Back to System Design
System Design
medium
mid

How would you implement a robust frontend monitoring and logging system?

Capture errors (window handlers + boundaries), performance (Core Web Vitals via RUM), and structured logs/breadcrumbs; enrich with context (user, route, release, session); sample and rate-limit; route to a backend (Sentry/Datadog); add session replay and alerting on SLOs. Mind privacy.

7 min read·~15 min to think through

A robust frontend monitoring system answers three questions in production: Is it broken? Is it slow? What were users doing when it happened? That means errors, performance, and behavioral context — collected, enriched, and routed somewhere actionable.

1. Error tracking

  • Global handlers: window.onerror / addEventListener('error') for uncaught errors and resource failures; unhandledrejection for promise rejections.
  • React error boundaries → report on componentDidCatch with the component stack.
  • Manual capture in try/catch around risky event handlers and async code (boundaries don't catch those).
  • Source maps uploaded to the backend so minified stacks are readable.
  • Deduplicate and group identical errors; track error rate, not just count.

2. Performance (RUM — real user monitoring)

  • Core Web Vitals — LCP, INP, CLS via the web-vitals library, reported from real sessions.
  • Custom marks/measures (performance.mark), API latency, route-transition timing, long tasks.
  • Report p75/p95, segmented by route, device, region — averages hide the pain.

3. Structured logging & breadcrumbs

  • Breadcrumbs — a rolling trail of recent actions (clicks, navigations, network calls, state changes) attached to each error so you can see how the user got there.
  • Structured logs — JSON with consistent fields, levels (debug/info/warn/error), not free-text console.log.
  • Network logging — failed requests, status codes, timing.

4. Context enrichment (what makes reports actionable)

Every event carries: release/version, route/URL, user/session id (or anonymized id), browser/device/OS, feature-flag state, viewport, connection type, timestamp. An error without context is nearly useless; "this error, on release 1.4.2, on the checkout route, for users on Safari" is a bug ticket.

5. Transport, sampling, reliability

  • Batch events; send via navigator.sendBeacon (survives page unload) or fetch with keepalive.
  • Sample high-volume data (e.g. 100% of errors, X% of performance/replay) to control cost and load.
  • Rate-limit so an error loop doesn't DoS your own ingestion or the user's network.
  • Buffer offline, flush on reconnect. Monitoring must never break or slow the app.

6. Tooling

Don't build the backend — use Sentry, Datadog RUM, LogRocket, New Relic, Grafana Faro. Add session replay (LogRocket/Sentry) to watch what happened. Wrap the vendor SDK in your own thin module so you can swap it.

7. Alerting & dashboards

  • Alert on SLOs: error rate spike, Core Web Vitals regression, a new error type, a crash-rate threshold — routed to Slack/PagerDuty.
  • Dashboards for error trends, vitals over releases, top errors.
  • Release tracking — tie metrics to deploys so you catch a bad release fast.

8. Privacy

  • Scrub PII before sending — mask inputs, redact tokens, anonymize ids in replay.
  • Respect consent (GDPR), data residency, and don't capture sensitive fields.

The framing

"Three pillars — errors, performance (RUM), and behavioral context (breadcrumbs/replay) — every event enriched with release/route/user/device, sampled and rate-limited, sent reliably via sendBeacon, routed to a tool like Sentry with alerting on SLOs and release tracking. And PII scrubbed throughout. The monitoring system itself must be lightweight and never able to break the app."

Follow-up questions

  • Why enrich every event with release/route/device context?
  • Why use sendBeacon instead of fetch for telemetry?
  • How do you keep monitoring from impacting performance or flooding on an error loop?
  • How do you handle PII in error reports and session replay?

Common mistakes

  • Only catching React errors, missing window/async/resource errors.
  • Reporting errors with no context — unactionable.
  • No sampling or rate-limiting — cost blowup and error-loop floods.
  • Sending PII/tokens to the monitoring backend.
  • Not uploading source maps, so stacks are unreadable.

Performance considerations

  • Telemetry must be lightweight — batch, sample, send async via sendBeacon, never block the main thread. Rate-limiting prevents an error storm from flooding the network. Session replay is heavy — sample it heavily.

Edge cases

  • Errors during page unload (need sendBeacon/keepalive).
  • Offline users — buffer and flush on reconnect.
  • An error loop generating thousands of events.
  • Source map mismatch after a deploy.

Real-world examples

  • Sentry capturing errors + breadcrumbs + release tracking; web-vitals feeding a RUM dashboard.
  • LogRocket session replay scrubbed of PII, linked from each error.

Senior engineer discussion

Seniors structure it as errors + performance + context, stress enrichment (release/route/user/device) as what makes reports actionable, and cover the operational realities — sampling, rate-limiting, sendBeacon, offline buffering, source maps, SLO-based alerting, release tracking. They insist monitoring be unable to harm the app and treat PII scrubbing as non-negotiable.

Related questions