Back to System Design
System Design
easy
mid

How do you design a monitoring and error tracking strategy for a frontend application?

Layer it: error tracking (Sentry) for exceptions + source maps, RUM for real-user performance (Core Web Vitals), product analytics for behavior, and synthetic/uptime checks. Add error boundaries, global handlers, alerting with thresholds, release tracking, and PII scrubbing. The goal: know it broke before users tell you.

5 min read·~8 min to think through

A monitoring strategy answers one question: when something breaks or slows down in production, do you find out — and can you diagnose it — before users complain? It's layered.

1. Error tracking

Catch and report exceptions — Sentry, Datadog, etc.:

  • Global handlerswindow.onerror, unhandledrejection for uncaught errors and promise rejections.
  • React error boundaries — catch render-time crashes, report them, show a fallback instead of a white screen.
  • Source maps uploaded to the service (not served publicly) so stack traces map to original code.
  • Context — user id (non-PII), release version, browser, route, breadcrumbs (recent actions) — so an error is debuggable.
  • Release tracking — tag errors with the deploy version to spot regressions and know which release introduced what.

2. Real User Monitoring (RUM) — performance

Measure what actual users experience:

  • Core Web Vitals — LCP, CLS, INP — collected via PerformanceObserver / the web-vitals library.
  • Navigation/resource timing, API latency from the client.
  • Segmented by device, geography, connection — averages hide the bad tail; watch p75/p95.

3. Product analytics — behavior

What users do — funnels, drop-off, feature usage (Amplitude, PostHog, GA). Distinct from error/perf monitoring but part of "is the app healthy."

4. Synthetic monitoring / uptime

Scripted checks hitting critical flows (login, checkout) on a schedule from outside — catches outages even when no real user has hit the bug yet.

5. Alerting — the part that makes it useful

Monitoring without alerting is just dashboards nobody looks at:

  • Threshold + anomaly alerts — error rate spike, Web Vitals regression, a new error type, checkout funnel drop.
  • Routed to the right people (Slack/PagerDuty), tuned to avoid noise/fatigue.

6. Cross-cutting

  • PII scrubbing — never send passwords, tokens, personal data to third-party monitoring; scrub before send.
  • Sampling — RUM/breadcrumbs sampled to control cost and volume.
  • Privacy/consent — respect Do Not Track / consent where required.
  • Dashboards — error rate, Web Vitals, uptime in one place.

The framing

"I'd layer it. Error tracking — Sentry with global handlers, React error boundaries, uploaded source maps, and release tagging so I can see which deploy broke what. RUM for real-user performance — Core Web Vitals via PerformanceObserver, watching p75/p95 not averages. Product analytics for behavior and funnels. Synthetic checks on critical flows so I catch outages before users do. The piece that makes it real is alerting — threshold and anomaly alerts on error rate and Web Vitals, routed to people, tuned against fatigue. And throughout: scrub PII before anything leaves the client. The goal is finding out it broke before users tell me."

Follow-up questions

  • Why upload source maps to your error tracker?
  • Why look at p75/p95 instead of average performance?
  • What's the difference between RUM and synthetic monitoring?
  • How do you avoid sending PII to third-party monitoring?

Common mistakes

  • Monitoring without alerting — dashboards nobody watches.
  • No source maps — unreadable minified stack traces.
  • Tracking averages, missing the bad tail (p95).
  • Sending PII/tokens to third-party services.
  • No error boundaries — render crashes white-screen silently.
  • Alert fatigue from noisy, untuned alerts.

Performance considerations

  • The monitoring SDKs themselves add weight and run code — load them async, sample RUM/breadcrumbs to control overhead and cost, and don't let instrumentation block the main thread.

Edge cases

  • Errors from browser extensions / third-party scripts polluting the data.
  • A spike that's actually one user in a loop.
  • Errors only in old cached client versions.
  • Ad blockers blocking the monitoring script itself.

Real-world examples

  • Sentry for errors + source maps + releases; web-vitals reporting to analytics for RUM.
  • Synthetic checks on the checkout flow alerting before customers report an outage.

Senior engineer discussion

Seniors describe the layered strategy (errors, RUM, analytics, synthetic), insist on alerting and release tracking, watch percentile tails, and treat PII scrubbing, sampling, and SDK overhead as first-class concerns.

Related questions