Back to System Design
System Design
hard
mid

How would you design a frontend architecture that supports one million daily users?

Edge-first: static assets + ideally HTML on CDN (Cloudflare/Fastly/Vercel Edge). SSG for content, SSR/ISR for personalized. Aggressive caching with hashed assets + tag-based invalidation. Service worker for repeat-visit speed. Per-route code splitting + bundle budgets in CI. RUM (web-vitals) sampled, alerted on regression. Observability: error monitoring (Sentry), perf dashboard. Multi-region origin if SSR. Resilience: circuit breakers, retries, graceful degradation. Cost-aware: cap third-party tags, optimize images, lazy below-fold.

11 min read·~30 min to think through

1M daily users = roughly 12 RPS average, 50-200 RPS peak depending on time-of-day skew. The hard part isn't raw throughput — CDNs handle that — it's consistent perf across global users, low ops overhead, fast time-to-deploy, and resilience under spikes.

Architecture layers

ts
User → CDN (edge cache + edge SSR) → Origin (SSR / API) → Database / Cache

Edge / CDN

  • Static assets (JS, CSS, images): hashed URLs, Cache-Control: public, max-age=31536000, immutable.
  • HTML for public pages: SSG → CDN edge cache; sub-100ms TTFB worldwide.
  • HTML for personalized pages: SSR at the edge (Cloudflare Workers, Vercel Edge) with cookie-aware logic; can still cache per cookie-group.
  • API for public data: edge cache with short TTL + stale-while-revalidate.
  • Images: CDN-served, on-the-fly resized (Cloudflare Images / Imgix / Cloudinary), AVIF/WebP.

Origin

  • Stateless servers behind a load balancer; autoscale on CPU + queue depth.
  • Multi-region if SSR for global users (otherwise edge SSR handles geo).
  • Connection pooling for DB.
  • Background workers for non-realtime tasks (email, analytics aggregation).
  • Health checks for fast failover.

Data layer

  • Primary DB with read replicas per region.
  • Redis for cache + session + rate limit counters.
  • Search engine (Elasticsearch, Meilisearch) for autocomplete.
  • Object storage (S3) for uploads.

Frontend specifics

Bundle

  • Per-route code splitting (automatic in Next/Remix).
  • Initial bundle budget: <200KB compressed for content sites, <400KB for dashboards.
  • size-limit in CI to enforce.
  • Tree-shaking, modular imports.

Rendering

  • Marketing pages: SSG.
  • Product/catalog: ISR with tag-based invalidation on data change.
  • Personalized (logged in): SSR.
  • App / dashboard: CSR with SSR-rendered shell.

Caching

  • HTTP: long max-age + immutable on assets.
  • Service worker: pre-cache app shell, runtime cache for API.
  • App-level: React Query / SWR for dedup + cache.

Images

  • AVIF / WebP via CDN image service.
  • srcset + sizes for responsive.
  • Lazy-load below fold, preload LCP image.

Resilience

  • Circuit breakers on origin → external dependencies.
  • Retries with jitter for transient failures.
  • Graceful degradation: read-only mode if write path is down.
  • Static fallback HTML for outages.

Observability

  • RUM (web-vitals → analytics) sampled at 1-10% of users.
  • Error monitoring (Sentry, Datadog).
  • Synthetic checks (Lighthouse CI per PR, scheduled WebPageTest).
  • Alerting on rate-of-change (LCP regress >10% week/week, error rate doubled).

Deployment

  • Atomic deploys (immutable build artifacts).
  • Canary / progressive rollout: 1% → 10% → 100% with health gates.
  • One-click rollback.
  • Feature flags for risky changes (kill-switch without deploy).
  • Blue/green or rolling for zero-downtime.

Cost

  • CDN cache hit ratio target: >95% for static, >70% for HTML.
  • Origin compute: autoscale + spot instances where possible.
  • Image optimization: pay once at upload, serve forever.
  • Cap third-party tags: each adds bytes + RPS to user devices.

Per-route considerations

  • /checkout must be highly reliable: simpler bundle, fewer third parties, more retries, idempotency keys.
  • /search: aggressive client-side cache, edge cache short-TTL, debounced input + abort.
  • Auth pages: minimal JS, fastest possible TTFB (login is on the critical funnel).
  • Marketing: SSG + aggressive image opt + minimal client JS.

Scaling specific concerns

Spikes (launch, virality)

  • Pre-warm CDN for known popular URLs.
  • Origin should handle 3-5x typical peak without degradation.
  • Backoff + jitter on retries to prevent thundering herd.
  • Static fallback page if origin can't keep up.

Regional latency

  • Edge SSR / SSG keeps perf consistent worldwide.
  • Multi-region read replicas for SSR routes.
  • Sampling RUM by geo to find slow markets.

Browser diversity

  • Test matrix includes Safari (especially iOS), Chrome Android, Firefox.
  • Don't ship polyfills for browsers you don't support.
  • Progressive enhancement for old browsers (still works, less features).

Team-level

  • Bundle budget enforced in CI, owned by perf champion / platform team.
  • Architecture Decision Records for significant choices.
  • Sunset policy for feature flags (max 90 days).
  • Periodic perf audit (quarterly).
  • Capacity planning before known traffic events.

What NOT to do

  • Single-region SSR for global users → slow TTFB outside the region.
  • CSR for content pages → bad SEO + slow LCP.
  • Mega vendor chunk → single dep update invalidates the world.
  • Long-cached HTML + frequently changing assets → version mismatch.
  • Storing tokens in localStorage → XSS = game over.
  • No CI guards → regressions creep in.
  • No error monitoring → invisible failures.

Mental model

Edge-first + cache-everything + observe-everything + scale-stateless. At 1M DAU the math is forgiving (most ops can handle it), but the experience depends on getting cache strategy, image optimization, and observability right. The architecture is less about extreme scale and more about consistency, resilience, and ops simplicity.

Follow-up questions

  • How do you handle a traffic spike (10x normal)?
  • What's the right CDN cache hit ratio target?
  • When does edge SSR beat origin SSR?
  • How do you handle a regional outage?

Common mistakes

  • Single-region SSR for global users.
  • CSR for content — bad SEO + slow LCP.
  • No bundle budget — silent regression.
  • Mega vendor chunk — invalidates on every dep update.
  • No error/perf monitoring.
  • Letting third-party tags eat the perf budget.

Performance considerations

  • At 1M DAU, perf is product. LCP -1s typically gives +5-10% conversion on transactional flows. Edge SSR + immutable assets + good image opt routinely deliver sub-1s LCP globally. Cost: a few hundred dollars/mo of CDN + edge compute, often less than one origin instance.

Edge cases

  • Holiday / launch traffic spikes — pre-warm cache, capacity-plan.
  • Regional outage of CDN or DB — failover plan.
  • Browser version surprises — Samsung Internet, in-app webviews.
  • Save-Data signals — opt out of aggressive prefetches.
  • PWA install + offline mode for repeat users.

Real-world examples

  • Vercel / Netlify / Cloudflare host millions of sites at this scale on edge-first architecture.
  • Shopify, Stripe, Notion all run multi-region with edge-cached HTML.
  • BBC, NYT, Pinterest published case studies of LCP-driven engagement lift.

Senior engineer discussion

Seniors design the architecture for consistency and ops simplicity at scale, not exotic throughput. They pick edge-first, enforce budgets in CI, instrument observability, and design graceful degradation. They tie perf to business and pitch investment in those terms.

Related questions