Performance

medium

mid

What challenges have you faced when building frontends for large user traffic?

Real challenges at scale: bundle bloat from N teams contributing, third-party tags eating perf budget, cache invalidation across thousands of edge nodes, regional latency variance, A/B test framework adding render delay, dependency conflicts in monorepos, CI build times growing with codebase, on-call pager-fatigue from JS errors at scale. Fixes are organizational as much as technical: budgets in CI, dependency review, observability investment, error grouping, and forcing experiments behind a perf gate.

9 min read·~5 min to think through

When traffic is large, the hard problems shift from "make it work" to "keep it from regressing while many people contribute and many users notice the smallest issue."

Common challenges I've hit (and the fixes that worked)

1. Bundle bloat from N teams contributing

Every team adds a dependency, a feature flag, an experiment. After a year nobody owns the bundle. Initial JS climbs from 200KB to 500KB+.

Fixes:

Bundle budget in CI (size-limit or @next/bundle-analyzer) that fails the build on regression.
PR template asking "how much JS does this add and why?"
Quarterly bundle audits to delete dead code and replace heavy deps.
Module ownership in CODEOWNERS so the right team reviews.

2. Third-party tags taxing the critical path

Analytics, A/B testing, chat widgets, ad scripts. They land in <head>, block parsing, ship megabytes of their own JS. A single bad tag can move LCP by 2 seconds.

Fixes:

All third-party scripts async or defer at minimum.
Tag manager moved out of the critical path; loaded after LCP fires.
Self-host analytics where possible.
Real-user metrics segmented by tag presence — quantify the cost.
Veto power on adding new tags above a measured budget.

3. Cache invalidation across edge

Static assets are immutable + long max-age. Easy. HTML and API responses with thousands of CDN nodes — hard. Stale HTML pointing to old chunks that no longer exist crashes the app for users with cached HTML.

Fixes:

HTML cache key includes a build ID; old HTML is OK with old chunks (kept around for a deploy window).
Soft purges (revalidate-on-next-request) over hard purges.
Tag-based invalidation (Cloudflare, Fastly).
Client SW with skipWaiting strategy designed for graceful version handoff.

4. Regional latency variance

Local dev on fast WiFi → 100ms RTT. P95 user in India on 4G → 600ms RTT. The lab and the field disagree.

Fixes:

Geo-distributed RUM (web-vitals.js → analytics, tag with continent).
Edge SSR / SSG so HTML originates from the user's region.
Network-throttled CI checks (slow 3G profile) for catch-regressions.
DevTools throttling baked into developer onboarding.

5. A/B test framework adding render delay

Sync experiment fetch on page load → flash-of-control-version → flash-of-variant → user-perceived jank.

Fixes:

Decide variant server-side; bake into HTML.
Or: hide content behind opacity: 0 until experiment fires (with timeout) — feels worse to users than a fast control variant.
Limit number of concurrent experiments per page.
Treat the experiment SDK as a third-party tag and budget it.

6. Dependency conflicts in monorepos

Two packages depend on different React minor versions. Two copies ship. Bundle doubles for React. Or worse, hook calls cross instances and crash.

Fixes:

Strict peer-dep policies + overrides/resolutions to force one version.
npm ls react in CI to catch duplicates.
Shared internal library for cross-cutting concerns (logging, auth, design system).

7. CI build times

A 200-file monorepo takes 15 minutes to build + test. PR throughput suffers.

Fixes:

Incremental build (turborepo, nx).
Remote build cache.
Parallelize test shards.
Profile the actual slow steps; don't optimize blindly.

Even a 0.1% error rate at 10M requests/day = 10k errors/day. Sentry inbox overflows. Real bugs hide in noise.

Fixes:

Error grouping by stack + breadcrumbs, not just message.
Sample low-value errors (e.g., network errors from offline users).
Alert on rate change, not absolute count.
Per-team ownership of error categories.
Source maps in production builds so stacks are readable.

9. Observability investment

Without RUM you're optimizing blind. But shipping web-vitals + custom marks + error tracking to analytics is itself a perf cost.

Fixes:

Beacon API for analytics (non-blocking).
Sample heavy data (full RUM at 1% sampling is plenty at high traffic).
Aggregate at the edge before shipping to backend.

10. Feature flag / experiment tech debt

After a year, dozens of dead experiments still in the codebase. Each adds a few KB and a branch that nobody touches.

Fixes:

Sunset policy on flags (max 90 days).
Automated PR to remove a flag when retired.
Reminder for stale flags in the flag dashboard.

Mental model

At scale, performance is a socio-technical problem. The technical fixes are real but reactive; the organizational fixes (budgets, ownership, review, monitoring) are what keep wins from eroding. Without them, every win you ship gets eaten by the next quarter's growth.

Follow-up questions

•How do you set a sensible JS budget for a large app?
•What's your approach to third-party script governance?
•How do you handle the cache-invalidation deploy story?
•How do you measure tail-latency for international users?

Common mistakes

•Treating perf as one engineer's responsibility — without ownership it regresses.
•Setting a budget without CI enforcement — perf rots in 6 months.
•Letting third-party tags into the critical path without review.
•Soft-deleting feature flags instead of removing them — dead-code accumulation.
•Optimizing for the lab; ignoring international users on slow networks.
•Adding RUM and never looking at it.

Performance considerations

•At scale, the constant battle is regression prevention more than one-off wins. A 50ms LCP regression on 10M page views/day is a meaningful business hit; one-off perf sprints don't fix it — process and tooling do.

Edge cases

•Spikes in traffic (launches, virality) reveal cache holes invisible at steady state — load-test before launch.
•Mobile network handover (WiFi → cellular) mid-session — retry policy must handle.
•Browser version diversity at scale — your support matrix has long tails (Samsung Internet, in-app webviews).
•Holiday season for e-commerce — perf regressions a week before are fatal.
•Synthetic and field metrics can disagree wildly when third-party tags vary per user.

Real-world examples

•Pinterest, Etsy, BBC have published detailed retrospectives on cutting LCP at scale (10s to 2s wins, +X% conversion).
•Slack has talked about RSC adoption and bundle splitting as a way to escape monorepo bundle growth.
•Vercel's Next.js perf insights surface regressions per deploy.

Senior engineer discussion

Seniors talk about the organizational scaffolding (budgets, ownership, monitoring) at least as much as the technical fixes. They quantify cost in dollars and user impact, not just milliseconds, and they design the deploy story (cache, version handoff, rollback) into the perf strategy from day one.

Common challenges I've hit (and the fixes that worked)

1. Bundle bloat from N teams contributing

2. Third-party tags taxing the critical path

3. Cache invalidation across edge

4. Regional latency variance

5. A/B test framework adding render delay

6. Dependency conflicts in monorepos

7. CI build times

8. JS errors at scale = pager fatigue

9. Observability investment

10. Feature flag / experiment tech debt

Mental model

Follow-up questions

Common mistakes

Performance considerations

Edge cases

Real-world examples

Senior engineer discussion

Related questions