What challenges have you faced when building frontends for large user traffic?
Real challenges at scale: bundle bloat from N teams contributing, third-party tags eating perf budget, cache invalidation across thousands of edge nodes, regional latency variance, A/B test framework adding render delay, dependency conflicts in monorepos, CI build times growing with codebase, on-call pager-fatigue from JS errors at scale. Fixes are organizational as much as technical: budgets in CI, dependency review, observability investment, error grouping, and forcing experiments behind a perf gate.
When traffic is large, the hard problems shift from "make it work" to "keep it from regressing while many people contribute and many users notice the smallest issue."
Common challenges I've hit (and the fixes that worked)
1. Bundle bloat from N teams contributing
Every team adds a dependency, a feature flag, an experiment. After a year nobody owns the bundle. Initial JS climbs from 200KB to 500KB+.
Fixes:
- Bundle budget in CI (
size-limitor@next/bundle-analyzer) that fails the build on regression. - PR template asking "how much JS does this add and why?"
- Quarterly bundle audits to delete dead code and replace heavy deps.
- Module ownership in CODEOWNERS so the right team reviews.
2. Third-party tags taxing the critical path
Analytics, A/B testing, chat widgets, ad scripts. They land in <head>, block parsing, ship megabytes of their own JS. A single bad tag can move LCP by 2 seconds.
Fixes:
- All third-party scripts
asyncordeferat minimum. - Tag manager moved out of the critical path; loaded after LCP fires.
- Self-host analytics where possible.
- Real-user metrics segmented by tag presence — quantify the cost.
- Veto power on adding new tags above a measured budget.
3. Cache invalidation across edge
Static assets are immutable + long max-age. Easy. HTML and API responses with thousands of CDN nodes — hard. Stale HTML pointing to old chunks that no longer exist crashes the app for users with cached HTML.
Fixes:
- HTML cache key includes a build ID; old HTML is OK with old chunks (kept around for a deploy window).
- Soft purges (revalidate-on-next-request) over hard purges.
- Tag-based invalidation (Cloudflare, Fastly).
- Client SW with skipWaiting strategy designed for graceful version handoff.
4. Regional latency variance
Local dev on fast WiFi → 100ms RTT. P95 user in India on 4G → 600ms RTT. The lab and the field disagree.
Fixes:
- Geo-distributed RUM (web-vitals.js → analytics, tag with continent).
- Edge SSR / SSG so HTML originates from the user's region.
- Network-throttled CI checks (slow 3G profile) for catch-regressions.
- DevTools throttling baked into developer onboarding.
5. A/B test framework adding render delay
Sync experiment fetch on page load → flash-of-control-version → flash-of-variant → user-perceived jank.
Fixes:
- Decide variant server-side; bake into HTML.
- Or: hide content behind
opacity: 0until experiment fires (with timeout) — feels worse to users than a fast control variant. - Limit number of concurrent experiments per page.
- Treat the experiment SDK as a third-party tag and budget it.
6. Dependency conflicts in monorepos
Two packages depend on different React minor versions. Two copies ship. Bundle doubles for React. Or worse, hook calls cross instances and crash.
Fixes:
- Strict peer-dep policies +
overrides/resolutionsto force one version. npm ls reactin CI to catch duplicates.- Shared internal library for cross-cutting concerns (logging, auth, design system).
7. CI build times
A 200-file monorepo takes 15 minutes to build + test. PR throughput suffers.
Fixes:
- Incremental build (
turborepo,nx). - Remote build cache.
- Parallelize test shards.
- Profile the actual slow steps; don't optimize blindly.
8. JS errors at scale = pager fatigue
Even a 0.1% error rate at 10M requests/day = 10k errors/day. Sentry inbox overflows. Real bugs hide in noise.
Fixes:
- Error grouping by stack + breadcrumbs, not just message.
- Sample low-value errors (e.g., network errors from offline users).
- Alert on rate change, not absolute count.
- Per-team ownership of error categories.
- Source maps in production builds so stacks are readable.
9. Observability investment
Without RUM you're optimizing blind. But shipping web-vitals + custom marks + error tracking to analytics is itself a perf cost.
Fixes:
- Beacon API for analytics (non-blocking).
- Sample heavy data (full RUM at 1% sampling is plenty at high traffic).
- Aggregate at the edge before shipping to backend.
10. Feature flag / experiment tech debt
After a year, dozens of dead experiments still in the codebase. Each adds a few KB and a branch that nobody touches.
Fixes:
- Sunset policy on flags (max 90 days).
- Automated PR to remove a flag when retired.
- Reminder for stale flags in the flag dashboard.
Mental model
At scale, performance is a socio-technical problem. The technical fixes are real but reactive; the organizational fixes (budgets, ownership, review, monitoring) are what keep wins from eroding. Without them, every win you ship gets eaten by the next quarter's growth.
Follow-up questions
- •How do you set a sensible JS budget for a large app?
- •What's your approach to third-party script governance?
- •How do you handle the cache-invalidation deploy story?
- •How do you measure tail-latency for international users?
Common mistakes
- •Treating perf as one engineer's responsibility — without ownership it regresses.
- •Setting a budget without CI enforcement — perf rots in 6 months.
- •Letting third-party tags into the critical path without review.
- •Soft-deleting feature flags instead of removing them — dead-code accumulation.
- •Optimizing for the lab; ignoring international users on slow networks.
- •Adding RUM and never looking at it.
Performance considerations
- •At scale, the constant battle is regression prevention more than one-off wins. A 50ms LCP regression on 10M page views/day is a meaningful business hit; one-off perf sprints don't fix it — process and tooling do.
Edge cases
- •Spikes in traffic (launches, virality) reveal cache holes invisible at steady state — load-test before launch.
- •Mobile network handover (WiFi → cellular) mid-session — retry policy must handle.
- •Browser version diversity at scale — your support matrix has long tails (Samsung Internet, in-app webviews).
- •Holiday season for e-commerce — perf regressions a week before are fatal.
- •Synthetic and field metrics can disagree wildly when third-party tags vary per user.
Real-world examples
- •Pinterest, Etsy, BBC have published detailed retrospectives on cutting LCP at scale (10s to 2s wins, +X% conversion).
- •Slack has talked about RSC adoption and bundle splitting as a way to escape monorepo bundle growth.
- •Vercel's Next.js perf insights surface regressions per deploy.