Back to Behavioral
Behavioral
medium
mid

A critical UI feature is failing during peak traffic. How do you mitigate the issue?

Incident response: stabilize first (rollback, feature-flag off, scale, shed load), communicate, then diagnose with metrics/logs. Peak-only failure points to load-dependent causes — backend saturation, rate limits, race conditions, memory. Add resilience after: graceful degradation, retries, caching, load testing.

7 min read·~12 min to think through

A critical feature failing under peak load is an incident. The order matters: stabilize → communicate → diagnose → harden. Don't start with root-cause analysis while users are broken.

1. Stabilize — stop the bleeding (minutes)

Restore service first; understand it later.

  • Roll back the most recent deploy if the failure correlates with one.
  • Feature-flag the feature off (or to a degraded mode) — the rest of the app stays up. This is exactly why critical features should be behind kill switches.
  • Scale up if it's a capacity issue — more instances, raise limits, scale the DB.
  • Shed load / degrade gracefully — serve cached/stale data, queue non-critical work, disable expensive operations, show a clear "experiencing high load" state instead of errors.
  • Mitigate the obvious — bump a rate limit, add a circuit breaker, enable a CDN/edge cache for the hot path.

2. Communicate — in parallel

  • Open an incident, assign an owner/commander.
  • Update a status page and stakeholders; tell support.
  • Keep a running incident log (timestamps, actions) — invaluable for the postmortem.

3. Diagnose — once it's stable

Peak-only failure means the cause is load-dependent. Use metrics/logs/traces:

  • Backend saturation — DB connection pool exhausted, slow queries under load, CPU/memory maxed, downstream service timing out.
  • Rate limits — your own API or a third party throttling at volume.
  • Race conditions — only manifest under concurrency.
  • Resource exhaustion — memory leaks, connection/socket limits, cache stampede (everything misses cache at once and hammers the origin).
  • Frontend — too many requests on load, no dedup, retry storms amplifying the problem, an N+1 request pattern.
  • Correlate the failure start with deploys, traffic graphs, and dependency dashboards.

4. Fix the root cause (after the incident)

  • Address what diagnosis found — query optimization, connection pooling, caching, fixing the race.
  • Verify under load testing before calling it done.

5. Harden so it doesn't recur

  • Graceful degradation designed in — the feature should fail soft, not hard.
  • Resilience patterns — retries with backoff+jitter, circuit breakers, timeouts, request dedup/coalescing, caching with stale-while-revalidate.
  • Load testing in CI/staging at realistic peak.
  • Capacity planning & autoscaling; alerts that fire before full failure (saturation, error-rate, latency SLOs).
  • Kill switches on all critical features.
  • Blameless postmortem — what happened, why, action items.

The framing

"First priority is users, not curiosity — stabilize via rollback or kill switch, scale, and degrade gracefully, while communicating. Peak-only failure points to a load-dependent cause: backend saturation, rate limits, races, or cache stampede — I'd confirm with metrics and traces. Then fix the root cause, verify with load testing, and harden with degradation, circuit breakers, and pre-failure alerting."

Follow-up questions

  • Why stabilize before diagnosing?
  • What load-dependent causes would you suspect for a peak-only failure?
  • What is a cache stampede and how do you prevent it?
  • What resilience patterns prevent this class of incident?

Common mistakes

  • Debugging root cause while users are still broken instead of stabilizing first.
  • No kill switch, so the only option is a risky hotfix under pressure.
  • Frontend retry storms amplifying a backend overload.
  • No load testing, so peak behavior is never validated.
  • Skipping the blameless postmortem.

Performance considerations

  • Peak failures are fundamentally performance/capacity events. Caching, request dedup, connection pooling, and circuit breakers reduce load amplification. Retry-with-jitter prevents synchronized retry storms. Load testing surfaces the ceiling before users do.

Edge cases

  • Failure caused by a third-party dependency, not your code.
  • Cache stampede after a cache flush or deploy.
  • Rollback not possible (DB migration already applied).
  • Cascading failure across multiple services.

Real-world examples

  • Feature-flagging a failing checkout step into a degraded mode while diagnosing.
  • A cache stampede after deploy fixed with request coalescing + staggered TTLs.

Senior engineer discussion

Seniors run it as an incident: stabilize (rollback/kill switch/scale/degrade) and communicate before diagnosing. They reason that peak-only => load-dependent (saturation, rate limits, races, cache stampede), confirm with telemetry, then harden with graceful degradation, circuit breakers, retry-with-jitter, load testing, pre-failure alerting, and a blameless postmortem.

Related questions