Back to System Design
System Design
hard
mid

How would you design a system to handle client side caching, API retries, and error boundaries gracefully?

Three layers. Cache: a data-fetching lib (React Query / RTKQ / SWR) with TTL, stale-while-revalidate, tag-based invalidation, and dedup. Retries: exponential backoff with jitter, only for idempotent methods, capped at 2-3 attempts; circuit-break after consecutive failures. Error boundaries: route-level and feature-level boundaries with fallback UIs, plus a global handler for unhandled rejections, plus log to monitoring (Sentry). Tie them together with a single fetch wrapper.

10 min read·~15 min to think through

Treat caching, retries, and error handling as one system. They overlap (retried request still caches; cached response avoids retry) and the boundaries between them define the resilience story.

Layer 1: cache + dedup + invalidation

Don't roll this layer yourself. Use React Query, RTK Query, or SWR. All three give:

  • Cache keyed by query args.
  • Dedup in-flight requests with the same key.
  • Stale-while-revalidate: return cache instantly, fetch in background, update on success.
  • TTL (staleTime / cacheTime).
  • Tag-based invalidation: mutate X → automatically refetch queries tagged X.
  • Refetch on focus / reconnect / interval.
tsx
const { data } = useQuery({
  queryKey: ['user', id],
  queryFn: ({ signal }) => fetch(`/users/${id}`, { signal }).then(r => r.json()),
  staleTime: 60_000,
  retry: 2,
  retryDelay: attempt => Math.min(1000 * 2 ** attempt, 30_000),
});

Two components asking for the same user issue one request. The cache survives unmounts. If the user comes back to the tab, refetch in the background — no spinner.

Layer 2: retry policy

Retry only idempotent methods (GET, HEAD, PUT with idempotency key, DELETE on a tombstone). Never auto-retry POST without an idempotency-key header — duplicate charges, duplicate emails.

ts
async function fetchWithRetry(input, init = {}, retries = 2) {
  let attempt = 0;
  while (true) {
    try {
      const res = await fetch(input, init);
      if (res.status >= 500 && attempt < retries) throw new Error(`retry ${res.status}`);
      return res;
    } catch (err) {
      attempt++;
      if (attempt > retries || init.method !== undefined && !isIdempotent(init.method)) throw err;
      const delay = 200 * 2 ** (attempt - 1) + Math.random() * 100;  // exp backoff + jitter
      await new Promise(r => setTimeout(r, delay));
    }
  }
}

Jitter is critical: without it, a downed service comes back online and gets dog-piled by every client retrying in sync.

Circuit breaker for tougher resilience: after N consecutive failures to a host, stop retrying for M seconds. Prevents retry storms.

ts
class CircuitBreaker {
  failures = 0;
  openedAt = 0;
  isOpen() { return this.failures >= 5 && Date.now() - this.openedAt < 30_000; }
  record(ok) { if (ok) this.failures = 0; else { this.failures++; this.openedAt = Date.now(); } }
}

Layer 3: error boundaries

Two scopes:

Route-level: catches anything that breaks a whole page. Shows a fallback with retry + report.

tsx
<ErrorBoundary fallback={<RouteError />}>
  <RouteContent />
</ErrorBoundary>

Feature-level: catches errors inside a widget so the rest of the page survives.

tsx
<ErrorBoundary fallback={<WidgetError />}>
  <Chart />
</ErrorBoundary>

React's built-in error boundaries don't catch async errors or event handlers. For those, use onError callbacks from your data lib + a global handler:

js
window.addEventListener('unhandledrejection', e => log(e.reason));
window.addEventListener('error', e => log(e.error));

Wire all of it into Sentry or similar so production errors are visible.

Putting it together: a single fetch wrapper

ts
type Options = RequestInit & { timeoutMs?: number; retries?: number };

export async function api<T>(path: string, opts: Options = {}): Promise<T> {
  const { timeoutMs = 10_000, retries = 2, ...rest } = opts;
  const url = `${BASE_URL}${path}`;

  for (let attempt = 0; attempt <= retries; attempt++) {
    const ctrl = new AbortController();
    const timer = setTimeout(() => ctrl.abort(), timeoutMs);

    try {
      const res = await fetch(url, { ...rest, signal: ctrl.signal });
      if (res.status === 401) handle401();
      if (res.status >= 500 && attempt < retries && isIdempotent(opts.method ?? 'GET')) {
        await sleep(200 * 2 ** attempt + Math.random() * 100);
        continue;
      }
      if (!res.ok) {
        const body = await res.json().catch(() => ({}));
        throw new ApiError(res.status, body.message ?? res.statusText, body);
      }
      return res.headers.get('content-type')?.includes('json') ? res.json() : (await res.text() as unknown as T);
    } catch (err: any) {
      if (err.name === 'AbortError' && attempt < retries && isIdempotent(opts.method ?? 'GET')) {
        await sleep(200 * 2 ** attempt + Math.random() * 100);
        continue;
      }
      throw err;
    } finally {
      clearTimeout(timer);
    }
  }
  throw new Error('unreachable');
}

Then plug api into React Query's queryFn:

ts
useQuery({ queryKey: ['user', id], queryFn: () => api<User>(`/users/${id}`) });

React Query handles caching, dedup, refetch. The wrapper handles auth, timeout, server-error retry. Error boundaries catch what propagates up. Optimistic updates handle UX for mutations.

UX surface

  • Optimistic UI for mutations — instant feedback, rollback on rejection.
  • Stale data while revalidating — never show a spinner if you have a cached value.
  • Retry banner on errors with a button — don't auto-retry forever.
  • Offline detection via navigator.onLine + queue mutations.
  • Toast for transient errors, inline for form-field errors, page for catastrophic.

Things to avoid

  • Auto-retrying mutations without idempotency keys.
  • Infinite retries — pick a cap.
  • No jitter — synchronized retry storms.
  • Treating all errors the same — 401/403/404/5xx have different UX.
  • Eating errors silently — they should reach logging.
  • Hand-rolling cache when React Query exists.

Follow-up questions

  • When is it safe to auto-retry a POST?
  • What's an idempotency key and how do you implement one?
  • How does React Query's stale-while-revalidate work under the hood?
  • What's a circuit breaker and when do you need one client-side?

Common mistakes

  • Auto-retrying POST without idempotency keys — duplicate side effects.
  • Retry without jitter — retry storms when service recovers.
  • Catching errors and not reporting them — invisible production failures.
  • Rolling your own cache instead of using React Query / RTKQ.
  • Error boundaries only at the root — one error nukes the whole app.
  • Optimistic updates without rollback — UI shows success when the server actually failed.

Performance considerations

  • Caching reduces request count by 50-90% for typical apps; dedup prevents thundering herds. Retries with backoff smooth over transient failures; circuit breakers prevent client-side amplification of server outages. Error boundaries contain blast radius so one widget's failure doesn't take down the page.

Edge cases

  • 401 mid-session: refresh token, replay original request, or redirect to login — pick one and be consistent.
  • Offline → queue mutations → replay on reconnect with conflict resolution.
  • Background tab: pause polling, refetch on focus.
  • WebSocket reconnect: needs its own retry strategy + missed-message catch-up.
  • Server-Sent Events: built-in retry, but tune retry interval and add resume tokens.

Real-world examples

  • GitHub's web app uses React Query-style caching; cached responses don't count against rate limit thanks to ETags.
  • Linear pre-fetches related data so navigation feels instant.
  • Sentry / Datadog / Honeybadger ingest unhandled rejections from window.onerror.

Senior engineer discussion

Seniors think in terms of layered resilience: cache reduces calls, dedup reduces concurrency, retry handles transient failures, circuit breakers prevent amplification, error boundaries contain blast radius, monitoring closes the loop. They don't conflate these — each is a separate concern with its own knobs. They also make a clear distinction between safe (idempotent) and unsafe (mutation) retries.

Related questions