Back to System Design
System Design
hard
mid

How would you design a rate limiter for API calls in a single page application?

A client-side limiter that throttles outgoing requests: a token-bucket or sliding-window counter to cap request rate, a queue for excess requests, request deduplication and caching to reduce volume, plus respecting server 429s with backoff. The client limiter is UX/protection, not security.

7 min read·~25 min to think through

A client-side rate limiter caps the rate at which a SPA sends requests — to respect a known API limit, protect a fragile backend, and smooth bursts. Important framing: a client limiter is a cooperation/UX mechanism, not a security control — the server must still enforce its own limits.

Why you'd want one

  • The API has a documented limit (e.g. 10 req/s) and you want to stay under it rather than getting 429s.
  • Prevent self-inflicted bursts — a component that fires on every keystroke/scroll.
  • Be a good citizen to a shared or fragile backend.

Core algorithm — token bucket (most common)

  • A bucket holds up to N tokens; tokens refill at a steady rate (e.g. 10/sec).
  • Each request consumes a token. Token available → send immediately. No token → queue or reject.
  • Allows short bursts (up to bucket size) while bounding the sustained rate. Simple and flexible.

Alternatives: sliding-window counter (count requests in the trailing window, block over the limit — more precise, no bursts), fixed window (simplest, but boundary bursts), leaky bucket (perfectly smooth output).

The request queue

When over the limit, you usually want to queue, not drop:

ts
request() → token available? ── yes ──▶ send

                no ──▶ enqueue ──▶ drained as tokens refill
  • Add priority — user-initiated requests jump ahead of background prefetches.
  • Add a queue cap and timeout so it can't grow unbounded.
  • Optionally concurrency limiting too — at most M requests in flight at once.

Reduce volume first (the best rate limiting is fewer requests)

  • Debounce/throttle the events that trigger requests.
  • Deduplicate in-flight identical requests; cache responses (React Query/SWR).
  • Batch multiple calls into one where the API supports it.

Respect the server

  • Honor 429 + Retry-After; exponential backoff + jitter on retries.
  • Read X-RateLimit-Remaining/-Reset if exposed and proactively slow down.

Implementation shape

Wrap fetch/axios in a limiter module (or an axios interceptor): limitedFetch(url) checks the bucket, sends or enqueues, and the queue drains on the refill timer. Centralized — every call site inherits it.

js
const limiter = createTokenBucket({ capacity: 10, refillPerSec: 10 });
async function apiCall(url, opts) {
  await limiter.acquire(); // resolves when a token is free
  return fetch(url, opts);
}

UI

Don't fail silently — if requests are queued/delayed, reflect it ("syncing…"), and disable spammy actions.

The framing

"A token bucket caps the outgoing rate while allowing small bursts; excess requests go into a priority queue that drains as tokens refill, with a queue cap to stay bounded. But I'd reduce volume first — debounce, dedup, cache, batch — and respect server 429s with backoff. Crucially, the client limiter is for UX and being a good citizen; the server still must enforce limits itself, since the client can't be trusted."

Follow-up questions

  • Token bucket vs sliding window vs leaky bucket — tradeoffs?
  • Why is a client-side rate limiter not a security control?
  • How do you handle priority in the request queue?
  • How does this interact with server-side 429 responses?

Common mistakes

  • Treating a client limiter as a security mechanism (it's bypassable).
  • Dropping requests instead of queueing them.
  • An unbounded queue that grows forever under load.
  • Not reducing volume first (debounce/dedup/cache) — limiting symptoms not cause.
  • Ignoring server 429s and Retry-After.

Performance considerations

  • Reducing request volume (debounce, dedup, cache, batch) is more effective than limiting. Token bucket allows useful bursts; sliding window is smoother but stricter. A bounded priority queue keeps memory and latency predictable.

Edge cases

  • Queue growing unbounded — needs a cap and timeouts.
  • High-priority request stuck behind low-priority ones.
  • Server limit differs from the client's assumed limit.
  • Page navigation/unmount with requests still queued.

Real-world examples

  • A client wrapper enforcing a third-party API's documented rate limit.
  • p-limit / bottleneck-style concurrency+rate limiting around fetch; axios interceptor queueing requests.

Senior engineer discussion

Seniors pick token bucket for burst-tolerance, design a bounded priority queue for overflow, and — importantly — frame the client limiter as cooperation/UX, not security, since the client is bypassable and the server must enforce its own limits. They lead with volume reduction (debounce/dedup/cache/batch) and integrate server 429/backoff handling.

Related questions