Back to Networking
Networking
medium
mid

How do you handle rate limits and errors when calling AI APIs from the frontend?

Proxy through your backend (which owns the provider key, retries, and quotas), respect 429/Retry-After with exponential backoff + jitter, handle AI-specific failures (timeouts, mid-stream drops, content filtering, context-length errors), degrade gracefully, and surface clear feedback with cost controls.

7 min read·~12 min to think through

AI APIs fail in normal ways (rate limits, 5xx) and AI-specific ways (slow, streaming, content-filtered, context-length). Handle both — and remember the backend should own most of this.

Architecture first: the backend is your buffer

Calls go Browser → your backend → AI provider. So:

  • The provider key, retries, and provider-level rate limits live server-side.
  • Your backend enforces your per-user quotas and rate limits — protecting against one user burning the shared budget.
  • The frontend deals mostly with your backend's responses, which you control and can make consistent.

Rate limits (429)

  • Respect Retry-After — the provider tells you how long to wait.
  • Exponential backoff + jitter for retries (so concurrent clients don't retry in lockstep).
  • Cap retries; after N attempts, surface a clear error.
  • Prefer doing retries on the backend — it has the full picture and can queue.
  • Reduce volume — debounce input, prevent rapid resubmits, disable the submit button while a request is in flight.

AI-specific failure modes

  • Timeouts — generations can hang; set a timeout and let the user cancel (AbortController).
  • Mid-stream disconnects — a streaming response can drop partway. Keep the partial content, show it, offer "regenerate" or "continue."
  • Context-length errors — the prompt + history exceeded the model's window. Handle by truncating/summarizing history, not crashing.
  • Content filtering / safety refusals — the model declines or the provider blocks output. Show a graceful message, not an error stack.
  • Malformed output — if you asked for JSON, the model might return invalid JSON. Validate and retry/repair.
  • Provider outage — fall back gracefully; maybe a cached response or a "try again later."

Graceful degradation & UX

  • Never fail silently; never dump a raw provider error.
  • Clear, friendly messaging: "AI is busy — retrying…", "That response was cut short — regenerate?".
  • Preserve user input and conversation on failure — don't make them retype.
  • Manual retry / regenerate affordances.
  • Loading/streaming/partial/error states all designed.

Cost & abuse control

  • Per-user quotas and rate limits on your backend.
  • Token/length caps on requests and responses.
  • Debounce, block spam, require auth.
  • Cache responses (by prompt hash) — cheaper, faster, and dodges rate limits entirely for repeats.

How to answer

"First, proxy through my backend — it owns the provider key, provider-level retries, and my per-user quotas. For rate limits, respect 429/Retry-After with exponential backoff + jitter, capped, ideally retried server-side, and reduce volume with debounce + disabled-while-pending. Then the AI-specific failures: timeouts with cancellation, mid-stream drops (keep partial + regenerate), context-length errors (truncate history), content filtering and malformed output. Everything degrades gracefully with clear messaging, preserved input, and manual retry — plus cost controls: quotas, token caps, and caching by prompt hash."

Follow-up questions

  • Why handle retries on the backend rather than the frontend?
  • How do you handle a streaming response that drops mid-generation?
  • What do you do when the prompt exceeds the model's context window?
  • How does caching help with both cost and rate limits?

Common mistakes

  • Retrying aggressively without backoff/jitter, worsening the rate limit.
  • Ignoring Retry-After.
  • Not handling mid-stream disconnects — losing partial content.
  • Showing raw provider errors to users.
  • No cost controls — one user exhausts the shared budget.

Performance considerations

  • Caching by prompt hash avoids both latency and rate limits for repeated prompts. Backoff with jitter prevents synchronized retry storms. Backend-side queuing smooths bursts. Token caps bound latency and cost.

Edge cases

  • Stream dropping after partial output.
  • Context-length-exceeded errors.
  • Content-filter refusals.
  • Model returning malformed JSON when structured output was requested.
  • Provider-wide outage.

Real-world examples

  • A chat feature: backend proxy with per-user quotas + server-side backoff retries; frontend keeps partial streamed content and offers regenerate on a dropped stream.

Senior engineer discussion

Seniors put the backend at the center (key, retries, quotas), handle standard rate limiting (429/Retry-After, backoff+jitter, volume reduction), and enumerate the AI-specific failures — timeouts, mid-stream drops, context-length, content filtering, malformed output. They insist on graceful degradation with preserved input and treat caching + quotas as core cost/abuse control.

Related questions