How do you handle rate limits and errors when calling AI APIs on the frontend

Proxy through your backend (which owns the provider key, retries, and quotas), respect 429/Retry-After with exponential backoff + jitter, handle AI-specific failures (timeouts, mid-stream drops, content filtering, context-length errors), degrade gracefully, and surface clear feedback with cost controls.

7 min read·~12 min to think through

AI APIs fail in normal ways (rate limits, 5xx) and AI-specific ways (slow, streaming, content-filtered, context-length). Handle both — and remember the backend should own most of this.

Architecture first: the backend is your buffer

Calls go Browser → your backend → AI provider. So:

The provider key, retries, and provider-level rate limits live server-side.
Your backend enforces your per-user quotas and rate limits — protecting against one user burning the shared budget.
The frontend deals mostly with your backend's responses, which you control and can make consistent.

Rate limits (429)

Respect Retry-After — the provider tells you how long to wait.
Exponential backoff + jitter for retries (so concurrent clients don't retry in lockstep).
Cap retries; after N attempts, surface a clear error.
Prefer doing retries on the backend — it has the full picture and can queue.
Reduce volume — debounce input, prevent rapid resubmits, disable the submit button while a request is in flight.

AI-specific failure modes

Timeouts — generations can hang; set a timeout and let the user cancel (AbortController).
Mid-stream disconnects — a streaming response can drop partway. Keep the partial content, show it, offer "regenerate" or "continue."
Context-length errors — the prompt + history exceeded the model's window. Handle by truncating/summarizing history, not crashing.
Content filtering / safety refusals — the model declines or the provider blocks output. Show a graceful message, not an error stack.
Malformed output — if you asked for JSON, the model might return invalid JSON. Validate and retry/repair.
Provider outage — fall back gracefully; maybe a cached response or a "try again later."

Graceful degradation & UX

Never fail silently; never dump a raw provider error.
Clear, friendly messaging: "AI is busy — retrying…", "That response was cut short — regenerate?".
Preserve user input and conversation on failure — don't make them retype.
Manual retry / regenerate affordances.
Loading/streaming/partial/error states all designed.

Cost & abuse control

Per-user quotas and rate limits on your backend.
Token/length caps on requests and responses.
Debounce, block spam, require auth.
Cache responses (by prompt hash) — cheaper, faster, and dodges rate limits entirely for repeats.

How to answer

"First, proxy through my backend — it owns the provider key, provider-level retries, and my per-user quotas. For rate limits, respect 429/Retry-After with exponential backoff + jitter, capped, ideally retried server-side, and reduce volume with debounce + disabled-while-pending. Then the AI-specific failures: timeouts with cancellation, mid-stream drops (keep partial + regenerate), context-length errors (truncate history), content filtering and malformed output. Everything degrades gracefully with clear messaging, preserved input, and manual retry — plus cost controls: quotas, token caps, and caching by prompt hash."

Follow-up questions

•Why handle retries on the backend rather than the frontend?
•How do you handle a streaming response that drops mid-generation?
•What do you do when the prompt exceeds the model's context window?
•How does caching help with both cost and rate limits?

Common mistakes

•Retrying aggressively without backoff/jitter, worsening the rate limit.
•Ignoring Retry-After.
•Not handling mid-stream disconnects — losing partial content.
•Showing raw provider errors to users.
•No cost controls — one user exhausts the shared budget.

Performance considerations

•Caching by prompt hash avoids both latency and rate limits for repeated prompts. Backoff with jitter prevents synchronized retry storms. Backend-side queuing smooths bursts. Token caps bound latency and cost.

Edge cases

•Stream dropping after partial output.
•Context-length-exceeded errors.
•Content-filter refusals.
•Model returning malformed JSON when structured output was requested.
•Provider-wide outage.

Real-world examples

•A chat feature: backend proxy with per-user quotas + server-side backoff retries; frontend keeps partial streamed content and offers regenerate on a dropped stream.

Senior engineer discussion

Seniors put the backend at the center (key, retries, quotas), handle standard rate limiting (429/Retry-After, backoff+jitter, volume reduction), and enumerate the AI-specific failures — timeouts, mid-stream drops, context-length, content filtering, malformed output. They insist on graceful degradation with preserved input and treat caching + quotas as core cost/abuse control.