How do you handle rate limits and errors when calling AI APIs on the frontend
Proxy through your backend (which owns the provider key, retries, and quotas), respect 429/Retry-After with exponential backoff + jitter, handle AI-specific failures (timeouts, mid-stream drops, content filtering, context-length errors), degrade gracefully, and surface clear feedback with cost controls.
AI APIs fail in normal ways (rate limits, 5xx) and AI-specific ways (slow, streaming, content-filtered, context-length). Handle both — and remember the backend should own most of this.
Architecture first: the backend is your buffer
Calls go Browser → your backend → AI provider. So:
- The provider key, retries, and provider-level rate limits live server-side.
- Your backend enforces your per-user quotas and rate limits — protecting against one user burning the shared budget.
- The frontend deals mostly with your backend's responses, which you control and can make consistent.
Rate limits (429)
- Respect
Retry-After— the provider tells you how long to wait. - Exponential backoff + jitter for retries (so concurrent clients don't retry in lockstep).
- Cap retries; after N attempts, surface a clear error.
- Prefer doing retries on the backend — it has the full picture and can queue.
- Reduce volume — debounce input, prevent rapid resubmits, disable the submit button while a request is in flight.
AI-specific failure modes
- Timeouts — generations can hang; set a timeout and let the user cancel (
AbortController). - Mid-stream disconnects — a streaming response can drop partway. Keep the partial content, show it, offer "regenerate" or "continue."
- Context-length errors — the prompt + history exceeded the model's window. Handle by truncating/summarizing history, not crashing.
- Content filtering / safety refusals — the model declines or the provider blocks output. Show a graceful message, not an error stack.
- Malformed output — if you asked for JSON, the model might return invalid JSON. Validate and retry/repair.
- Provider outage — fall back gracefully; maybe a cached response or a "try again later."
Graceful degradation & UX
- Never fail silently; never dump a raw provider error.
- Clear, friendly messaging: "AI is busy — retrying…", "That response was cut short — regenerate?".
- Preserve user input and conversation on failure — don't make them retype.
- Manual retry / regenerate affordances.
- Loading/streaming/partial/error states all designed.
Cost & abuse control
- Per-user quotas and rate limits on your backend.
- Token/length caps on requests and responses.
- Debounce, block spam, require auth.
- Cache responses (by prompt hash) — cheaper, faster, and dodges rate limits entirely for repeats.
How to answer
"First, proxy through my backend — it owns the provider key, provider-level retries, and my per-user quotas. For rate limits, respect 429/Retry-After with exponential backoff + jitter, capped, ideally retried server-side, and reduce volume with debounce + disabled-while-pending. Then the AI-specific failures: timeouts with cancellation, mid-stream drops (keep partial + regenerate), context-length errors (truncate history), content filtering and malformed output. Everything degrades gracefully with clear messaging, preserved input, and manual retry — plus cost controls: quotas, token caps, and caching by prompt hash."
Follow-up questions
- •Why handle retries on the backend rather than the frontend?
- •How do you handle a streaming response that drops mid-generation?
- •What do you do when the prompt exceeds the model's context window?
- •How does caching help with both cost and rate limits?
Common mistakes
- •Retrying aggressively without backoff/jitter, worsening the rate limit.
- •Ignoring Retry-After.
- •Not handling mid-stream disconnects — losing partial content.
- •Showing raw provider errors to users.
- •No cost controls — one user exhausts the shared budget.
Performance considerations
- •Caching by prompt hash avoids both latency and rate limits for repeated prompts. Backoff with jitter prevents synchronized retry storms. Backend-side queuing smooths bursts. Token caps bound latency and cost.
Edge cases
- •Stream dropping after partial output.
- •Context-length-exceeded errors.
- •Content-filter refusals.
- •Model returning malformed JSON when structured output was requested.
- •Provider-wide outage.
Real-world examples
- •A chat feature: backend proxy with per-user quotas + server-side backoff retries; frontend keeps partial streamed content and offers regenerate on a dropped stream.