How do you handle rate limiting and large payloads when consuming AI services?
Rate limits: queue with priority, respect Retry-After headers, exponential backoff + jitter, per-user quotas, circuit breaker. Large payloads: stream responses (SSE/HTTP chunked), enforce input token budgets, chunk huge inputs, use cheaper models for prefilter then expensive for refine, summarize/compress conversation history, return partial results, page through outputs. Always proxy via your server (never API keys client-side), cache deterministic queries, batch where API supports it.
LLM APIs are expensive, slow, and rate-limited. Production usage needs explicit handling for both rate limits (theirs) and payload size (yours and theirs).
Rate limiting
Most LLM APIs (OpenAI, Anthropic, Google) have per-org or per-key rate limits in:
- Requests per minute (RPM).
- Tokens per minute (TPM) — usually the binding constraint.
- Concurrent requests.
On hitting the limit, providers return 429 Too Many Requests with a Retry-After header.
Strategies
1. Server-side queue with priority.
User submits → request enters queue → worker pool sends to LLM, respecting RPM/TPM budgets. Interactive requests (user waiting) get higher priority than background (summarize stale chat).
class LLMQueue {
pending = [];
inflight = 0;
maxConcurrent = 5;
async enqueue(req) {
return new Promise((resolve, reject) => {
this.pending.push({ req, resolve, reject });
this.drain();
});
}
async drain() {
while (this.inflight < this.maxConcurrent && this.pending.length) {
const { req, resolve, reject } = this.pending.shift();
this.inflight++;
try {
resolve(await this.call(req));
} catch (e) { reject(e); }
finally { this.inflight--; this.drain(); }
}
}
}2. Respect Retry-After + exponential backoff.
async function callWithRetry(req, retries = 3) {
for (let attempt = 0; attempt <= retries; attempt++) {
const res = await fetch('https://api.openai.com/v1/chat/completions', req);
if (res.status === 429 || res.status >= 500) {
if (attempt === retries) throw new Error(`Failed after ${retries} retries`);
const retryAfter = parseInt(res.headers.get('Retry-After') ?? '0', 10) * 1000;
const backoff = Math.min(1000 * 2 ** attempt, 30000) + Math.random() * 500;
await sleep(retryAfter || backoff);
continue;
}
return res;
}
}Jitter prevents thundering herd when the limit resets.
3. Per-user quotas.
Track requests/tokens per user. Reject (with helpful UI) when over quota. Use Redis or a similar fast store.
4. Circuit breaker.
If the LLM provider has an outage, opening a circuit (stop trying for N seconds) prevents your service from amplifying the problem and queueing up requests that will time out.
5. Multi-provider fallback.
Have a second provider (Anthropic ↔ OpenAI) configured. On rate limit / outage, fall over.
Large payloads
Input
LLMs have context limits (e.g., 200k tokens for Claude 3.5, 1M+ for Gemini). And per-request cost scales with input tokens.
Truncate / summarize conversation history.
Sliding window: keep the system prompt + the last N turns. For long conversations, summarize older turns:
const summary = await llm.summarize(messages.slice(0, -10));
const newMessages = [
systemPrompt,
{ role: 'system', content: `Summary so far: ${summary}` },
...messages.slice(-10),
];Chunk huge documents. For RAG over big corpora, split into chunks (500-2000 tokens), embed each, retrieve top-K relevant chunks, only send those.
Pre-filter with a cheaper model. Use a small/fast model to classify "is this relevant" → only escalate to the expensive model when needed.
Output
Stream the response. Use Server-Sent Events (SSE) to stream tokens as they're generated:
// Server
const stream = await openai.chat.completions.create({ model, messages, stream: true });
for await (const chunk of stream) {
res.write(`data: ${JSON.stringify(chunk.choices[0]?.delta?.content ?? '')}\n\n`);
}// Client
const es = new EventSource('/api/llm');
es.onmessage = e => append(JSON.parse(e.data));Streaming makes the LLM feel responsive — first token in <500ms instead of waiting 10s for the full reply.
Cap output tokens. Set max_tokens so a runaway generation doesn't cost $5.
Paginate / continue. For very long outputs, generate in chunks with a "continue" prompt.
Cache deterministic queries
If the same prompt + same parameters likely gets the same answer, cache:
const cacheKey = sha256(JSON.stringify({ model, prompt, temperature }));
if (cache.has(cacheKey)) return cache.get(cacheKey);
const res = await llm.call(...);
cache.set(cacheKey, res, { ttl: 3600 });Lower hit rate for high-temperature/creative requests; higher for fact lookups.
Batch where supported
OpenAI Batch API: submit many requests at once, get results in ~24h, at 50% cost. Great for non-realtime workloads.
Budget + monitoring
- Per-request token cap.
- Per-user daily budget.
- Per-tenant monthly budget.
- Cost dashboard: tokens × model × hour.
- Alert on spend anomalies.
Client-server architecture
Always:
Browser ──→ Your Server ──→ LLM ProviderNever:
Browser ──→ LLM Provider ← API key leakThe server enforces rate limits, redacts inputs, validates outputs, caches, logs, and bills.
Pitfalls
- API key in browser.
- No rate limit → user can spam, drain budget.
- Synchronous mega-prompt (full conversation history) → context overflow, $$$.
- Not streaming → 10s blank screen.
- No retry on 429 → user sees error for normal load.
- No jitter → retry storm when limit resets.
- Caching at the wrong granularity — temperature in cache key matters.
- Treating LLM responses as trusted → render-as-HTML → XSS.
Mental model
LLMs are a constrained, expensive, latency-sensitive resource. Treat the integration as you would any unreliable external service: queue, retry, fall back, cache, monitor cost. Stream to hide latency. Truncate/summarize to respect context. Budget per user to prevent runaway spend.
Follow-up questions
- •How does the OpenAI Batch API change cost economics?
- •How do you summarize conversation history without losing fidelity?
- •When does caching make sense for LLM calls?
- •What's the right backoff strategy for 429s?
Common mistakes
- •API key shipped to browser.
- •No retry / no backoff on 429 — user sees errors at moderate load.
- •No streaming — blank screen for 10s.
- •No per-user budget — billing surprises.
- •Sending full conversation history each call — context blows up.
- •Treating LLM JSON as valid without schema validation.
Performance considerations
- •Streaming is the single biggest UX improvement — perceived latency drops by 10x. Caching deterministic queries can cut cost significantly for fact-lookup patterns. Per-user rate limit + quota prevent abuse and keep monthly bill predictable.
Edge cases
- •Provider outages: have a fallback model/provider or graceful degradation.
- •Streaming connection drop mid-response: client should resume or display partial.
- •Tokenization differs per model — token counts aren't directly comparable.
- •Rate limits are per-key OR per-org — both matter.
- •Some endpoints have lower limits (e.g., embeddings vs chat).
Real-world examples
- •ChatGPT, Claude, Gemini web UIs all stream by default.
- •Perplexity / Phind use cheap pre-filter LLMs then expensive synthesizer.
- •Most production LLM apps cache embeddings; some cache full responses for FAQ-style queries.