Back to Networking
Networking
medium
mid

How do you handle rate limiting and large payloads when consuming AI services?

Rate limits: queue with priority, respect Retry-After headers, exponential backoff + jitter, per-user quotas, circuit breaker. Large payloads: stream responses (SSE/HTTP chunked), enforce input token budgets, chunk huge inputs, use cheaper models for prefilter then expensive for refine, summarize/compress conversation history, return partial results, page through outputs. Always proxy via your server (never API keys client-side), cache deterministic queries, batch where API supports it.

10 min read·~5 min to think through

LLM APIs are expensive, slow, and rate-limited. Production usage needs explicit handling for both rate limits (theirs) and payload size (yours and theirs).

Rate limiting

Most LLM APIs (OpenAI, Anthropic, Google) have per-org or per-key rate limits in:

  • Requests per minute (RPM).
  • Tokens per minute (TPM) — usually the binding constraint.
  • Concurrent requests.

On hitting the limit, providers return 429 Too Many Requests with a Retry-After header.

Strategies

1. Server-side queue with priority.

User submits → request enters queue → worker pool sends to LLM, respecting RPM/TPM budgets. Interactive requests (user waiting) get higher priority than background (summarize stale chat).

ts
class LLMQueue {
  pending = [];
  inflight = 0;
  maxConcurrent = 5;
  async enqueue(req) {
    return new Promise((resolve, reject) => {
      this.pending.push({ req, resolve, reject });
      this.drain();
    });
  }
  async drain() {
    while (this.inflight < this.maxConcurrent && this.pending.length) {
      const { req, resolve, reject } = this.pending.shift();
      this.inflight++;
      try {
        resolve(await this.call(req));
      } catch (e) { reject(e); }
      finally { this.inflight--; this.drain(); }
    }
  }
}

2. Respect Retry-After + exponential backoff.

ts
async function callWithRetry(req, retries = 3) {
  for (let attempt = 0; attempt <= retries; attempt++) {
    const res = await fetch('https://api.openai.com/v1/chat/completions', req);
    if (res.status === 429 || res.status >= 500) {
      if (attempt === retries) throw new Error(`Failed after ${retries} retries`);
      const retryAfter = parseInt(res.headers.get('Retry-After') ?? '0', 10) * 1000;
      const backoff = Math.min(1000 * 2 ** attempt, 30000) + Math.random() * 500;
      await sleep(retryAfter || backoff);
      continue;
    }
    return res;
  }
}

Jitter prevents thundering herd when the limit resets.

3. Per-user quotas.

Track requests/tokens per user. Reject (with helpful UI) when over quota. Use Redis or a similar fast store.

4. Circuit breaker.

If the LLM provider has an outage, opening a circuit (stop trying for N seconds) prevents your service from amplifying the problem and queueing up requests that will time out.

5. Multi-provider fallback.

Have a second provider (Anthropic ↔ OpenAI) configured. On rate limit / outage, fall over.

Large payloads

Input

LLMs have context limits (e.g., 200k tokens for Claude 3.5, 1M+ for Gemini). And per-request cost scales with input tokens.

Truncate / summarize conversation history.

Sliding window: keep the system prompt + the last N turns. For long conversations, summarize older turns:

ts
const summary = await llm.summarize(messages.slice(0, -10));
const newMessages = [
  systemPrompt,
  { role: 'system', content: `Summary so far: ${summary}` },
  ...messages.slice(-10),
];

Chunk huge documents. For RAG over big corpora, split into chunks (500-2000 tokens), embed each, retrieve top-K relevant chunks, only send those.

Pre-filter with a cheaper model. Use a small/fast model to classify "is this relevant" → only escalate to the expensive model when needed.

Output

Stream the response. Use Server-Sent Events (SSE) to stream tokens as they're generated:

ts
// Server
const stream = await openai.chat.completions.create({ model, messages, stream: true });
for await (const chunk of stream) {
  res.write(`data: ${JSON.stringify(chunk.choices[0]?.delta?.content ?? '')}\n\n`);
}
tsx
// Client
const es = new EventSource('/api/llm');
es.onmessage = e => append(JSON.parse(e.data));

Streaming makes the LLM feel responsive — first token in <500ms instead of waiting 10s for the full reply.

Cap output tokens. Set max_tokens so a runaway generation doesn't cost $5.

Paginate / continue. For very long outputs, generate in chunks with a "continue" prompt.

Cache deterministic queries

If the same prompt + same parameters likely gets the same answer, cache:

ts
const cacheKey = sha256(JSON.stringify({ model, prompt, temperature }));
if (cache.has(cacheKey)) return cache.get(cacheKey);
const res = await llm.call(...);
cache.set(cacheKey, res, { ttl: 3600 });

Lower hit rate for high-temperature/creative requests; higher for fact lookups.

Batch where supported

OpenAI Batch API: submit many requests at once, get results in ~24h, at 50% cost. Great for non-realtime workloads.

Budget + monitoring

  • Per-request token cap.
  • Per-user daily budget.
  • Per-tenant monthly budget.
  • Cost dashboard: tokens × model × hour.
  • Alert on spend anomalies.

Client-server architecture

Always:

ts
Browser ──→ Your Server ──→ LLM Provider

Never:

ts
Browser ──→ LLM Provider     ← API key leak

The server enforces rate limits, redacts inputs, validates outputs, caches, logs, and bills.

Pitfalls

  • API key in browser.
  • No rate limit → user can spam, drain budget.
  • Synchronous mega-prompt (full conversation history) → context overflow, $$$.
  • Not streaming → 10s blank screen.
  • No retry on 429 → user sees error for normal load.
  • No jitter → retry storm when limit resets.
  • Caching at the wrong granularity — temperature in cache key matters.
  • Treating LLM responses as trusted → render-as-HTML → XSS.

Mental model

LLMs are a constrained, expensive, latency-sensitive resource. Treat the integration as you would any unreliable external service: queue, retry, fall back, cache, monitor cost. Stream to hide latency. Truncate/summarize to respect context. Budget per user to prevent runaway spend.

Follow-up questions

  • How does the OpenAI Batch API change cost economics?
  • How do you summarize conversation history without losing fidelity?
  • When does caching make sense for LLM calls?
  • What's the right backoff strategy for 429s?

Common mistakes

  • API key shipped to browser.
  • No retry / no backoff on 429 — user sees errors at moderate load.
  • No streaming — blank screen for 10s.
  • No per-user budget — billing surprises.
  • Sending full conversation history each call — context blows up.
  • Treating LLM JSON as valid without schema validation.

Performance considerations

  • Streaming is the single biggest UX improvement — perceived latency drops by 10x. Caching deterministic queries can cut cost significantly for fact-lookup patterns. Per-user rate limit + quota prevent abuse and keep monthly bill predictable.

Edge cases

  • Provider outages: have a fallback model/provider or graceful degradation.
  • Streaming connection drop mid-response: client should resume or display partial.
  • Tokenization differs per model — token counts aren't directly comparable.
  • Rate limits are per-key OR per-org — both matter.
  • Some endpoints have lower limits (e.g., embeddings vs chat).

Real-world examples

  • ChatGPT, Claude, Gemini web UIs all stream by default.
  • Perplexity / Phind use cheap pre-filter LLMs then expensive synthesizer.
  • Most production LLM apps cache embeddings; some cache full responses for FAQ-style queries.

Senior engineer discussion

Seniors design LLM integration with cost + latency + reliability + safety as first-class concerns. They proxy, queue, stream, cache, budget, retry, and monitor. They understand rate-limit math (TPM, RPM, concurrency) and design batching/queuing to maximize throughput within the limit.

Related questions