Networking

medium

mid

How do you handle rate limiting and large payloads when consuming AI services?

Rate limits: queue with priority, respect Retry-After headers, exponential backoff + jitter, per-user quotas, circuit breaker. Large payloads: stream responses (SSE/HTTP chunked), enforce input token budgets, chunk huge inputs, use cheaper models for prefilter then expensive for refine, summarize/compress conversation history, return partial results, page through outputs. Always proxy via your server (never API keys client-side), cache deterministic queries, batch where API supports it.

10 min read·~5 min to think through

LLM APIs are expensive, slow, and rate-limited. Production usage needs explicit handling for both rate limits (theirs) and payload size (yours and theirs).

Rate limiting

Most LLM APIs (OpenAI, Anthropic, Google) have per-org or per-key rate limits in:

Requests per minute (RPM).
Tokens per minute (TPM) — usually the binding constraint.
Concurrent requests.

On hitting the limit, providers return 429 Too Many Requests with a Retry-After header.

Strategies

1. Server-side queue with priority.

User submits → request enters queue → worker pool sends to LLM, respecting RPM/TPM budgets. Interactive requests (user waiting) get higher priority than background (summarize stale chat).

class LLMQueue {
  pending = [];
  inflight = 0;
  maxConcurrent = 5;
  async enqueue(req) {
    return new Promise((resolve, reject) => {
      this.pending.push({ req, resolve, reject });
      this.drain();
    });
  }
  async drain() {
    while (this.inflight < this.maxConcurrent && this.pending.length) {
      const { req, resolve, reject } = this.pending.shift();
      this.inflight++;
      try {
        resolve(await this.call(req));
      } catch (e) { reject(e); }
      finally { this.inflight--; this.drain(); }
    }
  }
}

2. Respect Retry-After + exponential backoff.

async function callWithRetry(req, retries = 3) {
  for (let attempt = 0; attempt <= retries; attempt++) {
    const res = await fetch('https://api.openai.com/v1/chat/completions', req);
    if (res.status === 429 || res.status >= 500) {
      if (attempt === retries) throw new Error(`Failed after ${retries} retries`);
      const retryAfter = parseInt(res.headers.get('Retry-After') ?? '0', 10) * 1000;
      const backoff = Math.min(1000 * 2 ** attempt, 30000) + Math.random() * 500;
      await sleep(retryAfter || backoff);
      continue;
    }
    return res;
  }
}

Jitter prevents thundering herd when the limit resets.

3. Per-user quotas.

Track requests/tokens per user. Reject (with helpful UI) when over quota. Use Redis or a similar fast store.

4. Circuit breaker.

If the LLM provider has an outage, opening a circuit (stop trying for N seconds) prevents your service from amplifying the problem and queueing up requests that will time out.

5. Multi-provider fallback.

Have a second provider (Anthropic ↔ OpenAI) configured. On rate limit / outage, fall over.

Large payloads

Input

LLMs have context limits (e.g., 200k tokens for Claude 3.5, 1M+ for Gemini). And per-request cost scales with input tokens.

Truncate / summarize conversation history.

Sliding window: keep the system prompt + the last N turns. For long conversations, summarize older turns:

const summary = await llm.summarize(messages.slice(0, -10));
const newMessages = [
  systemPrompt,
  { role: 'system', content: `Summary so far: ${summary}` },
  ...messages.slice(-10),
];

Chunk huge documents. For RAG over big corpora, split into chunks (500-2000 tokens), embed each, retrieve top-K relevant chunks, only send those.

Pre-filter with a cheaper model. Use a small/fast model to classify "is this relevant" → only escalate to the expensive model when needed.

Output

Stream the response. Use Server-Sent Events (SSE) to stream tokens as they're generated:

// Server
const stream = await openai.chat.completions.create({ model, messages, stream: true });
for await (const chunk of stream) {
  res.write(`data: ${JSON.stringify(chunk.choices[0]?.delta?.content ?? '')}\n\n`);
}

tsx

// Client
const es = new EventSource('/api/llm');
es.onmessage = e => append(JSON.parse(e.data));

Streaming makes the LLM feel responsive — first token in <500ms instead of waiting 10s for the full reply.

Cap output tokens. Set max_tokens so a runaway generation doesn't cost $5.

Paginate / continue. For very long outputs, generate in chunks with a "continue" prompt.

Cache deterministic queries

If the same prompt + same parameters likely gets the same answer, cache:

const cacheKey = sha256(JSON.stringify({ model, prompt, temperature }));
if (cache.has(cacheKey)) return cache.get(cacheKey);
const res = await llm.call(...);
cache.set(cacheKey, res, { ttl: 3600 });

Lower hit rate for high-temperature/creative requests; higher for fact lookups.

Batch where supported

OpenAI Batch API: submit many requests at once, get results in ~24h, at 50% cost. Great for non-realtime workloads.

Budget + monitoring

Per-request token cap.
Per-user daily budget.
Per-tenant monthly budget.
Cost dashboard: tokens × model × hour.
Alert on spend anomalies.

Client-server architecture

Always:

Browser ──→ Your Server ──→ LLM Provider

Never:

Browser ──→ LLM Provider     ← API key leak

The server enforces rate limits, redacts inputs, validates outputs, caches, logs, and bills.

Pitfalls

API key in browser.
No rate limit → user can spam, drain budget.
Synchronous mega-prompt (full conversation history) → context overflow, $$$.
Not streaming → 10s blank screen.
No retry on 429 → user sees error for normal load.
No jitter → retry storm when limit resets.
Caching at the wrong granularity — temperature in cache key matters.
Treating LLM responses as trusted → render-as-HTML → XSS.

Mental model

LLMs are a constrained, expensive, latency-sensitive resource. Treat the integration as you would any unreliable external service: queue, retry, fall back, cache, monitor cost. Stream to hide latency. Truncate/summarize to respect context. Budget per user to prevent runaway spend.

Follow-up questions

•How does the OpenAI Batch API change cost economics?
•How do you summarize conversation history without losing fidelity?
•When does caching make sense for LLM calls?
•What's the right backoff strategy for 429s?

Common mistakes

•API key shipped to browser.
•No retry / no backoff on 429 — user sees errors at moderate load.
•No streaming — blank screen for 10s.
•No per-user budget — billing surprises.
•Sending full conversation history each call — context blows up.
•Treating LLM JSON as valid without schema validation.

Performance considerations

•Streaming is the single biggest UX improvement — perceived latency drops by 10x. Caching deterministic queries can cut cost significantly for fact-lookup patterns. Per-user rate limit + quota prevent abuse and keep monthly bill predictable.

Edge cases

•Provider outages: have a fallback model/provider or graceful degradation.
•Streaming connection drop mid-response: client should resume or display partial.
•Tokenization differs per model — token counts aren't directly comparable.
•Rate limits are per-key OR per-org — both matter.
•Some endpoints have lower limits (e.g., embeddings vs chat).

Real-world examples

•ChatGPT, Claude, Gemini web UIs all stream by default.
•Perplexity / Phind use cheap pre-filter LLMs then expensive synthesizer.
•Most production LLM apps cache embeddings; some cache full responses for FAQ-style queries.

Senior engineer discussion

Seniors design LLM integration with cost + latency + reliability + safety as first-class concerns. They proxy, queue, stream, cache, budget, retry, and monitor. They understand rate-limit math (TPM, RPM, concurrency) and design batching/queuing to maximize throughput within the limit.