Back to System Design
System Design
medium
mid

How do you integrate an AI API such as OpenAI or Claude into a frontend application?

Never call the AI API directly from the browser — proxy through your own backend so the API key stays secret. The backend handles auth, rate limiting, prompt construction, and streaming; the frontend streams the response, renders incrementally, and handles loading/errors/cancellation.

7 min read·~15 min to think through

The single most important rule: never call the AI provider directly from the browser. Everything else follows from that.

1. Architecture — proxy through your backend

ts
Browser → your backend (BFF) → OpenAI/Claude API
  • The API key must stay server-side. Putting it in frontend code or env-exposed-to-client = anyone can read it from the bundle/network and run up your bill. This is non-negotiable.
  • Your backend endpoint also lets you: authenticate your users, enforce your rate limits and quotas, construct/validate prompts, sanitize inputs, log/monitor, cache, and swap providers without a frontend change.

2. Streaming — essential for UX

AI responses are slow (seconds) and generated token-by-token. Don't make the user stare at a spinner.

  • The provider supports streaming (SSE / chunked responses). Your backend proxies the stream through to the client.
  • Frontend consumes it — ReadableStream from fetch, an EventSource, or the provider's SDK streaming helpers — and renders tokens incrementally as they arrive.
  • Show a typing/cursor indicator; let the UI update progressively.

3. Frontend state & UX

  • States: idle, loading/streaming, success, error — and partial (mid-stream) content.
  • Cancellation — an AbortController so the user can stop a long generation (and you stop paying for it).
  • Optimistic display of the user's message; append the assistant's streamed response.
  • Markdown rendering — model output is usually markdown; render it safely (sanitize — model output is untrusted; treat it like user content for XSS).
  • Conversation state — message history managed client-side and sent with each request (or referenced by a server-side thread id).

4. Errors, limits, cost

  • Handle rate limits (429) and provider errors gracefully — backoff, retry where safe, clear messaging. (See AI-specific rate-limit handling.)
  • Timeouts for stuck generations.
  • Cost control — token limits, max output length, debounce, prevent spam; enforce per-user quotas on your backend.
  • Latency — set expectations in the UI; streaming makes it feel faster.

5. AI-specific concerns

  • Hallucinations — don't present output as authoritative; cite sources where possible; let users verify/edit.
  • Safety — moderate inputs/outputs if user-facing.
  • Prompt injection — treat user input as untrusted in prompt construction; don't let it override system instructions.
  • Nondeterminism — same input, different output; design UI and tests around that.

How to answer

"The key decision: proxy through my own backend, never call the provider from the browser — the API key stays secret, and the backend owns auth, rate limiting, prompt construction, and provider abstraction. I'd stream the response (SSE/chunked) and render tokens incrementally since AI latency is high, with cancellation via AbortController, loading/error/partial states, and safe markdown rendering of the (untrusted) model output. Plus AI-specific handling: 429s, cost/token limits, prompt-injection safety, and not presenting hallucination-prone output as authoritative."

Follow-up questions

  • Why can't you call the AI API directly from the browser?
  • How do you implement streaming responses end to end?
  • Why must you sanitize AI-generated markdown output?
  • How do you control cost when integrating an AI API?

Common mistakes

  • Calling the provider directly from the frontend, exposing the API key.
  • Not streaming — making users wait on a spinner for seconds.
  • Rendering model output as raw HTML without sanitization.
  • No cancellation, so users can't stop (and stop paying for) a generation.
  • Ignoring rate limits, cost controls, and prompt-injection risk.

Performance considerations

  • Streaming dramatically improves perceived performance — first token in ~1s vs whole response in ~10s. The backend proxy adds a hop but enables caching and abstraction. Token/length limits control both latency and cost.

Edge cases

  • Stream interrupted mid-response (partial content + retry).
  • Provider rate limit or outage.
  • Very long generations needing timeouts/cancellation.
  • Prompt injection via user input.
  • Nondeterministic output complicating testing.

Real-world examples

  • A chat feature: backend BFF proxying Claude/OpenAI with streaming, frontend rendering tokens into sanitized markdown with a stop button.

Senior engineer discussion

Seniors lead with the security boundary — proxy through a backend, key stays server-side — and treat that backend as the place for auth, rate limiting, prompt construction, and provider abstraction. They make streaming + incremental rendering + cancellation core, sanitize untrusted model output, and raise AI-specific concerns: cost, 429s, prompt injection, hallucinations, nondeterminism.

Related questions