Back to System Design
System Design
medium
mid

How do you stream AI responses to the UI in real time?

The API returns a streaming response (SSE or chunked fetch). Read the ReadableStream from response.body, decode chunks, parse the token deltas, and append to state as they arrive. Handle partial chunks, abort/cancel, errors mid-stream, auto-scroll, and a 'stop generating' control.

5 min read·~15 min to think through

Streaming AI output is about consuming a response that arrives incrementally and rendering it as it comes — instead of waiting for the whole thing.

The transport

LLM APIs stream via Server-Sent Events (SSE) or chunked HTTP — the response body is a stream of small chunks, each carrying a token delta. On the client you read it with the fetch + ReadableStream API:

js
const res = await fetch("/api/chat", {
  method: "POST",
  body: JSON.stringify({ messages }),
  signal: controller.signal,           // for cancellation
});

const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });

  // SSE: events are separated by \n\n; lines start with "data: "
  const lines = buffer.split("\n");
  buffer = lines.pop();                // keep the last partial line
  for (const line of lines) {
    if (!line.startsWith("data: ")) continue;
    const data = line.slice(6);
    if (data === "[DONE]") return;
    const delta = JSON.parse(data).choices?.[0]?.delta?.content ?? "";
    setMessage((m) => m + delta);       // append the token to state
  }
}

The details interviewers grade

  • Partial chunks — a network chunk does not align to event/token boundaries. You must buffer and only parse complete lines/events, carrying the leftover partial into the next read. Forgetting this corrupts the output.
  • TextDecoder({ stream: true }) — so multi-byte UTF-8 characters split across chunks decode correctly.
  • Cancellation — wire an AbortController to a "Stop generating" button; abort the fetch to halt the stream.
  • Errors mid-stream — the stream can fail after it started. Catch, show what arrived plus an error, allow retry.
  • Rendering performance — appending on every token causes a render per token. For fast streams, batch updates (e.g. requestAnimationFrame or a small buffer) so you don't thrash. Memoize already-rendered markdown.
  • UX — auto-scroll to follow the output (but stop if the user scrolls up), a typing cursor, disable input while streaming.
  • Markdown — render incrementally; handle incomplete markdown/code fences gracefully.

Why stream at all

Time-to-first-token is far shorter than time-to-full-response — streaming makes the app feel responsive and lets users read as it generates and stop early. It's a perceived-performance win.

The framing

"The API sends a streaming response — SSE or chunked — so on the client I read response.body as a ReadableStream, decode each chunk with a streaming TextDecoder, and append token deltas to state as they arrive. The non-obvious parts: network chunks don't align to token boundaries, so I buffer and only parse complete events; I wire an AbortController to a stop button; I handle mid-stream errors; and I batch renders so a fast token stream doesn't cause a render per token. Plus UX — auto-scroll, typing cursor, disabled input while generating."

Follow-up questions

  • Why do you need to buffer partial chunks?
  • How do you implement a 'stop generating' button?
  • How do you avoid a re-render on every single token?
  • What happens if the stream errors halfway through?

Common mistakes

  • Assuming each network chunk is a complete token or event.
  • Not using TextDecoder's stream option — breaking multi-byte characters.
  • Re-rendering on every token with no batching — UI jank.
  • No cancellation, so users can't stop a long generation.
  • Not handling errors that occur after the stream started.

Performance considerations

  • Appending state per token can cause hundreds of renders — batch with rAF or a buffer. Memoize already-rendered markdown so only the streaming tail re-parses. Auto-scroll work should be throttled.

Edge cases

  • A token/event split across two network chunks.
  • Multi-byte UTF-8 character split across chunks.
  • Stream errors or connection drops mid-response.
  • User navigates away while streaming — must abort.
  • Incomplete markdown/code fence at the current cursor.

Real-world examples

  • ChatGPT/Claude-style chat UIs rendering tokens as they generate.
  • AI code assistants streaming completions into an editor.

Senior engineer discussion

Seniors describe the ReadableStream consumption precisely, call out chunk-boundary buffering and streaming decode, wire cancellation, handle mid-stream errors, and address render batching plus auto-scroll/typing-cursor UX.

Related questions