Back to System Design
System Design
medium
mid

How would you build a scalable chat UI for an LLM powered product?

Scalable LLM chat UI requires: streaming via SSE or fetch streams, optimistic message rendering, virtualized long conversation lists, persistent thread storage (server + optimistic local), abort/regenerate semantics, markdown + code-block rendering with syntax highlighting, tool-call / structured-output UI, multi-modal attachments, and careful state ownership (server is source of truth for history; client buffers active stream). Performance hot spots: re-rendering during token stream, scrolling pinned to bottom, markdown parsing on every token.

10 min read·~30 min to think through

What 'scalable' means here

Three axes:

  1. Per-thread: a chat with 10,000 messages must not slow.
  2. Per-user: 1,000 threads must load instantly.
  3. Per-request: streaming tokens at 50/sec must render at 60fps.

Core architecture

ts
┌─────────────────┐    SSE / WebSocket    ┌────────────────┐
Client (React) │ ◀──────────────────── │  API Gateway   │
- thread list  │    JSON / fetch       │  - auth        │
- message list │ ────────────────────▶ │  - rate limit  │
- composer     │                       └────────┬───────┘
└─────────────────┘                                │
                                          ┌────────▼───────┐
LLM service   │
+ tools       │
                                          └────────┬───────┘

                                          ┌────────▼───────┐
                                          │  Postgres      │
                                          │  threads/msgs  │
                                          └────────────────┘

Streaming the response

The single biggest UX lever: tokens appear as they're generated.

Server-Sent Events (SSE) is the right primary choice — text/event-stream, one direction, auto-reconnect, works through proxies. Use fetch with a ReadableStream for the same effect.

js
const res = await fetch('/api/chat', {
  method: 'POST',
  body: JSON.stringify({ threadId, message }),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  appendToken(decoder.decode(value, { stream: true }));
}

Token-render performance

Naive: re-render the entire markdown component on every token. At 50 tokens/sec, that's 50 markdown parses per second.

Mitigations:

  • Append-only buffer for the active message; only the trailing chunk re-renders.
  • Lazy markdown render: render plaintext during stream, swap to parsed markdown on complete. Or only parse complete paragraphs/blocks.
  • requestAnimationFrame batching: coalesce N tokens into one paint.
  • Memoize prior messages aggressively — they cannot change.

Scroll behavior

Hard problem. Rules users expect:

  • New token appended → scroll stays pinned to bottom IF user was at bottom.
  • User scrolls up to read → DO NOT auto-scroll on new tokens.
  • User scrolls back to bottom → re-engage auto-scroll.
js
const isAtBottom = scrollHeight - scrollTop - clientHeight < 50;
if (isAtBottom) scrollToBottom();

Virtualizing long conversations

10k messages × 50 DOM nodes each = unrenderable. Use react-virtual / react-window:

  • Variable-height windowing (messages vary).
  • Estimated heights with measured fallback.
  • Anchor at the bottom (chat-style — newest at the bottom of the window).
  • Stable keys (message id, not index).

State ownership

  • Server: source of truth for the message history.
  • Client: optimistic + buffer.
  • On send: append user message optimistically; if server fails, mark error.
  • On stream: append the streaming assistant message in client state, not the server.
  • On complete: server returns the final message id; reconcile.

Thread list

Sidebar with N threads. Patterns:

  • Lazy-load full history per thread; sidebar only carries title + lastMessageAt.
  • Infinite scroll / pagination on the thread list itself.
  • Background prefetch of the most-recent thread.

Composer features

  • Multi-line, autosize.
  • Submit on Enter, newline on Shift+Enter.
  • Paste image / file → upload + attach.
  • Stop generation button (AbortController) wired to the stream.
  • Regenerate (re-runs with same input).

Markdown + code

  • react-markdown or markdown-it with syntax highlighting (prism / shiki).
  • Memoize highlighter — slow on first call.
  • Copy-to-clipboard on code blocks.
  • Render diffs / tables / math (KaTeX) as needed.

Structured output / tool calls

LLMs increasingly return tool calls or JSON. UI patterns:

  • Tool call → collapsible card showing tool + arguments + result.
  • JSON / structured → render as form, table, or chart based on schema.
  • Citations → inline footnotes with hover preview.

Error and edge cases

  • Network drop mid-stream → reconnect & resume (or restart, depending on backend).
  • Token limit exceeded → render partial + error banner.
  • Rate-limited → backoff and surface.
  • Concurrent sends → queue or reject.
  • Server-side moderation block → surface gracefully.

Persistence layer

  • Server: append-only messages table, indexed by (thread_id, created_at).
  • Client: cache last N threads in IndexedDB for instant cold-load.
  • Sync strategy: server is source of truth; client treats local as cache.

Observability

  • Per-message: tokens/sec, time-to-first-token, total latency.
  • Per-thread: message count, length.
  • Errors broken out by phase (auth, model, stream, render).

Cost angle

  • Stream tokens server-side and bill on completion.
  • Truncate or summarize old context to keep prompt cost bounded.
  • Show user the running cost on heavy usage.

Recommended stack

  • React + a thin state layer (Zustand, Jotai).
  • TanStack Query for thread list (cache, refetch).
  • @tanstack/react-virtual for message list.
  • react-markdown + shiki for rendering.
  • fetch-based SSE (no eventsource lib needed in modern browsers).
  • Backend: Postgres + a streaming endpoint, queue for moderation / persistence.

Mental model

The chat UI is three problems wearing a trench coat: streaming (network + parse + render at 60fps), virtualization (don't render what isn't visible), and state ownership (server is truth; client buffers the live stream). Solve each separately; compose with care. Everything else (markdown, tools, attachments) layers on top.

Follow-up questions

  • How do you handle reconnection mid-stream?
  • How do you keep token rendering at 60fps?
  • How do you persist threads — client cache vs server?
  • How would you support multi-modal (image, file) inputs?

Common mistakes

  • Re-parsing markdown on every token — janks the whole UI.
  • Auto-scrolling even when the user has scrolled up.
  • Not virtualizing — long conversations grind to a halt.
  • Storing the full stream on the server before responding (kills TTFB).
  • No abort wiring — user cannot stop a runaway generation.

Performance considerations

  • Streaming + append-only render = TTFB ~200ms, perceived latency near zero. Virtualization keeps a 10k-message thread at 60fps. Token-rate render hot path: avoid full markdown parse per token; either render plaintext during stream or parse only complete blocks.

Edge cases

  • Network drop mid-stream — resume or restart?
  • User edits prompt while a response is streaming.
  • Markdown is invalid mid-token (unclosed code fence).
  • Tool call requires user confirmation mid-generation.
  • Browser tab backgrounded — should streaming pause?

Real-world examples

  • ChatGPT — SSE stream, virtualized list, optimistic rendering.
  • Claude — SSE with abort, structured output rendering.
  • Vercel AI SDK — useChat hook implements the streaming buffer pattern.
  • Cursor — chat UI inside an editor with the same streaming model.

Senior engineer discussion

Seniors break the problem into streaming, virtualization, and state ownership. They prototype TTFB and tokens-per-second first because those drive every other decision. They design state with the server as source of truth and the client as a buffer, plan for partial failures (mid-stream disconnects), and instrument every phase so regressions are visible.

Related questions