System Design

medium

mid

How would you build a scalable chat UI for an LLM powered product?

Scalable LLM chat UI requires: streaming via SSE or fetch streams, optimistic message rendering, virtualized long conversation lists, persistent thread storage (server + optimistic local), abort/regenerate semantics, markdown + code-block rendering with syntax highlighting, tool-call / structured-output UI, multi-modal attachments, and careful state ownership (server is source of truth for history; client buffers active stream). Performance hot spots: re-rendering during token stream, scrolling pinned to bottom, markdown parsing on every token.

10 min read·~30 min to think through

What 'scalable' means here

Three axes:

Per-thread: a chat with 10,000 messages must not slow.
Per-user: 1,000 threads must load instantly.
Per-request: streaming tokens at 50/sec must render at 60fps.

Core architecture

┌─────────────────┐    SSE / WebSocket    ┌────────────────┐
│  Client (React) │ ◀──────────────────── │  API Gateway   │
│  - thread list  │    JSON / fetch       │  - auth        │
│  - message list │ ────────────────────▶ │  - rate limit  │
│  - composer     │                       └────────┬───────┘
└─────────────────┘                                │
                                          ┌────────▼───────┐
                                          │  LLM service   │
                                          │  + tools       │
                                          └────────┬───────┘
                                                   │
                                          ┌────────▼───────┐
                                          │  Postgres      │
                                          │  threads/msgs  │
                                          └────────────────┘

Streaming the response

The single biggest UX lever: tokens appear as they're generated.

Server-Sent Events (SSE) is the right primary choice — text/event-stream, one direction, auto-reconnect, works through proxies. Use fetch with a ReadableStream for the same effect.

const res = await fetch('/api/chat', {
  method: 'POST',
  body: JSON.stringify({ threadId, message }),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  appendToken(decoder.decode(value, { stream: true }));
}

Token-render performance

Naive: re-render the entire markdown component on every token. At 50 tokens/sec, that's 50 markdown parses per second.

Mitigations:

Append-only buffer for the active message; only the trailing chunk re-renders.
Lazy markdown render: render plaintext during stream, swap to parsed markdown on complete. Or only parse complete paragraphs/blocks.
requestAnimationFrame batching: coalesce N tokens into one paint.
Memoize prior messages aggressively — they cannot change.

Scroll behavior

Hard problem. Rules users expect:

New token appended → scroll stays pinned to bottom IF user was at bottom.
User scrolls up to read → DO NOT auto-scroll on new tokens.
User scrolls back to bottom → re-engage auto-scroll.

const isAtBottom = scrollHeight - scrollTop - clientHeight < 50;
if (isAtBottom) scrollToBottom();

Virtualizing long conversations

10k messages × 50 DOM nodes each = unrenderable. Use react-virtual / react-window:

Variable-height windowing (messages vary).
Estimated heights with measured fallback.
Anchor at the bottom (chat-style — newest at the bottom of the window).
Stable keys (message id, not index).

State ownership

Server: source of truth for the message history.
Client: optimistic + buffer.
On send: append user message optimistically; if server fails, mark error.
On stream: append the streaming assistant message in client state, not the server.
On complete: server returns the final message id; reconcile.

Thread list

Sidebar with N threads. Patterns:

Lazy-load full history per thread; sidebar only carries title + lastMessageAt.
Infinite scroll / pagination on the thread list itself.
Background prefetch of the most-recent thread.

Composer features

Multi-line, autosize.
Submit on Enter, newline on Shift+Enter.
Paste image / file → upload + attach.
Stop generation button (AbortController) wired to the stream.
Regenerate (re-runs with same input).

Markdown + code

react-markdown or markdown-it with syntax highlighting (prism / shiki).
Memoize highlighter — slow on first call.
Copy-to-clipboard on code blocks.
Render diffs / tables / math (KaTeX) as needed.

Structured output / tool calls

LLMs increasingly return tool calls or JSON. UI patterns:

Tool call → collapsible card showing tool + arguments + result.
JSON / structured → render as form, table, or chart based on schema.
Citations → inline footnotes with hover preview.

Error and edge cases

Network drop mid-stream → reconnect & resume (or restart, depending on backend).
Token limit exceeded → render partial + error banner.
Rate-limited → backoff and surface.
Concurrent sends → queue or reject.
Server-side moderation block → surface gracefully.

Persistence layer

Server: append-only messages table, indexed by (thread_id, created_at).
Client: cache last N threads in IndexedDB for instant cold-load.
Sync strategy: server is source of truth; client treats local as cache.

Observability

Per-message: tokens/sec, time-to-first-token, total latency.
Per-thread: message count, length.
Errors broken out by phase (auth, model, stream, render).

Cost angle

Stream tokens server-side and bill on completion.
Truncate or summarize old context to keep prompt cost bounded.
Show user the running cost on heavy usage.

Recommended stack

React + a thin state layer (Zustand, Jotai).
TanStack Query for thread list (cache, refetch).
@tanstack/react-virtual for message list.
react-markdown + shiki for rendering.
fetch-based SSE (no eventsource lib needed in modern browsers).
Backend: Postgres + a streaming endpoint, queue for moderation / persistence.

Mental model

The chat UI is three problems wearing a trench coat: streaming (network + parse + render at 60fps), virtualization (don't render what isn't visible), and state ownership (server is truth; client buffers the live stream). Solve each separately; compose with care. Everything else (markdown, tools, attachments) layers on top.

Follow-up questions

•How do you handle reconnection mid-stream?
•How do you keep token rendering at 60fps?
•How do you persist threads — client cache vs server?
•How would you support multi-modal (image, file) inputs?

Common mistakes

•Re-parsing markdown on every token — janks the whole UI.
•Auto-scrolling even when the user has scrolled up.
•Not virtualizing — long conversations grind to a halt.
•Storing the full stream on the server before responding (kills TTFB).
•No abort wiring — user cannot stop a runaway generation.

Performance considerations

•Streaming + append-only render = TTFB ~200ms, perceived latency near zero. Virtualization keeps a 10k-message thread at 60fps. Token-rate render hot path: avoid full markdown parse per token; either render plaintext during stream or parse only complete blocks.

Edge cases

•Network drop mid-stream — resume or restart?
•User edits prompt while a response is streaming.
•Markdown is invalid mid-token (unclosed code fence).
•Tool call requires user confirmation mid-generation.
•Browser tab backgrounded — should streaming pause?

Real-world examples

•ChatGPT — SSE stream, virtualized list, optimistic rendering.
•Claude — SSE with abort, structured output rendering.
•Vercel AI SDK — useChat hook implements the streaming buffer pattern.
•Cursor — chat UI inside an editor with the same streaming model.

Senior engineer discussion

Seniors break the problem into streaming, virtualization, and state ownership. They prototype TTFB and tokens-per-second first because those drive every other decision. They design state with the server as source of truth and the client as a buffer, plan for partial failures (mid-stream disconnects), and instrument every phase so regressions are visible.