How would you build a scalable chat UI for an LLM powered product?
Scalable LLM chat UI requires: streaming via SSE or fetch streams, optimistic message rendering, virtualized long conversation lists, persistent thread storage (server + optimistic local), abort/regenerate semantics, markdown + code-block rendering with syntax highlighting, tool-call / structured-output UI, multi-modal attachments, and careful state ownership (server is source of truth for history; client buffers active stream). Performance hot spots: re-rendering during token stream, scrolling pinned to bottom, markdown parsing on every token.
What 'scalable' means here
Three axes:
- Per-thread: a chat with 10,000 messages must not slow.
- Per-user: 1,000 threads must load instantly.
- Per-request: streaming tokens at 50/sec must render at 60fps.
Core architecture
┌─────────────────┐ SSE / WebSocket ┌────────────────┐
│ Client (React) │ ◀──────────────────── │ API Gateway │
│ - thread list │ JSON / fetch │ - auth │
│ - message list │ ────────────────────▶ │ - rate limit │
│ - composer │ └────────┬───────┘
└─────────────────┘ │
┌────────▼───────┐
│ LLM service │
│ + tools │
└────────┬───────┘
│
┌────────▼───────┐
│ Postgres │
│ threads/msgs │
└────────────────┘Streaming the response
The single biggest UX lever: tokens appear as they're generated.
Server-Sent Events (SSE) is the right primary choice — text/event-stream, one direction, auto-reconnect, works through proxies. Use fetch with a ReadableStream for the same effect.
const res = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ threadId, message }),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
appendToken(decoder.decode(value, { stream: true }));
}Token-render performance
Naive: re-render the entire markdown component on every token. At 50 tokens/sec, that's 50 markdown parses per second.
Mitigations:
- Append-only buffer for the active message; only the trailing chunk re-renders.
- Lazy markdown render: render plaintext during stream, swap to parsed markdown on complete. Or only parse complete paragraphs/blocks.
- requestAnimationFrame batching: coalesce N tokens into one paint.
- Memoize prior messages aggressively — they cannot change.
Scroll behavior
Hard problem. Rules users expect:
- New token appended → scroll stays pinned to bottom IF user was at bottom.
- User scrolls up to read → DO NOT auto-scroll on new tokens.
- User scrolls back to bottom → re-engage auto-scroll.
const isAtBottom = scrollHeight - scrollTop - clientHeight < 50;
if (isAtBottom) scrollToBottom();Virtualizing long conversations
10k messages × 50 DOM nodes each = unrenderable. Use react-virtual / react-window:
- Variable-height windowing (messages vary).
- Estimated heights with measured fallback.
- Anchor at the bottom (chat-style — newest at the bottom of the window).
- Stable keys (message id, not index).
State ownership
- Server: source of truth for the message history.
- Client: optimistic + buffer.
- On send: append user message optimistically; if server fails, mark error.
- On stream: append the streaming assistant message in client state, not the server.
- On complete: server returns the final message id; reconcile.
Thread list
Sidebar with N threads. Patterns:
- Lazy-load full history per thread; sidebar only carries title + lastMessageAt.
- Infinite scroll / pagination on the thread list itself.
- Background prefetch of the most-recent thread.
Composer features
- Multi-line, autosize.
- Submit on Enter, newline on Shift+Enter.
- Paste image / file → upload + attach.
- Stop generation button (AbortController) wired to the stream.
- Regenerate (re-runs with same input).
Markdown + code
- react-markdown or markdown-it with syntax highlighting (prism / shiki).
- Memoize highlighter — slow on first call.
- Copy-to-clipboard on code blocks.
- Render diffs / tables / math (KaTeX) as needed.
Structured output / tool calls
LLMs increasingly return tool calls or JSON. UI patterns:
- Tool call → collapsible card showing tool + arguments + result.
- JSON / structured → render as form, table, or chart based on schema.
- Citations → inline footnotes with hover preview.
Error and edge cases
- Network drop mid-stream → reconnect & resume (or restart, depending on backend).
- Token limit exceeded → render partial + error banner.
- Rate-limited → backoff and surface.
- Concurrent sends → queue or reject.
- Server-side moderation block → surface gracefully.
Persistence layer
- Server: append-only messages table, indexed by (thread_id, created_at).
- Client: cache last N threads in IndexedDB for instant cold-load.
- Sync strategy: server is source of truth; client treats local as cache.
Observability
- Per-message: tokens/sec, time-to-first-token, total latency.
- Per-thread: message count, length.
- Errors broken out by phase (auth, model, stream, render).
Cost angle
- Stream tokens server-side and bill on completion.
- Truncate or summarize old context to keep prompt cost bounded.
- Show user the running cost on heavy usage.
Recommended stack
- React + a thin state layer (Zustand, Jotai).
- TanStack Query for thread list (cache, refetch).
- @tanstack/react-virtual for message list.
- react-markdown + shiki for rendering.
- fetch-based SSE (no eventsource lib needed in modern browsers).
- Backend: Postgres + a streaming endpoint, queue for moderation / persistence.
Mental model
The chat UI is three problems wearing a trench coat: streaming (network + parse + render at 60fps), virtualization (don't render what isn't visible), and state ownership (server is truth; client buffers the live stream). Solve each separately; compose with care. Everything else (markdown, tools, attachments) layers on top.
Follow-up questions
- •How do you handle reconnection mid-stream?
- •How do you keep token rendering at 60fps?
- •How do you persist threads — client cache vs server?
- •How would you support multi-modal (image, file) inputs?
Common mistakes
- •Re-parsing markdown on every token — janks the whole UI.
- •Auto-scrolling even when the user has scrolled up.
- •Not virtualizing — long conversations grind to a halt.
- •Storing the full stream on the server before responding (kills TTFB).
- •No abort wiring — user cannot stop a runaway generation.
Performance considerations
- •Streaming + append-only render = TTFB ~200ms, perceived latency near zero. Virtualization keeps a 10k-message thread at 60fps. Token-rate render hot path: avoid full markdown parse per token; either render plaintext during stream or parse only complete blocks.
Edge cases
- •Network drop mid-stream — resume or restart?
- •User edits prompt while a response is streaming.
- •Markdown is invalid mid-token (unclosed code fence).
- •Tool call requires user confirmation mid-generation.
- •Browser tab backgrounded — should streaming pause?
Real-world examples
- •ChatGPT — SSE stream, virtualized list, optimistic rendering.
- •Claude — SSE with abort, structured output rendering.
- •Vercel AI SDK — useChat hook implements the streaming buffer pattern.
- •Cursor — chat UI inside an editor with the same streaming model.