Back to System Design
System Design
hard
senior

How would you build a real time order tracker using WebSockets?

WebSocket connection authenticated at handshake, subscribed to a per-order channel. Server pushes status events; client merges into local state (TanStack Query cache or a simple reducer). Handle reconnection with backoff + resubscription, request a snapshot on connect to fill missed events, dedupe by event id, and fall back to polling on persistent failure. Consider SSE for one-way streams — simpler infrastructure, automatic reconnect.

10 min read·~25 min to think through

Real-time tracker has three system properties:

  1. Push — server drives updates (not polling).
  2. Per-order subscription — only the orders the user cares about.
  3. Resumable — reconnect + catch-up on missed events.

Transport choice

OptionProsCons
WebSocketBidirectional, low overhead per message, matureNeed sticky sessions / pub-sub backend; reconnect logic by hand
SSE (Server-Sent Events)One-way, automatic reconnect, plain HTTP, easy CDNOne-way only; some proxies buffer
Long pollingWorks everywhereInefficient; latency higher
HTTP/3 + WebTransportModern, multiplexed, datagrams + streamsBrowser support patchy in 2026

For an order tracker (server pushes, client mostly listens), SSE is the right default. If you also need client → server messages (live chat with the driver, "delivered" confirmation from rider), WebSocket.

Architecture

ts
Client ── WS ──► Edge gateway ── pub/sub ──► Order service
                  (sticky to a            (publishes events
                   pod via cookie)         on order changes)

                  Redis Pub/Sub, NATS, Kafka, or
                  managed service (Pusher, Ably, Liveblocks)

The gateway holds the long-lived connection. The order service is stateless and publishes events to a pub/sub layer. The gateway subscribes to the topic for each connected user, forwards events.

Authentication

WebSocket handshake is an HTTP upgrade — pass auth via:

  • Cookie (HttpOnly auth cookie) — automatic, but watch CSRF on the upgrade endpoint.
  • Token in Sec-WebSocket-Protocol subprotocol header — works, slightly weird.
  • Query paramavoid, logged everywhere.
  • First-message auth — connect, send { type: "auth", token }, server rejects if invalid.

Re-validate auth on token rotation; don't keep an indefinitely-old session alive.

Subscription model

ts
client: { type: "subscribe", channel: "order:abc123" }
server: { type: "ok", channel: "order:abc123" }
server: { type: "event", channel: "order:abc123", seq: 42, payload: {...} }

Per-channel sequence numbers (seq) are the single most important piece. They let you detect missed events on reconnect.

Reconnection

ts
class RealtimeClient {
  ws: WebSocket | null = null;
  backoff = 1000;
  lastSeq: Map<string, number> = new Map();
  subscriptions = new Set<string>();

  connect() {
    this.ws = new WebSocket(URL);
    this.ws.onopen = () => {
      this.backoff = 1000;
      for (const ch of this.subscriptions) {
        const since = this.lastSeq.get(ch);
        this.send({ type: "subscribe", channel: ch, since });
      }
    };
    this.ws.onmessage = (e) => this.handle(JSON.parse(e.data));
    this.ws.onclose = () => {
      this.ws = null;
      setTimeout(() => this.connect(), this.backoff);
      this.backoff = Math.min(this.backoff * 2, 30_000);
    };
  }

  handle(msg: any) {
    if (msg.type === "event") {
      const last = this.lastSeq.get(msg.channel) ?? 0;
      if (msg.seq <= last) return; // dedupe
      this.lastSeq.set(msg.channel, msg.seq);
      this.emit(msg.channel, msg.payload);
    }
  }
}
  • Exponential backoff with jitter — don't thunder-herd reconnect after a server outage.
  • since on resubscribe — server replays events newer than since from its buffer (Redis Streams, Kafka offset, in-memory ring buffer).
  • Heartbeatping/pong every 30s. Detect dead connections faster than TCP keepalive (which can be minutes).
  • Online/offline eventswindow.addEventListener("online", reconnectImmediately).

Catch-up: snapshot + delta

When reconnecting OR loading a new order page, do snapshot + subscribe in one round-trip:

ts
client: GET /orders/abc123      → returns full state at seq=42
client: WS subscribe since=42   → server pushes events seq>42

Otherwise, events arriving while the snapshot was in flight may be lost or duplicated.

State on the client

Two viable patterns:

1. TanStack Query as the cache.

tsx
const { data } = useQuery({ queryKey: ["order", id], queryFn: fetchOrder });
useRealtime(`order:${id}`, (event) => {
  queryClient.setQueryData(["order", id], applyEvent);
});

Cache holds the canonical state; subscriptions mutate it; UI re-renders.

2. Custom reducer.

tsx
const [state, dispatch] = useReducer(orderReducer, initial);
useEffect(() => realtime.subscribe(`order:${id}`, dispatch), [id]);

Fits well when state is complex (multi-step status with derived UI).

Failure modes

  • Server hard kill / deploy — clients reconnect to a new pod; pub/sub durability (Kafka, Redis Streams) ensures no events lost.
  • Network blip — backoff handles it; since resubscribe catches up.
  • Stale tab — heartbeat detects; reconnect refreshes data.
  • Stale token — server closes the connection on token rotation; client refreshes auth, reconnects.
  • CDN / corporate proxy strips WebSockets — fallback to long-polling or SSE.

Scale at the gateway

  • One long-lived TCP connection per user, often per tab. 100k concurrent users = 100k sockets. A modern Node/Go gateway can do 100k+ per box.
  • Sticky sessions — once a user connects to pod X, subsequent reconnects should hit pod X to use the same buffer. Load balancer sticky cookies; or, decouple — gateways are stateless, pub/sub holds the events.
  • Fan-out — order updates may fan out to thousands of subscribers (a popular live event). Use a hierarchical pub/sub; don't direct-publish from the order service to gateways.
  • Backpressure — if a client is slow, the queue on the gateway grows. Drop / coalesce updates rather than OOMing.

UI

  • Connection state indicator — connected, reconnecting, offline.
  • Optimistic updates for outgoing actions (e.g., "cancel order") with rollback on server reject.
  • Smooth animations between status changes — status moves are visually meaningful.
  • Polling fallback — if WS fails 3× consecutively, fall back to GET /orders/:id every 30s.

Build vs buy

NeedBuildBuy
Internal tool, 1k concurrent usersBuild (ws + Redis pub/sub)overkill
Customer-facing, 100k+ concurrentHard but possiblePusher, Ably, Liveblocks, AWS IoT, Convex
Collaborative editingBuild is very hardYjs + Hocuspocus / Liveblocks

Buy when realtime isn't your differentiation. Build when ops are part of your competence.

Senior framing. The interviewer wants: (1) transport choice with reason, (2) seq + snapshot for resumability, (3) reconnect with backoff + heartbeat, (4) pub/sub for fan-out, (5) graceful degradation. The "we use WebSockets" answer is shallow; the architecture above is senior.

Follow-up questions

  • Why is sequence number on the wire the most important detail?
  • When would you pick SSE over WebSockets?
  • How do you avoid losing events between snapshot and subscription?
  • What's the scaling bottleneck — connections, fan-out, or pub/sub?

Common mistakes

  • Not deduping events on reconnect → double-applied state changes.
  • Reconnecting without `since` → missed events.
  • Auth via query param → token leaks into logs.
  • No heartbeat → silent dead connections.

Performance considerations

  • Coalesce updates server-side — don't send a tick per pixel of progress.
  • Use binary protocols (msgpack, protobuf) for high-rate channels.
  • Per-tab connection multiplied by users can blow out gateway file-descriptor limits.

Edge cases

  • Corporate proxies that buffer SSE — disable buffering with `X-Accel-Buffering: no`.
  • Mobile Safari closes WS in background — reconnect on visibility change.
  • Browser tab throttled in background — events queue, deliver on focus.

Real-world examples

  • Uber, DoorDash live tracking. Stripe Connect updates. Linear's live sync. Slack message streams.

Related questions