Back to System Design
System Design
hard
mid

How would you design a payment gateway covering transaction flow, consistency, ledger systems, and fault tolerance?

Mostly a backend-leaning design: transaction flow (auth → capture), idempotency keys to prevent double-charges, a double-entry ledger as source of truth, state machines for transaction status, async webhooks for settlement, retries with exponential backoff, and reconciliation. Frontend role: never trust the client; confirm server-side.

6 min read·~15 min to think through

A payment gateway is a correctness-and-reliability system — the dominant concern is "never lose or duplicate money." It's backend-heavy, but a frontend engineer should understand the flow and the client's (limited, untrusted) role.

Transaction flow

A payment isn't one step:

  1. Authorization — verify the card/account has funds and reserve them. Money isn't moved yet.
  2. Capture — actually move the reserved funds (can be immediate or later, e.g. on shipment).
  3. Settlement — funds clear between banks, asynchronously, over hours/days.
  4. Refund / void / chargeback — reverse flows.

Each transaction moves through a state machinecreated → authorized → captured → settled (or failed/voided/refunded) — with only valid transitions allowed.

Idempotency — the core safety mechanism

The network is unreliable; clients retry. Every payment request carries an idempotency key (a unique id per attempt). The gateway dedupes on it: a retried or duplicated request with the same key returns the original result instead of charging again. This is what prevents double charges — non-negotiable.

The ledger — source of truth

A double-entry ledger: every money movement is recorded as balanced debit/credit entries. The ledger is append-only and immutable — you never edit an entry, you add a reversing one. This makes the system auditable and the balance always derivable and verifiable. The ledger, not the UI or a cached balance, is the source of truth.

Reliability & fault tolerance

  • Async + webhooks — settlement is slow; the gateway notifies your system via signed webhooks. Webhook handlers must be idempotent (webhooks get redelivered).
  • Retries with exponential backoff for transient failures; a dead-letter queue for ones that keep failing.
  • Reconciliation — periodic jobs compare your ledger against the bank/processor records to catch discrepancies.
  • Consistency — money operations need strong consistency (often a DB transaction around the ledger writes); avoid eventual consistency where funds are concerned.
  • Timeouts & the "unknown" state — a request that times out might have succeeded; you must be able to query-and-resolve, never assume.

The frontend's role (small and untrusted)

  • Collect payment details (ideally in an iframe/SDK so card data never touches your servers — PCI scope).
  • Show transaction status — processing, success, failed — as UI only.
  • Never trust a client-side success signal. Real confirmation is server-to-server (webhook / verify call). The client can't be the source of truth for money.
  • Disable submit on click + idempotency key to prevent double-submits.

The framing

"It's a correctness and reliability system — the rule is never lose or duplicate money. The flow is multi-step: authorize, capture, settle, with each transaction in a strict state machine. The core safety mechanism is idempotency keys — every request carries one so retries can't double-charge. The source of truth is an append-only double-entry ledger — immutable, auditable, balance always derivable. Reliability comes from async signed webhooks with idempotent handlers, retries with backoff, dead-letter queues, and reconciliation jobs against the processor. The frontend's role is deliberately small and untrusted: collect details in an iframe SDK, show status as UI only, and rely on server-to-server confirmation — the client never decides money moved."

Follow-up questions

  • Why is an idempotency key essential?
  • Why use a double-entry, append-only ledger?
  • What's the difference between authorization and capture?
  • Why can't the frontend be trusted to confirm a payment?

Common mistakes

  • No idempotency — retries cause double charges.
  • Treating a client-side success signal as proof of payment.
  • Mutable ledger entries instead of append-only reversals.
  • Assuming a timed-out request failed (it might have succeeded).
  • Non-idempotent webhook handlers (webhooks get redelivered).

Performance considerations

  • Correctness outranks throughput here — money operations use strong consistency and DB transactions. Async settlement and webhooks keep the user-facing path fast while the slow parts happen in the background.

Edge cases

  • A request that times out in an unknown state.
  • Duplicate or out-of-order webhooks.
  • Partial captures and partial refunds.
  • Chargebacks reversing a settled transaction.
  • Currency conversion and rounding.

Real-world examples

  • Stripe/Razorpay: idempotency keys, signed webhooks, auth+capture, ledgers.
  • Reconciliation jobs catching discrepancies between internal records and the processor.

Senior engineer discussion

Seniors center the answer on idempotency, the append-only double-entry ledger as source of truth, the auth/capture/settle state machine, and reliability patterns (webhooks, retries, reconciliation, the unknown state) — and correctly scope the frontend as small and untrusted.

Related questions