Design a Payment Gateway with discussion around transaction flow, consistency and ledger systems, reliability and fault tolerance.
Mostly a backend-leaning design: transaction flow (auth → capture), idempotency keys to prevent double-charges, a double-entry ledger as source of truth, state machines for transaction status, async webhooks for settlement, retries with exponential backoff, and reconciliation. Frontend role: never trust the client; confirm server-side.
A payment gateway is a correctness-and-reliability system — the dominant concern is "never lose or duplicate money." It's backend-heavy, but a frontend engineer should understand the flow and the client's (limited, untrusted) role.
Transaction flow
A payment isn't one step:
- Authorization — verify the card/account has funds and reserve them. Money isn't moved yet.
- Capture — actually move the reserved funds (can be immediate or later, e.g. on shipment).
- Settlement — funds clear between banks, asynchronously, over hours/days.
- Refund / void / chargeback — reverse flows.
Each transaction moves through a state machine — created → authorized → captured → settled (or failed/voided/refunded) — with only valid transitions allowed.
Idempotency — the core safety mechanism
The network is unreliable; clients retry. Every payment request carries an idempotency key (a unique id per attempt). The gateway dedupes on it: a retried or duplicated request with the same key returns the original result instead of charging again. This is what prevents double charges — non-negotiable.
The ledger — source of truth
A double-entry ledger: every money movement is recorded as balanced debit/credit entries. The ledger is append-only and immutable — you never edit an entry, you add a reversing one. This makes the system auditable and the balance always derivable and verifiable. The ledger, not the UI or a cached balance, is the source of truth.
Reliability & fault tolerance
- Async + webhooks — settlement is slow; the gateway notifies your system via signed webhooks. Webhook handlers must be idempotent (webhooks get redelivered).
- Retries with exponential backoff for transient failures; a dead-letter queue for ones that keep failing.
- Reconciliation — periodic jobs compare your ledger against the bank/processor records to catch discrepancies.
- Consistency — money operations need strong consistency (often a DB transaction around the ledger writes); avoid eventual consistency where funds are concerned.
- Timeouts & the "unknown" state — a request that times out might have succeeded; you must be able to query-and-resolve, never assume.
The frontend's role (small and untrusted)
- Collect payment details (ideally in an iframe/SDK so card data never touches your servers — PCI scope).
- Show transaction status —
processing,success,failed— as UI only. - Never trust a client-side success signal. Real confirmation is server-to-server (webhook / verify call). The client can't be the source of truth for money.
- Disable submit on click + idempotency key to prevent double-submits.
The framing
"It's a correctness and reliability system — the rule is never lose or duplicate money. The flow is multi-step: authorize, capture, settle, with each transaction in a strict state machine. The core safety mechanism is idempotency keys — every request carries one so retries can't double-charge. The source of truth is an append-only double-entry ledger — immutable, auditable, balance always derivable. Reliability comes from async signed webhooks with idempotent handlers, retries with backoff, dead-letter queues, and reconciliation jobs against the processor. The frontend's role is deliberately small and untrusted: collect details in an iframe SDK, show status as UI only, and rely on server-to-server confirmation — the client never decides money moved."
Follow-up questions
- •Why is an idempotency key essential?
- •Why use a double-entry, append-only ledger?
- •What's the difference between authorization and capture?
- •Why can't the frontend be trusted to confirm a payment?
Common mistakes
- •No idempotency — retries cause double charges.
- •Treating a client-side success signal as proof of payment.
- •Mutable ledger entries instead of append-only reversals.
- •Assuming a timed-out request failed (it might have succeeded).
- •Non-idempotent webhook handlers (webhooks get redelivered).
Performance considerations
- •Correctness outranks throughput here — money operations use strong consistency and DB transactions. Async settlement and webhooks keep the user-facing path fast while the slow parts happen in the background.
Edge cases
- •A request that times out in an unknown state.
- •Duplicate or out-of-order webhooks.
- •Partial captures and partial refunds.
- •Chargebacks reversing a settled transaction.
- •Currency conversion and rounding.
Real-world examples
- •Stripe/Razorpay: idempotency keys, signed webhooks, auth+capture, ledgers.
- •Reconciliation jobs catching discrepancies between internal records and the processor.