How would you design a payment gateway covering transaction flow, consistency, ledger systems, and fault tolerance?

Mostly a backend-leaning design: transaction flow (auth → capture), idempotency keys to prevent double-charges, a double-entry ledger as source of truth, state machines for transaction status, async webhooks for settlement, retries with exponential backoff, and reconciliation. Frontend role: never trust the client; confirm server-side.

6 min read·~15 min to think through

A payment gateway is a correctness-and-reliability system — the dominant concern is "never lose or duplicate money." It's backend-heavy, but a frontend engineer should understand the flow and the client's (limited, untrusted) role.

Transaction flow

A payment isn't one step:

Authorization — verify the card/account has funds and reserve them. Money isn't moved yet.
Capture — actually move the reserved funds (can be immediate or later, e.g. on shipment).
Settlement — funds clear between banks, asynchronously, over hours/days.
Refund / void / chargeback — reverse flows.

Each transaction moves through a state machine — created → authorized → captured → settled (or failed/voided/refunded) — with only valid transitions allowed.

Idempotency — the core safety mechanism

The network is unreliable; clients retry. Every payment request carries an idempotency key (a unique id per attempt). The gateway dedupes on it: a retried or duplicated request with the same key returns the original result instead of charging again. This is what prevents double charges — non-negotiable.

The ledger — source of truth

A double-entry ledger: every money movement is recorded as balanced debit/credit entries. The ledger is append-only and immutable — you never edit an entry, you add a reversing one. This makes the system auditable and the balance always derivable and verifiable. The ledger, not the UI or a cached balance, is the source of truth.

Reliability & fault tolerance

Async + webhooks — settlement is slow; the gateway notifies your system via signed webhooks. Webhook handlers must be idempotent (webhooks get redelivered).
Retries with exponential backoff for transient failures; a dead-letter queue for ones that keep failing.
Reconciliation — periodic jobs compare your ledger against the bank/processor records to catch discrepancies.
Consistency — money operations need strong consistency (often a DB transaction around the ledger writes); avoid eventual consistency where funds are concerned.
Timeouts & the "unknown" state — a request that times out might have succeeded; you must be able to query-and-resolve, never assume.

The frontend's role (small and untrusted)

Collect payment details (ideally in an iframe/SDK so card data never touches your servers — PCI scope).
Show transaction status — processing, success, failed — as UI only.
Never trust a client-side success signal. Real confirmation is server-to-server (webhook / verify call). The client can't be the source of truth for money.
Disable submit on click + idempotency key to prevent double-submits.

The framing

"It's a correctness and reliability system — the rule is never lose or duplicate money. The flow is multi-step: authorize, capture, settle, with each transaction in a strict state machine. The core safety mechanism is idempotency keys — every request carries one so retries can't double-charge. The source of truth is an append-only double-entry ledger — immutable, auditable, balance always derivable. Reliability comes from async signed webhooks with idempotent handlers, retries with backoff, dead-letter queues, and reconciliation jobs against the processor. The frontend's role is deliberately small and untrusted: collect details in an iframe SDK, show status as UI only, and rely on server-to-server confirmation — the client never decides money moved."

Follow-up questions

•Why is an idempotency key essential?
•Why use a double-entry, append-only ledger?
•What's the difference between authorization and capture?
•Why can't the frontend be trusted to confirm a payment?

Common mistakes

•No idempotency — retries cause double charges.
•Treating a client-side success signal as proof of payment.
•Mutable ledger entries instead of append-only reversals.
•Assuming a timed-out request failed (it might have succeeded).
•Non-idempotent webhook handlers (webhooks get redelivered).

Performance considerations

•Correctness outranks throughput here — money operations use strong consistency and DB transactions. Async settlement and webhooks keep the user-facing path fast while the slow parts happen in the background.

Edge cases

•A request that times out in an unknown state.
•Duplicate or out-of-order webhooks.
•Partial captures and partial refunds.
•Chargebacks reversing a settled transaction.
•Currency conversion and rounding.

Real-world examples

•Stripe/Razorpay: idempotency keys, signed webhooks, auth+capture, ledgers.
•Reconciliation jobs catching discrepancies between internal records and the processor.

Senior engineer discussion

Seniors center the answer on idempotency, the append-only double-entry ledger as source of truth, the auth/capture/settle state machine, and reliability patterns (webhooks, retries, reconciliation, the unknown state) — and correctly scope the frontend as small and untrusted.