Skip to main content

Command Palette

Search for a command to run...

Distributed Transactions: How to Actually Answer This in a System Design Interview

Updated
8 min readView as Markdown
A
Senior Backend Engineer with 11+ years building distributed systems in fintech and payments — across PayPal, Goldman Sachs, Amazon, and early-stage startups. My work has spanned real-time payment infrastructure (PIX Brazil), crypto wallet platforms, large-scale monitoring automation that cut MTTR by 78%, and event-driven analytics pipelines. I gravitate toward problems where correctness and scale are both non-negotiable — the kind of problems that break naive designs quickly. Outside of engineering, I mentor backend engineers on system design, interview strategy, and career navigation. I run a mentoring practice on Topmate (Top 1%, People's Choice) and believe the best investment you can make early in your career is learning to communicate your thinking as clearly as you execute it. I write about distributed systems, fintech engineering patterns, and what senior-level interviews actually test — drawing from real systems I've built and real mistakes I've made.

Most candidates freeze when distributed transactions come up. They either recite "use 2PC" and stop — or they over-engineer into a saga rabbit hole before the interviewer has even asked a follow-up.

After 11+ years building payment systems across fintech and e-commerce, this is the topic I've seen trip up the most senior engineers in interviews. Here's how to approach it the right way.


Why This Topic Comes Up

Distributed transactions appear in system design interviews whenever the problem involves multiple services that need to agree on an outcome. Classic scenarios:

  • A payment service debits a wallet and an order service marks an order as paid — both must succeed or both must roll back.

  • A booking system reserves a seat and charges a card simultaneously.

  • A logistics platform updates inventory and dispatches a delivery agent in the same flow.

The interviewer isn't testing whether you know the definition of atomicity. They're testing whether you understand why the solutions that work in a monolith break in distributed systems — and what you'd reach for instead.


Start Here: Why Monolith Solutions Don't Transfer

In a monolithic application, a single @Transactional annotation on a method buys you ACID guarantees across multiple DB operations. The transaction manager and the database are co-located. Rollback is cheap. Failure modes are simple.

The moment you split that into two services with two separate databases, that safety net disappears entirely.

You now have three problems:

  1. Atomicity across networks — a network call can fail after the remote side has already committed.

  2. No shared transaction manager — there's no single coordinator that both services trust.

  3. Partial failure is the default — in distributed systems, the question isn't if something will fail, it's when and which part.

State this clearly at the start of your answer. It signals that you understand the root cause, not just the solutions.


The Solutions — and When to Use Each

Option 1: Two-Phase Commit (2PC)

The classic academic answer. Worth explaining correctly.

How it works:

A coordinator (your service, or a dedicated transaction manager) runs two phases:

  • Phase 1 — Prepare: The coordinator asks all participants "can you commit?" Each participant locks the necessary resources, writes to a prepare log, and votes Yes or No.

  • Phase 2 — Commit/Abort: If all voted Yes, the coordinator sends Commit to all. If any voted No, it sends Abort.

Where it works:

  • Same-datacenter services where network latency is low and predictable.

  • Systems where strong consistency is non-negotiable (financial ledgers, regulated records).

  • Small participant counts (2–3 services). The more participants, the worse the failure surface becomes.

The failure modes interviewers will push on:

What happens if the coordinator crashes after Phase 1 but before Phase 2?

All participants are locked and waiting. They can't commit on their own — they don't know if the other participants voted Yes. They can't abort either, because maybe the coordinator sent Commit to some of them before crashing. This is the blocking problem with 2PC. Recovery requires the coordinator to come back up and replay its log, or a timeout-based abort with compensating transactions.

What happens if a participant crashes after voting Yes but before receiving the Commit?

On restart it checks its prepare log, contacts the coordinator, and replays the decision. This is recoverable — but it requires durable prepare logs and a coordinator that's still reachable.

The honest verdict: 2PC trades availability for consistency. Under network partition or coordinator failure, the system blocks. In a payment context at scale, that blocking behaviour is often unacceptable.


Option 2: Saga Pattern

The production answer for most distributed systems at scale. This is what gets used in real fintech systems.

Core idea: Break the distributed transaction into a sequence of local transactions. Each step publishes an event or message when it completes. If any step fails, the saga executes compensating transactions to undo the work already done.

There are two flavours:

Choreography-based Saga

Each service listens for events and reacts autonomously. No central coordinator.

OrderService → publishes OrderPlaced
  → PaymentService listens, charges card → publishes PaymentProcessed
    → InventoryService listens, reserves stock → publishes StockReserved
      → FulfillmentService listens, dispatches order

If PaymentService fails:

PaymentService → publishes PaymentFailed
  → OrderService listens, cancels order → publishes OrderCancelled

Pros: Loose coupling. No single point of failure. Each service is independently deployable.

Cons: Hard to trace the overall transaction state. Cyclic dependencies between services can creep in. Debugging a failed saga requires correlating events across multiple service logs.

Orchestration-based Saga

A central orchestrator (a dedicated saga service or a workflow engine) drives the sequence explicitly.

SagaOrchestrator:
  1. Call OrderService.createOrder()       → success
  2. Call PaymentService.charge()          → success
  3. Call InventoryService.reserve()       → FAILURE
  4. Call PaymentService.refund()          → compensate step 2
  5. Call OrderService.cancelOrder()       → compensate step 1

Pros: The full transaction lifecycle is visible in one place. Easier to monitor, debug, and reason about. Simpler to add retry logic.

Cons: The orchestrator becomes a central dependency. More initial complexity to build.

When to choose which:

  • Choreography works well for simple, linear flows with few participants.

  • Orchestration is the right call when the flow has branches, retries, timeouts, or more than 3–4 participants. In payment systems, orchestration almost always wins.


Option 3: Outbox Pattern (the glue that makes Sagas reliable)

This one often gets missed but interviewers who know distributed systems will ask about it.

The problem: even with a Saga, there's a gap between writing to your local DB and publishing an event to Kafka/RabbitMQ. If the service crashes in that gap, the event is lost and the saga stalls silently.

The fix: Write the event to an outbox table in the same local DB transaction as your business operation. A separate process reads the outbox and publishes to the message broker. At-least-once delivery is guaranteed; idempotency on the consumer side handles deduplication.

BEGIN TRANSACTION
  INSERT INTO orders (id, status) VALUES (?, 'PENDING')
  INSERT INTO outbox (event_type, payload) VALUES ('OrderPlaced', ?)
COMMIT

A relay process (or CDC via Debezium) reads the outbox and publishes to Kafka. If it crashes and retries, consumers handle duplicate events via idempotency keys.

Bring this up unprompted. It shows you've thought about failure at the infrastructure level, not just the design level.


The Trade-off Table (Draw This)

2PC Saga (Choreography) Saga (Orchestration)
Consistency Strong (ACID) Eventual Eventual
Availability Blocks on failure High High
Complexity Moderate Low–Medium Medium–High
Observability Good Hard Good
Failure recovery Coordinator-dependent Event-driven compensation Orchestrator-driven compensation
Best for Small, co-located services Simple linear flows Complex, branching flows

What Interviewers Actually Want to Hear

When the question comes up, structure your answer like this:

1. Acknowledge the root cause — in a distributed system, you can't have a shared transaction manager, so ACID across services is not possible by default.

2. Present 2PC honestly — explain how it works, then explain the blocking failure mode. This shows you understand it deeply enough to know when not to use it.

3. Land on Saga — explain both choreography and orchestration, then make a call based on the problem. Don't leave it as "it depends" — commit to a recommendation.

4. Mention the Outbox Pattern — this is the detail that separates a theoretical answer from a production-grade one.

5. Handle the follow-up on idempotency — the interviewer will likely ask how you prevent duplicate processing. Your answer: idempotency keys on every operation, exactly-once semantics or at-least-once with deduplication at the consumer.


One Thing Most Candidates Get Wrong

They treat compensating transactions as equivalent to rollbacks. They're not.

A database rollback undoes a change as if it never happened — the row is never visible to anyone else, the side effects never occur.

A compensating transaction corrects a change that already happened and may have already been observed. If you charged a card and need to undo it, you issue a refund — you can't un-charge. If you sent a confirmation email, you can't un-send it. Compensating transactions are business operations, not technical reversals. The distinction matters when the interviewer asks how you handle partial failures in a Saga — your compensation logic needs to be designed with this in mind.


TL;DR for Interview Day

  • Default answer: Saga with orchestration + Outbox Pattern.

  • Use 2PC only when: co-located services, small participant count, consistency > availability.

  • Saga ≠ rollback: compensation is a business operation, not a technical undo.

  • Always mention idempotency: every step in a Saga must be safe to retry.

  • Draw the trade-off table — it shows structured thinking, not just memorised terms.


I mentor senior engineers on system design and backend architecture. If you're preparing for a Staff or Principal Engineer loop, feel free to connect.

10 views

From Production to Whiteboard

Part 1 of 2

Most system design content teaches you what to draw on the whiteboard. This series teaches you why — from someone who's actually built these systems in production. Each article takes a real-world design problem, walks through the requirements, the trade-offs, and the decisions that matter at scale — then frames it as how you'd approach it in a senior-level system design interview. No hand-waving. No "it depends" without a reason. Just the thinking that separates a good answer from a great one.

Up next

How to Design an Event-Driven Analytics Pipeline — A System Design Deep Dive

Interview prep series: real problems, real trade-offs, no hand-waving. The Problem Worth Solving Here's a scenario straight from a fintech startup I worked at. We had field agents responsible for onb