Distributed Transactions — From Absolute Zero to Mastery

Chapter 0: The Money Problem

You're building a payment system. Alice wants to send $100 to Bob. The operation requires two steps: debit Alice's account by $100 and credit Bob's account by $100. On a single database, this is straightforward — wrap both operations in a transaction, and the database guarantees they either both happen or neither does.

Now scale up. Alice's account lives in a database in US-East. Bob's account lives in a database in EU-West. There is no single database that owns both accounts. You need both databases to agree on the outcome of this transfer: either both commit (Alice debited AND Bob credited) or both abort (nothing changes). If Alice is debited but Bob isn't credited, $100 vanishes from the economy. If Bob is credited but Alice isn't debited, $100 appears from nowhere.

This is the atomic commit problem: getting multiple independent systems to agree on a single yes/no decision. It sounds simple until you add the reality of distributed systems: networks drop messages, servers crash mid-operation, and processes stall unpredictably.

The fundamental impossibility. You cannot have atomic commit that is both always available and always consistent across network partitions. This is a direct consequence of the FLP impossibility result: no deterministic protocol can guarantee consensus in an asynchronous system where even one process can fail. Every distributed transaction protocol is a compromise — it sacrifices something (availability, performance, or simplicity) to achieve atomicity.

What Goes Wrong Without Transactions

Failure mode	What happens	Business impact
Partial write	Alice debited, Bob not credited (crash between ops)	Money disappears — customer files dispute
Dirty read	Another transaction reads Alice's debited balance before transfer completes, then transfer aborts	Decisions based on phantom data
Double spend	Two concurrent transfers both check Alice's balance ($200), both debit $150	Account goes to -$100, bank loses money
Phantom	A report sums all accounts while a transfer is in progress — total is wrong	Audit failure, regulatory violation

The Money Transfer Problem

Watch a money transfer between two databases. Without a transaction, a crash mid-operation causes money to vanish. With a transaction, the crash triggers a rollback.

Alice: $500, Bob: $500. Click a transfer mode, then "Crash" to see the difference.

Quick check: A money transfer debits Account A, then credits Account B. The system crashes after the debit but before the credit. Without a transaction, what is the total money in the system?

Unchanged — the debit is automatically rolled back $100 less than before — the debit persisted but the credit didn't, so money vanished $100 more than before

Chapter 1: ACID Breakdown

The word "transaction" gets thrown around loosely. Let's be precise. A database transaction provides four guarantees, collectively known as ACID. But these guarantees are not equally useful, and not all databases implement them equally.

A — Atomicity

Atomicity means "all or nothing." Either every operation in the transaction completes, or none of them do. If the system crashes midway through, it rolls back any partial changes. Think of it as an undo button: the transaction is the unit of "undo." You can't undo half a transaction.

Important clarification: atomicity is NOT about concurrency (that's Isolation). Atomicity is about fault tolerance — what happens when things crash partway through. The name is misleading; "abortability" would be more accurate.

C — Consistency

Consistency means the transaction preserves application invariants. If your invariant is "account balances never go negative," then a transaction that would cause a negative balance must be aborted. But here's the critical insight: the database can't know your application's invariants. YOU are responsible for writing transactions that maintain consistency. The database just provides the tools (constraints, triggers, atomicity). In practice, "C" in ACID is the application's responsibility, not the database's.

I — Isolation

Isolation means concurrent transactions don't interfere with each other. Each transaction behaves as if it were the only transaction running. The strongest form is serializability: the result of running transactions concurrently is the same as if they ran one at a time in some serial order.

Isolation is the expensive one. True serializability requires coordination (locks, or timestamp ordering, or serial execution). This is why databases offer weaker isolation levels — they trade correctness for performance. Understanding these levels is critical for system design interviews.

D — Durability

Durability means once a transaction commits, the data survives crashes. Typically implemented via write-ahead logging (WAL): changes are written to a durable log before the commit acknowledgment. Even if the server crashes, it replays the log on recovery. In distributed systems, durability often means "replicated to N nodes."

ACID Properties Visualized

Each quadrant shows one ACID property. Click to simulate a violation and see what breaks.

Click a button to see what happens when each ACID property is violated.

Check: Which ACID property is the application's responsibility, not the database's?

Atomicity Consistency — the database provides tools (constraints), but the application must define and enforce its own invariants Isolation Durability

Chapter 2: Isolation Levels

True serializable isolation is expensive — it often requires locking or aborting transactions that conflict. In practice, databases offer a menu of isolation levels, each trading correctness for performance. Understanding these is essential: most production databases default to a weaker level, and bugs from insufficient isolation are among the hardest to diagnose.

The Hierarchy

Level	Dirty Reads	Non-repeatable Reads	Phantoms	Write Skew	Performance
Read Uncommitted	Possible	Possible	Possible	Possible	Fastest
Read Committed	Prevented	Possible	Possible	Possible	Fast
Repeatable Read / Snapshot	Prevented	Prevented	Possible	Possible	Medium
Serializable	Prevented	Prevented	Prevented	Prevented	Slowest

Let's demystify each level with concrete examples.

Read Committed

The most common default (PostgreSQL, Oracle, SQL Server). Two guarantees: (1) you only read data that has been committed (no dirty reads), (2) you only overwrite data that has been committed (no dirty writes). Implemented with row-level locks for writes and snapshot reads for reads.

// Read Committed prevents dirty reads but allows non-repeatable reads:

T1: BEGIN
T1: SELECT balance FROM accounts WHERE id=1 → $500
// T2 commits: UPDATE accounts SET balance=400 WHERE id=1
T1: SELECT balance FROM accounts WHERE id=1 → $400 (!)
// Same query, different result within the same transaction!
T1: COMMIT

Snapshot Isolation (MVCC)

Snapshot Isolation gives each transaction a consistent snapshot of the database at the moment it started. All reads within the transaction see that snapshot, even if other transactions commit changes in the meantime. This is how PostgreSQL's "REPEATABLE READ" and MySQL's "REPEATABLE READ" work (via Multi-Version Concurrency Control — MVCC).

// MVCC: each row has a creation version and deletion version

Row {id:1, balance:500, created_by:txn100, deleted_by:null}

// T1 (txn_id=200) starts → snapshot = version 200
// T2 (txn_id=201) updates: balance=400
Row {id:1, balance:500, created_by:txn100, deleted_by:txn201} // old version
Row {id:1, balance:400, created_by:txn201, deleted_by:null} // new version

// T1 reads: sees only rows where created_by < 200 AND deleted_by > 200 (or null)
// → T1 still sees balance=$500 (consistent with its snapshot)

Snapshot isolation is NOT serializable. It prevents dirty reads, non-repeatable reads, and phantoms from reads. But it allows write skew: two transactions each read the same data, make a decision based on it, and write different rows. Neither sees the other's write, and both commit — but the combined result violates an invariant.

Isolation Level Comparison

Two concurrent transactions access the same account. Select an isolation level to see which anomalies can occur.

Select an isolation level to see how concurrent transactions interact.

Check: PostgreSQL defaults to "Read Committed." Your application has a function that reads a user's balance, checks if it's ≥ $100, and if so, debits $100. Two concurrent calls arrive for the same user (balance $150). Can both succeed, overdrawing the account to -$50?

Yes — both read $150 (committed), both pass the check, both debit $100. Read Committed doesn't prevent this. No — the database locks the row when the first transaction reads it No — PostgreSQL detects the conflict and aborts one

Chapter 3: Read Phenomena

To understand isolation levels, you need to know the specific anomalies they prevent. Each "read phenomenon" is a concrete scenario where concurrent transactions produce results that couldn't happen in any serial execution.

Dirty Read

A dirty read happens when transaction T1 reads data written by transaction T2 that hasn't committed yet. If T2 later aborts, T1 has read data that never officially existed.

// Dirty Read scenario:
T1: UPDATE accounts SET balance = balance - 100 WHERE id = 1 (not yet committed)
T2: SELECT balance FROM accounts WHERE id = 1 → sees uncommitted value!
T1: ROLLBACK (the debit never happened)
// T2 made a decision based on phantom data

Non-Repeatable Read (Read Skew)

A non-repeatable read happens when T1 reads the same row twice and gets different values because T2 committed a change between the two reads.

Phantom Read

A phantom is like a non-repeatable read, but for sets of rows. T1 runs a query that returns a set of rows, T2 inserts or deletes a row that matches the query, and T1 re-runs the query — the set has changed.

Write Skew

Write skew is the subtlest and most dangerous anomaly. Two transactions each read the same data, make a decision, and write to different rows. Neither violates a constraint individually, but the combined result does.

// Write Skew: on-call doctor scheduling
// Invariant: at least 1 doctor must be on-call
// Currently on-call: Dr. A and Dr. B

T1: SELECT COUNT(*) FROM doctors WHERE on_call = true → 2
T2: SELECT COUNT(*) FROM doctors WHERE on_call = true → 2
T1: UPDATE doctors SET on_call = false WHERE id = 'A' (2-1=1, still ≥1, ok)
T2: UPDATE doctors SET on_call = false WHERE id = 'B' (2-1=1, still ≥1, ok)
T1: COMMIT
T2: COMMIT
// Result: 0 doctors on-call! Invariant violated!
// Each transaction was correct individually, but combined they broke the rule.

Write skew is everywhere. Double-booking meeting rooms, over-selling concert tickets, concurrent username registration, two-account balance transfers — all are write skew. Snapshot isolation doesn't prevent it. You need either serializable isolation or explicit application-level locking (SELECT ... FOR UPDATE).

Read Phenomena Visualizer

Watch each read phenomenon play out step by step. Two transactions interleave their operations on a shared database.

Select a phenomenon, then click Step to advance through the scenario.

Check: You're using Snapshot Isolation (MVCC). Two transactions concurrently check if a username "alice" is available, find it is, and both insert it. What happens?

Both succeed — duplicate username (write skew) One succeeds, one gets a unique constraint violation (if a UNIQUE constraint exists) It depends — if there's a UNIQUE index, the database catches it; without one, both succeed and you have duplicates

Chapter 4: Two-Phase Commit

How do you get two independent databases to agree on committing a transaction? You can't just send "COMMIT" to both — what if one succeeds and the other crashes before receiving the message? You need a protocol that ensures either ALL participants commit or ALL abort, even in the face of failures.

Two-Phase Commit (2PC) is the classic answer. It introduces a coordinator (also called the transaction manager) that orchestrates the commit across all participants (the databases).

The Protocol

Phase 1: Prepare

Coordinator sends PREPARE to all participants. Each participant writes all transaction data to a durable log, then votes YES (I can commit) or NO (I must abort). The vote is a PROMISE: a YES vote means the participant guarantees it can commit, even if it crashes and restarts.

↓

Coordinator Decision

If ALL participants voted YES → decision is COMMIT. If ANY participant voted NO → decision is ABORT. The coordinator writes this decision to its own durable log BEFORE sending it.

↓

Phase 2: Commit/Abort

Coordinator sends the decision (COMMIT or ABORT) to all participants. Each participant applies the decision and acknowledges. If a participant crashes, it reads its log on recovery, finds the PREPARE record, and asks the coordinator for the decision.

The Data Flow

// Transfer $100: Alice (DB-A) → Bob (DB-B)

Coordinator → DB-A: PREPARE (debit Alice $100)
Coordinator → DB-B: PREPARE (credit Bob $100)

DB-A: writes WAL, locks row, returns VOTE=YES
DB-B: writes WAL, locks row, returns VOTE=YES

Coordinator: all YES → writes COMMIT to its log

Coordinator → DB-A: COMMIT
Coordinator → DB-B: COMMIT

DB-A: applies debit, releases lock, returns ACK
DB-B: applies credit, releases lock, returns ACK

// Total time: 2 round-trips (prepare + commit)
// Messages: 4 (2 prepare + 2 commit) + 4 responses = 8 total

The point of no return. Once a participant votes YES, it has made a binding promise. It CANNOT abort unilaterally — it must wait for the coordinator's decision, even if it means waiting forever. This is the key property that makes 2PC work: after the prepare phase, the only entity that can decide the outcome is the coordinator.

Two-Phase Commit Protocol

Watch the 2PC protocol execute step by step. The coordinator orchestrates a commit across two database participants.

Click Step to advance through the Two-Phase Commit protocol.

Check: In 2PC, participant DB-A has voted YES (prepare phase complete). DB-A then detects that it's running out of disk space and wants to abort. Can it?

Yes — any participant can abort at any time No — once it voted YES, it must wait for the coordinator's decision. It has made a binding promise. It depends on whether the coordinator has decided yet

Chapter 5: Coordinator Crash

Two-Phase Commit has a fatal weakness: the coordinator is a single point of failure. If the coordinator crashes at exactly the wrong moment, participants are stuck in a state called in-doubt (or uncertain), holding locks indefinitely.

The Worst-Case Scenario

// Timeline of doom:

t=0: Coordinator sends PREPARE to DB-A and DB-B
t=1: DB-A votes YES (now holding locks, promised to commit if asked)
t=2: DB-B votes YES (now holding locks, promised to commit if asked)
t=3: Coordinator receives both YES votes
t=4: Coordinator writes COMMIT to its log
t=5: Coordinator CRASHES before sending COMMIT to anyone

// Now:
// DB-A is in-doubt. It voted YES. It's holding locks. It can't commit
// (hasn't received COMMIT) and can't abort (it promised!). It MUST wait.
// DB-B: same situation.
// The locks are held until the coordinator recovers and resends its decision.
// If the coordinator's disk is destroyed, those locks are held FOREVER
// (until a human intervenes).

This is a blocking protocol. The in-doubt state is the fundamental problem with 2PC. While locks are held, no other transaction can read or write those rows. A coordinator outage can stall an entire database. This is why some systems avoid 2PC entirely in favor of sagas (which we'll cover next) — even though sagas provide weaker guarantees.

Mitigations (But No Silver Bullet)

Mitigation	How it helps	Limitation
Coordinator HA	Run coordinator on a replicated cluster (Raft/Paxos)	Still has failover delay; doesn't eliminate the problem
Timeout + presumed abort	If coordinator doesn't respond in T seconds, participants abort	Can cause inconsistency if coordinator actually committed
Three-Phase Commit (3PC)	Adds a "pre-commit" phase to reduce blocking window	Doesn't work under network partitions; rarely used in practice
Cooperative termination	In-doubt participants ask each other for the coordinator's decision	If ALL participants are in-doubt, nobody knows the answer

Coordinator Crash Simulator

Watch what happens when the coordinator crashes at different points in the 2PC protocol. The "danger zone" is after participants vote YES but before they receive the decision.

Choose when the coordinator crashes to see the impact on participants.

Check: The coordinator crashes after writing COMMIT to its log but before sending COMMIT to any participant. When the coordinator recovers, what does it do?

Reads its log, finds COMMIT decision, and resends COMMIT to all participants Starts a new 2PC round from scratch Aborts the transaction since it never completed

Chapter 6: The Saga Pattern

2PC is correct but blocking. What if you need distributed atomicity but can't afford to hold locks across services? The Saga pattern provides an alternative: instead of one distributed transaction, execute a sequence of local transactions, each in its own service. If any step fails, execute compensating transactions to undo the previous steps.

Think of it like booking a trip. You book a flight, then a hotel, then a rental car. If the car rental fails, you cancel the hotel, then cancel the flight. Each cancellation is a separate operation that undoes the corresponding booking.

Saga vs. 2PC

Property	2PC	Saga
Atomicity	Full (all-or-nothing)	Eventual (via compensation)
Isolation	Yes (locks held during protocol)	No (intermediate states are visible)
Lock duration	Until commit/abort (could be seconds to hours)	Per-step only (milliseconds)
Blocking	Yes (coordinator crash blocks all)	No (each step is independent)
Complexity	Protocol is simple; failure handling is hard	Protocol is complex; each step needs a compensating action

Two Flavors: Choreography vs. Orchestration

Choreography: Each service listens for events and decides its own next step. No central coordinator. Services communicate through events (e.g., Kafka topics).

Order Service

Creates order, publishes "OrderCreated" event

↓ event

Payment Service

Hears "OrderCreated", charges card, publishes "PaymentCompleted"

↓ event

Inventory Service

Hears "PaymentCompleted", reserves stock, publishes "StockReserved"

↓ event

Shipping Service

Hears "StockReserved", schedules delivery, publishes "OrderFulfilled"

Orchestration: A central saga orchestrator tells each service what to do and tracks the overall state. The orchestrator is like a state machine.

Saga Orchestrator

Sends "charge $50" to Payment Service

↓ command

Payment Service

Charges $50, responds "success"

↓ reply

Saga Orchestrator

Sends "reserve 2 items" to Inventory Service

↓ command

Inventory Service

Reserves items, responds "success"

Choreography vs. Orchestration trade-off. Choreography is decentralized and loosely coupled but hard to debug (the saga logic is spread across all services). Orchestration centralizes the flow in one place but introduces a potential bottleneck and single point of responsibility. Most production systems start with orchestration for critical flows (order processing, payments) and choreography for less critical flows (notifications, analytics).

Compensating Transactions

The hard part of sagas is writing correct compensating transactions. A compensation must undo the effect of a step, but it can't simply "rollback" — the original transaction already committed. Compensation is a new, forward-moving operation.

// Step: Charge customer $50
// Compensation: Refund customer $50
// NOT the same as rollback! The charge shows on the statement, then the refund does.

// Step: Reserve 2 items from inventory
// Compensation: Release 2 items back to inventory

// Step: Send confirmation email
// Compensation: Send "sorry, your order was cancelled" email
// Some steps CANNOT be perfectly compensated — you can't un-send an email

Saga Choreography vs. Orchestration

Compare the two saga styles. In choreography, services communicate through events. In orchestration, a central coordinator sends commands.

Run a saga in choreography or orchestration mode. Click "Fail Step 3" to trigger compensation.

Check: A saga has 5 steps. Step 4 fails. How many compensating transactions must run?

4 — compensate all 4 previous steps 3 — compensate steps 3, 2, and 1 (the successfully completed steps, in reverse order). Step 4 failed so its own compensation isn't needed. 5 — compensate everything including the failed step

Chapter 7: The Outbox Pattern

Sagas rely on services publishing events. But there's a subtle reliability problem: how do you atomically update the database AND publish an event? If you update the database first and then crash before publishing, the event is lost. If you publish first and then crash before updating, the event describes something that didn't happen.

This is the dual write problem: you need to write to two systems (database + message broker) atomically, but they don't share a transaction.

// The dual write problem:

// Option 1: DB first, then publish
database.update(order) // ✓ succeeds
kafka.publish(event) // ✗ CRASH — event lost! Other services never learn about the update.

// Option 2: Publish first, then DB
kafka.publish(event) // ✓ succeeds
database.update(order) // ✗ CRASH — event was published for a non-existent update!

The Outbox Solution

Instead of writing to both systems, write ONLY to the database — but include the event in an outbox table within the same database transaction. A separate process (the "relay" or "publisher") reads the outbox table and publishes events to the message broker.

1. Atomic DB Write

In a single transaction: UPDATE order, INSERT INTO outbox (event_data). Both writes succeed or both fail.

↓

2. Relay Reads Outbox

A background process polls the outbox table (or uses CDC — Change Data Capture) to find new events.

↓

3. Publish to Broker

Relay publishes each event to Kafka/RabbitMQ. After successful publish, marks the outbox row as published.

At-least-once delivery. The relay might crash after publishing but before marking the row as published. On restart, it will re-publish the same event. This means consumers must be idempotent — processing the same event twice must produce the same result. Idempotency keys (unique IDs per event) are the standard solution: consumers check "have I processed event X before?" and skip duplicates.

CDC: The Modern Outbox

Change Data Capture (CDC) eliminates polling by tailing the database's write-ahead log (WAL). Tools like Debezium, Maxwell, or native PostgreSQL logical replication capture every committed change and stream it to Kafka automatically. The outbox table still exists (for structure), but the relay is replaced by the CDC connector.

python
# Outbox pattern: atomic write + deferred publish
def create_order(order):
    with db.transaction():
        db.execute("INSERT INTO orders ...", order)
        db.execute("INSERT INTO outbox (event_type, payload, published) VALUES (%s, %s, false)",
                   'OrderCreated', json.dumps(order))
    # No Kafka publish here! The relay handles it.

# Relay process (runs continuously)
def relay():
    while True:
        events = db.execute("SELECT * FROM outbox WHERE published = false ORDER BY id LIMIT 100")
        for e in events:
            kafka.publish(topic=e.event_type, value=e.payload, key=e.id)
            db.execute("UPDATE outbox SET published = true WHERE id = %s", e.id)

Outbox Pattern Visualizer

Watch the outbox pattern in action. A service writes to its database + outbox atomically. The relay then publishes events to Kafka. Crash it to see at-least-once delivery.

Write an order to see it appear in the outbox. Relay publishes it to Kafka.

Check: The outbox relay publishes an event to Kafka, then crashes before marking the outbox row as "published." When the relay restarts, what happens?

The event is published again (at-least-once delivery). Consumers must be idempotent. The event is lost — Kafka already has it but the relay doesn't know The relay detects the duplicate and skips it

Chapter 8: Saga Orchestrator

This is the showcase simulation. You are the saga orchestrator for an e-commerce order. The saga has four steps: create order, charge payment, reserve inventory, and schedule shipping. Each step can succeed or fail (you control which ones fail). When a step fails, the orchestrator runs compensating transactions for all previously completed steps in reverse order.

The goal: understand the orchestrator state machine. Try different failure scenarios. Notice how intermediate states are visible (no isolation), how compensation runs in reverse, and how the outbox ensures no events are lost.

Saga Orchestrator Simulation

An order saga with 4 steps. Toggle which steps will fail, then run the saga. Watch the orchestrator execute steps, detect failure, and run compensations in reverse.

Toggle step failures, then run the saga to see the orchestrator in action.

Observe the state machine transitions:

State	On Success	On Failure
Create Order	→ Charge Payment	→ Saga Failed (no compensation needed)
Charge Payment	→ Reserve Inventory	→ Cancel Order (compensate step 0)
Reserve Inventory	→ Schedule Shipping	→ Refund Payment → Cancel Order
Schedule Shipping	→ Saga Complete	→ Release Inventory → Refund Payment → Cancel Order

Chapter 9: Connections

Distributed transactions are the price you pay for splitting data across multiple services. Every microservice architecture eventually faces this problem, and the solution is always a trade-off between consistency, availability, and complexity.

Decision Framework

If you need...	Use this	Trade-off
Strong atomicity + isolation across DBs	2PC (XA transactions)	Blocking; coordinator is SPOF; high latency
Eventual consistency across services	Saga (orchestration)	No isolation; requires compensations; complex
Reliable event publishing	Outbox + CDC	At-least-once (idempotency required); added infrastructure
Simple read-your-writes on a single DB	Local transactions	Doesn't help with cross-service consistency
Serializable cross-shard transactions	Spanner-style (TrueTime)	Requires specialized hardware (atomic clocks); expensive

The Practical Answer

Avoid distributed transactions when possible. The best distributed transaction is the one you don't need. Design service boundaries so that operations requiring atomicity live in a single service. If your "create order" flow needs atomicity across Order, Payment, Inventory, and Shipping — maybe those should be one service (or at least one database). Microservices are about independent deployment, not about splitting every table into its own service.

Decision Tree

A visual guide for choosing the right distributed transaction approach.

Related lessons:

Coordination Avoidance — CRDTs and when you DON'T need transactions
Database Replication — the replication foundations that transactions build on
Partitioning & Storage — sharding introduces cross-shard transaction challenges

"A distributed system is one where I can't get my work done because a computer I've never heard of has crashed." — Leslie Lamport