Distributed Systems

Distributed Transactions

ACID, isolation levels, two-phase commit, sagas, and the outbox pattern — when you must coordinate.

Prerequisites: Basic SQL/databases + Networking fundamentals. That's it.
10
Chapters
9+
Simulations
0
Assumed Knowledge

Chapter 0: The Money Problem

You're building a payment system. Alice wants to send $100 to Bob. The operation requires two steps: debit Alice's account by $100 and credit Bob's account by $100. On a single database, this is straightforward — wrap both operations in a transaction, and the database guarantees they either both happen or neither does.

Now scale up. Alice's account lives in a database in US-East. Bob's account lives in a database in EU-West. There is no single database that owns both accounts. You need both databases to agree on the outcome of this transfer: either both commit (Alice debited AND Bob credited) or both abort (nothing changes). If Alice is debited but Bob isn't credited, $100 vanishes from the economy. If Bob is credited but Alice isn't debited, $100 appears from nowhere.

This is the atomic commit problem: getting multiple independent systems to agree on a single yes/no decision. It sounds simple until you add the reality of distributed systems: networks drop messages, servers crash mid-operation, and processes stall unpredictably.

The fundamental impossibility. You cannot have atomic commit that is both always available and always consistent across network partitions. This is a direct consequence of the FLP impossibility result: no deterministic protocol can guarantee consensus in an asynchronous system where even one process can fail. Every distributed transaction protocol is a compromise — it sacrifices something (availability, performance, or simplicity) to achieve atomicity.

What Goes Wrong Without Transactions

Failure modeWhat happensBusiness impact
Partial writeAlice debited, Bob not credited (crash between ops)Money disappears — customer files dispute
Dirty readAnother transaction reads Alice's debited balance before transfer completes, then transfer abortsDecisions based on phantom data
Double spendTwo concurrent transfers both check Alice's balance ($200), both debit $150Account goes to -$100, bank loses money
PhantomA report sums all accounts while a transfer is in progress — total is wrongAudit failure, regulatory violation
The Money Transfer Problem

Watch a money transfer between two databases. Without a transaction, a crash mid-operation causes money to vanish. With a transaction, the crash triggers a rollback.

Alice: $500, Bob: $500. Click a transfer mode, then "Crash" to see the difference.
Quick check: A money transfer debits Account A, then credits Account B. The system crashes after the debit but before the credit. Without a transaction, what is the total money in the system?

Chapter 1: ACID Breakdown

The word "transaction" gets thrown around loosely. Let's be precise. A database transaction provides four guarantees, collectively known as ACID. But these guarantees are not equally useful, and not all databases implement them equally.

A — Atomicity

Atomicity means "all or nothing." Either every operation in the transaction completes, or none of them do. If the system crashes midway through, it rolls back any partial changes. Think of it as an undo button: the transaction is the unit of "undo." You can't undo half a transaction.

Important clarification: atomicity is NOT about concurrency (that's Isolation). Atomicity is about fault tolerance — what happens when things crash partway through. The name is misleading; "abortability" would be more accurate.

C — Consistency

Consistency means the transaction preserves application invariants. If your invariant is "account balances never go negative," then a transaction that would cause a negative balance must be aborted. But here's the critical insight: the database can't know your application's invariants. YOU are responsible for writing transactions that maintain consistency. The database just provides the tools (constraints, triggers, atomicity). In practice, "C" in ACID is the application's responsibility, not the database's.

I — Isolation

Isolation means concurrent transactions don't interfere with each other. Each transaction behaves as if it were the only transaction running. The strongest form is serializability: the result of running transactions concurrently is the same as if they ran one at a time in some serial order.

Isolation is the expensive one. True serializability requires coordination (locks, or timestamp ordering, or serial execution). This is why databases offer weaker isolation levels — they trade correctness for performance. Understanding these levels is critical for system design interviews.

D — Durability

Durability means once a transaction commits, the data survives crashes. Typically implemented via write-ahead logging (WAL): changes are written to a durable log before the commit acknowledgment. Even if the server crashes, it replays the log on recovery. In distributed systems, durability often means "replicated to N nodes."

ACID Properties Visualized

Each quadrant shows one ACID property. Click to simulate a violation and see what breaks.

Click a button to see what happens when each ACID property is violated.
Check: Which ACID property is the application's responsibility, not the database's?

Chapter 2: Isolation Levels

True serializable isolation is expensive — it often requires locking or aborting transactions that conflict. In practice, databases offer a menu of isolation levels, each trading correctness for performance. Understanding these is essential: most production databases default to a weaker level, and bugs from insufficient isolation are among the hardest to diagnose.

The Hierarchy

LevelDirty ReadsNon-repeatable ReadsPhantomsWrite SkewPerformance
Read UncommittedPossiblePossiblePossiblePossibleFastest
Read CommittedPreventedPossiblePossiblePossibleFast
Repeatable Read / SnapshotPreventedPreventedPossiblePossibleMedium
SerializablePreventedPreventedPreventedPreventedSlowest

Let's demystify each level with concrete examples.

Read Committed

The most common default (PostgreSQL, Oracle, SQL Server). Two guarantees: (1) you only read data that has been committed (no dirty reads), (2) you only overwrite data that has been committed (no dirty writes). Implemented with row-level locks for writes and snapshot reads for reads.

// Read Committed prevents dirty reads but allows non-repeatable reads:

T1: BEGIN
T1: SELECT balance FROM accounts WHERE id=1 → $500
// T2 commits: UPDATE accounts SET balance=400 WHERE id=1
T1: SELECT balance FROM accounts WHERE id=1 → $400 (!)
// Same query, different result within the same transaction!
T1: COMMIT

Snapshot Isolation (MVCC)

Snapshot Isolation gives each transaction a consistent snapshot of the database at the moment it started. All reads within the transaction see that snapshot, even if other transactions commit changes in the meantime. This is how PostgreSQL's "REPEATABLE READ" and MySQL's "REPEATABLE READ" work (via Multi-Version Concurrency Control — MVCC).

// MVCC: each row has a creation version and deletion version

Row {id:1, balance:500, created_by:txn100, deleted_by:null}

// T1 (txn_id=200) starts → snapshot = version 200
// T2 (txn_id=201) updates: balance=400
Row {id:1, balance:500, created_by:txn100, deleted_by:txn201} // old version
Row {id:1, balance:400, created_by:txn201, deleted_by:null} // new version

// T1 reads: sees only rows where created_by < 200 AND deleted_by > 200 (or null)
// → T1 still sees balance=$500 (consistent with its snapshot)
Snapshot isolation is NOT serializable. It prevents dirty reads, non-repeatable reads, and phantoms from reads. But it allows write skew: two transactions each read the same data, make a decision based on it, and write different rows. Neither sees the other's write, and both commit — but the combined result violates an invariant.
Isolation Level Comparison

Two concurrent transactions access the same account. Select an isolation level to see which anomalies can occur.

Select an isolation level to see how concurrent transactions interact.
Check: PostgreSQL defaults to "Read Committed." Your application has a function that reads a user's balance, checks if it's ≥ $100, and if so, debits $100. Two concurrent calls arrive for the same user (balance $150). Can both succeed, overdrawing the account to -$50?

Chapter 3: Read Phenomena

To understand isolation levels, you need to know the specific anomalies they prevent. Each "read phenomenon" is a concrete scenario where concurrent transactions produce results that couldn't happen in any serial execution.

Dirty Read

A dirty read happens when transaction T1 reads data written by transaction T2 that hasn't committed yet. If T2 later aborts, T1 has read data that never officially existed.

// Dirty Read scenario:
T1: UPDATE accounts SET balance = balance - 100 WHERE id = 1 (not yet committed)
T2: SELECT balance FROM accounts WHERE id = 1 → sees uncommitted value!
T1: ROLLBACK (the debit never happened)
// T2 made a decision based on phantom data

Non-Repeatable Read (Read Skew)

A non-repeatable read happens when T1 reads the same row twice and gets different values because T2 committed a change between the two reads.

Phantom Read

A phantom is like a non-repeatable read, but for sets of rows. T1 runs a query that returns a set of rows, T2 inserts or deletes a row that matches the query, and T1 re-runs the query — the set has changed.

Write Skew

Write skew is the subtlest and most dangerous anomaly. Two transactions each read the same data, make a decision, and write to different rows. Neither violates a constraint individually, but the combined result does.

// Write Skew: on-call doctor scheduling
// Invariant: at least 1 doctor must be on-call
// Currently on-call: Dr. A and Dr. B

T1: SELECT COUNT(*) FROM doctors WHERE on_call = true → 2
T2: SELECT COUNT(*) FROM doctors WHERE on_call = true → 2
T1: UPDATE doctors SET on_call = false WHERE id = 'A' (2-1=1, still ≥1, ok)
T2: UPDATE doctors SET on_call = false WHERE id = 'B' (2-1=1, still ≥1, ok)
T1: COMMIT
T2: COMMIT
// Result: 0 doctors on-call! Invariant violated!
// Each transaction was correct individually, but combined they broke the rule.
Write skew is everywhere. Double-booking meeting rooms, over-selling concert tickets, concurrent username registration, two-account balance transfers — all are write skew. Snapshot isolation doesn't prevent it. You need either serializable isolation or explicit application-level locking (SELECT ... FOR UPDATE).
Read Phenomena Visualizer

Watch each read phenomenon play out step by step. Two transactions interleave their operations on a shared database.

Select a phenomenon, then click Step to advance through the scenario.
Check: You're using Snapshot Isolation (MVCC). Two transactions concurrently check if a username "alice" is available, find it is, and both insert it. What happens?

Chapter 4: Two-Phase Commit

How do you get two independent databases to agree on committing a transaction? You can't just send "COMMIT" to both — what if one succeeds and the other crashes before receiving the message? You need a protocol that ensures either ALL participants commit or ALL abort, even in the face of failures.

Two-Phase Commit (2PC) is the classic answer. It introduces a coordinator (also called the transaction manager) that orchestrates the commit across all participants (the databases).

The Protocol

Phase 1: Prepare
Coordinator sends PREPARE to all participants. Each participant writes all transaction data to a durable log, then votes YES (I can commit) or NO (I must abort). The vote is a PROMISE: a YES vote means the participant guarantees it can commit, even if it crashes and restarts.
Coordinator Decision
If ALL participants voted YES → decision is COMMIT. If ANY participant voted NO → decision is ABORT. The coordinator writes this decision to its own durable log BEFORE sending it.
Phase 2: Commit/Abort
Coordinator sends the decision (COMMIT or ABORT) to all participants. Each participant applies the decision and acknowledges. If a participant crashes, it reads its log on recovery, finds the PREPARE record, and asks the coordinator for the decision.

The Data Flow

// Transfer $100: Alice (DB-A) → Bob (DB-B)

Coordinator → DB-A: PREPARE (debit Alice $100)
Coordinator → DB-B: PREPARE (credit Bob $100)

DB-A: writes WAL, locks row, returns VOTE=YES
DB-B: writes WAL, locks row, returns VOTE=YES

Coordinator: all YES → writes COMMIT to its log

Coordinator → DB-A: COMMIT
Coordinator → DB-B: COMMIT

DB-A: applies debit, releases lock, returns ACK
DB-B: applies credit, releases lock, returns ACK

// Total time: 2 round-trips (prepare + commit)
// Messages: 4 (2 prepare + 2 commit) + 4 responses = 8 total
The point of no return. Once a participant votes YES, it has made a binding promise. It CANNOT abort unilaterally — it must wait for the coordinator's decision, even if it means waiting forever. This is the key property that makes 2PC work: after the prepare phase, the only entity that can decide the outcome is the coordinator.
Two-Phase Commit Protocol

Watch the 2PC protocol execute step by step. The coordinator orchestrates a commit across two database participants.

Click Step to advance through the Two-Phase Commit protocol.
Check: In 2PC, participant DB-A has voted YES (prepare phase complete). DB-A then detects that it's running out of disk space and wants to abort. Can it?

Chapter 5: Coordinator Crash

Two-Phase Commit has a fatal weakness: the coordinator is a single point of failure. If the coordinator crashes at exactly the wrong moment, participants are stuck in a state called in-doubt (or uncertain), holding locks indefinitely.

The Worst-Case Scenario

// Timeline of doom:

t=0: Coordinator sends PREPARE to DB-A and DB-B
t=1: DB-A votes YES (now holding locks, promised to commit if asked)
t=2: DB-B votes YES (now holding locks, promised to commit if asked)
t=3: Coordinator receives both YES votes
t=4: Coordinator writes COMMIT to its log
t=5: Coordinator CRASHES before sending COMMIT to anyone

// Now:
// DB-A is in-doubt. It voted YES. It's holding locks. It can't commit
// (hasn't received COMMIT) and can't abort (it promised!). It MUST wait.
// DB-B: same situation.
// The locks are held until the coordinator recovers and resends its decision.
// If the coordinator's disk is destroyed, those locks are held FOREVER
// (until a human intervenes).
This is a blocking protocol. The in-doubt state is the fundamental problem with 2PC. While locks are held, no other transaction can read or write those rows. A coordinator outage can stall an entire database. This is why some systems avoid 2PC entirely in favor of sagas (which we'll cover next) — even though sagas provide weaker guarantees.

Mitigations (But No Silver Bullet)

MitigationHow it helpsLimitation
Coordinator HARun coordinator on a replicated cluster (Raft/Paxos)Still has failover delay; doesn't eliminate the problem
Timeout + presumed abortIf coordinator doesn't respond in T seconds, participants abortCan cause inconsistency if coordinator actually committed
Three-Phase Commit (3PC)Adds a "pre-commit" phase to reduce blocking windowDoesn't work under network partitions; rarely used in practice
Cooperative terminationIn-doubt participants ask each other for the coordinator's decisionIf ALL participants are in-doubt, nobody knows the answer
Coordinator Crash Simulator

Watch what happens when the coordinator crashes at different points in the 2PC protocol. The "danger zone" is after participants vote YES but before they receive the decision.

Choose when the coordinator crashes to see the impact on participants.
Check: The coordinator crashes after writing COMMIT to its log but before sending COMMIT to any participant. When the coordinator recovers, what does it do?

Chapter 6: The Saga Pattern

2PC is correct but blocking. What if you need distributed atomicity but can't afford to hold locks across services? The Saga pattern provides an alternative: instead of one distributed transaction, execute a sequence of local transactions, each in its own service. If any step fails, execute compensating transactions to undo the previous steps.

Think of it like booking a trip. You book a flight, then a hotel, then a rental car. If the car rental fails, you cancel the hotel, then cancel the flight. Each cancellation is a separate operation that undoes the corresponding booking.

Saga vs. 2PC

Property2PCSaga
AtomicityFull (all-or-nothing)Eventual (via compensation)
IsolationYes (locks held during protocol)No (intermediate states are visible)
Lock durationUntil commit/abort (could be seconds to hours)Per-step only (milliseconds)
BlockingYes (coordinator crash blocks all)No (each step is independent)
ComplexityProtocol is simple; failure handling is hardProtocol is complex; each step needs a compensating action

Two Flavors: Choreography vs. Orchestration

Choreography: Each service listens for events and decides its own next step. No central coordinator. Services communicate through events (e.g., Kafka topics).

Order Service
Creates order, publishes "OrderCreated" event
↓ event
Payment Service
Hears "OrderCreated", charges card, publishes "PaymentCompleted"
↓ event
Inventory Service
Hears "PaymentCompleted", reserves stock, publishes "StockReserved"
↓ event
Shipping Service
Hears "StockReserved", schedules delivery, publishes "OrderFulfilled"

Orchestration: A central saga orchestrator tells each service what to do and tracks the overall state. The orchestrator is like a state machine.

Saga Orchestrator
Sends "charge $50" to Payment Service
↓ command
Payment Service
Charges $50, responds "success"
↓ reply
Saga Orchestrator
Sends "reserve 2 items" to Inventory Service
↓ command
Inventory Service
Reserves items, responds "success"
Choreography vs. Orchestration trade-off. Choreography is decentralized and loosely coupled but hard to debug (the saga logic is spread across all services). Orchestration centralizes the flow in one place but introduces a potential bottleneck and single point of responsibility. Most production systems start with orchestration for critical flows (order processing, payments) and choreography for less critical flows (notifications, analytics).

Compensating Transactions

The hard part of sagas is writing correct compensating transactions. A compensation must undo the effect of a step, but it can't simply "rollback" — the original transaction already committed. Compensation is a new, forward-moving operation.

// Step: Charge customer $50
// Compensation: Refund customer $50
// NOT the same as rollback! The charge shows on the statement, then the refund does.

// Step: Reserve 2 items from inventory
// Compensation: Release 2 items back to inventory

// Step: Send confirmation email
// Compensation: Send "sorry, your order was cancelled" email
// Some steps CANNOT be perfectly compensated — you can't un-send an email
Saga Choreography vs. Orchestration

Compare the two saga styles. In choreography, services communicate through events. In orchestration, a central coordinator sends commands.

Run a saga in choreography or orchestration mode. Click "Fail Step 3" to trigger compensation.
Check: A saga has 5 steps. Step 4 fails. How many compensating transactions must run?

Chapter 7: The Outbox Pattern

Sagas rely on services publishing events. But there's a subtle reliability problem: how do you atomically update the database AND publish an event? If you update the database first and then crash before publishing, the event is lost. If you publish first and then crash before updating, the event describes something that didn't happen.

This is the dual write problem: you need to write to two systems (database + message broker) atomically, but they don't share a transaction.

// The dual write problem:

// Option 1: DB first, then publish
database.update(order) // ✓ succeeds
kafka.publish(event) // ✗ CRASH — event lost! Other services never learn about the update.

// Option 2: Publish first, then DB
kafka.publish(event) // ✓ succeeds
database.update(order) // ✗ CRASH — event was published for a non-existent update!

The Outbox Solution

Instead of writing to both systems, write ONLY to the database — but include the event in an outbox table within the same database transaction. A separate process (the "relay" or "publisher") reads the outbox table and publishes events to the message broker.

1. Atomic DB Write
In a single transaction: UPDATE order, INSERT INTO outbox (event_data). Both writes succeed or both fail.
2. Relay Reads Outbox
A background process polls the outbox table (or uses CDC — Change Data Capture) to find new events.
3. Publish to Broker
Relay publishes each event to Kafka/RabbitMQ. After successful publish, marks the outbox row as published.
At-least-once delivery. The relay might crash after publishing but before marking the row as published. On restart, it will re-publish the same event. This means consumers must be idempotent — processing the same event twice must produce the same result. Idempotency keys (unique IDs per event) are the standard solution: consumers check "have I processed event X before?" and skip duplicates.

CDC: The Modern Outbox

Change Data Capture (CDC) eliminates polling by tailing the database's write-ahead log (WAL). Tools like Debezium, Maxwell, or native PostgreSQL logical replication capture every committed change and stream it to Kafka automatically. The outbox table still exists (for structure), but the relay is replaced by the CDC connector.

python
# Outbox pattern: atomic write + deferred publish
def create_order(order):
    with db.transaction():
        db.execute("INSERT INTO orders ...", order)
        db.execute("INSERT INTO outbox (event_type, payload, published) VALUES (%s, %s, false)",
                   'OrderCreated', json.dumps(order))
    # No Kafka publish here! The relay handles it.

# Relay process (runs continuously)
def relay():
    while True:
        events = db.execute("SELECT * FROM outbox WHERE published = false ORDER BY id LIMIT 100")
        for e in events:
            kafka.publish(topic=e.event_type, value=e.payload, key=e.id)
            db.execute("UPDATE outbox SET published = true WHERE id = %s", e.id)
Outbox Pattern Visualizer

Watch the outbox pattern in action. A service writes to its database + outbox atomically. The relay then publishes events to Kafka. Crash it to see at-least-once delivery.

Write an order to see it appear in the outbox. Relay publishes it to Kafka.
Check: The outbox relay publishes an event to Kafka, then crashes before marking the outbox row as "published." When the relay restarts, what happens?

Chapter 8: Saga Orchestrator

This is the showcase simulation. You are the saga orchestrator for an e-commerce order. The saga has four steps: create order, charge payment, reserve inventory, and schedule shipping. Each step can succeed or fail (you control which ones fail). When a step fails, the orchestrator runs compensating transactions for all previously completed steps in reverse order.

The goal: understand the orchestrator state machine. Try different failure scenarios. Notice how intermediate states are visible (no isolation), how compensation runs in reverse, and how the outbox ensures no events are lost.
Saga Orchestrator Simulation

An order saga with 4 steps. Toggle which steps will fail, then run the saga. Watch the orchestrator execute steps, detect failure, and run compensations in reverse.

Toggle step failures, then run the saga to see the orchestrator in action.

Observe the state machine transitions:

StateOn SuccessOn Failure
Create Order→ Charge Payment→ Saga Failed (no compensation needed)
Charge Payment→ Reserve Inventory→ Cancel Order (compensate step 0)
Reserve Inventory→ Schedule Shipping→ Refund Payment → Cancel Order
Schedule Shipping→ Saga Complete→ Release Inventory → Refund Payment → Cancel Order

Chapter 9: Connections

Distributed transactions are the price you pay for splitting data across multiple services. Every microservice architecture eventually faces this problem, and the solution is always a trade-off between consistency, availability, and complexity.

Decision Framework

If you need...Use thisTrade-off
Strong atomicity + isolation across DBs2PC (XA transactions)Blocking; coordinator is SPOF; high latency
Eventual consistency across servicesSaga (orchestration)No isolation; requires compensations; complex
Reliable event publishingOutbox + CDCAt-least-once (idempotency required); added infrastructure
Simple read-your-writes on a single DBLocal transactionsDoesn't help with cross-service consistency
Serializable cross-shard transactionsSpanner-style (TrueTime)Requires specialized hardware (atomic clocks); expensive

The Practical Answer

Avoid distributed transactions when possible. The best distributed transaction is the one you don't need. Design service boundaries so that operations requiring atomicity live in a single service. If your "create order" flow needs atomicity across Order, Payment, Inventory, and Shipping — maybe those should be one service (or at least one database). Microservices are about independent deployment, not about splitting every table into its own service.
Decision Tree

A visual guide for choosing the right distributed transaction approach.

Related lessons:

"A distributed system is one where I can't get my work done because a computer I've never heard of has crashed." — Leslie Lamport