Database Transactions — From Absolute Zero to Mastery

Chapter 0: The Problem

You are building a banking app. Alice wants to transfer $100 from her checking account to Bob's savings account. The operation seems trivial: subtract $100 from Alice, add $100 to Bob. Two writes. What could go wrong?

Everything.

The database executes the first write — Alice's balance drops from $500 to $400. Then the server crashes. The power supply fails. The process gets killed by the OOM killer. A network partition cuts the connection to the database. Pick your disaster. The second write never happens. Alice lost $100. Bob never received it. The money has vanished from the universe.

Or consider a different nightmare. Two people are browsing the same airline website. One seat remains on the 7:15 PM flight to JFK. Both customers simultaneously click "Book Now." Both see "1 seat available" because they both read the database before either wrote to it. Both bookings succeed. The airline now has two confirmed passengers for one seat. Someone is getting bumped at the gate.

These are not hypothetical scenarios. They happen in production systems every day. The bank transfer is a partial failure problem — one operation succeeded but the other didn't. The airline booking is a race condition — two concurrent operations interfered with each other because neither saw the other's changes.

The fundamental tension. Databases must handle many operations at once (concurrency) without crashing in the middle of any of them (fault tolerance). A transaction is the database's answer: a way to group multiple operations so they either ALL succeed or ALL fail, even in the presence of crashes and concurrency. Every systems design interview touches this concept.

The Formal Definition

A transaction is a group of database reads and writes that are treated as a single logical unit. The transaction either commits (all changes become permanent) or aborts/rolls back (all changes are undone). There is no in-between state where some changes persist and others don't.

In SQL, transactions look like this:

sql
BEGIN TRANSACTION;

  -- Read Alice's balance
  SELECT balance FROM accounts WHERE name = 'Alice';
  -- Result: 500

  -- Debit Alice
  UPDATE accounts SET balance = balance - 100 WHERE name = 'Alice';

  -- Credit Bob
  UPDATE accounts SET balance = balance + 100 WHERE name = 'Bob';

COMMIT;
-- If any statement fails, or the connection drops, or the server crashes:
-- ALL changes are automatically rolled back. Both accounts stay unchanged.

In application code, it looks like this:

python
import psycopg2

conn = psycopg2.connect("dbname=bank")
try:
    with conn.cursor() as cur:
        cur.execute("UPDATE accounts SET balance = balance - 100 WHERE name = 'Alice'")
        cur.execute("UPDATE accounts SET balance = balance + 100 WHERE name = 'Bob'")
    conn.commit()  # Both succeed or...
except Exception:
    conn.rollback()  # ...both are undone

Watch It Break

The simulation below shows the bank transfer in action. Click "Start Transfer" to begin. Then click "Crash!" at any point during the transfer to simulate a failure. Watch what happens to the account balances.

Bank Transfer Crash Simulator

Click "Start Transfer" then "Crash!" between the two operations. Or let it complete successfully.

The Race Condition

Now consider the airline booking. Two customers query the database at the same time, both see 1 seat available, and both insert a booking. The simulation below shows the timeline of these two concurrent transactions.

Double-Booking Race Condition

Watch two customers simultaneously book the last seat. The timeline shows how the race unfolds.

How Common Is This?

You might think "crashes between two writes" is rare. It is not. Consider the scale: a busy e-commerce site processes 10,000 orders per second. Each order involves 3-5 database writes (insert order, update inventory, update customer stats, insert payment record, update shipping queue). That is 30,000-50,000 writes per second. If the system crashes once per day — a very optimistic assumption — it will land between two writes of the same logical operation. The probability is not "if" but "when."

Race conditions are even more common. Any system with more than one concurrent user can experience them. A database handling 1,000 concurrent connections has thousands of opportunities per second for read-write interleaving. Without isolation guarantees, data corruption is a statistical certainty.

The three failure modes. Every database application faces three kinds of failure: (1) Process crashes — the application or database process dies mid-operation. (2) Network failures — the connection between application and database drops. (3) Concurrency bugs — multiple operations interfere with each other. Transactions address all three. Without them, the application must handle each failure mode individually, which is both extremely difficult and inevitably incomplete.

Without transactions, your database is a minefield. Every concurrent operation could corrupt data. Every crash could leave things half-done. The rest of this lesson teaches you how transactions solve these problems — and the tradeoffs involved.

Interview warm-up: A junior engineer says "We don't need transactions — we'll just write careful code that checks for errors after each database operation." What is the fundamental flaw in this reasoning?

Transactions are faster than manual error checking Manual error checking works fine for single-server databases A crash can happen BETWEEN the error check and the next operation. No amount of application-level code can make a series of operations atomic — if the process dies mid-sequence, the error-handling code never runs. Only the database engine itself (via WAL + crash recovery) can guarantee atomicity, because it controls what gets written to disk.

Chapter 1: ACID — What It Actually Means

Every database textbook, every interview prep guide, and every recruiter's phone screen mentions ACID. But most explanations are wrong. They recite the acronym like a prayer without explaining what each letter actually guarantees at the implementation level. Let's fix that.

A — Atomicity

Atomicity does NOT mean "simultaneous." It means "all or nothing." If a transaction makes five writes and the system crashes after the third write, atomicity guarantees that all five writes are rolled back. The database looks as if the transaction never started. Zero writes persisted, not three.

The implementation is a write-ahead log (WAL). Before modifying any data page on disk, the database first writes a log entry describing the change. If the system crashes, the recovery process reads the WAL and undoes any changes from incomplete transactions. The WAL is the reason atomicity works — not clever application code, not locks, not prayers.

// WAL entries for our bank transfer:
LSN=101: BEGIN TXN_42
LSN=102: UPDATE accounts SET balance=400 WHERE id='Alice' (old=500)
LSN=103: UPDATE accounts SET balance=600 WHERE id='Bob' (old=500)
LSN=104: COMMIT TXN_42

// If crash happens after LSN=102 but before LSN=104:
// Recovery reads WAL, sees TXN_42 has no COMMIT entry
// UNDO: restore Alice balance to 500 (using old value from LSN=102)
// Result: both accounts back to $500. No money lost.

C — Consistency

Consistency in ACID is the most misunderstood letter. It does NOT mean "replicas agree" (that is replication consistency). In ACID, consistency means that the database preserves application-level invariants — rules like "total money across all accounts is constant" or "every foreign key references an existing row."

Here is the uncomfortable truth: consistency is mostly the application's responsibility, not the database's. The database can enforce some invariants (foreign keys, unique constraints, CHECK constraints), but it cannot know that "total money must be conserved" unless you tell it. Consistency is the odd one out in ACID — it is a property of the application using the database, not a property of the database itself.

Kleppmann's insight. Martin Kleppmann argues that the "C" in ACID was retrofitted to make the acronym work. Atomicity, Isolation, and Durability are genuine database guarantees. Consistency is a property that depends on the application writing correct transactions. The database only provides the tools (constraints, triggers) to help enforce it.

I — Isolation

Isolation means concurrent transactions don't interfere with each other. Each transaction behaves as if it were the only one running. In the airline example, proper isolation would prevent both customers from seeing "1 seat available" simultaneously — one of them would see the updated count after the other's booking.

Full isolation (called serializability) means the result of running transactions concurrently is the same as if they ran one at a time, in some serial order. But serializability is expensive. So most databases offer weaker isolation levels that trade correctness for performance. Understanding these tradeoffs is 80% of what interviews test.

D — Durability

Durability means that once a transaction commits, its data survives crashes — even power failures. The implementation combines WAL with fsync — a system call that forces the operating system to flush data from its write cache to the physical disk. Without fsync, the OS might acknowledge a write that is still sitting in a volatile buffer that vanishes on power loss.

// Durability guarantee chain:
Application calls COMMIT
→ Database writes COMMIT record to WAL
→ Database calls fsync() on WAL file
→ OS flushes write buffer to disk controller
→ Disk controller writes to physical media
→ fsync() returns
→ Database returns "COMMIT OK" to application

// If the process crashes after fsync returns, data is on disk.
// If power fails before fsync returns, COMMIT is not acknowledged
// and the WAL recovery will roll back the transaction.

Durability is not absolute. If every disk in the system fails simultaneously, data is lost. If the firmware lies about completing fsync (some SSDs do this for performance), data can be lost. Replication to other nodes adds another layer of durability. "Durability" means "survives the expected failure modes," not "survives the heat death of the universe."

The WAL in Detail

The Write-Ahead Log deserves a closer look because it is the foundation of both atomicity and durability. Every modern database — PostgreSQL, MySQL InnoDB, SQLite, Oracle, SQL Server — uses some form of WAL. The principle is simple: write the intent before the action.

Think of it like a pilot's checklist. Before taking off, the pilot writes down every step they are going to perform. If they are incapacitated mid-takeoff, another pilot can read the checklist and know exactly what was done and what remains. The WAL is the database's checklist.

// WAL Structure (simplified PostgreSQL model):
Each WAL record contains:
  LSN         Log Sequence Number (monotonically increasing)
  TXN_ID      Which transaction made this change
  Operation   INSERT / UPDATE / DELETE
  Page_ID     Which data page on disk is affected
  Old_Value   Previous value (for UNDO on rollback)
  New_Value   New value (for REDO on crash recovery)

// Recovery algorithm (ARIES-style, simplified):
1. Scan WAL forward: build list of committed and uncommitted transactions
2. REDO: replay all changes from committed transactions (in case data pages weren't flushed)
3. UNDO: reverse all changes from uncommitted transactions
// After recovery, the database is in a consistent state.

The WAL is append-only and sequential. This is critical for performance: sequential writes to disk are 100x-1000x faster than random writes. The actual data pages can be written lazily in the background (called checkpointing), while the WAL guarantees that no committed data is lost.

BASE: The Alternative

Not all systems provide ACID. Many distributed databases offer BASE instead:

Property	Meaning	Example
Basically Available	The system always responds, even if data is stale	DynamoDB returns the last-known value during a partition
Soft state	Data may change without input (replicas converging)	Eventual consistency means reads may see different values on different nodes
Eventual consistency	All replicas will eventually agree, given enough time	DNS propagation: update takes hours but eventually all servers agree

BASE is not "ACID but worse." It is a different set of tradeoffs optimized for availability and partition tolerance (the "AP" in CAP theorem). Your shopping cart can tolerate eventual consistency. Your bank balance cannot.

ACID vs BASE: When to Use Which

The choice between ACID and BASE is not a religious war — it is an engineering decision driven by your application's invariants:

Use Case	Invariant	Tolerance	Choice
Bank transfer	Money is conserved	Zero tolerance for incorrect balances	ACID (serializable)
Shopping cart	Items in cart reflect user intent	Brief stale reads are OK (refresh shows correct state)	BASE
Inventory count	Stock >= 0	Brief overselling tolerable if caught quickly	ACID for decrement, BASE for display
Social media feed	Posts eventually appear	High tolerance (seconds of delay are fine)	BASE
Flight booking	No overbooking	Zero tolerance	ACID (at least for the seat assignment step)
User analytics	Approximate correctness	High (10% error is acceptable)	BASE (or no transactions at all)

The hybrid approach. Most real systems use BOTH ACID and BASE. The payment processing path uses strict ACID transactions (money must be correct). The recommendation engine uses eventually consistent reads (showing a slightly stale recommendation is fine). The user activity log uses fire-and-forget writes (losing an occasional page view is acceptable). The skill is knowing which path each operation belongs on.

Transaction Costs: What You Pay

Transactions are not free. Here is what they cost:

// Performance cost of each ACID property:

Atomicity (WAL writes):
  Every write becomes TWO writes: WAL + data page
  fsync on WAL at commit: ~0.1-1ms on SSD, ~5-10ms on HDD
  Group commit amortizes: batch N transactions per fsync
  PostgreSQL default: commit_delay=0 (fsync every commit)

Isolation (locks or MVCC):
  Lock-based: blocked transactions = wasted CPU time
  MVCC: extra disk space for old row versions
  MVCC: VACUUM overhead (5-15% of total I/O in busy systems)
  SSI: conflict tracking memory + occasional aborts/retries

Durability (fsync):
  Without fsync: ~100K writes/sec on SSD
  With fsync per commit: ~10K writes/sec on SSD
  Group commit (batch 10): ~50K writes/sec on SSD
  10x performance difference between fsync modes!

// Trade-off: setting synchronous_commit=off in PostgreSQL
// disables fsync per commit. 10x faster but you can lose
// up to 3x wal_writer_delay (default 600ms) of committed data
// on crash. Acceptable for non-critical data (session logs).

Interactive ACID Demo

A transaction writes 3 records. Click "Inject Crash" at any point. Atomicity rolls back all partial changes.

Interview question: An engineer proposes using application-level retries to handle partial failures instead of database transactions. "If the second write fails, we'll retry it." What specific failure mode does this NOT handle?

Network latency between application and database The application process itself crashes (OOM kill, segfault, hardware failure) between the two writes. There is no code left running to execute the retry. The first write persists, the second never happens, and no retry logic ever runs. Only the database WAL + recovery can undo the first write. Database connection pooling limits

Chapter 2: Single-Object vs Multi-Object Transactions

Not all transactions are created equal. Writing a single row is much simpler to make atomic than writing across multiple tables. Let's draw the line clearly.

Single-Object Atomicity

Imagine you are writing a 20 KB JSON document to a database. The write takes multiple disk pages. If the system crashes after writing page 1 but before writing page 2, you have a torn write — half old data, half new data. The document is corrupted.

Almost every storage engine provides single-object atomicity out of the box. The mechanisms are:

Mechanism	How it works	What it prevents
WAL (Write-Ahead Log)	Write the complete new value to the log before modifying the data page	Torn writes on crash
Compare-and-Swap (CAS)	Atomically update only if current value matches expected	Concurrent overwrites
Increment operations	Atomic read-modify-write in a single CPU instruction	Lost updates on counters

Single-object atomicity is necessary but not sufficient. The bank transfer requires atomicity across TWO objects (two account rows). That requires a multi-object transaction.

When Single-Object Atomicity Breaks Down

Even single-object atomicity has subtle failure modes. Consider a document database storing a 20 KB JSON document:

// Single-object failure without WAL:
Document = 20 KB = 5 data pages (4 KB each)

Write page 1/5: success (4 KB written to disk)
Write page 2/5: success
Write page 3/5: POWER FAILURE

// Result: pages 1-2 have new data, pages 3-5 have old data.
// The document is corrupted — a mix of old and new versions.
// This is a "torn write."

// With WAL:
WAL: write complete new document (20 KB) to sequential log
fsync WAL
// Now update data pages (can happen lazily)
Write page 1/5... Write page 2/5... POWER FAILURE

// On recovery: WAL has the complete new document.
// Recovery replays the WAL and writes all 5 pages.
// Document is correct.

Multi-Object Transactions

A multi-object transaction groups multiple reads and writes — possibly to different rows, different tables, even different indexes — into a single atomic unit. The syntax in SQL is explicit:

sql
BEGIN TRANSACTION;
  UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
  UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
COMMIT;

-- If ANY statement fails or the connection drops between them,
-- the database automatically issues ROLLBACK.
-- Both accounts return to their original values.

Why Multi-Object Transactions Are Needed

Many common database operations are inherently multi-object:

Foreign key references. You insert an order row that references a customer_id. The database must verify the customer exists AND insert the order atomically. If the customer is deleted between the check and the insert, you have a dangling foreign key.

Denormalized data. You store a user's post_count on the user document AND the individual posts in a posts collection. Inserting a new post requires updating both. If only one succeeds, the count is wrong.

Secondary indexes. When you update a row, the database must also update every secondary index that covers that row. If the row update succeeds but an index update fails, queries using that index return wrong results.

Worked Example: Multi-Object Transaction Failure

Let's trace through a concrete example of what happens without multi-object transactions. An e-commerce application inserts an order and updates inventory:

// WITHOUT multi-object transaction:
Step 1: INSERT INTO orders (id, product_id, qty) VALUES (501, 42, 3);
       // Success! Order exists in database.

Step 2: UPDATE products SET stock = stock - 3 WHERE id = 42;
       // CRASH before this executes!

// State after crash:
orders table: order 501 exists (customer will be charged)
products table: stock unchanged (inventory not decremented)

// Consequences:
// 1. Customer is charged for the order
// 2. Inventory shows 3 more items than actually available
// 3. Another customer orders those same 3 items
// 4. Warehouse ships both orders but only has stock for one
// 5. Manual intervention: cancel one order, refund customer, apologize

// WITH multi-object transaction:
BEGIN TRANSACTION;
  INSERT INTO orders (id, product_id, qty) VALUES (501, 42, 3);
  UPDATE products SET stock = stock - 3 WHERE id = 42;
COMMIT;
// If crash happens between the INSERT and UPDATE:
// WAL has no COMMIT record → recovery rolls back the INSERT too
// Neither the order nor the inventory change persists. Clean state.

The NoSQL Trade-off

Many NoSQL databases (MongoDB before 4.0, early DynamoDB, Cassandra) intentionally dropped multi-object transactions. Their argument: multi-object transactions hurt performance and don't scale across partitions. Instead, they encourage application-level workarounds: idempotent operations, event sourcing, compensating transactions.

This works for some use cases (shopping carts, activity feeds) but fails catastrophically for others (financial transfers, inventory management). MongoDB added multi-document transactions in v4.0 (2018) after years of customer demand. The lesson: you can avoid transactions until you can't.

The Saga Pattern (When You Can't Use Transactions)

When data lives across multiple services (each with its own database), multi-object transactions are impossible. You can't span a SQL transaction across a payment service database and an inventory service database. The saga pattern is the most common alternative:

1. Create Order (Order Service)

Insert order with status=PENDING. This is a local transaction within the order service's database.

↓

2. Reserve Inventory (Inventory Service)

Decrement stock. If insufficient stock, publish "inventory failed" event.

↓

3. Process Payment (Payment Service)

Charge customer. If payment fails, publish "payment failed" event.

↓

4. Confirm Order

Update order status=CONFIRMED. Publish "order confirmed" event.

↻ On failure: compensating transactions reverse each step

If step 3 fails, a compensating transaction runs: the inventory service re-increments the stock, and the order service marks the order as CANCELLED. This is not atomic — there is a window where the inventory is reserved but the payment hasn't been processed. During that window, the system is in an inconsistent state. But eventually, it converges to a consistent state. This is the tradeoff: eventual consistency instead of strong consistency.

Multi-Object Transaction Visualizer

Watch a transaction span two tables with a foreign key. Toggle "With Transaction" to see the difference.

Interview question: A NoSQL database provides single-document atomicity but no multi-document transactions. You need to transfer money between two user documents. What are your options, and what are the tradeoffs?

Just do two writes and hope for the best Use a message queue to guarantee delivery Use a two-phase approach: (1) create a "transfer" document with status=pending, (2) debit source, (3) credit destination, (4) mark transfer=complete. A background reconciliation process periodically scans for stuck transfers (pending for >N seconds) and either retries or compensates. This is the "saga" pattern — it trades strong atomicity for eventual consistency with manual compensation logic. The tradeoff: more application code, eventual (not immediate) consistency, and edge cases around duplicate processing that require idempotency keys.

Chapter 3: Read Committed

Now we enter the world of isolation levels — the spectrum between "no isolation at all" and "perfect serializability." Each level prevents certain anomalies but allows others. Understanding this spectrum is the single most important transaction concept for interviews.

Read Committed is the weakest useful isolation level. It is the default in PostgreSQL, Oracle, and SQL Server. It provides two guarantees:

Guarantee 1: No Dirty Reads

A dirty read occurs when Transaction B reads data that Transaction A has written but not yet committed. If A later rolls back, B has acted on data that never officially existed.

// DIRTY READ scenario (READ UNCOMMITTED allows this):
Time 1: TXN_A begins
Time 2: TXN_A writes Alice.balance = 400 (was 500, debit $100)
Time 3: TXN_B reads Alice.balance → sees 400 DIRTY! A hasn't committed
Time 4: TXN_A aborts (rolls back) → Alice.balance is back to 500
Time 5: TXN_B thinks Alice has $400 but she actually has $500

// READ COMMITTED prevents this:
Time 3: TXN_B reads Alice.balance → sees 500 (old committed value)
// B only sees A's changes AFTER A commits

Guarantee 2: No Dirty Writes

A dirty write occurs when Transaction B overwrites a value that Transaction A has written but not yet committed. This can cause bizarre interleaving where the final state mixes parts of different transactions.

// DIRTY WRITE scenario:
// Alice buys a car. Two tables: listings and invoices.
Time 1: TXN_A (Alice's purchase) writes listings.buyer = 'Alice'
Time 2: TXN_B (Bob's purchase of SAME car) writes listings.buyer = 'Bob'
Time 3: TXN_A writes invoices.buyer = 'Alice'
Time 4: TXN_B writes invoices.buyer = 'Bob'

// Without dirty write prevention: listing says Bob, invoice says Bob
// But if TXN_B wrote listings BEFORE TXN_A committed,
// we could get listing=Bob, invoice=Alice — a corrupted state.

// READ COMMITTED prevents this using ROW-LEVEL LOCKS:
// When TXN_A writes a row, it acquires a lock.
// TXN_B must WAIT until TXN_A commits or aborts before writing that row.

Implementation

Read committed uses two mechanisms:

For writes: row-level locks. When a transaction writes a row, it acquires an exclusive lock on that row. Other transactions that want to write the same row must wait until the lock is released (at commit or abort).

For reads: the database keeps TWO copies of each modified row — the old committed value and the new uncommitted value. Any transaction reading that row sees the old value until the writing transaction commits. This is cheaper than read locks because reads never block.

Why not use read locks? An early approach to preventing dirty reads was to use shared locks for reads: if Transaction A holds a write lock, Transaction B's read must wait. But this means a long-running write blocks all reads on that row. One slow analytics query scanning millions of rows would block all concurrent writes. PostgreSQL's approach (keeping both old and new values) lets reads proceed without blocking, which is why it's the default isolation level in most databases.

Read Committed Implementation: The Two-Value Trick

Here is exactly how PostgreSQL implements read committed for reads, step by step:

// When TXN_A (not yet committed) updates row X:
// The database maintains TWO versions of the row:

Row X, Version 1: { balance: 500, xmin: 100, xmax: 150 }
  // Old version. Created by TXN_100 (committed long ago).
  // Marked as deleted by TXN_150 (TXN_A, not yet committed).

Row X, Version 2: { balance: 400, xmin: 150, xmax: 0 }
  // New version. Created by TXN_150 (TXN_A, not yet committed).

// When TXN_B reads row X under READ COMMITTED:
1. Database scans versions of row X
2. Finds Version 2: created by TXN_150. Is TXN_150 committed? NO.
   // Skip this version — it's uncommitted (dirty).
3. Finds Version 1: created by TXN_100 (committed).
   Deleted by TXN_150 — but TXN_150 is not committed yet!
   // The delete doesn't count yet. This version is still visible.
4. Returns Version 1: balance = 500.

// After TXN_A commits:
// The SAME query by TXN_B now returns 400 (Version 2 is now visible).
// This is a non-repeatable read — allowed under read committed.

This mechanism — checking the commit status of each version's creating transaction — is cheap. PostgreSQL maintains a commit log (clog) that stores the commit status of each transaction ID in just 2 bits (in-progress, committed, aborted, sub-committed). Checking clog is a simple array lookup.

What Read Committed Does NOT Prevent

Read committed has a critical gap: non-repeatable reads (also called read skew). You read a value, another transaction modifies and commits it, and reading again gives a different result — within the same transaction.

// NON-REPEATABLE READ (allowed under Read Committed):
// Alice has two accounts: checking=$500, savings=$500. Total=$1000.
Time 1: TXN_B (Alice's backup job) reads checking → $500
Time 2: TXN_A (transfer $100 from checking to savings) begins
Time 3: TXN_A: checking = 400, savings = 600. COMMIT.
Time 4: TXN_B reads savings → $600 (sees A's committed update)

// TXN_B saw: checking=$500, savings=$600. Total=$1100.
// Alice appears to have $100 more than she actually does.
// The backup is inconsistent.

For a quick interactive query, this is usually harmless — Alice refreshes the page and sees the correct values. But for a backup that takes 30 minutes, or an analytics query scanning millions of rows, reading inconsistent snapshots produces wrong results. This is why we need the next isolation level: snapshot isolation.

Worked Example: Non-Repeatable Read in Practice

Here is a concrete scenario that breaks analytics queries under read committed:

// Scenario: "How much total money is in the bank?"
// True answer: $1000 (Alice=$500, Bob=$500)

// Analytics query (TXN_B) scans accounts one by one:
TXN_B: SELECT balance FROM accounts WHERE id='alice' → $500
// Meanwhile, a transfer happens:
TXN_A: UPDATE accounts SET balance=400 WHERE id='alice';
TXN_A: UPDATE accounts SET balance=600 WHERE id='bob';
TXN_A: COMMIT;
TXN_B: SELECT balance FROM accounts WHERE id='bob' → $600

// TXN_B's total: $500 + $600 = $1100
// Real total: $1000
// The analytics query reports $100 more than actually exists!

// Under snapshot isolation, TXN_B would see the database as of
// its start time: Alice=$500, Bob=$500. Total=$1000. Correct.

This is not just a theoretical concern. Database backups use pg_dump, which reads every table sequentially. If the backup runs under read committed (the default), it can capture an inconsistent snapshot: some tables reflect data from before a transaction, others from after. Restoring this backup gives you a database where foreign keys are broken, counters don't add up, and invariants are violated. PostgreSQL's pg_dump uses a single REPEATABLE READ transaction for exactly this reason.

Isolation Levels in Real Databases

Database	Default Level	Notes
PostgreSQL	Read Committed	REPEATABLE READ actually provides snapshot isolation. SERIALIZABLE provides SSI.
MySQL InnoDB	Repeatable Read	Uses gap locking for phantoms. Does NOT detect lost updates automatically.
Oracle	Read Committed	Snapshot isolation available as SERIALIZABLE, but it is not truly serializable.
SQL Server	Read Committed	Offers both lock-based and snapshot-based read committed. Set with READ_COMMITTED_SNAPSHOT.
CockroachDB	Serializable	Only supports serializable. "If you want correctness, don't compromise."

The naming mess. The SQL standard defines four isolation levels: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, and SERIALIZABLE. But databases implement them differently! PostgreSQL's REPEATABLE READ is actually snapshot isolation (which the SQL standard doesn't mention). MySQL's REPEATABLE READ uses gap locks, which prevents phantoms in some cases but not others. Oracle's SERIALIZABLE is actually snapshot isolation, not true serializability. Always test your database's actual behavior rather than trusting the name. The Hermitage project (by Martin Kleppmann) tests real database behavior and exposes these discrepancies.

Read Committed vs Dirty Reads

Two concurrent transactions. Toggle isolation level to see how dirty reads are prevented.

Interview question: Under Read Committed isolation, Transaction A updates a row but hasn't committed yet. Transaction B tries to READ that row. What does B see?

B sees the OLD committed value of the row. The database maintains both the old and new versions, and returns the old version to any reader until A commits. This is how PostgreSQL, Oracle, and SQL Server prevent dirty reads without using read locks. B is blocked until A commits or rolls back B sees A's uncommitted (dirty) value

Chapter 4: Snapshot Isolation & MVCC

Read committed prevents dirty reads but allows non-repeatable reads. For a backup job or a long analytics query, this is disastrous — the query sees an inconsistent mix of pre-update and post-update data. We need something stronger.

Snapshot isolation gives each transaction a frozen snapshot of the database as it existed at the transaction's start time. No matter how long the transaction runs, it always sees the same consistent view. Other transactions can modify data freely — the snapshot is immutable.

Multi-Version Concurrency Control (MVCC)

Snapshot isolation is implemented using Multi-Version Concurrency Control (MVCC). Instead of storing one version of each row, the database stores multiple versions, each tagged with the ID of the transaction that created or deleted it.

// Each row version has metadata:
row_version {
  data: { id: 'alice', balance: 500 },
  created_by: TXN_10,   // the transaction that inserted/updated this version
  deleted_by: NULL     // NULL = still active; set when row is updated or deleted
}

// When TXN_15 updates Alice's balance to 400:
// 1. Mark old version as deleted_by=TXN_15
// 2. Create new version with created_by=TXN_15

row_version_old {
  data: { id: 'alice', balance: 500 },
  created_by: TXN_10,
  deleted_by: TXN_15   ← marked as deleted
}
row_version_new {
  data: { id: 'alice', balance: 400 },
  created_by: TXN_15,
  deleted_by: NULL
}

MVCC Visibility Rules

When transaction TXN_X reads a row, it must decide which version to see. The rule is precise:

A row version is visible to TXN_X if and only if:

// A version is VISIBLE to TXN_X when ALL of these are true:
1. created_by ≤ TXN_X            creator started before or is TXN_X itself
2. created_by is committed        creator finished successfully
3. deleted_by is NULL             OR
   deleted_by > TXN_X            OR
   deleted_by is not committed    deleter hasn't committed yet

// In other words: the version was created by a committed transaction
// that started before me, and it hasn't been deleted by any committed
// transaction that started before me.

Worked Example: Three Concurrent Transactions

This is the kind of example interviewers love. Let's trace exactly which versions each transaction sees.

// Setup: accounts table, one row: {id:'alice', balance:500}
// created_by: TXN_5 (committed long ago)

// Timeline:
Time 1: TXN_10 begins           (snapshot at TXN_10)
Time 2: TXN_12 begins           (snapshot at TXN_12)
Time 3: TXN_12 updates alice.balance = 400   (creates new version)
Time 4: TXN_12 commits
Time 5: TXN_15 begins           (snapshot at TXN_15)
Time 6: TXN_10 reads alice.balance → ???
Time 7: TXN_15 reads alice.balance → ???

// Row versions after TXN_12 commits:
V1: { balance:500, created_by:TXN_5(committed), deleted_by:TXN_12(committed) }
V2: { balance:400, created_by:TXN_12(committed), deleted_by:NULL }

// What does TXN_10 see at Time 6?
// V2: created_by=TXN_12. Is TXN_12 ≤ TXN_10? NO (12 > 10).
// So V2 is INVISIBLE to TXN_10. TXN_12 started after TXN_10.
// V1: created_by=TXN_5 (committed, ≤ 10). deleted_by=TXN_12 (> 10).
// V1 is VISIBLE. TXN_10 sees balance = 500.

// What does TXN_15 see at Time 7?
// V2: created_by=TXN_12 (committed, ≤ 15). deleted_by=NULL.
// V2 is VISIBLE. TXN_15 sees balance = 400.

// TXN_10 sees the OLD value (500) even though TXN_12 committed!
// TXN_15 sees the NEW value (400) because TXN_12 committed before TXN_15 started.
// This is snapshot isolation in action.

The key insight. Under snapshot isolation, a transaction sees a frozen view of the database. It never sees changes made by transactions that started after it, even if those transactions commit. This is why backups and analytics queries work correctly — they see a consistent point-in-time snapshot regardless of concurrent writes.

PostgreSQL's MVCC Implementation

PostgreSQL stores MVCC metadata in system columns on every row:

Column	Meaning
xmin	Transaction ID that created this row version (INSERT or UPDATE)
xmax	Transaction ID that deleted this row version (UPDATE or DELETE). 0 if still live.
ctid	Physical location (page, offset) of this row version on disk

When you UPDATE a row in PostgreSQL, it doesn't modify the row in place. It marks the old version with xmax = current txn, then inserts a brand new row with xmin = current txn. This is why PostgreSQL tables need periodic VACUUM — to clean up old, invisible row versions that are wasting disk space.

VACUUM: Garbage Collection for MVCC

MVCC creates multiple versions of every row. Over time, old versions become invisible to all active transactions. These dead tuples waste disk space and slow down sequential scans (the database must skip over them). PostgreSQL's VACUUM process reclaims this space.

// Before VACUUM:
Page 42 contains:
  V1: { balance:500, xmin:TXN_5, xmax:TXN_12 } ← dead (no active txn can see this)
  V2: { balance:400, xmin:TXN_12, xmax:TXN_20 } ← dead
  V3: { balance:350, xmin:TXN_20, xmax:NULL } ← live (latest version)

// VACUUM marks V1 and V2 as reclaimable. Their space can be reused.
// VACUUM FULL actually compacts the table (rewrites it), but locks it.

// autovacuum runs in the background (PostgreSQL default).
// Trigger: when dead_tuples > autovacuum_vacuum_threshold +
// autovacuum_vacuum_scale_factor × table_size
// Default: 50 + 0.2 × row_count → VACUUM when 20% of rows are dead.

The long-running transaction problem. MVCC cannot garbage-collect any row version that might be visible to an active transaction. If you have a transaction that started 3 hours ago and never committed, VACUUM cannot remove ANY row versions created after that transaction started. Dead tuples accumulate, the table bloats, and performance degrades. This is called transaction ID wraparound in PostgreSQL, and it can cause the database to refuse new transactions entirely. Always set a statement_timeout and idle_in_transaction_session_timeout in production.

MVCC Index Handling

Indexes complicate MVCC. A B-tree index entry points to a specific row version's physical location (ctid in PostgreSQL). When a row is updated, the new version has a different ctid. The database has two strategies:

Strategy	How it works	Used by
HOT update	If no indexed column changed, store new version on same page, link old→new via a "heap-only tuple" chain. No index update needed.	PostgreSQL (when possible)
Index entry per version	Each version gets its own index entry. Index scans must check visibility for each entry.	PostgreSQL (when indexed column changes)
Undo log	Index points to latest version. Old versions are stored in a separate undo log and reconstructed on demand.	MySQL InnoDB, Oracle

PostgreSQL's approach (multiple index entries) means that a frequently-updated, heavily-indexed table can have very large indexes full of entries pointing to dead row versions. MySQL's approach (undo log) keeps indexes cleaner but adds overhead to reconstruct old versions for long-running snapshot transactions.

Observing MVCC in PostgreSQL

You can actually see MVCC in action using PostgreSQL's system columns. Here's a hands-on experiment you can run:

sql
-- Create a test table
CREATE TABLE test (id INT PRIMARY KEY, value TEXT);
INSERT INTO test VALUES (1, 'original');

-- See the MVCC metadata:
SELECT xmin, xmax, ctid, * FROM test;
--  xmin  | xmax | ctid  | id |  value
-- -------+------+-------+----+----------
--  1000  |    0 | (0,1) |  1 | original
-- xmin=1000: created by transaction 1000
-- xmax=0: not yet deleted
-- ctid=(0,1): page 0, tuple 1

-- Now update the row:
UPDATE test SET value = 'modified' WHERE id = 1;

SELECT xmin, xmax, ctid, * FROM test;
--  xmin  | xmax | ctid  | id |  value
-- -------+------+-------+----+----------
--  1001  |    0 | (0,2) |  1 | modified
-- NEW xmin! This is a completely new row version.
-- NEW ctid! It's at a different physical location (0,2).
-- The old version (0,1) still exists on disk with xmax=1001,
-- but it's invisible to committed transactions.

-- Check dead tuple count:
SELECT n_dead_tup, n_live_tup FROM pg_stat_user_tables
WHERE relname = 'test';
--  n_dead_tup | n_live_tup
-- ------------+------------
--           1 |          1
-- One dead tuple (the old 'original' version). VACUUM will reclaim it.

Snapshot Isolation Performance Characteristics

Operation	Read Committed	Snapshot Isolation	Why
Short reads	Fast	Same speed	Both use MVCC; snapshot just remembers which txn IDs to ignore
Long reads (analytics)	Inconsistent	Consistent snapshot	Snapshot freezes the visible set at transaction start
Writes	Same speed	Same speed	Write performance is identical; both create new row versions
Write conflicts	Last write wins	First writer wins (second aborts)	Snapshot detects that the row changed since snapshot start
Storage	Less bloat	More bloat if long txns	Long-running snapshot transactions prevent VACUUM from reclaiming old versions

MVCC Version Visibility Explorer

Three transactions interact with a row. Drag the time slider to see which version each transaction sees.

Time step 0

Interview question: TXN_20 starts, then TXN_25 updates row X and commits, then TXN_30 starts and reads row X. Under snapshot isolation, which version of row X does TXN_20 see? Which does TXN_30 see?

Both see the new value (TXN_25 committed) TXN_20 sees the OLD value (created before TXN_20's snapshot, and TXN_25 started after TXN_20 so its changes are invisible). TXN_30 sees the NEW value (TXN_25 committed before TXN_30 started, so TXN_25's changes are visible). Each transaction is frozen at its start time. TXN_20 sees the new value, TXN_30 sees the old value

Chapter 5: Preventing Lost Updates

Snapshot isolation prevents dirty reads and non-repeatable reads. But it has a blind spot: the lost update problem. This is one of the most common concurrency bugs in real applications.

The Pattern: Read-Modify-Write

A lost update happens whenever two transactions perform a read-modify-write cycle on the same data:

// Lost Update — the classic scenario:
// Counter starts at 100. Two transactions both increment it by 50.

TXN_A: reads counter → 100
TXN_B: reads counter → 100 (same snapshot!)
TXN_A: writes counter = 100 + 50 = 150
TXN_B: writes counter = 100 + 50 = 150 OVERWRITES A's update!

// Expected result: 200. Actual result: 150.
// TXN_A's increment was lost.

This pattern shows up everywhere:

Scenario	Read	Modify	Write
Counter increment	Read current value	Add 1	Write new value
Wiki page edit	Read page content	Edit text	Save changes
Account balance	Read balance	Subtract amount	Write new balance
JSON field update	Read document	Modify one field	Write entire document

Solution 1: Atomic Operations

The simplest fix — don't read-then-write. Let the database do it in a single atomic operation:

sql
-- BAD: read-modify-write (vulnerable to lost updates)
SELECT balance FROM accounts WHERE id = 'alice';  -- reads 500
-- application computes: 500 - 100 = 400
UPDATE accounts SET balance = 400 WHERE id = 'alice';

-- GOOD: atomic operation (no read-modify-write)
UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
-- The database performs the read+modify+write as a single atomic step.
-- No other transaction can interleave between the read and the write.

Atomic operations work for simple modifications (increment, decrement, append). They don't work when the modification is complex (editing a wiki page requires reading the entire document, modifying it in the application, and writing it back).

Solution 2: Explicit Locking (SELECT FOR UPDATE)

sql
BEGIN;
SELECT balance FROM accounts WHERE id = 'alice' FOR UPDATE;
-- FOR UPDATE acquires a ROW-LEVEL EXCLUSIVE LOCK
-- Any other transaction trying to SELECT FOR UPDATE on this row BLOCKS

-- application computes new balance
UPDATE accounts SET balance = 400 WHERE id = 'alice';
COMMIT; -- lock released

SELECT FOR UPDATE tells the database: "I'm going to modify this row. Lock it so nobody else can read-for-update until I'm done." This serializes all read-modify-write operations on that row.

Solution 3: Automatic Lost Update Detection

PostgreSQL's repeatable read isolation level (which actually provides snapshot isolation) automatically detects lost updates. If TXN_B tries to write a row that TXN_A already wrote since B's snapshot started, PostgreSQL aborts TXN_B. The application can then retry.

// PostgreSQL automatic detection (REPEATABLE READ):
TXN_A: reads counter → 100
TXN_B: reads counter → 100
TXN_A: writes counter = 150. COMMIT. success
TXN_B: writes counter = 150. PostgreSQL detects conflict!
→ ERROR: could not serialize access due to concurrent update
TXN_B: RETRY → reads counter → 150. writes 200. COMMIT. success

MySQL does NOT do this. MySQL's "repeatable read" is actually a weaker form of snapshot isolation that does not detect lost updates. If you need lost update prevention on MySQL, you must use explicit locking (SELECT FOR UPDATE) or atomic operations. This is a common interview gotcha: "Does repeatable read prevent lost updates?" The answer depends on the database.

Solution 4: Compare-and-Set (Optimistic Locking)

sql
-- Read the current value
SELECT balance, version FROM accounts WHERE id = 'alice';
-- Returns: balance=500, version=7

-- Update ONLY if the version hasn't changed
UPDATE accounts
SET balance = 400, version = 8
WHERE id = 'alice' AND version = 7;

-- If another transaction changed it, version != 7, UPDATE affects 0 rows.
-- Application detects "0 rows affected" and retries.

This is also called optimistic concurrency control (OCC). It doesn't use locks — it lets both transactions proceed, but only one succeeds. The other detects the conflict and retries. Works well when conflicts are rare.

Worked Example: All Four Solutions Side by Side

Let's trace through the same scenario with each solution. Two users both try to withdraw $200 from an account with $300 balance. Only one should succeed (the second should get "insufficient funds").

// Initial state: account_1001.balance = 300

══════════ NO PROTECTION (Read Committed) ══════════
TXN_A: SELECT balance FROM accounts WHERE id=1001 → 300
TXN_B: SELECT balance FROM accounts WHERE id=1001 → 300
TXN_A: 300 >= 200, OK. UPDATE SET balance = 100 WHERE id=1001. COMMIT.
TXN_B: 300 >= 200, OK. UPDATE SET balance = 100 WHERE id=1001. COMMIT.
// Result: balance = 100. Both withdrew $200. Total withdrawn = $400.
// But account only had $300! Bank lost $100.

══════════ ATOMIC OPERATION ══════════
TXN_A: UPDATE accounts SET balance = balance - 200
       WHERE id=1001 AND balance >= 200;
       // 300 >= 200, OK. balance = 100. 1 row affected. COMMIT.
TXN_B: UPDATE accounts SET balance = balance - 200
       WHERE id=1001 AND balance >= 200;
       // 100 >= 200? NO. 0 rows affected. Report "insufficient funds".
// Result: balance = 100. Correct! Only one withdrawal succeeded.

══════════ SELECT FOR UPDATE ══════════
TXN_A: SELECT balance FROM accounts WHERE id=1001 FOR UPDATE;
       // Gets lock. Reads 300.
TXN_B: SELECT balance FROM accounts WHERE id=1001 FOR UPDATE;
       // BLOCKED. Waiting for TXN_A's lock.
TXN_A: 300 >= 200. UPDATE SET balance = 100. COMMIT. Lock released.
TXN_B: // Unblocked! Reads 100 (fresh value).
TXN_B: 100 >= 200? NO. Report "insufficient funds". ROLLBACK.
// Result: balance = 100. Correct!

══════════ COMPARE-AND-SET ══════════
TXN_A: SELECT balance, version FROM accounts WHERE id=1001;
       // balance=300, version=5
TXN_B: SELECT balance, version FROM accounts WHERE id=1001;
       // balance=300, version=5
TXN_A: UPDATE SET balance=100, version=6 WHERE id=1001 AND version=5;
       // 1 row affected. COMMIT.
TXN_B: UPDATE SET balance=100, version=6 WHERE id=1001 AND version=5;
       // 0 rows affected (version is now 6, not 5). RETRY.
TXN_B: SELECT balance, version → 100, 6. 100 >= 200? NO. "Insufficient funds".
// Result: balance = 100. Correct!

When to use which. Atomic operations are simplest and fastest — use them when possible (counters, balances, appends). SELECT FOR UPDATE is the go-to for complex read-modify-write where you need the full row value. Compare-and-set is best for distributed systems where database-level locking isn't available (DynamoDB conditional writes, Cassandra LWT). PostgreSQL's automatic detection is great if you're already using snapshot isolation — no code changes needed.

Lost Update Simulator

Two transactions both increment a counter. Watch the lost update, then see how each solution prevents it.

Interview question: You're designing a collaborative document editor. Multiple users can edit the same document simultaneously. Which lost update prevention strategy is most appropriate, and why?

Atomic operations (UPDATE SET content = content || new_text) SELECT FOR UPDATE (lock the document while editing) None of the SQL-level solutions work well here. Document editing requires operational transformation (OT) or CRDTs — data structures that can merge concurrent edits without conflicts. The document is not a single atomic value to increment; it is a sequence of characters where concurrent inserts/deletes at different positions should BOTH succeed. Traditional database transactions model the document as "one big blob" — but the right granularity is individual operations (insert char at position 5, delete char at position 12). Google Docs uses OT; Figma uses CRDTs.

Chapter 6: Write Skew & Phantoms

Lost updates happen when two transactions write to the SAME row. Write skew is a subtler and nastier problem: two transactions read the same data, make decisions based on what they read, then write to DIFFERENT rows — and the combination violates an invariant.

The Doctor On-Call Problem

A hospital requires at least one doctor on call at all times. Currently, Alice and Bob are both on call (2 doctors). Both simultaneously decide to take themselves off call:

// Write Skew — the doctor scenario:
// Invariant: COUNT(on_call) >= 1 at all times

TXN_Alice: SELECT COUNT(*) FROM doctors WHERE on_call = true;
          → 2 (both Alice and Bob are on call)
          "2 >= 2, so it's safe if I go off call"

TXN_Bob:   SELECT COUNT(*) FROM doctors WHERE on_call = true;
          → 2 (same snapshot — Alice hasn't written yet)
          "2 >= 2, so it's safe if I go off call"

TXN_Alice: UPDATE doctors SET on_call = false WHERE name = 'Alice'; COMMIT;
TXN_Bob:   UPDATE doctors SET on_call = false WHERE name = 'Bob'; COMMIT;

// Result: 0 doctors on call. Invariant violated!
// Each transaction wrote to a DIFFERENT row.
// No lost update (no row was written by both).
// No dirty read. No non-repeatable read.
// Yet the result is incorrect.

Why snapshot isolation doesn't catch this. Snapshot isolation detects conflicts when two transactions write the SAME row. Write skew involves transactions writing DIFFERENT rows — Alice updates her row, Bob updates his row. There is no write-write conflict for the database to detect. The conflict is between a READ (the count query) and a WRITE (the update) in the other transaction, which snapshot isolation doesn't track.

The Write Skew Taxonomy

Write skew is a family of anomalies, not a single bug. Understanding the taxonomy helps you spot it in interviews and code reviews. Every write skew has the same structure:

// The universal write skew template:
1. TXN_X reads a PREDICATE (a query matching multiple rows)
2. TXN_Y reads the SAME predicate
3. Based on the predicate result, TXN_X decides to write to Row_A
4. Based on the predicate result, TXN_Y decides to write to Row_B
5. Row_A ≠ Row_B (different rows!)
6. The combination of both writes violates an invariant
that the predicate was supposed to protect

// The key: neither write conflicts with the other in a write-write sense.
// The conflict is between TXN_X's READ and TXN_Y's WRITE (and vice versa).
// Snapshot isolation only tracks write-write conflicts, not read-write.

More Write Skew Examples

Scenario	Invariant	How write skew violates it
Meeting room booking	No overlapping bookings for same room	Two users check "room free at 2pm?", both get yes, both insert a booking
Username registration	Usernames are unique	Two users check "is 'alice99' taken?", both get no, both insert it
Multiplayer game	Two players can't move to same position	Both check "is (3,5) empty?", both get yes, both move there
Bank overdraft	Balance >= 0	Two concurrent withdrawals both check balance, both see enough funds

Phantoms

Notice a pattern in the write skew examples: one transaction's query result changes because another transaction inserts or deletes rows. The meeting room double-booking happens because the query "are there bookings for this room at 2pm?" returns zero rows, and another transaction inserts a row that WOULD have changed the query result.

These rows that don't exist yet (but would change a query's result) are called phantoms. They are the root cause of write skew.

Why SELECT FOR UPDATE can't help with phantoms. SELECT FOR UPDATE locks the rows returned by the query. But if the query returns zero rows (room has no bookings), there are no rows to lock! You can't lock rows that don't exist yet. The phantom will be inserted by the other transaction, and your lock protects nothing.

Materializing Conflicts

One workaround: create the rows to lock in advance. For the meeting room problem:

sql
-- Create a "lock table" with one row per room per 15-min slot
CREATE TABLE room_slots (
  room_id INT,
  slot_start TIMESTAMP,
  PRIMARY KEY (room_id, slot_start)
);
-- Pre-populate: INSERT INTO room_slots VALUES (1, '2024-01-15 14:00'), ...

-- Now bookings can lock the slot row:
BEGIN;
SELECT * FROM room_slots
WHERE room_id = 1 AND slot_start = '2024-01-15 14:00'
FOR UPDATE;
-- This locks the row. Another transaction trying to book the same slot blocks.

INSERT INTO bookings (room_id, start_time, user_id) VALUES (1, '2024-01-15 14:00', 42);
COMMIT;

This is called materializing conflicts — turning a phantom into a concrete row that can be locked. It works but it's ugly: you need to pre-create lock rows for every possible conflict, and the lock table leaks database implementation details into the application. The real solution is serializable isolation, which we cover in the next chapter.

Write Skew vs Lost Update: The Distinction

Interviewers love to test whether you can distinguish write skew from lost updates. Here is the precise difference:

Property	Lost Update	Write Skew
What conflicts	Two transactions write the SAME row	Two transactions write DIFFERENT rows
Pattern	Both read X, both modify X, both write X	Both read a CONDITION, each writes a different row, condition violated
Detection	Snapshot isolation detects (PostgreSQL)	Only serializable isolation detects
Fix	Atomic ops, SELECT FOR UPDATE, CAS	Serializable isolation, or materializing conflicts
Example	Two counter increments	Two doctors going off call

The formal pattern of write skew. Every write skew follows the same template: (1) Both transactions read a PREDICATE (a query that matches multiple rows). (2) Each transaction makes a decision based on the predicate result. (3) Each transaction writes to a DIFFERENT row, changing the predicate's result. (4) If both writes succeed, the predicate's result has changed in a way that violates an invariant. The key: the conflict is between a READ (the predicate) and a WRITE (the update), not between two writes.

Detecting Write Skew in Code Reviews

When reviewing code, here are red flags that indicate potential write skew:

python
# RED FLAG 1: Check-then-act on a count
count = db.execute("SELECT COUNT(*) FROM doctors WHERE on_call = TRUE")
if count >= 2:
    db.execute("UPDATE doctors SET on_call = FALSE WHERE id = %s", (my_id,))
# DANGER: another transaction can do the same check concurrently

# RED FLAG 2: Check-then-insert on uniqueness
existing = db.execute("SELECT id FROM users WHERE username = %s", (name,))
if not existing:
    db.execute("INSERT INTO users (username) VALUES (%s)", (name,))
# DANGER: use a UNIQUE constraint instead, which is enforced atomically

# RED FLAG 3: Check availability then reserve
available = db.execute("SELECT seats FROM flights WHERE id = %s", (flight_id,))
if available > 0:
    db.execute("INSERT INTO bookings ...")
    db.execute("UPDATE flights SET seats = seats - 1 WHERE id = %s", (flight_id,))
# DANGER: use SELECT FOR UPDATE on the flights row

The general pattern: any code that reads a condition, makes a decision, then writes based on that decision — without holding a lock on the condition — is vulnerable to write skew.

Fixing Write Skew: The Complete Toolkit

Depending on your database and situation, here are all available fixes for write skew, ranked from simplest to most complex:

Fix	How	Works for	Limitation
Unique constraint	CREATE UNIQUE INDEX	Username registration, seat booking	Only prevents duplicate inserts, not general predicates
Explicit lock	SELECT FOR UPDATE on predicate rows	Doctor on-call (lock the doctor rows)	Can't lock rows that don't exist (phantoms)
Materialized conflict	Pre-create lock rows, SELECT FOR UPDATE	Meeting room booking (lock the time-slot row)	Requires pre-populating lock table. Ugly.
Serializable isolation	SET TRANSACTION ISOLATION LEVEL SERIALIZABLE	Everything	Requires retry logic. Some performance cost.
Application-level lock	Redis SETNX, advisory lock	Cross-service invariants	Complex. Must handle lock expiry carefully.

The stafff-level answer. When asked "How do you prevent write skew?", don't immediately jump to serializable isolation. First ask: "Can I reformulate this as a unique constraint or a single-row lock?" Most write skew scenarios can be eliminated with simpler mechanisms. Reserve serializable isolation for the cases where the invariant truly spans multiple rows in complex ways. The best solutions don't add complexity — they restructure the data model to make the conflict impossible.

Write Skew: Doctor On-Call Scenario

Watch two doctors simultaneously go off call. The invariant "at least 1 on call" gets violated.

Interview question: You're building a flight booking system. Two customers simultaneously try to book the last seat. How would you prevent the double-booking using PostgreSQL?

Use SELECT FOR UPDATE on the flight's seats_available row, decrement it, and check the result: BEGIN; SELECT seats_available FROM flights WHERE id=123 FOR UPDATE; -- if seats_available > 0: UPDATE flights SET seats_available = seats_available - 1 WHERE id=123; INSERT INTO bookings ...; COMMIT; The lock serializes concurrent bookings. The second transaction blocks until the first commits, then reads the updated (decremented) seat count. Check seats_available in the application code before inserting Use a unique constraint on (flight_id, seat_number)

Chapter 7: Serializability — The Showcase

The gold standard of isolation is serializability. It guarantees that the result of executing transactions concurrently is identical to executing them in SOME serial order — one at a time, no overlap. This prevents every concurrency anomaly: dirty reads, non-repeatable reads, lost updates, write skew, and phantoms.

The question is: how do you achieve it without destroying performance? Three approaches exist, each with radically different tradeoffs.

Why Serializability Matters: A Concrete Case

Before diving into implementation, let's be precise about what serializability buys you. Consider a payroll system processing end-of-month salaries. Three transactions run concurrently:

// TXN_A: Calculate total payroll expense for department D
// SELECT SUM(salary) FROM employees WHERE dept = 'D'

// TXN_B: Transfer employee Alice from dept D to dept E
// UPDATE employees SET dept = 'E' WHERE name = 'Alice'

// TXN_C: Give everyone in dept D a 10% raise
// UPDATE employees SET salary = salary * 1.1 WHERE dept = 'D'

// Under snapshot isolation (non-serializable):
// TXN_A might see Alice's salary in its SUM (she was in D at snapshot time)
// TXN_B might move Alice to E
// TXN_C might NOT give Alice a raise (she's no longer in D when C runs)
// Result: the payroll report includes Alice's salary,
// but Alice didn't get the raise that the report assumed she would.
// The report is inconsistent with the actual state.

// Under serializability:
// The result must be equivalent to SOME serial order:
// Option 1: A, then B, then C — report includes Alice, she moves, raise applies to D without her
// Option 2: B, then C, then A — Alice moves first, raise applies without her, report excludes her
// Any serial order gives a consistent result.

Approach 1: Actual Serial Execution

The simplest approach: don't allow concurrency at all. Run every transaction on a single thread, one at a time. This sounds insane, but it works if:

Requirement	Why	How
Transactions are fast	A single thread can only do ~100K short txns/sec	Keep data in RAM (VoltDB, Redis)
No network round-trips	Waiting for app server between reads/writes wastes the single thread	Use stored procedures — entire transaction in one request
Data fits in memory	Disk I/O would block the single thread	In-memory databases with WAL for durability

VoltDB takes this approach. Redis is single-threaded by design. For partition-local transactions (where all data lives on one partition), this is surprisingly viable. For cross-partition transactions, you need coordination, which kills throughput.

Serial Execution: Throughput Calculation

How fast is a single thread? Let's do the math:

// Assumptions:
// - All data fits in RAM (no disk I/O during transaction)
// - Average transaction: 3 reads + 2 writes = 5 operations
// - Each RAM operation: ~100 nanoseconds (including data structure traversal)
// - Transaction overhead (begin/commit/WAL): ~1 microsecond

Transaction time = 5 × 100ns + 1μs = 1.5μs
Throughput = 1s / 1.5μs = ~666,000 transactions/second (per core)

// With stored procedures (no network round-trips between ops):
// Each transaction is a single function call, not a series of network messages.
// Without stored procedures:
// 3 reads × (0.1ms network RTT) + 2 writes × (0.1ms) = 0.5ms per transaction
// Throughput = 2,000 transactions/second. 300x slower!

// This is why serial execution REQUIRES stored procedures.
// VoltDB enforces this: all transactions are Java stored procedures.

Partitioning for throughput. One core can do ~500K txn/sec. If you need more, partition the data across N cores (or N nodes), each running its own serial executor. Each partition handles its subset of data independently. N=8 cores gives you ~4M txn/sec. The catch: cross-partition transactions require coordination, which serializes across partitions and destroys the throughput advantage. The system design skill is minimizing cross-partition transactions through smart data modeling.

Approach 2: Two-Phase Locking (2PL)

Two-Phase Locking was the standard approach for 30+ years. The rule is:

// Two-Phase Locking rules:
1. To READ a row: acquire a SHARED lock (multiple readers allowed)
2. To WRITE a row: acquire an EXCLUSIVE lock (no other readers or writers)
3. If you can't get a lock, WAIT until it's released
4. Hold ALL locks until COMMIT or ABORT (this is the "two phase" part)

// Phase 1: "Growing" — acquire locks, never release
// Phase 2: "Shrinking" — release all locks at once (on commit/abort)

// Why this works: if TXN_A reads a row, it holds a shared lock.
// TXN_B cannot write that row (needs exclusive lock, blocked by A's shared lock).
// So B's write cannot change what A read. No write skew possible.

The problem with 2PL: performance is terrible under contention. If transaction A holds a lock that B needs, B waits. If B holds a lock that C needs, C waits. One slow transaction can block dozens of others. And deadlocks are common: A waits for B's lock, B waits for A's lock. The database must detect deadlocks and abort one transaction.

Predicate locks and index-range locks. To prevent phantoms, 2PL uses predicate locks — locks on all rows matching a WHERE clause, including rows that don't exist yet. But checking every predicate against every other predicate is expensive. Index-range locks are a practical approximation: lock an index range (e.g., all bookings for room 1 between 2pm and 3pm). This is coarser (may lock more than needed) but fast to check.

Approach 3: Serializable Snapshot Isolation (SSI)

SSI is the modern approach, introduced in PostgreSQL 9.1 (2011). It's optimistic: let all transactions run concurrently on their own snapshots (like regular snapshot isolation), but track what each transaction reads and writes. At commit time, check if any conflicts would violate serializability. If so, abort the transaction. The application retries.

// SSI conflict detection:
// When TXN_A commits, check two things:

// 1. Did any concurrent transaction (TXN_B) write to a row that A read?
// If so, A's read was based on a stale snapshot. ABORT A.

// 2. Did A write to a row that any concurrent transaction (TXN_B) read?
// If so, B's read was based on a stale snapshot. ABORT B (when B tries to commit).

// The key insight: SSI only aborts when a rw-conflict CYCLE exists.
// Not every conflict means an anomaly — SSI uses dependency graph analysis
// to only abort when necessary.

SSI vs 2PL — the fundamental tradeoff. 2PL is pessimistic: it prevents conflicts by blocking. This guarantees no transaction ever needs to retry, but throughput collapses under contention. SSI is optimistic: it allows conflicts and aborts on detection. Transactions never block each other, so throughput is high, but some transactions must retry. SSI wins when contention is low (most transactions don't conflict). 2PL wins when contention is high (most transactions conflict, and retries would waste more work than blocking).

Deadlock Detection in 2PL

Deadlocks are the Achilles' heel of 2PL. They occur when two transactions each hold a lock the other needs:

// Classic deadlock:
TXN_A: Lock row X (exclusive)   success
TXN_B: Lock row Y (exclusive)   success
TXN_A: Lock row Y (exclusive)   BLOCKED — TXN_B holds it
TXN_B: Lock row X (exclusive)   BLOCKED — TXN_A holds it

// Both transactions are waiting for each other. Neither can proceed.
// Without detection, they wait forever.

// Detection: build a "waits-for" graph.
// TXN_A waits for TXN_B (needs Y). TXN_B waits for TXN_A (needs X).
// Cycle detected! Abort one transaction (usually the youngest or cheapest).

// Prevention: always acquire locks in a consistent order.
// If both transactions lock X before Y, TXN_B blocks on X first,
// and TXN_A can complete and release both locks.

Databases detect deadlocks by periodically checking for cycles in the waits-for graph (PostgreSQL does this every deadlock_timeout milliseconds, default 1 second). When a cycle is found, one transaction is aborted with an error: "ERROR: deadlock detected." The application must catch this error and retry.

SSI Implementation Details

SSI tracks two kinds of conflicts between concurrent transactions:

// SSI tracks "rw-antidependencies" (also called "rw-conflicts"):
// A rw-conflict from TXN_A to TXN_B means:
// "TXN_A read something that TXN_B subsequently wrote"

// Example: the doctor on-call scenario:
TXN_Alice reads on_call count (includes Bob's row) → 2
TXN_Bob reads on_call count (includes Alice's row) → 2
TXN_Alice writes Alice.on_call = false
TXN_Bob writes Bob.on_call = false

// rw-conflict: Alice read Bob's row, Bob wrote Bob's row
// rw-conflict: Bob read Alice's row, Alice wrote Alice's row
// This forms a CYCLE of rw-conflicts: Alice → Bob → Alice
// A cycle means the result is not serializable.
// SSI aborts one of them (whichever tries to commit second).

// The formal rule: if there exists a "dangerous structure"
// (two consecutive rw-conflicts involving the same transaction
// in the middle), the middle transaction may need to abort.

False positives. SSI may abort transactions that would not actually cause an anomaly. The conflict detection is conservative — it catches all real anomalies but may also flag some safe executions. In practice, the false positive rate is low (single-digit percent), and retries are cheap. The benefit is that reads never block and throughput stays high.

Stored Procedures for Serial Execution

For the serial execution approach to work, transactions must be fast. If a transaction needs to go back to the application server for logic (compute a price, check a rule), the single thread sits idle waiting for the network round-trip. The solution: stored procedures. The entire transaction — reads, logic, and writes — is bundled into a single request:

sql
-- Stored procedure: transfer money atomically
CREATE FUNCTION transfer(from_id INT, to_id INT, amount NUMERIC)
RETURNS VOID AS $$
DECLARE
  from_balance NUMERIC;
BEGIN
  SELECT balance INTO from_balance
  FROM accounts WHERE id = from_id;

  IF from_balance < amount THEN
    RAISE EXCEPTION 'Insufficient funds';
  END IF;

  UPDATE accounts SET balance = balance - amount WHERE id = from_id;
  UPDATE accounts SET balance = balance + amount WHERE id = to_id;
END;
$$ LANGUAGE plpgsql;

-- Single network round-trip: the database executes everything locally.
SELECT transfer(1, 2, 100.00);

VoltDB requires all transactions to be stored procedures. Redis processes all operations on a single thread. Both achieve serializability by eliminating concurrency entirely — and both are surprisingly fast for partition-local workloads.

The Big Simulation

The simulation below runs three concurrent transactions against the same data. Choose an isolation level and watch how each approach handles the workload. Serial queues them. 2PL blocks on lock conflicts. SSI runs optimistically and aborts on conflict.

Serializability Showdown: Serial vs 2PL vs SSI

Three transactions read and write overlapping rows. Watch how each approach handles conflicts differently.

Choose an isolation approach to begin.

How Each Approach Handles the Doctor Scenario

Let's trace the doctor on-call write skew through all three approaches to see exactly how they differ:

// Reminder: Alice and Bob are both on call. Each tries to go off call.
// Invariant: at least 1 doctor on call.

══════════ SERIAL EXECUTION ══════════
Queue: [TXN_Alice, TXN_Bob]
TXN_Alice runs: SELECT COUNT(*) WHERE on_call=true → 2. OK. UPDATE Alice.on_call=false. COMMIT.
TXN_Bob runs: SELECT COUNT(*) WHERE on_call=true → 1. NOT >= 2. ABORT.
Result: 1 doctor on call. Invariant preserved. Correct!

══════════ TWO-PHASE LOCKING ══════════
TXN_Alice: SELECT COUNT(*) WHERE on_call=true
  Acquires SHARED predicate lock on {on_call=true} rows (Alice, Bob)
TXN_Bob: SELECT COUNT(*) WHERE on_call=true
  Acquires SHARED predicate lock on same rows (compatible with Alice's shared lock)
TXN_Alice: UPDATE Alice.on_call=false
  Needs EXCLUSIVE lock on Alice's row. Alice's shared lock → upgrade to exclusive.
  But Bob holds a shared lock on Alice's row (from predicate lock). BLOCKED!
TXN_Bob: UPDATE Bob.on_call=false
  Needs EXCLUSIVE lock on Bob's row. But Alice holds a shared lock. BLOCKED!
DEADLOCK! Database detects the cycle and aborts TXN_Bob.
TXN_Alice: unblocked. UPDATE. COMMIT. 1 doctor on call. Correct!
TXN_Bob: retries. SELECT COUNT(*) → 1. ABORT (insufficient).

══════════ SSI (Serializable Snapshot Isolation) ══════════
TXN_Alice: snapshot at T=100. SELECT COUNT(*) → 2. UPDATE Alice.on_call=false.
TXN_Bob: snapshot at T=100. SELECT COUNT(*) → 2. UPDATE Bob.on_call=false.
TXN_Alice: COMMIT at T=101. SSI records: "Alice read rows matching on_call=true"
TXN_Bob: COMMIT at T=102.
  SSI checks: "Bob read on_call=true rows. Alice wrote to one of those rows
  and committed AFTER Bob's snapshot. Bob's read was stale. rw-conflict detected."
  ABORT TXN_Bob!
TXN_Bob: retries. SELECT COUNT(*) → 1. ABORT (insufficient).
Result: 1 doctor on call. Correct!

All three approaches reach the correct result. Serial forces one to go first. 2PL creates a deadlock (which is resolved by aborting one). SSI detects the stale read at commit time. The difference is in performance: serial has zero concurrency, 2PL causes blocking/deadlocks, and SSI causes one retry. For this particular scenario, SSI is the most efficient.

Comparison Table

Property	Serial Execution	Two-Phase Locking	SSI
Concurrency	None	Limited (lock contention)	High (snapshot-based)
Blocking	Queued	Readers block writers, writers block everyone	No blocking
Deadlocks	Impossible	Common (need detection + timeout)	Impossible
Aborts	Rare	On deadlock only	On conflict detection
Latency	Depends on queue depth	Unpredictable (lock waits)	Low (but retries add tail latency)
Throughput	Limited by single core	Degrades under contention	High until conflict rate is high
Used by	VoltDB, Redis	MySQL InnoDB, SQL Server	PostgreSQL 9.1+, CockroachDB

Interview question: Your PostgreSQL database uses serializable isolation (SSI). Under heavy load, you notice many transactions being aborted due to serialization failures. What would you investigate and what might you change?

Switch to read committed isolation First investigate: (1) Are transactions reading/writing overlapping data? SSI aborts when transactions have read-write dependencies. (2) Are transactions long-running? Longer transactions have more time to conflict. (3) Is the retry logic correct? (exponential backoff, limited retries). Mitigations: shorten transaction scope (do less work per transaction), reduce hot-spot contention (partition data so transactions touch different rows), move read-only queries to snapshot isolation (no need for serializable if you're only reading), and ensure retries use exponential backoff so aborted transactions don't immediately conflict again. Increase the lock timeout setting

Chapter 8: Distributed Transactions

Everything so far assumed a single database node. But what happens when a transaction spans multiple databases, or multiple partitions on different machines? You need a way to commit atomically across all of them — or roll back across all of them. This is the domain of distributed transactions.

Two-Phase Commit (2PC)

Two-Phase Commit is the classic protocol for distributed atomicity. It introduces a coordinator (also called the transaction manager) that orchestrates the commit across multiple participants (database nodes).

// Two-Phase Commit Protocol:

// PHASE 1: PREPARE
Coordinator → Participant A: "Can you commit TXN_42?"
Coordinator → Participant B: "Can you commit TXN_42?"
Coordinator → Participant C: "Can you commit TXN_42?"

// Each participant:
// - Writes all transaction data to its WAL
// - Acquires all necessary locks
// - Responds YES (I promise I can commit) or NO (something went wrong)

A → Coordinator: "YES"
B → Coordinator: "YES"
C → Coordinator: "YES"

// PHASE 2: COMMIT (or ABORT)
// If ALL said YES:
Coordinator writes "COMMIT TXN_42" to its own WAL (the decision point)
Coordinator → A: "COMMIT"
Coordinator → B: "COMMIT"
Coordinator → C: "COMMIT"

// If ANY said NO:
Coordinator → All: "ABORT"

The Fatal Flaw: Coordinator Crash

The critical moment is after Phase 1 (all participants said YES) but before Phase 2 (coordinator sends COMMIT or ABORT). If the coordinator crashes at this exact moment:

// The Danger Zone:
Time 1: All participants vote YES
Time 2: Coordinator writes COMMIT to WAL ← or crashes before this!
Time 3: Coordinator sends COMMIT to participants

// If coordinator crashes between Time 1 and Time 2:
// - Participants have all voted YES
// - They've promised they CAN commit
// - They're holding locks on the data
// - But they don't know the decision: COMMIT or ABORT?
// - They CANNOT release locks (what if the decision was COMMIT?)
// - They CANNOT commit (what if another participant voted NO?)
// - They are STUCK. "In doubt" transactions. Holding locks. Blocking others.

// This can last HOURS until the coordinator recovers.
// In the worst case, it requires manual intervention by a DBA.

This is why 2PC is hated. A coordinator crash after PREPARE but before COMMIT leaves participants holding locks indefinitely. Those locks block other transactions on those rows. A single coordinator failure can bring an entire distributed system to a halt. This is not a theoretical concern — it happens in production, and it is the reason most modern distributed databases avoid 2PC entirely.

Three-Phase Commit (3PC)

3PC adds a "pre-commit" phase to avoid the blocking problem. After all participants vote YES, the coordinator sends a PRE-COMMIT message. Participants can then decide: if they received PRE-COMMIT, they know the decision was COMMIT; if the coordinator crashes, they can safely commit on timeout.

In theory, 3PC is non-blocking. In practice, it doesn't work: network partitions can cause different participants to reach different conclusions about what happened. One participant might time out and commit, while another times out and aborts. 3PC assumes reliable network delivery with bounded delays, which is impossible in real distributed systems.

XA Transactions

XA is a standard API (X/Open XA) for 2PC across heterogeneous systems — for example, a transaction that spans a PostgreSQL database AND a message queue AND an Oracle database. The application server acts as the coordinator.

XA transactions are used in enterprise systems (Java EE / Jakarta EE), but they inherit all of 2PC's problems plus additional overhead from the XA API. Modern systems avoid them in favor of eventual consistency patterns (sagas, outbox pattern).

The Saga Pattern in Detail

Since distributed transactions are problematic, most modern microservice architectures use sagas instead. A saga is a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps.

python
class OrderSaga:
    """Choreography-based saga for order processing."""

    def start(self, order):
        # Step 1: Create order (local transaction)
        self.order_service.create_order(order)  # status = PENDING
        self.event_bus.publish("OrderCreated", order)

    def on_order_created(self, event):
        # Step 2: Reserve inventory (local transaction)
        try:
            self.inventory_service.reserve(event.product_id, event.qty)
            self.event_bus.publish("InventoryReserved", event)
        except InsufficientStock:
            # Compensate: cancel the order
            self.order_service.cancel_order(event.order_id)
            self.event_bus.publish("OrderCancelled", event)

    def on_inventory_reserved(self, event):
        # Step 3: Process payment (local transaction)
        try:
            self.payment_service.charge(event.customer_id, event.amount)
            self.event_bus.publish("PaymentProcessed", event)
        except PaymentFailed:
            # Compensate: release inventory, then cancel order
            self.inventory_service.release(event.product_id, event.qty)
            self.order_service.cancel_order(event.order_id)
            self.event_bus.publish("OrderCancelled", event)

    def on_payment_processed(self, event):
        # Step 4: Confirm order (local transaction)
        self.order_service.confirm_order(event.order_id)  # status = CONFIRMED
        self.event_bus.publish("OrderConfirmed", event)

Saga limitations. Sagas provide eventual consistency, not strong consistency. During the saga execution, the system is in an intermediate state: inventory is reserved but payment hasn't been processed yet. If another saga reads the inventory count during this window, it sees a different number than it would after the saga completes. Sagas also don't provide isolation — interleaving sagas can see each other's intermediate states. For most e-commerce applications, this is acceptable. For financial ledgers, it is not.

Exactly-Once Processing: The Holy Grail

Many interview candidates claim their system provides "exactly-once processing." This is almost always wrong. In a distributed system, you can guarantee:

Guarantee	Meaning	How
At-most-once	Message processed 0 or 1 times	Don't retry on failure. Risk: lost messages.
At-least-once	Message processed 1 or more times	Retry until acknowledged. Risk: duplicate processing.
Exactly-once	Message processed exactly 1 time	At-least-once delivery + idempotent processing. The combination achieves the EFFECT of exactly-once.

True exactly-once delivery is impossible in a distributed system (proven by the Two Generals Problem). But you can achieve the effect of exactly-once by making every operation idempotent: processing the same message twice produces the same result as processing it once. The idempotency key pattern (store a unique request ID with each processed message, skip duplicates) is the standard implementation.

The 2PC State Machine

Let's trace the exact states each participant goes through during 2PC. This is a common whiteboard question.

// COORDINATOR state machine:
INIT → (send PREPARE) → WAITING
WAITING → (all YES) → write COMMIT to WAL → COMMITTED → (send COMMIT) → DONE
WAITING → (any NO) → write ABORT to WAL → ABORTED → (send ABORT) → DONE
WAITING → (timeout) → ABORTED → (send ABORT) → DONE

// PARTICIPANT state machine:
INIT → (receive PREPARE) → PREPARED → (send YES) → WAITING_DECISION
INIT → (receive PREPARE, but can't commit) → ABORTED → (send NO) → DONE
WAITING_DECISION → (receive COMMIT) → COMMITTED → DONE
WAITING_DECISION → (receive ABORT) → ABORTED → DONE
WAITING_DECISION → (timeout, coordinator crashed) → ??? ← THE PROBLEM!

// In WAITING_DECISION with no coordinator:
// Can't commit (maybe another participant voted NO)
// Can't abort (maybe coordinator decided COMMIT)
// Must WAIT. Holding locks. Blocking other transactions.

The "point of no return." When a participant votes YES in Phase 1, it makes an irrevocable promise: "I CAN commit this transaction." It has written all data to its WAL, acquired all locks, and verified all constraints. After this point, the participant has surrendered its autonomy — it can no longer unilaterally decide to abort. Only the coordinator can tell it what to do. This is the fundamental asymmetry that makes 2PC fragile: the coordinator has all the power, and participants are helpless without it.

Worked Example: 2PC with Numbers

// Distributed bank transfer: debit $100 from Node A, credit $100 to Node B

// PHASE 1: PREPARE
Coordinator → Node A: PREPARE (debit account 1001 by $100)
  Node A checks: account 1001 has $500. $500 >= $100. OK.
  Node A writes to WAL: "IF COMMIT: account 1001 = $400"
  Node A locks row for account 1001 (exclusive)
  Node A → Coordinator: YES

Coordinator → Node B: PREPARE (credit account 2001 by $100)
  Node B checks: account 2001 exists. OK.
  Node B writes to WAL: "IF COMMIT: account 2001 = $600"
  Node B locks row for account 2001 (exclusive)
  Node B → Coordinator: YES

// PHASE 2: COMMIT
Coordinator writes to its own WAL: "TXN_42: COMMIT" ← THE DECISION POINT
Coordinator → Node A: COMMIT
  Node A applies: account 1001 = $400. Releases lock.
Coordinator → Node B: COMMIT
  Node B applies: account 2001 = $600. Releases lock.

// Total time locks are held:
// Node A: from PREPARE until COMMIT received (2 network round-trips)
// In a cross-datacenter deployment with 50ms RTT: ~200ms of lock time
// During those 200ms, NO other transaction can touch account 1001 or 2001.

What Modern Systems Do Instead

System	Approach	Tradeoff
Google Spanner	Paxos-based commit across replicas, TrueTime for global ordering	Requires atomic clocks + GPS receivers in every data center
CockroachDB	Parallel commits with Raft consensus per range	Higher latency for cross-range transactions
Kafka + Outbox	Write to DB + outbox table atomically, async publish to Kafka	Eventual consistency, not strong consistency
Sagas	Sequence of local transactions with compensating actions on failure	No isolation between saga steps, complex failure handling

Google Spanner: How to Do It Right

Google Spanner is the gold standard for globally distributed, serializable transactions. It avoids 2PC's blocking problem by using Paxos-based state machines for each shard, and uses TrueTime — a global clock with bounded uncertainty — to assign globally ordered timestamps to transactions.

// Spanner's approach:
// 1. Each shard is a Paxos group (5 replicas, majority agreement)
// 2. A shard leader can crash without blocking — another replica takes over
// 3. Cross-shard transactions use 2PC, but each participant is a Paxos group
// (not a single node), so coordinator crash = Paxos leader election, not blocking

// TrueTime API:
TT.now() → [earliest, latest] // an interval, not a point
// Guarantee: the true time is somewhere within that interval
// Uncertainty is typically 1-7 milliseconds

// For a commit at timestamp t:
// Spanner waits until TT.now().earliest > t
// This guarantees that all future transactions will have timestamps > t
// Consequence: commit latency includes the clock uncertainty (1-7ms wait)
// This is why Google invested in atomic clocks + GPS receivers in every datacenter

Spanner proves that globally distributed, serializable transactions are possible — but at the cost of specialized hardware (atomic clocks), higher latency (the TrueTime wait), and enormous engineering investment. CockroachDB implements a similar approach using NTP clocks (less precise than TrueTime but no special hardware needed), with slightly weaker guarantees.

Choosing Between Distributed Transaction Approaches

When an interviewer asks "How would you handle transactions across services?", your answer should depend on the consistency requirements. Here is the decision framework:

// Decision tree for distributed transactions:

Q1: Can you avoid distributed transactions entirely?
   YES → Model your data so each transaction is local to one service.
         This is almost always the right answer. Rethink your service boundaries.
   NO → Q2

Q2: Do you need strong consistency (linearizability)?
   NO → Use sagas with compensating transactions. Accept eventual consistency.
        Works for e-commerce, booking, most SaaS applications.
   YES → Q3

Q3: Can you afford the infrastructure?
   YES → Use CockroachDB or Google Spanner. They solve this at the database level.
   NO → Use a single database as the source of truth.
        All writes go through one DB. Services read from replicas.
        This scales surprisingly far (Twitter ran on MySQL for years).

The microservices transaction trap. Many teams split into microservices first, then discover they need transactions across services. This is backwards. Start with a monolith (one database, real transactions). Split into services only when you have a clear scaling bottleneck, and split along boundaries that minimize cross-service transactions. The cost of distributed transactions far exceeds the cost of a larger monolith.

Two-Phase Commit Simulator

Watch 2PC with 3 participants. Inject a coordinator crash after PREPARE to see participants stuck "in doubt."

Interview question: An architect proposes using 2PC to coordinate transactions across your payment service and inventory service. What is your main concern?

2PC is too slow for high-throughput payments 2PC doesn't support more than 2 participants If the coordinator crashes after PREPARE but before COMMIT, both services hold locks indefinitely. The payment service can't process other payments for that customer. The inventory service can't release or sell that item. This is a single point of failure that can cascade into a system-wide outage. Prefer a saga pattern: process payment first, then reserve inventory, with a compensating "refund" action if inventory reservation fails. You trade strong consistency for availability.

Chapter 9: Interview Arsenal

Isolation Level Cheat Sheet

Anomaly	Read Uncommitted	Read Committed	Snapshot / Repeatable Read	Serializable
Dirty reads	Allowed	Prevented	Prevented	Prevented
Dirty writes	Allowed	Prevented	Prevented	Prevented
Non-repeatable reads	Allowed	Allowed	Prevented	Prevented
Lost updates	Allowed	Allowed	Depends*	Prevented
Write skew	Allowed	Allowed	Allowed	Prevented
Phantoms	Allowed	Allowed	Allowed	Prevented

*PostgreSQL's repeatable read detects lost updates. MySQL's does not.

System Design: Payment System

"Design a payment system with exactly-once semantics." Key insights: (1) Use an idempotency key per request — store it with the payment record in the SAME transaction. If a retry arrives with the same key, return the existing result instead of processing again. (2) The debit and credit must be atomic — use a database transaction, not two API calls. (3) For cross-service payments (debit service A, credit service B), use the outbox pattern: write the debit + an "outbox event" in one local transaction, then an async worker publishes the event to credit service B. If B fails, the outbox event is retried. (4) Store a payment state machine: PENDING → DEBITED → CREDITED → COMPLETE (or FAILED). Each transition is a local transaction.

System Design: Booking System

"Design a booking system that prevents double-booking." Key insights: (1) Use SELECT FOR UPDATE on the resource row (seat, room, time slot) inside a transaction. (2) Decrement the availability counter and insert the booking in the SAME transaction. (3) For high-concurrency scenarios (concert tickets), use optimistic locking: UPDATE SET available = available - 1 WHERE available > 0. If 0 rows affected, the item is sold out — no lock contention. (4) For distributed booking (multiple servers), either use a single database as the source of truth, or use a distributed lock service (Redis SETNX with TTL).

Coding Drill: MVCC Version Visibility

python
class MVCCStore:
    """Simplified MVCC implementation for interview coding rounds."""
    def __init__(self):
        self.versions = []  # list of (key, value, created_by, deleted_by)
        self.next_txn_id = 1
        self.committed = set()  # set of committed txn IDs

    def begin(self):
        txn_id = self.next_txn_id
        self.next_txn_id += 1
        return txn_id

    def write(self, txn_id, key, value):
        # Mark old versions as deleted
        for i, (k, v, created, deleted) in enumerate(self.versions):
            if k == key and deleted is None:
                self.versions[i] = (k, v, created, txn_id)
        # Create new version
        self.versions.append((key, value, txn_id, None))

    def read(self, txn_id, key):
        """Return the value visible to this transaction."""
        for k, v, created, deleted in reversed(self.versions):
            if k != key:
                continue
            # Visibility check:
            # 1. Creator must have committed AND started before us
            if created > txn_id or (created != txn_id and created not in self.committed):
                continue
            # 2. Not deleted, or deleter started after us, or deleter not committed
            if deleted is not None:
                if deleted <= txn_id and deleted in self.committed:
                    continue  # deleted by a committed txn before us
            return v
        return None

    def commit(self, txn_id):
        self.committed.add(txn_id)

# Example usage:
store = MVCCStore()
t1 = store.begin()  # TXN 1
store.write(t1, 'alice', 500)
store.commit(t1)

t2 = store.begin()  # TXN 2
t3 = store.begin()  # TXN 3
store.write(t2, 'alice', 400)
print(store.read(t3, 'alice'))  # 500 (t2 not committed)
store.commit(t2)
print(store.read(t3, 'alice'))  # 500 (t2 committed AFTER t3 started)
t4 = store.begin()  # TXN 4
print(store.read(t4, 'alice'))  # 400 (t2 committed before t4 started)

Coding Drill: Transaction Retry Logic

In any system using optimistic concurrency (SSI or compare-and-set), transactions may be aborted and need retrying. Here is production-grade retry logic with exponential backoff:

python
import time
import random
import psycopg2
from psycopg2 import errors

def run_with_retry(conn_params, txn_func, max_retries=5):
    """Execute a transaction function with automatic retry on serialization failure.

    txn_func: callable(cursor) -> result
    Returns the result of txn_func on success.
    Raises after max_retries exhausted.
    """
    for attempt in range(max_retries):
        conn = psycopg2.connect(**conn_params)
        conn.set_isolation_level(
            psycopg2.extensions.ISOLATION_LEVEL_SERIALIZABLE
        )
        try:
            with conn.cursor() as cur:
                result = txn_func(cur)
            conn.commit()
            return result  # success!
        except errors.SerializationFailure:
            conn.rollback()
            if attempt == max_retries - 1:
                raise  # exhausted retries
            # Exponential backoff with jitter
            delay = (2 ** attempt) * 0.01  # 10ms, 20ms, 40ms, 80ms, 160ms
            delay *= (0.5 + random.random())  # add jitter to prevent thundering herd
            time.sleep(delay)
        except errors.DeadlockDetected:
            conn.rollback()
            # Same retry logic — deadlock is a transient failure
            delay = (2 ** attempt) * 0.01 * (0.5 + random.random())
            time.sleep(delay)
        finally:
            conn.close()

# Usage:
def transfer(cur):
    cur.execute("UPDATE accounts SET balance = balance - 100 WHERE id = 1")
    cur.execute("UPDATE accounts SET balance = balance + 100 WHERE id = 2")
    return "transferred"

result = run_with_retry({'dbname': 'bank'}, transfer)

Why jitter matters. Without jitter, all retrying transactions wake up at the same time (after exactly 10ms, 20ms, etc.) and immediately conflict again. Adding a random factor spreads them out in time, dramatically reducing the probability of repeated conflicts. Amazon's "Exponential Backoff and Jitter" blog post is the canonical reference on this pattern.

Debug Scenarios

"Transactions are timing out under load." Most likely cause: 2PL contention. Check: (1) pg_stat_activity for blocked queries. (2) Lock wait chains: is one long-running transaction holding locks? (3) Solution: shorten transactions, move non-critical work outside the transaction, or switch to SSI.

"Occasional incorrect account balances." Most likely cause: lost updates or write skew at the wrong isolation level. Check: (1) Is the isolation level read committed? Lost updates are possible. (2) Is the code doing read-modify-write without FOR UPDATE? Classic lost update. (3) Are there multi-row invariants? Write skew. Fix: upgrade to serializable or add explicit locking.

"Deadlocks in production." This happens with 2PL. Check: (1) Are transactions acquiring locks in different orders? Fix by always locking resources in a consistent global order (e.g., by primary key). (2) Is there a lock hierarchy violation? (3) Set a reasonable lock_timeout so deadlocked transactions are aborted promptly rather than hanging.

"Database VACUUM is running constantly and table is bloating." A long-running transaction (or an idle transaction in repeatable read / serializable) is preventing dead tuple cleanup. Check: pg_stat_activity for old transactions. Fix: set idle_in_transaction_session_timeout = '5min' to automatically kill idle transactions. For analytics queries, use a read replica with its own MVCC lifecycle.

"Writes are succeeding but reads show old data." Most likely cause: the application reads from a replica with replication lag, but writes go to the primary. The replica hasn't caught up yet. Fix: for critical reads after writes, route to the primary (read-your-writes consistency). Or use session stickiness to ensure a user always reads from the same node after writing. This is a replication issue, not a transaction issue — but it's often confused with isolation problems.

"Serialization failures spiking during peak hours." SSI is aborting transactions because of high contention on hot rows. Investigate: (1) Which rows are hot? Often a single counter or status row that every transaction reads. (2) Can you partition the hot data? Instead of one global counter, use per-shard counters and sum them periodically. (3) Can you batch updates? Instead of incrementing on every request, buffer counts in Redis and flush to the database every second. (4) Can you downgrade the isolation level for reads? If the reading transaction only needs approximate data, use read committed.

Coding Drill: Simple 2PL Lock Manager

python
import threading
from collections import defaultdict

class LockManager:
    """Two-Phase Locking with deadlock detection via timeout."""
    def __init__(self):
        self.locks = {}  # key -> {'mode': 'S'|'X', 'holders': set(txn_ids)}
        self.mutex = threading.Lock()
        self.waiters = {}  # key -> threading.Event

    def acquire(self, txn_id, key, mode, timeout=5.0):
        """Acquire lock. Returns True on success, False on timeout (deadlock)."""
        while True:
            with self.mutex:
                if key not in self.locks:
                    self.locks[key] = {'mode': mode, 'holders': {txn_id}}
                    return True
                lock = self.locks[key]
                # Compatible: both shared
                if lock['mode'] == 'S' and mode == 'S':
                    lock['holders'].add(txn_id)
                    return True
                # Already hold it
                if txn_id in lock['holders']:
                    if mode == 'X':
                        lock['mode'] = 'X'  # upgrade
                    return True
                # Must wait
                if key not in self.waiters:
                    self.waiters[key] = threading.Event()
                evt = self.waiters[key]
            # Wait outside mutex
            if not evt.wait(timeout):
                return False  # timeout = probable deadlock

    def release_all(self, txn_id):
        """Release all locks held by txn_id (called on commit/abort)."""
        with self.mutex:
            for key in list(self.locks):
                lock = self.locks[key]
                lock['holders'].discard(txn_id)
                if not lock['holders']:
                    del self.locks[key]
                    if key in self.waiters:
                        self.waiters[key].set()
                        del self.waiters[key]

Design Challenge: Outbox Pattern

The outbox pattern is the most common way to achieve reliable cross-service communication without distributed transactions. It deserves its own worked example because it comes up in nearly every system design interview involving microservices.

python
# The Outbox Pattern — sending reliable events from a database

class OrderService:
    def place_order(self, user_id, product_id, amount):
        # SINGLE local transaction — no distributed coordination!
        with self.db.transaction() as txn:
            # 1. Insert the order
            order_id = txn.execute(
                "INSERT INTO orders (user_id, product_id, amount, status) "
                "VALUES (%s, %s, %s, 'PENDING') RETURNING id",
                (user_id, product_id, amount)
            )

            # 2. Insert event into outbox table (SAME transaction!)
            txn.execute(
                "INSERT INTO outbox (aggregate_id, event_type, payload) "
                "VALUES (%s, %s, %s)",
                (order_id, 'OrderPlaced',
                 json.dumps({'order_id': order_id,
                             'user_id': user_id,
                             'amount': amount}))
            )
            # Both writes are atomic. If either fails, both roll back.

class OutboxPublisher:
    """Background worker that polls the outbox and publishes to Kafka."""
    def run(self):
        while True:
            with self.db.transaction() as txn:
                # Grab unpublished events (with row lock to prevent duplicates)
                events = txn.execute(
                    "SELECT id, event_type, payload FROM outbox "
                    "WHERE published = FALSE "
                    "ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED"
                )
                for event in events:
                    self.kafka.publish(event.event_type, event.payload)
                    txn.execute(
                        "UPDATE outbox SET published = TRUE WHERE id = %s",
                        (event.id,)
                    )
            time.sleep(0.1)  # poll every 100ms

Why this works. The order and the outbox event are written in the same local transaction — guaranteed atomic. The background publisher reads from the outbox and sends to Kafka. If the publisher crashes, it retries on the next poll (the outbox event is still there with published=FALSE). If Kafka is down, the publisher retries later. The consumer must be idempotent (handle duplicate messages) because at-least-once delivery is guaranteed, not exactly-once.

System Design: Idempotent Payment Processing

Let's walk through a complete payment system design — the kind of answer that gets a "strong hire" at a senior+ level:

python
class PaymentService:
    """Idempotent payment processing with exactly-once semantics."""

    def process_payment(self, idempotency_key, from_acct, to_acct, amount):
        """Process a payment. Safe to retry with the same idempotency_key."""

        with self.db.transaction(isolation='serializable') as txn:
            # Step 1: Check idempotency key
            existing = txn.execute(
                "SELECT result FROM payments WHERE idempotency_key = %s",
                (idempotency_key,)
            )
            if existing:
                return existing[0]  # Already processed — return cached result

            # Step 2: Check sufficient funds (with lock)
            balance = txn.execute(
                "SELECT balance FROM accounts WHERE id = %s FOR UPDATE",
                (from_acct,)
            )
            if balance < amount:
                # Record the failure (so retries get the same answer)
                txn.execute(
                    "INSERT INTO payments (idempotency_key, status, error) "
                    "VALUES (%s, 'FAILED', 'insufficient_funds')",
                    (idempotency_key,)
                )
                return {'status': 'FAILED', 'error': 'insufficient_funds'}

            # Step 3: Execute the transfer
            txn.execute(
                "UPDATE accounts SET balance = balance - %s WHERE id = %s",
                (amount, from_acct)
            )
            txn.execute(
                "UPDATE accounts SET balance = balance + %s WHERE id = %s",
                (amount, to_acct)
            )

            # Step 4: Record the success (for idempotency)
            txn.execute(
                "INSERT INTO payments (idempotency_key, status, amount) "
                "VALUES (%s, 'SUCCESS', %s)",
                (idempotency_key, amount)
            )

            # Step 5: Write to outbox (for async notification)
            txn.execute(
                "INSERT INTO outbox (event_type, payload) VALUES (%s, %s)",
                ('PaymentCompleted', json.dumps({
                    'from': from_acct, 'to': to_acct,
                    'amount': str(amount),
                    'idempotency_key': idempotency_key
                }))
            )

            return {'status': 'SUCCESS', 'amount': str(amount)}

Why this works. (1) The idempotency key check and the payment are in the SAME transaction — either both persist or neither does. (2) SELECT FOR UPDATE on the source account prevents concurrent withdrawals from draining the account below zero. (3) The outbox entry is written in the same transaction — if the payment succeeds, the notification event is guaranteed to exist. (4) If the client retries with the same key, it gets the cached result without re-executing. (5) Serializable isolation prevents any write skew between concurrent payments.

System Design: Distributed Lock with Fencing

For distributed systems without shared databases, you sometimes need a distributed lock. But distributed locks are tricky — a lock can be held by a crashed process:

// Distributed lock with fencing token:

// Step 1: Acquire lock with TTL (Redis SETNX)
lock = redis.set("lock:seat-42", my_id, nx=True, ex=30) # 30sec TTL
// If another process holds the lock, this returns False.

// Step 2: Get a monotonic fencing token
token = redis.incr("fence:seat-42") # returns 1, 2, 3, ...

// Step 3: Do work, then write to database WITH the token
db.execute( "UPDATE seats SET booked_by = %s WHERE id = 42 AND fence_token < %s", (my_id, token) )

// WHY the fencing token matters:
// Process A acquires lock, gets token=5, then pauses (GC, network delay)
// Lock expires (TTL). Process B acquires lock, gets token=6, writes to DB
// Process A wakes up, tries to write with token=5
// DB rejects: fence_token(5) < current_fence_token(6). A's write fails.
// Without fencing, A's stale write would overwrite B's valid write.

Interview Canvas: Isolation Level Decision Tree

Isolation Level Quick Reference

A visual summary of what each level prevents. Reference this in interviews.

Coding Drill: Deadlock Detection

python
from collections import defaultdict

class DeadlockDetector:
    """Detect deadlocks using cycle detection in the waits-for graph."""

    def __init__(self):
        # waits_for[A] = B means "transaction A is waiting for B's lock"
        self.waits_for = {}

    def add_wait(self, waiter_txn, holder_txn):
        """Record that waiter_txn is blocked waiting for holder_txn."""
        self.waits_for[waiter_txn] = holder_txn

    def remove_wait(self, txn):
        """Remove txn from waits-for graph (it got the lock or aborted)."""
        self.waits_for.pop(txn, None)

    def detect_cycle(self):
        """Return the cycle if one exists, else None.
        Uses Floyd's tortoise-and-hare for each starting node."""
        for start in self.waits_for:
            visited = set()
            node = start
            while node in self.waits_for:
                if node in visited:
                    # Found a cycle. Return the cycle path.
                    cycle = [node]
                    n = self.waits_for[node]
                    while n != node:
                        cycle.append(n)
                        n = self.waits_for[n]
                    cycle.append(node)
                    return cycle
                visited.add(node)
                node = self.waits_for[node]
        return None

# Usage:
dd = DeadlockDetector()
dd.add_wait('TXN_A', 'TXN_B')  # A waits for B
dd.add_wait('TXN_B', 'TXN_C')  # B waits for C
print(dd.detect_cycle())  # None (no cycle)

dd.add_wait('TXN_C', 'TXN_A')  # C waits for A — cycle!
print(dd.detect_cycle())  # ['TXN_A', 'TXN_B', 'TXN_C', 'TXN_A']
# Resolution: abort the youngest transaction in the cycle (usually TXN_C)

PostgreSQL's approach. PostgreSQL checks for deadlocks every deadlock_timeout milliseconds (default: 1 second). When a cycle is found, it aborts the transaction that would be least costly to retry (typically based on the number of locks held and the amount of work done). The remaining transactions in the cycle are unblocked and can proceed. Setting deadlock_timeout too low wastes CPU on frequent checks. Setting it too high means deadlocked transactions sit idle for longer. 1 second is a reasonable default for most workloads.

Paper / Resource	Why it matters
Hermitage (Martin Kleppmann)	Tests actual isolation level behavior across databases — many don't match their documentation
A Critique of ANSI SQL Isolation Levels (Berenson et al., 1995)	Defines snapshot isolation and write skew formally — the canonical paper
Serializable Snapshot Isolation (Cahill et al., 2008)	The algorithm behind PostgreSQL's serializable mode
Spanner: Google's Globally-Distributed Database (Corbett et al., 2012)	How Google achieves serializable distributed transactions with TrueTime
Amazon DynamoDB Transactions (AWS docs)	How DynamoDB implements transactions on a distributed key-value store using timestamp ordering
Calvin: Fast Distributed Transactions for Partitioned Database Systems (Thomson et al., 2012)	Deterministic transaction ordering — decide the serial order BEFORE execution, eliminating the need for 2PL or SSI

The Five Interview Dimensions for Transactions

Every strong interview answer on transactions touches five areas. Practice framing your answers across all five:

Dimension	Example question	What a great answer includes
Concept	"Explain the difference between snapshot isolation and serializable."	Both use MVCC. Snapshot allows write skew (reads don't create conflicts). Serializable tracks read-write dependencies (SSI) or uses predicate locks (2PL). Concrete example of anomaly that one prevents but the other doesn't.
Design	"Design a booking system that prevents double-booking."	SELECT FOR UPDATE on the resource row. Atomic decrement with WHERE available > 0. Idempotency key for retries. Discuss the scalability path: partition seats into blocks.
Code	"Implement MVCC visibility rules."	The MVCCStore class from this chapter. Created_by and deleted_by transaction IDs. Visibility depends on commit status relative to the reading transaction's snapshot.
Debug	"Our payment system occasionally double-charges customers."	Missing idempotency key → retries re-execute. Or: read committed with read-modify-write → lost update. Systematic diagnosis: check isolation level, check for non-atomic operations, check retry logic.
Frontier	"What's the state of the art for distributed transactions?"	Spanner (TrueTime + Paxos), CockroachDB (hybrid logical clocks + Raft), Calvin (deterministic ordering), FoundationDB (optimistic concurrency on sorted key-value). Trend: moving from pessimistic (2PL) to optimistic (SSI) approaches.

Design question: You're designing a ticket-selling system for a concert with 10,000 seats. At sale time, 50,000 users hit the system simultaneously. Each user must be assigned exactly one seat. Which isolation approach do you use and why?

Use optimistic locking with atomic decrement: UPDATE seats SET available = available - 1 WHERE event_id = X AND available > 0. If 0 rows affected, sold out. No row-level locks needed — the atomic UPDATE handles concurrency. For seat assignment, INSERT INTO bookings with a unique constraint on (event_id, seat_number) — duplicates are rejected by the constraint. This avoids the thundering herd problem of 50K SELECT FOR UPDATE queries all blocking on the same rows. For further scalability, partition seats into blocks and route users to different partitions. Use serializable isolation to prevent double-booking Use two-phase locking on the seat rows

Chapter 10: Connections

Transactions don't exist in isolation (pun intended). They connect to nearly every topic in distributed systems.

Where This Fits

Related Topic	Connection
DDIA Ch6: Replication	Replication determines how many copies of data exist. Transactions on a single leader are simpler than on multiple leaders or leaderless replicas. Multi-leader replication with transactions requires cross-leader coordination (which is why most systems don't support it).
DDIA Ch9: The Trouble with Distributed Systems	Network partitions, clock skew, and process pauses all complicate distributed transactions. 2PC's coordinator failure is a specific instance of the general problem: you can't distinguish a crashed node from a slow one.
DDIA Ch10: Consensus	Consensus protocols (Paxos, Raft) solve the problem that 2PC fails at: agreeing on a decision even when nodes crash. Spanner uses Paxos-based state machines to implement distributed transactions without 2PC's blocking problem.
CAP Theorem	Transactions with strong isolation require linearizability, which forces you to sacrifice availability during network partitions (the "CP" choice). BASE systems sacrifice consistency for availability (the "AP" choice).
Stream Processing	Event sourcing and the outbox pattern are alternatives to distributed transactions. Instead of atomically updating two systems, you write to one system and asynchronously propagate changes via events.
Batch Processing	MapReduce and Spark jobs process immutable input datasets and produce new output datasets. They don't need traditional transactions because they don't do in-place updates — they write fresh output. If a job fails, you rerun it from scratch.

Transactions in the Real World

To ground this lesson in practical reality, here is how some well-known systems handle transactions:

Company	System	Transaction Strategy
Stripe	Payment processing	Idempotency keys + single-database ACID transactions. Every API request has an idempotency key. The payment operation and the idempotency record are written atomically.
Uber	Trip management	Saga pattern with compensating transactions. Trip creation, driver matching, and payment are separate services with eventual consistency.
Google	Ads serving	Spanner for ad budget management (global consistency). Bigtable for click logs (eventual consistency). Different consistency for different data.
Netflix	Content catalog	Eventual consistency everywhere. If a title shows up 5 seconds late in the catalog, nobody notices. No need for transactions.
Banks	Core banking	Serializable transactions on mainframes (IBM DB2, Oracle). The one domain where "correctness at all costs" is literal — regulators audit for consistency.

The Isolation Level Decision Framework

1. What anomalies can you tolerate?

If you need perfect correctness for financial data: serializable. If occasional stale reads are fine (social media feed): read committed.

↓

2. What is your contention level?

Low contention (most transactions touch different data): SSI works great. High contention (hot rows): 2PL may be better than SSI's abort-and-retry cycle.

↓

3. Single node or distributed?

Single node: any isolation level is feasible. Distributed: serializable is expensive (Spanner-level infrastructure) or impractical (2PC blocking). Consider eventual consistency + application-level invariant enforcement.

↓

4. Default: Read Committed + explicit locking where needed

This is PostgreSQL's default and works for 90% of applications. Use SELECT FOR UPDATE for critical sections. Upgrade to serializable only for proven write skew scenarios.

Common Interview Mistakes

Having interviewed dozens of candidates on transactions, here are the most common mistakes:

Mistake	What they say	What they should say
"ACID means all four properties are equally important"	"A database is ACID or it isn't"	A, I, and D are database guarantees. C is mostly an application property. Different isolation levels trade off I for performance.
"Transactions are slow"	"We should avoid transactions for performance"	Read Committed transactions are nearly free. Only serializable adds significant overhead. The question is which isolation level, not whether to use transactions.
"SELECT FOR UPDATE solves everything"	"Just add FOR UPDATE to every SELECT"	FOR UPDATE can't lock rows that don't exist (phantoms). It also reduces concurrency by serializing all readers. Use it surgically on specific hot rows, not broadly.
"Eventual consistency means inconsistency"	"BASE systems have no guarantees"	BASE systems guarantee eventual convergence. The question is: how long is "eventual" and what happens during the convergence window? For many use cases, sub-second convergence is perfectly acceptable.
"2PC is the standard for distributed transactions"	"We'll use 2PC across microservices"	2PC is a last resort due to its blocking failure mode. Prefer sagas with compensating transactions, or design your data model so transactions are local to one service.

Closing Thought

"Transactions are an abstraction layer that allows an application to pretend that certain concurrency problems and certain kinds of hardware and software faults don't exist. A large class of errors is reduced down to a simple transaction abort, and the application just needs to try again." — Martin Kleppmann, Designing Data-Intensive Applications

The beauty of transactions is that they let application developers think about correctness without simultaneously thinking about every possible interleaving of concurrent operations and every possible crash timing. They are not free — they cost performance — but the alternative is ad-hoc, error-prone, application-level concurrency control that almost always has bugs you won't discover until production traffic exposes them.

The Transaction Mental Model

Here is the mental model that ties everything together. Think of transactions along two axes:

	Single Node	Distributed
Weak Isolation	Read Committed (fast, most apps). Snapshot Isolation (backups, analytics).	Eventual consistency. BASE semantics. Good for shopping carts, social feeds.
Strong Isolation	Serializable (SSI or 2PL). Required for financial data, inventory, bookings.	Spanner (Paxos + TrueTime). CockroachDB (Raft). Extremely expensive infrastructure.

Most applications live in the top-left quadrant: single-node, read committed, with selective use of explicit locking (SELECT FOR UPDATE) for critical sections. This is PostgreSQL's sweet spot and covers 90% of real-world needs.

The bottom-right quadrant — distributed serializable transactions — is the hardest problem in databases. Only a handful of systems (Google Spanner, CockroachDB, FoundationDB) provide it, and they come with significant latency and infrastructure costs. If your interviewer asks "How would you build a globally distributed, strongly consistent payment system?", the answer involves one of these systems — not handrolled 2PC.

Master the isolation levels. Know which anomalies each one prevents. Understand the implementation mechanisms (locks, MVCC, SSI). And always ask the interview question that separates senior from staff: "What isolation level does this system actually need, and what happens if we get it wrong?"

PostgreSQL Transaction Cheat Sheet

Since PostgreSQL is the most common database in interviews and startups, here are the practical commands you should know:

sql
-- Check current isolation level
SHOW default_transaction_isolation;  -- read committed (default)

-- Set isolation level for one transaction
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- ... do work ...
COMMIT;

-- Check for blocked queries (deadlocks, lock waits)
SELECT pid, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state = 'active' AND wait_event_type IS NOT NULL;

-- Check for long-running transactions (MVCC bloat risk)
SELECT pid, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start;

-- Check dead tuples (VACUUM needed?)
SELECT relname, n_dead_tup, n_live_tup,
       round(n_dead_tup::numeric / greatest(n_live_tup, 1) * 100, 1) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;

-- Set safety timeouts (do this in production!)
SET lock_timeout = '5s';  -- abort if can't get lock in 5s
SET statement_timeout = '30s';  -- abort queries running > 30s
SET idle_in_transaction_session_timeout = '5min';  -- kill idle txns

Closing Quote

"The whole point of abstractions like ACID transactions is to hide complexity. But as we've seen, it's not as simple as 'ACID = good, no-ACID = bad.' The devil is in the implementation details." — Martin Kleppmann, Designing Data-Intensive Applications

What to Study Next

Replication (DDIA Ch5)

How data is copied across nodes. Single-leader replication gives you a clear place for transactions. Multi-leader and leaderless replication make transactions much harder.

↓

Partitioning (DDIA Ch6)

How data is split across nodes. Partition-local transactions are easy. Cross-partition transactions require coordination (2PC, sagas, or Spanner-style protocols).

↓

Distributed Trouble (DDIA Ch8)

Network partitions, clock skew, and process pauses. These are the failure modes that make distributed transactions so hard. Understanding them is essential for designing resilient systems.

↓

Consensus (DDIA Ch9)

Paxos, Raft, and Zab. These protocols solve the fundamental agreement problem that 2PC cannot: reaching a decision even when nodes crash. Spanner uses Paxos for its transaction protocol.

Final question: A startup is building a ride-hailing app. They ask: "Should we use serializable isolation for everything?" What is your answer?

Yes, correctness is paramount for financial transactions No, use eventual consistency for everything to maximize throughput Use different isolation levels for different operations. Payments and fare calculations: serializable or explicit locking (money must be correct). Driver-rider matching: read committed with optimistic locking (occasional retries are fine). Location updates: eventual consistency (GPS pings are inherently approximate). Ride history and analytics: read committed or snapshot isolation (stale reads are acceptable). The right answer is always "it depends on the invariants of each operation" — not one size fits all.