ACID, isolation levels, MVCC, serializability — the guarantees that keep your data correct.
You are building a banking app. Alice wants to transfer $100 from her checking account to Bob's savings account. The operation seems trivial: subtract $100 from Alice, add $100 to Bob. Two writes. What could go wrong?
Everything.
The database executes the first write — Alice's balance drops from $500 to $400. Then the server crashes. The power supply fails. The process gets killed by the OOM killer. A network partition cuts the connection to the database. Pick your disaster. The second write never happens. Alice lost $100. Bob never received it. The money has vanished from the universe.
Or consider a different nightmare. Two people are browsing the same airline website. One seat remains on the 7:15 PM flight to JFK. Both customers simultaneously click "Book Now." Both see "1 seat available" because they both read the database before either wrote to it. Both bookings succeed. The airline now has two confirmed passengers for one seat. Someone is getting bumped at the gate.
These are not hypothetical scenarios. They happen in production systems every day. The bank transfer is a partial failure problem — one operation succeeded but the other didn't. The airline booking is a race condition — two concurrent operations interfered with each other because neither saw the other's changes.
A transaction is a group of database reads and writes that are treated as a single logical unit. The transaction either commits (all changes become permanent) or aborts/rolls back (all changes are undone). There is no in-between state where some changes persist and others don't.
In SQL, transactions look like this:
sql BEGIN TRANSACTION; -- Read Alice's balance SELECT balance FROM accounts WHERE name = 'Alice'; -- Result: 500 -- Debit Alice UPDATE accounts SET balance = balance - 100 WHERE name = 'Alice'; -- Credit Bob UPDATE accounts SET balance = balance + 100 WHERE name = 'Bob'; COMMIT; -- If any statement fails, or the connection drops, or the server crashes: -- ALL changes are automatically rolled back. Both accounts stay unchanged.
In application code, it looks like this:
python import psycopg2 conn = psycopg2.connect("dbname=bank") try: with conn.cursor() as cur: cur.execute("UPDATE accounts SET balance = balance - 100 WHERE name = 'Alice'") cur.execute("UPDATE accounts SET balance = balance + 100 WHERE name = 'Bob'") conn.commit() # Both succeed or... except Exception: conn.rollback() # ...both are undone
The simulation below shows the bank transfer in action. Click "Start Transfer" to begin. Then click "Crash!" at any point during the transfer to simulate a failure. Watch what happens to the account balances.
Click "Start Transfer" then "Crash!" between the two operations. Or let it complete successfully.
Now consider the airline booking. Two customers query the database at the same time, both see 1 seat available, and both insert a booking. The simulation below shows the timeline of these two concurrent transactions.
Watch two customers simultaneously book the last seat. The timeline shows how the race unfolds.
You might think "crashes between two writes" is rare. It is not. Consider the scale: a busy e-commerce site processes 10,000 orders per second. Each order involves 3-5 database writes (insert order, update inventory, update customer stats, insert payment record, update shipping queue). That is 30,000-50,000 writes per second. If the system crashes once per day — a very optimistic assumption — it will land between two writes of the same logical operation. The probability is not "if" but "when."
Race conditions are even more common. Any system with more than one concurrent user can experience them. A database handling 1,000 concurrent connections has thousands of opportunities per second for read-write interleaving. Without isolation guarantees, data corruption is a statistical certainty.
Without transactions, your database is a minefield. Every concurrent operation could corrupt data. Every crash could leave things half-done. The rest of this lesson teaches you how transactions solve these problems — and the tradeoffs involved.
Every database textbook, every interview prep guide, and every recruiter's phone screen mentions ACID. But most explanations are wrong. They recite the acronym like a prayer without explaining what each letter actually guarantees at the implementation level. Let's fix that.
Atomicity does NOT mean "simultaneous." It means "all or nothing." If a transaction makes five writes and the system crashes after the third write, atomicity guarantees that all five writes are rolled back. The database looks as if the transaction never started. Zero writes persisted, not three.
The implementation is a write-ahead log (WAL). Before modifying any data page on disk, the database first writes a log entry describing the change. If the system crashes, the recovery process reads the WAL and undoes any changes from incomplete transactions. The WAL is the reason atomicity works — not clever application code, not locks, not prayers.
Consistency in ACID is the most misunderstood letter. It does NOT mean "replicas agree" (that is replication consistency). In ACID, consistency means that the database preserves application-level invariants — rules like "total money across all accounts is constant" or "every foreign key references an existing row."
Here is the uncomfortable truth: consistency is mostly the application's responsibility, not the database's. The database can enforce some invariants (foreign keys, unique constraints, CHECK constraints), but it cannot know that "total money must be conserved" unless you tell it. Consistency is the odd one out in ACID — it is a property of the application using the database, not a property of the database itself.
Isolation means concurrent transactions don't interfere with each other. Each transaction behaves as if it were the only one running. In the airline example, proper isolation would prevent both customers from seeing "1 seat available" simultaneously — one of them would see the updated count after the other's booking.
Full isolation (called serializability) means the result of running transactions concurrently is the same as if they ran one at a time, in some serial order. But serializability is expensive. So most databases offer weaker isolation levels that trade correctness for performance. Understanding these tradeoffs is 80% of what interviews test.
Durability means that once a transaction commits, its data survives crashes — even power failures. The implementation combines WAL with fsync — a system call that forces the operating system to flush data from its write cache to the physical disk. Without fsync, the OS might acknowledge a write that is still sitting in a volatile buffer that vanishes on power loss.
The Write-Ahead Log deserves a closer look because it is the foundation of both atomicity and durability. Every modern database — PostgreSQL, MySQL InnoDB, SQLite, Oracle, SQL Server — uses some form of WAL. The principle is simple: write the intent before the action.
Think of it like a pilot's checklist. Before taking off, the pilot writes down every step they are going to perform. If they are incapacitated mid-takeoff, another pilot can read the checklist and know exactly what was done and what remains. The WAL is the database's checklist.
The WAL is append-only and sequential. This is critical for performance: sequential writes to disk are 100x-1000x faster than random writes. The actual data pages can be written lazily in the background (called checkpointing), while the WAL guarantees that no committed data is lost.
Not all systems provide ACID. Many distributed databases offer BASE instead:
| Property | Meaning | Example |
|---|---|---|
| Basically Available | The system always responds, even if data is stale | DynamoDB returns the last-known value during a partition |
| Soft state | Data may change without input (replicas converging) | Eventual consistency means reads may see different values on different nodes |
| Eventual consistency | All replicas will eventually agree, given enough time | DNS propagation: update takes hours but eventually all servers agree |
BASE is not "ACID but worse." It is a different set of tradeoffs optimized for availability and partition tolerance (the "AP" in CAP theorem). Your shopping cart can tolerate eventual consistency. Your bank balance cannot.
The choice between ACID and BASE is not a religious war — it is an engineering decision driven by your application's invariants:
| Use Case | Invariant | Tolerance | Choice |
|---|---|---|---|
| Bank transfer | Money is conserved | Zero tolerance for incorrect balances | ACID (serializable) |
| Shopping cart | Items in cart reflect user intent | Brief stale reads are OK (refresh shows correct state) | BASE |
| Inventory count | Stock >= 0 | Brief overselling tolerable if caught quickly | ACID for decrement, BASE for display |
| Social media feed | Posts eventually appear | High tolerance (seconds of delay are fine) | BASE |
| Flight booking | No overbooking | Zero tolerance | ACID (at least for the seat assignment step) |
| User analytics | Approximate correctness | High (10% error is acceptable) | BASE (or no transactions at all) |
Transactions are not free. Here is what they cost:
A transaction writes 3 records. Click "Inject Crash" at any point. Atomicity rolls back all partial changes.
Not all transactions are created equal. Writing a single row is much simpler to make atomic than writing across multiple tables. Let's draw the line clearly.
Imagine you are writing a 20 KB JSON document to a database. The write takes multiple disk pages. If the system crashes after writing page 1 but before writing page 2, you have a torn write — half old data, half new data. The document is corrupted.
Almost every storage engine provides single-object atomicity out of the box. The mechanisms are:
| Mechanism | How it works | What it prevents |
|---|---|---|
| WAL (Write-Ahead Log) | Write the complete new value to the log before modifying the data page | Torn writes on crash |
| Compare-and-Swap (CAS) | Atomically update only if current value matches expected | Concurrent overwrites |
| Increment operations | Atomic read-modify-write in a single CPU instruction | Lost updates on counters |
Single-object atomicity is necessary but not sufficient. The bank transfer requires atomicity across TWO objects (two account rows). That requires a multi-object transaction.
Even single-object atomicity has subtle failure modes. Consider a document database storing a 20 KB JSON document:
A multi-object transaction groups multiple reads and writes — possibly to different rows, different tables, even different indexes — into a single atomic unit. The syntax in SQL is explicit:
sql BEGIN TRANSACTION; UPDATE accounts SET balance = balance - 100 WHERE id = 'alice'; UPDATE accounts SET balance = balance + 100 WHERE id = 'bob'; COMMIT; -- If ANY statement fails or the connection drops between them, -- the database automatically issues ROLLBACK. -- Both accounts return to their original values.
Many common database operations are inherently multi-object:
Let's trace through a concrete example of what happens without multi-object transactions. An e-commerce application inserts an order and updates inventory:
Many NoSQL databases (MongoDB before 4.0, early DynamoDB, Cassandra) intentionally dropped multi-object transactions. Their argument: multi-object transactions hurt performance and don't scale across partitions. Instead, they encourage application-level workarounds: idempotent operations, event sourcing, compensating transactions.
This works for some use cases (shopping carts, activity feeds) but fails catastrophically for others (financial transfers, inventory management). MongoDB added multi-document transactions in v4.0 (2018) after years of customer demand. The lesson: you can avoid transactions until you can't.
When data lives across multiple services (each with its own database), multi-object transactions are impossible. You can't span a SQL transaction across a payment service database and an inventory service database. The saga pattern is the most common alternative:
If step 3 fails, a compensating transaction runs: the inventory service re-increments the stock, and the order service marks the order as CANCELLED. This is not atomic — there is a window where the inventory is reserved but the payment hasn't been processed. During that window, the system is in an inconsistent state. But eventually, it converges to a consistent state. This is the tradeoff: eventual consistency instead of strong consistency.
Watch a transaction span two tables with a foreign key. Toggle "With Transaction" to see the difference.
Now we enter the world of isolation levels — the spectrum between "no isolation at all" and "perfect serializability." Each level prevents certain anomalies but allows others. Understanding this spectrum is the single most important transaction concept for interviews.
Read Committed is the weakest useful isolation level. It is the default in PostgreSQL, Oracle, and SQL Server. It provides two guarantees:
A dirty read occurs when Transaction B reads data that Transaction A has written but not yet committed. If A later rolls back, B has acted on data that never officially existed.
A dirty write occurs when Transaction B overwrites a value that Transaction A has written but not yet committed. This can cause bizarre interleaving where the final state mixes parts of different transactions.
Read committed uses two mechanisms:
For writes: row-level locks. When a transaction writes a row, it acquires an exclusive lock on that row. Other transactions that want to write the same row must wait until the lock is released (at commit or abort).
For reads: the database keeps TWO copies of each modified row — the old committed value and the new uncommitted value. Any transaction reading that row sees the old value until the writing transaction commits. This is cheaper than read locks because reads never block.
Here is exactly how PostgreSQL implements read committed for reads, step by step:
This mechanism — checking the commit status of each version's creating transaction — is cheap. PostgreSQL maintains a commit log (clog) that stores the commit status of each transaction ID in just 2 bits (in-progress, committed, aborted, sub-committed). Checking clog is a simple array lookup.
Read committed has a critical gap: non-repeatable reads (also called read skew). You read a value, another transaction modifies and commits it, and reading again gives a different result — within the same transaction.
For a quick interactive query, this is usually harmless — Alice refreshes the page and sees the correct values. But for a backup that takes 30 minutes, or an analytics query scanning millions of rows, reading inconsistent snapshots produces wrong results. This is why we need the next isolation level: snapshot isolation.
Here is a concrete scenario that breaks analytics queries under read committed:
This is not just a theoretical concern. Database backups use pg_dump, which reads every table sequentially. If the backup runs under read committed (the default), it can capture an inconsistent snapshot: some tables reflect data from before a transaction, others from after. Restoring this backup gives you a database where foreign keys are broken, counters don't add up, and invariants are violated. PostgreSQL's pg_dump uses a single REPEATABLE READ transaction for exactly this reason.
| Database | Default Level | Notes |
|---|---|---|
| PostgreSQL | Read Committed | REPEATABLE READ actually provides snapshot isolation. SERIALIZABLE provides SSI. |
| MySQL InnoDB | Repeatable Read | Uses gap locking for phantoms. Does NOT detect lost updates automatically. |
| Oracle | Read Committed | Snapshot isolation available as SERIALIZABLE, but it is not truly serializable. |
| SQL Server | Read Committed | Offers both lock-based and snapshot-based read committed. Set with READ_COMMITTED_SNAPSHOT. |
| CockroachDB | Serializable | Only supports serializable. "If you want correctness, don't compromise." |
Two concurrent transactions. Toggle isolation level to see how dirty reads are prevented.
Read committed prevents dirty reads but allows non-repeatable reads. For a backup job or a long analytics query, this is disastrous — the query sees an inconsistent mix of pre-update and post-update data. We need something stronger.
Snapshot isolation gives each transaction a frozen snapshot of the database as it existed at the transaction's start time. No matter how long the transaction runs, it always sees the same consistent view. Other transactions can modify data freely — the snapshot is immutable.
Snapshot isolation is implemented using Multi-Version Concurrency Control (MVCC). Instead of storing one version of each row, the database stores multiple versions, each tagged with the ID of the transaction that created or deleted it.
When transaction TXN_X reads a row, it must decide which version to see. The rule is precise:
A row version is visible to TXN_X if and only if:
This is the kind of example interviewers love. Let's trace exactly which versions each transaction sees.
PostgreSQL stores MVCC metadata in system columns on every row:
| Column | Meaning |
|---|---|
| xmin | Transaction ID that created this row version (INSERT or UPDATE) |
| xmax | Transaction ID that deleted this row version (UPDATE or DELETE). 0 if still live. |
| ctid | Physical location (page, offset) of this row version on disk |
When you UPDATE a row in PostgreSQL, it doesn't modify the row in place. It marks the old version with xmax = current txn, then inserts a brand new row with xmin = current txn. This is why PostgreSQL tables need periodic VACUUM — to clean up old, invisible row versions that are wasting disk space.
MVCC creates multiple versions of every row. Over time, old versions become invisible to all active transactions. These dead tuples waste disk space and slow down sequential scans (the database must skip over them). PostgreSQL's VACUUM process reclaims this space.
Indexes complicate MVCC. A B-tree index entry points to a specific row version's physical location (ctid in PostgreSQL). When a row is updated, the new version has a different ctid. The database has two strategies:
| Strategy | How it works | Used by |
|---|---|---|
| HOT update | If no indexed column changed, store new version on same page, link old→new via a "heap-only tuple" chain. No index update needed. | PostgreSQL (when possible) |
| Index entry per version | Each version gets its own index entry. Index scans must check visibility for each entry. | PostgreSQL (when indexed column changes) |
| Undo log | Index points to latest version. Old versions are stored in a separate undo log and reconstructed on demand. | MySQL InnoDB, Oracle |
PostgreSQL's approach (multiple index entries) means that a frequently-updated, heavily-indexed table can have very large indexes full of entries pointing to dead row versions. MySQL's approach (undo log) keeps indexes cleaner but adds overhead to reconstruct old versions for long-running snapshot transactions.
You can actually see MVCC in action using PostgreSQL's system columns. Here's a hands-on experiment you can run:
sql -- Create a test table CREATE TABLE test (id INT PRIMARY KEY, value TEXT); INSERT INTO test VALUES (1, 'original'); -- See the MVCC metadata: SELECT xmin, xmax, ctid, * FROM test; -- xmin | xmax | ctid | id | value -- -------+------+-------+----+---------- -- 1000 | 0 | (0,1) | 1 | original -- xmin=1000: created by transaction 1000 -- xmax=0: not yet deleted -- ctid=(0,1): page 0, tuple 1 -- Now update the row: UPDATE test SET value = 'modified' WHERE id = 1; SELECT xmin, xmax, ctid, * FROM test; -- xmin | xmax | ctid | id | value -- -------+------+-------+----+---------- -- 1001 | 0 | (0,2) | 1 | modified -- NEW xmin! This is a completely new row version. -- NEW ctid! It's at a different physical location (0,2). -- The old version (0,1) still exists on disk with xmax=1001, -- but it's invisible to committed transactions. -- Check dead tuple count: SELECT n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'test'; -- n_dead_tup | n_live_tup -- ------------+------------ -- 1 | 1 -- One dead tuple (the old 'original' version). VACUUM will reclaim it.
| Operation | Read Committed | Snapshot Isolation | Why |
|---|---|---|---|
| Short reads | Fast | Same speed | Both use MVCC; snapshot just remembers which txn IDs to ignore |
| Long reads (analytics) | Inconsistent | Consistent snapshot | Snapshot freezes the visible set at transaction start |
| Writes | Same speed | Same speed | Write performance is identical; both create new row versions |
| Write conflicts | Last write wins | First writer wins (second aborts) | Snapshot detects that the row changed since snapshot start |
| Storage | Less bloat | More bloat if long txns | Long-running snapshot transactions prevent VACUUM from reclaiming old versions |
Three transactions interact with a row. Drag the time slider to see which version each transaction sees.
Snapshot isolation prevents dirty reads and non-repeatable reads. But it has a blind spot: the lost update problem. This is one of the most common concurrency bugs in real applications.
A lost update happens whenever two transactions perform a read-modify-write cycle on the same data:
This pattern shows up everywhere:
| Scenario | Read | Modify | Write |
|---|---|---|---|
| Counter increment | Read current value | Add 1 | Write new value |
| Wiki page edit | Read page content | Edit text | Save changes |
| Account balance | Read balance | Subtract amount | Write new balance |
| JSON field update | Read document | Modify one field | Write entire document |
The simplest fix — don't read-then-write. Let the database do it in a single atomic operation:
sql -- BAD: read-modify-write (vulnerable to lost updates) SELECT balance FROM accounts WHERE id = 'alice'; -- reads 500 -- application computes: 500 - 100 = 400 UPDATE accounts SET balance = 400 WHERE id = 'alice'; -- GOOD: atomic operation (no read-modify-write) UPDATE accounts SET balance = balance - 100 WHERE id = 'alice'; -- The database performs the read+modify+write as a single atomic step. -- No other transaction can interleave between the read and the write.
Atomic operations work for simple modifications (increment, decrement, append). They don't work when the modification is complex (editing a wiki page requires reading the entire document, modifying it in the application, and writing it back).
sql BEGIN; SELECT balance FROM accounts WHERE id = 'alice' FOR UPDATE; -- FOR UPDATE acquires a ROW-LEVEL EXCLUSIVE LOCK -- Any other transaction trying to SELECT FOR UPDATE on this row BLOCKS -- application computes new balance UPDATE accounts SET balance = 400 WHERE id = 'alice'; COMMIT; -- lock released
SELECT FOR UPDATE tells the database: "I'm going to modify this row. Lock it so nobody else can read-for-update until I'm done." This serializes all read-modify-write operations on that row.
PostgreSQL's repeatable read isolation level (which actually provides snapshot isolation) automatically detects lost updates. If TXN_B tries to write a row that TXN_A already wrote since B's snapshot started, PostgreSQL aborts TXN_B. The application can then retry.
sql -- Read the current value SELECT balance, version FROM accounts WHERE id = 'alice'; -- Returns: balance=500, version=7 -- Update ONLY if the version hasn't changed UPDATE accounts SET balance = 400, version = 8 WHERE id = 'alice' AND version = 7; -- If another transaction changed it, version != 7, UPDATE affects 0 rows. -- Application detects "0 rows affected" and retries.
This is also called optimistic concurrency control (OCC). It doesn't use locks — it lets both transactions proceed, but only one succeeds. The other detects the conflict and retries. Works well when conflicts are rare.
Let's trace through the same scenario with each solution. Two users both try to withdraw $200 from an account with $300 balance. Only one should succeed (the second should get "insufficient funds").
Two transactions both increment a counter. Watch the lost update, then see how each solution prevents it.
Lost updates happen when two transactions write to the SAME row. Write skew is a subtler and nastier problem: two transactions read the same data, make decisions based on what they read, then write to DIFFERENT rows — and the combination violates an invariant.
A hospital requires at least one doctor on call at all times. Currently, Alice and Bob are both on call (2 doctors). Both simultaneously decide to take themselves off call:
Write skew is a family of anomalies, not a single bug. Understanding the taxonomy helps you spot it in interviews and code reviews. Every write skew has the same structure:
| Scenario | Invariant | How write skew violates it |
|---|---|---|
| Meeting room booking | No overlapping bookings for same room | Two users check "room free at 2pm?", both get yes, both insert a booking |
| Username registration | Usernames are unique | Two users check "is 'alice99' taken?", both get no, both insert it |
| Multiplayer game | Two players can't move to same position | Both check "is (3,5) empty?", both get yes, both move there |
| Bank overdraft | Balance >= 0 | Two concurrent withdrawals both check balance, both see enough funds |
Notice a pattern in the write skew examples: one transaction's query result changes because another transaction inserts or deletes rows. The meeting room double-booking happens because the query "are there bookings for this room at 2pm?" returns zero rows, and another transaction inserts a row that WOULD have changed the query result.
These rows that don't exist yet (but would change a query's result) are called phantoms. They are the root cause of write skew.
One workaround: create the rows to lock in advance. For the meeting room problem:
sql -- Create a "lock table" with one row per room per 15-min slot CREATE TABLE room_slots ( room_id INT, slot_start TIMESTAMP, PRIMARY KEY (room_id, slot_start) ); -- Pre-populate: INSERT INTO room_slots VALUES (1, '2024-01-15 14:00'), ... -- Now bookings can lock the slot row: BEGIN; SELECT * FROM room_slots WHERE room_id = 1 AND slot_start = '2024-01-15 14:00' FOR UPDATE; -- This locks the row. Another transaction trying to book the same slot blocks. INSERT INTO bookings (room_id, start_time, user_id) VALUES (1, '2024-01-15 14:00', 42); COMMIT;
This is called materializing conflicts — turning a phantom into a concrete row that can be locked. It works but it's ugly: you need to pre-create lock rows for every possible conflict, and the lock table leaks database implementation details into the application. The real solution is serializable isolation, which we cover in the next chapter.
Interviewers love to test whether you can distinguish write skew from lost updates. Here is the precise difference:
| Property | Lost Update | Write Skew |
|---|---|---|
| What conflicts | Two transactions write the SAME row | Two transactions write DIFFERENT rows |
| Pattern | Both read X, both modify X, both write X | Both read a CONDITION, each writes a different row, condition violated |
| Detection | Snapshot isolation detects (PostgreSQL) | Only serializable isolation detects |
| Fix | Atomic ops, SELECT FOR UPDATE, CAS | Serializable isolation, or materializing conflicts |
| Example | Two counter increments | Two doctors going off call |
When reviewing code, here are red flags that indicate potential write skew:
python # RED FLAG 1: Check-then-act on a count count = db.execute("SELECT COUNT(*) FROM doctors WHERE on_call = TRUE") if count >= 2: db.execute("UPDATE doctors SET on_call = FALSE WHERE id = %s", (my_id,)) # DANGER: another transaction can do the same check concurrently # RED FLAG 2: Check-then-insert on uniqueness existing = db.execute("SELECT id FROM users WHERE username = %s", (name,)) if not existing: db.execute("INSERT INTO users (username) VALUES (%s)", (name,)) # DANGER: use a UNIQUE constraint instead, which is enforced atomically # RED FLAG 3: Check availability then reserve available = db.execute("SELECT seats FROM flights WHERE id = %s", (flight_id,)) if available > 0: db.execute("INSERT INTO bookings ...") db.execute("UPDATE flights SET seats = seats - 1 WHERE id = %s", (flight_id,)) # DANGER: use SELECT FOR UPDATE on the flights row
The general pattern: any code that reads a condition, makes a decision, then writes based on that decision — without holding a lock on the condition — is vulnerable to write skew.
Depending on your database and situation, here are all available fixes for write skew, ranked from simplest to most complex:
| Fix | How | Works for | Limitation |
|---|---|---|---|
| Unique constraint | CREATE UNIQUE INDEX | Username registration, seat booking | Only prevents duplicate inserts, not general predicates |
| Explicit lock | SELECT FOR UPDATE on predicate rows | Doctor on-call (lock the doctor rows) | Can't lock rows that don't exist (phantoms) |
| Materialized conflict | Pre-create lock rows, SELECT FOR UPDATE | Meeting room booking (lock the time-slot row) | Requires pre-populating lock table. Ugly. |
| Serializable isolation | SET TRANSACTION ISOLATION LEVEL SERIALIZABLE | Everything | Requires retry logic. Some performance cost. |
| Application-level lock | Redis SETNX, advisory lock | Cross-service invariants | Complex. Must handle lock expiry carefully. |
Watch two doctors simultaneously go off call. The invariant "at least 1 on call" gets violated.
The gold standard of isolation is serializability. It guarantees that the result of executing transactions concurrently is identical to executing them in SOME serial order — one at a time, no overlap. This prevents every concurrency anomaly: dirty reads, non-repeatable reads, lost updates, write skew, and phantoms.
The question is: how do you achieve it without destroying performance? Three approaches exist, each with radically different tradeoffs.
Before diving into implementation, let's be precise about what serializability buys you. Consider a payroll system processing end-of-month salaries. Three transactions run concurrently:
The simplest approach: don't allow concurrency at all. Run every transaction on a single thread, one at a time. This sounds insane, but it works if:
| Requirement | Why | How |
|---|---|---|
| Transactions are fast | A single thread can only do ~100K short txns/sec | Keep data in RAM (VoltDB, Redis) |
| No network round-trips | Waiting for app server between reads/writes wastes the single thread | Use stored procedures — entire transaction in one request |
| Data fits in memory | Disk I/O would block the single thread | In-memory databases with WAL for durability |
VoltDB takes this approach. Redis is single-threaded by design. For partition-local transactions (where all data lives on one partition), this is surprisingly viable. For cross-partition transactions, you need coordination, which kills throughput.
How fast is a single thread? Let's do the math:
Two-Phase Locking was the standard approach for 30+ years. The rule is:
The problem with 2PL: performance is terrible under contention. If transaction A holds a lock that B needs, B waits. If B holds a lock that C needs, C waits. One slow transaction can block dozens of others. And deadlocks are common: A waits for B's lock, B waits for A's lock. The database must detect deadlocks and abort one transaction.
SSI is the modern approach, introduced in PostgreSQL 9.1 (2011). It's optimistic: let all transactions run concurrently on their own snapshots (like regular snapshot isolation), but track what each transaction reads and writes. At commit time, check if any conflicts would violate serializability. If so, abort the transaction. The application retries.
Deadlocks are the Achilles' heel of 2PL. They occur when two transactions each hold a lock the other needs:
Databases detect deadlocks by periodically checking for cycles in the waits-for graph (PostgreSQL does this every deadlock_timeout milliseconds, default 1 second). When a cycle is found, one transaction is aborted with an error: "ERROR: deadlock detected." The application must catch this error and retry.
SSI tracks two kinds of conflicts between concurrent transactions:
For the serial execution approach to work, transactions must be fast. If a transaction needs to go back to the application server for logic (compute a price, check a rule), the single thread sits idle waiting for the network round-trip. The solution: stored procedures. The entire transaction — reads, logic, and writes — is bundled into a single request:
sql -- Stored procedure: transfer money atomically CREATE FUNCTION transfer(from_id INT, to_id INT, amount NUMERIC) RETURNS VOID AS $$ DECLARE from_balance NUMERIC; BEGIN SELECT balance INTO from_balance FROM accounts WHERE id = from_id; IF from_balance < amount THEN RAISE EXCEPTION 'Insufficient funds'; END IF; UPDATE accounts SET balance = balance - amount WHERE id = from_id; UPDATE accounts SET balance = balance + amount WHERE id = to_id; END; $$ LANGUAGE plpgsql; -- Single network round-trip: the database executes everything locally. SELECT transfer(1, 2, 100.00);
VoltDB requires all transactions to be stored procedures. Redis processes all operations on a single thread. Both achieve serializability by eliminating concurrency entirely — and both are surprisingly fast for partition-local workloads.
The simulation below runs three concurrent transactions against the same data. Choose an isolation level and watch how each approach handles the workload. Serial queues them. 2PL blocks on lock conflicts. SSI runs optimistically and aborts on conflict.
Three transactions read and write overlapping rows. Watch how each approach handles conflicts differently.
Let's trace the doctor on-call write skew through all three approaches to see exactly how they differ:
| Property | Serial Execution | Two-Phase Locking | SSI |
|---|---|---|---|
| Concurrency | None | Limited (lock contention) | High (snapshot-based) |
| Blocking | Queued | Readers block writers, writers block everyone | No blocking |
| Deadlocks | Impossible | Common (need detection + timeout) | Impossible |
| Aborts | Rare | On deadlock only | On conflict detection |
| Latency | Depends on queue depth | Unpredictable (lock waits) | Low (but retries add tail latency) |
| Throughput | Limited by single core | Degrades under contention | High until conflict rate is high |
| Used by | VoltDB, Redis | MySQL InnoDB, SQL Server | PostgreSQL 9.1+, CockroachDB |
Everything so far assumed a single database node. But what happens when a transaction spans multiple databases, or multiple partitions on different machines? You need a way to commit atomically across all of them — or roll back across all of them. This is the domain of distributed transactions.
Two-Phase Commit is the classic protocol for distributed atomicity. It introduces a coordinator (also called the transaction manager) that orchestrates the commit across multiple participants (database nodes).
The critical moment is after Phase 1 (all participants said YES) but before Phase 2 (coordinator sends COMMIT or ABORT). If the coordinator crashes at this exact moment:
3PC adds a "pre-commit" phase to avoid the blocking problem. After all participants vote YES, the coordinator sends a PRE-COMMIT message. Participants can then decide: if they received PRE-COMMIT, they know the decision was COMMIT; if the coordinator crashes, they can safely commit on timeout.
In theory, 3PC is non-blocking. In practice, it doesn't work: network partitions can cause different participants to reach different conclusions about what happened. One participant might time out and commit, while another times out and aborts. 3PC assumes reliable network delivery with bounded delays, which is impossible in real distributed systems.
XA is a standard API (X/Open XA) for 2PC across heterogeneous systems — for example, a transaction that spans a PostgreSQL database AND a message queue AND an Oracle database. The application server acts as the coordinator.
XA transactions are used in enterprise systems (Java EE / Jakarta EE), but they inherit all of 2PC's problems plus additional overhead from the XA API. Modern systems avoid them in favor of eventual consistency patterns (sagas, outbox pattern).
Since distributed transactions are problematic, most modern microservice architectures use sagas instead. A saga is a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps.
python class OrderSaga: """Choreography-based saga for order processing.""" def start(self, order): # Step 1: Create order (local transaction) self.order_service.create_order(order) # status = PENDING self.event_bus.publish("OrderCreated", order) def on_order_created(self, event): # Step 2: Reserve inventory (local transaction) try: self.inventory_service.reserve(event.product_id, event.qty) self.event_bus.publish("InventoryReserved", event) except InsufficientStock: # Compensate: cancel the order self.order_service.cancel_order(event.order_id) self.event_bus.publish("OrderCancelled", event) def on_inventory_reserved(self, event): # Step 3: Process payment (local transaction) try: self.payment_service.charge(event.customer_id, event.amount) self.event_bus.publish("PaymentProcessed", event) except PaymentFailed: # Compensate: release inventory, then cancel order self.inventory_service.release(event.product_id, event.qty) self.order_service.cancel_order(event.order_id) self.event_bus.publish("OrderCancelled", event) def on_payment_processed(self, event): # Step 4: Confirm order (local transaction) self.order_service.confirm_order(event.order_id) # status = CONFIRMED self.event_bus.publish("OrderConfirmed", event)
Many interview candidates claim their system provides "exactly-once processing." This is almost always wrong. In a distributed system, you can guarantee:
| Guarantee | Meaning | How |
|---|---|---|
| At-most-once | Message processed 0 or 1 times | Don't retry on failure. Risk: lost messages. |
| At-least-once | Message processed 1 or more times | Retry until acknowledged. Risk: duplicate processing. |
| Exactly-once | Message processed exactly 1 time | At-least-once delivery + idempotent processing. The combination achieves the EFFECT of exactly-once. |
True exactly-once delivery is impossible in a distributed system (proven by the Two Generals Problem). But you can achieve the effect of exactly-once by making every operation idempotent: processing the same message twice produces the same result as processing it once. The idempotency key pattern (store a unique request ID with each processed message, skip duplicates) is the standard implementation.
Let's trace the exact states each participant goes through during 2PC. This is a common whiteboard question.
| System | Approach | Tradeoff |
|---|---|---|
| Google Spanner | Paxos-based commit across replicas, TrueTime for global ordering | Requires atomic clocks + GPS receivers in every data center |
| CockroachDB | Parallel commits with Raft consensus per range | Higher latency for cross-range transactions |
| Kafka + Outbox | Write to DB + outbox table atomically, async publish to Kafka | Eventual consistency, not strong consistency |
| Sagas | Sequence of local transactions with compensating actions on failure | No isolation between saga steps, complex failure handling |
Google Spanner is the gold standard for globally distributed, serializable transactions. It avoids 2PC's blocking problem by using Paxos-based state machines for each shard, and uses TrueTime — a global clock with bounded uncertainty — to assign globally ordered timestamps to transactions.
Spanner proves that globally distributed, serializable transactions are possible — but at the cost of specialized hardware (atomic clocks), higher latency (the TrueTime wait), and enormous engineering investment. CockroachDB implements a similar approach using NTP clocks (less precise than TrueTime but no special hardware needed), with slightly weaker guarantees.
When an interviewer asks "How would you handle transactions across services?", your answer should depend on the consistency requirements. Here is the decision framework:
Watch 2PC with 3 participants. Inject a coordinator crash after PREPARE to see participants stuck "in doubt."
| Anomaly | Read Uncommitted | Read Committed | Snapshot / Repeatable Read | Serializable |
|---|---|---|---|---|
| Dirty reads | Allowed | Prevented | Prevented | Prevented |
| Dirty writes | Allowed | Prevented | Prevented | Prevented |
| Non-repeatable reads | Allowed | Allowed | Prevented | Prevented |
| Lost updates | Allowed | Allowed | Depends* | Prevented |
| Write skew | Allowed | Allowed | Allowed | Prevented |
| Phantoms | Allowed | Allowed | Allowed | Prevented |
*PostgreSQL's repeatable read detects lost updates. MySQL's does not.
python class MVCCStore: """Simplified MVCC implementation for interview coding rounds.""" def __init__(self): self.versions = [] # list of (key, value, created_by, deleted_by) self.next_txn_id = 1 self.committed = set() # set of committed txn IDs def begin(self): txn_id = self.next_txn_id self.next_txn_id += 1 return txn_id def write(self, txn_id, key, value): # Mark old versions as deleted for i, (k, v, created, deleted) in enumerate(self.versions): if k == key and deleted is None: self.versions[i] = (k, v, created, txn_id) # Create new version self.versions.append((key, value, txn_id, None)) def read(self, txn_id, key): """Return the value visible to this transaction.""" for k, v, created, deleted in reversed(self.versions): if k != key: continue # Visibility check: # 1. Creator must have committed AND started before us if created > txn_id or (created != txn_id and created not in self.committed): continue # 2. Not deleted, or deleter started after us, or deleter not committed if deleted is not None: if deleted <= txn_id and deleted in self.committed: continue # deleted by a committed txn before us return v return None def commit(self, txn_id): self.committed.add(txn_id) # Example usage: store = MVCCStore() t1 = store.begin() # TXN 1 store.write(t1, 'alice', 500) store.commit(t1) t2 = store.begin() # TXN 2 t3 = store.begin() # TXN 3 store.write(t2, 'alice', 400) print(store.read(t3, 'alice')) # 500 (t2 not committed) store.commit(t2) print(store.read(t3, 'alice')) # 500 (t2 committed AFTER t3 started) t4 = store.begin() # TXN 4 print(store.read(t4, 'alice')) # 400 (t2 committed before t4 started)
In any system using optimistic concurrency (SSI or compare-and-set), transactions may be aborted and need retrying. Here is production-grade retry logic with exponential backoff:
python import time import random import psycopg2 from psycopg2 import errors def run_with_retry(conn_params, txn_func, max_retries=5): """Execute a transaction function with automatic retry on serialization failure. txn_func: callable(cursor) -> result Returns the result of txn_func on success. Raises after max_retries exhausted. """ for attempt in range(max_retries): conn = psycopg2.connect(**conn_params) conn.set_isolation_level( psycopg2.extensions.ISOLATION_LEVEL_SERIALIZABLE ) try: with conn.cursor() as cur: result = txn_func(cur) conn.commit() return result # success! except errors.SerializationFailure: conn.rollback() if attempt == max_retries - 1: raise # exhausted retries # Exponential backoff with jitter delay = (2 ** attempt) * 0.01 # 10ms, 20ms, 40ms, 80ms, 160ms delay *= (0.5 + random.random()) # add jitter to prevent thundering herd time.sleep(delay) except errors.DeadlockDetected: conn.rollback() # Same retry logic — deadlock is a transient failure delay = (2 ** attempt) * 0.01 * (0.5 + random.random()) time.sleep(delay) finally: conn.close() # Usage: def transfer(cur): cur.execute("UPDATE accounts SET balance = balance - 100 WHERE id = 1") cur.execute("UPDATE accounts SET balance = balance + 100 WHERE id = 2") return "transferred" result = run_with_retry({'dbname': 'bank'}, transfer)
python import threading from collections import defaultdict class LockManager: """Two-Phase Locking with deadlock detection via timeout.""" def __init__(self): self.locks = {} # key -> {'mode': 'S'|'X', 'holders': set(txn_ids)} self.mutex = threading.Lock() self.waiters = {} # key -> threading.Event def acquire(self, txn_id, key, mode, timeout=5.0): """Acquire lock. Returns True on success, False on timeout (deadlock).""" while True: with self.mutex: if key not in self.locks: self.locks[key] = {'mode': mode, 'holders': {txn_id}} return True lock = self.locks[key] # Compatible: both shared if lock['mode'] == 'S' and mode == 'S': lock['holders'].add(txn_id) return True # Already hold it if txn_id in lock['holders']: if mode == 'X': lock['mode'] = 'X' # upgrade return True # Must wait if key not in self.waiters: self.waiters[key] = threading.Event() evt = self.waiters[key] # Wait outside mutex if not evt.wait(timeout): return False # timeout = probable deadlock def release_all(self, txn_id): """Release all locks held by txn_id (called on commit/abort).""" with self.mutex: for key in list(self.locks): lock = self.locks[key] lock['holders'].discard(txn_id) if not lock['holders']: del self.locks[key] if key in self.waiters: self.waiters[key].set() del self.waiters[key]
The outbox pattern is the most common way to achieve reliable cross-service communication without distributed transactions. It deserves its own worked example because it comes up in nearly every system design interview involving microservices.
python # The Outbox Pattern — sending reliable events from a database class OrderService: def place_order(self, user_id, product_id, amount): # SINGLE local transaction — no distributed coordination! with self.db.transaction() as txn: # 1. Insert the order order_id = txn.execute( "INSERT INTO orders (user_id, product_id, amount, status) " "VALUES (%s, %s, %s, 'PENDING') RETURNING id", (user_id, product_id, amount) ) # 2. Insert event into outbox table (SAME transaction!) txn.execute( "INSERT INTO outbox (aggregate_id, event_type, payload) " "VALUES (%s, %s, %s)", (order_id, 'OrderPlaced', json.dumps({'order_id': order_id, 'user_id': user_id, 'amount': amount})) ) # Both writes are atomic. If either fails, both roll back. class OutboxPublisher: """Background worker that polls the outbox and publishes to Kafka.""" def run(self): while True: with self.db.transaction() as txn: # Grab unpublished events (with row lock to prevent duplicates) events = txn.execute( "SELECT id, event_type, payload FROM outbox " "WHERE published = FALSE " "ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED" ) for event in events: self.kafka.publish(event.event_type, event.payload) txn.execute( "UPDATE outbox SET published = TRUE WHERE id = %s", (event.id,) ) time.sleep(0.1) # poll every 100ms
Let's walk through a complete payment system design — the kind of answer that gets a "strong hire" at a senior+ level:
python class PaymentService: """Idempotent payment processing with exactly-once semantics.""" def process_payment(self, idempotency_key, from_acct, to_acct, amount): """Process a payment. Safe to retry with the same idempotency_key.""" with self.db.transaction(isolation='serializable') as txn: # Step 1: Check idempotency key existing = txn.execute( "SELECT result FROM payments WHERE idempotency_key = %s", (idempotency_key,) ) if existing: return existing[0] # Already processed — return cached result # Step 2: Check sufficient funds (with lock) balance = txn.execute( "SELECT balance FROM accounts WHERE id = %s FOR UPDATE", (from_acct,) ) if balance < amount: # Record the failure (so retries get the same answer) txn.execute( "INSERT INTO payments (idempotency_key, status, error) " "VALUES (%s, 'FAILED', 'insufficient_funds')", (idempotency_key,) ) return {'status': 'FAILED', 'error': 'insufficient_funds'} # Step 3: Execute the transfer txn.execute( "UPDATE accounts SET balance = balance - %s WHERE id = %s", (amount, from_acct) ) txn.execute( "UPDATE accounts SET balance = balance + %s WHERE id = %s", (amount, to_acct) ) # Step 4: Record the success (for idempotency) txn.execute( "INSERT INTO payments (idempotency_key, status, amount) " "VALUES (%s, 'SUCCESS', %s)", (idempotency_key, amount) ) # Step 5: Write to outbox (for async notification) txn.execute( "INSERT INTO outbox (event_type, payload) VALUES (%s, %s)", ('PaymentCompleted', json.dumps({ 'from': from_acct, 'to': to_acct, 'amount': str(amount), 'idempotency_key': idempotency_key })) ) return {'status': 'SUCCESS', 'amount': str(amount)}
For distributed systems without shared databases, you sometimes need a distributed lock. But distributed locks are tricky — a lock can be held by a crashed process:
A visual summary of what each level prevents. Reference this in interviews.
python from collections import defaultdict class DeadlockDetector: """Detect deadlocks using cycle detection in the waits-for graph.""" def __init__(self): # waits_for[A] = B means "transaction A is waiting for B's lock" self.waits_for = {} def add_wait(self, waiter_txn, holder_txn): """Record that waiter_txn is blocked waiting for holder_txn.""" self.waits_for[waiter_txn] = holder_txn def remove_wait(self, txn): """Remove txn from waits-for graph (it got the lock or aborted).""" self.waits_for.pop(txn, None) def detect_cycle(self): """Return the cycle if one exists, else None. Uses Floyd's tortoise-and-hare for each starting node.""" for start in self.waits_for: visited = set() node = start while node in self.waits_for: if node in visited: # Found a cycle. Return the cycle path. cycle = [node] n = self.waits_for[node] while n != node: cycle.append(n) n = self.waits_for[n] cycle.append(node) return cycle visited.add(node) node = self.waits_for[node] return None # Usage: dd = DeadlockDetector() dd.add_wait('TXN_A', 'TXN_B') # A waits for B dd.add_wait('TXN_B', 'TXN_C') # B waits for C print(dd.detect_cycle()) # None (no cycle) dd.add_wait('TXN_C', 'TXN_A') # C waits for A — cycle! print(dd.detect_cycle()) # ['TXN_A', 'TXN_B', 'TXN_C', 'TXN_A'] # Resolution: abort the youngest transaction in the cycle (usually TXN_C)
| Paper / Resource | Why it matters |
|---|---|
| Hermitage (Martin Kleppmann) | Tests actual isolation level behavior across databases — many don't match their documentation |
| A Critique of ANSI SQL Isolation Levels (Berenson et al., 1995) | Defines snapshot isolation and write skew formally — the canonical paper |
| Serializable Snapshot Isolation (Cahill et al., 2008) | The algorithm behind PostgreSQL's serializable mode |
| Spanner: Google's Globally-Distributed Database (Corbett et al., 2012) | How Google achieves serializable distributed transactions with TrueTime |
| Amazon DynamoDB Transactions (AWS docs) | How DynamoDB implements transactions on a distributed key-value store using timestamp ordering |
| Calvin: Fast Distributed Transactions for Partitioned Database Systems (Thomson et al., 2012) | Deterministic transaction ordering — decide the serial order BEFORE execution, eliminating the need for 2PL or SSI |
Every strong interview answer on transactions touches five areas. Practice framing your answers across all five:
| Dimension | Example question | What a great answer includes |
|---|---|---|
| Concept | "Explain the difference between snapshot isolation and serializable." | Both use MVCC. Snapshot allows write skew (reads don't create conflicts). Serializable tracks read-write dependencies (SSI) or uses predicate locks (2PL). Concrete example of anomaly that one prevents but the other doesn't. |
| Design | "Design a booking system that prevents double-booking." | SELECT FOR UPDATE on the resource row. Atomic decrement with WHERE available > 0. Idempotency key for retries. Discuss the scalability path: partition seats into blocks. |
| Code | "Implement MVCC visibility rules." | The MVCCStore class from this chapter. Created_by and deleted_by transaction IDs. Visibility depends on commit status relative to the reading transaction's snapshot. |
| Debug | "Our payment system occasionally double-charges customers." | Missing idempotency key → retries re-execute. Or: read committed with read-modify-write → lost update. Systematic diagnosis: check isolation level, check for non-atomic operations, check retry logic. |
| Frontier | "What's the state of the art for distributed transactions?" | Spanner (TrueTime + Paxos), CockroachDB (hybrid logical clocks + Raft), Calvin (deterministic ordering), FoundationDB (optimistic concurrency on sorted key-value). Trend: moving from pessimistic (2PL) to optimistic (SSI) approaches. |
Transactions don't exist in isolation (pun intended). They connect to nearly every topic in distributed systems.
| Related Topic | Connection |
|---|---|
| DDIA Ch6: Replication | Replication determines how many copies of data exist. Transactions on a single leader are simpler than on multiple leaders or leaderless replicas. Multi-leader replication with transactions requires cross-leader coordination (which is why most systems don't support it). |
| DDIA Ch9: The Trouble with Distributed Systems | Network partitions, clock skew, and process pauses all complicate distributed transactions. 2PC's coordinator failure is a specific instance of the general problem: you can't distinguish a crashed node from a slow one. |
| DDIA Ch10: Consensus | Consensus protocols (Paxos, Raft) solve the problem that 2PC fails at: agreeing on a decision even when nodes crash. Spanner uses Paxos-based state machines to implement distributed transactions without 2PC's blocking problem. |
| CAP Theorem | Transactions with strong isolation require linearizability, which forces you to sacrifice availability during network partitions (the "CP" choice). BASE systems sacrifice consistency for availability (the "AP" choice). |
| Stream Processing | Event sourcing and the outbox pattern are alternatives to distributed transactions. Instead of atomically updating two systems, you write to one system and asynchronously propagate changes via events. |
| Batch Processing | MapReduce and Spark jobs process immutable input datasets and produce new output datasets. They don't need traditional transactions because they don't do in-place updates — they write fresh output. If a job fails, you rerun it from scratch. |
To ground this lesson in practical reality, here is how some well-known systems handle transactions:
| Company | System | Transaction Strategy |
|---|---|---|
| Stripe | Payment processing | Idempotency keys + single-database ACID transactions. Every API request has an idempotency key. The payment operation and the idempotency record are written atomically. |
| Uber | Trip management | Saga pattern with compensating transactions. Trip creation, driver matching, and payment are separate services with eventual consistency. |
| Ads serving | Spanner for ad budget management (global consistency). Bigtable for click logs (eventual consistency). Different consistency for different data. | |
| Netflix | Content catalog | Eventual consistency everywhere. If a title shows up 5 seconds late in the catalog, nobody notices. No need for transactions. |
| Banks | Core banking | Serializable transactions on mainframes (IBM DB2, Oracle). The one domain where "correctness at all costs" is literal — regulators audit for consistency. |
Having interviewed dozens of candidates on transactions, here are the most common mistakes:
| Mistake | What they say | What they should say |
|---|---|---|
| "ACID means all four properties are equally important" | "A database is ACID or it isn't" | A, I, and D are database guarantees. C is mostly an application property. Different isolation levels trade off I for performance. |
| "Transactions are slow" | "We should avoid transactions for performance" | Read Committed transactions are nearly free. Only serializable adds significant overhead. The question is which isolation level, not whether to use transactions. |
| "SELECT FOR UPDATE solves everything" | "Just add FOR UPDATE to every SELECT" | FOR UPDATE can't lock rows that don't exist (phantoms). It also reduces concurrency by serializing all readers. Use it surgically on specific hot rows, not broadly. |
| "Eventual consistency means inconsistency" | "BASE systems have no guarantees" | BASE systems guarantee eventual convergence. The question is: how long is "eventual" and what happens during the convergence window? For many use cases, sub-second convergence is perfectly acceptable. |
| "2PC is the standard for distributed transactions" | "We'll use 2PC across microservices" | 2PC is a last resort due to its blocking failure mode. Prefer sagas with compensating transactions, or design your data model so transactions are local to one service. |
The beauty of transactions is that they let application developers think about correctness without simultaneously thinking about every possible interleaving of concurrent operations and every possible crash timing. They are not free — they cost performance — but the alternative is ad-hoc, error-prone, application-level concurrency control that almost always has bugs you won't discover until production traffic exposes them.
Here is the mental model that ties everything together. Think of transactions along two axes:
| Single Node | Distributed | |
|---|---|---|
| Weak Isolation | Read Committed (fast, most apps). Snapshot Isolation (backups, analytics). | Eventual consistency. BASE semantics. Good for shopping carts, social feeds. |
| Strong Isolation | Serializable (SSI or 2PL). Required for financial data, inventory, bookings. | Spanner (Paxos + TrueTime). CockroachDB (Raft). Extremely expensive infrastructure. |
Most applications live in the top-left quadrant: single-node, read committed, with selective use of explicit locking (SELECT FOR UPDATE) for critical sections. This is PostgreSQL's sweet spot and covers 90% of real-world needs.
The bottom-right quadrant — distributed serializable transactions — is the hardest problem in databases. Only a handful of systems (Google Spanner, CockroachDB, FoundationDB) provide it, and they come with significant latency and infrastructure costs. If your interviewer asks "How would you build a globally distributed, strongly consistent payment system?", the answer involves one of these systems — not handrolled 2PC.
Master the isolation levels. Know which anomalies each one prevents. Understand the implementation mechanisms (locks, MVCC, SSI). And always ask the interview question that separates senior from staff: "What isolation level does this system actually need, and what happens if we get it wrong?"
Since PostgreSQL is the most common database in interviews and startups, here are the practical commands you should know:
sql -- Check current isolation level SHOW default_transaction_isolation; -- read committed (default) -- Set isolation level for one transaction BEGIN; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; -- ... do work ... COMMIT; -- Check for blocked queries (deadlocks, lock waits) SELECT pid, state, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state = 'active' AND wait_event_type IS NOT NULL; -- Check for long-running transactions (MVCC bloat risk) SELECT pid, now() - xact_start AS duration, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_start; -- Check dead tuples (VACUUM needed?) SELECT relname, n_dead_tup, n_live_tup, round(n_dead_tup::numeric / greatest(n_live_tup, 1) * 100, 1) AS dead_pct FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10; -- Set safety timeouts (do this in production!) SET lock_timeout = '5s'; -- abort if can't get lock in 5s SET statement_timeout = '30s'; -- abort queries running > 30s SET idle_in_transaction_session_timeout = '5min'; -- kill idle txns