Principles of Eventual Consistency — Burckhardt (MSR 2014)

Eventual Consistency

CRDTs, consistency models, CAP theorem, protocol correctness — the formal foundations of distributed data.

Prerequisites: Replication basics + Set theory. That's it.
12
Chapters
10+
Simulations
6
CRDTs Implemented

Chapter 0: The Problem

Two travelers open the same airline app at the same moment. Flight 714 to Lisbon has exactly one seat left — 22A, window, just behind the wing. One traveler is in Chicago, the other in Berlin. Both tap "Book Now" at 14:03:27 UTC. Both get a confirmation screen. Both believe they own seat 22A.

Now what?

The airline's database has replicas in North America and Europe. The Chicago request hit the US replica. The Berlin request hit the EU replica. Between these datacenters, the network round-trip is ~85 milliseconds. That 85ms gap is where distributed systems break. During those milliseconds, neither replica knows about the other's booking. Each one independently decided the seat was free, assigned it, and confirmed to the user.

This is not a hypothetical — it happens every day at scale. This exact class of bug caused United Airlines to oversell 42 flights in a single weekend in 2017. Marriott's reservation system double-booked 10,000 hotel rooms in a merger. Amazon's inventory system sold items that didn't exist during Prime Day surges. Every company building multi-region systems eventually confronts this problem.

The naive fix is obvious: before confirming, check with ALL replicas. Make sure nobody else claimed the seat. But this means the Chicago request must wait for a round-trip to Europe (85ms), and if the European datacenter is down or unreachable, the Chicago user can't book at all. You've traded availability for consistency.

The alternative: let both bookings succeed immediately, then resolve the conflict later. Maybe the system picks the earlier timestamp. Maybe it picks randomly. Maybe it refunds one traveler. This preserves availability but sacrifices consistency — for some window of time, the system believes two people own the same seat.

This is the central tension of distributed systems. When replicas are separated by a network — and networks are always slow, always unreliable — you cannot simultaneously have: (1) every replica agrees on every value at every moment (strong consistency), (2) every replica responds to every request (availability), and (3) the system works during network partitions (partition tolerance). You must choose. This lesson is about understanding exactly what you're choosing, what you're giving up, and the mathematics that prove why.

Sebastian Burckhardt's 2014 monograph, Principles of Eventual Consistency, does something no textbook before it accomplished: it takes the fuzzy, hand-wavy notion of "eventual consistency" and gives it rigorous mathematical foundations. He defines a hierarchy of consistency models as sets of axioms on event graphs. He proves which guarantees are achievable under which conditions. And he provides concrete data structures — CRDTs (Conflict-free Replicated Data Types) — that provably converge without coordination.

By the end of this lesson, you'll understand not just WHAT eventual consistency means, but WHY it's defined the way it is, HOW to implement systems that guarantee convergence, and WHEN you need something stronger.

The Scale of the Problem

Every major distributed system has war stories about consistency bugs. These aren't theoretical — they cost real money and real trust:

IncidentRoot CauseImpact
Amazon DynamoDB 2015Eventual consistency + stale read led to order being placed for out-of-stock itemFulfillment failures, customer refunds
GitHub 2018MySQL replication lag caused users to see stale repository state after pushesUsers thought commits were lost, re-pushed, caused merge conflicts
Facebook 2019Cache invalidation across regions took >30 seconds; users saw stale profile dataPrivacy concerns (old settings visible to others)
Google Cloud Spanner 2020Clock synchronization drift exceeded TrueTime bounds during datacenter maintenanceBrief unavailability in affected zone (CP system behaved correctly)

The lesson from these incidents: consistency bugs are not about correctness in the abstract. They're about real users seeing impossible states — ghost data, lost writes, duplicate purchases, privacy violations. The formal framework in this lesson exists because informal reasoning about these systems is not reliable enough.

The Airline Seat Problem

Two users try to book the same seat across a network partition. Watch the consistency-availability tradeoff in real time.

Click "Both Book Now" to see what happens when two users book the same seat simultaneously.

Why Can't We Just Use Locks?

If you've built concurrent programs, your first instinct might be: "use a distributed lock." Client A acquires a lock on seat 22A, writes, releases. Client B tries to acquire, blocks until A is done. Problem solved?

Distributed locks are harder than they sound. Consider what happens if the lock holder crashes before releasing. The lock is stuck forever unless you add a timeout. But if you add a timeout, what happens if the holder is just slow (not crashed)? It loses the lock while still thinking it holds it, and another client acquires the lock. Now TWO clients think they hold the lock. This is the fencing problem.

// Distributed lock failure scenario:

// t=0: Client A acquires lock with 10s timeout
// t=1: Client A begins writing to seat 22A
// t=8: Client A pauses for GC (garbage collection) for 5 seconds
// t=10: Lock EXPIRES. Client B acquires lock.
// t=11: Client B writes to seat 22A
// t=13: Client A's GC finishes. A still thinks it holds the lock!
// t=14: Client A writes to seat 22A. CONFLICT.

// Solution: fencing tokens. Lock server gives each acquisition a
// monotonically increasing token. The resource rejects writes with
// tokens older than the latest it's seen.
// But now you need a linearizable lock server... which brings us
// back to the consistency problem we're trying to solve.

The deeper issue: distributed locks require a consensus service (like ZooKeeper or etcd) to implement correctly. And consensus services are expensive — they require majority agreement on every lock acquisition. For a global system with thousands of operations per second, the lock service becomes a bottleneck. CRDTs and eventual consistency offer an alternative: don't lock at all, let everyone write, and resolve conflicts mathematically.

Interview warm-up: Two clients write to the same key on different replicas at the same time during a network partition. Under eventual consistency, what is GUARANTEED to happen?

Chapter 1: The Consistency Spectrum

Most developers think of consistency as binary: either your system is consistent or it isn't. This is like saying temperature is binary — either hot or cold. In reality, there's a precise spectrum of consistency guarantees, each one relaxing a specific constraint to gain performance or availability.

Before Burckhardt's formalization, the consistency landscape was confusing. Different papers used the same terms to mean different things. "Eventual consistency" in Amazon's Dynamo paper meant something subtly different from "eventual consistency" in Werner Vogels's blog post, which was different again from the academic definition. Burckhardt's contribution was to pin each model down with mathematical axioms, making the definitions unambiguous.

What emerged is surprisingly elegant: the models form a clean hierarchy, each one obtained from the previous by removing exactly one axiom. Understanding this hierarchy is the single most important thing you can learn about distributed systems — it's the map that tells you exactly what you're getting and what you're giving up for every design choice you make.

Burckhardt's key insight is that each consistency model is a set of axioms about how operations relate to each other. Stronger models have more axioms. Weaker models have fewer. Fewer axioms means more behaviors are "legal" — which means the system has more freedom, which means it can be faster and more available, but also harder for programmers to reason about.

An analogy: traffic laws. A country with many traffic laws (drive on the right, stop at red, yield to pedestrians, speed limits) has predictable traffic — you know what other drivers will do. A country with few laws has more "freedom" — drivers can go faster, take shortcuts — but you can't predict what anyone will do. Consistency models are traffic laws for data. Linearizability is a well-regulated highway. Eventual consistency is a lawless dirt road. Both get you there, but the experience is very different.

The Hierarchy

From strongest to weakest:

Linearizability (the gold standard). Every operation appears to take effect at a single instant between its invocation and completion. All observers see operations in the same order, and that order respects real-time. If operation A finishes before operation B starts, then A appears before B — for everyone. This is what single-threaded programs give you for free. In a distributed system, it requires coordination on every operation.

Sequential consistency. There exists SOME total order of all operations that is consistent with each client's local order. But this order does NOT have to respect real-time. If you write X=1, then I write X=2, a sequentially consistent system could show everyone X=1 (ordering my write before yours), as long as each individual client's operations appear in the order they issued them. The difference from linearizability: real-time ordering is not preserved across clients.

Causal consistency. If operation A could have influenced operation B (because B happened after A on the same client, or B read a value written by A), then everyone sees A before B. But operations that are concurrent — neither could have influenced the other — may appear in different orders on different replicas. This is weaker than sequential because there's no single total order of concurrent operations.

Eventual consistency. If no new updates are made, all replicas will eventually converge to the same value. That's it. No ordering guarantees during updates. No bound on how long "eventually" takes. Replicas can temporarily disagree on everything. The only promise: silence leads to agreement.

Quiescent consistency. Even weaker — convergence is only guaranteed during periods of quiescence (no operations in flight). If operations never stop, replicas may never converge. This is the absolute floor of useful consistency.

Key insight: each level removes exactly one constraint. Linearizability requires real-time order + total order + per-client order. Sequential drops real-time order. Causal drops total order (keeps causal order). Eventual drops all ordering (keeps convergence). Quiescent drops bounded convergence. Understanding what each level removes is the entire game.
ModelGuaranteesGives UpReal System
LinearizableReal-time total orderPerformance, availabilityGoogle Spanner, ZooKeeper
SequentialTotal order (not real-time)Real-time ordering across clientsPOSIX file systems
CausalCausally-related operations orderedTotal order of concurrent opsMongoDB (causal sessions), COPS
EventualReplicas converge when updates stopAll ordering during updatesCassandra, DynamoDB, Riak
QuiescentConverge during quiet periodsConvergence during continuous loadDNS (with TTL-based propagation)

Worked Example: What Each Model Allows

Two clients, two replicas. Client A writes X=1, then Client B writes X=2. Client C reads X from both replicas.

// Timeline:
// t=0: Client A writes X=1 to Replica 1
// t=5: Client B writes X=2 to Replica 2
// t=10: Client C reads X from Replica 1, then Replica 2

// LINEARIZABLE: Client C must read X=2 from both.
// Why? A's write finished (t=0), B's write finished (t=5),
// B is after A in real-time, so X=2 is the "latest."
// C reads at t=10, after both writes. Must see X=2.

// SEQUENTIAL: Client C might read X=1 from both, or X=2 from both.
// Legal orderings: A then B (result=2) or B then A (result=1).
// Both replicas must agree. But real-time is ignored.

// CAUSAL: A and B are on different clients, no causal link.
// Replica 1 could show X=1, Replica 2 could show X=2.
// They're concurrent, so no ordering is required between them.

// EVENTUAL: Anything goes. C might read X=1 from Replica 1
// and X=2 from Replica 2. Or vice versa. Eventually they'll agree.

Linearizability: The Gold Standard (In Detail)

Linearizability deserves special attention because it's what single-threaded programs give you for free — and what distributed systems spend enormous effort to approximate. The formal definition: every operation appears to take effect at a single, atomic point between its invocation and completion, and all operations are consistent with a single global total order that respects real time.

// Linearizability check: is this execution linearizable?

// Client A: write(X, 1) starts at t=0, completes at t=5
// Client B: write(X, 2) starts at t=3, completes at t=8
// Client C: read(X) starts at t=6, completes at t=7

// A's write could linearize at any point in [0, 5].
// B's write could linearize at any point in [3, 8].
// C's read could linearize at any point in [6, 7].

// If A linearizes at t=2, B at t=4, C at t=6.5:
// Order: A(write 1) → B(write 2) → C(read)
// C must return 2 (latest write). ✓ Legal.

// If A linearizes at t=4.5, B at t=3.5, C at t=6.5:
// Order: B(write 2) → A(write 1) → C(read)
// C must return 1 (latest write). ✓ Also legal.

// C CANNOT return 0 (initial value) — both writes
// completed before C started, so C must see at least one.
// Returning 0 would be a linearizability violation.

Checking whether an execution history is linearizable is NP-complete in general (you're searching for a linearization point assignment). In practice, tools like Jepsen use randomized testing to find violations.

Causal Consistency: The Sweet Spot

Many practitioners argue that causal consistency is the "sweet spot" — it prevents the most confusing anomalies (seeing a reply before the original message) while remaining implementable without global coordination. The key: causal consistency requires no total order of concurrent operations, which means replicas don't need to agree on how to order unrelated events.

// Why causal is the "sweet spot":

// Causal prevents:
// - Read-your-writes violations (you see your own updates)
// - Monotonic reads (you never go back in time)
// - Causal anomalies (see effect before cause)

// Causal allows:
// - Concurrent writes ordered differently on different replicas
// (usually fine — who cares if two unrelated likes are in different order?)

// Causal achievable WITHOUT:
// - Global consensus (no Paxos/Raft needed)
// - Total ordering (no single point of failure)
// - Blocking during partitions (remains available!)

// This is why COPS, Eiger, and MongoDB's causal sessions exist.
// Causal consistency + CRDTs = convergent without coordination.
The Consistency Hierarchy

Click each consistency level to see what it guarantees and what it gives up. Real systems are mapped to their actual consistency level.

Interview question: Client A writes X=5 and then reads X from the same replica. Under causal consistency, what MUST this read return?

Chapter 2: Event Graphs & Histories

To reason precisely about consistency, we need precise language. Without formalism, engineers argue in circles: "our system is eventually consistent" — but what does that MEAN? When does it converge? What can a reader see during convergence? What ordering is guaranteed? These questions have precise answers, but only if we define our terms precisely.

Burckhardt's formalism starts with the simplest possible building block: an event. Every time a client reads or writes a value, that's an event. An event has an operation (what it did), a return value (what it saw), and an identity (which client, when, where). Events are the atoms of distributed computation — everything else is built from them.

The genius of this approach is its generality. Whether you're building a key-value store, a message queue, a file system, or a collaborative editor, the formal model is the same: events, visibility, session order, arbitration. The specific operations differ, but the framework for reasoning about consistency is universal.

Now the key question: which events can "see" other events? If event A writes X=5 and event B later reads X=5 on the same replica, then B can see A. We write this as A vis B — "A is visible to B." But if B is on a different replica and the write hasn't propagated yet, A is NOT visible to B. B might read an old value, or nothing.

The visibility relation is the formal way of saying "this operation knew about that operation." It captures the flow of information through the system. When you replicate a write from node A to node B, you're extending the visibility relation: events on A become visible to future events on B. The strength of a consistency model is determined by how much visibility it guarantees.

The Three Relations

Session order (so). Events by the same client are ordered. If client 1 writes X, then reads Y, the write comes before the read in session order. This is the programmer's expectation: my operations happen in the order I issued them.

so ⊆ E × E : (a, b) ∈ so ⇔ same client, a before b

Visibility (vis). Which events can an operation "see"? If event A is visible to event B, then B's return value was computed with knowledge of A. Visibility is NOT the same as session order — an event on replica 2 might see events from replica 1 (after replication), but session order only relates events from the same client.

vis ⊆ E × E : (a, b) ∈ vis ⇔ b was computed knowing about a

Arbitration (ar). A total order on events that resolves conflicts. When two events write to the same key, arbitration decides who "wins." Think of it as the tiebreaker. In a last-writer-wins register, arbitration is determined by timestamps. In a Paxos-based system, arbitration is the consensus order.

ar ⊆ E × E : total order resolving conflicts

A History

These three relations — session order, visibility, and arbitration — are the alphabet of Burckhardt's formalism. Every consistency guarantee, every protocol property, every correctness proof is written in terms of these relations. Master them, and the rest of the framework follows.

Formally, a history is a tuple H = (E, op, rval, so, vis) where:

SymbolMeaningExample
ESet of events{e1, e2, e3, e4, e5}
opOperation for each eventop(e1) = write(X, 5)
rvalReturn value for each eventrval(e3) = 5
soSession order (per-client)e1 → e2 (same client)
visVisibility relatione1 → e3 (e3 saw e1's write)

Happens-Before: Combining Session Order and Visibility

Lamport's famous happens-before relation is the transitive closure of session order and visibility combined. If A is in session order before B, and B is visible to C, then A happens before C — even though A and C might be on different clients and different replicas.

hb = (so ∪ vis)+

Two events are concurrent if neither happens before the other:

a ∥ b ⇔ ¬(a hb b) ∧ ¬(b hb a)

Concurrent events are the root of all distributed systems complexity. If two events are ordered by happens-before, we know which one "came first" and can resolve conflicts. If they're concurrent, there IS no "first" — both happened independently, and the system must have a strategy for resolving the disagreement.

The key insight about happens-before: It's a PARTIAL order, not a total order. In a total order (like real numbers), every pair of elements is comparable — one is always less than or equal to the other. In a partial order, some pairs are incomparable. That's concurrency. Two events are concurrent precisely when they're incomparable in the happens-before partial order. The fraction of events that are concurrent determines how much "freedom" (and how many conflicts) the system has.

Happens-Before is NOT Wall-Clock Time

A critical distinction. Wall-clock time (UTC timestamps) is NOT the same as happens-before ordering. Two events can happen at the same wall-clock time on different replicas and still be ordered by happens-before (if one saw the other via replication). Conversely, two events with different wall-clock times can be concurrent (if neither saw the other).

// Wall-clock ≠ Happens-before

// Case 1: Same wall-clock time, but ORDERED by happens-before
// t=0: A writes X=1 on Replica 1
// t=0: Replica 2 receives A's write (fast replication), reads X=1
// These events happened at the "same time" but write hb→ read.

// Case 2: Different wall-clock times, but CONCURRENT
// t=0: A writes X=1 on Replica 1 (slow clock)
// t=5: B writes X=2 on Replica 2 (fast clock)
// Despite B happening "later," B didn't see A's write.
// No visibility, different clients → CONCURRENT.
// Wall-clock time is IRRELEVANT to the formal model.

This is why Lamport invented logical clocks — to capture happens-before without relying on physical time. Vector clocks extend this to capture exactly which events each replica has seen, enabling precise concurrency detection.

Worked Example: Building an Event Graph

// Three clients, two replicas.
// Client 1 (on Replica A): writes X=1, then X=3
// Client 2 (on Replica B): writes X=2
// Client 3 (on Replica A): reads X after replication from B

Events:
e1: Client 1 writes X=1 on Replica A
e2: Client 1 writes X=3 on Replica A
e3: Client 2 writes X=2 on Replica B
e4: Client 3 reads X on Replica A (after sync from B)

Session order: e1 →so e2 (same client, sequential)

Visibility:
e1 →vis e2 (same replica, e2 sees e1)
e1 →vis e4 (same replica, e4 sees e1)
e2 →vis e4 (same replica, e4 sees e2)
e3 →vis e4 (after replication, e4 sees e3)

Happens-before:
e1 →hb e2 →hb e4
e3 →hb e4

Concurrent:
e1 ∥ e3 (different clients, different replicas, no vis)
e2 ∥ e3 (different clients, different replicas, no vis)

// What does e4 read?
// e4 sees {e1: X=1, e2: X=3, e3: X=2}
// Arbitration decides: if LWW by timestamp, the latest wins.
// If e2 has timestamp 10 and e3 has timestamp 12: e4 reads X=2.
// If e2 has timestamp 12 and e3 has timestamp 10: e4 reads X=3.
Interactive Event Graph Builder

Click on a replica to create write events. Watch visibility and happens-before edges form. Concurrent events are highlighted.

Add events on replicas, then sync to see visibility edges form.
Interview question: Client A writes X=1 on Replica 1. Client B, on Replica 2, writes X=2 without having seen A's write. Are these events concurrent?

Chapter 3: Abstract Executions & Consistency Guarantees

We have events, visibility, session order, and arbitration. Now Burckhardt's central contribution: a consistency guarantee is a SET OF AXIOMS constraining these relations. Each axiom forbids certain behaviors. More axioms = stronger consistency = fewer legal executions = harder to implement efficiently.

An abstract execution is a tuple A = (H, vis, ar) where H is a history, vis is the visibility relation, and ar is the arbitration (total conflict-resolving order). A consistency model is a predicate that says "this abstract execution is legal" or "this one is not."

The Axioms

Burckhardt identifies five core axioms. Every consistency model is a combination of these:

Axiom 1: Eventual Visibility (EVIS). For any two events a and b, eventually a becomes visible to b or b becomes visible to a. No event stays invisible forever.

∀ a, b ∈ E : eventually a vis b ∨ b vis a

Axiom 2: Consistent Visibility (CVIS). Visibility is consistent with session order. If a client does operation A then operation B, anything visible to A is also visible to B.

∀ a, b, c : (a vis b) ∧ (b so c) ⇒ (a vis c)

In other words: a client never "forgets" what it has seen. If you read X=5 and then read X again, you won't see X=3 (an older value). This is also called monotonic reads.

Axiom 3: Consistent Arbitration (CAR). Arbitration is consistent with visibility. If a is visible to b, then a comes before b in the arbitration order.

∀ a, b : (a vis b) ⇒ (a ar b)

Axiom 4: Total Visibility (TVIS). If b sees a, then b sees everything that a saw. Visibility is transitively closed.

∀ a, b, c : (a vis b) ∧ (b vis c) ⇒ (a vis c)

Axiom 5: Real-Time Order (RTO). If operation a completes before operation b starts (in wall-clock time), then a is visible to b.

∀ a, b : (a finishes before b starts) ⇒ (a vis b)

Building Models from Axioms

ModelAxioms Required
QuiescentEVIS (during quiescence only)
EventualEVIS
CausalEVIS + CVIS + TVIS
SequentialEVIS + CVIS + TVIS + CAR + total vis
LinearizableEVIS + CVIS + TVIS + CAR + total vis + RTO
The punchline: Linearizability = Sequential + real-time ordering. Sequential = Causal + total order. Causal = Eventual + transitive visibility + session guarantees. Each step adds exactly one constraint. This is not handwaving — it's a mathematical lattice of axiom combinations.

Why Axioms, Not Algorithms?

You might wonder: why define consistency models as axioms instead of algorithms? Because the same consistency guarantee can be implemented by many different protocols. Linearizability can be achieved by Paxos, Raft, viewstamped replication, or a single-node database. By defining models as axioms on abstract relations, Burckhardt separates the WHAT (the guarantee) from the HOW (the implementation). This lets you prove that a new protocol provides a known guarantee, or that a known guarantee is sufficient for your application — independently.

It's the same principle as interfaces in programming: the axioms are the interface, the protocols are the implementations. You can swap implementations without changing the guarantee, and you can analyze the guarantee without knowing the implementation.

The Axiom Dependencies

The axioms are not independent — some imply others, and some combinations are contradictory or redundant. Understanding the dependency structure is key to navigating the consistency hierarchy:

// Axiom dependency graph:

// TVIS + CVIS together imply: if a vis b and b so c,
// then everything a saw is also seen by c.
// This is EXACTLY "session guarantees" — monotonic reads,
// read-your-writes, writes-follow-reads.

// CAR alone is weak — it just says arbitration respects visibility.
// CAR + TVIS + CVIS gives you sequential consistency's key property:
// there's a single total order (the arbitration) that everyone agrees on,
// and it's consistent with what each replica has seen.

// RTO is the strongest single axiom. It forces:
// "if you finished before I started, I see you."
// This pins the ordering to physical time.
// Combined with all other axioms → linearizability.

Worked Example: Which Executions Are Legal?

// Setup: Two clients, one register X (initially 0).
// Client A: write(X, 1)
// Client B: write(X, 2), then read(X)

// Question: What can Client B's read return?

// Under LINEARIZABLE:
// If A's write finished before B's write: order is A(w1), B(w2), B(r?)
// B's read must return 2 (B's own write is latest).
// If B's write finished before A's write: order is B(w2), A(w1), B(r?)
// Wait — B's read comes after B's write in session order.
// If A's write is concurrent with B's (w2), and real-time says A(w1)
// finishes before B(r), then B(r) must see A(w1). Returns 1.
// Legal: {1, 2} depending on real-time ordering.

// Under SEQUENTIAL:
// Any total order consistent with per-client order.
// Options: A(w1) B(w2) B(r) → returns 2
// B(w2) A(w1) B(r) → returns 1
// B(w2) B(r) A(w1) → returns 2 (B reads own write)
// Legal: {1, 2}

// Under CAUSAL:
// B(w2) so→ B(r): B's read must see B's write.
// A(w1) is concurrent with B's operations.
// B's read could see {A(w1), B(w2)} or just {B(w2)}.
// If it sees both, conflict resolution decides (LWW, etc).
// Legal: {1, 2} (could return 1 if A wins conflict resolution)

// Under EVENTUAL:
// B(r) might not even see B(w2) yet!
// Wait — actually, session order within a single replica is preserved.
// But pure eventual makes no session guarantees.
// Legal: {0, 1, 2} — B could read the initial value!

Counting Legal Executions: A Complexity Perspective

Here's an eye-opening way to understand the consistency spectrum: count how many distinct executions are legal under each model for the same set of operations.

// 3 clients, each performing 1 write: write(X,1), write(X,2), write(X,3)
// Then 1 client reads X. How many possible read results?

// LINEARIZABLE:
// All 3 writes have a fixed real-time order.
// If the order is w1→w2→w3, read returns 3.
// If w3→w2→w1, read returns 1.
// But the real-time order is FIXED — exactly 1 legal result.
// Legal results: 1 value.

// SEQUENTIAL:
// Any permutation of the 3 writes is legal.
// 3! = 6 possible orderings. Read sees the last in that order.
// Legal results: up to 3 values (whichever write is "last").

// CAUSAL:
// If no causal dependencies between the writes (all concurrent),
// each replica can order them differently.
// Legal results: up to 3 values.

// EVENTUAL:
// The read might not see ANY of the writes yet.
// Legal results: up to 4 values (initial + 3 writes).

// Pattern: weaker consistency → more legal results → harder to reason about.
// For N concurrent writes:
// Linearizable: 1 legal result
// Sequential: up to N legal results
// Eventual: up to N+1 legal results (including initial value)

This is why eventual consistency is hard for programmers. If you write code that reads a value and branches on it, under linearizability there's one possible execution path. Under eventual consistency, there could be many. Your code must handle ALL of them correctly.

The testing problem. Under linearizability, testing is straightforward: run the operation, check the result. Under eventual consistency, you need to test EVERY possible interleaving of operations across replicas. For N operations on R replicas, the number of interleavings grows combinatorially. This is why tools like Jepsen (by Kyle Kingsbury) are so valuable — they systematically test distributed systems under various failure scenarios and check whether the observed behaviors are consistent with the claimed consistency model.
Axiom Toggle: What's Legal?

Toggle axioms on and off. See which executions become legal or illegal under different combinations.

EVIS alone = eventual consistency. Toggle more axioms to strengthen the model.
Interview question: A system provides causal consistency. Client A writes X=1. Client B reads X=1 (sees A's write), then writes Y=2. Client C reads Y=2 (sees B's write). What must Client C see when it reads X?

Chapter 4: Replicated Data Types (CRDTs)

Abstract executions and axioms tell us what guarantees a system provides. But they don't tell us HOW to build it. We need data structures that can live on multiple replicas, accept updates independently, and converge to the same state without coordination. These are Conflict-free Replicated Data Types — CRDTs.

The idea is deceptively simple, and the name is slightly misleading. "Conflict-free" doesn't mean conflicts don't happen — it means conflicts are resolved automatically, without human intervention, using mathematical properties of the data structure. A better name might be "automatically-convergent replicated data types," but CRDT is catchier.

A sequential data type is what you already know: a set of states, a set of operations, and a specification mapping (state, operation) to (new state, return value). A stack, a counter, a register, a set — all sequential data types. You interact with them one operation at a time, and each operation sees the result of all previous operations.

A replicated data type extends this to handle concurrent operations on multiple replicas. The question is: when two replicas independently apply different operations and then sync, do they end up in the same state? If YES for all possible orderings of all possible operations, the data type is a CRDT.

This is a remarkably strong guarantee. It says that NO matter what operations happen, in what order, on which replicas, with what delays or duplications in message delivery — the final state is ALWAYS the same. This is what "conflict-free" means: not that conflicts don't occur, but that they resolve themselves deterministically.

Three CRDT Families

State-based CRDTs (CvRDTs). Each replica maintains a complete copy of the state. Replicas periodically send their entire state to other replicas, which merge the received state with their own. The merge function must be commutative, associative, and idempotent — in other words, it must form a join-semilattice.

merge(a, b) = merge(b, a)   (commutative)
merge(a, merge(b, c)) = merge(merge(a, b), c)   (associative)
merge(a, a) = a   (idempotent)
Why a join-semilattice? These three properties guarantee convergence regardless of network topology. Messages can arrive out of order, be duplicated, or take different paths — as long as every update eventually reaches every replica, the merge function ensures they all end up in the same state. Idempotency handles duplicates. Commutativity handles reordering. Associativity handles different groupings.

Operation-based CRDTs (CmRDTs). Instead of sending full states, replicas broadcast operations. The key requirement: operations must commute. If replica A applies op1 then op2, and replica B applies op2 then op1, both must end up in the same state. This needs a reliable broadcast layer (every operation delivered at least once to every replica).

∀ op1, op2 concurrent: apply(apply(s, op1), op2) = apply(apply(s, op2), op1)

Delta-based CRDTs. A hybrid. Instead of full states (expensive for large data) or operations (need reliable broadcast), send only the DELTA — the part of the state that changed. Deltas must also form a join-semilattice. This is what modern implementations like Riak and Akka Distributed Data use.

The Convergence Theorem

Burckhardt's formal statement: a state-based CRDT with a join-semilattice merge function, where update operations are monotonically increasing (the state only moves "up" in the lattice), converges to a least upper bound under eventual delivery.

// Theorem: If (S, ⊑, ⊔) is a join-semilattice and
// every update u satisfies s ⊑ u(s) (monotonic increase),
// then for any set of updates U applied in any order on any replica,
// all replicas converge to ⊔{u(s₀) | u ∈ U} — the join of all updates.

// Proof sketch:
// 1. Each update moves the state up in the lattice.
// 2. Merge = join = least upper bound.
// 3. Join is commutative, associative, idempotent.
// 4. Therefore the result is independent of application order.
// 5. After all messages delivered, every replica has merged all states.
// 6. All replicas compute the same join = convergence. ∎

What IS a Join-Semilattice? (Concretely)

This sounds abstract, but it's just a fancy name for a set with a "combine" operation that always produces the same result regardless of order. Think of it like a bag of marbles: if you pour bag A into bag B, or bag B into bag A, you get the same combined bag. And pouring a bag into itself changes nothing.

Formally, a join-semilattice is a set S with a partial order ≤ and a binary operation ⊔ (join) such that for any a, b in S, a ⊔ b is the least upper bound — the smallest element that's greater than or equal to both a and b.

// Examples of join-semilattices:

// 1. Natural numbers with max:
max(3, 5) = max(5, 3) = 5 // commutative
max(3, max(5, 7)) = max(max(3, 5), 7) = 7 // associative
max(5, 5) = 5 // idempotent

// 2. Sets with union:
{a,b} ∪ {b,c} = {b,c} ∪ {a,b} = {a,b,c} // commutative
{a} ∪ {a} = {a} // idempotent

// 3. Vectors with component-wise max:
max([2,1,0], [0,3,1]) = [2,3,1] // used by G-Counter!

// NON-examples (NOT semilattices):
// + (addition): not idempotent (3+3=6 ≠ 3)
// last-write: not commutative (LWW uses timestamps to break symmetry)
Why does the semilattice matter for convergence? Because the three properties guarantee that no matter what order messages arrive, no matter how many duplicates, no matter which pairs of replicas talk to each other, the final merged state is always the same: the least upper bound of all updates. The path doesn't matter — only the destination.

Formal Proof: G-Counter Forms a Join-Semilattice

Let's prove rigorously that the G-Counter merge function satisfies all three semilattice properties. This is the kind of proof Burckhardt demands — not handwaving, but step-by-step verification.

// Theorem: (V^N, merge) forms a join-semilattice,
// where V^N is the set of all N-dimensional non-negative integer vectors
// and merge(A, B)[i] = max(A[i], B[i]).

// PROOF OF COMMUTATIVITY: merge(A, B) = merge(B, A)
// merge(A, B)[i] = max(A[i], B[i])
// = max(B[i], A[i]) (max is commutative)
// = merge(B, A)[i]
// Equal for all i. ✓

// PROOF OF ASSOCIATIVITY: merge(A, merge(B, C)) = merge(merge(A, B), C)
// merge(A, merge(B, C))[i] = max(A[i], max(B[i], C[i]))
// = max(max(A[i], B[i]), C[i]) (max assoc.)
// = merge(merge(A, B), C)[i]
// Equal for all i. ✓

// PROOF OF IDEMPOTENCY: merge(A, A) = A
// merge(A, A)[i] = max(A[i], A[i]) = A[i]
// Equal for all i. ✓

// PROOF OF MONOTONICITY: increment moves state "up" in lattice
// Define A ⊑ B iff A[i] ≤ B[i] for all i.
// After increment on replica j: A'[j] = A[j] + 1, A'[k] = A[k] for k≠j.
// A[i] ≤ A'[i] for all i (only increased one component).
// So A ⊑ A'. Increment is monotonically increasing. ✓

// COROLLARY: G-Counter is a valid state-based CRDT.
// merge is a semilattice join, increment is monotonic.
// By the convergence theorem, all replicas converge. ∎

This proof may seem trivially simple — and it is, for the G-Counter. But when you get to more complex CRDTs like OR-Sets or RGA (Replicated Growable Arrays for text editing), the proofs become intricate. The G-Counter is the "hello world" of CRDT proofs: simple enough to understand, complex enough to illustrate the technique.

State Size vs. Message Size: The Delta Trade-off

State-based CRDTs send the entire state on every sync. For a G-Counter with 100 replicas, that's 100 integers — fine. For an OR-Set with 10 million entries, sending the full state is catastrophic. This is where delta-based CRDTs shine.

A delta-CRDT only sends the change since the last sync. The key requirement: deltas must also form a semilattice, so they can be merged independently. The receiver accumulates deltas and periodically merges them into its full state. If deltas are lost, the sender falls back to sending the full state — but this is rare.

// G-Counter delta example:
// Replica 0 has vector [5, 3, 2]. It increments to [6, 3, 2].
// Full state to send: [6, 3, 2] — 3 integers
// Delta to send: [6, 0, 0] — only the changed component
// Receiver: merge([4, 3, 1], [6, 0, 0]) = [6, 3, 1] — delta is enough!

// OR-Set delta: only the new (element, tag) pair added/removed.
// Instead of sending 10M entries, send 1 entry. Massive savings.
G-Counter: Your First CRDT

Three replicas each maintain a counter vector. Increment on any replica, then merge to see convergence.

Each replica has a vector [A, B, C]. Increment locally, then merge to converge.
Interview question: Why can't you implement a distributed counter as a single integer that each replica increments? Why do you need a vector?

Chapter 5: Counters & Registers

Let's implement the fundamental CRDTs from scratch. Each one follows the same pattern: define the state, define the update operations, define the merge function, prove convergence. This is the "concept to realization" bridge — by the end of this chapter, you'll be able to implement any of these from memory.

We start with the simplest (G-Counter) and build toward the most complex (MV-Register). Each builds on the previous: the PN-Counter uses two G-Counters, the LWW-Register uses timestamps like the G-Counter uses max, and the MV-Register uses version vectors — which are themselves G-Counters used for tracking causality rather than counting.

G-Counter (Grow-only Counter)

The simplest useful CRDT, and arguably the most important — because it demonstrates the core principle that all CRDTs share. Each of N replicas maintains a vector of N integers. Replica i only increments position i. The counter value is the sum of all positions. Merge is component-wise maximum.

Why a vector instead of a single integer? Because we need the merge function to be idempotent. If we used addition, merging the same state twice would double the count. With max, merging the same state twice changes nothing — which is exactly what idempotency requires.

// State: vector V of N integers, one per replica
// Initial: V = [0, 0, ..., 0]

// Increment on replica i:
V[i] = V[i] + 1

// Value (query):
value() = ∑ V[j] for j = 0..N-1

// Merge two replicas:
merge(Va, Vb)[j] = max(Va[j], Vb[j]) for each j

Worked Trace: G-Counter with 3 Replicas

// Initial state: all replicas have V = [0, 0, 0]
A: [0,0,0] val=0 B: [0,0,0] val=0 C: [0,0,0] val=0

// Step 1: A increments twice
A: [2,0,0] val=2 B: [0,0,0] val=0 C: [0,0,0] val=0

// Step 2: B increments once
A: [2,0,0] val=2 B: [0,1,0] val=1 C: [0,0,0] val=0

// Step 3: C increments three times
A: [2,0,0] val=2 B: [0,1,0] val=1 C: [0,0,3] val=3

// Step 4: A merges with B
// merge([2,0,0], [0,1,0]) = [max(2,0), max(0,1), max(0,0)] = [2,1,0]
A: [2,1,0] val=3 B: [0,1,0] val=1 C: [0,0,3] val=3

// Step 5: B merges with C
// merge([0,1,0], [0,0,3]) = [0,1,3]
A: [2,1,0] val=3 B: [0,1,3] val=4 C: [0,0,3] val=3

// Step 6: All merge (A↔B, B↔C, A↔C)
// merge([2,1,0], [0,1,3]) = [2,1,3]
// merge([2,1,3], [0,0,3]) = [2,1,3]
A: [2,1,3] val=6 B: [2,1,3] val=6 C: [2,1,3] val=6
// ✓ CONVERGED. True count = 2+1+3 = 6. All agree.

PN-Counter (Positive-Negative Counter)

G-Counters can only go up. What if you need decrements? The naive approach — allowing negative values in the vector — breaks the semilattice property (max of -3 and 5 is 5, but that loses the -3 decrement). The elegant solution: use TWO G-Counters. One for increments (P), one for decrements (N). The value is P.value() - N.value(). Each G-Counter independently satisfies the semilattice properties, and subtraction preserves the correct total.

This "decompose into monotonic components" pattern appears throughout CRDT design. When you need a non-monotonic operation (decrement, remove), split the state into monotonically increasing parts and derive the non-monotonic result from them.

value() = P.value() - N.value() = ∑P[j] - ∑N[j]
// PN-Counter trace: A increments, B decrements, C increments
// P = positive G-Counter, N = negative G-Counter

Initial: P=[0,0,0] N=[0,0,0] val=0-0=0

A increments: P=[1,0,0] N=[0,0,0] val=1-0=1
B decrements: P=[1,0,0] N=[0,1,0] val=1-1=0
C increments: P=[1,0,1] N=[0,1,0] val=2-1=1

// After full merge:
All: P=[1,0,1] N=[0,1,0] val=2-1=1 ✓

PN-Counter: Full Worked Trace with Concurrent Decrements

// 3 replicas: A, B, C. Tracking a stock count.
// P = positive vector, N = negative vector

// Initial: P=[0,0,0] N=[0,0,0] val=0

// Step 1: A adds 3 items to stock (3 increments)
A: P=[3,0,0] N=[0,0,0] val=3
B: P=[0,0,0] N=[0,0,0] val=0
C: P=[0,0,0] N=[0,0,0] val=0

// Step 2: B adds 2 items
A: P=[3,0,0] N=[0,0,0] val=3
B: P=[0,2,0] N=[0,0,0] val=2
C: P=[0,0,0] N=[0,0,0] val=0

// Step 3: C sells 1 item (decrement on C)
A: P=[3,0,0] N=[0,0,0] val=3
B: P=[0,2,0] N=[0,0,0] val=2
C: P=[0,0,0] N=[0,0,1] val=-1 // Temporarily negative! Will fix after merge.

// Step 4: A sells 2 items concurrently
A: P=[3,0,0] N=[2,0,0] val=1
B: P=[0,2,0] N=[0,0,0] val=2
C: P=[0,0,0] N=[0,0,1] val=-1

// Step 5: Full merge
// P_merged = [max(3,0,0), max(0,2,0), max(0,0,0)] = [3,2,0]
// N_merged = [max(2,0,0), max(0,0,0), max(0,0,1)] = [2,0,1]
All: P=[3,2,0] N=[2,0,1] val=(3+2+0)-(2+0+1)=5-3=2

// Verify: started with 0, added 3+2=5, sold 2+1=3. Net = 2. ✓
PN-Counter limitation: The counter can go negative (if decrements exceed increments on a single replica before merge). Your application must decide if negative values are meaningful. For stock counts, a negative count might mean "oversold" — which you can detect and correct after merge. For like counts, you'd typically clamp to zero in the UI.

LWW-Register (Last-Writer-Wins Register)

A register holds a single value. When two replicas write concurrently, who wins? The LWW-Register uses timestamps: the write with the highest timestamp wins. Simple, but lossy — concurrent writes are silently discarded. Despite its simplicity, LWW is one of the most widely deployed CRDTs — Cassandra uses it for every cell in every table by default.

The catch: LWW requires globally unique, monotonically increasing timestamps. In a distributed system, clock skew means two replicas might assign the same timestamp to different writes. Worse, a replica with a fast clock will always "win" over a replica with a slow clock, regardless of actual event ordering. This is why systems like Cassandra use hybrid logical clocks — combining physical time with a logical counter to guarantee uniqueness.

// State: (value, timestamp) pair
// Update: write(v) → state = (v, now())
// Merge: keep the pair with the higher timestamp

merge((v1, t1), (v2, t2)) = (v1, t1) if t1 > t2
                          = (v2, t2) if t2 > t1
                          = (max(v1,v2), t1) if t1 = t2 // tiebreaker
The LWW trap. LWW is dangerous because it silently drops data. If user A writes "alice@email.com" and user B writes "bob@email.com" to the same register at nearly the same time, one write vanishes. The loser gets no error. For many applications (shopping carts, collaborative editors), this data loss is unacceptable. LWW is appropriate ONLY when the last write truly should win (e.g., sensor readings where only the latest matters).

MV-Register (Multi-Value Register)

The alternative to LWW: keep ALL concurrent values and let the application resolve the conflict. This is the MV-Register (Multi-Value Register). Amazon's original Dynamo shopping cart used this approach — concurrent adds to a shopping cart produce a set of carts, which the application merges (usually: union of all items).

The MV-Register uses version vectors (essentially vector clocks for data versions) to detect concurrency. When two writes have incomparable version vectors (neither dominates the other), they're concurrent, and both values are kept. When a later write has a version vector that dominates an earlier one, the earlier value is replaced.

// State: set of (value, version-vector) pairs
// Write: replace all entries dominated by writer's version vector
// Merge: keep entries not dominated by any other entry

// Example: A writes "X" with vector [1,0], B writes "Y" with vector [0,1]
// Neither dominates → both kept: {("X",[1,0]), ("Y",[0,1])}
// Application sees: "conflict: X or Y?" and decides.

// If C then writes "Z" with vector [1,1] (after seeing both):
// [1,1] dominates both [1,0] and [0,1]
// Result: {("Z",[1,1])} — conflict resolved.

MV-Register: Detailed Worked Trace

// MV-Register with version vectors (3 replicas: R0, R1, R2)
// Initial: all replicas have value=null, vector=[0,0,0]

// Step 1: R0 writes "Alice", ticks its clock
R0: {("Alice", [1,0,0])}
R1: {}                    
R2: {}                    

// Step 2: R0 syncs to R1. R1 now has R0's entry.
R0: {("Alice", [1,0,0])}
R1: {("Alice", [1,0,0])}
R2: {}                    

// Step 3: CONCURRENTLY:
// R1 writes "Bob" (clock [1,1,0] — saw R0's write)
// R2 writes "Carol" (clock [0,0,1] — never saw anyone)
R0: {("Alice", [1,0,0])}
R1: {("Bob", [1,1,0])}     // dominates "Alice"[1,0,0], replaced
R2: {("Carol", [0,0,1])}

// Step 4: R1 and R2 merge.
// "Bob"[1,1,0] vs "Carol"[0,0,1]:
// [1,1,0] does NOT dominate [0,0,1] (component 2: 0 < 1)
// [0,0,1] does NOT dominate [1,1,0] (components 0,1: 0 < 1)
// CONCURRENT! Keep both.
R1: {("Bob", [1,1,0]), ("Carol", [0,0,1])}  // CONFLICT
R2: {("Bob", [1,1,0]), ("Carol", [0,0,1])}

// Application sees: "conflict between Bob and Carol"
// Application resolves (e.g., picks Bob, or shows both to user)

// Step 5: R2 writes "Dave" (clock [1,1,2] — after seeing Bob+Carol)
// [1,1,2] dominates [1,1,0] AND [0,0,1]. Both replaced.
R2: {("Dave", [1,1,2])}   // Conflict resolved by new write

Python Implementations

python
class GCounter:
    """Grow-only counter CRDT."""
    def __init__(self, node_id, n_nodes):
        self.id = node_id
        self.vector = [0] * n_nodes

    def increment(self):
        self.vector[self.id] += 1

    def value(self):
        return sum(self.vector)

    def merge(self, other):
        for i in range(len(self.vector)):
            self.vector[i] = max(self.vector[i], other.vector[i])


class PNCounter:
    """Positive-Negative counter = two G-Counters."""
    def __init__(self, node_id, n_nodes):
        self.P = GCounter(node_id, n_nodes)
        self.N = GCounter(node_id, n_nodes)

    def increment(self):
        self.P.increment()

    def decrement(self):
        self.N.increment()

    def value(self):
        return self.P.value() - self.N.value()

    def merge(self, other):
        self.P.merge(other.P)
        self.N.merge(other.N)


class LWWRegister:
    """Last-Writer-Wins Register."""
    def __init__(self):
        self.value = None
        self.timestamp = 0

    def write(self, val, ts):
        if ts > self.timestamp:
            self.value = val
            self.timestamp = ts

    def read(self):
        return self.value

    def merge(self, other):
        if other.timestamp > self.timestamp:
            self.value = other.value
            self.timestamp = other.timestamp


class MVRegister:
    """Multi-Value Register — keeps all concurrent values."""
    def __init__(self, node_id, n_nodes):
        self.id = node_id
        self.n = n_nodes
        self.entries = []  # list of (value, vector_clock)
        self.clock = [0] * n_nodes

    def write(self, val):
        self.clock[self.id] += 1
        # New write dominates all current entries
        self.entries = [(val, self.clock[:])]

    def read(self):
        return [v for v, _ in self.entries]

    def _dominates(self, vc1, vc2):
        """Does vc1 causally dominate vc2?"""
        return (all(vc1[i] >= vc2[i] for i in range(self.n))
                and vc1 != vc2)

    def merge(self, other):
        # Combine entries, keep only non-dominated ones
        all_entries = self.entries + other.entries
        kept = []
        for v, vc in all_entries:
            dominated = any(
                self._dominates(vc2, vc)
                for _, vc2 in all_entries
            )
            if not dominated:
                kept.append((v, vc))
        # Deduplicate by vector clock
        seen = set()
        self.entries = []
        for v, vc in kept:
            key = tuple(vc)
            if key not in seen:
                seen.add(key)
                self.entries.append((v, vc))
        # Merge clocks
        for i in range(self.n):
            self.clock[i] = max(self.clock[i], other.clock[i])


# Demo: MV-Register concurrent writes
r0 = MVRegister(0, 3)
r1 = MVRegister(1, 3)
r0.write("Alice")          # r0: [("Alice", [1,0,0])]
r1.write("Bob")            # r1: [("Bob", [0,1,0])]
r0.merge(r1)
print(r0.read())           # ["Alice", "Bob"] — concurrent!
r0.write("Carol")          # resolves conflict
print(r0.read())           # ["Carol"] — single value
LWW-Register vs MV-Register

Perform concurrent writes on two replicas. Compare LWW (one value survives) vs MV (all concurrent values preserved).

Write to both replicas before merging to see the difference between LWW and MV.
Interview question: You're designing a collaborative document editor. Users can concurrently edit the document title. Should you use an LWW-Register or an MV-Register for the title field?

Chapter 6: Sets & Maps

Counters and registers handle single values. Most real applications need collections: shopping carts (sets of items), user permissions (sets of roles), configuration maps (key-value pairs). Making these conflict-free is harder than you'd think, because collections have an operation that counters and registers don't: remove. And remove in a distributed system is surprisingly tricky — when you remove an element that was concurrently added on another replica, which one wins?

G-Set (Grow-only Set)

The simplest collection CRDT: you can add elements but never remove them. Merge is union.

merge(A, B) = A ∪ B

This is trivially a CRDT — union is commutative, associative, and idempotent. But it's rarely useful in practice. You almost always need to remove elements.

G-Set: Worked Trace

// G-Set trace: 3 replicas tracking seen message IDs

// Step 1: Replica A sees messages {m1, m2}
A: {m1, m2} B: {} C: {}

// Step 2: Replica B sees message {m2, m3}
A: {m1, m2} B: {m2, m3} C: {}

// Step 3: Replica C sees message {m1, m4}
A: {m1, m2} B: {m2, m3} C: {m1, m4}

// Step 4: A merges with B
// merge = union: {m1, m2} ∪ {m2, m3} = {m1, m2, m3}
A: {m1, m2, m3} B: {m2, m3} C: {m1, m4}

// Step 5: Full merge (all pairs)
All: {m1, m2, m3, m4} // Union of all seen messages. ✓

G-Sets are perfect for tracking "have I seen this before?" — duplicate detection, read receipts, visited URLs. Anything where you only need to add elements and never remove them.

2P-Set (Two-Phase Set)

Like the PN-Counter pattern: two G-Sets, one for adds (A) and one for removes (R). An element is in the set if it's in A but NOT in R. The catch: once an element is removed, it can NEVER be re-added (it's permanently in R).

lookup(e) = (e ∈ A) ∧ (e ∉ R)
// 2P-Set trace: friends list

// Initial: A={}, R={}

// Add "Alice": A={Alice}, R={}. lookup(Alice)=true ✓
// Add "Bob": A={Alice, Bob}, R={}. lookup(Bob)=true ✓
// Remove "Alice": A={Alice, Bob}, R={Alice}. lookup(Alice)=false ✓

// Try to re-add "Alice": A={Alice, Bob}, R={Alice}.
// lookup(Alice) = (Alice ∈ A) ∧ (Alice ∉ R) = true ∧ false = false
// Alice is PERMANENTLY removed. Can't re-add. ✗
The 2P-Set limitation is brutal. Imagine a shopping cart. User adds item X, removes it, then changes their mind and re-adds it. With a 2P-Set, the re-add fails — X is permanently in the remove set. This is why 2P-Sets are rarely used in practice. The OR-Set fixes this.

OR-Set (Observed-Remove Set) — The Practical CRDT

The OR-Set is the most important set CRDT. The key idea: each ADD generates a unique tag. When you REMOVE an element, you only remove the tags you've OBSERVED — tags you haven't seen (from concurrent adds on other replicas) survive.

Formally: the state is a set of (element, unique-tag) pairs. Add(e) inserts (e, fresh_tag). Remove(e) removes all pairs (e, *) currently visible on this replica. Merge is union of all pairs.

// OR-Set trace: 3 replicas, element "X"

// Step 1: Replica A adds "X" (generates tag α)
A: {(X, α)} B: {} C: {}

// Step 2: A syncs to B
A: {(X, α)} B: {(X, α)} C: {}

// Step 3: B removes "X" — removes all observed tags for X
// B observes (X, α), so it removes (X, α)
A: {(X, α)} B: {} C: {}

// Step 4: CONCURRENTLY, C adds "X" (generates tag β)
A: {(X, α)} B: {} C: {(X, β)}

// Step 5: Full merge (union of all pairs)
// A has (X, α), B has {}, C has (X, β)
// merge = {(X, α), (X, β)} ... wait, B removed (X, α)!
// Correct merge: B's remove eliminates (X, α) from B's knowledge.
// After all syncs: {(X, β)} — C's concurrent add SURVIVED B's remove.

// Result: "X" is in the set. Add wins over concurrent remove.
// This is the "add-wins" semantics of OR-Set.

Why "Observed-Remove" Matters: A Shopping Cart Example

Consider Amazon's shopping cart, one of the most famous eventual-consistency use cases. You add milk to your cart on your phone. Your partner removes milk from the cart on their laptop. Concurrently, you add milk again (you realized you need two cartons). With a 2P-Set, the re-add fails — milk is in the tombstone set forever. With an OR-Set, the re-add generates a fresh tag, B's remove only killed the tags it observed, and the fresh tag survives. Milk is in the cart.

Amazon's original Dynamo paper (2007) used a similar approach: the shopping cart was an MV-Register that kept concurrent versions. The application merged them by taking the union of all items. This occasionally produced "resurrected" items (items that were removed would reappear), which Amazon preferred over lost items. The OR-Set formalizes this: add always wins over concurrent remove.

The Full OR-Set Merge (with tombstone optimization)

The naive OR-Set keeps growing forever — removed tags are gone from the local set but might arrive again via a stale replica. The practical fix: maintain a causal context (a vector clock summarizing all tags seen) alongside the entries. During merge, any tag that's in the causal context but NOT in the entries was explicitly removed.

// Optimized OR-Set state:
// entries: set of (element, tag) pairs
// context: vector clock tracking all tags seen

// Merge(S1, S2):
// Keep entry (e, t) from S1 if:
// t is NOT in S2.context (S2 never saw this tag), OR
// t IS in S2.entries (S2 saw it and didn't remove it)
// Symmetrically for entries from S2.
// New context = merge(S1.context, S2.context)

// This lets replicas distinguish "tag was removed" from "tag not yet seen."
// Without this optimization, the OR-Set has unbounded metadata growth.

Python OR-Set Implementation

python
import uuid

class ORSet:
    """Observed-Remove Set CRDT. Add-wins semantics."""
    def __init__(self):
        self.entries = set()  # set of (element, unique_tag) tuples

    def add(self, element):
        tag = str(uuid.uuid4())[:8]
        self.entries.add((element, tag))

    def remove(self, element):
        # Remove only the tags we can SEE right now
        observed = {(e, t) for e, t in self.entries if e == element}
        self.entries -= observed

    def lookup(self, element):
        return any(e == element for e, _ in self.entries)

    def elements(self):
        return {e for e, _ in self.entries}

    def merge(self, other):
        # Union of all (element, tag) pairs.
        # Tags removed on one side but present on the other
        # survive — that's the "add-wins" behavior.
        self.entries = self.entries | other.entries


# Demo: concurrent add and remove
a = ORSet()
b = ORSet()
a.add("milk")
b.merge(a)       # b sees "milk"
b.remove("milk")  # b removes observed tags
a.add("milk")     # a concurrently re-adds with NEW tag
a.merge(b)
b.merge(a)
print(a.lookup("milk"))  # True — concurrent add survived
print(b.lookup("milk"))  # True — converged

OR-Set: Why "Add-Wins" is the Right Default

The OR-Set's "add-wins" semantics might seem arbitrary. Why should add beat remove? Consider the alternative: "remove-wins" (if anyone removes X concurrently with someone adding X, X is removed). This is actually worse for most applications:

// Shopping cart scenario: remove-wins vs add-wins

// Setup: User A and User B share a shopping cart.
// Cart contains: {milk, eggs, bread}

// Concurrent operations:
// User A removes "milk" (they decided they don't need it)
// User B adds "milk" (they realized they DO need it)

// Remove-wins: milk is removed. User B's explicit add is ignored.
// User B explicitly tried to add milk but it's gone. Surprising!

// Add-wins: milk stays. User A's remove is "overridden."
// User A might see milk reappear, but at least no data is lost.
// User A can remove again if they still want to.

// General principle: it's better to have something you don't want
// (you can delete it again) than to NOT have something you DO want
// (it silently disappeared). Data preservation > data pruning.

This principle — prefer not losing data — is why most CRDT designs bias toward preservation. It's the same philosophy behind Git's merge conflicts: show both versions and let the human decide, rather than silently picking one.

Vector Clocks for Version Tracking

The MV-Register and OR-Set both need to track causality — "which events has this replica seen?" The standard tool is a vector clock: each replica maintains a vector of N counters, one per replica. When replica i does an operation, it increments V[i]. When it receives a message from replica j with vector V', it sets V[k] = max(V[k], V'[k]) for all k.

python
class VectorClock:
    """Tracks causality across N replicas."""
    def __init__(self, node_id, n_nodes):
        self.id = node_id
        self.clock = [0] * n_nodes

    def tick(self):
        """Local event: increment own counter."""
        self.clock[self.id] += 1

    def merge(self, other_clock):
        """Merge with received clock."""
        for i in range(len(self.clock)):
            self.clock[i] = max(self.clock[i], other_clock[i])
        self.clock[self.id] += 1  # receiving is also an event

    def dominates(self, other):
        """Does self causally dominate other?"""
        return (all(self.clock[i] >= other.clock[i]
                    for i in range(len(self.clock)))
                and self.clock != other.clock)

    def concurrent(self, other):
        """Are self and other concurrent?"""
        return not self.dominates(other) and not other.dominates(self)
OR-Set Step-Through

Add and remove elements on different replicas. Watch unique tags track adds/removes. Concurrent add + remove = add wins.

Each add generates a unique tag. Removes only kill tags you've seen.
Interview question: Replica A adds item "X" to an OR-Set. Replica B syncs with A (now both have "X"). Concurrently, B removes "X" while A adds "X" again. After full sync, is "X" in the set?

Chapter 7: Consistency Models — The Showcase

This is the payoff chapter. Everything we've learned — event graphs, axioms, CRDTs — converges in one interactive simulation. You control a distributed system with three replicas. You perform read/write operations on any replica. And you toggle between consistency models to see exactly which read results are legal under each one.

If you can predict what each model allows before clicking "Read All" — then you've internalized the formal framework. Every consistency debate in a design review reduces to this: "which behaviors does our model allow, and can our application tolerate them?"

The insight that makes distributed systems engineering hard is not that eventual consistency exists. It's that you can't tell the difference between consistency models during normal operation. When the network is fast and healthy, linearizable and eventual consistency look identical. The difference only manifests during partitions, lag, and concurrent access. This simulation forces those edge cases so you can see what each model actually guarantees.

The core challenge of distributed systems: Weaker consistency models have MORE legal behaviors, not fewer. Under eventual consistency, almost anything goes. Under linearizability, there's exactly one legal history for any set of operations. More freedom for the system = less predictability for the programmer.
Distributed Consistency Simulator

Three replicas storing register X. Write values, read from any replica, and toggle the consistency model to see which reads are legal.

Write values on different replicas, then read to see what each consistency model allows.

How To Use This Simulator

Here's a specific experiment to try. Follow these steps exactly:

1. Set Model to Eventual
Click the model button until it says "Eventual."
2. Write X=1 on R1
Click "Write on R1". R1 now shows X=1.
3. Write X=2 on R2
Click "Write on R2". R2 now shows X=2. R1 still shows X=1.
4. Click "Read All"
Under eventual consistency, R1 returns 1, R2 returns 2, R3 returns 0. All legal!
5. Switch to Linearizable, Read Again
Now only one value is legal — the "latest" write. All replicas must agree.

The visual difference is striking. Under eventual consistency, each replica lives in its own world. Under linearizability, there's one global truth. Every level in between is a different blend of "local truth" vs "global truth."

What You Should Observe

ModelWhat You'll SeeWhy
LinearizableAll replicas return the same "latest" valueReal-time total order means one global truth
SequentialAll replicas agree, but might not reflect real-time orderTotal order exists but may reorder concurrent writes
CausalCausally-related writes ordered; concurrent writes may differ across replicasOnly causal dependencies are preserved
EventualReplicas may temporarily disagree on everything, but converge when quiescentNo ordering guarantees during active updates

The Mental Model

Think of consistency models as constraints on a jigsaw puzzle. Linearizability says there's exactly one way to assemble the pieces — the real-time order. Sequential says any assembly that respects each person's piece order is fine. Causal says pieces that were placed "because of" earlier pieces must keep that order, but unrelated pieces can go anywhere. Eventual says you can scatter pieces across the table — as long as you eventually put them all together, the final picture will be the same for everyone.

The programmer's job is to choose the weakest model that still makes the application correct. Weaker models are faster and more available. The art is knowing what anomalies your application can tolerate.

Anomaly Catalog: What Goes Wrong at Each Level

Each weakening of the consistency model introduces specific anomalies. Understanding these anomalies is how you choose the right model for your application.

AnomalyOccurs BelowWhat HappensExample
Stale readLinearizableRead returns a value that was already overwrittenYou transfer $100, check balance, see old amount
Non-monotonic readCausalA later read returns an OLDER value than an earlier readYou see a comment, refresh, it disappears, refresh again, it's back
Read-your-writes violationCausalYou write a value, then read and DON'T see your own writeYou post a message, but it doesn't show in your feed
Causal violationSequentialYou see a reply but not the original messageBob replies to Alice's comment; you see Bob's reply but not Alice's comment
Total-order violationCausalTwo replicas see concurrent writes in different ordersReplica 1 shows balance $50 then $100; Replica 2 shows $100 then $50

The Decision Framework

1. What anomalies can my app tolerate?
A like counter can tolerate stale reads. A bank transfer cannot.
2. Choose the WEAKEST model that excludes intolerable anomalies
If stale reads are fine but causal violations aren't: choose causal consistency.
3. Pick the protocol template for that model
Causal → broadcast protocol with vector clocks. Strong → global-sequence (Raft/Paxos).
4. Choose CRDTs for data types that can be eventually consistent
Not everything needs the same model. Mix strong (for transfers) and eventual (for likes).
Real systems mix models. Google Spanner uses linearizability for transactions but relaxes to eventual consistency for read-only queries that can tolerate staleness. Facebook uses strong consistency for the "like" mutation (to prevent double-likes) but eventual consistency for displaying like counts. The best architectures use the strongest model where correctness demands it and the weakest model everywhere else.

Worked Example: Choosing Consistency for a Chat App

// Feature analysis for a distributed chat application:

// 1. Message ordering within a channel:
// Anomaly risk: causal violation (see reply before original)
// Minimum model: CAUSAL consistency
// Protocol: broadcast with vector clocks

// 2. Unread message count:
// Anomaly risk: stale read (count off by 1-2 for a moment)
// Minimum model: EVENTUAL consistency
// CRDT: PN-Counter (increment on new message, decrement on read)

// 3. User "typing..." indicator:
// Anomaly risk: stale indicator (shows typing for 2 extra seconds)
// Minimum model: EVENTUAL (or even no consistency — just timeout)
// Protocol: epidemic (fire-and-forget)

// 4. Channel membership (who's in the channel):
// Anomaly risk: removed user still sees messages
// Minimum model: CAUSAL (remove must happen-before access check)
// CRDT: OR-Set for member list

// 5. Payment/subscription status:
// Anomaly risk: access after payment failure
// Minimum model: LINEARIZABLE
// Protocol: global-sequence (Raft consensus)
Interview question: A system has 3 replicas. R1 has X=1, R2 has X=2, R3 has X=0 (initial). Under which consistency model(s) is it legal for a client to read X=0 from R3 AFTER writes to R1 and R2 have completed?

Chapter 8: Protocol Templates

Models tell us WHAT guarantees we want. Protocols tell us HOW to achieve them. Burckhardt identifies three protocol families, each implementing a different point on the consistency spectrum. The mapping is clean: one family per level.

Understanding these protocols is critical for system design interviews. When an interviewer asks "how would you implement X," the answer is almost always one of these three patterns — or a hybrid. The choice determines your latency, availability, and programming complexity.

Each protocol can be understood as answering one question: when does a replica learn about another replica's operations? Immediately (global-sequence), when causally needed (broadcast), or eventually (epidemic). The answer determines everything about the system's behavior.

1. Epidemic (Gossip) Protocols — For Eventual Consistency

Each replica periodically picks a random partner and exchanges state. After merging, both have the same information. Over time, every update reaches every replica — like a rumor spreading through a crowd.

There are three variants of gossip: push (I send my state to you), pull (I request your state), and push-pull (we exchange states). Push-pull converges fastest because each interaction makes both parties up-to-date. Push converges in O(N log N) messages; push-pull converges in O(N log log N) — a significant improvement for large clusters.

The gossip interval (how often each node contacts a peer) creates a direct tradeoff between convergence speed and network bandwidth. Gossip every 100ms and you converge in seconds; gossip every 10 seconds and you converge in minutes. Most systems use 1-second gossip intervals, giving convergence in 5-20 seconds for clusters of 100+ nodes.

1. Local Update
Replica applies operation to its local state immediately. No coordination.
2. Anti-Entropy
Periodically, pick a random replica and send your state (or a delta).
3. Merge
Receiving replica merges incoming state with its own using the CRDT merge function.
↻ repeat
// Convergence speed of gossip:
// In each round, each node contacts one random node.
// After k rounds, the probability that a node hasn't received an update:

P(not received after k rounds) ≈ (1 - 1/N)k

// For N=100 nodes, after k=500 rounds (5 full cycles):
P(missed) ≈ (99/100)500 ≈ 0.0066 = 0.66%

// After k=1000 rounds: P(missed) ≈ 0.004%
// Gossip converges in O(N log N) rounds with high probability.
// This is the same as epidemic spreading in biology.

2. Broadcast Protocols — For Causal Consistency

Tag each operation with a vector clock. Deliver operations in causal order: an operation is only applied when all operations it depends on have been applied. This guarantees CVIS and TVIS axioms.

The key mechanism is the delivery buffer. When a message arrives whose dependencies aren't met yet (the sender saw events that the receiver hasn't seen), the message waits in a buffer until the missing events arrive. This buffer is what makes causal broadcast more expensive than epidemic — messages can be delayed, and the buffer can grow during bursts of activity or slow networks.

1. Stamp
Before sending, attach the sender's current vector clock to the operation.
2. Broadcast
Send the (operation, vector-clock) to all replicas.
3. Buffer
Receiving replica holds the operation in a buffer until all dependencies (earlier vector clock entries) are satisfied.
4. Deliver
When dependencies are met, apply the operation and advance the local vector clock.
// Causal delivery example with vector clocks:
// 3 replicas: R1, R2, R3. Clocks = [r1_count, r2_count, r3_count]

// R1 writes X=1, clock [1,0,0]
// R2 receives, merges clock → [1,1,0] (tick its own), writes Y=2 clock [1,2,0]
// R3 receives Y=2 with clock [1,2,0]
// R3's clock is [0,0,0]. Dependency: r1≥1, r2≥2.
// R3 hasn't seen R1's write yet! Buffer Y=2 until R1's write arrives.
// R3 receives X=1 with clock [1,0,0]. Apply. Clock → [1,0,1].
// Now check buffer: Y=2 needs [1,2,0]. r1=1 ✓, r2=0 < 2 ✗.
// Still waiting. (R2's message was split - this is the intermediate op.)
// In practice, Y=2 has clock [1,2,0] meaning R2 had seen R1's write.
// R3 needs r2_count ≥ 1 (one prior R2 event, which was the merge).
// Once R3's r2_count catches up, deliver Y=2. Causal order preserved.

3. Global-Sequence Protocols — For Strong Consistency

A single designated node (or a consensus protocol like Paxos/Raft) assigns a global sequence number to every operation. All replicas apply operations in this sequence. This gives linearizability (with a leader) or sequential consistency (with consensus).

The fundamental cost of global-sequence protocols is the consensus round. Every write requires a majority of replicas to acknowledge before the operation is committed. This means the write latency is dominated by the slowest node in the majority — typically the one farthest away in terms of network distance. For a 3-node cluster in a single datacenter, this might be 5ms. For a 5-node cluster spanning continents, this could be 200ms.

This is why global-sequence protocols are impractical for latency-sensitive operations across geographic regions. Spanner works around this with GPS-synchronized atomic clocks (TrueTime), but this requires specialized hardware that most organizations can't afford.

1. Submit
Client sends operation to the sequencer (leader / consensus group).
2. Sequence
Sequencer assigns a global monotonic sequence number. In Raft: this is the log index.
3. Replicate
The sequenced operation is sent to all replicas in order.
4. Apply
Each replica applies operations strictly in sequence order. All replicas are identical at each sequence number.
ProtocolConsistencyLatencyAvailabilityExample
EpidemicEventualLocal writes (~1ms)All replicas serve reads/writesRiak, Cassandra
BroadcastCausalLocal + buffer delayAll serve writes; reads may waitCOPS, Eiger
Global-SeqLinearizableConsensus round (~10-100ms)Majority must be upSpanner, CockroachDB

Gossip Protocol: The Mathematics of Rumor Spreading

Epidemic protocols borrow directly from mathematical epidemiology. The spread of an update through a gossip network follows the same differential equation as the spread of an infectious disease through a population. This is not a metaphor — it's the same math.

// SIR epidemic model applied to gossip:
// S(t) = susceptible nodes (haven't received update)
// I(t) = infected nodes (have the update, actively spreading)
// R(t) = removed nodes (have the update, stopped spreading)

dS/dt = -β · S · I / N
dI/dt = β · S · I / N - γ · I

// Where β = contact rate (gossip frequency), γ = recovery rate
// For a gossip protocol with N=1000 nodes, β=1 (one contact/round):
// After 10 rounds: ~632 nodes have the update (1 - 1/e)
// After 20 rounds: ~998 nodes have the update
// After 30 rounds: ~999.95 nodes (with high probability, ALL)
// Total messages: ~N·ln(N) ≈ 6900 for N=1000
Gossip is surprisingly efficient. To reliably deliver an update to all N nodes, gossip requires O(N log N) total messages — the same order as sorting. Each node only contacts O(log N) other nodes. Compare this to a broadcast tree (O(N) messages but single point of failure) or a full broadcast (O(N²) messages). Gossip gives reliability through redundancy: multiple paths, multiple contacts, probabilistic coverage.

Vector Clocks in Practice: The Space Problem

Causal broadcast uses vector clocks with one entry per replica. With N replicas, each message carries an N-dimensional vector. For N=5, this is trivial. For N=10,000 (a large Cassandra cluster), this is 10,000 integers per message — a serious overhead.

// Vector clock space complexity:
// N replicas → N integers per message
// Each integer: 8 bytes (int64)
// Per message overhead: N × 8 bytes

// N = 5: 40 bytes (trivial)
// N = 100: 800 bytes (acceptable)
// N = 10000: 80 KB per message (unacceptable!)

// Solutions:
// 1. Dotted version vectors: track only active entries
// 2. Hash-based clocks: compress vector to fixed size
// 3. Server-side clocks: only N servers track time, clients don't
// Dynamo (Amazon) uses vector clocks with pruning: if vector > K entries,
// truncate oldest entries. Loses some causal info but bounds space.

Global-Sequence: The Raft Example

Let's trace a write through a Raft-based global-sequence protocol. Raft is the most widely-used consensus algorithm (etcd, CockroachDB, TiKV). The key: one leader sequences all operations; followers replicate in order.

// Raft write path (5 nodes: Leader L, Followers F1-F4)

// t=0: Client sends write(X, 42) to Leader L
// t=1: L appends to its log: [index=7, term=3, write(X, 42)]
// t=2: L sends AppendEntries RPC to F1, F2, F3, F4

// t=5: F1 responds: "index 7 appended"
// t=6: F2 responds: "index 7 appended"
// Majority! (L + F1 + F2 = 3 of 5)

// t=7: L commits index 7. Applies write(X, 42) to state machine.
// t=8: L responds to client: "OK, X=42 committed."

// t=10: F3 finally responds (slow but not failed)
// t=15: F4 responds (very slow)
// F3 and F4 apply index 7 on their next heartbeat.

// Total latency: t=0 to t=8 = 8 time units
// Dominated by: leader write + fastest majority response
// This is why Raft is typically 5-50ms for same-region deploys.

// If L fails after committing but before F3/F4 catch up:
// F1 or F2 wins election (they have index 7).
// No data loss. Index 7 persists on majority.
// F3/F4 catch up from new leader.

Latency Comparison: Real Numbers

ProtocolSame-RegionCross-RegionBottleneck
Epidemic (Riak)~1-2ms (local disk)~1-2ms (local disk; propagation in background)Local I/O only
Broadcast (COPS)~2-5ms (local + buffer check)~5-50ms (wait for causal deps)Causal dependency delay
Global-Seq (Raft)~5-20ms (consensus round)~100-300ms (cross-region RTT)Majority acknowledgment RTT
Global-Seq (Spanner)~5-10ms (TrueTime wait)~10-50ms (TrueTime wait + RTT)Clock uncertainty + RTT
The latency-consistency tradeoff is linear. Moving from eventual to causal roughly doubles latency (need to check dependencies). Moving from causal to linearizable roughly 10x's latency (need global coordination). This is why most systems use the weakest consistency that's correct for each operation.

Python: Complete Gossip Protocol Simulator

python
import random

class GossipNode:
    """A single node in a gossip network with a G-Counter."""
    def __init__(self, node_id, n_nodes):
        self.id = node_id
        self.counter = GCounter(node_id, n_nodes)
        self.inbox = []  # messages waiting to be processed

    def increment(self):
        self.counter.increment()

    def send_state(self):
        """Return current state to send to a random peer."""
        return self.counter.vector[:]

    def receive_state(self, remote_vector):
        """Merge received state with local state."""
        for i in range(len(self.counter.vector)):
            self.counter.vector[i] = max(
                self.counter.vector[i],
                remote_vector[i]
            )


def simulate_gossip(n_nodes=10, n_increments=20, n_rounds=50):
    """Simulate gossip convergence."""
    nodes = [GossipNode(i, n_nodes) for i in range(n_nodes)]

    # Phase 1: random increments on random nodes
    for _ in range(n_increments):
        node = random.choice(nodes)
        node.increment()

    # Phase 2: gossip rounds
    for round_num in range(n_rounds):
        # Each node contacts one random peer
        for node in nodes:
            peer = random.choice(nodes)
            if peer.id != node.id:
                state = node.send_state()
                peer.receive_state(state)

        # Check convergence
        values = [n.counter.value() for n in nodes]
        if len(set(values)) == 1:
            print(f"Converged in round {round_num}! Value: {values[0]}")
            return round_num

    print(f"Did not converge in {n_rounds} rounds.")
    return n_rounds

# Run simulation
rounds = [simulate_gossip() for _ in range(100)]
print(f"Average rounds to converge: {sum(rounds)/len(rounds):.1f}")
# Typically: 3-5 rounds for 10 nodes. O(log N) behavior.

Python: Causal Broadcast with Vector Clocks

python
class CausalBroadcastNode:
    """Node implementing causal broadcast via vector clocks."""
    def __init__(self, node_id, n_nodes):
        self.id = node_id
        self.n = n_nodes
        self.clock = [0] * n_nodes
        self.state = {}     # key-value store
        self.buffer = []    # undelivered messages
        self.delivered = []  # delivered operations log

    def write(self, key, val):
        self.clock[self.id] += 1
        self.state[key] = val
        # Return message to broadcast
        return {'op': (key, val),
                'sender': self.id,
                'vc': self.clock[:]}

    def receive(self, msg):
        """Buffer message, deliver when deps satisfied."""
        self.buffer.append(msg)
        self._try_deliver()

    def _can_deliver(self, msg):
        """Check if all causal dependencies are met."""
        vc = msg['vc']
        sender = msg['sender']
        # Sender's count must be exactly our count + 1
        if vc[sender] != self.clock[sender] + 1:
            return False
        # All other entries must be ≤ our clock
        for k in range(self.n):
            if k != sender and vc[k] > self.clock[k]:
                return False
        return True

    def _try_deliver(self):
        progress = True
        while progress:
            progress = False
            for msg in self.buffer[:]:
                if self._can_deliver(msg):
                    self.buffer.remove(msg)
                    key, val = msg['op']
                    self.state[key] = val
                    self.clock[msg['sender']] += 1
                    self.delivered.append(msg)
                    progress = True
Protocol Comparison: Same Operations, Three Protocols

Watch the same sequence of operations executed under epidemic, broadcast, and global-sequence protocols. Note the latency and ordering differences.

Click "Next Operation" to step through writes and syncs across all three protocol types.
Interview question: You're designing a social media feed. Posts have a "like count." Users across the globe like posts concurrently. Which protocol template should you use for the like counter and why?

Chapter 9: The CAP Theorem (Formally)

You've heard "pick two of three: Consistency, Availability, Partition tolerance." But what does this ACTUALLY mean? Most explanations are hand-wavy. Burckhardt gives the precise formal statement and proof. And most people who cite CAP get it wrong — either they think it means "choose any two" (it doesn't — P is not optional), or they think it applies during normal operation (it doesn't — CAP only matters during partitions), or they think it's about eventual consistency (it's not — CAP's "C" means sequential or stronger).

Getting CAP right in an interview is a differentiator. Let's build the correct understanding from the formal definitions.

Definitions (Precise)

Consistency (C): Here means specifically sequential consistency (or stronger). Every read returns a value consistent with some total order of all operations that respects each client's order. Note: this is NOT eventual consistency. CAP's "C" is the strong kind.

Availability (A): Every request to a non-failed node receives a response. Not "eventually" — within a bounded time. A system that blocks forever waiting for a partitioned node is NOT available.

Partition tolerance (P): The system continues to operate despite arbitrary message loss between nodes. Since network partitions are a fact of life (not a choice), every real distributed system must be partition-tolerant. "P" is not optional.

The real tradeoff is C vs A. Since partitions WILL happen (cables get cut, switches fail, datacenters lose connectivity), you must choose: during a partition, do you sacrifice consistency (allow stale reads) or availability (reject requests until the partition heals)? This is the ONLY real choice. "Pick 2 of 3" is misleading — it's really "during a partition, pick C or A."

The Formal Proof

The proof is by contradiction, and it's beautifully simple. Assume a system that is both sequentially consistent AND available during a partition. Show that this leads to a contradiction.

// Setup: Two nodes, N1 and N2, with a partition between them.
// No messages can pass from N1 to N2 or vice versa.

// Step 1: Client A writes X=1 to N1.
// N1 must respond (availability).
// N1 cannot inform N2 (partition).
// N1 says: "OK, X=1 stored."

// Step 2: Client B reads X from N2.
// N2 must respond (availability).
// N2 doesn't know about X=1 (partition).
// N2 returns: X=0 (initial value).

// Step 3: Check sequential consistency.
// In the real-time order: write(X,1) happened BEFORE read(X).
// Sequential consistency requires some total order consistent with
// per-client order. If write is before read in that order,
// the read must return 1 (or a later value). But it returned 0.
// If read is before write in the total order, that contradicts
// real-time (the write actually happened first).
// CONTRADICTION. ∎

// Conclusion: You cannot have sequential consistency + availability
// when a partition exists. One must give.

What Systems Actually Choose

ChoiceDuring PartitionSystems
CP (Consistent)Reject writes/reads on minority side of partition. Majority side continues.ZooKeeper, etcd, Spanner, HBase
AP (Available)All nodes accept reads/writes. May return stale data. Reconcile after partition heals.Cassandra, DynamoDB, Riak, CouchDB
CAP is about partitions, not about normal operation. When the network is healthy, most systems provide both strong consistency and high availability. CAP only forces the choice DURING a partition. Google Spanner avoids CAP's bite by making partitions extremely rare (redundant networking, GPS-synchronized clocks) — it's technically CP but practically almost never sacrifices availability because partitions almost never happen.

Beyond CAP: The PACELC Model

CAP only talks about partition scenarios. But even during normal operation, there's a tradeoff between latency and consistency. The PACELC model extends CAP:

if Partition → choose Availability or Consistency;
else → choose Latency or Consistency
SystemDuring PartitionNormal OperationPACELC
CassandraAP (available)EL (low latency)PA/EL
SpannerCP (consistent)EC (consistent)PC/EC
DynamoDBAP (available)EL (low latency)PA/EL
MongoDBCP (consistent)EC (consistent)PC/EC

How Google Spanner "Cheats" CAP

Google Spanner claims to be a globally-distributed, strongly-consistent, highly-available database. Doesn't this violate CAP? Technically no. Spanner is a CP system — during a partition, it sacrifices availability on the minority side. But Google has invested billions in network infrastructure to make partitions astronomically rare.

// Spanner's partition-avoidance strategy:

// 1. Redundant network paths: every datacenter has 3+ independent links
// 2. GPS + atomic clocks: TrueTime API gives bounded clock uncertainty
// (usually 1-7ms, worst case ~100ms during leap seconds)
// 3. If TrueTime uncertainty is U, Spanner WAITS U milliseconds before
// committing — this guarantees real-time ordering without coordination!
// 4. Reported availability: 99.9999% (5 minutes downtime per year)

// The insight: CAP says you CAN'T have C+A during partitions.
// Spanner says: "We'll make sure partitions almost never happen."
// This is a valid engineering response — just an expensive one.

CAP is a Theorem, Not a Framework

A common mistake: treating CAP as a design framework ("I'll build an AP system"). CAP is a negative result — it tells you what's impossible, not what to build. The right way to use CAP:

WrongRight
"Let's build an AP system""During partitions, which operations can tolerate stale reads?"
"We chose consistency over availability""For payment operations, we block during partitions. For feed reads, we serve stale data."
"CAP means pick 2 of 3""CAP means during a partition, each operation must choose C or A"
The per-operation view. Real systems don't make a single global CAP choice. Each operation can make its own tradeoff. A banking system might use linearizable writes for transfers (CP) but eventually consistent reads for balance display (AP). DynamoDB lets you choose per-request: strongly consistent reads (CP) or eventually consistent reads (AP, 2x cheaper). The granularity of the CAP choice is per-operation, not per-system.
Interactive CAP Proof

Two nodes with a partition. Write to one, read from the other. See why strong consistency + availability is impossible.

Enable the partition, then write and read to see the CAP impossibility in action.
Interview question: Your system uses Raft consensus (CP). During a network partition, the leader is on the minority side (1 of 3 nodes). What happens?

Chapter 10: Correctness Proofs

Saying "my CRDT converges" is easy. Proving it requires a simulation relation between the concrete protocol execution and the abstract consistency model. This is how Burckhardt verifies that a protocol actually implements the consistency guarantees it claims.

Why should you care about correctness proofs? Because distributed systems bugs are the hardest to find. They only manifest under specific timing conditions, specific failure patterns, specific message orderings. You can test for months and miss a bug that only appears when a specific replica is 200ms behind, during a garbage collection pause, while a network switch is rebooting. Proofs catch these bugs at design time, before any code is written.

The industry learned this the hard way. Amazon's DynamoDB team discovered that their original hash ring implementation had a subtle consistency violation that only occurred during simultaneous node joins and departures. MongoDB's initial causal consistency implementation had a gap in the vector clock protocol that allowed causal ordering violations during primary elections. Both bugs were found by formal analysis, not testing.

The Proof Structure

A correctness proof has three parts:

1. Define the abstract specification. What consistency model are we targeting? For a G-Counter with epidemic protocol, the target is eventual consistency: the EVIS axiom (all events eventually visible).

2. Define the simulation relation. Map each concrete state (vector values on each replica, messages in flight) to an abstract execution (events, visibility, arbitration). The simulation relation R(concrete, abstract) says: "this concrete state corresponds to this abstract execution."

3. Prove the simulation. Show that for every step the concrete protocol takes (local update, send message, receive message, merge), the corresponding abstract execution remains legal — i.e., the axioms still hold.

Worked Proof: G-Counter + Epidemic Protocol = Eventual Consistency

// CLAIM: The epidemic G-Counter protocol implements eventual consistency.

// CONCRETE STATE:
// Each replica i has vector V_i[0..N-1].
// Messages in transit: set of vectors being sent between replicas.

// ABSTRACT EXECUTION:
// Events: each increment(i) is an event e.
// Visibility: e_a vis e_b iff V_b[a.replica] ≥ a.sequence_number.
// (i.e., replica b has merged a state that includes a's increment.)

// SIMULATION RELATION R:
// R(concrete, abstract) holds iff for every pair of events (a, b):
// a vis b in abstract ⟺ V_b[a.node] ≥ a.count in concrete.

// PROOF STEPS:

// Step 1: Local increment on replica i.
// Concrete: V_i[i] += 1.
// Abstract: New event e. e is visible to itself (V_i[i] ≥ new count).
// R still holds. ✓

// Step 2: Replica i sends V_i to replica j.
// Concrete: Message (V_i) in transit to j.
// Abstract: No change yet (visibility hasn't changed).
// R still holds. ✓

// Step 3: Replica j receives V_i and merges.
// Concrete: V_j[k] = max(V_j[k], V_i[k]) for all k.
// Abstract: All events visible to i become visible to j.
// V_j[k] ≥ V_i[k] for all k after merge, so if event e was
// visible to i (V_i[e.node] ≥ e.count), now visible to j too.
// R still holds. ✓

// Step 4: EVIS axiom — eventual visibility.
// The epidemic protocol guarantees every message is eventually delivered
// (fair communication assumption). After delivery, merge ensures
// all events become visible to all replicas.
// EVIS holds. ✓

// Step 5: Convergence.
// After all messages delivered, all replicas have V[k] = final_count[k].
// value() = Σ V[k] is identical on all replicas.
// Convergence holds. ✓

// CONCLUSION: The epidemic G-Counter protocol is a correct implementation
// of an eventually consistent replicated counter. ∎

Why Proofs Matter in Practice

You might think: "I'll never write a formal proof at work." True. But understanding the proof structure helps you spot bugs. The most common distributed systems bugs violate one of the proof steps:

Proof Step ViolatedReal Bug
Merge not commutativeCassandra's counter CRDT had a non-commutative merge path that caused divergence under specific message orderings (CASSANDRA-6504)
Messages not eventually deliveredRiak's handoff mechanism could stall during high load, violating the fair delivery assumption → permanent divergence
Simulation relation broken by edge caseCockroachDB's causal ordering had a gap where replicas could deliver operations out of causal order during leader transitions
Think of the proof as a checklist. When implementing a distributed data structure, walk through each step: (1) Is my merge function commutative, associative, idempotent? (2) Does my communication layer guarantee eventual delivery? (3) Does every local operation preserve the simulation relation? If you can answer yes to all three, your system will converge. If you can't, you have a bug.

The Engineer's Verification Checklist

You don't need to write formal proofs. But you DO need to verify these properties. Here's how to test each one in practice:

python
def test_crdt_properties(crdt_class, n_trials=1000):
    """Property-based test for CRDT correctness."""
    import random

    for _ in range(n_trials):
        # Create two replicas with random operations
        a = crdt_class(0, 3)
        b = crdt_class(1, 3)

        for _ in range(random.randint(1, 10)):
            a.increment()
        for _ in range(random.randint(1, 10)):
            b.increment()

        # Test commutativity: merge(a,b) == merge(b,a)
        ab = crdt_class(0, 3); ab.vector = a.vector[:]
        ba = crdt_class(1, 3); ba.vector = b.vector[:]
        ab.merge(b)
        ba.merge(a)
        assert ab.vector == ba.vector, "Commutativity failed!"

        # Test idempotency: merge(a,a) == a
        aa = crdt_class(0, 3); aa.vector = a.vector[:]
        aa.merge(a)
        assert aa.vector == a.vector, "Idempotency failed!"

    print("All properties verified!")

test_crdt_properties(GCounter)  # Should pass

Proof Sketch: Causal Broadcast Protocol Correctness

Let's also outline the proof for a causal broadcast protocol — this is harder than the epidemic proof because we need to verify three axioms (EVIS, CVIS, TVIS) instead of just one.

// CLAIM: Vector-clock broadcast implements causal consistency.

// CONCRETE STATE:
// Each replica i has:
// - Local state S_i (applied operations)
// - Vector clock VC_i[0..N-1]
// - Buffer B_i of undelivered messages

// PROTOCOL:
// On local operation at replica i:
// VC_i[i]++; broadcast (op, VC_i) to all replicas
// On receive (op, VC_msg) at replica j:
// Add to buffer. Deliver when:
// VC_msg[k] ≤ VC_j[k] for all k ≠ sender, AND
// VC_msg[sender] = VC_j[sender] + 1
// On deliver: VC_j[sender]++; apply op to S_j

// PROOF OF EVIS (eventual visibility):
// Assume fair delivery (all messages eventually arrive).
// Every message in B_j will eventually have its dependencies
// satisfied (because dependencies are delivered first, by induction).
// So every operation eventually visible to all replicas. ✓

// PROOF OF CVIS (consistent visibility):
// If a vis b (b sees a's effect) and b so c (same client),
// then c is on the same replica as b (or later).
// c's VC includes all of b's VC entries (monotonic clock).
// So c sees everything b saw, including a. a vis c. ✓

// PROOF OF TVIS (transitive visibility):
// If a vis b and b vis c, we need a vis c.
// c delivered b, meaning VC_c[b.sender] ≥ b.VC[b.sender].
// b saw a, so b.VC[a.sender] ≥ a.VC[a.sender].
// c's delivery condition: VC_msg[k] ≤ VC_c[k] for all k.
// Since c delivered b, and b's VC includes a's operation,
// c must have already delivered a (a is a dependency of b).
// So a vis c. ✓

// CONCLUSION: Vector-clock broadcast implements causal consistency. ∎
The buffer is the key mechanism. Without buffering, messages arrive out of order and causal violations occur. The delivery condition ensures that a message is only applied when all its causal dependencies have been applied first. This is why causal consistency is more expensive than eventual — every receive requires a dependency check against the local vector clock.

Common Implementation Bugs and How the Proof Catches Them

// Bug 1: Using addition instead of max for merge
// merge([3,0], [0,2]) = [3,2] ✓
// But: merge([3,0], [3,0]) = [6,0] ✗ (idempotency broken!)
// Result: counter inflates on every gossip round

// Bug 2: OR-Set using element as tag instead of unique ID
// add("X") creates (X, "X"). remove("X") removes (X, "X").
// Concurrent add("X") on another replica creates... (X, "X")!
// Same tag! The remove kills the concurrent add. Add-wins broken.

// Bug 3: Vector clock not incremented on send
// A sends message but doesn't tick its clock.
// B receives and can't distinguish this message from the previous one.
// Causal ordering violated: B might deliver messages out of order.
Convergence Proof Visualization

Watch 5 replicas execute increments and gossip merges. Track the state vectors converging. The proof guarantee: all vectors reach the same final state.

Increment on random replicas, then run gossip rounds to watch convergence.
Interview question: You implement a custom CRDT and find that two replicas with the same set of operations applied in different orders end up with different states. Which CRDT property is broken?

Chapter 11: Connections & Cheat Sheet

Mapping to DDIA

If you've studied the DDIA lessons in this series, here's how Burckhardt's formal framework maps to Kleppmann's practical treatment. Every concept in DDIA has a precise formal counterpart here.

Burckhardt ConceptDDIA ChapterConnection
Consistency modelsCh 6 (Replication)DDIA's replication lag anomalies = violations of specific Burckhardt axioms
CRDTsCh 7 (Concurrent writes)DDIA's "concurrent writes" section motivates CRDTs as the solution
Global-sequence protocolsCh 10 (Consensus)Paxos/Raft = specific implementations of Burckhardt's global-sequence template
CAP theoremCh 9 (Consistency & Consensus)DDIA discusses CAP informally; Burckhardt formalizes it
Event graphsCh 5 (Encoding flows)Lamport's happens-before is the backbone of event graphs

Mapping to Algorithms

Burckhardt's framework doesn't exist in isolation — it builds on decades of theoretical computer science. Here are the algorithmic foundations that underpin each concept.

Burckhardt ConceptAlgorithmic Foundation
Event graphsDirected acyclic graphs (CLRS Ch 22)
Happens-before / transitive closureGraph reachability, topological sort
Vector clocksPartial orders, lattice theory
CRDT merge = join-semilatticeOrder theory, abstract algebra
Gossip convergenceRandomized algorithms, epidemic models

Complete CRDT Cheat Sheet

Pin this table. In a system design interview, you'll reach for it constantly. Each CRDT solves a specific class of conflict — choosing the right one is half the design.

CRDTTypeMergeSemanticsUse Case
G-CounterStatemax per componentGrow-only countPage views, likes
PN-CounterStateTwo G-CountersInc + DecStock levels, scores
LWW-RegisterStateHighest timestampLast write winsSensor data, cache
MV-RegisterStateKeep concurrent valuesMulti-value (app resolves)Collaborative edit
G-SetStateUnionAdd-only setSeen message IDs
2P-SetStateTwo G-SetsAdd + permanent removeTombstone tracking
OR-SetStateUnion of tagged pairsAdd wins over concurrent removeShopping cart, friends list

Consistency Models Cheat Sheet

The definitive reference. Each model is a precise set of axioms — not a marketing term, not a vague promise, but a mathematical specification with provable properties.

ModelAxiomsKey PropertyProtocol
LinearizableEVIS+CVIS+TVIS+CAR+RTOReal-time total orderGlobal-sequence
SequentialEVIS+CVIS+TVIS+CARTotal order (no real-time)Global-sequence (relaxed)
CausalEVIS+CVIS+TVISCausal order preservedBroadcast + vector clocks
EventualEVISConverge when quiescentEpidemic (gossip)
QuiescentEVIS (weak)Converge during quietLazy propagation

When to Use What: The Interview Decision Tree

// "How do you choose a consistency model?" — Common interview question

// Step 1: Can the data be wrong temporarily?
// NO → Linearizable (bank transfers, inventory reservation)
// YES → Continue...

// Step 2: Do users need to see their own writes?
// YES → At minimum: read-your-writes (part of causal)
// NO → Continue...

// Step 3: Can users see effects before causes?
// NO → Causal consistency
// YES → Eventual is fine

// Step 4: Is the data type conflict-free?
// YES → Use a CRDT (counter, set, register)
// NO → Use application-level conflict resolution

// Step 5: How many replicas?
// ≤5 → Vector clocks are cheap, use causal broadcast
// 5-100 → Gossip with CRDTs
// 100+ → Gossip with delta-CRDTs (full state sync too expensive)

Burckhardt's Lasting Impact

This monograph, published in 2014, laid the formal groundwork that the distributed systems community has built on for a decade. Before Burckhardt, "eventual consistency" was an informal promise. After, it became a precise mathematical property with provable implementations. The CRDT taxonomy he formalized (state-based, operation-based, delta-based) became the standard vocabulary. His axiom-based consistency hierarchy influenced the design of systems from Azure Cosmos DB to Redis CRDTs.

The key meta-insight: distributed consistency is not one problem — it's a SPECTRUM of problems, each solvable with known techniques. The engineer's job is to identify where on the spectrum each piece of data lives, and apply the right tool. Most real systems don't live at a single point on the spectrum — they use different consistency levels for different data, chosen based on the correctness requirements and performance constraints of each piece.

When you hear "we use eventual consistency" in an interview or design review, the right follow-up is: "for which data, and what conflict resolution strategy?" The answer reveals whether the speaker truly understands the system they're building.

Related Lessons

Interview Masterclass: The Five Most Common Distributed Consistency Questions

These are the questions that separate senior engineers from juniors in system design interviews. Each one maps directly to a concept from this lesson.

// Q1: "How would you design a distributed counter for likes?"
// Answer: G-Counter CRDT with gossip protocol.
// Each datacenter maintains its own count. Merge = component-wise max.
// Value = sum. Converges without coordination.
// Show: define the state, update, merge, value operations.
// Prove: merge is commutative, associative, idempotent.

// Q2: "What happens if two users edit the same document simultaneously?"
// Answer: Depends on the CRDT chosen.
// LWW-Register: last writer wins, one edit lost silently.
// MV-Register: both edits preserved, app shows conflict.
// RGA/YATA: character-level CRDT, both edits merge automatically.
// Show: trace through a concurrent edit with version vectors.

// Q3: "Explain the CAP theorem and how it affects your design."
// Answer: NOT "pick 2 of 3." Instead:
// - Partitions are a fact, so the real choice is C vs A during partitions.
// - The choice is PER-OPERATION, not per-system.
// - Give concrete example: payment = CP, feed display = AP.
// - Mention PACELC for the non-partition tradeoff (latency vs consistency).

// Q4: "How do you detect and resolve conflicts in an eventually consistent system?"
// Answer: Depends on the data type.
// Counters: no conflicts (G-Counter merge is deterministic).
// Registers: LWW (automatic, lossy) or MV (manual resolution).
// Sets: OR-Set (add-wins automatic resolution).
// General: version vectors detect concurrency; application resolves.

// Q5: "How do you PROVE your system is correct?"
// Answer: Define the consistency model as axioms.
// Build a simulation relation between concrete state and abstract execution.
// Prove that every protocol step preserves the axioms.
// In practice: property-based testing (commutativity, idempotency, convergence).
// In production: Jepsen testing under network partition scenarios.

The Cost of Consistency: Real Latency Numbers

// Measured latencies from production systems (p50):

// EVENTUAL CONSISTENCY (local write):
// Cassandra single-DC write: 0.5 - 2 ms
// DynamoDB eventual read: 1 - 5 ms
// Redis CRDB local write: 0.1 - 1 ms

// CAUSAL CONSISTENCY:
// MongoDB causal session read: 2 - 10 ms
// (adds vector clock check overhead)

// STRONG CONSISTENCY (same region):
// CockroachDB write: 5 - 20 ms
// etcd write (Raft consensus): 5 - 15 ms
// DynamoDB strongly consistent: 2 - 10 ms (2x eventual cost)

// STRONG CONSISTENCY (cross-region):
// Spanner cross-continent write: 50 - 200 ms
// CockroachDB cross-region: 100 - 300 ms

// The gap: 0.5ms (eventual, local) vs 200ms (linearizable, global)
// That's a 400x difference. This is why you DON'T use linearizable
// for everything — only where correctness requires it.

For replication fundamentals: DDIA Chapter 6: Database Replication

Recommended Reading

If this lesson sparked your interest, here are the primary sources. Each paper/book is foundational — they're the papers that everyone in distributed systems has read.

SourceTopicWhy Read It
Burckhardt, MSR 2014This monographThe formal foundations — everything in this lesson comes from here
Shapiro et al., INRIA 2011"A Comprehensive Study of CRDTs"The definitive CRDT catalog with 12+ data types and proofs
Lamport, 1978"Time, Clocks, and Ordering"Happens-before relation, logical clocks — the foundation of event graphs
Brewer, 2000 / Gilbert & Lynch, 2002CAP TheoremOriginal conjecture (Brewer) and formal proof (Gilbert & Lynch)
Kleppmann, DDIA Ch 5-9Distributed dataPractical perspective on everything formalized here

Real-World CRDT Deployments

SystemCRDTs UsedWhy
Redis (CRDB)Counter, Set, Register, Sorted SetActive-active geo-replication across datacenters
RiakCounter, Set, Map, Register, FlagLeaderless key-value store; CRDTs replace read-repair
Apple NotesRGA (Replicated Growable Array)Offline-first editing on iOS/macOS; sync on reconnect
FigmaCustom operation-based CRDTsReal-time collaborative design; conflict-free shape edits
SoundCloudRiak CRDTs for countersPlay counts and like counts across global CDN
Bet365Riak CRDTs for session dataActive-active across 3 datacenters; must never lose a bet
League of LegendsRiak CRDTs for player state75M+ players; game state must survive datacenter failures

The Frontier: New CRDT Research (2020-2024)

The CRDT field has exploded since Burckhardt's 2014 monograph. Key developments:

DevelopmentYearSignificance
Automerge (Kleppmann et al.)2019-2024General-purpose CRDT library for JSON documents. Powers local-first apps.
Yjs (Yata CRDT)2019High-performance CRDT for text editing. Used by many collaborative editors.
Pure operation-based CRDTs2017Eliminate need for reliable broadcast; operations carry full causal context.
Server-side CRDTs (Redis CRDB)2020CRDTs deployed in database engines, not just application code.
Local-first software (Ink & Switch)2019Movement toward offline-capable apps using CRDTs as the sync primitive.

CRDT Limitations: When NOT to Use Them

CRDTs are powerful, but they are not a universal solution. Understanding when they break down is as important as knowing how they work.

LimitationWhy It HappensAlternative
Metadata growthOR-Set tags, version vectors, tombstones accumulate over timePeriodic garbage collection (requires coordination — ironic)
No invariantsCRDTs guarantee convergence but NOT that invariants hold during convergence. A "stock count" PN-Counter can go negative.Application-level validation after merge
Semantics mismatchSome operations don't commute naturally. "Set balance to $100" from two replicas isn't meaningful as a CRDT.Transform to commutative operations: "add $10" instead of "set $100"
Complex dataLists, trees, and graphs have order-dependent operations that are hard to make conflict-freeSpecialized CRDTs (RGA for lists, TreeDoc for trees) — complex and expensive
Privacy/access controlCRDTs assume all replicas are trusted. In peer-to-peer settings, untrusted nodes could inject bad operations.Authenticated CRDTs (research frontier)
The invariant problem is fundamental. Consider a bank account that must never go below $0. With a PN-Counter CRDT, two replicas could each approve a $60 withdrawal from a $100 account (each sees $100 locally). After merge: -$20. The CRDT converged correctly (both withdrawals recorded), but the application invariant (non-negative balance) is violated. For operations that require invariant checking, you NEED coordination — there is no conflict-free way to enforce global invariants.

The rule of thumb: CRDTs are for data where "all concurrent operations should be applied" is the right semantics. Counters (count everything), sets (add everything), text (keep all edits). They are NOT for data where operations must be mutually exclusive (seat booking, account overdraft protection).

The local-first movement is the most exciting application of CRDTs. Instead of cloud-first (data lives on the server, local is a cache), local-first means your data lives on YOUR device. CRDTs enable peer-to-peer sync without any server. Your document works offline, syncs when online, and never has merge conflicts. Automerge and Yjs are making this practical. It's the most significant shift in application architecture since client-server.
Final thought: Eventual consistency is not a compromise — it's a precise engineering choice backed by rigorous mathematics. The formal framework in this monograph transforms "fuzzy" distributed systems intuition into provable guarantees. When you choose a consistency model, you're not guessing — you're selecting a specific set of axioms with known properties, implementable via known protocols, provably correct via simulation relations. That's not hand-waving. That's engineering.