CRDTs, consistency models, CAP theorem, protocol correctness — the formal foundations of distributed data.
Two travelers open the same airline app at the same moment. Flight 714 to Lisbon has exactly one seat left — 22A, window, just behind the wing. One traveler is in Chicago, the other in Berlin. Both tap "Book Now" at 14:03:27 UTC. Both get a confirmation screen. Both believe they own seat 22A.
Now what?
The airline's database has replicas in North America and Europe. The Chicago request hit the US replica. The Berlin request hit the EU replica. Between these datacenters, the network round-trip is ~85 milliseconds. That 85ms gap is where distributed systems break. During those milliseconds, neither replica knows about the other's booking. Each one independently decided the seat was free, assigned it, and confirmed to the user.
This is not a hypothetical — it happens every day at scale. This exact class of bug caused United Airlines to oversell 42 flights in a single weekend in 2017. Marriott's reservation system double-booked 10,000 hotel rooms in a merger. Amazon's inventory system sold items that didn't exist during Prime Day surges. Every company building multi-region systems eventually confronts this problem.
The naive fix is obvious: before confirming, check with ALL replicas. Make sure nobody else claimed the seat. But this means the Chicago request must wait for a round-trip to Europe (85ms), and if the European datacenter is down or unreachable, the Chicago user can't book at all. You've traded availability for consistency.
The alternative: let both bookings succeed immediately, then resolve the conflict later. Maybe the system picks the earlier timestamp. Maybe it picks randomly. Maybe it refunds one traveler. This preserves availability but sacrifices consistency — for some window of time, the system believes two people own the same seat.
Sebastian Burckhardt's 2014 monograph, Principles of Eventual Consistency, does something no textbook before it accomplished: it takes the fuzzy, hand-wavy notion of "eventual consistency" and gives it rigorous mathematical foundations. He defines a hierarchy of consistency models as sets of axioms on event graphs. He proves which guarantees are achievable under which conditions. And he provides concrete data structures — CRDTs (Conflict-free Replicated Data Types) — that provably converge without coordination.
By the end of this lesson, you'll understand not just WHAT eventual consistency means, but WHY it's defined the way it is, HOW to implement systems that guarantee convergence, and WHEN you need something stronger.
Every major distributed system has war stories about consistency bugs. These aren't theoretical — they cost real money and real trust:
| Incident | Root Cause | Impact |
|---|---|---|
| Amazon DynamoDB 2015 | Eventual consistency + stale read led to order being placed for out-of-stock item | Fulfillment failures, customer refunds |
| GitHub 2018 | MySQL replication lag caused users to see stale repository state after pushes | Users thought commits were lost, re-pushed, caused merge conflicts |
| Facebook 2019 | Cache invalidation across regions took >30 seconds; users saw stale profile data | Privacy concerns (old settings visible to others) |
| Google Cloud Spanner 2020 | Clock synchronization drift exceeded TrueTime bounds during datacenter maintenance | Brief unavailability in affected zone (CP system behaved correctly) |
The lesson from these incidents: consistency bugs are not about correctness in the abstract. They're about real users seeing impossible states — ghost data, lost writes, duplicate purchases, privacy violations. The formal framework in this lesson exists because informal reasoning about these systems is not reliable enough.
Two users try to book the same seat across a network partition. Watch the consistency-availability tradeoff in real time.
If you've built concurrent programs, your first instinct might be: "use a distributed lock." Client A acquires a lock on seat 22A, writes, releases. Client B tries to acquire, blocks until A is done. Problem solved?
Distributed locks are harder than they sound. Consider what happens if the lock holder crashes before releasing. The lock is stuck forever unless you add a timeout. But if you add a timeout, what happens if the holder is just slow (not crashed)? It loses the lock while still thinking it holds it, and another client acquires the lock. Now TWO clients think they hold the lock. This is the fencing problem.
The deeper issue: distributed locks require a consensus service (like ZooKeeper or etcd) to implement correctly. And consensus services are expensive — they require majority agreement on every lock acquisition. For a global system with thousands of operations per second, the lock service becomes a bottleneck. CRDTs and eventual consistency offer an alternative: don't lock at all, let everyone write, and resolve conflicts mathematically.
Most developers think of consistency as binary: either your system is consistent or it isn't. This is like saying temperature is binary — either hot or cold. In reality, there's a precise spectrum of consistency guarantees, each one relaxing a specific constraint to gain performance or availability.
Before Burckhardt's formalization, the consistency landscape was confusing. Different papers used the same terms to mean different things. "Eventual consistency" in Amazon's Dynamo paper meant something subtly different from "eventual consistency" in Werner Vogels's blog post, which was different again from the academic definition. Burckhardt's contribution was to pin each model down with mathematical axioms, making the definitions unambiguous.
What emerged is surprisingly elegant: the models form a clean hierarchy, each one obtained from the previous by removing exactly one axiom. Understanding this hierarchy is the single most important thing you can learn about distributed systems — it's the map that tells you exactly what you're getting and what you're giving up for every design choice you make.
Burckhardt's key insight is that each consistency model is a set of axioms about how operations relate to each other. Stronger models have more axioms. Weaker models have fewer. Fewer axioms means more behaviors are "legal" — which means the system has more freedom, which means it can be faster and more available, but also harder for programmers to reason about.
From strongest to weakest:
Linearizability (the gold standard). Every operation appears to take effect at a single instant between its invocation and completion. All observers see operations in the same order, and that order respects real-time. If operation A finishes before operation B starts, then A appears before B — for everyone. This is what single-threaded programs give you for free. In a distributed system, it requires coordination on every operation.
Sequential consistency. There exists SOME total order of all operations that is consistent with each client's local order. But this order does NOT have to respect real-time. If you write X=1, then I write X=2, a sequentially consistent system could show everyone X=1 (ordering my write before yours), as long as each individual client's operations appear in the order they issued them. The difference from linearizability: real-time ordering is not preserved across clients.
Causal consistency. If operation A could have influenced operation B (because B happened after A on the same client, or B read a value written by A), then everyone sees A before B. But operations that are concurrent — neither could have influenced the other — may appear in different orders on different replicas. This is weaker than sequential because there's no single total order of concurrent operations.
Eventual consistency. If no new updates are made, all replicas will eventually converge to the same value. That's it. No ordering guarantees during updates. No bound on how long "eventually" takes. Replicas can temporarily disagree on everything. The only promise: silence leads to agreement.
Quiescent consistency. Even weaker — convergence is only guaranteed during periods of quiescence (no operations in flight). If operations never stop, replicas may never converge. This is the absolute floor of useful consistency.
| Model | Guarantees | Gives Up | Real System |
|---|---|---|---|
| Linearizable | Real-time total order | Performance, availability | Google Spanner, ZooKeeper |
| Sequential | Total order (not real-time) | Real-time ordering across clients | POSIX file systems |
| Causal | Causally-related operations ordered | Total order of concurrent ops | MongoDB (causal sessions), COPS |
| Eventual | Replicas converge when updates stop | All ordering during updates | Cassandra, DynamoDB, Riak |
| Quiescent | Converge during quiet periods | Convergence during continuous load | DNS (with TTL-based propagation) |
Two clients, two replicas. Client A writes X=1, then Client B writes X=2. Client C reads X from both replicas.
Linearizability deserves special attention because it's what single-threaded programs give you for free — and what distributed systems spend enormous effort to approximate. The formal definition: every operation appears to take effect at a single, atomic point between its invocation and completion, and all operations are consistent with a single global total order that respects real time.
Checking whether an execution history is linearizable is NP-complete in general (you're searching for a linearization point assignment). In practice, tools like Jepsen use randomized testing to find violations.
Many practitioners argue that causal consistency is the "sweet spot" — it prevents the most confusing anomalies (seeing a reply before the original message) while remaining implementable without global coordination. The key: causal consistency requires no total order of concurrent operations, which means replicas don't need to agree on how to order unrelated events.
Click each consistency level to see what it guarantees and what it gives up. Real systems are mapped to their actual consistency level.
To reason precisely about consistency, we need precise language. Without formalism, engineers argue in circles: "our system is eventually consistent" — but what does that MEAN? When does it converge? What can a reader see during convergence? What ordering is guaranteed? These questions have precise answers, but only if we define our terms precisely.
Burckhardt's formalism starts with the simplest possible building block: an event. Every time a client reads or writes a value, that's an event. An event has an operation (what it did), a return value (what it saw), and an identity (which client, when, where). Events are the atoms of distributed computation — everything else is built from them.
The genius of this approach is its generality. Whether you're building a key-value store, a message queue, a file system, or a collaborative editor, the formal model is the same: events, visibility, session order, arbitration. The specific operations differ, but the framework for reasoning about consistency is universal.
Now the key question: which events can "see" other events? If event A writes X=5 and event B later reads X=5 on the same replica, then B can see A. We write this as A vis B — "A is visible to B." But if B is on a different replica and the write hasn't propagated yet, A is NOT visible to B. B might read an old value, or nothing.
The visibility relation is the formal way of saying "this operation knew about that operation." It captures the flow of information through the system. When you replicate a write from node A to node B, you're extending the visibility relation: events on A become visible to future events on B. The strength of a consistency model is determined by how much visibility it guarantees.
Session order (so). Events by the same client are ordered. If client 1 writes X, then reads Y, the write comes before the read in session order. This is the programmer's expectation: my operations happen in the order I issued them.
Visibility (vis). Which events can an operation "see"? If event A is visible to event B, then B's return value was computed with knowledge of A. Visibility is NOT the same as session order — an event on replica 2 might see events from replica 1 (after replication), but session order only relates events from the same client.
Arbitration (ar). A total order on events that resolves conflicts. When two events write to the same key, arbitration decides who "wins." Think of it as the tiebreaker. In a last-writer-wins register, arbitration is determined by timestamps. In a Paxos-based system, arbitration is the consensus order.
These three relations — session order, visibility, and arbitration — are the alphabet of Burckhardt's formalism. Every consistency guarantee, every protocol property, every correctness proof is written in terms of these relations. Master them, and the rest of the framework follows.
Formally, a history is a tuple H = (E, op, rval, so, vis) where:
| Symbol | Meaning | Example |
|---|---|---|
| E | Set of events | {e1, e2, e3, e4, e5} |
| op | Operation for each event | op(e1) = write(X, 5) |
| rval | Return value for each event | rval(e3) = 5 |
| so | Session order (per-client) | e1 → e2 (same client) |
| vis | Visibility relation | e1 → e3 (e3 saw e1's write) |
Lamport's famous happens-before relation is the transitive closure of session order and visibility combined. If A is in session order before B, and B is visible to C, then A happens before C — even though A and C might be on different clients and different replicas.
Two events are concurrent if neither happens before the other:
Concurrent events are the root of all distributed systems complexity. If two events are ordered by happens-before, we know which one "came first" and can resolve conflicts. If they're concurrent, there IS no "first" — both happened independently, and the system must have a strategy for resolving the disagreement.
A critical distinction. Wall-clock time (UTC timestamps) is NOT the same as happens-before ordering. Two events can happen at the same wall-clock time on different replicas and still be ordered by happens-before (if one saw the other via replication). Conversely, two events with different wall-clock times can be concurrent (if neither saw the other).
This is why Lamport invented logical clocks — to capture happens-before without relying on physical time. Vector clocks extend this to capture exactly which events each replica has seen, enabling precise concurrency detection.
Click on a replica to create write events. Watch visibility and happens-before edges form. Concurrent events are highlighted.
We have events, visibility, session order, and arbitration. Now Burckhardt's central contribution: a consistency guarantee is a SET OF AXIOMS constraining these relations. Each axiom forbids certain behaviors. More axioms = stronger consistency = fewer legal executions = harder to implement efficiently.
An abstract execution is a tuple A = (H, vis, ar) where H is a history, vis is the visibility relation, and ar is the arbitration (total conflict-resolving order). A consistency model is a predicate that says "this abstract execution is legal" or "this one is not."
Burckhardt identifies five core axioms. Every consistency model is a combination of these:
Axiom 1: Eventual Visibility (EVIS). For any two events a and b, eventually a becomes visible to b or b becomes visible to a. No event stays invisible forever.
Axiom 2: Consistent Visibility (CVIS). Visibility is consistent with session order. If a client does operation A then operation B, anything visible to A is also visible to B.
In other words: a client never "forgets" what it has seen. If you read X=5 and then read X again, you won't see X=3 (an older value). This is also called monotonic reads.
Axiom 3: Consistent Arbitration (CAR). Arbitration is consistent with visibility. If a is visible to b, then a comes before b in the arbitration order.
Axiom 4: Total Visibility (TVIS). If b sees a, then b sees everything that a saw. Visibility is transitively closed.
Axiom 5: Real-Time Order (RTO). If operation a completes before operation b starts (in wall-clock time), then a is visible to b.
| Model | Axioms Required |
|---|---|
| Quiescent | EVIS (during quiescence only) |
| Eventual | EVIS |
| Causal | EVIS + CVIS + TVIS |
| Sequential | EVIS + CVIS + TVIS + CAR + total vis |
| Linearizable | EVIS + CVIS + TVIS + CAR + total vis + RTO |
You might wonder: why define consistency models as axioms instead of algorithms? Because the same consistency guarantee can be implemented by many different protocols. Linearizability can be achieved by Paxos, Raft, viewstamped replication, or a single-node database. By defining models as axioms on abstract relations, Burckhardt separates the WHAT (the guarantee) from the HOW (the implementation). This lets you prove that a new protocol provides a known guarantee, or that a known guarantee is sufficient for your application — independently.
It's the same principle as interfaces in programming: the axioms are the interface, the protocols are the implementations. You can swap implementations without changing the guarantee, and you can analyze the guarantee without knowing the implementation.
The axioms are not independent — some imply others, and some combinations are contradictory or redundant. Understanding the dependency structure is key to navigating the consistency hierarchy:
Here's an eye-opening way to understand the consistency spectrum: count how many distinct executions are legal under each model for the same set of operations.
This is why eventual consistency is hard for programmers. If you write code that reads a value and branches on it, under linearizability there's one possible execution path. Under eventual consistency, there could be many. Your code must handle ALL of them correctly.
Toggle axioms on and off. See which executions become legal or illegal under different combinations.
Abstract executions and axioms tell us what guarantees a system provides. But they don't tell us HOW to build it. We need data structures that can live on multiple replicas, accept updates independently, and converge to the same state without coordination. These are Conflict-free Replicated Data Types — CRDTs.
The idea is deceptively simple, and the name is slightly misleading. "Conflict-free" doesn't mean conflicts don't happen — it means conflicts are resolved automatically, without human intervention, using mathematical properties of the data structure. A better name might be "automatically-convergent replicated data types," but CRDT is catchier.
A sequential data type is what you already know: a set of states, a set of operations, and a specification mapping (state, operation) to (new state, return value). A stack, a counter, a register, a set — all sequential data types. You interact with them one operation at a time, and each operation sees the result of all previous operations.
A replicated data type extends this to handle concurrent operations on multiple replicas. The question is: when two replicas independently apply different operations and then sync, do they end up in the same state? If YES for all possible orderings of all possible operations, the data type is a CRDT.
This is a remarkably strong guarantee. It says that NO matter what operations happen, in what order, on which replicas, with what delays or duplications in message delivery — the final state is ALWAYS the same. This is what "conflict-free" means: not that conflicts don't occur, but that they resolve themselves deterministically.
State-based CRDTs (CvRDTs). Each replica maintains a complete copy of the state. Replicas periodically send their entire state to other replicas, which merge the received state with their own. The merge function must be commutative, associative, and idempotent — in other words, it must form a join-semilattice.
Operation-based CRDTs (CmRDTs). Instead of sending full states, replicas broadcast operations. The key requirement: operations must commute. If replica A applies op1 then op2, and replica B applies op2 then op1, both must end up in the same state. This needs a reliable broadcast layer (every operation delivered at least once to every replica).
Delta-based CRDTs. A hybrid. Instead of full states (expensive for large data) or operations (need reliable broadcast), send only the DELTA — the part of the state that changed. Deltas must also form a join-semilattice. This is what modern implementations like Riak and Akka Distributed Data use.
Burckhardt's formal statement: a state-based CRDT with a join-semilattice merge function, where update operations are monotonically increasing (the state only moves "up" in the lattice), converges to a least upper bound under eventual delivery.
This sounds abstract, but it's just a fancy name for a set with a "combine" operation that always produces the same result regardless of order. Think of it like a bag of marbles: if you pour bag A into bag B, or bag B into bag A, you get the same combined bag. And pouring a bag into itself changes nothing.
Formally, a join-semilattice is a set S with a partial order ≤ and a binary operation ⊔ (join) such that for any a, b in S, a ⊔ b is the least upper bound — the smallest element that's greater than or equal to both a and b.
Let's prove rigorously that the G-Counter merge function satisfies all three semilattice properties. This is the kind of proof Burckhardt demands — not handwaving, but step-by-step verification.
This proof may seem trivially simple — and it is, for the G-Counter. But when you get to more complex CRDTs like OR-Sets or RGA (Replicated Growable Arrays for text editing), the proofs become intricate. The G-Counter is the "hello world" of CRDT proofs: simple enough to understand, complex enough to illustrate the technique.
State-based CRDTs send the entire state on every sync. For a G-Counter with 100 replicas, that's 100 integers — fine. For an OR-Set with 10 million entries, sending the full state is catastrophic. This is where delta-based CRDTs shine.
A delta-CRDT only sends the change since the last sync. The key requirement: deltas must also form a semilattice, so they can be merged independently. The receiver accumulates deltas and periodically merges them into its full state. If deltas are lost, the sender falls back to sending the full state — but this is rare.
Three replicas each maintain a counter vector. Increment on any replica, then merge to see convergence.
Let's implement the fundamental CRDTs from scratch. Each one follows the same pattern: define the state, define the update operations, define the merge function, prove convergence. This is the "concept to realization" bridge — by the end of this chapter, you'll be able to implement any of these from memory.
We start with the simplest (G-Counter) and build toward the most complex (MV-Register). Each builds on the previous: the PN-Counter uses two G-Counters, the LWW-Register uses timestamps like the G-Counter uses max, and the MV-Register uses version vectors — which are themselves G-Counters used for tracking causality rather than counting.
The simplest useful CRDT, and arguably the most important — because it demonstrates the core principle that all CRDTs share. Each of N replicas maintains a vector of N integers. Replica i only increments position i. The counter value is the sum of all positions. Merge is component-wise maximum.
Why a vector instead of a single integer? Because we need the merge function to be idempotent. If we used addition, merging the same state twice would double the count. With max, merging the same state twice changes nothing — which is exactly what idempotency requires.
G-Counters can only go up. What if you need decrements? The naive approach — allowing negative values in the vector — breaks the semilattice property (max of -3 and 5 is 5, but that loses the -3 decrement). The elegant solution: use TWO G-Counters. One for increments (P), one for decrements (N). The value is P.value() - N.value(). Each G-Counter independently satisfies the semilattice properties, and subtraction preserves the correct total.
This "decompose into monotonic components" pattern appears throughout CRDT design. When you need a non-monotonic operation (decrement, remove), split the state into monotonically increasing parts and derive the non-monotonic result from them.
A register holds a single value. When two replicas write concurrently, who wins? The LWW-Register uses timestamps: the write with the highest timestamp wins. Simple, but lossy — concurrent writes are silently discarded. Despite its simplicity, LWW is one of the most widely deployed CRDTs — Cassandra uses it for every cell in every table by default.
The catch: LWW requires globally unique, monotonically increasing timestamps. In a distributed system, clock skew means two replicas might assign the same timestamp to different writes. Worse, a replica with a fast clock will always "win" over a replica with a slow clock, regardless of actual event ordering. This is why systems like Cassandra use hybrid logical clocks — combining physical time with a logical counter to guarantee uniqueness.
The alternative to LWW: keep ALL concurrent values and let the application resolve the conflict. This is the MV-Register (Multi-Value Register). Amazon's original Dynamo shopping cart used this approach — concurrent adds to a shopping cart produce a set of carts, which the application merges (usually: union of all items).
The MV-Register uses version vectors (essentially vector clocks for data versions) to detect concurrency. When two writes have incomparable version vectors (neither dominates the other), they're concurrent, and both values are kept. When a later write has a version vector that dominates an earlier one, the earlier value is replaced.
python class GCounter: """Grow-only counter CRDT.""" def __init__(self, node_id, n_nodes): self.id = node_id self.vector = [0] * n_nodes def increment(self): self.vector[self.id] += 1 def value(self): return sum(self.vector) def merge(self, other): for i in range(len(self.vector)): self.vector[i] = max(self.vector[i], other.vector[i]) class PNCounter: """Positive-Negative counter = two G-Counters.""" def __init__(self, node_id, n_nodes): self.P = GCounter(node_id, n_nodes) self.N = GCounter(node_id, n_nodes) def increment(self): self.P.increment() def decrement(self): self.N.increment() def value(self): return self.P.value() - self.N.value() def merge(self, other): self.P.merge(other.P) self.N.merge(other.N) class LWWRegister: """Last-Writer-Wins Register.""" def __init__(self): self.value = None self.timestamp = 0 def write(self, val, ts): if ts > self.timestamp: self.value = val self.timestamp = ts def read(self): return self.value def merge(self, other): if other.timestamp > self.timestamp: self.value = other.value self.timestamp = other.timestamp class MVRegister: """Multi-Value Register — keeps all concurrent values.""" def __init__(self, node_id, n_nodes): self.id = node_id self.n = n_nodes self.entries = [] # list of (value, vector_clock) self.clock = [0] * n_nodes def write(self, val): self.clock[self.id] += 1 # New write dominates all current entries self.entries = [(val, self.clock[:])] def read(self): return [v for v, _ in self.entries] def _dominates(self, vc1, vc2): """Does vc1 causally dominate vc2?""" return (all(vc1[i] >= vc2[i] for i in range(self.n)) and vc1 != vc2) def merge(self, other): # Combine entries, keep only non-dominated ones all_entries = self.entries + other.entries kept = [] for v, vc in all_entries: dominated = any( self._dominates(vc2, vc) for _, vc2 in all_entries ) if not dominated: kept.append((v, vc)) # Deduplicate by vector clock seen = set() self.entries = [] for v, vc in kept: key = tuple(vc) if key not in seen: seen.add(key) self.entries.append((v, vc)) # Merge clocks for i in range(self.n): self.clock[i] = max(self.clock[i], other.clock[i]) # Demo: MV-Register concurrent writes r0 = MVRegister(0, 3) r1 = MVRegister(1, 3) r0.write("Alice") # r0: [("Alice", [1,0,0])] r1.write("Bob") # r1: [("Bob", [0,1,0])] r0.merge(r1) print(r0.read()) # ["Alice", "Bob"] — concurrent! r0.write("Carol") # resolves conflict print(r0.read()) # ["Carol"] — single value
Perform concurrent writes on two replicas. Compare LWW (one value survives) vs MV (all concurrent values preserved).
Counters and registers handle single values. Most real applications need collections: shopping carts (sets of items), user permissions (sets of roles), configuration maps (key-value pairs). Making these conflict-free is harder than you'd think, because collections have an operation that counters and registers don't: remove. And remove in a distributed system is surprisingly tricky — when you remove an element that was concurrently added on another replica, which one wins?
The simplest collection CRDT: you can add elements but never remove them. Merge is union.
This is trivially a CRDT — union is commutative, associative, and idempotent. But it's rarely useful in practice. You almost always need to remove elements.
G-Sets are perfect for tracking "have I seen this before?" — duplicate detection, read receipts, visited URLs. Anything where you only need to add elements and never remove them.
Like the PN-Counter pattern: two G-Sets, one for adds (A) and one for removes (R). An element is in the set if it's in A but NOT in R. The catch: once an element is removed, it can NEVER be re-added (it's permanently in R).
The OR-Set is the most important set CRDT. The key idea: each ADD generates a unique tag. When you REMOVE an element, you only remove the tags you've OBSERVED — tags you haven't seen (from concurrent adds on other replicas) survive.
Formally: the state is a set of (element, unique-tag) pairs. Add(e) inserts (e, fresh_tag). Remove(e) removes all pairs (e, *) currently visible on this replica. Merge is union of all pairs.
Consider Amazon's shopping cart, one of the most famous eventual-consistency use cases. You add milk to your cart on your phone. Your partner removes milk from the cart on their laptop. Concurrently, you add milk again (you realized you need two cartons). With a 2P-Set, the re-add fails — milk is in the tombstone set forever. With an OR-Set, the re-add generates a fresh tag, B's remove only killed the tags it observed, and the fresh tag survives. Milk is in the cart.
Amazon's original Dynamo paper (2007) used a similar approach: the shopping cart was an MV-Register that kept concurrent versions. The application merged them by taking the union of all items. This occasionally produced "resurrected" items (items that were removed would reappear), which Amazon preferred over lost items. The OR-Set formalizes this: add always wins over concurrent remove.
The naive OR-Set keeps growing forever — removed tags are gone from the local set but might arrive again via a stale replica. The practical fix: maintain a causal context (a vector clock summarizing all tags seen) alongside the entries. During merge, any tag that's in the causal context but NOT in the entries was explicitly removed.
python import uuid class ORSet: """Observed-Remove Set CRDT. Add-wins semantics.""" def __init__(self): self.entries = set() # set of (element, unique_tag) tuples def add(self, element): tag = str(uuid.uuid4())[:8] self.entries.add((element, tag)) def remove(self, element): # Remove only the tags we can SEE right now observed = {(e, t) for e, t in self.entries if e == element} self.entries -= observed def lookup(self, element): return any(e == element for e, _ in self.entries) def elements(self): return {e for e, _ in self.entries} def merge(self, other): # Union of all (element, tag) pairs. # Tags removed on one side but present on the other # survive — that's the "add-wins" behavior. self.entries = self.entries | other.entries # Demo: concurrent add and remove a = ORSet() b = ORSet() a.add("milk") b.merge(a) # b sees "milk" b.remove("milk") # b removes observed tags a.add("milk") # a concurrently re-adds with NEW tag a.merge(b) b.merge(a) print(a.lookup("milk")) # True — concurrent add survived print(b.lookup("milk")) # True — converged
The OR-Set's "add-wins" semantics might seem arbitrary. Why should add beat remove? Consider the alternative: "remove-wins" (if anyone removes X concurrently with someone adding X, X is removed). This is actually worse for most applications:
This principle — prefer not losing data — is why most CRDT designs bias toward preservation. It's the same philosophy behind Git's merge conflicts: show both versions and let the human decide, rather than silently picking one.
The MV-Register and OR-Set both need to track causality — "which events has this replica seen?" The standard tool is a vector clock: each replica maintains a vector of N counters, one per replica. When replica i does an operation, it increments V[i]. When it receives a message from replica j with vector V', it sets V[k] = max(V[k], V'[k]) for all k.
python class VectorClock: """Tracks causality across N replicas.""" def __init__(self, node_id, n_nodes): self.id = node_id self.clock = [0] * n_nodes def tick(self): """Local event: increment own counter.""" self.clock[self.id] += 1 def merge(self, other_clock): """Merge with received clock.""" for i in range(len(self.clock)): self.clock[i] = max(self.clock[i], other_clock[i]) self.clock[self.id] += 1 # receiving is also an event def dominates(self, other): """Does self causally dominate other?""" return (all(self.clock[i] >= other.clock[i] for i in range(len(self.clock))) and self.clock != other.clock) def concurrent(self, other): """Are self and other concurrent?""" return not self.dominates(other) and not other.dominates(self)
Add and remove elements on different replicas. Watch unique tags track adds/removes. Concurrent add + remove = add wins.
This is the payoff chapter. Everything we've learned — event graphs, axioms, CRDTs — converges in one interactive simulation. You control a distributed system with three replicas. You perform read/write operations on any replica. And you toggle between consistency models to see exactly which read results are legal under each one.
If you can predict what each model allows before clicking "Read All" — then you've internalized the formal framework. Every consistency debate in a design review reduces to this: "which behaviors does our model allow, and can our application tolerate them?"
The insight that makes distributed systems engineering hard is not that eventual consistency exists. It's that you can't tell the difference between consistency models during normal operation. When the network is fast and healthy, linearizable and eventual consistency look identical. The difference only manifests during partitions, lag, and concurrent access. This simulation forces those edge cases so you can see what each model actually guarantees.
Three replicas storing register X. Write values, read from any replica, and toggle the consistency model to see which reads are legal.
Here's a specific experiment to try. Follow these steps exactly:
The visual difference is striking. Under eventual consistency, each replica lives in its own world. Under linearizability, there's one global truth. Every level in between is a different blend of "local truth" vs "global truth."
| Model | What You'll See | Why |
|---|---|---|
| Linearizable | All replicas return the same "latest" value | Real-time total order means one global truth |
| Sequential | All replicas agree, but might not reflect real-time order | Total order exists but may reorder concurrent writes |
| Causal | Causally-related writes ordered; concurrent writes may differ across replicas | Only causal dependencies are preserved |
| Eventual | Replicas may temporarily disagree on everything, but converge when quiescent | No ordering guarantees during active updates |
Think of consistency models as constraints on a jigsaw puzzle. Linearizability says there's exactly one way to assemble the pieces — the real-time order. Sequential says any assembly that respects each person's piece order is fine. Causal says pieces that were placed "because of" earlier pieces must keep that order, but unrelated pieces can go anywhere. Eventual says you can scatter pieces across the table — as long as you eventually put them all together, the final picture will be the same for everyone.
The programmer's job is to choose the weakest model that still makes the application correct. Weaker models are faster and more available. The art is knowing what anomalies your application can tolerate.
Each weakening of the consistency model introduces specific anomalies. Understanding these anomalies is how you choose the right model for your application.
| Anomaly | Occurs Below | What Happens | Example |
|---|---|---|---|
| Stale read | Linearizable | Read returns a value that was already overwritten | You transfer $100, check balance, see old amount |
| Non-monotonic read | Causal | A later read returns an OLDER value than an earlier read | You see a comment, refresh, it disappears, refresh again, it's back |
| Read-your-writes violation | Causal | You write a value, then read and DON'T see your own write | You post a message, but it doesn't show in your feed |
| Causal violation | Sequential | You see a reply but not the original message | Bob replies to Alice's comment; you see Bob's reply but not Alice's comment |
| Total-order violation | Causal | Two replicas see concurrent writes in different orders | Replica 1 shows balance $50 then $100; Replica 2 shows $100 then $50 |
Models tell us WHAT guarantees we want. Protocols tell us HOW to achieve them. Burckhardt identifies three protocol families, each implementing a different point on the consistency spectrum. The mapping is clean: one family per level.
Understanding these protocols is critical for system design interviews. When an interviewer asks "how would you implement X," the answer is almost always one of these three patterns — or a hybrid. The choice determines your latency, availability, and programming complexity.
Each protocol can be understood as answering one question: when does a replica learn about another replica's operations? Immediately (global-sequence), when causally needed (broadcast), or eventually (epidemic). The answer determines everything about the system's behavior.
Each replica periodically picks a random partner and exchanges state. After merging, both have the same information. Over time, every update reaches every replica — like a rumor spreading through a crowd.
There are three variants of gossip: push (I send my state to you), pull (I request your state), and push-pull (we exchange states). Push-pull converges fastest because each interaction makes both parties up-to-date. Push converges in O(N log N) messages; push-pull converges in O(N log log N) — a significant improvement for large clusters.
The gossip interval (how often each node contacts a peer) creates a direct tradeoff between convergence speed and network bandwidth. Gossip every 100ms and you converge in seconds; gossip every 10 seconds and you converge in minutes. Most systems use 1-second gossip intervals, giving convergence in 5-20 seconds for clusters of 100+ nodes.
Tag each operation with a vector clock. Deliver operations in causal order: an operation is only applied when all operations it depends on have been applied. This guarantees CVIS and TVIS axioms.
The key mechanism is the delivery buffer. When a message arrives whose dependencies aren't met yet (the sender saw events that the receiver hasn't seen), the message waits in a buffer until the missing events arrive. This buffer is what makes causal broadcast more expensive than epidemic — messages can be delayed, and the buffer can grow during bursts of activity or slow networks.
A single designated node (or a consensus protocol like Paxos/Raft) assigns a global sequence number to every operation. All replicas apply operations in this sequence. This gives linearizability (with a leader) or sequential consistency (with consensus).
The fundamental cost of global-sequence protocols is the consensus round. Every write requires a majority of replicas to acknowledge before the operation is committed. This means the write latency is dominated by the slowest node in the majority — typically the one farthest away in terms of network distance. For a 3-node cluster in a single datacenter, this might be 5ms. For a 5-node cluster spanning continents, this could be 200ms.
This is why global-sequence protocols are impractical for latency-sensitive operations across geographic regions. Spanner works around this with GPS-synchronized atomic clocks (TrueTime), but this requires specialized hardware that most organizations can't afford.
| Protocol | Consistency | Latency | Availability | Example |
|---|---|---|---|---|
| Epidemic | Eventual | Local writes (~1ms) | All replicas serve reads/writes | Riak, Cassandra |
| Broadcast | Causal | Local + buffer delay | All serve writes; reads may wait | COPS, Eiger |
| Global-Seq | Linearizable | Consensus round (~10-100ms) | Majority must be up | Spanner, CockroachDB |
Epidemic protocols borrow directly from mathematical epidemiology. The spread of an update through a gossip network follows the same differential equation as the spread of an infectious disease through a population. This is not a metaphor — it's the same math.
Causal broadcast uses vector clocks with one entry per replica. With N replicas, each message carries an N-dimensional vector. For N=5, this is trivial. For N=10,000 (a large Cassandra cluster), this is 10,000 integers per message — a serious overhead.
Let's trace a write through a Raft-based global-sequence protocol. Raft is the most widely-used consensus algorithm (etcd, CockroachDB, TiKV). The key: one leader sequences all operations; followers replicate in order.
| Protocol | Same-Region | Cross-Region | Bottleneck |
|---|---|---|---|
| Epidemic (Riak) | ~1-2ms (local disk) | ~1-2ms (local disk; propagation in background) | Local I/O only |
| Broadcast (COPS) | ~2-5ms (local + buffer check) | ~5-50ms (wait for causal deps) | Causal dependency delay |
| Global-Seq (Raft) | ~5-20ms (consensus round) | ~100-300ms (cross-region RTT) | Majority acknowledgment RTT |
| Global-Seq (Spanner) | ~5-10ms (TrueTime wait) | ~10-50ms (TrueTime wait + RTT) | Clock uncertainty + RTT |
python import random class GossipNode: """A single node in a gossip network with a G-Counter.""" def __init__(self, node_id, n_nodes): self.id = node_id self.counter = GCounter(node_id, n_nodes) self.inbox = [] # messages waiting to be processed def increment(self): self.counter.increment() def send_state(self): """Return current state to send to a random peer.""" return self.counter.vector[:] def receive_state(self, remote_vector): """Merge received state with local state.""" for i in range(len(self.counter.vector)): self.counter.vector[i] = max( self.counter.vector[i], remote_vector[i] ) def simulate_gossip(n_nodes=10, n_increments=20, n_rounds=50): """Simulate gossip convergence.""" nodes = [GossipNode(i, n_nodes) for i in range(n_nodes)] # Phase 1: random increments on random nodes for _ in range(n_increments): node = random.choice(nodes) node.increment() # Phase 2: gossip rounds for round_num in range(n_rounds): # Each node contacts one random peer for node in nodes: peer = random.choice(nodes) if peer.id != node.id: state = node.send_state() peer.receive_state(state) # Check convergence values = [n.counter.value() for n in nodes] if len(set(values)) == 1: print(f"Converged in round {round_num}! Value: {values[0]}") return round_num print(f"Did not converge in {n_rounds} rounds.") return n_rounds # Run simulation rounds = [simulate_gossip() for _ in range(100)] print(f"Average rounds to converge: {sum(rounds)/len(rounds):.1f}") # Typically: 3-5 rounds for 10 nodes. O(log N) behavior.
python class CausalBroadcastNode: """Node implementing causal broadcast via vector clocks.""" def __init__(self, node_id, n_nodes): self.id = node_id self.n = n_nodes self.clock = [0] * n_nodes self.state = {} # key-value store self.buffer = [] # undelivered messages self.delivered = [] # delivered operations log def write(self, key, val): self.clock[self.id] += 1 self.state[key] = val # Return message to broadcast return {'op': (key, val), 'sender': self.id, 'vc': self.clock[:]} def receive(self, msg): """Buffer message, deliver when deps satisfied.""" self.buffer.append(msg) self._try_deliver() def _can_deliver(self, msg): """Check if all causal dependencies are met.""" vc = msg['vc'] sender = msg['sender'] # Sender's count must be exactly our count + 1 if vc[sender] != self.clock[sender] + 1: return False # All other entries must be ≤ our clock for k in range(self.n): if k != sender and vc[k] > self.clock[k]: return False return True def _try_deliver(self): progress = True while progress: progress = False for msg in self.buffer[:]: if self._can_deliver(msg): self.buffer.remove(msg) key, val = msg['op'] self.state[key] = val self.clock[msg['sender']] += 1 self.delivered.append(msg) progress = True
Watch the same sequence of operations executed under epidemic, broadcast, and global-sequence protocols. Note the latency and ordering differences.
You've heard "pick two of three: Consistency, Availability, Partition tolerance." But what does this ACTUALLY mean? Most explanations are hand-wavy. Burckhardt gives the precise formal statement and proof. And most people who cite CAP get it wrong — either they think it means "choose any two" (it doesn't — P is not optional), or they think it applies during normal operation (it doesn't — CAP only matters during partitions), or they think it's about eventual consistency (it's not — CAP's "C" means sequential or stronger).
Getting CAP right in an interview is a differentiator. Let's build the correct understanding from the formal definitions.
Consistency (C): Here means specifically sequential consistency (or stronger). Every read returns a value consistent with some total order of all operations that respects each client's order. Note: this is NOT eventual consistency. CAP's "C" is the strong kind.
Availability (A): Every request to a non-failed node receives a response. Not "eventually" — within a bounded time. A system that blocks forever waiting for a partitioned node is NOT available.
Partition tolerance (P): The system continues to operate despite arbitrary message loss between nodes. Since network partitions are a fact of life (not a choice), every real distributed system must be partition-tolerant. "P" is not optional.
The proof is by contradiction, and it's beautifully simple. Assume a system that is both sequentially consistent AND available during a partition. Show that this leads to a contradiction.
| Choice | During Partition | Systems |
|---|---|---|
| CP (Consistent) | Reject writes/reads on minority side of partition. Majority side continues. | ZooKeeper, etcd, Spanner, HBase |
| AP (Available) | All nodes accept reads/writes. May return stale data. Reconcile after partition heals. | Cassandra, DynamoDB, Riak, CouchDB |
CAP only talks about partition scenarios. But even during normal operation, there's a tradeoff between latency and consistency. The PACELC model extends CAP:
| System | During Partition | Normal Operation | PACELC |
|---|---|---|---|
| Cassandra | AP (available) | EL (low latency) | PA/EL |
| Spanner | CP (consistent) | EC (consistent) | PC/EC |
| DynamoDB | AP (available) | EL (low latency) | PA/EL |
| MongoDB | CP (consistent) | EC (consistent) | PC/EC |
Google Spanner claims to be a globally-distributed, strongly-consistent, highly-available database. Doesn't this violate CAP? Technically no. Spanner is a CP system — during a partition, it sacrifices availability on the minority side. But Google has invested billions in network infrastructure to make partitions astronomically rare.
A common mistake: treating CAP as a design framework ("I'll build an AP system"). CAP is a negative result — it tells you what's impossible, not what to build. The right way to use CAP:
| Wrong | Right |
|---|---|
| "Let's build an AP system" | "During partitions, which operations can tolerate stale reads?" |
| "We chose consistency over availability" | "For payment operations, we block during partitions. For feed reads, we serve stale data." |
| "CAP means pick 2 of 3" | "CAP means during a partition, each operation must choose C or A" |
Two nodes with a partition. Write to one, read from the other. See why strong consistency + availability is impossible.
Saying "my CRDT converges" is easy. Proving it requires a simulation relation between the concrete protocol execution and the abstract consistency model. This is how Burckhardt verifies that a protocol actually implements the consistency guarantees it claims.
Why should you care about correctness proofs? Because distributed systems bugs are the hardest to find. They only manifest under specific timing conditions, specific failure patterns, specific message orderings. You can test for months and miss a bug that only appears when a specific replica is 200ms behind, during a garbage collection pause, while a network switch is rebooting. Proofs catch these bugs at design time, before any code is written.
The industry learned this the hard way. Amazon's DynamoDB team discovered that their original hash ring implementation had a subtle consistency violation that only occurred during simultaneous node joins and departures. MongoDB's initial causal consistency implementation had a gap in the vector clock protocol that allowed causal ordering violations during primary elections. Both bugs were found by formal analysis, not testing.
A correctness proof has three parts:
1. Define the abstract specification. What consistency model are we targeting? For a G-Counter with epidemic protocol, the target is eventual consistency: the EVIS axiom (all events eventually visible).
2. Define the simulation relation. Map each concrete state (vector values on each replica, messages in flight) to an abstract execution (events, visibility, arbitration). The simulation relation R(concrete, abstract) says: "this concrete state corresponds to this abstract execution."
3. Prove the simulation. Show that for every step the concrete protocol takes (local update, send message, receive message, merge), the corresponding abstract execution remains legal — i.e., the axioms still hold.
You might think: "I'll never write a formal proof at work." True. But understanding the proof structure helps you spot bugs. The most common distributed systems bugs violate one of the proof steps:
| Proof Step Violated | Real Bug |
|---|---|
| Merge not commutative | Cassandra's counter CRDT had a non-commutative merge path that caused divergence under specific message orderings (CASSANDRA-6504) |
| Messages not eventually delivered | Riak's handoff mechanism could stall during high load, violating the fair delivery assumption → permanent divergence |
| Simulation relation broken by edge case | CockroachDB's causal ordering had a gap where replicas could deliver operations out of causal order during leader transitions |
You don't need to write formal proofs. But you DO need to verify these properties. Here's how to test each one in practice:
python def test_crdt_properties(crdt_class, n_trials=1000): """Property-based test for CRDT correctness.""" import random for _ in range(n_trials): # Create two replicas with random operations a = crdt_class(0, 3) b = crdt_class(1, 3) for _ in range(random.randint(1, 10)): a.increment() for _ in range(random.randint(1, 10)): b.increment() # Test commutativity: merge(a,b) == merge(b,a) ab = crdt_class(0, 3); ab.vector = a.vector[:] ba = crdt_class(1, 3); ba.vector = b.vector[:] ab.merge(b) ba.merge(a) assert ab.vector == ba.vector, "Commutativity failed!" # Test idempotency: merge(a,a) == a aa = crdt_class(0, 3); aa.vector = a.vector[:] aa.merge(a) assert aa.vector == a.vector, "Idempotency failed!" print("All properties verified!") test_crdt_properties(GCounter) # Should pass
Let's also outline the proof for a causal broadcast protocol — this is harder than the epidemic proof because we need to verify three axioms (EVIS, CVIS, TVIS) instead of just one.
Watch 5 replicas execute increments and gossip merges. Track the state vectors converging. The proof guarantee: all vectors reach the same final state.
If you've studied the DDIA lessons in this series, here's how Burckhardt's formal framework maps to Kleppmann's practical treatment. Every concept in DDIA has a precise formal counterpart here.
| Burckhardt Concept | DDIA Chapter | Connection |
|---|---|---|
| Consistency models | Ch 6 (Replication) | DDIA's replication lag anomalies = violations of specific Burckhardt axioms |
| CRDTs | Ch 7 (Concurrent writes) | DDIA's "concurrent writes" section motivates CRDTs as the solution |
| Global-sequence protocols | Ch 10 (Consensus) | Paxos/Raft = specific implementations of Burckhardt's global-sequence template |
| CAP theorem | Ch 9 (Consistency & Consensus) | DDIA discusses CAP informally; Burckhardt formalizes it |
| Event graphs | Ch 5 (Encoding flows) | Lamport's happens-before is the backbone of event graphs |
Burckhardt's framework doesn't exist in isolation — it builds on decades of theoretical computer science. Here are the algorithmic foundations that underpin each concept.
| Burckhardt Concept | Algorithmic Foundation |
|---|---|
| Event graphs | Directed acyclic graphs (CLRS Ch 22) |
| Happens-before / transitive closure | Graph reachability, topological sort |
| Vector clocks | Partial orders, lattice theory |
| CRDT merge = join-semilattice | Order theory, abstract algebra |
| Gossip convergence | Randomized algorithms, epidemic models |
Pin this table. In a system design interview, you'll reach for it constantly. Each CRDT solves a specific class of conflict — choosing the right one is half the design.
| CRDT | Type | Merge | Semantics | Use Case |
|---|---|---|---|---|
| G-Counter | State | max per component | Grow-only count | Page views, likes |
| PN-Counter | State | Two G-Counters | Inc + Dec | Stock levels, scores |
| LWW-Register | State | Highest timestamp | Last write wins | Sensor data, cache |
| MV-Register | State | Keep concurrent values | Multi-value (app resolves) | Collaborative edit |
| G-Set | State | Union | Add-only set | Seen message IDs |
| 2P-Set | State | Two G-Sets | Add + permanent remove | Tombstone tracking |
| OR-Set | State | Union of tagged pairs | Add wins over concurrent remove | Shopping cart, friends list |
The definitive reference. Each model is a precise set of axioms — not a marketing term, not a vague promise, but a mathematical specification with provable properties.
| Model | Axioms | Key Property | Protocol |
|---|---|---|---|
| Linearizable | EVIS+CVIS+TVIS+CAR+RTO | Real-time total order | Global-sequence |
| Sequential | EVIS+CVIS+TVIS+CAR | Total order (no real-time) | Global-sequence (relaxed) |
| Causal | EVIS+CVIS+TVIS | Causal order preserved | Broadcast + vector clocks |
| Eventual | EVIS | Converge when quiescent | Epidemic (gossip) |
| Quiescent | EVIS (weak) | Converge during quiet | Lazy propagation |
This monograph, published in 2014, laid the formal groundwork that the distributed systems community has built on for a decade. Before Burckhardt, "eventual consistency" was an informal promise. After, it became a precise mathematical property with provable implementations. The CRDT taxonomy he formalized (state-based, operation-based, delta-based) became the standard vocabulary. His axiom-based consistency hierarchy influenced the design of systems from Azure Cosmos DB to Redis CRDTs.
The key meta-insight: distributed consistency is not one problem — it's a SPECTRUM of problems, each solvable with known techniques. The engineer's job is to identify where on the spectrum each piece of data lives, and apply the right tool. Most real systems don't live at a single point on the spectrum — they use different consistency levels for different data, chosen based on the correctness requirements and performance constraints of each piece.
When you hear "we use eventual consistency" in an interview or design review, the right follow-up is: "for which data, and what conflict resolution strategy?" The answer reveals whether the speaker truly understands the system they're building.
These are the questions that separate senior engineers from juniors in system design interviews. Each one maps directly to a concept from this lesson.
For replication fundamentals: DDIA Chapter 6: Database Replication
If this lesson sparked your interest, here are the primary sources. Each paper/book is foundational — they're the papers that everyone in distributed systems has read.
| Source | Topic | Why Read It |
|---|---|---|
| Burckhardt, MSR 2014 | This monograph | The formal foundations — everything in this lesson comes from here |
| Shapiro et al., INRIA 2011 | "A Comprehensive Study of CRDTs" | The definitive CRDT catalog with 12+ data types and proofs |
| Lamport, 1978 | "Time, Clocks, and Ordering" | Happens-before relation, logical clocks — the foundation of event graphs |
| Brewer, 2000 / Gilbert & Lynch, 2002 | CAP Theorem | Original conjecture (Brewer) and formal proof (Gilbert & Lynch) |
| Kleppmann, DDIA Ch 5-9 | Distributed data | Practical perspective on everything formalized here |
| System | CRDTs Used | Why |
|---|---|---|
| Redis (CRDB) | Counter, Set, Register, Sorted Set | Active-active geo-replication across datacenters |
| Riak | Counter, Set, Map, Register, Flag | Leaderless key-value store; CRDTs replace read-repair |
| Apple Notes | RGA (Replicated Growable Array) | Offline-first editing on iOS/macOS; sync on reconnect |
| Figma | Custom operation-based CRDTs | Real-time collaborative design; conflict-free shape edits |
| SoundCloud | Riak CRDTs for counters | Play counts and like counts across global CDN |
| Bet365 | Riak CRDTs for session data | Active-active across 3 datacenters; must never lose a bet |
| League of Legends | Riak CRDTs for player state | 75M+ players; game state must survive datacenter failures |
The CRDT field has exploded since Burckhardt's 2014 monograph. Key developments:
| Development | Year | Significance |
|---|---|---|
| Automerge (Kleppmann et al.) | 2019-2024 | General-purpose CRDT library for JSON documents. Powers local-first apps. |
| Yjs (Yata CRDT) | 2019 | High-performance CRDT for text editing. Used by many collaborative editors. |
| Pure operation-based CRDTs | 2017 | Eliminate need for reliable broadcast; operations carry full causal context. |
| Server-side CRDTs (Redis CRDB) | 2020 | CRDTs deployed in database engines, not just application code. |
| Local-first software (Ink & Switch) | 2019 | Movement toward offline-capable apps using CRDTs as the sync primitive. |
CRDTs are powerful, but they are not a universal solution. Understanding when they break down is as important as knowing how they work.
| Limitation | Why It Happens | Alternative |
|---|---|---|
| Metadata growth | OR-Set tags, version vectors, tombstones accumulate over time | Periodic garbage collection (requires coordination — ironic) |
| No invariants | CRDTs guarantee convergence but NOT that invariants hold during convergence. A "stock count" PN-Counter can go negative. | Application-level validation after merge |
| Semantics mismatch | Some operations don't commute naturally. "Set balance to $100" from two replicas isn't meaningful as a CRDT. | Transform to commutative operations: "add $10" instead of "set $100" |
| Complex data | Lists, trees, and graphs have order-dependent operations that are hard to make conflict-free | Specialized CRDTs (RGA for lists, TreeDoc for trees) — complex and expensive |
| Privacy/access control | CRDTs assume all replicas are trusted. In peer-to-peer settings, untrusted nodes could inject bad operations. | Authenticated CRDTs (research frontier) |
The rule of thumb: CRDTs are for data where "all concurrent operations should be applied" is the right semantics. Counters (count everything), sets (add everything), text (keep all edits). They are NOT for data where operations must be mutually exclusive (seat booking, account overdraft protection).