CRDTs, broadcast protocols, and building systems that never need to wait for agreement.
Three data centers: Virginia, Frankfurt, Tokyo. A user in Tokyo clicks "Add to Cart." The system needs to update the shopping cart, which is replicated across all three regions. The conventional approach: send the write to a leader in Virginia, wait for acknowledgment, then respond to the user.
Round-trip from Tokyo to Virginia: 180 milliseconds. The user stares at a spinner. They click again. Now you have a duplicate write problem on top of the latency problem. Meanwhile, a user in Frankfurt doing the exact same operation gets a response in 40 milliseconds because they're closer to the leader. Same product, wildly different experience.
The fundamental issue is coordination. Every time a node must ask another node "is this okay?" before proceeding, it pays a latency tax proportional to the distance between them. In a single data center, that tax is microseconds. Across continents, it's hundreds of milliseconds. Across a congested or partitioned network, it's seconds or infinity.
What if we could design data structures and algorithms that never need to coordinate? Every replica processes writes locally, immediately, with zero round-trips to any other node. Later, when replicas sync up, they automatically converge to the same state — regardless of the order messages arrived, regardless of duplicates, regardless of which writes each replica has seen so far.
This is coordination avoidance: the art of building distributed systems where replicas can operate independently and still agree in the end. The key tools are CRDTs (Conflict-free Replicated Data Types), causal broadcast protocols, and a theoretical result called the CALM theorem that tells us exactly when coordination is — and isn't — necessary.
Let's quantify the cost. Consider a system with three replicas and two approaches:
| Approach | Write latency (Tokyo) | Availability during partition | Conflict handling |
|---|---|---|---|
| Coordinated (Raft/Paxos) | 180-360ms (must contact majority) | Unavailable if majority unreachable | No conflicts (linearizable) |
| Uncoordinated (CRDTs) | 1-5ms (local write) | Available (all replicas accept writes) | Auto-resolved by merge function |
The difference is dramatic. Coordinated writes are 50-100x slower and become unavailable during network partitions. Uncoordinated writes are always fast, always available, but they require clever data structures that can merge concurrent updates without losing information.
Watch writes propagate in a coordinated vs. uncoordinated system. Notice how the coordinated system must wait for acknowledgments, while the uncoordinated system responds immediately.
Before we build coordination-free data structures, we need to understand the consistency model they provide. When replicas operate independently, they will temporarily disagree about the current state. A user writing to Tokyo sees their update immediately, but a user reading from Frankfurt sees the old value until the update propagates. This model is called eventual consistency.
The formal definition: if no new updates are made to an object, all replicas will eventually return the last updated value. "Eventually" is the key weasel word — it could be 5 milliseconds or 5 minutes, and the definition says nothing about what you'll read in the meantime.
That sounds weak. But eventual consistency comes in different strengths, and understanding the hierarchy is crucial for designing real systems.
| Level | Guarantee | Example |
|---|---|---|
| Strong eventual consistency (SEC) | Replicas that have received the same set of updates are in the same state — regardless of order | CRDTs |
| Causal consistency | If event A caused event B, everyone sees A before B | Vector clock ordering |
| Eventual consistency | Replicas converge "eventually" after updates stop | DNS propagation |
| Strong consistency (linearizability) | Every read returns the most recent write | Raft/Paxos |
The critical distinction is between plain eventual consistency and strong eventual consistency (SEC). Plain eventual means "they'll converge somehow, eventually." SEC means "any two replicas that have processed the same set of updates are guaranteed to be in the same state, right now, with no additional communication needed." SEC is what CRDTs provide, and it's dramatically more useful.
For a merge function to guarantee convergence without coordination, it must satisfy three algebraic properties:
These three properties form a join-semilattice. That's a fancy name for a structure where merging always moves "upward" toward a common state and never moves backward. Think of it like water flowing downhill — no matter what path it takes, it always ends up at the lowest point. A semilattice guarantees that no matter what order updates arrive, the final state is always the same.
Three replicas receive updates in different orders. Watch them converge to the same state. Toggle between a commutative merge (CRDT) and a non-commutative one (naive overwrite) to see the difference.
Let's build the simplest possible CRDT: a counter that can only go up. This is the G-Counter (Grow-only Counter). Think of it as a distributed "like" button — every replica can increment it locally, and when replicas merge, the total is always correct.
The naive approach fails immediately. If three replicas each store a single number and increment it locally, merging by taking the maximum loses increments. Merging by adding counts doubles them. There's no single-number representation that works.
The insight: give each replica its own slot in a vector. Replica A only increments slot A. Replica B only increments slot B. The total count is the sum of all slots. When merging, take the component-wise maximum of each slot.
Each slot is "owned" by exactly one replica. Only replica A ever increments slot A. So slot A is a monotonically increasing counter. Taking the maximum of two values for the same slot always yields the most recent value — because it can only go up.
Let's trace through a concrete example:
python class GCounter: def __init__(self, replica_id, peers): self.id = replica_id self.counts = {p: 0 for p in peers} def increment(self): self.counts[self.id] += 1 def value(self): return sum(self.counts.values()) def merge(self, other): for k in self.counts: self.counts[k] = max(self.counts[k], other.counts[k])
Three replicas with independent G-Counters. Increment each locally, then merge pairs to watch convergence. The vector is shown inside each node.
A G-Counter only goes up. But what about a shopping cart quantity, a vote tally, or a stock level? You need to decrement too. The PN-Counter (Positive-Negative Counter) solves this by maintaining two G-Counters: one for increments (P) and one for decrements (N). The value is P - N.
Think of it as double-entry bookkeeping. Instead of erasing a number and writing a smaller one (which is not monotonic and breaks CRDTs), you record every subtraction as an addition in a separate ledger. The balance is always computable from the two ledgers, and both ledgers only grow.
python class PNCounter: def __init__(self, replica_id, peers): self.P = GCounter(replica_id, peers) self.N = GCounter(replica_id, peers) def increment(self): self.P.increment() def decrement(self): self.N.increment() # Note: incrementing the NEGATIVE counter def value(self): return self.P.value() - self.N.value() def merge(self, other): self.P.merge(other.P) self.N.merge(other.N)
Three replicas with independent PN-Counters. Increment or decrement each, then merge to see both the P and N vectors converge. The bar chart shows each replica's computed value.
Counters are elegant, but most real data isn't a single number. What about a user's display name? An email address? A set of tags? We need CRDTs for arbitrary values and collections.
The LWW-Register (Last-Writer-Wins Register) is the simplest CRDT for single values. Each write carries a timestamp. When merging, the value with the higher timestamp wins. That's it.
The OR-Set (Observed-Remove Set) is a CRDT for sets that handles concurrent add/remove correctly. The key insight: every add operation generates a unique tag. A remove operation only removes the tags it has observed. If a concurrent add creates a new tag that the remover hasn't seen, that tag survives.
Why is a set CRDT hard? Consider a plain set with concurrent operations:
Two replicas with an OR-Set. Add and remove items concurrently, then merge to see how tags determine the final set contents.
CRDTs define what data structures look like and how they merge. But they say nothing about how updates actually get from one replica to another. That's the job of broadcast protocols — the plumbing that delivers updates across the network.
There are three levels of delivery guarantees, each progressively stronger:
Send the message to every replica. If you crash, some replicas might not get it. No retries, no guarantees. This is basically UDP multicast. Useless for anything that matters.
If a non-faulty replica receives a message, then every non-faulty replica eventually receives it. Even if the original sender crashes midway through broadcasting, the other replicas re-broadcast the message to ensure everyone gets it.
Reliable broadcast plus ordering: if message A causally precedes message B (meaning B was sent after seeing A), then every replica delivers A before B. Messages that are concurrent (neither caused the other) can be delivered in any order.
Why does order matter? Consider a chat application:
Gossip protocols (also called epidemic protocols) spread updates the way rumors spread in a crowd. Each node periodically picks a random neighbor and shares its latest updates. This is remarkably effective:
| Property | Value |
|---|---|
| Messages per update | O(n log n) — each node contacts ~log(n) peers |
| Convergence time | O(log n) rounds to reach all nodes |
| Fault tolerance | Works even with high packet loss; redundant paths |
| Drawback | Non-deterministic latency; no ordering guarantee |
Watch an update spread through a network of 12 nodes via gossip. Each round, every informed node contacts one random neighbor. Green = has the update, gray = hasn't received it yet.
CRDTs guarantee convergence. Gossip guarantees delivery. But neither guarantees that updates arrive in an order that makes sense. Causal consistency fills this gap: it ensures that if one event influenced another, every replica sees them in that order.
The tool for tracking causality is the vector clock. Each replica maintains a vector of logical timestamps — one slot per replica, just like the G-Counter. When a replica performs an operation, it increments its own slot. When it receives a message, it merges the incoming vector with its own (component-wise max) and then increments its own slot.
Given two vector clocks V1 and V2:
A replica holds an incoming message in a buffer until all causally preceding messages have been delivered. The check is simple:
Three replicas send messages. The vector clock is shown at each node. Messages are held in a buffer (yellow) until causal dependencies are met, then delivered (green).
We've built coordination-free counters, registers, and sets. But how do we know in general which programs can run without coordination? Is there a test? The CALM theorem (Consistency As Logical Monotonicity) provides exactly this answer.
The theorem, proven by Hellerstein and Alvaro in 2010, states:
Let's unpack "monotonic." A function is monotonic if learning more information only adds to its output — it never takes away. Think of it as a one-way ratchet: once you've concluded something, new data can only add new conclusions, never invalidate old ones.
| Program | Monotonic? | Why |
|---|---|---|
| UNION of two tables | Yes | Adding rows to either input only adds rows to the output |
| SELECT WHERE x > 5 | Yes | Adding new rows can only add new matches, never remove existing ones |
| JOIN of two tables | Yes | New rows in one table can only create new matches |
| COUNT(*) | No! | The count can change with retractions or corrections |
| NOT EXISTS | No! | A row that "doesn't exist" could appear later, invalidating the conclusion |
| MIN/MAX | No! | A new value could become the new min/max |
| Set difference (A - B) | No! | Adding to B can remove items from the result |
CALM gives you a compiler-like test for coordination requirements. When designing a distributed system:
Sometimes you need some coordination but want to minimize it. Quorum systems let you tune the trade-off. In a system with N replicas:
Adjust W and R to see how quorums overlap (or don't). Green = overlap (consistency guaranteed). Red = no overlap (stale reads possible).
Time to put everything together. This interactive playground simulates a cluster of three replicas, each running a full set of CRDTs (G-Counter, PN-Counter, LWW-Register, OR-Set). You can perform operations on any replica, create network partitions, and watch how the CRDTs converge when replicas reconnect.
Three replicas running CRDTs. Perform operations, create partitions, merge, and watch convergence. The convergence status shows whether all replicas agree.
Notice what happens when you partition and do conflicting operations:
| Scenario | Counter behavior | Set behavior |
|---|---|---|
| A increments during partition, B decrements | Both are preserved: net = inc - dec | N/A |
| A adds "x", B adds "y" during partition | N/A | Both survive after merge: {x, y} |
| A adds "x", B removes "x" during partition | N/A | "x" survives (add-wins in OR-Set) |
Coordination avoidance is not a universal solution. It's a powerful tool for specific situations — and understanding when to use it (and when not to) is what separates a good distributed systems engineer from a great one.
| Use Case | CRDT Type | Why It Works |
|---|---|---|
| Like/upvote counters | G-Counter | Only increments, high availability needed |
| Shopping cart quantities | PN-Counter | Add/remove items, availability over accuracy |
| Collaborative text editing | Sequence CRDT (RGA, LSEQ) | Concurrent inserts/deletes merge automatically |
| User presence/status | LWW-Register | Only latest status matters |
| Tag sets, feature flags | OR-Set | Concurrent add/remove with clear semantics |
| Use Case | Why CRDTs Fail | Use Instead |
|---|---|---|
| Bank transfers | Balance can't go negative — needs coordination | Serializable transactions |
| Unique username registration | Uniqueness is inherently non-monotonic | Consensus (Raft/Paxos) |
| Inventory with stock limits | Bounded counters need periodic coordination | Escrow/reservation pattern |
| Ordered event logs | Total order requires coordination | Replicated log (Kafka, Raft) |
Where different approaches sit on the coordination-consistency spectrum.
Related lessons:
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport