No global clock, no global state, no reliable failure detection — how do distributed systems agree on what happened and when?
You are sitting at your desk with one computer. You write event A to a log file, then write event B. When you read the log, A comes before B. Always. The clock on the wall ticks forward. Always. If your program crashes, you know it crashed because it stopped responding.
Now give three computers to three engineers in three different cities. Engineer 1 writes event A. Engineer 2 writes event B. You ask: "Which happened first?" And the universe answers: "That question doesn't have a well-defined answer."
This is not a bug. It is physics. Special relativity tells us that two spatially separated events have no absolute ordering unless one causally influenced the other. Light takes 67 milliseconds to travel from New York to London. No information can travel faster. There is no cosmic "now" that all computers share.
This isn't just a theoretical concern. Consider a real scenario: you have a user in New York and a user in London, both editing the same Google Doc. New York types "Hello" at local time 12:00:00.000. London types "World" at local time 17:00:00.005 (which is 12:00:00.005 EST — 5 milliseconds later). The speed-of-light round trip between New York and London is ~67ms. So when London typed "World," the information that New York typed "Hello" could not possibly have reached London yet. These edits are concurrent in the physics sense — neither could have been influenced by the other.
Google Docs handles this using Operational Transformation (OT), which explicitly reasons about concurrent edits. But the fundamental problem remains: simultaneity is relative, and distributed systems must deal with this reality.
Before we can reason about what is possible, we need to agree on the rules of the game. A system model is a set of assumptions about timing and failures. Here are the three timing models:
| Model | Timing Guarantee | Realism |
|---|---|---|
| Synchronous | Messages arrive within a known bound Δ. Clocks drift by at most ρ. Processes take at most τ to respond. | Almost never true in real networks. Useful for theoretical lower bounds. |
| Asynchronous | No timing guarantees whatsoever. Messages can take arbitrarily long. Processes can pause indefinitely. | The safest assumption. FLP impossibility lives here: you cannot solve consensus deterministically. |
| Partially synchronous | The system behaves asynchronously for some unknown period, then eventually becomes synchronous (messages arrive within Δ). | The sweet spot. Most practical algorithms (Raft, PBFT) assume this model. |
And three failure models, from mildest to most adversarial:
| Failure Model | What Can Go Wrong | Example |
|---|---|---|
| Crash-stop | A node stops forever. It never sends incorrect messages — it just vanishes. | Power failure, OOM kill, kernel panic. |
| Crash-recovery | A node crashes but may restart later with its persistent state intact. It may have lost in-memory state. | Process restart, reboot with disk intact. |
| Byzantine | A node can do anything: send wrong messages, lie about its state, collude with other faulty nodes. | Compromised node, hardware corruption, malicious actor. |
Most practical systems assume partially synchronous timing with crash-recovery failures. This lesson focuses on that regime — it is where all the interesting engineering happens.
The simulation below shows three nodes trying to agree on the order of events. Each node has its own clock. The clocks drift. Messages take variable time. Click "Run" to see the mess, then click "Show True Order" to reveal what actually happened versus what each node thinks happened.
Three nodes each record events with their local clocks. Messages between them take random time. Watch how each node's view of "what happened when" diverges from reality.
This is the core problem of this entire lesson. We need ways to detect failures (Chapter 1), measure time (Chapters 2-5), and order events (Chapter 6) in a system where none of these things come for free.
Failure detection, time, and ordering are not separate topics — they are three facets of the same fundamental challenge. Failure detection requires timeouts, which require clocks. Ordering events requires timestamps, which require clocks. And clocks themselves are unreliable in distributed systems.
Think of it as a dependency chain:
This interconnection is why we cover all three in one lesson. You cannot understand failure detection without understanding time, and you cannot understand time without understanding the ordering problem.
In 1985, Fischer, Lynch, and Paterson proved a devastating result: in a purely asynchronous system where even one process can crash, there is no deterministic algorithm that solves consensus. This is the FLP impossibility result.
Why? Because in an asynchronous system, you cannot distinguish a crashed process from a slow one. Any algorithm that waits for a response might wait forever (the process crashed) or might incorrectly proceed without a live process (it was just slow). This fundamental ambiguity makes consensus impossible to guarantee.
The practical escape hatch: assume partial synchrony. In this model, the system is asynchronous for some unknown period, but eventually becomes synchronous. This allows algorithms like Paxos and Raft to make progress during synchronous periods while remaining safe during asynchronous ones.
The system model you assume determines what algorithms you can use and what guarantees they provide. Here is the landscape:
| System Model | Failure Model | What You Can Solve | Example Algorithms |
|---|---|---|---|
| Synchronous | Crash-stop | Everything: consensus, atomic broadcast, perfect failure detection | Synchronous consensus (textbook only) |
| Synchronous | Byzantine | Consensus with n ≥ 3f + 1 nodes | DLS (Dwork-Lynch-Stockmeyer) |
| Partially synchronous | Crash-stop | Consensus, total order broadcast (eventually) | Paxos, Raft, Viewstamped Replication |
| Partially synchronous | Byzantine | Consensus with n ≥ 3f + 1 (eventually) | PBFT, HotStuff, Tendermint |
| Asynchronous | Crash-stop | NOT consensus (FLP). CAN do reliable broadcast, causal broadcast. | Reliable broadcast protocols |
| Asynchronous | Byzantine | Very limited. Can do Byzantine reliable broadcast with n ≥ 3f + 1. | Bracha's broadcast |
The key takeaway: the partially synchronous crash-recovery model is where all practical distributed systems live. It is strong enough to solve consensus (unlike pure asynchronous) but realistic enough to model actual networks (unlike pure synchronous). This is why Raft has become the de facto standard: it explicitly assumes partial synchrony and crash-recovery.
You send a request to node B. You wait. And wait. No response. Is B dead? Is the network down? Is B just slow? You have exactly one tool to make this judgment: a timeout.
A timeout says: "If I don't hear back within T milliseconds, I will suspect the remote node has failed." Notice the word suspect. You never know. In an asynchronous system, there is no timeout value that can distinguish a crashed node from a slow one. This is not a limitation of your code — it is a fundamental impossibility.
The simplest failure detection scheme: every node periodically sends an "I'm alive" message — a heartbeat — to its peers. If you don't receive a heartbeat from node B within some timeout T, you suspect B has failed.
The trouble is choosing T. Set T too low and you get false positives: a node with a brief GC pause gets declared dead, triggering an expensive failover. Set T too high and you get slow detection: a genuinely dead node isn't noticed for seconds, during which clients see errors.
To make this concrete: suppose your heartbeat interval is 200ms and your timeout is 500ms. If the remote node has a 600ms GC pause, you will declare it dead — a false positive. Now suppose you set the timeout to 2000ms. The GC pause goes unnoticed, but if the node actually crashes, you won't know for 2 seconds. During those 2 seconds, every request routed to that dead node fails.
Real-world impact: in a load-balanced system with 10 backends and 10,000 requests/second, each backend handles 1,000 req/s. A 2-second detection delay means 2,000 failed requests. A false positive that triggers a 30-second rebalance costs 30,000 requests worth of reduced capacity. These are the numbers that drive the engineering decision.
Instead of a binary "alive or dead" decision, the phi accrual failure detector (Hayashibara et al., 2004) outputs a continuous suspicion level φ. The idea is elegant:
1. Track the inter-arrival times of heartbeats from each peer. Over time, you build a distribution of normal arrival intervals.
2. When a new heartbeat is late, compute the probability that a heartbeat would be this late if the node were still alive. The later it is, the less likely.
3. φ = -log10(probability the node is alive). So φ = 1 means there's a 10% chance the node is alive. φ = 2 means 1%. φ = 3 means 0.1%.
Where F is the CDF of the heartbeat inter-arrival distribution (typically modeled as a normal distribution with the observed mean μ and standard deviation σ).
The application decides the threshold: "suspect at φ ≥ 8" (very conservative — almost certainly dead) or "φ ≥ 1" (aggressive — might just be slow). Cassandra uses this approach with a default threshold of φ = 8.
The simulation below shows a failure detector monitoring a remote node. Drag the timeout slider to see the tradeoff. The node occasionally has GC pauses (which look like failures but aren't) and eventually actually crashes.
Adjust the timeout. Watch how short timeouts cause false alarms during GC pauses, while long timeouts delay real failure detection. The node will crash at a random time.
In 1996, Chandra and Toueg formalized the notion of failure detectors into classes based on two properties: completeness (do we eventually suspect every failed process?) and accuracy (do we avoid suspecting correct processes?).
| Class | Completeness | Accuracy | Intuition |
|---|---|---|---|
| Perfect (P) | Strong: every crashed process is eventually suspected by every correct process | Strong: no correct process is ever suspected | Only possible in synchronous systems |
| Eventually Perfect (◇P) | Strong | Eventually strong: after some unknown time, no correct process is suspected | The practical target. Achievable in partially synchronous systems. |
| Strong (S) | Strong | Weak: at least one correct process is never suspected | Enough to solve consensus! |
| Eventually Strong (◇S) | Strong | Eventually weak | The weakest detector that can solve consensus. This is what Raft/Paxos implicitly assume. |
The remarkable result: an eventually strong failure detector ◇S is sufficient to solve consensus. You don't need to perfectly detect failures — you just need to eventually stop suspecting at least one correct process. This is why Raft works in practice even with imperfect failure detection.
Real systems combine multiple signals. TCP keepalives detect broken connections. Application-level heartbeats detect process hangs. Load balancers track response latency percentiles. Kubernetes uses liveness and readiness probes — an HTTP endpoint the kubelet polls. If it returns non-200 three times in a row, the pod is restarted.
The key insight: failure detection is not a boolean. It is a spectrum of confidence. The phi accrual detector makes this explicit, but even simple timeout-based systems should treat "suspected failure" differently from "confirmed failure" (e.g., stop routing new requests immediately, but wait before reassigning the node's work).
You are designing a microservice mesh. Your p99 latency is 200ms. Your GC pauses last up to 500ms. How do you choose a heartbeat interval and timeout?
Step 1: Heartbeat interval. You want heartbeats frequent enough to detect failures quickly, but not so frequent they add significant network overhead. A common rule of thumb: heartbeat every T/3 seconds, where T is your timeout. If you have 1000 nodes, each sending heartbeats to 5 peers, that is 5000 heartbeats per interval — tolerable at ~50-byte packets.
Step 2: Timeout threshold. Your timeout must be longer than the longest non-failure delay. GC pauses reach 500ms. Network p99 is 200ms. So a heartbeat might be delayed by up to ~700ms in the worst non-failure case. Set timeout to 1000ms (1.4x the worst case) for some safety margin.
Step 3: Consequences. With a 1000ms timeout and 333ms heartbeat interval, a truly dead node is detected in 1000-1333ms (one timeout period after the last heartbeat). False positives should be rare since the timeout exceeds the worst GC + network delay.
Step 4: Tune based on data. In production, track the distribution of heartbeat inter-arrival times. If your p99.9 is 400ms but you occasionally see 900ms, consider increasing the timeout or switching to a phi accrual detector that adapts automatically.
python import time, math, statistics class PhiAccrualDetector: """Phi accrual failure detector (Hayashibara 2004).""" def __init__(self, window=100): self.intervals = [] # recent heartbeat inter-arrival times self.window = window self.last_ts = None def heartbeat(self): """Call when a heartbeat arrives.""" now = time.monotonic() if self.last_ts is not None: self.intervals.append(now - self.last_ts) if len(self.intervals) > self.window: self.intervals.pop(0) self.last_ts = now def phi(self): """Return current suspicion level.""" if not self.intervals or self.last_ts is None: return 0.0 elapsed = time.monotonic() - self.last_ts mu = statistics.mean(self.intervals) sigma = statistics.stdev(self.intervals) if len(self.intervals) > 1 else mu * 0.1 # CDF of normal distribution z = (elapsed - mu) / max(sigma, 1e-9) p_alive = 0.5 * (1 - math.erf(z / math.sqrt(2))) return -math.log10(max(p_alive, 1e-15)) # Usage: detector = PhiAccrualDetector() # In heartbeat receiver loop: detector.heartbeat() # When checking health: if detector.phi() > 8: print("Node suspected dead (phi > 8)")
Every computer has a clock. It is a quartz crystal oscillating at a known frequency — typically 32,768 Hz. A counter increments on each oscillation, and software converts the count into a timestamp. Simple, right?
The problem is that "known frequency" is only approximately known. Quartz crystals drift. Temperature changes the oscillation frequency. A typical server-grade crystal drifts by 10-20 parts per million (ppm). That is 10-20 microseconds per second, or 0.86 to 1.73 seconds per day.
Clock drift is the rate at which a clock deviates from a reference. If your clock gains 15 μs every second, your drift rate is 15 ppm. Clock skew is the absolute difference between two clocks at a point in time. If node A's clock reads 12:00:00.000 and node B's reads 12:00:00.047, the skew is 47 ms.
Drift is the derivative; skew is the integral. Even if you synchronize two clocks perfectly at time t=0, they immediately begin drifting apart. After one day at 15 ppm drift each (in opposite directions), the skew can reach 2.6 seconds.
NTP synchronizes clocks over the network. It works by exchanging timestamps with a time server:
NTP on a LAN typically achieves 1-10 ms accuracy. Over the internet, 10-100 ms. This sounds precise, but in a database doing thousands of transactions per second, 10 ms of uncertainty means thousands of transactions whose ordering is ambiguous.
Let's walk through a concrete NTP exchange. Suppose:
| Timestamp | Value | Meaning |
|---|---|---|
| t1 | 100.000 | Client sends request (client clock) |
| t2 | 100.005 | Server receives request (server clock) |
| t3 | 100.006 | Server sends response (server clock) |
| t4 | 100.010 | Client receives response (client clock) |
Round-trip delay: δ = (t4 - t1) - (t3 - t2) = (100.010 - 100.000) - (100.006 - 100.005) = 0.010 - 0.001 = 0.009 seconds.
Clock offset: θ = ((t2 - t1) + (t3 - t4)) / 2 = ((100.005 - 100.000) + (100.006 - 100.010)) / 2 = (0.005 + (-0.004)) / 2 = 0.0005 seconds.
This means the client's clock is about 0.5 ms behind the server. NTP will slew the client's clock forward by this amount. But note the hidden assumption: NTP assumes the network delay is symmetric (same in both directions). If the request took 7ms but the response took 2ms, the formula gives the wrong offset. Asymmetric delays are the primary source of NTP error in practice.
NTP organizes time sources into a hierarchy of strata:
| Stratum | Source | Typical Accuracy |
|---|---|---|
| 0 | Atomic clocks, GPS receivers (the hardware itself) | ~10 ns |
| 1 | Servers directly connected to stratum-0 hardware | ~1 μs |
| 2 | Servers syncing from stratum-1 over network | ~1-10 ms (LAN), ~10-100 ms (internet) |
| 3+ | Each additional hop adds uncertainty | Degrades with each level |
Most datacenter servers are stratum 2 or 3. Public NTP pools (pool.ntp.org) are stratum 2-3. Every network hop adds asymmetric delay that NTP cannot fully compensate for — it assumes the request and response take the same time, which is often false.
GPS receivers achieve ~100 nanosecond accuracy by triangulating signals from satellites carrying atomic clocks. Atomic clocks (cesium or rubidium) drift by less than 1 nanosecond per day. Google and Amazon put atomic clocks and GPS receivers in every datacenter.
Why both? GPS requires a roof antenna and clear sky view — it can fail if the antenna is obstructed or during rare GPS outages. Atomic clocks don't need external signals but drift (slowly). By using both, you get continuous accuracy even if GPS drops out temporarily.
Google's TrueTime API doesn't return a single timestamp. It returns a confidence interval: [earliest, latest]. The call TT.now() returns an interval that is guaranteed to contain the true time.
Where ε is typically 1-7 milliseconds. Spanner uses this to implement commit-wait: after committing a transaction at time t, it waits until TrueTime guarantees that t is in the past (i.e., waits 2ε). This ensures that any later transaction sees the commit — because its start time is guaranteed to be after t.
Spanner's commit-wait deserves a deeper look because it illustrates how physical time bounds enable strong consistency guarantees.
Here is the protocol step by step:
Why does this work? After commit-wait, we know that the real time is past s. Any future transaction's start timestamp is at least TT.now().earliest at the time it starts. Since real time is past s and TT.now().earliest ≤ real time, we can't be sure yet... but by the time the future transaction observes TT.now(), its earliest bound will be ≥ s. Therefore the future transaction's snapshot reads at timestamp > s will see our committed writes.
The cost is latency. If ε is 7ms, commit-wait takes up to 14ms. Google has driven ε down by deploying GPS receivers and atomic clocks in every datacenter, directly reducing write latency. This is why Google invests heavily in time infrastructure — better clocks = lower write latency = faster Spanner.
Three nodes start synchronized. Watch their clocks drift apart over simulated time. The slider controls the drift rate. NTP corrections periodically pull them back, but never perfectly.
Each node's clock drifts at a different rate. NTP syncs happen periodically (yellow flash) but leave residual skew. Watch the gap grow.
python import time class TrueTime: """Simplified TrueTime API (Google Spanner style).""" def __init__(self, epsilon_ms=7.0): self.epsilon = epsilon_ms / 1000.0 # convert to seconds def now(self): """Returns (earliest, latest) bounding true time.""" t = time.time() return (t - self.epsilon, t + self.epsilon) def after(self, t): """True if t is definitely in the past.""" earliest, _ = self.now() return earliest > t def before(self, t): """True if t is definitely in the future.""" _, latest = self.now() return latest < t # Commit-wait: ensures linearizability tt = TrueTime(epsilon_ms=7.0) _, commit_ts = tt.now() # pick latest bound as commit time while not tt.after(commit_ts): time.sleep(0.001) # wait until commit_ts is definitely past # Now safe to release — any future transaction will see this commit
Physical clocks measure wall time and drift. What if we gave up on measuring real time entirely and instead tracked only the order of events? This is Leslie Lamport's key insight from his 1978 paper "Time, Clocks, and the Ordering of Events in a Distributed System" — one of the most cited papers in computer science.
Lamport defined a relation called happened-before, written a → b. It means "event a could have causally influenced event b." The rules are simple:
Rule 1 (Process order): If a and b are events in the same process, and a comes before b, then a → b.
Rule 2 (Message passing): If a is the sending of a message and b is the receipt of that same message, then a → b. (A message cannot be received before it is sent.)
Rule 3 (Transitivity): If a → b and b → c, then a → c.
If neither a → b nor b → a, the events are concurrent, written a || b. This means neither could have influenced the other. There is no meaningful "first" — they happened independently.
A Lamport timestamp (or Lamport clock) is a single integer counter that each process maintains. The algorithm:
This gives us a key property: if a → b, then L(a) < L(b). Causality implies timestamp order. But the converse is NOT true: L(a) < L(b) does NOT mean a → b. Two concurrent events can have different Lamport timestamps purely by chance.
Lamport timestamps alone don't give a total order — two events can have the same timestamp (e.g., two processes both at L=3 doing local events). To break ties, append the process ID: compare by (L, pid). This gives a total order that is consistent with causality.
This total order is used in Lamport's mutual exclusion algorithm and in distributed systems that need a deterministic tiebreaker (e.g., which of two conflicting writes "wins").
But remember: this total order is arbitrary for concurrent events. If P0 and P2 both have L=3 and we use pid as tiebreaker, P0's event comes "first" — but this is a convention, not a physical reality. The events are causally independent; we just need a consistent way to order them for correctness.
Let's trace a complete execution step by step:
| Time | Event | P0 Clock | P1 Clock | P2 Clock | Rule Applied |
|---|---|---|---|---|---|
| 1 | P0: local | 1 | 0 | 0 | Increment: 0+1=1 |
| 2 | P0: send(m1) to P1 | 2 | 0 | 0 | Increment: 1+1=2. m1 carries L=2 |
| 3 | P2: local | 2 | 0 | 1 | Increment: 0+1=1 |
| 4 | P1: receive(m1) | 2 | 3 | 1 | max(0, 2)+1 = 3 |
| 5 | P1: send(m2) to P2 | 2 | 4 | 1 | Increment: 3+1=4. m2 carries L=4 |
| 6 | P2: receive(m2) | 2 | 4 | 5 | max(1, 4)+1 = 5 |
Now check: Is P2's event at step 3 (L=1) before P0's event at step 2 (L=2)? Lamport says L(3) < L(2), i.e., 1 < 2, so "yes." But are they actually causally related? No! P2's local event and P0's send are concurrent — P2 had no way to know about P0's send. Lamport clocks ordered them, but the ordering is meaningless for these two events.
The classic way to visualize distributed events. Each process is a vertical line. Time flows downward. Arrows are messages. Watch the Lamport counters update as events and messages occur.
Three processes exchange messages. Watch Lamport timestamps update according to the rules. Click "Step" to advance one event at a time, or "Play" for automatic progression.
python class LamportClock: """Lamport logical clock (1978).""" def __init__(self): self.time = 0 def local_event(self): """Internal event or send.""" self.time += 1 return self.time def send(self): """Increment and return timestamp to attach to message.""" self.time += 1 return self.time def receive(self, msg_timestamp): """Update clock on message receipt.""" self.time = max(self.time, msg_timestamp) + 1 return self.time # Example: 3 processes p1, p2, p3 = LamportClock(), LamportClock(), LamportClock() # p1 does local event (L=1), sends to p2 ts1 = p1.send() # p1.time = 1 ts2 = p2.receive(ts1) # p2.time = max(0,1)+1 = 2 # p2 sends to p3 ts3 = p2.send() # p2.time = 3 ts4 = p3.receive(ts3) # p3.time = max(0,3)+1 = 4 # p1 does local event ts5 = p1.local_event() # p1.time = 2 # p1.time=2 vs p3.time=4: is p1's event before p3's? # L(p1)=2 < L(p3)=4, but they are CONCURRENT — no causal path!
Lamport clocks give us one direction: if a → b then L(a) < L(b). But we want the converse too. We want: V(a) < V(b) if and only if a → b. To get this, each process must track not just one counter, but a vector of counters — one for every process in the system.
Each process Pi maintains a vector V[0..n-1], where n is the number of processes. V[i] counts how many events Pi has observed. The rules:
Rule 1 (Local event): Pi increments V[i] by 1.
Rule 2 (Send): Pi increments V[i] by 1, then attaches the entire vector V to the message.
Rule 3 (Receive): Pi sets V[j] = max(V[j], Vmsg[j]) for every j, then increments V[i] by 1.
The magic is in the comparison rules:
| Comparison | Definition | Meaning |
|---|---|---|
| V(a) ≤ V(b) | V(a)[i] ≤ V(b)[i] for ALL i | a happened-before b (or same event) |
| V(a) < V(b) | V(a) ≤ V(b) AND V(a) ≠ V(b) | a strictly happened-before b: a → b |
| V(a) || V(b) | Neither V(a) ≤ V(b) nor V(b) ≤ V(a) | a and b are concurrent: a || b |
Think of it this way: each process's counter tells you "how much of that process's history I have seen." If my vector is [3, 2, 1] and yours is [3, 4, 1], I have seen more of P0's history than you need, but you have seen more of P1's history than I have. Neither dominates — we have independent knowledge. Our events are concurrent.
Let's trace through a concrete scenario:
| Step | Event | P0's Vector | P1's Vector | P2's Vector |
|---|---|---|---|---|
| 1 | P0 local event | [1,0,0] | [0,0,0] | [0,0,0] |
| 2 | P0 sends to P1 | [2,0,0] | [0,0,0] | [0,0,0] |
| 3 | P1 receives from P0 | [2,0,0] | [2,1,0] | [0,0,0] |
| 4 | P2 local event | [2,0,0] | [2,1,0] | [0,0,1] |
| 5 | P1 sends to P2 | [2,0,0] | [2,2,0] | [0,0,1] |
| 6 | P2 receives from P1 | [2,0,0] | [2,2,0] | [2,2,2] |
Now compare step 4 (P2 at [0,0,1]) with step 2 (P0 at [2,0,0]). P0[0]=2 > P2[0]=0, but P2[2]=1 > P0[2]=0. Neither dominates. These events are concurrent. Correct — P2's local event had no causal connection to P0's send.
Compare step 3 (P1 at [2,1,0]) with step 1 (P0 at [1,0,0]). P1[0]=2 ≥ P0[0]=1, P1[1]=1 ≥ P0[1]=0, P1[2]=0 ≥ P0[2]=0. P1 dominates. So step 1 → step 3. Correct — there is a causal chain P0 → message → P1.
Watch vector clocks update across three processes. Compare with what Lamport clocks would show — notice cases where Lamport says "a < b" but the events are actually concurrent.
Three processes exchange messages. Each event shows both its Lamport timestamp and vector clock. Concurrent events (detectable only by vector clocks) are highlighted in yellow.
Vector clocks have a practical problem: the vector size is O(n) where n is the number of processes. In a system with 10,000 nodes, every message carries a 10,000-entry vector. This is why systems like Dynamo use version vectors (a variant that tracks only the servers that actually wrote the data) and why many systems settle for Lamport clocks plus application-level conflict resolution.
A version vector is like a vector clock but indexed by writer ID rather than process ID. If you have 10,000 nodes but only 3 replicas that can write a particular key, the version vector for that key has only 3 entries — not 10,000.
Amazon Dynamo uses this approach. When two replicas independently write to the same key, their version vectors become incomparable (concurrent). Dynamo detects this and either:
Standard version vectors have a subtle problem: sibling explosion. If two clients repeatedly write to the same key on different replicas without reading first, the number of "concurrent" versions grows unboundedly. Each write creates a new sibling because it can't dominate any existing version.
Dotted version vectors (Preguica et al., 2012) fix this by tracking not just the version counter per replica, but also a "dot" — the specific event that created each version. When a replica receives a write, it can precisely identify which previous version the write supersedes, even if the client didn't read first.
Riak adopted dotted version vectors in Riak 2.0 specifically to solve sibling explosion. The improvement was dramatic: in workloads with blind writes, the number of retained siblings dropped from hundreds to at most one per concurrent writer.
If you are implementing vector clocks in a real system, here are the engineering decisions you will face:
| Decision | Options | Recommendation |
|---|---|---|
| Vector size | Full n-entry vs. sparse (only non-zero entries) | Sparse. Use a HashMap instead of array. Most entries are zero in practice. |
| Pruning | Keep all entries vs. trim old ones | Trim entries for nodes that have been dead for > T hours. Losing an entry means you might not detect some concurrent writes, but this is rare and the space savings are significant. |
| Serialization | JSON vs. binary | Binary (protobuf or msgpack). A 1000-entry vector clock as JSON is ~10KB; as packed uint64s it's 8KB. The savings matter when every message carries a vector. |
| Comparison cost | O(n) per comparison | Unavoidable. But if n is small (version vectors with 3-5 replicas), this is trivial. For large n, consider Bloom clocks as a probabilistic alternative. |
python class SparseVectorClock: """Space-efficient vector clock using a dict.""" def __init__(self, node_id): self.node_id = node_id self.v = {} # only store non-zero entries def increment(self): self.v[self.node_id] = self.v.get(self.node_id, 0) + 1 def merge(self, other_v): """Merge another vector clock into this one.""" for k, val in other_v.items(): self.v[k] = max(self.v.get(k, 0), val) self.increment() def to_dict(self): return dict(self.v) @staticmethod def compare(a, b): """Compare two sparse vector clocks (dicts).""" all_keys = set(a) | set(b) le = all(a.get(k, 0) <= b.get(k, 0) for k in all_keys) ge = all(a.get(k, 0) >= b.get(k, 0) for k in all_keys) if le and ge: return "equal" if le: return "before" if ge: return "after" return "concurrent" # Example: 10000 nodes, but only 3 wrote this key vc = SparseVectorClock("node-7392") vc.increment() # {"node-7392": 1} — just 1 entry, not 10000 vc.merge({"node-42": 5, "node-7392": 1}) # {"node-7392": 2, "node-42": 5} — still just 2 entries
python class VectorClock: """Vector clock for n processes.""" def __init__(self, pid, n): self.pid = pid self.v = [0] * n def local_event(self): self.v[self.pid] += 1 return list(self.v) def send(self): self.v[self.pid] += 1 return list(self.v) # attach to message def receive(self, msg_v): for j in range(len(self.v)): self.v[j] = max(self.v[j], msg_v[j]) self.v[self.pid] += 1 return list(self.v) @staticmethod def compare(va, vb): """Returns 'before', 'after', 'equal', or 'concurrent'.""" le = all(a <= b for a, b in zip(va, vb)) ge = all(a >= b for a, b in zip(va, vb)) if le and ge: return "equal" if le: return "before" # va happened-before vb if ge: return "after" # vb happened-before va return "concurrent" # incomparable # Example p0 = VectorClock(0, 3) p1 = VectorClock(1, 3) p2 = VectorClock(2, 3) v_a = p0.send() # [1,0,0] v_b = p1.receive(v_a) # [1,1,0] v_c = p2.local_event() # [0,0,1] print(VectorClock.compare(v_a, v_b)) # "before" print(VectorClock.compare(v_a, v_c)) # "concurrent" print(VectorClock.compare(v_b, v_c)) # "concurrent"
We have seen three approaches: physical clocks (approximate real time, no causality guarantee), Lamport clocks (causality but no real time, cannot detect concurrency), and vector clocks (full causality + concurrency detection but O(n) overhead). Is there a middle ground?
The Hybrid Logical Clock (HLC), introduced by Kulkarni et al. in 2014, combines physical and logical time. The key idea: use physical time when possible, but fall back to a logical counter when physical time hasn't advanced enough to maintain the happened-before property.
An HLC timestamp has two components:
Where l is the best estimate of physical time (always ≥ the local physical clock) and c is a logical counter that breaks ties when l values are equal.
The algorithm:
The beauty: HLC timestamps are always within ε of physical time (where ε is the max clock skew), they never go backward, and they guarantee the happened-before property just like Lamport clocks.
CockroachDB uses HLC timestamps for MVCC (multi-version concurrency control). Every key-value pair is stamped with an HLC. Reads at timestamp T see all writes with HLC ≤ T. This lets CockroachDB provide serializable transactions across a geo-distributed cluster.
How does CockroachDB handle the fact that HLC can't detect concurrency? It uses a technique called read refreshing. If a transaction reads a key at HLC timestamp T1 and later discovers a write at T2 where T1 < T2 < T_commit, it checks whether the read value has changed. If not, the transaction can proceed with its commit timestamp pushed forward. If the value changed, the transaction must restart. This is more efficient than requiring vector clocks on every key.
Spanner uses TrueTime (physical time with error bounds) instead of HLC, but the principle is similar: combine physical time awareness with causal ordering guarantees. The key difference is that TrueTime makes the uncertainty explicit (the ε bound), while HLC hides it inside the logical counter c.
| Property | Guarantee |
|---|---|
| Causality | If e1 → e2, then HLC(e1) < HLC(e2). Same as Lamport. |
| Bounded drift | |l - pt| ≤ ε where ε is the max clock skew in the system. HLC never drifts further from real time than the worst physical clock. |
| Monotonicity | HLC timestamps at a single node are strictly increasing. No backward jumps. |
| Compactness | O(1) — just two integers (l, c) regardless of cluster size. |
| Compatibility | HLC timestamps can be compared with physical timestamps (just compare l components). This makes them usable alongside NTP-synchronized logs. |
Let's trace an HLC exchange between two nodes with drifting clocks. Node A's physical time is 100, Node B's is 95 (5ms behind).
| Step | Event | A's PT | A's HLC | B's PT | B's HLC |
|---|---|---|---|---|---|
| 1 | A: local event | 100 | (100, 0) | 95 | (0, 0) |
| 2 | A: send to B | 101 | (101, 0) | 96 | (0, 0) |
| 3 | B: recv from A | — | (101, 0) | 97 | (101, 1) |
| 4 | B: local event | — | (101, 0) | 98 | (101, 2) |
| 5 | B: local event | — | (101, 0) | 102 | (102, 0) |
At step 3, B receives A's message with (101, 0). B's physical time is only 97. So l = max(0, 101, 97) = 101 (the message's l wins). Since l tied with msg_l, c = msg_c + 1 = 1. Result: (101, 1).
At step 4, B does a local event. Its PT is 98, which is less than l=101. So l stays 101, and c increments to 2. Result: (101, 2).
At step 5, B's physical time has caught up to 102, which exceeds l=101. So l advances to 102 and c resets to 0. Result: (102, 0). The HLC has snapped back to physical time — the logical counter only accumulates when physical time isn't advancing fast enough.
Notice: the ordering is correct: (100,0) < (101,0) < (101,1) < (101,2) < (102,0). And every HLC timestamp is within 5ms of the true physical time — the bounded drift property holds.
Three nodes with drifting physical clocks. Watch how HLC timestamps track close to real time while maintaining causal ordering. Compare with raw Lamport counters below each event.
python import time class HybridLogicalClock: """Hybrid Logical Clock (Kulkarni et al., 2014).""" def __init__(self): self.l = 0 # logical physical time component self.c = 0 # logical counter def _pt(self): """Physical time in milliseconds.""" return int(time.time() * 1000) def local_event(self): """Local event or send.""" pt = self._pt() l_old = self.l self.l = max(l_old, pt) if self.l == l_old: self.c += 1 else: self.c = 0 return (self.l, self.c) def send(self): """Same as local_event — attach (l, c) to message.""" return self.local_event() def receive(self, msg_l, msg_c): """Receive message with HLC timestamp (msg_l, msg_c).""" pt = self._pt() l_old = self.l self.l = max(l_old, msg_l, pt) if self.l == l_old == msg_l: self.c = max(self.c, msg_c) + 1 elif self.l == l_old: self.c += 1 elif self.l == msg_l: self.c = msg_c + 1 else: self.c = 0 return (self.l, self.c) # HLC comparison: (l1, c1) < (l2, c2) iff l1 < l2, or l1==l2 and c1 < c2 def hlc_before(ts1, ts2): return ts1[0] < ts2[0] or (ts1[0] == ts2[0] and ts1[1] < ts2[1])
This is the big simulation. Three nodes exchange messages in real time. You can toggle between four clock models — physical time, Lamport timestamps, vector clocks, and HLC — and see exactly which events are correctly ordered, which are ambiguous, and which are flat-out wrong under each model.
Three nodes generate events and exchange messages. Switch between clock models to see how each handles ordering. Green = correctly ordered pair. Yellow = ambiguous (concurrent but clock says ordered). Red = incorrectly ordered.
1. Click "Generate Scenario" to create a random sequence of local events, sends, and receives across three nodes. Each node has a physical clock with the configured skew.
2. Use "Step" or "Play All" to reveal events one at a time. Each event is colored based on whether the currently selected clock model orders it correctly relative to the previous event.
3. Switch the clock model using the dropdown. The same events are re-evaluated under a different clock. Watch the green/yellow/red coloring change.
4. Increase the clock skew slider and generate a new scenario. Higher skew makes physical time ordering worse, but logical and hybrid clocks remain unaffected.
5. Try to find a scenario where physical time gets an ordering wrong (red) but HLC gets it right (green). This demonstrates why you should never use physical timestamps alone for causal ordering.
| Clock Model | Strengths | Weaknesses You Will See |
|---|---|---|
| Physical Time | Human-readable, small metadata | Clock skew causes events to appear in wrong order. Two events 10ms apart on different nodes might be ordered backwards. |
| Lamport | Respects causality (a → b implies L(a) < L(b)) | Labels concurrent events with an arbitrary order. You can't tell if L(5) < L(8) means causality or coincidence. |
| Vector | Full causality detection. Identifies concurrent events. | O(n) size per timestamp. The display gets crowded with 3 nodes — imagine 1000. |
| HLC | Causality + close to physical time + O(1) size | Cannot detect concurrency (same as Lamport). But timestamps are meaningful. |
Experiment 1: High skew + physical time. Set clock skew to 150ms. Generate a scenario. Step through with "Physical Time" selected. Count the red (incorrectly ordered) events. Now switch to "Lamport" — the red events should disappear (Lamport respects causality). Switch to "Vector" — some events turn yellow (concurrent, correctly identified).
Experiment 2: Zero skew. Set clock skew to 0ms. Physical time should now be mostly green, because with no skew, physical timestamps match true ordering. But watch carefully — even with zero skew, message delays can cause ordering issues if two events on different nodes happen very close together.
Experiment 3: Compare Lamport and HLC. With any skew, switch between "Lamport" and "HLC." The coloring should be identical (both provide the same causal guarantees). The difference is in the timestamp values: HLC values are close to wall time, while Lamport values are arbitrary integers. This matters for debugging — an HLC timestamp of (1716000050000, 3) tells you "about May 18, 2024 at 1:40 PM" while a Lamport timestamp of 47 tells you nothing about when the event happened.
Theory gives us Lamport clocks and vector clocks. Practice gives us NTP misconfigurations, leap seconds, and "why are my logs out of order?" This chapter is the debugging guide.
Your operating system exposes two different clocks. Understanding the difference will save you from subtle, data-corrupting bugs.
| Clock Type | What It Measures | Can It Go Backward? | Use For |
|---|---|---|---|
| Wall clock (a.k.a. real-time clock) | Time since Unix epoch (Jan 1, 1970) | Yes. NTP can step the clock backward. Leap seconds can too. | Displaying time to humans. Logging. External APIs that need "what time is it?" |
| Monotonic clock | Time since some arbitrary point (usually boot) | Never. It only goes forward, by design. | Measuring durations. Timeouts. Rate limiting. Anything where you compute elapsed time. |
end_time - start_time using time() or gettimeofday(). If NTP steps the clock between your two reads, you can get a negative duration. Use clock_gettime(CLOCK_MONOTONIC) in C, time.monotonic() in Python, or System.nanoTime() in Java.python import time # BAD: wall clock — can go backward if NTP adjusts start = time.time() do_work() elapsed = time.time() - start # might be NEGATIVE! # GOOD: monotonic clock — guaranteed non-decreasing start = time.monotonic() do_work() elapsed = time.monotonic() - start # always >= 0 # ALSO GOOD: perf_counter for high-resolution benchmarking start = time.perf_counter() do_work() elapsed = time.perf_counter() - start # highest resolution monotonic
The Earth's rotation is slowing down. To keep UTC synchronized with solar time, a leap second is occasionally inserted: 23:59:59 is followed by 23:59:60 instead of 00:00:00. This has caused real outages:
Modern mitigations: leap smearing (Google, AWS, Azure). Instead of inserting a full second, they slow down the clock by a tiny amount over a 24-hour window. The clock is technically wrong during the smear period, but it never jumps — which is far safer for software.
| Language | Wall Clock | Monotonic Clock | Notes |
|---|---|---|---|
| C | clock_gettime(CLOCK_REALTIME) | clock_gettime(CLOCK_MONOTONIC) | Also: CLOCK_MONOTONIC_RAW (Linux) for hardware clock without NTP slew |
| Python | time.time() | time.monotonic() | time.perf_counter() for highest resolution |
| Java | System.currentTimeMillis() | System.nanoTime() | nanoTime() has nanosecond resolution but is only for elapsed time |
| Go | time.Now() | time.Now() (monotonic since Go 1.9) | Go embeds both readings in Time; durations automatically use monotonic |
| Rust | SystemTime::now() | Instant::now() | Instant is explicitly monotonic; SystemTime can go backward |
Go deserves special mention. Since Go 1.9, time.Now() stores both wall clock and monotonic readings. When you compute t2.Sub(t1), Go automatically uses the monotonic component — eliminating an entire class of bugs without requiring the programmer to choose. This is excellent API design.
Distributed systems should use UTC (Coordinated Universal Time) internally, never local time zones. Time zones add complexity (daylight saving transitions, political changes) without adding information. Convert to local time only at the presentation layer.
But even UTC has warts — leap seconds. For systems requiring monotonic global time, there is TAI (International Atomic Time). TAI is UTC without leap seconds. As of 2024, TAI is 37 seconds ahead of UTC. TAI never jumps, never pauses, never goes backward. Some high-frequency trading systems use TAI internally.
The relationship:
Unix timestamps (the output of time()) nominally represent "seconds since 1970-01-01 00:00:00 UTC." But they don't count leap seconds — during a leap second, the Unix timestamp either repeats a value (for a positive leap second) or skips one (for a negative, which has never happened). This means Unix timestamps are not true UTC and not true TAI — they are their own quirky standard called POSIX time.
Some systems embed timestamps in unique IDs to get rough ordering "for free." Examples:
| ID Format | Time Bits | Ordering Guarantee | Used By |
|---|---|---|---|
| UUIDv1 | 60-bit timestamp (100ns resolution) | Roughly ordered by creation time. Not guaranteed across machines (clock skew). | Cassandra (as TimeUUID) |
| UUIDv7 | 48-bit Unix timestamp (ms resolution) | Lexicographically sorted by time. Random suffix for uniqueness. | Recommended replacement for UUIDv4 when ordering matters |
| ULID | 48-bit timestamp + 80-bit random | Lexicographically sorted. Monotonic within same ms. | Various (pre-dates UUIDv7) |
| Snowflake | 41-bit timestamp + 10-bit worker + 12-bit sequence | Roughly ordered. 69 years of timestamps at ms resolution. | Twitter, Discord |
All of these inherit the limitations of physical clocks: two IDs generated on different machines within the clock skew window may be ordered differently than their true creation order. For strong causal ordering, you still need Lamport/HLC stamps alongside the ID.
Google TrueTime uses GPS receivers and atomic clocks in every datacenter. The API returns [earliest, latest] bounds. Typical ε is 1-7 ms. Used by Spanner for linearizable global transactions.
AWS Time Sync Service provides a Chrony-based NTP service accessible to all EC2 instances via a link-local address (169.254.169.123). It uses GPS and atomic clocks at AWS's time-source fleet. Typical accuracy: <1 ms within a region. AWS also offers ClockBound, an open-source library that provides TrueTime-like confidence intervals.
Clock bugs are not theoretical — they cause real incidents in production systems. Here are three case studies worth understanding:
t2 - t1 using the wall clock. When the leap second inserted 23:59:60, the kernel briefly reported a time that was earlier than a previous reading. The subtraction produced a negative number. The code passed this to a function expecting a positive duration, which panicked. Fix: use monotonic clocks for all duration calculations.--max-offset flag (default 500ms) and rejects nodes whose clocks diverge beyond this threshold. But the 500ms of transactions processed before detection had to be retried.time.time() before the migration and compared it after saw impossible durations. AWS's solution: the Time Sync Service provides a PHC (PTP Hardware Clock) device that maintains monotonic time across live migrations. Applications using CLOCK_MONOTONIC were unaffected.Most engineers never configure NTP — it "just works" via cloud provider defaults. But understanding the configuration helps you debug problems and tune for tighter bounds.
python # /etc/chrony.conf — example for a datacenter server # Use the cloud provider's time source (AWS example) # server 169.254.169.123 prefer iburst # Fallback to public NTP pools # pool pool.ntp.org iburst maxsources 4 # Key settings: # makestep 1.0 3 — allow clock step (not slew) if offset > 1s, # but only for the first 3 updates after boot. # After that, only slew (gradual adjustment). # maxdrift 100 — assume max 100 ppm drift rate # rtcsync — sync hardware RTC to system clock periodically # Monitoring commands: # chronyc tracking — current offset and drift # chronyc sources -v — list time sources with details # chronyc sourcestats — statistics for each source
The critical setting is makestep. Without it, Chrony will only slew the clock (gradually speed up or slow down) to correct offsets. Slewing preserves monotonicity but can take hours to correct a large offset. With makestep 1.0 3, the clock can jump for the first three corrections after boot, then switches to slew-only mode. This is the right balance: fast correction at startup, safe behavior during normal operation.
You are looking at a distributed system's logs aggregated from 50 machines. Events appear out of order. Here is the systematic debugging approach:
chronyc tracking on all machines. Look at "System time" offset. If it's > 10ms, NTP is misconfigured or the network to the time server is flaky.A node measures elapsed time using both clocks. At random moments, NTP adjusts the wall clock (red flash). Watch how the wall-clock duration goes wrong while the monotonic clock stays correct.
In any production distributed system, you should have automated monitoring of clock health. Here is a minimal but effective approach:
1. Monitor NTP offset on every machine. Alert if offset exceeds your tolerance (e.g., > 50ms for most systems, > 5ms for databases).
2. Track clock jumps. Compare consecutive readings of CLOCK_REALTIME. If the difference is negative or unexpectedly large (> 100ms between successive reads), log a warning. This catches NTP steps and leap seconds.
3. Cross-check with application-level heartbeats. If node A receives a heartbeat from node B with a physical timestamp 500ms in the future, either B's clock is fast or A's is slow. Log the skew and alert if it exceeds your max-offset threshold.
4. Include clock metadata in distributed traces. Every span in your distributed trace should carry the node's current NTP offset (from chronyc tracking or equivalent). When you see out-of-order spans, the offset metadata tells you whether it's a clock problem or a real ordering anomaly.
OpenTelemetry supports custom span attributes — add host.clock_offset_ms to every span. When debugging ordering issues in your tracing UI, filter by this attribute to identify machines with poor time sync. This single practice has saved countless hours of debugging in production systems.
python # Production clock health monitor import time, logging class ClockHealthMonitor: """Monitors local clock for jumps and drift.""" def __init__(self, max_jump_ms=100, check_interval=1.0): self.max_jump = max_jump_ms / 1000.0 self.interval = check_interval self.last_wall = time.time() self.last_mono = time.monotonic() def check(self): """Call periodically. Returns (is_healthy, details).""" wall = time.time() mono = time.monotonic() wall_delta = wall - self.last_wall mono_delta = mono - self.last_mono # The difference between wall and mono deltas reveals clock adjustments drift = abs(wall_delta - mono_delta) jumped = drift > self.max_jump if jumped: logging.warning( f"Clock jump detected: wall moved {wall_delta:.3f}s " f"but monotonic moved {mono_delta:.3f}s " f"(diff: {drift*1000:.1f}ms)" ) self.last_wall = wall self.last_mono = mono return not jumped, {"drift_ms": drift * 1000}
python # NTP status check — run on every machine in your fleet # $ chronyc tracking # System time : 0.000023483 seconds fast of NTP time # Last offset : +0.000011321 seconds # RMS offset : 0.000025612 seconds # Python: check if NTP is healthy import subprocess, re def check_ntp_offset(): """Returns NTP offset in milliseconds.""" result = subprocess.run( ["chronyc", "tracking"], capture_output=True, text=True ) for line in result.stdout.split('\n'): if 'System time' in line: secs = float(re.search(r'([\d.]+) seconds', line).group(1)) return secs * 1000 return None offset_ms = check_ntp_offset() if offset_ms and offset_ms > 10: print(f"WARNING: NTP offset is {offset_ms:.2f}ms — check time sync!")
This chapter distills everything into the format interviewers expect: crisp definitions, comparison tables, coding exercises, and design reasoning.
| Concept | One-Liner | When to Use |
|---|---|---|
| Timeout | Only tool for failure detection; binary alive/dead | Every RPC call, every heartbeat check |
| Phi Accrual | Continuous suspicion level based on heartbeat distribution | Cassandra, Akka — when you want adaptive detection |
| NTP | 1-100ms accuracy; can step clock backward | General clock sync; not safe for ordering |
| TrueTime | Returns [t-ε, t+ε]; ε ~ 1-7ms with GPS+atomic | Google Spanner; commit-wait for linearizability |
| Lamport Clock | Single counter; if a→b then L(a)<L(b); can't detect concurrency | Total ordering where concurrency detection is not needed |
| Vector Clock | Vector of n counters; detects concurrent events; O(n) size | Dynamo-style conflict detection; small-n systems |
| HLC | (l, c) — physical time + logical counter; O(1) + causal | CockroachDB MVCC; when you need causal order + human-readable time |
| Monotonic clock | Never goes backward; can't compare across machines | Measuring durations, timeouts, rate limiting |
The interactive decision tree below encodes the key tradeoffs. Click through the questions to arrive at the right clock mechanism for your use case.
Answer the questions by clicking buttons. The tree narrows to the right recommendation.
Interviewers test distributed time in predictable ways. Here are the five patterns and the expected answer shape:
| Pattern | Question Shape | Expected Answer |
|---|---|---|
| Ordering | "How do you order events across services?" | Define causality. Lamport for total order, vector for conflict detection, HLC for practical systems. |
| Conflict | "Two users edit the same document simultaneously. How do you detect the conflict?" | Version vectors. Not Lamport (can't detect concurrency). Not physical time (skew makes it unreliable). |
| Consistency | "How does Spanner achieve linearizability globally?" | TrueTime + commit-wait. Assign timestamp, wait 2ε so all future reads see the write. |
| Debugging | "Logs from 50 servers are out of order. How do you fix it?" | Check NTP sync, use monotonic clocks for durations, add causal context (trace IDs, Lamport timestamps). |
| Failure | "How do you detect if a node is down?" | Heartbeats + timeout. Discuss tradeoff. Mention phi accrual for adaptive. Discuss what happens on false positive. |
Given a sequence of events, compute the Lamport timestamps. This tests whether you understand the max-then-increment rule on receive.
python # Drill: fill in the timestamps for this execution # P0: send(m1) local recv(m3) # P1: recv(m1) send(m2) send(m3) # P2: local recv(m2) # Solution: p0, p1, p2 = 0, 0, 0 # P0 sends m1 p0 += 1 # p0 = 1, m1 carries ts=1 # P2 does local event p2 += 1 # p2 = 1 # P1 receives m1 (ts=1) p1 = max(p1, 1) + 1 # p1 = max(0,1)+1 = 2 # P0 does local event p0 += 1 # p0 = 2 # P1 sends m2 p1 += 1 # p1 = 3, m2 carries ts=3 # P1 sends m3 p1 += 1 # p1 = 4, m3 carries ts=4 # P2 receives m2 (ts=3) p2 = max(p2, 3) + 1 # p2 = max(1,3)+1 = 4 # P0 receives m3 (ts=4) p0 = max(p0, 4) + 1 # p0 = max(2,4)+1 = 5 print(f"Final: P0={p0}, P1={p1}, P2={p2}") # P0=5, P1=4, P2=4
python # Drill: implement vector clock comparison def vc_compare(a, b): """ Compare two vector clocks. Returns: 'before' if a < b, 'after' if a > b, 'equal' if a == b, 'concurrent' if incomparable. """ assert len(a) == len(b) le = all(x <= y for x, y in zip(a, b)) ge = all(x >= y for x, y in zip(a, b)) if le and ge: return "equal" if le: return "before" if ge: return "after" return "concurrent" # Test cases: assert vc_compare([1,0,0], [2,1,0]) == "before" assert vc_compare([2,1,0], [1,0,0]) == "after" assert vc_compare([3,2,0], [2,3,0]) == "concurrent" assert vc_compare([1,1,1], [1,1,1]) == "equal"
python # Drill: implement HLC send and receive class HLC: def __init__(self, get_pt): self.l = 0 self.c = 0 self.get_pt = get_pt # callable returning physical time (ms) def tick(self): """Local event or send.""" pt = self.get_pt() l_old = self.l self.l = max(l_old, pt) self.c = self.c + 1 if self.l == l_old else 0 return (self.l, self.c) def recv(self, msg_l, msg_c): """Receive with sender's (l, c).""" pt = self.get_pt() l_old = self.l self.l = max(l_old, msg_l, pt) if self.l == l_old == msg_l: self.c = max(self.c, msg_c) + 1 elif self.l == l_old: self.c = self.c + 1 elif self.l == msg_l: self.c = msg_c + 1 else: self.c = 0 return (self.l, self.c) # Verify: if a -> b, then HLC(a) < HLC(b) fake_time = [100] def pt(): return fake_time[0] h1 = HLC(pt) h2 = HLC(pt) ts_send = h1.tick() # (100, 0) fake_time[0] = 99 # h2's clock is behind! ts_recv = h2.recv(*ts_send) # l = max(0, 100, 99) = 100, c = 0+1 = 1 print(ts_send, ts_recv) # (100,0) < (100,1) ✓
Failure detection and time are foundational. Every distributed algorithm you will ever study sits on top of these primitives. Here is how this lesson connects to the broader landscape.
| Mechanism | Size | Causality | Concurrency Detection | Real Time | Used By |
|---|---|---|---|---|---|
| Physical Clock | O(1) | No guarantee | No | Yes (with drift) | NTP, TrueTime |
| Lamport Clock | O(1) | Yes (one-way) | No | No | Total ordering, Paxos |
| Vector Clock | O(n) | Yes (both ways) | Yes | No | Dynamo, Riak |
| HLC | O(1) | Yes (one-way) | No | Close (≤ ε) | CockroachDB, MongoDB |
| TrueTime | O(1) | Yes (with wait) | No | Yes (± ε) | Google Spanner |
This lesson focused on the core primitives. Here are important related topics we did not cover:
| Topic | What It Is | Where to Learn More |
|---|---|---|
| Consistent Snapshots | Capturing the global state of a distributed system at a "logical instant" using vector clocks. Chandy-Lamport algorithm. | Chandy & Lamport, 1985 |
| Bloom Clocks | Probabilistic clocks using Bloom filters instead of vectors. O(1) comparison at the cost of false positives. | Gunawardhana et al., 2017 |
| Interval Tree Clocks | Generalization of version vectors that supports dynamic join/leave without reassigning IDs. Used in CRDTs. | Almeida et al., 2008 |
| SWIM | Scalable Weakly-consistent Infection-style Membership protocol. Gossip-based failure detection used by Consul and Memberlist. | Das et al., 2002 |
| Logical Time in CRDTs | Conflict-free Replicated Data Types use causal ordering (usually via vector clocks or dotted version vectors) to merge concurrent updates without coordination. | Shapiro et al., 2011 |
If you remember nothing else from this lesson, remember this decision framework: