Network faults, clock drift, process pauses — everything that can go wrong, will.
You have a single computer on your desk. You write a program that reads a file, does some math, and writes the result. If the hardware is working, the program does the same thing every time. The file is either there or it is not. The CPU either adds correctly or it is broken. There is no ambiguity.
Now imagine you have three computers connected by a network. You send a message from computer A to computer B, asking it to store a value. Then you ask computer C to read that value. Simple, right?
No. Here is what can happen, all at once, with no warning:
A single computer is deterministic: operations either work or they don't. A distributed system is nondeterministic: operations can partially succeed, silently fail, take arbitrarily long, or produce different results on different nodes at the same moment.
The simulation below shows the contrast. On the left, a single machine executes three operations in sequence: read, compute, write. Predictable. Boring. Beautiful.
On the right, three nodes try to coordinate the same operations over a network. Click "Inject Chaos" to introduce the failures that happen in real distributed systems — every single day, in every datacenter on Earth.
Left: a single machine runs operations deterministically. Right: three networked nodes. Click "Inject Chaos" to see what goes wrong.
The single machine always succeeds. The distributed system might drop a message, delay a response, or have a node pause at exactly the wrong moment. And the terrifying part: you cannot tell which failure occurred. From the sender's perspective, a dropped message looks identical to a delayed response looks identical to a crashed node.
If distributed systems are this painful, why use them? Three reasons:
| Reason | Why one machine fails | What distribution gives you |
|---|---|---|
| Scalability | One machine has finite CPU, RAM, disk | Add more machines to handle more load |
| Fault tolerance | One machine is a single point of failure | If one node dies, others continue serving |
| Latency | Users are geographically distributed | Place data close to users |
You must distribute. So you must understand what breaks.
The internet and most datacenter networks are asynchronous packet networks. When you send a message, the network gives you absolutely no guarantee about when (or whether) it will arrive. Here is the complete list of things that can go wrong when node A sends a request to node B:
From A's perspective, all five scenarios look identical: A sent a request and did not receive a response. That's it. There is no way for A to distinguish between them.
A network partition (or "netsplit") is when the network splits into two or more groups of nodes that can communicate within the group but not across groups. All nodes are alive and processing requests. They just can't see each other.
This is different from a node crashing. In a crash, the failed node stops doing things. In a partition, both sides keep operating — potentially making conflicting decisions, accepting conflicting writes, electing conflicting leaders.
Partitions are not rare. A 2011 study of Microsoft datacenters found that network failures caused about 5 link failures per day, each affecting tens of thousands of devices. Google's internal data shows similar numbers. In cloud environments, partial partitions (where some links fail but others work) are especially insidious.
The simulation below shows node A sending a request to node B. Use the buttons to inject different failure modes and observe that A always sees the same thing: silence.
Send a request from A to B. Then inject a specific failure to see what happens.
Since you can't distinguish failure modes, the only option is to set a timeout: wait for some duration, and if no response arrives, assume something went wrong. But what should the timeout be?
| Timeout | Upside | Downside |
|---|---|---|
| Short (100ms) | Fast failure detection | Healthy-but-slow nodes declared dead (false positives). Triggers unnecessary failover, increases load on healthy nodes |
| Long (30s) | Fewer false positives | Real failures take 30 seconds to detect. Users see 30 seconds of errors |
There is no universally correct timeout. It depends on your network, your workload, and how bad false positives are compared to slow detection. We'll explore this tradeoff deeply in Chapter 2.
You're running a cluster of five database replicas. One of them stops responding. Is it dead? Is it just slow? Is the network between you and it broken while it's still serving other clients? You need to decide — because if it's dead, you need to redirect traffic and maybe elect a new leader. If it's just slow, declaring it dead could cause a cascade of unnecessary failovers.
This is the failure detection problem. And your only tool is timeouts.
Let's make this concrete. Suppose your service sends heartbeats every 1 second. Normally, responses arrive within 5ms. You need to decide: how many missed heartbeats constitute a failure?
Network delays are not random noise — they have specific, mechanical causes. Understanding these helps you reason about reasonable timeout values.
| Source of delay | Typical magnitude | Why it happens |
|---|---|---|
| Switch queue | Microseconds to milliseconds | Multiple ports sending to the same output port. Packets queue in the switch's buffer. |
| TCP retransmission | 200ms to seconds | A packet is lost. TCP waits for an ACK timeout, then retransmits. Each retry doubles the wait (exponential backoff). |
| OS TCP buffer | Microseconds to milliseconds | The receiving OS accepted the packet but the application hasn't called read() yet. The packet sits in the kernel's socket buffer. |
| VM pause | Tens of milliseconds to seconds | Virtual machine live migration, or the hypervisor scheduling another VM on the same CPU core. |
| GC pause | Milliseconds to seconds | The application's garbage collector stops the world. The process cannot respond until GC completes. |
| Context switch | Microseconds to milliseconds | The OS schedules another process on the same CPU core. Under high load, your heartbeat response gets delayed. |
Notice that most of these are queueing problems. Packets, threads, and processes all queue. The more loaded the system, the longer the queues, the higher the latency. This is why the 99th percentile of response time is far worse than the median.
Rather than a fixed timeout, the phi accrual failure detector (used by Akka, Cassandra) adapts based on observed response times. Instead of a binary "alive or dead," it outputs a suspicion level (φ) that increases over time since the last heartbeat.
The beauty: if the network gets noisier (higher σ), the detector automatically becomes more lenient. If the network is rock-solid (low σ), it detects failures faster.
The simulation below shows a node sending heartbeats. The response time varies randomly. Use the timeout slider to see the tradeoff: too short and you get false positives (declaring a healthy node dead), too long and you're slow to detect a real failure.
Adjust the timeout. Watch false positives (red flashes) when it's too short, and slow detection when it's too long. The node will actually crash after ~15 heartbeats.
python import math import time from collections import deque class PhiAccrualDetector: """Phi accrual failure detector (simplified Cassandra-style).""" def __init__(self, threshold=8, window_size=100): self.threshold = threshold # phi above this = declared dead self.intervals = deque(maxlen=window_size) self.last_heartbeat = None def heartbeat(self): """Called when a heartbeat arrives.""" now = time.monotonic() if self.last_heartbeat is not None: interval = (now - self.last_heartbeat) * 1000 # ms self.intervals.append(interval) self.last_heartbeat = now def phi(self): """Current suspicion level. Higher = more likely dead.""" if self.last_heartbeat is None or len(self.intervals) < 2: return 0.0 now = time.monotonic() elapsed = (now - self.last_heartbeat) * 1000 # ms since last heartbeat # Compute mean and variance of intervals mean = sum(self.intervals) / len(self.intervals) variance = sum((x - mean) ** 2 for x in self.intervals) / len(self.intervals) std = math.sqrt(variance) if variance > 0 else 0.1 # CDF of normal distribution: P(X <= elapsed) z = (elapsed - mean) / std cdf = 0.5 * (1 + math.erf(z / math.sqrt(2))) # phi = -log10(1 - cdf), clamped to avoid log(0) p_late = max(1e-15, 1.0 - cdf) return -math.log10(p_late) def is_alive(self): return self.phi() < self.threshold # Usage: detector = PhiAccrualDetector(threshold=8) # In your heartbeat loop: # detector.heartbeat() — call when each heartbeat arrives # detector.is_alive() — check if the node is still alive
Every computer has a quartz crystal oscillator that ticks at a nominal frequency. But "nominal" is doing a lot of work in that sentence. Quartz crystals drift. A typical crystal oscillates at 32.768 kHz, but temperature, voltage, and manufacturing variance cause it to be slightly fast or slow. A drift of 35 parts per million (ppm) — common for consumer-grade crystals — means the clock gains or loses about 3 seconds per day.
On a single machine, this doesn't matter much. The OS periodically adjusts the clock using NTP (Network Time Protocol) and everything works fine. But in a distributed system, each node has its own clock, and they all drift independently. You cannot assume that "now" on node A is the same as "now" on node B.
Operating systems expose two fundamentally different clocks. Confusing them causes bugs that are maddeningly difficult to diagnose.
| Property | Time-of-day clock | Monotonic clock |
|---|---|---|
| What it measures | Wall-clock time (seconds since epoch) | Elapsed time since some arbitrary point |
| Python API | time.time() | time.monotonic() |
| Java API | System.currentTimeMillis() | System.nanoTime() |
| Can jump backwards? | YES. NTP step correction can move the clock backwards by seconds. | NO. Guaranteed to never decrease. |
| Synced across nodes? | Approximately (via NTP). Accuracy: 10-100ms over internet, 1ms on local network. | NO. Only meaningful on a single machine. Not comparable across machines. |
| Good for ordering events? | NO. Drift and jumps make cross-node ordering unreliable. | YES — but only on ONE machine. |
| Good for measuring durations? | NO. A jump makes a 5-second operation appear to take -3 seconds. | YES — this is exactly what it's for. |
Consider two database nodes using last-write-wins (LWW) conflict resolution. Each write is stamped with the node's local wall-clock time. The write with the highest timestamp wins.
Here's what happens when clocks are skewed:
This is not a theoretical concern. It is one of the most common sources of silent data loss in distributed databases that use LWW (including some default configurations of Cassandra).
The simulation below shows three nodes with independent clock drift. NTP corrections occur periodically but introduce jumps. Watch how two events that happen in a clear temporal order get reversed timestamps.
Three nodes with drifting clocks. NTP syncs periodically. Click "Write Event" to place events and see how timestamps can disagree with reality.
NTP gives you a timestamp. But how accurate is it? The terrifying answer: you don't know. NTP measures the round-trip time to a time server and estimates one-way delay as half the round-trip. But if the network is asymmetric (as it often is), this estimate is wrong. NTP gives you a time, but no error bars.
Google decided this was unacceptable. For their globally distributed database Spanner, they needed to know not just "what time is it?" but "what is the earliest and latest time it could be right now?"
Google's TrueTime API uses a combination of GPS receivers and atomic clocks in every datacenter. Instead of returning a single timestamp, it returns a confidence interval:
With confidence intervals, you can determine when two events are definitely ordered versus when their order is ambiguous.
Spanner uses this to implement external consistency (a stronger guarantee than serializability). When a transaction commits, Spanner deliberately waits out the uncertainty: it pauses for the length of the confidence interval before reporting the commit. This ensures that any transaction that starts after the commit will definitely see it.
Place events on two nodes. Each event has a confidence interval. The simulation shows whether the events can be ordered or are concurrent.
Click "Add Event" to place events on Node 1 and Node 2. The simulation shows whether their confidence intervals overlap (concurrent) or are separated (orderable).
You're not Google. You don't have atomic clocks. What can you do?
| Approach | Accuracy | Trade-off |
|---|---|---|
| NTP over internet | ~50-100ms | Free, but high and unpredictable error |
| NTP on local network | ~1-5ms | Requires dedicated NTP servers in your datacenter |
| PTP (Precision Time Protocol) | ~1-100μs | Requires hardware timestamping support in NICs and switches |
| GPS receiver per server | ~1μs | $50-200 per server, needs antenna with sky view |
| Logical clocks | Perfect causal ordering | Doesn't tell you wall-clock time at all — only "before/after" relationships |
You've handled unreliable networks and unreliable clocks. Surely you can at least trust that your own code, running on your own machine, executes continuously? No. Your process can stop running for seconds at any time, without warning, and resume as if nothing happened.
| Cause | Typical duration | Warning? |
|---|---|---|
| GC pause (Java/Go/.NET) | 10ms to 10+ seconds | None. The process freezes mid-instruction. |
| VM live migration | 100ms to seconds | None. The hypervisor moves your VM to another host while it's running. |
| OS context switch under load | Microseconds to milliseconds | None. The OS schedules another process. |
| Disk I/O (swapping/thrashing) | Milliseconds to seconds | None. A page fault triggers disk read. |
| Signal handling (SIGSTOP) | Indefinite | User-triggered (ctrl-Z). |
This is where process pauses become dangerous. Consider a lease — a distributed lock with a timeout. A node acquires a lease that says "you are the leader for the next 10 seconds." The node then processes requests, confident that it's the only leader.
Here's what goes wrong:
Since you can't prevent a stale leader from acting, you need the storage layer to reject stale writes. The mechanism is a fencing token: a monotonically increasing number issued with each lease grant.
This simulation shows two nodes competing for a lease. Node A gets the lease, then suffers a GC pause. Node B acquires the lease during the pause. When Node A wakes up, see what happens with and without fencing tokens.
Watch the scenario play out. Toggle fencing to see the difference: without fencing, data is corrupted. With fencing, the stale write is rejected.
python class FencedStorage: """Storage server that rejects stale writes using fencing tokens.""" def __init__(self): self.data = {} self.max_token = {} # key -> highest token seen def write(self, key, value, fencing_token): """Write only if the fencing token is >= the highest seen for this key.""" if key in self.max_token and fencing_token < self.max_token[key]: raise PermissionError( f"Stale fencing token: {fencing_token} < {self.max_token[key]}. " "Write rejected — you no longer hold the lease." ) self.max_token[key] = fencing_token self.data[key] = value return True def read(self, key): return self.data.get(key) # Demo: the GC pause scenario storage = FencedStorage() # Node A gets lease with token 34, writes successfully storage.write("account_balance", 500, fencing_token=34) print("Node A wrote 500 with token 34") # OK # Node B gets lease with token 35 (after A's lease expired during GC) storage.write("account_balance", 450, fencing_token=35) print("Node B wrote 450 with token 35") # OK # Node A wakes up from GC, tries to write with old token 34 try: storage.write("account_balance", 600, fencing_token=34) except PermissionError as e: print(f"Node A rejected: {e}") # REJECTED
So far, we've assumed nodes are honest but unreliable: they might crash, they might be slow, but they never lie. A crashed node stops sending messages. A slow node eventually responds with the correct answer. What if a node sends the wrong answer? Or sends different answers to different nodes?
This is a Byzantine fault: a node behaves in an arbitrary, potentially malicious way. It might send "yes" to node A and "no" to node B. It might report a different value from what it stored. It might impersonate another node.
In 1982, Lamport, Shostak, and Pease published a paper that framed this as a military scenario. Several Byzantine generals surround an enemy city. They need to agree on a plan: attack or retreat. They communicate by messenger. Some generals may be traitors who send conflicting messages to sabotage the consensus.
With 3 generals (A loyal, B loyal, T traitor):
Four generals try to agree on a plan. One is a traitor. Watch how the loyal generals use majority voting to reach consensus despite the traitor's conflicting messages.
The commander sends a plan. The traitor sends conflicting messages. Watch how majority voting overcomes the traitor.
Almost certainly not. BFT is expensive — you need 3f+1 nodes to tolerate f faults (versus 2f+1 for crash faults). The extra nodes and message rounds add latency and cost.
| Environment | Need BFT? | Why |
|---|---|---|
| Your datacenter | No | You control the nodes. Bugs exist, but nodes don't intentionally lie. |
| Blockchain / cryptocurrency | Yes | Nodes are operated by strangers with financial incentives to cheat. |
| Aircraft flight control | Yes | Hardware faults (cosmic rays, sensor failures) can produce arbitrary values. |
| Multi-org consortium | Maybe | Organizations might not trust each other's infrastructure. |
We've cataloged everything that can go wrong: networks drop messages, clocks drift, processes pause, and nodes can crash (or even lie). Now we need a framework for reasoning about these faults. We need system models.
A system model is a set of assumptions about what kinds of faults can happen. It's the contract between the algorithm designer and reality. "If the world behaves at least this well, then my algorithm guarantees these properties."
How do nodes fail?
| Model | Assumption | Example |
|---|---|---|
| Crash-stop | A node that crashes is gone forever. It never recovers. | A process killed with SIGKILL. A server removed from the rack. |
| Crash-recovery | A node that crashes may restart later with its durable state intact (but in-memory state is lost). | A server reboots after a power failure. The process is restarted by systemd. |
| Byzantine | A node may behave in any arbitrary way: crash, lie, send contradictory messages. | A compromised server. Hardware with corrupted memory. |
How predictable are delays?
| Model | Assumption | Real-world match |
|---|---|---|
| Synchronous | There is a known upper bound on message delay and clock drift. If a message doesn't arrive in time, the node is definitely crashed. | Hard real-time systems with dedicated networks (assembly line controllers) |
| Partially synchronous | The system behaves synchronously MOST of the time, but can occasionally exceed bounds (during network congestion, GC pauses, etc.). | Most datacenter systems. The model used by most practical algorithms. |
| Asynchronous | No timing assumptions at all. Messages can take arbitrarily long. You cannot use timeouts because there is no "too long." | Theoretical model. No real system is truly asynchronous, but it gives the strongest impossibility results. |
Five nodes in a partially synchronous network running a simplified consensus protocol. You are the chaos monkey. Crash nodes, partition the network, trigger GC pauses, and delay messages. Watch how the system handles each scenario — and when it breaks.
Click nodes to crash them. Use buttons to partition the network or trigger GC pauses. The system tries to maintain a leader and accept writes. Watch what survives and what doesn't.
Things to try:
Here is a thought experiment that crystallizes the central challenge of distributed systems.
You are a node. You just processed a request and sent a response. But your garbage collector kicked in for 15 seconds right after you sent it. When you wake up, you think only a millisecond has passed. Your lease has expired. Another node is now the leader. But you don't know any of this. From your perspective, nothing happened.
A node cannot trust its own judgment. It might be paused. Its clock might be wrong. Its network might be partitioned. It has no way to tell. This is the fundamental asymmetry of distributed systems: a node can observe its own state, but it cannot observe its own reachability or timeliness from the outside.
Since no individual node can be trusted to assess reality, the only reliable source of truth is a quorum — a majority of nodes agreeing on a fact.
| Decision | Who decides | Why not the individual? |
|---|---|---|
| "Node X is dead" | A quorum of nodes | Node X might be alive but partitioned from the observer |
| "Node Y is the leader" | A quorum of nodes | Y might think it's the leader but its lease expired during a GC pause |
| "Value V is committed" | A quorum of nodes | A single node might have a stale or corrupted copy |
| "The lock is held by Z" | A quorum of nodes (or the storage with fencing) | Z might have crashed and not know it yet |
In 1985, Fischer, Lynch, and Paterson proved one of the most important results in distributed computing. The FLP impossibility theorem states:
This sounds devastating. If consensus is impossible, how do Raft and Paxos work?
The key word is asynchronous. FLP applies to the fully asynchronous model. Real systems are partially synchronous: they have timeouts, and most of the time the network is well-behaved. Practical consensus algorithms exploit this:
Five nodes try to decide if a particular node is dead. Each has its own view (based on whether it can reach the suspect node). A quorum (3 of 5) must agree. See how partitions can cause different nodes to have different views — but the quorum still makes a consistent decision.
Click nodes to toggle their connectivity to the suspect node (Node 3). The quorum decides: if 3+ nodes can't reach Node 3, it is declared dead.
Every distributed algorithm makes a choice between two properties:
| Property | Definition | Example |
|---|---|---|
| Safety | Nothing bad ever happens. If the system gives you an answer, it's correct. | "Two leaders are never elected for the same term." "A committed value is never lost." |
| Liveness | Something good eventually happens. The system makes progress. | "A leader is eventually elected." "A submitted write is eventually committed." |
FLP tells us that in an asynchronous system, you cannot guarantee both safety and liveness in the presence of faults. Practical algorithms choose safety always, liveness when possible. During a partition, the system stops accepting writes (sacrifices liveness) rather than accepting potentially conflicting writes (sacrificing safety).
This chapter is your cheat sheet. Every failure mode, every mitigation, every pattern you need for distributed systems interviews — organized by dimension.
| Failure | Symptom | Mitigation | Interview trigger |
|---|---|---|---|
| Network partition | Some nodes unreachable, others fine | Quorum-based decisions, fencing tokens | "What happens if your services can't talk to each other?" |
| Message loss | Request or response disappears | Retries + idempotency keys | "How do you handle a failed API call?" |
| Message delay | Response arrives after timeout | Adaptive timeouts (phi accrual), deduplication | "How do you set your timeout values?" |
| Clock skew | Timestamps disagree across nodes | Logical clocks, vector clocks, TrueTime | "How do you order events across services?" |
| GC pause | Process freezes for seconds | Fencing tokens, short leases | "What if the leader is GC'd?" |
| Split brain | Two nodes both think they're leader | Epoch-based fencing, quorum leases | "How do you prevent two leaders?" |
| Partial failure | One of N steps completes | Transactions, saga pattern, idempotency | "How do you handle a crash mid-operation?" |
Drill 1: Idempotency layer
python import hashlib import time class IdempotencyLayer: """Prevents duplicate processing of requests.""" def __init__(self, ttl_seconds=3600): self.seen = {} # idempotency_key -> (result, timestamp) self.ttl = ttl_seconds def execute(self, idempotency_key, operation): """Execute operation only if this key hasn't been seen.""" self._evict_expired() if idempotency_key in self.seen: # Already processed — return cached result return self.seen[idempotency_key][0] # First time seeing this key — execute and cache result = operation() self.seen[idempotency_key] = (result, time.monotonic()) return result def _evict_expired(self): now = time.monotonic() expired = [k for k, (_, t) in self.seen.items() if now - t > self.ttl] for k in expired: del self.seen[k] # Usage: idem = IdempotencyLayer() result = idem.execute("charge-user-123-order-456", lambda: charge_credit_card(50)) # If retried with same key, returns cached result without re-charging
Drill 2: Simple lease with fencing
python import time import threading class LeaseService: """Simplified lease service with fencing tokens.""" def __init__(self): self.lock = threading.Lock() self.current_holder = None self.expiry = 0 self.next_token = 1 def acquire(self, node_id, duration_sec=10): """Try to acquire lease. Returns (success, fencing_token).""" with self.lock: now = time.monotonic() if self.current_holder is not None and now < self.expiry: return False, None # Lease held by someone else token = self.next_token self.next_token += 1 self.current_holder = node_id self.expiry = now + duration_sec return True, token def release(self, node_id): with self.lock: if self.current_holder == node_id: self.current_holder = None self.expiry = 0
| Paper | Year | Key idea |
|---|---|---|
| Time, Clocks, and the Ordering of Events in a Distributed System (Lamport) | 1978 | Logical clocks, happens-before relation. The foundation of all distributed ordering. |
| Impossibility of Distributed Consensus with One Faulty Process (Fischer, Lynch, Paterson) | 1985 | FLP impossibility. No deterministic consensus in fully asynchronous systems. |
| The Byzantine Generals Problem (Lamport, Shostak, Pease) | 1982 | BFT requires 3f+1 nodes for f faults. The original formulation. |
| Spanner: Google's Globally-Distributed Database (Corbett et al.) | 2012 | TrueTime API. GPS + atomic clocks for bounded clock uncertainty. |
| The φ Accrual Failure Detector (Hayashibara et al.) | 2004 | Adaptive failure detection based on observed heartbeat distribution. |
This chapter covered the problems of distributed systems — the hostile environment that algorithms must survive. Now you need the solutions: the algorithms and protocols that provide guarantees despite these faults.
| Chapter | Topic | Relationship |
|---|---|---|
| DDIA Ch. 6: Replication | Copying data across nodes | Replication must handle all the faults we described: network partitions, clock skew (for LWW), process pauses (for leader failover). This chapter explains WHY replication is hard. |
| DDIA Ch. 8: Transactions | Grouping operations atomically | Distributed transactions (2PC, sagas) must survive network partitions and node crashes. This chapter explains the environment that makes distributed transactions so expensive. |
| DDIA Ch. 10: Consensus | Getting nodes to agree | Consensus (Raft, Paxos) is the PRIMARY solution to the problems in this chapter. It works because it assumes partial synchrony (not full asynchrony, where FLP makes it impossible). |
After this chapter, you should have a three-layer mental model:
| Principle | One-liner |
|---|---|
| You can't distinguish failure modes | A timeout tells you nothing about why the response didn't arrive. |
| Clocks lie | Never use wall-clock timestamps to determine the order of events across machines. |
| Your process can pause | A GC pause, VM migration, or context switch can stop your code for seconds without warning. |
| Trust the quorum, not yourself | No individual node can reliably assess its own status. Only a majority decision is trustworthy. |
| Safety over liveness | Better to stop accepting writes than to accept conflicting writes. |
| Fence everything | Every write to shared storage should carry a fencing token that the storage can verify. |
| Make it idempotent | If a client might retry, the operation must produce the same result on duplicate execution. |