Designing Data-Intensive Applications — Chapter 9

The Trouble with Distributed Systems

Network faults, clock drift, process pauses — everything that can go wrong, will.

Prerequisites: Basic networking + Client-server model. That's it.
11
Chapters
9+
Simulations
5
Interview Dimensions

Chapter 0: The Problem

You have a single computer on your desk. You write a program that reads a file, does some math, and writes the result. If the hardware is working, the program does the same thing every time. The file is either there or it is not. The CPU either adds correctly or it is broken. There is no ambiguity.

Now imagine you have three computers connected by a network. You send a message from computer A to computer B, asking it to store a value. Then you ask computer C to read that value. Simple, right?

No. Here is what can happen, all at once, with no warning:

A single computer is deterministic: operations either work or they don't. A distributed system is nondeterministic: operations can partially succeed, silently fail, take arbitrarily long, or produce different results on different nodes at the same moment.

The core reality. In a distributed system, there is no shared memory, no shared clock, and no shared fate. The network is the only connection between nodes, and the network lies. This chapter is the "reality check" — every failure mode you must internalize before you can design systems that tolerate them.

A Single Machine vs. A Distributed System

The simulation below shows the contrast. On the left, a single machine executes three operations in sequence: read, compute, write. Predictable. Boring. Beautiful.

On the right, three nodes try to coordinate the same operations over a network. Click "Inject Chaos" to introduce the failures that happen in real distributed systems — every single day, in every datacenter on Earth.

Single Machine vs. Distributed Chaos

Left: a single machine runs operations deterministically. Right: three networked nodes. Click "Inject Chaos" to see what goes wrong.

Click Run to execute operations.

The single machine always succeeds. The distributed system might drop a message, delay a response, or have a node pause at exactly the wrong moment. And the terrifying part: you cannot tell which failure occurred. From the sender's perspective, a dropped message looks identical to a delayed response looks identical to a crashed node.

Why Not Just Use One Computer?

If distributed systems are this painful, why use them? Three reasons:

ReasonWhy one machine failsWhat distribution gives you
ScalabilityOne machine has finite CPU, RAM, diskAdd more machines to handle more load
Fault toleranceOne machine is a single point of failureIf one node dies, others continue serving
LatencyUsers are geographically distributedPlace data close to users

You must distribute. So you must understand what breaks.

The structure of this lesson. We'll cover the three categories of faults: network faults (messages are delayed, dropped, duplicated), clock faults (timestamps disagree), and process faults (your code stops running unexpectedly). Then we'll see what we can still guarantee in this hostile environment.
Interview question: You send an RPC from service A to service B and don't receive a response within your timeout. What do you know for certain about whether B processed the request?

Chapter 1: Unreliable Networks

The internet and most datacenter networks are asynchronous packet networks. When you send a message, the network gives you absolutely no guarantee about when (or whether) it will arrive. Here is the complete list of things that can go wrong when node A sends a request to node B:

1. Request Lost
The packet was dropped by a congested switch, a flapping link, or a misconfigured firewall. It never reaches B.
2. Request Queued
The packet arrives at B's machine but sits in a TCP buffer or application queue because B is overloaded. It will be processed... eventually.
3. Remote Node Crashed
B's process, OS, or hardware died. The packet either reached B before the crash (and B processed it but couldn't respond) or after (and was silently discarded by the OS).
4. Response Lost
B processed the request and sent a response, but the response packet was lost on the way back. B thinks it succeeded. A thinks it failed.
5. Response Delayed
B responded, but the response is stuck in a network queue. It will arrive in 30 seconds, long after A's timeout fires.

From A's perspective, all five scenarios look identical: A sent a request and did not receive a response. That's it. There is no way for A to distinguish between them.

The fundamental problem. When you don't get a response, you don't know whether your request was processed. If the operation was not idempotent (e.g., "charge this credit card"), retrying might double-charge the customer. Not retrying might mean the original charge was lost. This is why idempotency is a first-class design concern in distributed systems.

Network Partitions

A network partition (or "netsplit") is when the network splits into two or more groups of nodes that can communicate within the group but not across groups. All nodes are alive and processing requests. They just can't see each other.

This is different from a node crashing. In a crash, the failed node stops doing things. In a partition, both sides keep operating — potentially making conflicting decisions, accepting conflicting writes, electing conflicting leaders.

Partitions are not rare. A 2011 study of Microsoft datacenters found that network failures caused about 5 link failures per day, each affecting tens of thousands of devices. Google's internal data shows similar numbers. In cloud environments, partial partitions (where some links fail but others work) are especially insidious.

Simulate the Failure Modes

The simulation below shows node A sending a request to node B. Use the buttons to inject different failure modes and observe that A always sees the same thing: silence.

Network Failure Simulator

Send a request from A to B. Then inject a specific failure to see what happens.

Click Send Request, then inject a failure.

Timeouts: The Only Tool

Since you can't distinguish failure modes, the only option is to set a timeout: wait for some duration, and if no response arrives, assume something went wrong. But what should the timeout be?

TimeoutUpsideDownside
Short (100ms)Fast failure detectionHealthy-but-slow nodes declared dead (false positives). Triggers unnecessary failover, increases load on healthy nodes
Long (30s)Fewer false positivesReal failures take 30 seconds to detect. Users see 30 seconds of errors

There is no universally correct timeout. It depends on your network, your workload, and how bad false positives are compared to slow detection. We'll explore this tradeoff deeply in Chapter 2.

Real-world numbers. Within a datacenter, round-trip time (RTT) is typically under 1ms. Across availability zones: 1-5ms. Across continents: 100-300ms. But these are medians. The 99th percentile can be 10-100x higher due to queueing, TCP retransmissions, and congestion. A "1ms" datacenter network can spike to 100ms during a top-of-rack switch failover.
Interview question: Your service A calls service B. B successfully processes the request and sends a response, but the response packet is lost in the network. A's timeout fires and it retries the request. B processes the request a second time. If the request is "deduct $50 from account X," what is the consequence, and how would you prevent it?

Chapter 2: Detecting Failures

You're running a cluster of five database replicas. One of them stops responding. Is it dead? Is it just slow? Is the network between you and it broken while it's still serving other clients? You need to decide — because if it's dead, you need to redirect traffic and maybe elect a new leader. If it's just slow, declaring it dead could cause a cascade of unnecessary failovers.

This is the failure detection problem. And your only tool is timeouts.

The Timeout Dilemma: Worked Example

Let's make this concrete. Suppose your service sends heartbeats every 1 second. Normally, responses arrive within 5ms. You need to decide: how many missed heartbeats constitute a failure?

// Scenario: 5 database replicas, heartbeat every 1 second
// Normal RTT: 1-5ms (99th percentile: 50ms)

// Option A: timeout = 2 missed heartbeats = 2 seconds
False positive rate: a 99.9th percentile network spike of 2+ seconds
triggers failover for a healthy node.
In a busy datacenter, this happens ~once per day per node.

// Option B: timeout = 10 missed heartbeats = 10 seconds
False positive rate: nearly zero (10-second network spikes are very rare).
But: if a node actually crashes, it takes 10 seconds to detect.
10 seconds of requests routed to a dead node = 10 seconds of errors.

// The math of false positives:
P(false positive) = P(RTT > timeout) × heartbeats_per_second
Lower timeout → higher P(RTT > timeout) → more false positives
Higher timeout → longer detection time for real failures

Where Delays Come From

Network delays are not random noise — they have specific, mechanical causes. Understanding these helps you reason about reasonable timeout values.

Source of delayTypical magnitudeWhy it happens
Switch queueMicroseconds to millisecondsMultiple ports sending to the same output port. Packets queue in the switch's buffer.
TCP retransmission200ms to secondsA packet is lost. TCP waits for an ACK timeout, then retransmits. Each retry doubles the wait (exponential backoff).
OS TCP bufferMicroseconds to millisecondsThe receiving OS accepted the packet but the application hasn't called read() yet. The packet sits in the kernel's socket buffer.
VM pauseTens of milliseconds to secondsVirtual machine live migration, or the hypervisor scheduling another VM on the same CPU core.
GC pauseMilliseconds to secondsThe application's garbage collector stops the world. The process cannot respond until GC completes.
Context switchMicroseconds to millisecondsThe OS schedules another process on the same CPU core. Under high load, your heartbeat response gets delayed.

Notice that most of these are queueing problems. Packets, threads, and processes all queue. The more loaded the system, the longer the queues, the higher the latency. This is why the 99th percentile of response time is far worse than the median.

The Phi Accrual Failure Detector

Rather than a fixed timeout, the phi accrual failure detector (used by Akka, Cassandra) adapts based on observed response times. Instead of a binary "alive or dead," it outputs a suspicion level (φ) that increases over time since the last heartbeat.

// Phi accrual failure detector

// 1. Track recent heartbeat intervals
intervals = [1002ms, 998ms, 1005ms, 1001ms, 997ms, ...]

// 2. Compute mean and standard deviation
μ = mean(intervals) = 1000.6ms
σ = stdev(intervals) = 3.1ms

// 3. When a heartbeat is late by t ms, compute phi
φ(t) = -log10(1 - F(t))
// where F(t) is the CDF of the normal distribution N(μ, σ²)

// 4. Threshold: declare failure when φ > threshold
// φ = 1 → P(mistake) = 10%
// φ = 2 → P(mistake) = 1%
// φ = 3 → P(mistake) = 0.1%
// φ = 8 → P(mistake) = 0.00001% (Cassandra default)

The beauty: if the network gets noisier (higher σ), the detector automatically becomes more lenient. If the network is rock-solid (low σ), it detects failures faster.

Timeout Tuning Simulator

The simulation below shows a node sending heartbeats. The response time varies randomly. Use the timeout slider to see the tradeoff: too short and you get false positives (declaring a healthy node dead), too long and you're slow to detect a real failure.

Timeout Tuning Simulator

Adjust the timeout. Watch false positives (red flashes) when it's too short, and slow detection when it's too long. The node will actually crash after ~15 heartbeats.

Timeout 200ms
Adjust timeout, then click Start.

Python: Simple Phi Accrual Detector

python
import math
import time
from collections import deque

class PhiAccrualDetector:
    """Phi accrual failure detector (simplified Cassandra-style)."""

    def __init__(self, threshold=8, window_size=100):
        self.threshold = threshold       # phi above this = declared dead
        self.intervals = deque(maxlen=window_size)
        self.last_heartbeat = None

    def heartbeat(self):
        """Called when a heartbeat arrives."""
        now = time.monotonic()
        if self.last_heartbeat is not None:
            interval = (now - self.last_heartbeat) * 1000  # ms
            self.intervals.append(interval)
        self.last_heartbeat = now

    def phi(self):
        """Current suspicion level. Higher = more likely dead."""
        if self.last_heartbeat is None or len(self.intervals) < 2:
            return 0.0

        now = time.monotonic()
        elapsed = (now - self.last_heartbeat) * 1000  # ms since last heartbeat

        # Compute mean and variance of intervals
        mean = sum(self.intervals) / len(self.intervals)
        variance = sum((x - mean) ** 2 for x in self.intervals) / len(self.intervals)
        std = math.sqrt(variance) if variance > 0 else 0.1

        # CDF of normal distribution: P(X <= elapsed)
        z = (elapsed - mean) / std
        cdf = 0.5 * (1 + math.erf(z / math.sqrt(2)))

        # phi = -log10(1 - cdf), clamped to avoid log(0)
        p_late = max(1e-15, 1.0 - cdf)
        return -math.log10(p_late)

    def is_alive(self):
        return self.phi() < self.threshold

# Usage:
detector = PhiAccrualDetector(threshold=8)
# In your heartbeat loop:
# detector.heartbeat()  — call when each heartbeat arrives
# detector.is_alive()   — check if the node is still alive
Interview question: You're running Cassandra with a phi accrual threshold of 8. The network suddenly becomes congested — response times double, and the standard deviation triples. Does the detector immediately start declaring nodes dead?

Chapter 3: Unreliable Clocks

Every computer has a quartz crystal oscillator that ticks at a nominal frequency. But "nominal" is doing a lot of work in that sentence. Quartz crystals drift. A typical crystal oscillates at 32.768 kHz, but temperature, voltage, and manufacturing variance cause it to be slightly fast or slow. A drift of 35 parts per million (ppm) — common for consumer-grade crystals — means the clock gains or loses about 3 seconds per day.

On a single machine, this doesn't matter much. The OS periodically adjusts the clock using NTP (Network Time Protocol) and everything works fine. But in a distributed system, each node has its own clock, and they all drift independently. You cannot assume that "now" on node A is the same as "now" on node B.

Two Types of Clocks

Operating systems expose two fundamentally different clocks. Confusing them causes bugs that are maddeningly difficult to diagnose.

PropertyTime-of-day clockMonotonic clock
What it measuresWall-clock time (seconds since epoch)Elapsed time since some arbitrary point
Python APItime.time()time.monotonic()
Java APISystem.currentTimeMillis()System.nanoTime()
Can jump backwards?YES. NTP step correction can move the clock backwards by seconds.NO. Guaranteed to never decrease.
Synced across nodes?Approximately (via NTP). Accuracy: 10-100ms over internet, 1ms on local network.NO. Only meaningful on a single machine. Not comparable across machines.
Good for ordering events?NO. Drift and jumps make cross-node ordering unreliable.YES — but only on ONE machine.
Good for measuring durations?NO. A jump makes a 5-second operation appear to take -3 seconds.YES — this is exactly what it's for.
The critical rule. Use monotonic clocks for measuring elapsed time (profiling, timeouts, rate limiting). Use time-of-day clocks only when you need to communicate a time to a human or another system — and always remember that the value is approximate. NEVER use time-of-day clocks to determine the order of events across machines.

Why Timestamps Can't Order Events

Consider two database nodes using last-write-wins (LWW) conflict resolution. Each write is stamped with the node's local wall-clock time. The write with the highest timestamp wins.

Here's what happens when clocks are skewed:

// Node 1's clock is 5ms AHEAD of true time
// Node 2's clock is 5ms BEHIND true time
// Total skew: 10ms

// True time 100.000: Client writes x = 1 on Node 1
Node 1 timestamps it as t = 100.005 (clock is fast)

// True time 100.003: Client writes x = 2 on Node 2 (3ms LATER)
Node 2 timestamps it as t = 99.998 (clock is slow)

// LWW conflict resolution: max(100.005, 99.998) = 100.005
// Node 1's OLDER write (x=1) wins! The NEWER write (x=2) is silently discarded.

// The client who wrote x=2 LATER gets no error. // Their write just vanishes. Data loss. Silent corruption.

This is not a theoretical concern. It is one of the most common sources of silent data loss in distributed databases that use LWW (including some default configurations of Cassandra).

Clock Drift Simulator

The simulation below shows three nodes with independent clock drift. NTP corrections occur periodically but introduce jumps. Watch how two events that happen in a clear temporal order get reversed timestamps.

Clock Drift and Timestamp Ordering

Three nodes with drifting clocks. NTP syncs periodically. Click "Write Event" to place events and see how timestamps can disagree with reality.

Drift rate (ppm) 50 ppm
Click Start to begin clock simulation.

Worked Example: How Much Can Clocks Drift?

// Crystal oscillator spec: 35 ppm drift
// NTP sync interval: 5 minutes = 300 seconds

Max drift between syncs = 300s × 35 × 10-6 = 0.0105s = 10.5ms

// Two nodes drifting in opposite directions:
Max skew between two nodes = 2 × 10.5ms = 21ms

// With 1-hour NTP interval (laptop on WiFi):
Max drift = 3600 × 35 × 10-6 = 126ms
Max skew between two nodes = 252ms

// With a temperature excursion (server room AC fails):
// Crystal drift can increase to 100+ ppm
Max drift per hour = 3600 × 100 × 10-6 = 360ms per node
// Two nodes: 720ms of skew. Almost a full second.
The takeaway. Even with NTP, clocks can disagree by tens of milliseconds in normal conditions and hundreds of milliseconds under stress. Any system that relies on wall-clock timestamps being accurate to better than ~100ms across nodes is building on sand.
Interview question: Your distributed database uses last-write-wins (LWW) for conflict resolution with wall-clock timestamps. A developer reports that writes are "randomly disappearing." What is the most likely cause, and what would you change?

Chapter 4: Clock Synchronization and Confidence

NTP gives you a timestamp. But how accurate is it? The terrifying answer: you don't know. NTP measures the round-trip time to a time server and estimates one-way delay as half the round-trip. But if the network is asymmetric (as it often is), this estimate is wrong. NTP gives you a time, but no error bars.

Google decided this was unacceptable. For their globally distributed database Spanner, they needed to know not just "what time is it?" but "what is the earliest and latest time it could be right now?"

Google's TrueTime

Google's TrueTime API uses a combination of GPS receivers and atomic clocks in every datacenter. Instead of returning a single timestamp, it returns a confidence interval:

// Standard clock API:
time.now() → 1694012345.678 // a single number. How accurate? Who knows.

// Google TrueTime API:
TT.now() → [earliest, latest]
TT.now() → [1694012345.671, 1694012345.685]
// "The true time is somewhere in this 14ms window."

// The uncertainty (epsilon) is typically 1-7ms.
// After a GPS/atomic sync: epsilon ≈ 1ms
// Just before the next sync: epsilon ≈ 7ms (crystal drift accumulates)

// Key property: the TRUE time is GUARANTEED to be within the interval. // This is not a statistical estimate. It's a hard bound.

Using Confidence Intervals for Ordering

With confidence intervals, you can determine when two events are definitely ordered versus when their order is ambiguous.

// Event A at node 1: TT.now() = [100, 107]
// Event B at node 2: TT.now() = [110, 116]

// Do the intervals overlap?
A.latest = 107 < B.earliest = 110
// NO OVERLAP. A definitely happened before B.
// We can safely order: A → B

// Now consider:
// Event C at node 1: TT.now() = [100, 107]
// Event D at node 2: TT.now() = [103, 109]

// C.latest = 107, D.earliest = 103
// OVERLAP. We CANNOT determine the order of C and D.
// They are CONCURRENT as far as we can tell.

Spanner uses this to implement external consistency (a stronger guarantee than serializability). When a transaction commits, Spanner deliberately waits out the uncertainty: it pauses for the length of the confidence interval before reporting the commit. This ensures that any transaction that starts after the commit will definitely see it.

The cost of correctness. Spanner's "commit-wait" adds latency equal to the uncertainty interval (typically 1-7ms). This is why Google invested millions in GPS receivers and atomic clocks in every datacenter — reducing the uncertainty interval directly reduces commit latency. Physical infrastructure buys you logical guarantees.

TrueTime Ordering Simulator

Place events on two nodes. Each event has a confidence interval. The simulation shows whether the events can be ordered or are concurrent.

Confidence Interval Ordering

Click "Add Event" to place events on Node 1 and Node 2. The simulation shows whether their confidence intervals overlap (concurrent) or are separated (orderable).

Uncertainty (ε) 5ms
Add events on both nodes to compare ordering.

What If You Don't Have TrueTime?

You're not Google. You don't have atomic clocks. What can you do?

ApproachAccuracyTrade-off
NTP over internet~50-100msFree, but high and unpredictable error
NTP on local network~1-5msRequires dedicated NTP servers in your datacenter
PTP (Precision Time Protocol)~1-100μsRequires hardware timestamping support in NICs and switches
GPS receiver per server~1μs$50-200 per server, needs antenna with sky view
Logical clocksPerfect causal orderingDoesn't tell you wall-clock time at all — only "before/after" relationships
The practical answer. For most systems, use logical clocks (Lamport timestamps or vector clocks) for ordering events, and use physical clocks only for human-facing features like "show posts from the last 24 hours." Logical clocks are immune to drift and skew because they don't measure physical time — they measure causality.
Interview question: Google Spanner's TT.now() returns the interval [100, 107]. A transaction wants to commit. Why does Spanner wait 7ms before declaring the commit successful, and what guarantee does this waiting provide?

Chapter 5: Process Pauses

You've handled unreliable networks and unreliable clocks. Surely you can at least trust that your own code, running on your own machine, executes continuously? No. Your process can stop running for seconds at any time, without warning, and resume as if nothing happened.

Why Your Code Stops

CauseTypical durationWarning?
GC pause (Java/Go/.NET)10ms to 10+ secondsNone. The process freezes mid-instruction.
VM live migration100ms to secondsNone. The hypervisor moves your VM to another host while it's running.
OS context switch under loadMicroseconds to millisecondsNone. The OS schedules another process.
Disk I/O (swapping/thrashing)Milliseconds to secondsNone. A page fault triggers disk read.
Signal handling (SIGSTOP)IndefiniteUser-triggered (ctrl-Z).

The Lease Violation Problem

This is where process pauses become dangerous. Consider a lease — a distributed lock with a timeout. A node acquires a lease that says "you are the leader for the next 10 seconds." The node then processes requests, confident that it's the only leader.

Here's what goes wrong:

// Timeline of a lease violation

t = 0s: Node A acquires lease (valid for 10 seconds)
t = 1s: Node A starts processing a batch write
t = 2s: GC PAUSE BEGINS — Node A's process is frozen
t = 11s: Lease expires. Node A doesn't know — it's frozen.
t = 12s: Node B acquires the lease. Node B is now the leader.
t = 13s: Node B starts accepting writes.
t = 14s: GC PAUSE ENDS — Node A resumes execution.
Node A still thinks it holds the lease (its local variable says so).
Node A writes to the database, conflicting with Node B's writes.

// Result: two leaders, corrupted data.
// Node A had NO WAY to know it was paused. From A's perspective,
// time jumped from t=2s to t=14s in an instant.
The terrifying part. Node A cannot check the lease expiry after resuming from the GC pause, because the pause can happen between checking the lease and using it. There is no safe place to insert a check — the GC can freeze you at any point, including between two adjacent CPU instructions.

The Solution: Fencing Tokens

Since you can't prevent a stale leader from acting, you need the storage layer to reject stale writes. The mechanism is a fencing token: a monotonically increasing number issued with each lease grant.

// Fencing token protocol

t = 0s: Node A acquires lease, receives fencing token = 34
t = 2s: GC pause begins on Node A
t = 12s: Node B acquires lease, receives fencing token = 35
t = 13s: Node B writes to storage with token 35. Storage records: last_token = 35
t = 14s: Node A resumes, tries to write to storage with token 34
Storage checks: 34 < 35 (last_token). REJECTED.

// The storage server enforces ordering, not the client.
// The client can be wrong about whether it holds the lock.
// The storage server is the source of truth.

Fencing Token Simulator

This simulation shows two nodes competing for a lease. Node A gets the lease, then suffers a GC pause. Node B acquires the lease during the pause. When Node A wakes up, see what happens with and without fencing tokens.

GC Pause + Fencing Token Simulator

Watch the scenario play out. Toggle fencing to see the difference: without fencing, data is corrupted. With fencing, the stale write is rejected.

Click Play to see the GC pause scenario.

Python: Fencing Token Check

python
class FencedStorage:
    """Storage server that rejects stale writes using fencing tokens."""

    def __init__(self):
        self.data = {}
        self.max_token = {}  # key -> highest token seen

    def write(self, key, value, fencing_token):
        """Write only if the fencing token is >= the highest seen for this key."""
        if key in self.max_token and fencing_token < self.max_token[key]:
            raise PermissionError(
                f"Stale fencing token: {fencing_token} < {self.max_token[key]}. "
                "Write rejected — you no longer hold the lease."
            )
        self.max_token[key] = fencing_token
        self.data[key] = value
        return True

    def read(self, key):
        return self.data.get(key)

# Demo: the GC pause scenario
storage = FencedStorage()

# Node A gets lease with token 34, writes successfully
storage.write("account_balance", 500, fencing_token=34)
print("Node A wrote 500 with token 34")  # OK

# Node B gets lease with token 35 (after A's lease expired during GC)
storage.write("account_balance", 450, fencing_token=35)
print("Node B wrote 450 with token 35")  # OK

# Node A wakes up from GC, tries to write with old token 34
try:
    storage.write("account_balance", 600, fencing_token=34)
except PermissionError as e:
    print(f"Node A rejected: {e}")  # REJECTED
Interview question: You're designing a lease-based distributed lock for a payment processing system. A colleague says "we'll just check if the lease is still valid before each write." Why is this insufficient, and what mechanism would you add?

Chapter 6: The Byzantine Generals Problem

So far, we've assumed nodes are honest but unreliable: they might crash, they might be slow, but they never lie. A crashed node stops sending messages. A slow node eventually responds with the correct answer. What if a node sends the wrong answer? Or sends different answers to different nodes?

This is a Byzantine fault: a node behaves in an arbitrary, potentially malicious way. It might send "yes" to node A and "no" to node B. It might report a different value from what it stored. It might impersonate another node.

The Original Problem

In 1982, Lamport, Shostak, and Pease published a paper that framed this as a military scenario. Several Byzantine generals surround an enemy city. They need to agree on a plan: attack or retreat. They communicate by messenger. Some generals may be traitors who send conflicting messages to sabotage the consensus.

// The Byzantine Generals Problem

// 4 generals: A, B, C are loyal. T is a traitor.
// The commander (A) sends "ATTACK" to everyone.

A → B: "ATTACK"
A → C: "ATTACK"
A → T: "ATTACK"

// Now each general relays what they received:
B tells C: "A said ATTACK"
B tells T: "A said ATTACK"
C tells B: "A said ATTACK"
C tells T: "A said ATTACK"

// But the traitor LIES:
T tells B: "A said RETREAT" // lie!
T tells C: "A said RETREAT" // lie!

// B's view: 2 say ATTACK (from A and C), 1 says RETREAT (from T)
// B takes majority: ATTACK. Correct!
// C's view: 2 say ATTACK (from A and B), 1 says RETREAT (from T)
// C takes majority: ATTACK. Correct!

// With 4 generals and 1 traitor, majority voting works.
// Key theorem: you need 3f + 1 generals to tolerate f traitors.
// 3(1) + 1 = 4. With 3 generals and 1 traitor, consensus is IMPOSSIBLE.

Why 3f + 1?

With 3 generals (A loyal, B loyal, T traitor):

// Commander A sends "ATTACK" to B and T
// T tells B: "A said RETREAT" (lie)
// B's view: A says ATTACK, T says RETREAT. Tie! B cannot decide.

// Now consider: what if the COMMANDER is the traitor?
// Commander T sends "ATTACK" to A, "RETREAT" to B
// A relays: "T said ATTACK"
// B relays: "T said RETREAT"
// A sees: ATTACK (from T) and RETREAT (from B's relay). Tie!
// B sees: RETREAT (from T) and ATTACK (from A's relay). Tie!

// With only 3 nodes and 1 traitor, B CANNOT distinguish:
// "A is the commander, T is lying" from "T is the commander, telling me RETREAT"
// Provably impossible.

Byzantine Generals Simulator

Four generals try to agree on a plan. One is a traitor. Watch how the loyal generals use majority voting to reach consensus despite the traitor's conflicting messages.

Byzantine Generals: 4 Generals, 1 Traitor

The commander sends a plan. The traitor sends conflicting messages. Watch how majority voting overcomes the traitor.

Click Run to see the Byzantine consensus protocol.

Do You Need Byzantine Fault Tolerance?

Almost certainly not. BFT is expensive — you need 3f+1 nodes to tolerate f faults (versus 2f+1 for crash faults). The extra nodes and message rounds add latency and cost.

EnvironmentNeed BFT?Why
Your datacenterNoYou control the nodes. Bugs exist, but nodes don't intentionally lie.
Blockchain / cryptocurrencyYesNodes are operated by strangers with financial incentives to cheat.
Aircraft flight controlYesHardware faults (cosmic rays, sensor failures) can produce arbitrary values.
Multi-org consortiumMaybeOrganizations might not trust each other's infrastructure.
Practical Byzantine faults in datacenters. Even in trusted environments, some Byzantine-like scenarios exist: a node with corrupted memory (bit flip) sends wrong data but doesn't know it. A misconfigured node speaks the wrong protocol version. A software bug causes one replica to compute different results. Most systems protect against these with checksums, protocol version negotiation, and integration tests — not full BFT.
Interview question: You have a distributed system with 7 nodes. How many Byzantine (arbitrarily faulty) nodes can it tolerate, and why?

Chapter 7: System Models — The Showcase

We've cataloged everything that can go wrong: networks drop messages, clocks drift, processes pause, and nodes can crash (or even lie). Now we need a framework for reasoning about these faults. We need system models.

A system model is a set of assumptions about what kinds of faults can happen. It's the contract between the algorithm designer and reality. "If the world behaves at least this well, then my algorithm guarantees these properties."

Crash Models

How do nodes fail?

ModelAssumptionExample
Crash-stopA node that crashes is gone forever. It never recovers.A process killed with SIGKILL. A server removed from the rack.
Crash-recoveryA node that crashes may restart later with its durable state intact (but in-memory state is lost).A server reboots after a power failure. The process is restarted by systemd.
ByzantineA node may behave in any arbitrary way: crash, lie, send contradictory messages.A compromised server. Hardware with corrupted memory.

Timing Models

How predictable are delays?

ModelAssumptionReal-world match
SynchronousThere is a known upper bound on message delay and clock drift. If a message doesn't arrive in time, the node is definitely crashed.Hard real-time systems with dedicated networks (assembly line controllers)
Partially synchronousThe system behaves synchronously MOST of the time, but can occasionally exceed bounds (during network congestion, GC pauses, etc.).Most datacenter systems. The model used by most practical algorithms.
AsynchronousNo timing assumptions at all. Messages can take arbitrarily long. You cannot use timeouts because there is no "too long."Theoretical model. No real system is truly asynchronous, but it gives the strongest impossibility results.
The practical choice. Most real distributed systems assume crash-recovery + partial synchrony. This means: nodes can crash and restart (but we trust them not to lie), and the network is usually fast but occasionally slow. Algorithms designed for this model (like Raft, Paxos, ZAB) work well in practice because datacenters usually behave synchronously, and the occasional asynchronous period causes temporary unavailability, not incorrectness.

The Big Simulation: Distributed System Under Stress

Five nodes in a partially synchronous network running a simplified consensus protocol. You are the chaos monkey. Crash nodes, partition the network, trigger GC pauses, and delay messages. Watch how the system handles each scenario — and when it breaks.

Distributed System Stress Test

Click nodes to crash them. Use buttons to partition the network or trigger GC pauses. The system tries to maintain a leader and accept writes. Watch what survives and what doesn't.

Click Start to launch the 5-node cluster.

Things to try:

The key insight from this simulation. Under crash-recovery + partial synchrony, the system can always preserve safety (never give a wrong answer). But it sometimes sacrifices liveness (it might stop accepting requests until the partition heals or a new leader is elected). This is the fundamental tradeoff described by the FLP impossibility result, which we'll discuss in Chapter 8.

Chapter 8: Knowledge and Truth

Here is a thought experiment that crystallizes the central challenge of distributed systems.

You are a node. You just processed a request and sent a response. But your garbage collector kicked in for 15 seconds right after you sent it. When you wake up, you think only a millisecond has passed. Your lease has expired. Another node is now the leader. But you don't know any of this. From your perspective, nothing happened.

A node cannot trust its own judgment. It might be paused. Its clock might be wrong. Its network might be partitioned. It has no way to tell. This is the fundamental asymmetry of distributed systems: a node can observe its own state, but it cannot observe its own reachability or timeliness from the outside.

The Truth Is Defined by the Quorum

Since no individual node can be trusted to assess reality, the only reliable source of truth is a quorum — a majority of nodes agreeing on a fact.

DecisionWho decidesWhy not the individual?
"Node X is dead"A quorum of nodesNode X might be alive but partitioned from the observer
"Node Y is the leader"A quorum of nodesY might think it's the leader but its lease expired during a GC pause
"Value V is committed"A quorum of nodesA single node might have a stale or corrupted copy
"The lock is held by Z"A quorum of nodes (or the storage with fencing)Z might have crashed and not know it yet
The quorum principle. In a distributed system, truth is not what any one node believes — it is what a majority of nodes agree on. This is why consensus algorithms (Raft, Paxos) require majority votes for every decision: leader election, log commitment, membership changes. A minority can always be wrong or partitioned.

The FLP Impossibility

In 1985, Fischer, Lynch, and Paterson proved one of the most important results in distributed computing. The FLP impossibility theorem states:

FLP Impossibility. In a fully asynchronous system where even one process can crash, there is no deterministic algorithm that guarantees consensus will be reached. Put differently: if you make zero timing assumptions (no timeouts, no upper bound on message delay), then it is mathematically impossible to guarantee that a group of nodes will agree on a value, even if only one node might crash.

This sounds devastating. If consensus is impossible, how do Raft and Paxos work?

The key word is asynchronous. FLP applies to the fully asynchronous model. Real systems are partially synchronous: they have timeouts, and most of the time the network is well-behaved. Practical consensus algorithms exploit this:

// How Raft sidesteps FLP:

// FLP says: in a FULLY ASYNCHRONOUS system, you can't guarantee
// consensus if even one node can crash.

// Raft says: I use TIMEOUTS. If I don't hear from the leader
// within my election timeout, I start an election.

// During a period of asynchrony (network congestion, GC pause):
// - Raft might not make progress (no leader elected, no writes committed)
// - But it never violates SAFETY (never commits two different values)

// When synchrony returns (network stabilizes):
// - Raft elects a leader and resumes operation
// - Progress is restored

// The tradeoff: Raft gives up LIVENESS during asynchronous periods
// in exchange for SAFETY always. FLP says you can't have both.

Quorum Decision Simulator

Five nodes try to decide if a particular node is dead. Each has its own view (based on whether it can reach the suspect node). A quorum (3 of 5) must agree. See how partitions can cause different nodes to have different views — but the quorum still makes a consistent decision.

Quorum Decision Making

Click nodes to toggle their connectivity to the suspect node (Node 3). The quorum decides: if 3+ nodes can't reach Node 3, it is declared dead.

Click on nodes 1, 2, 4, or 5 to toggle connectivity to Node 3.

Safety vs. Liveness

Every distributed algorithm makes a choice between two properties:

PropertyDefinitionExample
SafetyNothing bad ever happens. If the system gives you an answer, it's correct."Two leaders are never elected for the same term." "A committed value is never lost."
LivenessSomething good eventually happens. The system makes progress."A leader is eventually elected." "A submitted write is eventually committed."

FLP tells us that in an asynchronous system, you cannot guarantee both safety and liveness in the presence of faults. Practical algorithms choose safety always, liveness when possible. During a partition, the system stops accepting writes (sacrifices liveness) rather than accepting potentially conflicting writes (sacrificing safety).

Interview question: The FLP impossibility says consensus is impossible in an asynchronous system. Yet Raft, Paxos, and ZAB all achieve consensus in practice. Are they violating the FLP result?

Chapter 9: Interview Arsenal

This chapter is your cheat sheet. Every failure mode, every mitigation, every pattern you need for distributed systems interviews — organized by dimension.

Failure Mode Reference

FailureSymptomMitigationInterview trigger
Network partitionSome nodes unreachable, others fineQuorum-based decisions, fencing tokens"What happens if your services can't talk to each other?"
Message lossRequest or response disappearsRetries + idempotency keys"How do you handle a failed API call?"
Message delayResponse arrives after timeoutAdaptive timeouts (phi accrual), deduplication"How do you set your timeout values?"
Clock skewTimestamps disagree across nodesLogical clocks, vector clocks, TrueTime"How do you order events across services?"
GC pauseProcess freezes for secondsFencing tokens, short leases"What if the leader is GC'd?"
Split brainTwo nodes both think they're leaderEpoch-based fencing, quorum leases"How do you prevent two leaders?"
Partial failureOne of N steps completesTransactions, saga pattern, idempotency"How do you handle a crash mid-operation?"

System Design Questions

"Design a distributed lock service." Key points: lease-based (not indefinite), fencing tokens for safety, quorum for leader election, handle GC pauses (client might hold a stale lock). Mention ZooKeeper or etcd as real implementations. The lock service is NOT the source of truth — the fencing token check at the storage layer is.
"Design a system that tolerates network partitions." Clarify the requirement: do you need strong consistency (CP) or availability (AP)? CP: use consensus (Raft/Paxos), accept unavailability during partition. AP: use CRDTs or LWW, accept temporary inconsistency. Mention CAP theorem but note its limitations (it's about a very specific tradeoff, not a general design framework).
"How does garbage collection affect distributed systems?" GC pauses look like node crashes to the rest of the cluster. The paused node might hold a lease that expires during the pause. When it wakes up, it doesn't know time passed. Mitigations: fencing tokens, GC-aware heartbeats (treat a long GC as a voluntary step-down), JVM tuning (G1GC, ZGC for low-pause), pre-emptive GC notification in JVM 11+.

Coding Drills

Drill 1: Idempotency layer

python
import hashlib
import time

class IdempotencyLayer:
    """Prevents duplicate processing of requests."""

    def __init__(self, ttl_seconds=3600):
        self.seen = {}  # idempotency_key -> (result, timestamp)
        self.ttl = ttl_seconds

    def execute(self, idempotency_key, operation):
        """Execute operation only if this key hasn't been seen."""
        self._evict_expired()

        if idempotency_key in self.seen:
            # Already processed — return cached result
            return self.seen[idempotency_key][0]

        # First time seeing this key — execute and cache
        result = operation()
        self.seen[idempotency_key] = (result, time.monotonic())
        return result

    def _evict_expired(self):
        now = time.monotonic()
        expired = [k for k, (_, t) in self.seen.items() if now - t > self.ttl]
        for k in expired:
            del self.seen[k]

# Usage:
idem = IdempotencyLayer()
result = idem.execute("charge-user-123-order-456", lambda: charge_credit_card(50))
# If retried with same key, returns cached result without re-charging

Drill 2: Simple lease with fencing

python
import time
import threading

class LeaseService:
    """Simplified lease service with fencing tokens."""

    def __init__(self):
        self.lock = threading.Lock()
        self.current_holder = None
        self.expiry = 0
        self.next_token = 1

    def acquire(self, node_id, duration_sec=10):
        """Try to acquire lease. Returns (success, fencing_token)."""
        with self.lock:
            now = time.monotonic()
            if self.current_holder is not None and now < self.expiry:
                return False, None  # Lease held by someone else

            token = self.next_token
            self.next_token += 1
            self.current_holder = node_id
            self.expiry = now + duration_sec
            return True, token

    def release(self, node_id):
        with self.lock:
            if self.current_holder == node_id:
                self.current_holder = None
                self.expiry = 0

Debug Scenarios

"Intermittent data corruption across replicas." Diagnosis: clock skew + LWW. Two replicas receive conflicting writes. The one with the "higher" timestamp wins — but clock skew means the newer write might get a lower timestamp. The older write wins, and the newer write is silently discarded. Evidence: check NTP sync status across nodes. Fix: switch from wall-clock LWW to logical timestamps (Lamport clocks or vector clocks), or use CRDT-based conflict resolution.
"Lease violations during high GC pressure." Diagnosis: Java services with large heaps run full GC, pausing for seconds. During the pause, their leases expire. When the GC completes, the node resumes and acts as if it still holds the lease. Evidence: correlate GC logs with lease violation alerts. Fix: fencing tokens at the storage layer, GC tuning (smaller heap, G1GC or ZGC), or proactive GC scheduling with jmap/jcmd.
"Split-brain after network partition." Diagnosis: the network splits and both halves elect a leader. When the partition heals, you have two leaders with conflicting state. Evidence: check leader election logs — two nodes claim leadership for overlapping terms. Fix: use epoch-based fencing (each leader has a monotonically increasing epoch number), quorum leases (leader must maintain contact with a majority to keep its lease), and fencing tokens for all writes.

Key Papers

PaperYearKey idea
Time, Clocks, and the Ordering of Events in a Distributed System (Lamport)1978Logical clocks, happens-before relation. The foundation of all distributed ordering.
Impossibility of Distributed Consensus with One Faulty Process (Fischer, Lynch, Paterson)1985FLP impossibility. No deterministic consensus in fully asynchronous systems.
The Byzantine Generals Problem (Lamport, Shostak, Pease)1982BFT requires 3f+1 nodes for f faults. The original formulation.
Spanner: Google's Globally-Distributed Database (Corbett et al.)2012TrueTime API. GPS + atomic clocks for bounded clock uncertainty.
The φ Accrual Failure Detector (Hayashibara et al.)2004Adaptive failure detection based on observed heartbeat distribution.
Interview question: You're designing a payment system that debits one account and credits another. The system spans two microservices (Accounts and Payments), each with its own database. The network between them can fail at any time. How do you ensure that money is never lost or double-counted?

Chapter 10: Connections

This chapter covered the problems of distributed systems — the hostile environment that algorithms must survive. Now you need the solutions: the algorithms and protocols that provide guarantees despite these faults.

Where This Chapter Fits

ChapterTopicRelationship
DDIA Ch. 6: ReplicationCopying data across nodesReplication must handle all the faults we described: network partitions, clock skew (for LWW), process pauses (for leader failover). This chapter explains WHY replication is hard.
DDIA Ch. 8: TransactionsGrouping operations atomicallyDistributed transactions (2PC, sagas) must survive network partitions and node crashes. This chapter explains the environment that makes distributed transactions so expensive.
DDIA Ch. 10: ConsensusGetting nodes to agreeConsensus (Raft, Paxos) is the PRIMARY solution to the problems in this chapter. It works because it assumes partial synchrony (not full asynchrony, where FLP makes it impossible).

The Mental Model

After this chapter, you should have a three-layer mental model:

Layer 1: The Hostile Environment
Networks are unreliable. Clocks drift. Processes pause. Nodes crash. (This chapter)
Layer 2: The Theoretical Limits
FLP impossibility: can't guarantee consensus in async systems. CAP theorem: can't have consistency + availability during partition. BFT: need 3f+1 for f traitors.
Layer 3: Practical Algorithms
Raft, Paxos, ZAB: achieve consensus under partial synchrony by trading liveness for safety. Fencing tokens, logical clocks, idempotency keys: engineering tools that make systems robust.

Summary of Key Principles

PrincipleOne-liner
You can't distinguish failure modesA timeout tells you nothing about why the response didn't arrive.
Clocks lieNever use wall-clock timestamps to determine the order of events across machines.
Your process can pauseA GC pause, VM migration, or context switch can stop your code for seconds without warning.
Trust the quorum, not yourselfNo individual node can reliably assess its own status. Only a majority decision is trustworthy.
Safety over livenessBetter to stop accepting writes than to accept conflicting writes.
Fence everythingEvery write to shared storage should carry a fencing token that the storage can verify.
Make it idempotentIf a client might retry, the operation must produce the same result on duplicate execution.
Closing thought. "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport. The problems in this chapter are not bugs to be fixed. They are the fundamental nature of distributed computing. The art is not eliminating them — it is building systems that work correctly despite them.

Related Lessons

Final question: You're in a system design interview. The interviewer says: "Assume the network is perfectly reliable." How should you respond?