The Trouble with Distributed Systems — From Absolute Zero to Mastery

Chapter 0: The Problem

You have a single computer on your desk. You write a program that reads a file, does some math, and writes the result. If the hardware is working, the program does the same thing every time. The file is either there or it is not. The CPU either adds correctly or it is broken. There is no ambiguity.

Now imagine you have three computers connected by a network. You send a message from computer A to computer B, asking it to store a value. Then you ask computer C to read that value. Simple, right?

No. Here is what can happen, all at once, with no warning:

Your message to B was delayed by 3 seconds in a network switch buffer.
B received the message and stored the value, but the response got lost. You don't know if B got it or not.
C's clock is 14 milliseconds ahead of B's clock. It thinks the write hasn't happened yet.
B's garbage collector froze the process for 800ms right after storing the value. During that pause, A's timeout fired and A declared B dead.
The network partitioned: A can talk to C but not B. B can talk to C but not A. Everyone is alive. Nobody agrees on the state of the system.

A single computer is deterministic: operations either work or they don't. A distributed system is nondeterministic: operations can partially succeed, silently fail, take arbitrarily long, or produce different results on different nodes at the same moment.

The core reality. In a distributed system, there is no shared memory, no shared clock, and no shared fate. The network is the only connection between nodes, and the network lies. This chapter is the "reality check" — every failure mode you must internalize before you can design systems that tolerate them.

A Single Machine vs. A Distributed System

The simulation below shows the contrast. On the left, a single machine executes three operations in sequence: read, compute, write. Predictable. Boring. Beautiful.

On the right, three nodes try to coordinate the same operations over a network. Click "Inject Chaos" to introduce the failures that happen in real distributed systems — every single day, in every datacenter on Earth.

Single Machine vs. Distributed Chaos

Left: a single machine runs operations deterministically. Right: three networked nodes. Click "Inject Chaos" to see what goes wrong.

Click Run to execute operations.

The single machine always succeeds. The distributed system might drop a message, delay a response, or have a node pause at exactly the wrong moment. And the terrifying part: you cannot tell which failure occurred. From the sender's perspective, a dropped message looks identical to a delayed response looks identical to a crashed node.

Why Not Just Use One Computer?

If distributed systems are this painful, why use them? Three reasons:

Reason	Why one machine fails	What distribution gives you
Scalability	One machine has finite CPU, RAM, disk	Add more machines to handle more load
Fault tolerance	One machine is a single point of failure	If one node dies, others continue serving
Latency	Users are geographically distributed	Place data close to users

You must distribute. So you must understand what breaks.

The structure of this lesson. We'll cover the three categories of faults: network faults (messages are delayed, dropped, duplicated), clock faults (timestamps disagree), and process faults (your code stops running unexpectedly). Then we'll see what we can still guarantee in this hostile environment.

Interview question: You send an RPC from service A to service B and don't receive a response within your timeout. What do you know for certain about whether B processed the request?

B definitely did not process it — the timeout means B is down B definitely processed it — the request was sent successfully You know nothing. The request may have been lost, B may have processed it and the response was lost, or B may still be processing it. A timeout tells you only that you did not receive a reply in time — not WHY.

Chapter 1: Unreliable Networks

The internet and most datacenter networks are asynchronous packet networks. When you send a message, the network gives you absolutely no guarantee about when (or whether) it will arrive. Here is the complete list of things that can go wrong when node A sends a request to node B:

1. Request Lost

The packet was dropped by a congested switch, a flapping link, or a misconfigured firewall. It never reaches B.

↓

2. Request Queued

The packet arrives at B's machine but sits in a TCP buffer or application queue because B is overloaded. It will be processed... eventually.

↓

3. Remote Node Crashed

B's process, OS, or hardware died. The packet either reached B before the crash (and B processed it but couldn't respond) or after (and was silently discarded by the OS).

↓

4. Response Lost

B processed the request and sent a response, but the response packet was lost on the way back. B thinks it succeeded. A thinks it failed.

↓

5. Response Delayed

B responded, but the response is stuck in a network queue. It will arrive in 30 seconds, long after A's timeout fires.

From A's perspective, all five scenarios look identical: A sent a request and did not receive a response. That's it. There is no way for A to distinguish between them.

The fundamental problem. When you don't get a response, you don't know whether your request was processed. If the operation was not idempotent (e.g., "charge this credit card"), retrying might double-charge the customer. Not retrying might mean the original charge was lost. This is why idempotency is a first-class design concern in distributed systems.

Network Partitions

A network partition (or "netsplit") is when the network splits into two or more groups of nodes that can communicate within the group but not across groups. All nodes are alive and processing requests. They just can't see each other.

This is different from a node crashing. In a crash, the failed node stops doing things. In a partition, both sides keep operating — potentially making conflicting decisions, accepting conflicting writes, electing conflicting leaders.

Partitions are not rare. A 2011 study of Microsoft datacenters found that network failures caused about 5 link failures per day, each affecting tens of thousands of devices. Google's internal data shows similar numbers. In cloud environments, partial partitions (where some links fail but others work) are especially insidious.

Simulate the Failure Modes

The simulation below shows node A sending a request to node B. Use the buttons to inject different failure modes and observe that A always sees the same thing: silence.

Network Failure Simulator

Send a request from A to B. Then inject a specific failure to see what happens.

Click Send Request, then inject a failure.

Timeouts: The Only Tool

Since you can't distinguish failure modes, the only option is to set a timeout: wait for some duration, and if no response arrives, assume something went wrong. But what should the timeout be?

Timeout	Upside	Downside
Short (100ms)	Fast failure detection	Healthy-but-slow nodes declared dead (false positives). Triggers unnecessary failover, increases load on healthy nodes
Long (30s)	Fewer false positives	Real failures take 30 seconds to detect. Users see 30 seconds of errors

There is no universally correct timeout. It depends on your network, your workload, and how bad false positives are compared to slow detection. We'll explore this tradeoff deeply in Chapter 2.

Real-world numbers. Within a datacenter, round-trip time (RTT) is typically under 1ms. Across availability zones: 1-5ms. Across continents: 100-300ms. But these are medians. The 99th percentile can be 10-100x higher due to queueing, TCP retransmissions, and congestion. A "1ms" datacenter network can spike to 100ms during a top-of-rack switch failover.

Interview question: Your service A calls service B. B successfully processes the request and sends a response, but the response packet is lost in the network. A's timeout fires and it retries the request. B processes the request a second time. If the request is "deduct $50 from account X," what is the consequence, and how would you prevent it?

No consequence — TCP guarantees exactly-once delivery $100 is deducted instead of $50. Prevention: make the operation idempotent by using a unique request ID (idempotency key). B stores processed request IDs and returns the cached response for duplicates instead of re-processing. The retry will fail because B is already processing the first request

Chapter 2: Detecting Failures

You're running a cluster of five database replicas. One of them stops responding. Is it dead? Is it just slow? Is the network between you and it broken while it's still serving other clients? You need to decide — because if it's dead, you need to redirect traffic and maybe elect a new leader. If it's just slow, declaring it dead could cause a cascade of unnecessary failovers.

This is the failure detection problem. And your only tool is timeouts.

The Timeout Dilemma: Worked Example

Let's make this concrete. Suppose your service sends heartbeats every 1 second. Normally, responses arrive within 5ms. You need to decide: how many missed heartbeats constitute a failure?

// Scenario: 5 database replicas, heartbeat every 1 second
// Normal RTT: 1-5ms (99th percentile: 50ms)

// Option A: timeout = 2 missed heartbeats = 2 seconds
False positive rate: a 99.9th percentile network spike of 2+ seconds
triggers failover for a healthy node.
In a busy datacenter, this happens ~once per day per node.

// Option B: timeout = 10 missed heartbeats = 10 seconds
False positive rate: nearly zero (10-second network spikes are very rare).
But: if a node actually crashes, it takes 10 seconds to detect.
10 seconds of requests routed to a dead node = 10 seconds of errors.

// The math of false positives:
P(false positive) = P(RTT > timeout) × heartbeats_per_second
Lower timeout → higher P(RTT > timeout) → more false positives
Higher timeout → longer detection time for real failures

Where Delays Come From

Network delays are not random noise — they have specific, mechanical causes. Understanding these helps you reason about reasonable timeout values.

Source of delay	Typical magnitude	Why it happens
Switch queue	Microseconds to milliseconds	Multiple ports sending to the same output port. Packets queue in the switch's buffer.
TCP retransmission	200ms to seconds	A packet is lost. TCP waits for an ACK timeout, then retransmits. Each retry doubles the wait (exponential backoff).
OS TCP buffer	Microseconds to milliseconds	The receiving OS accepted the packet but the application hasn't called read() yet. The packet sits in the kernel's socket buffer.
VM pause	Tens of milliseconds to seconds	Virtual machine live migration, or the hypervisor scheduling another VM on the same CPU core.
GC pause	Milliseconds to seconds	The application's garbage collector stops the world. The process cannot respond until GC completes.
Context switch	Microseconds to milliseconds	The OS schedules another process on the same CPU core. Under high load, your heartbeat response gets delayed.

Notice that most of these are queueing problems. Packets, threads, and processes all queue. The more loaded the system, the longer the queues, the higher the latency. This is why the 99th percentile of response time is far worse than the median.

The Phi Accrual Failure Detector

Rather than a fixed timeout, the phi accrual failure detector (used by Akka, Cassandra) adapts based on observed response times. Instead of a binary "alive or dead," it outputs a suspicion level (φ) that increases over time since the last heartbeat.

// Phi accrual failure detector

// 1. Track recent heartbeat intervals
intervals = [1002ms, 998ms, 1005ms, 1001ms, 997ms, ...]

// 2. Compute mean and standard deviation
μ = mean(intervals) = 1000.6ms
σ = stdev(intervals) = 3.1ms

// 3. When a heartbeat is late by t ms, compute phi
φ(t) = -log₁₀(1 - F(t))
// where F(t) is the CDF of the normal distribution N(μ, σ²)

// 4. Threshold: declare failure when φ > threshold
// φ = 1 → P(mistake) = 10%
// φ = 2 → P(mistake) = 1%
// φ = 3 → P(mistake) = 0.1%
// φ = 8 → P(mistake) = 0.00001% (Cassandra default)

The beauty: if the network gets noisier (higher σ), the detector automatically becomes more lenient. If the network is rock-solid (low σ), it detects failures faster.

Timeout Tuning Simulator

The simulation below shows a node sending heartbeats. The response time varies randomly. Use the timeout slider to see the tradeoff: too short and you get false positives (declaring a healthy node dead), too long and you're slow to detect a real failure.

Timeout Tuning Simulator

Adjust the timeout. Watch false positives (red flashes) when it's too short, and slow detection when it's too long. The node will actually crash after ~15 heartbeats.

Timeout 200ms

Adjust timeout, then click Start.

Python: Simple Phi Accrual Detector

python
import math
import time
from collections import deque

class PhiAccrualDetector:
    """Phi accrual failure detector (simplified Cassandra-style)."""

    def __init__(self, threshold=8, window_size=100):
        self.threshold = threshold       # phi above this = declared dead
        self.intervals = deque(maxlen=window_size)
        self.last_heartbeat = None

    def heartbeat(self):
        """Called when a heartbeat arrives."""
        now = time.monotonic()
        if self.last_heartbeat is not None:
            interval = (now - self.last_heartbeat) * 1000  # ms
            self.intervals.append(interval)
        self.last_heartbeat = now

    def phi(self):
        """Current suspicion level. Higher = more likely dead."""
        if self.last_heartbeat is None or len(self.intervals) < 2:
            return 0.0

        now = time.monotonic()
        elapsed = (now - self.last_heartbeat) * 1000  # ms since last heartbeat

        # Compute mean and variance of intervals
        mean = sum(self.intervals) / len(self.intervals)
        variance = sum((x - mean) ** 2 for x in self.intervals) / len(self.intervals)
        std = math.sqrt(variance) if variance > 0 else 0.1

        # CDF of normal distribution: P(X <= elapsed)
        z = (elapsed - mean) / std
        cdf = 0.5 * (1 + math.erf(z / math.sqrt(2)))

        # phi = -log10(1 - cdf), clamped to avoid log(0)
        p_late = max(1e-15, 1.0 - cdf)
        return -math.log10(p_late)

    def is_alive(self):
        return self.phi() < self.threshold

# Usage:
detector = PhiAccrualDetector(threshold=8)
# In your heartbeat loop:
# detector.heartbeat()  — call when each heartbeat arrives
# detector.is_alive()   — check if the node is still alive

Interview question: You're running Cassandra with a phi accrual threshold of 8. The network suddenly becomes congested — response times double, and the standard deviation triples. Does the detector immediately start declaring nodes dead?

Yes — doubled response times will exceed any threshold It depends on the window size and how quickly the mean updates No — the phi accrual detector adapts. As the observed intervals get longer and noisier, the mean (μ) and standard deviation (σ) in the sliding window update. A heartbeat arriving at 2× the old mean but at 1× the new mean gets a low phi. The detector self-calibrates to the current network conditions. That's the whole point.

Chapter 3: Unreliable Clocks

Every computer has a quartz crystal oscillator that ticks at a nominal frequency. But "nominal" is doing a lot of work in that sentence. Quartz crystals drift. A typical crystal oscillates at 32.768 kHz, but temperature, voltage, and manufacturing variance cause it to be slightly fast or slow. A drift of 35 parts per million (ppm) — common for consumer-grade crystals — means the clock gains or loses about 3 seconds per day.

On a single machine, this doesn't matter much. The OS periodically adjusts the clock using NTP (Network Time Protocol) and everything works fine. But in a distributed system, each node has its own clock, and they all drift independently. You cannot assume that "now" on node A is the same as "now" on node B.

Two Types of Clocks

Operating systems expose two fundamentally different clocks. Confusing them causes bugs that are maddeningly difficult to diagnose.

Property	Time-of-day clock	Monotonic clock
What it measures	Wall-clock time (seconds since epoch)	Elapsed time since some arbitrary point
Python API	`time.time()`	`time.monotonic()`
Java API	`System.currentTimeMillis()`	`System.nanoTime()`
Can jump backwards?	YES. NTP step correction can move the clock backwards by seconds.	NO. Guaranteed to never decrease.
Synced across nodes?	Approximately (via NTP). Accuracy: 10-100ms over internet, 1ms on local network.	NO. Only meaningful on a single machine. Not comparable across machines.
Good for ordering events?	NO. Drift and jumps make cross-node ordering unreliable.	YES — but only on ONE machine.
Good for measuring durations?	NO. A jump makes a 5-second operation appear to take -3 seconds.	YES — this is exactly what it's for.

The critical rule. Use monotonic clocks for measuring elapsed time (profiling, timeouts, rate limiting). Use time-of-day clocks only when you need to communicate a time to a human or another system — and always remember that the value is approximate. NEVER use time-of-day clocks to determine the order of events across machines.

Why Timestamps Can't Order Events

Consider two database nodes using last-write-wins (LWW) conflict resolution. Each write is stamped with the node's local wall-clock time. The write with the highest timestamp wins.

Here's what happens when clocks are skewed:

// Node 1's clock is 5ms AHEAD of true time
// Node 2's clock is 5ms BEHIND true time
// Total skew: 10ms

// True time 100.000: Client writes x = 1 on Node 1
Node 1 timestamps it as t = 100.005 (clock is fast)

// True time 100.003: Client writes x = 2 on Node 2 (3ms LATER)
Node 2 timestamps it as t = 99.998 (clock is slow)

// LWW conflict resolution: max(100.005, 99.998) = 100.005
// Node 1's OLDER write (x=1) wins! The NEWER write (x=2) is silently discarded.

// The client who wrote x=2 LATER gets no error. // Their write just vanishes. Data loss. Silent corruption.

This is not a theoretical concern. It is one of the most common sources of silent data loss in distributed databases that use LWW (including some default configurations of Cassandra).

Clock Drift Simulator

The simulation below shows three nodes with independent clock drift. NTP corrections occur periodically but introduce jumps. Watch how two events that happen in a clear temporal order get reversed timestamps.

Clock Drift and Timestamp Ordering

Three nodes with drifting clocks. NTP syncs periodically. Click "Write Event" to place events and see how timestamps can disagree with reality.

Drift rate (ppm) 50 ppm

Click Start to begin clock simulation.

Worked Example: How Much Can Clocks Drift?

// Crystal oscillator spec: 35 ppm drift
// NTP sync interval: 5 minutes = 300 seconds

Max drift between syncs = 300s × 35 × 10^-6 = 0.0105s = 10.5ms

// Two nodes drifting in opposite directions:
Max skew between two nodes = 2 × 10.5ms = 21ms

// With 1-hour NTP interval (laptop on WiFi):
Max drift = 3600 × 35 × 10^-6 = 126ms
Max skew between two nodes = 252ms

// With a temperature excursion (server room AC fails):
// Crystal drift can increase to 100+ ppm
Max drift per hour = 3600 × 100 × 10^-6 = 360ms per node
// Two nodes: 720ms of skew. Almost a full second.

The takeaway. Even with NTP, clocks can disagree by tens of milliseconds in normal conditions and hundreds of milliseconds under stress. Any system that relies on wall-clock timestamps being accurate to better than ~100ms across nodes is building on sand.

Interview question: Your distributed database uses last-write-wins (LWW) for conflict resolution with wall-clock timestamps. A developer reports that writes are "randomly disappearing." What is the most likely cause, and what would you change?

Clock skew between nodes causes newer writes to get lower timestamps than older writes on other nodes. The "newer" write is discarded by LWW because its timestamp is lower. Replace wall-clock LWW with logical clocks (Lamport timestamps or vector clocks) that respect causal ordering regardless of physical clock skew. Network partitions are dropping write acknowledgments The database is running out of disk space and silently dropping writes

Chapter 4: Clock Synchronization and Confidence

NTP gives you a timestamp. But how accurate is it? The terrifying answer: you don't know. NTP measures the round-trip time to a time server and estimates one-way delay as half the round-trip. But if the network is asymmetric (as it often is), this estimate is wrong. NTP gives you a time, but no error bars.

Google decided this was unacceptable. For their globally distributed database Spanner, they needed to know not just "what time is it?" but "what is the earliest and latest time it could be right now?"

Google's TrueTime

Google's TrueTime API uses a combination of GPS receivers and atomic clocks in every datacenter. Instead of returning a single timestamp, it returns a confidence interval:

// Standard clock API:
time.now() → 1694012345.678 // a single number. How accurate? Who knows.

// Google TrueTime API:
TT.now() → [earliest, latest]
TT.now() → [1694012345.671, 1694012345.685]
// "The true time is somewhere in this 14ms window."

// The uncertainty (epsilon) is typically 1-7ms.
// After a GPS/atomic sync: epsilon ≈ 1ms
// Just before the next sync: epsilon ≈ 7ms (crystal drift accumulates)

// Key property: the TRUE time is GUARANTEED to be within the interval. // This is not a statistical estimate. It's a hard bound.

Using Confidence Intervals for Ordering

With confidence intervals, you can determine when two events are definitely ordered versus when their order is ambiguous.

// Event A at node 1: TT.now() = [100, 107]
// Event B at node 2: TT.now() = [110, 116]

// Do the intervals overlap?
A.latest = 107 < B.earliest = 110
// NO OVERLAP. A definitely happened before B.
// We can safely order: A → B

// Now consider:
// Event C at node 1: TT.now() = [100, 107]
// Event D at node 2: TT.now() = [103, 109]

// C.latest = 107, D.earliest = 103
// OVERLAP. We CANNOT determine the order of C and D.
// They are CONCURRENT as far as we can tell.

Spanner uses this to implement external consistency (a stronger guarantee than serializability). When a transaction commits, Spanner deliberately waits out the uncertainty: it pauses for the length of the confidence interval before reporting the commit. This ensures that any transaction that starts after the commit will definitely see it.

The cost of correctness. Spanner's "commit-wait" adds latency equal to the uncertainty interval (typically 1-7ms). This is why Google invested millions in GPS receivers and atomic clocks in every datacenter — reducing the uncertainty interval directly reduces commit latency. Physical infrastructure buys you logical guarantees.

TrueTime Ordering Simulator

Place events on two nodes. Each event has a confidence interval. The simulation shows whether the events can be ordered or are concurrent.

Confidence Interval Ordering

Click "Add Event" to place events on Node 1 and Node 2. The simulation shows whether their confidence intervals overlap (concurrent) or are separated (orderable).

Uncertainty (ε) 5ms

Add events on both nodes to compare ordering.

What If You Don't Have TrueTime?

You're not Google. You don't have atomic clocks. What can you do?

Approach	Accuracy	Trade-off
NTP over internet	~50-100ms	Free, but high and unpredictable error
NTP on local network	~1-5ms	Requires dedicated NTP servers in your datacenter
PTP (Precision Time Protocol)	~1-100μs	Requires hardware timestamping support in NICs and switches
GPS receiver per server	~1μs	$50-200 per server, needs antenna with sky view
Logical clocks	Perfect causal ordering	Doesn't tell you wall-clock time at all — only "before/after" relationships

The practical answer. For most systems, use logical clocks (Lamport timestamps or vector clocks) for ordering events, and use physical clocks only for human-facing features like "show posts from the last 24 hours." Logical clocks are immune to drift and skew because they don't measure physical time — they measure causality.

Interview question: Google Spanner's TT.now() returns the interval [100, 107]. A transaction wants to commit. Why does Spanner wait 7ms before declaring the commit successful, and what guarantee does this waiting provide?

Spanner waits until the uncertainty window has passed so that the commit timestamp is definitely in the past for ALL nodes. Any transaction that starts after the wait will have a TT.now().earliest that is after the commit timestamp, guaranteeing it sees the committed data. This is called "commit-wait" and it provides external consistency: if transaction T1 commits before T2 starts (in real time), then T2 is guaranteed to see T1's writes. Spanner waits for the network round-trip to complete Spanner waits for all replicas to acknowledge the write

Chapter 5: Process Pauses

You've handled unreliable networks and unreliable clocks. Surely you can at least trust that your own code, running on your own machine, executes continuously? No. Your process can stop running for seconds at any time, without warning, and resume as if nothing happened.

Why Your Code Stops

Cause	Typical duration	Warning?
GC pause (Java/Go/.NET)	10ms to 10+ seconds	None. The process freezes mid-instruction.
VM live migration	100ms to seconds	None. The hypervisor moves your VM to another host while it's running.
OS context switch under load	Microseconds to milliseconds	None. The OS schedules another process.
Disk I/O (swapping/thrashing)	Milliseconds to seconds	None. A page fault triggers disk read.
Signal handling (SIGSTOP)	Indefinite	User-triggered (ctrl-Z).

The Lease Violation Problem

This is where process pauses become dangerous. Consider a lease — a distributed lock with a timeout. A node acquires a lease that says "you are the leader for the next 10 seconds." The node then processes requests, confident that it's the only leader.

Here's what goes wrong:

// Timeline of a lease violation

t = 0s: Node A acquires lease (valid for 10 seconds)
t = 1s: Node A starts processing a batch write
t = 2s: GC PAUSE BEGINS — Node A's process is frozen
t = 11s: Lease expires. Node A doesn't know — it's frozen.
t = 12s: Node B acquires the lease. Node B is now the leader.
t = 13s: Node B starts accepting writes.
t = 14s: GC PAUSE ENDS — Node A resumes execution.
Node A still thinks it holds the lease (its local variable says so).
Node A writes to the database, conflicting with Node B's writes.

// Result: two leaders, corrupted data.
// Node A had NO WAY to know it was paused. From A's perspective,
// time jumped from t=2s to t=14s in an instant.

The terrifying part. Node A cannot check the lease expiry after resuming from the GC pause, because the pause can happen between checking the lease and using it. There is no safe place to insert a check — the GC can freeze you at any point, including between two adjacent CPU instructions.

The Solution: Fencing Tokens

Since you can't prevent a stale leader from acting, you need the storage layer to reject stale writes. The mechanism is a fencing token: a monotonically increasing number issued with each lease grant.

// Fencing token protocol

t = 0s: Node A acquires lease, receives fencing token = 34
t = 2s: GC pause begins on Node A
t = 12s: Node B acquires lease, receives fencing token = 35
t = 13s: Node B writes to storage with token 35. Storage records: last_token = 35
t = 14s: Node A resumes, tries to write to storage with token 34
Storage checks: 34 < 35 (last_token). REJECTED.

// The storage server enforces ordering, not the client.
// The client can be wrong about whether it holds the lock.
// The storage server is the source of truth.

Fencing Token Simulator

This simulation shows two nodes competing for a lease. Node A gets the lease, then suffers a GC pause. Node B acquires the lease during the pause. When Node A wakes up, see what happens with and without fencing tokens.

GC Pause + Fencing Token Simulator

Watch the scenario play out. Toggle fencing to see the difference: without fencing, data is corrupted. With fencing, the stale write is rejected.

Fencing enabled

Click Play to see the GC pause scenario.

Python: Fencing Token Check

python
class FencedStorage:
    """Storage server that rejects stale writes using fencing tokens."""

    def __init__(self):
        self.data = {}
        self.max_token = {}  # key -> highest token seen

    def write(self, key, value, fencing_token):
        """Write only if the fencing token is >= the highest seen for this key."""
        if key in self.max_token and fencing_token < self.max_token[key]:
            raise PermissionError(
                f"Stale fencing token: {fencing_token} < {self.max_token[key]}. "
                "Write rejected — you no longer hold the lease."
            )
        self.max_token[key] = fencing_token
        self.data[key] = value
        return True

    def read(self, key):
        return self.data.get(key)

# Demo: the GC pause scenario
storage = FencedStorage()

# Node A gets lease with token 34, writes successfully
storage.write("account_balance", 500, fencing_token=34)
print("Node A wrote 500 with token 34")  # OK

# Node B gets lease with token 35 (after A's lease expired during GC)
storage.write("account_balance", 450, fencing_token=35)
print("Node B wrote 450 with token 35")  # OK

# Node A wakes up from GC, tries to write with old token 34
try:
    storage.write("account_balance", 600, fencing_token=34)
except PermissionError as e:
    print(f"Node A rejected: {e}")  # REJECTED

Interview question: You're designing a lease-based distributed lock for a payment processing system. A colleague says "we'll just check if the lease is still valid before each write." Why is this insufficient, and what mechanism would you add?

The check is sufficient — if the lease is valid, the write is safe A GC pause (or any process pause) can occur BETWEEN checking the lease and performing the write. The node checks "am I still the leader?" — yes — then gets paused for 12 seconds. The lease expires, another node becomes leader. The first node resumes and writes, believing it's still the leader. You need fencing tokens: the lock service issues a monotonically increasing token with each lease grant, and the storage layer rejects any write with a token lower than the highest it has seen. The check is insufficient because of network latency, not process pauses

Chapter 6: The Byzantine Generals Problem

So far, we've assumed nodes are honest but unreliable: they might crash, they might be slow, but they never lie. A crashed node stops sending messages. A slow node eventually responds with the correct answer. What if a node sends the wrong answer? Or sends different answers to different nodes?

This is a Byzantine fault: a node behaves in an arbitrary, potentially malicious way. It might send "yes" to node A and "no" to node B. It might report a different value from what it stored. It might impersonate another node.

The Original Problem

In 1982, Lamport, Shostak, and Pease published a paper that framed this as a military scenario. Several Byzantine generals surround an enemy city. They need to agree on a plan: attack or retreat. They communicate by messenger. Some generals may be traitors who send conflicting messages to sabotage the consensus.

// The Byzantine Generals Problem

// 4 generals: A, B, C are loyal. T is a traitor.
// The commander (A) sends "ATTACK" to everyone.

A → B: "ATTACK"
A → C: "ATTACK"
A → T: "ATTACK"

// Now each general relays what they received:
B tells C: "A said ATTACK"
B tells T: "A said ATTACK"
C tells B: "A said ATTACK"
C tells T: "A said ATTACK"

// But the traitor LIES:
T tells B: "A said RETREAT" // lie!
T tells C: "A said RETREAT" // lie!

// B's view: 2 say ATTACK (from A and C), 1 says RETREAT (from T)
// B takes majority: ATTACK. Correct!
// C's view: 2 say ATTACK (from A and B), 1 says RETREAT (from T)
// C takes majority: ATTACK. Correct!

// With 4 generals and 1 traitor, majority voting works.
// Key theorem: you need 3f + 1 generals to tolerate f traitors.
// 3(1) + 1 = 4. With 3 generals and 1 traitor, consensus is IMPOSSIBLE.

Why 3f + 1?

With 3 generals (A loyal, B loyal, T traitor):

// Commander A sends "ATTACK" to B and T
// T tells B: "A said RETREAT" (lie)
// B's view: A says ATTACK, T says RETREAT. Tie! B cannot decide.

// Now consider: what if the COMMANDER is the traitor?
// Commander T sends "ATTACK" to A, "RETREAT" to B
// A relays: "T said ATTACK"
// B relays: "T said RETREAT"
// A sees: ATTACK (from T) and RETREAT (from B's relay). Tie!
// B sees: RETREAT (from T) and ATTACK (from A's relay). Tie!

// With only 3 nodes and 1 traitor, B CANNOT distinguish:
// "A is the commander, T is lying" from "T is the commander, telling me RETREAT"
// Provably impossible.

Byzantine Generals Simulator

Four generals try to agree on a plan. One is a traitor. Watch how the loyal generals use majority voting to reach consensus despite the traitor's conflicting messages.

Byzantine Generals: 4 Generals, 1 Traitor

The commander sends a plan. The traitor sends conflicting messages. Watch how majority voting overcomes the traitor.

Click Run to see the Byzantine consensus protocol.

Do You Need Byzantine Fault Tolerance?

Almost certainly not. BFT is expensive — you need 3f+1 nodes to tolerate f faults (versus 2f+1 for crash faults). The extra nodes and message rounds add latency and cost.

Environment	Need BFT?	Why
Your datacenter	No	You control the nodes. Bugs exist, but nodes don't intentionally lie.
Blockchain / cryptocurrency	Yes	Nodes are operated by strangers with financial incentives to cheat.
Aircraft flight control	Yes	Hardware faults (cosmic rays, sensor failures) can produce arbitrary values.
Multi-org consortium	Maybe	Organizations might not trust each other's infrastructure.

Practical Byzantine faults in datacenters. Even in trusted environments, some Byzantine-like scenarios exist: a node with corrupted memory (bit flip) sends wrong data but doesn't know it. A misconfigured node speaks the wrong protocol version. A software bug causes one replica to compute different results. Most systems protect against these with checksums, protocol version negotiation, and integration tests — not full BFT.

Interview question: You have a distributed system with 7 nodes. How many Byzantine (arbitrarily faulty) nodes can it tolerate, and why?

3 — because you need a simple majority of honest nodes 2 — because BFT requires 3f+1 nodes for f faulty nodes. With 7 nodes: 3f+1 ≤ 7, so 3f ≤ 6, so f ≤ 2. You can tolerate 2 Byzantine nodes because the remaining 5 loyal nodes form a 2/3+ supermajority needed to outvote the traitors in every round. 1 — Byzantine fault tolerance is very limited

Chapter 7: System Models — The Showcase

We've cataloged everything that can go wrong: networks drop messages, clocks drift, processes pause, and nodes can crash (or even lie). Now we need a framework for reasoning about these faults. We need system models.

A system model is a set of assumptions about what kinds of faults can happen. It's the contract between the algorithm designer and reality. "If the world behaves at least this well, then my algorithm guarantees these properties."

Crash Models

How do nodes fail?

Model	Assumption	Example
Crash-stop	A node that crashes is gone forever. It never recovers.	A process killed with SIGKILL. A server removed from the rack.
Crash-recovery	A node that crashes may restart later with its durable state intact (but in-memory state is lost).	A server reboots after a power failure. The process is restarted by systemd.
Byzantine	A node may behave in any arbitrary way: crash, lie, send contradictory messages.	A compromised server. Hardware with corrupted memory.

Timing Models

How predictable are delays?

Model	Assumption	Real-world match
Synchronous	There is a known upper bound on message delay and clock drift. If a message doesn't arrive in time, the node is definitely crashed.	Hard real-time systems with dedicated networks (assembly line controllers)
Partially synchronous	The system behaves synchronously MOST of the time, but can occasionally exceed bounds (during network congestion, GC pauses, etc.).	Most datacenter systems. The model used by most practical algorithms.
Asynchronous	No timing assumptions at all. Messages can take arbitrarily long. You cannot use timeouts because there is no "too long."	Theoretical model. No real system is truly asynchronous, but it gives the strongest impossibility results.

The practical choice. Most real distributed systems assume crash-recovery + partial synchrony. This means: nodes can crash and restart (but we trust them not to lie), and the network is usually fast but occasionally slow. Algorithms designed for this model (like Raft, Paxos, ZAB) work well in practice because datacenters usually behave synchronously, and the occasional asynchronous period causes temporary unavailability, not incorrectness.

The Big Simulation: Distributed System Under Stress

Five nodes in a partially synchronous network running a simplified consensus protocol. You are the chaos monkey. Crash nodes, partition the network, trigger GC pauses, and delay messages. Watch how the system handles each scenario — and when it breaks.

Distributed System Stress Test

Click nodes to crash them. Use buttons to partition the network or trigger GC pauses. The system tries to maintain a leader and accept writes. Watch what survives and what doesn't.

Click Start to launch the 5-node cluster.

Things to try:

Crash one follower: The system continues. The leader still has a quorum (3 of 5).
Crash two followers: The leader loses quorum. Writes stop being accepted.
Network partition (2-3 split): The side with 3 nodes elects a leader and continues. The side with 2 nodes cannot form a quorum.
GC pause on leader: The leader misses heartbeats. Followers elect a new leader. When the old leader wakes up, it discovers it's been deposed.
Delay all messages: Partial synchrony violated. Timeouts fire, elections start, but agreement is delayed (not prevented).

The key insight from this simulation. Under crash-recovery + partial synchrony, the system can always preserve safety (never give a wrong answer). But it sometimes sacrifices liveness (it might stop accepting requests until the partition heals or a new leader is elected). This is the fundamental tradeoff described by the FLP impossibility result, which we'll discuss in Chapter 8.

Chapter 8: Knowledge and Truth

Here is a thought experiment that crystallizes the central challenge of distributed systems.

You are a node. You just processed a request and sent a response. But your garbage collector kicked in for 15 seconds right after you sent it. When you wake up, you think only a millisecond has passed. Your lease has expired. Another node is now the leader. But you don't know any of this. From your perspective, nothing happened.

A node cannot trust its own judgment. It might be paused. Its clock might be wrong. Its network might be partitioned. It has no way to tell. This is the fundamental asymmetry of distributed systems: a node can observe its own state, but it cannot observe its own reachability or timeliness from the outside.

The Truth Is Defined by the Quorum

Since no individual node can be trusted to assess reality, the only reliable source of truth is a quorum — a majority of nodes agreeing on a fact.

Decision	Who decides	Why not the individual?
"Node X is dead"	A quorum of nodes	Node X might be alive but partitioned from the observer
"Node Y is the leader"	A quorum of nodes	Y might think it's the leader but its lease expired during a GC pause
"Value V is committed"	A quorum of nodes	A single node might have a stale or corrupted copy
"The lock is held by Z"	A quorum of nodes (or the storage with fencing)	Z might have crashed and not know it yet

The quorum principle. In a distributed system, truth is not what any one node believes — it is what a majority of nodes agree on. This is why consensus algorithms (Raft, Paxos) require majority votes for every decision: leader election, log commitment, membership changes. A minority can always be wrong or partitioned.

The FLP Impossibility

In 1985, Fischer, Lynch, and Paterson proved one of the most important results in distributed computing. The FLP impossibility theorem states:

FLP Impossibility. In a fully asynchronous system where even one process can crash, there is no deterministic algorithm that guarantees consensus will be reached. Put differently: if you make zero timing assumptions (no timeouts, no upper bound on message delay), then it is mathematically impossible to guarantee that a group of nodes will agree on a value, even if only one node might crash.

This sounds devastating. If consensus is impossible, how do Raft and Paxos work?

The key word is asynchronous. FLP applies to the fully asynchronous model. Real systems are partially synchronous: they have timeouts, and most of the time the network is well-behaved. Practical consensus algorithms exploit this:

// How Raft sidesteps FLP:

// FLP says: in a FULLY ASYNCHRONOUS system, you can't guarantee
// consensus if even one node can crash.

// Raft says: I use TIMEOUTS. If I don't hear from the leader
// within my election timeout, I start an election.

// During a period of asynchrony (network congestion, GC pause):
// - Raft might not make progress (no leader elected, no writes committed)
// - But it never violates SAFETY (never commits two different values)

// When synchrony returns (network stabilizes):
// - Raft elects a leader and resumes operation
// - Progress is restored

// The tradeoff: Raft gives up LIVENESS during asynchronous periods
// in exchange for SAFETY always. FLP says you can't have both.

Quorum Decision Simulator

Five nodes try to decide if a particular node is dead. Each has its own view (based on whether it can reach the suspect node). A quorum (3 of 5) must agree. See how partitions can cause different nodes to have different views — but the quorum still makes a consistent decision.

Quorum Decision Making

Click nodes to toggle their connectivity to the suspect node (Node 3). The quorum decides: if 3+ nodes can't reach Node 3, it is declared dead.

Click on nodes 1, 2, 4, or 5 to toggle connectivity to Node 3.

Safety vs. Liveness

Every distributed algorithm makes a choice between two properties:

Property	Definition	Example
Safety	Nothing bad ever happens. If the system gives you an answer, it's correct.	"Two leaders are never elected for the same term." "A committed value is never lost."
Liveness	Something good eventually happens. The system makes progress.	"A leader is eventually elected." "A submitted write is eventually committed."

FLP tells us that in an asynchronous system, you cannot guarantee both safety and liveness in the presence of faults. Practical algorithms choose safety always, liveness when possible. During a partition, the system stops accepting writes (sacrifices liveness) rather than accepting potentially conflicting writes (sacrificing safety).

Interview question: The FLP impossibility says consensus is impossible in an asynchronous system. Yet Raft, Paxos, and ZAB all achieve consensus in practice. Are they violating the FLP result?

No — FLP applies to the fully asynchronous model. Raft/Paxos/ZAB assume partial synchrony (they use timeouts). During asynchronous periods, they may fail to make progress (violating liveness), but they never violate safety. FLP is not violated because these algorithms don't guarantee liveness in all cases — they guarantee it only when the network eventually stabilizes. Yes — modern algorithms have found ways around the theoretical limit FLP only applies to Byzantine faults, not crash faults

Chapter 9: Interview Arsenal

This chapter is your cheat sheet. Every failure mode, every mitigation, every pattern you need for distributed systems interviews — organized by dimension.

Failure Mode Reference

Failure	Symptom	Mitigation	Interview trigger
Network partition	Some nodes unreachable, others fine	Quorum-based decisions, fencing tokens	"What happens if your services can't talk to each other?"
Message loss	Request or response disappears	Retries + idempotency keys	"How do you handle a failed API call?"
Message delay	Response arrives after timeout	Adaptive timeouts (phi accrual), deduplication	"How do you set your timeout values?"
Clock skew	Timestamps disagree across nodes	Logical clocks, vector clocks, TrueTime	"How do you order events across services?"
GC pause	Process freezes for seconds	Fencing tokens, short leases	"What if the leader is GC'd?"
Split brain	Two nodes both think they're leader	Epoch-based fencing, quorum leases	"How do you prevent two leaders?"
Partial failure	One of N steps completes	Transactions, saga pattern, idempotency	"How do you handle a crash mid-operation?"

System Design Questions

"Design a distributed lock service." Key points: lease-based (not indefinite), fencing tokens for safety, quorum for leader election, handle GC pauses (client might hold a stale lock). Mention ZooKeeper or etcd as real implementations. The lock service is NOT the source of truth — the fencing token check at the storage layer is.

"Design a system that tolerates network partitions." Clarify the requirement: do you need strong consistency (CP) or availability (AP)? CP: use consensus (Raft/Paxos), accept unavailability during partition. AP: use CRDTs or LWW, accept temporary inconsistency. Mention CAP theorem but note its limitations (it's about a very specific tradeoff, not a general design framework).

"How does garbage collection affect distributed systems?" GC pauses look like node crashes to the rest of the cluster. The paused node might hold a lease that expires during the pause. When it wakes up, it doesn't know time passed. Mitigations: fencing tokens, GC-aware heartbeats (treat a long GC as a voluntary step-down), JVM tuning (G1GC, ZGC for low-pause), pre-emptive GC notification in JVM 11+.

Coding Drills

Drill 1: Idempotency layer

python
import hashlib
import time

class IdempotencyLayer:
    """Prevents duplicate processing of requests."""

    def __init__(self, ttl_seconds=3600):
        self.seen = {}  # idempotency_key -> (result, timestamp)
        self.ttl = ttl_seconds

    def execute(self, idempotency_key, operation):
        """Execute operation only if this key hasn't been seen."""
        self._evict_expired()

        if idempotency_key in self.seen:
            # Already processed — return cached result
            return self.seen[idempotency_key][0]

        # First time seeing this key — execute and cache
        result = operation()
        self.seen[idempotency_key] = (result, time.monotonic())
        return result

    def _evict_expired(self):
        now = time.monotonic()
        expired = [k for k, (_, t) in self.seen.items() if now - t > self.ttl]
        for k in expired:
            del self.seen[k]

# Usage:
idem = IdempotencyLayer()
result = idem.execute("charge-user-123-order-456", lambda: charge_credit_card(50))
# If retried with same key, returns cached result without re-charging

Drill 2: Simple lease with fencing

python
import time
import threading

class LeaseService:
    """Simplified lease service with fencing tokens."""

    def __init__(self):
        self.lock = threading.Lock()
        self.current_holder = None
        self.expiry = 0
        self.next_token = 1

    def acquire(self, node_id, duration_sec=10):
        """Try to acquire lease. Returns (success, fencing_token)."""
        with self.lock:
            now = time.monotonic()
            if self.current_holder is not None and now < self.expiry:
                return False, None  # Lease held by someone else

            token = self.next_token
            self.next_token += 1
            self.current_holder = node_id
            self.expiry = now + duration_sec
            return True, token

    def release(self, node_id):
        with self.lock:
            if self.current_holder == node_id:
                self.current_holder = None
                self.expiry = 0

Debug Scenarios

"Intermittent data corruption across replicas." Diagnosis: clock skew + LWW. Two replicas receive conflicting writes. The one with the "higher" timestamp wins — but clock skew means the newer write might get a lower timestamp. The older write wins, and the newer write is silently discarded. Evidence: check NTP sync status across nodes. Fix: switch from wall-clock LWW to logical timestamps (Lamport clocks or vector clocks), or use CRDT-based conflict resolution.

"Lease violations during high GC pressure." Diagnosis: Java services with large heaps run full GC, pausing for seconds. During the pause, their leases expire. When the GC completes, the node resumes and acts as if it still holds the lease. Evidence: correlate GC logs with lease violation alerts. Fix: fencing tokens at the storage layer, GC tuning (smaller heap, G1GC or ZGC), or proactive GC scheduling with jmap/jcmd.

"Split-brain after network partition." Diagnosis: the network splits and both halves elect a leader. When the partition heals, you have two leaders with conflicting state. Evidence: check leader election logs — two nodes claim leadership for overlapping terms. Fix: use epoch-based fencing (each leader has a monotonically increasing epoch number), quorum leases (leader must maintain contact with a majority to keep its lease), and fencing tokens for all writes.

Key Papers

Paper	Year	Key idea
Time, Clocks, and the Ordering of Events in a Distributed System (Lamport)	1978	Logical clocks, happens-before relation. The foundation of all distributed ordering.
Impossibility of Distributed Consensus with One Faulty Process (Fischer, Lynch, Paterson)	1985	FLP impossibility. No deterministic consensus in fully asynchronous systems.
The Byzantine Generals Problem (Lamport, Shostak, Pease)	1982	BFT requires 3f+1 nodes for f faults. The original formulation.
Spanner: Google's Globally-Distributed Database (Corbett et al.)	2012	TrueTime API. GPS + atomic clocks for bounded clock uncertainty.
The φ Accrual Failure Detector (Hayashibara et al.)	2004	Adaptive failure detection based on observed heartbeat distribution.

Interview question: You're designing a payment system that debits one account and credits another. The system spans two microservices (Accounts and Payments), each with its own database. The network between them can fail at any time. How do you ensure that money is never lost or double-counted?

Use a distributed transaction (2PC) across both databases Use eventual consistency — replicate the write to both services Use the saga pattern with idempotency keys: (1) Debit the account with a unique transaction ID, (2) Credit the other account with the same transaction ID. Each service stores processed transaction IDs to prevent double-processing on retry. If the credit fails, run a compensating transaction to reverse the debit. Idempotency keys ensure retries are safe. The transaction ID is the fencing mechanism — each step can be safely retried without duplication.

Chapter 10: Connections

This chapter covered the problems of distributed systems — the hostile environment that algorithms must survive. Now you need the solutions: the algorithms and protocols that provide guarantees despite these faults.

Where This Chapter Fits

Chapter	Topic	Relationship
DDIA Ch. 6: Replication	Copying data across nodes	Replication must handle all the faults we described: network partitions, clock skew (for LWW), process pauses (for leader failover). This chapter explains WHY replication is hard.
DDIA Ch. 8: Transactions	Grouping operations atomically	Distributed transactions (2PC, sagas) must survive network partitions and node crashes. This chapter explains the environment that makes distributed transactions so expensive.
DDIA Ch. 10: Consensus	Getting nodes to agree	Consensus (Raft, Paxos) is the PRIMARY solution to the problems in this chapter. It works because it assumes partial synchrony (not full asynchrony, where FLP makes it impossible).

The Mental Model

After this chapter, you should have a three-layer mental model:

Layer 1: The Hostile Environment

Networks are unreliable. Clocks drift. Processes pause. Nodes crash. (This chapter)

↓

Layer 2: The Theoretical Limits

FLP impossibility: can't guarantee consensus in async systems. CAP theorem: can't have consistency + availability during partition. BFT: need 3f+1 for f traitors.

↓

Layer 3: Practical Algorithms

Raft, Paxos, ZAB: achieve consensus under partial synchrony by trading liveness for safety. Fencing tokens, logical clocks, idempotency keys: engineering tools that make systems robust.

Summary of Key Principles

Principle	One-liner
You can't distinguish failure modes	A timeout tells you nothing about why the response didn't arrive.
Clocks lie	Never use wall-clock timestamps to determine the order of events across machines.
Your process can pause	A GC pause, VM migration, or context switch can stop your code for seconds without warning.
Trust the quorum, not yourself	No individual node can reliably assess its own status. Only a majority decision is trustworthy.
Safety over liveness	Better to stop accepting writes than to accept conflicting writes.
Fence everything	Every write to shared storage should carry a fencing token that the storage can verify.
Make it idempotent	If a client might retry, the operation must produce the same result on duplicate execution.

Closing thought. "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport. The problems in this chapter are not bugs to be fixed. They are the fundamental nature of distributed computing. The art is not eliminating them — it is building systems that work correctly despite them.

Related Lessons

DDIA Chapter 6: Replication — How to copy data across unreliable nodes
DDIA Chapter 8: Transactions — Atomic operations in a distributed world

Final question: You're in a system design interview. The interviewer says: "Assume the network is perfectly reliable." How should you respond?

Accept the assumption and proceed — it simplifies the design Acknowledge the simplification but note that even with a perfect network, you still face clock drift and process pauses. You'd still need fencing tokens (GC pauses can violate leases), logical clocks (physical clocks still drift), and idempotency (retries can still happen at the application level). Then ask: "Shall I design for these remaining faults, or are we simplifying those away too?" This shows the interviewer you understand the full problem space. Refuse the assumption — a reliable network is unrealistic