Distributed Systems

Failure Modes & Isolation

Everything fails. The question is how much damage it does when it does.

Prerequisites: Basic networking + Client-server model. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Things Break

You are running a web application on three servers behind a load balancer. At 3:17 AM, one server's disk fills up because a logging library wrote an unexpectedly large stack trace in a loop. The server starts returning 500 errors. The load balancer notices and routes traffic to the remaining two servers. They were already running at 40% capacity each, so now they are at 60%. No big deal.

But here is the problem: the same logging library runs on all three servers. The same bug exists on all three. The same unusual input that triggered the loop is still arriving. Within 90 seconds, all three servers have full disks. Your entire application is down.

This is not a hardware failure. The disks are fine. The CPUs are fine. The network is fine. A correlated software fault took out every replica simultaneously. Your redundancy — three whole servers! — was worthless because the failure was not independent.

The core lesson of failure engineering. Adding more replicas does not help if they all fail for the same reason at the same time. The art of building reliable systems is not about preventing failures (you cannot), but about ensuring that when one thing fails, the failure stays contained. This is called fault isolation, and it is the single most important concept in this lesson.

The simulation below shows a simple three-server setup. Each server processes requests independently. Click "Inject Fault" to see what happens when a correlated fault hits all three servers versus when only one is affected.

Correlated vs. Independent Failures

Three servers behind a load balancer. Green = healthy, red = failed. Watch how correlated faults bypass redundancy.

Click a button to inject a failure.

When one server fails independently, the load balancer shifts traffic and users barely notice. When all three fail together, there is nowhere to redirect. The system is down.

The Failure Landscape

Failures come from many sources. Some are random (a disk wears out), some are systematic (a bug triggered by a specific input), and some are environmental (a power outage in a datacenter). The key distinction is whether failures are independent (happening by random chance) or correlated (happening to multiple components for the same reason).

Failure TypeExampleCorrelation
HardwareDisk dies after 5 years of useUsually independent (unless same batch)
Software bugNull pointer on edge-case inputPerfectly correlated across all replicas
Config errorBad DNS entry pushed to all serversPerfectly correlated
Network partitionTop-of-rack switch diesCorrelated for all servers on that rack
Load pressureTraffic spike overwhelms capacityCorrelated — all servers see the spike
Resource leakMemory leak crashes process every 48 hoursCorrelated if all replicas started at the same time
The structure of this lesson. We will walk through each failure category (hardware, software, config, cascading), then learn the isolation strategies that contain damage: redundancy, fault domains, shuffle sharding, and cellular architecture. The final chapter is a blast radius simulator where you inject failures and see which isolation strategies contain them.

A Taxonomy of Failure

Before we dive in, let's establish vocabulary. Distributed systems distinguish between three levels of problems:

A fault is a single component deviating from its specification. One disk returns corrupted data. One process crashes. One network link drops packets. A fault is local and specific.

A failure is a system-level event where the service as a whole stops providing correct results to users. A failure is what users experience. One disk fault does not necessarily cause a system failure — if you have redundancy, the system continues working.

An outage is a sustained failure affecting users for a meaningful duration. A 200ms blip during a failover is a failure but not usually counted as an outage. A 45-minute period of 500 errors is an outage.

The goal of fault isolation is to prevent faults from becoming failures, and to contain failures so they do not become prolonged outages.

Measuring Blast Radius

Blast radius is the fraction of your system, customers, or requests affected by a single fault. It is the most important metric in failure engineering.

// Blast radius calculation:
blast_radius = affected_users / total_users

// Example: 3 servers, one crashes
// No isolation: all traffic shifts, survivors overload = 100% blast
// With redundancy: 1/3 traffic shifts, survivors absorb = 0% blast

// Example: 10 cells, one gets bad deploy
blast_radius = 1/10 = 10%
// The other 9 cells never see the bad code

Every isolation strategy in this lesson is designed to minimize blast radius. The table below previews what we will learn:

StrategyBlast Radius TargetWhat It Protects Against
Redundancy0% (if independent)Single-component hardware faults
Fault domains≤ 1/N (N = domains)Infrastructure failures (rack, AZ)
Shuffle sharding1-2 customersNoisy neighbors, per-customer faults
Cellular architecture1/N (N = cells)Everything: deploys, config, software bugs
Quiz: You deploy the same application binary to 10 servers. A bug in the binary causes a crash when processing a specific rare request. Server 1 crashes. What happens next?

Chapter 1: Hardware Faults

Hard drives spin at 7,200 RPM, 24 hours a day, for years on end. Eventually, the bearings wear out. The magnetic surface degrades. A cosmic ray flips a bit in RAM. A power supply capacitor dries out. Hardware is physical, and physical things break.

The good news about hardware faults is that they are usually independent. When one disk dies, that does not cause the disk in the next server to die. Each failure is a random, uncorrelated event. This means that simple redundancy — having a spare — is highly effective.

Mean Time Between Failures (MTBF)

Mean Time Between Failures is the average time a component runs before it fails. For a modern enterprise hard drive, MTBF is roughly 10-50 years (depending on the model and workload). But "10 years per drive" does not mean your first failure is in 10 years if you have many drives.

// If one drive has MTBF = 10 years...
// And you have 1000 drives...

Expected failures per year = 1000 / 10 = 100 drive failures per year

// That's roughly one failure every 3.6 days.
// At scale, hardware failure is not a rare event — it's a daily routine.

Google, with millions of servers, sees thousands of hardware failures per day. Amazon, Microsoft, and Meta see similar numbers. At this scale, hardware failure is not an exception — it is the steady state.

Types of Hardware Failures

ComponentFailure ModeTypical MTBFMitigation
Hard driveMechanical wear, bad sectors10-50 yearsRAID, replication
SSDWrite endurance exhaustion5-10 yearsWear leveling, replication
RAMBit flips from cosmic rays or decayRare per stick, common at scaleECC memory
Power supplyCapacitor failure, fan failure5-10 yearsDual PSU, UPS
Network interfacePort failure, cable degradation10+ yearsDual NIC, bonding
Entire serverMotherboard failure3-10 yearsReplicas on other servers

The simulation below shows a datacenter with 100 servers. Each server has a random MTBF. Over time, servers fail and are replaced. Watch how the failure rate becomes predictable at scale.

Hardware Failure Over Time

100 servers, each with random MTBF. Green dots = healthy, red = failed. The chart shows cumulative failures over time.

Click Simulate to watch hardware failures accumulate over one year.
The bathtub curve. Hardware failure rates follow a "bathtub curve": higher in the first few weeks (infant mortality — manufacturing defects), very low for years (useful life), then rising again as components wear out (end of life). This is why some companies "burn in" new hardware for 48 hours before putting it into production.

The key insight: hardware faults are mostly independent. One server's disk dying does not cause another server's disk to die. This means that if you have data on two different servers, losing one disk does not lose the data. Simple replication defeats simple hardware faults.

But "mostly independent" has exceptions. A batch of drives from the same manufacturer, shipped at the same time, installed at the same time, may all reach end-of-life at roughly the same time. A power outage takes out every server on the same power circuit. These are correlated hardware faults, and we will address them with fault domains in Chapter 5.

A Worked Example: Disk Failure Probability at Scale

Let's do the math for a real datacenter. Suppose you operate 10,000 servers, each with 4 hard drives. That is 40,000 drives. Each drive has an annual failure rate (AFR) of 2% (a realistic figure for enterprise drives).

// Datacenter: 10,000 servers, 4 drives each = 40,000 drives
// AFR = 2% per drive per year

Expected failures per year = 40,000 × 0.02 = 800 drive failures
Expected failures per day = 800 / 365 = ~2.2 drive failures per day
Expected failures per week = ~15

// Probability of NO failures in a given week:
P(0 failures in week) = (1 - 2%/52)40000 ≈ e-15.40.000002%

// Translation: you WILL have hardware failures every single week.
// At scale, hardware failure is not an event — it is a constant.

This is why large-scale systems are designed with the assumption that hardware will fail continuously. The system must handle failures automatically, without human intervention, as part of its normal operation.

ECC Memory: Silent Corruption

A particularly insidious hardware fault is a silent bit flip in RAM. Cosmic rays and natural radioactive decay can flip individual bits in memory. Without ECC (Error-Correcting Code) memory, these flips are silent — the data is corrupted but no error is reported. The program continues with wrong data.

A study at Google found approximately 1 single-bit error per gigabyte of RAM per year. With 64 GB per server and 10,000 servers, that is 640,000 bit flips per year across the fleet. ECC memory detects and corrects single-bit errors automatically. This is why every production server uses ECC memory.

Quiz: You run a database on 3 servers, each with an independent MTBF of 5 years. What is the approximate probability that all three servers fail in the same week?

Chapter 2: Software & Config Errors

Hardware faults are well-understood and well-mitigated. The bigger threat to modern systems is software faults and configuration errors. Unlike hardware, these are almost always perfectly correlated: the same bug exists on every replica, the same bad config is pushed to every server.

Software Bugs

A software bug is dormant until triggered. It might lurk for months, waiting for a specific combination of inputs, load level, or timing. When the trigger arrives, every replica running that software hits the bug simultaneously.

1. Bug Introduced
A developer writes code that does not handle a null field in a JSON payload. Tests pass because test data never has null fields.
2. Bug Deployed
The code ships to all 50 production servers. The bug is now present on every single replica.
3. Trigger Arrives
A new client sends a request with a null field. The request is load-balanced to server 17. Server 17 crashes.
4. Correlated Failure
The client retries. The retry hits server 3. Server 3 crashes. Every retry hits a different server. Within minutes, all 50 servers have crashed.

This pattern — a "poison pill" request that crashes every server it touches — is one of the most common causes of total outages in production systems.

Configuration Errors

Configuration changes are the number one cause of outages at large internet companies. Not hardware failures. Not software bugs. Config changes.

Why? Because configuration changes are typically pushed globally and immediately. A developer changes a DNS entry, a firewall rule, a feature flag, or a database connection string, and that change propagates to every server in seconds. If the change is wrong, every server is broken in seconds.

Config ErrorImpactReal-World Incident
Bad DNS entryAll servers resolve to wrong IPGitHub outage, 2016
Firewall rule typoAll servers lose network accessCloudflare outage, 2019
Wrong DB credentialsAll servers can't authenticateCommon in many organizations
Feature flag misconfiguredAll servers enable broken code pathKnight Capital, $440M loss in 45 min
Capacity limit set too lowAll servers reject valid requestsAWS S3 outage, 2017
Config changes are the most dangerous deployments. A code deployment at least goes through CI/CD, code review, and staged rollout. A config change often bypasses all of that. It gets applied instantly, globally, with no canary phase. This is why the best engineering organizations treat config changes with the same rigor as code changes: version control, review, staged rollout, and automatic rollback.

Resource Leaks

A resource leak is a slow-motion failure. Memory is allocated but never freed. File handles are opened but never closed. Database connections are checked out but never returned. The process works fine for hours or days, slowly consuming resources, until it hits a limit and crashes.

Resource leaks are insidious because they are time-correlated. If all your replicas started at the same time (which they usually did, because you deployed them at the same time), they all leak at the same rate, and they all crash at roughly the same time.

// Memory leak scenario:
Process starts with 2 GB RSS
Leak rate: 50 MB / hour
Memory limit: 8 GB

Time to crash = (8 GB - 2 GB) / 50 MB/hour = 120 hours = 5 days

// If all 10 replicas deployed at the same time...
// All 10 hit 8 GB at roughly hour 120.
// Staggered restarts (chaos monkey) break the correlation.

The simulation below demonstrates how resource leaks and config pushes create correlated failures. Each bar represents a server's health over time.

Correlated Failure Scenarios

Watch how different failure types affect servers. Green = healthy, yellow = degraded, red = failed.

Choose a failure scenario to simulate.
Staggered restarts break time-correlation. If you restart replicas at different times (some companies randomly restart one replica every few hours — a "chaos monkey" approach), resource leaks never accumulate simultaneously. This is a simple, powerful defense against time-correlated failures.

The Load Pressure Problem

There is one more failure type that deserves special attention: load pressure. A traffic spike — from a viral post, a marketing campaign, a DDoS attack, or simply normal daily peak — can overwhelm your system. Unlike hardware or software faults, load pressure affects every server simultaneously because they all see the increased traffic.

Load pressure is correlated by definition: if one server sees the spike, they all do (because the load balancer distributes it). And load pressure interacts badly with other failure types. A server that is running at 95% capacity has zero headroom for a GC pause, a slow query, or a background job. Load pressure turns normal events into failures.

// The danger zone: utilization vs. latency

// At 50% utilization: latency = baseline (e.g., 10ms)
// At 80% utilization: latency = 2x baseline (queueing starts)
// At 90% utilization: latency = 5x baseline
// At 95% utilization: latency = 10x baseline
// At 99% utilization: latency = 50x baseline

// This is the queueing theory result: as utilization approaches 100%,
// response time approaches infinity. NOT linearly — exponentially.
// A server at 90% is not "10% away from failure." It is already degraded.

The Interaction Between Failure Types

The most dangerous scenarios are not single failure types — they are combinations. A traffic spike (load pressure) hits while a server is restarting (reducing capacity by 1/N), during a GC pause on another server (reducing capacity further), which triggers retries (amplifying load). Each individual event is manageable. Together, they cascade.

CombinationWhy it's dangerous
Load spike + server restartSpike arrives when you have N-1 capacity (during rolling deploy)
Bug trigger + retriesOne server crashes on bad input, retries spread the bad input to all servers
Network partition + leader electionBoth sides think they're the leader, conflicting writes
Memory leak + traffic spikeLeak at 90% memory, spike pushes to OOM

This is why defense-in-depth matters. No single isolation strategy handles all failure combinations. You need multiple layers.

Quiz: A configuration change that sets the connection pool size to 0 is pushed to all 20 servers at once. How many servers are affected?

Chapter 3: Cascading Failures

One server fails. Its traffic shifts to the remaining servers. Those servers were already at 70% capacity. Now they are at 85%. Response times increase. Clients start timing out and retrying. Each retry doubles the load. The next server collapses. Its traffic shifts to the remaining servers. They are now at 100%. The entire cluster goes down in minutes.

This is a cascading failure: a chain reaction where one failure causes additional failures, which cause more failures, until the entire system is down. The initial trigger might be trivial — a single server restart, a small traffic spike, a GC pause. But the cascade amplifies it into a total outage.

Cascading failures are the leading cause of multi-hour outages. They are especially dangerous because they are self-reinforcing: each failure increases load on survivors, which causes more failures, which increases load further. Breaking the cascade requires understanding the positive feedback loops that drive it.

The Anatomy of a Cascade

1. Trigger
A server fails, traffic spikes, or a dependency slows down. The cause is often trivial.
2. Load Redistribution
Traffic from the failed component shifts to healthy components. Each healthy component's load increases.
3. Degradation
Overloaded components slow down. Queue depths increase. Memory usage spikes. Response times climb.
4. Retry Storm
Clients see timeouts and retry. Each retry adds load. The retry rate can be 2-10x the normal request rate.
5. Collapse
The next component fails under the combined load of real traffic + retries. The cycle repeats from step 2.
↻ repeat until total outage

The Feedback Loops

Cascading failures are driven by three positive feedback loops (positive here means self-reinforcing, not "good"):

Feedback LoopMechanismAmplification Factor
Retry amplificationFailed requests are retried, each retry consumes server resources2-10x per retry layer
Load redistributionFailed server's traffic shifts to survivors, pushing them closer to capacityN/(N-1) load increase per failure
Resource exhaustionSlow requests hold connections longer, reducing available capacityLinear with response time increase

Let's do the math for a concrete example. You have 5 servers, each handling 1000 requests per second (RPS) at 80% capacity (max capacity = 1250 RPS each).

// Initial state: 5 servers, 1000 RPS each, 80% capacity
Total load: 5000 RPS
Per-server capacity: 1250 RPS
Headroom per server: 250 RPS (20%)

// Server 1 fails. Load redistributes to 4 servers:
Per-server load: 5000 / 4 = 1250 RPS = 100% capacity

// At 100% capacity, response times increase 3x.
// Clients timeout and retry. Retry rate = 30% of requests.
Effective load: 1250 × 1.3 = 1625 RPS per server
// 1625 > 1250 capacity. Servers start dropping requests.

// Server 2 fails under load. 3 servers remain:
Per-server load: 5000 / 3 = 1667 RPS + retries = total collapse

The simulation below lets you watch a cascade in real time. Five servers process incoming traffic. Click "Kill Server" to remove one. Watch the load redistribute and the cascade begin.

Cascading Failure Simulator

5 servers processing traffic. Kill one and watch the cascade. Toggle "Retries" to see the amplification effect.

Kill a server and watch load redistribute. Toggle retries to see their effect.

Breaking the Cascade

Every defense against cascading failures targets one of the feedback loops:

DefenseTargetsMechanism
Load sheddingResource exhaustionDrop excess requests instead of queuing them
Retry budgetsRetry amplificationLimit total retries to N% of original traffic
Circuit breakersRetry amplificationStop calling a failing dependency entirely
BulkheadsLoad redistributionIsolate resource pools so one failure can't consume all capacity
Auto-scalingLoad redistributionAdd capacity faster than the cascade propagates

We will cover these patterns in detail in the Resiliency Patterns lesson. For now, the key insight is: cascading failures are caused by positive feedback loops, and the defense is to break those loops.

Quiz: You have 4 servers at 75% capacity. One server fails. What is the new per-server load on the remaining 3, assuming no retries?

Chapter 4: Redundancy

The simplest defense against failure is having a spare. If one disk can fail, have two. If one server can crash, have three. If one datacenter can lose power, have two datacenters. This is redundancy: duplicating components so that when one fails, another takes over.

But not all redundancy is created equal. The two major patterns are active-active and active-passive, and choosing the wrong one can be worse than having no redundancy at all.

Active-Passive

In active-passive (also called primary-standby), one component handles all traffic. The other sits idle, doing nothing, waiting for the primary to fail. When the primary fails, the passive component takes over through a process called failover.

AspectActive-Passive
Normal operationPrimary handles all traffic. Standby is idle (or receiving replicated data).
FailoverDetect primary failure, promote standby, redirect traffic. Takes seconds to minutes.
Resource efficiency50% — the standby is consuming resources but doing no useful work.
Failover riskThe standby may not be ready (replication lag, cold cache, untested failover path).
Best forDatabases, stateful systems where split-brain must be avoided.

The dangerous thing about active-passive is the cold standby problem. If the standby has never handled real traffic, how do you know it works? Its cache is cold. Its connection pools are empty. Its code paths for handling real load have never been exercised. The first time it handles traffic might be during an incident — the worst possible time to discover it doesn't work.

Active-Active

In active-active, all components handle traffic simultaneously. There is no idle standby. When one fails, the remaining components absorb its traffic.

AspectActive-Active
Normal operationAll replicas handle traffic. Load is balanced across them.
FailoverAutomatic — the load balancer stops sending traffic to the failed replica. No explicit failover step.
Resource efficiencyHigh — all replicas are doing useful work at all times.
Failover riskLow — survivors are already warm, already handling traffic, already tested.
ComplexityHigher for stateful systems — need to handle concurrent writes, conflicts.
Best forStateless services, read-heavy workloads, systems where slight inconsistency is tolerable.
Active-active is almost always preferred for stateless services. The only reason to use active-passive is when you have stateful systems that cannot handle concurrent writes (like a single-leader database). Even then, the standby should handle read traffic to stay warm.

The Cold Standby Trap: A Worked Example

Let's trace through a typical active-passive failover that goes wrong. This scenario plays out in production systems all the time.

// Timeline of a bad failover:

// 02:00 AM — Primary database serving 2000 queries/sec
// - Connection pool: 200 connections, all warm
// - Buffer pool cache: 32 GB, 99% hit rate
// - Replication lag to standby: 0.5 seconds

// 02:03 AM — Primary's disk controller fails. Primary goes down.

// 02:03:30 — Failover detection fires after 30 seconds of missed heartbeats.
// Standby promoted to primary.

// 02:03:31 — 200 application servers simultaneously open connections
// to the new primary. Connection storm. Some fail to connect.

// 02:03:35 — First queries hit the new primary.
// Buffer pool cache is COLD (was idle standby).
// Cache hit rate: 0%. Every query hits disk.
// Latency: 10x normal (50ms → 500ms per query).

// 02:04:00 — Application servers start timing out.
// Retries begin. New primary is overwhelmed.
// The "recovery" is now a cascading failure.

// 02:15:00 — Cache warms up. Queries normalize.
// Total impact: 12 minutes of degradation.
// Expected: 30 seconds. Actual: 12 minutes.

The fix: the standby should handle read traffic to keep its cache warm. This is "warm standby" vs. "cold standby." A warm standby has a hot buffer pool cache and warm connection pools. Its failover latency is seconds, not minutes.

The N+1 and N+2 Models

How many spare replicas do you need? This depends on what you are protecting against:

// N+1: survive one failure
// If you need 4 servers to handle peak load, deploy 5.
// One can fail and you still have enough capacity.

// N+2: survive two simultaneous failures
// If you need 4 servers for peak load, deploy 6.
// Used when you need to survive a failure DURING a deployment
// (one server down for deploy + one failure = 2 down).

// N+M (where M = number of fault domains):
// If you have 3 availability zones and need 6 servers,
// deploy 3 per zone = 9 total.
// Any one zone can fail (3 servers) and you still have 6.

The simulation below compares active-passive and active-active redundancy. Watch how failover works in each mode and how quickly the system recovers.

Active-Active vs. Active-Passive Failover

Left: active-passive with failover delay. Right: active-active with instant redistribution.

Click Fail to compare failover behavior.
Quiz: You run an active-passive database. The primary fails at 2:00 AM. The standby's replication was 30 seconds behind. What happens to the last 30 seconds of writes?

Chapter 5: Fault Domains

You have three replicas of your database. Smart. But you put all three on the same physical rack in the same datacenter. The rack's top-of-rack switch fails. All three replicas are unreachable. Your redundancy is useless because all your replicas share the same fault domain.

A fault domain is a group of components that share a single point of failure. Everything on the same server shares the server's power supply. Everything on the same rack shares the rack's network switch and power distribution unit. Everything in the same datacenter shares the datacenter's power feed and internet uplink.

The Hierarchy of Fault Domains

Fault domains form a hierarchy, from small (a single process) to large (an entire region):

Process
A single process crash kills everything in that process. Scope: one service instance.
Server / VM
Hardware failure or OS crash kills all processes on that server. Scope: 5-50 service instances.
Rack
Top-of-rack switch failure or PDU failure takes out all servers on the rack. Scope: 20-48 servers.
Availability Zone (AZ)
Power failure, cooling failure, or network isolation. Scope: thousands of servers in one physical building.
Region
Natural disaster, widespread power outage, or software bug affecting the entire region. Scope: all AZs in a geographic area.

The rule for fault domains is simple: replicas should be in different fault domains. If you have 3 replicas, put them on 3 different racks (to survive rack failure), in 3 different AZs (to survive AZ failure), or in 3 different regions (to survive regional failure).

The Trade-off

Spreading replicas across fault domains increases reliability but also increases latency. The further apart your replicas, the longer it takes to synchronize them.

SpreadSurvivesSynchronization LatencyCost
Same rackProcess / server failure< 0.1msLow
Different racksRack failure< 1msLow
Different AZsAZ failure1-5msMedium (cross-AZ network fees)
Different regionsRegional disaster50-300msHigh (inter-region replication)
Most production systems use cross-AZ replication. This survives the most common large-scale failure (an entire building losing power) while keeping latency low enough for synchronous replication. Cross-region replication is reserved for disaster recovery and is almost always asynchronous.

The simulation below shows how spreading replicas across fault domains protects against different failure scopes. Click on a fault domain level to simulate a failure at that scope.

Fault Domain Hierarchy

3 replicas deployed across fault domains. Click a domain level to simulate failure at that scope.

Click a failure level to see which replicas are affected.
Quiz: You deploy 3 database replicas in 3 different availability zones. A power outage takes out AZ-1. How many replicas remain?

Chapter 6: Shuffle Sharding

You run a multi-tenant service. Customer A and Customer B both send traffic to your system. You have 8 backend servers. In a simple setup, every customer's traffic can reach every server. If Customer A sends a traffic spike that overloads server 3, Customer B's traffic on server 3 also suffers.

The blast radius of Customer A's spike includes Customer B and every other customer on server 3. How do we shrink that blast radius?

Simple Sharding: The Starting Point

One approach: assign each customer to a fixed shard of servers. Customer A gets servers 1 and 2. Customer B gets servers 3 and 4. Customer C gets servers 5 and 6. Now if Customer A overloads their shard, only servers 1 and 2 are affected. Customer B on servers 3 and 4 is untouched.

This is better, but the blast radius of a shard failure is still large: every customer on that shard is affected. If you have 100 customers on shard 1, a problem on shard 1 hits all 100 customers.

Shuffle Sharding: Random Assignment

Shuffle sharding assigns each customer to a random subset of servers. Instead of fixed shards, each customer gets a randomly chosen pair (or trio) of servers. The magic is in the math of combinatorics.

// With 8 servers and each customer assigned to 2 random servers:
Number of possible combinations = C(8, 2) = 8! / (2! × 6!) = 28

// Customer A is assigned to servers {2, 5}
// Customer B is assigned to servers {3, 7}
// Customer C is assigned to servers {1, 5}

// If Customer A overloads servers 2 and 5:
// Customer B (on 3, 7) is COMPLETELY unaffected.
// Customer C (on 1, 5) is PARTIALLY affected (server 5 only).

// Probability that two random customers share BOTH servers:
P(same pair) = 1 / C(8, 2) = 1/28 = 3.6%

// Probability they share at LEAST ONE server:
P(overlap ≥ 1) = 1 - C(6,2)/C(8,2) = 1 - 15/28 = 46%

// But partial overlap means partial blast radius, not total.

The key insight: with fixed sharding, a shard failure has a 100% blast radius for all customers on that shard. With shuffle sharding, the blast radius is probabilistically limited. Most customers share zero or one servers with the affected customer.

Scaling the Numbers

As you increase the number of servers and decrease the shard size (relative to total), shuffle sharding becomes exponentially more powerful:

Total ServersShard SizePossible CombinationsP(exact same shard)
82283.6%
1621200.83%
10024,9500.02%
100575,287,520< 0.000002%

With 100 servers and shards of size 5, there are over 75 million possible combinations. The probability that two random customers share all 5 servers is vanishingly small. Even a catastrophic "noisy neighbor" affecting all 5 of their servers impacts almost no one else.

Shuffle sharding is how AWS isolates customers. Services like Route 53 and DynamoDB use shuffle sharding to ensure that one customer's bad behavior (intentional or not) cannot cause a broad outage. It is one of the most elegant ideas in distributed systems engineering.

Shuffle Sharding: A Concrete Worked Example

Let's trace through a specific scenario. You have 8 workers (W1 through W8) and 6 customers (A through F). Each customer is assigned to 2 random workers:

// Assignment (random):
Customer A: {W2, W5} Customer D: {W1, W6}
Customer B: {W3, W7} Customer E: {W4, W8}
Customer C: {W1, W5} Customer F: {W3, W6}

// Customer A sends a poison request that crashes W2 and W5.

// Impact analysis:
Customer A: FULLY affected (both workers down)
Customer B: Unaffected (W3, W7 both healthy)
Customer C: PARTIALLY affected (W5 down, but W1 still healthy)
Customer D: Unaffected (W1, W6 both healthy)
Customer E: Unaffected (W4, W8 both healthy)
Customer F: Unaffected (W3, W6 both healthy)

// Blast radius: 1 customer fully affected, 1 partially affected
// Compare with fixed sharding: if A is on Shard 1 (W1-W4),
// ALL customers on Shard 1 would be affected = 50% blast radius.

The beauty of shuffle sharding is probabilistic: as you add more workers, the probability that any two customers share their entire set of workers drops exponentially. With enough workers, each customer effectively has a private shard — without the operational cost of actually running private shards.

The simulation below lets you see shuffle sharding in action. Each customer is assigned to a random pair of servers. Click on a customer to "overload" their servers and see how the blast radius is contained.

Shuffle Sharding Visualizer

8 servers (top). Each customer (bottom) is assigned to 2 random servers (lines). Click a customer to overload their servers.

Click Re-shuffle to assign random server pairs. Then overload Customer A.
Quiz: You have 20 servers and assign each customer to a random pair of 2 servers. There are C(20,2) = 190 possible pairs. Customer X and Customer Y are assigned independently. What is the probability they share the exact same pair of servers?

Chapter 7: Cellular Architecture

Shuffle sharding limits the blast radius within a single deployment. But what if the failure is not a noisy neighbor but a systemic issue — a bad deploy, a control plane bug, a shared dependency that goes down? In these cases, shuffle sharding does not help because the failure affects the entire fleet.

Cellular architecture (also called cell-based architecture) takes isolation to the extreme: instead of one large system, you build many small independent copies of the entire system, called cells. Each cell is a complete, self-contained instance of your service with its own compute, storage, configuration, and deployment pipeline.

What Makes a Cell

A cell is not just a shard. A shard shares infrastructure (deployment pipeline, config system, monitoring) with other shards. A cell is fully independent:

PropertyShardCell
ComputeShared clusterDedicated per cell
StorageShared database (different partitions)Dedicated database per cell
DeploymentDeployed togetherDeployed independently, one cell at a time
ConfigShared configIndependent config per cell
Failure blast radiusEntire shard (all customers in it)One cell (one subset of customers)
Bad deploy blast radiusEntire service (all shards deployed together)One cell (only the cell being deployed)
The killer feature of cells: deployment isolation. With cells, you deploy to one cell first, observe it for 30 minutes, then deploy to the next. A bad deploy affects only one cell — maybe 5% of your customers — not all of them. Without cells, a bad deploy rolls out to 100% of your traffic.

How Routing Works

A thin cell router sits in front of all cells. It maps each request to a specific cell based on a customer ID, tenant ID, or similar stable identifier. The router itself is the only shared component, so it must be extremely simple and reliable — it does nothing but route.

Request arrives
Client sends request to the global endpoint.
Cell Router
Looks up customer ID in routing table. Forwards to Cell 3.
Cell 3
Processes request using its own compute, storage, and config. Returns response.

Real-World Cellular Architectures

CompanySystemCell Size
AWSDynamoDB, Route 53Small cells, high count
MicrosoftAzure Active Directory"Scale units" per region
SlackBackend servicesPer-team cells
RobloxGame serversPer-game-instance cells

The simulation below shows a cellular architecture with 4 cells. Each cell serves a subset of customers. Deploy a bad update to one cell and see how the damage is contained.

Cellular Architecture

4 independent cells, each serving customers. Deploy a bad update to one cell and compare with a global deploy.

Deploy a bad update to one cell, or to all cells simultaneously.
The trade-off of cells: operational complexity. Instead of operating one system, you operate N systems. Monitoring, deployment tooling, debugging — everything is multiplied. You also lose the ability to do cross-cell queries easily. Cells are worth it when the cost of a total outage is very high (e.g., AWS infrastructure services), but overkill for a small startup.

Designing a Cell

The hardest part of cellular architecture is deciding what constitutes a cell. The boundary must be complete: everything inside the cell is self-sufficient, with no shared mutable state across cells.

// Cell design checklist:

// 1. Compute: each cell has its own servers/containers.
Cell 1: servers {c1-s1, c1-s2, c1-s3, c1-s4}
Cell 2: servers {c2-s1, c2-s2, c2-s3, c2-s4}

// 2. Storage: each cell has its own database.
Cell 1: database-cell1.internal (primary + replicas)
Cell 2: database-cell2.internal (primary + replicas)

// 3. Config: each cell pulls config independently.
Cell 1: config deployed at 10:00 AM
Cell 2: config deployed at 10:30 AM (staged rollout)

// 4. Monitoring: each cell has its own dashboards.
Cell 1 error rate: 0.1% (healthy)
Cell 2 error rate: 5.2% (bad deploy detected!)

// The cell router is the ONLY shared component.
// It must be extremely simple, stateless, and highly available.

Cell Sizing

How large should a cell be? There is a tension between isolation (more, smaller cells) and efficiency (fewer, larger cells). A common heuristic:

FactorFewer Large CellsMany Small Cells
Blast radius per cellLarge (25-50% of traffic)Small (2-5% of traffic)
Operational overheadLowHigh (N sets of everything)
Resource efficiencyHigh (better bin-packing)Lower (each cell needs headroom)
Deploy safetyRisky (large blast per cell)Safe (tiny blast per cell)
Cross-cell queriesEasier (less fan-out)Harder (more fan-out)

Most organizations start with 3-5 cells and increase as they grow. The goal is that a single cell failure affects no more than 10% of traffic.

Cell Router Design

The cell router is the single point of failure in a cellular architecture. If it goes down, all cells are unreachable. Therefore, the router must be:

1. Stateless. No mutable state. Routing table is read from a static config that is cached locally. Even if the config system is down, the router uses its last known good config.

2. Deterministic. Given the same customer ID, always route to the same cell. This prevents split-brain where a customer's requests go to different cells and see inconsistent data.

3. Extremely simple. The router does one thing: hash the customer ID, look up the cell, forward the request. No business logic, no data transformation, no authentication. Every line of code in the router is a potential failure point.

python
# Minimal cell router
import hashlib

CELL_ASSIGNMENTS = {
    0: "cell-1.internal",
    1: "cell-2.internal",
    2: "cell-3.internal",
    3: "cell-4.internal",
}
NUM_CELLS = len(CELL_ASSIGNMENTS)

def route(customer_id):
    # Deterministic hash: same customer always goes to same cell
    h = int(hashlib.md5(customer_id.encode()).hexdigest(), 16)
    cell_idx = h % NUM_CELLS
    return CELL_ASSIGNMENTS[cell_idx]
Quiz: You have 10 cells, each serving 10% of traffic. You deploy a bad update to Cell 3. What percentage of customers experience the outage?

Chapter 8: Blast Radius Simulator

This is the payoff chapter. Everything we have learned — fault domains, shuffle sharding, cellular architecture — exists to shrink the blast radius of a failure. Blast radius is the fraction of your system or your customers that are affected by a single failure event.

The simulator below lets you configure a distributed system with different isolation strategies, then inject failures and watch the blast radius in real time.

How to use the simulator. Choose an isolation strategy (None, Fault Domains, Shuffle Sharding, or Cells). Then inject different types of failures: a single server crash, a rack failure, a bad config push, or a noisy neighbor. Watch how the blast radius changes based on the isolation strategy. The red percentage at the top shows what fraction of customers are impacted.
Blast Radius Simulator

Choose isolation strategy, then inject failures. Red = affected. The percentage shows blast radius.

Select an isolation strategy, then inject a failure.

Expected Blast Radius by Strategy

FailureNo IsolationFault DomainsShuffle ShardingCells
Server crash~12% (1/8 servers)~12%~12%~3% (1 cell)
Rack failure~50%~12% (spread across racks)~12%~3%
Bad config100%100%100%~3% (deploy to 1 cell)
Noisy neighbor~25%~25%~6% (2 servers)~3%

Notice the pattern: no single strategy handles all failure types. Fault domains protect against infrastructure failures but not software failures. Shuffle sharding protects against noisy neighbors but not config pushes. Cells protect against everything — at the cost of operational complexity.

The best systems layer these strategies: cells for deployment isolation, shuffle sharding within each cell for tenant isolation, and fault domain awareness for infrastructure resilience.

Real-World Blast Radius Incidents

Let's look at real incidents through the lens of blast radius and isolation:

IncidentCauseBlast RadiusWhat Would Have Helped
AWS S3, Feb 2017Typo in maintenance command removed too many servers100% of S3 in us-east-1. Cascaded to hundreds of dependent services.Cellular architecture for S3 internal components
Cloudflare, Jul 2019Bad WAF regex caused CPU spike100% of traffic through Cloudflare globallyCanary deployment of WAF rules, shuffle sharding per customer
Facebook, Oct 2021BGP config change withdrawn all routes100% of Facebook, Instagram, WhatsApp for 6 hoursFault domain separation between config system and production
Google Cloud, Jun 2019Config change caused network overload~50% of Google Cloud services in multiple regionsStaged config rollout, cellular config propagation

Notice the pattern: every major outage was a correlated failure caused by a change (config, deploy, maintenance) that propagated globally without isolation. Hardware failures almost never cause these outages — the isolation strategies (RAID, replication) handle those. It is the human-initiated changes that bypass isolation and cause global damage.

Designing for Blast Radius

When architecting a new system, ask yourself these questions for every component:

// Blast radius design checklist:

// 1. What is the worst-case blast radius of a failure in this component?
// If the answer is "100% of traffic," you need isolation.

// 2. What is the worst-case blast radius of a bad deploy?
// If the answer is "100% of traffic," you need staged rollout + cells.

// 3. Can one customer's behavior affect other customers?
// If yes, you need shuffle sharding or per-customer isolation.

// 4. Does the system do MORE work during failure?
// If yes, you have a bimodal system — redesign for constant work.

// 5. Can a single config change affect all instances?
// If yes, you need staged config rollout.

Chapter 9: Connections

Failure modes and isolation are the foundation. You now understand what fails and how to contain the damage. But containing damage is only half the story — you also need to recover gracefully when failures happen despite your isolation.

This Lesson vs. Related Topics

TopicFocusRelationship
Failure Modes (this lesson)What fails and how to contain itThe "why" and "where" of failures
Resiliency PatternsTimeouts, retries, circuit breakers, load sheddingThe "how" of surviving failures at runtime
Testing & DeploymentChaos engineering, canary deploys, rollbacksHow to find failures before users do, and recover fast
ObservabilityMetrics, logs, traces, alertingHow to detect failures and understand their blast radius

Key Takeaways

1. Hardware faults are independent. Redundancy works because one disk dying does not cause another to die. Simple replication defeats simple hardware faults.

2. Software and config faults are correlated. The same bug exists on every replica. The same bad config hits every server. Redundancy alone does not help — you need isolation.

3. Cascading failures are driven by feedback loops. Retry storms, load redistribution, and resource exhaustion amplify a small failure into a total outage. Break the loops.

4. Fault domains bound the blast radius of infrastructure failures. Put replicas in different racks, AZs, or regions.

5. Shuffle sharding bounds the blast radius of noisy neighbors. Random assignment means one customer's problem rarely affects another.

6. Cellular architecture bounds the blast radius of everything. Independent cells mean a bad deploy affects only one cell. The price is operational complexity.

"The only way to have a system that works is to plan for the ways it fails." — Werner Vogels, CTO of Amazon
Final quiz: You are designing a new multi-tenant SaaS platform. You need to protect against: (1) hardware failures, (2) noisy neighbors, and (3) bad deployments. Which combination of isolation strategies covers all three?