Everything fails. The question is how much damage it does when it does.
You are running a web application on three servers behind a load balancer. At 3:17 AM, one server's disk fills up because a logging library wrote an unexpectedly large stack trace in a loop. The server starts returning 500 errors. The load balancer notices and routes traffic to the remaining two servers. They were already running at 40% capacity each, so now they are at 60%. No big deal.
But here is the problem: the same logging library runs on all three servers. The same bug exists on all three. The same unusual input that triggered the loop is still arriving. Within 90 seconds, all three servers have full disks. Your entire application is down.
This is not a hardware failure. The disks are fine. The CPUs are fine. The network is fine. A correlated software fault took out every replica simultaneously. Your redundancy — three whole servers! — was worthless because the failure was not independent.
The simulation below shows a simple three-server setup. Each server processes requests independently. Click "Inject Fault" to see what happens when a correlated fault hits all three servers versus when only one is affected.
Three servers behind a load balancer. Green = healthy, red = failed. Watch how correlated faults bypass redundancy.
When one server fails independently, the load balancer shifts traffic and users barely notice. When all three fail together, there is nowhere to redirect. The system is down.
Failures come from many sources. Some are random (a disk wears out), some are systematic (a bug triggered by a specific input), and some are environmental (a power outage in a datacenter). The key distinction is whether failures are independent (happening by random chance) or correlated (happening to multiple components for the same reason).
| Failure Type | Example | Correlation |
|---|---|---|
| Hardware | Disk dies after 5 years of use | Usually independent (unless same batch) |
| Software bug | Null pointer on edge-case input | Perfectly correlated across all replicas |
| Config error | Bad DNS entry pushed to all servers | Perfectly correlated |
| Network partition | Top-of-rack switch dies | Correlated for all servers on that rack |
| Load pressure | Traffic spike overwhelms capacity | Correlated — all servers see the spike |
| Resource leak | Memory leak crashes process every 48 hours | Correlated if all replicas started at the same time |
Before we dive in, let's establish vocabulary. Distributed systems distinguish between three levels of problems:
A fault is a single component deviating from its specification. One disk returns corrupted data. One process crashes. One network link drops packets. A fault is local and specific.
A failure is a system-level event where the service as a whole stops providing correct results to users. A failure is what users experience. One disk fault does not necessarily cause a system failure — if you have redundancy, the system continues working.
An outage is a sustained failure affecting users for a meaningful duration. A 200ms blip during a failover is a failure but not usually counted as an outage. A 45-minute period of 500 errors is an outage.
The goal of fault isolation is to prevent faults from becoming failures, and to contain failures so they do not become prolonged outages.
Blast radius is the fraction of your system, customers, or requests affected by a single fault. It is the most important metric in failure engineering.
Every isolation strategy in this lesson is designed to minimize blast radius. The table below previews what we will learn:
| Strategy | Blast Radius Target | What It Protects Against |
|---|---|---|
| Redundancy | 0% (if independent) | Single-component hardware faults |
| Fault domains | ≤ 1/N (N = domains) | Infrastructure failures (rack, AZ) |
| Shuffle sharding | 1-2 customers | Noisy neighbors, per-customer faults |
| Cellular architecture | 1/N (N = cells) | Everything: deploys, config, software bugs |
Hard drives spin at 7,200 RPM, 24 hours a day, for years on end. Eventually, the bearings wear out. The magnetic surface degrades. A cosmic ray flips a bit in RAM. A power supply capacitor dries out. Hardware is physical, and physical things break.
The good news about hardware faults is that they are usually independent. When one disk dies, that does not cause the disk in the next server to die. Each failure is a random, uncorrelated event. This means that simple redundancy — having a spare — is highly effective.
Mean Time Between Failures is the average time a component runs before it fails. For a modern enterprise hard drive, MTBF is roughly 10-50 years (depending on the model and workload). But "10 years per drive" does not mean your first failure is in 10 years if you have many drives.
Google, with millions of servers, sees thousands of hardware failures per day. Amazon, Microsoft, and Meta see similar numbers. At this scale, hardware failure is not an exception — it is the steady state.
| Component | Failure Mode | Typical MTBF | Mitigation |
|---|---|---|---|
| Hard drive | Mechanical wear, bad sectors | 10-50 years | RAID, replication |
| SSD | Write endurance exhaustion | 5-10 years | Wear leveling, replication |
| RAM | Bit flips from cosmic rays or decay | Rare per stick, common at scale | ECC memory |
| Power supply | Capacitor failure, fan failure | 5-10 years | Dual PSU, UPS |
| Network interface | Port failure, cable degradation | 10+ years | Dual NIC, bonding |
| Entire server | Motherboard failure | 3-10 years | Replicas on other servers |
The simulation below shows a datacenter with 100 servers. Each server has a random MTBF. Over time, servers fail and are replaced. Watch how the failure rate becomes predictable at scale.
100 servers, each with random MTBF. Green dots = healthy, red = failed. The chart shows cumulative failures over time.
The key insight: hardware faults are mostly independent. One server's disk dying does not cause another server's disk to die. This means that if you have data on two different servers, losing one disk does not lose the data. Simple replication defeats simple hardware faults.
But "mostly independent" has exceptions. A batch of drives from the same manufacturer, shipped at the same time, installed at the same time, may all reach end-of-life at roughly the same time. A power outage takes out every server on the same power circuit. These are correlated hardware faults, and we will address them with fault domains in Chapter 5.
Let's do the math for a real datacenter. Suppose you operate 10,000 servers, each with 4 hard drives. That is 40,000 drives. Each drive has an annual failure rate (AFR) of 2% (a realistic figure for enterprise drives).
This is why large-scale systems are designed with the assumption that hardware will fail continuously. The system must handle failures automatically, without human intervention, as part of its normal operation.
A particularly insidious hardware fault is a silent bit flip in RAM. Cosmic rays and natural radioactive decay can flip individual bits in memory. Without ECC (Error-Correcting Code) memory, these flips are silent — the data is corrupted but no error is reported. The program continues with wrong data.
A study at Google found approximately 1 single-bit error per gigabyte of RAM per year. With 64 GB per server and 10,000 servers, that is 640,000 bit flips per year across the fleet. ECC memory detects and corrects single-bit errors automatically. This is why every production server uses ECC memory.
Hardware faults are well-understood and well-mitigated. The bigger threat to modern systems is software faults and configuration errors. Unlike hardware, these are almost always perfectly correlated: the same bug exists on every replica, the same bad config is pushed to every server.
A software bug is dormant until triggered. It might lurk for months, waiting for a specific combination of inputs, load level, or timing. When the trigger arrives, every replica running that software hits the bug simultaneously.
This pattern — a "poison pill" request that crashes every server it touches — is one of the most common causes of total outages in production systems.
Configuration changes are the number one cause of outages at large internet companies. Not hardware failures. Not software bugs. Config changes.
Why? Because configuration changes are typically pushed globally and immediately. A developer changes a DNS entry, a firewall rule, a feature flag, or a database connection string, and that change propagates to every server in seconds. If the change is wrong, every server is broken in seconds.
| Config Error | Impact | Real-World Incident |
|---|---|---|
| Bad DNS entry | All servers resolve to wrong IP | GitHub outage, 2016 |
| Firewall rule typo | All servers lose network access | Cloudflare outage, 2019 |
| Wrong DB credentials | All servers can't authenticate | Common in many organizations |
| Feature flag misconfigured | All servers enable broken code path | Knight Capital, $440M loss in 45 min |
| Capacity limit set too low | All servers reject valid requests | AWS S3 outage, 2017 |
A resource leak is a slow-motion failure. Memory is allocated but never freed. File handles are opened but never closed. Database connections are checked out but never returned. The process works fine for hours or days, slowly consuming resources, until it hits a limit and crashes.
Resource leaks are insidious because they are time-correlated. If all your replicas started at the same time (which they usually did, because you deployed them at the same time), they all leak at the same rate, and they all crash at roughly the same time.
The simulation below demonstrates how resource leaks and config pushes create correlated failures. Each bar represents a server's health over time.
Watch how different failure types affect servers. Green = healthy, yellow = degraded, red = failed.
There is one more failure type that deserves special attention: load pressure. A traffic spike — from a viral post, a marketing campaign, a DDoS attack, or simply normal daily peak — can overwhelm your system. Unlike hardware or software faults, load pressure affects every server simultaneously because they all see the increased traffic.
Load pressure is correlated by definition: if one server sees the spike, they all do (because the load balancer distributes it). And load pressure interacts badly with other failure types. A server that is running at 95% capacity has zero headroom for a GC pause, a slow query, or a background job. Load pressure turns normal events into failures.
The most dangerous scenarios are not single failure types — they are combinations. A traffic spike (load pressure) hits while a server is restarting (reducing capacity by 1/N), during a GC pause on another server (reducing capacity further), which triggers retries (amplifying load). Each individual event is manageable. Together, they cascade.
| Combination | Why it's dangerous |
|---|---|
| Load spike + server restart | Spike arrives when you have N-1 capacity (during rolling deploy) |
| Bug trigger + retries | One server crashes on bad input, retries spread the bad input to all servers |
| Network partition + leader election | Both sides think they're the leader, conflicting writes |
| Memory leak + traffic spike | Leak at 90% memory, spike pushes to OOM |
This is why defense-in-depth matters. No single isolation strategy handles all failure combinations. You need multiple layers.
One server fails. Its traffic shifts to the remaining servers. Those servers were already at 70% capacity. Now they are at 85%. Response times increase. Clients start timing out and retrying. Each retry doubles the load. The next server collapses. Its traffic shifts to the remaining servers. They are now at 100%. The entire cluster goes down in minutes.
This is a cascading failure: a chain reaction where one failure causes additional failures, which cause more failures, until the entire system is down. The initial trigger might be trivial — a single server restart, a small traffic spike, a GC pause. But the cascade amplifies it into a total outage.
Cascading failures are driven by three positive feedback loops (positive here means self-reinforcing, not "good"):
| Feedback Loop | Mechanism | Amplification Factor |
|---|---|---|
| Retry amplification | Failed requests are retried, each retry consumes server resources | 2-10x per retry layer |
| Load redistribution | Failed server's traffic shifts to survivors, pushing them closer to capacity | N/(N-1) load increase per failure |
| Resource exhaustion | Slow requests hold connections longer, reducing available capacity | Linear with response time increase |
Let's do the math for a concrete example. You have 5 servers, each handling 1000 requests per second (RPS) at 80% capacity (max capacity = 1250 RPS each).
The simulation below lets you watch a cascade in real time. Five servers process incoming traffic. Click "Kill Server" to remove one. Watch the load redistribute and the cascade begin.
5 servers processing traffic. Kill one and watch the cascade. Toggle "Retries" to see the amplification effect.
Every defense against cascading failures targets one of the feedback loops:
| Defense | Targets | Mechanism |
|---|---|---|
| Load shedding | Resource exhaustion | Drop excess requests instead of queuing them |
| Retry budgets | Retry amplification | Limit total retries to N% of original traffic |
| Circuit breakers | Retry amplification | Stop calling a failing dependency entirely |
| Bulkheads | Load redistribution | Isolate resource pools so one failure can't consume all capacity |
| Auto-scaling | Load redistribution | Add capacity faster than the cascade propagates |
We will cover these patterns in detail in the Resiliency Patterns lesson. For now, the key insight is: cascading failures are caused by positive feedback loops, and the defense is to break those loops.
The simplest defense against failure is having a spare. If one disk can fail, have two. If one server can crash, have three. If one datacenter can lose power, have two datacenters. This is redundancy: duplicating components so that when one fails, another takes over.
But not all redundancy is created equal. The two major patterns are active-active and active-passive, and choosing the wrong one can be worse than having no redundancy at all.
In active-passive (also called primary-standby), one component handles all traffic. The other sits idle, doing nothing, waiting for the primary to fail. When the primary fails, the passive component takes over through a process called failover.
| Aspect | Active-Passive |
|---|---|
| Normal operation | Primary handles all traffic. Standby is idle (or receiving replicated data). |
| Failover | Detect primary failure, promote standby, redirect traffic. Takes seconds to minutes. |
| Resource efficiency | 50% — the standby is consuming resources but doing no useful work. |
| Failover risk | The standby may not be ready (replication lag, cold cache, untested failover path). |
| Best for | Databases, stateful systems where split-brain must be avoided. |
The dangerous thing about active-passive is the cold standby problem. If the standby has never handled real traffic, how do you know it works? Its cache is cold. Its connection pools are empty. Its code paths for handling real load have never been exercised. The first time it handles traffic might be during an incident — the worst possible time to discover it doesn't work.
In active-active, all components handle traffic simultaneously. There is no idle standby. When one fails, the remaining components absorb its traffic.
| Aspect | Active-Active |
|---|---|
| Normal operation | All replicas handle traffic. Load is balanced across them. |
| Failover | Automatic — the load balancer stops sending traffic to the failed replica. No explicit failover step. |
| Resource efficiency | High — all replicas are doing useful work at all times. |
| Failover risk | Low — survivors are already warm, already handling traffic, already tested. |
| Complexity | Higher for stateful systems — need to handle concurrent writes, conflicts. |
| Best for | Stateless services, read-heavy workloads, systems where slight inconsistency is tolerable. |
Let's trace through a typical active-passive failover that goes wrong. This scenario plays out in production systems all the time.
The fix: the standby should handle read traffic to keep its cache warm. This is "warm standby" vs. "cold standby." A warm standby has a hot buffer pool cache and warm connection pools. Its failover latency is seconds, not minutes.
How many spare replicas do you need? This depends on what you are protecting against:
The simulation below compares active-passive and active-active redundancy. Watch how failover works in each mode and how quickly the system recovers.
Left: active-passive with failover delay. Right: active-active with instant redistribution.
You have three replicas of your database. Smart. But you put all three on the same physical rack in the same datacenter. The rack's top-of-rack switch fails. All three replicas are unreachable. Your redundancy is useless because all your replicas share the same fault domain.
A fault domain is a group of components that share a single point of failure. Everything on the same server shares the server's power supply. Everything on the same rack shares the rack's network switch and power distribution unit. Everything in the same datacenter shares the datacenter's power feed and internet uplink.
Fault domains form a hierarchy, from small (a single process) to large (an entire region):
The rule for fault domains is simple: replicas should be in different fault domains. If you have 3 replicas, put them on 3 different racks (to survive rack failure), in 3 different AZs (to survive AZ failure), or in 3 different regions (to survive regional failure).
Spreading replicas across fault domains increases reliability but also increases latency. The further apart your replicas, the longer it takes to synchronize them.
| Spread | Survives | Synchronization Latency | Cost |
|---|---|---|---|
| Same rack | Process / server failure | < 0.1ms | Low |
| Different racks | Rack failure | < 1ms | Low |
| Different AZs | AZ failure | 1-5ms | Medium (cross-AZ network fees) |
| Different regions | Regional disaster | 50-300ms | High (inter-region replication) |
The simulation below shows how spreading replicas across fault domains protects against different failure scopes. Click on a fault domain level to simulate a failure at that scope.
3 replicas deployed across fault domains. Click a domain level to simulate failure at that scope.
You run a multi-tenant service. Customer A and Customer B both send traffic to your system. You have 8 backend servers. In a simple setup, every customer's traffic can reach every server. If Customer A sends a traffic spike that overloads server 3, Customer B's traffic on server 3 also suffers.
The blast radius of Customer A's spike includes Customer B and every other customer on server 3. How do we shrink that blast radius?
One approach: assign each customer to a fixed shard of servers. Customer A gets servers 1 and 2. Customer B gets servers 3 and 4. Customer C gets servers 5 and 6. Now if Customer A overloads their shard, only servers 1 and 2 are affected. Customer B on servers 3 and 4 is untouched.
This is better, but the blast radius of a shard failure is still large: every customer on that shard is affected. If you have 100 customers on shard 1, a problem on shard 1 hits all 100 customers.
Shuffle sharding assigns each customer to a random subset of servers. Instead of fixed shards, each customer gets a randomly chosen pair (or trio) of servers. The magic is in the math of combinatorics.
The key insight: with fixed sharding, a shard failure has a 100% blast radius for all customers on that shard. With shuffle sharding, the blast radius is probabilistically limited. Most customers share zero or one servers with the affected customer.
As you increase the number of servers and decrease the shard size (relative to total), shuffle sharding becomes exponentially more powerful:
| Total Servers | Shard Size | Possible Combinations | P(exact same shard) |
|---|---|---|---|
| 8 | 2 | 28 | 3.6% |
| 16 | 2 | 120 | 0.83% |
| 100 | 2 | 4,950 | 0.02% |
| 100 | 5 | 75,287,520 | < 0.000002% |
With 100 servers and shards of size 5, there are over 75 million possible combinations. The probability that two random customers share all 5 servers is vanishingly small. Even a catastrophic "noisy neighbor" affecting all 5 of their servers impacts almost no one else.
Let's trace through a specific scenario. You have 8 workers (W1 through W8) and 6 customers (A through F). Each customer is assigned to 2 random workers:
The beauty of shuffle sharding is probabilistic: as you add more workers, the probability that any two customers share their entire set of workers drops exponentially. With enough workers, each customer effectively has a private shard — without the operational cost of actually running private shards.
The simulation below lets you see shuffle sharding in action. Each customer is assigned to a random pair of servers. Click on a customer to "overload" their servers and see how the blast radius is contained.
8 servers (top). Each customer (bottom) is assigned to 2 random servers (lines). Click a customer to overload their servers.
Shuffle sharding limits the blast radius within a single deployment. But what if the failure is not a noisy neighbor but a systemic issue — a bad deploy, a control plane bug, a shared dependency that goes down? In these cases, shuffle sharding does not help because the failure affects the entire fleet.
Cellular architecture (also called cell-based architecture) takes isolation to the extreme: instead of one large system, you build many small independent copies of the entire system, called cells. Each cell is a complete, self-contained instance of your service with its own compute, storage, configuration, and deployment pipeline.
A cell is not just a shard. A shard shares infrastructure (deployment pipeline, config system, monitoring) with other shards. A cell is fully independent:
| Property | Shard | Cell |
|---|---|---|
| Compute | Shared cluster | Dedicated per cell |
| Storage | Shared database (different partitions) | Dedicated database per cell |
| Deployment | Deployed together | Deployed independently, one cell at a time |
| Config | Shared config | Independent config per cell |
| Failure blast radius | Entire shard (all customers in it) | One cell (one subset of customers) |
| Bad deploy blast radius | Entire service (all shards deployed together) | One cell (only the cell being deployed) |
A thin cell router sits in front of all cells. It maps each request to a specific cell based on a customer ID, tenant ID, or similar stable identifier. The router itself is the only shared component, so it must be extremely simple and reliable — it does nothing but route.
| Company | System | Cell Size |
|---|---|---|
| AWS | DynamoDB, Route 53 | Small cells, high count |
| Microsoft | Azure Active Directory | "Scale units" per region |
| Slack | Backend services | Per-team cells |
| Roblox | Game servers | Per-game-instance cells |
The simulation below shows a cellular architecture with 4 cells. Each cell serves a subset of customers. Deploy a bad update to one cell and see how the damage is contained.
4 independent cells, each serving customers. Deploy a bad update to one cell and compare with a global deploy.
The hardest part of cellular architecture is deciding what constitutes a cell. The boundary must be complete: everything inside the cell is self-sufficient, with no shared mutable state across cells.
How large should a cell be? There is a tension between isolation (more, smaller cells) and efficiency (fewer, larger cells). A common heuristic:
| Factor | Fewer Large Cells | Many Small Cells |
|---|---|---|
| Blast radius per cell | Large (25-50% of traffic) | Small (2-5% of traffic) |
| Operational overhead | Low | High (N sets of everything) |
| Resource efficiency | High (better bin-packing) | Lower (each cell needs headroom) |
| Deploy safety | Risky (large blast per cell) | Safe (tiny blast per cell) |
| Cross-cell queries | Easier (less fan-out) | Harder (more fan-out) |
Most organizations start with 3-5 cells and increase as they grow. The goal is that a single cell failure affects no more than 10% of traffic.
The cell router is the single point of failure in a cellular architecture. If it goes down, all cells are unreachable. Therefore, the router must be:
1. Stateless. No mutable state. Routing table is read from a static config that is cached locally. Even if the config system is down, the router uses its last known good config.
2. Deterministic. Given the same customer ID, always route to the same cell. This prevents split-brain where a customer's requests go to different cells and see inconsistent data.
3. Extremely simple. The router does one thing: hash the customer ID, look up the cell, forward the request. No business logic, no data transformation, no authentication. Every line of code in the router is a potential failure point.
python # Minimal cell router import hashlib CELL_ASSIGNMENTS = { 0: "cell-1.internal", 1: "cell-2.internal", 2: "cell-3.internal", 3: "cell-4.internal", } NUM_CELLS = len(CELL_ASSIGNMENTS) def route(customer_id): # Deterministic hash: same customer always goes to same cell h = int(hashlib.md5(customer_id.encode()).hexdigest(), 16) cell_idx = h % NUM_CELLS return CELL_ASSIGNMENTS[cell_idx]
This is the payoff chapter. Everything we have learned — fault domains, shuffle sharding, cellular architecture — exists to shrink the blast radius of a failure. Blast radius is the fraction of your system or your customers that are affected by a single failure event.
The simulator below lets you configure a distributed system with different isolation strategies, then inject failures and watch the blast radius in real time.
Choose isolation strategy, then inject failures. Red = affected. The percentage shows blast radius.
| Failure | No Isolation | Fault Domains | Shuffle Sharding | Cells |
|---|---|---|---|---|
| Server crash | ~12% (1/8 servers) | ~12% | ~12% | ~3% (1 cell) |
| Rack failure | ~50% | ~12% (spread across racks) | ~12% | ~3% |
| Bad config | 100% | 100% | 100% | ~3% (deploy to 1 cell) |
| Noisy neighbor | ~25% | ~25% | ~6% (2 servers) | ~3% |
Notice the pattern: no single strategy handles all failure types. Fault domains protect against infrastructure failures but not software failures. Shuffle sharding protects against noisy neighbors but not config pushes. Cells protect against everything — at the cost of operational complexity.
The best systems layer these strategies: cells for deployment isolation, shuffle sharding within each cell for tenant isolation, and fault domain awareness for infrastructure resilience.
Let's look at real incidents through the lens of blast radius and isolation:
| Incident | Cause | Blast Radius | What Would Have Helped |
|---|---|---|---|
| AWS S3, Feb 2017 | Typo in maintenance command removed too many servers | 100% of S3 in us-east-1. Cascaded to hundreds of dependent services. | Cellular architecture for S3 internal components |
| Cloudflare, Jul 2019 | Bad WAF regex caused CPU spike | 100% of traffic through Cloudflare globally | Canary deployment of WAF rules, shuffle sharding per customer |
| Facebook, Oct 2021 | BGP config change withdrawn all routes | 100% of Facebook, Instagram, WhatsApp for 6 hours | Fault domain separation between config system and production |
| Google Cloud, Jun 2019 | Config change caused network overload | ~50% of Google Cloud services in multiple regions | Staged config rollout, cellular config propagation |
Notice the pattern: every major outage was a correlated failure caused by a change (config, deploy, maintenance) that propagated globally without isolation. Hardware failures almost never cause these outages — the isolation strategies (RAID, replication) handle those. It is the human-initiated changes that bypass isolation and cause global damage.
When architecting a new system, ask yourself these questions for every component:
Failure modes and isolation are the foundation. You now understand what fails and how to contain the damage. But containing damage is only half the story — you also need to recover gracefully when failures happen despite your isolation.
| Topic | Focus | Relationship |
|---|---|---|
| Failure Modes (this lesson) | What fails and how to contain it | The "why" and "where" of failures |
| Resiliency Patterns | Timeouts, retries, circuit breakers, load shedding | The "how" of surviving failures at runtime |
| Testing & Deployment | Chaos engineering, canary deploys, rollbacks | How to find failures before users do, and recover fast |
| Observability | Metrics, logs, traces, alerting | How to detect failures and understand their blast radius |
1. Hardware faults are independent. Redundancy works because one disk dying does not cause another to die. Simple replication defeats simple hardware faults.
2. Software and config faults are correlated. The same bug exists on every replica. The same bad config hits every server. Redundancy alone does not help — you need isolation.
3. Cascading failures are driven by feedback loops. Retry storms, load redistribution, and resource exhaustion amplify a small failure into a total outage. Break the loops.
4. Fault domains bound the blast radius of infrastructure failures. Put replicas in different racks, AZs, or regions.
5. Shuffle sharding bounds the blast radius of noisy neighbors. Random assignment means one customer's problem rarely affects another.
6. Cellular architecture bounds the blast radius of everything. Independent cells mean a bad deploy affects only one cell. The price is operational complexity.