Resiliency Patterns — From Absolute Zero to Mastery

Chapter 0: Why Resiliency

Your service calls a database. The database is slow today — maybe a long-running query is hogging resources, maybe a disk is degrading, maybe the network is congested. Whatever the reason, your database calls that normally take 5ms are now taking 30 seconds.

Your service has a thread pool of 200 threads. Each request makes a database call. Each database call now holds a thread for 30 seconds instead of 5ms. After 200 requests (which arrive in about 2 seconds at normal traffic), every thread is stuck waiting for the database. No more threads are available. Your service cannot handle any requests — not even requests that don't need the database.

Your service is down, not because it failed, but because it waited too long for a dependency that was slow.

The core problem. In a distributed system, the failure of one component often manifests as slowness in another component, which causes resource exhaustion, which causes failure, which causes slowness in the next component. This is the cascading failure pattern. Resiliency patterns exist to break this chain — to ensure that a slow or failing dependency does not drag your service down with it.

The simulation below shows a service calling a slow dependency. Without any resiliency patterns, the service's thread pool fills up and the service becomes completely unresponsive.

Thread Pool Exhaustion

A service with 20 threads calls a slow dependency. Watch threads fill up as the dependency slows down.

Start traffic, then slow down the dependency to see thread exhaustion.

The Resiliency Toolkit

Each resiliency pattern addresses a specific failure mode. Here is the complete toolkit:

Pattern	What it does	Protects against
Timeout	Limits how long you wait for a response	Slow dependencies exhausting resources
Retry	Repeats a failed request	Transient failures (network blips, brief overload)
Circuit breaker	Stops calling a failing dependency	Retry storms overwhelming a recovering service
Load shedding	Drops excess requests when overloaded	Cascading failures from resource exhaustion
Rate limiting	Limits request rate per client	Abusive or buggy clients overwhelming the service
Bulkhead	Isolates resource pools per dependency	One slow dependency consuming all shared resources
Constant work	Does the same amount of work regardless of conditions	Bimodal behavior that amplifies failures

The structure of this lesson. We will cover each pattern in depth — what it does, how to implement it, what parameters to tune, and the pitfalls. The final chapter is a live simulation of a service chain with all patterns working together, where you can inject failures and toggle patterns on and off.

Quiz: Your service has 100 threads and calls a dependency that normally responds in 10ms. The dependency starts responding in 30 seconds. Assuming you receive 50 requests per second, how long until all threads are exhausted?

2 seconds. At 50 RPS, you consume 50 threads per second (each stuck for 30s). After 2 seconds, all 100 threads are consumed. No timeout means every thread waits the full 30 seconds. 30 seconds — it takes one full timeout cycle It depends on the thread pool implementation

Chapter 1: Timeouts

A timeout is the simplest resiliency pattern: "If I haven't received a response in X milliseconds, give up and free the resources." Without a timeout, a slow dependency can hold your resources indefinitely. With a timeout, the damage is bounded.

Two Types of Timeouts

There are two distinct timeouts, and confusing them is a common source of bugs:

Timeout	What it measures	Typical value
Connection timeout	How long to wait for the TCP connection to be established (the three-way handshake)	1-5 seconds
Read timeout (request timeout)	How long to wait for the response after the connection is established and the request is sent	100ms - 30s (depends on operation)

A connection timeout that fires means the server is unreachable (down, network partition, wrong address). A read timeout that fires means the server is reachable but slow (overloaded, long query, GC pause).

Choosing Timeout Values

Too short: you declare healthy-but-slow calls as failures, increasing your error rate and triggering unnecessary retries. Too long: you hold resources for too long, risking thread pool exhaustion.

The best practice is to base timeouts on observed latency percentiles:

// Rule of thumb for read timeouts:
timeout = p99_latency × 2

// Example: your DB calls have p50=5ms, p99=200ms
read_timeout = 200ms × 2 = 400ms

// This means:
// - 99% of healthy requests complete before the timeout
// - Only 1% of healthy requests are slower than p99
// - The 2x multiplier gives headroom for normal variance
// - A request that takes >400ms is almost certainly a sign of trouble

Adaptive Timeouts

Fixed timeouts are fragile. If your dependency gets faster, your timeout is wastefully long. If it gets slower (legitimately), your timeout causes false failures. An adaptive timeout adjusts based on recent observations.

The simplest adaptive timeout tracks a rolling window of recent response times and sets the timeout to the p99 of that window plus a margin:

// Adaptive timeout (simplified)
window = last 100 response times
p99 = percentile(window, 0.99)
timeout = p99 × 1.5

// As the dependency speeds up, the timeout tightens.
// As the dependency slows (legitimately), the timeout relaxes.
// But: if the dependency is truly broken, the window fills with
// slow values and the timeout becomes very long — which is BAD.
// So you also need a hard cap: timeout = min(adaptive, hard_max)

The simulation below lets you adjust the timeout and see how it affects false positives (healthy requests killed by timeout) versus resource waste (threads stuck waiting).

Timeout Tuning Simulator

Requests arrive with varying latencies. Adjust the timeout to balance false positives vs. resource waste.

Timeout (ms) 500ms

Generate requests, then adjust the timeout to see the trade-off.

The golden rule of timeouts. Every network call must have a timeout. No exceptions. The default timeout in most HTTP clients is "wait forever," which is a recipe for thread pool exhaustion. Set explicit timeouts on every call, and make them aggressive. It is better to fail fast and retry than to wait and cascade.

Timeout Implementation Patterns

python
import requests

# BAD: no timeout (default = wait forever)
response = requests.get("http://payment-service/charge")

# GOOD: explicit connection and read timeouts
response = requests.get(
    "http://payment-service/charge",
    timeout=(1, 0.4)  # (connect_timeout, read_timeout) in seconds
)

# BETTER: with retry handling
try:
    response = requests.get(
        "http://payment-service/charge",
        timeout=(1, 0.4)
    )
except requests.Timeout:
    # Request timed out. We do NOT know if it was processed.
    # Check idempotency key before retrying.
    return fallback_response()

The Deadline Propagation Pattern

In a service chain (A calls B calls C calls D), each service sets its own timeout. But if A's timeout is 500ms and B spends 400ms before calling C, then C only has 100ms to complete. If C doesn't know this, it might start a 300ms operation that will be killed by A's timeout anyway — wasting resources.

Deadline propagation passes the remaining time budget from caller to callee. Each service knows exactly how much time it has before the upstream caller gives up.

// Deadline propagation example:
A sets deadline: now() + 500ms
A calls B at t=10ms: B receives deadline = 490ms remaining
B does local work: 100ms elapsed. 390ms remaining.
B calls C at t=110ms: C receives deadline = 390ms remaining
C starts DB query: if query estimate > 390ms, don't even start it

// Without deadline propagation:
// C starts a 2-second query, wastes resources for 390ms,
// then A's timeout kills the whole chain. C's work was wasted.
// gRPC has built-in deadline propagation. HTTP services need to
// pass it as a header (e.g., X-Request-Deadline).

gRPC propagates deadlines automatically. When you set a deadline on a gRPC call, it is automatically passed to every downstream service in the chain. Each service can check how much time remains and short-circuit expensive operations. This is one of gRPC's most important features for distributed systems.

Quiz: Your service calls a dependency with p50 latency of 10ms and p99 of 200ms. You set a read timeout of 50ms. What happens?

Perfect — 50ms is well above the median, so most requests succeed The service is too slow and needs to be optimized Excessive false positives. While p50 is 10ms, p99 is 200ms, meaning about 1-5% of perfectly healthy requests take between 50ms and 200ms. You will timeout on these healthy requests, causing unnecessary errors and retries. The timeout should be based on p99 (200ms), not p50 (10ms).

Chapter 2: Retries

A request fails. Maybe the network dropped a packet. Maybe the server had a momentary GC pause. Maybe a load balancer routed you to a server that was just restarting. These are transient failures — they go away if you simply try again.

Retries are the natural response: if it fails, try again. But naive retries are one of the most dangerous things in a distributed system.

The Retry Storm Problem

Imagine a service handling 1000 RPS. It calls a dependency that starts failing 50% of requests. Without retries, 500 requests per second fail. With one retry, those 500 failed requests are retried, adding 500 more requests. The dependency now sees 1500 RPS instead of 1000. With two retries, it could see up to 2000 RPS.

// Retry amplification math:
Original traffic: 1000 RPS
Failure rate: 50%

// With up to 3 retries per request:
Attempt 1: 1000 requests (500 fail)
Attempt 2: 500 requests (250 fail)
Attempt 3: 250 requests (125 fail)
Total: 1750 RPS hitting the dependency

// The dependency was already struggling at 1000 RPS.
// Now it's getting 1750 RPS. It fails harder. More retries.
// This is a positive feedback loop = cascading failure.

Exponential Backoff with Jitter

Exponential backoff means waiting longer between each retry: 100ms, 200ms, 400ms, 800ms... This gives the failing dependency time to recover instead of hammering it immediately.

Jitter means adding randomness to the backoff delay. Without jitter, if 1000 clients all retry at the same time (because they all saw the same failure at the same time), they will all retry at t+100ms, then all retry again at t+200ms. These synchronized retry waves are called thundering herds. Jitter breaks the synchronization.

// Exponential backoff with full jitter:
base_delay = 100ms
max_delay = 10000ms

delay(attempt) = random(0, min(max_delay, base_delay × 2^attempt))

// Example retry delays:
Attempt 1: random(0, 200ms) → e.g. 137ms
Attempt 2: random(0, 400ms) → e.g. 289ms
Attempt 3: random(0, 800ms) → e.g. 512ms
Attempt 4: random(0, 1600ms) → e.g. 1102ms

Retry Budgets

Even with backoff and jitter, retries increase load. A retry budget caps the total number of retries across all requests, usually as a percentage of successful traffic.

// Retry budget: max retries = 10% of successful requests

// Normal operation: 1000 RPS, 1% failure rate
Successful: 990/sec, Failed: 10/sec
Retry budget: 990 × 0.10 = 99 retries/sec available
Retries needed: 10/sec < budget → all retries allowed

// Degraded operation: 1000 RPS, 50% failure rate
Successful: 500/sec, Failed: 500/sec
Retry budget: 500 × 0.10 = 50 retries/sec available
Retries needed: 500/sec >> budget → only 50 retries allowed
// Load increase limited to 5% instead of 50%!

Retries require idempotency. If you retry a "deduct $50" request and both the original and the retry succeed, you deduct $100. Every operation that can be retried MUST be idempotent — producing the same result whether executed once or multiple times. Common strategies: unique request IDs (idempotency keys), conditional writes (only write if version matches), deduplication at the server.

Idempotency Implementation

python
# Idempotency key pattern:
# Client generates a unique key for each logical operation.
# Server stores the key and result. On retry, returns cached result.

def charge_payment(request):
    key = request.headers["Idempotency-Key"]

    # Check if we already processed this request
    cached = redis.get(f"idempotency:{key}")
    if cached:
        return json.loads(cached)  # Return same result

    # Process for the first time
    result = payment_gateway.charge(request.amount)

    # Cache result with TTL (e.g., 24 hours)
    redis.set(f"idempotency:{key}", json.dumps(result), ex=86400)
    return result

The Retry Amplification Problem in Service Chains

Retries at each layer of a service chain multiply. If A retries 3 times calling B, and B retries 3 times calling C, and C retries 3 times calling D, the total number of requests hitting D can be up to 3 × 3 × 3 = 27 times the original request volume.

// Retry amplification in a 4-service chain:
// Each service retries up to 3 times

A → B: 1 request × 3 retries = 3 requests to B
B → C: 3 requests × 3 retries = 9 requests to C
C → D: 9 requests × 3 retries = 27 requests to D

// D sees 27x amplification from A's single request!
// Solution: only retry at the outermost layer, or use retry budgets
// at each layer to cap total amplification.

Best practice: retry at one layer only. Either retry at the edge (the outermost caller) or at each service with a strict budget (e.g., 10% of successful traffic). Never allow unbounded retries at every layer — the multiplicative amplification will destroy your downstream services.

The simulation shows how retries with and without backoff/jitter affect load on a struggling dependency.

Retry Strategy Comparison

A dependency starts failing. Compare immediate retries vs. exponential backoff with jitter. Y-axis = total RPS hitting the dependency.

Choose a retry strategy to see how it affects load on the dependency.

Quiz: 500 clients simultaneously discover a server is down and all set their first retry delay to 200ms. What happens at t+200ms?

The retries succeed because the server has had 200ms to recover All 500 clients retry simultaneously, creating a thundering herd that overwhelms the recovering server. This is why jitter is essential: if each client waits random(0, 200ms) instead of exactly 200ms, the retries are spread over the full 200ms window instead of all hitting at the same instant. The load balancer will spread the retries evenly

Chapter 3: Circuit Breakers

Retries help with transient failures. But what if the dependency is not transiently failing — what if it is completely down, and it is going to be down for the next 10 minutes? Every retry during those 10 minutes is wasted: it consumes resources, adds load to the already-dead dependency, and always fails.

A circuit breaker detects when a dependency is consistently failing and stops sending requests to it entirely for a cooldown period. Instead of timing out and retrying, the circuit breaker returns an immediate failure (or a cached/fallback response). This gives the dependency breathing room to recover.

The Three States

A circuit breaker is a state machine with three states:

CLOSED (normal)

All requests pass through to the dependency. The circuit breaker counts failures. When the failure count exceeds a threshold (e.g., 5 failures in 10 seconds), the circuit opens.

↓ failure threshold exceeded

OPEN (tripped)

All requests are immediately rejected without calling the dependency. A timer starts. After the cooldown period (e.g., 30 seconds), the circuit transitions to half-open.

↓ cooldown expires

HALF-OPEN (probing)

A small number of requests (e.g., 1) are allowed through as a test. If they succeed, the circuit closes. If they fail, the circuit opens again.

↑ success → CLOSED | failure → OPEN

The Parameters

Parameter	Typical Value	Too Low	Too High
Failure threshold	5-10 failures in 10s	Trips on normal noise	Takes too long to detect real outages
Cooldown period	10-60 seconds	Hammers recovering service	Stays open too long, slow recovery
Half-open probe count	1-5 requests	One fluke success closes circuit prematurely	Sends too much probe traffic

Circuit breakers + retries = defense in depth. Retries handle transient failures (one-off network blips). When failures are persistent, the circuit breaker trips and stops retries entirely. When the dependency recovers, the half-open probe detects it and traffic resumes. They work together: retries handle blips, circuit breakers handle outages.

Circuit Breaker Implementation

python
import time

class CircuitBreaker:
    def __init__(self, failure_threshold=5,
                 cooldown_sec=30, probe_count=3):
        self.state = "closed"
        self.failures = 0
        self.threshold = failure_threshold
        self.cooldown = cooldown_sec
        self.opened_at = 0
        self.probe_count = probe_count
        self.probe_successes = 0

    def call(self, func, *args):
        if self.state == "open":
            if time.time() - self.opened_at > self.cooldown:
                self.state = "half-open"  # Try probing
            else:
                raise CircuitOpenError()  # Fast fail

        try:
            result = func(*args)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        if self.state == "half-open":
            self.probe_successes += 1
            if self.probe_successes >= self.probe_count:
                self.state = "closed"  # Recovered!
                self.failures = 0
        else:
            self.failures = max(0, self.failures - 1)

    def _on_failure(self):
        self.failures += 1
        if self.state == "half-open" or \
           self.failures >= self.threshold:
            self.state = "open"
            self.opened_at = time.time()
            self.probe_successes = 0

What to Return When the Circuit Is Open

When the circuit breaker is open and you fast-fail requests, what do you return to the caller? There are three options:

Strategy	When to use	Example
Error	When the operation is essential and there is no substitute	Payment processing — return "payment temporarily unavailable"
Cached response	When stale data is acceptable	Product recommendations — return yesterday's recommendations
Degraded response	When a simpler version of the response exists	Search results — return popular items instead of personalized results

The simulation below shows a circuit breaker in action. The dependency alternates between healthy and failing. Watch the circuit breaker transition between states.

Circuit Breaker State Machine

Watch the circuit breaker react to dependency health. Green = closed, Red = open, Yellow = half-open.

Start traffic, then fail/recover the dependency to see circuit breaker transitions.

Quiz: A circuit breaker is in the OPEN state (the dependency is down). The cooldown expires and one probe request is sent. The probe succeeds. Should the circuit close immediately?

Yes — the probe succeeded, so the dependency is recovered No — wait for more probe successes to be safe It depends. A single success could be a fluke. Best practice: send 3-5 probe requests in half-open state. If most succeed, close the circuit. If any fail, re-open. But simple implementations do close on a single success — the trade-off is faster recovery vs. risk of premature closure.

Chapter 4: Load Shedding

Your server can handle 1000 requests per second. Right now, 2000 requests per second are arriving. What do you do?

Option A: try to process all 2000. Your response time triples. All 2000 clients experience slow responses. Many time out. They retry. Now you are at 3000 RPS. Total collapse.

Option B: shed the excess load. Accept 1000 requests, immediately reject the other 1000 with a "503 Service Unavailable" response. The 1000 accepted requests complete quickly at normal latency. The 1000 rejected clients can retry (preferably to a different server) or back off.

This is load shedding: deliberately dropping requests when overloaded to protect the requests you are still serving. It is counterintuitive — deliberately failing requests to prevent failing all requests. But it works.

LIFO Queue Ordering

Most request queues are FIFO (first in, first out). Under load, a FIFO queue means that by the time you process a request, it has been waiting so long that the client has already timed out and given up. You waste resources processing a request whose response will be ignored.

LIFO (last in, first out) processes the newest requests first. These are the ones whose clients are still waiting. Old requests at the back of the queue have probably already timed out — drop them.

CoDel: Controlled Delay. An even smarter approach: instead of LIFO, track how long each request has been in the queue. If a request has been waiting longer than a target delay (e.g., 5ms), drop it. This is the CoDel algorithm, originally designed for network routers, now used in service queues. It ensures that queue wait time stays bounded regardless of load.

Priority-Based Load Shedding

Not all requests are equal. A health check from a load balancer is more important than a background analytics event. A checkout request from a paying customer is more important than a product listing refresh. Priority-based load shedding drops low-priority requests first, preserving capacity for high-priority ones.

// Priority tiers for load shedding:

Priority 0 (Critical): Health checks, leader elections, heartbeats
Never shed. If these fail, the system becomes unmanageable.

Priority 1 (High): User-facing requests (checkout, login, search)
Shed only under extreme overload.

Priority 2 (Normal): Background API calls, feed refreshes
Shed when load exceeds 80% capacity.

Priority 3 (Low): Analytics, logging, batch processing
Shed first. These can be retried later.

Load Shedding Implementation Pattern

python
import time

class AdaptiveLoadShedder:
    def __init__(self, max_queue_time_ms=50):
        self.max_queue_time = max_queue_time_ms
        self.in_flight = 0
        self.max_in_flight = 100  # Concurrency limit

    def should_accept(self, request):
        # Check queue age (CoDel-style)
        queue_time = time.time() - request.enqueued_at
        if queue_time > self.max_queue_time:
            return False  # Too stale, client probably timed out

        # Check concurrency limit
        if self.in_flight >= self.max_in_flight:
            # Shed based on priority
            if request.priority > 1:  # Low priority
                return False
            if request.priority > 0 and \
               self.in_flight > self.max_in_flight * 1.1:
                return False  # Even medium priority

        return True

Where to Shed Load

Layer	Mechanism	Advantage
Load balancer	Reject when all backends are at capacity	Cheapest rejection — no work done at all
Application server	Reject when thread pool / queue is full	Application-aware (can prioritize by endpoint)
Request queue	Drop old/excess items from queue	Fine-grained control over queue depth

The simulation shows a server under increasing load. Toggle load shedding on and off to see how it maintains quality of service for accepted requests.

Load Shedding in Action

Incoming traffic ramps up. Without shedding, latency explodes. With shedding, accepted requests stay fast.

Ramp up traffic, then toggle load shedding to compare behavior.

Quiz: Your server uses a FIFO queue and is overloaded. A request enters the queue at t=0. The server processes it at t=5s. The client's timeout is 3s. What happens?

The server returns the response at t=5s and the client receives it The server processes the request and sends a response, but the client has already timed out at t=3s and moved on (possibly retried). The server's work was completely wasted. This is why LIFO or CoDel is better under load — they prioritize fresh requests whose clients are still waiting. The TCP connection keeps the client waiting until t=5s

Chapter 5: Rate Limiting

Load shedding protects a server from total overload. Rate limiting is more targeted: it limits how many requests each individual client can make, preventing one client from monopolizing the server's capacity.

Without rate limiting, a buggy client that sends 10,000 requests per second (instead of its normal 10) can consume all of your capacity, starving every other client. Rate limiting caps that client at, say, 100 RPS, ensuring capacity remains for everyone else.

Token Bucket

The token bucket is the most common rate limiting algorithm. Picture a bucket that holds tokens. Tokens are added at a fixed rate (e.g., 10 per second). Each request consumes one token. If the bucket is empty, the request is rejected.

// Token bucket parameters:
rate = 10 tokens/second (steady-state rate limit)
burst = 20 tokens (maximum burst size)

// The bucket starts full (20 tokens).
// A client can burst 20 requests instantly.
// After the burst, they can only send 10 requests/sec.
// If they stop sending, the bucket refills over 2 seconds.

tokens = min(burst, tokens + rate × time_elapsed)
// On each request:
if tokens ≥ 1: tokens -= 1; allow request
else: reject with 429 Too Many Requests

Sliding Window

The sliding window counts requests in a rolling time window. Simpler than token bucket but no burst control.

// Sliding window: max 100 requests per 60-second window
// At any moment, count requests in the last 60 seconds.
// If count >= 100, reject.

// Implementation: keep timestamps of recent requests.
// Prune entries older than window_size on each check.

Leaky Bucket

The leaky bucket smooths traffic to a constant rate. Requests enter a queue (the bucket). The queue drains at a fixed rate. If the queue is full, new requests are dropped.

Algorithm	Burst	Smoothing	Memory	Best for
Token bucket	Allows bursts up to bucket size	None — bursts pass through	2 numbers (tokens, last_refill)	APIs where bursts are acceptable
Leaky bucket	Absorbs bursts into queue	Output is perfectly smooth	Queue + drain rate	Systems requiring uniform rate
Sliding window	No burst control	None	Timestamps of recent requests	Simple per-client limits

The simulation below lets you compare token bucket and sliding window rate limiting under bursty traffic.

Rate Limiting: Token Bucket vs. Sliding Window

Bursty client sends requests. Green = allowed, red = rate-limited. Compare the two algorithms.

Rate limit 20 req/s

Send a burst of requests and see which are allowed by each algorithm.

Quiz: A token bucket has rate=10/sec and burst=50. A client has been idle for 10 seconds. How many requests can they send instantly?

100 — 10 tokens/sec × 10 seconds = 100 tokens accumulated 50. The bucket can hold a maximum of 50 tokens (the burst limit). Even though 100 tokens were generated during the 10-second idle period, the bucket overflows at 50. The burst parameter caps accumulated tokens, preventing a long-idle client from causing a huge burst. 10 — only the tokens from the last second count

Chapter 6: Bulkheads

A bulkhead is a wall in a ship's hull that divides the hull into separate watertight compartments. If the hull is breached, water floods one compartment but the bulkheads prevent it from flooding the entire ship. The ship stays afloat.

The same principle applies to software. Your service calls three dependencies: a database, a cache, and an external API. All three share the same thread pool of 100 threads. The external API starts responding slowly. All 100 threads get stuck waiting for the API. Now your database calls and cache calls also fail — not because the database or cache is down, but because there are no threads left to call them.

A bulkhead pattern gives each dependency its own isolated pool of resources (threads, connections, memory). If the external API pool (30 threads) is exhausted, the database pool (50 threads) and cache pool (20 threads) are unaffected.

The Titanic Analogy

The Titanic had bulkheads — 16 watertight compartments. It was designed to survive flooding of any 4 consecutive compartments. But the iceberg opened 6 compartments. Water flooded one compartment, overflowed the top of the bulkhead wall into the next, and so on. The bulkheads were not tall enough.

Software bulkheads fail the same way when they are not configured correctly. If your "database thread pool" bulkhead allows 90 out of 100 threads, it is not really a bulkhead — one slow dependency can still consume 90% of your capacity. Effective bulkheads must leave meaningful capacity for other dependencies.

// Bulkhead sizing rule of thumb:

Total thread pool: 100 threads
Dependencies: DB, Cache, External API

// BAD: DB=80, Cache=10, API=10
// DB can still consume 80% of all capacity!

// GOOD: DB=40, Cache=20, API=15, Reserve=25
// Even if DB is totally down, 60 threads remain for Cache + API + Reserve.
// The "reserve" pool handles requests that don't call any dependency.

// BETTER: Each pool sized to max_rps × timeout
DB: 200 RPS × 0.1s timeout = 20 concurrent slots
Cache: 500 RPS × 0.01s timeout = 5 concurrent slots
API: 50 RPS × 0.5s timeout = 25 concurrent slots
Reserve: 50 slots (for non-dependency requests)

Types of Bulkheads

Bulkhead Type	What it isolates	Example
Thread pool	Threads per dependency	30 threads for DB, 20 for cache, 10 for API
Connection pool	Connections per downstream	Max 50 DB connections, max 20 API connections
Semaphore	Concurrency per operation	Max 10 concurrent calls to slow endpoint
Process/container	Entire runtime per workload	Separate container for batch jobs vs. serving

Bulkheads prevent "bad neighbor" failures within a single service. Without bulkheads, one slow dependency can consume all shared resources and take down the entire service. With bulkheads, the damage from one slow dependency is contained to its allocated pool.

The simulation shows a service with shared resources vs. bulkheaded resources. One dependency slows down — compare how it affects the other dependencies.

Bulkhead Pattern

Left: shared thread pool. Right: bulkheaded pools. Slow down API to see the difference.

Slow down the API dependency to see how bulkheads protect other dependencies.

Quiz: Your service has a shared thread pool of 100 threads and calls 3 dependencies. Dependency C starts timing out (30s per request). 40 requests per second go to dependency C. How long until the shared pool is exhausted?

About 2.5 seconds. At 40 RPS with 30s timeouts, each request holds a thread for 30 seconds. After 2.5 seconds, 40 × 2.5 = 100 threads are consumed. All threads are stuck waiting for C. Dependencies A and B are now starved. With bulkheads limiting C to 30 threads, only C's pool would be exhausted — A and B keep working. 30 seconds — the length of one timeout It won't be exhausted — the OS creates more threads as needed

Chapter 7: Constant Work

Most systems do more work when things go wrong. A health checker pings every server every 10 seconds. When a server goes down, the checker starts pinging it more frequently (every 1 second) to detect when it comes back. Meanwhile, the load balancer redistributes traffic, generating more routing table updates. The monitoring system fires alerts, which trigger escalation chains, which generate more monitoring queries.

This is bimodal behavior: the system does X work during normal operation and 10X work during an incident. The problem? Incidents are exactly when your infrastructure is most stressed. Doing 10X work during a crisis is a recipe for cascading failures.

The Constant Work Pattern

The constant work pattern (sometimes called "constant-rate processing") ensures the system does the same amount of work regardless of conditions. During normal operation, some of that work is "wasted." During an incident, no extra work is needed.

// Bimodal health check (BAD):
Normal: check every 10 seconds
Failure detected: check every 1 second
// 10x spike in health check traffic during outage

// Constant work health check (GOOD):
Always: check every 2 seconds
// Same traffic during outage as during normal operation
// More work during normal (2s vs 10s) but no spike during failure

Examples of Constant Work

Scenario	Bimodal (bad)	Constant Work (good)
Config distribution	Push updates when config changes	Continuously push entire config every N seconds, whether or not it changed
DNS resolution	Resolve on cache miss	Periodically re-resolve all entries on a fixed schedule
Health checks	Increase frequency on failure	Constant frequency regardless of health
Membership	Broadcast when a node joins/leaves	Periodically broadcast full membership list

The key insight. Constant work trades efficiency during normal operation for predictability during incidents. You always know exactly how much work the system is doing, so you can provision for it. There are no surprises. No bimodal spikes. No "the monitoring system crashed because of the alert storm it generated about another crash."

Constant Work Implementation: Config Distribution

Let's compare bimodal and constant-work config distribution implementations:

python
# BIMODAL (bad): push on change
def on_config_change(new_config):
    # Push to ALL servers immediately
    for server in all_servers:
        server.update_config(new_config)
    # Problem: 50 rapid changes = 50 × N broadcast storms
    # During incidents, engineers make MANY rapid changes.

# CONSTANT WORK (good): poll on schedule
def config_sync_loop():
    while True:
        config = config_store.get_latest()
        apply_config(config)  # Idempotent apply
        time.sleep(10)  # Every 10 seconds, always
    # Same traffic during normal and during incident.
    # 50 rapid changes? Only the final state is applied.
    # Max staleness: 10 seconds.

Constant Work: DNS Resolution

A classic bimodal DNS problem: your service resolves DNS names when establishing connections. During normal operation, DNS responses are cached. During a DNS outage, the cache expires and every new connection attempts a DNS resolution — multiplying DNS traffic exactly when the DNS server is struggling.

// Bimodal DNS resolution:
Normal: 0 DNS queries (all cached)
DNS outage: 1000 queries/sec (all caches expired simultaneously)
// 1000x spike exactly when DNS is struggling!

// Constant work DNS resolution:
Background thread re-resolves every endpoint every 30 seconds.
Normal: 100 queries/30s = 3.3 queries/sec
DNS outage: 100 queries/30s = 3.3 queries/sec (same!)
// Plus: use stale cache entries during outage (serve stale).
// Traffic to DNS server is identical regardless of its health.

The simulation shows the health check traffic pattern under normal and failure conditions, comparing bimodal and constant work approaches.

Bimodal vs. Constant Work

Health check traffic over time. Inject a server failure and compare the traffic patterns.

Start monitoring, then inject a failure to see how traffic spikes differ.

Quiz: Your config distribution system pushes config updates only when they change. During an outage, an engineer makes 50 rapid config changes trying to fix the issue. What happens?

50 config pushes in rapid succession, each broadcast to all servers. This spike in config traffic happens during the worst possible time — when infrastructure is already stressed. A constant-work approach would continuously push the full config every N seconds, and the 50 changes would be batched into the next push cycle. Same traffic as normal, even during the incident. The config system queues the changes efficiently Only the final config state is pushed since the system deduplicates

Chapter 8: The Big Sim

This is the payoff chapter. We have learned seven resiliency patterns: timeouts, retries, circuit breakers, load shedding, rate limiting, bulkheads, and constant work. Now we see them work together in a realistic service chain.

How to use the simulator. The system has 3 services in a chain: Frontend → Backend → Database. Traffic enters from the left. Toggle each resiliency pattern on/off with the buttons below the canvas. Then inject failures at any point in the chain. Watch how the patterns interact to contain (or fail to contain) the damage.

Service Chain with Resiliency Patterns

3 services in a chain. Toggle patterns on/off, inject failures, watch real-time metrics.

Toggle patterns on, then inject failures. Watch the metrics change in real time.

Scenarios to Try

Scenario	Expected Behavior
Slow DB, no patterns	Threads fill up on backend, frontend times out, total outage
Slow DB + timeouts	Backend recovers threads after timeout, but errors increase
Slow DB + timeouts + circuit breaker	Circuit trips after threshold, instant fail-fast, DB gets breathing room
Traffic spike + load shedding	Excess requests shed, accepted requests stay fast
Kill backend + retries	Retries succeed on remaining backends (if multiple exist)
All patterns on + any failure	Graceful degradation instead of cascading failure

How the Patterns Interact

The resiliency patterns are not independent — they interact in specific, important ways. Understanding these interactions is essential for configuring them correctly.

Interaction	How it works	Configuration tip
Timeout + Retry	A timeout triggers a retry. The total time is timeout × max_retries.	Total time must be less than the upstream caller's timeout
Retry + Circuit Breaker	Retries feed failure count to the circuit breaker. Too many retried failures trip the circuit.	Circuit breaker threshold should account for retry-induced failures
Circuit Breaker + Load Shedding	When circuit opens, no traffic reaches the dependency. This is a form of load shedding at the caller level.	The dependency gets breathing room to recover
Bulkhead + Timeout	Bulkhead limits concurrent calls. If all slots are busy (timeout), new requests are rejected immediately.	Bulkhead size = max_concurrent × timeout / avg_latency

// Sizing a bulkhead:
// How many concurrent requests can be in-flight to a dependency?

avg_latency = 50ms
max_rps_to_dependency = 200
concurrent = max_rps × avg_latency = 200 × 0.05 = 10 slots

// Under degradation (latency = 500ms):
concurrent = 200 × 0.5 = 100 slots

// Set bulkhead size between these values (e.g., 30).
// At 30 slots and 500ms latency, you can handle 60 RPS.
// The other 140 RPS are immediately rejected (shed).
// This prevents the degraded dependency from consuming all resources.

The Complete Resiliency Stack in Production

Here is how a well-configured service processes an outbound request with all patterns active:

1. Rate Limit Check

Is this client within their rate limit? If not, return 429 immediately.

↓ allowed

2. Load Shedding Check

Is the server overloaded? If so, shed low-priority requests with 503.

↓ accepted

3. Bulkhead Acquire

Get a slot in the dependency-specific bulkhead. If full, return 503.

↓ slot acquired

4. Circuit Breaker Check

Is the circuit open? If so, fast-fail or return cached response.

↓ circuit closed

5. Call with Timeout

Make the request with an explicit timeout. If timeout fires, record failure.

↓ timeout or error

6. Retry with Backoff

If retry budget allows and operation is idempotent, retry with exponential backoff + jitter.

↓ final result

7. Release Bulkhead

Return the slot to the bulkhead pool.

Chapter 9: Connections

Resiliency patterns are the runtime defense layer. They handle failures as they happen, in real time. But they don't prevent failures from happening, and they don't tell you what is failing. For that, you need the other layers of the defense stack.

This Lesson vs. Related Topics

Topic	Focus	Relationship
Failure Modes & Isolation	What fails and how to contain it structurally	The architectural foundation that resiliency patterns build on
Resiliency Patterns (this lesson)	Runtime patterns that handle failures gracefully	The "how" of surviving failures in code
Testing & Deployment	Chaos engineering, canary deploys	How to verify that your resiliency patterns actually work
Observability	Metrics, logs, traces	How to see that patterns are activating and diagnose failures

Key Takeaways

1. Every network call needs a timeout. No timeout = unbounded resource consumption. Base timeouts on p99 latency, not median.

2. Retries must be safe. Exponential backoff + jitter prevents thundering herds. Retry budgets cap amplification. Idempotency is required.

3. Circuit breakers handle persistent failures. Stop hammering a dead service. Give it breathing room. Probe to detect recovery.

4. Load shedding saves the majority. It is better to serve 80% of requests well than 100% of requests poorly. Shed early, shed fast.

5. Rate limiting protects fairness. Token bucket allows bursts. Leaky bucket smooths. Sliding window is simple. Pick based on your traffic pattern.

6. Bulkheads contain blast radius within a service. Isolate resource pools per dependency. One slow dependency should not starve others.

7. Constant work prevents bimodal spikes. Do the same work always, not more work during crises. Provision for the constant, not the spike.

"The best way to survive a failure is to have already decided what you're going to do about it." — Nora Jones, Jeli co-founder

Final quiz: Your service chain is: User → Gateway → OrderService → PaymentService. PaymentService is down. Without any resiliency patterns, what happens? With all patterns enabled (timeouts, retries, circuit breaker, load shedding), what happens?

Without patterns, only PaymentService is affected. With patterns, the same thing but faster. Without patterns, all services fail. With patterns, all services recover automatically. Without patterns: OrderService threads fill up waiting for Payment (no timeout), Gateway threads fill up waiting for Order, users see total outage (cascading failure). With patterns: timeouts free threads quickly, retries catch transient issues, circuit breaker trips after threshold (fast-failing payment requests), load shedding on Gateway protects capacity. Result: payment-related requests fail fast with clear errors, but non-payment features remain operational.