Timeouts, retries, circuit breakers, load shedding — the toolkit for surviving failures gracefully.
Your service calls a database. The database is slow today — maybe a long-running query is hogging resources, maybe a disk is degrading, maybe the network is congested. Whatever the reason, your database calls that normally take 5ms are now taking 30 seconds.
Your service has a thread pool of 200 threads. Each request makes a database call. Each database call now holds a thread for 30 seconds instead of 5ms. After 200 requests (which arrive in about 2 seconds at normal traffic), every thread is stuck waiting for the database. No more threads are available. Your service cannot handle any requests — not even requests that don't need the database.
Your service is down, not because it failed, but because it waited too long for a dependency that was slow.
The simulation below shows a service calling a slow dependency. Without any resiliency patterns, the service's thread pool fills up and the service becomes completely unresponsive.
A service with 20 threads calls a slow dependency. Watch threads fill up as the dependency slows down.
Each resiliency pattern addresses a specific failure mode. Here is the complete toolkit:
| Pattern | What it does | Protects against |
|---|---|---|
| Timeout | Limits how long you wait for a response | Slow dependencies exhausting resources |
| Retry | Repeats a failed request | Transient failures (network blips, brief overload) |
| Circuit breaker | Stops calling a failing dependency | Retry storms overwhelming a recovering service |
| Load shedding | Drops excess requests when overloaded | Cascading failures from resource exhaustion |
| Rate limiting | Limits request rate per client | Abusive or buggy clients overwhelming the service |
| Bulkhead | Isolates resource pools per dependency | One slow dependency consuming all shared resources |
| Constant work | Does the same amount of work regardless of conditions | Bimodal behavior that amplifies failures |
A timeout is the simplest resiliency pattern: "If I haven't received a response in X milliseconds, give up and free the resources." Without a timeout, a slow dependency can hold your resources indefinitely. With a timeout, the damage is bounded.
There are two distinct timeouts, and confusing them is a common source of bugs:
| Timeout | What it measures | Typical value |
|---|---|---|
| Connection timeout | How long to wait for the TCP connection to be established (the three-way handshake) | 1-5 seconds |
| Read timeout (request timeout) | How long to wait for the response after the connection is established and the request is sent | 100ms - 30s (depends on operation) |
A connection timeout that fires means the server is unreachable (down, network partition, wrong address). A read timeout that fires means the server is reachable but slow (overloaded, long query, GC pause).
Too short: you declare healthy-but-slow calls as failures, increasing your error rate and triggering unnecessary retries. Too long: you hold resources for too long, risking thread pool exhaustion.
The best practice is to base timeouts on observed latency percentiles:
Fixed timeouts are fragile. If your dependency gets faster, your timeout is wastefully long. If it gets slower (legitimately), your timeout causes false failures. An adaptive timeout adjusts based on recent observations.
The simplest adaptive timeout tracks a rolling window of recent response times and sets the timeout to the p99 of that window plus a margin:
The simulation below lets you adjust the timeout and see how it affects false positives (healthy requests killed by timeout) versus resource waste (threads stuck waiting).
Requests arrive with varying latencies. Adjust the timeout to balance false positives vs. resource waste.
python import requests # BAD: no timeout (default = wait forever) response = requests.get("http://payment-service/charge") # GOOD: explicit connection and read timeouts response = requests.get( "http://payment-service/charge", timeout=(1, 0.4) # (connect_timeout, read_timeout) in seconds ) # BETTER: with retry handling try: response = requests.get( "http://payment-service/charge", timeout=(1, 0.4) ) except requests.Timeout: # Request timed out. We do NOT know if it was processed. # Check idempotency key before retrying. return fallback_response()
In a service chain (A calls B calls C calls D), each service sets its own timeout. But if A's timeout is 500ms and B spends 400ms before calling C, then C only has 100ms to complete. If C doesn't know this, it might start a 300ms operation that will be killed by A's timeout anyway — wasting resources.
Deadline propagation passes the remaining time budget from caller to callee. Each service knows exactly how much time it has before the upstream caller gives up.
A request fails. Maybe the network dropped a packet. Maybe the server had a momentary GC pause. Maybe a load balancer routed you to a server that was just restarting. These are transient failures — they go away if you simply try again.
Retries are the natural response: if it fails, try again. But naive retries are one of the most dangerous things in a distributed system.
Imagine a service handling 1000 RPS. It calls a dependency that starts failing 50% of requests. Without retries, 500 requests per second fail. With one retry, those 500 failed requests are retried, adding 500 more requests. The dependency now sees 1500 RPS instead of 1000. With two retries, it could see up to 2000 RPS.
Exponential backoff means waiting longer between each retry: 100ms, 200ms, 400ms, 800ms... This gives the failing dependency time to recover instead of hammering it immediately.
Jitter means adding randomness to the backoff delay. Without jitter, if 1000 clients all retry at the same time (because they all saw the same failure at the same time), they will all retry at t+100ms, then all retry again at t+200ms. These synchronized retry waves are called thundering herds. Jitter breaks the synchronization.
Even with backoff and jitter, retries increase load. A retry budget caps the total number of retries across all requests, usually as a percentage of successful traffic.
python # Idempotency key pattern: # Client generates a unique key for each logical operation. # Server stores the key and result. On retry, returns cached result. def charge_payment(request): key = request.headers["Idempotency-Key"] # Check if we already processed this request cached = redis.get(f"idempotency:{key}") if cached: return json.loads(cached) # Return same result # Process for the first time result = payment_gateway.charge(request.amount) # Cache result with TTL (e.g., 24 hours) redis.set(f"idempotency:{key}", json.dumps(result), ex=86400) return result
Retries at each layer of a service chain multiply. If A retries 3 times calling B, and B retries 3 times calling C, and C retries 3 times calling D, the total number of requests hitting D can be up to 3 × 3 × 3 = 27 times the original request volume.
The simulation shows how retries with and without backoff/jitter affect load on a struggling dependency.
A dependency starts failing. Compare immediate retries vs. exponential backoff with jitter. Y-axis = total RPS hitting the dependency.
Retries help with transient failures. But what if the dependency is not transiently failing — what if it is completely down, and it is going to be down for the next 10 minutes? Every retry during those 10 minutes is wasted: it consumes resources, adds load to the already-dead dependency, and always fails.
A circuit breaker detects when a dependency is consistently failing and stops sending requests to it entirely for a cooldown period. Instead of timing out and retrying, the circuit breaker returns an immediate failure (or a cached/fallback response). This gives the dependency breathing room to recover.
A circuit breaker is a state machine with three states:
| Parameter | Typical Value | Too Low | Too High |
|---|---|---|---|
| Failure threshold | 5-10 failures in 10s | Trips on normal noise | Takes too long to detect real outages |
| Cooldown period | 10-60 seconds | Hammers recovering service | Stays open too long, slow recovery |
| Half-open probe count | 1-5 requests | One fluke success closes circuit prematurely | Sends too much probe traffic |
python import time class CircuitBreaker: def __init__(self, failure_threshold=5, cooldown_sec=30, probe_count=3): self.state = "closed" self.failures = 0 self.threshold = failure_threshold self.cooldown = cooldown_sec self.opened_at = 0 self.probe_count = probe_count self.probe_successes = 0 def call(self, func, *args): if self.state == "open": if time.time() - self.opened_at > self.cooldown: self.state = "half-open" # Try probing else: raise CircuitOpenError() # Fast fail try: result = func(*args) self._on_success() return result except Exception as e: self._on_failure() raise def _on_success(self): if self.state == "half-open": self.probe_successes += 1 if self.probe_successes >= self.probe_count: self.state = "closed" # Recovered! self.failures = 0 else: self.failures = max(0, self.failures - 1) def _on_failure(self): self.failures += 1 if self.state == "half-open" or \ self.failures >= self.threshold: self.state = "open" self.opened_at = time.time() self.probe_successes = 0
When the circuit breaker is open and you fast-fail requests, what do you return to the caller? There are three options:
| Strategy | When to use | Example |
|---|---|---|
| Error | When the operation is essential and there is no substitute | Payment processing — return "payment temporarily unavailable" |
| Cached response | When stale data is acceptable | Product recommendations — return yesterday's recommendations |
| Degraded response | When a simpler version of the response exists | Search results — return popular items instead of personalized results |
The simulation below shows a circuit breaker in action. The dependency alternates between healthy and failing. Watch the circuit breaker transition between states.
Watch the circuit breaker react to dependency health. Green = closed, Red = open, Yellow = half-open.
Your server can handle 1000 requests per second. Right now, 2000 requests per second are arriving. What do you do?
Option A: try to process all 2000. Your response time triples. All 2000 clients experience slow responses. Many time out. They retry. Now you are at 3000 RPS. Total collapse.
Option B: shed the excess load. Accept 1000 requests, immediately reject the other 1000 with a "503 Service Unavailable" response. The 1000 accepted requests complete quickly at normal latency. The 1000 rejected clients can retry (preferably to a different server) or back off.
This is load shedding: deliberately dropping requests when overloaded to protect the requests you are still serving. It is counterintuitive — deliberately failing requests to prevent failing all requests. But it works.
Most request queues are FIFO (first in, first out). Under load, a FIFO queue means that by the time you process a request, it has been waiting so long that the client has already timed out and given up. You waste resources processing a request whose response will be ignored.
LIFO (last in, first out) processes the newest requests first. These are the ones whose clients are still waiting. Old requests at the back of the queue have probably already timed out — drop them.
Not all requests are equal. A health check from a load balancer is more important than a background analytics event. A checkout request from a paying customer is more important than a product listing refresh. Priority-based load shedding drops low-priority requests first, preserving capacity for high-priority ones.
python import time class AdaptiveLoadShedder: def __init__(self, max_queue_time_ms=50): self.max_queue_time = max_queue_time_ms self.in_flight = 0 self.max_in_flight = 100 # Concurrency limit def should_accept(self, request): # Check queue age (CoDel-style) queue_time = time.time() - request.enqueued_at if queue_time > self.max_queue_time: return False # Too stale, client probably timed out # Check concurrency limit if self.in_flight >= self.max_in_flight: # Shed based on priority if request.priority > 1: # Low priority return False if request.priority > 0 and \ self.in_flight > self.max_in_flight * 1.1: return False # Even medium priority return True
| Layer | Mechanism | Advantage |
|---|---|---|
| Load balancer | Reject when all backends are at capacity | Cheapest rejection — no work done at all |
| Application server | Reject when thread pool / queue is full | Application-aware (can prioritize by endpoint) |
| Request queue | Drop old/excess items from queue | Fine-grained control over queue depth |
The simulation shows a server under increasing load. Toggle load shedding on and off to see how it maintains quality of service for accepted requests.
Incoming traffic ramps up. Without shedding, latency explodes. With shedding, accepted requests stay fast.
Load shedding protects a server from total overload. Rate limiting is more targeted: it limits how many requests each individual client can make, preventing one client from monopolizing the server's capacity.
Without rate limiting, a buggy client that sends 10,000 requests per second (instead of its normal 10) can consume all of your capacity, starving every other client. Rate limiting caps that client at, say, 100 RPS, ensuring capacity remains for everyone else.
The token bucket is the most common rate limiting algorithm. Picture a bucket that holds tokens. Tokens are added at a fixed rate (e.g., 10 per second). Each request consumes one token. If the bucket is empty, the request is rejected.
The sliding window counts requests in a rolling time window. Simpler than token bucket but no burst control.
The leaky bucket smooths traffic to a constant rate. Requests enter a queue (the bucket). The queue drains at a fixed rate. If the queue is full, new requests are dropped.
| Algorithm | Burst | Smoothing | Memory | Best for |
|---|---|---|---|---|
| Token bucket | Allows bursts up to bucket size | None — bursts pass through | 2 numbers (tokens, last_refill) | APIs where bursts are acceptable |
| Leaky bucket | Absorbs bursts into queue | Output is perfectly smooth | Queue + drain rate | Systems requiring uniform rate |
| Sliding window | No burst control | None | Timestamps of recent requests | Simple per-client limits |
The simulation below lets you compare token bucket and sliding window rate limiting under bursty traffic.
Bursty client sends requests. Green = allowed, red = rate-limited. Compare the two algorithms.
A bulkhead is a wall in a ship's hull that divides the hull into separate watertight compartments. If the hull is breached, water floods one compartment but the bulkheads prevent it from flooding the entire ship. The ship stays afloat.
The same principle applies to software. Your service calls three dependencies: a database, a cache, and an external API. All three share the same thread pool of 100 threads. The external API starts responding slowly. All 100 threads get stuck waiting for the API. Now your database calls and cache calls also fail — not because the database or cache is down, but because there are no threads left to call them.
A bulkhead pattern gives each dependency its own isolated pool of resources (threads, connections, memory). If the external API pool (30 threads) is exhausted, the database pool (50 threads) and cache pool (20 threads) are unaffected.
The Titanic had bulkheads — 16 watertight compartments. It was designed to survive flooding of any 4 consecutive compartments. But the iceberg opened 6 compartments. Water flooded one compartment, overflowed the top of the bulkhead wall into the next, and so on. The bulkheads were not tall enough.
Software bulkheads fail the same way when they are not configured correctly. If your "database thread pool" bulkhead allows 90 out of 100 threads, it is not really a bulkhead — one slow dependency can still consume 90% of your capacity. Effective bulkheads must leave meaningful capacity for other dependencies.
| Bulkhead Type | What it isolates | Example |
|---|---|---|
| Thread pool | Threads per dependency | 30 threads for DB, 20 for cache, 10 for API |
| Connection pool | Connections per downstream | Max 50 DB connections, max 20 API connections |
| Semaphore | Concurrency per operation | Max 10 concurrent calls to slow endpoint |
| Process/container | Entire runtime per workload | Separate container for batch jobs vs. serving |
The simulation shows a service with shared resources vs. bulkheaded resources. One dependency slows down — compare how it affects the other dependencies.
Left: shared thread pool. Right: bulkheaded pools. Slow down API to see the difference.
Most systems do more work when things go wrong. A health checker pings every server every 10 seconds. When a server goes down, the checker starts pinging it more frequently (every 1 second) to detect when it comes back. Meanwhile, the load balancer redistributes traffic, generating more routing table updates. The monitoring system fires alerts, which trigger escalation chains, which generate more monitoring queries.
This is bimodal behavior: the system does X work during normal operation and 10X work during an incident. The problem? Incidents are exactly when your infrastructure is most stressed. Doing 10X work during a crisis is a recipe for cascading failures.
The constant work pattern (sometimes called "constant-rate processing") ensures the system does the same amount of work regardless of conditions. During normal operation, some of that work is "wasted." During an incident, no extra work is needed.
| Scenario | Bimodal (bad) | Constant Work (good) |
|---|---|---|
| Config distribution | Push updates when config changes | Continuously push entire config every N seconds, whether or not it changed |
| DNS resolution | Resolve on cache miss | Periodically re-resolve all entries on a fixed schedule |
| Health checks | Increase frequency on failure | Constant frequency regardless of health |
| Membership | Broadcast when a node joins/leaves | Periodically broadcast full membership list |
Let's compare bimodal and constant-work config distribution implementations:
python # BIMODAL (bad): push on change def on_config_change(new_config): # Push to ALL servers immediately for server in all_servers: server.update_config(new_config) # Problem: 50 rapid changes = 50 × N broadcast storms # During incidents, engineers make MANY rapid changes. # CONSTANT WORK (good): poll on schedule def config_sync_loop(): while True: config = config_store.get_latest() apply_config(config) # Idempotent apply time.sleep(10) # Every 10 seconds, always # Same traffic during normal and during incident. # 50 rapid changes? Only the final state is applied. # Max staleness: 10 seconds.
A classic bimodal DNS problem: your service resolves DNS names when establishing connections. During normal operation, DNS responses are cached. During a DNS outage, the cache expires and every new connection attempts a DNS resolution — multiplying DNS traffic exactly when the DNS server is struggling.
The simulation shows the health check traffic pattern under normal and failure conditions, comparing bimodal and constant work approaches.
Health check traffic over time. Inject a server failure and compare the traffic patterns.
This is the payoff chapter. We have learned seven resiliency patterns: timeouts, retries, circuit breakers, load shedding, rate limiting, bulkheads, and constant work. Now we see them work together in a realistic service chain.
3 services in a chain. Toggle patterns on/off, inject failures, watch real-time metrics.
| Scenario | Expected Behavior |
|---|---|
| Slow DB, no patterns | Threads fill up on backend, frontend times out, total outage |
| Slow DB + timeouts | Backend recovers threads after timeout, but errors increase |
| Slow DB + timeouts + circuit breaker | Circuit trips after threshold, instant fail-fast, DB gets breathing room |
| Traffic spike + load shedding | Excess requests shed, accepted requests stay fast |
| Kill backend + retries | Retries succeed on remaining backends (if multiple exist) |
| All patterns on + any failure | Graceful degradation instead of cascading failure |
The resiliency patterns are not independent — they interact in specific, important ways. Understanding these interactions is essential for configuring them correctly.
| Interaction | How it works | Configuration tip |
|---|---|---|
| Timeout + Retry | A timeout triggers a retry. The total time is timeout × max_retries. | Total time must be less than the upstream caller's timeout |
| Retry + Circuit Breaker | Retries feed failure count to the circuit breaker. Too many retried failures trip the circuit. | Circuit breaker threshold should account for retry-induced failures |
| Circuit Breaker + Load Shedding | When circuit opens, no traffic reaches the dependency. This is a form of load shedding at the caller level. | The dependency gets breathing room to recover |
| Bulkhead + Timeout | Bulkhead limits concurrent calls. If all slots are busy (timeout), new requests are rejected immediately. | Bulkhead size = max_concurrent × timeout / avg_latency |
Here is how a well-configured service processes an outbound request with all patterns active:
Resiliency patterns are the runtime defense layer. They handle failures as they happen, in real time. But they don't prevent failures from happening, and they don't tell you what is failing. For that, you need the other layers of the defense stack.
| Topic | Focus | Relationship |
|---|---|---|
| Failure Modes & Isolation | What fails and how to contain it structurally | The architectural foundation that resiliency patterns build on |
| Resiliency Patterns (this lesson) | Runtime patterns that handle failures gracefully | The "how" of surviving failures in code |
| Testing & Deployment | Chaos engineering, canary deploys | How to verify that your resiliency patterns actually work |
| Observability | Metrics, logs, traces | How to see that patterns are activating and diagnose failures |
1. Every network call needs a timeout. No timeout = unbounded resource consumption. Base timeouts on p99 latency, not median.
2. Retries must be safe. Exponential backoff + jitter prevents thundering herds. Retry budgets cap amplification. Idempotency is required.
3. Circuit breakers handle persistent failures. Stop hammering a dead service. Give it breathing room. Probe to detect recovery.
4. Load shedding saves the majority. It is better to serve 80% of requests well than 100% of requests poorly. Shed early, shed fast.
5. Rate limiting protects fairness. Token bucket allows bursts. Leaky bucket smooths. Sliding window is simple. Pick based on your traffic pattern.
6. Bulkheads contain blast radius within a service. Isolate resource pools per dependency. One slow dependency should not starve others.
7. Constant work prevents bimodal spikes. Do the same work always, not more work during crises. Provision for the constant, not the spike.