Trade-Offs in Data Systems

Chapter 0: The Problem

You open your phone and check Twitter. In the two seconds it takes to load your timeline, here is what happens behind the curtain: a load balancer routes your request to one of thousands of application servers. That server checks a cache (Redis, maybe Memcached) to see if your timeline was recently computed. Cache miss. It queries a database (probably a sharded MySQL or PostgreSQL cluster) for tweets from the 400 accounts you follow. It calls a search index (Elasticsearch or a custom inverted index) to find trending topics. It pushes an analytics event into a message queue (Kafka) so a downstream batch processing pipeline can update your engagement scores overnight. Meanwhile, a stream processor is watching Kafka in real time, updating global trending topics as tweets flow in.

Six different kinds of data systems. One request. And the user just sees a feed that loaded in under a second. If any one of those systems is slow, the whole experience degrades. If any one loses data, users see corrupted timelines. If any one goes down completely, the application must decide: do we serve a degraded experience or show an error page?

These are not hypothetical questions. They are the daily reality of every engineering team building modern applications. And the answers are never obvious — they depend on the specific requirements, the specific workload, and the specific tradeoffs you are willing to make.

This is the defining reality of modern software: most applications today are data-intensive, not compute-intensive. The CPU is rarely the bottleneck. The hard problems are the amount of data, the complexity of data, and the speed at which data changes.

Martin Kleppmann opens DDIA with a simple observation: we have gotten very good at making CPUs fast. What we have not gotten good at is making data systems that are reliable, scalable, and easy to maintain. That is what DDIA is about, and that is what this lesson covers.

The Blurring Boundaries

Textbooks draw clean boxes: "this is a database," "that is a message queue." Reality is messier. Redis is a cache that can persist to disk like a database. Apache Kafka is a message queue that can store data durably for weeks, acting as a distributed log. Elasticsearch is a search engine that many teams use as their primary data store. The categories blur because each tool optimizes for a different set of tradeoffs, and sometimes those tradeoffs overlap.

When you stitch these tools together in your application code, you are designing a new data system. Your application's API hides the internal complexity. To the outside world, your service is a black box that promises certain guarantees: "This query will return results in under 200ms," "No data will be lost after a confirmed write," "The system will handle 10,000 concurrent users."

Making those promises — and keeping them — is harder than it sounds. A cache improves speed but introduces stale data. Replication improves availability but introduces consistency problems. A message queue decouples services but adds latency. Every tool you add solves one problem and creates another. The art of data systems engineering is choosing the combination of tradeoffs that best serves your users.

Those promises come down to three properties. Martin Kleppmann calls them the three concerns that are important in most software systems:

The Big Three. Every data system lives or dies by three qualities:

Reliability — The system works correctly, even when things go wrong.
Scalability — The system copes with growth (data, traffic, complexity).
Maintainability — Future engineers can work with the system productively.

The rest of this lesson unpacks each one. These three words are the vocabulary of every systems design interview you will ever do.

A Typical Data System

Before we dive into each property, let us see what a typical data-intensive application looks like. The simulation below shows data flowing between the components you just read about. Watch how a single user request touches multiple systems.

Data System Architecture

Click "Send Request" to trace a user request through a typical web application. Watch data flow between the database, cache, search index, and message queue.

Notice something important: the application server is the orchestrator. It decides which component to call, in what order, and what to do when one fails. If the cache is down, do you skip it and hit the database directly? If the message queue is full, do you drop the analytics event or block the user's request? These are design decisions, and every one of them is a tradeoff.

Why "Data-Intensive" Not "Compute-Intensive"?

There was a time when CPU speed was the bottleneck. In the 1990s, sorting a million records was a genuine computational challenge. Today, a single laptop can sort a billion integers in under a minute. CPUs got fast enough that for most applications, raw computation is cheap.

What did not get cheap: moving data around. Reading from disk. Waiting for a network round-trip. Synchronizing state across three data centers on different continents. The speed of light imposes a hard floor: a round-trip from New York to London takes at minimum 56 milliseconds at light speed through fiber — and real-world latency is 2-3x that. No amount of CPU improvement can fix this.

This is why the interesting problems are data problems: how to store it (databases), how to cache it (Redis, Memcached), how to search it (Elasticsearch), how to move it (Kafka, RabbitMQ), how to process it (Spark, Flink), and how to keep it consistent across multiple copies (replication, consensus). The CPU crunches numbers. The system manages data. DDIA is about the system.

You are a data system designer. The moment you combine a database with a cache with a queue, you are not just a backend developer. You are designing a new, special-purpose data system with its own guarantees. DDIA exists to help you reason about those guarantees rigorously, not just hope they work.

Warm-up: A junior developer says "Redis is a cache, not a database." Why is this statement too simplistic?

Because Redis is written in C and databases are usually written in Java Because Redis can persist data to disk (RDB snapshots and AOF logs), support transactions, and replicate across nodes — the boundary between "cache" and "database" is a spectrum of durability and query capability, not a binary category Because Redis is newer than most databases

Chapter 1: Reliability

Everybody wants a system that "just works." But what does that mean precisely? Kleppmann defines reliability as: the system continues to work correctly even when things go wrong. "Working correctly" means performing the function the user expects, at the performance the user expects, even when the user makes mistakes, even when the hardware fails, even when the software has bugs.

The key distinction is between faults and failures. A fault is a single component deviating from its spec — one disk dies, one process crashes, one network packet is lost. A failure is the whole system stopping to provide the required service. The goal of fault-tolerant design is to prevent faults from causing failures. You cannot prevent all faults (disks will die, networks will partition), but you can build systems that tolerate them.

Counterintuitively, it can make sense to deliberately trigger faults to test fault-tolerance mechanisms. This is the philosophy behind Netflix's Chaos Monkey and the broader discipline of chaos engineering: if you have not tested what happens when a server dies, you do not actually know if your system is fault-tolerant. You only believe it is. The difference between belief and knowledge in production engineering is measured in outages.

Hardware Faults

Hard disks have a mean time to failure (MTTF) of about 10 to 50 years. That sounds reliable until you do the math: if you have a storage cluster with 10,000 disks, each with an MTTF of 30 years, you should expect on average one disk to die every day. (10,000 disks ÷ 30 years ÷ 365 days = 0.91 failures per day.) RAM develops faults. Power grids have outages. Network cables get unplugged by cleaning staff.

The traditional response: redundancy. RAID arrays for disks, dual power supplies, hot-swappable CPUs, diesel generators for power. If one component fails, its redundant partner takes over while the broken one is replaced. For a single server, this approach can keep a machine running uninterrupted for years.

But as data volumes have grown, more applications use larger numbers of machines, which proportionally increases the rate of hardware faults. Cloud platforms like AWS are designed so that individual virtual machine instances can become unavailable without warning. The platform prioritizes flexibility and elasticity over single-machine reliability. So the emphasis has shifted from hardware redundancy to software fault-tolerance: systems that can tolerate the loss of entire machines.

Software Faults

Hardware faults are mostly independent — one disk dying does not make another disk more likely to die. Software faults are worse because they are correlated. A bug in the Linux kernel affects every machine running that kernel. A runaway process that consumes all CPU, memory, or disk on every node simultaneously. A service that the system depends on slowing down, becoming unresponsive, or returning corrupted data. A cascading failure where a small fault in one component triggers a fault in another, which triggers another.

Software faults are harder to anticipate. They lie dormant for a long time until triggered by an unusual set of circumstances — a leap second, a particularly large input, a specific sequence of operations. There is no quick fix. Kleppmann's advice: thorough testing, process isolation, allowing processes to crash and restart, measuring, monitoring, and alerting.

Human Faults

Humans design and operate these systems, and humans make mistakes. Configuration errors by operators are the leading cause of outages, not hardware or software faults. Studies of large internet services found that operator error caused the majority of disruptions, while hardware faults caused only 10-25%.

How do you mitigate human error? Design systems that minimize opportunities for error (good APIs that make it easy to do the right thing and hard to do the wrong thing). Provide sandbox environments where people can experiment safely with real data without affecting production. Test thoroughly at all levels: unit tests, integration tests, manual tests, automated end-to-end tests. Allow quick and easy rollback from mistakes — every deploy should be reversible in under one minute. Set up detailed monitoring (telemetry) so you can detect problems early, before users report them.

Google's approach to mitigating human error is instructive. Every configuration change goes through a review process. Changes are rolled out gradually (1% of traffic, then 10%, then 50%, then 100%) with automated monitoring at each stage. If any metric degrades, the rollout automatically pauses and alerts the engineer. This "canary" pattern catches most human errors before they affect more than a small fraction of users.

The reliability hierarchy. Hardware faults are random and independent — redundancy handles them. Software faults are systematic and correlated — testing and monitoring handle them. Human faults are the most common — good design, sandboxes, and rollbacks handle them. In an interview, start with human error (most common), then software (most dangerous), then hardware (most studied).

Worked Example: How Faults Become Failures

Let us trace a real cascading failure to see how a small fault becomes a system-wide failure.

1. Fault: One database replica has a slow disk

The disk is degraded but not dead. Reads take 500ms instead of 5ms. The replica is technically "healthy" — health checks pass because they only check if the process is alive, not if it is fast.

↓

2. Amplification: Load balancer routes traffic to slow replica

The load balancer distributes connections evenly. Requests hitting the slow replica take 100x longer. These requests hold open connections and consume thread pool slots on the application server.

↓

3. Cascading: App server thread pool exhausted

With 200 threads and each slow request holding a thread for 500ms instead of 5ms, throughput drops from 40K req/s to 400 req/s. Requests queue up. Timeouts start firing. Retry storms begin — upstream services retry failed requests, tripling the load.

↓

4. Failure: System-wide outage

The retry storm overwhelms the remaining healthy replicas. All replicas become slow. All application servers exhaust their thread pools. The load balancer has no healthy backends. Users see 503 errors. Total duration: 47 minutes. Root cause: one slow disk.

Prevention requires defense in depth: (1) health checks that measure latency, not just liveness, (2) circuit breakers that stop sending traffic to slow backends, (3) retry budgets that cap the total number of retries, (4) timeouts at every network boundary. Netflix pioneered many of these patterns with their Hystrix library and the broader concept of chaos engineering — intentionally injecting faults to verify that the system handles them gracefully.

Real-World Failure Rates

How often do things actually break? Here are empirically observed failure rates from large-scale systems:

Component	Failure Rate	Source
Hard disk (HDD)	2-4% annual failure rate (AFR)	Backblaze Drive Stats, 2023
SSD	0.5-1% AFR	Google fleet study, 2016
Server (any component)	2-10% AFR depending on age	Microsoft Azure study, 2018
Network link	~1 failure per year per link	Google Jupiter network paper
Rack power supply	1-2 events per year per datacenter	AWS re:Invent talks
Software deploy	~1 in 20 deploys causes a rollback	DORA State of DevOps
Configuration change	#1 cause of outages at Google, Microsoft, and Amazon	Multiple incident reports

The pattern is clear: hardware fails at predictable, low rates. Software and human errors fail at much higher rates but are harder to predict. The most reliable systems invest more in preventing human error (safe defaults, canary deploys, automated rollbacks) than in preventing hardware error (which is already well-understood).

Reliability Simulator

The simulation below models a system with five components, each with an adjustable failure rate. In a series architecture (every component must work for the system to work), the overall reliability multiplies: if each component has 99.9% availability, five in series give 99.9%⁵ = 99.5%. Toggle redundancy to add a replica for each component and watch reliability improve dramatically.

Reliability Simulator

Adjust individual component reliability. Toggle redundancy to add replicas. Watch how series composition multiplies failure probabilities.

Component reliability 99.90%

// Series reliability (all must work):
R_series = R₁ × R₂ × ... × R_n

// With redundancy (each component has a backup):
R_component = 1 - (1 - R_single)²
R_{series-redundant} = R_componentⁿ

// Worked example: 5 components, each 99.9%:
No redundancy: 0.999⁵ = 0.995 = 99.5% (~1.8 days of downtime/year)
With redundancy: (1 - 0.001²)⁵ = (1 - 10^-6)⁵ ≈ 99.9995% (~2.6 minutes of downtime/year)

Interview question: Your system has 5 microservices in a critical path. Each has 99.9% availability (three nines). A product manager asks: "We have three nines on every service, so the system has three nines, right?" What is the actual system availability, and what do you recommend?

Yes, 99.9% — each service is three nines so the system is three nines 95% — you add the failure rates 99.5% — you multiply: 0.999^5 = 0.995. Series composition means the system is LESS reliable than any individual component. Recommend: add redundancy to the most critical services, or redesign the critical path to call fewer services in series

Chapter 2: Scalability

A system that works reliably today may not work reliably tomorrow if load increases tenfold. Scalability is the ability of a system to cope with increased load. But "scalability" is not a one-dimensional label — you cannot say "X is scalable" or "X doesn't scale" as an absolute statement. It is always relative to a specific kind of growth: "If load doubles, can we maintain response times by adding proportionally more resources?"

Describing Load

Before you can talk about whether a system handles growth, you must describe the current load precisely. Kleppmann calls these load parameters — numbers that capture the shape of your workload. The best choice of load parameters depends on the architecture of your system:

System Type	Key Load Parameters	Example
Web server	Requests per second, read/write ratio	1,000 req/s, 90% reads
Database	Read/write ratio, working set size	Write-heavy: 80% writes
Cache	Hit rate, eviction rate	95% hit rate, 1M keys
Chat app	Concurrent active users, messages/sec	50K concurrent, 500 msg/s
Video platform	Concurrent streams, bitrate, storage growth/day	100K streams, 5 Mbps avg

Twitter's Fan-Out Problem

The best example from DDIA is Twitter circa 2012. Twitter had two main operations: post tweet (a user writes a tweet, delivered to all followers) and read timeline (a user views their home timeline — tweets from everyone they follow). At the time, Twitter handled about 4,600 tweets/sec on average (12,000 at peak) and 300,000 timeline reads/sec.

The challenge was not the write throughput. 12,000 writes/sec is manageable for most databases. The challenge was fan-out: each tweet must be delivered to all the author's followers, and some users have tens of millions of followers.

Two approaches:

Approach 1: Fan-out on Read

When a user requests their timeline, look up everyone they follow, find all recent tweets from those users, and merge them (sorted by time). Simple to implement, but expensive at read time. For a user following 500 people, that is 500 index lookups per timeline load, times 300,000 reads/sec = 150 million database lookups per second.

vs.

Approach 2: Fan-out on Write

When a user posts a tweet, immediately insert it into the home timeline cache of every follower. Reading the timeline is now trivially fast (just read the precomputed cache), but writing is expensive. A user with 30 million followers generates 30 million cache writes per tweet. At 4,600 tweets/sec average, that could be hundreds of billions of writes per day.

Twitter initially used Approach 1, then switched to Approach 2 because the read volume (300K/sec) was far higher than write volume (4.6K/sec), so it made sense to do the heavy work once at write time. But then the celebrity problem hit: when a user with 30 million followers tweets, that is 30 million writes. At peak, this created unsustainable write amplification.

Twitter's solution: a hybrid. Most users get fan-out on write (their tweets are pushed to followers' caches). Celebrities (users with very large follower counts) get fan-out on read — their tweets are fetched and merged at read time. This is a brilliant example of a tradeoff that only becomes visible once you quantify the load parameters.

Twitter Fan-Out Simulator

Adjust follower count and tweet rate. Compare write amplification (fan-out on write) vs read amplification (fan-out on read). Find the crossover point where the hybrid approach wins.

Followers 1,000

Tweets/sec 4,600

The fan-out number. In any system design interview, ask: "What is the fan-out?" Fan-out is the number of downstream operations triggered by a single incoming operation. Low fan-out (1:1 or 1:10) is easy. High fan-out (1:1,000,000) requires fundamentally different architectures. Twitter's hybrid approach recognizes that fan-out is not uniform across users — it is a heavy-tailed distribution, and you must handle the tail differently.

Worked Example: The Full Twitter Math

Let us work through the numbers that forced Twitter to change their architecture. These are approximate numbers from 2012, the era Kleppmann describes.

// Given:
Active users: 300 million
Average tweets/sec: 4,600 (peak: 12,000)
Timeline reads/sec: 300,000
Average followers per user: ~200
Celebrity followers (top 0.001%): 10-50 million

// Fan-out on Write (precompute timelines):
Average write amplification = 4,600 tweets/sec × 200 followers = 920,000 cache writes/sec
// This is manageable. But then a celebrity tweets:
Celebrity tweet = 1 tweet × 30,000,000 followers = 30,000,000 cache writes
// At 4.6K tweets/sec, ~0.005% are from celebrities, so ~0.23 celebrity tweets/sec
// Each one creates a 30M write burst on top of the 920K baseline

// Fan-out on Read (compute timeline at read time):
Each timeline = look up ~200 followed accounts × fetch recent tweets = ~200 index lookups
Total = 300,000 reads/sec × 200 lookups = 60,000,000 lookups/sec
// 60M lookups/sec is a LOT of random reads, very hard to sustain

// Hybrid solution:
Regular users: fan-out on write (920K writes/sec, no celebrity spikes)
Celebrities: fan-out on read (merged at read time, ~1-5% of timelines affected)
// Best of both worlds: predictable write load + fast reads for most users

The hybrid solution is elegant because it exploits the fact that follower count follows a power law distribution. The vast majority of users have a small number of followers — fan-out on write is cheap for them. Only a tiny fraction of users have millions of followers — handling them differently at read time is a small additional cost. This is a pattern you will see repeatedly in distributed systems: handle the common case efficiently, and special-case the outliers.

Fan-Out Beyond Twitter

The fan-out concept applies far beyond social media. Here are examples from other domains:

System	Operation	Fan-Out	Strategy
Google Search	One query	Fans out to 1,000+ index shards	Return results from fastest shards, skip slow ones
CDN invalidation	One cache purge	Propagates to 200+ edge servers globally	Async propagation, accept brief staleness (seconds)
Payment processing	One charge	Triggers: auth, fraud check, ledger write, receipt email, analytics	Critical path (auth+ledger) sync, rest async via queue
CI/CD pipeline	One commit	Triggers: lint, unit tests, integration tests, deploy to staging, canary	Parallel where possible, sequential gates for safety
Ride-sharing	One ride request	Notify 10-50 nearby drivers	Fan-out on write (push to drivers), first-accept wins

In every case, the design question is the same: do you do the work eagerly (fan-out on write) or lazily (fan-out on read)? The answer depends on: (1) the ratio of writes to reads, (2) the acceptable latency for reads, (3) the cost of write amplification, and (4) whether the fan-out is uniform or heavy-tailed.

Design question: A social media platform has 100 million users. The average user has 200 followers, but the top 0.01% (10,000 users) have 10 million followers each. The platform handles 5,000 tweets/sec. If you use pure fan-out-on-write, approximately how many cache writes per second do you need to handle?

1 million writes/sec — just 5,000 × 200 average followers 5,000 writes/sec — one write per tweet In the average case, 5,000 × 200 = 1M writes/sec. But when a celebrity tweets (statistically ~0.5 per second given 10K celebrities at 5K tweets/sec), that single tweet adds 10M writes. So the peak is dominated by celebrity tweets: ~10M writes/sec bursts on top of a 1M baseline. This is exactly why Twitter uses the hybrid approach

Chapter 3: Describing Performance

Once you have described the load on your system, the next question is: what happens when load increases? Two ways to frame it: (1) keep resources fixed, how does performance degrade? (2) keep performance fixed, how much more resources do you need? Both framings require you to measure performance precisely.

Response Time vs. Latency

People use these interchangeably, but they are different. Response time is what the client sees: the time between sending a request and receiving the response. It includes network transit time, queuing time, and the actual service time. Latency is the time a request spends waiting to be handled — the time the request is "latent," sitting in a queue, awaiting service. Latency is a component of response time, not a synonym for it.

Why Averages Lie

If you are asked "what is the response time of your service?" and you answer with the average (arithmetic mean), you are hiding critical information. Imagine 100 requests: 95 take 10ms, 4 take 100ms, and 1 takes 5,000ms. The average is (95×10 + 4×100 + 1×5000) / 100 = 64.5ms. That number describes no actual user experience — nobody got a 64.5ms response. The typical user saw 10ms. The unlucky user saw 5 seconds.

Percentiles are the right tool. Sort all response times from fastest to slowest:

Percentile	Meaning	Why it matters
p50 (median)	Half of requests are faster than this	The "typical" user experience
p95	95% of requests are faster	The experience for 1 in 20 users
p99	99% of requests are faster	The experience for 1 in 100 users — often your most valuable customers (heavy users)
p999	99.9% of requests are faster	The long tail; diminishing returns to optimize further

Amazon observed that a 100ms increase in response time reduced sales by 1%. They also found that the customers with the slowest response times were often those with the most data in their accounts — they had the most purchases, they were the most valuable customers. That is why Amazon cares about p999, not just p50.

Tail Latency Amplification

Here is where percentiles get dangerous. In a microservices architecture, a single user request often fans out to multiple backend services. If your request touches 10 backend services in parallel, the overall response time is determined by the slowest service. Even if each service has a fast p99, the chance that at least one of 10 calls hits its slow tail grows quickly.

// Probability that ALL 10 calls are fast (under p99):
P(all fast) = 0.99¹⁰ = 0.904

// So ~10% of requests hit at least one slow backend
// The overall p99 of the composite request is worse than p99 of each service

// With 50 backend calls:
P(all fast) = 0.99⁵⁰ = 0.605
// ~40% of requests are slow! The effective p99 is closer to p80 of each service

This is tail latency amplification. It means that a service with a "good" p99 of 100ms can cause an overall request p99 of 1,000ms+ when you fan out across many services. The more services you call, the worse it gets.

Measuring Correctly

Measuring response times correctly is surprisingly tricky. Common mistakes:

Coordinated omission. If your load generator sends one request at a time and waits for the response before sending the next, it under-reports tail latency. When the server is slow, the generator slows down too, reducing load exactly when the system is stressed. Gil Tene (Azul Systems) calls this coordinated omission — the measurement tool and the system under test are coordinating to hide the problem. The fix: use a constant-rate load generator that sends requests at a fixed schedule regardless of response times. Tools like wrk2 and HdrHistogram are designed to avoid this.

Averaging percentiles. You cannot average percentiles. If server A has p99 = 50ms and server B has p99 = 500ms, the system p99 is NOT 275ms. You must merge the full response time distributions and compute the percentile from the merged set. In practice, this means your monitoring system must store histograms (like HDR Histogram or Prometheus histograms), not pre-computed percentiles.

Ignoring the client perspective. Server-side metrics miss queuing time, network latency, and client-side processing. Always measure response time from the client's perspective (end-to-end), not just server processing time. The user does not care that your server processed the request in 10ms if the total round trip took 500ms because of a congested network.

Percentile Explorer

1,000 simulated response times from a log-normal distribution. Adjust the distribution shape and fan-out count to see how percentiles change and how tail latency amplifies.

Tail weight 0.70

Backend fan-out 1

Worked Example: Amazon's Tail Latency

Amazon observed that customers with the most purchases (their most valuable customers) tended to have the slowest response times. Why? Because these customers had more data in their accounts — more order history, more recommendations to compute, more wish lists to merge. The very customers Amazon most wanted to keep happy were the ones most likely to hit the slow tail.

This is why Amazon optimizes for p999 (the slowest 1-in-1000 request), not just p50. Let us do the math for their checkout page, which calls multiple backend services:

// Amazon checkout page backend calls (simplified):
1. User service (account details): p99 = 30ms
2. Cart service (items in cart): p99 = 20ms
3. Inventory service (stock check): p99 = 50ms
4. Pricing service (discounts, tax): p99 = 40ms
5. Recommendation service: p99 = 80ms
6. Shipping service (delivery estimates): p99 = 60ms

// If called in parallel, overall p99 = ?
// P(all 6 under p99) = 0.99^6 = 0.941
// So ~6% of checkout requests hit at least one slow backend
// The effective p99 of the checkout page is dominated by the
// slowest call. If inventory service spikes to 2000ms at its p99,
// then 6% of checkout pages take 2+ seconds.

// Jeff Dean's "Tail at Scale" solution: hedged requests
// If a backend does not respond within the p95 latency,
// immediately send a duplicate request to a different replica.
// First response wins. This dramatically reduces tail latency
// at the cost of ~5% more backend load.

SLOs and SLAs. A Service Level Objective (SLO) is an internal target: "p99 response time under 200ms." A Service Level Agreement (SLA) is a contract with customers, usually with financial penalties: "99.9% of requests under 500ms, or we refund." SLAs are typically less aggressive than SLOs because you need margin. Always set SLOs tighter than SLAs. In interviews, state your SLO explicitly and explain the percentile — "Our SLO is p99 latency under 200ms" is a complete, professional statement.

Debug scenario: Your service has a p50 of 15ms and a p99 of 800ms. A single user request calls 5 backend services in parallel. Approximately what percentage of user-facing requests will experience a response time of 800ms or more?

About 5% — the probability that at least one of 5 backends hits its p99 is 1 - 0.99^5 = 0.049, so roughly 1 in 20 user requests sees the tail About 1% — the p99 is by definition 1% About 0.2% — you divide 1% by 5 services

Chapter 4: Scaling Up and Scaling Out

You have measured your load, you have measured your performance. Load is growing. What do you do?

Vertical Scaling (Scaling Up)

Scaling up means moving to a more powerful machine: more CPU cores, more RAM, faster disks, better network. It is the simplest approach — your software does not change at all, you just give it more horsepower. A single high-end server with 256 cores and 2TB of RAM can handle a remarkable amount of work.

The problem: cost grows super-linearly. A machine with twice the RAM does not cost twice as much — it costs four or eight times as much. And there is a hard ceiling: you cannot buy a machine with infinite resources. Eventually you hit the limit of what a single machine can do.

Horizontal Scaling (Scaling Out)

Scaling out means distributing the load across multiple smaller machines. Instead of one giant server, you have 20 commodity servers behind a load balancer. Cost grows linearly (20 servers cost 20x one server), and there is no hard ceiling — you can always add more machines.

The problem: complexity. A shared-nothing architecture (where each machine is independent and coordinates via network messages) requires your software to handle distribution explicitly. How do you split data across machines? How do you route requests? What happens when one machine fails? How do you keep data consistent across machines? These are the questions that fill the remaining chapters of DDIA.

Worked Example: The Numbers

Let us make this concrete with real cloud pricing (approximate, 2024 numbers):

Approach	Configuration	Cost/month	Capacity
Vertical	1x r6g.16xlarge (512 GB RAM, 64 vCPU)	~$6,500	~50K req/s
Horizontal	8x r6g.2xlarge (64 GB RAM, 8 vCPU each)	~$6,400	~60K req/s + redundancy
Vertical 2x	1x r6g.metal (512 GB, 64 vCPU + bare metal)	~$9,800	~55K req/s
Horizontal 2x	16x r6g.2xlarge	~$12,800	~120K req/s + redundancy

At 1x load, vertical and horizontal cost about the same. At 2x load, vertical jumps 50% in cost for only 10% more capacity (you are buying expensive headroom). Horizontal doubles in cost but also doubles in capacity, and you get redundancy for free — if one of 16 machines dies, you still have 15.

The crossover point varies by workload, but a common rule of thumb: scale vertically until you hit ~$10K/month or the largest available instance, whichever comes first. After that, horizontal scaling is almost always more economical and more resilient.

Elastic Scaling

Elastic scaling means the system automatically adds or removes machines based on detected load. More traffic? Spin up more servers. Traffic drops? Shut them down and stop paying for them. Cloud platforms (AWS Auto Scaling, Kubernetes HPA) make this feasible, but it requires your application to be designed for it: stateless request handling, externalized session state, and fast startup times.

Elastic scaling has a critical subtlety: cold start time. If your application takes 30 seconds to boot (loading models, warming caches, establishing database connections), and a traffic spike arrives, you have 30 seconds of degraded service before the new machines are ready. This is why many teams "pre-warm" capacity: keep a few extra machines running at all times and only scale down conservatively. The cost of a few idle machines is small compared to the cost of a 30-second outage during Black Friday.

Scaling Decision Matrix

Here is a decision framework for when to use each scaling approach:

Scenario	Best Approach	Why
Single DB, load < 10K req/s	Vertical scaling	Simpler, no distributed complexity, cheaper at low scale
Read-heavy workload (90%+ reads)	Read replicas	Add replicas for reads, keep single primary for writes. Linear read scaling.
Stateless web servers	Horizontal + elastic	Trivial to scale; add/remove behind load balancer based on CPU/memory
Write-heavy workload	Sharding	Split data by key, each shard handles a fraction of writes. Most complex option.
Unpredictable traffic spikes	Elastic + pre-warming	Auto-scale up fast, keep minimum capacity for cold-start protection
Global users, low latency	Multi-region	Data close to users. CDN for static, regional replicas for dynamic.

The scalability interview pattern. When an interviewer asks "How does your system scale?", they want to hear three things: (1) what is the bottleneck today (CPU? memory? disk I/O? network?), (2) what is the first scaling step (usually vertical or read replicas), and (3) what is the long-term strategy when the first step is exhausted (sharding, queuing, caching). Show that you will take the simplest step first and add complexity only when forced. "We start with a single big machine, add read replicas when reads exceed capacity, and shard only when writes exceed single-machine throughput."

The Reality: It Depends

Stateless services (web servers, API gateways, compute workers) are easy to scale horizontally. They hold no data — just add another server behind the load balancer and route requests to it. This is why the first thing you do in a system design interview is separate stateless and stateful components.

Stateful services (databases, caches with important data, coordination services) are hard to scale horizontally. This is the entire reason DDIA exists as a 500-page book. Splitting a database across multiple machines (sharding) introduces consistency problems, rebalancing complexity, cross-shard queries, and distributed transactions. We will cover all of these in later chapters.

The pragmatic answer. In practice, most systems use a mix. Stateless application servers scale horizontally behind a load balancer. The database scales vertically first (bigger machine is simpler), then eventually scales out (sharding, read replicas) when vertical limits are reached. Caches scale horizontally (consistent hashing across multiple Redis nodes). This hybrid approach gives you the simplicity of vertical scaling where it works and the capacity of horizontal scaling where you need it.

Scaling Cost Curves

Drag the load slider to see how costs compare between vertical scaling (bigger machine) and horizontal scaling (more machines). Horizontal has constant per-unit cost but adds coordination overhead.

Load multiplier 1x

// Vertical scaling cost (super-linear):
Cost_vertical = base_cost × load^1.6

// Horizontal scaling cost (linear + overhead):
Cost_horizontal = (base_cost × load) + (coordination_overhead × load × log(load))

// Crossover: horizontal becomes cheaper around 4-8x the initial load
// Before that, vertical is simpler and cheaper

System design: You are building a chat application. Messages are stored in a database, and users expect to see messages within 1 second of sending. Your current single PostgreSQL server handles the load, but traffic is growing 3x per year. What is your scaling strategy for the next 3 years?

Immediately shard the database across 10 machines to prepare for growth Year 1: scale vertically (bigger PostgreSQL instance) — simpler and current load is manageable. Year 2: add read replicas for read-heavy queries (timeline reads), keep writes on primary. Year 3: if write volume exceeds single-machine capacity, introduce sharding by user ID. Each step adds complexity, so delay it until necessary. Meanwhile, scale the stateless chat servers horizontally from day one Switch from PostgreSQL to a NoSQL database that scales horizontally by default

Chapter 5: Maintainability

Here is an uncomfortable truth: the majority of the cost of software is not in the initial development. It is in the ongoing maintenance — fixing bugs, keeping it running, adapting it to new requirements, repaying technical debt, adding features. Most engineers do not like maintaining legacy systems. But the reality is that most engineering work is maintenance work, not greenfield development.

Kleppmann identifies three design principles that minimize maintenance pain:

1. Operability

Operability means making it easy for operations teams to keep the system running smoothly. A system with good operability provides:

Good monitoring that provides visibility into runtime behavior and system health
Good support for automation and integration with standard tools
Avoiding dependency on individual machines (so machines can be taken down for maintenance)
Good documentation and an intuitive operational model ("if I do X, Y will happen")
Self-healing where possible, but giving administrators manual control when needed
Predictable behavior — minimizing surprises

Think of it this way: at 3 AM when something breaks, is your system helpful or hostile? Does it tell you what is wrong, or do you have to guess? Can you restart a component without taking down the whole system? Can a new team member understand the operational runbook in a day?

2. Simplicity

Simplicity means removing accidental complexity. Accidental complexity is complexity that is not inherent in the problem but arises from the implementation: tight coupling between modules, tangled dependencies, inconsistent naming, hacks that worked around problems without solving them. The Big Ball of Mud anti-pattern: a system with no discernible structure, where every change requires understanding the whole thing.

The best tool for removing accidental complexity is abstraction. A good abstraction hides implementation details behind a clean interface. SQL is an abstraction — you describe what data you want, not how to get it. Programming languages are abstractions over machine code. TCP is an abstraction over unreliable networks. Good abstractions are force multipliers: they let you reason about the system at a higher level without getting lost in details.

3. Evolvability

Evolvability (also called extensibility, modifiability, or plasticity) means making it easy to change the system in the future. Requirements change. You discover new facts. Regulations shift. New features are requested. A system with good evolvability can accommodate these changes without requiring a rewrite.

Evolvability is related to simplicity: simple systems are easier to change than complex ones. It is also related to good abstractions: if your system is organized into well-defined modules with clear interfaces, you can replace or modify one module without affecting others. Agile practices (test-driven development, refactoring, frequent releases) are techniques for achieving evolvability at the team process level.

A concrete example: imagine your application uses MySQL as its database. If the database is accessed through a clean repository interface (e.g., UserRepository.findById(id)), switching to PostgreSQL or DynamoDB requires changing only the repository implementation. If SQL queries are scattered throughout the codebase (in controllers, in templates, in background jobs), a database migration becomes a multi-month rewrite. The abstraction boundary determines the cost of change.

Kleppmann makes the point that we should design systems with the expectation that they will need to change. Requirements shift. Technologies evolve. What seemed like the right database choice three years ago may not be the right choice today. If your system is evolvable, adapting to change is routine. If it is not, every change is a crisis.

The maintainability test. Imagine a new engineer joins your team tomorrow. How long until they can: (1) understand the system architecture? (2) deploy a change safely? (3) debug a production issue at 3 AM? If the answer to any of these is "months," you have a maintainability problem. In interviews, always discuss how your design enables future engineers to work effectively — this signals staff-level thinking.

The Big Ball of Mud

The Big Ball of Mud is a software architecture anti-pattern identified by Brian Foote and Joseph Yoder in 1997. It describes a system with no discernible structure — a haphazardly assembled pile of code where every module depends on every other module. Changing anything requires understanding everything. Adding a feature to the payment system breaks the notification system because they share global mutable state through an undocumented side channel.

How does it happen? Never intentionally. It happens one shortcut at a time. A deadline pressure leads to a quick hack instead of a proper abstraction. The hack works, so nobody fixes it. Another feature builds on top of the hack. A third feature adds a workaround for a bug in the second feature. Six months later, the team has a system where the only person who understands module X is the person who wrote it — and they left the company.

The cost is measurable. A study by Stripe estimated that developers spend 42% of their time dealing with technical debt and maintenance, and that the global cost of developer time wasted on bad code is $85 billion per year. The business impact is not just developer happiness — it is speed. A company with a well-maintained codebase ships features in days. A company drowning in technical debt ships the same features in months.

Abstraction: The Antidote

The cure for accidental complexity is abstraction. A good abstraction lets you ignore irrelevant details and focus on what matters. Here are abstractions you use every day without thinking about them:

Abstraction	Hides	Exposes
SQL	How data is stored on disk, which indexes exist, the query execution plan	A declarative language: "give me rows where X"
TCP	Packet loss, reordering, duplication, network congestion	A reliable, ordered byte stream between two endpoints
File system	Disk sectors, block allocation, wear leveling on SSDs	Named files and directories with read/write operations
REST API	The database schema, the caching layer, the message queue	Resources with CRUD operations and JSON responses

Each of these abstractions converts a complex, leaky reality into a simple, clean interface. When you design a data system, your most important job is choosing or creating the right abstractions. The right abstraction at the right boundary turns a Big Ball of Mud into a set of clean, replaceable components.

Technical Debt Simulator

The simulation below models the cost of adding features to a system over time. A well-maintained system (good abstractions, clean code, automated tests) has roughly constant feature development cost. A poorly maintained system accumulates technical debt: each shortcut makes the next feature harder, until eventually the team spends most of its time fighting the system rather than improving it.

Maintainability Cost Curve

Watch how feature development velocity diverges between a well-maintained and a poorly-maintained system. The "debt ratio" slider controls how much technical debt accumulates per feature.

Debt ratio 10%

Scenario: Your team ships a feature by copy-pasting and modifying an existing module instead of refactoring the shared logic into an abstraction. This saves 2 weeks now. Six months later, a bug is found in the shared logic. What is the maintenance cost?

You must fix the bug in EVERY copy — the original module, the copy, and any subsequent copies. Each copy may have diverged, so the fix is slightly different for each. Testing must cover all variants. If you miss one copy, the bug persists and customers find it. The 2 weeks "saved" has compounded into weeks of debugging across scattered code. This is accidental complexity and technical debt in action Just fix it in one place since the logic is the same The bug probably will not affect the copy since it was modified

Chapter 6: Thinking About Data Systems

Time to put it all together. You are designing a data system. Not in theory — right now, in this simulation. You will pick components, set SLA targets, and watch your system handle load. Then we will break it.

Here is the scenario: you are building a backend for an e-commerce platform. Customers browse products (read-heavy), add items to carts (write-heavy, must not lose data), and checkout (critical path, must be fast and reliable). You need a database for product catalog and orders, a cache for fast reads, and a message queue for order processing.

The simulation below lets you configure each component and observe the tradeoffs in real time. Set your SLA (availability and p99 latency), then inject failures and traffic spikes to see if your system holds up.

Data System Design Simulator

Configure your system, set an SLA, then hit "Run Simulation" to observe behavior under load. Use "Inject Failure" to take down a component and see how the system degrades.

Traffic (req/s) 1,000

Cache hit rate 90%

DB capacity (req/s) 2,000

SLA: p99 latency (ms) 200ms

The fundamental lesson. When you combine a database, a cache, and a message queue in your application, YOU become the data system designer. Your application code is the glue that defines the guarantees. The cache improves read latency but introduces stale data. The message queue decouples producers from consumers but adds delivery delay. The database provides durability but has limited throughput. Every architectural choice is a tradeoff. Kleppmann's entire book is about understanding these tradeoffs rigorously.

Understanding the Simulation

Let us break down what is happening inside the simulation so you understand the model:

// Traffic flow:
Total requests = traffic slider (req/sec)
Cache hits = traffic × cache_hit_rate% (served in ~2-7ms)
Cache misses = traffic × (1 - cache_hit_rate%) (go to database)

// Database behavior:
DB load = cache misses (req/sec)
If DB load < DB capacity: latency = 10-30ms (normal)
If DB load > 80% capacity: latency increases (congestion)
If DB load > 100% capacity: requests start failing (overload)

// The key equation:
DB load = traffic × (1 - cache_hit_rate / 100)
// To keep DB under capacity:
cache_hit_rate > (1 - DB_capacity / traffic) × 100

// Example: traffic = 10,000, DB capacity = 2,000
// Minimum cache hit rate = (1 - 2000/10000) × 100 = 80%

This model is simplified (real systems have connection pools, queue depths, retry logic), but it captures the essential dynamic: the cache shields the database from the full force of traffic. Without a cache, your database capacity IS your traffic ceiling. With a cache, your ceiling is DB_capacity / (1 - hit_rate).

Play with the simulation above. Here are some experiments to try:

The cache cliff. Set traffic to 5,000 req/s with DB capacity of 2,000 req/s and cache hit rate of 50%. Watch the DB get overwhelmed (50% of 5,000 = 2,500 > 2,000 capacity). Now raise cache hit rate to 95% — the DB only sees 250 req/s, well within capacity. The cache is doing 95% of the work.
Failure injection. Set everything to comfortable levels, then inject a failure. If the cache fails, ALL traffic hits the database. If you were relying on a 95% cache hit rate, your DB suddenly sees 20x the load. This is the thundering herd problem — a cache failure causes a database failure causes a total outage.
Tight SLA. Set p99 SLA under 100ms and see how little headroom you have. Cache latency is 2-7ms (fine), but any DB congestion pushes p99 over the limit. Tight SLAs demand high cache hit rates AND low DB utilization.
The capacity equation. With 10,000 req/s and 2,000 DB capacity, you need at least 80% cache hit rate. Set it to 79% and watch the system degrade. Set it to 81% and watch it stabilize. The boundary is sharp — there is almost no graceful degradation between "fine" and "overwhelmed."

Why caches are not optional. In the simulation, removing the cache (setting hit rate to 0%) makes the database the bottleneck at any meaningful traffic level. This is why every data-intensive application uses a cache layer. But caches introduce their own problems: stale data, cache invalidation ("the two hardest problems in computer science"), cold start after a restart, and the thundering herd on cache failure. Every solution creates new problems. That is the nature of data systems engineering.

The SLA budget game. In real systems, you have an "error budget." If your SLA is 99.9% availability (8.76 hours of downtime per year), you can spend that budget deliberately: deploying risky changes, running chaos engineering experiments, doing maintenance. Once the budget is consumed, you freeze changes and focus on reliability. Google popularized this concept in the SRE book. It turns reliability from a vague aspiration into a quantitative engineering practice.

Chapter 7: Real-World Tradeoffs

Theory is necessary but not sufficient. Let us look at how real companies made the tradeoffs we have been discussing. Every one of these decisions started with a specific load parameter, a specific failure mode, and a specific business requirement.

Case Study 1: Amazon Shopping Cart (Availability over Consistency)

Amazon's shopping cart must always be available. If a customer adds an item to their cart and the system says "sorry, try again later," Amazon loses the sale. The 2007 Dynamo paper made this explicit: the shopping cart sacrifices consistency for availability. If two data centers disagree about what is in the cart, the system merges both versions (potentially adding items that were removed) rather than rejecting the write. It is better to occasionally have a phantom item in the cart (customer removes it at checkout) than to ever lose an item the customer added.

The business reasoning: a false positive (extra item in cart) costs Amazon a brief moment of customer confusion. A false negative (lost item from cart) costs Amazon a sale. The asymmetry in business cost drives the technical tradeoff.

Case Study 2: Google Search (Latency over Completeness)

When you search Google, your query fans out to thousands of index servers. Some servers may be slow or unresponsive. Google's response: return results within 200ms even if not all index servers have responded. The user gets 98% of the results instantly rather than 100% of the results in 2 seconds. Most users never notice the missing 2% because the most relevant results are on the fastest servers (Google optimizes for this).

The tradeoff: completeness vs. latency. Google chooses latency because user studies show that slower results cause users to search less, even if the results are better. A 400ms delay reduces search volume by 0.6%. Fast-but-incomplete beats slow-but-complete.

Case Study 3: Banking (Consistency over Availability)

When you transfer $1,000 from checking to savings, the bank must guarantee that the money is either in checking OR savings, never in both, never in neither. This is ACID consistency: the database will reject the operation entirely rather than leave the data in an inconsistent state. If the system is partitioned and cannot guarantee consistency, it becomes unavailable rather than risk double-spending or losing money.

The tradeoff: availability vs. consistency. Banks choose consistency because the cost of inconsistency (losing money, regulatory violations, fraud) vastly exceeds the cost of brief unavailability (customer waits a few seconds, tries again).

Case Study 4: Social Media (It Depends on the Feature)

Social media platforms are interesting because different features have different tradeoff profiles on the same platform:

Feature	Tradeoff	Why
Likes/comments	Availability over consistency (eventual consistency)	If a like count is off by 1 for a few seconds, nobody cares
Direct messages	Consistency over latency	Messages must arrive in order and none must be lost
Payments (creator fund)	Strong consistency	Money must be exactly right, always
Feed ranking	Latency over completeness	Show the best-guess feed fast rather than the perfect feed slow
Account deletion	Consistency over speed	Regulatory (GDPR): deletion must be complete and verifiable

Case Study 5: Netflix (Availability over Perfection)

Netflix serves 230 million subscribers across 190 countries. Their philosophy: it is better to show you something than to show you nothing. If the recommendation engine is slow or down, Netflix falls back to showing popular content in your region. If personalized artwork cannot be generated, it shows the default poster. If one microservice is failing, circuit breakers isolate it and the rest of the system continues.

Netflix pioneered chaos engineering with Chaos Monkey (randomly kills production servers) and later Chaos Kong (simulates the failure of an entire AWS region). The philosophy: if you have not tested a failure mode in production, you do not know if your system survives it. By breaking things intentionally and regularly, Netflix ensures their fallback paths actually work.

The tradeoff: graceful degradation over correctness. A slightly wrong recommendation is better than no content at all. This only works because Netflix's core product (video streaming) can tolerate imprecision. You would not want your bank to show you "approximately your balance."

The interview pattern. When asked "What consistency model would you use?" the answer is always "It depends on the feature." Show the interviewer you understand that a single application has multiple consistency requirements by listing 2-3 features with different needs. This demonstrates systems thinking, not textbook regurgitation.

The Tradeoff Framework

Every system design decision can be framed as a choice between competing goods. Here is a framework for thinking through any tradeoff:

1. Identify the competing properties

What are you trading? Consistency vs. availability? Latency vs. completeness? Simplicity vs. performance? Cost vs. reliability?

↓

2. Quantify the business cost of each side

What is the dollar cost of 1 minute of downtime? What is the dollar cost of showing stale data? What is the cost of a 100ms increase in latency? Amazon found: 100ms latency = 1% sales loss.

↓

3. Choose based on asymmetry

Almost always, one side is much more expensive than the other. The shopping cart: losing an item (cost: lost sale) vs. extra item (cost: minor confusion). The asymmetry makes the decision obvious.

↓

4. Implement different tradeoffs for different features

Not every feature in your application needs the same guarantees. Payments need strong consistency. Like counts need availability. Profile pictures can be eventually consistent.

Putting Numbers to Tradeoffs

Abstract tradeoffs become concrete when you attach dollar amounts. Here is how real companies quantify them:

Company	Metric	Cost	Source
Amazon	100ms added latency	1% revenue loss (~$5B/year at current revenue)	Greg Linden, 2006
Google	500ms added to search	20% fewer searches per user	Marissa Mayer, 2006
Walmart	1 second faster page load	2% increase in conversion	Walmart Labs, 2012
Amazon (AWS)	1 hour of S3 downtime	~$150M in affected commerce	2017 S3 outage estimate
Facebook	6 hours of total outage	~$100M in lost ad revenue	2021 BGP outage
Robinhood	Trading outage during volatility	$70M FINRA fine + class-action settlement	2020 outage, 2021 fine

These numbers change the conversation. When a product manager says "just make it work," you can respond: "We can achieve 99.9% availability for $X/month (3 replicas), or 99.99% for $3X/month (5 replicas across 3 zones). Based on our revenue, each nine of availability is worth $Y/year. Which level makes business sense?" This is engineering leadership: turning vague requirements into quantified tradeoffs.

The cost of downtime formula.
Cost per minute of downtime = (annual revenue / 525,600 minutes) × impact factor

Impact factor depends on the service: a checkout page has impact factor ~1 (lost revenue = revenue rate). A search page has impact factor ~0.1 (users retry later). An internal tool has impact factor ~0.01 (employees wait).

For a company with $1B annual revenue, 1 minute of checkout downtime costs ~$1,900. One hour costs $114,000. Four nines (4.4 minutes/month of downtime) costs ~$8,400/month in expected lost revenue.

Design Challenge

Tradeoff Mapper

Click on each system to see where it sits on the availability-consistency spectrum and why. Each position represents a deliberate engineering decision driven by business requirements.

Design challenge: Flight booking system. You are building a flight booking system. When a customer clicks "Book," what tradeoffs do you make?

Think about: (1) Can two customers book the same seat? (Strong consistency required for seat assignment.) (2) Can the booking confirmation be delayed by a few seconds? (Availability: the user should see "confirmed" quickly.) (3) What happens if a data center goes down mid-booking? (The booking should not be lost — durability.) (4) What about showing seat availability while browsing? (Eventual consistency is fine — seats may show as available for a few seconds after being booked.)

The answer: Strong consistency for the actual booking (double-booking a seat is unacceptable), eventual consistency for the browsing view (a brief lag is fine), and high durability everywhere (losing a confirmed booking is catastrophic). This is the same pattern as social media: different features, different consistency requirements.

Interview question: A startup is building a collaborative document editor (like Google Docs). Two users edit the same paragraph simultaneously. The founder says "we need strong consistency so users always see the same thing." What is wrong with this reasoning, and what would you recommend?

Strong consistency is correct — users must always see the same document Strong consistency would make the editor feel sluggish because every keystroke would need to be confirmed by a central server before appearing on screen. Google Docs uses eventual consistency with Operational Transformation (OT) or CRDTs: each user sees their own edits immediately (optimistic local update), and the system converges to a consistent state within milliseconds. Users tolerate brief divergence (seeing slightly different states for 100ms) in exchange for instant responsiveness. The tradeoff is latency vs. strict consistency, and for real-time collaboration, low latency wins Just use a locking mechanism so only one user can edit at a time

Chapter 8: The CAP Theorem

Every systems design discussion eventually invokes the CAP theorem. It is the most frequently cited — and most frequently misunderstood — result in distributed systems. Let us get it right.

The Three Properties

Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time. (This is linearizability, a specific and strong form of consistency — not to be confused with the C in ACID, which means something different.)

Availability (A): Every request receives a (non-error) response, without guarantee that it contains the most recent write. The system is always responsive, even if the data it returns is stale.

Partition tolerance (P): The system continues to operate despite network partitions — messages between nodes being lost or delayed. A partition means some nodes cannot communicate with other nodes.

The Actual Theorem

The CAP theorem (proved by Gilbert and Lynch in 2002, based on Brewer's 2000 conjecture) states: in the presence of a network partition, a distributed system must choose between consistency and availability. You cannot have both.

The common phrasing "pick two out of three" is misleading. Here is why: you do not get to opt out of partition tolerance. Network partitions happen. Switches fail, cables get cut, cloud regions lose connectivity. Any distributed system must tolerate partitions. So the real choice is:

CP: Consistent + Partition-tolerant

During a partition, the system refuses to serve requests that might return stale data. It becomes unavailable to preserve consistency. Example: a banking system that returns "service unavailable" rather than risk showing a wrong balance. (PostgreSQL with synchronous replication, HBase, MongoDB with majority reads.)

AP: Available + Partition-tolerant

During a partition, the system continues serving requests but may return stale data. It sacrifices consistency for availability. Example: a shopping cart that always accepts writes, merging conflicts later. (Cassandra, DynamoDB, CouchDB.)

"CA" (consistent and available, no partition tolerance) is not a real option for distributed systems. It describes a single-node system (like a single PostgreSQL server) — which is indeed both consistent and available, but only because there is no network partition possible (there is only one node). The moment you add a second node, partitions become possible.

PACELC: The Better Mental Model

CAP only describes behavior during a partition. But partitions are rare. What about the other 99.9% of the time? Daniel Abadi proposed PACELC (2012): "If there is a Partition (P), choose Availability (A) or Consistency (C). Else (E), when the system is running normally, choose Latency (L) or Consistency (C)."

System	During Partition	Normal Operation	PACELC
DynamoDB	Availability (A)	Latency (L)	PA/EL
Cassandra	Availability (A)	Latency (L)	PA/EL
PostgreSQL (sync repl.)	Consistency (C)	Consistency (C)	PC/EC
MongoDB (majority)	Consistency (C)	Latency (L)	PC/EL
Cosmos DB	Configurable	Configurable	Tunable

PACELC is more useful than CAP because it captures the latency-consistency tradeoff that systems face all the time, not just during rare partition events.

The Consistency Spectrum

CAP's binary "consistent or not" hides a rich spectrum of consistency models. In practice, you do not choose between "consistent" and "inconsistent" — you choose a point on a continuum:

Consistency Model	Guarantee	Latency Cost	Example Use Case
Linearizability (strongest)	All operations appear to execute atomically in real-time order	Highest (requires coordination)	Leader election, distributed locks
Sequential consistency	All operations appear in some total order consistent with each process's local order	High	Shared memory systems
Causal consistency	Operations that are causally related appear in order; concurrent operations may appear in any order	Medium	Social media (replies appear after the post they reply to)
Read-your-writes	A client always sees its own writes	Low	User profile updates (you see your own changes immediately)
Eventual consistency (weakest)	If no new writes occur, all replicas eventually converge to the same value	Lowest	DNS, like counts, CDN caches

Each step down the spectrum gives you better latency and availability at the cost of weaker guarantees. The art of distributed systems design is choosing the weakest consistency model that your application can tolerate — because stronger consistency is always more expensive.

Why CAP is oversimplified but still useful. CAP uses binary definitions: a system is either consistent or it is not, available or it is not. Reality has gradations — eventual consistency, causal consistency, read-your-writes consistency. CAP also says nothing about latency, throughput, or durability. Despite these limitations, CAP is useful as a conversation starter. In interviews, mention CAP to set the context, then immediately pivot to PACELC or specific consistency models to show depth. Never stop at "you pick two."

CAP/PACELC System Map

Real database systems plotted on the consistency-availability spectrum. Click any system to see its tradeoff details. The horizontal axis shows behavior during partitions (CP vs AP), and the vertical axis shows behavior during normal operation (low latency vs strong consistency).

Interview question: A candidate says "We use Cassandra because it is AP, meaning we get both availability and partition tolerance." What important caveat is missing from this statement?

Cassandra is actually CP, not AP AP does not mean the system is always correct — it just means it is always responsive Being AP means that during a partition, Cassandra will continue serving requests but may return STALE data. Reads may not reflect the most recent writes. The application must be designed to tolerate stale reads (conflict resolution, last-write-wins, read-repair). Furthermore, Cassandra's consistency is actually tunable via consistency levels (ONE, QUORUM, ALL) — at QUORUM reads + QUORUM writes it behaves more like CP. Saying "AP" without mentioning what you sacrifice (freshness) and what you can tune (consistency level) is incomplete

Chapter 9: Interview Arsenal

This chapter is your cheat sheet. Everything from Chapters 0-8, compressed into the forms you will use in interviews and on the job.

The Three Pillars — Definitions

Pillar	One-sentence definition	Key metrics	Key techniques
Reliability	The system works correctly even when things go wrong	Availability (nines), MTTF, MTTR, error rate	Redundancy, fault isolation, chaos engineering, graceful degradation
Scalability	The system copes with growth in data, traffic, or complexity	Throughput (req/s), response time percentiles, load parameters	Vertical scaling, horizontal scaling (sharding, replicas), caching, async processing
Maintainability	Future engineers can work with the system productively	Time to onboard, time to deploy, incident resolution time	Good abstractions, documentation, automation, simplicity, evolvability

Numbers Every Engineer Should Know

Jeff Dean (Google) published a list of latency numbers that every systems engineer should have memorized. These numbers help you do back-of-envelope calculations in interviews and design discussions. Here are updated figures for modern hardware (circa 2024):

Operation	Latency	Relative Scale
L1 cache reference	~1 ns	1x
L2 cache reference	~4 ns	4x
Main memory (DRAM) reference	~100 ns	100x
SSD random read	~16 μs	16,000x
HDD random read (seek)	~2 ms	2,000,000x
Network round-trip (same datacenter)	~0.5 ms	500,000x
Network round-trip (cross-continent)	~150 ms	150,000,000x
Read 1 MB sequentially from SSD	~50 μs
Read 1 MB sequentially from HDD	~2 ms
Read 1 MB from network (10 Gbps)	~0.8 ms

The memory-disk-network hierarchy. Memory is 1,000x faster than SSD, which is 100x faster than HDD. Network within a datacenter is comparable to SSD speed. Network across continents is comparable to HDD speed. This hierarchy drives every caching and data placement decision: keep hot data in memory, warm data on SSD, cold data on HDD. Every cache miss moves you one level down the hierarchy and costs 100-1000x more latency.

Back-of-Envelope Estimation Pattern

In interviews, you will be asked to estimate things like "How many servers do you need?" or "How much storage?" Here is the pattern:

// Step 1: Estimate the load
Users: 100 million monthly active
Daily active: ~30% = 30 million
Requests per user per day: ~50
Total: 30M × 50 = 1.5 billion req/day

// Step 2: Convert to per-second
Average: 1.5B / 86400 ≈ 17,000 req/sec
Peak (2-3x average): ~50,000 req/sec

// Step 3: Size the servers
One server handles: ~5,000 req/sec (typical for a web server)
Servers needed: 50,000 / 5,000 = 10 servers (at peak)
With 2x headroom: 20 servers

// Step 4: Size the storage
Data per request: ~1 KB (if storing)
Daily: 1.5B × 1 KB = 1.5 TB/day
Yearly: ~550 TB ≈ 0.5 PB
With replication (3x): ~1.5 PB

These estimates are deliberately rough — the point is not precision, it is getting the right order of magnitude. "We need about 20 servers and half a petabyte per year" is a useful statement in a design discussion. "We need exactly 17.3 servers" is false precision.

Complete System Design Walkthrough

Let us walk through a full system design answer using the framework from this lesson. The question: "Design a URL shortener like bit.ly."

Step 1: Describe the load.
"Let me start by estimating the load parameters. Assume 100 million new URLs per month (write-heavy initially), and 10 billion redirects per month (read-heavy in steady state). That is ~40 writes/sec and ~4,000 reads/sec on average. Read:write ratio is 100:1. Each URL record is ~500 bytes. 100M URLs × 500 bytes = 50 GB/month, so ~600 GB/year. Data fits on a single machine for years."

Step 2: Define the SLA.
"Redirects must be fast — p99 under 50ms. This is the core user experience. Availability for reads: 99.99% (shortened URLs that do not work are useless). Availability for writes: 99.9% is fine (creating a short URL can tolerate brief downtime). Consistency: eventual consistency is acceptable — a 1-second delay between creating a URL and it being redirect-able is fine."

Step 3: Identify the bottleneck.
"Read-heavy (100:1 ratio). The bottleneck is redirect lookup latency. Reads are simple key-value lookups (short code to long URL). This is perfect for caching — most shortened URLs are accessed repeatedly. A cache with 80% hit rate means only 800 reads/sec hit the database. Any modern database handles this easily."

Step 4: Separate stateless from stateful.
"API servers are stateless — scale horizontally behind a load balancer. The database is stateful — start with a single PostgreSQL instance (50 GB fits easily). Add read replicas if read load grows beyond single-machine capacity. Add a Redis cache in front for hot URLs. Use a CDN for the HTTP 301 redirects themselves."

Notice how every answer references concepts from this lesson: load parameters, SLO/SLA, read:write ratio, percentiles, cache hit rates, stateless vs. stateful, vertical first then horizontal. The framework gives you structure; the numbers give you credibility.

System Design Opening Moves

When you get a system design question, start with these questions to establish the tradeoff space:

1. Describe the Load

"What are the key load parameters? Read/write ratio? Requests per second? Data volume? Growth rate?" This shows you think quantitatively.

↓

2. Define the SLA

"What is our availability target? What is our p99 latency budget? Which is more important: consistency or availability?" This shows you think about tradeoffs.

↓

3. Identify the Bottleneck

"Is the system read-heavy or write-heavy? Is the bottleneck CPU, memory, disk I/O, or network? Where is the fan-out?" This shows you think about where to optimize.

↓

4. Separate Stateless from Stateful

"Stateless services scale trivially. Stateful services are the hard part. What data do we need to persist, and what are the access patterns?" This shows you know where complexity lives.

Quick-Reference: Availability Budget

Availability	Downtime/year	Downtime/month	Typical system
99% (two nines)	3.65 days	7.3 hours	Internal tools, batch systems
99.9% (three nines)	8.76 hours	43.8 minutes	SaaS products, e-commerce
99.99% (four nines)	52.6 minutes	4.4 minutes	Payment systems, core infrastructure
99.999% (five nines)	5.26 minutes	26.3 seconds	Telecom, 911 systems

Coding Drill: Percentile Calculator

python
def percentile(data, p):
    """Calculate the p-th percentile of a list of values.
    p is between 0 and 100. Uses nearest-rank method.
    """
    if not data:
        raise ValueError("Empty dataset")
    sorted_data = sorted(data)
    n = len(sorted_data)
    rank = (p / 100) * (n - 1)
    lower = int(rank)
    upper = lower + 1
    weight = rank - lower
    if upper >= n:
        return sorted_data[lower]
    return sorted_data[lower] * (1 - weight) + sorted_data[upper] * weight

# Usage:
latencies = [12, 15, 11, 14, 800, 13, 12, 16, 11, 13]
print(f"p50: {percentile(latencies, 50)}ms")   # ~13ms
print(f"p99: {percentile(latencies, 99)}ms")   # ~800ms
print(f"mean: {sum(latencies)/len(latencies)}ms") # ~91.7ms (misleading!)

Coding Drill: Availability Calculator

python
def series_availability(components):
    """Availability of N components in series (all must work)."""
    result = 1.0
    for a in components:
        result *= a
    return result

def parallel_availability(a, replicas=2):
    """Availability with N redundant replicas."""
    return 1 - (1 - a) ** replicas

# 5 services, each 99.9%
services = [0.999] * 5
print(f"Series: {series_availability(services)*100:.4f}%")
# 99.5010%  (~1.8 days downtime/year)

# Same 5 services, each with a redundant replica
redundant = [parallel_availability(0.999) for _ in range(5)]
print(f"With redundancy: {series_availability(redundant)*100:.6f}%")
# 99.999500%  (~2.6 minutes downtime/year)

Coding Drill: Simple Load Generator

python
import time
import random
import statistics

def simulate_service(base_latency_ms=10, p99_spike_ms=500):
    """Simulate a service with realistic latency distribution."""
    if random.random() < 0.01:  # 1% chance of slow response
        return base_latency_ms + random.expovariate(1/p99_spike_ms)
    return base_latency_ms + random.expovariate(1/5)

def simulate_fanout(n_backends, n_requests=10000):
    """Simulate tail latency amplification with n backend calls."""
    latencies = []
    for _ in range(n_requests):
        # Each request calls n_backends in parallel
        # Overall latency = max of all backends
        backend_latencies = [simulate_service() for _ in range(n_backends)]
        latencies.append(max(backend_latencies))
    return latencies

# Compare: 1 backend vs 10 backends
for n in [1, 5, 10, 20]:
    lats = simulate_fanout(n)
    lats.sort()
    print(f"Backends={n:2d}  p50={lats[4999]:.0f}ms  "
          f"p99={lats[9899]:.0f}ms  p999={lats[9989]:.0f}ms")
# Output (typical):
# Backends= 1  p50=  15ms  p99=  52ms   p999= 487ms
# Backends= 5  p50=  22ms  p99= 498ms   p999=1203ms
# Backends=10  p50=  26ms  p99= 823ms   p999=1847ms
# Backends=20  p50=  30ms  p99=1156ms   p999=2534ms

Notice how p50 grows slowly (the median is stable) but p99 explodes with more backends. This is tail latency amplification in action. With 20 backends, p99 is 38x the single-backend p50. This is why microservice architectures must be designed with tail latency in mind from day one.

Coding Drill: Error Budget Calculator

python
def error_budget(sla_percent, period_days=30):
    """Calculate allowed downtime for a given SLA."""
    total_minutes = period_days * 24 * 60
    allowed_downtime = total_minutes * (1 - sla_percent / 100)
    return {
        'sla': f'{sla_percent}%',
        'period': f'{period_days} days',
        'budget_minutes': round(allowed_downtime, 2),
        'budget_human': format_duration(allowed_downtime)
    }

def format_duration(minutes):
    if minutes < 1: return f'{minutes*60:.1f} seconds'
    if minutes < 60: return f'{minutes:.1f} minutes'
    if minutes < 1440: return f'{minutes/60:.1f} hours'
    return f'{minutes/1440:.1f} days'

# Monthly error budgets:
for sla in [99, 99.9, 99.99, 99.999]:
    result = error_budget(sla)
    print(f"SLA {result['sla']:>8s}  Budget: {result['budget_human']:>14s}")
# SLA      99%  Budget:      7.3 hours
# SLA    99.9%  Budget:     43.8 minutes
# SLA   99.99%  Budget:      4.4 minutes
# SLA  99.999%  Budget:     26.3 seconds

The error budget mindset. Google's SRE team reframed reliability from "minimize downtime" to "spend your error budget wisely." If your SLA is 99.9% and you have used only 10 minutes of your 43.8-minute monthly budget, you have 33 minutes to "spend" on risky deployments, infrastructure changes, or chaos engineering experiments. This turns reliability from a fear-driven constraint into a quantitative resource to be managed.

Debug Scenarios

Scenario	Likely cause	What to check
p99 latency spiked 10x	Tail latency amplification, GC pauses, slow dependency, noisy neighbor	Trace slow requests end-to-end, check each backend's p99 individually, check GC logs, check for resource contention
System went down during traffic spike	No elastic scaling, no back-pressure, cascading failure	Check if auto-scaling was configured, check if the system has circuit breakers, check for thundering herd effects
Data inconsistency after deploy	Missing migration, stale cache, split-brain	Check cache invalidation, check replication lag, check for concurrent writes during schema migration
Intermittent "service unavailable"	Health check flapping, DNS issues, connection pool exhaustion	Check connection pool sizes, check health check configuration, check DNS TTL

Chapter 10: Connections

Chapter 1 of DDIA sets the vocabulary. Every subsequent chapter explores one dimension of the tradeoff space in depth. Here is how this lesson connects to what comes next.

This Lesson's Concept	Where It Goes Next	DDIA Chapter
Data models (the blurring boundaries)	Relational vs document vs graph models — choosing the right data model for your access patterns	Chapter 2: Data Models
Reliability (fault tolerance)	Replication: keeping copies of data on multiple machines so the system survives failures	Chapter 5: Replication
Scalability (handling growth)	Partitioning (sharding): splitting a large dataset across multiple machines so no single node is a bottleneck	Chapter 6: Partitioning
Consistency (CAP, PACELC)	Transactions: grouping reads and writes into atomic units with consistency guarantees	Chapter 7: Transactions
Fan-out (Twitter problem)	Stream processing: handling high-throughput event flows in real time	Chapter 11: Stream Processing
Batch vs. real-time	Batch processing: the economics of processing large datasets efficiently (MapReduce and beyond)	Chapter 10: Batch Processing

Key Takeaways

The one thing to remember. There is no such thing as a "best" data system. There are only tradeoffs. Every architectural decision sacrifices something to gain something else. The mark of a senior engineer is not knowing the "right answer" — it is knowing which tradeoffs are acceptable for a given business context, quantifying them, and communicating them clearly. Reliability, scalability, and maintainability are not checkboxes. They are dials. Your job is to set each dial to the right level for your specific situation.

The Vocabulary You Now Own

After this lesson, you should be able to use these terms precisely in conversation, in interviews, and in design documents. Each term has a specific meaning that we derived from first principles:

Term	Precise Meaning	Common Misuse
Fault	One component deviating from spec	Confused with "failure" (whole system down)
Failure	The system stops providing the required service	Used interchangeably with "fault"
Load parameter	A number describing the shape of workload (req/s, ratio, count)	Vague statements like "high traffic"
Fan-out	Number of downstream ops triggered by one incoming op	Used only for messaging; it applies to any amplification
Percentile (p99)	99% of requests are faster than this value	Confused with "average" or "typical"
Tail latency amplification	Fan-out to N services makes the composite p99 worse than individual p99	Ignored until production breaks
SLO vs SLA	SLO = internal target; SLA = contractual guarantee with penalties	Used interchangeably
Error budget	Allowed downtime = 1 - SLA, a resource to be spent wisely	Treated as a punishment rather than a tool
Accidental complexity	Complexity from the implementation, not the problem domain	Confused with inherent problem complexity

Beyond DDIA Chapter 1

This lesson covered the conceptual framework. The rest of the DDIA series goes deep on each mechanism:

Storage engines — How databases physically store and retrieve data (B-trees vs LSM-trees). The tradeoff: B-trees optimize for reads, LSM-trees optimize for writes. Your choice depends on your read/write ratio (a load parameter from this lesson).
Encoding and evolution — How data formats change over time without breaking systems. The tradeoff: forward and backward compatibility vs. schema simplicity. Directly related to evolvability (maintainability from this lesson).
Replication — Leader-follower, multi-leader, leaderless — each with different consistency/availability tradeoffs. This is CAP and PACELC made concrete: which consistency model, and what does it cost?
Partitioning — Splitting data across nodes: by key range, by hash, or by document. This is horizontal scaling for stateful services — the hard problem we flagged in the scaling chapter.
Transactions — ACID guarantees, isolation levels, distributed transactions. The tradeoff: stronger isolation = lower throughput = higher latency. How much isolation does your application actually need?
Consensus — Getting multiple nodes to agree on a value (Paxos, Raft, Zab). The foundation of fault-tolerant systems. Without consensus, you cannot have leader election, distributed locks, or atomic commits across partitions.

Every one of these topics is a deeper exploration of the tradeoffs we introduced here. The vocabulary from this lesson — faults vs. failures, load parameters, percentiles, fan-out, consistency vs. availability, series reliability, tail latency amplification — will appear in every subsequent chapter.

A Map of the Tradeoff Landscape

If you zoom out far enough, every decision in distributed systems is a point in a multi-dimensional tradeoff space. Here are the axes:

Axis	One End	Other End	DDIA Chapters
Consistency ↔ Availability	Linearizability (strict)	Eventual consistency (lax)	Ch 5, 7, 9
Latency ↔ Durability	In-memory only (fast, volatile)	Sync to disk + replicas (slow, durable)	Ch 3, 5
Throughput ↔ Isolation	No isolation (dirty reads, fast)	Serializable (safe, slow)	Ch 7
Simplicity ↔ Performance	Single node (simple, limited)	Distributed (complex, scalable)	Ch 5, 6, 8, 9
Freshness ↔ Cost	Real-time updates (expensive)	Batch updates (cheap, stale)	Ch 10, 11

Every system sits at a particular point in this space. There is no system that optimizes all axes simultaneously — that is the fundamental theorem of data systems design. Your job is to find the point that best serves your users and your business, and to be able to articulate why you chose that point over the alternatives.

The Closing Challenge

Here is a thought experiment to carry with you. The next time you use any application — checking email, ordering food, streaming a video, sending a payment — pause and ask yourself:

What are the load parameters of this system? (How many users are doing this right now?)
What consistency model is it using? (If I update something, will I see it immediately on another device?)
What happens if a server fails right now? (Does the system degrade gracefully?)
Where is the fan-out? (How many backend systems does my single action trigger?)
What tradeoff did the engineers choose, and why?

Once you start seeing systems through this lens, you will never look at software the same way again. Every loading spinner is a latency tradeoff. Every "try again later" is a reliability decision. Every stale notification is a consistency choice. You are now equipped to understand — and make — these decisions.

Resource	Topic	Why Read It
DDIA Chapters 2-12	Everything	The rest of the book builds directly on Chapter 1. Each chapter is a deep dive into one aspect of the tradeoff space.
Google SRE Book (free online)	Reliability, SLOs, error budgets	The practical handbook for running reliable systems at scale. Chapter 4 on SLOs is essential.
"The Tail at Scale" — Jeff Dean	Tail latency	The definitive paper on why tail latency matters and how to mitigate it (hedged requests, tied requests).
Amazon Dynamo Paper (2007)	Availability vs consistency	The paper that launched the NoSQL movement. Shows how Amazon sacrifices consistency for availability.
"Harvest, Yield, and Scalable Tolerant Systems" — Fox & Brewer	CAP nuances	A more nuanced take on CAP from one of its creators. Introduces harvest (completeness) and yield (availability).
"Consistency Tradeoffs in Modern Distributed DB Design" — Abadi	PACELC	The paper that extends CAP to cover the non-partition case. Essential for understanding real system tradeoffs.

Trade-Offs in Data Systems

Chapter 0: The Problem

The Blurring Boundaries

A Typical Data System

Why "Data-Intensive" Not "Compute-Intensive"?

Chapter 1: Reliability

Hardware Faults

Software Faults

Human Faults

Worked Example: How Faults Become Failures

Real-World Failure Rates

Reliability Simulator

Chapter 2: Scalability

Describing Load

Twitter's Fan-Out Problem

Worked Example: The Full Twitter Math

Fan-Out Beyond Twitter

Chapter 3: Describing Performance

Response Time vs. Latency

Why Averages Lie

Tail Latency Amplification

Measuring Correctly

Worked Example: Amazon's Tail Latency

Chapter 4: Scaling Up and Scaling Out

Vertical Scaling (Scaling Up)

Horizontal Scaling (Scaling Out)

Worked Example: The Numbers

Elastic Scaling

Scaling Decision Matrix

The Reality: It Depends

Chapter 5: Maintainability

1. Operability

2. Simplicity

3. Evolvability

The Big Ball of Mud

Abstraction: The Antidote

Technical Debt Simulator

Chapter 6: Thinking About Data Systems

Understanding the Simulation

Chapter 7: Real-World Tradeoffs

Case Study 1: Amazon Shopping Cart (Availability over Consistency)

Case Study 2: Google Search (Latency over Completeness)

Case Study 3: Banking (Consistency over Availability)

Case Study 4: Social Media (It Depends on the Feature)

Case Study 5: Netflix (Availability over Perfection)

The Tradeoff Framework

Putting Numbers to Tradeoffs

Design Challenge

Chapter 8: The CAP Theorem

The Three Properties

The Actual Theorem

PACELC: The Better Mental Model

The Consistency Spectrum

Chapter 9: Interview Arsenal

The Three Pillars — Definitions

Numbers Every Engineer Should Know

Back-of-Envelope Estimation Pattern

Complete System Design Walkthrough

System Design Opening Moves

Quick-Reference: Availability Budget

Coding Drill: Percentile Calculator

Coding Drill: Availability Calculator

Coding Drill: Simple Load Generator

Coding Drill: Error Budget Calculator

Debug Scenarios

Recommended Reading

Chapter 10: Connections

Key Takeaways

The Vocabulary You Now Own

Beyond DDIA Chapter 1

A Map of the Tradeoff Landscape

The Closing Challenge

Further Reading