Designing Data-Intensive Applications — Chapter 1

Trade-Offs in Data Systems

Reliability, scalability, maintainability — the vocabulary of systems design.

Prerequisites: None. This is where we start.
11
Chapters
9+
Simulations
5
Interview Dimensions

Chapter 0: The Problem

You open your phone and check Twitter. In the two seconds it takes to load your timeline, here is what happens behind the curtain: a load balancer routes your request to one of thousands of application servers. That server checks a cache (Redis, maybe Memcached) to see if your timeline was recently computed. Cache miss. It queries a database (probably a sharded MySQL or PostgreSQL cluster) for tweets from the 400 accounts you follow. It calls a search index (Elasticsearch or a custom inverted index) to find trending topics. It pushes an analytics event into a message queue (Kafka) so a downstream batch processing pipeline can update your engagement scores overnight. Meanwhile, a stream processor is watching Kafka in real time, updating global trending topics as tweets flow in.

Six different kinds of data systems. One request. And the user just sees a feed that loaded in under a second. If any one of those systems is slow, the whole experience degrades. If any one loses data, users see corrupted timelines. If any one goes down completely, the application must decide: do we serve a degraded experience or show an error page?

These are not hypothetical questions. They are the daily reality of every engineering team building modern applications. And the answers are never obvious — they depend on the specific requirements, the specific workload, and the specific tradeoffs you are willing to make.

This is the defining reality of modern software: most applications today are data-intensive, not compute-intensive. The CPU is rarely the bottleneck. The hard problems are the amount of data, the complexity of data, and the speed at which data changes.

Martin Kleppmann opens DDIA with a simple observation: we have gotten very good at making CPUs fast. What we have not gotten good at is making data systems that are reliable, scalable, and easy to maintain. That is what DDIA is about, and that is what this lesson covers.

The Blurring Boundaries

Textbooks draw clean boxes: "this is a database," "that is a message queue." Reality is messier. Redis is a cache that can persist to disk like a database. Apache Kafka is a message queue that can store data durably for weeks, acting as a distributed log. Elasticsearch is a search engine that many teams use as their primary data store. The categories blur because each tool optimizes for a different set of tradeoffs, and sometimes those tradeoffs overlap.

When you stitch these tools together in your application code, you are designing a new data system. Your application's API hides the internal complexity. To the outside world, your service is a black box that promises certain guarantees: "This query will return results in under 200ms," "No data will be lost after a confirmed write," "The system will handle 10,000 concurrent users."

Making those promises — and keeping them — is harder than it sounds. A cache improves speed but introduces stale data. Replication improves availability but introduces consistency problems. A message queue decouples services but adds latency. Every tool you add solves one problem and creates another. The art of data systems engineering is choosing the combination of tradeoffs that best serves your users.

Those promises come down to three properties. Martin Kleppmann calls them the three concerns that are important in most software systems:

The Big Three. Every data system lives or dies by three qualities:

Reliability — The system works correctly, even when things go wrong.
Scalability — The system copes with growth (data, traffic, complexity).
Maintainability — Future engineers can work with the system productively.

The rest of this lesson unpacks each one. These three words are the vocabulary of every systems design interview you will ever do.

A Typical Data System

Before we dive into each property, let us see what a typical data-intensive application looks like. The simulation below shows data flowing between the components you just read about. Watch how a single user request touches multiple systems.

Data System Architecture

Click "Send Request" to trace a user request through a typical web application. Watch data flow between the database, cache, search index, and message queue.

Notice something important: the application server is the orchestrator. It decides which component to call, in what order, and what to do when one fails. If the cache is down, do you skip it and hit the database directly? If the message queue is full, do you drop the analytics event or block the user's request? These are design decisions, and every one of them is a tradeoff.

Why "Data-Intensive" Not "Compute-Intensive"?

There was a time when CPU speed was the bottleneck. In the 1990s, sorting a million records was a genuine computational challenge. Today, a single laptop can sort a billion integers in under a minute. CPUs got fast enough that for most applications, raw computation is cheap.

What did not get cheap: moving data around. Reading from disk. Waiting for a network round-trip. Synchronizing state across three data centers on different continents. The speed of light imposes a hard floor: a round-trip from New York to London takes at minimum 56 milliseconds at light speed through fiber — and real-world latency is 2-3x that. No amount of CPU improvement can fix this.

This is why the interesting problems are data problems: how to store it (databases), how to cache it (Redis, Memcached), how to search it (Elasticsearch), how to move it (Kafka, RabbitMQ), how to process it (Spark, Flink), and how to keep it consistent across multiple copies (replication, consensus). The CPU crunches numbers. The system manages data. DDIA is about the system.

You are a data system designer. The moment you combine a database with a cache with a queue, you are not just a backend developer. You are designing a new, special-purpose data system with its own guarantees. DDIA exists to help you reason about those guarantees rigorously, not just hope they work.
Warm-up: A junior developer says "Redis is a cache, not a database." Why is this statement too simplistic?

Chapter 1: Reliability

Everybody wants a system that "just works." But what does that mean precisely? Kleppmann defines reliability as: the system continues to work correctly even when things go wrong. "Working correctly" means performing the function the user expects, at the performance the user expects, even when the user makes mistakes, even when the hardware fails, even when the software has bugs.

The key distinction is between faults and failures. A fault is a single component deviating from its spec — one disk dies, one process crashes, one network packet is lost. A failure is the whole system stopping to provide the required service. The goal of fault-tolerant design is to prevent faults from causing failures. You cannot prevent all faults (disks will die, networks will partition), but you can build systems that tolerate them.

Counterintuitively, it can make sense to deliberately trigger faults to test fault-tolerance mechanisms. This is the philosophy behind Netflix's Chaos Monkey and the broader discipline of chaos engineering: if you have not tested what happens when a server dies, you do not actually know if your system is fault-tolerant. You only believe it is. The difference between belief and knowledge in production engineering is measured in outages.

Hardware Faults

Hard disks have a mean time to failure (MTTF) of about 10 to 50 years. That sounds reliable until you do the math: if you have a storage cluster with 10,000 disks, each with an MTTF of 30 years, you should expect on average one disk to die every day. (10,000 disks ÷ 30 years ÷ 365 days = 0.91 failures per day.) RAM develops faults. Power grids have outages. Network cables get unplugged by cleaning staff.

The traditional response: redundancy. RAID arrays for disks, dual power supplies, hot-swappable CPUs, diesel generators for power. If one component fails, its redundant partner takes over while the broken one is replaced. For a single server, this approach can keep a machine running uninterrupted for years.

But as data volumes have grown, more applications use larger numbers of machines, which proportionally increases the rate of hardware faults. Cloud platforms like AWS are designed so that individual virtual machine instances can become unavailable without warning. The platform prioritizes flexibility and elasticity over single-machine reliability. So the emphasis has shifted from hardware redundancy to software fault-tolerance: systems that can tolerate the loss of entire machines.

Software Faults

Hardware faults are mostly independent — one disk dying does not make another disk more likely to die. Software faults are worse because they are correlated. A bug in the Linux kernel affects every machine running that kernel. A runaway process that consumes all CPU, memory, or disk on every node simultaneously. A service that the system depends on slowing down, becoming unresponsive, or returning corrupted data. A cascading failure where a small fault in one component triggers a fault in another, which triggers another.

Software faults are harder to anticipate. They lie dormant for a long time until triggered by an unusual set of circumstances — a leap second, a particularly large input, a specific sequence of operations. There is no quick fix. Kleppmann's advice: thorough testing, process isolation, allowing processes to crash and restart, measuring, monitoring, and alerting.

Human Faults

Humans design and operate these systems, and humans make mistakes. Configuration errors by operators are the leading cause of outages, not hardware or software faults. Studies of large internet services found that operator error caused the majority of disruptions, while hardware faults caused only 10-25%.

How do you mitigate human error? Design systems that minimize opportunities for error (good APIs that make it easy to do the right thing and hard to do the wrong thing). Provide sandbox environments where people can experiment safely with real data without affecting production. Test thoroughly at all levels: unit tests, integration tests, manual tests, automated end-to-end tests. Allow quick and easy rollback from mistakes — every deploy should be reversible in under one minute. Set up detailed monitoring (telemetry) so you can detect problems early, before users report them.

Google's approach to mitigating human error is instructive. Every configuration change goes through a review process. Changes are rolled out gradually (1% of traffic, then 10%, then 50%, then 100%) with automated monitoring at each stage. If any metric degrades, the rollout automatically pauses and alerts the engineer. This "canary" pattern catches most human errors before they affect more than a small fraction of users.

The reliability hierarchy. Hardware faults are random and independent — redundancy handles them. Software faults are systematic and correlated — testing and monitoring handle them. Human faults are the most common — good design, sandboxes, and rollbacks handle them. In an interview, start with human error (most common), then software (most dangerous), then hardware (most studied).

Worked Example: How Faults Become Failures

Let us trace a real cascading failure to see how a small fault becomes a system-wide failure.

1. Fault: One database replica has a slow disk
The disk is degraded but not dead. Reads take 500ms instead of 5ms. The replica is technically "healthy" — health checks pass because they only check if the process is alive, not if it is fast.
2. Amplification: Load balancer routes traffic to slow replica
The load balancer distributes connections evenly. Requests hitting the slow replica take 100x longer. These requests hold open connections and consume thread pool slots on the application server.
3. Cascading: App server thread pool exhausted
With 200 threads and each slow request holding a thread for 500ms instead of 5ms, throughput drops from 40K req/s to 400 req/s. Requests queue up. Timeouts start firing. Retry storms begin — upstream services retry failed requests, tripling the load.
4. Failure: System-wide outage
The retry storm overwhelms the remaining healthy replicas. All replicas become slow. All application servers exhaust their thread pools. The load balancer has no healthy backends. Users see 503 errors. Total duration: 47 minutes. Root cause: one slow disk.

Prevention requires defense in depth: (1) health checks that measure latency, not just liveness, (2) circuit breakers that stop sending traffic to slow backends, (3) retry budgets that cap the total number of retries, (4) timeouts at every network boundary. Netflix pioneered many of these patterns with their Hystrix library and the broader concept of chaos engineering — intentionally injecting faults to verify that the system handles them gracefully.

Real-World Failure Rates

How often do things actually break? Here are empirically observed failure rates from large-scale systems:

ComponentFailure RateSource
Hard disk (HDD)2-4% annual failure rate (AFR)Backblaze Drive Stats, 2023
SSD0.5-1% AFRGoogle fleet study, 2016
Server (any component)2-10% AFR depending on ageMicrosoft Azure study, 2018
Network link~1 failure per year per linkGoogle Jupiter network paper
Rack power supply1-2 events per year per datacenterAWS re:Invent talks
Software deploy~1 in 20 deploys causes a rollbackDORA State of DevOps
Configuration change#1 cause of outages at Google, Microsoft, and AmazonMultiple incident reports

The pattern is clear: hardware fails at predictable, low rates. Software and human errors fail at much higher rates but are harder to predict. The most reliable systems invest more in preventing human error (safe defaults, canary deploys, automated rollbacks) than in preventing hardware error (which is already well-understood).

Reliability Simulator

The simulation below models a system with five components, each with an adjustable failure rate. In a series architecture (every component must work for the system to work), the overall reliability multiplies: if each component has 99.9% availability, five in series give 99.9%5 = 99.5%. Toggle redundancy to add a replica for each component and watch reliability improve dramatically.

Reliability Simulator

Adjust individual component reliability. Toggle redundancy to add replicas. Watch how series composition multiplies failure probabilities.

Component reliability 99.90%
// Series reliability (all must work):
Rseries = R1 × R2 × ... × Rn

// With redundancy (each component has a backup):
Rcomponent = 1 - (1 - Rsingle)2
Rseries-redundant = Rcomponentn

// Worked example: 5 components, each 99.9%:
No redundancy: 0.9995 = 0.995 = 99.5%  (~1.8 days of downtime/year)
With redundancy: (1 - 0.0012)5 = (1 - 10-6)5 ≈ 99.9995%  (~2.6 minutes of downtime/year)
Interview question: Your system has 5 microservices in a critical path. Each has 99.9% availability (three nines). A product manager asks: "We have three nines on every service, so the system has three nines, right?" What is the actual system availability, and what do you recommend?

Chapter 2: Scalability

A system that works reliably today may not work reliably tomorrow if load increases tenfold. Scalability is the ability of a system to cope with increased load. But "scalability" is not a one-dimensional label — you cannot say "X is scalable" or "X doesn't scale" as an absolute statement. It is always relative to a specific kind of growth: "If load doubles, can we maintain response times by adding proportionally more resources?"

Describing Load

Before you can talk about whether a system handles growth, you must describe the current load precisely. Kleppmann calls these load parameters — numbers that capture the shape of your workload. The best choice of load parameters depends on the architecture of your system:

System TypeKey Load ParametersExample
Web serverRequests per second, read/write ratio1,000 req/s, 90% reads
DatabaseRead/write ratio, working set sizeWrite-heavy: 80% writes
CacheHit rate, eviction rate95% hit rate, 1M keys
Chat appConcurrent active users, messages/sec50K concurrent, 500 msg/s
Video platformConcurrent streams, bitrate, storage growth/day100K streams, 5 Mbps avg

Twitter's Fan-Out Problem

The best example from DDIA is Twitter circa 2012. Twitter had two main operations: post tweet (a user writes a tweet, delivered to all followers) and read timeline (a user views their home timeline — tweets from everyone they follow). At the time, Twitter handled about 4,600 tweets/sec on average (12,000 at peak) and 300,000 timeline reads/sec.

The challenge was not the write throughput. 12,000 writes/sec is manageable for most databases. The challenge was fan-out: each tweet must be delivered to all the author's followers, and some users have tens of millions of followers.

Two approaches:

Approach 1: Fan-out on Read
When a user requests their timeline, look up everyone they follow, find all recent tweets from those users, and merge them (sorted by time). Simple to implement, but expensive at read time. For a user following 500 people, that is 500 index lookups per timeline load, times 300,000 reads/sec = 150 million database lookups per second.
vs.
Approach 2: Fan-out on Write
When a user posts a tweet, immediately insert it into the home timeline cache of every follower. Reading the timeline is now trivially fast (just read the precomputed cache), but writing is expensive. A user with 30 million followers generates 30 million cache writes per tweet. At 4,600 tweets/sec average, that could be hundreds of billions of writes per day.

Twitter initially used Approach 1, then switched to Approach 2 because the read volume (300K/sec) was far higher than write volume (4.6K/sec), so it made sense to do the heavy work once at write time. But then the celebrity problem hit: when a user with 30 million followers tweets, that is 30 million writes. At peak, this created unsustainable write amplification.

Twitter's solution: a hybrid. Most users get fan-out on write (their tweets are pushed to followers' caches). Celebrities (users with very large follower counts) get fan-out on read — their tweets are fetched and merged at read time. This is a brilliant example of a tradeoff that only becomes visible once you quantify the load parameters.

Twitter Fan-Out Simulator

Adjust follower count and tweet rate. Compare write amplification (fan-out on write) vs read amplification (fan-out on read). Find the crossover point where the hybrid approach wins.

Followers 1,000
Tweets/sec 4,600
The fan-out number. In any system design interview, ask: "What is the fan-out?" Fan-out is the number of downstream operations triggered by a single incoming operation. Low fan-out (1:1 or 1:10) is easy. High fan-out (1:1,000,000) requires fundamentally different architectures. Twitter's hybrid approach recognizes that fan-out is not uniform across users — it is a heavy-tailed distribution, and you must handle the tail differently.

Worked Example: The Full Twitter Math

Let us work through the numbers that forced Twitter to change their architecture. These are approximate numbers from 2012, the era Kleppmann describes.

// Given:
Active users: 300 million
Average tweets/sec: 4,600 (peak: 12,000)
Timeline reads/sec: 300,000
Average followers per user: ~200
Celebrity followers (top 0.001%): 10-50 million

// Fan-out on Write (precompute timelines):
Average write amplification = 4,600 tweets/sec × 200 followers = 920,000 cache writes/sec
// This is manageable. But then a celebrity tweets:
Celebrity tweet = 1 tweet × 30,000,000 followers = 30,000,000 cache writes
// At 4.6K tweets/sec, ~0.005% are from celebrities, so ~0.23 celebrity tweets/sec
// Each one creates a 30M write burst on top of the 920K baseline

// Fan-out on Read (compute timeline at read time):
Each timeline = look up ~200 followed accounts × fetch recent tweets = ~200 index lookups
Total = 300,000 reads/sec × 200 lookups = 60,000,000 lookups/sec
// 60M lookups/sec is a LOT of random reads, very hard to sustain

// Hybrid solution:
Regular users: fan-out on write (920K writes/sec, no celebrity spikes)
Celebrities: fan-out on read (merged at read time, ~1-5% of timelines affected)
// Best of both worlds: predictable write load + fast reads for most users

The hybrid solution is elegant because it exploits the fact that follower count follows a power law distribution. The vast majority of users have a small number of followers — fan-out on write is cheap for them. Only a tiny fraction of users have millions of followers — handling them differently at read time is a small additional cost. This is a pattern you will see repeatedly in distributed systems: handle the common case efficiently, and special-case the outliers.

Fan-Out Beyond Twitter

The fan-out concept applies far beyond social media. Here are examples from other domains:

SystemOperationFan-OutStrategy
Google SearchOne queryFans out to 1,000+ index shardsReturn results from fastest shards, skip slow ones
CDN invalidationOne cache purgePropagates to 200+ edge servers globallyAsync propagation, accept brief staleness (seconds)
Payment processingOne chargeTriggers: auth, fraud check, ledger write, receipt email, analyticsCritical path (auth+ledger) sync, rest async via queue
CI/CD pipelineOne commitTriggers: lint, unit tests, integration tests, deploy to staging, canaryParallel where possible, sequential gates for safety
Ride-sharingOne ride requestNotify 10-50 nearby driversFan-out on write (push to drivers), first-accept wins

In every case, the design question is the same: do you do the work eagerly (fan-out on write) or lazily (fan-out on read)? The answer depends on: (1) the ratio of writes to reads, (2) the acceptable latency for reads, (3) the cost of write amplification, and (4) whether the fan-out is uniform or heavy-tailed.

Design question: A social media platform has 100 million users. The average user has 200 followers, but the top 0.01% (10,000 users) have 10 million followers each. The platform handles 5,000 tweets/sec. If you use pure fan-out-on-write, approximately how many cache writes per second do you need to handle?

Chapter 3: Describing Performance

Once you have described the load on your system, the next question is: what happens when load increases? Two ways to frame it: (1) keep resources fixed, how does performance degrade? (2) keep performance fixed, how much more resources do you need? Both framings require you to measure performance precisely.

Response Time vs. Latency

People use these interchangeably, but they are different. Response time is what the client sees: the time between sending a request and receiving the response. It includes network transit time, queuing time, and the actual service time. Latency is the time a request spends waiting to be handled — the time the request is "latent," sitting in a queue, awaiting service. Latency is a component of response time, not a synonym for it.

Why Averages Lie

If you are asked "what is the response time of your service?" and you answer with the average (arithmetic mean), you are hiding critical information. Imagine 100 requests: 95 take 10ms, 4 take 100ms, and 1 takes 5,000ms. The average is (95×10 + 4×100 + 1×5000) / 100 = 64.5ms. That number describes no actual user experience — nobody got a 64.5ms response. The typical user saw 10ms. The unlucky user saw 5 seconds.

Percentiles are the right tool. Sort all response times from fastest to slowest:

PercentileMeaningWhy it matters
p50 (median)Half of requests are faster than thisThe "typical" user experience
p9595% of requests are fasterThe experience for 1 in 20 users
p9999% of requests are fasterThe experience for 1 in 100 users — often your most valuable customers (heavy users)
p99999.9% of requests are fasterThe long tail; diminishing returns to optimize further

Amazon observed that a 100ms increase in response time reduced sales by 1%. They also found that the customers with the slowest response times were often those with the most data in their accounts — they had the most purchases, they were the most valuable customers. That is why Amazon cares about p999, not just p50.

Tail Latency Amplification

Here is where percentiles get dangerous. In a microservices architecture, a single user request often fans out to multiple backend services. If your request touches 10 backend services in parallel, the overall response time is determined by the slowest service. Even if each service has a fast p99, the chance that at least one of 10 calls hits its slow tail grows quickly.

// Probability that ALL 10 calls are fast (under p99):
P(all fast) = 0.9910 = 0.904

// So ~10% of requests hit at least one slow backend
// The overall p99 of the composite request is worse than p99 of each service

// With 50 backend calls:
P(all fast) = 0.9950 = 0.605
// ~40% of requests are slow! The effective p99 is closer to p80 of each service

This is tail latency amplification. It means that a service with a "good" p99 of 100ms can cause an overall request p99 of 1,000ms+ when you fan out across many services. The more services you call, the worse it gets.

Measuring Correctly

Measuring response times correctly is surprisingly tricky. Common mistakes:

Coordinated omission. If your load generator sends one request at a time and waits for the response before sending the next, it under-reports tail latency. When the server is slow, the generator slows down too, reducing load exactly when the system is stressed. Gil Tene (Azul Systems) calls this coordinated omission — the measurement tool and the system under test are coordinating to hide the problem. The fix: use a constant-rate load generator that sends requests at a fixed schedule regardless of response times. Tools like wrk2 and HdrHistogram are designed to avoid this.

Averaging percentiles. You cannot average percentiles. If server A has p99 = 50ms and server B has p99 = 500ms, the system p99 is NOT 275ms. You must merge the full response time distributions and compute the percentile from the merged set. In practice, this means your monitoring system must store histograms (like HDR Histogram or Prometheus histograms), not pre-computed percentiles.

Ignoring the client perspective. Server-side metrics miss queuing time, network latency, and client-side processing. Always measure response time from the client's perspective (end-to-end), not just server processing time. The user does not care that your server processed the request in 10ms if the total round trip took 500ms because of a congested network.

Percentile Explorer

1,000 simulated response times from a log-normal distribution. Adjust the distribution shape and fan-out count to see how percentiles change and how tail latency amplifies.

Tail weight 0.70
Backend fan-out 1

Worked Example: Amazon's Tail Latency

Amazon observed that customers with the most purchases (their most valuable customers) tended to have the slowest response times. Why? Because these customers had more data in their accounts — more order history, more recommendations to compute, more wish lists to merge. The very customers Amazon most wanted to keep happy were the ones most likely to hit the slow tail.

This is why Amazon optimizes for p999 (the slowest 1-in-1000 request), not just p50. Let us do the math for their checkout page, which calls multiple backend services:

// Amazon checkout page backend calls (simplified):
1. User service (account details): p99 = 30ms
2. Cart service (items in cart): p99 = 20ms
3. Inventory service (stock check): p99 = 50ms
4. Pricing service (discounts, tax): p99 = 40ms
5. Recommendation service: p99 = 80ms
6. Shipping service (delivery estimates): p99 = 60ms

// If called in parallel, overall p99 = ?
// P(all 6 under p99) = 0.99^6 = 0.941
// So ~6% of checkout requests hit at least one slow backend
// The effective p99 of the checkout page is dominated by the
// slowest call. If inventory service spikes to 2000ms at its p99,
// then 6% of checkout pages take 2+ seconds.

// Jeff Dean's "Tail at Scale" solution: hedged requests
// If a backend does not respond within the p95 latency,
// immediately send a duplicate request to a different replica.
// First response wins. This dramatically reduces tail latency
// at the cost of ~5% more backend load.
SLOs and SLAs. A Service Level Objective (SLO) is an internal target: "p99 response time under 200ms." A Service Level Agreement (SLA) is a contract with customers, usually with financial penalties: "99.9% of requests under 500ms, or we refund." SLAs are typically less aggressive than SLOs because you need margin. Always set SLOs tighter than SLAs. In interviews, state your SLO explicitly and explain the percentile — "Our SLO is p99 latency under 200ms" is a complete, professional statement.
Debug scenario: Your service has a p50 of 15ms and a p99 of 800ms. A single user request calls 5 backend services in parallel. Approximately what percentage of user-facing requests will experience a response time of 800ms or more?

Chapter 4: Scaling Up and Scaling Out

You have measured your load, you have measured your performance. Load is growing. What do you do?

Vertical Scaling (Scaling Up)

Scaling up means moving to a more powerful machine: more CPU cores, more RAM, faster disks, better network. It is the simplest approach — your software does not change at all, you just give it more horsepower. A single high-end server with 256 cores and 2TB of RAM can handle a remarkable amount of work.

The problem: cost grows super-linearly. A machine with twice the RAM does not cost twice as much — it costs four or eight times as much. And there is a hard ceiling: you cannot buy a machine with infinite resources. Eventually you hit the limit of what a single machine can do.

Horizontal Scaling (Scaling Out)

Scaling out means distributing the load across multiple smaller machines. Instead of one giant server, you have 20 commodity servers behind a load balancer. Cost grows linearly (20 servers cost 20x one server), and there is no hard ceiling — you can always add more machines.

The problem: complexity. A shared-nothing architecture (where each machine is independent and coordinates via network messages) requires your software to handle distribution explicitly. How do you split data across machines? How do you route requests? What happens when one machine fails? How do you keep data consistent across machines? These are the questions that fill the remaining chapters of DDIA.

Worked Example: The Numbers

Let us make this concrete with real cloud pricing (approximate, 2024 numbers):

ApproachConfigurationCost/monthCapacity
Vertical1x r6g.16xlarge (512 GB RAM, 64 vCPU)~$6,500~50K req/s
Horizontal8x r6g.2xlarge (64 GB RAM, 8 vCPU each)~$6,400~60K req/s + redundancy
Vertical 2x1x r6g.metal (512 GB, 64 vCPU + bare metal)~$9,800~55K req/s
Horizontal 2x16x r6g.2xlarge~$12,800~120K req/s + redundancy

At 1x load, vertical and horizontal cost about the same. At 2x load, vertical jumps 50% in cost for only 10% more capacity (you are buying expensive headroom). Horizontal doubles in cost but also doubles in capacity, and you get redundancy for free — if one of 16 machines dies, you still have 15.

The crossover point varies by workload, but a common rule of thumb: scale vertically until you hit ~$10K/month or the largest available instance, whichever comes first. After that, horizontal scaling is almost always more economical and more resilient.

Elastic Scaling

Elastic scaling means the system automatically adds or removes machines based on detected load. More traffic? Spin up more servers. Traffic drops? Shut them down and stop paying for them. Cloud platforms (AWS Auto Scaling, Kubernetes HPA) make this feasible, but it requires your application to be designed for it: stateless request handling, externalized session state, and fast startup times.

Elastic scaling has a critical subtlety: cold start time. If your application takes 30 seconds to boot (loading models, warming caches, establishing database connections), and a traffic spike arrives, you have 30 seconds of degraded service before the new machines are ready. This is why many teams "pre-warm" capacity: keep a few extra machines running at all times and only scale down conservatively. The cost of a few idle machines is small compared to the cost of a 30-second outage during Black Friday.

Scaling Decision Matrix

Here is a decision framework for when to use each scaling approach:

ScenarioBest ApproachWhy
Single DB, load < 10K req/sVertical scalingSimpler, no distributed complexity, cheaper at low scale
Read-heavy workload (90%+ reads)Read replicasAdd replicas for reads, keep single primary for writes. Linear read scaling.
Stateless web serversHorizontal + elasticTrivial to scale; add/remove behind load balancer based on CPU/memory
Write-heavy workloadShardingSplit data by key, each shard handles a fraction of writes. Most complex option.
Unpredictable traffic spikesElastic + pre-warmingAuto-scale up fast, keep minimum capacity for cold-start protection
Global users, low latencyMulti-regionData close to users. CDN for static, regional replicas for dynamic.
The scalability interview pattern. When an interviewer asks "How does your system scale?", they want to hear three things: (1) what is the bottleneck today (CPU? memory? disk I/O? network?), (2) what is the first scaling step (usually vertical or read replicas), and (3) what is the long-term strategy when the first step is exhausted (sharding, queuing, caching). Show that you will take the simplest step first and add complexity only when forced. "We start with a single big machine, add read replicas when reads exceed capacity, and shard only when writes exceed single-machine throughput."

The Reality: It Depends

Stateless services (web servers, API gateways, compute workers) are easy to scale horizontally. They hold no data — just add another server behind the load balancer and route requests to it. This is why the first thing you do in a system design interview is separate stateless and stateful components.

Stateful services (databases, caches with important data, coordination services) are hard to scale horizontally. This is the entire reason DDIA exists as a 500-page book. Splitting a database across multiple machines (sharding) introduces consistency problems, rebalancing complexity, cross-shard queries, and distributed transactions. We will cover all of these in later chapters.

The pragmatic answer. In practice, most systems use a mix. Stateless application servers scale horizontally behind a load balancer. The database scales vertically first (bigger machine is simpler), then eventually scales out (sharding, read replicas) when vertical limits are reached. Caches scale horizontally (consistent hashing across multiple Redis nodes). This hybrid approach gives you the simplicity of vertical scaling where it works and the capacity of horizontal scaling where you need it.
Scaling Cost Curves

Drag the load slider to see how costs compare between vertical scaling (bigger machine) and horizontal scaling (more machines). Horizontal has constant per-unit cost but adds coordination overhead.

Load multiplier 1x
// Vertical scaling cost (super-linear):
Costvertical = base_cost × load1.6

// Horizontal scaling cost (linear + overhead):
Costhorizontal = (base_cost × load) + (coordination_overhead × load × log(load))

// Crossover: horizontal becomes cheaper around 4-8x the initial load
// Before that, vertical is simpler and cheaper
System design: You are building a chat application. Messages are stored in a database, and users expect to see messages within 1 second of sending. Your current single PostgreSQL server handles the load, but traffic is growing 3x per year. What is your scaling strategy for the next 3 years?

Chapter 5: Maintainability

Here is an uncomfortable truth: the majority of the cost of software is not in the initial development. It is in the ongoing maintenance — fixing bugs, keeping it running, adapting it to new requirements, repaying technical debt, adding features. Most engineers do not like maintaining legacy systems. But the reality is that most engineering work is maintenance work, not greenfield development.

Kleppmann identifies three design principles that minimize maintenance pain:

1. Operability

Operability means making it easy for operations teams to keep the system running smoothly. A system with good operability provides:

Think of it this way: at 3 AM when something breaks, is your system helpful or hostile? Does it tell you what is wrong, or do you have to guess? Can you restart a component without taking down the whole system? Can a new team member understand the operational runbook in a day?

2. Simplicity

Simplicity means removing accidental complexity. Accidental complexity is complexity that is not inherent in the problem but arises from the implementation: tight coupling between modules, tangled dependencies, inconsistent naming, hacks that worked around problems without solving them. The Big Ball of Mud anti-pattern: a system with no discernible structure, where every change requires understanding the whole thing.

The best tool for removing accidental complexity is abstraction. A good abstraction hides implementation details behind a clean interface. SQL is an abstraction — you describe what data you want, not how to get it. Programming languages are abstractions over machine code. TCP is an abstraction over unreliable networks. Good abstractions are force multipliers: they let you reason about the system at a higher level without getting lost in details.

3. Evolvability

Evolvability (also called extensibility, modifiability, or plasticity) means making it easy to change the system in the future. Requirements change. You discover new facts. Regulations shift. New features are requested. A system with good evolvability can accommodate these changes without requiring a rewrite.

Evolvability is related to simplicity: simple systems are easier to change than complex ones. It is also related to good abstractions: if your system is organized into well-defined modules with clear interfaces, you can replace or modify one module without affecting others. Agile practices (test-driven development, refactoring, frequent releases) are techniques for achieving evolvability at the team process level.

A concrete example: imagine your application uses MySQL as its database. If the database is accessed through a clean repository interface (e.g., UserRepository.findById(id)), switching to PostgreSQL or DynamoDB requires changing only the repository implementation. If SQL queries are scattered throughout the codebase (in controllers, in templates, in background jobs), a database migration becomes a multi-month rewrite. The abstraction boundary determines the cost of change.

Kleppmann makes the point that we should design systems with the expectation that they will need to change. Requirements shift. Technologies evolve. What seemed like the right database choice three years ago may not be the right choice today. If your system is evolvable, adapting to change is routine. If it is not, every change is a crisis.

The maintainability test. Imagine a new engineer joins your team tomorrow. How long until they can: (1) understand the system architecture? (2) deploy a change safely? (3) debug a production issue at 3 AM? If the answer to any of these is "months," you have a maintainability problem. In interviews, always discuss how your design enables future engineers to work effectively — this signals staff-level thinking.

The Big Ball of Mud

The Big Ball of Mud is a software architecture anti-pattern identified by Brian Foote and Joseph Yoder in 1997. It describes a system with no discernible structure — a haphazardly assembled pile of code where every module depends on every other module. Changing anything requires understanding everything. Adding a feature to the payment system breaks the notification system because they share global mutable state through an undocumented side channel.

How does it happen? Never intentionally. It happens one shortcut at a time. A deadline pressure leads to a quick hack instead of a proper abstraction. The hack works, so nobody fixes it. Another feature builds on top of the hack. A third feature adds a workaround for a bug in the second feature. Six months later, the team has a system where the only person who understands module X is the person who wrote it — and they left the company.

The cost is measurable. A study by Stripe estimated that developers spend 42% of their time dealing with technical debt and maintenance, and that the global cost of developer time wasted on bad code is $85 billion per year. The business impact is not just developer happiness — it is speed. A company with a well-maintained codebase ships features in days. A company drowning in technical debt ships the same features in months.

Abstraction: The Antidote

The cure for accidental complexity is abstraction. A good abstraction lets you ignore irrelevant details and focus on what matters. Here are abstractions you use every day without thinking about them:

AbstractionHidesExposes
SQLHow data is stored on disk, which indexes exist, the query execution planA declarative language: "give me rows where X"
TCPPacket loss, reordering, duplication, network congestionA reliable, ordered byte stream between two endpoints
File systemDisk sectors, block allocation, wear leveling on SSDsNamed files and directories with read/write operations
REST APIThe database schema, the caching layer, the message queueResources with CRUD operations and JSON responses

Each of these abstractions converts a complex, leaky reality into a simple, clean interface. When you design a data system, your most important job is choosing or creating the right abstractions. The right abstraction at the right boundary turns a Big Ball of Mud into a set of clean, replaceable components.

Technical Debt Simulator

The simulation below models the cost of adding features to a system over time. A well-maintained system (good abstractions, clean code, automated tests) has roughly constant feature development cost. A poorly maintained system accumulates technical debt: each shortcut makes the next feature harder, until eventually the team spends most of its time fighting the system rather than improving it.

Maintainability Cost Curve

Watch how feature development velocity diverges between a well-maintained and a poorly-maintained system. The "debt ratio" slider controls how much technical debt accumulates per feature.

Debt ratio 10%
Scenario: Your team ships a feature by copy-pasting and modifying an existing module instead of refactoring the shared logic into an abstraction. This saves 2 weeks now. Six months later, a bug is found in the shared logic. What is the maintenance cost?

Chapter 6: Thinking About Data Systems

Time to put it all together. You are designing a data system. Not in theory — right now, in this simulation. You will pick components, set SLA targets, and watch your system handle load. Then we will break it.

Here is the scenario: you are building a backend for an e-commerce platform. Customers browse products (read-heavy), add items to carts (write-heavy, must not lose data), and checkout (critical path, must be fast and reliable). You need a database for product catalog and orders, a cache for fast reads, and a message queue for order processing.

The simulation below lets you configure each component and observe the tradeoffs in real time. Set your SLA (availability and p99 latency), then inject failures and traffic spikes to see if your system holds up.

Data System Design Simulator

Configure your system, set an SLA, then hit "Run Simulation" to observe behavior under load. Use "Inject Failure" to take down a component and see how the system degrades.

Traffic (req/s) 1,000
Cache hit rate 90%
DB capacity (req/s) 2,000
SLA: p99 latency (ms) 200ms
The fundamental lesson. When you combine a database, a cache, and a message queue in your application, YOU become the data system designer. Your application code is the glue that defines the guarantees. The cache improves read latency but introduces stale data. The message queue decouples producers from consumers but adds delivery delay. The database provides durability but has limited throughput. Every architectural choice is a tradeoff. Kleppmann's entire book is about understanding these tradeoffs rigorously.

Understanding the Simulation

Let us break down what is happening inside the simulation so you understand the model:

// Traffic flow:
Total requests = traffic slider (req/sec)
Cache hits = traffic × cache_hit_rate%  (served in ~2-7ms)
Cache misses = traffic × (1 - cache_hit_rate%)  (go to database)

// Database behavior:
DB load = cache misses (req/sec)
If DB load < DB capacity: latency = 10-30ms (normal)
If DB load > 80% capacity: latency increases (congestion)
If DB load > 100% capacity: requests start failing (overload)

// The key equation:
DB load = traffic × (1 - cache_hit_rate / 100)
// To keep DB under capacity:
cache_hit_rate > (1 - DB_capacity / traffic) × 100

// Example: traffic = 10,000, DB capacity = 2,000
// Minimum cache hit rate = (1 - 2000/10000) × 100 = 80%

This model is simplified (real systems have connection pools, queue depths, retry logic), but it captures the essential dynamic: the cache shields the database from the full force of traffic. Without a cache, your database capacity IS your traffic ceiling. With a cache, your ceiling is DB_capacity / (1 - hit_rate).

Play with the simulation above. Here are some experiments to try:

  1. The cache cliff. Set traffic to 5,000 req/s with DB capacity of 2,000 req/s and cache hit rate of 50%. Watch the DB get overwhelmed (50% of 5,000 = 2,500 > 2,000 capacity). Now raise cache hit rate to 95% — the DB only sees 250 req/s, well within capacity. The cache is doing 95% of the work.
  2. Failure injection. Set everything to comfortable levels, then inject a failure. If the cache fails, ALL traffic hits the database. If you were relying on a 95% cache hit rate, your DB suddenly sees 20x the load. This is the thundering herd problem — a cache failure causes a database failure causes a total outage.
  3. Tight SLA. Set p99 SLA under 100ms and see how little headroom you have. Cache latency is 2-7ms (fine), but any DB congestion pushes p99 over the limit. Tight SLAs demand high cache hit rates AND low DB utilization.
  4. The capacity equation. With 10,000 req/s and 2,000 DB capacity, you need at least 80% cache hit rate. Set it to 79% and watch the system degrade. Set it to 81% and watch it stabilize. The boundary is sharp — there is almost no graceful degradation between "fine" and "overwhelmed."
Why caches are not optional. In the simulation, removing the cache (setting hit rate to 0%) makes the database the bottleneck at any meaningful traffic level. This is why every data-intensive application uses a cache layer. But caches introduce their own problems: stale data, cache invalidation ("the two hardest problems in computer science"), cold start after a restart, and the thundering herd on cache failure. Every solution creates new problems. That is the nature of data systems engineering.
The SLA budget game. In real systems, you have an "error budget." If your SLA is 99.9% availability (8.76 hours of downtime per year), you can spend that budget deliberately: deploying risky changes, running chaos engineering experiments, doing maintenance. Once the budget is consumed, you freeze changes and focus on reliability. Google popularized this concept in the SRE book. It turns reliability from a vague aspiration into a quantitative engineering practice.

Chapter 7: Real-World Tradeoffs

Theory is necessary but not sufficient. Let us look at how real companies made the tradeoffs we have been discussing. Every one of these decisions started with a specific load parameter, a specific failure mode, and a specific business requirement.

Case Study 1: Amazon Shopping Cart (Availability over Consistency)

Amazon's shopping cart must always be available. If a customer adds an item to their cart and the system says "sorry, try again later," Amazon loses the sale. The 2007 Dynamo paper made this explicit: the shopping cart sacrifices consistency for availability. If two data centers disagree about what is in the cart, the system merges both versions (potentially adding items that were removed) rather than rejecting the write. It is better to occasionally have a phantom item in the cart (customer removes it at checkout) than to ever lose an item the customer added.

The business reasoning: a false positive (extra item in cart) costs Amazon a brief moment of customer confusion. A false negative (lost item from cart) costs Amazon a sale. The asymmetry in business cost drives the technical tradeoff.

Case Study 2: Google Search (Latency over Completeness)

When you search Google, your query fans out to thousands of index servers. Some servers may be slow or unresponsive. Google's response: return results within 200ms even if not all index servers have responded. The user gets 98% of the results instantly rather than 100% of the results in 2 seconds. Most users never notice the missing 2% because the most relevant results are on the fastest servers (Google optimizes for this).

The tradeoff: completeness vs. latency. Google chooses latency because user studies show that slower results cause users to search less, even if the results are better. A 400ms delay reduces search volume by 0.6%. Fast-but-incomplete beats slow-but-complete.

Case Study 3: Banking (Consistency over Availability)

When you transfer $1,000 from checking to savings, the bank must guarantee that the money is either in checking OR savings, never in both, never in neither. This is ACID consistency: the database will reject the operation entirely rather than leave the data in an inconsistent state. If the system is partitioned and cannot guarantee consistency, it becomes unavailable rather than risk double-spending or losing money.

The tradeoff: availability vs. consistency. Banks choose consistency because the cost of inconsistency (losing money, regulatory violations, fraud) vastly exceeds the cost of brief unavailability (customer waits a few seconds, tries again).

Case Study 4: Social Media (It Depends on the Feature)

Social media platforms are interesting because different features have different tradeoff profiles on the same platform:

FeatureTradeoffWhy
Likes/commentsAvailability over consistency (eventual consistency)If a like count is off by 1 for a few seconds, nobody cares
Direct messagesConsistency over latencyMessages must arrive in order and none must be lost
Payments (creator fund)Strong consistencyMoney must be exactly right, always
Feed rankingLatency over completenessShow the best-guess feed fast rather than the perfect feed slow
Account deletionConsistency over speedRegulatory (GDPR): deletion must be complete and verifiable

Case Study 5: Netflix (Availability over Perfection)

Netflix serves 230 million subscribers across 190 countries. Their philosophy: it is better to show you something than to show you nothing. If the recommendation engine is slow or down, Netflix falls back to showing popular content in your region. If personalized artwork cannot be generated, it shows the default poster. If one microservice is failing, circuit breakers isolate it and the rest of the system continues.

Netflix pioneered chaos engineering with Chaos Monkey (randomly kills production servers) and later Chaos Kong (simulates the failure of an entire AWS region). The philosophy: if you have not tested a failure mode in production, you do not know if your system survives it. By breaking things intentionally and regularly, Netflix ensures their fallback paths actually work.

The tradeoff: graceful degradation over correctness. A slightly wrong recommendation is better than no content at all. This only works because Netflix's core product (video streaming) can tolerate imprecision. You would not want your bank to show you "approximately your balance."

The interview pattern. When asked "What consistency model would you use?" the answer is always "It depends on the feature." Show the interviewer you understand that a single application has multiple consistency requirements by listing 2-3 features with different needs. This demonstrates systems thinking, not textbook regurgitation.

The Tradeoff Framework

Every system design decision can be framed as a choice between competing goods. Here is a framework for thinking through any tradeoff:

1. Identify the competing properties
What are you trading? Consistency vs. availability? Latency vs. completeness? Simplicity vs. performance? Cost vs. reliability?
2. Quantify the business cost of each side
What is the dollar cost of 1 minute of downtime? What is the dollar cost of showing stale data? What is the cost of a 100ms increase in latency? Amazon found: 100ms latency = 1% sales loss.
3. Choose based on asymmetry
Almost always, one side is much more expensive than the other. The shopping cart: losing an item (cost: lost sale) vs. extra item (cost: minor confusion). The asymmetry makes the decision obvious.
4. Implement different tradeoffs for different features
Not every feature in your application needs the same guarantees. Payments need strong consistency. Like counts need availability. Profile pictures can be eventually consistent.

Putting Numbers to Tradeoffs

Abstract tradeoffs become concrete when you attach dollar amounts. Here is how real companies quantify them:

CompanyMetricCostSource
Amazon100ms added latency1% revenue loss (~$5B/year at current revenue)Greg Linden, 2006
Google500ms added to search20% fewer searches per userMarissa Mayer, 2006
Walmart1 second faster page load2% increase in conversionWalmart Labs, 2012
Amazon (AWS)1 hour of S3 downtime~$150M in affected commerce2017 S3 outage estimate
Facebook6 hours of total outage~$100M in lost ad revenue2021 BGP outage
RobinhoodTrading outage during volatility$70M FINRA fine + class-action settlement2020 outage, 2021 fine

These numbers change the conversation. When a product manager says "just make it work," you can respond: "We can achieve 99.9% availability for $X/month (3 replicas), or 99.99% for $3X/month (5 replicas across 3 zones). Based on our revenue, each nine of availability is worth $Y/year. Which level makes business sense?" This is engineering leadership: turning vague requirements into quantified tradeoffs.

The cost of downtime formula.
Cost per minute of downtime = (annual revenue / 525,600 minutes) × impact factor

Impact factor depends on the service: a checkout page has impact factor ~1 (lost revenue = revenue rate). A search page has impact factor ~0.1 (users retry later). An internal tool has impact factor ~0.01 (employees wait).

For a company with $1B annual revenue, 1 minute of checkout downtime costs ~$1,900. One hour costs $114,000. Four nines (4.4 minutes/month of downtime) costs ~$8,400/month in expected lost revenue.

Design Challenge

Tradeoff Mapper

Click on each system to see where it sits on the availability-consistency spectrum and why. Each position represents a deliberate engineering decision driven by business requirements.

Design challenge: Flight booking system. You are building a flight booking system. When a customer clicks "Book," what tradeoffs do you make?

Think about: (1) Can two customers book the same seat? (Strong consistency required for seat assignment.) (2) Can the booking confirmation be delayed by a few seconds? (Availability: the user should see "confirmed" quickly.) (3) What happens if a data center goes down mid-booking? (The booking should not be lost — durability.) (4) What about showing seat availability while browsing? (Eventual consistency is fine — seats may show as available for a few seconds after being booked.)

The answer: Strong consistency for the actual booking (double-booking a seat is unacceptable), eventual consistency for the browsing view (a brief lag is fine), and high durability everywhere (losing a confirmed booking is catastrophic). This is the same pattern as social media: different features, different consistency requirements.
Interview question: A startup is building a collaborative document editor (like Google Docs). Two users edit the same paragraph simultaneously. The founder says "we need strong consistency so users always see the same thing." What is wrong with this reasoning, and what would you recommend?

Chapter 8: The CAP Theorem

Every systems design discussion eventually invokes the CAP theorem. It is the most frequently cited — and most frequently misunderstood — result in distributed systems. Let us get it right.

The Three Properties

Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time. (This is linearizability, a specific and strong form of consistency — not to be confused with the C in ACID, which means something different.)

Availability (A): Every request receives a (non-error) response, without guarantee that it contains the most recent write. The system is always responsive, even if the data it returns is stale.

Partition tolerance (P): The system continues to operate despite network partitions — messages between nodes being lost or delayed. A partition means some nodes cannot communicate with other nodes.

The Actual Theorem

The CAP theorem (proved by Gilbert and Lynch in 2002, based on Brewer's 2000 conjecture) states: in the presence of a network partition, a distributed system must choose between consistency and availability. You cannot have both.

The common phrasing "pick two out of three" is misleading. Here is why: you do not get to opt out of partition tolerance. Network partitions happen. Switches fail, cables get cut, cloud regions lose connectivity. Any distributed system must tolerate partitions. So the real choice is:

CP: Consistent + Partition-tolerant
During a partition, the system refuses to serve requests that might return stale data. It becomes unavailable to preserve consistency. Example: a banking system that returns "service unavailable" rather than risk showing a wrong balance. (PostgreSQL with synchronous replication, HBase, MongoDB with majority reads.)
or
AP: Available + Partition-tolerant
During a partition, the system continues serving requests but may return stale data. It sacrifices consistency for availability. Example: a shopping cart that always accepts writes, merging conflicts later. (Cassandra, DynamoDB, CouchDB.)

"CA" (consistent and available, no partition tolerance) is not a real option for distributed systems. It describes a single-node system (like a single PostgreSQL server) — which is indeed both consistent and available, but only because there is no network partition possible (there is only one node). The moment you add a second node, partitions become possible.

PACELC: The Better Mental Model

CAP only describes behavior during a partition. But partitions are rare. What about the other 99.9% of the time? Daniel Abadi proposed PACELC (2012): "If there is a Partition (P), choose Availability (A) or Consistency (C). Else (E), when the system is running normally, choose Latency (L) or Consistency (C)."

SystemDuring PartitionNormal OperationPACELC
DynamoDBAvailability (A)Latency (L)PA/EL
CassandraAvailability (A)Latency (L)PA/EL
PostgreSQL (sync repl.)Consistency (C)Consistency (C)PC/EC
MongoDB (majority)Consistency (C)Latency (L)PC/EL
Cosmos DBConfigurableConfigurableTunable

PACELC is more useful than CAP because it captures the latency-consistency tradeoff that systems face all the time, not just during rare partition events.

The Consistency Spectrum

CAP's binary "consistent or not" hides a rich spectrum of consistency models. In practice, you do not choose between "consistent" and "inconsistent" — you choose a point on a continuum:

Consistency ModelGuaranteeLatency CostExample Use Case
Linearizability (strongest)All operations appear to execute atomically in real-time orderHighest (requires coordination)Leader election, distributed locks
Sequential consistencyAll operations appear in some total order consistent with each process's local orderHighShared memory systems
Causal consistencyOperations that are causally related appear in order; concurrent operations may appear in any orderMediumSocial media (replies appear after the post they reply to)
Read-your-writesA client always sees its own writesLowUser profile updates (you see your own changes immediately)
Eventual consistency (weakest)If no new writes occur, all replicas eventually converge to the same valueLowestDNS, like counts, CDN caches

Each step down the spectrum gives you better latency and availability at the cost of weaker guarantees. The art of distributed systems design is choosing the weakest consistency model that your application can tolerate — because stronger consistency is always more expensive.

Why CAP is oversimplified but still useful. CAP uses binary definitions: a system is either consistent or it is not, available or it is not. Reality has gradations — eventual consistency, causal consistency, read-your-writes consistency. CAP also says nothing about latency, throughput, or durability. Despite these limitations, CAP is useful as a conversation starter. In interviews, mention CAP to set the context, then immediately pivot to PACELC or specific consistency models to show depth. Never stop at "you pick two."
CAP/PACELC System Map

Real database systems plotted on the consistency-availability spectrum. Click any system to see its tradeoff details. The horizontal axis shows behavior during partitions (CP vs AP), and the vertical axis shows behavior during normal operation (low latency vs strong consistency).

Interview question: A candidate says "We use Cassandra because it is AP, meaning we get both availability and partition tolerance." What important caveat is missing from this statement?

Chapter 9: Interview Arsenal

This chapter is your cheat sheet. Everything from Chapters 0-8, compressed into the forms you will use in interviews and on the job.

The Three Pillars — Definitions

PillarOne-sentence definitionKey metricsKey techniques
ReliabilityThe system works correctly even when things go wrongAvailability (nines), MTTF, MTTR, error rateRedundancy, fault isolation, chaos engineering, graceful degradation
ScalabilityThe system copes with growth in data, traffic, or complexityThroughput (req/s), response time percentiles, load parametersVertical scaling, horizontal scaling (sharding, replicas), caching, async processing
MaintainabilityFuture engineers can work with the system productivelyTime to onboard, time to deploy, incident resolution timeGood abstractions, documentation, automation, simplicity, evolvability

Numbers Every Engineer Should Know

Jeff Dean (Google) published a list of latency numbers that every systems engineer should have memorized. These numbers help you do back-of-envelope calculations in interviews and design discussions. Here are updated figures for modern hardware (circa 2024):

OperationLatencyRelative Scale
L1 cache reference~1 ns1x
L2 cache reference~4 ns4x
Main memory (DRAM) reference~100 ns100x
SSD random read~16 μs16,000x
HDD random read (seek)~2 ms2,000,000x
Network round-trip (same datacenter)~0.5 ms500,000x
Network round-trip (cross-continent)~150 ms150,000,000x
Read 1 MB sequentially from SSD~50 μs
Read 1 MB sequentially from HDD~2 ms
Read 1 MB from network (10 Gbps)~0.8 ms
The memory-disk-network hierarchy. Memory is 1,000x faster than SSD, which is 100x faster than HDD. Network within a datacenter is comparable to SSD speed. Network across continents is comparable to HDD speed. This hierarchy drives every caching and data placement decision: keep hot data in memory, warm data on SSD, cold data on HDD. Every cache miss moves you one level down the hierarchy and costs 100-1000x more latency.

Back-of-Envelope Estimation Pattern

In interviews, you will be asked to estimate things like "How many servers do you need?" or "How much storage?" Here is the pattern:

// Step 1: Estimate the load
Users: 100 million monthly active
Daily active: ~30% = 30 million
Requests per user per day: ~50
Total: 30M × 50 = 1.5 billion req/day

// Step 2: Convert to per-second
Average: 1.5B / 86400 ≈ 17,000 req/sec
Peak (2-3x average): ~50,000 req/sec

// Step 3: Size the servers
One server handles: ~5,000 req/sec (typical for a web server)
Servers needed: 50,000 / 5,000 = 10 servers (at peak)
With 2x headroom: 20 servers

// Step 4: Size the storage
Data per request: ~1 KB (if storing)
Daily: 1.5B × 1 KB = 1.5 TB/day
Yearly: ~550 TB ≈ 0.5 PB
With replication (3x): ~1.5 PB

These estimates are deliberately rough — the point is not precision, it is getting the right order of magnitude. "We need about 20 servers and half a petabyte per year" is a useful statement in a design discussion. "We need exactly 17.3 servers" is false precision.

Complete System Design Walkthrough

Let us walk through a full system design answer using the framework from this lesson. The question: "Design a URL shortener like bit.ly."

Step 1: Describe the load.
"Let me start by estimating the load parameters. Assume 100 million new URLs per month (write-heavy initially), and 10 billion redirects per month (read-heavy in steady state). That is ~40 writes/sec and ~4,000 reads/sec on average. Read:write ratio is 100:1. Each URL record is ~500 bytes. 100M URLs × 500 bytes = 50 GB/month, so ~600 GB/year. Data fits on a single machine for years."
Step 2: Define the SLA.
"Redirects must be fast — p99 under 50ms. This is the core user experience. Availability for reads: 99.99% (shortened URLs that do not work are useless). Availability for writes: 99.9% is fine (creating a short URL can tolerate brief downtime). Consistency: eventual consistency is acceptable — a 1-second delay between creating a URL and it being redirect-able is fine."
Step 3: Identify the bottleneck.
"Read-heavy (100:1 ratio). The bottleneck is redirect lookup latency. Reads are simple key-value lookups (short code to long URL). This is perfect for caching — most shortened URLs are accessed repeatedly. A cache with 80% hit rate means only 800 reads/sec hit the database. Any modern database handles this easily."
Step 4: Separate stateless from stateful.
"API servers are stateless — scale horizontally behind a load balancer. The database is stateful — start with a single PostgreSQL instance (50 GB fits easily). Add read replicas if read load grows beyond single-machine capacity. Add a Redis cache in front for hot URLs. Use a CDN for the HTTP 301 redirects themselves."

Notice how every answer references concepts from this lesson: load parameters, SLO/SLA, read:write ratio, percentiles, cache hit rates, stateless vs. stateful, vertical first then horizontal. The framework gives you structure; the numbers give you credibility.

System Design Opening Moves

When you get a system design question, start with these questions to establish the tradeoff space:

1. Describe the Load
"What are the key load parameters? Read/write ratio? Requests per second? Data volume? Growth rate?" This shows you think quantitatively.
2. Define the SLA
"What is our availability target? What is our p99 latency budget? Which is more important: consistency or availability?" This shows you think about tradeoffs.
3. Identify the Bottleneck
"Is the system read-heavy or write-heavy? Is the bottleneck CPU, memory, disk I/O, or network? Where is the fan-out?" This shows you think about where to optimize.
4. Separate Stateless from Stateful
"Stateless services scale trivially. Stateful services are the hard part. What data do we need to persist, and what are the access patterns?" This shows you know where complexity lives.

Quick-Reference: Availability Budget

AvailabilityDowntime/yearDowntime/monthTypical system
99% (two nines)3.65 days7.3 hoursInternal tools, batch systems
99.9% (three nines)8.76 hours43.8 minutesSaaS products, e-commerce
99.99% (four nines)52.6 minutes4.4 minutesPayment systems, core infrastructure
99.999% (five nines)5.26 minutes26.3 secondsTelecom, 911 systems

Coding Drill: Percentile Calculator

python
def percentile(data, p):
    """Calculate the p-th percentile of a list of values.
    p is between 0 and 100. Uses nearest-rank method.
    """
    if not data:
        raise ValueError("Empty dataset")
    sorted_data = sorted(data)
    n = len(sorted_data)
    rank = (p / 100) * (n - 1)
    lower = int(rank)
    upper = lower + 1
    weight = rank - lower
    if upper >= n:
        return sorted_data[lower]
    return sorted_data[lower] * (1 - weight) + sorted_data[upper] * weight

# Usage:
latencies = [12, 15, 11, 14, 800, 13, 12, 16, 11, 13]
print(f"p50: {percentile(latencies, 50)}ms")   # ~13ms
print(f"p99: {percentile(latencies, 99)}ms")   # ~800ms
print(f"mean: {sum(latencies)/len(latencies)}ms") # ~91.7ms (misleading!)

Coding Drill: Availability Calculator

python
def series_availability(components):
    """Availability of N components in series (all must work)."""
    result = 1.0
    for a in components:
        result *= a
    return result

def parallel_availability(a, replicas=2):
    """Availability with N redundant replicas."""
    return 1 - (1 - a) ** replicas

# 5 services, each 99.9%
services = [0.999] * 5
print(f"Series: {series_availability(services)*100:.4f}%")
# 99.5010%  (~1.8 days downtime/year)

# Same 5 services, each with a redundant replica
redundant = [parallel_availability(0.999) for _ in range(5)]
print(f"With redundancy: {series_availability(redundant)*100:.6f}%")
# 99.999500%  (~2.6 minutes downtime/year)

Coding Drill: Simple Load Generator

python
import time
import random
import statistics

def simulate_service(base_latency_ms=10, p99_spike_ms=500):
    """Simulate a service with realistic latency distribution."""
    if random.random() < 0.01:  # 1% chance of slow response
        return base_latency_ms + random.expovariate(1/p99_spike_ms)
    return base_latency_ms + random.expovariate(1/5)

def simulate_fanout(n_backends, n_requests=10000):
    """Simulate tail latency amplification with n backend calls."""
    latencies = []
    for _ in range(n_requests):
        # Each request calls n_backends in parallel
        # Overall latency = max of all backends
        backend_latencies = [simulate_service() for _ in range(n_backends)]
        latencies.append(max(backend_latencies))
    return latencies

# Compare: 1 backend vs 10 backends
for n in [1, 5, 10, 20]:
    lats = simulate_fanout(n)
    lats.sort()
    print(f"Backends={n:2d}  p50={lats[4999]:.0f}ms  "
          f"p99={lats[9899]:.0f}ms  p999={lats[9989]:.0f}ms")
# Output (typical):
# Backends= 1  p50=  15ms  p99=  52ms   p999= 487ms
# Backends= 5  p50=  22ms  p99= 498ms   p999=1203ms
# Backends=10  p50=  26ms  p99= 823ms   p999=1847ms
# Backends=20  p50=  30ms  p99=1156ms   p999=2534ms

Notice how p50 grows slowly (the median is stable) but p99 explodes with more backends. This is tail latency amplification in action. With 20 backends, p99 is 38x the single-backend p50. This is why microservice architectures must be designed with tail latency in mind from day one.

Coding Drill: Error Budget Calculator

python
def error_budget(sla_percent, period_days=30):
    """Calculate allowed downtime for a given SLA."""
    total_minutes = period_days * 24 * 60
    allowed_downtime = total_minutes * (1 - sla_percent / 100)
    return {
        'sla': f'{sla_percent}%',
        'period': f'{period_days} days',
        'budget_minutes': round(allowed_downtime, 2),
        'budget_human': format_duration(allowed_downtime)
    }

def format_duration(minutes):
    if minutes < 1: return f'{minutes*60:.1f} seconds'
    if minutes < 60: return f'{minutes:.1f} minutes'
    if minutes < 1440: return f'{minutes/60:.1f} hours'
    return f'{minutes/1440:.1f} days'

# Monthly error budgets:
for sla in [99, 99.9, 99.99, 99.999]:
    result = error_budget(sla)
    print(f"SLA {result['sla']:>8s}  Budget: {result['budget_human']:>14s}")
# SLA      99%  Budget:      7.3 hours
# SLA    99.9%  Budget:     43.8 minutes
# SLA   99.99%  Budget:      4.4 minutes
# SLA  99.999%  Budget:     26.3 seconds
The error budget mindset. Google's SRE team reframed reliability from "minimize downtime" to "spend your error budget wisely." If your SLA is 99.9% and you have used only 10 minutes of your 43.8-minute monthly budget, you have 33 minutes to "spend" on risky deployments, infrastructure changes, or chaos engineering experiments. This turns reliability from a fear-driven constraint into a quantitative resource to be managed.

Debug Scenarios

ScenarioLikely causeWhat to check
p99 latency spiked 10xTail latency amplification, GC pauses, slow dependency, noisy neighborTrace slow requests end-to-end, check each backend's p99 individually, check GC logs, check for resource contention
System went down during traffic spikeNo elastic scaling, no back-pressure, cascading failureCheck if auto-scaling was configured, check if the system has circuit breakers, check for thundering herd effects
Data inconsistency after deployMissing migration, stale cache, split-brainCheck cache invalidation, check replication lag, check for concurrent writes during schema migration
Intermittent "service unavailable"Health check flapping, DNS issues, connection pool exhaustionCheck connection pool sizes, check health check configuration, check DNS TTL

Recommended Reading

Final drill: You are on-call and receive an alert: p99 latency has gone from 150ms to 1,500ms. The system has 8 microservices in the critical path. p50 is unchanged at 15ms. What is your first hypothesis, and what do you check first?

Chapter 10: Connections

Chapter 1 of DDIA sets the vocabulary. Every subsequent chapter explores one dimension of the tradeoff space in depth. Here is how this lesson connects to what comes next.

This Lesson's ConceptWhere It Goes NextDDIA Chapter
Data models (the blurring boundaries)Relational vs document vs graph models — choosing the right data model for your access patternsChapter 2: Data Models
Reliability (fault tolerance)Replication: keeping copies of data on multiple machines so the system survives failuresChapter 5: Replication
Scalability (handling growth)Partitioning (sharding): splitting a large dataset across multiple machines so no single node is a bottleneckChapter 6: Partitioning
Consistency (CAP, PACELC)Transactions: grouping reads and writes into atomic units with consistency guaranteesChapter 7: Transactions
Fan-out (Twitter problem)Stream processing: handling high-throughput event flows in real timeChapter 11: Stream Processing
Batch vs. real-timeBatch processing: the economics of processing large datasets efficiently (MapReduce and beyond)Chapter 10: Batch Processing

Key Takeaways

The one thing to remember. There is no such thing as a "best" data system. There are only tradeoffs. Every architectural decision sacrifices something to gain something else. The mark of a senior engineer is not knowing the "right answer" — it is knowing which tradeoffs are acceptable for a given business context, quantifying them, and communicating them clearly. Reliability, scalability, and maintainability are not checkboxes. They are dials. Your job is to set each dial to the right level for your specific situation.

The Vocabulary You Now Own

After this lesson, you should be able to use these terms precisely in conversation, in interviews, and in design documents. Each term has a specific meaning that we derived from first principles:

TermPrecise MeaningCommon Misuse
FaultOne component deviating from specConfused with "failure" (whole system down)
FailureThe system stops providing the required serviceUsed interchangeably with "fault"
Load parameterA number describing the shape of workload (req/s, ratio, count)Vague statements like "high traffic"
Fan-outNumber of downstream ops triggered by one incoming opUsed only for messaging; it applies to any amplification
Percentile (p99)99% of requests are faster than this valueConfused with "average" or "typical"
Tail latency amplificationFan-out to N services makes the composite p99 worse than individual p99Ignored until production breaks
SLO vs SLASLO = internal target; SLA = contractual guarantee with penaltiesUsed interchangeably
Error budgetAllowed downtime = 1 - SLA, a resource to be spent wiselyTreated as a punishment rather than a tool
Accidental complexityComplexity from the implementation, not the problem domainConfused with inherent problem complexity

Beyond DDIA Chapter 1

This lesson covered the conceptual framework. The rest of the DDIA series goes deep on each mechanism:

Every one of these topics is a deeper exploration of the tradeoffs we introduced here. The vocabulary from this lesson — faults vs. failures, load parameters, percentiles, fan-out, consistency vs. availability, series reliability, tail latency amplification — will appear in every subsequent chapter.

A Map of the Tradeoff Landscape

If you zoom out far enough, every decision in distributed systems is a point in a multi-dimensional tradeoff space. Here are the axes:

AxisOne EndOther EndDDIA Chapters
Consistency ↔ AvailabilityLinearizability (strict)Eventual consistency (lax)Ch 5, 7, 9
Latency ↔ DurabilityIn-memory only (fast, volatile)Sync to disk + replicas (slow, durable)Ch 3, 5
Throughput ↔ IsolationNo isolation (dirty reads, fast)Serializable (safe, slow)Ch 7
Simplicity ↔ PerformanceSingle node (simple, limited)Distributed (complex, scalable)Ch 5, 6, 8, 9
Freshness ↔ CostReal-time updates (expensive)Batch updates (cheap, stale)Ch 10, 11

Every system sits at a particular point in this space. There is no system that optimizes all axes simultaneously — that is the fundamental theorem of data systems design. Your job is to find the point that best serves your users and your business, and to be able to articulate why you chose that point over the alternatives.

The Closing Challenge

Here is a thought experiment to carry with you. The next time you use any application — checking email, ordering food, streaming a video, sending a payment — pause and ask yourself:

Once you start seeing systems through this lens, you will never look at software the same way again. Every loading spinner is a latency tradeoff. Every "try again later" is a reliability decision. Every stale notification is a consistency choice. You are now equipped to understand — and make — these decisions.

Further Reading

If this lesson resonated with you, here are the resources that go deeper into each topic:

ResourceTopicWhy Read It
DDIA Chapters 2-12EverythingThe rest of the book builds directly on Chapter 1. Each chapter is a deep dive into one aspect of the tradeoff space.
Google SRE Book (free online)Reliability, SLOs, error budgetsThe practical handbook for running reliable systems at scale. Chapter 4 on SLOs is essential.
"The Tail at Scale" — Jeff DeanTail latencyThe definitive paper on why tail latency matters and how to mitigate it (hedged requests, tied requests).
Amazon Dynamo Paper (2007)Availability vs consistencyThe paper that launched the NoSQL movement. Shows how Amazon sacrifices consistency for availability.
"Harvest, Yield, and Scalable Tolerant Systems" — Fox & BrewerCAP nuancesA more nuanced take on CAP from one of its creators. Introduces harvest (completeness) and yield (availability).
"Consistency Tradeoffs in Modern Distributed DB Design" — AbadiPACELCThe paper that extends CAP to cover the non-partition case. Essential for understanding real system tradeoffs.
"The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise."
— Edsger W. Dijkstra

Reliability, scalability, and maintainability are not vague aspirations. They are precise engineering properties with quantitative definitions. That precision is what separates a junior engineer who says "we need to scale" from a senior engineer who says "we need to handle 10x read throughput at p99 under 200ms with 99.99% availability, and here is the cost model for three architectural approaches."