Reliability, scalability, maintainability — the vocabulary of systems design.
You open your phone and check Twitter. In the two seconds it takes to load your timeline, here is what happens behind the curtain: a load balancer routes your request to one of thousands of application servers. That server checks a cache (Redis, maybe Memcached) to see if your timeline was recently computed. Cache miss. It queries a database (probably a sharded MySQL or PostgreSQL cluster) for tweets from the 400 accounts you follow. It calls a search index (Elasticsearch or a custom inverted index) to find trending topics. It pushes an analytics event into a message queue (Kafka) so a downstream batch processing pipeline can update your engagement scores overnight. Meanwhile, a stream processor is watching Kafka in real time, updating global trending topics as tweets flow in.
Six different kinds of data systems. One request. And the user just sees a feed that loaded in under a second. If any one of those systems is slow, the whole experience degrades. If any one loses data, users see corrupted timelines. If any one goes down completely, the application must decide: do we serve a degraded experience or show an error page?
These are not hypothetical questions. They are the daily reality of every engineering team building modern applications. And the answers are never obvious — they depend on the specific requirements, the specific workload, and the specific tradeoffs you are willing to make.
This is the defining reality of modern software: most applications today are data-intensive, not compute-intensive. The CPU is rarely the bottleneck. The hard problems are the amount of data, the complexity of data, and the speed at which data changes.
Martin Kleppmann opens DDIA with a simple observation: we have gotten very good at making CPUs fast. What we have not gotten good at is making data systems that are reliable, scalable, and easy to maintain. That is what DDIA is about, and that is what this lesson covers.
Textbooks draw clean boxes: "this is a database," "that is a message queue." Reality is messier. Redis is a cache that can persist to disk like a database. Apache Kafka is a message queue that can store data durably for weeks, acting as a distributed log. Elasticsearch is a search engine that many teams use as their primary data store. The categories blur because each tool optimizes for a different set of tradeoffs, and sometimes those tradeoffs overlap.
When you stitch these tools together in your application code, you are designing a new data system. Your application's API hides the internal complexity. To the outside world, your service is a black box that promises certain guarantees: "This query will return results in under 200ms," "No data will be lost after a confirmed write," "The system will handle 10,000 concurrent users."
Making those promises — and keeping them — is harder than it sounds. A cache improves speed but introduces stale data. Replication improves availability but introduces consistency problems. A message queue decouples services but adds latency. Every tool you add solves one problem and creates another. The art of data systems engineering is choosing the combination of tradeoffs that best serves your users.
Those promises come down to three properties. Martin Kleppmann calls them the three concerns that are important in most software systems:
Before we dive into each property, let us see what a typical data-intensive application looks like. The simulation below shows data flowing between the components you just read about. Watch how a single user request touches multiple systems.
Click "Send Request" to trace a user request through a typical web application. Watch data flow between the database, cache, search index, and message queue.
Notice something important: the application server is the orchestrator. It decides which component to call, in what order, and what to do when one fails. If the cache is down, do you skip it and hit the database directly? If the message queue is full, do you drop the analytics event or block the user's request? These are design decisions, and every one of them is a tradeoff.
There was a time when CPU speed was the bottleneck. In the 1990s, sorting a million records was a genuine computational challenge. Today, a single laptop can sort a billion integers in under a minute. CPUs got fast enough that for most applications, raw computation is cheap.
What did not get cheap: moving data around. Reading from disk. Waiting for a network round-trip. Synchronizing state across three data centers on different continents. The speed of light imposes a hard floor: a round-trip from New York to London takes at minimum 56 milliseconds at light speed through fiber — and real-world latency is 2-3x that. No amount of CPU improvement can fix this.
This is why the interesting problems are data problems: how to store it (databases), how to cache it (Redis, Memcached), how to search it (Elasticsearch), how to move it (Kafka, RabbitMQ), how to process it (Spark, Flink), and how to keep it consistent across multiple copies (replication, consensus). The CPU crunches numbers. The system manages data. DDIA is about the system.
Everybody wants a system that "just works." But what does that mean precisely? Kleppmann defines reliability as: the system continues to work correctly even when things go wrong. "Working correctly" means performing the function the user expects, at the performance the user expects, even when the user makes mistakes, even when the hardware fails, even when the software has bugs.
The key distinction is between faults and failures. A fault is a single component deviating from its spec — one disk dies, one process crashes, one network packet is lost. A failure is the whole system stopping to provide the required service. The goal of fault-tolerant design is to prevent faults from causing failures. You cannot prevent all faults (disks will die, networks will partition), but you can build systems that tolerate them.
Counterintuitively, it can make sense to deliberately trigger faults to test fault-tolerance mechanisms. This is the philosophy behind Netflix's Chaos Monkey and the broader discipline of chaos engineering: if you have not tested what happens when a server dies, you do not actually know if your system is fault-tolerant. You only believe it is. The difference between belief and knowledge in production engineering is measured in outages.
Hard disks have a mean time to failure (MTTF) of about 10 to 50 years. That sounds reliable until you do the math: if you have a storage cluster with 10,000 disks, each with an MTTF of 30 years, you should expect on average one disk to die every day. (10,000 disks ÷ 30 years ÷ 365 days = 0.91 failures per day.) RAM develops faults. Power grids have outages. Network cables get unplugged by cleaning staff.
The traditional response: redundancy. RAID arrays for disks, dual power supplies, hot-swappable CPUs, diesel generators for power. If one component fails, its redundant partner takes over while the broken one is replaced. For a single server, this approach can keep a machine running uninterrupted for years.
But as data volumes have grown, more applications use larger numbers of machines, which proportionally increases the rate of hardware faults. Cloud platforms like AWS are designed so that individual virtual machine instances can become unavailable without warning. The platform prioritizes flexibility and elasticity over single-machine reliability. So the emphasis has shifted from hardware redundancy to software fault-tolerance: systems that can tolerate the loss of entire machines.
Hardware faults are mostly independent — one disk dying does not make another disk more likely to die. Software faults are worse because they are correlated. A bug in the Linux kernel affects every machine running that kernel. A runaway process that consumes all CPU, memory, or disk on every node simultaneously. A service that the system depends on slowing down, becoming unresponsive, or returning corrupted data. A cascading failure where a small fault in one component triggers a fault in another, which triggers another.
Software faults are harder to anticipate. They lie dormant for a long time until triggered by an unusual set of circumstances — a leap second, a particularly large input, a specific sequence of operations. There is no quick fix. Kleppmann's advice: thorough testing, process isolation, allowing processes to crash and restart, measuring, monitoring, and alerting.
Humans design and operate these systems, and humans make mistakes. Configuration errors by operators are the leading cause of outages, not hardware or software faults. Studies of large internet services found that operator error caused the majority of disruptions, while hardware faults caused only 10-25%.
How do you mitigate human error? Design systems that minimize opportunities for error (good APIs that make it easy to do the right thing and hard to do the wrong thing). Provide sandbox environments where people can experiment safely with real data without affecting production. Test thoroughly at all levels: unit tests, integration tests, manual tests, automated end-to-end tests. Allow quick and easy rollback from mistakes — every deploy should be reversible in under one minute. Set up detailed monitoring (telemetry) so you can detect problems early, before users report them.
Google's approach to mitigating human error is instructive. Every configuration change goes through a review process. Changes are rolled out gradually (1% of traffic, then 10%, then 50%, then 100%) with automated monitoring at each stage. If any metric degrades, the rollout automatically pauses and alerts the engineer. This "canary" pattern catches most human errors before they affect more than a small fraction of users.
Let us trace a real cascading failure to see how a small fault becomes a system-wide failure.
Prevention requires defense in depth: (1) health checks that measure latency, not just liveness, (2) circuit breakers that stop sending traffic to slow backends, (3) retry budgets that cap the total number of retries, (4) timeouts at every network boundary. Netflix pioneered many of these patterns with their Hystrix library and the broader concept of chaos engineering — intentionally injecting faults to verify that the system handles them gracefully.
How often do things actually break? Here are empirically observed failure rates from large-scale systems:
| Component | Failure Rate | Source |
|---|---|---|
| Hard disk (HDD) | 2-4% annual failure rate (AFR) | Backblaze Drive Stats, 2023 |
| SSD | 0.5-1% AFR | Google fleet study, 2016 |
| Server (any component) | 2-10% AFR depending on age | Microsoft Azure study, 2018 |
| Network link | ~1 failure per year per link | Google Jupiter network paper |
| Rack power supply | 1-2 events per year per datacenter | AWS re:Invent talks |
| Software deploy | ~1 in 20 deploys causes a rollback | DORA State of DevOps |
| Configuration change | #1 cause of outages at Google, Microsoft, and Amazon | Multiple incident reports |
The pattern is clear: hardware fails at predictable, low rates. Software and human errors fail at much higher rates but are harder to predict. The most reliable systems invest more in preventing human error (safe defaults, canary deploys, automated rollbacks) than in preventing hardware error (which is already well-understood).
The simulation below models a system with five components, each with an adjustable failure rate. In a series architecture (every component must work for the system to work), the overall reliability multiplies: if each component has 99.9% availability, five in series give 99.9%5 = 99.5%. Toggle redundancy to add a replica for each component and watch reliability improve dramatically.
Adjust individual component reliability. Toggle redundancy to add replicas. Watch how series composition multiplies failure probabilities.
A system that works reliably today may not work reliably tomorrow if load increases tenfold. Scalability is the ability of a system to cope with increased load. But "scalability" is not a one-dimensional label — you cannot say "X is scalable" or "X doesn't scale" as an absolute statement. It is always relative to a specific kind of growth: "If load doubles, can we maintain response times by adding proportionally more resources?"
Before you can talk about whether a system handles growth, you must describe the current load precisely. Kleppmann calls these load parameters — numbers that capture the shape of your workload. The best choice of load parameters depends on the architecture of your system:
| System Type | Key Load Parameters | Example |
|---|---|---|
| Web server | Requests per second, read/write ratio | 1,000 req/s, 90% reads |
| Database | Read/write ratio, working set size | Write-heavy: 80% writes |
| Cache | Hit rate, eviction rate | 95% hit rate, 1M keys |
| Chat app | Concurrent active users, messages/sec | 50K concurrent, 500 msg/s |
| Video platform | Concurrent streams, bitrate, storage growth/day | 100K streams, 5 Mbps avg |
The best example from DDIA is Twitter circa 2012. Twitter had two main operations: post tweet (a user writes a tweet, delivered to all followers) and read timeline (a user views their home timeline — tweets from everyone they follow). At the time, Twitter handled about 4,600 tweets/sec on average (12,000 at peak) and 300,000 timeline reads/sec.
The challenge was not the write throughput. 12,000 writes/sec is manageable for most databases. The challenge was fan-out: each tweet must be delivered to all the author's followers, and some users have tens of millions of followers.
Two approaches:
Twitter initially used Approach 1, then switched to Approach 2 because the read volume (300K/sec) was far higher than write volume (4.6K/sec), so it made sense to do the heavy work once at write time. But then the celebrity problem hit: when a user with 30 million followers tweets, that is 30 million writes. At peak, this created unsustainable write amplification.
Twitter's solution: a hybrid. Most users get fan-out on write (their tweets are pushed to followers' caches). Celebrities (users with very large follower counts) get fan-out on read — their tweets are fetched and merged at read time. This is a brilliant example of a tradeoff that only becomes visible once you quantify the load parameters.
Adjust follower count and tweet rate. Compare write amplification (fan-out on write) vs read amplification (fan-out on read). Find the crossover point where the hybrid approach wins.
Let us work through the numbers that forced Twitter to change their architecture. These are approximate numbers from 2012, the era Kleppmann describes.
The hybrid solution is elegant because it exploits the fact that follower count follows a power law distribution. The vast majority of users have a small number of followers — fan-out on write is cheap for them. Only a tiny fraction of users have millions of followers — handling them differently at read time is a small additional cost. This is a pattern you will see repeatedly in distributed systems: handle the common case efficiently, and special-case the outliers.
The fan-out concept applies far beyond social media. Here are examples from other domains:
| System | Operation | Fan-Out | Strategy |
|---|---|---|---|
| Google Search | One query | Fans out to 1,000+ index shards | Return results from fastest shards, skip slow ones |
| CDN invalidation | One cache purge | Propagates to 200+ edge servers globally | Async propagation, accept brief staleness (seconds) |
| Payment processing | One charge | Triggers: auth, fraud check, ledger write, receipt email, analytics | Critical path (auth+ledger) sync, rest async via queue |
| CI/CD pipeline | One commit | Triggers: lint, unit tests, integration tests, deploy to staging, canary | Parallel where possible, sequential gates for safety |
| Ride-sharing | One ride request | Notify 10-50 nearby drivers | Fan-out on write (push to drivers), first-accept wins |
In every case, the design question is the same: do you do the work eagerly (fan-out on write) or lazily (fan-out on read)? The answer depends on: (1) the ratio of writes to reads, (2) the acceptable latency for reads, (3) the cost of write amplification, and (4) whether the fan-out is uniform or heavy-tailed.
Once you have described the load on your system, the next question is: what happens when load increases? Two ways to frame it: (1) keep resources fixed, how does performance degrade? (2) keep performance fixed, how much more resources do you need? Both framings require you to measure performance precisely.
People use these interchangeably, but they are different. Response time is what the client sees: the time between sending a request and receiving the response. It includes network transit time, queuing time, and the actual service time. Latency is the time a request spends waiting to be handled — the time the request is "latent," sitting in a queue, awaiting service. Latency is a component of response time, not a synonym for it.
If you are asked "what is the response time of your service?" and you answer with the average (arithmetic mean), you are hiding critical information. Imagine 100 requests: 95 take 10ms, 4 take 100ms, and 1 takes 5,000ms. The average is (95×10 + 4×100 + 1×5000) / 100 = 64.5ms. That number describes no actual user experience — nobody got a 64.5ms response. The typical user saw 10ms. The unlucky user saw 5 seconds.
Percentiles are the right tool. Sort all response times from fastest to slowest:
| Percentile | Meaning | Why it matters |
|---|---|---|
| p50 (median) | Half of requests are faster than this | The "typical" user experience |
| p95 | 95% of requests are faster | The experience for 1 in 20 users |
| p99 | 99% of requests are faster | The experience for 1 in 100 users — often your most valuable customers (heavy users) |
| p999 | 99.9% of requests are faster | The long tail; diminishing returns to optimize further |
Amazon observed that a 100ms increase in response time reduced sales by 1%. They also found that the customers with the slowest response times were often those with the most data in their accounts — they had the most purchases, they were the most valuable customers. That is why Amazon cares about p999, not just p50.
Here is where percentiles get dangerous. In a microservices architecture, a single user request often fans out to multiple backend services. If your request touches 10 backend services in parallel, the overall response time is determined by the slowest service. Even if each service has a fast p99, the chance that at least one of 10 calls hits its slow tail grows quickly.
This is tail latency amplification. It means that a service with a "good" p99 of 100ms can cause an overall request p99 of 1,000ms+ when you fan out across many services. The more services you call, the worse it gets.
Measuring response times correctly is surprisingly tricky. Common mistakes:
Coordinated omission. If your load generator sends one request at a time and waits for the response before sending the next, it under-reports tail latency. When the server is slow, the generator slows down too, reducing load exactly when the system is stressed. Gil Tene (Azul Systems) calls this coordinated omission — the measurement tool and the system under test are coordinating to hide the problem. The fix: use a constant-rate load generator that sends requests at a fixed schedule regardless of response times. Tools like wrk2 and HdrHistogram are designed to avoid this.
Averaging percentiles. You cannot average percentiles. If server A has p99 = 50ms and server B has p99 = 500ms, the system p99 is NOT 275ms. You must merge the full response time distributions and compute the percentile from the merged set. In practice, this means your monitoring system must store histograms (like HDR Histogram or Prometheus histograms), not pre-computed percentiles.
Ignoring the client perspective. Server-side metrics miss queuing time, network latency, and client-side processing. Always measure response time from the client's perspective (end-to-end), not just server processing time. The user does not care that your server processed the request in 10ms if the total round trip took 500ms because of a congested network.
1,000 simulated response times from a log-normal distribution. Adjust the distribution shape and fan-out count to see how percentiles change and how tail latency amplifies.
Amazon observed that customers with the most purchases (their most valuable customers) tended to have the slowest response times. Why? Because these customers had more data in their accounts — more order history, more recommendations to compute, more wish lists to merge. The very customers Amazon most wanted to keep happy were the ones most likely to hit the slow tail.
This is why Amazon optimizes for p999 (the slowest 1-in-1000 request), not just p50. Let us do the math for their checkout page, which calls multiple backend services:
You have measured your load, you have measured your performance. Load is growing. What do you do?
Scaling up means moving to a more powerful machine: more CPU cores, more RAM, faster disks, better network. It is the simplest approach — your software does not change at all, you just give it more horsepower. A single high-end server with 256 cores and 2TB of RAM can handle a remarkable amount of work.
The problem: cost grows super-linearly. A machine with twice the RAM does not cost twice as much — it costs four or eight times as much. And there is a hard ceiling: you cannot buy a machine with infinite resources. Eventually you hit the limit of what a single machine can do.
Scaling out means distributing the load across multiple smaller machines. Instead of one giant server, you have 20 commodity servers behind a load balancer. Cost grows linearly (20 servers cost 20x one server), and there is no hard ceiling — you can always add more machines.
The problem: complexity. A shared-nothing architecture (where each machine is independent and coordinates via network messages) requires your software to handle distribution explicitly. How do you split data across machines? How do you route requests? What happens when one machine fails? How do you keep data consistent across machines? These are the questions that fill the remaining chapters of DDIA.
Let us make this concrete with real cloud pricing (approximate, 2024 numbers):
| Approach | Configuration | Cost/month | Capacity |
|---|---|---|---|
| Vertical | 1x r6g.16xlarge (512 GB RAM, 64 vCPU) | ~$6,500 | ~50K req/s |
| Horizontal | 8x r6g.2xlarge (64 GB RAM, 8 vCPU each) | ~$6,400 | ~60K req/s + redundancy |
| Vertical 2x | 1x r6g.metal (512 GB, 64 vCPU + bare metal) | ~$9,800 | ~55K req/s |
| Horizontal 2x | 16x r6g.2xlarge | ~$12,800 | ~120K req/s + redundancy |
At 1x load, vertical and horizontal cost about the same. At 2x load, vertical jumps 50% in cost for only 10% more capacity (you are buying expensive headroom). Horizontal doubles in cost but also doubles in capacity, and you get redundancy for free — if one of 16 machines dies, you still have 15.
The crossover point varies by workload, but a common rule of thumb: scale vertically until you hit ~$10K/month or the largest available instance, whichever comes first. After that, horizontal scaling is almost always more economical and more resilient.
Elastic scaling means the system automatically adds or removes machines based on detected load. More traffic? Spin up more servers. Traffic drops? Shut them down and stop paying for them. Cloud platforms (AWS Auto Scaling, Kubernetes HPA) make this feasible, but it requires your application to be designed for it: stateless request handling, externalized session state, and fast startup times.
Elastic scaling has a critical subtlety: cold start time. If your application takes 30 seconds to boot (loading models, warming caches, establishing database connections), and a traffic spike arrives, you have 30 seconds of degraded service before the new machines are ready. This is why many teams "pre-warm" capacity: keep a few extra machines running at all times and only scale down conservatively. The cost of a few idle machines is small compared to the cost of a 30-second outage during Black Friday.
Here is a decision framework for when to use each scaling approach:
| Scenario | Best Approach | Why |
|---|---|---|
| Single DB, load < 10K req/s | Vertical scaling | Simpler, no distributed complexity, cheaper at low scale |
| Read-heavy workload (90%+ reads) | Read replicas | Add replicas for reads, keep single primary for writes. Linear read scaling. |
| Stateless web servers | Horizontal + elastic | Trivial to scale; add/remove behind load balancer based on CPU/memory |
| Write-heavy workload | Sharding | Split data by key, each shard handles a fraction of writes. Most complex option. |
| Unpredictable traffic spikes | Elastic + pre-warming | Auto-scale up fast, keep minimum capacity for cold-start protection |
| Global users, low latency | Multi-region | Data close to users. CDN for static, regional replicas for dynamic. |
Stateless services (web servers, API gateways, compute workers) are easy to scale horizontally. They hold no data — just add another server behind the load balancer and route requests to it. This is why the first thing you do in a system design interview is separate stateless and stateful components.
Stateful services (databases, caches with important data, coordination services) are hard to scale horizontally. This is the entire reason DDIA exists as a 500-page book. Splitting a database across multiple machines (sharding) introduces consistency problems, rebalancing complexity, cross-shard queries, and distributed transactions. We will cover all of these in later chapters.
Drag the load slider to see how costs compare between vertical scaling (bigger machine) and horizontal scaling (more machines). Horizontal has constant per-unit cost but adds coordination overhead.
Here is an uncomfortable truth: the majority of the cost of software is not in the initial development. It is in the ongoing maintenance — fixing bugs, keeping it running, adapting it to new requirements, repaying technical debt, adding features. Most engineers do not like maintaining legacy systems. But the reality is that most engineering work is maintenance work, not greenfield development.
Kleppmann identifies three design principles that minimize maintenance pain:
Operability means making it easy for operations teams to keep the system running smoothly. A system with good operability provides:
Think of it this way: at 3 AM when something breaks, is your system helpful or hostile? Does it tell you what is wrong, or do you have to guess? Can you restart a component without taking down the whole system? Can a new team member understand the operational runbook in a day?
Simplicity means removing accidental complexity. Accidental complexity is complexity that is not inherent in the problem but arises from the implementation: tight coupling between modules, tangled dependencies, inconsistent naming, hacks that worked around problems without solving them. The Big Ball of Mud anti-pattern: a system with no discernible structure, where every change requires understanding the whole thing.
The best tool for removing accidental complexity is abstraction. A good abstraction hides implementation details behind a clean interface. SQL is an abstraction — you describe what data you want, not how to get it. Programming languages are abstractions over machine code. TCP is an abstraction over unreliable networks. Good abstractions are force multipliers: they let you reason about the system at a higher level without getting lost in details.
Evolvability (also called extensibility, modifiability, or plasticity) means making it easy to change the system in the future. Requirements change. You discover new facts. Regulations shift. New features are requested. A system with good evolvability can accommodate these changes without requiring a rewrite.
Evolvability is related to simplicity: simple systems are easier to change than complex ones. It is also related to good abstractions: if your system is organized into well-defined modules with clear interfaces, you can replace or modify one module without affecting others. Agile practices (test-driven development, refactoring, frequent releases) are techniques for achieving evolvability at the team process level.
A concrete example: imagine your application uses MySQL as its database. If the database is accessed through a clean repository interface (e.g., UserRepository.findById(id)), switching to PostgreSQL or DynamoDB requires changing only the repository implementation. If SQL queries are scattered throughout the codebase (in controllers, in templates, in background jobs), a database migration becomes a multi-month rewrite. The abstraction boundary determines the cost of change.
Kleppmann makes the point that we should design systems with the expectation that they will need to change. Requirements shift. Technologies evolve. What seemed like the right database choice three years ago may not be the right choice today. If your system is evolvable, adapting to change is routine. If it is not, every change is a crisis.
The Big Ball of Mud is a software architecture anti-pattern identified by Brian Foote and Joseph Yoder in 1997. It describes a system with no discernible structure — a haphazardly assembled pile of code where every module depends on every other module. Changing anything requires understanding everything. Adding a feature to the payment system breaks the notification system because they share global mutable state through an undocumented side channel.
How does it happen? Never intentionally. It happens one shortcut at a time. A deadline pressure leads to a quick hack instead of a proper abstraction. The hack works, so nobody fixes it. Another feature builds on top of the hack. A third feature adds a workaround for a bug in the second feature. Six months later, the team has a system where the only person who understands module X is the person who wrote it — and they left the company.
The cost is measurable. A study by Stripe estimated that developers spend 42% of their time dealing with technical debt and maintenance, and that the global cost of developer time wasted on bad code is $85 billion per year. The business impact is not just developer happiness — it is speed. A company with a well-maintained codebase ships features in days. A company drowning in technical debt ships the same features in months.
The cure for accidental complexity is abstraction. A good abstraction lets you ignore irrelevant details and focus on what matters. Here are abstractions you use every day without thinking about them:
| Abstraction | Hides | Exposes |
|---|---|---|
| SQL | How data is stored on disk, which indexes exist, the query execution plan | A declarative language: "give me rows where X" |
| TCP | Packet loss, reordering, duplication, network congestion | A reliable, ordered byte stream between two endpoints |
| File system | Disk sectors, block allocation, wear leveling on SSDs | Named files and directories with read/write operations |
| REST API | The database schema, the caching layer, the message queue | Resources with CRUD operations and JSON responses |
Each of these abstractions converts a complex, leaky reality into a simple, clean interface. When you design a data system, your most important job is choosing or creating the right abstractions. The right abstraction at the right boundary turns a Big Ball of Mud into a set of clean, replaceable components.
The simulation below models the cost of adding features to a system over time. A well-maintained system (good abstractions, clean code, automated tests) has roughly constant feature development cost. A poorly maintained system accumulates technical debt: each shortcut makes the next feature harder, until eventually the team spends most of its time fighting the system rather than improving it.
Watch how feature development velocity diverges between a well-maintained and a poorly-maintained system. The "debt ratio" slider controls how much technical debt accumulates per feature.
Time to put it all together. You are designing a data system. Not in theory — right now, in this simulation. You will pick components, set SLA targets, and watch your system handle load. Then we will break it.
Here is the scenario: you are building a backend for an e-commerce platform. Customers browse products (read-heavy), add items to carts (write-heavy, must not lose data), and checkout (critical path, must be fast and reliable). You need a database for product catalog and orders, a cache for fast reads, and a message queue for order processing.
The simulation below lets you configure each component and observe the tradeoffs in real time. Set your SLA (availability and p99 latency), then inject failures and traffic spikes to see if your system holds up.
Configure your system, set an SLA, then hit "Run Simulation" to observe behavior under load. Use "Inject Failure" to take down a component and see how the system degrades.
Let us break down what is happening inside the simulation so you understand the model:
This model is simplified (real systems have connection pools, queue depths, retry logic), but it captures the essential dynamic: the cache shields the database from the full force of traffic. Without a cache, your database capacity IS your traffic ceiling. With a cache, your ceiling is DB_capacity / (1 - hit_rate).
Play with the simulation above. Here are some experiments to try:
Theory is necessary but not sufficient. Let us look at how real companies made the tradeoffs we have been discussing. Every one of these decisions started with a specific load parameter, a specific failure mode, and a specific business requirement.
Amazon's shopping cart must always be available. If a customer adds an item to their cart and the system says "sorry, try again later," Amazon loses the sale. The 2007 Dynamo paper made this explicit: the shopping cart sacrifices consistency for availability. If two data centers disagree about what is in the cart, the system merges both versions (potentially adding items that were removed) rather than rejecting the write. It is better to occasionally have a phantom item in the cart (customer removes it at checkout) than to ever lose an item the customer added.
The business reasoning: a false positive (extra item in cart) costs Amazon a brief moment of customer confusion. A false negative (lost item from cart) costs Amazon a sale. The asymmetry in business cost drives the technical tradeoff.
When you search Google, your query fans out to thousands of index servers. Some servers may be slow or unresponsive. Google's response: return results within 200ms even if not all index servers have responded. The user gets 98% of the results instantly rather than 100% of the results in 2 seconds. Most users never notice the missing 2% because the most relevant results are on the fastest servers (Google optimizes for this).
The tradeoff: completeness vs. latency. Google chooses latency because user studies show that slower results cause users to search less, even if the results are better. A 400ms delay reduces search volume by 0.6%. Fast-but-incomplete beats slow-but-complete.
When you transfer $1,000 from checking to savings, the bank must guarantee that the money is either in checking OR savings, never in both, never in neither. This is ACID consistency: the database will reject the operation entirely rather than leave the data in an inconsistent state. If the system is partitioned and cannot guarantee consistency, it becomes unavailable rather than risk double-spending or losing money.
The tradeoff: availability vs. consistency. Banks choose consistency because the cost of inconsistency (losing money, regulatory violations, fraud) vastly exceeds the cost of brief unavailability (customer waits a few seconds, tries again).
Social media platforms are interesting because different features have different tradeoff profiles on the same platform:
| Feature | Tradeoff | Why |
|---|---|---|
| Likes/comments | Availability over consistency (eventual consistency) | If a like count is off by 1 for a few seconds, nobody cares |
| Direct messages | Consistency over latency | Messages must arrive in order and none must be lost |
| Payments (creator fund) | Strong consistency | Money must be exactly right, always |
| Feed ranking | Latency over completeness | Show the best-guess feed fast rather than the perfect feed slow |
| Account deletion | Consistency over speed | Regulatory (GDPR): deletion must be complete and verifiable |
Netflix serves 230 million subscribers across 190 countries. Their philosophy: it is better to show you something than to show you nothing. If the recommendation engine is slow or down, Netflix falls back to showing popular content in your region. If personalized artwork cannot be generated, it shows the default poster. If one microservice is failing, circuit breakers isolate it and the rest of the system continues.
Netflix pioneered chaos engineering with Chaos Monkey (randomly kills production servers) and later Chaos Kong (simulates the failure of an entire AWS region). The philosophy: if you have not tested a failure mode in production, you do not know if your system survives it. By breaking things intentionally and regularly, Netflix ensures their fallback paths actually work.
The tradeoff: graceful degradation over correctness. A slightly wrong recommendation is better than no content at all. This only works because Netflix's core product (video streaming) can tolerate imprecision. You would not want your bank to show you "approximately your balance."
Every system design decision can be framed as a choice between competing goods. Here is a framework for thinking through any tradeoff:
Abstract tradeoffs become concrete when you attach dollar amounts. Here is how real companies quantify them:
| Company | Metric | Cost | Source |
|---|---|---|---|
| Amazon | 100ms added latency | 1% revenue loss (~$5B/year at current revenue) | Greg Linden, 2006 |
| 500ms added to search | 20% fewer searches per user | Marissa Mayer, 2006 | |
| Walmart | 1 second faster page load | 2% increase in conversion | Walmart Labs, 2012 |
| Amazon (AWS) | 1 hour of S3 downtime | ~$150M in affected commerce | 2017 S3 outage estimate |
| 6 hours of total outage | ~$100M in lost ad revenue | 2021 BGP outage | |
| Robinhood | Trading outage during volatility | $70M FINRA fine + class-action settlement | 2020 outage, 2021 fine |
These numbers change the conversation. When a product manager says "just make it work," you can respond: "We can achieve 99.9% availability for $X/month (3 replicas), or 99.99% for $3X/month (5 replicas across 3 zones). Based on our revenue, each nine of availability is worth $Y/year. Which level makes business sense?" This is engineering leadership: turning vague requirements into quantified tradeoffs.
Click on each system to see where it sits on the availability-consistency spectrum and why. Each position represents a deliberate engineering decision driven by business requirements.
Every systems design discussion eventually invokes the CAP theorem. It is the most frequently cited — and most frequently misunderstood — result in distributed systems. Let us get it right.
Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time. (This is linearizability, a specific and strong form of consistency — not to be confused with the C in ACID, which means something different.)
Availability (A): Every request receives a (non-error) response, without guarantee that it contains the most recent write. The system is always responsive, even if the data it returns is stale.
Partition tolerance (P): The system continues to operate despite network partitions — messages between nodes being lost or delayed. A partition means some nodes cannot communicate with other nodes.
The CAP theorem (proved by Gilbert and Lynch in 2002, based on Brewer's 2000 conjecture) states: in the presence of a network partition, a distributed system must choose between consistency and availability. You cannot have both.
The common phrasing "pick two out of three" is misleading. Here is why: you do not get to opt out of partition tolerance. Network partitions happen. Switches fail, cables get cut, cloud regions lose connectivity. Any distributed system must tolerate partitions. So the real choice is:
"CA" (consistent and available, no partition tolerance) is not a real option for distributed systems. It describes a single-node system (like a single PostgreSQL server) — which is indeed both consistent and available, but only because there is no network partition possible (there is only one node). The moment you add a second node, partitions become possible.
CAP only describes behavior during a partition. But partitions are rare. What about the other 99.9% of the time? Daniel Abadi proposed PACELC (2012): "If there is a Partition (P), choose Availability (A) or Consistency (C). Else (E), when the system is running normally, choose Latency (L) or Consistency (C)."
| System | During Partition | Normal Operation | PACELC |
|---|---|---|---|
| DynamoDB | Availability (A) | Latency (L) | PA/EL |
| Cassandra | Availability (A) | Latency (L) | PA/EL |
| PostgreSQL (sync repl.) | Consistency (C) | Consistency (C) | PC/EC |
| MongoDB (majority) | Consistency (C) | Latency (L) | PC/EL |
| Cosmos DB | Configurable | Configurable | Tunable |
PACELC is more useful than CAP because it captures the latency-consistency tradeoff that systems face all the time, not just during rare partition events.
CAP's binary "consistent or not" hides a rich spectrum of consistency models. In practice, you do not choose between "consistent" and "inconsistent" — you choose a point on a continuum:
| Consistency Model | Guarantee | Latency Cost | Example Use Case |
|---|---|---|---|
| Linearizability (strongest) | All operations appear to execute atomically in real-time order | Highest (requires coordination) | Leader election, distributed locks |
| Sequential consistency | All operations appear in some total order consistent with each process's local order | High | Shared memory systems |
| Causal consistency | Operations that are causally related appear in order; concurrent operations may appear in any order | Medium | Social media (replies appear after the post they reply to) |
| Read-your-writes | A client always sees its own writes | Low | User profile updates (you see your own changes immediately) |
| Eventual consistency (weakest) | If no new writes occur, all replicas eventually converge to the same value | Lowest | DNS, like counts, CDN caches |
Each step down the spectrum gives you better latency and availability at the cost of weaker guarantees. The art of distributed systems design is choosing the weakest consistency model that your application can tolerate — because stronger consistency is always more expensive.
Real database systems plotted on the consistency-availability spectrum. Click any system to see its tradeoff details. The horizontal axis shows behavior during partitions (CP vs AP), and the vertical axis shows behavior during normal operation (low latency vs strong consistency).
This chapter is your cheat sheet. Everything from Chapters 0-8, compressed into the forms you will use in interviews and on the job.
| Pillar | One-sentence definition | Key metrics | Key techniques |
|---|---|---|---|
| Reliability | The system works correctly even when things go wrong | Availability (nines), MTTF, MTTR, error rate | Redundancy, fault isolation, chaos engineering, graceful degradation |
| Scalability | The system copes with growth in data, traffic, or complexity | Throughput (req/s), response time percentiles, load parameters | Vertical scaling, horizontal scaling (sharding, replicas), caching, async processing |
| Maintainability | Future engineers can work with the system productively | Time to onboard, time to deploy, incident resolution time | Good abstractions, documentation, automation, simplicity, evolvability |
Jeff Dean (Google) published a list of latency numbers that every systems engineer should have memorized. These numbers help you do back-of-envelope calculations in interviews and design discussions. Here are updated figures for modern hardware (circa 2024):
| Operation | Latency | Relative Scale |
|---|---|---|
| L1 cache reference | ~1 ns | 1x |
| L2 cache reference | ~4 ns | 4x |
| Main memory (DRAM) reference | ~100 ns | 100x |
| SSD random read | ~16 μs | 16,000x |
| HDD random read (seek) | ~2 ms | 2,000,000x |
| Network round-trip (same datacenter) | ~0.5 ms | 500,000x |
| Network round-trip (cross-continent) | ~150 ms | 150,000,000x |
| Read 1 MB sequentially from SSD | ~50 μs | |
| Read 1 MB sequentially from HDD | ~2 ms | |
| Read 1 MB from network (10 Gbps) | ~0.8 ms |
In interviews, you will be asked to estimate things like "How many servers do you need?" or "How much storage?" Here is the pattern:
These estimates are deliberately rough — the point is not precision, it is getting the right order of magnitude. "We need about 20 servers and half a petabyte per year" is a useful statement in a design discussion. "We need exactly 17.3 servers" is false precision.
Let us walk through a full system design answer using the framework from this lesson. The question: "Design a URL shortener like bit.ly."
Notice how every answer references concepts from this lesson: load parameters, SLO/SLA, read:write ratio, percentiles, cache hit rates, stateless vs. stateful, vertical first then horizontal. The framework gives you structure; the numbers give you credibility.
When you get a system design question, start with these questions to establish the tradeoff space:
| Availability | Downtime/year | Downtime/month | Typical system |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours | Internal tools, batch systems |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | SaaS products, e-commerce |
| 99.99% (four nines) | 52.6 minutes | 4.4 minutes | Payment systems, core infrastructure |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | Telecom, 911 systems |
python def percentile(data, p): """Calculate the p-th percentile of a list of values. p is between 0 and 100. Uses nearest-rank method. """ if not data: raise ValueError("Empty dataset") sorted_data = sorted(data) n = len(sorted_data) rank = (p / 100) * (n - 1) lower = int(rank) upper = lower + 1 weight = rank - lower if upper >= n: return sorted_data[lower] return sorted_data[lower] * (1 - weight) + sorted_data[upper] * weight # Usage: latencies = [12, 15, 11, 14, 800, 13, 12, 16, 11, 13] print(f"p50: {percentile(latencies, 50)}ms") # ~13ms print(f"p99: {percentile(latencies, 99)}ms") # ~800ms print(f"mean: {sum(latencies)/len(latencies)}ms") # ~91.7ms (misleading!)
python def series_availability(components): """Availability of N components in series (all must work).""" result = 1.0 for a in components: result *= a return result def parallel_availability(a, replicas=2): """Availability with N redundant replicas.""" return 1 - (1 - a) ** replicas # 5 services, each 99.9% services = [0.999] * 5 print(f"Series: {series_availability(services)*100:.4f}%") # 99.5010% (~1.8 days downtime/year) # Same 5 services, each with a redundant replica redundant = [parallel_availability(0.999) for _ in range(5)] print(f"With redundancy: {series_availability(redundant)*100:.6f}%") # 99.999500% (~2.6 minutes downtime/year)
python import time import random import statistics def simulate_service(base_latency_ms=10, p99_spike_ms=500): """Simulate a service with realistic latency distribution.""" if random.random() < 0.01: # 1% chance of slow response return base_latency_ms + random.expovariate(1/p99_spike_ms) return base_latency_ms + random.expovariate(1/5) def simulate_fanout(n_backends, n_requests=10000): """Simulate tail latency amplification with n backend calls.""" latencies = [] for _ in range(n_requests): # Each request calls n_backends in parallel # Overall latency = max of all backends backend_latencies = [simulate_service() for _ in range(n_backends)] latencies.append(max(backend_latencies)) return latencies # Compare: 1 backend vs 10 backends for n in [1, 5, 10, 20]: lats = simulate_fanout(n) lats.sort() print(f"Backends={n:2d} p50={lats[4999]:.0f}ms " f"p99={lats[9899]:.0f}ms p999={lats[9989]:.0f}ms") # Output (typical): # Backends= 1 p50= 15ms p99= 52ms p999= 487ms # Backends= 5 p50= 22ms p99= 498ms p999=1203ms # Backends=10 p50= 26ms p99= 823ms p999=1847ms # Backends=20 p50= 30ms p99=1156ms p999=2534ms
Notice how p50 grows slowly (the median is stable) but p99 explodes with more backends. This is tail latency amplification in action. With 20 backends, p99 is 38x the single-backend p50. This is why microservice architectures must be designed with tail latency in mind from day one.
python def error_budget(sla_percent, period_days=30): """Calculate allowed downtime for a given SLA.""" total_minutes = period_days * 24 * 60 allowed_downtime = total_minutes * (1 - sla_percent / 100) return { 'sla': f'{sla_percent}%', 'period': f'{period_days} days', 'budget_minutes': round(allowed_downtime, 2), 'budget_human': format_duration(allowed_downtime) } def format_duration(minutes): if minutes < 1: return f'{minutes*60:.1f} seconds' if minutes < 60: return f'{minutes:.1f} minutes' if minutes < 1440: return f'{minutes/60:.1f} hours' return f'{minutes/1440:.1f} days' # Monthly error budgets: for sla in [99, 99.9, 99.99, 99.999]: result = error_budget(sla) print(f"SLA {result['sla']:>8s} Budget: {result['budget_human']:>14s}") # SLA 99% Budget: 7.3 hours # SLA 99.9% Budget: 43.8 minutes # SLA 99.99% Budget: 4.4 minutes # SLA 99.999% Budget: 26.3 seconds
| Scenario | Likely cause | What to check |
|---|---|---|
| p99 latency spiked 10x | Tail latency amplification, GC pauses, slow dependency, noisy neighbor | Trace slow requests end-to-end, check each backend's p99 individually, check GC logs, check for resource contention |
| System went down during traffic spike | No elastic scaling, no back-pressure, cascading failure | Check if auto-scaling was configured, check if the system has circuit breakers, check for thundering herd effects |
| Data inconsistency after deploy | Missing migration, stale cache, split-brain | Check cache invalidation, check replication lag, check for concurrent writes during schema migration |
| Intermittent "service unavailable" | Health check flapping, DNS issues, connection pool exhaustion | Check connection pool sizes, check health check configuration, check DNS TTL |
Chapter 1 of DDIA sets the vocabulary. Every subsequent chapter explores one dimension of the tradeoff space in depth. Here is how this lesson connects to what comes next.
| This Lesson's Concept | Where It Goes Next | DDIA Chapter |
|---|---|---|
| Data models (the blurring boundaries) | Relational vs document vs graph models — choosing the right data model for your access patterns | Chapter 2: Data Models |
| Reliability (fault tolerance) | Replication: keeping copies of data on multiple machines so the system survives failures | Chapter 5: Replication |
| Scalability (handling growth) | Partitioning (sharding): splitting a large dataset across multiple machines so no single node is a bottleneck | Chapter 6: Partitioning |
| Consistency (CAP, PACELC) | Transactions: grouping reads and writes into atomic units with consistency guarantees | Chapter 7: Transactions |
| Fan-out (Twitter problem) | Stream processing: handling high-throughput event flows in real time | Chapter 11: Stream Processing |
| Batch vs. real-time | Batch processing: the economics of processing large datasets efficiently (MapReduce and beyond) | Chapter 10: Batch Processing |
After this lesson, you should be able to use these terms precisely in conversation, in interviews, and in design documents. Each term has a specific meaning that we derived from first principles:
| Term | Precise Meaning | Common Misuse |
|---|---|---|
| Fault | One component deviating from spec | Confused with "failure" (whole system down) |
| Failure | The system stops providing the required service | Used interchangeably with "fault" |
| Load parameter | A number describing the shape of workload (req/s, ratio, count) | Vague statements like "high traffic" |
| Fan-out | Number of downstream ops triggered by one incoming op | Used only for messaging; it applies to any amplification |
| Percentile (p99) | 99% of requests are faster than this value | Confused with "average" or "typical" |
| Tail latency amplification | Fan-out to N services makes the composite p99 worse than individual p99 | Ignored until production breaks |
| SLO vs SLA | SLO = internal target; SLA = contractual guarantee with penalties | Used interchangeably |
| Error budget | Allowed downtime = 1 - SLA, a resource to be spent wisely | Treated as a punishment rather than a tool |
| Accidental complexity | Complexity from the implementation, not the problem domain | Confused with inherent problem complexity |
This lesson covered the conceptual framework. The rest of the DDIA series goes deep on each mechanism:
Every one of these topics is a deeper exploration of the tradeoffs we introduced here. The vocabulary from this lesson — faults vs. failures, load parameters, percentiles, fan-out, consistency vs. availability, series reliability, tail latency amplification — will appear in every subsequent chapter.
If you zoom out far enough, every decision in distributed systems is a point in a multi-dimensional tradeoff space. Here are the axes:
| Axis | One End | Other End | DDIA Chapters |
|---|---|---|---|
| Consistency ↔ Availability | Linearizability (strict) | Eventual consistency (lax) | Ch 5, 7, 9 |
| Latency ↔ Durability | In-memory only (fast, volatile) | Sync to disk + replicas (slow, durable) | Ch 3, 5 |
| Throughput ↔ Isolation | No isolation (dirty reads, fast) | Serializable (safe, slow) | Ch 7 |
| Simplicity ↔ Performance | Single node (simple, limited) | Distributed (complex, scalable) | Ch 5, 6, 8, 9 |
| Freshness ↔ Cost | Real-time updates (expensive) | Batch updates (cheap, stale) | Ch 10, 11 |
Every system sits at a particular point in this space. There is no system that optimizes all axes simultaneously — that is the fundamental theorem of data systems design. Your job is to find the point that best serves your users and your business, and to be able to articulate why you chose that point over the alternatives.
Here is a thought experiment to carry with you. The next time you use any application — checking email, ordering food, streaming a video, sending a payment — pause and ask yourself:
Once you start seeing systems through this lens, you will never look at software the same way again. Every loading spinner is a latency tradeoff. Every "try again later" is a reliability decision. Every stale notification is a consistency choice. You are now equipped to understand — and make — these decisions.
If this lesson resonated with you, here are the resources that go deeper into each topic:
| Resource | Topic | Why Read It |
|---|---|---|
| DDIA Chapters 2-12 | Everything | The rest of the book builds directly on Chapter 1. Each chapter is a deep dive into one aspect of the tradeoff space. |
| Google SRE Book (free online) | Reliability, SLOs, error budgets | The practical handbook for running reliable systems at scale. Chapter 4 on SLOs is essential. |
| "The Tail at Scale" — Jeff Dean | Tail latency | The definitive paper on why tail latency matters and how to mitigate it (hedged requests, tied requests). |
| Amazon Dynamo Paper (2007) | Availability vs consistency | The paper that launched the NoSQL movement. Shows how Amazon sacrifices consistency for availability. |
| "Harvest, Yield, and Scalable Tolerant Systems" — Fox & Brewer | CAP nuances | A more nuanced take on CAP from one of its creators. Introduces harvest (completeness) and yield (availability). |
| "Consistency Tradeoffs in Modern Distributed DB Design" — Abadi | PACELC | The paper that extends CAP to cover the non-partition case. Essential for understanding real system tradeoffs. |