SLAs, percentiles, capacity planning, load testing — quantifying system quality.
Two engineers sit in a meeting room. The product manager says: "We need the checkout service to be fast and reliable." Engineer A nods and says "I'll put it on a single beefy server with an SSD." Engineer B says "We need three replicas across two availability zones with a CDN edge layer, auto-scaling, and a Redis cache." The product manager asks: "How much will each approach cost?" Engineer A says "$500/month." Engineer B says "$12,000/month."
Who is right? Neither of them knows, because nobody defined what "fast" and "reliable" actually mean.
If "fast" means p50 latency under 500ms, Engineer A's single server will do fine. If "fast" means p99 latency under 50ms under 10,000 concurrent users during Black Friday, Engineer B's architecture might not even be enough. If "reliable" means 99% uptime (3.65 days of downtime per year), a single server with good monitoring suffices. If "reliable" means 99.99% uptime (52 minutes per year), you need redundancy at every layer.
Vague requirements are the root cause of two opposite engineering disasters: over-engineering (spending $100K/month on infrastructure for a service that gets 10 requests per second) and under-engineering (deploying a single server that crashes on the first traffic spike and loses customer orders).
The simulation below shows how vague requirements lead to wildly different cost outcomes. Two teams build the same service — one with precise requirements, one without. Watch what happens.
Drag the "Actual Load" slider to see how over-engineering and under-engineering play out against a precisely-engineered system.
Notice what happens: the over-engineered system costs a fortune regardless of load. The under-engineered system is cheap until the load exceeds its capacity, at which point it crashes and costs you far more in lost revenue and emergency fixes. The precisely-engineered system scales proportionally — it costs exactly what the load demands, plus a reasonable safety margin.
This is why nonfunctional requirements matter. They are the numbers that turn an engineering problem from "build something good" into "build exactly this, for this many users, at this cost." In this lesson, we will learn the language and math of nonfunctional requirements: SLAs, percentiles, capacity planning, availability, and monitoring.
Google runs some of the most reliable services on the planet. They do not achieve this by making everything infinitely reliable — that would be infinitely expensive. Instead, they use a precise three-layer framework to define, target, and contractually guarantee system quality. Let us build it from the ground up.
A Service Level Indicator is a quantitative measurement of some aspect of your service. It is a number you can observe, record, and plot on a graph. Common SLIs include:
| SLI | What it measures | How to compute it | Example |
|---|---|---|---|
| Latency | How long a request takes | Time from request received to response sent | p99 latency = 142ms |
| Error rate | Fraction of requests that fail | HTTP 5xx count / total request count | 0.3% of requests return 500 |
| Throughput | How many requests per second | Request count / time window | 2,400 QPS at peak |
| Availability | Fraction of time the service is up | Successful requests / total requests | 99.95% of probes succeed |
| Durability | Probability that stored data is not lost | Objects lost / objects stored over time | 11 nines (99.999999999%) |
SLIs are measurements, not goals. They describe reality. You do not choose your SLI values — you observe them.
A Service Level Objective is an internal target for an SLI. It is a line in the sand that your team draws: "We aim for p99 latency below 200ms" or "We target 99.9% availability per month." SLOs are promises you make to yourself. No lawyer is involved. No money changes hands if you miss them.
But SLOs are not arbitrary. They are chosen based on user expectations and business needs. If your payment processing SLO is 99.9% availability (8.76 hours of downtime per year), and you serve 1,000 transactions per hour at an average of $50, each hour of downtime costs you $50,000 in lost revenue. So you had better know exactly how much reliability you are promising.
A Service Level Agreement is a contract between you and your customers that specifies what happens if you miss an SLO. SLAs have legal teeth — if you promise 99.95% uptime and deliver 99.8%, you owe your customer a refund, credit, or penalty. AWS credits your account if S3 drops below 99.9% availability in a billing period. Google Cloud credits you if Compute Engine drops below 99.95%.
If your SLO is 99.9% availability per month, that means you tolerate 0.1% unavailability. In a 30-day month (43,200 minutes), 0.1% is 43.2 minutes. That is your error budget — the maximum amount of downtime you can "spend" before violating your SLO.
Error budgets change how you think about risk. Deploying on a Friday afternoon? That might burn 20 minutes of error budget if something goes wrong. If you have already spent 30 minutes this month, you cannot afford it — deploy on Monday. If you have 43 minutes remaining on the 5th of the month, you are flush — ship with confidence.
Set the uptime percentage and see exactly how much downtime each "nine" allows per year, month, week, and day. Notice: each additional nine is 10x harder.
The table of nines is one of the most important numbers in systems engineering. Memorize it:
| Uptime | Downtime/year | Downtime/month | Downtime/week | Nines |
|---|---|---|---|---|
| 99% | 3.65 days | 7.31 hours | 1.68 hours | Two nines |
| 99.9% | 8.76 hours | 43.8 min | 10.1 min | Three nines |
| 99.95% | 4.38 hours | 21.9 min | 5.04 min | Three and a half |
| 99.99% | 52.6 min | 4.38 min | 1.01 min | Four nines |
| 99.999% | 5.26 min | 26.3 sec | 6.05 sec | Five nines |
Your monitoring dashboard says the average response time is 45ms. Sounds great. But five of your biggest customers are threatening to leave because "the site is unbearably slow." How? Because average latency is one of the most misleading metrics in computer science.
Consider 1,000 requests to your API. 950 of them complete in 5ms. 49 of them complete in 50ms (they hit the database instead of the cache). 1 of them takes 5,000ms (a garbage collection pause plus a cold database connection). The average is:
But what actually happened? 95% of users saw 5ms. 4.9% saw 50ms. And 0.1% saw 5 seconds. That 0.1% is your biggest customers — they make more requests, so they are statistically more likely to hit the tail. And one terrible experience out of a hundred makes them say "this site is slow."
A percentile tells you: "X% of requests complete faster than this value." Sort all your latency measurements from smallest to largest. The value at position N% is the Nth percentile.
Now the picture is clear. p50 is 5ms (great), p99 is 50ms (acceptable), but p99.9 is 5 seconds (terrible). The average of 12.2ms told you nothing useful. The percentiles told you everything.
In a microservices architecture, a single user request often fans out to multiple backend services. If your checkout page calls the cart service, the inventory service, the pricing service, and the payment service in parallel, the user's total latency is the maximum of all four calls. This is called tail latency amplification.
If each service has a p99 latency of 100ms, the probability that all four are under 100ms is 0.994 = 0.961. So the overall p99 is now governed by: 1 - 0.994 = 3.9% chance at least one is slow. Your combined p99 has degraded. With 10 backend calls, it is 1 - 0.9910 = 9.6% slow — your effective p90 is now what used to be your p99.
Adjust the distribution shape and tail weight. Watch how the average stays low while the tail percentiles explode. The histogram shows 1,000 requests — hover over the percentile markers to see exact values.
When you load test with tools like Apache Bench or wrk, they typically send the next request only after the previous one completes. If one request takes 5 seconds, the tool waits 5 seconds before sending the next one. During those 5 seconds, zero requests are recorded. The tool under-counts how many users would have been affected — in reality, hundreds of users would have sent requests during that 5 seconds, and all of them would have experienced high latency.
This is coordinated omission: the load testing tool's measurement is "coordinated" with the system's slowness, and it "omits" the requests that real users would have sent. Tools like wrk2 and k6 fix this by maintaining a constant send rate regardless of response time.
An interviewer says: "Design a URL shortener for 100 million users." Before you draw a single box on the whiteboard, you need numbers. How many requests per second? How much storage? How much bandwidth? How many servers? Without these, your architecture is a guess.
Capacity planning follows a five-step framework. Each step feeds into the next. Let us walk through it for the URL shortener.
| Quantity | Value | Why it matters |
|---|---|---|
| Seconds in a day | 86,400 ≈ 105 | Convert daily counts to QPS |
| Seconds in a month | 2.6M ≈ 2.5 × 106 | Monthly budgets |
| 1 million requests/day | ≈ 12 QPS | Quick conversion |
| 1 KB × 1 billion | = 1 TB | Storage estimation |
| Peak : average ratio | 2-3x (typical), 10x (bursty) | Headroom planning |
| 80/20 rule | 20% of data serves 80% of reads | Cache sizing (cache the hot 20%) |
Input your system parameters and see the derived capacity requirements in real time. Adjust each slider to model different scales.
You have estimated your capacity requirements. You have built your system. Now the question: does it actually handle the load? Hope is not a strategy. You need to prove it with load testing.
| Type | What it tests | How it works | What it catches |
|---|---|---|---|
| Load test | Expected traffic | Ramp to your estimated peak QPS and sustain for 10-30 min | Basic bottlenecks: slow queries, underpowered instances |
| Stress test | Breaking point | Increase QPS until the system fails. Find the knee point. | The exact QPS where latency degrades, the failure mode (OOM, connection pool, CPU) |
| Soak test | Long-running stability | Sustain moderate load for 4-24 hours | Memory leaks, connection pool exhaustion, disk fill-up, log rotation failure |
| Spike test | Sudden burst | Jump from idle to 10x peak in seconds | Auto-scaling lag, cold start latency, queue overflow |
The most important graph in load testing is the load-latency curve. Plot requests per second on the X axis and response time on the Y axis. At low load, latency is flat — the system responds in its baseline time. As load increases, latency stays flat until you hit the knee point — the QPS where some resource (CPU, memory, connections, disk I/O) becomes saturated. Beyond the knee, latency shoots up exponentially.
This is why you never run a system at 90% capacity. At 90% utilization, latency is 10x baseline. At 70%, latency is only 3.3x baseline. The standard rule of thumb: keep peak utilization below 70% for latency-sensitive services.
Drag the "Current Load" slider to move along the curve. Watch latency explode as you approach the system's maximum capacity. The knee point is marked. The "Max Capacity" slider lets you simulate adding more servers.
| Tool | Language | Strengths | Coordinated omission fix? |
|---|---|---|---|
| k6 | Go (JS scripts) | Modern, CI-friendly, cloud-native | Yes |
| Locust | Python | Easy scripting, distributed mode | No (but configurable) |
| Gatling | Scala | Enterprise-grade, detailed reports | Yes |
| wrk2 | C | Constant-rate, fixes coordinated omission | Yes (by design) |
| vegeta | Go | Constant-rate attack, simple CLI | Yes |
javascript (k6) import http from 'k6/http'; import { check, sleep } from 'k6'; export const options = { stages: [ { duration: '2m', target: 100 }, // ramp to 100 VUs { duration: '5m', target: 100 }, // hold at 100 VUs { duration: '2m', target: 500 }, // stress test: ramp to 500 { duration: '5m', target: 500 }, // hold at stress level { duration: '2m', target: 0 }, // cool down ], thresholds: { http_req_duration: ['p(99)<200'], // p99 < 200ms http_req_failed: ['rate<0.01'], // <1% errors }, }; export default function () { const res = http.get('https://api.example.com/checkout'); check(res, { 'status 200': (r) => r.status === 200 }); sleep(1); // 1 second between requests per VU }
A user request to your checkout service travels through a load balancer, an application server, a database, and a cache. Each component has a probability of being available. What is the probability that the entire chain works? And what happens when you add redundancy?
When components are in series — meaning all of them must work for the request to succeed — you multiply their availabilities:
Let us work through a real example. A request path has four components:
| Component | Individual Availability |
|---|---|
| Load Balancer | 99.99% |
| App Server | 99.9% |
| Database | 99.9% |
| Cache | 99.95% |
Even though every individual component has three or four nines, the chain only achieves two nines. This is the brutal arithmetic of series composition: the chain is weaker than its weakest link.
When components are in parallel — meaning the request succeeds if at least one works — you compute the probability that all of them fail, then subtract from 1:
The improvement is dramatic. A single 99.9% server gives you 8.76 hours of downtime per year. Two in parallel give you 31.5 seconds. The math is the same for any redundancy count:
| Replicas | Individual: 99.9% | Combined | Downtime/year |
|---|---|---|---|
| 1 | 99.9% | 99.9% | 8.76 hours |
| 2 | 99.9% | 99.9999% | 31.5 seconds |
| 3 | 99.9% | 99.9999999% | 0.03 seconds |
Real architectures combine both. You have components in series (LB → App → DB), and some of those components are internally redundant (2 app servers in parallel, 3 DB replicas). Compute bottom-up: first resolve each parallel group to a single availability number, then multiply the series chain.
Build a system by adjusting component availabilities and replica counts. See total system availability update in real time. Each component is in series; within each component, replicas provide parallel redundancy.
You have defined your SLOs. You have capacity-planned. You have load-tested. Your system is running in production. How do you know when something goes wrong before your users tell you?
Google's SRE book distills all monitoring into four signals. If you measure nothing else, measure these:
| Signal | What it measures | What to watch for | Example alert threshold |
|---|---|---|---|
| Latency | Time to serve a request | Distinguish successful vs failed latency. A fast 500 error is not "low latency." | p99 > 500ms for 5 min |
| Traffic | Demand on the system | QPS, active sessions, or messages/sec. Both spikes and drops matter. | QPS drops 50% in 5 min (might indicate upstream failure) |
| Errors | Rate of failed requests | HTTP 5xx, gRPC errors, or application-level failures. Include both explicit (500s) and implicit (200 with wrong data). | Error rate > 1% for 5 min |
| Saturation | How full the system is | CPU %, memory %, disk I/O utilization, connection pool usage. The signal that predicts future problems. | CPU > 80% for 10 min |
There are two philosophies for alerting. Cause-based alerting fires when an internal metric crosses a threshold: "CPU is at 90%," "disk is 85% full," "memory usage is 12 GB." The problem: many cause-based alerts never affect users. CPU spikes to 90% for 30 seconds during a batch job, then drops back. If you page an engineer for every CPU spike, you get alert fatigue — they start ignoring pages, and when a real outage happens, they are too desensitized to react.
Symptom-based alerting fires when the user-visible effect crosses a threshold: "error rate is above 1%," "p99 latency exceeded 500ms for 5 minutes," "availability dropped below 99.9% this month." These alerts are actionable — they mean something is actually broken for users.
A good dashboard shows the four golden signals in real time, with historical context. Below is an interactive mock — inject failures and watch how the signals respond.
Click the failure buttons to inject problems. Watch all four signals respond. This is what a real production dashboard looks like during an incident.
Every alert should link to a runbook — a document that tells the on-call engineer exactly what to do. A runbook has three sections:
kubectl rollout undo. If caused by traffic spike: scale horizontally via kubectl scale --replicas=10. If caused by DB: failover to replica via ..."This chapter is your cheat sheet. Every formula, every framework, every pattern from this lesson, organized for fast recall in an interview setting.
| Formula | What it computes | Example |
|---|---|---|
| QPS = DAU × req/user / 86400 | Average queries per second | 10M × 20 / 86400 ≈ 2,315 |
| Peak QPS ≈ 2-3x avg QPS | Peak capacity to plan for | 2,315 × 3 ≈ 6,945 |
| Storage = users × data/user × days | Total storage needed | 10M × 1KB × 1825d = 18.25 TB |
| Aseries = A1 × A2 × ... × An | Availability of chain | 0.9993 = 99.7% |
| Aparallel = 1 - (1-A)n | Availability with n replicas | 1 - (0.001)2 = 99.9999% |
| P(tail) = 1 - (1-p)n | Prob of hitting tail in n fan-out calls | 1 - 0.995 = 4.9% |
| Error budget = (1-SLO) × window | Allowed downtime | 0.001 × 43200min = 43.2 min/mo |
| L(q) = 1/(μ - q) | Latency vs load (M/M/1) | At 90% util: 10x baseline latency |
In every system design interview, cover these before drawing boxes:
Scenario 1: "Define the SLOs for a payment processing system."
Scenario 2: "Capacity plan for a social media app with 50M DAU."
Scenario 3: "p99 latency doubled after a deploy."
Drill 1: Sliding Window Percentile Tracker
python import bisect from collections import deque class PercentileTracker: """Track percentiles over a sliding window of N samples.""" def __init__(self, window_size=1000): self.window_size = window_size self.window = deque() # insertion order self.sorted_vals = [] # sorted for percentile lookup def add(self, value): # Evict oldest if window is full if len(self.window) >= self.window_size: old = self.window.popleft() idx = bisect.bisect_left(self.sorted_vals, old) self.sorted_vals.pop(idx) # Add new value self.window.append(value) bisect.insort(self.sorted_vals, value) def percentile(self, p): """Return the p-th percentile (0-100).""" if not self.sorted_vals: return 0 idx = int(len(self.sorted_vals) * p / 100) idx = min(idx, len(self.sorted_vals) - 1) return self.sorted_vals[idx] # Usage: tracker = PercentileTracker(window_size=1000) for latency in incoming_requests: tracker.add(latency) if tracker.percentile(99) > 200: fire_alert("p99 latency exceeded 200ms")
Drill 2: Availability Calculator
python def parallel_availability(single_avail, replicas): """Availability of N identical replicas in parallel.""" return 1 - (1 - single_avail) ** replicas def series_availability(*components): """Availability of components in series (all must work).""" result = 1.0 for a in components: result *= a return result def system_availability(components): """ components: list of (availability, replicas) tuples in series. Each tuple is an independent stage; replicas add parallel redundancy. """ stage_avails = [] for avail, replicas in components: stage_avails.append(parallel_availability(avail, replicas)) return series_availability(*stage_avails) # Example: LB(99.99%, 1x) → App(99.9%, 2x) → DB(99.9%, 3x) total = system_availability([ (0.9999, 1), # load balancer (0.999, 2), # 2 app servers (0.999, 3), # 3 DB replicas ]) print(f"System availability: {total*100:.6f}%") # Output: System availability: 99.989900%
Nonfunctional requirements are not an isolated topic — they thread through every chapter of system design. Here is how the concepts in this lesson connect to the rest of Designing Data-Intensive Applications.
| This lesson | Where it connects | Why it matters |
|---|---|---|
| Availability math | Ch 6: Replication | Replication is how you achieve parallel availability. Leader-follower, multi-leader, and leaderless each have different availability profiles. |
| Capacity planning (QPS, storage) | Ch 7: Sharding | When one machine cannot handle the load, you shard. Capacity planning tells you when to shard and how many shards you need. |
| Latency percentiles | Ch 4: Storage & Retrieval | B-trees vs LSM-trees have fundamentally different latency distributions. B-trees have predictable reads; LSM-trees have write amplification spikes during compaction. |
| SLAs and error budgets | Ch 8: Transactions | Transactions trade latency for correctness guarantees. Your SLO determines whether you can afford the latency cost of serializable isolation. |
| Load testing | Ch 9: Distributed Trouble | The failures you inject in load tests (slow DB, memory leak, traffic spike) are the same partial failures that haunt distributed systems. |
| Monitoring golden signals | Ch 10: Consistency & Consensus | Consensus protocols like Raft have specific latency and availability trade-offs. You need monitoring to verify your consensus layer meets its SLOs. |
This lesson teaches you to quantify requirements. It does not teach you to implement them. Knowing that you need 99.99% availability and the math to compute it is step one. Actually building a system that achieves it — through replication, failover, load balancing, circuit breakers, and graceful degradation — is the rest of this book.
We also did not cover security requirements (encryption, authentication, authorization), compliance requirements (GDPR, HIPAA, SOC2), or maintainability requirements (code complexity, deployment frequency, mean time to recovery). These are equally important nonfunctional requirements, but they deserve their own lessons.
| Resource | What it covers |
|---|---|
| Google SRE Book, Ch 4: Service Level Objectives | The definitive guide to SLIs, SLOs, and error budgets from the team that invented the framework. |
| DDIA Chapter 1: Reliability, Scalability, Maintainability | Kleppmann's original treatment of these concepts, with worked examples from real systems. |
| Jeff Dean's "Numbers Every Engineer Should Know" | The latency numbers table that powers all back-of-envelope estimation. |
| Gil Tene's "How NOT to Measure Latency" | The definitive talk on coordinated omission and why most load testing tools lie to you. |