Designing Data-Intensive Applications — Chapter 2

Nonfunctional Requirements

SLAs, percentiles, capacity planning, load testing — quantifying system quality.

Prerequisites: Basic math + Web architecture. That's it.
9
Chapters
8+
Simulations
5
Interview Dimensions

Chapter 0: The Problem

Two engineers sit in a meeting room. The product manager says: "We need the checkout service to be fast and reliable." Engineer A nods and says "I'll put it on a single beefy server with an SSD." Engineer B says "We need three replicas across two availability zones with a CDN edge layer, auto-scaling, and a Redis cache." The product manager asks: "How much will each approach cost?" Engineer A says "$500/month." Engineer B says "$12,000/month."

Who is right? Neither of them knows, because nobody defined what "fast" and "reliable" actually mean.

If "fast" means p50 latency under 500ms, Engineer A's single server will do fine. If "fast" means p99 latency under 50ms under 10,000 concurrent users during Black Friday, Engineer B's architecture might not even be enough. If "reliable" means 99% uptime (3.65 days of downtime per year), a single server with good monitoring suffices. If "reliable" means 99.99% uptime (52 minutes per year), you need redundancy at every layer.

Vague requirements are the root cause of two opposite engineering disasters: over-engineering (spending $100K/month on infrastructure for a service that gets 10 requests per second) and under-engineering (deploying a single server that crashes on the first traffic spike and loses customer orders).

The core problem. Without precise numbers, you cannot make engineering decisions. "It needs to be fast" is a wish. "p99 latency < 200ms at 5,000 QPS with 99.95% monthly uptime" is a requirement. This chapter teaches you to convert wishes into requirements.

The simulation below shows how vague requirements lead to wildly different cost outcomes. Two teams build the same service — one with precise requirements, one without. Watch what happens.

The Cost of Vague Requirements

Drag the "Actual Load" slider to see how over-engineering and under-engineering play out against a precisely-engineered system.

Actual Peak QPS 5,000

Notice what happens: the over-engineered system costs a fortune regardless of load. The under-engineered system is cheap until the load exceeds its capacity, at which point it crashes and costs you far more in lost revenue and emergency fixes. The precisely-engineered system scales proportionally — it costs exactly what the load demands, plus a reasonable safety margin.

This is why nonfunctional requirements matter. They are the numbers that turn an engineering problem from "build something good" into "build exactly this, for this many users, at this cost." In this lesson, we will learn the language and math of nonfunctional requirements: SLAs, percentiles, capacity planning, availability, and monitoring.

Interview check: A startup CTO says "Our API needs to be reliable." You need to turn this into an engineering requirement. What is the FIRST question you ask?

Chapter 1: SLIs, SLOs, SLAs

Google runs some of the most reliable services on the planet. They do not achieve this by making everything infinitely reliable — that would be infinitely expensive. Instead, they use a precise three-layer framework to define, target, and contractually guarantee system quality. Let us build it from the ground up.

Service Level Indicators (SLIs)

A Service Level Indicator is a quantitative measurement of some aspect of your service. It is a number you can observe, record, and plot on a graph. Common SLIs include:

SLIWhat it measuresHow to compute itExample
LatencyHow long a request takesTime from request received to response sentp99 latency = 142ms
Error rateFraction of requests that failHTTP 5xx count / total request count0.3% of requests return 500
ThroughputHow many requests per secondRequest count / time window2,400 QPS at peak
AvailabilityFraction of time the service is upSuccessful requests / total requests99.95% of probes succeed
DurabilityProbability that stored data is not lostObjects lost / objects stored over time11 nines (99.999999999%)

SLIs are measurements, not goals. They describe reality. You do not choose your SLI values — you observe them.

Service Level Objectives (SLOs)

A Service Level Objective is an internal target for an SLI. It is a line in the sand that your team draws: "We aim for p99 latency below 200ms" or "We target 99.9% availability per month." SLOs are promises you make to yourself. No lawyer is involved. No money changes hands if you miss them.

But SLOs are not arbitrary. They are chosen based on user expectations and business needs. If your payment processing SLO is 99.9% availability (8.76 hours of downtime per year), and you serve 1,000 transactions per hour at an average of $50, each hour of downtime costs you $50,000 in lost revenue. So you had better know exactly how much reliability you are promising.

Service Level Agreements (SLAs)

A Service Level Agreement is a contract between you and your customers that specifies what happens if you miss an SLO. SLAs have legal teeth — if you promise 99.95% uptime and deliver 99.8%, you owe your customer a refund, credit, or penalty. AWS credits your account if S3 drops below 99.9% availability in a billing period. Google Cloud credits you if Compute Engine drops below 99.95%.

The hierarchy. SLIs are what you measure. SLOs are what you aim for. SLAs are what you promise (with consequences). Always set your SLO stricter than your SLA — if your SLA is 99.95%, your internal SLO should be 99.97% or higher, so you have a buffer before the contract bites.

The Error Budget

If your SLO is 99.9% availability per month, that means you tolerate 0.1% unavailability. In a 30-day month (43,200 minutes), 0.1% is 43.2 minutes. That is your error budget — the maximum amount of downtime you can "spend" before violating your SLO.

Error budgets change how you think about risk. Deploying on a Friday afternoon? That might burn 20 minutes of error budget if something goes wrong. If you have already spent 30 minutes this month, you cannot afford it — deploy on Monday. If you have 43 minutes remaining on the 5th of the month, you are flush — ship with confidence.

SLA Uptime Calculator

Set the uptime percentage and see exactly how much downtime each "nine" allows per year, month, week, and day. Notice: each additional nine is 10x harder.

Uptime % 99.900%

The table of nines is one of the most important numbers in systems engineering. Memorize it:

UptimeDowntime/yearDowntime/monthDowntime/weekNines
99%3.65 days7.31 hours1.68 hoursTwo nines
99.9%8.76 hours43.8 min10.1 minThree nines
99.95%4.38 hours21.9 min5.04 minThree and a half
99.99%52.6 min4.38 min1.01 minFour nines
99.999%5.26 min26.3 sec6.05 secFive nines
The practical implication. Going from three nines (99.9%) to four nines (99.99%) means going from 8.76 hours of annual downtime to 52.6 minutes. That is a 10x reduction. It typically requires redundancy at every layer, automated failover, and zero-downtime deployments. Each nine costs roughly 10x more in engineering effort and infrastructure spend.
Interview check: Your SLA promises 99.95% monthly availability. It is day 20 of the month and you have already had 18 minutes of downtime. Should you approve a risky deploy that might cause 5 minutes of downtime?

Chapter 2: Latency & Percentiles

Your monitoring dashboard says the average response time is 45ms. Sounds great. But five of your biggest customers are threatening to leave because "the site is unbearably slow." How? Because average latency is one of the most misleading metrics in computer science.

Why Averages Lie

Consider 1,000 requests to your API. 950 of them complete in 5ms. 49 of them complete in 50ms (they hit the database instead of the cache). 1 of them takes 5,000ms (a garbage collection pause plus a cold database connection). The average is:

// Compute the average
average = (950 × 5 + 49 × 50 + 1 × 5000) / 1000
        = (4750 + 2450 + 5000) / 1000
        = 12,200 / 1000
        = 12.2 ms    looks fine!

But what actually happened? 95% of users saw 5ms. 4.9% saw 50ms. And 0.1% saw 5 seconds. That 0.1% is your biggest customers — they make more requests, so they are statistically more likely to hit the tail. And one terrible experience out of a hundred makes them say "this site is slow."

Percentiles: The Right Tool

A percentile tells you: "X% of requests complete faster than this value." Sort all your latency measurements from smallest to largest. The value at position N% is the Nth percentile.

// For 1000 sorted latency values:
p50 (median) = value at position 500    = 5ms (half are faster)
p90 = value at position 900    = 5ms (90% are faster)
p95 = value at position 950    = 5ms (barely — the 950th is the last 5ms one)
p99 = value at position 990    = 50ms (99% are faster)
p99.9 = value at position 999    = 5000ms (the outlier)

Now the picture is clear. p50 is 5ms (great), p99 is 50ms (acceptable), but p99.9 is 5 seconds (terrible). The average of 12.2ms told you nothing useful. The percentiles told you everything.

Tail Latency Amplification

In a microservices architecture, a single user request often fans out to multiple backend services. If your checkout page calls the cart service, the inventory service, the pricing service, and the payment service in parallel, the user's total latency is the maximum of all four calls. This is called tail latency amplification.

If each service has a p99 latency of 100ms, the probability that all four are under 100ms is 0.994 = 0.961. So the overall p99 is now governed by: 1 - 0.994 = 3.9% chance at least one is slow. Your combined p99 has degraded. With 10 backend calls, it is 1 - 0.9910 = 9.6% slow — your effective p90 is now what used to be your p99.

// Tail latency amplification formula
P(at least one call > p99) = 1 - (1 - 0.01)n

// For n parallel backend calls:
n = 1:   1.0% chance of tail hit
n = 4:   3.9% chance
n = 10:  9.6% chance
n = 20: 18.2% chance
n = 50: 39.5% chance    nearly half of user requests hit a tail!
The amplification rule. If a single user request fans out to n backend services, each with p99 latency L, then roughly n% of user requests will experience latency ≥ L (for small n). This is why microservices architectures obsess over tail latency — a long tail in one service becomes a wide body in the aggregate.
Latency Distribution Explorer

Adjust the distribution shape and tail weight. Watch how the average stays low while the tail percentiles explode. The histogram shows 1,000 requests — hover over the percentile markers to see exact values.

Tail weight 30
Fan-out (services) 1

Coordinated Omission

When you load test with tools like Apache Bench or wrk, they typically send the next request only after the previous one completes. If one request takes 5 seconds, the tool waits 5 seconds before sending the next one. During those 5 seconds, zero requests are recorded. The tool under-counts how many users would have been affected — in reality, hundreds of users would have sent requests during that 5 seconds, and all of them would have experienced high latency.

This is coordinated omission: the load testing tool's measurement is "coordinated" with the system's slowness, and it "omits" the requests that real users would have sent. Tools like wrk2 and k6 fix this by maintaining a constant send rate regardless of response time.

Interview check: Your API has a p50 of 10ms and p99 of 200ms. A user request fans out to 5 backend services in parallel. What is the approximate probability that a user experiences latency ≥ 200ms?

Chapter 3: Capacity Planning

An interviewer says: "Design a URL shortener for 100 million users." Before you draw a single box on the whiteboard, you need numbers. How many requests per second? How much storage? How much bandwidth? How many servers? Without these, your architecture is a guess.

The Back-of-Envelope Framework

Capacity planning follows a five-step framework. Each step feeds into the next. Let us walk through it for the URL shortener.

1. Estimate Users & Requests
100M total users. Assume 10% are daily active = 10M DAU. Each user creates 1 short URL/day and reads 10/day. That is 10M writes/day and 100M reads/day. The read:write ratio is 10:1.
2. Compute QPS
10M writes/day = 10,000,000 / 86,400 ≈ 116 writes/sec. 100M reads/day ≈ 1,157 reads/sec. Peak is typically 2-3x average: ~350 write QPS, ~3,500 read QPS at peak.
3. Estimate Storage
Each URL mapping: 7-char short code (7 bytes) + long URL (avg 200 bytes) + metadata (50 bytes) ≈ 257 bytes. 10M new URLs/day × 365 days × 5 years = 18.25B URLs. Total: 18.25B × 257 bytes ≈ 4.7 TB.
4. Estimate Bandwidth
Read bandwidth: 3,500 reads/sec × 257 bytes ≈ 900 KB/s (trivial). Write bandwidth: 350 writes/sec × 257 bytes ≈ 90 KB/s. Network is not the bottleneck here.
5. Estimate Compute
A modern server handles ~10K simple read QPS from memory or SSD cache. At 3,500 read QPS peak, a single server might handle it, but for redundancy we want 2-3 servers. Storage: 4.7 TB fits on a single large SSD array, but for durability we want replication — 3 replicas × 4.7 TB = 14.1 TB total.
Numbers every engineer should know. These are Jeff Dean's "Latency Numbers Every Programmer Should Know," updated for modern hardware. L1 cache: 1ns. L2 cache: 4ns. RAM: 100ns. SSD random read: 16µs. SSD sequential read 1MB: 50µs. Network round trip (same datacenter): 500µs. HDD seek: 2ms. Network round trip (cross-continent): 150ms. The gap between "in memory" and "on disk" is 1000x. The gap between "same datacenter" and "cross-continent" is 300x.

The Numbers You Need to Memorize

QuantityValueWhy it matters
Seconds in a day86,400 ≈ 105Convert daily counts to QPS
Seconds in a month2.6M ≈ 2.5 × 106Monthly budgets
1 million requests/day≈ 12 QPSQuick conversion
1 KB × 1 billion= 1 TBStorage estimation
Peak : average ratio2-3x (typical), 10x (bursty)Headroom planning
80/20 rule20% of data serves 80% of readsCache sizing (cache the hot 20%)
Capacity Planner

Input your system parameters and see the derived capacity requirements in real time. Adjust each slider to model different scales.

DAU (millions) 10M
Requests/user/day 20
Bytes per request 500 B
Retention (years) 5
Interview check: A social media app has 50M DAU. Each user posts 2 items/day (avg 1 KB each) and views 100 items/day. Estimate the peak read QPS and the total storage after 3 years.

Chapter 4: Load Testing

You have estimated your capacity requirements. You have built your system. Now the question: does it actually handle the load? Hope is not a strategy. You need to prove it with load testing.

Four Types of Load Tests

TypeWhat it testsHow it worksWhat it catches
Load testExpected trafficRamp to your estimated peak QPS and sustain for 10-30 minBasic bottlenecks: slow queries, underpowered instances
Stress testBreaking pointIncrease QPS until the system fails. Find the knee point.The exact QPS where latency degrades, the failure mode (OOM, connection pool, CPU)
Soak testLong-running stabilitySustain moderate load for 4-24 hoursMemory leaks, connection pool exhaustion, disk fill-up, log rotation failure
Spike testSudden burstJump from idle to 10x peak in secondsAuto-scaling lag, cold start latency, queue overflow

The Load-Latency Curve

The most important graph in load testing is the load-latency curve. Plot requests per second on the X axis and response time on the Y axis. At low load, latency is flat — the system responds in its baseline time. As load increases, latency stays flat until you hit the knee point — the QPS where some resource (CPU, memory, connections, disk I/O) becomes saturated. Beyond the knee, latency shoots up exponentially.

// Simplified model: latency as a function of load (M/M/1 queue)
L(q) = 1 / (μ - q)    where μ = max service rate, q = current QPS

// At q = 0: L = 1/μ (baseline latency)
// At q = 0.5μ: L = 2/μ (double baseline)
// At q = 0.9μ: L = 10/μ (10x baseline!)
// At q = 0.99μ: L = 100/μ (100x baseline)
// At q = μ: L = ∞ (queue grows unbounded)

This is why you never run a system at 90% capacity. At 90% utilization, latency is 10x baseline. At 70%, latency is only 3.3x baseline. The standard rule of thumb: keep peak utilization below 70% for latency-sensitive services.

The knee is non-negotiable. Every system has a knee point. You cannot engineer it away — you can only move it to a higher QPS by adding capacity. Your job in load testing is to find the knee, then ensure your expected peak load is well below it (typically 50-70% of the knee QPS).
Load-Latency Curve Simulator

Drag the "Current Load" slider to move along the curve. Watch latency explode as you approach the system's maximum capacity. The knee point is marked. The "Max Capacity" slider lets you simulate adding more servers.

Current Load (QPS) 300
Max Capacity (QPS) 1000

Load Testing Tools

ToolLanguageStrengthsCoordinated omission fix?
k6Go (JS scripts)Modern, CI-friendly, cloud-nativeYes
LocustPythonEasy scripting, distributed modeNo (but configurable)
GatlingScalaEnterprise-grade, detailed reportsYes
wrk2CConstant-rate, fixes coordinated omissionYes (by design)
vegetaGoConstant-rate attack, simple CLIYes
javascript (k6)
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // ramp to 100 VUs
    { duration: '5m', target: 100 },   // hold at 100 VUs
    { duration: '2m', target: 500 },   // stress test: ramp to 500
    { duration: '5m', target: 500 },   // hold at stress level
    { duration: '2m', target: 0 },     // cool down
  ],
  thresholds: {
    http_req_duration: ['p(99)<200'],  // p99 < 200ms
    http_req_failed: ['rate<0.01'],    // <1% errors
  },
};

export default function () {
  const res = http.get('https://api.example.com/checkout');
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(1);  // 1 second between requests per VU
}
Interview check: Your load test shows the system handles 2,000 QPS with p99 = 50ms. At 2,500 QPS, p99 jumps to 800ms. At 3,000 QPS, requests start timing out. What is the knee point, and what utilization should you target for production?

Chapter 5: Availability Math

A user request to your checkout service travels through a load balancer, an application server, a database, and a cache. Each component has a probability of being available. What is the probability that the entire chain works? And what happens when you add redundancy?

Series Composition (All Must Work)

When components are in series — meaning all of them must work for the request to succeed — you multiply their availabilities:

Atotal = A1 × A2 × A3 × ... × An

Let us work through a real example. A request path has four components:

ComponentIndividual Availability
Load Balancer99.99%
App Server99.9%
Database99.9%
Cache99.95%
Atotal = 0.9999 × 0.999 × 0.999 × 0.9995
        = 0.99740...
        ≈ 99.74%    only two nines! 22.8 hours of downtime/year

Even though every individual component has three or four nines, the chain only achieves two nines. This is the brutal arithmetic of series composition: the chain is weaker than its weakest link.

Parallel Composition (Redundancy)

When components are in parallel — meaning the request succeeds if at least one works — you compute the probability that all of them fail, then subtract from 1:

Aparallel = 1 - (1 - A1)(1 - A2)...(1 - An)

// Example: 2 app servers, each 99.9%
Aparallel = 1 - (1 - 0.999)2
           = 1 - (0.001)2
           = 1 - 0.000001
           = 99.9999%    six nines! Just by adding one server

The improvement is dramatic. A single 99.9% server gives you 8.76 hours of downtime per year. Two in parallel give you 31.5 seconds. The math is the same for any redundancy count:

ReplicasIndividual: 99.9%CombinedDowntime/year
199.9%99.9%8.76 hours
299.9%99.9999%31.5 seconds
399.9%99.9999999%0.03 seconds
Redundancy is the only way to exceed your weakest component. You cannot get four nines from a three-nine component without redundancy. Adding a second instance in parallel is the most cost-effective reliability improvement in all of systems engineering. It is also why every production database has at least one replica.

Combining Series and Parallel

Real architectures combine both. You have components in series (LB → App → DB), and some of those components are internally redundant (2 app servers in parallel, 3 DB replicas). Compute bottom-up: first resolve each parallel group to a single availability number, then multiply the series chain.

// Example: LB(99.99%) → 2x App(99.9%) → 3x DB(99.9%) → Cache(99.95%)

Aapp = 1 - (1 - 0.999)2 = 99.9999%
Adb = 1 - (1 - 0.999)3 = 99.9999999%

Atotal = 0.9999 × 0.999999 × 0.999999999 × 0.9995
        ≈ 99.94%    much better than the 99.74% without redundancy!
Availability Calculator — The Showcase

Build a system by adjusting component availabilities and replica counts. See total system availability update in real time. Each component is in series; within each component, replicas provide parallel redundancy.

LB availability 99.990%
App server avail. 99.90%
App replicas 1
DB availability 99.90%
DB replicas 1
Cache availability 99.95%
Cache replicas 1
Interview check: Your system has three components in series, each at 99.9% availability. You have budget to add ONE redundant instance to ONE component. Which component do you make redundant for the greatest improvement?

Chapter 6: Monitoring & Alerting

You have defined your SLOs. You have capacity-planned. You have load-tested. Your system is running in production. How do you know when something goes wrong before your users tell you?

The Four Golden Signals

Google's SRE book distills all monitoring into four signals. If you measure nothing else, measure these:

SignalWhat it measuresWhat to watch forExample alert threshold
LatencyTime to serve a requestDistinguish successful vs failed latency. A fast 500 error is not "low latency."p99 > 500ms for 5 min
TrafficDemand on the systemQPS, active sessions, or messages/sec. Both spikes and drops matter.QPS drops 50% in 5 min (might indicate upstream failure)
ErrorsRate of failed requestsHTTP 5xx, gRPC errors, or application-level failures. Include both explicit (500s) and implicit (200 with wrong data).Error rate > 1% for 5 min
SaturationHow full the system isCPU %, memory %, disk I/O utilization, connection pool usage. The signal that predicts future problems.CPU > 80% for 10 min
Three monitoring methodologies compared. Google's Four Golden Signals (latency, traffic, errors, saturation) are the most widely used. The RED method (Rate, Errors, Duration) is simpler — optimized for request-driven services. The USE method (Utilization, Saturation, Errors) is optimized for infrastructure resources (CPU, disk, network). Use Golden Signals for application monitoring, USE for infrastructure, RED when you want simplicity.

Symptom-Based vs Cause-Based Alerting

There are two philosophies for alerting. Cause-based alerting fires when an internal metric crosses a threshold: "CPU is at 90%," "disk is 85% full," "memory usage is 12 GB." The problem: many cause-based alerts never affect users. CPU spikes to 90% for 30 seconds during a batch job, then drops back. If you page an engineer for every CPU spike, you get alert fatigue — they start ignoring pages, and when a real outage happens, they are too desensitized to react.

Symptom-based alerting fires when the user-visible effect crosses a threshold: "error rate is above 1%," "p99 latency exceeded 500ms for 5 minutes," "availability dropped below 99.9% this month." These alerts are actionable — they mean something is actually broken for users.

The rule: page on symptoms, log causes. Your pager should fire only when users are affected (symptoms). Cause-based metrics (CPU, memory, disk) go to dashboards for investigation after the page fires. This keeps paging volume low and signal quality high. An engineer who gets 3 real pages per week will respond instantly. An engineer who gets 30 false alarms per week will ignore the real one.

The Monitoring Dashboard

A good dashboard shows the four golden signals in real time, with historical context. Below is an interactive mock — inject failures and watch how the signals respond.

Golden Signals Dashboard

Click the failure buttons to inject problems. Watch all four signals respond. This is what a real production dashboard looks like during an incident.

Runbooks

Every alert should link to a runbook — a document that tells the on-call engineer exactly what to do. A runbook has three sections:

1. What is this alert?
One sentence: "This fires when p99 latency exceeds 500ms for 5 consecutive minutes, indicating checkout is slow for at least 1% of users."
2. How to investigate
Step-by-step: (a) Check the dashboard for the specific service. (b) Look at the saturation panel — is CPU/memory/connection pool full? (c) Check recent deploys — did something change in the last hour? (d) Check downstream dependencies — is the database slow?
3. How to mitigate
Concrete actions: "If caused by recent deploy: roll back via kubectl rollout undo. If caused by traffic spike: scale horizontally via kubectl scale --replicas=10. If caused by DB: failover to replica via ..."
Interview check: An on-call engineer gets paged 40 times per week, but only 5 of those pages correspond to actual user-facing issues. What is the core problem, and how would you fix it?

Chapter 7: Interview Arsenal

This chapter is your cheat sheet. Every formula, every framework, every pattern from this lesson, organized for fast recall in an interview setting.

Quick Reference: Formulas

FormulaWhat it computesExample
QPS = DAU × req/user / 86400Average queries per second10M × 20 / 86400 ≈ 2,315
Peak QPS ≈ 2-3x avg QPSPeak capacity to plan for2,315 × 3 ≈ 6,945
Storage = users × data/user × daysTotal storage needed10M × 1KB × 1825d = 18.25 TB
Aseries = A1 × A2 × ... × AnAvailability of chain0.9993 = 99.7%
Aparallel = 1 - (1-A)nAvailability with n replicas1 - (0.001)2 = 99.9999%
P(tail) = 1 - (1-p)nProb of hitting tail in n fan-out calls1 - 0.995 = 4.9%
Error budget = (1-SLO) × windowAllowed downtime0.001 × 43200min = 43.2 min/mo
L(q) = 1/(μ - q)Latency vs load (M/M/1)At 90% util: 10x baseline latency

System Design: Nonfunctional Requirements Checklist

In every system design interview, cover these before drawing boxes:

1. Users & Traffic
How many DAU? Read:write ratio? Peak:average ratio? Geographic distribution?
2. Latency Requirements
What is the acceptable p50? p99? For which operations? (Reads vs writes often have different targets.)
3. Availability & Durability
What uptime is required? How many nines? Is data loss acceptable? (Payments: no. Social media likes: maybe.)
4. Storage & Bandwidth
How much data per request? How long is data retained? What is the storage growth rate?
5. Cost Constraints
What is the budget? Is this a startup (optimize for cost) or Big Tech (optimize for reliability)?

Interview Scenarios

Scenario 1: "Define the SLOs for a payment processing system."

Staff answer. Availability: 99.99% (payments are money — downtime = lost revenue + chargebacks). Latency: p50 < 100ms, p99 < 500ms (users abandon checkout after 3 seconds). Error rate: < 0.01% for payment failures (each failure is a lost sale or worse, a double charge). Durability: 100% (never lose a transaction record — this is a legal requirement). Idempotency: every payment must be idempotent (retries must not double-charge). These SLOs drive the architecture: you need synchronous replication for the transaction log, at least 3 replicas across 2 AZs, and circuit breakers on every downstream call.

Scenario 2: "Capacity plan for a social media app with 50M DAU."

Staff answer. Reads: 50M × 100 views/day = 5B reads/day = 58K avg QPS, ~175K peak. Writes: 50M × 2 posts/day = 100M writes/day = 1.16K avg QPS, ~3.5K peak. Storage: 100M posts/day × 1KB × 365 × 3 years = 109.5 TB. With replication (3x) = 328.5 TB. Cache: hot 20% of data = ~22 TB in Redis (need a cluster). Network: 175K QPS × 5KB avg response = 875 MB/s = ~7 Gbps (need multiple load balancers). Servers: at 10K QPS per server for cache hits, need ~18 app servers for peak. Total estimated cost: $50-80K/month on cloud.

Scenario 3: "p99 latency doubled after a deploy."

Staff answer. (1) Check the diff — what changed? A new code path, a new DB query, a new dependency call? (2) Profile: is it CPU (new computation), memory (GC pressure from new allocations), I/O (new DB query), or network (new external call)? (3) Check if it affects all endpoints or just the changed one. (4) Check if the database query plan changed (new index? missing index?). (5) Measure: is the regression in the service itself or in a downstream dependency? (6) Quick mitigation: roll back the deploy while investigating. (7) Root cause: likely a missing database index on a new query path, or an N+1 query that was not caught in code review.

Coding Drills

Drill 1: Sliding Window Percentile Tracker

python
import bisect
from collections import deque

class PercentileTracker:
    """Track percentiles over a sliding window of N samples."""

    def __init__(self, window_size=1000):
        self.window_size = window_size
        self.window = deque()       # insertion order
        self.sorted_vals = []       # sorted for percentile lookup

    def add(self, value):
        # Evict oldest if window is full
        if len(self.window) >= self.window_size:
            old = self.window.popleft()
            idx = bisect.bisect_left(self.sorted_vals, old)
            self.sorted_vals.pop(idx)
        # Add new value
        self.window.append(value)
        bisect.insort(self.sorted_vals, value)

    def percentile(self, p):
        """Return the p-th percentile (0-100)."""
        if not self.sorted_vals:
            return 0
        idx = int(len(self.sorted_vals) * p / 100)
        idx = min(idx, len(self.sorted_vals) - 1)
        return self.sorted_vals[idx]

# Usage:
tracker = PercentileTracker(window_size=1000)
for latency in incoming_requests:
    tracker.add(latency)
    if tracker.percentile(99) > 200:
        fire_alert("p99 latency exceeded 200ms")

Drill 2: Availability Calculator

python
def parallel_availability(single_avail, replicas):
    """Availability of N identical replicas in parallel."""
    return 1 - (1 - single_avail) ** replicas

def series_availability(*components):
    """Availability of components in series (all must work)."""
    result = 1.0
    for a in components:
        result *= a
    return result

def system_availability(components):
    """
    components: list of (availability, replicas) tuples in series.
    Each tuple is an independent stage; replicas add parallel redundancy.
    """
    stage_avails = []
    for avail, replicas in components:
        stage_avails.append(parallel_availability(avail, replicas))
    return series_availability(*stage_avails)

# Example: LB(99.99%, 1x) → App(99.9%, 2x) → DB(99.9%, 3x)
total = system_availability([
    (0.9999, 1),   # load balancer
    (0.999,  2),   # 2 app servers
    (0.999,  3),   # 3 DB replicas
])
print(f"System availability: {total*100:.6f}%")
# Output: System availability: 99.989900%
Interview check: An interviewer asks: "How would you define SLOs for a real-time multiplayer game server?" Which of these is the most staff-level answer?

Chapter 8: Connections

Nonfunctional requirements are not an isolated topic — they thread through every chapter of system design. Here is how the concepts in this lesson connect to the rest of Designing Data-Intensive Applications.

What We Covered vs What Comes Next

This lessonWhere it connectsWhy it matters
Availability mathCh 6: ReplicationReplication is how you achieve parallel availability. Leader-follower, multi-leader, and leaderless each have different availability profiles.
Capacity planning (QPS, storage)Ch 7: ShardingWhen one machine cannot handle the load, you shard. Capacity planning tells you when to shard and how many shards you need.
Latency percentilesCh 4: Storage & RetrievalB-trees vs LSM-trees have fundamentally different latency distributions. B-trees have predictable reads; LSM-trees have write amplification spikes during compaction.
SLAs and error budgetsCh 8: TransactionsTransactions trade latency for correctness guarantees. Your SLO determines whether you can afford the latency cost of serializable isolation.
Load testingCh 9: Distributed TroubleThe failures you inject in load tests (slow DB, memory leak, traffic spike) are the same partial failures that haunt distributed systems.
Monitoring golden signalsCh 10: Consistency & ConsensusConsensus protocols like Raft have specific latency and availability trade-offs. You need monitoring to verify your consensus layer meets its SLOs.

Limitations of This Lesson

This lesson teaches you to quantify requirements. It does not teach you to implement them. Knowing that you need 99.99% availability and the math to compute it is step one. Actually building a system that achieves it — through replication, failover, load balancing, circuit breakers, and graceful degradation — is the rest of this book.

We also did not cover security requirements (encryption, authentication, authorization), compliance requirements (GDPR, HIPAA, SOC2), or maintainability requirements (code complexity, deployment frequency, mean time to recovery). These are equally important nonfunctional requirements, but they deserve their own lessons.

Recommended Reading

ResourceWhat it covers
Google SRE Book, Ch 4: Service Level ObjectivesThe definitive guide to SLIs, SLOs, and error budgets from the team that invented the framework.
DDIA Chapter 1: Reliability, Scalability, MaintainabilityKleppmann's original treatment of these concepts, with worked examples from real systems.
Jeff Dean's "Numbers Every Engineer Should Know"The latency numbers table that powers all back-of-envelope estimation.
Gil Tene's "How NOT to Measure Latency"The definitive talk on coordinated omission and why most load testing tools lie to you.
"If you can't measure it, you can't improve it." — Peter Drucker. This lesson gave you the measurement tools. The rest of DDIA gives you the improvement tools. Every architectural decision in the chapters ahead — replication strategy, partitioning scheme, consistency model, transaction isolation level — will be evaluated against the nonfunctional requirements you now know how to define.