Nonfunctional Requirements

Chapter 0: The Problem

Two engineers sit in a meeting room. The product manager says: "We need the checkout service to be fast and reliable." Engineer A nods and says "I'll put it on a single beefy server with an SSD." Engineer B says "We need three replicas across two availability zones with a CDN edge layer, auto-scaling, and a Redis cache." The product manager asks: "How much will each approach cost?" Engineer A says "$500/month." Engineer B says "$12,000/month."

Who is right? Neither of them knows, because nobody defined what "fast" and "reliable" actually mean.

If "fast" means p50 latency under 500ms, Engineer A's single server will do fine. If "fast" means p99 latency under 50ms under 10,000 concurrent users during Black Friday, Engineer B's architecture might not even be enough. If "reliable" means 99% uptime (3.65 days of downtime per year), a single server with good monitoring suffices. If "reliable" means 99.99% uptime (52 minutes per year), you need redundancy at every layer.

Vague requirements are the root cause of two opposite engineering disasters: over-engineering (spending $100K/month on infrastructure for a service that gets 10 requests per second) and under-engineering (deploying a single server that crashes on the first traffic spike and loses customer orders).

The core problem. Without precise numbers, you cannot make engineering decisions. "It needs to be fast" is a wish. "p99 latency < 200ms at 5,000 QPS with 99.95% monthly uptime" is a requirement. This chapter teaches you to convert wishes into requirements.

The simulation below shows how vague requirements lead to wildly different cost outcomes. Two teams build the same service — one with precise requirements, one without. Watch what happens.

The Cost of Vague Requirements

Drag the "Actual Load" slider to see how over-engineering and under-engineering play out against a precisely-engineered system.

Actual Peak QPS 5,000

Notice what happens: the over-engineered system costs a fortune regardless of load. The under-engineered system is cheap until the load exceeds its capacity, at which point it crashes and costs you far more in lost revenue and emergency fixes. The precisely-engineered system scales proportionally — it costs exactly what the load demands, plus a reasonable safety margin.

This is why nonfunctional requirements matter. They are the numbers that turn an engineering problem from "build something good" into "build exactly this, for this many users, at this cost." In this lesson, we will learn the language and math of nonfunctional requirements: SLAs, percentiles, capacity planning, availability, and monitoring.

Interview check: A startup CTO says "Our API needs to be reliable." You need to turn this into an engineering requirement. What is the FIRST question you ask?

"What cloud provider are you using?" "How many servers do you have?" "What does 'reliable' mean to your users — what is the acceptable error rate and downtime per month?" "Have you considered using Kubernetes?"

Chapter 1: SLIs, SLOs, SLAs

Google runs some of the most reliable services on the planet. They do not achieve this by making everything infinitely reliable — that would be infinitely expensive. Instead, they use a precise three-layer framework to define, target, and contractually guarantee system quality. Let us build it from the ground up.

Service Level Indicators (SLIs)

A Service Level Indicator is a quantitative measurement of some aspect of your service. It is a number you can observe, record, and plot on a graph. Common SLIs include:

SLI	What it measures	How to compute it	Example
Latency	How long a request takes	Time from request received to response sent	p99 latency = 142ms
Error rate	Fraction of requests that fail	HTTP 5xx count / total request count	0.3% of requests return 500
Throughput	How many requests per second	Request count / time window	2,400 QPS at peak
Availability	Fraction of time the service is up	Successful requests / total requests	99.95% of probes succeed
Durability	Probability that stored data is not lost	Objects lost / objects stored over time	11 nines (99.999999999%)

SLIs are measurements, not goals. They describe reality. You do not choose your SLI values — you observe them.

Service Level Objectives (SLOs)

A Service Level Objective is an internal target for an SLI. It is a line in the sand that your team draws: "We aim for p99 latency below 200ms" or "We target 99.9% availability per month." SLOs are promises you make to yourself. No lawyer is involved. No money changes hands if you miss them.

But SLOs are not arbitrary. They are chosen based on user expectations and business needs. If your payment processing SLO is 99.9% availability (8.76 hours of downtime per year), and you serve 1,000 transactions per hour at an average of $50, each hour of downtime costs you $50,000 in lost revenue. So you had better know exactly how much reliability you are promising.

Service Level Agreements (SLAs)

A Service Level Agreement is a contract between you and your customers that specifies what happens if you miss an SLO. SLAs have legal teeth — if you promise 99.95% uptime and deliver 99.8%, you owe your customer a refund, credit, or penalty. AWS credits your account if S3 drops below 99.9% availability in a billing period. Google Cloud credits you if Compute Engine drops below 99.95%.

The hierarchy. SLIs are what you measure. SLOs are what you aim for. SLAs are what you promise (with consequences). Always set your SLO stricter than your SLA — if your SLA is 99.95%, your internal SLO should be 99.97% or higher, so you have a buffer before the contract bites.

The Error Budget

If your SLO is 99.9% availability per month, that means you tolerate 0.1% unavailability. In a 30-day month (43,200 minutes), 0.1% is 43.2 minutes. That is your error budget — the maximum amount of downtime you can "spend" before violating your SLO.

Error budgets change how you think about risk. Deploying on a Friday afternoon? That might burn 20 minutes of error budget if something goes wrong. If you have already spent 30 minutes this month, you cannot afford it — deploy on Monday. If you have 43 minutes remaining on the 5th of the month, you are flush — ship with confidence.

SLA Uptime Calculator

Set the uptime percentage and see exactly how much downtime each "nine" allows per year, month, week, and day. Notice: each additional nine is 10x harder.

Uptime % 99.900%

The table of nines is one of the most important numbers in systems engineering. Memorize it:

Uptime	Downtime/year	Downtime/month	Downtime/week	Nines
99%	3.65 days	7.31 hours	1.68 hours	Two nines
99.9%	8.76 hours	43.8 min	10.1 min	Three nines
99.95%	4.38 hours	21.9 min	5.04 min	Three and a half
99.99%	52.6 min	4.38 min	1.01 min	Four nines
99.999%	5.26 min	26.3 sec	6.05 sec	Five nines

The practical implication. Going from three nines (99.9%) to four nines (99.99%) means going from 8.76 hours of annual downtime to 52.6 minutes. That is a 10x reduction. It typically requires redundancy at every layer, automated failover, and zero-downtime deployments. Each nine costs roughly 10x more in engineering effort and infrastructure spend.

Interview check: Your SLA promises 99.95% monthly availability. It is day 20 of the month and you have already had 18 minutes of downtime. Should you approve a risky deploy that might cause 5 minutes of downtime?

No. 99.95% of 43,200 minutes = 21.6 min allowed. You have 3.6 minutes remaining — a risky 5-minute deploy would burn through your error budget and breach the SLA. Wait until next month or de-risk the deploy. Yes. 18 minutes is still within the 99.95% budget, and the deploy will improve future reliability. It depends on the day of the week.

Chapter 2: Latency & Percentiles

Your monitoring dashboard says the average response time is 45ms. Sounds great. But five of your biggest customers are threatening to leave because "the site is unbearably slow." How? Because average latency is one of the most misleading metrics in computer science.

Why Averages Lie

Consider 1,000 requests to your API. 950 of them complete in 5ms. 49 of them complete in 50ms (they hit the database instead of the cache). 1 of them takes 5,000ms (a garbage collection pause plus a cold database connection). The average is:

// Compute the average
average = (950 × 5 + 49 × 50 + 1 × 5000) / 1000
        = (4750 + 2450 + 5000) / 1000
        = 12,200 / 1000
        = 12.2 ms    looks fine!

But what actually happened? 95% of users saw 5ms. 4.9% saw 50ms. And 0.1% saw 5 seconds. That 0.1% is your biggest customers — they make more requests, so they are statistically more likely to hit the tail. And one terrible experience out of a hundred makes them say "this site is slow."

Percentiles: The Right Tool

A percentile tells you: "X% of requests complete faster than this value." Sort all your latency measurements from smallest to largest. The value at position N% is the Nth percentile.

// For 1000 sorted latency values:
p50 (median) = value at position 500    = 5ms (half are faster)
p90 = value at position 900    = 5ms (90% are faster)
p95 = value at position 950    = 5ms (barely — the 950th is the last 5ms one)
p99 = value at position 990    = 50ms (99% are faster)
p99.9 = value at position 999    = 5000ms (the outlier)

Now the picture is clear. p50 is 5ms (great), p99 is 50ms (acceptable), but p99.9 is 5 seconds (terrible). The average of 12.2ms told you nothing useful. The percentiles told you everything.

Tail Latency Amplification

In a microservices architecture, a single user request often fans out to multiple backend services. If your checkout page calls the cart service, the inventory service, the pricing service, and the payment service in parallel, the user's total latency is the maximum of all four calls. This is called tail latency amplification.

If each service has a p99 latency of 100ms, the probability that all four are under 100ms is 0.99⁴ = 0.961. So the overall p99 is now governed by: 1 - 0.99⁴ = 3.9% chance at least one is slow. Your combined p99 has degraded. With 10 backend calls, it is 1 - 0.99¹⁰ = 9.6% slow — your effective p90 is now what used to be your p99.

// Tail latency amplification formula
P(at least one call > p99) = 1 - (1 - 0.01)ⁿ

// For n parallel backend calls:
n = 1: 1.0% chance of tail hit
n = 4: 3.9% chance
n = 10: 9.6% chance
n = 20: 18.2% chance
n = 50: 39.5% chance nearly half of user requests hit a tail!

The amplification rule. If a single user request fans out to n backend services, each with p99 latency L, then roughly n% of user requests will experience latency ≥ L (for small n). This is why microservices architectures obsess over tail latency — a long tail in one service becomes a wide body in the aggregate.

Latency Distribution Explorer

Adjust the distribution shape and tail weight. Watch how the average stays low while the tail percentiles explode. The histogram shows 1,000 requests — hover over the percentile markers to see exact values.

Tail weight 30

Fan-out (services) 1

Coordinated Omission

When you load test with tools like Apache Bench or wrk, they typically send the next request only after the previous one completes. If one request takes 5 seconds, the tool waits 5 seconds before sending the next one. During those 5 seconds, zero requests are recorded. The tool under-counts how many users would have been affected — in reality, hundreds of users would have sent requests during that 5 seconds, and all of them would have experienced high latency.

This is coordinated omission: the load testing tool's measurement is "coordinated" with the system's slowness, and it "omits" the requests that real users would have sent. Tools like wrk2 and k6 fix this by maintaining a constant send rate regardless of response time.

Interview check: Your API has a p50 of 10ms and p99 of 200ms. A user request fans out to 5 backend services in parallel. What is the approximate probability that a user experiences latency ≥ 200ms?

1% — same as each individual service's p99 ~5% — because 1 - (0.99)^5 = 4.9%, so roughly 1 in 20 user requests will have at least one backend call hit its p99 5 × 1% = 5%, because latencies add linearly

Chapter 3: Capacity Planning

An interviewer says: "Design a URL shortener for 100 million users." Before you draw a single box on the whiteboard, you need numbers. How many requests per second? How much storage? How much bandwidth? How many servers? Without these, your architecture is a guess.

The Back-of-Envelope Framework

Capacity planning follows a five-step framework. Each step feeds into the next. Let us walk through it for the URL shortener.

1. Estimate Users & Requests

100M total users. Assume 10% are daily active = 10M DAU. Each user creates 1 short URL/day and reads 10/day. That is 10M writes/day and 100M reads/day. The read:write ratio is 10:1.

↓

2. Compute QPS

10M writes/day = 10,000,000 / 86,400 ≈ 116 writes/sec. 100M reads/day ≈ 1,157 reads/sec. Peak is typically 2-3x average: ~350 write QPS, ~3,500 read QPS at peak.

↓

3. Estimate Storage

Each URL mapping: 7-char short code (7 bytes) + long URL (avg 200 bytes) + metadata (50 bytes) ≈ 257 bytes. 10M new URLs/day × 365 days × 5 years = 18.25B URLs. Total: 18.25B × 257 bytes ≈ 4.7 TB.

↓

4. Estimate Bandwidth

Read bandwidth: 3,500 reads/sec × 257 bytes ≈ 900 KB/s (trivial). Write bandwidth: 350 writes/sec × 257 bytes ≈ 90 KB/s. Network is not the bottleneck here.

↓

5. Estimate Compute

A modern server handles ~10K simple read QPS from memory or SSD cache. At 3,500 read QPS peak, a single server might handle it, but for redundancy we want 2-3 servers. Storage: 4.7 TB fits on a single large SSD array, but for durability we want replication — 3 replicas × 4.7 TB = 14.1 TB total.

Numbers every engineer should know. These are Jeff Dean's "Latency Numbers Every Programmer Should Know," updated for modern hardware. L1 cache: 1ns. L2 cache: 4ns. RAM: 100ns. SSD random read: 16µs. SSD sequential read 1MB: 50µs. Network round trip (same datacenter): 500µs. HDD seek: 2ms. Network round trip (cross-continent): 150ms. The gap between "in memory" and "on disk" is 1000x. The gap between "same datacenter" and "cross-continent" is 300x.

The Numbers You Need to Memorize

Quantity	Value	Why it matters
Seconds in a day	86,400 ≈ 10⁵	Convert daily counts to QPS
Seconds in a month	2.6M ≈ 2.5 × 10⁶	Monthly budgets
1 million requests/day	≈ 12 QPS	Quick conversion
1 KB × 1 billion	= 1 TB	Storage estimation
Peak : average ratio	2-3x (typical), 10x (bursty)	Headroom planning
80/20 rule	20% of data serves 80% of reads	Cache sizing (cache the hot 20%)

Capacity Planner

Input your system parameters and see the derived capacity requirements in real time. Adjust each slider to model different scales.

DAU (millions) 10M

Requests/user/day 20

Bytes per request 500 B

Retention (years) 5

Interview check: A social media app has 50M DAU. Each user posts 2 items/day (avg 1 KB each) and views 100 items/day. Estimate the peak read QPS and the total storage after 3 years.

Peak read QPS: 58K, Storage: 109.5 TB. Reads/day = 50M × 100 = 5B. 5B/86400 ≈ 57,870 QPS. Peak ≈ 2x ≈ 116K. Wait, that's too high. Reads: 50M × 100 / 86400 ≈ 57,870 avg QPS, peak ~170K QPS. Writes: 50M × 2 / 86400 ≈ 1,157 avg QPS. Storage: 50M × 2 × 1KB × 365 × 3 = 109.5 TB (before replication). With 3x replication: ~328.5 TB. About 10K QPS and 10 TB storage.

Chapter 4: Load Testing

You have estimated your capacity requirements. You have built your system. Now the question: does it actually handle the load? Hope is not a strategy. You need to prove it with load testing.

Four Types of Load Tests

Type	What it tests	How it works	What it catches
Load test	Expected traffic	Ramp to your estimated peak QPS and sustain for 10-30 min	Basic bottlenecks: slow queries, underpowered instances
Stress test	Breaking point	Increase QPS until the system fails. Find the knee point.	The exact QPS where latency degrades, the failure mode (OOM, connection pool, CPU)
Soak test	Long-running stability	Sustain moderate load for 4-24 hours	Memory leaks, connection pool exhaustion, disk fill-up, log rotation failure
Spike test	Sudden burst	Jump from idle to 10x peak in seconds	Auto-scaling lag, cold start latency, queue overflow

The Load-Latency Curve

The most important graph in load testing is the load-latency curve. Plot requests per second on the X axis and response time on the Y axis. At low load, latency is flat — the system responds in its baseline time. As load increases, latency stays flat until you hit the knee point — the QPS where some resource (CPU, memory, connections, disk I/O) becomes saturated. Beyond the knee, latency shoots up exponentially.

// Simplified model: latency as a function of load (M/M/1 queue)
L(q) = 1 / (μ - q) where μ = max service rate, q = current QPS

// At q = 0: L = 1/μ (baseline latency)
// At q = 0.5μ: L = 2/μ (double baseline)
// At q = 0.9μ: L = 10/μ (10x baseline!)
// At q = 0.99μ: L = 100/μ (100x baseline)
// At q = μ: L = ∞ (queue grows unbounded)

This is why you never run a system at 90% capacity. At 90% utilization, latency is 10x baseline. At 70%, latency is only 3.3x baseline. The standard rule of thumb: keep peak utilization below 70% for latency-sensitive services.

The knee is non-negotiable. Every system has a knee point. You cannot engineer it away — you can only move it to a higher QPS by adding capacity. Your job in load testing is to find the knee, then ensure your expected peak load is well below it (typically 50-70% of the knee QPS).

Load-Latency Curve Simulator

Drag the "Current Load" slider to move along the curve. Watch latency explode as you approach the system's maximum capacity. The knee point is marked. The "Max Capacity" slider lets you simulate adding more servers.

Current Load (QPS) 300

Max Capacity (QPS) 1000

Load Testing Tools

Tool	Language	Strengths	Coordinated omission fix?
k6	Go (JS scripts)	Modern, CI-friendly, cloud-native	Yes
Locust	Python	Easy scripting, distributed mode	No (but configurable)
Gatling	Scala	Enterprise-grade, detailed reports	Yes
wrk2	C	Constant-rate, fixes coordinated omission	Yes (by design)
vegeta	Go	Constant-rate attack, simple CLI	Yes

javascript (k6)
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // ramp to 100 VUs
    { duration: '5m', target: 100 },   // hold at 100 VUs
    { duration: '2m', target: 500 },   // stress test: ramp to 500
    { duration: '5m', target: 500 },   // hold at stress level
    { duration: '2m', target: 0 },     // cool down
  ],
  thresholds: {
    http_req_duration: ['p(99)<200'],  // p99 < 200ms
    http_req_failed: ['rate<0.01'],    // <1% errors
  },
};

export default function () {
  const res = http.get('https://api.example.com/checkout');
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(1);  // 1 second between requests per VU
}

Interview check: Your load test shows the system handles 2,000 QPS with p99 = 50ms. At 2,500 QPS, p99 jumps to 800ms. At 3,000 QPS, requests start timing out. What is the knee point, and what utilization should you target for production?

The knee is around 2,000-2,500 QPS (where latency begins spiking). For production, target 50-70% of the knee — roughly 1,200 to 1,500 QPS peak — to maintain low latency with headroom for traffic spikes. The knee is 3,000 QPS because that's where timeouts start. You should run at 2,500 QPS because p99 of 800ms is acceptable.

Chapter 5: Availability Math

A user request to your checkout service travels through a load balancer, an application server, a database, and a cache. Each component has a probability of being available. What is the probability that the entire chain works? And what happens when you add redundancy?

Series Composition (All Must Work)

When components are in series — meaning all of them must work for the request to succeed — you multiply their availabilities:

A_total = A₁ × A₂ × A₃ × ... × A_n

Let us work through a real example. A request path has four components:

Component	Individual Availability
Load Balancer	99.99%
App Server	99.9%
Database	99.9%
Cache	99.95%

A_total = 0.9999 × 0.999 × 0.999 × 0.9995
= 0.99740...
≈ 99.74% only two nines! 22.8 hours of downtime/year

Even though every individual component has three or four nines, the chain only achieves two nines. This is the brutal arithmetic of series composition: the chain is weaker than its weakest link.

Parallel Composition (Redundancy)

When components are in parallel — meaning the request succeeds if at least one works — you compute the probability that all of them fail, then subtract from 1:

A_parallel = 1 - (1 - A₁)(1 - A₂)...(1 - A_n)

// Example: 2 app servers, each 99.9%
A_parallel = 1 - (1 - 0.999)²
           = 1 - (0.001)²
           = 1 - 0.000001
           = 99.9999%    six nines! Just by adding one server

The improvement is dramatic. A single 99.9% server gives you 8.76 hours of downtime per year. Two in parallel give you 31.5 seconds. The math is the same for any redundancy count:

Replicas	Individual: 99.9%	Combined	Downtime/year
1	99.9%	99.9%	8.76 hours
2	99.9%	99.9999%	31.5 seconds
3	99.9%	99.9999999%	0.03 seconds

Redundancy is the only way to exceed your weakest component. You cannot get four nines from a three-nine component without redundancy. Adding a second instance in parallel is the most cost-effective reliability improvement in all of systems engineering. It is also why every production database has at least one replica.

Combining Series and Parallel

Real architectures combine both. You have components in series (LB → App → DB), and some of those components are internally redundant (2 app servers in parallel, 3 DB replicas). Compute bottom-up: first resolve each parallel group to a single availability number, then multiply the series chain.

// Example: LB(99.99%) → 2x App(99.9%) → 3x DB(99.9%) → Cache(99.95%)

A_app = 1 - (1 - 0.999)² = 99.9999%
A_db = 1 - (1 - 0.999)³ = 99.9999999%

A_total = 0.9999 × 0.999999 × 0.999999999 × 0.9995
≈ 99.94% much better than the 99.74% without redundancy!

Availability Calculator — The Showcase

Build a system by adjusting component availabilities and replica counts. See total system availability update in real time. Each component is in series; within each component, replicas provide parallel redundancy.

LB availability 99.990%

App server avail. 99.90%

App replicas 1

DB availability 99.90%

DB replicas 1

Cache availability 99.95%

Cache replicas 1

Interview check: Your system has three components in series, each at 99.9% availability. You have budget to add ONE redundant instance to ONE component. Which component do you make redundant for the greatest improvement?

The first component in the chain, because it sees the most traffic. The database, because it's the hardest to restart. It doesn't matter — all three are at 99.9%, so making any one of them redundant (99.9999%) improves the chain equally. Before: 0.999^3 = 99.7%. After making any one redundant: 0.999999 × 0.999 × 0.999 = 99.8%. The gain is identical regardless of which one you pick.

Chapter 6: Monitoring & Alerting

You have defined your SLOs. You have capacity-planned. You have load-tested. Your system is running in production. How do you know when something goes wrong before your users tell you?

The Four Golden Signals

Google's SRE book distills all monitoring into four signals. If you measure nothing else, measure these:

Signal	What it measures	What to watch for	Example alert threshold
Latency	Time to serve a request	Distinguish successful vs failed latency. A fast 500 error is not "low latency."	p99 > 500ms for 5 min
Traffic	Demand on the system	QPS, active sessions, or messages/sec. Both spikes and drops matter.	QPS drops 50% in 5 min (might indicate upstream failure)
Errors	Rate of failed requests	HTTP 5xx, gRPC errors, or application-level failures. Include both explicit (500s) and implicit (200 with wrong data).	Error rate > 1% for 5 min
Saturation	How full the system is	CPU %, memory %, disk I/O utilization, connection pool usage. The signal that predicts future problems.	CPU > 80% for 10 min

Three monitoring methodologies compared. Google's Four Golden Signals (latency, traffic, errors, saturation) are the most widely used. The RED method (Rate, Errors, Duration) is simpler — optimized for request-driven services. The USE method (Utilization, Saturation, Errors) is optimized for infrastructure resources (CPU, disk, network). Use Golden Signals for application monitoring, USE for infrastructure, RED when you want simplicity.

Symptom-Based vs Cause-Based Alerting

There are two philosophies for alerting. Cause-based alerting fires when an internal metric crosses a threshold: "CPU is at 90%," "disk is 85% full," "memory usage is 12 GB." The problem: many cause-based alerts never affect users. CPU spikes to 90% for 30 seconds during a batch job, then drops back. If you page an engineer for every CPU spike, you get alert fatigue — they start ignoring pages, and when a real outage happens, they are too desensitized to react.

Symptom-based alerting fires when the user-visible effect crosses a threshold: "error rate is above 1%," "p99 latency exceeded 500ms for 5 minutes," "availability dropped below 99.9% this month." These alerts are actionable — they mean something is actually broken for users.

The rule: page on symptoms, log causes. Your pager should fire only when users are affected (symptoms). Cause-based metrics (CPU, memory, disk) go to dashboards for investigation after the page fires. This keeps paging volume low and signal quality high. An engineer who gets 3 real pages per week will respond instantly. An engineer who gets 30 false alarms per week will ignore the real one.

The Monitoring Dashboard

A good dashboard shows the four golden signals in real time, with historical context. Below is an interactive mock — inject failures and watch how the signals respond.

Golden Signals Dashboard

Click the failure buttons to inject problems. Watch all four signals respond. This is what a real production dashboard looks like during an incident.

Runbooks

Every alert should link to a runbook — a document that tells the on-call engineer exactly what to do. A runbook has three sections:

1. What is this alert?

One sentence: "This fires when p99 latency exceeds 500ms for 5 consecutive minutes, indicating checkout is slow for at least 1% of users."

↓

2. How to investigate

Step-by-step: (a) Check the dashboard for the specific service. (b) Look at the saturation panel — is CPU/memory/connection pool full? (c) Check recent deploys — did something change in the last hour? (d) Check downstream dependencies — is the database slow?

↓

3. How to mitigate

Concrete actions: "If caused by recent deploy: roll back via kubectl rollout undo. If caused by traffic spike: scale horizontally via kubectl scale --replicas=10. If caused by DB: failover to replica via ..."

Interview check: An on-call engineer gets paged 40 times per week, but only 5 of those pages correspond to actual user-facing issues. What is the core problem, and how would you fix it?

The engineer needs to be more responsive to the other 35 pages. Alert fatigue from cause-based alerting. Fix: convert alerts to symptom-based (page only when users are affected), raise thresholds on cause-based alerts, and move noisy low-signal alerts to dashboard-only (no page). Target: fewer than 10 actionable pages per week. Hire more on-call engineers to distribute the load.

Chapter 7: Interview Arsenal

This chapter is your cheat sheet. Every formula, every framework, every pattern from this lesson, organized for fast recall in an interview setting.

Quick Reference: Formulas

Formula	What it computes	Example
QPS = DAU × req/user / 86400	Average queries per second	10M × 20 / 86400 ≈ 2,315
Peak QPS ≈ 2-3x avg QPS	Peak capacity to plan for	2,315 × 3 ≈ 6,945
Storage = users × data/user × days	Total storage needed	10M × 1KB × 1825d = 18.25 TB
A_series = A₁ × A₂ × ... × A_n	Availability of chain	0.999³ = 99.7%
A_parallel = 1 - (1-A)ⁿ	Availability with n replicas	1 - (0.001)² = 99.9999%
P(tail) = 1 - (1-p)ⁿ	Prob of hitting tail in n fan-out calls	1 - 0.99⁵ = 4.9%
Error budget = (1-SLO) × window	Allowed downtime	0.001 × 43200min = 43.2 min/mo
L(q) = 1/(μ - q)	Latency vs load (M/M/1)	At 90% util: 10x baseline latency

System Design: Nonfunctional Requirements Checklist

In every system design interview, cover these before drawing boxes:

1. Users & Traffic

How many DAU? Read:write ratio? Peak:average ratio? Geographic distribution?

↓

2. Latency Requirements

What is the acceptable p50? p99? For which operations? (Reads vs writes often have different targets.)

↓

3. Availability & Durability

What uptime is required? How many nines? Is data loss acceptable? (Payments: no. Social media likes: maybe.)

↓

4. Storage & Bandwidth

How much data per request? How long is data retained? What is the storage growth rate?

↓

5. Cost Constraints

What is the budget? Is this a startup (optimize for cost) or Big Tech (optimize for reliability)?

Interview Scenarios

Scenario 1: "Define the SLOs for a payment processing system."

Staff answer. Availability: 99.99% (payments are money — downtime = lost revenue + chargebacks). Latency: p50 < 100ms, p99 < 500ms (users abandon checkout after 3 seconds). Error rate: < 0.01% for payment failures (each failure is a lost sale or worse, a double charge). Durability: 100% (never lose a transaction record — this is a legal requirement). Idempotency: every payment must be idempotent (retries must not double-charge). These SLOs drive the architecture: you need synchronous replication for the transaction log, at least 3 replicas across 2 AZs, and circuit breakers on every downstream call.

Scenario 2: "Capacity plan for a social media app with 50M DAU."

Staff answer. Reads: 50M × 100 views/day = 5B reads/day = 58K avg QPS, ~175K peak. Writes: 50M × 2 posts/day = 100M writes/day = 1.16K avg QPS, ~3.5K peak. Storage: 100M posts/day × 1KB × 365 × 3 years = 109.5 TB. With replication (3x) = 328.5 TB. Cache: hot 20% of data = ~22 TB in Redis (need a cluster). Network: 175K QPS × 5KB avg response = 875 MB/s = ~7 Gbps (need multiple load balancers). Servers: at 10K QPS per server for cache hits, need ~18 app servers for peak. Total estimated cost: $50-80K/month on cloud.

Scenario 3: "p99 latency doubled after a deploy."

Staff answer. (1) Check the diff — what changed? A new code path, a new DB query, a new dependency call? (2) Profile: is it CPU (new computation), memory (GC pressure from new allocations), I/O (new DB query), or network (new external call)? (3) Check if it affects all endpoints or just the changed one. (4) Check if the database query plan changed (new index? missing index?). (5) Measure: is the regression in the service itself or in a downstream dependency? (6) Quick mitigation: roll back the deploy while investigating. (7) Root cause: likely a missing database index on a new query path, or an N+1 query that was not caught in code review.

Coding Drills

Drill 1: Sliding Window Percentile Tracker

python
import bisect
from collections import deque

class PercentileTracker:
    """Track percentiles over a sliding window of N samples."""

    def __init__(self, window_size=1000):
        self.window_size = window_size
        self.window = deque()       # insertion order
        self.sorted_vals = []       # sorted for percentile lookup

    def add(self, value):
        # Evict oldest if window is full
        if len(self.window) >= self.window_size:
            old = self.window.popleft()
            idx = bisect.bisect_left(self.sorted_vals, old)
            self.sorted_vals.pop(idx)
        # Add new value
        self.window.append(value)
        bisect.insort(self.sorted_vals, value)

    def percentile(self, p):
        """Return the p-th percentile (0-100)."""
        if not self.sorted_vals:
            return 0
        idx = int(len(self.sorted_vals) * p / 100)
        idx = min(idx, len(self.sorted_vals) - 1)
        return self.sorted_vals[idx]

# Usage:
tracker = PercentileTracker(window_size=1000)
for latency in incoming_requests:
    tracker.add(latency)
    if tracker.percentile(99) > 200:
        fire_alert("p99 latency exceeded 200ms")

Drill 2: Availability Calculator

python
def parallel_availability(single_avail, replicas):
    """Availability of N identical replicas in parallel."""
    return 1 - (1 - single_avail) ** replicas

def series_availability(*components):
    """Availability of components in series (all must work)."""
    result = 1.0
    for a in components:
        result *= a
    return result

def system_availability(components):
    """
    components: list of (availability, replicas) tuples in series.
    Each tuple is an independent stage; replicas add parallel redundancy.
    """
    stage_avails = []
    for avail, replicas in components:
        stage_avails.append(parallel_availability(avail, replicas))
    return series_availability(*stage_avails)

# Example: LB(99.99%, 1x) → App(99.9%, 2x) → DB(99.9%, 3x)
total = system_availability([
    (0.9999, 1),   # load balancer
    (0.999,  2),   # 2 app servers
    (0.999,  3),   # 3 DB replicas
])
print(f"System availability: {total*100:.6f}%")
# Output: System availability: 99.989900%

Interview check: An interviewer asks: "How would you define SLOs for a real-time multiplayer game server?" Which of these is the most staff-level answer?

"p99 latency under 100ms and 99.9% availability." "I'd use the same SLOs as a web service: 99.95% availability and p99 < 200ms." "Game servers have different SLI categories. Tick rate SLO: p99 server tick < 16ms (60 Hz) for the authoritative simulation. Matchmaking latency: p90 < 30s. Session availability: 99.95% (a mid-game crash is far worse than a lobby failure, so weight active sessions more heavily). Network: p99 round-trip < 80ms within a region. Measure jitter, not just latency — a steady 60ms is better than alternating 20ms and 100ms. Error budget: spend it on matchmaking downtimes, never on active-session crashes."

Chapter 8: Connections

Nonfunctional requirements are not an isolated topic — they thread through every chapter of system design. Here is how the concepts in this lesson connect to the rest of Designing Data-Intensive Applications.

What We Covered vs What Comes Next

This lesson	Where it connects	Why it matters
Availability math	Ch 6: Replication	Replication is how you achieve parallel availability. Leader-follower, multi-leader, and leaderless each have different availability profiles.
Capacity planning (QPS, storage)	Ch 7: Sharding	When one machine cannot handle the load, you shard. Capacity planning tells you when to shard and how many shards you need.
Latency percentiles	Ch 4: Storage & Retrieval	B-trees vs LSM-trees have fundamentally different latency distributions. B-trees have predictable reads; LSM-trees have write amplification spikes during compaction.
SLAs and error budgets	Ch 8: Transactions	Transactions trade latency for correctness guarantees. Your SLO determines whether you can afford the latency cost of serializable isolation.
Load testing	Ch 9: Distributed Trouble	The failures you inject in load tests (slow DB, memory leak, traffic spike) are the same partial failures that haunt distributed systems.
Monitoring golden signals	Ch 10: Consistency & Consensus	Consensus protocols like Raft have specific latency and availability trade-offs. You need monitoring to verify your consensus layer meets its SLOs.

Limitations of This Lesson

This lesson teaches you to quantify requirements. It does not teach you to implement them. Knowing that you need 99.99% availability and the math to compute it is step one. Actually building a system that achieves it — through replication, failover, load balancing, circuit breakers, and graceful degradation — is the rest of this book.

We also did not cover security requirements (encryption, authentication, authorization), compliance requirements (GDPR, HIPAA, SOC2), or maintainability requirements (code complexity, deployment frequency, mean time to recovery). These are equally important nonfunctional requirements, but they deserve their own lessons.

Resource	What it covers
Google SRE Book, Ch 4: Service Level Objectives	The definitive guide to SLIs, SLOs, and error budgets from the team that invented the framework.
DDIA Chapter 1: Reliability, Scalability, Maintainability	Kleppmann's original treatment of these concepts, with worked examples from real systems.
Jeff Dean's "Numbers Every Engineer Should Know"	The latency numbers table that powers all back-of-envelope estimation.
Gil Tene's "How NOT to Measure Latency"	The definitive talk on coordinated omission and why most load testing tools lie to you.