Distributed Systems

Observability & Operations

Metrics, SLOs, alerting, traces, dashboards — seeing inside your system before users tell you it's broken.

Prerequisites: Client-server model + Basic statistics (mean, percentile). That's it.
10
Chapters
9+
Simulations
0
Assumed Knowledge

Chapter 0: Why Observability

It is 2 AM. Your phone buzzes. "CRITICAL: Payment service error rate above 5%." You open your laptop. The payment service is returning 500 errors on 8% of requests. Why?

Is it the payment service itself? The database it calls? The network between them? A bad deploy that went out 30 minutes ago? A traffic spike from a marketing campaign? A dependency that rate-limited you? Without observability, you are flying blind — guessing, grepping through log files, restarting services and hoping.

With observability, you open a dashboard. You see the error rate spiked at 1:47 AM. You click on a failing request. A trace shows the request spent 29 seconds waiting for a database query. You look at database metrics: CPU is at 98% because an unindexed query is doing a full table scan. You find the deploy that introduced the query. You rollback. Error rate drops to 0.1% within 2 minutes. Total incident time: 11 minutes.

The difference between monitoring and observability. Monitoring tells you WHEN something is broken (alerts fire). Observability tells you WHY it is broken. A monitored system has dashboards and alerts. An observable system has metrics, logs, and traces that let you diagnose any problem without deploying new code or adding new instrumentation. If you have to add a log line to debug a production issue, your system is not observable enough.

The Three Pillars

Observability has three data types, often called the three pillars:

PillarWhat it isWhat it answersExample
MetricsNumeric measurements over time"What is happening?" (rates, gauges, distributions)Request rate: 1200 RPS, p99 latency: 230ms
LogsTimestamped text records of events"What exactly happened?" (the narrative)"2024-01-15 01:47:23 ERROR: query timeout after 30s, query=SELECT..."
TracesEnd-to-end path of a request through services"Where did the time go?" (the journey)Frontend → Gateway (2ms) → OrderService (5ms) → DB (29s) !!!

The simulation below shows the difference between flying blind and having observability. A service starts failing. Without observability, you see "errors." With observability, you see exactly where and why.

Blind vs. Observable

An incident occurs. Compare diagnosis time with and without observability.

Trigger an incident, then try diagnosing with and without observability.
The structure of this lesson. We will cover: what to measure (four golden signals, RED, USE), what targets to set (SLIs/SLOs/SLAs), when to alert (alerting philosophy), what to log (structured logging), how to trace (distributed tracing), and a live dashboard simulation that ties it all together.
Quiz: Your payment service returns 500 errors on 8% of requests. You have metrics dashboards but no distributed tracing. What can you determine?

Chapter 1: The Four Golden Signals

If you could only measure four things about your service, what would they be? The Site Reliability Engineering discipline converges on the same four:

1. Latency

How long does it take to serve a request? Specifically, the distribution of latencies, not just the average. A service with 50ms average latency might have a p99 of 2 seconds — meaning 1% of users wait 40x longer than average.

Always measure percentiles, not averages. An average of 50ms could mean everyone gets 50ms (great) or 99% get 10ms and 1% get 4 seconds (terrible). The p50, p90, p95, and p99 tell the real story. Alert on p99, not average.

Why Averages Lie: A Worked Example

// Two services with the same average latency:

// Service A: 100 requests
All requests: 50ms
Average: 50ms, p50: 50ms, p99: 50ms
// Uniform. Every user gets the same experience.

// Service B: 100 requests
99 requests: 10ms, 1 request: 4050ms
Average: (99 × 10 + 1 × 4050) / 100 = 50.4ms
p50: 10ms, p99: 4050ms
// Same average! But 1% of users wait 4 seconds.

// If a user makes 20 requests per session:
P(hitting p99 at least once) = 1 - (0.99)20 = 18%
// Nearly 1 in 5 users will experience a 4-second wait.

This is why p99 is the most important latency metric. It represents the experience of your worst-affected users, who are often your highest-value users (they use the product frequently, so they hit the tail more often).

Histograms vs. Summaries

There are two ways to store latency data in a metrics system:

TypeHow it worksProsCons
HistogramCounts requests in predefined buckets: 0-10ms, 10-50ms, 50-100ms, 100-500ms, 500ms+Aggregatable across servers. Cheap to store.Bucket boundaries must be chosen in advance. Precision limited by bucket size.
SummaryCalculates exact percentiles on the fly using reservoir samplingPrecise percentiles without predefined buckets.Cannot be aggregated across servers (p99 of p99s is not p99 of all). Memory-intensive.
Histograms are preferred for distributed systems. Because they can be aggregated, you can compute the p99 across all servers — not just per-server. Prometheus recommends histograms over summaries for this reason.

2. Traffic

How much demand is being placed on the system? For a web service: HTTP requests per second. For a database: queries per second. For a message queue: messages per second. Traffic tells you the current load level and helps you predict when you will need more capacity.

3. Errors

What fraction of requests fail? This includes explicit errors (HTTP 500s), implicit errors (HTTP 200 with wrong content), and policy violations (responses slower than a threshold). The error rate (errors per second or errors as a percentage of total requests) is more useful than the error count.

4. Saturation

How "full" is the system? Saturation measures how close you are to capacity: CPU utilization, memory usage, disk I/O, queue depth, connection pool usage. When saturation approaches 100%, performance degrades nonlinearly — a system at 90% CPU does not perform 10% worse than one at 0%, it performs dramatically worse due to queueing effects.

The simulation below shows all four golden signals for a live service. Traffic arrives at a steady rate, then spikes. Watch how the signals interact.

Four Golden Signals — Live

Four real-time charts. Inject a traffic spike or slow dependency to see all signals react.

Start traffic, then inject disruptions to see all four signals react.
Quiz: Your service's average latency is 50ms but the p99 is 3 seconds. Users complain about slowness. A coworker says "the average is fine." What is wrong with this reasoning?

Chapter 2: RED & USE Methods

The four golden signals tell you what to measure for any service. But different types of components have different natural metrics. The RED method and USE method give you targeted frameworks.

RED Method (for services)

For any request-driven service (API, web server, microservice), measure three things:

RRateRequests per second. How busy is the service?
EErrorsErrors per second (or error rate %). Is the service broken?
DDurationLatency distribution (p50, p99). Is the service slow?

RED is essentially the four golden signals minus saturation, focused on the user-facing experience. If you can only instrument three things per service, instrument RED.

USE Method (for resources)

For any resource (CPU, memory, disk, network, connection pool), measure three things:

UUtilizationPercentage of time the resource is busy. CPU at 85%, disk at 60%.
SSaturationAmount of work queued that cannot be served. Queue depth, wait time.
EErrorsCount of error events. ECC memory corrections, network packet drops.
RED for services, USE for resources. When debugging a slow API, start with RED metrics on the service. If error rate is high, check logs. If latency is high, check USE metrics on the underlying resources (CPU, DB connections, disk I/O) to find the bottleneck.

When to Use RED vs. USE

The debugging workflow typically starts with RED (user-facing symptoms) and drills into USE (resource causes):

1. RED detects the problem
"Rate normal, errors at 3%, duration p99 at 2s." Something is slow and failing.
↓ which resource is the bottleneck?
2. USE finds the cause
"CPU 40%, Memory 60%, Disk I/O 98% utilized." The disk is saturated.
↓ why is disk saturated?
3. Logs explain the details
"Full table scan on 50M row table. Missing index on query introduced at 01:45."

RED tells you THAT something is wrong. USE tells you WHERE the bottleneck is. Logs tell you WHY.

Mapping RED and USE to Your System

// Example: e-commerce checkout service

// RED metrics (service-level):
Rate: checkout_requests_total (counter)
Errors: checkout_errors_total (counter)
Duration: checkout_duration_seconds (histogram)

// USE metrics (resource-level):
CPU Utilization: node_cpu_seconds_total
CPU Saturation: node_load1 (1-minute load average)
Mem Utilization: node_memory_MemTotal - MemAvailable
Mem Saturation: node_vmstat_pgmajfault (page faults)
Disk Utilization: node_disk_io_time_seconds_total
Disk Saturation: node_disk_io_time_weighted_seconds_total

The simulation shows RED and USE metrics side by side. Inject load and watch how service-level metrics (RED) correspond to resource-level metrics (USE).

RED & USE Dashboard

Left: service metrics (RED). Right: resource metrics (USE). Inject load to see correlation.

Start traffic, then spike load to see RED and USE metrics correlate.
Quiz: Your API has Rate=500 RPS, Errors=0%, Duration p99=50ms. But your database CPU (USE) is at 95% utilization. What should you do?

Chapter 3: SLIs, SLOs, SLAs

Your service is "reliable." But how reliable? 99%? 99.9%? 99.99%? And reliable in what dimension — availability? Latency? Correctness? These terms are often used loosely. The SLI/SLO/SLA framework gives them precise meaning.

SLI: Service Level Indicator

An SLI is a carefully defined quantitative measure of some aspect of the level of service that is provided. It is a metric with a specific definition.

// Example SLIs:

Availability SLI = (successful requests / total requests) × 100%
// "What fraction of requests succeed?"

Latency SLI = fraction of requests completing within 300ms
// "What fraction of requests are fast enough?"

Correctness SLI = (correct responses / total responses) × 100%
// "What fraction of responses return the right data?"

SLO: Service Level Objective

An SLO is a target value for an SLI. It is an internal goal that your team sets.

// Example SLOs:

Availability SLO: 99.9% of requests succeed over a 30-day window
// 0.1% failure allowed = 43.2 minutes of downtime per month

Latency SLO: 99% of requests complete within 300ms
// 1% of requests can be slower than 300ms

// SLO is NOT 100%. Never set an SLO of 100%.
// 100% means ZERO failures allowed. Ever. Impossible and counterproductive.

SLA: Service Level Agreement

An SLA is a contract with your customer that includes consequences for failing to meet it (refunds, credits, contract termination). SLAs are typically looser than SLOs because you want to breach your SLO (internal alert) before you breach your SLA (legal/financial consequences).

ConceptDefinitionAudienceConsequence of breach
SLIThe metric itselfEngineersNone (it's a measurement)
SLOTarget for the SLIEngineering teamError budget consumed, on-call pages
SLAContract with customerCustomers, legalRefunds, credits, legal action

The Nines

AvailabilityDowntime/monthDowntime/yearExample
99% (two 9s)7.3 hours3.65 daysInternal tools
99.9% (three 9s)43.2 minutes8.76 hoursMost SaaS products
99.95%21.6 minutes4.38 hoursCloud provider services
99.99% (four 9s)4.32 minutes52.6 minutesPayment systems, infrastructure
99.999% (five 9s)26 seconds5.26 minutes911 systems, pacemakers
Each additional 9 costs 10x more. Going from 99.9% to 99.99% does not require 10% more effort — it requires a fundamentally different architecture. Redundancy, automated failover, cross-region replication, formal verification. Most applications should target 99.9% or 99.95%, not more.

Choosing the Right SLO

How do you decide between 99.9% and 99.99%? It depends on your users and your dependencies:

// The dependency rule:
// Your SLO cannot be higher than your weakest dependency's SLO.

// If your database provides 99.95% availability,
// and your cloud provider provides 99.99% availability,
// your maximum achievable SLO is:

max_SLO = 99.95% × 99.99% = 99.94%

// Setting an SLO of 99.99% when your DB provides 99.95%
// is setting an impossible target. Your DB alone will violate it.

// With N dependencies, each at 99.9%:
combined = (0.999)N
N=3: 99.7% combined availability
N=5: 99.5%
N=10: 99.0%
// More dependencies = lower combined reliability.
// This is why minimizing dependencies is a reliability strategy.

SLI Selection: What to Measure

Choosing the right SLI is critical. A bad SLI creates a false sense of security. Here are the most common SLIs by service type:

Service TypeRecommended SLIWhy not something else?
Web API% of requests returning 2xx within 300msCombines availability and latency in one metric. A slow success is a failure for the user.
Data pipeline% of records processed within freshness targetThroughput alone is misleading — data can be processed but 3 hours late.
Storage system% of reads returning correct data within 100msDurability (no data loss) + availability + latency.
Batch job% of runs completing successfully within time budgetStart time is irrelevant; completion time matters.

The simulation shows how SLO targets map to allowed downtime. Adjust the SLO slider and see how much failure budget you have.

SLO Calculator

Set your SLO and see how much downtime you can afford per month and year.

SLO 99.9%
Drag the slider to explore different SLO targets.
Quiz: Your SLO is 99.9% availability over a 30-day window. It is day 15 and you have had 35 minutes of downtime. How much error budget remains?

Chapter 4: Error Budgets

The SLO says 99.9% availability. That means 0.1% unavailability is allowed. This 0.1% is your error budget — the amount of unreliability you can tolerate before breaching your SLO.

Error budgets transform reliability from a vague aspiration into a concrete, spendable resource. You can spend your error budget on things that matter: faster feature releases, risky experiments, infrastructure migrations. As long as you stay within budget, you are meeting your reliability commitment.

How Error Budgets Work

// Monthly error budget at 99.9% SLO:
Total minutes in 30 days: 30 × 24 × 60 = 43,200 minutes
Error budget: 43,200 × 0.001 = 43.2 minutes of downtime

// Spending the budget:
Week 1: deploy caused 5 min outage → 38.2 min remaining
Week 2: DB failover caused 3 min blip → 35.2 min remaining
Week 3: deploy caused 20 min outage → 15.2 min remaining
Week 4: BUDGET LOW — freeze risky deploys → 15.2 min to last 7 days

Error Budget Policies

Budget LevelAction
> 50% remainingFull speed: ship features, run experiments, do migrations
25-50% remainingCaution: all deploys require canary, no risky experiments
< 25% remainingFreeze: no deployments except bug fixes. Focus on reliability improvements.
0% (budget exhausted)Hard freeze: only reliability work until budget replenishes next cycle
Error budgets align incentives. Without error budgets, product teams want to ship fast and ops teams want to minimize risk. They are in constant conflict. Error budgets give them a shared framework: "We have 30 minutes of budget left. Should we spend 10 minutes of risk on this deploy? Is the feature worth it?" The conversation shifts from "can we deploy?" to "should we spend budget?"

Error Budget-Based Decision Making

Error budgets change how organizations make engineering decisions. Here are concrete examples:

// Scenario: Should we do a major database migration?

SLO: 99.9% (43.2 min budget per month)
Current budget consumed: 10 min (23%)
Remaining: 33.2 min (77%)

Migration estimated risk: 15 min of potential downtime
Migration estimated probability of issue: 30%
Expected cost: 15 × 0.3 = 4.5 min expected

Decision: Proceed. Even worst case (15 min), we'd have 18.2 min remaining.
Expected case leaves 28.7 min. Well within budget.

// Same scenario at end of month with 8 min remaining:
Expected cost: 4.5 min. Worst case: 15 min > 8 min remaining.
Decision: Defer to next month when budget resets.

What Consumes Error Budget

Not all budget consumption is equal. Some is expected (planned maintenance), some is unplanned but tolerable (transient errors), and some requires investigation.

SourceTypical CostAction
Planned maintenance5-15 minPre-approved, scheduled during low traffic
Deploy rollout0-2 minNormal if within canary detection window
Transient network blip0.5-2 minExpected noise, no action needed
Bad deploy5-30 minPost-mortem, improve canary/rollback speed
Dependency outageVariableEvaluate circuit breakers and fallbacks
Infrastructure failureVariableEvaluate fault domain spread, redundancy

Burn Rate Alerting: The Math

Traditional alerting fires when a metric crosses a threshold. Burn rate alerting fires when you are consuming error budget too fast — even if the absolute error rate is below your threshold.

// Error budget burn rate:
// If you consumed 1 month of budget in 1 hour,
// your burn rate = 720x (30 days × 24 hours / 1 hour)

budget_total = 43.2 min (99.9% SLO over 30 days)
budget_consumed_last_hour = 2 min
burn_rate = (2 / 43.2) × (30 × 24) = 33.3x

// At 33x burn rate, budget exhausted in:
time_to_exhaust = 30 days / 33.3 = 21.6 hours

// Alert thresholds:
burn_rate > 14.4x (budget gone in 2 days) → PAGE (P1)
burn_rate > 6x (budget gone in 5 days) → TICKET (P2)
burn_rate > 3x (budget gone in 10 days) → REVIEW (P3)
burn_rate > 1x (on track to exhaust) → DASHBOARD (P4)

The simulation below tracks an error budget over a 30-day period. Deployments and incidents consume budget. Watch the budget drain and see when policies activate.

Error Budget Tracker

30-day error budget at 99.9% SLO (43.2 min). Deployments and incidents consume budget.

Ship features and handle incidents. Watch your error budget.
Quiz: Your SLO is 99.9%. It's day 20 of the month and you have 5 minutes of error budget remaining. A product manager wants to deploy a major new feature. What do you do?

Chapter 5: Alerting

An alert wakes a human up at 3 AM. That alert better be important. If it is a false alarm — or a real alarm that does not require immediate human action — you have just wasted someone's sleep, eroded trust in your alerting system, and contributed to alert fatigue.

Symptom-Based vs. Cause-Based Alerting

The most important principle in alerting: alert on symptoms, not causes.

TypeExampleProblem
Cause-based (bad)"CPU > 90%"CPU at 92% might be fine if latency is normal. You page someone for a non-problem.
Symptom-based (good)"Error rate > 1% for 5 minutes"This means users are affected RIGHT NOW. Always actionable.
The golden rule of alerting. Every page (wake-someone-up alert) must be: (1) urgent — requires action within minutes, not hours; (2) actionable — the on-call engineer can DO something about it; (3) real — not a false positive. If any of these is false, it should not be a page. Make it a ticket or a dashboard annotation instead.

Alert Severity Levels

LevelResponse TimeChannelExample
Critical (P1)Immediate (minutes)Page on-call, phone callService completely down, data loss
High (P2)HoursSlack notification, ticketError rate elevated but below SLO
Medium (P3)DaysTicketDisk usage growing, needs cleanup
Low (P4)WeeksDashboard annotationCertificate expires in 60 days

Multi-Window, Multi-Burn-Rate Alerting

The most sophisticated alerting approach ties directly to error budgets. Instead of fixed thresholds, alert when you are burning error budget too fast:

// Error budget burn rate alerting:
Monthly budget: 43.2 minutes (99.9% SLO)

// If we burn 1 month of budget in 1 hour (burn rate = 720x):
// That's a CRITICAL incident. Page immediately.

// If we burn 1 month of budget in 3 days (burn rate = 10x):
// That's a HIGH priority. Notify the team, investigate today.

// If we burn 1 month of budget in 10 days (burn rate = 3x):
// That's a MEDIUM priority. File a ticket, fix this week.

On-Call Practices

An alert pages a human. That human is the on-call engineer — someone who has agreed to be reachable 24/7 for a rotation period (usually one week). On-call is one of the most important and demanding roles in engineering. Bad on-call practices burn out engineers and degrade incident response quality.

PracticeBadGood
Rotation length1 month (burnout)1 week, with swap ability
Alert volume>2 pages per shift (fatigue)≤2 pages per week
RunbooksNone ("figure it out")Every alert links to a runbook with step-by-step remediation
Post-incidentBlame the engineerBlameless post-mortem, focus on systemic fixes
CompensationNoneExtra pay or comp time for on-call shifts

Incident Response Framework

When a page fires, the on-call engineer should follow a structured response:

1. Acknowledge (2 min)
Ack the page. Open the dashboard. Assess severity. Communicate to the team channel.
2. Mitigate (5-15 min)
Stop the bleeding. Rollback a deploy, disable a feature flag, failover to standby. Do not debug yet — mitigate first.
3. Diagnose (15-60 min)
Now find the root cause. Dashboard → traces → logs → recent changes. Identify the change or condition that triggered the incident.
4. Fix & Verify (variable)
Apply the fix. Verify metrics return to baseline. Monitor for 30 minutes.
5. Post-mortem (next business day)
Write a blameless post-mortem. Timeline, root cause, impact, action items. Share widely.
Mitigate first, diagnose second. The #1 mistake in incident response is trying to understand WHY before stopping the damage. A 10-minute mitigation (rollback the deploy) saves more user pain than a 60-minute diagnosis that leads to a more elegant fix. Stop the bleeding, then do surgery.

The simulation shows different alert configurations and their impact on on-call burden. Compare symptom-based and cause-based alerting.

Alert Configuration Comparison

Simulate 1 week of alerts. Compare cause-based (noisy) vs. symptom-based (accurate) alerting.

Choose an alerting strategy to simulate one week of on-call.
Quiz: You have an alert "CPU > 90%." It fires 15 times this week. 13 times, the CPU spike resolved itself in 2 minutes with no user impact. 2 times, it was a real problem. What should you do?

Chapter 6: Structured Logging

A traditional log line looks like this: 2024-01-15 01:47:23 ERROR: Payment failed for user 12345, amount=$50.00, reason=timeout. It is human-readable. It is also impossible to query efficiently. Want to find all payment failures over $100 in the last hour? You are parsing strings with regex.

Structured logging emits log events as key-value pairs (usually JSON), making them queryable by any field:

// Unstructured (bad for queries):
"2024-01-15 01:47:23 ERROR: Payment failed for user 12345, amount=$50.00"

// Structured (queryable):
{
"timestamp": "2024-01-15T01:47:23Z",
"level": "ERROR",
"service": "payment-service",
"event": "payment_failed",
"user_id": "12345",
"amount": 50.00,
"currency": "USD",
"reason": "timeout",
"trace_id": "abc-123-def-456",
"duration_ms": 30042
}

What to Log

Always LogNever Log
Timestamp (ISO 8601 UTC)Passwords, tokens, API keys
Log level (ERROR, WARN, INFO, DEBUG)Full credit card numbers
Service name and versionPersonal health information
Trace ID and span IDSocial security numbers
Request IDFull request/response bodies (too large)
User ID (if applicable)Raw database queries with parameters
Duration of the operationAnything that violates GDPR/CCPA/HIPAA
Error type and message
Correlation is the killer feature. The trace_id in every log entry lets you find ALL log lines related to a single request, across ALL services. User reports "my payment failed." You search logs for their user_id, find the trace_id of the failing request, and see every log line from every service that request touched. Without trace_id, you are searching millions of log lines by timestamp and hoping.

Structured Logging Implementation

python
import json, time, uuid

class StructuredLogger:
    def __init__(self, service_name, version):
        self.service = service_name
        self.version = version

    def log(self, level, event, **kwargs):
        entry = {
            "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
            "level": level,
            "service": self.service,
            "version": self.version,
            "event": event,
            **kwargs  # trace_id, user_id, duration_ms, etc.
        }
        print(json.dumps(entry))  # stdout, captured by log agent

# Usage:
logger = StructuredLogger("payment-service", "v2.3.1")
logger.log("ERROR", "payment_failed",
    trace_id=request.trace_id,
    user_id="12345",
    amount=50.00,
    reason="timeout",
    duration_ms=30042
)
# Output: {"timestamp":"2024-01-15T01:47:23Z","level":"ERROR",
#  "service":"payment-service","version":"v2.3.1",
#  "event":"payment_failed","trace_id":"abc-123",
#  "user_id":"12345","amount":50.0,"reason":"timeout",
#  "duration_ms":30042}

Log Levels: When to Use Each

LevelWhenExampleVolume
ERRORSomething failed that affects user experiencePayment declined, query timeoutLow (<1% of requests)
WARNSomething unexpected but handledRetry succeeded, cache miss, slow queryLow-medium
INFOSignificant business eventsOrder placed, user signed up, deploy startedMedium
DEBUGDetailed execution flow (disabled in production)Query params, function entry/exitVery high (off in prod)
Log the boundaries, not the internals. Log when a request arrives, when it calls a dependency, and when it completes. Do not log every internal function call — that creates noise. A well-instrumented service needs 3-5 log lines per request, not 50.

The simulation shows how structured vs. unstructured logging affects incident diagnosis time.

Structured vs. Unstructured Log Search

Search for "all payment failures over $100 in the last hour." Compare query times.

Compare searching unstructured (grep) vs. structured (indexed query) logs.
Quiz: A user reports their request failed. You want to see every log line from every service involved in that request. What field in your structured logs enables this?

Chapter 7: Distributed Tracing

A user's request enters your system at the API gateway. It is routed to the order service, which calls the inventory service, which calls the database, which calls the cache. The entire round trip takes 2.3 seconds. Where did the time go?

Logs tell you what happened at each service. Metrics tell you aggregate behavior. But neither tells you the journey of a single request through the entire system. That is what distributed tracing does.

Anatomy of a Trace

A trace represents the entire lifecycle of a request. It is composed of spans, where each span represents one operation (one service call, one database query, one cache lookup).

// Trace: user checkout request
// trace_id: abc-123

Span 1: API Gateway [0ms ---------- 2300ms] (total: 2300ms)
Span 2: Order Service [5ms ------- 2295ms] (total: 2290ms)
Span 3: Inventory Check [10ms -- 50ms] (total: 40ms)
Span 4: Payment Service [55ms ------- 2250ms] (total: 2195ms) !!!
Span 5: DB Query [60ms ------- 2200ms] (total: 2140ms) !!!
Span 6: Send Confirmation [2255ms -- 2290ms] (total: 35ms)

The trace makes it immediately obvious: the Payment Service took 2195ms, and within that, the DB query took 2140ms. That is the bottleneck. Without tracing, you would know "the request took 2.3 seconds" but not where in the chain the time was spent.

Trace Context Propagation

For tracing to work, every service must propagate the trace context (trace_id, span_id, parent_span_id) to the next service. The standard is the W3C Trace Context specification, which uses HTTP headers:

// W3C Trace Context headers:
traceparent: 00-abc123def456-span789-01
// version-traceId-parentSpanId-flags

// Each service:
// 1. Reads traceparent from incoming request
// 2. Creates a new span with the trace_id and its own span_id
// 3. Sets parent_span_id to the incoming span_id
// 4. Includes traceparent in outgoing requests to downstream services
// 5. Reports the span to the tracing backend when the operation completes
The 1% problem. At high traffic, tracing every request generates enormous data volumes. Most systems use sampling: trace 1% of requests (or 100% of error requests, or 100% of slow requests). The trade-off: you might not have a trace for the specific request a user complains about. Sophisticated systems use tail-based sampling: decide whether to keep a trace after it completes, based on whether it was slow or errored.

Sampling Strategies

StrategyHow it worksProCon
Head-based (random)Decide at the start: trace this request with 1% probabilitySimple, low overheadMay miss interesting traces. Cannot decide based on outcome.
Tail-basedBuffer all spans. After request completes, decide to keep based on duration, error status, or other criteriaKeeps all interesting tracesRequires buffering all spans temporarily. Higher memory cost.
Priority-based100% for errors, 100% for slow, 10% for specific endpoints, 1% for everything elseBalances coverage and costComplex configuration
// Trace data volume calculation:

Traffic: 10,000 RPS
Average spans per trace: 8 (typical microservice chain)
Average span size: 500 bytes
Data per trace: 8 × 500 = 4 KB

// At 100% sampling:
10,000 × 4 KB = 40 MB/second = 3.4 TB/day

// At 1% sampling:
100 × 4 KB = 400 KB/second = 34 GB/day

// At 1% + 100% errors (0.1% error rate):
(100 + 10) × 4 KB = 440 KB/second = 37 GB/day
// Captures ALL errors with only 37 GB of storage.

The Three Pillars Connected

The real power of observability comes from connecting the three pillars. A metric tells you something is wrong. A trace shows you where. A log tells you exactly what happened.

1. Alert fires
Metric: "Error rate > 1% for 5 minutes on payment-service"
↓ click "View Exemplar Trace"
2. Open trace
Trace: Request spent 29s in DB span. All other spans normal.
↓ click on DB span for logs
3. Read logs
Log: "WARN: full table scan on payments table, 2.1M rows, missing index on created_at"
↓ search for related deploys
4. Find root cause
Deploy at 01:45 added query without index. Rollback fixes the issue.

This workflow — metric alert → trace → log → root cause — takes 5-10 minutes with good observability. Without it, the same investigation takes 30-60 minutes of guessing and grepping.

The simulation below shows a distributed trace. Click on spans to see details. Inject a slow dependency to see where the bottleneck appears.

Distributed Trace Viewer

A request flows through 4 services. Inject a slow dependency to see the bottleneck in the trace.

Generate a trace to visualize the request journey through services.
Quiz: A request takes 5 seconds. The trace shows: Gateway (5ms overhead), Auth (20ms), OrderService (4900ms), with OrderService calling DB (4850ms). Where is the bottleneck?

Chapter 8: Live Dashboard Simulator

This is the showcase chapter. A live dashboard showing all four golden signals, with alert indicators, trace links, and the ability to inject incidents. This is what the on-call engineer sees at 3 AM.

How to use the dashboard. Click "Start System" to begin generating real-time metrics. The four golden signals update every second. Inject incidents to see alerts fire and metrics degrade. Click "Trace Slow Request" to see a distributed trace of a slow request.
Live Operations Dashboard

Four golden signals with real-time data. Inject incidents. Watch alerts. Trace requests.

Start the system, then inject incidents to see the dashboard react.

What a Good Dashboard Shows

SectionContentsWhy
Top bannerOverall status (green/yellow/red), active alerts countInstant health assessment in 1 second
Golden signalsLatency (p50/p99), traffic (RPS), errors (%), saturation (%)The 4 things that matter most
SLO burn rateError budget consumption, burn rate chartAre we on track for the month?
Top errorsMost common error types by countWhat is breaking most?
Recent deploymentsTimeline of deploys with statusCorrelate incidents with changes

Dashboard Anti-Patterns

A bad dashboard is worse than no dashboard, because it gives a false sense of visibility. Here are the most common mistakes:

Anti-PatternProblemFix
Too many charts30 charts = cognitive overload. Engineer can't find the signal.5-7 charts max per dashboard. One dashboard per concern.
No contextLine goes up. Is that good or bad?Show thresholds, baselines, and SLO targets on every chart.
Average-only"Average latency is 50ms" hides p99 of 3 seconds.Show p50, p95, p99 as separate lines or a heatmap.
Stale dataDashboard shows data from 5 minutes ago.Refresh every 10-15 seconds. Show "last updated" timestamp.
No drill-downError rate is high. Now what?Click on a data point to jump to related traces and logs.

The Metrics Pipeline

How do metrics get from your application to a dashboard? The pipeline has four stages:

1. Instrumentation
Application code emits metrics (counters, gauges, histograms) via a client library. Example: Prometheus client, OpenTelemetry SDK.
2. Collection
A metrics agent (Prometheus scraper, OTel Collector, StatsD) pulls or receives metrics from applications every 10-15 seconds.
3. Storage
A time-series database (Prometheus, InfluxDB, Cortex, Thanos) stores metrics with timestamps. Retention: 30-90 days at full resolution.
4. Visualization
A dashboard tool (Grafana, Datadog, New Relic) queries the TSDB and renders charts. Engineers build queries in PromQL, InfluxQL, or similar.
// Example: instrumenting a Python service with Prometheus

// Counter: total requests (monotonically increasing)
request_total = Counter('http_requests_total',
'Total HTTP requests', ['method', 'endpoint', 'status'])

// Histogram: request duration (distribution)
request_duration = Histogram('http_request_duration_seconds',
'HTTP request duration',
buckets=[.01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10])

// Gauge: current in-flight requests
in_flight = Gauge('http_requests_in_flight',
'Currently processing requests')

// Three metric types cover almost every use case:
// Counter for "how many?" (requests, errors, bytes)
// Histogram for "how long?" (latency, sizes)
// Gauge for "how much right now?" (queue depth, connections)

Chapter 9: Connections

Observability is the final piece of the distributed systems reliability stack. You now have the complete picture: what fails, how to isolate it, how to survive it at runtime, how to ship safely, and how to see what is happening in production.

The Complete Stack

LayerTopicQuestion it answers
ArchitectureFailure Modes & IsolationWhat can fail? How do we contain it?
RuntimeResiliency PatternsHow do we handle failures gracefully in code?
DeliveryTesting & DeploymentHow do we ship changes safely?
VisibilityObservability & Operations (this lesson)How do we see what's happening and respond?

Key Takeaways

1. Measure the four golden signals. Latency (percentiles!), traffic, errors, saturation. If you can only have four metrics, make it these four.

2. RED for services, USE for resources. Rate/Errors/Duration for request-driven services. Utilization/Saturation/Errors for hardware resources.

3. SLOs are budgets, not aspirations. Set them based on user needs. Never set 100%. Use error budgets to make deployment risk decisions.

4. Alert on symptoms, not causes. "Error rate > 1%" pages a human. "CPU > 90%" creates a ticket. Every page must be urgent, actionable, and real.

5. Structure your logs. JSON with trace_id lets you query efficiently and correlate across services. Never log secrets.

6. Trace the critical path. Distributed tracing shows where time is spent. Sample at 1% in production, 100% for errors and slow requests.

7. Dashboards should answer questions in seconds. Top banner for status, golden signals for detail, drill-down for investigation.

"Observability is not about collecting data. It is about being able to ask arbitrary questions of your system and get answers without deploying new code." — Charity Majors, CTO of Honeycomb
Final quiz: You are building a new microservices platform. What is the minimum observability setup you need before launching to production?