Observability & Operations — From Absolute Zero to Mastery

Chapter 0: Why Observability

It is 2 AM. Your phone buzzes. "CRITICAL: Payment service error rate above 5%." You open your laptop. The payment service is returning 500 errors on 8% of requests. Why?

Is it the payment service itself? The database it calls? The network between them? A bad deploy that went out 30 minutes ago? A traffic spike from a marketing campaign? A dependency that rate-limited you? Without observability, you are flying blind — guessing, grepping through log files, restarting services and hoping.

With observability, you open a dashboard. You see the error rate spiked at 1:47 AM. You click on a failing request. A trace shows the request spent 29 seconds waiting for a database query. You look at database metrics: CPU is at 98% because an unindexed query is doing a full table scan. You find the deploy that introduced the query. You rollback. Error rate drops to 0.1% within 2 minutes. Total incident time: 11 minutes.

The difference between monitoring and observability. Monitoring tells you WHEN something is broken (alerts fire). Observability tells you WHY it is broken. A monitored system has dashboards and alerts. An observable system has metrics, logs, and traces that let you diagnose any problem without deploying new code or adding new instrumentation. If you have to add a log line to debug a production issue, your system is not observable enough.

The Three Pillars

Observability has three data types, often called the three pillars:

Pillar	What it is	What it answers	Example
Metrics	Numeric measurements over time	"What is happening?" (rates, gauges, distributions)	Request rate: 1200 RPS, p99 latency: 230ms
Logs	Timestamped text records of events	"What exactly happened?" (the narrative)	"2024-01-15 01:47:23 ERROR: query timeout after 30s, query=SELECT..."
Traces	End-to-end path of a request through services	"Where did the time go?" (the journey)	Frontend → Gateway (2ms) → OrderService (5ms) → DB (29s) !!!

The simulation below shows the difference between flying blind and having observability. A service starts failing. Without observability, you see "errors." With observability, you see exactly where and why.

Blind vs. Observable

An incident occurs. Compare diagnosis time with and without observability.

Trigger an incident, then try diagnosing with and without observability.

The structure of this lesson. We will cover: what to measure (four golden signals, RED, USE), what targets to set (SLIs/SLOs/SLAs), when to alert (alerting philosophy), what to log (structured logging), how to trace (distributed tracing), and a live dashboard simulation that ties it all together.

Quiz: Your payment service returns 500 errors on 8% of requests. You have metrics dashboards but no distributed tracing. What can you determine?

The exact root cause — metrics are enough You can see THAT the payment service is failing and WHEN it started, but not necessarily WHERE in the call chain the failure originates. If the payment service calls 3 downstream services, metrics alone don't tell you which one is causing the errors. You need traces to follow a failing request through the entire call chain. Nothing useful — you need logs to diagnose anything

Chapter 1: The Four Golden Signals

If you could only measure four things about your service, what would they be? The Site Reliability Engineering discipline converges on the same four:

1. Latency

How long does it take to serve a request? Specifically, the distribution of latencies, not just the average. A service with 50ms average latency might have a p99 of 2 seconds — meaning 1% of users wait 40x longer than average.

Always measure percentiles, not averages. An average of 50ms could mean everyone gets 50ms (great) or 99% get 10ms and 1% get 4 seconds (terrible). The p50, p90, p95, and p99 tell the real story. Alert on p99, not average.

Why Averages Lie: A Worked Example

// Two services with the same average latency:

// Service A: 100 requests
All requests: 50ms
Average: 50ms, p50: 50ms, p99: 50ms
// Uniform. Every user gets the same experience.

// Service B: 100 requests
99 requests: 10ms, 1 request: 4050ms
Average: (99 × 10 + 1 × 4050) / 100 = 50.4ms
p50: 10ms, p99: 4050ms
// Same average! But 1% of users wait 4 seconds.

// If a user makes 20 requests per session:
P(hitting p99 at least once) = 1 - (0.99)²⁰ = 18%
// Nearly 1 in 5 users will experience a 4-second wait.

This is why p99 is the most important latency metric. It represents the experience of your worst-affected users, who are often your highest-value users (they use the product frequently, so they hit the tail more often).

Histograms vs. Summaries

There are two ways to store latency data in a metrics system:

Type	How it works	Pros	Cons
Histogram	Counts requests in predefined buckets: 0-10ms, 10-50ms, 50-100ms, 100-500ms, 500ms+	Aggregatable across servers. Cheap to store.	Bucket boundaries must be chosen in advance. Precision limited by bucket size.
Summary	Calculates exact percentiles on the fly using reservoir sampling	Precise percentiles without predefined buckets.	Cannot be aggregated across servers (p99 of p99s is not p99 of all). Memory-intensive.

Histograms are preferred for distributed systems. Because they can be aggregated, you can compute the p99 across all servers — not just per-server. Prometheus recommends histograms over summaries for this reason.

2. Traffic

How much demand is being placed on the system? For a web service: HTTP requests per second. For a database: queries per second. For a message queue: messages per second. Traffic tells you the current load level and helps you predict when you will need more capacity.

3. Errors

What fraction of requests fail? This includes explicit errors (HTTP 500s), implicit errors (HTTP 200 with wrong content), and policy violations (responses slower than a threshold). The error rate (errors per second or errors as a percentage of total requests) is more useful than the error count.

4. Saturation

How "full" is the system? Saturation measures how close you are to capacity: CPU utilization, memory usage, disk I/O, queue depth, connection pool usage. When saturation approaches 100%, performance degrades nonlinearly — a system at 90% CPU does not perform 10% worse than one at 0%, it performs dramatically worse due to queueing effects.

The simulation below shows all four golden signals for a live service. Traffic arrives at a steady rate, then spikes. Watch how the signals interact.

Four Golden Signals — Live

Four real-time charts. Inject a traffic spike or slow dependency to see all signals react.

Start traffic, then inject disruptions to see all four signals react.

Quiz: Your service's average latency is 50ms but the p99 is 3 seconds. Users complain about slowness. A coworker says "the average is fine." What is wrong with this reasoning?

Nothing — 50ms average is acceptable for most services The p99 should be lower than the average The average hides the tail. If you serve 1 million requests per day, 1% at p99 means 10,000 users per day experience 3-second waits. A single user may make many requests, so they are very likely to hit the 1% tail at least once per session. Averages lie. Percentiles tell the truth. Alert on p99, not mean.

Chapter 2: RED & USE Methods

The four golden signals tell you what to measure for any service. But different types of components have different natural metrics. The RED method and USE method give you targeted frameworks.

RED Method (for services)

For any request-driven service (API, web server, microservice), measure three things:

R	Rate	Requests per second. How busy is the service?
E	Errors	Errors per second (or error rate %). Is the service broken?
D	Duration	Latency distribution (p50, p99). Is the service slow?

RED is essentially the four golden signals minus saturation, focused on the user-facing experience. If you can only instrument three things per service, instrument RED.

USE Method (for resources)

For any resource (CPU, memory, disk, network, connection pool), measure three things:

U	Utilization	Percentage of time the resource is busy. CPU at 85%, disk at 60%.
S	Saturation	Amount of work queued that cannot be served. Queue depth, wait time.
E	Errors	Count of error events. ECC memory corrections, network packet drops.

RED for services, USE for resources. When debugging a slow API, start with RED metrics on the service. If error rate is high, check logs. If latency is high, check USE metrics on the underlying resources (CPU, DB connections, disk I/O) to find the bottleneck.

When to Use RED vs. USE

The debugging workflow typically starts with RED (user-facing symptoms) and drills into USE (resource causes):

1. RED detects the problem

"Rate normal, errors at 3%, duration p99 at 2s." Something is slow and failing.

↓ which resource is the bottleneck?

2. USE finds the cause

"CPU 40%, Memory 60%, Disk I/O 98% utilized." The disk is saturated.

↓ why is disk saturated?

3. Logs explain the details

"Full table scan on 50M row table. Missing index on query introduced at 01:45."

RED tells you THAT something is wrong. USE tells you WHERE the bottleneck is. Logs tell you WHY.

Mapping RED and USE to Your System

// Example: e-commerce checkout service

// RED metrics (service-level):
Rate: checkout_requests_total (counter)
Errors: checkout_errors_total (counter)
Duration: checkout_duration_seconds (histogram)

// USE metrics (resource-level):
CPU Utilization: node_cpu_seconds_total
CPU Saturation: node_load1 (1-minute load average)
Mem Utilization: node_memory_MemTotal - MemAvailable
Mem Saturation: node_vmstat_pgmajfault (page faults)
Disk Utilization: node_disk_io_time_seconds_total
Disk Saturation: node_disk_io_time_weighted_seconds_total

The simulation shows RED and USE metrics side by side. Inject load and watch how service-level metrics (RED) correspond to resource-level metrics (USE).

RED & USE Dashboard

Left: service metrics (RED). Right: resource metrics (USE). Inject load to see correlation.

Start traffic, then spike load to see RED and USE metrics correlate.

Quiz: Your API has Rate=500 RPS, Errors=0%, Duration p99=50ms. But your database CPU (USE) is at 95% utilization. What should you do?

Act now, even though RED looks healthy. USE metrics show you are nearly at capacity. The next traffic increase will push the DB over the edge, and when a system hits 100% utilization, latency increases exponentially (not linearly). RED is fine NOW, but USE is predicting an imminent problem. Add DB capacity or optimize queries before the load increases. Nothing — RED metrics are healthy, so the system is fine Increase the API timeout to handle higher latency

Chapter 3: SLIs, SLOs, SLAs

Your service is "reliable." But how reliable? 99%? 99.9%? 99.99%? And reliable in what dimension — availability? Latency? Correctness? These terms are often used loosely. The SLI/SLO/SLA framework gives them precise meaning.

SLI: Service Level Indicator

An SLI is a carefully defined quantitative measure of some aspect of the level of service that is provided. It is a metric with a specific definition.

// Example SLIs:

Availability SLI = (successful requests / total requests) × 100%
// "What fraction of requests succeed?"

Latency SLI = fraction of requests completing within 300ms
// "What fraction of requests are fast enough?"

Correctness SLI = (correct responses / total responses) × 100%
// "What fraction of responses return the right data?"

SLO: Service Level Objective

An SLO is a target value for an SLI. It is an internal goal that your team sets.

// Example SLOs:

Availability SLO: 99.9% of requests succeed over a 30-day window
// 0.1% failure allowed = 43.2 minutes of downtime per month

Latency SLO: 99% of requests complete within 300ms
// 1% of requests can be slower than 300ms

// SLO is NOT 100%. Never set an SLO of 100%.
// 100% means ZERO failures allowed. Ever. Impossible and counterproductive.

SLA: Service Level Agreement

An SLA is a contract with your customer that includes consequences for failing to meet it (refunds, credits, contract termination). SLAs are typically looser than SLOs because you want to breach your SLO (internal alert) before you breach your SLA (legal/financial consequences).

Concept	Definition	Audience	Consequence of breach
SLI	The metric itself	Engineers	None (it's a measurement)
SLO	Target for the SLI	Engineering team	Error budget consumed, on-call pages
SLA	Contract with customer	Customers, legal	Refunds, credits, legal action

The Nines

Availability	Downtime/month	Downtime/year	Example
99% (two 9s)	7.3 hours	3.65 days	Internal tools
99.9% (three 9s)	43.2 minutes	8.76 hours	Most SaaS products
99.95%	21.6 minutes	4.38 hours	Cloud provider services
99.99% (four 9s)	4.32 minutes	52.6 minutes	Payment systems, infrastructure
99.999% (five 9s)	26 seconds	5.26 minutes	911 systems, pacemakers

Each additional 9 costs 10x more. Going from 99.9% to 99.99% does not require 10% more effort — it requires a fundamentally different architecture. Redundancy, automated failover, cross-region replication, formal verification. Most applications should target 99.9% or 99.95%, not more.

Choosing the Right SLO

How do you decide between 99.9% and 99.99%? It depends on your users and your dependencies:

// The dependency rule:
// Your SLO cannot be higher than your weakest dependency's SLO.

// If your database provides 99.95% availability,
// and your cloud provider provides 99.99% availability,
// your maximum achievable SLO is:

max_SLO = 99.95% × 99.99% = 99.94%

// Setting an SLO of 99.99% when your DB provides 99.95%
// is setting an impossible target. Your DB alone will violate it.

// With N dependencies, each at 99.9%:
combined = (0.999)^N
N=3: 99.7% combined availability
N=5: 99.5%
N=10: 99.0%
// More dependencies = lower combined reliability.
// This is why minimizing dependencies is a reliability strategy.

SLI Selection: What to Measure

Choosing the right SLI is critical. A bad SLI creates a false sense of security. Here are the most common SLIs by service type:

Service Type	Recommended SLI	Why not something else?
Web API	% of requests returning 2xx within 300ms	Combines availability and latency in one metric. A slow success is a failure for the user.
Data pipeline	% of records processed within freshness target	Throughput alone is misleading — data can be processed but 3 hours late.
Storage system	% of reads returning correct data within 100ms	Durability (no data loss) + availability + latency.
Batch job	% of runs completing successfully within time budget	Start time is irrelevant; completion time matters.

The simulation shows how SLO targets map to allowed downtime. Adjust the SLO slider and see how much failure budget you have.

SLO Calculator

Set your SLO and see how much downtime you can afford per month and year.

SLO 99.9%

Drag the slider to explore different SLO targets.

Quiz: Your SLO is 99.9% availability over a 30-day window. It is day 15 and you have had 35 minutes of downtime. How much error budget remains?

8.2 minutes. The total error budget for 30 days at 99.9% is 43.2 minutes (30 × 24 × 60 × 0.001). You have consumed 35 minutes. Remaining: 43.2 - 35 = 8.2 minutes. You are almost out of budget with 15 days left. This should trigger a freeze on risky deployments. 43.2 minutes — the budget resets each day You already breached your SLO

Chapter 4: Error Budgets

The SLO says 99.9% availability. That means 0.1% unavailability is allowed. This 0.1% is your error budget — the amount of unreliability you can tolerate before breaching your SLO.

Error budgets transform reliability from a vague aspiration into a concrete, spendable resource. You can spend your error budget on things that matter: faster feature releases, risky experiments, infrastructure migrations. As long as you stay within budget, you are meeting your reliability commitment.

How Error Budgets Work

// Monthly error budget at 99.9% SLO:
Total minutes in 30 days: 30 × 24 × 60 = 43,200 minutes
Error budget: 43,200 × 0.001 = 43.2 minutes of downtime

// Spending the budget:
Week 1: deploy caused 5 min outage → 38.2 min remaining
Week 2: DB failover caused 3 min blip → 35.2 min remaining
Week 3: deploy caused 20 min outage → 15.2 min remaining
Week 4: BUDGET LOW — freeze risky deploys → 15.2 min to last 7 days

Error Budget Policies

Budget Level	Action
> 50% remaining	Full speed: ship features, run experiments, do migrations
25-50% remaining	Caution: all deploys require canary, no risky experiments
< 25% remaining	Freeze: no deployments except bug fixes. Focus on reliability improvements.
0% (budget exhausted)	Hard freeze: only reliability work until budget replenishes next cycle

Error budgets align incentives. Without error budgets, product teams want to ship fast and ops teams want to minimize risk. They are in constant conflict. Error budgets give them a shared framework: "We have 30 minutes of budget left. Should we spend 10 minutes of risk on this deploy? Is the feature worth it?" The conversation shifts from "can we deploy?" to "should we spend budget?"

Error Budget-Based Decision Making

Error budgets change how organizations make engineering decisions. Here are concrete examples:

// Scenario: Should we do a major database migration?

SLO: 99.9% (43.2 min budget per month)
Current budget consumed: 10 min (23%)
Remaining: 33.2 min (77%)

Migration estimated risk: 15 min of potential downtime
Migration estimated probability of issue: 30%
Expected cost: 15 × 0.3 = 4.5 min expected

Decision: Proceed. Even worst case (15 min), we'd have 18.2 min remaining.
Expected case leaves 28.7 min. Well within budget.

// Same scenario at end of month with 8 min remaining:
Expected cost: 4.5 min. Worst case: 15 min > 8 min remaining.
Decision: Defer to next month when budget resets.

What Consumes Error Budget

Not all budget consumption is equal. Some is expected (planned maintenance), some is unplanned but tolerable (transient errors), and some requires investigation.

Source	Typical Cost	Action
Planned maintenance	5-15 min	Pre-approved, scheduled during low traffic
Deploy rollout	0-2 min	Normal if within canary detection window
Transient network blip	0.5-2 min	Expected noise, no action needed
Bad deploy	5-30 min	Post-mortem, improve canary/rollback speed
Dependency outage	Variable	Evaluate circuit breakers and fallbacks
Infrastructure failure	Variable	Evaluate fault domain spread, redundancy

Burn Rate Alerting: The Math

Traditional alerting fires when a metric crosses a threshold. Burn rate alerting fires when you are consuming error budget too fast — even if the absolute error rate is below your threshold.

// Error budget burn rate:
// If you consumed 1 month of budget in 1 hour,
// your burn rate = 720x (30 days × 24 hours / 1 hour)

budget_total = 43.2 min (99.9% SLO over 30 days)
budget_consumed_last_hour = 2 min
burn_rate = (2 / 43.2) × (30 × 24) = 33.3x

// At 33x burn rate, budget exhausted in:
time_to_exhaust = 30 days / 33.3 = 21.6 hours

// Alert thresholds:
burn_rate > 14.4x (budget gone in 2 days) → PAGE (P1)
burn_rate > 6x (budget gone in 5 days) → TICKET (P2)
burn_rate > 3x (budget gone in 10 days) → REVIEW (P3)
burn_rate > 1x (on track to exhaust) → DASHBOARD (P4)

The simulation below tracks an error budget over a 30-day period. Deployments and incidents consume budget. Watch the budget drain and see when policies activate.

Error Budget Tracker

30-day error budget at 99.9% SLO (43.2 min). Deployments and incidents consume budget.

Ship features and handle incidents. Watch your error budget.

Quiz: Your SLO is 99.9%. It's day 20 of the month and you have 5 minutes of error budget remaining. A product manager wants to deploy a major new feature. What do you do?

Deploy it — 5 minutes is plenty for the remaining 10 days Refuse — freeze all deployments Have the trade-off conversation. With 5 minutes left over 10 days, any deployment risk must be near-zero. If the feature can wait until the next budget cycle (10 days), defer it. If it's critical, use maximum safety: feature flag off by default, canary at 1%, and have a rollback ready. The error budget forces an honest conversation about risk vs. value.

Chapter 5: Alerting

An alert wakes a human up at 3 AM. That alert better be important. If it is a false alarm — or a real alarm that does not require immediate human action — you have just wasted someone's sleep, eroded trust in your alerting system, and contributed to alert fatigue.

Symptom-Based vs. Cause-Based Alerting

The most important principle in alerting: alert on symptoms, not causes.

Type	Example	Problem
Cause-based (bad)	"CPU > 90%"	CPU at 92% might be fine if latency is normal. You page someone for a non-problem.
Symptom-based (good)	"Error rate > 1% for 5 minutes"	This means users are affected RIGHT NOW. Always actionable.

The golden rule of alerting. Every page (wake-someone-up alert) must be: (1) urgent — requires action within minutes, not hours; (2) actionable — the on-call engineer can DO something about it; (3) real — not a false positive. If any of these is false, it should not be a page. Make it a ticket or a dashboard annotation instead.

Alert Severity Levels

Level	Response Time	Channel	Example
Critical (P1)	Immediate (minutes)	Page on-call, phone call	Service completely down, data loss
High (P2)	Hours	Slack notification, ticket	Error rate elevated but below SLO
Medium (P3)	Days	Ticket	Disk usage growing, needs cleanup
Low (P4)	Weeks	Dashboard annotation	Certificate expires in 60 days

Multi-Window, Multi-Burn-Rate Alerting

The most sophisticated alerting approach ties directly to error budgets. Instead of fixed thresholds, alert when you are burning error budget too fast:

// Error budget burn rate alerting:
Monthly budget: 43.2 minutes (99.9% SLO)

// If we burn 1 month of budget in 1 hour (burn rate = 720x):
// That's a CRITICAL incident. Page immediately.

// If we burn 1 month of budget in 3 days (burn rate = 10x):
// That's a HIGH priority. Notify the team, investigate today.

// If we burn 1 month of budget in 10 days (burn rate = 3x):
// That's a MEDIUM priority. File a ticket, fix this week.

On-Call Practices

An alert pages a human. That human is the on-call engineer — someone who has agreed to be reachable 24/7 for a rotation period (usually one week). On-call is one of the most important and demanding roles in engineering. Bad on-call practices burn out engineers and degrade incident response quality.

Practice	Bad	Good
Rotation length	1 month (burnout)	1 week, with swap ability
Alert volume	>2 pages per shift (fatigue)	≤2 pages per week
Runbooks	None ("figure it out")	Every alert links to a runbook with step-by-step remediation
Post-incident	Blame the engineer	Blameless post-mortem, focus on systemic fixes
Compensation	None	Extra pay or comp time for on-call shifts

Incident Response Framework

When a page fires, the on-call engineer should follow a structured response:

1. Acknowledge (2 min)

Ack the page. Open the dashboard. Assess severity. Communicate to the team channel.

↓

2. Mitigate (5-15 min)

Stop the bleeding. Rollback a deploy, disable a feature flag, failover to standby. Do not debug yet — mitigate first.

↓

3. Diagnose (15-60 min)

Now find the root cause. Dashboard → traces → logs → recent changes. Identify the change or condition that triggered the incident.

↓

4. Fix & Verify (variable)

Apply the fix. Verify metrics return to baseline. Monitor for 30 minutes.

↓

5. Post-mortem (next business day)

Write a blameless post-mortem. Timeline, root cause, impact, action items. Share widely.

Mitigate first, diagnose second. The #1 mistake in incident response is trying to understand WHY before stopping the damage. A 10-minute mitigation (rollback the deploy) saves more user pain than a 60-minute diagnosis that leads to a more elegant fix. Stop the bleeding, then do surgery.

The simulation shows different alert configurations and their impact on on-call burden. Compare symptom-based and cause-based alerting.

Alert Configuration Comparison

Simulate 1 week of alerts. Compare cause-based (noisy) vs. symptom-based (accurate) alerting.

Choose an alerting strategy to simulate one week of on-call.

Quiz: You have an alert "CPU > 90%." It fires 15 times this week. 13 times, the CPU spike resolved itself in 2 minutes with no user impact. 2 times, it was a real problem. What should you do?

Increase the threshold to 95% to reduce false alarms Add more CPU capacity so it never hits 90% Replace the cause-based alert (CPU > 90%) with a symptom-based alert (error rate > 1% for 5 minutes). The CPU alert fires on CPU spikes regardless of user impact. A symptom-based alert only fires when users are affected. The 2 real incidents would have triggered the symptom alert; the 13 non-incidents would not. This reduces alert volume by 87% while catching 100% of real problems.

Chapter 6: Structured Logging

A traditional log line looks like this: 2024-01-15 01:47:23 ERROR: Payment failed for user 12345, amount=$50.00, reason=timeout. It is human-readable. It is also impossible to query efficiently. Want to find all payment failures over $100 in the last hour? You are parsing strings with regex.

Structured logging emits log events as key-value pairs (usually JSON), making them queryable by any field:

// Unstructured (bad for queries):
"2024-01-15 01:47:23 ERROR: Payment failed for user 12345, amount=$50.00"

// Structured (queryable):
{
"timestamp": "2024-01-15T01:47:23Z",
"level": "ERROR",
"service": "payment-service",
"event": "payment_failed",
"user_id": "12345",
"amount": 50.00,
"currency": "USD",
"reason": "timeout",
"trace_id": "abc-123-def-456",
"duration_ms": 30042
}

What to Log

Always Log	Never Log
Timestamp (ISO 8601 UTC)	Passwords, tokens, API keys
Log level (ERROR, WARN, INFO, DEBUG)	Full credit card numbers
Service name and version	Personal health information
Trace ID and span ID	Social security numbers
Request ID	Full request/response bodies (too large)
User ID (if applicable)	Raw database queries with parameters
Duration of the operation	Anything that violates GDPR/CCPA/HIPAA
Error type and message

Correlation is the killer feature. The trace_id in every log entry lets you find ALL log lines related to a single request, across ALL services. User reports "my payment failed." You search logs for their user_id, find the trace_id of the failing request, and see every log line from every service that request touched. Without trace_id, you are searching millions of log lines by timestamp and hoping.

Structured Logging Implementation

python
import json, time, uuid

class StructuredLogger:
    def __init__(self, service_name, version):
        self.service = service_name
        self.version = version

    def log(self, level, event, **kwargs):
        entry = {
            "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
            "level": level,
            "service": self.service,
            "version": self.version,
            "event": event,
            **kwargs  # trace_id, user_id, duration_ms, etc.
        }
        print(json.dumps(entry))  # stdout, captured by log agent

# Usage:
logger = StructuredLogger("payment-service", "v2.3.1")
logger.log("ERROR", "payment_failed",
    trace_id=request.trace_id,
    user_id="12345",
    amount=50.00,
    reason="timeout",
    duration_ms=30042
)
# Output: {"timestamp":"2024-01-15T01:47:23Z","level":"ERROR",
#  "service":"payment-service","version":"v2.3.1",
#  "event":"payment_failed","trace_id":"abc-123",
#  "user_id":"12345","amount":50.0,"reason":"timeout",
#  "duration_ms":30042}

Log Levels: When to Use Each

Level	When	Example	Volume
ERROR	Something failed that affects user experience	Payment declined, query timeout	Low (<1% of requests)
WARN	Something unexpected but handled	Retry succeeded, cache miss, slow query	Low-medium
INFO	Significant business events	Order placed, user signed up, deploy started	Medium
DEBUG	Detailed execution flow (disabled in production)	Query params, function entry/exit	Very high (off in prod)

Log the boundaries, not the internals. Log when a request arrives, when it calls a dependency, and when it completes. Do not log every internal function call — that creates noise. A well-instrumented service needs 3-5 log lines per request, not 50.

The simulation shows how structured vs. unstructured logging affects incident diagnosis time.

Structured vs. Unstructured Log Search

Search for "all payment failures over $100 in the last hour." Compare query times.

Compare searching unstructured (grep) vs. structured (indexed query) logs.

Quiz: A user reports their request failed. You want to see every log line from every service involved in that request. What field in your structured logs enables this?

The timestamp — you can search by time The user_id — you can find all their requests The trace_id. Every service logs the same trace_id for a single request as it propagates through the system. Searching by trace_id returns exactly the log lines from that one request, across all services, in order. Timestamp is too broad (many requests per second). User_id gives all their requests, not one specific request.

Chapter 7: Distributed Tracing

A user's request enters your system at the API gateway. It is routed to the order service, which calls the inventory service, which calls the database, which calls the cache. The entire round trip takes 2.3 seconds. Where did the time go?

Logs tell you what happened at each service. Metrics tell you aggregate behavior. But neither tells you the journey of a single request through the entire system. That is what distributed tracing does.

Anatomy of a Trace

A trace represents the entire lifecycle of a request. It is composed of spans, where each span represents one operation (one service call, one database query, one cache lookup).

// Trace: user checkout request
// trace_id: abc-123

Span 1: API Gateway [0ms ---------- 2300ms] (total: 2300ms)
Span 2: Order Service [5ms ------- 2295ms] (total: 2290ms)
Span 3: Inventory Check [10ms -- 50ms] (total: 40ms)
Span 4: Payment Service [55ms ------- 2250ms] (total: 2195ms) !!!
Span 5: DB Query [60ms ------- 2200ms] (total: 2140ms) !!!
Span 6: Send Confirmation [2255ms -- 2290ms] (total: 35ms)

The trace makes it immediately obvious: the Payment Service took 2195ms, and within that, the DB query took 2140ms. That is the bottleneck. Without tracing, you would know "the request took 2.3 seconds" but not where in the chain the time was spent.

Trace Context Propagation

For tracing to work, every service must propagate the trace context (trace_id, span_id, parent_span_id) to the next service. The standard is the W3C Trace Context specification, which uses HTTP headers:

// W3C Trace Context headers:
traceparent: 00-abc123def456-span789-01
// version-traceId-parentSpanId-flags

// Each service:
// 1. Reads traceparent from incoming request
// 2. Creates a new span with the trace_id and its own span_id
// 3. Sets parent_span_id to the incoming span_id
// 4. Includes traceparent in outgoing requests to downstream services
// 5. Reports the span to the tracing backend when the operation completes

The 1% problem. At high traffic, tracing every request generates enormous data volumes. Most systems use sampling: trace 1% of requests (or 100% of error requests, or 100% of slow requests). The trade-off: you might not have a trace for the specific request a user complains about. Sophisticated systems use tail-based sampling: decide whether to keep a trace after it completes, based on whether it was slow or errored.

Sampling Strategies

Strategy	How it works	Pro	Con
Head-based (random)	Decide at the start: trace this request with 1% probability	Simple, low overhead	May miss interesting traces. Cannot decide based on outcome.
Tail-based	Buffer all spans. After request completes, decide to keep based on duration, error status, or other criteria	Keeps all interesting traces	Requires buffering all spans temporarily. Higher memory cost.
Priority-based	100% for errors, 100% for slow, 10% for specific endpoints, 1% for everything else	Balances coverage and cost	Complex configuration

// Trace data volume calculation:

Traffic: 10,000 RPS
Average spans per trace: 8 (typical microservice chain)
Average span size: 500 bytes
Data per trace: 8 × 500 = 4 KB

// At 100% sampling:
10,000 × 4 KB = 40 MB/second = 3.4 TB/day

// At 1% sampling:
100 × 4 KB = 400 KB/second = 34 GB/day

// At 1% + 100% errors (0.1% error rate):
(100 + 10) × 4 KB = 440 KB/second = 37 GB/day
// Captures ALL errors with only 37 GB of storage.

The Three Pillars Connected

The real power of observability comes from connecting the three pillars. A metric tells you something is wrong. A trace shows you where. A log tells you exactly what happened.

1. Alert fires

Metric: "Error rate > 1% for 5 minutes on payment-service"

↓ click "View Exemplar Trace"

2. Open trace

Trace: Request spent 29s in DB span. All other spans normal.

↓ click on DB span for logs

3. Read logs

Log: "WARN: full table scan on payments table, 2.1M rows, missing index on created_at"

↓ search for related deploys

4. Find root cause

Deploy at 01:45 added query without index. Rollback fixes the issue.

This workflow — metric alert → trace → log → root cause — takes 5-10 minutes with good observability. Without it, the same investigation takes 30-60 minutes of guessing and grepping.

The simulation below shows a distributed trace. Click on spans to see details. Inject a slow dependency to see where the bottleneck appears.

Distributed Trace Viewer

A request flows through 4 services. Inject a slow dependency to see the bottleneck in the trace.

Generate a trace to visualize the request journey through services.

Quiz: A request takes 5 seconds. The trace shows: Gateway (5ms overhead), Auth (20ms), OrderService (4900ms), with OrderService calling DB (4850ms). Where is the bottleneck?

The database. The DB span accounts for 4850ms of the total 5000ms. OrderService itself only adds 50ms of overhead. Gateway and Auth are fast. The fix is to optimize the DB query (add an index, reduce data scanned, cache the result), not to optimize OrderService or Gateway. OrderService — it has the longest span The Gateway — it's the entry point

Chapter 8: Live Dashboard Simulator

This is the showcase chapter. A live dashboard showing all four golden signals, with alert indicators, trace links, and the ability to inject incidents. This is what the on-call engineer sees at 3 AM.

How to use the dashboard. Click "Start System" to begin generating real-time metrics. The four golden signals update every second. Inject incidents to see alerts fire and metrics degrade. Click "Trace Slow Request" to see a distributed trace of a slow request.

Live Operations Dashboard

Four golden signals with real-time data. Inject incidents. Watch alerts. Trace requests.

Start the system, then inject incidents to see the dashboard react.

What a Good Dashboard Shows

Section	Contents	Why
Top banner	Overall status (green/yellow/red), active alerts count	Instant health assessment in 1 second
Golden signals	Latency (p50/p99), traffic (RPS), errors (%), saturation (%)	The 4 things that matter most
SLO burn rate	Error budget consumption, burn rate chart	Are we on track for the month?
Top errors	Most common error types by count	What is breaking most?
Recent deployments	Timeline of deploys with status	Correlate incidents with changes

Dashboard Anti-Patterns

A bad dashboard is worse than no dashboard, because it gives a false sense of visibility. Here are the most common mistakes:

Anti-Pattern	Problem	Fix
Too many charts	30 charts = cognitive overload. Engineer can't find the signal.	5-7 charts max per dashboard. One dashboard per concern.
No context	Line goes up. Is that good or bad?	Show thresholds, baselines, and SLO targets on every chart.
Average-only	"Average latency is 50ms" hides p99 of 3 seconds.	Show p50, p95, p99 as separate lines or a heatmap.
Stale data	Dashboard shows data from 5 minutes ago.	Refresh every 10-15 seconds. Show "last updated" timestamp.
No drill-down	Error rate is high. Now what?	Click on a data point to jump to related traces and logs.

The Metrics Pipeline

How do metrics get from your application to a dashboard? The pipeline has four stages:

1. Instrumentation

Application code emits metrics (counters, gauges, histograms) via a client library. Example: Prometheus client, OpenTelemetry SDK.

↓

2. Collection

A metrics agent (Prometheus scraper, OTel Collector, StatsD) pulls or receives metrics from applications every 10-15 seconds.

↓

3. Storage

A time-series database (Prometheus, InfluxDB, Cortex, Thanos) stores metrics with timestamps. Retention: 30-90 days at full resolution.

↓

4. Visualization

A dashboard tool (Grafana, Datadog, New Relic) queries the TSDB and renders charts. Engineers build queries in PromQL, InfluxQL, or similar.

// Example: instrumenting a Python service with Prometheus

// Counter: total requests (monotonically increasing)
request_total = Counter('http_requests_total',
'Total HTTP requests', ['method', 'endpoint', 'status'])

// Histogram: request duration (distribution)
request_duration = Histogram('http_request_duration_seconds',
'HTTP request duration',
buckets=[.01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10])

// Gauge: current in-flight requests
in_flight = Gauge('http_requests_in_flight',
'Currently processing requests')

// Three metric types cover almost every use case:
// Counter for "how many?" (requests, errors, bytes)
// Histogram for "how long?" (latency, sizes)
// Gauge for "how much right now?" (queue depth, connections)

Chapter 9: Connections

Observability is the final piece of the distributed systems reliability stack. You now have the complete picture: what fails, how to isolate it, how to survive it at runtime, how to ship safely, and how to see what is happening in production.

The Complete Stack

Layer	Topic	Question it answers
Architecture	Failure Modes & Isolation	What can fail? How do we contain it?
Runtime	Resiliency Patterns	How do we handle failures gracefully in code?
Delivery	Testing & Deployment	How do we ship changes safely?
Visibility	Observability & Operations (this lesson)	How do we see what's happening and respond?

Key Takeaways

1. Measure the four golden signals. Latency (percentiles!), traffic, errors, saturation. If you can only have four metrics, make it these four.

2. RED for services, USE for resources. Rate/Errors/Duration for request-driven services. Utilization/Saturation/Errors for hardware resources.

3. SLOs are budgets, not aspirations. Set them based on user needs. Never set 100%. Use error budgets to make deployment risk decisions.

4. Alert on symptoms, not causes. "Error rate > 1%" pages a human. "CPU > 90%" creates a ticket. Every page must be urgent, actionable, and real.

5. Structure your logs. JSON with trace_id lets you query efficiently and correlate across services. Never log secrets.

6. Trace the critical path. Distributed tracing shows where time is spent. Sample at 1% in production, 100% for errors and slow requests.

7. Dashboards should answer questions in seconds. Top banner for status, golden signals for detail, drill-down for investigation.

"Observability is not about collecting data. It is about being able to ask arbitrary questions of your system and get answers without deploying new code." — Charity Majors, CTO of Honeycomb

Final quiz: You are building a new microservices platform. What is the minimum observability setup you need before launching to production?

Dashboards showing CPU and memory per service Log aggregation so you can grep for errors All three pillars: metrics (four golden signals per service with percentile latencies, SLO tracking, alerting on symptoms), structured logs (JSON with trace_id, queryable), and distributed tracing (W3C trace context propagation, 1% sampling). Plus: a dashboard showing golden signals, error budgets, and recent deployments. Without all three, you can detect problems (metrics) but not diagnose them (traces) or understand the narrative (logs).