Metrics, SLOs, alerting, traces, dashboards — seeing inside your system before users tell you it's broken.
It is 2 AM. Your phone buzzes. "CRITICAL: Payment service error rate above 5%." You open your laptop. The payment service is returning 500 errors on 8% of requests. Why?
Is it the payment service itself? The database it calls? The network between them? A bad deploy that went out 30 minutes ago? A traffic spike from a marketing campaign? A dependency that rate-limited you? Without observability, you are flying blind — guessing, grepping through log files, restarting services and hoping.
With observability, you open a dashboard. You see the error rate spiked at 1:47 AM. You click on a failing request. A trace shows the request spent 29 seconds waiting for a database query. You look at database metrics: CPU is at 98% because an unindexed query is doing a full table scan. You find the deploy that introduced the query. You rollback. Error rate drops to 0.1% within 2 minutes. Total incident time: 11 minutes.
Observability has three data types, often called the three pillars:
| Pillar | What it is | What it answers | Example |
|---|---|---|---|
| Metrics | Numeric measurements over time | "What is happening?" (rates, gauges, distributions) | Request rate: 1200 RPS, p99 latency: 230ms |
| Logs | Timestamped text records of events | "What exactly happened?" (the narrative) | "2024-01-15 01:47:23 ERROR: query timeout after 30s, query=SELECT..." |
| Traces | End-to-end path of a request through services | "Where did the time go?" (the journey) | Frontend → Gateway (2ms) → OrderService (5ms) → DB (29s) !!! |
The simulation below shows the difference between flying blind and having observability. A service starts failing. Without observability, you see "errors." With observability, you see exactly where and why.
An incident occurs. Compare diagnosis time with and without observability.
If you could only measure four things about your service, what would they be? The Site Reliability Engineering discipline converges on the same four:
How long does it take to serve a request? Specifically, the distribution of latencies, not just the average. A service with 50ms average latency might have a p99 of 2 seconds — meaning 1% of users wait 40x longer than average.
This is why p99 is the most important latency metric. It represents the experience of your worst-affected users, who are often your highest-value users (they use the product frequently, so they hit the tail more often).
There are two ways to store latency data in a metrics system:
| Type | How it works | Pros | Cons |
|---|---|---|---|
| Histogram | Counts requests in predefined buckets: 0-10ms, 10-50ms, 50-100ms, 100-500ms, 500ms+ | Aggregatable across servers. Cheap to store. | Bucket boundaries must be chosen in advance. Precision limited by bucket size. |
| Summary | Calculates exact percentiles on the fly using reservoir sampling | Precise percentiles without predefined buckets. | Cannot be aggregated across servers (p99 of p99s is not p99 of all). Memory-intensive. |
How much demand is being placed on the system? For a web service: HTTP requests per second. For a database: queries per second. For a message queue: messages per second. Traffic tells you the current load level and helps you predict when you will need more capacity.
What fraction of requests fail? This includes explicit errors (HTTP 500s), implicit errors (HTTP 200 with wrong content), and policy violations (responses slower than a threshold). The error rate (errors per second or errors as a percentage of total requests) is more useful than the error count.
How "full" is the system? Saturation measures how close you are to capacity: CPU utilization, memory usage, disk I/O, queue depth, connection pool usage. When saturation approaches 100%, performance degrades nonlinearly — a system at 90% CPU does not perform 10% worse than one at 0%, it performs dramatically worse due to queueing effects.
The simulation below shows all four golden signals for a live service. Traffic arrives at a steady rate, then spikes. Watch how the signals interact.
Four real-time charts. Inject a traffic spike or slow dependency to see all signals react.
The four golden signals tell you what to measure for any service. But different types of components have different natural metrics. The RED method and USE method give you targeted frameworks.
For any request-driven service (API, web server, microservice), measure three things:
| R | Rate | Requests per second. How busy is the service? |
|---|---|---|
| E | Errors | Errors per second (or error rate %). Is the service broken? |
| D | Duration | Latency distribution (p50, p99). Is the service slow? |
RED is essentially the four golden signals minus saturation, focused on the user-facing experience. If you can only instrument three things per service, instrument RED.
For any resource (CPU, memory, disk, network, connection pool), measure three things:
| U | Utilization | Percentage of time the resource is busy. CPU at 85%, disk at 60%. |
|---|---|---|
| S | Saturation | Amount of work queued that cannot be served. Queue depth, wait time. |
| E | Errors | Count of error events. ECC memory corrections, network packet drops. |
The debugging workflow typically starts with RED (user-facing symptoms) and drills into USE (resource causes):
RED tells you THAT something is wrong. USE tells you WHERE the bottleneck is. Logs tell you WHY.
The simulation shows RED and USE metrics side by side. Inject load and watch how service-level metrics (RED) correspond to resource-level metrics (USE).
Left: service metrics (RED). Right: resource metrics (USE). Inject load to see correlation.
Your service is "reliable." But how reliable? 99%? 99.9%? 99.99%? And reliable in what dimension — availability? Latency? Correctness? These terms are often used loosely. The SLI/SLO/SLA framework gives them precise meaning.
An SLI is a carefully defined quantitative measure of some aspect of the level of service that is provided. It is a metric with a specific definition.
An SLO is a target value for an SLI. It is an internal goal that your team sets.
An SLA is a contract with your customer that includes consequences for failing to meet it (refunds, credits, contract termination). SLAs are typically looser than SLOs because you want to breach your SLO (internal alert) before you breach your SLA (legal/financial consequences).
| Concept | Definition | Audience | Consequence of breach |
|---|---|---|---|
| SLI | The metric itself | Engineers | None (it's a measurement) |
| SLO | Target for the SLI | Engineering team | Error budget consumed, on-call pages |
| SLA | Contract with customer | Customers, legal | Refunds, credits, legal action |
| Availability | Downtime/month | Downtime/year | Example |
|---|---|---|---|
| 99% (two 9s) | 7.3 hours | 3.65 days | Internal tools |
| 99.9% (three 9s) | 43.2 minutes | 8.76 hours | Most SaaS products |
| 99.95% | 21.6 minutes | 4.38 hours | Cloud provider services |
| 99.99% (four 9s) | 4.32 minutes | 52.6 minutes | Payment systems, infrastructure |
| 99.999% (five 9s) | 26 seconds | 5.26 minutes | 911 systems, pacemakers |
How do you decide between 99.9% and 99.99%? It depends on your users and your dependencies:
Choosing the right SLI is critical. A bad SLI creates a false sense of security. Here are the most common SLIs by service type:
| Service Type | Recommended SLI | Why not something else? |
|---|---|---|
| Web API | % of requests returning 2xx within 300ms | Combines availability and latency in one metric. A slow success is a failure for the user. |
| Data pipeline | % of records processed within freshness target | Throughput alone is misleading — data can be processed but 3 hours late. |
| Storage system | % of reads returning correct data within 100ms | Durability (no data loss) + availability + latency. |
| Batch job | % of runs completing successfully within time budget | Start time is irrelevant; completion time matters. |
The simulation shows how SLO targets map to allowed downtime. Adjust the SLO slider and see how much failure budget you have.
Set your SLO and see how much downtime you can afford per month and year.
The SLO says 99.9% availability. That means 0.1% unavailability is allowed. This 0.1% is your error budget — the amount of unreliability you can tolerate before breaching your SLO.
Error budgets transform reliability from a vague aspiration into a concrete, spendable resource. You can spend your error budget on things that matter: faster feature releases, risky experiments, infrastructure migrations. As long as you stay within budget, you are meeting your reliability commitment.
| Budget Level | Action |
|---|---|
| > 50% remaining | Full speed: ship features, run experiments, do migrations |
| 25-50% remaining | Caution: all deploys require canary, no risky experiments |
| < 25% remaining | Freeze: no deployments except bug fixes. Focus on reliability improvements. |
| 0% (budget exhausted) | Hard freeze: only reliability work until budget replenishes next cycle |
Error budgets change how organizations make engineering decisions. Here are concrete examples:
Not all budget consumption is equal. Some is expected (planned maintenance), some is unplanned but tolerable (transient errors), and some requires investigation.
| Source | Typical Cost | Action |
|---|---|---|
| Planned maintenance | 5-15 min | Pre-approved, scheduled during low traffic |
| Deploy rollout | 0-2 min | Normal if within canary detection window |
| Transient network blip | 0.5-2 min | Expected noise, no action needed |
| Bad deploy | 5-30 min | Post-mortem, improve canary/rollback speed |
| Dependency outage | Variable | Evaluate circuit breakers and fallbacks |
| Infrastructure failure | Variable | Evaluate fault domain spread, redundancy |
Traditional alerting fires when a metric crosses a threshold. Burn rate alerting fires when you are consuming error budget too fast — even if the absolute error rate is below your threshold.
The simulation below tracks an error budget over a 30-day period. Deployments and incidents consume budget. Watch the budget drain and see when policies activate.
30-day error budget at 99.9% SLO (43.2 min). Deployments and incidents consume budget.
An alert wakes a human up at 3 AM. That alert better be important. If it is a false alarm — or a real alarm that does not require immediate human action — you have just wasted someone's sleep, eroded trust in your alerting system, and contributed to alert fatigue.
The most important principle in alerting: alert on symptoms, not causes.
| Type | Example | Problem |
|---|---|---|
| Cause-based (bad) | "CPU > 90%" | CPU at 92% might be fine if latency is normal. You page someone for a non-problem. |
| Symptom-based (good) | "Error rate > 1% for 5 minutes" | This means users are affected RIGHT NOW. Always actionable. |
| Level | Response Time | Channel | Example |
|---|---|---|---|
| Critical (P1) | Immediate (minutes) | Page on-call, phone call | Service completely down, data loss |
| High (P2) | Hours | Slack notification, ticket | Error rate elevated but below SLO |
| Medium (P3) | Days | Ticket | Disk usage growing, needs cleanup |
| Low (P4) | Weeks | Dashboard annotation | Certificate expires in 60 days |
The most sophisticated alerting approach ties directly to error budgets. Instead of fixed thresholds, alert when you are burning error budget too fast:
An alert pages a human. That human is the on-call engineer — someone who has agreed to be reachable 24/7 for a rotation period (usually one week). On-call is one of the most important and demanding roles in engineering. Bad on-call practices burn out engineers and degrade incident response quality.
| Practice | Bad | Good |
|---|---|---|
| Rotation length | 1 month (burnout) | 1 week, with swap ability |
| Alert volume | >2 pages per shift (fatigue) | ≤2 pages per week |
| Runbooks | None ("figure it out") | Every alert links to a runbook with step-by-step remediation |
| Post-incident | Blame the engineer | Blameless post-mortem, focus on systemic fixes |
| Compensation | None | Extra pay or comp time for on-call shifts |
When a page fires, the on-call engineer should follow a structured response:
The simulation shows different alert configurations and their impact on on-call burden. Compare symptom-based and cause-based alerting.
Simulate 1 week of alerts. Compare cause-based (noisy) vs. symptom-based (accurate) alerting.
A traditional log line looks like this: 2024-01-15 01:47:23 ERROR: Payment failed for user 12345, amount=$50.00, reason=timeout. It is human-readable. It is also impossible to query efficiently. Want to find all payment failures over $100 in the last hour? You are parsing strings with regex.
Structured logging emits log events as key-value pairs (usually JSON), making them queryable by any field:
| Always Log | Never Log |
|---|---|
| Timestamp (ISO 8601 UTC) | Passwords, tokens, API keys |
| Log level (ERROR, WARN, INFO, DEBUG) | Full credit card numbers |
| Service name and version | Personal health information |
| Trace ID and span ID | Social security numbers |
| Request ID | Full request/response bodies (too large) |
| User ID (if applicable) | Raw database queries with parameters |
| Duration of the operation | Anything that violates GDPR/CCPA/HIPAA |
| Error type and message |
python import json, time, uuid class StructuredLogger: def __init__(self, service_name, version): self.service = service_name self.version = version def log(self, level, event, **kwargs): entry = { "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ"), "level": level, "service": self.service, "version": self.version, "event": event, **kwargs # trace_id, user_id, duration_ms, etc. } print(json.dumps(entry)) # stdout, captured by log agent # Usage: logger = StructuredLogger("payment-service", "v2.3.1") logger.log("ERROR", "payment_failed", trace_id=request.trace_id, user_id="12345", amount=50.00, reason="timeout", duration_ms=30042 ) # Output: {"timestamp":"2024-01-15T01:47:23Z","level":"ERROR", # "service":"payment-service","version":"v2.3.1", # "event":"payment_failed","trace_id":"abc-123", # "user_id":"12345","amount":50.0,"reason":"timeout", # "duration_ms":30042}
| Level | When | Example | Volume |
|---|---|---|---|
| ERROR | Something failed that affects user experience | Payment declined, query timeout | Low (<1% of requests) |
| WARN | Something unexpected but handled | Retry succeeded, cache miss, slow query | Low-medium |
| INFO | Significant business events | Order placed, user signed up, deploy started | Medium |
| DEBUG | Detailed execution flow (disabled in production) | Query params, function entry/exit | Very high (off in prod) |
The simulation shows how structured vs. unstructured logging affects incident diagnosis time.
Search for "all payment failures over $100 in the last hour." Compare query times.
A user's request enters your system at the API gateway. It is routed to the order service, which calls the inventory service, which calls the database, which calls the cache. The entire round trip takes 2.3 seconds. Where did the time go?
Logs tell you what happened at each service. Metrics tell you aggregate behavior. But neither tells you the journey of a single request through the entire system. That is what distributed tracing does.
A trace represents the entire lifecycle of a request. It is composed of spans, where each span represents one operation (one service call, one database query, one cache lookup).
The trace makes it immediately obvious: the Payment Service took 2195ms, and within that, the DB query took 2140ms. That is the bottleneck. Without tracing, you would know "the request took 2.3 seconds" but not where in the chain the time was spent.
For tracing to work, every service must propagate the trace context (trace_id, span_id, parent_span_id) to the next service. The standard is the W3C Trace Context specification, which uses HTTP headers:
| Strategy | How it works | Pro | Con |
|---|---|---|---|
| Head-based (random) | Decide at the start: trace this request with 1% probability | Simple, low overhead | May miss interesting traces. Cannot decide based on outcome. |
| Tail-based | Buffer all spans. After request completes, decide to keep based on duration, error status, or other criteria | Keeps all interesting traces | Requires buffering all spans temporarily. Higher memory cost. |
| Priority-based | 100% for errors, 100% for slow, 10% for specific endpoints, 1% for everything else | Balances coverage and cost | Complex configuration |
The real power of observability comes from connecting the three pillars. A metric tells you something is wrong. A trace shows you where. A log tells you exactly what happened.
This workflow — metric alert → trace → log → root cause — takes 5-10 minutes with good observability. Without it, the same investigation takes 30-60 minutes of guessing and grepping.
The simulation below shows a distributed trace. Click on spans to see details. Inject a slow dependency to see where the bottleneck appears.
A request flows through 4 services. Inject a slow dependency to see the bottleneck in the trace.
This is the showcase chapter. A live dashboard showing all four golden signals, with alert indicators, trace links, and the ability to inject incidents. This is what the on-call engineer sees at 3 AM.
Four golden signals with real-time data. Inject incidents. Watch alerts. Trace requests.
| Section | Contents | Why |
|---|---|---|
| Top banner | Overall status (green/yellow/red), active alerts count | Instant health assessment in 1 second |
| Golden signals | Latency (p50/p99), traffic (RPS), errors (%), saturation (%) | The 4 things that matter most |
| SLO burn rate | Error budget consumption, burn rate chart | Are we on track for the month? |
| Top errors | Most common error types by count | What is breaking most? |
| Recent deployments | Timeline of deploys with status | Correlate incidents with changes |
A bad dashboard is worse than no dashboard, because it gives a false sense of visibility. Here are the most common mistakes:
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Too many charts | 30 charts = cognitive overload. Engineer can't find the signal. | 5-7 charts max per dashboard. One dashboard per concern. |
| No context | Line goes up. Is that good or bad? | Show thresholds, baselines, and SLO targets on every chart. |
| Average-only | "Average latency is 50ms" hides p99 of 3 seconds. | Show p50, p95, p99 as separate lines or a heatmap. |
| Stale data | Dashboard shows data from 5 minutes ago. | Refresh every 10-15 seconds. Show "last updated" timestamp. |
| No drill-down | Error rate is high. Now what? | Click on a data point to jump to related traces and logs. |
How do metrics get from your application to a dashboard? The pipeline has four stages:
Observability is the final piece of the distributed systems reliability stack. You now have the complete picture: what fails, how to isolate it, how to survive it at runtime, how to ship safely, and how to see what is happening in production.
| Layer | Topic | Question it answers |
|---|---|---|
| Architecture | Failure Modes & Isolation | What can fail? How do we contain it? |
| Runtime | Resiliency Patterns | How do we handle failures gracefully in code? |
| Delivery | Testing & Deployment | How do we ship changes safely? |
| Visibility | Observability & Operations (this lesson) | How do we see what's happening and respond? |
1. Measure the four golden signals. Latency (percentiles!), traffic, errors, saturation. If you can only have four metrics, make it these four.
2. RED for services, USE for resources. Rate/Errors/Duration for request-driven services. Utilization/Saturation/Errors for hardware resources.
3. SLOs are budgets, not aspirations. Set them based on user needs. Never set 100%. Use error budgets to make deployment risk decisions.
4. Alert on symptoms, not causes. "Error rate > 1%" pages a human. "CPU > 90%" creates a ticket. Every page must be urgent, actionable, and real.
5. Structure your logs. JSON with trace_id lets you query efficiently and correlate across services. Never log secrets.
6. Trace the critical path. Distributed tracing shows where time is spent. Sample at 1% in production, 100% for errors and slow requests.
7. Dashboards should answer questions in seconds. Top banner for status, golden signals for detail, drill-down for investigation.