Distributed Systems

Testing & Deployment

How to ship code without breaking production — from test pyramids to canary deploys to chaos engineering.

Prerequisites: Basic CI/CD concepts + Client-server model. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Testing Fails

You have 100% code coverage. Every unit test passes. Integration tests are green. The staging environment looks perfect. You deploy to production with confidence. Within 30 minutes, your error rate triples.

What happened? The staging environment had 2 servers. Production has 200. The race condition that occurs when 200 servers simultaneously connect to the database on startup never happened in staging. The test suite tested logic perfectly but never tested behavior under real conditions.

This is the fundamental gap in distributed systems testing: the interesting failures are not logic bugs. They are emergent behaviors that only appear when real traffic hits real infrastructure at real scale with real network conditions.

The core challenge. You cannot test a distributed system by testing its parts in isolation. A distributed system is not just its code — it is its code running on specific hardware, with specific network conditions, handling specific traffic patterns, interacting with specific dependencies. Testing the code alone tests maybe 30% of the failure surface. The rest requires testing the system.

The simulation below shows a system that passes all unit tests and integration tests but fails in production due to emergent behavior under load.

The Testing Gap

Left: test results (all green). Right: production behavior under load. The tests miss what matters.

Run the test suite, then deploy to see the gap between test results and reality.

What Tests Miss

Failure TypeCan unit tests catch it?Can integration tests?Requires
Race condition under loadNoRarelyLoad testing, chaos engineering
Cascading failureNoNoChaos engineering, game days
Configuration driftNoNoCanary deploys, staged rollout
Dependency version mismatchSometimesYes, if environments matchEnvironment parity
Memory leak over 48 hoursNoNoSoak testing, production monitoring
Performance regressionNoRarelyBenchmark tests, canary metrics
The structure of this lesson. We will cover three layers of defense: testing (what to test and how), deployment (how to ship safely), and rollback (how to recover when things go wrong). The final chapter is a canary deployment simulator where you deploy a new version, monitor its health, and decide whether to promote or rollback.
Quiz: Your test suite has 100% line coverage. Does this guarantee your code is bug-free in production?

Chapter 1: The Test Pyramid

Not all tests are equal. A unit test that checks whether a function adds two numbers correctly runs in 1 millisecond and tests exactly one thing. An end-to-end test that spins up the entire application, opens a browser, fills in a form, and verifies the result in the database takes 30 seconds and tests everything at once.

The test pyramid is a model for how to allocate your testing effort. The base is wide (many cheap, fast tests), and the top is narrow (few expensive, slow tests).

End-to-End Tests (few)
Test the entire system from user input to database. Slow (minutes), brittle, expensive. Catch integration issues between services. 5-10% of tests.
Integration Tests (moderate)
Test interactions between components: API calls, database queries, message queue consumers. Medium speed (seconds). 20-30% of tests.
Unit Tests (many)
Test individual functions and classes in isolation. Fast (milliseconds), stable, cheap. Catch logic bugs. 60-70% of tests.

Why the Pyramid Shape?

PropertyUnitIntegrationE2E
Speed1-10ms100ms-5s10s-5min
ReliabilityVery stableOccasionally flakyOften flaky
Failure specificityPinpoints exact functionNarrows to component pair"Something is broken"
Cost to writeLowMediumHigh
Cost to maintainLowMediumVery high
What it catchesLogic bugsInterface mismatchesUser-visible regressions

The anti-pattern is the ice cream cone: lots of E2E tests, few unit tests. This results in a slow, flaky test suite that everyone ignores. Teams start shipping without waiting for tests because the tests take 45 minutes and fail randomly due to browser timing issues.

The simulation below shows two test suites: a pyramid (many unit, few E2E) and an ice cream cone (many E2E, few unit). Watch how they differ in speed and reliability.

Test Pyramid vs. Ice Cream Cone

Two test suites with the same total coverage. Compare execution time and flakiness.

Run each test suite to compare speed and reliability.
Quiz: Your team's CI pipeline takes 45 minutes because 80% of the tests are end-to-end browser tests. Developers frequently skip waiting for CI. What is the most effective fix?

Chapter 2: Test Sizes

The test pyramid categorizes tests by what they test (unit vs. integration vs. E2E). Test sizes categorize tests by how many resources they consume. This is a complementary classification used by large engineering organizations to enforce test quality and CI speed.

The Three Sizes

SizeTime LimitResourcesExamples
Small< 1 minuteSingle process. No I/O, no network, no disk, no database. Everything mocked or in-memory.Pure function tests, data structure tests, parser tests
Medium< 5 minutesSingle machine. Can use localhost network, local database, local file system.API tests with in-memory DB, gRPC tests, message serialization
Large< 15 minutesMultiple machines or external services. Can use real network, real databases, real dependencies.E2E tests, performance benchmarks, chaos experiments
Why size matters. A "unit test" that calls a real database is misclassified. It has the speed and flakiness profile of an integration test. By classifying tests by size (resource consumption) rather than scope, you get accurate predictions of how long your test suite will take and how likely it is to flake.

Why Size Matters More Than You Think

Consider a test labeled "unit test" that creates a temporary database connection. On the developer's laptop with a fast SSD and no network contention, it runs in 50ms. In CI, running alongside 500 other tests on a shared machine, it takes 3 seconds because of disk I/O contention. This test is nondeterministic — its behavior depends on the environment. It will fail intermittently (a "flaky" test), eroding confidence in the test suite.

A true small test uses no I/O, no network, no disk. It runs in 5ms on any machine, every time, deterministically. Size classification prevents environment-dependent failures from polluting your test suite.

The Real Cost of Flaky Tests

// Flaky test cost calculation:

Test suite: 5,000 tests
Flake rate: 2% (100 tests are occasionally flaky)
Each flake has 5% chance of failing on any run

P(at least one flake per run) = 1 - (0.95)100 = 99.4%

// Nearly EVERY CI run has at least one flaky failure.
// Developers learn to re-run failed tests without investigating.
// When a REAL failure occurs, it gets re-run too and goes unnoticed.
// This is "the boy who cried wolf" applied to CI.

// Developer time wasted per day (10 developers, 5 CI runs each):
50 runs × 99.4% flake × 3 min to re-run = 149 min/day wasted
// That's 2.5 hours of engineering time per day, every day.

Size Budgets

Some organizations enforce size budgets: at least 80% of tests must be small, at most 15% medium, and at most 5% large. This ensures the test suite stays fast and reliable as the codebase grows.

// Test size budget enforcement:
Total tests: 10,000
Small (< 1 min, no I/O): 8,500 (85%) → total time: ~2 min parallel
Medium (< 5 min, local): 1,200 (12%) → total time: ~8 min parallel
Large (< 15 min, external): 300 (3%) → total time: ~15 min parallel

// Compare without budget (ice cream cone):
Small: 1,000 (10%) → ~20 sec
Medium: 3,000 (30%) → ~15 min
Large: 6,000 (60%) → ~90 min, many flakes

The simulation below shows how test size distribution affects total CI time. Adjust the slider to change the ratio of small to large tests.

Test Size Distribution & CI Time

10,000 tests. Adjust the size distribution and see how total CI time and flake rate change.

% Small tests 80%
Drag the slider to see how test size distribution affects CI performance.
Quiz: A test labeled "unit test" creates a temporary SQLite file, writes data, reads it back, and deletes the file. What size is this test?

Chapter 3: Chaos Engineering

You have tests for the happy path. You have tests for known error conditions. But what about the errors you have not imagined? What about the interaction between a network partition and a GC pause and a traffic spike — all happening simultaneously?

Chaos engineering is the practice of deliberately injecting failures into a running system to discover weaknesses before they cause real outages. Instead of waiting for failures to find you, you find them first.

The Scientific Method of Chaos

1. Form a Hypothesis
"If we kill one database replica, the system will failover to the standby within 5 seconds and users will see no errors."
2. Design the Experiment
Kill a specific database replica during normal traffic. Measure error rate, latency, and failover time.
3. Limit Blast Radius
Run the experiment during low traffic. Have a kill switch to stop the experiment instantly. Alert the on-call team.
4. Run and Observe
Execute the experiment. Monitor all metrics. Record what happens.
5. Analyze
Did the hypothesis hold? If not, what broke? What was the actual blast radius? File a ticket for every surprise.

Common Chaos Experiments

ExperimentWhat it testsCommon surprises
Kill a serverFailover, load redistributionFailover takes 10x longer than expected
Network latency injectionTimeout handling, circuit breakersNo timeouts configured (infinite wait)
DNS failureDNS caching, fallback resolutionDNS cache expires during outage, total failure
Disk fullDisk space monitoring, graceful degradationLog files fill disk, process crashes on write
Clock skewTime-dependent logic, certificatesTLS certificates rejected, lease expirations
CPU exhaustionGraceful degradation under loadHealth checks fail, node removed from cluster
Start small, in staging. Do not inject chaos into production on day one. Start with staging, with a single experiment, with a known blast radius. Build confidence. Graduate to production only when you have kill switches, monitoring, and team buy-in. The goal is learning, not heroics.

Game Days

A game day is a scheduled chaos engineering exercise where the entire team participates. Unlike automated chaos experiments that run in the background, game days are planned events where engineers deliberately break the system, observe the response, and practice incident response procedures.

Game days serve a dual purpose: they test the system's resilience and the team's incident response capability. A system might automatically failover perfectly, but if the on-call engineer doesn't know how to read the dashboard or escalate correctly, the human response is the bottleneck.

Game Day ElementPurpose
Pre-announced scheduleTeam knows it's coming, can prepare
Documented hypothesis"We expect failover in <30s with <1% error spike"
Kill switchOne command to stop the experiment instantly
ObserversTeammates watch dashboards, note surprises
Post-mortemDocument findings, file tickets for gaps

Formal Verification: TLA+

For the most critical distributed algorithms (consensus protocols, database replication), chaos engineering is not enough. You cannot test every possible interleaving of events. Formal verification tools like TLA+ (created by Leslie Lamport) let you mathematically prove that your algorithm handles all possible states.

Amazon uses TLA+ extensively. Their engineers have found subtle bugs in DynamoDB, S3, and other systems — bugs that would be nearly impossible to find through testing alone because they require specific, rare combinations of concurrent events.

// Why formal verification matters:
// A system with 5 nodes, 3 message types, and 4 states per node has:
Possible states = 45 × 310 = 60,466,176 state combinations

// No test suite will cover all of them.
// TLA+ exhaustively checks every reachable state.
// It found a bug in the Raft consensus protocol
// that occurs only when 3 specific events happen in a specific order.

The simulation below lets you run chaos experiments on a simple distributed system. Form a hypothesis, inject a fault, and see if the system behaves as expected.

Chaos Experiment Runner

A 3-node cluster with a load balancer. Inject faults and observe the system's response.

Inject a fault and observe the cluster's response.
Quiz: You want to test whether your system handles a database failover correctly. What is the chaos engineering approach?

Chapter 4: CI/CD Pipeline

A CI/CD pipeline (Continuous Integration / Continuous Delivery) automates the path from code commit to production deployment. The goal is to make deployments frequent, small, and safe — rather than rare, large, and terrifying.

Pipeline Stages

1. Commit
Developer pushes code. Triggers the pipeline automatically.
2. Build
Compile code, build container images, generate artifacts. ~1-5 minutes.
3. Unit Tests
Run all small tests. Fast feedback on logic correctness. ~1-3 minutes.
4. Integration Tests
Run medium tests against real (or containerized) dependencies. ~3-10 minutes.
5. Security & Lint
Static analysis, dependency vulnerability scanning, code style checks. ~1-3 minutes.
6. Deploy to Staging
Automatically deploy to a staging environment. Run smoke tests. ~5-10 minutes.
7. Deploy to Production
Canary or rolling deployment to production. Monitor metrics. ~10-30 minutes.
The key metric: lead time. Lead time is the time from code commit to running in production. Elite teams achieve lead time under 1 hour. The pipeline above totals ~30-60 minutes. Any stage that takes longer than 15 minutes is a bottleneck that needs optimization.

The Four Key Metrics (DORA)

The DevOps Research and Assessment (DORA) program identified four metrics that distinguish elite engineering teams from the rest:

MetricEliteHighMediumLow
Deployment frequencyMultiple/dayWeekly-monthlyMonthly-6 months6+ months
Lead time for changes< 1 hour1 day - 1 week1 week - 1 month1-6 months
Time to restore< 1 hour< 1 day1 day - 1 week1 week - 1 month
Change failure rate0-15%16-30%16-30%46-60%
Speed and stability are NOT trade-offs. The DORA data shows that elite teams deploy more frequently AND have lower failure rates AND recover faster. Fast feedback loops (frequent deploys, fast CI, canary metrics) catch problems earlier, when they are cheaper to fix. Slow, batched releases accumulate risk.

Pipeline Optimization Techniques

// Optimization 1: Parallelize independent stages
Sequential: Build(3m) → Unit(5m) → Lint(2m) → Security(3m) = 13m
Parallel: Build(3m) → [Unit | Lint | Security] (5m) = 8m (39% faster)

// Optimization 2: Incremental testing
Changed files: src/payment/charge.py
Full suite: 10,000 tests (15 min)
Affected tests only: 120 tests (20 sec)
// Run affected tests pre-merge, full suite post-merge.

// Optimization 3: Build caching
Full Docker build: 8 minutes
With layer cache: 45 seconds (only changed layers rebuild)
// Use deterministic build inputs so cache hits are reliable.

Pipeline Anti-Patterns

Anti-PatternSymptomFix
Gated release trainsDeploy once a week with 50 changes batchedDeploy each change independently, multiple times per day
Manual approval gatesDeploys wait hours for a manager to click "Approve"Automate approval with metrics-based promotion
Shared stagingTeams block each other waiting for staging slotsEphemeral environments per pull request
Flaky testsTests fail randomly, developers re-run until greenQuarantine flaky tests, fix root causes, enforce size budgets

The simulation shows a CI/CD pipeline processing commits. Watch how different pipeline configurations affect lead time and deployment frequency.

CI/CD Pipeline Simulator

Commits flow through pipeline stages. Green = passing, red = failing. Watch lead time.

Choose a pipeline configuration to simulate.
Quiz: Your team deploys once a week, batching 30-50 changes per release. Last week's release had a bug that took 8 hours to diagnose because you couldn't tell which of the 42 changes caused it. What deployment practice would prevent this?

Chapter 5: Deployment Strategies

You have a new version of your service ready to deploy. How do you get it into production without causing an outage? There are four major strategies, each with different trade-offs.

Rolling Deployment

Replace instances one at a time. At any moment, some instances run the old version and some run the new. The rollout progresses gradually over minutes or hours.

// Rolling deployment: 10 servers
t=0: [v1 v1 v1 v1 v1 v1 v1 v1 v1 v1]
t=1: [v2 v1 v1 v1 v1 v1 v1 v1 v1 v1]
t=2: [v2 v2 v1 v1 v1 v1 v1 v1 v1 v1]
... ...
t=10: [v2 v2 v2 v2 v2 v2 v2 v2 v2 v2]

Blue-Green Deployment

Run two identical environments: "blue" (current production) and "green" (new version). Deploy the new version entirely to green. When green is healthy, switch the load balancer to route all traffic from blue to green. Instant cutover.

Canary Deployment

Deploy the new version to a tiny fraction of traffic (e.g., 5%) and monitor for errors. If the canary is healthy, gradually increase traffic (10%, 25%, 50%, 100%). If the canary shows problems, rollback only the canary — 95% of users were never affected.

Comparison

StrategyRollback SpeedResource CostVersion MixingBlast Radius
RollingSlow (must re-roll)Low (in-place)Yes (mixed during rollout)Gradual increase
Blue-GreenInstant (switch back)2x (two environments)No (instant switch)100% (all-or-nothing)
CanaryInstant (kill canary)Low (small canary pool)Yes (canary vs. baseline)5-10% during testing
Canary is the gold standard. It combines the best properties: low blast radius (only canary traffic is at risk), instant rollback (kill the canary), and real production validation (canary handles real traffic). Most large-scale systems use canary deployments.

Canary in Detail: The Progression

A typical canary deployment follows a multi-stage progression with automated health checks at each gate:

Stage 1: Deploy to 1%
Deploy to a single instance or 1% of traffic. Wait 5 minutes. Check error rate, latency, and resource metrics against baseline.
↓ all metrics within 2x baseline
Stage 2: Promote to 5%
Increase to 5% of traffic. Wait 10 minutes. Same health checks. This catches issues that only appear at slightly higher load.
↓ all metrics within 1.5x baseline
Stage 3: Promote to 25%
Quarter of traffic. Wait 15 minutes. At this scale, statistical significance improves — rare edge cases start surfacing.
↓ all metrics within 1.2x baseline
Stage 4: Promote to 50%
Half of traffic. Wait 15 minutes. Any version-interaction issues (old version talking to new version) will appear here.
↓ pass
Stage 5: Promote to 100%
Complete rollout. Continue monitoring for 30 minutes after full promotion.

The key detail: the wait times increase at each stage because some bugs only manifest after minutes of operation (e.g., slow memory leaks, connection pool exhaustion, cache warming issues).

Automated Canary Analysis

Manual canary evaluation does not scale. Automated canary analysis (ACA) compares the canary's metrics to the baseline using statistical tests:

// Automated canary analysis:

// 1. Collect metrics from canary and baseline for N minutes
canary_errors = [0.1%, 0.15%, 0.12%, 0.2%, 0.11%, ...]
baseline_errors = [0.09%, 0.11%, 0.1%, 0.13%, 0.09%, ...]

// 2. Run statistical comparison (Mann-Whitney U test)
// H0: canary and baseline have the same distribution
// If p < 0.05 AND canary is worse: FAIL the canary

// 3. Score: pass/marginal/fail for each metric
// Overall: pass if all metrics pass, fail if any metric fails
// Netflix's Kayenta and Google's Canary Analysis Service automate this.

The simulation below compares the three deployment strategies visually. Watch how traffic shifts between versions.

Deployment Strategy Comparison

Watch how each strategy transitions from v1 (blue) to v2 (green).

Choose a deployment strategy to visualize the rollout.
Quiz: You are deploying a database schema migration that is not backward-compatible. Which deployment strategy is MOST dangerous?

Chapter 6: Feature Flags

Deployment and release are often conflated, but they are different things. Deployment means putting new code on servers. Release means enabling that code for users. A feature flag separates these: you deploy code that is disabled by default, then enable it later — for specific users, a percentage of traffic, or all at once.

Why Decouple Deploy from Release?

// Without feature flags:
Deploy = Release = Risk
// If the new feature has a bug, rollback the entire deployment.

// With feature flags:
Deploy (code on servers, flag OFF) = zero risk
Release (flag ON for 5%) = small risk, instant disable
// If the new feature has a bug, flip the flag. No deployment needed.

Types of Feature Flags

TypeLifetimePurposeExample
Release flagDays to weeksProgressive rollout of new featureEnable new checkout flow for 10% of users
Experiment flagWeeks to monthsA/B testingShow variant A to 50%, variant B to 50%
Ops flagPermanentKill switch for risky featuresDisable recommendation engine if it's slow
Permission flagPermanentUser-specific access controlEnable beta features for premium users
Feature flags are a double-edged sword. They give you incredible control over releases. But every flag is a code branch that must be maintained and eventually cleaned up. A codebase with 500 stale feature flags is a maintenance nightmare. Best practice: set an expiration date for every flag. After the feature is fully released, delete the flag and the old code path.

Feature Flag Implementation

python
# Simple feature flag with gradual rollout
import hashlib

class FeatureFlags:
    def __init__(self, config):
        self.config = config  # {"new_checkout": {"pct": 10}}

    def is_enabled(self, flag_name, user_id):
        flag = self.config.get(flag_name)
        if not flag:
            return False

        # Deterministic: same user always gets same result
        hash_val = hashlib.md5(
            f"{flag_name}:{user_id}".encode()
        ).hexdigest()
        bucket = int(hash_val[:8], 16) % 100

        return bucket < flag["pct"]

# Usage:
flags = FeatureFlags({"new_checkout": {"pct": 10}})

if flags.is_enabled("new_checkout", user.id):
    return new_checkout_flow(request)
else:
    return old_checkout_flow(request)

The hash-based approach is critical: the same user always sees the same version. Without deterministic assignment, a user might see the new checkout on page load but the old checkout on form submission — a broken experience.

Feature Flag Lifecycle

1. Create Flag (off)
Register flag in config system. Set expiration date. Deploy code with flag checks.
2. Enable for Team (1%)
Internal dogfooding. Verify basic functionality.
3. Canary (5-10%)
Monitor metrics vs. control group. A/B test if applicable.
4. Ramp (25% → 50% → 100%)
Gradual increase with monitoring at each stage.
5. Clean Up
Remove flag checks from code. Delete old code path. Remove flag from config. THIS STEP IS MANDATORY.
Flag debt is real. Every feature flag is an if-else branch. With 200 flags, your code has 2200 possible execution paths (though most are impossible). Stale flags make code harder to read, harder to test, and harder to reason about. Set a policy: flags must be cleaned up within 30 days of reaching 100% rollout.

The simulation shows how feature flags enable progressive rollout and instant rollback without deployment changes.

Feature Flag Rollout

Deploy code with flag OFF, then gradually enable. Detect a problem? Disable instantly.

Deploy code, then gradually enable the feature flag.
Quiz: You deployed a new feature behind a flag. The flag is enabled for 10% of users. Those users report a bug. What do you do?

Chapter 7: Rollbacks

Things will go wrong. The question is not whether you will need to rollback, but how fast you can do it and how much damage occurs between deploying the bad version and completing the rollback.

Rollback Strategies

StrategySpeedRequirementsPitfalls
Re-deploy previous version5-30 minutesPrevious artifacts must be availableFull pipeline must run again
Instant traffic switchSecondsPrevious version still running (blue-green)Requires 2x infrastructure
Feature flag disableSecondsFeature behind a flagOnly works if the issue is in the flagged code
Database rollbackMinutes to hoursBackward-compatible schemaData loss if writes happened during bad period

What Makes Rollback Hard

Code rollback is easy. Data rollback is hard. If the bad deployment wrote data in a new format, changed a schema, or made irreversible state changes (sent emails, charged cards), rolling back the code does not roll back the data.

The "expand-contract" pattern for safe schema changes. Step 1 (expand): deploy code that writes to BOTH old and new schema columns. Step 2 (migrate): backfill existing data. Step 3 (contract): deploy code that only reads/writes the new column. Drop the old column. Each step is independently deployable and rollback-safe because both formats are always available.

What Makes Rollback Safe: Backward Compatibility

A deployment is rollback-safe if the previous version of the code can continue operating correctly after the new version has run. This is a stronger requirement than forward compatibility.

Change TypeRollback-Safe?Why
Add optional fieldYesOld code ignores unknown fields
Remove fieldNoOld code may depend on the removed field
Rename fieldNoOld code looks for old name, new code writes new name
Change typeNoOld code expects int, new code writes string
Add new tableYesOld code doesn't query new table
Drop tableNoOld code queries the dropped table

The pattern: adding is safe, removing or changing is not. Every breaking change must be done in multiple steps, with each step being independently rollback-safe.

Rollback Automation

The best rollback is one that happens automatically. If canary metrics degrade beyond a threshold, the deployment system should roll back without human intervention.

// Auto-rollback triggers:

// 1. Error rate > 2x baseline for 3 minutes
// 2. p99 latency > 3x baseline for 5 minutes
// 3. CPU > 90% for 5 minutes (compared to baseline at ~40%)
// 4. Memory growth > 50MB/minute (memory leak indicator)
// 5. Any crash loop (>3 restarts in 5 minutes)

// The key: auto-rollback must happen FAST.
// Detect: 3 minutes. Rollback: 2 minutes. Total: 5 minutes.
// Without automation: detect (5 min) + page on-call (5 min) +
// diagnose (15 min) + decide (5 min) + rollback (5 min) = 35 min.
// Auto-rollback cuts incident duration by 85%.
// Expand-contract migration example:

// Step 1: EXPAND (deploy writes to both)
ALTER TABLE users ADD COLUMN email_v2 VARCHAR(255);
// App writes to both 'email' and 'email_v2'
// Rollback: safe — old column still populated

// Step 2: MIGRATE (backfill)
UPDATE users SET email_v2 = email WHERE email_v2 IS NULL;
// Rollback: safe — old column untouched

// Step 3: CONTRACT (switch reads, drop old)
// App reads from 'email_v2' only
ALTER TABLE users DROP COLUMN email;
// Rollback: NOT safe after this point — old column gone

The simulation below shows a bad deployment and different rollback strategies. Watch how quickly each strategy recovers and how much data impact occurs.

Rollback Speed Comparison

A bad deploy goes out. Compare rollback strategies by recovery time and blast radius.

Choose a rollback strategy to compare recovery time.
Quiz: Your new deployment added a database column and the application now writes data in a new format. You discover a bug and roll back the code to the previous version. What problem do you face?

Chapter 8: Canary Deployment Simulator

This is the showcase chapter. You are deploying a new version of your service to production using a canary strategy. Start at 5% traffic, monitor error rate and latency, and decide: promote to 100% or rollback?

How to use the simulator. Click "Deploy Canary (5%)" to start. The canary version may be good or bad — you don't know yet. Watch the error rate and latency metrics. If the canary looks healthy after observation, click "Promote" to increase traffic. If metrics degrade, click "Rollback" to kill the canary. The automatic mode will auto-promote if healthy and auto-rollback if unhealthy.
Canary Deployment Simulator

Deploy a canary, monitor metrics in real time, decide to promote or rollback.

Deploy a canary to begin. Watch metrics carefully before promoting.

What to Monitor During Canary

MetricBaselineAlert ThresholdWhy
Error rate0.1%> 1%New code may crash on edge cases
p50 latency50ms> 2x baselineNew code may have performance regressions
p99 latency200ms> 3x baselineTail latency often reveals resource contention
CPU usage40%> 80%Inefficient code or infinite loops
Memory usage2GB> 4GBMemory leaks in new code

Chapter 9: Connections

Testing and deployment are how you ship changes safely. But once the code is in production, you need to observe it — to know whether it is healthy, to detect problems before users do, and to diagnose root causes when things go wrong.

This Lesson vs. Related Topics

TopicFocusRelationship
Failure Modes & IsolationWhat fails and how to contain itTesting validates your isolation strategies actually work
Resiliency PatternsRuntime patterns for handling failuresChaos engineering verifies resiliency patterns function correctly
Testing & Deployment (this lesson)How to ship safely and recover fastThe "when" and "how" of getting code into production
ObservabilityMetrics, logs, traces, alertingCanary deployments rely on observability to decide promote vs. rollback

Key Takeaways

1. The test pyramid is an investment strategy. Many cheap, fast tests (unit). Fewer expensive, slow tests (E2E). Classify by size, not just type.

2. Chaos engineering tests the real system. Inject real failures into real infrastructure. Form hypotheses. Measure outcomes. File tickets for surprises.

3. Small, frequent deploys are safer than large, rare ones. One change is easy to diagnose. Fifty changes are a mystery.

4. Canary deployments limit blast radius. 5% traffic to canary, 95% on stable. Promote when healthy, rollback when not.

5. Feature flags decouple deploy from release. Ship code anytime. Enable for users when ready. Kill switch instantly.

6. Rollback readiness is a design requirement. Schema changes must be backward-compatible. Data changes must be reversible. Plan for rollback before you deploy.

"If you're afraid to change something, it's a sign you don't have enough automated safety nets." — Martin Fowler
Final quiz: Your team is about to ship a major rewrite of the payment processing service. You want maximum safety. What deployment strategy would you use?