Testing & Deployment — From Absolute Zero to Mastery

Chapter 0: Why Testing Fails

You have 100% code coverage. Every unit test passes. Integration tests are green. The staging environment looks perfect. You deploy to production with confidence. Within 30 minutes, your error rate triples.

What happened? The staging environment had 2 servers. Production has 200. The race condition that occurs when 200 servers simultaneously connect to the database on startup never happened in staging. The test suite tested logic perfectly but never tested behavior under real conditions.

This is the fundamental gap in distributed systems testing: the interesting failures are not logic bugs. They are emergent behaviors that only appear when real traffic hits real infrastructure at real scale with real network conditions.

The core challenge. You cannot test a distributed system by testing its parts in isolation. A distributed system is not just its code — it is its code running on specific hardware, with specific network conditions, handling specific traffic patterns, interacting with specific dependencies. Testing the code alone tests maybe 30% of the failure surface. The rest requires testing the system.

The simulation below shows a system that passes all unit tests and integration tests but fails in production due to emergent behavior under load.

The Testing Gap

Left: test results (all green). Right: production behavior under load. The tests miss what matters.

Run the test suite, then deploy to see the gap between test results and reality.

What Tests Miss

Failure Type	Can unit tests catch it?	Can integration tests?	Requires
Race condition under load	No	Rarely	Load testing, chaos engineering
Cascading failure	No	No	Chaos engineering, game days
Configuration drift	No	No	Canary deploys, staged rollout
Dependency version mismatch	Sometimes	Yes, if environments match	Environment parity
Memory leak over 48 hours	No	No	Soak testing, production monitoring
Performance regression	No	Rarely	Benchmark tests, canary metrics

The structure of this lesson. We will cover three layers of defense: testing (what to test and how), deployment (how to ship safely), and rollback (how to recover when things go wrong). The final chapter is a canary deployment simulator where you deploy a new version, monitor its health, and decide whether to promote or rollback.

Quiz: Your test suite has 100% line coverage. Does this guarantee your code is bug-free in production?

Yes — 100% coverage means every line has been tested No — but it means logic bugs are impossible No. Line coverage only measures which code paths were executed during tests, not whether they were executed under realistic conditions. Race conditions, configuration errors, cascading failures, resource leaks, and performance regressions are all invisible to line coverage. Coverage tells you what was tested; it says nothing about what was tested well.

Chapter 1: The Test Pyramid

Not all tests are equal. A unit test that checks whether a function adds two numbers correctly runs in 1 millisecond and tests exactly one thing. An end-to-end test that spins up the entire application, opens a browser, fills in a form, and verifies the result in the database takes 30 seconds and tests everything at once.

The test pyramid is a model for how to allocate your testing effort. The base is wide (many cheap, fast tests), and the top is narrow (few expensive, slow tests).

End-to-End Tests (few)

Test the entire system from user input to database. Slow (minutes), brittle, expensive. Catch integration issues between services. 5-10% of tests.

↑

Integration Tests (moderate)

Test interactions between components: API calls, database queries, message queue consumers. Medium speed (seconds). 20-30% of tests.

↑

Unit Tests (many)

Test individual functions and classes in isolation. Fast (milliseconds), stable, cheap. Catch logic bugs. 60-70% of tests.

Why the Pyramid Shape?

Property	Unit	Integration	E2E
Speed	1-10ms	100ms-5s	10s-5min
Reliability	Very stable	Occasionally flaky	Often flaky
Failure specificity	Pinpoints exact function	Narrows to component pair	"Something is broken"
Cost to write	Low	Medium	High
Cost to maintain	Low	Medium	Very high
What it catches	Logic bugs	Interface mismatches	User-visible regressions

The anti-pattern is the ice cream cone: lots of E2E tests, few unit tests. This results in a slow, flaky test suite that everyone ignores. Teams start shipping without waiting for tests because the tests take 45 minutes and fail randomly due to browser timing issues.

The simulation below shows two test suites: a pyramid (many unit, few E2E) and an ice cream cone (many E2E, few unit). Watch how they differ in speed and reliability.

Test Pyramid vs. Ice Cream Cone

Two test suites with the same total coverage. Compare execution time and flakiness.

Run each test suite to compare speed and reliability.

Quiz: Your team's CI pipeline takes 45 minutes because 80% of the tests are end-to-end browser tests. Developers frequently skip waiting for CI. What is the most effective fix?

Add more CI servers to run tests in parallel Make E2E tests faster by using headless browsers Rebalance toward the pyramid: replace most E2E tests with unit and integration tests that cover the same scenarios. The few remaining E2E tests run as a post-merge gate, not a pre-merge blocker. This cuts CI from 45 minutes to under 5 minutes while maintaining coverage.

Chapter 2: Test Sizes

The test pyramid categorizes tests by what they test (unit vs. integration vs. E2E). Test sizes categorize tests by how many resources they consume. This is a complementary classification used by large engineering organizations to enforce test quality and CI speed.

The Three Sizes

Size	Time Limit	Resources	Examples
Small	< 1 minute	Single process. No I/O, no network, no disk, no database. Everything mocked or in-memory.	Pure function tests, data structure tests, parser tests
Medium	< 5 minutes	Single machine. Can use localhost network, local database, local file system.	API tests with in-memory DB, gRPC tests, message serialization
Large	< 15 minutes	Multiple machines or external services. Can use real network, real databases, real dependencies.	E2E tests, performance benchmarks, chaos experiments

Why size matters. A "unit test" that calls a real database is misclassified. It has the speed and flakiness profile of an integration test. By classifying tests by size (resource consumption) rather than scope, you get accurate predictions of how long your test suite will take and how likely it is to flake.

Why Size Matters More Than You Think

Consider a test labeled "unit test" that creates a temporary database connection. On the developer's laptop with a fast SSD and no network contention, it runs in 50ms. In CI, running alongside 500 other tests on a shared machine, it takes 3 seconds because of disk I/O contention. This test is nondeterministic — its behavior depends on the environment. It will fail intermittently (a "flaky" test), eroding confidence in the test suite.

A true small test uses no I/O, no network, no disk. It runs in 5ms on any machine, every time, deterministically. Size classification prevents environment-dependent failures from polluting your test suite.

The Real Cost of Flaky Tests

// Flaky test cost calculation:

Test suite: 5,000 tests
Flake rate: 2% (100 tests are occasionally flaky)
Each flake has 5% chance of failing on any run

P(at least one flake per run) = 1 - (0.95)¹⁰⁰ = 99.4%

// Nearly EVERY CI run has at least one flaky failure.
// Developers learn to re-run failed tests without investigating.
// When a REAL failure occurs, it gets re-run too and goes unnoticed.
// This is "the boy who cried wolf" applied to CI.

// Developer time wasted per day (10 developers, 5 CI runs each):
50 runs × 99.4% flake × 3 min to re-run = 149 min/day wasted
// That's 2.5 hours of engineering time per day, every day.

Size Budgets

Some organizations enforce size budgets: at least 80% of tests must be small, at most 15% medium, and at most 5% large. This ensures the test suite stays fast and reliable as the codebase grows.

// Test size budget enforcement:
Total tests: 10,000
Small (< 1 min, no I/O): 8,500 (85%) → total time: ~2 min parallel
Medium (< 5 min, local): 1,200 (12%) → total time: ~8 min parallel
Large (< 15 min, external): 300 (3%) → total time: ~15 min parallel

// Compare without budget (ice cream cone):
Small: 1,000 (10%) → ~20 sec
Medium: 3,000 (30%) → ~15 min
Large: 6,000 (60%) → ~90 min, many flakes

The simulation below shows how test size distribution affects total CI time. Adjust the slider to change the ratio of small to large tests.

Test Size Distribution & CI Time

10,000 tests. Adjust the size distribution and see how total CI time and flake rate change.

% Small tests 80%

Drag the slider to see how test size distribution affects CI performance.

Quiz: A test labeled "unit test" creates a temporary SQLite file, writes data, reads it back, and deletes the file. What size is this test?

Small — it's testing a single function Medium. It performs disk I/O (creating and deleting a file), which means it requires a real file system. By the size classification, any test that touches the file system, network, or external process is at least Medium. The label "unit test" is misleading — what matters is the resources consumed, not the name. Large — it uses a database

Chapter 3: Chaos Engineering

You have tests for the happy path. You have tests for known error conditions. But what about the errors you have not imagined? What about the interaction between a network partition and a GC pause and a traffic spike — all happening simultaneously?

Chaos engineering is the practice of deliberately injecting failures into a running system to discover weaknesses before they cause real outages. Instead of waiting for failures to find you, you find them first.

The Scientific Method of Chaos

1. Form a Hypothesis

"If we kill one database replica, the system will failover to the standby within 5 seconds and users will see no errors."

↓

2. Design the Experiment

Kill a specific database replica during normal traffic. Measure error rate, latency, and failover time.

↓

3. Limit Blast Radius

Run the experiment during low traffic. Have a kill switch to stop the experiment instantly. Alert the on-call team.

↓

4. Run and Observe

Execute the experiment. Monitor all metrics. Record what happens.

↓

5. Analyze

Did the hypothesis hold? If not, what broke? What was the actual blast radius? File a ticket for every surprise.

Common Chaos Experiments

Experiment	What it tests	Common surprises
Kill a server	Failover, load redistribution	Failover takes 10x longer than expected
Network latency injection	Timeout handling, circuit breakers	No timeouts configured (infinite wait)
DNS failure	DNS caching, fallback resolution	DNS cache expires during outage, total failure
Disk full	Disk space monitoring, graceful degradation	Log files fill disk, process crashes on write
Clock skew	Time-dependent logic, certificates	TLS certificates rejected, lease expirations
CPU exhaustion	Graceful degradation under load	Health checks fail, node removed from cluster

Start small, in staging. Do not inject chaos into production on day one. Start with staging, with a single experiment, with a known blast radius. Build confidence. Graduate to production only when you have kill switches, monitoring, and team buy-in. The goal is learning, not heroics.

Game Days

A game day is a scheduled chaos engineering exercise where the entire team participates. Unlike automated chaos experiments that run in the background, game days are planned events where engineers deliberately break the system, observe the response, and practice incident response procedures.

Game days serve a dual purpose: they test the system's resilience and the team's incident response capability. A system might automatically failover perfectly, but if the on-call engineer doesn't know how to read the dashboard or escalate correctly, the human response is the bottleneck.

Game Day Element	Purpose
Pre-announced schedule	Team knows it's coming, can prepare
Documented hypothesis	"We expect failover in <30s with <1% error spike"
Kill switch	One command to stop the experiment instantly
Observers	Teammates watch dashboards, note surprises
Post-mortem	Document findings, file tickets for gaps

Formal Verification: TLA+

For the most critical distributed algorithms (consensus protocols, database replication), chaos engineering is not enough. You cannot test every possible interleaving of events. Formal verification tools like TLA+ (created by Leslie Lamport) let you mathematically prove that your algorithm handles all possible states.

Amazon uses TLA+ extensively. Their engineers have found subtle bugs in DynamoDB, S3, and other systems — bugs that would be nearly impossible to find through testing alone because they require specific, rare combinations of concurrent events.

// Why formal verification matters:
// A system with 5 nodes, 3 message types, and 4 states per node has:
Possible states = 4⁵ × 3¹⁰ = 60,466,176 state combinations

// No test suite will cover all of them.
// TLA+ exhaustively checks every reachable state.
// It found a bug in the Raft consensus protocol
// that occurs only when 3 specific events happen in a specific order.

The simulation below lets you run chaos experiments on a simple distributed system. Form a hypothesis, inject a fault, and see if the system behaves as expected.

Chaos Experiment Runner

A 3-node cluster with a load balancer. Inject faults and observe the system's response.

Inject a fault and observe the cluster's response.

Quiz: You want to test whether your system handles a database failover correctly. What is the chaos engineering approach?

Wait for the database to fail naturally and observe Mock the database failure in unit tests Form a hypothesis ("failover completes in <5s with <1% error spike"), deliberately kill the primary database in a controlled setting, measure actual failover time and error rate, compare to hypothesis, and fix any gaps. This tests the REAL system, not a mock.

Chapter 4: CI/CD Pipeline

A CI/CD pipeline (Continuous Integration / Continuous Delivery) automates the path from code commit to production deployment. The goal is to make deployments frequent, small, and safe — rather than rare, large, and terrifying.

Pipeline Stages

1. Commit

Developer pushes code. Triggers the pipeline automatically.

↓

2. Build

Compile code, build container images, generate artifacts. ~1-5 minutes.

↓

3. Unit Tests

Run all small tests. Fast feedback on logic correctness. ~1-3 minutes.

↓

4. Integration Tests

Run medium tests against real (or containerized) dependencies. ~3-10 minutes.

↓

5. Security & Lint

Static analysis, dependency vulnerability scanning, code style checks. ~1-3 minutes.

↓

6. Deploy to Staging

Automatically deploy to a staging environment. Run smoke tests. ~5-10 minutes.

↓

7. Deploy to Production

Canary or rolling deployment to production. Monitor metrics. ~10-30 minutes.

The key metric: lead time. Lead time is the time from code commit to running in production. Elite teams achieve lead time under 1 hour. The pipeline above totals ~30-60 minutes. Any stage that takes longer than 15 minutes is a bottleneck that needs optimization.

The Four Key Metrics (DORA)

The DevOps Research and Assessment (DORA) program identified four metrics that distinguish elite engineering teams from the rest:

Metric	Elite	High	Medium	Low
Deployment frequency	Multiple/day	Weekly-monthly	Monthly-6 months	6+ months
Lead time for changes	< 1 hour	1 day - 1 week	1 week - 1 month	1-6 months
Time to restore	< 1 hour	< 1 day	1 day - 1 week	1 week - 1 month
Change failure rate	0-15%	16-30%	16-30%	46-60%

Speed and stability are NOT trade-offs. The DORA data shows that elite teams deploy more frequently AND have lower failure rates AND recover faster. Fast feedback loops (frequent deploys, fast CI, canary metrics) catch problems earlier, when they are cheaper to fix. Slow, batched releases accumulate risk.

Pipeline Optimization Techniques

// Optimization 1: Parallelize independent stages
Sequential: Build(3m) → Unit(5m) → Lint(2m) → Security(3m) = 13m
Parallel: Build(3m) → [Unit | Lint | Security] (5m) = 8m (39% faster)

// Optimization 2: Incremental testing
Changed files: src/payment/charge.py
Full suite: 10,000 tests (15 min)
Affected tests only: 120 tests (20 sec)
// Run affected tests pre-merge, full suite post-merge.

// Optimization 3: Build caching
Full Docker build: 8 minutes
With layer cache: 45 seconds (only changed layers rebuild)
// Use deterministic build inputs so cache hits are reliable.

Pipeline Anti-Patterns

Anti-Pattern	Symptom	Fix
Gated release trains	Deploy once a week with 50 changes batched	Deploy each change independently, multiple times per day
Manual approval gates	Deploys wait hours for a manager to click "Approve"	Automate approval with metrics-based promotion
Shared staging	Teams block each other waiting for staging slots	Ephemeral environments per pull request
Flaky tests	Tests fail randomly, developers re-run until green	Quarantine flaky tests, fix root causes, enforce size budgets

The simulation shows a CI/CD pipeline processing commits. Watch how different pipeline configurations affect lead time and deployment frequency.

CI/CD Pipeline Simulator

Commits flow through pipeline stages. Green = passing, red = failing. Watch lead time.

Choose a pipeline configuration to simulate.

Quiz: Your team deploys once a week, batching 30-50 changes per release. Last week's release had a bug that took 8 hours to diagnose because you couldn't tell which of the 42 changes caused it. What deployment practice would prevent this?

More thorough code reviews before each release Deploy each change independently, multiple times per day. If a bug appears, the most recent 1-2 changes are the suspects, not 42. Small, frequent deployments make bugs easy to find and easy to rollback. This is the core principle of continuous delivery. Better staging environment testing

Chapter 5: Deployment Strategies

You have a new version of your service ready to deploy. How do you get it into production without causing an outage? There are four major strategies, each with different trade-offs.

Rolling Deployment

Replace instances one at a time. At any moment, some instances run the old version and some run the new. The rollout progresses gradually over minutes or hours.

// Rolling deployment: 10 servers
t=0: [v1 v1 v1 v1 v1 v1 v1 v1 v1 v1]
t=1: [v2 v1 v1 v1 v1 v1 v1 v1 v1 v1]
t=2: [v2 v2 v1 v1 v1 v1 v1 v1 v1 v1]
... ...
t=10: [v2 v2 v2 v2 v2 v2 v2 v2 v2 v2]

Blue-Green Deployment

Run two identical environments: "blue" (current production) and "green" (new version). Deploy the new version entirely to green. When green is healthy, switch the load balancer to route all traffic from blue to green. Instant cutover.

Canary Deployment

Deploy the new version to a tiny fraction of traffic (e.g., 5%) and monitor for errors. If the canary is healthy, gradually increase traffic (10%, 25%, 50%, 100%). If the canary shows problems, rollback only the canary — 95% of users were never affected.

Comparison

Strategy	Rollback Speed	Resource Cost	Version Mixing	Blast Radius
Rolling	Slow (must re-roll)	Low (in-place)	Yes (mixed during rollout)	Gradual increase
Blue-Green	Instant (switch back)	2x (two environments)	No (instant switch)	100% (all-or-nothing)
Canary	Instant (kill canary)	Low (small canary pool)	Yes (canary vs. baseline)	5-10% during testing

Canary is the gold standard. It combines the best properties: low blast radius (only canary traffic is at risk), instant rollback (kill the canary), and real production validation (canary handles real traffic). Most large-scale systems use canary deployments.

Canary in Detail: The Progression

A typical canary deployment follows a multi-stage progression with automated health checks at each gate:

Stage 1: Deploy to 1%

Deploy to a single instance or 1% of traffic. Wait 5 minutes. Check error rate, latency, and resource metrics against baseline.

↓ all metrics within 2x baseline

Stage 2: Promote to 5%

Increase to 5% of traffic. Wait 10 minutes. Same health checks. This catches issues that only appear at slightly higher load.

↓ all metrics within 1.5x baseline

Stage 3: Promote to 25%

Quarter of traffic. Wait 15 minutes. At this scale, statistical significance improves — rare edge cases start surfacing.

↓ all metrics within 1.2x baseline

Stage 4: Promote to 50%

Half of traffic. Wait 15 minutes. Any version-interaction issues (old version talking to new version) will appear here.

↓ pass

Stage 5: Promote to 100%

Complete rollout. Continue monitoring for 30 minutes after full promotion.

The key detail: the wait times increase at each stage because some bugs only manifest after minutes of operation (e.g., slow memory leaks, connection pool exhaustion, cache warming issues).

Automated Canary Analysis

Manual canary evaluation does not scale. Automated canary analysis (ACA) compares the canary's metrics to the baseline using statistical tests:

// Automated canary analysis:

// 1. Collect metrics from canary and baseline for N minutes
canary_errors = [0.1%, 0.15%, 0.12%, 0.2%, 0.11%, ...]
baseline_errors = [0.09%, 0.11%, 0.1%, 0.13%, 0.09%, ...]

// 2. Run statistical comparison (Mann-Whitney U test)
// H0: canary and baseline have the same distribution
// If p < 0.05 AND canary is worse: FAIL the canary

// 3. Score: pass/marginal/fail for each metric
// Overall: pass if all metrics pass, fail if any metric fails
// Netflix's Kayenta and Google's Canary Analysis Service automate this.

The simulation below compares the three deployment strategies visually. Watch how traffic shifts between versions.

Deployment Strategy Comparison

Watch how each strategy transitions from v1 (blue) to v2 (green).

Choose a deployment strategy to visualize the rollout.

Quiz: You are deploying a database schema migration that is not backward-compatible. Which deployment strategy is MOST dangerous?

Rolling deployment. During the rollout, some instances run v1 (expecting the old schema) and some run v2 (expecting the new schema). If the migration has already run, v1 instances will break. If it hasn't, v2 instances will break. Rolling deployments require backward-compatible changes. For breaking schema changes, use blue-green with a migration step, or split the migration into backward-compatible steps. Blue-green deployment Canary deployment

Chapter 6: Feature Flags

Deployment and release are often conflated, but they are different things. Deployment means putting new code on servers. Release means enabling that code for users. A feature flag separates these: you deploy code that is disabled by default, then enable it later — for specific users, a percentage of traffic, or all at once.

Why Decouple Deploy from Release?

// Without feature flags:
Deploy = Release = Risk
// If the new feature has a bug, rollback the entire deployment.

// With feature flags:
Deploy (code on servers, flag OFF) = zero risk
Release (flag ON for 5%) = small risk, instant disable
// If the new feature has a bug, flip the flag. No deployment needed.

Types of Feature Flags

Type	Lifetime	Purpose	Example
Release flag	Days to weeks	Progressive rollout of new feature	Enable new checkout flow for 10% of users
Experiment flag	Weeks to months	A/B testing	Show variant A to 50%, variant B to 50%
Ops flag	Permanent	Kill switch for risky features	Disable recommendation engine if it's slow
Permission flag	Permanent	User-specific access control	Enable beta features for premium users

Feature flags are a double-edged sword. They give you incredible control over releases. But every flag is a code branch that must be maintained and eventually cleaned up. A codebase with 500 stale feature flags is a maintenance nightmare. Best practice: set an expiration date for every flag. After the feature is fully released, delete the flag and the old code path.

Feature Flag Implementation

python
# Simple feature flag with gradual rollout
import hashlib

class FeatureFlags:
    def __init__(self, config):
        self.config = config  # {"new_checkout": {"pct": 10}}

    def is_enabled(self, flag_name, user_id):
        flag = self.config.get(flag_name)
        if not flag:
            return False

        # Deterministic: same user always gets same result
        hash_val = hashlib.md5(
            f"{flag_name}:{user_id}".encode()
        ).hexdigest()
        bucket = int(hash_val[:8], 16) % 100

        return bucket < flag["pct"]

# Usage:
flags = FeatureFlags({"new_checkout": {"pct": 10}})

if flags.is_enabled("new_checkout", user.id):
    return new_checkout_flow(request)
else:
    return old_checkout_flow(request)

The hash-based approach is critical: the same user always sees the same version. Without deterministic assignment, a user might see the new checkout on page load but the old checkout on form submission — a broken experience.

Feature Flag Lifecycle

1. Create Flag (off)

↓

2. Enable for Team (1%)

Internal dogfooding. Verify basic functionality.

↓

3. Canary (5-10%)

Monitor metrics vs. control group. A/B test if applicable.

↓

4. Ramp (25% → 50% → 100%)

Gradual increase with monitoring at each stage.

↓

5. Clean Up

Remove flag checks from code. Delete old code path. Remove flag from config. THIS STEP IS MANDATORY.

Flag debt is real. Every feature flag is an if-else branch. With 200 flags, your code has 2²⁰⁰ possible execution paths (though most are impossible). Stale flags make code harder to read, harder to test, and harder to reason about. Set a policy: flags must be cleaned up within 30 days of reaching 100% rollout.

The simulation shows how feature flags enable progressive rollout and instant rollback without deployment changes.

Feature Flag Rollout

Deploy code with flag OFF, then gradually enable. Detect a problem? Disable instantly.

Deploy code, then gradually enable the feature flag.

Quiz: You deployed a new feature behind a flag. The flag is enabled for 10% of users. Those users report a bug. What do you do?

Flip the flag to OFF. This instantly disables the feature for all users with no deployment needed. The code is still on the servers but the code path is unreachable. You can then fix the bug at your own pace and re-enable the flag after the fix is deployed. Zero rollback risk. Roll back the entire deployment to the previous version Deploy a hotfix immediately

Chapter 7: Rollbacks

Things will go wrong. The question is not whether you will need to rollback, but how fast you can do it and how much damage occurs between deploying the bad version and completing the rollback.

Rollback Strategies

Strategy	Speed	Requirements	Pitfalls
Re-deploy previous version	5-30 minutes	Previous artifacts must be available	Full pipeline must run again
Instant traffic switch	Seconds	Previous version still running (blue-green)	Requires 2x infrastructure
Feature flag disable	Seconds	Feature behind a flag	Only works if the issue is in the flagged code
Database rollback	Minutes to hours	Backward-compatible schema	Data loss if writes happened during bad period

What Makes Rollback Hard

Code rollback is easy. Data rollback is hard. If the bad deployment wrote data in a new format, changed a schema, or made irreversible state changes (sent emails, charged cards), rolling back the code does not roll back the data.

The "expand-contract" pattern for safe schema changes. Step 1 (expand): deploy code that writes to BOTH old and new schema columns. Step 2 (migrate): backfill existing data. Step 3 (contract): deploy code that only reads/writes the new column. Drop the old column. Each step is independently deployable and rollback-safe because both formats are always available.

What Makes Rollback Safe: Backward Compatibility

A deployment is rollback-safe if the previous version of the code can continue operating correctly after the new version has run. This is a stronger requirement than forward compatibility.

Change Type	Rollback-Safe?	Why
Add optional field	Yes	Old code ignores unknown fields
Remove field	No	Old code may depend on the removed field
Rename field	No	Old code looks for old name, new code writes new name
Change type	No	Old code expects int, new code writes string
Add new table	Yes	Old code doesn't query new table
Drop table	No	Old code queries the dropped table

The pattern: adding is safe, removing or changing is not. Every breaking change must be done in multiple steps, with each step being independently rollback-safe.

Rollback Automation

The best rollback is one that happens automatically. If canary metrics degrade beyond a threshold, the deployment system should roll back without human intervention.

// Auto-rollback triggers:

// 1. Error rate > 2x baseline for 3 minutes
// 2. p99 latency > 3x baseline for 5 minutes
// 3. CPU > 90% for 5 minutes (compared to baseline at ~40%)
// 4. Memory growth > 50MB/minute (memory leak indicator)
// 5. Any crash loop (>3 restarts in 5 minutes)

// The key: auto-rollback must happen FAST.
// Detect: 3 minutes. Rollback: 2 minutes. Total: 5 minutes.
// Without automation: detect (5 min) + page on-call (5 min) +
// diagnose (15 min) + decide (5 min) + rollback (5 min) = 35 min.
// Auto-rollback cuts incident duration by 85%.

// Expand-contract migration example:

// Step 1: EXPAND (deploy writes to both)
ALTER TABLE users ADD COLUMN email_v2 VARCHAR(255);
// App writes to both 'email' and 'email_v2'
// Rollback: safe — old column still populated

// Step 2: MIGRATE (backfill)
UPDATE users SET email_v2 = email WHERE email_v2 IS NULL;
// Rollback: safe — old column untouched

// Step 3: CONTRACT (switch reads, drop old)
// App reads from 'email_v2' only
ALTER TABLE users DROP COLUMN email;
// Rollback: NOT safe after this point — old column gone

The simulation below shows a bad deployment and different rollback strategies. Watch how quickly each strategy recovers and how much data impact occurs.

Rollback Speed Comparison

A bad deploy goes out. Compare rollback strategies by recovery time and blast radius.

Choose a rollback strategy to compare recovery time.

Quiz: Your new deployment added a database column and the application now writes data in a new format. You discover a bug and roll back the code to the previous version. What problem do you face?

The old code does not know about the new column and may fail to read rows that were written in the new format. Data written during the bad deployment period may be incompatible with the old code. This is why schema changes must be backward-compatible: the old code should gracefully ignore new columns, and the new code should handle the absence of new columns. No problem — the database automatically handles version differences The rollback will also undo the database changes

Chapter 8: Canary Deployment Simulator

This is the showcase chapter. You are deploying a new version of your service to production using a canary strategy. Start at 5% traffic, monitor error rate and latency, and decide: promote to 100% or rollback?

How to use the simulator. Click "Deploy Canary (5%)" to start. The canary version may be good or bad — you don't know yet. Watch the error rate and latency metrics. If the canary looks healthy after observation, click "Promote" to increase traffic. If metrics degrade, click "Rollback" to kill the canary. The automatic mode will auto-promote if healthy and auto-rollback if unhealthy.

Canary Deployment Simulator

Deploy a canary, monitor metrics in real time, decide to promote or rollback.

Deploy a canary to begin. Watch metrics carefully before promoting.

What to Monitor During Canary

Metric	Baseline	Alert Threshold	Why
Error rate	0.1%	> 1%	New code may crash on edge cases
p50 latency	50ms	> 2x baseline	New code may have performance regressions
p99 latency	200ms	> 3x baseline	Tail latency often reveals resource contention
CPU usage	40%	> 80%	Inefficient code or infinite loops
Memory usage	2GB	> 4GB	Memory leaks in new code

Chapter 9: Connections

Testing and deployment are how you ship changes safely. But once the code is in production, you need to observe it — to know whether it is healthy, to detect problems before users do, and to diagnose root causes when things go wrong.

This Lesson vs. Related Topics

Topic	Focus	Relationship
Failure Modes & Isolation	What fails and how to contain it	Testing validates your isolation strategies actually work
Resiliency Patterns	Runtime patterns for handling failures	Chaos engineering verifies resiliency patterns function correctly
Testing & Deployment (this lesson)	How to ship safely and recover fast	The "when" and "how" of getting code into production
Observability	Metrics, logs, traces, alerting	Canary deployments rely on observability to decide promote vs. rollback

Key Takeaways

1. The test pyramid is an investment strategy. Many cheap, fast tests (unit). Fewer expensive, slow tests (E2E). Classify by size, not just type.

2. Chaos engineering tests the real system. Inject real failures into real infrastructure. Form hypotheses. Measure outcomes. File tickets for surprises.

3. Small, frequent deploys are safer than large, rare ones. One change is easy to diagnose. Fifty changes are a mystery.

4. Canary deployments limit blast radius. 5% traffic to canary, 95% on stable. Promote when healthy, rollback when not.

5. Feature flags decouple deploy from release. Ship code anytime. Enable for users when ready. Kill switch instantly.

6. Rollback readiness is a design requirement. Schema changes must be backward-compatible. Data changes must be reversible. Plan for rollback before you deploy.

"If you're afraid to change something, it's a sign you don't have enough automated safety nets." — Martin Fowler

Final quiz: Your team is about to ship a major rewrite of the payment processing service. You want maximum safety. What deployment strategy would you use?

Blue-green — instant cutover and instant rollback Rolling deployment — gradual transition Feature flag + canary. Deploy the rewrite behind a feature flag (zero risk deployment). Enable the flag for 1% of traffic (canary). Monitor error rate, latency, and payment success rate for 24 hours. If healthy, gradually increase to 5%, 25%, 50%, 100%. At any point, flip the flag to OFF for instant rollback. This combines canary's gradual validation with feature flag's instant kill switch.