How to ship code without breaking production — from test pyramids to canary deploys to chaos engineering.
You have 100% code coverage. Every unit test passes. Integration tests are green. The staging environment looks perfect. You deploy to production with confidence. Within 30 minutes, your error rate triples.
What happened? The staging environment had 2 servers. Production has 200. The race condition that occurs when 200 servers simultaneously connect to the database on startup never happened in staging. The test suite tested logic perfectly but never tested behavior under real conditions.
This is the fundamental gap in distributed systems testing: the interesting failures are not logic bugs. They are emergent behaviors that only appear when real traffic hits real infrastructure at real scale with real network conditions.
The simulation below shows a system that passes all unit tests and integration tests but fails in production due to emergent behavior under load.
Left: test results (all green). Right: production behavior under load. The tests miss what matters.
| Failure Type | Can unit tests catch it? | Can integration tests? | Requires |
|---|---|---|---|
| Race condition under load | No | Rarely | Load testing, chaos engineering |
| Cascading failure | No | No | Chaos engineering, game days |
| Configuration drift | No | No | Canary deploys, staged rollout |
| Dependency version mismatch | Sometimes | Yes, if environments match | Environment parity |
| Memory leak over 48 hours | No | No | Soak testing, production monitoring |
| Performance regression | No | Rarely | Benchmark tests, canary metrics |
Not all tests are equal. A unit test that checks whether a function adds two numbers correctly runs in 1 millisecond and tests exactly one thing. An end-to-end test that spins up the entire application, opens a browser, fills in a form, and verifies the result in the database takes 30 seconds and tests everything at once.
The test pyramid is a model for how to allocate your testing effort. The base is wide (many cheap, fast tests), and the top is narrow (few expensive, slow tests).
| Property | Unit | Integration | E2E |
|---|---|---|---|
| Speed | 1-10ms | 100ms-5s | 10s-5min |
| Reliability | Very stable | Occasionally flaky | Often flaky |
| Failure specificity | Pinpoints exact function | Narrows to component pair | "Something is broken" |
| Cost to write | Low | Medium | High |
| Cost to maintain | Low | Medium | Very high |
| What it catches | Logic bugs | Interface mismatches | User-visible regressions |
The anti-pattern is the ice cream cone: lots of E2E tests, few unit tests. This results in a slow, flaky test suite that everyone ignores. Teams start shipping without waiting for tests because the tests take 45 minutes and fail randomly due to browser timing issues.
The simulation below shows two test suites: a pyramid (many unit, few E2E) and an ice cream cone (many E2E, few unit). Watch how they differ in speed and reliability.
Two test suites with the same total coverage. Compare execution time and flakiness.
The test pyramid categorizes tests by what they test (unit vs. integration vs. E2E). Test sizes categorize tests by how many resources they consume. This is a complementary classification used by large engineering organizations to enforce test quality and CI speed.
| Size | Time Limit | Resources | Examples |
|---|---|---|---|
| Small | < 1 minute | Single process. No I/O, no network, no disk, no database. Everything mocked or in-memory. | Pure function tests, data structure tests, parser tests |
| Medium | < 5 minutes | Single machine. Can use localhost network, local database, local file system. | API tests with in-memory DB, gRPC tests, message serialization |
| Large | < 15 minutes | Multiple machines or external services. Can use real network, real databases, real dependencies. | E2E tests, performance benchmarks, chaos experiments |
Consider a test labeled "unit test" that creates a temporary database connection. On the developer's laptop with a fast SSD and no network contention, it runs in 50ms. In CI, running alongside 500 other tests on a shared machine, it takes 3 seconds because of disk I/O contention. This test is nondeterministic — its behavior depends on the environment. It will fail intermittently (a "flaky" test), eroding confidence in the test suite.
A true small test uses no I/O, no network, no disk. It runs in 5ms on any machine, every time, deterministically. Size classification prevents environment-dependent failures from polluting your test suite.
Some organizations enforce size budgets: at least 80% of tests must be small, at most 15% medium, and at most 5% large. This ensures the test suite stays fast and reliable as the codebase grows.
The simulation below shows how test size distribution affects total CI time. Adjust the slider to change the ratio of small to large tests.
10,000 tests. Adjust the size distribution and see how total CI time and flake rate change.
You have tests for the happy path. You have tests for known error conditions. But what about the errors you have not imagined? What about the interaction between a network partition and a GC pause and a traffic spike — all happening simultaneously?
Chaos engineering is the practice of deliberately injecting failures into a running system to discover weaknesses before they cause real outages. Instead of waiting for failures to find you, you find them first.
| Experiment | What it tests | Common surprises |
|---|---|---|
| Kill a server | Failover, load redistribution | Failover takes 10x longer than expected |
| Network latency injection | Timeout handling, circuit breakers | No timeouts configured (infinite wait) |
| DNS failure | DNS caching, fallback resolution | DNS cache expires during outage, total failure |
| Disk full | Disk space monitoring, graceful degradation | Log files fill disk, process crashes on write |
| Clock skew | Time-dependent logic, certificates | TLS certificates rejected, lease expirations |
| CPU exhaustion | Graceful degradation under load | Health checks fail, node removed from cluster |
A game day is a scheduled chaos engineering exercise where the entire team participates. Unlike automated chaos experiments that run in the background, game days are planned events where engineers deliberately break the system, observe the response, and practice incident response procedures.
Game days serve a dual purpose: they test the system's resilience and the team's incident response capability. A system might automatically failover perfectly, but if the on-call engineer doesn't know how to read the dashboard or escalate correctly, the human response is the bottleneck.
| Game Day Element | Purpose |
|---|---|
| Pre-announced schedule | Team knows it's coming, can prepare |
| Documented hypothesis | "We expect failover in <30s with <1% error spike" |
| Kill switch | One command to stop the experiment instantly |
| Observers | Teammates watch dashboards, note surprises |
| Post-mortem | Document findings, file tickets for gaps |
For the most critical distributed algorithms (consensus protocols, database replication), chaos engineering is not enough. You cannot test every possible interleaving of events. Formal verification tools like TLA+ (created by Leslie Lamport) let you mathematically prove that your algorithm handles all possible states.
Amazon uses TLA+ extensively. Their engineers have found subtle bugs in DynamoDB, S3, and other systems — bugs that would be nearly impossible to find through testing alone because they require specific, rare combinations of concurrent events.
The simulation below lets you run chaos experiments on a simple distributed system. Form a hypothesis, inject a fault, and see if the system behaves as expected.
A 3-node cluster with a load balancer. Inject faults and observe the system's response.
A CI/CD pipeline (Continuous Integration / Continuous Delivery) automates the path from code commit to production deployment. The goal is to make deployments frequent, small, and safe — rather than rare, large, and terrifying.
The DevOps Research and Assessment (DORA) program identified four metrics that distinguish elite engineering teams from the rest:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment frequency | Multiple/day | Weekly-monthly | Monthly-6 months | 6+ months |
| Lead time for changes | < 1 hour | 1 day - 1 week | 1 week - 1 month | 1-6 months |
| Time to restore | < 1 hour | < 1 day | 1 day - 1 week | 1 week - 1 month |
| Change failure rate | 0-15% | 16-30% | 16-30% | 46-60% |
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Gated release trains | Deploy once a week with 50 changes batched | Deploy each change independently, multiple times per day |
| Manual approval gates | Deploys wait hours for a manager to click "Approve" | Automate approval with metrics-based promotion |
| Shared staging | Teams block each other waiting for staging slots | Ephemeral environments per pull request |
| Flaky tests | Tests fail randomly, developers re-run until green | Quarantine flaky tests, fix root causes, enforce size budgets |
The simulation shows a CI/CD pipeline processing commits. Watch how different pipeline configurations affect lead time and deployment frequency.
Commits flow through pipeline stages. Green = passing, red = failing. Watch lead time.
You have a new version of your service ready to deploy. How do you get it into production without causing an outage? There are four major strategies, each with different trade-offs.
Replace instances one at a time. At any moment, some instances run the old version and some run the new. The rollout progresses gradually over minutes or hours.
Run two identical environments: "blue" (current production) and "green" (new version). Deploy the new version entirely to green. When green is healthy, switch the load balancer to route all traffic from blue to green. Instant cutover.
Deploy the new version to a tiny fraction of traffic (e.g., 5%) and monitor for errors. If the canary is healthy, gradually increase traffic (10%, 25%, 50%, 100%). If the canary shows problems, rollback only the canary — 95% of users were never affected.
| Strategy | Rollback Speed | Resource Cost | Version Mixing | Blast Radius |
|---|---|---|---|---|
| Rolling | Slow (must re-roll) | Low (in-place) | Yes (mixed during rollout) | Gradual increase |
| Blue-Green | Instant (switch back) | 2x (two environments) | No (instant switch) | 100% (all-or-nothing) |
| Canary | Instant (kill canary) | Low (small canary pool) | Yes (canary vs. baseline) | 5-10% during testing |
A typical canary deployment follows a multi-stage progression with automated health checks at each gate:
The key detail: the wait times increase at each stage because some bugs only manifest after minutes of operation (e.g., slow memory leaks, connection pool exhaustion, cache warming issues).
Manual canary evaluation does not scale. Automated canary analysis (ACA) compares the canary's metrics to the baseline using statistical tests:
The simulation below compares the three deployment strategies visually. Watch how traffic shifts between versions.
Watch how each strategy transitions from v1 (blue) to v2 (green).
Deployment and release are often conflated, but they are different things. Deployment means putting new code on servers. Release means enabling that code for users. A feature flag separates these: you deploy code that is disabled by default, then enable it later — for specific users, a percentage of traffic, or all at once.
| Type | Lifetime | Purpose | Example |
|---|---|---|---|
| Release flag | Days to weeks | Progressive rollout of new feature | Enable new checkout flow for 10% of users |
| Experiment flag | Weeks to months | A/B testing | Show variant A to 50%, variant B to 50% |
| Ops flag | Permanent | Kill switch for risky features | Disable recommendation engine if it's slow |
| Permission flag | Permanent | User-specific access control | Enable beta features for premium users |
python # Simple feature flag with gradual rollout import hashlib class FeatureFlags: def __init__(self, config): self.config = config # {"new_checkout": {"pct": 10}} def is_enabled(self, flag_name, user_id): flag = self.config.get(flag_name) if not flag: return False # Deterministic: same user always gets same result hash_val = hashlib.md5( f"{flag_name}:{user_id}".encode() ).hexdigest() bucket = int(hash_val[:8], 16) % 100 return bucket < flag["pct"] # Usage: flags = FeatureFlags({"new_checkout": {"pct": 10}}) if flags.is_enabled("new_checkout", user.id): return new_checkout_flow(request) else: return old_checkout_flow(request)
The hash-based approach is critical: the same user always sees the same version. Without deterministic assignment, a user might see the new checkout on page load but the old checkout on form submission — a broken experience.
The simulation shows how feature flags enable progressive rollout and instant rollback without deployment changes.
Deploy code with flag OFF, then gradually enable. Detect a problem? Disable instantly.
Things will go wrong. The question is not whether you will need to rollback, but how fast you can do it and how much damage occurs between deploying the bad version and completing the rollback.
| Strategy | Speed | Requirements | Pitfalls |
|---|---|---|---|
| Re-deploy previous version | 5-30 minutes | Previous artifacts must be available | Full pipeline must run again |
| Instant traffic switch | Seconds | Previous version still running (blue-green) | Requires 2x infrastructure |
| Feature flag disable | Seconds | Feature behind a flag | Only works if the issue is in the flagged code |
| Database rollback | Minutes to hours | Backward-compatible schema | Data loss if writes happened during bad period |
Code rollback is easy. Data rollback is hard. If the bad deployment wrote data in a new format, changed a schema, or made irreversible state changes (sent emails, charged cards), rolling back the code does not roll back the data.
A deployment is rollback-safe if the previous version of the code can continue operating correctly after the new version has run. This is a stronger requirement than forward compatibility.
| Change Type | Rollback-Safe? | Why |
|---|---|---|
| Add optional field | Yes | Old code ignores unknown fields |
| Remove field | No | Old code may depend on the removed field |
| Rename field | No | Old code looks for old name, new code writes new name |
| Change type | No | Old code expects int, new code writes string |
| Add new table | Yes | Old code doesn't query new table |
| Drop table | No | Old code queries the dropped table |
The pattern: adding is safe, removing or changing is not. Every breaking change must be done in multiple steps, with each step being independently rollback-safe.
The best rollback is one that happens automatically. If canary metrics degrade beyond a threshold, the deployment system should roll back without human intervention.
The simulation below shows a bad deployment and different rollback strategies. Watch how quickly each strategy recovers and how much data impact occurs.
A bad deploy goes out. Compare rollback strategies by recovery time and blast radius.
This is the showcase chapter. You are deploying a new version of your service to production using a canary strategy. Start at 5% traffic, monitor error rate and latency, and decide: promote to 100% or rollback?
Deploy a canary, monitor metrics in real time, decide to promote or rollback.
| Metric | Baseline | Alert Threshold | Why |
|---|---|---|---|
| Error rate | 0.1% | > 1% | New code may crash on edge cases |
| p50 latency | 50ms | > 2x baseline | New code may have performance regressions |
| p99 latency | 200ms | > 3x baseline | Tail latency often reveals resource contention |
| CPU usage | 40% | > 80% | Inefficient code or infinite loops |
| Memory usage | 2GB | > 4GB | Memory leaks in new code |
Testing and deployment are how you ship changes safely. But once the code is in production, you need to observe it — to know whether it is healthy, to detect problems before users do, and to diagnose root causes when things go wrong.
| Topic | Focus | Relationship |
|---|---|---|
| Failure Modes & Isolation | What fails and how to contain it | Testing validates your isolation strategies actually work |
| Resiliency Patterns | Runtime patterns for handling failures | Chaos engineering verifies resiliency patterns function correctly |
| Testing & Deployment (this lesson) | How to ship safely and recover fast | The "when" and "how" of getting code into production |
| Observability | Metrics, logs, traces, alerting | Canary deployments rely on observability to decide promote vs. rollback |
1. The test pyramid is an investment strategy. Many cheap, fast tests (unit). Fewer expensive, slow tests (E2E). Classify by size, not just type.
2. Chaos engineering tests the real system. Inject real failures into real infrastructure. Form hypotheses. Measure outcomes. File tickets for surprises.
3. Small, frequent deploys are safer than large, rare ones. One change is easy to diagnose. Fifty changes are a mystery.
4. Canary deployments limit blast radius. 5% traffic to canary, 95% on stable. Promote when healthy, rollback when not.
5. Feature flags decouple deploy from release. Ship code anytime. Enable for users when ready. Kill switch instantly.
6. Rollback readiness is a design requirement. Schema changes must be backward-compatible. Data changes must be reversible. Plan for rollback before you deploy.