Kochenderfer et al., Chapter 1

Intro to Validation

How do you prove an autonomous system won't kill anyone? You validate it.

Prerequisites: None. This is where it all begins.
10
Chapters
3+
Simulations
10
Quizzes

Chapter 0: Why Validate?

You are an engineer at a company that builds self-driving cars. Your team has spent three years training a neural network to drive. It handles highways, rain, night-time, construction zones. It passes every test you throw at it. Time to ship?

Not yet. You have shown the system works in the situations you thought of. But what about the situations you did not think of? A child chasing a ball into traffic at dusk. A mattress falling off a truck on a curve. A sensor failure during a snowstorm. The gap between "works in our tests" and "works in the real world" is where people die.

Validation is the discipline of systematically closing that gap. It asks: does this system actually satisfy its requirements in the conditions it will face? Not "does it work on our benchmark" — does it really work?

The Challenger analogy: The space shuttle Challenger did not need a better O-ring specification. NASA had the specification. The O-ring was rated for a minimum temperature. What they needed was someone to check whether the O-ring would work at 36°F on a January morning in Florida. The spec existed; the validation did not. Seven people died because no one systematically verified the operating conditions against the requirements. That is a validation failure.

This is not just an engineering nicety. Autonomous systems are being deployed in aviation, healthcare, finance, and defense. Each domain has catastrophic failure modes. The question is not whether to validate, but how — and whether the methods are strong enough for the stakes.

This chapter introduces the validation framework. By the end, you will understand why validation is hard, what tools exist, and why no single tool is sufficient — which is why we need a whole book of algorithms.

Design
Build a system that you believe meets requirements
Validate
Systematically check whether it actually meets them
Deploy
Only once you have sufficient evidence of safety
Verification vs. validation: The classic distinction — verification asks "did we build the system right?" (does the code match the spec?). Validation asks "did we build the right system?" (does the spec actually guarantee safety in the real world?). This book focuses on validation, but many of the algorithms apply to both.
Check: What is the core difference between testing and validation?

Chapter 1: The V-Model

Building a safety-critical system is not "write code, then test." There is a structured development lifecycle, and it has a name: the V-model. Imagine the letter V. The left side goes down (decomposing requirements into design into code). The right side goes up (testing each level against its corresponding requirement).

At the top-left, you start with stakeholder needs: "the car must not hit pedestrians." This decomposes into system requirements ("detect pedestrians within 50m"), then architecture ("use LiDAR + camera fusion"), then detailed design ("run YOLO at 30fps"), then implementation (actual code). That is the left side — the descent from abstract to concrete.

The right side mirrors it. Unit tests check the code. Integration tests check the architecture. System tests check the full requirements. And acceptance tests check the original stakeholder needs. Each level on the right validates the corresponding level on the left.

The V-Model: Development & Validation

Hover over each phase to see its validation counterpart. The cost of fixing a bug multiplies ~10x at each level.

The 10x rule of bug cost: A requirements error caught during design costs $1 to fix. Caught during coding: $10. During integration testing: $100. During system testing: $1,000. After deployment: $10,000 or more — or, in safety-critical systems, it costs lives. The whole point of the V-model is to catch errors at the level where they were introduced, before they metastasize.

Traditional V-model validation uses testing: run the system, see what happens. But for autonomous systems, testing alone is not enough. A self-driving car would need to drive billions of miles to statistically prove it is safer than humans. We cannot wait that long. This book's algorithms — falsification, reachability, importance sampling — are designed to replace brute-force testing with smarter, targeted analysis.

V-Model LevelLeft (Design)Right (Validation)
1 — StakeholderNeeds & constraintsAcceptance testing
2 — SystemSystem requirementsSystem testing
3 — ArchitectureSubsystem designIntegration testing
4 — DetailedModule designUnit testing
5 — CodeImplementation (the bottom of the V)
Check: Why is catching a bug at the requirements level so much cheaper than catching it after deployment?

Chapter 2: When Things Break

Validation is not abstract. Real systems have failed, and real people have been harmed. Understanding these failures is the best way to internalize why every algorithm in this book exists.

Therac-25 (1985-87): A radiation therapy machine delivered lethal doses to at least six patients. The cause? A software race condition that only occurred when the operator typed commands very quickly. Testing had never triggered this sequence because testers typed at normal speed. The machine had removed a hardware safety interlock present in earlier models, trusting the software alone. Lesson: removing redundant safety layers based on software confidence is a validation failure.
Boeing 737 MAX (2018-19): The MCAS system, designed to prevent stalls, relied on a single angle-of-attack sensor. When that sensor failed, MCAS repeatedly pushed the nose down. Pilots were not trained on MCAS and could not override it in time. 346 people died across two crashes. Lesson: validating subsystem behavior under sensor failure is as important as validating normal operation.
Amazon hiring algorithm (2018): Amazon built an ML system to screen resumes. It penalized resumes containing the word "women's" (as in "women's chess club captain"). The model had learned from historical hiring data, which reflected existing bias. Lesson: a system can satisfy its training objective perfectly and still fail its real-world specification. Specification matters.

Notice the pattern. In every case, the system did what it was designed to do. Therac-25 ran the software correctly (the race condition was a design flaw). MCAS followed its algorithm (the single-sensor architecture was the flaw). Amazon's model optimized its loss function (the training data encoded bias). Validation failures are not about bugs in the usual sense — they are about the gap between what you built and what you should have built.

Failure Pattern Analysis

Each failure maps to a specific validation gap. Click each case to highlight which validation layer was missing.

SystemWhat failedValidation gap
Therac-25Race condition in operator inputNo testing of edge-case timing
737 MAXSingle sensor dependencyNo fault-tree analysis on MCAS inputs
Amazon hiringBiased training dataNo specification of fairness constraints
The lesson: Validation is not just running tests until they pass. It requires: (1) specifying what "correct" means precisely, (2) systematically searching for conditions where correctness fails, and (3) providing evidence that no such conditions exist — or mitigating them if they do. These three steps map to the three pillars of this book: specification, falsification, and formal analysis.
Check: What do the Therac-25, 737 MAX, and Amazon failures have in common?

Chapter 3: System = Agent + World

To validate a system, we first need a formal language for describing it. The book models autonomous systems as an agent interacting with a world. The agent receives observations, takes actions, and the world transitions to new states. This is the POMDP framework — a partially observable Markov decision process.

Think of a self-driving car. The state s is everything about the world: positions and velocities of every vehicle, the road layout, weather, traffic signals. The agent does not see this directly — it receives observations o through noisy sensors (cameras, LiDAR, radar). Based on observations, it takes actions a (steer, accelerate, brake). The world then transitions to a new state s' according to its dynamics.

System: agent + world   →   (S, A, O, T, O, π)

Here, S is the state space (all possible world configurations), A is the action space (everything the agent can do), and O is the observation space (everything the agent can see). The dynamics are governed by two functions:

T(s'|s, a)  —  transition: probability of next state given current state and action
O(o|s)  —  observation: probability of seeing o when the true state is s
Think of it this way: The state is reality. The observation is what the agent perceives of reality — a noisy, incomplete picture. The transition function is the physics of the world. The observation function is the quality of the agent's sensors. Validation must account for uncertainty in both.
Agent-World Interaction Loop

Watch one step of the interaction cycle. Click Step to advance through: state → observe → decide → act → transition.

Ready
SymbolNameExample (self-driving car)
s ∈ SStatePositions & velocities of all vehicles
a ∈ AActionSteering angle, throttle, brake
o ∈ OObservationCamera images, LiDAR point cloud
T(s'|s,a)TransitionPhysics of vehicle motion
O(o|s)Observation modelSensor noise characteristics
Check: Why does the agent not have direct access to the state s?

Chapter 4: Observations & Policies

The agent needs a strategy for choosing actions. This strategy is called a policy, written π. A policy maps from what the agent has seen (observations) to what it does (actions).

Policies can be deterministic — given an observation, the agent always picks the same action: a = π(o). Or they can be stochastic — the agent samples actions from a probability distribution: a ~ π(a|o). A stochastic policy adds randomness, which can help with exploration or robustness.

Deterministic:   a = π(o)       Stochastic:   a ~ π(a|o)

Why does partial observability matter for validation? Because the agent's behavior depends on what it sees, not what is true. If a pedestrian is behind an occlusion, the true state includes the pedestrian, but the observation does not. A policy that is safe given full state information may be dangerous given partial observations.

Key insight: Validation must test the closed loop — the agent interacting with the world through its sensors. Testing the policy in isolation (with perfect state information) does not capture the real failure modes. A perfect driver with a foggy windshield is not a safe system.

In the POMDP framework, the agent often maintains a belief state b — a probability distribution over possible states given all past observations. The policy then maps beliefs to actions: π(b) or π(a|b). This is how the agent deals with uncertainty: it tracks what it thinks the world looks like, and acts accordingly.

Fully observable (MDP):

Agent sees the true state s directly.
Policy: π(s) or π(a|s).
Simpler to analyze.

Partially observable (POMDP):

Agent sees noisy observations o.
Policy: π(b) or π(a|o1:t).
Much harder to validate.

Self-driving example: An MDP policy could be: "if the pedestrian is at position (x,y), brake." A POMDP policy must instead be: "if my belief about the pedestrian's position, given camera data, gives P(pedestrian in crosswalk) > 0.8, brake." The second is realistic but far harder to validate — you must reason about the sensor, the belief update, and the policy together.
Check: Why must validation test the closed loop (agent + sensors + world) rather than the policy alone?

Chapter 5: Trajectories

When we run a system, we get a sequence of states, observations, and actions unfolding over time. This sequence is called a trajectory, denoted τ:

τ = (s1, o1, a1, s2, o2, a2, …, sT)

A trajectory is one complete "story" of the system — from initial conditions to termination. If the system is stochastic (random transitions, noisy sensors, stochastic policy), then each run produces a different trajectory. The trajectory distribution p(τ) captures the probability of every possible story.

Let's derive the trajectory density. Each trajectory is a chain of transitions, observations, and actions. By the chain rule of probability:

p(τ) = p(s1) · ∏t=1T-1 O(ot|st) · π(at|o1:t) · T(st+1|st, at)

Reading left to right: we start in state s1 (with probability p(s1)), observe o1 (with probability O(o1|s1)), take action a1 (with probability π(a1|o1)), transition to s2 (with probability T(s2|s1, a1)), and so on.

Worked example: A robot in a 3-state world, T=2 steps. States: {safe, risky, crashed}. Observation: {clear, unclear}. Action: {go, stop}. Suppose p(s1=safe) = 0.8, O(clear|safe) = 0.9, π(go|clear) = 0.7, T(safe|safe, go) = 0.95. Then the trajectory τ = (safe, clear, go, safe) has probability 0.8 × 0.9 × 0.7 × 0.95 = 0.4788. That single trajectory is nearly 50% likely. Most of the mass is on "boring" safe trajectories. The rare, dangerous ones are what validation must find.

Generating trajectories is called rollout. You sample s1 from the initial distribution, observe, act, transition, repeat. Each rollout is one Monte Carlo sample from p(τ). Simple, but exponentially many trajectories exist — most of the interesting (dangerous) ones are vanishingly rare.

The needle-in-a-haystack problem: If the probability of a dangerous trajectory is 10−9, you need roughly 109 rollouts before you expect to see one. At 1 second per rollout, that is 31 years of simulation. The entire second half of this book (Chapters 4-10) is about finding those needles faster.
pseudocode
# Rollout: sample one trajectory from p(tau)
def rollout(T, pi, transition, observe, p_init):
    s = sample(p_init)        # initial state
    tau = [s]
    for t in range(T):
        o = sample(observe(s))  # O(o|s)
        a = sample(pi(o))       # pi(a|o)
        s = sample(transition(s, a)) # T(s'|s,a)
        tau.append((o, a, s))
    return tau
Check: In the trajectory density p(τ), which three probability distributions multiply together at each time step?

Chapter 6: Specifications

A trajectory tells us what happened. A specification tells us whether what happened was acceptable. Formally, a specification is a function that takes a trajectory and returns true (pass) or false (fail):

ψ(τ) ∈ {true, false}

Think of it as a referee. The trajectory is the game tape. The specification watches the tape and blows the whistle if anything went wrong.

Some specifications are simple: "the car never exceeds 65 mph." Others are complex: "if a pedestrian enters the crosswalk, the car must stop within 3 seconds." The complexity of the specification determines how hard it is to validate — a speed limit check is trivial, a temporal safety constraint requires reasoning about sequences of events.

Specifications encode the meaning of "safe." Without a formal specification, validation is just vibes. You cannot prove a system is safe if you have not written down what "safe" means. The Amazon hiring example is instructive — the specification was "predict which candidates will succeed," but the real requirement was "predict which candidates will succeed, without gender or race bias." The missing specification led to a validated-but-harmful system.

This book uses specifications of increasing complexity:

TypeExampleChapter
Point-wiseψ(τ) = (st ∉ Sunsafe for all t)3
Temporal logic"If pedestrian detected, eventually stop"3
ProbabilisticP(ψ(τ) = false) < 10−97
Reachability"No reachable state is in Sunsafe"8-10

The simplest specification is falsification: find any trajectory where ψ(τ) = false. Even one counterexample proves the system is flawed. The hardest is formal verification: prove that ψ(τ) = true for all possible trajectories. Most validation lives between these extremes — estimating the probability that ψ fails, or bounding the set of reachable states.

Check: What does a specification ψ(τ) do?

Chapter 7: Algorithm Outputs

Different validation algorithms answer different questions. The book organizes them by what they produce:

Falsification
Find a single trajectory where ψ(τ) = false. One counterexample is enough to prove the system is flawed.
Probability Estimation
Estimate P(ψ(τ) = false). How often does the system fail? Is it 10−3 or 10−9?
Formal Guarantees
Prove that for ALL reachable states, ψ holds. Mathematical certainty, no sampling needed.
Explanations
Show WHY the system fails — feature importance, counterfactuals, surrogate models.
Runtime Assurance
Monitor the system in real time and intervene if it is about to violate ψ.

These are not competing approaches — they are complementary. Falsification finds bugs quickly. Probability estimation quantifies residual risk. Formal methods provide guarantees on simplified models. Explanations help engineers fix the root cause. Runtime monitoring is the last line of defense.

The strength-scope trade-off: Falsification is easy to apply but only proves something is wrong — failure to find a bug does not mean there are none. Formal verification gives iron-clad guarantees but only works for simple enough models. The art of validation is choosing the right tool for the right part of the system.
OutputStrengthLimitationChapters
CounterexampleConclusive proof of flawAbsence of bug ≠ absence of flaw4-5
Failure probabilityQuantitative riskRelies on model accuracy6-7
Reachable setMathematical guaranteeOnly tractable for simple dynamics8-10
ExplanationActionable insightMay miss nonlinear interactions11
Runtime guardDefense in depthReactive, not proactive12
Check: If a falsification algorithm fails to find any counterexample, what can we conclude?

Chapter 8: Swiss Cheese

No single validation method catches everything. James Reason's Swiss Cheese Model of accident causation provides the right mental model: think of each validation layer as a slice of Swiss cheese. Each slice has holes — weaknesses, blind spots, things it misses. A failure occurs only when the holes in every layer happen to line up, letting a hazard pass through all defenses.

The more layers you have, the less likely it is that all holes align. Formal verification has holes (it works on simplified models, not the real system). Testing has holes (it only covers scenarios you thought of). Runtime monitoring has holes (it can only detect problems it was designed to recognize). But together, the uncovered region shrinks dramatically.

Defense in depth: This is why aviation is so safe despite using imperfect components. There is no single system that prevents crashes — instead, there are layers: pilot training, autopilot, collision avoidance (TCAS), air traffic control, structural redundancy, maintenance schedules, accident investigation. Each is imperfect. Together, they achieve 10−9 fatal accidents per flight hour.
Swiss Cheese Model: Defense in Depth

Four validation layers, each with holes. Hazard trajectories (red dots) fall from the top. Drag the holes by touching/clicking a layer to reposition them. Watch which failures slip through all layers. Add or remove layers to see the effect on total coverage.

4 layers — drag holes to reposition

The lesson is profound: validation is not a single activity. It is a portfolio. The book's algorithms — falsification (Chapters 4-5), failure analysis (6-7), reachability (8-10), explainability (11), runtime monitoring (12) — are not alternatives. They are layers of cheese. A responsible validation effort uses as many as the budget allows.

The formal argument: If each layer independently misses a fraction pi of failures, then with n layers, the fraction that slips through all layers is ∏ pi. Four layers that each miss 10% of failures together miss only 0.14 = 0.01% — a 1000x improvement over a single layer. This is why redundancy works, even with imperfect components.
LayerWhat it catchesWhat it misses (holes)
Formal verificationAll failures in the simplified modelModel abstraction errors
Falsification testingConcrete failure scenariosRare or novel scenarios
Statistical estimationFailure probability boundsModel distribution mismatch
Runtime monitoringReal-time anomaliesMonitors not designed for the failure mode
Check: According to the Swiss Cheese model, when does a system failure occur?

Chapter 9: Summary

This chapter introduced the validation problem. An autonomous system is an agent interacting with a world through noisy sensors, producing trajectories. Validation asks: do these trajectories satisfy our specifications? And do we have enough evidence to trust the answer?

ConceptWhat it isWhy it matters
ValidationChecking requirements against real-world conditionsThe gap between "works in tests" and "works in reality"
V-modelDevelopment lifecycle with matched testing levelsCatch bugs at the level they were introduced
POMDP frameworkAgent + world + partial observabilityFormal language for reasoning about autonomous systems
Trajectory τSequence of (s, o, a) over timeThe basic unit that specifications evaluate
Specification ψBoolean function on trajectoriesDefines what "safe" means formally
Swiss CheeseMultiple imperfect layers → robust defenseNo single method is sufficient

What comes next:

Chapter 2 builds the modeling tools. Before we can validate, we need to model the system: transition dynamics, observation models, agent policies. Maximum likelihood, Bayesian learning, and behavioral cloning give us the building blocks.

The book's roadmap:

Chapters 1-3: Framework & specification.
Chapters 4-5: Falsification.
Chapters 6-7: Failure analysis.
Chapters 8-10: Reachability.
Chapters 11-12: Deployment.

The big picture: Validation is not a checklist you complete before shipping. It is a discipline woven into the entire development process. The algorithms in this book are tools, but the mindset is what matters: assume the system is flawed until proven otherwise. The Challenger engineers assumed the O-rings were fine. The 737 MAX engineers assumed one sensor was enough. Validation begins with the assumption of failure and works backward to evidence of safety.
"The first principle is that you must not fool yourself —
and you are the easiest person to fool."
— Richard Feynman
Check: What is the main takeaway from the Swiss Cheese model for validation?