Ch 1: Introduction — Algorithms for Validation

Chapter 0: Why Validate?

You are an engineer at a company that builds self-driving cars. Your team has spent three years training a neural network to drive. It handles highways, rain, night-time, construction zones. It passes every test you throw at it. Time to ship?

Not yet. You have shown the system works in the situations you thought of. But what about the situations you did not think of? A child chasing a ball into traffic at dusk. A mattress falling off a truck on a curve. A sensor failure during a snowstorm. The gap between "works in our tests" and "works in the real world" is where people die.

Validation is the discipline of systematically closing that gap. It asks: does this system actually satisfy its requirements in the conditions it will face? Not "does it work on our benchmark" — does it really work?

The Challenger analogy: The space shuttle Challenger did not need a better O-ring specification. NASA had the specification. The O-ring was rated for a minimum temperature. What they needed was someone to check whether the O-ring would work at 36°F on a January morning in Florida. The spec existed; the validation did not. Seven people died because no one systematically verified the operating conditions against the requirements. That is a validation failure.

This is not just an engineering nicety. Autonomous systems are being deployed in aviation, healthcare, finance, and defense. Each domain has catastrophic failure modes. The question is not whether to validate, but how — and whether the methods are strong enough for the stakes.

This chapter introduces the validation framework. By the end, you will understand why validation is hard, what tools exist, and why no single tool is sufficient — which is why we need a whole book of algorithms.

Design

Build a system that you believe meets requirements

↓

Validate

Systematically check whether it actually meets them

↓

Deploy

Only once you have sufficient evidence of safety

Verification vs. validation: The classic distinction — verification asks "did we build the system right?" (does the code match the spec?). Validation asks "did we build the right system?" (does the spec actually guarantee safety in the real world?). This book focuses on validation, but many of the algorithms apply to both.

Check: What is the core difference between testing and validation?

Testing runs a few examples; validation uses formal methods only Testing checks known scenarios; validation systematically covers the gap between known scenarios and real-world conditions There is no difference; they are synonyms

Chapter 1: The V-Model

Building a safety-critical system is not "write code, then test." There is a structured development lifecycle, and it has a name: the V-model. Imagine the letter V. The left side goes down (decomposing requirements into design into code). The right side goes up (testing each level against its corresponding requirement).

At the top-left, you start with stakeholder needs: "the car must not hit pedestrians." This decomposes into system requirements ("detect pedestrians within 50m"), then architecture ("use LiDAR + camera fusion"), then detailed design ("run YOLO at 30fps"), then implementation (actual code). That is the left side — the descent from abstract to concrete.

The right side mirrors it. Unit tests check the code. Integration tests check the architecture. System tests check the full requirements. And acceptance tests check the original stakeholder needs. Each level on the right validates the corresponding level on the left.

The V-Model: Development & Validation

Hover over each phase to see its validation counterpart. The cost of fixing a bug multiplies ~10x at each level.

The 10x rule of bug cost: A requirements error caught during design costs $1 to fix. Caught during coding: $10. During integration testing: $100. During system testing: $1,000. After deployment: $10,000 or more — or, in safety-critical systems, it costs lives. The whole point of the V-model is to catch errors at the level where they were introduced, before they metastasize.

Traditional V-model validation uses testing: run the system, see what happens. But for autonomous systems, testing alone is not enough. A self-driving car would need to drive billions of miles to statistically prove it is safer than humans. We cannot wait that long. This book's algorithms — falsification, reachability, importance sampling — are designed to replace brute-force testing with smarter, targeted analysis.

V-Model Level	Left (Design)	Right (Validation)
1 — Stakeholder	Needs & constraints	Acceptance testing
2 — System	System requirements	System testing
3 — Architecture	Subsystem design	Integration testing
4 — Detailed	Module design	Unit testing
5 — Code	Implementation (the bottom of the V)

Check: Why is catching a bug at the requirements level so much cheaper than catching it after deployment?

Because at the requirements level, only the spec needs to change — nothing has been built yet Because requirements bugs are always small Because deployment bugs are only found by users

Chapter 2: When Things Break

Validation is not abstract. Real systems have failed, and real people have been harmed. Understanding these failures is the best way to internalize why every algorithm in this book exists.

Therac-25 (1985-87): A radiation therapy machine delivered lethal doses to at least six patients. The cause? A software race condition that only occurred when the operator typed commands very quickly. Testing had never triggered this sequence because testers typed at normal speed. The machine had removed a hardware safety interlock present in earlier models, trusting the software alone. Lesson: removing redundant safety layers based on software confidence is a validation failure.

Boeing 737 MAX (2018-19): The MCAS system, designed to prevent stalls, relied on a single angle-of-attack sensor. When that sensor failed, MCAS repeatedly pushed the nose down. Pilots were not trained on MCAS and could not override it in time. 346 people died across two crashes. Lesson: validating subsystem behavior under sensor failure is as important as validating normal operation.

Amazon hiring algorithm (2018): Amazon built an ML system to screen resumes. It penalized resumes containing the word "women's" (as in "women's chess club captain"). The model had learned from historical hiring data, which reflected existing bias. Lesson: a system can satisfy its training objective perfectly and still fail its real-world specification. Specification matters.

Notice the pattern. In every case, the system did what it was designed to do. Therac-25 ran the software correctly (the race condition was a design flaw). MCAS followed its algorithm (the single-sensor architecture was the flaw). Amazon's model optimized its loss function (the training data encoded bias). Validation failures are not about bugs in the usual sense — they are about the gap between what you built and what you should have built.

Failure Pattern Analysis

Each failure maps to a specific validation gap. Click each case to highlight which validation layer was missing.

System	What failed	Validation gap
Therac-25	Race condition in operator input	No testing of edge-case timing
737 MAX	Single sensor dependency	No fault-tree analysis on MCAS inputs
Amazon hiring	Biased training data	No specification of fairness constraints

The lesson: Validation is not just running tests until they pass. It requires: (1) specifying what "correct" means precisely, (2) systematically searching for conditions where correctness fails, and (3) providing evidence that no such conditions exist — or mitigating them if they do. These three steps map to the three pillars of this book: specification, falsification, and formal analysis.

Check: What do the Therac-25, 737 MAX, and Amazon failures have in common?

They were all caused by hardware malfunctions The developers did not run any tests The system did what it was designed to do, but the design missed critical real-world conditions

Chapter 3: System = Agent + World

To validate a system, we first need a formal language for describing it. The book models autonomous systems as an agent interacting with a world. The agent receives observations, takes actions, and the world transitions to new states. This is the POMDP framework — a partially observable Markov decision process.

Think of a self-driving car. The state s is everything about the world: positions and velocities of every vehicle, the road layout, weather, traffic signals. The agent does not see this directly — it receives observations o through noisy sensors (cameras, LiDAR, radar). Based on observations, it takes actions a (steer, accelerate, brake). The world then transitions to a new state s' according to its dynamics.

System: agent + world → (S, A, O, T, O, π)

Here, S is the state space (all possible world configurations), A is the action space (everything the agent can do), and O is the observation space (everything the agent can see). The dynamics are governed by two functions:

T(s'|s, a) — transition: probability of next state given current state and action

O(o|s) — observation: probability of seeing o when the true state is s

Think of it this way: The state is reality. The observation is what the agent perceives of reality — a noisy, incomplete picture. The transition function is the physics of the world. The observation function is the quality of the agent's sensors. Validation must account for uncertainty in both.

Agent-World Interaction Loop

Watch one step of the interaction cycle. Click Step to advance through: state → observe → decide → act → transition.

Ready

Symbol	Name	Example (self-driving car)
s ∈ S	State	Positions & velocities of all vehicles
a ∈ A	Action	Steering angle, throttle, brake
o ∈ O	Observation	Camera images, LiDAR point cloud
T(s'\|s,a)	Transition	Physics of vehicle motion
O(o\|s)	Observation model	Sensor noise characteristics

Check: Why does the agent not have direct access to the state s?

Because states are too large to store Because real sensors are noisy and incomplete — the agent can only observe o, a partial view of s Because the state changes too quickly

Chapter 4: Observations & Policies

The agent needs a strategy for choosing actions. This strategy is called a policy, written π. A policy maps from what the agent has seen (observations) to what it does (actions).

Policies can be deterministic — given an observation, the agent always picks the same action: a = π(o). Or they can be stochastic — the agent samples actions from a probability distribution: a ~ π(a|o). A stochastic policy adds randomness, which can help with exploration or robustness.

Deterministic: a = π(o) Stochastic: a ~ π(a|o)

Why does partial observability matter for validation? Because the agent's behavior depends on what it sees, not what is true. If a pedestrian is behind an occlusion, the true state includes the pedestrian, but the observation does not. A policy that is safe given full state information may be dangerous given partial observations.

Key insight: Validation must test the closed loop — the agent interacting with the world through its sensors. Testing the policy in isolation (with perfect state information) does not capture the real failure modes. A perfect driver with a foggy windshield is not a safe system.

In the POMDP framework, the agent often maintains a belief state b — a probability distribution over possible states given all past observations. The policy then maps beliefs to actions: π(b) or π(a|b). This is how the agent deals with uncertainty: it tracks what it thinks the world looks like, and acts accordingly.

Fully observable (MDP):

Agent sees the true state s directly.
Policy: π(s) or π(a|s).
Simpler to analyze.

Partially observable (POMDP):

Agent sees noisy observations o.
Policy: π(b) or π(a|o_1:t).
Much harder to validate.

Self-driving example: An MDP policy could be: "if the pedestrian is at position (x,y), brake." A POMDP policy must instead be: "if my belief about the pedestrian's position, given camera data, gives P(pedestrian in crosswalk) > 0.8, brake." The second is realistic but far harder to validate — you must reason about the sensor, the belief update, and the policy together.

Check: Why must validation test the closed loop (agent + sensors + world) rather than the policy alone?

Because the agent's behavior depends on noisy observations, not the true state, so failures may only appear when the full sensing pipeline is included Because policies are always deterministic Because testing the policy alone is computationally cheaper

Chapter 5: Trajectories

When we run a system, we get a sequence of states, observations, and actions unfolding over time. This sequence is called a trajectory, denoted τ:

τ = (s₁, o₁, a₁, s₂, o₂, a₂, …, s_T)

A trajectory is one complete "story" of the system — from initial conditions to termination. If the system is stochastic (random transitions, noisy sensors, stochastic policy), then each run produces a different trajectory. The trajectory distribution p(τ) captures the probability of every possible story.

Let's derive the trajectory density. Each trajectory is a chain of transitions, observations, and actions. By the chain rule of probability:

p(τ) = p(s₁) · ∏_t=1^T-1 O(o_t|s_t) · π(a_t|o_1:t) · T(s_t+1|s_t, a_t)

Reading left to right: we start in state s₁ (with probability p(s₁)), observe o₁ (with probability O(o₁|s₁)), take action a₁ (with probability π(a₁|o₁)), transition to s₂ (with probability T(s₂|s₁, a₁)), and so on.

Worked example: A robot in a 3-state world, T=2 steps. States: {safe, risky, crashed}. Observation: {clear, unclear}. Action: {go, stop}. Suppose p(s₁=safe) = 0.8, O(clear|safe) = 0.9, π(go|clear) = 0.7, T(safe|safe, go) = 0.95. Then the trajectory τ = (safe, clear, go, safe) has probability 0.8 × 0.9 × 0.7 × 0.95 = 0.4788. That single trajectory is nearly 50% likely. Most of the mass is on "boring" safe trajectories. The rare, dangerous ones are what validation must find.

Generating trajectories is called rollout. You sample s₁ from the initial distribution, observe, act, transition, repeat. Each rollout is one Monte Carlo sample from p(τ). Simple, but exponentially many trajectories exist — most of the interesting (dangerous) ones are vanishingly rare.

The needle-in-a-haystack problem: If the probability of a dangerous trajectory is 10⁻⁹, you need roughly 10⁹ rollouts before you expect to see one. At 1 second per rollout, that is 31 years of simulation. The entire second half of this book (Chapters 4-10) is about finding those needles faster.

pseudocode
# Rollout: sample one trajectory from p(tau)
def rollout(T, pi, transition, observe, p_init):
    s = sample(p_init)        # initial state
    tau = [s]
    for t in range(T):
        o = sample(observe(s))  # O(o|s)
        a = sample(pi(o))       # pi(a|o)
        s = sample(transition(s, a)) # T(s'|s,a)
        tau.append((o, a, s))
    return tau

Check: In the trajectory density p(τ), which three probability distributions multiply together at each time step?

Chapter 6: Specifications

A trajectory tells us what happened. A specification tells us whether what happened was acceptable. Formally, a specification is a function that takes a trajectory and returns true (pass) or false (fail):

ψ(τ) ∈ {true, false}

Think of it as a referee. The trajectory is the game tape. The specification watches the tape and blows the whistle if anything went wrong.

Some specifications are simple: "the car never exceeds 65 mph." Others are complex: "if a pedestrian enters the crosswalk, the car must stop within 3 seconds." The complexity of the specification determines how hard it is to validate — a speed limit check is trivial, a temporal safety constraint requires reasoning about sequences of events.

Specifications encode the meaning of "safe." Without a formal specification, validation is just vibes. You cannot prove a system is safe if you have not written down what "safe" means. The Amazon hiring example is instructive — the specification was "predict which candidates will succeed," but the real requirement was "predict which candidates will succeed, without gender or race bias." The missing specification led to a validated-but-harmful system.

This book uses specifications of increasing complexity:

Type	Example	Chapter
Point-wise	ψ(τ) = (s_t ∉ S_unsafe for all t)	3
Temporal logic	"If pedestrian detected, eventually stop"	3
Probabilistic	P(ψ(τ) = false) < 10⁻⁹	7
Reachability	"No reachable state is in S_unsafe"	8-10

The simplest specification is falsification: find any trajectory where ψ(τ) = false. Even one counterexample proves the system is flawed. The hardest is formal verification: prove that ψ(τ) = true for all possible trajectories. Most validation lives between these extremes — estimating the probability that ψ fails, or bounding the set of reachable states.

Check: What does a specification ψ(τ) do?

It generates trajectories It takes a trajectory and returns pass/fail based on whether safety requirements were met It optimizes the agent's policy

Chapter 7: Algorithm Outputs

Different validation algorithms answer different questions. The book organizes them by what they produce:

Falsification

Find a single trajectory where ψ(τ) = false. One counterexample is enough to prove the system is flawed.

↓

Probability Estimation

Estimate P(ψ(τ) = false). How often does the system fail? Is it 10⁻³ or 10⁻⁹?

↓

Formal Guarantees

Prove that for ALL reachable states, ψ holds. Mathematical certainty, no sampling needed.

↓

Explanations

Show WHY the system fails — feature importance, counterfactuals, surrogate models.

↓

Runtime Assurance

Monitor the system in real time and intervene if it is about to violate ψ.

These are not competing approaches — they are complementary. Falsification finds bugs quickly. Probability estimation quantifies residual risk. Formal methods provide guarantees on simplified models. Explanations help engineers fix the root cause. Runtime monitoring is the last line of defense.

The strength-scope trade-off: Falsification is easy to apply but only proves something is wrong — failure to find a bug does not mean there are none. Formal verification gives iron-clad guarantees but only works for simple enough models. The art of validation is choosing the right tool for the right part of the system.

Output	Strength	Limitation	Chapters
Counterexample	Conclusive proof of flaw	Absence of bug ≠ absence of flaw	4-5
Failure probability	Quantitative risk	Relies on model accuracy	6-7
Reachable set	Mathematical guarantee	Only tractable for simple dynamics	8-10
Explanation	Actionable insight	May miss nonlinear interactions	11
Runtime guard	Defense in depth	Reactive, not proactive	12

Check: If a falsification algorithm fails to find any counterexample, what can we conclude?

The system is safe The system has no bugs Nothing conclusive — we just did not find a counterexample this time

Chapter 8: Swiss Cheese

No single validation method catches everything. James Reason's Swiss Cheese Model of accident causation provides the right mental model: think of each validation layer as a slice of Swiss cheese. Each slice has holes — weaknesses, blind spots, things it misses. A failure occurs only when the holes in every layer happen to line up, letting a hazard pass through all defenses.

The more layers you have, the less likely it is that all holes align. Formal verification has holes (it works on simplified models, not the real system). Testing has holes (it only covers scenarios you thought of). Runtime monitoring has holes (it can only detect problems it was designed to recognize). But together, the uncovered region shrinks dramatically.

Defense in depth: This is why aviation is so safe despite using imperfect components. There is no single system that prevents crashes — instead, there are layers: pilot training, autopilot, collision avoidance (TCAS), air traffic control, structural redundancy, maintenance schedules, accident investigation. Each is imperfect. Together, they achieve 10⁻⁹ fatal accidents per flight hour.

Swiss Cheese Model: Defense in Depth

Four validation layers, each with holes. Hazard trajectories (red dots) fall from the top. Drag the holes by touching/clicking a layer to reposition them. Watch which failures slip through all layers. Add or remove layers to see the effect on total coverage.

4 layers — drag holes to reposition

The lesson is profound: validation is not a single activity. It is a portfolio. The book's algorithms — falsification (Chapters 4-5), failure analysis (6-7), reachability (8-10), explainability (11), runtime monitoring (12) — are not alternatives. They are layers of cheese. A responsible validation effort uses as many as the budget allows.

The formal argument: If each layer independently misses a fraction p_i of failures, then with n layers, the fraction that slips through all layers is ∏ p_i. Four layers that each miss 10% of failures together miss only 0.1⁴ = 0.01% — a 1000x improvement over a single layer. This is why redundancy works, even with imperfect components.

Layer	What it catches	What it misses (holes)
Formal verification	All failures in the simplified model	Model abstraction errors
Falsification testing	Concrete failure scenarios	Rare or novel scenarios
Statistical estimation	Failure probability bounds	Model distribution mismatch
Runtime monitoring	Real-time anomalies	Monitors not designed for the failure mode

Check: According to the Swiss Cheese model, when does a system failure occur?

When the holes in all defense layers happen to align, letting a hazard pass through every layer When any single layer has a hole When the system runs out of layers

Chapter 9: Summary

This chapter introduced the validation problem. An autonomous system is an agent interacting with a world through noisy sensors, producing trajectories. Validation asks: do these trajectories satisfy our specifications? And do we have enough evidence to trust the answer?

Concept	What it is	Why it matters
Validation	Checking requirements against real-world conditions	The gap between "works in tests" and "works in reality"
V-model	Development lifecycle with matched testing levels	Catch bugs at the level they were introduced
POMDP framework	Agent + world + partial observability	Formal language for reasoning about autonomous systems
Trajectory τ	Sequence of (s, o, a) over time	The basic unit that specifications evaluate
Specification ψ	Boolean function on trajectories	Defines what "safe" means formally
Swiss Cheese	Multiple imperfect layers → robust defense	No single method is sufficient

What comes next:

Chapter 2 builds the modeling tools. Before we can validate, we need to model the system: transition dynamics, observation models, agent policies. Maximum likelihood, Bayesian learning, and behavioral cloning give us the building blocks.

The book's roadmap:

Chapters 1-3: Framework & specification.
Chapters 4-5: Falsification.
Chapters 6-7: Failure analysis.
Chapters 8-10: Reachability.
Chapters 11-12: Deployment.

The big picture: Validation is not a checklist you complete before shipping. It is a discipline woven into the entire development process. The algorithms in this book are tools, but the mindset is what matters: assume the system is flawed until proven otherwise. The Challenger engineers assumed the O-rings were fine. The 737 MAX engineers assumed one sensor was enough. Validation begins with the assumption of failure and works backward to evidence of safety.

"The first principle is that you must not fool yourself —
and you are the easiest person to fool."
— Richard Feynman

Check: What is the main takeaway from the Swiss Cheese model for validation?

Use formal verification only Use as many tests as possible Layer multiple imperfect validation methods so their blind spots are unlikely to align

Intro to Validation