Kochenderfer et al., Chapter 4

Falsification through Optimization

Finding failures by searching intelligently, not waiting for them to happen.

Prerequisites: Chapter 3 (Property Specification & STL Robustness). That's it.
10
Chapters
5+
Simulations
10
Quizzes

Chapter 0: What is Falsification?

Imagine you have built an autonomous landing system for a drone. You have run 10,000 simulations and it landed safely every time. Is it safe? You hope so. But you do not know so. The failures might be hiding in the 10,001st scenario — the one you did not think to test.

Falsification is the art of actively searching for failures instead of passively waiting for them to appear. Think of it as red-teaming for autonomous systems. You are not trying to prove the system is safe — you are trying to prove it is not. If you fail to find a failure despite an exhaustive, intelligent search, that is much stronger evidence of safety than any number of random tests.

The falsification mindset: Instead of asking "does the system work?", ask "how can I break it?" This adversarial perspective is the foundation of modern validation. Karl Popper formalized this in the philosophy of science: you cannot prove a theory true, only fail to prove it false. The same logic applies to autonomous systems.

Why not just run random simulations? Because failures in well-designed systems are rare events. If the failure probability is p = 10−6, you need roughly a million random simulations to expect even one failure. That may be computationally infeasible. Falsification uses intelligent search to find failures in far fewer evaluations.

Random Testing
Run N random simulations, hope to see a failure
↓ but failures are rare
Falsification
Intelligently search for scenarios that cause failures
↓ find failures faster
Validation
If intelligent search finds no failure, stronger confidence in safety

This chapter covers the progression from naive random testing to sophisticated optimization-based falsification. We start with direct sampling to establish the baseline, introduce the key abstraction of disturbances, then frame falsification as an optimization problem over disturbance sequences. The payoff: finding failures in hours that random testing would take years to discover.

MethodFailures found per simulation budgetDiversity of failures
Random samplingLow (proportional to pfail)High
FuzzingMediumMedium
Gradient optimizationHighLow (finds one mode)
Population methodsHighHigh (finds multiple modes)
Check: Why is random simulation inefficient for finding rare failures?

Chapter 1: Direct Sampling

The simplest approach: sample scenarios at random and simulate each one. If a failure occurs, you have found a counterexample. If not, run more simulations. How many do you need?

If the true failure probability is p, each simulation is an independent Bernoulli trial. The number of simulations until the first failure follows a geometric distribution:

P(first failure at trial k) = (1 − p)k−1 · p

The expected number of trials until the first failure is:

E[k] = 1 / p

If p = 10−3, you need about 1,000 simulations on average. If p = 10−6, you need a million. If p = 10−9 (the target for aviation-grade safety), you need a billion. At one simulation per second, that is 31 years.

The math is brutal: For a system designed to fail once in a billion trials, direct sampling requires a billion simulations to expect one failure. There is no way around this — it is a fundamental property of geometric distributions. Intelligent search methods must somehow beat 1/p by directing simulations toward the failure region.
Direct Sampling: Geometric Distribution

Each dot is a simulation. Green = pass, red = failure. Set the failure probability p and click Run to draw samples until a failure occurs. Watch how the number of required trials scales as 1/p.

Ready
pfail0.05

Direct sampling has one advantage: it is unbiased. The failures you find are representative of the true failure distribution. This is useful for estimating failure probability (Chapter 7). But for finding failures quickly, we need something smarter.

Confidence after N trials: If you run N simulations with no failure, you can bound the failure probability. The probability of seeing zero failures in N trials is (1−p)N. For this to be at most δ, you need p ≤ 1 − δ1/N ≈ −ln(δ)/N. With N = 3000 and δ = 0.05, you can claim p ≤ 10−3 with 95% confidence. Useful, but reaching p ≤ 10−9 this way is hopeless.
Check: If a system fails with probability p = 0.001, how many random simulations are needed on average to find one failure?

Chapter 2: Disturbances

The key to intelligent falsification is the disturbance decomposition. Most simulators have randomness: sensor noise, uncertain road conditions, random pedestrian behavior. Instead of letting the simulator sample this randomness internally, we externalize it.

Any stochastic simulation can be decomposed into a deterministic simulator plus a sequence of disturbance variables:

τ = f(x0, d1, d2, …, dT)

where τ is the trajectory, x0 is the initial state, and d1, …, dT are the disturbances at each time step. Given fixed disturbances, the trajectory is completely determined. The randomness has been pushed into the choice of d = (d1, …, dT).

The disturbance trick: Think of a stochastic simulator as a deterministic machine with knobs. Each knob controls one source of randomness (sensor noise at t=3, pedestrian speed at t=7, wind gust at t=12). In the original simulator, the knobs are set randomly. In the disturbance decomposition, YOU set the knobs. This lets you search over knob settings to find failures.
Disturbance Decomposition

A drone trying to land on a pad. The disturbance is wind at each time step. Drag the wind slider to set a constant disturbance and see how the trajectory changes. The original random simulation would sample wind from N(0, σ2).

Wind force0.0

The disturbance decomposition is more than a conceptual trick — it transforms the problem. Instead of sampling random scenarios and hoping for a failure, we can now search the disturbance space for disturbance sequences that cause failures. The failure region is a subset of disturbance space, and we can use optimization to find it.

Before decompositionAfter decomposition
Stochastic simulatorDeterministic simulator + disturbance input
Run and observeChoose disturbances and observe
Passive testingActive search
Sample from p(d)Optimize over d
Check: What does the disturbance decomposition do to a stochastic simulator?

Chapter 3: Trajectory Space

With the disturbance decomposition, every point in the disturbance space d = (d1, …, dT) maps to exactly one trajectory. The disturbance space is typically high-dimensional — if the system has T time steps and n disturbance variables per step, the space is nT-dimensional.

The trajectory τ(d) is a deterministic function of d. The metric m(τ) (e.g., minimum distance to obstacle) is a deterministic function of the trajectory. And the specification check φ(τ) (e.g., "distance always ≥ 2m") is a deterministic function of the metric. So the entire pipeline is:

Disturbance d
A point in RnT
↓ deterministic sim
Trajectory τ(d)
Complete system trace
↓ metric function
Robustness ρ(d)
Single real number (positive = safe, negative = failure)

The failure region is the set of disturbances where ρ(d) < 0. Falsification is the problem of finding any point in this region. If the failure region has non-zero volume in disturbance space, optimization methods can find it.

Key insight: The disturbance sequence determines the trajectory deterministically. This means the robustness ρ(d) is a well-defined function of d that we can evaluate, differentiate (in some cases), and optimize. Falsification has been reduced to function minimization: find d such that ρ(d) < 0.

Worked example: suppose the drone landing system has 20 time steps and one disturbance (wind) per step. The disturbance space is R20. A point d = (0.1, −0.3, 0.8, …) specifies the wind at each time step. The simulator produces a trajectory (the drone's position over time). The robustness ρ measures the minimum clearance minus the threshold. If ρ < 0, the drone crashed — we found a failure.

DimensionMeaning
d1Wind at t = 1
d2Wind at t = 2
d20Wind at t = 20
Check: After disturbance decomposition, what does falsification reduce to?

Chapter 4: Fuzzing

Before reaching for optimization, there is a middle ground between random sampling and gradient descent: fuzzing. The idea is simple — instead of sampling disturbances from the nominal distribution p(d), sample from a stress distribution that inflates the noise variance.

If the nominal disturbance distribution is N(0, σ2), fuzzing samples from N(0, kσ2) where k > 1 is the stress factor. Larger k means more extreme disturbances, which makes failures more likely to occur. This is the autonomous-systems equivalent of stress testing in engineering.

dfuzz ~ N(0, kσ2)     where k > 1

Fuzzing is easy to implement — just multiply the noise variance by k — and it dramatically increases the failure rate. If the original failure probability under nominal noise is 10−6, inflating the noise by k = 3 might push it to 10−2, requiring only 100 simulations instead of a million.

Key insight: Fuzzing trades realism for failure discovery speed. The failures you find under inflated noise are not necessarily representative of real-world failures — some may only occur under unrealistically extreme conditions. But they reveal where the system is fragile, which is valuable for debugging and improvement.
Fuzzing: Nominal vs Stressed

Left: nominal noise (σ). Right: stressed noise (kσ). Watch how increasing the stress factor k reveals failures (red dots) that are invisible under nominal conditions. Adjust k and click Run.

Ready
Stress factor k1

The limitation of fuzzing is that it is still undirected. It increases the probability of failures globally, but it does not steer toward them. If the failure region is a thin sliver in a high-dimensional space, inflating the noise may still miss it. For targeted failure search, we need optimization.

Check: What does fuzzing change about the disturbance distribution?

Chapter 5: The Optimization Framing

Here is the central idea of this chapter. We have a deterministic function ρ(d) that maps disturbance sequences to robustness values. Positive ρ means safe, negative means failure. Falsification is finding a d with ρ(d) < 0. This is a minimization problem:

minimize ρ(d)     over d ∈ RnT

If the minimum value is negative, we have found a failure. If the minimum is positive (or we cannot push it below zero), the system may be safe — but we cannot be sure (optimization can get stuck in local minima).

Falsification is minimization. This single reframing unlocks the entire toolbox of numerical optimization: gradient descent, evolutionary algorithms, Bayesian optimization, cross-entropy methods, and more. Each brings different strengths to the failure search problem.

The objective function ρ(d) is typically:

Expensive to evaluate — each evaluation requires running a full simulation.

Non-convex — the robustness landscape has multiple local minima (multiple failure modes).

High-dimensional — the disturbance space can be hundreds or thousands of dimensions.

Possibly non-differentiable — the min/max in robustness create sharp corners (but smooth robustness helps).

These properties rule out simple gradient descent on its own and motivate the use of global optimization methods, derivative-free methods, or smooth approximations of the robustness function.

Choose d
Pick a disturbance sequence
Simulate
τ = f(x0, d)
Evaluate
ρ = robustness(τ)
Update
Use ρ to choose next d (gradient, mutation, etc.)
↻ repeat until ρ < 0 or budget exhausted
Check: What makes the robustness function ρ(d) challenging to optimize?

Chapter 6: The Robustness Objective

Using STL robustness (from Chapter 3) as the objective function has a beautiful property: it provides a continuous, graded signal about how close the system is to failure. Unlike a Boolean pass/fail check, robustness tells the optimizer "you are getting warmer" or "you are getting colder."

Consider the spec □(distance ≥ 2m). If the minimum distance in a trajectory is 5m, the robustness is 3 (safely satisfied). If it is 2.1m, the robustness is 0.1 (barely satisfied). If it is 1.8m, the robustness is −0.2 (violated). The optimizer can follow the gradient from 3 down to 0.1 down to −0.2, smoothly descending into the failure region.

Key insight: A Boolean spec (pass/fail) gives the optimizer zero information about direction. It is like searching for a needle in a haystack while blindfolded. Robustness gives gradient information — it turns on the lights and shows you which direction leads to the needle. This is why STL robustness, not Boolean satisfaction, is the standard falsification objective.

For composed specifications, the robustness decomposes naturally. Recall from Chapter 3:

ρ(φ1 ∧ φ2) = min(ρ(φ1), ρ(φ2))

So the robustness of a conjunction is determined by the weakest component. The optimizer naturally focuses on whichever sub-specification is closest to violation. This aligns perfectly with the engineering intuition: the system's safety margin is determined by its weakest link.

For optimization, we often use the smooth robustness from Chapter 3, replacing min with softmin:

ρ̃(φ1 ∧ φ2) = −(1/β) · ln(e−βρ(φ1) + e−βρ(φ2))

This provides well-defined gradients everywhere, enabling gradient-based optimization. The parameter β controls the approximation quality — larger β is closer to the true min but has sharper gradients.

Check: Why is robustness a better optimization objective than Boolean satisfaction for falsification?

Chapter 7: Most Likely Failure

Minimizing robustness alone finds any failure — including unrealistic ones. If we crank the wind disturbance to 100 m/s, of course the drone crashes. But a 100 m/s gust has essentially zero probability. The failure is real but irrelevant.

We want to find plausible failures — failures that could actually occur under the nominal disturbance distribution. This requires combining the robustness objective with the likelihood of the disturbance sequence.

The likelihood of disturbance sequence d under the nominal distribution p(d) measures how "realistic" that scenario is. High likelihood = plausible scenario. Low likelihood = extreme scenario. We want to find failures that are as likely as possible:

maximize p(d)    subject to ρ(d) < 0

Equivalently, minimize the negative log-likelihood subject to the failure constraint. In practice, we combine the two objectives into a single weighted sum:

minimize λ · ρ(d) − (1 − λ) · log p(d)

The weight λ controls the tradeoff. Large λ focuses on finding any failure (even unlikely ones). Small λ focuses on plausible scenarios (even if they are safe). In between, we find the most likely failure.

Most likely failure: This is the falsification question that matters most in practice. It answers: "What is the most plausible scenario under which the system fails?" This is the scenario your engineers should worry about, the scenario regulators will ask about, and the scenario that should drive system redesign.

If the disturbances are Gaussian d ~ N(0, Σ), then −log p(d) = (1/2) dTΣ−1d + const. This is just the Mahalanobis distance from the origin. So minimizing −log p(d) while requiring ρ(d) < 0 finds the failure that is closest to the origin in Mahalanobis distance — the failure that requires the least extreme disturbances.

Most Likely Failure

The 2D plane shows disturbance space. The red region is where the system fails (ρ < 0). Contour lines show the disturbance likelihood (higher near the origin). The most likely failure is the failure point closest to the origin. Adjust λ to see the tradeoff.

λ (rob vs likelihood)0.50
Check: Why do we combine robustness with disturbance likelihood in the objective?

Chapter 8: Optimization Methods

We have framed falsification as minimizing ρ(d) (or the combined robustness-likelihood objective). Which optimization algorithm should we use? The answer depends on the properties of the problem and what we want from the search.

Gradient Descent

Update: d ← d − η∇dρ(d). Fast convergence to a local minimum. Requires differentiable robustness (use smooth robustness). Finds one failure mode efficiently but misses others.

Population Methods (CMA-ES)

Maintain a population of candidate disturbance sequences. Each generation: sample from a distribution, evaluate, update the distribution toward low-robustness regions. Finds diverse failure modes but needs more evaluations.

Gradient finds one failure. Population finds many. If you need to quickly confirm a failure exists, use gradient descent. If you need a comprehensive map of failure modes (e.g., for safety certification), use a population method like CMA-ES. The best approach often combines both: use CMA-ES to explore broadly, then refine each discovered failure mode with gradient descent.

CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is the workhorse of falsification. It maintains a multivariate Gaussian distribution over disturbance space. Each generation: (1) sample λ candidates from the Gaussian, (2) evaluate robustness for each, (3) select the best μ candidates, (4) update the mean and covariance of the Gaussian to concentrate around the best candidates. The covariance matrix adapts to the local geometry of the objective, stretching the search distribution along favorable directions.

Gradient vs Population Search

2D disturbance space. Contour colors show robustness: red = failure, blue = safe. Click Gradient to run gradient descent from a random start. Click Population to run a CMA-ES-style search. Notice: gradient finds one failure; population explores multiple failure regions.

Ready
Speed80ms
MethodEvaluationsDiversityGradient needed?
Gradient descentLow (10s–100s)Low (one mode)Yes
CMA-ESMedium (100s–1000s)HighNo
Bayesian optimizationLow (10s–100s)MediumNo
Cross-entropyMediumHighNo
Diversity matters: A system might fail in multiple distinct ways — sensor failure, poor weather, adversarial pedestrian, edge-case road geometry. Gradient descent tends to find the same failure mode from different starting points (basin of attraction). Population methods can simultaneously explore multiple failure basins, giving engineers a more complete picture of system vulnerabilities.
Check: Why might a population method like CMA-ES be preferred over gradient descent for falsification?

Chapter 9: Summary

This chapter transformed the vague goal of "finding failures" into a precise optimization problem. The key steps: externalize randomness into disturbances, define robustness as the objective, and apply optimization to find failures efficiently.

ConceptRole in falsification
Disturbance decompositionTransforms stochastic sim into deterministic function of d
Robustness ρ(d)Continuous objective: positive = safe, negative = failure
Smooth robustnessDifferentiable approximation for gradient-based search
Most likely failureCombines ρ(d) with log p(d) to find plausible failures
Gradient descentFast, finds one failure mode
CMA-ESSlower, finds diverse failure modes
FuzzingSimple middle ground: inflate noise variance
The progression: Direct sampling (Chapter 1) needs ~1/p evaluations. Fuzzing (Chapter 4) reduces this by inflating noise. Optimization (Chapters 5–8) can find failures in tens to hundreds of evaluations regardless of p. The trade-off: optimization-based failures are not statistically representative (Chapter 7 of the book addresses failure probability estimation).

What comes next:

Chapter 5 extends falsification from optimization to planning — tree search, MCTS, and reinforcement learning for finding failures in systems with sequential decision-making. Instead of optimizing over a fixed disturbance vector, planning methods build the disturbance sequence step by step.

The bigger picture:

Falsification finds individual failures. But knowing one failure exists does not tell you how likely it is. Chapters 6–7 address failure distribution and probability estimation — moving from "does a failure exist?" to "how often does it occur?"

"Testing shows the presence, not the absence of bugs."
— Edsger Dijkstra
Check: What is the key advantage of optimization-based falsification over random sampling?