Finding failures by searching intelligently, not waiting for them to happen.
Imagine you have built an autonomous landing system for a drone. You have run 10,000 simulations and it landed safely every time. Is it safe? You hope so. But you do not know so. The failures might be hiding in the 10,001st scenario — the one you did not think to test.
Falsification is the art of actively searching for failures instead of passively waiting for them to appear. Think of it as red-teaming for autonomous systems. You are not trying to prove the system is safe — you are trying to prove it is not. If you fail to find a failure despite an exhaustive, intelligent search, that is much stronger evidence of safety than any number of random tests.
Why not just run random simulations? Because failures in well-designed systems are rare events. If the failure probability is p = 10−6, you need roughly a million random simulations to expect even one failure. That may be computationally infeasible. Falsification uses intelligent search to find failures in far fewer evaluations.
This chapter covers the progression from naive random testing to sophisticated optimization-based falsification. We start with direct sampling to establish the baseline, introduce the key abstraction of disturbances, then frame falsification as an optimization problem over disturbance sequences. The payoff: finding failures in hours that random testing would take years to discover.
| Method | Failures found per simulation budget | Diversity of failures |
|---|---|---|
| Random sampling | Low (proportional to pfail) | High |
| Fuzzing | Medium | Medium |
| Gradient optimization | High | Low (finds one mode) |
| Population methods | High | High (finds multiple modes) |
The simplest approach: sample scenarios at random and simulate each one. If a failure occurs, you have found a counterexample. If not, run more simulations. How many do you need?
If the true failure probability is p, each simulation is an independent Bernoulli trial. The number of simulations until the first failure follows a geometric distribution:
The expected number of trials until the first failure is:
If p = 10−3, you need about 1,000 simulations on average. If p = 10−6, you need a million. If p = 10−9 (the target for aviation-grade safety), you need a billion. At one simulation per second, that is 31 years.
Each dot is a simulation. Green = pass, red = failure. Set the failure probability p and click Run to draw samples until a failure occurs. Watch how the number of required trials scales as 1/p.
Direct sampling has one advantage: it is unbiased. The failures you find are representative of the true failure distribution. This is useful for estimating failure probability (Chapter 7). But for finding failures quickly, we need something smarter.
The key to intelligent falsification is the disturbance decomposition. Most simulators have randomness: sensor noise, uncertain road conditions, random pedestrian behavior. Instead of letting the simulator sample this randomness internally, we externalize it.
Any stochastic simulation can be decomposed into a deterministic simulator plus a sequence of disturbance variables:
where τ is the trajectory, x0 is the initial state, and d1, …, dT are the disturbances at each time step. Given fixed disturbances, the trajectory is completely determined. The randomness has been pushed into the choice of d = (d1, …, dT).
A drone trying to land on a pad. The disturbance is wind at each time step. Drag the wind slider to set a constant disturbance and see how the trajectory changes. The original random simulation would sample wind from N(0, σ2).
The disturbance decomposition is more than a conceptual trick — it transforms the problem. Instead of sampling random scenarios and hoping for a failure, we can now search the disturbance space for disturbance sequences that cause failures. The failure region is a subset of disturbance space, and we can use optimization to find it.
| Before decomposition | After decomposition |
|---|---|
| Stochastic simulator | Deterministic simulator + disturbance input |
| Run and observe | Choose disturbances and observe |
| Passive testing | Active search |
| Sample from p(d) | Optimize over d |
With the disturbance decomposition, every point in the disturbance space d = (d1, …, dT) maps to exactly one trajectory. The disturbance space is typically high-dimensional — if the system has T time steps and n disturbance variables per step, the space is nT-dimensional.
The trajectory τ(d) is a deterministic function of d. The metric m(τ) (e.g., minimum distance to obstacle) is a deterministic function of the trajectory. And the specification check φ(τ) (e.g., "distance always ≥ 2m") is a deterministic function of the metric. So the entire pipeline is:
The failure region is the set of disturbances where ρ(d) < 0. Falsification is the problem of finding any point in this region. If the failure region has non-zero volume in disturbance space, optimization methods can find it.
Worked example: suppose the drone landing system has 20 time steps and one disturbance (wind) per step. The disturbance space is R20. A point d = (0.1, −0.3, 0.8, …) specifies the wind at each time step. The simulator produces a trajectory (the drone's position over time). The robustness ρ measures the minimum clearance minus the threshold. If ρ < 0, the drone crashed — we found a failure.
| Dimension | Meaning |
|---|---|
| d1 | Wind at t = 1 |
| d2 | Wind at t = 2 |
| … | … |
| d20 | Wind at t = 20 |
Before reaching for optimization, there is a middle ground between random sampling and gradient descent: fuzzing. The idea is simple — instead of sampling disturbances from the nominal distribution p(d), sample from a stress distribution that inflates the noise variance.
If the nominal disturbance distribution is N(0, σ2), fuzzing samples from N(0, kσ2) where k > 1 is the stress factor. Larger k means more extreme disturbances, which makes failures more likely to occur. This is the autonomous-systems equivalent of stress testing in engineering.
Fuzzing is easy to implement — just multiply the noise variance by k — and it dramatically increases the failure rate. If the original failure probability under nominal noise is 10−6, inflating the noise by k = 3 might push it to 10−2, requiring only 100 simulations instead of a million.
Left: nominal noise (σ). Right: stressed noise (kσ). Watch how increasing the stress factor k reveals failures (red dots) that are invisible under nominal conditions. Adjust k and click Run.
The limitation of fuzzing is that it is still undirected. It increases the probability of failures globally, but it does not steer toward them. If the failure region is a thin sliver in a high-dimensional space, inflating the noise may still miss it. For targeted failure search, we need optimization.
Here is the central idea of this chapter. We have a deterministic function ρ(d) that maps disturbance sequences to robustness values. Positive ρ means safe, negative means failure. Falsification is finding a d with ρ(d) < 0. This is a minimization problem:
If the minimum value is negative, we have found a failure. If the minimum is positive (or we cannot push it below zero), the system may be safe — but we cannot be sure (optimization can get stuck in local minima).
The objective function ρ(d) is typically:
• Expensive to evaluate — each evaluation requires running a full simulation.
• Non-convex — the robustness landscape has multiple local minima (multiple failure modes).
• High-dimensional — the disturbance space can be hundreds or thousands of dimensions.
• Possibly non-differentiable — the min/max in robustness create sharp corners (but smooth robustness helps).
These properties rule out simple gradient descent on its own and motivate the use of global optimization methods, derivative-free methods, or smooth approximations of the robustness function.
Using STL robustness (from Chapter 3) as the objective function has a beautiful property: it provides a continuous, graded signal about how close the system is to failure. Unlike a Boolean pass/fail check, robustness tells the optimizer "you are getting warmer" or "you are getting colder."
Consider the spec □(distance ≥ 2m). If the minimum distance in a trajectory is 5m, the robustness is 3 (safely satisfied). If it is 2.1m, the robustness is 0.1 (barely satisfied). If it is 1.8m, the robustness is −0.2 (violated). The optimizer can follow the gradient from 3 down to 0.1 down to −0.2, smoothly descending into the failure region.
For composed specifications, the robustness decomposes naturally. Recall from Chapter 3:
So the robustness of a conjunction is determined by the weakest component. The optimizer naturally focuses on whichever sub-specification is closest to violation. This aligns perfectly with the engineering intuition: the system's safety margin is determined by its weakest link.
For optimization, we often use the smooth robustness from Chapter 3, replacing min with softmin:
This provides well-defined gradients everywhere, enabling gradient-based optimization. The parameter β controls the approximation quality — larger β is closer to the true min but has sharper gradients.
Minimizing robustness alone finds any failure — including unrealistic ones. If we crank the wind disturbance to 100 m/s, of course the drone crashes. But a 100 m/s gust has essentially zero probability. The failure is real but irrelevant.
We want to find plausible failures — failures that could actually occur under the nominal disturbance distribution. This requires combining the robustness objective with the likelihood of the disturbance sequence.
The likelihood of disturbance sequence d under the nominal distribution p(d) measures how "realistic" that scenario is. High likelihood = plausible scenario. Low likelihood = extreme scenario. We want to find failures that are as likely as possible:
Equivalently, minimize the negative log-likelihood subject to the failure constraint. In practice, we combine the two objectives into a single weighted sum:
The weight λ controls the tradeoff. Large λ focuses on finding any failure (even unlikely ones). Small λ focuses on plausible scenarios (even if they are safe). In between, we find the most likely failure.
If the disturbances are Gaussian d ~ N(0, Σ), then −log p(d) = (1/2) dTΣ−1d + const. This is just the Mahalanobis distance from the origin. So minimizing −log p(d) while requiring ρ(d) < 0 finds the failure that is closest to the origin in Mahalanobis distance — the failure that requires the least extreme disturbances.
The 2D plane shows disturbance space. The red region is where the system fails (ρ < 0). Contour lines show the disturbance likelihood (higher near the origin). The most likely failure is the failure point closest to the origin. Adjust λ to see the tradeoff.
We have framed falsification as minimizing ρ(d) (or the combined robustness-likelihood objective). Which optimization algorithm should we use? The answer depends on the properties of the problem and what we want from the search.
Gradient Descent
Update: d ← d − η∇dρ(d). Fast convergence to a local minimum. Requires differentiable robustness (use smooth robustness). Finds one failure mode efficiently but misses others.
Population Methods (CMA-ES)
Maintain a population of candidate disturbance sequences. Each generation: sample from a distribution, evaluate, update the distribution toward low-robustness regions. Finds diverse failure modes but needs more evaluations.
CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is the workhorse of falsification. It maintains a multivariate Gaussian distribution over disturbance space. Each generation: (1) sample λ candidates from the Gaussian, (2) evaluate robustness for each, (3) select the best μ candidates, (4) update the mean and covariance of the Gaussian to concentrate around the best candidates. The covariance matrix adapts to the local geometry of the objective, stretching the search distribution along favorable directions.
2D disturbance space. Contour colors show robustness: red = failure, blue = safe. Click Gradient to run gradient descent from a random start. Click Population to run a CMA-ES-style search. Notice: gradient finds one failure; population explores multiple failure regions.
| Method | Evaluations | Diversity | Gradient needed? |
|---|---|---|---|
| Gradient descent | Low (10s–100s) | Low (one mode) | Yes |
| CMA-ES | Medium (100s–1000s) | High | No |
| Bayesian optimization | Low (10s–100s) | Medium | No |
| Cross-entropy | Medium | High | No |
This chapter transformed the vague goal of "finding failures" into a precise optimization problem. The key steps: externalize randomness into disturbances, define robustness as the objective, and apply optimization to find failures efficiently.
| Concept | Role in falsification |
|---|---|
| Disturbance decomposition | Transforms stochastic sim into deterministic function of d |
| Robustness ρ(d) | Continuous objective: positive = safe, negative = failure |
| Smooth robustness | Differentiable approximation for gradient-based search |
| Most likely failure | Combines ρ(d) with log p(d) to find plausible failures |
| Gradient descent | Fast, finds one failure mode |
| CMA-ES | Slower, finds diverse failure modes |
| Fuzzing | Simple middle ground: inflate noise variance |
What comes next:
Chapter 5 extends falsification from optimization to planning — tree search, MCTS, and reinforcement learning for finding failures in systems with sequential decision-making. Instead of optimizing over a fixed disturbance vector, planning methods build the disturbance sequence step by step.
The bigger picture:
Falsification finds individual failures. But knowing one failure exists does not tell you how likely it is. Chapters 6–7 address failure distribution and probability estimation — moving from "does a failure exist?" to "how often does it occur?"