From finding one failure to mapping the entire landscape — sampling the distribution over failure trajectories.
Chapters 4 and 5 found individual failures. That answered the question: "Can this system fail?" But knowing a system can fail is just the beginning. An autonomous vehicle that fails once in a billion simulated miles and one that fails once in a thousand miles are both "fallible" — but the difference between them is life and death.
To understand the full picture, we need answers to harder questions. Which failures are most likely? Are there multiple distinct failure modes? What do typical failures look like — are they sudden catastrophes or gradual drifts? Answering these requires not a single failure trajectory, but a distribution over failure trajectories.
Formally, let τ be a trajectory (sequence of states and disturbances) and let p(τ) be its probability under the nominal disturbance distribution. Let "failure" be the event that the trajectory violates some safety property. We want to sample from:
This is the failure distribution. The numerator is easy: it's the trajectory probability times an indicator of whether failure occurred. The denominator p(failure) is the total probability of failure — the one number the whole book is trying to estimate. We don't know it. This makes direct sampling seem impossible.
The beautiful trick is that we don't need to know p(failure) to sample from the failure distribution. Markov chain Monte Carlo (MCMC) methods can sample from unnormalized distributions — and the failure distribution has a perfectly usable unnormalized form. The normalizing constant falls out for free.
This chapter builds up the tools one piece at a time. We start with the unnormalized trick, try (and fail with) rejection sampling, then build to MCMC. Along the way, we'll discover why sharp failure boundaries cause problems and how smoothing fixes them.
Here is the key mathematical observation that makes everything in this chapter possible. We want to sample from p(τ | failure), but we don't know the normalizing constant p(failure). No problem. Define the unnormalized failure density:
This is just the trajectory probability p(τ) multiplied by the failure indicator. For non-failure trajectories, it's zero. For failure trajectories, it equals p(τ). The key identity:
The normalizing constant Z is literally p(failure) — the total probability mass assigned to failure trajectories. We don't know Z, but here is the beautiful part: MCMC doesn't need Z. Metropolis-Hastings only uses ratios of the target density, and Z cancels in the ratio:
As long as both τ and τ' are failure trajectories, this ratio is simply p(τ') / p(τ) — the likelihood ratio, which we can always compute.
Let's ground this with a simple example. Suppose the system has a 1D disturbance x ~ N(0, 1), and failure occurs when |x| > 2. The failure distribution is:
This is a standard normal truncated to the tails beyond ±2. The unnormalized version is just the normal density times the indicator — two bumps, one on each side of zero, decaying outward. We don't need to know how much probability is in the tails to sample from this shape.
Before reaching for MCMC, let's try the simplest possible approach: rejection sampling. The idea is charmingly straightforward. Sample trajectories from the nominal distribution p(τ), and keep only those that fail. Throw away the rest. The survivors are exact samples from p(τ | failure).
Think of throwing darts at a dartboard. The board is the full trajectory space. The bullseye is the failure region. If you throw darts uniformly (well, proportional to p(τ)), the ones that hit the bullseye are distributed exactly as p(τ | failure). Simple, exact, elegant.
More generally, rejection sampling works with any proposal distribution q(τ) that covers the support of the target. We need a constant c such that c · q(τ) ≥ p̄(τ | failure) everywhere. The algorithm:
The acceptance rate is 1/c. If q = p (the nominal distribution) and the target is p(τ | failure), then c = 1/p(failure). The acceptance rate equals the failure probability itself.
Orange dots are proposals from N(0,1). Red zone is the failure region (|x| > threshold). Green dots are accepted samples. Watch the acceptance rate. Lower the threshold to see how quickly rejection becomes infeasible.
When the failure threshold is at 2.0, about 4.6% of standard normal samples exceed it — so roughly 1 in 22 proposals is accepted. Manageable. But push the threshold to 3.0 (0.27% acceptance) or 4.0 (0.006% acceptance) and you'll be waiting a long time. For rare failures, rejection sampling is hopelessly inefficient.
Rejection sampling breaks down for the same reason that naive Monte Carlo estimation of rare events breaks down: the acceptance rate is proportional to the failure probability. If failures happen once in a million trajectories, we need roughly a million proposals to get one accepted sample.
But the real killer is dimensionality. Consider an inverted pendulum with 100 timesteps of 3D disturbances. The trajectory space is 300-dimensional. Even if the marginal failure probability isn't astronomically small, the constant c that bounds the density ratio grows exponentially with dimension.
Why is c so large? Because the proposal q(τ) spreads its probability mass over the entire trajectory space — including the vast majority that doesn't lead to failure. The failure-conditional distribution concentrates its mass in a tiny region. The ratio p̄(τ | failure) / q(τ) spikes enormously at failure trajectories, forcing c to be enormous to bound it.
| Problem | Dimensions | p(failure) | c | Samples per accept |
|---|---|---|---|---|
| 1D threshold | 1 | 4.6% | 22 | ~22 |
| 10D threshold | 10 | 10−3 | 103 | ~1,000 |
| Inv. pendulum | 300 | ~10−14 | 1.22 × 1014 | ~1014 |
There is one more subtle problem. Even if we could magically improve the acceptance rate, rejection sampling produces independent samples. That sounds good, but it means we cannot use information from previous samples to guide future ones. MCMC trades independence for adaptivity: each sample informs the next, creating a chain that efficiently explores the target distribution.
Rejection sampling tries to generate independent samples and mostly fails. Markov chain Monte Carlo (MCMC) takes a different philosophy: generate correlated samples that form a chain, where each sample is a small perturbation of the previous one. The chain is designed so that its long-run behavior matches the target distribution.
Think of a drunk person wandering around a city. If the bars are in the failure region and the wanderer has a slight tendency to walk toward bars, then over time, the wanderer's position traces out the distribution of bar locations. Individual positions are correlated (each step is near the previous), but the collection of all positions visited gives us a map of where bars are.
A Markov chain is defined by a transition kernel g(τ' | τ) that says: given the current sample τ, where does the chain go next? The chain has the correct stationary distribution if running it long enough produces samples from p(τ | failure), regardless of where it started.
The sufficient condition for correct stationary distribution is detailed balance:
where T is the overall transition kernel (proposal + accept/reject). This says: the flow of probability from τ to τ' equals the flow from τ' to τ. If this holds for all pairs, the distribution is in equilibrium.
A 1D Markov chain targeting a bimodal distribution (two bumps). The green dot is the current position. Watch the trace plot (history of positions) below. The chain spends more time in high-density regions.
Key concepts to watch for in the simulation:
| Concept | What it means | What to look for |
|---|---|---|
| Burn-in | Initial samples before chain reaches target region | Early samples far from bumps |
| Mixing | How well chain explores all modes | Does it visit both bumps? |
| Thinning | Taking every k-th sample to reduce correlation | Consecutive samples are similar |
| Acceptance rate | Fraction of proposals accepted | Too low = stuck; too high = tiny steps |
How do we build an MCMC chain with the right stationary distribution? The Metropolis-Hastings algorithm gives a general recipe. Start with any proposal kernel g(τ' | τ), and add an accept/reject step that corrects for the proposal's bias.
Given the current sample τ, propose τ' ~ g(τ' | τ). Accept with probability:
Let's unpack this. The ratio p̄(τ') / p̄(τ) measures whether the proposal is in a higher-density region of the target. The ratio g(τ | τ') / g(τ' | τ) corrects for asymmetry in the proposal. If the proposal is more likely to jump from τ to τ' than vice versa, we need to accept less often to compensate.
For a symmetric proposal (like a Gaussian random walk where g(τ' | τ) = g(τ | τ')), the correction term cancels, and the acceptance probability simplifies to:
With a symmetric random walk proposal, this becomes: accept if the proposed trajectory is more probable than the current one. If the proposal is less probable, accept with a probability equal to the probability ratio. This means we mostly move "uphill" toward high-density regions but occasionally accept "downhill" moves, which prevents the chain from getting permanently stuck at a single mode.
What if the proposed trajectory is not a failure? Then 1{fail(τ')} = 0, so p̄(τ' | fail) = 0, and the acceptance probability is zero. The chain never leaves the failure region once it enters. This is correct — we're sampling the conditional distribution given failure.
pseudocode # Metropolis-Hastings for failure distribution Input: initial failure trajectory τ0, proposal kernel g, num_samples chain = [τ0] for i = 1 to num_samples: τ' ~ g(τ' | chain[i-1]) if is_failure(τ'): α = min(1, p(τ') / p(chain[i-1])) # symmetric g else: α = 0 # never accept non-failure if random() < α: chain.append(τ') # accept else: chain.append(chain[i-1]) # reject: repeat current return chain[burn_in:] # discard burn-in
There is a nasty problem with the MH algorithm as described. The failure indicator 1{fail(τ)} is a hard boundary. A trajectory either fails or it doesn't — there is no middle ground. If the current trajectory is barely a failure (robustness ρ = −0.001) and the proposal perturbs it slightly into non-failure territory (ρ = +0.001), the proposal is instantly rejected. The chain is trapped on one side of the failure boundary.
This is catastrophic when there are multiple failure modes separated by non-failure regions. The chain enters one failure mode and can never cross the safe region to reach the other. It's like being on an island — the MCMC walker can explore every beach on this island, but it can never swim to the next island because the ocean (non-failure region) has zero density.
The fix is elegant: replace the hard indicator with a smooth approximation. Instead of 1{fail(τ)}, use a Gaussian kernel centered at the failure boundary:
Here Δ(τ) is the robustness — positive means safe, negative means failure, zero is the boundary. The Gaussian kernel gives high weight to trajectories near the failure boundary (|Δ| small) and decaying weight to trajectories farther away on either side. Crucially, it assigns nonzero weight to safe trajectories near the boundary. This lets the chain cross through safe regions to reach other failure modes.
The parameter ε controls the smoothing width. Small ε approximates the hard indicator closely (good fidelity, poor mixing). Large ε smooths aggressively (poor fidelity, good mixing). The practical recipe: start with large ε, let the chain explore all modes, then gradually decrease ε toward zero to sharpen the distribution.
Two failure modes (red regions) separated by a safe zone. The green dot is the MCMC walker. Without smoothing (toggle off), the chain stays trapped in one mode. With smoothing, it can cross between modes. The trace plot shows the chain's position over time.
The effect is dramatic. Set ε = 0 (no smoothing) and run 200 steps — the chain stays entirely in one mode. Now set ε = 0.5 and reset — the chain crosses between modes freely. The trace plot tells the story: without smoothing, a flat line in one mode; with smoothing, the trace jumps between modes.
Small ε (< 0.1):
• Close to true failure dist.
• Sharp boundary = poor mixing
• Chain may get trapped
• Good for single-mode problems
Large ε (> 0.5):
• Smooth boundary = good mixing
• Includes near-failure trajectories
• Cross between modes easily
• Decrease ε after exploration
The smoothing trick enables mode crossing, but multimodal failure distributions remain one of the hardest problems in computational validation. Let's understand why, and what strategies help beyond smoothing.
Consider an autonomous vehicle. One failure mode might be "rear-ended by a fast-following vehicle." Another might be "sudden pedestrian crossing." A third might be "sensor occlusion in heavy rain." These failure modes occupy completely different regions of the disturbance space. The failure distribution has three (or more) well-separated peaks.
Several strategies address this:
| Strategy | Idea | Trade-off |
|---|---|---|
| Multiple chains | Run many chains from different starting points | Embarrassingly parallel but may miss modes |
| Tempering | Run chains at different smoothing levels, swap between them | Guaranteed mixing but expensive |
| Adaptive ε | Anneal ε from large (exploration) to small (precision) | Simple but sequence matters |
| Seed from search | Use Ch 5 methods to find modes, then MCMC within each | Depends on search finding all modes |
Parallel tempering is the most principled approach. Run K chains in parallel, each targeting the smoothed distribution with a different ε. The chain with the largest ε moves freely across the whole space. Periodically, propose swaps between adjacent chains. If the high-ε chain has discovered a new mode, the swap propagates this discovery down to the low-ε chains. The Metropolis criterion for swaps ensures detailed balance is maintained.
In practice, the most common approach for validation is the simplest: run the falsification methods from Chapter 5 (MCTS, RL adversary) to find diverse failure modes, then initialize an MCMC chain in each mode and explore locally. This hybrid approach uses each tool for what it does best — search algorithms for discovery, MCMC for characterization.
The random walk MH algorithm proposes steps in random directions. In high dimensions, most random directions are "sideways" — they neither increase nor decrease the target density. The chain shuffles along the level set instead of moving efficiently toward high-density regions.
If we can compute the gradient of the log-density, we can bias proposals toward high-density directions. This is the idea behind gradient-based MCMC.
The Metropolis-Adjusted Langevin Algorithm (MALA) uses a proposal that follows the gradient:
The first term is a gradient step: the proposal drifts toward higher density. The second term is random noise for exploration. The step size ε balances the two. Because the proposal is no longer symmetric (it's biased toward the gradient direction), MH must correct with the full acceptance ratio including the proposal asymmetry.
Hamiltonian Monte Carlo (HMC) goes further. Instead of one gradient step, it simulates a physical system: the current position τ gets an initial momentum, then follows Hamiltonian dynamics (gradient + momentum) for L steps. The result is a proposal that can be far from the current point while maintaining high acceptance probability. HMC explores high-dimensional distributions dramatically faster than random walk MH.
The No-U-Turn Sampler (NUTS) automates HMC's tuning parameters. It simulates the trajectory forward and backward in time, stopping when the trajectory starts to double back on itself (the "U-turn"). NUTS is the default sampler in modern probabilistic programming systems (Stan, Turing.jl, NumPyro) and is the current gold standard for gradient-based MCMC.
| Method | Proposals/step | Scaling with d | Requires |
|---|---|---|---|
| Random Walk MH | 1 random | O(d2) | Density evaluation |
| MALA | 1 gradient-biased | O(d1/3) | + gradient |
| HMC | L leapfrog steps | O(d1/4) | + gradient |
| NUTS | Adaptive L | O(d1/4) | + gradient + auto-tuning |
Everything in this chapter — MCMC, smoothing, gradients — can be implemented by hand. But doing so for complex systems is tedious and error-prone. Probabilistic programming languages (PPLs) automate the heavy lifting. You write the model, and the PPL handles the inference.
The key idea: encode the validation problem as a probabilistic program. The disturbances are random variables. The simulator is a deterministic function of those variables. The failure condition is an observation. The PPL's inference engine (typically NUTS) samples from the posterior — which is exactly the failure distribution.
julia using Turing @model function failure_distribution(simulator, ε) # Prior: disturbances from nominal distribution x ~ MvNormal(zeros(300), I) # Run simulator (deterministic given x) τ = simulator(x) # Compute robustness Δ = robustness(τ) # Smoothed failure indicator via @addlogprob! Turing.@addlogprob!(-Δ^2 / (2 * ε^2)) end # Sample from failure distribution model = failure_distribution(my_simulator, 0.1) chain = sample(model, NUTS(), 1000)
The elegance is striking. The entire failure distribution sampling problem — disturbance model, simulator, robustness metric, smoothing, MCMC inference — fits in about 10 lines of code. The PPL handles all the machinery: gradient computation (via automatic differentiation), NUTS tuning, burn-in, diagnostics, and sample storage.
The requirement is that the simulator must be differentiable (for NUTS) or at least evaluable (for random walk MH). Most modern PPLs support automatic differentiation through arbitrary Julia/Python code, including array operations, conditionals, and loops. If your simulator is written in a supported language, it "just works."
Strengths of PPL approach:
• Minimal implementation effort
• Automatic gradient computation
• Built-in diagnostics (R-hat, ESS)
• Swap samplers easily (NUTS, HMC, MH)
Limitations:
• Requires differentiable simulator
• Overhead of PPL machinery
• Black-box simulators need wrappers
• Multimodality still hard
What comes next: Sampling the failure distribution tells us what failures look like. Chapter 7 asks the remaining question: how likely is failure? This requires estimating p(failure) — the normalizing constant Z that we carefully avoided throughout this chapter. Importance sampling, adaptive methods, and sequential Monte Carlo provide the answer.