Kochenderfer et al., Chapter 6

Failure Distribution

From finding one failure to mapping the entire landscape — sampling the distribution over failure trajectories.

Prerequisites: Chapters 4–5 (Falsification). That's it.
10
Chapters
4+
Simulations
10
Quizzes

Chapter 0: Why Distributions?

Chapters 4 and 5 found individual failures. That answered the question: "Can this system fail?" But knowing a system can fail is just the beginning. An autonomous vehicle that fails once in a billion simulated miles and one that fails once in a thousand miles are both "fallible" — but the difference between them is life and death.

To understand the full picture, we need answers to harder questions. Which failures are most likely? Are there multiple distinct failure modes? What do typical failures look like — are they sudden catastrophes or gradual drifts? Answering these requires not a single failure trajectory, but a distribution over failure trajectories.

The shift: Falsification finds one needle in a haystack. This chapter maps the entire haystack — or at least the region where needles are concentrated. We want samples from the conditional distribution: "given that the system fails, what did the failure look like?"

Formally, let τ be a trajectory (sequence of states and disturbances) and let p(τ) be its probability under the nominal disturbance distribution. Let "failure" be the event that the trajectory violates some safety property. We want to sample from:

p(τ | failure) = p(τ) · 1{failure(τ)} / p(failure)

This is the failure distribution. The numerator is easy: it's the trajectory probability times an indicator of whether failure occurred. The denominator p(failure) is the total probability of failure — the one number the whole book is trying to estimate. We don't know it. This makes direct sampling seem impossible.

Ch 4–5: Falsification
"Does a failure exist?" → Find one counterexample
↓ go deeper
Ch 6: Failure Distribution
"What do failures look like?" → Sample the conditional distribution
↓ then estimate
Ch 7: Failure Probability
"How likely is failure?" → Estimate p(failure) via importance sampling

The beautiful trick is that we don't need to know p(failure) to sample from the failure distribution. Markov chain Monte Carlo (MCMC) methods can sample from unnormalized distributions — and the failure distribution has a perfectly usable unnormalized form. The normalizing constant falls out for free.

This chapter builds up the tools one piece at a time. We start with the unnormalized trick, try (and fail with) rejection sampling, then build to MCMC. Along the way, we'll discover why sharp failure boundaries cause problems and how smoothing fixes them.

Check: What does the failure distribution p(τ | failure) tell us that a single counterexample cannot?

Chapter 1: The Unnormalized Trick

Here is the key mathematical observation that makes everything in this chapter possible. We want to sample from p(τ | failure), but we don't know the normalizing constant p(failure). No problem. Define the unnormalized failure density:

p̄(τ | failure) = 1{failure(τ)} · p(τ)

This is just the trajectory probability p(τ) multiplied by the failure indicator. For non-failure trajectories, it's zero. For failure trajectories, it equals p(τ). The key identity:

p(τ | failure) = p̄(τ | failure) / Z    where   Z = ∫ p̄(τ | failure) dτ = p(failure)

The normalizing constant Z is literally p(failure) — the total probability mass assigned to failure trajectories. We don't know Z, but here is the beautiful part: MCMC doesn't need Z. Metropolis-Hastings only uses ratios of the target density, and Z cancels in the ratio:

p̄(τ' | fail) / p̄(τ | fail) = 1{fail(τ')} · p(τ') / (1{fail(τ)} · p(τ))

As long as both τ and τ' are failure trajectories, this ratio is simply p(τ') / p(τ) — the likelihood ratio, which we can always compute.

Key insight: We get the failure distribution "for free" from the unnormalized density. And as a bonus, if we ever figure out how many MCMC samples we drew vs. how much probability mass they represent, we can back out p(failure) itself. Chapter 7 exploits this connection explicitly through importance sampling.

Let's ground this with a simple example. Suppose the system has a 1D disturbance x ~ N(0, 1), and failure occurs when |x| > 2. The failure distribution is:

p(x | failure) ∝ 1{|x| > 2} · exp(−x2/2) / √(2π)

This is a standard normal truncated to the tails beyond ±2. The unnormalized version is just the normal density times the indicator — two bumps, one on each side of zero, decaying outward. We don't need to know how much probability is in the tails to sample from this shape.

Notation clarification: p̄ (with a bar) means "unnormalized." It is proportional to the true density but doesn't integrate to 1. This is standard in MCMC literature. When we write "sample from p̄," we mean "use MCMC with target proportional to p̄, which yields samples from the normalized version."
Check: Why can MCMC sample from the failure distribution without knowing p(failure)?

Chapter 2: Rejection Sampling

Before reaching for MCMC, let's try the simplest possible approach: rejection sampling. The idea is charmingly straightforward. Sample trajectories from the nominal distribution p(τ), and keep only those that fail. Throw away the rest. The survivors are exact samples from p(τ | failure).

Think of throwing darts at a dartboard. The board is the full trajectory space. The bullseye is the failure region. If you throw darts uniformly (well, proportional to p(τ)), the ones that hit the bullseye are distributed exactly as p(τ | failure). Simple, exact, elegant.

Rejection sampling is exact. Unlike MCMC, there is no burn-in period, no autocorrelation, no convergence diagnostics needed. Every accepted sample is an independent, exact draw from the target distribution. The catch is efficiency.

More generally, rejection sampling works with any proposal distribution q(τ) that covers the support of the target. We need a constant c such that c · q(τ) ≥ p̄(τ | failure) everywhere. The algorithm:

1. Propose
Sample τ ~ q(τ)
2. Accept/Reject
Accept with probability p̄(τ | fail) / (c · q(τ))
↻ repeat until accepted

The acceptance rate is 1/c. If q = p (the nominal distribution) and the target is p(τ | failure), then c = 1/p(failure). The acceptance rate equals the failure probability itself.

Rejection Sampling: Dartboard

Orange dots are proposals from N(0,1). Red zone is the failure region (|x| > threshold). Green dots are accepted samples. Watch the acceptance rate. Lower the threshold to see how quickly rejection becomes infeasible.

Proposed: 0 | Accepted: 0
Failure threshold |x| >2.0

When the failure threshold is at 2.0, about 4.6% of standard normal samples exceed it — so roughly 1 in 22 proposals is accepted. Manageable. But push the threshold to 3.0 (0.27% acceptance) or 4.0 (0.006% acceptance) and you'll be waiting a long time. For rare failures, rejection sampling is hopelessly inefficient.

Check: What determines the acceptance rate of rejection sampling when using the nominal distribution as the proposal?

Chapter 3: When Rejection Fails

Rejection sampling breaks down for the same reason that naive Monte Carlo estimation of rare events breaks down: the acceptance rate is proportional to the failure probability. If failures happen once in a million trajectories, we need roughly a million proposals to get one accepted sample.

But the real killer is dimensionality. Consider an inverted pendulum with 100 timesteps of 3D disturbances. The trajectory space is 300-dimensional. Even if the marginal failure probability isn't astronomically small, the constant c that bounds the density ratio grows exponentially with dimension.

The curse of rare events: For the inverted pendulum example in the textbook, the rejection constant is c ≈ 1.22 × 1014. This means you need about 122 trillion proposals to get one accepted sample. At 1 million proposals per second, that's about 4 years per sample. We need a fundamentally better approach.

Why is c so large? Because the proposal q(τ) spreads its probability mass over the entire trajectory space — including the vast majority that doesn't lead to failure. The failure-conditional distribution concentrates its mass in a tiny region. The ratio p̄(τ | failure) / q(τ) spikes enormously at failure trajectories, forcing c to be enormous to bound it.

ProblemDimensionsp(failure)cSamples per accept
1D threshold14.6%22~22
10D threshold1010−3103~1,000
Inv. pendulum300~10−141.22 × 1014~1014
Key insight: Rejection sampling needs to "cover" the entire target density from above. In high dimensions, no simple proposal can tightly envelope a thin failure region. The waste is exponential. We need methods that can walk toward high-probability regions of the failure distribution, rather than sampling everywhere and hoping to land there. This is exactly what MCMC provides.

There is one more subtle problem. Even if we could magically improve the acceptance rate, rejection sampling produces independent samples. That sounds good, but it means we cannot use information from previous samples to guide future ones. MCMC trades independence for adaptivity: each sample informs the next, creating a chain that efficiently explores the target distribution.

Check: Why does the rejection constant c grow exponentially with trajectory dimension?

Chapter 4: MCMC Basics

Rejection sampling tries to generate independent samples and mostly fails. Markov chain Monte Carlo (MCMC) takes a different philosophy: generate correlated samples that form a chain, where each sample is a small perturbation of the previous one. The chain is designed so that its long-run behavior matches the target distribution.

Think of a drunk person wandering around a city. If the bars are in the failure region and the wanderer has a slight tendency to walk toward bars, then over time, the wanderer's position traces out the distribution of bar locations. Individual positions are correlated (each step is near the previous), but the collection of all positions visited gives us a map of where bars are.

The MCMC deal: We give up independence in exchange for any-time efficiency. Rejection sampling might produce zero samples in a billion tries. MCMC produces a sample at every step — it's just that consecutive samples are correlated. The chain must "burn in" (reach the target region) and then produces useful samples every k-th step (after "thinning" to reduce correlation).

A Markov chain is defined by a transition kernel g(τ' | τ) that says: given the current sample τ, where does the chain go next? The chain has the correct stationary distribution if running it long enough produces samples from p(τ | failure), regardless of where it started.

The sufficient condition for correct stationary distribution is detailed balance:

p(τ | fail) · T(τ' | τ) = p(τ' | fail) · T(τ | τ')

where T is the overall transition kernel (proposal + accept/reject). This says: the flow of probability from τ to τ' equals the flow from τ' to τ. If this holds for all pairs, the distribution is in equilibrium.

MCMC Random Walk

A 1D Markov chain targeting a bimodal distribution (two bumps). The green dot is the current position. Watch the trace plot (history of positions) below. The chain spends more time in high-density regions.

Steps: 0 | Accepted: 0
Step size σ0.5

Key concepts to watch for in the simulation:

ConceptWhat it meansWhat to look for
Burn-inInitial samples before chain reaches target regionEarly samples far from bumps
MixingHow well chain explores all modesDoes it visit both bumps?
ThinningTaking every k-th sample to reduce correlationConsecutive samples are similar
Acceptance rateFraction of proposals acceptedToo low = stuck; too high = tiny steps
Goldilocks step size: If σ is too small, almost every proposal is accepted (acceptance rate ≈ 100%), but the chain barely moves — it takes forever to explore. If σ is too large, proposals land in low-density regions and are rejected (acceptance rate ≈ 0%) — the chain stays stuck. The sweet spot is an acceptance rate around 20–50% for typical problems. Try different σ values in the simulation above.
Check: What happens when the MCMC step size is too small?

Chapter 5: Metropolis-Hastings

How do we build an MCMC chain with the right stationary distribution? The Metropolis-Hastings algorithm gives a general recipe. Start with any proposal kernel g(τ' | τ), and add an accept/reject step that corrects for the proposal's bias.

Given the current sample τ, propose τ' ~ g(τ' | τ). Accept with probability:

α(τ, τ') = min(1,   p̄(τ' | fail) · g(τ | τ')  /  p̄(τ | fail) · g(τ' | τ) )

Let's unpack this. The ratio p̄(τ') / p̄(τ) measures whether the proposal is in a higher-density region of the target. The ratio g(τ | τ') / g(τ' | τ) corrects for asymmetry in the proposal. If the proposal is more likely to jump from τ to τ' than vice versa, we need to accept less often to compensate.

Key insight: The normalizing constant Z cancels in the ratio p̄(τ') / p̄(τ). This is why MH can sample from unnormalized distributions. We never need to evaluate p(τ | failure) directly — only the unnormalized version p̄.

For a symmetric proposal (like a Gaussian random walk where g(τ' | τ) = g(τ | τ')), the correction term cancels, and the acceptance probability simplifies to:

α(τ, τ') = min(1,   p̄(τ' | fail) / p̄(τ | fail) )

With a symmetric random walk proposal, this becomes: accept if the proposed trajectory is more probable than the current one. If the proposal is less probable, accept with a probability equal to the probability ratio. This means we mostly move "uphill" toward high-density regions but occasionally accept "downhill" moves, which prevents the chain from getting permanently stuck at a single mode.

Worked example: Current trajectory τ has p(τ) = 0.01 and is a failure. Proposed τ' has p(τ') = 0.03 and is also a failure. Acceptance: min(1, 0.03/0.01) = min(1, 3) = 1. Always accept — the proposal is more likely. Now flip it: propose from p = 0.03 to p = 0.01. Acceptance: min(1, 0.01/0.03) = 0.33. Accept with 33% probability. The chain visits the higher-probability trajectory 3× more often, exactly as the target distribution demands.

What if the proposed trajectory is not a failure? Then 1{fail(τ')} = 0, so p̄(τ' | fail) = 0, and the acceptance probability is zero. The chain never leaves the failure region once it enters. This is correct — we're sampling the conditional distribution given failure.

pseudocode
# Metropolis-Hastings for failure distribution
Input: initial failure trajectory τ0, proposal kernel g, num_samples
chain = [τ0]

for i = 1 to num_samples:
  τ' ~ g(τ' | chain[i-1])
  if is_failure(τ'):
    α = min(1, p(τ') / p(chain[i-1]))  # symmetric g
  else:
    α = 0  # never accept non-failure
  if random() < α:
    chain.append(τ')  # accept
  else:
    chain.append(chain[i-1])  # reject: repeat current

return chain[burn_in:]  # discard burn-in
Check: In MH for the failure distribution, what happens when the proposed trajectory is not a failure?

Chapter 6: The Smoothing Trick

There is a nasty problem with the MH algorithm as described. The failure indicator 1{fail(τ)} is a hard boundary. A trajectory either fails or it doesn't — there is no middle ground. If the current trajectory is barely a failure (robustness ρ = −0.001) and the proposal perturbs it slightly into non-failure territory (ρ = +0.001), the proposal is instantly rejected. The chain is trapped on one side of the failure boundary.

This is catastrophic when there are multiple failure modes separated by non-failure regions. The chain enters one failure mode and can never cross the safe region to reach the other. It's like being on an island — the MCMC walker can explore every beach on this island, but it can never swim to the next island because the ocean (non-failure region) has zero density.

The island problem: The failure distribution might have two or more "islands" of failure in trajectory space, separated by a sea of safe trajectories. Standard MH with a hard failure indicator can never cross between islands. The chain is permanently trapped in whichever failure mode it starts in, giving a biased sample that ignores all other modes.

The fix is elegant: replace the hard indicator with a smooth approximation. Instead of 1{fail(τ)}, use a Gaussian kernel centered at the failure boundary:

1{fail(τ)}  →  N(Δ(τ) | 0, ε2)  =  exp(−Δ(τ)2 / 2ε2) / √(2πε2)

Here Δ(τ) is the robustness — positive means safe, negative means failure, zero is the boundary. The Gaussian kernel gives high weight to trajectories near the failure boundary (|Δ| small) and decaying weight to trajectories farther away on either side. Crucially, it assigns nonzero weight to safe trajectories near the boundary. This lets the chain cross through safe regions to reach other failure modes.

The parameter ε controls the smoothing width. Small ε approximates the hard indicator closely (good fidelity, poor mixing). Large ε smooths aggressively (poor fidelity, good mixing). The practical recipe: start with large ε, let the chain explore all modes, then gradually decrease ε toward zero to sharpen the distribution.

MCMC with Smoothing: Mode Crossing

Two failure modes (red regions) separated by a safe zone. The green dot is the MCMC walker. Without smoothing (toggle off), the chain stays trapped in one mode. With smoothing, it can cross between modes. The trace plot shows the chain's position over time.

Steps: 0
Smoothing ε0.50

The effect is dramatic. Set ε = 0 (no smoothing) and run 200 steps — the chain stays entirely in one mode. Now set ε = 0.5 and reset — the chain crosses between modes freely. The trace plot tells the story: without smoothing, a flat line in one mode; with smoothing, the trace jumps between modes.

Why this works: The smoothed density has nonzero value everywhere near the failure boundary. The "ocean" between failure islands becomes shallow water instead of an impassable void. The chain can wade across. As ε → 0, the smoothed distribution converges to the true failure distribution, so asymptotically the samples are correct.

Small ε (< 0.1):

• Close to true failure dist.
• Sharp boundary = poor mixing
• Chain may get trapped
• Good for single-mode problems

Large ε (> 0.5):

• Smooth boundary = good mixing
• Includes near-failure trajectories
• Cross between modes easily
• Decrease ε after exploration

Check: Why does the smoothing trick enable MCMC to cross between separated failure modes?

Chapter 7: Multimodal Challenges

The smoothing trick enables mode crossing, but multimodal failure distributions remain one of the hardest problems in computational validation. Let's understand why, and what strategies help beyond smoothing.

Consider an autonomous vehicle. One failure mode might be "rear-ended by a fast-following vehicle." Another might be "sudden pedestrian crossing." A third might be "sensor occlusion in heavy rain." These failure modes occupy completely different regions of the disturbance space. The failure distribution has three (or more) well-separated peaks.

The multimodal challenge: Even with smoothing, a single MCMC chain may take exponentially long to find all modes. If the modes are separated by wide safe regions, the chain must make many consecutive "uphill" moves through low-density territory to cross from one mode to another. The crossing time grows exponentially with the distance between modes.

Several strategies address this:

StrategyIdeaTrade-off
Multiple chainsRun many chains from different starting pointsEmbarrassingly parallel but may miss modes
TemperingRun chains at different smoothing levels, swap between themGuaranteed mixing but expensive
Adaptive εAnneal ε from large (exploration) to small (precision)Simple but sequence matters
Seed from searchUse Ch 5 methods to find modes, then MCMC within eachDepends on search finding all modes

Parallel tempering is the most principled approach. Run K chains in parallel, each targeting the smoothed distribution with a different ε. The chain with the largest ε moves freely across the whole space. Periodically, propose swaps between adjacent chains. If the high-ε chain has discovered a new mode, the swap propagates this discovery down to the low-ε chains. The Metropolis criterion for swaps ensures detailed balance is maintained.

Key insight: Tempering creates a "ladder" from the easy-to-explore (large ε) distribution to the hard-to-explore (small ε) target. Information flows up the ladder (new modes) and down the ladder (precise samples). Each chain individually has poor mixing, but the ensemble explores efficiently.

In practice, the most common approach for validation is the simplest: run the falsification methods from Chapter 5 (MCTS, RL adversary) to find diverse failure modes, then initialize an MCMC chain in each mode and explore locally. This hybrid approach uses each tool for what it does best — search algorithms for discovery, MCMC for characterization.

How many modes are there? This is fundamentally unknowable with certainty. You can never prove you've found all failure modes. What you can do is run multiple search algorithms with different random seeds, monitor the number of distinct clusters in your failure samples, and check whether adding more computation keeps discovering new modes. If the rate of new mode discovery has plateaued, you have reasonable (but not guaranteed) coverage.
Check: What is the most principled approach to exploring all modes of a multimodal failure distribution?

Chapter 8: Gradient-Based MCMC

The random walk MH algorithm proposes steps in random directions. In high dimensions, most random directions are "sideways" — they neither increase nor decrease the target density. The chain shuffles along the level set instead of moving efficiently toward high-density regions.

If we can compute the gradient of the log-density, we can bias proposals toward high-density directions. This is the idea behind gradient-based MCMC.

The Metropolis-Adjusted Langevin Algorithm (MALA) uses a proposal that follows the gradient:

τ' = τ + (ε2/2) ∇ log p̄(τ) + ε · z     z ~ N(0, I)

The first term is a gradient step: the proposal drifts toward higher density. The second term is random noise for exploration. The step size ε balances the two. Because the proposal is no longer symmetric (it's biased toward the gradient direction), MH must correct with the full acceptance ratio including the proposal asymmetry.

Key insight: MALA's gradient bias is like giving the drunk wanderer a compass. Instead of stumbling randomly, each step has a preferred direction — toward higher probability. In 300-dimensional trajectory space, this is transformative. A random walk needs O(d2) steps to traverse a distribution; MALA needs O(d1/3).

Hamiltonian Monte Carlo (HMC) goes further. Instead of one gradient step, it simulates a physical system: the current position τ gets an initial momentum, then follows Hamiltonian dynamics (gradient + momentum) for L steps. The result is a proposal that can be far from the current point while maintaining high acceptance probability. HMC explores high-dimensional distributions dramatically faster than random walk MH.

Random Walk MH
Proposes τ' = τ + ε · z. Blind to density shape. O(d2) scaling.
↓ add gradient
MALA
Proposes τ' = τ + grad step + noise. Follows density slope. O(d1/3) scaling.
↓ add momentum
HMC / NUTS
Simulates Hamiltonian dynamics. Long-range moves with high acceptance. O(d1/4) scaling.

The No-U-Turn Sampler (NUTS) automates HMC's tuning parameters. It simulates the trajectory forward and backward in time, stopping when the trajectory starts to double back on itself (the "U-turn"). NUTS is the default sampler in modern probabilistic programming systems (Stan, Turing.jl, NumPyro) and is the current gold standard for gradient-based MCMC.

The gradient requirement: MALA and HMC need ∇ log p̄(τ). For the failure distribution, this requires differentiating through the simulator to compute ∇ p(τ) and differentiating the robustness metric to compute ∇ Δ(τ). This is only possible with differentiable simulators. If your simulator is a black box, you're limited to random walk MH.
MethodProposals/stepScaling with dRequires
Random Walk MH1 randomO(d2)Density evaluation
MALA1 gradient-biasedO(d1/3)+ gradient
HMCL leapfrog stepsO(d1/4)+ gradient
NUTSAdaptive LO(d1/4)+ gradient + auto-tuning
Check: Why does MALA scale better than random walk MH in high dimensions?

Chapter 9: Probabilistic Programming

Everything in this chapter — MCMC, smoothing, gradients — can be implemented by hand. But doing so for complex systems is tedious and error-prone. Probabilistic programming languages (PPLs) automate the heavy lifting. You write the model, and the PPL handles the inference.

The key idea: encode the validation problem as a probabilistic program. The disturbances are random variables. The simulator is a deterministic function of those variables. The failure condition is an observation. The PPL's inference engine (typically NUTS) samples from the posterior — which is exactly the failure distribution.

The @addlogprob! trick: In Turing.jl (and similar PPLs), you can directly add to the log-probability of the current sample. To implement the smoothed failure indicator, add −Δ(τ)2 / (2ε2) to the log-probability. The PPL's sampler will automatically produce samples that concentrate near the failure boundary. No manual MH implementation needed.
julia
using Turing

@model function failure_distribution(simulator, ε)
    # Prior: disturbances from nominal distribution
    x ~ MvNormal(zeros(300), I)

    # Run simulator (deterministic given x)
    τ = simulator(x)

    # Compute robustness
    Δ = robustness(τ)

    # Smoothed failure indicator via @addlogprob!
    Turing.@addlogprob!(-Δ^2 / (2 * ε^2))
end

# Sample from failure distribution
model = failure_distribution(my_simulator, 0.1)
chain = sample(model, NUTS(), 1000)

The elegance is striking. The entire failure distribution sampling problem — disturbance model, simulator, robustness metric, smoothing, MCMC inference — fits in about 10 lines of code. The PPL handles all the machinery: gradient computation (via automatic differentiation), NUTS tuning, burn-in, diagnostics, and sample storage.

Key insight: Probabilistic programming reduces the problem from "implement an MCMC sampler for your specific system" to "describe your system as a generative model and let the PPL do inference." This is a massive reduction in engineering effort and debugging risk. The PPL has been tested on thousands of models; your hand-coded sampler has been tested on one.

The requirement is that the simulator must be differentiable (for NUTS) or at least evaluable (for random walk MH). Most modern PPLs support automatic differentiation through arbitrary Julia/Python code, including array operations, conditionals, and loops. If your simulator is written in a supported language, it "just works."

Strengths of PPL approach:

• Minimal implementation effort
• Automatic gradient computation
• Built-in diagnostics (R-hat, ESS)
• Swap samplers easily (NUTS, HMC, MH)

Limitations:

• Requires differentiable simulator
• Overhead of PPL machinery
• Black-box simulators need wrappers
• Multimodality still hard

What comes next: Sampling the failure distribution tells us what failures look like. Chapter 7 asks the remaining question: how likely is failure? This requires estimating p(failure) — the normalizing constant Z that we carefully avoided throughout this chapter. Importance sampling, adaptive methods, and sequential Monte Carlo provide the answer.

"The purpose of computing is insight, not numbers."
— Richard Hamming
Check: What does the @addlogprob! trick accomplish in the probabilistic programming formulation?