Estimating how often a system fails — especially when failures are astronomically rare.
You've built a self-driving car. Your team has run 100,000 test scenarios, and the car crashed in 3 of them. That gives a rough estimate: pfail ≈ 3/100,000 = 3 × 10−5. But is that good enough? Regulators want to know if pfail < 10−9 — one failure in a billion drives. How many tests would you need to confidently verify that?
The answer is staggering. If the true failure probability is 10−9, you'd need roughly 109 simulation runs just to see a single failure. And to get a tight confidence interval, you'd need many more. At one second per simulation, that's 30 years of continuous computation. We need a smarter approach.
Chapter 6 taught us how to characterize what failures look like — their distribution in the disturbance space. This chapter asks a different question: how often do they happen? We want a single number, pfail, with a confidence interval tight enough to certify the system.
We'll build up from the simplest estimator (count failures and divide) through Bayesian estimation, then the full arsenal of variance reduction techniques. Each method makes the same fundamental trade: inject clever bias into the sampling, then correct for it mathematically.
The simplest possible approach: run m simulations, count n failures, report the fraction. This is the direct Monte Carlo estimator:
where n is the number of failure trajectories and m is the total number of trajectories sampled. Each trajectory τi is drawn from the nominal distribution p(τ) — the distribution that describes how the system actually operates in the real world. We check whether each trajectory triggers a failure event, and count.
This estimator is unbiased: its expected value equals the true pfail. But how precise is it? The variance tells us. Each sample is a Bernoulli trial (fail or not), so:
The standard deviation is σ = √(p(1−p)/m). For a 95% confidence interval, we need roughly p̂ ± 2σ. Let's work a concrete example.
Now try pfail = 10−6 with m = 10,000. Expected failures: 0.01. We almost certainly observe zero failures, giving p̂ = 0. The estimator is technically unbiased, but with overwhelming probability it returns zero. We'd need m ≈ 108 to reliably observe even one failure.
| pfail | m needed (10% relative error) | Time at 1 sim/sec |
|---|---|---|
| 10−2 | 10,000 | 2.8 hours |
| 10−4 | 1,000,000 | 12 days |
| 10−6 | 108 | 3.2 years |
| 10−9 | 1011 | 3,170 years |
The coefficient of variation (CV) measures relative precision: CV = σ/pfail = √((1−p)/mp). For small p, this simplifies to CV ≈ 1/√(mp). As p shrinks, the CV explodes unless m grows to compensate.
Drag the slider to change the true failure probability. Watch how the required sample count explodes as pfail decreases. The dashed line marks 10% relative error.
The direct estimator treats pfail as a fixed unknown. The Bayesian approach treats it as a random variable with a prior distribution, then updates that distribution after observing data. This gives us something the frequentist approach cannot: a full posterior distribution over pfail, complete with credible intervals.
The natural choice for the prior is the Beta distribution, because it is the conjugate prior for Bernoulli observations. If our prior is Beta(α, β) and we observe n failures in m trials, the posterior is:
That's it. No MCMC, no approximation. The update is exact. The prior parameters α and β encode our initial belief: α is like "pseudo-failures" and β is like "pseudo-successes" we've already seen. A flat prior is Beta(1,1), which says every value of pfail between 0 and 1 is equally likely.
Let's trace a worked example. Start with a uniform prior: Beta(1,1). Run 50 simulations, observe 2 failures.
| Quantity | Prior | Posterior |
|---|---|---|
| Distribution | Beta(1, 1) | Beta(3, 49) |
| Mean | 0.500 | 3/52 = 0.058 |
| Mode | undefined (flat) | 2/50 = 0.040 |
| 95% credible interval | [0.025, 0.975] | [0.013, 0.126] |
The posterior mean (0.058) is pulled slightly above the maximum likelihood estimate (2/50 = 0.04) because the prior puts some weight on higher values. With more data, the prior's influence fades. After 500 trials with 20 failures, the posterior Beta(21, 481) has mean 0.042, almost indistinguishable from the MLE.
Each click of Flip simulates one test. Watch the Beta posterior sharpen as data accumulates. Adjust the true pfail with the slider. The prior is Beta(1,1) (uniform).
Both the direct and Bayesian approaches share a fundamental limitation: they rely on actually observing failures. When pfail = 10−9, you might run a billion simulations and see zero or one failure. The posterior is barely distinguishable from the prior. No amount of Bayesian cleverness can extract information from data that doesn't exist.
This is the rare event problem, and it is the central challenge of failure probability estimation. The issue is not computational cost per se — it's that the information rate is near zero. Almost every simulation returns "system OK," contributing nothing to our understanding of failure.
Why not just make the system fail more often in simulation? We could increase noise levels, degrade sensors, or inject adversarial disturbances. The problem is that these artificial failures may not represent real failure modes. We might find failures that would never occur under normal conditions, or miss the real failure modes entirely.
What we need is a principled way to sample more failures while still estimating the probability under the original distribution. We want to change how we sample, not what we're estimating. The mathematical framework for this is importance sampling.
Here's the key identity. The failure probability is an expectation under p:
We can rewrite this by multiplying and dividing by any positive density q(τ):
where w(τ) = p(τ)/q(τ) is the importance weight. We've converted the problem from "sample from p and count" to "sample from q and take a weighted average." If q concentrates samples in the failure region, we see many failures, each downweighted by how much q oversampled that region.
Importance sampling (IS) turns the identity from Chapter 3 into a practical estimator. Draw m samples τ1, …, τm from the proposal distribution q(τ). For each sample, compute the importance weight and check for failure:
This estimator is unbiased for any q that has support everywhere p does (meaning q(τ) > 0 whenever p(τ) > 0). The magic is in the choice of q. A good proposal concentrates samples in the failure region, giving us many failure observations, each properly downweighted.
The variance of the IS estimator is:
A bad choice of q can make this variance worse than direct sampling. If q puts too little weight in the failure region, the rare failures that do appear get enormous weights, creating high-variance spikes. The variance can even be infinite if q has lighter tails than p.
The nominal distribution p (blue) is N(0,1). Failure is τ > 3. Drag the proposal mean to shift q (orange). Watch how sample weights (bars below) change. A good q produces many small-weighted failures. A bad q produces few heavy-weighted failures.
If importance sampling works with any q, some q must be better than others. Is there an optimal proposal that minimizes variance? Yes — and the answer reveals a beautiful circular dependency.
The variance of the IS estimator is zero when every sample produces the same weighted value. That happens when 1fail(τ) · p(τ)/q(τ) is constant across all τ sampled from q. Working through the math, the zero-variance proposal is:
Read this carefully. The optimal proposal q* is the nominal distribution p, restricted to the failure region, and normalized. In other words, q* is the failure distribution itself — the conditional distribution of trajectories given that they fail.
Let's verify the zero-variance claim. Under q*, every failure sample has weight:
Every sample gives exactly pfail. No variance. A single sample suffices. Of course, this requires knowing pfail, which we don't. But the insight is powerful: the closer q is to the failure distribution, the lower the variance.
| Proposal quality | Variance behavior |
|---|---|
| q = p (no change) | Same as direct MC — no benefit |
| q concentrates near failure boundary | Many failures with moderate weights — low variance |
| q = q* (failure distribution) | Zero variance — but unknowable |
| q too far from failure region | Rare huge weights — variance worse than direct MC |
A single proposal distribution may miss important failure regions. If failures can occur through multiple mechanisms — sensor noise, actuator failure, adversarial inputs — one proposal centered on one mechanism ignores the others. Multiple importance sampling (MIS) uses several proposals simultaneously.
Suppose we have K proposal distributions q1, …, qK, and we draw mk samples from each. For a sample τi drawn from qk, how should we compute its weight?
The simplest approach is standard MIS (s-MIS): weight each sample by p/qk, the proposal it came from.
This works, but has a subtle problem. If sample τ happens to lie in a region where qk has low density but another proposal qj has high density, the weight p/qk is huge — even though the sample is "explained" by qj. The variance spikes unnecessarily.
Let's trace an example. Two proposals: q1 = N(3, 0.5) targets one failure mode, q2 = N(5, 0.5) targets another. Draw 50 samples from each. Under s-MIS, a sample at τ = 4 drawn from q1 gets weight p(4)/q1(4), which is moderate. But the same sample drawn from q2 gets a different weight p(4)/q2(4). Under DM-MIS, both get the same weight p(4)/q̄(4), because the denominator accounts for both proposals.
| Method | Weight formula | Pros | Cons |
|---|---|---|---|
| s-MIS | p(τ) / qk(τ) | Simple, unbiased | High variance from proposal mismatch |
| DM-MIS | p(τ) / q̄(τ) | Lower variance, near-optimal | Must evaluate all K densities per sample |
How do we find a good proposal distribution without knowing q*? The cross-entropy (CE) method iteratively refines the proposal by fitting it to the "best" samples from each round. It's an elegant dance: sample, select elites, refit, repeat.
The algorithm works with a parametric proposal family qθ (e.g., a Gaussian with mean μ and covariance Σ). Start with θ0 set to the nominal distribution. Then iterate:
The key trick is threshold relaxation. Instead of immediately targeting the failure region, CE uses a sequence of increasingly strict thresholds. In round 1, the "elites" might be the worst 10% of trajectories (even though they don't actually fail). The proposal shifts toward them. In round 2, the new samples are worse on average, so the elite threshold tightens. Eventually the elites are actual failures, and the proposal concentrates in the failure region.
For a Gaussian proposal q = N(μ, Σ), the update is explicit:
where E is the elite set. After convergence, the final proposal qθ* is used for one final round of importance sampling to produce the pfail estimate with proper IS weights.
Cross entropy uses a single proposal (or a single parametric family). What if the failure region is multimodal — failures occur in several disconnected "pockets" of the trajectory space? A single Gaussian can't cover them all without also covering vast non-failure regions, degrading performance.
Population Monte Carlo (PMC) maintains a population of K proposals, each with its own parameters. After each round, every proposal adapts independently based on its own samples. Proposals that happen to sit near failure modes sharpen and stay. Proposals in barren regions migrate toward productive areas.
The PMC algorithm:
pseudocode Initialize K proposals: q1(0), ..., qK(0) for t = 1, 2, ... for k = 1 to K: Draw mk samples from qk(t-1) Compute DM-MIS weights: wi = p(τi) / q̄(t-1)(τi) Pool all samples and weights for k = 1 to K: Resample or refit qk(t) from weighted samples Estimate: p̂ = (1/m) ∑ 1fail(τi) wi
The adaptation step can take many forms. One common approach: each proposal qk refits to the weighted samples nearest to it (a form of weighted EM). Another: randomly resample proposals from the current population, with probability proportional to the total weight of their samples.
PMC naturally handles multimodality because different proposals can converge to different modes. It also provides a valid IS estimate at each iteration (using DM-MIS weights), so you can stop early if the estimate has converged.
| Feature | CE Method | PMC |
|---|---|---|
| Number of proposals | 1 | K (a population) |
| Multimodal failures | May miss modes | Naturally covers multiple modes |
| Adaptation | Elite-based MLE | Weighted resampling/refitting |
| Weighting | p/q at final round | DM-MIS at every round |
Cross entropy relaxes the threshold. PMC migrates proposals. Sequential Monte Carlo (SMC) takes a different approach: build a sequence of intermediate distributions that smoothly bridge the gap between the nominal distribution p and the failure distribution p|fail.
Define a sequence of distributions π0, π1, …, πT where π0 = p (easy to sample) and πT concentrates on the failure region. A natural choice is to temper the failure indicator:
where φ(τ) measures "closeness to failure" and 0 = β0 < β1 < … < βT. At β = 0, we have π0 = p. As β increases, πt concentrates on trajectories with high φ — those near failure.
The SMC algorithm moves a population of weighted particles through this sequence:
The failure probability estimate combines normalizing constant ratios:
Each factor is the average incremental weight at level t — the fraction of "mass" that survives to the next level. The product telescopes to give the ratio of normalizing constants ZT/Z0, which equals pfail.
Importance sampling works forward: sample from q, reweight to p. But what if both p and q are easy to evaluate but hard to normalize? Bridge sampling exploits samples from both distributions to estimate the ratio of their normalizing constants — which is exactly pfail when one distribution is p and the other is p|fail.
Let p0 and p1 be two unnormalized densities with normalizing constants Z0 and Z1. We want r = Z1/Z0. Bridge sampling introduces an arbitrary bridge function h(τ) and uses the identity:
Given samples from both distributions, we estimate each expectation by a sample average. The ratio gives r̂. The optimal bridge function (minimizing variance) is:
This depends on the unknowns Z0, Z1. In practice, we solve iteratively: start with a guess for r, compute h*, re-estimate r, repeat until convergence.
For failure probability estimation, the setup is: p0 = the nominal distribution p(τ), Z0 = 1 (normalized). p1 = 1fail(τ) · p(τ), Z1 = pfail. We need samples from p (easy) and from the failure distribution (obtained via Chapter 6 methods like MCMC). Bridge sampling then estimates r = pfail/1 = pfail.
Every method so far estimates pfail as a single quantity. Multilevel splitting takes a radically different approach: decompose pfail into a product of conditional probabilities, each of which is large enough to estimate directly.
Define a sequence of nested events: F0 ⊃ F1 ⊃ … ⊃ FL = F (the failure event). Each Fl is a relaxed version of failure — say, the system cost exceeds threshold γl. Then by the chain rule of probability:
The magic: each conditional probability P(Fl | Fl−1) can be chosen to be moderately large (say 0.1 to 0.5) by adjusting the thresholds. Even if pfail = 10−10, we can write it as 0.110 — ten factors of 0.1, each easily estimable with a few hundred samples.
The algorithm:
pseudocode # Multilevel splitting Initialize N samples from p(τ) Set thresholds γ0 < γ1 < ... < γL = failure threshold for l = 1 to L: # Count how many samples exceed γl nl = |{i : S(τi) ≥ γl}| p̂l = nl / N # Split: duplicate survivors, discard the rest Keep the nl survivors Resample N − nl copies from survivors # Diversify via MCMC targeting p(τ | S(τ) ≥ γl) Apply MCMC transitions to all N samples p̂fail = ∏l=1L p̂l
With adaptive thresholds, we don't predefine the γl. Instead, at each level, we set γl to the empirical quantile of the current sample — say the (1−ρ)-quantile, so a fixed fraction ρ survives. Then every conditional probability is exactly ρ, and pfail ≈ ρL, where L is the number of levels needed to reach the actual failure threshold.
Watch samples (dots) get progressively filtered through threshold levels. At each level, the surviving fraction is displayed. The product of all fractions gives pfail. Adjust the number of levels and survival fraction.
| Method | Core idea | Best for |
|---|---|---|
| Direct MC | Count failures | pfail > 10−3 |
| Importance sampling | Reweight from better proposal | Known failure modes, single mode |
| Cross entropy | Iteratively fit proposal to elites | Unknown failure region, parametric proposal |
| PMC | Population of migrating proposals | Multimodal failures |
| SMC | Sequence of bridging distributions | Complex failure boundaries |
| Multilevel splitting | Product of conditional probabilities | Extremely rare events (p < 10−8) |