Ch 7: Failure Probability — Algorithms for Validation

Chapter 0: Why Estimate Failure Probability?

You've built a self-driving car. Your team has run 100,000 test scenarios, and the car crashed in 3 of them. That gives a rough estimate: p_fail ≈ 3/100,000 = 3 × 10⁻⁵. But is that good enough? Regulators want to know if p_fail < 10⁻⁹ — one failure in a billion drives. How many tests would you need to confidently verify that?

The answer is staggering. If the true failure probability is 10⁻⁹, you'd need roughly 10⁹ simulation runs just to see a single failure. And to get a tight confidence interval, you'd need many more. At one second per simulation, that's 30 years of continuous computation. We need a smarter approach.

The core problem: Directly counting failures is feasible when failures are common (p ≈ 10⁻²). But safety-critical systems need vanishingly small failure rates. This chapter covers the mathematical toolkit for estimating p_fail when failures are too rare to count directly — from importance sampling to multilevel splitting.

Chapter 6 taught us how to characterize what failures look like — their distribution in the disturbance space. This chapter asks a different question: how often do they happen? We want a single number, p_fail, with a confidence interval tight enough to certify the system.

The Question

What is P(system fails) under the operational distribution?

↓ too rare to observe directly

Naive Approach

Run m simulations, count n failures, report n/m

↓ need 10⁹+ samples

Smart Approaches

IS, CE, PMC, SMC, multilevel splitting

We'll build up from the simplest estimator (count failures and divide) through Bayesian estimation, then the full arsenal of variance reduction techniques. Each method makes the same fundamental trade: inject clever bias into the sampling, then correct for it mathematically.

Check: Why can't we simply run 10 million simulations and count failures to verify p_fail < 10⁻⁹?

Because 10 million simulations would take too long to run Because we'd expect zero failures in 10 million samples, giving no information about p_fail Because the system model is too complex to simulate

Chapter 1: Direct Estimation

The simplest possible approach: run m simulations, count n failures, report the fraction. This is the direct Monte Carlo estimator:

p̂_fail = n / m

where n is the number of failure trajectories and m is the total number of trajectories sampled. Each trajectory τ_i is drawn from the nominal distribution p(τ) — the distribution that describes how the system actually operates in the real world. We check whether each trajectory triggers a failure event, and count.

This estimator is unbiased: its expected value equals the true p_fail. But how precise is it? The variance tells us. Each sample is a Bernoulli trial (fail or not), so:

Var(p̂_fail) = p_fail(1 − p_fail) / m

The standard deviation is σ = √(p(1−p)/m). For a 95% confidence interval, we need roughly p̂ ± 2σ. Let's work a concrete example.

Worked example: Suppose p_fail = 10⁻³ and we run m = 10,000 simulations. Expected failures: n ≈ 10. Variance: 10⁻³ × 0.999 / 10,000 ≈ 10⁻⁷. Standard deviation: ≈ 3.16 × 10⁻⁴. So our 95% CI is roughly 10⁻³ ± 6.3 × 10⁻⁴. That's a 63% relative width — not very precise, but at least we see failures.

Now try p_fail = 10⁻⁶ with m = 10,000. Expected failures: 0.01. We almost certainly observe zero failures, giving p̂ = 0. The estimator is technically unbiased, but with overwhelming probability it returns zero. We'd need m ≈ 10⁸ to reliably observe even one failure.

The rule of thumb: To get a relative error of ε on p_fail, you need roughly m ≈ 1/(ε² · p_fail) samples. For p_fail = 10⁻⁹ and 10% relative error, that's m ≈ 10¹¹. This is the rare event problem — direct estimation collapses when failures are rare.

p_fail	m needed (10% relative error)	Time at 1 sim/sec
10⁻²	10,000	2.8 hours
10⁻⁴	1,000,000	12 days
10⁻⁶	10⁸	3.2 years
10⁻⁹	10¹¹	3,170 years

The coefficient of variation (CV) measures relative precision: CV = σ/p_fail = √((1−p)/mp). For small p, this simplifies to CV ≈ 1/√(mp). As p shrinks, the CV explodes unless m grows to compensate.

Direct Estimation: Variance vs. Sample Count

Drag the slider to change the true failure probability. Watch how the required sample count explodes as p_fail decreases. The dashed line marks 10% relative error.

log₁₀(p_fail)10^-3

Check: What is the variance of the direct Monte Carlo estimate p̂ = n/m?

p_fail / m p_fail(1 − p_fail) / m p_fail² / m

Chapter 2: Bayesian Estimation

The direct estimator treats p_fail as a fixed unknown. The Bayesian approach treats it as a random variable with a prior distribution, then updates that distribution after observing data. This gives us something the frequentist approach cannot: a full posterior distribution over p_fail, complete with credible intervals.

The natural choice for the prior is the Beta distribution, because it is the conjugate prior for Bernoulli observations. If our prior is Beta(α, β) and we observe n failures in m trials, the posterior is:

p_fail | data ∼ Beta(α + n, β + m − n)

That's it. No MCMC, no approximation. The update is exact. The prior parameters α and β encode our initial belief: α is like "pseudo-failures" and β is like "pseudo-successes" we've already seen. A flat prior is Beta(1,1), which says every value of p_fail between 0 and 1 is equally likely.

Why Beta? The Beta distribution is defined on [0,1], which is exactly where probabilities live. Its shape is controlled by two parameters. Beta(1,1) is uniform. Beta(2,8) peaks near 0.2. Beta(100,9900) is a tight spike near 0.01. After m trials with n failures, the posterior mean is (α+n)/(α+β+m), which blends the prior guess α/(α+β) with the data n/m, weighted by their respective "sample sizes."

Let's trace a worked example. Start with a uniform prior: Beta(1,1). Run 50 simulations, observe 2 failures.

Quantity	Prior	Posterior
Distribution	Beta(1, 1)	Beta(3, 49)
Mean	0.500	3/52 = 0.058
Mode	undefined (flat)	2/50 = 0.040
95% credible interval	[0.025, 0.975]	[0.013, 0.126]

The posterior mean (0.058) is pulled slightly above the maximum likelihood estimate (2/50 = 0.04) because the prior puts some weight on higher values. With more data, the prior's influence fades. After 500 trials with 20 failures, the posterior Beta(21, 481) has mean 0.042, almost indistinguishable from the MLE.

Bayesian Coin-Flip Updater

Each click of Flip simulates one test. Watch the Beta posterior sharpen as data accumulates. Adjust the true p_fail with the slider. The prior is Beta(1,1) (uniform).

n=0 fails, m=0 trials

True p_fail0.10

Credible vs. confidence intervals: A 95% Bayesian credible interval says "there is a 95% probability that p_fail lies in this range, given the data." A 95% frequentist confidence interval says something more convoluted: "if we repeated the experiment many times, 95% of such intervals would contain the true value." For practical engineering decisions, the Bayesian statement is usually what you want.

Check: After observing 5 failures in 100 trials with a Beta(1,1) prior, what is the posterior distribution?

Beta(5, 95) Beta(5, 100) Beta(6, 96)

Chapter 3: The Rare Event Problem

Both the direct and Bayesian approaches share a fundamental limitation: they rely on actually observing failures. When p_fail = 10⁻⁹, you might run a billion simulations and see zero or one failure. The posterior is barely distinguishable from the prior. No amount of Bayesian cleverness can extract information from data that doesn't exist.

This is the rare event problem, and it is the central challenge of failure probability estimation. The issue is not computational cost per se — it's that the information rate is near zero. Almost every simulation returns "system OK," contributing nothing to our understanding of failure.

The information desert: Imagine fishing in an ocean where 99.9999999% of the water contains no fish. You could trawl for centuries and catch nothing. The fish exist, but your net sweeps the wrong water. The solution: don't fish randomly. Fish where the fish are, then correct for the biased sampling. That's importance sampling.

Why not just make the system fail more often in simulation? We could increase noise levels, degrade sensors, or inject adversarial disturbances. The problem is that these artificial failures may not represent real failure modes. We might find failures that would never occur under normal conditions, or miss the real failure modes entirely.

What we need is a principled way to sample more failures while still estimating the probability under the original distribution. We want to change how we sample, not what we're estimating. The mathematical framework for this is importance sampling.

Problem

Failures are too rare to observe under p(τ)

↓ can't just crank up noise

Insight

Sample from a different distribution q(τ) that produces more failures

↓ but then n/m estimates p_fail under q, not p

Fix

Reweight each sample by p(τ)/q(τ) to correct the bias

Here's the key identity. The failure probability is an expectation under p:

p_fail = E_p[1_fail(τ)] = ∫ 1_fail(τ) p(τ) dτ

We can rewrite this by multiplying and dividing by any positive density q(τ):

p_fail = ∫ 1_fail(τ) · (p(τ) / q(τ)) · q(τ) dτ = E_q[1_fail(τ) · w(τ)]

where w(τ) = p(τ)/q(τ) is the importance weight. We've converted the problem from "sample from p and count" to "sample from q and take a weighted average." If q concentrates samples in the failure region, we see many failures, each downweighted by how much q oversampled that region.

Check: Why does simply increasing noise levels in simulation NOT solve the rare event problem?

It produces failures that may not represent real failure modes under the true operational distribution It makes simulations run slower It always underestimates the true failure probability

Chapter 4: Importance Sampling

Importance sampling (IS) turns the identity from Chapter 3 into a practical estimator. Draw m samples τ₁, …, τ_m from the proposal distribution q(τ). For each sample, compute the importance weight and check for failure:

p̂_IS = (1/m) ∑_i=1^m 1_fail(τ_i) · w_i, w_i = p(τ_i) / q(τ_i)

This estimator is unbiased for any q that has support everywhere p does (meaning q(τ) > 0 whenever p(τ) > 0). The magic is in the choice of q. A good proposal concentrates samples in the failure region, giving us many failure observations, each properly downweighted.

Worked example: Suppose trajectories live in 1D, failure occurs when τ > 4, and p(τ) = N(0,1). Under p, P(τ > 4) ≈ 3.2 × 10⁻⁵. Let q(τ) = N(4,1), centered on the failure boundary. Drawing from q, about 50% of samples fail. Each failure sample gets weight w = p(τ)/q(τ) = exp(−τ²/2) / exp(−(τ−4)²/2) = exp(−4τ + 8). The weights are small (because q oversamples this region), but the product 1_fail · w averages to the correct p_fail.

The variance of the IS estimator is:

Var(p̂_IS) = (1/m) Var_q[1_fail(τ) · w(τ)]

A bad choice of q can make this variance worse than direct sampling. If q puts too little weight in the failure region, the rare failures that do appear get enormous weights, creating high-variance spikes. The variance can even be infinite if q has lighter tails than p.

Importance Sampling: Weight Visualizer

The nominal distribution p (blue) is N(0,1). Failure is τ > 3. Drag the proposal mean to shift q (orange). Watch how sample weights (bars below) change. A good q produces many small-weighted failures. A bad q produces few heavy-weighted failures.

p̂ = —

Proposal mean μ_q3.0

Self-normalized IS: In practice, we often don't know the normalizing constants of p and q. The self-normalized estimator divides by the sum of weights: p̂_SN = ∑ 1_fail · w_i / ∑ w_i. This introduces a small bias (it's a ratio of two estimates) but removes the need for normalized densities. For large m, the bias is negligible.

Check: In importance sampling, what is the importance weight w_i for sample τ_i?

p(τ_i) / q(τ_i) q(τ_i) / p(τ_i) 1_fail(τ_i) · q(τ_i)

Chapter 5: The Optimal Proposal

If importance sampling works with any q, some q must be better than others. Is there an optimal proposal that minimizes variance? Yes — and the answer reveals a beautiful circular dependency.

The variance of the IS estimator is zero when every sample produces the same weighted value. That happens when 1_fail(τ) · p(τ)/q(τ) is constant across all τ sampled from q. Working through the math, the zero-variance proposal is:

q*(τ) = 1_fail(τ) · p(τ) / p_fail

Read this carefully. The optimal proposal q* is the nominal distribution p, restricted to the failure region, and normalized. In other words, q* is the failure distribution itself — the conditional distribution of trajectories given that they fail.

The circular dependency: To construct q*, we need p_fail (the normalizing constant). But p_fail is exactly what we're trying to estimate! If we knew the answer, we wouldn't need importance sampling. This is not a flaw — it tells us what to aim for. Any proposal that approximates the failure distribution will reduce variance compared to direct sampling.

Let's verify the zero-variance claim. Under q*, every failure sample has weight:

w_i = p(τ_i) / q*(τ_i) = p(τ_i) / (1_fail(τ_i) p(τ_i) / p_fail) = p_fail

Every sample gives exactly p_fail. No variance. A single sample suffices. Of course, this requires knowing p_fail, which we don't. But the insight is powerful: the closer q is to the failure distribution, the lower the variance.

Proposal quality	Variance behavior
q = p (no change)	Same as direct MC — no benefit
q concentrates near failure boundary	Many failures with moderate weights — low variance
q = q* (failure distribution)	Zero variance — but unknowable
q too far from failure region	Rare huge weights — variance worse than direct MC

Practical takeaway: We can never achieve q*, but we can approximate it. The rest of this chapter covers methods that iteratively improve the proposal toward q*. Cross entropy minimizes KL divergence to q*. Population Monte Carlo migrates proposals toward high-weight regions. Sequential Monte Carlo builds a sequence of distributions approaching p|fail.

Check: What is the zero-variance importance sampling proposal q*?

The nominal distribution p shifted to higher variance The nominal distribution restricted to the failure region, normalized A uniform distribution over the entire trajectory space

Chapter 6: Multiple Proposals

A single proposal distribution may miss important failure regions. If failures can occur through multiple mechanisms — sensor noise, actuator failure, adversarial inputs — one proposal centered on one mechanism ignores the others. Multiple importance sampling (MIS) uses several proposals simultaneously.

Suppose we have K proposal distributions q₁, …, q_K, and we draw m_k samples from each. For a sample τ_i drawn from q_k, how should we compute its weight?

The simplest approach is standard MIS (s-MIS): weight each sample by p/q_k, the proposal it came from.

p̂_s-MIS = (1/m) ∑_k=1^K ∑_i=1^m_k 1_fail(τ_ki) · p(τ_ki) / q_k(τ_ki)

This works, but has a subtle problem. If sample τ happens to lie in a region where q_k has low density but another proposal q_j has high density, the weight p/q_k is huge — even though the sample is "explained" by q_j. The variance spikes unnecessarily.

Deterministic mixture MIS (DM-MIS): A better approach uses the mixture density in the denominator. Define the mixture q̄(τ) = (1/K) ∑_k q_k(τ). Then weight each sample by p(τ)/q̄(τ). Now a sample that any proposal would likely generate gets a moderate weight, regardless of which proposal actually produced it. This is called the balance heuristic, and it is provably near-optimal among all weighting schemes for MIS.

w_i^DM = p(τ_i) / q̄(τ_i) = p(τ_i) / ((1/K) ∑_k q_k(τ_i))

Let's trace an example. Two proposals: q₁ = N(3, 0.5) targets one failure mode, q₂ = N(5, 0.5) targets another. Draw 50 samples from each. Under s-MIS, a sample at τ = 4 drawn from q₁ gets weight p(4)/q₁(4), which is moderate. But the same sample drawn from q₂ gets a different weight p(4)/q₂(4). Under DM-MIS, both get the same weight p(4)/q̄(4), because the denominator accounts for both proposals.

Method	Weight formula	Pros	Cons
s-MIS	p(τ) / q_k(τ)	Simple, unbiased	High variance from proposal mismatch
DM-MIS	p(τ) / q̄(τ)	Lower variance, near-optimal	Must evaluate all K densities per sample

When to use MIS: Whenever you suspect multiple distinct failure modes. If all failures cluster in one region, a single well-tuned proposal suffices. But safety-critical systems often fail in qualitatively different ways — and a single Gaussian proposal can't cover them all.

Check: In DM-MIS, why is the mixture density q̄(τ) used in the denominator instead of the individual proposal q_k(τ)?

Because q̄ is easier to compute Because q̄ guarantees unbiasedness while q_k does not Because it prevents large weights when a sample is well-explained by any proposal, reducing variance

Chapter 7: Cross Entropy Method

How do we find a good proposal distribution without knowing q*? The cross-entropy (CE) method iteratively refines the proposal by fitting it to the "best" samples from each round. It's an elegant dance: sample, select elites, refit, repeat.

The algorithm works with a parametric proposal family q_θ (e.g., a Gaussian with mean μ and covariance Σ). Start with θ₀ set to the nominal distribution. Then iterate:

1. Sample

Draw m trajectories from q_{θ_t}

↓

2. Score

Evaluate cost function S(τ) for each sample

↓

3. Select

Keep top-ρ fraction (the "elites") that are closest to failure

↓

4. Refit

Set θ_t+1 by maximum likelihood on the elite samples

↻ repeat until convergence

The key trick is threshold relaxation. Instead of immediately targeting the failure region, CE uses a sequence of increasingly strict thresholds. In round 1, the "elites" might be the worst 10% of trajectories (even though they don't actually fail). The proposal shifts toward them. In round 2, the new samples are worse on average, so the elite threshold tightens. Eventually the elites are actual failures, and the proposal concentrates in the failure region.

Why "cross entropy"? Fitting q_θ to the elites by maximum likelihood is equivalent to minimizing the cross-entropy (or KL divergence) between the elite distribution and q_θ. If q is Gaussian, this reduces to computing the sample mean and covariance of the elites. The name comes from information theory, but the algorithm is just "fit a distribution to your best samples."

For a Gaussian proposal q = N(μ, Σ), the update is explicit:

μ_t+1 = (1/|E|) ∑_{τ ∈ E} τ, Σ_t+1 = (1/|E|) ∑_{τ ∈ E} (τ − μ_t+1)(τ − μ_t+1)^T

where E is the elite set. After convergence, the final proposal q_θ* is used for one final round of importance sampling to produce the p_fail estimate with proper IS weights.

Smoothing trick: In practice, the elite covariance can collapse too quickly, trapping the proposal in a local mode. A common fix is to update with smoothing: θ_t+1 = α θ̂ + (1−α) θ_t, where α ∈ (0,1) and θ̂ is the MLE from the elites. Typical α ≈ 0.7.

Check: In the cross-entropy method, what are the "elite" samples?

The top fraction of samples that are closest to (or in) the failure region The samples with the largest importance weights Samples drawn from the optimal proposal q*

Chapter 8: Population Monte Carlo

Cross entropy uses a single proposal (or a single parametric family). What if the failure region is multimodal — failures occur in several disconnected "pockets" of the trajectory space? A single Gaussian can't cover them all without also covering vast non-failure regions, degrading performance.

Population Monte Carlo (PMC) maintains a population of K proposals, each with its own parameters. After each round, every proposal adapts independently based on its own samples. Proposals that happen to sit near failure modes sharpen and stay. Proposals in barren regions migrate toward productive areas.

Think of it this way: Cross entropy sends one fishing boat. PMC sends a fleet. Each boat casts its net, keeps the good catches, and repositions based on where the fish were biting. Over time, boats cluster around the schools of fish (failure modes) and abandon empty water.

The PMC algorithm:

pseudocode
Initialize K proposals: q₁⁽⁰⁾, ..., q_K⁽⁰⁾
for t = 1, 2, ...
  for k = 1 to K:
    Draw m_k samples from q_k^(t-1)
    Compute DM-MIS weights: w_i = p(τ_i) / q̄^(t-1)(τ_i)
  Pool all samples and weights
  for k = 1 to K:
    Resample or refit q_k^(t) from weighted samples
  Estimate: p̂ = (1/m) ∑ 1_fail(τ_i) w_i

The adaptation step can take many forms. One common approach: each proposal q_k refits to the weighted samples nearest to it (a form of weighted EM). Another: randomly resample proposals from the current population, with probability proportional to the total weight of their samples.

PMC naturally handles multimodality because different proposals can converge to different modes. It also provides a valid IS estimate at each iteration (using DM-MIS weights), so you can stop early if the estimate has converged.

Feature	CE Method	PMC
Number of proposals	1	K (a population)
Multimodal failures	May miss modes	Naturally covers multiple modes
Adaptation	Elite-based MLE	Weighted resampling/refitting
Weighting	p/q at final round	DM-MIS at every round

Check: What advantage does PMC have over the cross-entropy method for multimodal failure distributions?

Multiple proposals can each converge to a different failure mode, covering the failure region more completely PMC requires fewer total samples than CE PMC does not require evaluating the nominal density p

Chapter 9: Sequential Monte Carlo

Cross entropy relaxes the threshold. PMC migrates proposals. Sequential Monte Carlo (SMC) takes a different approach: build a sequence of intermediate distributions that smoothly bridge the gap between the nominal distribution p and the failure distribution p|fail.

Define a sequence of distributions π₀, π₁, …, π_T where π₀ = p (easy to sample) and π_T concentrates on the failure region. A natural choice is to temper the failure indicator:

π_t(τ) ∝ p(τ) · φ(τ)^β_t

where φ(τ) measures "closeness to failure" and 0 = β₀ < β₁ < … < β_T. At β = 0, we have π₀ = p. As β increases, π_t concentrates on trajectories with high φ — those near failure.

The SMC algorithm moves a population of weighted particles through this sequence:

1. Initialize

Sample particles from π₀ = p

↓

2. Reweight

Update weights for next distribution π_t+1

↓

3. Resample

Duplicate high-weight particles, drop low-weight ones

↓

4. Move

Apply MCMC kernel to diversify particles (prevent collapse)

↻ next intermediate distribution

Why MCMC transitions? After resampling, many particles are copies of the same high-weight ancestors. Without diversification, the particle cloud collapses. The MCMC step (typically a few Metropolis-Hastings moves targeting π_t) spreads particles apart while keeping them distributed according to π_t. This is the key innovation of SMC over naive importance sampling through a sequence.

The failure probability estimate combines normalizing constant ratios:

p̂_fail = ∏_t=0^T-1 ((1/N) ∑_i=1^N w_i^(t))

Each factor is the average incremental weight at level t — the fraction of "mass" that survives to the next level. The product telescopes to give the ratio of normalizing constants Z_T/Z₀, which equals p_fail.

Adaptive β selection: Rather than fixing the sequence β₀, …, β_T in advance, a practical SMC algorithm chooses β_t+1 adaptively: find the next β such that the effective sample size (ESS) drops to some target fraction (e.g., 50% of N). This automatically adjusts the "step size" through the distribution sequence.

Check: Why does SMC include an MCMC transition step after resampling?

To diversify particles and prevent collapse after duplicate resampling To correct for the bias introduced by importance weights To speed up the convergence of the β sequence

Chapter 10: Bridge Sampling

Importance sampling works forward: sample from q, reweight to p. But what if both p and q are easy to evaluate but hard to normalize? Bridge sampling exploits samples from both distributions to estimate the ratio of their normalizing constants — which is exactly p_fail when one distribution is p and the other is p|fail.

Let p₀ and p₁ be two unnormalized densities with normalizing constants Z₀ and Z₁. We want r = Z₁/Z₀. Bridge sampling introduces an arbitrary bridge function h(τ) and uses the identity:

r = Z₁ / Z₀ = E₀[h(τ) p₁(τ)] / E₁[h(τ) p₀(τ)]

Given samples from both distributions, we estimate each expectation by a sample average. The ratio gives r̂. The optimal bridge function (minimizing variance) is:

h*(τ) = C / (n₀ p₀(τ) / Z₀ + n₁ p₁(τ) / Z₁)

This depends on the unknowns Z₀, Z₁. In practice, we solve iteratively: start with a guess for r, compute h*, re-estimate r, repeat until convergence.

Bridge sampling vs. IS: Standard IS uses samples from one distribution to estimate an expectation under another. Bridge sampling uses samples from both distributions. When both are available (e.g., you can sample from both the nominal and failure distributions), bridge sampling can be much more efficient, especially when the distributions have little overlap — exactly the regime where standard IS struggles.

For failure probability estimation, the setup is: p₀ = the nominal distribution p(τ), Z₀ = 1 (normalized). p₁ = 1_fail(τ) · p(τ), Z₁ = p_fail. We need samples from p (easy) and from the failure distribution (obtained via Chapter 6 methods like MCMC). Bridge sampling then estimates r = p_fail/1 = p_fail.

Intermediate bridges: When p₀ and p₁ are very different, a single bridge function may not suffice. We can insert intermediate distributions p₀, p_1/T, …, p_(T−1)/T, p₁ and estimate the product r = ∏ r_t. This is the same idea as SMC's sequence of intermediate distributions, applied to normalizing constant estimation.

Check: What unique requirement does bridge sampling have compared to standard importance sampling?

It requires a parametric form for the proposal It requires samples from BOTH distributions (nominal and failure) It requires the failure probability to be above 10⁻⁶

Chapter 11: Multilevel Splitting

Every method so far estimates p_fail as a single quantity. Multilevel splitting takes a radically different approach: decompose p_fail into a product of conditional probabilities, each of which is large enough to estimate directly.

Define a sequence of nested events: F₀ ⊃ F₁ ⊃ … ⊃ F_L = F (the failure event). Each F_l is a relaxed version of failure — say, the system cost exceeds threshold γ_l. Then by the chain rule of probability:

p_fail = P(F_L) = P(F₀) · ∏_l=1^L P(F_l | F_l−1)

The magic: each conditional probability P(F_l | F_l−1) can be chosen to be moderately large (say 0.1 to 0.5) by adjusting the thresholds. Even if p_fail = 10⁻¹⁰, we can write it as 0.1¹⁰ — ten factors of 0.1, each easily estimable with a few hundred samples.

The staircase analogy: Imagine descending a cliff in one leap (direct MC) versus walking down a staircase. Each step is manageable. The total descent is the same. Multilevel splitting builds the staircase: each level is a conditional probability that we can estimate reliably, and the product gives the full failure probability.

The algorithm:

pseudocode
# Multilevel splitting
Initialize N samples from p(τ)
Set thresholds γ₀ < γ₁ < ... < γ_L = failure threshold

for l = 1 to L:
  # Count how many samples exceed γ_l
  n_l = |{i : S(τ_i) ≥ γ_l}|
  p̂_l = n_l / N
  # Split: duplicate survivors, discard the rest
  Keep the n_l survivors
  Resample N − n_l copies from survivors
  # Diversify via MCMC targeting p(τ | S(τ) ≥ γ_l)
  Apply MCMC transitions to all N samples

p̂_fail = ∏_l=1^L p̂_l

With adaptive thresholds, we don't predefine the γ_l. Instead, at each level, we set γ_l to the empirical quantile of the current sample — say the (1−ρ)-quantile, so a fixed fraction ρ survives. Then every conditional probability is exactly ρ, and p_fail ≈ ρ^L, where L is the number of levels needed to reach the actual failure threshold.

Worked example: p_fail = 10⁻⁸. We use ρ = 0.2 (keep top 20% at each level). Then we need L = log(10⁻⁸)/log(0.2) ≈ 11.5, so about 12 levels. At each level, we estimate p̂_l ≈ 0.2. With N = 500 samples, each level's estimate has standard deviation √(0.2 × 0.8 / 500) = 0.018 — a 9% relative error per level. The relative error of the product is about √12 × 9% ≈ 31%. With N = 2000, it drops to 15%. Compare this to the 10¹⁰ samples direct MC would need!

Multilevel Splitting Staircase

Watch samples (dots) get progressively filtered through threshold levels. At each level, the surviving fraction is displayed. The product of all fractions gives p_fail. Adjust the number of levels and survival fraction.

Ready

Levels (L)5

Survival ρ0.20

Method	Core idea	Best for
Direct MC	Count failures	p_fail > 10⁻³
Importance sampling	Reweight from better proposal	Known failure modes, single mode
Cross entropy	Iteratively fit proposal to elites	Unknown failure region, parametric proposal
PMC	Population of migrating proposals	Multimodal failures
SMC	Sequence of bridging distributions	Complex failure boundaries
Multilevel splitting	Product of conditional probabilities	Extremely rare events (p < 10⁻⁸)

Check: In multilevel splitting with survival fraction ρ = 0.2 and 5 levels, what is the estimated p_fail?

0.2 × 5 = 1.0 0.2 / 5 = 0.04 0.2⁵ = 3.2 × 10⁻⁴