Kochenderfer et al., Chapter 7

Failure Probability

Estimating how often a system fails — especially when failures are astronomically rare.

Prerequisites: Chapter 6 (Failure Distributions). That's it.
12
Chapters
5+
Simulations
12
Quizzes

Chapter 0: Why Estimate Failure Probability?

You've built a self-driving car. Your team has run 100,000 test scenarios, and the car crashed in 3 of them. That gives a rough estimate: pfail ≈ 3/100,000 = 3 × 10−5. But is that good enough? Regulators want to know if pfail < 10−9 — one failure in a billion drives. How many tests would you need to confidently verify that?

The answer is staggering. If the true failure probability is 10−9, you'd need roughly 109 simulation runs just to see a single failure. And to get a tight confidence interval, you'd need many more. At one second per simulation, that's 30 years of continuous computation. We need a smarter approach.

The core problem: Directly counting failures is feasible when failures are common (p ≈ 10−2). But safety-critical systems need vanishingly small failure rates. This chapter covers the mathematical toolkit for estimating pfail when failures are too rare to count directly — from importance sampling to multilevel splitting.

Chapter 6 taught us how to characterize what failures look like — their distribution in the disturbance space. This chapter asks a different question: how often do they happen? We want a single number, pfail, with a confidence interval tight enough to certify the system.

The Question
What is P(system fails) under the operational distribution?
↓ too rare to observe directly
Naive Approach
Run m simulations, count n failures, report n/m
↓ need 109+ samples
Smart Approaches
IS, CE, PMC, SMC, multilevel splitting

We'll build up from the simplest estimator (count failures and divide) through Bayesian estimation, then the full arsenal of variance reduction techniques. Each method makes the same fundamental trade: inject clever bias into the sampling, then correct for it mathematically.

Check: Why can't we simply run 10 million simulations and count failures to verify pfail < 10−9?

Chapter 1: Direct Estimation

The simplest possible approach: run m simulations, count n failures, report the fraction. This is the direct Monte Carlo estimator:

fail = n / m

where n is the number of failure trajectories and m is the total number of trajectories sampled. Each trajectory τi is drawn from the nominal distribution p(τ) — the distribution that describes how the system actually operates in the real world. We check whether each trajectory triggers a failure event, and count.

This estimator is unbiased: its expected value equals the true pfail. But how precise is it? The variance tells us. Each sample is a Bernoulli trial (fail or not), so:

Var(p̂fail) = pfail(1 − pfail) / m

The standard deviation is σ = √(p(1−p)/m). For a 95% confidence interval, we need roughly p̂ ± 2σ. Let's work a concrete example.

Worked example: Suppose pfail = 10−3 and we run m = 10,000 simulations. Expected failures: n ≈ 10. Variance: 10−3 × 0.999 / 10,000 ≈ 10−7. Standard deviation: ≈ 3.16 × 10−4. So our 95% CI is roughly 10−3 ± 6.3 × 10−4. That's a 63% relative width — not very precise, but at least we see failures.

Now try pfail = 10−6 with m = 10,000. Expected failures: 0.01. We almost certainly observe zero failures, giving p̂ = 0. The estimator is technically unbiased, but with overwhelming probability it returns zero. We'd need m ≈ 108 to reliably observe even one failure.

The rule of thumb: To get a relative error of ε on pfail, you need roughly m ≈ 1/(ε2 · pfail) samples. For pfail = 10−9 and 10% relative error, that's m ≈ 1011. This is the rare event problem — direct estimation collapses when failures are rare.
pfailm needed (10% relative error)Time at 1 sim/sec
10−210,0002.8 hours
10−41,000,00012 days
10−61083.2 years
10−910113,170 years

The coefficient of variation (CV) measures relative precision: CV = σ/pfail = √((1−p)/mp). For small p, this simplifies to CV ≈ 1/√(mp). As p shrinks, the CV explodes unless m grows to compensate.

Direct Estimation: Variance vs. Sample Count

Drag the slider to change the true failure probability. Watch how the required sample count explodes as pfail decreases. The dashed line marks 10% relative error.

log10(pfail)10^-3
Check: What is the variance of the direct Monte Carlo estimate p̂ = n/m?

Chapter 2: Bayesian Estimation

The direct estimator treats pfail as a fixed unknown. The Bayesian approach treats it as a random variable with a prior distribution, then updates that distribution after observing data. This gives us something the frequentist approach cannot: a full posterior distribution over pfail, complete with credible intervals.

The natural choice for the prior is the Beta distribution, because it is the conjugate prior for Bernoulli observations. If our prior is Beta(α, β) and we observe n failures in m trials, the posterior is:

pfail | data ∼ Beta(α + n, β + m − n)

That's it. No MCMC, no approximation. The update is exact. The prior parameters α and β encode our initial belief: α is like "pseudo-failures" and β is like "pseudo-successes" we've already seen. A flat prior is Beta(1,1), which says every value of pfail between 0 and 1 is equally likely.

Why Beta? The Beta distribution is defined on [0,1], which is exactly where probabilities live. Its shape is controlled by two parameters. Beta(1,1) is uniform. Beta(2,8) peaks near 0.2. Beta(100,9900) is a tight spike near 0.01. After m trials with n failures, the posterior mean is (α+n)/(α+β+m), which blends the prior guess α/(α+β) with the data n/m, weighted by their respective "sample sizes."

Let's trace a worked example. Start with a uniform prior: Beta(1,1). Run 50 simulations, observe 2 failures.

QuantityPriorPosterior
DistributionBeta(1, 1)Beta(3, 49)
Mean0.5003/52 = 0.058
Modeundefined (flat)2/50 = 0.040
95% credible interval[0.025, 0.975][0.013, 0.126]

The posterior mean (0.058) is pulled slightly above the maximum likelihood estimate (2/50 = 0.04) because the prior puts some weight on higher values. With more data, the prior's influence fades. After 500 trials with 20 failures, the posterior Beta(21, 481) has mean 0.042, almost indistinguishable from the MLE.

Bayesian Coin-Flip Updater

Each click of Flip simulates one test. Watch the Beta posterior sharpen as data accumulates. Adjust the true pfail with the slider. The prior is Beta(1,1) (uniform).

n=0 fails, m=0 trials
True pfail0.10
Credible vs. confidence intervals: A 95% Bayesian credible interval says "there is a 95% probability that pfail lies in this range, given the data." A 95% frequentist confidence interval says something more convoluted: "if we repeated the experiment many times, 95% of such intervals would contain the true value." For practical engineering decisions, the Bayesian statement is usually what you want.
Check: After observing 5 failures in 100 trials with a Beta(1,1) prior, what is the posterior distribution?

Chapter 3: The Rare Event Problem

Both the direct and Bayesian approaches share a fundamental limitation: they rely on actually observing failures. When pfail = 10−9, you might run a billion simulations and see zero or one failure. The posterior is barely distinguishable from the prior. No amount of Bayesian cleverness can extract information from data that doesn't exist.

This is the rare event problem, and it is the central challenge of failure probability estimation. The issue is not computational cost per se — it's that the information rate is near zero. Almost every simulation returns "system OK," contributing nothing to our understanding of failure.

The information desert: Imagine fishing in an ocean where 99.9999999% of the water contains no fish. You could trawl for centuries and catch nothing. The fish exist, but your net sweeps the wrong water. The solution: don't fish randomly. Fish where the fish are, then correct for the biased sampling. That's importance sampling.

Why not just make the system fail more often in simulation? We could increase noise levels, degrade sensors, or inject adversarial disturbances. The problem is that these artificial failures may not represent real failure modes. We might find failures that would never occur under normal conditions, or miss the real failure modes entirely.

What we need is a principled way to sample more failures while still estimating the probability under the original distribution. We want to change how we sample, not what we're estimating. The mathematical framework for this is importance sampling.

Problem
Failures are too rare to observe under p(τ)
↓ can't just crank up noise
Insight
Sample from a different distribution q(τ) that produces more failures
↓ but then n/m estimates pfail under q, not p
Fix
Reweight each sample by p(τ)/q(τ) to correct the bias

Here's the key identity. The failure probability is an expectation under p:

pfail = Ep[1fail(τ)] = ∫ 1fail(τ) p(τ) dτ

We can rewrite this by multiplying and dividing by any positive density q(τ):

pfail = ∫ 1fail(τ) · (p(τ) / q(τ)) · q(τ) dτ = Eq[1fail(τ) · w(τ)]

where w(τ) = p(τ)/q(τ) is the importance weight. We've converted the problem from "sample from p and count" to "sample from q and take a weighted average." If q concentrates samples in the failure region, we see many failures, each downweighted by how much q oversampled that region.

Check: Why does simply increasing noise levels in simulation NOT solve the rare event problem?

Chapter 4: Importance Sampling

Importance sampling (IS) turns the identity from Chapter 3 into a practical estimator. Draw m samples τ1, …, τm from the proposal distribution q(τ). For each sample, compute the importance weight and check for failure:

IS = (1/m) ∑i=1m 1faili) · wi,     wi = p(τi) / q(τi)

This estimator is unbiased for any q that has support everywhere p does (meaning q(τ) > 0 whenever p(τ) > 0). The magic is in the choice of q. A good proposal concentrates samples in the failure region, giving us many failure observations, each properly downweighted.

Worked example: Suppose trajectories live in 1D, failure occurs when τ > 4, and p(τ) = N(0,1). Under p, P(τ > 4) ≈ 3.2 × 10−5. Let q(τ) = N(4,1), centered on the failure boundary. Drawing from q, about 50% of samples fail. Each failure sample gets weight w = p(τ)/q(τ) = exp(−τ2/2) / exp(−(τ−4)2/2) = exp(−4τ + 8). The weights are small (because q oversamples this region), but the product 1fail · w averages to the correct pfail.

The variance of the IS estimator is:

Var(p̂IS) = (1/m) Varq[1fail(τ) · w(τ)]

A bad choice of q can make this variance worse than direct sampling. If q puts too little weight in the failure region, the rare failures that do appear get enormous weights, creating high-variance spikes. The variance can even be infinite if q has lighter tails than p.

Importance Sampling: Weight Visualizer

The nominal distribution p (blue) is N(0,1). Failure is τ > 3. Drag the proposal mean to shift q (orange). Watch how sample weights (bars below) change. A good q produces many small-weighted failures. A bad q produces few heavy-weighted failures.

p̂ = —
Proposal mean μq3.0
Self-normalized IS: In practice, we often don't know the normalizing constants of p and q. The self-normalized estimator divides by the sum of weights: p̂SN = ∑ 1fail · wi / ∑ wi. This introduces a small bias (it's a ratio of two estimates) but removes the need for normalized densities. For large m, the bias is negligible.
Check: In importance sampling, what is the importance weight wi for sample τi?

Chapter 5: The Optimal Proposal

If importance sampling works with any q, some q must be better than others. Is there an optimal proposal that minimizes variance? Yes — and the answer reveals a beautiful circular dependency.

The variance of the IS estimator is zero when every sample produces the same weighted value. That happens when 1fail(τ) · p(τ)/q(τ) is constant across all τ sampled from q. Working through the math, the zero-variance proposal is:

q*(τ) = 1fail(τ) · p(τ) / pfail

Read this carefully. The optimal proposal q* is the nominal distribution p, restricted to the failure region, and normalized. In other words, q* is the failure distribution itself — the conditional distribution of trajectories given that they fail.

The circular dependency: To construct q*, we need pfail (the normalizing constant). But pfail is exactly what we're trying to estimate! If we knew the answer, we wouldn't need importance sampling. This is not a flaw — it tells us what to aim for. Any proposal that approximates the failure distribution will reduce variance compared to direct sampling.

Let's verify the zero-variance claim. Under q*, every failure sample has weight:

wi = p(τi) / q*(τi) = p(τi) / (1faili) p(τi) / pfail) = pfail

Every sample gives exactly pfail. No variance. A single sample suffices. Of course, this requires knowing pfail, which we don't. But the insight is powerful: the closer q is to the failure distribution, the lower the variance.

Proposal qualityVariance behavior
q = p (no change)Same as direct MC — no benefit
q concentrates near failure boundaryMany failures with moderate weights — low variance
q = q* (failure distribution)Zero variance — but unknowable
q too far from failure regionRare huge weights — variance worse than direct MC
Practical takeaway: We can never achieve q*, but we can approximate it. The rest of this chapter covers methods that iteratively improve the proposal toward q*. Cross entropy minimizes KL divergence to q*. Population Monte Carlo migrates proposals toward high-weight regions. Sequential Monte Carlo builds a sequence of distributions approaching p|fail.
Check: What is the zero-variance importance sampling proposal q*?

Chapter 6: Multiple Proposals

A single proposal distribution may miss important failure regions. If failures can occur through multiple mechanisms — sensor noise, actuator failure, adversarial inputs — one proposal centered on one mechanism ignores the others. Multiple importance sampling (MIS) uses several proposals simultaneously.

Suppose we have K proposal distributions q1, …, qK, and we draw mk samples from each. For a sample τi drawn from qk, how should we compute its weight?

The simplest approach is standard MIS (s-MIS): weight each sample by p/qk, the proposal it came from.

s-MIS = (1/m) ∑k=1Ki=1mk 1failki) · p(τki) / qkki)

This works, but has a subtle problem. If sample τ happens to lie in a region where qk has low density but another proposal qj has high density, the weight p/qk is huge — even though the sample is "explained" by qj. The variance spikes unnecessarily.

Deterministic mixture MIS (DM-MIS): A better approach uses the mixture density in the denominator. Define the mixture q̄(τ) = (1/K) ∑k qk(τ). Then weight each sample by p(τ)/q̄(τ). Now a sample that any proposal would likely generate gets a moderate weight, regardless of which proposal actually produced it. This is called the balance heuristic, and it is provably near-optimal among all weighting schemes for MIS.
wiDM = p(τi) / q̄(τi) = p(τi) / ((1/K) ∑k qki))

Let's trace an example. Two proposals: q1 = N(3, 0.5) targets one failure mode, q2 = N(5, 0.5) targets another. Draw 50 samples from each. Under s-MIS, a sample at τ = 4 drawn from q1 gets weight p(4)/q1(4), which is moderate. But the same sample drawn from q2 gets a different weight p(4)/q2(4). Under DM-MIS, both get the same weight p(4)/q̄(4), because the denominator accounts for both proposals.

MethodWeight formulaProsCons
s-MISp(τ) / qk(τ)Simple, unbiasedHigh variance from proposal mismatch
DM-MISp(τ) / q̄(τ)Lower variance, near-optimalMust evaluate all K densities per sample
When to use MIS: Whenever you suspect multiple distinct failure modes. If all failures cluster in one region, a single well-tuned proposal suffices. But safety-critical systems often fail in qualitatively different ways — and a single Gaussian proposal can't cover them all.
Check: In DM-MIS, why is the mixture density q̄(τ) used in the denominator instead of the individual proposal qk(τ)?

Chapter 7: Cross Entropy Method

How do we find a good proposal distribution without knowing q*? The cross-entropy (CE) method iteratively refines the proposal by fitting it to the "best" samples from each round. It's an elegant dance: sample, select elites, refit, repeat.

The algorithm works with a parametric proposal family qθ (e.g., a Gaussian with mean μ and covariance Σ). Start with θ0 set to the nominal distribution. Then iterate:

1. Sample
Draw m trajectories from qθt
2. Score
Evaluate cost function S(τ) for each sample
3. Select
Keep top-ρ fraction (the "elites") that are closest to failure
4. Refit
Set θt+1 by maximum likelihood on the elite samples
↻ repeat until convergence

The key trick is threshold relaxation. Instead of immediately targeting the failure region, CE uses a sequence of increasingly strict thresholds. In round 1, the "elites" might be the worst 10% of trajectories (even though they don't actually fail). The proposal shifts toward them. In round 2, the new samples are worse on average, so the elite threshold tightens. Eventually the elites are actual failures, and the proposal concentrates in the failure region.

Why "cross entropy"? Fitting qθ to the elites by maximum likelihood is equivalent to minimizing the cross-entropy (or KL divergence) between the elite distribution and qθ. If q is Gaussian, this reduces to computing the sample mean and covariance of the elites. The name comes from information theory, but the algorithm is just "fit a distribution to your best samples."

For a Gaussian proposal q = N(μ, Σ), the update is explicit:

μt+1 = (1/|E|) ∑τ ∈ E τ,     Σt+1 = (1/|E|) ∑τ ∈ E (τ − μt+1)(τ − μt+1)T

where E is the elite set. After convergence, the final proposal qθ* is used for one final round of importance sampling to produce the pfail estimate with proper IS weights.

Smoothing trick: In practice, the elite covariance can collapse too quickly, trapping the proposal in a local mode. A common fix is to update with smoothing: θt+1 = α θ̂ + (1−α) θt, where α ∈ (0,1) and θ̂ is the MLE from the elites. Typical α ≈ 0.7.
Check: In the cross-entropy method, what are the "elite" samples?

Chapter 8: Population Monte Carlo

Cross entropy uses a single proposal (or a single parametric family). What if the failure region is multimodal — failures occur in several disconnected "pockets" of the trajectory space? A single Gaussian can't cover them all without also covering vast non-failure regions, degrading performance.

Population Monte Carlo (PMC) maintains a population of K proposals, each with its own parameters. After each round, every proposal adapts independently based on its own samples. Proposals that happen to sit near failure modes sharpen and stay. Proposals in barren regions migrate toward productive areas.

Think of it this way: Cross entropy sends one fishing boat. PMC sends a fleet. Each boat casts its net, keeps the good catches, and repositions based on where the fish were biting. Over time, boats cluster around the schools of fish (failure modes) and abandon empty water.

The PMC algorithm:

pseudocode
Initialize K proposals: q1(0), ..., qK(0)
for t = 1, 2, ...
  for k = 1 to K:
    Draw mk samples from qk(t-1)
    Compute DM-MIS weights: wi = p(τi) / q̄(t-1)i)
  Pool all samples and weights
  for k = 1 to K:
    Resample or refit qk(t) from weighted samples
  Estimate: p̂ = (1/m) ∑ 1faili) wi

The adaptation step can take many forms. One common approach: each proposal qk refits to the weighted samples nearest to it (a form of weighted EM). Another: randomly resample proposals from the current population, with probability proportional to the total weight of their samples.

PMC naturally handles multimodality because different proposals can converge to different modes. It also provides a valid IS estimate at each iteration (using DM-MIS weights), so you can stop early if the estimate has converged.

FeatureCE MethodPMC
Number of proposals1K (a population)
Multimodal failuresMay miss modesNaturally covers multiple modes
AdaptationElite-based MLEWeighted resampling/refitting
Weightingp/q at final roundDM-MIS at every round
Check: What advantage does PMC have over the cross-entropy method for multimodal failure distributions?

Chapter 9: Sequential Monte Carlo

Cross entropy relaxes the threshold. PMC migrates proposals. Sequential Monte Carlo (SMC) takes a different approach: build a sequence of intermediate distributions that smoothly bridge the gap between the nominal distribution p and the failure distribution p|fail.

Define a sequence of distributions π0, π1, …, πT where π0 = p (easy to sample) and πT concentrates on the failure region. A natural choice is to temper the failure indicator:

πt(τ) ∝ p(τ) · φ(τ)βt

where φ(τ) measures "closeness to failure" and 0 = β0 < β1 < … < βT. At β = 0, we have π0 = p. As β increases, πt concentrates on trajectories with high φ — those near failure.

The SMC algorithm moves a population of weighted particles through this sequence:

1. Initialize
Sample particles from π0 = p
2. Reweight
Update weights for next distribution πt+1
3. Resample
Duplicate high-weight particles, drop low-weight ones
4. Move
Apply MCMC kernel to diversify particles (prevent collapse)
↻ next intermediate distribution
Why MCMC transitions? After resampling, many particles are copies of the same high-weight ancestors. Without diversification, the particle cloud collapses. The MCMC step (typically a few Metropolis-Hastings moves targeting πt) spreads particles apart while keeping them distributed according to πt. This is the key innovation of SMC over naive importance sampling through a sequence.

The failure probability estimate combines normalizing constant ratios:

fail = ∏t=0T-1 ((1/N) ∑i=1N wi(t))

Each factor is the average incremental weight at level t — the fraction of "mass" that survives to the next level. The product telescopes to give the ratio of normalizing constants ZT/Z0, which equals pfail.

Adaptive β selection: Rather than fixing the sequence β0, …, βT in advance, a practical SMC algorithm chooses βt+1 adaptively: find the next β such that the effective sample size (ESS) drops to some target fraction (e.g., 50% of N). This automatically adjusts the "step size" through the distribution sequence.
Check: Why does SMC include an MCMC transition step after resampling?

Chapter 10: Bridge Sampling

Importance sampling works forward: sample from q, reweight to p. But what if both p and q are easy to evaluate but hard to normalize? Bridge sampling exploits samples from both distributions to estimate the ratio of their normalizing constants — which is exactly pfail when one distribution is p and the other is p|fail.

Let p0 and p1 be two unnormalized densities with normalizing constants Z0 and Z1. We want r = Z1/Z0. Bridge sampling introduces an arbitrary bridge function h(τ) and uses the identity:

r = Z1 / Z0 = E0[h(τ) p1(τ)] / E1[h(τ) p0(τ)]

Given samples from both distributions, we estimate each expectation by a sample average. The ratio gives r̂. The optimal bridge function (minimizing variance) is:

h*(τ) = C / (n0 p0(τ) / Z0 + n1 p1(τ) / Z1)

This depends on the unknowns Z0, Z1. In practice, we solve iteratively: start with a guess for r, compute h*, re-estimate r, repeat until convergence.

Bridge sampling vs. IS: Standard IS uses samples from one distribution to estimate an expectation under another. Bridge sampling uses samples from both distributions. When both are available (e.g., you can sample from both the nominal and failure distributions), bridge sampling can be much more efficient, especially when the distributions have little overlap — exactly the regime where standard IS struggles.

For failure probability estimation, the setup is: p0 = the nominal distribution p(τ), Z0 = 1 (normalized). p1 = 1fail(τ) · p(τ), Z1 = pfail. We need samples from p (easy) and from the failure distribution (obtained via Chapter 6 methods like MCMC). Bridge sampling then estimates r = pfail/1 = pfail.

Intermediate bridges: When p0 and p1 are very different, a single bridge function may not suffice. We can insert intermediate distributions p0, p1/T, …, p(T−1)/T, p1 and estimate the product r = ∏ rt. This is the same idea as SMC's sequence of intermediate distributions, applied to normalizing constant estimation.
Check: What unique requirement does bridge sampling have compared to standard importance sampling?

Chapter 11: Multilevel Splitting

Every method so far estimates pfail as a single quantity. Multilevel splitting takes a radically different approach: decompose pfail into a product of conditional probabilities, each of which is large enough to estimate directly.

Define a sequence of nested events: F0 ⊃ F1 ⊃ … ⊃ FL = F (the failure event). Each Fl is a relaxed version of failure — say, the system cost exceeds threshold γl. Then by the chain rule of probability:

pfail = P(FL) = P(F0) · ∏l=1L P(Fl | Fl−1)

The magic: each conditional probability P(Fl | Fl−1) can be chosen to be moderately large (say 0.1 to 0.5) by adjusting the thresholds. Even if pfail = 10−10, we can write it as 0.110 — ten factors of 0.1, each easily estimable with a few hundred samples.

The staircase analogy: Imagine descending a cliff in one leap (direct MC) versus walking down a staircase. Each step is manageable. The total descent is the same. Multilevel splitting builds the staircase: each level is a conditional probability that we can estimate reliably, and the product gives the full failure probability.

The algorithm:

pseudocode
# Multilevel splitting
Initialize N samples from p(τ)
Set thresholds γ0 < γ1 < ... < γL = failure threshold

for l = 1 to L:
  # Count how many samples exceed γl
  nl = |{i : S(τi) ≥ γl}|
  p̂l = nl / N
  # Split: duplicate survivors, discard the rest
  Keep the nl survivors
  Resample N − nl copies from survivors
  # Diversify via MCMC targeting p(τ | S(τ) ≥ γl)
  Apply MCMC transitions to all N samples

p̂fail = ∏l=1Ll

With adaptive thresholds, we don't predefine the γl. Instead, at each level, we set γl to the empirical quantile of the current sample — say the (1−ρ)-quantile, so a fixed fraction ρ survives. Then every conditional probability is exactly ρ, and pfail ≈ ρL, where L is the number of levels needed to reach the actual failure threshold.

Worked example: pfail = 10−8. We use ρ = 0.2 (keep top 20% at each level). Then we need L = log(10−8)/log(0.2) ≈ 11.5, so about 12 levels. At each level, we estimate p̂l ≈ 0.2. With N = 500 samples, each level's estimate has standard deviation √(0.2 × 0.8 / 500) = 0.018 — a 9% relative error per level. The relative error of the product is about √12 × 9% ≈ 31%. With N = 2000, it drops to 15%. Compare this to the 1010 samples direct MC would need!
Multilevel Splitting Staircase

Watch samples (dots) get progressively filtered through threshold levels. At each level, the surviving fraction is displayed. The product of all fractions gives pfail. Adjust the number of levels and survival fraction.

Ready
Levels (L)5
Survival ρ0.20
MethodCore ideaBest for
Direct MCCount failurespfail > 10−3
Importance samplingReweight from better proposalKnown failure modes, single mode
Cross entropyIteratively fit proposal to elitesUnknown failure region, parametric proposal
PMCPopulation of migrating proposalsMultimodal failures
SMCSequence of bridging distributionsComplex failure boundaries
Multilevel splittingProduct of conditional probabilitiesExtremely rare events (p < 10−8)
Check: In multilevel splitting with survival fraction ρ = 0.2 and 5 levels, what is the estimated pfail?