Kochenderfer, Wheeler & Wray — Chapter 14

Policy Validation

Finding out your policy fails before deployment does.

Prerequisites: MDPs (Ch 7) + basic probability. That's it.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Validate?

In November 2016, a self-driving car being tested in San Francisco ran a red light. The engineers had optimized the policy on standard scenarios. This was not a standard scenario. The policy had never seen a protected left-turn lane at that intersection. The policy was not wrong — it was untested in that region of the world.

Training a policy and validating it are different problems. Training asks: "Does this policy perform well on the distribution of scenarios I've seen?" Validation asks: "Does this policy perform well on the full distribution it will encounter, including scenarios I haven't seen, including the rare ones that matter most?"

The validation problem: You have a policy π. You want to know how it performs across all environment parameters φ (weather, road conditions, opponent behavior, sensor noise). Some values of φ are rare but catastrophic if your policy fails. Naive simulation can't find them — you'd have to run too many rollouts.

The chapter's methods address four related questions:

QuestionMethodOutput
How good is my policy on average?Performance metrics + Monte CarloE[return], variance, percentiles
How good is it in rare failures?Rare event simulation, importance samplingProbability of failure
How good is it as parameters vary?Robustness analysisSafe operating envelope
Which parameters matter most for failure?Adversarial analysisWorst-case φ
Validation vs. testing: Software testing checks that code is correct. Policy validation checks that behavior is acceptable. A policy can be "correct" (does what you designed it to do) and still unacceptable (does the wrong thing in the tail of the distribution). This is why validation must go beyond nominal simulation.
A policy is trained on 10,000 simulated scenarios and achieves 97% success rate. Why might this not guarantee safe deployment?

Chapter 1: Performance Metrics

Before you can measure how good a policy is, you need to decide what to measure. A single scalar "return" is useful for training but hides important structure for validation.

Metrics for a single rollout: Given a trajectory τ = (s0, a0, …, sT), you might care about:

MetricFormulaWhen to use
Total returnt γt R(st, at)Training objective — dense reward
Success rate1[goal reached before T]Binary outcome problems
Time to goalmin{t : st = sgoal}Efficiency-sensitive applications
Safety metric1[constraint violated]Safety-critical applications
Distance to failuremint d(st, failure set)Margins analysis

Metrics for a distribution of rollouts: run m rollouts, collect {R(1), …, R(m)}, then report:

Distribution-level metrics: (1) Mean: E[R] ≈ (1/m)∑R(i) — overall performance. (2) Variance: Var[R] — consistency. (3) α-quantile: the value q such that P(R ≤ q) = α — worst-case analysis. (4) CVaRα: conditional value at risk — E[R | R ≤ qα], the expected return in the worst α fraction of outcomes.

Why CVaR? Mean return is an optimistic metric — it can be high even when rare but catastrophic failures exist. Conditional Value at Risk (CVaR) at α = 0.05 is the average return over the 5% worst outcomes. Optimizing for CVaR directly hedges against tail risk — the return in the worst cases is explicitly minimized rather than washed out by good average performance.

Choosing α for CVaR: For aviation safety, α might be 10−6 (failure in 1-in-a-million flights). For robotics, α = 0.05 is a reasonable starting point. The right α is a domain decision: how much tail risk is acceptable? This is fundamentally a values question, not a technical one. The engineer's job is to make the tradeoff explicit and transparent.
Return Distribution: Mean vs. CVaR

Adjust tail weight to see how CVaR differs from mean. A bimodal distribution can have a good mean but terrible CVaR.

Tail failures 0.20
CVaR α 0.10
Policy A has mean return 80 and CVaR0.05 = -10. Policy B has mean return 70 and CVaR0.05 = 60. For an autonomous vehicle, which is safer to deploy?

Chapter 2: Nominal Simulation

The simplest validation approach: pick a set of representative test scenarios, run the policy in each, and measure performance. These scenarios are called the nominal environment — the expected operating conditions.

Nominal simulation answers: "Does my policy work in the scenarios I designed it for?" This is a necessary first check but not sufficient. It doesn't tell you about the edges of the operating envelope or the tail of the distribution.

The failure mode of nominal simulation: If your nominal scenarios cluster around the "typical" part of the environment distribution, you will systematically miss rare failures. Consider a policy tested on 1000 random rollouts with failure probability p = 0.001. Expected number of failures observed: 1. Standard error: √(p(1-p)/1000) ≈ 0.001. You need ~100,000 rollouts to observe 100 failures and estimate the failure rate to 10% accuracy.

The sample complexity problem: to estimate a probability p to relative error ε with confidence δ, you need roughly m = (1/ε2p) · log(2/δ) rollouts. For p = 10−6 (aviation level safety), m ~ 1072. At 1 second per rollout, this is 115+ days of simulation. This is why nominal simulation cannot validate safety-critical systems alone.

When nominal simulation is sufficient: For low-stakes applications where failure probabilities are moderate (p ~ 0.01–0.1) and the test distribution accurately represents deployment. Example: a navigation algorithm for indoor delivery robots where the environment is well-characterized. For safety-critical systems (aircraft, surgical robots, autonomous vehicles), nominal simulation is a starting point, not a conclusion.

A well-designed nominal test suite should:

A policy has a true failure probability of 0.001. You run 500 nominal rollouts. How many failures do you expect to observe?

Chapter 3: Rare Event Simulation

You need to estimate the probability that your policy fails in an event that happens once in a million simulations. Naive simulation is hopeless. The insight: you can artificially inflate the probability of the rare event, measure how often it leads to failure, and then correct for the inflation.

This is rare event simulation. The core idea: instead of sampling environment parameters φ from the nominal distribution p(φ), sample them from an alternative distribution q(φ) that puts more weight on dangerous regions. Then correct the estimate with an importance weight.

The importance sampling estimator: Let f(φ) be an indicator of failure (1 if the policy fails under parameters φ, 0 otherwise). The failure probability is P(fail) = Ep[f(φ)] = ∫ f(φ)p(φ)dφ. Draw N samples from the importance distribution q instead: P(fail) ≈ (1/N) ∑i f(φ(i)) · w(φ(i)), where w(φ) = p(φ)/q(φ) is the importance weight.

The importance weight corrects for the fact that we're sampling from the "wrong" distribution. Regions where q is higher than p get downweighted by w < 1. Regions where q is lower than p get upweighted by w > 1. When the correction is applied, the estimator is unbiased regardless of the choice of q — but its variance depends critically on q.

The optimal q: Theoretically, the optimal importance distribution is q*(φ) ∝ f(φ)p(φ) — proportional to the failure probability times the nominal distribution. In the ideal case, q* puts all mass on failure scenarios and zero variance IS estimator results. In practice, q* is unknown (we don't know where failures are), so we must approximate it.

A practical approach for policy validation: use adaptive importance sampling. Run some rollouts under the nominal distribution to identify where failures tend to occur. Then concentrate q around those regions. Repeat: estimate failure probability, update q, re-estimate. Methods like Cross-Entropy Method (Ch 10) can be applied to find q that maximizes P(fail) under q, which happens to be a good approximation of q*.

Rare Event Estimation: Naive vs. Importance Sampling

Both methods estimate P(failure) at the same sample count. The failure region is the red zone. Watch how IS concentrates samples in the danger zone.

Failure probability p 0.010
Sample count N 50
Importance sampling uses weights w(φ) = p(φ)/q(φ). If you sample a parameter φ that is 10× more likely under q than under p, what weight do you assign?

Chapter 4: Importance Sampling in Practice

The theory of importance sampling is clean. The practice has sharp edges. Here are the key failure modes and how to handle them.

High variance from extreme weights. If q puts nearly zero probability on some region where p has significant mass, the importance weight p/q explodes. A single sample from that region gets an astronomically large weight and dominates the estimate. The variance is ∞ if the support of p is not contained in the support of q.

Self-normalized IS: Instead of using raw weights wi = p(φi)/q(φi), use normalized weights w̄i = wi / ∑j wj. The estimator becomes ∑ii f(φi). This introduces slight bias but dramatically reduces variance. It also handles un-normalized distributions — you don't need to know the partition function of p or q.

Effective sample size (ESS). When weights are very unequal, the effective sample size ESS = (∑i wi)2 / ∑i wi2 is much less than N. If ESS = 10 with N = 1000, you're effectively using 1% of your samples. Always check ESS as a diagnostic — if ESS/N < 0.1, your q is a poor approximation of the target and the estimate is unreliable.

Defensive importance sampling: Mix a fraction α of the nominal distribution into q: q(φ) = αp(φ) + (1-α)q0(φ). The mixing prevents extreme weights when q0 misses some region that p covers. α = 0.1 (10% nominal sampling) is a practical default. This makes the estimator less efficient in theory but much more stable in practice.
python
import numpy as np

def importance_sampling_estimate(simulate, p_logpdf, q_logpdf, q_sample, N=1000):
    """
    simulate: phi -> 1 (failure) or 0 (success)
    p_logpdf: log probability under nominal distribution p
    q_logpdf: log probability under proposal distribution q
    q_sample: function that samples phi from q
    """
    phis = [q_sample() for _ in range(N)]
    failures = np.array([simulate(phi) for phi in phis])

    # Log importance weights (numerically stable)
    log_w = np.array([p_logpdf(phi) - q_logpdf(phi) for phi in phis])
    log_w -= log_w.max()   # shift for stability
    w = np.exp(log_w)

    # Self-normalized estimate
    p_fail = (w * failures).sum() / w.sum()

    # Effective sample size diagnostic
    ess = w.sum()**2 / (w**2).sum()

    return p_fail, ess
You run IS with N=500 samples and find ESS=12. What does this tell you?

Chapter 5: Robustness Analysis

You've designed an autonomous vehicle policy under the assumption that sensor noise has standard deviation σ = 0.1m. What if it's actually 0.2m? What if the GPS signal degrades in a tunnel? Robustness analysis asks: over what range of environment parameters does your policy remain acceptable?

Formally: parameterize the environment by a vector φ ∈ Φ. Define an acceptability threshold: the policy is acceptable at φ if its performance metric U(π, φ) ≥ umin. The safe operating envelope is the set of parameters where the policy is acceptable:

Φsafe = { φ ∈ Φ : U(π, φ) ≥ umin }

Robustness analysis maps out this set. It answers: "What conditions can my policy handle, and where does it break down?"

How to map the safe envelope: Simple approach: grid search over Φ. At each grid point, estimate U(π, φ) with Monte Carlo rollouts. Color the grid by acceptability. This is expensive (exponential in dim(Φ)) but clear. For high-dimensional Φ, use adaptive sampling: run more rollouts near the boundary ∂Φsafe where uncertainty is highest.

A key metric from robustness analysis: the robustness margin ρ = minφ ∈ boundary of safe set ||φ − φnominal||. Large robustness margin = far from the safe envelope edge. Small margin = the policy barely handles the nominal conditions and fails on small perturbations.

Parameter uncertainty matters: Robustness analysis naturally answers "how bad does model error need to be before the policy fails?" If the sensor noise in the real world is anywhere in [0.05, 0.15]m but the policy starts failing at σ = 0.18m, you have a safety margin of 3cm. If you can't guarantee your model is better than 3cm, you can't guarantee safe deployment.
Safe Operating Envelope: 2D Robustness Map

Each cell shows policy performance as two environment parameters vary. The dashed region is the safe envelope (U ≥ threshold). Adjust threshold to see how the envelope changes.

Accept. threshold umin 0.50
A policy's safe operating envelope is the set {φ : U(π, φ) ≥ umin}. If you raise umin, what happens to the safe envelope?

Chapter 6: Trade Analysis

A self-driving car can prioritize passenger comfort (smooth, gentle maneuvers) or travel time (fast, aggressive maneuvers). A medical robot can prioritize treatment efficacy (more aggressive intervention) or patient safety (conservative, low-risk). These are not the same objective — improving one hurts the other.

When you have two or more objectives that conflict, you can't find a single "best" policy. Instead, you find the set of Pareto-optimal policies: policies where you cannot improve one objective without worsening another. This set is called the Pareto frontier.

Pareto dominance: Policy π dominates policy π′ if π is at least as good as π′ on all objectives AND strictly better on at least one. The Pareto frontier consists of all non-dominated policies. Any policy not on the frontier is Pareto-dominated — there exists a strictly better alternative.

How to compute the Pareto frontier for a policy class {πθ} : sample many policies (by varying θ or by multi-objective optimization), evaluate each on all objectives via Monte Carlo rollouts, and keep only the non-dominated ones.

The key insight for validation: a policy should be evaluated not just at its design point but across the Pareto frontier. The system designer might prefer a different tradeoff than the engineer assumed. Presenting the Pareto frontier makes the tradeoffs explicit and separable from the technical optimization.

Scalarization: One approach to multi-objective optimization: scalarize the objectives into one: U(π) = λ1U1(π) + λ2U2(π). By varying λ, you trace out the Pareto frontier. But scalarization implicitly assumes the frontier is convex. For non-convex frontiers, some Pareto-optimal points are unreachable by any scalarization. Use specialized multi-objective algorithms (NSGA-II, MOEA/D) to handle non-convex frontiers.
Pareto Frontier: Performance vs. Safety

Each dot is a candidate policy evaluated on two objectives. Gold dots form the Pareto frontier. Click a dot to see its tradeoff.

Policy A achieves (performance=0.8, safety=0.6). Policy B achieves (performance=0.7, safety=0.9). Policy C achieves (performance=0.75, safety=0.65). Which policies are Pareto-optimal?

Chapter 7: Adversarial Analysis

Robustness analysis maps performance across all parameters. Adversarial analysis asks a sharper question: what is the single worst parameter setting for my policy? An adversary is trying to find φ that causes maximum damage. What will they find?

Formally, adversarial analysis solves:

φ* = argminφ ∈ Φ U(π, φ)

This is an optimization problem: minimize policy performance over environment parameters. The result φ* is the worst-case environment. U(π, φ*) is the worst-case performance. If U(π, φ*) is acceptable, then no environment in Φ can break the policy.

The adversary as a test oracle: An adversarial agent that optimally selects φ to minimize policy performance is the best possible stress test. Unlike random simulation (which might never find the failure mode) or human-designed test cases (which might miss non-obvious failure modes), an adversary systematically searches for the worst case. This is the philosophy behind adversarial testing and red-teaming.

How to solve the adversarial problem? If U(π, φ) is differentiable in φ, use gradient descent on the negative return. If not (because φ affects the environment structure in non-differentiable ways), use policy search methods (Ch 10): CEM or evolution strategies with U(π, φ) as the objective to minimize.

Multi-armed adversary: In practice, φ might include multiple independent environment parameters. An adversarial agent trained with RL (using the negative of π's return as its reward) can find complex parameter combinations that no human would think to test. This is increasingly used in AI safety research to find failure modes of neural policies before deployment.
Adversarial analysis limitations: Finding the true worst-case φ is as hard as policy optimization itself. You may find a local worst case, not the global one. Multiple runs from different φ0 initializations help map the failure landscape. Also: adversarial analysis assumes an adversary constrained to Φ. If the real world can go outside Φ (distribution shift), adversarial analysis within Φ misses it.
Adversarial analysis finds φ* that minimizes U(π, φ) and gets U(π, φ*) = 0.35. The acceptability threshold is umin = 0.5. What can you conclude?

Chapter 8: Showcase — Policy Stress Test

A simple car-following policy tries to maintain a safe distance from a lead vehicle. We validate it against three environment parameters: sensor noise level, lead vehicle aggression, and road surface friction. Watch nominal simulation, importance-sampled rare events, and the adversarial worst-case all running simultaneously.

What to observe: Under nominal conditions (low noise, gentle lead car), the policy rarely fails. Increase noise or aggression to see failure rates rise. The IS estimator (gold) tracks the true rate much more efficiently than naive simulation (teal). The adversary (red) finds the worst parameter combination automatically.
Car-Following Policy Stress Test
Sensor noise 0.20
Lead aggression 0.20
Friction (1=dry, 0=ice) 0.80

Chapter 9: Connections & What's Next

Policy validation is the bridge between algorithm development and deployment. A policy that scores well in training but hasn't been stress-tested is a liability, not an asset.

MethodWhat it findsCost
Nominal simulationAverage performance in typical conditionsLow — m · T steps
Importance samplingRare failure probabilityMedium — IS overhead
Robustness analysisSafe operating envelope over ΦHigh — grid over Φ
Trade analysis (Pareto)Optimal tradeoffs between objectivesHigh — many policy evaluations
Adversarial analysisWorst-case environment parametersHigh — optimization over Φ
The validation stack: These methods are complementary, not competing. In practice: (1) Always run nominal simulation first as a sanity check. (2) Use importance sampling to estimate rare failure probability once you believe the policy is roughly correct. (3) Use robustness analysis to understand the operating envelope. (4) Use adversarial analysis to verify worst-case bounds. All four are needed for safety-critical deployment.
Connection to Ch 10 (Policy Search): Adversarial analysis is policy search applied to the environment parameter space: find φ* = argmin U(π, φ). The CEM and evolution strategies from Ch 10 work directly here. The "policy" being optimized is the adversary's choice of φ; the "environment" is the fixed agent policy π.

Related chapters:

"In God we trust, all others bring data." — W. Edwards Deming. A policy that hasn't been validated is not a belief — it's faith.
You have a limited budget of 1000 simulator calls to validate a safety-critical policy. The nominal failure probability is estimated at 0.001. Which validation method gives you the most useful information?