Finding out your policy fails before deployment does.
In November 2016, a self-driving car being tested in San Francisco ran a red light. The engineers had optimized the policy on standard scenarios. This was not a standard scenario. The policy had never seen a protected left-turn lane at that intersection. The policy was not wrong — it was untested in that region of the world.
Training a policy and validating it are different problems. Training asks: "Does this policy perform well on the distribution of scenarios I've seen?" Validation asks: "Does this policy perform well on the full distribution it will encounter, including scenarios I haven't seen, including the rare ones that matter most?"
The chapter's methods address four related questions:
| Question | Method | Output |
|---|---|---|
| How good is my policy on average? | Performance metrics + Monte Carlo | E[return], variance, percentiles |
| How good is it in rare failures? | Rare event simulation, importance sampling | Probability of failure |
| How good is it as parameters vary? | Robustness analysis | Safe operating envelope |
| Which parameters matter most for failure? | Adversarial analysis | Worst-case φ |
Before you can measure how good a policy is, you need to decide what to measure. A single scalar "return" is useful for training but hides important structure for validation.
Metrics for a single rollout: Given a trajectory τ = (s0, a0, …, sT), you might care about:
| Metric | Formula | When to use |
|---|---|---|
| Total return | ∑t γt R(st, at) | Training objective — dense reward |
| Success rate | 1[goal reached before T] | Binary outcome problems |
| Time to goal | min{t : st = sgoal} | Efficiency-sensitive applications |
| Safety metric | 1[constraint violated] | Safety-critical applications |
| Distance to failure | mint d(st, failure set) | Margins analysis |
Metrics for a distribution of rollouts: run m rollouts, collect {R(1), …, R(m)}, then report:
Why CVaR? Mean return is an optimistic metric — it can be high even when rare but catastrophic failures exist. Conditional Value at Risk (CVaR) at α = 0.05 is the average return over the 5% worst outcomes. Optimizing for CVaR directly hedges against tail risk — the return in the worst cases is explicitly minimized rather than washed out by good average performance.
Adjust tail weight to see how CVaR differs from mean. A bimodal distribution can have a good mean but terrible CVaR.
The simplest validation approach: pick a set of representative test scenarios, run the policy in each, and measure performance. These scenarios are called the nominal environment — the expected operating conditions.
Nominal simulation answers: "Does my policy work in the scenarios I designed it for?" This is a necessary first check but not sufficient. It doesn't tell you about the edges of the operating envelope or the tail of the distribution.
The sample complexity problem: to estimate a probability p to relative error ε with confidence δ, you need roughly m = (1/ε2p) · log(2/δ) rollouts. For p = 10−6 (aviation level safety), m ~ 107/ε2. At 1 second per rollout, this is 115+ days of simulation. This is why nominal simulation cannot validate safety-critical systems alone.
A well-designed nominal test suite should:
You need to estimate the probability that your policy fails in an event that happens once in a million simulations. Naive simulation is hopeless. The insight: you can artificially inflate the probability of the rare event, measure how often it leads to failure, and then correct for the inflation.
This is rare event simulation. The core idea: instead of sampling environment parameters φ from the nominal distribution p(φ), sample them from an alternative distribution q(φ) that puts more weight on dangerous regions. Then correct the estimate with an importance weight.
The importance weight corrects for the fact that we're sampling from the "wrong" distribution. Regions where q is higher than p get downweighted by w < 1. Regions where q is lower than p get upweighted by w > 1. When the correction is applied, the estimator is unbiased regardless of the choice of q — but its variance depends critically on q.
A practical approach for policy validation: use adaptive importance sampling. Run some rollouts under the nominal distribution to identify where failures tend to occur. Then concentrate q around those regions. Repeat: estimate failure probability, update q, re-estimate. Methods like Cross-Entropy Method (Ch 10) can be applied to find q that maximizes P(fail) under q, which happens to be a good approximation of q*.
Both methods estimate P(failure) at the same sample count. The failure region is the red zone. Watch how IS concentrates samples in the danger zone.
The theory of importance sampling is clean. The practice has sharp edges. Here are the key failure modes and how to handle them.
High variance from extreme weights. If q puts nearly zero probability on some region where p has significant mass, the importance weight p/q explodes. A single sample from that region gets an astronomically large weight and dominates the estimate. The variance is ∞ if the support of p is not contained in the support of q.
Effective sample size (ESS). When weights are very unequal, the effective sample size ESS = (∑i wi)2 / ∑i wi2 is much less than N. If ESS = 10 with N = 1000, you're effectively using 1% of your samples. Always check ESS as a diagnostic — if ESS/N < 0.1, your q is a poor approximation of the target and the estimate is unreliable.
python import numpy as np def importance_sampling_estimate(simulate, p_logpdf, q_logpdf, q_sample, N=1000): """ simulate: phi -> 1 (failure) or 0 (success) p_logpdf: log probability under nominal distribution p q_logpdf: log probability under proposal distribution q q_sample: function that samples phi from q """ phis = [q_sample() for _ in range(N)] failures = np.array([simulate(phi) for phi in phis]) # Log importance weights (numerically stable) log_w = np.array([p_logpdf(phi) - q_logpdf(phi) for phi in phis]) log_w -= log_w.max() # shift for stability w = np.exp(log_w) # Self-normalized estimate p_fail = (w * failures).sum() / w.sum() # Effective sample size diagnostic ess = w.sum()**2 / (w**2).sum() return p_fail, ess
You've designed an autonomous vehicle policy under the assumption that sensor noise has standard deviation σ = 0.1m. What if it's actually 0.2m? What if the GPS signal degrades in a tunnel? Robustness analysis asks: over what range of environment parameters does your policy remain acceptable?
Formally: parameterize the environment by a vector φ ∈ Φ. Define an acceptability threshold: the policy is acceptable at φ if its performance metric U(π, φ) ≥ umin. The safe operating envelope is the set of parameters where the policy is acceptable:
Robustness analysis maps out this set. It answers: "What conditions can my policy handle, and where does it break down?"
A key metric from robustness analysis: the robustness margin ρ = minφ ∈ boundary of safe set ||φ − φnominal||. Large robustness margin = far from the safe envelope edge. Small margin = the policy barely handles the nominal conditions and fails on small perturbations.
Each cell shows policy performance as two environment parameters vary. The dashed region is the safe envelope (U ≥ threshold). Adjust threshold to see how the envelope changes.
A self-driving car can prioritize passenger comfort (smooth, gentle maneuvers) or travel time (fast, aggressive maneuvers). A medical robot can prioritize treatment efficacy (more aggressive intervention) or patient safety (conservative, low-risk). These are not the same objective — improving one hurts the other.
When you have two or more objectives that conflict, you can't find a single "best" policy. Instead, you find the set of Pareto-optimal policies: policies where you cannot improve one objective without worsening another. This set is called the Pareto frontier.
How to compute the Pareto frontier for a policy class {πθ} : sample many policies (by varying θ or by multi-objective optimization), evaluate each on all objectives via Monte Carlo rollouts, and keep only the non-dominated ones.
The key insight for validation: a policy should be evaluated not just at its design point but across the Pareto frontier. The system designer might prefer a different tradeoff than the engineer assumed. Presenting the Pareto frontier makes the tradeoffs explicit and separable from the technical optimization.
Each dot is a candidate policy evaluated on two objectives. Gold dots form the Pareto frontier. Click a dot to see its tradeoff.
Robustness analysis maps performance across all parameters. Adversarial analysis asks a sharper question: what is the single worst parameter setting for my policy? An adversary is trying to find φ that causes maximum damage. What will they find?
Formally, adversarial analysis solves:
This is an optimization problem: minimize policy performance over environment parameters. The result φ* is the worst-case environment. U(π, φ*) is the worst-case performance. If U(π, φ*) is acceptable, then no environment in Φ can break the policy.
How to solve the adversarial problem? If U(π, φ) is differentiable in φ, use gradient descent on the negative return. If not (because φ affects the environment structure in non-differentiable ways), use policy search methods (Ch 10): CEM or evolution strategies with U(π, φ) as the objective to minimize.
A simple car-following policy tries to maintain a safe distance from a lead vehicle. We validate it against three environment parameters: sensor noise level, lead vehicle aggression, and road surface friction. Watch nominal simulation, importance-sampled rare events, and the adversarial worst-case all running simultaneously.
Policy validation is the bridge between algorithm development and deployment. A policy that scores well in training but hasn't been stress-tested is a liability, not an asset.
| Method | What it finds | Cost |
|---|---|---|
| Nominal simulation | Average performance in typical conditions | Low — m · T steps |
| Importance sampling | Rare failure probability | Medium — IS overhead |
| Robustness analysis | Safe operating envelope over Φ | High — grid over Φ |
| Trade analysis (Pareto) | Optimal tradeoffs between objectives | High — many policy evaluations |
| Adversarial analysis | Worst-case environment parameters | High — optimization over Φ |
Related chapters:
"In God we trust, all others bring data." — W. Edwards Deming. A policy that hasn't been validated is not a belief — it's faith.