Ch 14: Policy Validation — Kochenderfer et al.

Chapter 0: Why Validate?

In November 2016, a self-driving car being tested in San Francisco ran a red light. The engineers had optimized the policy on standard scenarios. This was not a standard scenario. The policy had never seen a protected left-turn lane at that intersection. The policy was not wrong — it was untested in that region of the world.

Training a policy and validating it are different problems. Training asks: "Does this policy perform well on the distribution of scenarios I've seen?" Validation asks: "Does this policy perform well on the full distribution it will encounter, including scenarios I haven't seen, including the rare ones that matter most?"

The validation problem: You have a policy π. You want to know how it performs across all environment parameters φ (weather, road conditions, opponent behavior, sensor noise). Some values of φ are rare but catastrophic if your policy fails. Naive simulation can't find them — you'd have to run too many rollouts.

The chapter's methods address four related questions:

Question	Method	Output
How good is my policy on average?	Performance metrics + Monte Carlo	E[return], variance, percentiles
How good is it in rare failures?	Rare event simulation, importance sampling	Probability of failure
How good is it as parameters vary?	Robustness analysis	Safe operating envelope
Which parameters matter most for failure?	Adversarial analysis	Worst-case φ

Validation vs. testing: Software testing checks that code is correct. Policy validation checks that behavior is acceptable. A policy can be "correct" (does what you designed it to do) and still unacceptable (does the wrong thing in the tail of the distribution). This is why validation must go beyond nominal simulation.

A policy is trained on 10,000 simulated scenarios and achieves 97% success rate. Why might this not guarantee safe deployment?

The 3% failures might be concentrated in rare but critical scenarios not covered by the training distribution 97% is too low — you need 100% for deployment 10,000 scenarios is too many to generalize well

Chapter 1: Performance Metrics

Before you can measure how good a policy is, you need to decide what to measure. A single scalar "return" is useful for training but hides important structure for validation.

Metrics for a single rollout: Given a trajectory τ = (s₀, a₀, …, s_T), you might care about:

Metric	Formula	When to use
Total return	∑_t γ^t R(s_t, a_t)	Training objective — dense reward
Success rate	1[goal reached before T]	Binary outcome problems
Time to goal	min{t : s_t = s_goal}	Efficiency-sensitive applications
Safety metric	1[constraint violated]	Safety-critical applications
Distance to failure	min_t d(s_t, failure set)	Margins analysis

Metrics for a distribution of rollouts: run m rollouts, collect {R⁽¹⁾, …, R^(m)}, then report:

Distribution-level metrics: (1) Mean: E[R] ≈ (1/m)∑R⁽ⁱ⁾ — overall performance. (2) Variance: Var[R] — consistency. (3) α-quantile: the value q such that P(R ≤ q) = α — worst-case analysis. (4) CVaR_α: conditional value at risk — E[R | R ≤ q_α], the expected return in the worst α fraction of outcomes.

Why CVaR? Mean return is an optimistic metric — it can be high even when rare but catastrophic failures exist. Conditional Value at Risk (CVaR) at α = 0.05 is the average return over the 5% worst outcomes. Optimizing for CVaR directly hedges against tail risk — the return in the worst cases is explicitly minimized rather than washed out by good average performance.

Choosing α for CVaR: For aviation safety, α might be 10⁻⁶ (failure in 1-in-a-million flights). For robotics, α = 0.05 is a reasonable starting point. The right α is a domain decision: how much tail risk is acceptable? This is fundamentally a values question, not a technical one. The engineer's job is to make the tradeoff explicit and transparent.

Return Distribution: Mean vs. CVaR

Adjust tail weight to see how CVaR differs from mean. A bimodal distribution can have a good mean but terrible CVaR.

Tail failures 0.20

CVaR α 0.10

Policy A has mean return 80 and CVaR_0.05 = -10. Policy B has mean return 70 and CVaR_0.05 = 60. For an autonomous vehicle, which is safer to deploy?

Policy A — higher mean return is better overall Policy B — the worst-case outcomes are much better Cannot determine without more information

Chapter 2: Nominal Simulation

The simplest validation approach: pick a set of representative test scenarios, run the policy in each, and measure performance. These scenarios are called the nominal environment — the expected operating conditions.

Nominal simulation answers: "Does my policy work in the scenarios I designed it for?" This is a necessary first check but not sufficient. It doesn't tell you about the edges of the operating envelope or the tail of the distribution.

The failure mode of nominal simulation: If your nominal scenarios cluster around the "typical" part of the environment distribution, you will systematically miss rare failures. Consider a policy tested on 1000 random rollouts with failure probability p = 0.001. Expected number of failures observed: 1. Standard error: √(p(1-p)/1000) ≈ 0.001. You need ~100,000 rollouts to observe 100 failures and estimate the failure rate to 10% accuracy.

The sample complexity problem: to estimate a probability p to relative error ε with confidence δ, you need roughly m = (1/ε²p) · log(2/δ) rollouts. For p = 10⁻⁶ (aviation level safety), m ~ 10⁷/ε². At 1 second per rollout, this is 115+ days of simulation. This is why nominal simulation cannot validate safety-critical systems alone.

When nominal simulation is sufficient: For low-stakes applications where failure probabilities are moderate (p ~ 0.01–0.1) and the test distribution accurately represents deployment. Example: a navigation algorithm for indoor delivery robots where the environment is well-characterized. For safety-critical systems (aircraft, surgical robots, autonomous vehicles), nominal simulation is a starting point, not a conclusion.

A well-designed nominal test suite should:

Cover the full range of nominal operating parameters
Include boundary conditions (maximum speed, minimum clearance, etc.)
Include edge cases from domain knowledge (known tricky scenarios)
Track performance over time (did a policy update make things worse?)

A policy has a true failure probability of 0.001. You run 500 nominal rollouts. How many failures do you expect to observe?

About 0.5 — you probably observe 0 failures despite the policy failing 1-in-1000 times About 5 — proportional scaling About 50 — rare events cluster in batches

Chapter 3: Rare Event Simulation

You need to estimate the probability that your policy fails in an event that happens once in a million simulations. Naive simulation is hopeless. The insight: you can artificially inflate the probability of the rare event, measure how often it leads to failure, and then correct for the inflation.

This is rare event simulation. The core idea: instead of sampling environment parameters φ from the nominal distribution p(φ), sample them from an alternative distribution q(φ) that puts more weight on dangerous regions. Then correct the estimate with an importance weight.

The importance sampling estimator: Let f(φ) be an indicator of failure (1 if the policy fails under parameters φ, 0 otherwise). The failure probability is P(fail) = E_p[f(φ)] = ∫ f(φ)p(φ)dφ. Draw N samples from the importance distribution q instead: P(fail) ≈ (1/N) ∑_i f(φ⁽ⁱ⁾) · w(φ⁽ⁱ⁾), where w(φ) = p(φ)/q(φ) is the importance weight.

The importance weight corrects for the fact that we're sampling from the "wrong" distribution. Regions where q is higher than p get downweighted by w < 1. Regions where q is lower than p get upweighted by w > 1. When the correction is applied, the estimator is unbiased regardless of the choice of q — but its variance depends critically on q.

The optimal q: Theoretically, the optimal importance distribution is q*(φ) ∝ f(φ)p(φ) — proportional to the failure probability times the nominal distribution. In the ideal case, q* puts all mass on failure scenarios and zero variance IS estimator results. In practice, q* is unknown (we don't know where failures are), so we must approximate it.

A practical approach for policy validation: use adaptive importance sampling. Run some rollouts under the nominal distribution to identify where failures tend to occur. Then concentrate q around those regions. Repeat: estimate failure probability, update q, re-estimate. Methods like Cross-Entropy Method (Ch 10) can be applied to find q that maximizes P(fail) under q, which happens to be a good approximation of q*.

Rare Event Estimation: Naive vs. Importance Sampling

Both methods estimate P(failure) at the same sample count. The failure region is the red zone. Watch how IS concentrates samples in the danger zone.

Failure probability p 0.010

Sample count N 50

Importance sampling uses weights w(φ) = p(φ)/q(φ). If you sample a parameter φ that is 10× more likely under q than under p, what weight do you assign?

0.1 — downweight because q over-represents this region compared to p 10 — upweight because this region is more dangerous 1 — the weight always equals 1 to maintain unbiasedness

Chapter 4: Importance Sampling in Practice

The theory of importance sampling is clean. The practice has sharp edges. Here are the key failure modes and how to handle them.

High variance from extreme weights. If q puts nearly zero probability on some region where p has significant mass, the importance weight p/q explodes. A single sample from that region gets an astronomically large weight and dominates the estimate. The variance is ∞ if the support of p is not contained in the support of q.

Self-normalized IS: Instead of using raw weights w_i = p(φ_i)/q(φ_i), use normalized weights w̄_i = w_i / ∑_j w_j. The estimator becomes ∑_i w̄_i f(φ_i). This introduces slight bias but dramatically reduces variance. It also handles un-normalized distributions — you don't need to know the partition function of p or q.

Effective sample size (ESS). When weights are very unequal, the effective sample size ESS = (∑_i w_i)² / ∑_i w_i² is much less than N. If ESS = 10 with N = 1000, you're effectively using 1% of your samples. Always check ESS as a diagnostic — if ESS/N < 0.1, your q is a poor approximation of the target and the estimate is unreliable.

Defensive importance sampling: Mix a fraction α of the nominal distribution into q: q(φ) = αp(φ) + (1-α)q₀(φ). The mixing prevents extreme weights when q₀ misses some region that p covers. α = 0.1 (10% nominal sampling) is a practical default. This makes the estimator less efficient in theory but much more stable in practice.

python
import numpy as np

def importance_sampling_estimate(simulate, p_logpdf, q_logpdf, q_sample, N=1000):
    """
    simulate: phi -> 1 (failure) or 0 (success)
    p_logpdf: log probability under nominal distribution p
    q_logpdf: log probability under proposal distribution q
    q_sample: function that samples phi from q
    """
    phis = [q_sample() for _ in range(N)]
    failures = np.array([simulate(phi) for phi in phis])

    # Log importance weights (numerically stable)
    log_w = np.array([p_logpdf(phi) - q_logpdf(phi) for phi in phis])
    log_w -= log_w.max()   # shift for stability
    w = np.exp(log_w)

    # Self-normalized estimate
    p_fail = (w * failures).sum() / w.sum()

    # Effective sample size diagnostic
    ess = w.sum()**2 / (w**2).sum()

    return p_fail, ess

You run IS with N=500 samples and find ESS=12. What does this tell you?

Your proposal q is a poor match to the target — only ~12 samples are contributing meaningfully to the estimate Your estimate is based on effectively 500 samples but with 12 failures You need to run 12 more rollouts to get a valid estimate

Chapter 5: Robustness Analysis

You've designed an autonomous vehicle policy under the assumption that sensor noise has standard deviation σ = 0.1m. What if it's actually 0.2m? What if the GPS signal degrades in a tunnel? Robustness analysis asks: over what range of environment parameters does your policy remain acceptable?

Formally: parameterize the environment by a vector φ ∈ Φ. Define an acceptability threshold: the policy is acceptable at φ if its performance metric U(π, φ) ≥ u_min. The safe operating envelope is the set of parameters where the policy is acceptable:

Φ_safe = { φ ∈ Φ : U(π, φ) ≥ u_min }

Robustness analysis maps out this set. It answers: "What conditions can my policy handle, and where does it break down?"

How to map the safe envelope: Simple approach: grid search over Φ. At each grid point, estimate U(π, φ) with Monte Carlo rollouts. Color the grid by acceptability. This is expensive (exponential in dim(Φ)) but clear. For high-dimensional Φ, use adaptive sampling: run more rollouts near the boundary ∂Φ_safe where uncertainty is highest.

A key metric from robustness analysis: the robustness margin ρ = min_{φ ∈ boundary of safe set} ||φ − φ_nominal||. Large robustness margin = far from the safe envelope edge. Small margin = the policy barely handles the nominal conditions and fails on small perturbations.

Parameter uncertainty matters: Robustness analysis naturally answers "how bad does model error need to be before the policy fails?" If the sensor noise in the real world is anywhere in [0.05, 0.15]m but the policy starts failing at σ = 0.18m, you have a safety margin of 3cm. If you can't guarantee your model is better than 3cm, you can't guarantee safe deployment.

Safe Operating Envelope: 2D Robustness Map

Each cell shows policy performance as two environment parameters vary. The dashed region is the safe envelope (U ≥ threshold). Adjust threshold to see how the envelope changes.

Accept. threshold u_min 0.50

A policy's safe operating envelope is the set {φ : U(π, φ) ≥ u_min}. If you raise u_min, what happens to the safe envelope?

It shrinks — higher requirements are harder to meet, excluding more parameter values It expands — stricter thresholds force the policy to improve It stays the same — the policy's performance doesn't depend on u_min

Chapter 6: Trade Analysis

A self-driving car can prioritize passenger comfort (smooth, gentle maneuvers) or travel time (fast, aggressive maneuvers). A medical robot can prioritize treatment efficacy (more aggressive intervention) or patient safety (conservative, low-risk). These are not the same objective — improving one hurts the other.

When you have two or more objectives that conflict, you can't find a single "best" policy. Instead, you find the set of Pareto-optimal policies: policies where you cannot improve one objective without worsening another. This set is called the Pareto frontier.

Pareto dominance: Policy π dominates policy π′ if π is at least as good as π′ on all objectives AND strictly better on at least one. The Pareto frontier consists of all non-dominated policies. Any policy not on the frontier is Pareto-dominated — there exists a strictly better alternative.

How to compute the Pareto frontier for a policy class {π_θ} : sample many policies (by varying θ or by multi-objective optimization), evaluate each on all objectives via Monte Carlo rollouts, and keep only the non-dominated ones.

The key insight for validation: a policy should be evaluated not just at its design point but across the Pareto frontier. The system designer might prefer a different tradeoff than the engineer assumed. Presenting the Pareto frontier makes the tradeoffs explicit and separable from the technical optimization.

Scalarization: One approach to multi-objective optimization: scalarize the objectives into one: U(π) = λ₁U₁(π) + λ₂U₂(π). By varying λ, you trace out the Pareto frontier. But scalarization implicitly assumes the frontier is convex. For non-convex frontiers, some Pareto-optimal points are unreachable by any scalarization. Use specialized multi-objective algorithms (NSGA-II, MOEA/D) to handle non-convex frontiers.

Pareto Frontier: Performance vs. Safety

Each dot is a candidate policy evaluated on two objectives. Gold dots form the Pareto frontier. Click a dot to see its tradeoff.

Policy A achieves (performance=0.8, safety=0.6). Policy B achieves (performance=0.7, safety=0.9). Policy C achieves (performance=0.75, safety=0.65). Which policies are Pareto-optimal?

A and B — neither dominates the other; C is dominated by A (A is better on both vs. C) A, B, and C — all three are Pareto-optimal Only B — it has the highest safety

Chapter 7: Adversarial Analysis

Robustness analysis maps performance across all parameters. Adversarial analysis asks a sharper question: what is the single worst parameter setting for my policy? An adversary is trying to find φ that causes maximum damage. What will they find?

Formally, adversarial analysis solves:

φ* = argmin_{φ ∈ Φ} U(π, φ)

This is an optimization problem: minimize policy performance over environment parameters. The result φ* is the worst-case environment. U(π, φ*) is the worst-case performance. If U(π, φ*) is acceptable, then no environment in Φ can break the policy.

The adversary as a test oracle: An adversarial agent that optimally selects φ to minimize policy performance is the best possible stress test. Unlike random simulation (which might never find the failure mode) or human-designed test cases (which might miss non-obvious failure modes), an adversary systematically searches for the worst case. This is the philosophy behind adversarial testing and red-teaming.

How to solve the adversarial problem? If U(π, φ) is differentiable in φ, use gradient descent on the negative return. If not (because φ affects the environment structure in non-differentiable ways), use policy search methods (Ch 10): CEM or evolution strategies with U(π, φ) as the objective to minimize.

Multi-armed adversary: In practice, φ might include multiple independent environment parameters. An adversarial agent trained with RL (using the negative of π's return as its reward) can find complex parameter combinations that no human would think to test. This is increasingly used in AI safety research to find failure modes of neural policies before deployment.

Adversarial analysis limitations: Finding the true worst-case φ is as hard as policy optimization itself. You may find a local worst case, not the global one. Multiple runs from different φ₀ initializations help map the failure landscape. Also: adversarial analysis assumes an adversary constrained to Φ. If the real world can go outside Φ (distribution shift), adversarial analysis within Φ misses it.

Adversarial analysis finds φ* that minimizes U(π, φ) and gets U(π, φ*) = 0.35. The acceptability threshold is u_min = 0.5. What can you conclude?

The policy is not robust — there exists a parameter setting in Φ where it performs below the threshold The policy is robust — it achieves 0.35 even in the worst case Cannot determine without knowing the nominal performance

Chapter 8: Showcase — Policy Stress Test

A simple car-following policy tries to maintain a safe distance from a lead vehicle. We validate it against three environment parameters: sensor noise level, lead vehicle aggression, and road surface friction. Watch nominal simulation, importance-sampled rare events, and the adversarial worst-case all running simultaneously.

What to observe: Under nominal conditions (low noise, gentle lead car), the policy rarely fails. Increase noise or aggression to see failure rates rise. The IS estimator (gold) tracks the true rate much more efficiently than naive simulation (teal). The adversary (red) finds the worst parameter combination automatically.

Car-Following Policy Stress Test

Sensor noise 0.20

Lead aggression 0.20

Friction (1=dry, 0=ice) 0.80

Chapter 9: Connections & What's Next

Policy validation is the bridge between algorithm development and deployment. A policy that scores well in training but hasn't been stress-tested is a liability, not an asset.

Method	What it finds	Cost
Nominal simulation	Average performance in typical conditions	Low — m · T steps
Importance sampling	Rare failure probability	Medium — IS overhead
Robustness analysis	Safe operating envelope over Φ	High — grid over Φ
Trade analysis (Pareto)	Optimal tradeoffs between objectives	High — many policy evaluations
Adversarial analysis	Worst-case environment parameters	High — optimization over Φ

The validation stack: These methods are complementary, not competing. In practice: (1) Always run nominal simulation first as a sanity check. (2) Use importance sampling to estimate rare failure probability once you believe the policy is roughly correct. (3) Use robustness analysis to understand the operating envelope. (4) Use adversarial analysis to verify worst-case bounds. All four are needed for safety-critical deployment.

Connection to Ch 10 (Policy Search): Adversarial analysis is policy search applied to the environment parameter space: find φ* = argmin U(π, φ). The CEM and evolution strategies from Ch 10 work directly here. The "policy" being optimized is the adversary's choice of φ; the "environment" is the fixed agent policy π.

Related chapters:

Ch 10: Policy Search — CEM and ES used in adversarial analysis
Ch 13: Actor-Critic — policies being validated
Ch 15: Exploration — active validation via exploration

"In God we trust, all others bring data." — W. Edwards Deming. A policy that hasn't been validated is not a belief — it's faith.

You have a limited budget of 1000 simulator calls to validate a safety-critical policy. The nominal failure probability is estimated at 0.001. Which validation method gives you the most useful information?

Importance sampling — 1000 IS samples can estimate p=0.001 far more accurately than 1000 naive samples Nominal simulation — 1000 random rollouts is the most unbiased estimate Trade analysis — understanding objectives matters more than failure probability