Before you can validate a system, you need a model of how it behaves. All models are wrong. Some are useful.
You want to validate a self-driving car. The gold standard would be to drive it for a billion miles on real roads and count the crashes. But a billion miles at 30 mph takes 38,000 years. And each crash kills real people during the evaluation. Real testing is too slow, too expensive, and too dangerous.
The alternative: build a model — a mathematical representation of the system and its environment — and test the system inside the model. Simulate a billion miles in hours instead of millennia. Make the virtual pedestrians jump into the road, and nobody gets hurt.
This chapter covers the tools for building these models. We will learn how to describe uncertainty with probability distributions, estimate model parameters from data, handle agents whose policies we do not know, and — crucially — check whether the model we built is actually any good.
The chain of reasoning is: (1) collect data from the real system, (2) fit a model to the data, (3) validate the system inside the model. If step 2 is bad, step 3 is meaningless — you would be testing the car in a world that does not resemble reality. Model quality is the foundation everything rests on.
Models are built from probability distributions. Before we estimate anything, we need a quick refresher on the building blocks: PMFs, PDFs, and the Gaussian.
A probability mass function (PMF) describes a discrete random variable. If a die has outcomes {1,2,3,4,5,6}, the PMF assigns a probability to each: P(X = k). For a fair die, P(X = k) = 1/6 for all k. The probabilities must sum to 1.
A probability density function (PDF) describes a continuous random variable. The probability that X falls in an interval [a, b] is the area under the curve: P(a ≤ X ≤ b) = ∫ab f(x) dx. The density f(x) itself can be greater than 1 — what matters is that the total area under the curve is 1.
The most important continuous distribution is the Gaussian (normal distribution), with two parameters: mean μ (center) and variance σ2 (spread):
The Gaussian is everywhere in modeling. Sensor noise? Gaussian. Measurement errors? Gaussian. The central limit theorem says that the sum of many small independent effects is approximately Gaussian, regardless of the individual distributions. This is why it appears so often in practice.
Adjust μ (mean) and σ (standard deviation) to see how the bell curve changes shape and position.
| Distribution | Type | Parameters | Use in validation |
|---|---|---|---|
| Uniform | Continuous | a, b (range) | Uninformative priors |
| Gaussian | Continuous | μ, σ2 | Sensor noise, positions, velocities |
| Bernoulli | Discrete | p | Binary outcomes (crash / no crash) |
| Categorical | Discrete | p1,...,pk | Discrete actions or states |
A single Gaussian is a unimodal bump. But the real world is often multimodal — a pedestrian at an intersection might go left, right, or straight, each with its own cluster of trajectories. A single Gaussian cannot capture this. We need richer distributions.
A Gaussian mixture model (GMM) solves this by combining K Gaussians, each weighted by πk:
The weights πk sum to 1. Each component has its own mean μk and variance σk2. The mixture can have bumps wherever it needs them. With enough components, a GMM can approximate any continuous distribution.
For even more flexibility, normalizing flows take a simple base distribution (say, a standard Gaussian) and push it through a series of invertible transformations. Each transformation stretches and bends the distribution into a more complex shape. The key constraint: each transformation must be invertible, so we can compute both the forward mapping (sample generation) and the backward mapping (density evaluation).
A mixture of 3 Gaussians. Drag the sliders to adjust the weight of each component. Watch how the total distribution (white) changes shape.
The models from Chapter 1 describe distributions of individual variables. But validation models need conditional distributions — how one quantity depends on another. The three core conditional distributions from the POMDP framework are:
Each of these is a distribution parameterized by its conditioning variables. T(s'|s,a) is not a single distribution — it is a different distribution for every (s,a) pair. If there are 100 states and 5 actions, you need to specify 500 distributions (one per pair).
For continuous states, conditional distributions are often parameterized as Gaussians whose mean depends on the conditioning variables:
Here f(s,a) is a function (could be a physics simulator, a neural network, or a simple linear model) that predicts the expected next state. Σ captures the noise. The entire validation pipeline hinges on getting these conditional distributions right.
| Distribution | Conditioning on | What it models |
|---|---|---|
| T(s'|s,a) | State + action | How the world changes |
| O(o|s) | State | What the sensors see |
| π(a|o) | Observation | How the agent (or other agents) behave |
| p(s1) | Nothing | Initial conditions |
You have data: N measurements x1, x2, ..., xN. You believe they come from a Gaussian with unknown mean μ and standard deviation σ. How do you find the best μ and σ?
The idea: choose parameters that make the observed data most probable. This is maximum likelihood estimation (MLE). The likelihood is the probability of seeing your data given the parameters:
We want to maximize L. Since products are annoying, we take the log (the log is monotonic, so maximizing log L is the same as maximizing L):
Taking derivatives and setting them to zero gives the MLE solutions:
The MLE mean is the sample mean. The MLE variance is the average squared deviation. These are the "best-fit" parameters — the Gaussian that gives the observed data the highest probability.
Data points (orange dots) are scattered below. Drag μ and σ sliders to position the Gaussian. Watch the log-likelihood change in real time. Click Find MLE to watch gradient ascent converge to the optimum.
python import numpy as np data = np.array([2.1, 3.5, 2.8, 3.2, 2.5]) mu_mle = data.mean() # 2.82 sig2_mle = ((data - mu_mle)**2).mean() # 0.2456 def log_likelihood(data, mu, sig2): n = len(data) return -n/2 * np.log(2*np.pi*sig2) - sum((data - mu)**2) / (2*sig2)
MLE gives you a single best estimate. But what if you have prior knowledge? And what if you want to know how uncertain you are about the parameters? Bayesian learning answers both questions by treating the parameters themselves as random variables with a distribution.
Bayes' theorem says:
In words: the posterior (what we believe after seeing data) is proportional to the likelihood (how probable the data is given θ) times the prior (what we believed before seeing data). The denominator p(data) is just a normalizing constant.
The Beta distribution is the natural prior for a probability parameter θ ∈ [0, 1]. It is parameterized by α and β:
After observing h heads and t tails, the posterior is Beta(α + h, β + t). This is called a conjugate prior — the posterior has the same form as the prior, with updated parameters. It makes Bayesian computation tractable.
Click Flip to flip the coin. Watch the posterior Beta distribution update in real time. The prior (dotted) is Beta(2,2). The true bias is hidden — can you infer it?
A model that fits the training data perfectly is not necessarily a good model. Consider fitting a polynomial to 10 data points. A degree-9 polynomial passes through every single point — zero training error. But between the points, it oscillates wildly. On new data, it performs terribly. This is overfitting: the model has memorized the noise in the training data instead of learning the underlying pattern.
We detect overfitting by measuring performance on held-out test data — data the model was not trained on. The gap between training performance and test performance is the generalization gap. A large gap signals overfitting.
Drag the Degree slider to change the polynomial order. Low degree = underfitting (misses the pattern). High degree = overfitting (passes through every point but oscillates wildly). Find the sweet spot.
Cross-validation is a systematic way to tune model complexity. Split the data into K folds. Train on K−1 folds, test on the held-out fold. Repeat K times and average the test error. The model complexity (degree, number of parameters, regularization strength) that minimizes cross-validation error is the best trade-off between underfitting and overfitting.
To validate an autonomous vehicle, you need to simulate not just your car, but all the other drivers on the road. Their policies π(a|o) are unknown — you cannot look inside their heads. But you have logged data: recordings of what they did in real traffic. Can you learn a policy from this data?
Behavioral cloning (BC) does exactly this: treat the logged state-action pairs as supervised learning data, and train a model to predict actions from states. Given data {(o1, a1), ..., (oN, aN)}, minimize:
where ℓ is a loss function (cross-entropy for discrete actions, MSE for continuous).
The expert (teal) follows a path. The cloned agent (orange) starts on the same path but accumulates small errors. Click Run to watch. Adjust the Noise slider to control the per-step error magnitude.
DAgger (Dataset Aggregation) is the fix. After the cloned agent makes errors and drifts, you ask the expert to label the new states the agent visits. Add these to the dataset and retrain. Iterate. This progressively covers the states the cloned agent actually encounters, closing the distribution shift gap.
Not every agent needs to be cloned from data. Sometimes we can build behavior models from first principles using decision-theoretic assumptions about how agents act. The key idea: agents are approximately rational, but not perfectly so.
The softmax response model says an agent picks actions with probability proportional to the exponentiated value of each action:
Here Q(s, a) is the value of taking action a in state s, and λ is the rationality parameter. When λ = 0, the agent is completely random (uniform over actions). When λ → ∞, the agent always picks the best action (perfectly rational). Real agents live somewhere in between.
Level-k reasoning takes this further. A level-0 agent acts randomly. A level-1 agent best-responds to level-0 agents. A level-2 agent best-responds to level-1 agents. And so on. This hierarchy captures the idea that not everyone thinks the same number of steps ahead.
Three actions with values Q(a1)=3, Q(a2)=1, Q(a3)=2. Drag λ to see how the action probabilities shift from uniform (random) to peaked (rational).
| λ | Behavior | Interpretation |
|---|---|---|
| 0 | Uniform random | Completely irrational |
| 0.5 | Slight preference for good actions | Inattentive driver |
| 2 | Strong preference for good actions | Normal driver |
| ∞ | Always picks the best action | Perfectly rational |
You built a model. But is it any good? A model that does not match reality is worse than useless — it gives false confidence. This chapter covers three tools for checking model quality.
Q-Q plots (quantile-quantile plots) compare the distribution of model outputs against real data. If the model is correct, the Q-Q plot is a straight diagonal line. Deviations reveal systematic mismatches: curved means the tails are wrong, offset means the mean is wrong, S-shaped means the variance is wrong.
Calibration plots check probabilistic predictions. If the model says P(crash) = 0.3, then among all situations where it predicted 0.3, about 30% should actually be crashes. A well-calibrated model has its calibration curve on the diagonal. Overconfident models curve below; underconfident models curve above.
KL divergence measures how different two distributions are:
If p is the true distribution and q is the model, DKL(p || q) = 0 means a perfect match. Larger values mean the model is farther from reality. KL divergence is asymmetric: it penalizes the model heavily for assigning low probability to events that are actually common (q(x) small where p(x) is large).
The true distribution p (teal bars) has 5 outcomes. Drag the model distribution q (orange bars) to match. The KL divergence DKL(p || q) is computed in real time. Try making q assign zero probability to a likely outcome — watch KL diverge.
This chapter built the modeling toolkit. To validate a system in simulation, you need models of the world (transitions), the sensors (observations), and other agents (policies). Each model is a probability distribution estimated from data, and each must be validated against reality before trusting it.
| Technique | What it does | Strength | Weakness |
|---|---|---|---|
| MLE | Point estimate of parameters | Simple, efficient | No uncertainty quantification |
| Bayesian learning | Full posterior over parameters | Handles small data, quantifies uncertainty | Requires prior, computation cost |
| GMM / Flows | Complex distribution shapes | Captures multimodality | More parameters to estimate |
| Behavioral cloning | Learns policy from demonstrations | Simple supervised learning | Compounding errors over time |
| Softmax model | Parameterized rationality | Interpretable, controllable | Assumes Q-values are known |
What comes next:
Chapter 3 tackles specification: how do you formally write down what "safe" means? Metrics, temporal logic, reachability sets — the language of requirements.
The modeling pipeline:
Data → Choose distribution family → Fit parameters (MLE/Bayes) → Check model (Q-Q, calibration, KL) → Use for validation. If the check fails, go back and try a richer model.