Kochenderfer et al., Chapter 2

System Modeling

Before you can validate a system, you need a model of how it behaves. All models are wrong. Some are useful.

Prerequisites: Chapter 1 (validation framework, trajectories, specifications).
11
Chapters
5+
Simulations
11
Quizzes

Chapter 0: Why Models?

You want to validate a self-driving car. The gold standard would be to drive it for a billion miles on real roads and count the crashes. But a billion miles at 30 mph takes 38,000 years. And each crash kills real people during the evaluation. Real testing is too slow, too expensive, and too dangerous.

The alternative: build a model — a mathematical representation of the system and its environment — and test the system inside the model. Simulate a billion miles in hours instead of millennia. Make the virtual pedestrians jump into the road, and nobody gets hurt.

George Box (1976): "All models are wrong, but some are useful." A driving simulator is not the real world. The physics are approximate, the other drivers are scripted, the sensors are idealized. But a model that captures the important failure modes — sensor noise, rare pedestrian behavior, weather degradation — is enormously useful, even if it gets the tire friction coefficient slightly wrong.

This chapter covers the tools for building these models. We will learn how to describe uncertainty with probability distributions, estimate model parameters from data, handle agents whose policies we do not know, and — crucially — check whether the model we built is actually any good.

The chain of reasoning is: (1) collect data from the real system, (2) fit a model to the data, (3) validate the system inside the model. If step 2 is bad, step 3 is meaningless — you would be testing the car in a world that does not resemble reality. Model quality is the foundation everything rests on.

Real World Data
Collect trajectories from sensors, logs, field tests
Fit a Model
Learn T(s'|s,a), O(o|s), π(a|o) from data
Validate in Model
Run the system in simulation, check specifications
Check the Model
Is the model faithful to reality? (Chapter 9 of this lesson)
Check: Why can't we just test autonomous systems on real roads for validation?

Chapter 1: Probability 101

Models are built from probability distributions. Before we estimate anything, we need a quick refresher on the building blocks: PMFs, PDFs, and the Gaussian.

A probability mass function (PMF) describes a discrete random variable. If a die has outcomes {1,2,3,4,5,6}, the PMF assigns a probability to each: P(X = k). For a fair die, P(X = k) = 1/6 for all k. The probabilities must sum to 1.

A probability density function (PDF) describes a continuous random variable. The probability that X falls in an interval [a, b] is the area under the curve: P(a ≤ X ≤ b) = ∫ab f(x) dx. The density f(x) itself can be greater than 1 — what matters is that the total area under the curve is 1.

The most important continuous distribution is the Gaussian (normal distribution), with two parameters: mean μ (center) and variance σ2 (spread):

f(x | μ, σ2) = 1√(2πσ2) exp(− (x − μ)22)

The Gaussian is everywhere in modeling. Sensor noise? Gaussian. Measurement errors? Gaussian. The central limit theorem says that the sum of many small independent effects is approximately Gaussian, regardless of the individual distributions. This is why it appears so often in practice.

The Gaussian Distribution

Adjust μ (mean) and σ (standard deviation) to see how the bell curve changes shape and position.

μ0.0
σ1.0
Notation: We write X ~ N(μ, σ2) to mean "X is drawn from a Gaussian with mean μ and variance σ2." The standard normal has μ = 0 and σ = 1. About 68% of the mass falls within one σ of μ, 95% within two, 99.7% within three.
DistributionTypeParametersUse in validation
UniformContinuousa, b (range)Uninformative priors
GaussianContinuousμ, σ2Sensor noise, positions, velocities
BernoulliDiscretepBinary outcomes (crash / no crash)
CategoricalDiscretep1,...,pkDiscrete actions or states
Check: Why is the Gaussian distribution so commonly used to model noise?

Chapter 2: Mixtures & Flows

A single Gaussian is a unimodal bump. But the real world is often multimodal — a pedestrian at an intersection might go left, right, or straight, each with its own cluster of trajectories. A single Gaussian cannot capture this. We need richer distributions.

A Gaussian mixture model (GMM) solves this by combining K Gaussians, each weighted by πk:

p(x) = ∑k=1K πk · N(x | μk, σk2)

The weights πk sum to 1. Each component has its own mean μk and variance σk2. The mixture can have bumps wherever it needs them. With enough components, a GMM can approximate any continuous distribution.

Think of it this way: A GMM is like a cocktail recipe. Each Gaussian component is an ingredient (μk sets the flavor, σk sets the intensity). The mixing weights πk say how much of each ingredient to pour. With the right recipe, you can match any taste — any distribution shape.

For even more flexibility, normalizing flows take a simple base distribution (say, a standard Gaussian) and push it through a series of invertible transformations. Each transformation stretches and bends the distribution into a more complex shape. The key constraint: each transformation must be invertible, so we can compute both the forward mapping (sample generation) and the backward mapping (density evaluation).

Gaussian Mixture Model

A mixture of 3 Gaussians. Drag the sliders to adjust the weight of each component. Watch how the total distribution (white) changes shape.

π10.40
π20.35
π30.25
Why flows matter for validation: Normalizing flows can model complex, high-dimensional distributions of disturbances (wind gusts, sensor glitches, other driver behavior). If you can sample from the disturbance distribution accurately, your simulated validation trajectories look more like reality.
Check: Why would you use a Gaussian mixture instead of a single Gaussian to model pedestrian trajectories at an intersection?

Chapter 3: Conditional Distributions

The models from Chapter 1 describe distributions of individual variables. But validation models need conditional distributions — how one quantity depends on another. The three core conditional distributions from the POMDP framework are:

T(s' | s, a)  —  next state depends on current state and action
O(o | s)  —  observation depends on true state
π(a | o)  —  action depends on observation

Each of these is a distribution parameterized by its conditioning variables. T(s'|s,a) is not a single distribution — it is a different distribution for every (s,a) pair. If there are 100 states and 5 actions, you need to specify 500 distributions (one per pair).

Worked example: A car on a highway with three lanes. State s = lane number {1,2,3}. Action a = {stay, left, right}. The transition model T(s'|s, a=left) might be: P(s' = s-1) = 0.8 (successful lane change), P(s' = s) = 0.15 (aborted), P(s' = s-2) = 0.05 (overcorrected). Notice: the transition is stochastic. Even a "left" action does not guarantee you end up in the left lane. Real systems have uncertainty.

For continuous states, conditional distributions are often parameterized as Gaussians whose mean depends on the conditioning variables:

T(s' | s, a) = N(s' | f(s, a), Σ)

Here f(s,a) is a function (could be a physics simulator, a neural network, or a simple linear model) that predicts the expected next state. Σ captures the noise. The entire validation pipeline hinges on getting these conditional distributions right.

The physics analogy: T(s'|s,a) is the "physics engine" of your model. O(o|s) is the "sensor simulator." π(a|o) is the "brain simulator" for other agents. To build a complete validation environment, you need all three. Missing any one means your simulated trajectories are unrealistic.
DistributionConditioning onWhat it models
T(s'|s,a)State + actionHow the world changes
O(o|s)StateWhat the sensors see
π(a|o)ObservationHow the agent (or other agents) behave
p(s1)NothingInitial conditions
Check: Why is T(s'|s,a) a conditional distribution rather than a fixed function?

Chapter 4: Maximum Likelihood

You have data: N measurements x1, x2, ..., xN. You believe they come from a Gaussian with unknown mean μ and standard deviation σ. How do you find the best μ and σ?

The idea: choose parameters that make the observed data most probable. This is maximum likelihood estimation (MLE). The likelihood is the probability of seeing your data given the parameters:

L(μ, σ) = ∏i=1N f(xi | μ, σ2)

We want to maximize L. Since products are annoying, we take the log (the log is monotonic, so maximizing log L is the same as maximizing L):

log L(μ, σ) = ∑i=1N log f(xi | μ, σ2) = −N2 log(2πσ2) − 12i (xi − μ)2

Taking derivatives and setting them to zero gives the MLE solutions:

μ̂MLE = 1Ni xi     σ̂2MLE = 1Ni (xi − μ̂)2

The MLE mean is the sample mean. The MLE variance is the average squared deviation. These are the "best-fit" parameters — the Gaussian that gives the observed data the highest probability.

The log-likelihood trick: Taking the log turns products into sums, which are easier to differentiate and numerically more stable. This trick is universal — you will see it in every machine learning derivation. The log-likelihood is the fundamental object to maximize.
Fit a Gaussian (MLE) — Showcase

Data points (orange dots) are scattered below. Drag μ and σ sliders to position the Gaussian. Watch the log-likelihood change in real time. Click Find MLE to watch gradient ascent converge to the optimum.

μ0.00
σ1.50
Worked numerical example: Suppose your data is {2.1, 3.5, 2.8, 3.2, 2.5}. The MLE mean is (2.1+3.5+2.8+3.2+2.5)/5 = 2.82. The MLE variance is [(2.1−2.82)2 + (3.5−2.82)2 + (2.8−2.82)2 + (3.2−2.82)2 + (2.5−2.82)2]/5 = [0.5184 + 0.4624 + 0.0004 + 0.1444 + 0.1024]/5 = 0.2456. So σ̂ = √0.2456 ≈ 0.496.
python
import numpy as np

data = np.array([2.1, 3.5, 2.8, 3.2, 2.5])
mu_mle  = data.mean()               # 2.82
sig2_mle = ((data - mu_mle)**2).mean()  # 0.2456

def log_likelihood(data, mu, sig2):
    n = len(data)
    return -n/2 * np.log(2*np.pi*sig2) - sum((data - mu)**2) / (2*sig2)
Check: What does "maximum likelihood" mean in plain English?

Chapter 5: Bayesian Learning

MLE gives you a single best estimate. But what if you have prior knowledge? And what if you want to know how uncertain you are about the parameters? Bayesian learning answers both questions by treating the parameters themselves as random variables with a distribution.

Bayes' theorem says:

p(θ | data) = p(data | θ) · p(θ)p(data)

In words: the posterior (what we believe after seeing data) is proportional to the likelihood (how probable the data is given θ) times the prior (what we believed before seeing data). The denominator p(data) is just a normalizing constant.

The coin-flipping example: Suppose you find a coin and want to know if it is fair. Your prior: P(heads) = θ, where θ ~ Beta(2, 2) (you mildly believe it is fair). You flip 10 times and get 7 heads. The posterior is Beta(2+7, 2+3) = Beta(9, 5). The posterior mean is 9/14 ≈ 0.643 — the data shifted you toward believing it is slightly biased, but your prior pulled you back from the MLE of 0.7. With more data, the prior matters less and the posterior converges to the MLE.

The Beta distribution is the natural prior for a probability parameter θ ∈ [0, 1]. It is parameterized by α and β:

Beta(θ | α, β) ∝ θα−1 (1−θ)β−1

After observing h heads and t tails, the posterior is Beta(α + h, β + t). This is called a conjugate prior — the posterior has the same form as the prior, with updated parameters. It makes Bayesian computation tractable.

Bayesian Coin Flipping

Click Flip to flip the coin. Watch the posterior Beta distribution update in real time. The prior (dotted) is Beta(2,2). The true bias is hidden — can you infer it?

0 flips
MLE vs. Bayes: MLE says "what single parameter value best explains the data?" Bayes says "given the data, what is the full distribution over parameter values?" The Bayesian answer is richer: it tells you not just the best guess, but how confident you should be. With little data, the prior dominates. With lots of data, the prior washes out and the posterior concentrates around the MLE.
Check: In the Beta-Bernoulli model, what happens to the posterior as you observe more and more data?

Chapter 6: Overfitting

A model that fits the training data perfectly is not necessarily a good model. Consider fitting a polynomial to 10 data points. A degree-9 polynomial passes through every single point — zero training error. But between the points, it oscillates wildly. On new data, it performs terribly. This is overfitting: the model has memorized the noise in the training data instead of learning the underlying pattern.

The parable of the exam: A student who memorizes every answer from past exams (training data) will score 100% on those exams. But give them a new exam (test data) and they fail, because they never learned the concepts. Overfitting is memorization without understanding. The validation analog: a model that perfectly reproduces logged driving data may fail on novel scenarios it has never seen.

We detect overfitting by measuring performance on held-out test data — data the model was not trained on. The gap between training performance and test performance is the generalization gap. A large gap signals overfitting.

Polynomial Fitting: Overfitting in Action

Drag the Degree slider to change the polynomial order. Low degree = underfitting (misses the pattern). High degree = overfitting (passes through every point but oscillates wildly). Find the sweet spot.

Degree3

Cross-validation is a systematic way to tune model complexity. Split the data into K folds. Train on K−1 folds, test on the held-out fold. Repeat K times and average the test error. The model complexity (degree, number of parameters, regularization strength) that minimizes cross-validation error is the best trade-off between underfitting and overfitting.

For validation models specifically: If your transition model T(s'|s,a) is overfit to training trajectories, it will produce unrealistically narrow simulated distributions. Rare events that should be possible in the model will have near-zero probability. Your validation will miss failure modes that exist in reality but not in your overfit model. Overfitting is not just a prediction problem — it is a safety problem.
Check: How do you detect overfitting?

Chapter 7: Cloning Agents

To validate an autonomous vehicle, you need to simulate not just your car, but all the other drivers on the road. Their policies π(a|o) are unknown — you cannot look inside their heads. But you have logged data: recordings of what they did in real traffic. Can you learn a policy from this data?

Behavioral cloning (BC) does exactly this: treat the logged state-action pairs as supervised learning data, and train a model to predict actions from states. Given data {(o1, a1), ..., (oN, aN)}, minimize:

L(θ) = 1Ni=1N ℓ(πθ(oi), ai)

where ℓ is a loss function (cross-entropy for discrete actions, MSE for continuous).

The cascading error problem: Behavioral cloning works well when the agent stays near the training distribution. But small errors compound: the cloned agent drifts slightly from the expert's trajectory at time 1. At time 2, it sees a slightly different observation (one it was never trained on), makes a slightly larger error, drifts further, and so on. By time 100, the agent is in a completely unfamiliar part of the state space, making random decisions. This is called compounding error or distribution shift.
Behavioral Cloning: Compounding Error

The expert (teal) follows a path. The cloned agent (orange) starts on the same path but accumulates small errors. Click Run to watch. Adjust the Noise slider to control the per-step error magnitude.

Noise0.10

DAgger (Dataset Aggregation) is the fix. After the cloned agent makes errors and drifts, you ask the expert to label the new states the agent visits. Add these to the dataset and retrain. Iterate. This progressively covers the states the cloned agent actually encounters, closing the distribution shift gap.

For validation: If you model other drivers with behavioral cloning, beware of cascading errors. A BC driver model that diverges from reality after 20 time steps will produce unrealistic trajectories — your validation will be testing against ghost traffic that does not behave like real drivers.
Check: Why does behavioral cloning suffer from compounding errors over long trajectories?

Chapter 8: Behavior Models

Not every agent needs to be cloned from data. Sometimes we can build behavior models from first principles using decision-theoretic assumptions about how agents act. The key idea: agents are approximately rational, but not perfectly so.

The softmax response model says an agent picks actions with probability proportional to the exponentiated value of each action:

π(a | s) = exp(λ · Q(s, a))a' exp(λ · Q(s, a'))

Here Q(s, a) is the value of taking action a in state s, and λ is the rationality parameter. When λ = 0, the agent is completely random (uniform over actions). When λ → ∞, the agent always picks the best action (perfectly rational). Real agents live somewhere in between.

The λ dial: Think of λ as a "how much does this agent care?" knob. A distracted driver (λ low) drifts between lanes semi-randomly. An alert, experienced driver (λ high) makes near-optimal decisions. For validation, you want to simulate drivers across the full range — the worst-case failures often come from low-λ (irrational) other agents.

Level-k reasoning takes this further. A level-0 agent acts randomly. A level-1 agent best-responds to level-0 agents. A level-2 agent best-responds to level-1 agents. And so on. This hierarchy captures the idea that not everyone thinks the same number of steps ahead.

Softmax Rationality: The λ Parameter

Three actions with values Q(a1)=3, Q(a2)=1, Q(a3)=2. Drag λ to see how the action probabilities shift from uniform (random) to peaked (rational).

λ1.0
λBehaviorInterpretation
0Uniform randomCompletely irrational
0.5Slight preference for good actionsInattentive driver
2Strong preference for good actionsNormal driver
Always picks the best actionPerfectly rational
Check: What does the rationality parameter λ control in the softmax model?

Chapter 9: Validating Models

You built a model. But is it any good? A model that does not match reality is worse than useless — it gives false confidence. This chapter covers three tools for checking model quality.

Q-Q plots (quantile-quantile plots) compare the distribution of model outputs against real data. If the model is correct, the Q-Q plot is a straight diagonal line. Deviations reveal systematic mismatches: curved means the tails are wrong, offset means the mean is wrong, S-shaped means the variance is wrong.

Calibration plots check probabilistic predictions. If the model says P(crash) = 0.3, then among all situations where it predicted 0.3, about 30% should actually be crashes. A well-calibrated model has its calibration curve on the diagonal. Overconfident models curve below; underconfident models curve above.

KL divergence measures how different two distributions are:

DKL(p || q) = ∑x p(x) log p(x)q(x)

If p is the true distribution and q is the model, DKL(p || q) = 0 means a perfect match. Larger values mean the model is farther from reality. KL divergence is asymmetric: it penalizes the model heavily for assigning low probability to events that are actually common (q(x) small where p(x) is large).

Why KL asymmetry matters for safety: If the model assigns near-zero probability to a scenario that actually occurs (p(x) > 0, q(x) ≈ 0), KL divergence blows up to infinity. This is exactly the dangerous case for validation: the model says "this will never happen" but reality disagrees. A model that underestimates the probability of rare but dangerous events is the most harmful kind of model error.
KL Divergence: Model vs. Reality

The true distribution p (teal bars) has 5 outcomes. Drag the model distribution q (orange bars) to match. The KL divergence DKL(p || q) is computed in real time. Try making q assign zero probability to a likely outcome — watch KL diverge.

DKL = 0.00
The model validation checklist: (1) Q-Q plot: are the distributions aligned? (2) Calibration: are probabilistic predictions reliable? (3) KL divergence: how far is the model from reality overall? (4) Domain expert review: do the simulated scenarios look realistic? All four should be checked before trusting a model for validation.
Check: Why is KL divergence especially relevant for safety-critical validation models?

Chapter 10: Summary

This chapter built the modeling toolkit. To validate a system in simulation, you need models of the world (transitions), the sensors (observations), and other agents (policies). Each model is a probability distribution estimated from data, and each must be validated against reality before trusting it.

TechniqueWhat it doesStrengthWeakness
MLEPoint estimate of parametersSimple, efficientNo uncertainty quantification
Bayesian learningFull posterior over parametersHandles small data, quantifies uncertaintyRequires prior, computation cost
GMM / FlowsComplex distribution shapesCaptures multimodalityMore parameters to estimate
Behavioral cloningLearns policy from demonstrationsSimple supervised learningCompounding errors over time
Softmax modelParameterized rationalityInterpretable, controllableAssumes Q-values are known

What comes next:

Chapter 3 tackles specification: how do you formally write down what "safe" means? Metrics, temporal logic, reachability sets — the language of requirements.

The modeling pipeline:

Data → Choose distribution family → Fit parameters (MLE/Bayes) → Check model (Q-Q, calibration, KL) → Use for validation. If the check fails, go back and try a richer model.

The big picture: A validation result is only as trustworthy as the model it was computed in. An elegant falsification algorithm running on a bad model is like a brilliant detective investigating the wrong crime scene. This chapter ensures you build the right scene before the investigation begins.
"All models are wrong, but some are useful."
— George E. P. Box, 1976
Check: Why must a model be validated before using it for system validation?