Ch 2: System Modeling — Algorithms for Validation

Chapter 0: Why Models?

You want to validate a self-driving car. The gold standard would be to drive it for a billion miles on real roads and count the crashes. But a billion miles at 30 mph takes 38,000 years. And each crash kills real people during the evaluation. Real testing is too slow, too expensive, and too dangerous.

The alternative: build a model — a mathematical representation of the system and its environment — and test the system inside the model. Simulate a billion miles in hours instead of millennia. Make the virtual pedestrians jump into the road, and nobody gets hurt.

George Box (1976): "All models are wrong, but some are useful." A driving simulator is not the real world. The physics are approximate, the other drivers are scripted, the sensors are idealized. But a model that captures the important failure modes — sensor noise, rare pedestrian behavior, weather degradation — is enormously useful, even if it gets the tire friction coefficient slightly wrong.

This chapter covers the tools for building these models. We will learn how to describe uncertainty with probability distributions, estimate model parameters from data, handle agents whose policies we do not know, and — crucially — check whether the model we built is actually any good.

The chain of reasoning is: (1) collect data from the real system, (2) fit a model to the data, (3) validate the system inside the model. If step 2 is bad, step 3 is meaningless — you would be testing the car in a world that does not resemble reality. Model quality is the foundation everything rests on.

Real World Data

Collect trajectories from sensors, logs, field tests

↓

Fit a Model

Learn T(s'|s,a), O(o|s), π(a|o) from data

↓

Validate in Model

Run the system in simulation, check specifications

↓

Check the Model

Is the model faithful to reality? (Chapter 9 of this lesson)

Check: Why can't we just test autonomous systems on real roads for validation?

Real roads are too bumpy Achieving statistical confidence requires so many miles that real testing is too slow, expensive, and dangerous Simulators are always more accurate than real roads

Chapter 1: Probability 101

Models are built from probability distributions. Before we estimate anything, we need a quick refresher on the building blocks: PMFs, PDFs, and the Gaussian.

A probability mass function (PMF) describes a discrete random variable. If a die has outcomes {1,2,3,4,5,6}, the PMF assigns a probability to each: P(X = k). For a fair die, P(X = k) = 1/6 for all k. The probabilities must sum to 1.

A probability density function (PDF) describes a continuous random variable. The probability that X falls in an interval [a, b] is the area under the curve: P(a ≤ X ≤ b) = ∫_a^b f(x) dx. The density f(x) itself can be greater than 1 — what matters is that the total area under the curve is 1.

The most important continuous distribution is the Gaussian (normal distribution), with two parameters: mean μ (center) and variance σ² (spread):

f(x | μ, σ²) = ¹⁄_√(2πσ²) exp(− ^{(x − μ)²}⁄_2σ²)

The Gaussian is everywhere in modeling. Sensor noise? Gaussian. Measurement errors? Gaussian. The central limit theorem says that the sum of many small independent effects is approximately Gaussian, regardless of the individual distributions. This is why it appears so often in practice.

The Gaussian Distribution

Adjust μ (mean) and σ (standard deviation) to see how the bell curve changes shape and position.

μ0.0

σ1.0

Notation: We write X ~ N(μ, σ²) to mean "X is drawn from a Gaussian with mean μ and variance σ²." The standard normal has μ = 0 and σ = 1. About 68% of the mass falls within one σ of μ, 95% within two, 99.7% within three.

Distribution	Type	Parameters	Use in validation
Uniform	Continuous	a, b (range)	Uninformative priors
Gaussian	Continuous	μ, σ²	Sensor noise, positions, velocities
Bernoulli	Discrete	p	Binary outcomes (crash / no crash)
Categorical	Discrete	p₁,...,p_k	Discrete actions or states

Check: Why is the Gaussian distribution so commonly used to model noise?

Because it is the only continuous distribution Because noise is always exactly Gaussian Because the central limit theorem says sums of many small independent effects converge to a Gaussian

Chapter 2: Mixtures & Flows

A single Gaussian is a unimodal bump. But the real world is often multimodal — a pedestrian at an intersection might go left, right, or straight, each with its own cluster of trajectories. A single Gaussian cannot capture this. We need richer distributions.

A Gaussian mixture model (GMM) solves this by combining K Gaussians, each weighted by π_k:

p(x) = ∑_k=1^K π_k · N(x | μ_k, σ_k²)

The weights π_k sum to 1. Each component has its own mean μ_k and variance σ_k². The mixture can have bumps wherever it needs them. With enough components, a GMM can approximate any continuous distribution.

Think of it this way: A GMM is like a cocktail recipe. Each Gaussian component is an ingredient (μ_k sets the flavor, σ_k sets the intensity). The mixing weights π_k say how much of each ingredient to pour. With the right recipe, you can match any taste — any distribution shape.

For even more flexibility, normalizing flows take a simple base distribution (say, a standard Gaussian) and push it through a series of invertible transformations. Each transformation stretches and bends the distribution into a more complex shape. The key constraint: each transformation must be invertible, so we can compute both the forward mapping (sample generation) and the backward mapping (density evaluation).

Gaussian Mixture Model

A mixture of 3 Gaussians. Drag the sliders to adjust the weight of each component. Watch how the total distribution (white) changes shape.

π₁0.40

π₂0.35

π₃0.25

Why flows matter for validation: Normalizing flows can model complex, high-dimensional distributions of disturbances (wind gusts, sensor glitches, other driver behavior). If you can sample from the disturbance distribution accurately, your simulated validation trajectories look more like reality.

Check: Why would you use a Gaussian mixture instead of a single Gaussian to model pedestrian trajectories at an intersection?

Because pedestrians may go left, right, or straight — creating multiple clusters (modes) that a single Gaussian cannot capture Because mixtures are computationally cheaper Because a single Gaussian has too many parameters

Chapter 3: Conditional Distributions

The models from Chapter 1 describe distributions of individual variables. But validation models need conditional distributions — how one quantity depends on another. The three core conditional distributions from the POMDP framework are:

T(s' | s, a) — next state depends on current state and action

O(o | s) — observation depends on true state

π(a | o) — action depends on observation

Each of these is a distribution parameterized by its conditioning variables. T(s'|s,a) is not a single distribution — it is a different distribution for every (s,a) pair. If there are 100 states and 5 actions, you need to specify 500 distributions (one per pair).

Worked example: A car on a highway with three lanes. State s = lane number {1,2,3}. Action a = {stay, left, right}. The transition model T(s'|s, a=left) might be: P(s' = s-1) = 0.8 (successful lane change), P(s' = s) = 0.15 (aborted), P(s' = s-2) = 0.05 (overcorrected). Notice: the transition is stochastic. Even a "left" action does not guarantee you end up in the left lane. Real systems have uncertainty.

For continuous states, conditional distributions are often parameterized as Gaussians whose mean depends on the conditioning variables:

T(s' | s, a) = N(s' | f(s, a), Σ)

Here f(s,a) is a function (could be a physics simulator, a neural network, or a simple linear model) that predicts the expected next state. Σ captures the noise. The entire validation pipeline hinges on getting these conditional distributions right.

The physics analogy: T(s'|s,a) is the "physics engine" of your model. O(o|s) is the "sensor simulator." π(a|o) is the "brain simulator" for other agents. To build a complete validation environment, you need all three. Missing any one means your simulated trajectories are unrealistic.

Distribution	Conditioning on	What it models
T(s'\|s,a)	State + action	How the world changes
O(o\|s)	State	What the sensors see
π(a\|o)	Observation	How the agent (or other agents) behave
p(s₁)	Nothing	Initial conditions

Check: Why is T(s'|s,a) a conditional distribution rather than a fixed function?

Because transitions are always deterministic Because real-world transitions have uncertainty — the same state and action can lead to different next states Because conditional distributions are easier to compute

Chapter 4: Maximum Likelihood

You have data: N measurements x₁, x₂, ..., x_N. You believe they come from a Gaussian with unknown mean μ and standard deviation σ. How do you find the best μ and σ?

The idea: choose parameters that make the observed data most probable. This is maximum likelihood estimation (MLE). The likelihood is the probability of seeing your data given the parameters:

L(μ, σ) = ∏_i=1^N f(x_i | μ, σ²)

We want to maximize L. Since products are annoying, we take the log (the log is monotonic, so maximizing log L is the same as maximizing L):

log L(μ, σ) = ∑_i=1^N log f(x_i | μ, σ²) = −^N⁄₂ log(2πσ²) − ¹⁄_2σ² ∑_i (x_i − μ)²

Taking derivatives and setting them to zero gives the MLE solutions:

μ̂_MLE = ¹⁄_N ∑_i x_i σ̂²_MLE = ¹⁄_N ∑_i (x_i − μ̂)²

The MLE mean is the sample mean. The MLE variance is the average squared deviation. These are the "best-fit" parameters — the Gaussian that gives the observed data the highest probability.

The log-likelihood trick: Taking the log turns products into sums, which are easier to differentiate and numerically more stable. This trick is universal — you will see it in every machine learning derivation. The log-likelihood is the fundamental object to maximize.

Fit a Gaussian (MLE) — Showcase

Data points (orange dots) are scattered below. Drag μ and σ sliders to position the Gaussian. Watch the log-likelihood change in real time. Click Find MLE to watch gradient ascent converge to the optimum.

μ0.00

σ1.50

Worked numerical example: Suppose your data is {2.1, 3.5, 2.8, 3.2, 2.5}. The MLE mean is (2.1+3.5+2.8+3.2+2.5)/5 = 2.82. The MLE variance is [(2.1−2.82)² + (3.5−2.82)² + (2.8−2.82)² + (3.2−2.82)² + (2.5−2.82)²]/5 = [0.5184 + 0.4624 + 0.0004 + 0.1444 + 0.1024]/5 = 0.2456. So σ̂ = √0.2456 ≈ 0.496.

python
import numpy as np

data = np.array([2.1, 3.5, 2.8, 3.2, 2.5])
mu_mle  = data.mean()               # 2.82
sig2_mle = ((data - mu_mle)**2).mean()  # 0.2456

def log_likelihood(data, mu, sig2):
    n = len(data)
    return -n/2 * np.log(2*np.pi*sig2) - sum((data - mu)**2) / (2*sig2)

Check: What does "maximum likelihood" mean in plain English?

Find the data that maximizes the model's accuracy Find the parameters that make the observed data most probable under the model Find the model with the most parameters

Chapter 5: Bayesian Learning

MLE gives you a single best estimate. But what if you have prior knowledge? And what if you want to know how uncertain you are about the parameters? Bayesian learning answers both questions by treating the parameters themselves as random variables with a distribution.

Bayes' theorem says:

p(θ | data) = ^{p(data | θ) · p(θ)}⁄_p(data)

In words: the posterior (what we believe after seeing data) is proportional to the likelihood (how probable the data is given θ) times the prior (what we believed before seeing data). The denominator p(data) is just a normalizing constant.

The coin-flipping example: Suppose you find a coin and want to know if it is fair. Your prior: P(heads) = θ, where θ ~ Beta(2, 2) (you mildly believe it is fair). You flip 10 times and get 7 heads. The posterior is Beta(2+7, 2+3) = Beta(9, 5). The posterior mean is 9/14 ≈ 0.643 — the data shifted you toward believing it is slightly biased, but your prior pulled you back from the MLE of 0.7. With more data, the prior matters less and the posterior converges to the MLE.

The Beta distribution is the natural prior for a probability parameter θ ∈ [0, 1]. It is parameterized by α and β:

Beta(θ | α, β) ∝ θ^α−1 (1−θ)^β−1

After observing h heads and t tails, the posterior is Beta(α + h, β + t). This is called a conjugate prior — the posterior has the same form as the prior, with updated parameters. It makes Bayesian computation tractable.

Bayesian Coin Flipping

Click Flip to flip the coin. Watch the posterior Beta distribution update in real time. The prior (dotted) is Beta(2,2). The true bias is hidden — can you infer it?

0 flips

MLE vs. Bayes: MLE says "what single parameter value best explains the data?" Bayes says "given the data, what is the full distribution over parameter values?" The Bayesian answer is richer: it tells you not just the best guess, but how confident you should be. With little data, the prior dominates. With lots of data, the prior washes out and the posterior concentrates around the MLE.

Check: In the Beta-Bernoulli model, what happens to the posterior as you observe more and more data?

It concentrates more tightly around the true value, with the prior having less and less influence It always stays equal to the prior It becomes uniform regardless of the data

Chapter 6: Overfitting

A model that fits the training data perfectly is not necessarily a good model. Consider fitting a polynomial to 10 data points. A degree-9 polynomial passes through every single point — zero training error. But between the points, it oscillates wildly. On new data, it performs terribly. This is overfitting: the model has memorized the noise in the training data instead of learning the underlying pattern.

The parable of the exam: A student who memorizes every answer from past exams (training data) will score 100% on those exams. But give them a new exam (test data) and they fail, because they never learned the concepts. Overfitting is memorization without understanding. The validation analog: a model that perfectly reproduces logged driving data may fail on novel scenarios it has never seen.

We detect overfitting by measuring performance on held-out test data — data the model was not trained on. The gap between training performance and test performance is the generalization gap. A large gap signals overfitting.

Polynomial Fitting: Overfitting in Action

Drag the Degree slider to change the polynomial order. Low degree = underfitting (misses the pattern). High degree = overfitting (passes through every point but oscillates wildly). Find the sweet spot.

Degree3

Cross-validation is a systematic way to tune model complexity. Split the data into K folds. Train on K−1 folds, test on the held-out fold. Repeat K times and average the test error. The model complexity (degree, number of parameters, regularization strength) that minimizes cross-validation error is the best trade-off between underfitting and overfitting.

For validation models specifically: If your transition model T(s'|s,a) is overfit to training trajectories, it will produce unrealistically narrow simulated distributions. Rare events that should be possible in the model will have near-zero probability. Your validation will miss failure modes that exist in reality but not in your overfit model. Overfitting is not just a prediction problem — it is a safety problem.

Check: How do you detect overfitting?

By comparing training performance to held-out test performance — a large gap indicates overfitting By checking if the model has many parameters By running the model for a longer time

Chapter 7: Cloning Agents

To validate an autonomous vehicle, you need to simulate not just your car, but all the other drivers on the road. Their policies π(a|o) are unknown — you cannot look inside their heads. But you have logged data: recordings of what they did in real traffic. Can you learn a policy from this data?

Behavioral cloning (BC) does exactly this: treat the logged state-action pairs as supervised learning data, and train a model to predict actions from states. Given data {(o₁, a₁), ..., (o_N, a_N)}, minimize:

L(θ) = ¹⁄_N ∑_i=1^N ℓ(π_θ(o_i), a_i)

where ℓ is a loss function (cross-entropy for discrete actions, MSE for continuous).

The cascading error problem: Behavioral cloning works well when the agent stays near the training distribution. But small errors compound: the cloned agent drifts slightly from the expert's trajectory at time 1. At time 2, it sees a slightly different observation (one it was never trained on), makes a slightly larger error, drifts further, and so on. By time 100, the agent is in a completely unfamiliar part of the state space, making random decisions. This is called compounding error or distribution shift.

Behavioral Cloning: Compounding Error

The expert (teal) follows a path. The cloned agent (orange) starts on the same path but accumulates small errors. Click Run to watch. Adjust the Noise slider to control the per-step error magnitude.

Noise0.10

DAgger (Dataset Aggregation) is the fix. After the cloned agent makes errors and drifts, you ask the expert to label the new states the agent visits. Add these to the dataset and retrain. Iterate. This progressively covers the states the cloned agent actually encounters, closing the distribution shift gap.

For validation: If you model other drivers with behavioral cloning, beware of cascading errors. A BC driver model that diverges from reality after 20 time steps will produce unrealistic trajectories — your validation will be testing against ghost traffic that does not behave like real drivers.

Check: Why does behavioral cloning suffer from compounding errors over long trajectories?

Because small per-step errors accumulate, pushing the agent into states it was never trained on, causing even larger errors Because the neural network forgets over time Because the loss function is wrong

Chapter 8: Behavior Models

Not every agent needs to be cloned from data. Sometimes we can build behavior models from first principles using decision-theoretic assumptions about how agents act. The key idea: agents are approximately rational, but not perfectly so.

The softmax response model says an agent picks actions with probability proportional to the exponentiated value of each action:

π(a | s) = ^{exp(λ · Q(s, a))}⁄_{∑_a' exp(λ · Q(s, a'))}

Here Q(s, a) is the value of taking action a in state s, and λ is the rationality parameter. When λ = 0, the agent is completely random (uniform over actions). When λ → ∞, the agent always picks the best action (perfectly rational). Real agents live somewhere in between.

The λ dial: Think of λ as a "how much does this agent care?" knob. A distracted driver (λ low) drifts between lanes semi-randomly. An alert, experienced driver (λ high) makes near-optimal decisions. For validation, you want to simulate drivers across the full range — the worst-case failures often come from low-λ (irrational) other agents.

Level-k reasoning takes this further. A level-0 agent acts randomly. A level-1 agent best-responds to level-0 agents. A level-2 agent best-responds to level-1 agents. And so on. This hierarchy captures the idea that not everyone thinks the same number of steps ahead.

Softmax Rationality: The λ Parameter

Three actions with values Q(a₁)=3, Q(a₂)=1, Q(a₃)=2. Drag λ to see how the action probabilities shift from uniform (random) to peaked (rational).

λ1.0

λ	Behavior	Interpretation
0	Uniform random	Completely irrational
0.5	Slight preference for good actions	Inattentive driver
2	Strong preference for good actions	Normal driver
∞	Always picks the best action	Perfectly rational

Check: What does the rationality parameter λ control in the softmax model?

The number of actions available The reward for each action How strongly the agent's action probabilities concentrate on the highest-value action

Chapter 9: Validating Models

You built a model. But is it any good? A model that does not match reality is worse than useless — it gives false confidence. This chapter covers three tools for checking model quality.

Q-Q plots (quantile-quantile plots) compare the distribution of model outputs against real data. If the model is correct, the Q-Q plot is a straight diagonal line. Deviations reveal systematic mismatches: curved means the tails are wrong, offset means the mean is wrong, S-shaped means the variance is wrong.

Calibration plots check probabilistic predictions. If the model says P(crash) = 0.3, then among all situations where it predicted 0.3, about 30% should actually be crashes. A well-calibrated model has its calibration curve on the diagonal. Overconfident models curve below; underconfident models curve above.

KL divergence measures how different two distributions are:

D_KL(p || q) = ∑_x p(x) log ^p(x)⁄_q(x)

If p is the true distribution and q is the model, D_KL(p || q) = 0 means a perfect match. Larger values mean the model is farther from reality. KL divergence is asymmetric: it penalizes the model heavily for assigning low probability to events that are actually common (q(x) small where p(x) is large).

Why KL asymmetry matters for safety: If the model assigns near-zero probability to a scenario that actually occurs (p(x) > 0, q(x) ≈ 0), KL divergence blows up to infinity. This is exactly the dangerous case for validation: the model says "this will never happen" but reality disagrees. A model that underestimates the probability of rare but dangerous events is the most harmful kind of model error.

KL Divergence: Model vs. Reality

The true distribution p (teal bars) has 5 outcomes. Drag the model distribution q (orange bars) to match. The KL divergence D_KL(p || q) is computed in real time. Try making q assign zero probability to a likely outcome — watch KL diverge.

DKL = 0.00

The model validation checklist: (1) Q-Q plot: are the distributions aligned? (2) Calibration: are probabilistic predictions reliable? (3) KL divergence: how far is the model from reality overall? (4) Domain expert review: do the simulated scenarios look realistic? All four should be checked before trusting a model for validation.

Check: Why is KL divergence especially relevant for safety-critical validation models?

Because it penalizes most heavily when the model assigns near-zero probability to events that are actually possible — exactly the dangerous case for validation Because it is the fastest metric to compute Because it always equals zero for Gaussian models

Chapter 10: Summary

This chapter built the modeling toolkit. To validate a system in simulation, you need models of the world (transitions), the sensors (observations), and other agents (policies). Each model is a probability distribution estimated from data, and each must be validated against reality before trusting it.

Technique	What it does	Strength	Weakness
MLE	Point estimate of parameters	Simple, efficient	No uncertainty quantification
Bayesian learning	Full posterior over parameters	Handles small data, quantifies uncertainty	Requires prior, computation cost
GMM / Flows	Complex distribution shapes	Captures multimodality	More parameters to estimate
Behavioral cloning	Learns policy from demonstrations	Simple supervised learning	Compounding errors over time
Softmax model	Parameterized rationality	Interpretable, controllable	Assumes Q-values are known

What comes next:

Chapter 3 tackles specification: how do you formally write down what "safe" means? Metrics, temporal logic, reachability sets — the language of requirements.

The modeling pipeline:

Data → Choose distribution family → Fit parameters (MLE/Bayes) → Check model (Q-Q, calibration, KL) → Use for validation. If the check fails, go back and try a richer model.

The big picture: A validation result is only as trustworthy as the model it was computed in. An elegant falsification algorithm running on a bad model is like a brilliant detective investigating the wrong crime scene. This chapter ensures you build the right scene before the investigation begins.

"All models are wrong, but some are useful."
— George E. P. Box, 1976

Check: Why must a model be validated before using it for system validation?

Because models are always wrong Because faster models are always better Because validation results are only trustworthy if the model faithfully represents reality — a bad model leads to false confidence