Kochenderfer et al., Chapter 12

Runtime Monitoring

Offline validation ends when deployment begins. Runtime monitoring keeps watch after.

Prerequisites: Chapters 1–3 (validation framework, system modeling, property specs). That's it.
12
Chapters
5+
Simulations
12
Quizzes

Chapter 0: Why Runtime Monitoring?

You've tested your autonomous vehicle in 10 million simulated miles. You've formally verified the perception module against 500 adversarial scenarios. You've computed reachable sets for the planner. You ship the car.

Day one. Rain. A garbage bag blows across the road. The lidar returns a point cloud it's never seen in training. The camera sees a dark, shiny, flapping shape that matches nothing in the object detector's vocabulary. The planner receives garbage observations and starts planning against a phantom object.

No amount of offline validation can anticipate every scenario the real world will throw at your system. The distribution shift between training/testing and deployment is the fundamental problem. Runtime monitoring exists to detect when the system is operating outside the conditions it was validated for.

The core principle: Offline validation tells you "the system is safe under these conditions." Runtime monitoring tells you "we are still inside those conditions." When the monitor triggers, the system should fall back to a safe mode — slow down, hand off to a human, or stop.

Runtime monitoring is complementary to offline validation, not a replacement. Think of it as the seat belt on top of the crash-tested frame. The frame (offline validation) protects you in known scenarios. The belt (runtime monitoring) catches the rest.

ODD Monitoring
Are we inside the validated operating conditions?
Uncertainty Monitoring
How confident is the model in its predictions?
Failure Monitoring
Is the system about to enter an unsafe state?
Monitor typeQuestion it answersActs on
ODD monitorIs this input in-distribution?Inputs / observations
Uncertainty monitorHow confident is the model?Model outputs
Failure monitorIs a safety violation imminent?System state trajectory
Check: Why can't offline validation alone guarantee safety during deployment?

Chapter 1: Operational Design Domain

Before you can monitor whether you're inside the validated conditions, you need to define those conditions. The Operational Design Domain (ODD) is a formal specification of every condition under which the system is expected to operate safely.

An ODD for a self-driving car might include:

DimensionRange
WeatherClear, light rain (no snow, fog, or heavy rain)
Time of dayDaytime (sun elevation > 10°)
Speed≤ 45 mph
Road typeUrban streets with lane markings
Sensor healthAll cameras and lidar operational

The ODD is typically defined by human engineers based on what scenarios were included in testing. If you only tested in dry weather, dry weather is your ODD for weather conditions. Expanding the ODD requires additional testing and validation.

Key insight: The ODD is not a description of where the system performs well. It's a description of where you have evidence that it performs well. A system might handle snow perfectly, but if you never tested in snow, snow is outside the ODD. This is a conservative, evidence-based approach.

In practice, defining the ODD in terms of human-readable conditions (weather, road type) is straightforward but coarse. A more principled approach defines the ODD directly in the input space — the space of sensor observations. This is what the k-NN and polytope monitors in the next two chapters do: they define the ODD as a region in observation space, based on the observations seen during validation.

ODD vs. validation scope: The ODD should be strictly contained within the validation scope. If you validated in conditions A, B, and C, the ODD should cover at most A, B, and C. In practice, engineers often include a margin: if you validated up to 45 mph, the ODD might be set to 40 mph to provide a buffer.
Check: What does the ODD define?

Chapter 2: k-NN & Distance Monitors

The simplest way to check if a new observation is "in distribution" is to ask: how close is it to the observations we've seen before?

A k-NN ODD monitor works as follows. During validation, collect all the observations the system encountered: a set D = {z1, z2, …, zN} in observation space. At runtime, when a new observation z arrives, compute its distance to the kth nearest neighbor in D. If that distance exceeds a threshold γ, flag the observation as out of distribution.

monitor(z) = &begin;cases; in-ODD   if dk(z, D) ≤ γ \\ out-of-ODD   if dk(z, D) > γ &end;cases;

Here dk(z, D) is the distance from z to its kth nearest neighbor in D. Using k > 1 (instead of just the closest point) makes the monitor robust to outliers in the training set.

Why k-NN works: In-distribution observations live in dense regions of observation space — they are close to many training points. Out-of-distribution observations live in sparse regions — even their nearest neighbors are far away. The k-NN distance is a simple density proxy: low distance means high density means likely in-distribution.
k-NN ODD Monitor

Blue dots are validation observations. Drag the red query point around. The monitor computes the distance to the kth nearest neighbor. Inside the green zone: in-ODD. Outside: out-of-ODD. Adjust γ and k.

γ (threshold)0.15
k (neighbors)3
Drag the red point to test

Choosing γ is a tradeoff. A small γ makes the ODD tight — only observations very close to training data pass. This is safe (few false negatives) but may trigger false alarms on valid observations. A large γ is permissive — fewer false alarms, but some out-of-distribution observations may slip through.

Computational cost: Naive k-NN search over N training points costs O(N) per query. For real-time monitoring, this is too slow if N is large. Practical implementations use approximate nearest neighbor structures (KD-trees, ball trees, or locality-sensitive hashing) that reduce query time to O(log N) or better. Some systems precompute a grid-based approximation of the ODD boundary for O(1) lookup.
Check: In a k-NN ODD monitor, what triggers an out-of-distribution alert?

Chapter 3: Polytope & Density Monitors

k-NN monitors check distance to individual points. An alternative is to define the ODD as a geometric region and check whether new observations fall inside it.

The simplest region is the convex hull of the training data — the smallest convex set containing all training points. A new observation is in-ODD if it's inside the hull, and out-of-ODD if it's outside.

Convex hull limitation: Real data distributions are rarely convex. Imagine training data forming a crescent shape. The convex hull fills in the concavity, marking a large region of empty space as "in-distribution." This leads to false negatives — out-of-distribution points that pass the monitor.

A refinement: clustered convex hulls. First cluster the training data (e.g., with k-means), then compute a separate convex hull for each cluster. A new observation is in-ODD if it's inside any of the cluster hulls. This captures non-convex distributions better, at the cost of choosing the number of clusters.

Superlevel set monitors go further. Fit a density model (kernel density estimation, Gaussian mixture) to the training data. The ODD is the set of points where the estimated density exceeds a threshold: {z : p̂(z) ≥ τ}. This naturally adapts to any distribution shape.

ODD = { z : p̂(z) ≥ τ }
MethodHandles non-convex?Computational costTuning
Convex hullNoLow (LP or membership test)None
Clustered hullsPartiallyMedium# clusters
Density superlevelYesHigherBandwidth + τ
k-NN distanceYesMediumk + γ
Key insight: All these methods answer the same question — "is this observation similar to what we've seen?" — just with different definitions of "similar." Convex hulls define similarity geometrically. k-NN defines it by proximity. Density methods define it probabilistically. The best choice depends on the data distribution and computational budget.
Check: Why might a convex hull ODD monitor produce false negatives (miss out-of-distribution inputs)?

Chapter 4: High-Dimensional ODD

The methods above work beautifully in 2D or 3D. But real autonomous systems observe 100-dimensional lidar scans or 200,000-pixel images. In high dimensions, distance-based monitors face the curse of dimensionality: all points become approximately equidistant, and the k-NN distance loses its ability to distinguish in-distribution from out-of-distribution.

The solution is dimensionality reduction. Project the high-dimensional observations into a lower-dimensional space where distance is meaningful, then apply k-NN or density monitoring in that space.

MethodDimensionality reductionLearns from data?
PCALinear projection to top principal componentsYes (unsupervised)
AutoencoderNonlinear encoder maps to bottleneck layerYes (unsupervised)
Pre-trained featuresUse intermediate layer of the policy networkNo extra training

PCA (Principal Component Analysis) finds the directions of maximum variance in the data and projects onto the top d of them. It's fast and linear, but may miss nonlinear structure.

Autoencoders learn a nonlinear compression: an encoder maps the observation to a low-dimensional latent code, and a decoder reconstructs the observation from the code. If the autoencoder was trained on in-distribution data, out-of-distribution inputs will have high reconstruction error. This gives a second monitoring signal: large reconstruction error = likely out-of-ODD.

Feature collapse risk. If the autoencoder or PCA discards dimensions that matter for ODD monitoring, the monitor will miss certain distribution shifts. Example: PCA might discard a low-variance component that distinguishes "clear road" from "icy road" (both look similar in most dimensions but differ critically in surface reflectance). Always validate that the reduced features preserve the distinctions you care about.

A pragmatic approach: use the intermediate representations of the policy network itself. If the policy is a neural network, its hidden layers already compress the raw input into task-relevant features. Monitor in that feature space. This is free (no extra model to train) and automatically focuses on the features the policy uses.

Reconstruction-based monitoring: An autoencoder trained on in-distribution data learns to compress and reconstruct those observations well. An out-of-distribution observation will be poorly reconstructed (the autoencoder never learned patterns for it). The reconstruction error ||z − decode(encode(z))|| serves as a continuous out-of-distribution score.
Check: Why do k-NN distance monitors struggle in high-dimensional observation spaces?

Chapter 5: Output Uncertainty

ODD monitors check the input. But you can also monitor the output — specifically, how uncertain the model is about its prediction.

Many models produce not just a point prediction, but a distribution over outputs. A regression model might predict μ = 2.5 with standard deviation σ = 0.3. A classifier might output probabilities [0.91, 0.06, 0.03] across three classes. The spread of this distribution measures uncertainty.

For a conditional Gaussian model that outputs μ(x) and σ(x):

p(y|x) = N(y; μ(x), σ(x)2)

The uncertainty is σ(x). When σ is large, the model is unsure. A natural monitor: flag when σ(x) exceeds a threshold.

For classifiers, entropy measures uncertainty. If the model outputs class probabilities p1, p2, …, pK:

H = − ∑k=1K pk log pk

Entropy is maximized when all classes are equally likely (maximum confusion) and minimized when one class has probability 1 (full confidence). High entropy → uncertain → flag.

Confidence is not calibration. A model can be confidently wrong. A neural network trained only on cats and dogs will output 99% confidence "dog" when shown a photo of an airplane — not because it's uncertain, but because it must choose between its two classes. Output uncertainty monitoring works only when the model is calibrated: its confidence reflects its actual probability of being correct. The next chapter addresses this.

Proper scoring rules evaluate whether a model's uncertainty estimates are honest. A scoring rule S(p, y) assigns a score based on the predicted distribution p and the true outcome y. A scoring rule is proper if the expected score is maximized when p equals the true distribution. The log score S = log p(y) and the Brier score S = −(1 − py)2 are both proper. Using a proper scoring rule during training incentivizes the model to produce well-calibrated uncertainty estimates.

Check: Why can a neural network's output confidence be misleading for out-of-distribution detection?

Chapter 6: Calibration

A weather forecaster says "70% chance of rain" on 100 different days. If it actually rains on about 70 of those days, the forecaster is calibrated. If it rains on only 40, the forecaster is overconfident.

The same applies to neural networks. If the model says "90% confident" on 1,000 predictions, it should be correct on about 900 of them. If it's only correct on 700, the model is overconfident — its stated confidence exceeds its actual accuracy.

A reliability diagram visualizes calibration. Bin predictions by confidence level (0-10%, 10-20%, ..., 90-100%). For each bin, plot the average confidence (x-axis) against the actual accuracy (y-axis). A perfectly calibrated model falls on the diagonal. Points above the diagonal indicate underconfidence; below indicates overconfidence.

Calibration: Reliability Diagram

A model's predictions, binned by confidence. The diagonal is perfect calibration. Use the λ slider to apply temperature scaling and watch the bars move toward the diagonal.

λ (temperature)1.00
ECE: --

Temperature scaling is the simplest calibration method. Divide the model's logits (pre-softmax values) by a temperature parameter λ before applying softmax. λ > 1 makes the distribution softer (less confident). λ < 1 makes it sharper (more confident). The optimal λ minimizes the calibration error on a held-out validation set.

k = softmax(zk / λ) = exp(zk / λ) / ∑j exp(zj / λ)

Expected Calibration Error (ECE) summarizes the reliability diagram in a single number:

ECE = ∑b=1B (nb / N) |accb − confb|

Where B is the number of bins, nb is the count in bin b, accb is the accuracy in bin b, and confb is the average confidence. ECE = 0 means perfect calibration.

Key insight: Temperature scaling changes the model's stated confidence without changing its predictions (the argmax is the same). It's a post-hoc fix: train the model normally, then fit one scalar λ on validation data. One number fixes overconfidence across the entire output distribution. Remarkably effective for how simple it is.
Check: If a model says "90% confident" and is correct on only 70% of those predictions, what does temperature scaling do?

Chapter 7: Conformal Prediction

Temperature scaling fixes confidence. But what if you want something stronger — a prediction set with a statistical guarantee? "I predict the output is between 2.1 and 3.4, and I guarantee this interval contains the true value at least 90% of the time."

This is what conformal prediction provides. It wraps any predictive model in a calibration layer that produces prediction sets with guaranteed coverage.

The setup requires three ingredients:

1. A base model
Any model that makes predictions (need not be calibrated)
2. Calibration data
A held-out set with ground truth labels
3. Desired coverage 1 − α
e.g., α = 0.1 means 90% coverage

The procedure for regression:

Step 1: Compute nonconformity scores on the calibration set. The simplest score is the absolute residual:

si = |yi − μ̂(xi)|

Step 2: Find the quantile threshold q. Sort the scores and take the ⌈(1 − α)(n + 1)⌉/n quantile:

q = quantile(s1, …, sn; ⌈(1 − α)(n + 1)⌉ / n)

Step 3: At runtime, for a new input x, output the prediction set:

C(x) = [μ̂(x) − q, μ̂(x) + q]
The guarantee: If the calibration and test data are exchangeable (roughly: drawn from the same distribution), then P(y ∈ C(x)) ≥ 1 − α. This is a finite-sample, distribution-free guarantee. No assumptions about the model being correct, or the data being Gaussian, or anything else. The only assumption is exchangeability.
Conformal Prediction Demo

A regression model (blue curve) with conformal prediction band. Gray dots are calibration points. Adjust the coverage level α — lower α means wider bands and higher coverage. The histogram shows calibration scores with the quantile threshold.

1 − α (coverage)0.90
--
What to notice: As you increase 1 − α (demand higher coverage), the quantile threshold q increases, making the band wider. At 1 − α = 0.99, almost all test points fall inside. At 0.80, about 20% are outside. The empirical coverage tracks the target closely, but may vary slightly because we have a finite calibration set.

For classification, conformal prediction outputs a set of classes instead of an interval. The nonconformity score is typically 1 − p(yi|xi) — one minus the model's confidence in the true class. The quantile threshold determines which classes are included: all classes k where p(k|x) ≥ 1 − q.

Variable-width intervals: The basic method above uses a fixed width q for all inputs. More sophisticated variants (like CQR — Conformalized Quantile Regression) produce intervals that are wider where the model is uncertain and narrower where it's confident, while maintaining the same coverage guarantee. This gives more informative prediction sets.
Check: What assumption does conformal prediction require to guarantee its coverage bound?

Chapter 8: Model Uncertainty & Ensembles

There are two fundamentally different kinds of uncertainty. A coin flip has aleatoric (data) uncertainty — even if you know everything about the coin, the outcome is random. But uncertainty about which model is correct is epistemic (model) uncertainty — and this can be reduced with more data.

Aleatoric (data) uncertainty:

Inherent noise in the data. Cannot be reduced by collecting more data. Example: measurement noise from a noisy sensor.

Epistemic (model) uncertainty:

Uncertainty about which model is correct. Can be reduced with more data. Example: predictions in a region with few training points.

For runtime monitoring, epistemic uncertainty is the one we care about. High epistemic uncertainty means "we haven't seen data like this before" — exactly the signal we need to detect distribution shift.

Ensemble methods estimate epistemic uncertainty by training multiple models and measuring their disagreement. If 5 models all predict y ≈ 2.5, epistemic uncertainty is low. If they predict 2.1, 3.4, 1.8, 2.9, and 4.0, epistemic uncertainty is high.

μ̄ = (1/M) ∑m=1M μm(x)      σ2epistemic = (1/M) ∑m=1Mm(x) − μ̄)2

The ensemble mean μ̄ is the prediction. The ensemble variance (disagreement) is the epistemic uncertainty estimate. This is Bayesian model averaging with a finite set of samples from the posterior.

Why ensembles work for OOD detection: In-distribution inputs land in regions where all models were trained, so they agree. Out-of-distribution inputs land in regions where each model extrapolated differently (based on random initialization, data order, etc.), so they disagree. Disagreement is a reliable OOD signal.
Cost: An ensemble of M models costs M times the computation of a single model. For real-time systems, M = 3-5 is typical. MC Dropout provides a cheaper approximation: run the same model M times with dropout enabled at inference, producing M stochastic predictions. The variance across runs approximates epistemic uncertainty at roughly the cost of M forward passes through one model.
Check: Which type of uncertainty signals "we haven't seen data like this before"?

Chapter 9: Failure Monitoring

ODD monitors check inputs. Uncertainty monitors check predictions. But sometimes neither is enough. The inputs can be in-distribution, the model can be confident, and the system can still be heading toward failure — because the sequence of states is leading to a dangerous region.

Failure monitors check the system's trajectory, not individual observations. They ask: "Given the current state and trajectory, is a safety violation imminent?"

Three approaches, in order of increasing sophistication:

1. Heuristic Rules

Hand-coded safety checks: "If speed > 40 AND distance_to_obstacle < 5m, trigger alert." Simple, fast, interpretable, but incomplete. Every rule you write is a rule you thought of. The failures you miss are the ones you didn't anticipate.

2. Online Reachability Analysis

Compute the reachable set from the current state over a short time horizon (e.g., 2 seconds). If any state in the reachable set violates a safety constraint, trigger an alert. This is sound (if the computation is exact) but expensive — reachability analysis in real time requires simplified dynamics models.

R(t) = { s(t) : s(0) = scurrent, ∀τ ∈ [0,t], s(τ) follows dynamics }

If R(t) ∩ Unsafe ≠ ∅ for any t ≤ T, the monitor triggers.

3. Lookup Tables

Precompute reachability offline. Discretize the state space and store the result of the safety analysis for each cell. At runtime, look up the current state — O(1) time. The tradeoff: you need exponential storage for high-dimensional state spaces, and the discretization introduces approximation error.

The tradeoff between these three approaches mirrors the classic tradeoff: heuristic rules are fast but incomplete. Exact reachability is complete but slow. Lookup tables are fast but memory-intensive. Most practical systems combine all three: rules catch the obvious cases, lookup tables handle the well-characterized region, and online reachability fills in the gaps when the state is near a boundary.
Check: What does a failure monitor check that an ODD or uncertainty monitor does not?

Chapter 10: Online vs Lookup Tables

The choice between online computation and precomputed lookup tables is a fundamental engineering tradeoff in runtime monitoring. Let's formalize it.

Online reachability computes the safety analysis at runtime, starting from the current state. Advantages: it uses the exact current state (no discretization error) and can adapt to runtime information. Disadvantage: it must complete within the control loop's time budget — typically 10-100 milliseconds.

Lookup tables precompute the analysis for a grid of states. At runtime, look up the nearest grid cell — O(1) time. Advantages: extremely fast, deterministic latency. Disadvantages: exponential memory in state dimension, discretization error, and the table is fixed (can't adapt to new information).

PropertyOnlineLookup Table
Runtime costO(computation of reachable set)O(1) — table lookup
MemoryLowExponential in state dimension
AccuracyExact (continuous state)Approximate (discretized)
Latency guaranteeVariable (may exceed deadline)Deterministic
AdaptabilityCan incorporate runtime infoFixed at deployment
The hybrid approach: Use lookup tables for the bulk of the state space (where the analysis is straightforward) and online computation only near decision boundaries (where the discretization error matters most). This gives O(1) performance almost everywhere, with online refinement only when needed.

For neural network verification specifically, the online approach runs a verification query (e.g., bound propagation) from the current state. The lookup approach precomputes verified regions and stores them. Tools like α,β-CROWN or ERAN can be used for either, depending on the time budget.

Dimensionality matters: A lookup table for a 3D state space with 100 cells per dimension needs 106 entries — manageable. A 6D state space needs 1012 entries — terabytes. A 10D state space needs 1020 — impossible. Lookup tables are practical only for low-dimensional state spaces (roughly ≤ 5D).
Check: What is the main advantage of lookup tables over online reachability computation?

Chapter 11: Defense in Depth

No single monitor catches everything. An ODD monitor might accept an in-distribution input that the model is uncertain about. An uncertainty monitor might be confident on an out-of-distribution input (the overconfidence problem). A failure monitor might miss a slow drift toward danger.

The solution is to layer all three:

Layer 1: ODD Monitor
Is this observation within the validated operating domain? (k-NN, polytope, autoencoder)
↓ If yes, continue
Layer 2: Uncertainty Monitor
Is the model confident and well-calibrated? (Entropy, conformal sets, ensemble)
↓ If yes, continue
Layer 3: Failure Monitor
Is the trajectory heading toward a safe state? (Rules, reachability, LUT)
↓ If yes, proceed normally
All clear
System operates normally

If any layer triggers, the system enters a safe fallback mode. The specific fallback depends on the application: slow down, request human supervision, execute a pre-programmed safe maneuver, or shut down.

Complementary coverage: Each monitor type covers a different failure mode. ODD monitors catch distribution shift (new environments). Uncertainty monitors catch model confusion (ambiguous inputs). Failure monitors catch unsafe trajectories (bad sequences of decisions). Together, they cover far more of the failure space than any single monitor.
Failure modeCaught byNot caught by
Novel environment (snow, unseen road)ODD monitorUncertainty (model may be overconfident)
Ambiguous input (shadow looks like obstacle)Uncertainty monitorODD (input may be in-distribution)
Slow drift into danger (gradual lane departure)Failure monitorODD + uncertainty (each frame looks normal)
Sensor degradation (fog reducing lidar range)ODD + uncertainty (both degrade)Failure (trajectory may look safe until crash)
The false alarm tradeoff: More monitoring layers means more chances for false alarms. If each layer has a 5% false positive rate and they're independent, the combined false positive rate is roughly 1 − (0.95)3 ≈ 14%. This is the cost of safety. In safety-critical systems, an occasional unnecessary slowdown is far preferable to a missed failure. But excessive false alarms cause alarm fatigue — operators start ignoring the monitor, defeating its purpose. Tuning each layer's threshold to minimize false positives while maintaining detection is an ongoing engineering challenge.
"The test of any autonomous system is not whether it can handle the expected,
but whether it knows when it's facing the unexpected."
— Mykel Kochenderfer
Check: Why is defense in depth (layering multiple monitors) better than relying on a single monitor type?