Offline validation ends when deployment begins. Runtime monitoring keeps watch after.
You've tested your autonomous vehicle in 10 million simulated miles. You've formally verified the perception module against 500 adversarial scenarios. You've computed reachable sets for the planner. You ship the car.
Day one. Rain. A garbage bag blows across the road. The lidar returns a point cloud it's never seen in training. The camera sees a dark, shiny, flapping shape that matches nothing in the object detector's vocabulary. The planner receives garbage observations and starts planning against a phantom object.
No amount of offline validation can anticipate every scenario the real world will throw at your system. The distribution shift between training/testing and deployment is the fundamental problem. Runtime monitoring exists to detect when the system is operating outside the conditions it was validated for.
Runtime monitoring is complementary to offline validation, not a replacement. Think of it as the seat belt on top of the crash-tested frame. The frame (offline validation) protects you in known scenarios. The belt (runtime monitoring) catches the rest.
| Monitor type | Question it answers | Acts on |
|---|---|---|
| ODD monitor | Is this input in-distribution? | Inputs / observations |
| Uncertainty monitor | How confident is the model? | Model outputs |
| Failure monitor | Is a safety violation imminent? | System state trajectory |
Before you can monitor whether you're inside the validated conditions, you need to define those conditions. The Operational Design Domain (ODD) is a formal specification of every condition under which the system is expected to operate safely.
An ODD for a self-driving car might include:
| Dimension | Range |
|---|---|
| Weather | Clear, light rain (no snow, fog, or heavy rain) |
| Time of day | Daytime (sun elevation > 10°) |
| Speed | ≤ 45 mph |
| Road type | Urban streets with lane markings |
| Sensor health | All cameras and lidar operational |
The ODD is typically defined by human engineers based on what scenarios were included in testing. If you only tested in dry weather, dry weather is your ODD for weather conditions. Expanding the ODD requires additional testing and validation.
In practice, defining the ODD in terms of human-readable conditions (weather, road type) is straightforward but coarse. A more principled approach defines the ODD directly in the input space — the space of sensor observations. This is what the k-NN and polytope monitors in the next two chapters do: they define the ODD as a region in observation space, based on the observations seen during validation.
The simplest way to check if a new observation is "in distribution" is to ask: how close is it to the observations we've seen before?
A k-NN ODD monitor works as follows. During validation, collect all the observations the system encountered: a set D = {z1, z2, …, zN} in observation space. At runtime, when a new observation z arrives, compute its distance to the kth nearest neighbor in D. If that distance exceeds a threshold γ, flag the observation as out of distribution.
Here dk(z, D) is the distance from z to its kth nearest neighbor in D. Using k > 1 (instead of just the closest point) makes the monitor robust to outliers in the training set.
Blue dots are validation observations. Drag the red query point around. The monitor computes the distance to the kth nearest neighbor. Inside the green zone: in-ODD. Outside: out-of-ODD. Adjust γ and k.
Choosing γ is a tradeoff. A small γ makes the ODD tight — only observations very close to training data pass. This is safe (few false negatives) but may trigger false alarms on valid observations. A large γ is permissive — fewer false alarms, but some out-of-distribution observations may slip through.
k-NN monitors check distance to individual points. An alternative is to define the ODD as a geometric region and check whether new observations fall inside it.
The simplest region is the convex hull of the training data — the smallest convex set containing all training points. A new observation is in-ODD if it's inside the hull, and out-of-ODD if it's outside.
A refinement: clustered convex hulls. First cluster the training data (e.g., with k-means), then compute a separate convex hull for each cluster. A new observation is in-ODD if it's inside any of the cluster hulls. This captures non-convex distributions better, at the cost of choosing the number of clusters.
Superlevel set monitors go further. Fit a density model (kernel density estimation, Gaussian mixture) to the training data. The ODD is the set of points where the estimated density exceeds a threshold: {z : p̂(z) ≥ τ}. This naturally adapts to any distribution shape.
| Method | Handles non-convex? | Computational cost | Tuning |
|---|---|---|---|
| Convex hull | No | Low (LP or membership test) | None |
| Clustered hulls | Partially | Medium | # clusters |
| Density superlevel | Yes | Higher | Bandwidth + τ |
| k-NN distance | Yes | Medium | k + γ |
The methods above work beautifully in 2D or 3D. But real autonomous systems observe 100-dimensional lidar scans or 200,000-pixel images. In high dimensions, distance-based monitors face the curse of dimensionality: all points become approximately equidistant, and the k-NN distance loses its ability to distinguish in-distribution from out-of-distribution.
The solution is dimensionality reduction. Project the high-dimensional observations into a lower-dimensional space where distance is meaningful, then apply k-NN or density monitoring in that space.
| Method | Dimensionality reduction | Learns from data? |
|---|---|---|
| PCA | Linear projection to top principal components | Yes (unsupervised) |
| Autoencoder | Nonlinear encoder maps to bottleneck layer | Yes (unsupervised) |
| Pre-trained features | Use intermediate layer of the policy network | No extra training |
PCA (Principal Component Analysis) finds the directions of maximum variance in the data and projects onto the top d of them. It's fast and linear, but may miss nonlinear structure.
Autoencoders learn a nonlinear compression: an encoder maps the observation to a low-dimensional latent code, and a decoder reconstructs the observation from the code. If the autoencoder was trained on in-distribution data, out-of-distribution inputs will have high reconstruction error. This gives a second monitoring signal: large reconstruction error = likely out-of-ODD.
A pragmatic approach: use the intermediate representations of the policy network itself. If the policy is a neural network, its hidden layers already compress the raw input into task-relevant features. Monitor in that feature space. This is free (no extra model to train) and automatically focuses on the features the policy uses.
ODD monitors check the input. But you can also monitor the output — specifically, how uncertain the model is about its prediction.
Many models produce not just a point prediction, but a distribution over outputs. A regression model might predict μ = 2.5 with standard deviation σ = 0.3. A classifier might output probabilities [0.91, 0.06, 0.03] across three classes. The spread of this distribution measures uncertainty.
For a conditional Gaussian model that outputs μ(x) and σ(x):
The uncertainty is σ(x). When σ is large, the model is unsure. A natural monitor: flag when σ(x) exceeds a threshold.
For classifiers, entropy measures uncertainty. If the model outputs class probabilities p1, p2, …, pK:
Entropy is maximized when all classes are equally likely (maximum confusion) and minimized when one class has probability 1 (full confidence). High entropy → uncertain → flag.
Proper scoring rules evaluate whether a model's uncertainty estimates are honest. A scoring rule S(p, y) assigns a score based on the predicted distribution p and the true outcome y. A scoring rule is proper if the expected score is maximized when p equals the true distribution. The log score S = log p(y) and the Brier score S = −(1 − py)2 are both proper. Using a proper scoring rule during training incentivizes the model to produce well-calibrated uncertainty estimates.
A weather forecaster says "70% chance of rain" on 100 different days. If it actually rains on about 70 of those days, the forecaster is calibrated. If it rains on only 40, the forecaster is overconfident.
The same applies to neural networks. If the model says "90% confident" on 1,000 predictions, it should be correct on about 900 of them. If it's only correct on 700, the model is overconfident — its stated confidence exceeds its actual accuracy.
A reliability diagram visualizes calibration. Bin predictions by confidence level (0-10%, 10-20%, ..., 90-100%). For each bin, plot the average confidence (x-axis) against the actual accuracy (y-axis). A perfectly calibrated model falls on the diagonal. Points above the diagonal indicate underconfidence; below indicates overconfidence.
A model's predictions, binned by confidence. The diagonal is perfect calibration. Use the λ slider to apply temperature scaling and watch the bars move toward the diagonal.
Temperature scaling is the simplest calibration method. Divide the model's logits (pre-softmax values) by a temperature parameter λ before applying softmax. λ > 1 makes the distribution softer (less confident). λ < 1 makes it sharper (more confident). The optimal λ minimizes the calibration error on a held-out validation set.
Expected Calibration Error (ECE) summarizes the reliability diagram in a single number:
Where B is the number of bins, nb is the count in bin b, accb is the accuracy in bin b, and confb is the average confidence. ECE = 0 means perfect calibration.
Temperature scaling fixes confidence. But what if you want something stronger — a prediction set with a statistical guarantee? "I predict the output is between 2.1 and 3.4, and I guarantee this interval contains the true value at least 90% of the time."
This is what conformal prediction provides. It wraps any predictive model in a calibration layer that produces prediction sets with guaranteed coverage.
The setup requires three ingredients:
The procedure for regression:
Step 1: Compute nonconformity scores on the calibration set. The simplest score is the absolute residual:
Step 2: Find the quantile threshold q. Sort the scores and take the ⌈(1 − α)(n + 1)⌉/n quantile:
Step 3: At runtime, for a new input x, output the prediction set:
A regression model (blue curve) with conformal prediction band. Gray dots are calibration points. Adjust the coverage level α — lower α means wider bands and higher coverage. The histogram shows calibration scores with the quantile threshold.
For classification, conformal prediction outputs a set of classes instead of an interval. The nonconformity score is typically 1 − p(yi|xi) — one minus the model's confidence in the true class. The quantile threshold determines which classes are included: all classes k where p(k|x) ≥ 1 − q.
There are two fundamentally different kinds of uncertainty. A coin flip has aleatoric (data) uncertainty — even if you know everything about the coin, the outcome is random. But uncertainty about which model is correct is epistemic (model) uncertainty — and this can be reduced with more data.
Aleatoric (data) uncertainty:
Inherent noise in the data. Cannot be reduced by collecting more data. Example: measurement noise from a noisy sensor.
Epistemic (model) uncertainty:
Uncertainty about which model is correct. Can be reduced with more data. Example: predictions in a region with few training points.
For runtime monitoring, epistemic uncertainty is the one we care about. High epistemic uncertainty means "we haven't seen data like this before" — exactly the signal we need to detect distribution shift.
Ensemble methods estimate epistemic uncertainty by training multiple models and measuring their disagreement. If 5 models all predict y ≈ 2.5, epistemic uncertainty is low. If they predict 2.1, 3.4, 1.8, 2.9, and 4.0, epistemic uncertainty is high.
The ensemble mean μ̄ is the prediction. The ensemble variance (disagreement) is the epistemic uncertainty estimate. This is Bayesian model averaging with a finite set of samples from the posterior.
ODD monitors check inputs. Uncertainty monitors check predictions. But sometimes neither is enough. The inputs can be in-distribution, the model can be confident, and the system can still be heading toward failure — because the sequence of states is leading to a dangerous region.
Failure monitors check the system's trajectory, not individual observations. They ask: "Given the current state and trajectory, is a safety violation imminent?"
Three approaches, in order of increasing sophistication:
Hand-coded safety checks: "If speed > 40 AND distance_to_obstacle < 5m, trigger alert." Simple, fast, interpretable, but incomplete. Every rule you write is a rule you thought of. The failures you miss are the ones you didn't anticipate.
Compute the reachable set from the current state over a short time horizon (e.g., 2 seconds). If any state in the reachable set violates a safety constraint, trigger an alert. This is sound (if the computation is exact) but expensive — reachability analysis in real time requires simplified dynamics models.
If R(t) ∩ Unsafe ≠ ∅ for any t ≤ T, the monitor triggers.
Precompute reachability offline. Discretize the state space and store the result of the safety analysis for each cell. At runtime, look up the current state — O(1) time. The tradeoff: you need exponential storage for high-dimensional state spaces, and the discretization introduces approximation error.
The choice between online computation and precomputed lookup tables is a fundamental engineering tradeoff in runtime monitoring. Let's formalize it.
Online reachability computes the safety analysis at runtime, starting from the current state. Advantages: it uses the exact current state (no discretization error) and can adapt to runtime information. Disadvantage: it must complete within the control loop's time budget — typically 10-100 milliseconds.
Lookup tables precompute the analysis for a grid of states. At runtime, look up the nearest grid cell — O(1) time. Advantages: extremely fast, deterministic latency. Disadvantages: exponential memory in state dimension, discretization error, and the table is fixed (can't adapt to new information).
| Property | Online | Lookup Table |
|---|---|---|
| Runtime cost | O(computation of reachable set) | O(1) — table lookup |
| Memory | Low | Exponential in state dimension |
| Accuracy | Exact (continuous state) | Approximate (discretized) |
| Latency guarantee | Variable (may exceed deadline) | Deterministic |
| Adaptability | Can incorporate runtime info | Fixed at deployment |
For neural network verification specifically, the online approach runs a verification query (e.g., bound propagation) from the current state. The lookup approach precomputes verified regions and stores them. Tools like α,β-CROWN or ERAN can be used for either, depending on the time budget.
No single monitor catches everything. An ODD monitor might accept an in-distribution input that the model is uncertain about. An uncertainty monitor might be confident on an out-of-distribution input (the overconfidence problem). A failure monitor might miss a slow drift toward danger.
The solution is to layer all three:
If any layer triggers, the system enters a safe fallback mode. The specific fallback depends on the application: slow down, request human supervision, execute a pre-programmed safe maneuver, or shut down.
| Failure mode | Caught by | Not caught by |
|---|---|---|
| Novel environment (snow, unseen road) | ODD monitor | Uncertainty (model may be overconfident) |
| Ambiguous input (shadow looks like obstacle) | Uncertainty monitor | ODD (input may be in-distribution) |
| Slow drift into danger (gradual lane departure) | Failure monitor | ODD + uncertainty (each frame looks normal) |
| Sensor degradation (fog reducing lidar range) | ODD + uncertainty (both degrade) | Failure (trajectory may look safe until crash) |