When a system says "turn left," you need to know why — before you trust it with lives.
An autonomous drone decides to abort a landing. The pilot didn't command it. The sensors look fine. The wind is calm. Why did it abort?
You check the logs. The neural network controller output a single number: action = 0.87 (abort). That number came from 12 million parameters multiplied together through 40 layers. Good luck tracing the reasoning.
This is the explainability problem. A validation engineer can verify that a system meets performance metrics — 99.7% safe landings in simulation, say. But metrics don't tell you why it fails the other 0.3%, or what features it relies on, or whether it will generalize to conditions you haven't tested.
Explainability also serves trust calibration. A pilot who understands why the autopilot makes decisions can detect when it's relying on the wrong cues — like a classifier that learned to detect "wolf" by checking for snow in the background. Without explanations, operators either over-trust (complacency) or under-trust (disuse) the system.
This chapter covers the full toolkit. We start with the simplest methods — visualizing what the policy does — and build toward more principled techniques: gradient-based attribution, Shapley values, surrogate models, counterfactuals, and failure mode clustering.
| Method | Scope | What it tells you |
|---|---|---|
| Sensitivity analysis | Local | Which inputs matter at this point |
| Saliency / gradients | Local | Direction of steepest change |
| Integrated gradients | Local | Total attribution along a path |
| Shapley values | Local or global | Fair contribution of each feature |
| Surrogate models | Global | Approximate the policy with something readable |
| Decision trees | Global | If-then rules that mimic the policy |
| Counterfactuals | Local | Smallest change to flip the decision |
The simplest form of explainability is just looking at what the policy does. If your state space is low-dimensional (2D or 3D), you can plot the policy as a map: for every point in state space, color it by what action the policy takes.
Consider an autonomous car navigating a 2D plane. The state is (x-position, y-position). The policy outputs one of four actions: accelerate, brake, turn left, turn right. A policy heatmap colors each grid cell by the chosen action. At a glance, you see "the car brakes when it's near an obstacle, accelerates on clear road, and turns when it's drifting toward a wall."
For higher-dimensional states, you fix all dimensions except two and plot a slice. If the car's full state is (x, y, speed, heading, distance_to_obstacle), you might fix speed = 30, heading = north, and plot the action as a function of (x, distance_to_obstacle). Multiple slices reveal how the policy changes with each variable.
Trajectory aggregation takes a different approach. Instead of plotting the action at every state, you overlay many simulated trajectories on the same map. Color them by outcome (green = safe, red = failure). Clusters of red trajectories reveal failure regions in state space — areas where the policy consistently makes bad decisions.
A toy 2D policy: given (x, y) position, the policy chooses one of four actions. Each cell shows the action color. Drag the obstacle (red dot) to see how a simple policy adapts.
Visualization has clear limits. It works for 2D state spaces, or 2D slices of higher-dimensional ones. For a 50-dimensional observation space (like a flattened image), you need the methods in the following chapters. But as a first pass, always start here.
You have a policy π that maps an observation x = (x1, x2, …, xn) to an action. You want to know: which features matter most? The most direct approach is to wiggle each one and see what happens.
Sensitivity analysis perturbs one input feature at a time, holding all others fixed, and measures how much the output changes. If changing x3 by ±10% causes the action to swing wildly, then x3 is a high-sensitivity feature. If changing x7 barely affects the output, x7 is low-sensitivity.
This is just the magnitude of the finite-difference derivative with respect to feature i. A large value means the policy is sensitive to that feature at this operating point.
To get a global picture, evaluate sensitivity at many operating points (sampled from your test distribution) and average. This gives mean absolute sensitivity — a rough ranking of feature importance across the whole input space.
A toy policy: π(x1, x2, x3) = 3x12 + 0.5x2 + 0.1x3. Adjust each feature to see how sensitivity changes. The bar chart shows |∂π/∂xi| at the current point.
Notice something critical: the sensitivity of x1 depends on the current value of x1, because the derivative of 3x12 is 6x1. At x1 = 0, the sensitivity to x1 is zero! This is a fundamental property of nonlinear functions — sensitivity is a function of where you are, not a fixed property of the feature.
Sensitivity analysis uses finite differences — perturb by δ, measure the change, divide. But if your policy is a differentiable neural network, you can compute the exact derivative using backpropagation. This is saliency mapping.
The saliency of feature i at input x is simply the partial derivative of the policy's output with respect to that input:
Backpropagation computes all n partial derivatives in a single backward pass — O(1) cost per feature, compared to O(n) forward passes for finite differences. For a policy that takes a 224×224 image as input (50,176 features), saliency is fast; finite-difference sensitivity would require 50,176 separate evaluations.
The gradient tells you the direction of steepest ascent. The sign matters: a positive gradient on xi means increasing xi increases the output; negative means the opposite. The magnitude tells you how steeply.
Two common refinements:
| Variant | Formula | Why |
|---|---|---|
| Gradient × Input | xi · ∂π/∂xi | Scales by feature magnitude; zero input → zero attribution |
| SmoothGrad | Average saliency over noisy copies of x | Reduces noise in gradient estimates |
Both help, but neither solves the saturation problem fundamentally. For that, we need the path-based method in the next chapter.
Gradients at a point are like reading the slope of a mountain at your current location. Useful, but it misses the full picture. If you walked from sea level (a baseline) to your current position, the total altitude gain is what matters — not just the local slope.
Integrated gradients formalize this idea. Instead of computing the gradient at one point, they integrate the gradient along a straight-line path from a baseline input x' (typically all zeros or a black image) to the actual input x:
Let's unpack this. The integral walks along the path from x' to x, parameterized by α ∈ [0, 1]. At each point on the path, it samples the gradient with respect to feature i. Then it multiplies by (xi − x'i) — the total change in that feature from baseline to actual input.
In practice, the integral is approximated by summing over m equally-spaced points along the path:
python def integrated_gradients(model, x, baseline, m=50): # x, baseline: input tensors of same shape alphas = torch.linspace(0, 1, m+1)[1:] # skip 0 grads = [] for alpha in alphas: interp = baseline + alpha * (x - baseline) interp.requires_grad_(True) out = model(interp) out.backward() grads.append(interp.grad.clone()) avg_grad = torch.stack(grads).mean(dim=0) return (x - baseline) * avg_grad
Integrated gradients satisfy two important axioms:
| Axiom | What it means |
|---|---|
| Completeness | ∑i IGi(x) = π(x) − π(x'). The attributions add up to the total output change. |
| Sensitivity | If changing one feature changes the output, its attribution is non-zero. |
Completeness is remarkable: it means the attributions are a complete accounting of why the output is what it is, relative to the baseline. Nothing is left unaccounted for.
Imagine three engineers — Alice, Bob, and Carol — working on a project. The project earns $100. How much credit does each person deserve?
If Alice alone produces $30, Bob alone $40, and Carol alone $10 — but Alice and Bob together produce $90 (they have synergy!) — a simple "divide by contribution" approach breaks down. Alice's value depends on whether Bob is also on the team.
Lloyd Shapley solved this in 1953 with an elegant idea from cooperative game theory. For each player, consider every possible subset of the other players. Add the player to each subset. The marginal contribution is how much the function value changes. The Shapley value is the average marginal contribution across all subsets.
Let's decode this. N is the set of all n features. S is a subset that does not include feature i. The weighting factor |S|!(n − |S| − 1)!/n! ensures we average uniformly over all possible orderings in which feature i could join the coalition. f(S) is the function value when only features in S are "present" (others set to baseline).
Let's work through a concrete example with n = 3 features. The function is:
The interaction term 3x1x2 means features 1 and 2 have synergy. Their combined contribution exceeds the sum of their individual contributions.
Function: f(x1,x2,x3) = 4x1 + 2x2 + x3 + 3x1x2. Baseline = (0,0,0). Adjust inputs and watch all 23 = 8 subsets, marginal contributions, and final Shapley values.
SHAP (SHapley Additive exPlanations) approximates Shapley values by sampling random subsets instead of enumerating all of them. Kernel SHAP fits a weighted linear model around the prediction. Tree SHAP exploits the structure of tree-based models for exact computation in polynomial time. These approximations make Shapley values practical for production use.
What if, instead of explaining the neural network directly, we train a simpler model to mimic it — and explain that?
A surrogate model is a simple, interpretable model (typically linear regression) fitted to the neural network's predictions. The surrogate doesn't need to be perfect. It just needs to capture the important input-output relationships well enough that its coefficients are meaningful.
The weights wi are the surrogate's "explanation." A large positive w3 means feature 3 is strongly positively associated with the output. The sign tells direction, the magnitude tells strength.
To ensure the surrogate highlights only the most important features, use LASSO regularization (L1 penalty). LASSO drives small weights to exactly zero, producing a sparse model where only the features that truly matter have nonzero coefficients. This is a form of automatic feature selection.
The parameter λ controls the sparsity-fidelity tradeoff. Large λ produces a very simple model (few nonzero weights) that may not fit well. Small λ produces an accurate but complex model that is harder to interpret. The best practice is to report the R2 score alongside the surrogate, so the reader knows how well the simple model actually approximates the original policy.
Linear surrogates give weights. Decision trees give rules. "If speed > 40 and distance_to_obstacle < 10, then brake. Else if heading_error > 15°, then turn. Else accelerate." A validation engineer can read this, check it against domain knowledge, and spot problems.
A decision tree surrogate is fitted to the neural network's predictions (not the true labels). The tree is grown by recursively splitting on the feature and threshold that best separates the data, then pruned to a desired depth.
A neural network classifies 2D points into two classes. The tree surrogate approximates its decision boundary. Adjust depth to see the fidelity/interpretability tradeoff.
The canvas shows two things: the tree's decision regions (colored areas) and the neural network's decision boundary (dashed line). At depth 1, the tree makes a single vertical or horizontal cut — a crude approximation. At depth 3-4, the axis-aligned rectangles start to match the curved boundary reasonably well. At depth 6, the match is near-perfect, but the tree has up to 64 leaf nodes — no longer a simple explanation.
| Depth | Max leaves | Typical fidelity | Interpretability |
|---|---|---|---|
| 1 | 2 | Low (∼60%) | Trivial to read |
| 3 | 8 | Medium (∼85%) | Easy to read |
| 5 | 32 | High (∼95%) | Takes effort |
| 10 | 1,024 | Very high (∼99%) | Unreadable |
A patient is denied a loan. They don't care about Shapley values or saliency maps. They want one thing: "What would I need to change to get approved?"
A counterfactual explanation answers exactly this question. It finds the smallest change to the input that flips the model's decision. "If your income were $5,000 higher, or your debt were $2,000 lower, you would be approved."
Formally, given input x with undesired outcome y, we seek x' such that the model predicts the desired outcome y', while minimizing a combination of four objectives:
| Objective | What it means | Why it matters |
|---|---|---|
| Outcome | π(x') = y' | The counterfactual must actually change the decision |
| Closeness | dist(x, x') is small | The change should be minimal |
| Sparsity | Few features change | Humans prefer "change one thing" explanations |
| Plausibility | x' is realistic | "Set your age to −5" is not helpful |
Counterfactuals are fundamentally different from the other methods in this chapter. Saliency, gradients, and Shapley values are attributive — they decompose the current decision. Counterfactuals are contrastive — they show what could be different. Both are valuable, but they answer different questions.
The methods so far explain individual decisions. But a validation engineer often needs to understand failure patterns: "Why does the system fail, and are there distinct failure types?"
Suppose you run 10,000 simulations and 200 result in failure. Looking at each one individually is tedious. Instead, you can cluster the failure trajectories to discover groups of similar failures. Each cluster represents a distinct failure mode.
The process has three steps:
The key question is: what features do you extract? Simple statistics (max speed, min distance to obstacle, duration) work for basic analysis. But for temporal patterns ("the car swerved, then overcorrected, then went off-road"), you need features that capture sequence structure. This is where temporal logic comes in (next chapter).
Choosing k (the number of clusters) matters. Too few clusters merge distinct failure modes. Too many create artificial distinctions. The elbow method plots total within-cluster variance against k and looks for the "elbow" where adding more clusters gives diminishing returns. Alternatively, silhouette scores measure how well-separated the clusters are.
A failure trajectory is not just a point — it's a sequence of states over time. Simple features like "max speed" throw away the temporal structure. The car might have been going fast at the beginning and slow at the end, or vice versa. These are different failure patterns that "max speed" merges together.
Signal Temporal Logic (STL) provides a language for describing temporal properties of trajectories. A few examples:
| STL formula | English |
|---|---|
| G[0,T](speed < 30) | Speed is always below 30 during [0, T] |
| F[0,5](distance < 2) | At some point in the first 5 seconds, distance drops below 2 |
| G[0,T](distance > 0.5) → F[0,T](reached_goal) | If we always stay safe, then we eventually reach the goal |
Here G means "globally" (always true in the time interval) and F means "finally" (true at some point in the interval). These are the same operators from Chapter 3's property specification.
Parametric STL (PSTL) extends this by making the thresholds into parameters. Instead of "speed < 30", you write "speed < c" where c is a free parameter. Given a trajectory, you can compute the tightest parameter that makes the formula true. For a trajectory where max speed is 42, the tightest c for G(speed < c) is c = 42.
The robustness of an STL formula measures how much margin there is. A robustness of +5 for "speed < 30" means the speed stayed 5 units below 30. A robustness of −3 means it exceeded 30 by 3 units. Robustness values can also serve as continuous features for clustering — more informative than binary "satisfied/violated."