Kochenderfer et al., Chapter 11

Explainability

When a system says "turn left," you need to know why — before you trust it with lives.

Prerequisites: Chapters 1–3 (validation framework, system modeling, property specs). That's it.
11
Chapters
5+
Simulations
11
Quizzes

Chapter 0: Why Explainability?

An autonomous drone decides to abort a landing. The pilot didn't command it. The sensors look fine. The wind is calm. Why did it abort?

You check the logs. The neural network controller output a single number: action = 0.87 (abort). That number came from 12 million parameters multiplied together through 40 layers. Good luck tracing the reasoning.

This is the explainability problem. A validation engineer can verify that a system meets performance metrics — 99.7% safe landings in simulation, say. But metrics don't tell you why it fails the other 0.3%, or what features it relies on, or whether it will generalize to conditions you haven't tested.

The stakes are legal, not just technical. The EU's GDPR includes a "right to explanation" for automated decisions. The US FAA requires that safety-critical avionics decisions be traceable. A black-box neural network that says "deny the loan" or "abort the landing" without explanation may violate regulations — even if it's statistically accurate.

Explainability also serves trust calibration. A pilot who understands why the autopilot makes decisions can detect when it's relying on the wrong cues — like a classifier that learned to detect "wolf" by checking for snow in the background. Without explanations, operators either over-trust (complacency) or under-trust (disuse) the system.

This chapter covers the full toolkit. We start with the simplest methods — visualizing what the policy does — and build toward more principled techniques: gradient-based attribution, Shapley values, surrogate models, counterfactuals, and failure mode clustering.

Local methods
Explain one decision: saliency, gradients, counterfactuals
Global methods
Explain the whole policy: surrogates, decision trees, Shapley
Failure methods
Explain why it breaks: clustering, temporal logic features
MethodScopeWhat it tells you
Sensitivity analysisLocalWhich inputs matter at this point
Saliency / gradientsLocalDirection of steepest change
Integrated gradientsLocalTotal attribution along a path
Shapley valuesLocal or globalFair contribution of each feature
Surrogate modelsGlobalApproximate the policy with something readable
Decision treesGlobalIf-then rules that mimic the policy
CounterfactualsLocalSmallest change to flip the decision
Check: Why are raw performance metrics insufficient for validating autonomous systems?

Chapter 1: Policy Visualization

The simplest form of explainability is just looking at what the policy does. If your state space is low-dimensional (2D or 3D), you can plot the policy as a map: for every point in state space, color it by what action the policy takes.

Consider an autonomous car navigating a 2D plane. The state is (x-position, y-position). The policy outputs one of four actions: accelerate, brake, turn left, turn right. A policy heatmap colors each grid cell by the chosen action. At a glance, you see "the car brakes when it's near an obstacle, accelerates on clear road, and turns when it's drifting toward a wall."

Key insight: Visualization is not "sophisticated," but it catches bugs that no metric will. A policy that always turns left regardless of state? Obvious on a heatmap. Invisible in aggregate accuracy numbers.

For higher-dimensional states, you fix all dimensions except two and plot a slice. If the car's full state is (x, y, speed, heading, distance_to_obstacle), you might fix speed = 30, heading = north, and plot the action as a function of (x, distance_to_obstacle). Multiple slices reveal how the policy changes with each variable.

Trajectory aggregation takes a different approach. Instead of plotting the action at every state, you overlay many simulated trajectories on the same map. Color them by outcome (green = safe, red = failure). Clusters of red trajectories reveal failure regions in state space — areas where the policy consistently makes bad decisions.

Policy Heatmap Explorer

A toy 2D policy: given (x, y) position, the policy chooses one of four actions. Each cell shows the action color. Drag the obstacle (red dot) to see how a simple policy adapts.

Drag the red dot to move the obstacle
What to notice: The policy carves state space into regions. Near the obstacle, the dominant action switches. This is exactly the kind of structure that helps a validation engineer ask: "Is this decision boundary in the right place? Does the braking region extend far enough?"

Visualization has clear limits. It works for 2D state spaces, or 2D slices of higher-dimensional ones. For a 50-dimensional observation space (like a flattened image), you need the methods in the following chapters. But as a first pass, always start here.

Check: What is the main limitation of policy visualization via heatmaps?

Chapter 2: Sensitivity Analysis

You have a policy π that maps an observation x = (x1, x2, …, xn) to an action. You want to know: which features matter most? The most direct approach is to wiggle each one and see what happens.

Sensitivity analysis perturbs one input feature at a time, holding all others fixed, and measures how much the output changes. If changing x3 by ±10% causes the action to swing wildly, then x3 is a high-sensitivity feature. If changing x7 barely affects the output, x7 is low-sensitivity.

sensitivity(xi) = | π(x1, …, xi + δ, …, xn) − π(x1, …, xi, …, xn) | / δ

This is just the magnitude of the finite-difference derivative with respect to feature i. A large value means the policy is sensitive to that feature at this operating point.

Sensitivity is local. It tells you which features matter at a specific input. The policy might be very sensitive to wind speed when cruising but insensitive to it when parked. Always specify the operating point.

To get a global picture, evaluate sensitivity at many operating points (sampled from your test distribution) and average. This gives mean absolute sensitivity — a rough ranking of feature importance across the whole input space.

Sensitivity Explorer

A toy policy: π(x1, x2, x3) = 3x12 + 0.5x2 + 0.1x3. Adjust each feature to see how sensitivity changes. The bar chart shows |∂π/∂xi| at the current point.

x11.0
x21.0
x31.0

Notice something critical: the sensitivity of x1 depends on the current value of x1, because the derivative of 3x12 is 6x1. At x1 = 0, the sensitivity to x1 is zero! This is a fundamental property of nonlinear functions — sensitivity is a function of where you are, not a fixed property of the feature.

Interaction effects: Sensitivity analysis perturbs one feature at a time. If two features interact (e.g., π depends on x1 · x2), single-feature perturbation can miss the interaction entirely. Shapley values (Chapter 5) handle interactions correctly, but at much greater computational cost.
Check: For the function π(x) = 3x12 + 0.5x2, what is the sensitivity to x1 at x1 = 2?

Chapter 3: Saliency & Gradients

Sensitivity analysis uses finite differences — perturb by δ, measure the change, divide. But if your policy is a differentiable neural network, you can compute the exact derivative using backpropagation. This is saliency mapping.

The saliency of feature i at input x is simply the partial derivative of the policy's output with respect to that input:

saliencyi(x) = ∂π(x) / ∂xi

Backpropagation computes all n partial derivatives in a single backward pass — O(1) cost per feature, compared to O(n) forward passes for finite differences. For a policy that takes a 224×224 image as input (50,176 features), saliency is fast; finite-difference sensitivity would require 50,176 separate evaluations.

Saliency maps for images produce a heatmap the same size as the input image. Bright pixels in the saliency map are pixels that, if changed slightly, would most affect the output. For an autonomous car, you hope to see high saliency on lane markings and obstacles — not on clouds or billboards.

The gradient tells you the direction of steepest ascent. The sign matters: a positive gradient on xi means increasing xi increases the output; negative means the opposite. The magnitude tells you how steeply.

The saturation problem. Gradients are local — they describe the function in an infinitesimal neighborhood of x. If the policy uses a ReLU or sigmoid that is saturated (flat) at the current input, the gradient is zero even though the feature clearly matters. Imagine a step function that switches from "brake" to "accelerate" at distance = 5m. At distance = 3m, the gradient is zero (the function is flat in the braking region), but distance is obviously critical. This is why pure gradient saliency can be misleading.

Two common refinements:

VariantFormulaWhy
Gradient × Inputxi · ∂π/∂xiScales by feature magnitude; zero input → zero attribution
SmoothGradAverage saliency over noisy copies of xReduces noise in gradient estimates

Both help, but neither solves the saturation problem fundamentally. For that, we need the path-based method in the next chapter.

Check: Why can gradient saliency give zero attribution to a feature that clearly matters?

Chapter 4: Integrated Gradients

Gradients at a point are like reading the slope of a mountain at your current location. Useful, but it misses the full picture. If you walked from sea level (a baseline) to your current position, the total altitude gain is what matters — not just the local slope.

Integrated gradients formalize this idea. Instead of computing the gradient at one point, they integrate the gradient along a straight-line path from a baseline input x' (typically all zeros or a black image) to the actual input x:

IGi(x) = (xi − x'i) × ∫01 (∂π / ∂xi)(x' + α(x − x')) dα

Let's unpack this. The integral walks along the path from x' to x, parameterized by α ∈ [0, 1]. At each point on the path, it samples the gradient with respect to feature i. Then it multiplies by (xi − x'i) — the total change in that feature from baseline to actual input.

Why this works: The integral captures the cumulative effect of each feature as we interpolate from "nothing" to the actual input. If feature i causes the output to rise somewhere along the path (even if the gradient is zero at the endpoint due to saturation), the integrated gradient picks it up. This solves the saturation problem from Chapter 3.

In practice, the integral is approximated by summing over m equally-spaced points along the path:

IGi(x) ≈ (xi − x'i) × (1/m) ∑k=1m (∂π / ∂xi)(x' + (k/m)(x − x'))
python
def integrated_gradients(model, x, baseline, m=50):
    # x, baseline: input tensors of same shape
    alphas = torch.linspace(0, 1, m+1)[1:]  # skip 0
    grads = []
    for alpha in alphas:
        interp = baseline + alpha * (x - baseline)
        interp.requires_grad_(True)
        out = model(interp)
        out.backward()
        grads.append(interp.grad.clone())
    avg_grad = torch.stack(grads).mean(dim=0)
    return (x - baseline) * avg_grad

Integrated gradients satisfy two important axioms:

AxiomWhat it means
Completenessi IGi(x) = π(x) − π(x'). The attributions add up to the total output change.
SensitivityIf changing one feature changes the output, its attribution is non-zero.

Completeness is remarkable: it means the attributions are a complete accounting of why the output is what it is, relative to the baseline. Nothing is left unaccounted for.

Baseline choice matters. The baseline x' represents "the absence of input." For images, a black image (all zeros) is common. For tabular data, using the dataset mean or zeros depends on the domain. An inappropriate baseline can produce misleading attributions — always justify your choice.
Check: What problem does integrating gradients along a path solve, compared to computing gradients at the input alone?

Chapter 5: Shapley Values

Imagine three engineers — Alice, Bob, and Carol — working on a project. The project earns $100. How much credit does each person deserve?

If Alice alone produces $30, Bob alone $40, and Carol alone $10 — but Alice and Bob together produce $90 (they have synergy!) — a simple "divide by contribution" approach breaks down. Alice's value depends on whether Bob is also on the team.

Lloyd Shapley solved this in 1953 with an elegant idea from cooperative game theory. For each player, consider every possible subset of the other players. Add the player to each subset. The marginal contribution is how much the function value changes. The Shapley value is the average marginal contribution across all subsets.

φi = ∑S ⊆ N\{i} [ |S|! (n − |S| − 1)! / n! ] × [ f(S ∪ {i}) − f(S) ]

Let's decode this. N is the set of all n features. S is a subset that does not include feature i. The weighting factor |S|!(n − |S| − 1)!/n! ensures we average uniformly over all possible orderings in which feature i could join the coalition. f(S) is the function value when only features in S are "present" (others set to baseline).

Shapley values are the unique attribution method that satisfies four fairness axioms: (1) Efficiency: attributions sum to f(x) − f(baseline). (2) Symmetry: features with identical contributions get equal credit. (3) Null player: a feature that never changes the output gets zero. (4) Linearity: attributions for a sum of functions equal the sum of attributions.

Let's work through a concrete example with n = 3 features. The function is:

f(x1, x2, x3) = 4x1 + 2x2 + x3 + 3x1x2

The interaction term 3x1x2 means features 1 and 2 have synergy. Their combined contribution exceeds the sum of their individual contributions.

Shapley Value Calculator

Function: f(x1,x2,x3) = 4x1 + 2x2 + x3 + 3x1x2. Baseline = (0,0,0). Adjust inputs and watch all 23 = 8 subsets, marginal contributions, and final Shapley values.

x12.0
x21.0
x31.0
Showing summary
The computational bomb: For n features, exact Shapley values require evaluating f on all 2n subsets. With n = 3, that's 8 — easy. With n = 10, that's 1,024 — still manageable. With n = 100 (a modest neural network input), that's 2100 ≈ 1030. The age of the universe in nanoseconds is only about 1026. Exact computation is impossible for real-world systems.

SHAP (SHapley Additive exPlanations) approximates Shapley values by sampling random subsets instead of enumerating all of them. Kernel SHAP fits a weighted linear model around the prediction. Tree SHAP exploits the structure of tree-based models for exact computation in polynomial time. These approximations make Shapley values practical for production use.

Shapley vs. integrated gradients: Both satisfy completeness (attributions sum to the output difference). Shapley values handle feature interactions correctly by averaging over all subsets. Integrated gradients follow one path (baseline → input) and can miss interactions. Shapley is more principled but far more expensive.
Check: Why is exact Shapley value computation infeasible for high-dimensional inputs?

Chapter 6: Surrogate Models

What if, instead of explaining the neural network directly, we train a simpler model to mimic it — and explain that?

A surrogate model is a simple, interpretable model (typically linear regression) fitted to the neural network's predictions. The surrogate doesn't need to be perfect. It just needs to capture the important input-output relationships well enough that its coefficients are meaningful.

π̂(x) = w0 + w1x1 + w2x2 + … + wnxn

The weights wi are the surrogate's "explanation." A large positive w3 means feature 3 is strongly positively associated with the output. The sign tells direction, the magnitude tells strength.

Global vs. local surrogates. A global surrogate is fitted to the entire input distribution — it summarizes the whole policy. A local surrogate (as in LIME) is fitted to a neighborhood around a single input — it explains one specific decision. Local surrogates can be more accurate because a linear model only needs to approximate the neural network in a small region, where it may be approximately linear.

To ensure the surrogate highlights only the most important features, use LASSO regularization (L1 penalty). LASSO drives small weights to exactly zero, producing a sparse model where only the features that truly matter have nonzero coefficients. This is a form of automatic feature selection.

minwj (π(xj) − π̂(xj))2 + λ ∑i |wi|

The parameter λ controls the sparsity-fidelity tradeoff. Large λ produces a very simple model (few nonzero weights) that may not fit well. Small λ produces an accurate but complex model that is harder to interpret. The best practice is to report the R2 score alongside the surrogate, so the reader knows how well the simple model actually approximates the original policy.

Danger: surrogates can lie. If the surrogate's R2 is 0.6, it only captures 60% of the variance. The remaining 40% may involve complex interactions that the linear model misses entirely. Always report fidelity. A surrogate with low fidelity is worse than no explanation — it gives false confidence.
Check: Why is LASSO regularization useful when building surrogate models for explainability?

Chapter 7: Decision Trees

Linear surrogates give weights. Decision trees give rules. "If speed > 40 and distance_to_obstacle < 10, then brake. Else if heading_error > 15°, then turn. Else accelerate." A validation engineer can read this, check it against domain knowledge, and spot problems.

A decision tree surrogate is fitted to the neural network's predictions (not the true labels). The tree is grown by recursively splitting on the feature and threshold that best separates the data, then pruned to a desired depth.

The depth tradeoff: A shallow tree (depth 3) has at most 8 leaf nodes. It's easy to read but may not faithfully reproduce the neural network's behavior. A deep tree (depth 20) closely matches the network but is as opaque as the network itself. The sweet spot is the shallowest tree whose fidelity (agreement with the network) is acceptably high.
Decision Tree Builder

A neural network classifies 2D points into two classes. The tree surrogate approximates its decision boundary. Adjust depth to see the fidelity/interpretability tradeoff.

Max depth2
Fidelity: --

The canvas shows two things: the tree's decision regions (colored areas) and the neural network's decision boundary (dashed line). At depth 1, the tree makes a single vertical or horizontal cut — a crude approximation. At depth 3-4, the axis-aligned rectangles start to match the curved boundary reasonably well. At depth 6, the match is near-perfect, but the tree has up to 64 leaf nodes — no longer a simple explanation.

DepthMax leavesTypical fidelityInterpretability
12Low (∼60%)Trivial to read
38Medium (∼85%)Easy to read
532High (∼95%)Takes effort
101,024Very high (∼99%)Unreadable
Trees vs. linear surrogates: Trees capture nonlinear decision boundaries (via axis-aligned splits) and interactions (via nested splits). Linear models cannot. But trees are discrete — they assign the same prediction to all points in a leaf — so they can miss smooth gradients in the policy.
Check: What happens to a decision tree surrogate as you increase its depth?

Chapter 8: Counterfactuals

A patient is denied a loan. They don't care about Shapley values or saliency maps. They want one thing: "What would I need to change to get approved?"

A counterfactual explanation answers exactly this question. It finds the smallest change to the input that flips the model's decision. "If your income were $5,000 higher, or your debt were $2,000 lower, you would be approved."

Formally, given input x with undesired outcome y, we seek x' such that the model predicts the desired outcome y', while minimizing a combination of four objectives:

minx' λ1·loss(π(x'), y') + λ2·dist(x, x') + λ3·sparsity(x − x') + λ4·implausibility(x')
ObjectiveWhat it meansWhy it matters
Outcomeπ(x') = y'The counterfactual must actually change the decision
Closenessdist(x, x') is smallThe change should be minimal
SparsityFew features changeHumans prefer "change one thing" explanations
Plausibilityx' is realistic"Set your age to −5" is not helpful
Plausibility is the hard part. Without it, the optimizer might find technically valid but absurd counterfactuals: "If you were 20 years younger and your income quadrupled..." Plausibility constraints typically require that x' lies within the data distribution, enforced by a density model or by restricting features to their observed ranges.

Counterfactuals are fundamentally different from the other methods in this chapter. Saliency, gradients, and Shapley values are attributive — they decompose the current decision. Counterfactuals are contrastive — they show what could be different. Both are valuable, but they answer different questions.

Multiple counterfactuals: There may be many valid counterfactuals. "Increase income by $5K" and "reduce debt by $2K" might both flip the decision. Presenting a diverse set of counterfactuals (varying which features change) gives the user more actionable options.
Check: What are the four objectives balanced when generating a counterfactual explanation?

Chapter 9: Failure Mode Clustering

The methods so far explain individual decisions. But a validation engineer often needs to understand failure patterns: "Why does the system fail, and are there distinct failure types?"

Suppose you run 10,000 simulations and 200 result in failure. Looking at each one individually is tedious. Instead, you can cluster the failure trajectories to discover groups of similar failures. Each cluster represents a distinct failure mode.

The process has three steps:

1. Featurize
Extract summary features from each failure trajectory
2. Cluster
Group similar failures (k-means, DBSCAN, etc.)
3. Interpret
Examine each cluster's centroid to understand the failure type

The key question is: what features do you extract? Simple statistics (max speed, min distance to obstacle, duration) work for basic analysis. But for temporal patterns ("the car swerved, then overcorrected, then went off-road"), you need features that capture sequence structure. This is where temporal logic comes in (next chapter).

k-means in feature space: After extracting m features from each failure trajectory, you have 200 points in Rm. k-means partitions them into k clusters by minimizing within-cluster variance. The cluster centroids are "prototypical failures" — average feature vectors that summarize each failure mode. If cluster 1 has high-speed + short-distance features and cluster 2 has low-visibility + wrong-lane features, you've discovered two distinct failure types.

Choosing k (the number of clusters) matters. Too few clusters merge distinct failure modes. Too many create artificial distinctions. The elbow method plots total within-cluster variance against k and looks for the "elbow" where adding more clusters gives diminishing returns. Alternatively, silhouette scores measure how well-separated the clusters are.

Actionability: Failure clustering is only useful if each cluster leads to a concrete engineering action. "Cluster 1: failures in rain with wet roads → retrain on wet-road data. Cluster 2: failures at dusk with low sun angle → add sun-glare robustness." If the clusters are not interpretable, the features need refinement.
Check: What is the purpose of clustering failure trajectories rather than examining each one individually?

Chapter 10: Temporal Logic Features

A failure trajectory is not just a point — it's a sequence of states over time. Simple features like "max speed" throw away the temporal structure. The car might have been going fast at the beginning and slow at the end, or vice versa. These are different failure patterns that "max speed" merges together.

Signal Temporal Logic (STL) provides a language for describing temporal properties of trajectories. A few examples:

STL formulaEnglish
G[0,T](speed < 30)Speed is always below 30 during [0, T]
F[0,5](distance < 2)At some point in the first 5 seconds, distance drops below 2
G[0,T](distance > 0.5) → F[0,T](reached_goal)If we always stay safe, then we eventually reach the goal

Here G means "globally" (always true in the time interval) and F means "finally" (true at some point in the interval). These are the same operators from Chapter 3's property specification.

Parametric STL (PSTL) extends this by making the thresholds into parameters. Instead of "speed < 30", you write "speed < c" where c is a free parameter. Given a trajectory, you can compute the tightest parameter that makes the formula true. For a trajectory where max speed is 42, the tightest c for G(speed < c) is c = 42.

PSTL as feature extraction: Take a library of PSTL templates (speed bounds, distance bounds, timing constraints). For each failure trajectory, compute the tightest parameters. These parameters become the features for clustering. Now your clusters are described not by abstract numbers but by temporal logic formulas: "Cluster 1: speed exceeds 35 AND distance drops below 1.5 within the first 3 seconds."

The robustness of an STL formula measures how much margin there is. A robustness of +5 for "speed < 30" means the speed stayed 5 units below 30. A robustness of −3 means it exceeded 30 by 3 units. Robustness values can also serve as continuous features for clustering — more informative than binary "satisfied/violated."

Closing the loop: PSTL-based clustering produces interpretable failure modes described in temporal logic. These descriptions can be directly converted into test specifications: "Generate more scenarios where speed exceeds 35 and distance drops below 1.5 within 3 seconds." This connects explainability back to falsification (Chapters 4-5), closing the validation loop.
"If you can't explain it simply, you don't understand it well enough."
— attributed to Richard Feynman
Check: What advantage do PSTL features offer over simple statistics (max speed, min distance) when clustering failure trajectories?