Ch 11: Explainability — Algorithms for Validation

Chapter 0: Why Explainability?

An autonomous drone decides to abort a landing. The pilot didn't command it. The sensors look fine. The wind is calm. Why did it abort?

You check the logs. The neural network controller output a single number: action = 0.87 (abort). That number came from 12 million parameters multiplied together through 40 layers. Good luck tracing the reasoning.

This is the explainability problem. A validation engineer can verify that a system meets performance metrics — 99.7% safe landings in simulation, say. But metrics don't tell you why it fails the other 0.3%, or what features it relies on, or whether it will generalize to conditions you haven't tested.

The stakes are legal, not just technical. The EU's GDPR includes a "right to explanation" for automated decisions. The US FAA requires that safety-critical avionics decisions be traceable. A black-box neural network that says "deny the loan" or "abort the landing" without explanation may violate regulations — even if it's statistically accurate.

Explainability also serves trust calibration. A pilot who understands why the autopilot makes decisions can detect when it's relying on the wrong cues — like a classifier that learned to detect "wolf" by checking for snow in the background. Without explanations, operators either over-trust (complacency) or under-trust (disuse) the system.

This chapter covers the full toolkit. We start with the simplest methods — visualizing what the policy does — and build toward more principled techniques: gradient-based attribution, Shapley values, surrogate models, counterfactuals, and failure mode clustering.

Local methods

Explain one decision: saliency, gradients, counterfactuals

↓

Global methods

Explain the whole policy: surrogates, decision trees, Shapley

↓

Failure methods

Explain why it breaks: clustering, temporal logic features

Method	Scope	What it tells you
Sensitivity analysis	Local	Which inputs matter at this point
Saliency / gradients	Local	Direction of steepest change
Integrated gradients	Local	Total attribution along a path
Shapley values	Local or global	Fair contribution of each feature
Surrogate models	Global	Approximate the policy with something readable
Decision trees	Global	If-then rules that mimic the policy
Counterfactuals	Local	Smallest change to flip the decision

Check: Why are raw performance metrics insufficient for validating autonomous systems?

Metrics are always inaccurate Metrics show how often it fails, but not why or what features it relies on Metrics cannot be computed for neural networks

Chapter 1: Policy Visualization

The simplest form of explainability is just looking at what the policy does. If your state space is low-dimensional (2D or 3D), you can plot the policy as a map: for every point in state space, color it by what action the policy takes.

Consider an autonomous car navigating a 2D plane. The state is (x-position, y-position). The policy outputs one of four actions: accelerate, brake, turn left, turn right. A policy heatmap colors each grid cell by the chosen action. At a glance, you see "the car brakes when it's near an obstacle, accelerates on clear road, and turns when it's drifting toward a wall."

Key insight: Visualization is not "sophisticated," but it catches bugs that no metric will. A policy that always turns left regardless of state? Obvious on a heatmap. Invisible in aggregate accuracy numbers.

For higher-dimensional states, you fix all dimensions except two and plot a slice. If the car's full state is (x, y, speed, heading, distance_to_obstacle), you might fix speed = 30, heading = north, and plot the action as a function of (x, distance_to_obstacle). Multiple slices reveal how the policy changes with each variable.

Trajectory aggregation takes a different approach. Instead of plotting the action at every state, you overlay many simulated trajectories on the same map. Color them by outcome (green = safe, red = failure). Clusters of red trajectories reveal failure regions in state space — areas where the policy consistently makes bad decisions.

Policy Heatmap Explorer

A toy 2D policy: given (x, y) position, the policy chooses one of four actions. Each cell shows the action color. Drag the obstacle (red dot) to see how a simple policy adapts.

Drag the red dot to move the obstacle

What to notice: The policy carves state space into regions. Near the obstacle, the dominant action switches. This is exactly the kind of structure that helps a validation engineer ask: "Is this decision boundary in the right place? Does the braking region extend far enough?"

Visualization has clear limits. It works for 2D state spaces, or 2D slices of higher-dimensional ones. For a 50-dimensional observation space (like a flattened image), you need the methods in the following chapters. But as a first pass, always start here.

Check: What is the main limitation of policy visualization via heatmaps?

It requires retraining the policy It can only show one trajectory at a time It only works for low-dimensional state spaces (or 2D slices of higher ones)

Chapter 2: Sensitivity Analysis

You have a policy π that maps an observation x = (x₁, x₂, …, x_n) to an action. You want to know: which features matter most? The most direct approach is to wiggle each one and see what happens.

Sensitivity analysis perturbs one input feature at a time, holding all others fixed, and measures how much the output changes. If changing x₃ by ±10% causes the action to swing wildly, then x₃ is a high-sensitivity feature. If changing x₇ barely affects the output, x₇ is low-sensitivity.

sensitivity(x_i) = | π(x₁, …, x_i + δ, …, x_n) − π(x₁, …, x_i, …, x_n) | / δ

This is just the magnitude of the finite-difference derivative with respect to feature i. A large value means the policy is sensitive to that feature at this operating point.

Sensitivity is local. It tells you which features matter at a specific input. The policy might be very sensitive to wind speed when cruising but insensitive to it when parked. Always specify the operating point.

To get a global picture, evaluate sensitivity at many operating points (sampled from your test distribution) and average. This gives mean absolute sensitivity — a rough ranking of feature importance across the whole input space.

Sensitivity Explorer

A toy policy: π(x₁, x₂, x₃) = 3x₁² + 0.5x₂ + 0.1x₃. Adjust each feature to see how sensitivity changes. The bar chart shows |∂π/∂x_i| at the current point.

x₁1.0

x₂1.0

x₃1.0

Notice something critical: the sensitivity of x₁ depends on the current value of x₁, because the derivative of 3x₁² is 6x₁. At x₁ = 0, the sensitivity to x₁ is zero! This is a fundamental property of nonlinear functions — sensitivity is a function of where you are, not a fixed property of the feature.

Interaction effects: Sensitivity analysis perturbs one feature at a time. If two features interact (e.g., π depends on x₁ · x₂), single-feature perturbation can miss the interaction entirely. Shapley values (Chapter 5) handle interactions correctly, but at much greater computational cost.

Check: For the function π(x) = 3x₁² + 0.5x₂, what is the sensitivity to x₁ at x₁ = 2?

3 12 (because ∂π/∂x₁ = 6x₁ = 6 × 2 = 12) 0.5

Chapter 3: Saliency & Gradients

Sensitivity analysis uses finite differences — perturb by δ, measure the change, divide. But if your policy is a differentiable neural network, you can compute the exact derivative using backpropagation. This is saliency mapping.

The saliency of feature i at input x is simply the partial derivative of the policy's output with respect to that input:

saliency_i(x) = ∂π(x) / ∂x_i

Backpropagation computes all n partial derivatives in a single backward pass — O(1) cost per feature, compared to O(n) forward passes for finite differences. For a policy that takes a 224×224 image as input (50,176 features), saliency is fast; finite-difference sensitivity would require 50,176 separate evaluations.

Saliency maps for images produce a heatmap the same size as the input image. Bright pixels in the saliency map are pixels that, if changed slightly, would most affect the output. For an autonomous car, you hope to see high saliency on lane markings and obstacles — not on clouds or billboards.

The gradient tells you the direction of steepest ascent. The sign matters: a positive gradient on x_i means increasing x_i increases the output; negative means the opposite. The magnitude tells you how steeply.

The saturation problem. Gradients are local — they describe the function in an infinitesimal neighborhood of x. If the policy uses a ReLU or sigmoid that is saturated (flat) at the current input, the gradient is zero even though the feature clearly matters. Imagine a step function that switches from "brake" to "accelerate" at distance = 5m. At distance = 3m, the gradient is zero (the function is flat in the braking region), but distance is obviously critical. This is why pure gradient saliency can be misleading.

Two common refinements:

Variant	Formula	Why
Gradient × Input	x_i · ∂π/∂x_i	Scales by feature magnitude; zero input → zero attribution
SmoothGrad	Average saliency over noisy copies of x	Reduces noise in gradient estimates

Both help, but neither solves the saturation problem fundamentally. For that, we need the path-based method in the next chapter.

Check: Why can gradient saliency give zero attribution to a feature that clearly matters?

If the function is locally flat (saturated) at the current input, the gradient is zero Gradients always underestimate importance Neural networks have too many parameters for accurate gradients

Chapter 4: Integrated Gradients

Gradients at a point are like reading the slope of a mountain at your current location. Useful, but it misses the full picture. If you walked from sea level (a baseline) to your current position, the total altitude gain is what matters — not just the local slope.

Integrated gradients formalize this idea. Instead of computing the gradient at one point, they integrate the gradient along a straight-line path from a baseline input x' (typically all zeros or a black image) to the actual input x:

IG_i(x) = (x_i − x'_i) × ∫₀¹ (∂π / ∂x_i)(x' + α(x − x')) dα

Let's unpack this. The integral walks along the path from x' to x, parameterized by α ∈ [0, 1]. At each point on the path, it samples the gradient with respect to feature i. Then it multiplies by (x_i − x'_i) — the total change in that feature from baseline to actual input.

Why this works: The integral captures the cumulative effect of each feature as we interpolate from "nothing" to the actual input. If feature i causes the output to rise somewhere along the path (even if the gradient is zero at the endpoint due to saturation), the integrated gradient picks it up. This solves the saturation problem from Chapter 3.

In practice, the integral is approximated by summing over m equally-spaced points along the path:

IG_i(x) ≈ (x_i − x'_i) × (1/m) ∑_k=1^m (∂π / ∂x_i)(x' + (k/m)(x − x'))

python
def integrated_gradients(model, x, baseline, m=50):
    # x, baseline: input tensors of same shape
    alphas = torch.linspace(0, 1, m+1)[1:]  # skip 0
    grads = []
    for alpha in alphas:
        interp = baseline + alpha * (x - baseline)
        interp.requires_grad_(True)
        out = model(interp)
        out.backward()
        grads.append(interp.grad.clone())
    avg_grad = torch.stack(grads).mean(dim=0)
    return (x - baseline) * avg_grad

Integrated gradients satisfy two important axioms:

Axiom	What it means
Completeness	∑_i IG_i(x) = π(x) − π(x'). The attributions add up to the total output change.
Sensitivity	If changing one feature changes the output, its attribution is non-zero.

Completeness is remarkable: it means the attributions are a complete accounting of why the output is what it is, relative to the baseline. Nothing is left unaccounted for.

Baseline choice matters. The baseline x' represents "the absence of input." For images, a black image (all zeros) is common. For tabular data, using the dataset mean or zeros depends on the domain. An inappropriate baseline can produce misleading attributions — always justify your choice.

Check: What problem does integrating gradients along a path solve, compared to computing gradients at the input alone?

It captures feature effects even when the function is locally flat (saturated) at the input It runs faster than a single gradient computation It eliminates the need for a neural network

Chapter 5: Shapley Values

Imagine three engineers — Alice, Bob, and Carol — working on a project. The project earns $100. How much credit does each person deserve?

If Alice alone produces $30, Bob alone $40, and Carol alone $10 — but Alice and Bob together produce $90 (they have synergy!) — a simple "divide by contribution" approach breaks down. Alice's value depends on whether Bob is also on the team.

Lloyd Shapley solved this in 1953 with an elegant idea from cooperative game theory. For each player, consider every possible subset of the other players. Add the player to each subset. The marginal contribution is how much the function value changes. The Shapley value is the average marginal contribution across all subsets.

φ_i = ∑_{S ⊆ N\{i}} [ |S|! (n − |S| − 1)! / n! ] × [ f(S ∪ {i}) − f(S) ]

Let's decode this. N is the set of all n features. S is a subset that does not include feature i. The weighting factor |S|!(n − |S| − 1)!/n! ensures we average uniformly over all possible orderings in which feature i could join the coalition. f(S) is the function value when only features in S are "present" (others set to baseline).

Shapley values are the unique attribution method that satisfies four fairness axioms: (1) Efficiency: attributions sum to f(x) − f(baseline). (2) Symmetry: features with identical contributions get equal credit. (3) Null player: a feature that never changes the output gets zero. (4) Linearity: attributions for a sum of functions equal the sum of attributions.

Let's work through a concrete example with n = 3 features. The function is:

f(x₁, x₂, x₃) = 4x₁ + 2x₂ + x₃ + 3x₁x₂

The interaction term 3x₁x₂ means features 1 and 2 have synergy. Their combined contribution exceeds the sum of their individual contributions.

Shapley Value Calculator

Function: f(x₁,x₂,x₃) = 4x₁ + 2x₂ + x₃ + 3x₁x₂. Baseline = (0,0,0). Adjust inputs and watch all 2³ = 8 subsets, marginal contributions, and final Shapley values.

x₁2.0

x₂1.0

x₃1.0

Showing summary

The computational bomb: For n features, exact Shapley values require evaluating f on all 2ⁿ subsets. With n = 3, that's 8 — easy. With n = 10, that's 1,024 — still manageable. With n = 100 (a modest neural network input), that's 2¹⁰⁰ ≈ 10³⁰. The age of the universe in nanoseconds is only about 10²⁶. Exact computation is impossible for real-world systems.

SHAP (SHapley Additive exPlanations) approximates Shapley values by sampling random subsets instead of enumerating all of them. Kernel SHAP fits a weighted linear model around the prediction. Tree SHAP exploits the structure of tree-based models for exact computation in polynomial time. These approximations make Shapley values practical for production use.

Shapley vs. integrated gradients: Both satisfy completeness (attributions sum to the output difference). Shapley values handle feature interactions correctly by averaging over all subsets. Integrated gradients follow one path (baseline → input) and can miss interactions. Shapley is more principled but far more expensive.

Check: Why is exact Shapley value computation infeasible for high-dimensional inputs?

The gradients become numerically unstable Shapley values are only defined for linear functions It requires evaluating the function on all 2ⁿ subsets, which is exponential

Chapter 6: Surrogate Models

What if, instead of explaining the neural network directly, we train a simpler model to mimic it — and explain that?

A surrogate model is a simple, interpretable model (typically linear regression) fitted to the neural network's predictions. The surrogate doesn't need to be perfect. It just needs to capture the important input-output relationships well enough that its coefficients are meaningful.

π̂(x) = w₀ + w₁x₁ + w₂x₂ + … + w_nx_n

The weights w_i are the surrogate's "explanation." A large positive w₃ means feature 3 is strongly positively associated with the output. The sign tells direction, the magnitude tells strength.

Global vs. local surrogates. A global surrogate is fitted to the entire input distribution — it summarizes the whole policy. A local surrogate (as in LIME) is fitted to a neighborhood around a single input — it explains one specific decision. Local surrogates can be more accurate because a linear model only needs to approximate the neural network in a small region, where it may be approximately linear.

To ensure the surrogate highlights only the most important features, use LASSO regularization (L1 penalty). LASSO drives small weights to exactly zero, producing a sparse model where only the features that truly matter have nonzero coefficients. This is a form of automatic feature selection.

min_w ∑_j (π(x_j) − π̂(x_j))² + λ ∑_i |w_i|

The parameter λ controls the sparsity-fidelity tradeoff. Large λ produces a very simple model (few nonzero weights) that may not fit well. Small λ produces an accurate but complex model that is harder to interpret. The best practice is to report the R² score alongside the surrogate, so the reader knows how well the simple model actually approximates the original policy.

Danger: surrogates can lie. If the surrogate's R² is 0.6, it only captures 60% of the variance. The remaining 40% may involve complex interactions that the linear model misses entirely. Always report fidelity. A surrogate with low fidelity is worse than no explanation — it gives false confidence.

Check: Why is LASSO regularization useful when building surrogate models for explainability?

It drives unimportant feature weights to exactly zero, producing a sparse, interpretable model It makes the surrogate model nonlinear It guarantees the surrogate perfectly matches the original policy

Chapter 7: Decision Trees

Linear surrogates give weights. Decision trees give rules. "If speed > 40 and distance_to_obstacle < 10, then brake. Else if heading_error > 15°, then turn. Else accelerate." A validation engineer can read this, check it against domain knowledge, and spot problems.

A decision tree surrogate is fitted to the neural network's predictions (not the true labels). The tree is grown by recursively splitting on the feature and threshold that best separates the data, then pruned to a desired depth.

The depth tradeoff: A shallow tree (depth 3) has at most 8 leaf nodes. It's easy to read but may not faithfully reproduce the neural network's behavior. A deep tree (depth 20) closely matches the network but is as opaque as the network itself. The sweet spot is the shallowest tree whose fidelity (agreement with the network) is acceptably high.

Decision Tree Builder

A neural network classifies 2D points into two classes. The tree surrogate approximates its decision boundary. Adjust depth to see the fidelity/interpretability tradeoff.

Max depth2

Fidelity: --

The canvas shows two things: the tree's decision regions (colored areas) and the neural network's decision boundary (dashed line). At depth 1, the tree makes a single vertical or horizontal cut — a crude approximation. At depth 3-4, the axis-aligned rectangles start to match the curved boundary reasonably well. At depth 6, the match is near-perfect, but the tree has up to 64 leaf nodes — no longer a simple explanation.

Depth	Max leaves	Typical fidelity	Interpretability
1	2	Low (∼60%)	Trivial to read
3	8	Medium (∼85%)	Easy to read
5	32	High (∼95%)	Takes effort
10	1,024	Very high (∼99%)	Unreadable

Trees vs. linear surrogates: Trees capture nonlinear decision boundaries (via axis-aligned splits) and interactions (via nested splits). Linear models cannot. But trees are discrete — they assign the same prediction to all points in a leaf — so they can miss smooth gradients in the policy.

Check: What happens to a decision tree surrogate as you increase its depth?

It becomes more interpretable but less accurate Fidelity to the original model increases, but interpretability decreases It always overfits and becomes useless

Chapter 8: Counterfactuals

A patient is denied a loan. They don't care about Shapley values or saliency maps. They want one thing: "What would I need to change to get approved?"

A counterfactual explanation answers exactly this question. It finds the smallest change to the input that flips the model's decision. "If your income were $5,000 higher, or your debt were $2,000 lower, you would be approved."

Formally, given input x with undesired outcome y, we seek x' such that the model predicts the desired outcome y', while minimizing a combination of four objectives:

min_x' λ₁·loss(π(x'), y') + λ₂·dist(x, x') + λ₃·sparsity(x − x') + λ₄·implausibility(x')

Objective	What it means	Why it matters
Outcome	π(x') = y'	The counterfactual must actually change the decision
Closeness	dist(x, x') is small	The change should be minimal
Sparsity	Few features change	Humans prefer "change one thing" explanations
Plausibility	x' is realistic	"Set your age to −5" is not helpful

Plausibility is the hard part. Without it, the optimizer might find technically valid but absurd counterfactuals: "If you were 20 years younger and your income quadrupled..." Plausibility constraints typically require that x' lies within the data distribution, enforced by a density model or by restricting features to their observed ranges.

Counterfactuals are fundamentally different from the other methods in this chapter. Saliency, gradients, and Shapley values are attributive — they decompose the current decision. Counterfactuals are contrastive — they show what could be different. Both are valuable, but they answer different questions.

Multiple counterfactuals: There may be many valid counterfactuals. "Increase income by $5K" and "reduce debt by $2K" might both flip the decision. Presenting a diverse set of counterfactuals (varying which features change) gives the user more actionable options.

Check: What are the four objectives balanced when generating a counterfactual explanation?

Speed, accuracy, interpretability, novelty Outcome (flip the decision), closeness, sparsity, plausibility Gradient magnitude, baseline choice, path length, noise level

Chapter 9: Failure Mode Clustering

The methods so far explain individual decisions. But a validation engineer often needs to understand failure patterns: "Why does the system fail, and are there distinct failure types?"

Suppose you run 10,000 simulations and 200 result in failure. Looking at each one individually is tedious. Instead, you can cluster the failure trajectories to discover groups of similar failures. Each cluster represents a distinct failure mode.

The process has three steps:

1. Featurize

Extract summary features from each failure trajectory

↓

2. Cluster

Group similar failures (k-means, DBSCAN, etc.)

↓

3. Interpret

Examine each cluster's centroid to understand the failure type

The key question is: what features do you extract? Simple statistics (max speed, min distance to obstacle, duration) work for basic analysis. But for temporal patterns ("the car swerved, then overcorrected, then went off-road"), you need features that capture sequence structure. This is where temporal logic comes in (next chapter).

k-means in feature space: After extracting m features from each failure trajectory, you have 200 points in R^m. k-means partitions them into k clusters by minimizing within-cluster variance. The cluster centroids are "prototypical failures" — average feature vectors that summarize each failure mode. If cluster 1 has high-speed + short-distance features and cluster 2 has low-visibility + wrong-lane features, you've discovered two distinct failure types.

Choosing k (the number of clusters) matters. Too few clusters merge distinct failure modes. Too many create artificial distinctions. The elbow method plots total within-cluster variance against k and looks for the "elbow" where adding more clusters gives diminishing returns. Alternatively, silhouette scores measure how well-separated the clusters are.

Actionability: Failure clustering is only useful if each cluster leads to a concrete engineering action. "Cluster 1: failures in rain with wet roads → retrain on wet-road data. Cluster 2: failures at dusk with low sun angle → add sun-glare robustness." If the clusters are not interpretable, the features need refinement.

Check: What is the purpose of clustering failure trajectories rather than examining each one individually?

To discover distinct failure modes — groups of failures with similar causes To reduce the total number of failures To prove the system is safe

Chapter 10: Temporal Logic Features

A failure trajectory is not just a point — it's a sequence of states over time. Simple features like "max speed" throw away the temporal structure. The car might have been going fast at the beginning and slow at the end, or vice versa. These are different failure patterns that "max speed" merges together.

Signal Temporal Logic (STL) provides a language for describing temporal properties of trajectories. A few examples:

STL formula	English
G_[0,T](speed < 30)	Speed is always below 30 during [0, T]
F_[0,5](distance < 2)	At some point in the first 5 seconds, distance drops below 2
G_[0,T](distance > 0.5) → F_[0,T](reached_goal)	If we always stay safe, then we eventually reach the goal

Here G means "globally" (always true in the time interval) and F means "finally" (true at some point in the interval). These are the same operators from Chapter 3's property specification.

Parametric STL (PSTL) extends this by making the thresholds into parameters. Instead of "speed < 30", you write "speed < c" where c is a free parameter. Given a trajectory, you can compute the tightest parameter that makes the formula true. For a trajectory where max speed is 42, the tightest c for G(speed < c) is c = 42.

PSTL as feature extraction: Take a library of PSTL templates (speed bounds, distance bounds, timing constraints). For each failure trajectory, compute the tightest parameters. These parameters become the features for clustering. Now your clusters are described not by abstract numbers but by temporal logic formulas: "Cluster 1: speed exceeds 35 AND distance drops below 1.5 within the first 3 seconds."

The robustness of an STL formula measures how much margin there is. A robustness of +5 for "speed < 30" means the speed stayed 5 units below 30. A robustness of −3 means it exceeded 30 by 3 units. Robustness values can also serve as continuous features for clustering — more informative than binary "satisfied/violated."

Closing the loop: PSTL-based clustering produces interpretable failure modes described in temporal logic. These descriptions can be directly converted into test specifications: "Generate more scenarios where speed exceeds 35 and distance drops below 1.5 within 3 seconds." This connects explainability back to falsification (Chapters 4-5), closing the validation loop.

"If you can't explain it simply, you don't understand it well enough."
— attributed to Richard Feynman

Check: What advantage do PSTL features offer over simple statistics (max speed, min distance) when clustering failure trajectories?

They capture temporal patterns (ordering, timing) that simple statistics discard They are faster to compute They eliminate the need for clustering entirely