Kochenderfer et al., Chapter 5

Falsification through Planning

From brute-force optimization to intelligent tree search — breaking autonomous systems by thinking ahead.

Prerequisites: Chapter 4 (Falsification through Optimization). That's it.
10
Chapters
4+
Simulations
10
Quizzes

Chapter 0: Why Planning?

In Chapter 4 we attacked validation as a black-box optimization problem: choose a full trajectory of disturbances, simulate, check for failure, repeat. CMA-ES and cross-entropy methods searched over the entire disturbance sequence at once. For a 100-timestep scenario with 3 disturbance dimensions per step, that is a 300-dimensional search space. It worked, but it was blunt — every candidate was a complete trajectory, and we had no way to reuse partial successes.

Think of it this way. A chess player who evaluates only complete games — playing all moves at random, then checking who won — would need astronomical luck to stumble on a winning strategy. Real chess players think sequentially: "if I move here, they might respond there, and then I could..." This is planning.

Planning-based falsification decomposes the disturbance trajectory into a sequence of decisions. At each timestep, we choose a disturbance, observe the system's response, and then choose the next disturbance. This lets us exploit the Markov structure: the next disturbance only depends on the current state, not the entire history.

The core shift: Optimization asks "what is the best complete trajectory?" Planning asks "what is the best next step from here?" By thinking one step at a time and building a search tree, we can explore far more efficiently — reusing partial paths, pruning dead ends, and focusing search on promising regions of the disturbance space.
Ch 4: Optimization
Optimize entire trajectory x1:T at once — 300+ dimensions
↓ decompose into sequential decisions
Ch 5: Planning
Choose xt given current state st — 3 dimensions at a time
↓ builds a search tree
Tree of Possibilities
Each node is a state, each edge is a disturbance choice. Failures are leaves.

The methods in this chapter span a wide range. Shooting methods bridge the gap from optimization by parameterizing short segments. Tree search methods like RRT grow trees of reachable states. Heuristic search uses cost functions to guide toward failures. MCTS adds statistical intelligence. And reinforcement learning trains an adversarial policy that generalizes across scenarios.

Each method requires different things from the simulator. Some need only forward simulation. Others need gradients. Some need the ability to set the simulator to an arbitrary state. Understanding these requirements is key to choosing the right tool.

MethodKey ideaSimulator needs
ShootingOptimize segments, penalize gapsForward sim
RRTRandomly grow a tree of statesForward sim + set state
Heuristic searchCost-guided expansionForward sim + set state
MCTSBalanced explore/exploit treeForward sim + set state
RL adversaryLearned adversarial policyForward sim (online)
Check: Why does planning-based falsification have an advantage over optimizing entire trajectories at once?

Chapter 1: Shooting Methods

Before we jump into tree search, there is an elegant bridge between Chapter 4's optimization and this chapter's planning. Shooting methods divide the trajectory into segments and optimize each segment separately, then stitch them together.

In single shooting, you parameterize the entire disturbance sequence x1:T and simulate forward from the initial state. This is exactly what Chapter 4 did. The problem: small changes to early disturbances cascade through the entire trajectory, making the optimization landscape chaotic.

Multiple shooting fixes this by breaking the trajectory into K segments. You choose "knot points" — intermediate states s1, s2, ..., sK-1 that the trajectory must pass through. Each segment is optimized independently: given sk, find disturbances that reach sk+1.

Key insight: Multiple shooting turns one long, chaotic optimization into K short, well-behaved ones. The catch: the segments might not connect. The state at the end of segment k might not match sk+1. We call this gap the defect.

The full multiple shooting objective balances two goals: find a failure trajectory, and make the segments connect. Mathematically:

minimize   f(x1:T) + λ ∑k || dk ||2

Here f is the falsification objective (e.g., robustness metric — how close to failure), dk is the defect at knot k (the gap between the end of segment k and the start of segment k+1), and λ controls the tradeoff. Large λ forces continuity at the expense of objective quality; small λ allows gaps in exchange for lower objective values.

Single vs. Multiple Shooting

Watch how single shooting propagates a full trajectory (top) while multiple shooting optimizes segments independently (bottom). Toggle between them. The red dots are knot points; purple gaps are defects.

Single shooting
λ (defect penalty)1.0
PropertySingle ShootingMultiple Shooting
Optimization variablesx1:T onlyx1:T + knot states
Sensitivity to early stepsHigh (cascading)Low (per segment)
ParallelizableNoYes (each segment)
Extra costNoneDefect penalties
When to use shooting: Shooting methods are best when you can compute gradients through the simulator (differentiable simulation). Without gradients, the high-dimensional optimization over x1:T is impractical even with multiple shooting. For non-differentiable simulators, tree search methods (next chapter) are the way to go.
Check: In multiple shooting, what is a "defect"?

Chapter 2: Tree Search Framework

Forget continuous optimization for a moment. Imagine you can set the simulator to any state, apply a disturbance, and see where it goes next. Now you can build a search tree: the root is the initial state, branches are disturbance choices, and each node is a state the system could reach.

Every tree search method follows the same two-step loop:

1. Select
Pick a node in the existing tree to expand
2. Extend
Choose a disturbance, simulate one step, add the child node
↻ repeat until failure found or budget exhausted

The magic is in how you implement Select and Extend. Different choices yield fundamentally different algorithms:

AlgorithmSelect strategyExtend strategy
RRTNearest to random goalToward goal, one step
A* / cost searchLowest cost nodeAll possible disturbances
MCTSLCB-guided traversalProgressive widening

Each node in the tree stores a state st. Each edge stores the disturbance xt that caused the transition. A path from root to any node is a partial trajectory. If a leaf node's state satisfies the failure condition, we have found a counterexample.

The key requirement: Tree search needs a resettable simulator. You must be able to set the system to state sk, apply disturbance xk, and get state sk+1. This is stronger than just "forward simulate from the initial state." Not all simulators support this, but many do — especially those built for testing.

Why is a tree better than random sampling? Because it remembers partial progress. If you found a disturbance sequence that gets the system 80% of the way to failure, the tree preserves that path. The next iteration extends it, rather than starting from scratch. Random trajectory sampling (Chapter 4) throws away all intermediate states.

Tree vs. graph: Technically, the same state might appear at multiple nodes (via different disturbance paths). True search algorithms would merge these into a graph. Most falsification methods keep the tree structure for simplicity, accepting some redundancy in exchange for cleaner implementation and guarantees.
Check: What simulator capability does tree search require beyond simple forward simulation?

Chapter 3: RRT Basics

The Rapidly-exploring Random Tree (RRT) is the simplest tree search method. Originally designed for robot motion planning, it has a beautiful property: it naturally expands toward unexplored regions of state space.

Here is the algorithm. Each iteration:

1. Sample a goal
Pick a random point g in the state space
2. Find nearest node
Find the tree node n closest to g (Euclidean distance)
3. Extend toward goal
From n, pick the disturbance that moves closest to g. Simulate one step. Add the new state as a child of n.
↻ repeat

Why does this explore well? Random goals are uniformly distributed. Nodes in sparse, unexplored regions have large Voronoi cells — meaning random goals are more likely to be closest to them. So the algorithm naturally extends toward under-explored areas. This is the key theoretical insight behind RRT.

The Voronoi bias: Think of each tree node as "owning" the region of state space closer to it than to any other node. Nodes in sparse areas own larger regions. A uniform random goal is more likely to land in a large region. So sparse nodes get selected more often, driving the tree outward into unexplored territory.

For falsification, there is a problem. Standard RRT samples goals uniformly in state space. But we don't want uniform coverage — we want to reach the failure region. A uniform sampler might spend most of its effort exploring states that are far from any failure, while ignoring the dangerous corner of state space.

Even worse, the disturbance at each step is chosen to move the state toward the goal. But moving the state isn't the same as choosing a disturbance. In many systems, the nominal controller fights against disturbances, pulling the state back toward safety. The RRT extends one step, the controller corrects, and the tree barely budges. The tree follows the nominal trajectory too closely.

The problem with basic RRT for falsification: If the system has a strong nominal controller, RRT's one-step extensions get "sucked back" to the nominal path. The tree looks like a slightly fuzzy version of the undisturbed trajectory, never reaching the failure region. We need smarter goal sampling and extension strategies — which is exactly what the next chapter addresses.
RRT: Watch the Tree Grow

Uniform goal sampling in a 2D state space. The failure region (red zone) is in the upper right. Watch how the tree explores outward but may struggle to reach the failure region. Click Step to add one node.

Nodes: 1
Check: Why does a standard RRT naturally explore toward sparse regions?

Chapter 4: Goal Heuristics

Standard RRT's uniform sampling wastes effort exploring safe regions. The fix is simple in concept: bias the goal sampling toward the failure region. Instead of picking random points in the full state space, sample points from or near the failure set.

There are several strategies, each more aggressive than the last:

StrategyHow it worksRisk
Goal biasedWith probability p, sample from failure region; else uniformMay miss paths through narrow corridors
Failure boundarySample on the boundary of the failure set (robustness = 0)Requires knowing the boundary
Gradient guidedUse ∇robustness to pick disturbances that decrease robustnessRequires differentiable robustness metric

The most practical approach combines goal biasing with a smarter extend step. When extending from node n toward goal g, don't just pick the disturbance that moves the state closest to g. Instead, pick the disturbance that maximally decreases the robustness metric — the quantitative measure of how far the system is from failure.

Key insight: The extend step is where the real leverage is. Even with perfect goal sampling, if the controller fights the disturbance and pulls the state back, we make no progress. Choosing disturbances that attack the robustness metric directly is far more effective than choosing disturbances that minimize Euclidean distance to a target state.
Goal-Biased RRT

Compare uniform vs. goal-biased sampling. The slider controls the bias probability p. Higher p means more samples from the failure region (red zone). Watch how the tree shape changes.

Nodes: 1
Goal bias p0.50

The effect is dramatic. With p = 0 (uniform sampling), the tree spreads everywhere like spilled water. With p = 0.7, it grows aggressively toward the failure region like a root seeking water. The tradeoff: too much bias can cause the tree to get stuck in local minima, repeatedly trying the same narrow approach to the failure region.

Practical balance: A goal bias of p = 0.3 to 0.5 is typical. This spends enough effort reaching toward failure while maintaining enough diversity to find indirect paths. Some implementations adaptively increase p over time as the tree matures.
Check: What is the main risk of using a very high goal bias (e.g., p = 0.95)?

Chapter 5: Coverage Metrics

Goal-biased RRT aims for one target. But what if there are many failure modes, and we want to find all of them? For this we need the tree to cover the state space thoroughly — not cluster around one region. How do we measure whether a set of explored states provides good coverage?

Three metrics quantify this:

Dispersion is the radius of the largest empty ball. Take the state space, find the point that is farthest from any explored state. That maximum distance is the dispersion. Low dispersion means no large unexplored gaps.

dispersion(S) = maxs ∈ X   mins' ∈ S   ||s − s'||

Average dispersion replaces the worst case with the average case. Instead of the maximum empty ball, it measures the average distance from a random point to its nearest explored state. This is less sensitive to outliers.

avg_dispersion(S) = Es ~ Uniform(X) [ mins' ∈ S ||s − s'|| ]

Discrepancy measures how uniform the distribution of explored states is. It compares the fraction of explored states in any rectangular sub-region to the fraction of volume that region occupies. High discrepancy means some regions are over-represented and others under-represented.

Why coverage matters for validation: Finding one failure is useful, but it doesn't tell you whether there are others. A tree with low dispersion guarantees that if a failure mode exists in a ball of radius r, and dispersion < r, at least one explored state is within distance r of it. This gives a probabilistic completeness guarantee: as dispersion goes to zero, the probability of missing any failure mode goes to zero.
Dispersion: Finding the Largest Gap

Click anywhere to add points. The blue circle shows the largest empty ball (dispersion). Adding points in sparse regions reduces dispersion most efficiently.

Points: 0 | Dispersion: —
MetricMeasuresGood for
DispersionWorst-case gapGuaranteeing no large blind spots
Avg. dispersionTypical gapOverall coverage quality
DiscrepancyDistribution uniformityDetecting over/under-sampling
Low-dispersion sequences: If you want coverage without tree search, you can use quasi-random sequences (Halton, Sobol) which have provably low discrepancy. These sample trajectories more uniformly than pure random sampling. The trade-off: they don't adapt to the system's dynamics, so they can't exploit structure the way tree search can.
Check: What does low dispersion guarantee about a set of explored states?

Chapter 6: Cost-Based Search

RRT grows trees randomly. Can we be smarter? Cost-based search assigns a cost to each node and always extends the cheapest one. This is the A* principle applied to falsification.

What should the cost function capture? We want to find the most likely failure path — the failure trajectory that has the highest probability under the disturbance distribution. Why the most likely? Because it represents the failure that is most likely to occur in reality. Finding unlikely failures is easy (stack extreme disturbances); finding plausible failures is the real challenge.

If disturbances xt are drawn from a known distribution p(xt), the probability of a trajectory is:

p(τ) = ∏t=1T p(xt)

Maximizing this probability is equivalent to minimizing the negative log probability:

cost(τ) = −log p(τ) = ∑t=1T −log p(xt)

This is a sum of per-step costs, which makes it compatible with tree search. The cost of reaching a node is the sum of edge costs along the path from the root. Each edge's cost is −log p(xt) — the surprisal of the disturbance. Nominal (likely) disturbances have low cost; extreme (unlikely) disturbances have high cost.

The log-likelihood cost: Minimizing −log p(τ) among failure trajectories finds the most likely failure. This is equivalent to finding the path through disturbance space that crosses into the failure region with the least total "surprise." Each step, we prefer disturbances that the environment model says are plausible.

With this cost function, A*-style search expands nodes in order of increasing cost. Nodes reached via likely disturbance sequences get explored first. If the cheapest node that reaches the failure region has cost c*, then any failure with cost < c* has already been found or cannot exist.

Key insight: The cost function decomposes over timesteps, so tree search naturally accumulates path cost as it extends nodes. This is exactly the structure A* exploits. A good heuristic (lower bound on remaining cost to failure) can dramatically speed up the search.
pseudocode
# Cost-based falsification search
Initialize tree with root at s0, cost = 0
priority_queue = {root}

while priority_queue not empty:
  node ← pop lowest-cost node
  if is_failure(node.state):
    return path to node  # most likely failure!
  for each disturbance x:
    s' ← simulate(node.state, x)
    edge_cost ← −log p(x)
    child_cost ← node.cost + edge_cost
    add child(s', child_cost) to tree and queue
Check: Why do we use −log p(xt) as the edge cost, rather than raw probability?

Chapter 7: Monte Carlo Tree Search

RRT grows one node at a time. Cost-based search needs to enumerate all possible disturbances at each node. Monte Carlo Tree Search (MCTS) finds a middle ground: it selectively expands the most promising parts of the tree while maintaining statistical confidence about unexplored regions.

MCTS was originally designed for game-playing (it powered AlphaGo). For falsification, we repurpose it: instead of winning a game, we're trying to find disturbance sequences that cause failures. The "opponent" is the system under test.

The core idea involves four phases per iteration:

1. Select
Traverse tree from root, using LCB to pick children
2. Expand
Add a new child via progressive widening
3. Rollout
Simulate random disturbances to end of trajectory, measure robustness
4. Backpropagate
Update visit counts and Q-values up the tree

Progressive widening controls how many children each node has. A node with N visits is allowed at most k · Nα children (typically α ≈ 0.5, k ≈ 1). This means frequently-visited nodes grow more children over time, while rarely-visited nodes keep few. It prevents the tree from branching too broadly at nodes we haven't explored enough.

The selection step uses the Lower Confidence Bound (LCB) to balance exploration and exploitation:

LCB(s, x) = Q(s, x) − c · √( ln N(s) / N(s, x) )

Here Q(s, x) is the average robustness estimate for choosing disturbance x in state s (lower is better — closer to failure). N(s) is the total visits to state s. N(s, x) is visits to the (s, x) pair. The exploration term c · √(ln N / Nchild) is large for rarely-visited children, encouraging exploration. The constant c controls the tradeoff.

LCB, not UCB: In games, MCTS uses the Upper Confidence Bound because the player wants to maximize reward. For falsification, we want to minimize robustness (find failures). So we use the Lower Confidence Bound — we're optimistic about the possibility that an unexplored branch might lead to a lower robustness value (i.e., a failure).
MCTS Falsification Tree

Each node shows visit count N and average Q-value. Edges show the LCB score. The highlighted path is the one currently selected by LCB. Adjust the exploration constant c to see exploration vs. exploitation. Click Step to run one MCTS iteration.

Iterations: 0
Exploration c1.0

Watch what happens as you adjust c. With c = 0.1, the tree dives deep along the most promising path — pure exploitation. It finds failures quickly if the first path is good, but misses alternatives. With c = 3.0, it spreads widely, visiting every child roughly equally — pure exploration. It finds diverse failures but takes longer. The sweet spot (c ≈ 1.0) balances both.

Why MCTS beats cost-based search: Cost search enumerates all disturbances at each node. In continuous disturbance spaces, this is impossible. MCTS uses progressive widening to try a few disturbances per node initially, adding more only as confidence demands. This makes it practical for continuous and high-dimensional disturbance spaces where cost-based search cannot be applied directly.

Exploitation (low c):

• Dives deep along best path
• Fast to first failure
• May miss failure modes
• Risk of getting stuck

Exploration (high c):

• Broad, shallow tree
• Finds diverse failures
• Slower convergence
• Better coverage

Check: In MCTS for falsification, why do we use LCB (Lower Confidence Bound) rather than UCB?

Chapter 8: RL as Adversary

Tree search methods build a new tree for each initial state. If the initial conditions change, you start over. What if we could learn a policy that generalizes — an adversarial agent that knows how to attack the system from any state?

This is the reinforcement learning approach to falsification. We train an adversarial RL agent whose "environment" is the system under test. The agent chooses disturbances. The reward encourages failures.

Adversarial Agent
Observes state st, outputs disturbance xt
↓ disturbance
System Under Test
Controller + plant, transitions to st+1
↓ new state + reward
Reward Signal
rt = −Δrobustness(st, st+1)
↻ loop

The reward is typically the change in robustness: if the disturbance makes the system less robust (closer to failure), the adversary gets a positive reward. Formally: rt = ρ(st) − ρ(st+1), where ρ is the robustness metric. Disturbances that decrease robustness are rewarded; disturbances that the controller easily compensates for are penalized.

Reward = robustness decrease. This is the key design choice. We don't give a big reward only at failure (that would be too sparse). Instead, every step that nudges the system toward failure gets a small reward. This shapes the RL agent to systematically chip away at the system's safety margin, step by step.

Standard RL algorithms (PPO, SAC) can be applied directly. The adversary's policy πθ(x|s) is a neural network that maps system states to disturbance distributions. After training on many episodes, the policy generalizes: it can attack the system from any initial state, not just the ones it trained on.

The generalization advantage: Tree search solves one scenario at a time. RL trains a policy across thousands of scenarios. Once trained, the policy instantly generates adversarial disturbances for new initial conditions — no search needed. This makes RL-based falsification dramatically faster at test time, at the cost of expensive upfront training.
PropertyTree Search (MCTS/RRT)RL Adversary
Per-scenario costFull search each timeOne forward pass
Upfront costNoneExpensive training
GeneralizationNone — one tree per scenarioAcross initial states
InterpretabilityFull tree is visibleBlack-box policy
Optimality guaranteeAsymptotic (MCTS)Approximate

In practice, RL adversaries are often combined with tree search: the RL policy provides a warm start or rollout policy for MCTS. The tree search refines the RL policy's suggestions, while the RL policy provides better-than-random rollouts. This hybrid approach often outperforms either method alone.

Check: What is the main advantage of an RL adversary over tree search methods?

Chapter 9: Simulator Requirements

We have seen five families of planning-based falsification methods. Each requires different capabilities from the simulator. Choosing the right method starts with understanding what your simulator can and cannot do.

CapabilityShootingRRTCost SearchMCTSRL
Forward sim
Set state
Gradientspreferred
Fast rolloutspreferred
Many episodes

The critical distinction is resettability. Can you set the simulator to an arbitrary state s and simulate from there? If yes, tree search methods (RRT, cost search, MCTS) are all available. If no, you're limited to shooting (with gradients) or RL (which only needs forward simulation from initial states).

The hierarchy of simulator capabilities:
Level 0: Forward simulation only → RL, shooting
Level 1: + Set state → + RRT, cost search, MCTS
Level 2: + Gradients → + Efficient shooting, gradient-guided RRT
Level 3: + Likelihood model → + Most-likely failure search

Real-world simulators vary enormously. A simple Python model of an inverted pendulum gives you everything: fast forward sim, state setting, automatic differentiation for gradients, and a known disturbance distribution. A complex autonomous driving simulator (CARLA, for example) typically provides forward simulation and state setting but no gradients and slow rollouts. A physical hardware-in-the-loop test bench provides only forward simulation — no state setting at all.

The practical decision tree: Can you set state? If yes, how fast are rollouts? Fast rollouts → MCTS. Slow rollouts → cost-based search or RRT. Can't set state? Have gradients → shooting. No gradients → RL. Need generalization across many scenarios? RL regardless. Need interpretable counterexamples? Tree search regardless.

Strengths of Planning:

• Reuses partial successes
• Exploits Markov structure
• Finds failure paths efficiently
• MCTS has asymptotic guarantees

Limitations of Planning:

• Requires resettable sim (mostly)
• Finds one failure at a time
• Doesn't characterize failure distribution
• Chapter 6 addresses this gap

What comes next: Planning finds individual failures efficiently. But knowing that the system can fail is only the beginning. Chapter 6 asks: what does the full distribution over failure trajectories look like? Which failures are most likely? Are there multiple distinct failure modes? This requires sampling from the failure distribution — a fundamentally different problem that calls for MCMC and probabilistic programming rather than tree search.

"The best way to break a system is to think like it thinks,
and then think one step beyond."
— paraphrased from validation folklore
Check: Which falsification method requires the fewest simulator capabilities?