From brute-force optimization to intelligent tree search — breaking autonomous systems by thinking ahead.
In Chapter 4 we attacked validation as a black-box optimization problem: choose a full trajectory of disturbances, simulate, check for failure, repeat. CMA-ES and cross-entropy methods searched over the entire disturbance sequence at once. For a 100-timestep scenario with 3 disturbance dimensions per step, that is a 300-dimensional search space. It worked, but it was blunt — every candidate was a complete trajectory, and we had no way to reuse partial successes.
Think of it this way. A chess player who evaluates only complete games — playing all moves at random, then checking who won — would need astronomical luck to stumble on a winning strategy. Real chess players think sequentially: "if I move here, they might respond there, and then I could..." This is planning.
Planning-based falsification decomposes the disturbance trajectory into a sequence of decisions. At each timestep, we choose a disturbance, observe the system's response, and then choose the next disturbance. This lets us exploit the Markov structure: the next disturbance only depends on the current state, not the entire history.
The methods in this chapter span a wide range. Shooting methods bridge the gap from optimization by parameterizing short segments. Tree search methods like RRT grow trees of reachable states. Heuristic search uses cost functions to guide toward failures. MCTS adds statistical intelligence. And reinforcement learning trains an adversarial policy that generalizes across scenarios.
Each method requires different things from the simulator. Some need only forward simulation. Others need gradients. Some need the ability to set the simulator to an arbitrary state. Understanding these requirements is key to choosing the right tool.
| Method | Key idea | Simulator needs |
|---|---|---|
| Shooting | Optimize segments, penalize gaps | Forward sim |
| RRT | Randomly grow a tree of states | Forward sim + set state |
| Heuristic search | Cost-guided expansion | Forward sim + set state |
| MCTS | Balanced explore/exploit tree | Forward sim + set state |
| RL adversary | Learned adversarial policy | Forward sim (online) |
Before we jump into tree search, there is an elegant bridge between Chapter 4's optimization and this chapter's planning. Shooting methods divide the trajectory into segments and optimize each segment separately, then stitch them together.
In single shooting, you parameterize the entire disturbance sequence x1:T and simulate forward from the initial state. This is exactly what Chapter 4 did. The problem: small changes to early disturbances cascade through the entire trajectory, making the optimization landscape chaotic.
Multiple shooting fixes this by breaking the trajectory into K segments. You choose "knot points" — intermediate states s1, s2, ..., sK-1 that the trajectory must pass through. Each segment is optimized independently: given sk, find disturbances that reach sk+1.
The full multiple shooting objective balances two goals: find a failure trajectory, and make the segments connect. Mathematically:
Here f is the falsification objective (e.g., robustness metric — how close to failure), dk is the defect at knot k (the gap between the end of segment k and the start of segment k+1), and λ controls the tradeoff. Large λ forces continuity at the expense of objective quality; small λ allows gaps in exchange for lower objective values.
Watch how single shooting propagates a full trajectory (top) while multiple shooting optimizes segments independently (bottom). Toggle between them. The red dots are knot points; purple gaps are defects.
| Property | Single Shooting | Multiple Shooting |
|---|---|---|
| Optimization variables | x1:T only | x1:T + knot states |
| Sensitivity to early steps | High (cascading) | Low (per segment) |
| Parallelizable | No | Yes (each segment) |
| Extra cost | None | Defect penalties |
Forget continuous optimization for a moment. Imagine you can set the simulator to any state, apply a disturbance, and see where it goes next. Now you can build a search tree: the root is the initial state, branches are disturbance choices, and each node is a state the system could reach.
Every tree search method follows the same two-step loop:
The magic is in how you implement Select and Extend. Different choices yield fundamentally different algorithms:
| Algorithm | Select strategy | Extend strategy |
|---|---|---|
| RRT | Nearest to random goal | Toward goal, one step |
| A* / cost search | Lowest cost node | All possible disturbances |
| MCTS | LCB-guided traversal | Progressive widening |
Each node in the tree stores a state st. Each edge stores the disturbance xt that caused the transition. A path from root to any node is a partial trajectory. If a leaf node's state satisfies the failure condition, we have found a counterexample.
Why is a tree better than random sampling? Because it remembers partial progress. If you found a disturbance sequence that gets the system 80% of the way to failure, the tree preserves that path. The next iteration extends it, rather than starting from scratch. Random trajectory sampling (Chapter 4) throws away all intermediate states.
The Rapidly-exploring Random Tree (RRT) is the simplest tree search method. Originally designed for robot motion planning, it has a beautiful property: it naturally expands toward unexplored regions of state space.
Here is the algorithm. Each iteration:
Why does this explore well? Random goals are uniformly distributed. Nodes in sparse, unexplored regions have large Voronoi cells — meaning random goals are more likely to be closest to them. So the algorithm naturally extends toward under-explored areas. This is the key theoretical insight behind RRT.
For falsification, there is a problem. Standard RRT samples goals uniformly in state space. But we don't want uniform coverage — we want to reach the failure region. A uniform sampler might spend most of its effort exploring states that are far from any failure, while ignoring the dangerous corner of state space.
Even worse, the disturbance at each step is chosen to move the state toward the goal. But moving the state isn't the same as choosing a disturbance. In many systems, the nominal controller fights against disturbances, pulling the state back toward safety. The RRT extends one step, the controller corrects, and the tree barely budges. The tree follows the nominal trajectory too closely.
Uniform goal sampling in a 2D state space. The failure region (red zone) is in the upper right. Watch how the tree explores outward but may struggle to reach the failure region. Click Step to add one node.
Standard RRT's uniform sampling wastes effort exploring safe regions. The fix is simple in concept: bias the goal sampling toward the failure region. Instead of picking random points in the full state space, sample points from or near the failure set.
There are several strategies, each more aggressive than the last:
| Strategy | How it works | Risk |
|---|---|---|
| Goal biased | With probability p, sample from failure region; else uniform | May miss paths through narrow corridors |
| Failure boundary | Sample on the boundary of the failure set (robustness = 0) | Requires knowing the boundary |
| Gradient guided | Use ∇robustness to pick disturbances that decrease robustness | Requires differentiable robustness metric |
The most practical approach combines goal biasing with a smarter extend step. When extending from node n toward goal g, don't just pick the disturbance that moves the state closest to g. Instead, pick the disturbance that maximally decreases the robustness metric — the quantitative measure of how far the system is from failure.
Compare uniform vs. goal-biased sampling. The slider controls the bias probability p. Higher p means more samples from the failure region (red zone). Watch how the tree shape changes.
The effect is dramatic. With p = 0 (uniform sampling), the tree spreads everywhere like spilled water. With p = 0.7, it grows aggressively toward the failure region like a root seeking water. The tradeoff: too much bias can cause the tree to get stuck in local minima, repeatedly trying the same narrow approach to the failure region.
Goal-biased RRT aims for one target. But what if there are many failure modes, and we want to find all of them? For this we need the tree to cover the state space thoroughly — not cluster around one region. How do we measure whether a set of explored states provides good coverage?
Three metrics quantify this:
Dispersion is the radius of the largest empty ball. Take the state space, find the point that is farthest from any explored state. That maximum distance is the dispersion. Low dispersion means no large unexplored gaps.
Average dispersion replaces the worst case with the average case. Instead of the maximum empty ball, it measures the average distance from a random point to its nearest explored state. This is less sensitive to outliers.
Discrepancy measures how uniform the distribution of explored states is. It compares the fraction of explored states in any rectangular sub-region to the fraction of volume that region occupies. High discrepancy means some regions are over-represented and others under-represented.
Click anywhere to add points. The blue circle shows the largest empty ball (dispersion). Adding points in sparse regions reduces dispersion most efficiently.
| Metric | Measures | Good for |
|---|---|---|
| Dispersion | Worst-case gap | Guaranteeing no large blind spots |
| Avg. dispersion | Typical gap | Overall coverage quality |
| Discrepancy | Distribution uniformity | Detecting over/under-sampling |
RRT grows trees randomly. Can we be smarter? Cost-based search assigns a cost to each node and always extends the cheapest one. This is the A* principle applied to falsification.
What should the cost function capture? We want to find the most likely failure path — the failure trajectory that has the highest probability under the disturbance distribution. Why the most likely? Because it represents the failure that is most likely to occur in reality. Finding unlikely failures is easy (stack extreme disturbances); finding plausible failures is the real challenge.
If disturbances xt are drawn from a known distribution p(xt), the probability of a trajectory is:
Maximizing this probability is equivalent to minimizing the negative log probability:
This is a sum of per-step costs, which makes it compatible with tree search. The cost of reaching a node is the sum of edge costs along the path from the root. Each edge's cost is −log p(xt) — the surprisal of the disturbance. Nominal (likely) disturbances have low cost; extreme (unlikely) disturbances have high cost.
With this cost function, A*-style search expands nodes in order of increasing cost. Nodes reached via likely disturbance sequences get explored first. If the cheapest node that reaches the failure region has cost c*, then any failure with cost < c* has already been found or cannot exist.
pseudocode # Cost-based falsification search Initialize tree with root at s0, cost = 0 priority_queue = {root} while priority_queue not empty: node ← pop lowest-cost node if is_failure(node.state): return path to node # most likely failure! for each disturbance x: s' ← simulate(node.state, x) edge_cost ← −log p(x) child_cost ← node.cost + edge_cost add child(s', child_cost) to tree and queue
RRT grows one node at a time. Cost-based search needs to enumerate all possible disturbances at each node. Monte Carlo Tree Search (MCTS) finds a middle ground: it selectively expands the most promising parts of the tree while maintaining statistical confidence about unexplored regions.
MCTS was originally designed for game-playing (it powered AlphaGo). For falsification, we repurpose it: instead of winning a game, we're trying to find disturbance sequences that cause failures. The "opponent" is the system under test.
The core idea involves four phases per iteration:
Progressive widening controls how many children each node has. A node with N visits is allowed at most k · Nα children (typically α ≈ 0.5, k ≈ 1). This means frequently-visited nodes grow more children over time, while rarely-visited nodes keep few. It prevents the tree from branching too broadly at nodes we haven't explored enough.
The selection step uses the Lower Confidence Bound (LCB) to balance exploration and exploitation:
Here Q(s, x) is the average robustness estimate for choosing disturbance x in state s (lower is better — closer to failure). N(s) is the total visits to state s. N(s, x) is visits to the (s, x) pair. The exploration term c · √(ln N / Nchild) is large for rarely-visited children, encouraging exploration. The constant c controls the tradeoff.
Each node shows visit count N and average Q-value. Edges show the LCB score. The highlighted path is the one currently selected by LCB. Adjust the exploration constant c to see exploration vs. exploitation. Click Step to run one MCTS iteration.
Watch what happens as you adjust c. With c = 0.1, the tree dives deep along the most promising path — pure exploitation. It finds failures quickly if the first path is good, but misses alternatives. With c = 3.0, it spreads widely, visiting every child roughly equally — pure exploration. It finds diverse failures but takes longer. The sweet spot (c ≈ 1.0) balances both.
Exploitation (low c):
• Dives deep along best path
• Fast to first failure
• May miss failure modes
• Risk of getting stuck
Exploration (high c):
• Broad, shallow tree
• Finds diverse failures
• Slower convergence
• Better coverage
Tree search methods build a new tree for each initial state. If the initial conditions change, you start over. What if we could learn a policy that generalizes — an adversarial agent that knows how to attack the system from any state?
This is the reinforcement learning approach to falsification. We train an adversarial RL agent whose "environment" is the system under test. The agent chooses disturbances. The reward encourages failures.
The reward is typically the change in robustness: if the disturbance makes the system less robust (closer to failure), the adversary gets a positive reward. Formally: rt = ρ(st) − ρ(st+1), where ρ is the robustness metric. Disturbances that decrease robustness are rewarded; disturbances that the controller easily compensates for are penalized.
Standard RL algorithms (PPO, SAC) can be applied directly. The adversary's policy πθ(x|s) is a neural network that maps system states to disturbance distributions. After training on many episodes, the policy generalizes: it can attack the system from any initial state, not just the ones it trained on.
| Property | Tree Search (MCTS/RRT) | RL Adversary |
|---|---|---|
| Per-scenario cost | Full search each time | One forward pass |
| Upfront cost | None | Expensive training |
| Generalization | None — one tree per scenario | Across initial states |
| Interpretability | Full tree is visible | Black-box policy |
| Optimality guarantee | Asymptotic (MCTS) | Approximate |
In practice, RL adversaries are often combined with tree search: the RL policy provides a warm start or rollout policy for MCTS. The tree search refines the RL policy's suggestions, while the RL policy provides better-than-random rollouts. This hybrid approach often outperforms either method alone.
We have seen five families of planning-based falsification methods. Each requires different capabilities from the simulator. Choosing the right method starts with understanding what your simulator can and cannot do.
| Capability | Shooting | RRT | Cost Search | MCTS | RL |
|---|---|---|---|---|---|
| Forward sim | ✓ | ✓ | ✓ | ✓ | ✓ |
| Set state | — | ✓ | ✓ | ✓ | — |
| Gradients | preferred | — | — | — | — |
| Fast rollouts | — | — | — | preferred | ✓ |
| Many episodes | — | — | — | — | ✓ |
The critical distinction is resettability. Can you set the simulator to an arbitrary state s and simulate from there? If yes, tree search methods (RRT, cost search, MCTS) are all available. If no, you're limited to shooting (with gradients) or RL (which only needs forward simulation from initial states).
Real-world simulators vary enormously. A simple Python model of an inverted pendulum gives you everything: fast forward sim, state setting, automatic differentiation for gradients, and a known disturbance distribution. A complex autonomous driving simulator (CARLA, for example) typically provides forward simulation and state setting but no gradients and slow rollouts. A physical hardware-in-the-loop test bench provides only forward simulation — no state setting at all.
Strengths of Planning:
• Reuses partial successes
• Exploits Markov structure
• Finds failure paths efficiently
• MCTS has asymptotic guarantees
Limitations of Planning:
• Requires resettable sim (mostly)
• Finds one failure at a time
• Doesn't characterize failure distribution
• Chapter 6 addresses this gap
What comes next: Planning finds individual failures efficiently. But knowing that the system can fail is only the beginning. Chapter 6 asks: what does the full distribution over failure trajectories look like? Which failures are most likely? Are there multiple distinct failure modes? This requires sampling from the failure distribution — a fundamentally different problem that calls for MCMC and probabilistic programming rather than tree search.