Why compute the whole policy when you only need the next action?
You're the flight computer on a commercial aircraft. A near-collision is developing. You have to pick: climb, descend, or hold. You have about two seconds.
The problem: your state space is enormous. Altitude, airspeed, vertical rate, intruder geometry — hundreds of continuous dimensions. An offline method like value iteration needs to sweep every state in the space, update, and repeat until convergence. That takes hours of computation. You have two seconds.
This is the leap from offline planning to online planning. Offline methods (chapters 7–8) compute a policy π(s) for the entire state space before deployment. Online methods compute the action for the current state at decision time, using the generative model as a simulator.
The price you pay: online planning must be fast enough to produce an action within the decision budget. The methods in this chapter each make different tradeoffs between computation and solution quality.
The methods of this chapter, in order of sophistication:
| Method | Key idea | Complexity |
|---|---|---|
| Receding horizon | Plan to depth d, execute first action, replan | O(|A|d) |
| Rollout lookahead | Estimate leaf values via random rollouts | O(m·|A|·d) |
| Forward search | Enumerate all transitions to depth d | O((|S||A|)d) |
| Branch & bound | Prune branches via value bounds | O((|S||A|)d) worst |
| Sparse sampling | Sample m states per action instead of all | O((m|A|)d) |
| MCTS | Adaptively allocate sims via UCB1 | O(m·d) per decision |
| Heuristic search | Guide search with a domain heuristic | Problem-dependent |
| Open-loop | Optimize over action sequences, not state-dependent policies | Optimization cost |
1. Computation vs. solution quality. More simulations/samples always improve quality. The question is: given a fixed budget (say, 100ms), which method extracts the most value? MCTS usually wins on this metric for stochastic problems because it concentrates computation on promising branches. Sparse sampling wastes work on every branch uniformly.
2. Model accuracy vs. sim count. If your generative model is inaccurate (e.g., a learned dynamics model with prediction error), more simulations exacerbate the model error. A small number of simulations with a rough model can outperform a huge number with the same rough model — more sims compound the model bias. This is why model-based RL (Ch 16) actively manages model uncertainty.
3. Depth vs. breadth. Given a fixed number of node visits, spend them wide (many shallow trees) or deep (few deep trees)? Wide coverage catches immediate wins. Deep coverage catches long-horizon consequences. MCTS adaptively finds the sweet spot: go deep on promising branches, wide on unexplored ones.
You're navigating a ship through fog. You can see 5 miles ahead. You plot the best route for the next 5 miles, sail 1 mile, then re-check and re-plot. You never commit to a route beyond your visibility. That is receding horizon planning: the planning window slides forward with you.
The key point: at+1 through at+d−1 are thrown away at every step. This seems wasteful. Why not just execute the whole plan? Because the world is stochastic. After taking at, the actual next state st+1 may differ from what the plan assumed. By replanning from the true st+1, we correct this discrepancy at every step.
The simplest online planning algorithm that uses the receding horizon framework: for each candidate first action a, estimate the value by simulating m rollouts. A rollout runs a simple policy (often random) from the successor state to depth d, accumulating discounted reward.
Where si′ ~ T(s, a) is a sampled successor state and rollout(π, s, d) accumulates γtR along a trajectory of length d following policy π.
python def rollout(env, s, policy, depth, gamma=0.95): """Simulate policy from s for depth steps; return discounted return.""" ret, discount = 0.0, 1.0 for _ in range(depth): a = policy(s) # e.g. random: random.choice(env.actions) s_next, r = env.step(s, a) # generative model call ret += discount * r discount *= gamma s = s_next if env.is_terminal(s): break return ret def rollout_lookahead(env, s, m=50, depth=10, gamma=0.95): """Pick the action with the highest average rollout return.""" best_a, best_q = None, float('-inf') for a in env.actions: q = 0.0 for _ in range(m): s_next, r = env.step(s, a) # first step with action a q += r + gamma * rollout(env, s_next, random_policy, depth−1, gamma) q /= m if q > best_q: best_a, best_q = a, q return best_a
Complexity: O(m × |A| × d) per decision. For |A|=4, m=100, d=10: 4,000 simulator calls per step. Typically fast enough for real-time use.
Rollout assigns zero value to states at depth d (the trajectory ends there). A better idea: at depth d, query an offline-computed value function U(s) instead of returning 0. This U(s) may be rough (a neural network approximation or linear function), but even a crude estimate of long-range value dramatically outperforms U(s) = 0.
This is the architecture behind ACAS X (collision avoidance) and AlphaGo: an offline neural network provides U(s) at the leaf, online MCTS provides precision at the root.
The formal justification for rollout lookahead is the policy improvement theorem. Let π0 be the rollout policy. Define π1 as the policy that performs one-step lookahead using π0 as the leaf estimator:
The policy improvement theorem guarantees Vπ1(s) ≥ Vπ0(s) for all s. The one-step lookahead never makes the policy worse. Repeating this improvement — using π1 as the next rollout policy — gives π2, π3, … This sequence converges to π* (optimal policy) in a finite number of steps for finite MDPs. Each iteration is one pass of policy iteration.
Online rollout lookahead does one step of this iteration at every decision. It never converges to π* (it only sees one state at a time), but it systematically improves on π0. The better the rollout policy, the closer π1 gets to π*, and the less work the online search has to do.
The most straightforward online planner: build the complete look-ahead tree to depth d, compute expected values bottom-up, and pick the best root action. No approximations. No sampling. Every transition. Every state.
pseudocode # Forward Search — returns (best_action, best_value) from state s at depth d function forward_search(s, d, U): if d == 0: return (None, U(s)) # leaf: use terminal value function best_a, best_u = None, −∞ for each a in A: # expected value over all successor states q = R(s, a) + γ · ∑s′ T(s′ | s, a) · forward_search(s′, d−1, U).value if q > best_u: best_a, best_u = a, q return (best_a, best_u)
The structure of the search tree is an AND-OR tree. Action nodes (OR nodes) branch over choices — we maximize. State nodes (AND nodes) branch over stochastic outcomes — we take expectations. The values propagate up: leaves get U(s), state nodes sum T(s′|s,a) times child values, action nodes take the max.
Forward search is the baseline. It is exact (to depth d) and provides the standard against which every approximation is measured. In practice it is only feasible for:
Small action spaces
|A| = 2–4 with shallow depths (d ≤ 4) gives ~256 leaf nodes — tractable.
Small state spaces
|S| = 10–50 with |A| = 2, d = 3 gives ~125,000 nodes — borderline tractable.
The terminal value function U(s) is critical. Setting U(s) = 0 means the planner ignores all reward beyond depth d. In a discounted infinite-horizon problem with γ = 0.9, the first 20 steps account for 86% of total discounted value, so d = 20 might seem sufficient — but for large problems, even d = 5 is too expensive. Hybrid methods break this tension.
Visualize how the forward search tree grows. Orange circles = action (OR) nodes. Teal circles = state (AND) nodes. Adjust depth and branching factors to see exponential growth.
Suppose we have a toy MDP: 3 states {s0, s1, s2}, 2 actions {aL, aR}, γ = 0.9, and the terminal value U(s) = 0 for all s.
| Action | Successor | Probability | Reward R(s0, ·) |
|---|---|---|---|
| aL | s1 | 0.7 | +2 |
| aL | s2 | 0.3 | +2 |
| aR | s1 | 0.4 | −1 |
| aR | s2 | 0.6 | −1 |
At depth d=2, we need U(s1) and U(s2) evaluated by a second level of search. Suppose that second level yields U(s1) = 3.0 and U(s2) = 1.0 (computed recursively). Then:
Best action: aL with value 4.16. This is the exact 2-step lookahead value. Forward search guarantees this answer; every other method in this chapter approximates it.
Forward search is too slow because it explores branches that cannot possibly win. Branch and bound uses value bounds to skip them provably.
You need two ingredients:
| Bound | Notation | Requirement | Example |
|---|---|---|---|
| Lower bound on value | Ulo(s) | Ulo(s) ≤ U*(s) for all s | Value of a heuristic policy (e.g., always go right) |
| Upper bound on Q-value | Qhi(s,a) | Qhi(s,a) ≥ Q*(s,a) for all s,a | Optimistic relaxation (e.g., assume no obstacles) |
The algorithm sorts actions by Qhi(s,a) descending, then searches them in that order. As soon as it has a complete sub-tree value for the first action, it knows the lower bound on the best achievable value. Any action whose upper bound is below this lower bound cannot win — prune it.
pseudocode # Branch and Bound — key pruning step function branch_and_bound(s, d, U_lo, Q_hi): if d == 0: return (None, U_lo(s)) best_a, best_u = None, U_lo(s) # initialize with lower bound actions_sorted = sort(A, key=Q_hi(s,.), descending=True) for each a in actions_sorted: if Q_hi(s, a) ≤ best_u: break # all remaining actions can only be worse; prune q = R(s,a) + γ · ∑s′ T(s′|s,a) · branch_and_bound(s′, d−1, U_lo, Q_hi).value if q > best_u: best_a, best_u = a, q return (best_a, best_u)
The break condition is critical: because actions are sorted by Qhi descending, once Qhi(s,a) ≤ best_u, every subsequent action also has Qhi ≤ best_u. Pruning is safe. The result is always identical to forward search; only the computation differs.
Building Qhi:
Relax the problem. Remove obstacles. Assume teleportation. Solve the relaxed MDP cheaply. Its Q-values are always optimistic (over-estimate) for the original problem.
Building Ulo:
Run any sub-optimal policy (random walk, greedy by feature). Evaluate that policy. Its value function is a lower bound on the optimal value.
State s0, four actions, all at depth d=2. Suppose we have bounds:
| Action | Qhi(s0,a) (upper bound) | True Q*(s0,a) |
|---|---|---|
| a1 | 12.0 | 9.5 |
| a2 | 10.0 | 8.0 |
| a3 | 7.0 | 6.5 |
| a4 | 5.0 | 4.8 |
Also suppose Ulo(s0) = 6.0. Execution order (sorted by Qhi descending: a1, a2, a3, a4):
1. Expand a1 fully. Compute its true value: Q = 9.5. Update best_u = max(6.0, 9.5) = 9.5.
2. Qhi(a2) = 10.0 > best_u = 9.5. Must expand a2. Compute Q = 8.0. best_u = max(9.5, 8.0) = 9.5.
3. Qhi(a3) = 7.0 ≤ best_u = 9.5. Prune a3. (A3's true value is 6.5 < 9.5, so this is safe.)
4. Qhi(a4) = 5.0 ≤ best_u = 9.5. Prune a4. (A4's true value is 4.8, safe.)
We computed 2 subtrees instead of 4. Savings: 50%. Return: a1 with value 9.5 (correct). With tighter bounds, we could prune a2 as well if after expanding a1, we had best_u = 9.5 and Qhi(a2) = 9.0: pruned! That would require only 1 subtree.
The hardest part of branch and bound is constructing good bounds cheaply. Here are the standard approaches:
Upper bound Qhi(s,a): Solve a relaxed problem. Remove constraints (obstacles, capacity limits). Solve the relaxed MDP using value iteration — this is typically much easier than the original. Its Q-values upper-bound the original because the relaxed problem has more options. Another approach: closed-form bounds. For a grid world with a goal state, a Manhattan distance bound: Qhi(s, any a) = Rmax/(1−γ) − (dist(s, goal) − 1) × Rmin.
Lower bound Ulo(s): Evaluate any feasible policy. Run it on the original MDP and compute Vπ(s). This is always ≤ U*(s) because π is sub-optimal. Good policies give tighter bounds. A greedy policy based on a feature approximation is a typical choice.
The exponential cost of forward search has two factors: |A|d from actions and |S|d from states. We cannot reduce |A| without losing coverage of the action space. But the state-space factor |S|d can be attacked: instead of branching over all |S| successor states, sample only m states.
Instead of the true expectation over T(s′|s,a), we approximate it by averaging m sampled transitions. The complexity drops from O((|A||S|)d) to O((|A|m)d), which is independent of |S|.
| Method | Node branching at each action | Total nodes at depth d |
|---|---|---|
| Forward search | |S| (all successor states) | O((|A|·|S|)d) |
| Sparse sampling | m (sampled states) | O((|A|·m)d) |
| Savings | (|S|/m)d – potentially enormous | — |
Compare how many nodes each method explores. Forward search explodes with |S|; sparse sampling is flat. Drag the sliders to see the effect.
The remaining limitation: sparse sampling is still exponential in d. If d = 10, m = 5, |A| = 4, the tree has (4×5)10 = 1013 nodes. For large depths, exponential in d is still intractable. MCTS solves this.
The theoretical bound (Kearns et al., 2002): to get Q estimates within ε of optimal with probability 1−δ, we need approximately:
where Vmax = Rmax/(1−γ) and C is a universal constant. Plugging in ε=1, δ=0.05, γ=0.9, Rmax=1, |A|=4, d=5: Vmax=10, so m ≥ C×100×25×log(400) ≈ 1.5M. Worst-case bounds are very loose. In practice m=50–200 samples per action give good performance in many domains because variance in the actual transition distribution is much lower than Vmax would suggest.
python def sparse_sampling(env, s, depth, m=20, gamma=0.95, U_leaf=None): """Sparse sampling: sample m transitions per action node.""" if depth == 0: return None, (U_leaf(s) if U_leaf else 0.0) best_a, best_q = None, float('-inf') for a in env.actions: q_sum = 0.0 for _ in range(m): # m samples — the only change vs forward search s_next, r = env.step(s, a) # sample one s′ ~ T(s,a) _, v = sparse_sampling(env, s_next, depth−1, m, gamma, U_leaf) q_sum += r + gamma * v q = q_sum / m # average instead of exact expectation if q > best_q: best_a, best_q = a, q return best_a, best_q
Compare to forward search: the only difference is the inner loop over m samples instead of the sum over all s′. This single change removes the |S|d factor entirely.
Sparse sampling still explodes exponentially in depth d. The problem: it allocates the same number of simulations to every branch, regardless of how promising that branch is. This is wasteful. A move that is clearly terrible gets just as many simulations as the move that looks brilliant.
Monte Carlo Tree Search (MCTS) fixes this by adaptively concentrating computation on the most promising branches. Instead of a pre-determined tree structure, MCTS grows a search tree incrementally, simulation by simulation. Branches that look better get explored more. Branches that look bad get pruned by neglect.
MCTS maintains two quantities for every (state, action) pair it has visited:
| Symbol | Meaning | Updated when |
|---|---|---|
| N(s, a) | Number of times action a was tried from state s | Every simulation that traverses (s, a) |
| Q(s, a) | Mean return observed when taking a from s | Running average after each simulation |
Each MCTS iteration has four phases. Watch the simulation below, then read the explanation.
Click Step to run one MCTS simulation (selection → expansion → rollout → backprop). The last phase executed is highlighted. Click Run 20 for bulk simulations. The tree updates live.
After m iterations, the root action is selected by the policy that maximizes Q(sroot, a) — pure exploitation, no exploration bonus, because we're committing to an action, not running more simulations.
Suppose state s0 with 2 actions {up, down}. Start: N(s0,up) = 0, Q(s0,up) = 0. Run 5 simulations, all choosing “up” (because both actions have N=0, so both have UCB1=∞ initially; break ties randomly). The rollouts return: +3, −1, +5, +2, +4.
Q-value update is an incremental (running) average. After each simulation returning qi:
| Sim # | Return qi | N(s0,up) after | Q(s0,up) after | Calculation |
|---|---|---|---|---|
| 1 | +3 | 1 | 3.00 | 0 + (3−0)/1 = 3.00 |
| 2 | −1 | 2 | 1.00 | 3 + (−1−3)/2 = 1.00 |
| 3 | +5 | 3 | 2.33 | 1 + (5−1)/3 = 2.33 |
| 4 | +2 | 4 | 2.25 | 2.33 + (2−2.33)/4 = 2.25 |
| 5 | +4 | 5 | 2.60 | 2.25 + (4−2.25)/5 = 2.60 |
The incremental formula is numerically stable and requires O(1) memory: just store N and Q. No need to keep all past returns. This is crucial when running millions of simulations in games like Go.
python def mcts_simulate(env, s, tree, depth, c=1.0): # Returns estimated value; updates tree in-place if depth == 0: return env.heuristic(s) if s not in tree: # EXPANSION: first visit, initialize all actions tree[s] = {a: {'N': 0, 'Q': 0.0} for a in env.actions} # SIMULATION: rollout from this new node return rollout(env, s, rollout_depth=5) # SELECTION: UCB1 to pick action N_s = sum(tree[s][a]['N'] for a in env.actions) def ucb1(a): n = tree[s][a]['N'] q = tree[s][a]['Q'] if n == 0: return float('inf') return q + c * (math.log(N_s) / n) ** 0.5 a_star = max(env.actions, key=ucb1) r, s_next = env.step(s, a_star) # Recurse; get return from child q = r + env.gamma * mcts_simulate(env, s_next, tree, depth−1, c) # BACKPROPAGATION: running average update tree[s][a_star]['N'] += 1 tree[s][a_star]['Q'] += (q − tree[s][a_star]['Q']) / tree[s][a_star]['N'] return q def mcts(env, s, m=1000, depth=10, c=1.0): tree = {} for _ in range(m): mcts_simulate(env, s, tree, depth, c) # Final action: pure exploitation (no UCB bonus) return max(env.actions, key=lambda a: tree[s][a]['Q'])
Below, MCTS solves a 6×6 grid navigation task. The agent starts at top-left, must reach the goal (bottom-right), and earns a large penalty for the hazard zone. Click Run Sims to accumulate simulations, then Best Move to execute the MCTS-chosen action.
Visit counts N(s,a) are shown as numbers around each cell. Larger numbers = more MCTS attention. The agent (orange circle) navigates toward the goal (teal G) avoiding the hazard (red H).
UCB1 is the heart of MCTS. Understanding it precisely is worth the effort.
Break this apart symbol by symbol:
| Term | Type | Meaning |
|---|---|---|
| Q(s, a) | Exploitation | Mean return observed from (s,a). High Q → looks good so far. Take more of this. |
| c | Hyperparameter | Exploration constant. Scales the exploration bonus relative to value estimates. |
| log N(s) | Context | Log of total visits to state s. Grows slowly — ensures the bonus never disappears entirely. |
| N(s, a) | Exploration | Visits to action a from s. Low N(s,a) → action is underexplored → big bonus. |
| √(log N(s) / N(s,a)) | Exploration bonus | Shrinks as N(s,a) grows (the action gets less under-explored). Grows as N(s) grows (each action gets a bit more budget). |
If N(s,a) = 0, the bonus is infinite — unexplored actions are always tried first. Once all actions have been tried at least once, the formula kicks in and trades off exploitation and exploration.
Three actions: Red (high Q, many visits), Blue (medium Q, few visits), Green (low Q, unvisited). See how UCB1 balances them as you change c and total visits N.
The exploration constant c must be tuned for the specific problem. Too small: MCTS exploits greedily, gets stuck on local optima. Too large: MCTS explores uniformly, wastes computation. A common heuristic: set c so the exploration bonus is comparable in magnitude to typical Q-value differences. In AlphaGo, c was tuned empirically and found to be around 5 for their Q-value normalization.
A principled approach: normalize Q values to [0, 1] by dividing by Vmax = Rmax/(1−γ). Then set c = 1 or c = √2 (the theoretically motivated value for bandit problems). This normalization removes the need to re-tune c when reward scales change — a crucial engineering improvement when applying MCTS across different problems.
c too small (c=0.01):
MCTS rapidly commits to the first good-looking action. Bonus shrinks to near-zero within a few visits. Other actions are rarely explored. Performance is good when the first action tried happens to be optimal; catastrophic when it is not.
c too large (c=1000):
MCTS visits actions round-robin. Every action gets equal simulations regardless of quality. Equivalent to uniform random sampling, ignoring all Q-value feedback. Performance degrades to blind rollout.
State s, three actions, c = 2.0, N(s) = 20 total visits so far:
| Action | N(s,a) | Q(s,a) | Bonus c√(logN(s)/N(s,a)) | UCB1 score |
|---|---|---|---|---|
| a1 | 12 | 8.5 | 2√(log20/12) = 2√(0.249) = 0.99 | 8.5 + 0.99 = 9.49 |
| a2 | 7 | 7.0 | 2√(log20/7) = 2√(0.428) = 1.31 | 7.0 + 1.31 = 8.31 |
| a3 | 1 | 9.0 | 2√(log20/1) = 2√(3.0) = 3.46 | 9.0 + 3.46 = 12.46 ← selected |
Interesting: a3 has the highest Q (9.0) and the highest exploration bonus (it's only been tried once). UCB1 correctly selects it. After this simulation, a3's bonus will shrink. If it returns value 6.0, Q(s,a3) updates to (9.0×1 + 6.0)/2 = 7.5, N(s,a3) = 2. Its new bonus = 2√(log21/2) = 2√(1.52) = 2.47. New UCB1 = 7.5 + 2.47 = 9.97 — still selected again, but less dominantly. Over time, if it keeps returning low values, a1 will take over.
All the methods so far make no assumptions about the problem structure. Heuristic search incorporates domain knowledge to guide the search toward high-value regions, dramatically reducing the number of states visited.
The core idea: maintain an approximate value function U(s) (the "heuristic"). At each step of the simulation, instead of a random rollout, follow the greedy policy with respect to U. After each simulation, update U using a Bellman backup at every state along the trajectory.
pseudocode # Heuristic Search (one simulation) function heuristic_sim(s, d, U): if d == 0: return U(s) # greedy action w.r.t. current U a = arg maxa [R(s,a) + γ ∑s′ T(s′|s,a) U(s′)] s′ ~ T(s,a) # sample one successor val = R(s,a) + γ · heuristic_sim(s′, d−1, U) # Bellman backup: update U(s) with the lookahead estimate U(s) ← maxa [R(s,a) + γ ∑s′ T(s′|s,a) U(s′)] return val
Over many simulations, U(s) converges toward U*(s) along the most-visited states. This is real-time dynamic programming (RTDP) — an online version of value iteration that only updates states on simulated trajectories.
Labeled RTDP (LRTDP) extends this with a convergence test: a state s is labeled "solved" once U(s) has stopped changing more than ε. Solved states are never re-expanded. The algorithm terminates when the root state is solved. This provides a practical stopping criterion that MCTS and RTDP lack.
| Method | Guides search how | Convergence guarantee |
|---|---|---|
| MCTS | UCB1 (value + novelty) | No — run for fixed m simulations |
| RTDP | Greedy w.r.t. heuristic | Yes — to relevant states only |
| LRTDP | Greedy + solved labeling | Yes — terminates when root solved |
RTDP has no built-in stopping criterion: it runs trials until the caller decides to stop. This is a practical weakness. Labeled RTDP (LRTDP) adds a convergence check after each trial:
After completing a trial ending at some state sleaf, walk back along the trial path. For each state s on the path, check whether the Bellman residual is small enough:
If this holds, and all successors of s are already labeled solved, then mark s as "solved." Once the root state s0 is labeled solved, LRTDP terminates. The key insight: you only need to solve states reachable from s0 under the greedy policy — which may be a tiny fraction of the full state space.
In classical graph search (pathfinding), A* uses a heuristic h(s) to prioritize promising states. The same idea works in MDPs: maintain a priority queue of states to expand, ordered by f(s) = g(s) + h(s), where g(s) is the cost-to-come and h(s) is the admissible heuristic for cost-to-go.
For discounted MDPs, the equivalent quantity is: f(s) = −Ulo(s) (negate because we maximize reward, not minimize cost). States with high potential value are expanded first. This is essentially RTDP with a priority queue instead of a stack.
python # RTDP: Real-Time Dynamic Programming def rtdp_trial(env, s0, U, gamma=0.95, depth=20): """One greedy simulation from s0, updating U along the way.""" s = s0 for _ in range(depth): if env.is_terminal(s): break # Bellman backup at s a_star = max(env.actions, key=lambda a: env.reward(s,a) + gamma * sum( T * U[s_next] for s_next, T in env.transitions(s,a))) U[s] = env.reward(s, a_star) + gamma * sum( T * U[s_next] for s_next, T in env.transitions(s, a_star)) # Greedy step: sample from T(s, a_star) s = env.sample_next(s, a_star) return U # updated in-place along the visited trajectory def rtdp(env, s0, U_init, num_trials=500): U = dict(U_init) # start from admissible heuristic for _ in range(num_trials): rtdp_trial(env, s0, U) a_star = max(env.actions, key=lambda a: env.reward(s0,a) + gamma * sum(T*U[s_next] for s_next,T in env.transitions(s0,a))) return a_star
After enough trials, U converges to U* on all states reachable from s0 under the greedy policy. States never visited are never updated — this is the efficiency gain. For problems where only a small fraction of states are relevant from the start state, RTDP is dramatically faster than full value iteration.
Every method we've discussed so far is closed-loop: at each step, the action depends on the observed state. A policy π(s) maps states to actions. This means the plan is a contingency tree — it branches on what the environment might do.
Open-loop planning drastically simplifies this. Instead of a policy (a function), find the best fixed sequence of actions a1, a2, …, ad. Ignore what states are actually visited. Just optimize the expected return of the sequence.
Where St evolves stochastically according to the dynamics: St+1 ~ T(St, at). The sequence a1:d is fixed; only the states are random. This makes the search space a tree of depth d where each node branches on only |A| actions (not |A|×|S|).
Closed-loop advantages:
• Adapts to observed states
• Optimal in stochastic environments
• Policy is a contingency plan (if s, do a)
• Quality doesn't degrade with randomness
Open-loop advantages:
• Search only d×|A| parameters
• Optimization may be convex
• Very fast per decision
• Easy to warm-start (shift the sequence)
Open-loop is most useful when:
• The environment is nearly deterministic (dynamics are predictable).
• The planning horizon is very short (little time for stochasticity to accumulate).
• Fast replanning frequency compensates for plan staleness.
• The action space is continuous and the optimization is structured (e.g., model-predictive control).
You can combine the advantages of MCTS (adaptive simulation budget) with open-loop planning (simple search space). In Open-Loop MCTS, the tree nodes are action sequences rather than states. Each path from root to leaf represents the sequence a1, a2, …, ak. UCB1 still governs selection.
The key difference: when evaluating a leaf (a sequence), you run m simulations starting from the current state, executing the sequence. The stochasticity is in the environment transitions, not the sequence itself. This averages out noise while keeping the search space compact.
python # Open-Loop MCTS: tree nodes are action sequences def evaluate_sequence(env, s0, action_seq, m=20, gamma=0.95): """Average return of action_seq over m stochastic rollouts from s0.""" total = 0.0 for _ in range(m): s, ret, discount = s0, 0.0, 1.0 for a in action_seq: s_next, r = env.step(s, a) ret += discount * r discount *= gamma s = s_next if env.is_terminal(s): break total += ret return total / m
Open-Loop MCTS is effective when the action sequence space is smaller than the state space (e.g., |A|d << |S|), or when state observations are unreliable and the tree should not branch on them. It is widely used in games with chance elements and in robotics with uncertain state estimates.
Open-loop planning fails gracefully as stochasticity increases. Consider a simple example: you're steering a car toward a target in crosswind. The wind is random, uniformly in [−W, +W] at each step.
Low wind (W = 0.1): Open-loop plan "steer left 3 times, then straight" works fine. The wind pushes you only 0.3 units off course total over 3 steps. Small error.
High wind (W = 5.0): The plan fails catastrophically. After 3 steps, you could be 15 units off course in either direction. A closed-loop policy that observes position and corrects is the only robust option.
The gap between open-loop and closed-loop value is bounded by:
where δ measures the stochasticity of the environment (specifically, the maximum variation of the next-state distribution across actions). When δ = 0 (deterministic environment), open-loop = closed-loop. As δ grows, the gap grows proportionally. The factor 1/(1−γ) amplifies long-horizon stochasticity.
Online planning is not an isolated topic. It sits at the intersection of search, reinforcement learning, control theory, and game theory. Understanding these connections reveals why online planning is structured the way it is — and where the next improvements will come from.
Every method here exploits one insight: you do not need to solve the entire MDP. Plan from where you are, to a feasible depth, and replan each step. The methods differ only in how they allocate limited computation across the search tree.
| Method | Complexity (depth d) | Best for | Key weakness |
|---|---|---|---|
| Receding horizon | Any planner | Framework for all methods | Myopic if d too small |
| Rollout lookahead | O(m|A|d) | Simple improvement over any policy | Limited by rollout quality |
| Forward search | O((|S||A|)d) | Tiny problems, baseline | Exponential in d and |S| |
| Branch & bound | O((|S||A|)d) worst | When good bounds exist | Bound quality critical |
| Sparse sampling | O((m|A|)d) | Large/continuous state spaces | Still exponential in d |
| MCTS | O(m·d) | Games, combinatorial problems | c tuning, no bounds guarantee |
| Heuristic (RTDP) | Problem-dependent | Problems with good heuristics | Needs admissible heuristic |
| Open-loop | Optimization cost | Nearly deterministic, continuous action | Suboptimal with high stochasticity |
AlphaGo & AlphaZero (DeepMind, 2016–2017):
MCTS + deep neural network for both the rollout policy (prior over actions) and the value estimate at leaf nodes (replaces random rollouts). UCB1 is modified to incorporate the prior: score(a) = Q(s,a) + c·P(a|s)·√(N(s)/(1+N(s,a))). This combination beat the world champion at Go, Chess, and Shogi — the strongest demonstration of online planning in history.
ACAS X (FAA, 2019):
Collision avoidance for commercial aviation. State space: ~2.5 billion cells (relative geometry, speeds, altitudes). Offline value iteration with function approximation compresses U(s) into a neural network. Online planning to depth ~20 seconds queries this network at leaves. Operates in 100ms on embedded hardware. First certified NN-based safety system in aviation.
MuZero (DeepMind, 2019):
MCTS without a known model. Instead, the neural network learns a latent dynamics model: given state s and action a, predict the next latent state, reward, and value. MCTS then plans in latent space. Applied to Go, Chess, Shogi, and 57 Atari games simultaneously — the first system to achieve this without domain-specific rules.
Model Predictive Path Integral (MPPI, 2017):
Open-loop planning for continuous control. Sample K action sequences from a Gaussian distribution, roll each out through the dynamics, weight by exp(return/λ), update the distribution toward high-return sequences. Used in autonomous vehicle racing and drone acrobatics. GPU-parallel rollouts make it real-time feasible.
| Variant | Key modification | Used in |
|---|---|---|
| UCT (UCB1 applied to Trees) | UCB1 at every tree node | Original MCTS for Go (Kocsis & Szepesvári 2006) |
| PUCT | UCB1 weighted by neural prior P(a|s) | AlphaGo, AlphaZero, MuZero |
| Double Progressive Widening | Gradually expands continuous action/state spaces | Continuous-action MCTS |
| POMCP | MCTS over belief-state trees for POMDPs | Partially observable planning (Ch 22) |
| Temporal Difference MCTS | Uses TD learning for leaf evaluation instead of rollout | RL + planning hybrids |
MCTS is simple to understand but has many implementation pitfalls. This checklist covers the most common bugs:
Correctness:
• Handle N(s,a)=0 explicitly (return ∞, don't divide)
• Initialize Q=0 for all new actions (not −∞)
• Use running average for Q update, not sum
• Final action selection: max Q, not max UCB1
• Don't reuse the tree across decisions (the root state changed)
Performance:
• Vectorize rollout simulation if possible
• Use dictionaries keyed by state hash, not tuples
• Limit rollout depth (5–15 steps is usually enough)
• Tune c on a small representative set of initial states
• Terminal states: return 0 without a rollout
If you want to build intuition for the probability and value-function machinery underlying these methods:
• MDP — Markov decision processes, value iteration, the Bellman equation
• RL Algorithms — Q-learning, policy gradient, actor-critic
• Bayes Filter — recursive belief update (underlying POMDP planning)
• POMDP — planning under observation uncertainty; the setting for Ch 22
"The purpose of computing is insight, not numbers." — Richard Hamming
"If you can't explain it simply, you don't understand it well enough." — Richard Feynman
Online planning is the art of understanding a problem well enough to find the best action, without understanding it perfectly. The Bellman equation tells us what the answer is; the algorithms of this chapter tell us how to compute it in time.
Each algorithm exploits a different piece of problem structure:
• Rollout lookahead assumes any policy can produce a useful value signal after one step. Requires: a rollout policy, a generative model.
• Forward search assumes the full transition distribution T(s′|s,a) is known and computable. Requires: exact model.
• Branch & bound assumes tight upper/lower bounds on value can be computed cheaply. Requires: domain-specific bounds.
• Sparse sampling assumes the generative model can sample efficiently. Requires: generative model, no transition enumeration needed.
• MCTS assumes the generative model is fast and adaptive allocation via UCB1 is beneficial. Requires: fast generative model, tuned c.
• Heuristic search assumes a cheap admissible heuristic exists. Requires: domain knowledge, admissible initial U.
• Open-loop assumes the environment is nearly deterministic or action sequences can be optimized efficiently. Requires: low stochasticity or structured dynamics.