Kochenderfer, Wheeler & Wray — Chapter 9

Online Planning

Why compute the whole policy when you only need the next action?

Prerequisites: MDPs (Ch 7) + Approximate value functions (Ch 8). That's it.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Plan Online?

You're the flight computer on a commercial aircraft. A near-collision is developing. You have to pick: climb, descend, or hold. You have about two seconds.

The problem: your state space is enormous. Altitude, airspeed, vertical rate, intruder geometry — hundreds of continuous dimensions. An offline method like value iteration needs to sweep every state in the space, update, and repeat until convergence. That takes hours of computation. You have two seconds.

The core insight: You don't need the optimal policy for every possible state in the universe. You need the best action from this state, right now. Online planning computes exactly that — and nothing more.

This is the leap from offline planning to online planning. Offline methods (chapters 7–8) compute a policy π(s) for the entire state space before deployment. Online methods compute the action for the current state at decision time, using the generative model as a simulator.

Offline
Compute π(s) for all s ∈ S — store the policy, deploy it
vs.
Online
At each step: receive s, search forward, pick best a, act

The price you pay: online planning must be fast enough to produce an action within the decision budget. The methods in this chapter each make different tradeoffs between computation and solution quality.

The generative model is the key: Online methods require a generative model — a black box that accepts (s, a) and returns a sampled next state s′ ~ T(s,a) and reward r. You don't need the full transition matrix. A simulator counts. This is why MCTS can be applied to Go, Atari, and autonomous vehicles: you can simulate the world without enumerating it.

The methods of this chapter, in order of sophistication:

MethodKey ideaComplexity
Receding horizonPlan to depth d, execute first action, replanO(|A|d)
Rollout lookaheadEstimate leaf values via random rolloutsO(m·|A|·d)
Forward searchEnumerate all transitions to depth dO((|S||A|)d)
Branch & boundPrune branches via value boundsO((|S||A|)d) worst
Sparse samplingSample m states per action instead of allO((m|A|)d)
MCTSAdaptively allocate sims via UCB1O(m·d) per decision
Heuristic searchGuide search with a domain heuristicProblem-dependent
Open-loopOptimize over action sequences, not state-dependent policiesOptimization cost
Complexity summary — what each exponent means:
Forward search O((|S||A|)d): For |S|=50, |A|=4, d=5 → 2005 = 3.2×1011 nodes. Infeasible.
Sparse sampling O((m|A|)d): For m=10, |A|=4, d=5 → 405 = 1.0×108 nodes. Slow but feasible.
MCTS O(m·d): For m=1000, d=10 → 10,000 node visits. Fast.

The Three Key Tradeoffs in Online Planning

1. Computation vs. solution quality. More simulations/samples always improve quality. The question is: given a fixed budget (say, 100ms), which method extracts the most value? MCTS usually wins on this metric for stochastic problems because it concentrates computation on promising branches. Sparse sampling wastes work on every branch uniformly.

2. Model accuracy vs. sim count. If your generative model is inaccurate (e.g., a learned dynamics model with prediction error), more simulations exacerbate the model error. A small number of simulations with a rough model can outperform a huge number with the same rough model — more sims compound the model bias. This is why model-based RL (Ch 16) actively manages model uncertainty.

3. Depth vs. breadth. Given a fixed number of node visits, spend them wide (many shallow trees) or deep (few deep trees)? Wide coverage catches immediate wins. Deep coverage catches long-horizon consequences. MCTS adaptively finds the sweet spot: go deep on promising branches, wide on unexplored ones.

Online planning computes the best action for the current state at decision time. What does this require that offline planning does not?

Chapter 1: Receding Horizon Planning

You're navigating a ship through fog. You can see 5 miles ahead. You plot the best route for the next 5 miles, sail 1 mile, then re-check and re-plot. You never commit to a route beyond your visibility. That is receding horizon planning: the planning window slides forward with you.

1. Observe
Receive current state st from the environment
2. Plan
Run any online planning algorithm from st to horizon depth d. Produces a plan at, at+1, …, at+d−1.
3. Execute
Take only the first action at from the plan
4. Observe
Receive st+1. Discard at+1, …, at+d−1. Increment t.
↻ repeat

The key point: at+1 through at+d−1 are thrown away at every step. This seems wasteful. Why not just execute the whole plan? Because the world is stochastic. After taking at, the actual next state st+1 may differ from what the plan assumed. By replanning from the true st+1, we correct this discrepancy at every step.

Depth d is the critical design parameter. Consider two extremes:
d = 1: One-step lookahead. Plans only the immediate next reward. Misses long-range consequences. An aircraft collision avoidance system planning 1 second ahead will miss a conflict developing 30 seconds out.
d = ∞: Equivalent to solving the full MDP — optimal but intractable.
d = 10–20 is a typical practical compromise. Frequent replanning compensates for shallowness.

Lookahead with Rollouts

The simplest online planning algorithm that uses the receding horizon framework: for each candidate first action a, estimate the value by simulating m rollouts. A rollout runs a simple policy (often random) from the successor state to depth d, accumulating discounted reward.

π*(s) ≈ arg maxa   (1/m) ∑i=1m [ R(s,a) + γ · rollout(πrollout, si′, d−1) ]

Where si′ ~ T(s, a) is a sampled successor state and rollout(π, s, d) accumulates γtR along a trajectory of length d following policy π.

python
def rollout(env, s, policy, depth, gamma=0.95):
    """Simulate policy from s for depth steps; return discounted return."""
    ret, discount = 0.0, 1.0
    for _ in range(depth):
        a = policy(s)              # e.g. random: random.choice(env.actions)
        s_next, r = env.step(s, a) # generative model call
        ret += discount * r
        discount *= gamma
        s = s_next
        if env.is_terminal(s): break
    return ret

def rollout_lookahead(env, s, m=50, depth=10, gamma=0.95):
    """Pick the action with the highest average rollout return."""
    best_a, best_q = None, float('-inf')
    for a in env.actions:
        q = 0.0
        for _ in range(m):
            s_next, r = env.step(s, a)  # first step with action a
            q += r + gamma * rollout(env, s_next, random_policy, depth−1, gamma)
        q /= m
        if q > best_q: best_a, best_q = a, q
    return best_a

Complexity: O(m × |A| × d) per decision. For |A|=4, m=100, d=10: 4,000 simulator calls per step. Typically fast enough for real-time use.

Why does a random rollout policy improve performance? Even a terrible rollout policy produces useful signal. The one-step lookahead picks the best first action given the (biased but consistent) rollout estimates. By the policy improvement theorem, the resulting policy π* is at least as good as the rollout policy πrollout. You're improving even if the improvement is modest. Iterating — using π* as the next rollout policy — leads to further improvement each round.

Hybrid Planning: Online + Offline

Rollout assigns zero value to states at depth d (the trajectory ends there). A better idea: at depth d, query an offline-computed value function U(s) instead of returning 0. This U(s) may be rough (a neural network approximation or linear function), but even a crude estimate of long-range value dramatically outperforms U(s) = 0.

V(s, d) = U(s)   if d = 0
V(s, d) = maxa [ R(s,a) + γ · Es′~T(s,a) V(s′, d−1) ]   otherwise

This is the architecture behind ACAS X (collision avoidance) and AlphaGo: an offline neural network provides U(s) at the leaf, online MCTS provides precision at the root.

Worked example: 5-step depth, γ = 0.95.
With U(s) = 0, states at step 5 contribute nothing. Optimal value at step 0 sees only rewards at steps 1–5.
With a perfect U(s): all remaining reward is captured at step 5. Full optimality.
With a rough U(s) (error ≤ 10): error at step 0 ≤ γ5 × 10 = 0.955 × 10 ≈ 7.7. Often acceptable.

Policy Improvement: The Theory Behind Rollout Lookahead

The formal justification for rollout lookahead is the policy improvement theorem. Let π0 be the rollout policy. Define π1 as the policy that performs one-step lookahead using π0 as the leaf estimator:

π1(s) = arg maxa [ R(s,a) + γ Es′ Vπ0(s′) ]

The policy improvement theorem guarantees Vπ1(s) ≥ Vπ0(s) for all s. The one-step lookahead never makes the policy worse. Repeating this improvement — using π1 as the next rollout policy — gives π2, π3, … This sequence converges to π* (optimal policy) in a finite number of steps for finite MDPs. Each iteration is one pass of policy iteration.

Online rollout lookahead does one step of this iteration at every decision. It never converges to π* (it only sees one state at a time), but it systematically improves on π0. The better the rollout policy, the closer π1 gets to π*, and the less work the online search has to do.

In receding horizon planning, how much of the plan computed at step t is executed before replanning?

Chapter 2: Forward Search

The most straightforward online planner: build the complete look-ahead tree to depth d, compute expected values bottom-up, and pick the best root action. No approximations. No sampling. Every transition. Every state.

pseudocode
# Forward Search — returns (best_action, best_value) from state s at depth d
function forward_search(s, d, U):
  if d == 0:
    return (None, U(s))   # leaf: use terminal value function

  best_a, best_u = None, −∞
  for each a in A:
    # expected value over all successor states
    q = R(s, a) + γ · ∑s′ T(s′ | s, a) · forward_search(s′, d−1, U).value
    if q > best_u:
      best_a, best_u = a, q
  return (best_a, best_u)

The structure of the search tree is an AND-OR tree. Action nodes (OR nodes) branch over choices — we maximize. State nodes (AND nodes) branch over stochastic outcomes — we take expectations. The values propagate up: leaves get U(s), state nodes sum T(s′|s,a) times child values, action nodes take the max.

Exponential cost: At depth 1, we expand |A| actions, each to |S| successor states: |A|×|S| nodes. At depth 2: |A|2×|S|2. At depth d: O(|A|d×|S|d) = O((|A||S|)d). Even for |A|=4, |S|=100, depth d=5: 45×1005 = 1013 nodes. Intractable.

Forward search is the baseline. It is exact (to depth d) and provides the standard against which every approximation is measured. In practice it is only feasible for:

Small action spaces

|A| = 2–4 with shallow depths (d ≤ 4) gives ~256 leaf nodes — tractable.

Small state spaces

|S| = 10–50 with |A| = 2, d = 3 gives ~125,000 nodes — borderline tractable.

The terminal value function U(s) is critical. Setting U(s) = 0 means the planner ignores all reward beyond depth d. In a discounted infinite-horizon problem with γ = 0.9, the first 20 steps account for 86% of total discounted value, so d = 20 might seem sufficient — but for large problems, even d = 5 is too expensive. Hybrid methods break this tension.

AND-OR tree breakdown: Take a 3-action MDP (|A|=3) with 10 states (|S|=10), planned to depth d=2. Root: 3 action children. Each action: 10 state children. Each state: 3 action children. Each action: 10 leaf states. Total nodes: 1 + 3 + 30 + 90 + 900 = 1,024. For d=3: add one more level → 30,000. For d=4: 1,000,000. This is why d=3 or 4 is often the practical limit.
AND-OR Tree: Forward Search Structure

Visualize how the forward search tree grows. Orange circles = action (OR) nodes. Teal circles = state (AND) nodes. Adjust depth and branching factors to see exponential growth.

Depth d2
Actions |A|3
States |S|3

Worked Numerical Example (d=2)

Suppose we have a toy MDP: 3 states {s0, s1, s2}, 2 actions {aL, aR}, γ = 0.9, and the terminal value U(s) = 0 for all s.

ActionSuccessorProbabilityReward R(s0, ·)
aLs10.7+2
aLs20.3+2
aRs10.4−1
aRs20.6−1

At depth d=2, we need U(s1) and U(s2) evaluated by a second level of search. Suppose that second level yields U(s1) = 3.0 and U(s2) = 1.0 (computed recursively). Then:

Q(s0, aL) = 2 + 0.9 × (0.7 × 3.0 + 0.3 × 1.0) = 2 + 0.9 × 2.4 = 4.16
Q(s0, aR) = −1 + 0.9 × (0.4 × 3.0 + 0.6 × 1.0) = −1 + 0.9 × 1.8 = 0.62

Best action: aL with value 4.16. This is the exact 2-step lookahead value. Forward search guarantees this answer; every other method in this chapter approximates it.

Why does the depth-2 computation require the depth-1 computations for s1 and s2? Because the depth-2 Q-value of (s0, aL) depends on the best available action from s1 and s2. Those best actions come from depth-1 sub-trees. The recursion bottoms out at depth 0, where U(s) is returned directly. This is why forward search is naturally recursive: each depth calls the next level, which calls the next, down to the leaves.
Forward search is an AND-OR tree. What operation is performed at OR (action) nodes vs. AND (state) nodes?

Chapter 3: Branch and Bound

Forward search is too slow because it explores branches that cannot possibly win. Branch and bound uses value bounds to skip them provably.

You need two ingredients:

BoundNotationRequirementExample
Lower bound on valueUlo(s)Ulo(s) ≤ U*(s) for all sValue of a heuristic policy (e.g., always go right)
Upper bound on Q-valueQhi(s,a)Qhi(s,a) ≥ Q*(s,a) for all s,aOptimistic relaxation (e.g., assume no obstacles)

The algorithm sorts actions by Qhi(s,a) descending, then searches them in that order. As soon as it has a complete sub-tree value for the first action, it knows the lower bound on the best achievable value. Any action whose upper bound is below this lower bound cannot win — prune it.

pseudocode
# Branch and Bound — key pruning step
function branch_and_bound(s, d, U_lo, Q_hi):
  if d == 0: return (None, U_lo(s))

  best_a, best_u = None, U_lo(s)   # initialize with lower bound
  actions_sorted = sort(A, key=Q_hi(s,.), descending=True)

  for each a in actions_sorted:
    if Q_hi(s, a) ≤ best_u:
      break   # all remaining actions can only be worse; prune
    q = R(s,a) + γ · ∑s′ T(s′|s,a) · branch_and_bound(s′, d−1, U_lo, Q_hi).value
    if q > best_u:
      best_a, best_u = a, q

  return (best_a, best_u)

The break condition is critical: because actions are sorted by Qhi descending, once Qhi(s,a) ≤ best_u, every subsequent action also has Qhi ≤ best_u. Pruning is safe. The result is always identical to forward search; only the computation differs.

Tighter bounds = more pruning. If Qhi(s,a) = ∞ for all a (useless bound), no pruning happens and the algorithm degrades to forward search. If Qhi(s,a) exactly equals Q*(s,a), we only expand the optimal subtree and prune everything else. Most real problems lie between these extremes. The art is constructing good bounds cheaply.

Building Qhi:

Relax the problem. Remove obstacles. Assume teleportation. Solve the relaxed MDP cheaply. Its Q-values are always optimistic (over-estimate) for the original problem.

Building Ulo:

Run any sub-optimal policy (random walk, greedy by feature). Evaluate that policy. Its value function is a lower bound on the optimal value.

Correctness guarantee: Branch and bound returns the exact same action as forward search. Pruning never removes the optimal branch because: (1) the lower bound Ulo(s) ≤ U*(s), so we only prune when the upper bound Qhi(s,a) ≤ best_u, which implies Q*(s,a) ≤ Qhi(s,a) ≤ best_u. The pruned action cannot be optimal.

Concrete Worked Example: Pruning in Action

State s0, four actions, all at depth d=2. Suppose we have bounds:

ActionQhi(s0,a) (upper bound)True Q*(s0,a)
a112.09.5
a210.08.0
a37.06.5
a45.04.8

Also suppose Ulo(s0) = 6.0. Execution order (sorted by Qhi descending: a1, a2, a3, a4):

1. Expand a1 fully. Compute its true value: Q = 9.5. Update best_u = max(6.0, 9.5) = 9.5.

2. Qhi(a2) = 10.0 > best_u = 9.5. Must expand a2. Compute Q = 8.0. best_u = max(9.5, 8.0) = 9.5.

3. Qhi(a3) = 7.0 ≤ best_u = 9.5. Prune a3. (A3's true value is 6.5 < 9.5, so this is safe.)

4. Qhi(a4) = 5.0 ≤ best_u = 9.5. Prune a4. (A4's true value is 4.8, safe.)

We computed 2 subtrees instead of 4. Savings: 50%. Return: a1 with value 9.5 (correct). With tighter bounds, we could prune a2 as well if after expanding a1, we had best_u = 9.5 and Qhi(a2) = 9.0: pruned! That would require only 1 subtree.

The value of tight bounds: The gap Qhi(s,a) − Q*(s,a) determines how much of the tree we must explore. A bound that is always within ε of Q* will allow pruning of all but O(1) subtrees at each node, collapsing the exponential tree to near-linear cost. This is why tight bounds are worth significant effort to compute.

Where Bounds Come From in Practice

The hardest part of branch and bound is constructing good bounds cheaply. Here are the standard approaches:

Upper bound Qhi(s,a): Solve a relaxed problem. Remove constraints (obstacles, capacity limits). Solve the relaxed MDP using value iteration — this is typically much easier than the original. Its Q-values upper-bound the original because the relaxed problem has more options. Another approach: closed-form bounds. For a grid world with a goal state, a Manhattan distance bound: Qhi(s, any a) = Rmax/(1−γ) − (dist(s, goal) − 1) × Rmin.

Lower bound Ulo(s): Evaluate any feasible policy. Run it on the original MDP and compute Vπ(s). This is always ≤ U*(s) because π is sub-optimal. Good policies give tighter bounds. A greedy policy based on a feature approximation is a typical choice.

Branch and bound produces the exact same action as forward search. What does it sacrifice to achieve this?

Chapter 4: Sparse Sampling

The exponential cost of forward search has two factors: |A|d from actions and |S|d from states. We cannot reduce |A| without losing coverage of the action space. But the state-space factor |S|d can be attacked: instead of branching over all |S| successor states, sample only m states.

Q(s,a) ≈ R(s,a) + γ · (1/m) ∑i=1m V(si′, d−1)     si′ ~ T(s,a)

Instead of the true expectation over T(s′|s,a), we approximate it by averaging m sampled transitions. The complexity drops from O((|A||S|)d) to O((|A|m)d), which is independent of |S|.

The critical result: Ke&ouml;rding (2000) proved that sparse sampling with m ≥ O(ε−2 d2 log(d/δ) Vmax2 / ε2) gives a value estimate within ε of optimal with probability 1−δ. The key: m depends on the desired accuracy and depth, but not at all on |S|. You can have a continuous state space with uncountably infinite states, and sparse sampling still runs in finite time.
MethodNode branching at each actionTotal nodes at depth d
Forward search|S| (all successor states)O((|A|·|S|)d)
Sparse samplingm (sampled states)O((|A|·m)d)
Savings(|S|/m)d – potentially enormous
Sparse Sampling: Complexity vs. State Space Size

Compare how many nodes each method explores. Forward search explodes with |S|; sparse sampling is flat. Drag the sliders to see the effect.

Depth d3
Samples m5
Actions |A|4

The remaining limitation: sparse sampling is still exponential in d. If d = 10, m = 5, |A| = 4, the tree has (4×5)10 = 1013 nodes. For large depths, exponential in d is still intractable. MCTS solves this.

Is sampling m = |S| the same as forward search? No. Forward search iterates over every s′ with nonzero T(s′|s,a). Sparse sampling draws |S| random samples from T(s′|s,a), which will include duplicates and miss some states. Even at the same cost, sparse sampling remains an approximation.

Error Analysis: How Many Samples m Do You Need?

The theoretical bound (Kearns et al., 2002): to get Q estimates within ε of optimal with probability 1−δ, we need approximately:

m ≥   C · (Vmax / ε)2 · d2 · log(|A| d / δ)

where Vmax = Rmax/(1−γ) and C is a universal constant. Plugging in ε=1, δ=0.05, γ=0.9, Rmax=1, |A|=4, d=5: Vmax=10, so m ≥ C×100×25×log(400) ≈ 1.5M. Worst-case bounds are very loose. In practice m=50–200 samples per action give good performance in many domains because variance in the actual transition distribution is much lower than Vmax would suggest.

python
def sparse_sampling(env, s, depth, m=20, gamma=0.95, U_leaf=None):
    """Sparse sampling: sample m transitions per action node."""
    if depth == 0:
        return None, (U_leaf(s) if U_leaf else 0.0)

    best_a, best_q = None, float('-inf')
    for a in env.actions:
        q_sum = 0.0
        for _ in range(m):               # m samples — the only change vs forward search
            s_next, r = env.step(s, a)   # sample one s′ ~ T(s,a)
            _, v = sparse_sampling(env, s_next, depth−1, m, gamma, U_leaf)
            q_sum += r + gamma * v
        q = q_sum / m                    # average instead of exact expectation
        if q > best_q: best_a, best_q = a, q
    return best_a, best_q

Compare to forward search: the only difference is the inner loop over m samples instead of the sum over all s′. This single change removes the |S|d factor entirely.

Sparse sampling's complexity O((|A|·m)d) is independent of what, making it tractable for problems that kill forward search?

Chapter 5: Monte Carlo Tree Search

Sparse sampling still explodes exponentially in depth d. The problem: it allocates the same number of simulations to every branch, regardless of how promising that branch is. This is wasteful. A move that is clearly terrible gets just as many simulations as the move that looks brilliant.

Monte Carlo Tree Search (MCTS) fixes this by adaptively concentrating computation on the most promising branches. Instead of a pre-determined tree structure, MCTS grows a search tree incrementally, simulation by simulation. Branches that look better get explored more. Branches that look bad get pruned by neglect.

MCTS maintains two quantities for every (state, action) pair it has visited:

SymbolMeaningUpdated when
N(s, a)Number of times action a was tried from state sEvery simulation that traverses (s, a)
Q(s, a)Mean return observed when taking a from sRunning average after each simulation

Each MCTS iteration has four phases. Watch the simulation below, then read the explanation.

MCTS Live: Search Tree Growing in Real Time

Click Step to run one MCTS simulation (selection → expansion → rollout → backprop). The last phase executed is highlighted. Click Run 20 for bulk simulations. The tree updates live.

Sims: 0 — Phase: —
Exploration c 80
Rollout depth 6

The Four Phases, One by One

1. Selection
Start at root. Traverse the tree using UCB1, choosing the child that maximizes Q(s,a) + c√(log N(s)/N(s,a)). Stop when reaching an unexplored node.
2. Expansion
Initialize N(s,a)=0 and Q(s,a)=0 for all actions from the new node. Add it to the tree.
3. Simulation (Rollout)
From the new node, simulate a rollout policy (typically random) to depth d_rollout. Accumulate discounted reward. This is the value estimate for the new node.
4. Backpropagation
Walk back up the path taken in Selection. For each (s,a) on the path: N(s,a) += 1 and Q(s,a) += (new_return − Q(s,a)) / N(s,a) (running average update).

After m iterations, the root action is selected by the policy that maximizes Q(sroot, a) — pure exploitation, no exploration bonus, because we're committing to an action, not running more simulations.

Why does this work? UCB1 gives high scores to: (a) actions with high Q (exploitation) and (b) actions visited few times (exploration). This automatically routes most simulations to the most promising branches. States rarely visited have high exploration bonus, so the algorithm occasionally revisits them — preventing it from getting stuck on a local optimum.
MCTS convergence guarantee: As m → ∞, the UCT algorithm (MCTS with UCB1) converges to the optimal action with probability 1. The root action value Q(sroot, a*) → Q*(sroot, a*). However, convergence can be slow in practice: the tree must be explored exhaustively for the guarantee to hold. In time-constrained settings (finite m), MCTS provides the best action found, not necessarily the optimal one. The exploration constant c controls the tradeoff: small c converges faster but risks missing the optimum; large c guarantees eventual convergence but is slower in practice.

Worked Numerical: Q-Value Update Step by Step

Suppose state s0 with 2 actions {up, down}. Start: N(s0,up) = 0, Q(s0,up) = 0. Run 5 simulations, all choosing “up” (because both actions have N=0, so both have UCB1=∞ initially; break ties randomly). The rollouts return: +3, −1, +5, +2, +4.

Q-value update is an incremental (running) average. After each simulation returning qi:

Q ← Q + (q − Q) / N     (equivalently: Q = (1/N) ∑i=1N qi)
Sim #Return qiN(s0,up) afterQ(s0,up) afterCalculation
1+313.000 + (3−0)/1 = 3.00
2−121.003 + (−1−3)/2 = 1.00
3+532.331 + (5−1)/3 = 2.33
4+242.252.33 + (2−2.33)/4 = 2.25
5+452.602.25 + (4−2.25)/5 = 2.60

The incremental formula is numerically stable and requires O(1) memory: just store N and Q. No need to keep all past returns. This is crucial when running millions of simulations in games like Go.

Backpropagation propagates through the path. Each simulation traverses a path from root to leaf. Every (s,a) pair on that path gets its N and Q updated. The return q is discounted at each level: the update at the root uses the full trajectory return; at depth 1, it uses the return starting from step 2; and so on. This ensures deeper nodes influence shallower nodes through the Q values, not just through the UCB1 selection counts.
python
def mcts_simulate(env, s, tree, depth, c=1.0):
    # Returns estimated value; updates tree in-place
    if depth == 0:
        return env.heuristic(s)

    if s not in tree:
        # EXPANSION: first visit, initialize all actions
        tree[s] = {a: {'N': 0, 'Q': 0.0} for a in env.actions}
        # SIMULATION: rollout from this new node
        return rollout(env, s, rollout_depth=5)

    # SELECTION: UCB1 to pick action
    N_s = sum(tree[s][a]['N'] for a in env.actions)
    def ucb1(a):
        n = tree[s][a]['N']
        q = tree[s][a]['Q']
        if n == 0: return float('inf')
        return q + c * (math.log(N_s) / n) ** 0.5

    a_star = max(env.actions, key=ucb1)
    r, s_next = env.step(s, a_star)

    # Recurse; get return from child
    q = r + env.gamma * mcts_simulate(env, s_next, tree, depth−1, c)

    # BACKPROPAGATION: running average update
    tree[s][a_star]['N'] += 1
    tree[s][a_star]['Q'] += (q − tree[s][a_star]['Q']) / tree[s][a_star]['N']
    return q

def mcts(env, s, m=1000, depth=10, c=1.0):
    tree = {}
    for _ in range(m):
        mcts_simulate(env, s, tree, depth, c)
    # Final action: pure exploitation (no UCB bonus)
    return max(env.actions, key=lambda a: tree[s][a]['Q'])
After m MCTS simulations, which criterion selects the final action to execute?

MCTS Grid Navigation: See It Solve a Problem

Below, MCTS solves a 6×6 grid navigation task. The agent starts at top-left, must reach the goal (bottom-right), and earns a large penalty for the hazard zone. Click Run Sims to accumulate simulations, then Best Move to execute the MCTS-chosen action.

MCTS: Grid Navigation

Visit counts N(s,a) are shown as numbers around each cell. Larger numbers = more MCTS attention. The agent (orange circle) navigates toward the goal (teal G) avoiding the hazard (red H).

Sims: 0
Exploration c60

Chapter 6: UCB1 — The Exploration Bonus

UCB1 is the heart of MCTS. Understanding it precisely is worth the effort.

a* = arg maxa   Q(s,a) + c √log N(s) / N(s,a)

Break this apart symbol by symbol:

TermTypeMeaning
Q(s, a)ExploitationMean return observed from (s,a). High Q → looks good so far. Take more of this.
cHyperparameterExploration constant. Scales the exploration bonus relative to value estimates.
log N(s)ContextLog of total visits to state s. Grows slowly — ensures the bonus never disappears entirely.
N(s, a)ExplorationVisits to action a from s. Low N(s,a) → action is underexplored → big bonus.
√(log N(s) / N(s,a))Exploration bonusShrinks as N(s,a) grows (the action gets less under-explored). Grows as N(s) grows (each action gets a bit more budget).

If N(s,a) = 0, the bonus is infinite — unexplored actions are always tried first. Once all actions have been tried at least once, the formula kicks in and trades off exploitation and exploration.

Where does UCB1 come from? UCB1 was derived for the multi-armed bandit problem by Auer et al. (2002). In a bandit with K arms, UCB1 guarantees that the cumulative regret after T steps is O(K log T). In MCTS, we apply it at every tree node as if each state's actions were a separate bandit problem. This is heuristic — the theoretical guarantees don't extend to the tree setting — but empirically it works extremely well.
UCB1 Score: Exploitation vs. Exploration

Three actions: Red (high Q, many visits), Blue (medium Q, few visits), Green (low Q, unvisited). See how UCB1 balances them as you change c and total visits N.

Exploration c80
Total visits N(s)50

The exploration constant c must be tuned for the specific problem. Too small: MCTS exploits greedily, gets stuck on local optima. Too large: MCTS explores uniformly, wastes computation. A common heuristic: set c so the exploration bonus is comparable in magnitude to typical Q-value differences. In AlphaGo, c was tuned empirically and found to be around 5 for their Q-value normalization.

A principled approach: normalize Q values to [0, 1] by dividing by Vmax = Rmax/(1−γ). Then set c = 1 or c = √2 (the theoretically motivated value for bandit problems). This normalization removes the need to re-tune c when reward scales change — a crucial engineering improvement when applying MCTS across different problems.

c too small (c=0.01):

MCTS rapidly commits to the first good-looking action. Bonus shrinks to near-zero within a few visits. Other actions are rarely explored. Performance is good when the first action tried happens to be optimal; catastrophic when it is not.

c too large (c=1000):

MCTS visits actions round-robin. Every action gets equal simulations regardless of quality. Equivalent to uniform random sampling, ignoring all Q-value feedback. Performance degrades to blind rollout.

UCB1 with N(s,a) = 0: The score is ∞ for unvisited actions. This means MCTS always tries every action at least once before exploiting. This is correct behavior — you should never skip an action you've never tried. Only after all actions are visited does the formula activates. Your implementation must handle N(s,a) = 0 explicitly to avoid divide-by-zero.

Worked Numerical Example: UCB1 in Action

State s, three actions, c = 2.0, N(s) = 20 total visits so far:

ActionN(s,a)Q(s,a)Bonus c√(logN(s)/N(s,a))UCB1 score
a1128.52√(log20/12) = 2√(0.249) = 0.998.5 + 0.99 = 9.49
a277.02√(log20/7) = 2√(0.428) = 1.317.0 + 1.31 = 8.31
a319.02√(log20/1) = 2√(3.0) = 3.469.0 + 3.46 = 12.46 ← selected

Interesting: a3 has the highest Q (9.0) and the highest exploration bonus (it's only been tried once). UCB1 correctly selects it. After this simulation, a3's bonus will shrink. If it returns value 6.0, Q(s,a3) updates to (9.0×1 + 6.0)/2 = 7.5, N(s,a3) = 2. Its new bonus = 2√(log21/2) = 2√(1.52) = 2.47. New UCB1 = 7.5 + 2.47 = 9.97 — still selected again, but less dominantly. Over time, if it keeps returning low values, a1 will take over.

The logarithm is crucial. Without it: bonus = c/√(N(s,a)). Every action gets the same allocation of attention regardless of N(s). With log(N(s)): as total visits grow, the bonus for every action slowly inflates, ensuring no action is permanently abandoned. This is what gives UCB1 its O(K log T) regret guarantee in the bandit setting.
What happens to the UCB1 exploration bonus for action a as N(s,a) → ∞, while N(s) also grows proportionally (N(s) = 4×N(s,a))?

Chapter 7: Heuristic Search

All the methods so far make no assumptions about the problem structure. Heuristic search incorporates domain knowledge to guide the search toward high-value regions, dramatically reducing the number of states visited.

The core idea: maintain an approximate value function U(s) (the "heuristic"). At each step of the simulation, instead of a random rollout, follow the greedy policy with respect to U. After each simulation, update U using a Bellman backup at every state along the trajectory.

pseudocode
# Heuristic Search (one simulation)
function heuristic_sim(s, d, U):
  if d == 0: return U(s)
  # greedy action w.r.t. current U
  a = arg maxa [R(s,a) + γ ∑s′ T(s′|s,a) U(s′)]
  s′ ~ T(s,a)   # sample one successor
  val = R(s,a) + γ · heuristic_sim(s′, d−1, U)
  # Bellman backup: update U(s) with the lookahead estimate
  U(s) ← maxa [R(s,a) + γ ∑s′ T(s′|s,a) U(s′)]
  return val

Over many simulations, U(s) converges toward U*(s) along the most-visited states. This is real-time dynamic programming (RTDP) — an online version of value iteration that only updates states on simulated trajectories.

Admissible heuristics: A heuristic h(s) is admissible if h(s) ≥ U*(s) for all s (it never underestimates the optimal value). Starting from an admissible heuristic guarantees monotone convergence — U(s) can only decrease toward U*(s), never overshoot. The simplest admissible heuristic: Rmax / (1 − γ) (assume maximum reward forever). Tighter heuristics = faster convergence.

Labeled RTDP (LRTDP) extends this with a convergence test: a state s is labeled "solved" once U(s) has stopped changing more than ε. Solved states are never re-expanded. The algorithm terminates when the root state is solved. This provides a practical stopping criterion that MCTS and RTDP lack.

MethodGuides search howConvergence guarantee
MCTSUCB1 (value + novelty)No — run for fixed m simulations
RTDPGreedy w.r.t. heuristicYes — to relevant states only
LRTDPGreedy + solved labelingYes — terminates when root solved
When to use heuristic search: When you have domain knowledge that makes a good heuristic cheap to compute. For example, in route planning: straight-line distance is an admissible heuristic (actual route is always longer). In puzzle solving: number of misplaced tiles. In robotics: Euclidean distance to goal. Without a good heuristic, MCTS usually outperforms RTDP.

Convergence: When Does RTDP Stop?

RTDP has no built-in stopping criterion: it runs trials until the caller decides to stop. This is a practical weakness. Labeled RTDP (LRTDP) adds a convergence check after each trial:

After completing a trial ending at some state sleaf, walk back along the trial path. For each state s on the path, check whether the Bellman residual is small enough:

|U(s) − maxa[R(s,a) + γ ∑s′ T(s′|s,a) U(s′)]| ≤ ε

If this holds, and all successors of s are already labeled solved, then mark s as "solved." Once the root state s0 is labeled solved, LRTDP terminates. The key insight: you only need to solve states reachable from s0 under the greedy policy — which may be a tiny fraction of the full state space.

LRTDP vs. MCTS: convergence vs. anytime. LRTDP provides a termination guarantee: it finishes when the value at s0 has converged within ε. MCTS is "anytime": it can stop after any number of simulations and give the best action found so far. LRTDP is better when you need a guarantee; MCTS is better when you have a fixed time budget and want the best possible answer within it.

A* for MDPs: Connecting to Classical Search

In classical graph search (pathfinding), A* uses a heuristic h(s) to prioritize promising states. The same idea works in MDPs: maintain a priority queue of states to expand, ordered by f(s) = g(s) + h(s), where g(s) is the cost-to-come and h(s) is the admissible heuristic for cost-to-go.

For discounted MDPs, the equivalent quantity is: f(s) = −Ulo(s) (negate because we maximize reward, not minimize cost). States with high potential value are expanded first. This is essentially RTDP with a priority queue instead of a stack.

LAO* (Hansen & Zilberstein, 2001): The canonical heuristic search algorithm for stochastic MDPs. It runs A*-like expansions over the AND-OR tree, maintaining a partial solution graph. When the partial graph is consistent (every state in it is solved), LAO* terminates. Requires an admissible heuristic for correctness but is often 1–2 orders of magnitude faster than value iteration for structured problems.
python
# RTDP: Real-Time Dynamic Programming
def rtdp_trial(env, s0, U, gamma=0.95, depth=20):
    """One greedy simulation from s0, updating U along the way."""
    s = s0
    for _ in range(depth):
        if env.is_terminal(s): break
        # Bellman backup at s
        a_star = max(env.actions, key=lambda a:
            env.reward(s,a) + gamma * sum(
                T * U[s_next] for s_next, T in env.transitions(s,a)))
        U[s] = env.reward(s, a_star) + gamma * sum(
            T * U[s_next] for s_next, T in env.transitions(s, a_star))
        # Greedy step: sample from T(s, a_star)
        s = env.sample_next(s, a_star)
    return U  # updated in-place along the visited trajectory

def rtdp(env, s0, U_init, num_trials=500):
    U = dict(U_init)           # start from admissible heuristic
    for _ in range(num_trials):
        rtdp_trial(env, s0, U)
    a_star = max(env.actions, key=lambda a:
        env.reward(s0,a) + gamma * sum(T*U[s_next] for s_next,T in env.transitions(s0,a)))
    return a_star

After enough trials, U converges to U* on all states reachable from s0 under the greedy policy. States never visited are never updated — this is the efficiency gain. For problems where only a small fraction of states are relevant from the start state, RTDP is dramatically faster than full value iteration.

LRTDP "labels" states as solved. What does this label mean, and how does it speed up the algorithm?

Chapter 8: Open-Loop Planning

Every method we've discussed so far is closed-loop: at each step, the action depends on the observed state. A policy π(s) maps states to actions. This means the plan is a contingency tree — it branches on what the environment might do.

Open-loop planning drastically simplifies this. Instead of a policy (a function), find the best fixed sequence of actions a1, a2, …, ad. Ignore what states are actually visited. Just optimize the expected return of the sequence.

maximize   E[ ∑t=1d γt−1 R(St, at) ]   over action sequences a1:d

Where St evolves stochastically according to the dynamics: St+1 ~ T(St, at). The sequence a1:d is fixed; only the states are random. This makes the search space a tree of depth d where each node branches on only |A| actions (not |A|×|S|).

Closed-loop advantages:

• Adapts to observed states
• Optimal in stochastic environments
• Policy is a contingency plan (if s, do a)
• Quality doesn't degrade with randomness

Open-loop advantages:

• Search only d×|A| parameters
• Optimization may be convex
• Very fast per decision
• Easy to warm-start (shift the sequence)

Open-loop Monte Carlo: A practical open-loop algorithm: for each candidate sequence a1:d, run k Monte Carlo rollouts to estimate expected return. Search the sequence space by random sampling or evolutionary methods. After replanning at each step, the stale sequence is discarded anyway, so the suboptimality of open-loop is partly mitigated by frequent replanning.

Open-loop is most useful when:

• The environment is nearly deterministic (dynamics are predictable).
• The planning horizon is very short (little time for stochasticity to accumulate).
• Fast replanning frequency compensates for plan staleness.
• The action space is continuous and the optimization is structured (e.g., model-predictive control).

Connection to Model Predictive Control (MPC): Open-loop planning is exactly what MPC does. In each control step: optimize a d-step action sequence (often via gradient-based methods with a quadratic objective), execute the first action, observe, replan. MPC is the continuous-action open-loop planning method, and it dominates industrial control because it handles constraints naturally and is computationally efficient when the system model is linear.
In open-loop planning, what does the planner optimize, and why is this simpler than closed-loop?

Open-Loop MCTS: The Best of Both Worlds

You can combine the advantages of MCTS (adaptive simulation budget) with open-loop planning (simple search space). In Open-Loop MCTS, the tree nodes are action sequences rather than states. Each path from root to leaf represents the sequence a1, a2, …, ak. UCB1 still governs selection.

The key difference: when evaluating a leaf (a sequence), you run m simulations starting from the current state, executing the sequence. The stochasticity is in the environment transitions, not the sequence itself. This averages out noise while keeping the search space compact.

python
# Open-Loop MCTS: tree nodes are action sequences
def evaluate_sequence(env, s0, action_seq, m=20, gamma=0.95):
    """Average return of action_seq over m stochastic rollouts from s0."""
    total = 0.0
    for _ in range(m):
        s, ret, discount = s0, 0.0, 1.0
        for a in action_seq:
            s_next, r = env.step(s, a)
            ret += discount * r
            discount *= gamma
            s = s_next
            if env.is_terminal(s): break
        total += ret
    return total / m

Open-Loop MCTS is effective when the action sequence space is smaller than the state space (e.g., |A|d << |S|), or when state observations are unreliable and the tree should not branch on them. It is widely used in games with chance elements and in robotics with uncertain state estimates.

When Does Open-Loop Degrade?

Open-loop planning fails gracefully as stochasticity increases. Consider a simple example: you're steering a car toward a target in crosswind. The wind is random, uniformly in [−W, +W] at each step.

Low wind (W = 0.1): Open-loop plan "steer left 3 times, then straight" works fine. The wind pushes you only 0.3 units off course total over 3 steps. Small error.
High wind (W = 5.0): The plan fails catastrophically. After 3 steps, you could be 15 units off course in either direction. A closed-loop policy that observes position and corrects is the only robust option.

The fundamental limit of open-loop: An open-loop plan cannot exploit information that arrives during execution. Every piece of information you observe after committing to the plan is wasted. This is the core reason why closed-loop (state-contingent) policies are better in stochastic environments. Open-loop is the right choice only when: (1) replanning frequency is high, (2) stochasticity is low, or (3) the optimization structure makes closed-loop plans computationally infeasible.

Quantifying the Cost of Open-Loop

The gap between open-loop and closed-loop value is bounded by:

V*closed(s) − V*open(s) ≤ γδ/(1−γ)

where δ measures the stochasticity of the environment (specifically, the maximum variation of the next-state distribution across actions). When δ = 0 (deterministic environment), open-loop = closed-loop. As δ grows, the gap grows proportionally. The factor 1/(1−γ) amplifies long-horizon stochasticity.

Chapter 9: Connections

Online planning is not an isolated topic. It sits at the intersection of search, reinforcement learning, control theory, and game theory. Understanding these connections reveals why online planning is structured the way it is — and where the next improvements will come from.

Online planning vs. RL: Reinforcement learning (Ch 11–13) also learns to act in MDPs, but by accumulating experience from real environment interactions and updating a policy or Q-function. Online planning uses a model (the generative model) to simulate without real interactions. The tradeoff: RL requires many real interactions but no model; online planning requires a model but potentially zero real interactions. In practice, the best systems combine both: learn a model from real data, plan online using that model (this is model-based RL, Ch 16).

The Central Theme of This Chapter

Every method here exploits one insight: you do not need to solve the entire MDP. Plan from where you are, to a feasible depth, and replan each step. The methods differ only in how they allocate limited computation across the search tree.

The deepest pattern: All online planning methods can be understood as approximating the Bellman equation from the current state. Forward search does this exactly but expensively. Sparse sampling does it approximately but independently of |S|. MCTS does it adaptively, spending more computation on branches that look promising. Heuristic search does it efficiently using domain knowledge. Open-loop does it cheaply by ignoring state feedback. The Bellman equation is the target; the methods are different computational strategies to hit it.
MethodComplexity (depth d)Best forKey weakness
Receding horizonAny plannerFramework for all methodsMyopic if d too small
Rollout lookaheadO(m|A|d)Simple improvement over any policyLimited by rollout quality
Forward searchO((|S||A|)d)Tiny problems, baselineExponential in d and |S|
Branch & boundO((|S||A|)d) worstWhen good bounds existBound quality critical
Sparse samplingO((m|A|)d)Large/continuous state spacesStill exponential in d
MCTSO(m·d)Games, combinatorial problemsc tuning, no bounds guarantee
Heuristic (RTDP)Problem-dependentProblems with good heuristicsNeeds admissible heuristic
Open-loopOptimization costNearly deterministic, continuous actionSuboptimal with high stochasticity

Decision Flowchart: Which Method to Use?

Is |S| tiny (<100)?
Use Forward Search or Branch & Bound. Exact, provably optimal.
↓ No
Do you have a good domain heuristic?
Use RTDP / LRTDP. Convergence guarantee, fast on structured problems.
↓ No
Is the action space discrete and the problem stochastic?
Use MCTS. State of the art for games, combinatorial decision problems, planning under uncertainty.
↓ No: continuous actions or nearly deterministic
Use Open-Loop Planning / MPC
Optimize an action sequence. Efficient, convex if dynamics are linear.

Real-World Impact

AlphaGo & AlphaZero (DeepMind, 2016–2017):

MCTS + deep neural network for both the rollout policy (prior over actions) and the value estimate at leaf nodes (replaces random rollouts). UCB1 is modified to incorporate the prior: score(a) = Q(s,a) + c·P(a|s)·√(N(s)/(1+N(s,a))). This combination beat the world champion at Go, Chess, and Shogi — the strongest demonstration of online planning in history.

ACAS X (FAA, 2019):

Collision avoidance for commercial aviation. State space: ~2.5 billion cells (relative geometry, speeds, altitudes). Offline value iteration with function approximation compresses U(s) into a neural network. Online planning to depth ~20 seconds queries this network at leaves. Operates in 100ms on embedded hardware. First certified NN-based safety system in aviation.

MuZero (DeepMind, 2019):

MCTS without a known model. Instead, the neural network learns a latent dynamics model: given state s and action a, predict the next latent state, reward, and value. MCTS then plans in latent space. Applied to Go, Chess, Shogi, and 57 Atari games simultaneously — the first system to achieve this without domain-specific rules.

Model Predictive Path Integral (MPPI, 2017):

Open-loop planning for continuous control. Sample K action sequences from a Gaussian distribution, roll each out through the dynamics, weight by exp(return/λ), update the distribution toward high-return sequences. Used in autonomous vehicle racing and drone acrobatics. GPU-parallel rollouts make it real-time feasible.

The MCTS Algorithm Variants Family

VariantKey modificationUsed in
UCT (UCB1 applied to Trees)UCB1 at every tree nodeOriginal MCTS for Go (Kocsis & Szepesvári 2006)
PUCTUCB1 weighted by neural prior P(a|s)AlphaGo, AlphaZero, MuZero
Double Progressive WideningGradually expands continuous action/state spacesContinuous-action MCTS
POMCPMCTS over belief-state trees for POMDPsPartially observable planning (Ch 22)
Temporal Difference MCTSUses TD learning for leaf evaluation instead of rolloutRL + planning hybrids

Implementation Checklist: Getting MCTS Right

MCTS is simple to understand but has many implementation pitfalls. This checklist covers the most common bugs:

Correctness:

• Handle N(s,a)=0 explicitly (return ∞, don't divide)
• Initialize Q=0 for all new actions (not −∞)
• Use running average for Q update, not sum
• Final action selection: max Q, not max UCB1
• Don't reuse the tree across decisions (the root state changed)

Performance:

• Vectorize rollout simulation if possible
• Use dictionaries keyed by state hash, not tuples
• Limit rollout depth (5–15 steps is usually enough)
• Tune c on a small representative set of initial states
• Terminal states: return 0 without a rollout

What Comes Next

Chapter 10: Policy Search
Skip the value function. Directly optimize policy parameters θ (e.g., neural network weights) using gradient methods. REINFORCE, CEM, CMA-ES. The online planner becomes the environment for gradient estimation.
Chapter 22: Online Belief-State Planning
Extend online planning to POMDPs. State is unknown; only observations are available. The "state" is now a belief distribution b(s). POMCP runs MCTS over belief trees, sampling states and observations to estimate action values.
Chapter 23: Controller Abstractions
Macro-actions and hierarchical planning. Instead of single steps, plan over temporally extended options. The action tree is shallower; each node represents a subtask. Dramatically extends the horizon reachable by online planning.

Related Micro-Lessons

If you want to build intuition for the probability and value-function machinery underlying these methods:

MDP — Markov decision processes, value iteration, the Bellman equation
RL Algorithms — Q-learning, policy gradient, actor-critic
Bayes Filter — recursive belief update (underlying POMDP planning)
POMDP — planning under observation uncertainty; the setting for Ch 22

"The purpose of computing is insight, not numbers." — Richard Hamming

"If you can't explain it simply, you don't understand it well enough." — Richard Feynman

Online planning is the art of understanding a problem well enough to find the best action, without understanding it perfectly. The Bellman equation tells us what the answer is; the algorithms of this chapter tell us how to compute it in time.

Summary: What Each Algorithm "Knows" About the Problem

Each algorithm exploits a different piece of problem structure:

Rollout lookahead assumes any policy can produce a useful value signal after one step. Requires: a rollout policy, a generative model.
Forward search assumes the full transition distribution T(s′|s,a) is known and computable. Requires: exact model.
Branch & bound assumes tight upper/lower bounds on value can be computed cheaply. Requires: domain-specific bounds.
Sparse sampling assumes the generative model can sample efficiently. Requires: generative model, no transition enumeration needed.
MCTS assumes the generative model is fast and adaptive allocation via UCB1 is beneficial. Requires: fast generative model, tuned c.
Heuristic search assumes a cheap admissible heuristic exists. Requires: domain knowledge, admissible initial U.
Open-loop assumes the environment is nearly deterministic or action sequences can be optimized efficiently. Requires: low stochasticity or structured dynamics.

AlphaGo uses MCTS with a neural network. What two roles does the neural network play?