Don't plan everything in advance. Plan from where you are, right now. Forward search, branch-and-bound, sparse sampling, POMCP, and DESPOT for real-time POMDP decision making.
Imagine you're a robot in a building you've never seen. Offline planning says: "Before you leave home, compute the optimal policy for every possible belief you might encounter in that building." That's like memorizing responses to every possible conversation before going to a party. Exhaustive and mostly wasted effort.
Online planning takes the opposite approach. At each decision step, look at your current belief b. Build a search tree rooted at b. Expand it as far as time allows. Return the best action. Move. Get a new observation. Update your belief. Repeat. You only ever plan from beliefs you actually encounter.
Offline methods (Ch 21):
Compute π(b) for the entire belief simplex before execution. High upfront cost. Zero per-step cost. May waste computation on unreachable beliefs. Policy is a lookup table at runtime.
Online methods (this chapter):
At each step, build a search tree from the current belief. Zero upfront cost. Higher per-step cost. Only explores reachable beliefs. Policy is computed fresh at each step.
All online methods build a search tree alternating between action and observation nodes:
Before building full search trees, consider the simplest possible online method: lookahead with rollouts. For each candidate action a, run m simulations forward. In each simulation: sample a state s ~ b, take action a, sample a transition and observation, then follow a rollout policy πR for H steps. Average the returns:
Pick the action with the highest estimated Q-value. The rollout policy πR can be anything: random actions, a simple heuristic, or the QMDP policy from offline precomputation. The rollout doesn't need to be optimal — it just needs to estimate the value better than zero.
Crying Baby POMDP at belief P(hungry)=0.5. Compare Q-value estimates for 3 actions under random vs. QMDP rollout policies. More samples = less variance. Better rollout = less bias. Click Simulate to draw samples.
Rollouts estimate Q(b,a) with simulation noise. Forward search computes it exactly (within a tree), by summing over all observations at each step. The recursion:
This is exact: no sampling noise. The leaf value Uapprox comes from offline precomputation. Better leaf estimates allow shallower search: with PBVI leaves, depth 2 can be nearly as good as random-rollout depth 10.
2-action (feed/listen), 2-observation (cry/quiet) POMDP. Squares = action nodes (teal). Circles = observation nodes (orange). Each level doubles the number of branches in each dimension. Click Expand Level to watch the exponential growth.
Forward search builds the full tree and wastes time on obviously bad actions. Branch and bound uses precomputed upper and lower bounds to prune branches that cannot possibly beat the current best solution.
The algorithm maintains two quantities at each node:
Pruning rule: if Q̅(b, a) < U̲(b), skip action a entirely. It cannot beat what we already have. This is valid because:
Crying Baby POMDP at belief P(hungry)=0.5. Three actions with their upper bounds (FIB) and lower bound (PBVI). Adjust the belief to see how pruning changes. Actions whose upper bound falls below the best lower bound are pruned (grayed out).
| Method | Prunes actions? | Prunes observations? | Requires offline bounds? |
|---|---|---|---|
| Forward search | No | No | Just leaf evaluator |
| Branch and bound | Yes | No | Yes (UB + LB) |
| Gap heuristic (Ch 8) | Yes | Yes | Yes (UB + LB) |
Forward search sums over every observation. For |O| = 100 and depth 5, that's 1005 = 10 billion observation branches. Sparse sampling replaces the exact sum with a stochastic approximation: draw m observation samples instead of enumerating all of them.
where oa(i) ~ P(o|b,a) are sampled observations. The branching factor drops from |O| to m. Crucially, m doesn't depend on |O| at all. Whether |O| = 10 or |O| = 10 million, the same m gives the same approximation quality (up to statistical noise).
The tradeoff: sparse sampling introduces variance. The estimate of Q(b,a) has variance O(1/m) at each node. But the variance can be controlled by choosing m, whereas the explosion from |O| cannot. For many practical problems, m=10 to m=30 gives excellent results.
| Method | Branching factor | Total nodes at depth d | Observation space needed? |
|---|---|---|---|
| Forward search | |O| per step | O(|A|d|O|d) | Enumerable |O| |
| Sparse sampling | m per step | O(|A|dmd) | Only need to sample from P(o|b,a) |
| DESPOT (Ch 6) | K total | O(|A|dK) | Only need generative model |
Sparse sampling builds a tree of fixed depth and uniform breadth. Partially Observable Monte Carlo Planning (POMCP) adapts MCTS to POMDPs, spending more computation on promising branches and less on hopeless ones.
The core adaptation: MCTS normally indexes nodes by states. Since the agent doesn't know the state, POMCP indexes by action-observation histories h = (a1, o1, ..., at, ot). Each history node stores Q(h, a) estimates and visit counts N(h, a). The UCB action selection:
where N(h) = ∑a N(h, a) is total visits at history h. The c parameter balances exploration and exploitation.
The particle filter innovation: instead of representing beliefs explicitly, POMCP maintains a particle set at each history node — the collection of states that have been visited at that node across simulations. This implicitly approximates the belief without ever computing it.
julia (POMCP core) function pomcp_simulate(pomdp, s, h, depth, tree, c) if depth == 0; return rollout(pomdp, s, 10); end if !haskey(tree, h) # New history node: initialize Q-values tree[h] = Dict(a => (Q=0.0, N=0) for a in actions(pomdp)) return rollout(pomdp, s, depth) end # UCB action selection N_h = sum(n.N for n in values(tree[h])) a = argmax(a -> tree[h][a].Q + c*sqrt(log(N_h+1)/(tree[h][a].N+1)), actions(pomdp)) # Sample transition sp, o, r = gen(pomdp, s, a) h_new = (h..., a, o) # extend history # Add particle to new history node push!(get!(Set, particles, h_new), sp) G = r + discount(pomdp) * pomcp_simulate(pomdp, sp, h_new, depth-1, tree, c) # Backup tree[h][a] = (Q = tree[h][a].Q + (G - tree[h][a].Q)/(tree[h][a].N+1), N = tree[h][a].N + 1) return G end function pomcp_action(pomdp, belief, n_sims) tree = Dict(); particles = Dict() for _ in 1:n_sims s = rand(belief) pomcp_simulate(pomdp, s, (), 5, tree, 1.0) end return argmax(a -> tree[()][a].N, actions(pomdp)) end
POMCP draws fresh random observations at each simulation. Determinized Sparse Tree Search (DESPOT) takes a different approach: pre-generate K fixed random scenarios before search begins. A scenario φ = (φ1, φ2, ...) is a sequence of random numbers that deterministically specifies the environment's behavior (transitions and observations) at each depth.
Given action sequence a1, ..., ad and scenario φ, the entire trajectory is determined. This means actions a and a' can be compared under exactly the same random events. This is analogous to common random numbers in simulation: variance reduction by correlating random draws across alternatives.
The tree structure: DESPOT doesn't branch on observations. Instead, each scenario follows a unique path through the tree determined by its sequence of random numbers. The tree has at most K leaves at any depth, and total size O(|A|dK) — independent of |O|.
| Property | Sparse Sampling | DESPOT |
|---|---|---|
| Observation branching | m samples per node | K fixed scenarios total |
| Total leaf count | O(|A|dmd) | O(|A|dK) |
| Action comparison | Different random seeds | Same scenarios (lower variance) |
| Belief representation | Explicit or particle | Implicit via scenarios |
julia (DESPOT core) struct Scenario s0::Int # initial state sampled from belief rands::Vector{Float64} # random numbers for each depth level end function despot_value(pomdp, b, depth, scenarios, bounds) if depth == 0 return mean(bounds.lower(pomdp, s) for (s,_) in scenarios) end best_val = -Inf for a in actions(pomdp) # All K scenarios face the same action a # Group by which observation they produce obs_groups = Dict{Int, Vector}() total_r = 0.0 for (s, rand_seq) in scenarios sp, o, r = gen_deterministic(pomdp, s, a, rand_seq[depth]) total_r += r push!(get!(obs_groups, o, []), (sp, rand_seq)) end future = sum(for (o, group) in obs_groups length(group)/length(scenarios) * despot_value(pomdp, nothing, depth-1, group, bounds)) val = total_r/length(scenarios) + discount(pomdp) * future if val > best_val; best_val = val; end end return best_val end
Compare tree sizes: DESPOT (K scenarios, fixed paths) vs forward search (all observations). X-axis: depth d. Y-axis: number of leaf nodes (log scale). Adjust K and |O| to see the crossover point where DESPOT wins.
Watch POMCP build its search tree in real time on the Crying Baby POMDP. Each simulation starts at the root belief, navigates down using UCB, and backs up the return. The tree grows asymmetrically: promising branches get more visits. Run enough simulations and the most visited action at the root is reliably the best.
Root = belief P(hungry)=0.5. Actions: feed (F), ignore (I), listen (L). Squares = action nodes; brightness shows Q-value. Circles = observation nodes. Number = visit count. Click Simulate to run one MCTS iteration. Run many to see UCB focus on the best action.
Branch and bound prunes action branches using upper/lower bounds. The gap heuristic applies a similar idea to observation branches: instead of exploring all (or m random) observations at each node, selectively expand only the most informative ones.
The gap at a belief b is the difference between upper and lower bounds: gap(b) = U̅(b) − U̲(b). The gap heuristic expands the observation branch that maximizes the probability-weighted gap:
This targets observations that are both likely to occur and have high uncertainty in their value. If an observation is very likely but already well-understood (small gap), it's not worth expanding. If it's very uncertain but very unlikely, it's also not worth expanding. The product captures the joint importance.
| Online Method | Action branching | Obs branching | Termination |
|---|---|---|---|
| Rollouts | All | Sample (1) | Fixed H |
| Forward search | All | All |O| | Fixed depth |
| Branch & bound | Pruned by bounds | All |O| | Fixed depth |
| Sparse sampling | All | m samples | Fixed depth |
| POMCP | UCB-guided | Sampled per sim | Fixed n_sims |
| DESPOT | All | K scenarios | Fixed depth |
| Gap heuristic | Pruned by bounds | Gap-selected | Gap < ε |
Online POMDP planning covers a rich set of methods, each making different tradeoffs. The core tension is between: complete enumeration (forward search) and smart sampling/pruning (everything else). Here's the complete landscape:
| Method | Core idea | Scales to large |O|? | Guarantee? |
|---|---|---|---|
| Rollouts | Average returns from sampled trajectories | Yes (generative model) | No |
| Forward search | Exact tree expansion to depth d | No (enumerates |O|) | Exact at depth d |
| Branch & bound | Prune actions via offline bounds | No (enumerates |O|) | Yes (within gap) |
| Sparse sampling | Sample m observations per node | Yes | PAC guarantee |
| POMCP | UCB-guided tree with particle beliefs | Yes (generative model) | Anytime, no hard bound |
| DESPOT | Pre-fixed K scenarios | Yes (generative model) | Regularized bound |
| Gap heuristic | Gap-guided observation selection | Depends on bound eval | Yes (gap < ε) |