Kochenderfer et al., Chapter 22

Online Belief State Planning

Don't plan everything in advance. Plan from where you are, right now. Forward search, branch-and-bound, sparse sampling, POMCP, and DESPOT for real-time POMDP decision making.

Prerequisites: Chapters 20–21. Alpha vectors, PBVI lower bound, FIB upper bound.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Online vs Offline

Imagine you're a robot in a building you've never seen. Offline planning says: "Before you leave home, compute the optimal policy for every possible belief you might encounter in that building." That's like memorizing responses to every possible conversation before going to a party. Exhaustive and mostly wasted effort.

Online planning takes the opposite approach. At each decision step, look at your current belief b. Build a search tree rooted at b. Expand it as far as time allows. Return the best action. Move. Get a new observation. Update your belief. Repeat. You only ever plan from beliefs you actually encounter.

Offline methods (Ch 21):

Compute π(b) for the entire belief simplex before execution. High upfront cost. Zero per-step cost. May waste computation on unreachable beliefs. Policy is a lookup table at runtime.

Online methods (this chapter):

At each step, build a search tree from the current belief. Zero upfront cost. Higher per-step cost. Only explores reachable beliefs. Policy is computed fresh at each step.

The key advantage of online planning: The belief space reachable from the current belief is typically a tiny fraction of the full simplex. Offline methods waste computation on beliefs the agent will never visit. Online methods automatically focus on what matters. This makes them scale much better to large problems — even when the full belief simplex is astronomically large.

All online methods build a search tree alternating between action and observation nodes:

Root: current belief b
The agent's current probability distribution over states.
↓ branch on actions
Action node: choose a ∈ A
One child per action. Expected reward + discounted future value.
↓ branch on observations
Observation node: observe o ∈ O
One child per observation. Belief updates via Bayes' rule.
↓ recurse or evaluate leaf
Leaf: approximate value U(b)
Use QMDP, FIB, or PBVI from offline precomputation.
The tree size explosion: At depth d with |A| actions and |O| observations, the tree has O(|A|d|O|d) nodes. For |A|=4, |O|=10, d=5: that's 45·105 = 3.2 billion nodes. Every online method in this chapter is essentially a different strategy for managing this exponential explosion.
What is the primary computational advantage of online over offline POMDP planning?

Chapter 1: Lookahead with Rollouts

Before building full search trees, consider the simplest possible online method: lookahead with rollouts. For each candidate action a, run m simulations forward. In each simulation: sample a state s ~ b, take action a, sample a transition and observation, then follow a rollout policy πR for H steps. Average the returns:

Q(b, a) ≈ (1/m) ∑i=1mt=0H γt rt(i)

Pick the action with the highest estimated Q-value. The rollout policy πR can be anything: random actions, a simple heuristic, or the QMDP policy from offline precomputation. The rollout doesn't need to be optimal — it just needs to estimate the value better than zero.

Generative model = just a simulator: Rollout methods don't need explicit T(s'|s,a) and O(o|a,s') tables. They only need a generative model: a black-box function gen(s,a) → (s', o, r). This makes rollouts applicable to problems where the POMDP model is a simulator (a physics engine, a game engine, etc.) rather than a probability table. This is a huge practical advantage.
Rollout Quality vs. Rollout Policy

Crying Baby POMDP at belief P(hungry)=0.5. Compare Q-value estimates for 3 actions under random vs. QMDP rollout policies. More samples = less variance. Better rollout = less bias. Click Simulate to draw samples.

Run simulations to compare
Samples m50
Bias-variance tradeoff: More samples m reduces variance (estimates become consistent) but doesn't fix bias (a bad rollout policy gives systematically wrong estimates). A better rollout policy reduces bias but doesn't help variance. The ideal: use a cheap, reasonable rollout (e.g., QMDP) with moderate m (e.g., 20-50). This often outperforms random rollouts with m=1000.
When rollouts fail: If the optimal action requires a long horizon to distinguish itself (e.g., a 5-step detour that reveals hidden information), rollouts with H=3 will miss it entirely. Shallow rollouts work when the good action pays off quickly. Deep tree search (forward search, MCTS) is needed for long-horizon dependencies.
Why can rollout methods work with problems where T and O are not explicitly available?

Chapter 2: Forward Search

Rollouts estimate Q(b,a) with simulation noise. Forward search computes it exactly (within a tree), by summing over all observations at each step. The recursion:

Ud(b) = maxa Qd(b, a)
Qd(b, a) = R(b, a) + γ ∑o ∈ O P(o|b,a) · Ud−1(Update(b, a, o))
U0(b) = Uapprox(b)    (leaf: QMDP, FIB, or PBVI)

This is exact: no sampling noise. The leaf value Uapprox comes from offline precomputation. Better leaf estimates allow shallower search: with PBVI leaves, depth 2 can be nearly as good as random-rollout depth 10.

The depth-quality tradeoff: Forward search at depth d has O(|A|d|O|d) nodes. For the machine replacement problem (4 actions, 3 observations), depth 5 → ~250,000 nodes. Depth 10 → ~62 billion. In practice, d=2 or d=3 with good leaf evaluators is the sweet spot. Beyond d=4, the cost is prohibitive without pruning.
Forward Search Tree Construction

2-action (feed/listen), 2-observation (cry/quiet) POMDP. Squares = action nodes (teal). Circles = observation nodes (orange). Each level doubles the number of branches in each dimension. Click Expand Level to watch the exponential growth.

Depth: 0 | Nodes: 1
Strategies to tame depth: (1) Maximum likelihood observation: only branch on the most likely observation, not all of them. Cheap but biased. (2) Domain pruning: skip dominated actions based on domain knowledge. (3) Hybrid depth: use forward search for d=1 with MCTS for deeper exploration. (4) Branch and bound (next chapter): prune branches using precomputed bounds.
Forward search at depth d with |A| actions and |O| observations builds how many leaf nodes?

Chapter 3: Branch and Bound

Forward search builds the full tree and wastes time on obviously bad actions. Branch and bound uses precomputed upper and lower bounds to prune branches that cannot possibly beat the current best solution.

The algorithm maintains two quantities at each node:

Pruning rule: if Q̅(b, a) < U̲(b), skip action a entirely. It cannot beat what we already have. This is valid because:

Q*(b, a) ≤ Q̅(b, a) < U̲(b) ≤ U*(b)
The synergy between offline and online: Branch and bound is the canonical example of offline-online synergy. The offline phase (Ch 21) runs PBVI and FIB to get tight bounds. The online phase uses those bounds to skip large portions of the search tree. Tighter offline bounds → more pruning → deeper online search within the same time budget.
When does pruning help most? When the best action's lower bound is tight (PBVI has converged well) and the suboptimal actions' upper bounds are clearly below it. Near the corners of the belief simplex (high certainty), the optimal action is obvious and pruning is aggressive. Near the center (maximum uncertainty), all actions look similar and pruning is weak.
Branch and Bound: Pruning in Action

Crying Baby POMDP at belief P(hungry)=0.5. Three actions with their upper bounds (FIB) and lower bound (PBVI). Adjust the belief to see how pruning changes. Actions whose upper bound falls below the best lower bound are pruned (grayed out).

P(hungry)0.50
MethodPrunes actions?Prunes observations?Requires offline bounds?
Forward searchNoNoJust leaf evaluator
Branch and boundYesNoYes (UB + LB)
Gap heuristic (Ch 8)YesYesYes (UB + LB)
Branch and bound can prune action a from the search tree when:

Chapter 4: Sparse Sampling

Forward search sums over every observation. For |O| = 100 and depth 5, that's 1005 = 10 billion observation branches. Sparse sampling replaces the exact sum with a stochastic approximation: draw m observation samples instead of enumerating all of them.

Qd(b, a) ≈ R(b, a) + (γ/m) ∑i=1m Ud−1(Update(b, a, oa(i)))

where oa(i) ~ P(o|b,a) are sampled observations. The branching factor drops from |O| to m. Crucially, m doesn't depend on |O| at all. Whether |O| = 10 or |O| = 10 million, the same m gives the same approximation quality (up to statistical noise).

Why this is a big deal: Continuous observation spaces (e.g., sensor readings, camera images) have |O| = ∞. Forward search is impossible. Sparse sampling handles them trivially: just sample m=20 observations from P(o|b,a) at each node. The approximation quality scales with m, not |O|. This is what makes POMDP methods applicable to robotics problems with continuous sensors.

The tradeoff: sparse sampling introduces variance. The estimate of Q(b,a) has variance O(1/m) at each node. But the variance can be controlled by choosing m, whereas the explosion from |O| cannot. For many practical problems, m=10 to m=30 gives excellent results.

MethodBranching factorTotal nodes at depth dObservation space needed?
Forward search|O| per stepO(|A|d|O|d)Enumerable |O|
Sparse samplingm per stepO(|A|dmd)Only need to sample from P(o|b,a)
DESPOT (Ch 6)K totalO(|A|dK)Only need generative model
Sparse sampling vs MCTS: Both use sampling to manage the observation branching. The difference: sparse sampling builds a fixed-depth tree and samples observations at every level. MCTS (Ch 5) uses UCB to focus samples on promising branches and doesn't fix depth. MCTS is generally better when time is limited; sparse sampling is better when you want a fixed-depth guarantee.
Sparse sampling uses m=20 samples. How does changing |O| from 10 to 10,000 affect the approximation quality?

Chapter 5: POMCP — Monte Carlo Tree Search for POMDPs

Sparse sampling builds a tree of fixed depth and uniform breadth. Partially Observable Monte Carlo Planning (POMCP) adapts MCTS to POMDPs, spending more computation on promising branches and less on hopeless ones.

The core adaptation: MCTS normally indexes nodes by states. Since the agent doesn't know the state, POMCP indexes by action-observation histories h = (a1, o1, ..., at, ot). Each history node stores Q(h, a) estimates and visit counts N(h, a). The UCB action selection:

a* = argmaxa &left;[ Q(h, a) + c √(log N(h) / N(h, a)) &right;]

where N(h) = ∑a N(h, a) is total visits at history h. The c parameter balances exploration and exploitation.

The particle filter innovation: instead of representing beliefs explicitly, POMCP maintains a particle set at each history node — the collection of states that have been visited at that node across simulations. This implicitly approximates the belief without ever computing it.

1. Sample
Draw state s from current particle set (approximates b). If root: s ~ b0.
2. Select
Navigate tree using UCB. At unvisited nodes, initialize with Q=0, N=0.
3. Rollout
From first unvisited node, run rollout policy for H steps. Get return G.
4. Backup
Update Q(h, a) += (G − Q(h, a)) / N(h, a) along path to root.
POMCP's key insight: The belief at history h is approximated by the particles that have been simulated through h. No explicit Bayesian belief update is needed. This lets POMCP work with complex simulators where belief updates are intractable — you just simulate. Silver and Veness (2010) demonstrated POMCP on POMDPs with 1056 states (the Battleship game) where exact methods are completely impossible.
Anytime behavior: POMCP can be stopped at any time and will return the best action found so far: argmaxa N(root, a) (most visited action). More simulations = better action. This makes POMCP ideal for systems with variable time budgets: give it what you can, and it will use the time well.
julia (POMCP core)
function pomcp_simulate(pomdp, s, h, depth, tree, c)
    if depth == 0; return rollout(pomdp, s, 10); end

    if !haskey(tree, h)
        # New history node: initialize Q-values
        tree[h] = Dict(a => (Q=0.0, N=0) for a in actions(pomdp))
        return rollout(pomdp, s, depth)
    end

    # UCB action selection
    N_h = sum(n.N for n in values(tree[h]))
    a = argmax(a -> tree[h][a].Q + c*sqrt(log(N_h+1)/(tree[h][a].N+1)), actions(pomdp))

    # Sample transition
    sp, o, r = gen(pomdp, s, a)
    h_new = (h..., a, o)  # extend history

    # Add particle to new history node
    push!(get!(Set, particles, h_new), sp)

    G = r + discount(pomdp) * pomcp_simulate(pomdp, sp, h_new, depth-1, tree, c)

    # Backup
    tree[h][a] = (Q = tree[h][a].Q + (G - tree[h][a].Q)/(tree[h][a].N+1),
                  N = tree[h][a].N + 1)
    return G
end

function pomcp_action(pomdp, belief, n_sims)
    tree = Dict(); particles = Dict()
    for _ in 1:n_sims
        s = rand(belief)
        pomcp_simulate(pomdp, s, (), 5, tree, 1.0)
    end
    return argmax(a -> tree[()][a].N, actions(pomdp))
end
POMCP indexes tree nodes by action-observation histories instead of beliefs. What is the key benefit?

Chapter 6: DESPOT

POMCP draws fresh random observations at each simulation. Determinized Sparse Tree Search (DESPOT) takes a different approach: pre-generate K fixed random scenarios before search begins. A scenario φ = (φ1, φ2, ...) is a sequence of random numbers that deterministically specifies the environment's behavior (transitions and observations) at each depth.

Given action sequence a1, ..., ad and scenario φ, the entire trajectory is determined. This means actions a and a' can be compared under exactly the same random events. This is analogous to common random numbers in simulation: variance reduction by correlating random draws across alternatives.

The scenario picture: Imagine K=50 scenarios, each a "possible future." Each scenario determines what happens at every step: which state you transition to, which observation you get. For each action a, DESPOT simulates all 50 scenarios and averages the returns. Because all actions face the same 50 futures, the comparison is fair: you're testing which action performs better on the same set of challenges.

The tree structure: DESPOT doesn't branch on observations. Instead, each scenario follows a unique path through the tree determined by its sequence of random numbers. The tree has at most K leaves at any depth, and total size O(|A|dK) — independent of |O|.

PropertySparse SamplingDESPOT
Observation branchingm samples per nodeK fixed scenarios total
Total leaf countO(|A|dmd)O(|A|dK)
Action comparisonDifferent random seedsSame scenarios (lower variance)
Belief representationExplicit or particleImplicit via scenarios
DESPOT with regularization: A naive DESPOT tree can overfit to the K scenarios. The DESPOT paper introduces a regularization term that penalizes large trees, balancing tree depth against the number of scenarios. The regularized DESPOT objective is: maxtree T V(T, K-scenarios) − λ |T|. This gives a principled tradeoff between policy expressiveness and overfitting risk.
DESPOT in practice: DESPOT has been applied to autonomous driving, robotic manipulation, and target tracking. It handles continuous observation spaces (like camera images) by treating the simulator as a generative model. K=500 scenarios and d=5 depth is a common practical setting: ~500K scenario-action evaluations, easily parallelizable across CPU cores.
julia (DESPOT core)
struct Scenario
    s0::Int          # initial state sampled from belief
    rands::Vector{Float64}  # random numbers for each depth level
end

function despot_value(pomdp, b, depth, scenarios, bounds)
    if depth == 0
        return mean(bounds.lower(pomdp, s) for (s,_) in scenarios)
    end
    best_val = -Inf
    for a in actions(pomdp)
        # All K scenarios face the same action a
        # Group by which observation they produce
        obs_groups = Dict{Int, Vector}()
        total_r = 0.0
        for (s, rand_seq) in scenarios
            sp, o, r = gen_deterministic(pomdp, s, a, rand_seq[depth])
            total_r += r
            push!(get!(obs_groups, o, []), (sp, rand_seq))
        end
        future = sum(for (o, group) in obs_groups
            length(group)/length(scenarios) *
            despot_value(pomdp, nothing, depth-1, group, bounds))
        val = total_r/length(scenarios) + discount(pomdp) * future
        if val > best_val; best_val = val; end
    end
    return best_val
end
DESPOT: Scenario Branching vs Forward Search

Compare tree sizes: DESPOT (K scenarios, fixed paths) vs forward search (all observations). X-axis: depth d. Y-axis: number of leaf nodes (log scale). Adjust K and |O| to see the crossover point where DESPOT wins.

K scenarios50
|O| observations10
DESPOT pre-generates K fixed scenarios before search. What is the primary statistical advantage over sparse sampling?

Chapter 7: SHOWCASE — MCTS Search Tree Live

Watch POMCP build its search tree in real time on the Crying Baby POMDP. Each simulation starts at the root belief, navigates down using UCB, and backs up the return. The tree grows asymmetrically: promising branches get more visits. Run enough simulations and the most visited action at the root is reliably the best.

POMCP Search Tree (Crying Baby POMDP)

Root = belief P(hungry)=0.5. Actions: feed (F), ignore (I), listen (L). Squares = action nodes; brightness shows Q-value. Circles = observation nodes. Number = visit count. Click Simulate to run one MCTS iteration. Run many to see UCB focus on the best action.

Simulations: 0 | Best action: —
UCB c1.5
Speed150ms
What to observe: With c=1.5 (balanced), UCB initially tries all actions roughly equally, then concentrates on the most promising one. Increase c to force more exploration (all branches get roughly equal visits). Decrease c toward 0 for pure greedy exploitation (one branch gets almost all visits). After ~100 simulations, the most-visited action is a reliable recommendation.
The particle approximation: In the real POMCP, each observation node maintains a particle set (a collection of states from simulations that reached that node). These particles implicitly represent the posterior belief at that history. More simulations = more particles = better belief approximation. This is the key reason POMCP doesn't need explicit belief updates at runtime.
After 200 simulations in POMCP, how should you select the action to execute?

Chapter 8: Gap Heuristic Search

Branch and bound prunes action branches using upper/lower bounds. The gap heuristic applies a similar idea to observation branches: instead of exploring all (or m random) observations at each node, selectively expand only the most informative ones.

The gap at a belief b is the difference between upper and lower bounds: gap(b) = U̅(b) − U̲(b). The gap heuristic expands the observation branch that maximizes the probability-weighted gap:

o* = argmaxo P(o|b, a) · gap(Update(b, a, o))

This targets observations that are both likely to occur and have high uncertainty in their value. If an observation is very likely but already well-understood (small gap), it's not worth expanding. If it's very uncertain but very unlikely, it's also not worth expanding. The product captures the joint importance.

Gap heuristic terminates with a quality guarantee: When the gap at the root drops below threshold ε, the algorithm stops. The returned action is within ε of optimal. This gives a formal approximation quality guarantee that MCTS and sparse sampling lack. The algorithm is essentially "stop when you're sure enough."
Combining all the tricks: The state-of-the-art algorithm HSVI (Heuristic Search Value Iteration, Shani et al. 2007) and its online cousin SARSOP-online use: (1) branch and bound to prune action branches, (2) gap heuristic to focus observation branches, (3) sawtooth upper bound updated at each visited belief, and (4) PBVI lower bound. All four mechanisms work together to enable much deeper effective search than any single technique alone.
Online MethodAction branchingObs branchingTermination
RolloutsAllSample (1)Fixed H
Forward searchAllAll |O|Fixed depth
Branch & boundPruned by boundsAll |O|Fixed depth
Sparse samplingAllm samplesFixed depth
POMCPUCB-guidedSampled per simFixed n_sims
DESPOTAllK scenariosFixed depth
Gap heuristicPruned by boundsGap-selectedGap < ε
The gap heuristic selects observation o* = argmax P(o|b,a)·gap(Update(b,a,o)). Why weight the gap by probability?

Chapter 9: Summary & Connections

Online POMDP planning covers a rich set of methods, each making different tradeoffs. The core tension is between: complete enumeration (forward search) and smart sampling/pruning (everything else). Here's the complete landscape:

MethodCore ideaScales to large |O|?Guarantee?
RolloutsAverage returns from sampled trajectoriesYes (generative model)No
Forward searchExact tree expansion to depth dNo (enumerates |O|)Exact at depth d
Branch & boundPrune actions via offline boundsNo (enumerates |O|)Yes (within gap)
Sparse samplingSample m observations per nodeYesPAC guarantee
POMCPUCB-guided tree with particle beliefsYes (generative model)Anytime, no hard bound
DESPOTPre-fixed K scenariosYes (generative model)Regularized bound
Gap heuristicGap-guided observation selectionDepends on bound evalYes (gap < ε)
Which to use? For small, tabular POMDPs with known model: forward search or branch-and-bound. For medium POMDPs with moderate |O|: sparse sampling or POMCP. For large POMDPs with continuous observations (robotics, games): POMCP or DESPOT. For theoretical guarantees: gap heuristic or HSVI. For production systems that need anytime behavior: POMCP.
Looking ahead to Ch 23: All methods in this chapter maintain an explicit or implicit belief. Chapter 23 asks: what if we never compute beliefs at all? Finite state controllers are reactive policies that transition between internal nodes based on observations alone, with no Bayesian inference. This is a fundamentally different approach to the POMDP problem.
The offline-online connection: Online methods improve dramatically when combined with offline precomputation. POMCP with a PBVI rollout policy converges 5-10x faster than with random rollouts. Branch-and-bound with tight SARSOP bounds can handle 10x deeper search than with QMDP bounds. The offline work from Chapter 21 is not wasted even when you use online methods.
POMCP is described as an "anytime" algorithm. What does this mean?