Ch 22: Online Belief State Planning — Algorithms for Decision Making

Chapter 0: Online vs Offline

Imagine you're a robot in a building you've never seen. Offline planning says: "Before you leave home, compute the optimal policy for every possible belief you might encounter in that building." That's like memorizing responses to every possible conversation before going to a party. Exhaustive and mostly wasted effort.

Online planning takes the opposite approach. At each decision step, look at your current belief b. Build a search tree rooted at b. Expand it as far as time allows. Return the best action. Move. Get a new observation. Update your belief. Repeat. You only ever plan from beliefs you actually encounter.

Offline methods (Ch 21):

Compute π(b) for the entire belief simplex before execution. High upfront cost. Zero per-step cost. May waste computation on unreachable beliefs. Policy is a lookup table at runtime.

Online methods (this chapter):

At each step, build a search tree from the current belief. Zero upfront cost. Higher per-step cost. Only explores reachable beliefs. Policy is computed fresh at each step.

The key advantage of online planning: The belief space reachable from the current belief is typically a tiny fraction of the full simplex. Offline methods waste computation on beliefs the agent will never visit. Online methods automatically focus on what matters. This makes them scale much better to large problems — even when the full belief simplex is astronomically large.

All online methods build a search tree alternating between action and observation nodes:

Root: current belief b

The agent's current probability distribution over states.

↓ branch on actions

Action node: choose a ∈ A

One child per action. Expected reward + discounted future value.

↓ branch on observations

Observation node: observe o ∈ O

One child per observation. Belief updates via Bayes' rule.

↓ recurse or evaluate leaf

Leaf: approximate value U(b)

Use QMDP, FIB, or PBVI from offline precomputation.

The tree size explosion: At depth d with |A| actions and |O| observations, the tree has O(|A|^d|O|^d) nodes. For |A|=4, |O|=10, d=5: that's 4⁵·10⁵ = 3.2 billion nodes. Every online method in this chapter is essentially a different strategy for managing this exponential explosion.

What is the primary computational advantage of online over offline POMDP planning?

Online methods have lower algorithmic complexity overall Online methods only explore beliefs reachable from the current state, avoiding wasted computation Online methods do not require a POMDP model

Chapter 1: Lookahead with Rollouts

Before building full search trees, consider the simplest possible online method: lookahead with rollouts. For each candidate action a, run m simulations forward. In each simulation: sample a state s ~ b, take action a, sample a transition and observation, then follow a rollout policy π_R for H steps. Average the returns:

Q(b, a) ≈ (1/m) ∑_i=1^m ∑_t=0^H γ^t r_t⁽ⁱ⁾

Pick the action with the highest estimated Q-value. The rollout policy π_R can be anything: random actions, a simple heuristic, or the QMDP policy from offline precomputation. The rollout doesn't need to be optimal — it just needs to estimate the value better than zero.

Generative model = just a simulator: Rollout methods don't need explicit T(s'|s,a) and O(o|a,s') tables. They only need a generative model: a black-box function gen(s,a) → (s', o, r). This makes rollouts applicable to problems where the POMDP model is a simulator (a physics engine, a game engine, etc.) rather than a probability table. This is a huge practical advantage.

Rollout Quality vs. Rollout Policy

Crying Baby POMDP at belief P(hungry)=0.5. Compare Q-value estimates for 3 actions under random vs. QMDP rollout policies. More samples = less variance. Better rollout = less bias. Click Simulate to draw samples.

Run simulations to compare

Samples m50

Bias-variance tradeoff: More samples m reduces variance (estimates become consistent) but doesn't fix bias (a bad rollout policy gives systematically wrong estimates). A better rollout policy reduces bias but doesn't help variance. The ideal: use a cheap, reasonable rollout (e.g., QMDP) with moderate m (e.g., 20-50). This often outperforms random rollouts with m=1000.

When rollouts fail: If the optimal action requires a long horizon to distinguish itself (e.g., a 5-step detour that reveals hidden information), rollouts with H=3 will miss it entirely. Shallow rollouts work when the good action pays off quickly. Deep tree search (forward search, MCTS) is needed for long-horizon dependencies.

Why can rollout methods work with problems where T and O are not explicitly available?

They only need a generative model (simulator) that samples s', o, r given s and a They compute exact value functions from the model They use alpha vectors stored from offline computation

Chapter 2: Forward Search

Rollouts estimate Q(b,a) with simulation noise. Forward search computes it exactly (within a tree), by summing over all observations at each step. The recursion:

U_d(b) = max_a Q_d(b, a)

Q_d(b, a) = R(b, a) + γ ∑_{o ∈ O} P(o|b,a) · U_d−1(Update(b, a, o))

U₀(b) = U_approx(b) (leaf: QMDP, FIB, or PBVI)

This is exact: no sampling noise. The leaf value U_approx comes from offline precomputation. Better leaf estimates allow shallower search: with PBVI leaves, depth 2 can be nearly as good as random-rollout depth 10.

The depth-quality tradeoff: Forward search at depth d has O(|A|^d|O|^d) nodes. For the machine replacement problem (4 actions, 3 observations), depth 5 → ~250,000 nodes. Depth 10 → ~62 billion. In practice, d=2 or d=3 with good leaf evaluators is the sweet spot. Beyond d=4, the cost is prohibitive without pruning.

Forward Search Tree Construction

2-action (feed/listen), 2-observation (cry/quiet) POMDP. Squares = action nodes (teal). Circles = observation nodes (orange). Each level doubles the number of branches in each dimension. Click Expand Level to watch the exponential growth.

Depth: 0 | Nodes: 1

Strategies to tame depth: (1) Maximum likelihood observation: only branch on the most likely observation, not all of them. Cheap but biased. (2) Domain pruning: skip dominated actions based on domain knowledge. (3) Hybrid depth: use forward search for d=1 with MCTS for deeper exploration. (4) Branch and bound (next chapter): prune branches using precomputed bounds.

Forward search at depth d with |A| actions and |O| observations builds how many leaf nodes?

O(|A|^d |O|^d) O(d · |A| · |O|) O(|S|^d)

Chapter 3: Branch and Bound

Forward search builds the full tree and wastes time on obviously bad actions. Branch and bound uses precomputed upper and lower bounds to prune branches that cannot possibly beat the current best solution.

The algorithm maintains two quantities at each node:

Upper bound Q̅(b, a): the best this action branch could possibly achieve. Comes from FIB or the sawtooth bound.
Lower bound U̲(b): the value we've already guaranteed is achievable. Comes from PBVI or the blind policy.

Pruning rule: if Q̅(b, a) < U̲(b), skip action a entirely. It cannot beat what we already have. This is valid because:

Q*(b, a) ≤ Q̅(b, a) < U̲(b) ≤ U*(b)

The synergy between offline and online: Branch and bound is the canonical example of offline-online synergy. The offline phase (Ch 21) runs PBVI and FIB to get tight bounds. The online phase uses those bounds to skip large portions of the search tree. Tighter offline bounds → more pruning → deeper online search within the same time budget.

When does pruning help most? When the best action's lower bound is tight (PBVI has converged well) and the suboptimal actions' upper bounds are clearly below it. Near the corners of the belief simplex (high certainty), the optimal action is obvious and pruning is aggressive. Near the center (maximum uncertainty), all actions look similar and pruning is weak.

Branch and Bound: Pruning in Action

Crying Baby POMDP at belief P(hungry)=0.5. Three actions with their upper bounds (FIB) and lower bound (PBVI). Adjust the belief to see how pruning changes. Actions whose upper bound falls below the best lower bound are pruned (grayed out).

P(hungry)0.50

Method	Prunes actions?	Prunes observations?	Requires offline bounds?
Forward search	No	No	Just leaf evaluator
Branch and bound	Yes	No	Yes (UB + LB)
Gap heuristic (Ch 8)	Yes	Yes	Yes (UB + LB)

Branch and bound can prune action a from the search tree when:

The upper bound on Q(b,a) is below the current lower bound on U*(b) The action has been tried before in a previous planning step The observation probability P(o|b,a) is below 0.1 for all observations

Chapter 4: Sparse Sampling

Forward search sums over every observation. For |O| = 100 and depth 5, that's 100⁵ = 10 billion observation branches. Sparse sampling replaces the exact sum with a stochastic approximation: draw m observation samples instead of enumerating all of them.

Q_d(b, a) ≈ R(b, a) + (γ/m) ∑_i=1^m U_d−1(Update(b, a, o_a⁽ⁱ⁾))

where o_a⁽ⁱ⁾ ~ P(o|b,a) are sampled observations. The branching factor drops from |O| to m. Crucially, m doesn't depend on |O| at all. Whether |O| = 10 or |O| = 10 million, the same m gives the same approximation quality (up to statistical noise).

Why this is a big deal: Continuous observation spaces (e.g., sensor readings, camera images) have |O| = ∞. Forward search is impossible. Sparse sampling handles them trivially: just sample m=20 observations from P(o|b,a) at each node. The approximation quality scales with m, not |O|. This is what makes POMDP methods applicable to robotics problems with continuous sensors.

The tradeoff: sparse sampling introduces variance. The estimate of Q(b,a) has variance O(1/m) at each node. But the variance can be controlled by choosing m, whereas the explosion from |O| cannot. For many practical problems, m=10 to m=30 gives excellent results.

Method	Branching factor	Total nodes at depth d	Observation space needed?
Forward search	\|O\| per step	O(\|A\|^d\|O\|^d)	Enumerable \|O\|
Sparse sampling	m per step	O(\|A\|^dm^d)	Only need to sample from P(o\|b,a)
DESPOT (Ch 6)	K total	O(\|A\|^dK)	Only need generative model

Sparse sampling vs MCTS: Both use sampling to manage the observation branching. The difference: sparse sampling builds a fixed-depth tree and samples observations at every level. MCTS (Ch 5) uses UCB to focus samples on promising branches and doesn't fix depth. MCTS is generally better when time is limited; sparse sampling is better when you want a fixed-depth guarantee.

Sparse sampling uses m=20 samples. How does changing |O| from 10 to 10,000 affect the approximation quality?

Quality degrades dramatically because more observations means more variance Quality improves because larger observation spaces provide more information Quality is unchanged (m=20 samples give the same approximation regardless of |O|)

Chapter 5: POMCP — Monte Carlo Tree Search for POMDPs

Sparse sampling builds a tree of fixed depth and uniform breadth. Partially Observable Monte Carlo Planning (POMCP) adapts MCTS to POMDPs, spending more computation on promising branches and less on hopeless ones.

The core adaptation: MCTS normally indexes nodes by states. Since the agent doesn't know the state, POMCP indexes by action-observation histories h = (a₁, o₁, ..., a_t, o_t). Each history node stores Q(h, a) estimates and visit counts N(h, a). The UCB action selection:

a* = argmax_a &left;[ Q(h, a) + c √(log N(h) / N(h, a)) &right;]

where N(h) = ∑_a N(h, a) is total visits at history h. The c parameter balances exploration and exploitation.

The particle filter innovation: instead of representing beliefs explicitly, POMCP maintains a particle set at each history node — the collection of states that have been visited at that node across simulations. This implicitly approximates the belief without ever computing it.

1. Sample

Draw state s from current particle set (approximates b). If root: s ~ b₀.

↓

2. Select

Navigate tree using UCB. At unvisited nodes, initialize with Q=0, N=0.

↓

3. Rollout

From first unvisited node, run rollout policy for H steps. Get return G.

↓

4. Backup

Update Q(h, a) += (G − Q(h, a)) / N(h, a) along path to root.

POMCP's key insight: The belief at history h is approximated by the particles that have been simulated through h. No explicit Bayesian belief update is needed. This lets POMCP work with complex simulators where belief updates are intractable — you just simulate. Silver and Veness (2010) demonstrated POMCP on POMDPs with 10⁵⁶ states (the Battleship game) where exact methods are completely impossible.

Anytime behavior: POMCP can be stopped at any time and will return the best action found so far: argmax_a N(root, a) (most visited action). More simulations = better action. This makes POMCP ideal for systems with variable time budgets: give it what you can, and it will use the time well.

julia (POMCP core)
function pomcp_simulate(pomdp, s, h, depth, tree, c)
    if depth == 0; return rollout(pomdp, s, 10); end

    if !haskey(tree, h)
        # New history node: initialize Q-values
        tree[h] = Dict(a => (Q=0.0, N=0) for a in actions(pomdp))
        return rollout(pomdp, s, depth)
    end

    # UCB action selection
    N_h = sum(n.N for n in values(tree[h]))
    a = argmax(a -> tree[h][a].Q + c*sqrt(log(N_h+1)/(tree[h][a].N+1)), actions(pomdp))

    # Sample transition
    sp, o, r = gen(pomdp, s, a)
    h_new = (h..., a, o)  # extend history

    # Add particle to new history node
    push!(get!(Set, particles, h_new), sp)

    G = r + discount(pomdp) * pomcp_simulate(pomdp, sp, h_new, depth-1, tree, c)

    # Backup
    tree[h][a] = (Q = tree[h][a].Q + (G - tree[h][a].Q)/(tree[h][a].N+1),
                  N = tree[h][a].N + 1)
    return G
end

function pomcp_action(pomdp, belief, n_sims)
    tree = Dict(); particles = Dict()
    for _ in 1:n_sims
        s = rand(belief)
        pomcp_simulate(pomdp, s, (), 5, tree, 1.0)
    end
    return argmax(a -> tree[()][a].N, actions(pomdp))
end

POMCP indexes tree nodes by action-observation histories instead of beliefs. What is the key benefit?

No explicit belief update is needed; beliefs are implicitly represented by particle sets The tree has fewer nodes because histories are shorter than beliefs The UCB formula becomes exact rather than approximate

Chapter 6: DESPOT

POMCP draws fresh random observations at each simulation. Determinized Sparse Tree Search (DESPOT) takes a different approach: pre-generate K fixed random scenarios before search begins. A scenario φ = (φ₁, φ₂, ...) is a sequence of random numbers that deterministically specifies the environment's behavior (transitions and observations) at each depth.

Given action sequence a₁, ..., a_d and scenario φ, the entire trajectory is determined. This means actions a and a' can be compared under exactly the same random events. This is analogous to common random numbers in simulation: variance reduction by correlating random draws across alternatives.

The scenario picture: Imagine K=50 scenarios, each a "possible future." Each scenario determines what happens at every step: which state you transition to, which observation you get. For each action a, DESPOT simulates all 50 scenarios and averages the returns. Because all actions face the same 50 futures, the comparison is fair: you're testing which action performs better on the same set of challenges.

The tree structure: DESPOT doesn't branch on observations. Instead, each scenario follows a unique path through the tree determined by its sequence of random numbers. The tree has at most K leaves at any depth, and total size O(|A|^dK) — independent of |O|.

Property	Sparse Sampling	DESPOT
Observation branching	m samples per node	K fixed scenarios total
Total leaf count	O(\|A\|^dm^d)	O(\|A\|^dK)
Action comparison	Different random seeds	Same scenarios (lower variance)
Belief representation	Explicit or particle	Implicit via scenarios

DESPOT with regularization: A naive DESPOT tree can overfit to the K scenarios. The DESPOT paper introduces a regularization term that penalizes large trees, balancing tree depth against the number of scenarios. The regularized DESPOT objective is: max_{tree T} V(T, K-scenarios) − λ |T|. This gives a principled tradeoff between policy expressiveness and overfitting risk.

DESPOT in practice: DESPOT has been applied to autonomous driving, robotic manipulation, and target tracking. It handles continuous observation spaces (like camera images) by treating the simulator as a generative model. K=500 scenarios and d=5 depth is a common practical setting: ~500K scenario-action evaluations, easily parallelizable across CPU cores.

julia (DESPOT core)
struct Scenario
    s0::Int          # initial state sampled from belief
    rands::Vector{Float64}  # random numbers for each depth level
end

function despot_value(pomdp, b, depth, scenarios, bounds)
    if depth == 0
        return mean(bounds.lower(pomdp, s) for (s,_) in scenarios)
    end
    best_val = -Inf
    for a in actions(pomdp)
        # All K scenarios face the same action a
        # Group by which observation they produce
        obs_groups = Dict{Int, Vector}()
        total_r = 0.0
        for (s, rand_seq) in scenarios
            sp, o, r = gen_deterministic(pomdp, s, a, rand_seq[depth])
            total_r += r
            push!(get!(obs_groups, o, []), (sp, rand_seq))
        end
        future = sum(for (o, group) in obs_groups
            length(group)/length(scenarios) *
            despot_value(pomdp, nothing, depth-1, group, bounds))
        val = total_r/length(scenarios) + discount(pomdp) * future
        if val > best_val; best_val = val; end
    end
    return best_val
end

DESPOT: Scenario Branching vs Forward Search

Compare tree sizes: DESPOT (K scenarios, fixed paths) vs forward search (all observations). X-axis: depth d. Y-axis: number of leaf nodes (log scale). Adjust K and |O| to see the crossover point where DESPOT wins.

K scenarios50

|O| observations10

DESPOT pre-generates K fixed scenarios before search. What is the primary statistical advantage over sparse sampling?

Actions are compared under the same random events, reducing variance of action comparisons Fixed scenarios make the POMDP deterministic, removing all uncertainty K scenarios is always less than m samples, so the tree is smaller

Chapter 7: SHOWCASE — MCTS Search Tree Live

Watch POMCP build its search tree in real time on the Crying Baby POMDP. Each simulation starts at the root belief, navigates down using UCB, and backs up the return. The tree grows asymmetrically: promising branches get more visits. Run enough simulations and the most visited action at the root is reliably the best.

POMCP Search Tree (Crying Baby POMDP)

Root = belief P(hungry)=0.5. Actions: feed (F), ignore (I), listen (L). Squares = action nodes; brightness shows Q-value. Circles = observation nodes. Number = visit count. Click Simulate to run one MCTS iteration. Run many to see UCB focus on the best action.

Simulations: 0 | Best action: —

UCB c1.5

Speed150ms

What to observe: With c=1.5 (balanced), UCB initially tries all actions roughly equally, then concentrates on the most promising one. Increase c to force more exploration (all branches get roughly equal visits). Decrease c toward 0 for pure greedy exploitation (one branch gets almost all visits). After ~100 simulations, the most-visited action is a reliable recommendation.

The particle approximation: In the real POMCP, each observation node maintains a particle set (a collection of states from simulations that reached that node). These particles implicitly represent the posterior belief at that history. More simulations = more particles = better belief approximation. This is the key reason POMCP doesn't need explicit belief updates at runtime.

After 200 simulations in POMCP, how should you select the action to execute?

Pick the action with the highest visit count N(root, a) — the most-explored action Pick the action with the highest Q-value Q(root, a) at that instant Run one more simulation and pick the action sampled in that last simulation

Chapter 8: Gap Heuristic Search

Branch and bound prunes action branches using upper/lower bounds. The gap heuristic applies a similar idea to observation branches: instead of exploring all (or m random) observations at each node, selectively expand only the most informative ones.

The gap at a belief b is the difference between upper and lower bounds: gap(b) = U̅(b) − U̲(b). The gap heuristic expands the observation branch that maximizes the probability-weighted gap:

o* = argmax_o P(o|b, a) · gap(Update(b, a, o))

This targets observations that are both likely to occur and have high uncertainty in their value. If an observation is very likely but already well-understood (small gap), it's not worth expanding. If it's very uncertain but very unlikely, it's also not worth expanding. The product captures the joint importance.

Gap heuristic terminates with a quality guarantee: When the gap at the root drops below threshold ε, the algorithm stops. The returned action is within ε of optimal. This gives a formal approximation quality guarantee that MCTS and sparse sampling lack. The algorithm is essentially "stop when you're sure enough."

Combining all the tricks: The state-of-the-art algorithm HSVI (Heuristic Search Value Iteration, Shani et al. 2007) and its online cousin SARSOP-online use: (1) branch and bound to prune action branches, (2) gap heuristic to focus observation branches, (3) sawtooth upper bound updated at each visited belief, and (4) PBVI lower bound. All four mechanisms work together to enable much deeper effective search than any single technique alone.

Online Method	Action branching	Obs branching	Termination
Rollouts	All	Sample (1)	Fixed H
Forward search	All	All \|O\|	Fixed depth
Branch & bound	Pruned by bounds	All \|O\|	Fixed depth
Sparse sampling	All	m samples	Fixed depth
POMCP	UCB-guided	Sampled per sim	Fixed n_sims
DESPOT	All	K scenarios	Fixed depth
Gap heuristic	Pruned by bounds	Gap-selected	Gap < ε

The gap heuristic selects observation o* = argmax P(o|b,a)·gap(Update(b,a,o)). Why weight the gap by probability?

Because an uncertain but unlikely observation contributes little to improving the root action choice Because P(o|b,a) is always larger than gap(b) To normalize the gap to [0,1] for comparison

Chapter 9: Summary & Connections

Online POMDP planning covers a rich set of methods, each making different tradeoffs. The core tension is between: complete enumeration (forward search) and smart sampling/pruning (everything else). Here's the complete landscape:

Method	Core idea	Scales to large \|O\|?	Guarantee?
Rollouts	Average returns from sampled trajectories	Yes (generative model)	No
Forward search	Exact tree expansion to depth d	No (enumerates \|O\|)	Exact at depth d
Branch & bound	Prune actions via offline bounds	No (enumerates \|O\|)	Yes (within gap)
Sparse sampling	Sample m observations per node	Yes	PAC guarantee
POMCP	UCB-guided tree with particle beliefs	Yes (generative model)	Anytime, no hard bound
DESPOT	Pre-fixed K scenarios	Yes (generative model)	Regularized bound
Gap heuristic	Gap-guided observation selection	Depends on bound eval	Yes (gap < ε)

Which to use? For small, tabular POMDPs with known model: forward search or branch-and-bound. For medium POMDPs with moderate |O|: sparse sampling or POMCP. For large POMDPs with continuous observations (robotics, games): POMCP or DESPOT. For theoretical guarantees: gap heuristic or HSVI. For production systems that need anytime behavior: POMCP.

Looking ahead to Ch 23: All methods in this chapter maintain an explicit or implicit belief. Chapter 23 asks: what if we never compute beliefs at all? Finite state controllers are reactive policies that transition between internal nodes based on observations alone, with no Bayesian inference. This is a fundamentally different approach to the POMDP problem.

The offline-online connection: Online methods improve dramatically when combined with offline precomputation. POMCP with a PBVI rollout policy converges 5-10x faster than with random rollouts. Branch-and-bound with tight SARSOP bounds can handle 10x deeper search than with QMDP bounds. The offline work from Chapter 21 is not wasted even when you use online methods.

POMCP is described as an "anytime" algorithm. What does this mean?

It can run at any time of day without computational overhead It can be stopped at any time and returns the best action found so far; more time means better answers It has constant per-simulation time regardless of tree size