Kochenderfer et al., Chapter 21

Offline Belief State Planning

When exact POMDP solutions are intractable, approximation saves the day. Learn to compute value-function bounds that squeeze toward optimality from both sides.

Prerequisites: Chapter 20 (Exact Belief State Planning). Alpha vectors, PWLC value functions, and belief updates.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Approximate?

You have a robot that cannot see perfectly. You've modeled it as a POMDP. You fire up the exact solver from Chapter 20 and wait. And wait. After an hour it's still running. Even a modest problem with 30 states, 5 actions, and 8 observations can produce exponentially many alpha vectors per iteration. The exact method is theoretically sound but practically hopeless.

The core obstacle is intractability. Solving a POMDP exactly is PSPACE-complete for finite horizons and undecidable for infinite horizons in general. The number of alpha vectors can grow exponentially with the horizon, each backup step is expensive, and the belief simplex has no finite grid structure to exploit. Something has to give.

The approximation strategy: Instead of computing the exact value function U*(b) over the entire belief simplex, we compute approximations from both sides. An upper bound U̅(b) ≥ U*(b) tells us "we can't do better than this." A lower bound U̲(b) ≤ U*(b) tells us "we can at least do this well." The gap U̅(b) − U̲(b) measures our uncertainty about the true value.

The approximation methods in this chapter form a hierarchy. Upper bounds start optimistically and get refined. Lower bounds start pessimistically and get improved. The hierarchy looks like this:

QMDP
Loosest upper bound. Ignores partial observability after one step.
↓ tighter
Fast Informed Bound (FIB)
Tighter upper bound. Accounts for observation structure.
↓ tighter still
Sawtooth Upper Bound
Tightest upper bound. Interpolates from belief-utility pairs.
↑ tighter
PBVI / Randomized PBVI
Increasingly tight lower bounds from point-based backup.
↑ from below
Blind Lower Bound
Weakest lower bound. Commits to a fixed action forever.

All of these methods share a common currency: alpha vectors. Each alpha vector α is a function from states to real numbers, and the value function at belief b is maxα ∈ Γ α·b. The methods differ in how they construct and update the set Γ.

The offline-online division: This chapter covers offline methods. They do all or most computation before execution. You run the algorithm overnight, get a policy, and then deploy. Chapter 22 covers online methods that plan at runtime from the current belief. Both approaches use the bounds computed here.
Why can't we solve large POMDPs exactly?

Chapter 1: QMDP

Imagine you're playing poker with imperfect information. QMDP solves poker by assuming: "After I act once, I'll magically see all the cards." This is obviously wrong, but it gives a fast, principled upper bound that often works surprisingly well in practice.

Formally, QMDP computes the MDP Q-function — the value of taking action a from state s in the fully observable version of the problem — and then uses it to score beliefs. The update for QMDP iterates the MDP Bellman equation:

αa(k+1)(s) = R(s, a) + γ ∑s' T(s'|s, a) maxa' αa'(k)(s')

Notice: no observation model O(o|a,s') appears anywhere. QMDP acts as if observing the state after one step costs nothing. The policy at belief b picks: π(b) = argmaxa αa·b.

Why it's fast: QMDP produces exactly |A| alpha vectors (one per action). Each iteration is just MDP value iteration: O(|A|²|S|²) work. Run until convergence, done.

Why it's an upper bound: Perfect future observability is always at least as good as partial observability. So QMDP's value ≥ true value. This is the classic "clairvoyant" argument.

QMDP on the Crying Baby POMDP

3-action POMDP: ignore, feed, listen. 2 states: hungry, not hungry. Click Iterate to run one QMDP update. Watch the two alpha vectors shift toward the MDP solution.

k = 0
QMDP's blind spot: Any policy requiring active information-gathering will be undervalued. In the crying baby problem, the "listen" action is purely informational — it tells you whether the baby is hungry but gives no immediate reward. QMDP gives "listen" no credit because it assumes you'll know the state anyway after one step. QMDP works well when optimal actions don't depend on resolving uncertainty; it fails when they do.
PropertyQMDP
Alpha vectors produced|A| (one per action)
Iterations to convergenceSame as MDP value iteration
Observation model usedNo
Bound typeUpper bound on U*
Best suited forProblems where uncertainty is easily resolved
julia (Kochenderfer style)
function qmdp(pomdp)
    # Initialize: one alpha vector per action, all zeros
    alphas = [zeros(length(states(pomdp))) for a in actions(pomdp)]
    for iter in 1:100
        new_alphas = deepcopy(alphas)
        for (ai, a) in enumerate(actions(pomdp))
            for (si, s) in enumerate(states(pomdp))
                future = 0.0
                for (si2, s2) in enumerate(states(pomdp))
                    # MDP Bellman: max over actions, no observations
                    best = maximum(alphas[ai2][si2] for ai2 in eachindex(actions(pomdp)))
                    future += transition(pomdp, s, a, s2) * best
                end
                new_alphas[ai][si] = reward(pomdp, s, a) + discount(pomdp) * future
            end
        end
        alphas = new_alphas
    end
    return AlphaVectorPolicy(pomdp, alphas)
end
QMDP ignores the observation model in its update. What is the consequence?

Chapter 2: Fast Informed Bound

QMDP assumes you'll know the state after one step. The fast informed bound (FIB) makes a weaker assumption: after one step, you'll know the observation, but not the full state. This is still optimistic (you might get an uninformative observation), but it's closer to reality.

The difference in the update rule makes this precise. Where QMDP has the max outside the sum over s' (assuming you know s'), FIB has the max inside the sum over observations (assuming you know o but not s'):

QMDP update:

R(s,a) + γ ∑s' T(s'|s,a) maxa' αa'(s')

Max over a' is outside everything. You "know" s'.

FIB update:

R(s,a) + γ ∑o maxa's' O(o|a,s') T(s'|s,a) αa'(s')

Max over a' is per observation. You only "know" o.

Knowing o is strictly weaker than knowing s'. Therefore FIB ≤ QMDP at every belief point. FIB is still an upper bound on U*, but it's tighter. Both methods maintain exactly |A| alpha vectors, so the computational cost is similar: O(|A|²|S|²|O|) per iteration for FIB vs O(|A|²|S|²) for QMDP.

QMDP vs FIB: Tightness Comparison

Crying baby POMDP. X-axis: P(hungry). Blue = QMDP upper bound (looser). Teal = FIB upper bound (tighter). Click Iterate Both to watch them converge from above. Notice FIB always stays below QMDP.

k = 0
Why FIB is still an upper bound: Even if you know the observation o perfectly, you're still assuming you pick the best possible action given that observation. In reality you don't know the best action from o alone without a full belief update. So FIB is still over-optimistic — but less so than QMDP.
When does FIB improve most over QMDP? When observations are highly informative about the state. If O(o|a,s') is close to deterministic (each state maps to a unique observation), then FIB is almost as good as knowing the state — and both QMDP and FIB give similar bounds. If observations are very noisy, FIB gives a meaningfully tighter bound than QMDP.
julia
function fib(pomdp)
    alphas = [zeros(length(states(pomdp))) for a in actions(pomdp)]
    for iter in 1:100
        new_alphas = deepcopy(alphas)
        for (ai, a) in enumerate(actions(pomdp))
            for (si, s) in enumerate(states(pomdp))
                obs_sum = 0.0
                for o in observations(pomdp)
                    # Best action per observation — knows o, not s'
                    best_per_obs = maximum(
                        sum(obs(pomdp,a,s2,o) * transition(pomdp,s,a,s2) * alphas[ai2][si2]
                            for (si2,s2) in enumerate(states(pomdp)))
                        for ai2 in eachindex(actions(pomdp)))
                    obs_sum += best_per_obs
                end
                new_alphas[ai][si] = reward(pomdp, s, a) + discount(pomdp) * obs_sum
            end
        end
        alphas = new_alphas
    end
    return AlphaVectorPolicy(pomdp, alphas)
end
FIB places the max over actions inside the sum over observations. What assumption does this correspond to?

Chapter 3: Fast Lower Bounds

Upper bounds tell us "we can't do better than this." We also need a floor: "we can at least achieve this." Lower bounds come from pessimistic policies — policies that commit to something suboptimal and compute that policy's value exactly.

The most straightforward lower bound is the blind policy lower bound. A blind policy commits to a single action a forever, completely ignoring all observations. It's suboptimal (hopefully), so its value is ≤ U*(b). The value of always taking action a is:

αa(k+1)(s) = R(s, a) + γ ∑s' T(s'|s, a) αa(k)(s')

Compare this to QMDP: same structure, but there's no max over actions. We're not picking the best future action; we're stuck with action a forever. The lower bound over all such blind policies is:

U̲(b) = maxa αa·b

Taking the max over the |A| blind-policy vectors gives the best achievable value under a no-observation policy. This is always ≤ U*(b) because the optimal policy always does at least as well as any blind policy.

Blind vs QMDP compared directly: QMDP update: R(s,a) + γ ∑s' T(s'|s,a) maxa' αa'(s'). Blind update: R(s,a) + γ ∑s' T(s'|s,a) αa(s') [same action a]. QMDP picks the best future action (upper bound). Blind commits to the current action (lower bound). This one structural difference flips us from upper to lower.
Blind Lower Bound: All Three Actions

Three lines = value of committing to {feed, ignore, listen} forever. The lower bound (dotted) is the max over all three. Click Iterate to converge the blind policies.

k = 0
The single-action lower bound as a seed: The simplest valid lower bound for any belief is a constant: U̲(b) = rmin/(1−γ). This is the worst possible return from any policy. For the crying baby: Rmin = −15 (baby is hungry, you ignore). So U̲ = −15/0.1 = −150. Very loose, but a valid starting point for PBVI initialization.
Why lower bounds matter beyond theory: (1) The gap U̅ − U̲ bounds the suboptimality of the greedy policy. If the gap is small, we're close to optimal. (2) Branch-and-bound online planners (Chapter 22) use lower bounds to prune entire action branches. (3) Point selection in PBVI targets beliefs where the gap is largest. Tight lower bounds directly enable better algorithms.
Why is the blind lower bound a valid lower bound on U*(b)?

Chapter 4: Point-Based Value Iteration

The upper bounds (QMDP, FIB) approximate the value function with |A| alpha vectors — very compact but crude. Can we compute a lower bound with many alpha vectors that together approximate U*(b) well everywhere? Yes. That's point-based value iteration (PBVI).

The key idea: pick a finite set of belief points B = {b1, ..., bm}. For each belief point b ∈ B, run a backup that computes the single best alpha vector for that belief given the current set Γ. Collect all the new alpha vectors into a new Γ. Repeat.

The backup at belief b has three steps. First, for each action a and observation o, find the alpha vector in Γ that best supports the updated belief Update(b, a, o):

αa,o = argmaxα ∈ Γ α · Update(b, a, o)

Second, combine these per-observation vectors into one alpha vector per action:

αa(s) = R(s, a) + γ ∑os' O(o|a, s') T(s'|s, a) αa,o(s')

Third, pick the action whose alpha vector best supports b:

α(b) = argmaxa αa · b

The resulting α(b) is added to Γ. This is a lower-bound operation: we can only pick from existing Γ for the future, so we never overestimate.

The geometry of PBVI: Each alpha vector αi defines a hyperplane over the belief simplex. The value function is the upper envelope of these hyperplanes: V(b) = maxi αi·b. Every backup adds a new hyperplane. The more backups, the tighter the approximation fits U*(b) from below. This is why PBVI provides a lower bound: you're always taking the max of an increasing set of linear functions.
PBVI Backup: Watch Alpha Vectors Accumulate

Orange lines = current alpha vectors (lower bound). X-axis: P(hungry). Each backup at a belief point adds a new line. Click Backup to run one PBVI update sweep. The envelope (max of all lines) rises toward U*.

Sweep 0 | Γ = 1 vector
Quality vs. belief point selection: PBVI's approximation is only as good as the belief set B. If B concentrates on one part of the simplex, the value function will be well-approximated there but potentially poor elsewhere. The standard strategy: start from the initial belief b0 and simulate random trajectories, collecting the visited beliefs. This focuses computation on actually reachable regions.
julia
function pbvi_backup(pomdp, belief, Γ)
    # For each action, find the best per-obs alpha and combine
    best_alpha = nothing
    best_val = -Inf
    for a in actions(pomdp)
        alpha_a = reward_vector(pomdp, a)  # R(s, a) for all s
        for o in observations(pomdp)
            b_ao = belief_update(pomdp, belief, a, o)
            # Best alpha in Γ for the updated belief
            α_ao = argmax(α -> dot(α, b_ao), Γ)
            # Accumulate: α_a(s) += γ * Σ_{s'} O(o|a,s') T(s'|s,a) α_ao(s')
            for (si, s) in enumerate(states(pomdp))
                for (si2, s2) in enumerate(states(pomdp))
                    alpha_a[si] += discount(pomdp) * obs(pomdp,a,s2,o) *
                                   transition(pomdp,s,a,s2) * α_ao[si2]
                end
            end
        end
        val = dot(alpha_a, belief)
        if val > best_val
            best_val = val; best_alpha = alpha_a
        end
    end
    return best_alpha
end

function pbvi(pomdp, B, n_iter)
    Γ = [blind_lower_bound_alpha(pomdp)]  # seed with blind LB
    for _ in 1:n_iter
        Γ_new = [pbvi_backup(pomdp, b, Γ) for b in B]
        Γ = Γ_new
    end
    return AlphaVectorPolicy(pomdp, Γ)
end
PBVI produces a lower bound on U*(b). Why?

Chapter 5: Randomized PBVI

Standard PBVI runs a backup at every belief in B every iteration. This is thorough but wasteful: some beliefs are easy to improve (their current alpha vector is far from optimal), while others are already well-supported. Can we do better by being selective?

Randomized point-based value iteration (the Perseus algorithm) does exactly this. Instead of systematically updating every belief, it repeatedly: (1) randomly selects an unimproved belief b from B, (2) runs a backup to get a new alpha vector, and (3) only adds it to Γ if it improves the value at some belief in B, not necessarily b.

Initialize
B = belief set, Γ = blind lower bound. Mark all b ∈ B as "unimproved."
Select
Randomly pick an unimproved b from B.
Backup
Compute α = backup(b, Γ). If α·b' > old best for any b' ∈ B, add α to Γ.
Mark improved
Mark all b' ∈ B where the new α improves the value as "improved."
↻ until all b ∈ B are improved, then start new round

The key insight: a single backup at b can improve the value at many other beliefs, since the resulting alpha vector is a global linear function. By only keeping vectors that actually improve something, Perseus keeps Γ smaller than standard PBVI — sometimes dramatically smaller.

Perseus vs. standard PBVI: Standard PBVI adds exactly |B| alpha vectors per round (one per backup). Perseus adds only as many as needed to improve all beliefs. If one backup at b1 happens to improve b2, b3, ..., bk as a side effect, no backups are needed at those beliefs. In practice, Perseus often converges with far fewer alpha vectors while achieving similar quality.
The convergence guarantee: Each round of randomized PBVI guarantees that every belief in B sees improvement in its value estimate. This is because we keep selecting unimproved beliefs until all are improved. The value function is monotonically non-decreasing, so convergence is guaranteed. The rate depends on how often backups produce globally useful alpha vectors.
Practical result: Perseus was the first solver to handle POMDPs with hundreds of states in reasonable time. On the Tag problem (870 states, 5 actions, 30 observations), Perseus found near-optimal policies in minutes while exact methods would have been infeasible. The randomized approach with its smaller Γ was the key enabler.
In randomized PBVI, a backup at belief b produces an alpha vector α. Under what condition is α added to Γ?

Chapter 6: Sawtooth Upper Bound

FIB provides a tight upper bound with |A| alpha vectors. Can we do better? Yes — using an entirely different representation called the sawtooth upper bound. Instead of alpha vectors, it uses belief-utility pairs {(bi, ui)} and piecewise-linear interpolation.

The construction starts from a simple observation: on the belief simplex, the corners (b = es, a unit vector concentrating all probability on state s) have known values: U(es) = us from FIB or QMDP. For any interior belief b, we can write b as a convex combination of corners: b = ∑s b(s) es.

For a specific belief-utility pair (bj, uj), the sawtooth interpolation gives an upper bound at any belief b:

Uj(b) = λj uj + (1 − λj) Ucorner(b)

where λj = mins: bj(s)>0 b(s)/bj(s) measures how "close" b is to bj relative to the corners, and Ucorner(b) = ∑s b(s) us is the linear interpolation from corner values alone.

The sawtooth upper bound at b is then the minimum of Uj(b) over all j. Each interior belief-utility pair "pulls down" the upper bound in its neighborhood. More pairs = tighter bound.

Why "sawtooth"? In 1D belief space (P(s1) from 0 to 1), each interior belief-utility pair creates a V-shaped dip in the upper bound surface. The pair at bj says: "I know the value here is exactly uj; the upper bound near bj must stay below a linear interpolation from uj and the corner values." The pointwise minimum of many such V-shapes gives a zigzag (sawtooth) profile.
Sawtooth for online planning: The sawtooth bound is particularly useful as the upper bound in online planners (Chapter 22). It provides a tighter upper bound than FIB, so branch-and-bound can prune more aggressively. The tradeoff: evaluating the sawtooth bound at a belief b requires O(|B|) comparisons, vs O(|A||S|) for FIB.
Upper Bound MethodRepresentationTightnessEval Cost
QMDP|A| alpha vectorsLooseO(|A||S|)
FIB|A| alpha vectorsMediumO(|A||S|)
Sawtooth|B| belief-utility pairsTightO(|B||S|)
julia
struct SawtoothUpperBound
    corner_vals::Vector{Float64}  # u_s for each state s (from FIB corners)
    belief_util_pairs::Vector{Tuple{Vector{Float64}, Float64}}
end

function sawtooth_eval(ub::SawtoothUpperBound, b::Vector{Float64})
    # Corner interpolation: U_corner(b) = Σ_s b(s) * u_s
    u_corner = dot(b, ub.corner_vals)
    # Minimum over all belief-utility pair interpolations
    u_min = u_corner  # trivially valid upper bound
    for (bj, uj) in ub.belief_util_pairs
        # λ_j = min_{s: bj(s)>0} b(s) / bj(s)
        λ = minimum(b[s] / bj[s] for s in eachindex(b) if bj[s] > 1e-9)
        λ = clamp(λ, 0.0, 1.0)
        u_j = λ * uj + (1 - λ) * u_corner
        u_min = min(u_min, u_j)
    end
    return u_min
end
The sawtooth upper bound uses the minimum over belief-utility pair interpolations. Why does taking the minimum give an upper bound (not a lower bound)?

Chapter 7: SHOWCASE — Bounds Converging Live

This is the payoff. Watch all five methods (QMDP upper bound, FIB upper bound, Sawtooth upper bound, Blind lower bound, and PBVI lower bound) compete on the Crying Baby POMDP. The true optimal U* lies between the upper and lower bounds. As you iterate, the gap shrinks. Use the controls to see how they interact.

Live Bounds on the Crying Baby POMDP

X-axis: P(hungry) from 0 to 1. Shaded band = guaranteed region containing U*. Click Iterate to advance one step. Click Auto to run continuously. Use sliders to control focus.

k = 0 | Gap at b=0.5: —
Speed300ms
What to notice: QMDP converges first (it's just MDP value iteration), but stays high — it ignores the cost of uncertainty. FIB drops below QMDP after a few iterations. The PBVI lower bound rises from below. The sawtooth upper bound (if populated with belief-utility pairs from PBVI) tracks between FIB and U*. The shaded region is the guaranteed gap containing U*. It shrinks with iterations but may never close completely.
The gap is largest at b ≈ 0.5: When P(hungry) = 0.5, the agent is maximally uncertain. This is where the choice of action matters most: listening gives crucial information, but feeding or ignoring has direct consequences. Upper bounds are loose here because they assume away the uncertainty. Lower bounds are loose because they commit to one action regardless. Near the corners (b close to 0 or 1), the agent is nearly certain and all bounds agree.
After many iterations, the gap between upper and lower bounds is largest near b = 0.5. What does this tell you about the value function there?

Chapter 8: Belief Point Selection

PBVI's quality depends entirely on the belief set B. A poor choice of belief points means the alpha vectors are well-tuned for beliefs the agent never actually visits. How do we choose B intelligently?

The simplest strategy is random belief expansion: start from the initial belief b0, simulate a random trajectory (random actions, random observations), and collect the resulting beliefs. Repeat many times. This automatically focuses B on beliefs reachable from b0.

Why reachability matters: The belief simplex has dimension |S|−1. For |S| = 20, that's a 19-dimensional space. Most of it is unreachable from b0. A uniform grid over the simplex would waste most computation on unreachable beliefs. Trajectory simulation naturally discovers the manifold of reachable beliefs.

More sophisticated strategies target beliefs where the current bounds are most uncertain. The greedy expansion algorithm adds the belief b* ∈ reach(B) that maximizes the gap U̅(b) − U̲(b) at each step:

b* = argmaxb ∈ reach(B) (U̅(b) − U̲(b))

Here reach(B) is the set of beliefs reachable from B in one step under any action and observation. This is the approach used in SARSOP (Successive Approximations of the Reachable Space under Optimal Policies).

Belief Point Distribution Matters

Left panel: 10 belief points chosen uniformly. Right panel: 10 points chosen by gap-greedy expansion from b0=0.5. The gap-greedy points cluster where the value function curves most. Click to compare PBVI quality.

Select a strategy to compare
SARSOP: the full picture: SARSOP combines sawtooth upper bound + PBVI lower bound + gap-greedy belief selection. At each step: (1) expand B with the highest-gap reachable belief, (2) run a PBVI backup to tighten the lower bound, (3) update the sawtooth upper bound with the new belief-utility pair, (4) prune branches where the lower bound already dominates the upper bound. This is one of the most effective offline POMDP solvers in practice.
Gap-greedy belief point selection adds the belief b* = argmax gap(b) at each step. What property of the value function does this target?

Chapter 9: Summary & Connections

We've covered the complete landscape of offline POMDP approximation methods. Here's the full picture in one table:

MethodBoundVectorsObs modelKey strength
QMDPUpper|A|NoFastest to compute
FIBUpper (tighter)|A|YesSlight tightness gain over QMDP
Blind LBLower|A|NoSimplest valid lower bound
PBVILower|B|YesScales with belief set size
Randomized PBVILower≤|B|YesSmaller Γ, same quality
SawtoothUpper (tightest)N/A (pairs)YesTightest upper bound
The big picture: Every offline method is a tradeoff between computation and quality. QMDP and FIB give quick, rough bounds. PBVI and randomized PBVI give increasingly good lower bounds by spending time at belief points. Sawtooth gives the tightest upper bound by exploiting the simplex structure. SARSOP — the state-of-the-art offline solver — combines all of these with smart belief selection.
Connection to Chapter 22 (Online Planning): The offline bounds computed here become the leaf evaluators for online planners. When online search reaches depth d, it needs to evaluate the belief at the leaf. QMDP is the cheapest evaluator. FIB is a bit more expensive but tighter. PBVI (if precomputed offline) gives near-optimal leaf estimates. The better the leaf evaluator, the shallower the online tree needs to be — this is the offline-online synergy.
When to use which method: For rapid prototyping on small problems: QMDP. For better bounds on medium problems: FIB + PBVI. For production-quality policies on large problems: randomized PBVI with gap-greedy belief selection (SARSOP). For online planning integration: precompute sawtooth upper bound + PBVI lower bound, then use in Chapter 22's branch-and-bound searcher.
SARSOP combines several methods. What does it use for its upper bound, lower bound, and belief selection strategy?