When exact POMDP solutions are intractable, approximation saves the day. Learn to compute value-function bounds that squeeze toward optimality from both sides.
You have a robot that cannot see perfectly. You've modeled it as a POMDP. You fire up the exact solver from Chapter 20 and wait. And wait. After an hour it's still running. Even a modest problem with 30 states, 5 actions, and 8 observations can produce exponentially many alpha vectors per iteration. The exact method is theoretically sound but practically hopeless.
The core obstacle is intractability. Solving a POMDP exactly is PSPACE-complete for finite horizons and undecidable for infinite horizons in general. The number of alpha vectors can grow exponentially with the horizon, each backup step is expensive, and the belief simplex has no finite grid structure to exploit. Something has to give.
The approximation methods in this chapter form a hierarchy. Upper bounds start optimistically and get refined. Lower bounds start pessimistically and get improved. The hierarchy looks like this:
All of these methods share a common currency: alpha vectors. Each alpha vector α is a function from states to real numbers, and the value function at belief b is maxα ∈ Γ α·b. The methods differ in how they construct and update the set Γ.
Imagine you're playing poker with imperfect information. QMDP solves poker by assuming: "After I act once, I'll magically see all the cards." This is obviously wrong, but it gives a fast, principled upper bound that often works surprisingly well in practice.
Formally, QMDP computes the MDP Q-function — the value of taking action a from state s in the fully observable version of the problem — and then uses it to score beliefs. The update for QMDP iterates the MDP Bellman equation:
Notice: no observation model O(o|a,s') appears anywhere. QMDP acts as if observing the state after one step costs nothing. The policy at belief b picks: π(b) = argmaxa αa·b.
Why it's fast: QMDP produces exactly |A| alpha vectors (one per action). Each iteration is just MDP value iteration: O(|A|²|S|²) work. Run until convergence, done.
Why it's an upper bound: Perfect future observability is always at least as good as partial observability. So QMDP's value ≥ true value. This is the classic "clairvoyant" argument.
3-action POMDP: ignore, feed, listen. 2 states: hungry, not hungry. Click Iterate to run one QMDP update. Watch the two alpha vectors shift toward the MDP solution.
| Property | QMDP |
|---|---|
| Alpha vectors produced | |A| (one per action) |
| Iterations to convergence | Same as MDP value iteration |
| Observation model used | No |
| Bound type | Upper bound on U* |
| Best suited for | Problems where uncertainty is easily resolved |
julia (Kochenderfer style) function qmdp(pomdp) # Initialize: one alpha vector per action, all zeros alphas = [zeros(length(states(pomdp))) for a in actions(pomdp)] for iter in 1:100 new_alphas = deepcopy(alphas) for (ai, a) in enumerate(actions(pomdp)) for (si, s) in enumerate(states(pomdp)) future = 0.0 for (si2, s2) in enumerate(states(pomdp)) # MDP Bellman: max over actions, no observations best = maximum(alphas[ai2][si2] for ai2 in eachindex(actions(pomdp))) future += transition(pomdp, s, a, s2) * best end new_alphas[ai][si] = reward(pomdp, s, a) + discount(pomdp) * future end end alphas = new_alphas end return AlphaVectorPolicy(pomdp, alphas) end
QMDP assumes you'll know the state after one step. The fast informed bound (FIB) makes a weaker assumption: after one step, you'll know the observation, but not the full state. This is still optimistic (you might get an uninformative observation), but it's closer to reality.
The difference in the update rule makes this precise. Where QMDP has the max outside the sum over s' (assuming you know s'), FIB has the max inside the sum over observations (assuming you know o but not s'):
QMDP update:
Max over a' is outside everything. You "know" s'.
FIB update:
Max over a' is per observation. You only "know" o.
Knowing o is strictly weaker than knowing s'. Therefore FIB ≤ QMDP at every belief point. FIB is still an upper bound on U*, but it's tighter. Both methods maintain exactly |A| alpha vectors, so the computational cost is similar: O(|A|²|S|²|O|) per iteration for FIB vs O(|A|²|S|²) for QMDP.
Crying baby POMDP. X-axis: P(hungry). Blue = QMDP upper bound (looser). Teal = FIB upper bound (tighter). Click Iterate Both to watch them converge from above. Notice FIB always stays below QMDP.
julia function fib(pomdp) alphas = [zeros(length(states(pomdp))) for a in actions(pomdp)] for iter in 1:100 new_alphas = deepcopy(alphas) for (ai, a) in enumerate(actions(pomdp)) for (si, s) in enumerate(states(pomdp)) obs_sum = 0.0 for o in observations(pomdp) # Best action per observation — knows o, not s' best_per_obs = maximum( sum(obs(pomdp,a,s2,o) * transition(pomdp,s,a,s2) * alphas[ai2][si2] for (si2,s2) in enumerate(states(pomdp))) for ai2 in eachindex(actions(pomdp))) obs_sum += best_per_obs end new_alphas[ai][si] = reward(pomdp, s, a) + discount(pomdp) * obs_sum end end alphas = new_alphas end return AlphaVectorPolicy(pomdp, alphas) end
Upper bounds tell us "we can't do better than this." We also need a floor: "we can at least achieve this." Lower bounds come from pessimistic policies — policies that commit to something suboptimal and compute that policy's value exactly.
The most straightforward lower bound is the blind policy lower bound. A blind policy commits to a single action a forever, completely ignoring all observations. It's suboptimal (hopefully), so its value is ≤ U*(b). The value of always taking action a is:
Compare this to QMDP: same structure, but there's no max over actions. We're not picking the best future action; we're stuck with action a forever. The lower bound over all such blind policies is:
Taking the max over the |A| blind-policy vectors gives the best achievable value under a no-observation policy. This is always ≤ U*(b) because the optimal policy always does at least as well as any blind policy.
Three lines = value of committing to {feed, ignore, listen} forever. The lower bound (dotted) is the max over all three. Click Iterate to converge the blind policies.
The upper bounds (QMDP, FIB) approximate the value function with |A| alpha vectors — very compact but crude. Can we compute a lower bound with many alpha vectors that together approximate U*(b) well everywhere? Yes. That's point-based value iteration (PBVI).
The key idea: pick a finite set of belief points B = {b1, ..., bm}. For each belief point b ∈ B, run a backup that computes the single best alpha vector for that belief given the current set Γ. Collect all the new alpha vectors into a new Γ. Repeat.
The backup at belief b has three steps. First, for each action a and observation o, find the alpha vector in Γ that best supports the updated belief Update(b, a, o):
Second, combine these per-observation vectors into one alpha vector per action:
Third, pick the action whose alpha vector best supports b:
The resulting α(b) is added to Γ. This is a lower-bound operation: we can only pick from existing Γ for the future, so we never overestimate.
Orange lines = current alpha vectors (lower bound). X-axis: P(hungry). Each backup at a belief point adds a new line. Click Backup to run one PBVI update sweep. The envelope (max of all lines) rises toward U*.
julia function pbvi_backup(pomdp, belief, Γ) # For each action, find the best per-obs alpha and combine best_alpha = nothing best_val = -Inf for a in actions(pomdp) alpha_a = reward_vector(pomdp, a) # R(s, a) for all s for o in observations(pomdp) b_ao = belief_update(pomdp, belief, a, o) # Best alpha in Γ for the updated belief α_ao = argmax(α -> dot(α, b_ao), Γ) # Accumulate: α_a(s) += γ * Σ_{s'} O(o|a,s') T(s'|s,a) α_ao(s') for (si, s) in enumerate(states(pomdp)) for (si2, s2) in enumerate(states(pomdp)) alpha_a[si] += discount(pomdp) * obs(pomdp,a,s2,o) * transition(pomdp,s,a,s2) * α_ao[si2] end end end val = dot(alpha_a, belief) if val > best_val best_val = val; best_alpha = alpha_a end end return best_alpha end function pbvi(pomdp, B, n_iter) Γ = [blind_lower_bound_alpha(pomdp)] # seed with blind LB for _ in 1:n_iter Γ_new = [pbvi_backup(pomdp, b, Γ) for b in B] Γ = Γ_new end return AlphaVectorPolicy(pomdp, Γ) end
Standard PBVI runs a backup at every belief in B every iteration. This is thorough but wasteful: some beliefs are easy to improve (their current alpha vector is far from optimal), while others are already well-supported. Can we do better by being selective?
Randomized point-based value iteration (the Perseus algorithm) does exactly this. Instead of systematically updating every belief, it repeatedly: (1) randomly selects an unimproved belief b from B, (2) runs a backup to get a new alpha vector, and (3) only adds it to Γ if it improves the value at some belief in B, not necessarily b.
The key insight: a single backup at b can improve the value at many other beliefs, since the resulting alpha vector is a global linear function. By only keeping vectors that actually improve something, Perseus keeps Γ smaller than standard PBVI — sometimes dramatically smaller.
FIB provides a tight upper bound with |A| alpha vectors. Can we do better? Yes — using an entirely different representation called the sawtooth upper bound. Instead of alpha vectors, it uses belief-utility pairs {(bi, ui)} and piecewise-linear interpolation.
The construction starts from a simple observation: on the belief simplex, the corners (b = es, a unit vector concentrating all probability on state s) have known values: U(es) = us from FIB or QMDP. For any interior belief b, we can write b as a convex combination of corners: b = ∑s b(s) es.
For a specific belief-utility pair (bj, uj), the sawtooth interpolation gives an upper bound at any belief b:
where λj = mins: bj(s)>0 b(s)/bj(s) measures how "close" b is to bj relative to the corners, and Ucorner(b) = ∑s b(s) us is the linear interpolation from corner values alone.
The sawtooth upper bound at b is then the minimum of Uj(b) over all j. Each interior belief-utility pair "pulls down" the upper bound in its neighborhood. More pairs = tighter bound.
| Upper Bound Method | Representation | Tightness | Eval Cost |
|---|---|---|---|
| QMDP | |A| alpha vectors | Loose | O(|A||S|) |
| FIB | |A| alpha vectors | Medium | O(|A||S|) |
| Sawtooth | |B| belief-utility pairs | Tight | O(|B||S|) |
julia struct SawtoothUpperBound corner_vals::Vector{Float64} # u_s for each state s (from FIB corners) belief_util_pairs::Vector{Tuple{Vector{Float64}, Float64}} end function sawtooth_eval(ub::SawtoothUpperBound, b::Vector{Float64}) # Corner interpolation: U_corner(b) = Σ_s b(s) * u_s u_corner = dot(b, ub.corner_vals) # Minimum over all belief-utility pair interpolations u_min = u_corner # trivially valid upper bound for (bj, uj) in ub.belief_util_pairs # λ_j = min_{s: bj(s)>0} b(s) / bj(s) λ = minimum(b[s] / bj[s] for s in eachindex(b) if bj[s] > 1e-9) λ = clamp(λ, 0.0, 1.0) u_j = λ * uj + (1 - λ) * u_corner u_min = min(u_min, u_j) end return u_min end
This is the payoff. Watch all five methods (QMDP upper bound, FIB upper bound, Sawtooth upper bound, Blind lower bound, and PBVI lower bound) compete on the Crying Baby POMDP. The true optimal U* lies between the upper and lower bounds. As you iterate, the gap shrinks. Use the controls to see how they interact.
X-axis: P(hungry) from 0 to 1. Shaded band = guaranteed region containing U*. Click Iterate to advance one step. Click Auto to run continuously. Use sliders to control focus.
PBVI's quality depends entirely on the belief set B. A poor choice of belief points means the alpha vectors are well-tuned for beliefs the agent never actually visits. How do we choose B intelligently?
The simplest strategy is random belief expansion: start from the initial belief b0, simulate a random trajectory (random actions, random observations), and collect the resulting beliefs. Repeat many times. This automatically focuses B on beliefs reachable from b0.
More sophisticated strategies target beliefs where the current bounds are most uncertain. The greedy expansion algorithm adds the belief b* ∈ reach(B) that maximizes the gap U̅(b) − U̲(b) at each step:
Here reach(B) is the set of beliefs reachable from B in one step under any action and observation. This is the approach used in SARSOP (Successive Approximations of the Reachable Space under Optimal Policies).
Left panel: 10 belief points chosen uniformly. Right panel: 10 points chosen by gap-greedy expansion from b0=0.5. The gap-greedy points cluster where the value function curves most. Click to compare PBVI quality.
We've covered the complete landscape of offline POMDP approximation methods. Here's the full picture in one table:
| Method | Bound | Vectors | Obs model | Key strength |
|---|---|---|---|---|
| QMDP | Upper | |A| | No | Fastest to compute |
| FIB | Upper (tighter) | |A| | Yes | Slight tightness gain over QMDP |
| Blind LB | Lower | |A| | No | Simplest valid lower bound |
| PBVI | Lower | |B| | Yes | Scales with belief set size |
| Randomized PBVI | Lower | ≤|B| | Yes | Smaller Γ, same quality |
| Sawtooth | Upper (tightest) | N/A (pairs) | Yes | Tightest upper bound |