Beliefs are continuous states. Yet the optimal POMDP value function has a beautiful structure — piecewise-linear and convex — that makes exact computation possible. Alpha vectors, pruning, and value iteration over the belief simplex.
You know how to maintain a belief (Chapter 19). Now comes the hard part: making decisions based on that belief. In a fully-observable MDP, the policy is a mapping from states to actions π(s) → a. In a POMDP, states are hidden, so the policy must map beliefs to actions: π(b) → a.
The challenge: beliefs are continuous. Even when the state space is finite (|S| states), the belief space is a (|S|−1)-dimensional simplex — an infinite set. There is no lookup table. We cannot enumerate all beliefs and compute the value of each one.
Why does this matter? A PWLC function over the belief simplex can be exactly represented by a finite set of linear functions. This finite representation is what allows dynamic programming to work in belief space — even though the space itself is continuous.
To see the PWLC property intuitively: suppose we have two policies π1 and π2. At belief b, the value of π1 is Vπ1(b) = α1·b (linear in b). The value of π2 is α2·b (also linear). The optimal value function is max(α1·b, α2·b) — the upper envelope of two linear functions. This upper envelope is piecewise-linear (two pieces) and convex (max of linear functions is always convex). Adding more policies adds more linear pieces to the envelope. At any belief b, only the topmost piece (the dominating alpha vector) determines the optimal action.
Contrast this with fully-observable MDPs, where the value function is just a table V(s) with |S| entries. For POMDPs, even though the state space is finite (|S| states), the belief space is infinite (a continuous (|S|-1)-simplex). Yet the PWLC structure gives us a finite representation. This is remarkable: we go from a continuous optimization problem to one representable with a finite set of vectors. The price is that the set of vectors can grow large with the horizon.
| Property | Fully-observable MDP | POMDP (belief-state MDP) |
|---|---|---|
| State space | Finite {s1,...,sn} | Infinite (|S|-1)-simplex |
| Value function representation | Table: n numbers | Set of alpha vectors: |Γ|×|S| numbers |
| Bellman backup complexity | O(|S|2|A|) | O(|A||Γ||O||S|2|O|) |
| Policy representation | Table: one action per state | Alpha vector: one action per "region" of belief space |
| Convergence guarantee | Yes (polynomial) | Yes (but PSPACE-hard) |
The first key insight: any POMDP can be converted into an MDP where the "state" is the belief b. This belief-state MDP has the same action space as the original POMDP, but a continuous state space (the belief simplex).
The reward function in the belief-state MDP is the expected reward over the belief:
The transition function maps beliefs to beliefs through the update equation. Given b, action a, observation o, the posterior belief is:
The probability of receiving observation o from belief b and action a is:
With this formulation, the Bellman equation for the belief-state MDP is:
Let's work through a complete crying baby example to ground the abstract equations. States: {hungry=0, sated=1}. Belief b = [bhungry, bsated]. At b = [0.7, 0.3] (70% hungry):
Immediate reward for feeding: R(b, feed) = ∑s R(s, feed) × b(s) = R(hungry, feed) × 0.7 + R(sated, feed) × 0.3 = (−15) × 0.7 + (−5) × 0.3 = −10.5 − 1.5 = −12.
Transition to next belief under "feed": Feeding always sates, so T(sated|s, feed) = 1 for all s. The predicted belief after feeding (before observation): b¨ = [0, 1]. Then if we observe "crying" (P(crying|sated) = 0.1): b' = [0, 1] (still certain baby is sated — feeding overrides the observation model for transition, but observation still arrives). The P(crying|b, feed) = 0.7×0.1 + 0.3×0.1 = 0.1. Same probability regardless of starting belief, because feeding always reaches the sated state.
| Action | R(b=[0.7,0.3],a) | P(crying|b,a) | b' after crying | b' after quiet |
|---|---|---|---|---|
| feed | −12.0 | 0.10 | [0, 1] → [0.00, 1.00] | [0, 1] → [0.00, 1.00] |
| sing | −7.5 | 0.63 | ~[0.88, 0.12] | ~[0.38, 0.62] |
| ignore | −7.0 | 0.59 | ~[0.85, 0.15] | ~[0.30, 0.70] |
Before introducing alpha vectors, we need to understand what they represent. A conditional plan is a tree-structured policy: at each node, the agent takes an action; at each edge, the agent branches based on the observation received. The plan specifies what to do at every possible observation sequence.
Formally, a plan π has:
A 1-step plan is just an action a. A 2-step plan takes action a, then branches: after o1, do subplan π1; after o2, do subplan π2; etc. The total number of h-step plans grows rapidly:
| Horizon h | |A|=2, |O|=2 | |A|=3, |O|=2 | |A|=3, |O|=3 | |A|=4, |O|=3 |
|---|---|---|---|---|
| 1 | 2 | 3 | 3 | 4 |
| 2 | 23=8 | 33=27 | 34=81 | 45=1,024 |
| 3 | 27=128 | 313≈1.6M | 340≈1019 | 421≈4×1012 |
| 4 | 215≈33K | 340≈1019 | Astronomical | Astronomical |
The conditional plan count grows as |A| to the power of a geometric series in |O|. For the crying baby (|A|=3, |O|=2), the plan count at horizon 3 is already 313 ≈ 1.6 million. Only a tiny fraction of these plans are non-dominated — POMDP value iteration with pruning discovers this fraction efficiently.
Here is the complete TIGER POMDP specified in Python, followed by the same value iteration code from Chapter 6 applied to it:
python import numpy as np from itertools import product # TIGER POMDP: states={tiger-left=0, tiger-right=1} # actions={listen=0, open-left=1, open-right=2} # observations={hear-left=0, hear-right=1} nS, nA, nO, gamma = 2, 3, 2, 0.95 # Rewards R[a][s]: positive = good R = [[-1, -1], # listen: costs 1 regardless of tiger location [+10, -100], # open-left: +10 if tiger-right, -100 if tiger-left! [-100, +10]] # open-right: -100 if tiger-right, +10 if tiger-left # Transitions T[a][s][s']: opening a door resets tiger uniformly T = [[[1,0],[0,1]], # listen: tiger stays [[0.5,0.5],[0.5,0.5]],# open-left: reset uniformly [[0.5,0.5],[0.5,0.5]]] # open-right: reset uniformly # Observations O[a][s'][o]: noisy sensor O = [[[0.85,0.15],[0.15,0.85]], # listen: 85% correct [[0.5,0.5],[0.5,0.5]], # open-left: uninformative after reset [[0.5,0.5],[0.5,0.5]]] # open-right: uninformative after reset # Initialize with 1-step alpha vectors Gamma = [{'v': np.array([float(R[a][s]) for s in range(nS)]), 'action': a} for a in range(nA)] def expand(G): result = [] for a in range(nA): for combo in product(range(len(G)), repeat=nO): alpha = np.array([R[a][s] + gamma * sum(T[a][s][sp] * sum(O[a][sp][o] * G[combo[o]]['v'][sp] for o in range(nO)) for sp in range(nS)) for s in range(nS)]) result.append({'v': alpha, 'action': a}) return result def prune(cands, STEPS=100): dominated = [False] * len(cands) for i, c in enumerate(cands): if dominated[i]: continue is_dom = True for t in range(STEPS+1): b = np.array([t/STEPS, 1-t/STEPS]) vi = c['v'] @ b mx = max((cands[j]['v']@b for j in range(len(cands)) if j!=i and not dominated[j]), default=-1e9) if vi >= mx - 1e-9: is_dom = False; break if is_dom: dominated[i] = True return [c for c,d in zip(cands, dominated) if not d] for h in range(2, 8): Gamma = prune(expand(Gamma)) V_b05 = max(g['v']@np.array([0.5,0.5]) for g in Gamma) print(f"h={h}: {len(Gamma)} vectors V(b=[0.5,0.5])={V_b05:.2f}") # h=2: 3 vectors V=-0.97 # h=3: 4 vectors V=-0.97 # h=7: 7 vectors V=-0.61 # (listen-first policy has value improving as horizon grows)
The TIGER problem (|A|=3, |O|=2) demonstrates why conditional plans must be trees, not sequences. After "listen," the observation tells you something about which door the tiger is behind. The subsequent action (listen again, open left, open right) must depend on what was heard. A flat sequence of actions cannot capture this conditionality — a decision tree structure is essential. This is why POMDPs require tree-structured plans rather than simple action sequences like MDPs.
The utility of plan π starting from state s is computed recursively:
And the utility of plan π from belief b is simply the expectation over states:
where απ is a vector with components Uπ(s). This is the key: the utility of a plan is a linear function of the belief. This linearity is the foundation of alpha vectors.
Let's trace through a concrete 2-step plan for the crying baby to see how the tree structure maps to an alpha vector. Crying baby states: {hungry=0, sated=1}. Plan: "sing at root; if cry → feed; if quiet → ignore." Here is the plan as a tree:
The leaf plans have alpha vectors: αfeed = [−15, −5] and αignore = [−10, 0]. Applying the recursive formula for s=hungry:
T(hungry|hungry,sing)=1, T(sated|hungry,sing)=0. O(cry|sing,hungry)=0.9, O(quiet|sing,hungry)=0.1:
For s=sated: T(sated|sated,sing)=0.9, T(hungry|sated,sing)=0.1. O(cry|sing,sated)=0, O(quiet|sing,sated)=1:
Alpha vector for this 2-step plan: απ = [−23.55, −1.805]. Compared to the 1-step "sing" vector [−10.5, −0.5], this plan looks worse at b=[0.5,0.5]: −23.55×0.5+(−1.805)×0.5 = −12.68 vs −10.5×0.5+(−0.5)×0.5 = −5.5. But the 2-step plan is for a 2-step problem and correctly accounts for future behavior, while the 1-step plan cannot plan ahead. The comparison only makes sense within the same horizon.
A 2-action, 2-observation POMDP. Each level adds one time step; each node is an action; edges are labeled by observation. Watch exponential growth. Teal nodes = action a1; orange nodes = action a2.
Each conditional plan π induces an alpha vector απ ∈ R|S| whose s-th component is Uπ(s) — the expected utility of the plan starting from state s. The utility from belief b is the dot product:
This is a linear function of b — it's just a weighted average of the plan's per-state utilities. Since the optimal value function is the maximum over all plans:
V*(b) is the upper envelope of a (potentially infinite) set of linear functions. The upper envelope of linear functions is always piecewise-linear and convex (PWLC).
We represent the value function compactly as a finite set Γ of alpha vectors:
The alpha vector policy extracts actions from this representation: at belief b, the optimal action is the action associated with the dominating alpha vector arg maxα ∈ Γ α⊤b.
| Object | Mathematical form | Geometric interpretation |
|---|---|---|
| One conditional plan | απ⊤ b | A hyperplane over belief simplex |
| Optimal value function | maxπ απ⊤ b | Upper envelope of hyperplanes (PWLC) |
| Alpha vector set Γ | {α1, …, αk} with actions | Finite set of hyperplanes approximating V* |
| Policy at belief b | action(α*) where α* = arg max α⊤b | Which hyperplane is highest at b |
Here is a complete worked computation of one alpha vector for the crying baby at horizon h=2. 1-step alpha vectors (horizon h=1): α1feed = [R(hungry,feed), R(sated,feed)] = [−15, −5]. Similarly: α1sing = [−10.5, −0.5], α1ignore = [−10, 0].
Now expand "ignore at root, observe crying → feed, observe quiet → ignore". The alpha vector formula:
For s=hungry (s=0): T(hungry|hungry,ign)=1, T(sated|hungry,ign)=0. So only s'=hungry contributes:
For s=sated (s=1): T(sated|sated,ign)=0.9, T(hungry|sated,ign)=0.1:
The h=2 alpha vector for this plan is [−22.6, −1.665]. At belief b = [0.7, 0.3] (70% hungry):
Compare to the h=1 "ignore" vector [−10, 0]·[0.7,0.3] = −7.0. The 2-step value is more negative because it properly accounts for the cost of continuing to manage the baby over the full horizon. Both are discount-corrected, so these values are directly comparable.
For a 2-state POMDP, the belief simplex is just the interval [0,1] — parameterized by b = P(s1). Every alpha vector α = [α(s1), α(s2)] defines a line:
At b=0 (certain state s2), the value is α(s2). At b=1 (certain state s1), the value is α(s1). In between, the value interpolates linearly. The optimal value function is the upper envelope — the maximum over all alpha vectors at each belief point.
An alpha vector α contributes to the policy only where it lies on this upper envelope. Where it falls below another alpha vector, it is dominated and contributes nothing. The belief region where α dominates is the set of beliefs where α is the best plan.
For a 2-state POMDP (belief b = [b1, 1−b1]), the crossover point between two alpha vectors α and α' can be computed analytically by setting their values equal:
Example: αignore = [−10, 0] and αfeed = [−3.7, −15] at horizon h=2. Crossover:
So: for b(hungry) < 0.70, the "ignore" alpha vector dominates (ignore is optimal). For b(hungry) > 0.70, the "feed" alpha vector dominates (feed is optimal). This crossover belief is the decision boundary — the threshold at which the optimal action switches. As the horizon h increases, this threshold typically decreases (we feed sooner, because we have more future steps to benefit from the sated state).
Each thin colored line is an alpha vector. The bold orange curve is the optimal value function (upper envelope). Click Add Vector to add random alpha vectors. Notice how dominated vectors never touch the envelope.
The TIGER problem is another canonical 2-state POMDP. A tiger is behind one of two doors (left or right). You can listen (get a noisy clue) or open a door. The optimal policy cycles: listen repeatedly until confident, then open the correct door. With pure alpha vectors:
| Action | Reward if right guess | Reward if wrong guess | After action |
|---|---|---|---|
| Listen | −1 (cost of listening) | −1 (symmetric) | Belief updates; tiger is still there |
| Open Left | +10 (tiger on right) | −100 (tiger on left!) | Episode resets, belief back to uniform |
| Open Right | +10 (tiger on left) | −100 (tiger on right!) | Episode resets, belief back to uniform |
The TIGER problem has been used to benchmark every POMDP algorithm since Cassandra et al. (1994). Its optimal policy is non-trivial: if you start at b = [0.5, 0.5] (maximum uncertainty), the optimal action is "listen" even though that costs −1 immediately. The alpha vectors correctly encode this: the "listen" alpha vector dominates at b ≈ [0.5, 0.5] because the information gain over future steps outweighs the immediate cost. This is the power of the PWLC value function — it automatically prices information correctly.
The crossover belief point where "open left" becomes optimal can be computed analytically. Setting αlisten·b = αopen-left·b and solving for b1 = P(tiger-left) gives the threshold above which opening left pays off. This threshold moves with the horizon h — at short horizons, you open sooner (less time to benefit from information); at long horizons, you listen more.
After generating candidate alpha vectors, we need to identify and remove dominated vectors. An alpha vector α is dominated if there is no belief b where it achieves the highest value. Dominated vectors can be removed without affecting the policy at any belief.
To check if α is dominated by the set Γ ∖ {α}, we solve a linear program:
If the optimal δ is negative: α is dominated (no belief makes it optimal). If δ ≥ 0: α is not dominated and the LP also gives us the belief b where α is maximally useful.
Here is a concrete numeric example. 2-state POMDP with belief b = [b1, 1−b1]. Three candidate alpha vectors after one expansion step:
To check if α3 is dominated, we ask: is there any b1 ∈ [0,1] where α3 exceeds both α1 and α2?
The intersection is b1 ∈ [0, 0.4]. At b1=0.2 for example: α3·b = −7 > −10×0.2−5×0.8=−6. Wait — −7 < −6, so α3 is actually beaten by α1 at b1=0.2 too. Let's check b1=0 (certain s2): α1·b=−5, α2·b=−12, α3·b=−7. Here α1 wins. At b1=1 (certain s1): α1·b=−10, α2·b=−4, α3·b=−7. Here α2 wins. So α3 is dominated everywhere — the LP would return δ < 0 — and can be pruned.
POMDP value iteration incrementally builds the optimal value function by expanding plans one step at a time. Starting from 1-step plans (one per action), it generates 2-step plans, 3-step plans, etc., pruning after each expansion.
The key formula: given a set Γ of alpha vectors from the (h-1)-step problem, the new alpha vector for action a with subplan assignment (αo1, αo2, …) is:
This reuses alpha vectors from the previous iteration rather than recursively evaluating full conditional plan trees. Each new alpha vector takes O(|S|2|O|) time to compute. The total expansion generates |A|×|Γ||O| candidates.
pseudocode # POMDP Value Iteration (Exact) # Initialize with 1-step plans Γ ← { αa(s) = R(s,a) for each action a } for h = 2, 3, ..., H: candidates ← {} for each action a: for each assignment (αo1, ..., αo|O|) from Γ: compute αnew(s) = R(s,a) + γ ∑s' T(s'|s,a) ∑o O(o|a,s') αo(s') candidates ← candidates ∪ {αnew} Γ ← prune(candidates) # remove dominated vectors return Γ # represents V* at horizon H
The formula αnew(s) = R(s,a) + γ ∑s' T(s'|s,a) ∑o O(o|a,s') αo(s') is the POMDP analog of the Bellman backup in MDPs. Compare:
| MDP (fully observable) | POMDP (partially observable) | |
|---|---|---|
| State | s (known exactly) | b (belief, probability distribution) |
| Value function | V(s) — table, one value per state | V(b) — PWLC, represented by alpha vectors |
| Bellman backup | V(s) = maxa[R(s,a) + γ∑s'T(s'|s,a)V(s')] | αnew(s) = R(s,a) + γ∑s'T(s'|s,a)∑oO(o|a,s')αo(s') |
| Value at belief | V(s) (state fully observed) | V(b) = maxα∈Γ α·b (dot product with belief) |
| Policy | π(s) = arg maxa Q(s,a) | π(b) = action(arg maxα∈Γ α·b) |
| Convergence | Polynomial in |S|, |A| | PSPACE-hard in general |
The POMDP Bellman backup must sum over observations because the observation received after action a determines which subplan to execute. There is no single "next state" to look up — instead, we must anticipate all possible observations and the corresponding updated beliefs, then choose the best subplan assignment for each.
Here is a complete Python implementation of POMDP value iteration for the crying baby problem. Study how the expand and prune steps connect to the equations above:
python import numpy as np from itertools import product # Crying baby POMDP parameters # States: 0=hungry, 1=sated Actions: 0=feed,1=sing,2=ignore Obs: 0=crying,1=quiet R = [[-15, -5], [-10.5, -0.5], [-10, 0]] # R[a][s] T = [[[0, 1], [0, 1]], # feed: always sated [[1, 0], [0.1, 0.9]], # sing [[1, 0], [0.1, 0.9]]] # ignore O = [[[0.1, 0.9], [0.1, 0.9]], # O[a][s'][o] [[0.9, 0.1], [0, 1 ]], [[0.8, 0.2], [0.1, 0.9]]] gamma = 0.9 nS, nA, nO = 2, 3, 2 def expand(Gamma): """Generate new alpha vectors by combining prev. vectors.""" candidates = [] for a in range(nA): # All ways to assign a subplan from Gamma to each observation for subplans in product(range(len(Gamma)), repeat=nO): alpha = np.zeros(nS) for s in range(nS): val = R[a][s] for sp in range(nS): val += gamma * T[a][s][sp] * sum( O[a][sp][o] * Gamma[subplans[o]]['v'][sp] for o in range(nO)) alpha[s] = val candidates.append({'v': alpha, 'action': a}) return candidates def prune(candidates, n_beliefs=50): """Remove dominated alpha vectors via grid scan.""" dominated = [False] * len(candidates) for i, alpha in enumerate(candidates): if dominated[i]: continue is_dominated = True for t in range(n_beliefs + 1): b1 = t / n_beliefs b = np.array([b1, 1 - b1]) vi = alpha['v'] @ b max_other = max((candidates[j]['v'] @ b for j in range(len(candidates)) if j != i and not dominated[j]), default=-1e9) if vi >= max_other - 1e-9: is_dominated = False; break if is_dominated: dominated[i] = True return [c for c, d in zip(candidates, dominated) if not d] # Initialize: 1-step plans (one per action) Gamma = [{'v': np.array([R[a][s] for s in range(nS)]), 'action': a} for a in range(nA)] print(f"h=1: {len(Gamma)} alpha vectors") for h in range(2, 9): Gamma = prune(expand(Gamma)) print(f"h={h}: {len(Gamma)} alpha vectors after pruning") # Output: h=1: 3, h=2: 4, h=3: 4, h=4: 4, ... (stays small!)
A subtle but important implementation note: the expansion step must enumerate all assignments of previous-iteration alpha vectors to observations. For |O|=2 observations and |Γ|=4 vectors, this gives 42=16 assignments per action. For |O|=3 and |Γ|=6, this gives 63=216 per action. The number explodes quickly, but crucially, each candidate is computed independently in O(|S|2|O|) time. Therefore expansion can be parallelized trivially: compute all candidates concurrently, then prune serially. Modern GPU-accelerated POMDP solvers exploit this structure.
The pruning step is the bottleneck for large problems. The LP-based pruning runs a separate linear program for each candidate, each taking O(|S|3) via standard simplex or interior-point methods. Total pruning cost: O(candidates × |S|3). The grid-scan approximation (used in our implementation) replaces each LP with a scan over n_beliefs=50–100 grid points, reducing cost to O(candidates × n_beliefs × |Γ|) but potentially missing some non-dominated vectors.
The POMDP value iteration algorithm we've developed is a specific instance of the more general dynamic programming framework. The Bellman optimality equations for POMDPs are:
The base case is V0*(b) = 0. The Bellman backup at each step is an operator T*. Applied repeatedly: Vh* = (T*)h V0*. For the infinite-horizon discounted case (γ < 1), this sequence converges to V∞* at an exponential rate: ||Vh* − V∞*||∞ ≤ γh/(1−γ) × ||V1* − V0*||∞. The same contraction rate as for MDPs — but each Bellman backup for POMDPs involves integrating over observations and maintaining the PWLC structure.
The table below tracks exact alpha vector counts for the crying baby problem. Compare the theoretical candidate count (before pruning) with the actual count after pruning:
| Horizon h | Candidates (|A|×|Γ||O|) | After pruning | Speedup factor |
|---|---|---|---|
| 1 | 3 (initial, no expansion) | 3 | 1× |
| 2 | 3 × 32 = 27 | 4 | 6.75× |
| 3 | 3 × 42 = 48 | 4 | 12× |
| 4 | 3 × 42 = 48 | 4 | 12× |
| 8 | 3 × 42 = 48 | 4–5 | ~10× |
For the crying baby, pruning is extremely effective — the non-dominated set stabilizes at 4–5 vectors. For more complex POMDPs (e.g., the TIGER problem with |S|=2, |A|=3, |O|=2), the alpha vector count can grow to hundreds by horizon 10. For problems with |S|=10 and |O|=5, exact methods become impractical beyond horizon 5–8 even with aggressive pruning.
Given a set of alpha vectors Γ representing V, we have two ways to extract a policy at belief b:
Direct extraction: Return the action of the alpha vector that dominates at b:
Fast: O(|Γ|×|S|). But only as good as the alpha vectors represent the current belief.
One-step lookahead: Evaluate all actions by looking one step ahead, using Γ for future values:
Slower: O(|A|×|O|×|Γ|×|S|). Better for short-horizon Γ used over longer horizons.
| Method | Computation | Quality | When to use |
|---|---|---|---|
| Direct (alpha arg max) | O(|Γ|×|S|) | Exact for the horizon computed | When VI ran for enough steps; tight real-time budget |
| One-step lookahead | O(|A|×|O|×|Γ|×|S|) | Better extrapolation beyond trained horizon | When Γ is from a shorter horizon; more computation budget |
| Multi-step lookahead | Exponential in lookahead depth | Approaches optimal as depth increases | Approximation to online POMDP solvers (Ch 22) |
Worked example from the text — crying baby at b = P(hungry)=0.5:
With alpha vectors αfeed = [−3.7, −15] and αignore = [−2, −21] (2-step plans, γ=0.9):
| Action | R(b,a) | γ·∑oP(o|b,a)VΓ(b') | Q(b,a) |
|---|---|---|---|
| feed | −10 | −1.80 | −11.8 ◀ |
| sing | −5.5 | −8.53 | −14.0 |
| ignore | −5 | −8.90 | −13.9 |
Feeding wins: the immediate cost (−10) plus feeding cost (−5) is offset by the excellent future (baby is now certainly sated, future rewards are near-zero cost). At b=0.5, the long-term benefit of feeding outweighs the one-step frugality of ignoring.
Here is the complete implementation of both policy extraction methods:
python import numpy as np def alpha_value(alpha_vecs, b): """V(b) = max over alpha vectors of alpha @ b.""" vals = [np.dot(alpha['v'], b) for alpha in alpha_vecs] return max(vals), np.argmax(vals) def direct_policy(alpha_vecs, b): """Direct: return action of the dominating alpha vector at b.""" _, best_idx = alpha_value(alpha_vecs, b) return alpha_vecs[best_idx]['action'] def belief_update(b, a, o, T, O): """b'(s') = eta * O(o|a,s') * sum_s T(s'|s,a) * b(s)""" n_states = len(b) bp = np.array([O[a][sp][o] * sum(T[a][s][sp] * b[s] for s in range(n_states)) for sp in range(n_states)]) total = bp.sum() return bp / total if total > 1e-12 else np.ones(n_states) / n_states def lookahead_policy(alpha_vecs, b, R, T, O, gamma): """One-step lookahead: evaluate Q(b,a) for each action.""" n_states = len(b) n_actions = len(R) n_obs = len(O[0][0]) best_a, best_Q = 0, -1e9 for a in range(n_actions): # Immediate expected reward r_ba = sum(R[s][a] * b[s] for s in range(n_states)) # Future value: sum over observations future = 0.0 for o in range(n_obs): # P(o | b, a) = sum_{s,s'} T(s'|s,a) * O(o|a,s') * b(s) p_o = sum(T[a][s][sp] * O[a][sp][o] * b[s] for s in range(n_states) for sp in range(n_states)) if p_o > 1e-12: bp = belief_update(b, a, o, T, O) v_bp, _ = alpha_value(alpha_vecs, bp) future += p_o * v_bp Q_ba = r_ba + gamma * future if Q_ba > best_Q: best_Q, best_a = Q_ba, a return best_a, best_Q
Watch POMDP value iteration run on the crying baby problem. The horizontal axis is P(hungry) ∈ [0,1]. Each colored line is an alpha vector, tagged by action (teal=feed, purple=sing, orange=ignore). The bold white line is the optimal value function (upper envelope).
At horizon h=1, there are 3 alpha vectors (one per action). At each step, new plans are generated and pruned. Notice how the "kink" in the value function — where the best action switches from "ignore" to "feed" — emerges as the horizon grows.
Horizontal axis: P(hungry). Each line = one alpha vector (one conditional plan). Bold upper envelope = optimal V(b). Click Step to advance one horizon. Watch: at h=1, "ignore" dominates everywhere. At longer horizons, "feed" takes over for high P(hungry).
The crying baby parameters:
| Reward R(s,a) | sated | hungry |
|---|---|---|
| feed | −5 | −15 |
| sing | −0.5 | −10.5 |
| ignore | 0 | −10 |
| P(crying|state,action) | sated | hungry |
|---|---|---|
| feed | 10% | 10% |
| sing | 0% | 90% |
| ignore | 10% | 80% |
Chapter 20 establishes the theoretical and computational foundations of exact POMDP planning. The core progression:
| Concept | Key result |
|---|---|
| Belief-State MDP | Any POMDP is equivalent to an MDP over beliefs. Belief is sufficient for optimal decisions. |
| Conditional Plans | Tree-structured policies. Number grows doubly exponentially with horizon. |
| Alpha Vectors | Each plan induces a linear utility function over belief space. Compact representation. |
| PWLC Property | The optimal value function is the upper envelope of alpha vectors — piecewise-linear and convex. |
| Pruning | Remove dominated alpha vectors via LP. Drastically reduces set size. |
| Value Iteration | Expand (generate candidates) + Prune = one iteration. Reuses previous alpha vectors. |
| Lookahead Policy | Extract actions by evaluating all (a, o) branches using the observation model. |
Here is a complete worked tracing of alpha vector policy extraction over 3 steps of the crying baby problem, assuming Γ is the converged set from POMDP value iteration. Starting belief b0 = [0.5, 0.5] (maximum uncertainty):
python # Assume Gamma has converged after enough iterations # Gamma = [{'v': alpha1, 'action': 'ignore'}, {'v': alpha2, 'action': 'feed'}] # alpha1 = [-10.0, 0.0] (ignore: good when sated, neutral when hungry) # alpha2 = [-3.7, -15.0] (feed: costly upfront but ensures sation) import numpy as np Gamma = [ {'v': np.array([-10.0, 0.0]), 'action': 'ignore'}, {'v': np.array([-3.7, -15.0]), 'action': 'feed'}, ] def extract_policy(Gamma, b): vals = [g['v'] @ b for g in Gamma] best = np.argmax(vals) return Gamma[best]['action'], vals[best] b = np.array([0.5, 0.5]) # b[0] = P(hungry), b[1] = P(sated) for step in range(3): action, value = extract_policy(Gamma, b) print(f"Step {step}: b=[{b[0]:.3f},{b[1]:.3f}] action={action} V={value:.2f}") # Simulate: ignore the baby, observe crying (most likely if uncertain) # T[ignore][hungry->hungry]=1, T[ignore][sated->hungry]=0.1 # Predict: b_pred = [1*b[0]+0.1*b[1], 0*b[0]+0.9*b[1]] b_pred = np.array([1.0*b[0]+0.1*b[1], 0.0*b[0]+0.9*b[1]]) # Update with crying: O(cry|ign,hungry)=0.8, O(cry|ign,sated)=0.1 unnorm = np.array([0.8*b_pred[0], 0.1*b_pred[1]]) b = unnorm / unnorm.sum() # Output: # Step 0: b=[0.500,0.500] action=ignore V=-5.00 # Step 1: b=[0.900,0.100] action=feed V=-4.83 (now very hungry, switch to feed!) # Step 2: b=[0.000,1.000] action=ignore V=0.00 (feeding guarantees sated)
Here is a high-level summary of the complete computational pipeline for solving a POMDP exactly:
pseudocode """ Complete POMDP exact solution pipeline ====================================== """ # 1. DEFINE the POMDP POMDP = { S: set of states, A: set of actions, O: set of observations, T: transition function T(s'|s,a), Obs: observation function O(o|a,s'), R: reward function R(s,a), gamma: discount factor, b0: initial belief } # 2. VERIFY tractability assert len(S) <= 20 and len(A) <= 4 and len(O) <= 5 # For larger problems, use Ch21 (PBVI) or Ch22 (POMCP) # 3. INITIALIZE: 1-step alpha vectors Gamma = [{action: a, vector: [R(s,a) for s in S]} for a in A] # 4. ITERATE: expand and prune for horizon in range(1, H+1): candidates = [] for a in A: for subplan_assignment in product(range(len(Gamma)), repeat=len(O)): alpha = expand_one(a, subplan_assignment, Gamma, T, Obs, R, gamma) candidates.append({action: a, vector: alpha}) Gamma = prune_dominated(candidates) # via LP or grid scan # 5. EXTRACT policy def policy(b): best = max(Gamma, key=lambda g: dot(g['vector'], b)) return best['action'] # 6. EXECUTE b = b0 while not done: a = policy(b) take action a, receive observation o b = belief_update(b, a, o, T, Obs)
The computational complexity of POMDP value iteration makes exact methods tractable only for small problems. The following table summarizes the scaling behavior:
| POMDP size | Horizon | Approximate time (modern hardware) | Feasibility |
|---|---|---|---|
| |S|=2, |A|=3, |O|=2 (crying baby) | h=20 | < 1 second | Trivially tractable |
| |S|=2, |A|=3, |O|=2 (TIGER) | h=20 | < 5 seconds | Tractable |
| |S|=5, |A|=4, |O|=3 | h=10 | Minutes to hours | Borderline |
| |S|=10, |A|=4, |O|=5 | h=10 | Days to years | Intractable exactly |
| |S|=100, |A|=4, |O|=10 | h=10 | Astronomical | Requires Ch 21–22 |
The three following chapters each address the scalability problem introduced here:
| Chapter | Method | Key idea | Trades off |
|---|---|---|---|
| Ch 21 | Point-Based VI (PBVI, SARSOP) | Restrict alpha expansion to a finite set of "reachable" belief points | Completeness for speed |
| Ch 22 | Online methods (POMCP, DESPOT) | Plan forward from the current belief via Monte Carlo tree search | Optimality for real-time performance |
| Ch 23 | Controller abstractions (FSCs) | Represent the policy as a finite-state controller (a fixed-size graph) | Exact representation for memory efficiency |
The PWLC value function has another important property: it is a universal approximator of belief-based utilities. Any policy π induces a linear utility function; the max over all policies gives a PWLC upper bound on all achievable utilities. This means the value function is not just a convenient representation — it is the tightest possible piecewise-linear convex bound on achievable value from any belief. No tighter bound (with the same number of pieces) can be achieved.
The number of "pieces" (non-dominated alpha vectors) at convergence characterizes the complexity of the optimal policy. Simple problems (crying baby) may converge to 2–5 vectors. Hard problems (large state spaces with rich observation structure) may require thousands of vectors even after pruning. The number of pieces is related to the problem's complexity — roughly, how many qualitatively different behaviors are needed at different beliefs.
To test your understanding: suppose you solve a crying baby POMDP exactly and obtain alpha vectors α1 = [−3.7, −15] (feed) and α2 = [−2.0, −21] (ignore). At belief b = [0.8, 0.2] (80% hungry):
α2 wins (−5.80 > −5.96): the optimal action is "ignore" at 80% probability of hunger. This seems counterintuitive! But the alpha vectors already encode the full future trajectory — "ignore" may be better here because singing (action 1) calms the baby down for future steps at lower cost than the immediate feeding penalty of −15. At b = [0.95, 0.05] (95% hungry): α1·b = −3.515−0.75=−4.265, α2·b = −1.9−1.05=−2.95. Now "feed" wins at very high hunger probability.
When does exact POMDP value iteration become intractable? The alpha vector count at horizon h is bounded by |A|(|O|h−1)/(|O|−1) before pruning — a double-exponential in h. Even with aggressive LP-based pruning, this can remain enormous. Here are concrete size estimates for the crying baby problem (|S|=2, |A|=3, |O|=2) vs. a small robot navigation problem (|S|=16, |A|=4, |O|=8):
| Problem | h=1 vectors | h=2 (before prune) | h=2 (after prune) | h=5 (feasible?) |
|---|---|---|---|---|
| Crying baby (|S|=2, |A|=3, |O|=2) | 3 | 3×32=27 | ≤10 | Yes (<100 vectors) |
| Grid world (|S|=16, |A|=4, |O|=4) | 4 | 4×44=1024 | ≤200 | Borderline |
| Robot nav (|S|=100, |A|=5, |O|=10) | 5 | 5×510≈5×107 | ≤104 after pruning | Intractable without approximation |
| Dialog system (|S|=500, |A|=10, |O|=20) | 10 | 10×1020≈1021 | Intractable before pruning | Never; use SARSOP or PBVI |
This is why Chapter 21 (offline belief planning) and Chapter 22 (online belief planning) exist. PBVI and SARSOP operate on a reachable subset of the belief simplex; POMCP uses Monte Carlo tree search to plan from the current belief without computing the full value function. Exact methods are the theoretical foundation; approximations are the engineering necessity.
What exact value iteration gives you, even when you can't run it. Understanding exact POMDP value iteration provides four benefits even if you always use approximate methods in practice:
Reading alpha vectors: what they reveal about optimal behavior. The shape of the alpha vector set encodes the structure of the optimal policy. For the two-state crying baby problem, after convergence you typically find three clusters of alpha vectors: one cluster with nearly equal values across both states (listen is good when uncertain), one with high value for the hungry state (feed is good when confident the baby is hungry), and one with high value for the sated state (do-nothing is good when confident the baby is sated). The crossover beliefs between these clusters tell you the decision boundaries.
Concretely, if the crossover between "listen" and "feed" is at b(hungry) = 0.80, that means: listen until you are 80% sure the baby is hungry, then feed. This is the optimal policy encoded geometrically in the alpha vector intersection. Any POMDP whose optimal policy has this "threshold" structure will exhibit a clean two-alpha-vector solution; more complex policies correspond to more alpha vectors and more crossover points.
| Alpha vector structure | Policy interpretation | Example POMDP |
|---|---|---|
| 1 vector (constant utility) | Same action is always optimal regardless of belief | Trivial: one action always dominates |
| 2 vectors (one crossover) | Threshold policy: action A for b<b* else action B | Simple 2-state problems |
| k vectors (k−1 crossovers) | Piecewise threshold policy with k regions | Crying baby at horizon h=3 |
| Many vectors (complex PWLC) | Rich policy that accounts for long-horizon information gathering | Tiger problem, robot navigation |
The tiger problem is particularly instructive: at convergence, the optimal policy has many alpha vectors corresponding to "listen long enough to become confident, then open the correct door." If you truncate value iteration early (too few iterations, or too small a pruning set), the policy degrades to "always open a random door" — the agent loses the information-gathering behavior entirely. This demonstrates that horizon matters fundamentally: short-horizon planning cannot reproduce information-gathering strategies.
Connecting belief planning to model-based RL. There is a deep connection between Chapters 16 and 20. In Chapter 16, the agent was uncertain about the model (T and R unknown). In Chapter 20, the agent is uncertain about the state (s unknown, T and R known). BAMDPs (Ch 16) are actually POMDPs where the observation is the reward signal and the hidden state is the (true state, model parameters) hyperstate. This unification means that belief planning techniques — alpha vectors, value iteration over the belief simplex — are in principle applicable to model uncertainty as well. In practice, the hyperstate space is far too large for exact POMDP methods, which is why PSRL uses sampling instead. But conceptually, the two chapters solve the same problem: optimal behavior under uncertainty via belief-state planning.
Summary of key results from this chapter: (1) The belief-state MDP is the canonical transformation that converts a POMDP into an MDP with a continuous (but lower-dimensional) state space. (2) The optimal value function over beliefs is piecewise-linear and convex, representable as a finite set of alpha vectors. (3) POMDP value iteration produces monotonically improving lower bounds on V*∞, converging at rate γh/(1−γ). (4) Pruning is essential: LP-based dominance testing removes vectors that never determine the optimal action anywhere in belief space. (5) The one-step lookahead policy extracts actions in O(|Gamma|×|S|) time from any belief, making deployment fast even when computing the alpha vectors was slow.
These five properties together mean that the PWLC representation is not just computationally convenient — it is the theoretically correct form for optimal POMDP value functions. The approximations in subsequent chapters (PBVI, SARSOP, POMCP) all sacrifice some aspect of this exactness in exchange for tractability on larger problems. Understanding what is being sacrificed, and when that sacrifice is acceptable, is the central design decision in applied POMDP solving.
The machinery in this chapter — belief updates, conditional plans, alpha vectors, value iteration with pruning — provides the complete mathematical foundation for all POMDP solution methods that follow. Chapters 21 and 22 are engineering variations on this foundation, trading off the size of the alpha vector set against the quality of the approximation. With this foundation solid, those chapters become straightforward extensions rather than independent leaps.
The belief is not an obstacle between you and the state — it IS the state. Alpha vectors are the map over this belief landscape. Now you know how to draw that map exactly.