Ch 16: Partially Observable MDPs — Probabilistic Robotics

Chapter 0: From Knowing to Believing

Chapter 15 gave us a powerful tool: value iteration for MDPs. A robot could plan optimal actions by computing a value function over states and following the gradient. But there was a hidden assumption lurking beneath every equation.

MDPs assume full observability — the robot knows its exact state at all times. In real robotics, this is almost never true. A mobile robot in a symmetric hallway doesn't know which direction it's facing. A manipulator reaching into a bin can't see the exact pose of the target object. Sensors are noisy projections of the true state.

The key question of this chapter: Can a robot plan optimally when it doesn't know its own state? The answer is yes, but the robot must reason about its belief — its probability distribution over possible states — and plan in this belief space. This is the Partially Observable Markov Decision Process (POMDP) framework. It handles both types of uncertainty simultaneously: uncertain action effects AND uncertain perception. The price is a dramatic increase in computational complexity.

Consider Thrun's illustrative example: a robot in a nearly symmetric corridor. The goal is at one end, a dangerous pit at the other, and the two are perceptually indistinguishable from the middle (both entrances look the same). A distinguishing landmark exists in only one alcove. An MDP planner would say "go left" or "go right" depending on which state is better. But the robot doesn't know which end is which — it doesn't even know which direction it's facing. Charging toward the assumed goal risks a 50% chance of falling into the pit.

The optimal strategy? First move to a landmark that reveals the robot's orientation, then proceed safely to the goal. The robot must actively gather information even though it costs a detour. This is the hallmark of POMDP planning: balancing exploitation (moving toward the goal) with exploration (reducing uncertainty). The payoff from a guaranteed +100 reward far exceeds the expected payoff from a coin-flip between +100 and -100.

Crucially, solving this problem by "just considering all possible states and averaging" does not work. You might think: "Just solve two MDPs (one for each possible orientation), find the optimal policy for each, and mix them." But the averaged policy still charges toward the assumed goal in each case — neither individual MDP policy visits the landmark, because in each case the state is known. The information-gathering behavior only emerges when uncertainty itself is part of the state.

This distinction is subtle but profound. The POMDP doesn't just "hedge" between two MDP solutions. It discovers entirely new behaviors — like visiting landmarks, asking questions, or performing sensing actions — that no MDP would ever produce. The optimal plan in belief space can be qualitatively different from any plan in state space. This emergent information-gathering behavior is one of the most beautiful results in probabilistic robotics.

The textbook distinguishes three planning paradigms, each handling a different level of uncertainty:

Paradigm	Handles	Output
Classical planning	No uncertainty	Fixed action sequence
MDP (Ch 15)	Action uncertainty	State → action policy
POMDP (Ch 16)	Action + perception uncertainty	Belief → action policy

Check: Why can't an MDP planner handle a robot that doesn't know its own state?

MDPs assume full observability — the policy maps states to actions, but the robot only has a belief over states, not a known state MDPs can't handle stochastic transitions MDPs are too slow to compute

Chapter 1: The POMDP Framework

A Partially Observable Markov Decision Process extends the MDP with two additions: an observation model and the concept of belief.

Component	Symbol	Description
States	S	Set of all possible states (same as MDP)
Actions	A	Set of available actions (same as MDP)
Transitions	P(s'\|a,s)	Stochastic transition model (same as MDP)
Payoff	c(s)	Immediate reward at state s (same as MDP)
Observations	O	Set of possible observations (new)
Observation model	P(o\|s)	Probability of seeing o in state s (new)
Belief	b(s)	Probability distribution over states (new)

The robot never sees s directly. Instead, after taking action a and arriving in state s', it receives an observation o drawn from P(o|s'). Observations are noisy — the same state might produce different observations on different occasions, and different states might produce the same observation. This perceptual ambiguity is exactly what makes the problem "partially observable." The robot must reason probabilistically about which state could have produced the observations it has received so far.

The belief b(s) captures everything the robot knows about its state, updated via the familiar Bayes filter from Chapter 2:

b'(s') = η P(o'|s') Σ_s P(s'|a,s) b(s)

where η is the normalizing constant, ensuring the belief sums to 1 over all states. This is the exact same Bayes filter from Chapter 2, applied at every decision step. The belief is a sufficient statistic for the robot's history — it compresses all past actions and observations into a probability distribution that is all the robot needs to make optimal decisions.

POMDP = MDP in belief space. The crucial insight: since the robot can't observe states, it must plan over beliefs. A belief is a probability distribution — an entire vector of numbers. Planning in this space is vastly harder than planning over states, but it's the price of honest uncertainty.

The policy in a POMDP maps beliefs to actions, not states to actions:

π(b) → a

Given a belief b and an action a, the robot receives an observation o' and updates its belief via the Bayes filter. The new belief b' then determines the next action. This cycle is the POMDP control loop:

1. Start with belief b.

2. Choose action a = π(b).

3. Execute a in the world, observe o'.

4. Update belief: b' = B(b, a, o') via Bayes filter.

5. Set b = b' and repeat.

Notice that the Bayes filter from Chapter 2 is embedded inside the POMDP control loop. Perception (belief updating) and planning (action selection) are interleaved at every time step. This tight coupling is what enables information-gathering behavior: the robot can choose actions specifically to make its next Bayes filter update more informative.

An important subtlety: in an MDP, the policy depends only on the current state. In a POMDP, the policy depends on the entire history of actions and observations — but only through the belief, which is a sufficient statistic for the history. This is why the Bayes filter is so central: it compresses an ever-growing history into a fixed-size belief vector that captures all the information the robot needs for optimal decision-making.

The sufficiency of the belief state follows directly from the Markov property: given the current belief (which encodes all relevant history), the future is conditionally independent of the past. A robot that has run for 1000 time steps has a belief vector of the same dimensionality as one that has run for 10 time steps. The Bayes filter is the mechanism that makes this compression possible, accumulating evidence without accumulating data.

This means the POMDP is itself an MDP — just one defined over a much larger (continuous) state space. The "states" of this meta-MDP are beliefs, the "transitions" are Bayes filter updates, and the "observations" determine which transition occurs. This perspective, known as the belief-MDP formulation, unifies the two chapters: Chapter 15's value iteration and Chapter 16's belief-space value iteration are the same algorithm, applied to different state spaces. The difficulty of POMDPs lies entirely in the size and structure of the belief space, not in any fundamentally new algorithmic idea.

Check: What are the two components that a POMDP adds beyond an MDP?

An observation model P(o|s) and the belief b(s) over states A larger state space and faster algorithm A discount factor and a planning horizon

Chapter 2: Belief Space

In an MDP, the robot plans over states. In a POMDP, the robot plans over beliefs. But what does the belief space look like?

If the state space has n discrete states, a belief b is specified by n probabilities p₁, ..., p_n that sum to 1. Since the last probability is determined by the others, the belief space is an (n-1)-dimensional simplex. For two states, the belief space is a line segment. For three states, it's a triangle. For four states, it's a tetrahedron.

Even finite worlds have continuous belief spaces. A world with just 4 states has a 3-dimensional continuous belief space. A world with 100 states has a 99-dimensional belief simplex. And for continuous state spaces (like a robot's x-y position), the belief space is infinite-dimensional — the space of all probability distributions over a continuous variable. This is why POMDPs are so much harder than MDPs: the planning space explodes in dimensionality.

A useful geometric picture: the belief simplex for n states is a convex polytope in (n-1)-dimensional space. The vertices of this simplex correspond to states of certainty (the robot is 100% sure it's in one particular state). The center corresponds to maximum ignorance (uniform distribution over all states). Movement through belief space happens via Bayes filter updates — informative observations pull the belief toward certainty (a vertex), while noisy actions spread it toward the center.

For the textbook's 4-state example, the action a₂ cleanly separates {s₁, s₂} from {s₃, s₄}. So the only uncertainty the robot encounters is between s₁ and s₂, or between s₃ and s₄. Each of these can be represented by a single parameter (p₁ or p₃). This makes the example convenient for plotting — the value function is a 1D curve instead of a surface over a high-dimensional simplex.

Practical implications of belief space dimensionality: A robot with a discrete 100-cell grid map has a belief space with 99 continuous dimensions. Each dimension represents the probability of being in one cell. Planning in this space requires representing and updating value functions over a 99-dimensional simplex — far beyond what exact methods can handle. This is why approximations (Monte Carlo, point-based, augmented) are essential for real robotics applications. The exact POMDP algorithm is primarily a theoretical tool that establishes optimality and guides the design of practical approximations.

The payoff of a belief is the expected payoff over all states, weighted by the belief:

c(b) = Σ_i c(s_i) p_i

This is linear in the belief parameters — a key property that enables exact solutions in finite worlds. The linearity of the payoff is the base case for showing that the entire value function is piecewise linear: since the value function at horizon T is built from linear payoffs and linear operations (summation over observations, maximization over actions), the result is always piecewise linear and convex. This structure is what distinguishes finite POMDPs from continuous ones — in continuous state spaces, the belief-dependent payoff is no longer a finite linear combination, and exact solutions become impossible.

The transition model in belief space works through the Bayes filter. Given a belief b, an action a, and an observation o', the new belief B(b,a,o') is computed by Bayes' rule. This means the belief transitions are deterministic given (b,a,o'), but stochastic from the robot's perspective because it doesn't know which o' it will receive.

Consider a concrete example from the textbook: a world with 4 states (s₁, s₂, s₃, s₄), 2 actions, and 2 observations.

Parameter	Value
States	s₁, s₂ (initial states), s₃ (goal, +100), s₄ (pit, -100)
Action a₁	Swap s₁ ↔ s₂ with prob 0.9/0.8 (information gathering)
Action a₂	Go to s₃ or s₄ depending on current state (commit)
Observations	o₁: prob 0.7 in s₁, 0.4 in s₂; o₂: prob 0.3 in s₁, 0.6 in s₂

The robot starts in either s₁ or s₂ with equal probability. The belief space is just a line segment parameterized by p₁ = P(s = s₁). The entire planning problem reduces to: "How confident must I be that I'm in s₁ before I risk action a₂?" Action a₁ is the information-gathering action — it changes the state (allowing new observations) without committing to the high-stakes a₂.

Check: For a world with n discrete states, what is the dimensionality of the belief space?

n - 1 dimensions (since the probabilities must sum to 1) n dimensions 2n dimensions

Chapter 3: Value Iteration in Belief Space

The Bellman equation from Chapter 15 operated over states. The POMDP version operates over beliefs:

C^T(b) = max_a ∫ [c(b') + C^T-1(b')] P(b'|a,b) db'

This looks identical in structure to the MDP Bellman equation — just replace "state s" with "belief b" everywhere. But there's a problem: integrating over all possible beliefs b' is extremely expensive because the belief space is continuous.

The key trick: integrate over observations, not beliefs. Given (b, a, o'), the posterior belief b' = B(b,a,o') is unique (determined by the Bayes filter). We can factor P(b'|a,b) = ∫ P(b'|a,b,o') P(o'|a,b) do', where P(b'|a,b,o') is a point mass at the Bayes filter output. This lets us replace the belief integral with an observation integral:

C^T(b) = max_a Σ_o' [c(B(b,a,o')) + C^T-1(B(b,a,o'))] P(o'|a,b)

where B(b,a,o') is the Bayes filter update, and P(o'|a,b) is the probability of observing o' after doing a from belief b:

P(o'|a,b) = Σ_s' P(o'|s') Σ_s P(s'|a,s) b(s)

This reformulation is crucial. In a finite world with |O| observations, the sum is finite and computable. We've turned an impossible integral over a continuous belief space into a tractable sum over observations.

To make this concrete, consider the T=2 case from the textbook's 4-state example. After executing a₁ and observing o₁, the belief updates via Bayes' rule to a new point on the belief line. The value C²(b, a₁) is the sum over both possible observations of (probability of that observation) times (value at the resulting belief under C¹). Expanding this produces a piecewise-linear function — a theme we explore next.

The mechanics step by step:

1. For each action a: Consider every possible observation o' that could follow.

2. For each observation o': The Bayes filter produces a unique posterior belief b' = B(b,a,o').

3. Look up value: The payoff of that new belief is c(b') + C^T-1(b'), the sum of immediate reward and future value.

4. Weight by probability: Multiply by P(o'|a,b), the chance of actually seeing o'.

5. Sum over observations: This gives C^T(b,a) for this specific action.

6. Max over actions: C^T(b) = max_a C^T(b,a).

The structure is exactly MDP value iteration, but one level more abstract — beliefs instead of states, observations instead of transitions. It is often useful to split the computation into per-action value functions C^T(b,a) before taking the maximum, since this decomposition enables per-action pruning of alpha-vectors.

The key conceptual leap: in an MDP, the "one-step lookahead" considers all successor states. In a POMDP, the one-step lookahead considers all possible observations, and for each observation computes what the robot would believe. The value of an action is the expected value across all observations, weighted by their probabilities. This is why POMDP planning automatically accounts for both the reward structure AND the information content of each action.

A beautiful cancellation: When substituting the Bayes filter into the value equation, the normalizer P(o'|a,b) appears in both the posterior belief formula (in the denominator) and the observation probability weighting (in the numerator). These cancel, yielding a simpler update that avoids computing the normalizer explicitly. This cancellation is a hallmark of the finite POMDP solution and is what makes exact value computation tractable: each new alpha-vector coefficient can be computed directly from the transition probabilities, observation probabilities, and previous alpha-vectors, without ever needing the normalization constant.

Check: Why do we integrate over observations instead of beliefs in the POMDP value iteration?

Because given (b, a, o'), the posterior belief is uniquely determined by the Bayes filter, making the integral over beliefs trivial Because there are fewer observations than beliefs Because observations are easier to measure

Chapter 4: Piecewise-Linear Value Functions

Here's the remarkable result for finite POMDPs: the optimal value function is piecewise linear and convex in the belief parameters.

A single linear value function over beliefs with n states would be:

C(b) = Σ_i=1..n C_i p_i

A piecewise linear and convex value function is the maximum of K such linear functions:

C(b) = max_k=1..K Σ_i=1..n C_k,i p_i

Each linear piece (sometimes called an alpha-vector) corresponds to a different conditional plan — a different tree of actions conditioned on future observations. The value at a particular belief b is determined by whichever plan performs best in expectation under that belief. The maximum over all alpha-vectors selects the best plan for each belief point.

Visually, for a 2-state world where the belief is a single number p₁, the value function looks like an inverted "V" or a series of line segments. Each segment corresponds to a region of belief space where a particular conditional plan is optimal. The boundaries between segments are the "switching points" where the robot changes its preferred strategy.

The number of alpha-vectors grows with the planning horizon because longer horizons create more distinct conditional plans. At T=1, there are at most |A| alpha-vectors (one per action). At T=2, each action can be followed by a plan conditioned on the observation, creating up to |A|^|O| distinct continuation strategies per initial action. Each unique continuation strategy produces its own alpha-vector with different coefficients.

A critical property: while the number of alpha-vectors can explode, many of them are dominated — they're never the maximum for any belief point. An alpha-vector α is dominated if for every belief b, some other alpha-vector α' achieves a higher or equal value. Pruning dominated alpha-vectors keeps the representation manageable. In the textbook's example, 4 alpha-vectors are generated at T=2 for action a₁, but 2 are immediately prunable because they never achieve the maximum anywhere in the belief simplex.

This pruning is related to the geometry of the upper envelope: the value function is the upper envelope of a set of hyperplanes. Adding a new hyperplane either expands the upper envelope (the hyperplane is "visible" from above) or is hidden beneath existing hyperplanes (dominated). Determining visibility is a linear programming problem, which is why LP is central to the exact POMDP algorithm.

Why convexity makes sense: The value function is convex because knowing is always better than not knowing. Given two possible belief states, the average of their values is always at least as large as the value of their averaged belief: βC(b) + (1-β)C(b') ≥ C(βb + (1-β)b'). Certainty (being at a vertex of the simplex) is more valuable than confusion (being in the interior).

Why piecewise linear? Consider a one-step horizon. The payoff c(b) = Σ c(s_i)p_i is linear. Taking the max over actions produces a piecewise linear function (one piece per action). For horizon T, each observation can lead to a different linear piece of C^T-1, and the max over actions produces more pieces. The number of pieces can grow exponentially with horizon T.

In the textbook's example, C¹(b) consists of two linear pieces: zero (for action a₁) and 72p₁ - 72p₂ (for action a₂, with γ = 0.9). The optimal policy at T=1 is: if p₁ ≥ 0.5, take a₂ (go for the reward); otherwise, take a₁ (stay safe). The value function looks like a "hockey stick": flat at zero on the left, rising linearly on the right.

At T=2, the value function gains a third linear piece from the information-gathering strategy: execute a₁ (which changes the state and yields an informative observation), then decide. The new piece has the form -33.048p₁ + 7.776p₂, which is positive for small p₁ — meaning that even when the robot is uncertain, it can expect positive value by gathering information first.

The policy threshold shifts: the robot can now afford to be less certain before acting, because it has time to gather information. With a longer horizon, the threshold shifts further, and the expected value increases everywhere. The longer the horizon, the more the robot benefits from information gathering — but the computational cost of representing the piecewise-linear function grows. This creates a fundamental tradeoff: longer horizons produce better policies but require exponentially more alpha-vectors.

Inductive proof: If C^T-1(b) is piecewise linear and convex, then C^T(b) is also piecewise linear and convex. The sum of piecewise-linear convex functions (over observations) is piecewise-linear and convex. The max over actions preserves piecewise linearity and convexity. Since C⁰ = 0 is trivially linear, induction gives us the result for all finite T.

Check: Why is the POMDP value function convex in belief space?

Because knowing the state exactly (certainty) is always at least as valuable as being uncertain — the value of knowledge is non-negative Because the reward function is always positive Because the discount factor is less than 1

Chapter 5: The Finite POMDP Algorithm

Piecewise linearity means we can represent the value function exactly on a computer, as a finite set of linear constraints (called alpha-vectors in the modern literature). Each alpha-vector α_k = (C_k,1, ..., C_k,n) defines a hyperplane over the belief simplex, and the value function is their upper envelope. The POMDP value iteration algorithm builds this set recursively.

The algorithm maintains a constraint set Φ^T at each horizon. The update for each horizon T produces new linear pieces from the old ones. For each action a and each assignment of observations to previous linear pieces, we get a new linear function:

C^T_l,j = γ Σ_o' Σ_i (c_i + C^T-1_k(o'),i) P(o'|s'_i) P(s'_i|a,s_j)

where k(o') selects which linear piece of C^T-1 is active for observation o'.

Horizon	Max alpha-vectors	Growth pattern
T = 1	\|A\|	One per action
T = 2	\|A\| · \|A\|^\|O\|	Each observation independently picks a piece from T=1
T = 3	\|A\| · (\|A\| · \|A\|^\|O\|)^\|O\|	Pieces from T=2 are combined per observation
General T	\|A\| · \|Φ^T-1\|^\|O\|	Doubly exponential in T

To illustrate the explosion: with |A|=2 actions and |O|=2 observations, horizon T=1 has 2 pieces, T=2 has up to 8, T=3 up to 128, and T=4 up to 2 · 128² = 32,768. After pruning, the actual number is often much smaller, but the worst case grows as a tower of exponentials.

This complexity is intrinsic to the problem, not an artifact of the algorithm. The information-theoretic reason: at each time step, the robot must plan a different strategy for each possible observation sequence it might receive. The number of distinct observation sequences grows exponentially with the horizon. Each sequence induces a different belief trajectory through belief space, potentially requiring a different optimal action. The exact algorithm faithfully represents all these contingencies; approximate methods trade optimality for tractability by considering only the most likely or most useful contingencies.

The computational curse: The number of linear pieces can grow doubly exponentially in the planning horizon T. For each observation, we independently choose which linear piece to use, giving |Φ^T-1|^|O| combinations per action. Many of these pieces are dominated (never the maximum for any belief) and can be pruned, but in the worst case the growth is unavoidable. This is why exact POMDP solving is restricted to small problems.

In practice, linear programming identifies which constraints are active (defining the value function) and which are passive (dominated). Pruning passive constraints is essential for tractability but cannot eliminate the fundamental worst-case complexity.

The algorithm works by induction on the horizon T. For T=1, the constraint set Φ¹ has one linear function per action (since the payoff c(b) is linear, each action gives one alpha-vector). For each subsequent horizon, the algorithm enumerates all actions and all assignments of observations to alpha-vectors from the previous horizon, computes the new linear coefficients using the formula above, and adds them to Φ^T. Minimizing C^T under these constraints (via linear programming) extracts the optimal value. Active constraints identify the optimal action for each region of belief space.

The algorithm can be divided into three computational phases at each horizon step:

1. Cross-sum: For each action, combine alpha-vectors from C^T-1 with observation probabilities to create observation-specific alpha-vectors.

2. Enumeration: Form all combinations across observations (one alpha-vector per observation). Each combination yields a new alpha-vector for horizon T.

3. Pruning: Use linear programming to identify and discard dominated alpha-vectors (those that are never the maximum for any belief point). A linear program tests whether a given alpha-vector is ever the best: maximize its value minus the value of all other alpha-vectors. If this maximum is non-positive, the alpha-vector can be safely discarded.

The pruning step is what makes the difference between a theoretically correct but useless algorithm and a practical one. In the textbook example, half the alpha-vectors at T=2 are dominated and can be removed immediately. In larger problems, aggressive pruning can reduce the representation by orders of magnitude while preserving optimality.

For the general case (continuous states, infinite horizon), the exact algorithm doesn't apply. The general POMDP value iteration equation becomes:

Ĉ(b) ← max_a ∫ [Ĉ(B(b,a,o')) + ∫ c(s') B(b,a,o')(s') ds'] P(o'|s') P(s'|a,s) b(s) ds ds' do'

This is an "in-principle" algorithm: it requires integration over infinite spaces and doesn't specify how Ĉ is represented. The two key open questions are: (1) how to represent a value function over probability distributions, and (2) how to compute the integrals efficiently. Making it practical requires approximation — either through Monte Carlo methods (next chapter) or by compressing the belief (Chapter 7).

For infinite horizons, the value function reaches an equilibrium described by the Bellman equation in belief space — a self-consistent equation where the value at each belief equals the best expected one-step value plus discounted future value. Just as in the MDP case, this fixed point can be approximated by iterating the backup operator until convergence. The convergence guarantee requires γ < 1, and the approximation error after k iterations is bounded by γ^k times the maximum value. In practice, convergence in belief space is typically slower than in state space because the belief space is so much larger and the value function is more complex to represent.

Modern POMDP solvers: While the exact algorithm is intractable for large problems, it inspired a family of point-based solvers that have dramatically expanded the reach of POMDP planning:
• PBVI (Point-Based Value Iteration): Maintains alpha-vectors only at a sampled set of reachable belief points, performing backups only at those points.
• Perseus: Randomized point-based solver that updates a random subset of belief points at each iteration, often converging faster than PBVI.
• SARSOP: Focuses sampling on the "optimally reachable" belief space — the beliefs the robot would actually visit under an optimal policy — achieving state-of-the-art performance on benchmark problems.
These methods scale to problems with thousands of states by exploiting the fact that the robot only encounters a small fraction of all possible beliefs during its lifetime. The full belief simplex may be 99-dimensional, but the robot's actual trajectory through belief space is a thin manifold within it.

Check: Why does the exact POMDP algorithm have doubly exponential complexity?

Each observation independently selects a linear piece from the previous horizon, so pieces multiply as |Φ|^|O| per step The state space doubles at each step The discount factor causes exponential blowup

Chapter 6: Monte Carlo POMDPs

Exact POMDP solving is limited to tiny problems — the doubly exponential growth makes it infeasible for anything beyond a handful of states and a short horizon. For realistic robot tasks with continuous states and long horizons, we need approximations.

The Monte Carlo POMDP (MC-POMDP) approach, developed by Thrun and collaborators, brings three ideas together: particle filters to represent beliefs, Monte Carlo sampling to approximate the value update, and reinforcement learning to iteratively improve the policy. The result is an algorithm that can handle continuous state spaces and long planning horizons, at the cost of approximate (not optimal) solutions.

The idea is straightforward: instead of integrating over all observations and all states, we sample. The belief is represented by a set of N weighted particles (as in Chapter 4), and the value update is approximated by Monte Carlo sampling:

1. Represent the belief b as a set of weighted particles {⟨s⁽ⁱ⁾, w⁽ⁱ⁾⟩}.

2. Sample transitions: For each action a, draw N samples: pick s from b according to weights, draw s' ~ P(s'|a,s), draw o' ~ P(o'|s').

3. Update belief: Compute posterior b' = B(b,a,o') using the particle filter from Chapter 4 — reweight particles by P(o'|s') and normalize.

4. Estimate value: C(b,a) ≈ (1/N) Σ γ[C(b') + c(s')], averaging over the N sampled transitions.

5. Select action: Pick the action with the highest estimated value.

Learning the value function with nearest neighbors: The value function C(b) must be stored somehow. MC-POMDP uses nearest-neighbor lookup: maintain a database of (belief, value) pairs and interpolate. The distance between beliefs is measured using KL divergence, approximated via Monte Carlo sampling over the particle representations. When a new belief is encountered, its value is estimated from the k nearest stored beliefs.

The value function Q(θ, a) is learned via Bellman backups in belief space:

Q(θ_t, a_t) ← E[R(o_t+1) + γ max_ā Q(θ_t+1, ā)]

where θ is the particle-based belief representation. The particle filter computes samples of the reward and successor belief, from which the right-hand side is approximated by Monte Carlo averaging.

MC-POMDP uses reinforcement learning in belief space: the robot runs episodes, updates Q-values using Bellman backups, and gradually improves its policy. A mix of exploitation (following the current best policy) and exploration (10% random actions) ensures the robot discovers useful belief states. Experience replay and counter-based exploration from the reinforcement learning literature help determine the order of belief-state updates.

Experimental results from Thrun et al.: In a "heaven and hell" environment where the agent must consult a "priest" to learn which end is heaven, MC-POMDP finds a successful policy after 17 episodes (~6,150 backups, ~29 minutes on a Pentium PC). Doubling the environment size quadrupled the state space but only doubled convergence time (33 episodes). In a robotic locate-and-retrieve task with a mono-camera system (30% detection error, 20% motion noise), the robot learned to search, localize, and grasp an object with 80%+ success after ~4,000 iterations. The robot spontaneously discovers that it must view the object from multiple angles to triangulate its 3D position — a classic information-gathering behavior.

A key challenge in MC-POMDP is representing the value function Q over beliefs. Standard function approximators (neural networks, decision trees) accept fixed-length vectors, not probability distributions of varying complexity. The solution: nearest-neighbor interpolation in density space. When a new belief is encountered, its k nearest neighbors (by KL divergence between the corresponding particle distributions) are found in a growing database, and their Q-values are linearly averaged. If there aren't enough neighbors within a threshold distance, the new belief is added to the database — so the database grows over time, concentrating resolution where the robot actually visits.

Computing KL divergence between particle sets is itself non-trivial. Strictly speaking, KL divergence is not a distance metric (it's asymmetric and doesn't satisfy the triangle inequality), but it works well as a similarity measure for beliefs. The approach maps each particle set to a continuous density (via Gaussian kernels) and then uses Monte Carlo sampling to approximate the divergence. This "distance between densities" enables function approximation in the space of probability distributions — a capability not found in standard machine learning tools.

Two key optimizations make MC-POMDP practical:

Adaptive database growth: New belief states are only added to the database when they're sufficiently different (by KL divergence) from all existing entries. This naturally concentrates resolution in frequently-visited regions of belief space — the robot builds a detailed value map only where it matters.

Experience replay: Past transitions (b, a, o', b', r) are stored and re-used for additional Bellman backups. This dramatically accelerates convergence by extracting more learning from each simulated episode. Combined with counter-based exploration (tracking how many times each belief-action pair has been visited), this produces an efficient online learning algorithm that balances exploration of new beliefs with exploitation of known good policies.

Check: How does MC-POMDP approximate the value iteration update?

By sampling states, transitions, and observations from the current belief, then averaging the resulting values over samples By discretizing the belief space into a grid By using neural networks to approximate the value function

Chapter 7: Augmented MDPs

POMDPs plan over the full belief — but that space is enormous and the exact algorithm is doubly exponential. MDPs plan over states — but they ignore uncertainty entirely and can lead to dangerous policies when the robot is unsure of its position. Is there a middle ground?

The answer, from Thrun and collaborators, is the Augmented MDP (AMDP). The key observation: in many robotics domains, the range of belief distributions the robot encounters is a small subspace of all possible beliefs. Most of the time, the belief is roughly Gaussian (unimodal, centered on the true pose). What matters for planning is not the exact shape of the belief, but a few summary statistics — particularly how uncertain the robot is.

Augmented MDPs (AMDPs) take a different approach to approximation: instead of sampling, they compress the belief into a low-dimensional statistic and plan over that statistic using standard MDP tools. The idea: the full belief b is summarized by a tuple:

b̄ = ⟨ argmax_s b(s), H[b] ⟩

where the first component is the most likely state (the MDP would just use this) and the second is the entropy of the belief:

H[b] = - ∫ b(s) ln b(s) ds

The entropy captures "how uncertain" the robot is. High entropy means high uncertainty (the belief is spread over many states); low entropy means the robot is fairly sure of its state (the belief is concentrated on a few states). For a point-mass belief (zero entropy), the AMDP reduces to a standard MDP. For maximum entropy (uniform belief), the AMDP policy reflects the most cautious behavior.

The coastal navigation insight: In mobile robot navigation, the AMDP is called the "coastal planner" — like ships that stay near shore to avoid getting lost. The name captures a deep analogy: before GPS, ships hugged coastlines because landmarks on shore were the only way to know their position. Similarly, the AMDP anticipates that crossing a large, featureless area will increase entropy (the robot loses track of its position). So it plans paths that stay near walls and landmarks, trading a longer path for lower uncertainty at the goal.

The AMDP transition model P(b̄'|a,b̄) captures how actions change both the most likely state and the entropy. Moving through a featureless hallway increases entropy (the robot drifts without landmarks to correct against). Moving past a distinctive feature decreases entropy (the Bayes filter sharpens the belief). The AMDP planner can predict these entropy changes and avoid "information deserts."

The result is intuitive: in areas with many landmarks (walls, corners, distinctive features), the AMDP policy behaves like the MDP policy — it takes the shortest path. But when the shortest path crosses a barren region where localization would degrade, the AMDP diverts to stay near information-rich features. The detour cost is weighed against the risk of arriving at the goal with high uncertainty (and thus potentially missing it or colliding with nearby obstacles).

This behavior emerges naturally from the value function: cells in open areas with high entropy have lower value than the same cells with low entropy, because the robot's future reward potential is lower when it's uncertain about its position. The value gradient thus pulls the robot toward high-certainty paths (near landmarks) even when those paths are geometrically longer.

Value iteration for AMDPs replaces the full belief b with the statistic b̄ everywhere:

Ĉ(b̄) ← max_a ∫ [c(b̄') + Ĉ(b̄')] P(b̄'|a,b̄) db̄'

The transition model P(b̄'|a,b̄) is precomputed by sampling: simulate many beliefs consistent with b̄, apply actions, observe transitions, and cache the results. This preprocessing phase "translates" the familiar P(o|s) and P(s'|s,a) into the augmented state model, after which standard value iteration applies directly.

The key mathematical justification: the representation is valid when b̄ is a sufficient statistic of b for estimating value, meaning C(b) = C(b̄) for all beliefs the robot may encounter. In practice this rarely holds exactly, but the resulting policy is often good enough — especially when the robot's primary concern is "am I near a wall or in open space?" rather than the exact shape of the posterior.

Alternative statistics beyond (mode, entropy) are also possible. One might use the mean and variance of the belief, or its top-k modes, or any other low-dimensional summary. The choice of statistic is a modeling decision that trades off expressiveness against computational cost. For mobile robot navigation, (most likely position, entropy) works well because the entropy directly captures "how lost" the robot is.

Implementation-wise, the AMDP precomputes the transition model P(b̄'|a,b̄) by stochastic simulation: sample many beliefs consistent with a given (mode, entropy) pair, propagate them through the Bayes filter for various actions and observations, compute the resulting (mode, entropy) pairs, and cache the transition probabilities in a lookup table. After this preprocessing phase, value iteration runs on the augmented state space exactly as in the MDP case — making it orders of magnitude faster than full POMDP solving.

The augmented state space is typically discretized. For mobile robot navigation, the most likely state is a grid cell (x,y), and the entropy is discretized into bins (e.g., "low," "medium," "high" uncertainty). This gives a state space that is only a small multiple of the MDP state space — say 3x larger if we use 3 entropy bins. Compare this to the full POMDP belief space, which for a 1000-cell grid would be a 999-dimensional continuum.

The computational complexity of the AMDP is polynomial in the size of the augmented state space (number of grid cells times number of entropy bins), making it tractable for the same grid resolutions used by MDP planners. The main additional cost is the preprocessing phase where P(b̄'|a,b̄) is estimated by stochastic simulation. Once this model is cached, value iteration proceeds at essentially MDP speed.

In practice, the AMDP preprocessing requires running many simulated Bayes filter episodes to estimate how entropy changes under different conditions. This is a one-time cost that depends on the map, the sensor model, and the discretization of the entropy axis. For a typical museum-scale environment, Thrun et al. report total computation times comparable to standard MDP planning.

Experimental results from Thrun et al. demonstrate the coastal planner's advantage vividly. In a museum environment with a 40-meter open area, the MDP planner takes the shortest path straight across — oblivious to the fact that the robot will lose track of its position in the featureless expanse. The coastal planner stays near walls and obstacles, adding travel time but dramatically reducing localization uncertainty at the goal.

The benefit is largest with poor sensors (short range). As the sensor range increases, the difference gradually vanishes — with long-range sensors, the robot can localize from anywhere, so the coastal detour isn't worthwhile. Quantitatively, Thrun et al. measured the entropy of the robot's belief at the goal location as a function of sensor range. The coastal planner consistently achieved lower entropy (higher confidence at the goal), with the advantage most pronounced for sensors with range under 2 meters.

This illustrates a general principle: the value of uncertainty-aware planning depends on the ratio of environment complexity to sensor capability. In richly textured environments with good sensors, MDP planning suffices. In featureless or ambiguous environments with limited sensors, POMDP or AMDP planning can make the difference between mission success and failure.

Method	Plans over	Complexity	Handles perception uncertainty?
MDP	States	Polynomial	No
AMDP	State + entropy	Polynomial (larger)	Approximately
POMDP	Full beliefs	Doubly exponential	Yes (optimal)

Check: What does the augmented MDP add to the MDP state to account for uncertainty?

A measure of uncertainty (entropy of the belief), so the planner can anticipate and avoid situations where the robot gets lost The full belief distribution over all states The robot's velocity

Chapter 8: Belief-Space Planning Demo

Explore the difference between MDP planning (ignores uncertainty) and POMDP-style planning (considers uncertainty). The robot must reach the goal but doesn't know which end of the corridor it faces. A landmark above the corridor disambiguates orientation.

The MDP planner assumes it knows its state (optimistically picks a direction). The POMDP planner reasons about its uncertainty and visits the landmark first. Run each plan multiple times to see the statistical difference — MDP wins ~50% of the time, while POMDP wins every time. The certainty bar at the bottom shows how the robot's confidence changes over time: the MDP robot stays at 50% throughout (it never gathers information), while the POMDP robot's certainty jumps to 100% upon reaching the landmark.

POMDP vs MDP Planning

The robot (blue) starts in the center with unknown orientation. The goal (green) and pit (red) are at opposite ends. A landmark (yellow diamond) disambiguates orientation. MDP planning charges toward the goal (50% pit). POMDP planning visits the landmark first.

Click a plan to start.

What to notice: The MDP planner heads straight for the assumed goal, arriving quickly but with a 50% chance of hitting the pit. The POMDP planner takes a detour to the landmark, learns its orientation, then proceeds safely to the goal. The POMDP path is longer but achieves much higher expected reward. Run each plan multiple times — the MDP succeeds roughly half the time, while the POMDP succeeds every time.

The exploration-exploitation tradeoff: The POMDP's detour to the landmark is a pure information-gathering action. It costs time (exploitation loss) but eliminates the risk of the pit (exploration gain). The optimal POMDP policy automatically computes exactly how much information gathering is worthwhile — it only seeks new information to the extent that it benefits the expected utility. If the penalty for the pit were small (say -10 instead of -100), the POMDP might skip the landmark and go directly for the goal, accepting the small risk. The tradeoff is quantitative, not binary.

This automatic balancing of exploration and exploitation is perhaps the most remarkable property of POMDP planning. Unlike heuristic approaches that separate "exploration" and "goal pursuit" into distinct modules, the POMDP integrates them into a single optimal decision process. Every action is evaluated simultaneously for its reward potential AND its information content. The robot never explores "just because" — it explores only when the information gained justifies the cost.

Check: Why does the POMDP planner visit the landmark before heading to the goal?

To reduce uncertainty about its orientation, so it can safely distinguish the goal from the pit Because the landmark gives extra reward Because the POMDP algorithm is slower

Chapter 9: Summary

Concept	Key Idea
POMDP	MDP extended with observations and beliefs — the robot plans over what it believes, not what it knows
Belief space	The space of all probability distributions over states; continuous even for finite state spaces
Belief-space value iteration	Bellman equation over beliefs, integrating over observations instead of beliefs
Piecewise-linear convexity	Optimal finite-horizon value functions are piecewise linear and convex in belief parameters
Exact POMDP	Computes optimal policy via linear programming; doubly exponential in horizon
MC-POMDP	Approximate solver using particle filters and nearest-neighbor value function
AMDP	Compresses belief to (most likely state, entropy); coastal navigation for robots
Exploration vs exploitation	POMDP optimally trades off information gathering and goal pursuit

The big picture: MDPs, AMDPs, and POMDPs form a spectrum of increasing expressiveness and cost:
• MDP: Plans over states. Polynomial. Ignores sensor uncertainty.
• AMDP: Plans over (state, entropy). Polynomial but larger. Captures some uncertainty.
• POMDP: Plans over full beliefs. Doubly exponential. Optimal information gathering.
In real robotics, the choice depends on how much the robot's uncertainty matters for the task at hand. For navigation in feature-rich environments, an MDP often suffices. For navigating featureless open areas or performing delicate manipulation, the AMDP or POMDP framework can dramatically improve performance.

This chapter completes the planning arc of the book. Chapters 1-14 addressed perception: how robots estimate their state from noisy data. Chapter 15 introduced action selection under the assumption of full observability. This chapter unifies perception and action: the robot must simultaneously estimate its state and choose actions that balance goal achievement with information gathering. This is the most general — and most challenging — problem in probabilistic robotics.

The key takeaway is not any specific algorithm, but the insight that planning in belief space enables a robot to reason about its own ignorance and act to reduce it when beneficial. Every POMDP solver — whether exact, point-based, Monte Carlo, or augmented — embodies this principle. The computational challenge of POMDPs has driven decades of algorithmic innovation, from the piecewise-linear exact solutions of the 1960s to modern point-based solvers that handle problems with millions of states.

The ideas from this chapter extend far beyond robotics. POMDPs model any sequential decision-making problem where the decision-maker has incomplete information: medical diagnosis (the patient's true condition is hidden), spoken dialogue systems (the user's intent is uncertain), autonomous driving (other drivers' intentions are unknown), and scientific experimentation (the true hypothesis is hidden). Wherever an agent must balance "acting on what it knows" with "learning what it doesn't know," the POMDP framework provides the optimal solution — if the computation can be managed.

"The optimal plan in belief space optimally gathers information,
only seeking new information to the extent that it is actually
beneficial to the expected utility of the robot's action."

Check: What is the fundamental advantage of POMDP planning over MDP planning?

POMDP planning reasons about the robot's uncertainty and can actively gather information when beneficial POMDP planning is always faster POMDP planning uses fewer states