Planning under uncertainty when you can't see: belief-space value iteration, piecewise-linear value functions, Monte Carlo POMDPs, and coastal navigation.
Chapter 15 gave us a powerful tool: value iteration for MDPs. A robot could plan optimal actions by computing a value function over states and following the gradient. But there was a hidden assumption lurking beneath every equation.
MDPs assume full observability — the robot knows its exact state at all times. In real robotics, this is almost never true. A mobile robot in a symmetric hallway doesn't know which direction it's facing. A manipulator reaching into a bin can't see the exact pose of the target object. Sensors are noisy projections of the true state.
Consider Thrun's illustrative example: a robot in a nearly symmetric corridor. The goal is at one end, a dangerous pit at the other, and the two are perceptually indistinguishable from the middle (both entrances look the same). A distinguishing landmark exists in only one alcove. An MDP planner would say "go left" or "go right" depending on which state is better. But the robot doesn't know which end is which — it doesn't even know which direction it's facing. Charging toward the assumed goal risks a 50% chance of falling into the pit.
The optimal strategy? First move to a landmark that reveals the robot's orientation, then proceed safely to the goal. The robot must actively gather information even though it costs a detour. This is the hallmark of POMDP planning: balancing exploitation (moving toward the goal) with exploration (reducing uncertainty). The payoff from a guaranteed +100 reward far exceeds the expected payoff from a coin-flip between +100 and -100.
Crucially, solving this problem by "just considering all possible states and averaging" does not work. You might think: "Just solve two MDPs (one for each possible orientation), find the optimal policy for each, and mix them." But the averaged policy still charges toward the assumed goal in each case — neither individual MDP policy visits the landmark, because in each case the state is known. The information-gathering behavior only emerges when uncertainty itself is part of the state.
This distinction is subtle but profound. The POMDP doesn't just "hedge" between two MDP solutions. It discovers entirely new behaviors — like visiting landmarks, asking questions, or performing sensing actions — that no MDP would ever produce. The optimal plan in belief space can be qualitatively different from any plan in state space. This emergent information-gathering behavior is one of the most beautiful results in probabilistic robotics.
The textbook distinguishes three planning paradigms, each handling a different level of uncertainty:
| Paradigm | Handles | Output |
|---|---|---|
| Classical planning | No uncertainty | Fixed action sequence |
| MDP (Ch 15) | Action uncertainty | State → action policy |
| POMDP (Ch 16) | Action + perception uncertainty | Belief → action policy |
A Partially Observable Markov Decision Process extends the MDP with two additions: an observation model and the concept of belief.
| Component | Symbol | Description |
|---|---|---|
| States | S | Set of all possible states (same as MDP) |
| Actions | A | Set of available actions (same as MDP) |
| Transitions | P(s'|a,s) | Stochastic transition model (same as MDP) |
| Payoff | c(s) | Immediate reward at state s (same as MDP) |
| Observations | O | Set of possible observations (new) |
| Observation model | P(o|s) | Probability of seeing o in state s (new) |
| Belief | b(s) | Probability distribution over states (new) |
The robot never sees s directly. Instead, after taking action a and arriving in state s', it receives an observation o drawn from P(o|s'). Observations are noisy — the same state might produce different observations on different occasions, and different states might produce the same observation. This perceptual ambiguity is exactly what makes the problem "partially observable." The robot must reason probabilistically about which state could have produced the observations it has received so far.
The belief b(s) captures everything the robot knows about its state, updated via the familiar Bayes filter from Chapter 2:
where η is the normalizing constant, ensuring the belief sums to 1 over all states. This is the exact same Bayes filter from Chapter 2, applied at every decision step. The belief is a sufficient statistic for the robot's history — it compresses all past actions and observations into a probability distribution that is all the robot needs to make optimal decisions.
The policy in a POMDP maps beliefs to actions, not states to actions:
Given a belief b and an action a, the robot receives an observation o' and updates its belief via the Bayes filter. The new belief b' then determines the next action. This cycle is the POMDP control loop:
1. Start with belief b.
2. Choose action a = π(b).
3. Execute a in the world, observe o'.
4. Update belief: b' = B(b, a, o') via Bayes filter.
5. Set b = b' and repeat.
Notice that the Bayes filter from Chapter 2 is embedded inside the POMDP control loop. Perception (belief updating) and planning (action selection) are interleaved at every time step. This tight coupling is what enables information-gathering behavior: the robot can choose actions specifically to make its next Bayes filter update more informative.
An important subtlety: in an MDP, the policy depends only on the current state. In a POMDP, the policy depends on the entire history of actions and observations — but only through the belief, which is a sufficient statistic for the history. This is why the Bayes filter is so central: it compresses an ever-growing history into a fixed-size belief vector that captures all the information the robot needs for optimal decision-making.
The sufficiency of the belief state follows directly from the Markov property: given the current belief (which encodes all relevant history), the future is conditionally independent of the past. A robot that has run for 1000 time steps has a belief vector of the same dimensionality as one that has run for 10 time steps. The Bayes filter is the mechanism that makes this compression possible, accumulating evidence without accumulating data.
This means the POMDP is itself an MDP — just one defined over a much larger (continuous) state space. The "states" of this meta-MDP are beliefs, the "transitions" are Bayes filter updates, and the "observations" determine which transition occurs. This perspective, known as the belief-MDP formulation, unifies the two chapters: Chapter 15's value iteration and Chapter 16's belief-space value iteration are the same algorithm, applied to different state spaces. The difficulty of POMDPs lies entirely in the size and structure of the belief space, not in any fundamentally new algorithmic idea.
In an MDP, the robot plans over states. In a POMDP, the robot plans over beliefs. But what does the belief space look like?
If the state space has n discrete states, a belief b is specified by n probabilities p1, ..., pn that sum to 1. Since the last probability is determined by the others, the belief space is an (n-1)-dimensional simplex. For two states, the belief space is a line segment. For three states, it's a triangle. For four states, it's a tetrahedron.
A useful geometric picture: the belief simplex for n states is a convex polytope in (n-1)-dimensional space. The vertices of this simplex correspond to states of certainty (the robot is 100% sure it's in one particular state). The center corresponds to maximum ignorance (uniform distribution over all states). Movement through belief space happens via Bayes filter updates — informative observations pull the belief toward certainty (a vertex), while noisy actions spread it toward the center.
For the textbook's 4-state example, the action a2 cleanly separates {s1, s2} from {s3, s4}. So the only uncertainty the robot encounters is between s1 and s2, or between s3 and s4. Each of these can be represented by a single parameter (p1 or p3). This makes the example convenient for plotting — the value function is a 1D curve instead of a surface over a high-dimensional simplex.
The payoff of a belief is the expected payoff over all states, weighted by the belief:
This is linear in the belief parameters — a key property that enables exact solutions in finite worlds. The linearity of the payoff is the base case for showing that the entire value function is piecewise linear: since the value function at horizon T is built from linear payoffs and linear operations (summation over observations, maximization over actions), the result is always piecewise linear and convex. This structure is what distinguishes finite POMDPs from continuous ones — in continuous state spaces, the belief-dependent payoff is no longer a finite linear combination, and exact solutions become impossible.
The transition model in belief space works through the Bayes filter. Given a belief b, an action a, and an observation o', the new belief B(b,a,o') is computed by Bayes' rule. This means the belief transitions are deterministic given (b,a,o'), but stochastic from the robot's perspective because it doesn't know which o' it will receive.
Consider a concrete example from the textbook: a world with 4 states (s1, s2, s3, s4), 2 actions, and 2 observations.
| Parameter | Value |
|---|---|
| States | s1, s2 (initial states), s3 (goal, +100), s4 (pit, -100) |
| Action a1 | Swap s1 ↔ s2 with prob 0.9/0.8 (information gathering) |
| Action a2 | Go to s3 or s4 depending on current state (commit) |
| Observations | o1: prob 0.7 in s1, 0.4 in s2; o2: prob 0.3 in s1, 0.6 in s2 |
The robot starts in either s1 or s2 with equal probability. The belief space is just a line segment parameterized by p1 = P(s = s1). The entire planning problem reduces to: "How confident must I be that I'm in s1 before I risk action a2?" Action a1 is the information-gathering action — it changes the state (allowing new observations) without committing to the high-stakes a2.
The Bellman equation from Chapter 15 operated over states. The POMDP version operates over beliefs:
This looks identical in structure to the MDP Bellman equation — just replace "state s" with "belief b" everywhere. But there's a problem: integrating over all possible beliefs b' is extremely expensive because the belief space is continuous.
where B(b,a,o') is the Bayes filter update, and P(o'|a,b) is the probability of observing o' after doing a from belief b:
This reformulation is crucial. In a finite world with |O| observations, the sum is finite and computable. We've turned an impossible integral over a continuous belief space into a tractable sum over observations.
To make this concrete, consider the T=2 case from the textbook's 4-state example. After executing a1 and observing o1, the belief updates via Bayes' rule to a new point on the belief line. The value C2(b, a1) is the sum over both possible observations of (probability of that observation) times (value at the resulting belief under C1). Expanding this produces a piecewise-linear function — a theme we explore next.
The mechanics step by step:
1. For each action a: Consider every possible observation o' that could follow.
2. For each observation o': The Bayes filter produces a unique posterior belief b' = B(b,a,o').
3. Look up value: The payoff of that new belief is c(b') + CT-1(b'), the sum of immediate reward and future value.
4. Weight by probability: Multiply by P(o'|a,b), the chance of actually seeing o'.
5. Sum over observations: This gives CT(b,a) for this specific action.
6. Max over actions: CT(b) = maxa CT(b,a).
The structure is exactly MDP value iteration, but one level more abstract — beliefs instead of states, observations instead of transitions. It is often useful to split the computation into per-action value functions CT(b,a) before taking the maximum, since this decomposition enables per-action pruning of alpha-vectors.
The key conceptual leap: in an MDP, the "one-step lookahead" considers all successor states. In a POMDP, the one-step lookahead considers all possible observations, and for each observation computes what the robot would believe. The value of an action is the expected value across all observations, weighted by their probabilities. This is why POMDP planning automatically accounts for both the reward structure AND the information content of each action.
Here's the remarkable result for finite POMDPs: the optimal value function is piecewise linear and convex in the belief parameters.
A single linear value function over beliefs with n states would be:
A piecewise linear and convex value function is the maximum of K such linear functions:
Each linear piece (sometimes called an alpha-vector) corresponds to a different conditional plan — a different tree of actions conditioned on future observations. The value at a particular belief b is determined by whichever plan performs best in expectation under that belief. The maximum over all alpha-vectors selects the best plan for each belief point.
Visually, for a 2-state world where the belief is a single number p1, the value function looks like an inverted "V" or a series of line segments. Each segment corresponds to a region of belief space where a particular conditional plan is optimal. The boundaries between segments are the "switching points" where the robot changes its preferred strategy.
The number of alpha-vectors grows with the planning horizon because longer horizons create more distinct conditional plans. At T=1, there are at most |A| alpha-vectors (one per action). At T=2, each action can be followed by a plan conditioned on the observation, creating up to |A||O| distinct continuation strategies per initial action. Each unique continuation strategy produces its own alpha-vector with different coefficients.
A critical property: while the number of alpha-vectors can explode, many of them are dominated — they're never the maximum for any belief point. An alpha-vector α is dominated if for every belief b, some other alpha-vector α' achieves a higher or equal value. Pruning dominated alpha-vectors keeps the representation manageable. In the textbook's example, 4 alpha-vectors are generated at T=2 for action a1, but 2 are immediately prunable because they never achieve the maximum anywhere in the belief simplex.
This pruning is related to the geometry of the upper envelope: the value function is the upper envelope of a set of hyperplanes. Adding a new hyperplane either expands the upper envelope (the hyperplane is "visible" from above) or is hidden beneath existing hyperplanes (dominated). Determining visibility is a linear programming problem, which is why LP is central to the exact POMDP algorithm.
Why piecewise linear? Consider a one-step horizon. The payoff c(b) = Σ c(si)pi is linear. Taking the max over actions produces a piecewise linear function (one piece per action). For horizon T, each observation can lead to a different linear piece of CT-1, and the max over actions produces more pieces. The number of pieces can grow exponentially with horizon T.
In the textbook's example, C1(b) consists of two linear pieces: zero (for action a1) and 72p1 - 72p2 (for action a2, with γ = 0.9). The optimal policy at T=1 is: if p1 ≥ 0.5, take a2 (go for the reward); otherwise, take a1 (stay safe). The value function looks like a "hockey stick": flat at zero on the left, rising linearly on the right.
At T=2, the value function gains a third linear piece from the information-gathering strategy: execute a1 (which changes the state and yields an informative observation), then decide. The new piece has the form -33.048p1 + 7.776p2, which is positive for small p1 — meaning that even when the robot is uncertain, it can expect positive value by gathering information first.
The policy threshold shifts: the robot can now afford to be less certain before acting, because it has time to gather information. With a longer horizon, the threshold shifts further, and the expected value increases everywhere. The longer the horizon, the more the robot benefits from information gathering — but the computational cost of representing the piecewise-linear function grows. This creates a fundamental tradeoff: longer horizons produce better policies but require exponentially more alpha-vectors.
Piecewise linearity means we can represent the value function exactly on a computer, as a finite set of linear constraints (called alpha-vectors in the modern literature). Each alpha-vector αk = (Ck,1, ..., Ck,n) defines a hyperplane over the belief simplex, and the value function is their upper envelope. The POMDP value iteration algorithm builds this set recursively.
The algorithm maintains a constraint set ΦT at each horizon. The update for each horizon T produces new linear pieces from the old ones. For each action a and each assignment of observations to previous linear pieces, we get a new linear function:
where k(o') selects which linear piece of CT-1 is active for observation o'.
| Horizon | Max alpha-vectors | Growth pattern |
|---|---|---|
| T = 1 | |A| | One per action |
| T = 2 | |A| · |A||O| | Each observation independently picks a piece from T=1 |
| T = 3 | |A| · (|A| · |A||O|)|O| | Pieces from T=2 are combined per observation |
| General T | |A| · |ΦT-1||O| | Doubly exponential in T |
To illustrate the explosion: with |A|=2 actions and |O|=2 observations, horizon T=1 has 2 pieces, T=2 has up to 8, T=3 up to 128, and T=4 up to 2 · 1282 = 32,768. After pruning, the actual number is often much smaller, but the worst case grows as a tower of exponentials.
This complexity is intrinsic to the problem, not an artifact of the algorithm. The information-theoretic reason: at each time step, the robot must plan a different strategy for each possible observation sequence it might receive. The number of distinct observation sequences grows exponentially with the horizon. Each sequence induces a different belief trajectory through belief space, potentially requiring a different optimal action. The exact algorithm faithfully represents all these contingencies; approximate methods trade optimality for tractability by considering only the most likely or most useful contingencies.
In practice, linear programming identifies which constraints are active (defining the value function) and which are passive (dominated). Pruning passive constraints is essential for tractability but cannot eliminate the fundamental worst-case complexity.
The algorithm works by induction on the horizon T. For T=1, the constraint set Φ1 has one linear function per action (since the payoff c(b) is linear, each action gives one alpha-vector). For each subsequent horizon, the algorithm enumerates all actions and all assignments of observations to alpha-vectors from the previous horizon, computes the new linear coefficients using the formula above, and adds them to ΦT. Minimizing CT under these constraints (via linear programming) extracts the optimal value. Active constraints identify the optimal action for each region of belief space.
The algorithm can be divided into three computational phases at each horizon step:
1. Cross-sum: For each action, combine alpha-vectors from CT-1 with observation probabilities to create observation-specific alpha-vectors.
2. Enumeration: Form all combinations across observations (one alpha-vector per observation). Each combination yields a new alpha-vector for horizon T.
3. Pruning: Use linear programming to identify and discard dominated alpha-vectors (those that are never the maximum for any belief point). A linear program tests whether a given alpha-vector is ever the best: maximize its value minus the value of all other alpha-vectors. If this maximum is non-positive, the alpha-vector can be safely discarded.
The pruning step is what makes the difference between a theoretically correct but useless algorithm and a practical one. In the textbook example, half the alpha-vectors at T=2 are dominated and can be removed immediately. In larger problems, aggressive pruning can reduce the representation by orders of magnitude while preserving optimality.
For the general case (continuous states, infinite horizon), the exact algorithm doesn't apply. The general POMDP value iteration equation becomes:
This is an "in-principle" algorithm: it requires integration over infinite spaces and doesn't specify how Ĉ is represented. The two key open questions are: (1) how to represent a value function over probability distributions, and (2) how to compute the integrals efficiently. Making it practical requires approximation — either through Monte Carlo methods (next chapter) or by compressing the belief (Chapter 7).
For infinite horizons, the value function reaches an equilibrium described by the Bellman equation in belief space — a self-consistent equation where the value at each belief equals the best expected one-step value plus discounted future value. Just as in the MDP case, this fixed point can be approximated by iterating the backup operator until convergence. The convergence guarantee requires γ < 1, and the approximation error after k iterations is bounded by γk times the maximum value. In practice, convergence in belief space is typically slower than in state space because the belief space is so much larger and the value function is more complex to represent.
Exact POMDP solving is limited to tiny problems — the doubly exponential growth makes it infeasible for anything beyond a handful of states and a short horizon. For realistic robot tasks with continuous states and long horizons, we need approximations.
The Monte Carlo POMDP (MC-POMDP) approach, developed by Thrun and collaborators, brings three ideas together: particle filters to represent beliefs, Monte Carlo sampling to approximate the value update, and reinforcement learning to iteratively improve the policy. The result is an algorithm that can handle continuous state spaces and long planning horizons, at the cost of approximate (not optimal) solutions.
The idea is straightforward: instead of integrating over all observations and all states, we sample. The belief is represented by a set of N weighted particles (as in Chapter 4), and the value update is approximated by Monte Carlo sampling:
1. Represent the belief b as a set of weighted particles {⟨s(i), w(i)⟩}.
2. Sample transitions: For each action a, draw N samples: pick s from b according to weights, draw s' ~ P(s'|a,s), draw o' ~ P(o'|s').
3. Update belief: Compute posterior b' = B(b,a,o') using the particle filter from Chapter 4 — reweight particles by P(o'|s') and normalize.
4. Estimate value: C(b,a) ≈ (1/N) Σ γ[C(b') + c(s')], averaging over the N sampled transitions.
5. Select action: Pick the action with the highest estimated value.
The value function Q(θ, a) is learned via Bellman backups in belief space:
where θ is the particle-based belief representation. The particle filter computes samples of the reward and successor belief, from which the right-hand side is approximated by Monte Carlo averaging.
MC-POMDP uses reinforcement learning in belief space: the robot runs episodes, updates Q-values using Bellman backups, and gradually improves its policy. A mix of exploitation (following the current best policy) and exploration (10% random actions) ensures the robot discovers useful belief states. Experience replay and counter-based exploration from the reinforcement learning literature help determine the order of belief-state updates.
A key challenge in MC-POMDP is representing the value function Q over beliefs. Standard function approximators (neural networks, decision trees) accept fixed-length vectors, not probability distributions of varying complexity. The solution: nearest-neighbor interpolation in density space. When a new belief is encountered, its k nearest neighbors (by KL divergence between the corresponding particle distributions) are found in a growing database, and their Q-values are linearly averaged. If there aren't enough neighbors within a threshold distance, the new belief is added to the database — so the database grows over time, concentrating resolution where the robot actually visits.
Computing KL divergence between particle sets is itself non-trivial. Strictly speaking, KL divergence is not a distance metric (it's asymmetric and doesn't satisfy the triangle inequality), but it works well as a similarity measure for beliefs. The approach maps each particle set to a continuous density (via Gaussian kernels) and then uses Monte Carlo sampling to approximate the divergence. This "distance between densities" enables function approximation in the space of probability distributions — a capability not found in standard machine learning tools.
Two key optimizations make MC-POMDP practical:
Adaptive database growth: New belief states are only added to the database when they're sufficiently different (by KL divergence) from all existing entries. This naturally concentrates resolution in frequently-visited regions of belief space — the robot builds a detailed value map only where it matters.
Experience replay: Past transitions (b, a, o', b', r) are stored and re-used for additional Bellman backups. This dramatically accelerates convergence by extracting more learning from each simulated episode. Combined with counter-based exploration (tracking how many times each belief-action pair has been visited), this produces an efficient online learning algorithm that balances exploration of new beliefs with exploitation of known good policies.
POMDPs plan over the full belief — but that space is enormous and the exact algorithm is doubly exponential. MDPs plan over states — but they ignore uncertainty entirely and can lead to dangerous policies when the robot is unsure of its position. Is there a middle ground?
The answer, from Thrun and collaborators, is the Augmented MDP (AMDP). The key observation: in many robotics domains, the range of belief distributions the robot encounters is a small subspace of all possible beliefs. Most of the time, the belief is roughly Gaussian (unimodal, centered on the true pose). What matters for planning is not the exact shape of the belief, but a few summary statistics — particularly how uncertain the robot is.
Augmented MDPs (AMDPs) take a different approach to approximation: instead of sampling, they compress the belief into a low-dimensional statistic and plan over that statistic using standard MDP tools. The idea: the full belief b is summarized by a tuple:
where the first component is the most likely state (the MDP would just use this) and the second is the entropy of the belief:
The entropy captures "how uncertain" the robot is. High entropy means high uncertainty (the belief is spread over many states); low entropy means the robot is fairly sure of its state (the belief is concentrated on a few states). For a point-mass belief (zero entropy), the AMDP reduces to a standard MDP. For maximum entropy (uniform belief), the AMDP policy reflects the most cautious behavior.
The AMDP transition model P(b̄'|a,b̄) captures how actions change both the most likely state and the entropy. Moving through a featureless hallway increases entropy (the robot drifts without landmarks to correct against). Moving past a distinctive feature decreases entropy (the Bayes filter sharpens the belief). The AMDP planner can predict these entropy changes and avoid "information deserts."
The result is intuitive: in areas with many landmarks (walls, corners, distinctive features), the AMDP policy behaves like the MDP policy — it takes the shortest path. But when the shortest path crosses a barren region where localization would degrade, the AMDP diverts to stay near information-rich features. The detour cost is weighed against the risk of arriving at the goal with high uncertainty (and thus potentially missing it or colliding with nearby obstacles).
This behavior emerges naturally from the value function: cells in open areas with high entropy have lower value than the same cells with low entropy, because the robot's future reward potential is lower when it's uncertain about its position. The value gradient thus pulls the robot toward high-certainty paths (near landmarks) even when those paths are geometrically longer.
Value iteration for AMDPs replaces the full belief b with the statistic b̄ everywhere:
The transition model P(b̄'|a,b̄) is precomputed by sampling: simulate many beliefs consistent with b̄, apply actions, observe transitions, and cache the results. This preprocessing phase "translates" the familiar P(o|s) and P(s'|s,a) into the augmented state model, after which standard value iteration applies directly.
The key mathematical justification: the representation is valid when b̄ is a sufficient statistic of b for estimating value, meaning C(b) = C(b̄) for all beliefs the robot may encounter. In practice this rarely holds exactly, but the resulting policy is often good enough — especially when the robot's primary concern is "am I near a wall or in open space?" rather than the exact shape of the posterior.
Alternative statistics beyond (mode, entropy) are also possible. One might use the mean and variance of the belief, or its top-k modes, or any other low-dimensional summary. The choice of statistic is a modeling decision that trades off expressiveness against computational cost. For mobile robot navigation, (most likely position, entropy) works well because the entropy directly captures "how lost" the robot is.
Implementation-wise, the AMDP precomputes the transition model P(b̄'|a,b̄) by stochastic simulation: sample many beliefs consistent with a given (mode, entropy) pair, propagate them through the Bayes filter for various actions and observations, compute the resulting (mode, entropy) pairs, and cache the transition probabilities in a lookup table. After this preprocessing phase, value iteration runs on the augmented state space exactly as in the MDP case — making it orders of magnitude faster than full POMDP solving.
The augmented state space is typically discretized. For mobile robot navigation, the most likely state is a grid cell (x,y), and the entropy is discretized into bins (e.g., "low," "medium," "high" uncertainty). This gives a state space that is only a small multiple of the MDP state space — say 3x larger if we use 3 entropy bins. Compare this to the full POMDP belief space, which for a 1000-cell grid would be a 999-dimensional continuum.
The computational complexity of the AMDP is polynomial in the size of the augmented state space (number of grid cells times number of entropy bins), making it tractable for the same grid resolutions used by MDP planners. The main additional cost is the preprocessing phase where P(b̄'|a,b̄) is estimated by stochastic simulation. Once this model is cached, value iteration proceeds at essentially MDP speed.
In practice, the AMDP preprocessing requires running many simulated Bayes filter episodes to estimate how entropy changes under different conditions. This is a one-time cost that depends on the map, the sensor model, and the discretization of the entropy axis. For a typical museum-scale environment, Thrun et al. report total computation times comparable to standard MDP planning.
Experimental results from Thrun et al. demonstrate the coastal planner's advantage vividly. In a museum environment with a 40-meter open area, the MDP planner takes the shortest path straight across — oblivious to the fact that the robot will lose track of its position in the featureless expanse. The coastal planner stays near walls and obstacles, adding travel time but dramatically reducing localization uncertainty at the goal.
The benefit is largest with poor sensors (short range). As the sensor range increases, the difference gradually vanishes — with long-range sensors, the robot can localize from anywhere, so the coastal detour isn't worthwhile. Quantitatively, Thrun et al. measured the entropy of the robot's belief at the goal location as a function of sensor range. The coastal planner consistently achieved lower entropy (higher confidence at the goal), with the advantage most pronounced for sensors with range under 2 meters.
This illustrates a general principle: the value of uncertainty-aware planning depends on the ratio of environment complexity to sensor capability. In richly textured environments with good sensors, MDP planning suffices. In featureless or ambiguous environments with limited sensors, POMDP or AMDP planning can make the difference between mission success and failure.
| Method | Plans over | Complexity | Handles perception uncertainty? |
|---|---|---|---|
| MDP | States | Polynomial | No |
| AMDP | State + entropy | Polynomial (larger) | Approximately |
| POMDP | Full beliefs | Doubly exponential | Yes (optimal) |
Explore the difference between MDP planning (ignores uncertainty) and POMDP-style planning (considers uncertainty). The robot must reach the goal but doesn't know which end of the corridor it faces. A landmark above the corridor disambiguates orientation.
The MDP planner assumes it knows its state (optimistically picks a direction). The POMDP planner reasons about its uncertainty and visits the landmark first. Run each plan multiple times to see the statistical difference — MDP wins ~50% of the time, while POMDP wins every time. The certainty bar at the bottom shows how the robot's confidence changes over time: the MDP robot stays at 50% throughout (it never gathers information), while the POMDP robot's certainty jumps to 100% upon reaching the landmark.
The robot (blue) starts in the center with unknown orientation. The goal (green) and pit (red) are at opposite ends. A landmark (yellow diamond) disambiguates orientation. MDP planning charges toward the goal (50% pit). POMDP planning visits the landmark first.
This automatic balancing of exploration and exploitation is perhaps the most remarkable property of POMDP planning. Unlike heuristic approaches that separate "exploration" and "goal pursuit" into distinct modules, the POMDP integrates them into a single optimal decision process. Every action is evaluated simultaneously for its reward potential AND its information content. The robot never explores "just because" — it explores only when the information gained justifies the cost.
| Concept | Key Idea |
|---|---|
| POMDP | MDP extended with observations and beliefs — the robot plans over what it believes, not what it knows |
| Belief space | The space of all probability distributions over states; continuous even for finite state spaces |
| Belief-space value iteration | Bellman equation over beliefs, integrating over observations instead of beliefs |
| Piecewise-linear convexity | Optimal finite-horizon value functions are piecewise linear and convex in belief parameters |
| Exact POMDP | Computes optimal policy via linear programming; doubly exponential in horizon |
| MC-POMDP | Approximate solver using particle filters and nearest-neighbor value function |
| AMDP | Compresses belief to (most likely state, entropy); coastal navigation for robots |
| Exploration vs exploitation | POMDP optimally trades off information gathering and goal pursuit |
This chapter completes the planning arc of the book. Chapters 1-14 addressed perception: how robots estimate their state from noisy data. Chapter 15 introduced action selection under the assumption of full observability. This chapter unifies perception and action: the robot must simultaneously estimate its state and choose actions that balance goal achievement with information gathering. This is the most general — and most challenging — problem in probabilistic robotics.
The key takeaway is not any specific algorithm, but the insight that planning in belief space enables a robot to reason about its own ignorance and act to reduce it when beneficial. Every POMDP solver — whether exact, point-based, Monte Carlo, or augmented — embodies this principle. The computational challenge of POMDPs has driven decades of algorithmic innovation, from the piecewise-linear exact solutions of the 1960s to modern point-based solvers that handle problems with millions of states.
The ideas from this chapter extend far beyond robotics. POMDPs model any sequential decision-making problem where the decision-maker has incomplete information: medical diagnosis (the patient's true condition is hidden), spoken dialogue systems (the user's intent is uncertain), autonomous driving (other drivers' intentions are unknown), and scientific experimentation (the true hypothesis is hidden). Wherever an agent must balance "acting on what it knows" with "learning what it doesn't know," the POMDP framework provides the optimal solution — if the computation can be managed.