How to make optimal decisions when you cannot see the full picture. The mathematical backbone of robot perception, dialogue systems, and strategic reasoning under uncertainty.
In a standard Markov Decision Process (MDP), you see the full state of the world. You know exactly where you are on the game board, exactly how much fuel is in the tank, exactly which cards are in your hand. Life is rarely so generous.
A Partially Observable MDP (POMDP) models the realistic situation: you must act, but you only get partial, noisy observations of the true state. You hear a growl behind a door but don't know which door hides the tiger. You see a blurry radar blip but don't know if it's a plane or a bird.
In an MDP you see everything. In a POMDP you see only clues. Click to reveal what's behind each door.
Since you can't see the state directly, what can you work with? Observations — noisy signals that give you partial information. Each observation shifts your belief, a probability distribution over all possible states.
In the tiger problem, you start with a 50/50 belief: the tiger is equally likely behind either door. After listening and hearing a growl from the left, your belief shifts — say 85% left, 15% right. Listen again and hear left? Now maybe 97% left. Your belief is your internal model of reality.
[0.2, 0.2, 0.2, 0.2, 0.2] — "I have no idea which state I'm in." A confident belief might be [0.95, 0.03, 0.01, 0.01, 0.0] — "I'm almost certain I'm in state 0." The tiger problem has only 2 states, so the belief is a single number: b(tiger-left) = 0.5 means total uncertainty; b = 0.97 means "almost sure the tiger is on the left."Click "Hear Left" or "Hear Right" to see how observations shift the belief. Accuracy is 85%.
These are the same equation. The POMDP belief update is a discrete Bayes filter where the action selects which transition model T to use, and the observation model Z provides the likelihood. Every POMDP agent is running a Bayes filter in its head.
What happens to this equation when states are continuous and Gaussian? You get the Kalman filter prediction+update cycle.
A POMDP is formally defined by seven components. If you already know MDPs, you'll recognize most of them — the new ingredients are the observation set O and the observation model Z.
| Component | MDP has it? | POMDP addition |
|---|---|---|
| S, A, T, R, γ | Yes | Same as MDP |
| O (observations) | No | What the agent actually sees |
| Z (observation model) | No | Links hidden state to observations |
After the agent takes action a and receives observation o, it must update its belief using Bayes' rule. The new belief b'(s') is proportional to the likelihood of the observation times the prior belief propagated through the transition model.
Here η is a normalizing constant that makes the probabilities sum to 1. In words: (1) propagate your old belief through the transition model, (2) weight by how likely the observation is in each state, (3) normalize.
The tiger problem has trivial transitions (listening doesn't change the state), so let's see the full formula with a 5-state example. A robot is in one of 5 rooms. It takes action "move-right" and observes a sensor reading z:
5-state belief update # Prior belief: mostly in room 2 b = [0.05, 0.10, 0.60, 0.20, 0.05] # Transition: "move right" shifts probability one room right # T[s, move-right, s'] — each row sums to 1.0 # Mostly moves right (0.8) with some stay (0.15) and overshoot (0.05) # Step 1: propagate belief through T # b_predicted[s'] = sum_s T[s, a, s'] * b[s] b_pred = [0.01, 0.05, 0.12, 0.56, 0.26] # shifted right # Step 2: weight by observation likelihood P(z | s') # Sensor says "room 3 signature" with noise Z_z = [0.02, 0.05, 0.10, 0.75, 0.08] # high for room 3 # Element-wise multiply raw = [0.0002, 0.0025, 0.012, 0.42, 0.0208] # Step 3: normalize (sum = 0.4555) b' = [0.000, 0.005, 0.026, 0.922, 0.046] # Belief collapsed: 92% confident we're in room 3
Walk-through # Prior belief b = [0.50, 0.50] # [tiger-left, tiger-right] # Observation likelihoods: P(hear-left | state) Z = [0.85, 0.15] # high if tiger IS left # Step 1: multiply prior × likelihood (element-wise) unnorm = [0.85 * 0.50, 0.15 * 0.50] # = [0.425, 0.075] # Step 2: normalize so it sums to 1.0 eta = 0.425 + 0.075 # = 0.500 b' = [0.425/0.5, 0.075/0.5] # = [0.85, 0.15] # One observation: 50/50 → 85/15. Hear left again? # [0.85*0.85, 0.15*0.15] → normalize → [0.97, 0.03]
One observation shifted your belief from total uncertainty to strong suspicion. Two consistent observations push it to 97% — that's the decision threshold.
Step through the Bayesian belief update for the tiger problem. Watch the bar chart shift after each observation.
You know Bayes' rule: P(A|B) = P(B|A) · P(A) / P(B). You also know the POMDP belief update formula: b'(s') = η · Z(o|s',a) · ∑s T(s'|s,a) · b(s).
Your task: Starting from Bayes' rule applied to P(s' | o, a, b), derive the full belief update formula. Show where the transition model T and observation model Z each enter.
Full derivation:
Start: b'(s') = P(s' | o, a, b) = P(o | s', a, b) · P(s' | a, b) / P(o | a, b)
Step 1: P(o | s', a, b) = Z(o | s', a) by the observation model definition (observation depends only on current state and action, not history).
Step 2: P(s' | a, b) = ∑s ∈ S P(s' | s, a) · b(s) = ∑s T(s'|s,a) · b(s). This is the "prediction" step — propagate belief through transitions.
Step 3: P(o | a, b) = ∑s' Z(o|s',a) · ∑s T(s'|s,a) · b(s). This is a constant (w.r.t. s'), so call it η-1.
Combining: b'(s') = η · Z(o|s',a) · ∑s T(s'|s,a) · b(s)
The key insight: The belief update is just Bayes' rule applied to a hidden Markov chain. The sum-product structure (sum over old states, product with likelihood) is the same predict-then-correct pattern as the Kalman filter, particle filter, and HMM forward algorithm.
b_pred = T[:, a, :].T @ b. Then element-wise multiply by Z[:, a, o] and normalize.python import numpy as np def belief_update(b, a, o, T, Z): # Step 1: Predict — propagate belief through transition model # b_pred[s'] = sum_s T[s, a, s'] * b[s] b_pred = T[:, a, :].T @ b # shape (|S|,) # Step 2: Update — weight by observation likelihood # unnormalized[s'] = Z[s', a, o] * b_pred[s'] likelihood = Z[:, a, o] # shape (|S|,) unnorm = likelihood * b_pred # Step 3: Normalize eta = unnorm.sum() if eta == 0: return np.ones_like(b) / len(b) # degenerate case return unnorm / eta
The belief b(s) = P(s | a1, o1, ..., at, ot) already encodes the entire history. By the Markov property of the underlying state transitions, once you know P(st | history), the future evolution depends only on st and your future actions — not on how you arrived at that belief. Two different histories that produce the same belief vector lead to identical optimal behavior going forward. This is why the "belief MDP" reduction works: the belief IS the state of the equivalent fully-observable problem.
Here's what makes POMDPs fascinating: sometimes the best action is to do nothing useful except gather information. In the tiger problem, opening a door immediately gives you a 50/50 chance of hitting the tiger. But listening (even at a small cost) lets you build confidence before committing.
This is the value of information: an information-gathering action has negative immediate reward (it costs you −1 to listen) but positive expected future reward because it lets you make a better decision later.
Expected reward for each action as belief changes. At 50/50, listening dominates. As belief becomes confident, opening the correct door wins.
| Action | Expected Reward at b=0.5 | Expected Reward at b=0.97 |
|---|---|---|
| Listen | −1 (always) | −1 (always) |
| Open Left | −45 (risky!) | +6.7 (safe) |
| Open Right | −45 (risky!) | −96.7 (disaster) |
Solving an MDP means finding the best action for each state. Solving a POMDP means finding the best action for each belief — and the belief space is a continuous simplex. Even with just 2 states, the belief is a number between 0 and 1 (infinite possibilities). With n states, it's an (n−1)-dimensional continuous space.
This is the curse of dimensionality: the belief space grows exponentially with the number of states. Exact solutions are computationally intractable for all but the tiniest problems. POMDP solving is PSPACE-complete — among the hardest problems in computer science.
The formal complexity: finite-horizon POMDP solving is PSPACE-complete. That puts it among the hardest problems in computer science — harder than NP-complete problems (which MDPs fall into at worst). The root cause is the belief MDP reduction: you can convert any POMDP into an MDP over belief states, but that MDP has a continuous, (|S|−1)-dimensional state space even when the original POMDP has a finite number of states.
With 3 states, the belief lives on a 2D triangle (simplex). Each corner is 100% certainty in one state. The interior is mixed belief. Click to place a belief point.
| States | MDP policy maps from | POMDP policy maps from |
|---|---|---|
| 2 | 2 discrete states | 1D line segment [0,1] |
| 10 | 10 discrete states | 9D simplex |
| 100 | 100 discrete states | 99D simplex |
The POMDP value function V(b) over the belief simplex is piecewise linear and convex (PWLC). This is the key structural property that makes point-based solvers possible.
Your task: Show why V(b) is PWLC for a finite-horizon POMDP. Start from the 1-step case and argue inductively.
Full derivation:
Base case (horizon 1): V1(b) = maxa ∑s R(s,a) · b(s) = maxa αa · b. This is the maximum of |A| linear functions of b. Max of linear functions = piecewise linear and convex.
Inductive step: Assume Vt(b) = maxi αi · b for some set of alpha-vectors {αi}. For each (action a, observation o, alpha-vector αi), define a new vector αa,o,i[s] = γ · ∑s' T(s'|s,a) · Z(o|s',a) · αi[s']. Then Vt+1(b) = maxa [αR,a · b + ∑o maxi αa,o,i · b]. Each term is linear in b; the sum and max preserve piecewise linearity and convexity.
The key insight: PWLC is preserved by the Bellman operator because the operator involves only: (1) linear combinations (sums, scalar products with b), (2) max operations, and (3) composition with the belief update (which maps linear functions of b' back to linear functions of b through the T and Z matrices). The number of alpha-vectors can grow exponentially with horizon — this is why exact solving is PSPACE-complete.
Since we can't solve for every belief point, what if we just pick a handful of representative belief points and solve for those? This is the core idea behind point-based value iteration (PBVI) and its successors like SARSOP.
The value function of a POMDP is piecewise linear and convex (PWLC) over the belief simplex. This means it can be represented as the maximum of a set of hyperplanes (called alpha-vectors). Each alpha-vector corresponds to an action and defines the value for a region of belief space.
For the 2-state tiger problem, the belief space is 1D. Each colored line is an alpha-vector. The value function (thick line) is the max over all of them.
Since exact solving is intractable for real problems, practitioners use a hierarchy of approximations, trading optimality for tractability:
| Method | Idea | Scales to |
|---|---|---|
| QMDP | Assume full observability after one step. Solve the underlying MDP, use its Q-values weighted by belief. Fast but ignores information-gathering. | 1000+ states |
| PBVI | Sample N belief points, compute alpha-vectors only at those. Approximation quality grows with N. | ~100 states |
| SARSOP | Smart PBVI: only sample reachable beliefs. State of the art for exact-ish offline solving. | ~100–500 states |
| Online planning | Don't solve the whole belief space. At each step, plan a few steps ahead from the current belief using tree search (POMCP). | 10,000+ states |
| Deep RL + RNNs | Skip belief computation entirely. Train a recurrent policy that implicitly maintains belief through its hidden state. | Continuous states |
Put it all together. You are the POMDP agent. Behind one door is a tiger (−100), behind the other is gold (+10). Listening costs −1 but gives you an 85%-accurate hint. After opening a door, the tiger is re-placed randomly and you play again.
The optimal strategy: listen until your belief is confident enough (roughly 97%), then open the safe door. Can you beat random guessing?
POMDPs are far more than a toy problem with tigers. They provide the mathematical framework for any situation where an agent must act under partial observability.
| Domain | Hidden State | Observation | Key Decision |
|---|---|---|---|
| Robot navigation | True pose (x, y, θ) | Laser scans, camera | Where to move / explore |
| Dialogue systems | User's true intent | Noisy speech recognition | What to say next |
| Medical diagnosis | Patient's true condition | Test results, symptoms | Which test to order next |
| Active perception | Object identity | Partial camera view | Where to look next |
| Network security | Attacker's presence | Log anomalies | When to raise an alert |
Nobody solves a 10,000-state POMDP exactly. In practice, the belief-MDP reduction is used as a conceptual framework, and the actual computation is heavily approximated. Here's the realistic stack for a modern robot:
Practical POMDP pipeline # 1. Belief tracking: run a particle filter # (1000 weighted samples approximate the belief) particles = [sample_state() for _ in range(1000)] # 2. Action selection: online planning from current belief # POMCP: Monte Carlo tree search over belief trees # Plans ~1 second ahead, not the full infinite horizon action = pomcp_search(particles, depth=10) # 3. Execute action, get observation obs = env.step(action) # 4. Update particles with new observation particles = particle_filter_update(particles, action, obs) # Repeat forever
You now understand the POMDP framework. Every time you make a decision without full information, you are solving a POMDP — now you know the mathematics behind it.
Real-world approach: Medical POMDP research (Hauskrecht & Fraser, 2000; Ayer et al., 2012) formulates this exactly as described. Key decisions:
Reward structure: R(treat correctly) = +large, R(treat incorrectly) = -proportional to harm, R(test) = -cost, R(delay) = -per-timestep penalty that increases with disease severity. The time pressure is encoded as worsening rewards over time (non-stationary) or as transitions that make the disease state worse each timestep.
Threshold behavior: The optimal policy typically has "test until confident, then treat" structure — but the confidence threshold depends on the relative costs. If wrong treatment costs $10k and a test costs $500, you need belief > 1 - 500/10000 = 95% to justify stopping. With time pressure, this threshold decreases over time — eventually acting on uncertain beliefs beats waiting.
In practice: Exact POMDP solving is intractable for realistic medical problems (thousands of symptom combinations). Systems use POMCP (online tree search) or train neural policies with RNNs on patient histories. The POMDP formulation provides the principled framework; approximations make it practical.
A POMDP is defined by (S, A, O, T, Z, R, γ). An MDP is (S, A, T, R, γ). They differ only in O and Z.
Your task: Show that if observations are perfect (O = S, and Z(o|s',a) = 1 if o = s', 0 otherwise), the POMDP belief update collapses the belief to a point mass, and the optimal POMDP policy is identical to the optimal MDP policy.
Full derivation:
1. With Z(o|s',a) = δ(o=s'), the belief update gives: b'(s') = η · δ(o=s') · ∑s T(s'|s,a) · b(s). This is nonzero only for s' = o, so b' = eo (point mass on the observed state).
2. Since belief is always a point mass after the first observation, the POMDP value function V(b) only needs to be evaluated at point masses: V(es).
3. The Bellman equation becomes: V(es) = maxa [R(s,a) + γ ∑s' T(s'|s,a) · 1 · V(es')]. (The "1" comes from P(o=s'|s',a) = 1.)
4. Defining V*(s) ≡ V(es), we get: V*(s) = maxa [R(s,a) + γ ∑s' T(s'|s,a) V*(s')]. This is exactly the MDP Bellman optimality equation.
The key insight: MDPs are a special case of POMDPs where observations perfectly reveal state. The belief update becomes trivial (just "look at the state"), and planning over belief space reduces to planning over state space. This is why the POMDP framework is strictly more general than MDPs.
An LSTM or Transformer processing observation history is learning to do belief tracking without being told the transition or observation models. Its hidden state ht is an implicit, learned belief representation. When deep RL works well on partially observable tasks, it's because the network has discovered an approximate sufficient statistic for the history — exactly what formal belief tracking computes analytically.
Next time you see an RL agent with a recurrent policy outperform a feedforward one, ask: what "belief" is its hidden state encoding?