How to make optimal decisions when you cannot see the full picture. The mathematical backbone of robot perception, dialogue systems, and strategic reasoning under uncertainty.
In a standard Markov Decision Process (MDP), you see the full state of the world. You know exactly where you are on the game board, exactly how much fuel is in the tank, exactly which cards are in your hand. Life is rarely so generous.
A Partially Observable MDP (POMDP) models the realistic situation: you must act, but you only get partial, noisy observations of the true state. You hear a growl behind a door but don't know which door hides the tiger. You see a blurry radar blip but don't know if it's a plane or a bird.
In an MDP you see everything. In a POMDP you see only clues. Click to reveal what's behind each door.
Since you can't see the state directly, what can you work with? Observations — noisy signals that give you partial information. Each observation shifts your belief, a probability distribution over all possible states.
In the tiger problem, you start with a 50/50 belief: the tiger is equally likely behind either door. After listening and hearing a growl from the left, your belief shifts — say 85% left, 15% right. Listen again and hear left? Now maybe 97% left. Your belief is your internal model of reality.
Click "Hear Left" or "Hear Right" to see how observations shift the belief. Accuracy is 85%.
A POMDP is formally defined by seven components. If you already know MDPs, you'll recognize most of them — the new ingredients are the observation set O and the observation model Z.
| Component | MDP has it? | POMDP addition |
|---|---|---|
| S, A, T, R, γ | Yes | Same as MDP |
| O (observations) | No | What the agent actually sees |
| Z (observation model) | No | Links hidden state to observations |
After the agent takes action a and receives observation o, it must update its belief using Bayes' rule. The new belief b'(s') is proportional to the likelihood of the observation times the prior belief propagated through the transition model.
Here η is a normalizing constant that makes the probabilities sum to 1. In words: (1) propagate your old belief through the transition model, (2) weight by how likely the observation is in each state, (3) normalize.
Step through the Bayesian belief update for the tiger problem. Watch the bar chart shift after each observation.
Here's what makes POMDPs fascinating: sometimes the best action is to do nothing useful except gather information. In the tiger problem, opening a door immediately gives you a 50/50 chance of hitting the tiger. But listening (even at a small cost) lets you build confidence before committing.
This is the value of information: an information-gathering action has negative immediate reward (it costs you −1 to listen) but positive expected future reward because it lets you make a better decision later.
Expected reward for each action as belief changes. At 50/50, listening dominates. As belief becomes confident, opening the correct door wins.
| Action | Expected Reward at b=0.5 | Expected Reward at b=0.97 |
|---|---|---|
| Listen | −1 (always) | −1 (always) |
| Open Left | −45 (risky!) | +6.7 (safe) |
| Open Right | −45 (risky!) | −96.7 (disaster) |
Solving an MDP means finding the best action for each state. Solving a POMDP means finding the best action for each belief — and the belief space is a continuous simplex. Even with just 2 states, the belief is a number between 0 and 1 (infinite possibilities). With n states, it's an (n−1)-dimensional continuous space.
This is the curse of dimensionality: the belief space grows exponentially with the number of states. Exact solutions are computationally intractable for all but the tiniest problems. POMDP solving is PSPACE-complete — among the hardest problems in computer science.
With 3 states, the belief lives on a 2D triangle (simplex). Each corner is 100% certainty in one state. The interior is mixed belief. Click to place a belief point.
| States | MDP policy maps from | POMDP policy maps from |
|---|---|---|
| 2 | 2 discrete states | 1D line segment [0,1] |
| 10 | 10 discrete states | 9D simplex |
| 100 | 100 discrete states | 99D simplex |
Since we can't solve for every belief point, what if we just pick a handful of representative belief points and solve for those? This is the core idea behind point-based value iteration (PBVI) and its successors like SARSOP.
The value function of a POMDP is piecewise linear and convex (PWLC) over the belief simplex. This means it can be represented as the maximum of a set of hyperplanes (called alpha-vectors). Each alpha-vector corresponds to an action and defines the value for a region of belief space.
For the 2-state tiger problem, the belief space is 1D. Each colored line is an alpha-vector. The value function (thick line) is the max over all of them.
Put it all together. You are the POMDP agent. Behind one door is a tiger (−100), behind the other is gold (+10). Listening costs −1 but gives you an 85%-accurate hint. After opening a door, the tiger is re-placed randomly and you play again.
The optimal strategy: listen until your belief is confident enough (roughly 97%), then open the safe door. Can you beat random guessing?
POMDPs are far more than a toy problem with tigers. They provide the mathematical framework for any situation where an agent must act under partial observability.
| Domain | Hidden State | Observation | Key Decision |
|---|---|---|---|
| Robot navigation | True pose (x, y, θ) | Laser scans, camera | Where to move / explore |
| Dialogue systems | User's true intent | Noisy speech recognition | What to say next |
| Medical diagnosis | Patient's true condition | Test results, symptoms | Which test to order next |
| Active perception | Object identity | Partial camera view | Where to look next |
| Network security | Attacker's presence | Log anomalies | When to raise an alert |
You now understand the POMDP framework. Every time you make a decision without full information, you are solving a POMDP — now you know the mathematics behind it.