The Complete Beginner's Path

Understand the POMDP
Framework

How to make optimal decisions when you cannot see the full picture. The mathematical backbone of robot perception, dialogue systems, and strategic reasoning under uncertainty.

Prerequisites: Basic probability + Intuition for decision-making. Familiarity with MDPs helps but is not required.
9
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: What You Cannot See

In a standard Markov Decision Process (MDP), you see the full state of the world. You know exactly where you are on the game board, exactly how much fuel is in the tank, exactly which cards are in your hand. Life is rarely so generous.

A Partially Observable MDP (POMDP) models the realistic situation: you must act, but you only get partial, noisy observations of the true state. You hear a growl behind a door but don't know which door hides the tiger. You see a blurry radar blip but don't know if it's a plane or a bird.

The Tiger Problem: You stand before two closed doors. Behind one is a tiger (pain!), behind the other is a pot of gold (reward!). You can listen to gather information or open a door. Listening is cheap but noisy — you only hear the tiger correctly 85% of the time. What do you do?
MDP vs POMDP

In an MDP you see everything. In a POMDP you see only clues. Click to reveal what's behind each door.

Check: What makes a POMDP different from a standard MDP?

Chapter 1: Observations & Belief

Since you can't see the state directly, what can you work with? Observations — noisy signals that give you partial information. Each observation shifts your belief, a probability distribution over all possible states.

In the tiger problem, you start with a 50/50 belief: the tiger is equally likely behind either door. After listening and hearing a growl from the left, your belief shifts — say 85% left, 15% right. Listen again and hear left? Now maybe 97% left. Your belief is your internal model of reality.

Key insight: A POMDP agent doesn't make decisions based on the true state (it can't see it). It makes decisions based on its belief — a probability vector that summarizes everything it has learned from past observations.
Belief as a Probability Bar

Click "Hear Left" or "Hear Right" to see how observations shift the belief. Accuracy is 85%.

b(tiger-left) = 0.50
Check: What is a "belief" in a POMDP?

Chapter 2: The POMDP Tuple

A POMDP is formally defined by seven components. If you already know MDPs, you'll recognize most of them — the new ingredients are the observation set O and the observation model Z.

S — States
The hidden states of the world (e.g., tiger-left, tiger-right)
A — Actions
What the agent can do (listen, open-left, open-right)
O — Observations
Noisy signals the agent receives (hear-left, hear-right)
T — Transitions
T(s'|s,a) — how states change after actions
Z — Observation Model
Z(o|s',a) — probability of observation given new state and action
R — Rewards
R(s,a) — immediate payoff for taking action a in state s
γ — Discount
0 ≤ γ ≤ 1 — how much to value future rewards vs immediate ones
Tiger problem mapping: S = {tiger-left, tiger-right}. A = {listen, open-left, open-right}. O = {hear-left, hear-right}. T: after opening a door, tiger resets randomly. Z: listening gives correct hint 85% of the time. R: gold = +10, tiger = −100, listen = −1.
ComponentMDP has it?POMDP addition
S, A, T, R, γYesSame as MDP
O (observations)NoWhat the agent actually sees
Z (observation model)NoLinks hidden state to observations
Check: Which two components does a POMDP add beyond a standard MDP?

Chapter 3: Belief Update

After the agent takes action a and receives observation o, it must update its belief using Bayes' rule. The new belief b'(s') is proportional to the likelihood of the observation times the prior belief propagated through the transition model.

b'(s') = η · Z(o | s', a) · ∑s T(s' | s, a) · b(s)

Here η is a normalizing constant that makes the probabilities sum to 1. In words: (1) propagate your old belief through the transition model, (2) weight by how likely the observation is in each state, (3) normalize.

Tiger example: You believe 50/50. You listen and hear a growl from the left. Z(hear-left | tiger-left, listen) = 0.85 and Z(hear-left | tiger-right, listen) = 0.15. After normalizing: b'(tiger-left) = 0.85. One observation shifted your belief from uncertainty to strong suspicion.
Interactive Belief Update

Step through the Bayesian belief update for the tiger problem. Watch the bar chart shift after each observation.

History: (none yet)
Check: In Bayesian belief update, what role does the observation model Z play?

Chapter 4: The Value of Information

Here's what makes POMDPs fascinating: sometimes the best action is to do nothing useful except gather information. In the tiger problem, opening a door immediately gives you a 50/50 chance of hitting the tiger. But listening (even at a small cost) lets you build confidence before committing.

This is the value of information: an information-gathering action has negative immediate reward (it costs you −1 to listen) but positive expected future reward because it lets you make a better decision later.

The strategic tradeoff: Every decision balances exploitation (act now with current knowledge) vs exploration (gather more info first). POMDPs formalize this tradeoff mathematically. The optimal policy automatically decides when to stop gathering info and commit.
When to Stop Listening?

Expected reward for each action as belief changes. At 50/50, listening dominates. As belief becomes confident, opening the correct door wins.

b(tiger-left)0.50
ActionExpected Reward at b=0.5Expected Reward at b=0.97
Listen−1 (always)−1 (always)
Open Left−45 (risky!)+6.7 (safe)
Open Right−45 (risky!)−96.7 (disaster)
Check: Why might an optimal POMDP agent choose a costly information-gathering action?

Chapter 5: Why POMDPs Are Hard

Solving an MDP means finding the best action for each state. Solving a POMDP means finding the best action for each belief — and the belief space is a continuous simplex. Even with just 2 states, the belief is a number between 0 and 1 (infinite possibilities). With n states, it's an (n−1)-dimensional continuous space.

This is the curse of dimensionality: the belief space grows exponentially with the number of states. Exact solutions are computationally intractable for all but the tiniest problems. POMDP solving is PSPACE-complete — among the hardest problems in computer science.

Scale comparison: A 10-state MDP has 10 states to solve. A 10-state POMDP has a 9-dimensional continuous belief simplex. A 100-state MDP has 100 states. A 100-state POMDP has a 99-dimensional belief space — astronomically large.
Belief Simplex (3 States)

With 3 states, the belief lives on a 2D triangle (simplex). Each corner is 100% certainty in one state. The interior is mixed belief. Click to place a belief point.

Click the triangle to set a belief point
StatesMDP policy maps fromPOMDP policy maps from
22 discrete states1D line segment [0,1]
1010 discrete states9D simplex
100100 discrete states99D simplex
Check: Why are POMDPs so much harder to solve than MDPs?

Chapter 6: Point-Based Solvers

Since we can't solve for every belief point, what if we just pick a handful of representative belief points and solve for those? This is the core idea behind point-based value iteration (PBVI) and its successors like SARSOP.

The value function of a POMDP is piecewise linear and convex (PWLC) over the belief simplex. This means it can be represented as the maximum of a set of hyperplanes (called alpha-vectors). Each alpha-vector corresponds to an action and defines the value for a region of belief space.

1. Sample Beliefs
Pick representative points on the belief simplex
2. Backup
Compute best alpha-vector for each sampled belief
3. Prune
Remove dominated alpha-vectors (never optimal anywhere)
4. Iterate
Repeat until convergence — explore new reachable beliefs
Alpha-Vectors on Belief Space

For the 2-state tiger problem, the belief space is 1D. Each colored line is an alpha-vector. The value function (thick line) is the max over all of them.

SARSOP insight: Not all belief points are equally useful. SARSOP focuses sampling on reachable belief points — ones the agent would actually visit under the optimal policy. This dramatically reduces computation for practical problems.
Check: What is an alpha-vector?

Chapter 7: SHOWCASE — The Tiger Problem

Put it all together. You are the POMDP agent. Behind one door is a tiger (−100), behind the other is gold (+10). Listening costs −1 but gives you an 85%-accurate hint. After opening a door, the tiger is re-placed randomly and you play again.

The optimal strategy: listen until your belief is confident enough (roughly 97%), then open the safe door. Can you beat random guessing?

Play the Tiger Problem
Total reward: 0 Round: 1 Listens this round: 0
Strategy tip: The optimal number of listens is usually 2–4 before opening. After hearing the same side twice, your belief is about 97% — that's the sweet spot where the expected value of opening exceeds the cost of another listen.
Check: In the tiger problem, why is it suboptimal to always listen forever?

Chapter 8: POMDPs in the Real World

POMDPs are far more than a toy problem with tigers. They provide the mathematical framework for any situation where an agent must act under partial observability.

DomainHidden StateObservationKey Decision
Robot navigationTrue pose (x, y, θ)Laser scans, cameraWhere to move / explore
Dialogue systemsUser's true intentNoisy speech recognitionWhat to say next
Medical diagnosisPatient's true conditionTest results, symptomsWhich test to order next
Active perceptionObject identityPartial camera viewWhere to look next
Network securityAttacker's presenceLog anomaliesWhen to raise an alert
Active perception: A robot doesn't just passively receive sensor data — it chooses where to look. This is a POMDP: the hidden state is the scene, observations are camera images, and actions include moving the camera. The robot gathers information strategically.
Connections: POMDPs connect to many other frameworks. A Bayes filter handles the belief update step. Reinforcement learning solves the decision-making when the model is unknown. The Kalman filter is belief update for continuous Gaussian POMDPs. Hidden Markov Models are POMDPs without actions.
"The real world is partially observable. The question is not whether to model uncertainty, but how well."
— Leslie Kaelbling

You now understand the POMDP framework. Every time you make a decision without full information, you are solving a POMDP — now you know the mathematics behind it.

Check: Which of these is an example of active perception?