The Complete Beginner's Path

Understand the POMDP
Framework

How to make optimal decisions when you cannot see the full picture. The mathematical backbone of robot perception, dialogue systems, and strategic reasoning under uncertainty.

Prerequisites: Basic probability + Intuition for decision-making. Familiarity with MDPs helps but is not required.
9
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: What You Cannot See

In a standard Markov Decision Process (MDP), you see the full state of the world. You know exactly where you are on the game board, exactly how much fuel is in the tank, exactly which cards are in your hand. Life is rarely so generous.

A Partially Observable MDP (POMDP) models the realistic situation: you must act, but you only get partial, noisy observations of the true state. You hear a growl behind a door but don't know which door hides the tiger. You see a blurry radar blip but don't know if it's a plane or a bird.

The Tiger Problem: You stand before two closed doors. Behind one is a tiger (pain!), behind the other is a pot of gold (reward!). You can listen to gather information or open a door. Listening is cheap but noisy — you only hear the tiger correctly 85% of the time. What do you do?
MDP vs POMDP

In an MDP you see everything. In a POMDP you see only clues. Click to reveal what's behind each door.

Check: What makes a POMDP different from a standard MDP?

Chapter 1: Observations & Belief

Since you can't see the state directly, what can you work with? Observations — noisy signals that give you partial information. Each observation shifts your belief, a probability distribution over all possible states.

In the tiger problem, you start with a 50/50 belief: the tiger is equally likely behind either door. After listening and hearing a growl from the left, your belief shifts — say 85% left, 15% right. Listen again and hear left? Now maybe 97% left. Your belief is your internal model of reality.

Key insight: A POMDP agent doesn't make decisions based on the true state (it can't see it). It makes decisions based on its belief — a probability vector that summarizes everything it has learned from past observations.
Belief as a concrete vector: For a POMDP with 5 states, the belief is a vector of length 5 that sums to 1.0. It's a point on the probability simplex. A uniform belief is [0.2, 0.2, 0.2, 0.2, 0.2] — "I have no idea which state I'm in." A confident belief might be [0.95, 0.03, 0.01, 0.01, 0.0] — "I'm almost certain I'm in state 0." The tiger problem has only 2 states, so the belief is a single number: b(tiger-left) = 0.5 means total uncertainty; b = 0.97 means "almost sure the tiger is on the left."
Belief as a Probability Bar

Click "Hear Left" or "Hear Right" to see how observations shift the belief. Accuracy is 85%.

b(tiger-left) = 0.50
Check: What is a "belief" in a POMDP?
🔗 Pattern Recognition
Belief Update IS a Bayes Filter
This Lesson (POMDP)
b'(s') = η · Z(o|s',a) · ∑s T(s'|s,a) · b(s)
Bayes Filter
bel(xt) = η · p(zt|xt) · ∫ p(xt|ut,xt-1) · bel(xt-1) dxt-1Bayes Filter lesson

These are the same equation. The POMDP belief update is a discrete Bayes filter where the action selects which transition model T to use, and the observation model Z provides the likelihood. Every POMDP agent is running a Bayes filter in its head.

What happens to this equation when states are continuous and Gaussian? You get the Kalman filter prediction+update cycle.

Chapter 2: The POMDP Tuple

A POMDP is formally defined by seven components. If you already know MDPs, you'll recognize most of them — the new ingredients are the observation set O and the observation model Z.

S — States
The hidden states of the world (e.g., tiger-left, tiger-right)
A — Actions
What the agent can do (listen, open-left, open-right)
O — Observations
Noisy signals the agent receives (hear-left, hear-right)
T — Transitions
T(s'|s,a) — how states change after actions
Z — Observation Model
Z(o|s',a) — probability of observation given new state and action
R — Rewards
R(s,a) — immediate payoff for taking action a in state s
γ — Discount
0 ≤ γ ≤ 1 — how much to value future rewards vs immediate ones
Tiger problem mapping: S = {tiger-left, tiger-right}. A = {listen, open-left, open-right}. O = {hear-left, hear-right}. T: after opening a door, tiger resets randomly. Z: listening gives correct hint 85% of the time. R: gold = +10, tiger = −100, listen = −1.
ComponentMDP has it?POMDP addition
S, A, T, R, γYesSame as MDP
O (observations)NoWhat the agent actually sees
Z (observation model)NoLinks hidden state to observations
Check: Which two components does a POMDP add beyond a standard MDP?

Chapter 3: Belief Update

After the agent takes action a and receives observation o, it must update its belief using Bayes' rule. The new belief b'(s') is proportional to the likelihood of the observation times the prior belief propagated through the transition model.

b'(s') = η · Z(o | s', a) · ∑s T(s' | s, a) · b(s)

Here η is a normalizing constant that makes the probabilities sum to 1. In words: (1) propagate your old belief through the transition model, (2) weight by how likely the observation is in each state, (3) normalize.

The Full Update with Transitions

The tiger problem has trivial transitions (listening doesn't change the state), so let's see the full formula with a 5-state example. A robot is in one of 5 rooms. It takes action "move-right" and observes a sensor reading z:

5-state belief update
# Prior belief: mostly in room 2
b     = [0.05, 0.10, 0.60, 0.20, 0.05]

# Transition: "move right" shifts probability one room right
# T[s, move-right, s'] — each row sums to 1.0
# Mostly moves right (0.8) with some stay (0.15) and overshoot (0.05)

# Step 1: propagate belief through T
# b_predicted[s'] = sum_s T[s, a, s'] * b[s]
b_pred = [0.01, 0.05, 0.12, 0.56, 0.26]  # shifted right

# Step 2: weight by observation likelihood P(z | s')
# Sensor says "room 3 signature" with noise
Z_z   = [0.02, 0.05, 0.10, 0.75, 0.08]  # high for room 3

# Element-wise multiply
raw   = [0.0002, 0.0025, 0.012, 0.42, 0.0208]

# Step 3: normalize (sum = 0.4555)
b'    = [0.000, 0.005, 0.026, 0.922, 0.046]
# Belief collapsed: 92% confident we're in room 3
Tiger example, step by step: You believe 50/50: b = [0.5, 0.5]. You listen and hear a growl from the left. The observation likelihoods are Z(hear-left | tiger-left) = 0.85 and Z(hear-left | tiger-right) = 0.15. The update:
Walk-through
# Prior belief
b = [0.50, 0.50]           # [tiger-left, tiger-right]

# Observation likelihoods: P(hear-left | state)
Z = [0.85, 0.15]           # high if tiger IS left

# Step 1: multiply prior × likelihood (element-wise)
unnorm = [0.85 * 0.50, 0.15 * 0.50]  # = [0.425, 0.075]

# Step 2: normalize so it sums to 1.0
eta = 0.425 + 0.075       # = 0.500
b' = [0.425/0.5, 0.075/0.5]  # = [0.85, 0.15]

# One observation: 50/50 → 85/15. Hear left again?
# [0.85*0.85, 0.15*0.15] → normalize → [0.97, 0.03]

One observation shifted your belief from total uncertainty to strong suspicion. Two consistent observations push it to 97% — that's the decision threshold.

Interactive Belief Update

Step through the Bayesian belief update for the tiger problem. Watch the bar chart shift after each observation.

History: (none yet)
Check: In Bayesian belief update, what role does the observation model Z play?
🔨 Derivation Derive the belief update from Bayes' rule ✓ ATTEMPTED

You know Bayes' rule: P(A|B) = P(B|A) · P(A) / P(B). You also know the POMDP belief update formula: b'(s') = η · Z(o|s',a) · ∑s T(s'|s,a) · b(s).

Your task: Starting from Bayes' rule applied to P(s' | o, a, b), derive the full belief update formula. Show where the transition model T and observation model Z each enter.

We want P(st+1 | ot+1, at, history). By Bayes' rule, this equals P(ot+1 | st+1, at) · P(st+1 | at, history) / P(ot+1 | at, history). The first term is Z. What's the second?
P(st+1 | at, history) is a prediction. Marginalize over the previous state: ∑s P(st+1 | s, at) · P(s | history). That second factor is... the old belief b(s)! And the first is T(s'|s,a).
The denominator P(o | a, history) is a normalizing constant. It equals ∑s' [Z(o|s',a) · ∑s T(s'|s,a) · b(s)]. We call it η-1 because it just makes the result sum to 1.

Full derivation:

Start: b'(s') = P(s' | o, a, b) = P(o | s', a, b) · P(s' | a, b) / P(o | a, b)

Step 1: P(o | s', a, b) = Z(o | s', a) by the observation model definition (observation depends only on current state and action, not history).

Step 2: P(s' | a, b) = ∑s ∈ S P(s' | s, a) · b(s) = ∑s T(s'|s,a) · b(s). This is the "prediction" step — propagate belief through transitions.

Step 3: P(o | a, b) = ∑s' Z(o|s',a) · ∑s T(s'|s,a) · b(s). This is a constant (w.r.t. s'), so call it η-1.

Combining: b'(s') = η · Z(o|s',a) · ∑s T(s'|s,a) · b(s)

The key insight: The belief update is just Bayes' rule applied to a hidden Markov chain. The sum-product structure (sum over old states, product with likelihood) is the same predict-then-correct pattern as the Kalman filter, particle filter, and HMM forward algorithm.

💻 Build It Implement the Belief Update Step ✓ ATTEMPTED
You've seen the formula: b'(s') = η · Z(o|s',a) · ∑s T(s'|s,a) · b(s). Now implement it in Python for an arbitrary number of states. The function should work for any POMDP, not just the tiger problem.
signature def belief_update(b, a, o, T, Z): """ b: np.array of shape (|S|,) — current belief a: int — action index o: int — observation index T: np.array of shape (|S|, |A|, |S|) — T[s, a, s'] = P(s'|s,a) Z: np.array of shape (|S|, |A|, |O|) — Z[s', a, o] = P(o|s',a) Returns: np.array of shape (|S|,) — updated belief """
Test case (Tiger problem)
b = [0.5, 0.5], a = 0 (listen), o = 0 (hear-left)
T[s, 0, s'] = identity (listening doesn't change state)
Z[0, 0, 0] = 0.85, Z[1, 0, 0] = 0.15
Expected output: [0.85, 0.15]
b_predicted[s'] = sum over s of T[s, a, s'] * b[s]. In NumPy: b_pred = T[:, a, :].T @ b. Then element-wise multiply by Z[:, a, o] and normalize.
python
import numpy as np

def belief_update(b, a, o, T, Z):
    # Step 1: Predict — propagate belief through transition model
    # b_pred[s'] = sum_s T[s, a, s'] * b[s]
    b_pred = T[:, a, :].T @ b  # shape (|S|,)

    # Step 2: Update — weight by observation likelihood
    # unnormalized[s'] = Z[s', a, o] * b_pred[s']
    likelihood = Z[:, a, o]   # shape (|S|,)
    unnorm = likelihood * b_pred

    # Step 3: Normalize
    eta = unnorm.sum()
    if eta == 0:
        return np.ones_like(b) / len(b)  # degenerate case
    return unnorm / eta
Bonus challenge: What happens if eta = 0? This means the observation was impossible given the current belief. In practice this indicates a modeling error. How would you handle it in a real system? (Hint: particle filters face this "particle depletion" problem.)
Checkpoint — Before you move on
Explain in your own words: why is the belief b a "sufficient statistic" for decision-making? That is, why does the agent not need to remember the entire history of actions and observations — just the current belief vector?
✓ Gate cleared
Model Answer

The belief b(s) = P(s | a1, o1, ..., at, ot) already encodes the entire history. By the Markov property of the underlying state transitions, once you know P(st | history), the future evolution depends only on st and your future actions — not on how you arrived at that belief. Two different histories that produce the same belief vector lead to identical optimal behavior going forward. This is why the "belief MDP" reduction works: the belief IS the state of the equivalent fully-observable problem.

Chapter 4: The Value of Information

Here's what makes POMDPs fascinating: sometimes the best action is to do nothing useful except gather information. In the tiger problem, opening a door immediately gives you a 50/50 chance of hitting the tiger. But listening (even at a small cost) lets you build confidence before committing.

This is the value of information: an information-gathering action has negative immediate reward (it costs you −1 to listen) but positive expected future reward because it lets you make a better decision later.

The strategic tradeoff: Every decision balances exploitation (act now with current knowledge) vs exploration (gather more info first). POMDPs formalize this tradeoff mathematically. The optimal policy automatically decides when to stop gathering info and commit.
When to Stop Listening?

Expected reward for each action as belief changes. At 50/50, listening dominates. As belief becomes confident, opening the correct door wins.

b(tiger-left)0.50
ActionExpected Reward at b=0.5Expected Reward at b=0.97
Listen−1 (always)−1 (always)
Open Left−45 (risky!)+6.7 (safe)
Open Right−45 (risky!)−96.7 (disaster)
Check: Why might an optimal POMDP agent choose a costly information-gathering action?

Chapter 5: Why POMDPs Are Hard

Solving an MDP means finding the best action for each state. Solving a POMDP means finding the best action for each belief — and the belief space is a continuous simplex. Even with just 2 states, the belief is a number between 0 and 1 (infinite possibilities). With n states, it's an (n−1)-dimensional continuous space.

This is the curse of dimensionality: the belief space grows exponentially with the number of states. Exact solutions are computationally intractable for all but the tiniest problems. POMDP solving is PSPACE-complete — among the hardest problems in computer science.

Scale comparison: A 10-state MDP has 10 states to solve. A 10-state POMDP has a 9-dimensional continuous belief simplex. A 100-state MDP has 100 states. A 100-state POMDP has a 99-dimensional belief space — astronomically large.

The formal complexity: finite-horizon POMDP solving is PSPACE-complete. That puts it among the hardest problems in computer science — harder than NP-complete problems (which MDPs fall into at worst). The root cause is the belief MDP reduction: you can convert any POMDP into an MDP over belief states, but that MDP has a continuous, (|S|−1)-dimensional state space even when the original POMDP has a finite number of states.

Why continuous = hard: In a regular MDP with |S| = 100, value iteration visits 100 states per sweep. In the equivalent belief MDP, the "state" is a point on a 99-dimensional simplex. You'd need to discretize this simplex to solve it — and even a coarse 10-point grid per dimension gives 1099 belief points. That's more than the atoms in the universe. This is why exact POMDP solving is limited to problems with roughly 10–20 states.
Belief Simplex (3 States)

With 3 states, the belief lives on a 2D triangle (simplex). Each corner is 100% certainty in one state. The interior is mixed belief. Click to place a belief point.

Click the triangle to set a belief point
StatesMDP policy maps fromPOMDP policy maps from
22 discrete states1D line segment [0,1]
1010 discrete states9D simplex
100100 discrete states99D simplex
Check: Why are POMDPs so much harder to solve than MDPs?
🔨 Derivation Why is the value function piecewise linear and convex? ✓ ATTEMPTED

The POMDP value function V(b) over the belief simplex is piecewise linear and convex (PWLC). This is the key structural property that makes point-based solvers possible.

Your task: Show why V(b) is PWLC for a finite-horizon POMDP. Start from the 1-step case and argue inductively.

For horizon 1, V(b) = maxas R(s,a) · b(s). Each action gives a linear function of b (it's a dot product: αa · b where αa[s] = R(s,a)). The max of linear functions is...?
V1(b) = maxa αa · b. The maximum of a finite set of linear (hence convex) functions is convex. And it's piecewise linear because each piece is one of the α-vectors. Now for the inductive step: assume Vt is PWLC. Show Vt+1 is PWLC.
Vt+1(b) = maxa [R(b,a) + γ ∑o P(o|b,a) · Vt(b')]. The sum ∑o is over a finite set. P(o|b,a) is linear in b. b' is a (nonlinear) function of b, but Vt(b') evaluated at the new belief is still a max of linear functions of the original b. Cross-multiplying produces new alpha-vectors.

Full derivation:

Base case (horizon 1): V1(b) = maxas R(s,a) · b(s) = maxa αa · b. This is the maximum of |A| linear functions of b. Max of linear functions = piecewise linear and convex.

Inductive step: Assume Vt(b) = maxi αi · b for some set of alpha-vectors {αi}. For each (action a, observation o, alpha-vector αi), define a new vector αa,o,i[s] = γ · ∑s' T(s'|s,a) · Z(o|s',a) · αi[s']. Then Vt+1(b) = maxaR,a · b + ∑o maxi αa,o,i · b]. Each term is linear in b; the sum and max preserve piecewise linearity and convexity.

The key insight: PWLC is preserved by the Bellman operator because the operator involves only: (1) linear combinations (sums, scalar products with b), (2) max operations, and (3) composition with the belief update (which maps linear functions of b' back to linear functions of b through the T and Z matrices). The number of alpha-vectors can grow exponentially with horizon — this is why exact solving is PSPACE-complete.

⚔ Adversarial: Why PSPACE and not just NP?
An MDP with |S| states can be solved in polynomial time (value iteration converges in O(|S|²|A|) per iteration). A POMDP with |S| states has a belief space of dimension |S|-1. Someone claims: "Just discretize the belief simplex into a grid and solve the resulting finite MDP." Why doesn't this reduce POMDP solving to a polynomial-size MDP?

Chapter 6: Point-Based Solvers

Since we can't solve for every belief point, what if we just pick a handful of representative belief points and solve for those? This is the core idea behind point-based value iteration (PBVI) and its successors like SARSOP.

The value function of a POMDP is piecewise linear and convex (PWLC) over the belief simplex. This means it can be represented as the maximum of a set of hyperplanes (called alpha-vectors). Each alpha-vector corresponds to an action and defines the value for a region of belief space.

1. Sample Beliefs
Pick representative points on the belief simplex
2. Backup
Compute best alpha-vector for each sampled belief
3. Prune
Remove dominated alpha-vectors (never optimal anywhere)
4. Iterate
Repeat until convergence — explore new reachable beliefs
Alpha-Vectors on Belief Space

For the 2-state tiger problem, the belief space is 1D. Each colored line is an alpha-vector. The value function (thick line) is the max over all of them.

SARSOP insight: Not all belief points are equally useful. SARSOP focuses sampling on reachable belief points — ones the agent would actually visit under the optimal policy. This dramatically reduces computation for practical problems.

Practical Approximation Ladder

Since exact solving is intractable for real problems, practitioners use a hierarchy of approximations, trading optimality for tractability:

MethodIdeaScales to
QMDPAssume full observability after one step. Solve the underlying MDP, use its Q-values weighted by belief. Fast but ignores information-gathering.1000+ states
PBVISample N belief points, compute alpha-vectors only at those. Approximation quality grows with N.~100 states
SARSOPSmart PBVI: only sample reachable beliefs. State of the art for exact-ish offline solving.~100–500 states
Online planningDon't solve the whole belief space. At each step, plan a few steps ahead from the current belief using tree search (POMCP).10,000+ states
Deep RL + RNNsSkip belief computation entirely. Train a recurrent policy that implicitly maintains belief through its hidden state.Continuous states
The QMDP shortcut: QMDP is the simplest POMDP approximation. It solves the underlying MDP to get Q*(s,a), then computes the QMDP action as argmaxas b(s) Q*(s,a). It's fast and works well when observations are informative enough that belief collapses quickly. It fails when information gathering matters — because it assumes you'll know the state after one step, it never chooses to "listen."
Check: What is an alpha-vector?
💥 Break-It Lab What Dies When You Remove POMDP Components? ✓ ATTEMPTED
A POMDP agent listens to gather information before opening a door. Watch how its performance degrades as you break different components of the framework. The chart shows average reward over 50 rounds.
Ignore observations (treat as MDP) ACTIVE
Failure mode: Without belief tracking, the agent treats b=0.5 forever. It can never build confidence, so it either gambles immediately (expected reward: -45) or listens forever (infinite -1 costs). This is what happens when you "just use RL without memory" on a POMDP.
Greedy policy (no info-gathering) ACTIVE
Failure mode: The agent always takes the immediately best action without considering future information gain. At b=0.5, all actions look equally bad, so it picks randomly. It never learns to listen first. Average reward plummets to roughly -45 per round.
Reduce observation accuracy to 55% ACTIVE
Failure mode: With near-random observations (55% vs 85%), the agent needs many more listens to reach the confidence threshold. Each listen costs -1, so total cost of information-gathering skyrockets. Eventually it's cheaper to just guess. The optimal policy degenerates toward "open immediately" when observations carry almost no information.
⚔ Adversarial: Your POMDP agent takes the same action regardless of observation history
You train a POMDP policy and notice it always executes the same action regardless of what observations it receives. The environment definitely has partial observability and information-gathering actions are available.

Chapter 7: SHOWCASE — The Tiger Problem

Put it all together. You are the POMDP agent. Behind one door is a tiger (−100), behind the other is gold (+10). Listening costs −1 but gives you an 85%-accurate hint. After opening a door, the tiger is re-placed randomly and you play again.

The optimal strategy: listen until your belief is confident enough (roughly 97%), then open the safe door. Can you beat random guessing?

Play the Tiger Problem
Total reward: 0 Round: 1 Listens this round: 0
Strategy tip: The optimal number of listens is usually 2–4 before opening. After hearing the same side twice, your belief is about 97% — that's the sweet spot where the expected value of opening exceeds the cost of another listen.
Check: In the tiger problem, why is it suboptimal to always listen forever?

Chapter 8: POMDPs in the Real World

POMDPs are far more than a toy problem with tigers. They provide the mathematical framework for any situation where an agent must act under partial observability.

DomainHidden StateObservationKey Decision
Robot navigationTrue pose (x, y, θ)Laser scans, cameraWhere to move / explore
Dialogue systemsUser's true intentNoisy speech recognitionWhat to say next
Medical diagnosisPatient's true conditionTest results, symptomsWhich test to order next
Active perceptionObject identityPartial camera viewWhere to look next
Network securityAttacker's presenceLog anomaliesWhen to raise an alert
Active perception: A robot doesn't just passively receive sensor data — it chooses where to look. This is a POMDP: the hidden state is the scene, observations are camera images, and actions include moving the camera. The robot gathers information strategically.
Connections: POMDPs connect to many other frameworks. A Bayes filter handles the belief update step. Reinforcement learning solves the decision-making when the model is unknown. The Kalman filter is belief update for continuous Gaussian POMDPs. Hidden Markov Models are POMDPs without actions.

What practitioners actually use

Nobody solves a 10,000-state POMDP exactly. In practice, the belief-MDP reduction is used as a conceptual framework, and the actual computation is heavily approximated. Here's the realistic stack for a modern robot:

Practical POMDP pipeline
# 1. Belief tracking: run a particle filter
#    (1000 weighted samples approximate the belief)
particles = [sample_state() for _ in range(1000)]

# 2. Action selection: online planning from current belief
#    POMCP: Monte Carlo tree search over belief trees
#    Plans ~1 second ahead, not the full infinite horizon
action = pomcp_search(particles, depth=10)

# 3. Execute action, get observation
obs = env.step(action)

# 4. Update particles with new observation
particles = particle_filter_update(particles, action, obs)
# Repeat forever
"The real world is partially observable. The question is not whether to model uncertainty, but how well."
— Leslie Kaelbling

You now understand the POMDP framework. Every time you make a decision without full information, you are solving a POMDP — now you know the mathematics behind it.

Check: Which of these is an example of active perception?
🏗 Design Challenge You're the Architect: Medical Diagnosis Robot ✓ ATTEMPTED
A hospital wants an AI system that decides which diagnostic tests to run on a patient. The patient has one of 5 possible conditions (hidden state). Tests vary in cost, invasiveness, and diagnostic power. The system must balance information gain against patient harm and budget.
Hidden states
5 diseases (one is "healthy" — no treatment needed)
Actions
3 cheap tests ($50, low info), 2 invasive tests ($500, high info), 5 treatments (one per disease)
Wrong treatment cost
$10,000 + patient harm. Treating "healthy" as sick: $2,000 wasted.
Budget per patient
$2,000 total for all tests before committing to treatment
Time pressure
Disease progresses: 5% worse outcomes per day of delay
1. When should the system order an invasive test vs. commit to treatment based on cheap tests alone? What belief threshold triggers the expensive test?
2. How do you encode "time pressure" in the POMDP? (Hint: think about how discount factor or transition model can capture disease progression.)
3. The system has 97% confidence the patient has disease A. Should it treat immediately or run one more confirmatory test? Under what conditions does the answer change?

Real-world approach: Medical POMDP research (Hauskrecht & Fraser, 2000; Ayer et al., 2012) formulates this exactly as described. Key decisions:

Reward structure: R(treat correctly) = +large, R(treat incorrectly) = -proportional to harm, R(test) = -cost, R(delay) = -per-timestep penalty that increases with disease severity. The time pressure is encoded as worsening rewards over time (non-stationary) or as transitions that make the disease state worse each timestep.

Threshold behavior: The optimal policy typically has "test until confident, then treat" structure — but the confidence threshold depends on the relative costs. If wrong treatment costs $10k and a test costs $500, you need belief > 1 - 500/10000 = 95% to justify stopping. With time pressure, this threshold decreases over time — eventually acting on uncertain beliefs beats waiting.

In practice: Exact POMDP solving is intractable for realistic medical problems (thousands of symptom combinations). Systems use POMCP (online tree search) or train neural policies with RNNs on patient histories. The POMDP formulation provides the principled framework; approximations make it practical.

🔨 Derivation Show that a POMDP with perfect observations is just an MDP ✓ ATTEMPTED

A POMDP is defined by (S, A, O, T, Z, R, γ). An MDP is (S, A, T, R, γ). They differ only in O and Z.

Your task: Show that if observations are perfect (O = S, and Z(o|s',a) = 1 if o = s', 0 otherwise), the POMDP belief update collapses the belief to a point mass, and the optimal POMDP policy is identical to the optimal MDP policy.

If Z(o|s',a) = δ(o, s'), then after observing o, the only state s' with nonzero likelihood is s' = o. The belief becomes b'(s') = 1 if s' = o, and 0 otherwise. A point mass!
If b is always a point mass on some state s (i.e., b = es, the one-hot vector), then V(b) = V(es) = maxa [R(s,a) + γ ∑s' T(s'|s,a) V(es')]. Define V*(s) = V(es). What equation does V* satisfy?

Full derivation:

1. With Z(o|s',a) = δ(o=s'), the belief update gives: b'(s') = η · δ(o=s') · ∑s T(s'|s,a) · b(s). This is nonzero only for s' = o, so b' = eo (point mass on the observed state).

2. Since belief is always a point mass after the first observation, the POMDP value function V(b) only needs to be evaluated at point masses: V(es).

3. The Bellman equation becomes: V(es) = maxa [R(s,a) + γ ∑s' T(s'|s,a) · 1 · V(es')]. (The "1" comes from P(o=s'|s',a) = 1.)

4. Defining V*(s) ≡ V(es), we get: V*(s) = maxa [R(s,a) + γ ∑s' T(s'|s,a) V*(s')]. This is exactly the MDP Bellman optimality equation.

The key insight: MDPs are a special case of POMDPs where observations perfectly reveal state. The belief update becomes trivial (just "look at the state"), and planning over belief space reduces to planning over state space. This is why the POMDP framework is strictly more general than MDPs.

🔗 Pattern Recognition
Deep RL Approximates POMDP Solutions with Learned Memory
POMDP (exact)
Maintain explicit belief b(s) via Bayes rule. Plan over belief space. Policy: π(b) → action.
Deep RL (approximate)
Maintain implicit belief in RNN hidden state ht. Train end-to-end. Policy: π(ht) → action. → RL Algorithms lesson

An LSTM or Transformer processing observation history is learning to do belief tracking without being told the transition or observation models. Its hidden state ht is an implicit, learned belief representation. When deep RL works well on partially observable tasks, it's because the network has discovered an approximate sufficient statistic for the history — exactly what formal belief tracking computes analytically.

Next time you see an RL agent with a recurrent policy outperform a feedforward one, ask: what "belief" is its hidden state encoding?