microPOMDP — Deciding Under Partial Observability

Chapter 1: Observations & Belief

Since you can't see the state directly, what can you work with? Observations — noisy signals that give you partial information. Each observation shifts your belief, a probability distribution over all possible states.

In the tiger problem, you start with a 50/50 belief: the tiger is equally likely behind either door. After listening and hearing a growl from the left, your belief shifts — say 85% left, 15% right. Listen again and hear left? Now maybe 97% left. Your belief is your internal model of reality.

Key insight: A POMDP agent doesn't make decisions based on the true state (it can't see it). It makes decisions based on its belief — a probability vector that summarizes everything it has learned from past observations.

Belief as a concrete vector: For a POMDP with 5 states, the belief is a vector of length 5 that sums to 1.0. It's a point on the probability simplex. A uniform belief is [0.2, 0.2, 0.2, 0.2, 0.2] — "I have no idea which state I'm in." A confident belief might be [0.95, 0.03, 0.01, 0.01, 0.0] — "I'm almost certain I'm in state 0." The tiger problem has only 2 states, so the belief is a single number: b(tiger-left) = 0.5 means total uncertainty; b = 0.97 means "almost sure the tiger is on the left."

Belief as a Probability Bar

Click "Hear Left" or "Hear Right" to see how observations shift the belief. Accuracy is 85%.

      b(tiger-left) =
      0.50
    

Check: What is a "belief" in a POMDP?

A random guess about what to do A probability distribution over all possible states The agent's preferred action

🔗 Pattern Recognition

Belief Update IS a Bayes Filter

This Lesson (POMDP)

b'(s') = η · Z(o|s',a) · ∑_s T(s'|s,a) · b(s)

Bayes Filter

bel(x_t) = η · p(z_t|x_t) · ∫ p(x_t|u_t,x_t-1) · bel(x_t-1) dx_t-1 → Bayes Filter lesson

These are the same equation. The POMDP belief update is a discrete Bayes filter where the action selects which transition model T to use, and the observation model Z provides the likelihood. Every POMDP agent is running a Bayes filter in its head.

What happens to this equation when states are continuous and Gaussian? You get the Kalman filter prediction+update cycle.

Chapter 2: The POMDP Tuple

A POMDP is formally defined by seven components. If you already know MDPs, you'll recognize most of them — the new ingredients are the observation set O and the observation model Z.

S — States

The hidden states of the world (e.g., tiger-left, tiger-right)

↓

A — Actions

What the agent can do (listen, open-left, open-right)

↓

O — Observations

Noisy signals the agent receives (hear-left, hear-right)

↓

T — Transitions

T(s'|s,a) — how states change after actions

↓

Z — Observation Model

Z(o|s',a) — probability of observation given new state and action

↓

R — Rewards

R(s,a) — immediate payoff for taking action a in state s

↓

γ — Discount

0 ≤ γ ≤ 1 — how much to value future rewards vs immediate ones

Tiger problem mapping: S = {tiger-left, tiger-right}. A = {listen, open-left, open-right}. O = {hear-left, hear-right}. T: after opening a door, tiger resets randomly. Z: listening gives correct hint 85% of the time. R: gold = +10, tiger = −100, listen = −1.

Component	MDP has it?	POMDP addition
S, A, T, R, γ	Yes	Same as MDP
O (observations)	No	What the agent actually sees
Z (observation model)	No	Links hidden state to observations

Check: Which two components does a POMDP add beyond a standard MDP?

Observations O and observation model Z Extra actions and extra states A learning rate and a temperature parameter

Chapter 3: Belief Update

After the agent takes action a and receives observation o, it must update its belief using Bayes' rule. The new belief b'(s') is proportional to the likelihood of the observation times the prior belief propagated through the transition model.

b'(s') = η · Z(o | s', a) · ∑_s T(s' | s, a) · b(s)

Here η is a normalizing constant that makes the probabilities sum to 1. In words: (1) propagate your old belief through the transition model, (2) weight by how likely the observation is in each state, (3) normalize.

The Full Update with Transitions

The tiger problem has trivial transitions (listening doesn't change the state), so let's see the full formula with a 5-state example. A robot is in one of 5 rooms. It takes action "move-right" and observes a sensor reading z:

5-state belief update
# Prior belief: mostly in room 2
b     = [0.05, 0.10, 0.60, 0.20, 0.05]

# Transition: "move right" shifts probability one room right
# T[s, move-right, s'] — each row sums to 1.0
# Mostly moves right (0.8) with some stay (0.15) and overshoot (0.05)

# Step 1: propagate belief through T
# b_predicted[s'] = sum_s T[s, a, s'] * b[s]
b_pred = [0.01, 0.05, 0.12, 0.56, 0.26]  # shifted right

# Step 2: weight by observation likelihood P(z | s')
# Sensor says "room 3 signature" with noise
Z_z   = [0.02, 0.05, 0.10, 0.75, 0.08]  # high for room 3

# Element-wise multiply
raw   = [0.0002, 0.0025, 0.012, 0.42, 0.0208]

# Step 3: normalize (sum = 0.4555)
b'    = [0.000, 0.005, 0.026, 0.922, 0.046]
# Belief collapsed: 92% confident we're in room 3

Tiger example, step by step: You believe 50/50: b = [0.5, 0.5]. You listen and hear a growl from the left. The observation likelihoods are Z(hear-left | tiger-left) = 0.85 and Z(hear-left | tiger-right) = 0.15. The update:

Walk-through
# Prior belief
b = [0.50, 0.50]           # [tiger-left, tiger-right]

# Observation likelihoods: P(hear-left | state)
Z = [0.85, 0.15]           # high if tiger IS left

# Step 1: multiply prior × likelihood (element-wise)
unnorm = [0.85 * 0.50, 0.15 * 0.50]  # = [0.425, 0.075]

# Step 2: normalize so it sums to 1.0
eta = 0.425 + 0.075       # = 0.500
b' = [0.425/0.5, 0.075/0.5]  # = [0.85, 0.15]

# One observation: 50/50 → 85/15. Hear left again?
# [0.85*0.85, 0.15*0.15] → normalize → [0.97, 0.03]

One observation shifted your belief from total uncertainty to strong suspicion. Two consistent observations push it to 97% — that's the decision threshold.

Interactive Belief Update

Step through the Bayesian belief update for the tiger problem. Watch the bar chart shift after each observation.

History: (none yet)

Check: In Bayesian belief update, what role does the observation model Z play?

It decides which action to take It weights each state by how likely the observation is in that state It transitions the agent to a new state

🔨 Derivation Derive the belief update from Bayes' rule ▶ ✓ ATTEMPTED

You know Bayes' rule: P(A|B) = P(B|A) · P(A) / P(B). You also know the POMDP belief update formula: b'(s') = η · Z(o|s',a) · ∑_s T(s'|s,a) · b(s).

Your task: Starting from Bayes' rule applied to P(s' | o, a, b), derive the full belief update formula. Show where the transition model T and observation model Z each enter.

We want P(s_t+1 | o_t+1, a_t, history). By Bayes' rule, this equals P(o_t+1 | s_t+1, a_t) · P(s_t+1 | a_t, history) / P(o_t+1 | a_t, history). The first term is Z. What's the second?

P(s_t+1 | a_t, history) is a prediction. Marginalize over the previous state: ∑_s P(s_t+1 | s, a_t) · P(s | history). That second factor is... the old belief b(s)! And the first is T(s'|s,a).

The denominator P(o | a, history) is a normalizing constant. It equals ∑_s' [Z(o|s',a) · ∑_s T(s'|s,a) · b(s)]. We call it η^-1 because it just makes the result sum to 1.

Full derivation:

Start: b'(s') = P(s' | o, a, b) = P(o | s', a, b) · P(s' | a, b) / P(o | a, b)

Step 1: P(o | s', a, b) = Z(o | s', a) by the observation model definition (observation depends only on current state and action, not history).

Step 2: P(s' | a, b) = ∑_{s ∈ S} P(s' | s, a) · b(s) = ∑_s T(s'|s,a) · b(s). This is the "prediction" step — propagate belief through transitions.

Step 3: P(o | a, b) = ∑_s' Z(o|s',a) · ∑_s T(s'|s,a) · b(s). This is a constant (w.r.t. s'), so call it η^-1.

Combining: b'(s') = η · Z(o|s',a) · ∑_s T(s'|s,a) · b(s)

The key insight: The belief update is just Bayes' rule applied to a hidden Markov chain. The sum-product structure (sum over old states, product with likelihood) is the same predict-then-correct pattern as the Kalman filter, particle filter, and HMM forward algorithm.

💻 Build It Implement the Belief Update Step ▶ ✓ ATTEMPTED

You've seen the formula: b'(s') = η · Z(o|s',a) · ∑_s T(s'|s,a) · b(s). Now implement it in Python for an arbitrary number of states. The function should work for any POMDP, not just the tiger problem.

signature def belief_update(b, a, o, T, Z): """ b: np.array of shape (|S|,) — current belief a: int — action index o: int — observation index T: np.array of shape (|S|, |A|, |S|) — T[s, a, s'] = P(s'|s,a) Z: np.array of shape (|S|, |A|, |O|) — Z[s', a, o] = P(o|s',a) Returns: np.array of shape (|S|,) — updated belief """

Test case (Tiger problem)

b = [0.5, 0.5], a = 0 (listen), o = 0 (hear-left)
T[s, 0, s'] = identity (listening doesn't change state)
Z[0, 0, 0] = 0.85, Z[1, 0, 0] = 0.15
Expected output: [0.85, 0.15]

b_predicted[s'] = sum over s of T[s, a, s'] * b[s]. In NumPy: b_pred = T[:, a, :].T @ b. Then element-wise multiply by Z[:, a, o] and normalize.

python
import numpy as np

def belief_update(b, a, o, T, Z):
    # Step 1: Predict — propagate belief through transition model
    # b_pred[s'] = sum_s T[s, a, s'] * b[s]
    b_pred = T[:, a, :].T @ b  # shape (|S|,)

    # Step 2: Update — weight by observation likelihood
    # unnormalized[s'] = Z[s', a, o] * b_pred[s']
    likelihood = Z[:, a, o]   # shape (|S|,)
    unnorm = likelihood * b_pred

    # Step 3: Normalize
    eta = unnorm.sum()
    if eta == 0:
        return np.ones_like(b) / len(b)  # degenerate case
    return unnorm / eta

Bonus challenge: What happens if eta = 0? This means the observation was impossible given the current belief. In practice this indicates a modeling error. How would you handle it in a real system? (Hint: particle filters face this "particle depletion" problem.)

Checkpoint — Before you move on

Explain in your own words: why is the belief b a "sufficient statistic" for decision-making? That is, why does the agent not need to remember the entire history of actions and observations — just the current belief vector?

✓ Gate cleared

Model Answer

The belief b(s) = P(s | a₁, o₁, ..., a_t, o_t) already encodes the entire history. By the Markov property of the underlying state transitions, once you know P(s_t | history), the future evolution depends only on s_t and your future actions — not on how you arrived at that belief. Two different histories that produce the same belief vector lead to identical optimal behavior going forward. This is why the "belief MDP" reduction works: the belief IS the state of the equivalent fully-observable problem.

Chapter 4: The Value of Information

Here's what makes POMDPs fascinating: sometimes the best action is to do nothing useful except gather information. In the tiger problem, opening a door immediately gives you a 50/50 chance of hitting the tiger. But listening (even at a small cost) lets you build confidence before committing.

This is the value of information: an information-gathering action has negative immediate reward (it costs you −1 to listen) but positive expected future reward because it lets you make a better decision later.

The strategic tradeoff: Every decision balances exploitation (act now with current knowledge) vs exploration (gather more info first). POMDPs formalize this tradeoff mathematically. The optimal policy automatically decides when to stop gathering info and commit.

When to Stop Listening?

Expected reward for each action as belief changes. At 50/50, listening dominates. As belief becomes confident, opening the correct door wins.

b(tiger-left)0.50

Action	Expected Reward at b=0.5	Expected Reward at b=0.97
Listen	−1 (always)	−1 (always)
Open Left	−45 (risky!)	+6.7 (safe)
Open Right	−45 (risky!)	−96.7 (disaster)

Check: Why might an optimal POMDP agent choose a costly information-gathering action?

Because it improves future decisions enough to offset the cost Because it has no other actions available Because immediate rewards don't matter

Chapter 5: Why POMDPs Are Hard

Solving an MDP means finding the best action for each state. Solving a POMDP means finding the best action for each belief — and the belief space is a continuous simplex. Even with just 2 states, the belief is a number between 0 and 1 (infinite possibilities). With n states, it's an (n−1)-dimensional continuous space.

This is the curse of dimensionality: the belief space grows exponentially with the number of states. Exact solutions are computationally intractable for all but the tiniest problems. POMDP solving is PSPACE-complete — among the hardest problems in computer science.

Scale comparison: A 10-state MDP has 10 states to solve. A 10-state POMDP has a 9-dimensional continuous belief simplex. A 100-state MDP has 100 states. A 100-state POMDP has a 99-dimensional belief space — astronomically large.

The formal complexity: finite-horizon POMDP solving is PSPACE-complete. That puts it among the hardest problems in computer science — harder than NP-complete problems (which MDPs fall into at worst). The root cause is the belief MDP reduction: you can convert any POMDP into an MDP over belief states, but that MDP has a continuous, (|S|−1)-dimensional state space even when the original POMDP has a finite number of states.

Why continuous = hard: In a regular MDP with |S| = 100, value iteration visits 100 states per sweep. In the equivalent belief MDP, the "state" is a point on a 99-dimensional simplex. You'd need to discretize this simplex to solve it — and even a coarse 10-point grid per dimension gives 10⁹⁹ belief points. That's more than the atoms in the universe. This is why exact POMDP solving is limited to problems with roughly 10–20 states.

Belief Simplex (3 States)

With 3 states, the belief lives on a 2D triangle (simplex). Each corner is 100% certainty in one state. The interior is mixed belief. Click to place a belief point.

Click the triangle to set a belief point

States	MDP policy maps from	POMDP policy maps from
2	2 discrete states	1D line segment [0,1]
10	10 discrete states	9D simplex
100	100 discrete states	99D simplex

Check: Why are POMDPs so much harder to solve than MDPs?

The belief space is continuous and grows exponentially with states POMDPs have more actions than MDPs The discount factor is harder to compute

🔨 Derivation Why is the value function piecewise linear and convex? ▶ ✓ ATTEMPTED

The POMDP value function V(b) over the belief simplex is piecewise linear and convex (PWLC). This is the key structural property that makes point-based solvers possible.

Your task: Show why V(b) is PWLC for a finite-horizon POMDP. Start from the 1-step case and argue inductively.

For horizon 1, V(b) = max_a ∑_s R(s,a) · b(s). Each action gives a linear function of b (it's a dot product: α_a · b where α_a[s] = R(s,a)). The max of linear functions is...?

V₁(b) = max_a α_a · b. The maximum of a finite set of linear (hence convex) functions is convex. And it's piecewise linear because each piece is one of the α-vectors. Now for the inductive step: assume V_t is PWLC. Show V_t+1 is PWLC.

V_t+1(b) = max_a [R(b,a) + γ ∑_o P(o|b,a) · V_t(b')]. The sum ∑_o is over a finite set. P(o|b,a) is linear in b. b' is a (nonlinear) function of b, but V_t(b') evaluated at the new belief is still a max of linear functions of the original b. Cross-multiplying produces new alpha-vectors.

Full derivation:

Base case (horizon 1): V₁(b) = max_a ∑_s R(s,a) · b(s) = max_a α_a · b. This is the maximum of |A| linear functions of b. Max of linear functions = piecewise linear and convex.

Inductive step: Assume V_t(b) = max_i α_i · b for some set of alpha-vectors {α_i}. For each (action a, observation o, alpha-vector α_i), define a new vector α_a,o,i[s] = γ · ∑_s' T(s'|s,a) · Z(o|s',a) · α_i[s']. Then V_t+1(b) = max_a [α_R,a · b + ∑_o max_i α_a,o,i · b]. Each term is linear in b; the sum and max preserve piecewise linearity and convexity.

The key insight: PWLC is preserved by the Bellman operator because the operator involves only: (1) linear combinations (sums, scalar products with b), (2) max operations, and (3) composition with the belief update (which maps linear functions of b' back to linear functions of b through the T and Z matrices). The number of alpha-vectors can grow exponentially with horizon — this is why exact solving is PSPACE-complete.

⚔ Adversarial: Why PSPACE and not just NP?

An MDP with |S| states can be solved in polynomial time (value iteration converges in O(|S|²|A|) per iteration). A POMDP with |S| states has a belief space of dimension |S|-1. Someone claims: "Just discretize the belief simplex into a grid and solve the resulting finite MDP." Why doesn't this reduce POMDP solving to a polynomial-size MDP?

Because discretization introduces approximation error Because the grid has exponentially many points in |S| — you need at least k^|S|-1 grid points to cover the simplex Because the discount factor makes it diverge

Chapter 6: Point-Based Solvers

Since we can't solve for every belief point, what if we just pick a handful of representative belief points and solve for those? This is the core idea behind point-based value iteration (PBVI) and its successors like SARSOP.

The value function of a POMDP is piecewise linear and convex (PWLC) over the belief simplex. This means it can be represented as the maximum of a set of hyperplanes (called alpha-vectors). Each alpha-vector corresponds to an action and defines the value for a region of belief space.

1. Sample Beliefs

Pick representative points on the belief simplex

↓

2. Backup

Compute best alpha-vector for each sampled belief

↓

3. Prune

Remove dominated alpha-vectors (never optimal anywhere)

↓

4. Iterate

Repeat until convergence — explore new reachable beliefs

Alpha-Vectors on Belief Space

For the 2-state tiger problem, the belief space is 1D. Each colored line is an alpha-vector. The value function (thick line) is the max over all of them.

SARSOP insight: Not all belief points are equally useful. SARSOP focuses sampling on reachable belief points — ones the agent would actually visit under the optimal policy. This dramatically reduces computation for practical problems.

Practical Approximation Ladder

Since exact solving is intractable for real problems, practitioners use a hierarchy of approximations, trading optimality for tractability:

Method	Idea	Scales to
QMDP	Assume full observability after one step. Solve the underlying MDP, use its Q-values weighted by belief. Fast but ignores information-gathering.	1000+ states
PBVI	Sample N belief points, compute alpha-vectors only at those. Approximation quality grows with N.	~100 states
SARSOP	Smart PBVI: only sample reachable beliefs. State of the art for exact-ish offline solving.	~100–500 states
Online planning	Don't solve the whole belief space. At each step, plan a few steps ahead from the current belief using tree search (POMCP).	10,000+ states
Deep RL + RNNs	Skip belief computation entirely. Train a recurrent policy that implicitly maintains belief through its hidden state.	Continuous states

The QMDP shortcut: QMDP is the simplest POMDP approximation. It solves the underlying MDP to get Q*(s,a), then computes the QMDP action as argmax_a ∑_s b(s) Q*(s,a). It's fast and works well when observations are informative enough that belief collapses quickly. It fails when information gathering matters — because it assumes you'll know the state after one step, it never chooses to "listen."

Check: What is an alpha-vector?

A hyperplane that defines the value function for a region of belief space A vector of observation probabilities The agent's action history

💥 Break-It Lab What Dies When You Remove POMDP Components? ▶ ✓ ATTEMPTED

A POMDP agent listens to gather information before opening a door. Watch how its performance degrades as you break different components of the framework. The chart shows average reward over 50 rounds.

Ignore observations (treat as MDP) ACTIVE

Failure mode: Without belief tracking, the agent treats b=0.5 forever. It can never build confidence, so it either gambles immediately (expected reward: -45) or listens forever (infinite -1 costs). This is what happens when you "just use RL without memory" on a POMDP.

Greedy policy (no info-gathering) ACTIVE

Failure mode: The agent always takes the immediately best action without considering future information gain. At b=0.5, all actions look equally bad, so it picks randomly. It never learns to listen first. Average reward plummets to roughly -45 per round.

Reduce observation accuracy to 55% ACTIVE

Failure mode: With near-random observations (55% vs 85%), the agent needs many more listens to reach the confidence threshold. Each listen costs -1, so total cost of information-gathering skyrockets. Eventually it's cheaper to just guess. The optimal policy degenerates toward "open immediately" when observations carry almost no information.

⚔ Adversarial: Your POMDP agent takes the same action regardless of observation history

You train a POMDP policy and notice it always executes the same action regardless of what observations it receives. The environment definitely has partial observability and information-gathering actions are available.

The agent has learned the optimal policy The transition model is deterministic The agent's belief state is not being updated (it's stuck at the prior), or its policy ignores belief

Chapter 7: SHOWCASE — The Tiger Problem

Put it all together. You are the POMDP agent. Behind one door is a tiger (−100), behind the other is gold (+10). Listening costs −1 but gives you an 85%-accurate hint. After opening a door, the tiger is re-placed randomly and you play again.

The optimal strategy: listen until your belief is confident enough (roughly 97%), then open the safe door. Can you beat random guessing?

Play the Tiger Problem

      Total reward: 0
      Round: 1
      Listens this round: 0
    

Strategy tip: The optimal number of listens is usually 2–4 before opening. After hearing the same side twice, your belief is about 97% — that's the sweet spot where the expected value of opening exceeds the cost of another listen.

Check: In the tiger problem, why is it suboptimal to always listen forever?

Each listen costs −1 and eventually the cost exceeds the benefit of more info The tiger moves after each listen You can only listen once

Chapter 8: POMDPs in the Real World

POMDPs are far more than a toy problem with tigers. They provide the mathematical framework for any situation where an agent must act under partial observability.

Domain	Hidden State	Observation	Key Decision
Robot navigation	True pose (x, y, θ)	Laser scans, camera	Where to move / explore
Dialogue systems	User's true intent	Noisy speech recognition	What to say next
Medical diagnosis	Patient's true condition	Test results, symptoms	Which test to order next
Active perception	Object identity	Partial camera view	Where to look next
Network security	Attacker's presence	Log anomalies	When to raise an alert

Active perception: A robot doesn't just passively receive sensor data — it chooses where to look. This is a POMDP: the hidden state is the scene, observations are camera images, and actions include moving the camera. The robot gathers information strategically.

Connections: POMDPs connect to many other frameworks. A Bayes filter handles the belief update step. Reinforcement learning solves the decision-making when the model is unknown. The Kalman filter is belief update for continuous Gaussian POMDPs. Hidden Markov Models are POMDPs without actions.

What practitioners actually use

Nobody solves a 10,000-state POMDP exactly. In practice, the belief-MDP reduction is used as a conceptual framework, and the actual computation is heavily approximated. Here's the realistic stack for a modern robot:

Practical POMDP pipeline
# 1. Belief tracking: run a particle filter
#    (1000 weighted samples approximate the belief)
particles = [sample_state() for _ in range(1000)]

# 2. Action selection: online planning from current belief
#    POMCP: Monte Carlo tree search over belief trees
#    Plans ~1 second ahead, not the full infinite horizon
action = pomcp_search(particles, depth=10)

# 3. Execute action, get observation
obs = env.step(action)

# 4. Update particles with new observation
particles = particle_filter_update(particles, action, obs)
# Repeat forever

"The real world is partially observable. The question is not whether to model uncertainty, but how well."

— Leslie Kaelbling

You now understand the POMDP framework. Every time you make a decision without full information, you are solving a POMDP — now you know the mathematics behind it.

Check: Which of these is an example of active perception?

A robot choosing where to point its camera to identify an object A camera passively recording video A thermostat reading the temperature

🏗 Design Challenge You're the Architect: Medical Diagnosis Robot ▶ ✓ ATTEMPTED

A hospital wants an AI system that decides which diagnostic tests to run on a patient. The patient has one of 5 possible conditions (hidden state). Tests vary in cost, invasiveness, and diagnostic power. The system must balance information gain against patient harm and budget.

Hidden states

5 diseases (one is "healthy" — no treatment needed)

Actions

3 cheap tests ($50, low info), 2 invasive tests ($500, high info), 5 treatments (one per disease)

Wrong treatment cost

$10,000 + patient harm. Treating "healthy" as sick: $2,000 wasted.

Budget per patient

$2,000 total for all tests before committing to treatment

Time pressure

Disease progresses: 5% worse outcomes per day of delay

1. When should the system order an invasive test vs. commit to treatment based on cheap tests alone? What belief threshold triggers the expensive test?

2. How do you encode "time pressure" in the POMDP? (Hint: think about how discount factor or transition model can capture disease progression.)

3. The system has 97% confidence the patient has disease A. Should it treat immediately or run one more confirmatory test? Under what conditions does the answer change?

Real-world approach: Medical POMDP research (Hauskrecht & Fraser, 2000; Ayer et al., 2012) formulates this exactly as described. Key decisions:

Reward structure: R(treat correctly) = +large, R(treat incorrectly) = -proportional to harm, R(test) = -cost, R(delay) = -per-timestep penalty that increases with disease severity. The time pressure is encoded as worsening rewards over time (non-stationary) or as transitions that make the disease state worse each timestep.

Threshold behavior: The optimal policy typically has "test until confident, then treat" structure — but the confidence threshold depends on the relative costs. If wrong treatment costs $10k and a test costs $500, you need belief > 1 - 500/10000 = 95% to justify stopping. With time pressure, this threshold decreases over time — eventually acting on uncertain beliefs beats waiting.

In practice: Exact POMDP solving is intractable for realistic medical problems (thousands of symptom combinations). Systems use POMCP (online tree search) or train neural policies with RNNs on patient histories. The POMDP formulation provides the principled framework; approximations make it practical.

🔨 Derivation Show that a POMDP with perfect observations is just an MDP ▶ ✓ ATTEMPTED

A POMDP is defined by (S, A, O, T, Z, R, γ). An MDP is (S, A, T, R, γ). They differ only in O and Z.

Your task: Show that if observations are perfect (O = S, and Z(o|s',a) = 1 if o = s', 0 otherwise), the POMDP belief update collapses the belief to a point mass, and the optimal POMDP policy is identical to the optimal MDP policy.

If Z(o|s',a) = δ(o, s'), then after observing o, the only state s' with nonzero likelihood is s' = o. The belief becomes b'(s') = 1 if s' = o, and 0 otherwise. A point mass!

If b is always a point mass on some state s (i.e., b = e_s, the one-hot vector), then V(b) = V(e_s) = max_a [R(s,a) + γ ∑_s' T(s'|s,a) V(e_s')]. Define V*(s) = V(e_s). What equation does V* satisfy?

Full derivation:

1. With Z(o|s',a) = δ(o=s'), the belief update gives: b'(s') = η · δ(o=s') · ∑_s T(s'|s,a) · b(s). This is nonzero only for s' = o, so b' = e_o (point mass on the observed state).

2. Since belief is always a point mass after the first observation, the POMDP value function V(b) only needs to be evaluated at point masses: V(e_s).

3. The Bellman equation becomes: V(e_s) = max_a [R(s,a) + γ ∑_s' T(s'|s,a) · 1 · V(e_s')]. (The "1" comes from P(o=s'|s',a) = 1.)

4. Defining V*(s) ≡ V(e_s), we get: V*(s) = max_a [R(s,a) + γ ∑_s' T(s'|s,a) V*(s')]. This is exactly the MDP Bellman optimality equation.

The key insight: MDPs are a special case of POMDPs where observations perfectly reveal state. The belief update becomes trivial (just "look at the state"), and planning over belief space reduces to planning over state space. This is why the POMDP framework is strictly more general than MDPs.

🔗 Pattern Recognition

Deep RL Approximates POMDP Solutions with Learned Memory

POMDP (exact)

Maintain explicit belief b(s) via Bayes rule. Plan over belief space. Policy: π(b) → action.

Deep RL (approximate)

Maintain implicit belief in RNN hidden state h_t. Train end-to-end. Policy: π(h_t) → action. → RL Algorithms lesson

An LSTM or Transformer processing observation history is learning to do belief tracking without being told the transition or observation models. Its hidden state h_t is an implicit, learned belief representation. When deep RL works well on partially observable tasks, it's because the network has discovered an approximate sufficient statistic for the history — exactly what formal belief tracking computes analytically.

Next time you see an RL agent with a recurrent policy outperform a feedforward one, ask: what "belief" is its hidden state encoding?

Understand the POMDP
Framework

Chapter 0: What You Cannot See

Chapter 1: Observations & Belief

Chapter 2: The POMDP Tuple

Chapter 3: Belief Update

The Full Update with Transitions

Chapter 4: The Value of Information

Chapter 5: Why POMDPs Are Hard

Chapter 6: Point-Based Solvers

Practical Approximation Ladder

Chapter 7: SHOWCASE — The Tiger Problem

Chapter 8: POMDPs in the Real World

What practitioners actually use

Understand the POMDPFramework

Chapter 0: What You Cannot See

Chapter 1: Observations & Belief

Chapter 2: The POMDP Tuple

Chapter 3: Belief Update

The Full Update with Transitions

Chapter 4: The Value of Information

Chapter 5: Why POMDPs Are Hard

Chapter 6: Point-Based Solvers

Practical Approximation Ladder

Chapter 7: SHOWCASE — The Tiger Problem

Chapter 8: POMDPs in the Real World

What practitioners actually use

Understand the POMDP
Framework