Kochenderfer et al., Chapter 19

Reasoning Under Hidden States

When you can't see the world directly, you maintain a belief — a probability distribution over what might be true. This chapter builds every major belief update method from first principles.

Prerequisites: POMDPs (basic structure), Bayes' theorem, Gaussian distributions, Linear algebra basics.
10
Chapters
6
Simulations
10
Quizzes

Chapter 0: When You Can't See the State

You're a robot navigating a dark hallway. You know your sensors are noisy. You know your motors don't always move exactly as commanded. What position are you in? You can't know for certain — but you can maintain a probability distribution over all the positions you might be in. That distribution is called a belief.

This is the central challenge of a Partially Observable Markov Decision Process (POMDP): the true state s is hidden. Instead of observing s, the agent receives an observation o drawn from O(o|a,s'), a probabilistic function of the reached state and the action taken. The agent must reason about s indirectly.

Belief as sufficient statistic: A key theorem: the belief b(s) = P(s | a0, o1, a1, o2, …) is a sufficient statistic for the entire action-observation history. This means: to make optimal decisions, you only need to know the current belief. You don't need to remember every past observation — the belief compresses all that information.

In a standard MDP: policy maps states to actions, π(s) → a. In a POMDP: policy maps beliefs to actions, π(b) → a. The belief plays the role of the state. Every chapter in this lesson is about computing b efficiently and accurately.

The five chapters of this lesson (Chapters 0–9) build up from the most exact to the most general belief representation. Chapters 0–2 establish the framework and the exact discrete filter. Chapters 3–5 explore parametric (Gaussian) filters that trade exactness for efficiency in continuous state spaces. Chapters 6–7 introduce the nonparametric particle filter and particle injection. Chapter 8 is the showcase: a fully interactive crying baby demonstration that ties all the update mechanics together. Chapter 9 summarizes and provides the decision hierarchy.

The proof that belief is a sufficient statistic uses the fact that POMDP dynamics are Markovian in the hidden state. Given the current state s and action a, the distribution over next state s' and observation o depends only on (s, a) — not on any earlier history. Therefore, the belief b = P(s|history) captures exactly the information content of the history that is relevant to future decisions. Any additional memory of the raw history beyond what is encoded in b would be redundant.

This is a fundamental result with profound practical implications: instead of maintaining an ever-growing history (which would require exponentially more memory over time), the agent only needs to maintain and update a fixed-size probability distribution over states. For a discrete state space |S|, the belief is always a vector of |S| numbers, regardless of how long the agent has been running.

The Belief Simplex for 3 States

Every possible belief over 3 states is a point inside this triangle. Corners = certainty about one state. Center = maximum uncertainty. Click anywhere inside to set a belief and see b(s1), b(s2), b(s3).

b = [0.333, 0.333, 0.333]
Why is the belief a sufficient statistic for the history of actions and observations?

Chapter 1: Belief Initialization

Before taking any action, we need a starting belief b0. Two broad representations are available:

RepresentationFormStrengthsLimitations
ParametricCategorical (discrete S), Gaussian N(μ,Σ) (continuous S)Compact, closed-form updatesCannot represent multimodal beliefs
NonparametricSet of weighted particles {(si, wi)}Handles arbitrary distributions, multimodal beliefsDiscrete approximation; resolution limited by particle count

Choosing the initial belief matters enormously. A diffuse initial belief (e.g., uniform over all states) is robust — it makes few assumptions. A confident wrong initial belief can be catastrophic: it may take a very long time to correct, even with informative observations.

The autonomous car example from the text: A car must localize itself using lidar against a known map. At startup, its position is unknown. A good initialization: place particles uniformly across the entire map (or in regions consistent with the initial lidar scan). The initial belief is spread across many candidate positions and will concentrate around the true position as more lidar observations arrive. Committing to a wrong position at startup could cause the car to drive into a wall before the belief corrects.

Two special starting cases:

Known initial state: b0(s) = 1[s = s0]. The agent starts certain. Common in simulations.

Prior from problem knowledge: Use domain knowledge. "The robot starts near the charging dock, so b0 = N(μdock, small Σ)."

In practice, initial belief construction is one of the most important engineering decisions for a POMDP system. Here is a concrete comparison of three initialization strategies for a robot localization problem in a 100×100 meter warehouse (state = (x, y) position):

StrategyHowInitial costTime to localize
Uniform particlesm = 10,000 particles uniformly over 100m×100mNone (no prior knowledge needed)Many steps until beliefs concentrate
GPS-seeded GaussianN(μGPS, Σ = 3m) — GPS is coarse but informativeRequires GPS fix (may be unavailable indoors)Fast: GPS reduces uncertainty by ~90%
Observation-consistentSample s' ~ O(o0|s') from prior — use first lidar scan to filterModerate: requires first observation at startupVery fast: only positions matching the initial scan are kept
Confident-wrongN(μwrong, Σ = 0.5m) — small variance, wrong positionDangerous: may take 50+ steps to recoverVery slow or never recovers without injection
Particle count rule of thumb: For a d-dimensional continuous state space, you typically need m ~ 10d particles to maintain adequate coverage. For d=1: m=10–100 is fine. For d=2 (robot position): m=500–5,000. For d=6 (full 3D pose + velocity): m=10,000–100,000. This is why particle filters are practical for low-dimensional problems but struggle for very high-dimensional states.
Why is a diffuse (spread-out) initial belief generally safer than a confident initial belief when the true starting state is unknown?

Chapter 2: Discrete State Filter

For finite state and observation spaces, belief updates are exact. The belief b is a probability vector of length |S|. After taking action a and receiving observation o, the new belief b' is computed in two steps:

Predict:   b¨(s') = ∑s T(s'|s,a) · b(s)
Update:   b'(s') = η · O(o|a,s') · b¨(s')

where η is a normalizing constant: η = 1 / ∑s' O(o|a,s') b¨(s').

This is recursive Bayesian estimation on a discrete state space. The prediction step propagates the belief forward through the dynamics model. The update step incorporates the observation using Bayes' rule. The normalization ensures b' is a valid probability distribution.

Intuition: Think of the predict step as "smearing" your belief through uncertainty in the dynamics. If your action moves you north but sometimes slips, the prediction spreads probability to adjacent states. The update step then "sharpens" the belief based on what you actually observed, concentrating probability where the observation is most likely.

The computational cost is O(|S|2) per update (the double sum over s and s'). For large state spaces, this becomes impractical, motivating the approximate methods in later chapters.

Here is a complete step-by-step worked example of one discrete filter update. The crying baby problem: states = {sated=0, hungry=1}, action = "ignore", observation = "crying". Starting belief b = [0.5, 0.5].

python
# Crying baby discrete filter: one step, action=ignore, obs=crying
# T[a][s][s'] = P(s'|s, a). Action: ignore=2
T_ignore = [[0.9, 0.1],  # from sated: 90% stay sated, 10% go hungry
             [0.0, 1.0]]  # from hungry: always stay hungry
# O[a][s'][o]: observation model, o=0 means crying
O_ignore = [0.1, 0.8]    # P(crying|sated)=0.1, P(crying|hungry)=0.8

b = [0.5, 0.5]              # prior belief: 50/50

# --- PREDICT STEP ---
# b_pred[s'] = sum_s T(s'|s,a) * b[s]
b_pred = [0, 0]
b_pred[0] = T_ignore[0][0]*b[0] + T_ignore[1][0]*b[1]   # 0.9*0.5 + 0.0*0.5 = 0.45
b_pred[1] = T_ignore[0][1]*b[0] + T_ignore[1][1]*b[1]   # 0.1*0.5 + 1.0*0.5 = 0.55
print(f"Predicted: {b_pred}")   # [0.45, 0.55]

# --- UPDATE STEP --- (observation = crying)
# b_new[s'] = eta * O(crying|s') * b_pred[s']
unnorm = [O_ignore[0] * b_pred[0],   # 0.1 * 0.45 = 0.045  (sated, but still crying)
          O_ignore[1] * b_pred[1]]   # 0.8 * 0.55 = 0.440  (hungry, expected crying)
eta = 1.0 / sum(unnorm)              # 1 / 0.485 = 2.062
b_new = [eta * u for u in unnorm]   # [0.093, 0.907]
print(f"Updated:   {[f'{v:.3f}' for v in b_new]}")

# Interpretation: P(hungry|ignore, crying) ≈ 90.7%
# The crying observation strongly confirms hunger.
# If we had observed "quiet" instead:
# unnorm_q = [(1-0.1)*0.45, (1-0.8)*0.55] = [0.405, 0.110] → P(hungry) ≈ 21%
The geometry of belief updates: The predict step "diffuses" probability mass: sated probability drops from 0.5 to 0.45 (some of it flows to hungry), hungry probability rises from 0.5 to 0.55. The update step "focuses" probability mass based on evidence: crying is 8x more likely from hungry than sated, so P(hungry) jumps from 0.55 to 0.907 after normalization. Repeated observations of crying will push belief toward P(hungry) = 1.0 asymptotically.
Discrete Belief Update — 4-State POMDP

A 4-state hallway. Take Left/Right actions and observe (Bright/Dim). Watch how the belief vector shifts. The triangle below a bar shows the true (hidden) state. Higher bar = more probability mass.

Step: 0 | True: s1
What does the prediction step in the discrete filter compute?

Chapter 3: Kalman Filter

For continuous states with linear Gaussian dynamics, the belief is always a Gaussian N(μ,Σ). This is not an approximation — it's exact. The dynamics and observation models are:

T(s'|s,a):   s' = Tss + Taa + εs,    εs ~ N(0, Σs)
O(o|a,s'):   o = Oss' + εo,    εo ~ N(0, Σo)

The belief update has the famous two-step structure:

Predict (Propagate)
μp = Tsμb + Taa     Σp = TsΣbTs + Σs
Update (Correct)
K = ΣpOs(OsΣpOso)−1    μb = μp+K(o−Osμp)    Σb=(I−KOsp

The matrix K is the Kalman gain. It is the weight given to the innovation (o − Osμp), the difference between the actual observation and the predicted observation. Think of K as a trust slider:

In steady state, the variance Σ converges to a fixed-point determined entirely by the noise parameters — independent of the initial belief. This is called the steady-state Kalman filter. For a 1D random walk with process variance σs2 and observation variance σo2, the steady-state gain K* and variance Σ* satisfy:

Σ* = ( σs2/2 ) + √( (σs2/2)2 + σs2σo2 )
K* = Σ* / (Σ* + σo2)

With σs2 = 0.25 and σo2 = 1.0: Σ* = 0.125 + √(0.0156 + 0.25) ≈ 0.125 + 0.5063 = 0.631. K* = 0.631/(0.631+1) ≈ 0.387. After reaching steady state, every update sets K ≈ 0.387 — a fixed fraction of the innovation is incorporated regardless of history. This is why steady-state Kalman filters are so efficient: precompute K* offline, then run at O(n) per step online.

Here is a 5-step worked example tracing the 1D Kalman filter, starting from uncertainty σ²=4 and true position 0. Process noise σs²=0.25, observation noise σo²=1.0:

StepTrue posObservationσ² beforeKμ afterσ² after
10.31.24.254.25/5.25≈0.810+0.81×1.2≈0.97(1−0.81)×4.25≈0.81
20.50.11.061.06/2.06≈0.510.97+0.51×(0.1−0.97)≈0.530.49×1.06≈0.52
30.80.90.770.77/1.77≈0.440.53+0.44×(0.9−0.53)≈0.690.56×0.77≈0.43
40.60.50.680.68/1.68≈0.400.69+0.40×(0.5−0.69)≈0.610.60×0.68≈0.41
50.91.10.660.66/1.66≈0.400.61+0.40×(1.1−0.61)≈0.800.60×0.66≈0.40

Notice that σ² starts large (4.25) and rapidly shrinks to a steady state around 0.40 — consistent with the analytic steady-state formula (Σ* ≈ 0.631 for this noise setting). The gain K also stabilizes around 0.40. The filter's mean tracks the true position, staying within ±2σ = ±2×0.63 of each observation. The estimate at step 5 (0.80) is close to the true position 0.9, within one sigma.

Why Gaussian in, Gaussian out? The product of two Gaussian densities is proportional to a Gaussian. The linear transform of a Gaussian is a Gaussian. So in the linear Gaussian case, every update preserves the Gaussian form. The Kalman filter leverages this closure property to maintain an exact, minimal-parameter belief representation at all times.

The three key properties of the Kalman filter that make it the gold standard for linear systems:

Tuning tip: In practice, Σs (process noise) and Σo (observation noise) are often estimated from data or tuned empirically. Underestimating Σs makes the filter "stiff" — slow to respond to real state changes. Overestimating Σs makes the filter "jumpy" — it trusts observations too much and amplifies noise. The ratio Σso controls the steady-state gain K*; increasing this ratio increases K*, making the filter more responsive.

Here is the complete 1D Kalman filter in Python. The belief is just two numbers — mean μ and variance σ2. Everything else is algebra:

python
import numpy as np

class KalmanFilter1D:
    """1D Kalman filter: belief = (mu, sigma2)."""
    def __init__(self, mu0=0.0, sigma2_0=1.0):
        self.mu = mu0             # belief mean
        self.sigma2 = sigma2_0    # belief variance

    def predict(self, sigma2_process):
        # Propagate belief through random walk dynamics
        # mu_p = mu  (no deterministic drift in this example)
        # sigma2_p = sigma2 + sigma2_process
        self.sigma2 += sigma2_process
        # self.mu unchanged (identity transition)

    def update(self, obs, sigma2_obs):
        # Incorporate a noisy observation
        K = self.sigma2 / (self.sigma2 + sigma2_obs)   # Kalman gain
        self.mu = self.mu + K * (obs - self.mu)         # belief mean update
        self.sigma2 = (1 - K) * self.sigma2              # belief variance shrinks
        return K  # return K so we can inspect it

# Worked example matching the textbook's robot localization scenario
kf = KalmanFilter1D(mu0=0.0, sigma2_0=4.0)  # uncertain about initial position
true_pos = 0.0
for step in range(10):
    true_pos += np.random.randn() * 0.5       # true random walk
    obs = true_pos + np.random.randn() * 1.0  # noisy observation
    kf.predict(sigma2_process=0.25)           # process noise variance
    K = kf.update(obs, sigma2_obs=1.0)        # observation noise variance
    print(f"Step {step+1}: true={true_pos:.2f}, obs={obs:.2f}, "
          f"mu={kf.mu:.2f}, sigma={kf.sigma2**.5:.2f}, K={K:.3f}")
1D Kalman Filter

Track a 1D position. Purple line = true state (random walk). Orange dots = noisy observations. Teal band = belief (μ ± 2σ). Adjust noise levels to see how the gain K adapts.

Process noise σs 0.5
Observation noise σo 1.0
Step: 0 | K = —
When observation noise Σo is very large, what happens to the Kalman gain K and why?

Chapter 4: Extended Kalman Filter

What if the dynamics or observation functions are nonlinear? The standard Kalman filter fails because a Gaussian pushed through a nonlinear function is no longer Gaussian. The Extended Kalman Filter (EKF) handles this by linearizing the nonlinear functions at the current belief mean using first-order Taylor expansions (Jacobians).

T(s'|s,a):   s' ≈ fT(s,a) + JT(s−μb) + εs
O(o|a,s'):   o ≈ fO(s') + JO(s'−μp) + εo

where JT = ∂fT/∂s |μb and JO = ∂fO/∂s' |μp are the Jacobian matrices evaluated at the current mean. The Kalman update equations then apply with JT and JO in place of Ts and Os.

StepStandard KFEKF
Predict meanμp = Tsμb + Taaμp = fTb, a)
Predict covarianceΣp = TsΣbTs + ΣsΣp = JTΣbJT + Σs
Kalman gain KΣpOs(…)−1ΣpJO(…)−1
Update meanμp + K(o − Osμp)μp + K(o − fOp))
Limitation — when EKF breaks: The linear approximation is only valid near the current mean. For highly nonlinear functions (e.g., the wrap-around angle in a bearing-only sensor), the linearization can be wildly wrong. Worse, the EKF cannot represent multimodal posteriors: if the observation is consistent with two very different positions, the EKF squashes this into a single Gaussian instead of two separate peaks.

Here is a concrete worked EKF example: 2D robot tracking with a bearing-only sensor. State s = (x, y). Observation: the angle from the origin to the robot: fO(x,y) = atan2(y,x). This is highly nonlinear.

python
import numpy as np

# Robot state: s = [x, y] = [3.0, 1.0]   (true position)
# Belief mean: mu = [3.0, 1.0]  sigma2 = [[0.5, 0], [0, 0.5]]
mu = np.array([3.0, 1.0])
Sigma = np.diag([0.5, 0.5])

# Nonlinear observation model: angle theta = atan2(y, x)
# EKF: linearize via Jacobian J_O = d/d(x,y) atan2(y,x)
# d(atan2(y,x))/dx = -y/(x^2+y^2)
# d(atan2(y,x))/dy =  x/(x^2+y^2)

x, y = mu
r2 = x**2 + y**2          # = 3^2+1^2 = 10
J_O = np.array([[-y/r2, x/r2]])   # = [[-0.1, 0.3]]  (1x2 Jacobian)
print(f"Jacobian J_O = {J_O}")   # [[-0.1, 0.3]]

# Observation noise: sigma_o = 0.1 radians
Sigma_o = np.array([[0.01]])    # (0.1 rad)^2

# Kalman gain: K = Sigma * J_O^T * (J_O * Sigma * J_O^T + Sigma_o)^{-1}
S_innov = J_O @ Sigma @ J_O.T + Sigma_o  # = [[-0.1,0.3]] [[0.5,0],[0,0.5]] [[-0.1],[0.3]] + 0.01
                                          # = [[-0.1,0.3]] [[-0.05],[0.15]] + 0.01 = [0.05] + 0.01 = [0.06]
K = Sigma @ J_O.T @ np.linalg.inv(S_innov)  # 2x1 gain vector
print(f"K = {K.flatten()}")    # approximately [-0.833, 2.5]

# Observe: true bearing = atan2(1,3) ≈ 0.3217 rad
# Noisy observation: theta_obs = 0.35 rad
theta_pred = np.arctan2(mu[1], mu[0])   # ≈ 0.3217
innovation = np.array([0.35]) - theta_pred  # ≈ 0.028 rad
mu_new = mu + K.flatten() * innovation[0]  # adjust position toward observation
Sigma_new = (np.eye(2) - K @ J_O) @ Sigma  # covariance shrinks
print(f"Updated mu: {mu_new}")
# mu changes: x decreases slightly (bearing too large implies y was underestimated)
# Key: the EKF exploits the local geometry (Jacobian) to map bearing residual
# into (x,y) correction. Works well near the true position; fails far from it.
Linear vs. Nonlinear Gaussian Push-Through

Samples from a 1D Gaussian (left) pushed through a linear function f(x)=2x (center) and a nonlinear function f(x)=sin(x)+x (right). The linear result stays Gaussian; the nonlinear result becomes skewed. The EKF approximates the nonlinear case with a local linearization (dashed).

What does the EKF use to handle nonlinear dynamics?

Chapter 5: Unscented Kalman Filter

The EKF linearizes using a derivative. But what if computing Jacobians is hard (or the function isn't differentiable)? The Unscented Kalman Filter (UKF) takes a different approach: instead of linearizing the function, it passes a carefully chosen set of sigma points through the nonlinear function and fits a Gaussian to the output.

For a state of dimension n, the UKF computes 2n+1 sigma points from the current belief N(μ,Σ):

s0 = μ     (mean)
si = μ + (√(n+λ)Σ)i    for i = 1,…,n     (positive perturbations)
sn+i = μ − (√(n+λ)Σ)i    for i = 1,…,n     (negative perturbations)

where (√(n+λ)Σ)i denotes the i-th column of the matrix square root, and λ is a spread parameter. Each sigma point is propagated through the nonlinear function, and the output mean and covariance are reconstructed as weighted sums of the transformed points.

The weights are not all equal. The mean sigma point (s0) gets a weight designed to absorb the remaining probability mass, while the off-center points get equal weights that together balance the mean:

W0m = λ/(n+λ)     W0c = λ/(n+λ) + (1−α2+β)     (mean sigma point)
Wim = Wic = 1/(2(n+λ))    for i = 1,…,2n     (off-center points)

where α controls the spread, β = 2 for Gaussian priors (optimal), and λ = α2(n+κ) − n. After propagating each sigma point through the nonlinear function f:

μ' ≈ ∑i=02n Wim f(si)
Σ' ≈ ∑i=02n Wic (f(si) − μ')(f(si) − μ') + Σs

A concrete 1D example: n=1, λ=2. Belief: μ=0, σ2=1. Sigma points: s0=0 (weight 2/3), s1=+√3≈1.73 (weight 1/6), s2=−√3≈−1.73 (weight 1/6). Propagate through f(x) = sin(x): f(0)=0, f(1.73)≈0.987, f(−1.73)≈−0.987. Reconstructed mean: (2/3)×0 + (1/6)×0.987 + (1/6)×(−0.987) = 0. Reconstructed variance: (1/6)(0.987−0)2 + (1/6)(0.987−0)2 = 0.325 + 0.325 ≈ 0.65. The UKF correctly estimates that sin(x) for x~N(0,1) has variance less than 1 (since |sin| ≤ 1 compresses the tails).

Why 2n+1 points? The sigma points are chosen to match the mean and covariance of the input distribution exactly. The spread parameter λ controls how far the off-center points go. A Gaussian is fully characterized by its first two moments; these 2n+1 points capture those moments exactly before passing them through the nonlinearity — no derivative required.
UKF vs. EKF: The UKF captures the nonlinear transformation more accurately than the EKF for "moderately" nonlinear functions, without requiring Jacobians. It is often preferred when Jacobians are difficult to compute analytically. However, both still maintain a single Gaussian belief and fail when the true posterior is multimodal.
Sigma Points Visualization

A 2D Gaussian with covariance [[1,0.5],[0.5,2]]. Yellow = mean. Orange = sigma points. Adjust λ to control spread. With n=2, we always have 2×2+1=5 sigma points.

λ (spread parameter) 2.0
For a state space of dimension n, how many sigma points does the UKF use?

Chapter 6: Particle Filter

What if none of the Gaussian assumptions hold? What if the posterior is multimodal (two plausible locations), or the dynamics are highly nonlinear, or the state space is mixed discrete-continuous? Particle filters represent the belief as a set of weighted samples (particles) that can approximate any distribution.

The update algorithm for m particles {(si, wi)}i=1m:

1. Propagate
For each particle si, sample si' ~ T(s'|si,a). The particles are moved through the dynamics stochastically.
2. Weight
Compute wi = O(o|a,si'). Particles in states consistent with the observation get high weight; others get low weight.
3. Normalize
Divide all weights by their sum: wi ← wi / ∑j wj.
4. Resample
Draw m new particles with replacement, proportional to weights. This concentrates particles in high-probability regions.

The key insight: resampling discards low-probability particles and duplicates high-probability ones. After many steps, particles concentrate near the true state. The belief estimate is approximated by the empirical distribution of particles.

There are three standard resampling algorithms. They all draw m new particles proportional to weights but differ in implementation:

AlgorithmComplexityProperties
MultinomialO(m log m)Simple: draw m uniform [0,1) random numbers, binary search in CDF. High variance.
SystematicO(m)Generate one uniform random number u ∈ [0,1/m), then use equally spaced offsets u, u+1/m, u+2/m, …. Lower variance than multinomial.
StratifiedO(m)Generate m uniform numbers, one per stratum [i/m, (i+1)/m). Similar to systematic but slightly higher variance.

Systematic resampling (also called "low-variance resampling") is strongly preferred in practice. With multinomial resampling, a particle with weight 3/m might be duplicated 0, 1, 2, 3, or more times due to random chance. With systematic resampling, it is always duplicated exactly 3 times. This determinism reduces the randomness in the particle set and improves accuracy for the same m.

python
import numpy as np

def systematic_resample(particles, weights):
    """Low-variance systematic resampling: O(m)."""
    m = len(particles)
    cumulative = np.cumsum(weights)
    u = np.random.uniform(0, 1.0/m)           # one random number
    positions = u + np.arange(m) / m            # m equally-spaced positions
    indices = np.searchsorted(cumulative, positions)  # O(m) binary searches
    return particles[indices]

def multinomial_resample(particles, weights):
    """Standard multinomial resampling: O(m log m), higher variance."""
    m = len(particles)
    indices = np.random.choice(m, size=m, replace=True, p=weights)
    return particles[indices]

# For comparison with m=5 particles, weights=[0.4, 0.3, 0.1, 0.1, 0.1]:
# Multinomial: may give [0,1,3,1,0] or [2,1,1,1,0] — high variance
# Systematic: always gives [2,1,1,1,0] or [2,2,1,0,0] — deterministic spread
Advantages of particle filters: No parametric assumption. Can represent multimodal posteriors. Can handle discontinuities, constraints, and arbitrary dynamics. Scales to high-dimensional spaces reasonably well (compared to discrete filters). The downside: requires m particles to be reasonable; may need thousands of particles for precision.

Here is a miniature worked example with m = 5 particles (normally you'd use hundreds). 1D position tracking. True position = 0.6. Particles: {0.2, 0.4, 0.5, 0.7, 0.9}. Noisy observation: z = 0.65. Observation noise σo = 0.1.

Particle siPropagated si' (add process noise)wi = N(0.65; si', 0.01)Normalized wi
0.20.21exp(−(0.65−0.21)²/0.02) ≈ 0.000560.002
0.40.42exp(−(0.65−0.42)²/0.02) ≈ 0.0730.228
0.50.51exp(−(0.65−0.51)²/0.02) ≈ 0.3751.170 → 0.370
0.70.72exp(−(0.65−0.72)²/0.02) ≈ 0.7822.440 → 0.244
0.90.89exp(−(0.65−0.89)²/0.02) ≈ 0.0570.179 → 0.057

After normalization, w = [0.002, 0.073, 0.370, 0.244, 0.057] (sum to 1). Weighted mean: 0.002×0.21 + 0.073×0.42 + 0.370×0.51 + 0.244×0.72 + 0.057×0.89 ≈ 0.60 — very close to the true position 0.6! After resampling with these weights, particles near 0.51 and 0.72 will be duplicated, while the particle at 0.21 will almost certainly be dropped.

The particle filter algorithm is remarkably concise to implement. Each "particle" is just a state sample with a weight:

python
import numpy as np

def particle_filter_update(particles, weights, action, obs,
                              transition_fn, obs_likelihood_fn, m):
    """
    particles: array of shape (m, state_dim) — current particle set
    weights:   array of shape (m,) — current particle weights
    action:    the action taken
    obs:       the observation received
    transition_fn(s, a): samples s' ~ T(s'|s,a)
    obs_likelihood_fn(o, s'): returns O(o|s') as a scalar
    Returns: new particles and uniform weights after resampling
    """
    # Step 1: Propagate — sample s_i' ~ T(s'|s_i, a)
    new_particles = np.array([transition_fn(s, action) for s in particles])

    # Step 2: Weight — w_i = O(obs | s_i')
    new_weights = np.array([obs_likelihood_fn(obs, s) for s in new_particles])

    # Step 3: Normalize
    total = new_weights.sum()
    if total < 1e-12:
        new_weights[:] = 1.0 / m  # degenerate case: reset to uniform
    else:
        new_weights /= total

    # Step 4: Resample — draw m particles with replacement
    indices = np.random.choice(m, size=m, replace=True, p=new_weights)
    resampled = new_particles[indices]
    uniform_weights = np.full(m, 1.0 / m)

    return resampled, uniform_weights

# Example: 2D position tracking with random walk dynamics
m = 500  # number of particles
particles = np.random.uniform(0, 1, (m, 2))   # uniform initial belief over [0,1]^2
weights = np.full(m, 1.0 / m)

def transition(s, a): return np.clip(s + np.random.randn(2) * 0.03, 0, 1)
def likelihood(obs, s): return np.exp(-np.sum((s - obs)**2) / (2 * 0.05**2))

# After each action + observation, call:
# particles, weights = particle_filter_update(particles, weights, a, obs,
#                                              transition, likelihood, m)
# Belief mean estimate:
# mu = particles.mean(axis=0)
What is the purpose of the resampling step in a particle filter?

Chapter 7: Particle Injection

Particle filters have a critical failure mode called particle deprivation: all particles can converge to a single region of state space, leaving no coverage of other regions. If the true state then jumps to a previously-uncovered region (due to a sudden large transition or a rare event), the filter has no particles there and cannot recover.

Particle injection addresses this by periodically adding a small fraction of randomly sampled particles (from the prior or from the observation model) to the particle set. This ensures global coverage is maintained even as most particles concentrate around the likely true state.

The crying baby kidnapping problem: The text introduces a variant of the crying baby problem where the baby can be "kidnapped" — suddenly transported to a random state with small probability. A standard particle filter loses all diversity after a few steps and cannot recover from a kidnapping. Particle injection adds a few uniformly distributed particles each step, maintaining the ability to detect sudden state changes.

The mechanics of particle deprivation can be understood quantitatively. After k steps without resampling, the effective sample size Neff = 1/(∑wi2) decreases as particles drift. After resampling, Neff resets to m (all uniform weights). However, with each predict–weight–resample cycle, some particles are duplicated and others are dropped. After many cycles in a smooth, well-observed system, the particle set collapses to a few distinct values — particle deprivation even without any sudden state jumps. Injection prevents this collapse by continuously replenishing diversity.

Two types of injected particles:

Prior particles: Sample from the initial belief distribution b0(s). Provides broad global coverage regardless of the observation.

Observation-consistent particles: Sample states s' consistent with the current observation: s' ~ P(s'|o) ∝ O(o|a,s') b0(s'). More efficient but requires knowing the observation model in a form we can invert.

The injection rate ρ (fraction of particles replaced each step) is a crucial hyperparameter. Too low: the filter cannot recover from kidnapping. Too high: the filter loses precision even in normal operation. The optimal rate depends on the expected frequency of sudden state jumps:

ScenarioRecommended ρRationale
Stable system, no sudden jumpsρ = 0% (no injection needed)Particles track well; injection adds unnecessary noise
Rare sudden jumps (<1% per step)ρ = 2–5%Enough to recover from kidnapping; doesn't hurt normal tracking
Frequent sudden jumps (>5% per step)ρ = 10–20%Must maintain global coverage; precision will be reduced
Unknown jump rateρ = 5%Conservative default; works reasonably well in most cases

An alternative to fixed-rate injection: adaptive injection. After each resampling step, compute the normalized probability of the current observation under the existing particles: p(o|particles) = ∑i wi O(o|si). If this probability is very low (the observation is poorly explained by current particles), inject at a higher rate. This adapts injection to the evidence — injecting more when the filter is likely wrong.

Particle Filter: With and Without Injection

Track a 1D target that occasionally teleports (kidnapping). Orange = particles, purple triangle = true position. Watch how particle injection (teal) prevents collapse after a kidnapping event vs. no injection (warm).

Injection fraction 5%
Step: 0
What is "particle deprivation" and why is it a problem?

Chapter 8: Showcase — The Crying Baby Filter

The crying baby problem is the canonical 2-state POMDP from the textbook. Two hidden states: {sated, hungry}. Three actions: {feed, sing, ignore}. Two observations: {crying, quiet}. The agent must maintain a belief over the baby's state and decide how to act.

DynamicsP(hungry next step)
Feed (any state)0% — feeding always sates
Sated + sing/ignore10% — baby gradually gets hungry
Hungry + sing/ignore100% — stays hungry without feeding
Observation modelP(crying | state, action)
Hungry + sing90%
Hungry + feed/ignore80%
Sated + sing0% (singing fully calms a sated baby)
Sated + feed/ignore10%

The belief is a single number b = P(hungry). Watch how each action-observation pair updates this belief. Note that feeding deterministically resets belief to P(hungry) = 0 (we know feeding always sates).

Key update from the text: Starting with b = 0.5 (maximum uncertainty), if we ignore the baby and it cries, the posterior is: P(hungry|crying,ignore) ∝ 0.8 × 0.5 + 0.1 × 0.5 = [0.4, 0.05], normalized to ≈ [0.89, 0.11]. So P(hungry) ≈ 0.89. A crying baby is very likely hungry.
Crying Baby Discrete Belief Update

You are the caretaker. Choose actions. The belief bar shows P(hungry) (warm) vs P(sated) (teal). The baby face reacts to the observation. Reward accumulates at bottom. Can you keep the baby happy while tracking the true state?

Reward: 0 | Steps: 0
From the text — worked example: At b = [0.5, 0.5] (sated=0.5, hungry=0.5), ignore the baby:
Prediction: b' = [0.5×0.9 + 0.5×0, 0.5×0.1 + 0.5×1] = [0.45, 0.55].
Then observation = crying: weights = [O(cry|sated,ignore)×0.45, O(cry|hungry,ignore)×0.55] = [0.1×0.45, 0.8×0.55] = [0.045, 0.44]. Normalize: [0.045/0.485, 0.44/0.485] ≈ [0.093, 0.907].

Chapter 9: Summary & Filter Comparison

Chapter 19 establishes the belief update machinery that underpins all POMDP planning. Every filter produces the same object (a belief distribution over states) but makes different assumptions about the structure of the problem.

All five filters covered in this chapter share a common structure: a prediction step that propagates the belief forward through the dynamics, and an update step that incorporates the observation. The filters differ in how they represent the belief and how they implement each step. Understanding this shared structure is key to adapting filters to new problems.

Common structure (all filters):

  1. Predict: b → b¨ using T(s'|s,a)
  2. Update: b¨ + o → b' using O(o|a,s')
  3. Normalize: ensure b' sums/integrates to 1

What varies:

  • Belief representation: vector vs. Gaussian vs. particles
  • Prediction method: matrix multiply vs. Riccati vs. particle propagation
  • Update method: elementwise multiply vs. Kalman gain vs. importance weighting
FilterState SpaceDynamicsBelief FormExact?
DiscreteFinite SAny discrete T, OCategorical vectorYes
KalmanContinuousLinear GaussianGaussian N(μ,Σ)Yes
Extended KalmanContinuousNonlinear Gaussian (mild)Gaussian (approx.)No
Unscented KalmanContinuousNonlinear GaussianGaussian (better approx.)No
ParticleAnyAny (sample from T)Weighted particlesAsymptotically
The decision hierarchy: Start with the most exact filter your problem permits.
  1. Discrete state space, small? → Discrete filter.
  2. Continuous, linear dynamics, Gaussian noise? → Kalman filter.
  3. Continuous, mildly nonlinear? → EKF (or UKF if Jacobians are hard).
  4. Strongly nonlinear, multimodal, or arbitrary? → Particle filter.
Looking ahead: Chapter 20 uses the belief update methods from this chapter as the foundation for solving POMDPs exactly. The belief is treated as the "state" of a continuous-state MDP, and dynamic programming is applied over it. The key structural result: the value function over belief space is piecewise-linear and convex.

Here is a side-by-side comparison of computational costs for each filter, to guide your implementation choice:

FilterMemoryPer-update costPrimary failure mode
DiscreteO(|S|)O(|S|2)|S| exponentially large for many state dimensions
KalmanO(n2)O(n3) for matrix inversionNon-Gaussian noise, nonlinear dynamics
EKFO(n2)O(n3) + Jacobian computationSevere nonlinearity, diverges far from true state
UKFO(n2)O(n3) (matrix sqrt) + 2n+1 propagationsStill Gaussian; cannot handle multimodal posteriors
Particle (m particles)O(m×n)O(m×n) propagate + weight + resampleParticle deprivation; m must be large for high-dim state

The practical choice flowchart:

Discrete state space?
If |S| is small (≤ 103): use the discrete filter. Exact and fast.
↓ No (continuous state)
Linear Gaussian?
Use the Kalman filter. Exact, O(n3), minimal assumptions needed.
↓ No (nonlinear)
Mildly nonlinear, unimodal?
Try EKF (if Jacobians computable) or UKF (if not). Both are fast approximations.
↓ No (multimodal, highly nonlinear)
Particle filter
Use particle filter with injection. Choose m based on required accuracy vs. compute budget.

Concrete robotics example: an autonomous vehicle navigating indoors must track its (x, y, θ) pose. Here is how the choice plays out:

ScenarioBest filterWhy
Straight corridor, wheel odometry + lidarKalman filterMotion is nearly linear; Gaussian noise model adequate
Turning + bearing-only sensorEKFBearing observation is nonlinear (atan2); EKF linearizes at current mean
Large open space, global localization from scratchParticle filter (1000+ particles)Belief is multimodal: could be near any wall
Recovered from GPS dropout, approximate last-known positionUKF or particle filterModerately nonlinear dynamics; belief may be slightly multimodal
Robot arm joint angles (known structure)Kalman filterJoint angle dynamics are linear; small-dimensioned state

The industry standard for mobile robot localization is the Adaptive Monte Carlo Localization (AMCL) algorithm, which is a particle filter with particle injection and adaptive particle count (more particles when uncertain, fewer when confident). AMCL is implemented in ROS (Robot Operating System) and runs on hundreds of deployed robots worldwide. It uses the discrete map as an implicit state space and samples particles from the map grid rather than continuously.

Key implementation pitfall: weight collapse. In particle filters, all weights can become negligible except one ("weight collapse"). This happens when the observation is very informative and most particles are in low-likelihood regions. The Neff = 1 / ∑wi2 statistic measures effective sample size. When Neff < m/2, trigger resampling immediately (even if not on a regular schedule). When Neff approaches 1, inject particles from the prior to restore diversity.

Common implementation pitfalls. Even experienced practitioners make these mistakes with belief filters:

BugSymptomFix
Forget normalization in discrete filterBelief sums to <1; probabilities driftAlways divide by η = ∑ p(o|s')·b'(s') after measurement update
Kalman gain computed with wrong signFilter diverges exponentiallyK = ΣHT(HΣHTO)−1; gain must be positive definite
EKF Jacobian computed at wrong pointFilter inconsistent; covariance too smallLinearize f and h at current mean μt|t−1, not at the measurement
UKF sigma points with negative weightsPredicted covariance becomes non-PDAdd small β term; use κ≥0 or enforce PD via Cholesky regularization
Particle filter: no resamplingWeight collapse after ~50 stepsResample whenever Neff = 1/∑wi2 < m/2
Particle filter: too few particles (m<30)High variance; occasional complete failureUse m≥500 for 2D problems; m≥2000 for 3D robot pose
No particle injectionFilter loses true state after teleport or large jumpInject ρ·m particles from prior; set ρ≈0.05–0.10

Numerical comparison: EKF vs UKF on a sine observation model. Suppose the state is xt = xt−1 + wt (w∼N(0,0.1)) and the observation is zt = sin(xt) + vt (v∼N(0,0.05)). Starting from x0∼N(0,1), both filters are initialized at μ0=0, Σ0=1. After one step with z1=0.7:

QuantityEKFUKFComment
Predicted μ0.00.0Same (linear prediction)
Predicted Σ1.11.1Same (linear prediction)
Jacobian / sigma-predicted obsJ=cos(0)=1.0z&hat;=E[sin(x)]=sin(0)≈0EKF linearizes; UKF propagates
Innovation covariance S1.1×12+0.05=1.15Var[sin(x)]+0.05≈0.46+0.05=0.51UKF captures nonlinear variance
Kalman gain K1.1/1.15≈0.9571.1×0.79/0.51≈1.70 (clamped)Different because S differs
Updated μ0+0.957×(0.7−0)≈0.67Depends on cross-covarianceUKF gives more accurate posterior

The key difference: EKF approximates the observation variance as 1.15 (treating sin as linear), while UKF captures the true nonlinear spread of sin(x) when x∼N(0,1.1), giving a smaller S of 0.51. This means the UKF trusts the measurement more (the observation function is more informative than the EKF thinks), and converges faster. In strongly nonlinear problems, this difference compounds over time and EKF can diverge while UKF remains stable.

When to upgrade your filter. In practice, engineers often start with the simplest adequate filter and upgrade only when it fails. Here is a principled upgrade path:

  1. Start with Kalman. Check: does the normalized innovation squared (NIS) statistic χ2-test pass? NIS = (z−z&hat;)TS−1(z−z&hat;) should be χ2(|z|) distributed. If NIS is consistently too large, the filter is inconsistent.
  2. If NIS fails due to nonlinearity: upgrade to EKF. Check: is the Jacobian stable and computable at every time step?
  3. If EKF diverges or Jacobians are ill-conditioned: upgrade to UKF. Check: are sigma point weights all positive after the choice of κ and α?
  4. If belief is multimodal (kidnapped robot, cycle ambiguity): particle filter is required. No Gaussian filter can represent multimodal posteriors.

This upgrade ladder reflects a fundamental tradeoff: Gaussian filters are fast and analytically clean, but restricted to unimodal, approximately normal posteriors. Particle filters are universal approximators of probability distributions, but require Θ(m) compute per step where m must grow with the dimension and complexity of the belief. Choose the simplest filter that passes consistency tests on your data.

"All models are wrong, but some are useful."
— George Box

Every filter in this chapter is an approximation — they differ in which aspects of reality they approximate well. Choosing the right filter means understanding which simplifications your problem can tolerate.

← Ch 18: Imitation Learning Ch 20: Exact Belief Planning →