Beliefs — Kochenderfer et al., Chapter 19

Chapter 0: When You Can't See the State

You're a robot navigating a dark hallway. You know your sensors are noisy. You know your motors don't always move exactly as commanded. What position are you in? You can't know for certain — but you can maintain a probability distribution over all the positions you might be in. That distribution is called a belief.

This is the central challenge of a Partially Observable Markov Decision Process (POMDP): the true state s is hidden. Instead of observing s, the agent receives an observation o drawn from O(o|a,s'), a probabilistic function of the reached state and the action taken. The agent must reason about s indirectly.

Belief as sufficient statistic: A key theorem: the belief b(s) = P(s | a₀, o₁, a₁, o₂, …) is a sufficient statistic for the entire action-observation history. This means: to make optimal decisions, you only need to know the current belief. You don't need to remember every past observation — the belief compresses all that information.

In a standard MDP: policy maps states to actions, π(s) → a. In a POMDP: policy maps beliefs to actions, π(b) → a. The belief plays the role of the state. Every chapter in this lesson is about computing b efficiently and accurately.

The five chapters of this lesson (Chapters 0–9) build up from the most exact to the most general belief representation. Chapters 0–2 establish the framework and the exact discrete filter. Chapters 3–5 explore parametric (Gaussian) filters that trade exactness for efficiency in continuous state spaces. Chapters 6–7 introduce the nonparametric particle filter and particle injection. Chapter 8 is the showcase: a fully interactive crying baby demonstration that ties all the update mechanics together. Chapter 9 summarizes and provides the decision hierarchy.

The proof that belief is a sufficient statistic uses the fact that POMDP dynamics are Markovian in the hidden state. Given the current state s and action a, the distribution over next state s' and observation o depends only on (s, a) — not on any earlier history. Therefore, the belief b = P(s|history) captures exactly the information content of the history that is relevant to future decisions. Any additional memory of the raw history beyond what is encoded in b would be redundant.

This is a fundamental result with profound practical implications: instead of maintaining an ever-growing history (which would require exponentially more memory over time), the agent only needs to maintain and update a fixed-size probability distribution over states. For a discrete state space |S|, the belief is always a vector of |S| numbers, regardless of how long the agent has been running.

The Belief Simplex for 3 States

Every possible belief over 3 states is a point inside this triangle. Corners = certainty about one state. Center = maximum uncertainty. Click anywhere inside to set a belief and see b(s₁), b(s₂), b(s₃).

b = [0.333, 0.333, 0.333]

Why is the belief a sufficient statistic for the history of actions and observations?

Because Bayes' rule compresses all past information into the current posterior — the future depends on the past only through the current state distribution Because observations are always perfect Because the agent forgets old observations automatically

Chapter 1: Belief Initialization

Before taking any action, we need a starting belief b₀. Two broad representations are available:

Representation	Form	Strengths	Limitations
Parametric	Categorical (discrete S), Gaussian N(μ,Σ) (continuous S)	Compact, closed-form updates	Cannot represent multimodal beliefs
Nonparametric	Set of weighted particles {(s_i, w_i)}	Handles arbitrary distributions, multimodal beliefs	Discrete approximation; resolution limited by particle count

Choosing the initial belief matters enormously. A diffuse initial belief (e.g., uniform over all states) is robust — it makes few assumptions. A confident wrong initial belief can be catastrophic: it may take a very long time to correct, even with informative observations.

The autonomous car example from the text: A car must localize itself using lidar against a known map. At startup, its position is unknown. A good initialization: place particles uniformly across the entire map (or in regions consistent with the initial lidar scan). The initial belief is spread across many candidate positions and will concentrate around the true position as more lidar observations arrive. Committing to a wrong position at startup could cause the car to drive into a wall before the belief corrects.

Two special starting cases:

Known initial state: b₀(s) = 1[s = s₀]. The agent starts certain. Common in simulations.

Prior from problem knowledge: Use domain knowledge. "The robot starts near the charging dock, so b₀ = N(μ_dock, small Σ)."

In practice, initial belief construction is one of the most important engineering decisions for a POMDP system. Here is a concrete comparison of three initialization strategies for a robot localization problem in a 100×100 meter warehouse (state = (x, y) position):

Strategy	How	Initial cost	Time to localize
Uniform particles	m = 10,000 particles uniformly over 100m×100m	None (no prior knowledge needed)	Many steps until beliefs concentrate
GPS-seeded Gaussian	N(μ_GPS, Σ = 3m) — GPS is coarse but informative	Requires GPS fix (may be unavailable indoors)	Fast: GPS reduces uncertainty by ~90%
Observation-consistent	Sample s' ~ O(o₀\|s') from prior — use first lidar scan to filter	Moderate: requires first observation at startup	Very fast: only positions matching the initial scan are kept
Confident-wrong	N(μ_wrong, Σ = 0.5m) — small variance, wrong position	Dangerous: may take 50+ steps to recover	Very slow or never recovers without injection

Particle count rule of thumb: For a d-dimensional continuous state space, you typically need m ~ 10^d particles to maintain adequate coverage. For d=1: m=10–100 is fine. For d=2 (robot position): m=500–5,000. For d=6 (full 3D pose + velocity): m=10,000–100,000. This is why particle filters are practical for low-dimensional problems but struggle for very high-dimensional states.

Why is a diffuse (spread-out) initial belief generally safer than a confident initial belief when the true starting state is unknown?

A confidently wrong initial belief may require many observations to correct and can lead to bad decisions in the interim Diffuse beliefs are computationally cheaper to update Concentrated initial beliefs cause value iteration to diverge

Chapter 2: Discrete State Filter

For finite state and observation spaces, belief updates are exact. The belief b is a probability vector of length |S|. After taking action a and receiving observation o, the new belief b' is computed in two steps:

Predict: b¨(s') = ∑_s T(s'|s,a) · b(s)

Update: b'(s') = η · O(o|a,s') · b¨(s')

where η is a normalizing constant: η = 1 / ∑_s' O(o|a,s') b¨(s').

This is recursive Bayesian estimation on a discrete state space. The prediction step propagates the belief forward through the dynamics model. The update step incorporates the observation using Bayes' rule. The normalization ensures b' is a valid probability distribution.

Intuition: Think of the predict step as "smearing" your belief through uncertainty in the dynamics. If your action moves you north but sometimes slips, the prediction spreads probability to adjacent states. The update step then "sharpens" the belief based on what you actually observed, concentrating probability where the observation is most likely.

The computational cost is O(|S|²) per update (the double sum over s and s'). For large state spaces, this becomes impractical, motivating the approximate methods in later chapters.

Here is a complete step-by-step worked example of one discrete filter update. The crying baby problem: states = {sated=0, hungry=1}, action = "ignore", observation = "crying". Starting belief b = [0.5, 0.5].

python
# Crying baby discrete filter: one step, action=ignore, obs=crying
# T[a][s][s'] = P(s'|s, a). Action: ignore=2
T_ignore = [[0.9, 0.1],  # from sated: 90% stay sated, 10% go hungry
             [0.0, 1.0]]  # from hungry: always stay hungry
# O[a][s'][o]: observation model, o=0 means crying
O_ignore = [0.1, 0.8]    # P(crying|sated)=0.1, P(crying|hungry)=0.8

b = [0.5, 0.5]              # prior belief: 50/50

# --- PREDICT STEP ---
# b_pred[s'] = sum_s T(s'|s,a) * b[s]
b_pred = [0, 0]
b_pred[0] = T_ignore[0][0]*b[0] + T_ignore[1][0]*b[1]   # 0.9*0.5 + 0.0*0.5 = 0.45
b_pred[1] = T_ignore[0][1]*b[0] + T_ignore[1][1]*b[1]   # 0.1*0.5 + 1.0*0.5 = 0.55
print(f"Predicted: {b_pred}")   # [0.45, 0.55]

# --- UPDATE STEP --- (observation = crying)
# b_new[s'] = eta * O(crying|s') * b_pred[s']
unnorm = [O_ignore[0] * b_pred[0],   # 0.1 * 0.45 = 0.045  (sated, but still crying)
          O_ignore[1] * b_pred[1]]   # 0.8 * 0.55 = 0.440  (hungry, expected crying)
eta = 1.0 / sum(unnorm)              # 1 / 0.485 = 2.062
b_new = [eta * u for u in unnorm]   # [0.093, 0.907]
print(f"Updated:   {[f'{v:.3f}' for v in b_new]}")

# Interpretation: P(hungry|ignore, crying) ≈ 90.7%
# The crying observation strongly confirms hunger.
# If we had observed "quiet" instead:
# unnorm_q = [(1-0.1)*0.45, (1-0.8)*0.55] = [0.405, 0.110] → P(hungry) ≈ 21%

The geometry of belief updates: The predict step "diffuses" probability mass: sated probability drops from 0.5 to 0.45 (some of it flows to hungry), hungry probability rises from 0.5 to 0.55. The update step "focuses" probability mass based on evidence: crying is 8x more likely from hungry than sated, so P(hungry) jumps from 0.55 to 0.907 after normalization. Repeated observations of crying will push belief toward P(hungry) = 1.0 asymptotically.

Discrete Belief Update — 4-State POMDP

A 4-state hallway. Take Left/Right actions and observe (Bright/Dim). Watch how the belief vector shifts. The triangle below a bar shows the true (hidden) state. Higher bar = more probability mass.

Step: 0 | True: s1

What does the prediction step in the discrete filter compute?

The belief propagated through the transition model, before incorporating the new observation The reward for the current action The optimal action at the current belief

Chapter 3: Kalman Filter

For continuous states with linear Gaussian dynamics, the belief is always a Gaussian N(μ,Σ). This is not an approximation — it's exact. The dynamics and observation models are:

T(s'|s,a): s' = T_ss + T_aa + ε_s, ε_s ~ N(0, Σ_s)

O(o|a,s'): o = O_ss' + ε_o, ε_o ~ N(0, Σ_o)

The belief update has the famous two-step structure:

Predict (Propagate)

μ_p = T_sμ_b + T_aa Σ_p = T_sΣ_bT_s^⊤ + Σ_s

↓

Update (Correct)

K = Σ_pO_s^⊤(O_sΣ_pO_s^⊤+Σ_o)⁻¹ μ_b = μ_p+K(o−O_sμ_p) Σ_b=(I−KO_s)Σ_p

The matrix K is the Kalman gain. It is the weight given to the innovation (o − O_sμ_p), the difference between the actual observation and the predicted observation. Think of K as a trust slider:

When Σ_o is small (observations are precise): K → large. Trust the observation more.
When Σ_o is large (observations are noisy): K → small. Trust the prediction more.
When Σ_p is large (lots of prediction uncertainty): K → large. Eager to correct.

In steady state, the variance Σ converges to a fixed-point determined entirely by the noise parameters — independent of the initial belief. This is called the steady-state Kalman filter. For a 1D random walk with process variance σ_s² and observation variance σ_o², the steady-state gain K* and variance Σ* satisfy:

Σ* = ( σ_s²/2 ) + √( (σ_s²/2)² + σ_s²σ_o² )

K* = Σ* / (Σ* + σ_o²)

With σ_s² = 0.25 and σ_o² = 1.0: Σ* = 0.125 + √(0.0156 + 0.25) ≈ 0.125 + 0.5063 = 0.631. K* = 0.631/(0.631+1) ≈ 0.387. After reaching steady state, every update sets K ≈ 0.387 — a fixed fraction of the innovation is incorporated regardless of history. This is why steady-state Kalman filters are so efficient: precompute K* offline, then run at O(n) per step online.

Here is a 5-step worked example tracing the 1D Kalman filter, starting from uncertainty σ²=4 and true position 0. Process noise σ_s²=0.25, observation noise σ_o²=1.0:

Step	True pos	Observation	σ² before	K	μ after	σ² after
1	0.3	1.2	4.25	4.25/5.25≈0.81	0+0.81×1.2≈0.97	(1−0.81)×4.25≈0.81
2	0.5	0.1	1.06	1.06/2.06≈0.51	0.97+0.51×(0.1−0.97)≈0.53	0.49×1.06≈0.52
3	0.8	0.9	0.77	0.77/1.77≈0.44	0.53+0.44×(0.9−0.53)≈0.69	0.56×0.77≈0.43
4	0.6	0.5	0.68	0.68/1.68≈0.40	0.69+0.40×(0.5−0.69)≈0.61	0.60×0.68≈0.41
5	0.9	1.1	0.66	0.66/1.66≈0.40	0.61+0.40×(1.1−0.61)≈0.80	0.60×0.66≈0.40

Notice that σ² starts large (4.25) and rapidly shrinks to a steady state around 0.40 — consistent with the analytic steady-state formula (Σ* ≈ 0.631 for this noise setting). The gain K also stabilizes around 0.40. The filter's mean tracks the true position, staying within ±2σ = ±2×0.63 of each observation. The estimate at step 5 (0.80) is close to the true position 0.9, within one sigma.

Why Gaussian in, Gaussian out? The product of two Gaussian densities is proportional to a Gaussian. The linear transform of a Gaussian is a Gaussian. So in the linear Gaussian case, every update preserves the Gaussian form. The Kalman filter leverages this closure property to maintain an exact, minimal-parameter belief representation at all times.

The three key properties of the Kalman filter that make it the gold standard for linear systems:

Optimal (minimum variance): Among all linear estimators, the Kalman filter minimizes the mean squared error. No other linear filter can do better given the same data.
Recursive: The belief at time t only depends on the previous belief — not the full history. This enables real-time streaming at O(n³) per step.
Consistent: The estimated covariance accurately reflects the true estimation uncertainty (assuming correct noise models). A well-tuned Kalman filter's 95% confidence intervals actually contain the truth ~95% of the time.

Tuning tip: In practice, Σ_s (process noise) and Σ_o (observation noise) are often estimated from data or tuned empirically. Underestimating Σ_s makes the filter "stiff" — slow to respond to real state changes. Overestimating Σ_s makes the filter "jumpy" — it trusts observations too much and amplifies noise. The ratio Σ_s/Σ_o controls the steady-state gain K*; increasing this ratio increases K*, making the filter more responsive.

Here is the complete 1D Kalman filter in Python. The belief is just two numbers — mean μ and variance σ². Everything else is algebra:

python
import numpy as np

class KalmanFilter1D:
    """1D Kalman filter: belief = (mu, sigma2)."""
    def __init__(self, mu0=0.0, sigma2_0=1.0):
        self.mu = mu0             # belief mean
        self.sigma2 = sigma2_0    # belief variance

    def predict(self, sigma2_process):
        # Propagate belief through random walk dynamics
        # mu_p = mu  (no deterministic drift in this example)
        # sigma2_p = sigma2 + sigma2_process
        self.sigma2 += sigma2_process
        # self.mu unchanged (identity transition)

    def update(self, obs, sigma2_obs):
        # Incorporate a noisy observation
        K = self.sigma2 / (self.sigma2 + sigma2_obs)   # Kalman gain
        self.mu = self.mu + K * (obs - self.mu)         # belief mean update
        self.sigma2 = (1 - K) * self.sigma2              # belief variance shrinks
        return K  # return K so we can inspect it

# Worked example matching the textbook's robot localization scenario
kf = KalmanFilter1D(mu0=0.0, sigma2_0=4.0)  # uncertain about initial position
true_pos = 0.0
for step in range(10):
    true_pos += np.random.randn() * 0.5       # true random walk
    obs = true_pos + np.random.randn() * 1.0  # noisy observation
    kf.predict(sigma2_process=0.25)           # process noise variance
    K = kf.update(obs, sigma2_obs=1.0)        # observation noise variance
    print(f"Step {step+1}: true={true_pos:.2f}, obs={obs:.2f}, "
          f"mu={kf.mu:.2f}, sigma={kf.sigma2**.5:.2f}, K={K:.3f}")

1D Kalman Filter

Track a 1D position. Purple line = true state (random walk). Orange dots = noisy observations. Teal band = belief (μ ± 2σ). Adjust noise levels to see how the gain K adapts.

Process noise σ_s 0.5

Observation noise σ_o 1.0

Step: 0 | K = —

When observation noise Σ_o is very large, what happens to the Kalman gain K and why?

K becomes small, so the filter trusts the predicted mean more and barely moves toward the noisy observation K becomes large, amplifying the noisy observation K stays constant regardless of noise levels

Chapter 4: Extended Kalman Filter

What if the dynamics or observation functions are nonlinear? The standard Kalman filter fails because a Gaussian pushed through a nonlinear function is no longer Gaussian. The Extended Kalman Filter (EKF) handles this by linearizing the nonlinear functions at the current belief mean using first-order Taylor expansions (Jacobians).

T(s'|s,a): s' ≈ f_T(s,a) + J_T(s−μ_b) + ε_s

O(o|a,s'): o ≈ f_O(s') + J_O(s'−μ_p) + ε_o

where J_T = ∂f_T/∂s |_{μ_b} and J_O = ∂f_O/∂s' |_{μ_p} are the Jacobian matrices evaluated at the current mean. The Kalman update equations then apply with J_T and J_O in place of T_s and O_s.

Step	Standard KF	EKF
Predict mean	μ_p = T_sμ_b + T_aa	μ_p = f_T(μ_b, a)
Predict covariance	Σ_p = T_sΣ_bT_s^⊤ + Σ_s	Σ_p = J_TΣ_bJ_T^⊤ + Σ_s
Kalman gain K	Σ_pO_s^⊤(…)⁻¹	Σ_pJ_O^⊤(…)⁻¹
Update mean	μ_p + K(o − O_sμ_p)	μ_p + K(o − f_O(μ_p))

Limitation — when EKF breaks: The linear approximation is only valid near the current mean. For highly nonlinear functions (e.g., the wrap-around angle in a bearing-only sensor), the linearization can be wildly wrong. Worse, the EKF cannot represent multimodal posteriors: if the observation is consistent with two very different positions, the EKF squashes this into a single Gaussian instead of two separate peaks.

Here is a concrete worked EKF example: 2D robot tracking with a bearing-only sensor. State s = (x, y). Observation: the angle from the origin to the robot: f_O(x,y) = atan2(y,x). This is highly nonlinear.

python
import numpy as np

# Robot state: s = [x, y] = [3.0, 1.0]   (true position)
# Belief mean: mu = [3.0, 1.0]  sigma2 = [[0.5, 0], [0, 0.5]]
mu = np.array([3.0, 1.0])
Sigma = np.diag([0.5, 0.5])

# Nonlinear observation model: angle theta = atan2(y, x)
# EKF: linearize via Jacobian J_O = d/d(x,y) atan2(y,x)
# d(atan2(y,x))/dx = -y/(x^2+y^2)
# d(atan2(y,x))/dy =  x/(x^2+y^2)

x, y = mu
r2 = x**2 + y**2          # = 3^2+1^2 = 10
J_O = np.array([[-y/r2, x/r2]])   # = [[-0.1, 0.3]]  (1x2 Jacobian)
print(f"Jacobian J_O = {J_O}")   # [[-0.1, 0.3]]

# Observation noise: sigma_o = 0.1 radians
Sigma_o = np.array([[0.01]])    # (0.1 rad)^2

# Kalman gain: K = Sigma * J_O^T * (J_O * Sigma * J_O^T + Sigma_o)^{-1}
S_innov = J_O @ Sigma @ J_O.T + Sigma_o  # = [[-0.1,0.3]] [[0.5,0],[0,0.5]] [[-0.1],[0.3]] + 0.01
                                          # = [[-0.1,0.3]] [[-0.05],[0.15]] + 0.01 = [0.05] + 0.01 = [0.06]
K = Sigma @ J_O.T @ np.linalg.inv(S_innov)  # 2x1 gain vector
print(f"K = {K.flatten()}")    # approximately [-0.833, 2.5]

# Observe: true bearing = atan2(1,3) ≈ 0.3217 rad
# Noisy observation: theta_obs = 0.35 rad
theta_pred = np.arctan2(mu[1], mu[0])   # ≈ 0.3217
innovation = np.array([0.35]) - theta_pred  # ≈ 0.028 rad
mu_new = mu + K.flatten() * innovation[0]  # adjust position toward observation
Sigma_new = (np.eye(2) - K @ J_O) @ Sigma  # covariance shrinks
print(f"Updated mu: {mu_new}")
# mu changes: x decreases slightly (bearing too large implies y was underestimated)
# Key: the EKF exploits the local geometry (Jacobian) to map bearing residual
# into (x,y) correction. Works well near the true position; fails far from it.

Linear vs. Nonlinear Gaussian Push-Through

Samples from a 1D Gaussian (left) pushed through a linear function f(x)=2x (center) and a nonlinear function f(x)=sin(x)+x (right). The linear result stays Gaussian; the nonlinear result becomes skewed. The EKF approximates the nonlinear case with a local linearization (dashed).

What does the EKF use to handle nonlinear dynamics?

First-order Taylor expansion (Jacobian matrix) evaluated at the current belief mean A set of carefully placed sample points Discretization of the continuous state space

Chapter 5: Unscented Kalman Filter

The EKF linearizes using a derivative. But what if computing Jacobians is hard (or the function isn't differentiable)? The Unscented Kalman Filter (UKF) takes a different approach: instead of linearizing the function, it passes a carefully chosen set of sigma points through the nonlinear function and fits a Gaussian to the output.

For a state of dimension n, the UKF computes 2n+1 sigma points from the current belief N(μ,Σ):

s₀ = μ (mean)

s_i = μ + (√(n+λ)Σ)_i for i = 1,…,n (positive perturbations)

s_n+i = μ − (√(n+λ)Σ)_i for i = 1,…,n (negative perturbations)

where (√(n+λ)Σ)_i denotes the i-th column of the matrix square root, and λ is a spread parameter. Each sigma point is propagated through the nonlinear function, and the output mean and covariance are reconstructed as weighted sums of the transformed points.

The weights are not all equal. The mean sigma point (s₀) gets a weight designed to absorb the remaining probability mass, while the off-center points get equal weights that together balance the mean:

W₀^m = λ/(n+λ) W₀^c = λ/(n+λ) + (1−α²+β) (mean sigma point)

W_i^m = W_i^c = 1/(2(n+λ)) for i = 1,…,2n (off-center points)

where α controls the spread, β = 2 for Gaussian priors (optimal), and λ = α²(n+κ) − n. After propagating each sigma point through the nonlinear function f:

μ' ≈ ∑_i=0²ⁿ W_i^m f(s_i)

Σ' ≈ ∑_i=0²ⁿ W_i^c (f(s_i) − μ')(f(s_i) − μ')^⊤ + Σ_s

A concrete 1D example: n=1, λ=2. Belief: μ=0, σ²=1. Sigma points: s₀=0 (weight 2/3), s₁=+√3≈1.73 (weight 1/6), s₂=−√3≈−1.73 (weight 1/6). Propagate through f(x) = sin(x): f(0)=0, f(1.73)≈0.987, f(−1.73)≈−0.987. Reconstructed mean: (2/3)×0 + (1/6)×0.987 + (1/6)×(−0.987) = 0. Reconstructed variance: (1/6)(0.987−0)² + (1/6)(0.987−0)² = 0.325 + 0.325 ≈ 0.65. The UKF correctly estimates that sin(x) for x~N(0,1) has variance less than 1 (since |sin| ≤ 1 compresses the tails).

Why 2n+1 points? The sigma points are chosen to match the mean and covariance of the input distribution exactly. The spread parameter λ controls how far the off-center points go. A Gaussian is fully characterized by its first two moments; these 2n+1 points capture those moments exactly before passing them through the nonlinearity — no derivative required.

UKF vs. EKF: The UKF captures the nonlinear transformation more accurately than the EKF for "moderately" nonlinear functions, without requiring Jacobians. It is often preferred when Jacobians are difficult to compute analytically. However, both still maintain a single Gaussian belief and fail when the true posterior is multimodal.

Sigma Points Visualization

A 2D Gaussian with covariance [[1,0.5],[0.5,2]]. Yellow = mean. Orange = sigma points. Adjust λ to control spread. With n=2, we always have 2×2+1=5 sigma points.

λ (spread parameter) 2.0

For a state space of dimension n, how many sigma points does the UKF use?

2n + 1: the mean plus n positive perturbations and n negative perturbations along each eigendirection n points (one per dimension) n² points to capture all cross-correlations

Chapter 6: Particle Filter

What if none of the Gaussian assumptions hold? What if the posterior is multimodal (two plausible locations), or the dynamics are highly nonlinear, or the state space is mixed discrete-continuous? Particle filters represent the belief as a set of weighted samples (particles) that can approximate any distribution.

The update algorithm for m particles {(s_i, w_i)}_i=1^m:

1. Propagate

For each particle s_i, sample s_i' ~ T(s'|s_i,a). The particles are moved through the dynamics stochastically.

↓

2. Weight

Compute w_i = O(o|a,s_i'). Particles in states consistent with the observation get high weight; others get low weight.

↓

3. Normalize

Divide all weights by their sum: w_i ← w_i / ∑_j w_j.

↓

4. Resample

Draw m new particles with replacement, proportional to weights. This concentrates particles in high-probability regions.

↻

The key insight: resampling discards low-probability particles and duplicates high-probability ones. After many steps, particles concentrate near the true state. The belief estimate is approximated by the empirical distribution of particles.

There are three standard resampling algorithms. They all draw m new particles proportional to weights but differ in implementation:

Algorithm	Complexity	Properties
Multinomial	O(m log m)	Simple: draw m uniform [0,1) random numbers, binary search in CDF. High variance.
Systematic	O(m)	Generate one uniform random number u ∈ [0,1/m), then use equally spaced offsets u, u+1/m, u+2/m, …. Lower variance than multinomial.
Stratified	O(m)	Generate m uniform numbers, one per stratum [i/m, (i+1)/m). Similar to systematic but slightly higher variance.

Systematic resampling (also called "low-variance resampling") is strongly preferred in practice. With multinomial resampling, a particle with weight 3/m might be duplicated 0, 1, 2, 3, or more times due to random chance. With systematic resampling, it is always duplicated exactly 3 times. This determinism reduces the randomness in the particle set and improves accuracy for the same m.

python
import numpy as np

def systematic_resample(particles, weights):
    """Low-variance systematic resampling: O(m)."""
    m = len(particles)
    cumulative = np.cumsum(weights)
    u = np.random.uniform(0, 1.0/m)           # one random number
    positions = u + np.arange(m) / m            # m equally-spaced positions
    indices = np.searchsorted(cumulative, positions)  # O(m) binary searches
    return particles[indices]

def multinomial_resample(particles, weights):
    """Standard multinomial resampling: O(m log m), higher variance."""
    m = len(particles)
    indices = np.random.choice(m, size=m, replace=True, p=weights)
    return particles[indices]

# For comparison with m=5 particles, weights=[0.4, 0.3, 0.1, 0.1, 0.1]:
# Multinomial: may give [0,1,3,1,0] or [2,1,1,1,0] — high variance
# Systematic: always gives [2,1,1,1,0] or [2,2,1,0,0] — deterministic spread

Advantages of particle filters: No parametric assumption. Can represent multimodal posteriors. Can handle discontinuities, constraints, and arbitrary dynamics. Scales to high-dimensional spaces reasonably well (compared to discrete filters). The downside: requires m particles to be reasonable; may need thousands of particles for precision.

Here is a miniature worked example with m = 5 particles (normally you'd use hundreds). 1D position tracking. True position = 0.6. Particles: {0.2, 0.4, 0.5, 0.7, 0.9}. Noisy observation: z = 0.65. Observation noise σ_o = 0.1.

Particle s_i	Propagated s_i' (add process noise)	w_i = N(0.65; s_i', 0.01)	Normalized w_i
0.2	0.21	exp(−(0.65−0.21)²/0.02) ≈ 0.00056	0.002
0.4	0.42	exp(−(0.65−0.42)²/0.02) ≈ 0.073	0.228
0.5	0.51	exp(−(0.65−0.51)²/0.02) ≈ 0.375	1.170 → 0.370
0.7	0.72	exp(−(0.65−0.72)²/0.02) ≈ 0.782	2.440 → 0.244
0.9	0.89	exp(−(0.65−0.89)²/0.02) ≈ 0.057	0.179 → 0.057

After normalization, w = [0.002, 0.073, 0.370, 0.244, 0.057] (sum to 1). Weighted mean: 0.002×0.21 + 0.073×0.42 + 0.370×0.51 + 0.244×0.72 + 0.057×0.89 ≈ 0.60 — very close to the true position 0.6! After resampling with these weights, particles near 0.51 and 0.72 will be duplicated, while the particle at 0.21 will almost certainly be dropped.

The particle filter algorithm is remarkably concise to implement. Each "particle" is just a state sample with a weight:

python
import numpy as np

def particle_filter_update(particles, weights, action, obs,
                              transition_fn, obs_likelihood_fn, m):
    """
    particles: array of shape (m, state_dim) — current particle set
    weights:   array of shape (m,) — current particle weights
    action:    the action taken
    obs:       the observation received
    transition_fn(s, a): samples s' ~ T(s'|s,a)
    obs_likelihood_fn(o, s'): returns O(o|s') as a scalar
    Returns: new particles and uniform weights after resampling
    """
    # Step 1: Propagate — sample s_i' ~ T(s'|s_i, a)
    new_particles = np.array([transition_fn(s, action) for s in particles])

    # Step 2: Weight — w_i = O(obs | s_i')
    new_weights = np.array([obs_likelihood_fn(obs, s) for s in new_particles])

    # Step 3: Normalize
    total = new_weights.sum()
    if total < 1e-12:
        new_weights[:] = 1.0 / m  # degenerate case: reset to uniform
    else:
        new_weights /= total

    # Step 4: Resample — draw m particles with replacement
    indices = np.random.choice(m, size=m, replace=True, p=new_weights)
    resampled = new_particles[indices]
    uniform_weights = np.full(m, 1.0 / m)

    return resampled, uniform_weights

# Example: 2D position tracking with random walk dynamics
m = 500  # number of particles
particles = np.random.uniform(0, 1, (m, 2))   # uniform initial belief over [0,1]^2
weights = np.full(m, 1.0 / m)

def transition(s, a): return np.clip(s + np.random.randn(2) * 0.03, 0, 1)
def likelihood(obs, s): return np.exp(-np.sum((s - obs)**2) / (2 * 0.05**2))

# After each action + observation, call:
# particles, weights = particle_filter_update(particles, weights, a, obs,
#                                              transition, likelihood, m)
# Belief mean estimate:
# mu = particles.mean(axis=0)

What is the purpose of the resampling step in a particle filter?

To concentrate computational effort in high-probability regions by eliminating low-weight particles and duplicating high-weight ones To add noise to prevent particle collapse To convert the nonparametric belief to a Gaussian approximation

Chapter 7: Particle Injection

Particle filters have a critical failure mode called particle deprivation: all particles can converge to a single region of state space, leaving no coverage of other regions. If the true state then jumps to a previously-uncovered region (due to a sudden large transition or a rare event), the filter has no particles there and cannot recover.

Particle injection addresses this by periodically adding a small fraction of randomly sampled particles (from the prior or from the observation model) to the particle set. This ensures global coverage is maintained even as most particles concentrate around the likely true state.

The crying baby kidnapping problem: The text introduces a variant of the crying baby problem where the baby can be "kidnapped" — suddenly transported to a random state with small probability. A standard particle filter loses all diversity after a few steps and cannot recover from a kidnapping. Particle injection adds a few uniformly distributed particles each step, maintaining the ability to detect sudden state changes.

The mechanics of particle deprivation can be understood quantitatively. After k steps without resampling, the effective sample size N_eff = 1/(∑w_i²) decreases as particles drift. After resampling, N_eff resets to m (all uniform weights). However, with each predict–weight–resample cycle, some particles are duplicated and others are dropped. After many cycles in a smooth, well-observed system, the particle set collapses to a few distinct values — particle deprivation even without any sudden state jumps. Injection prevents this collapse by continuously replenishing diversity.

Two types of injected particles:

Prior particles: Sample from the initial belief distribution b₀(s). Provides broad global coverage regardless of the observation.

Observation-consistent particles: Sample states s' consistent with the current observation: s' ~ P(s'|o) ∝ O(o|a,s') b₀(s'). More efficient but requires knowing the observation model in a form we can invert.

The injection rate ρ (fraction of particles replaced each step) is a crucial hyperparameter. Too low: the filter cannot recover from kidnapping. Too high: the filter loses precision even in normal operation. The optimal rate depends on the expected frequency of sudden state jumps:

Scenario	Recommended ρ	Rationale
Stable system, no sudden jumps	ρ = 0% (no injection needed)	Particles track well; injection adds unnecessary noise
Rare sudden jumps (<1% per step)	ρ = 2–5%	Enough to recover from kidnapping; doesn't hurt normal tracking
Frequent sudden jumps (>5% per step)	ρ = 10–20%	Must maintain global coverage; precision will be reduced
Unknown jump rate	ρ = 5%	Conservative default; works reasonably well in most cases

An alternative to fixed-rate injection: adaptive injection. After each resampling step, compute the normalized probability of the current observation under the existing particles: p(o|particles) = ∑_i w_i O(o|s_i). If this probability is very low (the observation is poorly explained by current particles), inject at a higher rate. This adapts injection to the evidence — injecting more when the filter is likely wrong.

Particle Filter: With and Without Injection

Track a 1D target that occasionally teleports (kidnapping). Orange = particles, purple triangle = true position. Watch how particle injection (teal) prevents collapse after a kidnapping event vs. no injection (warm).

Injection fraction 5%

Step: 0

What is "particle deprivation" and why is it a problem?

Too many particles are used, slowing down computation Particles are too spread out to estimate the state accurately All particles converge to one region, leaving no coverage elsewhere; if the true state moves away from that region, the filter cannot recover

Chapter 8: Showcase — The Crying Baby Filter

The crying baby problem is the canonical 2-state POMDP from the textbook. Two hidden states: {sated, hungry}. Three actions: {feed, sing, ignore}. Two observations: {crying, quiet}. The agent must maintain a belief over the baby's state and decide how to act.

Dynamics	P(hungry next step)
Feed (any state)	0% — feeding always sates
Sated + sing/ignore	10% — baby gradually gets hungry
Hungry + sing/ignore	100% — stays hungry without feeding

Observation model	P(crying \| state, action)
Hungry + sing	90%
Hungry + feed/ignore	80%
Sated + sing	0% (singing fully calms a sated baby)
Sated + feed/ignore	10%

The belief is a single number b = P(hungry). Watch how each action-observation pair updates this belief. Note that feeding deterministically resets belief to P(hungry) = 0 (we know feeding always sates).

Key update from the text: Starting with b = 0.5 (maximum uncertainty), if we ignore the baby and it cries, the posterior is: P(hungry|crying,ignore) ∝ 0.8 × 0.5 + 0.1 × 0.5 = [0.4, 0.05], normalized to ≈ [0.89, 0.11]. So P(hungry) ≈ 0.89. A crying baby is very likely hungry.

Crying Baby Discrete Belief Update

You are the caretaker. Choose actions. The belief bar shows P(hungry) (warm) vs P(sated) (teal). The baby face reacts to the observation. Reward accumulates at bottom. Can you keep the baby happy while tracking the true state?

Reward: 0 | Steps: 0

From the text — worked example: At b = [0.5, 0.5] (sated=0.5, hungry=0.5), ignore the baby:
Prediction: b' = [0.5×0.9 + 0.5×0, 0.5×0.1 + 0.5×1] = [0.45, 0.55].
Then observation = crying: weights = [O(cry|sated,ignore)×0.45, O(cry|hungry,ignore)×0.55] = [0.1×0.45, 0.8×0.55] = [0.045, 0.44]. Normalize: [0.045/0.485, 0.44/0.485] ≈ [0.093, 0.907].

Chapter 9: Summary & Filter Comparison

Chapter 19 establishes the belief update machinery that underpins all POMDP planning. Every filter produces the same object (a belief distribution over states) but makes different assumptions about the structure of the problem.

All five filters covered in this chapter share a common structure: a prediction step that propagates the belief forward through the dynamics, and an update step that incorporates the observation. The filters differ in how they represent the belief and how they implement each step. Understanding this shared structure is key to adapting filters to new problems.

Common structure (all filters):

Predict: b → b¨ using T(s'|s,a)
Update: b¨ + o → b' using O(o|a,s')
Normalize: ensure b' sums/integrates to 1

What varies:

Belief representation: vector vs. Gaussian vs. particles
Prediction method: matrix multiply vs. Riccati vs. particle propagation
Update method: elementwise multiply vs. Kalman gain vs. importance weighting

Filter	State Space	Dynamics	Belief Form	Exact?
Discrete	Finite S	Any discrete T, O	Categorical vector	Yes
Kalman	Continuous	Linear Gaussian	Gaussian N(μ,Σ)	Yes
Extended Kalman	Continuous	Nonlinear Gaussian (mild)	Gaussian (approx.)	No
Unscented Kalman	Continuous	Nonlinear Gaussian	Gaussian (better approx.)	No
Particle	Any	Any (sample from T)	Weighted particles	Asymptotically

The decision hierarchy: Start with the most exact filter your problem permits.

Discrete state space, small? → Discrete filter.
Continuous, linear dynamics, Gaussian noise? → Kalman filter.
Continuous, mildly nonlinear? → EKF (or UKF if Jacobians are hard).
Strongly nonlinear, multimodal, or arbitrary? → Particle filter.

Looking ahead: Chapter 20 uses the belief update methods from this chapter as the foundation for solving POMDPs exactly. The belief is treated as the "state" of a continuous-state MDP, and dynamic programming is applied over it. The key structural result: the value function over belief space is piecewise-linear and convex.

Here is a side-by-side comparison of computational costs for each filter, to guide your implementation choice:

Filter	Memory	Per-update cost	Primary failure mode
Discrete	O(\|S\|)	O(\|S\|²)	\|S\| exponentially large for many state dimensions
Kalman	O(n²)	O(n³) for matrix inversion	Non-Gaussian noise, nonlinear dynamics
EKF	O(n²)	O(n³) + Jacobian computation	Severe nonlinearity, diverges far from true state
UKF	O(n²)	O(n³) (matrix sqrt) + 2n+1 propagations	Still Gaussian; cannot handle multimodal posteriors
Particle (m particles)	O(m×n)	O(m×n) propagate + weight + resample	Particle deprivation; m must be large for high-dim state

The practical choice flowchart:

Discrete state space?

If |S| is small (≤ 10³): use the discrete filter. Exact and fast.

↓ No (continuous state)

Linear Gaussian?

Use the Kalman filter. Exact, O(n³), minimal assumptions needed.

↓ No (nonlinear)

Mildly nonlinear, unimodal?

Try EKF (if Jacobians computable) or UKF (if not). Both are fast approximations.

↓ No (multimodal, highly nonlinear)

Particle filter

Use particle filter with injection. Choose m based on required accuracy vs. compute budget.

Concrete robotics example: an autonomous vehicle navigating indoors must track its (x, y, θ) pose. Here is how the choice plays out:

Scenario	Best filter	Why
Straight corridor, wheel odometry + lidar	Kalman filter	Motion is nearly linear; Gaussian noise model adequate
Turning + bearing-only sensor	EKF	Bearing observation is nonlinear (atan2); EKF linearizes at current mean
Large open space, global localization from scratch	Particle filter (1000+ particles)	Belief is multimodal: could be near any wall
Recovered from GPS dropout, approximate last-known position	UKF or particle filter	Moderately nonlinear dynamics; belief may be slightly multimodal
Robot arm joint angles (known structure)	Kalman filter	Joint angle dynamics are linear; small-dimensioned state

The industry standard for mobile robot localization is the Adaptive Monte Carlo Localization (AMCL) algorithm, which is a particle filter with particle injection and adaptive particle count (more particles when uncertain, fewer when confident). AMCL is implemented in ROS (Robot Operating System) and runs on hundreds of deployed robots worldwide. It uses the discrete map as an implicit state space and samples particles from the map grid rather than continuously.

Key implementation pitfall: weight collapse. In particle filters, all weights can become negligible except one ("weight collapse"). This happens when the observation is very informative and most particles are in low-likelihood regions. The N_eff = 1 / ∑w_i² statistic measures effective sample size. When N_eff < m/2, trigger resampling immediately (even if not on a regular schedule). When N_eff approaches 1, inject particles from the prior to restore diversity.

Common implementation pitfalls. Even experienced practitioners make these mistakes with belief filters:

Bug	Symptom	Fix
Forget normalization in discrete filter	Belief sums to <1; probabilities drift	Always divide by `η = ∑ p(o\|s')·b'(s')` after measurement update
Kalman gain computed with wrong sign	Filter diverges exponentially	K = ΣH^T(HΣH^T+Σ_O)⁻¹; gain must be positive definite
EKF Jacobian computed at wrong point	Filter inconsistent; covariance too small	Linearize f and h at current mean μ_t\|t−1, not at the measurement
UKF sigma points with negative weights	Predicted covariance becomes non-PD	Add small β term; use κ≥0 or enforce PD via Cholesky regularization
Particle filter: no resampling	Weight collapse after ~50 steps	Resample whenever N_eff = 1/∑w_i² < m/2
Particle filter: too few particles (m<30)	High variance; occasional complete failure	Use m≥500 for 2D problems; m≥2000 for 3D robot pose
No particle injection	Filter loses true state after teleport or large jump	Inject ρ·m particles from prior; set ρ≈0.05–0.10

Numerical comparison: EKF vs UKF on a sine observation model. Suppose the state is x_t = x_t−1 + w_t (w∼N(0,0.1)) and the observation is z_t = sin(x_t) + v_t (v∼N(0,0.05)). Starting from x₀∼N(0,1), both filters are initialized at μ₀=0, Σ₀=1. After one step with z₁=0.7:

Quantity	EKF	UKF	Comment
Predicted μ	0.0	0.0	Same (linear prediction)
Predicted Σ	1.1	1.1	Same (linear prediction)
Jacobian / sigma-predicted obs	J=cos(0)=1.0	z&hat;=E[sin(x)]=sin(0)≈0	EKF linearizes; UKF propagates
Innovation covariance S	1.1×1²+0.05=1.15	Var[sin(x)]+0.05≈0.46+0.05=0.51	UKF captures nonlinear variance
Kalman gain K	1.1/1.15≈0.957	1.1×0.79/0.51≈1.70 (clamped)	Different because S differs
Updated μ	0+0.957×(0.7−0)≈0.67	Depends on cross-covariance	UKF gives more accurate posterior

The key difference: EKF approximates the observation variance as 1.15 (treating sin as linear), while UKF captures the true nonlinear spread of sin(x) when x∼N(0,1.1), giving a smaller S of 0.51. This means the UKF trusts the measurement more (the observation function is more informative than the EKF thinks), and converges faster. In strongly nonlinear problems, this difference compounds over time and EKF can diverge while UKF remains stable.

When to upgrade your filter. In practice, engineers often start with the simplest adequate filter and upgrade only when it fails. Here is a principled upgrade path:

Start with Kalman. Check: does the normalized innovation squared (NIS) statistic χ²-test pass? NIS = (z−z&hat;)^TS⁻¹(z−z&hat;) should be χ²(|z|) distributed. If NIS is consistently too large, the filter is inconsistent.
If NIS fails due to nonlinearity: upgrade to EKF. Check: is the Jacobian stable and computable at every time step?
If EKF diverges or Jacobians are ill-conditioned: upgrade to UKF. Check: are sigma point weights all positive after the choice of κ and α?
If belief is multimodal (kidnapped robot, cycle ambiguity): particle filter is required. No Gaussian filter can represent multimodal posteriors.

This upgrade ladder reflects a fundamental tradeoff: Gaussian filters are fast and analytically clean, but restricted to unimodal, approximately normal posteriors. Particle filters are universal approximators of probability distributions, but require Θ(m) compute per step where m must grow with the dimension and complexity of the belief. Choose the simplest filter that passes consistency tests on your data.

"All models are wrong, but some are useful."

— George Box

Every filter in this chapter is an approximation — they differ in which aspects of reality they approximate well. Choosing the right filter means understanding which simplifications your problem can tolerate.

← Ch 18: Imitation Learning Ch 20: Exact Belief Planning →

Reasoning Under Hidden States