When you can't see the world directly, you maintain a belief — a probability distribution over what might be true. This chapter builds every major belief update method from first principles.
You're a robot navigating a dark hallway. You know your sensors are noisy. You know your motors don't always move exactly as commanded. What position are you in? You can't know for certain — but you can maintain a probability distribution over all the positions you might be in. That distribution is called a belief.
This is the central challenge of a Partially Observable Markov Decision Process (POMDP): the true state s is hidden. Instead of observing s, the agent receives an observation o drawn from O(o|a,s'), a probabilistic function of the reached state and the action taken. The agent must reason about s indirectly.
In a standard MDP: policy maps states to actions, π(s) → a. In a POMDP: policy maps beliefs to actions, π(b) → a. The belief plays the role of the state. Every chapter in this lesson is about computing b efficiently and accurately.
The five chapters of this lesson (Chapters 0–9) build up from the most exact to the most general belief representation. Chapters 0–2 establish the framework and the exact discrete filter. Chapters 3–5 explore parametric (Gaussian) filters that trade exactness for efficiency in continuous state spaces. Chapters 6–7 introduce the nonparametric particle filter and particle injection. Chapter 8 is the showcase: a fully interactive crying baby demonstration that ties all the update mechanics together. Chapter 9 summarizes and provides the decision hierarchy.
The proof that belief is a sufficient statistic uses the fact that POMDP dynamics are Markovian in the hidden state. Given the current state s and action a, the distribution over next state s' and observation o depends only on (s, a) — not on any earlier history. Therefore, the belief b = P(s|history) captures exactly the information content of the history that is relevant to future decisions. Any additional memory of the raw history beyond what is encoded in b would be redundant.
This is a fundamental result with profound practical implications: instead of maintaining an ever-growing history (which would require exponentially more memory over time), the agent only needs to maintain and update a fixed-size probability distribution over states. For a discrete state space |S|, the belief is always a vector of |S| numbers, regardless of how long the agent has been running.
Every possible belief over 3 states is a point inside this triangle. Corners = certainty about one state. Center = maximum uncertainty. Click anywhere inside to set a belief and see b(s1), b(s2), b(s3).
Before taking any action, we need a starting belief b0. Two broad representations are available:
| Representation | Form | Strengths | Limitations |
|---|---|---|---|
| Parametric | Categorical (discrete S), Gaussian N(μ,Σ) (continuous S) | Compact, closed-form updates | Cannot represent multimodal beliefs |
| Nonparametric | Set of weighted particles {(si, wi)} | Handles arbitrary distributions, multimodal beliefs | Discrete approximation; resolution limited by particle count |
Choosing the initial belief matters enormously. A diffuse initial belief (e.g., uniform over all states) is robust — it makes few assumptions. A confident wrong initial belief can be catastrophic: it may take a very long time to correct, even with informative observations.
Two special starting cases:
Known initial state: b0(s) = 1[s = s0]. The agent starts certain. Common in simulations.
Prior from problem knowledge: Use domain knowledge. "The robot starts near the charging dock, so b0 = N(μdock, small Σ)."
In practice, initial belief construction is one of the most important engineering decisions for a POMDP system. Here is a concrete comparison of three initialization strategies for a robot localization problem in a 100×100 meter warehouse (state = (x, y) position):
| Strategy | How | Initial cost | Time to localize |
|---|---|---|---|
| Uniform particles | m = 10,000 particles uniformly over 100m×100m | None (no prior knowledge needed) | Many steps until beliefs concentrate |
| GPS-seeded Gaussian | N(μGPS, Σ = 3m) — GPS is coarse but informative | Requires GPS fix (may be unavailable indoors) | Fast: GPS reduces uncertainty by ~90% |
| Observation-consistent | Sample s' ~ O(o0|s') from prior — use first lidar scan to filter | Moderate: requires first observation at startup | Very fast: only positions matching the initial scan are kept |
| Confident-wrong | N(μwrong, Σ = 0.5m) — small variance, wrong position | Dangerous: may take 50+ steps to recover | Very slow or never recovers without injection |
For finite state and observation spaces, belief updates are exact. The belief b is a probability vector of length |S|. After taking action a and receiving observation o, the new belief b' is computed in two steps:
where η is a normalizing constant: η = 1 / ∑s' O(o|a,s') b¨(s').
This is recursive Bayesian estimation on a discrete state space. The prediction step propagates the belief forward through the dynamics model. The update step incorporates the observation using Bayes' rule. The normalization ensures b' is a valid probability distribution.
The computational cost is O(|S|2) per update (the double sum over s and s'). For large state spaces, this becomes impractical, motivating the approximate methods in later chapters.
Here is a complete step-by-step worked example of one discrete filter update. The crying baby problem: states = {sated=0, hungry=1}, action = "ignore", observation = "crying". Starting belief b = [0.5, 0.5].
python # Crying baby discrete filter: one step, action=ignore, obs=crying # T[a][s][s'] = P(s'|s, a). Action: ignore=2 T_ignore = [[0.9, 0.1], # from sated: 90% stay sated, 10% go hungry [0.0, 1.0]] # from hungry: always stay hungry # O[a][s'][o]: observation model, o=0 means crying O_ignore = [0.1, 0.8] # P(crying|sated)=0.1, P(crying|hungry)=0.8 b = [0.5, 0.5] # prior belief: 50/50 # --- PREDICT STEP --- # b_pred[s'] = sum_s T(s'|s,a) * b[s] b_pred = [0, 0] b_pred[0] = T_ignore[0][0]*b[0] + T_ignore[1][0]*b[1] # 0.9*0.5 + 0.0*0.5 = 0.45 b_pred[1] = T_ignore[0][1]*b[0] + T_ignore[1][1]*b[1] # 0.1*0.5 + 1.0*0.5 = 0.55 print(f"Predicted: {b_pred}") # [0.45, 0.55] # --- UPDATE STEP --- (observation = crying) # b_new[s'] = eta * O(crying|s') * b_pred[s'] unnorm = [O_ignore[0] * b_pred[0], # 0.1 * 0.45 = 0.045 (sated, but still crying) O_ignore[1] * b_pred[1]] # 0.8 * 0.55 = 0.440 (hungry, expected crying) eta = 1.0 / sum(unnorm) # 1 / 0.485 = 2.062 b_new = [eta * u for u in unnorm] # [0.093, 0.907] print(f"Updated: {[f'{v:.3f}' for v in b_new]}") # Interpretation: P(hungry|ignore, crying) ≈ 90.7% # The crying observation strongly confirms hunger. # If we had observed "quiet" instead: # unnorm_q = [(1-0.1)*0.45, (1-0.8)*0.55] = [0.405, 0.110] → P(hungry) ≈ 21%
A 4-state hallway. Take Left/Right actions and observe (Bright/Dim). Watch how the belief vector shifts. The triangle below a bar shows the true (hidden) state. Higher bar = more probability mass.
For continuous states with linear Gaussian dynamics, the belief is always a Gaussian N(μ,Σ). This is not an approximation — it's exact. The dynamics and observation models are:
The belief update has the famous two-step structure:
The matrix K is the Kalman gain. It is the weight given to the innovation (o − Osμp), the difference between the actual observation and the predicted observation. Think of K as a trust slider:
In steady state, the variance Σ converges to a fixed-point determined entirely by the noise parameters — independent of the initial belief. This is called the steady-state Kalman filter. For a 1D random walk with process variance σs2 and observation variance σo2, the steady-state gain K* and variance Σ* satisfy:
With σs2 = 0.25 and σo2 = 1.0: Σ* = 0.125 + √(0.0156 + 0.25) ≈ 0.125 + 0.5063 = 0.631. K* = 0.631/(0.631+1) ≈ 0.387. After reaching steady state, every update sets K ≈ 0.387 — a fixed fraction of the innovation is incorporated regardless of history. This is why steady-state Kalman filters are so efficient: precompute K* offline, then run at O(n) per step online.
Here is a 5-step worked example tracing the 1D Kalman filter, starting from uncertainty σ²=4 and true position 0. Process noise σs²=0.25, observation noise σo²=1.0:
| Step | True pos | Observation | σ² before | K | μ after | σ² after |
|---|---|---|---|---|---|---|
| 1 | 0.3 | 1.2 | 4.25 | 4.25/5.25≈0.81 | 0+0.81×1.2≈0.97 | (1−0.81)×4.25≈0.81 |
| 2 | 0.5 | 0.1 | 1.06 | 1.06/2.06≈0.51 | 0.97+0.51×(0.1−0.97)≈0.53 | 0.49×1.06≈0.52 |
| 3 | 0.8 | 0.9 | 0.77 | 0.77/1.77≈0.44 | 0.53+0.44×(0.9−0.53)≈0.69 | 0.56×0.77≈0.43 |
| 4 | 0.6 | 0.5 | 0.68 | 0.68/1.68≈0.40 | 0.69+0.40×(0.5−0.69)≈0.61 | 0.60×0.68≈0.41 |
| 5 | 0.9 | 1.1 | 0.66 | 0.66/1.66≈0.40 | 0.61+0.40×(1.1−0.61)≈0.80 | 0.60×0.66≈0.40 |
Notice that σ² starts large (4.25) and rapidly shrinks to a steady state around 0.40 — consistent with the analytic steady-state formula (Σ* ≈ 0.631 for this noise setting). The gain K also stabilizes around 0.40. The filter's mean tracks the true position, staying within ±2σ = ±2×0.63 of each observation. The estimate at step 5 (0.80) is close to the true position 0.9, within one sigma.
The three key properties of the Kalman filter that make it the gold standard for linear systems:
Here is the complete 1D Kalman filter in Python. The belief is just two numbers — mean μ and variance σ2. Everything else is algebra:
python import numpy as np class KalmanFilter1D: """1D Kalman filter: belief = (mu, sigma2).""" def __init__(self, mu0=0.0, sigma2_0=1.0): self.mu = mu0 # belief mean self.sigma2 = sigma2_0 # belief variance def predict(self, sigma2_process): # Propagate belief through random walk dynamics # mu_p = mu (no deterministic drift in this example) # sigma2_p = sigma2 + sigma2_process self.sigma2 += sigma2_process # self.mu unchanged (identity transition) def update(self, obs, sigma2_obs): # Incorporate a noisy observation K = self.sigma2 / (self.sigma2 + sigma2_obs) # Kalman gain self.mu = self.mu + K * (obs - self.mu) # belief mean update self.sigma2 = (1 - K) * self.sigma2 # belief variance shrinks return K # return K so we can inspect it # Worked example matching the textbook's robot localization scenario kf = KalmanFilter1D(mu0=0.0, sigma2_0=4.0) # uncertain about initial position true_pos = 0.0 for step in range(10): true_pos += np.random.randn() * 0.5 # true random walk obs = true_pos + np.random.randn() * 1.0 # noisy observation kf.predict(sigma2_process=0.25) # process noise variance K = kf.update(obs, sigma2_obs=1.0) # observation noise variance print(f"Step {step+1}: true={true_pos:.2f}, obs={obs:.2f}, " f"mu={kf.mu:.2f}, sigma={kf.sigma2**.5:.2f}, K={K:.3f}")
Track a 1D position. Purple line = true state (random walk). Orange dots = noisy observations. Teal band = belief (μ ± 2σ). Adjust noise levels to see how the gain K adapts.
What if the dynamics or observation functions are nonlinear? The standard Kalman filter fails because a Gaussian pushed through a nonlinear function is no longer Gaussian. The Extended Kalman Filter (EKF) handles this by linearizing the nonlinear functions at the current belief mean using first-order Taylor expansions (Jacobians).
where JT = ∂fT/∂s |μb and JO = ∂fO/∂s' |μp are the Jacobian matrices evaluated at the current mean. The Kalman update equations then apply with JT and JO in place of Ts and Os.
| Step | Standard KF | EKF |
|---|---|---|
| Predict mean | μp = Tsμb + Taa | μp = fT(μb, a) |
| Predict covariance | Σp = TsΣbTs⊤ + Σs | Σp = JTΣbJT⊤ + Σs |
| Kalman gain K | ΣpOs⊤(…)−1 | ΣpJO⊤(…)−1 |
| Update mean | μp + K(o − Osμp) | μp + K(o − fO(μp)) |
Here is a concrete worked EKF example: 2D robot tracking with a bearing-only sensor. State s = (x, y). Observation: the angle from the origin to the robot: fO(x,y) = atan2(y,x). This is highly nonlinear.
python import numpy as np # Robot state: s = [x, y] = [3.0, 1.0] (true position) # Belief mean: mu = [3.0, 1.0] sigma2 = [[0.5, 0], [0, 0.5]] mu = np.array([3.0, 1.0]) Sigma = np.diag([0.5, 0.5]) # Nonlinear observation model: angle theta = atan2(y, x) # EKF: linearize via Jacobian J_O = d/d(x,y) atan2(y,x) # d(atan2(y,x))/dx = -y/(x^2+y^2) # d(atan2(y,x))/dy = x/(x^2+y^2) x, y = mu r2 = x**2 + y**2 # = 3^2+1^2 = 10 J_O = np.array([[-y/r2, x/r2]]) # = [[-0.1, 0.3]] (1x2 Jacobian) print(f"Jacobian J_O = {J_O}") # [[-0.1, 0.3]] # Observation noise: sigma_o = 0.1 radians Sigma_o = np.array([[0.01]]) # (0.1 rad)^2 # Kalman gain: K = Sigma * J_O^T * (J_O * Sigma * J_O^T + Sigma_o)^{-1} S_innov = J_O @ Sigma @ J_O.T + Sigma_o # = [[-0.1,0.3]] [[0.5,0],[0,0.5]] [[-0.1],[0.3]] + 0.01 # = [[-0.1,0.3]] [[-0.05],[0.15]] + 0.01 = [0.05] + 0.01 = [0.06] K = Sigma @ J_O.T @ np.linalg.inv(S_innov) # 2x1 gain vector print(f"K = {K.flatten()}") # approximately [-0.833, 2.5] # Observe: true bearing = atan2(1,3) ≈ 0.3217 rad # Noisy observation: theta_obs = 0.35 rad theta_pred = np.arctan2(mu[1], mu[0]) # ≈ 0.3217 innovation = np.array([0.35]) - theta_pred # ≈ 0.028 rad mu_new = mu + K.flatten() * innovation[0] # adjust position toward observation Sigma_new = (np.eye(2) - K @ J_O) @ Sigma # covariance shrinks print(f"Updated mu: {mu_new}") # mu changes: x decreases slightly (bearing too large implies y was underestimated) # Key: the EKF exploits the local geometry (Jacobian) to map bearing residual # into (x,y) correction. Works well near the true position; fails far from it.
Samples from a 1D Gaussian (left) pushed through a linear function f(x)=2x (center) and a nonlinear function f(x)=sin(x)+x (right). The linear result stays Gaussian; the nonlinear result becomes skewed. The EKF approximates the nonlinear case with a local linearization (dashed).
The EKF linearizes using a derivative. But what if computing Jacobians is hard (or the function isn't differentiable)? The Unscented Kalman Filter (UKF) takes a different approach: instead of linearizing the function, it passes a carefully chosen set of sigma points through the nonlinear function and fits a Gaussian to the output.
For a state of dimension n, the UKF computes 2n+1 sigma points from the current belief N(μ,Σ):
where (√(n+λ)Σ)i denotes the i-th column of the matrix square root, and λ is a spread parameter. Each sigma point is propagated through the nonlinear function, and the output mean and covariance are reconstructed as weighted sums of the transformed points.
The weights are not all equal. The mean sigma point (s0) gets a weight designed to absorb the remaining probability mass, while the off-center points get equal weights that together balance the mean:
where α controls the spread, β = 2 for Gaussian priors (optimal), and λ = α2(n+κ) − n. After propagating each sigma point through the nonlinear function f:
A concrete 1D example: n=1, λ=2. Belief: μ=0, σ2=1. Sigma points: s0=0 (weight 2/3), s1=+√3≈1.73 (weight 1/6), s2=−√3≈−1.73 (weight 1/6). Propagate through f(x) = sin(x): f(0)=0, f(1.73)≈0.987, f(−1.73)≈−0.987. Reconstructed mean: (2/3)×0 + (1/6)×0.987 + (1/6)×(−0.987) = 0. Reconstructed variance: (1/6)(0.987−0)2 + (1/6)(0.987−0)2 = 0.325 + 0.325 ≈ 0.65. The UKF correctly estimates that sin(x) for x~N(0,1) has variance less than 1 (since |sin| ≤ 1 compresses the tails).
A 2D Gaussian with covariance [[1,0.5],[0.5,2]]. Yellow = mean. Orange = sigma points. Adjust λ to control spread. With n=2, we always have 2×2+1=5 sigma points.
What if none of the Gaussian assumptions hold? What if the posterior is multimodal (two plausible locations), or the dynamics are highly nonlinear, or the state space is mixed discrete-continuous? Particle filters represent the belief as a set of weighted samples (particles) that can approximate any distribution.
The update algorithm for m particles {(si, wi)}i=1m:
The key insight: resampling discards low-probability particles and duplicates high-probability ones. After many steps, particles concentrate near the true state. The belief estimate is approximated by the empirical distribution of particles.
There are three standard resampling algorithms. They all draw m new particles proportional to weights but differ in implementation:
| Algorithm | Complexity | Properties |
|---|---|---|
| Multinomial | O(m log m) | Simple: draw m uniform [0,1) random numbers, binary search in CDF. High variance. |
| Systematic | O(m) | Generate one uniform random number u ∈ [0,1/m), then use equally spaced offsets u, u+1/m, u+2/m, …. Lower variance than multinomial. |
| Stratified | O(m) | Generate m uniform numbers, one per stratum [i/m, (i+1)/m). Similar to systematic but slightly higher variance. |
Systematic resampling (also called "low-variance resampling") is strongly preferred in practice. With multinomial resampling, a particle with weight 3/m might be duplicated 0, 1, 2, 3, or more times due to random chance. With systematic resampling, it is always duplicated exactly 3 times. This determinism reduces the randomness in the particle set and improves accuracy for the same m.
python import numpy as np def systematic_resample(particles, weights): """Low-variance systematic resampling: O(m).""" m = len(particles) cumulative = np.cumsum(weights) u = np.random.uniform(0, 1.0/m) # one random number positions = u + np.arange(m) / m # m equally-spaced positions indices = np.searchsorted(cumulative, positions) # O(m) binary searches return particles[indices] def multinomial_resample(particles, weights): """Standard multinomial resampling: O(m log m), higher variance.""" m = len(particles) indices = np.random.choice(m, size=m, replace=True, p=weights) return particles[indices] # For comparison with m=5 particles, weights=[0.4, 0.3, 0.1, 0.1, 0.1]: # Multinomial: may give [0,1,3,1,0] or [2,1,1,1,0] — high variance # Systematic: always gives [2,1,1,1,0] or [2,2,1,0,0] — deterministic spread
Here is a miniature worked example with m = 5 particles (normally you'd use hundreds). 1D position tracking. True position = 0.6. Particles: {0.2, 0.4, 0.5, 0.7, 0.9}. Noisy observation: z = 0.65. Observation noise σo = 0.1.
| Particle si | Propagated si' (add process noise) | wi = N(0.65; si', 0.01) | Normalized wi |
|---|---|---|---|
| 0.2 | 0.21 | exp(−(0.65−0.21)²/0.02) ≈ 0.00056 | 0.002 |
| 0.4 | 0.42 | exp(−(0.65−0.42)²/0.02) ≈ 0.073 | 0.228 |
| 0.5 | 0.51 | exp(−(0.65−0.51)²/0.02) ≈ 0.375 | 1.170 → 0.370 |
| 0.7 | 0.72 | exp(−(0.65−0.72)²/0.02) ≈ 0.782 | 2.440 → 0.244 |
| 0.9 | 0.89 | exp(−(0.65−0.89)²/0.02) ≈ 0.057 | 0.179 → 0.057 |
After normalization, w = [0.002, 0.073, 0.370, 0.244, 0.057] (sum to 1). Weighted mean: 0.002×0.21 + 0.073×0.42 + 0.370×0.51 + 0.244×0.72 + 0.057×0.89 ≈ 0.60 — very close to the true position 0.6! After resampling with these weights, particles near 0.51 and 0.72 will be duplicated, while the particle at 0.21 will almost certainly be dropped.
The particle filter algorithm is remarkably concise to implement. Each "particle" is just a state sample with a weight:
python import numpy as np def particle_filter_update(particles, weights, action, obs, transition_fn, obs_likelihood_fn, m): """ particles: array of shape (m, state_dim) — current particle set weights: array of shape (m,) — current particle weights action: the action taken obs: the observation received transition_fn(s, a): samples s' ~ T(s'|s,a) obs_likelihood_fn(o, s'): returns O(o|s') as a scalar Returns: new particles and uniform weights after resampling """ # Step 1: Propagate — sample s_i' ~ T(s'|s_i, a) new_particles = np.array([transition_fn(s, action) for s in particles]) # Step 2: Weight — w_i = O(obs | s_i') new_weights = np.array([obs_likelihood_fn(obs, s) for s in new_particles]) # Step 3: Normalize total = new_weights.sum() if total < 1e-12: new_weights[:] = 1.0 / m # degenerate case: reset to uniform else: new_weights /= total # Step 4: Resample — draw m particles with replacement indices = np.random.choice(m, size=m, replace=True, p=new_weights) resampled = new_particles[indices] uniform_weights = np.full(m, 1.0 / m) return resampled, uniform_weights # Example: 2D position tracking with random walk dynamics m = 500 # number of particles particles = np.random.uniform(0, 1, (m, 2)) # uniform initial belief over [0,1]^2 weights = np.full(m, 1.0 / m) def transition(s, a): return np.clip(s + np.random.randn(2) * 0.03, 0, 1) def likelihood(obs, s): return np.exp(-np.sum((s - obs)**2) / (2 * 0.05**2)) # After each action + observation, call: # particles, weights = particle_filter_update(particles, weights, a, obs, # transition, likelihood, m) # Belief mean estimate: # mu = particles.mean(axis=0)
Particle filters have a critical failure mode called particle deprivation: all particles can converge to a single region of state space, leaving no coverage of other regions. If the true state then jumps to a previously-uncovered region (due to a sudden large transition or a rare event), the filter has no particles there and cannot recover.
Particle injection addresses this by periodically adding a small fraction of randomly sampled particles (from the prior or from the observation model) to the particle set. This ensures global coverage is maintained even as most particles concentrate around the likely true state.
The mechanics of particle deprivation can be understood quantitatively. After k steps without resampling, the effective sample size Neff = 1/(∑wi2) decreases as particles drift. After resampling, Neff resets to m (all uniform weights). However, with each predict–weight–resample cycle, some particles are duplicated and others are dropped. After many cycles in a smooth, well-observed system, the particle set collapses to a few distinct values — particle deprivation even without any sudden state jumps. Injection prevents this collapse by continuously replenishing diversity.
Two types of injected particles:
Prior particles: Sample from the initial belief distribution b0(s). Provides broad global coverage regardless of the observation.
Observation-consistent particles: Sample states s' consistent with the current observation: s' ~ P(s'|o) ∝ O(o|a,s') b0(s'). More efficient but requires knowing the observation model in a form we can invert.
The injection rate ρ (fraction of particles replaced each step) is a crucial hyperparameter. Too low: the filter cannot recover from kidnapping. Too high: the filter loses precision even in normal operation. The optimal rate depends on the expected frequency of sudden state jumps:
| Scenario | Recommended ρ | Rationale |
|---|---|---|
| Stable system, no sudden jumps | ρ = 0% (no injection needed) | Particles track well; injection adds unnecessary noise |
| Rare sudden jumps (<1% per step) | ρ = 2–5% | Enough to recover from kidnapping; doesn't hurt normal tracking |
| Frequent sudden jumps (>5% per step) | ρ = 10–20% | Must maintain global coverage; precision will be reduced |
| Unknown jump rate | ρ = 5% | Conservative default; works reasonably well in most cases |
An alternative to fixed-rate injection: adaptive injection. After each resampling step, compute the normalized probability of the current observation under the existing particles: p(o|particles) = ∑i wi O(o|si). If this probability is very low (the observation is poorly explained by current particles), inject at a higher rate. This adapts injection to the evidence — injecting more when the filter is likely wrong.
Track a 1D target that occasionally teleports (kidnapping). Orange = particles, purple triangle = true position. Watch how particle injection (teal) prevents collapse after a kidnapping event vs. no injection (warm).
The crying baby problem is the canonical 2-state POMDP from the textbook. Two hidden states: {sated, hungry}. Three actions: {feed, sing, ignore}. Two observations: {crying, quiet}. The agent must maintain a belief over the baby's state and decide how to act.
| Dynamics | P(hungry next step) |
|---|---|
| Feed (any state) | 0% — feeding always sates |
| Sated + sing/ignore | 10% — baby gradually gets hungry |
| Hungry + sing/ignore | 100% — stays hungry without feeding |
| Observation model | P(crying | state, action) |
|---|---|
| Hungry + sing | 90% |
| Hungry + feed/ignore | 80% |
| Sated + sing | 0% (singing fully calms a sated baby) |
| Sated + feed/ignore | 10% |
The belief is a single number b = P(hungry). Watch how each action-observation pair updates this belief. Note that feeding deterministically resets belief to P(hungry) = 0 (we know feeding always sates).
You are the caretaker. Choose actions. The belief bar shows P(hungry) (warm) vs P(sated) (teal). The baby face reacts to the observation. Reward accumulates at bottom. Can you keep the baby happy while tracking the true state?
Chapter 19 establishes the belief update machinery that underpins all POMDP planning. Every filter produces the same object (a belief distribution over states) but makes different assumptions about the structure of the problem.
All five filters covered in this chapter share a common structure: a prediction step that propagates the belief forward through the dynamics, and an update step that incorporates the observation. The filters differ in how they represent the belief and how they implement each step. Understanding this shared structure is key to adapting filters to new problems.
Common structure (all filters):
What varies:
| Filter | State Space | Dynamics | Belief Form | Exact? |
|---|---|---|---|---|
| Discrete | Finite S | Any discrete T, O | Categorical vector | Yes |
| Kalman | Continuous | Linear Gaussian | Gaussian N(μ,Σ) | Yes |
| Extended Kalman | Continuous | Nonlinear Gaussian (mild) | Gaussian (approx.) | No |
| Unscented Kalman | Continuous | Nonlinear Gaussian | Gaussian (better approx.) | No |
| Particle | Any | Any (sample from T) | Weighted particles | Asymptotically |
Here is a side-by-side comparison of computational costs for each filter, to guide your implementation choice:
| Filter | Memory | Per-update cost | Primary failure mode |
|---|---|---|---|
| Discrete | O(|S|) | O(|S|2) | |S| exponentially large for many state dimensions |
| Kalman | O(n2) | O(n3) for matrix inversion | Non-Gaussian noise, nonlinear dynamics |
| EKF | O(n2) | O(n3) + Jacobian computation | Severe nonlinearity, diverges far from true state |
| UKF | O(n2) | O(n3) (matrix sqrt) + 2n+1 propagations | Still Gaussian; cannot handle multimodal posteriors |
| Particle (m particles) | O(m×n) | O(m×n) propagate + weight + resample | Particle deprivation; m must be large for high-dim state |
The practical choice flowchart:
Concrete robotics example: an autonomous vehicle navigating indoors must track its (x, y, θ) pose. Here is how the choice plays out:
| Scenario | Best filter | Why |
|---|---|---|
| Straight corridor, wheel odometry + lidar | Kalman filter | Motion is nearly linear; Gaussian noise model adequate |
| Turning + bearing-only sensor | EKF | Bearing observation is nonlinear (atan2); EKF linearizes at current mean |
| Large open space, global localization from scratch | Particle filter (1000+ particles) | Belief is multimodal: could be near any wall |
| Recovered from GPS dropout, approximate last-known position | UKF or particle filter | Moderately nonlinear dynamics; belief may be slightly multimodal |
| Robot arm joint angles (known structure) | Kalman filter | Joint angle dynamics are linear; small-dimensioned state |
The industry standard for mobile robot localization is the Adaptive Monte Carlo Localization (AMCL) algorithm, which is a particle filter with particle injection and adaptive particle count (more particles when uncertain, fewer when confident). AMCL is implemented in ROS (Robot Operating System) and runs on hundreds of deployed robots worldwide. It uses the discrete map as an implicit state space and samples particles from the map grid rather than continuously.
Common implementation pitfalls. Even experienced practitioners make these mistakes with belief filters:
| Bug | Symptom | Fix |
|---|---|---|
| Forget normalization in discrete filter | Belief sums to <1; probabilities drift | Always divide by η = ∑ p(o|s')·b'(s') after measurement update |
| Kalman gain computed with wrong sign | Filter diverges exponentially | K = ΣHT(HΣHT+ΣO)−1; gain must be positive definite |
| EKF Jacobian computed at wrong point | Filter inconsistent; covariance too small | Linearize f and h at current mean μt|t−1, not at the measurement |
| UKF sigma points with negative weights | Predicted covariance becomes non-PD | Add small β term; use κ≥0 or enforce PD via Cholesky regularization |
| Particle filter: no resampling | Weight collapse after ~50 steps | Resample whenever Neff = 1/∑wi2 < m/2 |
| Particle filter: too few particles (m<30) | High variance; occasional complete failure | Use m≥500 for 2D problems; m≥2000 for 3D robot pose |
| No particle injection | Filter loses true state after teleport or large jump | Inject ρ·m particles from prior; set ρ≈0.05–0.10 |
Numerical comparison: EKF vs UKF on a sine observation model. Suppose the state is xt = xt−1 + wt (w∼N(0,0.1)) and the observation is zt = sin(xt) + vt (v∼N(0,0.05)). Starting from x0∼N(0,1), both filters are initialized at μ0=0, Σ0=1. After one step with z1=0.7:
| Quantity | EKF | UKF | Comment |
|---|---|---|---|
| Predicted μ | 0.0 | 0.0 | Same (linear prediction) |
| Predicted Σ | 1.1 | 1.1 | Same (linear prediction) |
| Jacobian / sigma-predicted obs | J=cos(0)=1.0 | z&hat;=E[sin(x)]=sin(0)≈0 | EKF linearizes; UKF propagates |
| Innovation covariance S | 1.1×12+0.05=1.15 | Var[sin(x)]+0.05≈0.46+0.05=0.51 | UKF captures nonlinear variance |
| Kalman gain K | 1.1/1.15≈0.957 | 1.1×0.79/0.51≈1.70 (clamped) | Different because S differs |
| Updated μ | 0+0.957×(0.7−0)≈0.67 | Depends on cross-covariance | UKF gives more accurate posterior |
The key difference: EKF approximates the observation variance as 1.15 (treating sin as linear), while UKF captures the true nonlinear spread of sin(x) when x∼N(0,1.1), giving a smaller S of 0.51. This means the UKF trusts the measurement more (the observation function is more informative than the EKF thinks), and converges faster. In strongly nonlinear problems, this difference compounds over time and EKF can diverge while UKF remains stable.
When to upgrade your filter. In practice, engineers often start with the simplest adequate filter and upgrade only when it fails. Here is a principled upgrade path:
This upgrade ladder reflects a fundamental tradeoff: Gaussian filters are fast and analytically clean, but restricted to unimodal, approximately normal posteriors. Particle filters are universal approximators of probability distributions, but require Θ(m) compute per step where m must grow with the dimension and complexity of the belief. Choose the simplest filter that passes consistency tests on your data.
Every filter in this chapter is an approximation — they differ in which aspects of reality they approximate well. Choosing the right filter means understanding which simplifications your problem can tolerate.