The algorithm that lets a robot figure out where it is by combining movement and sensing — the parent of every probabilistic filter you'll ever meet.
You're a robot in a hallway. You can't see yourself from above. You have no GPS. All you have is a noisy sensor that can detect whether you're next to a door, and wheels that sometimes slip. Where are you?
You don't know — not for certain. But you can believe. You start with a rough guess spread across every possible position. Then you sense something: "I see a door." Some positions have doors, some don't, so your belief shifts. Then you move forward and sense again. With each sense-move cycle, your belief sharpens until you're nearly certain where you are.
The robot starts with no idea where it is (uniform belief). Click Sense or Move to watch the belief evolve.
Before we build a filter, we need one formula. But let's not start with the formula — let's start with counting.
Imagine 100 people. 20 are programmers. Of those 20 programmers, 15 drink coffee. Of the 80 non-programmers, 40 drink coffee. You meet someone drinking coffee. What's the probability they're a programmer?
Total coffee drinkers: 15 + 40 = 55. Programmers among them: 15. So: P(programmer | coffee) = 15/55 ≈ 0.27. That's Bayes' theorem in action — you updated your belief about someone being a programmer after observing they drink coffee.
Adjust the overlap to see how P(A|B) changes. A = hypothesis, B = evidence.
You have a prior P(A) and a likelihood P(B|A). You apply Bayes' theorem: P(A|B) = P(B|A) P(A) / P(B). But what is P(B)? In our robot scenario, A ranges over N hypotheses (cells).
Your task: Show that P(B) = Σi P(B|Ai) P(Ai), and explain why this guarantees Σi P(Ai|B) = 1.
Full derivation:
Start with Bayes for hypothesis i: P(Ai|B) = P(B|Ai) P(Ai) / P(B).
Sum both sides over all i: Σi P(Ai|B) = Σi P(B|Ai) P(Ai) / P(B) = [Σi P(B|Ai) P(Ai)] / P(B).
By the law of total probability, Σi P(B|Ai) P(Ai) = P(B). So the fraction = P(B)/P(B) = 1.
Therefore the posterior is guaranteed to be a valid probability distribution.
The key insight: Normalization is not a hack or numerical convenience — it's the law of total probability in disguise. The denominator P(B) is exactly what makes the posterior sum to 1. In code, bel /= bel.sum() computes 1/P(B) implicitly.
The robot's world is a 1D hallway divided into discrete cells. At any moment, the robot's belief is a probability distribution — a number for each cell saying "how likely am I to be here?" All the numbers sum to 1.
When the robot knows nothing, the belief is uniform: equal probability everywhere. As it gathers evidence, some cells become more probable and others shrink. The belief is the robot's internal mental model of where it might be.
[20] array summing to 1.0. At initialization (uniform), every cell = 1/20 = 0.05. After the robot senses a door and doors are at cells 2, 5, 7, 12, 15, 18, those cells spike to ~0.128 while non-door cells drop to ~0.013. The numbers change, but the sum stays 1.0. That's the entire data structure — one float per cell.python import numpy as np # 20-cell hallway, uniform belief belief = np.ones(20) / 20 # belief = [0.05, 0.05, 0.05, ..., 0.05] (sums to 1.0) # After sensing a door (doors at cells 2,5,7,12,15,18): doors = [2, 5, 7, 12, 15, 18] hit_rate = 0.9 likelihood = np.where(np.isin(np.arange(20), doors), hit_rate, 1 - hit_rate) belief = belief * likelihood belief /= belief.sum() # renormalize # Door cells: ~0.128, non-door cells: ~0.013
Click cells to manually assign probability. The histogram always renormalizes to sum to 1.
When the robot's wheels turn, it tries to move forward. But wheels slip. The robot might overshoot, undershoot, or stay put. The motion model P(x'|x,u) describes: "if I'm at cell x and I command motion u, what's the probability I end up at cell x'?"
Mathematically, applying the motion model to a belief is a convolution: each cell "spreads" its probability to neighboring cells according to the motion kernel. The result is that the belief shifts in the direction of motion and blurs due to uncertainty. Information is lost. Entropy increases.
python import numpy as np # Motion kernel: [undershoot, exact, overshoot] kernel = np.array([0.1, 0.8, 0.1]) # One predict step = convolve belief with kernel belief_pred = np.convolve(belief, kernel, mode='same') belief_pred /= belief_pred.sum() # fix boundary effects # Example: belief = [0, 0, 0.9, 0.1, 0, ...] # After predict: [0, 0.09, 0.1, 0.73, 0.08, 0, ...] # The spike shifted right and blurred out.
Start with a spike belief, then apply motion. Watch how the belief blurs and shifts with each step.
The robot has a door sensor. It returns "door" or "no door." But it's not perfect — sometimes it says "door" when there's none (false positive), and sometimes it misses a real door (false negative). The sensor model P(z|x) encodes: "if I'm truly at cell x, what's the probability of reading z?"
This is the likelihood function. For each cell, it tells us how well the sensor reading "fits" that position. Cells with doors get a high likelihood when the sensor says "door." Cells without doors get a low likelihood. This is the information that sharpens belief.
The hallway has doors at specific cells (orange). Toggle the sensor reading to see how the likelihood changes.
The predict step applies the motion model to the current belief. Every cell's probability mass gets redistributed according to how the robot might move. The result is a blurred, shifted version of the old belief.
This is mathematically a convolution: bel¯(x') = Σx P(x'|x,u) bel(x). In the discrete case, we slide a small kernel across the histogram and sum. The output is always smoother (less certain) than the input.
The teal histogram is before prediction. Click Predict to apply one motion step. Watch the histogram blur and shift right.
The predict step says: bel¯(x') = Σx P(x'|x, u) · bel(x). This is not just a formula — it's a fundamental result in probability theory called the Chapman-Kolmogorov equation (or the law of total probability applied to transitions).
Your task: Starting from the joint probability P(x', x | z1:t, u1:t), derive the predict step. Show why we must sum over ALL previous states x, not just the most likely one.
Full derivation:
1. Define what we want: bel¯(x') = P(x't+1 | z1:t, u1:t+1)
2. Marginalize over xt: = Σx P(x't+1, xt | z1:t, u1:t+1)
3. Factor the joint: = Σx P(x't+1 | xt, z1:t, u1:t+1) · P(xt | z1:t, u1:t+1)
4. Apply Markov property: P(x'|xt, z1:t, u1:t+1) = P(x'|xt, ut+1). The next state only depends on current state and action.
5. And: P(xt | z1:t, u1:t+1) = P(xt | z1:t, u1:t) = bel(xt). The future action doesn't tell us about the past state.
6. Result: bel¯(x') = Σx P(x'|x, u) · bel(x).
The key insight: The predict step is a total probability calculation. We don't know where we are, so we hedge across ALL possible previous positions, weighted by how likely each one is. This is why it's a convolution — every source cell contributes to every destination cell through the motion kernel.
The predict step convolves the belief with the motion kernel. Any kernel with non-zero probability in more than one entry spreads probability mass from concentrated cells to neighboring ones. Information theory tells us: convolution with a non-trivial kernel can only increase entropy (or keep it the same). The ONLY way entropy stays unchanged is if P(x'|x,u) is a deterministic mapping — a delta function δ(x' - f(x,u)) — meaning the robot moves to exactly one cell with probability 1. In practice, motion always has noise, so entropy always increases. This is why a robot that only moves (without sensing) inevitably loses track of its position.
The update step incorporates a sensor reading. It's pure Bayes' theorem applied to every cell: multiply the predicted belief by the sensor likelihood, then renormalize so the probabilities sum to 1.
Here η is the normalizing constant (1 / Σ P(z|xi) bel¯(xi)). The effect: cells that are consistent with the sensor reading get boosted; cells that are inconsistent get suppressed. The histogram sharpens. Uncertainty decreases.
Let's trace through with actual numbers. Suppose the predicted belief is roughly uniform at 0.05 per cell. The sensor says "door." Doors are at cells 2, 5, 7, 12, 15, 18. With hit rate 0.9:
python # Update step = element-wise multiply + normalize # bel_pred = [0.05, 0.05, 0.05, ...] (uniform) # sensor says "door" # Likelihood for each cell: # P(door | cell_2) = 0.9 (has door) # P(door | cell_3) = 0.1 (no door) # Multiply: # cell_2: 0.05 * 0.9 = 0.045 # cell_3: 0.05 * 0.1 = 0.005 # Cell 2 gets 9x the weight of cell 3! # Normalizer = sum of all products # After normalize: door cells ~0.128, non-door ~0.013 bel_updated = bel_pred * likelihood # element-wise bel_updated /= bel_updated.sum() # normalize
Blue is predicted belief. Yellow is sensor likelihood. Green is the updated belief after multiply + normalize.
The robot has received observations z1, z2, ..., zt and taken actions u1, ..., ut. That's a growing history. But the Bayes filter only stores bel(x) — a single distribution over states. It throws away the raw history.
Your task: Prove that bel(xt) = P(xt | z1:t, u1:t) contains all the information from the full history needed to compute the future belief bel(xt+1). That is, show the belief is a sufficient statistic for the history.
Full derivation:
We want to show: P(xt+1 | z1:t+1, u1:t+1) can be computed from P(xt | z1:t, u1:t) without accessing the raw history.
Start with the update: P(xt+1 | z1:t+1, u1:t+1) = η P(zt+1 | xt+1) P(xt+1 | z1:t, u1:t+1)
The second factor is the predict: P(xt+1 | z1:t, u1:t+1) = Σxt P(xt+1 | xt, ut+1) P(xt | z1:t, u1:t)
The term P(xt | z1:t, u1:t) = bel(xt). By Markov, we only need bel(xt), the current action ut+1, and the current reading zt+1.
Therefore bel(xt) is recursively sufficient: at each step we only need the previous belief plus the new data. The history z1:t-1, u1:t-1 is irrelevant given bel(xt).
The key insight: A 20-cell hallway generates an ever-growing history (hundreds of actions and readings), but the belief vector is always just 20 numbers. This compression is EXACT (not lossy) under the Markov assumption. The belief IS the state of the filter — nothing else is needed. This is why recursive filters scale to infinite time horizons with constant memory.
Now we put it all together. The Bayes filter is an infinite loop of Predict (move, blur) and Update (sense, sharpen). Below is the classic Probabilistic Robotics visualization: a 1D hallway with colored doors, a robot that moves, and a belief histogram that evolves in real-time.
Python def bayes_filter(bel, u, z, motion_model, sensor_model): """One cycle of the discrete Bayes filter.""" n = len(bel) # ── Predict (convolution) ── bel_bar = [0.0] * n for x_prime in range(n): for x in range(n): bel_bar[x_prime] += motion_model(x_prime, x, u) * bel[x] # ── Update (Bayes rule) ── for x in range(n): bel_bar[x] *= sensor_model(z, x) # ── Normalize ── eta = 1.0 / sum(bel_bar) return [b * eta for b in bel_bar]
A robot moves through a hallway with colored doors. Watch the belief histogram converge to the true position. The green bar marks the robot's true cell.
python def bayes_filter_step(bel, motion_shift, motion_noise, sensor_reading, door_cells, hit_rate, n_cells): # ── PREDICT: convolve with motion kernel ── exact = 1.0 - motion_noise slip = motion_noise / 2.0 bel_bar = [0.0] * n_cells for i in range(n_cells): bel_bar[(i + motion_shift) % n_cells] += bel[i] * exact bel_bar[(i + motion_shift + 1) % n_cells] += bel[i] * slip bel_bar[(i + motion_shift - 1) % n_cells] += bel[i] * slip # ── UPDATE: multiply by sensor likelihood ��─ for i in range(n_cells): has_door = i in door_cells if sensor_reading == "door": likelihood = hit_rate if has_door else (1 - hit_rate) else: likelihood = (1 - hit_rate) if has_door else hit_rate bel_bar[i] *= likelihood # ── NORMALIZE ── total = sum(bel_bar) return [b / total for b in bel_bar]
The Bayes filter is not one algorithm — it's the parent of an entire family. Every probabilistic filter is a special case of the Bayes filter, making different assumptions about the belief representation and the motion/sensor models.
| Filter | Belief | Assumption | Cost |
|---|---|---|---|
| Discrete Bayes | Histogram | Finite grid of cells | O(n) |
| Kalman (KF) | Gaussian (μ, Σ) | Linear + Gaussian noise | O(n³) |
| EKF | Gaussian | Locally linear (Jacobian) | O(n³) |
| UKF | Gaussian (sigma pts) | Smooth nonlinearity | O(n³) |
| Particle Filter | Weighted samples | Any distribution | O(N·n) |
python # Particle filter: 1000 particles for 2D robot pose particles = np.random.uniform(size=(1000, 3)) # [x, y, θ] weights = np.ones(1000) / 1000 # Predict: move each particle + noise particles[:, 0] += velocity * np.cos(particles[:, 2]) + np.random.randn(1000) * 0.1 # Update: weight by sensor likelihood weights *= sensor_likelihood(particles, measurement) weights /= weights.sum() # Resample: draw new particles proportional to weights indices = np.random.choice(1000, size=1000, p=weights) particles = particles[indices]
The computational cost tells you which filter to use in practice:
| Filter | Cost per Step | When to Use |
|---|---|---|
| Discrete Bayes | O(|S|²) | Small state space (<1000 cells) |
| Kalman / EKF / UKF | O(n³) | Continuous, roughly Gaussian |
| Particle Filter | O(N) | Any distribution, any nonlinearity |
The discrete Bayes filter sums over all state pairs (O(|S|²) for the predict step). The Kalman filter inverts matrices (O(n³) where n is the state dimension — cheap for 3D pose, expensive for 100D). The particle filter scales linearly in the number of particles, but you need more particles as dimensionality grows.
You now understand Bayesian state estimation. Every sensor reading is noisy, every action is uncertain — but belief persists and refines. That's the Bayes filter.
Real-world solution: Adaptive Monte Carlo Localization (AMCL) with a Kalman fallback.
Why not grid: 200m/0.05m × 100m/0.05m × 360 = 4000 × 2000 × 360 = 2.88 billion cells. At 4 bytes each = 11.5 GB. Completely infeasible. Even at 50cm resolution (400 × 200 × 36 = 2.88M cells), the predict step requires ~2.88M multiplications per destination cell — too slow at 10 Hz.
Why not pure Kalman: LIDAR multi-path creates bimodal noise (ghost reflections). The Kalman filter's Gaussian assumption means it averages between the true position and the ghost, converging to neither. In dynamic environments with sudden occlusions, the Gaussian envelope is too optimistic.
Why particle filter wins: 500-2000 particles with (x, y, θ) state = 500×3×8 bytes = 12 KB. Each update: move all particles (O(N)), weight by LIDAR likelihood (O(N × K) where K = scan points matched), resample (O(N)). With N=1000 and K=100: ~100K operations per cycle. Easily fits in 100ms on ARM.
The production trick: Use AMCL (ROS default): start with many particles (5000) when uncertain, adaptively reduce to 200-500 when converged. The KLD-sampling algorithm adjusts particle count based on the complexity of the current belief. For tracking (already localized), some teams use an EKF with particle-based re-localization on kidnap detection.
Both compute the SAME thing: weigh the prediction against the measurement. In the discrete case, we literally multiply each cell's probability by how well the evidence fits. In the Kalman case, the Kalman Gain K is the continuous equivalent — it interpolates between the predicted mean and the measurement. When belief and likelihood are both Gaussian, the discrete multiply-and-normalize becomes the elegant K formula.
When the likelihood is much more certain than the prior (R << Σ), what happens to K? What's the discrete equivalent — what does the belief look like after update when the sensor is near-perfect?
The deep structural connection: both are weighted belief aggregation. The Bayes filter takes a distribution over states and updates it by weighting each state by evidence compatibility. Attention takes a set of values and produces a weighted combination where weights come from query-key compatibility. In both cases, a "soft assignment" (probability distribution) selects information from multiple sources rather than committing to one. The HMM forward algorithm (HMM lesson) is literally the Bayes filter applied to a finite state machine with discrete emissions — making the connection explicit.
In the Bayes filter, the normalizer η ensures probabilities sum to 1. In attention, softmax does the same. What would happen to a Transformer if you removed the softmax normalization? (Hint: same pathology as removing normalization from the Bayes filter.)