microBayes — From Uncertainty to Belief

Chapter 1: Bayes' Theorem — Updating What You Believe

Before we build a filter, we need one formula. But let's not start with the formula — let's start with counting.

Imagine 100 people. 20 are programmers. Of those 20 programmers, 15 drink coffee. Of the 80 non-programmers, 40 drink coffee. You meet someone drinking coffee. What's the probability they're a programmer?

Total coffee drinkers: 15 + 40 = 55. Programmers among them: 15. So: P(programmer | coffee) = 15/55 ≈ 0.27. That's Bayes' theorem in action — you updated your belief about someone being a programmer after observing they drink coffee.

P(A|B) = P(B|A) · P(A) / P(B)

Translation: posterior = likelihood × prior / evidence. The posterior is your updated belief. The prior is what you believed before. The likelihood is how well the evidence fits each hypothesis. The evidence normalizes everything to sum to 1.

Interactive Venn Diagram

Adjust the overlap to see how P(A|B) changes. A = hypothesis, B = evidence.

P(A) prior0.30

P(B|A) likelihood0.75

P(B|¬A)0.50

      P(A|B) =
      0.39
    

Check: In Bayes' theorem, what does the "prior" represent?

The probability of the evidence Your belief before seeing any evidence The updated belief after seeing evidence

🔨 Derivation Where does the normalization constant come from? ▶ ✓ ATTEMPTED

You have a prior P(A) and a likelihood P(B|A). You apply Bayes' theorem: P(A|B) = P(B|A) P(A) / P(B). But what is P(B)? In our robot scenario, A ranges over N hypotheses (cells).

Your task: Show that P(B) = Σ_i P(B|A_i) P(A_i), and explain why this guarantees Σ_i P(A_i|B) = 1.

P(B) = Σ_i P(B and A_i) = Σ_i P(B|A_i) P(A_i). This marginalizes out A by summing over all possible hypotheses. Each term is "the probability of seeing B if hypothesis i is true, weighted by how likely hypothesis i is."

Σ_i P(A_i|B) = Σ_i P(B|A_i) P(A_i) / P(B). Factor out 1/P(B): = (1/P(B)) Σ_i P(B|A_i) P(A_i). The numerator sum IS P(B) by the law of total probability.

In code, we never compute P(B) explicitly. We compute the unnormalized product P(B|A_i) P(A_i) for each cell, then divide by their sum. That sum IS P(B). So η = 1/P(B) = 1 / Σ_i P(B|A_i) P(A_i).

Full derivation:

Start with Bayes for hypothesis i: P(A_i|B) = P(B|A_i) P(A_i) / P(B).

Sum both sides over all i: Σ_i P(A_i|B) = Σ_i P(B|A_i) P(A_i) / P(B) = [Σ_i P(B|A_i) P(A_i)] / P(B).

By the law of total probability, Σ_i P(B|A_i) P(A_i) = P(B). So the fraction = P(B)/P(B) = 1.

Therefore the posterior is guaranteed to be a valid probability distribution.

The key insight: Normalization is not a hack or numerical convenience — it's the law of total probability in disguise. The denominator P(B) is exactly what makes the posterior sum to 1. In code, bel /= bel.sum() computes 1/P(B) implicitly.

Chapter 2: Belief — A Distribution Over States

The robot's world is a 1D hallway divided into discrete cells. At any moment, the robot's belief is a probability distribution — a number for each cell saying "how likely am I to be here?" All the numbers sum to 1.

When the robot knows nothing, the belief is uniform: equal probability everywhere. As it gathers evidence, some cells become more probable and others shrink. The belief is the robot's internal mental model of where it might be.

The belief vector, concretely: Our hallway has 20 cells, so the belief is a [20] array summing to 1.0. At initialization (uniform), every cell = 1/20 = 0.05. After the robot senses a door and doors are at cells 2, 5, 7, 12, 15, 18, those cells spike to ~0.128 while non-door cells drop to ~0.013. The numbers change, but the sum stays 1.0. That's the entire data structure — one float per cell.

python
import numpy as np

# 20-cell hallway, uniform belief
belief = np.ones(20) / 20
# belief = [0.05, 0.05, 0.05, ..., 0.05]  (sums to 1.0)

# After sensing a door (doors at cells 2,5,7,12,15,18):
doors = [2, 5, 7, 12, 15, 18]
hit_rate = 0.9
likelihood = np.where(np.isin(np.arange(20), doors), hit_rate, 1 - hit_rate)
belief = belief * likelihood
belief /= belief.sum()   # renormalize
# Door cells: ~0.128, non-door cells: ~0.013

Belief bel(x)

A probability for every cell: [0.05, 0.12, 0.03, ...]

↓

Constraint

Σ bel(x_i) = 1 — probabilities must sum to 1

↓

Visualization

A histogram with one bar per cell

Belief Histogram

Click cells to manually assign probability. The histogram always renormalizes to sum to 1.

Discrete vs continuous: In this lesson we use discrete cells (a histogram). The Kalman filter uses a continuous Gaussian. Particle filters use samples. But the concept is identical: belief = a distribution over states.

Check: What must always be true about a belief distribution?

All values must be equal The maximum value must be 1 All values must be non-negative and sum to 1

Chapter 3: The Motion Model — P(x'|x,u)

When the robot's wheels turn, it tries to move forward. But wheels slip. The robot might overshoot, undershoot, or stay put. The motion model P(x'|x,u) describes: "if I'm at cell x and I command motion u, what's the probability I end up at cell x'?"

Mathematically, applying the motion model to a belief is a convolution: each cell "spreads" its probability to neighboring cells according to the motion kernel. The result is that the belief shifts in the direction of motion and blurs due to uncertainty. Information is lost. Entropy increases.

bel¯(x') = Σ_x P(x'|x, u) · bel(x)

Intuition: Motion always makes you less certain. If you were 90% sure you were at cell 5 and you move right by 1, maybe now you're 70% at cell 6, 10% at cell 5, and 10% at cell 7. The belief smears out. Only sensing can sharpen it back.

Predict = convolution. The predict step is literally a 1D convolution. The motion kernel [0.1, 0.8, 0.1] says: 10% chance of staying put, 80% of moving right by 1, 10% of overshooting by 2. Slide this kernel across the belief vector and sum. In NumPy, one line does it:

python
import numpy as np

# Motion kernel: [undershoot, exact, overshoot]
kernel = np.array([0.1, 0.8, 0.1])

# One predict step = convolve belief with kernel
belief_pred = np.convolve(belief, kernel, mode='same')
belief_pred /= belief_pred.sum()   # fix boundary effects

# Example: belief = [0, 0, 0.9, 0.1, 0, ...]
# After predict: [0, 0.09, 0.1, 0.73, 0.08, 0, ...]
# The spike shifted right and blurred out.

Motion Convolution

Start with a spike belief, then apply motion. Watch how the belief blurs and shifts with each step.

Exact prob0.60

Overshoot0.15

Steps: 0

Check: What happens to the belief when the robot moves?

It becomes more certain It shifts and blurs — uncertainty increases It stays exactly the same

Chapter 4: The Sensor Model — P(z|x)

The robot has a door sensor. It returns "door" or "no door." But it's not perfect — sometimes it says "door" when there's none (false positive), and sometimes it misses a real door (false negative). The sensor model P(z|x) encodes: "if I'm truly at cell x, what's the probability of reading z?"

This is the likelihood function. For each cell, it tells us how well the sensor reading "fits" that position. Cells with doors get a high likelihood when the sensor says "door." Cells without doors get a low likelihood. This is the information that sharpens belief.

P(z = door | x) = 0.9 if cell x has a door, 0.1 otherwise

Likelihood vs probability: P(z|x) is NOT the probability of being at x. It's the probability of seeing z if you were at x. Confusing these is the #1 Bayes mistake. The likelihood tells us how to weight each hypothesis.

Sensor Likelihood

The hallway has doors at specific cells (orange). Toggle the sensor reading to see how the likelihood changes.

Hit = 0.9, False alarm = 0.1

Hit rate0.90

Check: If the sensor says "door" and cell 3 has a door, what is P(z="door"|x=3) with hit rate 0.9?

0.9 0.1 0.5

Chapter 5: The Predict Step

The predict step applies the motion model to the current belief. Every cell's probability mass gets redistributed according to how the robot might move. The result is a blurred, shifted version of the old belief.

This is mathematically a convolution: bel¯(x') = Σ_x P(x'|x,u) bel(x). In the discrete case, we slide a small kernel across the histogram and sum. The output is always smoother (less certain) than the input.

Before Predict

Sharp belief — peaked at a few cells

↓ apply motion model

After Predict

Blurred belief — spread across more cells

Watch the Predict Step

The teal histogram is before prediction. Click Predict to apply one motion step. Watch the histogram blur and shift right.

Motion amount1

Motion noise0.30

Entropy: low

Key insight: Prediction ALWAYS increases uncertainty. Every predict step adds entropy. If the robot just drives forward without sensing, the belief eventually becomes uniform — it has no idea where it is. Only the update step can fight this drift.

Check: After many predict steps without any updates, the belief becomes...

A sharp spike at one cell A Gaussian Approximately uniform — maximum uncertainty

🔨 Derivation The Chapman-Kolmogorov Equation (Predict Step) ▶ ✓ ATTEMPTED

The predict step says: bel¯(x') = Σ_x P(x'|x, u) · bel(x). This is not just a formula — it's a fundamental result in probability theory called the Chapman-Kolmogorov equation (or the law of total probability applied to transitions).

Your task: Starting from the joint probability P(x', x | z_1:t, u_1:t), derive the predict step. Show why we must sum over ALL previous states x, not just the most likely one.

bel¯(x') = P(x'_t+1 | z_1:t, u_1:t+1). This is the probability of being at x' AFTER acting but BEFORE sensing. We need to marginalize out the previous state x_t that we don't know for certain.

P(x'_t+1 | z_1:t, u_1:t+1) = Σ_x P(x'_t+1 | x_t, u_t+1) · P(x_t | z_1:t, u_1:t). The first term is the motion model, the second is bel(x_t). The key assumption: we used the Markov property — x'_t+1 only depends on x_t and u_t+1, not the full history.

If you only propagate from the most likely state, you lose information. Suppose bel = [0.4, 0.35, 0.25] at cells [5, 6, 7]. The most likely cell is 5, but there's a 60% chance you're NOT there. Summing over all x correctly weights every possible origin by its probability. Using only argmax would give overconfident (and wrong) predictions.

Full derivation:

1. Define what we want: bel¯(x') = P(x'_t+1 | z_1:t, u_1:t+1)

2. Marginalize over x_t: = Σ_x P(x'_t+1, x_t | z_1:t, u_1:t+1)

3. Factor the joint: = Σ_x P(x'_t+1 | x_t, z_1:t, u_1:t+1) · P(x_t | z_1:t, u_1:t+1)

4. Apply Markov property: P(x'|x_t, z_1:t, u_1:t+1) = P(x'|x_t, u_t+1). The next state only depends on current state and action.

5. And: P(x_t | z_1:t, u_1:t+1) = P(x_t | z_1:t, u_1:t) = bel(x_t). The future action doesn't tell us about the past state.

6. Result: bel¯(x') = Σ_x P(x'|x, u) · bel(x).

The key insight: The predict step is a total probability calculation. We don't know where we are, so we hedge across ALL possible previous positions, weighted by how likely each one is. This is why it's a convolution — every source cell contributes to every destination cell through the motion kernel.

Checkpoint — Before you move on

In your own words: why does the predict step ALWAYS increase entropy (uncertainty), regardless of the motion model? What would have to be true about P(x'|x,u) for entropy to stay the same?

✓ Gate cleared

Model Answer

The predict step convolves the belief with the motion kernel. Any kernel with non-zero probability in more than one entry spreads probability mass from concentrated cells to neighboring ones. Information theory tells us: convolution with a non-trivial kernel can only increase entropy (or keep it the same). The ONLY way entropy stays unchanged is if P(x'|x,u) is a deterministic mapping — a delta function δ(x' - f(x,u)) — meaning the robot moves to exactly one cell with probability 1. In practice, motion always has noise, so entropy always increases. This is why a robot that only moves (without sensing) inevitably loses track of its position.

Chapter 6: The Update Step

The update step incorporates a sensor reading. It's pure Bayes' theorem applied to every cell: multiply the predicted belief by the sensor likelihood, then renormalize so the probabilities sum to 1.

bel(x) = η · P(z|x) · bel¯(x)

Here η is the normalizing constant (1 / Σ P(z|x_i) bel¯(x_i)). The effect: cells that are consistent with the sensor reading get boosted; cells that are inconsistent get suppressed. The histogram sharpens. Uncertainty decreases.

The magic moment: Multiply and renormalize. That's the entire update. Cells where the sensor reading is likely get amplified. Cells where it's unlikely get crushed. The belief sharpens toward the truth. This is Bayesian inference in its purest form.

Let's trace through with actual numbers. Suppose the predicted belief is roughly uniform at 0.05 per cell. The sensor says "door." Doors are at cells 2, 5, 7, 12, 15, 18. With hit rate 0.9:

python
# Update step = element-wise multiply + normalize
# bel_pred = [0.05, 0.05, 0.05, ...]  (uniform)
# sensor says "door"

# Likelihood for each cell:
#   P(door | cell_2)  = 0.9  (has door)
#   P(door | cell_3)  = 0.1  (no door)

# Multiply:
#   cell_2: 0.05 * 0.9 = 0.045
#   cell_3: 0.05 * 0.1 = 0.005

# Cell 2 gets 9x the weight of cell 3!
# Normalizer = sum of all products
# After normalize: door cells ~0.128, non-door ~0.013

bel_updated = bel_pred * likelihood   # element-wise
bel_updated /= bel_updated.sum()      # normalize

Watch the Update Step

Blue is predicted belief. Yellow is sensor likelihood. Green is the updated belief after multiply + normalize.

Check: The update step works by...

Replacing the belief with the sensor reading Multiplying the belief by the likelihood, then renormalizing Averaging the belief with the likelihood

🔨 Derivation Why belief is a sufficient statistic ▶ ✓ ATTEMPTED

The robot has received observations z₁, z₂, ..., z_t and taken actions u₁, ..., u_t. That's a growing history. But the Bayes filter only stores bel(x) — a single distribution over states. It throws away the raw history.

Your task: Prove that bel(x_t) = P(x_t | z_1:t, u_1:t) contains all the information from the full history needed to compute the future belief bel(x_t+1). That is, show the belief is a sufficient statistic for the history.

Two assumptions: (1) The Markov property: P(x_t+1 | x_t, x_t-1, ...) = P(x_t+1 | x_t). The future depends on the past only through the present state. (2) Conditional independence of observations: P(z_t | x_t, z_1:t-1) = P(z_t | x_t). The sensor reading only depends on the current state.

bel(x_t+1) = η P(z_t+1|x_t+1) Σ_{x_t} P(x_t+1|x_t, u_t+1) bel(x_t). Notice: bel(x_t+1) depends on bel(x_t), the new action u_t+1, and the new observation z_t+1. It does NOT depend on z_1:t-1 or u_1:t-1 directly.

A statistic T is sufficient for a parameter if P(data | T, parameter) = P(data | T). Here: given bel(x_t), the full history (z_1:t, u_1:t) provides no additional information about x_t. The belief compresses the entire history into a fixed-size object (the N-dimensional probability vector).

Full derivation:

We want to show: P(x_t+1 | z_1:t+1, u_1:t+1) can be computed from P(x_t | z_1:t, u_1:t) without accessing the raw history.

Start with the update: P(x_t+1 | z_1:t+1, u_1:t+1) = η P(z_t+1 | x_t+1) P(x_t+1 | z_1:t, u_1:t+1)

The second factor is the predict: P(x_t+1 | z_1:t, u_1:t+1) = Σ_{x_t} P(x_t+1 | x_t, u_t+1) P(x_t | z_1:t, u_1:t)

The term P(x_t | z_1:t, u_1:t) = bel(x_t). By Markov, we only need bel(x_t), the current action u_t+1, and the current reading z_t+1.

Therefore bel(x_t) is recursively sufficient: at each step we only need the previous belief plus the new data. The history z_1:t-1, u_1:t-1 is irrelevant given bel(x_t).

The key insight: A 20-cell hallway generates an ever-growing history (hundreds of actions and readings), but the belief vector is always just 20 numbers. This compression is EXACT (not lossy) under the Markov assumption. The belief IS the state of the filter — nothing else is needed. This is why recursive filters scale to infinite time horizons with constant memory.

Chapter 7: The Full Algorithm — Robot Localization

Now we put it all together. The Bayes filter is an infinite loop of Predict (move, blur) and Update (sense, sharpen). Below is the classic Probabilistic Robotics visualization: a 1D hallway with colored doors, a robot that moves, and a belief histogram that evolves in real-time.

Initialize

bel(x) = 1/N (uniform — "I have no idea")

↓

Predict

bel¯(x') = Σ_x P(x'|x, u) bel(x)

↓

Update

bel(x) = η P(z|x) bel¯(x)

↓ repeat

The Bayes Filter in Python

Python
def bayes_filter(bel, u, z, motion_model, sensor_model):
    """One cycle of the discrete Bayes filter."""
    n = len(bel)

    # ── Predict (convolution) ──
    bel_bar = [0.0] * n
    for x_prime in range(n):
        for x in range(n):
            bel_bar[x_prime] += motion_model(x_prime, x, u) * bel[x]

    # ── Update (Bayes rule) ──
    for x in range(n):
        bel_bar[x] *= sensor_model(z, x)

    # ── Normalize ──
    eta = 1.0 / sum(bel_bar)
    return [b * eta for b in bel_bar]

Hallway Localization — The Showcase

A robot moves through a hallway with colored doors. Watch the belief histogram converge to the true position. The green bar marks the robot's true cell.

Sensor accuracy0.85

Motion noise0.20

Step: 0

Experiment: Reset and watch the uniform belief sharpen over 10-20 steps. Then crank sensor accuracy down to 55% and see how much slower convergence is. Try high motion noise — the belief re-blurs after every move, fighting the update step.

Check: In the Bayes filter, which step reduces uncertainty?

The predict step The update step (incorporating sensor data) Both steps equally

⚔ Adversarial: Your Bayes filter's belief becomes a delta function (all probability on one cell) after 50 steps, even though the sensor has 10% false alarm rate. What's wrong?

You implemented a Bayes filter for a 20-cell hallway. Sensors have hit rate 0.9 and false alarm rate 0.1. After about 50 predict-update cycles, you notice bel(x=7) = 1.0 and all other cells are exactly 0.0. The robot hasn't actually converged — it's at cell 12.

The sensor is too accurate The motion model is too noisy Numerical underflow — small probabilities rounded to 0, then normalization concentrates mass The predict step is running twice per cycle

💥 Break-It Lab What Dies When You Remove Components? ▶ ✓ ATTEMPTED

A working Bayes filter has three components: prediction (motion model), update (sensor model with normalization), and normalization. Below, the filter runs on a 20-cell hallway. Toggle off components to see specific failure modes.

Remove Normalization ACTIVE

Failure mode: Without dividing by Σ bel(x), the values grow exponentially with each update (or shrink to zero). After a few cycles, you get numbers like 0.00003 and 0.000001 — technically proportional but numerically useless. Eventually, floating-point underflow kills all cells. The belief is no longer a probability distribution.

Remove Prediction (observe only) ACTIVE

Failure mode: The belief narrows too fast and locks onto the wrong position. Without the predict step spreading probability, a few consistent sensor readings cause the belief to collapse to door positions regardless of motion. The robot "teleports" in its own mind because the belief can't track movement between observations.

Wrong Motion Model (shift=3 instead of 1) ACTIVE

Failure mode: The belief drifts away from the true position. Each predict step shifts probability 3 cells right while the robot only moves 1. After a few steps, the peak is 10+ cells away from reality. The update step tries to correct using sensor data, but the mismatch is too large — the filter "hunts" for the robot and never converges.

💻 Build It Implement the Full Predict-Update Cycle ▶ ✓ ATTEMPTED

You've seen the algorithm. Now write it from scratch. Implement a discrete Bayes filter that takes a belief vector, performs one predict step (motion), then one update step (sensor reading), and returns the new belief.

signature def bayes_filter_step(bel, motion_shift, motion_noise, sensor_reading, door_cells, hit_rate, n_cells): """ One full predict-update cycle of the discrete Bayes filter. Args: bel: list[float] - current belief (n_cells probabilities summing to 1) motion_shift: int - commanded movement (cells to shift right) motion_noise: float - probability of NOT landing exactly (split equally over/under) sensor_reading: str - "door" or "no_door" door_cells: list[int] - indices of cells that have doors hit_rate: float - P(sensor_correct | true_state), e.g. 0.9 n_cells: int - number of cells in the hallway Returns: list[float] - updated belief after predict + update """

Test case

bel = [0]*20; bel[5] = 1.0 # certain at cell 5
door_cells = [2, 5, 7, 12, 15, 18]
result = bayes_filter_step(bel, motion_shift=1, motion_noise=0.2, sensor_reading="door", door_cells=door_cells, hit_rate=0.9, n_cells=20)
# After predict: ~[0, 0, 0, 0, 0, 0.1, 0.8, 0.1, 0, ...] (shifted right, blurred)
# After update with "door" at cells 7: cell 7 gets boosted by 0.9, cells 6,8 get 0.1
# Expected: result[7] should be the highest value (~0.85), result[6] and result[8] much lower

For each source cell i, distribute its probability: exact portion goes to (i + shift) % n, overshoot to (i + shift + 1) % n, undershoot to (i + shift - 1) % n. Sum contributions into a new array. The kernel is [undershoot, exact, overshoot] = [noise/2, 1-noise, noise/2].

python
def bayes_filter_step(bel, motion_shift, motion_noise, sensor_reading,
                       door_cells, hit_rate, n_cells):
    # ── PREDICT: convolve with motion kernel ──
    exact = 1.0 - motion_noise
    slip = motion_noise / 2.0
    bel_bar = [0.0] * n_cells
    for i in range(n_cells):
        bel_bar[(i + motion_shift) % n_cells] += bel[i] * exact
        bel_bar[(i + motion_shift + 1) % n_cells] += bel[i] * slip
        bel_bar[(i + motion_shift - 1) % n_cells] += bel[i] * slip

    # ── UPDATE: multiply by sensor likelihood ��─
    for i in range(n_cells):
        has_door = i in door_cells
        if sensor_reading == "door":
            likelihood = hit_rate if has_door else (1 - hit_rate)
        else:
            likelihood = (1 - hit_rate) if has_door else hit_rate
        bel_bar[i] *= likelihood

    # ── NORMALIZE ──
    total = sum(bel_bar)
    return [b / total for b in bel_bar]

Bonus challenge: Extend to 2D (grid of cells with both x,y position). How does computational cost scale? What changes in the motion model?

Chapter 8: The Filter Family Tree

The Bayes filter is not one algorithm — it's the parent of an entire family. Every probabilistic filter is a special case of the Bayes filter, making different assumptions about the belief representation and the motion/sensor models.

Filter	Belief	Assumption	Cost
Discrete Bayes	Histogram	Finite grid of cells	O(n)
Kalman (KF)	Gaussian (μ, Σ)	Linear + Gaussian noise	O(n³)
EKF	Gaussian	Locally linear (Jacobian)	O(n³)
UKF	Gaussian (sigma pts)	Smooth nonlinearity	O(n³)
Particle Filter	Weighted samples	Any distribution	O(N·n)

The Family Tree

The unifying idea: Every filter does the same two steps — predict (apply motion, grow uncertainty) and update (incorporate measurement, shrink uncertainty). They only differ in how they represent the belief and how they compute the math. Understand the Bayes filter and you understand them all.

When the grid breaks down: Our 20-cell hallway was easy. But what about a robot in a 2D room? A 100×100 grid has 10,000 cells. Add orientation (360 bins) and you have 3.6 million cells. The particle filter solves this by representing the belief with N weighted samples instead of a full grid. Each particle is a hypothesis: "maybe I'm at (x=3.2, y=7.1, θ=45°)." Resampling draws new particles proportional to weights — likely hypotheses survive, unlikely ones die.

python
# Particle filter: 1000 particles for 2D robot pose
particles = np.random.uniform(size=(1000, 3))  # [x, y, θ]
weights = np.ones(1000) / 1000

# Predict: move each particle + noise
particles[:, 0] += velocity * np.cos(particles[:, 2]) + np.random.randn(1000) * 0.1

# Update: weight by sensor likelihood
weights *= sensor_likelihood(particles, measurement)
weights /= weights.sum()

# Resample: draw new particles proportional to weights
indices = np.random.choice(1000, size=1000, p=weights)
particles = particles[indices]

The computational cost tells you which filter to use in practice:

Filter	Cost per Step	When to Use
Discrete Bayes	O(\|S\|²)	Small state space (<1000 cells)
Kalman / EKF / UKF	O(n³)	Continuous, roughly Gaussian
Particle Filter	O(N)	Any distribution, any nonlinearity

The discrete Bayes filter sums over all state pairs (O(|S|²) for the predict step). The Kalman filter inverts matrices (O(n³) where n is the state dimension — cheap for 3D pose, expensive for 100D). The particle filter scales linearly in the number of particles, but you need more particles as dimensionality grows.

Where from here? The Kalman filter lesson covers the Gaussian continuous case in depth. Each filter in the family tree trades off expressiveness against computational cost. The discrete Bayes filter is the simplest and most intuitive — and now you understand it completely.

"Probability theory is nothing but common sense reduced to calculation."

— Pierre-Simon Laplace

You now understand Bayesian state estimation. Every sensor reading is noisy, every action is uncertain — but belief persists and refines. That's the Bayes filter.

Check: What makes the Kalman filter a special case of the Bayes filter?

It uses a different algorithm entirely It represents belief as a Gaussian and assumes linear dynamics It doesn't use Bayes' theorem

🏗 Design Challenge You're the Architect: Warehouse Robot Localization ▶ ✓ ATTEMPTED

You're designing the localization system for an autonomous forklift in a 200m × 100m warehouse. The robot has wheel encoders, a 2D LIDAR, and an ARM Cortex-A53 processor. Three approaches are on the table: discrete grid filter, particle filter, or Kalman filter. Each has trade-offs.

Map

1000 landmarks (pillars, shelves), 200m × 100m floor

Update rate

10 Hz (100ms per cycle)

Processor

ARM Cortex-A53 @ 1.2 GHz, 1 GB RAM

Required accuracy

< 5 cm position, < 2° heading

Environment

Dynamic (people, other forklifts), non-Gaussian noise from LIDAR multi-path

1. If you use a discrete grid at 5cm resolution over a 200m × 100m floor with 360 heading bins — how many cells is that? Can it fit in 1 GB RAM? Can one predict step run in 100ms?

2. A Kalman filter (3-state: x, y, θ) is computationally cheapest. But LIDAR multi-path reflections cause non-Gaussian noise. What breaks?

3. How many particles do you need for a particle filter in this space? What's the cost per update? Does it fit the 100ms budget?

4. Could you combine approaches? (Hint: what if you use a Kalman filter most of the time but switch to particles for re-localization?)

Real-world solution: Adaptive Monte Carlo Localization (AMCL) with a Kalman fallback.

Why not grid: 200m/0.05m × 100m/0.05m × 360 = 4000 × 2000 × 360 = 2.88 billion cells. At 4 bytes each = 11.5 GB. Completely infeasible. Even at 50cm resolution (400 × 200 × 36 = 2.88M cells), the predict step requires ~2.88M multiplications per destination cell — too slow at 10 Hz.

Why not pure Kalman: LIDAR multi-path creates bimodal noise (ghost reflections). The Kalman filter's Gaussian assumption means it averages between the true position and the ghost, converging to neither. In dynamic environments with sudden occlusions, the Gaussian envelope is too optimistic.

Why particle filter wins: 500-2000 particles with (x, y, θ) state = 500×3×8 bytes = 12 KB. Each update: move all particles (O(N)), weight by LIDAR likelihood (O(N × K) where K = scan points matched), resample (O(N)). With N=1000 and K=100: ~100K operations per cycle. Easily fits in 100ms on ARM.

The production trick: Use AMCL (ROS default): start with many particles (5000) when uncertain, adaptively reduce to 200-500 when converged. The KLD-sampling algorithm adjusts particle count based on the complexity of the current belief. For tracking (already localized), some teams use an EKF with particle-based re-localization on kidnap detection.

🔗 Pattern Recognition

Bayes Filter Update = Gaussian Special Case = Kalman Gain

This Lesson (Discrete Bayes)

bel(x) = η · P(z|x) · bel¯(x)
Element-wise multiply + normalize

Kalman Filter (Gaussian case)

μ = μ¯ + K(z - Hμ¯)
K = Σ¯H^T(HΣ¯H^T + R)^-1 → Kalman Filter lesson

Both compute the SAME thing: weigh the prediction against the measurement. In the discrete case, we literally multiply each cell's probability by how well the evidence fits. In the Kalman case, the Kalman Gain K is the continuous equivalent — it interpolates between the predicted mean and the measurement. When belief and likelihood are both Gaussian, the discrete multiply-and-normalize becomes the elegant K formula.

When the likelihood is much more certain than the prior (R << Σ), what happens to K? What's the discrete equivalent — what does the belief look like after update when the sensor is near-perfect?

🔗 Pattern Recognition

Belief Update = HMM Forward Algorithm = Soft Attention

Bayes Filter

bel¯(x') = Σ_x P(x'|x,u) bel(x)
bel(x') = η P(z|x') bel¯(x')

Transformer Attention

Attention(Q,K,V) = softmax(QK^T/√d) V
Weighted combination → Transformer lesson

The deep structural connection: both are weighted belief aggregation. The Bayes filter takes a distribution over states and updates it by weighting each state by evidence compatibility. Attention takes a set of values and produces a weighted combination where weights come from query-key compatibility. In both cases, a "soft assignment" (probability distribution) selects information from multiple sources rather than committing to one. The HMM forward algorithm (HMM lesson) is literally the Bayes filter applied to a finite state machine with discrete emissions — making the connection explicit.

In the Bayes filter, the normalizer η ensures probabilities sum to 1. In attention, softmax does the same. What would happen to a Transformer if you removed the softmax normalization? (Hint: same pathology as removing normalization from the Bayes filter.)

Understand the Bayes
Filter

Chapter 0: Why Bayes?

Chapter 1: Bayes' Theorem — Updating What You Believe

Chapter 2: Belief — A Distribution Over States

Chapter 3: The Motion Model — P(x'|x,u)

Chapter 4: The Sensor Model — P(z|x)

Chapter 5: The Predict Step

Chapter 6: The Update Step

Chapter 7: The Full Algorithm — Robot Localization

The Bayes Filter in Python

Chapter 8: The Filter Family Tree

Understand the BayesFilter

Chapter 0: Why Bayes?

Chapter 1: Bayes' Theorem — Updating What You Believe

Chapter 2: Belief — A Distribution Over States

Chapter 3: The Motion Model — P(x'|x,u)

Chapter 4: The Sensor Model — P(z|x)

Chapter 5: The Predict Step

Chapter 6: The Update Step

Chapter 7: The Full Algorithm — Robot Localization

The Bayes Filter in Python

Chapter 8: The Filter Family Tree

Understand the Bayes
Filter