The algorithm that guided Apollo to the Moon, helps your phone know where you are, and is the backbone of every self-driving car's perception system.
Imagine you're tracking a ball rolling across a table. You have two sources of information: a physics model that predicts where the ball should be, and a camera that measures where it actually is. The problem? The physics model is approximate (friction, air resistance, imperfect initial conditions), and the camera is noisy (pixel jitter, lighting changes, motion blur).
Neither source alone is reliable. The physics model drifts over time. The camera jumps around randomly. But what if you could combine them in the smartest possible way? That's what the Kalman filter does: it fuses noisy predictions with noisy measurements to produce an estimate that's better than either alone.
The teal line is the true trajectory. The red dots are noisy camera measurements. Notice how scattered they are.
Everything the Kalman filter does is expressed in the language of Gaussian distributions (bell curves). A Gaussian is defined by just two numbers: the mean (your best guess) and the variance (how uncertain you are). A small variance means you're confident. A large variance means you're shrugging your shoulders.
Why Gaussians? Because they have a magical property: the combination of two Gaussians is another Gaussian. This means the Kalman filter can fuse predictions and measurements using simple arithmetic — no expensive computations, no approximations. It's exact.
Drag the sliders to change the mean (μ) and standard deviation (σ). Watch how the bell curve reshapes.
Before we can filter anything, we need to decide: what are we tracking? For a ball rolling on a table, we care about its position and its velocity. Together, these form the state vector x. The state is not what we observe — it's what we believe about the world.
We also keep track of our uncertainty about the state: the covariance matrix P. If P is small, we're confident. If P is large, we're unsure. The Kalman filter updates both x and P at every step.
To make this real, let's track a ball in 2D. We track its x-position, x-velocity, y-position, and y-velocity. Here are the exact matrix shapes — every single matrix in the Kalman filter, with dimensions:
| Matrix | Shape | What it is |
|---|---|---|
| x | [4, 1] | State: [px, vx, py, vy] |
| F | [4, 4] | State transition (encodes physics) |
| P | [4, 4] | State covariance (16 entries, symmetric) |
| z | [2, 1] | Measurement: [measured px, measured py] |
| H | [2, 4] | Observation: picks out positions from state |
| R | [2, 2] | Measurement noise covariance |
| Q | [4, 4] | Process noise covariance |
| K | [4, 2] | Kalman gain (maps 2D innovation to 4D state correction) |
At each timestep, before we look at any measurement, we use physics to predict where the ball should be. If the ball was at position 5 with velocity 2, then after one timestep it should be at position 7. This is the predict step.
The state transition matrix F encodes our motion model. For constant velocity: new_position = old_position + velocity × dt. But our model isn't perfect, so we add process noise Q to account for things we can't model (wind, bumps, etc.). Uncertainty always grows during prediction.
| Symbol | Meaning |
|---|---|
| F | State transition matrix (physics model) |
| B | Control input matrix |
| u | Control input (e.g., applied force) |
| Q | Process noise covariance (model uncertainty) |
| P¯ | Predicted covariance (uncertainty after predict) |
R comes from datasheets. Your GPS says ±3 meters accuracy? That means one standard deviation is 3m, so R = 9 m² (variance = σ²). A laser rangefinder with ±0.01m gives R = 0.0001. R is a property of the sensor hardware — you look it up.
Q is tuned by the engineer. Q represents how much you trust your physics model. Big Q says "my model is approximate — expect surprises." Small Q says "I trust the model — the world is predictable." There is no datasheet for Q because it captures everything your model doesn't account for: wind gusts, bumps, unmodeled friction.
q · G @ G.T, where G is the noise input matrix and q is a scalar you tune. For our 2D tracker with dt=1:The predicted covariance is P¯ = F P F¹ + Q. In the 1D case (scalar), this simplifies to P¯ = F² P + Q. With F=1 (random walk), we get P¯ = P + Q.
Your task: Starting from the definition of variance and the predict equation x¯ = Fx + w (where w ~ N(0,Q) is independent of x), derive that Var(x¯) = F² Var(x) + Q.
Full derivation:
1. Start with the predict equation: x¯ = Fx + w, where w ~ N(0, Q) is independent of x.
2. Take the covariance: Cov(x¯) = Cov(Fx + w)
3. Since Fx and w are independent: Cov(x¯) = Cov(Fx) + Cov(w)
4. For a linear transform: Cov(Fx) = F Cov(x) F¹ = F P F¹
5. And Cov(w) = Q by definition.
6. Therefore: P¯ = F P F¹ + Q ✓
The key insight: Uncertainty MUST grow during prediction because we're adding a new source of randomness (process noise) that is independent of everything we already know. Even a perfect model (F exact) still accumulates uncertainty from unmodeled disturbances.
Prediction uses an imperfect physics model. Process noise Q accounts for everything the model doesn't capture (wind, friction changes, unmodeled forces). Without Q, the filter becomes overconfident: it trusts its prediction more and more, eventually ignoring measurements entirely. The covariance P shrinks to zero, K goes to zero, and the filter locks onto its (wrong) prediction. This is the classic "filter divergence" failure mode. Adding Q is an admission of humility: "my model isn't perfect, so keep me open to corrections from measurements."
Now the camera takes a picture and says "I see the ball at position 7.3." But we know the camera is noisy — it might be off by a bit. The measurement z has its own uncertainty, captured by the measurement noise covariance R.
The observation matrix H maps our state to what the sensor actually sees. If we're tracking [position, velocity] but the camera only measures position, then H = [1, 0] — it picks out just the position part.
The teal line is the true position. Red dots are noisy measurements. Adjust R to see the effect.
Here's the million-dollar question: how much should we trust the measurement vs our prediction? The Kalman gain K answers this. It's a number between 0 and 1 (roughly speaking) that acts as a trust slider.
When your sensor is very accurate (R is small), K is large — you trust the measurement. When your prediction is very confident (P is small), K is small — you trust the prediction. The formula automatically balances these:
Adjust measurement noise R and prediction uncertainty P to see how the Kalman gain K responds. (1D case: K = P / (P + R))
In the 1D case, the update equation is: x̂ = x̂¯ + K(z − x̂¯), and the posterior variance is P = (1 − K)² P¯ + K² R.
Your task: Find the value of K that minimizes P. Take dP/dK = 0 and solve for K. You should arrive at K = P¯ / (P¯ + R).
Full derivation:
1. The update is a weighted average: x̂ = (1−K) x̂¯ + K z
2. Since prediction error and measurement error are independent:
P = Var(x̂ − xtrue) = (1−K)² P¯ + K² R
3. Expand: P = P¯ − 2KP¯ + K²P¯ + K²R = P¯ − 2KP¯ + K²(P¯ + R)
4. Differentiate: dP/dK = −2P¯ + 2K(P¯ + R)
5. Set to zero: 2K(P¯ + R) = 2P¯
6. Solve: K = P¯ / (P¯ + R) ✓
7. Plug back in: P = (1 − P¯/(P¯+R))² P¯ + (P¯/(P¯+R))² R = P¯ R / (P¯ + R)
The key insight: The Kalman gain isn't an arbitrary formula — it's the UNIQUE value that minimizes the posterior variance. Any other weighting would give you a less certain estimate. This is why the KF is "optimal": no other linear filter can do better.
Both mechanisms solve the same fundamental problem: optimal weighting under uncertainty. The Kalman gain weights information by precision (inverse variance). Attention weights information by relevance (dot-product similarity). In both cases, you have multiple sources of information and need to decide how much to trust each one. The math differs, but the principle is identical: weight by quality of information.
When you study attention, ask: "What is the analog of R (noise) in the attention mechanism? What makes a key-value pair 'noisy' vs 'reliable'?"
Now we fuse the prediction and measurement. The innovation (or residual) is the difference between what we measured and what we expected: y = z − H x̂¯. The Kalman gain K scales this innovation and adds it to our prediction:
The updated covariance P is always smaller than the predicted covariance P¯. This is the magic: every measurement, no matter how noisy, makes our estimate more certain. Information only accumulates; it never destroys.
The blue Gaussian is the prediction. The red is the measurement. The green is the fused result — always the narrowest!
The simple form P = (I − KH) P¯ works when K is computed exactly. But with floating-point arithmetic, rounding errors can make P non-symmetric or even negative definite (variance < 0 — nonsense!).
The Joseph form is: P = (I − KH) P¯ (I − KH)¹ + K R K¹
Your task: Show that if K = P¯ H¹ (H P¯ H¹ + R)¯¹ (the optimal gain), the Joseph form reduces to P = (I − KH) P¯. Why is the Joseph form preferred in practice?
Full derivation:
1. Joseph form: P = (I−KH) P¯ (I−KH)¹ + K R K¹
2. Expand first term: P¯ − KHP¯ − P¯H¹K¹ + KHP¯H¹K¹
3. With optimal K: KHP¯ = K(HP¯) and P¯H¹K¹ = (KHP¯)¹ (symmetric), so −KHP¯ − P¯H¹K¹ = −2KHP¯
4. Also KHP¯H¹K¹ = K(S−R)K¹ (since S = HP¯H¹ + R)
5. Total: P¯ − 2KHP¯ + KSK¹ − KRK¹ + KRK¹ = P¯ − 2KHP¯ + KSK¹
6. Since K = P¯H¹S¯¹: KS = P¯H¹, so KSK¹ = P¯H¹K¹ = KHP¯
7. Final: P¯ − 2KHP¯ + KHP¯ = P¯ − KHP¯ = (I−KH)P¯ ✓
The key insight: The Joseph form is mathematically equivalent to the simple form when K is optimal. But it's ALWAYS guaranteed to produce a valid covariance matrix (symmetric positive semi-definite), even with floating-point errors or suboptimal K. Production Kalman filters always use the Joseph form to prevent numerical divergence over thousands of time steps.
That's it. The entire Kalman filter is just two steps repeated forever: Predict (use physics to guess forward, uncertainty grows) then Update (incorporate a measurement, uncertainty shrinks). The predict-update cycle is the heartbeat of the filter.
Python import numpy as np def kalman_filter(x, P, F, H, Q, R, measurements): """Run a Kalman filter over a sequence of measurements.""" estimates = [] for z in measurements: # ── Predict ── x_pred = F @ x # state prediction P_pred = F @ P @ F.T + Q # covariance prediction # ── Update ── y = z - H @ x_pred # innovation (residual) S = H @ P_pred @ H.T + R # innovation covariance K = P_pred @ H.T @ np.linalg.inv(S) # Kalman gain x = x_pred + K @ y # updated state P = (np.eye(len(x)) - K @ H) @ P_pred # updated covariance estimates.append(x.copy()) return estimates
Let's wire up the 2D constant-velocity tracker from Chapter 2. State x = [px, vx, py, vy], and the camera measures position only:
Python import numpy as np dt = 0.1 # 10 Hz update # State transition: constant velocity F = np.array([[1, dt, 0, 0], # px' = px + vx*dt [0, 1, 0, 0], # vx' = vx [0, 0, 1, dt], # py' = py + vy*dt [0, 0, 0, 1]]) # vy' = vy # Observation: camera sees position only H = np.array([[1, 0, 0, 0], # z_px = px [0, 0, 1, 0]]) # z_py = py # Measurement noise: GPS ±3m → variance = 9 R = np.diag([9.0, 9.0]) # [2,2] # Process noise: unknown acceleration q=0.5 m/s² q = 0.5 G = np.array([[dt**2/2, 0], [dt, 0], [0, dt**2/2], [0, dt]]) Q = q * G @ G.T # [4,4]
x_pred = F @ x and P_pred = F @ P @ F.T + Q. The update is K = P_pred @ H.T @ np.linalg.inv(H @ P_pred @ H.T + R), then x = x_pred + K @ (z - H @ x_pred). That's it. Five lines of NumPy, and you have a working Kalman filter.| Variable | Shape | Meaning |
|---|---|---|
| x | n×1 | State estimate (what we believe) |
| P | n×n | State covariance (our uncertainty) |
| F | n×n | State transition (physics model) |
| H | m×n | Observation matrix (state → measurement) |
| Q | n×n | Process noise (model uncertainty) |
| R | m×m | Measurement noise (sensor uncertainty) |
| K | n×m | Kalman gain (trust slider) |
| z | m×1 | Measurement vector (what we observe) |
python def kalman_1d(measurements, x0=0.0, P0=100.0, F=1.0, H=1.0, Q=0.1, R=1.0): x = x0 P = P0 estimates = [] for z in measurements: # ── Predict ── x_pred = F * x # state prediction P_pred = F * F * P + Q # covariance prediction # ── Update ── K = P_pred * H / (H * P_pred * H + R) # Kalman gain x = x_pred + K * (z - H * x_pred) # state update P = (1 - K * H) * P_pred # covariance update estimates.append(x) return estimates
Time to see the Kalman filter in action. Below, a ball bounces around the canvas. The gray circle is the true ball. The red dots are noisy camera measurements. The green circle is the Kalman filter's estimate, and the green ellipse shows the uncertainty.
Play with the sliders to see what happens when the sensor is noisy or the model is uncertain. Watch how the filter smoothly tracks the ball even when individual measurements are wildly off.
Real systems rarely have just one sensor. Your phone has GPS, an accelerometer, a gyroscope, a barometer, and Wi-Fi signal strength — all measuring different things with different noise characteristics. The Kalman filter handles this naturally: each sensor provides a measurement z with its own H and R, and the filter fuses them all.
Below, two sensors track a 1D target. Sensor A (like GPS) has low update rate but decent accuracy. Sensor B (like an accelerometer) updates fast but drifts. The Kalman filter combines both to outperform either alone.
Real-world solution (PX4/ArduPilot approach):
State vector: 15-24 states: [pos(3), vel(3), orientation(4 quaternion), accel_bias(3), gyro_bias(3), optionally baro_bias(1), wind(3)]. IMU biases are critical — they let the filter learn and subtract the drift online.
Rate handling: Predict at IMU rate (1kHz) using the IMU as a "control input" (the B·u term). This is called a "loosely-coupled" IMU/GPS fusion. The full update (expensive matrix inversion) only runs when GPS/baro arrive. Between updates, the high-rate predict keeps latency under 1ms.
GPS dropout: Q is NOT changed during dropout. Instead, the predict step naturally grows P. When GPS returns, the large P means K → 1, and the filter snaps back to GPS. The system gracefully degrades: uncertainty grows but doesn't diverge because the IMU biases were estimated when GPS was available.
Barometer: Add a barometer bias state that absorbs the absolute offset. The filter estimates the bias online. Only the RELATIVE changes from the barometer contribute to altitude — the absolute reference comes from GPS.
Key engineering insight: The real trick is NOT the KF equations (those are standard). It's choosing the state vector and the Q matrix. Adding IMU biases to the state is what makes the system work. Without bias estimation, IMU drift kills you in seconds.
The Kalman filter is optimal — but only under two assumptions: the system is linear and the noise is Gaussian. What if your system is nonlinear (a rocket re-entering the atmosphere, a car turning)? What if the noise distribution has heavy tails or multiple modes?
Several extensions address these limitations. Each trades off accuracy against computational cost. The original KF remains the gold standard when its assumptions hold — which is surprisingly often in practice.
The Kalman filter can fail silently — it keeps producing estimates even when those estimates are garbage. Here are the three most common failure modes practitioners hit:
| Problem | What happens | Fix |
|---|---|---|
| P initialized too small | Filter is overconfident from the start, ignores measurements, locks onto a wrong estimate | Initialize P large (e.g., 1000 · I). Let the filter learn. |
| Q too small | Filter trusts its model too much. Can't track maneuvering targets (turns, accelerations). Estimate lags behind. | Increase Q. Watch the innovation sequence — if it shows trends, Q is too small. |
| System is nonlinear | KF gives biased estimates because F·x and H·x are wrong models. Covariance P doesn't reflect true uncertainty. | Switch to EKF (linearize) or UKF (sigma points). |
| Filter | Handles | How | Cost |
|---|---|---|---|
| KF | Linear + Gaussian | Exact equations | Very low |
| EKF | Mildly nonlinear | Linearize with Jacobians | Low |
| UKF | More nonlinear | Sigma points (no Jacobians) | Medium |
| Particle | Any distribution | Monte Carlo sampling | High |
The Kalman filter IS a Bayes filter with a specific restriction: everything is Gaussian and linear. This restriction is what makes the math tractable — the predict and update integrals collapse to simple matrix equations. Remove the Gaussian/linear assumption and you need approximations: EKF (linearize), UKF (sigma points), or particle filters (sample). The predict→update structure is universal; only the representation of belief changes.
When you study the Bayes filter, ask: "Where does the Gaussian assumption enter, and what mathematical simplification does it buy us?"
You now understand optimal estimation. Every sensor reading you encounter is noisy. Now you know how to see through the noise.