The Complete Beginner's Path

Understand the Kalman
Filter

The algorithm that guided Apollo to the Moon, helps your phone know where you are, and is the backbone of every self-driving car's perception system.

Prerequisites: Basic algebra + Intuition for uncertainty. That's it.
11
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: Why Estimation?

Imagine you're tracking a ball rolling across a table. You have two sources of information: a physics model that predicts where the ball should be, and a camera that measures where it actually is. The problem? The physics model is approximate (friction, air resistance, imperfect initial conditions), and the camera is noisy (pixel jitter, lighting changes, motion blur).

Neither source alone is reliable. The physics model drifts over time. The camera jumps around randomly. But what if you could combine them in the smartest possible way? That's what the Kalman filter does: it fuses noisy predictions with noisy measurements to produce an estimate that's better than either alone.

The core idea: You have two uncertain guesses. The Kalman filter combines them, weighting each by how confident you are, to produce an optimal estimate. It's like asking two friends for directions — you trust the local more than the tourist.
Noisy Measurements vs Truth

The teal line is the true trajectory. The red dots are noisy camera measurements. Notice how scattered they are.

Check: Why can't we just use the raw camera measurements?

Chapter 1: Gaussians — The Language of Uncertainty

Everything the Kalman filter does is expressed in the language of Gaussian distributions (bell curves). A Gaussian is defined by just two numbers: the mean (your best guess) and the variance (how uncertain you are). A small variance means you're confident. A large variance means you're shrugging your shoulders.

Why Gaussians? Because they have a magical property: the combination of two Gaussians is another Gaussian. This means the Kalman filter can fuse predictions and measurements using simple arithmetic — no expensive computations, no approximations. It's exact.

p(x) = (1 / √(2πσ²)) · exp( −(x − μ)² / 2σ² )
Interactive Gaussian

Drag the sliders to change the mean (μ) and standard deviation (σ). Watch how the bell curve reshapes.

Mean μ0.0
Std dev σ1.0
Key insight: Small σ = tall, narrow peak = "I'm very sure." Large σ = short, wide hump = "I have no idea." The Kalman filter is fundamentally about shrinking σ by combining information.
Check: What do the two parameters of a Gaussian represent?

Chapter 2: The State — What Are We Tracking?

Before we can filter anything, we need to decide: what are we tracking? For a ball rolling on a table, we care about its position and its velocity. Together, these form the state vector x. The state is not what we observe — it's what we believe about the world.

We also keep track of our uncertainty about the state: the covariance matrix P. If P is small, we're confident. If P is large, we're unsure. The Kalman filter updates both x and P at every step.

State Vector x
x = [position, velocity]¹
Covariance P
P = [[σ²pos, σpv], [σpv, σ²vel]]
Together
A Gaussian belief: N(x, P)
1D example: If x = [5, 2], we believe the ball is at position 5 moving at speed 2. The covariance P tells us how confident we are in each of those numbers, and whether errors in position and velocity are correlated.

Concrete Shapes: 2D Constant-Velocity Tracker

To make this real, let's track a ball in 2D. We track its x-position, x-velocity, y-position, and y-velocity. Here are the exact matrix shapes — every single matrix in the Kalman filter, with dimensions:

MatrixShapeWhat it is
x[4, 1]State: [px, vx, py, vy]
F[4, 4]State transition (encodes physics)
P[4, 4]State covariance (16 entries, symmetric)
z[2, 1]Measurement: [measured px, measured py]
H[2, 4]Observation: picks out positions from state
R[2, 2]Measurement noise covariance
Q[4, 4]Process noise covariance
K[4, 2]Kalman gain (maps 2D innovation to 4D state correction)
Shape check trick: The Kalman gain K is always [n, m] — state dimension by measurement dimension. It takes a measurement-space error (what we got vs what we expected) and converts it into a state-space correction. If K is [4, 2], we observe 2 things and correct 4 state variables.
Check: What does the state vector represent?

Chapter 3: The Predict Step

At each timestep, before we look at any measurement, we use physics to predict where the ball should be. If the ball was at position 5 with velocity 2, then after one timestep it should be at position 7. This is the predict step.

The state transition matrix F encodes our motion model. For constant velocity: new_position = old_position + velocity × dt. But our model isn't perfect, so we add process noise Q to account for things we can't model (wind, bumps, etc.). Uncertainty always grows during prediction.

x̂¯ = F · x̂ + B · u       P¯ = F · P · F¹ + Q
Intuition: Prediction is like dead reckoning. A sailor says "I was here, going this fast, so I must be there now." It works short-term but drifts without corrections. The longer you predict without measuring, the more uncertain you become.
Predict step grows uncertainty (blue), Update step shrinks it (green)
SymbolMeaning
FState transition matrix (physics model)
BControl input matrix
uControl input (e.g., applied force)
QProcess noise covariance (model uncertainty)
Predicted covariance (uncertainty after predict)

Q and R in Practice: Where Do They Come From?

R comes from datasheets. Your GPS says ±3 meters accuracy? That means one standard deviation is 3m, so R = 9 m² (variance = σ²). A laser rangefinder with ±0.01m gives R = 0.0001. R is a property of the sensor hardware — you look it up.

Q is tuned by the engineer. Q represents how much you trust your physics model. Big Q says "my model is approximate — expect surprises." Small Q says "I trust the model — the world is predictable." There is no datasheet for Q because it captures everything your model doesn't account for: wind gusts, bumps, unmodeled friction.

The Q dilemma: For a constant-velocity model, Q is often built as q · G @ G.T, where G is the noise input matrix and q is a scalar you tune. For our 2D tracker with dt=1:

G = [dt²/2, 0; dt, 0; 0, dt²/2; 0, dt]

This encodes the assumption that unknown accelerations are the main source of model error. The scalar q controls the magnitude of those unknown accelerations. In practice: start with q = 1, run the filter, and adjust until the residuals look white (random, uncorrelated).
Check: What happens to uncertainty during the predict step?
🔨 Derivation Why Does Variance Add During Prediction? ✓ ATTEMPTED

The predicted covariance is P¯ = F P F¹ + Q. In the 1D case (scalar), this simplifies to P¯ = F² P + Q. With F=1 (random walk), we get P¯ = P + Q.

Your task: Starting from the definition of variance and the predict equation x¯ = Fx + w (where w ~ N(0,Q) is independent of x), derive that Var(x¯) = F² Var(x) + Q.

Var(aX + b) = a² Var(X). Scaling a random variable by 'a' scales its variance by a². Here F plays the role of 'a'.
Var(X + Y) = Var(X) + Var(Y) when X and Y are independent. The process noise w is independent of the current state x — so their variances simply add.
x¯ = Fx + w. Since Fx and w are independent: Var(x¯) = Var(Fx) + Var(w) = F² Var(x) + Q. Now generalize to matrices: Var(Fx) = F Var(x) F¹.

Full derivation:

1. Start with the predict equation: x¯ = Fx + w, where w ~ N(0, Q) is independent of x.

2. Take the covariance: Cov(x¯) = Cov(Fx + w)

3. Since Fx and w are independent: Cov(x¯) = Cov(Fx) + Cov(w)

4. For a linear transform: Cov(Fx) = F Cov(x) F¹ = F P F¹

5. And Cov(w) = Q by definition.

6. Therefore: P¯ = F P F¹ + Q ✓

The key insight: Uncertainty MUST grow during prediction because we're adding a new source of randomness (process noise) that is independent of everything we already know. Even a perfect model (F exact) still accumulates uncertainty from unmodeled disturbances.

Checkpoint — Before you move on
In your own words: why must uncertainty grow during the predict step, and what would go wrong if we skipped adding Q?
✓ Gate cleared
Model Answer

Prediction uses an imperfect physics model. Process noise Q accounts for everything the model doesn't capture (wind, friction changes, unmodeled forces). Without Q, the filter becomes overconfident: it trusts its prediction more and more, eventually ignoring measurements entirely. The covariance P shrinks to zero, K goes to zero, and the filter locks onto its (wrong) prediction. This is the classic "filter divergence" failure mode. Adding Q is an admission of humility: "my model isn't perfect, so keep me open to corrections from measurements."

Chapter 4: The Measurement

Now the camera takes a picture and says "I see the ball at position 7.3." But we know the camera is noisy — it might be off by a bit. The measurement z has its own uncertainty, captured by the measurement noise covariance R.

The observation matrix H maps our state to what the sensor actually sees. If we're tracking [position, velocity] but the camera only measures position, then H = [1, 0] — it picks out just the position part.

z = H · x + v      where v ~ N(0, R)
Example: Our state is [pos, vel] = [7, 2]. The camera measures position only, so H = [1, 0]. The true measurement would be 7, but the camera reports z = 7.3 because of noise with variance R.
Measurement Noise

The teal line is the true position. Red dots are noisy measurements. Adjust R to see the effect.

Noise R1.0
Check: What does the matrix H do?

Chapter 5: The Kalman Gain

Here's the million-dollar question: how much should we trust the measurement vs our prediction? The Kalman gain K answers this. It's a number between 0 and 1 (roughly speaking) that acts as a trust slider.

When your sensor is very accurate (R is small), K is large — you trust the measurement. When your prediction is very confident (P is small), K is small — you trust the prediction. The formula automatically balances these:

K = P¯ H¹ (H P¯ H¹ + R)¯¹
K is a TRUST SLIDER. K near 1 = "I believe the sensor." K near 0 = "I believe my prediction." The beauty is that K is computed automatically from the uncertainties — you never have to tune it by hand.
Interactive Kalman Gain

Adjust measurement noise R and prediction uncertainty P to see how the Kalman gain K responds. (1D case: K = P / (P + R))

Pred. uncert. P2.0
Meas. noise R2.0
K = P / (P + R) = 0.50
Check: If your sensor is very noisy (large R), what happens to K?
🔨 Derivation Derive the Kalman Gain by Minimizing Posterior Variance ✓ ATTEMPTED

In the 1D case, the update equation is: x̂ = x̂¯ + K(z − x̂¯), and the posterior variance is P = (1 − K)² P¯ + K² R.

Your task: Find the value of K that minimizes P. Take dP/dK = 0 and solve for K. You should arrive at K = P¯ / (P¯ + R).

P = (1-K)² P¯ + K² R. Expand: P = P¯ − 2K P¯ + K² P¯ + K² R = P¯ − 2K P¯ + K²(P¯ + R).
dP/dK = −2P¯ + 2K(P¯ + R). Set this equal to zero.
0 = −2P¯ + 2K(P¯ + R) ⇒ K(P¯ + R) = P¯ ⇒ K = P¯ / (P¯ + R).

Full derivation:

1. The update is a weighted average: x̂ = (1−K) x̂¯ + K z

2. Since prediction error and measurement error are independent:
P = Var(x̂ − xtrue) = (1−K)² P¯ + K² R

3. Expand: P = P¯ − 2KP¯ + K²P¯ + K²R = P¯ − 2KP¯ + K²(P¯ + R)

4. Differentiate: dP/dK = −2P¯ + 2K(P¯ + R)

5. Set to zero: 2K(P¯ + R) = 2P¯

6. Solve: K = P¯ / (P¯ + R)

7. Plug back in: P = (1 − P¯/(P¯+R))² P¯ + (P¯/(P¯+R))² R = P¯ R / (P¯ + R)

The key insight: The Kalman gain isn't an arbitrary formula — it's the UNIQUE value that minimizes the posterior variance. Any other weighting would give you a less certain estimate. This is why the KF is "optimal": no other linear filter can do better.

⚔ Adversarial: What happens when P¯ → ∞?
Your filter has been predicting without measurements for 1000 steps. P¯ has grown enormous. Finally, a measurement arrives with R = 5.
🔗 Pattern Recognition
Kalman Gain = Attention Weight
Kalman Filter
K = P¯ / (P¯ + R) — weight measurement by its relative precision
Transformer Attention
αij = softmax(qi · kj / √d) — weight value by relevance → Transformer lesson

Both mechanisms solve the same fundamental problem: optimal weighting under uncertainty. The Kalman gain weights information by precision (inverse variance). Attention weights information by relevance (dot-product similarity). In both cases, you have multiple sources of information and need to decide how much to trust each one. The math differs, but the principle is identical: weight by quality of information.

When you study attention, ask: "What is the analog of R (noise) in the attention mechanism? What makes a key-value pair 'noisy' vs 'reliable'?"

Chapter 6: The Update Step

Now we fuse the prediction and measurement. The innovation (or residual) is the difference between what we measured and what we expected: y = z − H x̂¯. The Kalman gain K scales this innovation and adds it to our prediction:

x̂ = x̂¯ + K(z − H x̂¯)       P = (I − K H) P¯

The updated covariance P is always smaller than the predicted covariance P¯. This is the magic: every measurement, no matter how noisy, makes our estimate more certain. Information only accumulates; it never destroys.

Key insight: The posterior Gaussian is ALWAYS narrower than either the prediction or the measurement alone. Two uncertain estimates combined yield a more certain result. This is provably optimal for linear Gaussian systems.
Gaussian Fusion

The blue Gaussian is the prediction. The red is the measurement. The green is the fused result — always the narrowest!

Predict μ-1.0
Predict σ1.5
Measure μ1.0
Measure σ1.0
Check: After the update step, how does the uncertainty compare?
🔨 Derivation The Joseph Form — A Numerically Stable P Update ✓ ATTEMPTED

The simple form P = (I − KH) P¯ works when K is computed exactly. But with floating-point arithmetic, rounding errors can make P non-symmetric or even negative definite (variance < 0 — nonsense!).

The Joseph form is: P = (I − KH) P¯ (I − KH)¹ + K R K¹

Your task: Show that if K = P¯ H¹ (H P¯ H¹ + R)¯¹ (the optimal gain), the Joseph form reduces to P = (I − KH) P¯. Why is the Joseph form preferred in practice?

Expand (I−KH) P¯ (I−KH)¹ = P¯ − KHP¯ − P¯H¹K¹ + KHP¯H¹K¹. The cross terms are where the magic happens.
With K = P¯H¹S¯¹ where S = HP¯H¹ + R, note that KS = P¯H¹. So KHP¯H¹K¹ = K(S−R)K¹ = KSK¹ − KRK¹.
The Joseph form is guaranteed to produce a symmetric positive semi-definite P (it's a sum of two PSD matrices: A P¯ A¹ + B R B¹). The simple form (I−KH)P¯ can lose symmetry due to rounding.

Full derivation:

1. Joseph form: P = (I−KH) P¯ (I−KH)¹ + K R K¹

2. Expand first term: P¯ − KHP¯ − P¯H¹K¹ + KHP¯H¹K¹

3. With optimal K: KHP¯ = K(HP¯) and P¯H¹K¹ = (KHP¯)¹ (symmetric), so −KHP¯ − P¯H¹K¹ = −2KHP¯

4. Also KHP¯H¹K¹ = K(S−R)K¹ (since S = HP¯H¹ + R)

5. Total: P¯ − 2KHP¯ + KSK¹ − KRK¹ + KRK¹ = P¯ − 2KHP¯ + KSK¹

6. Since K = P¯H¹S¯¹: KS = P¯H¹, so KSK¹ = P¯H¹K¹ = KHP¯

7. Final: P¯ − 2KHP¯ + KHP¯ = P¯ − KHP¯ = (I−KH)P¯ ✓

The key insight: The Joseph form is mathematically equivalent to the simple form when K is optimal. But it's ALWAYS guaranteed to produce a valid covariance matrix (symmetric positive semi-definite), even with floating-point errors or suboptimal K. Production Kalman filters always use the Joseph form to prevent numerical divergence over thousands of time steps.

Chapter 7: The Full Algorithm

That's it. The entire Kalman filter is just two steps repeated forever: Predict (use physics to guess forward, uncertainty grows) then Update (incorporate a measurement, uncertainty shrinks). The predict-update cycle is the heartbeat of the filter.

Initialize
Set x̂ (initial guess) and P (initial uncertainty)
Predict
x̂¯ = F x̂ + B u    P¯ = F P F¹ + Q
Kalman Gain
K = P¯ H¹ (H P¯ H¹ + R)¯¹
Update
x̂ = x̂¯ + K(z − H x̂¯)    P = (I − KH) P¯
↓ repeat

The Complete KF in Python

Python
import numpy as np

def kalman_filter(x, P, F, H, Q, R, measurements):
    """Run a Kalman filter over a sequence of measurements."""
    estimates = []
    for z in measurements:
        # ── Predict ──
        x_pred = F @ x                    # state prediction
        P_pred = F @ P @ F.T + Q          # covariance prediction

        # ── Update ──
        y = z - H @ x_pred                # innovation (residual)
        S = H @ P_pred @ H.T + R          # innovation covariance
        K = P_pred @ H.T @ np.linalg.inv(S)  # Kalman gain
        x = x_pred + K @ y                # updated state
        P = (np.eye(len(x)) - K @ H) @ P_pred  # updated covariance

        estimates.append(x.copy())
    return estimates

Plugging In Real Numbers

Let's wire up the 2D constant-velocity tracker from Chapter 2. State x = [px, vx, py, vy], and the camera measures position only:

Python
import numpy as np

dt = 0.1   # 10 Hz update

# State transition: constant velocity
F = np.array([[1, dt, 0,  0],   # px' = px + vx*dt
              [0,  1, 0,  0],   # vx' = vx
              [0,  0, 1, dt],   # py' = py + vy*dt
              [0,  0, 0,  1]])  # vy' = vy

# Observation: camera sees position only
H = np.array([[1, 0, 0, 0],    # z_px = px
              [0, 0, 1, 0]])   # z_py = py

# Measurement noise: GPS ±3m → variance = 9
R = np.diag([9.0, 9.0])    # [2,2]

# Process noise: unknown acceleration q=0.5 m/s²
q = 0.5
G = np.array([[dt**2/2, 0],
              [dt,      0],
              [0, dt**2/2],
              [0,      dt]])
Q = q * G @ G.T              # [4,4]
Every line maps to a matrix equation. The predict step is literally x_pred = F @ x and P_pred = F @ P @ F.T + Q. The update is K = P_pred @ H.T @ np.linalg.inv(H @ P_pred @ H.T + R), then x = x_pred + K @ (z - H @ x_pred). That's it. Five lines of NumPy, and you have a working Kalman filter.

Variable Reference

VariableShapeMeaning
xn×1State estimate (what we believe)
Pn×nState covariance (our uncertainty)
Fn×nState transition (physics model)
Hm×nObservation matrix (state → measurement)
Qn×nProcess noise (model uncertainty)
Rm×mMeasurement noise (sensor uncertainty)
Kn×mKalman gain (trust slider)
zm×1Measurement vector (what we observe)
Check: In what order do the two main steps execute?
💻 Build It Implement the Full Predict-Update Cycle ✓ ATTEMPTED
You've seen the complete algorithm. Now implement it from scratch. Given the matrices F, H, Q, R, and initial state x and P, run the filter over a list of measurements and return the estimated positions.
signature def kalman_1d(measurements, x0=0.0, P0=100.0, F=1.0, H=1.0, Q=0.1, R=1.0): """ 1D Kalman filter. Args: measurements: list of scalar observations x0: initial state estimate P0: initial uncertainty F: state transition (scalar, constant velocity=1) H: observation model (scalar, direct observation=1) Q: process noise variance R: measurement noise variance Returns: list of filtered state estimates (same length as measurements) """
Test case
measurements = [1.2, 0.8, 1.1, 0.9, 1.0] # noisy obs of true value 1.0 result = kalman_1d(measurements, x0=0.0, P0=100.0, Q=0.01, R=1.0) # result[-1] should be approximately 1.0 (within 0.1) # result[0] should be close to 1.2 (first measurement dominates when P0 is large)
In 1D with F=1: x_pred = F * x (just copy the state). P_pred = F*F*P + Q = P + Q. Then the gain: K = P_pred / (P_pred + R). Update: x = x_pred + K * (z - H * x_pred), P = (1 - K*H) * P_pred.
python
def kalman_1d(measurements, x0=0.0, P0=100.0, F=1.0, H=1.0, Q=0.1, R=1.0):
    x = x0
    P = P0
    estimates = []
    for z in measurements:
        # ── Predict ──
        x_pred = F * x           # state prediction
        P_pred = F * F * P + Q   # covariance prediction

        # ── Update ──
        K = P_pred * H / (H * P_pred * H + R)  # Kalman gain
        x = x_pred + K * (z - H * x_pred)      # state update
        P = (1 - K * H) * P_pred               # covariance update

        estimates.append(x)
    return estimates
Bonus challenge: Extend to 2D (track position AND velocity). You'll need F = [[1, dt], [0, 1]] and H = [[1, 0]]. Use NumPy matrices instead of scalars.

Chapter 8: Track a Moving Object

Time to see the Kalman filter in action. Below, a ball bounces around the canvas. The gray circle is the true ball. The red dots are noisy camera measurements. The green circle is the Kalman filter's estimate, and the green ellipse shows the uncertainty.

Play with the sliders to see what happens when the sensor is noisy or the model is uncertain. Watch how the filter smoothly tracks the ball even when individual measurements are wildly off.

Live Kalman Tracker
Process noise Q0.50
Meas. noise R15.0
Step: 0
Experiment: Crank R way up (noisy sensor). Notice the green estimate barely flinches — it trusts the model. Now crank R way down (perfect sensor). The estimate snaps to each measurement. That's the Kalman gain at work.
Check: When you increase measurement noise R, the Kalman filter...
💥 Break-It Lab What Dies When You Remove Components? ✓ ATTEMPTED
A 1D Kalman filter tracks a sinusoidal signal (teal). Toggle off components to see exactly how each one prevents a specific failure mode. The green line is the filter estimate.
Set Q = 0 (no process noise) ACTIVE
Failure mode: Without Q, the filter becomes overconfident in its prediction. P shrinks to zero, K → 0, and the filter stops listening to measurements entirely. After any model mismatch (the target maneuvers), the estimate drifts away forever. This is "filter divergence" — the most common failure in production KFs.
Set R = 0 (infinite sensor trust) ACTIVE
Failure mode: With R=0, K=1 always. The filter ignores its model and copies the raw measurement. You get zero smoothing — the output IS the noise. The filter degenerates into "just use the sensor reading." All the benefit of prediction is lost.
Remove prediction step ACTIVE
Failure mode: Without prediction, the filter has no model of how the system evolves. It can only react to measurements, never anticipate. The estimate lags behind the true signal because there's no "I expect it to be HERE next" to bridge between measurements. Latency increases dramatically.
Fix K = 0.5 (constant gain) ACTIVE
Failure mode: A fixed K can't adapt. Early on (when P is large and the filter is uncertain), K should be large to learn quickly. Later (when P is small and the filter is confident), K should shrink. A constant K=0.5 oversmooths during the transient and undersmooths at steady state. The adaptive gain is what makes the KF optimal.

Chapter 9: Sensor Fusion

Real systems rarely have just one sensor. Your phone has GPS, an accelerometer, a gyroscope, a barometer, and Wi-Fi signal strength — all measuring different things with different noise characteristics. The Kalman filter handles this naturally: each sensor provides a measurement z with its own H and R, and the filter fuses them all.

Below, two sensors track a 1D target. Sensor A (like GPS) has low update rate but decent accuracy. Sensor B (like an accelerometer) updates fast but drifts. The Kalman filter combines both to outperform either alone.

Two-Sensor Fusion
Sensor A noise2.0
Sensor B noise5.0
Why not just average? Averaging treats both sensors equally. The Kalman filter weights them by their precision: a sensor with half the noise gets twice the influence. It's the optimal weighting — provably, no algorithm can do better.
Check: When fusing two sensors, the Kalman filter weights each sensor by...
🏗 Design Challenge You're the Architect: Drone Sensor Fusion ✓ ATTEMPTED
You're building the state estimation system for a delivery drone. It must know its 3D position and velocity at all times. You have three sensors with different characteristics, and must decide how to fuse them.
GPS
10 Hz, ±3m horizontal, ±5m vertical, drops out near buildings
IMU (accel + gyro)
1000 Hz, low noise short-term, drifts 0.1 m/s² over 10s
Barometric altimeter
50 Hz, ±0.5m relative, ±10m absolute (weather-dependent)
Compute budget
ARM Cortex-M7 @ 400MHz, 512KB RAM
Latency requirement
State estimate within 5ms of IMU sample
1. What's your state vector? (Position only? Pos + vel? Pos + vel + accel? Pos + vel + IMU biases?) Bigger state = better but more compute.
2. How do you handle the 100x rate difference between IMU (1kHz) and GPS (10Hz)? Run the full KF at 1kHz? Predict at 1kHz but update only when GPS arrives?
3. When GPS drops out (urban canyon), the IMU drifts. How do you set Q to handle both "GPS available" and "GPS denied" modes?
4. The barometer gives good relative altitude but bad absolute. How do you use it without corrupting your GPS-based altitude?

Real-world solution (PX4/ArduPilot approach):

State vector: 15-24 states: [pos(3), vel(3), orientation(4 quaternion), accel_bias(3), gyro_bias(3), optionally baro_bias(1), wind(3)]. IMU biases are critical — they let the filter learn and subtract the drift online.

Rate handling: Predict at IMU rate (1kHz) using the IMU as a "control input" (the B·u term). This is called a "loosely-coupled" IMU/GPS fusion. The full update (expensive matrix inversion) only runs when GPS/baro arrive. Between updates, the high-rate predict keeps latency under 1ms.

GPS dropout: Q is NOT changed during dropout. Instead, the predict step naturally grows P. When GPS returns, the large P means K → 1, and the filter snaps back to GPS. The system gracefully degrades: uncertainty grows but doesn't diverge because the IMU biases were estimated when GPS was available.

Barometer: Add a barometer bias state that absorbs the absolute offset. The filter estimates the bias online. Only the RELATIVE changes from the barometer contribute to altitude — the absolute reference comes from GPS.

Key engineering insight: The real trick is NOT the KF equations (those are standard). It's choosing the state vector and the Q matrix. Adding IMU biases to the state is what makes the system work. Without bias estimation, IMU drift kills you in seconds.

Chapter 10: Limitations & Beyond

The Kalman filter is optimal — but only under two assumptions: the system is linear and the noise is Gaussian. What if your system is nonlinear (a rocket re-entering the atmosphere, a car turning)? What if the noise distribution has heavy tails or multiple modes?

Several extensions address these limitations. Each trades off accuracy against computational cost. The original KF remains the gold standard when its assumptions hold — which is surprisingly often in practice.

What Breaks (and How to Tell)

The Kalman filter can fail silently — it keeps producing estimates even when those estimates are garbage. Here are the three most common failure modes practitioners hit:

ProblemWhat happensFix
P initialized too smallFilter is overconfident from the start, ignores measurements, locks onto a wrong estimateInitialize P large (e.g., 1000 · I). Let the filter learn.
Q too smallFilter trusts its model too much. Can't track maneuvering targets (turns, accelerations). Estimate lags behind.Increase Q. Watch the innovation sequence — if it shows trends, Q is too small.
System is nonlinearKF gives biased estimates because F·x and H·x are wrong models. Covariance P doesn't reflect true uncertainty.Switch to EKF (linearize) or UKF (sigma points).
The innovation test: The innovation y = z − H·x̂¯ should be zero-mean white noise if the filter is healthy. If the innovations are biased (consistently positive or negative) or correlated (trending), something is wrong — usually Q is too small or the model is wrong.
FilterHandlesHowCost
KFLinear + GaussianExact equationsVery low
EKFMildly nonlinearLinearize with JacobiansLow
UKFMore nonlinearSigma points (no Jacobians)Medium
ParticleAny distributionMonte Carlo samplingHigh
Where from here? The Kalman filter is a gateway to a huge family of estimation techniques. If you understand predict→update and the idea of fusing uncertain information, you have the conceptual foundation for all of them.
In the wild: Apollo navigation (1960s), GPS receivers, drone autopilots, self-driving cars (sensor fusion), financial time series, weather prediction, robot SLAM, phone orientation tracking — the Kalman filter is everywhere.
⚔ Adversarial: Your Kalman filter diverges after 100 steps
You're tracking a car. Q = I (identity) and R = I (identity). The filter works for 100 steps, then the estimate starts drifting away from reality. The innovations (z − Hx) are consistently positive and growing. Q and R seem reasonable. What's the most likely cause?
🔗 Pattern Recognition
Kalman Filter = Gaussian Special Case of Bayes Filter
Kalman Filter
Belief is N(μ, Σ). Predict and update use matrix algebra. Exact, O(n³) in state dimension.
Bayes Filter
Belief is ANY distribution. Predict = Chapman-Kolmogorov integral. Update = Bayes rule. General but intractable for most distributions. → Bayes Filter lesson

The Kalman filter IS a Bayes filter with a specific restriction: everything is Gaussian and linear. This restriction is what makes the math tractable — the predict and update integrals collapse to simple matrix equations. Remove the Gaussian/linear assumption and you need approximations: EKF (linearize), UKF (sigma points), or particle filters (sample). The predict→update structure is universal; only the representation of belief changes.

When you study the Bayes filter, ask: "Where does the Gaussian assumption enter, and what mathematical simplification does it buy us?"

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem."
— John Tukey

You now understand optimal estimation. Every sensor reading you encounter is noisy. Now you know how to see through the noise.