The Kalman filter assumes straight lines. Real systems curve, twist, and wrap around. The EKF handles all of that by approximating curves with tangent lines at each step.
The standard Kalman filter is beautiful — but it makes a critical assumption: the system must be linear. That means the next state is a matrix times the current state, and measurements are a matrix times the state. In math: x' = Fx and z = Hx.
Real systems almost never obey this. A radar station measures range and bearing to an aircraft — that's a square root and an arctangent, not a matrix multiply. A robot turns by an angle and drives forward — that involves sines and cosines. A satellite re-entering the atmosphere experiences exponential drag. All nonlinear.
The blue Gaussian is the input. Watch how a linear function preserves the bell shape, but a nonlinear function distorts it. Toggle between them.
Examples of nonlinear systems in the wild:
| System | Nonlinearity |
|---|---|
| Radar tracking | Range = sqrt(x² + y²), bearing = atan2(y, x) |
| Robot navigation | Motion involves sin(θ), cos(θ) |
| Spacecraft re-entry | Drag is exponential in altitude |
| Pendulum | Restoring force is sin(θ), not θ |
The key idea of the EKF is simple: if the function is curved, approximate it with a straight line at the current operating point. This is exactly what a first-order Taylor expansion does. Near any point, a smooth curve looks like its tangent line.
For a 1D function f(x), the linearization around point a is: f(x) ≈ f(a) + f'(a)(x - a). The derivative f'(a) is the slope of the tangent line. For multi-variable functions, we use the Jacobian matrix instead of a simple derivative.
Drag the slider to move the linearization point along the curve. The orange line is the tangent. Notice how it matches well nearby but diverges further away.
x³ is convex for x > 0 and concave for x < 0. Around x = 1, points slightly above 1 get cubed to much larger values than the tangent predicts (the curve accelerates away). Points below 1 get cubed to slightly smaller values. The result is a right-skewed distribution — it has a long tail to the right that a Gaussian cannot capture. The EKF's tangent line at slope 3 predicts a symmetric Gaussian with σ = 6, but the true output is asymmetric with more mass at high values. This is exactly the "banana-shaped" distribution problem the UKF was designed to handle.
When your function takes a vector in and produces a vector out, a single derivative number isn't enough. You need the Jacobian matrix: a matrix where each entry is a partial derivative. Row i, column j contains ∂fi/∂xj — how much the i-th output changes when you wiggle the j-th input.
The Jacobian is the multi-dimensional version of "the slope." It tells you the best linear approximation of a vector function near a point. In the EKF, we compute two Jacobians at every step:
For f(x) = [r · cos(θ), r · sin(θ)] (polar to Cartesian), explore how the Jacobian changes with θ.
The polar-to-Cartesian conversion is f(r, θ) = [r·cos(θ), r·sin(θ)]. This is a function from R² → R².
Your task: Derive the full 2×2 Jacobian matrix. Then evaluate it at r = 2, θ = π/3 and interpret what each column means geometrically.
Full derivation:
Given f(r, θ) = [r cosθ, r sinθ]:
J = [[∂x/∂r, ∂x/∂θ], [∂y/∂r, ∂y/∂θ]] = [[cosθ, −r sinθ], [sinθ, r cosθ]]
At r = 2, θ = π/3: cos(π/3) = 0.5, sin(π/3) = 0.866
J = [[0.5, −1.732], [0.866, 1.0]]
Column 1: [0.5, 0.866] — pointing in the radial direction (outward at 60°). Unit length because it's ∂f/∂r.
Column 2: [−1.732, 1.0] — tangent to the circle, scaled by r=2. Its magnitude is r=2. Points perpendicular to the radial direction (counterclockwise).
The key insight: The determinant |J| = r·cos²θ + r·sin²θ = r. This is why the area element in polar coordinates is r·dr·dθ — the Jacobian determinant IS the r factor. Linearization and coordinate transforms are deeply connected.
In the standard KF, the predict step is: x̄ = Fx. The state transition is a matrix multiply. In the EKF, we replace this with the actual nonlinear function f(x). But for the covariance, we still need a matrix — so we use the Jacobian FJ evaluated at the current estimate.
| Kalman Filter | Extended Kalman Filter | |
|---|---|---|
| State | x̄ = F · x | x̄ = f(x) |
| Covariance | P̄ = F P FT + Q | P̄ = FJ P FJT + Q |
| F meaning | Constant matrix | FJ = ∂f/∂x at current x |
Python def ekf_predict(x, P, f, F_jacobian, Q): """EKF predict step.""" x_pred = f(x) # nonlinear state transition F_j = F_jacobian(x) # Jacobian at current estimate P_pred = F_j @ P @ F_j.T + Q # covariance propagation return x_pred, P_pred
Robot at (2, 3) heading θ=0.5 rad, drives forward at v=1 m/s, dt=0.1s:
Python import numpy as np # Current state x = np.array([2.0, 3.0, 0.5]) # [x, y, θ] v, omega, dt = 1.0, 0.0, 0.1 # Nonlinear state transition x_pred = np.array([ x[0] + v * np.cos(x[2]) * dt, # 2.0 + 1.0 * 0.878 * 0.1 = 2.088 x[1] + v * np.sin(x[2]) * dt, # 3.0 + 1.0 * 0.479 * 0.1 = 3.048 x[2] + omega * dt # 0.5 + 0.0 = 0.5 ]) # Jacobian at current state (NOT at prediction) F_J = np.array([ [1, 0, -v * np.sin(x[2]) * dt], # [1, 0, -0.048] [0, 1, v * np.cos(x[2]) * dt], # [0, 1, 0.088] [0, 0, 1] # [0, 0, 1 ] ]) # Covariance prediction P_pred = F_J @ P @ F_J.T + Q
The EKF predict is: x̄ = f(x) but P̄ = FJ P FJT + Q. Why does the covariance need the Jacobian while the state uses the actual nonlinear function? This asymmetry is the heart of the EKF.
Your task: Starting from the definition of covariance Cov(y) = E[(y − μy)(y − μy)T], show why a linear approximation is needed for uncertainty propagation but not for mean propagation.
Full derivation:
1. State (mean): The best estimate of the next state is simply f(x̂). No approximation needed — we evaluate the function at one point. This is exact for the mean (to first order: E[f(x)] ≈ f(E[x]) when f is smooth).
2. Covariance (spread): We need P̄ = E[(f(x) − f(x̂))(f(x) − f(x̂))T]. This requires knowing what f does to every point in the distribution — not just the mean.
3. The linearization trick: Taylor-expand f(x) around x̂: f(x) ≈ f(x̂) + FJ(x − x̂). Now f(x) − f(x̂) ≈ FJ(x − x̂).
4. Plug in: P̄ = E[FJ(x − x̂)(x − x̂)TFJT] = FJ · E[(x − x̂)(x − x̂)T] · FJT = FJ P FJT.
The key insight: The mean is a point — you can push a point through any function. The covariance describes a cloud of points — pushing a cloud through a nonlinear function is intractable, so we approximate the function as linear locally. This is why the EKF is only accurate when the cloud (P) is small relative to the curvature of f.
Both share the predict-update cycle: predict forward, incorporate measurement, repeat. The KF does this with fixed matrices; the EKF does it with matrices that change at every timestep (because the Jacobian depends on where you are). The structure is identical — only the "F" and "H" are now functions of state.
The diffusion denoising process also iteratively refines: each step takes the current estimate, predicts the noise, and updates. Where's the "linearization" equivalent there?
Same story for the update step. In the KF, the expected measurement is z̄ = Hx. In the EKF, we use the nonlinear measurement function h(x) to compute what we expect to see. But the Kalman gain uses the Jacobian HJ to propagate uncertainty.
| Kalman Filter | Extended Kalman Filter | |
|---|---|---|
| Innovation | y = z − H x | y = z − h(x) |
| Innov. cov. | S = H P HT + R | S = HJ P HJT + R |
| Gain | K = P HT S−1 | K = P HJT S−1 |
| H meaning | Constant matrix | HJ = ∂h/∂x at predicted x |
Python def ekf_update(x_pred, P_pred, z, h, H_jacobian, R): """EKF update step.""" H_j = H_jacobian(x_pred) # Jacobian at predicted state y = z - h(x_pred) # innovation S = H_j @ P_pred @ H_j.T + R # innovation covariance K = P_pred @ H_j.T @ np.linalg.inv(S) # Kalman gain x = x_pred + K @ y # updated state P = (np.eye(len(x)) - K @ H_j) @ P_pred return x, P
Predict: x̄ = f(x̂) [TRUE nonlinear] | P̄ = FJ P FJT + Q [JACOBIAN]
Update: y = z − h(x̄) [TRUE nonlinear] | S = HJ P̄ HJT + R [JACOBIAN] | K = P̄ HJT S−1 [JACOBIAN] | x̂ = x̄ + Ky [no approximation — linear combination] | P = (I − KHJ)P̄ [JACOBIAN]
The pattern: the mean (state estimate) always uses true functions because we're evaluating at a single point. The covariance (uncertainty spread) always uses Jacobians because we need to describe how an entire distribution transforms, which requires linearity. The Kalman gain K bridges the two: it's computed from the linearized covariance but applied to the nonlinear innovation.
Let's work through a concrete example: a robot at position (x, y) with heading θ observes a landmark at (lx, ly). The measurement is range and bearing:
To get HJ, we take partial derivatives with respect to x, y, and θ. Let dx = lx − x, dy = ly − y, and q = dx² + dy²:
Move the robot (drag the slider) and see how the Jacobian entries change. The landmark is fixed at the orange star.
For a differential-drive robot with state [x, y, θ] and control [v, ω]:
Let's make this completely concrete. A robot drives at v = 1 m/s, heading θ = 45° (π/4 rad), with dt = 0.1s. What does the Jacobian look like?
Python import numpy as np v, theta, dt = 1.0, np.pi/4, 0.1 F_J = np.array([ [1, 0, -v * np.sin(theta) * dt], # -1.0 * 0.707 * 0.1 = -0.0707 [0, 1, v * np.cos(theta) * dt], # 1.0 * 0.707 * 0.1 = 0.0707 [0, 0, 1] ]) # F_J = [[1.000, 0.000, -0.071], # [0.000, 1.000, 0.071], # [0.000, 0.000, 1.000]]
The bearing measurement is: bearing = atan2(ly − y, lx − x) − θ. You need ∂bearing/∂x, ∂bearing/∂y, and ∂bearing/∂θ.
Your task: Using the fact that d/du[atan2(v, u)] = −v / (u² + v²) and d/dv[atan2(v, u)] = u / (u² + v²), derive the second row of HJ. Verify the result matches the formula given above.
Full derivation:
Let dx = lx − x, dy = ly − y, q = dx² + dy².
bearing = atan2(dy, dx) − θ
∂bearing/∂x: dx depends on x (ddx/dx = −1). Using the atan2 derivative rule: ∂/∂x[atan2(dy, dx)] = (−dy/q) · (−1) = dy/q
∂bearing/∂y: dy depends on y (ddy/dy = −1). ∂/∂y[atan2(dy, dx)] = (dx/q) · (−1) = −dx/q
∂bearing/∂θ = −1 (direct)
Second row of HJ: [dy/q, −dx/q, −1] ✓
The key insight: The 1/q factor means bearing sensitivity decreases with distance squared. When the landmark is far away (large q), small position changes barely affect the bearing. This is why nearby landmarks provide more information — their Jacobian entries are larger, making the Kalman gain higher.
python import numpy as np def ekf_predict_robot(state, P, v, omega, dt, Q): x, y, theta = state # 1. Nonlinear state transition state_pred = np.array([ x + v * np.cos(theta) * dt, y + v * np.sin(theta) * dt, theta + omega * dt ]) # 2. Jacobian of motion model w.r.t. state F_J = np.array([ [1, 0, -v * np.sin(theta) * dt], [0, 1, v * np.cos(theta) * dt], [0, 0, 1] ]) # 3. Covariance propagation P_pred = F_J @ P @ F_J.T + Q return state_pred, P_pred
This is the showcase. A robot moves in 2D space with 3 known landmarks. It can only measure the bearing (angle) to each landmark — not the distance. The EKF estimates the robot's position and heading from these bearing-only measurements.
Use the buttons to drive the robot. Watch the green uncertainty ellipse shrink as the filter incorporates measurements. The blue lines show bearing measurements to landmarks.
The EKF is an approximation. It assumes the nonlinear function is well-described by its tangent line within the region of uncertainty. When this assumption breaks, the filter can diverge: the estimated covariance shrinks (the filter thinks it's certain), but the actual error grows. The filter becomes overconfident and wrong.
Increase the nonlinearity to see how the tangent-line approximation degrades. The teal region is the true transformed distribution (sampled). The orange ellipse is the EKF's Gaussian approximation.
This is precisely why the Unscented Kalman Filter (UKF) was invented. Instead of linearizing, the UKF passes carefully chosen sample points through the true nonlinear function, then reconstructs a Gaussian from the results. No Jacobians required, and it captures more of the curvature.
| Aspect | EKF | UKF |
|---|---|---|
| Approach | Linearize with Jacobian | Sample with sigma points |
| Accuracy | 1st order (tangent line) | 2nd order (captures curvature) |
| Jacobians needed? | Yes | No |
| When it fails | High curvature, large uncertainty | Extreme nonlinearity, multimodal |
1. Bias estimation is critical. IMU accelerometers drift ~0.1 mg and gyros drift ~10 °/hr. Without bias states, after 30s of GPS dropout: position error from uncompensated accel bias = 0.5 · 0.001 · 30² = 0.45m. With bias estimation, the EKF learns the bias from GPS updates and compensates during dropout. This is why the state vector is 15, not 9.
2. Linearization is fine at 200 Hz. At 36 m/s and dt = 5ms, the car moves 0.18m and rotates at most ~0.01 rad per step. The Jacobian entries like −v·sin(θ)·dt are tiny (<0.04), so FJ ≈ I. The EKF is well within its linear regime per step. It would break at lower IMU rates (e.g., 10 Hz: 3.6m per step with significant rotation).
3. During GPS dropout: Velocity states (from integrated accelerometer) provide short-term position. Uncertainty grows as O(t³) due to integrated accelerometer noise: σpos(t) ≈ σaccel·t²/2. With good IMU (~0.01 m/s² noise): after 30s, σpos ≈ 4.5m. Bias estimation slows this — production systems achieve ~2m error after 30s.
4. Computation: The GPS update inverts a 3x3 matrix (position measurement) not 15x15 — the innovation covariance S = H P HT + R is measurement-sized (3x3 for XYZ GPS). The 15x3 Kalman gain computation is ~15²·3 = 675 multiplies. Easily under 1ms on modern hardware. The dominant cost is actually the 15x15 covariance predict at 200 Hz: ~15³ = 3375 multiplies per step = ~0.67M multiplies/second. Trivial for even an embedded ARM processor.
The EKF sits at the center of a family of nonlinear estimation algorithms. Understanding it gives you the vocabulary and intuition to explore the rest.
In EKF-SLAM, every landmark gets appended to the state vector. The state becomes [robot_x, robot_y, robot_θ, lm1_x, lm1_y, lm2_x, lm2_y, ...]. With 100 landmarks:
| Quantity | Shape | Size |
|---|---|---|
| State x | [203, 1] | 3 robot + 200 landmark coords |
| Covariance P | [203, 203] | 41,209 entries |
| Kalman gain K | [203, 2] | (one landmark observation at a time) |
Both solve the same problem: "how do I propagate uncertainty through a nonlinear function?" The EKF linearizes the function (replace f with its tangent). The UKF instead samples the distribution (replace the Gaussian with representative points). Same goal, dual approaches. EKF-SLAM was the first SLAM algorithm, directly augmenting this state vector with landmark positions. → Classical SLAM
Neural networks are highly nonlinear functions. When we propagate uncertainty through a neural net (Bayesian deep learning), we face this same problem. Why do methods like MC Dropout resemble the UKF more than the EKF?
The EKF's model of nonlinear functions is wrong (tangent lines aren't curves). But it's useful enough to navigate robots, track missiles, and land spacecraft. You now understand why — and when it breaks.