Beyond Linearity

Understand the Extended
Kalman Filter

The Kalman filter assumes straight lines. Real systems curve, twist, and wrap around. The EKF handles all of that by approximating curves with tangent lines at each step.

Prerequisites: Kalman Filter basics + Derivatives (calculus I). Familiarity with matrices helps.
9
Chapters
6+
Simulations
1
Live Robot Demo

Chapter 0: When the World Is Nonlinear

The standard Kalman filter is beautiful — but it makes a critical assumption: the system must be linear. That means the next state is a matrix times the current state, and measurements are a matrix times the state. In math: x' = Fx and z = Hx.

Real systems almost never obey this. A radar station measures range and bearing to an aircraft — that's a square root and an arctangent, not a matrix multiply. A robot turns by an angle and drives forward — that involves sines and cosines. A satellite re-entering the atmosphere experiences exponential drag. All nonlinear.

The problem: The Kalman filter's predict and update equations assume linearity. If you plug nonlinear functions into F and H, the math breaks — Gaussians pushed through nonlinear functions are no longer Gaussian. We need a workaround.
Linear vs Nonlinear Transformation

The blue Gaussian is the input. Watch how a linear function preserves the bell shape, but a nonlinear function distorts it. Toggle between them.

Examples of nonlinear systems in the wild:

SystemNonlinearity
Radar trackingRange = sqrt(x² + y²), bearing = atan2(y, x)
Robot navigationMotion involves sin(θ), cos(θ)
Spacecraft re-entryDrag is exponential in altitude
PendulumRestoring force is sin(θ), not θ
Check: Why can't we use a standard Kalman filter for radar tracking?

Chapter 1: Linearization — Tangent Lines to the Rescue

The key idea of the EKF is simple: if the function is curved, approximate it with a straight line at the current operating point. This is exactly what a first-order Taylor expansion does. Near any point, a smooth curve looks like its tangent line.

For a 1D function f(x), the linearization around point a is: f(x) ≈ f(a) + f'(a)(x - a). The derivative f'(a) is the slope of the tangent line. For multi-variable functions, we use the Jacobian matrix instead of a simple derivative.

f(x) ≈ f(a) + f'(a) · (x − a)
Intuition: Imagine you're standing on a hillside and the fog is thick. You can only see the ground right at your feet. The slope you feel under your boots is the derivative. You assume the hill continues at that slope — that's linearization. It works well nearby but fails for distant points.
Function and Tangent Line

Drag the slider to move the linearization point along the curve. The orange line is the tangent. Notice how it matches well nearby but diverges further away.

Point a0.00
The EKF strategy: At each timestep, linearize the nonlinear functions around the current best estimate. Then run the normal Kalman filter equations using these linear approximations. Re-linearize at the next step around the new estimate.
Check: Linearization approximates a curve with...
Checkpoint — Before you move on
A Gaussian with σ = 2 passes through f(x) = x³. The tangent at x = 1 has slope 3. Without computing anything, explain qualitatively why the EKF approximation will be poor here — what shape will the true output have that the Gaussian misses?
✓ Gate cleared
Model Answer

x³ is convex for x > 0 and concave for x < 0. Around x = 1, points slightly above 1 get cubed to much larger values than the tangent predicts (the curve accelerates away). Points below 1 get cubed to slightly smaller values. The result is a right-skewed distribution — it has a long tail to the right that a Gaussian cannot capture. The EKF's tangent line at slope 3 predicts a symmetric Gaussian with σ = 6, but the true output is asymmetric with more mass at high values. This is exactly the "banana-shaped" distribution problem the UKF was designed to handle.

Chapter 2: Jacobians — Derivatives for Vectors

When your function takes a vector in and produces a vector out, a single derivative number isn't enough. You need the Jacobian matrix: a matrix where each entry is a partial derivative. Row i, column j contains ∂fi/∂xj — how much the i-th output changes when you wiggle the j-th input.

J = ∂f / ∂x = [ [∂f₁/∂x₁, ∂f₁/∂x₂], [∂f₂/∂x₁, ∂f₂/∂x₂] ]

The Jacobian is the multi-dimensional version of "the slope." It tells you the best linear approximation of a vector function near a point. In the EKF, we compute two Jacobians at every step:

FJ — Process Jacobian
∂f / ∂x evaluated at current estimate
HJ — Measurement Jacobian
∂h / ∂x evaluated at predicted state
Example: If f(x, y) = [x² + y, sin(x)], the Jacobian is [[2x, 1], [cos(x), 0]]. Each entry is a partial derivative. At the point (1, 3), the Jacobian evaluates to [[2, 1], [0.54, 0]].
Jacobian Explorer

For f(x) = [r · cos(θ), r · sin(θ)] (polar to Cartesian), explore how the Jacobian changes with θ.

r2.0
θ0.80
Check: What does entry (i, j) of the Jacobian matrix represent?
🔨 Derivation Derive the polar-to-Cartesian Jacobian ✓ ATTEMPTED

The polar-to-Cartesian conversion is f(r, θ) = [r·cos(θ), r·sin(θ)]. This is a function from R² → R².

Your task: Derive the full 2×2 Jacobian matrix. Then evaluate it at r = 2, θ = π/3 and interpret what each column means geometrically.

You need ∂x/∂r, ∂x/∂θ, ∂y/∂r, and ∂y/∂θ. For the first one: x = r·cos(θ), so ∂x/∂r = cos(θ) since θ is treated as constant.
Differentiate r·cos(θ) with respect to θ, treating r as constant: ∂x/∂θ = −r·sin(θ). The negative sign is crucial — increasing θ moves x in the negative direction (counterclockwise).
Column 1 (the ∂f/∂r column) = [cosθ, sinθ] is the radial direction — the unit vector pointing outward from the origin. Column 2 (the ∂f/∂θ column) = [−r sinθ, r cosθ] is the tangential direction scaled by r — how fast the point moves when θ changes.

Full derivation:

Given f(r, θ) = [r cosθ, r sinθ]:

J = [[∂x/∂r, ∂x/∂θ], [∂y/∂r, ∂y/∂θ]] = [[cosθ, −r sinθ], [sinθ, r cosθ]]

At r = 2, θ = π/3: cos(π/3) = 0.5, sin(π/3) = 0.866

J = [[0.5, −1.732], [0.866, 1.0]]

Column 1: [0.5, 0.866] — pointing in the radial direction (outward at 60°). Unit length because it's ∂f/∂r.

Column 2: [−1.732, 1.0] — tangent to the circle, scaled by r=2. Its magnitude is r=2. Points perpendicular to the radial direction (counterclockwise).

The key insight: The determinant |J| = r·cos²θ + r·sin²θ = r. This is why the area element in polar coordinates is r·dr·dθ — the Jacobian determinant IS the r factor. Linearization and coordinate transforms are deeply connected.

Chapter 3: EKF Predict

In the standard KF, the predict step is: x̄ = Fx. The state transition is a matrix multiply. In the EKF, we replace this with the actual nonlinear function f(x). But for the covariance, we still need a matrix — so we use the Jacobian FJ evaluated at the current estimate.

x̄ = f(x̂)       P̄ = FJ · P · FJT + Q
Key difference from KF: The state is propagated through the true nonlinear function f(x). Only the covariance uses the linearized Jacobian FJ. This gives us the best of both worlds: accurate state prediction + tractable uncertainty propagation.

Side by Side: KF vs EKF Predict

Kalman FilterExtended Kalman Filter
Statex̄ = F · xx̄ = f(x)
CovarianceP̄ = F P FT + QP̄ = FJ P FJT + Q
F meaningConstant matrixFJ = ∂f/∂x at current x

EKF Predict in Code

Python
def ekf_predict(x, P, f, F_jacobian, Q):
    """EKF predict step."""
    x_pred = f(x)                        # nonlinear state transition
    F_j = F_jacobian(x)                  # Jacobian at current estimate
    P_pred = F_j @ P @ F_j.T + Q        # covariance propagation
    return x_pred, P_pred

Full EKF Predict With Numbers

Robot at (2, 3) heading θ=0.5 rad, drives forward at v=1 m/s, dt=0.1s:

Python
import numpy as np

# Current state
x = np.array([2.0, 3.0, 0.5])  # [x, y, θ]
v, omega, dt = 1.0, 0.0, 0.1

# Nonlinear state transition
x_pred = np.array([
    x[0] + v * np.cos(x[2]) * dt,  # 2.0 + 1.0 * 0.878 * 0.1 = 2.088
    x[1] + v * np.sin(x[2]) * dt,  # 3.0 + 1.0 * 0.479 * 0.1 = 3.048
    x[2] + omega * dt               # 0.5 + 0.0 = 0.5
])

# Jacobian at current state (NOT at prediction)
F_J = np.array([
    [1, 0, -v * np.sin(x[2]) * dt],  # [1, 0, -0.048]
    [0, 1,  v * np.cos(x[2]) * dt],  # [0, 1,  0.088]
    [0, 0,  1]                       # [0, 0,  1    ]
])

# Covariance prediction
P_pred = F_J @ P @ F_J.T + Q
Check: In the EKF predict, the state is propagated using...
🔨 Derivation Why P uses the Jacobian but x uses f(x) ✓ ATTEMPTED

The EKF predict is: x̄ = f(x) but P̄ = FJ P FJT + Q. Why does the covariance need the Jacobian while the state uses the actual nonlinear function? This asymmetry is the heart of the EKF.

Your task: Starting from the definition of covariance Cov(y) = E[(y − μy)(y − μy)T], show why a linear approximation is needed for uncertainty propagation but not for mean propagation.

If y = f(x), then μy = E[f(x)]. For the mean, we can simply evaluate f at the mean of x (first-order approximation), OR we can just compute f(x̂) directly — no matrix needed. The nonlinear function gives us the best single-point estimate.
Covariance measures the spread of a distribution. To compute Cov(f(x)), we'd need to evaluate E[f(x)f(x)T] − E[f(x)]E[f(x)]T, which requires integrating f over the entire distribution. That's intractable for arbitrary f. But if we approximate f(x) ≈ f(μ) + FJ(x − μ), the covariance of a linear transform is simply FJ P FJT.
Let δx = x − x̂. Then f(x) ≈ f(x̂) + FJδx. So y − μy ≈ FJδx. Therefore Cov(y) = E[FJδx · δxTFJT] = FJ E[δx δxT] FJT = FJ P FJT.

Full derivation:

1. State (mean): The best estimate of the next state is simply f(x̂). No approximation needed — we evaluate the function at one point. This is exact for the mean (to first order: E[f(x)] ≈ f(E[x]) when f is smooth).

2. Covariance (spread): We need P̄ = E[(f(x) − f(x̂))(f(x) − f(x̂))T]. This requires knowing what f does to every point in the distribution — not just the mean.

3. The linearization trick: Taylor-expand f(x) around x̂: f(x) ≈ f(x̂) + FJ(x − x̂). Now f(x) − f(x̂) ≈ FJ(x − x̂).

4. Plug in: P̄ = E[FJ(x − x̂)(x − x̂)TFJT] = FJ · E[(x − x̂)(x − x̂)T] · FJT = FJ P FJT.

The key insight: The mean is a point — you can push a point through any function. The covariance describes a cloud of points — pushing a cloud through a nonlinear function is intractable, so we approximate the function as linear locally. This is why the EKF is only accurate when the cloud (P) is small relative to the curvature of f.

🔗 Pattern Recognition
Iterative Refinement: Re-linearize at Every Step
EKF (This Lesson)
Linearize f at current x̂, propagate, get new x̂, re-linearize at the new point. Each step uses a fresh tangent line.
Kalman Filter
F and H are constant matrices — no re-linearization needed. The EKF reduces to the KF when f(x) = Fx and h(x) = Hx. → Kalman Filter

Both share the predict-update cycle: predict forward, incorporate measurement, repeat. The KF does this with fixed matrices; the EKF does it with matrices that change at every timestep (because the Jacobian depends on where you are). The structure is identical — only the "F" and "H" are now functions of state.

The diffusion denoising process also iteratively refines: each step takes the current estimate, predicts the noise, and updates. Where's the "linearization" equivalent there?

Chapter 4: EKF Update

Same story for the update step. In the KF, the expected measurement is z̄ = Hx. In the EKF, we use the nonlinear measurement function h(x) to compute what we expect to see. But the Kalman gain uses the Jacobian HJ to propagate uncertainty.

y = z − h(x̄)      S = HJ P̄ HJT + R      K = P̄ HJT S−1
x̂ = x̄ + K y      P = (I − K HJ) P̄
Why h(x) matters: Consider a radar measuring range r = sqrt(x² + y²). If the predicted state is (3, 4), the expected range is h(x) = 5, not [some matrix] · [3, 4]. The nonlinear function gives the right expected measurement; the Jacobian handles the uncertainty math around it.

Side by Side: KF vs EKF Update

Kalman FilterExtended Kalman Filter
Innovationy = z − H xy = z − h(x)
Innov. cov.S = H P HT + RS = HJ P HJT + R
GainK = P HT S−1K = P HJT S−1
H meaningConstant matrixHJ = ∂h/∂x at predicted x

EKF Update in Code

Python
def ekf_update(x_pred, P_pred, z, h, H_jacobian, R):
    """EKF update step."""
    H_j = H_jacobian(x_pred)             # Jacobian at predicted state
    y = z - h(x_pred)                    # innovation
    S = H_j @ P_pred @ H_j.T + R        # innovation covariance
    K = P_pred @ H_j.T @ np.linalg.inv(S)  # Kalman gain
    x = x_pred + K @ y                  # updated state
    P = (np.eye(len(x)) - K @ H_j) @ P_pred
    return x, P
Check: In the EKF update, the innovation y is computed as...
Checkpoint — Before you move on
Without looking back, write out the full EKF loop (predict + update) in your own notation. For each equation, annotate which part uses the true nonlinear function and which uses the Jacobian approximation. Why is this split necessary?
✓ Gate cleared
Model Answer

Predict: x̄ = f(x̂) [TRUE nonlinear] | P̄ = FJ P FJT + Q [JACOBIAN]

Update: y = z − h(x̄) [TRUE nonlinear] | S = HJ P̄ HJT + R [JACOBIAN] | K = P̄ HJT S−1 [JACOBIAN] | x̂ = x̄ + Ky [no approximation — linear combination] | P = (I − KHJ)P̄ [JACOBIAN]

The pattern: the mean (state estimate) always uses true functions because we're evaluating at a single point. The covariance (uncertainty spread) always uses Jacobians because we need to describe how an entire distribution transforms, which requires linearity. The Kalman gain K bridges the two: it's computed from the linearized covariance but applied to the nonlinear innovation.

Chapter 5: Computing Jacobians in Practice

Let's work through a concrete example: a robot at position (x, y) with heading θ observes a landmark at (lx, ly). The measurement is range and bearing:

h(x) = [ sqrt((lx−x)² + (ly−y)²) , atan2(ly−y, lx−x) − θ ]

To get HJ, we take partial derivatives with respect to x, y, and θ. Let dx = lx − x, dy = ly − y, and q = dx² + dy²:

HJ = [ [−dx/√q, −dy/√q, 0], [dy/q, −dx/q, −1] ]
Step by step:
1. ∂range/∂x = −dx / sqrt(q)  (chain rule on the square root)
2. ∂range/∂y = −dy / sqrt(q)
3. ∂range/∂θ = 0  (range doesn't depend on heading)
4. ∂bearing/∂x = dy / q  (derivative of atan2)
5. ∂bearing/∂y = −dx / q
6. ∂bearing/∂θ = −1  (bearing is relative to heading)
Range-Bearing Jacobian Visualizer

Move the robot (drag the slider) and see how the Jacobian entries change. The landmark is fixed at the orange star.

Robot x100
Robot y200

Process Model Jacobian (Robot Motion)

For a differential-drive robot with state [x, y, θ] and control [v, ω]:

f(x) = [x + v·cos(θ)·dt,   y + v·sin(θ)·dt,   θ + ω·dt]
FJ = [ [1, 0, −v·sin(θ)·dt], [0, 1, v·cos(θ)·dt], [0, 0, 1] ]

Plugging In Real Numbers

Let's make this completely concrete. A robot drives at v = 1 m/s, heading θ = 45° (π/4 rad), with dt = 0.1s. What does the Jacobian look like?

Python
import numpy as np

v, theta, dt = 1.0, np.pi/4, 0.1

F_J = np.array([
    [1, 0, -v * np.sin(theta) * dt],  # -1.0 * 0.707 * 0.1 = -0.0707
    [0, 1,  v * np.cos(theta) * dt],  #  1.0 * 0.707 * 0.1 =  0.0707
    [0, 0,  1]
])
# F_J = [[1.000, 0.000, -0.071],
#        [0.000, 1.000,  0.071],
#        [0.000, 0.000,  1.000]]
Reading the Jacobian: The top-right entry −0.071 says: "if we're wrong about θ by 1 radian, the x-position prediction is off by 0.071 meters." The Jacobian quantifies how each state error propagates into other state errors. A near-identity matrix (like this one with small dt) means errors don't propagate much per step.
Why linearize at the current estimate, not the prediction? The estimate is our best guess of the current state — the point where the tangent line is most accurate. If we linearized at zero or at the prediction, the Jacobian would be evaluated at the wrong point and the approximation would be worse. Each EKF step re-linearizes around the freshest estimate.
Check: Why is ∂range/∂θ = 0?
🔨 Derivation Derive ∂bearing/∂x from first principles ✓ ATTEMPTED

The bearing measurement is: bearing = atan2(ly − y, lx − x) − θ. You need ∂bearing/∂x, ∂bearing/∂y, and ∂bearing/∂θ.

Your task: Using the fact that d/du[atan2(v, u)] = −v / (u² + v²) and d/dv[atan2(v, u)] = u / (u² + v²), derive the second row of HJ. Verify the result matches the formula given above.

Let u = lx − x and v = ly − y. Then bearing = atan2(v, u) − θ. For ∂bearing/∂x, you need (d/du[atan2]) · (du/dx). Since u = lx − x, du/dx = −1.
∂bearing/∂x = (d/du[atan2(v,u)]) · (−1) = (−v/(u²+v²)) · (−1) = v/(u²+v²) = (ly−y) / q = dy/q.
The bearing formula is atan2(...) − θ. Since θ only appears in that −θ term, its derivative is simply −1. The absolute angle to the landmark doesn't change with heading, but the relative bearing does — linearly.

Full derivation:

Let dx = lx − x, dy = ly − y, q = dx² + dy².

bearing = atan2(dy, dx) − θ

∂bearing/∂x: dx depends on x (ddx/dx = −1). Using the atan2 derivative rule: ∂/∂x[atan2(dy, dx)] = (−dy/q) · (−1) = dy/q

∂bearing/∂y: dy depends on y (ddy/dy = −1). ∂/∂y[atan2(dy, dx)] = (dx/q) · (−1) = −dx/q

∂bearing/∂θ = −1 (direct)

Second row of HJ: [dy/q, −dx/q, −1] ✓

The key insight: The 1/q factor means bearing sensitivity decreases with distance squared. When the landmark is far away (large q), small position changes barely affect the bearing. This is why nearby landmarks provide more information — their Jacobian entries are larger, making the Kalman gain higher.

💻 Build It Implement the EKF Predict Step from Scratch ✓ ATTEMPTED
You've seen the EKF predict equations and the robot motion model. Now implement the full predict step for a differential-drive robot with state [x, y, θ] and control input [v, ω]. You must compute both the predicted state AND the Jacobian, then propagate the covariance.
signature def ekf_predict_robot(state, P, v, omega, dt, Q): """ EKF predict for differential-drive robot. Args: state: np.array([x, y, theta]) P: 3x3 covariance matrix v: linear velocity (m/s) omega: angular velocity (rad/s) dt: timestep (s) Q: 3x3 process noise covariance Returns: state_pred: np.array([x', y', theta']) P_pred: 3x3 predicted covariance """
Test case
state = [2.0, 3.0, 0.5], v = 1.0, omega = 0.1, dt = 0.1
Expected state_pred ≈ [2.088, 3.048, 0.51]
Expected F_J[0,2] ≈ −0.048 (= −v·sin(θ)·dt)
The motion model is: x' = x + v·cos(θ)·dt, y' = y + v·sin(θ)·dt, θ' = θ + ω·dt. Taking ∂/∂θ of each: ∂x'/∂θ = −v·sin(θ)·dt, ∂y'/∂θ = v·cos(θ)·dt, ∂θ'/∂θ = 1. All ∂/∂x and ∂/∂y derivatives of the position equations are 1 on diagonal, 0 elsewhere.
python
import numpy as np

def ekf_predict_robot(state, P, v, omega, dt, Q):
    x, y, theta = state

    # 1. Nonlinear state transition
    state_pred = np.array([
        x + v * np.cos(theta) * dt,
        y + v * np.sin(theta) * dt,
        theta + omega * dt
    ])

    # 2. Jacobian of motion model w.r.t. state
    F_J = np.array([
        [1, 0, -v * np.sin(theta) * dt],
        [0, 1,  v * np.cos(theta) * dt],
        [0, 0,  1]
    ])

    # 3. Covariance propagation
    P_pred = F_J @ P @ F_J.T + Q

    return state_pred, P_pred
Bonus challenge: What happens if v is very large and dt is not small? The Jacobian is evaluated at the current θ, but the robot has rotated ω·dt during the step. For large ω·dt, this creates a mismatch. Implement a version that subdivides the step into N smaller sub-steps for better accuracy.

Chapter 6: EKF Robot Localization

This is the showcase. A robot moves in 2D space with 3 known landmarks. It can only measure the bearing (angle) to each landmark — not the distance. The EKF estimates the robot's position and heading from these bearing-only measurements.

Use the buttons to drive the robot. Watch the green uncertainty ellipse shrink as the filter incorporates measurements. The blue lines show bearing measurements to landmarks.

EKF Robot Localization
Meas. noise0.08
Step: 0 | x: 200 y: 220 θ: 0.00
Observe: When the robot is far from landmarks, the ellipse is large. When it moves to a spot where the landmarks are spread out in different directions, the ellipse shrinks rapidly — the geometry of the landmarks matters. This is called observability.
Check: Why does the uncertainty ellipse shrink faster when landmarks are spread out?

Chapter 7: Consistency & Divergence

The EKF is an approximation. It assumes the nonlinear function is well-described by its tangent line within the region of uncertainty. When this assumption breaks, the filter can diverge: the estimated covariance shrinks (the filter thinks it's certain), but the actual error grows. The filter becomes overconfident and wrong.

When the EKF breaks:
1. High curvature: If the function bends sharply within one standard deviation, the tangent-line approximation is poor.
2. Wrong linearization point: If the estimate is far from truth, the Jacobian is evaluated at the wrong point.
3. Large uncertainty: The tangent line only works "locally." If P is huge, the Gaussian spreads over a region where the tangent is a bad fit.
EKF Divergence Demo

Increase the nonlinearity to see how the tangent-line approximation degrades. The teal region is the true transformed distribution (sampled). The orange ellipse is the EKF's Gaussian approximation.

Nonlinearity0.50
Input spread σ0.50
Concrete failure: Imagine a robot turning 90° in a single timestep. The Jacobian at the old heading has cos(θ) and sin(θ) terms for the old direction. But the robot is now facing perpendicular to that — the linearization is completely wrong. The covariance propagates in the wrong direction, the filter becomes overconfident in the wrong state, and subsequent updates pull the estimate further away from truth. This is why EKFs struggle with high angular rates.

This is precisely why the Unscented Kalman Filter (UKF) was invented. Instead of linearizing, the UKF passes carefully chosen sample points through the true nonlinear function, then reconstructs a Gaussian from the results. No Jacobians required, and it captures more of the curvature.

AspectEKFUKF
ApproachLinearize with JacobianSample with sigma points
Accuracy1st order (tangent line)2nd order (captures curvature)
Jacobians needed?YesNo
When it failsHigh curvature, large uncertaintyExtreme nonlinearity, multimodal
Check: EKF divergence happens when...
💥 Break-It Lab Three Ways to Kill an EKF ✓ ATTEMPTED
A tracking EKF estimates a 1D state using a nonlinear measurement h(x) = sin(x). The green line is true state, orange is the EKF estimate, and the shaded region is the EKF's reported ±2σ confidence. Watch how each failure mode makes the EKF overconfident and wrong.
Use wrong Jacobian (constant cos(0) = 1 instead of cos(x)) ACTIVE
Failure mode: The Jacobian is evaluated at x=0 instead of the current estimate. When the true state drifts to x=π/2, cos(x)≈0 but the filter thinks the measurement sensitivity is 1.0. It grossly overestimates how informative the measurement is, shrinks P too fast, and locks onto a wrong value. The filter is certain and wrong.
Skip re-linearization (use first-step Jacobian forever) ACTIVE
Failure mode: The Jacobian from step 0 becomes stale as the state evolves. The filter's model of "how measurements relate to state" drifts from reality. Covariance propagates in the wrong direction, the gain points the wrong way, and corrections push the estimate further from truth instead of closer.
Start with huge initial uncertainty (P0 = 100 instead of 1) ACTIVE
Failure mode: The tangent line is only valid locally, but P=100 means the distribution spans many wavelengths of sin(x). The first update computes a huge Kalman gain (because P is big relative to R), makes a massive correction based on a linearization that's only valid nearby, and overshoots wildly. Can sometimes recover — but often oscillates or diverges.
⚔ Adversarial: Pendulum vs spring-mass divergence
You implement an EKF for a simple pendulum (θ'' = −(g/L)sinθ) and a separate EKF for a spring-mass system (x'' = −(k/m)x). Both use the same sensor noise R = 0.01, same timestep dt = 0.01s, same initial P. The spring-mass EKF tracks perfectly forever. The pendulum EKF diverges after 3 seconds. Same code structure, same noise. What's fundamentally different?
🏗 Design Challenge You're the Architect: GPS+IMU Fusion for Autonomous Driving ✓ ATTEMPTED
You're designing the state estimation system for a Level 4 autonomous vehicle. The car has a GPS receiver (10 Hz, ±1.5m accuracy) and a 6-axis IMU (200 Hz, accelerometer + gyroscope). The car drives at up to 130 km/h on highways and must maintain lane-level accuracy (<0.3m lateral error).
State vector
Must track: position (3D), velocity (3D), orientation (3D), accelerometer bias (3D), gyroscope bias (3D) = 15 states
IMU rate
200 Hz — predict step runs at this rate
GPS rate
10 Hz — update step runs at this rate (20x slower than predict)
Latency budget
<5ms per predict step, <20ms per update step
Failure mode
GPS dropout in tunnels (up to 30 seconds)
1. Why include accelerometer and gyro biases in the state? What happens if you don't estimate them?
2. At 130 km/h (36 m/s), the car moves 0.18m per IMU step. Is the EKF linearization valid? When does it break?
3. During a 30-second GPS dropout, how does position uncertainty grow? Which states keep you alive?
4. The 15x15 covariance matrix has 225 entries. Is the computational cost of the update step (involving matrix inversion) feasible at 10 Hz?

1. Bias estimation is critical. IMU accelerometers drift ~0.1 mg and gyros drift ~10 °/hr. Without bias states, after 30s of GPS dropout: position error from uncompensated accel bias = 0.5 · 0.001 · 30² = 0.45m. With bias estimation, the EKF learns the bias from GPS updates and compensates during dropout. This is why the state vector is 15, not 9.

2. Linearization is fine at 200 Hz. At 36 m/s and dt = 5ms, the car moves 0.18m and rotates at most ~0.01 rad per step. The Jacobian entries like −v·sin(θ)·dt are tiny (<0.04), so FJ ≈ I. The EKF is well within its linear regime per step. It would break at lower IMU rates (e.g., 10 Hz: 3.6m per step with significant rotation).

3. During GPS dropout: Velocity states (from integrated accelerometer) provide short-term position. Uncertainty grows as O(t³) due to integrated accelerometer noise: σpos(t) ≈ σaccel·t²/2. With good IMU (~0.01 m/s² noise): after 30s, σpos ≈ 4.5m. Bias estimation slows this — production systems achieve ~2m error after 30s.

4. Computation: The GPS update inverts a 3x3 matrix (position measurement) not 15x15 — the innovation covariance S = H P HT + R is measurement-sized (3x3 for XYZ GPS). The 15x3 Kalman gain computation is ~15²·3 = 675 multiplies. Easily under 1ms on modern hardware. The dominant cost is actually the 15x15 covariance predict at 200 Hz: ~15³ = 3375 multiplies per step = ~0.67M multiplies/second. Trivial for even an embedded ARM processor.

Chapter 8: Connections & Where to Go

The EKF sits at the center of a family of nonlinear estimation algorithms. Understanding it gives you the vocabulary and intuition to explore the rest.

Kalman Filter
The foundation. Linear + Gaussian = exact solution. The EKF reduces to the KF when f and h are linear.
↓ generalize
Extended Kalman Filter
You are here. Linearize with Jacobians. Works for mildly nonlinear systems.
↓ avoid Jacobians
Unscented Kalman Filter
Sigma-point sampling. Better accuracy, no derivatives needed. Handles stronger nonlinearity.
↓ arbitrary distributions
Particle Filter
Monte Carlo sampling. Handles any distribution, any nonlinearity. Expensive but powerful.
EKF + Mapping = SLAM. When the landmarks are unknown and must be estimated simultaneously with the robot's pose, you get Simultaneous Localization and Mapping. The EKF was the original SLAM algorithm (EKF-SLAM), augmenting the state vector with landmark positions.

EKF-SLAM: When the State Vector Explodes

In EKF-SLAM, every landmark gets appended to the state vector. The state becomes [robot_x, robot_y, robot_θ, lm1_x, lm1_y, lm2_x, lm2_y, ...]. With 100 landmarks:

QuantityShapeSize
State x[203, 1]3 robot + 200 landmark coords
Covariance P[203, 203]41,209 entries
Kalman gain K[203, 2](one landmark observation at a time)
The scaling wall: The update step inverts a matrix derived from P — that's O(n³) per update. At 100 landmarks (n=203), each update touches ~41K covariance entries. At 1000 landmarks (n=2003), the covariance has 4 million entries. This is why EKF-SLAM doesn't scale past ~100 landmarks, and why modern SLAM systems use graph-based optimization or particle filters instead.

The EKF Algorithm Summary

Initialize
Set x̂ (initial guess), P (initial covariance)
Predict
x̄ = f(x̂)    FJ = ∂f/∂x    P̄ = FJ P FJT + Q
Gain
HJ = ∂h/∂x    S = HJ P̄ HJT + R    K = P̄ HJT S−1
Update
y = z − h(x̄)    x̂ = x̄ + K y    P = (I − K HJ) P̄
↓ repeat
🔗 Pattern Recognition
Approximating Nonlinear Transforms: Jacobians vs Sigma Points
EKF (This Lesson)
Approximate f with tangent line at μ. Propagate covariance as FJ P FJT. Cost: one function eval + one Jacobian. Accuracy: 1st-order (misses curvature).
UKF (Sigma Points)
Pick 2n+1 carefully chosen points that capture the mean and covariance. Push each through f. Reconstruct μ and P from the results. Cost: 2n+1 function evals, no Jacobian. Accuracy: 2nd-order. → UKF Lesson

Both solve the same problem: "how do I propagate uncertainty through a nonlinear function?" The EKF linearizes the function (replace f with its tangent). The UKF instead samples the distribution (replace the Gaussian with representative points). Same goal, dual approaches. EKF-SLAM was the first SLAM algorithm, directly augmenting this state vector with landmark positions. → Classical SLAM

Neural networks are highly nonlinear functions. When we propagate uncertainty through a neural net (Bayesian deep learning), we face this same problem. Why do methods like MC Dropout resemble the UKF more than the EKF?

"All models are wrong, but some are useful."
— George Box

The EKF's model of nonlinear functions is wrong (tangent lines aren't curves). But it's useful enough to navigate robots, track missiles, and land spacecraft. You now understand why — and when it breaks.