Beyond Linearization

Understand the Unscented
Kalman Filter

No Jacobians. No tangent-line approximations. Just carefully chosen sample points that capture what a Gaussian looks like after it's been squeezed through a nonlinear function.

Prerequisites: Kalman Filter basics + Familiarity with EKF limitations.
9
Chapters
6+
Simulations
0
Jacobians

Chapter 0: Beyond Linearization

The Extended Kalman Filter (EKF) handles nonlinear systems by computing Jacobians — local tangent-line approximations. It works, but it has real problems: you need to derive the Jacobian analytically, and the first-order linear approximation can be wildly inaccurate when the function curves sharply.

In 1995, Simon Julier and Jeffrey Uhlmann had a striking insight: it's easier to approximate a probability distribution than to approximate an arbitrary nonlinear function. Instead of linearizing the function, why not pick a small set of sample points that capture the distribution's statistics, push them through the exact nonlinear function, and reconstruct a Gaussian from the results?

The core idea: Approximating a nonlinear function with a tangent line is fragile. Approximating a Gaussian with a handful of carefully chosen points is robust. The UKF approximates the distribution, not the function.
EKF Approach
Linearize f(x) with Jacobian → propagate Gaussian through the linear approximation
vs
UKF Approach
Pick sigma points → propagate each through exact f(x) → reconstruct Gaussian
Why "unscented"? Uhlmann reportedly chose the name because it "doesn't stink" — a tongue-in-cheek jab at the EKF's linearization assumptions. The name has no mathematical meaning.
Check: What does the UKF approximate?

Chapter 1: Sigma Points

A Gaussian in n dimensions is fully described by its mean (n numbers) and covariance (an n×n matrix). The UKF selects 2n+1 deterministic sample points — called sigma points — that exactly match these first two moments. For a 2D state, that's just 5 points.

The sigma points are the mean itself, plus pairs of points offset along each eigenvector of the covariance, scaled by a factor that depends on the tuning parameters α, β, and κ.

χ0 = μ      χi = μ + (√((n+λ)P))i      χi+n = μ − (√((n+λ)P))i

where λ = α²(n + κ) − n, and (√((n+λ)P))i is the i-th column of the matrix square root of (n+λ)P.

Interactive Sigma Points

A 2D Gaussian with 2n+1 = 5 sigma points. Drag the sliders to change the covariance shape. The orange dots are sigma points; the teal ellipse is the 2-sigma contour.

Var X (σ²x)1.5
Var Y (σ²y)0.8
Correlation ρ0.4

Sigma Points Concretely: The Cholesky Connection

The formula says √((n+λ)P). What does the "square root" of a matrix even mean? It means the Cholesky decomposition: finding a lower-triangular matrix L such that LL¹ = (n+λ)P. In NumPy, one line:

Python
import numpy as np

n = 4   # 2D tracker: [px, vx, py, vy]
alpha, kappa = 1e-3, 0
lam = alpha**2 * (n + kappa) - n  # ≈ -4.0

# Cholesky decomposition gives the "matrix square root"
L = np.linalg.cholesky((n + lam) * P)  # L is [4,4]

# Generate 2n+1 = 9 sigma points
sigmas = [x.copy()]                    # χ₀ = mean
for i in range(n):
    sigmas.append(x + L[:, i])         # χᵢ = mean + i-th column of L
    sigmas.append(x - L[:, i])         # χᵢ₊ₙ = mean - i-th column of L
# sigmas has 9 points, each is [4,1]
Why exactly 2n+1? It's the minimum number of points needed to capture the mean and covariance of an n-dimensional Gaussian exactly. The center point captures the mean. Each pair of symmetric points (one in the +Li direction, one in the −Li direction) captures one axis of the covariance ellipsoid. For n=4: 1 center + 4 pairs = 9 points. More points would help capture higher-order moments (skewness, kurtosis) — but that's the particle filter's job.
Key property: The sigma points are not random samples. They are deterministic. The same input always produces the same points. This makes the UKF reproducible and avoids the sampling noise of particle filters.
Check: For a state of dimension n=3, how many sigma points does the UKF use?
🔨 Derivation Derive the sigma point weights from moment-matching constraints ✓ ATTEMPTED

We have 2n+1 sigma points: one at the mean μ, and n symmetric pairs at μ ± (√((n+λ)P))i. Each point has a mean weight wim and a covariance weight wic.

Your task: Derive the weight formulas by enforcing two constraints: (1) the weighted sum of sigma points equals the mean, and (2) the weighted sum of outer products equals the covariance P.

Write Σ wim χi = μ. Expand by substituting χ0 = μ, χi = μ + Li, χi+n = μ − Li. The symmetric pairs cancel in pairs (Li − Li = 0), leaving just the requirement that all weights sum to 1: w0m + 2n · wim = 1.
Write Σ wici − μ)(χi − μ)T = P. The center point contributes zero (since χ0 − μ = 0). Each pair contributes 2 wic Li LiT. Since L is from chol((n+λ)P), we get L LT = (n+λ)P. So: 2n · wic · (n+λ)P = P, giving wic = 1/(2(n+λ)).
The center point's covariance weight w0c has an extra term: w0c = w0m + (1 − α² + β). This β term captures prior knowledge about the fourth moment (kurtosis) of the distribution. For a Gaussian, kurtosis = 3, and β = 2 is optimal because it makes the covariance estimate exact for the Gaussian case. This isn't derivable from first/second moment matching alone — it's a tuning knob for higher-order accuracy.

Full derivation:

Step 1 (mean constraint): Σ wim χi = w0mμ + Σi=1..n wim(μ + Li) + Σi=1..n wi+nm(μ − Li) = μ. Since the symmetric terms cancel: (w0m + 2n · wim) · μ = μ, so weights sum to 1.

Step 2 (covariance constraint): Σ wici−μ)(χi−μ)T = 2 · n · wic · LiLiT. But L comes from chol((n+λ)P), so each column satisfies LiLiT contribution sums to (n+λ)P. Thus 2n · wic · (n+λ) = 1, giving wic = 1/(2(n+λ)).

Step 3 (solve for w0m): From sum-to-1: w0m = 1 − 2n/(2(n+λ)) = 1 − n/(n+λ) = λ/(n+λ).

The key insight: The satellite weights are completely determined by the covariance constraint alone. The center weight is the leftover from the sum-to-1 constraint. This is why the center weight can go negative (when λ < 0) — it's not a probability, it's a moment-matching coefficient from Mercer's theorem on quadrature rules.

Checkpoint — Before you move on
Explain in your own words: why do we need exactly 2n+1 sigma points (not fewer, not more) to capture a Gaussian's first two moments? What would go wrong with 2n points (no center)?
✓ Gate cleared
Model Answer

The n symmetric pairs capture n axes of the covariance ellipsoid — each pair spans one eigenvector direction. Without the center point, the weights must still sum to 1 for the mean constraint, but then 2n · wim = 1 forces wim = 1/(2n). Now the covariance constraint becomes: 2n · (1/(2n)) · (n+λ)P = (n+λ)P ≠ P unless λ = 1−n. You lose a free parameter (λ) that controls the spread, meaning you can't independently tune how far the sigma points reach into the tails. The center point adds one more degree of freedom, decoupling the spread (controlled by λ) from the weight distribution. Fewer than 2n+1 points is under-determined for both moments; more than 2n+1 is over-determined (helpful for higher moments — that's cubature Kalman filters).

Chapter 2: The Unscented Transform

The unscented transform is the heart of the UKF. Given a Gaussian and a nonlinear function, it approximates the output distribution in three steps:

  1. Generate sigma points from the input Gaussian.
  2. Propagate each sigma point through the nonlinear function.
  3. Reconstruct the output Gaussian from the weighted transformed points.
μy = Σ wim · f(χi)      Py = Σ wic (f(χi) − μy)(f(χi) − μy

Each sigma point carries two weights: wm for the mean and wc for the covariance. The center point typically gets a different weight than the satellite points.

Unscented Transform: Banana Nonlinearity

Sigma points from a 2D Gaussian pass through a nonlinear "banana" transform. The teal ellipse is the input. The orange points become purple points after the transform. The green ellipse is the reconstructed output Gaussian.

Curvature0.20
Input spread σ1.20

The Weighted Recombination in NumPy

After pushing all sigma points through f, we reconstruct a Gaussian by computing the weighted mean and covariance:

Python
# sigmas_f = [f(χ₀), f(χ₁), ..., f(χ₂ₙ)]  — transformed points
# wm, wc = weight arrays (length 2n+1)

# Weighted mean: just a weighted average
mu_out = sum(wm[i] * sigmas_f[i] for i in range(2*n+1))

# Weighted covariance: outer products of deviations
P_out = np.zeros((n, n))
for i in range(2*n+1):
    d = sigmas_f[i] - mu_out
    P_out += wc[i] * np.outer(d, d)
What's different from Monte Carlo? A particle filter uses thousands of random samples and equal weights. The UKF uses just 2n+1 deterministic samples with carefully computed weights. Fewer points, but each one is placed exactly where it needs to be. This is why the UKF is fast enough for real-time applications where particle filters are too expensive.
Why it works: A Gaussian passed through a nonlinear function is no longer Gaussian. But the unscented transform captures the true mean and covariance to at least second-order accuracy — one order better than the EKF's first-order Jacobian approximation.
Check: What accuracy does the unscented transform achieve for Gaussian inputs?
🔨 Derivation Why the unscented transform is exact for linear systems ✓ ATTEMPTED

Consider a linear function f(x) = Ax + b applied to a Gaussian with mean μ and covariance P. The true output distribution has mean Aμ + b and covariance APAT.

Your task: Show that the unscented transform recovers the exact output mean and covariance for any linear f — proving it's at least as good as the standard Kalman filter for linear systems.

f(χi) = Aχi + b. For the center: Aμ + b. For satellite i: A(μ + Li) + b = Aμ + b + ALi. The transformed sigma points are just the original sigma points for a Gaussian with mean Aμ+b and "square root" AL.
μy = Σ wim (Aχi + b) = A(Σ wim χi) + b · (Σ wim). Since the weights match the input mean and sum to 1: μy = Aμ + b. Exact!
Py = Σ wic (f(χi) − μy)(f(χi) − μy)T = Σ wic A(χi − μ)(χi − μ)TAT = A [Σ wici−μ)(χi−μ)T] AT = APAT. Since we designed the weights to match P exactly!

Full derivation:

Mean: μy = Σ wim f(χi) = Σ wim (Aχi + b) = A(Σ wim χi) + b = Aμ + b. This is the exact linear transform of the mean.

Covariance: Each deviation is f(χi) − μy = A(χi − μ). So Py = Σ wic A(χi−μ)(χi−μ)TAT = A[Σ wici−μ)(χi−μ)T]AT = APAT.

For nonlinear f: Taylor-expand f(χi) = f(μ) + J(χi−μ) + ½H(χi−μ)² + ... The UT captures the first-order term exactly (same as EKF) AND captures the second-order Hessian term because the sigma points sample the curvature. The EKF drops everything beyond the Jacobian J.

The key insight: Linearity is the special case where the UT is trivially exact. For nonlinear functions, the sigma points "probe" the curvature in each direction, capturing second-order effects that a single tangent line (Jacobian) misses entirely.

💻 Build It Implement the Unscented Transform from Scratch ✓ ATTEMPTED
You've seen the sigma point generation and the weighted recombination. Now implement the full unscented transform: given a mean, covariance, and nonlinear function, output the transformed mean and covariance.
signature def unscented_transform(mu, P, f, alpha=1e-3, beta=2, kappa=0): """ Apply the unscented transform to propagate a Gaussian through f. Args: mu: np.array, shape (n,) — input mean P: np.array, shape (n, n) — input covariance f: callable, takes (n,) returns (m,) — nonlinear function alpha, beta, kappa: UKF tuning parameters Returns: mu_y: np.array, shape (m,) — output mean P_y: np.array, shape (m, m) — output covariance P_xy: np.array, shape (n, m) — cross-covariance """
Test case
mu = np.array([1.0, 2.0])
P = np.array([[1.0, 0.3], [0.3, 0.5]])
f = lambda x: np.array([x[0]**2, x[0]*x[1]]) # nonlinear!
mu_y, P_y, P_xy = unscented_transform(mu, P, f)
# mu_y should be approximately [2.0, 2.0] (since E[x²] = mu² + var)
# mu_y[0] ≈ 1² + 1.0 = 2.0
The cross-covariance uses deviations from the input mean (not the output mean) in one factor: P_xy = Σ wici − μ)(f(χi) − μy)T. This is needed for the UKF update step to compute the Kalman gain.
python
import numpy as np

def unscented_transform(mu, P, f, alpha=1e-3, beta=2, kappa=0):
    n = len(mu)
    lam = alpha**2 * (n + kappa) - n

    # Generate sigma points
    sqrt_P = np.linalg.cholesky((n + lam) * P)
    sigmas = np.zeros((2*n+1, n))
    sigmas[0] = mu
    for i in range(n):
        sigmas[i+1]   = mu + sqrt_P[:, i]
        sigmas[i+n+1] = mu - sqrt_P[:, i]

    # Weights
    wm = np.full(2*n+1, 1/(2*(n+lam)))
    wc = np.full(2*n+1, 1/(2*(n+lam)))
    wm[0] = lam / (n + lam)
    wc[0] = lam / (n + lam) + (1 - alpha**2 + beta)

    # Propagate
    sigmas_f = np.array([f(s) for s in sigmas])
    m = sigmas_f.shape[1]

    # Weighted mean
    mu_y = wm @ sigmas_f  # (m,)

    # Weighted covariance and cross-covariance
    P_y = np.zeros((m, m))
    P_xy = np.zeros((n, m))
    for i in range(2*n+1):
        dy = sigmas_f[i] - mu_y
        dx = sigmas[i] - mu
        P_y  += wc[i] * np.outer(dy, dy)
        P_xy += wc[i] * np.outer(dx, dy)

    return mu_y, P_y, P_xy
Bonus challenge: Modify this to return the propagated sigma points as well (needed for the augmented UKF which handles correlated process/measurement noise).

Chapter 3: UKF Predict

The predict step applies the unscented transform to the process model f(x). We generate sigma points from the current belief, push them through f, and reconstruct the predicted Gaussian. Then add process noise Q.

1. Generate Sigma Points
χ = sigmaPoints(μ, P)
2. Propagate
χi* = f(χi) for each i
3. Reconstruct
μ¯ = Σ wim χi*    P¯ = Σ wici* − μ¯)(χi* − μ¯)¹ + Q

UKF Predict in Python

Python
import numpy as np

def ukf_predict(x, P, f, Q, alpha=1e-3, beta=2, kappa=0):
    """UKF predict step."""
    n = len(x)
    lam = alpha**2 * (n + kappa) - n

    # Generate sigma points
    sqrt_P = np.linalg.cholesky((n + lam) * P)
    sigmas = [x.copy()]
    for i in range(n):
        sigmas.append(x + sqrt_P[:, i])
        sigmas.append(x - sqrt_P[:, i])

    # Weights
    wm = [lam / (n + lam)] + [1 / (2*(n + lam))] * 2*n
    wc = [lam / (n + lam) + (1 - alpha**2 + beta)] + [1 / (2*(n + lam))] * 2*n

    # Propagate through f
    sigmas_f = [f(s) for s in sigmas]

    # Reconstruct mean and covariance
    x_pred = sum(w * s for w, s in zip(wm, sigmas_f))
    P_pred = Q.copy()
    for i, (w, s) in enumerate(zip(wc, sigmas_f)):
        d = s - x_pred
        P_pred += w * np.outer(d, d)

    return x_pred, P_pred

Concrete Example: 2D Tracker Predict

For a 2D tracker (n=4), the predict step generates 9 sigma points, pushes each through the motion model, and recombines. Here's the data flow with actual shapes:

StepOperationShape
1Cholesky: L = chol((n+λ)P)[4, 4] lower-triangular
2Generate 2n+1 = 9 sigma points9 vectors of [4, 1]
3Propagate each: χi* = f(χi)9 vectors of [4, 1]
4Weighted mean → x̄[4, 1]
5Weighted outer products → P̄[4, 4]
6Add Q → P̄ + Q[4, 4]
Notice: No Jacobians anywhere! The function f is used as a black box. This means you can use the UKF even when f is a simulation, a neural network, or any function you can evaluate but cannot differentiate.
Check: In the UKF predict step, what happens to the sigma points?

Chapter 4: UKF Update

The update step is similar: generate sigma points from the predicted state, push them through the measurement model h(x), and compute the cross-covariance between state and measurement spaces. This gives us the Kalman gain.

1. Sigma Points
χ = sigmaPoints(μ¯, P¯)
2. Predicted Measurements
Zi = h(χi) for each i
3. Measurement Statistics
μz = Σ wim Zi    S = Σ wic (Zi − μz)(Zi − μz)¹ + R
4. Cross-Covariance & Gain
Pxz = Σ wici − μ¯)(Zi − μz)¹    K = Pxz S¯¹
5. Update
μ = μ¯ + K(z − μz)    P = P¯ − K S K¹

UKF Update in Python

Python
def ukf_update(x_pred, P_pred, z, h, R, alpha=1e-3, beta=2, kappa=0):
    """UKF update step."""
    n = len(x_pred)
    lam = alpha**2 * (n + kappa) - n

    # Regenerate sigma points from predicted state
    sqrt_P = np.linalg.cholesky((n + lam) * P_pred)
    sigmas = [x_pred.copy()]
    for i in range(n):
        sigmas.append(x_pred + sqrt_P[:, i])
        sigmas.append(x_pred - sqrt_P[:, i])

    wm = [lam / (n + lam)] + [1/(2*(n+lam))] * 2*n
    wc = [lam / (n + lam) + 1 - alpha**2 + beta] + [1/(2*(n+lam))] * 2*n

    # Push through measurement model
    z_sigmas = [h(s) for s in sigmas]
    z_mean = sum(w*zs for w, zs in zip(wm, z_sigmas))

    # Innovation covariance S and cross-covariance Pxz
    S = R.copy()
    Pxz = np.zeros((n, len(z)))
    for w, s, zs in zip(wc, sigmas, z_sigmas):
        dz = zs - z_mean
        S  += w * np.outer(dz, dz)
        dx = s - x_pred
        Pxz += w * np.outer(dx, dz)

    # Kalman gain and update
    K = Pxz @ np.linalg.inv(S)
    x_new = x_pred + K @ (z - z_mean)
    P_new = P_pred - K @ S @ K.T
    return x_new, P_new
Same structure as the KF: Predict grows uncertainty, Update shrinks it. The only difference is how we propagate through nonlinear functions — sigma points instead of matrices or Jacobians.
Check: What additional quantity does the UKF update compute that the standard KF does not?
🏗 Design Challenge You're the Architect: Spacecraft Attitude Estimation ✓ ATTEMPTED
You're designing the attitude determination system for a CubeSat. The spacecraft orientation is represented as a quaternion (4 components, but constrained to unit norm). Sensors: a star tracker (accurate but slow, 1 Hz) and a gyroscope (fast at 100 Hz, but drifts). The processor is a radiation-hardened BAE RAD750 running at 133 MHz with 256 MB RAM.
State representation
Quaternion (4D) + gyro bias (3D) = 7D state, but quaternion has unit-norm constraint
Compute budget
133 MHz RAD750 — each UKF cycle must complete in <5ms at 100 Hz
Sigma points at n=7
2(7)+1 = 15 sigma points per cycle — 15 quaternion propagations
Constraint problem
After weighted recombination, the mean quaternion may not have unit norm
1. The quaternion constraint: do you (a) renormalize after each update, (b) use a Multiplicative EKF (MEKF) with 3D error state, or (c) use a modified UKF that generates sigma points in the tangent space?
2. Multi-rate sensors: the gyro runs 100x faster than the star tracker. Do you (a) run the full UKF at 100 Hz, (b) propagate-only between star tracker updates, or (c) use a separate predict-only loop?
3. Given 133 MHz and 15 sigma points needing quaternion integration, can you meet the 5ms budget? If not, what do you sacrifice?

Real-world solution (used on ISS, Mars rovers, and CubeSats):

1. State representation: The industry standard is the Multiplicative UKF (MUKF). Instead of putting the full quaternion in the state vector, you estimate a 3D rotation error (Modified Rodrigues Parameters or Gibbs vector) that gets composed with a reference quaternion. This keeps the state 6D (3 error angles + 3 gyro biases), avoids the constraint entirely, and reduces sigma points from 15 to 13.

2. Multi-rate: Standard practice is propagate at gyro rate, update at star tracker rate. Between star tracker measurements, you just integrate the gyro (cheap — one quaternion multiplication per step). The full UKF machinery only runs when a star tracker measurement arrives (1 Hz). This is 100x cheaper than running the full filter at 100 Hz.

3. Compute: At 133 MHz, you get ~665,000 cycles per 5ms window. A single quaternion ODE integration takes ~50 FLOPs. 13 sigma points × 50 = 650 FLOPs for propagation — trivial. The bottleneck is the Cholesky decomposition of the 6×6 covariance (~200 FLOPs) and the matrix inversions. Total: ~3,000 FLOPs per update cycle — well within budget. The real concern is radiation-induced bit flips corrupting the covariance matrix, which is why the SR-UKF (guaranteed PSD) is preferred on flight hardware.

Chapter 5: Choosing Parameters

The UKF has three tuning parameters that control how the sigma points are placed and weighted:

ParameterTypical ValueEffect
α10-3 to 1Controls the spread of sigma points. Small α keeps them close to the mean.
β2 (optimal for Gaussians)Incorporates prior knowledge about the distribution. β=2 is optimal when the state is truly Gaussian.
κ0 or 3−nSecondary scaling parameter. κ=3−n ensures positive semi-definiteness of the covariance.
λ = α²(n + κ) − n

The composite parameter λ determines the distance of the sigma points from the mean. Larger λ pushes them further out, capturing more of the tails. Smaller λ clusters them near the center, which is safer for highly nonlinear functions.

Parameter Explorer

See how α, β, and κ affect sigma point placement and weights for n=2. The center weight can even go negative for large λ.

α0.50
β2.0
κ0.0
λ = -1.50 w0m = -3.00
In practice: Many implementations just use α=1, β=2, κ=0 and it works fine. The UKF is surprisingly robust to these parameters. Only tune them if you see numerical issues (e.g., non-positive-definite covariance).
Check: What is the optimal value of beta for Gaussian distributions?
💥 Break-It Lab What Dies When Parameters Go Wrong? ✓ ATTEMPTED
A UKF tracking a target moving in a figure-8 (highly nonlinear). Toggle each failure mode to see what happens to the filter's covariance and tracking error over 200 timesteps.
Force negative center weight (alpha=0.001, kappa=0) ACTIVE
Failure mode: With λ ≈ −n, the center weight w0c becomes large and negative. The weighted covariance can produce a non-positive-definite P — Cholesky fails on the next step. Watch the covariance trace explode then the filter crashes (NaN).
Sigma points too far out (alpha=1, kappa=3) ACTIVE
Failure mode: Large alpha pushes sigma points far from the mean. When the nonlinearity is strong (like our figure-8), far-flung points land in completely different regimes. The reconstructed Gaussian is huge and inaccurate — the filter becomes overconfident in the wrong direction.
Drop velocity sigma points (only propagate position) ACTIVE
Failure mode: If we only generate sigma points for position (not velocity), we lose the cross-covariance between position and velocity. The filter can't anticipate motion — it always lags behind the target, and the covariance grows unbounded because velocity uncertainty is never captured.

Chapter 6: UKF vs EKF Comparison

When the nonlinearity is mild, EKF and UKF produce nearly identical results. But as the function curves more sharply, the EKF's tangent-line approximation falls apart while the UKF's sigma points continue to track accurately. Below, both filters track the same nonlinear system.

Side-by-Side: EKF vs UKF

A target moves in a circle. Both filters use range+bearing measurements. The gray path is truth, red is the EKF, green is the UKF. Increase nonlinearity to see the EKF diverge.

Nonlinearity1.5
Meas. noise0.5
EKF err: 0.00 | UKF err: 0.00
When EKF fails: The EKF linearizes around its current estimate. If that estimate is wrong (because the linearization was bad), the next linearization is even worse. This positive feedback loop causes divergence. The UKF avoids this because it never linearizes.
FeatureEKFUKF
Requires JacobianYesNo
Accuracy1st order2nd order+
Black-box f, hNoYes
ComputationJacobian eval2n+1 function evals
Divergence riskHigherLower

Computational Cost: Concrete Numbers

Both filters share the O(n³) matrix algebra (multiplying P, inverting S). The difference is what else they compute:

State dim nSigma points (2n+1)KF costEKF extraUKF extra
4 (2D tracker)9O(64) matrix ops+ Jacobian derivation+ 9 model evaluations
6 (3D pose)13O(216)+ larger Jacobian+ 13 evaluations
200 (SLAM)401O(8M)+ 200×200 Jacobian+ 401 evaluations
The tradeoff: For small state dimensions (n < 20), the UKF's extra function evaluations are cheap — well worth the accuracy gain. For large state dimensions (n > 100), running the model 201+ times per step gets expensive. But recall: the EKF requires you to derive the Jacobian analytically, which is an engineering cost the UKF avoids entirely. Many practitioners choose the UKF simply because it's easier to implement correctly.
Check: Why does the EKF diverge under strong nonlinearity?
🔗 Pattern Recognition
Two Strategies for the Same Nonlinear Problem
EKF (this lesson's rival)
Linearize the function: f(x) ≈ f(μ) + J(x − μ). Apply exact Gaussian propagation through the linear approximation. Error: O(||x−μ||²).
UKF (this lesson)
Keep the exact function, approximate the distribution: N(μ,P) ≈ {χ0,...,χ2n}. Propagate exact points through exact f. Error: O(||x−μ||³). → EKF lesson

Both are solving the same problem: "How do I propagate uncertainty through a nonlinear function?" The EKF approximates the function to keep the distribution exact. The UKF approximates the distribution to keep the function exact. This is a dual relationship — a recurring pattern in estimation theory where you can choose which side to simplify.

Where else in engineering do you see this "approximate the model vs approximate the data" duality? (Hint: think of variational inference vs sampling methods in machine learning.)

Chapter 7: Square-Root UKF

The standard UKF recomputes the Cholesky decomposition of P at every step. Rounding errors can cause P to lose positive semi-definiteness, crashing the filter. The square-root UKF (SR-UKF) propagates the Cholesky factor S directly, where P = SS¹.

Standard UKF
Store P → Cholesky at each step → risk of non-PSD P
vs
Square-Root UKF
Store S (where P = SS¹) → update S directly via QR/Cholesky updates → P always PSD by construction

The SR-UKF uses QR decomposition and rank-1 Cholesky updates instead of full matrix operations. This guarantees numerical stability and is roughly the same computational cost.

When to use it: If your state dimension is large (n > 10) or your system runs for many thousands of timesteps, prefer the SR-UKF. For small problems or short runs, the standard UKF is usually fine.

Key SR-UKF Operations

OperationStandard UKFSR-UKF
Store uncertaintyP (n×n)S (n×n, lower triangular)
Sigma point spreadchol((n+λ)P)√(n+λ) · S (already available)
Covariance updateWeighted outer productsQR decomposition + Cholesky rank-1 update
Guaranteed PSDNoYes, by construction
Check: What is the main advantage of the square-root UKF?
⚔ Adversarial: Your UKF's covariance matrix becomes non-positive-definite after 200 steps
You're running a standard (non-square-root) UKF with n=6 states, α=0.001, β=2, κ=0. The filter tracks well for 200 steps, then Cholesky decomposition fails with "Matrix is not positive definite." You check: Q and R are valid PSD matrices. The process model is smooth. What's the most likely root cause?
🔨 Derivation Why does storing the Cholesky factor guarantee positive semi-definiteness? ✓ ATTEMPTED

The standard UKF stores P and computes P = Σ wic didiT + Q. This can lose PSD due to floating-point cancellation. The SR-UKF stores S where P = SST and uses QR decomposition + rank-1 Cholesky updates.

Your task: Explain why the QR-based update of S never produces a non-PSD covariance, even with negative weights.

QR factors a matrix A into Q (orthogonal) and R (upper triangular). The key property: RTR = ATA. If we feed the matrix [√w1 d1, √w2 d2, ..., √Q] into QR, we get R such that RTR = Σ wi didiT + Q. But only for positive weights.
The negative center weight is handled via a rank-1 Cholesky downdate: given S such that SST = P, find S' such that S'S'T = P − vvT. The downdate algorithm (cholupdate in MATLAB) computes this directly by modifying S's entries. If the result would be non-PSD, the algorithm explicitly fails — you catch the error instead of silently propagating garbage.
Summing wi didiT involves catastrophic cancellation: adding large positive terms then subtracting a large negative term. The residual can have negative eigenvalues due to floating-point errors. The QR approach never forms the full sum — it maintains the factored form throughout, so P = SST is PSD by construction (product of a matrix with its transpose is always PSD).

Full derivation:

Step 1 (QR for positive weights): Form the compound matrix A = [√w1(f(χ1)−μ), ..., √w2n(f(χ2n)−μ), √Q]. Take QR: A = QR. Then S = RT (lower triangular), and SST = RTR = ATA = Σi=1..2n wi didiT + Q. This is always PSD because ATA is always PSD.

Step 2 (Cholesky downdate for w0): If w0c < 0, we perform: S ← choldowndate(S, √|w0c| · d0). This subtracts the rank-1 term |w0c| d0d0T from SST while maintaining the triangular factor. If the subtraction would make P non-PSD, the algorithm raises an error — a detectable failure, not silent corruption.

Step 3 (why this is numerically superior): The standard approach computes P = big_positive − big_negative + Q. With 64-bit floats, if big_positive ≈ 106 and the true residual is O(1), you lose ~6 digits of precision. The QR approach never forms this sum — it works directly on the factor, preserving full precision.

The key insight: "PSD by construction" means the representation cannot express a non-PSD matrix. SST is PSD for any real S. By never converting back to full P, we never lose this structural guarantee.

Chapter 8: Connections

The UKF sits in a family of estimation algorithms, each making different tradeoffs between accuracy, generality, and computational cost. Understanding where it fits helps you choose the right tool.

FilterHandlesApproachCost
Kalman FilterLinear + GaussianExact matrix equationsVery low
EKFMildly nonlinearJacobian linearizationLow
UKFModerately nonlinearSigma pointsMedium
Particle FilterAny distributionMonte Carlo samplingHigh
The estimation ladder: Start with the KF if your system is linear. Move to the EKF if it's mildly nonlinear and you can compute Jacobians. Use the UKF when Jacobians are unavailable or the nonlinearity is too strong. Resort to particle filters when the distribution is truly non-Gaussian (multimodal, heavy-tailed).
UKF in SLAM: The UKF is widely used in simultaneous localization and mapping (SLAM), where a robot must estimate its own position while building a map of the environment. The measurement model (range and bearing to landmarks) is highly nonlinear, making the UKF a natural choice over the EKF.
🔗 Pattern Recognition
Sigma Points Are Deterministic Particles
UKF (this lesson)
2n+1 deterministic points, weights from moment-matching. Captures mean + covariance exactly. Assumes output is Gaussian.
Particle Filter
N random samples, weights from likelihood. Captures arbitrary distributions. No Gaussian assumption. → Bayes Filter lesson

Both represent a distribution as weighted point masses. The UKF uses the minimum points for 2nd-order accuracy; the particle filter uses many points for asymptotic accuracy. As you add more sigma points with higher-order quadrature rules (cubature Kalman filter, Gauss-Hermite filter), you approach the particle filter from the "structured" side. They're the same idea at different points on the compute-vs-accuracy tradeoff.

If you had a 3-state system but suspected bimodal distributions, would you reach for a UKF with more sigma points, or switch to a particle filter? At what dimensionality does this decision flip?

Continue learning:

"It is easier to approximate a probability distribution than to approximate an arbitrary nonlinear function or transformation."
— Simon Julier & Jeffrey Uhlmann

You now understand the unscented Kalman filter. No Jacobians, no linearization — just carefully chosen points that capture the shape of uncertainty.