microUKF — The Unscented Kalman Filter

Chapter 0: Beyond Linearization

The Extended Kalman Filter (EKF) handles nonlinear systems by computing Jacobians — local tangent-line approximations. It works, but it has real problems: you need to derive the Jacobian analytically, and the first-order linear approximation can be wildly inaccurate when the function curves sharply.

In 1995, Simon Julier and Jeffrey Uhlmann had a striking insight: it's easier to approximate a probability distribution than to approximate an arbitrary nonlinear function. Instead of linearizing the function, why not pick a small set of sample points that capture the distribution's statistics, push them through the exact nonlinear function, and reconstruct a Gaussian from the results?

The core idea: Approximating a nonlinear function with a tangent line is fragile. Approximating a Gaussian with a handful of carefully chosen points is robust. The UKF approximates the distribution, not the function.

EKF Approach

Linearize f(x) with Jacobian → propagate Gaussian through the linear approximation

UKF Approach

Pick sigma points → propagate each through exact f(x) → reconstruct Gaussian

Why "unscented"? Uhlmann reportedly chose the name because it "doesn't stink" — a tongue-in-cheek jab at the EKF's linearization assumptions. The name has no mathematical meaning.

Check: What does the UKF approximate?

The nonlinear function, using a tangent line The probability distribution, using sample points The measurement noise, using Monte Carlo

Chapter 1: Sigma Points

A Gaussian in n dimensions is fully described by its mean (n numbers) and covariance (an n×n matrix). The UKF selects 2n+1 deterministic sample points — called sigma points — that exactly match these first two moments. For a 2D state, that's just 5 points.

The sigma points are the mean itself, plus pairs of points offset along each eigenvector of the covariance, scaled by a factor that depends on the tuning parameters α, β, and κ.

χ₀ = μ χ_i = μ + (√((n+λ)P))_i χ_i+n = μ − (√((n+λ)P))_i

where λ = α²(n + κ) − n, and (√((n+λ)P))_i is the i-th column of the matrix square root of (n+λ)P.

Interactive Sigma Points

A 2D Gaussian with 2n+1 = 5 sigma points. Drag the sliders to change the covariance shape. The orange dots are sigma points; the teal ellipse is the 2-sigma contour.

Var X (σ²_x)1.5

Var Y (σ²_y)0.8

Correlation ρ0.4

Sigma Points Concretely: The Cholesky Connection

The formula says √((n+λ)P). What does the "square root" of a matrix even mean? It means the Cholesky decomposition: finding a lower-triangular matrix L such that LL¹ = (n+λ)P. In NumPy, one line:

Python
import numpy as np

n = 4   # 2D tracker: [px, vx, py, vy]
alpha, kappa = 1e-3, 0
lam = alpha**2 * (n + kappa) - n  # ≈ -4.0

# Cholesky decomposition gives the "matrix square root"
L = np.linalg.cholesky((n + lam) * P)  # L is [4,4]

# Generate 2n+1 = 9 sigma points
sigmas = [x.copy()]                    # χ₀ = mean
for i in range(n):
    sigmas.append(x + L[:, i])         # χᵢ = mean + i-th column of L
    sigmas.append(x - L[:, i])         # χᵢ₊ₙ = mean - i-th column of L
# sigmas has 9 points, each is [4,1]

Why exactly 2n+1? It's the minimum number of points needed to capture the mean and covariance of an n-dimensional Gaussian exactly. The center point captures the mean. Each pair of symmetric points (one in the +L_i direction, one in the −L_i direction) captures one axis of the covariance ellipsoid. For n=4: 1 center + 4 pairs = 9 points. More points would help capture higher-order moments (skewness, kurtosis) — but that's the particle filter's job.

Key property: The sigma points are not random samples. They are deterministic. The same input always produces the same points. This makes the UKF reproducible and avoids the sampling noise of particle filters.

Check: For a state of dimension n=3, how many sigma points does the UKF use?

3 6 7 (2n+1)

🔨 Derivation Derive the sigma point weights from moment-matching constraints ▶ ✓ ATTEMPTED

We have 2n+1 sigma points: one at the mean μ, and n symmetric pairs at μ ± (√((n+λ)P))_i. Each point has a mean weight w_i^m and a covariance weight w_i^c.

Your task: Derive the weight formulas by enforcing two constraints: (1) the weighted sum of sigma points equals the mean, and (2) the weighted sum of outer products equals the covariance P.

Write Σ w_i^m χ_i = μ. Expand by substituting χ₀ = μ, χ_i = μ + L_i, χ_i+n = μ − L_i. The symmetric pairs cancel in pairs (L_i − L_i = 0), leaving just the requirement that all weights sum to 1: w₀^m + 2n · w_i^m = 1.

Write Σ w_i^c (χ_i − μ)(χ_i − μ)^T = P. The center point contributes zero (since χ₀ − μ = 0). Each pair contributes 2 w_i^c L_i L_i^T. Since L is from chol((n+λ)P), we get L L^T = (n+λ)P. So: 2n · w_i^c · (n+λ)P = P, giving w_i^c = 1/(2(n+λ)).

The center point's covariance weight w₀^c has an extra term: w₀^c = w₀^m + (1 − α² + β). This β term captures prior knowledge about the fourth moment (kurtosis) of the distribution. For a Gaussian, kurtosis = 3, and β = 2 is optimal because it makes the covariance estimate exact for the Gaussian case. This isn't derivable from first/second moment matching alone — it's a tuning knob for higher-order accuracy.

Full derivation:

Step 1 (mean constraint): Σ w_i^m χ_i = w₀^mμ + Σ_i=1..n w_i^m(μ + L_i) + Σ_i=1..n w_i+n^m(μ − L_i) = μ. Since the symmetric terms cancel: (w₀^m + 2n · w_i^m) · μ = μ, so weights sum to 1.

Step 2 (covariance constraint): Σ w_i^c(χ_i−μ)(χ_i−μ)^T = 2 · n · w_i^c · L_iL_i^T. But L comes from chol((n+λ)P), so each column satisfies L_iL_i^T contribution sums to (n+λ)P. Thus 2n · w_i^c · (n+λ) = 1, giving w_i^c = 1/(2(n+λ)).

Step 3 (solve for w₀^m): From sum-to-1: w₀^m = 1 − 2n/(2(n+λ)) = 1 − n/(n+λ) = λ/(n+λ).

The key insight: The satellite weights are completely determined by the covariance constraint alone. The center weight is the leftover from the sum-to-1 constraint. This is why the center weight can go negative (when λ < 0) — it's not a probability, it's a moment-matching coefficient from Mercer's theorem on quadrature rules.

Checkpoint — Before you move on

Explain in your own words: why do we need exactly 2n+1 sigma points (not fewer, not more) to capture a Gaussian's first two moments? What would go wrong with 2n points (no center)?

✓ Gate cleared

Model Answer

The n symmetric pairs capture n axes of the covariance ellipsoid — each pair spans one eigenvector direction. Without the center point, the weights must still sum to 1 for the mean constraint, but then 2n · w_i^m = 1 forces w_i^m = 1/(2n). Now the covariance constraint becomes: 2n · (1/(2n)) · (n+λ)P = (n+λ)P ≠ P unless λ = 1−n. You lose a free parameter (λ) that controls the spread, meaning you can't independently tune how far the sigma points reach into the tails. The center point adds one more degree of freedom, decoupling the spread (controlled by λ) from the weight distribution. Fewer than 2n+1 points is under-determined for both moments; more than 2n+1 is over-determined (helpful for higher moments — that's cubature Kalman filters).

Chapter 2: The Unscented Transform

The unscented transform is the heart of the UKF. Given a Gaussian and a nonlinear function, it approximates the output distribution in three steps:

Generate sigma points from the input Gaussian.
Propagate each sigma point through the nonlinear function.
Reconstruct the output Gaussian from the weighted transformed points.

μ_y = Σ w_i^m · f(χ_i) P_y = Σ w_i^c (f(χ_i) − μ_y)(f(χ_i) − μ_y)¹

Each sigma point carries two weights: w^m for the mean and w^c for the covariance. The center point typically gets a different weight than the satellite points.

Unscented Transform: Banana Nonlinearity

Sigma points from a 2D Gaussian pass through a nonlinear "banana" transform. The teal ellipse is the input. The orange points become purple points after the transform. The green ellipse is the reconstructed output Gaussian.

Curvature0.20

Input spread σ1.20

The Weighted Recombination in NumPy

After pushing all sigma points through f, we reconstruct a Gaussian by computing the weighted mean and covariance:

Python
# sigmas_f = [f(χ₀), f(χ₁), ..., f(χ₂ₙ)]  — transformed points
# wm, wc = weight arrays (length 2n+1)

# Weighted mean: just a weighted average
mu_out = sum(wm[i] * sigmas_f[i] for i in range(2*n+1))

# Weighted covariance: outer products of deviations
P_out = np.zeros((n, n))
for i in range(2*n+1):
    d = sigmas_f[i] - mu_out
    P_out += wc[i] * np.outer(d, d)

What's different from Monte Carlo? A particle filter uses thousands of random samples and equal weights. The UKF uses just 2n+1 deterministic samples with carefully computed weights. Fewer points, but each one is placed exactly where it needs to be. This is why the UKF is fast enough for real-time applications where particle filters are too expensive.

Why it works: A Gaussian passed through a nonlinear function is no longer Gaussian. But the unscented transform captures the true mean and covariance to at least second-order accuracy — one order better than the EKF's first-order Jacobian approximation.

Check: What accuracy does the unscented transform achieve for Gaussian inputs?

First-order (same as EKF) At least second-order (better than EKF) Exact

🔨 Derivation Why the unscented transform is exact for linear systems ▶ ✓ ATTEMPTED

Consider a linear function f(x) = Ax + b applied to a Gaussian with mean μ and covariance P. The true output distribution has mean Aμ + b and covariance APA^T.

Your task: Show that the unscented transform recovers the exact output mean and covariance for any linear f — proving it's at least as good as the standard Kalman filter for linear systems.

f(χ_i) = Aχ_i + b. For the center: Aμ + b. For satellite i: A(μ + L_i) + b = Aμ + b + AL_i. The transformed sigma points are just the original sigma points for a Gaussian with mean Aμ+b and "square root" AL.

μ_y = Σ w_i^m (Aχ_i + b) = A(Σ w_i^m χ_i) + b · (Σ w_i^m). Since the weights match the input mean and sum to 1: μ_y = Aμ + b. Exact!

P_y = Σ w_i^c (f(χ_i) − μ_y)(f(χ_i) − μ_y)^T = Σ w_i^c A(χ_i − μ)(χ_i − μ)^TA^T = A [Σ w_i^c (χ_i−μ)(χ_i−μ)^T] A^T = APA^T. Since we designed the weights to match P exactly!

Full derivation:

Mean: μ_y = Σ w_i^m f(χ_i) = Σ w_i^m (Aχ_i + b) = A(Σ w_i^m χ_i) + b = Aμ + b. This is the exact linear transform of the mean.

Covariance: Each deviation is f(χ_i) − μ_y = A(χ_i − μ). So P_y = Σ w_i^c A(χ_i−μ)(χ_i−μ)^TA^T = A[Σ w_i^c(χ_i−μ)(χ_i−μ)^T]A^T = APA^T.

For nonlinear f: Taylor-expand f(χ_i) = f(μ) + J(χ_i−μ) + ½H(χ_i−μ)² + ... The UT captures the first-order term exactly (same as EKF) AND captures the second-order Hessian term because the sigma points sample the curvature. The EKF drops everything beyond the Jacobian J.

The key insight: Linearity is the special case where the UT is trivially exact. For nonlinear functions, the sigma points "probe" the curvature in each direction, capturing second-order effects that a single tangent line (Jacobian) misses entirely.

💻 Build It Implement the Unscented Transform from Scratch ▶ ✓ ATTEMPTED

You've seen the sigma point generation and the weighted recombination. Now implement the full unscented transform: given a mean, covariance, and nonlinear function, output the transformed mean and covariance.

signature def unscented_transform(mu, P, f, alpha=1e-3, beta=2, kappa=0): """ Apply the unscented transform to propagate a Gaussian through f. Args: mu: np.array, shape (n,) — input mean P: np.array, shape (n, n) — input covariance f: callable, takes (n,) returns (m,) — nonlinear function alpha, beta, kappa: UKF tuning parameters Returns: mu_y: np.array, shape (m,) — output mean P_y: np.array, shape (m, m) — output covariance P_xy: np.array, shape (n, m) — cross-covariance """

Test case

mu = np.array([1.0, 2.0])
P = np.array([[1.0, 0.3], [0.3, 0.5]])
f = lambda x: np.array([x[0]**2, x[0]*x[1]]) # nonlinear!
mu_y, P_y, P_xy = unscented_transform(mu, P, f)
# mu_y should be approximately [2.0, 2.0] (since E[x²] = mu² + var)
# mu_y[0] ≈ 1² + 1.0 = 2.0

The cross-covariance uses deviations from the input mean (not the output mean) in one factor: P_xy = Σ w_i^c (χ_i − μ)(f(χ_i) − μ_y)^T. This is needed for the UKF update step to compute the Kalman gain.

python
import numpy as np

def unscented_transform(mu, P, f, alpha=1e-3, beta=2, kappa=0):
    n = len(mu)
    lam = alpha**2 * (n + kappa) - n

    # Generate sigma points
    sqrt_P = np.linalg.cholesky((n + lam) * P)
    sigmas = np.zeros((2*n+1, n))
    sigmas[0] = mu
    for i in range(n):
        sigmas[i+1]   = mu + sqrt_P[:, i]
        sigmas[i+n+1] = mu - sqrt_P[:, i]

    # Weights
    wm = np.full(2*n+1, 1/(2*(n+lam)))
    wc = np.full(2*n+1, 1/(2*(n+lam)))
    wm[0] = lam / (n + lam)
    wc[0] = lam / (n + lam) + (1 - alpha**2 + beta)

    # Propagate
    sigmas_f = np.array([f(s) for s in sigmas])
    m = sigmas_f.shape[1]

    # Weighted mean
    mu_y = wm @ sigmas_f  # (m,)

    # Weighted covariance and cross-covariance
    P_y = np.zeros((m, m))
    P_xy = np.zeros((n, m))
    for i in range(2*n+1):
        dy = sigmas_f[i] - mu_y
        dx = sigmas[i] - mu
        P_y  += wc[i] * np.outer(dy, dy)
        P_xy += wc[i] * np.outer(dx, dy)

    return mu_y, P_y, P_xy

Bonus challenge: Modify this to return the propagated sigma points as well (needed for the augmented UKF which handles correlated process/measurement noise).

Chapter 3: UKF Predict

The predict step applies the unscented transform to the process model f(x). We generate sigma points from the current belief, push them through f, and reconstruct the predicted Gaussian. Then add process noise Q.

1. Generate Sigma Points

χ = sigmaPoints(μ, P)

↓

2. Propagate

χ_i* = f(χ_i) for each i

↓

3. Reconstruct

μ¯ = Σ w_i^m χ_i* P¯ = Σ w_i^c (χ_i* − μ¯)(χ_i* − μ¯)¹ + Q

UKF Predict in Python

Python
import numpy as np

def ukf_predict(x, P, f, Q, alpha=1e-3, beta=2, kappa=0):
    """UKF predict step."""
    n = len(x)
    lam = alpha**2 * (n + kappa) - n

    # Generate sigma points
    sqrt_P = np.linalg.cholesky((n + lam) * P)
    sigmas = [x.copy()]
    for i in range(n):
        sigmas.append(x + sqrt_P[:, i])
        sigmas.append(x - sqrt_P[:, i])

    # Weights
    wm = [lam / (n + lam)] + [1 / (2*(n + lam))] * 2*n
    wc = [lam / (n + lam) + (1 - alpha**2 + beta)] + [1 / (2*(n + lam))] * 2*n

    # Propagate through f
    sigmas_f = [f(s) for s in sigmas]

    # Reconstruct mean and covariance
    x_pred = sum(w * s for w, s in zip(wm, sigmas_f))
    P_pred = Q.copy()
    for i, (w, s) in enumerate(zip(wc, sigmas_f)):
        d = s - x_pred
        P_pred += w * np.outer(d, d)

    return x_pred, P_pred

Concrete Example: 2D Tracker Predict

For a 2D tracker (n=4), the predict step generates 9 sigma points, pushes each through the motion model, and recombines. Here's the data flow with actual shapes:

Step	Operation	Shape
1	Cholesky: L = chol((n+λ)P)	[4, 4] lower-triangular
2	Generate 2n+1 = 9 sigma points	9 vectors of [4, 1]
3	Propagate each: χ_i* = f(χ_i)	9 vectors of [4, 1]
4	Weighted mean → x̄	[4, 1]
5	Weighted outer products → P̄	[4, 4]
6	Add Q → P̄ + Q	[4, 4]

Notice: No Jacobians anywhere! The function f is used as a black box. This means you can use the UKF even when f is a simulation, a neural network, or any function you can evaluate but cannot differentiate.

Check: In the UKF predict step, what happens to the sigma points?

Each sigma point is passed through the exact nonlinear function f They are multiplied by the Jacobian of f They are randomly sampled from the prior

Chapter 4: UKF Update

The update step is similar: generate sigma points from the predicted state, push them through the measurement model h(x), and compute the cross-covariance between state and measurement spaces. This gives us the Kalman gain.

1. Sigma Points

χ = sigmaPoints(μ¯, P¯)

↓

2. Predicted Measurements

Z_i = h(χ_i) for each i

↓

3. Measurement Statistics

μ_z = Σ w_i^m Z_i S = Σ w_i^c (Z_i − μ_z)(Z_i − μ_z)¹ + R

↓

4. Cross-Covariance & Gain

P_xz = Σ w_i^c (χ_i − μ¯)(Z_i − μ_z)¹ K = P_xz S¯¹

↓

5. Update

μ = μ¯ + K(z − μ_z) P = P¯ − K S K¹

UKF Update in Python

Python
def ukf_update(x_pred, P_pred, z, h, R, alpha=1e-3, beta=2, kappa=0):
    """UKF update step."""
    n = len(x_pred)
    lam = alpha**2 * (n + kappa) - n

    # Regenerate sigma points from predicted state
    sqrt_P = np.linalg.cholesky((n + lam) * P_pred)
    sigmas = [x_pred.copy()]
    for i in range(n):
        sigmas.append(x_pred + sqrt_P[:, i])
        sigmas.append(x_pred - sqrt_P[:, i])

    wm = [lam / (n + lam)] + [1/(2*(n+lam))] * 2*n
    wc = [lam / (n + lam) + 1 - alpha**2 + beta] + [1/(2*(n+lam))] * 2*n

    # Push through measurement model
    z_sigmas = [h(s) for s in sigmas]
    z_mean = sum(w*zs for w, zs in zip(wm, z_sigmas))

    # Innovation covariance S and cross-covariance Pxz
    S = R.copy()
    Pxz = np.zeros((n, len(z)))
    for w, s, zs in zip(wc, sigmas, z_sigmas):
        dz = zs - z_mean
        S  += w * np.outer(dz, dz)
        dx = s - x_pred
        Pxz += w * np.outer(dx, dz)

    # Kalman gain and update
    K = Pxz @ np.linalg.inv(S)
    x_new = x_pred + K @ (z - z_mean)
    P_new = P_pred - K @ S @ K.T
    return x_new, P_new

Same structure as the KF: Predict grows uncertainty, Update shrinks it. The only difference is how we propagate through nonlinear functions — sigma points instead of matrices or Jacobians.

Check: What additional quantity does the UKF update compute that the standard KF does not?

Cross-covariance P_xz between state and measurement sigma points The Jacobian of h A random seed for the sigma points

🏗 Design Challenge You're the Architect: Spacecraft Attitude Estimation ▶ ✓ ATTEMPTED

You're designing the attitude determination system for a CubeSat. The spacecraft orientation is represented as a quaternion (4 components, but constrained to unit norm). Sensors: a star tracker (accurate but slow, 1 Hz) and a gyroscope (fast at 100 Hz, but drifts). The processor is a radiation-hardened BAE RAD750 running at 133 MHz with 256 MB RAM.

State representation

Quaternion (4D) + gyro bias (3D) = 7D state, but quaternion has unit-norm constraint

Compute budget

133 MHz RAD750 — each UKF cycle must complete in <5ms at 100 Hz

Sigma points at n=7

2(7)+1 = 15 sigma points per cycle — 15 quaternion propagations

Constraint problem

After weighted recombination, the mean quaternion may not have unit norm

1. The quaternion constraint: do you (a) renormalize after each update, (b) use a Multiplicative EKF (MEKF) with 3D error state, or (c) use a modified UKF that generates sigma points in the tangent space?

2. Multi-rate sensors: the gyro runs 100x faster than the star tracker. Do you (a) run the full UKF at 100 Hz, (b) propagate-only between star tracker updates, or (c) use a separate predict-only loop?

3. Given 133 MHz and 15 sigma points needing quaternion integration, can you meet the 5ms budget? If not, what do you sacrifice?

Real-world solution (used on ISS, Mars rovers, and CubeSats):

1. State representation: The industry standard is the Multiplicative UKF (MUKF). Instead of putting the full quaternion in the state vector, you estimate a 3D rotation error (Modified Rodrigues Parameters or Gibbs vector) that gets composed with a reference quaternion. This keeps the state 6D (3 error angles + 3 gyro biases), avoids the constraint entirely, and reduces sigma points from 15 to 13.

2. Multi-rate: Standard practice is propagate at gyro rate, update at star tracker rate. Between star tracker measurements, you just integrate the gyro (cheap — one quaternion multiplication per step). The full UKF machinery only runs when a star tracker measurement arrives (1 Hz). This is 100x cheaper than running the full filter at 100 Hz.

3. Compute: At 133 MHz, you get ~665,000 cycles per 5ms window. A single quaternion ODE integration takes ~50 FLOPs. 13 sigma points × 50 = 650 FLOPs for propagation — trivial. The bottleneck is the Cholesky decomposition of the 6×6 covariance (~200 FLOPs) and the matrix inversions. Total: ~3,000 FLOPs per update cycle — well within budget. The real concern is radiation-induced bit flips corrupting the covariance matrix, which is why the SR-UKF (guaranteed PSD) is preferred on flight hardware.

Chapter 5: Choosing Parameters

The UKF has three tuning parameters that control how the sigma points are placed and weighted:

Parameter	Typical Value	Effect
α	10^-3 to 1	Controls the spread of sigma points. Small α keeps them close to the mean.
β	2 (optimal for Gaussians)	Incorporates prior knowledge about the distribution. β=2 is optimal when the state is truly Gaussian.
κ	0 or 3−n	Secondary scaling parameter. κ=3−n ensures positive semi-definiteness of the covariance.

λ = α²(n + κ) − n

The composite parameter λ determines the distance of the sigma points from the mean. Larger λ pushes them further out, capturing more of the tails. Smaller λ clusters them near the center, which is safer for highly nonlinear functions.

Parameter Explorer

See how α, β, and κ affect sigma point placement and weights for n=2. The center weight can even go negative for large λ.

α0.50

β2.0

κ0.0

      λ = -1.50
      w0m = -3.00
    

In practice: Many implementations just use α=1, β=2, κ=0 and it works fine. The UKF is surprisingly robust to these parameters. Only tune them if you see numerical issues (e.g., non-positive-definite covariance).

Check: What is the optimal value of beta for Gaussian distributions?

0 2 It depends on the dimensionality n

💥 Break-It Lab What Dies When Parameters Go Wrong? ▶ ✓ ATTEMPTED

A UKF tracking a target moving in a figure-8 (highly nonlinear). Toggle each failure mode to see what happens to the filter's covariance and tracking error over 200 timesteps.

Force negative center weight (alpha=0.001, kappa=0) ACTIVE

Failure mode: With λ ≈ −n, the center weight w₀^c becomes large and negative. The weighted covariance can produce a non-positive-definite P — Cholesky fails on the next step. Watch the covariance trace explode then the filter crashes (NaN).

Sigma points too far out (alpha=1, kappa=3) ACTIVE

Failure mode: Large alpha pushes sigma points far from the mean. When the nonlinearity is strong (like our figure-8), far-flung points land in completely different regimes. The reconstructed Gaussian is huge and inaccurate — the filter becomes overconfident in the wrong direction.

Drop velocity sigma points (only propagate position) ACTIVE

Failure mode: If we only generate sigma points for position (not velocity), we lose the cross-covariance between position and velocity. The filter can't anticipate motion — it always lags behind the target, and the covariance grows unbounded because velocity uncertainty is never captured.

Chapter 6: UKF vs EKF Comparison

When the nonlinearity is mild, EKF and UKF produce nearly identical results. But as the function curves more sharply, the EKF's tangent-line approximation falls apart while the UKF's sigma points continue to track accurately. Below, both filters track the same nonlinear system.

Side-by-Side: EKF vs UKF

A target moves in a circle. Both filters use range+bearing measurements. The gray path is truth, red is the EKF, green is the UKF. Increase nonlinearity to see the EKF diverge.

Nonlinearity1.5

Meas. noise0.5

EKF err: 0.00 | UKF err: 0.00

When EKF fails: The EKF linearizes around its current estimate. If that estimate is wrong (because the linearization was bad), the next linearization is even worse. This positive feedback loop causes divergence. The UKF avoids this because it never linearizes.

Feature	EKF	UKF
Requires Jacobian	Yes	No
Accuracy	1st order	2nd order+
Black-box f, h	No	Yes
Computation	Jacobian eval	2n+1 function evals
Divergence risk	Higher	Lower

Computational Cost: Concrete Numbers

Both filters share the O(n³) matrix algebra (multiplying P, inverting S). The difference is what else they compute:

State dim n	Sigma points (2n+1)	KF cost	EKF extra	UKF extra
4 (2D tracker)	9	O(64) matrix ops	+ Jacobian derivation	+ 9 model evaluations
6 (3D pose)	13	O(216)	+ larger Jacobian	+ 13 evaluations
200 (SLAM)	401	O(8M)	+ 200×200 Jacobian	+ 401 evaluations

The tradeoff: For small state dimensions (n < 20), the UKF's extra function evaluations are cheap — well worth the accuracy gain. For large state dimensions (n > 100), running the model 201+ times per step gets expensive. But recall: the EKF requires you to derive the Jacobian analytically, which is an engineering cost the UKF avoids entirely. Many practitioners choose the UKF simply because it's easier to implement correctly.

Check: Why does the EKF diverge under strong nonlinearity?

Linearization errors compound: bad estimate leads to worse linearization It runs out of sigma points The Kalman gain becomes infinite

🔗 Pattern Recognition

Two Strategies for the Same Nonlinear Problem

EKF (this lesson's rival)

Linearize the function: f(x) ≈ f(μ) + J(x − μ). Apply exact Gaussian propagation through the linear approximation. Error: O(||x−μ||²).

UKF (this lesson)

Keep the exact function, approximate the distribution: N(μ,P) ≈ {χ₀,...,χ_2n}. Propagate exact points through exact f. Error: O(||x−μ||³). → EKF lesson

Both are solving the same problem: "How do I propagate uncertainty through a nonlinear function?" The EKF approximates the function to keep the distribution exact. The UKF approximates the distribution to keep the function exact. This is a dual relationship — a recurring pattern in estimation theory where you can choose which side to simplify.

Where else in engineering do you see this "approximate the model vs approximate the data" duality? (Hint: think of variational inference vs sampling methods in machine learning.)

Chapter 7: Square-Root UKF

The standard UKF recomputes the Cholesky decomposition of P at every step. Rounding errors can cause P to lose positive semi-definiteness, crashing the filter. The square-root UKF (SR-UKF) propagates the Cholesky factor S directly, where P = SS¹.

Standard UKF

Store P → Cholesky at each step → risk of non-PSD P

Square-Root UKF

Store S (where P = SS¹) → update S directly via QR/Cholesky updates → P always PSD by construction

The SR-UKF uses QR decomposition and rank-1 Cholesky updates instead of full matrix operations. This guarantees numerical stability and is roughly the same computational cost.

When to use it: If your state dimension is large (n > 10) or your system runs for many thousands of timesteps, prefer the SR-UKF. For small problems or short runs, the standard UKF is usually fine.

Key SR-UKF Operations

Operation	Standard UKF	SR-UKF
Store uncertainty	P (n×n)	S (n×n, lower triangular)
Sigma point spread	chol((n+λ)P)	√(n+λ) · S (already available)
Covariance update	Weighted outer products	QR decomposition + Cholesky rank-1 update
Guaranteed PSD	No	Yes, by construction

Check: What is the main advantage of the square-root UKF?

It uses fewer sigma points Guaranteed numerical stability (covariance stays positive semi-definite) It handles non-Gaussian noise

⚔ Adversarial: Your UKF's covariance matrix becomes non-positive-definite after 200 steps

You're running a standard (non-square-root) UKF with n=6 states, α=0.001, β=2, κ=0. The filter tracks well for 200 steps, then Cholesky decomposition fails with "Matrix is not positive definite." You check: Q and R are valid PSD matrices. The process model is smooth. What's the most likely root cause?

The process noise Q is too small Negative center weight (w₀^c < 0) accumulates numerical error in P = Σ w_i^c d_id_i^T until a diagonal goes negative The measurement noise R is too large The state dimension is odd, which breaks symmetry

🔨 Derivation Why does storing the Cholesky factor guarantee positive semi-definiteness? ▶ ✓ ATTEMPTED

The standard UKF stores P and computes P = Σ w_i^c d_id_i^T + Q. This can lose PSD due to floating-point cancellation. The SR-UKF stores S where P = SS^T and uses QR decomposition + rank-1 Cholesky updates.

Your task: Explain why the QR-based update of S never produces a non-PSD covariance, even with negative weights.

QR factors a matrix A into Q (orthogonal) and R (upper triangular). The key property: R^TR = A^TA. If we feed the matrix [√w₁ d₁, √w₂ d₂, ..., √Q] into QR, we get R such that R^TR = Σ w_i d_id_i^T + Q. But only for positive weights.

The negative center weight is handled via a rank-1 Cholesky downdate: given S such that SS^T = P, find S' such that S'S'^T = P − vv^T. The downdate algorithm (cholupdate in MATLAB) computes this directly by modifying S's entries. If the result would be non-PSD, the algorithm explicitly fails — you catch the error instead of silently propagating garbage.

Summing w_i d_id_i^T involves catastrophic cancellation: adding large positive terms then subtracting a large negative term. The residual can have negative eigenvalues due to floating-point errors. The QR approach never forms the full sum — it maintains the factored form throughout, so P = SS^T is PSD by construction (product of a matrix with its transpose is always PSD).

Full derivation:

Step 1 (QR for positive weights): Form the compound matrix A = [√w₁(f(χ₁)−μ), ..., √w_2n(f(χ_2n)−μ), √Q]. Take QR: A = QR. Then S = R^T (lower triangular), and SS^T = R^TR = A^TA = Σ_i=1..2n w_i d_id_i^T + Q. This is always PSD because A^TA is always PSD.

Step 2 (Cholesky downdate for w₀): If w₀^c < 0, we perform: S ← choldowndate(S, √|w₀^c| · d₀). This subtracts the rank-1 term |w₀^c| d₀d₀^T from SS^T while maintaining the triangular factor. If the subtraction would make P non-PSD, the algorithm raises an error — a detectable failure, not silent corruption.

Step 3 (why this is numerically superior): The standard approach computes P = big_positive − big_negative + Q. With 64-bit floats, if big_positive ≈ 10⁶ and the true residual is O(1), you lose ~6 digits of precision. The QR approach never forms this sum — it works directly on the factor, preserving full precision.

The key insight: "PSD by construction" means the representation cannot express a non-PSD matrix. SS^T is PSD for any real S. By never converting back to full P, we never lose this structural guarantee.

Chapter 8: Connections

The UKF sits in a family of estimation algorithms, each making different tradeoffs between accuracy, generality, and computational cost. Understanding where it fits helps you choose the right tool.

Filter	Handles	Approach	Cost
Kalman Filter	Linear + Gaussian	Exact matrix equations	Very low
EKF	Mildly nonlinear	Jacobian linearization	Low
UKF	Moderately nonlinear	Sigma points	Medium
Particle Filter	Any distribution	Monte Carlo sampling	High

The estimation ladder: Start with the KF if your system is linear. Move to the EKF if it's mildly nonlinear and you can compute Jacobians. Use the UKF when Jacobians are unavailable or the nonlinearity is too strong. Resort to particle filters when the distribution is truly non-Gaussian (multimodal, heavy-tailed).

UKF in SLAM: The UKF is widely used in simultaneous localization and mapping (SLAM), where a robot must estimate its own position while building a map of the environment. The measurement model (range and bearing to landmarks) is highly nonlinear, making the UKF a natural choice over the EKF.

🔗 Pattern Recognition

Sigma Points Are Deterministic Particles

UKF (this lesson)

2n+1 deterministic points, weights from moment-matching. Captures mean + covariance exactly. Assumes output is Gaussian.

Particle Filter

N random samples, weights from likelihood. Captures arbitrary distributions. No Gaussian assumption. → Bayes Filter lesson

Both represent a distribution as weighted point masses. The UKF uses the minimum points for 2nd-order accuracy; the particle filter uses many points for asymptotic accuracy. As you add more sigma points with higher-order quadrature rules (cubature Kalman filter, Gauss-Hermite filter), you approach the particle filter from the "structured" side. They're the same idea at different points on the compute-vs-accuracy tradeoff.

If you had a 3-state system but suspected bimodal distributions, would you reach for a UKF with more sigma points, or switch to a particle filter? At what dimensionality does this decision flip?

Continue learning:

Kalman Filter — the linear foundation everything builds on
Extended Kalman Filter — the Jacobian-based rival
Bayes Filter — the general recursive estimation framework
Hidden Markov Models — discrete-state estimation

"It is easier to approximate a probability distribution than to approximate an arbitrary nonlinear function or transformation."

— Simon Julier & Jeffrey Uhlmann

You now understand the unscented Kalman filter. No Jacobians, no linearization — just carefully chosen points that capture the shape of uncertainty.

Understand the UnscentedKalman Filter

Chapter 0: Beyond Linearization

Chapter 1: Sigma Points

Sigma Points Concretely: The Cholesky Connection

Chapter 2: The Unscented Transform

The Weighted Recombination in NumPy

Chapter 3: UKF Predict

UKF Predict in Python

Concrete Example: 2D Tracker Predict

Chapter 4: UKF Update

UKF Update in Python

Chapter 5: Choosing Parameters

Chapter 6: UKF vs EKF Comparison

Computational Cost: Concrete Numbers

Chapter 7: Square-Root UKF

Key SR-UKF Operations

Chapter 8: Connections

Understand the Unscented
Kalman Filter