No Jacobians. No tangent-line approximations. Just carefully chosen sample points that capture what a Gaussian looks like after it's been squeezed through a nonlinear function.
The Extended Kalman Filter (EKF) handles nonlinear systems by computing Jacobians — local tangent-line approximations. It works, but it has real problems: you need to derive the Jacobian analytically, and the first-order linear approximation can be wildly inaccurate when the function curves sharply.
In 1995, Simon Julier and Jeffrey Uhlmann had a striking insight: it's easier to approximate a probability distribution than to approximate an arbitrary nonlinear function. Instead of linearizing the function, why not pick a small set of sample points that capture the distribution's statistics, push them through the exact nonlinear function, and reconstruct a Gaussian from the results?
A Gaussian in n dimensions is fully described by its mean (n numbers) and covariance (an n×n matrix). The UKF selects 2n+1 deterministic sample points — called sigma points — that exactly match these first two moments. For a 2D state, that's just 5 points.
The sigma points are the mean itself, plus pairs of points offset along each eigenvector of the covariance, scaled by a factor that depends on the tuning parameters α, β, and κ.
where λ = α²(n + κ) − n, and (√((n+λ)P))i is the i-th column of the matrix square root of (n+λ)P.
A 2D Gaussian with 2n+1 = 5 sigma points. Drag the sliders to change the covariance shape. The orange dots are sigma points; the teal ellipse is the 2-sigma contour.
The formula says √((n+λ)P). What does the "square root" of a matrix even mean? It means the Cholesky decomposition: finding a lower-triangular matrix L such that LL¹ = (n+λ)P. In NumPy, one line:
Python import numpy as np n = 4 # 2D tracker: [px, vx, py, vy] alpha, kappa = 1e-3, 0 lam = alpha**2 * (n + kappa) - n # ≈ -4.0 # Cholesky decomposition gives the "matrix square root" L = np.linalg.cholesky((n + lam) * P) # L is [4,4] # Generate 2n+1 = 9 sigma points sigmas = [x.copy()] # χ₀ = mean for i in range(n): sigmas.append(x + L[:, i]) # χᵢ = mean + i-th column of L sigmas.append(x - L[:, i]) # χᵢ₊ₙ = mean - i-th column of L # sigmas has 9 points, each is [4,1]
We have 2n+1 sigma points: one at the mean μ, and n symmetric pairs at μ ± (√((n+λ)P))i. Each point has a mean weight wim and a covariance weight wic.
Your task: Derive the weight formulas by enforcing two constraints: (1) the weighted sum of sigma points equals the mean, and (2) the weighted sum of outer products equals the covariance P.
Full derivation:
Step 1 (mean constraint): Σ wim χi = w0mμ + Σi=1..n wim(μ + Li) + Σi=1..n wi+nm(μ − Li) = μ. Since the symmetric terms cancel: (w0m + 2n · wim) · μ = μ, so weights sum to 1.
Step 2 (covariance constraint): Σ wic(χi−μ)(χi−μ)T = 2 · n · wic · LiLiT. But L comes from chol((n+λ)P), so each column satisfies LiLiT contribution sums to (n+λ)P. Thus 2n · wic · (n+λ) = 1, giving wic = 1/(2(n+λ)).
Step 3 (solve for w0m): From sum-to-1: w0m = 1 − 2n/(2(n+λ)) = 1 − n/(n+λ) = λ/(n+λ).
The key insight: The satellite weights are completely determined by the covariance constraint alone. The center weight is the leftover from the sum-to-1 constraint. This is why the center weight can go negative (when λ < 0) — it's not a probability, it's a moment-matching coefficient from Mercer's theorem on quadrature rules.
The n symmetric pairs capture n axes of the covariance ellipsoid — each pair spans one eigenvector direction. Without the center point, the weights must still sum to 1 for the mean constraint, but then 2n · wim = 1 forces wim = 1/(2n). Now the covariance constraint becomes: 2n · (1/(2n)) · (n+λ)P = (n+λ)P ≠ P unless λ = 1−n. You lose a free parameter (λ) that controls the spread, meaning you can't independently tune how far the sigma points reach into the tails. The center point adds one more degree of freedom, decoupling the spread (controlled by λ) from the weight distribution. Fewer than 2n+1 points is under-determined for both moments; more than 2n+1 is over-determined (helpful for higher moments — that's cubature Kalman filters).
The unscented transform is the heart of the UKF. Given a Gaussian and a nonlinear function, it approximates the output distribution in three steps:
Each sigma point carries two weights: wm for the mean and wc for the covariance. The center point typically gets a different weight than the satellite points.
Sigma points from a 2D Gaussian pass through a nonlinear "banana" transform. The teal ellipse is the input. The orange points become purple points after the transform. The green ellipse is the reconstructed output Gaussian.
After pushing all sigma points through f, we reconstruct a Gaussian by computing the weighted mean and covariance:
Python # sigmas_f = [f(χ₀), f(χ₁), ..., f(χ₂ₙ)] — transformed points # wm, wc = weight arrays (length 2n+1) # Weighted mean: just a weighted average mu_out = sum(wm[i] * sigmas_f[i] for i in range(2*n+1)) # Weighted covariance: outer products of deviations P_out = np.zeros((n, n)) for i in range(2*n+1): d = sigmas_f[i] - mu_out P_out += wc[i] * np.outer(d, d)
Consider a linear function f(x) = Ax + b applied to a Gaussian with mean μ and covariance P. The true output distribution has mean Aμ + b and covariance APAT.
Your task: Show that the unscented transform recovers the exact output mean and covariance for any linear f — proving it's at least as good as the standard Kalman filter for linear systems.
Full derivation:
Mean: μy = Σ wim f(χi) = Σ wim (Aχi + b) = A(Σ wim χi) + b = Aμ + b. This is the exact linear transform of the mean.
Covariance: Each deviation is f(χi) − μy = A(χi − μ). So Py = Σ wic A(χi−μ)(χi−μ)TAT = A[Σ wic(χi−μ)(χi−μ)T]AT = APAT.
For nonlinear f: Taylor-expand f(χi) = f(μ) + J(χi−μ) + ½H(χi−μ)² + ... The UT captures the first-order term exactly (same as EKF) AND captures the second-order Hessian term because the sigma points sample the curvature. The EKF drops everything beyond the Jacobian J.
The key insight: Linearity is the special case where the UT is trivially exact. For nonlinear functions, the sigma points "probe" the curvature in each direction, capturing second-order effects that a single tangent line (Jacobian) misses entirely.
python import numpy as np def unscented_transform(mu, P, f, alpha=1e-3, beta=2, kappa=0): n = len(mu) lam = alpha**2 * (n + kappa) - n # Generate sigma points sqrt_P = np.linalg.cholesky((n + lam) * P) sigmas = np.zeros((2*n+1, n)) sigmas[0] = mu for i in range(n): sigmas[i+1] = mu + sqrt_P[:, i] sigmas[i+n+1] = mu - sqrt_P[:, i] # Weights wm = np.full(2*n+1, 1/(2*(n+lam))) wc = np.full(2*n+1, 1/(2*(n+lam))) wm[0] = lam / (n + lam) wc[0] = lam / (n + lam) + (1 - alpha**2 + beta) # Propagate sigmas_f = np.array([f(s) for s in sigmas]) m = sigmas_f.shape[1] # Weighted mean mu_y = wm @ sigmas_f # (m,) # Weighted covariance and cross-covariance P_y = np.zeros((m, m)) P_xy = np.zeros((n, m)) for i in range(2*n+1): dy = sigmas_f[i] - mu_y dx = sigmas[i] - mu P_y += wc[i] * np.outer(dy, dy) P_xy += wc[i] * np.outer(dx, dy) return mu_y, P_y, P_xy
The predict step applies the unscented transform to the process model f(x). We generate sigma points from the current belief, push them through f, and reconstruct the predicted Gaussian. Then add process noise Q.
Python import numpy as np def ukf_predict(x, P, f, Q, alpha=1e-3, beta=2, kappa=0): """UKF predict step.""" n = len(x) lam = alpha**2 * (n + kappa) - n # Generate sigma points sqrt_P = np.linalg.cholesky((n + lam) * P) sigmas = [x.copy()] for i in range(n): sigmas.append(x + sqrt_P[:, i]) sigmas.append(x - sqrt_P[:, i]) # Weights wm = [lam / (n + lam)] + [1 / (2*(n + lam))] * 2*n wc = [lam / (n + lam) + (1 - alpha**2 + beta)] + [1 / (2*(n + lam))] * 2*n # Propagate through f sigmas_f = [f(s) for s in sigmas] # Reconstruct mean and covariance x_pred = sum(w * s for w, s in zip(wm, sigmas_f)) P_pred = Q.copy() for i, (w, s) in enumerate(zip(wc, sigmas_f)): d = s - x_pred P_pred += w * np.outer(d, d) return x_pred, P_pred
For a 2D tracker (n=4), the predict step generates 9 sigma points, pushes each through the motion model, and recombines. Here's the data flow with actual shapes:
| Step | Operation | Shape |
|---|---|---|
| 1 | Cholesky: L = chol((n+λ)P) | [4, 4] lower-triangular |
| 2 | Generate 2n+1 = 9 sigma points | 9 vectors of [4, 1] |
| 3 | Propagate each: χi* = f(χi) | 9 vectors of [4, 1] |
| 4 | Weighted mean → x̄ | [4, 1] |
| 5 | Weighted outer products → P̄ | [4, 4] |
| 6 | Add Q → P̄ + Q | [4, 4] |
The update step is similar: generate sigma points from the predicted state, push them through the measurement model h(x), and compute the cross-covariance between state and measurement spaces. This gives us the Kalman gain.
Python def ukf_update(x_pred, P_pred, z, h, R, alpha=1e-3, beta=2, kappa=0): """UKF update step.""" n = len(x_pred) lam = alpha**2 * (n + kappa) - n # Regenerate sigma points from predicted state sqrt_P = np.linalg.cholesky((n + lam) * P_pred) sigmas = [x_pred.copy()] for i in range(n): sigmas.append(x_pred + sqrt_P[:, i]) sigmas.append(x_pred - sqrt_P[:, i]) wm = [lam / (n + lam)] + [1/(2*(n+lam))] * 2*n wc = [lam / (n + lam) + 1 - alpha**2 + beta] + [1/(2*(n+lam))] * 2*n # Push through measurement model z_sigmas = [h(s) for s in sigmas] z_mean = sum(w*zs for w, zs in zip(wm, z_sigmas)) # Innovation covariance S and cross-covariance Pxz S = R.copy() Pxz = np.zeros((n, len(z))) for w, s, zs in zip(wc, sigmas, z_sigmas): dz = zs - z_mean S += w * np.outer(dz, dz) dx = s - x_pred Pxz += w * np.outer(dx, dz) # Kalman gain and update K = Pxz @ np.linalg.inv(S) x_new = x_pred + K @ (z - z_mean) P_new = P_pred - K @ S @ K.T return x_new, P_new
Real-world solution (used on ISS, Mars rovers, and CubeSats):
1. State representation: The industry standard is the Multiplicative UKF (MUKF). Instead of putting the full quaternion in the state vector, you estimate a 3D rotation error (Modified Rodrigues Parameters or Gibbs vector) that gets composed with a reference quaternion. This keeps the state 6D (3 error angles + 3 gyro biases), avoids the constraint entirely, and reduces sigma points from 15 to 13.
2. Multi-rate: Standard practice is propagate at gyro rate, update at star tracker rate. Between star tracker measurements, you just integrate the gyro (cheap — one quaternion multiplication per step). The full UKF machinery only runs when a star tracker measurement arrives (1 Hz). This is 100x cheaper than running the full filter at 100 Hz.
3. Compute: At 133 MHz, you get ~665,000 cycles per 5ms window. A single quaternion ODE integration takes ~50 FLOPs. 13 sigma points × 50 = 650 FLOPs for propagation — trivial. The bottleneck is the Cholesky decomposition of the 6×6 covariance (~200 FLOPs) and the matrix inversions. Total: ~3,000 FLOPs per update cycle — well within budget. The real concern is radiation-induced bit flips corrupting the covariance matrix, which is why the SR-UKF (guaranteed PSD) is preferred on flight hardware.
The UKF has three tuning parameters that control how the sigma points are placed and weighted:
| Parameter | Typical Value | Effect |
|---|---|---|
| α | 10-3 to 1 | Controls the spread of sigma points. Small α keeps them close to the mean. |
| β | 2 (optimal for Gaussians) | Incorporates prior knowledge about the distribution. β=2 is optimal when the state is truly Gaussian. |
| κ | 0 or 3−n | Secondary scaling parameter. κ=3−n ensures positive semi-definiteness of the covariance. |
The composite parameter λ determines the distance of the sigma points from the mean. Larger λ pushes them further out, capturing more of the tails. Smaller λ clusters them near the center, which is safer for highly nonlinear functions.
See how α, β, and κ affect sigma point placement and weights for n=2. The center weight can even go negative for large λ.
When the nonlinearity is mild, EKF and UKF produce nearly identical results. But as the function curves more sharply, the EKF's tangent-line approximation falls apart while the UKF's sigma points continue to track accurately. Below, both filters track the same nonlinear system.
A target moves in a circle. Both filters use range+bearing measurements. The gray path is truth, red is the EKF, green is the UKF. Increase nonlinearity to see the EKF diverge.
| Feature | EKF | UKF |
|---|---|---|
| Requires Jacobian | Yes | No |
| Accuracy | 1st order | 2nd order+ |
| Black-box f, h | No | Yes |
| Computation | Jacobian eval | 2n+1 function evals |
| Divergence risk | Higher | Lower |
Both filters share the O(n³) matrix algebra (multiplying P, inverting S). The difference is what else they compute:
| State dim n | Sigma points (2n+1) | KF cost | EKF extra | UKF extra |
|---|---|---|---|---|
| 4 (2D tracker) | 9 | O(64) matrix ops | + Jacobian derivation | + 9 model evaluations |
| 6 (3D pose) | 13 | O(216) | + larger Jacobian | + 13 evaluations |
| 200 (SLAM) | 401 | O(8M) | + 200×200 Jacobian | + 401 evaluations |
Both are solving the same problem: "How do I propagate uncertainty through a nonlinear function?" The EKF approximates the function to keep the distribution exact. The UKF approximates the distribution to keep the function exact. This is a dual relationship — a recurring pattern in estimation theory where you can choose which side to simplify.
Where else in engineering do you see this "approximate the model vs approximate the data" duality? (Hint: think of variational inference vs sampling methods in machine learning.)
The standard UKF recomputes the Cholesky decomposition of P at every step. Rounding errors can cause P to lose positive semi-definiteness, crashing the filter. The square-root UKF (SR-UKF) propagates the Cholesky factor S directly, where P = SS¹.
The SR-UKF uses QR decomposition and rank-1 Cholesky updates instead of full matrix operations. This guarantees numerical stability and is roughly the same computational cost.
| Operation | Standard UKF | SR-UKF |
|---|---|---|
| Store uncertainty | P (n×n) | S (n×n, lower triangular) |
| Sigma point spread | chol((n+λ)P) | √(n+λ) · S (already available) |
| Covariance update | Weighted outer products | QR decomposition + Cholesky rank-1 update |
| Guaranteed PSD | No | Yes, by construction |
The standard UKF stores P and computes P = Σ wic didiT + Q. This can lose PSD due to floating-point cancellation. The SR-UKF stores S where P = SST and uses QR decomposition + rank-1 Cholesky updates.
Your task: Explain why the QR-based update of S never produces a non-PSD covariance, even with negative weights.
Full derivation:
Step 1 (QR for positive weights): Form the compound matrix A = [√w1(f(χ1)−μ), ..., √w2n(f(χ2n)−μ), √Q]. Take QR: A = QR. Then S = RT (lower triangular), and SST = RTR = ATA = Σi=1..2n wi didiT + Q. This is always PSD because ATA is always PSD.
Step 2 (Cholesky downdate for w0): If w0c < 0, we perform: S ← choldowndate(S, √|w0c| · d0). This subtracts the rank-1 term |w0c| d0d0T from SST while maintaining the triangular factor. If the subtraction would make P non-PSD, the algorithm raises an error — a detectable failure, not silent corruption.
Step 3 (why this is numerically superior): The standard approach computes P = big_positive − big_negative + Q. With 64-bit floats, if big_positive ≈ 106 and the true residual is O(1), you lose ~6 digits of precision. The QR approach never forms this sum — it works directly on the factor, preserving full precision.
The key insight: "PSD by construction" means the representation cannot express a non-PSD matrix. SST is PSD for any real S. By never converting back to full P, we never lose this structural guarantee.
The UKF sits in a family of estimation algorithms, each making different tradeoffs between accuracy, generality, and computational cost. Understanding where it fits helps you choose the right tool.
| Filter | Handles | Approach | Cost |
|---|---|---|---|
| Kalman Filter | Linear + Gaussian | Exact matrix equations | Very low |
| EKF | Mildly nonlinear | Jacobian linearization | Low |
| UKF | Moderately nonlinear | Sigma points | Medium |
| Particle Filter | Any distribution | Monte Carlo sampling | High |
Both represent a distribution as weighted point masses. The UKF uses the minimum points for 2nd-order accuracy; the particle filter uses many points for asymptotic accuracy. As you add more sigma points with higher-order quadrature rules (cubature Kalman filter, Gauss-Hermite filter), you approach the particle filter from the "structured" side. They're the same idea at different points on the compute-vs-accuracy tradeoff.
If you had a 3-state system but suspected bimodal distributions, would you reach for a UKF with more sigma points, or switch to a particle filter? At what dimensionality does this decision flip?
Continue learning:
You now understand the unscented Kalman filter. No Jacobians, no linearization — just carefully chosen points that capture the shape of uncertainty.