microVIO — Classical Visual-Inertial Odometry

Chapter 1: IMU Fundamentals

An IMU contains two sensors: an accelerometer that measures linear acceleration (including gravity!) and a gyroscope that measures angular velocity. Together they provide 6-axis measurements at high rates (100–1000 Hz).

The catch: both sensors are corrupted by bias (a slowly drifting offset) and noise (random jitter). To get velocity from acceleration, you integrate once. To get position, you integrate twice. Each integration amplifies errors — this is why IMU-only position drifts so fast.

a_meas = a_true + b_a + n_a ω_meas = ω_true + b_g + n_g

Accelerometer

Measures specific force (acceleration + gravity). 3 axes. Units: m/s².

↓

Gyroscope

Measures angular velocity. 3 axes. Units: rad/s.

↓

Integration

Accel → velocity → position. Each step amplifies noise and bias.

IMU Integration Drift

Watch how double-integrating noisy acceleration leads to quadratic position drift. Adjust bias and noise levels.

Accel bias0.10

Accel noise0.30

Symbol	Meaning	Typical Value
a_meas	Measured acceleration (what the sensor reports)	—
b_a	Accelerometer bias (slowly drifting offset)	~0.01–0.1 m/s²
n_a	Accelerometer noise density	~0.1 m/s²/√Hz
ω_meas	Measured angular velocity	—
b_g	Gyroscope bias	~0.001–0.01 rad/s
n_g	Gyroscope noise density	~0.01 rad/s/√Hz

Rule of thumb: After 1 second of pure IMU integration (no corrections), a MEMS IMU accumulates ~5 cm position error and ~0.6° rotation error. After 10 seconds: ~5 m position error. This is why you cannot dead-reckon with an IMU alone for more than a few seconds. The camera must correct frequently.

Check: Why does IMU-only position drift so quickly?

The sensors are too slow Double integration of noisy/biased acceleration amplifies errors quadratically The gyroscope interferes with the accelerometer

🔗 Pattern Recognition

IMU Propagation IS the Kalman Predict Step

Kalman Filter

x̂_k|k-1 = F · x̂_k-1 + B · u — propagate state forward using dynamics model.

VIO (IMU propagation)

p_k+1 = p_k + v_kΔt + ½(a_meas − b_a)Δt² — propagate pose forward using IMU. → Kalman Filter

The IMU integration between camera frames IS the Kalman prediction step. The control input u is the IMU measurement. The camera observation IS the Kalman update step. VIO is literally a Kalman filter where the prediction uses inertial data and the correction uses visual data.

In the Kalman filter, what determines how much you trust the prediction vs the measurement? How does the same concept appear in VIO as "tight coupling"?

Chapter 2: Camera Models

To use a camera for geometry, we need a mathematical model of how 3D points project onto the 2D image. The pinhole model is the simplest: light from a 3D point passes through a tiny hole and hits the image plane. The intrinsic parameters (f_x, f_y, c_x, c_y) describe the camera's internal geometry.

Real lenses also introduce distortion — straight lines in the world become curved in the image. We model this with radial and tangential distortion coefficients, then undistort images before processing.

u = f_x · X/Z + c_x v = f_y · Y/Z + c_y

Intuition: f_x, f_y are the focal lengths (how much the camera "zooms"). c_x, c_y are the principal point (where the optical axis hits the image). Together, they form the intrinsic matrix K.

Pinhole Projection

Move the 3D point by adjusting X, Y, Z. Watch where it projects on the image plane. The projection is simply (fX/Z, fY/Z).

Point X0.5

Point Y-0.3

Depth Z2.0

Focal length150

Distortion: Fish-eye lenses have extreme radial distortion. Before running any VIO pipeline, images are typically undistorted so the pinhole model holds. This is a one-time calibration step.

Check: What do the intrinsic parameters describe?

The position and orientation of the camera in the world The camera's internal geometry: focal length and principal point The resolution of the image sensor

Chapter 3: Visual Feature Tracking

To estimate camera motion, we track features across consecutive frames. A feature is a distinctive patch — typically a corner — that's easy to find again. The KLT (Kanade-Lucas-Tomasi) tracker finds these corners and follows them using optical flow: it assumes the brightness of a pixel patch doesn't change between frames.

Corner detection (e.g., Shi-Tomasi, FAST) identifies pixels with strong gradients in two directions. Optical flow then estimates how each corner moved. Losing features is inevitable — occlusion, leaving the frame, deformation — so new features are regularly detected.

Why corners? An edge can slide along its direction — ambiguous. A flat region has no gradient at all. Only corners are unique in both directions, making them trackable.

The Optical Flow Assumption

KLT tracking rests on the brightness constancy assumption: a pixel's intensity doesn't change between frames. Mathematically:

I(x, y, t) = I(x + δx, y + δy, t + δt)

Taylor-expanding and dividing by δt gives the optical flow equation: I_x u + I_y v + I_t = 0, where u, v are the pixel velocities. This is one equation in two unknowns — underdetermined. KLT resolves this by assuming the flow is constant over a small window (typically 21×21 pixels) and solving the resulting overdetermined system via least squares.

In practice, a VIO system tracks 100–200 features simultaneously. Each frame, features are tracked with KLT, new corners are detected to replace lost ones (via Shi-Tomasi or FAST), and the surviving tracks provide the visual constraints for the estimator. A healthy track length is 10–30 frames before a feature is lost or leaves the field of view.

Feature Tracking Simulator

Features (dots) are tracked across frames. Green = successfully tracked. Red = lost. Watch features enter and leave the field of view.

Tracked: 0

Check: Why are corners preferred over edges for tracking?

Corners have gradients in two directions, making them uniquely localizable Corners are brighter than edges There are more corners than edges in an image

🔨 Derivation The Epipolar Constraint — from two views to geometry ▶ ✓ ATTEMPTED

A 3D point P is observed from two camera positions. In frame 1, it projects to pixel p₁. In frame 2 (after rotation R and translation t), it projects to p₂. The normalized image coordinates are x₁ = K^-1p₁ and x₂ = K^-1p₂.

Your task: Derive the epipolar constraint x₂^T E x₁ = 0, where E = [t]_× R is the essential matrix. Explain geometrically why this constraint holds.

If P has coordinates X₁ in frame 1, then in frame 2: X₂ = R X₁ + t. The normalized image coordinates are x₁ = X₁/Z₁ and x₂ = X₂/Z₂. So X₁ = λ₁ x₁ and X₂ = λ₂ x₂ for some depths λ.

The vectors t, R x₁, and x₂ all lie in the same plane (the epipolar plane through P and both camera centers). Three coplanar vectors have zero scalar triple product: x₂ · (t × R x₁) = 0.

The cross product t × v can be written as [t]_× v, where [t]_× is the skew-symmetric matrix of t. So t × (R x₁) = [t]_× R x₁. The dot product becomes: x₂^T [t]_× R x₁ = 0.

Full derivation:

1. Point P in frame 1: X₁ = λ₁ x₁ (ray from camera 1 through the pixel).

2. Same point in frame 2: X₂ = R X₁ + t = λ₁ R x₁ + t.

3. Also X₂ = λ₂ x₂. So: λ₂ x₂ = λ₁ R x₁ + t.

4. Take cross product with t on both sides: λ₂ (t × x₂) = λ₁ (t × R x₁) + (t × t). The last term is zero.

5. Dot both sides with x₂: λ₂ x₂ · (t × x₂) = λ₁ x₂ · (t × R x₁). The left side is zero (scalar triple product with repeated vector). So: x₂^T [t]_× R x₁ = 0.

6. Define E = [t]_× R. We get: x₂^T E x₁ = 0.

The key insight: The epipolar constraint eliminates depth entirely! Even though we don't know how far away P is, the constraint between x₁ and x₂ must hold for ANY valid correspondence. This is why we can recover camera motion (E) from feature matches alone, without knowing 3D structure. Each match gives one equation; 5 matches suffice to solve for E (the 5-point algorithm).

Checkpoint — Before you move on

Explain in your own words: if a camera only rotates (no translation), can you recover the 3D structure of the scene from feature matches? Why or why not? What does this mean for VIO initialization?

✓ Gate cleared

Model Answer

Pure rotation produces no baseline between views. Without translation, all epipolar lines collapse to a single point (the epipole is undefined), and the essential matrix becomes degenerate (rank 0 instead of rank 2). You cannot triangulate depth from features because there's no parallax — a point 1m away and 100m away both move identically under rotation.

For VIO, this means the initialization phase REQUIRES translational motion. This is why AR systems tell you to "move your phone" at startup — pure rotation gives orientation but no scale or depth. The IMU can detect translation (via acceleration), confirming when sufficient parallax exists for visual initialization.

Chapter 4: Loosely-Coupled Fusion

The simplest approach: run visual odometry (VO) and IMU integration separately, then fuse their pose estimates. Each system outputs a pose — position and orientation — and the fusion layer (often a Kalman filter or complementary filter) combines them.

This is easy to implement: each subsystem is a black box. But it's suboptimal — information is lost when each system compresses its raw data into a single pose estimate before fusion. Correlations between visual and inertial measurements are ignored.

Visual Odometry

Tracks features, estimates pose from images → outputs T_cam

↓

IMU Integration

Integrates accel & gyro → outputs T_imu

↓

Fusion (KF)

Combines T_cam and T_imu using uncertainties → T_fused

Loose vs Tight Fusion

The teal line is the true path. Orange is loosely-coupled. Green is tightly-coupled. Notice the loose estimate is noisier.

Analogy: Loose coupling is like asking two people to solve a jigsaw puzzle separately, then voting on the result. Tight coupling is like having them work on the same puzzle together — sharing every piece of information.

What Information Is Lost?

In loose coupling, each subsystem reports a 6-DOF pose and its covariance. But it discards everything else — which features were tracked, how confident each feature match was, which direction the uncertainty is largest. The fusion layer sees two Gaussian blobs and averages them. It can't know that the VO system was tracking features on a distant building (high depth uncertainty) while the nearby ground had no texture (no features at all).

Tight coupling preserves all this structure. The raw 2D pixel measurements enter the optimizer directly, along with their individual uncertainties. An uncertain feature at 50 meters contributes less than a confident feature at 2 meters. This per-measurement weighting is impossible with loose coupling.

Check: What is the main disadvantage of loosely-coupled fusion?

Information is lost when each system compresses data to a single pose before fusion It requires more computation It only works with stereo cameras

Chapter 5: Tightly-Coupled Fusion

In tightly-coupled VIO, we jointly optimize over raw measurements from both sensors in a single estimator. Instead of combining two pose estimates, we combine individual pixel observations and IMU readings directly. This preserves correlations and produces significantly better results.

The state vector now includes the camera pose, velocity, IMU biases, and possibly feature positions. All measurements — pixel coordinates from the camera and accelerations/angular velocities from the IMU — constrain this joint state simultaneously.

x = [p, v, q, b_a, b_g] (position, velocity, orientation, biases)

Key advantage: When the camera loses tracking on some features, IMU measurements still constrain velocity and orientation. When the IMU drifts, pixel reprojection errors pull the state back. Every measurement helps constrain everything.

What's Actually in the State Vector?

The full state per keyframe is 16 parameters: position p (3), velocity v (3), orientation quaternion q (4), accelerometer bias b_a (3), and gyroscope bias b_g (3). With N keyframes in the sliding window plus M 3D landmarks, the optimizer solves over 16N + 3M parameters simultaneously.

x_full = [p₁, v₁, q₁, b_a1, b_g1, …, p_N, v_N, q_N, b_aN, b_gN, ℓ₁, …, ℓ_M]

For a typical window of 10 keyframes with 50 landmarks: 16 × 10 + 3 × 50 = 310 parameters. The optimizer must find values for all 310 simultaneously. This is why efficient solvers (Gauss-Newton, Levenberg-Marquardt) and sparse matrix structure matter so much.

Approach	Inputs	State	Accuracy
Loosely-coupled	VO pose + IMU pose	Pose only	Good
Tightly-coupled (filter)	Raw pixels + raw IMU	Pose + biases + (features)	Better
Tightly-coupled (optim.)	Raw pixels + raw IMU	Pose + biases + features	Best

Joint State Estimation

Watch how the tightly-coupled estimator simultaneously adjusts pose, velocity, and IMU bias. Drag the bias slider to simulate drift.

IMU bias drift0.20

Visual features10

Check: What does the tightly-coupled state vector typically include?

Only position and orientation Position, velocity, orientation, and IMU biases Only IMU measurements

Chapter 6: Preintegration

Between two camera keyframes (say, 100 ms apart), the IMU produces dozens of measurements. Naively, if we change our linearization point for the pose, we'd have to re-integrate all those IMU samples. This is expensive. Preintegration is the clever trick that avoids it.

The idea: integrate IMU measurements in a local frame relative to the starting pose. The resulting "preintegrated measurement" summarizes the relative motion (Δp, Δv, Δq) between keyframes. When the linearization point changes, we only need a cheap first-order correction — no re-integration.

Δp_ij = Σ_k [Δv_ik · δt + ½ (α_k − b_a) · δt²]

IMU samples (high rate)

a₁, a₂, ..., a_N and ω₁, ω₂, ..., ω_N between keyframes

↓

Preintegrate

Compute Δp, Δv, Δq in local frame. Also compute Jacobians w.r.t. bias.

↓

Bias correction

When bias estimate changes, apply first-order correction instead of re-integrating.

Preintegration vs Naive

The bar chart shows computational cost. Red = naive re-integration (grows with IMU rate). Green = preintegration (constant cost after initial computation).

IMU rate (Hz)400

Optimization iters5

Why it matters: In optimization-based VIO, the solver iterates many times, each time updating the linearization point. Without preintegration, each iteration re-integrates hundreds of IMU samples. With it, the cost drops to a single cheap matrix multiply.

Concrete Numbers

Between two keyframes at 15–30 Hz, the IMU fires at 200–1000 Hz. That means 7–67 IMU samples per keyframe interval. The preintegrated measurement is a compact tuple: ΔR (relative rotation), Δv (relative velocity), and Δp (relative position), all expressed in the local frame of the interval's starting pose.

When the optimizer updates the starting pose, you do not re-integrate those 7–67 IMU samples. Instead, you apply a first-order correction using the precomputed Jacobians with respect to bias:

ΔR_corrected ≈ ΔR · Exp(J_bg · δb_g) Δv_corrected ≈ Δv + J_ba^v · δb_a + J_bg^v · δb_g

Typical IMU noise densities: gyroscope ~0.01 rad/s/√Hz, accelerometer ~0.1 m/s²/√Hz. After 1 second of dead-reckoning without any visual correction, you accumulate roughly 0.6° rotation error and ~5 cm position error. This is why keyframes must arrive frequently — preintegration bridges the gap cheaply.

Check: What problem does preintegration solve?

It reduces IMU noise It avoids re-integrating IMU samples when the linearization point changes It removes the need for a camera

🔨 Derivation IMU Preintegration — bias correction via first-order Jacobians ▶ ✓ ATTEMPTED

Between keyframes i and j, you preintegrate N IMU samples to get Δv_ij (relative velocity change). This preintegrated measurement depends on the bias b_a used during integration. When the optimizer updates b_a to b_a + δb_a, you need to correct Δv without re-integrating.

Your task: Show that Δv_corrected ≈ Δv + J_ba^v · δb_a, and derive the form of the Jacobian J_ba^v = ∂Δv / ∂b_a.

The preintegrated velocity is: Δv_ij = Σ_k=i^j-1 ΔR_ik (a_k − b_a) δt, where ΔR_ik is the preintegrated rotation from i to k. The key: b_a appears linearly inside the sum.

If b_a changes to b_a + δb_a, the velocity becomes: Δv(b_a + δb_a) = Σ_k ΔR_ik (a_k − b_a − δb_a) δt = Δv(b_a) − (Σ_k ΔR_ik δt) δb_a.

Comparing with Δv_corrected = Δv + J · δb_a, the Jacobian is J_ba^v = −Σ_k=i^j-1 ΔR_ik δt. It's a 3×3 matrix accumulated during preintegration.

Full derivation:

1. Discrete preintegration formula: Δv_ij = Σ_k=i^j-1 ΔR_ik (a_k − b_a) δt

2. Perturb bias: b_a → b_a + δb_a

3. New velocity: Δv' = Σ_k ΔR_ik (a_k − b_a − δb_a) δt = Δv_ij − [Σ_k ΔR_ik δt] δb_a

4. Therefore: J_ba^v = ∂Δv/∂b_a = −Σ_k=i^j-1 ΔR_ik · δt

5. This Jacobian is computed during preintegration (accumulated alongside Δv) at zero extra cost. When the optimizer changes b_a by δb_a, the correction is a single matrix-vector multiply: J_ba^v · δb_a.

The key insight: Bias enters the integration LINEARLY (it's subtracted from each acceleration measurement). This linearity is what makes first-order correction exact — there's no second-order term to worry about. For the rotation preintegration, bias correction is more complex because rotations are nonlinear (Lie group), requiring the exponential map: ΔR_corrected ≈ ΔR · Exp(J_bg^R δb_g).

Chapter 7: MSCKF

The Multi-State Constraint Kalman Filter (MSCKF) is one of the most successful VIO algorithms. Its key insight: don't keep features in the state vector. Instead, keep a sliding window of camera poses and use features only to create constraints between those poses before marginalizing them out.

When a feature is tracked across N frames, it creates 2N measurement equations (x and y pixel coordinates). Once the feature leaves the view, MSCKF uses these equations to constrain the N poses, then discards the feature. This keeps the state vector small — only O(window size) instead of O(number of features).

State: Sliding Window of Poses

x = [IMU state, pose₁, pose₂, ..., pose_N]

↓

Feature Lost

Feature tracked in frames 3,4,5,6 is no longer visible

↓

Multi-State Constraint

Triangulate feature, project into each frame, create residual constraints on poses 3-6

↓

Marginalize Feature

Feature is removed. Poses are updated. State stays compact.

MSCKF State Size

Compare state size: red = keeping all features in state (grows unboundedly). green = MSCKF with pose-only state (bounded).

Window size15

The Math Behind the Savings

MSCKF keeps the IMU state (16 parameters) plus a sliding window of N camera poses (7 parameters each: position + quaternion). Total state size: 16 + 7N. With N=20 poses: 156 parameters. The EKF covariance matrix is 156×156 — easily manageable.

By contrast, EKF-SLAM keeps features in the state. With 200 landmarks at 3 parameters each: 16 + 600 = 616 parameters. The covariance is 616×616. Updating it costs O(616²) per measurement. And the state keeps growing. MSCKF's state is bounded — no matter how many features pass through, the state never exceeds 16 + 7N.

MSCKF vs EKF-SLAM: Traditional EKF-SLAM keeps features in the state (O(N²) cost). MSCKF marginalizes features immediately, maintaining O(M) cost where M is the window size. This makes it viable for real-time on mobile devices.

Check: How does MSCKF keep computational cost bounded?

By marginalizing features and keeping only poses in the state By reducing the camera frame rate By ignoring IMU measurements

Chapter 8: Sliding Window Optimization

Instead of a Kalman filter, many modern VIO systems use nonlinear optimization over a sliding window of recent states. The idea: collect all visual and inertial measurements in a window, build a factor graph, and solve for the poses that best explain all the data.

Each factor (constraint) in the graph connects variables: IMU preintegration factors connect consecutive poses, visual factors connect a 3D landmark to camera poses that observed it. Marginalization removes old states while preserving their information as a prior.

Filter vs Optimization: Filters (MSCKF, EKF) process measurements once. Optimization re-linearizes and iterates, getting closer to the true maximum-likelihood solution. The tradeoff: optimization is more accurate but more expensive. Modern VIO usually chooses optimization.

How Marginalization Works

When a keyframe leaves the sliding window, we can't just discard it — it carried information about the remaining states. Marginalization via the Schur complement converts the old frame's constraints into a dense prior on the states that remain. Concretely, if the full information matrix is partitioned into "keep" and "drop" blocks:

Λ_prior = Λ_kk − Λ_kd Λ_dd⁻¹ Λ_dk

This prior becomes a new factor in the graph. The cost of one marginalization step is O(N³) in window size N — not O(K³) in total keyframes K seen so far. With a window of 10–20 keyframes, this is fast. Without it (full bundle adjustment over all keyframes), the system would grind to a halt after a few seconds of operation.

Factor Graph Visualization

Circles are pose variables. Squares are factors (constraints). Orange = IMU factors. Teal = visual factors. Purple = marginalization prior.

Window size6

Landmarks4

Method	Type	Accuracy	Latency
EKF-VIO	Filter	Good	Very low
MSCKF	Filter	Better	Low
Sliding Window BA	Optimization	Best	Medium
Full BA	Optimization	Optimal	High

Factor Types in a VIO Graph

Each factor encodes a measurement and its uncertainty (covariance). The optimizer minimizes the sum of squared Mahalanobis distances across all factors:

χ* = argmin_χ ∑_i ||r_i(χ)||²_{Σ_i}

Factor Type	Connects	Residual
IMU preintegration	Pose_i ↔ Pose_i+1	Δp, Δv, Δq vs predicted from poses
Visual reprojection	Pose_j ↔ Landmark_k	Measured pixel − projected pixel (2D error)
Marginalization prior	First pose in window	Dense prior from Schur complement
Bias random walk	Pose_i ↔ Pose_i+1	Constrains bias change between keyframes

The solver (typically Gauss-Newton or Levenberg-Marquardt) linearizes all residuals, forms the normal equations H · δχ = −b, and solves for the state update. The Hessian H is sparse because each factor only connects a few variables — this sparsity is what makes sliding window optimization tractable.

Check: Why does sliding window optimization outperform filtering?

It re-linearizes and iterates toward the maximum-likelihood solution It uses fewer measurements It ignores old data

💥 Break-It Lab What Dies When You Remove VIO Components? ▶ ✓ ATTEMPTED

A drone flies a figure-8 pattern using tightly-coupled VIO. The system has: IMU propagation, visual features, and bias estimation. Toggle each off to see the trajectory degrade.

Remove IMU (camera-only) ACTIVE

Failure mode: Without IMU, monocular scale is unobservable. The trajectory slowly shrinks or expands (scale drift). Fast motions cause lost tracking between frames because there's no motion prediction. The system has no way to know how far it moved between frames — scale drifts by 10-30% per minute. Also: no metric units (the "1 meter" could be 0.7m or 1.5m).

Degenerate Motion (pure rotation only) ACTIVE

Failure mode: With pure rotation, there's zero parallax — features move on the image but carry no depth information. The essential matrix becomes rank-deficient. Feature triangulation fails. The system can track orientation perfectly but position is completely unobservable. Landmarks fly to infinity. The visual "measurements" provide no constraint on translation.

Wrong Camera-IMU Calibration (5° error) ACTIVE

Failure mode: The extrinsic rotation between camera and IMU is baked into every measurement. A 5° error means IMU-predicted feature positions are systematically wrong. The optimizer compensates by warping the trajectory — straight lines become curves, and the bias estimates absorb the calibration error (biased biases!). Drift increases 3-5x. The system doesn't crash but produces smoothly incorrect results.

🔨 Derivation The Schur Complement — why landmark marginalization is efficient ▶ ✓ ATTEMPTED

In sliding window BA, the Hessian H has a specific sparsity pattern. Poses connect to each other (via IMU factors) and to landmarks (via visual factors), but landmarks don't connect to each other. The Hessian partitions into pose block H_pp, landmark block H_ll, and cross-block H_pl.

Your task: Show that eliminating the landmark variables from the normal equations Hδx = −b produces the reduced system H_pp^* δx_p = −b_p^* where H_pp^* = H_pp − H_pl H_ll⁻¹ H_lp. Why is H_ll⁻¹ cheap?

The normal equations are: [H_pp H_pl; H_lp H_ll] [δx_p; δx_l] = −[b_p; b_l]. This gives two equations: H_ppδx_p + H_plδx_l = −b_p and H_lpδx_p + H_llδx_l = −b_l.

From the second equation: δx_l = H_ll⁻¹(−b_l − H_lpδx_p). Substitute this into the first equation to eliminate δx_l.

Each landmark only appears in reprojection factors connecting it to poses. No factor ever connects two landmarks directly. Therefore H_ll is block-diagonal with 3×3 blocks (one per landmark). Inverting a block-diagonal matrix is just inverting each 3×3 block independently — O(M) for M landmarks, not O(M³).

Full derivation:

1. From equation 2: δx_l = −H_ll⁻¹(b_l + H_lpδx_p)

2. Substitute into equation 1: H_ppδx_p + H_pl[−H_ll⁻¹(b_l + H_lpδx_p)] = −b_p

3. Rearrange: (H_pp − H_plH_ll⁻¹H_lp)δx_p = −(b_p − H_plH_ll⁻¹b_l)

4. Define: H_pp^* = H_pp − H_plH_ll⁻¹H_lp (the Schur complement)

5. The reduced system is much smaller: if we have 10 poses (160 dims) and 50 landmarks (150 dims), the original system is 310×310. The reduced system is only 160×160. We solve for poses first, then back-substitute to get landmarks.

The key insight: H_ll is block-diagonal because landmarks are conditionally independent given the poses. No two landmarks ever share a factor. This means H_ll⁻¹ costs O(M) not O(M³). The entire Schur complement costs O(N²M) where N is pose dimensions and M is landmark count — the standard trick that makes visual SLAM computationally tractable.

⚔ Adversarial: The stiffening marginalization prior

Your VIO system runs with a sliding window of 10 keyframes. After 5 minutes of operation, you notice the system becomes increasingly resistant to corrections — even clear loop closures barely shift the trajectory. The marginalization prior seems to "stiffen" over time. The state estimates are still reasonable but suboptimal.

The IMU biases are saturated First-Estimate Jacobians (FEJ) violation: marginalization used inconsistent linearization points, accumulating information that over-constrains the remaining states The camera intrinsics drifted Too many features are being tracked

Chapter 9: VIO Systems

Several landmark VIO systems have defined the field. Each makes different design choices — filter vs optimization, feature handling, state representation — but they all fuse visual and inertial data in a tightly-coupled manner.

System	Year	Approach	Key Innovation
MSCKF	2007	EKF, no features in state	Efficient multi-state constraints
OKVIS	2015	Keyframe-based optimization	Sliding window with marginalization
ROVIO	2015	EKF with direct photometric	No feature extraction needed
VINS-Mono	2018	Optimization + loop closure	Complete system with relocalization
Basalt	2019	Optimization, visual-inertial	Non-linear factor recovery for marginalization

VINS-Mono is arguably the most influential open-source VIO system. It includes initialization (recovering scale, gravity, biases), tightly-coupled sliding window optimization, loop closure with DBoW2, and pose graph optimization. It's the complete package.

What "Initialization" Actually Means

A monocular VIO system faces a chicken-and-egg problem at startup: you need depth to triangulate features, but you need features to estimate depth. The initialization procedure (typically 1–2 seconds) does this:

Collect 10–20 frames of visual-inertial data during motion
SfM initialization: Estimate relative poses from vision (up to an unknown scale)
IMU alignment: Align the visual scale, gravity direction, and initial velocity with IMU measurements by solving a linear system
Bias estimation: Estimate initial accelerometer and gyroscope biases
Scale recovery: The IMU provides absolute scale (meters, not "visual units") because gravity is a known acceleration

This is why VIO systems need the user to move during startup — pure rotation provides no translation for scale estimation. ARKit's "move your phone around" instruction at launch is literally the VIO initialization phase.

System Comparison Radar

Compare VIO systems across five axes: accuracy, speed, robustness, features, and ease of use.

Where from here? Classical VIO is mature and reliable, but it struggles with textureless surfaces, dramatic lighting changes, and fast motion. Modern VIO replaces handcrafted features with learned ones, uses neural IMU models, and pushes toward end-to-end systems.

Check: What distinguishes VINS-Mono from earlier VIO systems?

It only uses an IMU, no camera It uses loosely-coupled fusion It includes initialization, optimization, loop closure, and relocalization

🏗 Design Challenge You're the Architect: VIO for AR Glasses ▶ ✓ ATTEMPTED

A consumer electronics company is building AR glasses with 6-DOF head tracking. The VIO system must deliver sub-millimeter jitter at display refresh rate while fitting within extreme power and thermal constraints. The glasses will be worn 8+ hours daily in varied environments: offices, outdoors, public transit.

Latency budget

<5 ms total (IMU-to-photon)

Camera FPS

30 Hz mono or 15 Hz stereo

IMU rate

800 Hz (low-cost MEMS)

Compute

2 ARM cores + small DSP, 1.5W total power

Thermal

Surface temp <41°C (skin contact)

Accuracy

<0.5mm jitter, <1% drift over 10m walk

1. Filter (MSCKF-style) or optimization (VINS-style)? The latency budget is 5ms. Optimization with 10 iterations takes ~15ms. What can you do?

2. The 5ms IMU-to-photon budget means the visual pipeline CAN'T be in the critical path. How do you architect the system so the display always has a fresh pose?

3. Thermal throttling at minute 45: the CPU clocks down 30%. Your system must degrade gracefully. Which components do you shed first?

4. User walks from bright office into dark elevator. How do you handle complete visual failure for 5-10 seconds?

Real-world solution (Apple ARKit / Meta Aria):

1. Dual-rate architecture: IMU propagation runs at 800-1000 Hz on a dedicated core with <1ms latency. This is the "prediction" that drives the display. Visual corrections run asynchronously at 30 Hz on the other core. The display always gets the IMU-propagated pose; visual corrections arrive "late" and retroactively fix drift. This decouples display latency from visual processing time.

2. MSCKF for the critical path, not optimization. The EKF update is a single matrix multiply (~0.5ms for 30 features). The optimization runs in a background thread for relocalization and map maintenance, but never blocks the display loop.

3. Graceful degradation: Under thermal throttle: reduce feature count (100 → 30), increase keyframe spacing (30Hz → 15Hz), skip outlier rejection (trust RANSAC less). The IMU-only propagation continues at full rate regardless.

4. IMU-only bridging: During visual blackout, trust IMU propagation. The key: preintegrated covariance tells you EXACTLY how much drift to expect. After 10 seconds of IMU-only: ~50cm position uncertainty. When vision recovers, the large innovation is expected (high covariance) and the system absorbs the correction smoothly. The alternative (reset) would cause visible "jumps" in the AR content.

💻 Build It Implement IMU Preintegration from Scratch ▶ ✓ ATTEMPTED

Given a sequence of IMU measurements (accelerometer + gyroscope) between two keyframes, compute the preintegrated relative position, velocity, and rotation. Also compute the bias Jacobians for first-order correction.

signature def preintegrate(imu_samples, bias_a, bias_g, dt): """ imu_samples: list of (accel[3], gyro[3]) tuples bias_a: np.array [3] - current accelerometer bias estimate bias_g: np.array [3] - current gyroscope bias estimate dt: float - time between IMU samples Returns: delta_p: np.array [3] - preintegrated position delta_v: np.array [3] - preintegrated velocity delta_R: np.array [3,3] - preintegrated rotation J_ba_v: np.array [3,3] - Jacobian of delta_v w.r.t. bias_a J_ba_p: np.array [3,3] - Jacobian of delta_p w.r.t. bias_a """

Test case

Constant accel = [0, 0, 1] m/s² (above bias), zero gyro, 10 samples at dt=0.01s: Expected: delta_v = [0, 0, 0.1], delta_p = [0, 0, 0.005], delta_R = I. With bias correction δb_a = [0.1, 0, 0]: corrected delta_v[0] should be −0.01.

For each sample k: (1) corrected_accel = accel - bias_a, (2) corrected_gyro = gyro - bias_g, (3) delta_R = delta_R @ exp_so3(corrected_gyro * dt), (4) delta_v += delta_R @ corrected_accel * dt, (5) delta_p += delta_v * dt + 0.5 * delta_R @ corrected_accel * dt^2, (6) J_ba_v += -delta_R * dt, (7) J_ba_p += J_ba_v * dt - 0.5 * delta_R * dt^2.

python
import numpy as np
from scipy.spatial.transform import Rotation

def exp_so3(omega):
    """Exponential map: rotation vector -> rotation matrix."""
    angle = np.linalg.norm(omega)
    if angle < 1e-10:
        return np.eye(3) + skew(omega)
    axis = omega / angle
    K = skew(axis)
    return np.eye(3) + np.sin(angle)*K + (1-np.cos(angle))*(K @ K)

def skew(v):
    return np.array([[0, -v[2], v[1]],
                     [v[2], 0, -v[0]],
                     [-v[1], v[0], 0]])

def preintegrate(imu_samples, bias_a, bias_g, dt):
    delta_R = np.eye(3)
    delta_v = np.zeros(3)
    delta_p = np.zeros(3)
    J_ba_v = np.zeros((3, 3))
    J_ba_p = np.zeros((3, 3))

    for accel, gyro in imu_samples:
        # Subtract biases
        a_corr = np.array(accel) - bias_a
        w_corr = np.array(gyro) - bias_g

        # Rotate acceleration into local frame
        a_local = delta_R @ a_corr

        # Accumulate Jacobians (BEFORE updating delta_v/delta_p)
        J_ba_v += -delta_R * dt
        J_ba_p += J_ba_v * dt - 0.5 * delta_R * (dt**2)

        # Integrate position (uses current velocity)
        delta_p += delta_v * dt + 0.5 * a_local * (dt**2)

        # Integrate velocity
        delta_v += a_local * dt

        # Integrate rotation
        delta_R = delta_R @ exp_so3(w_corr * dt)

    return delta_p, delta_v, delta_R, J_ba_v, J_ba_p

Bonus challenge: Add covariance propagation. The preintegrated measurement has uncertainty that grows with the number of samples. Track the 9×9 covariance matrix [δR, δv, δp] using the discrete-time noise propagation: Σ_k+1 = F_k Σ_k F_k^T + G_k Q G_k^T.

🔗 Pattern Recognition

VIO is SLAM Without Map Reuse

Classical VIO

Sliding window of poses + current features. Old poses and features are marginalized away. No persistent map.

Classical SLAM

All poses + all landmarks kept in state. Loop closures possible. Map persists for relocalization. → Classical SLAM

VIO trades map persistence for bounded computation. SLAM trades computation for the ability to recognize previously visited places. VINS-Mono bridges both: it runs VIO for real-time tracking and maintains a separate pose graph for loop closure. When the robot returns to a known place, the pose graph corrects accumulated drift — something pure VIO can never do.

If you added loop closure to a VIO system, what additional components would you need? (Hint: place recognition, pose graph, and what happens to the marginalization prior when a loop is detected?)

Understand Classical
VIO

Chapter 0: Why IMU + Camera?

Chapter 1: IMU Fundamentals

Chapter 2: Camera Models

Chapter 3: Visual Feature Tracking

The Optical Flow Assumption

Chapter 4: Loosely-Coupled Fusion

What Information Is Lost?

Chapter 5: Tightly-Coupled Fusion

What's Actually in the State Vector?

Chapter 6: Preintegration

Concrete Numbers

Chapter 7: MSCKF

The Math Behind the Savings

Chapter 8: Sliding Window Optimization

How Marginalization Works

Factor Types in a VIO Graph

Chapter 9: VIO Systems

What "Initialization" Actually Means

Understand ClassicalVIO

Chapter 0: Why IMU + Camera?

Chapter 1: IMU Fundamentals

Chapter 2: Camera Models

Chapter 3: Visual Feature Tracking

The Optical Flow Assumption

Chapter 4: Loosely-Coupled Fusion

What Information Is Lost?

Chapter 5: Tightly-Coupled Fusion

What's Actually in the State Vector?

Chapter 6: Preintegration

Concrete Numbers

Chapter 7: MSCKF

The Math Behind the Savings

Chapter 8: Sliding Window Optimization

How Marginalization Works

Factor Types in a VIO Graph

Chapter 9: VIO Systems

What "Initialization" Actually Means

Understand Classical
VIO