The Complete Beginner's Path

Understand Classical
VIO

The system that fuses camera and IMU to track your position in 3D — powering AR headsets, drones, and autonomous robots even when GPS fails.

Prerequisites: Basic linear algebra + Intuition for 3D geometry. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why IMU + Camera?

A camera can see the world in rich detail, tracking features frame-to-frame. But it struggles in the dark, with motion blur, or when there's nothing interesting to look at. An IMU (Inertial Measurement Unit) measures accelerations and angular velocities directly — it never loses tracking, even in complete darkness. But it drifts fast: double-integrating noisy accelerations yields meters of error within seconds.

Each sensor fails where the other succeeds. The camera drifts slowly but can lose tracking entirely. The IMU drifts fast but never loses tracking. Together, they form a complementary pair: the camera corrects the IMU's drift, and the IMU bridges the camera's gaps.

The core idea: Cameras provide slow, accurate position corrections. IMUs provide fast, noisy motion measurements. Fusing them gives you robust, low-latency, drift-corrected pose estimation. This is Visual-Inertial Odometry (VIO).
Sensor Drift Comparison

The teal line is the true path. Red is IMU-only (fast drift). Blue is camera-only (sudden loss). Green is VIO fusion.

Check: Why do we combine a camera with an IMU?

Chapter 1: IMU Fundamentals

An IMU contains two sensors: an accelerometer that measures linear acceleration (including gravity!) and a gyroscope that measures angular velocity. Together they provide 6-axis measurements at high rates (100–1000 Hz).

The catch: both sensors are corrupted by bias (a slowly drifting offset) and noise (random jitter). To get velocity from acceleration, you integrate once. To get position, you integrate twice. Each integration amplifies errors — this is why IMU-only position drifts so fast.

ameas = atrue + ba + na      ωmeas = ωtrue + bg + ng
Accelerometer
Measures specific force (acceleration + gravity). 3 axes. Units: m/s².
Gyroscope
Measures angular velocity. 3 axes. Units: rad/s.
Integration
Accel → velocity → position. Each step amplifies noise and bias.
IMU Integration Drift

Watch how double-integrating noisy acceleration leads to quadratic position drift. Adjust bias and noise levels.

Accel bias0.10
Accel noise0.30
SymbolMeaningTypical Value
ameasMeasured acceleration (what the sensor reports)
baAccelerometer bias (slowly drifting offset)~0.01–0.1 m/s²
naAccelerometer noise density~0.1 m/s²/√Hz
ωmeasMeasured angular velocity
bgGyroscope bias~0.001–0.01 rad/s
ngGyroscope noise density~0.01 rad/s/√Hz
Rule of thumb: After 1 second of pure IMU integration (no corrections), a MEMS IMU accumulates ~5 cm position error and ~0.6° rotation error. After 10 seconds: ~5 m position error. This is why you cannot dead-reckon with an IMU alone for more than a few seconds. The camera must correct frequently.
Check: Why does IMU-only position drift so quickly?
🔗 Pattern Recognition
IMU Propagation IS the Kalman Predict Step
Kalman Filter
k|k-1 = F · x̂k-1 + B · u — propagate state forward using dynamics model.
VIO (IMU propagation)
pk+1 = pk + vkΔt + ½(ameas − ba)Δt² — propagate pose forward using IMU. → Kalman Filter

The IMU integration between camera frames IS the Kalman prediction step. The control input u is the IMU measurement. The camera observation IS the Kalman update step. VIO is literally a Kalman filter where the prediction uses inertial data and the correction uses visual data.

In the Kalman filter, what determines how much you trust the prediction vs the measurement? How does the same concept appear in VIO as "tight coupling"?

Chapter 2: Camera Models

To use a camera for geometry, we need a mathematical model of how 3D points project onto the 2D image. The pinhole model is the simplest: light from a 3D point passes through a tiny hole and hits the image plane. The intrinsic parameters (fx, fy, cx, cy) describe the camera's internal geometry.

Real lenses also introduce distortion — straight lines in the world become curved in the image. We model this with radial and tangential distortion coefficients, then undistort images before processing.

u = fx · X/Z + cx      v = fy · Y/Z + cy
Intuition: fx, fy are the focal lengths (how much the camera "zooms"). cx, cy are the principal point (where the optical axis hits the image). Together, they form the intrinsic matrix K.
Pinhole Projection

Move the 3D point by adjusting X, Y, Z. Watch where it projects on the image plane. The projection is simply (fX/Z, fY/Z).

Point X0.5
Point Y-0.3
Depth Z2.0
Focal length150
Distortion: Fish-eye lenses have extreme radial distortion. Before running any VIO pipeline, images are typically undistorted so the pinhole model holds. This is a one-time calibration step.
Check: What do the intrinsic parameters describe?

Chapter 3: Visual Feature Tracking

To estimate camera motion, we track features across consecutive frames. A feature is a distinctive patch — typically a corner — that's easy to find again. The KLT (Kanade-Lucas-Tomasi) tracker finds these corners and follows them using optical flow: it assumes the brightness of a pixel patch doesn't change between frames.

Corner detection (e.g., Shi-Tomasi, FAST) identifies pixels with strong gradients in two directions. Optical flow then estimates how each corner moved. Losing features is inevitable — occlusion, leaving the frame, deformation — so new features are regularly detected.

Why corners? An edge can slide along its direction — ambiguous. A flat region has no gradient at all. Only corners are unique in both directions, making them trackable.

The Optical Flow Assumption

KLT tracking rests on the brightness constancy assumption: a pixel's intensity doesn't change between frames. Mathematically:

I(x, y, t) = I(x + δx, y + δy, t + δt)

Taylor-expanding and dividing by δt gives the optical flow equation: Ix u + Iy v + It = 0, where u, v are the pixel velocities. This is one equation in two unknowns — underdetermined. KLT resolves this by assuming the flow is constant over a small window (typically 21×21 pixels) and solving the resulting overdetermined system via least squares.

In practice, a VIO system tracks 100–200 features simultaneously. Each frame, features are tracked with KLT, new corners are detected to replace lost ones (via Shi-Tomasi or FAST), and the surviving tracks provide the visual constraints for the estimator. A healthy track length is 10–30 frames before a feature is lost or leaves the field of view.

Feature Tracking Simulator

Features (dots) are tracked across frames. Green = successfully tracked. Red = lost. Watch features enter and leave the field of view.

Tracked: 0
Check: Why are corners preferred over edges for tracking?
🔨 Derivation The Epipolar Constraint — from two views to geometry ✓ ATTEMPTED

A 3D point P is observed from two camera positions. In frame 1, it projects to pixel p1. In frame 2 (after rotation R and translation t), it projects to p2. The normalized image coordinates are x1 = K-1p1 and x2 = K-1p2.

Your task: Derive the epipolar constraint x2T E x1 = 0, where E = [t]× R is the essential matrix. Explain geometrically why this constraint holds.

If P has coordinates X1 in frame 1, then in frame 2: X2 = R X1 + t. The normalized image coordinates are x1 = X1/Z1 and x2 = X2/Z2. So X1 = λ1 x1 and X2 = λ2 x2 for some depths λ.
The vectors t, R x1, and x2 all lie in the same plane (the epipolar plane through P and both camera centers). Three coplanar vectors have zero scalar triple product: x2 · (t × R x1) = 0.
The cross product t × v can be written as [t]× v, where [t]× is the skew-symmetric matrix of t. So t × (R x1) = [t]× R x1. The dot product becomes: x2T [t]× R x1 = 0.

Full derivation:

1. Point P in frame 1: X1 = λ1 x1 (ray from camera 1 through the pixel).

2. Same point in frame 2: X2 = R X1 + t = λ1 R x1 + t.

3. Also X2 = λ2 x2. So: λ2 x2 = λ1 R x1 + t.

4. Take cross product with t on both sides: λ2 (t × x2) = λ1 (t × R x1) + (t × t). The last term is zero.

5. Dot both sides with x2: λ2 x2 · (t × x2) = λ1 x2 · (t × R x1). The left side is zero (scalar triple product with repeated vector). So: x2T [t]× R x1 = 0.

6. Define E = [t]× R. We get: x2T E x1 = 0.

The key insight: The epipolar constraint eliminates depth entirely! Even though we don't know how far away P is, the constraint between x1 and x2 must hold for ANY valid correspondence. This is why we can recover camera motion (E) from feature matches alone, without knowing 3D structure. Each match gives one equation; 5 matches suffice to solve for E (the 5-point algorithm).

Checkpoint — Before you move on
Explain in your own words: if a camera only rotates (no translation), can you recover the 3D structure of the scene from feature matches? Why or why not? What does this mean for VIO initialization?
✓ Gate cleared
Model Answer

Pure rotation produces no baseline between views. Without translation, all epipolar lines collapse to a single point (the epipole is undefined), and the essential matrix becomes degenerate (rank 0 instead of rank 2). You cannot triangulate depth from features because there's no parallax — a point 1m away and 100m away both move identically under rotation.

For VIO, this means the initialization phase REQUIRES translational motion. This is why AR systems tell you to "move your phone" at startup — pure rotation gives orientation but no scale or depth. The IMU can detect translation (via acceleration), confirming when sufficient parallax exists for visual initialization.

Chapter 4: Loosely-Coupled Fusion

The simplest approach: run visual odometry (VO) and IMU integration separately, then fuse their pose estimates. Each system outputs a pose — position and orientation — and the fusion layer (often a Kalman filter or complementary filter) combines them.

This is easy to implement: each subsystem is a black box. But it's suboptimal — information is lost when each system compresses its raw data into a single pose estimate before fusion. Correlations between visual and inertial measurements are ignored.

Visual Odometry
Tracks features, estimates pose from images → outputs Tcam
IMU Integration
Integrates accel & gyro → outputs Timu
Fusion (KF)
Combines Tcam and Timu using uncertainties → Tfused
Loose vs Tight Fusion

The teal line is the true path. Orange is loosely-coupled. Green is tightly-coupled. Notice the loose estimate is noisier.

Analogy: Loose coupling is like asking two people to solve a jigsaw puzzle separately, then voting on the result. Tight coupling is like having them work on the same puzzle together — sharing every piece of information.

What Information Is Lost?

In loose coupling, each subsystem reports a 6-DOF pose and its covariance. But it discards everything else — which features were tracked, how confident each feature match was, which direction the uncertainty is largest. The fusion layer sees two Gaussian blobs and averages them. It can't know that the VO system was tracking features on a distant building (high depth uncertainty) while the nearby ground had no texture (no features at all).

Tight coupling preserves all this structure. The raw 2D pixel measurements enter the optimizer directly, along with their individual uncertainties. An uncertain feature at 50 meters contributes less than a confident feature at 2 meters. This per-measurement weighting is impossible with loose coupling.

Check: What is the main disadvantage of loosely-coupled fusion?

Chapter 5: Tightly-Coupled Fusion

In tightly-coupled VIO, we jointly optimize over raw measurements from both sensors in a single estimator. Instead of combining two pose estimates, we combine individual pixel observations and IMU readings directly. This preserves correlations and produces significantly better results.

The state vector now includes the camera pose, velocity, IMU biases, and possibly feature positions. All measurements — pixel coordinates from the camera and accelerations/angular velocities from the IMU — constrain this joint state simultaneously.

x = [p, v, q, ba, bg]     (position, velocity, orientation, biases)
Key advantage: When the camera loses tracking on some features, IMU measurements still constrain velocity and orientation. When the IMU drifts, pixel reprojection errors pull the state back. Every measurement helps constrain everything.

What's Actually in the State Vector?

The full state per keyframe is 16 parameters: position p (3), velocity v (3), orientation quaternion q (4), accelerometer bias ba (3), and gyroscope bias bg (3). With N keyframes in the sliding window plus M 3D landmarks, the optimizer solves over 16N + 3M parameters simultaneously.

xfull = [p1, v1, q1, ba1, bg1, …, pN, vN, qN, baN, bgN, ℓ1, …, ℓM]

For a typical window of 10 keyframes with 50 landmarks: 16 × 10 + 3 × 50 = 310 parameters. The optimizer must find values for all 310 simultaneously. This is why efficient solvers (Gauss-Newton, Levenberg-Marquardt) and sparse matrix structure matter so much.

ApproachInputsStateAccuracy
Loosely-coupledVO pose + IMU posePose onlyGood
Tightly-coupled (filter)Raw pixels + raw IMUPose + biases + (features)Better
Tightly-coupled (optim.)Raw pixels + raw IMUPose + biases + featuresBest
Joint State Estimation

Watch how the tightly-coupled estimator simultaneously adjusts pose, velocity, and IMU bias. Drag the bias slider to simulate drift.

IMU bias drift0.20
Visual features10
Check: What does the tightly-coupled state vector typically include?

Chapter 6: Preintegration

Between two camera keyframes (say, 100 ms apart), the IMU produces dozens of measurements. Naively, if we change our linearization point for the pose, we'd have to re-integrate all those IMU samples. This is expensive. Preintegration is the clever trick that avoids it.

The idea: integrate IMU measurements in a local frame relative to the starting pose. The resulting "preintegrated measurement" summarizes the relative motion (Δp, Δv, Δq) between keyframes. When the linearization point changes, we only need a cheap first-order correction — no re-integration.

Δpij = Σk [Δvik · δt + ½ (αk − ba) · δt²]
IMU samples (high rate)
a1, a2, ..., aN and ω1, ω2, ..., ωN between keyframes
Preintegrate
Compute Δp, Δv, Δq in local frame. Also compute Jacobians w.r.t. bias.
Bias correction
When bias estimate changes, apply first-order correction instead of re-integrating.
Preintegration vs Naive

The bar chart shows computational cost. Red = naive re-integration (grows with IMU rate). Green = preintegration (constant cost after initial computation).

IMU rate (Hz)400
Optimization iters5
Why it matters: In optimization-based VIO, the solver iterates many times, each time updating the linearization point. Without preintegration, each iteration re-integrates hundreds of IMU samples. With it, the cost drops to a single cheap matrix multiply.

Concrete Numbers

Between two keyframes at 15–30 Hz, the IMU fires at 200–1000 Hz. That means 7–67 IMU samples per keyframe interval. The preintegrated measurement is a compact tuple: ΔR (relative rotation), Δv (relative velocity), and Δp (relative position), all expressed in the local frame of the interval's starting pose.

When the optimizer updates the starting pose, you do not re-integrate those 7–67 IMU samples. Instead, you apply a first-order correction using the precomputed Jacobians with respect to bias:

ΔRcorrected ≈ ΔR · Exp(Jbg · δbg)    Δvcorrected ≈ Δv + Jbav · δba + Jbgv · δbg

Typical IMU noise densities: gyroscope ~0.01 rad/s/√Hz, accelerometer ~0.1 m/s²/√Hz. After 1 second of dead-reckoning without any visual correction, you accumulate roughly 0.6° rotation error and ~5 cm position error. This is why keyframes must arrive frequently — preintegration bridges the gap cheaply.

Check: What problem does preintegration solve?
🔨 Derivation IMU Preintegration — bias correction via first-order Jacobians ✓ ATTEMPTED

Between keyframes i and j, you preintegrate N IMU samples to get Δvij (relative velocity change). This preintegrated measurement depends on the bias ba used during integration. When the optimizer updates ba to ba + δba, you need to correct Δv without re-integrating.

Your task: Show that Δvcorrected ≈ Δv + Jbav · δba, and derive the form of the Jacobian Jbav = ∂Δv / ∂ba.

The preintegrated velocity is: Δvij = Σk=ij-1 ΔRik (ak − ba) δt, where ΔRik is the preintegrated rotation from i to k. The key: ba appears linearly inside the sum.
If ba changes to ba + δba, the velocity becomes: Δv(ba + δba) = Σk ΔRik (ak − ba − δba) δt = Δv(ba) − (Σk ΔRik δt) δba.
Comparing with Δvcorrected = Δv + J · δba, the Jacobian is Jbav = −Σk=ij-1 ΔRik δt. It's a 3×3 matrix accumulated during preintegration.

Full derivation:

1. Discrete preintegration formula: Δvij = Σk=ij-1 ΔRik (ak − ba) δt

2. Perturb bias: ba → ba + δba

3. New velocity: Δv' = Σk ΔRik (ak − ba − δba) δt = Δvij − [Σk ΔRik δt] δba

4. Therefore: Jbav = ∂Δv/∂ba = −Σk=ij-1 ΔRik · δt

5. This Jacobian is computed during preintegration (accumulated alongside Δv) at zero extra cost. When the optimizer changes ba by δba, the correction is a single matrix-vector multiply: Jbav · δba.

The key insight: Bias enters the integration LINEARLY (it's subtracted from each acceleration measurement). This linearity is what makes first-order correction exact — there's no second-order term to worry about. For the rotation preintegration, bias correction is more complex because rotations are nonlinear (Lie group), requiring the exponential map: ΔRcorrected ≈ ΔR · Exp(JbgR δbg).

Chapter 7: MSCKF

The Multi-State Constraint Kalman Filter (MSCKF) is one of the most successful VIO algorithms. Its key insight: don't keep features in the state vector. Instead, keep a sliding window of camera poses and use features only to create constraints between those poses before marginalizing them out.

When a feature is tracked across N frames, it creates 2N measurement equations (x and y pixel coordinates). Once the feature leaves the view, MSCKF uses these equations to constrain the N poses, then discards the feature. This keeps the state vector small — only O(window size) instead of O(number of features).

State: Sliding Window of Poses
x = [IMU state, pose1, pose2, ..., poseN]
Feature Lost
Feature tracked in frames 3,4,5,6 is no longer visible
Multi-State Constraint
Triangulate feature, project into each frame, create residual constraints on poses 3-6
Marginalize Feature
Feature is removed. Poses are updated. State stays compact.
MSCKF State Size

Compare state size: red = keeping all features in state (grows unboundedly). green = MSCKF with pose-only state (bounded).

Window size15

The Math Behind the Savings

MSCKF keeps the IMU state (16 parameters) plus a sliding window of N camera poses (7 parameters each: position + quaternion). Total state size: 16 + 7N. With N=20 poses: 156 parameters. The EKF covariance matrix is 156×156 — easily manageable.

By contrast, EKF-SLAM keeps features in the state. With 200 landmarks at 3 parameters each: 16 + 600 = 616 parameters. The covariance is 616×616. Updating it costs O(616²) per measurement. And the state keeps growing. MSCKF's state is bounded — no matter how many features pass through, the state never exceeds 16 + 7N.

MSCKF vs EKF-SLAM: Traditional EKF-SLAM keeps features in the state (O(N²) cost). MSCKF marginalizes features immediately, maintaining O(M) cost where M is the window size. This makes it viable for real-time on mobile devices.
Check: How does MSCKF keep computational cost bounded?

Chapter 8: Sliding Window Optimization

Instead of a Kalman filter, many modern VIO systems use nonlinear optimization over a sliding window of recent states. The idea: collect all visual and inertial measurements in a window, build a factor graph, and solve for the poses that best explain all the data.

Each factor (constraint) in the graph connects variables: IMU preintegration factors connect consecutive poses, visual factors connect a 3D landmark to camera poses that observed it. Marginalization removes old states while preserving their information as a prior.

Filter vs Optimization: Filters (MSCKF, EKF) process measurements once. Optimization re-linearizes and iterates, getting closer to the true maximum-likelihood solution. The tradeoff: optimization is more accurate but more expensive. Modern VIO usually chooses optimization.

How Marginalization Works

When a keyframe leaves the sliding window, we can't just discard it — it carried information about the remaining states. Marginalization via the Schur complement converts the old frame's constraints into a dense prior on the states that remain. Concretely, if the full information matrix is partitioned into "keep" and "drop" blocks:

Λprior = Λkk − Λkd Λdd−1 Λdk

This prior becomes a new factor in the graph. The cost of one marginalization step is O(N³) in window size N — not O(K³) in total keyframes K seen so far. With a window of 10–20 keyframes, this is fast. Without it (full bundle adjustment over all keyframes), the system would grind to a halt after a few seconds of operation.

Factor Graph Visualization

Circles are pose variables. Squares are factors (constraints). Orange = IMU factors. Teal = visual factors. Purple = marginalization prior.

Window size6
Landmarks4
MethodTypeAccuracyLatency
EKF-VIOFilterGoodVery low
MSCKFFilterBetterLow
Sliding Window BAOptimizationBestMedium
Full BAOptimizationOptimalHigh

Factor Types in a VIO Graph

Each factor encodes a measurement and its uncertainty (covariance). The optimizer minimizes the sum of squared Mahalanobis distances across all factors:

χ* = argminχi ||ri(χ)||2Σi
Factor TypeConnectsResidual
IMU preintegrationPosei ↔ Posei+1Δp, Δv, Δq vs predicted from poses
Visual reprojectionPosej ↔ LandmarkkMeasured pixel − projected pixel (2D error)
Marginalization priorFirst pose in windowDense prior from Schur complement
Bias random walkPosei ↔ Posei+1Constrains bias change between keyframes

The solver (typically Gauss-Newton or Levenberg-Marquardt) linearizes all residuals, forms the normal equations H · δχ = −b, and solves for the state update. The Hessian H is sparse because each factor only connects a few variables — this sparsity is what makes sliding window optimization tractable.

Check: Why does sliding window optimization outperform filtering?
💥 Break-It Lab What Dies When You Remove VIO Components? ✓ ATTEMPTED
A drone flies a figure-8 pattern using tightly-coupled VIO. The system has: IMU propagation, visual features, and bias estimation. Toggle each off to see the trajectory degrade.
Remove IMU (camera-only) ACTIVE
Failure mode: Without IMU, monocular scale is unobservable. The trajectory slowly shrinks or expands (scale drift). Fast motions cause lost tracking between frames because there's no motion prediction. The system has no way to know how far it moved between frames — scale drifts by 10-30% per minute. Also: no metric units (the "1 meter" could be 0.7m or 1.5m).
Degenerate Motion (pure rotation only) ACTIVE
Failure mode: With pure rotation, there's zero parallax — features move on the image but carry no depth information. The essential matrix becomes rank-deficient. Feature triangulation fails. The system can track orientation perfectly but position is completely unobservable. Landmarks fly to infinity. The visual "measurements" provide no constraint on translation.
Wrong Camera-IMU Calibration (5° error) ACTIVE
Failure mode: The extrinsic rotation between camera and IMU is baked into every measurement. A 5° error means IMU-predicted feature positions are systematically wrong. The optimizer compensates by warping the trajectory — straight lines become curves, and the bias estimates absorb the calibration error (biased biases!). Drift increases 3-5x. The system doesn't crash but produces smoothly incorrect results.
🔨 Derivation The Schur Complement — why landmark marginalization is efficient ✓ ATTEMPTED

In sliding window BA, the Hessian H has a specific sparsity pattern. Poses connect to each other (via IMU factors) and to landmarks (via visual factors), but landmarks don't connect to each other. The Hessian partitions into pose block Hpp, landmark block Hll, and cross-block Hpl.

Your task: Show that eliminating the landmark variables from the normal equations Hδx = −b produces the reduced system Hpp* δxp = −bp* where Hpp* = Hpp − Hpl Hll−1 Hlp. Why is Hll−1 cheap?

The normal equations are: [Hpp Hpl; Hlp Hll] [δxp; δxl] = −[bp; bl]. This gives two equations: Hppδxp + Hplδxl = −bp and Hlpδxp + Hllδxl = −bl.
From the second equation: δxl = Hll−1(−bl − Hlpδxp). Substitute this into the first equation to eliminate δxl.
Each landmark only appears in reprojection factors connecting it to poses. No factor ever connects two landmarks directly. Therefore Hll is block-diagonal with 3×3 blocks (one per landmark). Inverting a block-diagonal matrix is just inverting each 3×3 block independently — O(M) for M landmarks, not O(M³).

Full derivation:

1. From equation 2: δxl = −Hll−1(bl + Hlpδxp)

2. Substitute into equation 1: Hppδxp + Hpl[−Hll−1(bl + Hlpδxp)] = −bp

3. Rearrange: (Hpp − HplHll−1Hlp)δxp = −(bp − HplHll−1bl)

4. Define: Hpp* = Hpp − HplHll−1Hlp (the Schur complement)

5. The reduced system is much smaller: if we have 10 poses (160 dims) and 50 landmarks (150 dims), the original system is 310×310. The reduced system is only 160×160. We solve for poses first, then back-substitute to get landmarks.

The key insight: Hll is block-diagonal because landmarks are conditionally independent given the poses. No two landmarks ever share a factor. This means Hll−1 costs O(M) not O(M³). The entire Schur complement costs O(N²M) where N is pose dimensions and M is landmark count — the standard trick that makes visual SLAM computationally tractable.

⚔ Adversarial: The stiffening marginalization prior
Your VIO system runs with a sliding window of 10 keyframes. After 5 minutes of operation, you notice the system becomes increasingly resistant to corrections — even clear loop closures barely shift the trajectory. The marginalization prior seems to "stiffen" over time. The state estimates are still reasonable but suboptimal.

Chapter 9: VIO Systems

Several landmark VIO systems have defined the field. Each makes different design choices — filter vs optimization, feature handling, state representation — but they all fuse visual and inertial data in a tightly-coupled manner.

SystemYearApproachKey Innovation
MSCKF2007EKF, no features in stateEfficient multi-state constraints
OKVIS2015Keyframe-based optimizationSliding window with marginalization
ROVIO2015EKF with direct photometricNo feature extraction needed
VINS-Mono2018Optimization + loop closureComplete system with relocalization
Basalt2019Optimization, visual-inertialNon-linear factor recovery for marginalization
VINS-Mono is arguably the most influential open-source VIO system. It includes initialization (recovering scale, gravity, biases), tightly-coupled sliding window optimization, loop closure with DBoW2, and pose graph optimization. It's the complete package.

What "Initialization" Actually Means

A monocular VIO system faces a chicken-and-egg problem at startup: you need depth to triangulate features, but you need features to estimate depth. The initialization procedure (typically 1–2 seconds) does this:

  1. Collect 10–20 frames of visual-inertial data during motion
  2. SfM initialization: Estimate relative poses from vision (up to an unknown scale)
  3. IMU alignment: Align the visual scale, gravity direction, and initial velocity with IMU measurements by solving a linear system
  4. Bias estimation: Estimate initial accelerometer and gyroscope biases
  5. Scale recovery: The IMU provides absolute scale (meters, not "visual units") because gravity is a known acceleration

This is why VIO systems need the user to move during startup — pure rotation provides no translation for scale estimation. ARKit's "move your phone around" instruction at launch is literally the VIO initialization phase.

System Comparison Radar

Compare VIO systems across five axes: accuracy, speed, robustness, features, and ease of use.

Where from here? Classical VIO is mature and reliable, but it struggles with textureless surfaces, dramatic lighting changes, and fast motion. Modern VIO replaces handcrafted features with learned ones, uses neural IMU models, and pushes toward end-to-end systems.
Check: What distinguishes VINS-Mono from earlier VIO systems?
🏗 Design Challenge You're the Architect: VIO for AR Glasses ✓ ATTEMPTED
A consumer electronics company is building AR glasses with 6-DOF head tracking. The VIO system must deliver sub-millimeter jitter at display refresh rate while fitting within extreme power and thermal constraints. The glasses will be worn 8+ hours daily in varied environments: offices, outdoors, public transit.
Latency budget
<5 ms total (IMU-to-photon)
Camera FPS
30 Hz mono or 15 Hz stereo
IMU rate
800 Hz (low-cost MEMS)
Compute
2 ARM cores + small DSP, 1.5W total power
Thermal
Surface temp <41°C (skin contact)
Accuracy
<0.5mm jitter, <1% drift over 10m walk
1. Filter (MSCKF-style) or optimization (VINS-style)? The latency budget is 5ms. Optimization with 10 iterations takes ~15ms. What can you do?
2. The 5ms IMU-to-photon budget means the visual pipeline CAN'T be in the critical path. How do you architect the system so the display always has a fresh pose?
3. Thermal throttling at minute 45: the CPU clocks down 30%. Your system must degrade gracefully. Which components do you shed first?
4. User walks from bright office into dark elevator. How do you handle complete visual failure for 5-10 seconds?

Real-world solution (Apple ARKit / Meta Aria):

1. Dual-rate architecture: IMU propagation runs at 800-1000 Hz on a dedicated core with <1ms latency. This is the "prediction" that drives the display. Visual corrections run asynchronously at 30 Hz on the other core. The display always gets the IMU-propagated pose; visual corrections arrive "late" and retroactively fix drift. This decouples display latency from visual processing time.

2. MSCKF for the critical path, not optimization. The EKF update is a single matrix multiply (~0.5ms for 30 features). The optimization runs in a background thread for relocalization and map maintenance, but never blocks the display loop.

3. Graceful degradation: Under thermal throttle: reduce feature count (100 → 30), increase keyframe spacing (30Hz → 15Hz), skip outlier rejection (trust RANSAC less). The IMU-only propagation continues at full rate regardless.

4. IMU-only bridging: During visual blackout, trust IMU propagation. The key: preintegrated covariance tells you EXACTLY how much drift to expect. After 10 seconds of IMU-only: ~50cm position uncertainty. When vision recovers, the large innovation is expected (high covariance) and the system absorbs the correction smoothly. The alternative (reset) would cause visible "jumps" in the AR content.

💻 Build It Implement IMU Preintegration from Scratch ✓ ATTEMPTED
Given a sequence of IMU measurements (accelerometer + gyroscope) between two keyframes, compute the preintegrated relative position, velocity, and rotation. Also compute the bias Jacobians for first-order correction.
signature def preintegrate(imu_samples, bias_a, bias_g, dt): """ imu_samples: list of (accel[3], gyro[3]) tuples bias_a: np.array [3] - current accelerometer bias estimate bias_g: np.array [3] - current gyroscope bias estimate dt: float - time between IMU samples Returns: delta_p: np.array [3] - preintegrated position delta_v: np.array [3] - preintegrated velocity delta_R: np.array [3,3] - preintegrated rotation J_ba_v: np.array [3,3] - Jacobian of delta_v w.r.t. bias_a J_ba_p: np.array [3,3] - Jacobian of delta_p w.r.t. bias_a """
Test case
Constant accel = [0, 0, 1] m/s² (above bias), zero gyro, 10 samples at dt=0.01s: Expected: delta_v = [0, 0, 0.1], delta_p = [0, 0, 0.005], delta_R = I. With bias correction δba = [0.1, 0, 0]: corrected delta_v[0] should be −0.01.
For each sample k: (1) corrected_accel = accel - bias_a, (2) corrected_gyro = gyro - bias_g, (3) delta_R = delta_R @ exp_so3(corrected_gyro * dt), (4) delta_v += delta_R @ corrected_accel * dt, (5) delta_p += delta_v * dt + 0.5 * delta_R @ corrected_accel * dt^2, (6) J_ba_v += -delta_R * dt, (7) J_ba_p += J_ba_v * dt - 0.5 * delta_R * dt^2.
python
import numpy as np
from scipy.spatial.transform import Rotation

def exp_so3(omega):
    """Exponential map: rotation vector -> rotation matrix."""
    angle = np.linalg.norm(omega)
    if angle < 1e-10:
        return np.eye(3) + skew(omega)
    axis = omega / angle
    K = skew(axis)
    return np.eye(3) + np.sin(angle)*K + (1-np.cos(angle))*(K @ K)

def skew(v):
    return np.array([[0, -v[2], v[1]],
                     [v[2], 0, -v[0]],
                     [-v[1], v[0], 0]])

def preintegrate(imu_samples, bias_a, bias_g, dt):
    delta_R = np.eye(3)
    delta_v = np.zeros(3)
    delta_p = np.zeros(3)
    J_ba_v = np.zeros((3, 3))
    J_ba_p = np.zeros((3, 3))

    for accel, gyro in imu_samples:
        # Subtract biases
        a_corr = np.array(accel) - bias_a
        w_corr = np.array(gyro) - bias_g

        # Rotate acceleration into local frame
        a_local = delta_R @ a_corr

        # Accumulate Jacobians (BEFORE updating delta_v/delta_p)
        J_ba_v += -delta_R * dt
        J_ba_p += J_ba_v * dt - 0.5 * delta_R * (dt**2)

        # Integrate position (uses current velocity)
        delta_p += delta_v * dt + 0.5 * a_local * (dt**2)

        # Integrate velocity
        delta_v += a_local * dt

        # Integrate rotation
        delta_R = delta_R @ exp_so3(w_corr * dt)

    return delta_p, delta_v, delta_R, J_ba_v, J_ba_p
Bonus challenge: Add covariance propagation. The preintegrated measurement has uncertainty that grows with the number of samples. Track the 9×9 covariance matrix [δR, δv, δp] using the discrete-time noise propagation: Σk+1 = Fk Σk FkT + Gk Q GkT.
🔗 Pattern Recognition
VIO is SLAM Without Map Reuse
Classical VIO
Sliding window of poses + current features. Old poses and features are marginalized away. No persistent map.
Classical SLAM
All poses + all landmarks kept in state. Loop closures possible. Map persists for relocalization. → Classical SLAM

VIO trades map persistence for bounded computation. SLAM trades computation for the ability to recognize previously visited places. VINS-Mono bridges both: it runs VIO for real-time tracking and maintains a separate pose graph for loop closure. When the robot returns to a known place, the pose graph corrects accumulated drift — something pure VIO can never do.

If you added loop closure to a VIO system, what additional components would you need? (Hint: place recognition, pose graph, and what happens to the marginalization prior when a loop is detected?)