The Complete Beginner's Path

Understand Classical
VIO

The system that fuses camera and IMU to track your position in 3D — powering AR headsets, drones, and autonomous robots even when GPS fails.

Prerequisites: Basic linear algebra + Intuition for 3D geometry. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why IMU + Camera?

A camera can see the world in rich detail, tracking features frame-to-frame. But it struggles in the dark, with motion blur, or when there's nothing interesting to look at. An IMU (Inertial Measurement Unit) measures accelerations and angular velocities directly — it never loses tracking, even in complete darkness. But it drifts fast: double-integrating noisy accelerations yields meters of error within seconds.

Each sensor fails where the other succeeds. The camera drifts slowly but can lose tracking entirely. The IMU drifts fast but never loses tracking. Together, they form a complementary pair: the camera corrects the IMU's drift, and the IMU bridges the camera's gaps.

The core idea: Cameras provide slow, accurate position corrections. IMUs provide fast, noisy motion measurements. Fusing them gives you robust, low-latency, drift-corrected pose estimation. This is Visual-Inertial Odometry (VIO).
Sensor Drift Comparison

The teal line is the true path. Red is IMU-only (fast drift). Blue is camera-only (sudden loss). Green is VIO fusion.

Check: Why do we combine a camera with an IMU?

Chapter 1: IMU Fundamentals

An IMU contains two sensors: an accelerometer that measures linear acceleration (including gravity!) and a gyroscope that measures angular velocity. Together they provide 6-axis measurements at high rates (100–1000 Hz).

The catch: both sensors are corrupted by bias (a slowly drifting offset) and noise (random jitter). To get velocity from acceleration, you integrate once. To get position, you integrate twice. Each integration amplifies errors — this is why IMU-only position drifts so fast.

ameas = atrue + ba + na      ωmeas = ωtrue + bg + ng
Accelerometer
Measures specific force (acceleration + gravity). 3 axes. Units: m/s².
Gyroscope
Measures angular velocity. 3 axes. Units: rad/s.
Integration
Accel → velocity → position. Each step amplifies noise and bias.
IMU Integration Drift

Watch how double-integrating noisy acceleration leads to quadratic position drift. Adjust bias and noise levels.

Accel bias0.10
Accel noise0.30
SymbolMeaning
ameasMeasured acceleration (what the sensor reports)
baAccelerometer bias (slowly drifting offset)
naAccelerometer noise (random jitter)
ωmeasMeasured angular velocity
bgGyroscope bias
Check: Why does IMU-only position drift so quickly?

Chapter 2: Camera Models

To use a camera for geometry, we need a mathematical model of how 3D points project onto the 2D image. The pinhole model is the simplest: light from a 3D point passes through a tiny hole and hits the image plane. The intrinsic parameters (fx, fy, cx, cy) describe the camera's internal geometry.

Real lenses also introduce distortion — straight lines in the world become curved in the image. We model this with radial and tangential distortion coefficients, then undistort images before processing.

u = fx · X/Z + cx      v = fy · Y/Z + cy
Intuition: fx, fy are the focal lengths (how much the camera "zooms"). cx, cy are the principal point (where the optical axis hits the image). Together, they form the intrinsic matrix K.
Pinhole Projection

Move the 3D point by adjusting X, Y, Z. Watch where it projects on the image plane. The projection is simply (fX/Z, fY/Z).

Point X0.5
Point Y-0.3
Depth Z2.0
Focal length150
Distortion: Fish-eye lenses have extreme radial distortion. Before running any VIO pipeline, images are typically undistorted so the pinhole model holds. This is a one-time calibration step.
Check: What do the intrinsic parameters describe?

Chapter 3: Visual Feature Tracking

To estimate camera motion, we track features across consecutive frames. A feature is a distinctive patch — typically a corner — that's easy to find again. The KLT (Kanade-Lucas-Tomasi) tracker finds these corners and follows them using optical flow: it assumes the brightness of a pixel patch doesn't change between frames.

Corner detection (e.g., Shi-Tomasi, FAST) identifies pixels with strong gradients in two directions. Optical flow then estimates how each corner moved. Losing features is inevitable — occlusion, leaving the frame, deformation — so new features are regularly detected.

Why corners? An edge can slide along its direction — ambiguous. A flat region has no gradient at all. Only corners are unique in both directions, making them trackable.
Feature Tracking Simulator

Features (dots) are tracked across frames. Green = successfully tracked. Red = lost. Watch features enter and leave the field of view.

Tracked: 0
Check: Why are corners preferred over edges for tracking?

Chapter 4: Loosely-Coupled Fusion

The simplest approach: run visual odometry (VO) and IMU integration separately, then fuse their pose estimates. Each system outputs a pose — position and orientation — and the fusion layer (often a Kalman filter or complementary filter) combines them.

This is easy to implement: each subsystem is a black box. But it's suboptimal — information is lost when each system compresses its raw data into a single pose estimate before fusion. Correlations between visual and inertial measurements are ignored.

Visual Odometry
Tracks features, estimates pose from images → outputs Tcam
IMU Integration
Integrates accel & gyro → outputs Timu
Fusion (KF)
Combines Tcam and Timu using uncertainties → Tfused
Loose vs Tight Fusion

The teal line is the true path. Orange is loosely-coupled. Green is tightly-coupled. Notice the loose estimate is noisier.

Analogy: Loose coupling is like asking two people to solve a jigsaw puzzle separately, then voting on the result. Tight coupling is like having them work on the same puzzle together — sharing every piece of information.
Check: What is the main disadvantage of loosely-coupled fusion?

Chapter 5: Tightly-Coupled Fusion

In tightly-coupled VIO, we jointly optimize over raw measurements from both sensors in a single estimator. Instead of combining two pose estimates, we combine individual pixel observations and IMU readings directly. This preserves correlations and produces significantly better results.

The state vector now includes the camera pose, velocity, IMU biases, and possibly feature positions. All measurements — pixel coordinates from the camera and accelerations/angular velocities from the IMU — constrain this joint state simultaneously.

x = [p, v, q, ba, bg]     (position, velocity, orientation, biases)
Key advantage: When the camera loses tracking on some features, IMU measurements still constrain velocity and orientation. When the IMU drifts, pixel reprojection errors pull the state back. Every measurement helps constrain everything.
ApproachInputsStateAccuracy
Loosely-coupledVO pose + IMU posePose onlyGood
Tightly-coupled (filter)Raw pixels + raw IMUPose + biases + (features)Better
Tightly-coupled (optim.)Raw pixels + raw IMUPose + biases + featuresBest
Joint State Estimation

Watch how the tightly-coupled estimator simultaneously adjusts pose, velocity, and IMU bias. Drag the bias slider to simulate drift.

IMU bias drift0.20
Visual features10
Check: What does the tightly-coupled state vector typically include?

Chapter 6: Preintegration

Between two camera keyframes (say, 100 ms apart), the IMU produces dozens of measurements. Naively, if we change our linearization point for the pose, we'd have to re-integrate all those IMU samples. This is expensive. Preintegration is the clever trick that avoids it.

The idea: integrate IMU measurements in a local frame relative to the starting pose. The resulting "preintegrated measurement" summarizes the relative motion (Δp, Δv, Δq) between keyframes. When the linearization point changes, we only need a cheap first-order correction — no re-integration.

Δpij = Σk [Δvik · δt + ½ (αk − ba) · δt²]
IMU samples (high rate)
a1, a2, ..., aN and ω1, ω2, ..., ωN between keyframes
Preintegrate
Compute Δp, Δv, Δq in local frame. Also compute Jacobians w.r.t. bias.
Bias correction
When bias estimate changes, apply first-order correction instead of re-integrating.
Preintegration vs Naive

The bar chart shows computational cost. Red = naive re-integration (grows with IMU rate). Green = preintegration (constant cost after initial computation).

IMU rate (Hz)400
Optimization iters5
Why it matters: In optimization-based VIO, the solver iterates many times, each time updating the linearization point. Without preintegration, each iteration re-integrates hundreds of IMU samples. With it, the cost drops to a single cheap matrix multiply.
Check: What problem does preintegration solve?

Chapter 7: MSCKF

The Multi-State Constraint Kalman Filter (MSCKF) is one of the most successful VIO algorithms. Its key insight: don't keep features in the state vector. Instead, keep a sliding window of camera poses and use features only to create constraints between those poses before marginalizing them out.

When a feature is tracked across N frames, it creates 2N measurement equations (x and y pixel coordinates). Once the feature leaves the view, MSCKF uses these equations to constrain the N poses, then discards the feature. This keeps the state vector small — only O(window size) instead of O(number of features).

State: Sliding Window of Poses
x = [IMU state, pose1, pose2, ..., poseN]
Feature Lost
Feature tracked in frames 3,4,5,6 is no longer visible
Multi-State Constraint
Triangulate feature, project into each frame, create residual constraints on poses 3-6
Marginalize Feature
Feature is removed. Poses are updated. State stays compact.
MSCKF State Size

Compare state size: red = keeping all features in state (grows unboundedly). green = MSCKF with pose-only state (bounded).

Window size15
MSCKF vs EKF-SLAM: Traditional EKF-SLAM keeps features in the state (O(N²) cost). MSCKF marginalizes features immediately, maintaining O(M) cost where M is the window size. This makes it viable for real-time on mobile devices.
Check: How does MSCKF keep computational cost bounded?

Chapter 8: Sliding Window Optimization

Instead of a Kalman filter, many modern VIO systems use nonlinear optimization over a sliding window of recent states. The idea: collect all visual and inertial measurements in a window, build a factor graph, and solve for the poses that best explain all the data.

Each factor (constraint) in the graph connects variables: IMU preintegration factors connect consecutive poses, visual factors connect a 3D landmark to camera poses that observed it. Marginalization removes old states while preserving their information as a prior.

Filter vs Optimization: Filters (MSCKF, EKF) process measurements once. Optimization re-linearizes and iterates, getting closer to the true maximum-likelihood solution. The tradeoff: optimization is more accurate but more expensive. Modern VIO usually chooses optimization.
Factor Graph Visualization

Circles are pose variables. Squares are factors (constraints). Orange = IMU factors. Teal = visual factors. Purple = marginalization prior.

Window size6
Landmarks4
MethodTypeAccuracyLatency
EKF-VIOFilterGoodVery low
MSCKFFilterBetterLow
Sliding Window BAOptimizationBestMedium
Full BAOptimizationOptimalHigh
Check: Why does sliding window optimization outperform filtering?

Chapter 9: VIO Systems

Several landmark VIO systems have defined the field. Each makes different design choices — filter vs optimization, feature handling, state representation — but they all fuse visual and inertial data in a tightly-coupled manner.

SystemYearApproachKey Innovation
MSCKF2007EKF, no features in stateEfficient multi-state constraints
OKVIS2015Keyframe-based optimizationSliding window with marginalization
ROVIO2015EKF with direct photometricNo feature extraction needed
VINS-Mono2018Optimization + loop closureComplete system with relocalization
Basalt2019Optimization, visual-inertialNon-linear factor recovery for marginalization
VINS-Mono is arguably the most influential open-source VIO system. It includes initialization (recovering scale, gravity, biases), tightly-coupled sliding window optimization, loop closure with DBoW2, and pose graph optimization. It's the complete package.
System Comparison Radar

Compare VIO systems across five axes: accuracy, speed, robustness, features, and ease of use.

Where from here? Classical VIO is mature and reliable, but it struggles with textureless surfaces, dramatic lighting changes, and fast motion. Modern VIO replaces handcrafted features with learned ones, uses neural IMU models, and pushes toward end-to-end systems.
Check: What distinguishes VINS-Mono from earlier VIO systems?