The system that fuses camera and IMU to track your position in 3D — powering AR headsets, drones, and autonomous robots even when GPS fails.
A camera can see the world in rich detail, tracking features frame-to-frame. But it struggles in the dark, with motion blur, or when there's nothing interesting to look at. An IMU (Inertial Measurement Unit) measures accelerations and angular velocities directly — it never loses tracking, even in complete darkness. But it drifts fast: double-integrating noisy accelerations yields meters of error within seconds.
Each sensor fails where the other succeeds. The camera drifts slowly but can lose tracking entirely. The IMU drifts fast but never loses tracking. Together, they form a complementary pair: the camera corrects the IMU's drift, and the IMU bridges the camera's gaps.
The teal line is the true path. Red is IMU-only (fast drift). Blue is camera-only (sudden loss). Green is VIO fusion.
An IMU contains two sensors: an accelerometer that measures linear acceleration (including gravity!) and a gyroscope that measures angular velocity. Together they provide 6-axis measurements at high rates (100–1000 Hz).
The catch: both sensors are corrupted by bias (a slowly drifting offset) and noise (random jitter). To get velocity from acceleration, you integrate once. To get position, you integrate twice. Each integration amplifies errors — this is why IMU-only position drifts so fast.
Watch how double-integrating noisy acceleration leads to quadratic position drift. Adjust bias and noise levels.
| Symbol | Meaning |
|---|---|
| ameas | Measured acceleration (what the sensor reports) |
| ba | Accelerometer bias (slowly drifting offset) |
| na | Accelerometer noise (random jitter) |
| ωmeas | Measured angular velocity |
| bg | Gyroscope bias |
To use a camera for geometry, we need a mathematical model of how 3D points project onto the 2D image. The pinhole model is the simplest: light from a 3D point passes through a tiny hole and hits the image plane. The intrinsic parameters (fx, fy, cx, cy) describe the camera's internal geometry.
Real lenses also introduce distortion — straight lines in the world become curved in the image. We model this with radial and tangential distortion coefficients, then undistort images before processing.
Move the 3D point by adjusting X, Y, Z. Watch where it projects on the image plane. The projection is simply (fX/Z, fY/Z).
To estimate camera motion, we track features across consecutive frames. A feature is a distinctive patch — typically a corner — that's easy to find again. The KLT (Kanade-Lucas-Tomasi) tracker finds these corners and follows them using optical flow: it assumes the brightness of a pixel patch doesn't change between frames.
Corner detection (e.g., Shi-Tomasi, FAST) identifies pixels with strong gradients in two directions. Optical flow then estimates how each corner moved. Losing features is inevitable — occlusion, leaving the frame, deformation — so new features are regularly detected.
Features (dots) are tracked across frames. Green = successfully tracked. Red = lost. Watch features enter and leave the field of view.
The simplest approach: run visual odometry (VO) and IMU integration separately, then fuse their pose estimates. Each system outputs a pose — position and orientation — and the fusion layer (often a Kalman filter or complementary filter) combines them.
This is easy to implement: each subsystem is a black box. But it's suboptimal — information is lost when each system compresses its raw data into a single pose estimate before fusion. Correlations between visual and inertial measurements are ignored.
The teal line is the true path. Orange is loosely-coupled. Green is tightly-coupled. Notice the loose estimate is noisier.
In tightly-coupled VIO, we jointly optimize over raw measurements from both sensors in a single estimator. Instead of combining two pose estimates, we combine individual pixel observations and IMU readings directly. This preserves correlations and produces significantly better results.
The state vector now includes the camera pose, velocity, IMU biases, and possibly feature positions. All measurements — pixel coordinates from the camera and accelerations/angular velocities from the IMU — constrain this joint state simultaneously.
| Approach | Inputs | State | Accuracy |
|---|---|---|---|
| Loosely-coupled | VO pose + IMU pose | Pose only | Good |
| Tightly-coupled (filter) | Raw pixels + raw IMU | Pose + biases + (features) | Better |
| Tightly-coupled (optim.) | Raw pixels + raw IMU | Pose + biases + features | Best |
Watch how the tightly-coupled estimator simultaneously adjusts pose, velocity, and IMU bias. Drag the bias slider to simulate drift.
Between two camera keyframes (say, 100 ms apart), the IMU produces dozens of measurements. Naively, if we change our linearization point for the pose, we'd have to re-integrate all those IMU samples. This is expensive. Preintegration is the clever trick that avoids it.
The idea: integrate IMU measurements in a local frame relative to the starting pose. The resulting "preintegrated measurement" summarizes the relative motion (Δp, Δv, Δq) between keyframes. When the linearization point changes, we only need a cheap first-order correction — no re-integration.
The bar chart shows computational cost. Red = naive re-integration (grows with IMU rate). Green = preintegration (constant cost after initial computation).
The Multi-State Constraint Kalman Filter (MSCKF) is one of the most successful VIO algorithms. Its key insight: don't keep features in the state vector. Instead, keep a sliding window of camera poses and use features only to create constraints between those poses before marginalizing them out.
When a feature is tracked across N frames, it creates 2N measurement equations (x and y pixel coordinates). Once the feature leaves the view, MSCKF uses these equations to constrain the N poses, then discards the feature. This keeps the state vector small — only O(window size) instead of O(number of features).
Compare state size: red = keeping all features in state (grows unboundedly). green = MSCKF with pose-only state (bounded).
Instead of a Kalman filter, many modern VIO systems use nonlinear optimization over a sliding window of recent states. The idea: collect all visual and inertial measurements in a window, build a factor graph, and solve for the poses that best explain all the data.
Each factor (constraint) in the graph connects variables: IMU preintegration factors connect consecutive poses, visual factors connect a 3D landmark to camera poses that observed it. Marginalization removes old states while preserving their information as a prior.
Circles are pose variables. Squares are factors (constraints). Orange = IMU factors. Teal = visual factors. Purple = marginalization prior.
| Method | Type | Accuracy | Latency |
|---|---|---|---|
| EKF-VIO | Filter | Good | Very low |
| MSCKF | Filter | Better | Low |
| Sliding Window BA | Optimization | Best | Medium |
| Full BA | Optimization | Optimal | High |
Several landmark VIO systems have defined the field. Each makes different design choices — filter vs optimization, feature handling, state representation — but they all fuse visual and inertial data in a tightly-coupled manner.
| System | Year | Approach | Key Innovation |
|---|---|---|---|
| MSCKF | 2007 | EKF, no features in state | Efficient multi-state constraints |
| OKVIS | 2015 | Keyframe-based optimization | Sliding window with marginalization |
| ROVIO | 2015 | EKF with direct photometric | No feature extraction needed |
| VINS-Mono | 2018 | Optimization + loop closure | Complete system with relocalization |
| Basalt | 2019 | Optimization, visual-inertial | Non-linear factor recovery for marginalization |
Compare VIO systems across five axes: accuracy, speed, robustness, features, and ease of use.