The system that fuses camera and IMU to track your position in 3D — powering AR headsets, drones, and autonomous robots even when GPS fails.
A camera can see the world in rich detail, tracking features frame-to-frame. But it struggles in the dark, with motion blur, or when there's nothing interesting to look at. An IMU (Inertial Measurement Unit) measures accelerations and angular velocities directly — it never loses tracking, even in complete darkness. But it drifts fast: double-integrating noisy accelerations yields meters of error within seconds.
Each sensor fails where the other succeeds. The camera drifts slowly but can lose tracking entirely. The IMU drifts fast but never loses tracking. Together, they form a complementary pair: the camera corrects the IMU's drift, and the IMU bridges the camera's gaps.
The teal line is the true path. Red is IMU-only (fast drift). Blue is camera-only (sudden loss). Green is VIO fusion.
An IMU contains two sensors: an accelerometer that measures linear acceleration (including gravity!) and a gyroscope that measures angular velocity. Together they provide 6-axis measurements at high rates (100–1000 Hz).
The catch: both sensors are corrupted by bias (a slowly drifting offset) and noise (random jitter). To get velocity from acceleration, you integrate once. To get position, you integrate twice. Each integration amplifies errors — this is why IMU-only position drifts so fast.
Watch how double-integrating noisy acceleration leads to quadratic position drift. Adjust bias and noise levels.
| Symbol | Meaning | Typical Value |
|---|---|---|
| ameas | Measured acceleration (what the sensor reports) | — |
| ba | Accelerometer bias (slowly drifting offset) | ~0.01–0.1 m/s² |
| na | Accelerometer noise density | ~0.1 m/s²/√Hz |
| ωmeas | Measured angular velocity | — |
| bg | Gyroscope bias | ~0.001–0.01 rad/s |
| ng | Gyroscope noise density | ~0.01 rad/s/√Hz |
The IMU integration between camera frames IS the Kalman prediction step. The control input u is the IMU measurement. The camera observation IS the Kalman update step. VIO is literally a Kalman filter where the prediction uses inertial data and the correction uses visual data.
In the Kalman filter, what determines how much you trust the prediction vs the measurement? How does the same concept appear in VIO as "tight coupling"?
To use a camera for geometry, we need a mathematical model of how 3D points project onto the 2D image. The pinhole model is the simplest: light from a 3D point passes through a tiny hole and hits the image plane. The intrinsic parameters (fx, fy, cx, cy) describe the camera's internal geometry.
Real lenses also introduce distortion — straight lines in the world become curved in the image. We model this with radial and tangential distortion coefficients, then undistort images before processing.
Move the 3D point by adjusting X, Y, Z. Watch where it projects on the image plane. The projection is simply (fX/Z, fY/Z).
To estimate camera motion, we track features across consecutive frames. A feature is a distinctive patch — typically a corner — that's easy to find again. The KLT (Kanade-Lucas-Tomasi) tracker finds these corners and follows them using optical flow: it assumes the brightness of a pixel patch doesn't change between frames.
Corner detection (e.g., Shi-Tomasi, FAST) identifies pixels with strong gradients in two directions. Optical flow then estimates how each corner moved. Losing features is inevitable — occlusion, leaving the frame, deformation — so new features are regularly detected.
KLT tracking rests on the brightness constancy assumption: a pixel's intensity doesn't change between frames. Mathematically:
Taylor-expanding and dividing by δt gives the optical flow equation: Ix u + Iy v + It = 0, where u, v are the pixel velocities. This is one equation in two unknowns — underdetermined. KLT resolves this by assuming the flow is constant over a small window (typically 21×21 pixels) and solving the resulting overdetermined system via least squares.
In practice, a VIO system tracks 100–200 features simultaneously. Each frame, features are tracked with KLT, new corners are detected to replace lost ones (via Shi-Tomasi or FAST), and the surviving tracks provide the visual constraints for the estimator. A healthy track length is 10–30 frames before a feature is lost or leaves the field of view.
Features (dots) are tracked across frames. Green = successfully tracked. Red = lost. Watch features enter and leave the field of view.
A 3D point P is observed from two camera positions. In frame 1, it projects to pixel p1. In frame 2 (after rotation R and translation t), it projects to p2. The normalized image coordinates are x1 = K-1p1 and x2 = K-1p2.
Your task: Derive the epipolar constraint x2T E x1 = 0, where E = [t]× R is the essential matrix. Explain geometrically why this constraint holds.
Full derivation:
1. Point P in frame 1: X1 = λ1 x1 (ray from camera 1 through the pixel).
2. Same point in frame 2: X2 = R X1 + t = λ1 R x1 + t.
3. Also X2 = λ2 x2. So: λ2 x2 = λ1 R x1 + t.
4. Take cross product with t on both sides: λ2 (t × x2) = λ1 (t × R x1) + (t × t). The last term is zero.
5. Dot both sides with x2: λ2 x2 · (t × x2) = λ1 x2 · (t × R x1). The left side is zero (scalar triple product with repeated vector). So: x2T [t]× R x1 = 0.
6. Define E = [t]× R. We get: x2T E x1 = 0.
The key insight: The epipolar constraint eliminates depth entirely! Even though we don't know how far away P is, the constraint between x1 and x2 must hold for ANY valid correspondence. This is why we can recover camera motion (E) from feature matches alone, without knowing 3D structure. Each match gives one equation; 5 matches suffice to solve for E (the 5-point algorithm).
Pure rotation produces no baseline between views. Without translation, all epipolar lines collapse to a single point (the epipole is undefined), and the essential matrix becomes degenerate (rank 0 instead of rank 2). You cannot triangulate depth from features because there's no parallax — a point 1m away and 100m away both move identically under rotation.
For VIO, this means the initialization phase REQUIRES translational motion. This is why AR systems tell you to "move your phone" at startup — pure rotation gives orientation but no scale or depth. The IMU can detect translation (via acceleration), confirming when sufficient parallax exists for visual initialization.
The simplest approach: run visual odometry (VO) and IMU integration separately, then fuse their pose estimates. Each system outputs a pose — position and orientation — and the fusion layer (often a Kalman filter or complementary filter) combines them.
This is easy to implement: each subsystem is a black box. But it's suboptimal — information is lost when each system compresses its raw data into a single pose estimate before fusion. Correlations between visual and inertial measurements are ignored.
The teal line is the true path. Orange is loosely-coupled. Green is tightly-coupled. Notice the loose estimate is noisier.
In loose coupling, each subsystem reports a 6-DOF pose and its covariance. But it discards everything else — which features were tracked, how confident each feature match was, which direction the uncertainty is largest. The fusion layer sees two Gaussian blobs and averages them. It can't know that the VO system was tracking features on a distant building (high depth uncertainty) while the nearby ground had no texture (no features at all).
Tight coupling preserves all this structure. The raw 2D pixel measurements enter the optimizer directly, along with their individual uncertainties. An uncertain feature at 50 meters contributes less than a confident feature at 2 meters. This per-measurement weighting is impossible with loose coupling.
In tightly-coupled VIO, we jointly optimize over raw measurements from both sensors in a single estimator. Instead of combining two pose estimates, we combine individual pixel observations and IMU readings directly. This preserves correlations and produces significantly better results.
The state vector now includes the camera pose, velocity, IMU biases, and possibly feature positions. All measurements — pixel coordinates from the camera and accelerations/angular velocities from the IMU — constrain this joint state simultaneously.
The full state per keyframe is 16 parameters: position p (3), velocity v (3), orientation quaternion q (4), accelerometer bias ba (3), and gyroscope bias bg (3). With N keyframes in the sliding window plus M 3D landmarks, the optimizer solves over 16N + 3M parameters simultaneously.
For a typical window of 10 keyframes with 50 landmarks: 16 × 10 + 3 × 50 = 310 parameters. The optimizer must find values for all 310 simultaneously. This is why efficient solvers (Gauss-Newton, Levenberg-Marquardt) and sparse matrix structure matter so much.
| Approach | Inputs | State | Accuracy |
|---|---|---|---|
| Loosely-coupled | VO pose + IMU pose | Pose only | Good |
| Tightly-coupled (filter) | Raw pixels + raw IMU | Pose + biases + (features) | Better |
| Tightly-coupled (optim.) | Raw pixels + raw IMU | Pose + biases + features | Best |
Watch how the tightly-coupled estimator simultaneously adjusts pose, velocity, and IMU bias. Drag the bias slider to simulate drift.
Between two camera keyframes (say, 100 ms apart), the IMU produces dozens of measurements. Naively, if we change our linearization point for the pose, we'd have to re-integrate all those IMU samples. This is expensive. Preintegration is the clever trick that avoids it.
The idea: integrate IMU measurements in a local frame relative to the starting pose. The resulting "preintegrated measurement" summarizes the relative motion (Δp, Δv, Δq) between keyframes. When the linearization point changes, we only need a cheap first-order correction — no re-integration.
The bar chart shows computational cost. Red = naive re-integration (grows with IMU rate). Green = preintegration (constant cost after initial computation).
Between two keyframes at 15–30 Hz, the IMU fires at 200–1000 Hz. That means 7–67 IMU samples per keyframe interval. The preintegrated measurement is a compact tuple: ΔR (relative rotation), Δv (relative velocity), and Δp (relative position), all expressed in the local frame of the interval's starting pose.
When the optimizer updates the starting pose, you do not re-integrate those 7–67 IMU samples. Instead, you apply a first-order correction using the precomputed Jacobians with respect to bias:
Typical IMU noise densities: gyroscope ~0.01 rad/s/√Hz, accelerometer ~0.1 m/s²/√Hz. After 1 second of dead-reckoning without any visual correction, you accumulate roughly 0.6° rotation error and ~5 cm position error. This is why keyframes must arrive frequently — preintegration bridges the gap cheaply.
Between keyframes i and j, you preintegrate N IMU samples to get Δvij (relative velocity change). This preintegrated measurement depends on the bias ba used during integration. When the optimizer updates ba to ba + δba, you need to correct Δv without re-integrating.
Your task: Show that Δvcorrected ≈ Δv + Jbav · δba, and derive the form of the Jacobian Jbav = ∂Δv / ∂ba.
Full derivation:
1. Discrete preintegration formula: Δvij = Σk=ij-1 ΔRik (ak − ba) δt
2. Perturb bias: ba → ba + δba
3. New velocity: Δv' = Σk ΔRik (ak − ba − δba) δt = Δvij − [Σk ΔRik δt] δba
4. Therefore: Jbav = ∂Δv/∂ba = −Σk=ij-1 ΔRik · δt
5. This Jacobian is computed during preintegration (accumulated alongside Δv) at zero extra cost. When the optimizer changes ba by δba, the correction is a single matrix-vector multiply: Jbav · δba.
The key insight: Bias enters the integration LINEARLY (it's subtracted from each acceleration measurement). This linearity is what makes first-order correction exact — there's no second-order term to worry about. For the rotation preintegration, bias correction is more complex because rotations are nonlinear (Lie group), requiring the exponential map: ΔRcorrected ≈ ΔR · Exp(JbgR δbg).
The Multi-State Constraint Kalman Filter (MSCKF) is one of the most successful VIO algorithms. Its key insight: don't keep features in the state vector. Instead, keep a sliding window of camera poses and use features only to create constraints between those poses before marginalizing them out.
When a feature is tracked across N frames, it creates 2N measurement equations (x and y pixel coordinates). Once the feature leaves the view, MSCKF uses these equations to constrain the N poses, then discards the feature. This keeps the state vector small — only O(window size) instead of O(number of features).
Compare state size: red = keeping all features in state (grows unboundedly). green = MSCKF with pose-only state (bounded).
MSCKF keeps the IMU state (16 parameters) plus a sliding window of N camera poses (7 parameters each: position + quaternion). Total state size: 16 + 7N. With N=20 poses: 156 parameters. The EKF covariance matrix is 156×156 — easily manageable.
By contrast, EKF-SLAM keeps features in the state. With 200 landmarks at 3 parameters each: 16 + 600 = 616 parameters. The covariance is 616×616. Updating it costs O(616²) per measurement. And the state keeps growing. MSCKF's state is bounded — no matter how many features pass through, the state never exceeds 16 + 7N.
Instead of a Kalman filter, many modern VIO systems use nonlinear optimization over a sliding window of recent states. The idea: collect all visual and inertial measurements in a window, build a factor graph, and solve for the poses that best explain all the data.
Each factor (constraint) in the graph connects variables: IMU preintegration factors connect consecutive poses, visual factors connect a 3D landmark to camera poses that observed it. Marginalization removes old states while preserving their information as a prior.
When a keyframe leaves the sliding window, we can't just discard it — it carried information about the remaining states. Marginalization via the Schur complement converts the old frame's constraints into a dense prior on the states that remain. Concretely, if the full information matrix is partitioned into "keep" and "drop" blocks:
This prior becomes a new factor in the graph. The cost of one marginalization step is O(N³) in window size N — not O(K³) in total keyframes K seen so far. With a window of 10–20 keyframes, this is fast. Without it (full bundle adjustment over all keyframes), the system would grind to a halt after a few seconds of operation.
Circles are pose variables. Squares are factors (constraints). Orange = IMU factors. Teal = visual factors. Purple = marginalization prior.
| Method | Type | Accuracy | Latency |
|---|---|---|---|
| EKF-VIO | Filter | Good | Very low |
| MSCKF | Filter | Better | Low |
| Sliding Window BA | Optimization | Best | Medium |
| Full BA | Optimization | Optimal | High |
Each factor encodes a measurement and its uncertainty (covariance). The optimizer minimizes the sum of squared Mahalanobis distances across all factors:
| Factor Type | Connects | Residual |
|---|---|---|
| IMU preintegration | Posei ↔ Posei+1 | Δp, Δv, Δq vs predicted from poses |
| Visual reprojection | Posej ↔ Landmarkk | Measured pixel − projected pixel (2D error) |
| Marginalization prior | First pose in window | Dense prior from Schur complement |
| Bias random walk | Posei ↔ Posei+1 | Constrains bias change between keyframes |
The solver (typically Gauss-Newton or Levenberg-Marquardt) linearizes all residuals, forms the normal equations H · δχ = −b, and solves for the state update. The Hessian H is sparse because each factor only connects a few variables — this sparsity is what makes sliding window optimization tractable.
In sliding window BA, the Hessian H has a specific sparsity pattern. Poses connect to each other (via IMU factors) and to landmarks (via visual factors), but landmarks don't connect to each other. The Hessian partitions into pose block Hpp, landmark block Hll, and cross-block Hpl.
Your task: Show that eliminating the landmark variables from the normal equations Hδx = −b produces the reduced system Hpp* δxp = −bp* where Hpp* = Hpp − Hpl Hll−1 Hlp. Why is Hll−1 cheap?
Full derivation:
1. From equation 2: δxl = −Hll−1(bl + Hlpδxp)
2. Substitute into equation 1: Hppδxp + Hpl[−Hll−1(bl + Hlpδxp)] = −bp
3. Rearrange: (Hpp − HplHll−1Hlp)δxp = −(bp − HplHll−1bl)
4. Define: Hpp* = Hpp − HplHll−1Hlp (the Schur complement)
5. The reduced system is much smaller: if we have 10 poses (160 dims) and 50 landmarks (150 dims), the original system is 310×310. The reduced system is only 160×160. We solve for poses first, then back-substitute to get landmarks.
The key insight: Hll is block-diagonal because landmarks are conditionally independent given the poses. No two landmarks ever share a factor. This means Hll−1 costs O(M) not O(M³). The entire Schur complement costs O(N²M) where N is pose dimensions and M is landmark count — the standard trick that makes visual SLAM computationally tractable.
Several landmark VIO systems have defined the field. Each makes different design choices — filter vs optimization, feature handling, state representation — but they all fuse visual and inertial data in a tightly-coupled manner.
| System | Year | Approach | Key Innovation |
|---|---|---|---|
| MSCKF | 2007 | EKF, no features in state | Efficient multi-state constraints |
| OKVIS | 2015 | Keyframe-based optimization | Sliding window with marginalization |
| ROVIO | 2015 | EKF with direct photometric | No feature extraction needed |
| VINS-Mono | 2018 | Optimization + loop closure | Complete system with relocalization |
| Basalt | 2019 | Optimization, visual-inertial | Non-linear factor recovery for marginalization |
A monocular VIO system faces a chicken-and-egg problem at startup: you need depth to triangulate features, but you need features to estimate depth. The initialization procedure (typically 1–2 seconds) does this:
This is why VIO systems need the user to move during startup — pure rotation provides no translation for scale estimation. ARKit's "move your phone around" instruction at launch is literally the VIO initialization phase.
Compare VIO systems across five axes: accuracy, speed, robustness, features, and ease of use.
Real-world solution (Apple ARKit / Meta Aria):
1. Dual-rate architecture: IMU propagation runs at 800-1000 Hz on a dedicated core with <1ms latency. This is the "prediction" that drives the display. Visual corrections run asynchronously at 30 Hz on the other core. The display always gets the IMU-propagated pose; visual corrections arrive "late" and retroactively fix drift. This decouples display latency from visual processing time.
2. MSCKF for the critical path, not optimization. The EKF update is a single matrix multiply (~0.5ms for 30 features). The optimization runs in a background thread for relocalization and map maintenance, but never blocks the display loop.
3. Graceful degradation: Under thermal throttle: reduce feature count (100 → 30), increase keyframe spacing (30Hz → 15Hz), skip outlier rejection (trust RANSAC less). The IMU-only propagation continues at full rate regardless.
4. IMU-only bridging: During visual blackout, trust IMU propagation. The key: preintegrated covariance tells you EXACTLY how much drift to expect. After 10 seconds of IMU-only: ~50cm position uncertainty. When vision recovers, the large innovation is expected (high covariance) and the system absorbs the correction smoothly. The alternative (reset) would cause visible "jumps" in the AR content.
python import numpy as np from scipy.spatial.transform import Rotation def exp_so3(omega): """Exponential map: rotation vector -> rotation matrix.""" angle = np.linalg.norm(omega) if angle < 1e-10: return np.eye(3) + skew(omega) axis = omega / angle K = skew(axis) return np.eye(3) + np.sin(angle)*K + (1-np.cos(angle))*(K @ K) def skew(v): return np.array([[0, -v[2], v[1]], [v[2], 0, -v[0]], [-v[1], v[0], 0]]) def preintegrate(imu_samples, bias_a, bias_g, dt): delta_R = np.eye(3) delta_v = np.zeros(3) delta_p = np.zeros(3) J_ba_v = np.zeros((3, 3)) J_ba_p = np.zeros((3, 3)) for accel, gyro in imu_samples: # Subtract biases a_corr = np.array(accel) - bias_a w_corr = np.array(gyro) - bias_g # Rotate acceleration into local frame a_local = delta_R @ a_corr # Accumulate Jacobians (BEFORE updating delta_v/delta_p) J_ba_v += -delta_R * dt J_ba_p += J_ba_v * dt - 0.5 * delta_R * (dt**2) # Integrate position (uses current velocity) delta_p += delta_v * dt + 0.5 * a_local * (dt**2) # Integrate velocity delta_v += a_local * dt # Integrate rotation delta_R = delta_R @ exp_so3(w_corr * dt) return delta_p, delta_v, delta_R, J_ba_v, J_ba_p
VIO trades map persistence for bounded computation. SLAM trades computation for the ability to recognize previously visited places. VINS-Mono bridges both: it runs VIO for real-time tracking and maintains a separate pose graph for loop closure. When the robot returns to a known place, the pose graph corrects accumulated drift — something pure VIO can never do.
If you added loop closure to a VIO system, what additional components would you need? (Hint: place recognition, pose graph, and what happens to the marginalization prior when a loop is detected?)