microModernVIO — Deep Learning Meets Visual-Inertial Odometry

Chapter 0: Learning Meets Inertial

Classical VIO systems like VINS-Mono and MSCKF work remarkably well in "nice" conditions — good lighting, textured surfaces, slow motion. But in the real world, conditions are harsh: think helmet-mounted cameras on firefighters, drones in rain, or AR glasses transitioning from indoor to outdoor. Classical pipelines break down because their hand-crafted components can't generalize.

Deep learning offers a compelling answer: learn from data what the classical pipeline hard-codes. Learn which pixels are good features. Learn how to model IMU noise. Learn the entire odometry pipeline end-to-end. The question isn't whether to use learning, but where in the pipeline.

Before we dive in, let's ground ourselves in the numbers. A typical VIO setup: camera at 30 Hz, IMU at 200 Hz. Between two camera frames (~33ms apart), the IMU delivers ~7 measurements. Each IMU measurement is 6 numbers: accelerometer [a_x, a_y, a_z] in m/s² and gyroscope [ω_x, ω_y, ω_z] in rad/s. Preintegration is the trick that makes VIO efficient: accumulate those 7 IMU samples into a single rotation/velocity/position delta, without needing to know the absolute pose. When the visual system later updates the pose, the preintegrated measurement is "applied" instantly — no need to re-integrate all 7 samples. This saves enormous computation in the optimization loop.

The spectrum: At one extreme, replace a single component (e.g., feature detection). At the other extreme, replace the entire VIO pipeline with a single neural network. Most successful modern systems sit somewhere in the middle — learning the hard parts, keeping the geometry.

The VIO state vector: To ground the discussion, here is what a VIO system actually estimates at each timestep: [position(3), velocity(3), rotation(4, quaternion), accel_bias(3), gyro_bias(3)] = 16 dimensions. The biases are slowly drifting offsets in the IMU readings that must be estimated alongside pose. Without bias estimation, errors accumulate at ~1 m/s² — the trajectory diverges within seconds.

The Learning Spectrum

Drag the slider to see which components are classical (blue) vs learned (green) at different points on the spectrum.

Learning level0

Check: Why are classical VIO systems being augmented with deep learning?

Deep learning is always faster than classical methods Hand-crafted components fail to generalize to harsh real-world conditions Deep learning doesn't require calibration

Chapter 1: Deep Visual Features for VIO

The first and most natural place to inject learning: replace FAST/ORB feature detection with a learned detector like SuperPoint. In a VIO context, this means the visual front-end (feature detection, description, and matching) uses a neural network while the back-end (optimization, IMU integration) remains classical.

This is a drop-in replacement that's easy to integrate. Systems like SuperPoint + SuperGlue + VINS-Mono back-end get the best of both worlds: learned robustness in the visual front-end and proven geometric rigor in the optimization.

What does "drop-in" actually mean in practice? Classical VINS-Mono's front-end outputs: tracked feature positions [N, 2] across frames. The back-end only cares about these 2D tracks — it doesn't know or care how they were produced. So you swap FAST+KLT tracking for SuperPoint detection + SuperGlue matching, output the same [N, 2] tracks, and the back-end works unchanged. The only engineering consideration: SuperPoint+SuperGlue needs a GPU (~10ms), while FAST+KLT runs on CPU (~3ms). If your platform has a GPU, this is free performance; if not, look at LightGlue (2023) which is 3x faster than SuperGlue with comparable accuracy.

Classical Front-End

FAST corners → ORB descriptors → brute-force matching

↓ replace with

Learned Front-End

SuperPoint keypoints → 256-dim descriptors → SuperGlue matching

↓ keep

Classical Back-End

IMU preintegration → sliding window optimization → pose output

Feature Quality Under Degradation

Track count over time. Red = classical features (crash when count drops to zero). Green = learned features (maintain tracks through difficulties).

Key insight: You don't need to rewrite the whole VIO system. Swapping in SuperPoint/SuperGlue as the visual front-end can improve robustness dramatically with minimal architectural changes. This is the lowest-risk way to modernize a classical VIO pipeline.

Data flow (hybrid VIO): Each visual frame at 30 Hz: image [H, W, 3] → SuperPoint → keypoints [N, 2] + descriptors [N, 256]. Between frames: ~7 IMU samples at 200 Hz, each [accel(3), gyro(3)] → preintegrated into ΔR, Δv, Δp. Both feed into a sliding-window optimizer (typically 10–15 frames). The optimizer jointly minimizes: (1) visual reprojection error from keypoint matches, and (2) IMU preintegration residuals. State per frame: [p(3), v(3), q(4), b_a(3), b_g(3)] = 16 dims. Total optimization variables for a 10-frame window: ~160 state dims + map points.

The phrase "visual reprojection error" is the heart of every VIO back-end — classical or learned. Take a known 3D landmark, run it through the camera, and you get a predicted pixel. Compare that to where the feature was actually observed. The gap is the reprojection residual, and the optimizer's whole job is to wiggle the poses and landmarks until those gaps shrink. Let's build it.

π(P_c) = [ f_x·X/Z + c_x , f_y·Y/Z + c_y ] ··· r_i = u_obs,i − π(P_c,i)

Check: What is the simplest way to add learning to a classical VIO system?

Replace the visual feature front-end with a learned detector and matcher Retrain the entire system from scratch Remove the IMU entirely

Chapter 2: Learned Inertial Models

Classical VIO models IMU errors with simple parametric models: constant bias + white noise. In reality, IMU errors are complex — temperature-dependent, orientation-dependent, with correlated noise. A neural network can learn these complex error patterns from data.

IMU data flow, concretely: An IMU produces two streams: accelerometer [3] (m/s², measuring linear acceleration + gravity) and gyroscope [3] (rad/s, measuring angular velocity). Typical rate: 200 Hz. Visual frames arrive at ~30 Hz. That means ~7 IMU measurements between each pair of visual frames. Preintegration accumulates these 7 measurements into a single rotation/velocity/position delta — without needing to know the absolute pose yet. When the pose is later updated by the visual system, preintegrated measurements are "applied" directly. No re-integration needed.

Neural inertial models take raw IMU data and output corrected measurements, or directly predict the noise parameters. Some approaches train a network to predict the residual between the simple model and reality, allowing the classical pipeline to benefit from learned corrections without replacing the physics.

The residual approach is elegant and safe. The classical model predicts: a_corrected = a_raw − b (constant bias). The neural model predicts: a_corrected = a_raw − b − f_θ(...). The network f_θ only needs to learn the difference between the simple model and reality. If the network fails or outputs garbage, the system falls back to classical bias-only correction. This graceful degradation is critical for safety-critical applications.

a_corrected = a_raw − f_θ(a_raw, ω_raw, T, t) where f_θ is a learned correction

Why learn IMU models? Consumer-grade IMUs (in phones, drones) have complex, non-stationary error characteristics. A neural network trained on data from that specific hardware can model errors that no simple bias+noise model captures. The bias drifts slowly over minutes — a constant-bias model sees this as signal, a learned model recognizes the drift pattern.

To see why this matters, consider the numbers: a typical MEMS accelerometer has bias instability of ~0.04 mg and random walk of ~0.08 m/s/√hr. But the actual errors include vibration coupling, cross-axis sensitivity, and temperature coefficients of ~0.3 mg/°C. When your phone warms up by 10°C during use, that's 3 mg of unmodeled bias — which integrates to ~15 cm of position error over 10 seconds. A neural correction network, trained on IMU data paired with ground-truth poses, learns to subtract these complex, environment-dependent errors before they compound.

Why does an uncorrected bias hurt so much? Because the IMU is integrated twice: acceleration → velocity → position. A constant offset in acceleration becomes a linearly-growing velocity error, which becomes a quadratically-growing position error. The only way to feel this is to integrate it yourself — that is exactly what preintegration does for every IMU sample between two camera frames.

Classical vs Learned IMU Model

The teal line is the true acceleration. Red = classical correction (constant bias only). Green = neural correction (adapts to complex patterns).

Error complexity0.30

Temperature drift0.20

Check: What limitation of classical IMU models do neural networks address?

Classical models run too slowly Classical models only work with one type of IMU Classical constant bias + noise models can't capture complex, non-stationary errors

Chapter 3: Deep VIO

Going further: replace the entire VIO pipeline with a neural network that takes raw images and IMU data and outputs poses. DeepVIO and similar systems use CNN encoders for images, LSTM/GRU networks for IMU sequences, and learn to fuse them end-to-end.

The appeal is simplicity: no feature extraction, no descriptor matching, no optimization. Just images + IMU in, poses out. But the challenge is generalization — these systems tend to overfit to their training environment and struggle in new settings.

Why does overfitting happen here specifically? A pure end-to-end network must learn both visual representation and geometric reasoning from the same training data. The geometric reasoning (epipolar geometry, perspective projection) is universal — it holds in all environments. But the network has no way to separate this from environment-specific visual patterns. It memorizes "in this office, this visual pattern corresponds to a 10cm forward motion" rather than learning the general principle of triangulation. The best current approach: end-to-end learning with geometric priors baked in (e.g., differentiable BA as in DROID-SLAM extended with IMU). The geometric priors force the network to learn representations that are useful for geometry, not just for memorizing the training set.

Tightly-coupled vs loosely-coupled: Loosely-coupled systems run visual odometry and IMU integration separately, then fuse the two pose estimates (e.g., with an EKF). Simple but suboptimal — the visual system doesn't know about IMU constraints and vice versa. Tightly-coupled systems (OKVIS, VINS-Mono) put visual and IMU measurements into the same optimization. IMU preintegration factors and visual reprojection factors share the same graph, jointly constraining all state variables. Tightly-coupled is harder to implement but consistently more accurate, especially under fast motion where IMU measurements prevent visual tracking loss.

Image Encoder

CNN/ViT extracts visual features from each frame

↓

IMU Encoder

LSTM/GRU processes ~7 accel [3] + gyro [3] samples between frames at 200 Hz

↓

Fusion + Pose Regression

Cross-attention or concatenation → MLP → relative pose [Δp, Δq]

Learned VIO (DPVO/DROID with IMU): The frontier is fusing DROID-SLAM's dense optical flow with inertial measurements. DPVO (2023) learns visual patch tracking with iterative updates (similar to DROID but sparser). Adding IMU: the preintegrated IMU factor constrains the relative pose between frames, while the dense flow constrains depth and fine pose. The IMU provides absolute scale (monocular vision alone cannot) and prevents drift during visual tracking failures. This is tightly-coupled and learned — the best of all worlds, though at 4–8 GB GPU memory cost.

How does IMU fusion work concretely in DPVO? The visual system predicts per-patch flow and depth updates. Between frames, the IMU preintegration provides a strong prior on the relative rotation and a weaker prior on translation (because acceleration is double-integrated, making it noisy). In the joint optimization:

Visual Cost

Reprojection error from dense flow: "predicted 2D position of 3D point" vs "observed 2D position." Constrains depth + pose.

IMU Cost

Preintegration residual: "IMU-predicted relative motion" vs "optimized relative motion." Constrains rotation (strongly) + velocity + position.

↓

Joint Solution

Gauss-Newton or LM solver minimizes both costs simultaneously. IMU terms regularize during visual degeneracies.

Let's open up that "Gauss-Newton solver" box and see the one step that actually corrects the pose. The setup is the bundle-adjustment core: we have a camera at an unknown translation, the same 5 landmarks and pixels you reprojected in Chapter 1, and a wrong initial guess for where the camera is. Gauss-Newton turns the reprojection residuals into a pose correction.

H δ = J^Tr, with H = J^TJ ··· t ← t + δ

Here r stacks all 10 reprojection residuals (2 per feature), J is how each pixel moves when the camera translates, H = J^TJ is the Gauss-Newton approximation to the Hessian, and δ is the pose update that best cancels the residual. One solve, one step.

End-to-End vs Hybrid Architecture

Compare trajectory accuracy: Red = pure end-to-end (overfits). Orange = classical. Green = hybrid (learned features + geometric optimization).

The generalization gap: End-to-end systems shine in their training domain but degrade in new environments. Hybrid systems that preserve geometric structure (epipolar constraints, BA) generalize much better. This is the central tension of deep VIO.

Check: What is the main challenge of end-to-end deep VIO systems?

They require too many cameras They don't use IMU data They tend to overfit and struggle to generalize to new environments

🔨 Derivation Sliding Window Marginalization — bounding compute in VIO ▶ ✓ ATTEMPTED

A sliding-window VIO system maintains poses x₁...x_N. When a new frame x_N+1 arrives, the oldest frame x₁ must be removed. The full Hessian has a factor between x₁ and x₂ (IMU), and factors between x₁ and some landmarks.

Your task: Show that marginalizing x₁ creates a dense prior factor on the variables that x₁ was connected to. Why does this prior make the problem "fill in" over time? What's the computational implication?

Before marginalization, the Hessian is sparse: x₁ connects to x₂ (IMU factor, band-diagonal entry) and to landmarks ℓ_a, ℓ_b (visual factors). Most entries are zero because x₁ doesn't directly connect to x₅ or x₈.

Marginalizing x₁ produces: Λ_new = Λ_rest − Λ_r1Λ₁₁⁻¹Λ_1r. The term Λ_r1Λ₁₁⁻¹Λ_1r is the outer product of Λ_r1 with itself (scaled). It creates non-zero entries between ALL variables that were connected to x₁: x₂, ℓ_a, ℓ_b now all share a dense factor.

After marginalizing x₁, x₂ is now connected (via the marginalization prior) to landmarks that x₁ saw. When x₂ is later marginalized, it spreads these connections further. Over time, the marginalization prior becomes a dense factor connecting ALL remaining variables. This "fill-in" destroys the sparsity that makes BA fast.

Full derivation:

1. Before marginalizing x₁, the Hessian has non-zeros at: (1,2), (1,ℓ_a), (1,ℓ_b).

2. Schur complement: Λ^* = Λ_rr − Λ_r1Λ₁₁⁻¹Λ_1r.

3. The outer product Λ_r1Λ₁₁⁻¹Λ_1r creates entries at: (2, ℓ_a), (2, ℓ_b), (ℓ_a, ℓ_b). These are NEW non-zero entries that didn't exist before. This is "fill-in."

4. After N marginalizations, the prior factor connects all surviving variables that share ANY historical ancestor. The prior's Hessian block is dense: O(k²) where k is the number of variables touched by the prior.

5. Computational cost: Without fill-in, the normal equations solve in O(N) time (sparse Cholesky). With a dense prior, the cost is O(k² N) where k grows with each marginalization. Practical systems limit fill-in by: (a) marginalizing landmarks first (they don't connect to each other), (b) approximating the prior (dropping small off-diagonal terms), or (c) using partial marginalization (only remove the oldest pose, keep its landmarks for the next frame to also observe).

The key insight: Marginalization is information-preserving but destroys sparsity. This is the fundamental tension of sliding-window VIO: you MUST remove old states (bounded memory), but removing them FILLS IN the remaining structure (unbounded connections). Every practical VIO system manages this tension differently. OKVIS approximates the prior. VINS-Mono carefully orders marginalizations. Basalt uses non-linear factor recovery to maintain sparsity.

🔗 Pattern Recognition

Classical VIO (Filter) vs Modern VIO (Optimization)

Classical VIO (MSCKF)

EKF-based. Process measurements once. Can never re-visit past decisions. O(N²) in state size per update. Marginalize features via null-space projection. → Classical VIO

Modern VIO (Optimization)

Factor graph + iterative optimization. Re-linearize at each iteration. Can recover from bad initial estimates. O(N) per iteration with sparse structure. Marginalize via Schur complement.

Both solve the same problem (estimate pose from visual+inertial data) but with different computational tradeoffs. The filter linearizes ONCE and processes data sequentially — fast but suboptimal. The optimizer re-linearizes and iterates — slower but converges to the true MAP solution. The filter can never correct a past mistake; the optimizer can (within the window). Modern systems use optimization because compute is now cheap enough, and the accuracy gain justifies the cost.

Under what conditions would the filter's accuracy actually match the optimizer's? (Hint: think about when re-linearization provides no benefit.)

Chapter 4: Transformer-Based Odometry

Transformers have revolutionized NLP and vision — and now they're coming to odometry. The self-attention mechanism is a natural fit for VIO: it can model temporal dependencies across frames, cross-modal relationships between visual and inertial data, and spatial context within each frame.

Systems like AirVO and transformer-based extensions of DROID-SLAM use attention to aggregate features over time windows, weight relevant past observations, and fuse multi-modal inputs. The key advantage over RNNs: attention can look at any time step directly, without information bottlenecks.

The practical tradeoff: transformers' O(N²) memory means you can attend over ~50-100 frames before hitting GPU limits. For a 30 Hz system, that's 1.5–3 seconds of history. LSTMs can theoretically remember further back (O(1) memory per step) but in practice lose information after ~20 frames due to vanishing gradients. The sweet spot for VIO: a sliding window transformer that attends to the most recent ~50 frames with full attention, plus a memory bank of compressed representations from older keyframes.

One subtlety that matters for real-time VIO: the self-attention mechanism must be causal — frame t can only attend to frames 0 through t, never to future frames. This is non-negotiable for online systems. But during training, you can optionally use bidirectional attention (looking forward and backward) to learn better representations, then mask future frames at inference. This train-bidirectional-infer-causal pattern gives a small but consistent accuracy boost (~5% lower RPE).

The memory bank for older keyframes works like this: when a frame exits the sliding window, compress its representation from, say, 768 dims to 64 dims via a learned projection. Store these compressed tokens. The attention mechanism can still attend to them but at much lower memory cost. This gives the system "long-term memory" for loop closure detection — recognizing that you've returned to a previously visited location, even minutes later.

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Why transformers for VIO? In a VIO sequence, some past frames are highly informative (e.g., loop closures, revisited areas) and others are redundant. Attention lets the network dynamically weight past information — unlike LSTMs which compress everything through a fixed-size hidden state.

Cross-modal attention in practice: Visual features [T, D_v] and IMU features [T×7, D_i] (7 IMU samples per visual frame) are projected into a shared embedding space. Cross-attention: IMU queries attend to visual keys, and vice versa. This lets the network learn, for example, that during fast rotation (high gyro readings), it should weight visual features from frames with small motion blur. The attention weights become interpretable: you can see the network "looking back" to informative frames and ignoring redundant ones.

Attention Over Time

Each column is a time step. Brightness shows attention weight — which past frames the current step attends to. Notice: attention is high for informative frames, not just recent ones.

Sequence length16

Current frame12

Architecture	Temporal Model	Memory	Long-Range?
LSTM/GRU	Recurrent	O(1) per step	Poor (vanishing gradients)
1D CNN	Convolutional	O(window)	Limited by kernel size
Transformer	Self-attention	O(N²)	Excellent (direct access)

Check: What advantage do transformers have over LSTMs for VIO?

Direct access to any past time step, no information bottleneck They always use less memory They don't need training data

Chapter 5: Foundation Models for Ego-Motion

The latest frontier: use large pretrained vision models (DINOv2, SAM, Depth Anything) as the visual backbone for VIO. These models have been trained on billions of images and have learned incredibly rich representations of geometry, semantics, and spatial relationships.

What makes a "foundation model" different from a VIO-specific encoder? Scale and diversity. A VIO encoder might be trained on 500K frames from 50 indoor sequences. DINOv2 was trained on 142M images spanning every conceivable visual domain: underwater, aerial, medical, microscopic, street-level, satellite. Its features encode visual concepts that no VIO dataset could teach — and many of these concepts (depth from perspective, texture vs geometry, lighting invariance) are exactly what VIO needs.

Depth Anything deserves special mention for VIO. It predicts dense depth [H, W] from a single image at ~30ms on GPU. For monocular VIO, this is transformative: depth predictions provide the scale prior that monocular vision alone cannot recover. Without depth (or IMU), monocular VIO cannot distinguish "small room, small motion" from "large room, large motion." With a depth prior, even an affine-invariant one, the system can anchor the scale. The remaining alignment step (finding the correct scale α and shift β) is a cheap least-squares solve on a handful of overlapping points.

Instead of training a VIO-specific visual encoder from scratch, you freeze a foundation model and train only a lightweight adapter on top. The foundation model provides features that are robust to domain shift — they've "seen everything" during pretraining. This is the emerging path to VIO systems that work across all environments.

The paradigm shift: Classical VIO trains on specific datasets (EuRoC, TUM). Foundation model-based VIO inherits knowledge from internet-scale visual pretraining. The gap between "training domain" and "deployment domain" shrinks dramatically.

Why freeze the backbone? DINOv2-Large has ~300M parameters trained on 142M images. Your VIO dataset might have 100K frames. Fine-tuning 300M params on 100K frames = catastrophic overfitting. Instead: freeze DINOv2, extract features [H/14, W/14, 1024] from its intermediate layers, then train a small adapter (2–5M params) that maps these to flow, depth, or match predictions. The adapter learns VIO-specific outputs; the backbone provides domain-general visual understanding. Cost: ~4 GB GPU for DINOv2 inference + ~0.5 GB for the adapter. Benefit: features that work in offices, forests, underwater, and at night — because DINOv2 has seen them all.

Feature Quality: Trained-from-Scratch vs Foundation Model

Test accuracy across environments. Red = VIO-specific features. Green = foundation model features. Notice how the green line stays high across all domains.

Foundation Model (frozen)

DINOv2 / Depth Anything: pretrained on billions of images. Rich, general features.

↓

Lightweight Adapter

Small trainable head: maps foundation features to VIO-relevant outputs (flow, depth, matches).

↓

Geometric Back-End

Classical or differentiable BA: exploits rich features for precise pose estimation.

Check: Why are foundation models beneficial for VIO?

They provide domain-general visual features trained on billions of images, reducing domain shift They eliminate the need for an IMU They are faster than classical features

Chapter 6: Robustness & Generalization

The central challenge of learned VIO: it must work in environments never seen during training. Domain shift — the difference between training and deployment data — is the biggest enemy. A system trained on indoor offices may fail in a forest. A model trained in California may break in snow.

Strategies to improve generalization include: domain randomization (train on synthetic data with random variations), self-supervised learning (learn from structure in unlabeled data), test-time adaptation (fine-tune on the fly in the new environment), and uncertainty estimation (know when you don't know).

Test-time adaptation deserves special attention for VIO. The idea: as the system runs in a new environment, it continues to update its own weights using self-supervised losses (photometric consistency between frames, IMU integration consistency). No ground truth needed. The visual encoder slowly adapts to the lighting, texture, and geometry of the current scene. The risk: adapt too aggressively and the model becomes overspecialized to the current frame; too conservatively and it doesn't adapt at all. Practical systems use very small learning rates (1e-5 to 1e-6) and only update the adapter layers, not the backbone.

Uncertainty estimation is the safety net. When the learned VIO system encounters something truly out-of-distribution (a mirror, a strobe light, heavy fog), it should know it doesn't know and fall back to IMU-only propagation or reduce trust in the visual estimate. Monte Carlo dropout or ensemble disagreement can estimate this uncertainty at ~2x computational cost — expensive but essential for safety-critical applications like autonomous driving.

The robustness hierarchy: (1) Augmentation and diverse training data. (2) Architectural inductive biases (geometric constraints). (3) Test-time adaptation. (4) Graceful degradation with uncertainty awareness. The most robust systems use all four layers.

What Breaks VIO Systems

Even tightly-coupled learned VIO has specific failure modes worth understanding concretely:

Failure Mode	Why It Happens	Mitigation
IMU initialization	Estimating gravity direction requires motion. A stationary device cannot observe gravity from IMU alone (it's confounded with accelerometer bias). Systems need 2–5 seconds of varied motion to initialize.	Static initialization with known gravity, or delayed VIO start
Degenerate motion	Pure rotation (looking around without translating) → IMU cannot observe scale. The system has no way to distinguish "small scene, small motion" from "large scene, large motion."	Detect degenerate motion and reduce trust in scale estimate
Temperature drift	Consumer IMU biases drift ~0.01°/s per °C. A phone going from a cold car to a warm building sees bias jump mid-sequence. The classical constant-bias model lags behind.	Learned temperature-dependent bias models, online bias recalibration
Vibration	Motors, engines, or even walking create high-frequency vibration that saturates the accelerometer. The IMU reads 4g when actual motion is 0.01g. Low-pass filtering helps but adds latency.	Learned vibration rejection, frequency-domain filtering, vibration-aware loss functions
Magnetic interference	Some VIO systems use magnetometer for heading. Near metal structures or electronics, the magnetic field is distorted. Heading estimates jump by 30–90°.	Magnetometer-free VIO (most modern systems), or learned distortion models

The most insidious failure mode is IMU initialization. To estimate gravity direction, the system needs to observe how the accelerometer reading changes with motion. A stationary device measures gravity plus accelerometer bias — but it cannot tell which is which without moving. Most systems require 2–5 seconds of "excitation" (walking, moving the device) to initialize. During this window, the VIO output is unreliable. Some systems explicitly detect and warn about this state; others silently output bad poses.

Robustness Under Domain Shift

Drag the domain shift slider. Watch how different strategies maintain performance. Red = naive. Orange = augmented. Green = full robustness stack.

Domain shift severity0.00

Strategy	When Applied	Effect
Domain randomization	Training	Exposes model to wide variations synthetically
Self-supervised pretrain	Training	Learns visual structure without labels
Geometric constraints	Architecture	Bakes in physics that holds across domains
Test-time adaptation	Deployment	Fine-tunes online to new environment
Uncertainty estimation	Deployment	Flags unreliable predictions

Check: What is the most fundamental challenge of learned VIO?

Generalizing to environments not seen during training (domain shift) The high cost of GPUs The need for stereo cameras

💥 Break-It Lab What Dies in Modern VIO When You Break Components? ▶ ✓ ATTEMPTED

A learned VIO system runs with sliding-window optimization, learned features, and IMU bias estimation. Toggle components off to see the trajectory degrade.

Remove Marginalization (unbounded window) ACTIVE

Failure mode: Without marginalization, the optimization window grows linearly with time. After 30 seconds: 900 frames × 16 state dims = 14,400 variables. The Hessian is 14400×14400. Each Gauss-Newton iteration takes seconds instead of milliseconds. The system falls behind real-time, accumulating a latency backlog. Eventually: OOM crash. This is why EVERY real-time VIO system marginalizes — it's not optional, it's structural.

Wrong Feature Tracking (50% outlier matches) ACTIVE

Failure mode: With 50% outlier feature matches, half the reprojection residuals point in wrong directions. If RANSAC is disabled, the optimizer tries to satisfy contradictory constraints. The pose estimate oscillates wildly between iterations without converging. Even with RANSAC, 50% outlier rate pushes close to the breakdown point (RANSAC needs >50% inliers for high probability of finding a valid sample in reasonable iterations). The result: intermittent tracking failures and sudden jumps.

Disable IMU Bias Estimation ACTIVE

Failure mode: IMU biases are real (~0.01-0.1 m/s² for accelerometer). Without estimating them, the preintegrated measurements are systematically wrong. The system interprets bias as real acceleration — it "thinks" it's constantly accelerating even when stationary. Position drifts quadratically with time: ~5cm/s at 0.1 m/s² bias. After 10 seconds: ~5m of drift. The optimizer cannot fix this because the IMU constraints overwhelm the visual corrections (IMU runs at 200 Hz, vision at 30 Hz).

🔨 Derivation Visual Feature Triangulation — from 2D pixels to 3D position ▶ ✓ ATTEMPTED

A 3D point P is observed in two frames with known poses T₁ = [R₁|t₁] and T₂ = [R₂|t₂]. The normalized pixel observations are x₁ and x₂. The point lies along the ray: P = T_i⁻¹(d_i · x_i) for some depth d_i.

Your task: Set up the linear triangulation problem using the DLT (Direct Linear Transform) and show how to solve for P using SVD. When does triangulation fail?

For each observation: the measured pixel x_i and the projected point π(P) should be parallel. Their cross product is zero: x_i × (T_i P) = 0. Expanding the projection matrix T_i = [t_i1^T; t_i2^T; t_i3^T]: u_i(t_i3^TP) − (t_i1^TP) = 0 and v_i(t_i3^TP) − (t_i2^TP) = 0.

Each view gives 2 equations linear in P (in homogeneous coordinates). Two views give a 4×4 system A P = 0 where P = [X, Y, Z, 1]^T. Solve via SVD: P is the last column of V (right singular vector corresponding to the smallest singular value).

Triangulation fails when: (1) The baseline is zero (same camera position, no parallax → depth is unobservable). (2) The point is at infinity (all rays are parallel). (3) The rays are nearly parallel (small baseline relative to depth → large depth uncertainty). The condition number of A indicates quality: ill-conditioned = unreliable triangulation.

Full derivation:

1. Projection: λ_i [u_i; v_i; 1] = T_i [X; Y; Z; 1] where T_i is 3×4.

2. Eliminate λ_i via cross product: x_i × (T_i P) = 0.

3. This gives 2 independent equations per view (the third is linearly dependent):

u_i t_i3^TP − t_i1^TP = 0

v_i t_i3^TP − t_i2^TP = 0

4. Stack: A = [u₁t₁₃^T − t₁₁^T; v₁t₁₃^T − t₁₂^T; u₂t₂₃^T − t₂₁^T; v₂t₂₃^T − t₂₂^T]. This is 4×4.

5. Solve A P = 0: SVD of A = UΣV^T. P = last column of V (dehomogenize: divide by last element).

The key insight: Triangulation accuracy is proportional to baseline/depth ratio. At 10m depth with a 10cm baseline, the parallax angle is ~0.6° — a 1-pixel measurement error translates to ~1m depth error. At 1m depth with the same baseline, the parallax is ~6° and depth error is ~1cm. This is why VIO with a single camera (small "baseline" between frames) struggles with distant features and why stereo cameras (fixed baseline) give much better depth at close range.

⚔ Adversarial: The unobservable bias

Your VIO system runs for 60 seconds with the device placed flat on a table (zero motion). You examine the estimated accelerometer bias and find it has converged to b_a = [0.02, -0.01, 9.81] m/s². The z-component is clearly wrong (it "absorbed" gravity). But the system reports low uncertainty on this estimate!

The accelerometer is broken Without motion, gravity direction and accelerometer z-bias are unobservable (confounded) — the system cannot tell them apart The gyroscope noise is too high The optimization has diverged

Checkpoint — Before you move on

You're choosing between three architectures for a drone VIO system: (A) Pure end-to-end learning (image+IMU -> pose), (B) Learned features + classical BA, (C) Classical features + learned depth prior. The drone has a Jetson Orin (8GB GPU) and must fly outdoors in varied weather. Which do you pick and why? What's the failure mode of each?

✓ Gate cleared

Model Answer

Best choice: (B) Learned features + classical BA. Here's why:

(A) Pure end-to-end: Will overfit to training environments. Outdoor weather variation (rain, sun, fog, snow) is almost impossible to cover in training data. When it fails, there's no geometric fallback — the output is just wrong, often confidently wrong. Not appropriate for safety-critical drone flight.

(B) Learned features + classical BA: SuperPoint/LightGlue handles lighting and weather variation (trained on diverse data). Classical BA provides geometric guarantees (if the features are correct, the pose IS correct). IMU preintegration gives absolute scale and bridges visual gaps. Failure mode: if BOTH learned features fail AND IMU drifts, you get trouble — but that's a double failure. Jetson Orin easily handles LightGlue (~10ms) + BA (~12ms).

(C) Classical features + learned depth: ORB features still fail in rain/fog (water droplets on lens = no gradients). The depth prior helps with scale but doesn't fix the core issue of feature tracking failure in adverse weather.

The key principle: Put learning where environment variation matters most (visual features), keep geometry where guarantees matter most (optimization). The drone can recover from brief feature loss (IMU bridges), but cannot recover from a fundamentally wrong pose estimate (end-to-end failure).

Chapter 7: Deploying Modern VIO

The gap between "works on a workstation with a GPU" and "works on a drone / phone / AR headset" is enormous. Deployment requires meeting strict latency constraints (<10ms per frame for AR), power budgets (milliwatts on a headset), and memory limits. This chapter is about making modern VIO practical.

Key techniques include: model distillation (train a small student from a large teacher), quantization (INT8 or even INT4 inference), pruning (remove unnecessary network connections), and hardware-aware architecture search (design networks specifically for the target chip).

Distillation for VIO works like this: train a large "teacher" model (e.g., full SuperPoint + SuperGlue at 70ms) on a large dataset. Then train a tiny "student" model (e.g., a MobileNet backbone + lightweight matcher at 8ms) to mimic the teacher's outputs rather than learning from raw data. The student learns the teacher's "knowledge" of what makes a good feature and a good match. The student never reaches the teacher's accuracy, but it gets 80-90% of the way there at 1/8th the cost. For VIO, that tradeoff is worth it — you need to run at 30 fps or not at all.

Quantization is the other big lever. Most neural network weights are stored as FP32 (4 bytes). INT8 quantization stores them as 8-bit integers (1 byte) — 4x smaller, 2–4x faster on hardware with INT8 support. Naive quantization (just round everything) destroys accuracy. Quantization-aware training (QAT) simulates rounding during training, letting the network learn to be robust to quantization noise. For VIO feature extractors, QAT typically loses <1% accuracy while doubling inference speed.

The ultimate deployment target: dedicated hardware. Apple's Neural Engine, Qualcomm's Hexagon DSP, and Google's Edge TPU run INT8 models at 10–100x the speed of the CPU on the same chip. The catch: each target has different supported operations and memory layouts. A model optimized for Qualcomm may need modifications for Apple. This is why hardware-aware architecture search matters — design the network for the hardware from day one.

The deployment gap: SuperPoint runs at 70ms on a phone GPU. That's too slow for 30fps VIO (budget: 33ms total, of which features get maybe 10ms). But a distilled version can run at 8ms. The challenge is preserving accuracy while hitting the latency target. This is engineering, not research — and it's where most teams spend their time.

Latency budget breakdown for AR VIO at 30 Hz: Total: 33ms per frame. Feature extraction: ~8ms. Feature matching: ~4ms. IMU preintegration: ~0.5ms (CPU, trivial). Optimization (BA solve for 10-frame window): ~12ms. Overhead (memory, scheduling): ~3ms. That leaves ~5ms margin. Every component must hit its target or the system drops frames. A single dropped frame means the IMU must dead-reckon — accelerometer integration drifts at 0.5–1 m/s², so even one missed frame at 30 Hz adds centimeters of error.

Accuracy vs Latency Tradeoff

Each dot is a system configuration. The goal: maximize accuracy (up) while minimizing latency (left). The green zone is the deployable region.

Latency budget (ms)15

Technique	Speedup	Accuracy Loss	Effort
FP16 inference	1.5–2x	Negligible	Low
INT8 quantization	2–4x	Small	Medium
Model distillation	3–8x	Small–moderate	High
Architecture search	5–10x	Variable	Very high
Hardware accelerator	10–100x	None (same model)	Very high (custom HW)

"A mediocre algorithm that runs in real-time on the actual hardware beats a perfect algorithm that runs 10x too slow."

— Systems engineering wisdom

You now understand the frontier of visual-inertial odometry. The field is moving from hand-crafted to learned, from rigid to adaptive, and from lab to deployment. The future belongs to systems that combine geometric rigor with learned robustness.

The concrete takeaway: start with a classical tightly-coupled VIO system (VINS-Mono). Replace the visual front-end with SuperPoint/LightGlue. Add a learned depth prior if your platform has a GPU. Add a neural IMU correction if you have device-specific training data. Each step is incremental, testable, and reversible. Only go fully end-to-end if you have massive training data and a tolerance for domain-specific fine-tuning.

Check: What is typically the biggest bottleneck when deploying modern VIO on edge devices?

Lack of training data Meeting latency and power constraints while preserving accuracy The camera resolution is too low

🏗 Design Challenge You're the Architect: VIO for a Mars Rover ▶ ✓ ATTEMPTED

A Mars rover must navigate autonomously across rocky terrain. No GPS, no communication during traversals (20-minute Earth-Mars delay), sparse visual features (rocky desert with few distinct landmarks), potential wheel slip on sand. The VIO system must survive for 10+ year mission with radiation-hardened hardware from 2018.

Compute

RAD750 processor (~200 MHz, 256MB RAM) + FPGA co-processor

Camera

Stereo pair, 1 Mpx, 1 Hz (due to power/data limits)

IMU

Navigation-grade (low noise) but 100 Hz max

Wheel encoders

Available but slip on sand (up to 30% slip)

Terrain

Rocky desert: some texture but highly repetitive

Speed

~0.04 m/s max (4.5 cm/s, very slow)

Accuracy needed

<5% dead-reckoning error over 100m traverse

1. At 1 Hz camera rate, the rover moves 4cm between frames. Is the baseline sufficient for stereo triangulation? What about for visual odometry (motion estimation)?

2. Wheel slip is 30% on sand. How do you detect and compensate for it? Can the visual system help?

3. No learned features possible (RAD750 can't run neural networks). What classical approach maximizes robustness on rocky desert terrain?

4. Communication delay means no human-in-the-loop for navigation decisions. How do you handle "lost" situations? What's the recovery strategy?

Real-world solution (Mars 2020 / Perseverance, AutoNav):

1. Stereo visual odometry at 1 Hz: At 0.04 m/s and 1 Hz, the rover moves 4cm between frames. The 24cm stereo baseline gives excellent depth for nearby rocks (1-5m). For VO, 4cm of motion at 3m distance gives ~0.8° parallax — detectable but small. Solution: match features across MULTIPLE frames (accumulate 5-10 frames = 20-40cm baseline for VO).

2. Wheel slip detection: Compare wheel odometry (distance from encoders) with visual odometry (distance from feature tracking). If they disagree by >10%, flag slip. The visual estimate is trusted as ground truth. For sand traps: if visual VO shows zero progress but wheels are spinning, STOP. Alert mission control. The rover has survived by detecting slip early.

3. Classical features on RAD750: Harris corners + normalized cross-correlation matching. No SIFT/SURF (too expensive). Feature count is limited to ~200 per frame. Key trick: rock edges and shadows provide texture on Mars. The system is HEAVILY tuned for this specific terrain type. RANSAC with 5-point algorithm for relative pose.

4. Autonomy and recovery: The rover builds a local terrain model (stereo DEM within 20m). Hazard detection classifies rocks/slopes/sand. If "lost" (visual VO fails for >3 steps): stop, rotate in place to gather stereo imagery from multiple angles, attempt relocalization against the last good map. If that fails: wait for ground contact. The rover NEVER blindly continues when uncertain — stopping is always safe.

🔗 Pattern Recognition

VIO + Relocalization = SLAM

Modern VIO

Sliding window, no persistent map. Drift accumulates over time. Cannot recognize previously visited places. Bounded computation.

Full SLAM

VIO for local tracking + persistent keyframe database for place recognition + pose graph for loop closure. Drift is periodically corrected. → Modern SLAM

Modern VIO and modern SLAM share most of their pipeline. The difference is one component: a place recognition database that detects when you've returned to a previously visited location. When detected, a loop closure constraint is added to a pose graph, and the entire trajectory is re-optimized to eliminate accumulated drift. VINS-Mono does exactly this: real-time VIO for tracking + DBoW2 for place recognition + pose graph optimization for drift correction.

If your VIO system already uses learned features (SuperPoint descriptors), how would you build place recognition on top? Would you need a separate representation, or can you reuse the feature descriptors?

💻 Build It Implement Linear Triangulation (DLT) ▶ ✓ ATTEMPTED

Given two camera projection matrices and corresponding 2D pixel observations, triangulate the 3D point position using the DLT method with SVD.

signature def triangulate_dlt(P1, P2, x1, x2): """ P1: np.array [3,4] - projection matrix of camera 1 (K @ [R|t]) P2: np.array [3,4] - projection matrix of camera 2 x1: np.array [2] - pixel observation in camera 1 (u, v) x2: np.array [2] - pixel observation in camera 2 (u, v) Returns: X: np.array [3] - 3D point in world coordinates quality: float - ratio of smallest to second-smallest singular value (close to 0 = good, close to 1 = degenerate) """

Test case

Camera 1 at origin, Camera 2 translated 1m right. Point at (0.5, 0, 5): P1 = K @ [I|0], P2 = K @ [I|-[1,0,0]]. With K = [[500,0,320],[0,500,240],[0,0,1]]: Expected: X = [0.5, 0, 5], quality near 0 (well-conditioned). With point at infinity (0, 0, 10000): quality approaches 1 (degenerate).

For each view i with pixel (u, v) and projection matrix P (rows p1, p2, p3): row 1 = u*p3 - p1, row 2 = v*p3 - p2. Stack all rows into A (4x4 for 2 views). SVD of A = USV^T. Solution = last column of V (corresponding to smallest singular value). Dehomogenize: X = V[:3, -1] / V[3, -1].

python
import numpy as np

def triangulate_dlt(P1, P2, x1, x2):
    """Linear triangulation via Direct Linear Transform."""
    u1, v1 = x1[0], x1[1]
    u2, v2 = x2[0], x2[1]

    # Build 4x4 matrix A (2 rows per view)
    A = np.zeros((4, 4))
    A[0] = u1 * P1[2] - P1[0]  # u1*p1_3 - p1_1
    A[1] = v1 * P1[2] - P1[1]  # v1*p1_3 - p1_2
    A[2] = u2 * P2[2] - P2[0]  # u2*p2_3 - p2_1
    A[3] = v2 * P2[2] - P2[1]  # v2*p2_3 - p2_2

    # SVD: solution is last column of V
    _, S, Vt = np.linalg.svd(A)
    X_homo = Vt[-1]  # last row of Vt = last col of V

    # Dehomogenize
    X = X_homo[:3] / X_homo[3]

    # Quality metric: ratio of smallest to 2nd smallest SV
    # Close to 0 = well-determined, close to 1 = degenerate
    quality = S[3] / (S[2] + 1e-10)

    return X, quality

Bonus challenge: Extend to N views (more than 2). The A matrix becomes 2N x 4. With more views, the solution is over-determined and more robust. Also implement the reprojection error: project the triangulated point back into each camera and measure the pixel distance. This is how VIO systems decide whether a feature is an inlier or outlier.

Understand ModernVIO