Every concept, system design, algorithm, failure mode, and frontier paper a staff 3D vision engineer needs — from camera models to neural reconstruction.
You are a 3D geometry engineer at a robotics company. Your morning starts with a Slack message: "Stereo calibration drifted overnight, VO is diverging on the warehouse floor." By afternoon you are debugging why the offline reconstruction pipeline dropped 400 images from a 10,000-image drone capture. After dinner you review a PR that replaces your COLMAP-based pipeline with DUSt3R and need to decide: ship it or push back?
This is the job. You own the full pipeline from photons hitting the sensor to 3D point clouds consumed by downstream planners, world models, and manipulation policies. Every component is load-bearing. A 2-pixel calibration error propagates through triangulation, corrupts the map, and sends the robot into a shelf.
A day in the life, concretely:
The interview dimensions for every chapter in this lesson:
| Dimension | What it tests | Signal |
|---|---|---|
| CONCEPT | Can you derive it on a whiteboard? | First principles, not memorized formulas |
| DESIGN | Where does it fit in a production system? | Data flow, tradeoffs, failure boundaries |
| CODE | Can you implement the core algorithm? | From-scratch numpy, not API calls |
| DEBUG | What breaks and how do you fix it? | Symptom → diagnosis → fix |
| FRONTIER | What replaces this in 2 years? | Latest papers, repos, tradeoffs |
Click any component to highlight its inputs and outputs. This is the full perception pipeline you would own as a staff geometry engineer.
The daily reality: You will spend 40% of your time on calibration and data pipeline issues (not glamorous, but load-bearing). 30% on algorithm development and evaluation. 20% on debugging production failures. 10% on reading papers and prototyping new approaches. The staff-level signal is owning the full stack end-to-end, not just the ML model in the middle.
The tech stack you should know: C++ for production (Ceres, g2o, OpenCV, GTSAM). Python for prototyping (numpy, torch, open3d). ROS for sensor integration. Docker for deployment. COLMAP/hloc for offline pipelines. Nerfstudio/gsplat for neural rendering.
Data flow through the pipeline with tensor shapes:
Typical numbers from a real robot deployment: 6 cameras × 1280×960 @ 30fps = 6 × 3.7MB/frame = 22.2 MB/frame = 666 MB/s raw throughput. Feature extraction: ~500 keypoints/camera/frame × 6 = 3000 keypoints/frame. Matching: 15 camera pairs × ~200 matches = 3000 correspondences/frame. VO: 1 SE(3) pose per frame (6 DOF). Map: 50K-500K 3D landmarks after 10 minutes of operation.
Every 3D vision algorithm starts with a camera model — the mathematical function that maps a 3D world point to a 2D pixel. The simplest is the pinhole camera: light passes through an infinitely small hole and hits a flat sensor. No lens, no distortion, just geometry. Why start here? Because every error in the camera model propagates through triangulation, SfM, SLAM, and reconstruction. Get calibration wrong by 1 pixel, and your robot's 3D map is wrong by centimeters at arm's reach.
The math has three stages. First, transform from world to camera coordinates using the extrinsic matrix [R|t]. Then divide by depth (the perspective projection). Finally, scale to pixels using the intrinsic matrix K.
fx, fy = focal length in pixels (how much the camera "zooms"). cx, cy = principal point (where the optical axis hits the sensor, usually near image center). R (3×3) = rotation from world to camera. t (3×1) = translation. Together, [R|t] says "where is the camera and which way is it looking?"
Real lenses introduce distortion. The Brown-Conrady model corrects radial (barrel/pincushion) and tangential components:
Why does distortion matter in practice? At the image center, distortion is near zero (r is small, so k1r² ≈ 0). At the edges (r=1 for a normalized image), even k1=0.1 shifts points by 10% of their distance from center. On a 1280×960 sensor, that's ~50 pixels at the corners. If you triangulate features at the edges without undistortion, your 3D points are wrong by centimeters. Always undistort before geometric computation.
Distortion model hierarchy: Radial-only (k1, k2): sufficient for most machine vision lenses. Full Brown-Conrady (k1, k2, k3, p1, p2): standard for OpenCV calibration. Kannala-Brandt: for fisheye lenses (FOV > 120°). UCM/Double Sphere: unified model for catadioptric and omnidirectional cameras. Neural distortion: learn a small MLP mapping (x,y) → (x',y') for non-parametric lenses. Each model adds complexity but handles wider FOV and more exotic optics.
Data flow from 3D point to pixel:
Where does calibration fit? Before everything else. At a robotics company, you calibrate once at the factory, verify monthly in the field, and run continuous self-calibration checks during operation. Thermal drift changes focal length by 0.1–1 pixel per 10°C. Rolling shutter CMOS sensors expose rows sequentially — fast motion causes wobble. Use global shutter cameras for robotics.
Rolling shutter vs global shutter, quantified. A rolling shutter camera with 33ms readout time (full frame) exposes each row at a different instant. A robot rotating at 90°/s shifts the bottom of the image by 90 × 0.033 = 3° relative to the top. At 800px focal length, that's ~42 pixels of skew. This breaks the pinhole assumption completely. Global shutter cameras expose all pixels simultaneously — no skew. Cost difference: global shutter is typically 2-3x the price of rolling shutter. Rolling shutter can be modeled in BA (continuous-time formulation), but it adds complexity and computation. Staff recommendation: global shutter for anything moving faster than walking speed.
Coordinate frame conventions are a constant source of bugs. OpenCV uses right-handed with Z forward, Y down. OpenGL uses Z backward, Y up. ROS uses X forward, Z up. Every time you integrate a new library, check which convention it uses.
| Convention | Forward | Right | Up | Used by |
|---|---|---|---|---|
| OpenCV | +Z | +X | -Y | OpenCV, COLMAP, most vision libraries |
| OpenGL | -Z | +X | +Y | OpenGL, Three.js, Blender camera |
| ROS REP-103 | +X | -Y | +Z | ROS, URDF, tf2 |
| Unreal Engine | +X | +Y | +Z | Unreal, some game engines |
The conversion trap. When you read a pose from COLMAP (OpenCV convention) and use it in a ROS pipeline, you need to rotate the coordinate frame: Rros = Rcv2ros · Rcolmap. The conversion matrix flips Y and Z: Rcv2ros = [[0,0,1],[−1,0,0],[0,−1,0]]. Get this wrong and the robot drives backward into a wall. Debug tip: visualize a single pose with known orientation (camera looking down a hallway) and verify the axes manually before processing the full dataset.
python import numpy as np def project(X, K, R, t, dist=None): """Project 3D point X (3,) to pixel (2,). dist: [k1, k2, p1, p2] distortion coefficients.""" X_cam = R @ X + t # (3,) world → camera x_n = X_cam[:2] / X_cam[2] # (2,) perspective divide if dist is not None: k1, k2, p1, p2 = dist r2 = x_n[0]**2 + x_n[1]**2 radial = 1 + k1*r2 + k2*r2**2 x_n = x_n * radial # radial distortion x_n[0] += 2*p1*x_n[0]*x_n[1] + p2*(r2 + 2*x_n[0]**2) x_n[1] += p1*(r2 + 2*x_n[1]**2) + 2*p2*x_n[0]*x_n[1] px = K[:2,:2] @ x_n + K[:2,2] # (2,) apply intrinsics return px # Test K = np.array([[800,0,320],[0,800,240],[0,0,1]], dtype='float64') px = project(np.array([1.0, 2.0, 5.0]), K, np.eye(3), np.zeros(3)) # px = [480., 560.] ✓ # Batch projection for N points def project_batch(X_batch, K, R, t, dist=None): """Project (N,3) points to (N,2) pixels.""" X_cam = (R @ X_batch.T).T + t # (N,3) x_n = X_cam[:,:2] / X_cam[:, 2:3] # (N,2) perspective divide if dist is not None: k1, k2, p1, p2 = dist r2 = np.sum(x_n**2, axis=1, keepdims=True) radial = 1 + k1*r2 + k2*r2**2 x_n = x_n * radial px = x_n @ K[:2,:2].T + K[:2,2] # (N,2) apply intrinsics return px # Undistort: inverse mapping (iterative) def undistort_point(px, K, dist, n_iter=10): """Pixel (2,) → undistorted normalized coords (2,).""" x_d = (px - K[:2,2]) / np.diag(K[:2,:2]) x = x_d.copy() k1, k2, p1, p2 = dist for _ in range(n_iter): r2 = x[0]**2 + x[1]**2 radial = 1 + k1*r2 + k2*r2**2 x = x_d / radial # fixed-point iteration return x
| Symptom | Diagnosis | Fix |
|---|---|---|
| Images warped at edges | Wrong distortion coefficients | Recalibrate with more board images; ensure corners cover image edges |
| Points don't align between cameras | Extrinsic calibration is stale | Recalibrate extrinsics; check if camera mount shifted |
| Projection off by constant offset | Wrong principal point | Check cx, cy; ensure calibration images span full FOV |
| Projected points mirrored/rotated | Coordinate frame mismatch | Check OpenCV vs OpenGL vs ROS conventions |
Self-calibration from video (no checkerboard): estimate intrinsics from feature tracks across frames by solving the absolute conic constraint. Requires the camera to undergo sufficient rotation diversity (at least 3 rotations about different axes). Works in practice for handheld video (natural wrist rotation) but fails for car-mounted cameras (mostly translation, little rotation). Neural lens models: learn the full distortion function as a small MLP, mapping normalized coordinates (x,y) to distorted (x',y'). Handles non-Brown-Conrady distortion from fisheye and catadioptric lenses that parametric models can't fit. Training: minimize reprojection error on a dense calibration target. Differentiable camera models in PyTorch (e.g., in DROID-SLAM) allow end-to-end optimization of intrinsics alongside pose estimation — the focal length becomes a learnable parameter, jointly optimized with poses and depths.
Event cameras (DVS): asynchronous per-pixel brightness change sensors with microsecond resolution. Each pixel independently fires when it detects a brightness change exceeding a threshold. No frames, no exposure time, no motion blur. Extreme dynamic range (>120dB vs ~60dB for standard cameras). The geometry is the same (pinhole + distortion), but the calibration procedure is different: you can't use a static checkerboard (no brightness changes). Instead, use a blinking checkerboard or moving patterns. Event cameras are increasingly used for high-speed robotics (drone racing, industrial pick-and-place at >10Hz) where conventional cameras produce motion blur.
Camera selection for robotics: Staff-level question. Mono vs stereo vs fisheye vs event? The answer depends on: speed (event cameras for >100Hz, global shutter for <30Hz), environment (fisheye for tight spaces, narrow FOV for long-range), compute budget (stereo needs rectification + matching, mono needs depth estimation), and whether you have an IMU (VIO gives you scale without stereo).
A 3D wireframe cube projected through a pinhole camera. Adjust focal length and distortion to see what breaks.
Two cameras see the same 3D point. In camera 1, it appears at pixel p1. Where must it appear in camera 2? Without any constraint, we search the entire second image. Epipolar geometry reduces this to a single line.
The 3D point X, the two camera centers C1 and C2, and the two image points all lie in a single epipolar plane. The intersection of this plane with image 2 is the epipolar line. The matching point must lie on this line.
Derivation from scratch. Camera 1 is at origin. Camera 2 is at rotation R, translation t. A 3D point X1 in camera 1 coords becomes X2 = RX1 + t in camera 2 coords. The three vectors X2, t, and RX1 are coplanar. Coplanarity: X2 · (t × RX1) = 0. Since the normalized image coordinate x = X/Z, and dividing by Z does not change zero, we get:
E = [t]×R is the essential matrix (5 DOF: 3 rotation + 2 translation direction; scale is unrecoverable). [t]× is the skew-symmetric matrix such that [t]×v = t × v.
Worked example of E. Camera 2 is translated 1m to the right: t = (1,0,0), R = I (same orientation). Then [t]× = [[0,0,0],[0,0,-1],[0,1,0]], so E = [t]×R = [[0,0,0],[0,0,-1],[0,1,0]]. Given a point at normalized coords x1 = (0.5, 0.3, 1) in camera 1, the epipolar line in camera 2 is l2 = Ex1 = (0, -1, 0.3). This is a horizontal line at y = 0.3 — makes sense! Pure horizontal translation means epipolar lines are horizontal (the standard stereo rectification assumption).
The fundamental matrix F is the uncalibrated version, working on raw pixel coordinates:
F has 7 DOF (9 entries − 1 scale − 1 rank-2 constraint). Given F and point p1, the epipolar line in image 2 is l2 = Fp1.
In a SfM pipeline, epipolar geometry sits right after feature matching. You estimate F (or E if calibrated) from correspondences, then use it to: (1) reject outlier matches (points violating the epipolar constraint), (2) recover relative pose R, t, (3) initialize triangulation. RANSAC is mandatory — even 10% outlier correspondences will corrupt the estimate.
How many inliers do you need? The 5-point algorithm needs 5 correct correspondences per RANSAC sample. The probability of getting a clean sample is p = w5 where w is the inlier ratio. For 99% success probability, you need k = log(0.01)/log(1-w5) iterations. With 50% inliers: k = 145. With 30%: k = 4,121. With 20%: k = ~29,000. This exponential blowup is why matching quality is the #1 bottleneck in practice.
Essential vs Fundamental: when to use which? If you know intrinsics K (calibrated camera): use E. It has fewer DOF (5 vs 7), so you need fewer correspondences (5-point vs 8-point), and the estimate is more stable. If cameras are uncalibrated or K is uncertain: use F. In practice, most modern systems use calibrated cameras, so E is the standard.
RANSAC iteration count — the math you should know. To find one clean sample of s points from data with inlier ratio w, the probability of failure after k iterations is: Pfail = (1 - ws)k. Solving for k at Pfail = 0.01:
| Inlier ratio w | 5-point (s=5) | 8-point (s=8) | Time @ 1ms/iter |
|---|---|---|---|
| 80% | 8 iterations | 14 iterations | ~14ms |
| 60% | 57 iterations | 177 iterations | ~177ms |
| 40% | 435 iterations | 4,177 iterations | ~4.2s |
| 20% | 14,365 iterations | ~717K iterations | ~12min |
This table shows two things: (1) the 5-point algorithm is vastly more RANSAC-efficient than 8-point, and (2) below 40% inliers, you need thousands of iterations. This is why matching quality (the step before RANSAC) is the #1 factor determining pipeline speed. Improving the inlier ratio from 40% to 60% with a better matcher (e.g., LightGlue vs brute-force ratio test) saves you 8x iterations.
RANSAC scoring. Not all correct models are equal. After finding an inlier set, score the model by the number of inliers within a threshold (typically 1-3 pixels for F estimation). Better variants: MSAC: score = sum of capped squared residuals (incentivizes tighter fits). MAGSAC: marginalizes over all thresholds (no threshold parameter needed). GC-RANSAC: adds local optimization on the inlier set after each RANSAC iteration. OpenCV's `findFundamentalMat` uses RANSAC by default, but `cv::USAC_MAGSAC` is strongly recommended for production.
python def eight_point_F(pts1, pts2): """pts1, pts2: (N,2) correspondences, N ≥ 8.""" def normalize(pts): mu = pts.mean(axis=0) d = np.linalg.norm(pts - mu, axis=1).mean() s = np.sqrt(2) / (d + 1e-8) T = np.array([[s,0,-s*mu[0]],[0,s,-s*mu[1]],[0,0,1]]) ph = np.hstack([pts, np.ones((len(pts),1))]) return (T @ ph.T).T[:,:2], T n1, T1 = normalize(pts1) n2, T2 = normalize(pts2) u1, v1 = n1[:,0], n1[:,1] u2, v2 = n2[:,0], n2[:,1] A = np.stack([u2*u1, u2*v1, u2, v2*u1, v2*v1, v2, u1, v1, np.ones(len(u1))], axis=1) _, _, Vt = np.linalg.svd(A) F = Vt[-1].reshape(3,3) U, S, Vt2 = np.linalg.svd(F) # enforce rank 2 S[2] = 0 F = U @ np.diag(S) @ Vt2 return T2.T @ F @ T1 # denormalize def ransac_F(pts1, pts2, n_iter=500, thresh=3.0): """RANSAC wrapper for robust F estimation.""" best_F, best_inliers = None, [] N = len(pts1) for _ in range(n_iter): idx = np.random.choice(N, 8, replace=False) F_cand = eight_point_F(pts1[idx], pts2[idx]) # Compute Sampson distance for all points p1h = np.hstack([pts1, np.ones((N,1))]) p2h = np.hstack([pts2, np.ones((N,1))]) Fp1 = (F_cand @ p1h.T).T # (N,3) Ftp2 = (F_cand.T @ p2h.T).T # (N,3) num = np.sum(p2h * Fp1, axis=1)**2 # (p2^T F p1)^2 denom = Fp1[:,0]**2 + Fp1[:,1]**2 + Ftp2[:,0]**2 + Ftp2[:,1]**2 dist = num / (denom + 1e-8) inliers = np.where(dist < thresh**2)[0] if len(inliers) > len(best_inliers): best_inliers = inliers best_F = eight_point_F(pts1[inliers], pts2[inliers]) return best_F, best_inliers
| Symptom | Diagnosis | Fix |
|---|---|---|
| F estimation gives garbage | Missing Hartley normalization | Always normalize; condition number of A should be ~1 |
| Degenerate F (all epipolar lines converge) | Points are coplanar | Use homography H instead; recover normal + distance |
| Too few RANSAC inliers (<30%) | Feature matching has too many outliers | Tighten Lowe's ratio test, use learned matcher, check image overlap |
| E decomposition gives wrong R,t | Cheirality check failed — picked wrong solution | E decomposes to 4 (R,t) solutions; pick the one where triangulated points have positive depth in both cameras |
Deep fundamental matrix estimation: OANet (Yi et al., CVPR 2019) learns to weight correspondences, replacing RANSAC. LoFTR + F estimation: skip feature detection entirely, predict matches with a transformer, estimate F from dense correspondences. DUSt3R (CVPR 2024): bypasses the fundamental matrix completely — directly predicts 3D pointmaps from two images.
Click in the left image to see the corresponding epipolar line in the right image. Drag baseline to see degenerate case when cameras coincide.
Harris corners. At each pixel, compute the structure tensor M — a 2×2 matrix of image gradient products summed over a window. Its eigenvalues λ1, λ2 diagnose the local structure: both large = corner (good feature), one large = edge, both small = flat. Harris avoids computing eigenvalues directly with R = det(M) − k·trace(M)².
SIFT (Scale-Invariant Feature Transform): Build a Difference-of-Gaussians pyramid (blur at multiple scales, subtract adjacent levels). Extrema in this 3D scale-space are keypoints. For each, compute a 128-dimensional descriptor from gradient histograms in a 16×16 patch, rotated to the dominant orientation. Scale + rotation invariant. Slow but extremely distinctive.
ORB (Oriented FAST + Rotated BRIEF): FAST corner detection (16-pixel circle intensity test) plus BRIEF binary descriptor (256 bits from pixel comparisons). 100× faster than SIFT but less distinctive. Binary descriptors use Hamming distance (count differing bits with XOR + popcount) instead of L2, making matching 10x faster than float descriptors.
The evolution to learned features. Classical features (SIFT, ORB) are hand-designed — fixed receptive field, fixed descriptor computation. SuperPoint (DeTone et al., CVPRW 2018) trains a CNN to detect keypoints and compute 256-dim descriptors jointly. Trained with homographic adaptation: warp images with random homographies, check if the same corners are detected. The result: more repeatable keypoints in challenging conditions (low light, blur, viewpoint change) than any classical detector.
Dense matchers skip keypoints entirely. LoFTR (Sun et al., CVPR 2021) processes two images with a ViT backbone, computes cross-attention between all feature positions, and outputs dense correspondences. No keypoint detection, no descriptor computation, no ratio test. Works on textureless surfaces (walls, floors) where classical features find nothing. The cost: ~80ms on a 3090 vs ~5ms for ORB. Worth it when matching quality is the bottleneck.
Matching strategy at scale. For 100 images: brute-force all pairs (O(N²) = 5K pairs). For 10,000 images: vocabulary tree retrieval to find the top-50 candidate pairs per image (O(N·50) = 500K pairs vs O(N²) = 50M). For 1M images: learned retrieval (NetVLAD, CosPlace) to find top-20 candidates.
Why brute-force L2 matching is expensive. For N descriptors of dimension D: N×N distance computations, each D multiplications + additions. With 2000 SIFT features (D=128) in each image: 2000×2000×128 = 512M FLOPs per pair. With 5000 pairs for 100 images: 2.56 TFLOPs total. That's why approximate nearest neighbors (FLANN, FAISS) or learned matchers (which attend to spatial context and exit early when confident) are essential at scale.
Feature distribution matters. 2000 features clustered on a textured poster leave the rest of the image unmatched. Good practice: grid-based detection — divide the image into 8×8 cells, detect top-N features per cell. This ensures coverage across the entire FOV, which improves triangulation geometry (wide spread of rays) and makes reconstruction more complete. ORB-SLAM3 does exactly this with its "distribute keypoints with quadtree" strategy.
| Method | Descriptor | Speed | Accuracy | GPU? |
|---|---|---|---|---|
| SIFT | 128-dim float | Slow (~1s/img) | Good | No |
| ORB | 256-bit binary | Real-time | Moderate | No |
| SuperPoint + LightGlue | 256-dim float | Fast (~30ms) | Excellent | Yes |
| LoFTR / RoMa | Dense (no keypoints) | Moderate (~80ms) | Best for low-texture | Yes |
python def harris_corners(img_gray, k=0.04, threshold=0.01): """Harris corner detection from scratch. img_gray: (H,W) float32, range [0,1].""" # Image gradients Ix = np.gradient(img_gray, axis=1) # (H,W) Iy = np.gradient(img_gray, axis=0) # (H,W) # Structure tensor components (Gaussian-weighted) from scipy.ndimage import gaussian_filter Ixx = gaussian_filter(Ix * Ix, sigma=1.5) Iyy = gaussian_filter(Iy * Iy, sigma=1.5) Ixy = gaussian_filter(Ix * Iy, sigma=1.5) # Harris response: R = det(M) - k * trace(M)^2 det_M = Ixx * Iyy - Ixy ** 2 trace_M = Ixx + Iyy R = det_M - k * trace_M ** 2 # Non-maximum suppression + threshold corners = (R > threshold * R.max()) return np.argwhere(corners)[:, ::-1] # (N, 2) as (x, y) def ratio_test(desc1, desc2, ratio=0.75): """Lowe's ratio test. desc1: (N,D), desc2: (M,D).""" dists = np.linalg.norm(desc1[:, None] - desc2[None], axis=2) idx1 = np.argsort(dists, axis=1)[:, :2] # top-2 matches d1 = dists[np.arange(len(desc1)), idx1[:, 0]] d2 = dists[np.arange(len(desc1)), idx1[:, 1]] mask = d1 / (d2 + 1e-8) < ratio return np.stack([np.arange(len(desc1))[mask], idx1[:, 0][mask]], axis=1)
| Symptom | Diagnosis | Fix |
|---|---|---|
| Too few matches (<50) | Images have little overlap or large scale change | Lower ratio threshold to 0.8; use more keypoints; try LoFTR |
| Many wrong matches | Repetitive texture (warehouse shelving) | Tighten ratio test to 0.6; use geometric verification (RANSAC on F) |
| Matching too slow (>1s) | Brute-force on too many features | Use FLANN approximate NN; reduce to top-2000 features by response |
| Features cluster in one region | One textured object dominates detection | Use grid-based detection (N features per cell); ensure whole-image coverage |
SuperPoint + LightGlue (Lindenberger et al., ICCV 2023): the current production standard. CNN keypoints + GNN matcher with adaptive early exit. RoMa (Edstedt et al., CVPR 2024): coarse-to-fine dense matching with robust training. GIM (Xuelun et al., ICLR 2024): generalized image matching across domains. MASt3R (Leroy et al., ECCV 2024): matching + 3D in one model.
Two views with detected features. Toggle ratio test to see outlier rejection. Green = correct, red = outlier.
The problem: Given N images of a scene with feature correspondences, recover ALL camera poses (position + orientation) AND ALL 3D point positions. Two unknowns solved simultaneously.
Incremental SfM builds the reconstruction step by step:
Global SfM takes a different path: estimate all pairwise rotations, solve global rotations via rotation averaging, estimate translations, triangulate, run BA once. Faster but more sensitive to outlier relative poses.
Initialization is the hardest part. The initial two-view reconstruction sets the coordinate system and scale for the entire reconstruction. Bad initialization corrupts everything downstream. Choose images with: (1) sufficient baseline (not too small — noisy triangulation, not too large — matching fails), (2) enough shared features (>100 inlier matches), (3) non-degenerate geometry (not pure rotation, not all coplanar points). COLMAP tries many initial pairs, scores them by triangulation angle and match count, and picks the best one.
The 5-point algorithm (Nistér, 2003) is the standard for calibrated cameras. Given 5 correspondences + known K, recover E (up to 10 solutions). Use cheirality check (depth must be positive in both cameras) to reduce to 1–2 solutions. Combined with RANSAC, this is the workhorse of modern SfM initialization and relative pose estimation.
PnP variants and when to use each: P3P: minimal solver (3 points → up to 4 solutions), used inside RANSAC for maximum efficiency. DLT (6+ points): linear, no iteration, good for initialization. EPnP (Lepetit et al., 2009): O(n) complexity, accurate, the default in OpenCV. Iterative PnP: refine with Levenberg-Marquardt after initial estimate. In production SLAM, the pipeline is: P3P inside RANSAC to find inliers → EPnP or iterative refinement on all inliers for the final pose.
Triangulation quality depends on baseline. Two views at distance B with focal length f and pixel noise σ give depth uncertainty: σZ = Z² · σ / (f · B). At Z=5m, f=800px, σ=0.5px, B=0.5m: σZ = 25 × 0.5 / (800 × 0.5) = 0.031m = 3.1cm. Double the baseline to 1m: 1.6cm. This is why wide baselines help, but too wide means matching fails (appearance change).
Scaling to millions of images: Exhaustive pairwise matching is O(N²). For 10K images, that's 50M pairs. Solutions: (1) image retrieval (vocabulary tree, NetVLAD) to find top-k candidates, (2) hierarchical SfM — cluster images by GPS/time/visual similarity, run SfM per cluster, merge at boundaries, (3) distributed processing with Theia or OpenSfM.
COLMAP's registration strategy in detail. For each unregistered image, COLMAP counts how many of its features match to existing 3D points. The image with the most matches registers first (greedy strategy). Then: (1) solve PnP with P3P inside RANSAC (typically 100 iterations, 8px inlier threshold). (2) If >30 inliers, register the image. (3) Triangulate new 3D points from this camera + existing cameras (require triangulation angle >1.5°). (4) Run local BA on the 10 most recent cameras + all observed 3D points. (5) Every 10 registrations, run global BA. (6) Filter points: remove any with reprojection error >4px or track length <2.
Why initialization is so fragile. The first two cameras set the global coordinate frame and scale. If the baseline is too small (<5° parallax), triangulated points have huge depth uncertainty. If points are coplanar, E is degenerate. If there's pure rotation (no translation), triangulation is impossible (all rays from both cameras are parallel). COLMAP's trick: try up to 100 initial pairs, score each by (1) number of inlier matches, (2) mean triangulation angle, (3) median reprojection error. The pair with the best composite score wins. This heuristic is critical and rarely discussed in papers.
PnP (Perspective-n-Point): Given N 3D-2D correspondences, solve for camera pose. DLT formulation:
python def pnp_dlt(pts_3d, pts_2d, K): """PnP via Direct Linear Transform. pts_3d: (N,3), pts_2d: (N,2), K: (3,3). Returns R (3,3), t (3,).""" # Normalize 2D points by K^-1 K_inv = np.linalg.inv(K) pts_n = (K_inv @ np.hstack([pts_2d, np.ones((len(pts_2d),1))]).T).T # (N,3) # Build DLT matrix (2N x 12) A = [] for i in range(len(pts_3d)): X, Y, Z = pts_3d[i] u, v, w = pts_n[i] A.append([X,Y,Z,1,0,0,0,0,-u*X,-u*Y,-u*Z,-u]) A.append([0,0,0,0,X,Y,Z,1,-v*X,-v*Y,-v*Z,-v]) A = np.array(A) _, _, Vt = np.linalg.svd(A) P = Vt[-1].reshape(3,4) # projection matrix R_approx = P[:,:3] t_approx = P[:,3] # Enforce R ∈ SO(3) via SVD U, S, Vt2 = np.linalg.svd(R_approx) R = U @ Vt2 if np.linalg.det(R) < 0: R = -R; t_approx = -t_approx t = t_approx / S.mean() # scale correction return R, t
Triangulation from two views via DLT:
python def triangulate(P1, P2, pts1, pts2): """P1, P2: (3,4) projection matrices. pts: (N,2). Returns (N,3) 3D points.""" points_3d = [] for i in range(len(pts1)): u1, v1 = pts1[i] u2, v2 = pts2[i] A = np.array([ u1*P1[2] - P1[0], v1*P1[2] - P1[1], u2*P2[2] - P2[0], v2*P2[2] - P2[1] ]) _, _, Vt = np.linalg.svd(A) X = Vt[-1] points_3d.append(X[:3] / X[3]) # dehomogenize return np.array(points_3d)
| Symptom | Diagnosis | Fix |
|---|---|---|
| Reconstruction has 2+ disconnected components | Missing connections between image clusters | Add more retrieval candidates; ensure some images bridge the gap |
| Many images fail to register | Not enough 3D-2D correspondences for PnP | Lower matching thresholds; add more features; ensure overlap |
| Scale drift over long sequences | Incremental SfM accumulates BA error | Use global SfM, or add GPS/known-distance constraints |
| Initialization fails | First pair has degenerate geometry | Try multiple initial pairs; check for pure rotation or coplanarity |
DUSt3R (Wang et al., CVPR 2024): two images in, per-pixel 3D pointmaps out. No matching, no RANSAC, no triangulation. A single ViT-based network. MASt3R (Leroy et al., ECCV 2024): adds local feature extraction to DUSt3R for matching. VGGT (Wang et al., CVPR 2025): N images in one forward pass → cameras, 3D points, depth. The "foundation model" for geometry.
Step through SfM. Camera frustums (teal) and points (yellow) appear as images register. "Inject Noise" shows what happens without BA.
Bundle adjustment (BA) simultaneously refines ALL camera poses AND ALL 3D points to minimize total reprojection error:
xij = observed pixel of point j in camera i. π(Ci, Xj) = projection of 3D point Xj through camera Ci. ρ = robust loss function (Huber or Cauchy) to downweight outliers. The name: adjusting the bundles of rays until they intersect as closely as possible.
Why is it tractable? Each observation xij depends on only ONE camera (6 params) and ONE point (3 params). The Jacobian is 99%+ zeros. This sparsity is the key to everything.
Worked example: 100 cameras × 6 = 600 camera params. 10K points × 3 = 30K point params. Full system: 30,600 × 30,600. After Schur complement: 600 × 600. That's 2,600× smaller.
Memory comparison: The full Hessian for 100 cameras + 10K points = 30,600² × 8 bytes = ~7.5 GB. The Schur-reduced camera system = 600² × 8 bytes = 2.9 MB. That's a 2,500× reduction. For a city-scale reconstruction (10K cameras + 10M points), the full system would be 30B entries — 240 GB. Schur-reduced: 60K² = 28.8 GB. Still large, so you'd use iterative solvers (conjugate gradient on the Schur complement) instead of direct factorization.
The Jacobian structure in detail: Each observation contributes a 2×9 Jacobian block (2 pixel residuals, differentiated w.r.t. 6 camera params + 3 point params). For 1M observations, J is 2M × 30,600 but with only 2M × 9 = 18M nonzeros out of 61.2B entries. That's 99.97% sparse. The block structure means JTJ has camera-camera blocks (dense 6×6), point-point blocks (3×3, independent), and camera-point cross-terms (sparse). The Schur complement exploits that the point-point block V is block-diagonal: V-1 costs O(Npoints) not O(Npoints3).
The optimizer: Levenberg-Marquardt blends Gauss-Newton (fast near minimum; approximates Hessian as JTJ) with gradient descent (safe far from minimum). Damping λ controls: large λ → gradient descent, small λ → Gauss-Newton. LM adapts λ automatically: decrease if step reduces cost, increase if it doesn't.
LM step-by-step: (1) Compute Jacobian J and residual vector r. (2) Form normal equations: (JTJ + λI)Δ = -JTr. (3) Solve for Δ (the parameter update). (4) Compute new cost with trial parameters x + Δ. (5) If cost decreased: accept step, decrease λ by factor 10. (6) If cost increased: reject step, increase λ by factor 10, retry. The initial λ matters: too small → Gauss-Newton step may diverge if far from minimum. Too large → tiny gradient descent steps, slow convergence. Rule of thumb: λ0 = 10-3 × max(diag(JTJ)).
Convergence criteria: Stop when: (1) relative cost change < 10-6, or (2) step size ||Δ|| < 10-8, or (3) max iterations reached (typically 50-100). In practice, BA converges in 10-30 LM iterations for a well-initialized problem. Poorly initialized: may need 100+ or diverge entirely.
When to run BA: In incremental SfM: after every N images (N=5–10) to prevent error accumulation, plus a full global BA at the end. In SLAM: local BA on a sliding window of keyframes (real-time, ~10ms for 10 keyframes), plus global BA on loop closure (can take seconds, runs in background thread).
Local vs global BA in SLAM. Local BA: optimize only the K most recent keyframes and all 3D points they observe. Fix all other keyframes (treat their poses as constants). This is fast (O(K)) and prevents distant keyframes from being affected by local noise. Global BA: optimize everything. Needed after loop closure because the correction must propagate to all poses. ORB-SLAM3's strategy: local BA after every keyframe insertion (10 keyframes, ~50ms). Global BA only on loop closure (runs in a dedicated thread, doesn't block tracking). If global BA changes poses by more than a threshold, merge the correction into the tracking thread's local map.
Gauge freedom. The BA cost function is invariant to a 7-DOF similarity transform (3 rotation + 3 translation + 1 scale). This means the solution is underdetermined — infinitely many camera+point configurations give the same reprojection error. Fix gauge by: (1) freezing the first camera at the origin with identity rotation, (2) fixing the scale by setting the baseline between cameras 1 and 2 to 1.0. Without gauge fixing, the Hessian is rank-deficient and the solver either fails or produces an arbitrary solution. Ceres handles this automatically if you mark parameters as constant.
Memory footprint: The reduced camera system after Schur complement is Nc×6 × Nc×6. For 1000 cameras: 6000×6000 × 8 bytes = ~288MB. Fits in RAM easily. The full unreduced system would be 3M unknowns — the Hessian would be 3M×3M = 72TB. Obviously impossible without the Schur trick.
Solver choice matters. Ceres (Google): the gold standard, supports automatic differentiation, multiple linear solvers, robust losses. Automatic differentiation means you write the cost function once and Ceres computes Jacobians for you — no manual derivative computation. g2o: graph optimization library, commonly used in SLAM backends (ORB-SLAM, LSD-SLAM). Lightweight, fast, but less flexible than Ceres. GTSAM (Georgia Tech): factor graph framework, good for incremental updates (iSAM2 algorithm adds new factors without resolving from scratch — ideal for real-time SLAM where you add one keyframe at a time). For large-scale BA (>10K cameras), use Ceres with SPARSE_SCHUR or ITERATIVE_SCHUR solver type. For real-time SLAM BA, use g2o or GTSAM with iSAM2.
Automatic differentiation vs analytic Jacobians. Ceres' autodiff is convenient (write cost function, get derivatives for free) but 2-3x slower than hand-coded analytic Jacobians. For production SLAM at 30Hz, the difference matters. ORB-SLAM3 uses analytic Jacobians for all BA cost functions. For prototyping or offline SfM, autodiff is fine. The Jacobian of the reprojection function w.r.t. camera parameters (6 DOF) and point position (3 DOF) is a 2×9 matrix — not that hard to derive by hand, and doing so deepens your understanding of the geometry.
Robust loss functions handle outlier observations that would otherwise pull the solution. Huber loss: quadratic for small residuals, linear for large (reduces outlier influence). Cauchy loss: even more aggressive downweighting. In practice, Cauchy with threshold = 2× median residual works well. Always start with L2 (square loss) for the first few iterations to get close, then switch to robust for refinement.
python import torch def rodrigues(rvec): """Rodrigues rotation: (3,) angle-axis → (3,3) matrix.""" theta = torch.norm(rvec) if theta < 1e-8: return torch.eye(3) k = rvec / theta K = torch.tensor([[0,-k[2],k[1]],[k[2],0,-k[0]],[-k[1],k[0],0]]) return torch.eye(3) + torch.sin(theta)*K + (1-torch.cos(theta))*(K@K) def ba_cost(cameras, points, observations, K): """cameras: (Nc,6), points: (Np,3), obs: list of (ci,pi,u,v). Returns scalar reprojection error.""" total = 0.0 for ci, pi, u, v in observations: R = rodrigues(cameras[ci, :3]) t = cameras[ci, 3:] Xc = R @ points[pi] + t proj = K[:2,:2] @ (Xc[:2]/Xc[2]) + K[:2,2] total += torch.sum((proj - torch.tensor([u,v]))**2) return total / len(observations)
| Symptom | Diagnosis | Fix |
|---|---|---|
| BA diverges (cost increases) | Bad initialization or λ too small | Increase initial λ; check for NaN in Jacobian; verify initial poses |
| BA converges to wrong solution | Outlier observations pulling the solution | Use Huber or Cauchy robust loss; pre-filter with RANSAC |
| BA is too slow | Not using Schur complement | Use Ceres Solver with SPARSE_SCHUR; check that sparsity pattern is correct |
| Points fly to infinity after BA | Under-constrained points (seen from <2 cameras) | Remove points with <3 observations; fix gauge freedom |
Differentiable BA in neural pipelines: DROID-SLAM (Teed & Deng, NeurIPS 2021) runs BA as a differentiable layer, backpropagating through the optimization. End-to-end learned alternatives: DUSt3R doesn't need BA at all — the network directly predicts globally consistent 3D. Distributed BA: Theia runs Schur-complement BA across multiple machines for city-scale reconstructions.
Cameras (teal) and 3D points (yellow) from above. "Perturb" adds noise. "Run BA" optimizes. Watch error converge. Toggle robust loss to handle outliers.
Visual Odometry (VO) estimates camera motion frame-by-frame from video. Think of it as a visual speedometer — chain relative poses to build a trajectory. SLAM goes further: build a persistent map and use it to correct drift when revisiting places.
Feature-based VO: detect features → match to previous frame → estimate E → decompose to R, t. Robust but ignores textureless regions. Direct VO (LSD-SLAM, DSO): minimize photometric error directly — pixel intensity difference after warping by estimated motion. Uses all gradient information.
Scale drift in monocular SLAM, quantified. Without any scale constraint, monocular SLAM's scale estimate drifts by 1-5% per 100m of travel. Over 1km, that's 10-50m of accumulated scale error — the map could be 10% too large or too small. This means distances to objects are wrong by 10%, which is unacceptable for manipulation (need <1cm accuracy) and marginal for navigation (need <1m accuracy at 10m range). Solutions: (1) VIO (IMU provides scale, drift <0.1% per 100m). (2) Stereo (known baseline, no scale drift at all). (3) Loop closure at places with known geometry (re-anchor scale).
SLAM = VO + map + loop closure. Camera poses are nodes in a pose graph. Odometry constraints are edges between consecutive nodes (relative pose from VO). Loop closures are edges between non-consecutive nodes (relative pose from place recognition + geometric verification). When you revisit a place (detected by DBoW2 vocabulary tree or NetVLAD global descriptor), verify geometrically (estimate relative pose from feature matches, check consistency), add the loop closure edge, and optimize the entire pose graph. The optimization distributes the accumulated drift correction across all poses in the loop. This kills drift.
How loop closure detection works in ORB-SLAM3: Each keyframe is converted to a bag-of-words vector using DBoW2 (a vocabulary tree trained offline on a large feature dataset). To check for loop closure: query the database for the keyframes most similar to the current one, excluding recent keyframes (which are similar by construction, not by revisiting). If a candidate has similarity score > threshold AND passes geometric verification (at least 20 inlier matches with consistent relative pose), declare a loop closure. The temporal consistency check requires 3 consecutive keyframes to all match the same candidate region, preventing false positives from repetitive textures.
Visual-Inertial fusion: IMU provides high-frequency rotation + acceleration (pre-integrated between keyframes). Camera provides low-frequency but drift-free bearing measurements. Complementary: IMU handles fast motion and gives scale; camera prevents IMU drift.
IMU pre-integration (Lupton & Susskind, 2012; Forster et al., TRO 2017): Between two keyframes (e.g., 0.1s apart at 30fps), the IMU produces 100+ measurements (at 1000Hz). Naively, you'd integrate all 100 accelerometer+gyroscope readings into a single relative pose constraint. But if BA changes a keyframe's pose, you'd have to re-integrate all 100 measurements. Pre-integration avoids this by computing the relative motion in the body frame (independent of absolute pose), then rotating into the world frame at optimization time. The pre-integrated measurement ΔR, Δv, Δp becomes a factor in the pose graph, with a covariance propagated from IMU noise specs.
IMU initialization. The IMU has 15 unknowns: 3 gravity direction, 3 velocity, 3 accelerometer bias, 3 gyroscope bias, and scale (for monocular). You need ~10s of diverse motion (not just straight-line, need rotation) to observe all modes. ORB-SLAM3 runs a parallel IMU initialization thread that collects keyframe poses from visual-only SLAM, then jointly solves for IMU biases + gravity + scale. If initialization fails (robot is stationary or moving in a straight line), VIO falls back to visual-only until sufficient motion is observed.
SLAM system architecture: Three threads running concurrently. (1) Tracking thread (real-time, 30Hz): estimate current pose by matching features against the local map. Must complete in <33ms. Use motion model prediction to narrow search. (2) Local mapping thread (~5Hz): triangulate new points from recent keyframes, run local BA on a sliding window of 10–20 keyframes. Can take 200ms. (3) Loop closure thread (background, ~1Hz): query place recognition database, verify candidates geometrically, optimize pose graph.
Keyframe selection is critical. Too many: wasted compute, tiny baselines (noisy triangulation). Too few: large appearance changes (tracking fails). Heuristic: new keyframe when parallax > threshold OR tracked features < minimum. ORB-SLAM3's criteria: (1) at least 20 frames since last keyframe, (2) local mapping is idle (not still processing previous keyframe), (3) current frame tracks fewer than 90% of the points in the reference keyframe (significant viewpoint change), and (4) current frame tracks more than 15 points (not lost). These heuristics prevent keyframe spam in static scenes while ensuring coverage during motion.
Covisibility graph. Two keyframes are "covisible" if they share at least 15 observed 3D points. This graph encodes which keyframes see similar parts of the scene. Local BA operates on the covisibility neighborhood of the current keyframe (typically 10-20 keyframes). Loop closure candidates are found among keyframes that are NOT covisible with the current one (you've been here before, but not recently). The covisibility graph is the backbone of ORB-SLAM's architecture.
Map management: The map grows over time. For a robot operating 8 hours/day, you might accumulate 50,000 keyframes. Strategies: cull redundant keyframes (too similar to neighbors), prune low-quality points (high reprojection error), merge duplicate landmarks (same 3D point observed from different keyframes). ORB-SLAM3's "atlas" allows multiple disconnected maps that can be merged later.
| System | Type | Input | Key Feature |
|---|---|---|---|
| ORB-SLAM3 | Feature-based | Mono/Stereo/VI | Multi-map, IMU init, atlas |
| DROID-SLAM | Learned dense | Mono/Stereo | Differentiable BA, dense flow |
| Kimera | VI + semantic | Stereo + IMU | 3D mesh + semantic labels |
| DSO | Direct sparse | Mono | Photometric BA, no features |
Simplified pose graph optimization: 4 poses in a loop with noisy odometry and one loop closure constraint. The key insight: odometry edges encode relative poses between consecutive frames (noisy). A loop closure edge says "pose 0 and pose 3 are actually the same place." By minimizing the sum of squared edge errors, the optimizer distributes the accumulated drift evenly across all poses.
Worked example: Robot moves in a square: right 10m, up 10m, left 10m, down 10m (back to start). But odometry has 2% drift, so after 4 legs it thinks it's at (0.8, -0.6) instead of (0,0). Loop closure detects "I've been here before" and creates a constraint: pose_4 - pose_0 = (0,0). The optimizer spreads the (0.8, -0.6) error across all 4 edges: each pose shifts by ~(0.2, -0.15).
python import numpy as np def pose_graph_optimize(poses, edges, n_iter=20, lr=0.3): """poses: (N,2) initial xy positions. edges: list of (i, j, dx, dy) constraints. Simple gradient descent on sum of squared edge errors.""" poses = poses.copy().astype('float64') for _ in range(n_iter): grad = np.zeros_like(poses) for i, j, dx, dy in edges: err_x = (poses[j,0] - poses[i,0]) - dx err_y = (poses[j,1] - poses[i,1]) - dy grad[i,0] -= err_x; grad[i,1] -= err_y grad[j,0] += err_x; grad[j,1] += err_y poses[1:] -= lr * grad[1:] # fix first pose return poses
| Symptom | Diagnosis | Fix |
|---|---|---|
| Trajectory drifts over time | No loop closures or VO has bias | Add loop closure detection; fuse IMU for scale; check feature quality |
| Sudden jumps in trajectory | False loop closure | Add geometric verification; increase place recognition threshold |
| Tracking lost frequently | Fast motion or motion blur | Higher framerate; global shutter; use IMU for prediction; reduce exposure time |
| Map inconsistent after loop closure | Pose graph has conflicting constraints | Use robust kernel in graph optimization; check edge covariances |
DROID-SLAM (Teed & Deng, NeurIPS 2021): replaces hand-crafted feature matching with a learned dense correlation layer, and runs bundle adjustment as a differentiable layer. Key innovation: all-pairs correlation between features in the current frame and keyframe features, followed by a GRU that predicts dense correspondence updates. The BA layer backpropagates through the optimization, so the feature extractor learns to produce features that are easy to optimize. Result: state-of-the-art accuracy on TartanAir, EuRoC, and TUM, with significantly fewer failure cases than ORB-SLAM3.
Gaussian Splatting SLAM (MonoGS, SplaTAM, Photo-SLAM, 2024): represent the map as 3D Gaussians, run SLAM within the Gaussian framework. The idea: instead of maintaining a sparse point cloud map, maintain a dense Gaussian cloud that can be rendered from any viewpoint. Tracking: render the expected view from the current pose estimate, compare to the actual image, optimize pose to minimize photometric + geometric error. Mapping: add new Gaussians for under-represented regions, prune bad ones. The map IS a renderable 3D model. This unifies SLAM and scene reconstruction into one system.
Foundation-model relocalization: NetVLAD → AnyLoc → CricaVPR (2024) — learn universal place descriptors across domains (indoor, outdoor, underwater, aerial). AnyLoc uses DINOv2 features aggregated with VLAD, achieving cross-domain place recognition without fine-tuning. CricaVPR adds cross-image correlation for better distinctiveness. These models enable SLAM loop closure in environments where BoW-based methods (DBoW2) fail due to domain shift.
Robot traces a loop. Trajectory (orange) drifts from ground truth (teal). After completing the loop, "Detect Loop" triggers pose graph optimization.
Intrinsic calibration (Zhang's method, TPAMI 2000): show a checkerboard from 10–20 angles. Detect corners. Solve for K and distortion coefficients that minimize reprojection error across all views. Key constraint: different viewing angles break the focal-length / principal-point ambiguity.
Zhang's method step by step: (1) For each checkerboard image, detect all inner corners (e.g., 9×6 = 54 corners). Their 3D positions are known (grid pattern, Z=0 plane). (2) Compute homography Hi from the 3D plane to image i using DLT. (3) From ≥3 homographies, solve for K using the constraint h1TK-TK-1h2 = 0 (orthogonality of column vectors). This gives fx, fy, cx, cy. (4) Extract Ri, ti from each Hi using K. (5) Nonlinear refinement: minimize reprojection error over all corners in all views, jointly optimizing K, distortion coefficients, and all Ri, ti.
Practical calibration tips: (1) Use at least 15 images. (2) Tilt the board at diverse angles (not just fronto-parallel). (3) Cover the entire image, especially corners and edges (that's where distortion is strongest). (4) Keep the board stationary during capture (motion blur ruins corner detection). (5) Check reprojection error: <0.3px = excellent, 0.3-0.8px = acceptable, >1px = recapture. (6) Use a large board (A2 or bigger) for better accuracy.
Extrinsic calibration finds the rigid transform between sensors. Methods: (1) Shared checkerboard — both cameras see the same board simultaneously. Both cameras detect the board's corners, giving two sets of 3D-2D correspondences. Solve for each camera's pose relative to the board, then compute the relative transform between cameras: T12 = Tboard←cam2 · Tcam1←board. (2) Hand-eye calibration (AX = XB): camera on robot arm, A = arm motion (from FK), B = observed camera motion (from visual pose estimation), X = unknown camera-to-hand transform. At least 2 non-parallel motions needed; 10+ give a robust estimate. (3) Target-less from natural scene correspondences — match features across cameras viewing the same scene, estimate the relative pose from correspondences. Less accurate than target-based but doesn't require physical targets.
Stereo rectification is a special case of extrinsic calibration for stereo pairs. After calibrating the relative pose between two cameras, compute rectification homographies that make epipolar lines horizontal and aligned between images. This transforms stereo matching from a 2D search to a 1D search along each row, enabling efficient block matching. OpenCV's `stereoRectify` computes these homographies from K1, K2, R, t. The quality check: rectified images should have perfectly horizontal epipolar lines. Verify by drawing horizontal lines across the rectified pair — corresponding features should align.
Temporal synchronization is critical and often overlooked. Cameras must fire simultaneously or motion creates alignment errors. Hardware trigger: sub-microsecond sync. Software timestamp matching: 1–5ms typical. At 1 m/s robot speed, 5ms desync = 5mm error.
Synchronization methods ranked by accuracy: (1) Hardware trigger (GPIO pulse from microcontroller): <1μs sync. Gold standard. (2) PTP (Precision Time Protocol): <100μs over Ethernet. Good for multi-camera rigs without a dedicated trigger line. (3) NTP (Network Time Protocol): ~1ms. Acceptable for coarse alignment only. (4) Software timestamps: 1-10ms depending on OS scheduling jitter. Never good enough for fast-moving robots.
Time offset estimation: If you can't get hardware sync, estimate the time offset between cameras from data. Kalibr does this by fitting continuous-time B-spline trajectories to IMU data and finding the temporal offset that minimizes reprojection error. Typical result: offset = 3.2ms ± 0.1ms between USB camera and IMU. You then shift timestamps by this constant offset before fusion.
A typical robot sensor rig: 2–6 cameras + IMU + optional LiDAR. Coordinate frame conventions: define one "body frame" (usually IMU), express all sensors in body frame coordinates.
python def hand_eye_AX_XB(A_list, B_list): """Solve AX = XB for hand-eye calibration. A_list: list of (4,4) transforms (robot motions). B_list: list of (4,4) transforms (camera motions). Returns X: (4,4) camera-to-hand transform.""" # Tsai-Lenz method (rotation-first) # 1. Solve for rotation: log(R_A) x = log(R_B) # Collect rotation axes M = np.zeros((3, 3)) for A, B in zip(A_list, B_list): alpha = rotation_to_axis_angle(A[:3,:3]) # (3,) beta = rotation_to_axis_angle(B[:3,:3]) M += np.outer(beta, alpha) # SVD to find best rotation U, _, Vt = np.linalg.svd(M) R_X = U @ Vt if np.linalg.det(R_X) < 0: R_X = -R_X # 2. Solve for translation A_stack, b_stack = [], [] for A, B in zip(A_list, B_list): A_stack.append(A[:3,:3] - np.eye(3)) b_stack.append(R_X @ B[:3,3] - A[:3,3]) t_X = np.linalg.lstsq(np.vstack(A_stack), np.hstack(b_stack), rcond=None)[0] X = np.eye(4) X[:3,:3] = R_X; X[:3,3] = t_X return X
| Symptom | Diagnosis | Fix |
|---|---|---|
| Stereo doesn't match after calibration | Rectification error; extrinsics drifted | Recalibrate extrinsics; check for mount vibration |
| IMU and camera disagree | Time sync offset | Check trigger latency; use Kalibr for time offset estimation |
| Calibration works in lab, fails on robot | Vibration/thermal drift/cable stress changing camera pose | Use rigid mounts; run online self-calibration; monitor reprojection error |
| Rolling shutter wobble | CMOS row-sequential exposure + fast motion | Use global shutter cameras; or model rolling shutter in BA |
Kalibr (ETH, Furgale et al.): the gold standard for camera-IMU calibration. Uses continuous-time B-spline trajectories to represent the motion, which allows: (1) estimating the time offset between camera and IMU (typically 1-10ms), (2) handling rolling shutter by modeling per-row timestamps, (3) jointly optimizing camera intrinsics, extrinsics, and IMU biases. Requires an aprilgrid target and ~60 seconds of diverse hand-held motion. Output: camera-IMU transform (SE(3)), time offset (scalar), IMU noise parameters (accelerometer + gyro random walk and white noise).
Online self-calibration: recent work on continuously refining intrinsics/extrinsics during SLAM operation. The idea: monitor reprojection residuals in a sliding window. If the median residual exceeds a baseline (e.g., 2x the factory calibration residual), trigger an automatic recalibration using the current keyframe set. This catches thermal drift, mount shifts, and gradual lens degradation without human intervention.
Neural extrinsic calibration: learning camera-to-camera transforms from visual data without checkerboards — train a network on paired images to predict the relative SE(3) transform. Infrastructure-free calibration from natural scene correspondences (no targets needed, just drive around). The trend is toward fully automatic, continuous calibration that never requires human intervention. The challenge: neural methods give ~1-2px accuracy, but production systems need <0.5px. Hybrid approaches (neural initialization + geometric refinement) bridge this gap.
A note on fisheye and omnidirectional cameras: The pinhole + Brown-Conrady model fails for FOV > 120°. Fisheye models (equidistant, stereographic, Kannala-Brandt) handle 180°+ FOV. Modern robots increasingly use fisheye cameras for maximum coverage — a single 190° fisheye replaces three narrow-FOV cameras. The calibration tools (Kalibr, OpenCV fisheye module) support these models.
Adjust radial distortion k1. Positive = barrel, negative = pincushion. Gray grid = undistorted, teal = distorted.
SfM gives sparse point clouds. For robotics, we need dense 3D — every surface represented. Three paradigms:
Multi-View Stereo (MVS): For each pixel in a reference image, search along epipolar lines in neighbors for the photometric match. Output: dense depth map per view. Fuse into a point cloud, run Poisson reconstruction for a mesh. No training needed. Struggles with textureless surfaces.
NeRF (Mildenhall et al., ECCV 2020): represent the scene as a continuous function F(x,y,z,θ,φ) → (color, density). Train an MLP per scene on photometric loss via volume rendering:
Ti = transmittance (how much light hasn't been absorbed). αi = opacity from density σi and step size Δi. ci = color at sample i. Beautiful results, but slow (hours to train, seconds to render).
3D Gaussian Splatting (3DGS) (Kerbl et al., SIGGRAPH 2023): scene as millions of 3D Gaussians, each with position, covariance (shape), color (spherical harmonics), opacity. Rendering: sort by depth, project each Gaussian as 2D ellipse, alpha-blend front-to-back. Differentiable rasterization — no ray marching. 100× faster to render than NeRF. Real-time at 1080p.
Per-Gaussian parameters: position μ (3), covariance Σ (6, stored as rotation quaternion q (4) + scale s (3) = 7 params, Σ = RSSTRT), color via spherical harmonics (16 coefficients per channel × 3 = 48 for order 3, or 3 for order 0), opacity α (1). Total: ~60 params per Gaussian. A typical scene has 500K–5M Gaussians. At 60 × 4 bytes = 240 bytes/Gaussian, a 1M-Gaussian scene is ~240 MB.
The rendering pipeline in detail: (1) Frustum culling: discard Gaussians outside the camera view. (2) Project each 3D Gaussian to a 2D Gaussian on the image plane via the local affine approximation (Jacobian of the projection). (3) Tile-based sorting: divide the image into 16×16 tiles, assign each Gaussian to tiles it overlaps, sort per-tile by depth. (4) Alpha compositing: for each pixel, blend Gaussians front-to-back: C = ∑ ci αi Ti where Ti = ∏j<i(1-αj). Stop when transmittance drops below 1/255. All steps are differentiable.
Input requirements compared:
| Method | Input | Preprocessing | Min images | GPU required? |
|---|---|---|---|---|
| MVS | Images + poses + K | COLMAP SfM | 3+ | Optional (CPU ok) |
| NeRF | Images + poses + K | COLMAP SfM | 20+ | Yes (training) |
| 3DGS | Images + poses + K + sparse points | COLMAP SfM | 20+ | Yes (training + rendering) |
| DUSt3R → 3DGS | Images only | None | 2+ | Yes |
Note: DUSt3R + 3DGS skips COLMAP entirely — DUSt3R provides poses and initial points, 3DGS trains on them. This is the fastest path from images to renderable 3D, but accuracy is lower than COLMAP-initialized 3DGS.
| Method | Representation | Train time | Render | Quality | Edit? |
|---|---|---|---|---|---|
| MVS | Explicit mesh | None | Real-time | Good | Easy |
| NeRF | Implicit MLP | Hours | ~5s/frame | Excellent | Hard |
| 3DGS | Explicit Gaussians | ~10min | Real-time | Excellent | Easy |
3DGS training procedure: (1) Initialize Gaussians at COLMAP sparse points (typically 10K-100K). Each gets position from 3D point, color from the point's observed color, scale = mean distance to 3 nearest neighbors, rotation = identity, opacity = inverse sigmoid(0.1). (2) Optimization loop (30K iterations, ~10min on RTX 4090): sample a random training view, rasterize all visible Gaussians, compute L1 + D-SSIM loss against ground truth image, backpropagate through the differentiable rasterizer. (3) Densification (every 100 iterations for first 15K): if a Gaussian has large view-space gradient (>0.0002), clone it (if small) or split it (if large). This adaptively adds detail. (4) Pruning (every 100 iterations): remove Gaussians with opacity <0.005 or that are too large (world-space size >scene extent/20). (5) Opacity reset (every 3000 iterations): set all opacities to 0.01, forcing re-learning — this kills floaters.
3DGS data flow:
python def volume_render(densities, colors, deltas): """NeRF volume rendering equation. densities: (N,) sigma values along ray. colors: (N,3) RGB at each sample. deltas: (N,) step sizes. Returns: (3,) final pixel color.""" alphas = 1.0 - np.exp(-densities * deltas) # (N,) transmittance = np.cumprod( np.concatenate([[1.0], 1.0 - alphas[:-1]]) # T_i = prod(1-alpha_j) for j < i ) weights = transmittance * alphas # (N,) color = (weights[:, None] * colors).sum(axis=0) # (3,) return color # 3DGS: project a single Gaussian to 2D def splat_gaussian(mean_3d, cov_3d, R_cam, t_cam, K): """Project 3D Gaussian to 2D image-space Gaussian.""" mean_cam = R_cam @ mean_3d + t_cam # (3,) in camera frame z = mean_cam[2] J = np.array([[K[0,0]/z, 0, -K[0,0]*mean_cam[0]/z**2], [0, K[1,1]/z, -K[1,1]*mean_cam[1]/z**2]]) # (2,3) Jacobian cov_cam = R_cam @ cov_3d @ R_cam.T cov_2d = J @ cov_cam @ J.T # (2,2) projected covariance mean_2d = K[:2,:2] @ (mean_cam[:2]/z) + K[:2,2] return mean_2d, cov_2d
| Symptom | Diagnosis | Fix |
|---|---|---|
| NeRF has artifacts on reflections | Specular surfaces violate Lambertian assumption | Use Ref-NeRF (separate diffuse/specular) or mask reflective regions |
| 3DGS has "floaters" in air | Insufficient pruning; noisy initialization | Enable opacity reset every N iterations; increase pruning threshold |
| MVS has holes in textureless areas | Photometric matching fails without texture | Use PatchMatch-based MVS; add regularization; fill with depth completion network |
| 3DGS training diverges | Learning rate too high; bad SfM initialization | Reduce position LR; check COLMAP output for outlier points |
| 3DGS renders are blurry | Under-densification; not enough Gaussians | Lower split/clone gradient threshold; increase densification iterations |
4DGS (dynamic scenes with deformation fields — model time-varying Gaussians for moving objects). 2DGS (2D surfel Gaussians for better surface quality — flat discs instead of 3D ellipsoids give cleaner geometry). SplaTAM / MonoGS (SLAM with 3DGS as the map representation — track camera AND update Gaussians simultaneously). Zip-NeRF (anti-aliased hash-grid NeRF, training in minutes). LRM / Zero123++ (single-image 3D — feed one photo, get a 3D model via large reconstruction model).
The integration with robotics is the frontier. Gaussian Splatting SLAM means your robot's map IS a Gaussian cloud — you can render novel views for planning, check collisions against Gaussians, segment objects by clustering Gaussians, and track deformable objects as Gaussian motion fields. This unifies mapping, rendering, and scene understanding into one representation.
Three methods reconstruct the same scene. MVS scans depth columns. NeRF accumulates density along rays. 3DGS splats Gaussians. Slide iterations to watch quality improve. "Add Noise" to compare robustness.
Optical flow estimates dense 2D motion: for each pixel in frame t, where did it move in frame t+1? Classical: Lucas-Kanade (sparse, 5×5 window, least squares on brightness constancy I(x,y,t) = I(x+dx,y+dy,t+dt)). Horn-Schunck (dense, adds smoothness). Both fail with occlusion, large displacements, illumination changes.
RAFT (Teed & Deng, ECCV 2020): builds a 4D correlation volume between all feature pairs in two frames, then iteratively refines flow via a GRU that indexes into this volume. 12–32 refinement iterations — the network learns "gradient descent" through correlation space. The architecture: (1) shared feature encoder (CNN, 1/8 resolution) extracts features from both frames. (2) All-pairs correlation: compute dot product between every pair of feature vectors → 4D tensor (H/8 × W/8 × H/8 × W/8). (3) Correlation pyramid: pool at multiple scales for large displacements. (4) Update GRU: at each iteration, look up correlations around the current flow estimate, combine with context features, output flow delta. The iterative refinement is key: more iterations = better flow, and you can trade compute for accuracy at test time.
Long-range tracking extends beyond consecutive frames. CoTracker (Karaev et al., NeurIPS 2023) tracks multiple points jointly with a transformer, exploiting correlations between nearby tracks. Key insight: if point A moves right, nearby point B probably does too.
TAPIR (Doersch et al., ICCV 2023): Tracking Any Point with per-frame Initialization and temporal Refinement. Two stages: (1) match each query point independently to every frame (coarse, handles occlusion by checking all frames). (2) Temporal refinement: smooth tracks over time, predict visibility score. The key advantage over CoTracker: TAPIR handles very long videos (1000+ frames) because the per-frame matching is independent. CoTracker's joint attention has quadratic cost in sequence length. For short manipulation videos (50-200 frames), CoTracker is better. For long autonomous driving sequences, TAPIR wins.
Scene flow = 3D optical flow. For each point, estimate its 3D motion vector. Requires depth + 2D flow. Critical for dynamic scene understanding. Given 2D flow (dx, dy) and depth maps D1, D2, the 3D motion is: unproject pixel (x,y) at depth D1(x,y) to get point P1. Unproject (x+dx, y+dy) at depth D2(x+dx, y+dy) to get P2. Scene flow = P2 - P1. Noise in depth estimation dominates scene flow quality. Use stereo depth (<1% error) over monocular (>5% error) when possible.
Where tracking fits in robotics:
| Application | What to track | Tracker choice | Accuracy need |
|---|---|---|---|
| Contact point tracking | Gripper-object contact | CoTracker (joint, 5-10 pts) | <2px at 30fps |
| Object state estimation | Object keypoints | CoTracker or TAPIR | <5px, through occlusion |
| Data annotation | Labeled keypoints | TAPIR (long-range) | <3px, 100+ frames |
| World model training | Dense motion vectors | RAFT (dense flow) | Subpixel, every pixel |
| Sim-to-real alignment | ArUco markers / features | LK flow or TAPIR | <1px, metric |
| Deformable object tracking | Surface points on cloth/rope | CoTracker (50+ pts) | <5px, handle topology change |
Tracking through occlusion is the hard problem. Independent trackers lose the point when it goes behind another object. They have no way to predict where it will reappear. CoTracker uses neighboring tracks to interpolate: if 3 of 5 gripper points are visible, infer the other 2 from the group's motion pattern. TAPIR adds explicit occlusion prediction: the network outputs a visibility score alongside the position. SpatialTracker (2024) reasons in 3D to handle depth ordering — it lifts 2D tracks to 3D using monocular depth, then re-projects.
Tracking through occlusion is the hard problem. Independent trackers lose the point when it goes behind another object. They have no way to predict where it will reappear. CoTracker uses neighboring tracks to interpolate: if 3 of 5 gripper points are visible, infer the other 2 from the group's motion pattern. TAPIR adds explicit occlusion prediction: the network outputs a visibility score alongside the position. SpatialTracker (2024) reasons in 3D to handle depth ordering — it lifts 2D tracks to 3D using monocular depth, then re-projects.
Performance budget: RAFT runs at ~10Hz on a mid-range GPU for 1080p flow. CoTracker tracks 50 points at ~15Hz. For real-time robotics (>30Hz), you may need lighter alternatives: sparse LK flow (~1ms), or run the neural tracker at lower frequency and interpolate between keyframes.
python def lucas_kanade(I1, I2, pts, win=5): """Sparse optical flow via Lucas-Kanade. I1, I2: (H,W) grayscale. pts: (N,2) points to track. Returns: (N,2) new positions.""" Ix = np.gradient(I1, axis=1) Iy = np.gradient(I1, axis=0) It = I2.astype('float64') - I1.astype('float64') hw = win // 2 new_pts = [] for px, py in pts: x, y = int(round(px)), int(round(py)) if y-hw < 0 or y+hw+1 > I1.shape[0] or x-hw < 0 or x+hw+1 > I1.shape[1]: new_pts.append([px, py]); continue ix = Ix[y-hw:y+hw+1, x-hw:x+hw+1].flatten() iy = Iy[y-hw:y+hw+1, x-hw:x+hw+1].flatten() it = It[y-hw:y+hw+1, x-hw:x+hw+1].flatten() A = np.stack([ix, iy], axis=1) # (win^2, 2) b = -it # (win^2,) # Least squares: (A^T A) v = A^T b ATA = A.T @ A if np.linalg.det(ATA) < 1e-6: new_pts.append([px, py]); continue v = np.linalg.solve(ATA, A.T @ b) # (2,) flow new_pts.append([px + v[0], py + v[1]]) return np.array(new_pts)
| Symptom | Diagnosis | Fix |
|---|---|---|
| Tracking drifts over long sequences | Accumulated small errors | Use long-range tracker (CoTracker, TAPIR); add periodic re-initialization |
| Lost object during occlusion | Independent tracking can't hallucinate | Use CoTracker (joint tracking) or TAPIR (occlusion-aware) |
| Scene flow is very noisy | Depth estimation noise | Smooth depth first; use stereo instead of monocular depth; temporal filtering |
CoTracker v2 (Karaev et al., 2024): faster architecture (causal sliding window instead of full attention), supports online tracking (process frames as they arrive), and better occlusion handling. SpatialTracker (Xiao et al., 2024): lifts 2D tracks to 3D using monocular depth estimation, reasons about depth ordering to handle occlusion geometrically rather than learning it. TAPIR (Doersch et al., ICCV 2023): per-frame independent matching + temporal refinement. Best for very long sequences (1000+ frames) where CoTracker's attention is too expensive. BootsTAP (2024): bootstrapped training — use easy tracks to generate pseudo-labels for hard tracks, improving robustness without manual annotation. DOT (Lemoing et al., 2024): dense optical tracking, tracks every pixel, bridging optical flow and point tracking.
The tracking-for-policy frontier. Robot learning increasingly depends on point tracking for: (1) Imitation learning data: track expert's hand and object contact points to extract manipulation primitives. (2) Reward functions: track goal regions, compute distance-to-goal as reward signal. (3) State estimation: track object keypoints as a compact object state representation. (4) Sim-to-real transfer: track markers in real video, align with simulated trajectories, measure transfer gap. The quality of tracking directly limits the quality of learned policies.
Click to place tracking points. Objects move. Watch points follow. Occluded points shown as dashed circles. Toggle "Joint" to see CoTracker-style joint reasoning.
Vocabulary trees solve matching scalability. Hierarchical k-means on descriptors builds a tree. Each image = bag-of-words histogram. Comparing histograms is O(1), so finding candidate pairs goes from O(N²) to O(N log N).
Hierarchical SfM: partition images into clusters (by GPS/time/visual similarity) → SfM per cluster → merge at boundaries through shared landmarks. The merge step is non-trivial: find 3D-3D correspondences between clusters (points seen in both), compute similarity transform (7 DOF: rotation + translation + scale), apply, then run global BA to refine. If the overlap between clusters is thin (<10 shared 3D points), the merge is fragile.
Scaling numbers that interviewers expect you to know:
| Scale | Matching strategy | SfM approach | Expected time | Output quality |
|---|---|---|---|---|
| 100 images | Exhaustive (5K pairs) | Incremental (COLMAP) | ~10 min | Sub-pixel |
| 1K images | Exhaustive or retrieval (top-50) | Incremental | ~2 hours | Sub-pixel |
| 10K images | Retrieval (top-50) | Hierarchical | ~12 hours | ~1px |
| 100K images | Retrieval (top-30) | Hierarchical + distributed | ~3 days (cluster) | ~1-2px |
| 1M images | Retrieval (top-20) | Hierarchical + distributed | ~1 week (cluster) | ~2-5px |
DUSt3R (Wang et al., CVPR 2024): feed two images into a ViT, get per-pixel 3D point maps in a shared coordinate frame. No features, no F estimation, no triangulation. A single network. Architecture: two ViT-Large encoders (shared weights) process each image, then cross-attention decoders predict per-pixel 3D coordinates and confidence maps. The output is in the coordinate frame of image 1, so point maps from multiple pairs must be aligned via global optimization. Training data: 12M image pairs from ScanNet++, CO3D, and synthetic datasets. Inference: ~200ms per pair on A100.
MASt3R (Leroy et al., ECCV 2024) extends DUSt3R by adding local feature heads: alongside 3D pointmaps, it outputs per-pixel descriptors for matching. This bridges the gap — you get both learned 3D and classical-style correspondences in one forward pass. For N images, MASt3R-SfM runs all N(N-1)/2 pairs, then aligns pointmaps globally.
VGGT (Wang et al., CVPR 2025): the end-game. Feed N images (up to ~30) in a single forward pass. Multi-image cross-attention across all views simultaneously. Outputs: camera intrinsics, extrinsics, per-pixel depth, per-pixel 3D, and camera-to-world transforms. No iteration, no optimization, no BA. The catch: quadratic memory in N (attention), so N>30 requires chunking. And accuracy still doesn't match geometric BA for precision tasks (<0.5px).
Processing 1M images end-to-end: (1) Run feature extraction on all images in parallel (GPU cluster, ~100 images/GPU/sec with SuperPoint). (2) Build global image retrieval index (NetVLAD or CosPlace). (3) For each image, retrieve top-50 candidates. (4) Match each candidate pair (LightGlue, ~30ms/pair). (5) Geometric verification (estimate F, reject pairs with <15 inliers). (6) Cluster images into subproblems by connectivity. (7) Run SfM per cluster. (8) Merge clusters through shared 3D landmarks. Total: ~24h on a cluster for 1M images.
python # Bag-of-words image retrieval sketch def build_vocab_tree(all_descriptors, k=10, levels=6): """Hierarchical k-means on feature descriptors. Returns tree structure for fast retrieval.""" from sklearn.cluster import KMeans tree = {} def build_level(descs, depth, node_id): if depth == levels or len(descs) < k: tree[node_id] = {'leaf': True, 'weight': np.log(N_images / (len(descs)+1))} return km = KMeans(n_clusters=k).fit(descs) tree[node_id] = {'centers': km.cluster_centers_} for i in range(k): mask = km.labels_ == i build_level(descs[mask], depth+1, node_id*k+i) build_level(all_descriptors, 0, 0) return tree # DUSt3R inference pseudocode # model = DUSt3R.from_pretrained('naver/DUSt3R_ViTLarge') # out = model(img1, img2) # returns pointmaps, confidence # pts1 = out['pts3d'][0] # (H, W, 3) per-pixel 3D # pts2 = out['pts3d'][1] # already in same coord frame
| Symptom | Diagnosis | Fix |
|---|---|---|
| Large-scale SfM has disconnected components | Insufficient overlap between regions | Add more retrieval candidates; capture bridging images |
| DUSt3R fails on outdoor scenes | Trained mostly on indoor data; domain gap | Fine-tune on outdoor data; fall back to classical for outdoor |
| VGGT runs out of memory on N>20 images | Quadratic attention on large sets | Process in overlapping subsets; merge with global alignment |
VGGT (CVPR 2025): feed N images, get everything in one pass. Trained on millions of video frames + synthetic data. The model learns camera intrinsics, extrinsics, depth, and 3D points as separate output heads from the same transformer backbone. Achieves competitive results with COLMAP on indoor scenes, but still behind on outdoor scenes with large viewpoint changes.
MASt3R-SfM (Duisterhof et al., 2024): combines MASt3R pairwise predictions with classical global alignment. For each pair, MASt3R predicts pointmaps + features. Then: build a pose graph from pairwise transforms, optimize globally (like classical SfM, but starting from much better initialization). Bridges the speed of learned methods with the precision of geometric optimization.
Metric3Dv2 (Hu et al., 2024): metric depth from a single image. Uses a ViT backbone with a canonical camera model to predict absolute metric depth (not just relative). Accuracy: ~5% relative error on indoor scenes. This enables monocular SLAM with metric scale — no stereo or IMU needed for coarse depth. Limitation: accuracy degrades significantly outdoors at distances >20m.
Depth Pro (Apple, 2024): fast monocular metric depth at 2K resolution. Key innovation: multi-scale ViT with a "focal length estimation" head that predicts intrinsics from the image, enabling metric depth without camera calibration. Runs at ~1fps on M2 Max for 2048×1536 input. Useful for: casual 3D capture (iPhone), approximate 3D from web images, depth supervision for 3DGS training when no LiDAR is available.
Watch classical SfM (multi-step) race against DUSt3R (single forward pass). The learned approach finishes at ~30% of classical time.
This is the capstone. You have a multi-camera robot perception pipeline. Every stage is interconnected. Break one, watch the cascade. Fix it, watch recovery. This is what debugging feels like on a real robot team.
Why cascade effects are multiplicative, not additive. Calibration error of 1px causes features to project to slightly wrong locations. This doesn't just add 1px to matching error — it causes some correct matches to fail the epipolar constraint (they become "outliers"), which reduces inlier ratio, which forces RANSAC to run more iterations, which either slows down the system or causes it to give up. A 1px calibration error can reduce matching quality by 20-30%, which reduces reconstruction completeness by 40%, which creates holes in the map, which causes the planner to take a longer path, which means more opportunities for drift. The staff engineer's insight: always debug upstream first.
Error propagation formula (rule of thumb): If calibration is off by σcalib pixels, triangulation error at depth Z is approximately σ3D ≈ Z² · σcalib / (f · B). At Z=5m, f=800, B=0.2m, σcalib=2px: σ3D = 25 × 2 / 160 = 0.31m. That's 31cm at 5 meters — enough to miss a doorway. This calculation is what you show the team when arguing for better calibration.
Adjust noise at each pipeline stage. Watch metrics propagate. "Auto-run" traces a trajectory and shows real-time quality. Click any stage to see its effect in isolation.
What to notice as you experiment:
The classic production debugging protocol:
Common cascade patterns from production:
| Root cause | First symptom | Cascade effect | How to isolate |
|---|---|---|---|
| Camera shifted 0.5mm | Reproj error 3px | Stereo fails → VO drift 5x → map corrupt | Project checkerboard, compare to stored baseline |
| Time sync cable loose | 5ms desync | Stereo matches are wrong → depth errors → phantom obstacles | Compare timestamps, look for systematic match offset |
| Lens dirty | Feature count drops 60% | Matching fails on dirty regions → incomplete reconstruction | Feature heatmap: are they concentrated in one area? |
| New environment (outdoor) | Sky saturates, features sparse | VO fails in open areas → fall back to IMU only → scale drift | Check feature distribution: top of image blank? |
| Firmware update changed exposure | Images brighter, motion blur | Features less repeatable → matching quality drops → RANSAC needs more iterations | Compare feature repeatability metric before/after |
The "classical vs learned" switch: Your pipeline has a decision point at each stage. When the classical component fails (feature count too low, RANSAC inlier ratio below threshold, VO drift exceeding budget), you can dynamically switch to a learned alternative. This hybrid approach is what production systems actually use. Pure classical is fragile. Pure learned is unpredictable. The combination gives reliability with graceful degradation.
Monitoring and alerting in production: A staff engineer doesn't just build the pipeline — they instrument it. Key metrics to track continuously:
| Metric | Normal range | Alert threshold | Action |
|---|---|---|---|
| Reprojection error (px) | < 0.5 | > 1.5 | Trigger recalibration |
| Feature count per frame | > 200 | < 80 | Check lens, exposure, scene |
| Match inlier ratio | > 60% | < 30% | Tighten ratio test, switch matcher |
| VO relative pose error | < 0.5°/frame | > 2°/frame | Fall back to IMU-only, flag for review |
| Loop closure frequency | 1 per loop | 0 after full loop | Check place recognition, verify map |
| Gaussian count (3DGS) | 100K-2M | > 5M or < 10K | Check densification params, input quality |
Graceful degradation hierarchy: When a component fails, the system doesn't crash — it falls back. Full pipeline working: stereo VIO + 3DGS mapping + CoTracker. Feature extraction fails: switch to direct VO (photometric). Stereo matching fails: fall back to mono + IMU (lose metric reconstruction). IMU fails: mono VO only (lose scale, accept drift). Everything fails: dead reckoning from last known pose + wheel odometry. A staff engineer designs all these fallback paths before the first failure happens.
| Concept | 30-sec explanation | Key equation | Tool | Classic paper | Modern paper |
|---|---|---|---|---|---|
| Pinhole camera | 3D → 2D via perspective divide + intrinsics K | p = K[R|t]X | OpenCV | Hartley&Zisserman 2003 | Self-calibration via DUSt3R |
| Epipolar constraint | Matching point lies on a line in other image | x2TEx1=0 | OpenCV | Longuet-Higgins 1981 | OANet (CVPR 2019) |
| Feature matching | Detect keypoints, compute descriptors, match | ratio test < 0.75 | hloc | SIFT (Lowe 2004) | LightGlue (ICCV 2023) |
| SfM | Images → camera poses + sparse 3D points | PnP + BA | COLMAP | COLMAP (CVPR 2016) | DUSt3R (CVPR 2024) |
| Bundle adjustment | Minimize reprojection error over all views | min ∑ ‖x-π(C,X)‖² | Ceres | Triggs et al. 2000 | DROID-SLAM (2021) |
| SLAM | VO + map + loop closure | pose graph optimization | ORB-SLAM3 | ORB-SLAM (TRO 2015) | Gaussian SLAM (2024) |
| NeRF | Scene as implicit MLP, train via volume rendering | C=∑Tiαici | Nerfstudio | NeRF (ECCV 2020) | Zip-NeRF (2023) |
| 3DGS | Scene as explicit Gaussians, differentiable rasterization | splat + alpha blend | gsplat | 3DGS (SIGGRAPH 2023) | 2DGS / 4DGS (2024) |
| Optical flow | Per-pixel 2D motion between consecutive frames | I(x,t) = I(x+dx,t+dt) | RAFT | Horn-Schunck 1981 | RAFT (ECCV 2020) |
| Point tracking | Track any point across many frames | correlation + GRU | CoTracker | KLT (1981) | CoTracker v2 (2024) |
X_cam = R @ X + t; x_n = X_cam[:2] / X_cam[2]; px = K[:2,:2] @ x_n + K[:2,2]
| Task | Classical | Modern (Learned) | When to use which |
|---|---|---|---|
| Features + matching | SIFT + brute-force | SuperPoint + LightGlue | Learned for accuracy; classical for no-GPU or real-time on CPU |
| Relative pose | 5-pt + RANSAC | DUSt3R pointmaps | Classical for sub-pixel precision; learned for speed and robustness |
| SfM | COLMAP | DUSt3R / VGGT | Classical for metric accuracy; learned for rapid prototyping |
| Dense reconstruction | MVS + Poisson mesh | 3DGS / NeRF | MVS for offline mapping; 3DGS for real-time rendering + simulation |
| SLAM | ORB-SLAM3 | DROID-SLAM | ORB-SLAM3 for production (stable); DROID for research (accuracy) |
| Optical flow | Lucas-Kanade | RAFT | LK for sparse real-time; RAFT for dense accuracy |
| Point tracking | KLT tracker | CoTracker | KLT for speed; CoTracker for occlusion robustness |
| Place recognition | DBoW2 (vocab tree) | AnyLoc / CricaVPR | DBoW2 for low compute; learned for cross-domain / condition change |
THE book: Hartley & Zisserman, Multiple View Geometry in Computer Vision (2003). Read chapters 9-12 (fundamental matrix, cameras, 3D reconstruction, structure computation) cover-to-cover. The rest is reference material. Also recommended: Szeliski, Computer Vision: Algorithms and Applications (2nd ed., 2022) — broader scope, more accessible, free online.
5 must-read papers:
5 must-study repos:
Deep-dive into specific topics:
Staff-level interviewers love timing questions. Know these numbers:
| Operation | CPU time | GPU time | Notes |
|---|---|---|---|
| FAST corner detection (2K pts) | ~2ms | — | Per 1280×960 image |
| ORB descriptor (2K pts) | ~5ms | — | 256-bit binary |
| SuperPoint (2K pts) | ~50ms | ~5ms | 256-dim float |
| Brute-force matching (2K×2K) | ~15ms | ~1ms | L2 distance |
| LightGlue matching | — | ~30ms | With early exit |
| RANSAC + 5-pt (500 matches) | ~5ms | — | 100 iterations |
| PnP + RANSAC | ~2ms | — | P3P, 50 iterations |
| Local BA (10 keyframes) | ~50ms | — | Ceres, sparse Schur |
| 3DGS rendering (1080p) | — | ~5ms | Real-time |
| DUSt3R inference (1 pair) | — | ~200ms | ViT-Large, A100 |
| RAFT optical flow (1080p) | — | ~100ms | 12 iterations |
| CoTracker (50 pts, 50 frames) | — | ~300ms | Batch processing |
| Misconception | Reality | Why it matters |
|---|---|---|
| "More features = better reconstruction" | Quality matters more than quantity. 500 well-distributed features beat 5000 clustered on one texture. | Guides feature extraction tuning |
| "NeRF understands 3D geometry" | NeRF optimizes for appearance (photometric loss). Geometry is a byproduct, often inaccurate. | Don't trust NeRF depth for collision checking |
| "SLAM and SfM are the same thing" | SfM is offline (all images available). SLAM is online (causal, can't look ahead). Different optimization strategies. | System design implications |
| "Learned models will replace classical geometry" | Classical gives guarantees (metric accuracy, consistency). Learned gives robustness. Best systems are hybrid. | Staff-level perspective |
| "Calibration is a solved problem" | Static calibration is solved. Online calibration under thermal drift, vibration, and wear is an active research area. | Production vs research gap |
| "BA always converges" | BA is non-convex. Bad initialization, insufficient observations, or wrong distortion model can cause divergence. | Debug skill |
| "3DGS is always better than NeRF" | 3DGS struggles with thin structures, transparency, and view-dependent effects. NeRF handles these better with continuous volume sampling. | Method selection |
The difference between a senior and a staff engineer in 3D geometry is not algorithmic knowledge — it's systems thinking. A senior engineer implements the 8-point algorithm correctly. A staff engineer asks: "Should we even be estimating F, or should we use a learned matcher that gives us correspondences good enough to skip geometric verification entirely? What's the failure rate? What's the latency budget? What happens when the learned model encounters an out-of-distribution scene? What's the fallback?"
The five habits of a staff geometry engineer: