Staff-Engineer Interview Prep

Master Multi-View
Geometry

Every concept, system design, algorithm, failure mode, and frontier paper a staff 3D vision engineer needs — from camera models to neural reconstruction.

Prerequisites: Linear algebra + Basic Python. That's it.
13
Chapters
13+
Simulations
5
Dimensions per Ch

Chapter 0: The Role — What You'd Actually Build

You are a 3D geometry engineer at a robotics company. Your morning starts with a Slack message: "Stereo calibration drifted overnight, VO is diverging on the warehouse floor." By afternoon you are debugging why the offline reconstruction pipeline dropped 400 images from a 10,000-image drone capture. After dinner you review a PR that replaces your COLMAP-based pipeline with DUSt3R and need to decide: ship it or push back?

This is the job. You own the full pipeline from photons hitting the sensor to 3D point clouds consumed by downstream planners, world models, and manipulation policies. Every component is load-bearing. A 2-pixel calibration error propagates through triangulation, corrupts the map, and sends the robot into a shelf.

A day in the life, concretely:

The system you own: Multi-camera rig → intrinsic + extrinsic calibration → real-time visual odometry / SLAM during operation → offline dense reconstruction for scene understanding → point tracking for object state estimation → training data pipeline for learned policies. Each box in this diagram is an interview topic.

The interview dimensions for every chapter in this lesson:

DimensionWhat it testsSignal
CONCEPTCan you derive it on a whiteboard?First principles, not memorized formulas
DESIGNWhere does it fit in a production system?Data flow, tradeoffs, failure boundaries
CODECan you implement the core algorithm?From-scratch numpy, not API calls
DEBUGWhat breaks and how do you fix it?Symptom → diagnosis → fix
FRONTIERWhat replaces this in 2 years?Latest papers, repos, tradeoffs
System Architecture

Click any component to highlight its inputs and outputs. This is the full perception pipeline you would own as a staff geometry engineer.

Click a component to see its data flow

The daily reality: You will spend 40% of your time on calibration and data pipeline issues (not glamorous, but load-bearing). 30% on algorithm development and evaluation. 20% on debugging production failures. 10% on reading papers and prototyping new approaches. The staff-level signal is owning the full stack end-to-end, not just the ML model in the middle.

The tech stack you should know: C++ for production (Ceres, g2o, OpenCV, GTSAM). Python for prototyping (numpy, torch, open3d). ROS for sensor integration. Docker for deployment. COLMAP/hloc for offline pipelines. Nerfstudio/gsplat for neural rendering.

Data flow through the pipeline with tensor shapes:

Raw images
(N, H, W, 3) uint8 — N cameras × H×W resolution × RGB
↓ undistort via K, dist_coeffs
Undistorted images
(N, H, W, 3) uint8 — pinhole model now applies exactly
↓ feature extraction (SuperPoint)
Keypoints + Descriptors
per image: keypoints (Ki, 2) float32, descriptors (Ki, 256) float32
↓ matching (LightGlue) for each image pair
Correspondences
per pair: matches (Mij, 2) int — indices into keypoint arrays
↓ geometric verification (RANSAC + E/F)
Verified matches + relative poses
inlier mask (Mij,) bool, Rij (3,3), tij (3,)
↓ SfM (incremental or global)
Camera poses + 3D points
poses: (N, 4, 4) SE(3), points: (P, 3) float64, colors: (P, 3) uint8
↓ dense reconstruction (3DGS / NeRF / MVS)
Dense model
3DGS: (G, 60) per-Gaussian params | NeRF: MLP weights | MVS: (N, H, W) depth maps

Typical numbers from a real robot deployment: 6 cameras × 1280×960 @ 30fps = 6 × 3.7MB/frame = 22.2 MB/frame = 666 MB/s raw throughput. Feature extraction: ~500 keypoints/camera/frame × 6 = 3000 keypoints/frame. Matching: 15 camera pairs × ~200 matches = 3000 correspondences/frame. VO: 1 SE(3) pose per frame (6 DOF). Map: 50K-500K 3D landmarks after 10 minutes of operation.

Interview tip: When asked "tell me about your work," structure it as: problemsystem I builtkey technical decisionsimpact. "We needed real-time 6-DOF pose for a mobile manipulator. I built a stereo VIO pipeline fusing ORB features with IMU pre-integration, achieving 0.3% drift over 100m. The key decision was using DSO-style direct alignment for initialization because the warehouse floor is textureless. This enabled autonomous pick-and-place at 50 picks/hour."
System design: A colleague proposes skipping extrinsic calibration by using a learned model (DUSt3R) that predicts 3D directly from image pairs. What's the tradeoff?

Chapter 1: Camera Models — The Foundation Everything Builds On

CONCEPT

Every 3D vision algorithm starts with a camera model — the mathematical function that maps a 3D world point to a 2D pixel. The simplest is the pinhole camera: light passes through an infinitely small hole and hits a flat sensor. No lens, no distortion, just geometry. Why start here? Because every error in the camera model propagates through triangulation, SfM, SLAM, and reconstruction. Get calibration wrong by 1 pixel, and your robot's 3D map is wrong by centimeters at arm's reach.

The math has three stages. First, transform from world to camera coordinates using the extrinsic matrix [R|t]. Then divide by depth (the perspective projection). Finally, scale to pixels using the intrinsic matrix K.

p = K [R | t] Xh    where   K = [[fx, 0, cx], [0, fy, cy], [0, 0, 1]]

fx, fy = focal length in pixels (how much the camera "zooms"). cx, cy = principal point (where the optical axis hits the sensor, usually near image center). R (3×3) = rotation from world to camera. t (3×1) = translation. Together, [R|t] says "where is the camera and which way is it looking?"

Worked example: K = [[800,0,320],[0,800,240],[0,0,1]], camera at origin looking down Z (R=I, t=0). Project X=(1,2,5). Step 1: camera coords = (1,2,5). Step 2: divide by Z → (0.2, 0.4). Step 3: u = 800×0.2 + 320 = 480, v = 800×0.4 + 240 = 560. Pixel: (480, 560).

Real lenses introduce distortion. The Brown-Conrady model corrects radial (barrel/pincushion) and tangential components:

r' = r(1 + k1r² + k2r&sup4;)    +   tangential terms (p1, p2)

Why does distortion matter in practice? At the image center, distortion is near zero (r is small, so k1r² ≈ 0). At the edges (r=1 for a normalized image), even k1=0.1 shifts points by 10% of their distance from center. On a 1280×960 sensor, that's ~50 pixels at the corners. If you triangulate features at the edges without undistortion, your 3D points are wrong by centimeters. Always undistort before geometric computation.

Distortion model hierarchy: Radial-only (k1, k2): sufficient for most machine vision lenses. Full Brown-Conrady (k1, k2, k3, p1, p2): standard for OpenCV calibration. Kannala-Brandt: for fisheye lenses (FOV > 120°). UCM/Double Sphere: unified model for catadioptric and omnidirectional cameras. Neural distortion: learn a small MLP mapping (x,y) → (x',y') for non-parametric lenses. Each model adds complexity but handles wider FOV and more exotic optics.

Data flow from 3D point to pixel:

World point X (3,)
(X, Y, Z) in meters
↓ Xc = R·X + t
Camera coords (3,)
(Xc, Yc, Zc)
↓ divide by Zc
Normalized image coords (2,)
(Xc/Zc, Yc/Zc)
↓ apply distortion, then K
Pixel (u, v)
Integer pixel coordinates

DESIGN

Where does calibration fit? Before everything else. At a robotics company, you calibrate once at the factory, verify monthly in the field, and run continuous self-calibration checks during operation. Thermal drift changes focal length by 0.1–1 pixel per 10°C. Rolling shutter CMOS sensors expose rows sequentially — fast motion causes wobble. Use global shutter cameras for robotics.

Rolling shutter vs global shutter, quantified. A rolling shutter camera with 33ms readout time (full frame) exposes each row at a different instant. A robot rotating at 90°/s shifts the bottom of the image by 90 × 0.033 = 3° relative to the top. At 800px focal length, that's ~42 pixels of skew. This breaks the pinhole assumption completely. Global shutter cameras expose all pixels simultaneously — no skew. Cost difference: global shutter is typically 2-3x the price of rolling shutter. Rolling shutter can be modeled in BA (continuous-time formulation), but it adds complexity and computation. Staff recommendation: global shutter for anything moving faster than walking speed.

Coordinate frame conventions are a constant source of bugs. OpenCV uses right-handed with Z forward, Y down. OpenGL uses Z backward, Y up. ROS uses X forward, Z up. Every time you integrate a new library, check which convention it uses.

ConventionForwardRightUpUsed by
OpenCV+Z+X-YOpenCV, COLMAP, most vision libraries
OpenGL-Z+X+YOpenGL, Three.js, Blender camera
ROS REP-103+X-Y+ZROS, URDF, tf2
Unreal Engine+X+Y+ZUnreal, some game engines

The conversion trap. When you read a pose from COLMAP (OpenCV convention) and use it in a ROS pipeline, you need to rotate the coordinate frame: Rros = Rcv2ros · Rcolmap. The conversion matrix flips Y and Z: Rcv2ros = [[0,0,1],[−1,0,0],[0,−1,0]]. Get this wrong and the robot drives backward into a wall. Debug tip: visualize a single pose with known orientation (camera looking down a hallway) and verify the axes manually before processing the full dataset.

CODE

python
import numpy as np

def project(X, K, R, t, dist=None):
    """Project 3D point X (3,) to pixel (2,).
    dist: [k1, k2, p1, p2] distortion coefficients."""
    X_cam = R @ X + t                   # (3,) world → camera
    x_n = X_cam[:2] / X_cam[2]          # (2,) perspective divide
    if dist is not None:
        k1, k2, p1, p2 = dist
        r2 = x_n[0]**2 + x_n[1]**2
        radial = 1 + k1*r2 + k2*r2**2
        x_n = x_n * radial               # radial distortion
        x_n[0] += 2*p1*x_n[0]*x_n[1] + p2*(r2 + 2*x_n[0]**2)
        x_n[1] += p1*(r2 + 2*x_n[1]**2) + 2*p2*x_n[0]*x_n[1]
    px = K[:2,:2] @ x_n + K[:2,2]      # (2,) apply intrinsics
    return px

# Test
K = np.array([[800,0,320],[0,800,240],[0,0,1]], dtype='float64')
px = project(np.array([1.0, 2.0, 5.0]), K, np.eye(3), np.zeros(3))
# px = [480., 560.] ✓

# Batch projection for N points
def project_batch(X_batch, K, R, t, dist=None):
    """Project (N,3) points to (N,2) pixels."""
    X_cam = (R @ X_batch.T).T + t           # (N,3)
    x_n = X_cam[:,:2] / X_cam[:, 2:3]      # (N,2) perspective divide
    if dist is not None:
        k1, k2, p1, p2 = dist
        r2 = np.sum(x_n**2, axis=1, keepdims=True)
        radial = 1 + k1*r2 + k2*r2**2
        x_n = x_n * radial
    px = x_n @ K[:2,:2].T + K[:2,2]        # (N,2) apply intrinsics
    return px

# Undistort: inverse mapping (iterative)
def undistort_point(px, K, dist, n_iter=10):
    """Pixel (2,) → undistorted normalized coords (2,)."""
    x_d = (px - K[:2,2]) / np.diag(K[:2,:2])
    x = x_d.copy()
    k1, k2, p1, p2 = dist
    for _ in range(n_iter):
        r2 = x[0]**2 + x[1]**2
        radial = 1 + k1*r2 + k2*r2**2
        x = x_d / radial                    # fixed-point iteration
    return x

DEBUG

SymptomDiagnosisFix
Images warped at edgesWrong distortion coefficientsRecalibrate with more board images; ensure corners cover image edges
Points don't align between camerasExtrinsic calibration is staleRecalibrate extrinsics; check if camera mount shifted
Projection off by constant offsetWrong principal pointCheck cx, cy; ensure calibration images span full FOV
Projected points mirrored/rotatedCoordinate frame mismatchCheck OpenCV vs OpenGL vs ROS conventions

FRONTIER

Self-calibration from video (no checkerboard): estimate intrinsics from feature tracks across frames by solving the absolute conic constraint. Requires the camera to undergo sufficient rotation diversity (at least 3 rotations about different axes). Works in practice for handheld video (natural wrist rotation) but fails for car-mounted cameras (mostly translation, little rotation). Neural lens models: learn the full distortion function as a small MLP, mapping normalized coordinates (x,y) to distorted (x',y'). Handles non-Brown-Conrady distortion from fisheye and catadioptric lenses that parametric models can't fit. Training: minimize reprojection error on a dense calibration target. Differentiable camera models in PyTorch (e.g., in DROID-SLAM) allow end-to-end optimization of intrinsics alongside pose estimation — the focal length becomes a learnable parameter, jointly optimized with poses and depths.

Event cameras (DVS): asynchronous per-pixel brightness change sensors with microsecond resolution. Each pixel independently fires when it detects a brightness change exceeding a threshold. No frames, no exposure time, no motion blur. Extreme dynamic range (>120dB vs ~60dB for standard cameras). The geometry is the same (pinhole + distortion), but the calibration procedure is different: you can't use a static checkerboard (no brightness changes). Instead, use a blinking checkerboard or moving patterns. Event cameras are increasingly used for high-speed robotics (drone racing, industrial pick-and-place at >10Hz) where conventional cameras produce motion blur.

Camera selection for robotics: Staff-level question. Mono vs stereo vs fisheye vs event? The answer depends on: speed (event cameras for >100Hz, global shutter for <30Hz), environment (fisheye for tight spaces, narrow FOV for long-range), compute budget (stereo needs rectification + matching, mono needs depth estimation), and whether you have an IMU (VIO gives you scale without stereo).

Interview tip: When an interviewer asks "explain the camera model," start by drawing the pinhole diagram on the whiteboard. Label world frame, camera frame, image plane. Write K with actual numbers (fx=800, cx=320, cy=240). Then project a concrete point: (1,2,5) → divide by Z → multiply by K → pixel (480,560). Then mention distortion as a follow-up. The concrete example demonstrates mastery more than reciting formulas.
Camera Projection

A 3D wireframe cube projected through a pinhole camera. Adjust focal length and distortion to see what breaks.

Focal length800
Distortion k10.00
Interview: Your robot's camera calibration was done at room temperature (22°C). The robot operates in a warehouse at 40°C. What goes wrong and how do you fix it?

Chapter 2: Epipolar Geometry — The Constraint That Makes 3D Possible

CONCEPT

Two cameras see the same 3D point. In camera 1, it appears at pixel p1. Where must it appear in camera 2? Without any constraint, we search the entire second image. Epipolar geometry reduces this to a single line.

The 3D point X, the two camera centers C1 and C2, and the two image points all lie in a single epipolar plane. The intersection of this plane with image 2 is the epipolar line. The matching point must lie on this line.

Derivation from scratch. Camera 1 is at origin. Camera 2 is at rotation R, translation t. A 3D point X1 in camera 1 coords becomes X2 = RX1 + t in camera 2 coords. The three vectors X2, t, and RX1 are coplanar. Coplanarity: X2 · (t × RX1) = 0. Since the normalized image coordinate x = X/Z, and dividing by Z does not change zero, we get:

x2T [t]× R x1 = 0   ⇒   x2T E x1 = 0

E = [t]×R is the essential matrix (5 DOF: 3 rotation + 2 translation direction; scale is unrecoverable). [t]× is the skew-symmetric matrix such that [t]×v = t × v.

Worked example of E. Camera 2 is translated 1m to the right: t = (1,0,0), R = I (same orientation). Then [t]× = [[0,0,0],[0,0,-1],[0,1,0]], so E = [t]×R = [[0,0,0],[0,0,-1],[0,1,0]]. Given a point at normalized coords x1 = (0.5, 0.3, 1) in camera 1, the epipolar line in camera 2 is l2 = Ex1 = (0, -1, 0.3). This is a horizontal line at y = 0.3 — makes sense! Pure horizontal translation means epipolar lines are horizontal (the standard stereo rectification assumption).

The fundamental matrix F is the uncalibrated version, working on raw pixel coordinates:

p2T F p1 = 0    where   F = K2−T E K1−1

F has 7 DOF (9 entries − 1 scale − 1 rank-2 constraint). Given F and point p1, the epipolar line in image 2 is l2 = Fp1.

DESIGN

In a SfM pipeline, epipolar geometry sits right after feature matching. You estimate F (or E if calibrated) from correspondences, then use it to: (1) reject outlier matches (points violating the epipolar constraint), (2) recover relative pose R, t, (3) initialize triangulation. RANSAC is mandatory — even 10% outlier correspondences will corrupt the estimate.

How many inliers do you need? The 5-point algorithm needs 5 correct correspondences per RANSAC sample. The probability of getting a clean sample is p = w5 where w is the inlier ratio. For 99% success probability, you need k = log(0.01)/log(1-w5) iterations. With 50% inliers: k = 145. With 30%: k = 4,121. With 20%: k = ~29,000. This exponential blowup is why matching quality is the #1 bottleneck in practice.

Essential vs Fundamental: when to use which? If you know intrinsics K (calibrated camera): use E. It has fewer DOF (5 vs 7), so you need fewer correspondences (5-point vs 8-point), and the estimate is more stable. If cameras are uncalibrated or K is uncertain: use F. In practice, most modern systems use calibrated cameras, so E is the standard.

RANSAC iteration count — the math you should know. To find one clean sample of s points from data with inlier ratio w, the probability of failure after k iterations is: Pfail = (1 - ws)k. Solving for k at Pfail = 0.01:

Inlier ratio w5-point (s=5)8-point (s=8)Time @ 1ms/iter
80%8 iterations14 iterations~14ms
60%57 iterations177 iterations~177ms
40%435 iterations4,177 iterations~4.2s
20%14,365 iterations~717K iterations~12min

This table shows two things: (1) the 5-point algorithm is vastly more RANSAC-efficient than 8-point, and (2) below 40% inliers, you need thousands of iterations. This is why matching quality (the step before RANSAC) is the #1 factor determining pipeline speed. Improving the inlier ratio from 40% to 60% with a better matcher (e.g., LightGlue vs brute-force ratio test) saves you 8x iterations.

RANSAC scoring. Not all correct models are equal. After finding an inlier set, score the model by the number of inliers within a threshold (typically 1-3 pixels for F estimation). Better variants: MSAC: score = sum of capped squared residuals (incentivizes tighter fits). MAGSAC: marginalizes over all thresholds (no threshold parameter needed). GC-RANSAC: adds local optimization on the inlier set after each RANSAC iteration. OpenCV's `findFundamentalMat` uses RANSAC by default, but `cv::USAC_MAGSAC` is strongly recommended for production.

CODE

The 8-point algorithm step by step: (1) Normalize coordinates (Hartley normalization — translate centroid to origin, scale so mean distance = √2). (2) Build A matrix where each row is [u2u1, u2v1, u2, v2u1, v2v1, v2, u1, v1, 1]. (3) SVD of A, f = last column of V. (4) Reshape to 3×3, enforce rank-2 by zeroing smallest singular value. (5) Denormalize.
python
def eight_point_F(pts1, pts2):
    """pts1, pts2: (N,2) correspondences, N ≥ 8."""
    def normalize(pts):
        mu = pts.mean(axis=0)
        d = np.linalg.norm(pts - mu, axis=1).mean()
        s = np.sqrt(2) / (d + 1e-8)
        T = np.array([[s,0,-s*mu[0]],[0,s,-s*mu[1]],[0,0,1]])
        ph = np.hstack([pts, np.ones((len(pts),1))])
        return (T @ ph.T).T[:,:2], T
    n1, T1 = normalize(pts1)
    n2, T2 = normalize(pts2)
    u1, v1 = n1[:,0], n1[:,1]
    u2, v2 = n2[:,0], n2[:,1]
    A = np.stack([u2*u1, u2*v1, u2, v2*u1, v2*v1,
                  v2, u1, v1, np.ones(len(u1))], axis=1)
    _, _, Vt = np.linalg.svd(A)
    F = Vt[-1].reshape(3,3)
    U, S, Vt2 = np.linalg.svd(F)        # enforce rank 2
    S[2] = 0
    F = U @ np.diag(S) @ Vt2
    return T2.T @ F @ T1                 # denormalize

def ransac_F(pts1, pts2, n_iter=500, thresh=3.0):
    """RANSAC wrapper for robust F estimation."""
    best_F, best_inliers = None, []
    N = len(pts1)
    for _ in range(n_iter):
        idx = np.random.choice(N, 8, replace=False)
        F_cand = eight_point_F(pts1[idx], pts2[idx])
        # Compute Sampson distance for all points
        p1h = np.hstack([pts1, np.ones((N,1))])
        p2h = np.hstack([pts2, np.ones((N,1))])
        Fp1 = (F_cand @ p1h.T).T              # (N,3)
        Ftp2 = (F_cand.T @ p2h.T).T           # (N,3)
        num = np.sum(p2h * Fp1, axis=1)**2   # (p2^T F p1)^2
        denom = Fp1[:,0]**2 + Fp1[:,1]**2 + Ftp2[:,0]**2 + Ftp2[:,1]**2
        dist = num / (denom + 1e-8)
        inliers = np.where(dist < thresh**2)[0]
        if len(inliers) > len(best_inliers):
            best_inliers = inliers
            best_F = eight_point_F(pts1[inliers], pts2[inliers])
    return best_F, best_inliers

DEBUG

SymptomDiagnosisFix
F estimation gives garbageMissing Hartley normalizationAlways normalize; condition number of A should be ~1
Degenerate F (all epipolar lines converge)Points are coplanarUse homography H instead; recover normal + distance
Too few RANSAC inliers (<30%)Feature matching has too many outliersTighten Lowe's ratio test, use learned matcher, check image overlap
E decomposition gives wrong R,tCheirality check failed — picked wrong solutionE decomposes to 4 (R,t) solutions; pick the one where triangulated points have positive depth in both cameras
Interview tip: The derivation of the epipolar constraint is a classic whiteboard question. Practice writing it in 4 lines: (1) X2 = RX1 + t. (2) Coplanarity: X2 · (t × RX1) = 0. (3) Substitute normalized coords x = X/Z. (4) Group as x2T[t]×Rx1 = 0. Then immediately state the practical consequence: "this reduces a 2D search to a 1D search along a line."

FRONTIER

Deep fundamental matrix estimation: OANet (Yi et al., CVPR 2019) learns to weight correspondences, replacing RANSAC. LoFTR + F estimation: skip feature detection entirely, predict matches with a transformer, estimate F from dense correspondences. DUSt3R (CVPR 2024): bypasses the fundamental matrix completely — directly predicts 3D pointmaps from two images.

Epipolar Geometry

Click in the left image to see the corresponding epipolar line in the right image. Drag baseline to see degenerate case when cameras coincide.

Baseline100
Interview: You're estimating F between two drone images taken 1 second apart. Both images show mostly the ground plane. What goes wrong and what's the fix?

Chapter 3: Feature Detection, Description & Matching

CONCEPT

Harris corners. At each pixel, compute the structure tensor M — a 2×2 matrix of image gradient products summed over a window. Its eigenvalues λ1, λ2 diagnose the local structure: both large = corner (good feature), one large = edge, both small = flat. Harris avoids computing eigenvalues directly with R = det(M) − k·trace(M)².

SIFT (Scale-Invariant Feature Transform): Build a Difference-of-Gaussians pyramid (blur at multiple scales, subtract adjacent levels). Extrema in this 3D scale-space are keypoints. For each, compute a 128-dimensional descriptor from gradient histograms in a 16×16 patch, rotated to the dominant orientation. Scale + rotation invariant. Slow but extremely distinctive.

ORB (Oriented FAST + Rotated BRIEF): FAST corner detection (16-pixel circle intensity test) plus BRIEF binary descriptor (256 bits from pixel comparisons). 100× faster than SIFT but less distinctive. Binary descriptors use Hamming distance (count differing bits with XOR + popcount) instead of L2, making matching 10x faster than float descriptors.

The evolution to learned features. Classical features (SIFT, ORB) are hand-designed — fixed receptive field, fixed descriptor computation. SuperPoint (DeTone et al., CVPRW 2018) trains a CNN to detect keypoints and compute 256-dim descriptors jointly. Trained with homographic adaptation: warp images with random homographies, check if the same corners are detected. The result: more repeatable keypoints in challenging conditions (low light, blur, viewpoint change) than any classical detector.

Dense matchers skip keypoints entirely. LoFTR (Sun et al., CVPR 2021) processes two images with a ViT backbone, computes cross-attention between all feature positions, and outputs dense correspondences. No keypoint detection, no descriptor computation, no ratio test. Works on textureless surfaces (walls, floors) where classical features find nothing. The cost: ~80ms on a 3090 vs ~5ms for ORB. Worth it when matching quality is the bottleneck.

Lowe's ratio test: For each feature in image 1, find the two nearest matches in image 2 by descriptor distance. If ratio = best/second_best > 0.75, reject. Why? A correct match has one strong candidate; an ambiguous match has two similar ones. The ratio catches this elegantly.
Interview tip: If asked "walk me through feature matching end to end," give concrete numbers: "I'd use SuperPoint (2048 keypoints, 256-dim descriptors) + LightGlue (30ms on a 3090). For fallback without GPU: ORB (2000 keypoints, 256-bit binary descriptors) + brute-force Hamming distance (5ms on CPU). Ratio test at 0.7 for ORB, 0.8 for learned features. Then geometric verification with RANSAC on F to kill remaining outliers." Numbers show you've actually shipped this.

DESIGN

Matching strategy at scale. For 100 images: brute-force all pairs (O(N²) = 5K pairs). For 10,000 images: vocabulary tree retrieval to find the top-50 candidate pairs per image (O(N·50) = 500K pairs vs O(N²) = 50M). For 1M images: learned retrieval (NetVLAD, CosPlace) to find top-20 candidates.

Why brute-force L2 matching is expensive. For N descriptors of dimension D: N×N distance computations, each D multiplications + additions. With 2000 SIFT features (D=128) in each image: 2000×2000×128 = 512M FLOPs per pair. With 5000 pairs for 100 images: 2.56 TFLOPs total. That's why approximate nearest neighbors (FLANN, FAISS) or learned matchers (which attend to spatial context and exit early when confident) are essential at scale.

Feature distribution matters. 2000 features clustered on a textured poster leave the rest of the image unmatched. Good practice: grid-based detection — divide the image into 8×8 cells, detect top-N features per cell. This ensures coverage across the entire FOV, which improves triangulation geometry (wide spread of rays) and makes reconstruction more complete. ORB-SLAM3 does exactly this with its "distribute keypoints with quadtree" strategy.

MethodDescriptorSpeedAccuracyGPU?
SIFT128-dim floatSlow (~1s/img)GoodNo
ORB256-bit binaryReal-timeModerateNo
SuperPoint + LightGlue256-dim floatFast (~30ms)ExcellentYes
LoFTR / RoMaDense (no keypoints)Moderate (~80ms)Best for low-textureYes

CODE

python
def harris_corners(img_gray, k=0.04, threshold=0.01):
    """Harris corner detection from scratch.
    img_gray: (H,W) float32, range [0,1]."""
    # Image gradients
    Ix = np.gradient(img_gray, axis=1)   # (H,W)
    Iy = np.gradient(img_gray, axis=0)   # (H,W)
    # Structure tensor components (Gaussian-weighted)
    from scipy.ndimage import gaussian_filter
    Ixx = gaussian_filter(Ix * Ix, sigma=1.5)
    Iyy = gaussian_filter(Iy * Iy, sigma=1.5)
    Ixy = gaussian_filter(Ix * Iy, sigma=1.5)
    # Harris response: R = det(M) - k * trace(M)^2
    det_M = Ixx * Iyy - Ixy ** 2
    trace_M = Ixx + Iyy
    R = det_M - k * trace_M ** 2
    # Non-maximum suppression + threshold
    corners = (R > threshold * R.max())
    return np.argwhere(corners)[:, ::-1]  # (N, 2) as (x, y)

def ratio_test(desc1, desc2, ratio=0.75):
    """Lowe's ratio test. desc1: (N,D), desc2: (M,D)."""
    dists = np.linalg.norm(desc1[:, None] - desc2[None], axis=2)
    idx1 = np.argsort(dists, axis=1)[:, :2]  # top-2 matches
    d1 = dists[np.arange(len(desc1)), idx1[:, 0]]
    d2 = dists[np.arange(len(desc1)), idx1[:, 1]]
    mask = d1 / (d2 + 1e-8) < ratio
    return np.stack([np.arange(len(desc1))[mask], idx1[:, 0][mask]], axis=1)

DEBUG

The "repetitive texture" failure mode is the #1 matching problem in industrial environments. Warehouse shelving, server racks, parking structures — all have regular, repeating visual patterns. Every shelf bracket looks identical. SIFT descriptors match equally well to 50 different locations. Ratio test is useless (many equally-good matches). The fix cascade: (1) Tighten ratio to 0.6 (helps slightly). (2) Add geometric verification — RANSAC on F rejects matches that don't form a consistent geometry. (3) Use spatial context — LightGlue's attention mechanism considers the spatial arrangement of features, not just individual descriptors. (4) Use cross-image information — LoFTR/RoMa process both images jointly, so the matcher "sees" the full scene layout. (5) Fall back to direct methods — photometric alignment doesn't need distinctive features, just gradients.
SymptomDiagnosisFix
Too few matches (<50)Images have little overlap or large scale changeLower ratio threshold to 0.8; use more keypoints; try LoFTR
Many wrong matchesRepetitive texture (warehouse shelving)Tighten ratio test to 0.6; use geometric verification (RANSAC on F)
Matching too slow (>1s)Brute-force on too many featuresUse FLANN approximate NN; reduce to top-2000 features by response
Features cluster in one regionOne textured object dominates detectionUse grid-based detection (N features per cell); ensure whole-image coverage

FRONTIER

SuperPoint + LightGlue (Lindenberger et al., ICCV 2023): the current production standard. CNN keypoints + GNN matcher with adaptive early exit. RoMa (Edstedt et al., CVPR 2024): coarse-to-fine dense matching with robust training. GIM (Xuelun et al., ICLR 2024): generalized image matching across domains. MASt3R (Leroy et al., ECCV 2024): matching + 3D in one model.

Feature Matching

Two views with detected features. Toggle ratio test to see outlier rejection. Green = correct, red = outlier.

Interview: Your SfM pipeline works in offices but fails in a warehouse with repetitive shelving. What's happening and how do you fix it?

Chapter 4: Structure from Motion — From Images to 3D

CONCEPT

The problem: Given N images of a scene with feature correspondences, recover ALL camera poses (position + orientation) AND ALL 3D point positions. Two unknowns solved simultaneously.

Incremental SfM builds the reconstruction step by step:

1. Initialize
Pick two images with good baseline. Estimate E → decompose to R, t → triangulate initial 3D points.
2. Register next image
Match features to existing 3D points. Solve camera pose via PnP + RANSAC.
3. Triangulate new points
Features matched across newly registered image and existing images → new 3D points.
4. Bundle adjustment
Jointly refine ALL poses + ALL points to minimize reprojection error.
↻ repeat 2–4

Global SfM takes a different path: estimate all pairwise rotations, solve global rotations via rotation averaging, estimate translations, triangulate, run BA once. Faster but more sensitive to outlier relative poses.

Initialization is the hardest part. The initial two-view reconstruction sets the coordinate system and scale for the entire reconstruction. Bad initialization corrupts everything downstream. Choose images with: (1) sufficient baseline (not too small — noisy triangulation, not too large — matching fails), (2) enough shared features (>100 inlier matches), (3) non-degenerate geometry (not pure rotation, not all coplanar points). COLMAP tries many initial pairs, scores them by triangulation angle and match count, and picks the best one.

The 5-point algorithm (Nistér, 2003) is the standard for calibrated cameras. Given 5 correspondences + known K, recover E (up to 10 solutions). Use cheirality check (depth must be positive in both cameras) to reduce to 1–2 solutions. Combined with RANSAC, this is the workhorse of modern SfM initialization and relative pose estimation.

PnP variants and when to use each: P3P: minimal solver (3 points → up to 4 solutions), used inside RANSAC for maximum efficiency. DLT (6+ points): linear, no iteration, good for initialization. EPnP (Lepetit et al., 2009): O(n) complexity, accurate, the default in OpenCV. Iterative PnP: refine with Levenberg-Marquardt after initial estimate. In production SLAM, the pipeline is: P3P inside RANSAC to find inliers → EPnP or iterative refinement on all inliers for the final pose.

Triangulation quality depends on baseline. Two views at distance B with focal length f and pixel noise σ give depth uncertainty: σZ = Z² · σ / (f · B). At Z=5m, f=800px, σ=0.5px, B=0.5m: σZ = 25 × 0.5 / (800 × 0.5) = 0.031m = 3.1cm. Double the baseline to 1m: 1.6cm. This is why wide baselines help, but too wide means matching fails (appearance change).

DESIGN

COLMAP (Schönberger & Frahm, CVPR 2016) is the gold standard. Pipeline: exhaustive SIFT matching → geometric verification → incremental SfM → BA. Input: folder of images. Output: camera poses + sparse point cloud. It starts nearly every NeRF and 3DGS project.

Scaling to millions of images: Exhaustive pairwise matching is O(N²). For 10K images, that's 50M pairs. Solutions: (1) image retrieval (vocabulary tree, NetVLAD) to find top-k candidates, (2) hierarchical SfM — cluster images by GPS/time/visual similarity, run SfM per cluster, merge at boundaries, (3) distributed processing with Theia or OpenSfM.

COLMAP's registration strategy in detail. For each unregistered image, COLMAP counts how many of its features match to existing 3D points. The image with the most matches registers first (greedy strategy). Then: (1) solve PnP with P3P inside RANSAC (typically 100 iterations, 8px inlier threshold). (2) If >30 inliers, register the image. (3) Triangulate new 3D points from this camera + existing cameras (require triangulation angle >1.5°). (4) Run local BA on the 10 most recent cameras + all observed 3D points. (5) Every 10 registrations, run global BA. (6) Filter points: remove any with reprojection error >4px or track length <2.

Why initialization is so fragile. The first two cameras set the global coordinate frame and scale. If the baseline is too small (<5° parallax), triangulated points have huge depth uncertainty. If points are coplanar, E is degenerate. If there's pure rotation (no translation), triangulation is impossible (all rays from both cameras are parallel). COLMAP's trick: try up to 100 initial pairs, score each by (1) number of inlier matches, (2) mean triangulation angle, (3) median reprojection error. The pair with the best composite score wins. This heuristic is critical and rarely discussed in papers.

CODE

PnP (Perspective-n-Point): Given N 3D-2D correspondences, solve for camera pose. DLT formulation:

python
def pnp_dlt(pts_3d, pts_2d, K):
    """PnP via Direct Linear Transform.
    pts_3d: (N,3), pts_2d: (N,2), K: (3,3).
    Returns R (3,3), t (3,)."""
    # Normalize 2D points by K^-1
    K_inv = np.linalg.inv(K)
    pts_n = (K_inv @ np.hstack([pts_2d, np.ones((len(pts_2d),1))]).T).T  # (N,3)
    # Build DLT matrix (2N x 12)
    A = []
    for i in range(len(pts_3d)):
        X, Y, Z = pts_3d[i]
        u, v, w = pts_n[i]
        A.append([X,Y,Z,1,0,0,0,0,-u*X,-u*Y,-u*Z,-u])
        A.append([0,0,0,0,X,Y,Z,1,-v*X,-v*Y,-v*Z,-v])
    A = np.array(A)
    _, _, Vt = np.linalg.svd(A)
    P = Vt[-1].reshape(3,4)               # projection matrix
    R_approx = P[:,:3]
    t_approx = P[:,3]
    # Enforce R ∈ SO(3) via SVD
    U, S, Vt2 = np.linalg.svd(R_approx)
    R = U @ Vt2
    if np.linalg.det(R) < 0: R = -R; t_approx = -t_approx
    t = t_approx / S.mean()               # scale correction
    return R, t

Triangulation from two views via DLT:

python
def triangulate(P1, P2, pts1, pts2):
    """P1, P2: (3,4) projection matrices. pts: (N,2).
    Returns (N,3) 3D points."""
    points_3d = []
    for i in range(len(pts1)):
        u1, v1 = pts1[i]
        u2, v2 = pts2[i]
        A = np.array([
            u1*P1[2] - P1[0],
            v1*P1[2] - P1[1],
            u2*P2[2] - P2[0],
            v2*P2[2] - P2[1]
        ])
        _, _, Vt = np.linalg.svd(A)
        X = Vt[-1]
        points_3d.append(X[:3] / X[3])  # dehomogenize
    return np.array(points_3d)

DEBUG

SymptomDiagnosisFix
Reconstruction has 2+ disconnected componentsMissing connections between image clustersAdd more retrieval candidates; ensure some images bridge the gap
Many images fail to registerNot enough 3D-2D correspondences for PnPLower matching thresholds; add more features; ensure overlap
Scale drift over long sequencesIncremental SfM accumulates BA errorUse global SfM, or add GPS/known-distance constraints
Initialization failsFirst pair has degenerate geometryTry multiple initial pairs; check for pure rotation or coplanarity
Interview tip: When asked "how does SfM work?", draw the incremental loop on the whiteboard: init pair → register next → triangulate → BA → repeat. Then mention the three critical failure points: (1) bad initialization (coplanar, pure rotation), (2) disconnected components (missing overlap), (3) scale drift (no global constraint). Knowing failure modes signals production experience.

FRONTIER

DUSt3R (Wang et al., CVPR 2024): two images in, per-pixel 3D pointmaps out. No matching, no RANSAC, no triangulation. A single ViT-based network. MASt3R (Leroy et al., ECCV 2024): adds local feature extraction to DUSt3R for matching. VGGT (Wang et al., CVPR 2025): N images in one forward pass → cameras, 3D points, depth. The "foundation model" for geometry.

Incremental SfM

Step through SfM. Camera frustums (teal) and points (yellow) appear as images register. "Inject Noise" shows what happens without BA.

Images: 2/8 | Points: 0 | Error: 0.0px
Interview: You need to reconstruct a building from 10,000 drone images. Pairwise matching is O(N²) = 100M pairs. How do you make this tractable?

Chapter 5: Bundle Adjustment — The Optimization That Makes Everything Work

CONCEPT

Bundle adjustment (BA) simultaneously refines ALL camera poses AND ALL 3D points to minimize total reprojection error:

min ∑ij ρ( ‖ xij − π(Ci, Xj) ‖² )

xij = observed pixel of point j in camera i. π(Ci, Xj) = projection of 3D point Xj through camera Ci. ρ = robust loss function (Huber or Cauchy) to downweight outliers. The name: adjusting the bundles of rays until they intersect as closely as possible.

Why is it tractable? Each observation xij depends on only ONE camera (6 params) and ONE point (3 params). The Jacobian is 99%+ zeros. This sparsity is the key to everything.

The Schur complement trick, step by step. The normal equations JTJΔ = −JTr have block structure:
[[U, W], [WT, V]] [Δc, Δp]T = [εc, εp]T
U = camera-camera block (dense, 6Nc × 6Nc). V = point-point block (block diagonal, each point's 3×3 block is independent). W = camera-point cross-terms.
Since V is block diagonal, we can invert it cheaply: Δc = (U − WV−1WT)−1(ec − WV−1ep). Then back-substitute: Δp = V−1(ep − WTΔc).

Worked example: 100 cameras × 6 = 600 camera params. 10K points × 3 = 30K point params. Full system: 30,600 × 30,600. After Schur complement: 600 × 600. That's 2,600× smaller.

Memory comparison: The full Hessian for 100 cameras + 10K points = 30,600² × 8 bytes = ~7.5 GB. The Schur-reduced camera system = 600² × 8 bytes = 2.9 MB. That's a 2,500× reduction. For a city-scale reconstruction (10K cameras + 10M points), the full system would be 30B entries — 240 GB. Schur-reduced: 60K² = 28.8 GB. Still large, so you'd use iterative solvers (conjugate gradient on the Schur complement) instead of direct factorization.

The Jacobian structure in detail: Each observation contributes a 2×9 Jacobian block (2 pixel residuals, differentiated w.r.t. 6 camera params + 3 point params). For 1M observations, J is 2M × 30,600 but with only 2M × 9 = 18M nonzeros out of 61.2B entries. That's 99.97% sparse. The block structure means JTJ has camera-camera blocks (dense 6×6), point-point blocks (3×3, independent), and camera-point cross-terms (sparse). The Schur complement exploits that the point-point block V is block-diagonal: V-1 costs O(Npoints) not O(Npoints3).

The optimizer: Levenberg-Marquardt blends Gauss-Newton (fast near minimum; approximates Hessian as JTJ) with gradient descent (safe far from minimum). Damping λ controls: large λ → gradient descent, small λ → Gauss-Newton. LM adapts λ automatically: decrease if step reduces cost, increase if it doesn't.

LM step-by-step: (1) Compute Jacobian J and residual vector r. (2) Form normal equations: (JTJ + λI)Δ = -JTr. (3) Solve for Δ (the parameter update). (4) Compute new cost with trial parameters x + Δ. (5) If cost decreased: accept step, decrease λ by factor 10. (6) If cost increased: reject step, increase λ by factor 10, retry. The initial λ matters: too small → Gauss-Newton step may diverge if far from minimum. Too large → tiny gradient descent steps, slow convergence. Rule of thumb: λ0 = 10-3 × max(diag(JTJ)).

Convergence criteria: Stop when: (1) relative cost change < 10-6, or (2) step size ||Δ|| < 10-8, or (3) max iterations reached (typically 50-100). In practice, BA converges in 10-30 LM iterations for a well-initialized problem. Poorly initialized: may need 100+ or diverge entirely.

DESIGN

When to run BA: In incremental SfM: after every N images (N=5–10) to prevent error accumulation, plus a full global BA at the end. In SLAM: local BA on a sliding window of keyframes (real-time, ~10ms for 10 keyframes), plus global BA on loop closure (can take seconds, runs in background thread).

Local vs global BA in SLAM. Local BA: optimize only the K most recent keyframes and all 3D points they observe. Fix all other keyframes (treat their poses as constants). This is fast (O(K)) and prevents distant keyframes from being affected by local noise. Global BA: optimize everything. Needed after loop closure because the correction must propagate to all poses. ORB-SLAM3's strategy: local BA after every keyframe insertion (10 keyframes, ~50ms). Global BA only on loop closure (runs in a dedicated thread, doesn't block tracking). If global BA changes poses by more than a threshold, merge the correction into the tracking thread's local map.

Gauge freedom. The BA cost function is invariant to a 7-DOF similarity transform (3 rotation + 3 translation + 1 scale). This means the solution is underdetermined — infinitely many camera+point configurations give the same reprojection error. Fix gauge by: (1) freezing the first camera at the origin with identity rotation, (2) fixing the scale by setting the baseline between cameras 1 and 2 to 1.0. Without gauge fixing, the Hessian is rank-deficient and the solver either fails or produces an arbitrary solution. Ceres handles this automatically if you mark parameters as constant.

Memory footprint: The reduced camera system after Schur complement is Nc×6 × Nc×6. For 1000 cameras: 6000×6000 × 8 bytes = ~288MB. Fits in RAM easily. The full unreduced system would be 3M unknowns — the Hessian would be 3M×3M = 72TB. Obviously impossible without the Schur trick.

Solver choice matters. Ceres (Google): the gold standard, supports automatic differentiation, multiple linear solvers, robust losses. Automatic differentiation means you write the cost function once and Ceres computes Jacobians for you — no manual derivative computation. g2o: graph optimization library, commonly used in SLAM backends (ORB-SLAM, LSD-SLAM). Lightweight, fast, but less flexible than Ceres. GTSAM (Georgia Tech): factor graph framework, good for incremental updates (iSAM2 algorithm adds new factors without resolving from scratch — ideal for real-time SLAM where you add one keyframe at a time). For large-scale BA (>10K cameras), use Ceres with SPARSE_SCHUR or ITERATIVE_SCHUR solver type. For real-time SLAM BA, use g2o or GTSAM with iSAM2.

Automatic differentiation vs analytic Jacobians. Ceres' autodiff is convenient (write cost function, get derivatives for free) but 2-3x slower than hand-coded analytic Jacobians. For production SLAM at 30Hz, the difference matters. ORB-SLAM3 uses analytic Jacobians for all BA cost functions. For prototyping or offline SfM, autodiff is fine. The Jacobian of the reprojection function w.r.t. camera parameters (6 DOF) and point position (3 DOF) is a 2×9 matrix — not that hard to derive by hand, and doing so deepens your understanding of the geometry.

Robust loss functions handle outlier observations that would otherwise pull the solution. Huber loss: quadratic for small residuals, linear for large (reduces outlier influence). Cauchy loss: even more aggressive downweighting. In practice, Cauchy with threshold = 2× median residual works well. Always start with L2 (square loss) for the first few iterations to get close, then switch to robust for refinement.

Interview tip: The Schur complement is THE key insight that makes BA tractable. Be ready to explain it with numbers: "1000 cameras × 6 params = 6K camera unknowns. 500K points × 3 = 1.5M point unknowns. The full system is 1.5M × 1.5M. But the point-point block is block-diagonal (each point's 3×3 block is independent of other points), so I can eliminate all points in O(Npoints) and solve a 6K × 6K camera-only system. Then back-substitute for points." This answer shows you understand why BA is practical, not just that it exists.

CODE

python
import torch

def rodrigues(rvec):
    """Rodrigues rotation: (3,) angle-axis → (3,3) matrix."""
    theta = torch.norm(rvec)
    if theta < 1e-8:
        return torch.eye(3)
    k = rvec / theta
    K = torch.tensor([[0,-k[2],k[1]],[k[2],0,-k[0]],[-k[1],k[0],0]])
    return torch.eye(3) + torch.sin(theta)*K + (1-torch.cos(theta))*(K@K)

def ba_cost(cameras, points, observations, K):
    """cameras: (Nc,6), points: (Np,3), obs: list of (ci,pi,u,v).
    Returns scalar reprojection error."""
    total = 0.0
    for ci, pi, u, v in observations:
        R = rodrigues(cameras[ci, :3])
        t = cameras[ci, 3:]
        Xc = R @ points[pi] + t
        proj = K[:2,:2] @ (Xc[:2]/Xc[2]) + K[:2,2]
        total += torch.sum((proj - torch.tensor([u,v]))**2)
    return total / len(observations)

DEBUG

SymptomDiagnosisFix
BA diverges (cost increases)Bad initialization or λ too smallIncrease initial λ; check for NaN in Jacobian; verify initial poses
BA converges to wrong solutionOutlier observations pulling the solutionUse Huber or Cauchy robust loss; pre-filter with RANSAC
BA is too slowNot using Schur complementUse Ceres Solver with SPARSE_SCHUR; check that sparsity pattern is correct
Points fly to infinity after BAUnder-constrained points (seen from <2 cameras)Remove points with <3 observations; fix gauge freedom

FRONTIER

Differentiable BA in neural pipelines: DROID-SLAM (Teed & Deng, NeurIPS 2021) runs BA as a differentiable layer, backpropagating through the optimization. End-to-end learned alternatives: DUSt3R doesn't need BA at all — the network directly predicts globally consistent 3D. Distributed BA: Theia runs Schur-complement BA across multiple machines for city-scale reconstructions.

Bundle Adjustment Convergence

Cameras (teal) and 3D points (yellow) from above. "Perturb" adds noise. "Run BA" optimizes. Watch error converge. Toggle robust loss to handle outliers.

Damping λ0.10
Reproj error: 0.00 px
Interview: Your BA has 500 cameras and 1M points. The full Jacobian would be 2M×1.5M. Explain how the Schur complement makes this tractable.

Chapter 6: Visual Odometry & SLAM

CONCEPT

Visual Odometry (VO) estimates camera motion frame-by-frame from video. Think of it as a visual speedometer — chain relative poses to build a trajectory. SLAM goes further: build a persistent map and use it to correct drift when revisiting places.

Feature-based VO: detect features → match to previous frame → estimate E → decompose to R, t. Robust but ignores textureless regions. Direct VO (LSD-SLAM, DSO): minimize photometric error directly — pixel intensity difference after warping by estimated motion. Uses all gradient information.

Scale ambiguity: Monocular VO cannot recover absolute scale — everything could be twice as big and twice as far. Solutions: stereo (known baseline gives metric depth via triangulation), IMU (accelerometer gives metric acceleration, integrate twice for displacement), known object sizes (detect ArUco markers or objects with known dimensions), or learned metric depth (Metric3Dv2, Depth Pro). This is why mono VO always drifts in scale — there's no constraint anchoring the estimated scale to reality.

Scale drift in monocular SLAM, quantified. Without any scale constraint, monocular SLAM's scale estimate drifts by 1-5% per 100m of travel. Over 1km, that's 10-50m of accumulated scale error — the map could be 10% too large or too small. This means distances to objects are wrong by 10%, which is unacceptable for manipulation (need <1cm accuracy) and marginal for navigation (need <1m accuracy at 10m range). Solutions: (1) VIO (IMU provides scale, drift <0.1% per 100m). (2) Stereo (known baseline, no scale drift at all). (3) Loop closure at places with known geometry (re-anchor scale).

SLAM = VO + map + loop closure. Camera poses are nodes in a pose graph. Odometry constraints are edges between consecutive nodes (relative pose from VO). Loop closures are edges between non-consecutive nodes (relative pose from place recognition + geometric verification). When you revisit a place (detected by DBoW2 vocabulary tree or NetVLAD global descriptor), verify geometrically (estimate relative pose from feature matches, check consistency), add the loop closure edge, and optimize the entire pose graph. The optimization distributes the accumulated drift correction across all poses in the loop. This kills drift.

How loop closure detection works in ORB-SLAM3: Each keyframe is converted to a bag-of-words vector using DBoW2 (a vocabulary tree trained offline on a large feature dataset). To check for loop closure: query the database for the keyframes most similar to the current one, excluding recent keyframes (which are similar by construction, not by revisiting). If a candidate has similarity score > threshold AND passes geometric verification (at least 20 inlier matches with consistent relative pose), declare a loop closure. The temporal consistency check requires 3 consecutive keyframes to all match the same candidate region, preventing false positives from repetitive textures.

Visual-Inertial fusion: IMU provides high-frequency rotation + acceleration (pre-integrated between keyframes). Camera provides low-frequency but drift-free bearing measurements. Complementary: IMU handles fast motion and gives scale; camera prevents IMU drift.

IMU pre-integration (Lupton & Susskind, 2012; Forster et al., TRO 2017): Between two keyframes (e.g., 0.1s apart at 30fps), the IMU produces 100+ measurements (at 1000Hz). Naively, you'd integrate all 100 accelerometer+gyroscope readings into a single relative pose constraint. But if BA changes a keyframe's pose, you'd have to re-integrate all 100 measurements. Pre-integration avoids this by computing the relative motion in the body frame (independent of absolute pose), then rotating into the world frame at optimization time. The pre-integrated measurement ΔR, Δv, Δp becomes a factor in the pose graph, with a covariance propagated from IMU noise specs.

IMU initialization. The IMU has 15 unknowns: 3 gravity direction, 3 velocity, 3 accelerometer bias, 3 gyroscope bias, and scale (for monocular). You need ~10s of diverse motion (not just straight-line, need rotation) to observe all modes. ORB-SLAM3 runs a parallel IMU initialization thread that collects keyframe poses from visual-only SLAM, then jointly solves for IMU biases + gravity + scale. If initialization fails (robot is stationary or moving in a straight line), VIO falls back to visual-only until sufficient motion is observed.

DESIGN

SLAM system architecture: Three threads running concurrently. (1) Tracking thread (real-time, 30Hz): estimate current pose by matching features against the local map. Must complete in <33ms. Use motion model prediction to narrow search. (2) Local mapping thread (~5Hz): triangulate new points from recent keyframes, run local BA on a sliding window of 10–20 keyframes. Can take 200ms. (3) Loop closure thread (background, ~1Hz): query place recognition database, verify candidates geometrically, optimize pose graph.

Keyframe selection is critical. Too many: wasted compute, tiny baselines (noisy triangulation). Too few: large appearance changes (tracking fails). Heuristic: new keyframe when parallax > threshold OR tracked features < minimum. ORB-SLAM3's criteria: (1) at least 20 frames since last keyframe, (2) local mapping is idle (not still processing previous keyframe), (3) current frame tracks fewer than 90% of the points in the reference keyframe (significant viewpoint change), and (4) current frame tracks more than 15 points (not lost). These heuristics prevent keyframe spam in static scenes while ensuring coverage during motion.

Covisibility graph. Two keyframes are "covisible" if they share at least 15 observed 3D points. This graph encodes which keyframes see similar parts of the scene. Local BA operates on the covisibility neighborhood of the current keyframe (typically 10-20 keyframes). Loop closure candidates are found among keyframes that are NOT covisible with the current one (you've been here before, but not recently). The covisibility graph is the backbone of ORB-SLAM's architecture.

Map management: The map grows over time. For a robot operating 8 hours/day, you might accumulate 50,000 keyframes. Strategies: cull redundant keyframes (too similar to neighbors), prune low-quality points (high reprojection error), merge duplicate landmarks (same 3D point observed from different keyframes). ORB-SLAM3's "atlas" allows multiple disconnected maps that can be merged later.

Interview tip: If asked "explain the difference between VO and SLAM in one sentence": "VO chains relative poses frame-to-frame and accumulates drift; SLAM maintains a map and uses loop closure to eliminate drift globally." Follow up with threading architecture: "tracking at 30Hz, local BA at 5Hz, loop closure at 1Hz — three threads." Then mention the key metric: drift rate (VO alone: 1-5% translation error per meter; with loop closure: bounded, approaching 0 over revisits).
SystemTypeInputKey Feature
ORB-SLAM3Feature-basedMono/Stereo/VIMulti-map, IMU init, atlas
DROID-SLAMLearned denseMono/StereoDifferentiable BA, dense flow
KimeraVI + semanticStereo + IMU3D mesh + semantic labels
DSODirect sparseMonoPhotometric BA, no features

CODE

Simplified pose graph optimization: 4 poses in a loop with noisy odometry and one loop closure constraint. The key insight: odometry edges encode relative poses between consecutive frames (noisy). A loop closure edge says "pose 0 and pose 3 are actually the same place." By minimizing the sum of squared edge errors, the optimizer distributes the accumulated drift evenly across all poses.

Worked example: Robot moves in a square: right 10m, up 10m, left 10m, down 10m (back to start). But odometry has 2% drift, so after 4 legs it thinks it's at (0.8, -0.6) instead of (0,0). Loop closure detects "I've been here before" and creates a constraint: pose_4 - pose_0 = (0,0). The optimizer spreads the (0.8, -0.6) error across all 4 edges: each pose shifts by ~(0.2, -0.15).

python
import numpy as np

def pose_graph_optimize(poses, edges, n_iter=20, lr=0.3):
    """poses: (N,2) initial xy positions.
    edges: list of (i, j, dx, dy) constraints.
    Simple gradient descent on sum of squared edge errors."""
    poses = poses.copy().astype('float64')
    for _ in range(n_iter):
        grad = np.zeros_like(poses)
        for i, j, dx, dy in edges:
            err_x = (poses[j,0] - poses[i,0]) - dx
            err_y = (poses[j,1] - poses[i,1]) - dy
            grad[i,0] -= err_x;  grad[i,1] -= err_y
            grad[j,0] += err_x;  grad[j,1] += err_y
        poses[1:] -= lr * grad[1:]  # fix first pose
    return poses

DEBUG

SymptomDiagnosisFix
Trajectory drifts over timeNo loop closures or VO has biasAdd loop closure detection; fuse IMU for scale; check feature quality
Sudden jumps in trajectoryFalse loop closureAdd geometric verification; increase place recognition threshold
Tracking lost frequentlyFast motion or motion blurHigher framerate; global shutter; use IMU for prediction; reduce exposure time
Map inconsistent after loop closurePose graph has conflicting constraintsUse robust kernel in graph optimization; check edge covariances

FRONTIER

DROID-SLAM (Teed & Deng, NeurIPS 2021): replaces hand-crafted feature matching with a learned dense correlation layer, and runs bundle adjustment as a differentiable layer. Key innovation: all-pairs correlation between features in the current frame and keyframe features, followed by a GRU that predicts dense correspondence updates. The BA layer backpropagates through the optimization, so the feature extractor learns to produce features that are easy to optimize. Result: state-of-the-art accuracy on TartanAir, EuRoC, and TUM, with significantly fewer failure cases than ORB-SLAM3.

Gaussian Splatting SLAM (MonoGS, SplaTAM, Photo-SLAM, 2024): represent the map as 3D Gaussians, run SLAM within the Gaussian framework. The idea: instead of maintaining a sparse point cloud map, maintain a dense Gaussian cloud that can be rendered from any viewpoint. Tracking: render the expected view from the current pose estimate, compare to the actual image, optimize pose to minimize photometric + geometric error. Mapping: add new Gaussians for under-represented regions, prune bad ones. The map IS a renderable 3D model. This unifies SLAM and scene reconstruction into one system.

Foundation-model relocalization: NetVLAD → AnyLoc → CricaVPR (2024) — learn universal place descriptors across domains (indoor, outdoor, underwater, aerial). AnyLoc uses DINOv2 features aggregated with VLAD, achieving cross-domain place recognition without fine-tuning. CricaVPR adds cross-image correlation for better distinctiveness. These models enable SLAM loop closure in environments where BoW-based methods (DBoW2) fail due to domain shift.

SLAM Loop Closure

Robot traces a loop. Trajectory (orange) drifts from ground truth (teal). After completing the loop, "Detect Loop" triggers pose graph optimization.

Drift noise2.0
Status: Ready
Interview: You're building SLAM for an indoor delivery robot. The environment has long, featureless corridors. What failure modes do you expect and how do you address each?

Chapter 7: Multi-Camera Rigs & Sensor Fusion

CONCEPT

Intrinsic calibration (Zhang's method, TPAMI 2000): show a checkerboard from 10–20 angles. Detect corners. Solve for K and distortion coefficients that minimize reprojection error across all views. Key constraint: different viewing angles break the focal-length / principal-point ambiguity.

Zhang's method step by step: (1) For each checkerboard image, detect all inner corners (e.g., 9×6 = 54 corners). Their 3D positions are known (grid pattern, Z=0 plane). (2) Compute homography Hi from the 3D plane to image i using DLT. (3) From ≥3 homographies, solve for K using the constraint h1TK-TK-1h2 = 0 (orthogonality of column vectors). This gives fx, fy, cx, cy. (4) Extract Ri, ti from each Hi using K. (5) Nonlinear refinement: minimize reprojection error over all corners in all views, jointly optimizing K, distortion coefficients, and all Ri, ti.

Practical calibration tips: (1) Use at least 15 images. (2) Tilt the board at diverse angles (not just fronto-parallel). (3) Cover the entire image, especially corners and edges (that's where distortion is strongest). (4) Keep the board stationary during capture (motion blur ruins corner detection). (5) Check reprojection error: <0.3px = excellent, 0.3-0.8px = acceptable, >1px = recapture. (6) Use a large board (A2 or bigger) for better accuracy.

Extrinsic calibration finds the rigid transform between sensors. Methods: (1) Shared checkerboard — both cameras see the same board simultaneously. Both cameras detect the board's corners, giving two sets of 3D-2D correspondences. Solve for each camera's pose relative to the board, then compute the relative transform between cameras: T12 = Tboard←cam2 · Tcam1←board. (2) Hand-eye calibration (AX = XB): camera on robot arm, A = arm motion (from FK), B = observed camera motion (from visual pose estimation), X = unknown camera-to-hand transform. At least 2 non-parallel motions needed; 10+ give a robust estimate. (3) Target-less from natural scene correspondences — match features across cameras viewing the same scene, estimate the relative pose from correspondences. Less accurate than target-based but doesn't require physical targets.

Stereo rectification is a special case of extrinsic calibration for stereo pairs. After calibrating the relative pose between two cameras, compute rectification homographies that make epipolar lines horizontal and aligned between images. This transforms stereo matching from a 2D search to a 1D search along each row, enabling efficient block matching. OpenCV's `stereoRectify` computes these homographies from K1, K2, R, t. The quality check: rectified images should have perfectly horizontal epipolar lines. Verify by drawing horizontal lines across the rectified pair — corresponding features should align.

Temporal synchronization is critical and often overlooked. Cameras must fire simultaneously or motion creates alignment errors. Hardware trigger: sub-microsecond sync. Software timestamp matching: 1–5ms typical. At 1 m/s robot speed, 5ms desync = 5mm error.

Synchronization methods ranked by accuracy: (1) Hardware trigger (GPIO pulse from microcontroller): <1μs sync. Gold standard. (2) PTP (Precision Time Protocol): <100μs over Ethernet. Good for multi-camera rigs without a dedicated trigger line. (3) NTP (Network Time Protocol): ~1ms. Acceptable for coarse alignment only. (4) Software timestamps: 1-10ms depending on OS scheduling jitter. Never good enough for fast-moving robots.

Time offset estimation: If you can't get hardware sync, estimate the time offset between cameras from data. Kalibr does this by fitting continuous-time B-spline trajectories to IMU data and finding the temporal offset that minimizes reprojection error. Typical result: offset = 3.2ms ± 0.1ms between USB camera and IMU. You then shift timestamps by this constant offset before fusion.

DESIGN

A typical robot sensor rig: 2–6 cameras + IMU + optional LiDAR. Coordinate frame conventions: define one "body frame" (usually IMU), express all sensors in body frame coordinates.

Factory calibration
Controlled environment, checkerboard, all sensors simultaneously
Field verification
Monthly check: project known 3D landmarks, measure reprojection error
Online self-calibration
Continuous refinement during operation (detect drift, flag alerts)

CODE

python
def hand_eye_AX_XB(A_list, B_list):
    """Solve AX = XB for hand-eye calibration.
    A_list: list of (4,4) transforms (robot motions).
    B_list: list of (4,4) transforms (camera motions).
    Returns X: (4,4) camera-to-hand transform."""
    # Tsai-Lenz method (rotation-first)
    # 1. Solve for rotation: log(R_A) x = log(R_B)
    # Collect rotation axes
    M = np.zeros((3, 3))
    for A, B in zip(A_list, B_list):
        alpha = rotation_to_axis_angle(A[:3,:3])  # (3,)
        beta = rotation_to_axis_angle(B[:3,:3])
        M += np.outer(beta, alpha)
    # SVD to find best rotation
    U, _, Vt = np.linalg.svd(M)
    R_X = U @ Vt
    if np.linalg.det(R_X) < 0: R_X = -R_X
    # 2. Solve for translation
    A_stack, b_stack = [], []
    for A, B in zip(A_list, B_list):
        A_stack.append(A[:3,:3] - np.eye(3))
        b_stack.append(R_X @ B[:3,3] - A[:3,3])
    t_X = np.linalg.lstsq(np.vstack(A_stack), np.hstack(b_stack), rcond=None)[0]
    X = np.eye(4)
    X[:3,:3] = R_X; X[:3,3] = t_X
    return X

DEBUG

SymptomDiagnosisFix
Stereo doesn't match after calibrationRectification error; extrinsics driftedRecalibrate extrinsics; check for mount vibration
IMU and camera disagreeTime sync offsetCheck trigger latency; use Kalibr for time offset estimation
Calibration works in lab, fails on robotVibration/thermal drift/cable stress changing camera poseUse rigid mounts; run online self-calibration; monitor reprojection error
Rolling shutter wobbleCMOS row-sequential exposure + fast motionUse global shutter cameras; or model rolling shutter in BA

FRONTIER

Kalibr (ETH, Furgale et al.): the gold standard for camera-IMU calibration. Uses continuous-time B-spline trajectories to represent the motion, which allows: (1) estimating the time offset between camera and IMU (typically 1-10ms), (2) handling rolling shutter by modeling per-row timestamps, (3) jointly optimizing camera intrinsics, extrinsics, and IMU biases. Requires an aprilgrid target and ~60 seconds of diverse hand-held motion. Output: camera-IMU transform (SE(3)), time offset (scalar), IMU noise parameters (accelerometer + gyro random walk and white noise).

Online self-calibration: recent work on continuously refining intrinsics/extrinsics during SLAM operation. The idea: monitor reprojection residuals in a sliding window. If the median residual exceeds a baseline (e.g., 2x the factory calibration residual), trigger an automatic recalibration using the current keyframe set. This catches thermal drift, mount shifts, and gradual lens degradation without human intervention.

Neural extrinsic calibration: learning camera-to-camera transforms from visual data without checkerboards — train a network on paired images to predict the relative SE(3) transform. Infrastructure-free calibration from natural scene correspondences (no targets needed, just drive around). The trend is toward fully automatic, continuous calibration that never requires human intervention. The challenge: neural methods give ~1-2px accuracy, but production systems need <0.5px. Hybrid approaches (neural initialization + geometric refinement) bridge this gap.

A note on fisheye and omnidirectional cameras: The pinhole + Brown-Conrady model fails for FOV > 120°. Fisheye models (equidistant, stereographic, Kannala-Brandt) handle 180°+ FOV. Modern robots increasingly use fisheye cameras for maximum coverage — a single 190° fisheye replaces three narrow-FOV cameras. The calibration tools (Kalibr, OpenCV fisheye module) support these models.

Interview tip: Calibration is 40% of the job but 5% of the discussion in papers. Show you know the unglamorous parts: "At my last role, factory calibration took 30 minutes per rig. We automated it: robotic arm positions checkerboard, script runs Zhang's method + stereo extrinsics + IMU-camera via Kalibr. Online, we monitor reprojection error against a rolling baseline. If it exceeds 1.5px, we flag for recalibration. The #1 failure mode was thermal drift — focal length shifts 0.5px per 10°C."
Lens Distortion

Adjust radial distortion k1. Positive = barrel, negative = pincushion. Gray grid = undistorted, teal = distorted.

k10.20
Interview: You have a 6-camera rig on a robot arm. After the arm moves to a new configuration, stereo pairs no longer align. What's the most likely cause?

Chapter 8: Dense Reconstruction — NeRF, 3DGS, and MVS [SHOWCASE]

CONCEPT

SfM gives sparse point clouds. For robotics, we need dense 3D — every surface represented. Three paradigms:

Multi-View Stereo (MVS): For each pixel in a reference image, search along epipolar lines in neighbors for the photometric match. Output: dense depth map per view. Fuse into a point cloud, run Poisson reconstruction for a mesh. No training needed. Struggles with textureless surfaces.

NeRF (Mildenhall et al., ECCV 2020): represent the scene as a continuous function F(x,y,z,θ,φ) → (color, density). Train an MLP per scene on photometric loss via volume rendering:

C(r) = ∑i Ti · αi · ci   where   Ti = ∏j<i(1 − αj),   αi = 1 − exp(−σiΔi)

Ti = transmittance (how much light hasn't been absorbed). αi = opacity from density σi and step size Δi. ci = color at sample i. Beautiful results, but slow (hours to train, seconds to render).

3D Gaussian Splatting (3DGS) (Kerbl et al., SIGGRAPH 2023): scene as millions of 3D Gaussians, each with position, covariance (shape), color (spherical harmonics), opacity. Rendering: sort by depth, project each Gaussian as 2D ellipse, alpha-blend front-to-back. Differentiable rasterization — no ray marching. 100× faster to render than NeRF. Real-time at 1080p.

Per-Gaussian parameters: position μ (3), covariance Σ (6, stored as rotation quaternion q (4) + scale s (3) = 7 params, Σ = RSSTRT), color via spherical harmonics (16 coefficients per channel × 3 = 48 for order 3, or 3 for order 0), opacity α (1). Total: ~60 params per Gaussian. A typical scene has 500K–5M Gaussians. At 60 × 4 bytes = 240 bytes/Gaussian, a 1M-Gaussian scene is ~240 MB.

The rendering pipeline in detail: (1) Frustum culling: discard Gaussians outside the camera view. (2) Project each 3D Gaussian to a 2D Gaussian on the image plane via the local affine approximation (Jacobian of the projection). (3) Tile-based sorting: divide the image into 16×16 tiles, assign each Gaussian to tiles it overlaps, sort per-tile by depth. (4) Alpha compositing: for each pixel, blend Gaussians front-to-back: C = ∑ ci αi Ti where Ti = ∏j<i(1-αj). Stop when transmittance drops below 1/255. All steps are differentiable.

DESIGN

When to use which: MVS for offline mapping (no training, works out of the box, gives explicit mesh). 3DGS for real-time novel views (robot simulation, viewpoint planning, interactive editing). NeRF for maximum quality on static scenes (offline rendering, product photography). For robotics, 3DGS is winning because it's real-time, explicit (easy to edit/segment), and supports collision checking against Gaussian ellipsoids.

Input requirements compared:

MethodInputPreprocessingMin imagesGPU required?
MVSImages + poses + KCOLMAP SfM3+Optional (CPU ok)
NeRFImages + poses + KCOLMAP SfM20+Yes (training)
3DGSImages + poses + K + sparse pointsCOLMAP SfM20+Yes (training + rendering)
DUSt3R → 3DGSImages onlyNone2+Yes

Note: DUSt3R + 3DGS skips COLMAP entirely — DUSt3R provides poses and initial points, 3DGS trains on them. This is the fastest path from images to renderable 3D, but accuracy is lower than COLMAP-initialized 3DGS.

MethodRepresentationTrain timeRenderQualityEdit?
MVSExplicit meshNoneReal-timeGoodEasy
NeRFImplicit MLPHours~5s/frameExcellentHard
3DGSExplicit Gaussians~10minReal-timeExcellentEasy

3DGS training procedure: (1) Initialize Gaussians at COLMAP sparse points (typically 10K-100K). Each gets position from 3D point, color from the point's observed color, scale = mean distance to 3 nearest neighbors, rotation = identity, opacity = inverse sigmoid(0.1). (2) Optimization loop (30K iterations, ~10min on RTX 4090): sample a random training view, rasterize all visible Gaussians, compute L1 + D-SSIM loss against ground truth image, backpropagate through the differentiable rasterizer. (3) Densification (every 100 iterations for first 15K): if a Gaussian has large view-space gradient (>0.0002), clone it (if small) or split it (if large). This adaptively adds detail. (4) Pruning (every 100 iterations): remove Gaussians with opacity <0.005 or that are too large (world-space size >scene extent/20). (5) Opacity reset (every 3000 iterations): set all opacities to 0.01, forcing re-learning — this kills floaters.

3DGS data flow:

Images + COLMAP poses
Sparse SfM points as initialization
↓ init Gaussians from SfM points
Optimize Gaussians
Position, scale, rotation, SH color, opacity — differentiable rendering loss
↓ densify (clone/split) + prune (low opacity)
100K–5M Gaussians
Real-time renderable, exportable

CODE

python
def volume_render(densities, colors, deltas):
    """NeRF volume rendering equation.
    densities: (N,) sigma values along ray.
    colors: (N,3) RGB at each sample.
    deltas: (N,) step sizes.
    Returns: (3,) final pixel color."""
    alphas = 1.0 - np.exp(-densities * deltas)       # (N,)
    transmittance = np.cumprod(
        np.concatenate([[1.0], 1.0 - alphas[:-1]])   # T_i = prod(1-alpha_j) for j < i
    )
    weights = transmittance * alphas                   # (N,)
    color = (weights[:, None] * colors).sum(axis=0)    # (3,)
    return color

# 3DGS: project a single Gaussian to 2D
def splat_gaussian(mean_3d, cov_3d, R_cam, t_cam, K):
    """Project 3D Gaussian to 2D image-space Gaussian."""
    mean_cam = R_cam @ mean_3d + t_cam               # (3,) in camera frame
    z = mean_cam[2]
    J = np.array([[K[0,0]/z, 0, -K[0,0]*mean_cam[0]/z**2],
                  [0, K[1,1]/z, -K[1,1]*mean_cam[1]/z**2]])  # (2,3) Jacobian
    cov_cam = R_cam @ cov_3d @ R_cam.T
    cov_2d = J @ cov_cam @ J.T                       # (2,2) projected covariance
    mean_2d = K[:2,:2] @ (mean_cam[:2]/z) + K[:2,2]
    return mean_2d, cov_2d

DEBUG

The "floater" problem in 3DGS is the #1 quality issue in production. Cause: during optimization, some Gaussians drift into empty space because they lower photometric loss from certain viewpoints (by partially covering occluded regions). Diagnosis: render from novel viewpoints not in training set — floaters appear as blobs suspended in air. Fixes: (1) opacity reset every 3000 iterations forces all Gaussians to re-earn their opacity from scratch. (2) Prune Gaussians with opacity < 0.005. (3) Add depth regularization from LiDAR or MVS depth maps. (4) Use 2DGS (flat surfels) which geometrically can't represent volumetric floaters.
SymptomDiagnosisFix
NeRF has artifacts on reflectionsSpecular surfaces violate Lambertian assumptionUse Ref-NeRF (separate diffuse/specular) or mask reflective regions
3DGS has "floaters" in airInsufficient pruning; noisy initializationEnable opacity reset every N iterations; increase pruning threshold
MVS has holes in textureless areasPhotometric matching fails without textureUse PatchMatch-based MVS; add regularization; fill with depth completion network
3DGS training divergesLearning rate too high; bad SfM initializationReduce position LR; check COLMAP output for outlier points
3DGS renders are blurryUnder-densification; not enough GaussiansLower split/clone gradient threshold; increase densification iterations

FRONTIER

4DGS (dynamic scenes with deformation fields — model time-varying Gaussians for moving objects). 2DGS (2D surfel Gaussians for better surface quality — flat discs instead of 3D ellipsoids give cleaner geometry). SplaTAM / MonoGS (SLAM with 3DGS as the map representation — track camera AND update Gaussians simultaneously). Zip-NeRF (anti-aliased hash-grid NeRF, training in minutes). LRM / Zero123++ (single-image 3D — feed one photo, get a 3D model via large reconstruction model).

The integration with robotics is the frontier. Gaussian Splatting SLAM means your robot's map IS a Gaussian cloud — you can render novel views for planning, check collisions against Gaussians, segment objects by clustering Gaussians, and track deformable objects as Gaussian motion fields. This unifies mapping, rendering, and scene understanding into one representation.

Interview tip: When comparing NeRF vs 3DGS, don't just say "3DGS is faster." Give the system design reasoning: "For a robot sim environment, I need: (1) real-time rendering for viewpoint planning (>30fps) — eliminates NeRF. (2) Explicit geometry for collision checking — Gaussian positions + scales give me an occupancy proxy. (3) Segmentation by object — I can cluster Gaussians spatially. (4) Editability — if the scene changes, I can add/remove Gaussians locally. NeRF wins on quality for static offline rendering (product photography, architectural viz), but for anything interactive or physical, 3DGS dominates."
Dense Reconstruction: MVS vs NeRF vs 3DGS

Three methods reconstruct the same scene. MVS scans depth columns. NeRF accumulates density along rays. 3DGS splats Gaussians. Slide iterations to watch quality improve. "Add Noise" to compare robustness.

Iterations1
Method: MVS | Depth columns: 0
Interview: You're building a digital twin of a kitchen for robot simulation. You have 50 posed RGB images. Compare MVS, NeRF, and 3DGS — which do you choose and why?

Chapter 9: Point Tracking & Scene Flow

CONCEPT

Optical flow estimates dense 2D motion: for each pixel in frame t, where did it move in frame t+1? Classical: Lucas-Kanade (sparse, 5×5 window, least squares on brightness constancy I(x,y,t) = I(x+dx,y+dy,t+dt)). Horn-Schunck (dense, adds smoothness). Both fail with occlusion, large displacements, illumination changes.

RAFT (Teed & Deng, ECCV 2020): builds a 4D correlation volume between all feature pairs in two frames, then iteratively refines flow via a GRU that indexes into this volume. 12–32 refinement iterations — the network learns "gradient descent" through correlation space. The architecture: (1) shared feature encoder (CNN, 1/8 resolution) extracts features from both frames. (2) All-pairs correlation: compute dot product between every pair of feature vectors → 4D tensor (H/8 × W/8 × H/8 × W/8). (3) Correlation pyramid: pool at multiple scales for large displacements. (4) Update GRU: at each iteration, look up correlations around the current flow estimate, combine with context features, output flow delta. The iterative refinement is key: more iterations = better flow, and you can trade compute for accuracy at test time.

Long-range tracking extends beyond consecutive frames. CoTracker (Karaev et al., NeurIPS 2023) tracks multiple points jointly with a transformer, exploiting correlations between nearby tracks. Key insight: if point A moves right, nearby point B probably does too.

TAPIR (Doersch et al., ICCV 2023): Tracking Any Point with per-frame Initialization and temporal Refinement. Two stages: (1) match each query point independently to every frame (coarse, handles occlusion by checking all frames). (2) Temporal refinement: smooth tracks over time, predict visibility score. The key advantage over CoTracker: TAPIR handles very long videos (1000+ frames) because the per-frame matching is independent. CoTracker's joint attention has quadratic cost in sequence length. For short manipulation videos (50-200 frames), CoTracker is better. For long autonomous driving sequences, TAPIR wins.

Scene flow = 3D optical flow. For each point, estimate its 3D motion vector. Requires depth + 2D flow. Critical for dynamic scene understanding. Given 2D flow (dx, dy) and depth maps D1, D2, the 3D motion is: unproject pixel (x,y) at depth D1(x,y) to get point P1. Unproject (x+dx, y+dy) at depth D2(x+dx, y+dy) to get P2. Scene flow = P2 - P1. Noise in depth estimation dominates scene flow quality. Use stereo depth (<1% error) over monocular (>5% error) when possible.

DESIGN

Where tracking fits in robotics:

ApplicationWhat to trackTracker choiceAccuracy need
Contact point trackingGripper-object contactCoTracker (joint, 5-10 pts)<2px at 30fps
Object state estimationObject keypointsCoTracker or TAPIR<5px, through occlusion
Data annotationLabeled keypointsTAPIR (long-range)<3px, 100+ frames
World model trainingDense motion vectorsRAFT (dense flow)Subpixel, every pixel
Sim-to-real alignmentArUco markers / featuresLK flow or TAPIR<1px, metric
Deformable object trackingSurface points on cloth/ropeCoTracker (50+ pts)<5px, handle topology change

Tracking through occlusion is the hard problem. Independent trackers lose the point when it goes behind another object. They have no way to predict where it will reappear. CoTracker uses neighboring tracks to interpolate: if 3 of 5 gripper points are visible, infer the other 2 from the group's motion pattern. TAPIR adds explicit occlusion prediction: the network outputs a visibility score alongside the position. SpatialTracker (2024) reasons in 3D to handle depth ordering — it lifts 2D tracks to 3D using monocular depth, then re-projects.

Tracking through occlusion is the hard problem. Independent trackers lose the point when it goes behind another object. They have no way to predict where it will reappear. CoTracker uses neighboring tracks to interpolate: if 3 of 5 gripper points are visible, infer the other 2 from the group's motion pattern. TAPIR adds explicit occlusion prediction: the network outputs a visibility score alongside the position. SpatialTracker (2024) reasons in 3D to handle depth ordering — it lifts 2D tracks to 3D using monocular depth, then re-projects.

Performance budget: RAFT runs at ~10Hz on a mid-range GPU for 1080p flow. CoTracker tracks 50 points at ~15Hz. For real-time robotics (>30Hz), you may need lighter alternatives: sparse LK flow (~1ms), or run the neural tracker at lower frequency and interpolate between keyframes.

Interview tip: "Implement Lucas-Kanade" is a common coding drill. The key insight to state aloud: "Brightness constancy I(x,t) = I(x+dx,t+1) linearizes to Ixdx + Iydy + It = 0. One equation, two unknowns — underdetermined. LK's trick: assume all pixels in a small window have the same flow, giving a 25×2 overdetermined system solved by least squares." Then write the 10-line implementation. Mention when it fails: aperture problem (edges give only normal flow), large displacements (violate linearization).

CODE

python
def lucas_kanade(I1, I2, pts, win=5):
    """Sparse optical flow via Lucas-Kanade.
    I1, I2: (H,W) grayscale. pts: (N,2) points to track.
    Returns: (N,2) new positions."""
    Ix = np.gradient(I1, axis=1)
    Iy = np.gradient(I1, axis=0)
    It = I2.astype('float64') - I1.astype('float64')
    hw = win // 2
    new_pts = []
    for px, py in pts:
        x, y = int(round(px)), int(round(py))
        if y-hw < 0 or y+hw+1 > I1.shape[0] or x-hw < 0 or x+hw+1 > I1.shape[1]:
            new_pts.append([px, py]); continue
        ix = Ix[y-hw:y+hw+1, x-hw:x+hw+1].flatten()
        iy = Iy[y-hw:y+hw+1, x-hw:x+hw+1].flatten()
        it = It[y-hw:y+hw+1, x-hw:x+hw+1].flatten()
        A = np.stack([ix, iy], axis=1)           # (win^2, 2)
        b = -it                                     # (win^2,)
        # Least squares: (A^T A) v = A^T b
        ATA = A.T @ A
        if np.linalg.det(ATA) < 1e-6:
            new_pts.append([px, py]); continue
        v = np.linalg.solve(ATA, A.T @ b)          # (2,) flow
        new_pts.append([px + v[0], py + v[1]])
    return np.array(new_pts)

DEBUG

SymptomDiagnosisFix
Tracking drifts over long sequencesAccumulated small errorsUse long-range tracker (CoTracker, TAPIR); add periodic re-initialization
Lost object during occlusionIndependent tracking can't hallucinateUse CoTracker (joint tracking) or TAPIR (occlusion-aware)
Scene flow is very noisyDepth estimation noiseSmooth depth first; use stereo instead of monocular depth; temporal filtering

FRONTIER

CoTracker v2 (Karaev et al., 2024): faster architecture (causal sliding window instead of full attention), supports online tracking (process frames as they arrive), and better occlusion handling. SpatialTracker (Xiao et al., 2024): lifts 2D tracks to 3D using monocular depth estimation, reasons about depth ordering to handle occlusion geometrically rather than learning it. TAPIR (Doersch et al., ICCV 2023): per-frame independent matching + temporal refinement. Best for very long sequences (1000+ frames) where CoTracker's attention is too expensive. BootsTAP (2024): bootstrapped training — use easy tracks to generate pseudo-labels for hard tracks, improving robustness without manual annotation. DOT (Lemoing et al., 2024): dense optical tracking, tracks every pixel, bridging optical flow and point tracking.

The tracking-for-policy frontier. Robot learning increasingly depends on point tracking for: (1) Imitation learning data: track expert's hand and object contact points to extract manipulation primitives. (2) Reward functions: track goal regions, compute distance-to-goal as reward signal. (3) State estimation: track object keypoints as a compact object state representation. (4) Sim-to-real transfer: track markers in real video, align with simulated trajectories, measure transfer gap. The quality of tracking directly limits the quality of learned policies.

Point Tracking

Click to place tracking points. Objects move. Watch points follow. Occluded points shown as dashed circles. Toggle "Joint" to see CoTracker-style joint reasoning.

Click to add points. Points: 0 | Frame: 0
Interview: You need to track a robot gripper's contact points on a deformable object across a 10-second manipulation video. What tracker do you use and why?

Chapter 10: Scaling & Modern Foundations

CONCEPT

Vocabulary trees solve matching scalability. Hierarchical k-means on descriptors builds a tree. Each image = bag-of-words histogram. Comparing histograms is O(1), so finding candidate pairs goes from O(N²) to O(N log N).

Hierarchical SfM: partition images into clusters (by GPS/time/visual similarity) → SfM per cluster → merge at boundaries through shared landmarks. The merge step is non-trivial: find 3D-3D correspondences between clusters (points seen in both), compute similarity transform (7 DOF: rotation + translation + scale), apply, then run global BA to refine. If the overlap between clusters is thin (<10 shared 3D points), the merge is fragile.

Scaling numbers that interviewers expect you to know:

ScaleMatching strategySfM approachExpected timeOutput quality
100 imagesExhaustive (5K pairs)Incremental (COLMAP)~10 minSub-pixel
1K imagesExhaustive or retrieval (top-50)Incremental~2 hoursSub-pixel
10K imagesRetrieval (top-50)Hierarchical~12 hours~1px
100K imagesRetrieval (top-30)Hierarchical + distributed~3 days (cluster)~1-2px
1M imagesRetrieval (top-20)Hierarchical + distributed~1 week (cluster)~2-5px

DUSt3R (Wang et al., CVPR 2024): feed two images into a ViT, get per-pixel 3D point maps in a shared coordinate frame. No features, no F estimation, no triangulation. A single network. Architecture: two ViT-Large encoders (shared weights) process each image, then cross-attention decoders predict per-pixel 3D coordinates and confidence maps. The output is in the coordinate frame of image 1, so point maps from multiple pairs must be aligned via global optimization. Training data: 12M image pairs from ScanNet++, CO3D, and synthetic datasets. Inference: ~200ms per pair on A100.

MASt3R (Leroy et al., ECCV 2024) extends DUSt3R by adding local feature heads: alongside 3D pointmaps, it outputs per-pixel descriptors for matching. This bridges the gap — you get both learned 3D and classical-style correspondences in one forward pass. For N images, MASt3R-SfM runs all N(N-1)/2 pairs, then aligns pointmaps globally.

VGGT (Wang et al., CVPR 2025): the end-game. Feed N images (up to ~30) in a single forward pass. Multi-image cross-attention across all views simultaneously. Outputs: camera intrinsics, extrinsics, per-pixel depth, per-pixel 3D, and camera-to-world transforms. No iteration, no optimization, no BA. The catch: quadratic memory in N (attention), so N>30 requires chunking. And accuracy still doesn't match geometric BA for precision tasks (<0.5px).

DESIGN

Classical Pipeline
Features
SIFT / SuperPoint
Matching
Brute-force + ratio
RANSAC
Estimate F, reject outliers
Triangulate + BA
Iterative optimization
DUSt3R / VGGT
Images
Raw input pair/set
Neural Network
Single forward pass
3D + Poses
Done. No iteration.

Processing 1M images end-to-end: (1) Run feature extraction on all images in parallel (GPU cluster, ~100 images/GPU/sec with SuperPoint). (2) Build global image retrieval index (NetVLAD or CosPlace). (3) For each image, retrieve top-50 candidates. (4) Match each candidate pair (LightGlue, ~30ms/pair). (5) Geometric verification (estimate F, reject pairs with <15 inliers). (6) Cluster images into subproblems by connectivity. (7) Run SfM per cluster. (8) Merge clusters through shared 3D landmarks. Total: ~24h on a cluster for 1M images.

Interview tip: The "classical vs learned" question is a staff-level differentiator. Don't say "it depends." Have a concrete framework: "I use classical pipelines (COLMAP + SuperPoint/LightGlue) for production systems requiring metric accuracy, reproducibility, and constrained compute. I use DUSt3R/VGGT for rapid prototyping, coarse reconstruction, and cases where matching fails (textureless, repetitive). In production, I hybrid: learned features + classical geometric verification + classical BA. This gives the robustness of learned matching with the precision of geometric optimization."

CODE

python
# Bag-of-words image retrieval sketch
def build_vocab_tree(all_descriptors, k=10, levels=6):
    """Hierarchical k-means on feature descriptors.
    Returns tree structure for fast retrieval."""
    from sklearn.cluster import KMeans
    tree = {}
    def build_level(descs, depth, node_id):
        if depth == levels or len(descs) < k:
            tree[node_id] = {'leaf': True, 'weight': np.log(N_images / (len(descs)+1))}
            return
        km = KMeans(n_clusters=k).fit(descs)
        tree[node_id] = {'centers': km.cluster_centers_}
        for i in range(k):
            mask = km.labels_ == i
            build_level(descs[mask], depth+1, node_id*k+i)
    build_level(all_descriptors, 0, 0)
    return tree

# DUSt3R inference pseudocode
# model = DUSt3R.from_pretrained('naver/DUSt3R_ViTLarge')
# out = model(img1, img2)  # returns pointmaps, confidence
# pts1 = out['pts3d'][0]   # (H, W, 3) per-pixel 3D
# pts2 = out['pts3d'][1]   # already in same coord frame

DEBUG

SymptomDiagnosisFix
Large-scale SfM has disconnected componentsInsufficient overlap between regionsAdd more retrieval candidates; capture bridging images
DUSt3R fails on outdoor scenesTrained mostly on indoor data; domain gapFine-tune on outdoor data; fall back to classical for outdoor
VGGT runs out of memory on N>20 imagesQuadratic attention on large setsProcess in overlapping subsets; merge with global alignment

FRONTIER

VGGT (CVPR 2025): feed N images, get everything in one pass. Trained on millions of video frames + synthetic data. The model learns camera intrinsics, extrinsics, depth, and 3D points as separate output heads from the same transformer backbone. Achieves competitive results with COLMAP on indoor scenes, but still behind on outdoor scenes with large viewpoint changes.

MASt3R-SfM (Duisterhof et al., 2024): combines MASt3R pairwise predictions with classical global alignment. For each pair, MASt3R predicts pointmaps + features. Then: build a pose graph from pairwise transforms, optimize globally (like classical SfM, but starting from much better initialization). Bridges the speed of learned methods with the precision of geometric optimization.

Metric3Dv2 (Hu et al., 2024): metric depth from a single image. Uses a ViT backbone with a canonical camera model to predict absolute metric depth (not just relative). Accuracy: ~5% relative error on indoor scenes. This enables monocular SLAM with metric scale — no stereo or IMU needed for coarse depth. Limitation: accuracy degrades significantly outdoors at distances >20m.

Depth Pro (Apple, 2024): fast monocular metric depth at 2K resolution. Key innovation: multi-scale ViT with a "focal length estimation" head that predicts intrinsics from the image, enabling metric depth without camera calibration. Runs at ~1fps on M2 Max for 2048×1536 input. Useful for: casual 3D capture (iPhone), approximate 3D from web images, depth supervision for 3DGS training when no LiDAR is available.

Classical vs Learned Pipeline Race

Watch classical SfM (multi-step) race against DUSt3R (single forward pass). The learned approach finishes at ~30% of classical time.

Click "Run Comparison" to start
Interview: A colleague suggests replacing your entire COLMAP pipeline with DUSt3R. What are the tradeoffs? When would you agree and when would you push back?

Chapter 11: Full System Simulation [SHOWCASE]

This is the capstone. You have a multi-camera robot perception pipeline. Every stage is interconnected. Break one, watch the cascade. Fix it, watch recovery. This is what debugging feels like on a real robot team.

Why cascade effects are multiplicative, not additive. Calibration error of 1px causes features to project to slightly wrong locations. This doesn't just add 1px to matching error — it causes some correct matches to fail the epipolar constraint (they become "outliers"), which reduces inlier ratio, which forces RANSAC to run more iterations, which either slows down the system or causes it to give up. A 1px calibration error can reduce matching quality by 20-30%, which reduces reconstruction completeness by 40%, which creates holes in the map, which causes the planner to take a longer path, which means more opportunities for drift. The staff engineer's insight: always debug upstream first.

Error propagation formula (rule of thumb): If calibration is off by σcalib pixels, triangulation error at depth Z is approximately σ3D ≈ Z² · σcalib / (f · B). At Z=5m, f=800, B=0.2m, σcalib=2px: σ3D = 25 × 2 / 160 = 0.31m. That's 31cm at 5 meters — enough to miss a doorway. This calculation is what you show the team when arguing for better calibration.

The pipeline: Cameras → Calibration → Feature extraction → Matching → Pose estimation (VO) → Mapping (3D reconstruction) → Tracking. Each stage has a noise slider. Crank one up and watch everything downstream degrade.
Perception Pipeline Simulator

Adjust noise at each pipeline stage. Watch metrics propagate. "Auto-run" traces a trajectory and shows real-time quality. Click any stage to see its effect in isolation.

Calibration noise0.0
Feature noise0.0
Matching outlier %10
Motion speed3
Reproj: 0.0px | Drift: 0.0% | Recon: 100% | Tracking: 100%

What to notice as you experiment:

The classic production debugging protocol:

1. Check calibration
Project known 3D landmarks, measure reprojection error. <0.5px = good, >2px = recalibrate.
2. Check time sync
Verify hardware trigger is firing. Check timestamp differences between cameras. <1ms = good.
3. Check features
Count features per frame. <100 = problem (dirty lens? exposure too high? featureless scene?).
4. Check matching
Inlier ratio after RANSAC. >60% = good, <30% = matching/scene problem.
5. Check VO output
Compare trajectory to ground truth (if available). RPE per frame should be <1° rotation, <1% translation.
The debugging strategy: When the pipeline fails, check stages in order. Start with calibration (cheapest to verify). Then check feature counts. Then matching quality. Then VO output. Don't jump to "the neural network is wrong" — 90% of the time it's calibration or sync.
Interview tip: When you get a "the system is broken, debug it" question, resist the urge to jump to the ML component. State your protocol: "Step 1: check calibration (reprojection error on known targets). Step 2: check time sync (compare timestamps). Step 3: check features (count per frame, distribution). Step 4: check matching (inlier ratio). Step 5: check VO (drift rate against any ground truth)." This structured approach shows you've actually triaged production systems, not just trained models.

Common cascade patterns from production:

Root causeFirst symptomCascade effectHow to isolate
Camera shifted 0.5mmReproj error 3pxStereo fails → VO drift 5x → map corruptProject checkerboard, compare to stored baseline
Time sync cable loose5ms desyncStereo matches are wrong → depth errors → phantom obstaclesCompare timestamps, look for systematic match offset
Lens dirtyFeature count drops 60%Matching fails on dirty regions → incomplete reconstructionFeature heatmap: are they concentrated in one area?
New environment (outdoor)Sky saturates, features sparseVO fails in open areas → fall back to IMU only → scale driftCheck feature distribution: top of image blank?
Firmware update changed exposureImages brighter, motion blurFeatures less repeatable → matching quality drops → RANSAC needs more iterationsCompare feature repeatability metric before/after

The "classical vs learned" switch: Your pipeline has a decision point at each stage. When the classical component fails (feature count too low, RANSAC inlier ratio below threshold, VO drift exceeding budget), you can dynamically switch to a learned alternative. This hybrid approach is what production systems actually use. Pure classical is fragile. Pure learned is unpredictable. The combination gives reliability with graceful degradation.

Monitoring and alerting in production: A staff engineer doesn't just build the pipeline — they instrument it. Key metrics to track continuously:

MetricNormal rangeAlert thresholdAction
Reprojection error (px)< 0.5> 1.5Trigger recalibration
Feature count per frame> 200< 80Check lens, exposure, scene
Match inlier ratio> 60%< 30%Tighten ratio test, switch matcher
VO relative pose error< 0.5°/frame> 2°/frameFall back to IMU-only, flag for review
Loop closure frequency1 per loop0 after full loopCheck place recognition, verify map
Gaussian count (3DGS)100K-2M> 5M or < 10KCheck densification params, input quality

Graceful degradation hierarchy: When a component fails, the system doesn't crash — it falls back. Full pipeline working: stereo VIO + 3DGS mapping + CoTracker. Feature extraction fails: switch to direct VO (photometric). Stereo matching fails: fall back to mono + IMU (lose metric reconstruction). IMU fails: mono VO only (lose scale, accept drift). Everything fails: dead reckoning from last known pose + wheel odometry. A staff engineer designs all these fallback paths before the first failure happens.

Interview: Your robot's perception pipeline suddenly starts producing noisy 3D reconstructions. The code hasn't changed. Where do you look first?

Chapter 12: Interview Arsenal

1. Cheat Sheet

Concept30-sec explanationKey equationToolClassic paperModern paper
Pinhole camera3D → 2D via perspective divide + intrinsics Kp = K[R|t]XOpenCVHartley&Zisserman 2003Self-calibration via DUSt3R
Epipolar constraintMatching point lies on a line in other imagex2TEx1=0OpenCVLonguet-Higgins 1981OANet (CVPR 2019)
Feature matchingDetect keypoints, compute descriptors, matchratio test < 0.75hlocSIFT (Lowe 2004)LightGlue (ICCV 2023)
SfMImages → camera poses + sparse 3D pointsPnP + BACOLMAPCOLMAP (CVPR 2016)DUSt3R (CVPR 2024)
Bundle adjustmentMinimize reprojection error over all viewsmin ∑ ‖x-π(C,X)‖²CeresTriggs et al. 2000DROID-SLAM (2021)
SLAMVO + map + loop closurepose graph optimizationORB-SLAM3ORB-SLAM (TRO 2015)Gaussian SLAM (2024)
NeRFScene as implicit MLP, train via volume renderingC=∑TiαiciNerfstudioNeRF (ECCV 2020)Zip-NeRF (2023)
3DGSScene as explicit Gaussians, differentiable rasterizationsplat + alpha blendgsplat3DGS (SIGGRAPH 2023)2DGS / 4DGS (2024)
Optical flowPer-pixel 2D motion between consecutive framesI(x,t) = I(x+dx,t+dt)RAFTHorn-Schunck 1981RAFT (ECCV 2020)
Point trackingTrack any point across many framescorrelation + GRUCoTrackerKLT (1981)CoTracker v2 (2024)

2. System Design Questions

Q1: "Design a perception system for a mobile manipulator in a warehouse."
Answer framework: Sensors (stereo pair + fisheye + IMU on base; wrist camera for manipulation). Calibration (factory intrinsic/extrinsic; online self-calibration check every hour). Localization (VIO for real-time pose; SLAM with vocabulary tree loop closure). Mapping (3DGS for digital twin; occupancy grid for planning). Tracking (CoTracker for contact points during grasping). Tradeoffs: classical VO for reliability vs learned for accuracy; metric vs visual SLAM; real-time constraints.
Q2: "You need to build a 3D reconstruction pipeline that processes 10K images from a drone."
Answer: Image retrieval (NetVLAD/CosPlace, top-50 per image) → feature extraction (SuperPoint) → matching (LightGlue on candidate pairs) → COLMAP incremental SfM → 3DGS for dense reconstruction. Key decisions: hierarchical clustering by GPS to parallelize, global BA at the end, Schur complement solver, output both mesh (Poisson) and Gaussian representation.
Q3: "How would you add metric scale to a monocular SLAM system?"
Answer: Three options ranked by reliability: (1) IMU fusion (VIO) — accelerometer gives metric acceleration, integrate for scale. (2) Known landmarks — AprilTags or known object dimensions. (3) Learned metric depth (Metric3Dv2, Depth Pro) as a prior. IMU is the standard in industry because it's always available and doesn't require scene knowledge.
Q4: "Your team wants to replace the classical VO with DPVO. Walk me through the evaluation."
Answer: Run both on internal benchmark (representative of deployment). Metrics: ATE (absolute trajectory error), RPE (relative pose error), runtime, latency, GPU memory. Check failure modes: fast rotation, low light, featureless environments. Run A/B on the robot with safety driver. Only deploy if ATE improves by >20% and no new failure modes appear. Keep classical as fallback.
Q5: "Design a multi-floor indoor mapping system for a delivery robot."
Answer framework: Sensors: stereo fisheye pair (maximum coverage in corridors) + IMU + wheel encoders. Localization: VIO for within-floor navigation. Floor transitions: detect elevator entry/exit via IMU (acceleration spike), switch to new sub-map. Map representation: per-floor 2D occupancy grid for planning + 3D sparse map for localization. Global consistency: maintain an "atlas" (ORB-SLAM3 style) with inter-floor loop closures at elevator landings. Key challenge: long featureless corridors cause VO failure — use wheel odometry + IMU dead reckoning as fallback, with loop closure at distinctive locations (room entrances, intersections).
Q6: "How would you evaluate whether a new dense reconstruction method (e.g., 2DGS) is ready for production?"
Answer: Define success criteria upfront: (1) geometric accuracy: Chamfer distance < 5cm against LiDAR ground truth. (2) Completeness: < 5% missing surface area. (3) Training time: < 15min for a single room. (4) Rendering speed: > 30fps at deployment resolution. (5) Robustness: must handle at least 3 of these: reflective surfaces, textureless walls, thin structures, moving objects. Run on 20 representative scenes from actual deployment environments, not just benchmark datasets. Compare against current method (MVS, NeRF, or 3DGS). Only ship if it beats on the primary metric AND doesn't regress on any secondary metric by > 10%.

3. Coding Drills

Drill 1: "Implement projection." (Core 5 lines + discuss coordinate conventions)
X_cam = R @ X + t; x_n = X_cam[:2] / X_cam[2]; px = K[:2,:2] @ x_n + K[:2,2]
While coding, discuss: OpenCV vs OpenGL coords, distortion, when Z<0 means behind camera.
Drill 2: "Implement the 8-point algorithm." (Core 15 lines)
Build A matrix from correspondences, SVD solve, enforce rank-2, Hartley normalization.
While coding: why normalization matters (condition number), what rank-2 means geometrically.
Drill 3: "Implement triangulation from two views." (Core 10 lines)
DLT: build 4×4 A from projection matrices and points, SVD, dehomogenize.
While coding: when triangulation fails (small baseline), geometric interpretation.
Drill 4: "Implement Lucas-Kanade optical flow." (Core 12 lines)
Gradient computation, windowed least squares on brightness constancy.
While coding: aperture problem, window size tradeoff, when ATA is singular.
Drill 5: "Implement volume rendering." (Core 8 lines)
Transmittance, opacity from density, weighted color sum along ray.
While coding: why transmittance is a cumulative product, why we stop early (transmittance < threshold), how this makes NeRF differentiable.
Drill 6: "Implement Rodrigues rotation." (Core 6 lines)
Angle-axis to rotation matrix: R = I + sinθ[k]× + (1-cosθ)[k]ײ.
While coding: singularity at θ=0 (handle with Taylor approximation), why axis-angle is better than Euler angles for optimization (no gimbal lock, minimal parameterization).

4. Debugging Scenarios

Scenario 1: "Images from two cameras show the same scene, but reprojecting 3D points from one camera into the other gives 5-pixel error."
Diagnosis chain: Check extrinsic calibration (most likely). Verify by projecting checkerboard corners. If correct, check intrinsics (focal length drift from temperature). If still wrong, check time sync (are cameras triggered simultaneously?).
Scenario 2: "SfM reconstruction of a building has two halves that don't connect."
Diagnosis: Insufficient image overlap between the two sides. Image retrieval didn't find cross-connections. Fix: capture images that see both sides (bridging images). Or: manually specify additional image pairs for matching.
Scenario 3: "VO drift rate increased from 0.5% to 3% after updating the feature extractor."
Diagnosis: New features are less repeatable or produce worse matches. Check: number of features per frame, inlier ratio after RANSAC, angular error per frame. Compare feature distributions (are they clustered? spread across image?). Roll back and A/B test.
Scenario 4: "3DGS training produces good novel views but the geometry (depth) is wrong."
Diagnosis: 3DGS optimizes for appearance, not geometry. Gaussians can "cheat" by being large, semi-transparent, and overlapping. Fix: add depth supervision (from LiDAR or MVS depth). Use 2DGS (flat surfels) for better geometry. Increase opacity regularization.
Scenario 5: "Your SLAM system works perfectly in the office but fails in the parking garage."
Diagnosis chain: (1) Lighting: parking garages have extreme dynamic range (dark interior, bright entry). Check exposure — is auto-exposure hunting? Fix: manual exposure or HDR. (2) Features: concrete walls are textureless. Check feature count per frame. If <100, switch to direct VO or add a wider-FOV camera. (3) Repetitive structure: parking garage pillars all look identical. Loop closure will fire false positives. Fix: add geometric verification with higher inlier threshold, or use NetVLAD with a parking-garage-specific vocabulary.
Scenario 6: "Point tracking with CoTracker works in simulation but fails on real robot video."
Diagnosis: Sim-to-real gap. Possible causes: (1) Motion blur from real camera (sim has none) — CoTracker was trained on clean video. Fix: reduce exposure time, add motion blur augmentation at fine-tuning. (2) Resolution mismatch: sim renders at 512×512, real camera is 1280×960. Resize or crop to match training distribution. (3) Occluder behavior: real objects have complex self-occlusion patterns not in sim training data. Fix: fine-tune on real video with pseudo-ground-truth from TAPIR + manual correction.

5. Classical vs Modern

TaskClassicalModern (Learned)When to use which
Features + matchingSIFT + brute-forceSuperPoint + LightGlueLearned for accuracy; classical for no-GPU or real-time on CPU
Relative pose5-pt + RANSACDUSt3R pointmapsClassical for sub-pixel precision; learned for speed and robustness
SfMCOLMAPDUSt3R / VGGTClassical for metric accuracy; learned for rapid prototyping
Dense reconstructionMVS + Poisson mesh3DGS / NeRFMVS for offline mapping; 3DGS for real-time rendering + simulation
SLAMORB-SLAM3DROID-SLAMORB-SLAM3 for production (stable); DROID for research (accuracy)
Optical flowLucas-KanadeRAFTLK for sparse real-time; RAFT for dense accuracy
Point trackingKLT trackerCoTrackerKLT for speed; CoTracker for occlusion robustness
Place recognitionDBoW2 (vocab tree)AnyLoc / CricaVPRDBoW2 for low compute; learned for cross-domain / condition change

6. Recommended Reading

THE book: Hartley & Zisserman, Multiple View Geometry in Computer Vision (2003). Read chapters 9-12 (fundamental matrix, cameras, 3D reconstruction, structure computation) cover-to-cover. The rest is reference material. Also recommended: Szeliski, Computer Vision: Algorithms and Applications (2nd ed., 2022) — broader scope, more accessible, free online.

5 must-read papers:

  1. Triggs et al., "Bundle Adjustment — A Modern Synthesis" (2000) — optimization fundamentals
  2. ORB-SLAM3 (Campos et al., TRO 2021) — production SLAM system design
  3. NeRF (Mildenhall et al., ECCV 2020) — neural scene representation
  4. 3D Gaussian Splatting (Kerbl et al., SIGGRAPH 2023) — real-time neural rendering
  5. DUSt3R (Wang et al., CVPR 2024) — end-to-end 3D from image pairs

5 must-study repos:

  1. COLMAP (github.com/colmap/colmap) — SfM + MVS gold standard
  2. hloc (github.com/cvg/Hierarchical-Localization) — SuperPoint + LightGlue + retrieval
  3. gsplat (github.com/nerfstudio-project/gsplat) — 3DGS research framework
  4. DROID-SLAM (github.com/princeton-vl/DROID-SLAM) — learned dense SLAM
  5. DUSt3R (github.com/naver/dust3r) — feed-forward 3D from images

7. Related Lessons

Deep-dive into specific topics:

8. Latency Budget Reference

Staff-level interviewers love timing questions. Know these numbers:

OperationCPU timeGPU timeNotes
FAST corner detection (2K pts)~2msPer 1280×960 image
ORB descriptor (2K pts)~5ms256-bit binary
SuperPoint (2K pts)~50ms~5ms256-dim float
Brute-force matching (2K×2K)~15ms~1msL2 distance
LightGlue matching~30msWith early exit
RANSAC + 5-pt (500 matches)~5ms100 iterations
PnP + RANSAC~2msP3P, 50 iterations
Local BA (10 keyframes)~50msCeres, sparse Schur
3DGS rendering (1080p)~5msReal-time
DUSt3R inference (1 pair)~200msViT-Large, A100
RAFT optical flow (1080p)~100ms12 iterations
CoTracker (50 pts, 50 frames)~300msBatch processing

Bonus: Quick Reference Equations

Projection: p = K [R | t] Xh
Epipolar: x2T E x1 = 0   where   E = [t]×R
Fundamental: F = K2−T E K1−1
BA cost: min ∑i,j ρ(‖ xij − π(Ci, Xj) ‖²)
Schur: Δc = (U − WV−1WT)−1(ec − WV−1ep)
Volume rendering: C(r) = ∑i Ti αi ci
Brightness constancy: I(x,y,t) = I(x+dx, y+dy, t+dt)
Harris: R = λ1λ2 − k(λ12
Rodrigues: R = I + sinθ [k]× + (1−cosθ) [k]ײ
Triangulation uncertainty: σZ = Z² · σpx / (f · B)
RANSAC iterations: k = log(0.01) / log(1 − ws)
LM update: (JTJ + λI) Δ = −JTr
3DGS alpha compositing: C = ∑i ci αij<i(1 − αj)
Gaussian projection: Σ2D = J · R · Σ3D · RT · JT
Pre-integration: ΔRij = ∏k=ij-1 Exp((ωk − bg) Δt)
Scale drift (mono): σscale ≈ 1–5% per 100m without IMU
Sampson distance: dS = (p2TFp1)² / ((Fp1)1² + (Fp1)2² + (FTp2)1² + (FTp2)2²)

Bonus: Interview Day Checklist

Before the interview:
• Be able to draw the full pipeline (cameras → calibration → features → matching → pose → BA → reconstruction) from memory
• Have 2–3 war stories ready: "A time I debugged a subtle geometry bug" (calibration drift, coordinate frame mismatch, RANSAC failure)
• Know the tradeoff table (classical vs learned) cold
• Be ready to implement: projection, 8-point, triangulation, Lucas-Kanade on a whiteboard
• Have an opinion on DUSt3R/VGGT vs classical (not "it depends" but "here's when I'd use each and why")
• Know actual numbers: COLMAP processes ~100 images/hour, 3DGS trains in 10min, ORB-SLAM3 runs at 30fps, LightGlue matches in 30ms
• Understand the Schur complement well enough to explain it in 2 minutes with a worked example
• Know coordinate frame conventions: OpenCV (Z-forward, Y-down), OpenGL (Z-backward, Y-up), ROS (X-forward, Z-up)
• Know memory/compute costs: full Hessian vs Schur-reduced, feature extraction per image, BA per keyframe
• Practice the debugging protocol: calibration → sync → features → matching → pose → map
During the interview:
• Think aloud. "I'd start by checking calibration because that's the cheapest to verify and causes the most downstream damage."
• State assumptions. "I'm assuming we have calibrated cameras, so I'll use E instead of F."
• Give numbers. "RANSAC with 40% outliers needs ~400 iterations at ~1ms each, so ~400ms per pair."
• Mention failure modes proactively. "This works unless the scene is planar, in which case F is degenerate and I'd switch to a homography."
• Connect to production. "In practice, I'd use Ceres with SPARSE_SCHUR solver and Cauchy robust loss."

9. Common Misconceptions (and Correct Answers)

MisconceptionRealityWhy it matters
"More features = better reconstruction"Quality matters more than quantity. 500 well-distributed features beat 5000 clustered on one texture.Guides feature extraction tuning
"NeRF understands 3D geometry"NeRF optimizes for appearance (photometric loss). Geometry is a byproduct, often inaccurate.Don't trust NeRF depth for collision checking
"SLAM and SfM are the same thing"SfM is offline (all images available). SLAM is online (causal, can't look ahead). Different optimization strategies.System design implications
"Learned models will replace classical geometry"Classical gives guarantees (metric accuracy, consistency). Learned gives robustness. Best systems are hybrid.Staff-level perspective
"Calibration is a solved problem"Static calibration is solved. Online calibration under thermal drift, vibration, and wear is an active research area.Production vs research gap
"BA always converges"BA is non-convex. Bad initialization, insufficient observations, or wrong distortion model can cause divergence.Debug skill
"3DGS is always better than NeRF"3DGS struggles with thin structures, transparency, and view-dependent effects. NeRF handles these better with continuous volume sampling.Method selection

10. The Staff Engineer Mindset

The difference between a senior and a staff engineer in 3D geometry is not algorithmic knowledge — it's systems thinking. A senior engineer implements the 8-point algorithm correctly. A staff engineer asks: "Should we even be estimating F, or should we use a learned matcher that gives us correspondences good enough to skip geometric verification entirely? What's the failure rate? What's the latency budget? What happens when the learned model encounters an out-of-distribution scene? What's the fallback?"

The five habits of a staff geometry engineer:

  1. Instrument everything. Log reprojection error, feature count, inlier ratio, pose delta, and rendering quality at every stage. Anomalies in these metrics catch failures before the robot crashes.
  2. Design for failure. Every component has a fallback. Feature matching fails? Fall back to direct alignment. VO fails? Fall back to IMU. IMU fails? Fall back to wheel odometry. Define the cascade before the first failure.
  3. Benchmark on YOUR data. Academic benchmarks (KITTI, ETH3D, ScanNet) are necessary but not sufficient. Your warehouse/factory/kitchen has unique challenges. Build an internal benchmark from production data.
  4. Know the numbers. How fast is each component? How much memory? What's the accuracy? A staff engineer can estimate system latency and memory from first principles: "6 cameras × 1.2M pixels × 3 channels × 4 bytes = 86MB/frame. Feature extraction at 5ms/image = 30ms. Total perception latency budget: 50ms to stay at 20Hz."
  5. Read papers, but ship code. Follow CVPR/ECCV/ICCV, but evaluate every paper against: "Does this solve a problem I have? What's the integration cost? Does it work on my data domain?" Most papers don't survive contact with production.
"The fundamental matrix is the algebraic representation of epipolar geometry." — Richard Hartley
"What I cannot create, I do not understand." — Richard Feynman. Every algorithm in this lesson has a from-scratch implementation. If you can code it, you can debug it. If you can debug it, you can ship it.
Final check: You're 30 minutes into a staff interview. The interviewer draws a box labeled "SfM" and says "tell me how you'd make this work for 100K images on a drone survey where GPS is unreliable and the terrain is mostly flat farmland." What's the first failure mode you call out?