Recovering per-pixel depth from stereo pairs, multiple views, structured light, and time-of-flight sensors.
Hold one finger at arm's length and close one eye, then the other. Your finger appears to jump sideways. That jump — the disparity — is smaller for far objects and larger for close ones. Your brain fuses these two views into a sense of depth.
A stereo camera pair does the same thing. Given two images of the same scene taken from slightly different positions, we can compute a disparity map that tells us how far each pixel shifted, and from that, how far each surface point is from the camera.
A closer object has a larger disparity (shift between views). Move the slider to change depth.
In a general stereo pair, epipolar lines can run in any direction. Searching along arbitrary lines for matches is slow and error-prone. Rectification warps both images so that corresponding epipolar lines become horizontal scan lines. After rectification, matching reduces to a simple 1D search along each row.
The relationship between depth Z and disparity d in a rectified pair is beautifully simple:
where f is the focal length (in pixels), B is the baseline (distance between cameras), and Z is the depth. Larger baseline gives larger disparity at the same depth — better depth precision.
| Step | What Happens |
|---|---|
| 1. Rotate cameras | Warp both images so their optical axes are perpendicular to the baseline. |
| 2. Align vertically | Make the up vectors perpendicular to the baseline, so epipolar lines are horizontal. |
| 3. Scale | Re-scale if focal lengths differ, to avoid aliasing. |
After rectification, matching is a 1D search along horizontal scan lines. The green line shows the matching scan line.
Depth resolution: From d = f·B/Z, the depth resolution depends on the disparity resolution. For integer-pixel disparities, the depth quantization at distance Z is ΔZ = Z2/(f·B). This means far objects have poor depth resolution — the classic limitation of stereo. Wider baselines help, but make matching harder (more occlusion, more appearance change).
The first step in any stereo algorithm is deciding how to compare pixels. Given a pixel in the left image, how do you score each candidate match in the right image? This is the matching cost.
| Metric | Formula | Trade-off |
|---|---|---|
| SSD | (IL − IR)2 | Simple, sensitive to brightness changes |
| SAD | |IL − IR| | More robust than SSD to outliers |
| NCC | Normalized cross-correlation | Handles brightness/contrast variation |
| Census | Bit pattern of neighbors above/below center | Robust to illumination changes |
| Learned | Neural network patch comparison | Best discriminability, needs training data |
The census transform converts each pixel's neighborhood into a binary string: each neighbor gets a 1 if it is brighter than the center pixel, 0 otherwise. Two pixels are compared by counting how many bits differ (Hamming distance). This makes the measure invariant to any monotonic change in brightness — perfect for outdoor stereo where lighting varies between cameras.
The cost profile along one scan line. The correct disparity has the lowest cost — but noise creates false minima.
The simplest approach: sum the matching cost over a window around each pixel, then pick the disparity with the lowest total cost. This is winner-take-all (WTA) with an aggregation window.
The choice of window matters enormously:
| Method | Idea |
|---|---|
| Fixed square window | Sum costs over N×N block. Fast (box filter), but blurs edges. |
| Shiftable window | Try all positions of a window around the pixel, take the minimum. Slightly edge-aware. |
| Adaptive weight | Weight neighbors by color similarity. Sharp edges, higher cost. |
| Guided filter | Use the color image to guide cost aggregation. State-of-art among local methods. |
Sub-pixel refinement: Most local methods estimate integer disparities. For smooth rendering, you fit a parabola to the cost curve around the winning disparity and find the sub-pixel minimum. Without this, view synthesis shows distracting "layering" artifacts.
Compare a small vs large aggregation window near a depth edge. Small windows preserve the edge; large windows blur it.
Confidence and uncertainty: After computing disparities, how confident are you in each estimate? The curvature of the cost function at the winning disparity indicates confidence. A sharp, deep minimum means a strong match (high texture, low ambiguity). A flat minimum means the match is uncertain (textureless region, repetitive pattern). Stereo confidence maps are critical for downstream tasks: uncertain depths should be smoothed, filled, or excluded.
Instead of making a local decision at each pixel, global methods define an energy function over the entire disparity map and find the assignment that minimizes it:
The first term is the data cost — how well each pixel matches at its assigned disparity. The second is the smoothness cost — a penalty for neighboring pixels having different disparities, weighted by λ.
| Optimizer | Approach | Quality vs Speed |
|---|---|---|
| Dynamic programming | Optimal 1D path per scan line | Fast, but "streaking" artifacts between rows |
| SGM | Multiple 1D paths summed | Fast + good quality. Industry standard. |
| Graph cuts | Min-cut on a graph | High quality, slow |
| Belief propagation | Message passing on MRF | High quality, slow |
SGM runs 1D optimization along multiple directions and sums the results. More paths = better approximation of 2D global optimum.
Graph cuts vs belief propagation: For the full 2D energy, graph cuts (Boykov et al., 2001) find a strong local minimum by solving a max-flow/min-cut problem on a graph where nodes are pixels and edges encode both data and smoothness costs. Belief propagation (Sun et al., 2003) passes messages between neighboring pixels iteratively, each message encoding the cost distribution. Both are slower than SGM but can produce higher quality results on complex scenes.
PatchMatch stereo: Instead of testing all disparities at every pixel, randomly initialize, then propagate good solutions to neighbors. If your neighbor found a good slanted plane, try a similar plane at your pixel. This random search + propagation strategy converges rapidly and naturally handles slanted surfaces, which are problematic for discrete disparity methods.
Deep learning has transformed stereo matching. Modern networks learn to extract features, compute matching costs, aggregate context, and regularize the disparity map — all end-to-end from data. Where classical methods require careful tuning of window sizes, penalty functions, and post-processing steps, a single trained network replaces the entire pipeline.
| Network | Key Innovation |
|---|---|
| MC-CNN | Learned patch matching cost (Zbontar & LeCun, 2016) |
| GC-Net | End-to-end with 3D cost volume convolutions (Kendall et al., 2017) |
| PSMNet | Spatial pyramid pooling for large context (Chang & Chen, 2018) |
| RAFT-Stereo | Iterative refinement with GRU, sub-pixel accuracy (Lipson et al., 2021) |
| CREStereo | Cascade recurrent estimation, top on ETH3D (Li et al., 2022) |
Monocular depth: Remarkably, networks can estimate depth from a single image by learning statistical priors about scene geometry. Models like MiDaS and DPT predict relative depth maps that are useful for video effects, though they lack the metric accuracy of stereo.
Iterative refinement (RAFT-Stereo): Instead of predicting a single disparity map, iteratively refine it. Build a correlation volume, then use a GRU (gated recurrent unit) to produce a sequence of disparity updates, each correcting the previous one. After 10-20 iterations, the result converges to sub-pixel accuracy. This mirrors the coarse-to-fine philosophy of classical methods but is fully differentiable.
Two views give one disparity map. But adding more cameras dramatically improves quality: you get more coverage (fewer occlusions), more constraints (better accuracy), and the ability to reconstruct complete 3D objects.
Multi-view stereo (MVS) takes the sparse point cloud from SfM (Chapter 11) and densifies it into a complete 3D model. The input is a set of images with known camera poses; the output is a dense depth map per view or a fused 3D point cloud.
| Approach | How It Works | Output |
|---|---|---|
| Depth map fusion | Estimate per-view depth maps, fuse into one point cloud | Point cloud or mesh |
| Volumetric (voxel) | Carve or score a voxel grid using photo-consistency | Voxel grid → mesh via Marching Cubes |
| Shape from silhouettes | Intersect visual cones from object silhouettes | Visual hull (conservative bound) |
| Patch-based (PMVS) | Grow oriented patches across views | Dense oriented point cloud |
COLMAP (Schonberger et al., 2016) remains the reference pipeline: it runs SfM first, then per-view depth estimation using photometric and geometric consistency, then fusion. It handles hundreds of unordered photos and produces meshes suitable for rendering.
Passive stereo relies on scene texture. In a featureless white room, there is nothing to match. Active depth sensors solve this by projecting their own patterns onto the scene.
| Technology | How It Works | Example |
|---|---|---|
| Structured light | Project a known pattern (stripes, dots). Decode the pattern in the camera to determine per-pixel depth via triangulation. | Kinect v1, Intel RealSense |
| Time of flight (ToF) | Emit modulated IR light. Measure the phase shift of the return signal to compute distance. d = c · Δt / 2. | Kinect v2, iPhone LiDAR |
| LiDAR | Pulsed laser measures round-trip time per point. Mechanically or electronically scanned. | Velodyne, Ouster (autonomous vehicles) |
| Laser stripe | Sweep a plane of laser light. The stripe deformation encodes surface shape via triangulation. | Industrial 3D scanners |
See how a stereo algorithm turns a pair of rectified images into a depth map. The left image has a scene with objects at different depths. Slide the disparity to see which parts of the scene "light up" (match well) at each depth level.
Sweep through disparities to find where objects match. The scene has three depth layers. Watch the cost drop at correct disparities.
| Concept | Used In |
|---|---|
| Epipolar geometry & rectification | Ch 11 (SfM), Ch 14 (view interpolation) |
| Disparity maps | Ch 14 (IBR), Ch 9 (motion estimation), autonomous driving |
| Multi-view stereo | Ch 13 (3D reconstruction), Ch 14 (neural rendering / NeRFs) |
| Cost volume / DSI | Deep stereo networks, optical flow networks (RAFT) |
| Active depth sensors | Ch 13 (3D scanning), robotics, AR/VR, autonomous vehicles |
| Global optimization / MRF | Ch 4 (model fitting), Ch 6 (semantic segmentation) |