Szeliski, Chapter 12

Depth Estimation

Recovering per-pixel depth from stereo pairs, multiple views, structured light, and time-of-flight sensors.

Prerequisites: Chapter 2 (image formation), Chapter 11 (epipolar geometry, SfM).
10
Chapters
5+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why Depth?

Hold one finger at arm's length and close one eye, then the other. Your finger appears to jump sideways. That jump — the disparity — is smaller for far objects and larger for close ones. Your brain fuses these two views into a sense of depth.

A stereo camera pair does the same thing. Given two images of the same scene taken from slightly different positions, we can compute a disparity map that tells us how far each pixel shifted, and from that, how far each surface point is from the camera.

Why it matters: Depth estimation powers autonomous driving (obstacle detection), AR/VR (occlusion handling), robotics (grasping), 3D photography, and video post-processing (background blur, z-keying). The iPhone's Portrait Mode, the Kinect, and every self-driving car depend on it.
Disparity and Depth

A closer object has a larger disparity (shift between views). Move the slider to change depth.

Object depth 50
Why does a closer object have a larger disparity in a stereo pair?

Chapter 1: Rectification

In a general stereo pair, epipolar lines can run in any direction. Searching along arbitrary lines for matches is slow and error-prone. Rectification warps both images so that corresponding epipolar lines become horizontal scan lines. After rectification, matching reduces to a simple 1D search along each row.

The relationship between depth Z and disparity d in a rectified pair is beautifully simple:

d = f · B / Z

where f is the focal length (in pixels), B is the baseline (distance between cameras), and Z is the depth. Larger baseline gives larger disparity at the same depth — better depth precision.

Disparity Space Image (DSI): After rectification, for every pixel (x, y) we test all possible disparities d and store a matching cost C(x, y, d). This 3D volume is the DSI. Finding the depth map means finding the optimal surface d(x, y) through this volume — the surface where matching cost is lowest.
StepWhat Happens
1. Rotate camerasWarp both images so their optical axes are perpendicular to the baseline.
2. Align verticallyMake the up vectors perpendicular to the baseline, so epipolar lines are horizontal.
3. ScaleRe-scale if focal lengths differ, to avoid aliasing.
Plane sweep: An alternative to rectification. Sweep a set of virtual planes through the scene at different depths. At each depth, warp all images onto that plane and measure photoconsistency. This works with any camera configuration — not just side-by-side pairs — and is the basis for many multi-view stereo methods.
Rectified Stereo Pair

After rectification, matching is a 1D search along horizontal scan lines. The green line shows the matching scan line.

Scan line y 50

Depth resolution: From d = f·B/Z, the depth resolution depends on the disparity resolution. For integer-pixel disparities, the depth quantization at distance Z is ΔZ = Z2/(f·B). This means far objects have poor depth resolution — the classic limitation of stereo. Wider baselines help, but make matching harder (more occlusion, more appearance change).

What does rectification achieve in stereo matching?

Chapter 2: Matching Cost

The first step in any stereo algorithm is deciding how to compare pixels. Given a pixel in the left image, how do you score each candidate match in the right image? This is the matching cost.

MetricFormulaTrade-off
SSD(IL − IR)2Simple, sensitive to brightness changes
SAD|IL − IR|More robust than SSD to outliers
NCCNormalized cross-correlationHandles brightness/contrast variation
CensusBit pattern of neighbors above/below centerRobust to illumination changes
LearnedNeural network patch comparisonBest discriminability, needs training data

The census transform converts each pixel's neighborhood into a binary string: each neighbor gets a 1 if it is brighter than the center pixel, 0 otherwise. Two pixels are compared by counting how many bits differ (Hamming distance). This makes the measure invariant to any monotonic change in brightness — perfect for outdoor stereo where lighting varies between cameras.

Key insight: The matching cost alone is noisy. A textureless wall has the same cost at every disparity. A repeating pattern has many equally good matches. That is why we need aggregation (local methods) or regularization (global methods) to pick the right answer — the topics of the next two chapters.
Matching Cost Comparison

The cost profile along one scan line. The correct disparity has the lowest cost — but noise creates false minima.

Why is the census transform particularly robust for outdoor stereo?

Chapter 3: Local Methods

The simplest approach: sum the matching cost over a window around each pixel, then pick the disparity with the lowest total cost. This is winner-take-all (WTA) with an aggregation window.

The choice of window matters enormously:

Adaptive windows: Smart local methods adapt the window to image content. Bilateral aggregation (Yoon and Kweon, 2006) weights each neighbor by color similarity and spatial distance — exactly like a bilateral filter. Pixels that look similar (same surface) get high weight; pixels across a depth edge (different color) get low weight. This preserves sharp boundaries without sacrificing aggregation power.
MethodIdea
Fixed square windowSum costs over N×N block. Fast (box filter), but blurs edges.
Shiftable windowTry all positions of a window around the pixel, take the minimum. Slightly edge-aware.
Adaptive weightWeight neighbors by color similarity. Sharp edges, higher cost.
Guided filterUse the color image to guide cost aggregation. State-of-art among local methods.

Sub-pixel refinement: Most local methods estimate integer disparities. For smooth rendering, you fit a parabola to the cost curve around the winning disparity and find the sub-pixel minimum. Without this, view synthesis shows distracting "layering" artifacts.

Aggregation Window Effect

Compare a small vs large aggregation window near a depth edge. Small windows preserve the edge; large windows blur it.

Window radius 3

Confidence and uncertainty: After computing disparities, how confident are you in each estimate? The curvature of the cost function at the winning disparity indicates confidence. A sharp, deep minimum means a strong match (high texture, low ambiguity). A flat minimum means the match is uncertain (textureless region, repetitive pattern). Stereo confidence maps are critical for downstream tasks: uncertain depths should be smoothed, filled, or excluded.

Cross-checking: Compute left-to-right and right-to-left disparity maps. Pixels where the two maps disagree are likely occluded or mismatched. This simple consistency check catches most errors, and the flagged pixels can be filled from neighbors.
What is the fundamental trade-off in choosing the aggregation window size?

Chapter 4: Global Optimization

Instead of making a local decision at each pixel, global methods define an energy function over the entire disparity map and find the assignment that minimizes it:

E(d) = ∑p C(p, dp) + λ ∑(p,q)∈N V(dp, dq)

The first term is the data cost — how well each pixel matches at its assigned disparity. The second is the smoothness cost — a penalty for neighboring pixels having different disparities, weighted by λ.

Semi-Global Matching (SGM): Exact 2D optimization is NP-hard for most energy formulations. SGM (Hirschmuller, 2008) approximates it by running 1D dynamic programming along 8 or 16 directions across the image and summing the costs. It is fast, parallelizable (runs on GPUs), and produces near-global-optimality results. SGM is the workhorse of autonomous driving stereo — used in production by many AV companies.
OptimizerApproachQuality vs Speed
Dynamic programmingOptimal 1D path per scan lineFast, but "streaking" artifacts between rows
SGMMultiple 1D paths summedFast + good quality. Industry standard.
Graph cutsMin-cut on a graphHigh quality, slow
Belief propagationMessage passing on MRFHigh quality, slow
Smoothness penalty design: A simple L1 penalty |dp − dq| encourages piecewise-smooth disparity maps. A truncated penalty min(|dp − dq|, τ) allows sharp depth discontinuities without excessive cost. SGM further modulates the penalty by the image gradient: across an edge in the color image, the smoothness penalty is reduced because a depth jump is expected there.
SGM Path Aggregation

SGM runs 1D optimization along multiple directions and sums the results. More paths = better approximation of 2D global optimum.

Number of paths 8

Graph cuts vs belief propagation: For the full 2D energy, graph cuts (Boykov et al., 2001) find a strong local minimum by solving a max-flow/min-cut problem on a graph where nodes are pixels and edges encode both data and smoothness costs. Belief propagation (Sun et al., 2003) passes messages between neighboring pixels iteratively, each message encoding the cost distribution. Both are slower than SGM but can produce higher quality results on complex scenes.

PatchMatch stereo: Instead of testing all disparities at every pixel, randomly initialize, then propagate good solutions to neighbors. If your neighbor found a good slanted plane, try a similar plane at your pixel. This random search + propagation strategy converges rapidly and naturally handles slanted surfaces, which are problematic for discrete disparity methods.

How does Semi-Global Matching (SGM) approximate global optimization efficiently?

Chapter 5: Deep Stereo Networks

Deep learning has transformed stereo matching. Modern networks learn to extract features, compute matching costs, aggregate context, and regularize the disparity map — all end-to-end from data. Where classical methods require careful tuning of window sizes, penalty functions, and post-processing steps, a single trained network replaces the entire pipeline.

Feature Extraction
CNN extracts per-pixel features from both images
Cost Volume
Correlate left/right features at each disparity to build a 3D volume
3D Aggregation
3D convolutions (or recurrent units) smooth the cost volume
Disparity Regression
Soft-argmin produces a differentiable, sub-pixel disparity estimate
Learned matching cost: Zbontar and LeCun (2016) showed that training a CNN to compare patches drastically outperformed hand-crafted costs like census or NCC. This single insight launched the deep stereo era. Later networks (GC-Net, PSMNet, RAFT-Stereo) moved to end-to-end training where the cost volume is constructed and refined jointly.
NetworkKey Innovation
MC-CNNLearned patch matching cost (Zbontar & LeCun, 2016)
GC-NetEnd-to-end with 3D cost volume convolutions (Kendall et al., 2017)
PSMNetSpatial pyramid pooling for large context (Chang & Chen, 2018)
RAFT-StereoIterative refinement with GRU, sub-pixel accuracy (Lipson et al., 2021)
CREStereoCascade recurrent estimation, top on ETH3D (Li et al., 2022)

Monocular depth: Remarkably, networks can estimate depth from a single image by learning statistical priors about scene geometry. Models like MiDaS and DPT predict relative depth maps that are useful for video effects, though they lack the metric accuracy of stereo.

Self-supervised stereo: You do not always need ground-truth depth maps for training. Project the left image to the right using the predicted disparity map. If the prediction is correct, the warped image should match the actual right image. This photometric loss enables training on unlabeled stereo video — perfect for autonomous driving where ground truth is expensive. MonoDepth (Godard et al., 2017) pioneered this approach.

Iterative refinement (RAFT-Stereo): Instead of predicting a single disparity map, iteratively refine it. Build a correlation volume, then use a GRU (gated recurrent unit) to produce a sequence of disparity updates, each correcting the previous one. After 10-20 iterations, the result converges to sub-pixel accuracy. This mirrors the coarse-to-fine philosophy of classical methods but is fully differentiable.

What makes the soft-argmin operation important in deep stereo networks?

Chapter 6: Multi-View Stereo

Two views give one disparity map. But adding more cameras dramatically improves quality: you get more coverage (fewer occlusions), more constraints (better accuracy), and the ability to reconstruct complete 3D objects.

Multi-view stereo (MVS) takes the sparse point cloud from SfM (Chapter 11) and densifies it into a complete 3D model. The input is a set of images with known camera poses; the output is a dense depth map per view or a fused 3D point cloud.

Volumetric fusion: One powerful approach discretizes space into a 3D voxel grid. For each voxel, project it into every camera and measure photoconsistency. Voxels where all cameras agree become the surface. This naturally handles arbitrary camera configurations and complex topology. TSDF (Truncated Signed Distance Function) fusion, as used in KinectFusion, accumulates signed distances from multiple depth maps into a single coherent volume.
ApproachHow It WorksOutput
Depth map fusionEstimate per-view depth maps, fuse into one point cloudPoint cloud or mesh
Volumetric (voxel)Carve or score a voxel grid using photo-consistencyVoxel grid → mesh via Marching Cubes
Shape from silhouettesIntersect visual cones from object silhouettesVisual hull (conservative bound)
Patch-based (PMVS)Grow oriented patches across viewsDense oriented point cloud

COLMAP (Schonberger et al., 2016) remains the reference pipeline: it runs SfM first, then per-view depth estimation using photometric and geometric consistency, then fusion. It handles hundreds of unordered photos and produces meshes suitable for rendering.

What advantage does multi-view stereo have over two-view stereo?

Chapter 7: Active Depth Sensing

Passive stereo relies on scene texture. In a featureless white room, there is nothing to match. Active depth sensors solve this by projecting their own patterns onto the scene.

TechnologyHow It WorksExample
Structured lightProject a known pattern (stripes, dots). Decode the pattern in the camera to determine per-pixel depth via triangulation.Kinect v1, Intel RealSense
Time of flight (ToF)Emit modulated IR light. Measure the phase shift of the return signal to compute distance. d = c · Δt / 2.Kinect v2, iPhone LiDAR
LiDARPulsed laser measures round-trip time per point. Mechanically or electronically scanned.Velodyne, Ouster (autonomous vehicles)
Laser stripeSweep a plane of laser light. The stripe deformation encodes surface shape via triangulation.Industrial 3D scanners
Structured light in detail: The Kinect v1 projects ~30,000 IR dots in a pseudo-random pattern. Each local neighborhood of dots is unique, so matching a received dot pattern to the stored reference gives a per-dot disparity — the same triangulation math as stereo, but the "texture" is guaranteed by the projector. This works even on blank walls.
ToF vs structured light: ToF is faster (single shot, no sweeping), works at longer range, but has lower resolution and multi-path interference (bounced light). Structured light gives higher resolution and accuracy at close range, but fails in bright sunlight (IR floods the pattern). Apple uses both: the front TrueDepth camera is structured light, the rear LiDAR is ToF.
Why does structured light work in textureless scenes where passive stereo fails?

Chapter 8: Showcase — Stereo Matching Demo

See how a stereo algorithm turns a pair of rectified images into a depth map. The left image has a scene with objects at different depths. Slide the disparity to see which parts of the scene "light up" (match well) at each depth level.

Disparity Space Sweep

Sweep through disparities to find where objects match. The scene has three depth layers. Watch the cost drop at correct disparities.

Disparity d 0
Window size 3
What to notice: At the correct disparity for each object, its matching cost drops sharply (bright green). With a small window, you get precise edges but noisy results. With a large window, edges blur but noise decreases. This is the fundamental aggregation trade-off from Chapter 3 — visible right before your eyes.

Chapter 9: Connections

ConceptUsed In
Epipolar geometry & rectificationCh 11 (SfM), Ch 14 (view interpolation)
Disparity mapsCh 14 (IBR), Ch 9 (motion estimation), autonomous driving
Multi-view stereoCh 13 (3D reconstruction), Ch 14 (neural rendering / NeRFs)
Cost volume / DSIDeep stereo networks, optical flow networks (RAFT)
Active depth sensorsCh 13 (3D scanning), robotics, AR/VR, autonomous vehicles
Global optimization / MRFCh 4 (model fitting), Ch 6 (semantic segmentation)
Szeliski's perspective: "Stereo correspondence is perhaps the oldest problem in computer vision, dating back to the 19th century stereoscope. Yet it remains remarkably relevant: the same disparity estimation principles now power neural view synthesis, autonomous driving, and the depth sensors in every smartphone."
Which concept from stereo matching was later adopted by optical flow networks like RAFT?