Depth Estimation — Szeliski, Chapter 12

Chapter 0: Why Depth?

Hold one finger at arm's length and close one eye, then the other. Your finger appears to jump sideways. That jump — the disparity — is smaller for far objects and larger for close ones. Your brain fuses these two views into a sense of depth.

A stereo camera pair does the same thing. Given two images of the same scene taken from slightly different positions, we can compute a disparity map that tells us how far each pixel shifted, and from that, how far each surface point is from the camera.

Why it matters: Depth estimation powers autonomous driving (obstacle detection), AR/VR (occlusion handling), robotics (grasping), 3D photography, and video post-processing (background blur, z-keying). The iPhone's Portrait Mode, the Kinect, and every self-driving car depend on it.

Disparity and Depth

A closer object has a larger disparity (shift between views). Move the slider to change depth.

Object depth 50

Why does a closer object have a larger disparity in a stereo pair?

Because it shifts more between the two camera viewpoints — the parallax effect is proportional to the inverse of distance Because it is larger in the image Because it has more texture

Chapter 1: Rectification

In a general stereo pair, epipolar lines can run in any direction. Searching along arbitrary lines for matches is slow and error-prone. Rectification warps both images so that corresponding epipolar lines become horizontal scan lines. After rectification, matching reduces to a simple 1D search along each row.

The relationship between depth Z and disparity d in a rectified pair is beautifully simple:

d = f · B / Z

where f is the focal length (in pixels), B is the baseline (distance between cameras), and Z is the depth. Larger baseline gives larger disparity at the same depth — better depth precision.

Disparity Space Image (DSI): After rectification, for every pixel (x, y) we test all possible disparities d and store a matching cost C(x, y, d). This 3D volume is the DSI. Finding the depth map means finding the optimal surface d(x, y) through this volume — the surface where matching cost is lowest.

Step	What Happens
1. Rotate cameras	Warp both images so their optical axes are perpendicular to the baseline.
2. Align vertically	Make the up vectors perpendicular to the baseline, so epipolar lines are horizontal.
3. Scale	Re-scale if focal lengths differ, to avoid aliasing.

Plane sweep: An alternative to rectification. Sweep a set of virtual planes through the scene at different depths. At each depth, warp all images onto that plane and measure photoconsistency. This works with any camera configuration — not just side-by-side pairs — and is the basis for many multi-view stereo methods.

Rectified Stereo Pair

After rectification, matching is a 1D search along horizontal scan lines. The green line shows the matching scan line.

Scan line y 50

Depth resolution: From d = f·B/Z, the depth resolution depends on the disparity resolution. For integer-pixel disparities, the depth quantization at distance Z is ΔZ = Z²/(f·B). This means far objects have poor depth resolution — the classic limitation of stereo. Wider baselines help, but make matching harder (more occlusion, more appearance change).

What does rectification achieve in stereo matching?

It warps both images so that epipolar lines become horizontal scan lines, reducing the matching search from 2D to 1D It removes lens distortion It increases image resolution

Chapter 2: Matching Cost

The first step in any stereo algorithm is deciding how to compare pixels. Given a pixel in the left image, how do you score each candidate match in the right image? This is the matching cost.

Metric	Formula	Trade-off
SSD	(I_L − I_R)²	Simple, sensitive to brightness changes
SAD	\|I_L − I_R\|	More robust than SSD to outliers
NCC	Normalized cross-correlation	Handles brightness/contrast variation
Census	Bit pattern of neighbors above/below center	Robust to illumination changes
Learned	Neural network patch comparison	Best discriminability, needs training data

The census transform converts each pixel's neighborhood into a binary string: each neighbor gets a 1 if it is brighter than the center pixel, 0 otherwise. Two pixels are compared by counting how many bits differ (Hamming distance). This makes the measure invariant to any monotonic change in brightness — perfect for outdoor stereo where lighting varies between cameras.

Key insight: The matching cost alone is noisy. A textureless wall has the same cost at every disparity. A repeating pattern has many equally good matches. That is why we need aggregation (local methods) or regularization (global methods) to pick the right answer — the topics of the next two chapters.

Matching Cost Comparison

The cost profile along one scan line. The correct disparity has the lowest cost — but noise creates false minima.

Why is the census transform particularly robust for outdoor stereo?

It encodes only the relative ordering of pixel intensities, making it invariant to any monotonic brightness change It uses color information It only works on edges

Chapter 3: Local Methods

The simplest approach: sum the matching cost over a window around each pixel, then pick the disparity with the lowest total cost. This is winner-take-all (WTA) with an aggregation window.

The choice of window matters enormously:

Too small: Not enough evidence. Noise dominates. Lots of wrong matches.
Too large: Averages across depth boundaries. Foreground objects "bleed" into the background, and fine details vanish.

Adaptive windows: Smart local methods adapt the window to image content. Bilateral aggregation (Yoon and Kweon, 2006) weights each neighbor by color similarity and spatial distance — exactly like a bilateral filter. Pixels that look similar (same surface) get high weight; pixels across a depth edge (different color) get low weight. This preserves sharp boundaries without sacrificing aggregation power.

Method	Idea
Fixed square window	Sum costs over N×N block. Fast (box filter), but blurs edges.
Shiftable window	Try all positions of a window around the pixel, take the minimum. Slightly edge-aware.
Adaptive weight	Weight neighbors by color similarity. Sharp edges, higher cost.
Guided filter	Use the color image to guide cost aggregation. State-of-art among local methods.

Sub-pixel refinement: Most local methods estimate integer disparities. For smooth rendering, you fit a parabola to the cost curve around the winning disparity and find the sub-pixel minimum. Without this, view synthesis shows distracting "layering" artifacts.

Aggregation Window Effect

Compare a small vs large aggregation window near a depth edge. Small windows preserve the edge; large windows blur it.

Window radius 3

Confidence and uncertainty: After computing disparities, how confident are you in each estimate? The curvature of the cost function at the winning disparity indicates confidence. A sharp, deep minimum means a strong match (high texture, low ambiguity). A flat minimum means the match is uncertain (textureless region, repetitive pattern). Stereo confidence maps are critical for downstream tasks: uncertain depths should be smoothed, filled, or excluded.

Cross-checking: Compute left-to-right and right-to-left disparity maps. Pixels where the two maps disagree are likely occluded or mismatched. This simple consistency check catches most errors, and the flagged pixels can be filled from neighbors.

What is the fundamental trade-off in choosing the aggregation window size?

Larger windows give more evidence (reducing noise) but blur across depth boundaries, while smaller windows preserve edges but suffer from noise Larger windows are always better Window size does not matter

Chapter 4: Global Optimization

Instead of making a local decision at each pixel, global methods define an energy function over the entire disparity map and find the assignment that minimizes it:

E(d) = ∑_p C(p, d_p) + λ ∑_(p,q)∈N V(d_p, d_q)

The first term is the data cost — how well each pixel matches at its assigned disparity. The second is the smoothness cost — a penalty for neighboring pixels having different disparities, weighted by λ.

Semi-Global Matching (SGM): Exact 2D optimization is NP-hard for most energy formulations. SGM (Hirschmuller, 2008) approximates it by running 1D dynamic programming along 8 or 16 directions across the image and summing the costs. It is fast, parallelizable (runs on GPUs), and produces near-global-optimality results. SGM is the workhorse of autonomous driving stereo — used in production by many AV companies.

Optimizer	Approach	Quality vs Speed
Dynamic programming	Optimal 1D path per scan line	Fast, but "streaking" artifacts between rows
SGM	Multiple 1D paths summed	Fast + good quality. Industry standard.
Graph cuts	Min-cut on a graph	High quality, slow
Belief propagation	Message passing on MRF	High quality, slow

Smoothness penalty design: A simple L1 penalty |d_p − d_q| encourages piecewise-smooth disparity maps. A truncated penalty min(|d_p − d_q|, τ) allows sharp depth discontinuities without excessive cost. SGM further modulates the penalty by the image gradient: across an edge in the color image, the smoothness penalty is reduced because a depth jump is expected there.

SGM Path Aggregation

SGM runs 1D optimization along multiple directions and sums the results. More paths = better approximation of 2D global optimum.

Number of paths 8

Graph cuts vs belief propagation: For the full 2D energy, graph cuts (Boykov et al., 2001) find a strong local minimum by solving a max-flow/min-cut problem on a graph where nodes are pixels and edges encode both data and smoothness costs. Belief propagation (Sun et al., 2003) passes messages between neighboring pixels iteratively, each message encoding the cost distribution. Both are slower than SGM but can produce higher quality results on complex scenes.

PatchMatch stereo: Instead of testing all disparities at every pixel, randomly initialize, then propagate good solutions to neighbors. If your neighbor found a good slanted plane, try a similar plane at your pixel. This random search + propagation strategy converges rapidly and naturally handles slanted surfaces, which are problematic for discrete disparity methods.

How does Semi-Global Matching (SGM) approximate global optimization efficiently?

It runs 1D dynamic programming along 8-16 directions across the image and sums the aggregated costs, achieving near-global results without full 2D optimization It uses a very large window It processes only every other pixel

Chapter 5: Deep Stereo Networks

Deep learning has transformed stereo matching. Modern networks learn to extract features, compute matching costs, aggregate context, and regularize the disparity map — all end-to-end from data. Where classical methods require careful tuning of window sizes, penalty functions, and post-processing steps, a single trained network replaces the entire pipeline.

Feature Extraction

CNN extracts per-pixel features from both images

↓

Cost Volume

Correlate left/right features at each disparity to build a 3D volume

↓

3D Aggregation

3D convolutions (or recurrent units) smooth the cost volume

↓

Disparity Regression

Soft-argmin produces a differentiable, sub-pixel disparity estimate

Learned matching cost: Zbontar and LeCun (2016) showed that training a CNN to compare patches drastically outperformed hand-crafted costs like census or NCC. This single insight launched the deep stereo era. Later networks (GC-Net, PSMNet, RAFT-Stereo) moved to end-to-end training where the cost volume is constructed and refined jointly.

Network	Key Innovation
MC-CNN	Learned patch matching cost (Zbontar & LeCun, 2016)
GC-Net	End-to-end with 3D cost volume convolutions (Kendall et al., 2017)
PSMNet	Spatial pyramid pooling for large context (Chang & Chen, 2018)
RAFT-Stereo	Iterative refinement with GRU, sub-pixel accuracy (Lipson et al., 2021)
CREStereo	Cascade recurrent estimation, top on ETH3D (Li et al., 2022)

Monocular depth: Remarkably, networks can estimate depth from a single image by learning statistical priors about scene geometry. Models like MiDaS and DPT predict relative depth maps that are useful for video effects, though they lack the metric accuracy of stereo.

Self-supervised stereo: You do not always need ground-truth depth maps for training. Project the left image to the right using the predicted disparity map. If the prediction is correct, the warped image should match the actual right image. This photometric loss enables training on unlabeled stereo video — perfect for autonomous driving where ground truth is expensive. MonoDepth (Godard et al., 2017) pioneered this approach.

Iterative refinement (RAFT-Stereo): Instead of predicting a single disparity map, iteratively refine it. Build a correlation volume, then use a GRU (gated recurrent unit) to produce a sequence of disparity updates, each correcting the previous one. After 10-20 iterations, the result converges to sub-pixel accuracy. This mirrors the coarse-to-fine philosophy of classical methods but is fully differentiable.

What makes the soft-argmin operation important in deep stereo networks?

It produces a differentiable sub-pixel disparity estimate from the cost volume, enabling end-to-end training with gradient descent It reduces the number of parameters It increases the resolution of the output

Chapter 6: Multi-View Stereo

Two views give one disparity map. But adding more cameras dramatically improves quality: you get more coverage (fewer occlusions), more constraints (better accuracy), and the ability to reconstruct complete 3D objects.

Multi-view stereo (MVS) takes the sparse point cloud from SfM (Chapter 11) and densifies it into a complete 3D model. The input is a set of images with known camera poses; the output is a dense depth map per view or a fused 3D point cloud.

Volumetric fusion: One powerful approach discretizes space into a 3D voxel grid. For each voxel, project it into every camera and measure photoconsistency. Voxels where all cameras agree become the surface. This naturally handles arbitrary camera configurations and complex topology. TSDF (Truncated Signed Distance Function) fusion, as used in KinectFusion, accumulates signed distances from multiple depth maps into a single coherent volume.

Approach	How It Works	Output
Depth map fusion	Estimate per-view depth maps, fuse into one point cloud	Point cloud or mesh
Volumetric (voxel)	Carve or score a voxel grid using photo-consistency	Voxel grid → mesh via Marching Cubes
Shape from silhouettes	Intersect visual cones from object silhouettes	Visual hull (conservative bound)
Patch-based (PMVS)	Grow oriented patches across views	Dense oriented point cloud

COLMAP (Schonberger et al., 2016) remains the reference pipeline: it runs SfM first, then per-view depth estimation using photometric and geometric consistency, then fusion. It handles hundreds of unordered photos and produces meshes suitable for rendering.

What advantage does multi-view stereo have over two-view stereo?

More viewpoints reduce occlusions, provide more matching constraints, and enable complete 3D object reconstruction It uses less memory It does not require camera calibration

Chapter 7: Active Depth Sensing

Passive stereo relies on scene texture. In a featureless white room, there is nothing to match. Active depth sensors solve this by projecting their own patterns onto the scene.

Technology	How It Works	Example
Structured light	Project a known pattern (stripes, dots). Decode the pattern in the camera to determine per-pixel depth via triangulation.	Kinect v1, Intel RealSense
Time of flight (ToF)	Emit modulated IR light. Measure the phase shift of the return signal to compute distance. d = c · Δt / 2.	Kinect v2, iPhone LiDAR
LiDAR	Pulsed laser measures round-trip time per point. Mechanically or electronically scanned.	Velodyne, Ouster (autonomous vehicles)
Laser stripe	Sweep a plane of laser light. The stripe deformation encodes surface shape via triangulation.	Industrial 3D scanners

Structured light in detail: The Kinect v1 projects ~30,000 IR dots in a pseudo-random pattern. Each local neighborhood of dots is unique, so matching a received dot pattern to the stored reference gives a per-dot disparity — the same triangulation math as stereo, but the "texture" is guaranteed by the projector. This works even on blank walls.

ToF vs structured light: ToF is faster (single shot, no sweeping), works at longer range, but has lower resolution and multi-path interference (bounced light). Structured light gives higher resolution and accuracy at close range, but fails in bright sunlight (IR floods the pattern). Apple uses both: the front TrueDepth camera is structured light, the rear LiDAR is ToF.

Why does structured light work in textureless scenes where passive stereo fails?

It projects its own known pattern onto the scene, providing the "texture" needed for matching — even on blank surfaces It uses a higher-resolution camera It takes more photos

Chapter 8: Showcase — Stereo Matching Demo

See how a stereo algorithm turns a pair of rectified images into a depth map. The left image has a scene with objects at different depths. Slide the disparity to see which parts of the scene "light up" (match well) at each depth level.

Disparity Space Sweep

Sweep through disparities to find where objects match. The scene has three depth layers. Watch the cost drop at correct disparities.

Disparity d 0

Window size 3

What to notice: At the correct disparity for each object, its matching cost drops sharply (bright green). With a small window, you get precise edges but noisy results. With a large window, edges blur but noise decreases. This is the fundamental aggregation trade-off from Chapter 3 — visible right before your eyes.

Chapter 9: Connections

Concept	Used In
Epipolar geometry & rectification	Ch 11 (SfM), Ch 14 (view interpolation)
Disparity maps	Ch 14 (IBR), Ch 9 (motion estimation), autonomous driving
Multi-view stereo	Ch 13 (3D reconstruction), Ch 14 (neural rendering / NeRFs)
Cost volume / DSI	Deep stereo networks, optical flow networks (RAFT)
Active depth sensors	Ch 13 (3D scanning), robotics, AR/VR, autonomous vehicles
Global optimization / MRF	Ch 4 (model fitting), Ch 6 (semantic segmentation)

Szeliski's perspective: "Stereo correspondence is perhaps the oldest problem in computer vision, dating back to the 19th century stereoscope. Yet it remains remarkably relevant: the same disparity estimation principles now power neural view synthesis, autonomous driving, and the depth sensors in every smartphone."

Which concept from stereo matching was later adopted by optical flow networks like RAFT?

The cost volume — correlating features at multiple displacement hypotheses to build a 3D (or 4D) volume, then using iterative refinement to find the best match The census transform Laser scanning