Szeliski, Chapter 13

3D Reconstruction

From depth images and point clouds to watertight surfaces: shape-from-X cues, scanning, surface fitting, volumetric methods, and model-based reconstruction.

Prerequisites: Chapter 2 (image formation), Chapter 12 (depth estimation), Chapter 11 (SfM).
10
Chapters
5+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why 3D Reconstruction?

You have a handful of depth maps from a stereo camera, or a point cloud from a LiDAR scan, or surface normals from photometric stereo. Each is a partial, noisy view of an object. How do you stitch them into a single, complete, watertight 3D model you can render, print, or simulate?

That is the 3D reconstruction problem. It spans everything from inferring shape from a single photograph (using shading, texture, or focus cues) to merging millions of laser scans into a museum-quality digital replica of Michelangelo's David.

Why it matters: 3D reconstruction is the backbone of digital heritage (preserving ancient sites), medical imaging (organ models from CT/MRI), manufacturing (quality inspection), VFX (digital doubles), gaming (photogrammetric assets), and autonomous driving (HD maps).
From Scans to Surface

Multiple partial scans (colored) overlap. Merging them produces a single coherent surface.

What is the core challenge of 3D reconstruction?

Chapter 1: Shape from Shading

Look at a smooth white sphere under a single light source. Even from one image, you can see its shape from the way brightness varies across the surface. Bright where the surface faces the light, dark where it curves away. This is shape from shading.

For a diffuse (Lambertian) surface, brightness depends on the angle between the surface normal n and the light direction v:

I(x, y) = ρ · max(0, n̂ · v)

where ρ is the albedo (surface reflectance). Since each pixel gives one equation but the surface normal has two unknowns (the surface slopes p = ∂z/∂x, q = ∂z/∂y), the problem is underdetermined. A smoothness constraint is needed to regularize it.

Reflectance map: The function R(p, q) that maps surface slopes to image brightness is called the reflectance map. For a known light direction and Lambertian surface, it forms a smooth dome in (p, q) space. Each iso-brightness contour on this dome corresponds to a family of possible orientations. Without additional constraints, you cannot tell which one is correct — hence the need for smoothness or additional images.
Limitations: Shape from shading assumes uniform albedo, known lighting, and no inter-reflections. Real objects violate all three. That is why it is rarely used alone — but combined with stereo (for coarse shape) it fills in fine detail beautifully, since shading varies at a finer scale than stereo can resolve.
Shape from Shading

A sphere's brightness varies with the angle between the surface normal and the light direction. Adjust the light to see how shading reveals shape.

Light elevation 45°

Shape from texture: A regularly textured surface (like a brick wall viewed at an angle) reveals its orientation through the foreshortening of the texture pattern. Where the surface tilts away from you, the texture appears compressed. Measuring this compression gives the local surface normal. It is a weaker cue than shading but works on textured surfaces where shading is ambiguous.

Shape from focus: Take many photos at different focus settings. Each pixel is sharpest when the focal plane hits the surface at that point. Plotting sharpness vs. focus distance and finding the peak gives depth. Telecentric optics eliminate the magnification change between focus settings, enabling sub-millimeter depth from defocus.

Why is shape from shading underdetermined from a single image?

Chapter 2: Photometric Stereo

Shape from shading with a single light is underdetermined. The fix is elegant: take multiple images of the same scene, each with a different light direction. This is photometric stereo (Woodham, 1981).

For a diffuse surface with unknown albedo ρ and normal n, each light direction vk gives:

Ik = ρ · n̂ · vk

With three or more non-coplanar light directions, you can solve for ρn (albedo times normal) using linear least squares. Then normalize to get the normal and albedo separately.

Photometric Stereo

A sphere lit from different directions. Each lighting reveals different surface normals. Combined, they uniquely determine every normal.

Light angle 45°
Beyond Lambertian: Specular surfaces violate the Lambertian assumption. Highlights from specular reflection can be detected and excluded, or more sophisticated BRDF models can be fit. Recent deep learning approaches handle arbitrary materials by learning the mapping from multiple images to normals directly, without assuming a specific reflectance model.
VariantIdea
Classical (3 lights)Minimum configuration for Lambertian surfaces. Closed-form solution.
Many lightsOver-determined system. Robust to shadows, specularities via RANSAC or robust fitting.
OutdoorUse the sun as a moving light source across a day. Webcam-based photometric stereo.
Multi-view + photometricCombine stereo for coarse shape with photometric stereo for fine normals (e.g., Logothetis et al., 2019).
How many non-coplanar light directions are needed to solve for the surface normal and albedo?

Chapter 3: 3D Scanning

Active illumination takes shape recovery out of the "clever algorithm" regime and into engineering precision. A 3D scanner projects controlled light and triangulates the reflected signal.

TechnologyPrincipleAccuracy
Laser stripeSweep a laser plane across the object. The deformed stripe in the camera encodes depth via triangulation.Sub-millimeter
Structured lightProject a coded pattern (Gray codes, phase shifting). Decode each pixel's pattern to find correspondences.10s of microns
Time of flightEmit modulated light, measure phase shift of the return.~1 cm
Shadow scanningWave a stick casting a shadow. Track the shadow's sweep across known background planes (Bouguet & Perona, 1999).~1 mm (DIY)
Temporal peak detection: In laser stripe scanning, the best accuracy comes from finding not where the stripe is in any single frame, but the exact time each pixel reaches peak brightness as the stripe sweeps past. This sub-pixel temporal interpolation (Curless and Levoy, 1995) was critical for the Digital Michelangelo Project, which scanned the 5-meter David statue at 0.29 mm resolution — 2 billion polygons.
Shape from focus: Another active cue: take a stack of images at different focus settings. Each pixel is sharpest when the focus plane hits the surface. Plotting sharpness vs focus distance and finding the peak gives depth. This works even on transparent or specular objects where triangulation-based methods fail.
How does a laser stripe scanner recover 3D depth?

Chapter 4: Range Data Merging

A single scan sees only the side facing the scanner. To reconstruct a complete object, you scan from many directions, align (register) the scans, and merge them into one surface.

Registration brings all scans into a common coordinate system. The gold standard is Iterative Closest Point (ICP):

1. Find closest points
For each point in scan A, find the nearest point in scan B
2. Estimate transform
Compute the rigid transform (R, t) that minimizes the sum of squared distances
3. Apply and repeat
Transform scan A, then iterate until convergence
↻ repeat until convergence
Volumetric merging (TSDF): After alignment, how do you merge overlapping scans? The elegant answer (Curless and Levoy, 1996): maintain a 3D voxel grid of truncated signed distance function (TSDF) values. Each scan updates each voxel with its signed distance to the surface (positive outside, negative inside), truncated near the surface. Average all scans' TSDFs together. The zero-crossing of the averaged field is the merged surface. Extract it with Marching Cubes.

This approach is remarkably robust to noise: outlier scans are diluted by the averaging. It is also the same method used in real-time systems like KinectFusion, which fuses 30 depth frames per second into a growing TSDF volume.

ICP Registration

Two overlapping point clouds (blue and orange) are aligned iteratively. Watch ICP converge by clicking Step or Auto.

Global registration: ICP needs a reasonable initial alignment to converge (it is a local optimizer). For scans with no initial alignment, use feature-based methods first: extract 3D keypoints (e.g., FPFH descriptors), match them between scans, and use RANSAC to find a coarse alignment. Then refine with ICP. This two-stage approach handles even scans taken from completely different angles.
What does the zero-crossing of a TSDF represent?

Chapter 5: Surface Representations

Once you have raw 3D data (points, normals, depth maps), you need a surface representation to store, render, and manipulate the model. The choice of representation profoundly affects what you can do with the model.

RepresentationWhat It StoresStrengthsWeaknesses
Triangle meshVertices + trianglesGPU-friendly, standard for renderingTopology must be consistent, hard to merge
Point cloud3D positions (+ normals, colors)Simple, no topology neededNo surface continuity, holes
Implicit (SDF)Signed distance at grid pointsEasy to merge, topologically flexibleMemory-heavy, resolution-limited
Subdivision surfaceCoarse control mesh + rulesSmooth, multi-resolution, compactRequires clean topology
NURBS / splineControl points + knotsSmooth, compact, exact for CADPatching is hard for complex shapes
Surface simplification: A raw scan mesh may have millions of triangles — far too many for real-time rendering or streaming. Progressive meshes (Hoppe, 1996) simplify the mesh by iteratively collapsing edges, prioritized by the geometric error each collapse introduces. You can stream a model from coarse to fine, or pick the right level of detail for the viewing distance.
Geometry images: An elegant idea (Gu, Gortler, and Hoppe, 2002): cut and flatten the mesh onto a square, then store the x, y, z coordinates as an image (R=x, G=y, B=z). Now standard image compression (JPEG-like) compresses the geometry. GPU texture mapping hardware renders it. Geometry becomes just another image.

Surface interpolation: Raw data is noisy and has gaps. Surface interpolation fills holes and smooths noise by fitting a smooth function (thin-plate spline, radial basis function, or variational surface) through the data points. The key is balancing data fidelity (stay close to the observations) against smoothness (do not overfit noise). This is a regularization trade-off — the same principle from Chapter 4.

Why are implicit (SDF) representations especially useful for merging multiple scans?

Chapter 6: Point Cloud Processing

Sometimes you do not need a full mesh. A point cloud — a set of 3D points with optional normals and colors — is the rawest 3D representation and often the most practical.

Key operations on point clouds:

Point-based rendering (splatting): You can render a point cloud directly without meshing. Each point is drawn as a small oriented disc (a "splat") that covers the surface. With enough points and proper blending, the result looks like a smooth surface. This is much simpler than mesh reconstruction and avoids topology errors. The surfels representation (Pfister et al., 2000) formalized this idea.
Point Cloud Normal Estimation

Each point's normal is estimated from its local neighborhood. Adjust k to see how neighborhood size affects smoothness vs detail.

Neighbors k 6
How are surface normals typically estimated from an unstructured point cloud?

Chapter 7: Volumetric Methods

Volumetric methods represent 3D space on a regular grid and extract the surface as the boundary between "inside" and "outside." This is the most topologically flexible approach — it handles arbitrary genus, self-intersections, and merging naturally.

Marching Cubes: The workhorse algorithm (Lorensen and Cline, 1987). Walk through each cube (8 neighboring voxels) in the grid. If some corners are inside and others outside, the surface passes through that cube. A lookup table of 256 cases (28 sign combinations, reduced to 15 by symmetry) determines the triangle(s) to emit. The result is a watertight mesh.
MethodImplicit FunctionKey Idea
TSDF fusionAveraged signed distanceAverage multiple depth observations. Fast, incremental (KinectFusion).
Poisson reconIndicator functionSolve a Poisson equation using oriented normals. Produces smooth, watertight meshes (Kazhdan et al., 2006).
Level setsEvolving SDFSurface evolves under PDE forces (curvature, data fit). Topology changes handled naturally.
Space carvingOccupancyStart with a full volume. Remove voxels inconsistent with silhouettes or photo-consistency.
Neural implicit (DeepSDF)Learned SDFA neural network predicts signed distance for any query point. Continuous, compact, learned from data.
Poisson surface reconstruction: Given oriented point normals, the indicator function (1 inside, 0 outside) has a gradient that equals the normal field at the surface. So reconstructing the indicator from normals is a Poisson problem: ∇2χ = ∇ · V. Solve with a hierarchical octree solver. The result is a smooth, watertight mesh that fills holes and cleans noise — one of the most widely used algorithms in 3D reconstruction.
How does Marching Cubes extract a surface from a volumetric field?

Chapter 8: Showcase — TSDF Fusion

Watch how multiple depth observations are fused into a signed distance field. Each scan updates the volume. The zero-crossing converges to the true surface.

Volumetric TSDF Fusion (2D Cross-Section)

A 2D slice of a volume. Blue = positive (outside), red = negative (inside). The white zero-crossing is the surface. Add scans to see it converge.

Noise level 1
What to notice: With one scan, the field is noisy and the zero-crossing is rough. As more scans are added, averaging smooths the noise and the surface sharpens. Increase noise to see how volumetric fusion gracefully degrades — even noisy scans contribute useful signal when averaged.

Chapter 9: Connections

ConceptUsed In
Shape from shading / photometric stereoHigh-detail scanning, industrial inspection, combining with stereo (Ch 12)
ICP registrationSLAM (Ch 11), scan alignment, autonomous driving (LiDAR odometry)
TSDF fusionKinectFusion, real-time 3D reconstruction, robotics mapping
Marching CubesMedical imaging (CT/MRI surfaces), volumetric rendering, neural implicits
Neural implicit surfacesNeRF (Ch 14), DeepSDF, Occupancy Networks, generative 3D models
Point cloud processingLiDAR perception, autonomous driving, PointNet/PointNet++
Surface representationsGame assets, VFX, 3D printing, CAD/CAM
Szeliski's perspective: "3D reconstruction sits at the crossroads of computer vision and computer graphics. Vision provides the observations; graphics provides the representations. Recent advances in neural implicit surfaces have blurred this boundary entirely — the same network that encodes the geometry also renders the view."
What recent approach represents 3D surfaces as neural networks that predict signed distance for any query point?