3D Reconstruction — Szeliski, Chapter 13

Chapter 0: Why 3D Reconstruction?

You have a handful of depth maps from a stereo camera, or a point cloud from a LiDAR scan, or surface normals from photometric stereo. Each is a partial, noisy view of an object. How do you stitch them into a single, complete, watertight 3D model you can render, print, or simulate?

That is the 3D reconstruction problem. It spans everything from inferring shape from a single photograph (using shading, texture, or focus cues) to merging millions of laser scans into a museum-quality digital replica of Michelangelo's David.

Why it matters: 3D reconstruction is the backbone of digital heritage (preserving ancient sites), medical imaging (organ models from CT/MRI), manufacturing (quality inspection), VFX (digital doubles), gaming (photogrammetric assets), and autonomous driving (HD maps).

From Scans to Surface

Multiple partial scans (colored) overlap. Merging them produces a single coherent surface.

What is the core challenge of 3D reconstruction?

Combining multiple partial, noisy observations (depth maps, scans, normals) into a single complete and consistent 3D surface model Making images brighter Classifying objects in images

Chapter 1: Shape from Shading

Look at a smooth white sphere under a single light source. Even from one image, you can see its shape from the way brightness varies across the surface. Bright where the surface faces the light, dark where it curves away. This is shape from shading.

For a diffuse (Lambertian) surface, brightness depends on the angle between the surface normal n and the light direction v:

I(x, y) = ρ · max(0, n̂ · v)

where ρ is the albedo (surface reflectance). Since each pixel gives one equation but the surface normal has two unknowns (the surface slopes p = ∂z/∂x, q = ∂z/∂y), the problem is underdetermined. A smoothness constraint is needed to regularize it.

Reflectance map: The function R(p, q) that maps surface slopes to image brightness is called the reflectance map. For a known light direction and Lambertian surface, it forms a smooth dome in (p, q) space. Each iso-brightness contour on this dome corresponds to a family of possible orientations. Without additional constraints, you cannot tell which one is correct — hence the need for smoothness or additional images.

Limitations: Shape from shading assumes uniform albedo, known lighting, and no inter-reflections. Real objects violate all three. That is why it is rarely used alone — but combined with stereo (for coarse shape) it fills in fine detail beautifully, since shading varies at a finer scale than stereo can resolve.

Shape from Shading

A sphere's brightness varies with the angle between the surface normal and the light direction. Adjust the light to see how shading reveals shape.

Light elevation 45°

Shape from texture: A regularly textured surface (like a brick wall viewed at an angle) reveals its orientation through the foreshortening of the texture pattern. Where the surface tilts away from you, the texture appears compressed. Measuring this compression gives the local surface normal. It is a weaker cue than shading but works on textured surfaces where shading is ambiguous.

Shape from focus: Take many photos at different focus settings. Each pixel is sharpest when the focal plane hits the surface at that point. Plotting sharpness vs. focus distance and finding the peak gives depth. Telecentric optics eliminate the magnification change between focus settings, enabling sub-millimeter depth from defocus.

Why is shape from shading underdetermined from a single image?

Each pixel gives one brightness value but the surface normal has two unknowns (p and q), so there are more unknowns than equations The image is too noisy The camera is not calibrated

Chapter 2: Photometric Stereo

Shape from shading with a single light is underdetermined. The fix is elegant: take multiple images of the same scene, each with a different light direction. This is photometric stereo (Woodham, 1981).

For a diffuse surface with unknown albedo ρ and normal n, each light direction v_k gives:

I_k = ρ · n̂ · v_k

With three or more non-coplanar light directions, you can solve for ρn (albedo times normal) using linear least squares. Then normalize to get the normal and albedo separately.

Photometric Stereo

A sphere lit from different directions. Each lighting reveals different surface normals. Combined, they uniquely determine every normal.

Light angle 45°

Beyond Lambertian: Specular surfaces violate the Lambertian assumption. Highlights from specular reflection can be detected and excluded, or more sophisticated BRDF models can be fit. Recent deep learning approaches handle arbitrary materials by learning the mapping from multiple images to normals directly, without assuming a specific reflectance model.

Variant	Idea
Classical (3 lights)	Minimum configuration for Lambertian surfaces. Closed-form solution.
Many lights	Over-determined system. Robust to shadows, specularities via RANSAC or robust fitting.
Outdoor	Use the sun as a moving light source across a day. Webcam-based photometric stereo.
Multi-view + photometric	Combine stereo for coarse shape with photometric stereo for fine normals (e.g., Logothetis et al., 2019).

How many non-coplanar light directions are needed to solve for the surface normal and albedo?

At least three, because the unknown ρn has three components and each light direction provides one equation One is enough Exactly ten

Chapter 3: 3D Scanning

Active illumination takes shape recovery out of the "clever algorithm" regime and into engineering precision. A 3D scanner projects controlled light and triangulates the reflected signal.

Technology	Principle	Accuracy
Laser stripe	Sweep a laser plane across the object. The deformed stripe in the camera encodes depth via triangulation.	Sub-millimeter
Structured light	Project a coded pattern (Gray codes, phase shifting). Decode each pixel's pattern to find correspondences.	10s of microns
Time of flight	Emit modulated light, measure phase shift of the return.	~1 cm
Shadow scanning	Wave a stick casting a shadow. Track the shadow's sweep across known background planes (Bouguet & Perona, 1999).	~1 mm (DIY)

Temporal peak detection: In laser stripe scanning, the best accuracy comes from finding not where the stripe is in any single frame, but the exact time each pixel reaches peak brightness as the stripe sweeps past. This sub-pixel temporal interpolation (Curless and Levoy, 1995) was critical for the Digital Michelangelo Project, which scanned the 5-meter David statue at 0.29 mm resolution — 2 billion polygons.

Shape from focus: Another active cue: take a stack of images at different focus settings. Each pixel is sharpest when the focus plane hits the surface. Plotting sharpness vs focus distance and finding the peak gives depth. This works even on transparent or specular objects where triangulation-based methods fail.

How does a laser stripe scanner recover 3D depth?

A laser plane illuminates the surface, and the camera observes how the stripe deforms — triangulation between the known laser plane and the camera gives per-point depth It measures the color of reflected light It counts the number of photons

Chapter 4: Range Data Merging

A single scan sees only the side facing the scanner. To reconstruct a complete object, you scan from many directions, align (register) the scans, and merge them into one surface.

Registration brings all scans into a common coordinate system. The gold standard is Iterative Closest Point (ICP):

1. Find closest points

For each point in scan A, find the nearest point in scan B

↓

2. Estimate transform

Compute the rigid transform (R, t) that minimizes the sum of squared distances

↓

3. Apply and repeat

Transform scan A, then iterate until convergence

↻ repeat until convergence

Volumetric merging (TSDF): After alignment, how do you merge overlapping scans? The elegant answer (Curless and Levoy, 1996): maintain a 3D voxel grid of truncated signed distance function (TSDF) values. Each scan updates each voxel with its signed distance to the surface (positive outside, negative inside), truncated near the surface. Average all scans' TSDFs together. The zero-crossing of the averaged field is the merged surface. Extract it with Marching Cubes.

This approach is remarkably robust to noise: outlier scans are diluted by the averaging. It is also the same method used in real-time systems like KinectFusion, which fuses 30 depth frames per second into a growing TSDF volume.

ICP Registration

Two overlapping point clouds (blue and orange) are aligned iteratively. Watch ICP converge by clicking Step or Auto.

Global registration: ICP needs a reasonable initial alignment to converge (it is a local optimizer). For scans with no initial alignment, use feature-based methods first: extract 3D keypoints (e.g., FPFH descriptors), match them between scans, and use RANSAC to find a coarse alignment. Then refine with ICP. This two-stage approach handles even scans taken from completely different angles.

What does the zero-crossing of a TSDF represent?

The surface itself — where the signed distance transitions from positive (outside) to negative (inside), which is extracted as a mesh using Marching Cubes The center of the object The camera position

Chapter 5: Surface Representations

Once you have raw 3D data (points, normals, depth maps), you need a surface representation to store, render, and manipulate the model. The choice of representation profoundly affects what you can do with the model.

Representation	What It Stores	Strengths	Weaknesses
Triangle mesh	Vertices + triangles	GPU-friendly, standard for rendering	Topology must be consistent, hard to merge
Point cloud	3D positions (+ normals, colors)	Simple, no topology needed	No surface continuity, holes
Implicit (SDF)	Signed distance at grid points	Easy to merge, topologically flexible	Memory-heavy, resolution-limited
Subdivision surface	Coarse control mesh + rules	Smooth, multi-resolution, compact	Requires clean topology
NURBS / spline	Control points + knots	Smooth, compact, exact for CAD	Patching is hard for complex shapes

Surface simplification: A raw scan mesh may have millions of triangles — far too many for real-time rendering or streaming. Progressive meshes (Hoppe, 1996) simplify the mesh by iteratively collapsing edges, prioritized by the geometric error each collapse introduces. You can stream a model from coarse to fine, or pick the right level of detail for the viewing distance.

Geometry images: An elegant idea (Gu, Gortler, and Hoppe, 2002): cut and flatten the mesh onto a square, then store the x, y, z coordinates as an image (R=x, G=y, B=z). Now standard image compression (JPEG-like) compresses the geometry. GPU texture mapping hardware renders it. Geometry becomes just another image.

Surface interpolation: Raw data is noisy and has gaps. Surface interpolation fills holes and smooths noise by fitting a smooth function (thin-plate spline, radial basis function, or variational surface) through the data points. The key is balancing data fidelity (stay close to the observations) against smoothness (do not overfit noise). This is a regularization trade-off — the same principle from Chapter 4.

Why are implicit (SDF) representations especially useful for merging multiple scans?

Multiple signed distance fields can be averaged together trivially, with the merged surface emerging as the zero-crossing — no explicit mesh stitching needed They use less memory They render faster

Chapter 6: Point Cloud Processing

Sometimes you do not need a full mesh. A point cloud — a set of 3D points with optional normals and colors — is the rawest 3D representation and often the most practical.

Key operations on point clouds:

Normal estimation: Fit a local plane to each point's k-nearest neighbors. The plane normal is the surface normal.
Filtering: Remove outliers via statistical or radius-based tests.
Downsampling: Voxel grid filter — divide space into voxels, keep one point per voxel.
Registration: Align two clouds using ICP or feature-based methods.
Segmentation: Cluster points into planar patches, objects, or semantic classes.

Point-based rendering (splatting): You can render a point cloud directly without meshing. Each point is drawn as a small oriented disc (a "splat") that covers the surface. With enough points and proper blending, the result looks like a smooth surface. This is much simpler than mesh reconstruction and avoids topology errors. The surfels representation (Pfister et al., 2000) formalized this idea.

Point Cloud Normal Estimation

Each point's normal is estimated from its local neighborhood. Adjust k to see how neighborhood size affects smoothness vs detail.

Neighbors k 6

How are surface normals typically estimated from an unstructured point cloud?

By fitting a local plane to each point's k-nearest neighbors — the plane's normal is the estimated surface normal By computing the color gradient By measuring the distance to the camera

Chapter 7: Volumetric Methods

Volumetric methods represent 3D space on a regular grid and extract the surface as the boundary between "inside" and "outside." This is the most topologically flexible approach — it handles arbitrary genus, self-intersections, and merging naturally.

Marching Cubes: The workhorse algorithm (Lorensen and Cline, 1987). Walk through each cube (8 neighboring voxels) in the grid. If some corners are inside and others outside, the surface passes through that cube. A lookup table of 256 cases (2⁸ sign combinations, reduced to 15 by symmetry) determines the triangle(s) to emit. The result is a watertight mesh.

Method	Implicit Function	Key Idea
TSDF fusion	Averaged signed distance	Average multiple depth observations. Fast, incremental (KinectFusion).
Poisson recon	Indicator function	Solve a Poisson equation using oriented normals. Produces smooth, watertight meshes (Kazhdan et al., 2006).
Level sets	Evolving SDF	Surface evolves under PDE forces (curvature, data fit). Topology changes handled naturally.
Space carving	Occupancy	Start with a full volume. Remove voxels inconsistent with silhouettes or photo-consistency.
Neural implicit (DeepSDF)	Learned SDF	A neural network predicts signed distance for any query point. Continuous, compact, learned from data.

Poisson surface reconstruction: Given oriented point normals, the indicator function (1 inside, 0 outside) has a gradient that equals the normal field at the surface. So reconstructing the indicator from normals is a Poisson problem: ∇²χ = ∇ · V. Solve with a hierarchical octree solver. The result is a smooth, watertight mesh that fills holes and cleans noise — one of the most widely used algorithms in 3D reconstruction.

How does Marching Cubes extract a surface from a volumetric field?

It examines each cube of 8 voxels, and wherever the field changes sign (inside to outside), it emits triangles using a 256-case lookup table, producing a watertight mesh It projects the voxels onto a plane It removes all voxels above a threshold

Chapter 8: Showcase — TSDF Fusion

Watch how multiple depth observations are fused into a signed distance field. Each scan updates the volume. The zero-crossing converges to the true surface.

Volumetric TSDF Fusion (2D Cross-Section)

A 2D slice of a volume. Blue = positive (outside), red = negative (inside). The white zero-crossing is the surface. Add scans to see it converge.

Noise level 1

What to notice: With one scan, the field is noisy and the zero-crossing is rough. As more scans are added, averaging smooths the noise and the surface sharpens. Increase noise to see how volumetric fusion gracefully degrades — even noisy scans contribute useful signal when averaged.

Chapter 9: Connections

Concept	Used In
Shape from shading / photometric stereo	High-detail scanning, industrial inspection, combining with stereo (Ch 12)
ICP registration	SLAM (Ch 11), scan alignment, autonomous driving (LiDAR odometry)
TSDF fusion	KinectFusion, real-time 3D reconstruction, robotics mapping
Marching Cubes	Medical imaging (CT/MRI surfaces), volumetric rendering, neural implicits
Neural implicit surfaces	NeRF (Ch 14), DeepSDF, Occupancy Networks, generative 3D models
Point cloud processing	LiDAR perception, autonomous driving, PointNet/PointNet++
Surface representations	Game assets, VFX, 3D printing, CAD/CAM

Szeliski's perspective: "3D reconstruction sits at the crossroads of computer vision and computer graphics. Vision provides the observations; graphics provides the representations. Recent advances in neural implicit surfaces have blurred this boundary entirely — the same network that encodes the geometry also renders the view."

What recent approach represents 3D surfaces as neural networks that predict signed distance for any query point?

Neural implicit surfaces (e.g., DeepSDF), which learn a continuous signed distance field parameterized by a neural network — compact, resolution-independent, and directly usable for rendering Triangle meshes Voxel grids