Hartley & Zisserman, Chapter 18

N-View Computational Methods

Bundle adjustment, the factorization algorithm, projective factorization, non-rigid factorization, and reconstruction from image sequences.

Prerequisites: Chapter 10 (Reconstruction) + Chapter 12 (Triangulation).

Chapters

Simulations

Chapter 0: Why N Views?

So far we have handled two and three views. In practice, scenes are captured from many viewpoints — a video sequence may produce hundreds of frames. How do we jointly estimate all camera parameters and 3D structure from many views simultaneously?

The answer is bundle adjustment: a nonlinear optimization that minimizes the total reprojection error across all views and all points simultaneously. It is the gold standard for multi-view reconstruction. Everything else (pairwise F estimation, triangulation, trifocal tensors) provides initialization for bundle adjustment.

This chapter also introduces factorization methods, which exploit the algebraic structure of the multi-view problem for efficient initialization, including for deforming (non-rigid) scenes.

What is the gold standard method for multi-view reconstruction?

Bundle adjustment (joint nonlinear optimization of all cameras and points) The 8-point algorithm applied to all pairs The trifocal tensor applied to all triples

Chapter 1: Bundle Adjustment

Bundle adjustment minimizes the total reprojection error over all cameras P_i and all 3D points X_j:

min_{{P_i}, {X_j}} Σ_i,j d(x_ij, P_i X_j)²

This is a large sparse nonlinear least-squares problem, solved via Levenberg-Marquardt. The key insight is that the normal equations matrix has a special sparse structure (only cameras and points that "see" each other are coupled), enabling efficient solution.

The Schur complement trick: The normal equations have block structure: camera-camera, camera-point, and point-point blocks. By eliminating the point variables via the Schur complement, the problem reduces to solving a much smaller system involving only the camera parameters. This makes bundle adjustment practical even for thousands of cameras and millions of points.

Parameter	Typical count
Cameras (6 DOF each for calibrated, 11 for uncalibrated)	10 to 10,000
3D points (3 DOF each)	1,000 to 10,000,000
Observations (2 DOF each)	10,000 to 100,000,000

What special property of the bundle adjustment normal equations enables efficient solution?

The matrix has sparse block structure, enabling the Schur complement to eliminate point variables The matrix is diagonal The matrix has rank 1

Chapter 2: Affine Factorization

For affine cameras, the projection equations x_ij = M_i X_j + t_i are bilinear. After centring the image points (subtracting the mean), the measurement matrix W factorizes as:

W = M̂ · Ŝ

where W is 2m × n (m views, n points), M̂ is 2m × 3 (stacked camera matrices), and Ŝ is 3 × n (3D points). Since W has rank at most 3, the SVD gives the factorization.

The Tomasi-Kanade factorization:
(1) Form the centred 2m×n measurement matrix W.
(2) Compute SVD: W = UDV^T.
(3) Truncate to rank 3: M̂ = U₃D₃, Ŝ = V₃^T.
(4) Apply a 3×3 metric correction A to get M̂A and A⁻¹Ŝ.
This gives an affine reconstruction in closed form, without any iterative optimization.

What rank does the centred measurement matrix W have for rigid scenes viewed by affine cameras?

Rank 3 (or less) Full rank Rank 4

Chapter 3: The Measurement Matrix

The measurement matrix organizes all observations into a single matrix:

W = [x₁₁ x₁₂ ... x_1n ; x₂₁ ... x_2n ; ... ; x_m1 ... x_mn]

Each column is a point's trajectory across all views. Each row-pair is one camera's observations. The low-rank structure (rank 3 for affine, rank 4 for projective) is the key to factorization.

Trajectories and subspaces: Each point's trajectory (its image position across m views) is a 2m-vector. For a rigid scene, all trajectories lie in a 3-dimensional subspace (or 4-dimensional for projective cameras). This low-dimensional structure is what makes factorization work.

Missing data: When some points are not visible in all views, the measurement matrix has missing entries. Standard SVD does not apply directly. Iterative methods (alternation between estimating cameras and points) can handle missing data, though global convergence is no longer guaranteed.

For a rigid scene, point trajectories across views lie in a subspace of what dimension?

3 (for affine cameras) or 4 (for projective cameras) 2m (the full dimension) n (the number of points)

Chapter 4: Non-Rigid Factorization

For a deforming object, the shape at each time step is a linear combination of l basis shapes:

shape_i = Σ_k αⁱ_k B_k

The measurement matrix now has rank 3l instead of 3. The SVD still provides a factorization, but recovering the individual basis shapes and camera motions is more involved.

Application: Face tracking. A face changes expression while the camera moves. The shape can be decomposed into a mean face plus a few deformation modes (smile, frown, surprise). With l = 2 basis shapes, trajectories live in a 6-dimensional subspace. Given positions in 3 views, the position in all other views can be predicted — even for a deforming face.

Independently moving objects also create a higher-rank measurement matrix. Two independent rigid objects contribute rank 3 each, for a total rank of 6. Segmenting the matrix into rank-3 blocks identifies the separate objects.

If a deforming shape is a combination of l = 2 basis shapes, what rank does the measurement matrix have?

6 (= 3 × 2) 3 2

Chapter 5: Projective Factorization

For projective (perspective) cameras, the projection x_ij = P_i X_j is not bilinear because of the homogeneous division. But if we know the projective depths λ_ij such that λ_ij x_ij = P_i X_j, then the weighted measurement matrix has rank 4.

The chicken-and-egg problem: To factorize, we need the depths λ_ij. To compute the depths, we need the reconstruction. The solution: iterate. Start with λ_ij = 1, factorize, reproject to estimate new depths, repeat. Convergence is not guaranteed but works well in practice after normalizing rows and columns of the weighted measurement matrix.

Step	Action
1	Normalize image coordinates
2	Initialize depths λ_ij (e.g., = 1 or from a preliminary reconstruction)
3	Normalize depths (rows and columns to unit norm)
4	Form weighted matrix, truncate SVD to rank 4, extract P_i and X_j
5	Reproject to update depths, iterate from step 3

What rank does the correctly-weighted measurement matrix have for projective cameras?

4 3 11

Chapter 6: Reconstruction Using Planes

If a plane is visible in all views (providing homographies between each pair), the reconstruction problem simplifies dramatically. The plane-induced homographies determine the 3×3 submatrices M_i of the camera matrices P_i = [M_i | t_i]. Only the translation columns t_i remain unknown.

Linear solution with planes: Each off-plane point correspondence across two views gives a linear equation in the unknown translation parameters. With enough off-plane correspondences, all translations (and hence all cameras) are determined linearly. No iterative optimization needed for initialization.

If plane-induced homographies between all view pairs are known, what part of the camera matrices remains to be estimated?

Only the translation columns t_i The full camera matrices Nothing — the cameras are fully determined

Chapter 7: Reconstruction from Sequences

Video sequences add temporal structure: frames are ordered, and the camera moves smoothly. The reconstruction pipeline for sequences:

Step	Action
1	Establish initial reconstruction from a pair of well-separated keyframes
2	Incrementally add new cameras by resectioning from known 3D points
3	Triangulate new points visible in the newly added camera
4	Periodically run bundle adjustment to refine everything
5	Loop closure: detect when the camera returns to a previously seen location

Drift and loop closure: Incremental reconstruction accumulates error. When the camera revisits a location, the "loop closure" correction redistributes the accumulated error across the entire sequence. Bundle adjustment is essential for this correction.

What is the purpose of loop closure in sequential reconstruction?

To detect revisited locations and redistribute accumulated drift error To end the reconstruction process To remove outlier correspondences

Chapter 8: Sparse Methods

The Jacobian matrix in bundle adjustment is extremely sparse: each observation (x_ij) depends on only one camera (P_i) and one point (X_j). This sparsity must be exploited for efficiency.

The reduced camera system: After eliminating point variables via the Schur complement, the remaining system has size proportional to the number of cameras (not points). For 100 cameras and 10,000 points, the full system is 30,200 × 30,200, but the reduced system is only 600 × 600. This makes bundle adjustment practical for large-scale reconstruction.

Modern implementations (Ceres, g2o, GTSAM) use these sparse methods. Structure-from-motion systems like COLMAP and VisualSFM routinely process thousands of images using sparse bundle adjustment.

After applying the Schur complement in bundle adjustment, the reduced system size is proportional to:

The number of cameras (not points) The number of 3D points The total number of observations

Chapter 9: Connections

Link	Connection
All prior chapters → Ch 18	Bundle adjustment is the final refinement step; all earlier methods provide initialization
Ch 18 → Ch 19	Self-calibration can be integrated into bundle adjustment as additional constraints
Ch 18 → Modern SfM	COLMAP, OpenSfM, VisualSFM all implement the sequential pipeline described here

"Bundle adjustment is the gold standard of multi-view reconstruction."

— Hartley & Zisserman, Chapter 18

What does the Tomasi-Kanade factorization algorithm assume about the camera model?

Affine (orthographic or weak perspective) cameras Full projective cameras Calibrated pinhole cameras

← Chapter 15 Chapter 19: Auto-Calibration →