Hartley & Zisserman, Chapter 18

N-View Computational Methods

Bundle adjustment, the factorization algorithm, projective factorization, non-rigid factorization, and reconstruction from image sequences.

Prerequisites: Chapter 10 (Reconstruction) + Chapter 12 (Triangulation).
10
Chapters
4+
Simulations

Chapter 0: Why N Views?

So far we have handled two and three views. In practice, scenes are captured from many viewpoints — a video sequence may produce hundreds of frames. How do we jointly estimate all camera parameters and 3D structure from many views simultaneously?

The answer is bundle adjustment: a nonlinear optimization that minimizes the total reprojection error across all views and all points simultaneously. It is the gold standard for multi-view reconstruction. Everything else (pairwise F estimation, triangulation, trifocal tensors) provides initialization for bundle adjustment.

This chapter also introduces factorization methods, which exploit the algebraic structure of the multi-view problem for efficient initialization, including for deforming (non-rigid) scenes.

What is the gold standard method for multi-view reconstruction?

Chapter 1: Bundle Adjustment

Bundle adjustment minimizes the total reprojection error over all cameras Pi and all 3D points Xj:

min{Pi}, {Xj} Σi,j d(xij, Pi Xj)2

This is a large sparse nonlinear least-squares problem, solved via Levenberg-Marquardt. The key insight is that the normal equations matrix has a special sparse structure (only cameras and points that "see" each other are coupled), enabling efficient solution.

The Schur complement trick: The normal equations have block structure: camera-camera, camera-point, and point-point blocks. By eliminating the point variables via the Schur complement, the problem reduces to solving a much smaller system involving only the camera parameters. This makes bundle adjustment practical even for thousands of cameras and millions of points.
ParameterTypical count
Cameras (6 DOF each for calibrated, 11 for uncalibrated)10 to 10,000
3D points (3 DOF each)1,000 to 10,000,000
Observations (2 DOF each)10,000 to 100,000,000
What special property of the bundle adjustment normal equations enables efficient solution?

Chapter 2: Affine Factorization

For affine cameras, the projection equations xij = Mi Xj + ti are bilinear. After centring the image points (subtracting the mean), the measurement matrix W factorizes as:

W = M̂ · Ŝ

where W is 2m × n (m views, n points), M̂ is 2m × 3 (stacked camera matrices), and Ŝ is 3 × n (3D points). Since W has rank at most 3, the SVD gives the factorization.

The Tomasi-Kanade factorization:
(1) Form the centred 2m×n measurement matrix W.
(2) Compute SVD: W = UDVT.
(3) Truncate to rank 3: M̂ = U3D3, Ŝ = V3T.
(4) Apply a 3×3 metric correction A to get M̂A and A−1Ŝ.
This gives an affine reconstruction in closed form, without any iterative optimization.
What rank does the centred measurement matrix W have for rigid scenes viewed by affine cameras?

Chapter 3: The Measurement Matrix

The measurement matrix organizes all observations into a single matrix:

W = [x11 x12 ... x1n ; x21 ... x2n ; ... ; xm1 ... xmn]

Each column is a point's trajectory across all views. Each row-pair is one camera's observations. The low-rank structure (rank 3 for affine, rank 4 for projective) is the key to factorization.

Trajectories and subspaces: Each point's trajectory (its image position across m views) is a 2m-vector. For a rigid scene, all trajectories lie in a 3-dimensional subspace (or 4-dimensional for projective cameras). This low-dimensional structure is what makes factorization work.

Missing data: When some points are not visible in all views, the measurement matrix has missing entries. Standard SVD does not apply directly. Iterative methods (alternation between estimating cameras and points) can handle missing data, though global convergence is no longer guaranteed.

For a rigid scene, point trajectories across views lie in a subspace of what dimension?

Chapter 4: Non-Rigid Factorization

For a deforming object, the shape at each time step is a linear combination of l basis shapes:

shapei = Σk αik Bk

The measurement matrix now has rank 3l instead of 3. The SVD still provides a factorization, but recovering the individual basis shapes and camera motions is more involved.

Application: Face tracking. A face changes expression while the camera moves. The shape can be decomposed into a mean face plus a few deformation modes (smile, frown, surprise). With l = 2 basis shapes, trajectories live in a 6-dimensional subspace. Given positions in 3 views, the position in all other views can be predicted — even for a deforming face.

Independently moving objects also create a higher-rank measurement matrix. Two independent rigid objects contribute rank 3 each, for a total rank of 6. Segmenting the matrix into rank-3 blocks identifies the separate objects.

If a deforming shape is a combination of l = 2 basis shapes, what rank does the measurement matrix have?

Chapter 5: Projective Factorization

For projective (perspective) cameras, the projection xij = Pi Xj is not bilinear because of the homogeneous division. But if we know the projective depths λij such that λij xij = Pi Xj, then the weighted measurement matrix has rank 4.

The chicken-and-egg problem: To factorize, we need the depths λij. To compute the depths, we need the reconstruction. The solution: iterate. Start with λij = 1, factorize, reproject to estimate new depths, repeat. Convergence is not guaranteed but works well in practice after normalizing rows and columns of the weighted measurement matrix.
StepAction
1Normalize image coordinates
2Initialize depths λij (e.g., = 1 or from a preliminary reconstruction)
3Normalize depths (rows and columns to unit norm)
4Form weighted matrix, truncate SVD to rank 4, extract Pi and Xj
5Reproject to update depths, iterate from step 3
What rank does the correctly-weighted measurement matrix have for projective cameras?

Chapter 6: Reconstruction Using Planes

If a plane is visible in all views (providing homographies between each pair), the reconstruction problem simplifies dramatically. The plane-induced homographies determine the 3×3 submatrices Mi of the camera matrices Pi = [Mi | ti]. Only the translation columns ti remain unknown.

Linear solution with planes: Each off-plane point correspondence across two views gives a linear equation in the unknown translation parameters. With enough off-plane correspondences, all translations (and hence all cameras) are determined linearly. No iterative optimization needed for initialization.
If plane-induced homographies between all view pairs are known, what part of the camera matrices remains to be estimated?

Chapter 7: Reconstruction from Sequences

Video sequences add temporal structure: frames are ordered, and the camera moves smoothly. The reconstruction pipeline for sequences:

StepAction
1Establish initial reconstruction from a pair of well-separated keyframes
2Incrementally add new cameras by resectioning from known 3D points
3Triangulate new points visible in the newly added camera
4Periodically run bundle adjustment to refine everything
5Loop closure: detect when the camera returns to a previously seen location
Drift and loop closure: Incremental reconstruction accumulates error. When the camera revisits a location, the "loop closure" correction redistributes the accumulated error across the entire sequence. Bundle adjustment is essential for this correction.
What is the purpose of loop closure in sequential reconstruction?

Chapter 8: Sparse Methods

The Jacobian matrix in bundle adjustment is extremely sparse: each observation (xij) depends on only one camera (Pi) and one point (Xj). This sparsity must be exploited for efficiency.

The reduced camera system: After eliminating point variables via the Schur complement, the remaining system has size proportional to the number of cameras (not points). For 100 cameras and 10,000 points, the full system is 30,200 × 30,200, but the reduced system is only 600 × 600. This makes bundle adjustment practical for large-scale reconstruction.

Modern implementations (Ceres, g2o, GTSAM) use these sparse methods. Structure-from-motion systems like COLMAP and VisualSFM routinely process thousands of images using sparse bundle adjustment.

After applying the Schur complement in bundle adjustment, the reduced system size is proportional to:

Chapter 9: Connections

LinkConnection
All prior chapters → Ch 18Bundle adjustment is the final refinement step; all earlier methods provide initialization
Ch 18 → Ch 19Self-calibration can be integrated into bundle adjustment as additional constraints
Ch 18 → Modern SfMCOLMAP, OpenSfM, VisualSFM all implement the sequential pipeline described here
"Bundle adjustment is the gold standard of multi-view reconstruction."
— Hartley & Zisserman, Chapter 18
What does the Tomasi-Kanade factorization algorithm assume about the camera model?
← Chapter 15 Chapter 19: Auto-Calibration →