Bundle adjustment, the factorization algorithm, projective factorization, non-rigid factorization, and reconstruction from image sequences.
So far we have handled two and three views. In practice, scenes are captured from many viewpoints — a video sequence may produce hundreds of frames. How do we jointly estimate all camera parameters and 3D structure from many views simultaneously?
This chapter also introduces factorization methods, which exploit the algebraic structure of the multi-view problem for efficient initialization, including for deforming (non-rigid) scenes.
Bundle adjustment minimizes the total reprojection error over all cameras Pi and all 3D points Xj:
This is a large sparse nonlinear least-squares problem, solved via Levenberg-Marquardt. The key insight is that the normal equations matrix has a special sparse structure (only cameras and points that "see" each other are coupled), enabling efficient solution.
| Parameter | Typical count |
|---|---|
| Cameras (6 DOF each for calibrated, 11 for uncalibrated) | 10 to 10,000 |
| 3D points (3 DOF each) | 1,000 to 10,000,000 |
| Observations (2 DOF each) | 10,000 to 100,000,000 |
For affine cameras, the projection equations xij = Mi Xj + ti are bilinear. After centring the image points (subtracting the mean), the measurement matrix W factorizes as:
where W is 2m × n (m views, n points), M̂ is 2m × 3 (stacked camera matrices), and Ŝ is 3 × n (3D points). Since W has rank at most 3, the SVD gives the factorization.
The measurement matrix organizes all observations into a single matrix:
Each column is a point's trajectory across all views. Each row-pair is one camera's observations. The low-rank structure (rank 3 for affine, rank 4 for projective) is the key to factorization.
Missing data: When some points are not visible in all views, the measurement matrix has missing entries. Standard SVD does not apply directly. Iterative methods (alternation between estimating cameras and points) can handle missing data, though global convergence is no longer guaranteed.
For a deforming object, the shape at each time step is a linear combination of l basis shapes:
The measurement matrix now has rank 3l instead of 3. The SVD still provides a factorization, but recovering the individual basis shapes and camera motions is more involved.
Independently moving objects also create a higher-rank measurement matrix. Two independent rigid objects contribute rank 3 each, for a total rank of 6. Segmenting the matrix into rank-3 blocks identifies the separate objects.
For projective (perspective) cameras, the projection xij = Pi Xj is not bilinear because of the homogeneous division. But if we know the projective depths λij such that λij xij = Pi Xj, then the weighted measurement matrix has rank 4.
| Step | Action |
|---|---|
| 1 | Normalize image coordinates |
| 2 | Initialize depths λij (e.g., = 1 or from a preliminary reconstruction) |
| 3 | Normalize depths (rows and columns to unit norm) |
| 4 | Form weighted matrix, truncate SVD to rank 4, extract Pi and Xj |
| 5 | Reproject to update depths, iterate from step 3 |
If a plane is visible in all views (providing homographies between each pair), the reconstruction problem simplifies dramatically. The plane-induced homographies determine the 3×3 submatrices Mi of the camera matrices Pi = [Mi | ti]. Only the translation columns ti remain unknown.
Video sequences add temporal structure: frames are ordered, and the camera moves smoothly. The reconstruction pipeline for sequences:
| Step | Action |
|---|---|
| 1 | Establish initial reconstruction from a pair of well-separated keyframes |
| 2 | Incrementally add new cameras by resectioning from known 3D points |
| 3 | Triangulate new points visible in the newly added camera |
| 4 | Periodically run bundle adjustment to refine everything |
| 5 | Loop closure: detect when the camera returns to a previously seen location |
The Jacobian matrix in bundle adjustment is extremely sparse: each observation (xij) depends on only one camera (Pi) and one point (Xj). This sparsity must be exploited for efficiency.
Modern implementations (Ceres, g2o, GTSAM) use these sparse methods. Structure-from-motion systems like COLMAP and VisualSFM routinely process thousands of images using sparse bundle adjustment.
| Link | Connection |
|---|---|
| All prior chapters → Ch 18 | Bundle adjustment is the final refinement step; all earlier methods provide initialization |
| Ch 18 → Ch 19 | Self-calibration can be integrated into bundle adjustment as additional constraints |
| Ch 18 → Modern SfM | COLMAP, OpenSfM, VisualSFM all implement the sequential pipeline described here |