Estimating the camera projection matrix from 3D↔2D correspondences. The DLT algorithm, geometric error minimization, restricted cameras, and radial distortion correction.
In the previous chapter we defined the camera matrix P and saw how it encodes the projection from 3D to 2D. Now the question becomes practical: given a set of known 3D points and their observed image locations, how do we estimate the camera matrix?
This is the problem of camera resectioning — recovering the camera's position, orientation, and internal parameters from world-to-image correspondences Xi ↔ xi.
Known 3D points (cubes) project through the camera to image points (circles). Drag the slider to add noise and see how the estimated P degrades.
Each correspondence Xi ↔ xi gives us xi = P Xi, which in homogeneous coordinates means xi × (P Xi) = 0. Expanding this cross product yields:
where p is the 12-vector of entries of P, and xi = (xi, yi, wi)T. Each point gives 2 independent equations. Stacking n correspondences gives a 2n × 12 matrix A, and we solve Ap = 0.
For the minimal case (exactly 5.5 points), A is 11 × 12 with rank 11, and p is its 1D null space. With more points and noise, we minimize ||Ap|| subject to ||p|| = 1 via SVD.
The Direct Linear Transformation (DLT) algorithm is the simplest method for estimating P. It is the same algorithm we used for 2D homographies in Chapter 4, extended to the 3 × 4 camera matrix.
| Step | Action |
|---|---|
| 1 | For each correspondence Xi ↔ xi, form two rows of the matrix A using the equations from Chapter 1 |
| 2 | Stack all rows to get the 2n × 12 matrix A |
| 3 | Compute SVD: A = UDVT |
| 4 | P is the last column of V (the right singular vector for the smallest singular value) |
| 5 | Reshape the 12-vector into a 3 × 4 matrix |
The residual ||Ap|| is called the algebraic error. It is fast to compute but has a subtle bias: it weights distant points differently from nearby ones, and the result depends on the coordinate system used.
Raw coordinates (e.g., pixels in [0, 1024] and world points in metres) create terrible conditioning for the DLT. The matrix A has wildly different magnitudes in its entries. The fix is normalization.
| Data | Normalization |
|---|---|
| Image points xi | Translate centroid to origin, scale so RMS distance from origin is √2 |
| 3D points Xi | Translate centroid to origin, scale so RMS distance from origin is √3 |
After computing P from normalized data, denormalize: P = T−1 P̃ U, where T normalizes image points and U normalizes 3D points.
For scenes with large depth variation (e.g., points near the camera and points at infinity as vanishing points), the standard centroid+scaling normalization may not work well. More sophisticated normalization strategies exist for such cases.
Algebraic error (what the DLT minimizes) does not correspond directly to anything meaningful in image space. Geometric error is the quantity we really care about: the distance in pixels between the measured point xi and the projected point PXi.
This is the reprojection error. If measurement errors are Gaussian, minimizing this sum of squared geometric distances gives the Maximum Likelihood Estimate (MLE) of P.
When both image and 3D point locations have errors, we minimize a combined cost that includes estimated "true" 3D positions as additional unknowns, weighted by their respective covariances (Mahalanobis distance).
The Gold Standard algorithm combines the DLT (for initialization) with iterative minimization of geometric error (for refinement). It is the recommended approach for camera resectioning.
| Step | Action |
|---|---|
| 1 | Normalize image points (T) and 3D points (U) |
| 2 | Run DLT on normalized data to get initial P̃ |
| 3 | Refine P̃ by minimizing Σ d(x̃i, P̃ X̃i)2 using Levenberg-Marquardt |
| 4 | Denormalize: P = T−1 P̃ U |
The iterative step uses P's 12 entries as the parameter vector. Levenberg-Marquardt optimizes a function from IR12 → IR2n (or just IR9 → IR2n for a restricted camera).
Often we know something about the camera's internals — for example, that pixels are square (s = 0, αx = αy) or that the principal point is at the image centre. These constraints reduce the number of unknowns.
| Constraint | Reduces DOF by |
|---|---|
| Skew s = 0 | 1 |
| Square pixels: αx = αy | 1 |
| Known principal point (x0, y0) | 2 |
| Known full K | 5 (only 6 DOF remain: R and t) |
For affine cameras (last row of P is (0, 0, 0, 1)), the estimation is even simpler: algebraic and geometric errors are equal, so a linear algorithm gives the MLE directly with only 4 correspondences needed.
Real lenses are not perfect pinholes. The most visible deviation is radial distortion: straight lines in the world appear curved in the image, especially near the edges. This violates the linear model x = PX.
The standard correction model is:
where xc is the distortion centre (approximately the principal point), r = ||x − xc||, and L(r) = 1 + κ1r2 + κ2r4 + ...
Distortion correction can be estimated simultaneously with P, or applied as a pre-processing step. The key diagnostic: if straight world lines appear curved in the image, radial distortion needs correction before any geometry computation.
Some arrangements of 3D points and cameras make P impossible to determine uniquely, even with unlimited data. These are critical configurations.
| Configuration | Problem |
|---|---|
| Camera + all points on a twisted cubic | Camera can slide along the cubic without changing images |
| Points on a plane + a line through the camera centre | Camera can move along the line freely |
| Near-planar scenes (aerial nadir view) | Close to the planar degeneracy; P is poorly conditioned |
Line correspondences can supplement point correspondences. Each 3D↔2D line correspondence contributes 2 linear equations in the entries of P, and can help resolve ambiguities when points alone are insufficient.
Camera resectioning is the bridge between knowing a camera model (Chapter 6) and using cameras for reconstruction (Chapters 9–13). The methods here extend directly to multi-view problems.
| Link | How P computation connects |
|---|---|
| Ch 4 → Ch 7 | DLT for homographies generalizes directly to DLT for camera matrices |
| Ch 7 → Ch 8 | The computed P reveals vanishing points, calibration K, and single-view geometry |
| Ch 7 → Ch 9 | Two computed P's define the fundamental matrix F = [e']× P' P+ |
| Ch 7 → Ch 12 | Known P's enable triangulation: recover 3D structure from 2D correspondences |
| Ch 7 → Ch 18 | Bundle adjustment refines all P's and X's simultaneously across N views |