Hartley & Zisserman, Chapter 7

Computation of the Camera Matrix P

Estimating the camera projection matrix from 3D↔2D correspondences. The DLT algorithm, geometric error minimization, restricted cameras, and radial distortion correction.

Prerequisites: Chapter 4 (Estimation) + Chapter 6 (Camera Models).
10
Chapters
5+
Simulations

Chapter 0: Why Compute P?

In the previous chapter we defined the camera matrix P and saw how it encodes the projection from 3D to 2D. Now the question becomes practical: given a set of known 3D points and their observed image locations, how do we estimate the camera matrix?

This is the problem of camera resectioning — recovering the camera's position, orientation, and internal parameters from world-to-image correspondences Xi ↔ xi.

The central equation: xi = P Xi. Given enough correspondences between 3D points Xi and image points xi, we can solve for the 11 unknowns of P. Each correspondence gives 2 equations, so at minimum 5.5 (i.e. 6) point correspondences are needed.
Camera Resectioning

Known 3D points (cubes) project through the camera to image points (circles). Drag the slider to add noise and see how the estimated P degrades.

Noise0
What is the minimum number of 3D↔2D point correspondences needed to estimate P?

Chapter 1: Basic Equations

Each correspondence Xi ↔ xi gives us xi = P Xi, which in homogeneous coordinates means xi × (P Xi) = 0. Expanding this cross product yields:

[0T −wiXiT yiXiT ; wiXiT 0T −xiXiT] · p = 0

where p is the 12-vector of entries of P, and xi = (xi, yi, wi)T. Each point gives 2 independent equations. Stacking n correspondences gives a 2n × 12 matrix A, and we solve Ap = 0.

Degrees of freedom: P has 12 entries but is defined only up to scale, so 11 DOF. Each point contributes 2 equations. We need at least 11 equations ⇒ 6 points (using only 1 equation from the 6th).

For the minimal case (exactly 5.5 points), A is 11 × 12 with rank 11, and p is its 1D null space. With more points and noise, we minimize ||Ap|| subject to ||p|| = 1 via SVD.

How many independent equations does each 3D↔2D point correspondence contribute?

Chapter 2: The DLT Algorithm

The Direct Linear Transformation (DLT) algorithm is the simplest method for estimating P. It is the same algorithm we used for 2D homographies in Chapter 4, extended to the 3 × 4 camera matrix.

StepAction
1For each correspondence Xi ↔ xi, form two rows of the matrix A using the equations from Chapter 1
2Stack all rows to get the 2n × 12 matrix A
3Compute SVD: A = UDVT
4P is the last column of V (the right singular vector for the smallest singular value)
5Reshape the 12-vector into a 3 × 4 matrix
What is being minimized? The DLT minimizes algebraic error ||Ap||, not geometric error. Algebraic error has no direct geometric meaning — it depends on the coordinate system. This is why normalization (next chapter) is critical.

The residual ||Ap|| is called the algebraic error. It is fast to compute but has a subtle bias: it weights distant points differently from nearby ones, and the result depends on the coordinate system used.

What does the DLT algorithm minimize?

Chapter 3: Data Normalization

Raw coordinates (e.g., pixels in [0, 1024] and world points in metres) create terrible conditioning for the DLT. The matrix A has wildly different magnitudes in its entries. The fix is normalization.

DataNormalization
Image points xiTranslate centroid to origin, scale so RMS distance from origin is √2
3D points XiTranslate centroid to origin, scale so RMS distance from origin is √3
Normalization is essential, not optional. Without it, the DLT can give wildly wrong results. With it, the DLT often performs nearly as well as iterative methods. This is the single most important practical lesson in the whole estimation framework.

After computing P from normalized data, denormalize: P = T−1 P̃ U, where T normalizes image points and U normalizes 3D points.

For scenes with large depth variation (e.g., points near the camera and points at infinity as vanishing points), the standard centroid+scaling normalization may not work well. More sophisticated normalization strategies exist for such cases.

What is the RMS distance from the origin that 2D image points should be scaled to?

Chapter 4: Geometric Error

Algebraic error (what the DLT minimizes) does not correspond directly to anything meaningful in image space. Geometric error is the quantity we really care about: the distance in pixels between the measured point xi and the projected point PXi.

minP Σi d(xi, P Xi)2

This is the reprojection error. If measurement errors are Gaussian, minimizing this sum of squared geometric distances gives the Maximum Likelihood Estimate (MLE) of P.

Geometric interpretation of algebraic error: When points are normalized so xi = (x, y, 1)T, the algebraic error equals wi · d(xi, x̂i), where wi is proportional to the depth of Xi. So the DLT implicitly weights each point by its depth — a subtle bias towards minimizing focal length.

When both image and 3D point locations have errors, we minimize a combined cost that includes estimated "true" 3D positions as additional unknowns, weighted by their respective covariances (Mahalanobis distance).

If image measurement errors are Gaussian, what does minimizing geometric reprojection error give?

Chapter 5: The Gold Standard Algorithm

The Gold Standard algorithm combines the DLT (for initialization) with iterative minimization of geometric error (for refinement). It is the recommended approach for camera resectioning.

StepAction
1Normalize image points (T) and 3D points (U)
2Run DLT on normalized data to get initial P̃
3Refine P̃ by minimizing Σ d(x̃i, P̃ X̃i)2 using Levenberg-Marquardt
4Denormalize: P = T−1 P̃ U
When is the Gold Standard worth it? For clean data (e.g., a well-machined calibration grid), the DLT alone often suffices — the iterative refinement improves accuracy by less than 1/1000 of a pixel. For noisier data, the Gold Standard can make a significant difference.

The iterative step uses P's 12 entries as the parameter vector. Levenberg-Marquardt optimizes a function from IR12 → IR2n (or just IR9 → IR2n for a restricted camera).

What optimization algorithm is used in the iterative step of the Gold Standard?

Chapter 6: Restricted Camera Estimation

Often we know something about the camera's internals — for example, that pixels are square (s = 0, αx = αy) or that the principal point is at the image centre. These constraints reduce the number of unknowns.

ConstraintReduces DOF by
Skew s = 01
Square pixels: αx = αy1
Known principal point (x0, y0)2
Known full K5 (only 6 DOF remain: R and t)
Two strategies for enforcing constraints:
(1) Minimize geometric error over the reduced parameter set (e.g., 9 params for a pinhole camera with s = 0, αx = αy).
(2) Minimize algebraic error using a reduced measurement matrix  (a 12 × 12 matrix such that ||Ap|| = ||Âp|| for all p). This reduces to an IR9 → IR12 minimization, independent of the number of points.

For affine cameras (last row of P is (0, 0, 0, 1)), the estimation is even simpler: algebraic and geometric errors are equal, so a linear algorithm gives the MLE directly with only 4 correspondences needed.

For an affine camera, how many point correspondences are needed at minimum?

Chapter 7: Radial Distortion

Real lenses are not perfect pinholes. The most visible deviation is radial distortion: straight lines in the world appear curved in the image, especially near the edges. This violates the linear model x = PX.

The standard correction model is:

x̂ = xc + L(r)(x − xc)

where xc is the distortion centre (approximately the principal point), r = ||x − xc||, and L(r) = 1 + κ1r2 + κ2r4 + ...

Barrel vs. pincushion: If κ1 < 0, you get barrel distortion (lines bow outward — common in wide-angle lenses). If κ1 > 0, you get pincushion distortion (lines bow inward — common in telephoto lenses). Usually only κ1 is needed; κ2 helps with extreme wide-angle lenses.

Distortion correction can be estimated simultaneously with P, or applied as a pre-processing step. The key diagnostic: if straight world lines appear curved in the image, radial distortion needs correction before any geometry computation.

What type of distortion makes straight lines bow outward near the image edges?

Chapter 8: Degenerate Configurations

Some arrangements of 3D points and cameras make P impossible to determine uniquely, even with unlimited data. These are critical configurations.

ConfigurationProblem
Camera + all points on a twisted cubicCamera can slide along the cubic without changing images
Points on a plane + a line through the camera centreCamera can move along the line freely
Near-planar scenes (aerial nadir view)Close to the planar degeneracy; P is poorly conditioned
Practical warning: Near-degenerate configurations are more common than true degeneracies. A near-nadir aerial view of flat terrain, or a scene where most points lie on a plane, will give a poorly determined P even with many points and low noise. Use objects with substantial 3D extent for reliable calibration.

Line correspondences can supplement point correspondences. Each 3D↔2D line correspondence contributes 2 linear equations in the entries of P, and can help resolve ambiguities when points alone are insufficient.

Which of the following is a critical configuration for camera resectioning?

Chapter 9: Connections

Camera resectioning is the bridge between knowing a camera model (Chapter 6) and using cameras for reconstruction (Chapters 9–13). The methods here extend directly to multi-view problems.

LinkHow P computation connects
Ch 4 → Ch 7DLT for homographies generalizes directly to DLT for camera matrices
Ch 7 → Ch 8The computed P reveals vanishing points, calibration K, and single-view geometry
Ch 7 → Ch 9Two computed P's define the fundamental matrix F = [e']× P' P+
Ch 7 → Ch 12Known P's enable triangulation: recover 3D structure from 2D correspondences
Ch 7 → Ch 18Bundle adjustment refines all P's and X's simultaneously across N views
"There are few circumstances under which [gold standard estimation] should not be used in preference to a DLT method."
— Hartley & Zisserman, Chapter 7
What advantage does normalization give the DLT algorithm?
← Chapter 6 Chapter 8: Single View Geometry →