Hartley & Zisserman, Chapter 7

Computation of the Camera Matrix P

Estimating the camera projection matrix from 3D↔2D correspondences. The DLT algorithm, geometric error minimization, restricted cameras, and radial distortion correction.

Prerequisites: Chapter 4 (Estimation) + Chapter 6 (Camera Models).

Chapters

Simulations

Chapter 0: Why Compute P?

In the previous chapter we defined the camera matrix P and saw how it encodes the projection from 3D to 2D. Now the question becomes practical: given a set of known 3D points and their observed image locations, how do we estimate the camera matrix?

This is the problem of camera resectioning — recovering the camera's position, orientation, and internal parameters from world-to-image correspondences X_i ↔ x_i.

The central equation: x_i = P X_i. Given enough correspondences between 3D points X_i and image points x_i, we can solve for the 11 unknowns of P. Each correspondence gives 2 equations, so at minimum 5.5 (i.e. 6) point correspondences are needed.

Camera Resectioning

Known 3D points (cubes) project through the camera to image points (circles). Drag the slider to add noise and see how the estimated P degrades.

Noise0

What is the minimum number of 3D↔2D point correspondences needed to estimate P?

4 points 6 points (5.5 rounded up) 8 points

Chapter 1: Basic Equations

Each correspondence X_i ↔ x_i gives us x_i = P X_i, which in homogeneous coordinates means x_i × (P X_i) = 0. Expanding this cross product yields:

[0^T −w_iX_i^T y_iX_i^T ; w_iX_i^T 0^T −x_iX_i^T] · p = 0

where p is the 12-vector of entries of P, and x_i = (x_i, y_i, w_i)^T. Each point gives 2 independent equations. Stacking n correspondences gives a 2n × 12 matrix A, and we solve Ap = 0.

Degrees of freedom: P has 12 entries but is defined only up to scale, so 11 DOF. Each point contributes 2 equations. We need at least 11 equations ⇒ 6 points (using only 1 equation from the 6th).

For the minimal case (exactly 5.5 points), A is 11 × 12 with rank 11, and p is its 1D null space. With more points and noise, we minimize ||Ap|| subject to ||p|| = 1 via SVD.

How many independent equations does each 3D↔2D point correspondence contribute?

2 equations 3 equations 1 equation

Chapter 2: The DLT Algorithm

The Direct Linear Transformation (DLT) algorithm is the simplest method for estimating P. It is the same algorithm we used for 2D homographies in Chapter 4, extended to the 3 × 4 camera matrix.

Step	Action
1	For each correspondence X_i ↔ x_i, form two rows of the matrix A using the equations from Chapter 1
2	Stack all rows to get the 2n × 12 matrix A
3	Compute SVD: A = UDV^T
4	P is the last column of V (the right singular vector for the smallest singular value)
5	Reshape the 12-vector into a 3 × 4 matrix

What is being minimized? The DLT minimizes algebraic error ||Ap||, not geometric error. Algebraic error has no direct geometric meaning — it depends on the coordinate system. This is why normalization (next chapter) is critical.

The residual ||Ap|| is called the algebraic error. It is fast to compute but has a subtle bias: it weights distant points differently from nearby ones, and the result depends on the coordinate system used.

What does the DLT algorithm minimize?

Algebraic error ||Ap|| subject to ||p|| = 1 Geometric reprojection error Maximum likelihood error

Chapter 3: Data Normalization

Raw coordinates (e.g., pixels in [0, 1024] and world points in metres) create terrible conditioning for the DLT. The matrix A has wildly different magnitudes in its entries. The fix is normalization.

Data	Normalization
Image points x_i	Translate centroid to origin, scale so RMS distance from origin is √2
3D points X_i	Translate centroid to origin, scale so RMS distance from origin is √3

Normalization is essential, not optional. Without it, the DLT can give wildly wrong results. With it, the DLT often performs nearly as well as iterative methods. This is the single most important practical lesson in the whole estimation framework.

After computing P from normalized data, denormalize: P = T⁻¹ P̃ U, where T normalizes image points and U normalizes 3D points.

For scenes with large depth variation (e.g., points near the camera and points at infinity as vanishing points), the standard centroid+scaling normalization may not work well. More sophisticated normalization strategies exist for such cases.

What is the RMS distance from the origin that 2D image points should be scaled to?

√2 1 √3

Chapter 4: Geometric Error

Algebraic error (what the DLT minimizes) does not correspond directly to anything meaningful in image space. Geometric error is the quantity we really care about: the distance in pixels between the measured point x_i and the projected point PX_i.

min_P Σ_i d(x_i, P X_i)²

This is the reprojection error. If measurement errors are Gaussian, minimizing this sum of squared geometric distances gives the Maximum Likelihood Estimate (MLE) of P.

Geometric interpretation of algebraic error: When points are normalized so x_i = (x, y, 1)^T, the algebraic error equals w_i · d(x_i, x̂_i), where w_i is proportional to the depth of X_i. So the DLT implicitly weights each point by its depth — a subtle bias towards minimizing focal length.

When both image and 3D point locations have errors, we minimize a combined cost that includes estimated "true" 3D positions as additional unknowns, weighted by their respective covariances (Mahalanobis distance).

If image measurement errors are Gaussian, what does minimizing geometric reprojection error give?

The Maximum Likelihood Estimate of P The least-squares algebraic estimate The MAP estimate

Chapter 5: The Gold Standard Algorithm

The Gold Standard algorithm combines the DLT (for initialization) with iterative minimization of geometric error (for refinement). It is the recommended approach for camera resectioning.

Step	Action
1	Normalize image points (T) and 3D points (U)
2	Run DLT on normalized data to get initial P̃
3	Refine P̃ by minimizing Σ d(x̃_i, P̃ X̃_i)² using Levenberg-Marquardt
4	Denormalize: P = T⁻¹ P̃ U

When is the Gold Standard worth it? For clean data (e.g., a well-machined calibration grid), the DLT alone often suffices — the iterative refinement improves accuracy by less than 1/1000 of a pixel. For noisier data, the Gold Standard can make a significant difference.

The iterative step uses P's 12 entries as the parameter vector. Levenberg-Marquardt optimizes a function from IR¹² → IR²ⁿ (or just IR⁹ → IR²ⁿ for a restricted camera).

What optimization algorithm is used in the iterative step of the Gold Standard?

Levenberg-Marquardt Gradient descent Newton's method

Chapter 6: Restricted Camera Estimation

Often we know something about the camera's internals — for example, that pixels are square (s = 0, α_x = α_y) or that the principal point is at the image centre. These constraints reduce the number of unknowns.

Constraint	Reduces DOF by
Skew s = 0	1
Square pixels: α_x = α_y	1
Known principal point (x₀, y₀)	2
Known full K	5 (only 6 DOF remain: R and t)

Two strategies for enforcing constraints:
(1) Minimize geometric error over the reduced parameter set (e.g., 9 params for a pinhole camera with s = 0, α_x = α_y).
(2) Minimize algebraic error using a reduced measurement matrix Â (a 12 × 12 matrix such that ||Ap|| = ||Âp|| for all p). This reduces to an IR⁹ → IR¹² minimization, independent of the number of points.

For affine cameras (last row of P is (0, 0, 0, 1)), the estimation is even simpler: algebraic and geometric errors are equal, so a linear algorithm gives the MLE directly with only 4 correspondences needed.

For an affine camera, how many point correspondences are needed at minimum?

4 6 8

Chapter 7: Radial Distortion

Real lenses are not perfect pinholes. The most visible deviation is radial distortion: straight lines in the world appear curved in the image, especially near the edges. This violates the linear model x = PX.

The standard correction model is:

x̂ = x_c + L(r)(x − x_c)

where x_c is the distortion centre (approximately the principal point), r = ||x − x_c||, and L(r) = 1 + κ₁r² + κ₂r⁴ + ...

Barrel vs. pincushion: If κ₁ < 0, you get barrel distortion (lines bow outward — common in wide-angle lenses). If κ₁ > 0, you get pincushion distortion (lines bow inward — common in telephoto lenses). Usually only κ₁ is needed; κ₂ helps with extreme wide-angle lenses.

Distortion correction can be estimated simultaneously with P, or applied as a pre-processing step. The key diagnostic: if straight world lines appear curved in the image, radial distortion needs correction before any geometry computation.

What type of distortion makes straight lines bow outward near the image edges?

Barrel distortion (κ₁ < 0) Pincushion distortion (κ₁ > 0) Tangential distortion

Chapter 8: Degenerate Configurations

Some arrangements of 3D points and cameras make P impossible to determine uniquely, even with unlimited data. These are critical configurations.

Configuration	Problem
Camera + all points on a twisted cubic	Camera can slide along the cubic without changing images
Points on a plane + a line through the camera centre	Camera can move along the line freely
Near-planar scenes (aerial nadir view)	Close to the planar degeneracy; P is poorly conditioned

Practical warning: Near-degenerate configurations are more common than true degeneracies. A near-nadir aerial view of flat terrain, or a scene where most points lie on a plane, will give a poorly determined P even with many points and low noise. Use objects with substantial 3D extent for reliable calibration.

Line correspondences can supplement point correspondences. Each 3D↔2D line correspondence contributes 2 linear equations in the entries of P, and can help resolve ambiguities when points alone are insufficient.

Which of the following is a critical configuration for camera resectioning?

Camera and all 3D points lying on a twisted cubic curve Points distributed uniformly in 3D space Points at multiple different depths

Chapter 9: Connections

Camera resectioning is the bridge between knowing a camera model (Chapter 6) and using cameras for reconstruction (Chapters 9–13). The methods here extend directly to multi-view problems.

Link	How P computation connects
Ch 4 → Ch 7	DLT for homographies generalizes directly to DLT for camera matrices
Ch 7 → Ch 8	The computed P reveals vanishing points, calibration K, and single-view geometry
Ch 7 → Ch 9	Two computed P's define the fundamental matrix F = [e']_× P' P⁺
Ch 7 → Ch 12	Known P's enable triangulation: recover 3D structure from 2D correspondences
Ch 7 → Ch 18	Bundle adjustment refines all P's and X's simultaneously across N views

"There are few circumstances under which [gold standard estimation] should not be used in preference to a DLT method."

— Hartley & Zisserman, Chapter 7

What advantage does normalization give the DLT algorithm?

It makes the result invariant to the choice of coordinate system, dramatically improving conditioning It makes the DLT run faster It eliminates the need for iterative refinement

← Chapter 6 Chapter 8: Single View Geometry →