Hartley & Zisserman, Chapter 1

A Tour of Multiple View
Geometry

Projective geometry everywhere: how cameras steal a dimension, and how we steal it back from two or more photographs.

Prerequisites: High-school geometry + Matrix multiplication. That's it.

Chapters

Simulations

Chapter 0: Why Multiple View Geometry?

You take a photograph of a building. The rectangular windows look trapezoidal. The parallel walls converge toward a point. A dimension has been lost — the camera crushed 3D reality onto a flat 2D sensor. How do you recover what was thrown away?

Now take a second photograph from a different position. Suddenly, the lost depth reappears: each point in the scene is visible from two angles, and the difference between those angles encodes how far away it is. This is stereo vision, and it is the starting point for all of multi-view geometry.

The mathematics that ties these views together is projective geometry — a framework where parallel lines are allowed to meet (at infinity!), where shapes change but straightness is preserved, and where camera projections become simple matrix multiplications.

The core promise: Given only photographs — taken by unknown cameras, at unknown positions — projective geometry lets us recover the 3D shape of the scene and the location of every camera. No rulers, no GPS, no calibration patterns.

The Projection Problem

A 3D cube is projected through a camera center onto a 2D image plane. Drag the slider to rotate the cube and watch how parallel edges converge to vanishing points.

Rotate30

Why can't a single photograph tell you the 3D position of a point?

The camera sensor is too small Projection collapses a ray of 3D points onto a single 2D pixel — depth is lost Photographs are always blurry

Chapter 1: Central Projection

The simplest camera model is the pinhole camera. Every ray of light passes through a single point — the camera centre — and hits a flat image plane behind it. A 3D point (X, Y, Z) maps to image coordinates (fX/Z, fY/Z), where f is the focal length.

Notice the division by Z. That is where depth is lost: all points along the same ray from the camera centre collapse to the same image pixel. A mountain far away and a coin held up close can project to the same image point.

(X, Y, Z) → (fX/Z, fY/Z)

This is a non-linear mapping because of that division by Z. One of the central tricks of this book is to make it linear by switching to homogeneous coordinates, which we'll meet in the next chapter.

Key insight: Central projection preserves straight lines. A straight line in 3D always projects to a straight line in the image. This is the most fundamental preserved property, and projective geometry is built on it.

Pinhole Camera

A point at depth Z is projected through the camera centre onto the image plane at distance f. Move the sliders to change focal length and point depth.

Focal length f100

Depth Z300

What geometric property is preserved by central projection?

Angles between lines Lengths of line segments Straightness — lines map to lines

Chapter 2: Homogeneous Coordinates

In ordinary Euclidean coordinates, a 2D point is a pair (x, y). We now add a third number and write (x, y, 1). Seems harmless — we can always get back to (x, y) by dropping the 1. But here's the trick: we declare that (kx, ky, k) represents the same point for any non-zero k.

These are homogeneous coordinates. Now ask: what about (x, y, 0)? If we try to divide by the last coordinate, we get infinity. These are the points at infinity — they represent directions, not locations. Two parallel lines, which never meet in Euclidean space, now meet at a point at infinity.

(x, y, w) ≡ (x/w, y/w) when w ≠ 0 | w = 0 ⇒ point at infinity

Why this matters: In homogeneous coordinates, projection becomes a matrix multiplication: x = PX, where P is a 3×4 matrix. The ugly division by Z is absorbed into the framework. Everything becomes linear algebra.

Homogeneous Coordinates

The teal point shows where (x, y, w) maps when de-homogenized. Drag w toward 0 to watch the point race to infinity.

x1.5

w1.00

What does the homogeneous coordinate (3, 6, 0) represent?

The Euclidean point (3, 6) A point at infinity in the direction (3, 6) — it's where parallel lines with that slope meet An undefined, invalid coordinate

Chapter 3: The Camera Matrix

Now we can describe the full camera projection as a single matrix equation. A 3D point X in homogeneous coordinates is a 4-vector (X, Y, Z, 1). The image point x is a 3-vector (x, y, w). The camera is a 3×4 matrix P such that:

x = P · X

The matrix P packs together everything about the camera: where it is, which way it's pointing, and its internal optics (focal length, pixel size, principal point). We decompose P as P = K[R | t], where K is the calibration matrix (internal parameters), R is a rotation (camera orientation), and t is the translation (camera position).

Symbol	Meaning	DOF
K	Calibration matrix (focal length, principal point, skew)	5
R	Rotation matrix (camera orientation)	3
t	Translation vector (camera position)	3
P	Full camera matrix K[R\|t]	11

Degrees of freedom: P has 12 entries, but is only defined up to scale (multiplying by a constant doesn't change the projection). That leaves 11 DOF: 5 internal (K) + 3 rotation + 3 translation.

What does the camera calibration matrix K encode?

Internal properties: focal length, principal point, pixel shape The camera's position in the world The 3D structure of the scene

Chapter 4: Two-View Geometry

When the same scene is photographed twice, every 3D point creates a correspondence: a pair of image points (x, x') related by geometry. The key constraint is epipolar geometry.

Pick a point x in image 1. It could be the projection of any point along a ray in 3D. That entire ray, seen from camera 2, projects to a line in image 2 — the epipolar line. The matching point x' must lie somewhere on this line. This is encoded in the fundamental matrix F:

x'^T F x = 0

F is a 3×3 matrix with rank 2 and 7 degrees of freedom. It captures the entire relative geometry between two views without knowing anything about the 3D scene.

The epipolar constraint: For every pair of matching points, x'^TFx = 0. This single equation is the foundation of two-view geometry. It says the matching point must lie on a specific line, cutting the search from 2D down to 1D.

Epipolar Geometry

Click in the left image to place a point. The right image shows the corresponding epipolar line — the matching point must lie on it.

What does the fundamental matrix F encode?

The 3D positions of all scene points The focal length of both cameras The epipolar geometry relating corresponding points in two views

Chapter 5: Three-View Geometry

With two views, a point correspondence gives us the epipolar constraint. With three views, we get something richer: the trifocal tensor.

Consider a line L visible in all three images as l, l', l''. The three planes back-projected from these image lines must all intersect in a single 3D line. This incidence condition is captured by a set of three 3×3 matrices {T₁, T₂, T₃} — the trifocal tensor — through the relation:

l_i = l'^T T_i l''

The tensor has 18 degrees of freedom (three cameras with 11 DOF each, minus the 15-DOF projective ambiguity). It encodes all the geometry of three views, including the ability to transfer a point from two views to the third.

Key power of three views: Given a point in two images, the tensor tells you exactly where it appears in the third image. This point transfer is impossible with two views alone (you only get a line, not a point).

What new capability does three-view geometry provide over two-view geometry?

Point transfer: predicting a point's location in a third image from correspondences in two Higher resolution images The ability to measure absolute distances in meters

Chapter 6: Transfer & Homographies

When points lie on a plane in 3D, something special happens: the mapping between two images becomes a homography — an invertible 3×3 matrix H such that x' = Hx. No depth ambiguity, no epipolar lines needed.

Why? Because all points on a plane have their Z-coordinate determined by X and Y (via the plane equation). That extra constraint pins down the mapping completely. Homographies are everywhere in practice: floor planes, building facades, tabletops.

x' = H_3×3 x (4 point correspondences determine H)

Homography vs fundamental matrix: F relates any pair of views but gives only a line constraint. H relates views of a plane and gives exact point-to-point mapping. If the scene is planar, H is all you need.

Homography Warping

A square grid on a plane is warped by a perspective homography. Drag the sliders to change the tilt and rotation angles.

Tilt30

Rotate0

When does the mapping between two views reduce to a homography?

When the cameras have the same focal length When all scene points lie on a common plane When the images have the same resolution

Chapter 7: 3D Reconstruction

The grand payoff: given point correspondences across two or more images, recover the 3D positions of every point and the camera that took each photo. The procedure is called reconstruction, and it happens in layers:

1. Projective

From F alone: recover structure up to a 15-DOF projective ambiguity

↓

2. Affine

Identify the plane at infinity → fix parallelism

↓

3. Metric (Euclidean)

Identify the absolute conic → fix angles and ratios

↓

4. Full metric

One known distance → fix absolute scale

Each step adds information. With uncalibrated cameras you can only get projective reconstruction. If you know something about the cameras (constant focal length, square pixels), you can upgrade all the way to metric.

The reconstruction theorem: From two uncalibrated images and at least 7 point correspondences, you can recover the 3D scene up to an unknown projective transformation. This is remarkable — no calibration needed.

Stratified Reconstruction

A set of 3D points is shown under different reconstruction ambiguities. Toggle between layers to see how adding constraints progressively removes distortion.

What additional information is needed to upgrade a projective reconstruction to metric?

More point correspondences Knowledge of the camera calibration (or constraints on internal parameters) A higher-resolution camera

Chapter 8: Auto-Calibration

What if you don't have a calibration pattern? What if the cameras are unknown? Auto-calibration (or self-calibration) extracts the camera's internal parameters from the images alone, by exploiting constraints that must hold across views.

The key player is the absolute conic — an imaginary conic living on the plane at infinity. Every camera projects this conic into its image, and the projected shape (the image of the absolute conic, or IAC) encodes the calibration matrix K via ω = (KK^T)^-1.

The magical constraint: If you assume the camera has square pixels and zero skew (usually true), each image gives you two equations on the absolute conic. Three images with constant parameters give enough equations to solve for K.

Assumption	Constraints per view
Known principal point	2
Zero skew	1
Known aspect ratio	1
Fixed (unknown) focal length	Shared across views

What does auto-calibration recover without any calibration pattern?

The camera's internal parameters (focal length, principal point, etc.) The 3D positions of all scene points in absolute meters The brand and model of the camera

Chapter 9: Connections

This first chapter gave you the 30,000-foot view of multiple view geometry. The rest of the book fills in every detail. Here is the roadmap:

Part	Topic	Key Object
0	Projective geometry & estimation	Homogeneous coordinates, DLT, RANSAC
I	Single view	Camera matrix P, calibration K, vanishing points
II	Two views	Fundamental matrix F, essential matrix E, triangulation
III	Three views	Trifocal tensor T
IV	N views	Bundle adjustment, factorization

What connects it all: The same pattern repeats at every level. You observe image correspondences. You write a constraint equation (x'^TFx = 0, or a tensor contraction). You estimate the geometry from those equations. You reconstruct 3D from the estimated geometry. Projective first, then upgrade to metric.

"We may define a projective transformation of a plane as any mapping of the points on the plane that preserves straight lines."

— Hartley & Zisserman, Chapter 1

What is the recurring pattern across all levels of multi-view geometry?

Observe correspondences → write constraint equations → estimate geometry → reconstruct 3D Calibrate camera → take photos → measure distances → build model Train neural network → run inference → output depth map

Chapter 2: Projective 2D →

A Tour of Multiple ViewGeometry