Projective geometry everywhere: how cameras steal a dimension, and how we steal it back from two or more photographs.
You take a photograph of a building. The rectangular windows look trapezoidal. The parallel walls converge toward a point. A dimension has been lost — the camera crushed 3D reality onto a flat 2D sensor. How do you recover what was thrown away?
Now take a second photograph from a different position. Suddenly, the lost depth reappears: each point in the scene is visible from two angles, and the difference between those angles encodes how far away it is. This is stereo vision, and it is the starting point for all of multi-view geometry.
The mathematics that ties these views together is projective geometry — a framework where parallel lines are allowed to meet (at infinity!), where shapes change but straightness is preserved, and where camera projections become simple matrix multiplications.
A 3D cube is projected through a camera center onto a 2D image plane. Drag the slider to rotate the cube and watch how parallel edges converge to vanishing points.
The simplest camera model is the pinhole camera. Every ray of light passes through a single point — the camera centre — and hits a flat image plane behind it. A 3D point (X, Y, Z) maps to image coordinates (fX/Z, fY/Z), where f is the focal length.
Notice the division by Z. That is where depth is lost: all points along the same ray from the camera centre collapse to the same image pixel. A mountain far away and a coin held up close can project to the same image point.
This is a non-linear mapping because of that division by Z. One of the central tricks of this book is to make it linear by switching to homogeneous coordinates, which we'll meet in the next chapter.
A point at depth Z is projected through the camera centre onto the image plane at distance f. Move the sliders to change focal length and point depth.
In ordinary Euclidean coordinates, a 2D point is a pair (x, y). We now add a third number and write (x, y, 1). Seems harmless — we can always get back to (x, y) by dropping the 1. But here's the trick: we declare that (kx, ky, k) represents the same point for any non-zero k.
These are homogeneous coordinates. Now ask: what about (x, y, 0)? If we try to divide by the last coordinate, we get infinity. These are the points at infinity — they represent directions, not locations. Two parallel lines, which never meet in Euclidean space, now meet at a point at infinity.
The teal point shows where (x, y, w) maps when de-homogenized. Drag w toward 0 to watch the point race to infinity.
Now we can describe the full camera projection as a single matrix equation. A 3D point X in homogeneous coordinates is a 4-vector (X, Y, Z, 1). The image point x is a 3-vector (x, y, w). The camera is a 3×4 matrix P such that:
The matrix P packs together everything about the camera: where it is, which way it's pointing, and its internal optics (focal length, pixel size, principal point). We decompose P as P = K[R | t], where K is the calibration matrix (internal parameters), R is a rotation (camera orientation), and t is the translation (camera position).
| Symbol | Meaning | DOF |
|---|---|---|
| K | Calibration matrix (focal length, principal point, skew) | 5 |
| R | Rotation matrix (camera orientation) | 3 |
| t | Translation vector (camera position) | 3 |
| P | Full camera matrix K[R|t] | 11 |
When the same scene is photographed twice, every 3D point creates a correspondence: a pair of image points (x, x') related by geometry. The key constraint is epipolar geometry.
Pick a point x in image 1. It could be the projection of any point along a ray in 3D. That entire ray, seen from camera 2, projects to a line in image 2 — the epipolar line. The matching point x' must lie somewhere on this line. This is encoded in the fundamental matrix F:
F is a 3×3 matrix with rank 2 and 7 degrees of freedom. It captures the entire relative geometry between two views without knowing anything about the 3D scene.
Click in the left image to place a point. The right image shows the corresponding epipolar line — the matching point must lie on it.
With two views, a point correspondence gives us the epipolar constraint. With three views, we get something richer: the trifocal tensor.
Consider a line L visible in all three images as l, l', l''. The three planes back-projected from these image lines must all intersect in a single 3D line. This incidence condition is captured by a set of three 3×3 matrices {T1, T2, T3} — the trifocal tensor — through the relation:
The tensor has 18 degrees of freedom (three cameras with 11 DOF each, minus the 15-DOF projective ambiguity). It encodes all the geometry of three views, including the ability to transfer a point from two views to the third.
When points lie on a plane in 3D, something special happens: the mapping between two images becomes a homography — an invertible 3×3 matrix H such that x' = Hx. No depth ambiguity, no epipolar lines needed.
Why? Because all points on a plane have their Z-coordinate determined by X and Y (via the plane equation). That extra constraint pins down the mapping completely. Homographies are everywhere in practice: floor planes, building facades, tabletops.
A square grid on a plane is warped by a perspective homography. Drag the sliders to change the tilt and rotation angles.
The grand payoff: given point correspondences across two or more images, recover the 3D positions of every point and the camera that took each photo. The procedure is called reconstruction, and it happens in layers:
Each step adds information. With uncalibrated cameras you can only get projective reconstruction. If you know something about the cameras (constant focal length, square pixels), you can upgrade all the way to metric.
A set of 3D points is shown under different reconstruction ambiguities. Toggle between layers to see how adding constraints progressively removes distortion.
What if you don't have a calibration pattern? What if the cameras are unknown? Auto-calibration (or self-calibration) extracts the camera's internal parameters from the images alone, by exploiting constraints that must hold across views.
The key player is the absolute conic — an imaginary conic living on the plane at infinity. Every camera projects this conic into its image, and the projected shape (the image of the absolute conic, or IAC) encodes the calibration matrix K via ω = (KKT)-1.
| Assumption | Constraints per view |
|---|---|
| Known principal point | 2 |
| Zero skew | 1 |
| Known aspect ratio | 1 |
| Fixed (unknown) focal length | Shared across views |
This first chapter gave you the 30,000-foot view of multiple view geometry. The rest of the book fills in every detail. Here is the roadmap:
| Part | Topic | Key Object |
|---|---|---|
| 0 | Projective geometry & estimation | Homogeneous coordinates, DLT, RANSAC |
| I | Single view | Camera matrix P, calibration K, vanishing points |
| II | Two views | Fundamental matrix F, essential matrix E, triangulation |
| III | Three views | Trifocal tensor T |
| IV | N views | Bundle adjustment, factorization |