How the 3D world becomes a 2D photograph: geometric projections, lens optics, photometry, and the digital sensor.
Before we can analyze images, we need to understand how they are created. An image is not a random grid of numbers — it is the result of a precise physical process involving geometry, optics, and sensor electronics.
A point in the 3D world passes through a series of transformations: first, its 3D coordinates are projected through the camera's lens system onto a 2D image plane. Then, photons from that point interact with light sources and surface materials. Finally, the continuous light field is sampled into discrete pixel values by the sensor.
Understanding this pipeline is essential because every vision algorithm must either exploit or compensate for these formation effects. Lens distortion warps straight lines. Sensor noise corrupts measurements. Perspective projection makes distant objects small. If you do not model these, your algorithm will fail.
Watch a 3D point travel through the formation pipeline to become a pixel value.
Before tackling 3D-to-2D projection, let's master 2D-to-2D transforms. These are the building blocks for everything that follows and are directly used in image alignment (Chapter 8).
The hierarchy of 2D transformations, from simplest to most complex:
| Transform | DoF | Preserves |
|---|---|---|
| Translation | 2 | Orientation, shape, size |
| Rigid (Euclidean) | 3 | Shape, size (adds rotation) |
| Similarity | 4 | Shape (adds uniform scale) |
| Affine | 6 | Parallelism (adds shear, non-uniform scale) |
| Projective (Homography) | 8 | Straight lines only |
Each transform can be expressed as a matrix multiplication. In homogeneous coordinates, even translation becomes a matrix multiply: we add an extra coordinate (always 1) so that (x, y) becomes (x, y, 1), and a 3×3 matrix can encode any of the transforms above.
Select a transform type and drag the slider to see its effect on a square.
The 3D world requires 3D transformations. A camera's position and orientation in the world is described by a rigid body transformation (also called a Euclidean transform): a rotation R plus a translation t.
The rotation matrix R is a 3×3 orthogonal matrix (RTR = I, det(R) = 1). It has only 3 degrees of freedom despite having 9 entries, because the columns must be orthonormal. Common parameterizations include Euler angles (roll, pitch, yaw), axis-angle, and quaternions.
In homogeneous coordinates, a 3D rigid transform becomes a 4×4 matrix:
The heart of image formation is perspective projection: how a 3D point maps to a 2D image coordinate. A pinhole camera is the simplest model — light passes through a single point (the center of projection) and hits the image plane behind it.
For a point at 3D coordinates (X, Y, Z), the perspective projection onto the image plane at focal length f is:
This simple division by Z is what makes distant objects appear smaller. It also means we lose all depth information — the fundamental ambiguity from Chapter 1.
Drag the depth slider to move the 3D point closer or farther. Watch how its projected position changes.
The full projection model combines the intrinsic matrix K (focal length, principal point, skew) with the extrinsic matrix [R|t] (camera pose):
Real cameras are not pinhole cameras. They use lenses to gather more light, but lenses introduce distortion — straight lines in the world appear curved in the image.
The two main types:
Drag k1 to see barrel (negative) and pincushion (positive) distortion on a grid.
The brightness of a pixel depends not just on geometry but on how light interacts with surfaces. Two key concepts:
Lambertian reflectance: A matte surface reflects light equally in all directions. The brightness depends only on the angle between the surface normal n and the light direction l:
where ρ is the albedo (surface reflectivity, 0 to 1). When the light hits the surface head-on (n · l = 1), it is brightest. When the surface faces away (n · l ≤ 0), it is in shadow.
Specular reflectance: Shiny surfaces have bright highlights where the light reflects directly toward the camera. The Phong model adds a specular term based on the angle between the reflected light direction and the view direction, raised to a shininess exponent.
Drag the light direction slider. Watch how the sphere brightness changes based on the angle between surface normal and light.
Light is continuous, but computers need discrete numbers. The digital sensor converts photons into pixel values through two steps: spatial sampling (dividing the image plane into a grid of sensor cells) and quantization (converting the analog signal to an integer).
Each sensor cell accumulates photons during the exposure time. More photons = higher value. But there is always noise: photon shot noise (random arrival of photons), read noise (electronics), and dark current (thermal electrons).
Most color cameras use a Bayer filter array: a mosaic of red, green, and blue color filters over the sensor cells. There are twice as many green cells as red or blue, because human eyes are most sensitive to green. The missing color values at each pixel are filled in by demosaicing interpolation.
The sensor captures each pixel through only one color filter. Click "Demosaic" to interpolate the missing values.
When a continuous image is sampled on a discrete grid, information can be lost. If the image contains details finer than the pixel spacing can represent, those details get corrupted into false patterns called aliasing.
The Nyquist-Shannon sampling theorem tells us: to perfectly reconstruct a signal, we must sample at least twice the highest frequency present. If the signal has frequencies above half the sampling rate (the Nyquist frequency), those frequencies fold back and create artifacts.
In practice, cameras use an anti-aliasing filter (a slight blur) in front of the sensor to attenuate frequencies above the Nyquist limit before sampling.
Let's combine everything into a complete interactive camera model. A 3D scene with objects at various depths is projected through a lens onto a sensor, with control over focal length, aperture, distortion, and noise.
Adjust focal length, sensor noise, and distortion to see their combined effect on the captured image.
Image formation is the foundation for everything that follows. Here is how the concepts connect to later chapters:
| Concept | Used In |
|---|---|
| Perspective projection | Ch 11 (SfM), Ch 12 (Stereo), Ch 8 (Stitching) |
| Intrinsic matrix K | Ch 11 (Calibration), Ch 12 (Rectification) |
| Lens distortion | Ch 11 (Calibration), Ch 8 (Alignment) |
| Lighting & reflectance | Ch 10 (Computational Photography), Ch 13 (Shape from shading) |
| Sensor noise | Ch 3 (Denoising), Ch 10 (HDR) |
| Color & Bayer patterns | Ch 3 (Color processing), Ch 10 (Demosaicing) |