Image Formation — Szeliski, Chapter 2

Chapter 0: Why Study Image Formation?

Before we can analyze images, we need to understand how they are created. An image is not a random grid of numbers — it is the result of a precise physical process involving geometry, optics, and sensor electronics.

A point in the 3D world passes through a series of transformations: first, its 3D coordinates are projected through the camera's lens system onto a 2D image plane. Then, photons from that point interact with light sources and surface materials. Finally, the continuous light field is sampled into discrete pixel values by the sensor.

Understanding this pipeline is essential because every vision algorithm must either exploit or compensate for these formation effects. Lens distortion warps straight lines. Sensor noise corrupts measurements. Perspective projection makes distant objects small. If you do not model these, your algorithm will fail.

The image formation equation: Pixel value = f(scene geometry, camera pose, lens optics, lighting, surface reflectance, sensor response). Understanding each factor lets you invert the process — recovering the 3D world from the 2D image.

Image Formation Pipeline

Watch a 3D point travel through the formation pipeline to become a pixel value.

Why must vision algorithms model the image formation process?

To make images look better To correctly invert the process and recover 3D scene properties from 2D observations Because cameras are expensive

Chapter 1: 2D Transformations

Before tackling 3D-to-2D projection, let's master 2D-to-2D transforms. These are the building blocks for everything that follows and are directly used in image alignment (Chapter 8).

The hierarchy of 2D transformations, from simplest to most complex:

Transform	DoF	Preserves
Translation	2	Orientation, shape, size
Rigid (Euclidean)	3	Shape, size (adds rotation)
Similarity	4	Shape (adds uniform scale)
Affine	6	Parallelism (adds shear, non-uniform scale)
Projective (Homography)	8	Straight lines only

Each transform can be expressed as a matrix multiplication. In homogeneous coordinates, even translation becomes a matrix multiply: we add an extra coordinate (always 1) so that (x, y) becomes (x, y, 1), and a 3×3 matrix can encode any of the transforms above.

x' = H · x where x = (x, y, 1)^T

2D Transform Explorer

Select a transform type and drag the slider to see its effect on a square.

Amount 30

How many degrees of freedom does an affine transformation have?

3 4 6 — it adds shear and non-uniform scale to rigid transforms

Chapter 2: 3D Transformations

The 3D world requires 3D transformations. A camera's position and orientation in the world is described by a rigid body transformation (also called a Euclidean transform): a rotation R plus a translation t.

p' = R · p + t

The rotation matrix R is a 3×3 orthogonal matrix (R^TR = I, det(R) = 1). It has only 3 degrees of freedom despite having 9 entries, because the columns must be orthonormal. Common parameterizations include Euler angles (roll, pitch, yaw), axis-angle, and quaternions.

Why quaternions? Euler angles suffer from gimbal lock — certain orientations cause a loss of one degree of freedom. Quaternions (4D unit vectors) avoid this and interpolate smoothly, which is why game engines and robotics use them.

In homogeneous coordinates, a 3D rigid transform becomes a 4×4 matrix:

[R | t]
[0 | 1]

A 3D rotation matrix has 9 entries but only how many degrees of freedom?

3 — the orthonormality constraints eliminate 6 of the 9 entries 6 9

Chapter 3: Perspective Projection

The heart of image formation is perspective projection: how a 3D point maps to a 2D image coordinate. A pinhole camera is the simplest model — light passes through a single point (the center of projection) and hits the image plane behind it.

For a point at 3D coordinates (X, Y, Z), the perspective projection onto the image plane at focal length f is:

x = f · X / Z y = f · Y / Z

This simple division by Z is what makes distant objects appear smaller. It also means we lose all depth information — the fundamental ambiguity from Chapter 1.

The pinhole model: Think of poking a tiny hole in a cardboard box. Light from each scene point passes through the hole and projects onto the back wall. Nearer objects project larger; farther objects project smaller. The focal length f is the distance from the hole to the wall.

Perspective Projection

Drag the depth slider to move the 3D point closer or farther. Watch how its projected position changes.

Depth (Z) 5.0

Focal length 80

The full projection model combines the intrinsic matrix K (focal length, principal point, skew) with the extrinsic matrix [R|t] (camera pose):

x = K [R | t] X

In perspective projection, what causes distant objects to appear smaller?

The lens magnifies nearby objects The image coordinates are divided by the depth Z, so larger Z means smaller projection The sensor has fewer pixels for far objects

Chapter 4: Lens Distortion

Real cameras are not pinhole cameras. They use lenses to gather more light, but lenses introduce distortion — straight lines in the world appear curved in the image.

The two main types:

Radial distortion: Points are displaced radially from the image center. Barrel distortion pushes points outward (common in wide-angle lenses); pincushion distortion pulls them inward (common in telephoto lenses).
Tangential distortion: Caused by lens elements not being perfectly aligned. Usually much smaller than radial distortion.

x_d = x(1 + k₁r² + k₂r⁴) where r² = x² + y²

Lens Distortion Simulator

Drag k₁ to see barrel (negative) and pincushion (positive) distortion on a grid.

k₁ 0.00

Calibration removes distortion. Camera calibration estimates the distortion coefficients (k₁, k₂, ...) from images of known patterns (like checkerboards). Once known, distortion can be removed — "undistortion" — making the image match the ideal pinhole model.

What type of distortion makes straight lines bow outward, typical of wide-angle lenses?

Barrel distortion (k₁ < 0) Pincushion distortion Tangential distortion

Chapter 5: Lighting and Reflectance

The brightness of a pixel depends not just on geometry but on how light interacts with surfaces. Two key concepts:

Lambertian reflectance: A matte surface reflects light equally in all directions. The brightness depends only on the angle between the surface normal n and the light direction l:

I = ρ · max(0, n · l)

where ρ is the albedo (surface reflectivity, 0 to 1). When the light hits the surface head-on (n · l = 1), it is brightest. When the surface faces away (n · l ≤ 0), it is in shadow.

Specular reflectance: Shiny surfaces have bright highlights where the light reflects directly toward the camera. The Phong model adds a specular term based on the angle between the reflected light direction and the view direction, raised to a shininess exponent.

Lambertian Shading

Drag the light direction slider. Watch how the sphere brightness changes based on the angle between surface normal and light.

Light angle 30°

For a Lambertian surface, what determines pixel brightness?

The distance from the camera The dot product of the surface normal and light direction (times albedo) The color of the camera lens

Chapter 6: The Digital Sensor

Light is continuous, but computers need discrete numbers. The digital sensor converts photons into pixel values through two steps: spatial sampling (dividing the image plane into a grid of sensor cells) and quantization (converting the analog signal to an integer).

Each sensor cell accumulates photons during the exposure time. More photons = higher value. But there is always noise: photon shot noise (random arrival of photons), read noise (electronics), and dark current (thermal electrons).

Most color cameras use a Bayer filter array: a mosaic of red, green, and blue color filters over the sensor cells. There are twice as many green cells as red or blue, because human eyes are most sensitive to green. The missing color values at each pixel are filled in by demosaicing interpolation.

Why more green? The human visual system has peak sensitivity in the green wavelengths. The Bayer pattern (RGGB) mirrors this by sampling green at twice the rate. This gives better luminance resolution, which matters more perceptually than chrominance.

Bayer Pattern

The sensor captures each pixel through only one color filter. Click "Demosaic" to interpolate the missing values.

Why does the Bayer filter array have twice as many green pixels as red or blue?

Human vision is most sensitive to green, so more green samples give better perceived quality Green sensors are cheaper to manufacture It reduces noise in the blue channel

Chapter 7: Sampling and Aliasing

When a continuous image is sampled on a discrete grid, information can be lost. If the image contains details finer than the pixel spacing can represent, those details get corrupted into false patterns called aliasing.

The Nyquist-Shannon sampling theorem tells us: to perfectly reconstruct a signal, we must sample at least twice the highest frequency present. If the signal has frequencies above half the sampling rate (the Nyquist frequency), those frequencies fold back and create artifacts.

f_sample ≥ 2 · f_max

In practice, cameras use an anti-aliasing filter (a slight blur) in front of the sensor to attenuate frequencies above the Nyquist limit before sampling.

You see aliasing everywhere: Moire patterns on striped shirts in photos, jagged diagonal lines ("jaggies") in low-res images, and wagon wheels appearing to spin backwards in video are all aliasing artifacts.

What does the Nyquist theorem require for alias-free sampling?

More pixels than the image is wide The sampling rate must be at least twice the highest frequency in the signal The camera must be in focus

Chapter 8: Showcase — Full Camera Model

Let's combine everything into a complete interactive camera model. A 3D scene with objects at various depths is projected through a lens onto a sensor, with control over focal length, aperture, distortion, and noise.

Interactive Camera Simulator

Adjust focal length, sensor noise, and distortion to see their combined effect on the captured image.

Focal length 50mm

Noise σ 5

Distortion k₁ 0.00

The complete model: Every pixel value is the end result of: 3D geometry → perspective projection → lens distortion → photometric interaction (lighting × reflectance) → sensor sampling + noise. Chapters 3-14 build on this foundation to extract useful information back out.

Chapter 9: Connections

Image formation is the foundation for everything that follows. Here is how the concepts connect to later chapters:

Concept	Used In
Perspective projection	Ch 11 (SfM), Ch 12 (Stereo), Ch 8 (Stitching)
Intrinsic matrix K	Ch 11 (Calibration), Ch 12 (Rectification)
Lens distortion	Ch 11 (Calibration), Ch 8 (Alignment)
Lighting & reflectance	Ch 10 (Computational Photography), Ch 13 (Shape from shading)
Sensor noise	Ch 3 (Denoising), Ch 10 (HDR)
Color & Bayer patterns	Ch 3 (Color processing), Ch 10 (Demosaicing)

Szeliski's philosophy: "Before we can intelligently analyze and manipulate images, we need to establish a vocabulary for describing the geometry of a scene." This chapter provided that vocabulary. Now we can start processing images.

Which image formation concept is most critical for stereo depth estimation?

Perspective projection — it relates 3D depth to 2D disparity between stereo views Bayer patterns JPEG compression