Szeliski, Chapter 3

Image Processing

The pixel-level toolkit: point operators, convolution, Fourier transforms, image pyramids, and geometric warps.

Prerequisites: Chapter 2 (Image Formation) + basic calculus. That's it.
10
Chapters
7+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why Image Processing?

You have an image — a grid of pixel values produced by the formation pipeline from Chapter 2. Now what? Before any high-level understanding can happen (recognition, 3D reconstruction, tracking), the raw pixel data usually needs to be cleaned up, enhanced, or transformed.

Image processing is the set of operations that take an image as input and produce a modified image as output. Brighten a dark photo? That is a point operator. Blur out noise? That is a linear filter. Sharpen edges? That is another filter. Resize or rotate the image? That is a geometric warp.

These operations are the building blocks for everything that follows in computer vision. Edge detection, feature matching, image stitching, super-resolution — all of them rely on the tools in this chapter.

The processing pipeline: Raw image → noise reduction → contrast enhancement → edge sharpening → geometric correction. Each step uses a different class of operator, and understanding them lets you build any pipeline you need.
Image Processing Operations

Click each operation to see its effect on a synthetic image.

Why is image processing typically the first step before higher-level vision tasks?

Chapter 1: Point Operators

The simplest image processing operations work on individual pixels independently. A point operator transforms each pixel value without looking at its neighbors:

g(x, y) = h(f(x, y))

where f is the input image, g is the output, and h is some function applied to each pixel.

Common point operators include:

Gamma correction matters: Cameras and displays have nonlinear responses. A pixel value of 128 is not half as bright as 255. Gamma correction (typically γ ≈ 2.2) compensates for this, ensuring that perceived brightness is linear. Without it, shadows look too dark and highlights look washed out.
Point Operator Explorer

Adjust brightness, contrast, and gamma to see their effect on a gradient.

Brightness 0
Contrast 1.0
Gamma 1.0
What makes a point operator different from other image operations?

Chapter 2: Histograms and Equalization

A histogram counts how many pixels have each intensity value. It tells you at a glance whether an image is dark (values clustered at the low end), bright (clustered high), or low-contrast (narrow spread).

Histogram equalization redistributes pixel values so they span the full range uniformly. The algorithm is simple: compute the cumulative distribution function (CDF) of the histogram, then use it as a lookup table to remap each pixel.

g(x,y) = CDF(f(x,y)) × (L − 1)

where L is the number of intensity levels (256 for 8-bit images). This stretches dark images and compresses overexposed ones, automatically improving contrast.

Adaptive equalization: Global equalization can over-amplify noise in uniform regions. CLAHE (Contrast Limited Adaptive Histogram Equalization) applies equalization locally in tiles, with a clipping limit to prevent noise amplification. It is the standard in medical imaging.
Histogram Equalizer

The left shows the original dark image and its histogram. Click Equalize to redistribute values.

What does histogram equalization achieve?

Chapter 3: Linear Filtering

Unlike point operators, a linear filter computes each output pixel from a weighted combination of its neighbors. The weights are stored in a small grid called a kernel (or filter mask), and the operation of sliding this kernel across the image is called convolution.

g(x, y) = ∑i,j f(x+i, y+j) · k(i, j)

Key kernels and what they do:

KernelEffectSize
Box filterAverages all neighbors equally → blurAny
GaussianWeighted average, more weight to center → smooth blurDepends on σ
SobelApproximates derivative → edge detection3×3
LaplacianSecond derivative → edge + blob detection3×3
SharpenEnhances high frequencies → crisper edges3×3
Separability saves computation: A 2D Gaussian is the product of two 1D Gaussians. Instead of applying an N×N kernel (N² multiplications per pixel), you can apply a 1×N horizontal pass followed by an N×1 vertical pass (2N multiplications). For a 15×15 kernel, that is 30 vs 225 operations per pixel.
Convolution Explorer

Select a kernel type and see the convolution result. The kernel weights are shown on the right.

Why is the Gaussian filter preferred over the box filter for blurring?

Chapter 4: Nonlinear Filters

Linear filters are powerful, but they have a fundamental limitation: they blur edges along with noise. A nonlinear filter can preserve edges while still smoothing noise.

The most important nonlinear filters:

g(x) = ∑y f(y) · Gσs(||x−y||) · Gσr(|f(x)−f(y)|)  /  W(x)

where Gσs is the spatial Gaussian (nearby pixels matter more) and Gσr is the range Gaussian (similar-valued pixels matter more). W(x) is a normalizing factor.

Bilateral filtering is everywhere: It is the basis of many computational photography algorithms: HDR tone mapping, skin smoothing, flash/no-flash denoising. The idea of "only average pixels that look similar" is simple but extraordinarily effective.
Bilateral vs Gaussian

Compare Gaussian blur (which blurs edges) with bilateral filtering (which preserves them).

How does the bilateral filter preserve edges while smoothing?

Chapter 5: The Fourier Transform

Every image can be decomposed into a sum of sinusoidal patterns at different frequencies and orientations. The Fourier transform converts an image from the spatial domain (pixel values at locations) to the frequency domain (amplitudes and phases of sinusoids).

F(u, v) = ∑x,y f(x, y) · e−j2π(ux/M + vy/N)

In the frequency domain, low frequencies correspond to smooth, slowly varying regions. High frequencies correspond to edges, textures, and noise. This decomposition is powerful because convolution in the spatial domain equals multiplication in the frequency domain.

The convolution theorem: f * g = F−1(F(f) · F(g)). This means you can blur an image by multiplying its frequency spectrum by the Gaussian's spectrum. For large kernels, this is faster than direct convolution using the FFT (Fast Fourier Transform).

Practical applications:

Frequency Domain Filtering

Drag the cutoff frequency to see low-pass and high-pass filtering effects on a 1D signal.

Cutoff freq 10
What does the convolution theorem tell us?

Chapter 6: Pyramids and Wavelets

Many vision tasks need to operate at multiple scales. A face might be 20 pixels wide in one image and 200 pixels in another. Image pyramids provide a multi-resolution representation that handles this elegantly.

The Gaussian pyramid is built by repeatedly blurring and downsampling the image by a factor of 2. Each level is half the resolution of the one below.

The Laplacian pyramid stores the difference between consecutive Gaussian levels. Each level captures details at a specific scale. This is the basis of multi-resolution blending — you can seamlessly composite images by blending at each pyramid level independently.

Laplacian blending is magical: To seamlessly stitch two images, build their Laplacian pyramids, blend each level with a smooth mask, then collapse. Low-frequency transitions are gradual; high-frequency details are sharp. The result has no visible seam. This is the technique behind Burt and Adelson's 1983 "apple-orange" demo.
Gaussian Pyramid

Watch the image shrink through pyramid levels. Each level blurs and halves the resolution.

Pyramid level 0 (full res)
What does the Laplacian pyramid store at each level?

Chapter 7: Geometric Warps

Sometimes you need to change the geometry of an image rather than its pixel values. Geometric transformations (from Chapter 2) move pixels to new locations: translation, rotation, scaling, affine, and projective warps.

The key question is: when a pixel lands between grid positions, how do you compute its value? This is the interpolation problem.

MethodQualitySpeed
Nearest neighborBlocky, aliasedFastest
BilinearSmooth, slight blurringFast
BicubicSharp, minimal artifactsModerate

In practice, you use inverse warping: for each output pixel, compute where it came from in the input, then interpolate. This avoids holes in the output.

Forward vs inverse warping: Forward warping maps each input pixel to the output, but multiple inputs can land on the same output pixel (collision) or no input may land on certain outputs (holes). Inverse warping maps each output pixel back to the input, guaranteeing every output pixel gets a value. Always use inverse warping.
Interpolation Comparison

Zoom into a small image patch using different interpolation methods.

Why is inverse warping preferred over forward warping?

Chapter 8: Showcase — Image Processing Pipeline

Let's chain operations together into a full processing pipeline. Start with a noisy, low-contrast image and apply a sequence of operations to clean it up.

Processing Pipeline Builder

Toggle each processing step to see its cumulative effect on the image.

Noise level 30
Blur σ 0
Contrast 1.0
Sharpen 0.0
Order matters: Denoise before sharpening (or you amplify noise). Enhance contrast after denoising (or you push noisy pixels to extremes). The typical pipeline is: denoise → white balance → contrast → sharpen → geometric correction.

Chapter 9: Connections

Image processing tools underpin nearly every technique in the rest of this book. Here is the map:

ConceptUsed In
Convolution / filteringCh 5 (CNNs), Ch 7 (Feature detection), Ch 9 (Optical flow)
Gaussian blurCh 7 (Scale space for SIFT), Ch 8 (Image stitching blending)
Image pyramidsCh 8 (Multi-scale stitching), Ch 9 (Coarse-to-fine flow), Ch 10 (Super-resolution)
Fourier transformsCh 10 (Deconvolution), Ch 12 (Frequency-based stereo)
Geometric warpsCh 8 (Image alignment), Ch 9 (Motion compensation), Ch 14 (View synthesis)
Histogram methodsCh 10 (HDR tone mapping), Ch 6 (Feature histograms like HOG)
Bilateral filteringCh 10 (HDR), Ch 12 (Stereo cost filtering)
Szeliski's observation: "Image processing operations are so fundamental that most vision libraries (OpenCV, PIL, scikit-image) implement them as optimized primitives. Understanding what they do is essential; reimplementing them from scratch usually is not."
Which image processing technique is the foundation of convolutional neural networks?