Image Processing — Szeliski, Chapter 3

Chapter 0: Why Image Processing?

You have an image — a grid of pixel values produced by the formation pipeline from Chapter 2. Now what? Before any high-level understanding can happen (recognition, 3D reconstruction, tracking), the raw pixel data usually needs to be cleaned up, enhanced, or transformed.

Image processing is the set of operations that take an image as input and produce a modified image as output. Brighten a dark photo? That is a point operator. Blur out noise? That is a linear filter. Sharpen edges? That is another filter. Resize or rotate the image? That is a geometric warp.

These operations are the building blocks for everything that follows in computer vision. Edge detection, feature matching, image stitching, super-resolution — all of them rely on the tools in this chapter.

The processing pipeline: Raw image → noise reduction → contrast enhancement → edge sharpening → geometric correction. Each step uses a different class of operator, and understanding them lets you build any pipeline you need.

Image Processing Operations

Click each operation to see its effect on a synthetic image.

Why is image processing typically the first step before higher-level vision tasks?

Because raw images are always black and white Because raw pixel data needs cleaning, enhancement, and transformation before algorithms can reliably extract information Because cameras always produce corrupted images

Chapter 1: Point Operators

The simplest image processing operations work on individual pixels independently. A point operator transforms each pixel value without looking at its neighbors:

g(x, y) = h(f(x, y))

where f is the input image, g is the output, and h is some function applied to each pixel.

Common point operators include:

Brightness adjustment: g = f + b (shift all values up or down)
Contrast adjustment: g = a · f (scale values around zero)
Gamma correction: g = f^γ (nonlinear remapping to match display response)
Thresholding: g = 1 if f > t, else 0 (binarize the image)

Gamma correction matters: Cameras and displays have nonlinear responses. A pixel value of 128 is not half as bright as 255. Gamma correction (typically γ ≈ 2.2) compensates for this, ensuring that perceived brightness is linear. Without it, shadows look too dark and highlights look washed out.

Point Operator Explorer

Adjust brightness, contrast, and gamma to see their effect on a gradient.

Brightness 0

Contrast 1.0

Gamma 1.0

What makes a point operator different from other image operations?

It operates on each pixel independently, without looking at neighbors It only works on single-channel images It always increases brightness

Chapter 2: Histograms and Equalization

A histogram counts how many pixels have each intensity value. It tells you at a glance whether an image is dark (values clustered at the low end), bright (clustered high), or low-contrast (narrow spread).

Histogram equalization redistributes pixel values so they span the full range uniformly. The algorithm is simple: compute the cumulative distribution function (CDF) of the histogram, then use it as a lookup table to remap each pixel.

g(x,y) = CDF(f(x,y)) × (L − 1)

where L is the number of intensity levels (256 for 8-bit images). This stretches dark images and compresses overexposed ones, automatically improving contrast.

Adaptive equalization: Global equalization can over-amplify noise in uniform regions. CLAHE (Contrast Limited Adaptive Histogram Equalization) applies equalization locally in tiles, with a clipping limit to prevent noise amplification. It is the standard in medical imaging.

Histogram Equalizer

The left shows the original dark image and its histogram. Click Equalize to redistribute values.

What does histogram equalization achieve?

It removes noise from the image It redistributes pixel intensities so they span the full range uniformly, improving contrast It converts the image to grayscale

Chapter 3: Linear Filtering

Unlike point operators, a linear filter computes each output pixel from a weighted combination of its neighbors. The weights are stored in a small grid called a kernel (or filter mask), and the operation of sliding this kernel across the image is called convolution.

g(x, y) = ∑_i,j f(x+i, y+j) · k(i, j)

Key kernels and what they do:

Kernel	Effect	Size
Box filter	Averages all neighbors equally → blur	Any
Gaussian	Weighted average, more weight to center → smooth blur	Depends on σ
Sobel	Approximates derivative → edge detection	3×3
Laplacian	Second derivative → edge + blob detection	3×3
Sharpen	Enhances high frequencies → crisper edges	3×3

Separability saves computation: A 2D Gaussian is the product of two 1D Gaussians. Instead of applying an N×N kernel (N² multiplications per pixel), you can apply a 1×N horizontal pass followed by an N×1 vertical pass (2N multiplications). For a 15×15 kernel, that is 30 vs 225 operations per pixel.

Convolution Explorer

Select a kernel type and see the convolution result. The kernel weights are shown on the right.

Why is the Gaussian filter preferred over the box filter for blurring?

The box filter is slower The Gaussian gives more weight to nearby pixels, producing smoother results without ringing artifacts, and it is separable The box filter only works on square images

Chapter 4: Nonlinear Filters

Linear filters are powerful, but they have a fundamental limitation: they blur edges along with noise. A nonlinear filter can preserve edges while still smoothing noise.

The most important nonlinear filters:

Median filter: Replace each pixel with the median of its neighbors. Excellent at removing salt-and-pepper noise while preserving edges. Not separable, not linear.
Bilateral filter: Like a Gaussian, but with a second weight based on intensity difference. Pixels that are similar in value get more weight; pixels across an edge get less. This smooths flat regions while keeping edges sharp.

g(x) = ∑_y f(y) · G_σs(||x−y||) · G_σr(|f(x)−f(y)|) / W(x)

where G_σs is the spatial Gaussian (nearby pixels matter more) and G_σr is the range Gaussian (similar-valued pixels matter more). W(x) is a normalizing factor.

Bilateral filtering is everywhere: It is the basis of many computational photography algorithms: HDR tone mapping, skin smoothing, flash/no-flash denoising. The idea of "only average pixels that look similar" is simple but extraordinarily effective.

Bilateral vs Gaussian

Compare Gaussian blur (which blurs edges) with bilateral filtering (which preserves them).

How does the bilateral filter preserve edges while smoothing?

It detects edges first and skips them It uses two weighting functions: spatial distance AND intensity similarity, so pixels across edges get very low weight It uses a smaller kernel near edges

Chapter 5: The Fourier Transform

Every image can be decomposed into a sum of sinusoidal patterns at different frequencies and orientations. The Fourier transform converts an image from the spatial domain (pixel values at locations) to the frequency domain (amplitudes and phases of sinusoids).

F(u, v) = ∑_x,y f(x, y) · e^{−j2π(ux/M + vy/N)}

In the frequency domain, low frequencies correspond to smooth, slowly varying regions. High frequencies correspond to edges, textures, and noise. This decomposition is powerful because convolution in the spatial domain equals multiplication in the frequency domain.

The convolution theorem: f * g = F⁻¹(F(f) · F(g)). This means you can blur an image by multiplying its frequency spectrum by the Gaussian's spectrum. For large kernels, this is faster than direct convolution using the FFT (Fast Fourier Transform).

Practical applications:

Low-pass filtering: Zero out high frequencies → blur
High-pass filtering: Zero out low frequencies → edge detection
Band-pass filtering: Keep only a range of frequencies → texture analysis
Noise removal: If noise has a known frequency (like periodic patterns), remove just that frequency

Frequency Domain Filtering

Drag the cutoff frequency to see low-pass and high-pass filtering effects on a 1D signal.

Cutoff freq 10

What does the convolution theorem tell us?

Convolution in the spatial domain is equivalent to multiplication in the frequency domain, enabling faster filtering via FFT All images can be perfectly reconstructed from their Fourier transform The Fourier transform is always real-valued

Chapter 6: Pyramids and Wavelets

Many vision tasks need to operate at multiple scales. A face might be 20 pixels wide in one image and 200 pixels in another. Image pyramids provide a multi-resolution representation that handles this elegantly.

The Gaussian pyramid is built by repeatedly blurring and downsampling the image by a factor of 2. Each level is half the resolution of the one below.

The Laplacian pyramid stores the difference between consecutive Gaussian levels. Each level captures details at a specific scale. This is the basis of multi-resolution blending — you can seamlessly composite images by blending at each pyramid level independently.

Laplacian blending is magical: To seamlessly stitch two images, build their Laplacian pyramids, blend each level with a smooth mask, then collapse. Low-frequency transitions are gradual; high-frequency details are sharp. The result has no visible seam. This is the technique behind Burt and Adelson's 1983 "apple-orange" demo.

Gaussian Pyramid

Watch the image shrink through pyramid levels. Each level blurs and halves the resolution.

Pyramid level 0 (full res)

What does the Laplacian pyramid store at each level?

A blurred copy of the image The difference between consecutive Gaussian pyramid levels — capturing details at each scale The edge map at each resolution

Chapter 7: Geometric Warps

Sometimes you need to change the geometry of an image rather than its pixel values. Geometric transformations (from Chapter 2) move pixels to new locations: translation, rotation, scaling, affine, and projective warps.

The key question is: when a pixel lands between grid positions, how do you compute its value? This is the interpolation problem.

Method	Quality	Speed
Nearest neighbor	Blocky, aliased	Fastest
Bilinear	Smooth, slight blurring	Fast
Bicubic	Sharp, minimal artifacts	Moderate

In practice, you use inverse warping: for each output pixel, compute where it came from in the input, then interpolate. This avoids holes in the output.

Forward vs inverse warping: Forward warping maps each input pixel to the output, but multiple inputs can land on the same output pixel (collision) or no input may land on certain outputs (holes). Inverse warping maps each output pixel back to the input, guaranteeing every output pixel gets a value. Always use inverse warping.

Interpolation Comparison

Zoom into a small image patch using different interpolation methods.

Why is inverse warping preferred over forward warping?

It guarantees every output pixel gets a value by mapping from output back to input, avoiding holes It is faster It does not require interpolation

Chapter 8: Showcase — Image Processing Pipeline

Let's chain operations together into a full processing pipeline. Start with a noisy, low-contrast image and apply a sequence of operations to clean it up.

Processing Pipeline Builder

Toggle each processing step to see its cumulative effect on the image.

Noise level 30

Blur σ 0

Contrast 1.0

Sharpen 0.0

Order matters: Denoise before sharpening (or you amplify noise). Enhance contrast after denoising (or you push noisy pixels to extremes). The typical pipeline is: denoise → white balance → contrast → sharpen → geometric correction.

Chapter 9: Connections

Image processing tools underpin nearly every technique in the rest of this book. Here is the map:

Concept	Used In
Convolution / filtering	Ch 5 (CNNs), Ch 7 (Feature detection), Ch 9 (Optical flow)
Gaussian blur	Ch 7 (Scale space for SIFT), Ch 8 (Image stitching blending)
Image pyramids	Ch 8 (Multi-scale stitching), Ch 9 (Coarse-to-fine flow), Ch 10 (Super-resolution)
Fourier transforms	Ch 10 (Deconvolution), Ch 12 (Frequency-based stereo)
Geometric warps	Ch 8 (Image alignment), Ch 9 (Motion compensation), Ch 14 (View synthesis)
Histogram methods	Ch 10 (HDR tone mapping), Ch 6 (Feature histograms like HOG)
Bilateral filtering	Ch 10 (HDR), Ch 12 (Stereo cost filtering)

Szeliski's observation: "Image processing operations are so fundamental that most vision libraries (OpenCV, PIL, scikit-image) implement them as optimized primitives. Understanding what they do is essential; reimplementing them from scratch usually is not."

Which image processing technique is the foundation of convolutional neural networks?

Convolution — CNNs learn their filter kernels from data instead of using hand-designed ones Histogram equalization Geometric warping