Szeliski, Chapter 7

Feature Detection and Matching

Finding distinctive points, describing them, and matching them across images: corners, SIFT, edges, lines, and segmentation.

Prerequisites: Chapter 3 (image processing, gradients), Chapter 4 (optimization) helpful.

Chapters

Simulations

Assumed CV Knowledge

Chapter 0: Why Features?

Imagine you have two photos of the same building taken from different angles. How do you figure out how they relate? You cannot compare raw pixels — the viewpoints are different. Instead, you find distinctive points that appear in both images: a window corner, a door handle, a rooftop edge. These are features.

Feature detection and matching is a three-step pipeline that underpins most of classical computer vision:

Detect: Find interesting points in each image (corners, blobs, edges)
Describe: Build a compact descriptor for each point that captures its local appearance
Match: Find corresponding descriptors across images

Features are the bridge between pixels and geometry. Raw pixels are fragile — they change with lighting, viewpoint, and scale. Good features are invariant: they can be detected reliably regardless of these changes. This makes it possible to stitch panoramas, reconstruct 3D scenes, track objects, and recognize places.

Feature Detection Pipeline

The three stages of feature-based matching: detect, describe, match.

Why can't we simply compare raw pixel values to match two images of the same scene?

Pixel values change with viewpoint, lighting, and scale, so the same scene point has different pixel values in different images Pixels are too large Computers cannot read pixel values

Chapter 1: Harris Corner Detector

What makes a good feature point? Consider sliding a small window across the image:

Flat region: Moving the window in any direction shows no change. Not distinctive at all.
Edge: Moving along the edge shows no change, but moving perpendicular does. Only one direction is distinctive.
Corner: Moving in any direction shows significant change. This is distinctive!

The Harris detector formalizes this intuition. It computes the structure tensor (second moment matrix) from image gradients:

M = ∑_w ⎡ I_x² I_xI_y ⎤
⎣ I_xI_y I_y² ⎦

where I_x and I_y are image gradients. The eigenvalues λ₁, λ₂ of M tell the story:

Eigenvalues	Interpretation	Feature?
λ₁ ≈ 0, λ₂ ≈ 0	Flat region (no gradient)	No
λ₁ ≫ 0, λ₂ ≈ 0	Edge (gradient in one direction)	No
λ₁ ≫ 0, λ₂ ≫ 0	Corner (gradients in two directions)	Yes!

Harris corner response: Computing eigenvalues is expensive. Harris uses a clever shortcut: R = det(M) − k · trace(M)² = λ₁λ₂ − k(λ₁ + λ₂)². When R is large and positive, both eigenvalues are large → corner. When R is negative → edge. When |R| is small → flat.

Corner vs. Edge vs. Flat

Slide a window across different image regions. Watch how the eigenvalue plot changes.

What do the eigenvalues of the structure tensor tell us about a pixel's neighborhood?

Both eigenvalues large means strong gradients in two directions — a corner. One large means an edge. Both small means flat. They measure the brightness of the pixel They tell you the color of the pixel

Chapter 2: Scale Invariance

Harris corners are invariant to rotation — rotating the image does not change the eigenvalues. But they are not invariant to scale. A corner detected at one resolution may look like an edge when you zoom in, or disappear when you zoom out.

The solution: detect features at multiple scales using a scale-space pyramid.

The idea (pioneered by Lindeberg, refined by Lowe for SIFT):

Blur the image with Gaussians of increasing σ
Compute Difference of Gaussians (DoG) between adjacent blur levels
Find extrema (maxima/minima) in the 3D space of (x, y, scale)
Each extremum is a feature at a specific location and scale

Difference of Gaussians (DoG) approximates the Laplacian of Gaussian (LoG), which is the optimal blob detector. A blob at scale σ produces a strong response in the DoG at that scale. By searching across scales, you find each feature at its characteristic scale — the scale at which it is most prominent.

Scale-Space Pyramid

A feature (blob) appears at different sizes across scales. The DoG response peaks at the characteristic scale.

Scale σ 5

Why do we need to detect features at multiple scales?

A feature visible at one resolution may appear different at another scale due to zoom/distance changes, so we must find its characteristic scale for reliable matching Multiple scales make the algorithm faster It reduces the number of features detected

Chapter 3: SIFT Descriptors

Detecting a keypoint gives you a location and scale. But to match keypoints across images, you need a descriptor — a compact summary of the local appearance around each keypoint.

SIFT (Scale-Invariant Feature Transform, Lowe 2004) builds a 128-dimensional descriptor by:

Taking a 16×16 patch around the keypoint, oriented by the dominant gradient direction
Dividing it into a 4×4 grid of cells
Computing an 8-bin gradient orientation histogram for each cell
Concatenating: 4×4×8 = 128 dimensions

Why SIFT works so well: By normalizing for position, scale, and orientation before computing the descriptor, SIFT achieves invariance to all three. The gradient histogram representation is robust to small shifts and illumination changes. For over a decade (2004-2015), SIFT was the dominant feature descriptor in computer vision.

Modern alternatives:

Descriptor	Key Idea	Speed
SIFT	Gradient histograms, 128D	Moderate
SURF	Haar wavelet approximation, 64D	Fast
ORB	Binary descriptor (BRIEF + oriented FAST)	Very fast
SuperPoint	Learned CNN descriptor	GPU-fast

How does SIFT achieve rotation invariance?

It assigns a dominant gradient orientation to each keypoint and rotates the descriptor patch to align with it before computing the histogram It uses circular filters It ignores rotation entirely

Chapter 4: Feature Matching

Given descriptors from two images, how do you find correspondences? The simplest approach: for each descriptor in image A, find the nearest neighbor in image B (minimum Euclidean distance).

But nearest-neighbor matching produces many false matches. Two crucial tests filter them out:

Lowe's ratio test: Compare the distance to the best match d₁ with the distance to the second-best match d₂. If d₁/d₂ < 0.7, the best match is much better than any alternative → likely correct. If the ratio is close to 1, the match is ambiguous → reject it. This simple test eliminates ~90% of false matches while keeping ~95% of correct ones.

For large-scale matching (millions of images), brute-force nearest neighbor is too slow. Solutions:

KD-trees: Partition descriptor space for fast approximate search
Locality-sensitive hashing (LSH): Hash similar descriptors to the same bucket
Bag of Visual Words: Quantize descriptors to visual words, use inverted index (like a search engine)

Ratio Test

Adjust the ratio threshold. Lower threshold = fewer matches but higher precision.

Ratio threshold 0.70

What does Lowe's ratio test check?

Whether the best match is significantly closer than the second-best match, indicating an unambiguous correspondence Whether the feature has high contrast Whether the images have the same resolution

Chapter 5: Edges and Contours

Not all features are points. Edges — sharp boundaries between regions — carry crucial information about object boundaries, depth discontinuities, and surface orientation.

The classic edge detection pipeline:

Smooth the image with a Gaussian (reduce noise)
Compute gradients using Sobel or derivative-of-Gaussian filters
Non-maximum suppression: Thin edges to 1-pixel width by keeping only gradient maxima along the gradient direction
Hysteresis thresholding: Use two thresholds. Strong edges (above high threshold) are kept. Weak edges (between thresholds) are kept only if connected to a strong edge.

This is the Canny edge detector (1986), still widely used today.

Why hysteresis? A single threshold creates problems. Too high: edges have gaps. Too low: noise creates spurious edges. Hysteresis solves both. Strong edges anchor the detection, and connected weak edges fill in gaps along true boundaries without adding isolated noise.

Modern learned edge detectors (HED, Holistically-Nested Edge Detection) use deep networks to predict edges at multiple scales simultaneously, achieving more semantically meaningful boundaries than gradient-based methods.

What is the purpose of hysteresis thresholding in the Canny edge detector?

It uses two thresholds to keep strong edges and extend them through connected weak edges, giving continuous boundaries without noise It makes edges thicker It converts edges to color

Chapter 6: Lines and Vanishing Points

Many man-made scenes are dominated by straight lines: buildings, roads, furniture. Detecting lines and finding where parallel lines converge (vanishing points) reveals the 3D structure of the scene.

The Hough transform detects lines by transforming each edge point into a family of possible lines through it. In parameter space (ρ, θ), each point votes for all lines passing through it. Lines appear as peaks in the Hough accumulator.

Vanishing points and 3D orientation: In a perspective image, parallel 3D lines converge at a vanishing point. A set of orthogonal vanishing points (one for each axis) gives you the camera's orientation relative to the scene. This is how single-image 3D estimation works: detect lines → find vanishing points → infer the 3D "up" direction and room layout.

Hough Transform

Edge points vote for lines in parameter space. Peaks correspond to dominant lines.

How does the Hough transform detect lines?

Each edge point votes for all possible lines through it in parameter space; true lines accumulate many votes and appear as peaks It connects the nearest edge points with straight lines It uses deep learning

Chapter 7: Segmentation

Segmentation groups pixels into coherent regions. Unlike semantic segmentation (Chapter 6: Recognition), classical segmentation is unsupervised — it groups pixels by appearance similarity without knowing object categories.

Key approaches:

Method	Key Idea
Graph-based	Build a graph where pixels are nodes and edges are weighted by similarity. Merge regions where internal variation is smaller than boundary differences.
Mean shift	Iteratively move each point toward the local density maximum in color-position space. Points converging to the same mode form a segment.
Normalized cuts	Partition the pixel graph to minimize the normalized cut cost, balancing similarity within groups and dissimilarity between groups.
Superpixels (SLIC)	Oversegment the image into small, roughly uniform regions for downstream processing.

Superpixels as a preprocessing step: Instead of processing individual pixels (millions of them), group them into a few hundred superpixels first. Each superpixel is a small, coherent region. This reduces computation by orders of magnitude while preserving object boundaries. SLIC (Simple Linear Iterative Clustering) is the most popular method — essentially k-means in a 5D space of (x, y, L, a, b).

What is the practical benefit of superpixels?

They reduce the number of elements to process from millions of pixels to hundreds of coherent regions, while preserving boundaries They increase image resolution They classify each pixel into an object category

Chapter 8: Showcase — Feature Matching Demo

Let's match features between two views. The detector finds corners in both images, computes descriptors, and draws lines connecting matches. Good matches are consistent with a geometric transformation.

Two-Image Feature Matching

Features are detected in both views and matched by descriptor similarity. Correct matches follow a consistent pattern.

Geometric verification: Even after the ratio test, some matches are wrong. RANSAC (Ch 8) fits a geometric model (homography) to the matches and rejects outliers. Only matches consistent with the model survive. This is what makes feature matching robust enough for real-world applications.

Chapter 9: Connections

Feature detection and matching is the foundation for many downstream tasks:

Concept	Used In
Harris / SIFT features	Ch 8 (image stitching), Ch 11 (SfM), Ch 6 (instance recognition)
Feature matching + RANSAC	Ch 8 (alignment), Ch 11 (pose estimation), Ch 9 (motion)
Edge detection	Ch 3 (image processing), Ch 10 (matting), Ch 13 (shape from contours)
Hough transform / lines	Ch 11 (vanishing points for calibration), Ch 8 (panorama alignment)
Segmentation	Ch 6 (semantic seg.), Ch 10 (matting), Ch 12 (stereo)
Learned features (SuperPoint)	Ch 11 (visual SLAM), Ch 6 (recognition)

The deep learning transition: Classical features (SIFT, Harris) dominated for a decade. Now learned features (SuperPoint, SuperGlue) outperform hand-crafted ones on most benchmarks. But the principles remain the same: detect → describe → match. Deep learning replaces each component with a trained network, but the pipeline structure endures.

Which feature detection concept is essential for panorama stitching (Ch 8)?

Keypoint detection and matching with RANSAC-based geometric verification to align overlapping images Edge detection only Segmentation only