Szeliski, Chapter 7

Feature Detection and Matching

Finding distinctive points, describing them, and matching them across images: corners, SIFT, edges, lines, and segmentation.

Prerequisites: Chapter 3 (image processing, gradients), Chapter 4 (optimization) helpful.
10
Chapters
7+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why Features?

Imagine you have two photos of the same building taken from different angles. How do you figure out how they relate? You cannot compare raw pixels — the viewpoints are different. Instead, you find distinctive points that appear in both images: a window corner, a door handle, a rooftop edge. These are features.

Feature detection and matching is a three-step pipeline that underpins most of classical computer vision:

Features are the bridge between pixels and geometry. Raw pixels are fragile — they change with lighting, viewpoint, and scale. Good features are invariant: they can be detected reliably regardless of these changes. This makes it possible to stitch panoramas, reconstruct 3D scenes, track objects, and recognize places.
Feature Detection Pipeline

The three stages of feature-based matching: detect, describe, match.

Why can't we simply compare raw pixel values to match two images of the same scene?

Chapter 1: Harris Corner Detector

What makes a good feature point? Consider sliding a small window across the image:

The Harris detector formalizes this intuition. It computes the structure tensor (second moment matrix) from image gradients:

M = ∑w ⎡ Ix2   IxIy
    ⎣ IxIy   Iy2

where Ix and Iy are image gradients. The eigenvalues λ1, λ2 of M tell the story:

EigenvaluesInterpretationFeature?
λ1 ≈ 0, λ2 ≈ 0Flat region (no gradient)No
λ1 ≫ 0, λ2 ≈ 0Edge (gradient in one direction)No
λ1 ≫ 0, λ2 ≫ 0Corner (gradients in two directions)Yes!
Harris corner response: Computing eigenvalues is expensive. Harris uses a clever shortcut: R = det(M) − k · trace(M)2 = λ1λ2 − k(λ1 + λ2)2. When R is large and positive, both eigenvalues are large → corner. When R is negative → edge. When |R| is small → flat.
Corner vs. Edge vs. Flat

Slide a window across different image regions. Watch how the eigenvalue plot changes.

What do the eigenvalues of the structure tensor tell us about a pixel's neighborhood?

Chapter 2: Scale Invariance

Harris corners are invariant to rotation — rotating the image does not change the eigenvalues. But they are not invariant to scale. A corner detected at one resolution may look like an edge when you zoom in, or disappear when you zoom out.

The solution: detect features at multiple scales using a scale-space pyramid.

The idea (pioneered by Lindeberg, refined by Lowe for SIFT):

Difference of Gaussians (DoG) approximates the Laplacian of Gaussian (LoG), which is the optimal blob detector. A blob at scale σ produces a strong response in the DoG at that scale. By searching across scales, you find each feature at its characteristic scale — the scale at which it is most prominent.
Scale-Space Pyramid

A feature (blob) appears at different sizes across scales. The DoG response peaks at the characteristic scale.

Scale σ 5
Why do we need to detect features at multiple scales?

Chapter 3: SIFT Descriptors

Detecting a keypoint gives you a location and scale. But to match keypoints across images, you need a descriptor — a compact summary of the local appearance around each keypoint.

SIFT (Scale-Invariant Feature Transform, Lowe 2004) builds a 128-dimensional descriptor by:

Why SIFT works so well: By normalizing for position, scale, and orientation before computing the descriptor, SIFT achieves invariance to all three. The gradient histogram representation is robust to small shifts and illumination changes. For over a decade (2004-2015), SIFT was the dominant feature descriptor in computer vision.

Modern alternatives:

DescriptorKey IdeaSpeed
SIFTGradient histograms, 128DModerate
SURFHaar wavelet approximation, 64DFast
ORBBinary descriptor (BRIEF + oriented FAST)Very fast
SuperPointLearned CNN descriptorGPU-fast
How does SIFT achieve rotation invariance?

Chapter 4: Feature Matching

Given descriptors from two images, how do you find correspondences? The simplest approach: for each descriptor in image A, find the nearest neighbor in image B (minimum Euclidean distance).

But nearest-neighbor matching produces many false matches. Two crucial tests filter them out:

Lowe's ratio test: Compare the distance to the best match d1 with the distance to the second-best match d2. If d1/d2 < 0.7, the best match is much better than any alternative → likely correct. If the ratio is close to 1, the match is ambiguous → reject it. This simple test eliminates ~90% of false matches while keeping ~95% of correct ones.

For large-scale matching (millions of images), brute-force nearest neighbor is too slow. Solutions:

Ratio Test

Adjust the ratio threshold. Lower threshold = fewer matches but higher precision.

Ratio threshold 0.70
What does Lowe's ratio test check?

Chapter 5: Edges and Contours

Not all features are points. Edges — sharp boundaries between regions — carry crucial information about object boundaries, depth discontinuities, and surface orientation.

The classic edge detection pipeline:

This is the Canny edge detector (1986), still widely used today.

Why hysteresis? A single threshold creates problems. Too high: edges have gaps. Too low: noise creates spurious edges. Hysteresis solves both. Strong edges anchor the detection, and connected weak edges fill in gaps along true boundaries without adding isolated noise.

Modern learned edge detectors (HED, Holistically-Nested Edge Detection) use deep networks to predict edges at multiple scales simultaneously, achieving more semantically meaningful boundaries than gradient-based methods.

What is the purpose of hysteresis thresholding in the Canny edge detector?

Chapter 6: Lines and Vanishing Points

Many man-made scenes are dominated by straight lines: buildings, roads, furniture. Detecting lines and finding where parallel lines converge (vanishing points) reveals the 3D structure of the scene.

The Hough transform detects lines by transforming each edge point into a family of possible lines through it. In parameter space (ρ, θ), each point votes for all lines passing through it. Lines appear as peaks in the Hough accumulator.

Vanishing points and 3D orientation: In a perspective image, parallel 3D lines converge at a vanishing point. A set of orthogonal vanishing points (one for each axis) gives you the camera's orientation relative to the scene. This is how single-image 3D estimation works: detect lines → find vanishing points → infer the 3D "up" direction and room layout.
Hough Transform

Edge points vote for lines in parameter space. Peaks correspond to dominant lines.

How does the Hough transform detect lines?

Chapter 7: Segmentation

Segmentation groups pixels into coherent regions. Unlike semantic segmentation (Chapter 6: Recognition), classical segmentation is unsupervised — it groups pixels by appearance similarity without knowing object categories.

Key approaches:

MethodKey Idea
Graph-basedBuild a graph where pixels are nodes and edges are weighted by similarity. Merge regions where internal variation is smaller than boundary differences.
Mean shiftIteratively move each point toward the local density maximum in color-position space. Points converging to the same mode form a segment.
Normalized cutsPartition the pixel graph to minimize the normalized cut cost, balancing similarity within groups and dissimilarity between groups.
Superpixels (SLIC)Oversegment the image into small, roughly uniform regions for downstream processing.
Superpixels as a preprocessing step: Instead of processing individual pixels (millions of them), group them into a few hundred superpixels first. Each superpixel is a small, coherent region. This reduces computation by orders of magnitude while preserving object boundaries. SLIC (Simple Linear Iterative Clustering) is the most popular method — essentially k-means in a 5D space of (x, y, L, a, b).
What is the practical benefit of superpixels?

Chapter 8: Showcase — Feature Matching Demo

Let's match features between two views. The detector finds corners in both images, computes descriptors, and draws lines connecting matches. Good matches are consistent with a geometric transformation.

Two-Image Feature Matching

Features are detected in both views and matched by descriptor similarity. Correct matches follow a consistent pattern.

Geometric verification: Even after the ratio test, some matches are wrong. RANSAC (Ch 8) fits a geometric model (homography) to the matches and rejects outliers. Only matches consistent with the model survive. This is what makes feature matching robust enough for real-world applications.

Chapter 9: Connections

Feature detection and matching is the foundation for many downstream tasks:

ConceptUsed In
Harris / SIFT featuresCh 8 (image stitching), Ch 11 (SfM), Ch 6 (instance recognition)
Feature matching + RANSACCh 8 (alignment), Ch 11 (pose estimation), Ch 9 (motion)
Edge detectionCh 3 (image processing), Ch 10 (matting), Ch 13 (shape from contours)
Hough transform / linesCh 11 (vanishing points for calibration), Ch 8 (panorama alignment)
SegmentationCh 6 (semantic seg.), Ch 10 (matting), Ch 12 (stereo)
Learned features (SuperPoint)Ch 11 (visual SLAM), Ch 6 (recognition)
The deep learning transition: Classical features (SIFT, Harris) dominated for a decade. Now learned features (SuperPoint, SuperGlue) outperform hand-crafted ones on most benchmarks. But the principles remain the same: detect → describe → match. Deep learning replaces each component with a trained network, but the pipeline structure endures.
Which feature detection concept is essential for panorama stitching (Ch 8)?