Finding distinctive points, describing them, and matching them across images: corners, SIFT, edges, lines, and segmentation.
Imagine you have two photos of the same building taken from different angles. How do you figure out how they relate? You cannot compare raw pixels — the viewpoints are different. Instead, you find distinctive points that appear in both images: a window corner, a door handle, a rooftop edge. These are features.
Feature detection and matching is a three-step pipeline that underpins most of classical computer vision:
The three stages of feature-based matching: detect, describe, match.
What makes a good feature point? Consider sliding a small window across the image:
The Harris detector formalizes this intuition. It computes the structure tensor (second moment matrix) from image gradients:
where Ix and Iy are image gradients. The eigenvalues λ1, λ2 of M tell the story:
| Eigenvalues | Interpretation | Feature? |
|---|---|---|
| λ1 ≈ 0, λ2 ≈ 0 | Flat region (no gradient) | No |
| λ1 ≫ 0, λ2 ≈ 0 | Edge (gradient in one direction) | No |
| λ1 ≫ 0, λ2 ≫ 0 | Corner (gradients in two directions) | Yes! |
Slide a window across different image regions. Watch how the eigenvalue plot changes.
Harris corners are invariant to rotation — rotating the image does not change the eigenvalues. But they are not invariant to scale. A corner detected at one resolution may look like an edge when you zoom in, or disappear when you zoom out.
The solution: detect features at multiple scales using a scale-space pyramid.
The idea (pioneered by Lindeberg, refined by Lowe for SIFT):
A feature (blob) appears at different sizes across scales. The DoG response peaks at the characteristic scale.
Detecting a keypoint gives you a location and scale. But to match keypoints across images, you need a descriptor — a compact summary of the local appearance around each keypoint.
SIFT (Scale-Invariant Feature Transform, Lowe 2004) builds a 128-dimensional descriptor by:
Modern alternatives:
| Descriptor | Key Idea | Speed |
|---|---|---|
| SIFT | Gradient histograms, 128D | Moderate |
| SURF | Haar wavelet approximation, 64D | Fast |
| ORB | Binary descriptor (BRIEF + oriented FAST) | Very fast |
| SuperPoint | Learned CNN descriptor | GPU-fast |
Given descriptors from two images, how do you find correspondences? The simplest approach: for each descriptor in image A, find the nearest neighbor in image B (minimum Euclidean distance).
But nearest-neighbor matching produces many false matches. Two crucial tests filter them out:
For large-scale matching (millions of images), brute-force nearest neighbor is too slow. Solutions:
Adjust the ratio threshold. Lower threshold = fewer matches but higher precision.
Not all features are points. Edges — sharp boundaries between regions — carry crucial information about object boundaries, depth discontinuities, and surface orientation.
The classic edge detection pipeline:
This is the Canny edge detector (1986), still widely used today.
Modern learned edge detectors (HED, Holistically-Nested Edge Detection) use deep networks to predict edges at multiple scales simultaneously, achieving more semantically meaningful boundaries than gradient-based methods.
Many man-made scenes are dominated by straight lines: buildings, roads, furniture. Detecting lines and finding where parallel lines converge (vanishing points) reveals the 3D structure of the scene.
The Hough transform detects lines by transforming each edge point into a family of possible lines through it. In parameter space (ρ, θ), each point votes for all lines passing through it. Lines appear as peaks in the Hough accumulator.
Edge points vote for lines in parameter space. Peaks correspond to dominant lines.
Segmentation groups pixels into coherent regions. Unlike semantic segmentation (Chapter 6: Recognition), classical segmentation is unsupervised — it groups pixels by appearance similarity without knowing object categories.
Key approaches:
| Method | Key Idea |
|---|---|
| Graph-based | Build a graph where pixels are nodes and edges are weighted by similarity. Merge regions where internal variation is smaller than boundary differences. |
| Mean shift | Iteratively move each point toward the local density maximum in color-position space. Points converging to the same mode form a segment. |
| Normalized cuts | Partition the pixel graph to minimize the normalized cut cost, balancing similarity within groups and dissimilarity between groups. |
| Superpixels (SLIC) | Oversegment the image into small, roughly uniform regions for downstream processing. |
Let's match features between two views. The detector finds corners in both images, computes descriptors, and draws lines connecting matches. Good matches are consistent with a geometric transformation.
Features are detected in both views and matched by descriptor similarity. Correct matches follow a consistent pattern.
Feature detection and matching is the foundation for many downstream tasks:
| Concept | Used In |
|---|---|
| Harris / SIFT features | Ch 8 (image stitching), Ch 11 (SfM), Ch 6 (instance recognition) |
| Feature matching + RANSAC | Ch 8 (alignment), Ch 11 (pose estimation), Ch 9 (motion) |
| Edge detection | Ch 3 (image processing), Ch 10 (matting), Ch 13 (shape from contours) |
| Hough transform / lines | Ch 11 (vanishing points for calibration), Ch 8 (panorama alignment) |
| Segmentation | Ch 6 (semantic seg.), Ch 10 (matting), Ch 12 (stereo) |
| Learned features (SuperPoint) | Ch 11 (visual SLAM), Ch 6 (recognition) |