Szeliski, Chapter 6

Visual Recognition

From recognizing specific objects to understanding entire scenes: classification, detection, segmentation, and vision+language.

Prerequisites: Chapter 5 (deep learning basics), Chapter 7 (features) helpful but not required.
10
Chapters
7+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why Recognition?

You glance at a photo and instantly know: that is a golden retriever sitting on a red couch in a living room. You identified the object class (dog), the specific breed (golden retriever), located it in space (center of frame, on the couch), and understood the scene context (indoors, living room). All in a fraction of a second.

Teaching computers to do this is the central challenge of visual recognition. It spans a hierarchy of tasks, from simple to complex:

TaskQuestion AnsweredOutput
Instance recognitionIs this the exact same object?Match / no match
Image classificationWhat category is in this image?Class label
Object detectionWhere are the objects?Bounding boxes + labels
Semantic segmentationWhat class is each pixel?Per-pixel labels
Instance segmentationWhich pixels belong to which object?Per-pixel instance masks
The recognition hierarchy: Each task builds on the one before it. Classification tells you what. Detection tells you where. Segmentation tells you the exact shape. Modern systems solve all of these jointly — a single network can classify, detect, and segment simultaneously.
Recognition Tasks

Toggle between recognition tasks to see what each one produces from the same image.

What is the key difference between object detection and semantic segmentation?

Chapter 1: Instance Recognition

Instance recognition asks: "Is this the exact same object I have seen before?" Not "is this a dog?" but "is this my dog?" This matters for tasks like finding a specific product on a shelf, matching a building from different angles, or verifying an identity.

The classical pipeline has three steps:

Bag of Visual Words: A breakthrough idea from the mid-2000s. Extract local features (SIFT), cluster them into a "visual vocabulary" using k-means, then represent each image as a histogram of visual word frequencies. This turns image matching into text-search-like retrieval, enabling search over millions of images in milliseconds.

Modern approaches use deep learned descriptors. A CNN (trained with contrastive or triplet loss) maps images to compact vectors. Two images of the same object produce nearby vectors, regardless of viewpoint or lighting changes. The entire image becomes a single descriptor — no feature extraction or matching needed.

Ltriplet = max(0, ||f(a) − f(p)||2 − ||f(a) − f(n)||2 + α)

The triplet loss pushes the anchor a closer to a positive match p and away from a negative n by a margin α.

What does the triplet loss encourage a learned descriptor to do?

Chapter 2: Image Classification

Image classification answers a simpler question: "What category does this image belong to?" Given a photo, output a label like "cat", "car", or "mountain." This is the task that launched the deep learning revolution.

The milestone moment: ImageNet 2012. AlexNet, a deep CNN, cut the error rate from 26% to 16% — nearly halving it overnight. Every subsequent winner used deeper networks:

YearModelTop-5 ErrorKey Idea
2012AlexNet16.4%Deep CNN + GPU training
2014VGGNet7.3%Very deep, small filters
2014GoogLeNet6.7%Inception modules
2015ResNet3.6%Skip connections (152 layers)
2020ViT~1.5%Vision transformer (patches + attention)
Transfer learning: You rarely train a classifier from scratch. Instead, take a network pre-trained on ImageNet (millions of images, 1000 classes), freeze the early layers (which detect generic edges and textures), and fine-tune only the last few layers for your specific task. This works because early features are universal — edges look the same whether you are classifying dogs or X-rays.

Face recognition is a special case of classification. Modern systems like ArcFace learn embeddings where faces of the same person cluster together. They achieve superhuman accuracy on verification benchmarks, but raise important questions about privacy, bias, and consent.

Feature Transfer

Features learned for one task transfer to others. Early layers are generic; deep layers specialize.

Why does transfer learning work so well for image classification?

Chapter 3: Object Detection

Classification tells you what is in the image. Object detection tells you where: it outputs a bounding box (rectangle) and a class label for every object instance.

The evolution of detectors:

EraMethodKey Idea
2001Viola-JonesHaar features + cascaded classifiers. Real-time face detection.
2005HOG + SVMHistogram of gradients + linear SVM. Pedestrian detection.
2014R-CNNRegion proposals + CNN features. Two-stage pipeline.
2015Faster R-CNNLearned region proposals (RPN). End-to-end trainable.
2016YOLO / SSDSingle-shot: predict boxes and classes in one pass. Real-time.
Two-stage vs. one-stage: Faster R-CNN first proposes regions ("where might objects be?"), then classifies each region. YOLO skips the proposal step — it divides the image into a grid and directly predicts boxes and classes for each cell. One-stage is faster; two-stage is more accurate. Modern detectors blur this line.

Evaluation: Detection accuracy is measured by mean Average Precision (mAP). A prediction counts as correct if its bounding box overlaps the ground truth by at least 50% (IoU ≥ 0.5).

IoU = |A ∩ B| / |A ∪ B|

Intersection over Union measures how well two boxes overlap. An IoU of 1.0 means perfect alignment; 0.0 means no overlap at all.

IoU Calculator

Drag the predicted box (dashed) to see how IoU changes with overlap.

Box offset 30
What is the main advantage of single-shot detectors (YOLO) over two-stage detectors (Faster R-CNN)?

Chapter 4: Semantic Segmentation

Semantic segmentation assigns a class label to every pixel in the image. Instead of a single label ("street scene") or bounding boxes ("car at position X"), you get a dense map: this pixel is road, this pixel is car, this pixel is sky.

The breakthrough architecture: Fully Convolutional Networks (FCN). Take a classification CNN (like VGGNet), replace the final fully connected layers with convolutional layers, and upsample back to full resolution. Now the network can accept any input size and output a per-pixel prediction.

The encoder-decoder pattern: Most segmentation networks follow this structure. The encoder (backbone CNN) progressively downsamples, capturing "what" is in the image. The decoder progressively upsamples, recovering "where" things are. Skip connections bridge the two, combining high-level semantics with low-level spatial precision. U-Net perfected this pattern.

Key architectures:

Encoder-Decoder Architecture

Watch data flow through an encoder-decoder with skip connections. The encoder captures semantics; the decoder recovers spatial detail.

Why are skip connections important in segmentation networks like U-Net?

Chapter 5: Instance & Panoptic Segmentation

Semantic segmentation labels pixels by class, but it cannot distinguish between individual objects of the same class. If two cars overlap, all car pixels get the same label. Instance segmentation solves this: it gives each object its own mask.

Mask R-CNN (2017) extended Faster R-CNN with a mask prediction branch. For each detected object, it predicts not just a bounding box and class, but a pixel-level mask inside the box. The key innovation was RoIAlign, which precisely aligns feature maps to each region of interest without rounding errors.

Panoptic segmentation unifies semantic and instance segmentation into one task. Every pixel gets both a class label and an instance ID. "Stuff" classes (sky, road, grass) get only semantic labels. "Thing" classes (car, person, dog) get instance-level masks. This gives the most complete scene understanding from a single model.

Pose estimation takes this further by predicting the spatial configuration of a person's body. Instead of just a mask, the network outputs the 2D location of keypoints: shoulders, elbows, wrists, hips, knees, ankles. Systems like OpenPose detect multiple people's poses in real time, enabling applications from sports analytics to sign language recognition.

What does panoptic segmentation provide that semantic segmentation alone does not?

Chapter 6: Video Understanding

Images are snapshots. Videos are sequences. Understanding video requires reasoning about time: actions unfold over frames, objects move and interact, scenes change.

Key video understanding tasks:

Architectures for video must process the temporal dimension. Two-stream networks process appearance (RGB frames) and motion (optical flow) separately, then fuse. 3D convolutions (C3D, I3D) extend 2D conv kernels to space+time. Video transformers (TimeSformer, ViViT) apply attention across both spatial and temporal tokens.

The optical flow shortcut: Early video networks fed pre-computed optical flow as a second input stream. This explicitly provided motion information, boosting accuracy. But computing optical flow is expensive. Modern end-to-end networks learn to extract motion cues directly from raw frames, making the two-stream trick unnecessary — but it took years to match its accuracy.
Why is processing video harder than processing individual images?

Chapter 7: Vision and Language

Vision does not exist in isolation. Humans describe what they see in words, ask questions about images, and follow instructions that reference visual content. Vision-language models bridge the gap between seeing and saying.

TaskInputOutput
Image captioningImage"A dog playing fetch in a park"
Visual QAImage + "What color is the car?""Red"
Visual groundingImage + "the red car"Bounding box around the red car
Text-to-image"A cat wearing a hat"Generated image
CLIP and the alignment revolution: OpenAI's CLIP (2021) trained on 400 million image-text pairs from the internet. It learns a shared embedding space where images and their text descriptions are close together. This enables zero-shot classification: describe any category in words, and CLIP can recognize it without ever seeing a labeled example. "A photo of a golden retriever" matches golden retriever images, even though CLIP was never explicitly trained on that class.

Modern vision-language models (GPT-4V, Gemini, LLaVA) go further. They combine a vision encoder (like ViT) with a large language model, enabling free-form conversation about images. You can ask "What is wrong with this circuit diagram?" and get a detailed technical answer.

Embedding Space Alignment

CLIP aligns image and text embeddings in a shared space. Matching pairs cluster together.

What makes CLIP's approach to recognition fundamentally different from traditional classifiers?

Chapter 8: Showcase — Detection Playground

Let's visualize how a sliding-window detector works. The detector slides across the image at multiple scales, scoring each window. High-scoring windows become detections. Non-maximum suppression (NMS) removes overlapping boxes, keeping only the best one per object.

Sliding Window Detection + NMS

Watch the detector scan the image. High-confidence boxes appear, then NMS removes duplicates.

Confidence threshold 0.50
Non-maximum suppression (NMS): A detector often produces many overlapping boxes for the same object. NMS greedily keeps the highest-confidence box and removes all boxes that overlap it above an IoU threshold. This simple post-processing step is essential in every detection pipeline.

Chapter 9: Connections

Recognition is the crown jewel of computer vision. Nearly every other topic connects to it:

ConceptUsed In
CNN backbones (ResNet, ViT)Ch 5 (architectures), Ch 9 (flow estimation), Ch 12 (depth)
Instance recognition / retrievalCh 7 (feature matching), Ch 8 (panorama recognition)
Object detectionCh 9 (tracking), Ch 11 (SLAM), autonomous driving
Semantic segmentationCh 10 (matting), Ch 12 (stereo), Ch 13 (3D reconstruction)
Vision-language modelsCh 14 (neural rendering), robotics, embodied AI
Pose estimationCh 13 (body modeling), Ch 11 (camera pose)
Szeliski's perspective: "The distinction between recognition tasks is blurring. Modern architectures like Mask2Former and SAM (Segment Anything) handle classification, detection, and segmentation with a single model. The trend is toward general-purpose visual understanding systems that can answer any question about any image."
Which recognition technique enables a self-driving car to understand its surroundings at the pixel level?