Classification tells you "there's a cat." Detection tells you where. Segmentation tells you which pixels. Understanding tells you why the network thinks so.
Image classification answers a single question: "What is in this image?" You feed in a photo of a park, and the network says "dog." That's useful, but it throws away almost everything. Where is the dog? How big is it? What about the cat behind the tree? What pixels belong to grass versus sky?
Computer vision has four core tasks, each progressively richer:
| Task | Output | Key Question |
|---|---|---|
| Classification | Single label | What is in this image? |
| Semantic Segmentation | Label per pixel | What class does each pixel belong to? |
| Object Detection | Bounding boxes + labels | Where are the objects, and what are they? |
| Instance Segmentation | Masks per object | Which pixels belong to which specific object? |
Notice the progression. Classification has no spatial extent — it just says "cat." Semantic segmentation labels every pixel but doesn't distinguish individual objects: two dogs next to each other are both "dog-pixels," indistinguishable. Object detection finds individual objects with bounding boxes. Instance segmentation combines the best of both: per-pixel masks for each individual object.
Richer outputs require richer architectures. Classification needs one vector. Segmentation needs an output the same size as the input image. Detection needs a variable-length list of boxes. Each step up in richness introduces new architectural challenges — and that's what this lecture is about.
Consider an autonomous vehicle. It needs classification to identify traffic signs. It needs semantic segmentation to know where the road surface is versus sidewalk. It needs object detection to locate and track every car, pedestrian, and cyclist with bounding boxes. And it needs instance segmentation to distinguish "the pedestrian crossing now" from "the pedestrian waiting on the curb."
No single task is sufficient. Modern vision systems stack these capabilities, and the architectures we'll study form the backbone of essentially every vision system deployed in the real world.
The physical region in the image that an output describes. Classification has none (it describes the whole image). A bounding box defines a rectangle. A segmentation mask defines an arbitrary shape, pixel by pixel.
The goal is deceptively simple: given an image of size H×W, produce an output of size H×W where each pixel is assigned one of C class labels — sky, road, car, person, tree, etc. No notion of individual objects. Just: "this pixel is grass, that pixel is cat."
The naive approach: extract a small patch around each pixel, run it through a classification CNN, and assign the center pixel whatever class the CNN predicts. This works in principle — each patch provides local context. But it's catastrophically slow. For a 640×480 image, you'd need 307,200 independent forward passes. And you're re-computing features for nearly-identical overlapping patches.
Better idea: design a network made entirely of convolutional layers (no fully connected layers). Feed in the whole image, get out a tensor of scores with shape C×H×W. Take argmax across the C dimension at each pixel to get the segmentation map.
The problem? If you maintain full resolution throughout, every layer operates at H×W. With 64 filters at 640×480, that's enormous. Deep networks at full resolution are computationally intractable.
The architecture that actually works is the Fully Convolutional Network (FCN), introduced by Long, Shelhamer, and Darrell in 2015. The insight: use the standard classification backbone (with pooling and strided convolutions) to downsample the image into rich, compact feature maps. Then use learned upsampling to expand those features back to full resolution.
The downsampling path uses the same operations you already know: convolutions, ReLUs, and pooling. It builds up semantic understanding — "this region contains a cat" — but at low spatial resolution. The upsampling path recovers spatial detail — "the cat's boundary is here" — and produces per-pixel predictions.
Downsampling is cheap and builds receptive field fast. A few pooling layers let each neuron "see" a large region of the image. Upsampling is the hard part: you need to recover fine-grained spatial detail from a compressed representation. The next chapter is entirely about how to do this well.
Semantic segmentation uses per-pixel cross-entropy. For each pixel (h,w), compute the softmax over C class scores, then take the cross-entropy with the ground-truth label:
This is identical to classification cross-entropy, just applied independently at every pixel. The total loss is the average over all H×W pixels.
Semantic segmentation treats all instances of a class identically. If two cats sit side by side, both regions are labeled "cat" with no way to tell them apart. This is a fundamental limitation — and it's why we need instance segmentation (Chapter 7).
Image: 3×224×224 (RGB). Classes: C = 21 (PASCAL VOC). FCN output: 21×224×224. At pixel (100, 150), the 21-dim vector might be [−2.1, 0.3, −0.8, ..., 4.7, ...]. The index of 4.7 (say index 8 = "cat") becomes the predicted label for that pixel.
The downsampling path of an FCN is well-understood: pooling and strided convolutions compress spatial dimensions. But how do you go back up? How do you turn a 512×7×7 feature map into a C×224×224 prediction? This chapter covers three approaches, each more sophisticated than the last.
Nearest-neighbor unpooling simply repeats each value to fill a larger grid. A 2×2 region becomes a 4×4 region by copying each value into a 2×2 block. It's simple but blocky — no new information is created.
"Bed of nails" unpooling places each value in the top-left corner of its 2×2 block and fills the rest with zeros. It preserves position but creates a sparse, jagged output.
Max unpooling remembers where each max came from during max-pooling. During upsampling, it places each value back in its original position and fills the rest with zeros. This preserves spatial structure much better.
The key insight: if convolution can be written as multiplication by a matrix X, then "undo" that operation by multiplying by the transpose XT. This is called a transposed convolution (sometimes misleadingly called "deconvolution").
Concretely, a transposed convolution with stride 2 takes each input value, multiplies it by a learned kernel, and places the result in a larger output grid with stride-2 spacing. Overlapping regions are summed. The weights are learned end-to-end through backpropagation, so the network discovers the best upsampling strategy for the task.
Transposed convolutions with even kernel sizes can produce checkerboard patterns because overlapping regions receive uneven contributions. The fix: use kernel sizes that are divisible by the stride (e.g., kernel=4, stride=2), or use bilinear upsampling followed by a regular convolution.
There's a fundamental tension in the downsample-then-upsample design. Downsampling builds semantic understanding (what things are) but destroys spatial detail (where exact boundaries are). Upsampling tries to recover that spatial detail — but the information was already thrown away by pooling.
The U-Net (Ronneberger et al., 2015) solves this with skip connections. At each resolution level, it copies feature maps from the downsampling path directly to the corresponding level in the upsampling path, concatenating them with the upsampled features.
Direct connections that copy feature maps from the encoder (downsampling path) to the decoder (upsampling path) at matching resolutions. They provide high-resolution spatial detail to the decoder, compensating for information lost during pooling.
The architecture is symmetric — shaped like the letter U. The encoder and decoder have the same number of levels. Each skip connection carries the encoder's fine-grained spatial features to the decoder, so the decoder doesn't have to "reinvent" edge locations from scratch.
Without skip connections, the decoder sees only the bottleneck features (14×14 with 1024 channels). It knows "there's a cat somewhere in the top-left quadrant" but can't pinpoint the exact boundary. Skip connections from the 224×224 level provide edge maps, texture detail, and spatial precision. The decoder fuses "what" (from the bottleneck) with "where" (from the skips).
Input: 3×256×256. Encoder: Level 1: 64×256×256 → pool → Level 2: 128×128×128 → pool → Level 3: 256×64×64 → pool → Level 4: 512×32×32 → pool → Bottleneck: 1024×16×16.
Decoder: upsample to 512×32×32, concat with encoder Level 4 (512×32×32) = 1024×32×32 → conv to 512×32×32. Repeat upward. Final: 64×256×256 → 1×1 conv → C×256×256.
Object detection adds a fundamentally harder requirement: not only must you classify objects, but you must localize each one with a bounding box (x, y, width, height). And the number of objects varies per image — one image might have 2 dogs, another might have 15 cars. You can't just predict a fixed-size output.
If there's exactly one object, the problem is simple. Take your classification CNN, attach two heads: one predicting class scores (softmax loss), and one predicting four box coordinates (L2 regression loss). Train with a multitask loss:
This works perfectly for single-object localization. But it fails completely when there are multiple objects — you'd need a variable number of output boxes.
One approach: try every possible crop of the image at every position, scale, and aspect ratio. Run each crop through a classifier. If it says "dog" with high confidence, that crop is a detection. This is computationally insane — there are millions of possible crops.
A smarter approach: use a fast, cheap algorithm to propose ~2000 "blobby" image regions that are likely to contain objects. Selective Search groups similar pixels into segments, then merges segments into larger regions based on color, texture, and containment. It runs in a few seconds on CPU and gives you a manageable set of candidate boxes.
The original R-CNN (Girshick et al., 2014) combines region proposals with CNN features in a straightforward pipeline:
Each image requires ~2000 independent forward passes through the CNN (one per proposal). At test time, this takes ~47 seconds per image on a GPU. Training is a multi-stage mess: first train the CNN, then extract features for all proposals, then train SVMs, then train box regressors. Nothing is end-to-end.
The standard metric for measuring bounding box overlap. Given two boxes A and B: IoU(A, B) = Area(A ∩ B) / Area(A ∪ B). IoU = 1 means perfect overlap. IoU = 0 means no overlap. Typically, a detection is "correct" if IoU with the ground-truth box is ≥ 0.5.
Box A: top-left (10, 10), bottom-right (50, 50). Area = 40 × 40 = 1600.
Box B: top-left (30, 30), bottom-right (70, 70). Area = 40 × 40 = 1600.
Intersection: top-left (30, 30), bottom-right (50, 50). Area = 20 × 20 = 400.
Union: 1600 + 1600 − 400 = 2800.
IoU = 400 / 2800 = 0.143. These boxes barely overlap — this would not count as a correct detection.
R-CNN's bottleneck is obvious: running the CNN 2000 times per image. The fix is equally obvious in hindsight: run the CNN once on the whole image, then crop features for each region from the shared feature map.
Fast R-CNN (Girshick, 2015) does exactly this. The full image passes through a backbone CNN (e.g., VGG-16) to produce a feature map. For each region proposal, it crops and resizes the corresponding region from the feature map using RoI Pooling, then feeds the fixed-size features through a small per-region network that outputs class scores and box corrections.
Projects a region proposal onto the feature map, snaps to grid cells, divides into a fixed grid (e.g., 7×7), and max-pools within each sub-region. Output is always the same size regardless of proposal size. The "snap to grid" step introduces slight misalignment.
Used in Mask R-CNN. Instead of snapping to grid cells, it uses bilinear interpolation to sample feature values at exact (fractional) positions. This eliminates misalignment artifacts and is critical for pixel-accurate tasks like instance segmentation.
Fast R-CNN is trained end-to-end with a multitask loss:
Fast R-CNN made the per-region computation fast. But Selective Search still runs on CPU and takes ~2 seconds per image — now dominating runtime. What if the CNN itself could propose regions?
Faster R-CNN (Ren et al., 2015) replaces Selective Search with a learned Region Proposal Network (RPN) that shares computation with the detection backbone. The RPN slides a small network over the feature map and, at each position, predicts whether each of K anchor boxes contains an object and what corrections to apply.
A set of predefined bounding boxes with different scales and aspect ratios, centered at each position of the feature map. Typical: 3 scales × 3 ratios = 9 anchors per position. The RPN doesn't predict boxes from scratch — it predicts corrections (offsets) from these anchors.
At each of the H×W positions in the feature map, the RPN outputs:
For a feature map of 20×15 with K=9 anchors, that's 20 × 15 × 9 = 2700 candidate boxes. The top ~300 by objectness score are kept as proposals for the second stage.
After detection, many overlapping boxes predict the same object. NMS removes redundant detections: sort all boxes by confidence, take the top box, remove all remaining boxes with IoU > threshold (typically 0.5) with it, repeat. This keeps only the most confident, non-overlapping predictions.
Faster R-CNN is a two-stage detector. Stage 1 (RPN): proposes regions (class-agnostic, just "object or not"). Stage 2 (per-region head): classifies each proposal and refines its box. This two-stage design is slower but more accurate than single-stage alternatives — the second stage operates on focused, pre-screened regions.
Three detections for "dog": Box A (conf=0.95), Box B (conf=0.88, IoU with A = 0.72), Box C (conf=0.60, IoU with A = 0.15). NMS threshold = 0.5.
Step 1: Select A (highest confidence). Remove B (IoU 0.72 > 0.5). Keep C (IoU 0.15 < 0.5).
Step 2: Select C. No remaining boxes.
Output: Boxes A and C. NMS correctly suppressed the duplicate B while keeping the distinct detection C.
Two-stage detectors like Faster R-CNN are accurate but have a fundamental speed bottleneck: the per-region second stage. What if you skipped the proposal step entirely and predicted boxes and classes directly from the feature map in a single pass?
YOLO (Redmon et al., 2016) divides the image into an S×S grid. For each grid cell, it predicts B bounding boxes, each with 5 values (x, y, w, h, confidence) plus C class probabilities. The entire output is a single tensor of shape S×S×(5B+C), produced by one forward pass through the network.
Each grid cell is "responsible" for detecting objects whose center falls within it. The confidence score reflects both the probability that the box contains an object and how well the box fits: confidence = P(object) × IoU(pred, truth).
YOLO runs at 45 FPS (Fast YOLO: 155 FPS) versus Faster R-CNN's ~7 FPS. The trade-off: YOLO is less accurate, especially for small objects and objects near each other. The coarse S×S grid means each cell can only detect one object, and the fixed number of boxes per cell limits detection capacity.
SSD (Liu et al., 2016) improves on YOLO by making predictions at multiple feature map scales. Early layers (high resolution) detect small objects. Later layers (low resolution) detect large objects. At each position of each feature map, SSD predicts offsets and class scores for a set of anchor boxes — like running an RPN at every scale, but predicting actual classes instead of just "object/not."
Single-stage detectors suffer from extreme class imbalance: out of thousands of anchor boxes, only a handful contain objects. The vast majority are background. Standard cross-entropy loss treats all samples equally, so the model is overwhelmed by easy negatives.
Focal Loss (Lin et al., 2017) down-weights easy examples and focuses training on hard ones:
With focal loss, RetinaNet matches two-stage detector accuracy while running at single-stage speed.
DETR (Carion et al., 2020) takes a radically different approach: no anchors, no NMS, no hand-designed components. A CNN backbone extracts features, a Transformer encoder-decoder processes them, and a set of learned "object queries" directly predict a fixed-size set of N detections. Training uses bipartite matching (Hungarian algorithm) to assign each prediction to a ground-truth object.
A one-to-one assignment between predicted boxes and ground-truth boxes that minimizes total matching cost (a combination of classification loss and box distance). The Hungarian algorithm solves this optimally. Unmatched predictions are trained to predict "no object."
| Detector | Stages | Proposals | NMS? | Speed | Accuracy |
|---|---|---|---|---|---|
| R-CNN | Multi | Selective Search | Yes | ~47s/img | Moderate |
| Fast R-CNN | 2 | Selective Search | Yes | ~0.3s | Good |
| Faster R-CNN | 2 | RPN (learned) | Yes | ~0.14s | Very good |
| YOLO | 1 | Grid cells | Yes | ~0.02s | Moderate |
| SSD | 1 | Multi-scale anchors | Yes | ~0.02s | Good |
| RetinaNet | 1 | Anchors + focal loss | Yes | ~0.07s | Very good |
| DETR | 1 | Object queries | No | ~0.04s | Very good |
Consider γ = 2, α = 0.25. An easy background example: pt = 0.99. Standard loss: −log(0.99) = 0.01. Focal loss: −0.25 × (0.01)2 × log(0.99) = 0.25 × 0.0001 × 0.01 = 2.5 × 10−7. Nearly zero!
A hard misclassified example: pt = 0.1. Standard loss: −log(0.1) = 2.30. Focal loss: −0.25 × (0.9)2 × log(0.1) = 0.25 × 0.81 × 2.30 = 0.47. Still significant. Focal loss lets the network focus on what matters.
Semantic segmentation labels every pixel but doesn't distinguish individual objects. Object detection finds individual objects but only with coarse bounding boxes. Instance segmentation gives you the best of both worlds: a precise, pixel-level mask for each individual object in the image.
Mask R-CNN (He et al., 2017) is stunningly simple. Take Faster R-CNN, and add a third head: a small fully-convolutional network that predicts a binary segmentation mask for each detected object. That's it. Three heads, three tasks, one unified architecture.
A crucial design decision: the mask head predicts C separate binary masks (one per class), not one multi-class mask. Only the mask corresponding to the predicted class is used. This decouples mask prediction from classification — the mask branch doesn't need to "decide" what class the object is.
RoI Pooling snaps region boundaries to integer coordinates, introducing ~0.5 pixel misalignment per pooling level. For classification, this is irrelevant. For mask prediction, it's devastating — your mask is literally shifted relative to the object. RoI Align uses bilinear interpolation at exact fractional coordinates, eliminating this misalignment. Switching from Pool to Align improved mask AP by 3+ points.
Instance segmentation ignores "stuff" categories (sky, road, grass) — it only masks "things" (countable objects). Panoptic segmentation unifies both: every pixel gets both a class label (semantic) and, for "things" classes, an instance ID. It's the ultimate vision task — complete pixel-level understanding of the scene.
Things: countable objects with well-defined shape (car, person, cat). Stuff: amorphous regions (sky, grass, road, water). Semantic segmentation covers both. Instance segmentation covers only things. Panoptic segmentation covers everything.
Image with 5 detected objects. Each gets RoI Aligned features of 256×14×14. Mask head: two 3×3 conv layers (256 channels each), then a 2×2 transposed conv (upsamples to 28×28), then a 1×1 conv to C channels. Output per object: 80×28×28 (one binary mask per COCO class). At inference, use the predicted class to select which of the 80 masks to threshold.
Neural networks are often called "black boxes." But we can peer inside them. This chapter covers four techniques for understanding what a CNN has learned and why it makes specific predictions.
The simplest approach: visualize the weights of the first convolutional layer directly. Since first-layer filters operate on RGB pixels, each filter is a small 3×H×W image you can display. In virtually every trained network, first-layer filters learn oriented edges, color gradients, and blob detectors — the same features found in biological visual cortex.
This only works for the first layer. Later filters operate on abstract feature maps, not raw pixels, so their weights aren't directly interpretable as images.
Question: which pixels in the input image are most responsible for the network's prediction? Compute the gradient of the class score Sc with respect to the input image pixels:
The result is a grayscale map highlighting which image regions the network "looks at." Saliency maps typically highlight object boundaries and discriminative features (e.g., a dog's face, a car's wheels).
Class Activation Mapping (CAM) (Zhou et al., 2016) exploits the structure of networks ending in global average pooling + fully connected layer. The class score Sc can be decomposed as a weighted sum of the last conv layer's feature maps:
CAM's limitation: it only works at the last conv layer and requires a specific architecture (global average pooling before the classifier).
Grad-CAM (Selvaraju et al., 2017) generalizes CAM to any layer of any network. Instead of using the FC layer weights, it uses the gradients of the class score with respect to the layer's activations:
The ReLU keeps only positive contributions. We want to highlight regions that increase the class score (features that support the prediction), not regions that decrease it. Negative values would correspond to features that support other classes.
A disturbing discovery: adding a tiny, imperceptible perturbation to an image can flip the network's prediction entirely. An image of a panda, classified correctly with 57.7% confidence, becomes a "gibbon" with 99.3% confidence after adding noise invisible to the human eye.
These adversarial examples are generated by gradient ascent on the loss with respect to input pixels — essentially asking "what small change to the pixels would most increase the loss for the correct class?"
Adversarial examples transfer between models: an attack crafted for ResNet often fools VGG too. They work in the physical world: printed adversarial patches can fool autonomous vehicle classifiers. This is an active area of security research with no complete solution yet.
Layer activations A: 14×14×512. Target class: "cat" (index 281). Compute ∂S281/∂A (same shape: 14×14×512). Global average pool each channel: α ∈ R512. Weighted sum: M(h,w) = ReLU(∑k=1512 αk Ah,w,k) ∈ R14×14. Upsample to 224×224 via bilinear interpolation. The result highlights the cat's face and body — the regions driving the "cat" prediction.
Time to put it all together. The interactive visualization below walks you through the entire object detection pipeline, step by step. Watch how raw pixels become classified, localized objects.
Here's what happens at each stage:
In Faster R-CNN, all seven stages are differentiable (or made approximately differentiable). The backbone, RPN, and detection head are trained jointly. Gradients flow from the final classification loss all the way back to the first conv layer. The network learns to extract features that are simultaneously good for proposing regions AND classifying them.
Real detection systems add several refinements beyond the basic pipeline:
| Technique | What It Does | Why It Helps |
|---|---|---|
| Feature Pyramid Network (FPN) | Creates top-down feature maps at multiple scales | Detects objects at all sizes — small objects from high-res features, large objects from low-res features |
| Multi-scale training | Randomly resizes training images | Teaches the network to handle objects at varied scales |
| Soft-NMS | Reduces confidence instead of removing boxes | Handles highly-overlapping objects (e.g., a crowd) |
| Test-time augmentation | Run detection on flipped/scaled copies, merge | Improves accuracy at the cost of speed |
| Class-agnostic NMS | Run NMS across all classes | Prevents multiple classes from claiming the same box |
Input: 3×800×600. Backbone (ResNet-50 + FPN): features at scales {1/4, 1/8, 1/16, 1/32, 1/64}. RPN: 9 anchors per position per scale → ~200,000 candidate boxes. After NMS + top-300: 300 proposals. RoI Align: 300 × 256×7×7 feature tensors. Per-region head: 300 class vectors (81-dim for COCO) + 300 box vectors (4×81). After score thresholding (keep conf > 0.05) + class-wise NMS: ~20-50 final detections. Total inference time: ~90ms on a V100 GPU.
This lecture covered the full spectrum of dense prediction tasks in computer vision, from labeling every pixel to finding and masking individual objects, plus techniques for understanding what the network has learned.
| Year | Method | Key Innovation | Speed |
|---|---|---|---|
| 2014 | R-CNN | CNN features for detection | ~47s/img |
| 2015 | Fast R-CNN | Shared backbone + RoI Pooling | ~0.3s |
| 2015 | Faster R-CNN | Learned region proposals (RPN) | ~0.14s |
| 2016 | YOLO | Single-stage grid-based detection | ~0.02s |
| 2016 | SSD | Multi-scale single-stage detection | ~0.02s |
| 2017 | RetinaNet | Focal loss for class imbalance | ~0.07s |
| 2017 | Mask R-CNN | Instance masks + RoI Align | ~0.2s |
| 2020 | DETR | Transformer, no anchors, no NMS | ~0.04s |
| Concept | What It Solves |
|---|---|
| FCN + Downsample/Upsample | Efficient per-pixel prediction at full resolution |
| Transposed Convolution | Learnable upsampling (better than fixed interpolation) |
| U-Net Skip Connections | Preserves spatial detail lost during downsampling |
| Anchor Boxes | Efficient multi-scale, multi-ratio box proposals |
| RoI Align | Pixel-accurate feature cropping (no quantization) |
| NMS | Removes duplicate detections |
| Focal Loss | Handles extreme class imbalance in single-stage detectors |
| Saliency Maps | Which input pixels matter for the prediction |
| Grad-CAM | Which spatial regions at any layer drive the prediction |
The encoder-decoder architecture of U-Net reappears everywhere: in diffusion models (the U-Net denoises images), in image generation (VAEs encode then decode), and in depth estimation. Feature Pyramid Networks from Faster R-CNN are used in nearly every modern vision system. And Mask R-CNN's approach of adding task-specific heads to a shared backbone is the design pattern behind multi-task learning in vision.
DETR opened the door to vision-language models: if you can use Transformer queries for detection, you can use language tokens as queries for open-vocabulary detection. Models like OWL-ViT and Grounding DINO build directly on DETR's architecture.
Detection and segmentation turn classification's "what" into "where" and "which pixels" — and the key ideas (anchor boxes, RoI Align, multi-scale features, learned upsampling) recur in every modern vision system. Understanding these building blocks is non-negotiable.