← Gleams
Stanford CS 231n · Lecture 9 · Detection, Segmentation & Understanding

Detection, Segmentation & Understanding

Classification tells you "there's a cat." Detection tells you where. Segmentation tells you which pixels. Understanding tells you why the network thinks so.

Semantic Segmentation R-CNN Family YOLO & SSD Mask R-CNN Grad-CAM
Roadmap

What You'll Master

Chapter 01

Beyond Classification

Image classification answers a single question: "What is in this image?" You feed in a photo of a park, and the network says "dog." That's useful, but it throws away almost everything. Where is the dog? How big is it? What about the cat behind the tree? What pixels belong to grass versus sky?

Computer vision has four core tasks, each progressively richer:

TaskOutputKey Question
ClassificationSingle labelWhat is in this image?
Semantic SegmentationLabel per pixelWhat class does each pixel belong to?
Object DetectionBounding boxes + labelsWhere are the objects, and what are they?
Instance SegmentationMasks per objectWhich pixels belong to which specific object?

Notice the progression. Classification has no spatial extent — it just says "cat." Semantic segmentation labels every pixel but doesn't distinguish individual objects: two dogs next to each other are both "dog-pixels," indistinguishable. Object detection finds individual objects with bounding boxes. Instance segmentation combines the best of both: per-pixel masks for each individual object.

The Central Trade-off

Richer outputs require richer architectures. Classification needs one vector. Segmentation needs an output the same size as the input image. Detection needs a variable-length list of boxes. Each step up in richness introduces new architectural challenges — and that's what this lecture is about.

A Self-Driving Car Needs All Four

Consider an autonomous vehicle. It needs classification to identify traffic signs. It needs semantic segmentation to know where the road surface is versus sidewalk. It needs object detection to locate and track every car, pedestrian, and cyclist with bounding boxes. And it needs instance segmentation to distinguish "the pedestrian crossing now" from "the pedestrian waiting on the curb."

No single task is sufficient. Modern vision systems stack these capabilities, and the architectures we'll study form the backbone of essentially every vision system deployed in the real world.

Definition
Spatial Extent

The physical region in the image that an output describes. Classification has none (it describes the whole image). A bounding box defines a rectangle. A segmentation mask defines an arbitrary shape, pixel by pixel.

Chapter 02

Semantic Segmentation

The goal is deceptively simple: given an image of size H×W, produce an output of size H×W where each pixel is assigned one of C class labels — sky, road, car, person, tree, etc. No notion of individual objects. Just: "this pixel is grass, that pixel is cat."

Attempt 1: Sliding Window

The naive approach: extract a small patch around each pixel, run it through a classification CNN, and assign the center pixel whatever class the CNN predicts. This works in principle — each patch provides local context. But it's catastrophically slow. For a 640×480 image, you'd need 307,200 independent forward passes. And you're re-computing features for nearly-identical overlapping patches.

Attempt 2: Fully Convolutional (No Downsampling)

Better idea: design a network made entirely of convolutional layers (no fully connected layers). Feed in the whole image, get out a tensor of scores with shape C×H×W. Take argmax across the C dimension at each pixel to get the segmentation map.

The problem? If you maintain full resolution throughout, every layer operates at H×W. With 64 filters at 640×480, that's enormous. Deep networks at full resolution are computationally intractable.

The Winning Idea: Downsample, Then Upsample

The architecture that actually works is the Fully Convolutional Network (FCN), introduced by Long, Shelhamer, and Darrell in 2015. The insight: use the standard classification backbone (with pooling and strided convolutions) to downsample the image into rich, compact feature maps. Then use learned upsampling to expand those features back to full resolution.

FCN Pipeline Input (3×H×W) → Downsample (D×H/32×W/32) → Upsample (C×H×W) → argmax → Prediction (H×W)

The downsampling path uses the same operations you already know: convolutions, ReLUs, and pooling. It builds up semantic understanding — "this region contains a cat" — but at low spatial resolution. The upsampling path recovers spatial detail — "the cat's boundary is here" — and produces per-pixel predictions.

Why This Works

Downsampling is cheap and builds receptive field fast. A few pooling layers let each neuron "see" a large region of the image. Upsampling is the hard part: you need to recover fine-grained spatial detail from a compressed representation. The next chapter is entirely about how to do this well.

The Loss Function

Semantic segmentation uses per-pixel cross-entropy. For each pixel (h,w), compute the softmax over C class scores, then take the cross-entropy with the ground-truth label:

Per-Pixel Cross-Entropy L = −(1/HW) ∑h,w log p(yh,w | x)
where yh,w is the ground-truth class at pixel (h,w) and p is the softmax probability

This is identical to classification cross-entropy, just applied independently at every pixel. The total loss is the average over all H×W pixels.

Semantic Segmentation Visualizer
Click the class buttons to toggle which segmentation mask is visible. Each color represents a different semantic class.
No Object Identity

Semantic segmentation treats all instances of a class identically. If two cats sit side by side, both regions are labeled "cat" with no way to tell them apart. This is a fundamental limitation — and it's why we need instance segmentation (Chapter 7).

Worked Example — Output Tensor

Image: 3×224×224 (RGB). Classes: C = 21 (PASCAL VOC). FCN output: 21×224×224. At pixel (100, 150), the 21-dim vector might be [−2.1, 0.3, −0.8, ..., 4.7, ...]. The index of 4.7 (say index 8 = "cat") becomes the predicted label for that pixel.

Chapter 03

Upsampling & U-Net

The downsampling path of an FCN is well-understood: pooling and strided convolutions compress spatial dimensions. But how do you go back up? How do you turn a 512×7×7 feature map into a C×224×224 prediction? This chapter covers three approaches, each more sophisticated than the last.

Approach 1: Unpooling

Nearest-neighbor unpooling simply repeats each value to fill a larger grid. A 2×2 region becomes a 4×4 region by copying each value into a 2×2 block. It's simple but blocky — no new information is created.

"Bed of nails" unpooling places each value in the top-left corner of its 2×2 block and fills the rest with zeros. It preserves position but creates a sparse, jagged output.

Max unpooling remembers where each max came from during max-pooling. During upsampling, it places each value back in its original position and fills the rest with zeros. This preserves spatial structure much better.

Approach 2: Transposed Convolution (Learnable Upsampling)

The key insight: if convolution can be written as multiplication by a matrix X, then "undo" that operation by multiplying by the transpose XT. This is called a transposed convolution (sometimes misleadingly called "deconvolution").

Transposed Convolution Normal convolution: y = Wx   (downsamples)
Transposed convolution: x̂ = WTy   (upsamples)
W is the convolution matrix. WT has learnable parameters — the network learns HOW to upsample.

Concretely, a transposed convolution with stride 2 takes each input value, multiplies it by a learned kernel, and places the result in a larger output grid with stride-2 spacing. Overlapping regions are summed. The weights are learned end-to-end through backpropagation, so the network discovers the best upsampling strategy for the task.

Checkerboard Artifacts

Transposed convolutions with even kernel sizes can produce checkerboard patterns because overlapping regions receive uneven contributions. The fix: use kernel sizes that are divisible by the stride (e.g., kernel=4, stride=2), or use bilinear upsampling followed by a regular convolution.

Approach 3: U-Net — Skip Connections to the Rescue

There's a fundamental tension in the downsample-then-upsample design. Downsampling builds semantic understanding (what things are) but destroys spatial detail (where exact boundaries are). Upsampling tries to recover that spatial detail — but the information was already thrown away by pooling.

The U-Net (Ronneberger et al., 2015) solves this with skip connections. At each resolution level, it copies feature maps from the downsampling path directly to the corresponding level in the upsampling path, concatenating them with the upsampled features.

Definition
Skip Connections (U-Net)

Direct connections that copy feature maps from the encoder (downsampling path) to the decoder (upsampling path) at matching resolutions. They provide high-resolution spatial detail to the decoder, compensating for information lost during pooling.

U-Net Architecture
  1. Encoder (contracting path): Repeated blocks of [conv → conv → max pool]. Resolution halves at each level (e.g., 224 → 112 → 56 → 28 → 14).
  2. Bottleneck: Deepest level with the most channels and smallest spatial resolution.
  3. Decoder (expanding path): At each level: upsample (transposed conv) → concatenate with skip features from encoder → conv → conv.
  4. Output: 1×1 convolution to map to C class channels. Softmax + argmax for final segmentation.

The architecture is symmetric — shaped like the letter U. The encoder and decoder have the same number of levels. Each skip connection carries the encoder's fine-grained spatial features to the decoder, so the decoder doesn't have to "reinvent" edge locations from scratch.

Why Skip Connections Are Essential

Without skip connections, the decoder sees only the bottleneck features (14×14 with 1024 channels). It knows "there's a cat somewhere in the top-left quadrant" but can't pinpoint the exact boundary. Skip connections from the 224×224 level provide edge maps, texture detail, and spatial precision. The decoder fuses "what" (from the bottleneck) with "where" (from the skips).

Worked Example — Feature Dimensions in U-Net

Input: 3×256×256. Encoder: Level 1: 64×256×256 → pool → Level 2: 128×128×128 → pool → Level 3: 256×64×64 → pool → Level 4: 512×32×32 → pool → Bottleneck: 1024×16×16.

Decoder: upsample to 512×32×32, concat with encoder Level 4 (512×32×32) = 1024×32×32 → conv to 512×32×32. Repeat upward. Final: 64×256×256 → 1×1 conv → C×256×256.

Chapter 04

Object Detection & R-CNN

Object detection adds a fundamentally harder requirement: not only must you classify objects, but you must localize each one with a bounding box (x, y, width, height). And the number of objects varies per image — one image might have 2 dogs, another might have 15 cars. You can't just predict a fixed-size output.

Single Object: Classification + Localization

If there's exactly one object, the problem is simple. Take your classification CNN, attach two heads: one predicting class scores (softmax loss), and one predicting four box coordinates (L2 regression loss). Train with a multitask loss:

Multitask Loss (Single Object) L = Lcls(class scores, y) + λ · Lloc(predicted box, ground-truth box)
Lcls = cross-entropy, Lloc = L2 or smooth-L1 loss on (x, y, w, h)

This works perfectly for single-object localization. But it fails completely when there are multiple objects — you'd need a variable number of output boxes.

The Sliding Window Disaster

One approach: try every possible crop of the image at every position, scale, and aspect ratio. Run each crop through a classifier. If it says "dog" with high confidence, that crop is a detection. This is computationally insane — there are millions of possible crops.

Region Proposals: Selective Search

A smarter approach: use a fast, cheap algorithm to propose ~2000 "blobby" image regions that are likely to contain objects. Selective Search groups similar pixels into segments, then merges segments into larger regions based on color, texture, and containment. It runs in a few seconds on CPU and gives you a manageable set of candidate boxes.

R-CNN: Regions with CNN Features

The original R-CNN (Girshick et al., 2014) combines region proposals with CNN features in a straightforward pipeline:

R-CNN Pipeline
  1. Region proposals: Run Selective Search on the input image. Get ~2000 Region of Interest (RoI) boxes.
  2. Warp and forward: Resize each RoI to 224×224 pixels. Run each through a CNN (e.g., AlexNet) independently. This produces a 4096-dim feature vector per region.
  3. Classify: Train an SVM on each region's feature vector to predict object class.
  4. Bounding box regression: Train a linear regressor to predict corrections (dx, dy, dw, dh) from the proposal box to the ground-truth box.
R-CNN Is Painfully Slow

Each image requires ~2000 independent forward passes through the CNN (one per proposal). At test time, this takes ~47 seconds per image on a GPU. Training is a multi-stage mess: first train the CNN, then extract features for all proposals, then train SVMs, then train box regressors. Nothing is end-to-end.

Definition
IoU — Intersection over Union

The standard metric for measuring bounding box overlap. Given two boxes A and B: IoU(A, B) = Area(A ∩ B) / Area(A ∪ B). IoU = 1 means perfect overlap. IoU = 0 means no overlap. Typically, a detection is "correct" if IoU with the ground-truth box is ≥ 0.5.

IoU Formula IoU(A, B) = |A ∩ B| / |A ∪ B| = |A ∩ B| / (|A| + |B| − |A ∩ B|)
Worked Example — Computing IoU

Box A: top-left (10, 10), bottom-right (50, 50). Area = 40 × 40 = 1600.

Box B: top-left (30, 30), bottom-right (70, 70). Area = 40 × 40 = 1600.

Intersection: top-left (30, 30), bottom-right (50, 50). Area = 20 × 20 = 400.

Union: 1600 + 1600 − 400 = 2800.

IoU = 400 / 2800 = 0.143. These boxes barely overlap — this would not count as a correct detection.

Chapter 05

Fast & Faster R-CNN

R-CNN's bottleneck is obvious: running the CNN 2000 times per image. The fix is equally obvious in hindsight: run the CNN once on the whole image, then crop features for each region from the shared feature map.

Fast R-CNN

Fast R-CNN (Girshick, 2015) does exactly this. The full image passes through a backbone CNN (e.g., VGG-16) to produce a feature map. For each region proposal, it crops and resizes the corresponding region from the feature map using RoI Pooling, then feeds the fixed-size features through a small per-region network that outputs class scores and box corrections.

Definition
RoI Pooling

Projects a region proposal onto the feature map, snaps to grid cells, divides into a fixed grid (e.g., 7×7), and max-pools within each sub-region. Output is always the same size regardless of proposal size. The "snap to grid" step introduces slight misalignment.

Definition
RoI Align (improvement)

Used in Mask R-CNN. Instead of snapping to grid cells, it uses bilinear interpolation to sample feature values at exact (fractional) positions. This eliminates misalignment artifacts and is critical for pixel-accurate tasks like instance segmentation.

Fast R-CNN is trained end-to-end with a multitask loss:

Fast R-CNN Loss L = Lcls(p, u) + λ · [u ≥ 1] · Lloc(tu, v)
p = predicted class probabilities, u = true class, tu = predicted box offsets for class u, v = true box offsets. [u ≥ 1] means only regress boxes for non-background proposals.

The Region Proposal Bottleneck

Fast R-CNN made the per-region computation fast. But Selective Search still runs on CPU and takes ~2 seconds per image — now dominating runtime. What if the CNN itself could propose regions?

Faster R-CNN: Region Proposal Networks

Faster R-CNN (Ren et al., 2015) replaces Selective Search with a learned Region Proposal Network (RPN) that shares computation with the detection backbone. The RPN slides a small network over the feature map and, at each position, predicts whether each of K anchor boxes contains an object and what corrections to apply.

Definition
Anchor Boxes

A set of predefined bounding boxes with different scales and aspect ratios, centered at each position of the feature map. Typical: 3 scales × 3 ratios = 9 anchors per position. The RPN doesn't predict boxes from scratch — it predicts corrections (offsets) from these anchors.

Anchor Box Visualizer
Adjust scale and aspect ratio to see different anchor box configurations at each grid position. Click a grid cell to highlight its anchors.

At each of the H×W positions in the feature map, the RPN outputs:

For a feature map of 20×15 with K=9 anchors, that's 20 × 15 × 9 = 2700 candidate boxes. The top ~300 by objectness score are kept as proposals for the second stage.

Faster R-CNN: Four Losses
  1. RPN classification: Is each anchor an object or background? (binary cross-entropy)
  2. RPN regression: Box offset from positive anchors to ground-truth boxes. (smooth L1)
  3. Detection classification: What class is each proposal? (softmax cross-entropy over C+1 classes including background)
  4. Detection regression: Box offset from each proposal to the matched ground-truth box. (smooth L1)
IoU Calculator
Drag the blue and green boxes to see IoU (Intersection over Union) computed in real-time. The red region shows the intersection.
IoU: 0.000
Definition
Non-Maximum Suppression (NMS)

After detection, many overlapping boxes predict the same object. NMS removes redundant detections: sort all boxes by confidence, take the top box, remove all remaining boxes with IoU > threshold (typically 0.5) with it, repeat. This keeps only the most confident, non-overlapping predictions.

Non-Maximum Suppression
  1. Sort all detections by confidence score (descending).
  2. Select the top detection. Add it to the final output list.
  3. Remove all remaining detections whose IoU with the selected detection exceeds the threshold.
  4. Repeat steps 2-3 until no detections remain.
Two-Stage Architecture

Faster R-CNN is a two-stage detector. Stage 1 (RPN): proposes regions (class-agnostic, just "object or not"). Stage 2 (per-region head): classifies each proposal and refines its box. This two-stage design is slower but more accurate than single-stage alternatives — the second stage operates on focused, pre-screened regions.

Worked Example — NMS

Three detections for "dog": Box A (conf=0.95), Box B (conf=0.88, IoU with A = 0.72), Box C (conf=0.60, IoU with A = 0.15). NMS threshold = 0.5.

Step 1: Select A (highest confidence). Remove B (IoU 0.72 > 0.5). Keep C (IoU 0.15 < 0.5).

Step 2: Select C. No remaining boxes.

Output: Boxes A and C. NMS correctly suppressed the duplicate B while keeping the distinct detection C.

Chapter 06

Single-Stage Detectors

Two-stage detectors like Faster R-CNN are accurate but have a fundamental speed bottleneck: the per-region second stage. What if you skipped the proposal step entirely and predicted boxes and classes directly from the feature map in a single pass?

YOLO: You Only Look Once

YOLO (Redmon et al., 2016) divides the image into an S×S grid. For each grid cell, it predicts B bounding boxes, each with 5 values (x, y, w, h, confidence) plus C class probabilities. The entire output is a single tensor of shape S×S×(5B+C), produced by one forward pass through the network.

YOLO Output Tensor Output: S × S × (5B + C)
S=7, B=2, C=20 (PASCAL VOC) → 7 × 7 × 30 = 1470 predictions in one shot

Each grid cell is "responsible" for detecting objects whose center falls within it. The confidence score reflects both the probability that the box contains an object and how well the box fits: confidence = P(object) × IoU(pred, truth).

Speed vs. Accuracy

YOLO runs at 45 FPS (Fast YOLO: 155 FPS) versus Faster R-CNN's ~7 FPS. The trade-off: YOLO is less accurate, especially for small objects and objects near each other. The coarse S×S grid means each cell can only detect one object, and the fixed number of boxes per cell limits detection capacity.

SSD: Single Shot MultiBox Detector

SSD (Liu et al., 2016) improves on YOLO by making predictions at multiple feature map scales. Early layers (high resolution) detect small objects. Later layers (low resolution) detect large objects. At each position of each feature map, SSD predicts offsets and class scores for a set of anchor boxes — like running an RPN at every scale, but predicting actual classes instead of just "object/not."

RetinaNet and Focal Loss

Single-stage detectors suffer from extreme class imbalance: out of thousands of anchor boxes, only a handful contain objects. The vast majority are background. Standard cross-entropy loss treats all samples equally, so the model is overwhelmed by easy negatives.

Focal Loss (Lin et al., 2017) down-weights easy examples and focuses training on hard ones:

Focal Loss FL(pt) = −αt(1 − pt)γ log(pt)
pt = predicted probability for the true class. γ = focusing parameter (typically 2). When pt is large (easy example), (1−pt)γ is near zero, suppressing the loss.

With focal loss, RetinaNet matches two-stage detector accuracy while running at single-stage speed.

DETR: Detection with Transformers

DETR (Carion et al., 2020) takes a radically different approach: no anchors, no NMS, no hand-designed components. A CNN backbone extracts features, a Transformer encoder-decoder processes them, and a set of learned "object queries" directly predict a fixed-size set of N detections. Training uses bipartite matching (Hungarian algorithm) to assign each prediction to a ground-truth object.

Definition
Bipartite Matching

A one-to-one assignment between predicted boxes and ground-truth boxes that minimizes total matching cost (a combination of classification loss and box distance). The Hungarian algorithm solves this optimally. Unmatched predictions are trained to predict "no object."

DetectorStagesProposalsNMS?SpeedAccuracy
R-CNNMultiSelective SearchYes~47s/imgModerate
Fast R-CNN2Selective SearchYes~0.3sGood
Faster R-CNN2RPN (learned)Yes~0.14sVery good
YOLO1Grid cellsYes~0.02sModerate
SSD1Multi-scale anchorsYes~0.02sGood
RetinaNet1Anchors + focal lossYes~0.07sVery good
DETR1Object queriesNo~0.04sVery good
Worked Example — Focal Loss

Consider γ = 2, α = 0.25. An easy background example: pt = 0.99. Standard loss: −log(0.99) = 0.01. Focal loss: −0.25 × (0.01)2 × log(0.99) = 0.25 × 0.0001 × 0.01 = 2.5 × 10−7. Nearly zero!

A hard misclassified example: pt = 0.1. Standard loss: −log(0.1) = 2.30. Focal loss: −0.25 × (0.9)2 × log(0.1) = 0.25 × 0.81 × 2.30 = 0.47. Still significant. Focal loss lets the network focus on what matters.

Chapter 07

Instance Segmentation

Semantic segmentation labels every pixel but doesn't distinguish individual objects. Object detection finds individual objects but only with coarse bounding boxes. Instance segmentation gives you the best of both worlds: a precise, pixel-level mask for each individual object in the image.

Mask R-CNN

Mask R-CNN (He et al., 2017) is stunningly simple. Take Faster R-CNN, and add a third head: a small fully-convolutional network that predicts a binary segmentation mask for each detected object. That's it. Three heads, three tasks, one unified architecture.

Mask R-CNN Architecture
  1. Backbone + FPN: Extract multi-scale feature maps from the image.
  2. RPN: Propose ~300 candidate regions.
  3. RoI Align: Crop and resize features to 14×14 (or 7×7) for each proposal. RoI Align (not Pool) is critical for pixel accuracy.
  4. Classification head: Predict object class + box offsets (same as Faster R-CNN).
  5. Mask head: Apply convolutions to the 14×14 features to produce a C×28×28 binary mask (one mask per class).
Mask R-CNN Loss L = Lcls + Lbox + Lmask
Lmask = −(1/m2) ∑i,j [yij log pij + (1−yij) log(1−pij)]
Binary cross-entropy on the 28×28 mask. Only the mask for the ground-truth class contributes to the loss (class-specific masks prevent competition between classes).

A crucial design decision: the mask head predicts C separate binary masks (one per class), not one multi-class mask. Only the mask corresponding to the predicted class is used. This decouples mask prediction from classification — the mask branch doesn't need to "decide" what class the object is.

RoI Align Is Critical

RoI Pooling snaps region boundaries to integer coordinates, introducing ~0.5 pixel misalignment per pooling level. For classification, this is irrelevant. For mask prediction, it's devastating — your mask is literally shifted relative to the object. RoI Align uses bilinear interpolation at exact fractional coordinates, eliminating this misalignment. Switching from Pool to Align improved mask AP by 3+ points.

Panoptic Segmentation

Instance segmentation ignores "stuff" categories (sky, road, grass) — it only masks "things" (countable objects). Panoptic segmentation unifies both: every pixel gets both a class label (semantic) and, for "things" classes, an instance ID. It's the ultimate vision task — complete pixel-level understanding of the scene.

Definition
Things vs. Stuff

Things: countable objects with well-defined shape (car, person, cat). Stuff: amorphous regions (sky, grass, road, water). Semantic segmentation covers both. Instance segmentation covers only things. Panoptic segmentation covers everything.

Worked Example — Mask Dimensions

Image with 5 detected objects. Each gets RoI Aligned features of 256×14×14. Mask head: two 3×3 conv layers (256 channels each), then a 2×2 transposed conv (upsamples to 28×28), then a 1×1 conv to C channels. Output per object: 80×28×28 (one binary mask per COCO class). At inference, use the predicted class to select which of the 80 masks to threshold.

Chapter 08

Understanding CNNs

Neural networks are often called "black boxes." But we can peer inside them. This chapter covers four techniques for understanding what a CNN has learned and why it makes specific predictions.

Technique 1: Visualizing First-Layer Filters

The simplest approach: visualize the weights of the first convolutional layer directly. Since first-layer filters operate on RGB pixels, each filter is a small 3×H×W image you can display. In virtually every trained network, first-layer filters learn oriented edges, color gradients, and blob detectors — the same features found in biological visual cortex.

This only works for the first layer. Later filters operate on abstract feature maps, not raw pixels, so their weights aren't directly interpretable as images.

Technique 2: Saliency Maps via Backpropagation

Question: which pixels in the input image are most responsible for the network's prediction? Compute the gradient of the class score Sc with respect to the input image pixels:

Saliency Map Saliency = |∂Sc / ∂x|
Take absolute value and max over RGB channels. High-gradient pixels are pixels that, if changed slightly, would most affect the class score.

The result is a grayscale map highlighting which image regions the network "looks at." Saliency maps typically highlight object boundaries and discriminative features (e.g., a dog's face, a car's wheels).

Technique 3: CAM & Grad-CAM

Class Activation Mapping (CAM) (Zhou et al., 2016) exploits the structure of networks ending in global average pooling + fully connected layer. The class score Sc can be decomposed as a weighted sum of the last conv layer's feature maps:

CAM Mc(h, w) = ∑k wk,c · fh,w,k
wk,c = weight connecting feature channel k to class c. fh,w,k = activation at position (h,w) in channel k. The result is a heatmap showing which spatial regions contribute most to class c.

CAM's limitation: it only works at the last conv layer and requires a specific architecture (global average pooling before the classifier).

Grad-CAM (Selvaraju et al., 2017) generalizes CAM to any layer of any network. Instead of using the FC layer weights, it uses the gradients of the class score with respect to the layer's activations:

Grad-CAM Algorithm
  1. Pick a layer with activations A ∈ RH×W×K.
  2. Compute gradients: ∂Sc/∂A ∈ RH×W×K.
  3. Global average pool the gradients to get importance weights: αk = (1/HW) ∑h,w (∂Sc/∂Ah,w,k).
  4. Weighted combination + ReLU: Mc(h,w) = ReLU(∑k αk Ah,w,k).
  5. Upsample Mc to image resolution. Overlay as a heatmap.
Why ReLU?

The ReLU keeps only positive contributions. We want to highlight regions that increase the class score (features that support the prediction), not regions that decrease it. Negative values would correspond to features that support other classes.

Technique 4: Adversarial Examples

A disturbing discovery: adding a tiny, imperceptible perturbation to an image can flip the network's prediction entirely. An image of a panda, classified correctly with 57.7% confidence, becomes a "gibbon" with 99.3% confidence after adding noise invisible to the human eye.

These adversarial examples are generated by gradient ascent on the loss with respect to input pixels — essentially asking "what small change to the pixels would most increase the loss for the correct class?"

FGSM Attack xadv = x + ε · sign(∇x L(x, ytrue))
ε is tiny (e.g., 0.007 on [0,1] pixels). The perturbation is imperceptible but pushes the input across a decision boundary.
Real-World Implications

Adversarial examples transfer between models: an attack crafted for ResNet often fools VGG too. They work in the physical world: printed adversarial patches can fool autonomous vehicle classifiers. This is an active area of security research with no complete solution yet.

Worked Example — Grad-CAM Computation

Layer activations A: 14×14×512. Target class: "cat" (index 281). Compute ∂S281/∂A (same shape: 14×14×512). Global average pool each channel: α ∈ R512. Weighted sum: M(h,w) = ReLU(∑k=1512 αk Ah,w,k) ∈ R14×14. Upsample to 224×224 via bilinear interpolation. The result highlights the cat's face and body — the regions driving the "cat" prediction.

Chapter 09

Showcase: Detection Pipeline

Time to put it all together. The interactive visualization below walks you through the entire object detection pipeline, step by step. Watch how raw pixels become classified, localized objects.

Object Detection Pipeline — Step by Step
Click "Next Step" to advance through the detection pipeline. Each stage transforms the data into richer representations until we get final detected objects.
Stage 0: Input Image

Here's what happens at each stage:

Detection Pipeline Stages
  1. Input Image: Raw pixels, 3×H×W. A scene with multiple objects at different scales.
  2. Backbone Features: Pass through ResNet/VGG. Produces multi-scale feature maps (C4, C5). Rich semantic features, reduced spatial resolution.
  3. Region Proposals (RPN): Anchor boxes + objectness scores at each feature position. Top ~300 proposals selected by confidence.
  4. RoI Align: For each proposal, crop and resize features to a fixed 7×7 grid using bilinear interpolation.
  5. Per-Region Classification: Each proposal gets a class label and confidence score. Background proposals are discarded.
  6. Non-Maximum Suppression: Remove duplicate detections. Keep only the highest-confidence box for each distinct object.
  7. Final Output: Clean set of bounding boxes with class labels and confidence scores.
The Elegance of End-to-End Training

In Faster R-CNN, all seven stages are differentiable (or made approximately differentiable). The backbone, RPN, and detection head are trained jointly. Gradients flow from the final classification loss all the way back to the first conv layer. The network learns to extract features that are simultaneously good for proposing regions AND classifying them.

Detection in Practice

Real detection systems add several refinements beyond the basic pipeline:

TechniqueWhat It DoesWhy It Helps
Feature Pyramid Network (FPN)Creates top-down feature maps at multiple scalesDetects objects at all sizes — small objects from high-res features, large objects from low-res features
Multi-scale trainingRandomly resizes training imagesTeaches the network to handle objects at varied scales
Soft-NMSReduces confidence instead of removing boxesHandles highly-overlapping objects (e.g., a crowd)
Test-time augmentationRun detection on flipped/scaled copies, mergeImproves accuracy at the cost of speed
Class-agnostic NMSRun NMS across all classesPrevents multiple classes from claiming the same box
Worked Example — Full Pipeline Numbers

Input: 3×800×600. Backbone (ResNet-50 + FPN): features at scales {1/4, 1/8, 1/16, 1/32, 1/64}. RPN: 9 anchors per position per scale → ~200,000 candidate boxes. After NMS + top-300: 300 proposals. RoI Align: 300 × 256×7×7 feature tensors. Per-region head: 300 class vectors (81-dim for COCO) + 300 box vectors (4×81). After score thresholding (keep conf > 0.05) + class-wise NMS: ~20-50 final detections. Total inference time: ~90ms on a V100 GPU.

Chapter 10

Summary & Connections

This lecture covered the full spectrum of dense prediction tasks in computer vision, from labeling every pixel to finding and masking individual objects, plus techniques for understanding what the network has learned.

The Evolution of Object Detection

YearMethodKey InnovationSpeed
2014R-CNNCNN features for detection~47s/img
2015Fast R-CNNShared backbone + RoI Pooling~0.3s
2015Faster R-CNNLearned region proposals (RPN)~0.14s
2016YOLOSingle-stage grid-based detection~0.02s
2016SSDMulti-scale single-stage detection~0.02s
2017RetinaNetFocal loss for class imbalance~0.07s
2017Mask R-CNNInstance masks + RoI Align~0.2s
2020DETRTransformer, no anchors, no NMS~0.04s

Key Concepts Summary

ConceptWhat It Solves
FCN + Downsample/UpsampleEfficient per-pixel prediction at full resolution
Transposed ConvolutionLearnable upsampling (better than fixed interpolation)
U-Net Skip ConnectionsPreserves spatial detail lost during downsampling
Anchor BoxesEfficient multi-scale, multi-ratio box proposals
RoI AlignPixel-accurate feature cropping (no quantization)
NMSRemoves duplicate detections
Focal LossHandles extreme class imbalance in single-stage detectors
Saliency MapsWhich input pixels matter for the prediction
Grad-CAMWhich spatial regions at any layer drive the prediction

Connections to Other Topics

The encoder-decoder architecture of U-Net reappears everywhere: in diffusion models (the U-Net denoises images), in image generation (VAEs encode then decode), and in depth estimation. Feature Pyramid Networks from Faster R-CNN are used in nearly every modern vision system. And Mask R-CNN's approach of adding task-specific heads to a shared backbone is the design pattern behind multi-task learning in vision.

DETR opened the door to vision-language models: if you can use Transformer queries for detection, you can use language tokens as queries for open-vocabulary detection. Models like OWL-ViT and Grounding DINO build directly on DETR's architecture.

The One Sentence

Detection and segmentation turn classification's "what" into "where" and "which pixels" — and the key ideas (anchor boxes, RoI Align, multi-scale features, learned upsampling) recur in every modern vision system. Understanding these building blocks is non-negotiable.

References

  1. Long, Shelhamer, Darrell. "Fully Convolutional Networks for Semantic Segmentation." CVPR 2015.
  2. Ronneberger, Fischer, Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI 2015.
  3. Girshick, Donahue, Darrell, Malik. "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation." CVPR 2014.
  4. Girshick. "Fast R-CNN." ICCV 2015.
  5. Ren, He, Girshick, Sun. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." NeurIPS 2015.
  6. Redmon, Divvala, Girshick, Farhadi. "You Only Look Once: Unified, Real-Time Object Detection." CVPR 2016.
  7. Liu, Anguelov, Erhan, et al. "SSD: Single Shot MultiBox Detector." ECCV 2016.
  8. Lin, Goyal, Girshick, He, Dollar. "Focal Loss for Dense Object Detection." ICCV 2017.
  9. He, Gkioxari, Dollar, Girshick. "Mask R-CNN." ICCV 2017.
  10. Carion, Massa, Synnaeve, Usunier, Kirillov, Zagoruyko. "End-to-End Object Detection with Transformers." ECCV 2020.
  11. Selvaraju, Cogswell, Das, Vedantam, Parikh, Batra. "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization." ICCV 2017.
  12. Zhou, Khosla, Lapedriza, Oliva, Torralba. "Learning Deep Features for Discriminative Localization." CVPR 2016.
  13. Simonyan, Vedaldi, Zisserman. "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps." ICLR Workshop 2014.
  14. Goodfellow, Shlens, Szegedy. "Explaining and Harnessing Adversarial Examples." ICLR 2015.