Fast R-CNN — Veanors

Chapter 0: The Problem

R-CNN proved that deep CNN features could crush hand-crafted features for object detection. But it had a dirty secret: it was painfully slow. To detect objects in a single image, R-CNN extracts ~2000 region proposals, warps each one to 227x227, and runs each through the entire CNN independently. That's 2000 forward passes through AlexNet or VGG-16 per image.

With VGG-16, detection took 47 seconds per image on a GPU. That's not a typo. Nearly a minute to process one photograph. And training was even worse — a multi-stage pipeline that first fine-tuned the CNN, then trained SVMs on cached features, then trained bounding-box regressors. Features had to be written to disk, requiring hundreds of gigabytes of storage.

SPP-net (He et al., 2014) partially fixed the speed problem by computing a convolutional feature map once for the whole image and using spatial pyramid pooling to extract features per region. This sped up test time by 10-100x. But SPP-net still had the multi-stage training pipeline, still cached features to disk, and — critically — could not update the convolutional layers during fine-tuning. The spatial pyramid pooling layer blocked gradients from flowing back through the conv layers.

The wish list: We want a detector that (1) processes the image only once, (2) trains end-to-end in a single stage, (3) jointly learns classification AND bounding-box regression, (4) can update ALL network layers (including conv layers), and (5) needs no disk storage for cached features. Fast R-CNN delivers all five.

What is the fundamental reason R-CNN is slow at test time?

It runs the entire CNN independently on each of ~2000 region proposals — no computation is shared between regions The CNN architecture is too deep Non-maximum suppression is computationally expensive

Chapter 1: The Key Insight

Fast R-CNN's insight is beautifully simple: process the entire image through the CNN once, producing a shared convolutional feature map. Then, for each region proposal, just "reach into" that shared feature map and pool out a fixed-size feature vector.

Think of it this way. R-CNN is like a restaurant where every customer gets their own personal chef who cooks their meal from scratch. Fast R-CNN is like a buffet — one kitchen prepares all the food, and each customer just takes a plate and fills it from the shared spread. Same food, vastly less cooking.

The mechanism that makes this possible is called RoI pooling (Region of Interest pooling). Given any rectangular region in the feature map — regardless of its size or aspect ratio — RoI pooling divides it into a fixed H x W grid and max-pools each cell, producing a feature vector of exactly the same dimensionality regardless of the input region's size.

Why this is profound: The convolutional layers are the expensive part — they contain millions of multiply-accumulate operations. The fully connected layers at the end are cheap by comparison. By sharing the conv computation across all regions, Fast R-CNN reduces test time from 47 seconds to 0.32 seconds per image on VGG-16 — a 146x speedup on the expensive part. Adding RoI pooling + FC layers per region adds negligible cost.

But RoI pooling enables something even more important than speed: end-to-end training. Because RoI pooling is differentiable, gradients can flow all the way from the classification and regression losses back through the RoI pooling layer into the convolutional layers. This means we can fine-tune the entire network — including conv layers — for the detection task, which SPP-net could not do.

What are the two key benefits of processing the image once and pooling features per region?

Massive speedup from shared conv computation AND end-to-end training since RoI pooling is differentiable Better accuracy AND simpler code Smaller model size AND less GPU memory

Chapter 2: The R-CNN Bottleneck

Let's quantify exactly how wasteful R-CNN is. Selective Search generates about 2000 region proposals per image. In R-CNN, every single one gets warped to 227x227 pixels and fed through the entire CNN — all 5 conv layers and 3 FC layers of AlexNet, or all 13 conv layers and 3 FC layers of VGG-16.

Here's the critical observation: most of these regions overlap heavily. Two proposals might share 80% of their pixels. Yet R-CNN computes conv features for both from scratch. The same pixels get convolved through the same filters thousands of times.

Click to compare computation

With VGG-16, a single forward pass takes ~23ms on a GPU. Multiply by 2000 regions: that's 46 seconds just for feature extraction. Fast R-CNN runs VGG-16 once (~180ms for the conv layers), then pools and classifies each region in ~0.01ms. Total: ~0.32 seconds. That's a 146x speedup on the bottleneck operation.

The numbers: R-CNN: 2000 x 23ms = 46,000ms. Fast R-CNN: 180ms + 2000 x 0.01ms = 200ms. Same regions, same features, same accuracy — but 230x less computation.

Why does R-CNN waste so much computation?

Region proposals overlap heavily, so the same pixels are convolved through the same filters thousands of times — each region is processed independently with no sharing The SVM classifier is too slow Selective Search generates too many proposals

Chapter 3: RoI Pooling

RoI pooling is the heart of Fast R-CNN. It solves a specific problem: region proposals come in all sizes and aspect ratios, but the fully connected layers that follow expect a fixed-size input. We need a way to convert any region — tall, wide, square, big, small — into a fixed H x W feature map.

Here's how it works, step by step:

Project the region onto the feature map. If the original image region is at coordinates (x, y, w, h), divide by the total stride of the conv layers (16 for VGG-16) to get the corresponding location on the feature map.
Divide into an H x W grid. Take the projected region on the feature map and split it into H x W sub-windows. For VGG-16, H = W = 7, so you get 49 sub-windows. The sub-windows may have slightly different sizes if the region dimensions aren't evenly divisible.
Max-pool each cell. Within each sub-window, take the maximum activation value. This produces exactly one value per cell per channel.
Output: H x W x C feature map. For VGG-16 with 512 channels and H=W=7, every region — regardless of its original size — becomes a 7 x 7 x 512 tensor, which gets flattened to a 25,088-dimensional vector and fed into the FC layers.

RoI pooling is just a special case of spatial pyramid pooling with a single pyramid level. SPP-net used multiple pyramid levels (e.g., 1x1, 2x2, 4x4) concatenated together. Girshick found that a single level (7x7) is sufficient, and — critically — it allows backpropagation through the pooling layer, enabling end-to-end training.

Click to walk through the pooling steps

Backpropagation through RoI pooling

The gradient computation is straightforward. RoI pooling is just max pooling over irregular sub-windows. During the forward pass, we record which input position was the max in each sub-window (the "argmax switch"). During backprop, the gradient from each output cell routes back to its argmax input position. If an input position was the max for multiple RoIs (because regions overlap), its gradient is the sum of all contributing output gradients:

∂L/∂x_i = ∑_r ∑_j [i = i*(r,j)] · ∂L/∂y_rj

Where i*(r,j) is the argmax index for the j-th output of the r-th RoI. The Iverson bracket [i = i*(r,j)] is 1 only for the position that "won" the max pooling.

How does RoI pooling handle regions of different sizes?

It divides any region into a fixed H x W grid of sub-windows (adapting sub-window size to the region), then max-pools each cell to produce a fixed-size output It resizes all regions to the same size before pooling It pads smaller regions with zeros

Chapter 4: Multi-Task Loss

R-CNN trained in three separate stages: (1) fine-tune the CNN with softmax, (2) train SVM classifiers on cached features, (3) train bounding-box regressors on cached features. Each stage operated independently, and the features had to be stored on disk between stages. This was slow, inelegant, and — as Fast R-CNN shows — suboptimal.

Fast R-CNN replaces all three stages with a single multi-task loss that jointly trains classification and bounding-box regression end-to-end:

L(p, u, t^u, v) = L_cls(p, u) + λ[u ≥ 1] L_loc(t^u, v)

Let's unpack each piece:

L_cls(p, u) = −log p_u — Standard cross-entropy loss. p is the softmax probability distribution over K+1 classes (K objects + 1 background), and u is the true class label.
[u ≥ 1] — Iverson bracket. Equals 1 for foreground classes (u ≥ 1), equals 0 for background (u = 0). We only regress boxes for actual objects, not background.
L_loc(t^u, v) — Bounding-box regression loss between predicted offsets t^u and target offsets v for the true class u.
λ — Balancing hyperparameter, set to 1 in all experiments.

Smooth L1 Loss

For the bounding-box regression, Girshick introduces smooth L1 loss instead of the L2 (squared error) loss used in R-CNN:

smooth_L1(x) = { 0.5x² if |x| < 1, |x| − 0.5 otherwise }

Why not L2? Because when the regression targets are large (unbounded), L2 loss produces huge gradients that can cause exploding gradients. Smooth L1 behaves like L2 near zero (good for small errors) but transitions to L1 for large errors (gradient magnitude capped at 1). It's the best of both worlds.

Smooth L1 vs L1 vs L2 — notice how Smooth L1 combines L2's smoothness near zero with L1's robustness to outliers

Why multi-task learning helps: Table 6 in the paper shows that jointly training classification and bbox regression (multi-task) gives 66.9% mAP, compared to 61.4% for classification only + separate bbox regressors (stage-wise). Joint training gives better features because the conv layers learn representations useful for BOTH tasks. The gradient signals reinforce each other.

Why does Fast R-CNN use smooth L1 loss instead of L2 for bounding-box regression?

L2 loss produces huge gradients when regression targets are large, risking gradient explosion — smooth L1 caps the gradient magnitude at 1 for large errors while staying smooth near zero Smooth L1 is faster to compute L2 loss cannot be differentiated

Chapter 5: Architecture

The Fast R-CNN architecture is elegant in its simplicity. Here's the full pipeline:

Input: An entire image + a set of region proposals (from Selective Search).
Convolutional backbone: The image passes through all conv layers of a pre-trained network (e.g., VGG-16's 13 conv layers + 4 max-pool layers), producing a feature map. For a 1000x600 input image with VGG-16, the feature map is roughly 62x37x512.
RoI projection: Each region proposal's coordinates are divided by the total stride (16 for VGG-16) to find the corresponding rectangle on the feature map.
RoI pooling: Each projected region is pooled into a 7x7x512 fixed-size tensor.
FC layers: The 25,088-d vector goes through two fully connected layers (4096-d each with ReLU), producing a 4096-d feature vector per RoI.
Two sibling heads:
- Classification head: FC layer → (K+1)-way softmax → class probabilities
- Regression head: FC layer → 4K outputs → per-class bounding-box offsets (t_x, t_y, t_w, t_h)

Fast R-CNN architecture — shared conv features, RoI pooling, two-head output

Initialization from pre-trained networks

Three transformations convert a classification network into a Fast R-CNN detector:

The last max pooling layer is replaced by an RoI pooling layer (with H=W=7 to match the first FC layer's expected input).
The final FC layer and 1000-way softmax are replaced by the two sibling heads (K+1 classifier + 4K regressor).
The network is modified to accept two inputs: a list of images and a list of RoIs.

Which layers to fine-tune? The paper experiments with fine-tuning different numbers of conv layers. For VGG-16, fine-tuning from conv3_1 onward (layers 3-13) gives the best results. Fine-tuning only the FC layers (like SPP-net was forced to do) reduces mAP by 6.7 points. Fine-tuning all conv layers (from conv1_1) doesn't help further — the early layers learn generic features (edges, textures) that are useful as-is.

Why does Fast R-CNN use per-class bounding-box regressors (4K outputs) rather than class-agnostic regressors (4 outputs)?

Different object classes have different shapes and aspect ratios — a person is tall and narrow, a car is wide and short — so class-specific offsets give slightly better localization (66.9 vs 65.6 mAP) Class-agnostic regressors cannot be trained end-to-end Per-class regressors are faster at inference

Chapter 6: Training

Fast R-CNN's training strategy has a clever trick that makes it far more efficient than R-CNN or SPP-net: hierarchical mini-batch sampling.

The sampling problem

In R-CNN and SPP-net, each training example (RoI) came from a different image. This was terrible for efficiency — if you sample 128 RoIs from 128 different images, you effectively run the conv layers 128 times per mini-batch. No computation sharing.

Hierarchical sampling

Fast R-CNN samples mini-batches hierarchically:

Sample N = 2 images per mini-batch.
Sample R/N = 64 RoIs per image, for a total of R = 128 RoIs per mini-batch.

Since all 64 RoIs from the same image share the same conv feature map, the forward and backward passes share computation. This makes training roughly 64x faster than sampling one RoI from each of 128 images.

Positive/negative sampling

Of the 64 RoIs per image:

25% are positives (foreground): proposals with IoU ≥ 0.5 with a ground-truth box, labeled with the matched class.
75% are negatives (background): proposals with 0.1 ≤ IoU < 0.5 with any ground truth. The lower bound of 0.1 acts as hard negative mining — it forces the network to learn the difference between "close to an object but wrong" vs. "definitely not an object."

Hierarchical sampling: 2 images, 64 RoIs each — positive (green) vs negative (red)

Doesn't correlated sampling cause problems? You might worry that RoIs from the same image are correlated, which could slow convergence or cause overfitting. The paper addresses this directly: "This concern does not appear to be a practical issue and we achieve good results with N=2 and R=128 using fewer SGD iterations than R-CNN." The efficiency gains vastly outweigh any correlation effects.

Training details

SGD with momentum 0.9, weight decay 0.0005
Learning rate: 0.001 for 30k iterations, then 0.0001 for 10k more (on VOC)
FC layers initialized from Gaussian: σ = 0.01 (cls), σ = 0.001 (bbox)
Data augmentation: horizontal flip only (probability 0.5)

Why does Fast R-CNN sample only 2 images per mini-batch (with 64 RoIs each) instead of 128 images (with 1 RoI each)?

RoIs from the same image share the conv feature map in forward and backward passes — sampling 64 RoIs per image means only 2 conv computations per mini-batch instead of 128, making training ~64x faster It reduces overfitting It improves classification accuracy

Chapter 7: Results

Fast R-CNN delivers on every dimension — accuracy, training speed, and test speed — compared to both R-CNN and SPP-net.

Detection accuracy (PASCAL VOC)

On VOC 2007 test set:

R-CNN (VGG-16): 66.0% mAP
SPP-net (ZF): 63.1% mAP
Fast R-CNN (VGG-16): 66.9% mAP — best single model, without multi-scale
Fast R-CNN with multi-scale: 68.8% mAP
Fast R-CNN on VOC 07+12 train: 70.0% mAP

mAP comparison on PASCAL VOC 2007 test set

Speed comparison (VGG-16)

Training: R-CNN takes 84 hours. Fast R-CNN takes 9.5 hours. That's 9x faster training.
Test time: R-CNN takes 47 seconds per image. Fast R-CNN takes 0.32 seconds (with FC layers) or 0.22 seconds (with truncated SVD). That's 146x to 213x faster inference.

Key ablation findings

Multi-task loss helps: Joint training (cls + bbox) gives 66.9% mAP vs. 61.4% for stage-wise training. Joint training produces better conv features.
Fine-tuning conv layers matters: Freezing conv layers (like SPP-net) gives 61.4% mAP. Fine-tuning from conv3_1 gives 66.9% — a 5.5 point improvement.
Softmax beats SVM: The end-to-end softmax classifier gets 66.9% mAP vs 66.8% for post-hoc SVMs. R-CNN needed SVMs because it couldn't fine-tune jointly; with joint training, softmax is just as good and much simpler.
Single-scale works: Multi-scale testing (image pyramid) gives a small boost (66.9% → 68.8%) but at 3x the computation. Single-scale is the sweet spot.
More proposals don't help: Going from 2000 to 45k proposals per image barely changes mAP. Dense proposals (sliding window) give 65.8% — slightly worse than Selective Search's 66.9%.

Which design choice gives the single largest accuracy improvement in Fast R-CNN?

Fine-tuning the convolutional layers — unfreezing conv layers improves mAP by 5.5 points (from 61.4% to 66.9%), which SPP-net could not do Using more region proposals Multi-scale testing

Chapter 8: Truncated SVD

Even after sharing conv computation, the fully connected layers still take significant time during detection. Why? Because there are many RoIs to process (~2000 per image), and each one passes through two 4096-d FC layers. The paper reports that FC layers account for nearly half the forward pass time.

Girshick applies truncated SVD to compress the FC layers. A weight matrix W of size u x v is approximately factorized as:

W ≈ U Σ_t V^T

Where U is u x t, Σ_t is t x t diagonal, and V is v x t. The single FC layer with uv parameters is replaced by two FC layers with t(u + v) parameters total — no non-linearity between them.

For VGG-16's fc6 (25088 x 4096 = 102M parameters) and fc7 (4096 x 4096 = 16.7M parameters), truncated SVD with t = 1024 reduces these to roughly 30M + 8.4M = 38.4M parameters — a 3.1x reduction.

Truncated SVD: one large matrix becomes two smaller ones

Speed vs accuracy trade-off: Truncated SVD with 25% of singular values retained gives a 30% speedup with only 0.3% mAP drop (66.9% → 66.6%). More aggressive compression (top 1024 out of 4096 SVs) reduces FC time by ~70% with still minimal accuracy loss. This brings test time from 0.32s to 0.22s per image — the headline 213x faster than R-CNN.

Why is truncated SVD particularly effective for Fast R-CNN's FC layers during detection?

With ~2000 RoIs per image, each passing through large FC layers, the FC computation dominates — compressing them with SVD gives large wall-clock savings with minimal accuracy loss SVD makes the network easier to train SVD reduces the number of conv layer operations

Chapter 9: Connections

Looking backward

R-CNN (Girshick et al., 2014) — The predecessor. Established the "region proposals + CNN features" paradigm but was slow due to per-region CNN computation and multi-stage training. Fast R-CNN solves both problems.
SPP-net (He et al., 2014) — Introduced the idea of computing conv features once and extracting per-region features from the shared feature map via spatial pyramid pooling. Fast R-CNN's RoI pooling is a single-level version of SPP that enables end-to-end fine-tuning.
OverFeat (Sermanet et al., 2013) — Showed that multi-scale sliding window detection with shared conv features was viable. Used a regression-based approach rather than region proposals.

Looking forward

Faster R-CNN (Ren et al., 2015) — The natural next step. Fast R-CNN still depends on an external region proposal method (Selective Search), which runs on CPU and takes ~2 seconds per image — now the new bottleneck. Faster R-CNN replaces Selective Search with a Region Proposal Network (RPN) that shares conv features with the detector, making the entire pipeline end-to-end and nearly cost-free for proposals.
RoI Align (He et al., 2017, Mask R-CNN) — RoI pooling quantizes region boundaries to integer coordinates on the feature map, losing sub-pixel precision. RoI Align uses bilinear interpolation instead of snapping to grid cells, which is critical for pixel-level tasks like instance segmentation.
Feature Pyramid Networks (FPN) (Lin et al., 2017) — Instead of pooling from a single feature map, FPN builds a multi-scale feature pyramid. Each RoI is assigned to the pyramid level matching its scale, giving better features for small objects.
Deformable RoI Pooling (Dai et al., 2017) — Learns offsets for each pooling bin, so the pooling grid adapts to the object's shape rather than using a rigid rectangular grid.

The R-CNN family tree: R-CNN (2014) → Fast R-CNN (2015) → Faster R-CNN (2015) → Mask R-CNN (2017). Each paper removes one remaining bottleneck: redundant computation → external proposals → instance segmentation. Fast R-CNN is the pivotal middle step that proved end-to-end training and shared computation could work together.

After Fast R-CNN, what was the remaining speed bottleneck, and what solved it?

The external Selective Search proposal generation (~2s per image on CPU) — Faster R-CNN replaced it with a Region Proposal Network (RPN) that shares conv features with the detector The FC layers were still too slow — solved by depthwise convolutions The RoI pooling layer — solved by average pooling