Girshick — Microsoft Research, 2015

Fast R-CNN

Process the entire image once through a CNN, then pool features per region — eliminating the massive redundant computation that made R-CNN impractical. 9x faster training, 213x faster inference, higher accuracy.

Prerequisites: CNNs + R-CNN pipeline + Object detection basics
10
Chapters
6
Simulations

Chapter 0: The Problem

R-CNN proved that deep CNN features could crush hand-crafted features for object detection. But it had a dirty secret: it was painfully slow. To detect objects in a single image, R-CNN extracts ~2000 region proposals, warps each one to 227x227, and runs each through the entire CNN independently. That's 2000 forward passes through AlexNet or VGG-16 per image.

With VGG-16, detection took 47 seconds per image on a GPU. That's not a typo. Nearly a minute to process one photograph. And training was even worse — a multi-stage pipeline that first fine-tuned the CNN, then trained SVMs on cached features, then trained bounding-box regressors. Features had to be written to disk, requiring hundreds of gigabytes of storage.

SPP-net (He et al., 2014) partially fixed the speed problem by computing a convolutional feature map once for the whole image and using spatial pyramid pooling to extract features per region. This sped up test time by 10-100x. But SPP-net still had the multi-stage training pipeline, still cached features to disk, and — critically — could not update the convolutional layers during fine-tuning. The spatial pyramid pooling layer blocked gradients from flowing back through the conv layers.

The wish list: We want a detector that (1) processes the image only once, (2) trains end-to-end in a single stage, (3) jointly learns classification AND bounding-box regression, (4) can update ALL network layers (including conv layers), and (5) needs no disk storage for cached features. Fast R-CNN delivers all five.
What is the fundamental reason R-CNN is slow at test time?

Chapter 1: The Key Insight

Fast R-CNN's insight is beautifully simple: process the entire image through the CNN once, producing a shared convolutional feature map. Then, for each region proposal, just "reach into" that shared feature map and pool out a fixed-size feature vector.

Think of it this way. R-CNN is like a restaurant where every customer gets their own personal chef who cooks their meal from scratch. Fast R-CNN is like a buffet — one kitchen prepares all the food, and each customer just takes a plate and fills it from the shared spread. Same food, vastly less cooking.

The mechanism that makes this possible is called RoI pooling (Region of Interest pooling). Given any rectangular region in the feature map — regardless of its size or aspect ratio — RoI pooling divides it into a fixed H x W grid and max-pools each cell, producing a feature vector of exactly the same dimensionality regardless of the input region's size.

Why this is profound: The convolutional layers are the expensive part — they contain millions of multiply-accumulate operations. The fully connected layers at the end are cheap by comparison. By sharing the conv computation across all regions, Fast R-CNN reduces test time from 47 seconds to 0.32 seconds per image on VGG-16 — a 146x speedup on the expensive part. Adding RoI pooling + FC layers per region adds negligible cost.

But RoI pooling enables something even more important than speed: end-to-end training. Because RoI pooling is differentiable, gradients can flow all the way from the classification and regression losses back through the RoI pooling layer into the convolutional layers. This means we can fine-tune the entire network — including conv layers — for the detection task, which SPP-net could not do.

What are the two key benefits of processing the image once and pooling features per region?

Chapter 2: The R-CNN Bottleneck

Let's quantify exactly how wasteful R-CNN is. Selective Search generates about 2000 region proposals per image. In R-CNN, every single one gets warped to 227x227 pixels and fed through the entire CNN — all 5 conv layers and 3 FC layers of AlexNet, or all 13 conv layers and 3 FC layers of VGG-16.

Here's the critical observation: most of these regions overlap heavily. Two proposals might share 80% of their pixels. Yet R-CNN computes conv features for both from scratch. The same pixels get convolved through the same filters thousands of times.

Click to compare computation

With VGG-16, a single forward pass takes ~23ms on a GPU. Multiply by 2000 regions: that's 46 seconds just for feature extraction. Fast R-CNN runs VGG-16 once (~180ms for the conv layers), then pools and classifies each region in ~0.01ms. Total: ~0.32 seconds. That's a 146x speedup on the bottleneck operation.

The numbers: R-CNN: 2000 x 23ms = 46,000ms. Fast R-CNN: 180ms + 2000 x 0.01ms = 200ms. Same regions, same features, same accuracy — but 230x less computation.
Why does R-CNN waste so much computation?

Chapter 3: RoI Pooling

RoI pooling is the heart of Fast R-CNN. It solves a specific problem: region proposals come in all sizes and aspect ratios, but the fully connected layers that follow expect a fixed-size input. We need a way to convert any region — tall, wide, square, big, small — into a fixed H x W feature map.

Here's how it works, step by step:

  1. Project the region onto the feature map. If the original image region is at coordinates (x, y, w, h), divide by the total stride of the conv layers (16 for VGG-16) to get the corresponding location on the feature map.
  2. Divide into an H x W grid. Take the projected region on the feature map and split it into H x W sub-windows. For VGG-16, H = W = 7, so you get 49 sub-windows. The sub-windows may have slightly different sizes if the region dimensions aren't evenly divisible.
  3. Max-pool each cell. Within each sub-window, take the maximum activation value. This produces exactly one value per cell per channel.
  4. Output: H x W x C feature map. For VGG-16 with 512 channels and H=W=7, every region — regardless of its original size — becomes a 7 x 7 x 512 tensor, which gets flattened to a 25,088-dimensional vector and fed into the FC layers.
RoI pooling is just a special case of spatial pyramid pooling with a single pyramid level. SPP-net used multiple pyramid levels (e.g., 1x1, 2x2, 4x4) concatenated together. Girshick found that a single level (7x7) is sufficient, and — critically — it allows backpropagation through the pooling layer, enabling end-to-end training.
Click to walk through the pooling steps

Backpropagation through RoI pooling

The gradient computation is straightforward. RoI pooling is just max pooling over irregular sub-windows. During the forward pass, we record which input position was the max in each sub-window (the "argmax switch"). During backprop, the gradient from each output cell routes back to its argmax input position. If an input position was the max for multiple RoIs (because regions overlap), its gradient is the sum of all contributing output gradients:

∂L/∂xi = ∑rj [i = i*(r,j)] · ∂L/∂yrj

Where i*(r,j) is the argmax index for the j-th output of the r-th RoI. The Iverson bracket [i = i*(r,j)] is 1 only for the position that "won" the max pooling.

How does RoI pooling handle regions of different sizes?

Chapter 4: Multi-Task Loss

R-CNN trained in three separate stages: (1) fine-tune the CNN with softmax, (2) train SVM classifiers on cached features, (3) train bounding-box regressors on cached features. Each stage operated independently, and the features had to be stored on disk between stages. This was slow, inelegant, and — as Fast R-CNN shows — suboptimal.

Fast R-CNN replaces all three stages with a single multi-task loss that jointly trains classification and bounding-box regression end-to-end:

L(p, u, tu, v) = Lcls(p, u) + λ[u ≥ 1] Lloc(tu, v)

Let's unpack each piece:

Smooth L1 Loss

For the bounding-box regression, Girshick introduces smooth L1 loss instead of the L2 (squared error) loss used in R-CNN:

smoothL1(x) = { 0.5x2 if |x| < 1, |x| − 0.5 otherwise }

Why not L2? Because when the regression targets are large (unbounded), L2 loss produces huge gradients that can cause exploding gradients. Smooth L1 behaves like L2 near zero (good for small errors) but transitions to L1 for large errors (gradient magnitude capped at 1). It's the best of both worlds.

Smooth L1 vs L1 vs L2 — notice how Smooth L1 combines L2's smoothness near zero with L1's robustness to outliers
Why multi-task learning helps: Table 6 in the paper shows that jointly training classification and bbox regression (multi-task) gives 66.9% mAP, compared to 61.4% for classification only + separate bbox regressors (stage-wise). Joint training gives better features because the conv layers learn representations useful for BOTH tasks. The gradient signals reinforce each other.
Why does Fast R-CNN use smooth L1 loss instead of L2 for bounding-box regression?

Chapter 5: Architecture

The Fast R-CNN architecture is elegant in its simplicity. Here's the full pipeline:

  1. Input: An entire image + a set of region proposals (from Selective Search).
  2. Convolutional backbone: The image passes through all conv layers of a pre-trained network (e.g., VGG-16's 13 conv layers + 4 max-pool layers), producing a feature map. For a 1000x600 input image with VGG-16, the feature map is roughly 62x37x512.
  3. RoI projection: Each region proposal's coordinates are divided by the total stride (16 for VGG-16) to find the corresponding rectangle on the feature map.
  4. RoI pooling: Each projected region is pooled into a 7x7x512 fixed-size tensor.
  5. FC layers: The 25,088-d vector goes through two fully connected layers (4096-d each with ReLU), producing a 4096-d feature vector per RoI.
  6. Two sibling heads:
    • Classification head: FC layer → (K+1)-way softmax → class probabilities
    • Regression head: FC layer → 4K outputs → per-class bounding-box offsets (tx, ty, tw, th)
Fast R-CNN architecture — shared conv features, RoI pooling, two-head output

Initialization from pre-trained networks

Three transformations convert a classification network into a Fast R-CNN detector:

  1. The last max pooling layer is replaced by an RoI pooling layer (with H=W=7 to match the first FC layer's expected input).
  2. The final FC layer and 1000-way softmax are replaced by the two sibling heads (K+1 classifier + 4K regressor).
  3. The network is modified to accept two inputs: a list of images and a list of RoIs.
Which layers to fine-tune? The paper experiments with fine-tuning different numbers of conv layers. For VGG-16, fine-tuning from conv3_1 onward (layers 3-13) gives the best results. Fine-tuning only the FC layers (like SPP-net was forced to do) reduces mAP by 6.7 points. Fine-tuning all conv layers (from conv1_1) doesn't help further — the early layers learn generic features (edges, textures) that are useful as-is.
Why does Fast R-CNN use per-class bounding-box regressors (4K outputs) rather than class-agnostic regressors (4 outputs)?

Chapter 6: Training

Fast R-CNN's training strategy has a clever trick that makes it far more efficient than R-CNN or SPP-net: hierarchical mini-batch sampling.

The sampling problem

In R-CNN and SPP-net, each training example (RoI) came from a different image. This was terrible for efficiency — if you sample 128 RoIs from 128 different images, you effectively run the conv layers 128 times per mini-batch. No computation sharing.

Hierarchical sampling

Fast R-CNN samples mini-batches hierarchically:

  1. Sample N = 2 images per mini-batch.
  2. Sample R/N = 64 RoIs per image, for a total of R = 128 RoIs per mini-batch.

Since all 64 RoIs from the same image share the same conv feature map, the forward and backward passes share computation. This makes training roughly 64x faster than sampling one RoI from each of 128 images.

Positive/negative sampling

Of the 64 RoIs per image:

Hierarchical sampling: 2 images, 64 RoIs each — positive (green) vs negative (red)
Doesn't correlated sampling cause problems? You might worry that RoIs from the same image are correlated, which could slow convergence or cause overfitting. The paper addresses this directly: "This concern does not appear to be a practical issue and we achieve good results with N=2 and R=128 using fewer SGD iterations than R-CNN." The efficiency gains vastly outweigh any correlation effects.

Training details

Why does Fast R-CNN sample only 2 images per mini-batch (with 64 RoIs each) instead of 128 images (with 1 RoI each)?

Chapter 7: Results

Fast R-CNN delivers on every dimension — accuracy, training speed, and test speed — compared to both R-CNN and SPP-net.

Detection accuracy (PASCAL VOC)

On VOC 2007 test set:

mAP comparison on PASCAL VOC 2007 test set

Speed comparison (VGG-16)

Training: R-CNN takes 84 hours. Fast R-CNN takes 9.5 hours. That's 9x faster training.
Test time: R-CNN takes 47 seconds per image. Fast R-CNN takes 0.32 seconds (with FC layers) or 0.22 seconds (with truncated SVD). That's 146x to 213x faster inference.

Key ablation findings

Which design choice gives the single largest accuracy improvement in Fast R-CNN?

Chapter 8: Truncated SVD

Even after sharing conv computation, the fully connected layers still take significant time during detection. Why? Because there are many RoIs to process (~2000 per image), and each one passes through two 4096-d FC layers. The paper reports that FC layers account for nearly half the forward pass time.

Girshick applies truncated SVD to compress the FC layers. A weight matrix W of size u x v is approximately factorized as:

W ≈ U Σt VT

Where U is u x t, Σt is t x t diagonal, and V is v x t. The single FC layer with uv parameters is replaced by two FC layers with t(u + v) parameters total — no non-linearity between them.

For VGG-16's fc6 (25088 x 4096 = 102M parameters) and fc7 (4096 x 4096 = 16.7M parameters), truncated SVD with t = 1024 reduces these to roughly 30M + 8.4M = 38.4M parameters — a 3.1x reduction.

Truncated SVD: one large matrix becomes two smaller ones
Speed vs accuracy trade-off: Truncated SVD with 25% of singular values retained gives a 30% speedup with only 0.3% mAP drop (66.9% → 66.6%). More aggressive compression (top 1024 out of 4096 SVs) reduces FC time by ~70% with still minimal accuracy loss. This brings test time from 0.32s to 0.22s per image — the headline 213x faster than R-CNN.
Why is truncated SVD particularly effective for Fast R-CNN's FC layers during detection?

Chapter 9: Connections

Looking backward

Looking forward

The R-CNN family tree: R-CNN (2014) → Fast R-CNN (2015) → Faster R-CNN (2015) → Mask R-CNN (2017). Each paper removes one remaining bottleneck: redundant computation → external proposals → instance segmentation. Fast R-CNN is the pivotal middle step that proved end-to-end training and shared computation could work together.
After Fast R-CNN, what was the remaining speed bottleneck, and what solved it?
Girshick, R. "Fast R-CNN." ICCV, 2015.  |  arXiv: 1504.08083  |  ← Back to Veanors