Process the entire image once through a CNN, then pool features per region — eliminating the massive redundant computation that made R-CNN impractical. 9x faster training, 213x faster inference, higher accuracy.
R-CNN proved that deep CNN features could crush hand-crafted features for object detection. But it had a dirty secret: it was painfully slow. To detect objects in a single image, R-CNN extracts ~2000 region proposals, warps each one to 227x227, and runs each through the entire CNN independently. That's 2000 forward passes through AlexNet or VGG-16 per image.
With VGG-16, detection took 47 seconds per image on a GPU. That's not a typo. Nearly a minute to process one photograph. And training was even worse — a multi-stage pipeline that first fine-tuned the CNN, then trained SVMs on cached features, then trained bounding-box regressors. Features had to be written to disk, requiring hundreds of gigabytes of storage.
SPP-net (He et al., 2014) partially fixed the speed problem by computing a convolutional feature map once for the whole image and using spatial pyramid pooling to extract features per region. This sped up test time by 10-100x. But SPP-net still had the multi-stage training pipeline, still cached features to disk, and — critically — could not update the convolutional layers during fine-tuning. The spatial pyramid pooling layer blocked gradients from flowing back through the conv layers.
Fast R-CNN's insight is beautifully simple: process the entire image through the CNN once, producing a shared convolutional feature map. Then, for each region proposal, just "reach into" that shared feature map and pool out a fixed-size feature vector.
Think of it this way. R-CNN is like a restaurant where every customer gets their own personal chef who cooks their meal from scratch. Fast R-CNN is like a buffet — one kitchen prepares all the food, and each customer just takes a plate and fills it from the shared spread. Same food, vastly less cooking.
The mechanism that makes this possible is called RoI pooling (Region of Interest pooling). Given any rectangular region in the feature map — regardless of its size or aspect ratio — RoI pooling divides it into a fixed H x W grid and max-pools each cell, producing a feature vector of exactly the same dimensionality regardless of the input region's size.
But RoI pooling enables something even more important than speed: end-to-end training. Because RoI pooling is differentiable, gradients can flow all the way from the classification and regression losses back through the RoI pooling layer into the convolutional layers. This means we can fine-tune the entire network — including conv layers — for the detection task, which SPP-net could not do.
Let's quantify exactly how wasteful R-CNN is. Selective Search generates about 2000 region proposals per image. In R-CNN, every single one gets warped to 227x227 pixels and fed through the entire CNN — all 5 conv layers and 3 FC layers of AlexNet, or all 13 conv layers and 3 FC layers of VGG-16.
Here's the critical observation: most of these regions overlap heavily. Two proposals might share 80% of their pixels. Yet R-CNN computes conv features for both from scratch. The same pixels get convolved through the same filters thousands of times.
With VGG-16, a single forward pass takes ~23ms on a GPU. Multiply by 2000 regions: that's 46 seconds just for feature extraction. Fast R-CNN runs VGG-16 once (~180ms for the conv layers), then pools and classifies each region in ~0.01ms. Total: ~0.32 seconds. That's a 146x speedup on the bottleneck operation.
RoI pooling is the heart of Fast R-CNN. It solves a specific problem: region proposals come in all sizes and aspect ratios, but the fully connected layers that follow expect a fixed-size input. We need a way to convert any region — tall, wide, square, big, small — into a fixed H x W feature map.
Here's how it works, step by step:
The gradient computation is straightforward. RoI pooling is just max pooling over irregular sub-windows. During the forward pass, we record which input position was the max in each sub-window (the "argmax switch"). During backprop, the gradient from each output cell routes back to its argmax input position. If an input position was the max for multiple RoIs (because regions overlap), its gradient is the sum of all contributing output gradients:
Where i*(r,j) is the argmax index for the j-th output of the r-th RoI. The Iverson bracket [i = i*(r,j)] is 1 only for the position that "won" the max pooling.
R-CNN trained in three separate stages: (1) fine-tune the CNN with softmax, (2) train SVM classifiers on cached features, (3) train bounding-box regressors on cached features. Each stage operated independently, and the features had to be stored on disk between stages. This was slow, inelegant, and — as Fast R-CNN shows — suboptimal.
Fast R-CNN replaces all three stages with a single multi-task loss that jointly trains classification and bounding-box regression end-to-end:
Let's unpack each piece:
For the bounding-box regression, Girshick introduces smooth L1 loss instead of the L2 (squared error) loss used in R-CNN:
Why not L2? Because when the regression targets are large (unbounded), L2 loss produces huge gradients that can cause exploding gradients. Smooth L1 behaves like L2 near zero (good for small errors) but transitions to L1 for large errors (gradient magnitude capped at 1). It's the best of both worlds.
The Fast R-CNN architecture is elegant in its simplicity. Here's the full pipeline:
Three transformations convert a classification network into a Fast R-CNN detector:
Fast R-CNN's training strategy has a clever trick that makes it far more efficient than R-CNN or SPP-net: hierarchical mini-batch sampling.
In R-CNN and SPP-net, each training example (RoI) came from a different image. This was terrible for efficiency — if you sample 128 RoIs from 128 different images, you effectively run the conv layers 128 times per mini-batch. No computation sharing.
Fast R-CNN samples mini-batches hierarchically:
Since all 64 RoIs from the same image share the same conv feature map, the forward and backward passes share computation. This makes training roughly 64x faster than sampling one RoI from each of 128 images.
Of the 64 RoIs per image:
Fast R-CNN delivers on every dimension — accuracy, training speed, and test speed — compared to both R-CNN and SPP-net.
On VOC 2007 test set:
Even after sharing conv computation, the fully connected layers still take significant time during detection. Why? Because there are many RoIs to process (~2000 per image), and each one passes through two 4096-d FC layers. The paper reports that FC layers account for nearly half the forward pass time.
Girshick applies truncated SVD to compress the FC layers. A weight matrix W of size u x v is approximately factorized as:
Where U is u x t, Σt is t x t diagonal, and V is v x t. The single FC layer with uv parameters is replaced by two FC layers with t(u + v) parameters total — no non-linearity between them.
For VGG-16's fc6 (25088 x 4096 = 102M parameters) and fc7 (4096 x 4096 = 16.7M parameters), truncated SVD with t = 1024 reduces these to roughly 30M + 8.4M = 38.4M parameters — a 3.1x reduction.