Replace the 2-second selective search bottleneck with a learned network that proposes regions in 10ms — sharing convolutional features with the detector for near real-time object detection.
By 2015, object detection had a speed problem — but not where you'd expect. Fast R-CNN could classify proposed regions at near real-time speed. The actual bottleneck was finding those regions in the first place.
Selective Search, the dominant region proposal method, works like this: segment the image into superpixels, then greedily merge them based on hand-crafted features (color, texture, size). It produces ~2,000 candidate boxes per image. The problem? It takes about 2 seconds per image on CPU — orders of magnitude slower than the neural network that follows it.
Think about what that means. You've built this brilliant deep neural network that can classify regions in milliseconds. But before it ever runs, you're waiting 2 seconds for a hand-engineered algorithm from 2012 to finish grinding through superpixels. The CNN sits idle, twiddling its thumbs.
Even EdgeBoxes, a faster alternative, still takes ~200ms per image. And both methods share a deeper problem: they're completely separate from the detection network. The proposal algorithm doesn't benefit from the learned features, and the detector can't influence what gets proposed.
Compare where time is spent in the old pipeline vs. Faster R-CNN.
Here's the observation that makes Faster R-CNN possible: the convolutional feature maps that Fast R-CNN already computes for detection contain all the information needed to propose regions too.
Think about it. A VGG-16 backbone produces a rich feature map from the input image — 512 channels encoding edges, textures, parts, and objects at every spatial location. If a human can look at these features and say "there's probably an object here," a small neural network can learn to do the same.
So instead of running a separate algorithm (selective search) on the raw pixels, why not add a small network on top of the same conv features the detector already uses? This is the Region Proposal Network (RPN) — a lightweight fully convolutional network that predicts object proposals directly from the shared feature map.
The paper describes the RPN as an "attention" mechanism: it tells the detector where to look. The RPN and detector share a backbone, forming a single unified network. The marginal cost of computing proposals? Just a few extra convolutional layers — about 10ms per image.
The Region Proposal Network is beautifully simple. It slides a small 3×3 convolutional window over the shared feature map. At every spatial position, this window looks at a 3×3 patch of the feature map and asks two questions:
Concretely, the 3×3 conv produces a 256-d (ZFNet) or 512-d (VGG-16) feature vector at each position. This vector feeds into two sibling 1×1 convolutional layers:
Because the 3×3 conv slides across the entire feature map, the same weights are shared at every position. This is just a fully convolutional network — no fully-connected layers, no position-specific parameters. The RPN can handle images of any size.
The 3×3 window slides over the feature map. At each position it produces objectness scores and box deltas for k=9 anchors. Click to move the window.
Here's the problem the RPN faces: objects come in wildly different sizes and shapes. A person is tall and narrow. A bus is wide and squat. A ball is small and square. How can a single 3×3 sliding window detect all of them?
The answer is anchors — a set of k reference boxes centered at each sliding window position. Each anchor has a predefined scale and aspect ratio. The RPN doesn't predict boxes from scratch. Instead, it predicts how to adjust each anchor to better fit the actual object.
The default configuration uses 3 scales (128², 256², 512² pixels) × 3 aspect ratios (1:1, 1:2, 2:1) = 9 anchors per position. On a typical feature map of ~40×60, that's roughly 20,000 anchors covering the entire image at multiple scales.
The box parameterization is:
Where (x, y, w, h) is the predicted box and (xa, ya, wa, ha) is the anchor. The log transform ensures width/height are always positive and scale-invariant.
Click anywhere on the "image" to place an anchor center. Toggle scales and ratios to see different anchor configurations. Drag the offset sliders to see box regression.
Faster R-CNN is trained with two losses — one for the RPN and one for the Fast R-CNN detection head. Each loss itself has two terms: classification and regression.
For the RPN, each anchor gets a binary label: positive (object) or negative (background). An anchor is positive if it either (a) has the highest IoU with any ground-truth box, or (b) has IoU > 0.7 with any ground-truth box. Anchors with IoU < 0.3 for all ground-truth boxes are negative. Everything in between is ignored.
Where pi is the predicted objectness, pi* is the ground-truth label (1 for positive, 0 for negative), ti is the predicted box delta, and ti* is the target box delta. The term pi* in front of Lreg means we only regress boxes for positive anchors — no point refining an anchor that's labeled background.
The Fast R-CNN detection head has its own multi-task loss: multi-class classification (not just object/background, but which class) plus bounding box regression for each class. This is identical to the original Fast R-CNN paper.
Each mini-batch comes from a single image. The paper samples 256 anchors per image with a 1:1 ratio of positives to negatives. If there are fewer than 128 positive anchors (common), the rest are filled with negatives.
Compare how L2 and smooth L1 behave as the prediction error grows. Notice how L2 explodes for large errors while smooth L1 grows linearly.
The RPN and the Fast R-CNN detector both sit on top of a shared convolutional backbone. But if you train them independently, they'll each pull the backbone in different directions. How do you make them agree?
The paper proposes a 4-step alternating training procedure:
After Step 2, the two networks have separate backbones. Steps 3 and 4 are the crucial ones: they freeze the shared convolutional layers and only fine-tune the task-specific layers on top. This ensures the backbone serves both tasks well.
The result is a unified network where the backbone computes features once, the RPN reads those features to generate proposals, and the Fast R-CNN head reads the same features (pooled from the proposed regions) to classify and refine boxes. One forward pass through the backbone serves both purposes.
Let's put all the pieces together and trace the full Faster R-CNN pipeline from image to detections.
The paper experiments with two backbones: ZFNet (5 conv layers, fast) and VGG-16 (13 conv layers, accurate). Both are pretrained on ImageNet. The last conv layer produces a feature map that's roughly 1/16th the spatial resolution of the input image (stride 16).
A 3×3 conv (512-d for VGG) followed by two sibling 1×1 convs: one outputting 2×9 = 18 objectness scores, one outputting 4×9 = 36 box deltas. Total RPN parameters: ~2.8×104 for the output layers alone.
After the RPN produces ~20,000 proposals, Non-Maximum Suppression (NMS) with IoU threshold 0.7 reduces them to ~2,000. Then the top 300 (by objectness score) are passed to the detector. At test time, only 300 proposals are needed — far fewer than selective search's 2,000.
Each proposal is projected onto the feature map, RoI pooled to a fixed size (7×7×512 for VGG), then fed through fully-connected layers for multi-class classification and per-class bounding box regression.
Full pipeline from input image to final detections. Follow the data flow from left to right.
Faster R-CNN delivered on both promises: better accuracy and dramatically faster speed.
Using VGG-16 with shared features, Faster R-CNN achieves 69.9% mAP on VOC 2007 — beating the selective search baseline (66.9% mAP). With additional training data (VOC 2007 + 2012), it reaches 73.2% mAP. With COCO pretraining: 78.8% mAP.
| System | Backbone | Proposal Time | Total Time | FPS |
|---|---|---|---|---|
| SS + Fast R-CNN | VGG-16 | 1,510 ms | 1,830 ms | 0.5 |
| RPN + Fast R-CNN | VGG-16 | 10 ms | 198 ms | 5 |
| RPN + Fast R-CNN | ZFNet | 3 ms | 59 ms | 17 |
The headline number: proposals drop from 1,510ms to 10ms — a 150× speedup. Total system throughput goes from 0.5 fps to 5 fps with VGG-16, or 17 fps with ZFNet. And it uses only 300 proposals instead of 2,000.
Time breakdown for each detection system (log scale). Note how proposals dominate the old pipeline.
The anchor mechanism is one of Faster R-CNN's most influential contributions. Let's examine the design choices and why they matter.
Because anchors are defined relative to each sliding window position, and the same convolutional weights are shared everywhere, the system is translation invariant: if you shift an object in the image, the same anchor at the new position will detect it. Contrast this with MultiBox, which uses 800 k-means-generated anchors at fixed image positions — shifting an object may not produce the same proposal.
Before Faster R-CNN, there were two approaches to multi-scale detection:
Anchors are a third approach: regression reference pyramids. A single-scale image produces a single feature map, and the anchors at each position cover multiple scales and ratios. The network learns scale-specific regressors — one per anchor type — so each handles its assigned scale range.
| Setting | Scales | Ratios | mAP |
|---|---|---|---|
| 1 scale, 1 ratio | 128² | 1:1 | 65.8% |
| 1 scale, 1 ratio | 256² | 1:1 | 66.7% |
| 1 scale, 3 ratios | 128² | {2:1, 1:1, 1:2} | 68.8% |
| 1 scale, 3 ratios | 256² | {2:1, 1:1, 1:2} | 67.9% |
| 3 scales, 1 ratio | {128², 256², 512²} | 1:1 | 69.8% |
| 3 scales, 3 ratios | {128², 256², 512²} | {2:1, 1:1, 1:2} | 69.9% |
Multiple scales matter more than multiple ratios (69.8% with 3 scales / 1 ratio vs. 68.8% with 1 scale / 3 ratios). But both together give the best result. Importantly, even a single anchor (65.8%) is only ~4% behind the full 9-anchor setup — the core mechanism is robust.
Compare image pyramids, filter pyramids, and anchor pyramids. Click each tab to see the approach.
Faster R-CNN sits at a pivotal point in the history of object detection. It introduced anchors and RPNs — ideas that shaped nearly every detector that followed.
| Method | Year | Key Change | Speed |
|---|---|---|---|
| R-CNN | 2013 | CNN features for detection | ~47s/image |
| Fast R-CNN | 2015 | Shared conv features + RoI pooling | ~2.3s (with SS) |
| Faster R-CNN | 2015 | Learned proposals (RPN) + shared backbone | ~0.2s (5 fps) |
| FPN | 2017 | Multi-scale feature pyramids for better small-object detection | ~0.2s |
| Mask R-CNN | 2017 | + instance segmentation branch | ~0.2s |
Feature Pyramid Networks (FPN) addressed Faster R-CNN's weakness on small objects by building a top-down feature pyramid with lateral connections. The RPN runs on every level of the pyramid, so small objects get proposals from high-resolution feature maps. FPN + Faster R-CNN became the standard two-stage detector.
Mask R-CNN added a pixel-level segmentation branch alongside the classification and box regression branches. Same RPN, same backbone sharing — just one more output head.
Anchor-free methods (FCOS, CenterNet) later challenged the anchor paradigm entirely, predicting objects as center points or per-pixel predictions. They showed that anchors aren't strictly necessary — but the anchor concept evolved into "prior boxes" that appear everywhere, including in YOLO v2+, SSD, and RetinaNet.
DETR (2020) replaced both the RPN and NMS with a Transformer that directly predicts a set of detections. No anchors, no NMS, no hand-designed components. It's a radical departure — but it took 5 years to match Faster R-CNN's efficiency.