Ren, He, Girshick, Sun — Microsoft Research, 2015

Faster R-CNN: Region Proposal
Networks

Replace the 2-second selective search bottleneck with a learned network that proposes regions in 10ms — sharing convolutional features with the detector for near real-time object detection.

Prerequisites: CNNs (VGG/ZFNet) + Fast R-CNN basics
10
Chapters
6+
Simulations

Chapter 0: The Problem

By 2015, object detection had a speed problem — but not where you'd expect. Fast R-CNN could classify proposed regions at near real-time speed. The actual bottleneck was finding those regions in the first place.

Selective Search, the dominant region proposal method, works like this: segment the image into superpixels, then greedily merge them based on hand-crafted features (color, texture, size). It produces ~2,000 candidate boxes per image. The problem? It takes about 2 seconds per image on CPU — orders of magnitude slower than the neural network that follows it.

Think about what that means. You've built this brilliant deep neural network that can classify regions in milliseconds. But before it ever runs, you're waiting 2 seconds for a hand-engineered algorithm from 2012 to finish grinding through superpixels. The CNN sits idle, twiddling its thumbs.

The bottleneck in numbers: In a Fast R-CNN + VGG-16 system, the CNN takes ~320ms total (conv features + region classification). Selective Search takes ~1,500ms. The "slow" part isn't the neural network — it's the classical algorithm that feeds it.

Even EdgeBoxes, a faster alternative, still takes ~200ms per image. And both methods share a deeper problem: they're completely separate from the detection network. The proposal algorithm doesn't benefit from the learned features, and the detector can't influence what gets proposed.

Detection Pipeline Timing

Compare where time is spent in the old pipeline vs. Faster R-CNN.

Why is selective search a bottleneck in Fast R-CNN, even though the detection network is the "deep" part?

Chapter 1: The Key Insight

Here's the observation that makes Faster R-CNN possible: the convolutional feature maps that Fast R-CNN already computes for detection contain all the information needed to propose regions too.

Think about it. A VGG-16 backbone produces a rich feature map from the input image — 512 channels encoding edges, textures, parts, and objects at every spatial location. If a human can look at these features and say "there's probably an object here," a small neural network can learn to do the same.

So instead of running a separate algorithm (selective search) on the raw pixels, why not add a small network on top of the same conv features the detector already uses? This is the Region Proposal Network (RPN) — a lightweight fully convolutional network that predicts object proposals directly from the shared feature map.

The key insight in one sentence: Don't compute proposals from raw pixels with a hand-crafted algorithm. Compute them from learned conv features with a tiny neural network that shares those features with the detector. Proposals become nearly free — 10ms instead of 2,000ms.

The paper describes the RPN as an "attention" mechanism: it tells the detector where to look. The RPN and detector share a backbone, forming a single unified network. The marginal cost of computing proposals? Just a few extra convolutional layers — about 10ms per image.

Old Pipeline
Image → Selective Search (2s, CPU) → 2000 proposals → CNN classifier
↓ replace with
Faster R-CNN
Image → Shared Conv → RPN (10ms) → 300 proposals → Fast R-CNN head
What makes RPN proposals "nearly free" compared to selective search?

Chapter 2: The RPN

The Region Proposal Network is beautifully simple. It slides a small 3×3 convolutional window over the shared feature map. At every spatial position, this window looks at a 3×3 patch of the feature map and asks two questions:

  1. Is there an object here? (objectness classification)
  2. Where exactly is it? (bounding box refinement)

Concretely, the 3×3 conv produces a 256-d (ZFNet) or 512-d (VGG-16) feature vector at each position. This vector feeds into two sibling 1×1 convolutional layers:

Because the 3×3 conv slides across the entire feature map, the same weights are shared at every position. This is just a fully convolutional network — no fully-connected layers, no position-specific parameters. The RPN can handle images of any size.

Why 3×3? The effective receptive field of a 3×3 window on the last VGG-16 conv layer is 228 pixels on the input image. That's enough to "see" the context around even large objects. And because anchors handle multi-scale detection (Chapter 3), the network doesn't need large filters.
RPN Sliding Window

The 3×3 window slides over the feature map. At each position it produces objectness scores and box deltas for k=9 anchors. Click to move the window.

Why is the RPN implemented as a fully convolutional network (3×3 conv + 1×1 conv) rather than fully-connected layers?

Chapter 3: Anchors

Here's the problem the RPN faces: objects come in wildly different sizes and shapes. A person is tall and narrow. A bus is wide and squat. A ball is small and square. How can a single 3×3 sliding window detect all of them?

The answer is anchors — a set of k reference boxes centered at each sliding window position. Each anchor has a predefined scale and aspect ratio. The RPN doesn't predict boxes from scratch. Instead, it predicts how to adjust each anchor to better fit the actual object.

The default configuration uses 3 scales (128², 256², 512² pixels) × 3 aspect ratios (1:1, 1:2, 2:1) = 9 anchors per position. On a typical feature map of ~40×60, that's roughly 20,000 anchors covering the entire image at multiple scales.

The anchor trick: Instead of predicting absolute box coordinates (hard), predict small offsets from well-chosen reference boxes (easy). The RPN outputs tx, ty, tw, th — deltas that shift and resize each anchor. This is bounding box regression, and it works because most anchors are already close to the right answer.

The box parameterization is:

tx = (x − xa) / wa,   ty = (y − ya) / ha
tw = log(w / wa),   th = log(h / ha)

Where (x, y, w, h) is the predicted box and (xa, ya, wa, ha) is the anchor. The log transform ensures width/height are always positive and scale-invariant.

Interactive Anchor Explorer

Click anywhere on the "image" to place an anchor center. Toggle scales and ratios to see different anchor configurations. Drag the offset sliders to see box regression.

tx offset 0.00
ty offset 0.00
tw scale 0.00
th scale 0.00
Why does the RPN predict offsets from anchor boxes rather than predicting absolute box coordinates?

Chapter 4: Multi-Task Loss

Faster R-CNN is trained with two losses — one for the RPN and one for the Fast R-CNN detection head. Each loss itself has two terms: classification and regression.

RPN Loss

For the RPN, each anchor gets a binary label: positive (object) or negative (background). An anchor is positive if it either (a) has the highest IoU with any ground-truth box, or (b) has IoU > 0.7 with any ground-truth box. Anchors with IoU < 0.3 for all ground-truth boxes are negative. Everything in between is ignored.

LRPN = (1/Ncls) ∑i Lcls(pi, pi*) + λ (1/Nreg) ∑i pi* Lreg(ti, ti*)

Where pi is the predicted objectness, pi* is the ground-truth label (1 for positive, 0 for negative), ti is the predicted box delta, and ti* is the target box delta. The term pi* in front of Lreg means we only regress boxes for positive anchors — no point refining an anchor that's labeled background.

Normalization: Ncls = 256 (mini-batch size), Nreg ≈ 2,400 (number of anchor locations), and λ = 10. This makes both terms roughly equally weighted. The paper shows results are insensitive to λ across two orders of magnitude (1 to 100).

Detection Head Loss

The Fast R-CNN detection head has its own multi-task loss: multi-class classification (not just object/background, but which class) plus bounding box regression for each class. This is identical to the original Fast R-CNN paper.

Training Sampling

Each mini-batch comes from a single image. The paper samples 256 anchors per image with a 1:1 ratio of positives to negatives. If there are fewer than 128 positive anchors (common), the rest are filled with negatives.

Why smooth L1 loss for regression? The regression loss uses smooth L1 (Huber loss) instead of L2. Smooth L1 is less sensitive to outliers — when the predicted box is very far from the target, L2 produces huge gradients that can destabilize training. Smooth L1 transitions from quadratic (near zero) to linear (far away), giving stable gradients everywhere.
Smooth L1 vs L2 Loss

Compare how L2 and smooth L1 behave as the prediction error grows. Notice how L2 explodes for large errors while smooth L1 grows linearly.

Why is the regression loss only computed for positive anchors (those matched to a ground-truth object)?

Chapter 5: Feature Sharing

The RPN and the Fast R-CNN detector both sit on top of a shared convolutional backbone. But if you train them independently, they'll each pull the backbone in different directions. How do you make them agree?

The paper proposes a 4-step alternating training procedure:

Step 1
Train RPN from ImageNet-pretrained backbone. Learn to propose regions.
Step 2
Train Fast R-CNN detector using Step 1 proposals. Separate backbone — no sharing yet.
Step 3
Re-initialize RPN with detector's backbone. Freeze shared conv layers. Fine-tune only RPN-specific layers.
Step 4
Freeze shared conv layers. Fine-tune only Fast R-CNN-specific layers. Now both share the same backbone.

After Step 2, the two networks have separate backbones. Steps 3 and 4 are the crucial ones: they freeze the shared convolutional layers and only fine-tune the task-specific layers on top. This ensures the backbone serves both tasks well.

Why not just train jointly? The paper explored "approximate joint training" (merging both networks and backpropagating both losses), which works and is 25-50% faster. But it ignores the gradient through the proposal box coordinates — technically incorrect. The 4-step procedure is clean. Later work showed joint training works fine in practice.

The result is a unified network where the backbone computes features once, the RPN reads those features to generate proposals, and the Fast R-CNN head reads the same features (pooled from the proposed regions) to classify and refine boxes. One forward pass through the backbone serves both purposes.

What happens in Steps 3 and 4 of the alternating training procedure?

Chapter 6: Architecture

Let's put all the pieces together and trace the full Faster R-CNN pipeline from image to detections.

Backbone

The paper experiments with two backbones: ZFNet (5 conv layers, fast) and VGG-16 (13 conv layers, accurate). Both are pretrained on ImageNet. The last conv layer produces a feature map that's roughly 1/16th the spatial resolution of the input image (stride 16).

RPN Head

A 3×3 conv (512-d for VGG) followed by two sibling 1×1 convs: one outputting 2×9 = 18 objectness scores, one outputting 4×9 = 36 box deltas. Total RPN parameters: ~2.8×104 for the output layers alone.

NMS + Top-N Selection

After the RPN produces ~20,000 proposals, Non-Maximum Suppression (NMS) with IoU threshold 0.7 reduces them to ~2,000. Then the top 300 (by objectness score) are passed to the detector. At test time, only 300 proposals are needed — far fewer than selective search's 2,000.

Fast R-CNN Head

Each proposal is projected onto the feature map, RoI pooled to a fixed size (7×7×512 for VGG), then fed through fully-connected layers for multi-class classification and per-class bounding box regression.

Faster R-CNN Architecture

Full pipeline from input image to final detections. Follow the data flow from left to right.

Parameter efficiency: The RPN output layer has only 2.8×104 parameters (512 × 6 × 9), compared to MultiBox's 6.1×106 parameters. This is two orders of magnitude smaller, because anchors are translation-invariant — the same small conv filters work at every position.
After the RPN generates ~20,000 raw proposals, how are they reduced to the ~300 used by the detector?

Chapter 7: Results

Faster R-CNN delivered on both promises: better accuracy and dramatically faster speed.

PASCAL VOC 2007

Using VGG-16 with shared features, Faster R-CNN achieves 69.9% mAP on VOC 2007 — beating the selective search baseline (66.9% mAP). With additional training data (VOC 2007 + 2012), it reaches 73.2% mAP. With COCO pretraining: 78.8% mAP.

Speed

SystemBackboneProposal TimeTotal TimeFPS
SS + Fast R-CNNVGG-161,510 ms1,830 ms0.5
RPN + Fast R-CNNVGG-1610 ms198 ms5
RPN + Fast R-CNNZFNet3 ms59 ms17

The headline number: proposals drop from 1,510ms to 10ms — a 150× speedup. Total system throughput goes from 0.5 fps to 5 fps with VGG-16, or 17 fps with ZFNet. And it uses only 300 proposals instead of 2,000.

Fewer proposals, better results: Faster R-CNN with just 300 proposals outperforms selective search with 2,000 proposals (69.9% vs 66.9%). The learned proposals are simply better targeted than hand-crafted ones.

Key Ablation Findings

Speed Comparison

Time breakdown for each detection system (log scale). Note how proposals dominate the old pipeline.

How does Faster R-CNN achieve higher accuracy (69.9% mAP) with fewer proposals (300) than selective search achieves with 2,000?

Chapter 8: Anchor Design

The anchor mechanism is one of Faster R-CNN's most influential contributions. Let's examine the design choices and why they matter.

Translation Invariance

Because anchors are defined relative to each sliding window position, and the same convolutional weights are shared everywhere, the system is translation invariant: if you shift an object in the image, the same anchor at the new position will detect it. Contrast this with MultiBox, which uses 800 k-means-generated anchors at fixed image positions — shifting an object may not produce the same proposal.

Multi-Scale via Anchors (not Image Pyramids)

Before Faster R-CNN, there were two approaches to multi-scale detection:

  1. Image pyramids: Resize the image to multiple scales, compute features at each scale. Accurate but slow (N× the computation).
  2. Filter pyramids: Use filters of multiple sizes on the feature map. Requires training separate models per scale.

Anchors are a third approach: regression reference pyramids. A single-scale image produces a single feature map, and the anchors at each position cover multiple scales and ratios. The network learns scale-specific regressors — one per anchor type — so each handles its assigned scale range.

Why this is elegant: Image pyramids multiply computation. Filter pyramids multiply parameters. Anchor pyramids multiply neither — they're just reference coordinates that the same small conv network regresses from. The computational cost is the same whether you have 1 anchor or 9 anchors per position; only the output dimension of the 1×1 conv changes.

Scale and Ratio Ablations

SettingScalesRatiosmAP
1 scale, 1 ratio128²1:165.8%
1 scale, 1 ratio256²1:166.7%
1 scale, 3 ratios128²{2:1, 1:1, 1:2}68.8%
1 scale, 3 ratios256²{2:1, 1:1, 1:2}67.9%
3 scales, 1 ratio{128², 256², 512²}1:169.8%
3 scales, 3 ratios{128², 256², 512²}{2:1, 1:1, 1:2}69.9%

Multiple scales matter more than multiple ratios (69.8% with 3 scales / 1 ratio vs. 68.8% with 1 scale / 3 ratios). But both together give the best result. Importantly, even a single anchor (65.8%) is only ~4% behind the full 9-anchor setup — the core mechanism is robust.

Three Approaches to Multi-Scale Detection

Compare image pyramids, filter pyramids, and anchor pyramids. Click each tab to see the approach.

What is the main advantage of "anchor pyramids" over image pyramids for multi-scale detection?

Chapter 9: Connections

Faster R-CNN sits at a pivotal point in the history of object detection. It introduced anchors and RPNs — ideas that shaped nearly every detector that followed.

The R-CNN Family Tree

MethodYearKey ChangeSpeed
R-CNN2013CNN features for detection~47s/image
Fast R-CNN2015Shared conv features + RoI pooling~2.3s (with SS)
Faster R-CNN2015Learned proposals (RPN) + shared backbone~0.2s (5 fps)
FPN2017Multi-scale feature pyramids for better small-object detection~0.2s
Mask R-CNN2017+ instance segmentation branch~0.2s

What Came After

Feature Pyramid Networks (FPN) addressed Faster R-CNN's weakness on small objects by building a top-down feature pyramid with lateral connections. The RPN runs on every level of the pyramid, so small objects get proposals from high-resolution feature maps. FPN + Faster R-CNN became the standard two-stage detector.

Mask R-CNN added a pixel-level segmentation branch alongside the classification and box regression branches. Same RPN, same backbone sharing — just one more output head.

Anchor-free methods (FCOS, CenterNet) later challenged the anchor paradigm entirely, predicting objects as center points or per-pixel predictions. They showed that anchors aren't strictly necessary — but the anchor concept evolved into "prior boxes" that appear everywhere, including in YOLO v2+, SSD, and RetinaNet.

DETR (2020) replaced both the RPN and NMS with a Transformer that directly predicts a set of detections. No anchors, no NMS, no hand-designed components. It's a radical departure — but it took 5 years to match Faster R-CNN's efficiency.

Faster R-CNN's lasting legacy: The anchor mechanism and the idea of sharing learned features between proposal and detection became the default recipe for object detection. Even "anchor-free" methods define themselves in opposition to this paper. It won 1st place in ILSVRC and COCO 2015 competitions across multiple tracks, and its codebase became the basis for virtually all two-stage detectors that followed.
What limitation of Faster R-CNN did Feature Pyramid Networks (FPN) address?