Faster R-CNN — Veanors

Chapter 0: The Problem

By 2015, object detection had a speed problem — but not where you'd expect. Fast R-CNN could classify proposed regions at near real-time speed. The actual bottleneck was finding those regions in the first place.

Selective Search, the dominant region proposal method, works like this: segment the image into superpixels, then greedily merge them based on hand-crafted features (color, texture, size). It produces ~2,000 candidate boxes per image. The problem? It takes about 2 seconds per image on CPU — orders of magnitude slower than the neural network that follows it.

Think about what that means. You've built this brilliant deep neural network that can classify regions in milliseconds. But before it ever runs, you're waiting 2 seconds for a hand-engineered algorithm from 2012 to finish grinding through superpixels. The CNN sits idle, twiddling its thumbs.

The bottleneck in numbers: In a Fast R-CNN + VGG-16 system, the CNN takes ~320ms total (conv features + region classification). Selective Search takes ~1,500ms. The "slow" part isn't the neural network — it's the classical algorithm that feeds it.

Even EdgeBoxes, a faster alternative, still takes ~200ms per image. And both methods share a deeper problem: they're completely separate from the detection network. The proposal algorithm doesn't benefit from the learned features, and the detector can't influence what gets proposed.

Detection Pipeline Timing

Compare where time is spent in the old pipeline vs. Faster R-CNN.

Why is selective search a bottleneck in Fast R-CNN, even though the detection network is the "deep" part?

Selective search takes ~2 seconds per image on CPU, while the detection CNN runs in ~320ms on GPU — the proposal step dominates total runtime The CNN is too shallow to detect objects Selective search produces too few proposals

Chapter 1: The Key Insight

Here's the observation that makes Faster R-CNN possible: the convolutional feature maps that Fast R-CNN already computes for detection contain all the information needed to propose regions too.

Think about it. A VGG-16 backbone produces a rich feature map from the input image — 512 channels encoding edges, textures, parts, and objects at every spatial location. If a human can look at these features and say "there's probably an object here," a small neural network can learn to do the same.

So instead of running a separate algorithm (selective search) on the raw pixels, why not add a small network on top of the same conv features the detector already uses? This is the Region Proposal Network (RPN) — a lightweight fully convolutional network that predicts object proposals directly from the shared feature map.

The key insight in one sentence: Don't compute proposals from raw pixels with a hand-crafted algorithm. Compute them from learned conv features with a tiny neural network that shares those features with the detector. Proposals become nearly free — 10ms instead of 2,000ms.

The paper describes the RPN as an "attention" mechanism: it tells the detector where to look. The RPN and detector share a backbone, forming a single unified network. The marginal cost of computing proposals? Just a few extra convolutional layers — about 10ms per image.

Old Pipeline

Image → Selective Search (2s, CPU) → 2000 proposals → CNN classifier

↓ replace with

Faster R-CNN

Image → Shared Conv → RPN (10ms) → 300 proposals → Fast R-CNN head

What makes RPN proposals "nearly free" compared to selective search?

The RPN reuses the convolutional feature maps already computed by the detection backbone, so it only needs a few extra lightweight layers (~10ms) instead of processing raw pixels from scratch (~2s) The RPN uses fewer proposals The RPN runs on a faster GPU

Chapter 2: The RPN

The Region Proposal Network is beautifully simple. It slides a small 3×3 convolutional window over the shared feature map. At every spatial position, this window looks at a 3×3 patch of the feature map and asks two questions:

Is there an object here? (objectness classification)
Where exactly is it? (bounding box refinement)

Concretely, the 3×3 conv produces a 256-d (ZFNet) or 512-d (VGG-16) feature vector at each position. This vector feeds into two sibling 1×1 convolutional layers:

A cls layer that outputs 2k scores (object vs. not-object for each of k anchors)
A reg layer that outputs 4k coordinates (bounding box deltas for each of k anchors)

Because the 3×3 conv slides across the entire feature map, the same weights are shared at every position. This is just a fully convolutional network — no fully-connected layers, no position-specific parameters. The RPN can handle images of any size.

Why 3×3? The effective receptive field of a 3×3 window on the last VGG-16 conv layer is 228 pixels on the input image. That's enough to "see" the context around even large objects. And because anchors handle multi-scale detection (Chapter 3), the network doesn't need large filters.

RPN Sliding Window

The 3×3 window slides over the feature map. At each position it produces objectness scores and box deltas for k=9 anchors. Click to move the window.

Why is the RPN implemented as a fully convolutional network (3×3 conv + 1×1 conv) rather than fully-connected layers?

Because convolutional layers share weights across all spatial positions, making the network translation-invariant and able to handle any input size Fully-connected layers are slower Convolutional layers use more parameters

Chapter 3: Anchors

Here's the problem the RPN faces: objects come in wildly different sizes and shapes. A person is tall and narrow. A bus is wide and squat. A ball is small and square. How can a single 3×3 sliding window detect all of them?

The answer is anchors — a set of k reference boxes centered at each sliding window position. Each anchor has a predefined scale and aspect ratio. The RPN doesn't predict boxes from scratch. Instead, it predicts how to adjust each anchor to better fit the actual object.

The default configuration uses 3 scales (128², 256², 512² pixels) × 3 aspect ratios (1:1, 1:2, 2:1) = 9 anchors per position. On a typical feature map of ~40×60, that's roughly 20,000 anchors covering the entire image at multiple scales.

The anchor trick: Instead of predicting absolute box coordinates (hard), predict small offsets from well-chosen reference boxes (easy). The RPN outputs t_x, t_y, t_w, t_h — deltas that shift and resize each anchor. This is bounding box regression, and it works because most anchors are already close to the right answer.

The box parameterization is:

t_x = (x − x_a) / w_a, t_y = (y − y_a) / h_a

t_w = log(w / w_a), t_h = log(h / h_a)

Where (x, y, w, h) is the predicted box and (x_a, y_a, w_a, h_a) is the anchor. The log transform ensures width/height are always positive and scale-invariant.

Interactive Anchor Explorer

Click anywhere on the "image" to place an anchor center. Toggle scales and ratios to see different anchor configurations. Drag the offset sliders to see box regression.

t_x offset 0.00

t_y offset 0.00

t_w scale 0.00

t_h scale 0.00

Why does the RPN predict offsets from anchor boxes rather than predicting absolute box coordinates?

Predicting small offsets from well-placed references is much easier to learn than predicting absolute coordinates, and the anchors already cover multiple scales and aspect ratios Absolute coordinates require more parameters Anchor boxes are faster to compute

Chapter 4: Multi-Task Loss

Faster R-CNN is trained with two losses — one for the RPN and one for the Fast R-CNN detection head. Each loss itself has two terms: classification and regression.

RPN Loss

For the RPN, each anchor gets a binary label: positive (object) or negative (background). An anchor is positive if it either (a) has the highest IoU with any ground-truth box, or (b) has IoU > 0.7 with any ground-truth box. Anchors with IoU < 0.3 for all ground-truth boxes are negative. Everything in between is ignored.

L_RPN = (1/N_cls) ∑_i L_cls(p_i, p_i^*) + λ (1/N_reg) ∑_i p_i^* L_reg(t_i, t_i^*)

Where p_i is the predicted objectness, p_i^* is the ground-truth label (1 for positive, 0 for negative), t_i is the predicted box delta, and t_i^* is the target box delta. The term p_i^* in front of L_reg means we only regress boxes for positive anchors — no point refining an anchor that's labeled background.

Normalization: N_cls = 256 (mini-batch size), N_reg ≈ 2,400 (number of anchor locations), and λ = 10. This makes both terms roughly equally weighted. The paper shows results are insensitive to λ across two orders of magnitude (1 to 100).

Detection Head Loss

The Fast R-CNN detection head has its own multi-task loss: multi-class classification (not just object/background, but which class) plus bounding box regression for each class. This is identical to the original Fast R-CNN paper.

Training Sampling

Each mini-batch comes from a single image. The paper samples 256 anchors per image with a 1:1 ratio of positives to negatives. If there are fewer than 128 positive anchors (common), the rest are filled with negatives.

Why smooth L1 loss for regression? The regression loss uses smooth L1 (Huber loss) instead of L2. Smooth L1 is less sensitive to outliers — when the predicted box is very far from the target, L2 produces huge gradients that can destabilize training. Smooth L1 transitions from quadratic (near zero) to linear (far away), giving stable gradients everywhere.

Smooth L1 vs L2 Loss

Compare how L2 and smooth L1 behave as the prediction error grows. Notice how L2 explodes for large errors while smooth L1 grows linearly.

Why is the regression loss only computed for positive anchors (those matched to a ground-truth object)?

There is no meaningful target box for negative anchors — you can't regress toward an object that doesn't exist, so the regression loss would be undefined or harmful Negative anchors are too numerous The regression layer doesn't process negative anchors

Chapter 5: Feature Sharing

The RPN and the Fast R-CNN detector both sit on top of a shared convolutional backbone. But if you train them independently, they'll each pull the backbone in different directions. How do you make them agree?

The paper proposes a 4-step alternating training procedure:

Step 1

Train RPN from ImageNet-pretrained backbone. Learn to propose regions.

↓

Step 2

Train Fast R-CNN detector using Step 1 proposals. Separate backbone — no sharing yet.

↓

Step 3

Re-initialize RPN with detector's backbone. Freeze shared conv layers. Fine-tune only RPN-specific layers.

↓

Step 4

Freeze shared conv layers. Fine-tune only Fast R-CNN-specific layers. Now both share the same backbone.

After Step 2, the two networks have separate backbones. Steps 3 and 4 are the crucial ones: they freeze the shared convolutional layers and only fine-tune the task-specific layers on top. This ensures the backbone serves both tasks well.

Why not just train jointly? The paper explored "approximate joint training" (merging both networks and backpropagating both losses), which works and is 25-50% faster. But it ignores the gradient through the proposal box coordinates — technically incorrect. The 4-step procedure is clean. Later work showed joint training works fine in practice.

The result is a unified network where the backbone computes features once, the RPN reads those features to generate proposals, and the Fast R-CNN head reads the same features (pooled from the proposed regions) to classify and refine boxes. One forward pass through the backbone serves both purposes.

What happens in Steps 3 and 4 of the alternating training procedure?

The shared convolutional layers are frozen, and only the task-specific layers (RPN head in Step 3, detector head in Step 4) are fine-tuned — ensuring both tasks share a common backbone Both networks are trained from scratch The RPN is discarded and only the detector is kept

Chapter 6: Architecture

Let's put all the pieces together and trace the full Faster R-CNN pipeline from image to detections.

Backbone

The paper experiments with two backbones: ZFNet (5 conv layers, fast) and VGG-16 (13 conv layers, accurate). Both are pretrained on ImageNet. The last conv layer produces a feature map that's roughly 1/16th the spatial resolution of the input image (stride 16).

RPN Head

A 3×3 conv (512-d for VGG) followed by two sibling 1×1 convs: one outputting 2×9 = 18 objectness scores, one outputting 4×9 = 36 box deltas. Total RPN parameters: ~2.8×10⁴ for the output layers alone.

NMS + Top-N Selection

After the RPN produces ~20,000 proposals, Non-Maximum Suppression (NMS) with IoU threshold 0.7 reduces them to ~2,000. Then the top 300 (by objectness score) are passed to the detector. At test time, only 300 proposals are needed — far fewer than selective search's 2,000.

Fast R-CNN Head

Each proposal is projected onto the feature map, RoI pooled to a fixed size (7×7×512 for VGG), then fed through fully-connected layers for multi-class classification and per-class bounding box regression.

Faster R-CNN Architecture

Full pipeline from input image to final detections. Follow the data flow from left to right.

Parameter efficiency: The RPN output layer has only 2.8×10⁴ parameters (512 × 6 × 9), compared to MultiBox's 6.1×10⁶ parameters. This is two orders of magnitude smaller, because anchors are translation-invariant — the same small conv filters work at every position.

After the RPN generates ~20,000 raw proposals, how are they reduced to the ~300 used by the detector?

Non-Maximum Suppression (NMS) with IoU threshold 0.7 removes overlapping boxes, then the top 300 by objectness score are selected Random sampling of 300 proposals Clustering proposals by position

Chapter 7: Results

Faster R-CNN delivered on both promises: better accuracy and dramatically faster speed.

PASCAL VOC 2007

Using VGG-16 with shared features, Faster R-CNN achieves 69.9% mAP on VOC 2007 — beating the selective search baseline (66.9% mAP). With additional training data (VOC 2007 + 2012), it reaches 73.2% mAP. With COCO pretraining: 78.8% mAP.

Speed

System	Backbone	Proposal Time	Total Time	FPS
SS + Fast R-CNN	VGG-16	1,510 ms	1,830 ms	0.5
RPN + Fast R-CNN	VGG-16	10 ms	198 ms	5
RPN + Fast R-CNN	ZFNet	3 ms	59 ms	17

The headline number: proposals drop from 1,510ms to 10ms — a 150× speedup. Total system throughput goes from 0.5 fps to 5 fps with VGG-16, or 17 fps with ZFNet. And it uses only 300 proposals instead of 2,000.

Fewer proposals, better results: Faster R-CNN with just 300 proposals outperforms selective search with 2,000 proposals (69.9% vs 66.9%). The learned proposals are simply better targeted than hand-crafted ones.

Key Ablation Findings

Sharing matters: Shared features (69.9%) beat unshared (68.5%) because the detector-tuned features help the RPN propose better regions.
Both cls and reg matter: Removing the cls head (no ranking/NMS) drops mAP to 44.6-55.8%. Removing the reg head drops it to 51.3-52.1%. The cls head is especially important for ranking the top proposals.
Anchor-free approaches are worse: A one-stage OverFeat-style system (dense sliding windows as "proposals") gets only 53.9% mAP vs. the two-stage 58.7% — the proposal + detection cascade is worth ~5% mAP.

Speed Comparison

Time breakdown for each detection system (log scale). Note how proposals dominate the old pipeline.

How does Faster R-CNN achieve higher accuracy (69.9% mAP) with fewer proposals (300) than selective search achieves with 2,000?

The RPN learns to propose high-quality regions using deep features shared with the detector, so 300 learned proposals are more targeted than 2,000 hand-crafted ones The VGG-16 backbone is simply better at classification The RPN uses a larger input image resolution

Chapter 8: Anchor Design

The anchor mechanism is one of Faster R-CNN's most influential contributions. Let's examine the design choices and why they matter.

Translation Invariance

Because anchors are defined relative to each sliding window position, and the same convolutional weights are shared everywhere, the system is translation invariant: if you shift an object in the image, the same anchor at the new position will detect it. Contrast this with MultiBox, which uses 800 k-means-generated anchors at fixed image positions — shifting an object may not produce the same proposal.

Multi-Scale via Anchors (not Image Pyramids)

Before Faster R-CNN, there were two approaches to multi-scale detection:

Image pyramids: Resize the image to multiple scales, compute features at each scale. Accurate but slow (N× the computation).
Filter pyramids: Use filters of multiple sizes on the feature map. Requires training separate models per scale.

Anchors are a third approach: regression reference pyramids. A single-scale image produces a single feature map, and the anchors at each position cover multiple scales and ratios. The network learns scale-specific regressors — one per anchor type — so each handles its assigned scale range.

Why this is elegant: Image pyramids multiply computation. Filter pyramids multiply parameters. Anchor pyramids multiply neither — they're just reference coordinates that the same small conv network regresses from. The computational cost is the same whether you have 1 anchor or 9 anchors per position; only the output dimension of the 1×1 conv changes.

Scale and Ratio Ablations

Setting	Scales	Ratios	mAP
1 scale, 1 ratio	128²	1:1	65.8%
1 scale, 1 ratio	256²	1:1	66.7%
1 scale, 3 ratios	128²	{2:1, 1:1, 1:2}	68.8%
1 scale, 3 ratios	256²	{2:1, 1:1, 1:2}	67.9%
3 scales, 1 ratio	{128², 256², 512²}	1:1	69.8%
3 scales, 3 ratios	{128², 256², 512²}	{2:1, 1:1, 1:2}	69.9%

Multiple scales matter more than multiple ratios (69.8% with 3 scales / 1 ratio vs. 68.8% with 1 scale / 3 ratios). But both together give the best result. Importantly, even a single anchor (65.8%) is only ~4% behind the full 9-anchor setup — the core mechanism is robust.

Three Approaches to Multi-Scale Detection

Compare image pyramids, filter pyramids, and anchor pyramids. Click each tab to see the approach.

What is the main advantage of "anchor pyramids" over image pyramids for multi-scale detection?

Anchor pyramids handle multiple scales using a single-scale feature map with multiple reference boxes, avoiding the N-times computational cost of processing multiple image resolutions Anchor pyramids produce more proposals Anchor pyramids require deeper networks

Chapter 9: Connections

Faster R-CNN sits at a pivotal point in the history of object detection. It introduced anchors and RPNs — ideas that shaped nearly every detector that followed.

The R-CNN Family Tree

Method	Year	Key Change	Speed
R-CNN	2013	CNN features for detection	~47s/image
Fast R-CNN	2015	Shared conv features + RoI pooling	~2.3s (with SS)
Faster R-CNN	2015	Learned proposals (RPN) + shared backbone	~0.2s (5 fps)
FPN	2017	Multi-scale feature pyramids for better small-object detection	~0.2s
Mask R-CNN	2017	+ instance segmentation branch	~0.2s

What Came After

Feature Pyramid Networks (FPN) addressed Faster R-CNN's weakness on small objects by building a top-down feature pyramid with lateral connections. The RPN runs on every level of the pyramid, so small objects get proposals from high-resolution feature maps. FPN + Faster R-CNN became the standard two-stage detector.

Mask R-CNN added a pixel-level segmentation branch alongside the classification and box regression branches. Same RPN, same backbone sharing — just one more output head.

Anchor-free methods (FCOS, CenterNet) later challenged the anchor paradigm entirely, predicting objects as center points or per-pixel predictions. They showed that anchors aren't strictly necessary — but the anchor concept evolved into "prior boxes" that appear everywhere, including in YOLO v2+, SSD, and RetinaNet.

DETR (2020) replaced both the RPN and NMS with a Transformer that directly predicts a set of detections. No anchors, no NMS, no hand-designed components. It's a radical departure — but it took 5 years to match Faster R-CNN's efficiency.

Faster R-CNN's lasting legacy: The anchor mechanism and the idea of sharing learned features between proposal and detection became the default recipe for object detection. Even "anchor-free" methods define themselves in opposition to this paper. It won 1st place in ILSVRC and COCO 2015 competitions across multiple tracks, and its codebase became the basis for virtually all two-stage detectors that followed.

What limitation of Faster R-CNN did Feature Pyramid Networks (FPN) address?

Faster R-CNN uses only the last conv layer's feature map (low resolution), making it weak on small objects — FPN builds a multi-scale feature pyramid so the RPN can propose small objects from high-resolution features Faster R-CNN is too slow Faster R-CNN can't do instance segmentation

Faster R-CNN: Region ProposalNetworks