Girshick, Donahue, Darrell, Malik — UC Berkeley, 2013

R-CNN: Regions with CNN Features

The paper that brought deep learning to object detection — combining selective search region proposals with CNN feature extraction to shatter the HOG/DPM performance ceiling by over 30% relative improvement.

Prerequisites: CNNs (AlexNet) + Image classification basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

By 2012, object detection had hit a wall. The best systems on the PASCAL VOC benchmark were getting maybe 1-2% better per year, assembling increasingly baroque ensembles of hand-crafted features — HOG descriptors, SIFT keypoints, deformable part models (DPMs), spatial pyramids. The state of the art on VOC 2012 was around 33-35% mAP.

Meanwhile, something dramatic had happened in image classification. At the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in September 2012, Alex Krizhevsky's CNN — AlexNet — smashed the competition, cutting the top-5 error rate from 26% to 15%. A deep neural network had learned features that crushed hand-crafted ones on the classification task.

But classification and detection are fundamentally different problems. Classification answers "what is in this image?" Detection answers "what objects are in this image, and where are they?" You need to both recognize AND localize, potentially many objects per image.

The central question of this paper: Can the CNN features that revolutionized image classification also work for object detection? And if so, how do you bridge the gap between a classifier (one label per image) and a detector (many boxes per image, each with a label and location)?

The leading detection approach at the time — the Deformable Part Model (DPM) — used multi-scale HOG features fed into an SVM with sliding windows and part-based reasoning. It was elegant but fundamentally limited by the expressiveness of HOG. HOG computes local orientation histograms — essentially edge statistics — and that's all it can represent. No amount of clever engineering on top of HOG could learn the rich, hierarchical features that a deep CNN learns automatically.

The question wasn't academic. Object detection is the gateway to visual understanding: autonomous driving, robotics, medical imaging, surveillance — all depend on knowing what is where. Breaking the HOG ceiling would change the field.

Why couldn't the AlexNet revolution in classification be directly applied to object detection?

Chapter 1: The Key Insight

R-CNN's answer is deceptively simple: what if you just propose a bunch of candidate regions in the image, crop each one out, resize it, and run a CNN classifier on each crop independently?

That's it. That's the whole idea. Three concepts fused together:

  1. Region proposals — use an existing algorithm (selective search) to generate ~2000 candidate bounding boxes that might contain objects
  2. CNN feature extraction — warp each proposed region to 227×227 pixels and run it through AlexNet to get a 4096-dimensional feature vector
  3. Linear classification — train one SVM per object class on the CNN features
Why this is brilliant: Instead of trying to redesign the CNN architecture for detection (which nobody knew how to do in 2013), R-CNN reduces detection to a series of classification problems. Each region proposal is treated as a mini classification task: "Is there a dog in this crop? A car? A person?" The CNN doesn't need to know it's doing detection — it just classifies crops.

The second key contribution is transfer learning. In 2013, training a deep CNN from scratch on a small detection dataset like PASCAL VOC (a few thousand images) was hopeless — the network would massively overfit. R-CNN showed that you could:

  1. Pre-train the CNN on ImageNet (1.2 million images, 1000 classes)
  2. Fine-tune the last layers on your detection dataset (replacing the 1000-way classifier with an (N+1)-way classifier for N object classes + background)

This "supervised pre-training + domain-specific fine-tuning" paradigm boosted mAP by 8 percentage points. Today we call this transfer learning and it's standard practice — but in 2013 it was a novel contribution.

What are the two key insights that make R-CNN work?

Chapter 2: The R-CNN Pipeline

R-CNN processes an image in four stages. Each stage is a separate module — this is a pipeline, not an end-to-end system.

Stage 1: Input
Take the input image (any size)
Stage 2: Propose
Run selective search → ~2000 candidate bounding boxes
Stage 3: Extract
Warp each region to 227×227, run through CNN → 4096-dim feature vector per region
Stage 4: Classify
Score each feature vector with per-class SVMs + bounding box regression → final detections

At test time, the main computational bottleneck is Stage 3: running the CNN ~2000 times per image. On a GPU this takes about 13 seconds per image; on a CPU, 53 seconds. The SVM classification in Stage 4 is nearly instant — just a matrix multiply of the 2000×4096 feature matrix with the 4096×N weight matrix.

The R-CNN Pipeline

The four-stage detection pipeline. Each region proposal is independently cropped, warped, and classified.

Non-maximum suppression (NMS)

After scoring all ~2000 regions, many overlapping boxes will fire on the same object. R-CNN applies greedy non-maximum suppression per class: sort detections by score, accept the top one, then reject any remaining detection whose IoU (intersection over union) with an accepted detection exceeds a threshold. This collapses the cloud of overlapping boxes into a single tight box per object.

The pipeline nature is key: Each module is trained separately — the region proposer, the CNN feature extractor, the SVMs, and the bounding box regressor are all optimized independently. This makes the system modular but also means errors can't propagate backward. A later paper (Fast R-CNN) will unify Stages 3 and 4 into a single trainable network.
Why does R-CNN need non-maximum suppression (NMS)?

Chapter 3: Region Proposals

The first module in R-CNN generates category-independent region proposals — bounding boxes that might contain any object. R-CNN uses selective search, an algorithm from Uijlings et al. (2013).

How selective search works

Selective search is a bottom-up grouping algorithm. It starts with a fine-grained oversegmentation of the image (using Felzenszwalb's algorithm), then iteratively merges similar neighboring regions:

  1. Oversegment — break the image into thousands of tiny regions based on pixel similarity
  2. Compute similarity — for each pair of neighboring regions, compute similarity based on color, texture, size, and fill (how well regions fit together)
  3. Merge greedily — merge the most similar pair, recompute similarities for the new merged region
  4. Collect boxes — at every level of the merging hierarchy, record the bounding box of each merged region
  5. Output ~2000 proposals — the union of bounding boxes at all scales
Selective Search: Hierarchical Grouping

Watch regions merge from fine to coarse. Each merge level produces candidate bounding boxes. Click Step to advance the merging, or Auto to animate.

Level 0 / 6 — 36 regions
Why selective search and not sliding windows? A sliding window at all positions, scales, and aspect ratios produces millions of candidates — far too many for expensive CNN processing. Selective search is smarter: it uses image structure (edges, color, texture) to propose only ~2000 plausible regions. This is 1000x fewer candidates while capturing 95%+ of actual objects. The "proposal" strategy trades exhaustive coverage for computational tractability.

Why ~2000 proposals?

The paper found that 2000 proposals achieve very high recall — the fraction of ground-truth objects that overlap with at least one proposal at IoU > 0.5. Going beyond 2000 gives diminishing returns while linearly increasing CNN computation. Going below 1000 risks missing objects.

Why does R-CNN use selective search instead of sliding windows for region proposals?

Chapter 4: CNN Feature Extraction

For each of the ~2000 region proposals, R-CNN needs to extract a feature vector. Here's the process:

Step 1: Warp to fixed size

The CNN (AlexNet) requires a fixed 227×227 pixel input. Region proposals come in all shapes and sizes — a tall, narrow person; a wide, flat bus; a tiny bottle. R-CNN handles this with the simplest possible approach: anisotropic warping. It takes the bounding box, adds 16 pixels of context padding, and stretches (or squishes) the result to exactly 227×227. This distorts the aspect ratio, but the CNN learns to handle it.

Step 2: Forward pass through AlexNet

The warped region is mean-subtracted and passed through AlexNet's five convolutional layers and two fully connected layers. The output of the second fully connected layer (fc7) is a 4096-dimensional feature vector. This is the representation R-CNN uses for classification.

Warp & Extract: Region to Feature Vector

Each region proposal (any shape) is warped to 227×227 and passed through AlexNet. The fc7 layer outputs a 4096-dim feature vector.

What makes CNN features so much better than HOG?

HOG captures oriented edge histograms — essentially first-order gradient statistics in local patches. That's all it can represent. A CNN learns hierarchical features:

The paper's ablation study showed that most of the detection improvement comes from layers fc6 and fc7 — the fully connected layers that encode semantic meaning far beyond anything HOG can capture.

The 4096-dim feature vector: This is dramatically more compact than the features used by the previous best system (UVA). UVA used 360,000-dimensional features based on SIFT bag-of-words. R-CNN's features are 100x smaller and far more discriminative. Smaller features also mean the SVM classification step is nearly instantaneous.
Why does R-CNN use fc7 features instead of HOG features?

Chapter 5: Classification & Regression

Given a 4096-dimensional feature vector for each region proposal, R-CNN needs to (a) decide what object class it is (if any), and (b) refine the bounding box location.

Per-class SVMs

R-CNN trains one linear SVM per object class. For PASCAL VOC with 20 classes, that's 20 binary SVMs. Each SVM answers: "Does this region contain a [car / dog / person / ...]?"

At test time, all SVMs score all regions simultaneously via a single matrix multiplication: scores = features × W, where features is 2000×4096 and W is 4096×20. This produces a 2000×20 score matrix — every region scored against every class — in milliseconds.

Why SVMs instead of softmax? The paper tried using the fine-tuned CNN's softmax outputs directly but got worse results (50.9% vs 54.2% mAP). The key reason: the fine-tuning used a "loose" IoU threshold of 0.5 for positive examples (to avoid overfitting with limited data), but for SVM training they used a stricter 0.3 threshold with hard negative mining. The SVM training procedure was better calibrated for the detection task. This awkward split would later be eliminated by Fast R-CNN.

Bounding box regression

Even good region proposals are rarely perfectly aligned with objects. R-CNN learns a linear regressor that adjusts the proposal box to better fit the object. For each class, four regressors predict corrections to the box coordinates:

tx = (Gx − Px) / Pw    ty = (Gy − Py) / Ph
tw = ln(Gw / Pw)    th = ln(Gh / Ph)

Where P is the proposal box, G is the ground-truth box, and the t values are the regression targets. The regressor learns to predict these t values from the pool5 features (not fc7 — the paper found pool5 worked better for localization since it retains more spatial information).

The log-scale for width and height means the regressor predicts relative changes — a prediction of tw=0 means "don't change the width," regardless of the absolute box size. This makes the regression scale-invariant.

Bounding Box Regression

The blue box is the proposal, the green box is ground truth. The regressor learns to predict corrections (tx, ty, tw, th) that shift and resize the proposal to match. Drag the IoU slider to see how proposal quality affects correction magnitude.

Proposal overlap0.50
Why does R-CNN use log-scale targets for width and height in bounding box regression?

Chapter 6: Training

R-CNN's training is a four-stage process. Each stage is trained independently, which is one of the paper's acknowledged limitations.

Stage 1: ImageNet Pre-training
Train AlexNet on ILSVRC 2012 (1.2M images, 1000 classes). Standard classification training with softmax cross-entropy loss.
Stage 2: Domain-Specific Fine-Tuning
Replace the 1000-way fc8 with (N+1)-way fc8 (N classes + background). Fine-tune on warped region proposals from detection dataset. LR = 0.001 (1/10 of pre-training). Positive: IoU ≥ 0.5. Mini-batch: 32 positives + 96 negatives = 128.
Stage 3: SVM Training
Extract fc7 features for all regions. Train one linear SVM per class using hard negative mining. Positive: ground-truth boxes only. Negative: IoU < 0.3. (Note: different threshold than fine-tuning!)
Stage 4: Bbox Regressor Training
Train class-specific linear regressors on pool5 features. Only train on proposals with IoU ≥ 0.6 (close proposals only — the regressor is meant for refinement, not large corrections).

Why the inconsistent IoU thresholds?

This is one of the messiest parts of R-CNN, and the paper is transparent about it. Fine-tuning uses IoU ≥ 0.5 for positives because with limited data, using only ground-truth boxes (IoU = 1.0) would give too few positive examples and the network would overfit. But for SVM training, they found that using the CNN's own softmax outputs (trained with the 0.5 threshold) gave inferior results to training fresh SVMs with a stricter 0.3 threshold and hard negative mining.

The four-stage elephant in the room: Each stage has its own loss function, its own definition of positive/negative, and its own hyperparameters. This makes R-CNN complex and slow to train. The features from Stage 2 can't be updated based on errors in Stage 3 or 4 — there's no gradient flowing backward through the whole system. Fast R-CNN (Girshick, 2015) will solve this by replacing SVMs with a softmax layer and training the CNN, classifier, and regressor jointly in a single stage.

Hard negative mining

For SVM training, the negative examples (background regions) vastly outnumber the positives. R-CNN uses hard negative mining: train the SVM, then re-score all negatives. The ones the SVM gets wrong (false positives) are the "hard negatives" — the most confusing background regions. Add these to the training set, retrain. This focuses the SVM on the decision boundary where it matters most. In practice, one pass through the data is sufficient.

Why does R-CNN train SVMs separately instead of using the fine-tuned CNN's softmax output directly?

Chapter 7: Results

R-CNN's results were a paradigm shift. On PASCAL VOC 2012, R-CNN achieved 53.3% mAP — compared to 35.1% for the previous best system using the same region proposals but with hand-crafted features. That's a relative improvement of over 50%.

R-CNN vs Prior Art on PASCAL VOC 2010

Mean Average Precision (mAP) comparison. R-CNN with bounding box regression (BB) set a new state of the art.

Key results

Ablation: What matters most?

The paper ran careful ablation experiments on VOC 2007:

The pool5 finding is stunning: With just 5 convolutional layers and no fully connected layers — only 6% of the CNN parameters — the features already beat the best hand-crafted features. This proves that CNN features aren't better because of more parameters, but because the hierarchical learned representation captures visual structure that HOG fundamentally cannot.

Per-class results

R-CNN improved on nearly every class, but the gains were especially dramatic for classes with high visual complexity — animals (dog: 17.8 → 70.0), vehicles (car: 49.7 → 60.0), and articulated objects (person: 47.7 → 58.1). Classes that are already distinctive shapes (bottle, chair) saw smaller improvements.

What did the ablation study reveal about where R-CNN's improvement comes from?

Chapter 8: Limitations

R-CNN was a breakthrough, but it has serious practical limitations that subsequent papers would address.

1. Painfully slow inference

Running the CNN ~2000 times per image takes 13-53 seconds depending on hardware. Each region is processed independently, even though neighboring proposals share most of their pixels. This massive redundancy is the core bottleneck.

2. Multi-stage training

Four separate training stages (pre-training, fine-tuning, SVM training, bbox regressor) with inconsistent positive/negative definitions. Training is complex, slow, and inelegant. Errors in later stages can't improve earlier ones.

3. Feature storage

To train the SVMs via hard negative mining, you need to extract and store features for every region proposal in every training image. For VOC 2007 (5k images × 2000 regions × 4096 floats), that's ~150 GB of feature data on disk.

4. No end-to-end training

The selective search module is fixed — it can't be improved by the CNN's feedback. The CNN features can't be updated based on classification errors. The system is a pipeline of independently optimized components, not a jointly optimized whole.

5. Warping distortion

Anisotropic warping distorts aspect ratios. A tall, thin person gets squished. A wide car gets stretched. The CNN sees distorted inputs and must learn to be invariant to this distortion — a wasted capacity that could be used for more useful invariances.

R-CNN Computational Bottleneck

Each of ~2000 regions runs through the full CNN independently — massive redundant computation on overlapping pixels. This is what Fast R-CNN and Faster R-CNN will fix.

Click to see redundant computation
The key insight for the next generation: Most of the CNN computation is shared across regions. If you run the CNN once on the entire image to get a feature map, then crop features from that shared map for each region, you avoid all the redundancy. That's exactly what SPP-net and Fast R-CNN will do — reducing inference from 47 seconds to 0.3 seconds per image.
What is the single biggest computational waste in R-CNN's design?

Chapter 9: Connections

What R-CNN built on

HOG / SIFT / DPM (Dalal & Triggs 2005, Lowe 2004, Felzenszwalb et al. 2010): The hand-crafted feature paradigm that R-CNN replaced. HOG captures oriented edge histograms; DPM extends this with deformable parts. These were the best features for a decade, but their ceiling was about 33% mAP on VOC.

AlexNet (Krizhevsky et al., 2012): The CNN that won ILSVRC 2012 and proved deep learned features can dramatically outperform hand-crafted ones — but only for classification. R-CNN showed the features transfer to detection.

Selective Search (Uijlings et al., 2013): The region proposal algorithm R-CNN uses. It generates category-independent proposals via hierarchical grouping, enabling the "recognition using regions" paradigm.

What R-CNN directly enabled

SPP-net (He et al., 2014): Introduced spatial pyramid pooling to extract features from arbitrary-sized regions on a shared CNN feature map — eliminating the need to run the CNN per-region. 20x faster than R-CNN.

Fast R-CNN (Girshick, 2015): Unified the CNN, classifier, and bbox regressor into a single jointly-trained network. RoI pooling extracts features from the shared feature map. Replaced SVMs with softmax. 200x faster than R-CNN, single-stage training.

Faster R-CNN (Ren et al., 2015): Replaced selective search with a Region Proposal Network (RPN) — a small CNN that shares features with the detector and generates proposals in ~10ms. The full pipeline runs at 5 fps, making R-CNN's approach real-time.

The broader lineage

YOLO (Redmon et al., 2016): Abandoned the proposal-then-classify paradigm entirely. A single CNN predicts boxes and classes in one forward pass, enabling 45 fps real-time detection. Sacrificed some accuracy for dramatic speed gains.

Feature Pyramid Networks / FPN (Lin et al., 2017): Multi-scale feature extraction that handles objects at different sizes. Built into all modern detectors.

DETR (Carion et al., 2020): End-to-end detection with Transformers — no proposals, no NMS, no anchors. The logical endpoint of the "simplify the pipeline" trajectory that R-CNN started.

R-CNN's legacy: R-CNN didn't just beat HOG — it ended the hand-crafted feature era in object detection. Every detection system since 2014 uses CNN features. The paper also established transfer learning as standard practice: pre-train on ImageNet, fine-tune on your task. This paradigm has been inherited by modern Vision Transformers (ViTs) and even large language models. R-CNN was the ImageNet moment for object detection.

Cheat sheet

Core idea
Selective search proposals → warp to 227×227 → CNN features (fc7, 4096-dim) → per-class SVMs + bbox regression
Key numbers
~2000 proposals, 4096-dim features, 53.3% mAP on VOC 2012 (vs 35.1% prior), 13-53s per image
Training
4 stages: ImageNet pre-train → detection fine-tune → SVM training → bbox regressor
Contribution
Proved CNN features crush hand-crafted features for detection; established transfer learning for vision tasks
Lineage
HOG/DPM → R-CNN → SPP-net → Fast R-CNN → Faster R-CNN → FPN → DETR
How did Fast R-CNN solve R-CNN's main computational bottleneck?