R-CNN — Veanors

Chapter 0: The Problem

By 2012, object detection had hit a wall. The best systems on the PASCAL VOC benchmark were getting maybe 1-2% better per year, assembling increasingly baroque ensembles of hand-crafted features — HOG descriptors, SIFT keypoints, deformable part models (DPMs), spatial pyramids. The state of the art on VOC 2012 was around 33-35% mAP.

Meanwhile, something dramatic had happened in image classification. At the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in September 2012, Alex Krizhevsky's CNN — AlexNet — smashed the competition, cutting the top-5 error rate from 26% to 15%. A deep neural network had learned features that crushed hand-crafted ones on the classification task.

But classification and detection are fundamentally different problems. Classification answers "what is in this image?" Detection answers "what objects are in this image, and where are they?" You need to both recognize AND localize, potentially many objects per image.

The central question of this paper: Can the CNN features that revolutionized image classification also work for object detection? And if so, how do you bridge the gap between a classifier (one label per image) and a detector (many boxes per image, each with a label and location)?

The leading detection approach at the time — the Deformable Part Model (DPM) — used multi-scale HOG features fed into an SVM with sliding windows and part-based reasoning. It was elegant but fundamentally limited by the expressiveness of HOG. HOG computes local orientation histograms — essentially edge statistics — and that's all it can represent. No amount of clever engineering on top of HOG could learn the rich, hierarchical features that a deep CNN learns automatically.

The question wasn't academic. Object detection is the gateway to visual understanding: autonomous driving, robotics, medical imaging, surveillance — all depend on knowing what is where. Breaking the HOG ceiling would change the field.

Why couldn't the AlexNet revolution in classification be directly applied to object detection?

Detection requires both recognizing AND localizing potentially many objects per image — a classifier only gives one label per image, so you need additional mechanisms for localization CNNs are too slow for detection AlexNet was not accurate enough for detection

Chapter 1: The Key Insight

R-CNN's answer is deceptively simple: what if you just propose a bunch of candidate regions in the image, crop each one out, resize it, and run a CNN classifier on each crop independently?

That's it. That's the whole idea. Three concepts fused together:

Region proposals — use an existing algorithm (selective search) to generate ~2000 candidate bounding boxes that might contain objects
CNN feature extraction — warp each proposed region to 227×227 pixels and run it through AlexNet to get a 4096-dimensional feature vector
Linear classification — train one SVM per object class on the CNN features

Why this is brilliant: Instead of trying to redesign the CNN architecture for detection (which nobody knew how to do in 2013), R-CNN reduces detection to a series of classification problems. Each region proposal is treated as a mini classification task: "Is there a dog in this crop? A car? A person?" The CNN doesn't need to know it's doing detection — it just classifies crops.

The second key contribution is transfer learning. In 2013, training a deep CNN from scratch on a small detection dataset like PASCAL VOC (a few thousand images) was hopeless — the network would massively overfit. R-CNN showed that you could:

Pre-train the CNN on ImageNet (1.2 million images, 1000 classes)
Fine-tune the last layers on your detection dataset (replacing the 1000-way classifier with an (N+1)-way classifier for N object classes + background)

This "supervised pre-training + domain-specific fine-tuning" paradigm boosted mAP by 8 percentage points. Today we call this transfer learning and it's standard practice — but in 2013 it was a novel contribution.

What are the two key insights that make R-CNN work?

(1) Reduce detection to classification by running a CNN on each region proposal independently, and (2) use transfer learning — pre-train on ImageNet, fine-tune on detection data (1) Use a sliding window and (2) train end-to-end (1) Use deeper networks and (2) more data augmentation

Chapter 2: The R-CNN Pipeline

R-CNN processes an image in four stages. Each stage is a separate module — this is a pipeline, not an end-to-end system.

Stage 1: Input

Take the input image (any size)

↓

Stage 2: Propose

Run selective search → ~2000 candidate bounding boxes

↓

Stage 3: Extract

Warp each region to 227×227, run through CNN → 4096-dim feature vector per region

↓

Stage 4: Classify

Score each feature vector with per-class SVMs + bounding box regression → final detections

At test time, the main computational bottleneck is Stage 3: running the CNN ~2000 times per image. On a GPU this takes about 13 seconds per image; on a CPU, 53 seconds. The SVM classification in Stage 4 is nearly instant — just a matrix multiply of the 2000×4096 feature matrix with the 4096×N weight matrix.

The R-CNN Pipeline

The four-stage detection pipeline. Each region proposal is independently cropped, warped, and classified.

Non-maximum suppression (NMS)

After scoring all ~2000 regions, many overlapping boxes will fire on the same object. R-CNN applies greedy non-maximum suppression per class: sort detections by score, accept the top one, then reject any remaining detection whose IoU (intersection over union) with an accepted detection exceeds a threshold. This collapses the cloud of overlapping boxes into a single tight box per object.

The pipeline nature is key: Each module is trained separately — the region proposer, the CNN feature extractor, the SVMs, and the bounding box regressor are all optimized independently. This makes the system modular but also means errors can't propagate backward. A later paper (Fast R-CNN) will unify Stages 3 and 4 into a single trainable network.

Why does R-CNN need non-maximum suppression (NMS)?

Many overlapping region proposals will fire on the same object — NMS keeps the highest-scoring box and removes boxes with high IoU overlap, collapsing duplicates into one detection To make the CNN faster To reduce the number of region proposals before CNN processing

Chapter 3: Region Proposals

The first module in R-CNN generates category-independent region proposals — bounding boxes that might contain any object. R-CNN uses selective search, an algorithm from Uijlings et al. (2013).

How selective search works

Selective search is a bottom-up grouping algorithm. It starts with a fine-grained oversegmentation of the image (using Felzenszwalb's algorithm), then iteratively merges similar neighboring regions:

Oversegment — break the image into thousands of tiny regions based on pixel similarity
Compute similarity — for each pair of neighboring regions, compute similarity based on color, texture, size, and fill (how well regions fit together)
Merge greedily — merge the most similar pair, recompute similarities for the new merged region
Collect boxes — at every level of the merging hierarchy, record the bounding box of each merged region
Output ~2000 proposals — the union of bounding boxes at all scales

Selective Search: Hierarchical Grouping

Watch regions merge from fine to coarse. Each merge level produces candidate bounding boxes. Click Step to advance the merging, or Auto to animate.

Level 0 / 6 — 36 regions

Why selective search and not sliding windows? A sliding window at all positions, scales, and aspect ratios produces millions of candidates — far too many for expensive CNN processing. Selective search is smarter: it uses image structure (edges, color, texture) to propose only ~2000 plausible regions. This is 1000x fewer candidates while capturing 95%+ of actual objects. The "proposal" strategy trades exhaustive coverage for computational tractability.

Why ~2000 proposals?

The paper found that 2000 proposals achieve very high recall — the fraction of ground-truth objects that overlap with at least one proposal at IoU > 0.5. Going beyond 2000 gives diminishing returns while linearly increasing CNN computation. Going below 1000 risks missing objects.

Why does R-CNN use selective search instead of sliding windows for region proposals?

Sliding windows at all positions/scales/ratios produce millions of candidates — too many for expensive CNN processing. Selective search uses image structure to propose only ~2000 plausible regions while maintaining high recall. Sliding windows can't detect small objects Selective search is more accurate than sliding windows

Chapter 4: CNN Feature Extraction

For each of the ~2000 region proposals, R-CNN needs to extract a feature vector. Here's the process:

Step 1: Warp to fixed size

The CNN (AlexNet) requires a fixed 227×227 pixel input. Region proposals come in all shapes and sizes — a tall, narrow person; a wide, flat bus; a tiny bottle. R-CNN handles this with the simplest possible approach: anisotropic warping. It takes the bounding box, adds 16 pixels of context padding, and stretches (or squishes) the result to exactly 227×227. This distorts the aspect ratio, but the CNN learns to handle it.

Step 2: Forward pass through AlexNet

The warped region is mean-subtracted and passed through AlexNet's five convolutional layers and two fully connected layers. The output of the second fully connected layer (fc7) is a 4096-dimensional feature vector. This is the representation R-CNN uses for classification.

Warp & Extract: Region to Feature Vector

Each region proposal (any shape) is warped to 227×227 and passed through AlexNet. The fc7 layer outputs a 4096-dim feature vector.

What makes CNN features so much better than HOG?

HOG captures oriented edge histograms — essentially first-order gradient statistics in local patches. That's all it can represent. A CNN learns hierarchical features:

Layer 1: edges and color blobs (similar to HOG, actually)
Layer 2: textures, corners, simple patterns
Layer 3: object parts — wheels, eyes, fur textures
Layer 4: more complex parts and spatial arrangements
Layer 5 / fc6-fc7: whole-object representations and compositions

The paper's ablation study showed that most of the detection improvement comes from layers fc6 and fc7 — the fully connected layers that encode semantic meaning far beyond anything HOG can capture.

The 4096-dim feature vector: This is dramatically more compact than the features used by the previous best system (UVA). UVA used 360,000-dimensional features based on SIFT bag-of-words. R-CNN's features are 100x smaller and far more discriminative. Smaller features also mean the SVM classification step is nearly instantaneous.

Why does R-CNN use fc7 features instead of HOG features?

CNN features encode a learned hierarchy — from edges to textures to object parts to whole objects — that is far more discriminative than HOG's hand-crafted edge histograms, while also being 100x more compact fc7 features are faster to compute HOG features require more memory

Chapter 5: Classification & Regression

Given a 4096-dimensional feature vector for each region proposal, R-CNN needs to (a) decide what object class it is (if any), and (b) refine the bounding box location.

Per-class SVMs

R-CNN trains one linear SVM per object class. For PASCAL VOC with 20 classes, that's 20 binary SVMs. Each SVM answers: "Does this region contain a [car / dog / person / ...]?"

At test time, all SVMs score all regions simultaneously via a single matrix multiplication: scores = features × W, where features is 2000×4096 and W is 4096×20. This produces a 2000×20 score matrix — every region scored against every class — in milliseconds.

Why SVMs instead of softmax? The paper tried using the fine-tuned CNN's softmax outputs directly but got worse results (50.9% vs 54.2% mAP). The key reason: the fine-tuning used a "loose" IoU threshold of 0.5 for positive examples (to avoid overfitting with limited data), but for SVM training they used a stricter 0.3 threshold with hard negative mining. The SVM training procedure was better calibrated for the detection task. This awkward split would later be eliminated by Fast R-CNN.

Bounding box regression

Even good region proposals are rarely perfectly aligned with objects. R-CNN learns a linear regressor that adjusts the proposal box to better fit the object. For each class, four regressors predict corrections to the box coordinates:

t_x = (G_x − P_x) / P_w t_y = (G_y − P_y) / P_h

t_w = ln(G_w / P_w) t_h = ln(G_h / P_h)

Where P is the proposal box, G is the ground-truth box, and the t values are the regression targets. The regressor learns to predict these t values from the pool5 features (not fc7 — the paper found pool5 worked better for localization since it retains more spatial information).

The log-scale for width and height means the regressor predicts relative changes — a prediction of t_w=0 means "don't change the width," regardless of the absolute box size. This makes the regression scale-invariant.

Bounding Box Regression

The blue box is the proposal, the green box is ground truth. The regressor learns to predict corrections (t_x, t_y, t_w, t_h) that shift and resize the proposal to match. Drag the IoU slider to see how proposal quality affects correction magnitude.

Proposal overlap0.50

Why does R-CNN use log-scale targets for width and height in bounding box regression?

Log-scale makes the regression targets relative — t_w=0 always means "don't change width" regardless of absolute box size, making the regressor scale-invariant Log-scale is computationally cheaper Log-scale prevents negative widths

Chapter 6: Training

R-CNN's training is a four-stage process. Each stage is trained independently, which is one of the paper's acknowledged limitations.

Stage 1: ImageNet Pre-training

Train AlexNet on ILSVRC 2012 (1.2M images, 1000 classes). Standard classification training with softmax cross-entropy loss.

↓

Stage 2: Domain-Specific Fine-Tuning

Replace the 1000-way fc8 with (N+1)-way fc8 (N classes + background). Fine-tune on warped region proposals from detection dataset. LR = 0.001 (1/10 of pre-training). Positive: IoU ≥ 0.5. Mini-batch: 32 positives + 96 negatives = 128.

↓

Stage 3: SVM Training

Extract fc7 features for all regions. Train one linear SVM per class using hard negative mining. Positive: ground-truth boxes only. Negative: IoU < 0.3. (Note: different threshold than fine-tuning!)

↓

Stage 4: Bbox Regressor Training

Train class-specific linear regressors on pool5 features. Only train on proposals with IoU ≥ 0.6 (close proposals only — the regressor is meant for refinement, not large corrections).

Why the inconsistent IoU thresholds?

This is one of the messiest parts of R-CNN, and the paper is transparent about it. Fine-tuning uses IoU ≥ 0.5 for positives because with limited data, using only ground-truth boxes (IoU = 1.0) would give too few positive examples and the network would overfit. But for SVM training, they found that using the CNN's own softmax outputs (trained with the 0.5 threshold) gave inferior results to training fresh SVMs with a stricter 0.3 threshold and hard negative mining.

The four-stage elephant in the room: Each stage has its own loss function, its own definition of positive/negative, and its own hyperparameters. This makes R-CNN complex and slow to train. The features from Stage 2 can't be updated based on errors in Stage 3 or 4 — there's no gradient flowing backward through the whole system. Fast R-CNN (Girshick, 2015) will solve this by replacing SVMs with a softmax layer and training the CNN, classifier, and regressor jointly in a single stage.

Hard negative mining

For SVM training, the negative examples (background regions) vastly outnumber the positives. R-CNN uses hard negative mining: train the SVM, then re-score all negatives. The ones the SVM gets wrong (false positives) are the "hard negatives" — the most confusing background regions. Add these to the training set, retrain. This focuses the SVM on the decision boundary where it matters most. In practice, one pass through the data is sufficient.

Why does R-CNN train SVMs separately instead of using the fine-tuned CNN's softmax output directly?

The fine-tuning used a loose IoU threshold of 0.5 (to avoid overfitting), but SVMs trained with a stricter 0.3 threshold and hard negative mining were better calibrated for the detection task, giving ~3% better mAP SVMs are inherently better classifiers than softmax The softmax layer was too slow at test time

Chapter 7: Results

R-CNN's results were a paradigm shift. On PASCAL VOC 2012, R-CNN achieved 53.3% mAP — compared to 35.1% for the previous best system using the same region proposals but with hand-crafted features. That's a relative improvement of over 50%.

R-CNN vs Prior Art on PASCAL VOC 2010

Mean Average Precision (mAP) comparison. R-CNN with bounding box regression (BB) set a new state of the art.

Key results

VOC 2010: 53.7% mAP (with BB regression) vs 40.4% for SegDPM (previous best) — a 13.3 point improvement
VOC 2012: 53.3% mAP vs 35.1% for UVA
ILSVRC 2013: 31.4% mAP vs 24.3% for OverFeat — despite OverFeat being specifically designed for detection

Ablation: What matters most?

The paper ran careful ablation experiments on VOC 2007:

fc7 features without fine-tuning: 46.2% mAP — already far above HOG baselines, proving CNN features are inherently more discriminative
fc7 features with fine-tuning: 54.2% — fine-tuning gives +8 points, proving transfer learning works for detection
+ bounding box regression: 58.5% — bbox regression gives another +4.3 points
pool5 features (no FC layers): 44.2% — still beats HOG! This means the convolutional layers alone, without any class-specific fully connected processing, already produce better features than years of feature engineering

The pool5 finding is stunning: With just 5 convolutional layers and no fully connected layers — only 6% of the CNN parameters — the features already beat the best hand-crafted features. This proves that CNN features aren't better because of more parameters, but because the hierarchical learned representation captures visual structure that HOG fundamentally cannot.

Per-class results

R-CNN improved on nearly every class, but the gains were especially dramatic for classes with high visual complexity — animals (dog: 17.8 → 70.0), vehicles (car: 49.7 → 60.0), and articulated objects (person: 47.7 → 58.1). Classes that are already distinctive shapes (bottle, chair) saw smaller improvements.

What did the ablation study reveal about where R-CNN's improvement comes from?

Even without fine-tuning, CNN features beat HOG baselines by a large margin — and pool5 features alone (no FC layers, 6% of parameters) still beat hand-crafted features, proving the hierarchical learned representation is the key advantage Most of the improvement comes from bounding box regression The SVM classifier is the main source of improvement

Chapter 8: Limitations

R-CNN was a breakthrough, but it has serious practical limitations that subsequent papers would address.

1. Painfully slow inference

Running the CNN ~2000 times per image takes 13-53 seconds depending on hardware. Each region is processed independently, even though neighboring proposals share most of their pixels. This massive redundancy is the core bottleneck.

2. Multi-stage training

Four separate training stages (pre-training, fine-tuning, SVM training, bbox regressor) with inconsistent positive/negative definitions. Training is complex, slow, and inelegant. Errors in later stages can't improve earlier ones.

3. Feature storage

To train the SVMs via hard negative mining, you need to extract and store features for every region proposal in every training image. For VOC 2007 (5k images × 2000 regions × 4096 floats), that's ~150 GB of feature data on disk.

4. No end-to-end training

The selective search module is fixed — it can't be improved by the CNN's feedback. The CNN features can't be updated based on classification errors. The system is a pipeline of independently optimized components, not a jointly optimized whole.

5. Warping distortion

Anisotropic warping distorts aspect ratios. A tall, thin person gets squished. A wide car gets stretched. The CNN sees distorted inputs and must learn to be invariant to this distortion — a wasted capacity that could be used for more useful invariances.

R-CNN Computational Bottleneck

Each of ~2000 regions runs through the full CNN independently — massive redundant computation on overlapping pixels. This is what Fast R-CNN and Faster R-CNN will fix.

Click to see redundant computation

The key insight for the next generation: Most of the CNN computation is shared across regions. If you run the CNN once on the entire image to get a feature map, then crop features from that shared map for each region, you avoid all the redundancy. That's exactly what SPP-net and Fast R-CNN will do — reducing inference from 47 seconds to 0.3 seconds per image.

What is the single biggest computational waste in R-CNN's design?

Running the CNN ~2000 times per image, once per region proposal, even though overlapping regions share most of their pixels — massive redundant computation that could be avoided by computing features once on the full image The selective search algorithm is too slow The SVM training takes too long

Chapter 9: Connections

What R-CNN built on

HOG / SIFT / DPM (Dalal & Triggs 2005, Lowe 2004, Felzenszwalb et al. 2010): The hand-crafted feature paradigm that R-CNN replaced. HOG captures oriented edge histograms; DPM extends this with deformable parts. These were the best features for a decade, but their ceiling was about 33% mAP on VOC.

AlexNet (Krizhevsky et al., 2012): The CNN that won ILSVRC 2012 and proved deep learned features can dramatically outperform hand-crafted ones — but only for classification. R-CNN showed the features transfer to detection.

Selective Search (Uijlings et al., 2013): The region proposal algorithm R-CNN uses. It generates category-independent proposals via hierarchical grouping, enabling the "recognition using regions" paradigm.

What R-CNN directly enabled

SPP-net (He et al., 2014): Introduced spatial pyramid pooling to extract features from arbitrary-sized regions on a shared CNN feature map — eliminating the need to run the CNN per-region. 20x faster than R-CNN.

Fast R-CNN (Girshick, 2015): Unified the CNN, classifier, and bbox regressor into a single jointly-trained network. RoI pooling extracts features from the shared feature map. Replaced SVMs with softmax. 200x faster than R-CNN, single-stage training.

Faster R-CNN (Ren et al., 2015): Replaced selective search with a Region Proposal Network (RPN) — a small CNN that shares features with the detector and generates proposals in ~10ms. The full pipeline runs at 5 fps, making R-CNN's approach real-time.

The broader lineage

YOLO (Redmon et al., 2016): Abandoned the proposal-then-classify paradigm entirely. A single CNN predicts boxes and classes in one forward pass, enabling 45 fps real-time detection. Sacrificed some accuracy for dramatic speed gains.

Feature Pyramid Networks / FPN (Lin et al., 2017): Multi-scale feature extraction that handles objects at different sizes. Built into all modern detectors.

DETR (Carion et al., 2020): End-to-end detection with Transformers — no proposals, no NMS, no anchors. The logical endpoint of the "simplify the pipeline" trajectory that R-CNN started.

R-CNN's legacy: R-CNN didn't just beat HOG — it ended the hand-crafted feature era in object detection. Every detection system since 2014 uses CNN features. The paper also established transfer learning as standard practice: pre-train on ImageNet, fine-tune on your task. This paradigm has been inherited by modern Vision Transformers (ViTs) and even large language models. R-CNN was the ImageNet moment for object detection.

Cheat sheet

Core idea

Selective search proposals → warp to 227×227 → CNN features (fc7, 4096-dim) → per-class SVMs + bbox regression

Key numbers

~2000 proposals, 4096-dim features, 53.3% mAP on VOC 2012 (vs 35.1% prior), 13-53s per image

Training

4 stages: ImageNet pre-train → detection fine-tune → SVM training → bbox regressor

Contribution

Proved CNN features crush hand-crafted features for detection; established transfer learning for vision tasks

Lineage

HOG/DPM → R-CNN → SPP-net → Fast R-CNN → Faster R-CNN → FPN → DETR

How did Fast R-CNN solve R-CNN's main computational bottleneck?

Instead of running the CNN ~2000 times (once per region), Fast R-CNN runs the CNN once on the entire image to produce a shared feature map, then crops features from that map for each region using RoI pooling — eliminating redundant computation By using a faster GPU By reducing the number of region proposals

R-CNN: Regions with CNN Features