Redmon, Divvala, Girshick, Farhadi — 2015

You Only Look Once

A single neural network predicts bounding boxes and class probabilities from full images in one evaluation — real-time object detection at 45 fps by framing detection as regression.

Prerequisites: CNNs + Object detection basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

By 2015, the dominant approach to object detection was a two-stage pipeline. First, generate thousands of region proposals — candidate bounding boxes that might contain objects. Then, run a CNN classifier on each proposal individually. This was the R-CNN family: R-CNN, Fast R-CNN, Faster R-CNN.

The problem? Speed. R-CNN ran at less than 1 frame per second. Fast R-CNN improved things, but at test time you still needed a separate region proposal step. Even Faster R-CNN, which integrated proposals into the network, topped out at about 7 fps. For applications like autonomous driving or robotics, you need real-time detection — at least 30 fps, ideally more.

These two-stage detectors were also fundamentally disjointed. Each component — proposal generation, feature extraction, classification, bounding box regression, non-max suppression — was trained or tuned separately. The pipeline couldn't be optimized end-to-end for the actual goal: detecting objects.

The wish list: We want an object detector that (1) runs in real-time (30+ fps), (2) sees the entire image at once (not just local patches), (3) can be optimized end-to-end as a single network, and (4) generalizes well to new domains. YOLO delivers all four.
Why were two-stage detectors like R-CNN too slow for real-time applications?

Chapter 1: The Key Insight

YOLO's breakthrough is deceptively simple: frame object detection as a single regression problem. Instead of the detect-then-classify pipeline, a single neural network takes the entire image and directly outputs bounding box coordinates and class probabilities — all in one forward pass.

Here is the entire YOLO pipeline:

  1. Resize the input image to 448 × 448
  2. Run it through a single convolutional neural network
  3. Threshold the resulting detections by confidence (non-max suppression)

That's it. No region proposals. No separate classifiers. No multi-stage pipeline. One network, one evaluation, one set of predictions.

Why "You Only Look Once": Two-stage detectors look at an image thousands of times — once per region proposal. YOLO looks at the image exactly once. The entire image. This is where the name comes from, and it's the reason YOLO is fast.

Because YOLO sees the entire image during both training and testing, it implicitly encodes contextual information. It knows that a "person" is more likely near a "bicycle" than floating in the sky. Fast R-CNN, which only sees local patches, makes more than twice as many background false-positive errors as YOLO — it mistakes random textures for objects because it can't see the bigger picture.

The cost of this speed? Accuracy. YOLO makes more localization errors — it sometimes gets the bounding box slightly wrong. But it makes far fewer background errors. This tradeoff turns out to be favorable for many real-world applications where speed matters more than pixel-perfect boxes.

What is the fundamental difference between YOLO and two-stage detectors like R-CNN?

Chapter 2: The Grid

How does a single network predict potentially many objects at different locations? YOLO's answer: divide the image into an S × S grid.

For PASCAL VOC, S = 7, giving us a 7 × 7 = 49 grid cells. The rule is simple: if the center of an object falls into a grid cell, that cell is responsible for detecting that object.

Each grid cell makes two kinds of predictions:

The coordinates (x, y) represent the center of the box relative to the grid cell — so they range from 0 to 1 within that cell. The width and height (w, h) are relative to the whole image, also normalized to [0, 1]. The confidence score represents Pr(Object) × IOUpredtruth — it encodes both whether an object is present and how good the predicted box is.

Why a grid? The grid enforces spatial diversity. Instead of having the network predict an arbitrary number of boxes (which would require complex sorting), we get exactly S×S×B predictions at fixed spatial locations. This is what makes the output a clean, fixed-size tensor — perfect for regression.
What determines which grid cell is responsible for detecting an object?

Chapter 3: The Prediction Tensor

Let's make this concrete. For PASCAL VOC, YOLO uses S=7, B=2, C=20. Each grid cell predicts:

That's B×5 + C = 2×5 + 20 = 30 values per cell.

With a 7×7 grid, the full output is a 7 × 7 × 30 tensor. That's 1,470 predictions, produced by a single forward pass through the network.

Output shape: S × S × (B × 5 + C) = 7 × 7 × 30

At test time, to get class-specific confidence scores for each box, we multiply each box's confidence by the cell's class probabilities:

Pr(Classi) × IOUpredtruth = Pr(Classi | Object) × Pr(Object) × IOUpredtruth

This gives us 7 × 7 × 2 = 98 bounding boxes, each with 20 class scores. We then apply non-max suppression to eliminate duplicate detections, keeping only the most confident box for each detected object.

The elegance: The entire detection problem — where are the objects, what are they, and how confident are we — is encoded as a single fixed-size tensor. No variable-length lists. No sorting. Just a tensor that a CNN can directly regress to.

Hover over the grid below to see what each cell predicts:

For YOLO on PASCAL VOC (S=7, B=2, C=20), how many values does each grid cell predict?

Chapter 4: The Loss Function

YOLO uses sum-squared error (SSE) across all outputs. Simple and easy to optimize, but raw SSE has problems. Three, specifically:

Problem 1: Most cells are empty

In a 7×7 grid, most cells don't contain an object. These cells all push their confidence toward zero, overwhelming the gradient from the few cells that do contain objects. Training becomes unstable.

Solution: Weight the confidence loss from empty cells lower. Use λnoobj = 0.5 for cells without objects, while cells with objects use weight 1.0.

Problem 2: Localization needs more weight

SSE treats localization error and classification error equally. But getting the box coordinates right is harder and more important early in training.

Solution: Weight localization loss higher with λcoord = 5.

Problem 3: Large vs small boxes

A 2-pixel error in a large box barely matters. The same error in a small box is devastating for IOU. But SSE doesn't know the difference.

Solution: Predict the square root of width and height instead of the raw values. Since √w grows slower for large values, equal errors in √w space correspond to proportionally smaller errors for large boxes.

The full loss function has five terms:
λcoordij 1objij [(xi−x̂i)² + (yi−ŷi)²]
+ λcoordij 1objij [(√wi−√ŵi)² + (√hi−√ĥi)²]
+ ∑ij 1objij (Ci−Ĉi
+ λnoobjij 1noobjij (Ci−Ĉi
+ ∑i 1objic (pi(c)−p̂i(c))²

Where 1objij is 1 when the j-th box predictor in cell i is "responsible" for an object (has highest IOU with the ground truth), and 1noobjij when it is not.

Why does YOLO predict the square root of width and height instead of the raw values?

Chapter 5: The Architecture

YOLO's architecture is inspired by GoogLeNet but simpler. Instead of inception modules, it uses alternating 1×1 and 3×3 convolutional layers — the 1×1 layers reduce the feature space, and the 3×3 layers expand it again.

Network structure

The first layers use large 7×7 filters at stride 2, progressively reducing spatial resolution while increasing channels: 448→112→56→28→14→7. By the end, you have a 7×7 feature map with 1024 channels — which maps perfectly to the 7×7 grid.

Fast YOLO: A lighter version uses only 9 convolutional layers (instead of 24) and fewer filters per layer. It runs at an astounding 155 fps — more than 3x faster than full YOLO — while still achieving 52.7% mAP, which is double other real-time detectors of the era.

Activation function

All layers except the final one use leaky ReLU:

φ(x) = x   if x > 0,   0.1x   otherwise

The final layer uses a linear activation — since we're regressing to coordinates and probabilities, we don't want to clip negative values.

Pretraining

The first 20 conv layers are pretrained on ImageNet (1000 classes) at 224×224 resolution. This pretraining takes about a week and achieves 88% top-5 accuracy — competitive with GoogLeNet. Then four more conv layers and two FC layers are added for detection, and the input resolution is doubled to 448×448 to capture fine-grained details.

How many convolutional layers does full YOLO have, and what is the final output shape?

Chapter 6: Training

YOLO is trained on PASCAL VOC 2007 and 2012 for about 135 epochs. The training recipe is straightforward but has several important details.

Learning rate schedule

Training starts with a careful warm-up: the learning rate slowly rises from 10−3 to 10−2 over the first few epochs. Why? Starting at a high learning rate causes the model to diverge due to unstable gradients in the early stages when predictions are essentially random. After warm-up:

Regularization

Dropout: A dropout layer with rate 0.5 after the first fully connected layer prevents co-adaptation between layers.

Data augmentation: Random scaling and translations up to 20% of the image size. Random adjustments to exposure and saturation (up to 1.5x in HSV color space). These augmentations force the network to learn invariant representations.

Optimizer settings

Batch size 64, momentum 0.9, weight decay 0.0005. Standard SGD with momentum — nothing exotic. The simplicity is part of the appeal.

Box specialization: Each cell predicts B=2 boxes, but at training time only one box per cell is "responsible" for each object — the one with the highest current IOU with the ground truth. Over time, this causes the two predictors to specialize: one might get better at tall, narrow objects while the other handles wide, flat objects. This emergent specialization improves overall recall.
Why does YOLO start with a low learning rate and slowly increase it?

Chapter 7: Results

YOLO's results on PASCAL VOC 2007 tell a story of speed vs. accuracy tradeoffs:

The headline numbers:
• YOLO: 45 fps at 63.4% mAP
• Fast YOLO: 155 fps at 52.7% mAP
• Faster R-CNN: 7 fps at 73.2% mAP
• Fast R-CNN: 0.5 fps at 70.0% mAP

YOLO is 6x faster than Faster R-CNN with a 10-point mAP gap. For many applications — real-time video, robotics, autonomous driving — that tradeoff is worth it. And Fast YOLO is 22x faster than Faster R-CNN while still being competitive with older methods like DPM (which got 30.4% mAP).

Error analysis

When YOLO makes mistakes, they tend to be localization errors — the bounding box is in roughly the right place but not precise enough. Fast R-CNN, by contrast, makes many more background errors — confidently predicting objects where there's just background texture.

This makes YOLO and Fast R-CNN complementary. When combined, Fast R-CNN + YOLO achieves 75.0% mAP — a 3.2-point boost over Fast R-CNN alone. YOLO acts as a "background detector" that kills false positives.

Generalization to artwork

One striking result: when trained on natural photos and tested on Picasso paintings and People-Art, YOLO dramatically outperforms R-CNN and DPM. Because YOLO sees the entire image and learns general spatial relationships, it generalizes better to unfamiliar visual domains. It doesn't just memorize textures — it learns what objects look like in context.

What type of error does YOLO make more often compared to Fast R-CNN?

Chapter 8: Strengths & Limitations

Strengths

Limitations

The localization vs. background tradeoff: YOLO and R-CNN fail in complementary ways. YOLO localizes less precisely but rarely hallucinates objects. R-CNN localizes well but often sees objects where there are none. This complementarity is why combining them works so well, and it motivated much of the subsequent one-stage detector research.
Why does YOLO struggle with small objects appearing in groups?

Chapter 9: Connections

What came before

What came after

The one-stage revolution: YOLO didn't just create a faster detector — it proved that detection didn't need to be a two-stage problem. This insight spawned the entire family of one-stage detectors (SSD, RetinaNet, CenterNet, FCOS) that now dominate real-time detection. The accuracy gap between one-stage and two-stage detectors has essentially closed, vindicating YOLO's core bet.

Key ideas that persist

What fundamental paradigm shift did YOLO introduce to the object detection field?