YOLO — Veanors

Chapter 0: The Problem

By 2015, the dominant approach to object detection was a two-stage pipeline. First, generate thousands of region proposals — candidate bounding boxes that might contain objects. Then, run a CNN classifier on each proposal individually. This was the R-CNN family: R-CNN, Fast R-CNN, Faster R-CNN.

The problem? Speed. R-CNN ran at less than 1 frame per second. Fast R-CNN improved things, but at test time you still needed a separate region proposal step. Even Faster R-CNN, which integrated proposals into the network, topped out at about 7 fps. For applications like autonomous driving or robotics, you need real-time detection — at least 30 fps, ideally more.

These two-stage detectors were also fundamentally disjointed. Each component — proposal generation, feature extraction, classification, bounding box regression, non-max suppression — was trained or tuned separately. The pipeline couldn't be optimized end-to-end for the actual goal: detecting objects.

The wish list: We want an object detector that (1) runs in real-time (30+ fps), (2) sees the entire image at once (not just local patches), (3) can be optimized end-to-end as a single network, and (4) generalizes well to new domains. YOLO delivers all four.

Why were two-stage detectors like R-CNN too slow for real-time applications?

They run a classifier on each of thousands of region proposals separately, making the pipeline too slow for real-time use They use too much GPU memory They require labeled bounding boxes

Chapter 1: The Key Insight

YOLO's breakthrough is deceptively simple: frame object detection as a single regression problem. Instead of the detect-then-classify pipeline, a single neural network takes the entire image and directly outputs bounding box coordinates and class probabilities — all in one forward pass.

Here is the entire YOLO pipeline:

Resize the input image to 448 × 448
Run it through a single convolutional neural network
Threshold the resulting detections by confidence (non-max suppression)

That's it. No region proposals. No separate classifiers. No multi-stage pipeline. One network, one evaluation, one set of predictions.

Why "You Only Look Once": Two-stage detectors look at an image thousands of times — once per region proposal. YOLO looks at the image exactly once. The entire image. This is where the name comes from, and it's the reason YOLO is fast.

Because YOLO sees the entire image during both training and testing, it implicitly encodes contextual information. It knows that a "person" is more likely near a "bicycle" than floating in the sky. Fast R-CNN, which only sees local patches, makes more than twice as many background false-positive errors as YOLO — it mistakes random textures for objects because it can't see the bigger picture.

The cost of this speed? Accuracy. YOLO makes more localization errors — it sometimes gets the bounding box slightly wrong. But it makes far fewer background errors. This tradeoff turns out to be favorable for many real-world applications where speed matters more than pixel-perfect boxes.

What is the fundamental difference between YOLO and two-stage detectors like R-CNN?

YOLO frames detection as a single regression problem — one network predicts all boxes and classes in one forward pass, instead of using separate proposal and classification stages YOLO uses a bigger network YOLO uses more training data

Chapter 2: The Grid

How does a single network predict potentially many objects at different locations? YOLO's answer: divide the image into an S × S grid.

For PASCAL VOC, S = 7, giving us a 7 × 7 = 49 grid cells. The rule is simple: if the center of an object falls into a grid cell, that cell is responsible for detecting that object.

Each grid cell makes two kinds of predictions:

B bounding boxes (B = 2 in the paper), each with 5 values: x, y, w, h, and a confidence score
C class probabilities (C = 20 for VOC), shared across all boxes in that cell

The coordinates (x, y) represent the center of the box relative to the grid cell — so they range from 0 to 1 within that cell. The width and height (w, h) are relative to the whole image, also normalized to [0, 1]. The confidence score represents Pr(Object) × IOU_pred^truth — it encodes both whether an object is present and how good the predicted box is.

Why a grid? The grid enforces spatial diversity. Instead of having the network predict an arbitrary number of boxes (which would require complex sorting), we get exactly S×S×B predictions at fixed spatial locations. This is what makes the output a clean, fixed-size tensor — perfect for regression.

What determines which grid cell is responsible for detecting an object?

The cell where the center of the object falls — that cell is responsible for predicting the object's bounding box The cell with the largest overlap with the object All cells that the object touches

Chapter 3: The Prediction Tensor

Let's make this concrete. For PASCAL VOC, YOLO uses S=7, B=2, C=20. Each grid cell predicts:

Box 1: x₁, y₁, w₁, h₁, conf₁ (5 values)
Box 2: x₂, y₂, w₂, h₂, conf₂ (5 values)
Class probs: p(aeroplane), p(bicycle), ..., p(tvmonitor) (20 values)

That's B×5 + C = 2×5 + 20 = 30 values per cell.

With a 7×7 grid, the full output is a 7 × 7 × 30 tensor. That's 1,470 predictions, produced by a single forward pass through the network.

Output shape: S × S × (B × 5 + C) = 7 × 7 × 30

At test time, to get class-specific confidence scores for each box, we multiply each box's confidence by the cell's class probabilities:

Pr(Class_i) × IOU_pred^truth = Pr(Class_i | Object) × Pr(Object) × IOU_pred^truth

This gives us 7 × 7 × 2 = 98 bounding boxes, each with 20 class scores. We then apply non-max suppression to eliminate duplicate detections, keeping only the most confident box for each detected object.

The elegance: The entire detection problem — where are the objects, what are they, and how confident are we — is encoded as a single fixed-size tensor. No variable-length lists. No sorting. Just a tensor that a CNN can directly regress to.

Hover over the grid below to see what each cell predicts:

For YOLO on PASCAL VOC (S=7, B=2, C=20), how many values does each grid cell predict?

30 — two boxes of 5 values each (x, y, w, h, confidence) plus 20 class probabilities 20 — just the class probabilities 50 — two full sets of 25 predictions each

Chapter 4: The Loss Function

YOLO uses sum-squared error (SSE) across all outputs. Simple and easy to optimize, but raw SSE has problems. Three, specifically:

Problem 1: Most cells are empty

In a 7×7 grid, most cells don't contain an object. These cells all push their confidence toward zero, overwhelming the gradient from the few cells that do contain objects. Training becomes unstable.

Solution: Weight the confidence loss from empty cells lower. Use λ_noobj = 0.5 for cells without objects, while cells with objects use weight 1.0.

Problem 2: Localization needs more weight

SSE treats localization error and classification error equally. But getting the box coordinates right is harder and more important early in training.

Solution: Weight localization loss higher with λ_coord = 5.

Problem 3: Large vs small boxes

A 2-pixel error in a large box barely matters. The same error in a small box is devastating for IOU. But SSE doesn't know the difference.

Solution: Predict the square root of width and height instead of the raw values. Since √w grows slower for large values, equal errors in √w space correspond to proportionally smaller errors for large boxes.

The full loss function has five terms:

λ_coord ∑_i∑_j 1^obj_ij [(x_i−x̂_i)² + (y_i−ŷ_i)²]

+ λ_coord ∑_i∑_j 1^obj_ij [(√w_i−√ŵ_i)² + (√h_i−√ĥ_i)²]

+ ∑_i∑_j 1^obj_ij (C_i−Ĉ_i)²

+ λ_noobj ∑_i∑_j 1^noobj_ij (C_i−Ĉ_i)²

+ ∑_i 1^obj_i ∑_c (p_i(c)−p̂_i(c))²

Where 1^obj_ij is 1 when the j-th box predictor in cell i is "responsible" for an object (has highest IOU with the ground truth), and 1^noobj_ij when it is not.

Why does YOLO predict the square root of width and height instead of the raw values?

So that equal errors in prediction space correspond to proportionally smaller errors for large boxes — a 2-pixel error matters less for a large box than a small one To make the values positive To reduce the number of parameters

Chapter 5: The Architecture

YOLO's architecture is inspired by GoogLeNet but simpler. Instead of inception modules, it uses alternating 1×1 and 3×3 convolutional layers — the 1×1 layers reduce the feature space, and the 3×3 layers expand it again.

Network structure

24 convolutional layers for feature extraction
2 fully connected layers for prediction
Input: 448 × 448 × 3 (RGB image)
Output: 7 × 7 × 30 tensor

The first layers use large 7×7 filters at stride 2, progressively reducing spatial resolution while increasing channels: 448→112→56→28→14→7. By the end, you have a 7×7 feature map with 1024 channels — which maps perfectly to the 7×7 grid.

Fast YOLO: A lighter version uses only 9 convolutional layers (instead of 24) and fewer filters per layer. It runs at an astounding 155 fps — more than 3x faster than full YOLO — while still achieving 52.7% mAP, which is double other real-time detectors of the era.

Activation function

All layers except the final one use leaky ReLU:

φ(x) = x if x > 0, 0.1x otherwise

The final layer uses a linear activation — since we're regressing to coordinates and probabilities, we don't want to clip negative values.

Pretraining

The first 20 conv layers are pretrained on ImageNet (1000 classes) at 224×224 resolution. This pretraining takes about a week and achieves 88% top-5 accuracy — competitive with GoogLeNet. Then four more conv layers and two FC layers are added for detection, and the input resolution is doubled to 448×448 to capture fine-grained details.

How many convolutional layers does full YOLO have, and what is the final output shape?

24 conv layers + 2 FC layers, outputting a 7×7×30 tensor 9 conv layers + 2 FC layers, outputting a 14×14×30 tensor 50 layers outputting class labels only

Chapter 6: Training

YOLO is trained on PASCAL VOC 2007 and 2012 for about 135 epochs. The training recipe is straightforward but has several important details.

Learning rate schedule

Training starts with a careful warm-up: the learning rate slowly rises from 10⁻³ to 10⁻² over the first few epochs. Why? Starting at a high learning rate causes the model to diverge due to unstable gradients in the early stages when predictions are essentially random. After warm-up:

10⁻² for 75 epochs (the main training phase)
10⁻³ for 30 epochs (fine-tuning)
10⁻⁴ for 30 epochs (final polish)

Regularization

Dropout: A dropout layer with rate 0.5 after the first fully connected layer prevents co-adaptation between layers.

Data augmentation: Random scaling and translations up to 20% of the image size. Random adjustments to exposure and saturation (up to 1.5x in HSV color space). These augmentations force the network to learn invariant representations.

Optimizer settings

Batch size 64, momentum 0.9, weight decay 0.0005. Standard SGD with momentum — nothing exotic. The simplicity is part of the appeal.

Box specialization: Each cell predicts B=2 boxes, but at training time only one box per cell is "responsible" for each object — the one with the highest current IOU with the ground truth. Over time, this causes the two predictors to specialize: one might get better at tall, narrow objects while the other handles wide, flat objects. This emergent specialization improves overall recall.

Why does YOLO start with a low learning rate and slowly increase it?

Starting at a high learning rate causes divergence due to unstable gradients when initial predictions are random — warm-up stabilizes early training To save computation in early epochs To train the FC layers first

Chapter 7: Results

YOLO's results on PASCAL VOC 2007 tell a story of speed vs. accuracy tradeoffs:

The headline numbers:
• YOLO: 45 fps at 63.4% mAP
• Fast YOLO: 155 fps at 52.7% mAP
• Faster R-CNN: 7 fps at 73.2% mAP
• Fast R-CNN: 0.5 fps at 70.0% mAP

YOLO is 6x faster than Faster R-CNN with a 10-point mAP gap. For many applications — real-time video, robotics, autonomous driving — that tradeoff is worth it. And Fast YOLO is 22x faster than Faster R-CNN while still being competitive with older methods like DPM (which got 30.4% mAP).

Error analysis

When YOLO makes mistakes, they tend to be localization errors — the bounding box is in roughly the right place but not precise enough. Fast R-CNN, by contrast, makes many more background errors — confidently predicting objects where there's just background texture.

This makes YOLO and Fast R-CNN complementary. When combined, Fast R-CNN + YOLO achieves 75.0% mAP — a 3.2-point boost over Fast R-CNN alone. YOLO acts as a "background detector" that kills false positives.

Generalization to artwork

One striking result: when trained on natural photos and tested on Picasso paintings and People-Art, YOLO dramatically outperforms R-CNN and DPM. Because YOLO sees the entire image and learns general spatial relationships, it generalizes better to unfamiliar visual domains. It doesn't just memorize textures — it learns what objects look like in context.

What type of error does YOLO make more often compared to Fast R-CNN?

Localization errors — the box is roughly right but not precisely placed, while Fast R-CNN makes more background false-positive errors Classification errors Background errors

Chapter 8: Strengths & Limitations

Strengths

Speed: 45 fps (full) / 155 fps (fast) — genuinely real-time on 2015 hardware
Global reasoning: Sees the whole image, so it understands context. Fewer background errors than R-CNN family.
Generalization: Outperforms R-CNN and DPM on artwork — learns transferable representations
Simplicity: One network, one forward pass, end-to-end training. No complex multi-stage pipeline.
Complementary: Combining YOLO with Fast R-CNN boosts mAP by 3+ points

Limitations

Spatial constraint: Each cell predicts only 2 boxes and 1 class. Groups of small objects (flocks of birds) overwhelm this capacity.
Aspect ratio generalization: Since the model learns box shapes from data, it struggles with unusual aspect ratios not seen during training.
Coarse features: Multiple downsampling layers (448→7) lose fine-grained spatial information. Small objects suffer most.
Localization: The main source of error. The square-root trick for width/height only partially solves the large-vs-small box problem.
Loss function mismatch: SSE doesn't perfectly align with mAP. Equal weighting of all error types isn't optimal.

The localization vs. background tradeoff: YOLO and R-CNN fail in complementary ways. YOLO localizes less precisely but rarely hallucinates objects. R-CNN localizes well but often sees objects where there are none. This complementarity is why combining them works so well, and it motivated much of the subsequent one-stage detector research.

Why does YOLO struggle with small objects appearing in groups?

Each grid cell can only predict 2 boxes with 1 class — when many small objects cluster in one cell, most go undetected Small objects are blurry The network is too shallow

Chapter 9: Connections

What came before

DPM (2010): Sliding window + deformable parts. Slow, shallow features. ~30% mAP.
R-CNN (2014): Region proposals + CNN features. Accurate (58% mAP) but extremely slow (<1 fps). Multi-stage pipeline.
Fast R-CNN (2015): Shared CNN features, faster classification. Still needs external proposals.
Faster R-CNN (2015): Integrated Region Proposal Network. 7 fps at 73% mAP. Still two-stage.
OverFeat (2014): Multi-scale sliding window CNN. Faster than R-CNN but still not real-time.

What came after

SSD (2016): Multi-scale feature maps for detection at different resolutions. Addresses YOLO's small-object weakness. 59 fps at 74.3% mAP.
YOLOv2/YOLO9000 (2017): Batch normalization, anchor boxes, multi-scale training, darknet-19 backbone. 67 fps at 76.8% mAP. Detected 9000+ categories.
YOLOv3 (2018): Feature pyramid network (FPN) for multi-scale detection. Much better on small objects. Darknet-53 backbone.
YOLOv4-v11 (2020-2024): Continuous improvements — CSP backbones, mosaic augmentation, anchor-free heads, transformer components, and more.
DETR (2020): Transformers for detection — no anchors, no NMS, set prediction with Hungarian matching. A completely different paradigm.

The one-stage revolution: YOLO didn't just create a faster detector — it proved that detection didn't need to be a two-stage problem. This insight spawned the entire family of one-stage detectors (SSD, RetinaNet, CenterNet, FCOS) that now dominate real-time detection. The accuracy gap between one-stage and two-stage detectors has essentially closed, vindicating YOLO's core bet.

Key ideas that persist

Detection as regression: Now standard. Even transformer-based detectors predict box coordinates directly.
Grid-based prediction: Evolved into anchor boxes (YOLOv2), then anchor-free centers (CenterNet, FCOS).
End-to-end training: No more disjoint pipelines. Modern detectors are trained start-to-finish.
Speed/accuracy Pareto frontier: YOLO established the paradigm of measuring detectors on a speed-accuracy curve, not just mAP.

What fundamental paradigm shift did YOLO introduce to the object detection field?

It proved that detection could be framed as a single-stage regression problem — spawning the entire family of one-stage detectors that now dominate real-time applications It introduced convolutional neural networks It invented data augmentation