Carion, Massa, Synnaeve, Usunier, Kirillov, Zagoruyko — Facebook AI, 2020

End-to-End Object Detection with Transformers

Eliminate anchors, NMS, and all hand-designed post-processing. Treat detection as a set prediction problem, solve with a transformer and bipartite matching. Elegant simplicity.

Prerequisites: CNNs + Transformers (attention) + Object detection basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

Object detection sounds simple: draw a box around every object and label it. But by 2020, the machinery required to do this well had become astonishingly complex.

Every major detector — Faster R-CNN, YOLO, SSD, RetinaNet — relies on a cascade of hand-designed components:

Each component has hyperparameters. Each hyperparameter requires tuning. Each interaction between components creates subtle bugs. The result: modern detectors are powerful, but fragile and complex.

The question DETR asks: Can we build an object detector that has none of these components? No anchors, no NMS, no proposal generation, no hand-designed assignment rules. Just a neural network that takes an image and directly outputs a set of detections. Truly end-to-end.
What is the main problem with traditional object detectors that DETR aims to solve?

Chapter 1: The Key Insight

Here is DETR's core idea, and it is beautifully simple: treat object detection as a set prediction problem.

What does that mean? Instead of asking "is there an object at anchor position (i, j, k)?", DETR asks: "given this image, predict a set of N objects." The output is a fixed-size set — typically N=100 — where each element is either a detected object (class + bounding box) or a special "no object" (∅) token.

This reframing changes everything:

But how do we train a set predictor? The key challenge is assignment: which predicted box should be compared to which ground-truth box? Traditional detectors use anchor-matching heuristics. DETR uses something far more elegant: the Hungarian algorithm for optimal bipartite matching.

The elegance: Given N predictions and M ground-truth objects (M < N), the Hungarian algorithm finds the optimal one-to-one assignment that minimizes the total matching cost. The remaining N−M predictions are matched to ∅ ("no object"). This is mathematically principled, globally optimal, and produces no duplicates by construction.

The entire DETR pipeline: CNN backbone extracts features → transformer encoder-decoder reasons about them → N parallel prediction heads output boxes and classes. That's it. No anchors. No NMS. No post-processing.

Why does DETR not need Non-Maximum Suppression (NMS)?

Chapter 2: The Architecture

DETR's architecture has three components, each remarkably simple:

1. CNN Backbone

A standard ResNet (typically ResNet-50) takes the input image ximg ∈ R3×H0×W0 and produces a feature map f ∈ RC×H×W, where C=2048 and the spatial dimensions are reduced by 32×. A 1×1 convolution projects this to a lower dimension d=256.

2. Transformer Encoder

The feature map is flattened to a sequence of H×W tokens, each of dimension d=256. Fixed sinusoidal positional encodings are added (2D, one for each spatial position). The encoder applies 6 layers of multi-head self-attention + FFN, letting every spatial position attend to every other. This builds a globally-aware feature representation.

3. Transformer Decoder

The decoder takes N=100 learned object queries as input. Each decoder layer applies: (a) self-attention among the object queries (so they can coordinate to avoid predicting the same object), then (b) cross-attention from object queries to encoder features (so each query can "look at" the image). After 6 decoder layers, each query outputs a d-dimensional embedding.

4. Prediction Heads

Each of the N=100 output embeddings is independently passed through a shared 3-layer FFN that predicts: (a) a class label via softmax (including ∅), and (b) a bounding box as 4 normalized coordinates (center x, center y, width, height).

Implementation simplicity: The entire DETR inference pipeline can be written in fewer than 50 lines of PyTorch code. No custom CUDA kernels, no specialized libraries, no anchor grid generation. Just a CNN, a standard transformer, and a linear layer.
What are the three main components of the DETR architecture?

Chapter 3: Object Queries

Object queries are perhaps the most novel component of DETR. They are N=100 learned embedding vectors, each of dimension d=256. They are not derived from the image — they are parameters of the model, learned during training.

Think of each object query as a "detection slot." During training, each slot learns to specialize:

The decoder processes all 100 queries in parallel. Through self-attention, queries communicate with each other: "I'm already detecting the dog, so you don't need to." Through cross-attention to the encoder output, each query "looks at" the relevant parts of the image to refine its prediction.

Analogy: Imagine 100 expert spotters, each stationed at a different vantage point. They can all see the whole scene (cross-attention), and they can talk to each other to coordinate (self-attention). Each spotter either reports an object or says "nothing here" (∅). No two spotters report the same object.

The paper visualizes learned attention maps of the object queries and finds that different queries attend to different spatial regions and object sizes. This specialization emerges naturally from training — no spatial prior is manually encoded.

What are object queries in DETR?

Chapter 4: Bipartite Matching

Here is the training challenge: DETR outputs N=100 predictions, but the image might contain only 3 objects. Which 3 of the 100 predictions should we compare to the 3 ground-truth boxes?

Traditional detectors use heuristic rules: "assign each anchor to the ground-truth box with the highest IoU, if the IoU exceeds 0.5." DETR replaces this with a principled, globally optimal solution: the Hungarian algorithm.

Step 1: Compute the cost matrix

For every possible (prediction, ground-truth) pair, compute a matching cost:

Lmatch(yi, ŷσ(i)) = −1{ci≠∅}σ(i)(ci) + 1{ci≠∅} Lbox(bi, b̂σ(i))

This cost combines: (a) the predicted probability for the correct class (higher is better, hence the minus sign), and (b) a box distance (lower is better). The result is an N×M cost matrix.

Step 2: Hungarian algorithm

The Hungarian algorithm finds the permutation σ that minimizes the total cost — the optimal one-to-one assignment. This runs in O(N³) time, but with N=100 it is negligible compared to the network forward pass.

Why this is better than heuristic matching: The Hungarian algorithm finds the globally optimal assignment. Anchor-based matching is greedy and local — it matches each anchor independently, which can lead to suboptimal assignments. Bipartite matching considers all predictions jointly and guarantees uniqueness by construction.
What does the Hungarian algorithm do in DETR's training?

Chapter 5: The Set Prediction Loss

Once the Hungarian algorithm has found the optimal assignment σ̂, we compute the Hungarian loss on the matched pairs:

LHungarian(y, ŷ) = ∑i=1N [−log p̂σ̂(i)(ci) + 1{ci≠∅} Lbox(bi, b̂σ̂(i))]

This loss has two parts for each matched pair:

1. Classification loss

Standard cross-entropy. For predictions matched to a real object, this encourages the correct class. For predictions matched to ∅, this encourages the "no object" class. The ∅ class is down-weighted by a factor of 10 to handle class imbalance (most of the 100 slots will be empty).

2. Box loss

A combination of two complementary losses:

Lbox(bi, b̂σ(i)) = λiou Liou(bi, b̂σ(i)) + λL1 ||bi − b̂σ(i)||1

Hyperparameters: λiou = 2, λL1 = 5. Both losses are normalized by the number of objects in the batch.

Key difference from Faster R-CNN: Faster R-CNN assigns multiple anchors to each ground-truth object, so it can get multiple gradient signals per object. DETR assigns exactly one prediction per object. This is cleaner but means DETR needs longer training to converge (the model has fewer chances to learn each object per image).
Why does DETR use both L1 loss and GIoU loss for bounding boxes?

Chapter 6: Training

DETR has an unusually demanding training recipe. Understanding why reveals deep lessons about set prediction.

300 epochs (!)

While Faster R-CNN converges in ~36 epochs (the "3x schedule"), DETR needs 300 epochs on COCO. Why? Two reasons:

Auxiliary decoding losses

DETR adds prediction FFNs and Hungarian loss after every decoder layer, not just the last one. All 6 decoder layers share the same FFN parameters. This provides intermediate supervision that helps the model output the correct number of objects early in the decoder.

Optimizer: AdamW

Learning rates: 10−4 for the transformer, 10−5 for the backbone (10x smaller since the backbone is pre-trained on ImageNet). Weight decay: 10−4. Learning rate drops by 10x at epoch 200.

Positional encoding

DETR uses fixed 2D sinusoidal encodings for the spatial positions in the encoder (one sine/cosine pair per x and y coordinate). The object queries in the decoder are learned positional embeddings. Positional encodings are added at every attention layer, not just the first.

Data augmentation

Scale augmentation (shortest side 480-800, longest side capped at 1333) plus random crop with 50% probability. The random crop is crucial — it forces the encoder's self-attention to learn global spatial relationships, improving AP by ~1 point.

The 300-epoch cost: Training DETR-R50 for 300 epochs on 8 V100 GPUs takes about 3 days. This was a significant concern, and reducing DETR's training time became a major research direction (leading to Deformable DETR, DAB-DETR, DINO-DETR, etc.).
Why does DETR require 300 training epochs while Faster R-CNN converges in ~36?

Chapter 7: Results

DETR was evaluated on COCO 2017, the standard object detection benchmark, against highly-tuned Faster R-CNN baselines from Detectron2.

Overall performance

DETR with ResNet-50 achieves 42.0 AP, comparable to Faster R-CNN-FPN+ (42.0 AP) after extensive tuning. With ResNet-101, DETR-R101 reaches 43.5 AP. The models have similar parameter counts and FLOPs.

The size story

DETR shows a striking pattern across object sizes:

The FPN gap: Faster R-CNN uses a Feature Pyramid Network (FPN) that provides multi-scale features. DETR operates on a single-scale feature map (32x downsampled). This explains the small-object weakness. Adding a dilated backbone (DC5) helps somewhat but at 2x compute cost.

NMS is truly unnecessary

The paper verifies that adding NMS to DETR's outputs does not improve performance. The bipartite matching training has successfully taught the model to produce non-redundant predictions.

On what type of objects does DETR significantly outperform Faster R-CNN, and why?

Chapter 8: Panoptic Segmentation

One of DETR's most compelling results: it extends trivially to panoptic segmentation, the task of assigning every pixel in the image to either a "thing" (countable object instance) or "stuff" (amorphous region like sky or grass).

The extension

A small mask head is added on top of the frozen, pre-trained DETR:

  1. Take each decoder output embedding (one per detected object)
  2. Feed it through a multi-head attention module that attends to the encoder's feature maps, producing an attention heatmap per object
  3. Upsample these heatmaps using a lightweight FPN-style decoder with skip connections from the backbone
  4. For each object, output a binary mask at stride 4 resolution
  5. Merge masks using pixel-wise argmax to produce the panoptic map
Why this is remarkable: The mask head is trained for only 25 epochs with the rest of DETR frozen. Despite this simplicity, DETR-Panoptic achieves 46.0 PQ on COCO panoptic, outperforming the Panoptic FPN baseline (41.6 PQ on things). The unified architecture handles both things and stuff classes with the same mechanism.

This demonstrates DETR's versatility: the object query / set prediction framework generalizes beyond bounding boxes. Each query can produce any structured output — boxes, masks, keypoints — by simply changing the prediction head.

How does DETR extend to panoptic segmentation?

Chapter 9: Connections

What came before

What came after

DETR's legacy: DETR did not immediately surpass Faster R-CNN on all metrics. Its impact was conceptual: it proved that detection can be truly end-to-end, without any hand-designed components. Every subsequent DETR variant improved the recipe, and by 2023, DETR-family models held every detection record on COCO. The paradigm shift was complete.
What was the main limitation of the original DETR that subsequent work addressed?