Eliminate anchors, NMS, and all hand-designed post-processing. Treat detection as a set prediction problem, solve with a transformer and bipartite matching. Elegant simplicity.
Object detection sounds simple: draw a box around every object and label it. But by 2020, the machinery required to do this well had become astonishingly complex.
Every major detector — Faster R-CNN, YOLO, SSD, RetinaNet — relies on a cascade of hand-designed components:
Each component has hyperparameters. Each hyperparameter requires tuning. Each interaction between components creates subtle bugs. The result: modern detectors are powerful, but fragile and complex.
Here is DETR's core idea, and it is beautifully simple: treat object detection as a set prediction problem.
What does that mean? Instead of asking "is there an object at anchor position (i, j, k)?", DETR asks: "given this image, predict a set of N objects." The output is a fixed-size set — typically N=100 — where each element is either a detected object (class + bounding box) or a special "no object" (∅) token.
This reframing changes everything:
But how do we train a set predictor? The key challenge is assignment: which predicted box should be compared to which ground-truth box? Traditional detectors use anchor-matching heuristics. DETR uses something far more elegant: the Hungarian algorithm for optimal bipartite matching.
The entire DETR pipeline: CNN backbone extracts features → transformer encoder-decoder reasons about them → N parallel prediction heads output boxes and classes. That's it. No anchors. No NMS. No post-processing.
DETR's architecture has three components, each remarkably simple:
A standard ResNet (typically ResNet-50) takes the input image ximg ∈ R3×H0×W0 and produces a feature map f ∈ RC×H×W, where C=2048 and the spatial dimensions are reduced by 32×. A 1×1 convolution projects this to a lower dimension d=256.
The feature map is flattened to a sequence of H×W tokens, each of dimension d=256. Fixed sinusoidal positional encodings are added (2D, one for each spatial position). The encoder applies 6 layers of multi-head self-attention + FFN, letting every spatial position attend to every other. This builds a globally-aware feature representation.
The decoder takes N=100 learned object queries as input. Each decoder layer applies: (a) self-attention among the object queries (so they can coordinate to avoid predicting the same object), then (b) cross-attention from object queries to encoder features (so each query can "look at" the image). After 6 decoder layers, each query outputs a d-dimensional embedding.
Each of the N=100 output embeddings is independently passed through a shared 3-layer FFN that predicts: (a) a class label via softmax (including ∅), and (b) a bounding box as 4 normalized coordinates (center x, center y, width, height).
Object queries are perhaps the most novel component of DETR. They are N=100 learned embedding vectors, each of dimension d=256. They are not derived from the image — they are parameters of the model, learned during training.
Think of each object query as a "detection slot." During training, each slot learns to specialize:
The decoder processes all 100 queries in parallel. Through self-attention, queries communicate with each other: "I'm already detecting the dog, so you don't need to." Through cross-attention to the encoder output, each query "looks at" the relevant parts of the image to refine its prediction.
The paper visualizes learned attention maps of the object queries and finds that different queries attend to different spatial regions and object sizes. This specialization emerges naturally from training — no spatial prior is manually encoded.
Here is the training challenge: DETR outputs N=100 predictions, but the image might contain only 3 objects. Which 3 of the 100 predictions should we compare to the 3 ground-truth boxes?
Traditional detectors use heuristic rules: "assign each anchor to the ground-truth box with the highest IoU, if the IoU exceeds 0.5." DETR replaces this with a principled, globally optimal solution: the Hungarian algorithm.
For every possible (prediction, ground-truth) pair, compute a matching cost:
This cost combines: (a) the predicted probability for the correct class (higher is better, hence the minus sign), and (b) a box distance (lower is better). The result is an N×M cost matrix.
The Hungarian algorithm finds the permutation σ that minimizes the total cost — the optimal one-to-one assignment. This runs in O(N³) time, but with N=100 it is negligible compared to the network forward pass.
Once the Hungarian algorithm has found the optimal assignment σ̂, we compute the Hungarian loss on the matched pairs:
This loss has two parts for each matched pair:
Standard cross-entropy. For predictions matched to a real object, this encourages the correct class. For predictions matched to ∅, this encourages the "no object" class. The ∅ class is down-weighted by a factor of 10 to handle class imbalance (most of the 100 slots will be empty).
A combination of two complementary losses:
Hyperparameters: λiou = 2, λL1 = 5. Both losses are normalized by the number of objects in the batch.
DETR has an unusually demanding training recipe. Understanding why reveals deep lessons about set prediction.
While Faster R-CNN converges in ~36 epochs (the "3x schedule"), DETR needs 300 epochs on COCO. Why? Two reasons:
DETR adds prediction FFNs and Hungarian loss after every decoder layer, not just the last one. All 6 decoder layers share the same FFN parameters. This provides intermediate supervision that helps the model output the correct number of objects early in the decoder.
Learning rates: 10−4 for the transformer, 10−5 for the backbone (10x smaller since the backbone is pre-trained on ImageNet). Weight decay: 10−4. Learning rate drops by 10x at epoch 200.
DETR uses fixed 2D sinusoidal encodings for the spatial positions in the encoder (one sine/cosine pair per x and y coordinate). The object queries in the decoder are learned positional embeddings. Positional encodings are added at every attention layer, not just the first.
Scale augmentation (shortest side 480-800, longest side capped at 1333) plus random crop with 50% probability. The random crop is crucial — it forces the encoder's self-attention to learn global spatial relationships, improving AP by ~1 point.
DETR was evaluated on COCO 2017, the standard object detection benchmark, against highly-tuned Faster R-CNN baselines from Detectron2.
DETR with ResNet-50 achieves 42.0 AP, comparable to Faster R-CNN-FPN+ (42.0 AP) after extensive tuning. With ResNet-101, DETR-R101 reaches 43.5 AP. The models have similar parameter counts and FLOPs.
DETR shows a striking pattern across object sizes:
The paper verifies that adding NMS to DETR's outputs does not improve performance. The bipartite matching training has successfully taught the model to produce non-redundant predictions.
One of DETR's most compelling results: it extends trivially to panoptic segmentation, the task of assigning every pixel in the image to either a "thing" (countable object instance) or "stuff" (amorphous region like sky or grass).
A small mask head is added on top of the frozen, pre-trained DETR:
This demonstrates DETR's versatility: the object query / set prediction framework generalizes beyond bounding boxes. Each query can produce any structured output — boxes, masks, keypoints — by simply changing the prediction head.