Train one supernet. Search thousands of sub-architectures for free. The first real-time detector to break 60 AP on COCO — and it generalizes to 100 diverse real-world datasets.
You have a custom dataset — maybe aerial imagery of solar panels, or underwater footage of coral, or warehouse shelves full of parcels. You need a fast, accurate object detector. What do you do?
Option A: Use a vision-language model like GroundingDINO. It is powerful out of the box — just describe what you want to find in text. But it runs at 428ms per image. That is not real-time. Fine-tune it on your data and performance improves, but the text encoder still drags latency far beyond what interactive applications need.
Option B: Use a specialist detector like YOLOv11 or D-FINE. These run in 2–6ms. Fast enough. But there is a hidden problem: these models have been implicitly overfit to COCO.
What does "overfit to COCO" mean? It means their architectures, learning rate schedules, augmentation pipelines, and even model sizes have been tuned relentlessly on the 80 categories and ~118k images of the COCO benchmark. When you deploy them on a real-world dataset with different statistics — more objects per image, fewer training examples, unusual aspect ratios, domain-specific classes — performance collapses.
There is a deeper issue. Every time you switch hardware (T4 GPU → Jetson Nano → mobile NPU), you want a different accuracy-latency tradeoff. With traditional detectors, that means training a completely new model for each target point. YOLOv11 has nano, small, medium, large, and extra-large variants — five separate training runs, five separate architectures.
So: can we build a detector that (1) inherits internet-scale knowledge for strong transfer, (2) runs in real-time, and (3) lets us discover thousands of accuracy-latency tradeoffs from a single training run?
RF-DETR's core idea can be stated in one sentence: train a single supernet that contains thousands of sub-architectures inside it, then search for the best one for your target dataset and hardware — without any retraining.
Let's unpack what that means.
Imagine you train a detector with 6 decoder layers. At inference, you can just... drop the last 3 layers and use only the first 3. The predictions from layer 3 are still valid because the model was trained with a loss at every decoder layer (not just the final one). You get a faster model for free. No retraining needed.
Now extend this idea to every architectural dimension:
During training, every iteration randomly samples a different combination of these settings. The supernet learns to perform well at ALL configurations simultaneously. This is the weight-sharing NAS from OFA (Once-for-All), applied end-to-end to object detection for the first time.
Weight-sharing NAS tells you how to search. But you still need a strong starting point. RF-DETR initializes its backbone with DINOv2 — a vision foundation model trained on 142M images with self-supervised learning. This gives the detector rich, transferable features even before seeing a single detection label. On small datasets (some RF100-VL datasets have fewer than 100 images), this pre-training is the difference between a working detector and random chance.
The combination is powerful: DINOv2 provides internet-scale knowledge. Weight-sharing NAS provides hardware-adaptive flexibility. Together, you get one training run that yields thousands of Pareto-optimal detectors.
RF-DETR builds on LW-DETR (Lightweight DETR) but modernizes every component. Let's trace an image through the entire pipeline.
The input image x ∈ R3×H×W is split into non-overlapping patches of size P×P (default P=16). Each patch is linearly projected to a d-dimensional token. For an image of resolution 640×640 with patch size 16, that is (640/16)² = 1,600 tokens.
These tokens pass through the DINOv2 ViT-S encoder (12 transformer layers, d=384). The backbone is initialized with DINOv2's self-supervised weights — 142 million images of knowledge baked in before we see a single bounding box.
The projector takes the single-scale ViT output and creates a multi-scale feature pyramid. It reshapes the 1D token sequence back to 2D spatial maps, then produces features at multiple resolutions through strided convolutions and upsampling. These multi-scale features feed into the decoder's deformable cross-attention (which needs to sample from different scales).
A critical engineering choice: the projector uses layer normalization instead of batch normalization. Why? Batch norm statistics are unreliable with small batches, and RF-DETR needs gradient accumulation on consumer GPUs (which means effective batch size varies). Layer norm is batch-size-independent. This costs ~1% AP but enables training on a single GPU.
The multi-scale features pass through an encoder that interleaves two types of attention blocks:
The default pattern is 2 windowed blocks followed by 1 non-windowed block. This balances global context with computational efficiency.
The decoder takes N query tokens (default N=300) and refines them through multiple decoder layers. Each decoder layer has:
Each of the N query outputs is independently decoded by two shared heads:
Crucially, the detection loss is applied at every decoder layer, not just the last one. This is what enables decoder layer dropping at inference — every intermediate output is trained to be a valid detection.
Here is a problem that might not be obvious at first: ViT backbones output features at a single scale. Every token corresponds to one P×P patch. But objects in an image vary wildly in size — a person might span 300 pixels while a traffic light spans 20. A single-scale feature map cannot efficiently represent both.
Traditional detectors (Faster R-CNN, YOLO, DETR) solve this with a Feature Pyramid Network (FPN) that builds multi-scale representations from different stages of a CNN backbone (stride 8, 16, 32, 64). But ViTs don't have "stages" — all tokens live at the same resolution.
The projector takes the ViT output (40×40 spatial map for 640×640 input with P=16) and creates three scales:
All three scales have the same channel dimension (256) but different spatial resolutions. The deformable cross-attention in the decoder samples from all three scales, so small objects can be detected from the high-resolution Scale 1, and large objects from the low-resolution Scale 3.
Here is a clever design choice: the segmentation head also needs spatial features, but instead of building a separate pathway, it bilinearly upsamples the same projector output. This ensures that detection and segmentation see the same spatial organization of features, and adding segmentation doesn't double the feature extraction cost.
The original LW-DETR uses batch normalization in the projector. RF-DETR switches to layer normalization. Let's trace why this matters:
With batch norm, you need large batch sizes for stable running statistics. DINOv2-based RF-DETR is larger than CAEv2-based LW-DETR, so fitting a large batch on a single consumer GPU (say, 24GB RTX 4090) is harder. The solution is gradient accumulation — process 4 micro-batches of 2 images each and accumulate gradients to simulate batch size 8. But batch norm computes statistics per micro-batch (size 2), which is too noisy. Layer norm computes per-token statistics, independent of batch size. Problem solved.
The cost: switching from batch norm to layer norm drops AP by about 1%. But it enables training on consumer hardware, which is the whole point of making detection accessible.
This is the heart of RF-DETR. The NAS search space defines five "tunable knobs" that you can twist at inference time to slide along the accuracy-latency Pareto curve. Each knob trades compute for quality in a different way. Let's understand each one deeply.
ViT divides the image into patches of P×P pixels. The number of tokens is (H/P)×(W/P). Self-attention is O(tokens²), so patch size has a quadratic effect on compute.
| Patch Size | Tokens (640×640) | Self-Attention Cost | Effect |
|---|---|---|---|
| 8×8 | 6,400 | 41M ops | Fine-grained but very expensive |
| 14×14 | 2,089 | 4.4M ops | Default DINOv2 balance |
| 16×16 | 1,600 | 2.6M ops | RF-DETR default |
| 32×32 | 400 | 160K ops | Coarse but very fast |
But DINOv2 was pre-trained with P=14. How do you use P=16 or P=32 without retraining? FlexiViT-style patch embedding interpolation. The original patch embedding is a [384, 3, 14, 14] convolution kernel. To use P=16, you bilinearly resize the kernel to [384, 3, 16, 16]. The positional embeddings are similarly interpolated to match the new grid size.
RF-DETR trains with N decoder layers (e.g., 6) and applies the detection loss at every layer's output. At inference, you can use anywhere from 0 to N layers:
Query tokens set the maximum number of detections. If your dataset averages 5 objects per image (like a robotics grasping task), you don't need 300 queries — 50 might suffice. Fewer queries means less self-attention cost and less post-processing.
The key question: which queries do you keep when dropping? RF-DETR orders queries by the maximum sigmoid of their class logits at the encoder output. The most confident queries survive. This is an implicit form of top-K selection based on the model's own confidence.
Higher resolution captures more detail for small objects but generates more tokens. RF-DETR pre-allocates a large positional embedding grid (for the maximum resolution) and bilinearly interpolates it for smaller resolutions. This allows arbitrary resolution at test time.
Each windowed attention block restricts self-attention to a local window. More windows (smaller windows) means less compute but less global context. Fewer windows (larger windows) means each token can "see" more of the image. At the extreme, 1 window = full global attention.
Your hardware budget: 5ms per frame on a T4 GPU. The grid search reveals that the best configuration at ~5ms is:
Result: 54.7 AP on COCO. A different target (say 2.3ms) yields: patch size 16, resolution 560, 3 decoder layers, 100 queries, 2 windows → 48.0 AP. All from the same trained weights.
We've seen what the five knobs are. Now let's understand how the supernet is trained to be good at all configurations simultaneously.
At every training iteration:
The uniform random sampling is important. If you always trained at high resolution, the model would underperform at low resolution. If you always used all decoder layers, the early layers wouldn't learn to produce good standalone predictions. By sampling uniformly, every configuration gets roughly equal training signal.
Here is the counterintuitive finding from Table 5 of the paper. Adding weight-sharing NAS improves the base configuration (patch size 14, all decoder layers, full resolution) by 0.3 AP, even though patch size 14 is not in the NAS search space.
The likely explanation: architecture augmentation acts as a regularizer. Training with random sub-architectures prevents the model from relying on fragile co-adaptations between components. The same reason dropout works — but applied to the computational graph structure rather than individual activations.
RF-DETR uses a two-stage training pipeline:
| Stage | Dataset | Epochs | Purpose |
|---|---|---|---|
| 1. Backbone init | DINOv2 (142M images) | — | Self-supervised visual features |
| 2. Detection pre-train | Objects365 (2M images) | 60 | Detection-specific knowledge |
| 3. Target fine-tune | COCO / RF100-VL / yours | 100+ | Domain adaptation with NAS |
The Objects365 pre-training is crucial. It teaches the projector and decoder to detect objects before fine-tuning on the target domain. Without it, the DINOv2 features would need to learn detection from scratch, requiring many more epochs on small target datasets.
After training completes, the NAS search is trivially simple:
for patch_size in [8, 12, 16, 20, 24, 32]: for resolution in [384, 448, 512, 576, 640]: for n_dec in range(0, 7): for n_queries in [50, 100, 200, 300]: for n_windows in [1, 2, 4]: ap = evaluate(model, val_set, config) lat = measure_latency(model, config) results.append((ap, lat, config))
That is 6×5×7×4×3 = 2,520 configurations. Each evaluation is a fast forward pass on the validation set. The entire search completes in minutes to hours depending on validation set size.
From these results, extract the Pareto frontier: for each latency value, keep only the configuration with the highest AP.
This chapter addresses a problem that most papers don't even acknowledge: training schedulers are a form of overfitting to benchmark characteristics.
Almost every modern detector uses a cosine learning rate schedule. This decays the learning rate as:
where T is the total number of training steps. The problem: you must know T in advance. This works fine for COCO (118k images, always trained for 12/24/36 epochs). But for RF100-VL's datasets, which range from 50 to 50,000 images, the ideal T varies wildly. A cosine schedule tuned for 100 epochs on a 50-image dataset produces a completely different learning curve than 100 epochs on a 50,000-image dataset.
State-of-the-art detectors use aggressive augmentations: VerticalFlip, RandomResize, CachedMixUp, Mosaic, HSV jitter. Each augmentation encodes domain assumptions:
RF-DETR uses only two augmentations:
That's it. No vertical flip, no mosaic, no HSV jitter, no mixup. The rationale: with DINOv2's internet-scale pre-training and weight-sharing NAS as architectural regularization, aggressive augmentation is unnecessary. The model already has diverse visual knowledge baked in.
LW-DETR applies random resize per image, then pads each image in the batch to match the largest one. The result: most images have significant padding, which wastes computation and introduces window artifacts in the attention mechanism.
RF-DETR resizes at the batch level: all images in a batch get the same random resolution. No padding needed. This has two benefits:
Detection gives you bounding boxes. But what if you need precise pixel-level masks for each object? RF-DETR-Seg adds a lightweight segmentation head that shares the same projector output as the detection head.
The segmentation head is surprisingly simple. Here is the complete data flow:
The mask for each detected object is the dot product of its query embedding with every pixel embedding. High dot product = this pixel belongs to this object. Low dot product = background. That's it — no mask head, no ROI pooling, no crop-and-resize.
MaskDINO, the prior state-of-the-art, incorporates multi-scale backbone features into the segmentation head for better spatial detail. RF-DETR deliberately avoids this. Why?
Segmentation requires instance masks for training. COCO has them, but Objects365 (used for pre-training) does not. RF-DETR-Seg pre-trains on Objects365 using pseudo-labels from SAM2 — Meta's Segment Anything Model 2 generates high-quality masks for each detection box. This gives RF-DETR-Seg the benefit of large-scale segmentation pre-training without manually annotated masks.
| Model | Size | Latency | APmask |
|---|---|---|---|
| YOLOv11-Seg | X-Large | 6.9ms | 38.5 |
| RF-DETR-Seg | Nano | 3.4ms | 40.3 |
| FastInst | R50 | 39.6ms | 34.9 |
| MaskDINO | R50 | 242ms | 46.3 |
| RF-DETR-Seg | Medium | 5.9ms | 45.3 |
RF-DETR-Seg (nano) beats all YOLOv8 and YOLOv11 segmentation variants at all sizes. It beats FastInst by 5.4 AP while running 10× faster. And the NAS-based scaling means you get the full Pareto curve for segmentation too.
Let's examine what RF-DETR actually achieves and, more importantly, why the results look the way they do.
| Model | Size | Latency | AP | APS | APL |
|---|---|---|---|---|---|
| D-FINE | Nano | 2.1ms | 42.7 | 22.9 | 62.1 |
| RF-DETR | Nano | 2.3ms | 48.0 | 25.2 | 70.0 |
| D-FINE | Medium | 5.4ms | 55.0 | 37.6 | 71.7 |
| RF-DETR | Medium | 4.4ms | 54.7 | 36.1 | 73.8 |
| GroundingDINO | Tiny | 428ms | 58.2 | — | — |
| RF-DETR | 2XL | 17.2ms | 60.1 | 43.2 | 76.2 |
Three observations:
COCO has 80 categories and 118k training images. Real-world datasets look nothing like this. RF100-VL contains 100 diverse datasets spanning medical imaging, satellite imagery, manufacturing inspection, wildlife monitoring, and more. Some have 50 images. Some have 50,000. Some have 2 classes. Some have 50.
| Model | Size | Latency | AP | AP50 |
|---|---|---|---|---|
| YOLOv8 | Medium | 5.4ms | 56.5 | 82.3 |
| YOLOv11 | Medium | 5.1ms | 57.0 | 82.5 |
| D-FINE | Medium | 5.6ms | 60.6 | 85.5 |
| RT-DETR | Medium | 4.3ms | 59.6 | 85.7 |
| RF-DETR | Medium | 4.6ms | 61.7 | 88.0 |
| GroundingDINO | Tiny | 310ms | 62.3 | 88.8 |
| RF-DETR | 2XL | 15.6ms | 63.5 | 89.0 |
Key takeaway: RF-DETR (2XL) beats GroundingDINO on RF100-VL while running 20× faster. Internet-scale pre-training (via DINOv2) provides generalization. Weight-sharing NAS provides efficiency. No text encoder needed.
Table 5 in the paper reveals the incremental impact of each design choice, starting from LW-DETR (M) at 52.6 AP:
| Change | Δ AP | Cumulative AP | Latency |
|---|---|---|---|
| Gentler hyperparameters | −1.0 | 51.6 | 4.4ms |
| + DINOv2 backbone | +2.0 | 53.6 | 4.7ms |
| + Additional O365 pre-training | +0.7 | 54.3 | 4.7ms |
| + Weight-sharing NAS | +0.3 | 54.6 | 4.7ms |
| + NAS-mined config (patch 16, res 640) | −0.2 | 54.4 | 4.7ms |
| + Res 576, 2 windows, 4 layers | +0.3 | 54.7 | 4.4ms |
Read the story: We start by making training gentler (lower LR, layer norm) and lose 1 AP. Then DINOv2 more than compensates (+2). NAS acts as a regularizer (+0.3). Finally, NAS-mined configuration recovers the base latency while keeping the accuracy. Net result: +2.1 AP at the same latency.
The paper makes an important meta-contribution: they identify that GPU power throttling causes up to 25% variance in latency measurements between papers. The fix is dead simple: insert a 200ms buffer between consecutive forward passes. This prevents power over-draw and yields reproducible numbers. They also insist on reporting accuracy and latency from the same model artifact (FP16 accuracy with FP16 latency, not FP32 accuracy with FP16 latency).
RF-DETR sits at the intersection of three research threads:
| Concept | Formula | What It Means |
|---|---|---|
| Token count | N = (H/P) × (W/P) | Smaller patch P → more tokens |
| Self-attention cost | O(N² · d) | Quadratic in token count |
| Deformable attention | O(Nq · K · S) | K sample points, S scales — much cheaper |
| Cosine LR (avoided) | lrmin + ½(lrmax−lrmin)(1+cos(πt/T)) | Requires known horizon T |
| Mask generation | Mi = qi · EpixelT | Dot product of query with pixel embeddings |