Robinson, Robicheaux, Popov, Ramanan, Peri — Roboflow & CMU, ICLR 2026

RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

Train one supernet. Search thousands of sub-architectures for free. The first real-time detector to break 60 AP on COCO — and it generalizes to 100 diverse real-world datasets.

Prerequisites: Object detection basics + Transformers (attention) + DETR
10
Chapters
8+
Simulations

Chapter 0: The Problem

You have a custom dataset — maybe aerial imagery of solar panels, or underwater footage of coral, or warehouse shelves full of parcels. You need a fast, accurate object detector. What do you do?

Option A: Use a vision-language model like GroundingDINO. It is powerful out of the box — just describe what you want to find in text. But it runs at 428ms per image. That is not real-time. Fine-tune it on your data and performance improves, but the text encoder still drags latency far beyond what interactive applications need.

Option B: Use a specialist detector like YOLOv11 or D-FINE. These run in 2–6ms. Fast enough. But there is a hidden problem: these models have been implicitly overfit to COCO.

What does "overfit to COCO" mean? It means their architectures, learning rate schedules, augmentation pipelines, and even model sizes have been tuned relentlessly on the 80 categories and ~118k images of the COCO benchmark. When you deploy them on a real-world dataset with different statistics — more objects per image, fewer training examples, unusual aspect ratios, domain-specific classes — performance collapses.

The gap is real. Roboflow100-VL (RF100-VL) is a benchmark of 100 diverse real-world datasets. On COCO, YOLOv8 looks competitive. On RF100-VL, it lags behind DETR-based detectors by 5+ AP — and scaling to larger YOLO sizes doesn't help. The bespoke COCO tuning doesn't transfer.

There is a deeper issue. Every time you switch hardware (T4 GPU → Jetson Nano → mobile NPU), you want a different accuracy-latency tradeoff. With traditional detectors, that means training a completely new model for each target point. YOLOv11 has nano, small, medium, large, and extra-large variants — five separate training runs, five separate architectures.

So: can we build a detector that (1) inherits internet-scale knowledge for strong transfer, (2) runs in real-time, and (3) lets us discover thousands of accuracy-latency tradeoffs from a single training run?

Why do specialist detectors like YOLOv8 often fail on real-world datasets outside of COCO?

Chapter 1: The Key Insight

RF-DETR's core idea can be stated in one sentence: train a single supernet that contains thousands of sub-architectures inside it, then search for the best one for your target dataset and hardware — without any retraining.

Let's unpack what that means.

The Weight-Sharing Trick

Imagine you train a detector with 6 decoder layers. At inference, you can just... drop the last 3 layers and use only the first 3. The predictions from layer 3 are still valid because the model was trained with a loss at every decoder layer (not just the final one). You get a faster model for free. No retraining needed.

Now extend this idea to every architectural dimension:

During training, every iteration randomly samples a different combination of these settings. The supernet learns to perform well at ALL configurations simultaneously. This is the weight-sharing NAS from OFA (Once-for-All), applied end-to-end to object detection for the first time.

Architecture augmentation: Here is the surprise. Training with random sub-architectures doesn't just save you from retraining — it actually improves the base model's performance. Think of it as a regularizer: at each iteration, the model sees a different "view" of itself. This is analogous to dropout, but at the architectural level. Sub-architectures that were never explicitly sampled during training still perform well (Appendix F of the paper).

The Second Ingredient: Internet-Scale Priors

Weight-sharing NAS tells you how to search. But you still need a strong starting point. RF-DETR initializes its backbone with DINOv2 — a vision foundation model trained on 142M images with self-supervised learning. This gives the detector rich, transferable features even before seeing a single detection label. On small datasets (some RF100-VL datasets have fewer than 100 images), this pre-training is the difference between a working detector and random chance.

The combination is powerful: DINOv2 provides internet-scale knowledge. Weight-sharing NAS provides hardware-adaptive flexibility. Together, you get one training run that yields thousands of Pareto-optimal detectors.

Why can RF-DETR evaluate thousands of sub-architectures without retraining each one?

Chapter 2: The Architecture

RF-DETR builds on LW-DETR (Lightweight DETR) but modernizes every component. Let's trace an image through the entire pipeline.

Step 1: ViT Backbone (DINOv2)

The input image x ∈ R3×H×W is split into non-overlapping patches of size P×P (default P=16). Each patch is linearly projected to a d-dimensional token. For an image of resolution 640×640 with patch size 16, that is (640/16)² = 1,600 tokens.

These tokens pass through the DINOv2 ViT-S encoder (12 transformer layers, d=384). The backbone is initialized with DINOv2's self-supervised weights — 142 million images of knowledge baked in before we see a single bounding box.

Data flow: Image [3, 640, 640] → patch embed [1600, 384] → 12 ViT layers → features [1600, 384]. The backbone outputs features at a single scale (stride 16). But detection needs multi-scale features to handle objects of different sizes.

Step 2: Multi-Scale Projector

The projector takes the single-scale ViT output and creates a multi-scale feature pyramid. It reshapes the 1D token sequence back to 2D spatial maps, then produces features at multiple resolutions through strided convolutions and upsampling. These multi-scale features feed into the decoder's deformable cross-attention (which needs to sample from different scales).

A critical engineering choice: the projector uses layer normalization instead of batch normalization. Why? Batch norm statistics are unreliable with small batches, and RF-DETR needs gradient accumulation on consumer GPUs (which means effective batch size varies). Layer norm is batch-size-independent. This costs ~1% AP but enables training on a single GPU.

Step 3: Encoder (Windowed + Non-Windowed Attention)

The multi-scale features pass through an encoder that interleaves two types of attention blocks:

The default pattern is 2 windowed blocks followed by 1 non-windowed block. This balances global context with computational efficiency.

Step 4: Decoder (Deformable Cross-Attention)

The decoder takes N query tokens (default N=300) and refines them through multiple decoder layers. Each decoder layer has:

  1. Self-attention among queries (so they coordinate to avoid duplicate predictions)
  2. Deformable cross-attention from queries to multi-scale encoder features (each query attends to a sparse set of learned sampling points, not the full feature map)
  3. FFN (feed-forward network)
Why deformable? Standard cross-attention has O(N×HW) cost — every query attends to every spatial position. With 300 queries and 1,600+ spatial tokens across multiple scales, this is prohibitive for real-time. Deformable attention learns to sample only K=4 points per query per scale, making it O(N×K×S) where S is the number of scales. Much faster.

Step 5: Detection Head

Each of the N query outputs is independently decoded by two shared heads:

Crucially, the detection loss is applied at every decoder layer, not just the last one. This is what enables decoder layer dropping at inference — every intermediate output is trained to be a valid detection.

Why does RF-DETR apply detection loss at every decoder layer, not just the final one?

Chapter 3: The Multi-Scale Projector

Here is a problem that might not be obvious at first: ViT backbones output features at a single scale. Every token corresponds to one P×P patch. But objects in an image vary wildly in size — a person might span 300 pixels while a traffic light spans 20. A single-scale feature map cannot efficiently represent both.

Traditional detectors (Faster R-CNN, YOLO, DETR) solve this with a Feature Pyramid Network (FPN) that builds multi-scale representations from different stages of a CNN backbone (stride 8, 16, 32, 64). But ViTs don't have "stages" — all tokens live at the same resolution.

How RF-DETR Builds Multi-Scale Features from a Single-Scale ViT

The projector takes the ViT output (40×40 spatial map for 640×640 input with P=16) and creates three scales:

1ViT output: [40×40, 384]
21×1 conv → project to 256 dims
3aScale 1 (stride 16): [40×40, 256] — directly from projection
3bScale 2 (stride 32): [20×20, 256] — 2×2 strided conv on Scale 1
3cScale 3 (stride 64): [10×10, 256] — 2×2 strided conv on Scale 2

All three scales have the same channel dimension (256) but different spatial resolutions. The deformable cross-attention in the decoder samples from all three scales, so small objects can be detected from the high-resolution Scale 1, and large objects from the low-resolution Scale 3.

The Segmentation Branch Shares This Projector

Here is a clever design choice: the segmentation head also needs spatial features, but instead of building a separate pathway, it bilinearly upsamples the same projector output. This ensures that detection and segmentation see the same spatial organization of features, and adding segmentation doesn't double the feature extraction cost.

Data flow for segmentation: Projector output [40×40, 256] → bilinear upsample → [160×160, 256] → lightweight conv projector → pixel embedding map [160×160, dseg]. The final mask is computed as the dot product of each query embedding with this pixel map.

LayerNorm vs BatchNorm: A Practical Trade-Off

The original LW-DETR uses batch normalization in the projector. RF-DETR switches to layer normalization. Let's trace why this matters:

With batch norm, you need large batch sizes for stable running statistics. DINOv2-based RF-DETR is larger than CAEv2-based LW-DETR, so fitting a large batch on a single consumer GPU (say, 24GB RTX 4090) is harder. The solution is gradient accumulation — process 4 micro-batches of 2 images each and accumulate gradients to simulate batch size 8. But batch norm computes statistics per micro-batch (size 2), which is too noisy. Layer norm computes per-token statistics, independent of batch size. Problem solved.

The cost: switching from batch norm to layer norm drops AP by about 1%. But it enables training on consumer hardware, which is the whole point of making detection accessible.

Why does RF-DETR use layer normalization instead of batch normalization in the projector?

Chapter 4: The Five Knobs — NAS Search Space

This is the heart of RF-DETR. The NAS search space defines five "tunable knobs" that you can twist at inference time to slide along the accuracy-latency Pareto curve. Each knob trades compute for quality in a different way. Let's understand each one deeply.

Knob 1: Patch Size

ViT divides the image into patches of P×P pixels. The number of tokens is (H/P)×(W/P). Self-attention is O(tokens²), so patch size has a quadratic effect on compute.

Patch SizeTokens (640×640)Self-Attention CostEffect
8×86,40041M opsFine-grained but very expensive
14×142,0894.4M opsDefault DINOv2 balance
16×161,6002.6M opsRF-DETR default
32×32400160K opsCoarse but very fast

But DINOv2 was pre-trained with P=14. How do you use P=16 or P=32 without retraining? FlexiViT-style patch embedding interpolation. The original patch embedding is a [384, 3, 14, 14] convolution kernel. To use P=16, you bilinearly resize the kernel to [384, 3, 16, 16]. The positional embeddings are similarly interpolated to match the new grid size.

Why this works: The patch embedding learns to extract low-level features (edges, textures, colors). These features are spatially smooth — a 14×14 filter and a 16×16 filter that see similar image regions will extract similar features. Bilinear interpolation preserves this spatial structure. It is not perfect, but the NAS training procedure fine-tunes the interpolated weights across all patch sizes.

Knob 2: Number of Decoder Layers

RF-DETR trains with N decoder layers (e.g., 6) and applies the detection loss at every layer's output. At inference, you can use anywhere from 0 to N layers:

Knob 3: Number of Query Tokens

Query tokens set the maximum number of detections. If your dataset averages 5 objects per image (like a robotics grasping task), you don't need 300 queries — 50 might suffice. Fewer queries means less self-attention cost and less post-processing.

The key question: which queries do you keep when dropping? RF-DETR orders queries by the maximum sigmoid of their class logits at the encoder output. The most confident queries survive. This is an implicit form of top-K selection based on the model's own confidence.

Knob 4: Image Resolution

Higher resolution captures more detail for small objects but generates more tokens. RF-DETR pre-allocates a large positional embedding grid (for the maximum resolution) and bilinearly interpolates it for smaller resolutions. This allows arbitrary resolution at test time.

Knob 5: Number of Windows per Attention Block

Each windowed attention block restricts self-attention to a local window. More windows (smaller windows) means less compute but less global context. Fewer windows (larger windows) means each token can "see" more of the image. At the extreme, 1 window = full global attention.

The Pareto search is exhaustive but cheap. After training completes, RF-DETR evaluates all combinations of these 5 knobs on the validation set using grid search. Each evaluation is a single forward pass — no gradients, no backprop. Thousands of configurations can be evaluated in minutes. The output is a Pareto curve: for any target latency, here's the most accurate configuration.

Worked Example: Picking an Operating Point

Your hardware budget: 5ms per frame on a T4 GPU. The grid search reveals that the best configuration at ~5ms is:

Result: 54.7 AP on COCO. A different target (say 2.3ms) yields: patch size 16, resolution 560, 3 decoder layers, 100 queries, 2 windows → 48.0 AP. All from the same trained weights.

If you decrease the patch size from 16 to 8 on a 640×640 image, what happens to the self-attention cost?

Chapter 5: Weight-Sharing NAS — How Training Works

We've seen what the five knobs are. Now let's understand how the supernet is trained to be good at all configurations simultaneously.

The Training Loop

At every training iteration:

1Sample a random configuration: (patch_size, n_decoder_layers, n_queries, resolution, n_windows)
2Resize the batch to the sampled resolution (batch-level resize, not per-image)
3Interpolate patch embedding weights to the sampled patch size (FlexiViT)
4Forward pass through the backbone and encoder
5Run only the first k decoder layers (k = sampled n_decoder_layers)
6Select the top-j queries (j = sampled n_queries) by encoder confidence
7Compute detection loss at every active decoder layer. Backpropagate.

The uniform random sampling is important. If you always trained at high resolution, the model would underperform at low resolution. If you always used all decoder layers, the early layers wouldn't learn to produce good standalone predictions. By sampling uniformly, every configuration gets roughly equal training signal.

Analogy: architectural dropout. Recall that standard dropout randomly zeroes neurons during training, forcing the network to be robust to missing features. Weight-sharing NAS is the same idea at the architecture level: randomly remove decoder layers, reduce resolution, change patch size. The model learns to be robust to missing architecture, not just missing neurons.

Why Does This Improve the Base Model?

Here is the counterintuitive finding from Table 5 of the paper. Adding weight-sharing NAS improves the base configuration (patch size 14, all decoder layers, full resolution) by 0.3 AP, even though patch size 14 is not in the NAS search space.

The likely explanation: architecture augmentation acts as a regularizer. Training with random sub-architectures prevents the model from relying on fragile co-adaptations between components. The same reason dropout works — but applied to the computational graph structure rather than individual activations.

Pre-Training Pipeline

RF-DETR uses a two-stage training pipeline:

StageDatasetEpochsPurpose
1. Backbone initDINOv2 (142M images)Self-supervised visual features
2. Detection pre-trainObjects365 (2M images)60Detection-specific knowledge
3. Target fine-tuneCOCO / RF100-VL / yours100+Domain adaptation with NAS

The Objects365 pre-training is crucial. It teaches the projector and decoder to detect objects before fine-tuning on the target domain. Without it, the DINOv2 features would need to learn detection from scratch, requiring many more epochs on small target datasets.

The Search Phase

After training completes, the NAS search is trivially simple:

for patch_size in [8, 12, 16, 20, 24, 32]:
  for resolution in [384, 448, 512, 576, 640]:
    for n_dec in range(0, 7):
      for n_queries in [50, 100, 200, 300]:
        for n_windows in [1, 2, 4]:
          ap = evaluate(model, val_set, config)
          lat = measure_latency(model, config)
          results.append((ap, lat, config))

That is 6×5×7×4×3 = 2,520 configurations. Each evaluation is a fast forward pass on the validation set. The entire search completes in minutes to hours depending on validation set size.

From these results, extract the Pareto frontier: for each latency value, keep only the configuration with the highest AP.

During training, weight-sharing NAS improves even the base configuration's accuracy (which isn't in the search space). Why?

Chapter 6: Scheduler-Free Training

This chapter addresses a problem that most papers don't even acknowledge: training schedulers are a form of overfitting to benchmark characteristics.

The Cosine Schedule Problem

Almost every modern detector uses a cosine learning rate schedule. This decays the learning rate as:

lr(t) = lrmin + ½(lrmax − lrmin)(1 + cos(π · t / T))

where T is the total number of training steps. The problem: you must know T in advance. This works fine for COCO (118k images, always trained for 12/24/36 epochs). But for RF100-VL's datasets, which range from 50 to 50,000 images, the ideal T varies wildly. A cosine schedule tuned for 100 epochs on a 50-image dataset produces a completely different learning curve than 100 epochs on a 50,000-image dataset.

The hidden bias: When papers report "we train for 36 epochs with cosine decay," they are implicitly encoding the assumption that 36 epochs on COCO's 118k images is the right amount of optimization. Change the dataset size, change the number of classes, change the average objects per image — and this assumption breaks. D-FINE, built on RT-DETR, extensively tunes schedules on COCO. The result? It actually underperforms RT-DETR on RF100-VL (Table 4) despite beating it on COCO. The schedule overfit.

The Augmentation Bias Problem

State-of-the-art detectors use aggressive augmentations: VerticalFlip, RandomResize, CachedMixUp, Mosaic, HSV jitter. Each augmentation encodes domain assumptions:

RF-DETR's Approach: Minimal Augmentation

RF-DETR uses only two augmentations:

  1. Horizontal flip (50% probability)
  2. Random crop

That's it. No vertical flip, no mosaic, no HSV jitter, no mixup. The rationale: with DINOv2's internet-scale pre-training and weight-sharing NAS as architectural regularization, aggressive augmentation is unnecessary. The model already has diverse visual knowledge baked in.

Batch-Level Resize vs. Per-Image Resize

LW-DETR applies random resize per image, then pads each image in the batch to match the largest one. The result: most images have significant padding, which wastes computation and introduces window artifacts in the attention mechanism.

RF-DETR resizes at the batch level: all images in a batch get the same random resolution. No padding needed. This has two benefits:

  1. No wasted computation on padded regions
  2. All positional encodings are equally likely to be seen during training, which is important for NAS (the model needs to handle all resolutions)
Why does RF-DETR use only horizontal flip and random crop as augmentations?

Chapter 7: Instance Segmentation

Detection gives you bounding boxes. But what if you need precise pixel-level masks for each object? RF-DETR-Seg adds a lightweight segmentation head that shares the same projector output as the detection head.

The Architecture

The segmentation head is surprisingly simple. Here is the complete data flow:

1Projector output [H', W', 256] (same features as detection)
2Bilinear upsample to [4H', 4W', 256] — 4× the spatial resolution
3Lightweight conv projector → pixel embeddings [4H', 4W', dseg]
4Each query embedding qi [dseg] from decoder layer k, transformed by FFN
5Maski = qi · pixel_embeddingsT → [4H', 4W'] per-object mask

The mask for each detected object is the dot product of its query embedding with every pixel embedding. High dot product = this pixel belongs to this object. Low dot product = background. That's it — no mask head, no ROI pooling, no crop-and-resize.

Segmentation prototypes. You can interpret the pixel embedding map as a set of segmentation prototypes (YOLACT-style). Each pixel's embedding encodes "what kind of object region am I part of?" Each query's embedding encodes "what spatial pattern does my object have?" The dot product computes their compatibility. This is elegant because you compute the pixel embeddings once and reuse them for all N queries.

Why Bilinear Upsample from the Same Projector?

MaskDINO, the prior state-of-the-art, incorporates multi-scale backbone features into the segmentation head for better spatial detail. RF-DETR deliberately avoids this. Why?

Training with Pseudo-Labels

Segmentation requires instance masks for training. COCO has them, but Objects365 (used for pre-training) does not. RF-DETR-Seg pre-trains on Objects365 using pseudo-labels from SAM2 — Meta's Segment Anything Model 2 generates high-quality masks for each detection box. This gives RF-DETR-Seg the benefit of large-scale segmentation pre-training without manually annotated masks.

Results Snapshot

ModelSizeLatencyAPmask
YOLOv11-SegX-Large6.9ms38.5
RF-DETR-SegNano3.4ms40.3
FastInstR5039.6ms34.9
MaskDINOR50242ms46.3
RF-DETR-SegMedium5.9ms45.3

RF-DETR-Seg (nano) beats all YOLOv8 and YOLOv11 segmentation variants at all sizes. It beats FastInst by 5.4 AP while running 10× faster. And the NAS-based scaling means you get the full Pareto curve for segmentation too.

How does RF-DETR-Seg generate per-object segmentation masks?

Chapter 8: Experiments

Let's examine what RF-DETR actually achieves and, more importantly, why the results look the way they do.

COCO Detection: The Headline Numbers

ModelSizeLatencyAPAPSAPL
D-FINENano2.1ms42.722.962.1
RF-DETRNano2.3ms48.025.270.0
D-FINEMedium5.4ms55.037.671.7
RF-DETRMedium4.4ms54.736.173.8
GroundingDINOTiny428ms58.2
RF-DETR2XL17.2ms60.143.276.2

Three observations:

  1. RF-DETR (nano) beats D-FINE (nano) by 5.3 AP at similar latency. That is an enormous gap at the same speed — almost the difference between YOLOv8-nano and YOLOv8-medium.
  2. RF-DETR (2XL) is the first real-time detector above 60 AP on COCO. It outperforms GroundingDINO (fine-tuned) while being 25× faster. The 60 AP barrier was previously only achievable by slow, heavyweight VLMs.
  3. Small object AP (APS) scales strongly with NAS. RF-DETR (2XL) hits 43.2 APS vs. 37.6 for D-FINE (medium). The multi-resolution NAS knob lets large models use higher resolution where it matters most.

RF100-VL: The Real Test of Generalization

COCO has 80 categories and 118k training images. Real-world datasets look nothing like this. RF100-VL contains 100 diverse datasets spanning medical imaging, satellite imagery, manufacturing inspection, wildlife monitoring, and more. Some have 50 images. Some have 50,000. Some have 2 classes. Some have 50.

ModelSizeLatencyAPAP50
YOLOv8Medium5.4ms56.582.3
YOLOv11Medium5.1ms57.082.5
D-FINEMedium5.6ms60.685.5
RT-DETRMedium4.3ms59.685.7
RF-DETRMedium4.6ms61.788.0
GroundingDINOTiny310ms62.388.8
RF-DETR2XL15.6ms63.589.0
The overfitting signal: Notice that D-FINE beats RT-DETR on COCO (55.0 vs 49.0 AP) but loses on RF100-VL at AP50 (85.5 vs 85.7). D-FINE builds on RT-DETR with extensive hyperparameter tuning on COCO. Those tuned hyperparameters hurt on diverse real-world data. RF-DETR's scheduler-free, minimal-augmentation approach avoids this trap entirely.

Key takeaway: RF-DETR (2XL) beats GroundingDINO on RF100-VL while running 20× faster. Internet-scale pre-training (via DINOv2) provides generalization. Weight-sharing NAS provides efficiency. No text encoder needed.

The Ablation Story

Table 5 in the paper reveals the incremental impact of each design choice, starting from LW-DETR (M) at 52.6 AP:

ChangeΔ APCumulative APLatency
Gentler hyperparameters−1.051.64.4ms
+ DINOv2 backbone+2.053.64.7ms
+ Additional O365 pre-training+0.754.34.7ms
+ Weight-sharing NAS+0.354.64.7ms
+ NAS-mined config (patch 16, res 640)−0.254.44.7ms
+ Res 576, 2 windows, 4 layers+0.354.74.4ms

Read the story: We start by making training gentler (lower LR, layer norm) and lose 1 AP. Then DINOv2 more than compensates (+2). NAS acts as a regularizer (+0.3). Finally, NAS-mined configuration recovers the base latency while keeping the accuracy. Net result: +2.1 AP at the same latency.

Standardizing Latency Benchmarks

The paper makes an important meta-contribution: they identify that GPU power throttling causes up to 25% variance in latency measurements between papers. The fix is dead simple: insert a 200ms buffer between consecutive forward passes. This prevents power over-draw and yields reproducible numbers. They also insist on reporting accuracy and latency from the same model artifact (FP16 accuracy with FP16 latency, not FP32 accuracy with FP16 latency).

D-FINE beats RT-DETR significantly on COCO (55.0 vs 49.0 AP). What happens on RF100-VL?

Chapter 9: Connections

Where RF-DETR Fits in the Detection Timeline

RF-DETR sits at the intersection of three research threads:

Key Equations Cheat Sheet

ConceptFormulaWhat It Means
Token countN = (H/P) × (W/P)Smaller patch P → more tokens
Self-attention costO(N² · d)Quadratic in token count
Deformable attentionO(Nq · K · S)K sample points, S scales — much cheaper
Cosine LR (avoided)lrmin + ½(lrmax−lrmin)(1+cos(πt/T))Requires known horizon T
Mask generationMi = qi · EpixelTDot product of query with pixel embeddings

Related Lessons on This Site

What the Paper Doesn't Say

The broader trend: RF-DETR represents a shift from "design one architecture per speed tier" (YOLO-nano, YOLO-small, YOLO-medium...) to "train one model, mine all tiers." This weight-sharing NAS paradigm will likely spread beyond detection to segmentation, pose estimation, and other dense prediction tasks.
What is a notable limitation of RF-DETR despite its strong accuracy-latency tradeoff?