A unified model that detects, segments, and tracks all instances of a visual concept in images and videos — prompted by text, image exemplars, or both. Doubles the accuracy of all prior systems on open-vocabulary concept segmentation.
You are editing a video of a city street. You need to blur every single license plate. There are 47 cars, some barely visible, some reflected in glass, some at weird angles. With SAM 2, you could click on each plate individually — one by one, 47 times, per frame. You would need to find every plate yourself and click on it. Miss one and it stays unblurred.
What you actually want is to say “license plate” and have the model find, segment, and track every single instance automatically.
This is the gap that SAM 3 fills. SAM 1 introduced promptable segmentation with clicks. SAM 2 extended it to video with memory. But both systems share a fundamental limitation: they can only segment one object per prompt. You point at something, you get that one thing. You cannot express a concept.
A concept is a category of visual things describable by a short noun phrase: “yellow school bus”, “striped cat”, “traffic cone”. Concept prompting means you describe what you want and the model finds ALL instances — not just one. And it tracks every instance across every frame in video, assigning unique IDs so you know which cat is which.
Existing open-vocabulary detectors like OWLv2 or GroundingDINO can do something like this on images, but they have serious problems:
Before building the model, SAM 3 formalizes exactly what problem it solves. This formalization is called Promptable Concept Segmentation (PCS), and it is distinct from the Promptable Visual Segmentation (PVS) task that SAM 1 and SAM 2 solved.
PVS (SAM 1/2): You give a spatial prompt — a click, a box, or a mask — on a specific object. The model segments that one object. If you want 10 objects, you give 10 prompts.
PCS (SAM 3): You give a concept prompt — a short noun phrase like “red apple”, an image exemplar (a bounding box around an example object), or both. The model finds and segments every instance of that concept in the entire image or video, with unique IDs for tracking.
SAM 3 restricts concepts to simple noun phrases (NPs): a noun with optional modifiers. Examples:
This restriction is intentional. Simple NPs are unambiguous enough to annotate at scale, yet expressive enough to cover the vast majority of real-world segmentation needs. SAM 3’s training set has 4 million unique noun phrases.
| Prompt Type | Input | When to use |
|---|---|---|
| Text NP | “traffic cone” | You know the name of the concept |
| Image exemplar | Bounding box on an example object (+ or −) | Easier to show than describe (“that kind of flower”) |
| Text + exemplar | Both combined | Text is ambiguous; exemplar disambiguates |
Crucially, PCS also supports interactive refinement. After the initial detection, you can add positive exemplars (to catch objects the model missed) or negative exemplars (to suppress false positives). And you can refine individual masks with clicks, just like SAM 2.
Open vocabulary introduces inherent ambiguity. “Mouse” — the device or the animal? “Mirror” — does it include the frame? SAM 3 addresses this three ways: (1) collect test annotations from three independent experts, (2) evaluate with an oracle metric that picks the best-matching ground truth, and (3) include an ambiguity module in the model that predicts multiple valid interpretations.
SAM 3 is built from two major components that share a single vision backbone: a detector for image-level concept segmentation, and a tracker for propagating detections through video. Let’s trace the full data flow.
Both detector and tracker share a single pre-trained vision-language backbone called Perception Encoder (PE). PE produces aligned image and text embeddings. This is the foundation that gives SAM 3 its open-vocabulary understanding.
For any input image I ∈ RH×W×3, PE outputs:
For a text prompt “yellow school bus”, PE outputs:
where L is the number of text tokens and C is the embedding dimension. The key property: image and text embeddings live in the same vector space, so the model can match visual patterns to language descriptions.
The detector and tracker have deliberately decoupled designs. Why? Because detection and tracking have conflicting objectives:
Forcing a single model to do both simultaneously creates task conflict — a well-documented problem where one objective degrades the other. SAM 3 avoids this by running detection and tracking as separate heads on the same backbone.
For a single video frame at time t:
This is SAM 3’s most elegant architectural innovation. It solves a fundamental tension in object detection: each proposal query must simultaneously answer two different questions:
In standard DETR, each of the (say 300) learned object queries must answer both questions. Each query must look at the entire image for global context (“is there a cat anywhere?”) while also focusing on a specific local region (“is there a cat at position (x, y)?”). These objectives actively conflict.
SAM 3 introduces a single learned presence token. This token is solely responsible for answering the “what?” question:
Each regular proposal query qi then only needs to answer the “where?” question, conditioned on the concept being present:
The final score for each proposal is the product:
Imagine you prompt with “giraffe” on an image of a kitchen. Without the presence head:
With the presence head:
The presence head shines because SAM 3 is trained with hard negatives — noun phrases that are plausible but NOT present in the image. For example, an image of a tabby cat might be prompted with “leopard”, “tiger”, or “bobcat”. Without the presence head, hard negatives make training unstable because queries cannot easily learn to globally reject a concept. With the presence head, the global token learns to reject and the local queries can focus on localization.
| Config | cgF1 | IL_MCC | pmF1 |
|---|---|---|---|
| No presence head | 50.7 | 0.77 | 65.4 |
| With presence head | 52.2 | 0.82 | 63.4 |
Notice that pmF1 (localization quality) actually drops slightly. This is expected — the presence head is about recognition, not localization. The cgF1 improvement comes from the dramatic IL_MCC improvement which means far fewer false positive detections.
Let’s trace a concrete example through the entire detector. We have an image of a park with 3 dogs of different breeds, and the text prompt “dog.”
The Perception Encoder processes both inputs:
The image is 896×896 at patch size 14, giving 64×64 = 4,096 spatial tokens. The text “dog” tokenizes to 3 tokens including special tokens.
The unconditioned image features must be conditioned on the concept prompt. The fusion encoder does this with cross-attention:
Each image token cross-attends to the text tokens. After fusion, every image token “knows” what concept we are looking for. The fusion encoder uses multiple layers of self-attention (among image tokens) and cross-attention (image → text). This is where the model learns to highlight regions that match “dog” and suppress everything else.
A set of learned object queries (say 300) cross-attend to the fused image features. Each query tries to latch onto one potential object:
SAM 3 uses deformable attention with box-region-positional bias to help each query focus on its local region. Each decoder layer predicts:
Queries that survive thresholding get masks via a MaskFormer-style mask head. Each query’s embedding is dot-producted with the per-pixel features to produce a binary mask:
There is also a semantic segmentation head that predicts a single binary mask for ALL pixels matching the concept, regardless of instance boundaries. This provides a complementary training signal.
The presence token predicts ppresence = 0.97 (yes, there are dogs). Each query’s score is multiplied by this. Queries landing on the 3 dogs get final scores of ~0.92, ~0.88, ~0.85. Queries on non-dog objects get scores below 0.01. We threshold at 0.5 and output 3 instance masks with 3 bounding boxes.
The detector uses Hungarian matching (like DETR) to assign predicted objects to ground truth, then optimizes:
The detector finds objects frame by frame. But in video, you need continuity. The same dog must keep the same ID across 500 frames, even when it runs behind a tree and reappears. This is the tracker’s job.
SAM 3’s tracker is essentially SAM 2’s video segmentation pipeline. It uses the same PE backbone (frozen during tracker training), plus:
On each frame t, the tracker performs single-frame propagation:
The tracker cross-attends current frame features to memory bank features. The memory encoder uses self-attention across the current frame’s visual features and cross-attention from visual features to the spatial memory. For each tracked object, it predicts 3 output masks at different confidence levels and selects the most confident one — handling ambiguity (e.g., when an object is partially occluded).
Now we have two sets of masks on frame t: tracked masks M̂t from the tracker and new detections Ot from the detector. We need to reconcile them:
The matching uses IoU-based association:
Crowded scenes create ambiguities. SAM 3 uses two strategies:
1. Masklet detection score: Measures how consistently a tracked object is matched to a detection within a temporal window. If the score drops below a threshold (the object hasn’t been detected for many frames), the masklet is suppressed — it was likely a false positive.
2. Periodic re-prompting: Periodically, high-confidence detections replace the tracker’s predictions in the memory bank. This prevents drift — if the tracker slowly warps a mask over time, the detector resets it to a fresh, accurate detection.
During inference, the memory bank only stores frames where the object is confidently present. Occluded frames (where the tracker is uncertain) are excluded. This prevents corrupted memories from degrading future predictions. When the object reappears, the detector picks it up as a fresh detection and re-links it to the existing tracklet via IoU matching.
Text prompts are powerful but not always sufficient. Sometimes you cannot easily name what you want — a specific type of wildflower, a particular style of bracket, a rarely-described industrial component. SAM 3 lets you show the model what you mean using image exemplars.
An image exemplar is a bounding box drawn on an example object, labeled as positive (this is what I want) or negative (this is NOT what I want). The model then finds all other instances that match the positive examples while avoiding the negative ones.
Technically, each exemplar is encoded by the exemplar encoder using three components:
These are concatenated and processed by a small transformer, then concatenated with text tokens to form the complete prompt tokens fed to the fusion encoder.
This is a subtle but critical distinction. In SAM 1/2, pointing at a dog gives you that one dog. In SAM 3, giving an image exemplar of a dog gives you all dogs. The exemplar defines the concept, and the model generalizes from it.
| Prompt Type | Input | Output |
|---|---|---|
| PVS click (SAM 2) | Click on dog #1 | Mask for dog #1 only |
| PCS exemplar (SAM 3) | Box around dog #1 | Masks for ALL dogs |
| PCS text (SAM 3) | “dog” | Masks for ALL dogs |
| PCS text + exemplar | “dog” + box on dog #1 | Masks for ALL dogs, disambiguated |
The real power emerges in the interactive workflow:
Even a single exemplar dramatically improves results over text-only. On COCO, text-only achieves 56.4 AP. Adding 1 image exemplar jumps to 76.8 AP+. Adding both text + exemplar reaches 78.1 AP+. SAM 3 outperforms the prior state-of-the-art T-Rex2 by +18.3 AP on COCO with 1 exemplar.
Architecture is only half the story. The reason SAM 3 doubles the accuracy of prior systems is its data engine — a human-AI annotation pipeline that produced 4M unique noun phrases and 52M masks at quality levels impossible with human annotation alone.
To train a great concept segmentor, you need concept-level mask annotations. But manually annotating “every instance of ‘traffic cone’ in this image” for millions of images is prohibitively expensive. The data engine solves this by iterating between model training and data collection — each round of SAM 3 proposes masks, humans verify and correct them, and the improved data trains a better SAM 3.
Phase 1: Human verification (bootstrap). Start with simple captioners to propose NPs. Use SAM 2 + off-the-shelf detector to propose masks. Humans verify everything. Produces 4.3M image-NP pairs. Train initial SAM 3.
Phase 2: Human + AI verification. The breakthrough. Fine-tune Llama 3.2 on human accept/reject labels to create AI verifiers that automatically judge mask quality and exhaustivity. This doubles throughput compared to human-only annotation. Additionally, an upgraded NP proposal pipeline generates hard negatives adversarial to the current SAM 3. Re-train SAM 3 six times. Produces 122M image-NP pairs.
Phase 3: Scaling and domain expansion. Expand to 15 diverse visual domains. Mine long-tail concepts from a 22.4M node ontology based on Wikidata. Re-train SAM 3 seven times and AI verifiers three times. Produces 19.5M additional image-NP pairs.
Phase 4: Video annotation. Extend to video. Use mature image SAM 3 to annotate video frames. Focus human effort on crowded scenes and tracking failures. Produces 52.5K videos with 467K masklets.
Every mask goes through two quality gates:
| Step | Question | Failure mode caught |
|---|---|---|
| Mask Verification (MV) | Is this mask correct for this NP? | Bad mask quality, wrong object |
| Exhaustivity Verification (EV) | Are ALL instances of this NP masked? | Missed instances |
Only image-NP pairs that fail exhaustivity verification go to expensive human correction. Pairs that pass both checks are used directly as training data.
Both high-quality (HQ) and synthetic (SYN) data show clean power-law scaling. On SA-Co/Gold, going from EXT-only to EXT+SYN improves cgF1 by +8.8. Adding HQ data on top improves by another +14.6. The total training set comprises 52M human-verified masks, 1.4B synthetic masks, and 4M unique NPs — orders of magnitude more concept diversity than any prior dataset.
You cannot measure progress on a task without a good benchmark. Prior open-vocabulary benchmarks had at most a few thousand categories. SAM 3 introduces SA-Co — Segment Anything with Concepts — with 207K unique phrases across 120K images and 1.7K videos.
| Split | Domains | Annotation | Purpose |
|---|---|---|---|
| SA-Co/Gold | 7 | 3 annotators per pair | Measure human ceiling, handle ambiguity |
| SA-Co/Silver | 10 | 1 annotator per pair | Larger-scale evaluation |
| SA-Co/Bronze | 9 | Existing + SAM 2 masks | Cross-benchmark evaluation |
| SA-Co/Bio | Bio | Existing annotations | Specialized domain evaluation |
| SA-Co/VEval | 3 | Video NP pairs | Video concept segmentation |
COCO has 80 categories. LVIS has 1,203. These are fixed vocabularies — models memorize the category list. SA-Co has 207K unique phrases from open vocabulary, including fine-grained concepts (“Boston terrier” vs “French bulldog”), rare objects, and domain-specific terms. It also includes hard negative NPs to test calibration.
SAM 3 introduces a metric designed for practical use. Standard detection AP does not account for calibration — a model might have high AP but require custom thresholds per category to be usable. SAM 3’s metrics enforce a fixed threshold of 0.5:
Where:
Both components must be good for cgF1 to be high. A model that always predicts “present” gets low IL_MCC. A model that produces bad masks gets low pmF1.
SAM 3 does not just improve — it redefines the state of the art across nearly every benchmark it touches.
| Model | LVIS AP (mask) | SA-Co/Gold cgF1 | COCO AP (box) |
|---|---|---|---|
| OWLv2* | 29.3 | 24.6 | 45.5 |
| GroundingDINO-T | 14.7 | 3.3 | 20.5 |
| LLMDet-L | 35.1 | 6.5 | 42.0 |
| APE-D* | — | 16.4 | 59.6 |
| DINO-X | — | 21.3 | 52.4 |
| Gemini 2.5 Flash | 13.4 | 13.0 | — |
| SAM 3 | 48.5 | 54.1 | 53.6 |
| Human | — | 72.8 | — |
On LVIS, SAM 3 achieves 48.5 AP — a +10 point jump over the best prior system (DINO-X at 38.5). On SA-Co/Gold, SAM 3 more than doubles the best baseline’s cgF1 (54.1 vs 24.6 for OWLv2*). SAM 3 reaches 74% of estimated human performance.
On SAM 3’s own SA-Co/VEval benchmark, it massively outperforms all baselines:
| Model | SA-V cgF1 | SA-V pHOTA | LVVIS mAP | BURST HOTA |
|---|---|---|---|---|
| GLEE (one NP) | 0.1 | 11.8 | 20.8 | 28.4 |
| LLMDet + SAM 3 Tracker | 2.3 | 30.1 | 15.2 | 33.3 |
| SAM 3 Detector + T-by-D | 25.7 | 55.7 | 35.9 | 39.7 |
| SAM 3 | 30.3 | 58.0 | 36.3 | 44.5 |
| Human | 53.1 | 70.5 | — | — |
SAM 3 doesn’t just add a new capability; it also improves the original task. On the challenging MOSEv2 video object segmentation benchmark, SAM 3 outperforms SAM 2.1 by +12.4 J&F (60.3 vs 47.9). On DAVIS17 it reaches 92.2 vs 90.7. On interactive image segmentation (SA-37 benchmark), SAM 3 outperforms SAM 2.1 on average mIoU at 1, 3, and 5 clicks.
SAM 3 handles simple NPs natively. For complex queries like “the object the person in red is holding,” SAM 3 pairs with an MLLM. The MLLM decomposes the query into simple NPs, prompts SAM 3 iteratively, and analyzes returned masks. On ReasonSeg, SAM 3 Agent with Gemini 2.5 Pro achieves 77.0 gIoU — crushing prior zero-shot approaches (best was 65.0).
Because SAM 3 finds all instances, it can count objects by counting masks. On CountBench, SAM 3 achieves 93.8% accuracy with MAE of 0.12 — beating Gemini 2.5 Pro (92.4%, MAE 0.24). And unlike MLLMs that just output a number, SAM 3 provides the actual segmentation masks for each counted object.
On an H200 GPU, SAM 3 runs in 30 ms per image with 100+ detected objects. In video, latency scales with number of tracked objects, sustaining near real-time for ~5 concurrent objects.
SAM 1 (2023): Introduced promptable image segmentation with points/boxes/masks. Could segment one object per prompt. Trained on SA-1B (1B masks from 11M images). No text understanding, no video.
SAM 2 (2024): Extended to video via streaming memory architecture. Still one object per prompt. Introduced SA-V dataset for video. Real-time interactive video segmentation.
SAM 3 (2025): Extends to concept-level segmentation via text/exemplar prompts. Finds ALL instances. DETR-based detector + SAM 2 tracker sharing PE backbone. 4M NPs, 52M masks. Doubles prior accuracy.
SAM 3’s detector follows the DETR paradigm: learned object queries, Hungarian matching, set prediction loss. Key innovations it borrows: deformable attention (Deformable DETR), box-region positional bias, dual supervision from DAC-DETR, align loss, and MaskFormer-style mask heads.
The Perception Encoder backbone comes from aligned vision-language pre-training (similar to CLIP but with richer features). This is what enables SAM 3 to match text descriptions to visual content in the first place — without it, the model has no way to interpret “yellow school bus.”
Limitations the authors acknowledge: SAM 3 struggles with out-of-domain concepts (specialized domains not in training), very long tail concepts, and referring expressions that require reasoning (though MLLM pairing helps). Domain adaptation via synthetic data (no human labels needed) partially addresses the first issue.
Broader trend: SAM 3 is part of the convergence of detection, segmentation, tracking, and language into unified models. Vision Banana (2026) showed that even image generators can be turned into segmentors. The era of single-task vision models is ending; foundation models that handle everything are becoming the norm.