Carion, Gustafson, Hu, Debnath, Hu, Suris, Ryali et al. — Meta Superintelligence Labs, 2025

SAM 3: Segment Anything with Concepts

A unified model that detects, segments, and tracks all instances of a visual concept in images and videos — prompted by text, image exemplars, or both. Doubles the accuracy of all prior systems on open-vocabulary concept segmentation.

Prerequisites: SAM / SAM 2 + DETR + Vision-Language alignment
11
Chapters
10+
Simulations

Chapter 0: The Problem

You are editing a video of a city street. You need to blur every single license plate. There are 47 cars, some barely visible, some reflected in glass, some at weird angles. With SAM 2, you could click on each plate individually — one by one, 47 times, per frame. You would need to find every plate yourself and click on it. Miss one and it stays unblurred.

What you actually want is to say “license plate” and have the model find, segment, and track every single instance automatically.

This is the gap that SAM 3 fills. SAM 1 introduced promptable segmentation with clicks. SAM 2 extended it to video with memory. But both systems share a fundamental limitation: they can only segment one object per prompt. You point at something, you get that one thing. You cannot express a concept.

What “concept prompting” means

A concept is a category of visual things describable by a short noun phrase: “yellow school bus”, “striped cat”, “traffic cone”. Concept prompting means you describe what you want and the model finds ALL instances — not just one. And it tracks every instance across every frame in video, assigning unique IDs so you know which cat is which.

Existing open-vocabulary detectors like OWLv2 or GroundingDINO can do something like this on images, but they have serious problems:

The core challenge: Build a single unified model that takes a text description or image example of a concept, then detects, segments, and tracks every instance of that concept across images and video — with pixel-perfect masks, unique identity tracking, and the ability to interactively refine results when the model makes mistakes.
What is the key limitation of SAM 1 and SAM 2 that SAM 3 addresses?

Chapter 1: Promptable Concept Segmentation (PCS)

Before building the model, SAM 3 formalizes exactly what problem it solves. This formalization is called Promptable Concept Segmentation (PCS), and it is distinct from the Promptable Visual Segmentation (PVS) task that SAM 1 and SAM 2 solved.

PVS vs PCS

PVS (SAM 1/2): You give a spatial prompt — a click, a box, or a mask — on a specific object. The model segments that one object. If you want 10 objects, you give 10 prompts.

PCS (SAM 3): You give a concept prompt — a short noun phrase like “red apple”, an image exemplar (a bounding box around an example object), or both. The model finds and segments every instance of that concept in the entire image or video, with unique IDs for tracking.

What counts as a “concept”?

SAM 3 restricts concepts to simple noun phrases (NPs): a noun with optional modifiers. Examples:

This restriction is intentional. Simple NPs are unambiguous enough to annotate at scale, yet expressive enough to cover the vast majority of real-world segmentation needs. SAM 3’s training set has 4 million unique noun phrases.

The three prompt types

Prompt TypeInputWhen to use
Text NP“traffic cone”You know the name of the concept
Image exemplarBounding box on an example object (+ or −)Easier to show than describe (“that kind of flower”)
Text + exemplarBoth combinedText is ambiguous; exemplar disambiguates

Crucially, PCS also supports interactive refinement. After the initial detection, you can add positive exemplars (to catch objects the model missed) or negative exemplars (to suppress false positives). And you can refine individual masks with clicks, just like SAM 2.

Handling ambiguity

Open vocabulary introduces inherent ambiguity. “Mouse” — the device or the animal? “Mirror” — does it include the frame? SAM 3 addresses this three ways: (1) collect test annotations from three independent experts, (2) evaluate with an oracle metric that picks the best-matching ground truth, and (3) include an ambiguity module in the model that predicts multiple valid interpretations.

Key insight: PCS generalizes PVS. Any PVS task (segment this one object) is a special case of PCS (segment all objects of this concept, but there happens to be just one). SAM 3 supports both tasks in a single model — PCS with the new detector, PVS with the inherited SAM 2 click-based interface.
Why does SAM 3 restrict concept prompts to simple noun phrases rather than allowing arbitrary language?

Chapter 2: The Architecture

SAM 3 is built from two major components that share a single vision backbone: a detector for image-level concept segmentation, and a tracker for propagating detections through video. Let’s trace the full data flow.

The Shared Backbone: Perception Encoder (PE)

Both detector and tracker share a single pre-trained vision-language backbone called Perception Encoder (PE). PE produces aligned image and text embeddings. This is the foundation that gives SAM 3 its open-vocabulary understanding.

For any input image I ∈ RH×W×3, PE outputs:

Fimg = PEvision(I) ∈ R(H/14)×(W/14)×C

For a text prompt “yellow school bus”, PE outputs:

Ftext = PEtext(“yellow school bus”) ∈ RL×C

where L is the number of text tokens and C is the embedding dimension. The key property: image and text embeddings live in the same vector space, so the model can match visual patterns to language descriptions.

Two-path architecture

The detector and tracker have deliberately decoupled designs. Why? Because detection and tracking have conflicting objectives:

Forcing a single model to do both simultaneously creates task conflict — a well-documented problem where one objective degrades the other. SAM 3 avoids this by running detection and tracking as separate heads on the same backbone.

High-level data flow

For a single video frame at time t:

  1. Encode: PE processes the frame to produce image features Fimg
  2. Detect: The detector takes Fimg + prompt tokens and produces new object masks Ot
  3. Track: The tracker propagates previous masklets Mt-1 to current frame, producing tracked masks M̂t
  4. Match: Associate tracked masks M̂t with new detections Ot via IoU matching
  5. Update: Final masks Mt combine tracked and newly detected objects
Think of it this way: The detector is like a spotter — it scans each frame independently and calls out every object matching the concept. The tracker is like a bookkeeper — it keeps a running list of known objects and their locations, updating them frame by frame. The matching step reconciles these two signals: “the spotter found 5 cats, the bookkeeper was tracking 4 — there must be 1 new cat.”
Why does SAM 3 decouple detection and tracking into separate heads rather than using one end-to-end model?

Chapter 3: The Presence Head

This is SAM 3’s most elegant architectural innovation. It solves a fundamental tension in object detection: each proposal query must simultaneously answer two different questions:

  1. “What?” — Is this concept even present in the image? (Recognition)
  2. “Where?” — Where exactly is each instance? (Localization)

In standard DETR, each of the (say 300) learned object queries must answer both questions. Each query must look at the entire image for global context (“is there a cat anywhere?”) while also focusing on a specific local region (“is there a cat at position (x, y)?”). These objectives actively conflict.

The decoupling trick

SAM 3 introduces a single learned presence token. This token is solely responsible for answering the “what?” question:

ppresence = σ(MLP(presence_token)) = P(“NP is present in image”)

Each regular proposal query qi then only needs to answer the “where?” question, conditioned on the concept being present:

pi = P(qi is a match | NP is present in image)

The final score for each proposal is the product:

scorei = ppresence × pi

Why does this matter? A worked example.

Imagine you prompt with “giraffe” on an image of a kitchen. Without the presence head:

With the presence head:

The ablation proves it. Adding the presence head boosts cgF1 by +1.5 on SA-Co/Gold, with the image-level classification metric IL_MCC improving from 0.77 to 0.82. That +0.05 in MCC means dramatically fewer false positives when prompting with concepts that are not in the image — which is exactly the hard negative scenario that matters most in real applications.

Training with hard negatives

The presence head shines because SAM 3 is trained with hard negatives — noun phrases that are plausible but NOT present in the image. For example, an image of a tabby cat might be prompted with “leopard”, “tiger”, or “bobcat”. Without the presence head, hard negatives make training unstable because queries cannot easily learn to globally reject a concept. With the presence head, the global token learns to reject and the local queries can focus on localization.

ConfigcgF1IL_MCCpmF1
No presence head50.70.7765.4
With presence head52.20.8263.4

Notice that pmF1 (localization quality) actually drops slightly. This is expected — the presence head is about recognition, not localization. The cgF1 improvement comes from the dramatic IL_MCC improvement which means far fewer false positive detections.

How does the presence head compute the final score for each proposal query?

Chapter 4: The Detector Pipeline

Let’s trace a concrete example through the entire detector. We have an image of a park with 3 dogs of different breeds, and the text prompt “dog.”

Step 1: Encode image and text

The Perception Encoder processes both inputs:

Fimg = PEvision(image) ∈ R64×64×1280
Ftext = PEtext(“dog”) ∈ R3×1280

The image is 896×896 at patch size 14, giving 64×64 = 4,096 spatial tokens. The text “dog” tokenizes to 3 tokens including special tokens.

Step 2: Fusion encoder

The unconditioned image features must be conditioned on the concept prompt. The fusion encoder does this with cross-attention:

Ffused = FusionEncoder(Fimg, Ftext) ∈ R4096×1280

Each image token cross-attends to the text tokens. After fusion, every image token “knows” what concept we are looking for. The fusion encoder uses multiple layers of self-attention (among image tokens) and cross-attention (image → text). This is where the model learns to highlight regions that match “dog” and suppress everything else.

Step 3: DETR decoder

A set of learned object queries (say 300) cross-attend to the fused image features. Each query tries to latch onto one potential object:

{q1, ..., q300} = DETRDecoder(queries, Ffused)

SAM 3 uses deformable attention with box-region-positional bias to help each query focus on its local region. Each decoder layer predicts:

Step 4: Mask head

Queries that survive thresholding get masks via a MaskFormer-style mask head. Each query’s embedding is dot-producted with the per-pixel features to produce a binary mask:

maski = σ(Fpixel ⋅ qiT) ∈ RH×W

There is also a semantic segmentation head that predicts a single binary mask for ALL pixels matching the concept, regardless of instance boundaries. This provides a complementary training signal.

Step 5: Scoring with presence head

The presence token predicts ppresence = 0.97 (yes, there are dogs). Each query’s score is multiplied by this. Queries landing on the 3 dogs get final scores of ~0.92, ~0.88, ~0.85. Queries on non-dog objects get scores below 0.01. We threshold at 0.5 and output 3 instance masks with 3 bounding boxes.

Training losses

The detector uses Hungarian matching (like DETR) to assign predicted objects to ground truth, then optimizes:

The full pipeline in one sentence: PE encodes image + text → fusion encoder conditions image on text → DETR decoder queries latch onto candidate objects → mask head produces pixel-level masks → presence head globally gates the scores → threshold at 0.5 to get final instance masks.
What is the role of the fusion encoder in SAM 3’s detector?

Chapter 5: Tracker & Video Architecture

The detector finds objects frame by frame. But in video, you need continuity. The same dog must keep the same ID across 500 frames, even when it runs behind a tree and reappears. This is the tracker’s job.

Inherited from SAM 2

SAM 3’s tracker is essentially SAM 2’s video segmentation pipeline. It uses the same PE backbone (frozen during tracker training), plus:

Frame-by-frame propagation

On each frame t, the tracker performs single-frame propagation:

t = propagate(Mt-1)

The tracker cross-attends current frame features to memory bank features. The memory encoder uses self-attention across the current frame’s visual features and cross-attention from visual features to the spatial memory. For each tracked object, it predicts 3 output masks at different confidence levels and selects the most confident one — handling ambiguity (e.g., when an object is partially occluded).

The critical matching step

Now we have two sets of masks on frame t: tracked masks M̂t from the tracker and new detections Ot from the detector. We need to reconcile them:

Mt = match_and_update(M̂t, Ot)

The matching uses IoU-based association:

  1. Compute pairwise IoU between tracked masks and detected masks
  2. If IoU > threshold, the tracked mask and detection are the same object
  3. Matched objects update their masks and memories
  4. Unmatched detections spawn new masklets (new objects that just appeared)

Temporal disambiguation

Crowded scenes create ambiguities. SAM 3 uses two strategies:

1. Masklet detection score: Measures how consistently a tracked object is matched to a detection within a temporal window. If the score drops below a threshold (the object hasn’t been detected for many frames), the masklet is suppressed — it was likely a false positive.

2. Periodic re-prompting: Periodically, high-confidence detections replace the tracker’s predictions in the memory bank. This prevents drift — if the tracker slowly warps a mask over time, the detector resets it to a fresh, accurate detection.

Why separate detector + tracker beats end-to-end tracking: In crowded scenes, end-to-end trackers (like TrackFormer) suffer from the detection-tracking conflict — detecting new objects requires semantic focus while tracking existing ones requires identity focus. SAM 3’s decoupled design lets each component excel at its own objective, then reconciles them via simple IoU matching.

Only confident frames enter memory

During inference, the memory bank only stores frames where the object is confidently present. Occluded frames (where the tracker is uncertain) are excluded. This prevents corrupted memories from degrading future predictions. When the object reappears, the detector picks it up as a fresh detection and re-links it to the existing tracklet via IoU matching.

What happens when the tracker loses an object due to occlusion in SAM 3?

Chapter 6: Image Exemplars & Interactivity

Text prompts are powerful but not always sufficient. Sometimes you cannot easily name what you want — a specific type of wildflower, a particular style of bracket, a rarely-described industrial component. SAM 3 lets you show the model what you mean using image exemplars.

How exemplars work

An image exemplar is a bounding box drawn on an example object, labeled as positive (this is what I want) or negative (this is NOT what I want). The model then finds all other instances that match the positive examples while avoiding the negative ones.

Technically, each exemplar is encoded by the exemplar encoder using three components:

  1. Position embedding: where the box is in the image
  2. Label embedding: positive or negative
  3. ROI-pooled features: visual features cropped from the bounding box region

These are concatenated and processed by a small transformer, then concatenated with text tokens to form the complete prompt tokens fed to the fusion encoder.

Exemplar vs. PVS prompts

This is a subtle but critical distinction. In SAM 1/2, pointing at a dog gives you that one dog. In SAM 3, giving an image exemplar of a dog gives you all dogs. The exemplar defines the concept, and the model generalizes from it.

Prompt TypeInputOutput
PVS click (SAM 2)Click on dog #1Mask for dog #1 only
PCS exemplar (SAM 3)Box around dog #1Masks for ALL dogs
PCS text (SAM 3)“dog”Masks for ALL dogs
PCS text + exemplar“dog” + box on dog #1Masks for ALL dogs, disambiguated

Interactive refinement loop

The real power emerges in the interactive workflow:

  1. User prompts: “fish”
  2. SAM 3 finds 8 fish, but misses 2 partially occluded ones
  3. User draws a positive box on a missed fish → SAM 3 now finds 10
  4. But it also wrongly includes a fish-shaped rock → user draws a negative box on it
  5. SAM 3 refines: 9 correct fish masks. User can now switch to PVS clicks to refine individual mask boundaries.
Quantitative impact: On SA-Co/Gold, starting with text-only gives cgF1 = 54.1. Adding 3 interactive exemplar clicks boosts it to +21.6 cgF1 points. Performance plateaus after ~4 clicks, after which switching to PVS-style per-mask refinement yields further gains. The hybrid approach (PCS then PVS) is strictly better than either alone.

The 1-exemplar experiment

Even a single exemplar dramatically improves results over text-only. On COCO, text-only achieves 56.4 AP. Adding 1 image exemplar jumps to 76.8 AP+. Adding both text + exemplar reaches 78.1 AP+. SAM 3 outperforms the prior state-of-the-art T-Rex2 by +18.3 AP on COCO with 1 exemplar.

How does an image exemplar prompt differ from a PVS click prompt?

Chapter 7: The Data Engine

Architecture is only half the story. The reason SAM 3 doubles the accuracy of prior systems is its data engine — a human-AI annotation pipeline that produced 4M unique noun phrases and 52M masks at quality levels impossible with human annotation alone.

The chicken-and-egg problem

To train a great concept segmentor, you need concept-level mask annotations. But manually annotating “every instance of ‘traffic cone’ in this image” for millions of images is prohibitively expensive. The data engine solves this by iterating between model training and data collection — each round of SAM 3 proposes masks, humans verify and correct them, and the improved data trains a better SAM 3.

The four phases

Phase 1: Human verification (bootstrap). Start with simple captioners to propose NPs. Use SAM 2 + off-the-shelf detector to propose masks. Humans verify everything. Produces 4.3M image-NP pairs. Train initial SAM 3.

Phase 2: Human + AI verification. The breakthrough. Fine-tune Llama 3.2 on human accept/reject labels to create AI verifiers that automatically judge mask quality and exhaustivity. This doubles throughput compared to human-only annotation. Additionally, an upgraded NP proposal pipeline generates hard negatives adversarial to the current SAM 3. Re-train SAM 3 six times. Produces 122M image-NP pairs.

Phase 3: Scaling and domain expansion. Expand to 15 diverse visual domains. Mine long-tail concepts from a 22.4M node ontology based on Wikidata. Re-train SAM 3 seven times and AI verifiers three times. Produces 19.5M additional image-NP pairs.

Phase 4: Video annotation. Extend to video. Use mature image SAM 3 to annotate video frames. Focus human effort on crowded scenes and tracking failures. Produces 52.5K videos with 467K masklets.

The two verification steps

Every mask goes through two quality gates:

StepQuestionFailure mode caught
Mask Verification (MV)Is this mask correct for this NP?Bad mask quality, wrong object
Exhaustivity Verification (EV)Are ALL instances of this NP masked?Missed instances

Only image-NP pairs that fail exhaustivity verification go to expensive human correction. Pairs that pass both checks are used directly as training data.

Hard negatives are crucial. Adding hard negatives to training improves IL_MCC from 0.44 to 0.68 — a massive jump. Without hard negatives, the model has never seen concepts that are NOT in the image, so it learns to always say “yes.” Hard negatives teach it to say “no” confidently. The NP proposal pipeline generates adversarial NPs by finding concepts that are visually or semantically similar to what IS in the image.

Scaling behavior

Both high-quality (HQ) and synthetic (SYN) data show clean power-law scaling. On SA-Co/Gold, going from EXT-only to EXT+SYN improves cgF1 by +8.8. Adding HQ data on top improves by another +14.6. The total training set comprises 52M human-verified masks, 1.4B synthetic masks, and 4M unique NPs — orders of magnitude more concept diversity than any prior dataset.

What role do AI verifiers play in SAM 3’s data engine?

Chapter 8: The SA-Co Benchmark

You cannot measure progress on a task without a good benchmark. Prior open-vocabulary benchmarks had at most a few thousand categories. SAM 3 introduces SA-Co — Segment Anything with Concepts — with 207K unique phrases across 120K images and 1.7K videos.

Benchmark splits

SplitDomainsAnnotationPurpose
SA-Co/Gold73 annotators per pairMeasure human ceiling, handle ambiguity
SA-Co/Silver101 annotator per pairLarger-scale evaluation
SA-Co/Bronze9Existing + SAM 2 masksCross-benchmark evaluation
SA-Co/BioBioExisting annotationsSpecialized domain evaluation
SA-Co/VEval3Video NP pairsVideo concept segmentation

Why existing benchmarks are insufficient

COCO has 80 categories. LVIS has 1,203. These are fixed vocabularies — models memorize the category list. SA-Co has 207K unique phrases from open vocabulary, including fine-grained concepts (“Boston terrier” vs “French bulldog”), rare objects, and domain-specific terms. It also includes hard negative NPs to test calibration.

The metrics: cgF1

SAM 3 introduces a metric designed for practical use. Standard detection AP does not account for calibration — a model might have high AP but require custom thresholds per category to be usable. SAM 3’s metrics enforce a fixed threshold of 0.5:

cgF1 = 100 × pmF1 × IL_MCC

Where:

Both components must be good for cgF1 to be high. A model that always predicts “present” gets low IL_MCC. A model that produces bad masks gets low pmF1.

Why MCC instead of accuracy? Matthews Correlation Coefficient handles class imbalance properly. If 90% of concept queries are negative (the concept isn’t in the image), a model that always says “no” gets 90% accuracy but MCC = 0. MCC ranges from −1 to 1, where 1 = perfect, 0 = random, −1 = perfectly wrong. It requires the model to do well on both positives and negatives.
Why does SAM 3 use cgF1 instead of standard AP (Average Precision)?

Chapter 9: Results

SAM 3 does not just improve — it redefines the state of the art across nearly every benchmark it touches.

Image PCS with text

ModelLVIS AP (mask)SA-Co/Gold cgF1COCO AP (box)
OWLv2*29.324.645.5
GroundingDINO-T14.73.320.5
LLMDet-L35.16.542.0
APE-D*16.459.6
DINO-X21.352.4
Gemini 2.5 Flash13.413.0
SAM 348.554.153.6
Human72.8

On LVIS, SAM 3 achieves 48.5 AP — a +10 point jump over the best prior system (DINO-X at 38.5). On SA-Co/Gold, SAM 3 more than doubles the best baseline’s cgF1 (54.1 vs 24.6 for OWLv2*). SAM 3 reaches 74% of estimated human performance.

Video PCS with text

On SAM 3’s own SA-Co/VEval benchmark, it massively outperforms all baselines:

ModelSA-V cgF1SA-V pHOTALVVIS mAPBURST HOTA
GLEE (one NP)0.111.820.828.4
LLMDet + SAM 3 Tracker2.330.115.233.3
SAM 3 Detector + T-by-D25.755.735.939.7
SAM 330.358.036.344.5
Human53.170.5

Visual prompting (PVS) — SAM 3 also beats SAM 2

SAM 3 doesn’t just add a new capability; it also improves the original task. On the challenging MOSEv2 video object segmentation benchmark, SAM 3 outperforms SAM 2.1 by +12.4 J&F (60.3 vs 47.9). On DAVIS17 it reaches 92.2 vs 90.7. On interactive image segmentation (SA-37 benchmark), SAM 3 outperforms SAM 2.1 on average mIoU at 1, 3, and 5 clicks.

SAM 3 Agent — complex queries via MLLM

SAM 3 handles simple NPs natively. For complex queries like “the object the person in red is holding,” SAM 3 pairs with an MLLM. The MLLM decomposes the query into simple NPs, prompts SAM 3 iteratively, and analyzes returned masks. On ReasonSeg, SAM 3 Agent with Gemini 2.5 Pro achieves 77.0 gIoU — crushing prior zero-shot approaches (best was 65.0).

Object counting

Because SAM 3 finds all instances, it can count objects by counting masks. On CountBench, SAM 3 achieves 93.8% accuracy with MAE of 0.12 — beating Gemini 2.5 Pro (92.4%, MAE 0.24). And unlike MLLMs that just output a number, SAM 3 provides the actual segmentation masks for each counted object.

Inference speed

On an H200 GPU, SAM 3 runs in 30 ms per image with 100+ detected objects. In video, latency scales with number of tracked objects, sustaining near real-time for ~5 concurrent objects.

The headline number: SAM 3 doubles the accuracy of the best existing system on open-vocabulary concept segmentation (54.1 vs 24.6 cgF1 on SA-Co/Gold). And it does this while ALSO improving SAM 2’s visual prompting capabilities across the board.
By how much does SAM 3 improve over the best prior system on LVIS mask AP?

Chapter 10: Connections

SAM lineage

SAM 1 (2023): Introduced promptable image segmentation with points/boxes/masks. Could segment one object per prompt. Trained on SA-1B (1B masks from 11M images). No text understanding, no video.

SAM 2 (2024): Extended to video via streaming memory architecture. Still one object per prompt. Introduced SA-V dataset for video. Real-time interactive video segmentation.

SAM 3 (2025): Extends to concept-level segmentation via text/exemplar prompts. Finds ALL instances. DETR-based detector + SAM 2 tracker sharing PE backbone. 4M NPs, 52M masks. Doubles prior accuracy.

DETR lineage

SAM 3’s detector follows the DETR paradigm: learned object queries, Hungarian matching, set prediction loss. Key innovations it borrows: deformable attention (Deformable DETR), box-region positional bias, dual supervision from DAC-DETR, align loss, and MaskFormer-style mask heads.

Vision-language alignment

The Perception Encoder backbone comes from aligned vision-language pre-training (similar to CLIP but with richer features). This is what enables SAM 3 to match text descriptions to visual content in the first place — without it, the model has no way to interpret “yellow school bus.”

What comes next

Limitations the authors acknowledge: SAM 3 struggles with out-of-domain concepts (specialized domains not in training), very long tail concepts, and referring expressions that require reasoning (though MLLM pairing helps). Domain adaptation via synthetic data (no human labels needed) partially addresses the first issue.

Broader trend: SAM 3 is part of the convergence of detection, segmentation, tracking, and language into unified models. Vision Banana (2026) showed that even image generators can be turned into segmentors. The era of single-task vision models is ending; foundation models that handle everything are becoming the norm.

The big picture: SAM 1 solved “segment what I point at.” SAM 2 solved “segment what I point at, across video.” SAM 3 solves “segment everything that matches what I describe.” Each step dramatically expands the space of tasks a single model can handle, reducing the need for specialized pipelines.
What is the key architectural difference between SAM 3 and SAM 2?