SAM 3 — Veanors

Chapter 0: The Problem

You are editing a video of a city street. You need to blur every single license plate. There are 47 cars, some barely visible, some reflected in glass, some at weird angles. With SAM 2, you could click on each plate individually — one by one, 47 times, per frame. You would need to find every plate yourself and click on it. Miss one and it stays unblurred.

What you actually want is to say “license plate” and have the model find, segment, and track every single instance automatically.

This is the gap that SAM 3 fills. SAM 1 introduced promptable segmentation with clicks. SAM 2 extended it to video with memory. But both systems share a fundamental limitation: they can only segment one object per prompt. You point at something, you get that one thing. You cannot express a concept.

What “concept prompting” means

A concept is a category of visual things describable by a short noun phrase: “yellow school bus”, “striped cat”, “traffic cone”. Concept prompting means you describe what you want and the model finds ALL instances — not just one. And it tracks every instance across every frame in video, assigning unique IDs so you know which cat is which.

Existing open-vocabulary detectors like OWLv2 or GroundingDINO can do something like this on images, but they have serious problems:

Poor calibration. They produce boxes, not masks. And they struggle with unusual or fine-grained concepts.
No video support. No tracking, no temporal identity, no frame-to-frame consistency.
No interactivity. You cannot point at a missed object and say “I also mean this one.”
Limited vocabulary. Many are trained on COCO’s 80 or LVIS’s 1,200 categories. SAM 3 handles 4M unique noun phrases.

The core challenge: Build a single unified model that takes a text description or image example of a concept, then detects, segments, and tracks every instance of that concept across images and video — with pixel-perfect masks, unique identity tracking, and the ability to interactively refine results when the model makes mistakes.

What is the key limitation of SAM 1 and SAM 2 that SAM 3 addresses?

They can only process low-resolution images They segment one object per prompt — you cannot describe a concept and get all matching instances They do not support video at all

Chapter 1: Promptable Concept Segmentation (PCS)

Before building the model, SAM 3 formalizes exactly what problem it solves. This formalization is called Promptable Concept Segmentation (PCS), and it is distinct from the Promptable Visual Segmentation (PVS) task that SAM 1 and SAM 2 solved.

PVS vs PCS

PVS (SAM 1/2): You give a spatial prompt — a click, a box, or a mask — on a specific object. The model segments that one object. If you want 10 objects, you give 10 prompts.

PCS (SAM 3): You give a concept prompt — a short noun phrase like “red apple”, an image exemplar (a bounding box around an example object), or both. The model finds and segments every instance of that concept in the entire image or video, with unique IDs for tracking.

What counts as a “concept”?

SAM 3 restricts concepts to simple noun phrases (NPs): a noun with optional modifiers. Examples:

“cat” — bare noun
“striped cat” — adjective + noun
“yellow school bus” — multiple modifiers
“person wearing a red hat” — this is not a simple NP. Too complex. SAM 3 handles these by pairing with an MLLM.

This restriction is intentional. Simple NPs are unambiguous enough to annotate at scale, yet expressive enough to cover the vast majority of real-world segmentation needs. SAM 3’s training set has 4 million unique noun phrases.

The three prompt types

Prompt Type	Input	When to use
Text NP	“traffic cone”	You know the name of the concept
Image exemplar	Bounding box on an example object (+ or −)	Easier to show than describe (“that kind of flower”)
Text + exemplar	Both combined	Text is ambiguous; exemplar disambiguates

Crucially, PCS also supports interactive refinement. After the initial detection, you can add positive exemplars (to catch objects the model missed) or negative exemplars (to suppress false positives). And you can refine individual masks with clicks, just like SAM 2.

Handling ambiguity

Open vocabulary introduces inherent ambiguity. “Mouse” — the device or the animal? “Mirror” — does it include the frame? SAM 3 addresses this three ways: (1) collect test annotations from three independent experts, (2) evaluate with an oracle metric that picks the best-matching ground truth, and (3) include an ambiguity module in the model that predicts multiple valid interpretations.

Key insight: PCS generalizes PVS. Any PVS task (segment this one object) is a special case of PCS (segment all objects of this concept, but there happens to be just one). SAM 3 supports both tasks in a single model — PCS with the new detector, PVS with the inherited SAM 2 click-based interface.

Why does SAM 3 restrict concept prompts to simple noun phrases rather than allowing arbitrary language?

Simple NPs are unambiguous enough to annotate at scale and expressive enough for most real-world segmentation needs; complex queries are handled by pairing SAM 3 with an MLLM The text encoder cannot handle long sentences Arbitrary language would make the model too slow

Chapter 2: The Architecture

SAM 3 is built from two major components that share a single vision backbone: a detector for image-level concept segmentation, and a tracker for propagating detections through video. Let’s trace the full data flow.

The Shared Backbone: Perception Encoder (PE)

Both detector and tracker share a single pre-trained vision-language backbone called Perception Encoder (PE). PE produces aligned image and text embeddings. This is the foundation that gives SAM 3 its open-vocabulary understanding.

For any input image I ∈ R^H×W×3, PE outputs:

F_img = PE_vision(I) ∈ R^{(H/14)×(W/14)×C}

For a text prompt “yellow school bus”, PE outputs:

F_text = PE_text(“yellow school bus”) ∈ R^L×C

where L is the number of text tokens and C is the embedding dimension. The key property: image and text embeddings live in the same vector space, so the model can match visual patterns to language descriptions.

Two-path architecture

The detector and tracker have deliberately decoupled designs. Why? Because detection and tracking have conflicting objectives:

Detection must be identity-agnostic. When you ask “find all cats,” the detector doesn’t care which cat is which. It just needs to find every one.
Tracking must separate identities. Once detected, the tracker must maintain unique IDs across frames — cat #1 stays cat #1, even during occlusion.

Forcing a single model to do both simultaneously creates task conflict — a well-documented problem where one objective degrades the other. SAM 3 avoids this by running detection and tracking as separate heads on the same backbone.

High-level data flow

For a single video frame at time t:

Encode: PE processes the frame to produce image features F_img
Detect: The detector takes F_img + prompt tokens and produces new object masks O_t
Track: The tracker propagates previous masklets M_t-1 to current frame, producing tracked masks M̂_t
Match: Associate tracked masks M̂_t with new detections O_t via IoU matching
Update: Final masks M_t combine tracked and newly detected objects

Think of it this way: The detector is like a spotter — it scans each frame independently and calls out every object matching the concept. The tracker is like a bookkeeper — it keeps a running list of known objects and their locations, updating them frame by frame. The matching step reconciles these two signals: “the spotter found 5 cats, the bookkeeper was tracking 4 — there must be 1 new cat.”

Why does SAM 3 decouple detection and tracking into separate heads rather than using one end-to-end model?

Detection is identity-agnostic (find ALL cats) while tracking must separate identities (THIS cat vs THAT cat) — these conflicting objectives degrade each other in a single head Separate heads are faster at inference The vision backbone cannot handle video natively

Chapter 3: The Presence Head

This is SAM 3’s most elegant architectural innovation. It solves a fundamental tension in object detection: each proposal query must simultaneously answer two different questions:

“What?” — Is this concept even present in the image? (Recognition)
“Where?” — Where exactly is each instance? (Localization)

In standard DETR, each of the (say 300) learned object queries must answer both questions. Each query must look at the entire image for global context (“is there a cat anywhere?”) while also focusing on a specific local region (“is there a cat at position (x, y)?”). These objectives actively conflict.

The decoupling trick

SAM 3 introduces a single learned presence token. This token is solely responsible for answering the “what?” question:

p_presence = σ(MLP(presence_token)) = P(“NP is present in image”)

Each regular proposal query q_i then only needs to answer the “where?” question, conditioned on the concept being present:

p_i = P(q_i is a match | NP is present in image)

The final score for each proposal is the product:

score_i = p_presence × p_i

Why does this matter? A worked example.

Imagine you prompt with “giraffe” on an image of a kitchen. Without the presence head:

Each of 300 queries independently evaluates “is there a giraffe at my position?”
Even low-confidence queries (0.05, 0.03) might pass the detection threshold after NMS
You get false positive detections on vaguely giraffe-shaped objects (a tall lamp, a patterned curtain)

With the presence head:

The presence token scans the entire image and predicts p_presence = 0.01 (“almost certainly no giraffe”)
All proposal scores are multiplied by 0.01
The tall lamp at p_i = 0.05 becomes score = 0.05 × 0.01 = 0.0005 — completely suppressed
Zero false positives

The ablation proves it. Adding the presence head boosts cgF1 by +1.5 on SA-Co/Gold, with the image-level classification metric IL_MCC improving from 0.77 to 0.82. That +0.05 in MCC means dramatically fewer false positives when prompting with concepts that are not in the image — which is exactly the hard negative scenario that matters most in real applications.

Training with hard negatives

The presence head shines because SAM 3 is trained with hard negatives — noun phrases that are plausible but NOT present in the image. For example, an image of a tabby cat might be prompted with “leopard”, “tiger”, or “bobcat”. Without the presence head, hard negatives make training unstable because queries cannot easily learn to globally reject a concept. With the presence head, the global token learns to reject and the local queries can focus on localization.

Config	cgF1	IL_MCC	pmF1
No presence head	50.7	0.77	65.4
With presence head	52.2	0.82	63.4

Notice that pmF1 (localization quality) actually drops slightly. This is expected — the presence head is about recognition, not localization. The cgF1 improvement comes from the dramatic IL_MCC improvement which means far fewer false positive detections.

How does the presence head compute the final score for each proposal query?

It averages the presence score with each query score It multiplies the global presence probability by each query’s local match probability: score = p_presence × p_i It uses the maximum of the presence score and query score

Chapter 4: The Detector Pipeline

Let’s trace a concrete example through the entire detector. We have an image of a park with 3 dogs of different breeds, and the text prompt “dog.”

Step 1: Encode image and text

The Perception Encoder processes both inputs:

F_img = PE_vision(image) ∈ R^64×64×1280

F_text = PE_text(“dog”) ∈ R^3×1280

The image is 896×896 at patch size 14, giving 64×64 = 4,096 spatial tokens. The text “dog” tokenizes to 3 tokens including special tokens.

Step 2: Fusion encoder

The unconditioned image features must be conditioned on the concept prompt. The fusion encoder does this with cross-attention:

F_fused = FusionEncoder(F_img, F_text) ∈ R^4096×1280

Each image token cross-attends to the text tokens. After fusion, every image token “knows” what concept we are looking for. The fusion encoder uses multiple layers of self-attention (among image tokens) and cross-attention (image → text). This is where the model learns to highlight regions that match “dog” and suppress everything else.

Step 3: DETR decoder

A set of learned object queries (say 300) cross-attend to the fused image features. Each query tries to latch onto one potential object:

{q₁, ..., q₃₀₀} = DETRDecoder(queries, F_fused)

SAM 3 uses deformable attention with box-region-positional bias to help each query focus on its local region. Each decoder layer predicts:

A binary classification logit: “does this query match the prompted concept?”
A bounding box delta from the previous layer’s prediction

Step 4: Mask head

Queries that survive thresholding get masks via a MaskFormer-style mask head. Each query’s embedding is dot-producted with the per-pixel features to produce a binary mask:

mask_i = σ(F_pixel ⋅ q_i^T) ∈ R^H×W

There is also a semantic segmentation head that predicts a single binary mask for ALL pixels matching the concept, regardless of instance boundaries. This provides a complementary training signal.

Step 5: Scoring with presence head

The presence token predicts p_presence = 0.97 (yes, there are dogs). Each query’s score is multiplied by this. Queries landing on the 3 dogs get final scores of ~0.92, ~0.88, ~0.85. Queries on non-dog objects get scores below 0.01. We threshold at 0.5 and output 3 instance masks with 3 bounding boxes.

Training losses

The detector uses Hungarian matching (like DETR) to assign predicted objects to ground truth, then optimizes:

Classification loss: focal loss for the binary “match or not” label
Box loss: L1 + GIoU between predicted and ground truth boxes
Mask loss: BCE + dice loss on the predicted masks
Presence loss: binary cross-entropy on the presence token’s prediction
Semantic loss: per-pixel BCE for the semantic segmentation head
Dual supervision (DAC-DETR): additional denoising-based auxiliary losses
Align loss: ensures consistency between detection and segmentation outputs

The full pipeline in one sentence: PE encodes image + text → fusion encoder conditions image on text → DETR decoder queries latch onto candidate objects → mask head produces pixel-level masks → presence head globally gates the scores → threshold at 0.5 to get final instance masks.

What is the role of the fusion encoder in SAM 3’s detector?

It cross-attends image tokens to text tokens, conditioning every image feature on the concept prompt so the model knows what to look for It fuses features from different resolution levels like an FPN It compresses the image to a smaller spatial resolution

Chapter 5: Tracker & Video Architecture

The detector finds objects frame by frame. But in video, you need continuity. The same dog must keep the same ID across 500 frames, even when it runs behind a tree and reappears. This is the tracker’s job.

Inherited from SAM 2

SAM 3’s tracker is essentially SAM 2’s video segmentation pipeline. It uses the same PE backbone (frozen during tracker training), plus:

Prompt encoder: encodes spatial prompts (clicks, boxes, masks)
Mask decoder: a two-way transformer that produces mask predictions
Memory encoder: stores object appearance features from past frames
Memory bank: maintains a sliding window of memory features

Frame-by-frame propagation

On each frame t, the tracker performs single-frame propagation:

M̂_t = propagate(M_t-1)

The tracker cross-attends current frame features to memory bank features. The memory encoder uses self-attention across the current frame’s visual features and cross-attention from visual features to the spatial memory. For each tracked object, it predicts 3 output masks at different confidence levels and selects the most confident one — handling ambiguity (e.g., when an object is partially occluded).

The critical matching step

Now we have two sets of masks on frame t: tracked masks M̂_t from the tracker and new detections O_t from the detector. We need to reconcile them:

M_t = match_and_update(M̂_t, O_t)

The matching uses IoU-based association:

Compute pairwise IoU between tracked masks and detected masks
If IoU > threshold, the tracked mask and detection are the same object
Matched objects update their masks and memories
Unmatched detections spawn new masklets (new objects that just appeared)

Temporal disambiguation

Crowded scenes create ambiguities. SAM 3 uses two strategies:

1. Masklet detection score: Measures how consistently a tracked object is matched to a detection within a temporal window. If the score drops below a threshold (the object hasn’t been detected for many frames), the masklet is suppressed — it was likely a false positive.

2. Periodic re-prompting: Periodically, high-confidence detections replace the tracker’s predictions in the memory bank. This prevents drift — if the tracker slowly warps a mask over time, the detector resets it to a fresh, accurate detection.

Why separate detector + tracker beats end-to-end tracking: In crowded scenes, end-to-end trackers (like TrackFormer) suffer from the detection-tracking conflict — detecting new objects requires semantic focus while tracking existing ones requires identity focus. SAM 3’s decoupled design lets each component excel at its own objective, then reconciles them via simple IoU matching.

Only confident frames enter memory

During inference, the memory bank only stores frames where the object is confidently present. Occluded frames (where the tracker is uncertain) are excluded. This prevents corrupted memories from degrading future predictions. When the object reappears, the detector picks it up as a fresh detection and re-links it to the existing tracklet via IoU matching.

What happens when the tracker loses an object due to occlusion in SAM 3?

The object is permanently lost and cannot be recovered The model restarts tracking from scratch on every frame The detector re-detects the object when it reappears, IoU matching re-links it to the existing tracklet, and periodic re-prompting keeps memory fresh

Chapter 6: Image Exemplars & Interactivity

Text prompts are powerful but not always sufficient. Sometimes you cannot easily name what you want — a specific type of wildflower, a particular style of bracket, a rarely-described industrial component. SAM 3 lets you show the model what you mean using image exemplars.

How exemplars work

An image exemplar is a bounding box drawn on an example object, labeled as positive (this is what I want) or negative (this is NOT what I want). The model then finds all other instances that match the positive examples while avoiding the negative ones.

Technically, each exemplar is encoded by the exemplar encoder using three components:

Position embedding: where the box is in the image
Label embedding: positive or negative
ROI-pooled features: visual features cropped from the bounding box region

These are concatenated and processed by a small transformer, then concatenated with text tokens to form the complete prompt tokens fed to the fusion encoder.

Exemplar vs. PVS prompts

This is a subtle but critical distinction. In SAM 1/2, pointing at a dog gives you that one dog. In SAM 3, giving an image exemplar of a dog gives you all dogs. The exemplar defines the concept, and the model generalizes from it.

Prompt Type	Input	Output
PVS click (SAM 2)	Click on dog #1	Mask for dog #1 only
PCS exemplar (SAM 3)	Box around dog #1	Masks for ALL dogs
PCS text (SAM 3)	“dog”	Masks for ALL dogs
PCS text + exemplar	“dog” + box on dog #1	Masks for ALL dogs, disambiguated

Interactive refinement loop

The real power emerges in the interactive workflow:

User prompts: “fish”
SAM 3 finds 8 fish, but misses 2 partially occluded ones
User draws a positive box on a missed fish → SAM 3 now finds 10
But it also wrongly includes a fish-shaped rock → user draws a negative box on it
SAM 3 refines: 9 correct fish masks. User can now switch to PVS clicks to refine individual mask boundaries.

Quantitative impact: On SA-Co/Gold, starting with text-only gives cgF1 = 54.1. Adding 3 interactive exemplar clicks boosts it to +21.6 cgF1 points. Performance plateaus after ~4 clicks, after which switching to PVS-style per-mask refinement yields further gains. The hybrid approach (PCS then PVS) is strictly better than either alone.

The 1-exemplar experiment

Even a single exemplar dramatically improves results over text-only. On COCO, text-only achieves 56.4 AP. Adding 1 image exemplar jumps to 76.8 AP+. Adding both text + exemplar reaches 78.1 AP+. SAM 3 outperforms the prior state-of-the-art T-Rex2 by +18.3 AP on COCO with 1 exemplar.

How does an image exemplar prompt differ from a PVS click prompt?

An exemplar defines a CONCEPT and the model finds all matching instances; a PVS click specifies a single object and returns only that object’s mask They are identical in function Exemplars only work on video, clicks only on images

Chapter 7: The Data Engine

Architecture is only half the story. The reason SAM 3 doubles the accuracy of prior systems is its data engine — a human-AI annotation pipeline that produced 4M unique noun phrases and 52M masks at quality levels impossible with human annotation alone.

The chicken-and-egg problem

To train a great concept segmentor, you need concept-level mask annotations. But manually annotating “every instance of ‘traffic cone’ in this image” for millions of images is prohibitively expensive. The data engine solves this by iterating between model training and data collection — each round of SAM 3 proposes masks, humans verify and correct them, and the improved data trains a better SAM 3.

The four phases

Phase 1: Human verification (bootstrap). Start with simple captioners to propose NPs. Use SAM 2 + off-the-shelf detector to propose masks. Humans verify everything. Produces 4.3M image-NP pairs. Train initial SAM 3.

Phase 2: Human + AI verification. The breakthrough. Fine-tune Llama 3.2 on human accept/reject labels to create AI verifiers that automatically judge mask quality and exhaustivity. This doubles throughput compared to human-only annotation. Additionally, an upgraded NP proposal pipeline generates hard negatives adversarial to the current SAM 3. Re-train SAM 3 six times. Produces 122M image-NP pairs.

Phase 3: Scaling and domain expansion. Expand to 15 diverse visual domains. Mine long-tail concepts from a 22.4M node ontology based on Wikidata. Re-train SAM 3 seven times and AI verifiers three times. Produces 19.5M additional image-NP pairs.

Phase 4: Video annotation. Extend to video. Use mature image SAM 3 to annotate video frames. Focus human effort on crowded scenes and tracking failures. Produces 52.5K videos with 467K masklets.

The two verification steps

Every mask goes through two quality gates:

Step	Question	Failure mode caught
Mask Verification (MV)	Is this mask correct for this NP?	Bad mask quality, wrong object
Exhaustivity Verification (EV)	Are ALL instances of this NP masked?	Missed instances

Only image-NP pairs that fail exhaustivity verification go to expensive human correction. Pairs that pass both checks are used directly as training data.

Hard negatives are crucial. Adding hard negatives to training improves IL_MCC from 0.44 to 0.68 — a massive jump. Without hard negatives, the model has never seen concepts that are NOT in the image, so it learns to always say “yes.” Hard negatives teach it to say “no” confidently. The NP proposal pipeline generates adversarial NPs by finding concepts that are visually or semantically similar to what IS in the image.

Scaling behavior

Both high-quality (HQ) and synthetic (SYN) data show clean power-law scaling. On SA-Co/Gold, going from EXT-only to EXT+SYN improves cgF1 by +8.8. Adding HQ data on top improves by another +14.6. The total training set comprises 52M human-verified masks, 1.4B synthetic masks, and 4M unique NPs — orders of magnitude more concept diversity than any prior dataset.

What role do AI verifiers play in SAM 3’s data engine?

They automatically judge mask quality (MV) and completeness (EV) using fine-tuned Llama models, doubling annotation throughput by letting humans focus only on the hardest cases They generate the images used for training They replace human annotators entirely

Chapter 8: The SA-Co Benchmark

You cannot measure progress on a task without a good benchmark. Prior open-vocabulary benchmarks had at most a few thousand categories. SAM 3 introduces SA-Co — Segment Anything with Concepts — with 207K unique phrases across 120K images and 1.7K videos.

Benchmark splits

Split	Domains	Annotation	Purpose
SA-Co/Gold	7	3 annotators per pair	Measure human ceiling, handle ambiguity
SA-Co/Silver	10	1 annotator per pair	Larger-scale evaluation
SA-Co/Bronze	9	Existing + SAM 2 masks	Cross-benchmark evaluation
SA-Co/Bio	Bio	Existing annotations	Specialized domain evaluation
SA-Co/VEval	3	Video NP pairs	Video concept segmentation

Why existing benchmarks are insufficient

COCO has 80 categories. LVIS has 1,203. These are fixed vocabularies — models memorize the category list. SA-Co has 207K unique phrases from open vocabulary, including fine-grained concepts (“Boston terrier” vs “French bulldog”), rare objects, and domain-specific terms. It also includes hard negative NPs to test calibration.

The metrics: cgF1

SAM 3 introduces a metric designed for practical use. Standard detection AP does not account for calibration — a model might have high AP but require custom thresholds per category to be usable. SAM 3’s metrics enforce a fixed threshold of 0.5:

cgF1 = 100 × pmF1 × IL_MCC

Where:

pmF1 (positive micro F1): measures mask quality on images that DO contain the concept
IL_MCC (image-level Matthews Correlation Coefficient): measures binary classification accuracy at the image level — “is the concept present at all?”

Both components must be good for cgF1 to be high. A model that always predicts “present” gets low IL_MCC. A model that produces bad masks gets low pmF1.

Why MCC instead of accuracy? Matthews Correlation Coefficient handles class imbalance properly. If 90% of concept queries are negative (the concept isn’t in the image), a model that always says “no” gets 90% accuracy but MCC = 0. MCC ranges from −1 to 1, where 1 = perfect, 0 = random, −1 = perfectly wrong. It requires the model to do well on both positives and negatives.

Why does SAM 3 use cgF1 instead of standard AP (Average Precision)?

cgF1 evaluates at a fixed 0.5 threshold, enforcing good calibration — models must be usable in practice without per-category threshold tuning, unlike AP which aggregates over all thresholds AP is too slow to compute cgF1 gives higher numbers which look better in papers

Chapter 9: Results

SAM 3 does not just improve — it redefines the state of the art across nearly every benchmark it touches.

Image PCS with text

Model	LVIS AP (mask)	SA-Co/Gold cgF1	COCO AP (box)
OWLv2*	29.3	24.6	45.5
GroundingDINO-T	14.7	3.3	20.5
LLMDet-L	35.1	6.5	42.0
APE-D*	—	16.4	59.6
DINO-X	—	21.3	52.4
Gemini 2.5 Flash	13.4	13.0	—
SAM 3	48.5	54.1	53.6
Human	—	72.8	—

On LVIS, SAM 3 achieves 48.5 AP — a +10 point jump over the best prior system (DINO-X at 38.5). On SA-Co/Gold, SAM 3 more than doubles the best baseline’s cgF1 (54.1 vs 24.6 for OWLv2*). SAM 3 reaches 74% of estimated human performance.

Video PCS with text

On SAM 3’s own SA-Co/VEval benchmark, it massively outperforms all baselines:

Model	SA-V cgF1	SA-V pHOTA	LVVIS mAP	BURST HOTA
GLEE (one NP)	0.1	11.8	20.8	28.4
LLMDet + SAM 3 Tracker	2.3	30.1	15.2	33.3
SAM 3 Detector + T-by-D	25.7	55.7	35.9	39.7
SAM 3	30.3	58.0	36.3	44.5
Human	53.1	70.5	—	—

Visual prompting (PVS) — SAM 3 also beats SAM 2

SAM 3 doesn’t just add a new capability; it also improves the original task. On the challenging MOSEv2 video object segmentation benchmark, SAM 3 outperforms SAM 2.1 by +12.4 J&F (60.3 vs 47.9). On DAVIS17 it reaches 92.2 vs 90.7. On interactive image segmentation (SA-37 benchmark), SAM 3 outperforms SAM 2.1 on average mIoU at 1, 3, and 5 clicks.

SAM 3 Agent — complex queries via MLLM

SAM 3 handles simple NPs natively. For complex queries like “the object the person in red is holding,” SAM 3 pairs with an MLLM. The MLLM decomposes the query into simple NPs, prompts SAM 3 iteratively, and analyzes returned masks. On ReasonSeg, SAM 3 Agent with Gemini 2.5 Pro achieves 77.0 gIoU — crushing prior zero-shot approaches (best was 65.0).

Object counting

Because SAM 3 finds all instances, it can count objects by counting masks. On CountBench, SAM 3 achieves 93.8% accuracy with MAE of 0.12 — beating Gemini 2.5 Pro (92.4%, MAE 0.24). And unlike MLLMs that just output a number, SAM 3 provides the actual segmentation masks for each counted object.

Inference speed

On an H200 GPU, SAM 3 runs in 30 ms per image with 100+ detected objects. In video, latency scales with number of tracked objects, sustaining near real-time for ~5 concurrent objects.

The headline number: SAM 3 doubles the accuracy of the best existing system on open-vocabulary concept segmentation (54.1 vs 24.6 cgF1 on SA-Co/Gold). And it does this while ALSO improving SAM 2’s visual prompting capabilities across the board.

By how much does SAM 3 improve over the best prior system on LVIS mask AP?

+2.3 points +10 points (48.5 vs 38.5 from DINO-X), a 26% relative improvement SAM 3 matches but does not surpass prior systems on LVIS

Chapter 10: Connections

SAM lineage

SAM 1 (2023): Introduced promptable image segmentation with points/boxes/masks. Could segment one object per prompt. Trained on SA-1B (1B masks from 11M images). No text understanding, no video.

SAM 2 (2024): Extended to video via streaming memory architecture. Still one object per prompt. Introduced SA-V dataset for video. Real-time interactive video segmentation.

SAM 3 (2025): Extends to concept-level segmentation via text/exemplar prompts. Finds ALL instances. DETR-based detector + SAM 2 tracker sharing PE backbone. 4M NPs, 52M masks. Doubles prior accuracy.

DETR lineage

SAM 3’s detector follows the DETR paradigm: learned object queries, Hungarian matching, set prediction loss. Key innovations it borrows: deformable attention (Deformable DETR), box-region positional bias, dual supervision from DAC-DETR, align loss, and MaskFormer-style mask heads.

Vision-language alignment

The Perception Encoder backbone comes from aligned vision-language pre-training (similar to CLIP but with richer features). This is what enables SAM 3 to match text descriptions to visual content in the first place — without it, the model has no way to interpret “yellow school bus.”

What comes next

Limitations the authors acknowledge: SAM 3 struggles with out-of-domain concepts (specialized domains not in training), very long tail concepts, and referring expressions that require reasoning (though MLLM pairing helps). Domain adaptation via synthetic data (no human labels needed) partially addresses the first issue.

Broader trend: SAM 3 is part of the convergence of detection, segmentation, tracking, and language into unified models. Vision Banana (2026) showed that even image generators can be turned into segmentors. The era of single-task vision models is ending; foundation models that handle everything are becoming the norm.

The big picture: SAM 1 solved “segment what I point at.” SAM 2 solved “segment what I point at, across video.” SAM 3 solves “segment everything that matches what I describe.” Each step dramatically expands the space of tasks a single model can handle, reducing the need for specialized pipelines.

What is the key architectural difference between SAM 3 and SAM 2?

SAM 3 adds a DETR-based concept detector that shares a vision-language backbone with the SAM 2-style tracker, enabling text/exemplar prompts to find all instances — while SAM 2 only had a tracker with spatial prompts for single objects SAM 3 uses a larger backbone SAM 3 removes the memory mechanism

SAM 3: Segment Anything with Concepts