A foundation model for promptable visual segmentation — extending SAM from images to video with a streaming architecture, memory bank, and the largest video segmentation dataset ever built.
SAM (Segment Anything Model) was a breakthrough: click on any object in an image, and you get a pixel-perfect mask. It solved promptable image segmentation. But it had a fundamental limitation: SAM only works on single images.
The real world is not a photograph. Objects move, get occluded, change appearance, and reappear. A huge fraction of visual data is video — from autonomous driving to AR/VR to video editing. If you want to segment an object in a video, SAM forces you to re-segment it independently in every single frame. No memory. No tracking. No temporal awareness.
What about combining SAM with a separate video tracker? People tried this — use SAM to get a mask on one frame, then feed it to a tracker like XMem++ or Cutie. But this pipeline has serious problems:
SAM 2's core insight is deceptively simple: treat video as a stream of frames, and give the model a memory.
Instead of processing an entire video at once (which would be prohibitively expensive for long videos), SAM 2 processes frames one at a time in a streaming fashion. Each frame passes through the same pipeline:
This is a natural generalization of SAM. When you apply SAM 2 to a single image, the memory bank is empty, and it behaves exactly like SAM. When you apply it to a video, memories accumulate frame by frame, carrying object identity forward through time.
The streaming design means SAM 2 can handle arbitrarily long videos — there is no fixed context window or quadratic blowup. Memory is managed through a FIFO queue: the most recent N frames are kept, old ones are evicted. This gives the model both a short-term memory (recent frame details) and long-term memory (prompted frames are always kept).
And because prompts can be provided on any frame — past, present, or future — users can interactively refine the segmentation at any point. Made a mistake? Click on the frame where the model went wrong, and the correction propagates to all other frames through the memory.
SAM 2 has four main components, each with a clean, well-defined role. Let's walk through what happens when a new video frame arrives.
Each frame is independently encoded by a Hiera image encoder — a hierarchical vision transformer pre-trained with MAE (Masked Autoencoder). The key word is "independently": the image encoder runs once per frame with no temporal information. Its job is to produce unconditioned feature embeddings — a rich spatial representation of the frame before any memory or prompt conditioning.
Hiera is hierarchical, meaning it produces features at multiple spatial scales. This is critical for segmentation: you need coarse features for understanding what the object is and fine features for precise mask boundaries.
This is the temporal heart of SAM 2. The memory attention module takes the unconditioned frame embedding from the image encoder and conditions it on the memory bank — information about what the model has seen and predicted in previous frames.
It stacks L transformer blocks. Each block performs:
The output is a conditioned frame embedding — the current frame's features, now enriched with knowledge of the target object's history.
Identical to SAM's design. The prompt encoder converts user inputs (clicks, boxes, masks) into embeddings. The mask decoder takes the conditioned frame embedding plus prompt embeddings and predicts segmentation masks. For ambiguous prompts (a single click could refer to multiple objects), it predicts multiple candidate masks and selects the one with the highest predicted IoU.
A key addition over SAM: an occlusion head that predicts whether the target object is even visible on the current frame. In video, objects frequently disappear behind other objects — the model needs to know when to output "nothing here" rather than hallucinating a mask.
After predicting a mask, the memory encoder creates a memory to store for future frames. It downsamples the predicted mask, fuses it element-wise with the unconditioned frame embedding, and passes it through lightweight convolutions. This compact memory representation captures both what the frame looked like and what was segmented in it.
The memory system is what makes SAM 2 fundamentally different from SAM. It's carefully designed to balance three competing needs: retaining enough history to track objects, staying efficient enough for real-time processing, and being flexible enough to handle interactive corrections.
The memory bank is a structured storage that holds two types of memories:
For the standard VOS setting (mask on the first frame only), the memory bank always keeps the first frame plus up to N recent frames. This means the model always remembers what the object originally looked like, while also tracking how it has changed.
Each memory has two parts:
The memory attention module cross-attends to both spatial memories and object pointers simultaneously. Spatial memories tell the model where to look, and object pointers tell it what to look for.
Recent frame memories get temporal position embeddings — the model knows "this memory is from 3 frames ago" versus "this is from 7 frames ago." This allows the model to reason about motion: an object that moved left between frames t-3 and t-1 is likely to continue moving left.
Prompted frame memories do not get temporal embeddings. Why? Because prompted frames can come from arbitrary points in the video (even "from the future" during interactive refinement), so fixed temporal position information would be misleading rather than helpful.
SAM 2 inherits SAM's flexible prompt interface and extends it to the temporal domain. You can prompt the model on any frame of a video, not just the first one. Three prompt types are supported:
The simplest interaction. Click on the object to segment (positive click) or click on the background to exclude (negative click). Points are encoded as positional encodings summed with learned embeddings that distinguish positive from negative.
A single click is ambiguous — it could refer to the whole object, a part, or a subpart. SAM 2 handles this by predicting multiple candidate masks and selecting the one with the highest predicted IoU score. In video, ambiguity can persist across frames: if you click on a person's hand, does the model track the hand or the whole person? SAM 2 propagates the highest-confidence interpretation unless a follow-up prompt resolves the ambiguity.
Draw a rectangle around the target object. Less ambiguous than a single click, but still allows some freedom (tight box = precise object, loose box = object + context). Encoded as two corner points with special box-type embeddings.
Provide a full binary mask of the target object. This is the least ambiguous prompt — you are telling the model exactly what to segment. Masks are embedded using convolutions and summed directly with the frame embedding. This is the prompt type used in the classical VOS (video object segmentation) task, where a ground-truth mask is given on the first frame.
The real power comes from iterative refinement across frames. Imagine you click on a dog in frame 1, and SAM 2 tracks it through the video. In frame 50, the dog goes behind a tree and the model loses it. With a SAM + tracker pipeline, you'd have to restart from scratch in frame 50 with multiple clicks. With SAM 2, you give a single corrective click in frame 50, and the memory from frame 1 is still available — the model knows what the dog looks like, so one click is enough to recover it.
A model is only as good as its training data. To build a model that can segment anything in video, you need a dataset that covers everything in video. Existing video segmentation datasets were far too small and narrow:
None of these datasets capture the diversity needed for "segment anything in video." They focus on whole objects (not parts), center on common categories, and contain relatively few masks.
SAM 2's dataset was built using an iterative data engine — a virtuous cycle where the model helps annotators, and the annotations improve the model. Three phases:
Phase 1 — SAM per frame: Annotators used SAM to segment every frame independently. No tracking, no temporal help. High quality but painfully slow: 37.8 seconds per frame. This produced 16K masklets across 1.4K videos — a small but clean seed dataset.
Phase 2 — SAM + SAM 2 Mask: An early SAM 2 was added to propagate masks across frames. Annotators segmented the first frame with SAM, then SAM 2 Mask propagated to other frames. Annotators corrected mistakes and re-propagated. Annotation time dropped to 7.4 s/frame (5.1x faster). Produced 63.5K masklets. But corrections still required re-segmenting from scratch — no memory of previous corrections.
Phase 3 — Full SAM 2: The fully-featured SAM 2 with memory was put in the loop. Now annotators could refine with simple corrective clicks, and the model remembered the object across frames. Annotation time: 4.5 s/frame (8.4x faster than Phase 1). Produced 197.0K masklets.
Automatic masklets were generated by prompting SAM 2 with a regular grid of points on the first frame. These candidates were then verified by human annotators — satisfactory ones were kept, and unsatisfactory ones (model failure cases) were sent back for manual refinement. This dual purpose both increases coverage and identifies weaknesses.
A separate team of verifiers checked every masklet as "satisfactory" (correctly tracks the object across all frames) or "unsatisfactory" (sent back for refinement). Objects without clear boundaries were rejected entirely.
SAM 2 is trained jointly on images and video data — and this joint training is crucial. The model needs to be excellent at both single-frame segmentation (the SAM task) and multi-frame tracking (the video task). Training on only one would compromise the other.
The training data combines:
During training, SAM 2 simulates the interactive annotation process. For each training sample:
The training task is to sequentially predict the ground-truth masklet across all 8 frames, using the simulated prompts and its own memories. This directly mirrors how the model will be used at inference time.
For ambiguous prompts (a single click), SAM 2 predicts multiple candidate masks, just like SAM. During training, the loss is computed only on the mask that best matches the ground truth. This encourages the model to maintain multiple hypotheses rather than averaging them into a blurry compromise.
The model includes an occlusion prediction head that is trained to predict whether the target object is visible on each frame. This is critical for video: objects frequently disappear behind other objects, leave the camera's field of view, or are otherwise temporarily invisible. The model must learn to output "not present" rather than hallucinating a mask where none should exist.
SAM 2 was evaluated across a massive range of benchmarks: 17 video segmentation datasets and 37 image segmentation datasets. The results are comprehensive and consistently strong.
The headline result: SAM 2 achieves better segmentation accuracy with 3x fewer interactions than the best SAM + tracker baselines (SAM+XMem++ and SAM+Cutie). In both offline evaluation (multiple passes through the video) and online evaluation (single forward pass), SAM 2 dominates across all 9 test datasets.
This is the payoff of the unified architecture — memory-based refinement is fundamentally more efficient than the restart-from-scratch approach of a decoupled pipeline.
Even in the classical VOS setting (ground-truth mask on the first frame, track through the rest), SAM 2 sets new state-of-the-art results. On 17 zero-shot video datasets:
SAM 2 outperforms dedicated VOS models at their own task, even though VOS is just a special case of SAM 2's more general PVS task.
On SAM's original 23 image benchmarks, SAM 2 achieves higher accuracy and is 6x faster than SAM. With 1-click prompting: 58.9 mIoU vs. SAM's 58.1 mIoU. This improvement comes largely from the Hiera image encoder, which is smaller but more effective than SAM's ViT-H encoder.
When trained on the full SA-1B + video data mix, accuracy jumps further to 61.4 mIoU — showing that video training data actually helps image segmentation performance.
Against dedicated state-of-the-art VOS methods on established benchmarks (DAVIS, YouTube-VOS, MOSE, LVOS, etc.), SAM 2 with Hiera-L achieves significant improvements across the board while running at 30.2 FPS on a single A100 GPU — real-time speed.
A segmentation model that takes minutes per frame is useful for research but useless for real applications. SAM 2 was designed from the ground up for real-time video processing.
On a single NVIDIA A100 GPU (batch size 1):
Several design choices make SAM 2 efficient:
Streaming architecture. The model processes one frame at a time with constant memory overhead. No need to load the entire video into GPU memory. No quadratic attention over all frames.
Hiera image encoder. SAM used ViT-H (632M parameters). SAM 2 uses Hiera, which is hierarchical and pre-trained with MAE, achieving better accuracy with fewer parameters and FLOPs. The Hiera-B+ variant is particularly efficient.
Efficient memory attention. SAM 2 uses vanilla attention operations for both self-attention and cross-attention, allowing it to leverage highly optimized attention kernels like FlashAttention (Dao, 2023). No custom attention patterns, no sparse attention tricks — just fast, well-optimized dense attention.
Lightweight memory encoder. Converting a prediction into a memory is cheap: a downsampling convolution, an element-wise addition, and a few convolutional layers. This is a tiny fraction of the total computation.
FIFO memory management. The memory bank has a fixed maximum size (N recent + M prompted frames). This means the cost of memory attention is bounded regardless of video length. A 10-second video costs the same per frame as a 10-hour video.
SAM 2 sits at the intersection of several research threads. Understanding where it came from — and what it enables — helps you see the bigger picture.
SAM 2's direct predecessor. SAM introduced promptable image segmentation with the SA-1B dataset (1 billion masks). SAM 2 generalizes SAM to video by adding memory and a streaming architecture. When the memory bank is empty, SAM 2 reduces to SAM. The model inherits SAM's prompt encoder and mask decoder designs almost identically.
XMem introduced multi-store memory architectures for video object segmentation — separating sensory memory (very recent frames), working memory (recent context), and long-term memory (persistent features). SAM 2's memory bank echoes this idea with its split between recent frame memories and prompted frame memories, though with a simpler FIFO management strategy instead of XMem's more complex memory consolidation.
A state-of-the-art VOS model that SAM 2 outperforms. Cutie uses object-level memory for compact representation of the target. SAM 2's object pointers serve a similar purpose — lightweight vectors summarizing the target object's identity. The key difference is that SAM 2 integrates prompting and tracking into a single model, while Cutie requires a separate module (like SAM) for initial segmentation.
The classical VOS task — given a mask on the first frame, track it through the video — is a special case of SAM 2's Promptable Visual Segmentation (PVS) task. Early VOS methods used online fine-tuning (slow), then shifted to memory-based approaches with transformers. SAM 2 subsumes VOS and extends it to interactive, multi-frame prompting.
Previous iVOS approaches often used a modular pipeline: SAM for spatial segmentation, a separate tracker for temporal propagation. SAM 2 replaces this with a single end-to-end model. The unified memory means corrections are far cheaper — one click instead of re-segmenting from scratch.
The model-in-the-loop data engine follows SAM's approach: use the model to help annotators, use the annotations to improve the model, iterate. This flywheel has become a standard strategy for building foundation models at scale. SAM 2's three-phase engine reduced per-frame annotation time from 37.8s to 4.5s while maintaining quality.