SAM 2 — Veanors

Chapter 0: The Problem

SAM (Segment Anything Model) was a breakthrough: click on any object in an image, and you get a pixel-perfect mask. It solved promptable image segmentation. But it had a fundamental limitation: SAM only works on single images.

The real world is not a photograph. Objects move, get occluded, change appearance, and reappear. A huge fraction of visual data is video — from autonomous driving to AR/VR to video editing. If you want to segment an object in a video, SAM forces you to re-segment it independently in every single frame. No memory. No tracking. No temporal awareness.

What about combining SAM with a separate video tracker? People tried this — use SAM to get a mask on one frame, then feed it to a tracker like XMem++ or Cutie. But this pipeline has serious problems:

The tracker may lose the object (occlusion, fast motion, deformation)
SAM was designed for curated images, not noisy video frames with motion blur
There is no mechanism to correct the tracker's mistakes interactively — you must restart from scratch
Two separate models with incompatible representations cannot share information

The core challenge: We need a single unified model that does promptable segmentation across both images and video — one that can take clicks, boxes, or masks on any frame, track objects through time, handle occlusions, and allow interactive refinement. And it needs to process video in real time.

Why can't you just run SAM independently on each frame of a video?

SAM has no memory — each frame is segmented independently, so there's no temporal consistency, no tracking, and no way to propagate identity across frames SAM is too slow to process video frames SAM only works on natural images, not video frames

Chapter 1: The Key Insight

SAM 2's core insight is deceptively simple: treat video as a stream of frames, and give the model a memory.

Instead of processing an entire video at once (which would be prohibitively expensive for long videos), SAM 2 processes frames one at a time in a streaming fashion. Each frame passes through the same pipeline:

Encode the current frame with an image encoder (just like SAM)
Condition the frame embedding on memories of previously seen frames
Decode a segmentation mask, optionally incorporating user prompts
Store the result in a memory bank for future frames

This is a natural generalization of SAM. When you apply SAM 2 to a single image, the memory bank is empty, and it behaves exactly like SAM. When you apply it to a video, memories accumulate frame by frame, carrying object identity forward through time.

The beautiful unification: An image is just a single-frame video with an empty memory bank. SAM 2 is literally a superset of SAM — the same model handles both images and videos, with memory as the only difference. This means improvements to the video model automatically improve image segmentation too.

The streaming design means SAM 2 can handle arbitrarily long videos — there is no fixed context window or quadratic blowup. Memory is managed through a FIFO queue: the most recent N frames are kept, old ones are evicted. This gives the model both a short-term memory (recent frame details) and long-term memory (prompted frames are always kept).

And because prompts can be provided on any frame — past, present, or future — users can interactively refine the segmentation at any point. Made a mistake? Click on the frame where the model went wrong, and the correction propagates to all other frames through the memory.

What happens to SAM 2 when it is applied to a single image (no video)?

The memory bank is empty, so SAM 2 behaves exactly like SAM — it's a natural superset It uses a special image-only mode with different weights It creates synthetic memory from data augmentation

Chapter 2: The Architecture

SAM 2 has four main components, each with a clean, well-defined role. Let's walk through what happens when a new video frame arrives.

1. Image Encoder (Hiera)

Each frame is independently encoded by a Hiera image encoder — a hierarchical vision transformer pre-trained with MAE (Masked Autoencoder). The key word is "independently": the image encoder runs once per frame with no temporal information. Its job is to produce unconditioned feature embeddings — a rich spatial representation of the frame before any memory or prompt conditioning.

Hiera is hierarchical, meaning it produces features at multiple spatial scales. This is critical for segmentation: you need coarse features for understanding what the object is and fine features for precise mask boundaries.

2. Memory Attention

This is the temporal heart of SAM 2. The memory attention module takes the unconditioned frame embedding from the image encoder and conditions it on the memory bank — information about what the model has seen and predicted in previous frames.

It stacks L transformer blocks. Each block performs:

Self-attention on the current frame features (spatial reasoning)
Cross-attention to memories from previous frames and object pointers (temporal reasoning)
An MLP to mix the information

The output is a conditioned frame embedding — the current frame's features, now enriched with knowledge of the target object's history.

3. Prompt Encoder + Mask Decoder

Identical to SAM's design. The prompt encoder converts user inputs (clicks, boxes, masks) into embeddings. The mask decoder takes the conditioned frame embedding plus prompt embeddings and predicts segmentation masks. For ambiguous prompts (a single click could refer to multiple objects), it predicts multiple candidate masks and selects the one with the highest predicted IoU.

A key addition over SAM: an occlusion head that predicts whether the target object is even visible on the current frame. In video, objects frequently disappear behind other objects — the model needs to know when to output "nothing here" rather than hallucinating a mask.

4. Memory Encoder

After predicting a mask, the memory encoder creates a memory to store for future frames. It downsamples the predicted mask, fuses it element-wise with the unconditioned frame embedding, and passes it through lightweight convolutions. This compact memory representation captures both what the frame looked like and what was segmented in it.

The full pipeline per frame: Image encoder produces unconditioned features → Memory attention conditions them on the memory bank → Mask decoder (with optional prompts) predicts a mask → Memory encoder stores the result → Memory bank is updated for the next frame. Rinse and repeat.

What is the role of the memory attention module?

It conditions the current frame's features on past memories via cross-attention, injecting temporal context about the target object It stores new memories into the memory bank It encodes the user's prompts

Chapter 3: The Memory Mechanism

The memory system is what makes SAM 2 fundamentally different from SAM. It's carefully designed to balance three competing needs: retaining enough history to track objects, staying efficient enough for real-time processing, and being flexible enough to handle interactive corrections.

The Memory Bank

The memory bank is a structured storage that holds two types of memories:

Recent frame memories — a FIFO queue of up to N memories from the most recent unprompted frames. These capture the object's recent appearance and short-term motion. When the queue is full, the oldest memory is evicted.
Prompted frame memories — a FIFO queue of up to M memories from frames where the user provided prompts (clicks, boxes, masks). These are especially valuable because they contain human-verified information about the target object.

For the standard VOS setting (mask on the first frame only), the memory bank always keeps the first frame plus up to N recent frames. This means the model always remembers what the object originally looked like, while also tracking how it has changed.

Spatial Memories vs. Object Pointers

Each memory has two parts:

Spatial feature maps — detailed spatial representations from the memory encoder. These are cross-attended to during memory attention, providing pixel-level context about where the object was and what it looked like.
Object pointers — lightweight vectors extracted from the mask decoder output tokens. These are high-level semantic summaries: "this is a dog" or "this is a hand." They provide a compact, global description of the target object.

The memory attention module cross-attends to both spatial memories and object pointers simultaneously. Spatial memories tell the model where to look, and object pointers tell it what to look for.

Temporal Position Encoding

Recent frame memories get temporal position embeddings — the model knows "this memory is from 3 frames ago" versus "this is from 7 frames ago." This allows the model to reason about motion: an object that moved left between frames t-3 and t-1 is likely to continue moving left.

Prompted frame memories do not get temporal embeddings. Why? Because prompted frames can come from arbitrary points in the video (even "from the future" during interactive refinement), so fixed temporal position information would be misleading rather than helpful.

Why FIFO works: You might wonder why a simple queue suffices — shouldn't the model decide which frames to remember? In practice, recency is a strong heuristic for video segmentation. The most recent frames almost always contain the most relevant information about the object's current appearance. And prompted frames (kept separately) handle the cases where a distant frame is important.

Why do prompted frame memories NOT receive temporal position embeddings?

Because prompted frames can come from arbitrary points in the video (even future frames), so fixed temporal positions would be misleading To save memory and computation Because prompted frames are always from the first frame

Chapter 4: Prompt Modes

SAM 2 inherits SAM's flexible prompt interface and extends it to the temporal domain. You can prompt the model on any frame of a video, not just the first one. Three prompt types are supported:

Clicks (Points)

The simplest interaction. Click on the object to segment (positive click) or click on the background to exclude (negative click). Points are encoded as positional encodings summed with learned embeddings that distinguish positive from negative.

A single click is ambiguous — it could refer to the whole object, a part, or a subpart. SAM 2 handles this by predicting multiple candidate masks and selecting the one with the highest predicted IoU score. In video, ambiguity can persist across frames: if you click on a person's hand, does the model track the hand or the whole person? SAM 2 propagates the highest-confidence interpretation unless a follow-up prompt resolves the ambiguity.

Bounding Boxes

Draw a rectangle around the target object. Less ambiguous than a single click, but still allows some freedom (tight box = precise object, loose box = object + context). Encoded as two corner points with special box-type embeddings.

Masks

Provide a full binary mask of the target object. This is the least ambiguous prompt — you are telling the model exactly what to segment. Masks are embedded using convolutions and summed directly with the frame embedding. This is the prompt type used in the classical VOS (video object segmentation) task, where a ground-truth mask is given on the first frame.

Interactive Refinement

The real power comes from iterative refinement across frames. Imagine you click on a dog in frame 1, and SAM 2 tracks it through the video. In frame 50, the dog goes behind a tree and the model loses it. With a SAM + tracker pipeline, you'd have to restart from scratch in frame 50 with multiple clicks. With SAM 2, you give a single corrective click in frame 50, and the memory from frame 1 is still available — the model knows what the dog looks like, so one click is enough to recover it.

Key advantage over SAM + tracker: When the tracker loses an object, a decoupled pipeline must re-segment from scratch (no memory). SAM 2's unified architecture retains memories across the entire video, so a single corrective click is often sufficient to recover a lost object. In experiments, SAM 2 achieved better accuracy with 3x fewer interactions.

Why does SAM 2 need fewer corrective clicks than a SAM + tracker pipeline when an object is lost?

SAM 2 retains memories of the object from earlier frames, so a single click can recover it — while a pipeline must re-segment from scratch with no memory context SAM 2 uses a better image encoder SAM 2 processes video at higher resolution

Chapter 5: The SA-V Dataset

A model is only as good as its training data. To build a model that can segment anything in video, you need a dataset that covers everything in video. Existing video segmentation datasets were far too small and narrow:

DAVIS 2017: 200 videos, 400 masklets — beautiful quality, but tiny
YouTube-VOS: 4,500 videos — larger, but limited to specific object categories (people, animals, vehicles)
MOSE: 2,100 videos — focused on complex occlusions but still limited

None of these datasets capture the diversity needed for "segment anything in video." They focus on whole objects (not parts), center on common categories, and contain relatively few masks.

The Data Engine

SAM 2's dataset was built using an iterative data engine — a virtuous cycle where the model helps annotators, and the annotations improve the model. Three phases:

Phase 1 — SAM per frame: Annotators used SAM to segment every frame independently. No tracking, no temporal help. High quality but painfully slow: 37.8 seconds per frame. This produced 16K masklets across 1.4K videos — a small but clean seed dataset.

Phase 2 — SAM + SAM 2 Mask: An early SAM 2 was added to propagate masks across frames. Annotators segmented the first frame with SAM, then SAM 2 Mask propagated to other frames. Annotators corrected mistakes and re-propagated. Annotation time dropped to 7.4 s/frame (5.1x faster). Produced 63.5K masklets. But corrections still required re-segmenting from scratch — no memory of previous corrections.

Phase 3 — Full SAM 2: The fully-featured SAM 2 with memory was put in the loop. Now annotators could refine with simple corrective clicks, and the model remembered the object across frames. Annotation time: 4.5 s/frame (8.4x faster than Phase 1). Produced 197.0K masklets.

SA-V by the Numbers

Scale: 50.9K videos, 642.6K masklets (190.9K manual + 451.7K automatic), 35.5M masks total. That is 53x more masks than any existing video segmentation dataset. Videos average 14 seconds, covering 54% indoor and 46% outdoor scenes.

Automatic masklets were generated by prompting SAM 2 with a regular grid of points on the first frame. These candidates were then verified by human annotators — satisfactory ones were kept, and unsatisfactory ones (model failure cases) were sent back for manual refinement. This dual purpose both increases coverage and identifies weaknesses.

A separate team of verifiers checked every masklet as "satisfactory" (correctly tracks the object across all frames) or "unsatisfactory" (sent back for refinement). Objects without clear boundaries were rejected entirely.

How much faster was Phase 3 (full SAM 2 in the loop) compared to Phase 1 (SAM per frame)?

8.4x faster — from 37.8 s/frame to 4.5 s/frame, because SAM 2's memory allowed simple corrective clicks instead of re-segmenting from scratch 2x faster 20x faster

Chapter 6: Training

SAM 2 is trained jointly on images and video data — and this joint training is crucial. The model needs to be excellent at both single-frame segmentation (the SAM task) and multi-frame tracking (the video task). Training on only one would compromise the other.

Training Data Mix

The training data combines:

SA-1B — SAM's original 1 billion mask image dataset (treated as single-frame videos)
SA-V — the new video dataset (50.9K videos, 642.6K masklets)
Existing VOS datasets — DAVIS, YouTube-VOS, MOSE, and others
Internal video data — 62.9K additional licensed videos

Simulated Interactive Prompting

During training, SAM 2 simulates the interactive annotation process. For each training sample:

Sample a sequence of 8 frames from a video
Randomly select up to 2 frames to receive prompts
The initial prompt can be a ground-truth mask (50% probability), a positive click (25%), or a bounding box (25%)
During the sequence, the model may receive corrective clicks — simulated by comparing the predicted mask to ground truth and clicking on the largest error region

The training task is to sequentially predict the ground-truth masklet across all 8 frames, using the simulated prompts and its own memories. This directly mirrors how the model will be used at inference time.

Multi-Mask Prediction

For ambiguous prompts (a single click), SAM 2 predicts multiple candidate masks, just like SAM. During training, the loss is computed only on the mask that best matches the ground truth. This encourages the model to maintain multiple hypotheses rather than averaging them into a blurry compromise.

Occlusion Handling

The model includes an occlusion prediction head that is trained to predict whether the target object is visible on each frame. This is critical for video: objects frequently disappear behind other objects, leave the camera's field of view, or are otherwise temporarily invisible. The model must learn to output "not present" rather than hallucinating a mask where none should exist.

Training ablation: Each data engine phase produced measurably better training data. On the SA-V val set, accuracy (J&F) went from 50.0 (VOS+SA-1B only) to 53.0 (+Phase 1) to 58.8 (+Phase 2) to 62.5 (+Phase 3) to 63.2 (+Auto). On 9 zero-shot benchmarks, accuracy climbed from 62.5 to 71.5. More and better data consistently improved the model.

How does SAM 2 simulate interactive prompting during training?

It samples 8-frame sequences, randomly prompts up to 2 frames (with masks, clicks, or boxes), and simulates corrective clicks by comparing predictions to ground truth It uses pre-recorded human annotation sessions as training data It trains on single frames only, then fine-tunes on video

Chapter 7: Results

SAM 2 was evaluated across a massive range of benchmarks: 17 video segmentation datasets and 37 image segmentation datasets. The results are comprehensive and consistently strong.

Interactive Video Segmentation

The headline result: SAM 2 achieves better segmentation accuracy with 3x fewer interactions than the best SAM + tracker baselines (SAM+XMem++ and SAM+Cutie). In both offline evaluation (multiple passes through the video) and online evaluation (single forward pass), SAM 2 dominates across all 9 test datasets.

This is the payoff of the unified architecture — memory-based refinement is fundamentally more efficient than the restart-from-scratch approach of a decoupled pipeline.

Semi-Supervised VOS (First-Frame Mask)

Even in the classical VOS setting (ground-truth mask on the first frame, track through the rest), SAM 2 sets new state-of-the-art results. On 17 zero-shot video datasets:

1-click prompting: SAM 2 scores 64.7 J&F vs. 56.9 (SAM+XMem++) and 56.7 (SAM+Cutie)
3-click prompting: 75.3 vs. 68.4 / 70.1
Ground-truth mask: 79.3 vs. 72.7 / 74.1

SAM 2 outperforms dedicated VOS models at their own task, even though VOS is just a special case of SAM 2's more general PVS task.

Image Segmentation

On SAM's original 23 image benchmarks, SAM 2 achieves higher accuracy and is 6x faster than SAM. With 1-click prompting: 58.9 mIoU vs. SAM's 58.1 mIoU. This improvement comes largely from the Hiera image encoder, which is smaller but more effective than SAM's ViT-H encoder.

When trained on the full SA-1B + video data mix, accuracy jumps further to 61.4 mIoU — showing that video training data actually helps image segmentation performance.

State-of-the-Art VOS Comparison

Against dedicated state-of-the-art VOS methods on established benchmarks (DAVIS, YouTube-VOS, MOSE, LVOS, etc.), SAM 2 with Hiera-L achieves significant improvements across the board while running at 30.2 FPS on a single A100 GPU — real-time speed.

The surprising finding: Video training data improves image segmentation. SAM 2 trained on both images and videos outperforms SAM trained on images alone. The temporal diversity in video data exposes the model to more object appearances, viewpoints, and deformations than static image datasets can provide.

How does SAM 2's image segmentation accuracy compare to the original SAM?

SAM 2 is more accurate (58.9 vs 58.1 mIoU with 1 click) AND 6x faster, thanks to the more efficient Hiera encoder — and video training data further boosts image performance SAM 2 sacrifices image accuracy for video capability They are exactly the same on images

Chapter 8: Real-Time Performance

A segmentation model that takes minutes per frame is useful for research but useless for real applications. SAM 2 was designed from the ground up for real-time video processing.

Speed Benchmarks

On a single NVIDIA A100 GPU (batch size 1):

SAM 2 (Hiera-B+): 43.8 FPS — well above real-time (30 FPS)
SAM 2 (Hiera-L): 30.2 FPS — at real-time, with higher accuracy
Original SAM (ViT-H): approximately 6 FPS on images — 6x slower than SAM 2

Why Is It Fast?

Several design choices make SAM 2 efficient:

Streaming architecture. The model processes one frame at a time with constant memory overhead. No need to load the entire video into GPU memory. No quadratic attention over all frames.

Hiera image encoder. SAM used ViT-H (632M parameters). SAM 2 uses Hiera, which is hierarchical and pre-trained with MAE, achieving better accuracy with fewer parameters and FLOPs. The Hiera-B+ variant is particularly efficient.

Efficient memory attention. SAM 2 uses vanilla attention operations for both self-attention and cross-attention, allowing it to leverage highly optimized attention kernels like FlashAttention (Dao, 2023). No custom attention patterns, no sparse attention tricks — just fast, well-optimized dense attention.

Lightweight memory encoder. Converting a prediction into a memory is cheap: a downsampling convolution, an element-wise addition, and a few convolutional layers. This is a tiny fraction of the total computation.

FIFO memory management. The memory bank has a fixed maximum size (N recent + M prompted frames). This means the cost of memory attention is bounded regardless of video length. A 10-second video costs the same per frame as a 10-hour video.

The speed-accuracy tradeoff: SAM 2 offers two main operating points: Hiera-B+ for speed (43.8 FPS, good accuracy) and Hiera-L for accuracy (30.2 FPS, best accuracy). Both run at or above real-time on a single GPU, making SAM 2 practical for interactive video editing, robotics, and AR/VR applications.

Why does SAM 2's per-frame cost stay constant regardless of video length?

The FIFO memory bank has a fixed maximum size — old memories are evicted, so memory attention cost is bounded whether the video is 10 seconds or 10 hours It uses sparse attention that scales sub-linearly Longer videos are processed at lower resolution

Chapter 9: Connections

SAM 2 sits at the intersection of several research threads. Understanding where it came from — and what it enables — helps you see the bigger picture.

SAM (Segment Anything Model)

SAM 2's direct predecessor. SAM introduced promptable image segmentation with the SA-1B dataset (1 billion masks). SAM 2 generalizes SAM to video by adding memory and a streaming architecture. When the memory bank is empty, SAM 2 reduces to SAM. The model inherits SAM's prompt encoder and mask decoder designs almost identically.

XMem / XMem++

XMem introduced multi-store memory architectures for video object segmentation — separating sensory memory (very recent frames), working memory (recent context), and long-term memory (persistent features). SAM 2's memory bank echoes this idea with its split between recent frame memories and prompted frame memories, though with a simpler FIFO management strategy instead of XMem's more complex memory consolidation.

Cutie

A state-of-the-art VOS model that SAM 2 outperforms. Cutie uses object-level memory for compact representation of the target. SAM 2's object pointers serve a similar purpose — lightweight vectors summarizing the target object's identity. The key difference is that SAM 2 integrates prompting and tracking into a single model, while Cutie requires a separate module (like SAM) for initial segmentation.

Video Object Segmentation (VOS)

The classical VOS task — given a mask on the first frame, track it through the video — is a special case of SAM 2's Promptable Visual Segmentation (PVS) task. Early VOS methods used online fine-tuning (slow), then shifted to memory-based approaches with transformers. SAM 2 subsumes VOS and extends it to interactive, multi-frame prompting.

Interactive Video Segmentation (iVOS)

Previous iVOS approaches often used a modular pipeline: SAM for spatial segmentation, a separate tracker for temporal propagation. SAM 2 replaces this with a single end-to-end model. The unified memory means corrections are far cheaper — one click instead of re-segmenting from scratch.

Data Engines

The model-in-the-loop data engine follows SAM's approach: use the model to help annotators, use the annotations to improve the model, iterate. This flywheel has become a standard strategy for building foundation models at scale. SAM 2's three-phase engine reduced per-frame annotation time from 37.8s to 4.5s while maintaining quality.

Where SAM 2 leads: SAM 2's promptable video segmentation opens the door to video-grounded language models, robotic manipulation (segment and track objects to grasp), video editing (select and modify objects across frames), medical video analysis (track anatomy across ultrasound or surgical video), and embodied AI (segment the world as you move through it). The open-source release of model, data, and code is designed to catalyze all of these directions.

Paper

SAM 2: Segment Anything in Images and Videos (Ravi et al., 2024)

Architecture

Streaming memory: Hiera encoder → memory attention → mask decoder → memory encoder → memory bank

Key Numbers

50.9K videos, 642.6K masklets, 35.5M masks. 6x faster than SAM. 3x fewer interactions than SAM+tracker. 43.8 FPS real-time.

Impact

Foundation model for video segmentation — open-sourced model, data (CC BY 4.0), and training code (Apache 2.0).

How does SAM 2 relate to the original SAM?

SAM 2 is a strict superset — it adds streaming memory for video, but when the memory bank is empty (single image), it reduces to SAM with the same prompt encoder and mask decoder SAM 2 is a completely different architecture with no shared components SAM 2 replaces SAM's decoder with a video-specific decoder