Li, Li, Wang, He, Wang, Wang, Qiao — OpenGVLab / Shanghai AI Lab, 2024

VideoMamba for Efficient Video Understanding

Apply Mamba's selective state space model to video: linear-complexity scanning replaces quadratic attention, handling long videos where transformers run out of memory.

Prerequisites: Vision Transformers (ViT) + State Space Models (Mamba) + Video classification basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

You want a model that watches a 2-minute cooking video and tells you what dish is being prepared. The video is 30 fps, so that is 3,600 frames. Even if you sample every 4th frame, you still have 900 frames. Each frame, at 224×224 with patch size 16, produces 196 spatial tokens. Multiply: 900 × 196 = 176,400 tokens.

Now try feeding 176,400 tokens into a transformer. Self-attention computes a score between every pair of tokens. That is 176,4002 ≈ 31 billion dot products. Per layer. Your GPU does not have enough memory to even store the attention matrix, let alone compute it.

The quadratic wall: A video with T frames and N spatial tokens per frame produces L = T · N total tokens. Full spatiotemporal attention costs O(L2) = O(T2 · N2) in memory and compute. Double the video length and cost quadruples. This is why video transformers either use very few frames (8–16) or resort to tricks like divided attention that sacrifice joint spatiotemporal reasoning.

Existing approaches handle this in three ways, each with a drawback:

What if there were an operator that processes the entire sequence — all L tokens, jointly — in linear time? That is exactly what state space models offer.

Attention vs. SSM: Scaling Cost

Drag the token count slider to see how memory cost scales. Attention grows quadratically; the SSM grows linearly. Watch the gap explode past ~10K tokens.

Total tokens (L) 20,000
Why can't standard video transformers process 64-frame, high-resolution videos end-to-end?

Chapter 1: The Key Insight

State space models process a sequence of L tokens in O(L) time and memory. Not O(L2). Not O(L·log L). Just O(L). If you double the video length, cost doubles — not quadruples.

The insight behind VideoMamba is disarmingly simple: take the Mamba SSM, which was designed for 1D language sequences, and apply it to the flattened spatiotemporal token sequence of a video.

Step 1: Patchify
Slice the video into 3D tubelets (small cubes across space and time). Each tubelet becomes a token. A video of T frames at H×W with patch size 16 yields L = T × (H/16) × (W/16) tokens.
Step 2: Flatten
Arrange all L tokens into a single 1D sequence using a spatial-first scan order: all spatial tokens of frame 1, then all spatial tokens of frame 2, and so on.
Step 3: Scan with Mamba
Run a bidirectional Mamba block over the full sequence. Forward scan sees past context; backward scan sees future context. Both are O(L).
Step 4: Classify
A CLS token aggregates information from the entire video. A linear head maps it to class logits.
Why this matters: VideoMamba processes 64-frame videos at 384×384 resolution where TimeSformer (a video transformer) runs out of memory. At 64 frames, VideoMamba runs 6× faster and uses 40× less GPU memory than TimeSformer. The linear scaling means you can keep feeding in more frames without hitting a memory wall.

There is one important subtlety. The original Mamba is causal — token t can only see tokens 1 through t−1. This makes sense for language (you predict the next word from previous words). But for video classification, every frame matters equally. You want frame 50 to know about both frame 10 and frame 90. So VideoMamba uses a bidirectional variant: it runs Mamba forward and backward, then sums the outputs. We will examine this in Chapter 4.

Concept → Realization: The concept is "SSMs give linear-time sequence modeling." The realization is "flatten video into a 1D sequence of spatiotemporal tokens, apply bidirectional Mamba, and get a linear-cost video backbone that matches or beats quadratic-cost transformers."
What is the key property of SSMs that makes them suitable for long video understanding?

Chapter 2: Mamba Recap

Before we see how VideoMamba applies Mamba to video, let's make sure the Mamba mechanism itself is clear. State Space Models (SSMs) are continuous systems that map an input signal x(t) to an output y(t) through a hidden state h(t):

h'(t) = A h(t) + B x(t)
y(t) = C h(t)

Here A is the evolution matrix (how the hidden state evolves on its own), B controls how the input gets injected, and C controls how we read out the output. Think of it as a differential equation: the hidden state is a dynamical system that gets "pushed" by the input and "read" by the output.

Discretization: From Continuous to Tokens

We don't have a continuous signal — we have discrete tokens x1, x2, ..., xL. So we discretize the continuous ODE using a time-step parameter Δ:

A = exp(Δ A)
B = (Δ A)−1(exp(Δ A) − I) · Δ B

Then the discrete recurrence is:

ht = A ht−1 + B xt
yt = C ht

This is just a linear recurrence. Processing L tokens takes L steps — O(L) time. No pairwise comparisons needed.

The "Selective" in Selective State Space

Traditional SSMs use fixed A, B, C — the same parameters for every input. Mamba's key innovation is making B, C, and Δ input-dependent. For each token xt, a linear projection produces Bt, Ct, and Δt specific to that token. This is the Selective Scan Mechanism (S6).

Why selectivity matters: With fixed parameters, the SSM must use the same "memory policy" for every token. With input-dependent parameters, it can choose: "This token is important — large Δ, let it strongly update the hidden state" vs. "This token is noise — small Δ, barely update." This is analogous to how attention selectively focuses on relevant tokens, but achieved through gating rather than pairwise comparison.

The Mamba Block

A single Mamba block wraps the SSM in a gated architecture:

  1. Input x goes through a linear projection, expanding the channel dimension by 2×
  2. One branch passes through a 1D convolution (local context), then the SSM
  3. The other branch passes through SiLU activation (gating)
  4. The two branches are multiplied element-wise, then projected back down

The 1D convolution before the SSM provides local context (like a small attention window), while the SSM captures long-range dependencies. The multiplicative gating lets the network control information flow.

Complexity comparison: Self-attention over L tokens: O(L2 · D) where D is the embedding dimension. Mamba over L tokens: O(L · D · N) where N is the state dimension (typically 16). Since N « L for long sequences, Mamba is vastly cheaper.
What makes Mamba's SSM "selective" compared to traditional SSMs?

Chapter 3: Video Tokenization

To feed a video into Mamba, we first need to convert it into a sequence of tokens. VideoMamba uses tubelet embedding — a 3D convolution that carves the video into small spatiotemporal cubes.

The Input

A video clip is a 4D tensor: Xv ∈ R3 × T × H × W, where 3 is RGB channels, T is the number of frames, and H × W is the spatial resolution. For example, a 16-frame clip at 224×224: shape [3, 16, 224, 224].

Tubelet Embedding

A single 3D convolution with kernel size 1×16×16 and stride 1×16×16 converts the video into non-overlapping patches:

Xp = Conv3D(Xv) ∈ RL × C

where L = t × h × w, with t = T (temporal kernel is 1, so every frame is preserved), h = H/16, and w = W/16. For our 16-frame, 224×224 example: L = 16 × 14 × 14 = 3,136 tokens, each a C-dimensional vector.

Why kernel 1×16×16? The temporal kernel is 1 — no temporal downsampling. Each frame independently produces h×w = 14×14 = 196 spatial tokens. This preserves full temporal resolution. Spatial downsampling is 16×, matching ViT's standard patch size. The result is identical to applying ViT's 2D patch embedding to each frame independently.

Position Embeddings

SSMs are position-sensitive — the recurrence naturally encodes ordering. But VideoMamba still adds explicit position embeddings to help the model distinguish spatial locations and temporal positions:

X = [Xcls, Xp] + ps + pt

A learnable CLS token Xcls is prepended, just like in ViT. After all layers, the CLS token's representation is used for classification.

Tubelet Tokenization

Adjust the number of frames and resolution to see how many tokens the video produces. The 3D conv carves the video into non-overlapping tubelets.

Frames (T) 16
Resolution 224
A 32-frame video at 384×384 with patch size 16 produces how many tokens?

Chapter 4: Bidirectional Mamba for Video

This is the core technical contribution. The original Mamba is causal: it scans left-to-right, and each token can only attend to previous tokens. This is perfect for autoregressive language modeling (predict the next word), but video classification needs global context — every token should know about every other token, regardless of position.

Bidirectional Mamba (B-Mamba)

VideoMamba uses the bidirectional Mamba block from Vision Mamba. The idea is simple:

  1. Forward scan: run the SSM from token 1 to token L. Token t sees context from tokens 1..t.
  2. Backward scan: run a separate SSM from token L to token 1. Token t sees context from tokens t..L.
  3. Sum: add the forward and backward outputs element-wise.
yt = SSMfwd(x1:t) + SSMbwd(xt:L)

Now every token has context from the entire sequence, in both directions. And both scans are O(L), so the total cost is still O(L).

Scan Order: How to Flatten 3D into 1D

A video has three dimensions: width, height, and time. To feed it into a 1D SSM, we must choose a scan order. The authors tested four strategies:

Scan TypeOrderSSv2 Acc
Spatial-First (SF)All spatial tokens of frame 1, then frame 2, ...65.1%
Temporal-First (TF)Token (0,0) across all frames, then (0,1), ...62.4%
Spatiotemporal v1Half layers SF, half layers TF63.9%
Spatiotemporal v2Full SF + full TF (2× compute)64.2%
Spatial-First wins: Scanning frame-by-frame (spatial-first) outperforms all alternatives. Why? Because it aligns with the ImageNet-pretrained 2D Mamba: the model already knows how to scan a single image's spatial tokens. Stacking frames in sequence is the most natural extension — each frame's spatial scan is familiar, and the temporal transitions happen at frame boundaries.

Why Not Just Use Divided Attention?

TimeSformer and ViViT use divided attention: spatial attention within each frame, then temporal attention across frames. This reduces O(T2·N2) to O(T·N2 + T2·N), but it is still quadratic in both T and N separately. VideoMamba's B-Mamba is O(T·N) — linear in the total token count. And it processes spatial and temporal information jointly, not in separate passes.

Attention vs. Mamba: Memory Scaling (SHOWCASE)

Compare GPU memory usage for joint attention, divided attention, and VideoMamba's B-Mamba as you scale frames and resolution. The orange (joint attention) line explodes; the teal (B-Mamba) stays flat.

Frames 32
Spatial res. 224
Concrete numbers from the paper: At 8 frames, 224×224, TimeSformer uses ~2× more memory than VideoMamba. At 64 frames, 224×224: TimeSformer uses 40× more GPU memory and runs 6× slower. The gap grows with sequence length because attention is O(L2) and Mamba is O(L).
Why does VideoMamba use spatial-first scan order rather than temporal-first?

Chapter 5: The Architecture

VideoMamba deliberately follows the vanilla ViT design as closely as possible. No downsampling layers, no hierarchical features, no window attention. Just a stack of identical blocks. This is called an isotropic architecture.

The Full Pipeline

  1. 3D Patch Embedding: Conv3D (1×16×16) maps the video to L tokens of dimension C
  2. CLS Token + Position Embeddings: prepend CLS, add spatial + temporal embeddings
  3. L × B-Mamba Blocks: each block is normalization → bidirectional Mamba → residual
  4. Classification Head: layer norm on CLS token → linear projection to class logits

Model Variants

ModelDepth (L)Embed Dim (C)Parameters
VideoMamba-Ti241927M
VideoMamba-S2438426M
VideoMamba-M3257674M

The SSM uses default Mamba hyperparameters: state dimension N = 16, expansion ratio 2. VideoMamba-Ti has only 7 million parameters — dramatically smaller than TimeSformer-L's 121M or ViViT-L's 311M.

Self-Distillation for Scaling

A problem emerged during experiments: larger VideoMamba models overfit. VideoMamba-Base (98M parameters) performed worse than VideoMamba-S (26M). The same issue was observed in VMamba. The solution is self-distillation:

  1. Train a smaller model (VideoMamba-S) to convergence — it generalizes well
  2. Use it as a "teacher" to train the larger model (VideoMamba-M)
  3. Align the student's final feature map to the teacher's via L2 loss
Why does Mamba overfit more than transformers? The authors hypothesize that Mamba's selective scan, with its input-dependent gating, gives the model more capacity to memorize training data. The teacher's feature map acts as a soft target that regularizes the student — it can't just memorize, it must produce features similar to a well-generalizing smaller model. This is simple, cheap (just an L2 loss on the last layer), and effective: VideoMamba-M goes from underperforming to SOTA with self-distillation.
Comparison to ViT: VideoMamba strictly follows the isotropic ViT design — no downsampling layers (unlike Video Swin), no depthwise convolutions (unlike VMamba), no middle CLS token or RoPE (unlike Vision Mamba). This simplicity makes it a clean, fair comparison: the only difference from ViT is replacing self-attention with B-Mamba.
VideoMamba Block Diagram

The full pipeline from raw video to class prediction. Each B-Mamba block runs forward and backward scans in parallel.

Why does VideoMamba use self-distillation when scaling to larger models?

Chapter 6: Training

VideoMamba uses a two-stage training pipeline: pretrain on images, then fine-tune on video.

Stage 1: ImageNet Pretraining

All models are first trained on ImageNet-1K (1.28M images, 1000 classes) for image classification. This provides a strong spatial feature extractor before any video data is seen.

ImageNet results: VideoMamba-M achieves 84.0% top-1 at 576×576 resolution with 74M params. This beats DeiT-B (81.8%, 87M params) and is competitive with Swin-B (83.5%, 88M params) — all without hierarchical features or downsampling layers.

Stage 2: Video Fine-tuning

The ImageNet-pretrained model is fine-tuned on video datasets. The key trick: the 2D spatial position embeddings transfer directly because VideoMamba's 3D patch embedding with temporal kernel 1 produces the same spatial tokens as 2D ViT. Only the temporal position embeddings are new.

Stage 2b: Masked Pretraining (Optional)

For even better performance, VideoMamba can be pretrained with masked alignment inspired by UMT. This masks 80% of video tokens and aligns the unmasked tokens with CLIP-ViT features.

A key finding: the masking strategy matters for Mamba. Because the B-Mamba block includes a 1D convolution before the SSM, it prefers contiguous unmasked tokens. Random masking (which works well for transformers) disrupts the 1D conv's local receptive field.

Masking StrategySSv2 Accuracy
Random67.4%
Tube66.3%
Clip-Row68.2%
Frame-Row67.8%
Attention masking68.5%
Row masking for Mamba: Row masking masks entire rows of the spatial grid, keeping tokens within a row contiguous. This preserves the 1D conv's local receptive field while still removing substantial information. Attention masking goes further: it masks tokens but preserves adjacency structure, letting the 1D conv see meaningful local context. Both are novel to VideoMamba's architecture.
Why does row masking work better than random masking for VideoMamba?

Chapter 7: Results

VideoMamba is evaluated against 3D CNNs, video transformers, and hybrid models across multiple benchmarks. The results show a consistent pattern: competitive or better accuracy at dramatically lower compute.

Kinetics-400 (Scene-Related Actions)

ModelTypeFramesParamsFLOPsTop-1
TimeSformer-LTrans.96121M2380G80.7%
ViViT-LTrans.16311M3992G81.3%
UniFormer-BCNN+Trans.3250M259G83.0%
VideoMamba-MSSM6474M2368G83.3%
VideoMamba-M*SSM+CLIP6474M2368G85.0%

* With masked pretraining using CLIP-400M teacher.

Key takeaway for K400: VideoMamba-M at 64 frames matches UniFormer-B's accuracy while using an isotropic (non-hierarchical) architecture. With masked pretraining, it reaches 85.0% — approaching UMT's 85.7% with a fundamentally different (linear-complexity) backbone.

Something-Something V2 (Temporal-Sensitive Actions)

ModelFramesParamsTop-1
TimeSformer-HR16121M62.5%
ViViT-L16311M65.4%
MViTv2-B3251M70.5%
VideoMamba-M1674M68.4%
VideoMamba-M*1674M71.4%

SSv2 requires understanding fine-grained temporal differences (e.g., "opening" vs "closing"). VideoMamba-M outperforms TimeSformer by +5.9% and ViViT-L by +3.0%. With masked pretraining, it surpasses even MViTv2-B.

Results Comparison: K400

Top-1 accuracy vs. parameter count on Kinetics-400 (supervised). VideoMamba achieves high accuracy with relatively few parameters.

The parameter efficiency story: VideoMamba-Ti (7M params) achieves 80.3% on K400 with 64 frames — better than TimeSformer-L (121M params, 80.7%) but at 17× fewer parameters. The SSM architecture is fundamentally more parameter-efficient for sequential data because it compresses history into a fixed-size state vector rather than storing all pairwise attention scores.
On Something-Something V2, VideoMamba-M outperforms TimeSformer-HR by how much?

Chapter 8: The Long Video Advantage

This is where VideoMamba truly shines. Long videos — cooking demonstrations, movies, procedural tasks — produce tens of thousands of tokens. Transformers either cannot process them at all or must rely on expensive pre-extracted features. VideoMamba handles them end-to-end.

Breakfast (Cooking Activities)

77 hours of cooking videos, 1,712 clips, 10 activity categories. Videos average several minutes long.

MethodEnd-to-EndBackboneTop-1
ViS4merNoSwin-B features88.2%
Turbo (32 frames)YesVideoMAE-B91.3%
VideoMamba-S (64f)YesVideoMamba-S97.4%
VideoMamba-M* (64f)YesVideoMamba-M97.9%
+6.1% over ViS4mer, +6.6% over Turbo. Feature-based methods (ViS4mer) extract features with a pretrained model frame-by-frame, losing fine-grained temporal information. VideoMamba's end-to-end processing preserves all spatiotemporal relationships across the full video length.

COIN (Procedural Tasks)

11,827 videos across 180 procedural tasks, averaging 2.36 minutes.

MethodTop-1
ViS4mer (Swin-B features)88.4%
Turbo (VideoMAE-B)87.5%
VideoMamba-M* (64f)90.4%

LVU (Long-Form Video Understanding)

30K movie clips, 1–3 minutes each, with 9 diverse tasks: relationship prediction, scene classification, director identification, and more. Even tiny VideoMamba-Ti beats all previous methods on most tasks.

LVU highlights (VideoMamba-Ti vs previous SOTA ViS4mer): Relationship: 62.5% vs 57.1% (+5.4). Scene: 70.4% vs 67.4% (+3.0). Director: 67.3% vs 62.6% (+4.7). Writer: 52.98% vs 48.8% (+4.2). These are all end-to-end results with a 7M-parameter model, versus feature-based methods using 88M+ parameter backbones.

Why Does Linear Complexity Matter Here?

At 64 frames and 224×224, the total token count is 12,544. The attention matrix for a transformer would be 12,544 × 12,544 ≈ 157 million entries — per layer, per head. At 64 frames and 384×384, it is 36,864 tokens and 1.36 billion entries. TimeSformer literally runs out of memory.

VideoMamba's state vector is fixed-size (dimension N = 16) regardless of sequence length. Whether processing 1,000 or 100,000 tokens, the memory footprint per layer is constant. Only the sequential scan cost grows — linearly.

Throughput and Memory: VideoMamba vs TimeSformer

Drag the frame count to compare GPU memory usage and throughput. Beyond 32 frames, TimeSformer's memory explodes while VideoMamba stays manageable.

Frames 32
Why does VideoMamba achieve such large improvements over feature-based methods on long video benchmarks?

Chapter 9: Connections

VideoMamba sits at the intersection of state space models and video understanding. Let's map where it fits.

Relation to Mamba (S6)

VideoMamba is a direct application of the Mamba selective scan mechanism to video. The core SSM operator is unchanged — the contribution is showing it works for 3D spatiotemporal sequences, not just 1D text. The bidirectional extension comes from Vision Mamba (Vim).

Relation to Vision Mamba (Vim)

Vision Mamba applied bidirectional Mamba to 2D images. VideoMamba extends this to 3D video by: (a) using 3D patch embedding (tubelets), (b) adding temporal position embeddings, and (c) analyzing scan orders for spatiotemporal tokens. It also simplifies Vim by removing the middle CLS token and RoPE, gaining +0.8% on ImageNet.

Relation to TimeSformer / ViViT

TimeSformer and ViViT are the attention-based video backbones that VideoMamba aims to replace. Both use divided spatiotemporal attention to reduce quadratic cost. VideoMamba achieves better accuracy with linear cost and dramatically fewer parameters (7M vs 121M–311M).

Relation to VideoMAE

VideoMAE is a self-supervised pretraining method for video transformers using masked autoencoding. VideoMamba's masked alignment is inspired by VideoMAE but requires different masking strategies (row/attention masking vs random/tube) due to Mamba's 1D conv sensitivity.

Cheat Sheet

AspectVideoMamba
InputVideo clip [3, T, H, W]
Tokenization3D Conv (1×16×16) → tubelets
Core operatorBidirectional Mamba (S6)
Scan orderSpatial-first (frame by frame)
ComplexityO(L) where L = T · (H/16) · (W/16)
ArchitectureIsotropic (like ViT, no downsampling)
SizesTi (7M), S (26M), M (74M)
TrainingImageNet-1K → K400/SSv2 fine-tune
Scaling trickSelf-distillation (S teaches M)
Masking trickRow/attention masking (not random)
K400 accuracy83.3% supervised, 85.0% w/ CLIP
SSv2 accuracy68.4% supervised, 71.4% w/ CLIP
Long video97.9% Breakfast, 90.4% COIN
Efficiency6× faster, 40× less memory than TimeSformer at 64 frames
The broader lesson: When your bottleneck is sequence length, replacing the quadratic operator (attention) with a linear one (SSM) is not just an optimization — it unlocks entirely new regimes. VideoMamba can process 64-frame high-res videos end-to-end where transformers cannot. This is a qualitative capability difference, not just a quantitative speedup.
What is the key architectural simplification VideoMamba makes over Vision Mamba (Vim)?