Apply Mamba's selective state space model to video: linear-complexity scanning replaces quadratic attention, handling long videos where transformers run out of memory.
You want a model that watches a 2-minute cooking video and tells you what dish is being prepared. The video is 30 fps, so that is 3,600 frames. Even if you sample every 4th frame, you still have 900 frames. Each frame, at 224×224 with patch size 16, produces 196 spatial tokens. Multiply: 900 × 196 = 176,400 tokens.
Now try feeding 176,400 tokens into a transformer. Self-attention computes a score between every pair of tokens. That is 176,4002 ≈ 31 billion dot products. Per layer. Your GPU does not have enough memory to even store the attention matrix, let alone compute it.
Existing approaches handle this in three ways, each with a drawback:
What if there were an operator that processes the entire sequence — all L tokens, jointly — in linear time? That is exactly what state space models offer.
Drag the token count slider to see how memory cost scales. Attention grows quadratically; the SSM grows linearly. Watch the gap explode past ~10K tokens.
State space models process a sequence of L tokens in O(L) time and memory. Not O(L2). Not O(L·log L). Just O(L). If you double the video length, cost doubles — not quadruples.
The insight behind VideoMamba is disarmingly simple: take the Mamba SSM, which was designed for 1D language sequences, and apply it to the flattened spatiotemporal token sequence of a video.
There is one important subtlety. The original Mamba is causal — token t can only see tokens 1 through t−1. This makes sense for language (you predict the next word from previous words). But for video classification, every frame matters equally. You want frame 50 to know about both frame 10 and frame 90. So VideoMamba uses a bidirectional variant: it runs Mamba forward and backward, then sums the outputs. We will examine this in Chapter 4.
Before we see how VideoMamba applies Mamba to video, let's make sure the Mamba mechanism itself is clear. State Space Models (SSMs) are continuous systems that map an input signal x(t) to an output y(t) through a hidden state h(t):
Here A is the evolution matrix (how the hidden state evolves on its own), B controls how the input gets injected, and C controls how we read out the output. Think of it as a differential equation: the hidden state is a dynamical system that gets "pushed" by the input and "read" by the output.
We don't have a continuous signal — we have discrete tokens x1, x2, ..., xL. So we discretize the continuous ODE using a time-step parameter Δ:
Then the discrete recurrence is:
This is just a linear recurrence. Processing L tokens takes L steps — O(L) time. No pairwise comparisons needed.
Traditional SSMs use fixed A, B, C — the same parameters for every input. Mamba's key innovation is making B, C, and Δ input-dependent. For each token xt, a linear projection produces Bt, Ct, and Δt specific to that token. This is the Selective Scan Mechanism (S6).
A single Mamba block wraps the SSM in a gated architecture:
The 1D convolution before the SSM provides local context (like a small attention window), while the SSM captures long-range dependencies. The multiplicative gating lets the network control information flow.
To feed a video into Mamba, we first need to convert it into a sequence of tokens. VideoMamba uses tubelet embedding — a 3D convolution that carves the video into small spatiotemporal cubes.
A video clip is a 4D tensor: Xv ∈ R3 × T × H × W, where 3 is RGB channels, T is the number of frames, and H × W is the spatial resolution. For example, a 16-frame clip at 224×224: shape [3, 16, 224, 224].
A single 3D convolution with kernel size 1×16×16 and stride 1×16×16 converts the video into non-overlapping patches:
where L = t × h × w, with t = T (temporal kernel is 1, so every frame is preserved), h = H/16, and w = W/16. For our 16-frame, 224×224 example: L = 16 × 14 × 14 = 3,136 tokens, each a C-dimensional vector.
SSMs are position-sensitive — the recurrence naturally encodes ordering. But VideoMamba still adds explicit position embeddings to help the model distinguish spatial locations and temporal positions:
A learnable CLS token Xcls is prepended, just like in ViT. After all layers, the CLS token's representation is used for classification.
Adjust the number of frames and resolution to see how many tokens the video produces. The 3D conv carves the video into non-overlapping tubelets.
This is the core technical contribution. The original Mamba is causal: it scans left-to-right, and each token can only attend to previous tokens. This is perfect for autoregressive language modeling (predict the next word), but video classification needs global context — every token should know about every other token, regardless of position.
VideoMamba uses the bidirectional Mamba block from Vision Mamba. The idea is simple:
Now every token has context from the entire sequence, in both directions. And both scans are O(L), so the total cost is still O(L).
A video has three dimensions: width, height, and time. To feed it into a 1D SSM, we must choose a scan order. The authors tested four strategies:
| Scan Type | Order | SSv2 Acc |
|---|---|---|
| Spatial-First (SF) | All spatial tokens of frame 1, then frame 2, ... | 65.1% |
| Temporal-First (TF) | Token (0,0) across all frames, then (0,1), ... | 62.4% |
| Spatiotemporal v1 | Half layers SF, half layers TF | 63.9% |
| Spatiotemporal v2 | Full SF + full TF (2× compute) | 64.2% |
TimeSformer and ViViT use divided attention: spatial attention within each frame, then temporal attention across frames. This reduces O(T2·N2) to O(T·N2 + T2·N), but it is still quadratic in both T and N separately. VideoMamba's B-Mamba is O(T·N) — linear in the total token count. And it processes spatial and temporal information jointly, not in separate passes.
Compare GPU memory usage for joint attention, divided attention, and VideoMamba's B-Mamba as you scale frames and resolution. The orange (joint attention) line explodes; the teal (B-Mamba) stays flat.
VideoMamba deliberately follows the vanilla ViT design as closely as possible. No downsampling layers, no hierarchical features, no window attention. Just a stack of identical blocks. This is called an isotropic architecture.
| Model | Depth (L) | Embed Dim (C) | Parameters |
|---|---|---|---|
| VideoMamba-Ti | 24 | 192 | 7M |
| VideoMamba-S | 24 | 384 | 26M |
| VideoMamba-M | 32 | 576 | 74M |
The SSM uses default Mamba hyperparameters: state dimension N = 16, expansion ratio 2. VideoMamba-Ti has only 7 million parameters — dramatically smaller than TimeSformer-L's 121M or ViViT-L's 311M.
A problem emerged during experiments: larger VideoMamba models overfit. VideoMamba-Base (98M parameters) performed worse than VideoMamba-S (26M). The same issue was observed in VMamba. The solution is self-distillation:
The full pipeline from raw video to class prediction. Each B-Mamba block runs forward and backward scans in parallel.
VideoMamba uses a two-stage training pipeline: pretrain on images, then fine-tune on video.
All models are first trained on ImageNet-1K (1.28M images, 1000 classes) for image classification. This provides a strong spatial feature extractor before any video data is seen.
The ImageNet-pretrained model is fine-tuned on video datasets. The key trick: the 2D spatial position embeddings transfer directly because VideoMamba's 3D patch embedding with temporal kernel 1 produces the same spatial tokens as 2D ViT. Only the temporal position embeddings are new.
For even better performance, VideoMamba can be pretrained with masked alignment inspired by UMT. This masks 80% of video tokens and aligns the unmasked tokens with CLIP-ViT features.
A key finding: the masking strategy matters for Mamba. Because the B-Mamba block includes a 1D convolution before the SSM, it prefers contiguous unmasked tokens. Random masking (which works well for transformers) disrupts the 1D conv's local receptive field.
| Masking Strategy | SSv2 Accuracy |
|---|---|
| Random | 67.4% |
| Tube | 66.3% |
| Clip-Row | 68.2% |
| Frame-Row | 67.8% |
| Attention masking | 68.5% |
VideoMamba is evaluated against 3D CNNs, video transformers, and hybrid models across multiple benchmarks. The results show a consistent pattern: competitive or better accuracy at dramatically lower compute.
| Model | Type | Frames | Params | FLOPs | Top-1 |
|---|---|---|---|---|---|
| TimeSformer-L | Trans. | 96 | 121M | 2380G | 80.7% |
| ViViT-L | Trans. | 16 | 311M | 3992G | 81.3% |
| UniFormer-B | CNN+Trans. | 32 | 50M | 259G | 83.0% |
| VideoMamba-M | SSM | 64 | 74M | 2368G | 83.3% |
| VideoMamba-M* | SSM+CLIP | 64 | 74M | 2368G | 85.0% |
* With masked pretraining using CLIP-400M teacher.
| Model | Frames | Params | Top-1 |
|---|---|---|---|
| TimeSformer-HR | 16 | 121M | 62.5% |
| ViViT-L | 16 | 311M | 65.4% |
| MViTv2-B | 32 | 51M | 70.5% |
| VideoMamba-M | 16 | 74M | 68.4% |
| VideoMamba-M* | 16 | 74M | 71.4% |
SSv2 requires understanding fine-grained temporal differences (e.g., "opening" vs "closing"). VideoMamba-M outperforms TimeSformer by +5.9% and ViViT-L by +3.0%. With masked pretraining, it surpasses even MViTv2-B.
Top-1 accuracy vs. parameter count on Kinetics-400 (supervised). VideoMamba achieves high accuracy with relatively few parameters.
This is where VideoMamba truly shines. Long videos — cooking demonstrations, movies, procedural tasks — produce tens of thousands of tokens. Transformers either cannot process them at all or must rely on expensive pre-extracted features. VideoMamba handles them end-to-end.
77 hours of cooking videos, 1,712 clips, 10 activity categories. Videos average several minutes long.
| Method | End-to-End | Backbone | Top-1 |
|---|---|---|---|
| ViS4mer | No | Swin-B features | 88.2% |
| Turbo (32 frames) | Yes | VideoMAE-B | 91.3% |
| VideoMamba-S (64f) | Yes | VideoMamba-S | 97.4% |
| VideoMamba-M* (64f) | Yes | VideoMamba-M | 97.9% |
11,827 videos across 180 procedural tasks, averaging 2.36 minutes.
| Method | Top-1 |
|---|---|
| ViS4mer (Swin-B features) | 88.4% |
| Turbo (VideoMAE-B) | 87.5% |
| VideoMamba-M* (64f) | 90.4% |
30K movie clips, 1–3 minutes each, with 9 diverse tasks: relationship prediction, scene classification, director identification, and more. Even tiny VideoMamba-Ti beats all previous methods on most tasks.
At 64 frames and 224×224, the total token count is 12,544. The attention matrix for a transformer would be 12,544 × 12,544 ≈ 157 million entries — per layer, per head. At 64 frames and 384×384, it is 36,864 tokens and 1.36 billion entries. TimeSformer literally runs out of memory.
VideoMamba's state vector is fixed-size (dimension N = 16) regardless of sequence length. Whether processing 1,000 or 100,000 tokens, the memory footprint per layer is constant. Only the sequential scan cost grows — linearly.
Drag the frame count to compare GPU memory usage and throughput. Beyond 32 frames, TimeSformer's memory explodes while VideoMamba stays manageable.
VideoMamba sits at the intersection of state space models and video understanding. Let's map where it fits.
VideoMamba is a direct application of the Mamba selective scan mechanism to video. The core SSM operator is unchanged — the contribution is showing it works for 3D spatiotemporal sequences, not just 1D text. The bidirectional extension comes from Vision Mamba (Vim).
Vision Mamba applied bidirectional Mamba to 2D images. VideoMamba extends this to 3D video by: (a) using 3D patch embedding (tubelets), (b) adding temporal position embeddings, and (c) analyzing scan orders for spatiotemporal tokens. It also simplifies Vim by removing the middle CLS token and RoPE, gaining +0.8% on ImageNet.
TimeSformer and ViViT are the attention-based video backbones that VideoMamba aims to replace. Both use divided spatiotemporal attention to reduce quadratic cost. VideoMamba achieves better accuracy with linear cost and dramatically fewer parameters (7M vs 121M–311M).
VideoMAE is a self-supervised pretraining method for video transformers using masked autoencoding. VideoMamba's masked alignment is inspired by VideoMAE but requires different masking strategies (row/attention masking vs random/tube) due to Mamba's 1D conv sensitivity.
| Aspect | VideoMamba |
|---|---|
| Input | Video clip [3, T, H, W] |
| Tokenization | 3D Conv (1×16×16) → tubelets |
| Core operator | Bidirectional Mamba (S6) |
| Scan order | Spatial-first (frame by frame) |
| Complexity | O(L) where L = T · (H/16) · (W/16) |
| Architecture | Isotropic (like ViT, no downsampling) |
| Sizes | Ti (7M), S (26M), M (74M) |
| Training | ImageNet-1K → K400/SSv2 fine-tune |
| Scaling trick | Self-distillation (S teaches M) |
| Masking trick | Row/attention masking (not random) |
| K400 accuracy | 83.3% supervised, 85.0% w/ CLIP |
| SSv2 accuracy | 68.4% supervised, 71.4% w/ CLIP |
| Long video | 97.9% Breakfast, 90.4% COIN |
| Efficiency | 6× faster, 40× less memory than TimeSformer at 64 frames |