VideoMamba — Veanors

Chapter 0: The Problem

You want a model that watches a 2-minute cooking video and tells you what dish is being prepared. The video is 30 fps, so that is 3,600 frames. Even if you sample every 4th frame, you still have 900 frames. Each frame, at 224×224 with patch size 16, produces 196 spatial tokens. Multiply: 900 × 196 = 176,400 tokens.

Now try feeding 176,400 tokens into a transformer. Self-attention computes a score between every pair of tokens. That is 176,400² ≈ 31 billion dot products. Per layer. Your GPU does not have enough memory to even store the attention matrix, let alone compute it.

The quadratic wall: A video with T frames and N spatial tokens per frame produces L = T · N total tokens. Full spatiotemporal attention costs O(L²) = O(T² · N²) in memory and compute. Double the video length and cost quadruples. This is why video transformers either use very few frames (8–16) or resort to tricks like divided attention that sacrifice joint spatiotemporal reasoning.

Existing approaches handle this in three ways, each with a drawback:

3D CNNs (I3D, SlowFast): use local convolution kernels. Efficient, but the receptive field grows slowly — they struggle with long-range temporal dependencies like "she picked up the knife two minutes ago and now uses it."
Divided attention (TimeSformer, ViViT): factorize attention into spatial-only and temporal-only passes. Cheaper at O(T·N² + T²·N), but the spatial and temporal streams never jointly attend, losing cross-spatiotemporal patterns.
Window attention (Video Swin): restrict attention to local windows. Efficient, but again limited in modeling global dependencies.

What if there were an operator that processes the entire sequence — all L tokens, jointly — in linear time? That is exactly what state space models offer.

Attention vs. SSM: Scaling Cost

Drag the token count slider to see how memory cost scales. Attention grows quadratically; the SSM grows linearly. Watch the gap explode past ~10K tokens.

Total tokens (L) 20,000

Why can't standard video transformers process 64-frame, high-resolution videos end-to-end?

Self-attention has O(L²) cost, and the token count L = T · N becomes so large that the attention matrix exhausts GPU memory Video frames are too noisy for transformers to process Transformers can only handle 1D text, not 2D images

Chapter 1: The Key Insight

State space models process a sequence of L tokens in O(L) time and memory. Not O(L²). Not O(L·log L). Just O(L). If you double the video length, cost doubles — not quadruples.

The insight behind VideoMamba is disarmingly simple: take the Mamba SSM, which was designed for 1D language sequences, and apply it to the flattened spatiotemporal token sequence of a video.

Step 1: Patchify

Slice the video into 3D tubelets (small cubes across space and time). Each tubelet becomes a token. A video of T frames at H×W with patch size 16 yields L = T × (H/16) × (W/16) tokens.

↓

Step 2: Flatten

Arrange all L tokens into a single 1D sequence using a spatial-first scan order: all spatial tokens of frame 1, then all spatial tokens of frame 2, and so on.

↓

Step 3: Scan with Mamba

Run a bidirectional Mamba block over the full sequence. Forward scan sees past context; backward scan sees future context. Both are O(L).

↓

Step 4: Classify

A CLS token aggregates information from the entire video. A linear head maps it to class logits.

Why this matters: VideoMamba processes 64-frame videos at 384×384 resolution where TimeSformer (a video transformer) runs out of memory. At 64 frames, VideoMamba runs 6× faster and uses 40× less GPU memory than TimeSformer. The linear scaling means you can keep feeding in more frames without hitting a memory wall.

There is one important subtlety. The original Mamba is causal — token t can only see tokens 1 through t−1. This makes sense for language (you predict the next word from previous words). But for video classification, every frame matters equally. You want frame 50 to know about both frame 10 and frame 90. So VideoMamba uses a bidirectional variant: it runs Mamba forward and backward, then sums the outputs. We will examine this in Chapter 4.

Concept → Realization: The concept is "SSMs give linear-time sequence modeling." The realization is "flatten video into a 1D sequence of spatiotemporal tokens, apply bidirectional Mamba, and get a linear-cost video backbone that matches or beats quadratic-cost transformers."

What is the key property of SSMs that makes them suitable for long video understanding?

They process sequences in O(L) time and memory, so doubling the video length only doubles the cost instead of quadrupling it They use 3D convolutions internally They downsample the video to fewer tokens before processing

Chapter 2: Mamba Recap

Before we see how VideoMamba applies Mamba to video, let's make sure the Mamba mechanism itself is clear. State Space Models (SSMs) are continuous systems that map an input signal x(t) to an output y(t) through a hidden state h(t):

h'(t) = A h(t) + B x(t)
y(t) = C h(t)

Here A is the evolution matrix (how the hidden state evolves on its own), B controls how the input gets injected, and C controls how we read out the output. Think of it as a differential equation: the hidden state is a dynamical system that gets "pushed" by the input and "read" by the output.

Discretization: From Continuous to Tokens

We don't have a continuous signal — we have discrete tokens x₁, x₂, ..., x_L. So we discretize the continuous ODE using a time-step parameter Δ:

A = exp(Δ A)
B = (Δ A)⁻¹(exp(Δ A) − I) · Δ B

Then the discrete recurrence is:

h_t = A h_t−1 + B x_t
y_t = C h_t

This is just a linear recurrence. Processing L tokens takes L steps — O(L) time. No pairwise comparisons needed.

The "Selective" in Selective State Space

Traditional SSMs use fixed A, B, C — the same parameters for every input. Mamba's key innovation is making B, C, and Δ input-dependent. For each token x_t, a linear projection produces B_t, C_t, and Δ_t specific to that token. This is the Selective Scan Mechanism (S6).

Why selectivity matters: With fixed parameters, the SSM must use the same "memory policy" for every token. With input-dependent parameters, it can choose: "This token is important — large Δ, let it strongly update the hidden state" vs. "This token is noise — small Δ, barely update." This is analogous to how attention selectively focuses on relevant tokens, but achieved through gating rather than pairwise comparison.

The Mamba Block

A single Mamba block wraps the SSM in a gated architecture:

Input x goes through a linear projection, expanding the channel dimension by 2×
One branch passes through a 1D convolution (local context), then the SSM
The other branch passes through SiLU activation (gating)
The two branches are multiplied element-wise, then projected back down

The 1D convolution before the SSM provides local context (like a small attention window), while the SSM captures long-range dependencies. The multiplicative gating lets the network control information flow.

Complexity comparison: Self-attention over L tokens: O(L² · D) where D is the embedding dimension. Mamba over L tokens: O(L · D · N) where N is the state dimension (typically 16). Since N « L for long sequences, Mamba is vastly cheaper.

What makes Mamba's SSM "selective" compared to traditional SSMs?

The parameters B, C, and Δ are computed from the input token itself, so the model can dynamically decide how much each token updates the hidden state It selects which tokens to process and skips the rest It uses a learnable mask to remove irrelevant tokens before processing

Chapter 3: Video Tokenization

To feed a video into Mamba, we first need to convert it into a sequence of tokens. VideoMamba uses tubelet embedding — a 3D convolution that carves the video into small spatiotemporal cubes.

The Input

A video clip is a 4D tensor: X_v ∈ R^{3 × T × H × W}, where 3 is RGB channels, T is the number of frames, and H × W is the spatial resolution. For example, a 16-frame clip at 224×224: shape [3, 16, 224, 224].

Tubelet Embedding

A single 3D convolution with kernel size 1×16×16 and stride 1×16×16 converts the video into non-overlapping patches:

X_p = Conv3D(X_v) ∈ R^{L × C}

where L = t × h × w, with t = T (temporal kernel is 1, so every frame is preserved), h = H/16, and w = W/16. For our 16-frame, 224×224 example: L = 16 × 14 × 14 = 3,136 tokens, each a C-dimensional vector.

Why kernel 1×16×16? The temporal kernel is 1 — no temporal downsampling. Each frame independently produces h×w = 14×14 = 196 spatial tokens. This preserves full temporal resolution. Spatial downsampling is 16×, matching ViT's standard patch size. The result is identical to applying ViT's 2D patch embedding to each frame independently.

Position Embeddings

SSMs are position-sensitive — the recurrence naturally encodes ordering. But VideoMamba still adds explicit position embeddings to help the model distinguish spatial locations and temporal positions:

Spatial position embedding p_s ∈ R^{(hw+1) × C}: shared across all frames. Tells the model "this token is at row 3, column 7 of the frame."
Temporal position embedding p_t ∈ R^{t × C}: shared across all spatial positions within a frame. Tells the model "this token is from frame 5."

X = [X_cls, X_p] + p_s + p_t

A learnable CLS token X_cls is prepended, just like in ViT. After all layers, the CLS token's representation is used for classification.

Tubelet Tokenization

Adjust the number of frames and resolution to see how many tokens the video produces. The 3D conv carves the video into non-overlapping tubelets.

Frames (T) 16

Resolution 224

A 32-frame video at 384×384 with patch size 16 produces how many tokens?

32 × 14 × 14 = 6,272 32 × 24 × 24 = 18,432 32 × 384 × 384 = 4,718,592

Chapter 4: Bidirectional Mamba for Video

This is the core technical contribution. The original Mamba is causal: it scans left-to-right, and each token can only attend to previous tokens. This is perfect for autoregressive language modeling (predict the next word), but video classification needs global context — every token should know about every other token, regardless of position.

Bidirectional Mamba (B-Mamba)

VideoMamba uses the bidirectional Mamba block from Vision Mamba. The idea is simple:

Forward scan: run the SSM from token 1 to token L. Token t sees context from tokens 1..t.
Backward scan: run a separate SSM from token L to token 1. Token t sees context from tokens t..L.
Sum: add the forward and backward outputs element-wise.

y_t = SSM_fwd(x_1:t) + SSM_bwd(x_t:L)

Now every token has context from the entire sequence, in both directions. And both scans are O(L), so the total cost is still O(L).

Scan Order: How to Flatten 3D into 1D

A video has three dimensions: width, height, and time. To feed it into a 1D SSM, we must choose a scan order. The authors tested four strategies:

Scan Type	Order	SSv2 Acc
Spatial-First (SF)	All spatial tokens of frame 1, then frame 2, ...	65.1%
Temporal-First (TF)	Token (0,0) across all frames, then (0,1), ...	62.4%
Spatiotemporal v1	Half layers SF, half layers TF	63.9%
Spatiotemporal v2	Full SF + full TF (2× compute)	64.2%

Spatial-First wins: Scanning frame-by-frame (spatial-first) outperforms all alternatives. Why? Because it aligns with the ImageNet-pretrained 2D Mamba: the model already knows how to scan a single image's spatial tokens. Stacking frames in sequence is the most natural extension — each frame's spatial scan is familiar, and the temporal transitions happen at frame boundaries.

Why Not Just Use Divided Attention?

TimeSformer and ViViT use divided attention: spatial attention within each frame, then temporal attention across frames. This reduces O(T²·N²) to O(T·N² + T²·N), but it is still quadratic in both T and N separately. VideoMamba's B-Mamba is O(T·N) — linear in the total token count. And it processes spatial and temporal information jointly, not in separate passes.

Attention vs. Mamba: Memory Scaling (SHOWCASE)

Compare GPU memory usage for joint attention, divided attention, and VideoMamba's B-Mamba as you scale frames and resolution. The orange (joint attention) line explodes; the teal (B-Mamba) stays flat.

Frames 32

Spatial res. 224

Concrete numbers from the paper: At 8 frames, 224×224, TimeSformer uses ~2× more memory than VideoMamba. At 64 frames, 224×224: TimeSformer uses 40× more GPU memory and runs 6× slower. The gap grows with sequence length because attention is O(L²) and Mamba is O(L).

Why does VideoMamba use spatial-first scan order rather than temporal-first?

Spatial-first aligns with the ImageNet-pretrained 2D scan pattern, letting the model leverage pretrained spatial knowledge, and it achieves the best accuracy Spatial-first uses fewer parameters Temporal-first causes gradient vanishing

Chapter 5: The Architecture

VideoMamba deliberately follows the vanilla ViT design as closely as possible. No downsampling layers, no hierarchical features, no window attention. Just a stack of identical blocks. This is called an isotropic architecture.

The Full Pipeline

3D Patch Embedding: Conv3D (1×16×16) maps the video to L tokens of dimension C
CLS Token + Position Embeddings: prepend CLS, add spatial + temporal embeddings
L × B-Mamba Blocks: each block is normalization → bidirectional Mamba → residual
Classification Head: layer norm on CLS token → linear projection to class logits

Model Variants

Model	Depth (L)	Embed Dim (C)	Parameters
VideoMamba-Ti	24	192	7M
VideoMamba-S	24	384	26M
VideoMamba-M	32	576	74M

The SSM uses default Mamba hyperparameters: state dimension N = 16, expansion ratio 2. VideoMamba-Ti has only 7 million parameters — dramatically smaller than TimeSformer-L's 121M or ViViT-L's 311M.

Self-Distillation for Scaling

A problem emerged during experiments: larger VideoMamba models overfit. VideoMamba-Base (98M parameters) performed worse than VideoMamba-S (26M). The same issue was observed in VMamba. The solution is self-distillation:

Train a smaller model (VideoMamba-S) to convergence — it generalizes well
Use it as a "teacher" to train the larger model (VideoMamba-M)
Align the student's final feature map to the teacher's via L2 loss

Why does Mamba overfit more than transformers? The authors hypothesize that Mamba's selective scan, with its input-dependent gating, gives the model more capacity to memorize training data. The teacher's feature map acts as a soft target that regularizes the student — it can't just memorize, it must produce features similar to a well-generalizing smaller model. This is simple, cheap (just an L2 loss on the last layer), and effective: VideoMamba-M goes from underperforming to SOTA with self-distillation.

Comparison to ViT: VideoMamba strictly follows the isotropic ViT design — no downsampling layers (unlike Video Swin), no depthwise convolutions (unlike VMamba), no middle CLS token or RoPE (unlike Vision Mamba). This simplicity makes it a clean, fair comparison: the only difference from ViT is replacing self-attention with B-Mamba.

VideoMamba Block Diagram

The full pipeline from raw video to class prediction. Each B-Mamba block runs forward and backward scans in parallel.

Why does VideoMamba use self-distillation when scaling to larger models?

Larger Mamba models overfit more easily, and a well-trained smaller model's features serve as a regularization target that prevents the larger model from memorizing the training set Self-distillation makes training faster It is needed to initialize the weights of the larger model

Chapter 6: Training

VideoMamba uses a two-stage training pipeline: pretrain on images, then fine-tune on video.

Stage 1: ImageNet Pretraining

All models are first trained on ImageNet-1K (1.28M images, 1000 classes) for image classification. This provides a strong spatial feature extractor before any video data is seen.

Optimizer: AdamW with weight decay 0.05
Schedule: Cosine learning rate decay over 300 epochs, LR = 1e-3
Warmup: 5 epochs of linear warmup
Precision: BFloat16 for stability (no EMA needed)
Stochastic depth: 0 / 0.15 / 0.5 for Ti / S / M
Self-distillation: for VideoMamba-M, a pretrained VideoMamba-S serves as teacher (L2 loss on final features)

ImageNet results: VideoMamba-M achieves 84.0% top-1 at 576×576 resolution with 74M params. This beats DeiT-B (81.8%, 87M params) and is competitive with Swin-B (83.5%, 88M params) — all without hierarchical features or downsampling layers.

Stage 2: Video Fine-tuning

The ImageNet-pretrained model is fine-tuned on video datasets. The key trick: the 2D spatial position embeddings transfer directly because VideoMamba's 3D patch embedding with temporal kernel 1 produces the same spatial tokens as 2D ViT. Only the temporal position embeddings are new.

Kinetics-400: 10-second clips, scene-related actions. LR scaled as 2e-4 · batch/256
Something-Something V2: 4-second clips, temporal-sensitive actions (opening, closing). LR scaled as 4e-4 · batch/256
Stochastic depth: 0.8 for VideoMamba-M
Epochs: 50 (K400) or 30 (SSv2)

Stage 2b: Masked Pretraining (Optional)

For even better performance, VideoMamba can be pretrained with masked alignment inspired by UMT. This masks 80% of video tokens and aligns the unmasked tokens with CLIP-ViT features.

A key finding: the masking strategy matters for Mamba. Because the B-Mamba block includes a 1D convolution before the SSM, it prefers contiguous unmasked tokens. Random masking (which works well for transformers) disrupts the 1D conv's local receptive field.

Masking Strategy	SSv2 Accuracy
Random	67.4%
Tube	66.3%
Clip-Row	68.2%
Frame-Row	67.8%
Attention masking	68.5%

Row masking for Mamba: Row masking masks entire rows of the spatial grid, keeping tokens within a row contiguous. This preserves the 1D conv's local receptive field while still removing substantial information. Attention masking goes further: it masks tokens but preserves adjacency structure, letting the 1D conv see meaningful local context. Both are novel to VideoMamba's architecture.

Why does row masking work better than random masking for VideoMamba?

Because VideoMamba's B-Mamba block includes a 1D convolution that needs contiguous tokens — random masking creates gaps that disrupt the conv's local receptive field Because rows contain more semantic information than random patches Because row masking removes more tokens, providing stronger regularization

Chapter 7: Results

VideoMamba is evaluated against 3D CNNs, video transformers, and hybrid models across multiple benchmarks. The results show a consistent pattern: competitive or better accuracy at dramatically lower compute.

Kinetics-400 (Scene-Related Actions)

Model	Type	Frames	Params	FLOPs	Top-1
TimeSformer-L	Trans.	96	121M	2380G	80.7%
ViViT-L	Trans.	16	311M	3992G	81.3%
UniFormer-B	CNN+Trans.	32	50M	259G	83.0%
VideoMamba-M	SSM	64	74M	2368G	83.3%
VideoMamba-M*	SSM+CLIP	64	74M	2368G	85.0%

* With masked pretraining using CLIP-400M teacher.

Key takeaway for K400: VideoMamba-M at 64 frames matches UniFormer-B's accuracy while using an isotropic (non-hierarchical) architecture. With masked pretraining, it reaches 85.0% — approaching UMT's 85.7% with a fundamentally different (linear-complexity) backbone.

Something-Something V2 (Temporal-Sensitive Actions)

Model	Frames	Params	Top-1
TimeSformer-HR	16	121M	62.5%
ViViT-L	16	311M	65.4%
MViTv2-B	32	51M	70.5%
VideoMamba-M	16	74M	68.4%
VideoMamba-M*	16	74M	71.4%

SSv2 requires understanding fine-grained temporal differences (e.g., "opening" vs "closing"). VideoMamba-M outperforms TimeSformer by +5.9% and ViViT-L by +3.0%. With masked pretraining, it surpasses even MViTv2-B.

Results Comparison: K400

Top-1 accuracy vs. parameter count on Kinetics-400 (supervised). VideoMamba achieves high accuracy with relatively few parameters.

The parameter efficiency story: VideoMamba-Ti (7M params) achieves 80.3% on K400 with 64 frames — better than TimeSformer-L (121M params, 80.7%) but at 17× fewer parameters. The SSM architecture is fundamentally more parameter-efficient for sequential data because it compresses history into a fixed-size state vector rather than storing all pairwise attention scores.

On Something-Something V2, VideoMamba-M outperforms TimeSformer-HR by how much?

+5.9 percentage points (68.4% vs 62.5%), showing that Mamba's linear scan captures fine-grained temporal differences more effectively than divided attention +2.0 percentage points +0.5 percentage points

Chapter 8: The Long Video Advantage

This is where VideoMamba truly shines. Long videos — cooking demonstrations, movies, procedural tasks — produce tens of thousands of tokens. Transformers either cannot process them at all or must rely on expensive pre-extracted features. VideoMamba handles them end-to-end.

Breakfast (Cooking Activities)

77 hours of cooking videos, 1,712 clips, 10 activity categories. Videos average several minutes long.

Method	End-to-End	Backbone	Top-1
ViS4mer	No	Swin-B features	88.2%
Turbo (32 frames)	Yes	VideoMAE-B	91.3%
VideoMamba-S (64f)	Yes	VideoMamba-S	97.4%
*VideoMamba-M (64f)**	Yes	VideoMamba-M	97.9%

+6.1% over ViS4mer, +6.6% over Turbo. Feature-based methods (ViS4mer) extract features with a pretrained model frame-by-frame, losing fine-grained temporal information. VideoMamba's end-to-end processing preserves all spatiotemporal relationships across the full video length.

COIN (Procedural Tasks)

11,827 videos across 180 procedural tasks, averaging 2.36 minutes.

Method	Top-1
ViS4mer (Swin-B features)	88.4%
Turbo (VideoMAE-B)	87.5%
*VideoMamba-M (64f)**	90.4%

LVU (Long-Form Video Understanding)

30K movie clips, 1–3 minutes each, with 9 diverse tasks: relationship prediction, scene classification, director identification, and more. Even tiny VideoMamba-Ti beats all previous methods on most tasks.

LVU highlights (VideoMamba-Ti vs previous SOTA ViS4mer): Relationship: 62.5% vs 57.1% (+5.4). Scene: 70.4% vs 67.4% (+3.0). Director: 67.3% vs 62.6% (+4.7). Writer: 52.98% vs 48.8% (+4.2). These are all end-to-end results with a 7M-parameter model, versus feature-based methods using 88M+ parameter backbones.

Why Does Linear Complexity Matter Here?

At 64 frames and 224×224, the total token count is 12,544. The attention matrix for a transformer would be 12,544 × 12,544 ≈ 157 million entries — per layer, per head. At 64 frames and 384×384, it is 36,864 tokens and 1.36 billion entries. TimeSformer literally runs out of memory.

VideoMamba's state vector is fixed-size (dimension N = 16) regardless of sequence length. Whether processing 1,000 or 100,000 tokens, the memory footprint per layer is constant. Only the sequential scan cost grows — linearly.

Throughput and Memory: VideoMamba vs TimeSformer

Drag the frame count to compare GPU memory usage and throughput. Beyond 32 frames, TimeSformer's memory explodes while VideoMamba stays manageable.

Frames 32

Why does VideoMamba achieve such large improvements over feature-based methods on long video benchmarks?

End-to-end training preserves fine-grained spatiotemporal relationships that are lost when features are extracted frame-by-frame, and linear complexity makes end-to-end processing feasible even for 64+ frame videos VideoMamba uses a larger training dataset Feature-based methods cannot use GPU acceleration

Chapter 9: Connections

VideoMamba sits at the intersection of state space models and video understanding. Let's map where it fits.

Relation to Mamba (S6)

VideoMamba is a direct application of the Mamba selective scan mechanism to video. The core SSM operator is unchanged — the contribution is showing it works for 3D spatiotemporal sequences, not just 1D text. The bidirectional extension comes from Vision Mamba (Vim).

Relation to Vision Mamba (Vim)

Vision Mamba applied bidirectional Mamba to 2D images. VideoMamba extends this to 3D video by: (a) using 3D patch embedding (tubelets), (b) adding temporal position embeddings, and (c) analyzing scan orders for spatiotemporal tokens. It also simplifies Vim by removing the middle CLS token and RoPE, gaining +0.8% on ImageNet.

Relation to TimeSformer / ViViT

TimeSformer and ViViT are the attention-based video backbones that VideoMamba aims to replace. Both use divided spatiotemporal attention to reduce quadratic cost. VideoMamba achieves better accuracy with linear cost and dramatically fewer parameters (7M vs 121M–311M).

Relation to VideoMAE

VideoMAE is a self-supervised pretraining method for video transformers using masked autoencoding. VideoMamba's masked alignment is inspired by VideoMAE but requires different masking strategies (row/attention masking vs random/tube) due to Mamba's 1D conv sensitivity.

Cheat Sheet

Aspect	VideoMamba
Input	Video clip [3, T, H, W]
Tokenization	3D Conv (1×16×16) → tubelets
Core operator	Bidirectional Mamba (S6)
Scan order	Spatial-first (frame by frame)
Complexity	O(L) where L = T · (H/16) · (W/16)
Architecture	Isotropic (like ViT, no downsampling)
Sizes	Ti (7M), S (26M), M (74M)
Training	ImageNet-1K → K400/SSv2 fine-tune
Scaling trick	Self-distillation (S teaches M)
Masking trick	Row/attention masking (not random)
K400 accuracy	83.3% supervised, 85.0% w/ CLIP
SSv2 accuracy	68.4% supervised, 71.4% w/ CLIP
Long video	97.9% Breakfast, 90.4% COIN
Efficiency	6× faster, 40× less memory than TimeSformer at 64 frames

The broader lesson: When your bottleneck is sequence length, replacing the quadratic operator (attention) with a linear one (SSM) is not just an optimization — it unlocks entirely new regimes. VideoMamba can process 64-frame high-res videos end-to-end where transformers cannot. This is a qualitative capability difference, not just a quantitative speedup.

What is the key architectural simplification VideoMamba makes over Vision Mamba (Vim)?

It removes the middle CLS token and Rotary Position Embedding (RoPE), gaining accuracy while being simpler It uses fewer layers It replaces the SSM with standard attention

VideoMamba for Efficient Video Understanding