VideoMAE V2 — Veanors

Chapter 0: The Problem

You want a single video model that understands actions, detects people in space, localizes events in time, and transfers to dozens of downstream tasks. Language models achieved this by scaling transformers to billions of parameters on massive text corpora. Images followed with MAE and BEiT at billion-parameter scale. But video is stuck.

Why? A single 16-frame video clip at 224×224 produces 8×14×14 = 1568 tokens with a ViT patch size of 16. That is 5× more tokens than a single image. A billion-parameter ViT-giant running self-attention on 1568 tokens is extremely expensive — both in FLOPs and GPU memory.

VideoMAE (V1) made video self-supervised pre-training feasible by masking 90% of tokens in the encoder. The encoder only processes ~157 tokens instead of 1568. But the decoder still sees all 1568 tokens (the encoded visible ones plus learnable mask tokens). For a billion-parameter decoder, this is still the bottleneck.

The core tension: Scaling video transformers to 1B+ parameters requires enormous compute. VideoMAE V1's encoder masking helps the encoder side, but the decoder still processes the full token sequence. Pre-training ViT-g with V1 takes over two weeks on 64 A100 GPUs. VideoMAE V2's insight: mask the decoder too.

Beyond compute, there is a data bottleneck. The largest public video dataset (Kinetics-400) has only 240K clips. ImageNet-22K has 14.2M images. JFT-3B has 3 billion. A billion-parameter video model will overfit on 240K videos. We need million-scale diverse video data.

Full data flow at a glance: Video clip I ∈ R^3×T×H×W (T=16 frames, H=W=224, stride τ) → cube embedding Φ_emb produces N = T'×H'×W' tokens (e.g., 8×14×14 = 1568) → encoder mask M_e at ratio ρ_e=90% keeps N^e=157 visible tokens → ViT-giant encoder (40 blocks, dim 1408, 16 heads, 1.01B params) processes visible tokens → decoder mask M_d at ratio ρ_d=50% selects a subset via running cell masking → lightweight decoder (4 blocks, dim 512) reconstructs pixels only for tokens invisible to encoder → MSE loss on normalized masked pixels.

The Scaling Wall

Compare memory cost of encoder-only masking (VideoMAE V1) vs dual masking (V2). Drag the model size slider to see how memory savings grow with scale.

Model params (M) 304M

Why is scaling video masked autoencoders harder than scaling image MAEs?

Video clips produce ~5x more tokens than images due to the temporal dimension, making both compute and memory much more expensive Video datasets have lower resolution than image datasets Video transformers require different attention mechanisms

Chapter 1: VideoMAE Revisited

Before we can understand V2, we need the original VideoMAE's design locked in. It is a direct extension of MAE (He et al., 2022) from images to video.

Cube Embedding

A video clip I ∈ R^3×T×H×W is divided into non-overlapping spatiotemporal cubes. With temporal patch size 2, spatial patch size 16, and input 16×224×224, you get T'=8 temporal positions × 14×14 spatial positions = 1568 cubes. Each cube is flattened and linearly projected to a d-dimensional token, then added with a learnable 3D positional embedding.

Tube Masking (Encoder)

A random spatial mask is generated once and applied identically across all T'=8 temporal positions. This creates tubes — a masked spatial position stays masked for the entire clip. The masking ratio is extreme: ρ_e = 90%. Only 157 out of 1568 cubes survive.

Why tubes? Adjacent frames in video are nearly identical. If you mask position (3,5) in frame 1 but leave it visible in frame 2, the model just copies the pixel values across time. That makes reconstruction trivially easy — no learning happens. Tube masking forces the model to truly reconstruct from distant spatial context.

Asymmetric Encoder-Decoder

The encoder is a large ViT that only processes the 10% visible tokens. This is where the compute savings come from — self-attention is O(N²), so 10% of tokens means ~1% of attention cost. The decoder is a smaller ViT that receives all N tokens: the 10% encoded tokens plus 90% learnable [MASK] tokens with positional embeddings. The loss is MSE between reconstructed and original pixels, computed only on masked positions.

Tensor shapes (ViT-Base example): Input: (B, 3, 16, 224, 224). After cube embed: (B, 1568, 768). After encoder masking (ρ=90%): (B, 157, 768) → encoder output. Decoder input: concat encoded (157) + mask tokens (1411) = (B, 1568, 384). Decoder output: (B, 1568, 1536) — reconstructed pixel values per cube (2×16×16×3=1536 per cube). Loss computed on 1411 masked cubes only.

The bottleneck insight: Encoder processes 157 tokens with a big ViT — cheap. Decoder processes 1568 tokens with a smaller ViT — but 1568 tokens is still a lot. For ViT-Base, the decoder costs more FLOPs than the encoder (despite having fewer layers) simply because it has 10× more tokens. When you scale to ViT-giant, this imbalance becomes crippling.

Why does VideoMAE use tube masking instead of random per-frame masking?

Because adjacent video frames are nearly identical — independent per-frame masks would let the model trivially copy pixels across time, making the reconstruction task too easy for meaningful learning Because tube masking is faster to compute on GPUs Because random masking causes gradient instability during training

Chapter 2: Dual Masking

VideoMAE V2's central contribution is embarrassingly simple: mask the decoder too.

The encoder already drops 90% of tokens via tube masking. The decoder receives the 10% encoded visible tokens plus the 90% learnable [MASK] tokens. V2 introduces a second mask M_d on the decoder side that drops an additional 50% of the decoder's token sequence.

How the Two Masks Interact

The encoder mask M_e and decoder mask M_d are generated by different strategies with different goals:

Encoder mask M_e: Random tube masking at ρ_e=90%. Goal: prevent temporal information leakage. Same mask applied across all frames.
Decoder mask M_d: Running cell masking at ρ_d=50%. Goal: select a diverse, representative subset of positions for reconstruction. Designed to complement rather than duplicate the encoder's visible tokens.

The decoder receives: (1) all encoded visible tokens from the encoder, plus (2) learnable [MASK] tokens for positions selected by M_d that were invisible to the encoder. The reconstruction loss is computed only on decoder output tokens that were invisible to the encoder — specifically the intersection M_d ∩ M_e.

Z^c = Z ∪ {M_i}_{i ∈ M_d}

ℓ = (1 / ((1−ρ^d)N)) Σ_{i ∈ M_d ∩ M_e} |I_i − Î_i|²

Why only supervise encoder-invisible tokens? If you also compute loss on tokens the encoder saw, the decoder can cheat — it just passes through the already-encoded visible patches. The visible tokens carry ground-truth pixel information directly from the encoder. Supervising only invisible tokens ensures the decoder must truly reconstruct from context.

Concrete numbers (ViT-g): Original VideoMAE decoder: processes all 1568 tokens. Memory for feature maps: 1753 MB. V2 dual masking decoder: processes ~784 tokens (50% of 1568). Memory: 1050 MB — a 40% reduction. Training time for 1200 epochs on 64 A100 GPUs: 356h (V1 estimated) vs 241h (V2) — 1.48× speedup. Top-1 accuracy on Something-Something V2: essentially identical (77.0% both).

Dual Masking Token Flow

The encoder sees only visible tokens (orange). The decoder sees encoded tokens plus a selected subset of masked tokens (teal). Reconstruction loss applies only to decoder tokens that were invisible to the encoder (purple). Drag the decoder mask ratio to see how it changes the token distribution.

Decoder mask ratio ρ_d 50%

What does dual masking primarily save compared to encoder-only masking?

Encoder FLOPs Decoder memory and compute — the decoder processes fewer tokens (e.g., 50% instead of 100%), cutting memory by ~40% and training time by ~30%, with negligible accuracy loss Data loading time

Chapter 3: Running Cell Masking

The decoder mask cannot be random. Random masking might cluster selections in one spatial region and miss others entirely. It cannot be frame-level masking (keeping half the frames) because that destroys temporal diversity. The decoder needs a mask that selects maximally diverse spatiotemporal positions.

How It Works

Running cell masking divides the spatiotemporal grid into cells and systematically samples one position per cell, with the sampling offset shifting across cells. Think of it like a stratified sampling scheme that guarantees uniform coverage of the entire video volume.

Concretely, if you want 50% of tokens, you divide the T'×H'×W' grid into 2×1×1 cells (temporal stride 2). Within each cell, you pick the position using a running offset that changes from cell to cell. This ensures every temporal position, every spatial row, and every spatial column is represented in the selected subset.

Why Not Other Strategies?

The ablation study is revealing:

Decoder Masking	ρ_d	Top-1 Acc	FLOPs
None (original V1)	0%	70.28	35.48G
Frame masking	50%	69.76	25.87G
Random masking	50%	64.87	25.87G
Running cell	50%	70.15	25.87G
Running cell	75%	70.01	21.06G

Key insight: Random masking is catastrophic (−5.4%). Frame masking loses 0.5%. Running cell masking loses only 0.13% while saving 27% of FLOPs. The difference is coverage: random masking leaves spatial holes. Frame masking destroys half the temporal information. Running cell masking preserves both spatial and temporal diversity while still halving the token count.

What degrades with higher decoder masking: At ρ_d=50%, accuracy drops 0.13%. At ρ_d=75%, it drops 0.27%. The degradation is graceful because video has extreme redundancy — neighboring cubes contain similar information. But beyond 75%, the decoder starts missing critical spatial structure and reconstruction quality suffers.

Running Cell vs Random Masking

Compare coverage patterns. Running cell masking (left) guarantees uniform spatiotemporal coverage. Random masking (right) leaves gaps. Click Regenerate to resample.

Why does random decoder masking perform much worse than running cell masking?

Random masking leaves spatial and temporal gaps, missing entire regions of the video, while running cell masking guarantees uniform coverage across the full spatiotemporal volume Random masking is slower to compute Random masking causes the loss to be noisier

Chapter 4: ViT-giant Backbone

VideoMAE V2 scales the encoder from ViT-Base (86M params) all the way to ViT-giant (1.01B params). This is the first billion-parameter model ever pre-trained in the video domain.

Backbone	Layers	Dim	Heads	Patch	Params
ViT-Base	12	768	12	16	86M
ViT-Large	24	1024	16	16	304M
ViT-Huge	32	1280	16	16	632M
ViT-giant	40	1408	16	14	1,011M

Key Architecture Details

ViT-giant uses patch size 14 (not 16 like smaller models). With 224×224 spatial input, this gives 16×16 = 256 spatial positions per frame. With temporal stride 2 and 16 frames, that is 8×16×16 = 2048 tokens total. The smaller patch size captures finer spatial details but increases token count by 31% compared to patch-16 models.

The decoder is deliberately lightweight: only 4 transformer blocks with dimension 512 for ViT-g. This asymmetry is critical — the encoder does the heavy representational lifting, while the decoder is just a reconstruction head that gets discarded after pre-training.

Frozen vs. Trained: During self-supervised pre-training: the entire encoder (1.01B params) and decoder (~25M params) are trained from scratch. During downstream fine-tuning: only the encoder is kept; the decoder is discarded. A task-specific head is added (e.g., a linear classifier for action recognition). The encoder is fine-tuned end-to-end.

Compute budget: ViT-g encoder with dual masking processes ~205 visible tokens (10% of 2048). Self-attention cost: O(205² × 1408 × 40) ≈ 95 GFLOPs per clip. Decoder with dual masking processes ~1024 tokens: O(1024² × 512 × 4) ≈ 8.6 GFLOPs. Total with dual masking: 241.61 GFLOPs. Without dual masking (full decoder): 263.93 GFLOPs. Memory: 1050 MB vs 1753 MB — the difference between fitting in GPU memory and not.

Model Scaling Comparison

Accuracy on Kinetics-400 and Something-Something V2 as model capacity scales from ViT-B to ViT-g. Notice diminishing returns at the largest scale.

Why does ViT-giant use patch size 14 instead of 16?

Smaller patches capture finer spatial details, which matters more at the billion-parameter scale where the model has capacity to use that extra spatial resolution Patch size 14 divides evenly into 224 It reduces the total number of parameters

Chapter 5: Data Scaling

A billion-parameter model needs a billion-scale dataset. But the largest public video datasets are tiny compared to images: Kinetics-400 has 240K clips, Something-Something V2 has 170K. A ViT-g will overfit on these.

UnlabeledHybrid: 1.35M Clips

VideoMAE V2 builds UnlabeledHybrid, a diverse unlabeled video dataset by mixing clips from multiple public sources:

Source	Domain	Clips
Kinetics (all versions)	Web video (actions)	~650K
Something-Something	Manual recordings (object interactions)	~220K
AVA	Movies (multi-person)	~80K
WebVid2M	Web video (diverse)	~200K
Instagram (crawled)	Social media	~200K

The key word is diversity. These sources cover first-person, third-person, scripted, unscripted, indoor, outdoor, object-centric, and person-centric videos. Labels are discarded — the pre-training is purely self-supervised (pixel reconstruction).

LabeledHybrid: 0.66M Clips (for post-pre-training)

A second dataset merges the training splits of multiple Kinetics versions (K400, K600, K700) with aligned label semantics and deduplication, totaling 710 categories and 0.66M clips. This is used in the intermediate supervised stage (Chapter 6).

Data scaling effect (Something-Something V2): ViT-H pre-trained on SSv2 alone (170K clips): 74.8% top-1. ViT-H pre-trained on UnlabeledHybrid (1.35M clips): 76.8% top-1 — a full 2.0% improvement just from more diverse data. The gap widens with model size: for ViT-B the gap is 0.4%, for ViT-L it is 1.4%, for ViT-H it is 2.0%. Bigger models are hungrier for data.

Comparison to MAE-ST: MAE-ST tried pre-training on 1M uncurated Instagram clips and got worse performance than pre-training on 240K Kinetics clips (84.4% vs 84.8% on K400 with ViT-L). VideoMAE V2 with 1.35M UnlabeledHybrid gets 85.4%. The difference: data diversity. MAE-ST used only Instagram; V2 mixes five diverse sources covering different visual domains.

Why does MAE-ST's 1M Instagram pre-training underperform V2's 1.35M UnlabeledHybrid pre-training?

Instagram has lower resolution videos MAE-ST used a single source (Instagram), while V2 mixes five diverse sources covering different visual domains — data diversity matters as much as data quantity MAE-ST used a different masking strategy

Chapter 6: Progressive Training

You have a 1B-parameter model pre-trained via masked autoencoding on 1.35M unlabeled videos. Now you want to fine-tune it on Kinetics-400 (240K clips, 400 classes). Directly fine-tuning 1B parameters on 240K samples is a recipe for overfitting.

VideoMAE V2 uses a three-stage progressive training pipeline, borrowing from the intermediate fine-tuning technique used in image models (BEiT, EVA):

Stage 1: Unsupervised Pre-training

Train the full VideoMAE (encoder + decoder) on UnlabeledHybrid (1.35M clips) with pixel reconstruction. No labels. 1200 epochs for ViT-g. This learns general spatiotemporal features.

↓

Stage 2: Post-Pre-training (Intermediate)

Discard the decoder. Add a classification head. Fine-tune on LabeledHybrid (0.66M clips, 710 classes). This bridges the gap between reconstruction features and semantic features using diverse supervised data.

↓

Stage 3: Target Fine-tuning

Fine-tune on the specific downstream dataset (e.g., K400, SSv2, AVA). The model has been progressively adapted from general reconstruction → broad semantics → task-specific knowledge.

Why three stages instead of two? ViT-H fine-tuned directly (stages 1 → 3, skipping post-pre-training): 86.9% on K400. ViT-H with progressive training (stages 1 → 2 → 3): 88.6% on K400 — a 1.7% improvement. The intermediate supervised stage acts as a bridge: it softens the distribution shift between self-supervised features (optimized for pixel prediction) and task-specific features (optimized for classification). With 710 categories from multiple Kinetics versions, the model learns broad action semantics before specializing.

What degrades without progressive training: Directly fine-tuning V1 pre-trained ViT-H on K400 gives 88.1% — worse than V2's 88.6% despite using the same post-pre-training. This confirms that large-scale unsupervised pre-training (1.35M vs 240K) provides a better initialization that post-pre-training can build upon.

Progressive Training Pipeline

Three stages of training, each building on the previous. The decoder is discarded after stage 1.

What role does the post-pre-training stage (Stage 2) play in the progressive pipeline?

It bridges reconstruction features and task-specific features by exposing the model to broad semantic supervision from 710 diverse action categories, reducing overfitting when fine-tuning on smaller target datasets It reduces the model size for faster downstream inference It replaces the self-supervised features with supervised features

Chapter 7: Results

VideoMAE V2 achieves new state-of-the-art results on the two main action recognition benchmarks and competitive results across many others.

Action Recognition (End-to-End Fine-tuning)

Method	Backbone	K400 Top-1	SSv2 Top-1
VideoMAE V1	ViT-H	86.6	74.8
MAE-ST	ViT-H	86.8	—
Video Swin-L	Swin-L	84.9	—
MViTv2-L	MViT-L	86.1	—
VideoMAE V2	ViT-H	88.6	76.8
VideoMAE V2	ViT-g	88.5	77.0
VideoMAE V2-g (64 frames, 266²)	ViT-g	90.0	—

The 90.0% on Kinetics-400 (with 64 frames at 266×266 resolution) is a landmark result — the first method to break the 90% barrier without using proprietary labeled data.

Something-Something V2

SSv2 tests temporal understanding — you need to tell "pushing something left" from "pushing something right." Here the V2 improvements are even more dramatic: ViT-H jumps from 74.8% (V1) to 76.8% (V2), a 2.0% gain driven by better data scaling. ViT-g reaches 77.0%.

Scaling returns are diminishing: ViT-B → ViT-L: +4% on K400. ViT-L → ViT-H: +1.5%. ViT-H → ViT-g: −0.1% on K400 (88.6 vs 88.5) and +0.2% on SSv2. The billion-parameter model barely outperforms the 632M model on standard resolution. The real gain comes at higher resolution and more frames: 64 frames at 266² pushes ViT-g to 90.0%.

Engineering decision: resolution vs frames vs parameters. At standard 16 frames, 224², going from ViT-H to ViT-g gains almost nothing. But at 64 frames, 266², ViT-g pulls ahead to 90.0%. The billion parameters are useful when there is enough input resolution for them to exploit. This is a test-time compute scaling: more frames and pixels at inference, with TFLOPs jumping from 17.88 to 160.30.

Accuracy vs Model Scale

Top-1 accuracy on Kinetics-400 and SSv2 across model scales. Notice the diminishing returns at the largest scale.

Where does the ViT-giant model most clearly outperform ViT-Huge?

At higher input resolution and more frames (64 frames, 266x266) where ViT-g reaches 90.0% on K400, because the extra parameters need more input data to show their advantage On Something-Something V2 with standard 16 frames On small downstream datasets like UCF-101

Chapter 8: Downstream Tasks

A foundation model must generalize beyond its pre-training task. VideoMAE V2 is evaluated on three categories of downstream tasks beyond standard action classification.

Spatial Action Detection (AVA)

AVA requires detecting which person is performing which action at each frame — a spatial grounding task. VideoMAE V2 provides the backbone features, combined with a detection head.

Method	AVA mAP	AVA-K mAP
SlowFast	27.4	—
VideoMAE V1 (ViT-H)	37.0	—
TubeR	33.4	—
VideoMAE V2 (ViT-g)	40.1	46.1

Temporal Action Detection (THUMOS14, FineAction)

Temporal action detection localizes when actions start and end in untrimmed video. The pre-trained model must capture temporal evolution of features, not just per-frame classification.

Method	THUMOS14 mAP	FineAction mAP
I3D + Flow	66.8	17.6
VideoMAE V1 (ViT-H)	68.1	—
VideoMAE V2 (ViT-g)	75.5	27.2

The THUMOS14 improvement from 68.1 to 75.5 is massive — a 7.4 point gain. On FineAction (fine-grained temporal detection), V2 achieves 27.2%, demonstrating that the billion-parameter model captures subtle temporal boundaries.

The generalization argument: VideoMAE V2 was pre-trained with pixel reconstruction — no action labels, no detection labels, no temporal boundaries. Yet it achieves SOTA on spatial detection (where the model is), temporal detection (when the action happens), and classification (what the action is). Self-supervised spatiotemporal features are genuinely general-purpose.

What degrades on small datasets: On UCF-101 (13K clips, 101 classes), ViT-g does not significantly outperform ViT-H. The 1B model's advantage only manifests when paired with sufficient downstream data or when transferring across diverse task types. Small datasets saturate quickly regardless of model scale.

What does VideoMAE V2's strong performance on temporal action detection demonstrate about its self-supervised features?

That pixel reconstruction pre-training produces features that capture temporal evolution and boundary information, not just per-frame appearance — the features are genuinely spatiotemporal That the decoder masking strategy improves temporal sensitivity That the larger patch size of ViT-g captures longer temporal dependencies

Chapter 9: Connections

VideoMAE V2 sits at the intersection of several major lines of work. Let's map the landscape.

Relation to MAE (Images)

VideoMAE V2 is a direct descendant of MAE (He et al., 2022). The core idea — asymmetric encoder-decoder with high masking ratio — is identical. The innovations are video-specific: tube masking (from V1), dual masking (V2), and running cell masking (V2). The progressive training is borrowed from BEiT/EVA's intermediate fine-tuning.

Relation to InternVideo2

InternVideo2 uses VideoMAE V2-g as one of its two expert teacher models in stage 1. Where V2 learns purely from pixel reconstruction, InternVideo2 adds multimodal alignment (video-text-audio) and next-token prediction. V2 is a single-modality specialist; InternVideo2 is a multi-modality generalist built on top of V2's features.

Relation to V-JEPA

V-JEPA (Bardes et al., 2024) also masks video and predicts representations, but in latent space rather than pixel space. V-JEPA argues that pixel reconstruction wastes capacity on irrelevant high-frequency details. V2 disagrees implicitly — its pixel reconstruction still achieves competitive or superior results, especially on motion-centric tasks (SSv2).

Cheat Sheet

Aspect	VideoMAE V2
Input	Video clip 16×224×224 (up to 64×266×266)
Pre-training task	Masked pixel reconstruction (MSE loss)
Encoder masking	Tube masking at 90%
Decoder masking	Running cell masking at 50%
Largest backbone	ViT-giant: 40 layers, dim 1408, 1.01B params
Pre-training data	UnlabeledHybrid: 1.35M clips from 5 sources
Progressive training	3 stages: unsupervised → supervised intermediate → target
Key result	90.0% K400, 77.0% SSv2, 40.1 AVA mAP
Training cost	241h for 1200 epochs, 64× A100 GPUs (dual masking)
Memory savings	40% less decoder memory vs V1

The broader lesson: Video's extreme temporal redundancy is a curse (expensive) and a blessing (you can mask aggressively without losing information). Dual masking exploits this redundancy in both encoder and decoder. The real bottleneck for video foundation models is data diversity, not just data quantity.

How does InternVideo2 build on VideoMAE V2?

InternVideo2 replaces VideoMAE V2's pixel reconstruction with text prediction InternVideo2 uses VideoMAE V2-g as one of its expert teacher models, distilling V2's spatiotemporal features into a larger multimodal framework that adds video-text alignment and next-token prediction InternVideo2 uses V2's dual masking strategy for text encoding

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking