Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong et al. — Nanjing University & Shanghai AI Lab, 2023

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

The first billion-parameter video transformer, trained by masking 90% of tokens in the encoder and another 50% in the decoder. SOTA on Kinetics, Something-Something, AVA, and temporal action detection.

Prerequisites: Vision Transformers (ViT) + Masked Autoencoders (MAE) + Self-supervised learning basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

You want a single video model that understands actions, detects people in space, localizes events in time, and transfers to dozens of downstream tasks. Language models achieved this by scaling transformers to billions of parameters on massive text corpora. Images followed with MAE and BEiT at billion-parameter scale. But video is stuck.

Why? A single 16-frame video clip at 224×224 produces 8×14×14 = 1568 tokens with a ViT patch size of 16. That is 5× more tokens than a single image. A billion-parameter ViT-giant running self-attention on 1568 tokens is extremely expensive — both in FLOPs and GPU memory.

VideoMAE (V1) made video self-supervised pre-training feasible by masking 90% of tokens in the encoder. The encoder only processes ~157 tokens instead of 1568. But the decoder still sees all 1568 tokens (the encoded visible ones plus learnable mask tokens). For a billion-parameter decoder, this is still the bottleneck.

The core tension: Scaling video transformers to 1B+ parameters requires enormous compute. VideoMAE V1's encoder masking helps the encoder side, but the decoder still processes the full token sequence. Pre-training ViT-g with V1 takes over two weeks on 64 A100 GPUs. VideoMAE V2's insight: mask the decoder too.

Beyond compute, there is a data bottleneck. The largest public video dataset (Kinetics-400) has only 240K clips. ImageNet-22K has 14.2M images. JFT-3B has 3 billion. A billion-parameter video model will overfit on 240K videos. We need million-scale diverse video data.

Full data flow at a glance: Video clip I ∈ R3×T×H×W (T=16 frames, H=W=224, stride τ) → cube embedding Φemb produces N = T'×H'×W' tokens (e.g., 8×14×14 = 1568) → encoder mask Me at ratio ρe=90% keeps Ne=157 visible tokens → ViT-giant encoder (40 blocks, dim 1408, 16 heads, 1.01B params) processes visible tokens → decoder mask Md at ratio ρd=50% selects a subset via running cell masking → lightweight decoder (4 blocks, dim 512) reconstructs pixels only for tokens invisible to encoder → MSE loss on normalized masked pixels.
The Scaling Wall

Compare memory cost of encoder-only masking (VideoMAE V1) vs dual masking (V2). Drag the model size slider to see how memory savings grow with scale.

Model params (M) 304M
Why is scaling video masked autoencoders harder than scaling image MAEs?

Chapter 1: VideoMAE Revisited

Before we can understand V2, we need the original VideoMAE's design locked in. It is a direct extension of MAE (He et al., 2022) from images to video.

Cube Embedding

A video clip I ∈ R3×T×H×W is divided into non-overlapping spatiotemporal cubes. With temporal patch size 2, spatial patch size 16, and input 16×224×224, you get T'=8 temporal positions × 14×14 spatial positions = 1568 cubes. Each cube is flattened and linearly projected to a d-dimensional token, then added with a learnable 3D positional embedding.

Tube Masking (Encoder)

A random spatial mask is generated once and applied identically across all T'=8 temporal positions. This creates tubes — a masked spatial position stays masked for the entire clip. The masking ratio is extreme: ρe = 90%. Only 157 out of 1568 cubes survive.

Why tubes? Adjacent frames in video are nearly identical. If you mask position (3,5) in frame 1 but leave it visible in frame 2, the model just copies the pixel values across time. That makes reconstruction trivially easy — no learning happens. Tube masking forces the model to truly reconstruct from distant spatial context.

Asymmetric Encoder-Decoder

The encoder is a large ViT that only processes the 10% visible tokens. This is where the compute savings come from — self-attention is O(N2), so 10% of tokens means ~1% of attention cost. The decoder is a smaller ViT that receives all N tokens: the 10% encoded tokens plus 90% learnable [MASK] tokens with positional embeddings. The loss is MSE between reconstructed and original pixels, computed only on masked positions.

Tensor shapes (ViT-Base example): Input: (B, 3, 16, 224, 224). After cube embed: (B, 1568, 768). After encoder masking (ρ=90%): (B, 157, 768) → encoder output. Decoder input: concat encoded (157) + mask tokens (1411) = (B, 1568, 384). Decoder output: (B, 1568, 1536) — reconstructed pixel values per cube (2×16×16×3=1536 per cube). Loss computed on 1411 masked cubes only.
The bottleneck insight: Encoder processes 157 tokens with a big ViT — cheap. Decoder processes 1568 tokens with a smaller ViT — but 1568 tokens is still a lot. For ViT-Base, the decoder costs more FLOPs than the encoder (despite having fewer layers) simply because it has 10× more tokens. When you scale to ViT-giant, this imbalance becomes crippling.
Why does VideoMAE use tube masking instead of random per-frame masking?

Chapter 2: Dual Masking

VideoMAE V2's central contribution is embarrassingly simple: mask the decoder too.

The encoder already drops 90% of tokens via tube masking. The decoder receives the 10% encoded visible tokens plus the 90% learnable [MASK] tokens. V2 introduces a second mask Md on the decoder side that drops an additional 50% of the decoder's token sequence.

How the Two Masks Interact

The encoder mask Me and decoder mask Md are generated by different strategies with different goals:

The decoder receives: (1) all encoded visible tokens from the encoder, plus (2) learnable [MASK] tokens for positions selected by Md that were invisible to the encoder. The reconstruction loss is computed only on decoder output tokens that were invisible to the encoder — specifically the intersection Md ∩ Me.

Zc = Z ∪ {Mi}i ∈ Md
ℓ = (1 / ((1−ρd)N)) Σi ∈ Md ∩ Me |Ii − Îi|2
Why only supervise encoder-invisible tokens? If you also compute loss on tokens the encoder saw, the decoder can cheat — it just passes through the already-encoded visible patches. The visible tokens carry ground-truth pixel information directly from the encoder. Supervising only invisible tokens ensures the decoder must truly reconstruct from context.
Concrete numbers (ViT-g): Original VideoMAE decoder: processes all 1568 tokens. Memory for feature maps: 1753 MB. V2 dual masking decoder: processes ~784 tokens (50% of 1568). Memory: 1050 MB — a 40% reduction. Training time for 1200 epochs on 64 A100 GPUs: 356h (V1 estimated) vs 241h (V2) — 1.48× speedup. Top-1 accuracy on Something-Something V2: essentially identical (77.0% both).
Dual Masking Token Flow

The encoder sees only visible tokens (orange). The decoder sees encoded tokens plus a selected subset of masked tokens (teal). Reconstruction loss applies only to decoder tokens that were invisible to the encoder (purple). Drag the decoder mask ratio to see how it changes the token distribution.

Decoder mask ratio ρd 50%
What does dual masking primarily save compared to encoder-only masking?

Chapter 3: Running Cell Masking

The decoder mask cannot be random. Random masking might cluster selections in one spatial region and miss others entirely. It cannot be frame-level masking (keeping half the frames) because that destroys temporal diversity. The decoder needs a mask that selects maximally diverse spatiotemporal positions.

How It Works

Running cell masking divides the spatiotemporal grid into cells and systematically samples one position per cell, with the sampling offset shifting across cells. Think of it like a stratified sampling scheme that guarantees uniform coverage of the entire video volume.

Concretely, if you want 50% of tokens, you divide the T'×H'×W' grid into 2×1×1 cells (temporal stride 2). Within each cell, you pick the position using a running offset that changes from cell to cell. This ensures every temporal position, every spatial row, and every spatial column is represented in the selected subset.

Why Not Other Strategies?

The ablation study is revealing:

Decoder MaskingρdTop-1 AccFLOPs
None (original V1)0%70.2835.48G
Frame masking50%69.7625.87G
Random masking50%64.8725.87G
Running cell50%70.1525.87G
Running cell75%70.0121.06G
Key insight: Random masking is catastrophic (−5.4%). Frame masking loses 0.5%. Running cell masking loses only 0.13% while saving 27% of FLOPs. The difference is coverage: random masking leaves spatial holes. Frame masking destroys half the temporal information. Running cell masking preserves both spatial and temporal diversity while still halving the token count.
What degrades with higher decoder masking: At ρd=50%, accuracy drops 0.13%. At ρd=75%, it drops 0.27%. The degradation is graceful because video has extreme redundancy — neighboring cubes contain similar information. But beyond 75%, the decoder starts missing critical spatial structure and reconstruction quality suffers.
Running Cell vs Random Masking

Compare coverage patterns. Running cell masking (left) guarantees uniform spatiotemporal coverage. Random masking (right) leaves gaps. Click Regenerate to resample.

Why does random decoder masking perform much worse than running cell masking?

Chapter 4: ViT-giant Backbone

VideoMAE V2 scales the encoder from ViT-Base (86M params) all the way to ViT-giant (1.01B params). This is the first billion-parameter model ever pre-trained in the video domain.

BackboneLayersDimHeadsPatchParams
ViT-Base12768121686M
ViT-Large2410241616304M
ViT-Huge3212801616632M
ViT-giant40140816141,011M

Key Architecture Details

ViT-giant uses patch size 14 (not 16 like smaller models). With 224×224 spatial input, this gives 16×16 = 256 spatial positions per frame. With temporal stride 2 and 16 frames, that is 8×16×16 = 2048 tokens total. The smaller patch size captures finer spatial details but increases token count by 31% compared to patch-16 models.

The decoder is deliberately lightweight: only 4 transformer blocks with dimension 512 for ViT-g. This asymmetry is critical — the encoder does the heavy representational lifting, while the decoder is just a reconstruction head that gets discarded after pre-training.

Frozen vs. Trained: During self-supervised pre-training: the entire encoder (1.01B params) and decoder (~25M params) are trained from scratch. During downstream fine-tuning: only the encoder is kept; the decoder is discarded. A task-specific head is added (e.g., a linear classifier for action recognition). The encoder is fine-tuned end-to-end.
Compute budget: ViT-g encoder with dual masking processes ~205 visible tokens (10% of 2048). Self-attention cost: O(2052 × 1408 × 40) ≈ 95 GFLOPs per clip. Decoder with dual masking processes ~1024 tokens: O(10242 × 512 × 4) ≈ 8.6 GFLOPs. Total with dual masking: 241.61 GFLOPs. Without dual masking (full decoder): 263.93 GFLOPs. Memory: 1050 MB vs 1753 MB — the difference between fitting in GPU memory and not.
Model Scaling Comparison

Accuracy on Kinetics-400 and Something-Something V2 as model capacity scales from ViT-B to ViT-g. Notice diminishing returns at the largest scale.

Why does ViT-giant use patch size 14 instead of 16?

Chapter 5: Data Scaling

A billion-parameter model needs a billion-scale dataset. But the largest public video datasets are tiny compared to images: Kinetics-400 has 240K clips, Something-Something V2 has 170K. A ViT-g will overfit on these.

UnlabeledHybrid: 1.35M Clips

VideoMAE V2 builds UnlabeledHybrid, a diverse unlabeled video dataset by mixing clips from multiple public sources:

SourceDomainClips
Kinetics (all versions)Web video (actions)~650K
Something-SomethingManual recordings (object interactions)~220K
AVAMovies (multi-person)~80K
WebVid2MWeb video (diverse)~200K
Instagram (crawled)Social media~200K

The key word is diversity. These sources cover first-person, third-person, scripted, unscripted, indoor, outdoor, object-centric, and person-centric videos. Labels are discarded — the pre-training is purely self-supervised (pixel reconstruction).

LabeledHybrid: 0.66M Clips (for post-pre-training)

A second dataset merges the training splits of multiple Kinetics versions (K400, K600, K700) with aligned label semantics and deduplication, totaling 710 categories and 0.66M clips. This is used in the intermediate supervised stage (Chapter 6).

Data scaling effect (Something-Something V2): ViT-H pre-trained on SSv2 alone (170K clips): 74.8% top-1. ViT-H pre-trained on UnlabeledHybrid (1.35M clips): 76.8% top-1 — a full 2.0% improvement just from more diverse data. The gap widens with model size: for ViT-B the gap is 0.4%, for ViT-L it is 1.4%, for ViT-H it is 2.0%. Bigger models are hungrier for data.
Comparison to MAE-ST: MAE-ST tried pre-training on 1M uncurated Instagram clips and got worse performance than pre-training on 240K Kinetics clips (84.4% vs 84.8% on K400 with ViT-L). VideoMAE V2 with 1.35M UnlabeledHybrid gets 85.4%. The difference: data diversity. MAE-ST used only Instagram; V2 mixes five diverse sources covering different visual domains.
Why does MAE-ST's 1M Instagram pre-training underperform V2's 1.35M UnlabeledHybrid pre-training?

Chapter 6: Progressive Training

You have a 1B-parameter model pre-trained via masked autoencoding on 1.35M unlabeled videos. Now you want to fine-tune it on Kinetics-400 (240K clips, 400 classes). Directly fine-tuning 1B parameters on 240K samples is a recipe for overfitting.

VideoMAE V2 uses a three-stage progressive training pipeline, borrowing from the intermediate fine-tuning technique used in image models (BEiT, EVA):

Stage 1: Unsupervised Pre-training
Train the full VideoMAE (encoder + decoder) on UnlabeledHybrid (1.35M clips) with pixel reconstruction. No labels. 1200 epochs for ViT-g. This learns general spatiotemporal features.
Stage 2: Post-Pre-training (Intermediate)
Discard the decoder. Add a classification head. Fine-tune on LabeledHybrid (0.66M clips, 710 classes). This bridges the gap between reconstruction features and semantic features using diverse supervised data.
Stage 3: Target Fine-tuning
Fine-tune on the specific downstream dataset (e.g., K400, SSv2, AVA). The model has been progressively adapted from general reconstruction → broad semantics → task-specific knowledge.
Why three stages instead of two? ViT-H fine-tuned directly (stages 1 → 3, skipping post-pre-training): 86.9% on K400. ViT-H with progressive training (stages 1 → 2 → 3): 88.6% on K400 — a 1.7% improvement. The intermediate supervised stage acts as a bridge: it softens the distribution shift between self-supervised features (optimized for pixel prediction) and task-specific features (optimized for classification). With 710 categories from multiple Kinetics versions, the model learns broad action semantics before specializing.
What degrades without progressive training: Directly fine-tuning V1 pre-trained ViT-H on K400 gives 88.1% — worse than V2's 88.6% despite using the same post-pre-training. This confirms that large-scale unsupervised pre-training (1.35M vs 240K) provides a better initialization that post-pre-training can build upon.
Progressive Training Pipeline

Three stages of training, each building on the previous. The decoder is discarded after stage 1.

What role does the post-pre-training stage (Stage 2) play in the progressive pipeline?

Chapter 7: Results

VideoMAE V2 achieves new state-of-the-art results on the two main action recognition benchmarks and competitive results across many others.

Action Recognition (End-to-End Fine-tuning)

MethodBackboneK400 Top-1SSv2 Top-1
VideoMAE V1ViT-H86.674.8
MAE-STViT-H86.8
Video Swin-LSwin-L84.9
MViTv2-LMViT-L86.1
VideoMAE V2ViT-H88.676.8
VideoMAE V2ViT-g88.577.0
VideoMAE V2-g (64 frames, 2662)ViT-g90.0

The 90.0% on Kinetics-400 (with 64 frames at 266×266 resolution) is a landmark result — the first method to break the 90% barrier without using proprietary labeled data.

Something-Something V2

SSv2 tests temporal understanding — you need to tell "pushing something left" from "pushing something right." Here the V2 improvements are even more dramatic: ViT-H jumps from 74.8% (V1) to 76.8% (V2), a 2.0% gain driven by better data scaling. ViT-g reaches 77.0%.

Scaling returns are diminishing: ViT-B → ViT-L: +4% on K400. ViT-L → ViT-H: +1.5%. ViT-H → ViT-g: −0.1% on K400 (88.6 vs 88.5) and +0.2% on SSv2. The billion-parameter model barely outperforms the 632M model on standard resolution. The real gain comes at higher resolution and more frames: 64 frames at 2662 pushes ViT-g to 90.0%.
Engineering decision: resolution vs frames vs parameters. At standard 16 frames, 2242, going from ViT-H to ViT-g gains almost nothing. But at 64 frames, 2662, ViT-g pulls ahead to 90.0%. The billion parameters are useful when there is enough input resolution for them to exploit. This is a test-time compute scaling: more frames and pixels at inference, with TFLOPs jumping from 17.88 to 160.30.
Accuracy vs Model Scale

Top-1 accuracy on Kinetics-400 and SSv2 across model scales. Notice the diminishing returns at the largest scale.

Where does the ViT-giant model most clearly outperform ViT-Huge?

Chapter 8: Downstream Tasks

A foundation model must generalize beyond its pre-training task. VideoMAE V2 is evaluated on three categories of downstream tasks beyond standard action classification.

Spatial Action Detection (AVA)

AVA requires detecting which person is performing which action at each frame — a spatial grounding task. VideoMAE V2 provides the backbone features, combined with a detection head.

MethodAVA mAPAVA-K mAP
SlowFast27.4
VideoMAE V1 (ViT-H)37.0
TubeR33.4
VideoMAE V2 (ViT-g)40.146.1

Temporal Action Detection (THUMOS14, FineAction)

Temporal action detection localizes when actions start and end in untrimmed video. The pre-trained model must capture temporal evolution of features, not just per-frame classification.

MethodTHUMOS14 mAPFineAction mAP
I3D + Flow66.817.6
VideoMAE V1 (ViT-H)68.1
VideoMAE V2 (ViT-g)75.527.2

The THUMOS14 improvement from 68.1 to 75.5 is massive — a 7.4 point gain. On FineAction (fine-grained temporal detection), V2 achieves 27.2%, demonstrating that the billion-parameter model captures subtle temporal boundaries.

The generalization argument: VideoMAE V2 was pre-trained with pixel reconstruction — no action labels, no detection labels, no temporal boundaries. Yet it achieves SOTA on spatial detection (where the model is), temporal detection (when the action happens), and classification (what the action is). Self-supervised spatiotemporal features are genuinely general-purpose.
What degrades on small datasets: On UCF-101 (13K clips, 101 classes), ViT-g does not significantly outperform ViT-H. The 1B model's advantage only manifests when paired with sufficient downstream data or when transferring across diverse task types. Small datasets saturate quickly regardless of model scale.
What does VideoMAE V2's strong performance on temporal action detection demonstrate about its self-supervised features?

Chapter 9: Connections

VideoMAE V2 sits at the intersection of several major lines of work. Let's map the landscape.

Relation to MAE (Images)

VideoMAE V2 is a direct descendant of MAE (He et al., 2022). The core idea — asymmetric encoder-decoder with high masking ratio — is identical. The innovations are video-specific: tube masking (from V1), dual masking (V2), and running cell masking (V2). The progressive training is borrowed from BEiT/EVA's intermediate fine-tuning.

Relation to InternVideo2

InternVideo2 uses VideoMAE V2-g as one of its two expert teacher models in stage 1. Where V2 learns purely from pixel reconstruction, InternVideo2 adds multimodal alignment (video-text-audio) and next-token prediction. V2 is a single-modality specialist; InternVideo2 is a multi-modality generalist built on top of V2's features.

Relation to V-JEPA

V-JEPA (Bardes et al., 2024) also masks video and predicts representations, but in latent space rather than pixel space. V-JEPA argues that pixel reconstruction wastes capacity on irrelevant high-frequency details. V2 disagrees implicitly — its pixel reconstruction still achieves competitive or superior results, especially on motion-centric tasks (SSv2).

Cheat Sheet

AspectVideoMAE V2
InputVideo clip 16×224×224 (up to 64×266×266)
Pre-training taskMasked pixel reconstruction (MSE loss)
Encoder maskingTube masking at 90%
Decoder maskingRunning cell masking at 50%
Largest backboneViT-giant: 40 layers, dim 1408, 1.01B params
Pre-training dataUnlabeledHybrid: 1.35M clips from 5 sources
Progressive training3 stages: unsupervised → supervised intermediate → target
Key result90.0% K400, 77.0% SSv2, 40.1 AVA mAP
Training cost241h for 1200 epochs, 64× A100 GPUs (dual masking)
Memory savings40% less decoder memory vs V1
The broader lesson: Video's extreme temporal redundancy is a curse (expensive) and a blessing (you can mask aggressively without losing information). Dual masking exploits this redundancy in both encoder and decoder. The real bottleneck for video foundation models is data diversity, not just data quantity.
How does InternVideo2 build on VideoMAE V2?