The first billion-parameter video transformer, trained by masking 90% of tokens in the encoder and another 50% in the decoder. SOTA on Kinetics, Something-Something, AVA, and temporal action detection.
You want a single video model that understands actions, detects people in space, localizes events in time, and transfers to dozens of downstream tasks. Language models achieved this by scaling transformers to billions of parameters on massive text corpora. Images followed with MAE and BEiT at billion-parameter scale. But video is stuck.
Why? A single 16-frame video clip at 224×224 produces 8×14×14 = 1568 tokens with a ViT patch size of 16. That is 5× more tokens than a single image. A billion-parameter ViT-giant running self-attention on 1568 tokens is extremely expensive — both in FLOPs and GPU memory.
VideoMAE (V1) made video self-supervised pre-training feasible by masking 90% of tokens in the encoder. The encoder only processes ~157 tokens instead of 1568. But the decoder still sees all 1568 tokens (the encoded visible ones plus learnable mask tokens). For a billion-parameter decoder, this is still the bottleneck.
Beyond compute, there is a data bottleneck. The largest public video dataset (Kinetics-400) has only 240K clips. ImageNet-22K has 14.2M images. JFT-3B has 3 billion. A billion-parameter video model will overfit on 240K videos. We need million-scale diverse video data.
Compare memory cost of encoder-only masking (VideoMAE V1) vs dual masking (V2). Drag the model size slider to see how memory savings grow with scale.
Before we can understand V2, we need the original VideoMAE's design locked in. It is a direct extension of MAE (He et al., 2022) from images to video.
A video clip I ∈ R3×T×H×W is divided into non-overlapping spatiotemporal cubes. With temporal patch size 2, spatial patch size 16, and input 16×224×224, you get T'=8 temporal positions × 14×14 spatial positions = 1568 cubes. Each cube is flattened and linearly projected to a d-dimensional token, then added with a learnable 3D positional embedding.
A random spatial mask is generated once and applied identically across all T'=8 temporal positions. This creates tubes — a masked spatial position stays masked for the entire clip. The masking ratio is extreme: ρe = 90%. Only 157 out of 1568 cubes survive.
Why tubes? Adjacent frames in video are nearly identical. If you mask position (3,5) in frame 1 but leave it visible in frame 2, the model just copies the pixel values across time. That makes reconstruction trivially easy — no learning happens. Tube masking forces the model to truly reconstruct from distant spatial context.
The encoder is a large ViT that only processes the 10% visible tokens. This is where the compute savings come from — self-attention is O(N2), so 10% of tokens means ~1% of attention cost. The decoder is a smaller ViT that receives all N tokens: the 10% encoded tokens plus 90% learnable [MASK] tokens with positional embeddings. The loss is MSE between reconstructed and original pixels, computed only on masked positions.
VideoMAE V2's central contribution is embarrassingly simple: mask the decoder too.
The encoder already drops 90% of tokens via tube masking. The decoder receives the 10% encoded visible tokens plus the 90% learnable [MASK] tokens. V2 introduces a second mask Md on the decoder side that drops an additional 50% of the decoder's token sequence.
The encoder mask Me and decoder mask Md are generated by different strategies with different goals:
The decoder receives: (1) all encoded visible tokens from the encoder, plus (2) learnable [MASK] tokens for positions selected by Md that were invisible to the encoder. The reconstruction loss is computed only on decoder output tokens that were invisible to the encoder — specifically the intersection Md ∩ Me.
The encoder sees only visible tokens (orange). The decoder sees encoded tokens plus a selected subset of masked tokens (teal). Reconstruction loss applies only to decoder tokens that were invisible to the encoder (purple). Drag the decoder mask ratio to see how it changes the token distribution.
The decoder mask cannot be random. Random masking might cluster selections in one spatial region and miss others entirely. It cannot be frame-level masking (keeping half the frames) because that destroys temporal diversity. The decoder needs a mask that selects maximally diverse spatiotemporal positions.
Running cell masking divides the spatiotemporal grid into cells and systematically samples one position per cell, with the sampling offset shifting across cells. Think of it like a stratified sampling scheme that guarantees uniform coverage of the entire video volume.
Concretely, if you want 50% of tokens, you divide the T'×H'×W' grid into 2×1×1 cells (temporal stride 2). Within each cell, you pick the position using a running offset that changes from cell to cell. This ensures every temporal position, every spatial row, and every spatial column is represented in the selected subset.
The ablation study is revealing:
| Decoder Masking | ρd | Top-1 Acc | FLOPs |
|---|---|---|---|
| None (original V1) | 0% | 70.28 | 35.48G |
| Frame masking | 50% | 69.76 | 25.87G |
| Random masking | 50% | 64.87 | 25.87G |
| Running cell | 50% | 70.15 | 25.87G |
| Running cell | 75% | 70.01 | 21.06G |
Compare coverage patterns. Running cell masking (left) guarantees uniform spatiotemporal coverage. Random masking (right) leaves gaps. Click Regenerate to resample.
VideoMAE V2 scales the encoder from ViT-Base (86M params) all the way to ViT-giant (1.01B params). This is the first billion-parameter model ever pre-trained in the video domain.
| Backbone | Layers | Dim | Heads | Patch | Params |
|---|---|---|---|---|---|
| ViT-Base | 12 | 768 | 12 | 16 | 86M |
| ViT-Large | 24 | 1024 | 16 | 16 | 304M |
| ViT-Huge | 32 | 1280 | 16 | 16 | 632M |
| ViT-giant | 40 | 1408 | 16 | 14 | 1,011M |
ViT-giant uses patch size 14 (not 16 like smaller models). With 224×224 spatial input, this gives 16×16 = 256 spatial positions per frame. With temporal stride 2 and 16 frames, that is 8×16×16 = 2048 tokens total. The smaller patch size captures finer spatial details but increases token count by 31% compared to patch-16 models.
The decoder is deliberately lightweight: only 4 transformer blocks with dimension 512 for ViT-g. This asymmetry is critical — the encoder does the heavy representational lifting, while the decoder is just a reconstruction head that gets discarded after pre-training.
Accuracy on Kinetics-400 and Something-Something V2 as model capacity scales from ViT-B to ViT-g. Notice diminishing returns at the largest scale.
A billion-parameter model needs a billion-scale dataset. But the largest public video datasets are tiny compared to images: Kinetics-400 has 240K clips, Something-Something V2 has 170K. A ViT-g will overfit on these.
VideoMAE V2 builds UnlabeledHybrid, a diverse unlabeled video dataset by mixing clips from multiple public sources:
| Source | Domain | Clips |
|---|---|---|
| Kinetics (all versions) | Web video (actions) | ~650K |
| Something-Something | Manual recordings (object interactions) | ~220K |
| AVA | Movies (multi-person) | ~80K |
| WebVid2M | Web video (diverse) | ~200K |
| Instagram (crawled) | Social media | ~200K |
The key word is diversity. These sources cover first-person, third-person, scripted, unscripted, indoor, outdoor, object-centric, and person-centric videos. Labels are discarded — the pre-training is purely self-supervised (pixel reconstruction).
A second dataset merges the training splits of multiple Kinetics versions (K400, K600, K700) with aligned label semantics and deduplication, totaling 710 categories and 0.66M clips. This is used in the intermediate supervised stage (Chapter 6).
You have a 1B-parameter model pre-trained via masked autoencoding on 1.35M unlabeled videos. Now you want to fine-tune it on Kinetics-400 (240K clips, 400 classes). Directly fine-tuning 1B parameters on 240K samples is a recipe for overfitting.
VideoMAE V2 uses a three-stage progressive training pipeline, borrowing from the intermediate fine-tuning technique used in image models (BEiT, EVA):
Three stages of training, each building on the previous. The decoder is discarded after stage 1.
VideoMAE V2 achieves new state-of-the-art results on the two main action recognition benchmarks and competitive results across many others.
| Method | Backbone | K400 Top-1 | SSv2 Top-1 |
|---|---|---|---|
| VideoMAE V1 | ViT-H | 86.6 | 74.8 |
| MAE-ST | ViT-H | 86.8 | — |
| Video Swin-L | Swin-L | 84.9 | — |
| MViTv2-L | MViT-L | 86.1 | — |
| VideoMAE V2 | ViT-H | 88.6 | 76.8 |
| VideoMAE V2 | ViT-g | 88.5 | 77.0 |
| VideoMAE V2-g (64 frames, 2662) | ViT-g | 90.0 | — |
The 90.0% on Kinetics-400 (with 64 frames at 266×266 resolution) is a landmark result — the first method to break the 90% barrier without using proprietary labeled data.
SSv2 tests temporal understanding — you need to tell "pushing something left" from "pushing something right." Here the V2 improvements are even more dramatic: ViT-H jumps from 74.8% (V1) to 76.8% (V2), a 2.0% gain driven by better data scaling. ViT-g reaches 77.0%.
Top-1 accuracy on Kinetics-400 and SSv2 across model scales. Notice the diminishing returns at the largest scale.
A foundation model must generalize beyond its pre-training task. VideoMAE V2 is evaluated on three categories of downstream tasks beyond standard action classification.
AVA requires detecting which person is performing which action at each frame — a spatial grounding task. VideoMAE V2 provides the backbone features, combined with a detection head.
| Method | AVA mAP | AVA-K mAP |
|---|---|---|
| SlowFast | 27.4 | — |
| VideoMAE V1 (ViT-H) | 37.0 | — |
| TubeR | 33.4 | — |
| VideoMAE V2 (ViT-g) | 40.1 | 46.1 |
Temporal action detection localizes when actions start and end in untrimmed video. The pre-trained model must capture temporal evolution of features, not just per-frame classification.
| Method | THUMOS14 mAP | FineAction mAP |
|---|---|---|
| I3D + Flow | 66.8 | 17.6 |
| VideoMAE V1 (ViT-H) | 68.1 | — |
| VideoMAE V2 (ViT-g) | 75.5 | 27.2 |
The THUMOS14 improvement from 68.1 to 75.5 is massive — a 7.4 point gain. On FineAction (fine-grained temporal detection), V2 achieves 27.2%, demonstrating that the billion-parameter model captures subtle temporal boundaries.
VideoMAE V2 sits at the intersection of several major lines of work. Let's map the landscape.
VideoMAE V2 is a direct descendant of MAE (He et al., 2022). The core idea — asymmetric encoder-decoder with high masking ratio — is identical. The innovations are video-specific: tube masking (from V1), dual masking (V2), and running cell masking (V2). The progressive training is borrowed from BEiT/EVA's intermediate fine-tuning.
InternVideo2 uses VideoMAE V2-g as one of its two expert teacher models in stage 1. Where V2 learns purely from pixel reconstruction, InternVideo2 adds multimodal alignment (video-text-audio) and next-token prediction. V2 is a single-modality specialist; InternVideo2 is a multi-modality generalist built on top of V2's features.
V-JEPA (Bardes et al., 2024) also masks video and predicts representations, but in latent space rather than pixel space. V-JEPA argues that pixel reconstruction wastes capacity on irrelevant high-frequency details. V2 disagrees implicitly — its pixel reconstruction still achieves competitive or superior results, especially on motion-centric tasks (SSv2).
| Aspect | VideoMAE V2 |
|---|---|
| Input | Video clip 16×224×224 (up to 64×266×266) |
| Pre-training task | Masked pixel reconstruction (MSE loss) |
| Encoder masking | Tube masking at 90% |
| Decoder masking | Running cell masking at 50% |
| Largest backbone | ViT-giant: 40 layers, dim 1408, 1.01B params |
| Pre-training data | UnlabeledHybrid: 1.35M clips from 5 sources |
| Progressive training | 3 stages: unsupervised → supervised intermediate → target |
| Key result | 90.0% K400, 77.0% SSv2, 40.1 AVA mAP |
| Training cost | 241h for 1200 epochs, 64× A100 GPUs (dual masking) |
| Memory savings | 40% less decoder memory vs V1 |