A 6B-parameter video encoder trained in three stages — masked reconstruction, multimodal contrastive learning, and next-token prediction — achieving SOTA on 60+ benchmarks spanning recognition, retrieval, QA, dialogue, and temporal grounding.
You want a single video model that can do everything: classify actions, retrieve videos from text queries, answer questions about video content, detect temporal boundaries, segment objects, and hold natural-language conversations about what happens in a clip. No existing model does all of these well.
The challenge is that these tasks require fundamentally different kinds of understanding:
Previous models pick one or two of these. InternVideo2 unifies all three.
Different tasks require different capabilities. InternVideo2 unifies all three training paradigms to cover the full spectrum.
InternVideo2 is built through a strict progressive training pipeline. The video encoder grows in capability with each stage, and each stage initializes from the checkpoint of the previous.
Stage 1 trains the video encoder from scratch to build spatiotemporal understanding. But unlike VideoMAE, which reconstructs raw pixels, InternVideo2 reconstructs teacher features from two expert models.
The two teachers encode complementary knowledge:
Both teachers are frozen. The student video encoder learns to match their outputs at the token level.
For each input video V:
Where fV is the student, h is InternViT-6B, g is VideoMAE V2-g, p indexes unmasked token positions, and Z normalizes.
The student encoder learns from two frozen teachers. InternViT provides semantic features, VideoMAE V2 provides motion features. Alignment happens at multiple layers.
After Stage 1, the video encoder understands spatiotemporal structure but has no concept of language or audio. Stage 2 bridges this gap by aligning video representations with text, audio, and speech.
Three new modules are introduced:
Stage 2 uses three complementary losses:
Contrastive loss (LCON): Standard InfoNCE. Pull matching video-text pairs together, push non-matching apart. Applied across modality pairs: {video, text}, {image, text}, {video, VAS-caption}, {video+audio, VAS-caption}.
Matching loss (LMAC): Binary classification — is this video-text pair matched or not? Uses the multimodal decoder with cross-attention between video and text features.
Masked language modeling (LMLM): Mask tokens in the text caption and predict them conditioned on the video. Forces fine-grained video-language grounding.
Stage 2 is split into two phases:
Stages 1 and 2 produce a video encoder with excellent features for classification and retrieval. But it cannot generate text — it cannot answer "What is the person doing and why?" in natural language. Stage 3 connects the video encoder to a Large Language Model.
The bridge between video and language is a QFormer (from BLIP-2). It takes the video encoder's output tokens and compresses them into a small set of query tokens that the LLM can process. The LLM is Vicuna (a fine-tuned LLaMA).
To improve fine-grained and long-video understanding, InternVideo2 adds a HD post-training stage. Each input video is split into up to 6 sub-videos at 224×224 plus 1 global resized sub-video. Training proceeds in two epochs: first with 8 frames per sub-video, then 16 frames. During this stage, both the video encoder and QFormer are updated, while the LLM uses LoRA.
Video tokens are compressed by QFormer into a small query set, then concatenated with text instruction tokens and fed to the LLM for next-token prediction.
At the heart of InternVideo2 is a massive Vision Transformer scaled to 6 billion parameters. Let's look at the architectural details.
| Property | InternVideo2-1B | InternVideo2-6B |
|---|---|---|
| Layers | ~24 | 48 |
| Hidden dim | ~1280 | 3200 |
| Attention heads | ~16 | 25 |
| Patch size | 14×14 | 14×14 |
| Input frames | 8 (sparse sampling) | 8 (sparse sampling) |
| Tokens per clip | 2048 + 1 [CLS] | 2048 + 1 [CLS] |
| Parameters | ~1B | ~6B |
InternVideo2 samples only 8 frames from each video, sparsely distributed. With 14×14 spatial patches, each frame produces 16×16 = 256 tokens. Eight frames give 8×256 = 2048 tokens plus one [CLS] token. 3D positional embeddings encode the temporal position of each frame.
After the ViT encoder, an attention pooling layer aggregates the 2048 token-level features into a single video-level embedding. This is used for contrastive learning and retrieval. For token-level tasks (temporal action detection), the per-token features are used directly.
The video encoder sits at the center, feeding different downstream modules depending on the task.
InternVideo2 is trained on a massive multimodal dataset of 402M entries. The data design is as important as the model architecture.
| Stage | Dataset | Size | Type |
|---|---|---|---|
| Stage 1 | KMash | 2M clips | Unlabeled video |
| Stage 2 | LAION etc. | 300M | Image-text |
| Stage 2 | WebVid + InternVid | 50M | Video-text |
| Stage 2 | InternVid2 | 50M | Video-audio-speech-text |
| Stage 3 | LLaVA, MVBench, etc. | 2.1M | Instruction tuning |
For Stage 1 (reconstruction), 2M video clips are curated from action recognition datasets (Kinetics, Something-Something, Moments in Time, ActivityNet, HACS). An extended version KMash2M adds 844K YouTube videos for diversity. Labels are discarded — only raw video is used.
The most novel data contribution is InternVid2: 100M video clips with video-audio-speech captions. Building it required an automated annotation system called VidCap:
The amount and type of data varies dramatically across the three stages. Stage 2 dominates in total volume (400M+ entries).
InternVideo2 achieves state-of-the-art results on over 60 video and audio benchmarks. Here are the headline numbers.
| Method | K400 | K600 | SSv2 | MiT |
|---|---|---|---|---|
| VideoMAE V2-g | 90.0 (64f) | 89.9 | 77.0 | — |
| CoCa-g | 88.9 | — | — | 49.0 |
| InternVideo (v1) | 91.1 | 91.3 | — | — |
| InternVideo2-6B | 92.1 | 91.9 | 77.5 | 51.2 |
92.1% on Kinetics-400 with only 16 frames at 224×224 — the previous SOTA required ensembles, higher resolution (5762), or 64 frames.
This measures feature quality directly — freeze the backbone and train only a lightweight attention pooling + linear head:
| Method | K400 | K600 | SSv2 |
|---|---|---|---|
| ViT-22B | 88.0 | — | 67.7 |
| DINOv2-g | 83.4 | — | 50.0 |
| VideoPrism-g | 87.2 | — | 68.5 |
| InternVideo2-6B | 88.8 | 89.1 | 72.2 |
InternVideo2-clip achieves 72.7% on K400 zero-shot — competitive with VideoPrism-g (76.4%), which was trained on 311M manually-labeled videos. On UCF-101 and HMDB-51, InternVideo2 leads.
InternVideo2-6B vs previous best on key benchmarks. End-to-end fine-tuning results.
A true video foundation model must generalize across many task types. InternVideo2 is evaluated on tasks far beyond action classification.
Detect when actions start and end in untrimmed videos. InternVideo2 uses features from layer 7 (not the final layer) with ActionFormer as the detection head.
| Method | THUMOS14 | ActivityNet | HACS | FineAction |
|---|---|---|---|---|
| I3D + Flow | 66.8 | 35.6 | — | 17.6 |
| VideoMAE V2-g | 69.5 | 39.0 | 42.4 | 18.2 |
| InternVideo2-6B | 72.0 | 41.2 | 43.3 | 27.7 |
Using Mask2Former with InternVideo2 as backbone achieves 63.4 mAP on YouTube-VIS 2019 (vs 60.3 for Swin-L), and a video-tuned version reaches 65.1 mAP.
InternVideo2's Stage 2 contrastive features excel at video retrieval. On MSR-VTT, it achieves 55.9% R@1 (text-to-video), outperforming VideoCoCa and other baselines.
The Stage 3 model handles open-ended video QA. On benchmarks like EgoSchema (long-form video understanding), InternVideo2 achieves 63.2% accuracy, demonstrating the ability to reason over extended temporal contexts.
InternVideo2 sits at the convergence of several major research threads. Let's map the landscape.
VideoMAE V2 provides one of InternVideo2's two teacher models in Stage 1. Where VideoMAE V2 reconstructs raw pixels, InternVideo2 reconstructs teacher features. V2 is a single-modality specialist optimized for pixel reconstruction; InternVideo2 is a multi-modality generalist that uses V2's motion-aware features as a learning signal.
InternVL-6B (the image-language model) provides the other teacher in Stage 1 and shapes the feature space. Stage 2's contrastive learning is inspired by CLIP but extended to video-audio-speech-text. The result is a "video CLIP" that understands not just visual-text alignment but audio and speech as well.
VideoPrism (Google, 2024) is the closest competitor. It also uses masked reconstruction + contrastive learning, but with a two-stage scheme (vs three). VideoPrism's key advantage is proprietary data (311M manually-labeled videos). InternVideo2 compensates with its third stage (dialogue) and auto-generated VAS captions, achieving competitive or superior results on most benchmarks.
Stage 3 directly uses the BLIP-2 architecture for bridging vision and language. QFormer compresses the video encoder's 2048 tokens into ~32 tokens that the LLM can digest. This is not a novel contribution — InternVideo2's innovation is in the video encoder itself, not the LLM bridge.
| Aspect | InternVideo2 |
|---|---|
| Video encoder | ViT-6B (48 layers, dim 3200, patch 14) |
| Input | 8 frames, 224×224, sparse sampling |
| Stage 1 | Dual-teacher token reconstruction (InternViT-6B + VideoMAE V2-g) |
| Stage 2 | Video-text-audio contrastive + matching + MLM |
| Stage 3 | QFormer → Vicuna LLM, instruction tuning |
| Data total | 402M entries (2M video, 300M image-text, 100M VAS) |
| Key results | 92.1% K400, 77.5% SSv2, 72.0 THUMOS14 mAP, 63.2% EgoSchema |
| Training cost | Stage 1: 256 A100, 18 days. Stage 2: 256 A100, 14 days. Stage 3: 64 A100, 3 days |
| Total compute | ~35 days on 256 A100 GPUs (Stage 1 + 2) |
| Benchmarks | SOTA on 60+ tasks across recognition, retrieval, QA, dialogue, segmentation |