Yi Wang, Kunchang Li, Xinhao Li et al. — Shanghai AI Lab & Nanjing University, 2024

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

A 6B-parameter video encoder trained in three stages — masked reconstruction, multimodal contrastive learning, and next-token prediction — achieving SOTA on 60+ benchmarks spanning recognition, retrieval, QA, dialogue, and temporal grounding.

Prerequisites: Vision Transformers (ViT) + Contrastive learning (CLIP) + Masked autoencoders + LLMs basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

You want a single video model that can do everything: classify actions, retrieve videos from text queries, answer questions about video content, detect temporal boundaries, segment objects, and hold natural-language conversations about what happens in a clip. No existing model does all of these well.

The challenge is that these tasks require fundamentally different kinds of understanding:

Previous models pick one or two of these. InternVideo2 unifies all three.

The core insight: These three learning objectives — reconstruction, contrastive alignment, and next-token prediction — are complementary, not competing. A progressive training scheme that applies them in sequence builds a video encoder that is perceptive (stage 1), semantically grounded (stage 2), and capable of reasoning (stage 3). Each stage initializes from the previous, compounding the learned representations.
Full data flow at a glance: Video V (8 sparse frames, 224×224) → patch embed with 14×14 patches, temporal stride → 8×16×16 = 2048 tokens + 1 [CLS] + 3D positional embeddings → ViT-6B encoder (48 layers, dim 3200, 25 heads) → Stage 1: align unmasked token outputs to InternViT-6B + VideoMAE V2-g via MSE loss → Stage 2: attention pooling → contrastive loss with text encoder (BERT-Large) + audio encoder (BEATs) → Stage 3: QFormer bridge to LLM (Vicuna) for next-token prediction → outputs: video embeddings for retrieval, features for detection, dialogue responses for QA.
The Video Understanding Landscape

Different tasks require different capabilities. InternVideo2 unifies all three training paradigms to cover the full spectrum.

Why does InternVideo2 use three different training objectives instead of just one?

Chapter 1: The Three-Stage Design

InternVideo2 is built through a strict progressive training pipeline. The video encoder grows in capability with each stage, and each stage initializes from the checkpoint of the previous.

Stage 1: Token Reconstruction
Train the video encoder from scratch. Use two expert models (InternViT-6B, VideoMAE V2-g) as teachers. Mask 80% of tokens. The student learns to match the teachers' token-level outputs for unmasked positions. Objective: spatiotemporal perception. Data: 2M video clips (KMash).
Stage 2: Multimodal Alignment
Initialize from Stage 1 checkpoint. Add text encoder (BERT-Large) and audio encoder (BEATs). Train with contrastive + matching + masked language modeling losses. Objective: semantic grounding. Data: 300M image-text + 50M video-text + 50M video-audio-speech-text pairs.
Stage 3: Video-Centric Dialogue
Initialize from Stage 2 checkpoint. Connect video encoder to LLM (Vicuna) via QFormer bridge. Fine-tune with instruction data. Objective: open-ended reasoning. Data: 2.1M instruction-tuning samples from 34+ sources.
Why this order matters: Stage 1 builds a foundation of spatiotemporal understanding from unlabeled video alone. Stage 2 layers semantics on top by aligning video, text, and audio. Stage 3 adds reasoning and language generation. Reversing the order would not work — you cannot ground language in a video encoder that does not yet understand spatiotemporal structure.
Frozen vs. Trained across stages: Stage 1: video encoder trained from scratch, expert teachers frozen. Stage 2 (masked phase): video encoder trained, audio encoder frozen, text encoder trained. Stage 2 (unmasked post-pretrain): video encoder frozen, audio encoder trained, text encoder trained. Stage 3: video encoder trained (with LoRA for LLM), QFormer trained, LLM updated via LoRA only. The freezing/training schedule is carefully designed: freeze a module when it is mature to prevent catastrophic forgetting.
Why is the video encoder frozen during Stage 2's unmasked post-pretraining phase?

Chapter 2: Stage 1 — Token Reconstruction

Stage 1 trains the video encoder from scratch to build spatiotemporal understanding. But unlike VideoMAE, which reconstructs raw pixels, InternVideo2 reconstructs teacher features from two expert models.

Dual Teacher Distillation

The two teachers encode complementary knowledge:

Both teachers are frozen. The student video encoder learns to match their outputs at the token level.

The Training Procedure

For each input video V:

  1. Feed full frames into both teachers (no masking for teachers) to get reference features.
  2. Mask 80% of tokens for the student encoder.
  3. The student processes only the 20% unmasked tokens.
  4. Align student outputs to teacher outputs for unmasked positions only via MSE loss.
L = (1/Z) Σp1|fV(Vp) − h(Vp)|2 + α2|fV(Vp) − g(Vp)|2)

Where fV is the student, h is InternViT-6B, g is VideoMAE V2-g, p indexes unmasked token positions, and Z normalizes.

Why reconstruct features, not pixels? Pixel reconstruction wastes capacity learning irrelevant details — exact RGB values, texture noise, high-frequency artifacts. Feature reconstruction forces the student to learn the semantic essence that expert models have already distilled. The InternViT teacher embeds language priors (from its image-text pre-training), making the student multimodal-ready even before Stage 2.
Multi-layer alignment: The loss is not just at the final output. InternVideo2 aligns the last 6 layers of InternViT, the last 4 layers of VideoMAE V2, and the final [CLS] token of InternViT to corresponding layers of the student encoder via learnable MLP projection layers. After pre-training, these projection layers are discarded. This multi-layer supervision provides stronger gradient signal and forces all layers (not just the last) to learn useful representations.
Tensor shapes (6B model): Input: (B, 3, 8, 224, 224). Patch embed with 14×14: 8×16×16 = 2048 tokens of dim 3200. Masking at 80%: student processes (B, 410, 3200). InternViT teacher: processes full frames independently, produces (B, 8, 256, 3200) — 256 = 16×16 patches per frame. VideoMAE V2 teacher: processes full video, produces (B, 2048, 1408). Projection MLPs map student dim 3200 to teacher dims. MSE computed over 410 aligned token pairs per sample.
Dual Teacher Distillation

The student encoder learns from two frozen teachers. InternViT provides semantic features, VideoMAE V2 provides motion features. Alignment happens at multiple layers.

Why does InternVideo2 use both InternViT and VideoMAE V2 as teachers rather than just one?

Chapter 3: Stage 2 — Multimodal Alignment

After Stage 1, the video encoder understands spatiotemporal structure but has no concept of language or audio. Stage 2 bridges this gap by aligning video representations with text, audio, and speech.

Architecture Additions

Three new modules are introduced:

Three Training Losses

Stage 2 uses three complementary losses:

L = LCON + LMAC + LMLM

Contrastive loss (LCON): Standard InfoNCE. Pull matching video-text pairs together, push non-matching apart. Applied across modality pairs: {video, text}, {image, text}, {video, VAS-caption}, {video+audio, VAS-caption}.

LCON = −(1/N) Σi log(exp(sim(fiV, fiT)/τ) / Σj exp(sim(fiV, fjT)/τ))

Matching loss (LMAC): Binary classification — is this video-text pair matched or not? Uses the multimodal decoder with cross-attention between video and text features.

Masked language modeling (LMLM): Mask tokens in the text caption and predict them conditioned on the video. Forces fine-grained video-language grounding.

Two-Phase Training

Stage 2 is split into two phases:

  1. Masked alignment: Video encoder trained (with 80% masking for efficiency), audio encoder frozen. Uses the full dataset: 300M image-text + 50M video-text + 50M video-audio-speech-text.
  2. Unmasked post-pretraining: Video encoder frozen, audio encoder trained. Uses a smaller subset (25M image + video, 0.5M audio, 50M audio-video). No masking — ensures consistency with inference where no tokens are masked.
Why freeze the video encoder in phase 2? During phase 1, the video encoder updates rapidly under contrastive gradients. By phase 2, its features are mature. Freezing it lets the audio and text encoders catch up and align to stable video features. Without freezing, continued updates to the 6B encoder could destabilize the alignment, since the audio encoder (90M) is much smaller and adapts slower.
What degrades without audio alignment: InternVideo2 without audio data loses performance on audio-visual tasks and also on some pure-video tasks. The audio signal provides complementary temporal cues — a "sizzling" sound tells you cooking is happening even when the visual is ambiguous. On video retrieval benchmarks, the audio-aligned model consistently outperforms the video-only version.
Why does Stage 2 use three losses (contrastive, matching, MLM) instead of just contrastive?

Chapter 4: Stage 3 — Video-Centric Dialogue

Stages 1 and 2 produce a video encoder with excellent features for classification and retrieval. But it cannot generate text — it cannot answer "What is the person doing and why?" in natural language. Stage 3 connects the video encoder to a Large Language Model.

Architecture

The bridge between video and language is a QFormer (from BLIP-2). It takes the video encoder's output tokens and compresses them into a small set of query tokens that the LLM can process. The LLM is Vicuna (a fine-tuned LLaMA).

Video Encoder (6B)
Produces token-level features from the input video. Initialized from Stage 2.
QFormer Bridge
Cross-attends from a small set of learnable queries to the video encoder's output. Compresses 2048 video tokens into ~32 query tokens compatible with the LLM's text embedding space.
LLM (Vicuna)
Takes the 32 video query tokens + text instruction tokens. Generates responses via next-token prediction. Updated with LoRA only.

High-Definition Post-Training

To improve fine-grained and long-video understanding, InternVideo2 adds a HD post-training stage. Each input video is split into up to 6 sub-videos at 224×224 plus 1 global resized sub-video. Training proceeds in two epochs: first with 8 frames per sub-video, then 16 frames. During this stage, both the video encoder and QFormer are updated, while the LLM uses LoRA.

Engineering trade-off: LoRA vs full fine-tuning. The LLM has ~7B parameters. Full fine-tuning would require enormous memory and risk catastrophic forgetting of the LLM's language capabilities. LoRA adds only ~0.1% trainable parameters (low-rank adaptation matrices) and preserves the LLM's pre-trained knowledge. The video encoder (6B) is fully fine-tuned here — its features should adapt to the dialogue task, since it was designed for video understanding.
Instruction tuning data: 2.1M samples from 34+ sources covering conversation, captioning, VQA, reasoning, and classification. Key sources include LLaVA, MVBench, PerceptionTestQA, TVQA, EgoTaskQA, and grounding datasets (DiDeMo, COCO). The data is intentionally diverse across tasks to prevent the model from overfitting to a single dialogue format.
Tensor shapes through Stage 3: Video: (B, 3, 8, 224, 224) → ViT-6B encoder: (B, 2049, 3200) [2048 patch tokens + 1 CLS]. QFormer: 32 learnable queries of dim 768 cross-attend to encoder output → (B, 32, 768). LLM embedding projection: (B, 32, 768) → (B, 32, 4096) to match Vicuna's hidden dim. Instruction text: tokenized to (B, L, 4096). LLM input: concat video queries + text tokens = (B, 32+L, 4096). LLM generates output autoregressively. LoRA rank: 16, applied to Q and V projection matrices in all LLM layers — adds ~3.4M trainable params out of 7B total (0.05%).
Stage 3 Dialogue Pipeline

Video tokens are compressed by QFormer into a small query set, then concatenated with text instruction tokens and fed to the LLM for next-token prediction.

Why is the LLM updated with LoRA instead of full fine-tuning in Stage 3?

Chapter 5: The Video Encoder

At the heart of InternVideo2 is a massive Vision Transformer scaled to 6 billion parameters. Let's look at the architectural details.

PropertyInternVideo2-1BInternVideo2-6B
Layers~2448
Hidden dim~12803200
Attention heads~1625
Patch size14×1414×14
Input frames8 (sparse sampling)8 (sparse sampling)
Tokens per clip2048 + 1 [CLS]2048 + 1 [CLS]
Parameters~1B~6B

Sparse Frame Sampling

InternVideo2 samples only 8 frames from each video, sparsely distributed. With 14×14 spatial patches, each frame produces 16×16 = 256 tokens. Eight frames give 8×256 = 2048 tokens plus one [CLS] token. 3D positional embeddings encode the temporal position of each frame.

Attention Pooling

After the ViT encoder, an attention pooling layer aggregates the 2048 token-level features into a single video-level embedding. This is used for contrastive learning and retrieval. For token-level tasks (temporal action detection), the per-token features are used directly.

Engineering decision: 8 frames vs more. Eight frames is surprisingly few. But sparse sampling covers a wide temporal span — for a 10-second clip, frames are ~1.25 seconds apart. The model learns to reason across large temporal gaps. At test time, InternVideo2 can use 16 frames for higher resolution. The 6B model at 16 frames and 224×224 is already at the limit of what fits in A100 GPU memory during training.
How features are extracted for different tasks: Action classification: [CLS] token or attention-pooled embedding → linear classifier. Video retrieval: attention-pooled embedding → cosine similarity with text embedding. Temporal action detection: per-token features from layer 7 (an intermediate layer, not the last!) → fed into ActionFormer detection head. Video QA/dialogue: full token sequence → QFormer → LLM. The choice of which features to extract and from which layer matters — intermediate layers retain more spatial detail, final layers have more semantic abstraction.
InternVideo2 Architecture Overview

The video encoder sits at the center, feeding different downstream modules depending on the task.

Why does temporal action detection use features from an intermediate layer (layer 7) rather than the final layer?

Chapter 6: The Data Pipeline

InternVideo2 is trained on a massive multimodal dataset of 402M entries. The data design is as important as the model architecture.

StageDatasetSizeType
Stage 1KMash2M clipsUnlabeled video
Stage 2LAION etc.300MImage-text
Stage 2WebVid + InternVid50MVideo-text
Stage 2InternVid250MVideo-audio-speech-text
Stage 3LLaVA, MVBench, etc.2.1MInstruction tuning

KMash: Video-Only Data

For Stage 1 (reconstruction), 2M video clips are curated from action recognition datasets (Kinetics, Something-Something, Moments in Time, ActivityNet, HACS). An extended version KMash2M adds 844K YouTube videos for diversity. Labels are discarded — only raw video is used.

InternVid2: Multimodal Video Captions

The most novel data contribution is InternVid2: 100M video clips with video-audio-speech captions. Building it required an automated annotation system called VidCap:

Video Captioner
Generates visual descriptions: "A green tractor with cables attached to it"
Audio Captioner
Generates audio descriptions: "A man is speaking and engine operates in background"
Speech Captioner (Whisper)
Transcribes speech: "I'll show you what they do"
LLM Fusion (Vicuna)
Fuses all three: "As the man talks about the tractor's capabilities, it is attached by cables and engine operates in the background"
Temporal consistency matters: Videos must be segmented into clips at semantically meaningful boundaries. InternVideo2 uses AutoShot (a temporal boundary detection model) instead of FFmpeg's SceneDet filter. AutoShot predicts boundaries based on semantic variations, not pixel differences. This prevents clips from mixing frames with inconsistent context — a critical quality factor for learning aligned representations.
What degrades with poor captions: The quality of video-text alignment directly impacts Stage 2 performance. WebVid's alt-text captions are noisy and often unrelated to the video. InternVid's generated captions are better but miss audio and speech context. InternVid2's fused VAS captions capture what you see, hear, and what is said — all three perspectives. Ablations show each additional modality in the caption improves downstream retrieval and QA performance.
Data Scale by Training Stage

The amount and type of data varies dramatically across the three stages. Stage 2 dominates in total volume (400M+ entries).

Why does InternVideo2 use semantically-segmented clips (via AutoShot) instead of fixed-duration clips?

Chapter 7: Results

InternVideo2 achieves state-of-the-art results on over 60 video and audio benchmarks. Here are the headline numbers.

Action Recognition (End-to-End Fine-tuning)

MethodK400K600SSv2MiT
VideoMAE V2-g90.0 (64f)89.977.0
CoCa-g88.949.0
InternVideo (v1)91.191.3
InternVideo2-6B92.191.977.551.2

92.1% on Kinetics-400 with only 16 frames at 224×224 — the previous SOTA required ensembles, higher resolution (5762), or 64 frames.

Attentive Probing (Frozen Backbone)

This measures feature quality directly — freeze the backbone and train only a lightweight attention pooling + linear head:

MethodK400K600SSv2
ViT-22B88.067.7
DINOv2-g83.450.0
VideoPrism-g87.268.5
InternVideo2-6B88.889.172.2
The SSv2 gap is telling: InternVideo2-6B achieves 72.2% on Something-Something V2 with a frozen backbone — 3.7% above VideoPrism-g (68.5%). SSv2 tests temporal understanding specifically. This shows that the three-stage training produces genuinely temporal features, not just spatial features that happen to work on video.

Zero-Shot Recognition

InternVideo2-clip achieves 72.7% on K400 zero-shot — competitive with VideoPrism-g (76.4%), which was trained on 311M manually-labeled videos. On UCF-101 and HMDB-51, InternVideo2 leads.

SOTA Comparison

InternVideo2-6B vs previous best on key benchmarks. End-to-end fine-tuning results.

What does InternVideo2's strong frozen-backbone performance on SSv2 demonstrate?

Chapter 8: Downstream Tasks

A true video foundation model must generalize across many task types. InternVideo2 is evaluated on tasks far beyond action classification.

Temporal Action Localization

Detect when actions start and end in untrimmed videos. InternVideo2 uses features from layer 7 (not the final layer) with ActionFormer as the detection head.

MethodTHUMOS14ActivityNetHACSFineAction
I3D + Flow66.835.617.6
VideoMAE V2-g69.539.042.418.2
InternVideo2-6B72.041.243.327.7

Video Instance Segmentation

Using Mask2Former with InternVideo2 as backbone achieves 63.4 mAP on YouTube-VIS 2019 (vs 60.3 for Swin-L), and a video-tuned version reaches 65.1 mAP.

Video-Text Retrieval

InternVideo2's Stage 2 contrastive features excel at video retrieval. On MSR-VTT, it achieves 55.9% R@1 (text-to-video), outperforming VideoCoCa and other baselines.

Video Question Answering

The Stage 3 model handles open-ended video QA. On benchmarks like EgoSchema (long-form video understanding), InternVideo2 achieves 63.2% accuracy, demonstrating the ability to reason over extended temporal contexts.

The universality argument: InternVideo2 holds SOTA or near-SOTA on action recognition (what), temporal detection (when), instance segmentation (where), retrieval (matching), QA (reasoning), and dialogue (generation). No previous single model covered all these. The three-stage training creates a truly general-purpose video encoder because each stage adds a different dimension of understanding.
What degrades at the extremes: On Kinetics zero-shot, InternVideo2-6B is slightly weaker than InternVideo2-1B (72.7 vs 73.1 on K400). The larger model, trained with more diverse data in Stage 2, partially forgets the distribution of Stage 1 pre-training data. On FineAction (fine-grained temporal detection), scaling from 1B to 6B barely helps (27.7 vs 27.2), suggesting fine-grained discrimination may need better data annotations rather than more parameters.
Why does InternVideo2-6B sometimes underperform InternVideo2-1B on zero-shot recognition?

Chapter 9: Connections

InternVideo2 sits at the convergence of several major research threads. Let's map the landscape.

Relation to VideoMAE V2

VideoMAE V2 provides one of InternVideo2's two teacher models in Stage 1. Where VideoMAE V2 reconstructs raw pixels, InternVideo2 reconstructs teacher features. V2 is a single-modality specialist optimized for pixel reconstruction; InternVideo2 is a multi-modality generalist that uses V2's motion-aware features as a learning signal.

Relation to CLIP / InternVL

InternVL-6B (the image-language model) provides the other teacher in Stage 1 and shapes the feature space. Stage 2's contrastive learning is inspired by CLIP but extended to video-audio-speech-text. The result is a "video CLIP" that understands not just visual-text alignment but audio and speech as well.

Relation to VideoPrism

VideoPrism (Google, 2024) is the closest competitor. It also uses masked reconstruction + contrastive learning, but with a two-stage scheme (vs three). VideoPrism's key advantage is proprietary data (311M manually-labeled videos). InternVideo2 compensates with its third stage (dialogue) and auto-generated VAS captions, achieving competitive or superior results on most benchmarks.

Relation to BLIP-2 / QFormer

Stage 3 directly uses the BLIP-2 architecture for bridging vision and language. QFormer compresses the video encoder's 2048 tokens into ~32 tokens that the LLM can digest. This is not a novel contribution — InternVideo2's innovation is in the video encoder itself, not the LLM bridge.

Cheat Sheet

AspectInternVideo2
Video encoderViT-6B (48 layers, dim 3200, patch 14)
Input8 frames, 224×224, sparse sampling
Stage 1Dual-teacher token reconstruction (InternViT-6B + VideoMAE V2-g)
Stage 2Video-text-audio contrastive + matching + MLM
Stage 3QFormer → Vicuna LLM, instruction tuning
Data total402M entries (2M video, 300M image-text, 100M VAS)
Key results92.1% K400, 77.5% SSv2, 72.0 THUMOS14 mAP, 63.2% EgoSchema
Training costStage 1: 256 A100, 18 days. Stage 2: 256 A100, 14 days. Stage 3: 64 A100, 3 days
Total compute~35 days on 256 A100 GPUs (Stage 1 + 2)
BenchmarksSOTA on 60+ tasks across recognition, retrieval, QA, dialogue, segmentation
The broader lesson: The video foundation model problem is not just about bigger models or more data. It requires a progressive training recipe that layers complementary objectives. Each stage builds a different capability. The encoder that emerges is more than the sum of its parts — it inherits spatiotemporal perception from reconstruction, semantic grounding from contrastive learning, and reasoning from language model alignment.
What is the key architectural difference between InternVideo2 and VideoPrism?