InternVideo2 — Veanors

Chapter 0: The Problem

You want a single video model that can do everything: classify actions, retrieve videos from text queries, answer questions about video content, detect temporal boundaries, segment objects, and hold natural-language conversations about what happens in a clip. No existing model does all of these well.

The challenge is that these tasks require fundamentally different kinds of understanding:

Spatiotemporal perception: Recognizing that someone is "stirring soup" requires understanding spatial structure and temporal motion. Pure reconstruction objectives (like VideoMAE) excel here.
Semantic alignment: Matching a video of a cat jumping to the text "a cat leaps off a table" requires cross-modal understanding. Contrastive learning (like CLIP) excels here.
Open-ended reasoning: Answering "why did the person laugh?" requires high-level reasoning and language generation. Next-token prediction with LLMs excels here.

Previous models pick one or two of these. InternVideo2 unifies all three.

The core insight: These three learning objectives — reconstruction, contrastive alignment, and next-token prediction — are complementary, not competing. A progressive training scheme that applies them in sequence builds a video encoder that is perceptive (stage 1), semantically grounded (stage 2), and capable of reasoning (stage 3). Each stage initializes from the previous, compounding the learned representations.

Full data flow at a glance: Video V (8 sparse frames, 224×224) → patch embed with 14×14 patches, temporal stride → 8×16×16 = 2048 tokens + 1 [CLS] + 3D positional embeddings → ViT-6B encoder (48 layers, dim 3200, 25 heads) → Stage 1: align unmasked token outputs to InternViT-6B + VideoMAE V2-g via MSE loss → Stage 2: attention pooling → contrastive loss with text encoder (BERT-Large) + audio encoder (BEATs) → Stage 3: QFormer bridge to LLM (Vicuna) for next-token prediction → outputs: video embeddings for retrieval, features for detection, dialogue responses for QA.

The Video Understanding Landscape

Different tasks require different capabilities. InternVideo2 unifies all three training paradigms to cover the full spectrum.

Why does InternVideo2 use three different training objectives instead of just one?

Because different video tasks require fundamentally different capabilities — reconstruction builds spatiotemporal perception, contrastive learning builds semantic grounding, and next-token prediction enables reasoning — and these objectives are complementary Because each stage uses a different dataset size Because three objectives converge faster than one

Chapter 1: The Three-Stage Design

InternVideo2 is built through a strict progressive training pipeline. The video encoder grows in capability with each stage, and each stage initializes from the checkpoint of the previous.

Stage 1: Token Reconstruction

Train the video encoder from scratch. Use two expert models (InternViT-6B, VideoMAE V2-g) as teachers. Mask 80% of tokens. The student learns to match the teachers' token-level outputs for unmasked positions. Objective: spatiotemporal perception. Data: 2M video clips (KMash).

↓

Stage 2: Multimodal Alignment

Initialize from Stage 1 checkpoint. Add text encoder (BERT-Large) and audio encoder (BEATs). Train with contrastive + matching + masked language modeling losses. Objective: semantic grounding. Data: 300M image-text + 50M video-text + 50M video-audio-speech-text pairs.

↓

Stage 3: Video-Centric Dialogue

Initialize from Stage 2 checkpoint. Connect video encoder to LLM (Vicuna) via QFormer bridge. Fine-tune with instruction data. Objective: open-ended reasoning. Data: 2.1M instruction-tuning samples from 34+ sources.

Why this order matters: Stage 1 builds a foundation of spatiotemporal understanding from unlabeled video alone. Stage 2 layers semantics on top by aligning video, text, and audio. Stage 3 adds reasoning and language generation. Reversing the order would not work — you cannot ground language in a video encoder that does not yet understand spatiotemporal structure.

Frozen vs. Trained across stages: Stage 1: video encoder trained from scratch, expert teachers frozen. Stage 2 (masked phase): video encoder trained, audio encoder frozen, text encoder trained. Stage 2 (unmasked post-pretrain): video encoder frozen, audio encoder trained, text encoder trained. Stage 3: video encoder trained (with LoRA for LLM), QFormer trained, LLM updated via LoRA only. The freezing/training schedule is carefully designed: freeze a module when it is mature to prevent catastrophic forgetting.

Why is the video encoder frozen during Stage 2's unmasked post-pretraining phase?

To prevent catastrophic forgetting of the spatiotemporal features learned in Stage 1 and the masked alignment phase, while allowing the audio and text encoders to align to the now-mature video representations To reduce training cost Because the video encoder has already converged

Chapter 2: Stage 1 — Token Reconstruction

Stage 1 trains the video encoder from scratch to build spatiotemporal understanding. But unlike VideoMAE, which reconstructs raw pixels, InternVideo2 reconstructs teacher features from two expert models.

Dual Teacher Distillation

The two teachers encode complementary knowledge:

InternViT-6B: A 6B-parameter image ViT from InternVL, pre-trained on massive image-text data. Provides rich multimodal-friendly semantic features. Knows what things look like and how they relate to language.
VideoMAE V2-g: A 1B-parameter video ViT pre-trained with masked autoencoding. Provides motion-aware temporal features. Knows how things move.

Both teachers are frozen. The student video encoder learns to match their outputs at the token level.

The Training Procedure

For each input video V:

Feed full frames into both teachers (no masking for teachers) to get reference features.
Mask 80% of tokens for the student encoder.
The student processes only the 20% unmasked tokens.
Align student outputs to teacher outputs for unmasked positions only via MSE loss.

L = (1/Z) Σ_p (α₁|f^V(V_p) − h(V_p)|² + α₂|f^V(V_p) − g(V_p)|²)

Where f^V is the student, h is InternViT-6B, g is VideoMAE V2-g, p indexes unmasked token positions, and Z normalizes.

Why reconstruct features, not pixels? Pixel reconstruction wastes capacity learning irrelevant details — exact RGB values, texture noise, high-frequency artifacts. Feature reconstruction forces the student to learn the semantic essence that expert models have already distilled. The InternViT teacher embeds language priors (from its image-text pre-training), making the student multimodal-ready even before Stage 2.

Multi-layer alignment: The loss is not just at the final output. InternVideo2 aligns the last 6 layers of InternViT, the last 4 layers of VideoMAE V2, and the final [CLS] token of InternViT to corresponding layers of the student encoder via learnable MLP projection layers. After pre-training, these projection layers are discarded. This multi-layer supervision provides stronger gradient signal and forces all layers (not just the last) to learn useful representations.

Tensor shapes (6B model): Input: (B, 3, 8, 224, 224). Patch embed with 14×14: 8×16×16 = 2048 tokens of dim 3200. Masking at 80%: student processes (B, 410, 3200). InternViT teacher: processes full frames independently, produces (B, 8, 256, 3200) — 256 = 16×16 patches per frame. VideoMAE V2 teacher: processes full video, produces (B, 2048, 1408). Projection MLPs map student dim 3200 to teacher dims. MSE computed over 410 aligned token pairs per sample.

Dual Teacher Distillation

The student encoder learns from two frozen teachers. InternViT provides semantic features, VideoMAE V2 provides motion features. Alignment happens at multiple layers.

Why does InternVideo2 use both InternViT and VideoMAE V2 as teachers rather than just one?

They encode complementary knowledge: InternViT provides multimodal-friendly semantic features (what things look like and how they relate to language), while VideoMAE V2 provides motion-aware temporal features (how things move) Using two teachers doubles the training speed Two teachers prevent mode collapse during training

Chapter 3: Stage 2 — Multimodal Alignment

After Stage 1, the video encoder understands spatiotemporal structure but has no concept of language or audio. Stage 2 bridges this gap by aligning video representations with text, audio, and speech.

Architecture Additions

Three new modules are introduced:

Text encoder: First 19 layers of BERT-Large, initialized from pre-trained weights. Encodes text captions into embeddings.
Multimodal decoder: Remaining 5 layers of BERT-Large with cross-attention layers. Fuses video and text for matching and MLM tasks.
Audio encoder: 12-layer transformer initialized from BEATs (90M params). Takes 64-dim log Mel spectrograms from 10-second clips.

Three Training Losses

Stage 2 uses three complementary losses:

L = L_CON + L_MAC + L_MLM

Contrastive loss (L_CON): Standard InfoNCE. Pull matching video-text pairs together, push non-matching apart. Applied across modality pairs: {video, text}, {image, text}, {video, VAS-caption}, {video+audio, VAS-caption}.

L_CON = −(1/N) Σ_i log(exp(sim(f_i^V, f_i^T)/τ) / Σ_j exp(sim(f_i^V, f_j^T)/τ))

Matching loss (L_MAC): Binary classification — is this video-text pair matched or not? Uses the multimodal decoder with cross-attention between video and text features.

Masked language modeling (L_MLM): Mask tokens in the text caption and predict them conditioned on the video. Forces fine-grained video-language grounding.

Two-Phase Training

Stage 2 is split into two phases:

Masked alignment: Video encoder trained (with 80% masking for efficiency), audio encoder frozen. Uses the full dataset: 300M image-text + 50M video-text + 50M video-audio-speech-text.
Unmasked post-pretraining: Video encoder frozen, audio encoder trained. Uses a smaller subset (25M image + video, 0.5M audio, 50M audio-video). No masking — ensures consistency with inference where no tokens are masked.

Why freeze the video encoder in phase 2? During phase 1, the video encoder updates rapidly under contrastive gradients. By phase 2, its features are mature. Freezing it lets the audio and text encoders catch up and align to stable video features. Without freezing, continued updates to the 6B encoder could destabilize the alignment, since the audio encoder (90M) is much smaller and adapts slower.

What degrades without audio alignment: InternVideo2 without audio data loses performance on audio-visual tasks and also on some pure-video tasks. The audio signal provides complementary temporal cues — a "sizzling" sound tells you cooking is happening even when the visual is ambiguous. On video retrieval benchmarks, the audio-aligned model consistently outperforms the video-only version.

Why does Stage 2 use three losses (contrastive, matching, MLM) instead of just contrastive?

Contrastive loss provides coarse global alignment (video and text embeddings should be close), matching loss adds fine-grained cross-modal reasoning (is this specific pair matched?), and MLM forces token-level grounding (predict masked words from video context) Three losses train three times faster Each loss handles a different modality (video, audio, text)

Chapter 4: Stage 3 — Video-Centric Dialogue

Stages 1 and 2 produce a video encoder with excellent features for classification and retrieval. But it cannot generate text — it cannot answer "What is the person doing and why?" in natural language. Stage 3 connects the video encoder to a Large Language Model.

Architecture

The bridge between video and language is a QFormer (from BLIP-2). It takes the video encoder's output tokens and compresses them into a small set of query tokens that the LLM can process. The LLM is Vicuna (a fine-tuned LLaMA).

Video Encoder (6B)

Produces token-level features from the input video. Initialized from Stage 2.

↓

QFormer Bridge

Cross-attends from a small set of learnable queries to the video encoder's output. Compresses 2048 video tokens into ~32 query tokens compatible with the LLM's text embedding space.

↓

LLM (Vicuna)

Takes the 32 video query tokens + text instruction tokens. Generates responses via next-token prediction. Updated with LoRA only.

High-Definition Post-Training

To improve fine-grained and long-video understanding, InternVideo2 adds a HD post-training stage. Each input video is split into up to 6 sub-videos at 224×224 plus 1 global resized sub-video. Training proceeds in two epochs: first with 8 frames per sub-video, then 16 frames. During this stage, both the video encoder and QFormer are updated, while the LLM uses LoRA.

Engineering trade-off: LoRA vs full fine-tuning. The LLM has ~7B parameters. Full fine-tuning would require enormous memory and risk catastrophic forgetting of the LLM's language capabilities. LoRA adds only ~0.1% trainable parameters (low-rank adaptation matrices) and preserves the LLM's pre-trained knowledge. The video encoder (6B) is fully fine-tuned here — its features should adapt to the dialogue task, since it was designed for video understanding.

Instruction tuning data: 2.1M samples from 34+ sources covering conversation, captioning, VQA, reasoning, and classification. Key sources include LLaVA, MVBench, PerceptionTestQA, TVQA, EgoTaskQA, and grounding datasets (DiDeMo, COCO). The data is intentionally diverse across tasks to prevent the model from overfitting to a single dialogue format.

Tensor shapes through Stage 3: Video: (B, 3, 8, 224, 224) → ViT-6B encoder: (B, 2049, 3200) [2048 patch tokens + 1 CLS]. QFormer: 32 learnable queries of dim 768 cross-attend to encoder output → (B, 32, 768). LLM embedding projection: (B, 32, 768) → (B, 32, 4096) to match Vicuna's hidden dim. Instruction text: tokenized to (B, L, 4096). LLM input: concat video queries + text tokens = (B, 32+L, 4096). LLM generates output autoregressively. LoRA rank: 16, applied to Q and V projection matrices in all LLM layers — adds ~3.4M trainable params out of 7B total (0.05%).

Stage 3 Dialogue Pipeline

Video tokens are compressed by QFormer into a small query set, then concatenated with text instruction tokens and fed to the LLM for next-token prediction.

Why is the LLM updated with LoRA instead of full fine-tuning in Stage 3?

Full fine-tuning of a 7B LLM would require enormous memory and risk catastrophic forgetting of language capabilities; LoRA adds minimal trainable parameters while preserving the pre-trained knowledge LoRA makes the model smaller at inference The LLM does not need to learn from video data

Chapter 5: The Video Encoder

At the heart of InternVideo2 is a massive Vision Transformer scaled to 6 billion parameters. Let's look at the architectural details.

Property	InternVideo2-1B	InternVideo2-6B
Layers	~24	48
Hidden dim	~1280	3200
Attention heads	~16	25
Patch size	14×14	14×14
Input frames	8 (sparse sampling)	8 (sparse sampling)
Tokens per clip	2048 + 1 [CLS]	2048 + 1 [CLS]
Parameters	~1B	~6B

Sparse Frame Sampling

InternVideo2 samples only 8 frames from each video, sparsely distributed. With 14×14 spatial patches, each frame produces 16×16 = 256 tokens. Eight frames give 8×256 = 2048 tokens plus one [CLS] token. 3D positional embeddings encode the temporal position of each frame.

Attention Pooling

After the ViT encoder, an attention pooling layer aggregates the 2048 token-level features into a single video-level embedding. This is used for contrastive learning and retrieval. For token-level tasks (temporal action detection), the per-token features are used directly.

Engineering decision: 8 frames vs more. Eight frames is surprisingly few. But sparse sampling covers a wide temporal span — for a 10-second clip, frames are ~1.25 seconds apart. The model learns to reason across large temporal gaps. At test time, InternVideo2 can use 16 frames for higher resolution. The 6B model at 16 frames and 224×224 is already at the limit of what fits in A100 GPU memory during training.

How features are extracted for different tasks: Action classification: [CLS] token or attention-pooled embedding → linear classifier. Video retrieval: attention-pooled embedding → cosine similarity with text embedding. Temporal action detection: per-token features from layer 7 (an intermediate layer, not the last!) → fed into ActionFormer detection head. Video QA/dialogue: full token sequence → QFormer → LLM. The choice of which features to extract and from which layer matters — intermediate layers retain more spatial detail, final layers have more semantic abstraction.

InternVideo2 Architecture Overview

The video encoder sits at the center, feeding different downstream modules depending on the task.

Why does temporal action detection use features from an intermediate layer (layer 7) rather than the final layer?

Intermediate layers retain more spatial and temporal detail needed for precise boundary localization, while final layers are more semantically abstract and better suited for classification Layer 7 is faster to compute The final layer has already been compressed by attention pooling

Chapter 6: The Data Pipeline

InternVideo2 is trained on a massive multimodal dataset of 402M entries. The data design is as important as the model architecture.

Stage	Dataset	Size	Type
Stage 1	KMash	2M clips	Unlabeled video
Stage 2	LAION etc.	300M	Image-text
Stage 2	WebVid + InternVid	50M	Video-text
Stage 2	InternVid2	50M	Video-audio-speech-text
Stage 3	LLaVA, MVBench, etc.	2.1M	Instruction tuning

KMash: Video-Only Data

For Stage 1 (reconstruction), 2M video clips are curated from action recognition datasets (Kinetics, Something-Something, Moments in Time, ActivityNet, HACS). An extended version KMash2M adds 844K YouTube videos for diversity. Labels are discarded — only raw video is used.

InternVid2: Multimodal Video Captions

The most novel data contribution is InternVid2: 100M video clips with video-audio-speech captions. Building it required an automated annotation system called VidCap:

Video Captioner

Generates visual descriptions: "A green tractor with cables attached to it"

↓

Audio Captioner

Generates audio descriptions: "A man is speaking and engine operates in background"

↓

Speech Captioner (Whisper)

Transcribes speech: "I'll show you what they do"

↓

LLM Fusion (Vicuna)

Fuses all three: "As the man talks about the tractor's capabilities, it is attached by cables and engine operates in the background"

Temporal consistency matters: Videos must be segmented into clips at semantically meaningful boundaries. InternVideo2 uses AutoShot (a temporal boundary detection model) instead of FFmpeg's SceneDet filter. AutoShot predicts boundaries based on semantic variations, not pixel differences. This prevents clips from mixing frames with inconsistent context — a critical quality factor for learning aligned representations.

What degrades with poor captions: The quality of video-text alignment directly impacts Stage 2 performance. WebVid's alt-text captions are noisy and often unrelated to the video. InternVid's generated captions are better but miss audio and speech context. InternVid2's fused VAS captions capture what you see, hear, and what is said — all three perspectives. Ablations show each additional modality in the caption improves downstream retrieval and QA performance.

Data Scale by Training Stage

The amount and type of data varies dramatically across the three stages. Stage 2 dominates in total volume (400M+ entries).

Why does InternVideo2 use semantically-segmented clips (via AutoShot) instead of fixed-duration clips?

Because semantic boundaries prevent clips from mixing frames with inconsistent context, which would confuse the video-text alignment and degrade the learned representations Because fixed-duration clips are harder to process on GPUs Because AutoShot produces clips of uniform length

Chapter 7: Results

InternVideo2 achieves state-of-the-art results on over 60 video and audio benchmarks. Here are the headline numbers.

Action Recognition (End-to-End Fine-tuning)

Method	K400	K600	SSv2	MiT
VideoMAE V2-g	90.0 (64f)	89.9	77.0	—
CoCa-g	88.9	—	—	49.0
InternVideo (v1)	91.1	91.3	—	—
InternVideo2-6B	92.1	91.9	77.5	51.2

92.1% on Kinetics-400 with only 16 frames at 224×224 — the previous SOTA required ensembles, higher resolution (576²), or 64 frames.

Attentive Probing (Frozen Backbone)

This measures feature quality directly — freeze the backbone and train only a lightweight attention pooling + linear head:

Method	K400	K600	SSv2
ViT-22B	88.0	—	67.7
DINOv2-g	83.4	—	50.0
VideoPrism-g	87.2	—	68.5
InternVideo2-6B	88.8	89.1	72.2

The SSv2 gap is telling: InternVideo2-6B achieves 72.2% on Something-Something V2 with a frozen backbone — 3.7% above VideoPrism-g (68.5%). SSv2 tests temporal understanding specifically. This shows that the three-stage training produces genuinely temporal features, not just spatial features that happen to work on video.

Zero-Shot Recognition

InternVideo2-clip achieves 72.7% on K400 zero-shot — competitive with VideoPrism-g (76.4%), which was trained on 311M manually-labeled videos. On UCF-101 and HMDB-51, InternVideo2 leads.

SOTA Comparison

InternVideo2-6B vs previous best on key benchmarks. End-to-end fine-tuning results.

What does InternVideo2's strong frozen-backbone performance on SSv2 demonstrate?

That the three-stage training produces genuinely temporal features baked into the backbone, not just spatial features that are adapted during fine-tuning — the temporal understanding is intrinsic, not learned at the head level That SSv2 is an easy benchmark That attention pooling is better than average pooling

Chapter 8: Downstream Tasks

A true video foundation model must generalize across many task types. InternVideo2 is evaluated on tasks far beyond action classification.

Temporal Action Localization

Detect when actions start and end in untrimmed videos. InternVideo2 uses features from layer 7 (not the final layer) with ActionFormer as the detection head.

Method	THUMOS14	ActivityNet	HACS	FineAction
I3D + Flow	66.8	35.6	—	17.6
VideoMAE V2-g	69.5	39.0	42.4	18.2
InternVideo2-6B	72.0	41.2	43.3	27.7

Video Instance Segmentation

Using Mask2Former with InternVideo2 as backbone achieves 63.4 mAP on YouTube-VIS 2019 (vs 60.3 for Swin-L), and a video-tuned version reaches 65.1 mAP.

Video-Text Retrieval

InternVideo2's Stage 2 contrastive features excel at video retrieval. On MSR-VTT, it achieves 55.9% R@1 (text-to-video), outperforming VideoCoCa and other baselines.

Video Question Answering

The Stage 3 model handles open-ended video QA. On benchmarks like EgoSchema (long-form video understanding), InternVideo2 achieves 63.2% accuracy, demonstrating the ability to reason over extended temporal contexts.

The universality argument: InternVideo2 holds SOTA or near-SOTA on action recognition (what), temporal detection (when), instance segmentation (where), retrieval (matching), QA (reasoning), and dialogue (generation). No previous single model covered all these. The three-stage training creates a truly general-purpose video encoder because each stage adds a different dimension of understanding.

What degrades at the extremes: On Kinetics zero-shot, InternVideo2-6B is slightly weaker than InternVideo2-1B (72.7 vs 73.1 on K400). The larger model, trained with more diverse data in Stage 2, partially forgets the distribution of Stage 1 pre-training data. On FineAction (fine-grained temporal detection), scaling from 1B to 6B barely helps (27.7 vs 27.2), suggesting fine-grained discrimination may need better data annotations rather than more parameters.

Why does InternVideo2-6B sometimes underperform InternVideo2-1B on zero-shot recognition?

The 6B model uses a more diverse pre-training corpus in Stage 2, which causes partial forgetting of the visual distributions learned in Stage 1 — a classic trade-off between breadth and depth of representation The 6B model is too large for zero-shot evaluation Zero-shot evaluation does not use the video encoder

Chapter 9: Connections

InternVideo2 sits at the convergence of several major research threads. Let's map the landscape.

Relation to VideoMAE V2

VideoMAE V2 provides one of InternVideo2's two teacher models in Stage 1. Where VideoMAE V2 reconstructs raw pixels, InternVideo2 reconstructs teacher features. V2 is a single-modality specialist optimized for pixel reconstruction; InternVideo2 is a multi-modality generalist that uses V2's motion-aware features as a learning signal.

Relation to CLIP / InternVL

InternVL-6B (the image-language model) provides the other teacher in Stage 1 and shapes the feature space. Stage 2's contrastive learning is inspired by CLIP but extended to video-audio-speech-text. The result is a "video CLIP" that understands not just visual-text alignment but audio and speech as well.

Relation to VideoPrism

VideoPrism (Google, 2024) is the closest competitor. It also uses masked reconstruction + contrastive learning, but with a two-stage scheme (vs three). VideoPrism's key advantage is proprietary data (311M manually-labeled videos). InternVideo2 compensates with its third stage (dialogue) and auto-generated VAS captions, achieving competitive or superior results on most benchmarks.

Relation to BLIP-2 / QFormer

Stage 3 directly uses the BLIP-2 architecture for bridging vision and language. QFormer compresses the video encoder's 2048 tokens into ~32 tokens that the LLM can digest. This is not a novel contribution — InternVideo2's innovation is in the video encoder itself, not the LLM bridge.

Cheat Sheet

Aspect	InternVideo2
Video encoder	ViT-6B (48 layers, dim 3200, patch 14)
Input	8 frames, 224×224, sparse sampling
Stage 1	Dual-teacher token reconstruction (InternViT-6B + VideoMAE V2-g)
Stage 2	Video-text-audio contrastive + matching + MLM
Stage 3	QFormer → Vicuna LLM, instruction tuning
Data total	402M entries (2M video, 300M image-text, 100M VAS)
Key results	92.1% K400, 77.5% SSv2, 72.0 THUMOS14 mAP, 63.2% EgoSchema
Training cost	Stage 1: 256 A100, 18 days. Stage 2: 256 A100, 14 days. Stage 3: 64 A100, 3 days
Total compute	~35 days on 256 A100 GPUs (Stage 1 + 2)
Benchmarks	SOTA on 60+ tasks across recognition, retrieval, QA, dialogue, segmentation

The broader lesson: The video foundation model problem is not just about bigger models or more data. It requires a progressive training recipe that layers complementary objectives. Each stage builds a different capability. The encoder that emerges is more than the sum of its parts — it inherits spatiotemporal perception from reconstruction, semantic grounding from contrastive learning, and reasoning from language model alignment.

What is the key architectural difference between InternVideo2 and VideoPrism?

VideoPrism uses a bigger model InternVideo2 adds a third training stage (LLM dialogue) that VideoPrism lacks, enabling open-ended video reasoning, while VideoPrism relies on a two-stage scheme with superior proprietary labeled data InternVideo2 uses a CNN backbone while VideoPrism uses ViT

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Chapter 0: The Problem

Chapter 1: The Three-Stage Design

Chapter 2: Stage 1 — Token Reconstruction

Dual Teacher Distillation

The Training Procedure

Chapter 3: Stage 2 — Multimodal Alignment

Architecture Additions

Three Training Losses

Two-Phase Training

Chapter 4: Stage 3 — Video-Centric Dialogue

Architecture

High-Definition Post-Training

Chapter 5: The Video Encoder

Sparse Frame Sampling

Attention Pooling

Chapter 6: The Data Pipeline

KMash: Video-Only Data

InternVid2: Multimodal Video Captions

Chapter 7: Results

Action Recognition (End-to-End Fine-tuning)

Attentive Probing (Frozen Backbone)

Zero-Shot Recognition

Chapter 8: Downstream Tasks

Temporal Action Localization

Video Instance Segmentation

Video-Text Retrieval

Video Question Answering

Chapter 9: Connections

Relation to VideoMAE V2

Relation to CLIP / InternVL

Relation to VideoPrism

Relation to BLIP-2 / QFormer

Cheat Sheet