Chen, Wu, Wang, Su, Chen et al. — Shanghai AI Lab, CVPR 2024

InternVL: Scaling Up Vision Foundation Models

Scale the vision encoder to 6 billion parameters and progressively align it with an LLM through contrastive, generative, and instruction tuning stages — bridging the parameter gap between vision and language.

Prerequisites: Vision Transformers + CLIP contrastive learning + LLM basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

By late 2023, large language models had exploded to hundreds of billions of parameters. GPT-4 was rumored at over a trillion. LLaMA-65B, Vicuna-13B, InternLM-7B — language models were massive and getting bigger.

But when these LLMs needed to see, what vision encoder did they use? Almost always a CLIP ViT-L (300M parameters) or at most an EVA-CLIP ViT-G (1B parameters). That is a 10-100x parameter gap between the vision encoder and the language model it feeds into.

Think about what this means. You have an enormous language model with rich, nuanced understanding of concepts — and you're feeding it visual features from a comparatively tiny encoder. The vision side is a bottleneck. It's like having a world-class translator who receives messages through a tin can telephone.

The parameter gap: In 2023's best VLLMs, the LLM decoder had 7-65B parameters while the vision encoder had 0.3-1B. This 10-100x mismatch meant the LLM's capacity was underutilized — the visual features simply weren't rich enough to leverage it. Additionally, these vision encoders were aligned with BERT-style text encoders, not with LLMs, creating a representation inconsistency.

There were three specific problems:

The Parameter Gap

Compare the parameter counts of vision encoders vs. the LLMs they feed into. The gap reveals a fundamental bottleneck in multimodal systems.

Why is a 300M-parameter vision encoder a bottleneck when paired with a 13B-parameter LLM?

Chapter 1: The Key Insight

InternVL's solution has two parts: scale up the vision encoder to match the LLM's parameter count, and progressively align their representations through a three-stage training pipeline.

Part 1: Scale the vision encoder to 6B

Instead of a 300M or 1B vision encoder, InternVL uses InternViT-6B — a vanilla Vision Transformer scaled to 6 billion parameters. This is 6x larger than EVA-CLIP-E (1B) and approaches the parameter count of the LLMs it will connect to (8B InternLM, 7B LLaMA).

Why does scale matter? Because a larger vision encoder can learn richer, more fine-grained visual features — textures, spatial relationships, subtle semantic distinctions — that a small encoder simply doesn't have the capacity to represent. With 6B parameters, the vision encoder is no longer the bottleneck.

Part 2: Progressive three-stage alignment

You can't just train a 6B vision encoder from scratch and plug it into an LLM. The training is unstable, and the feature spaces won't match. InternVL uses a progressive strategy:

  1. Stage 1: Contrastive learning — Train on 5B noisy web image-text pairs using CLIP-style contrastive loss. This gives InternViT-6B a strong visual representation aligned with language.
  2. Stage 2: Generative learning — Connect to QLLaMA (an 8B language middleware) and train with contrastive + matching + generation losses on 1B curated pairs. This bridges the representation gap with LLMs.
  3. Stage 3: Instruction tuning — Fine-tune with 4M high-quality instruction samples for dialogue, VQA, captioning, and grounding. This unlocks practical capabilities.
The core insight: Don't just build a bigger vision encoder — build one that is progressively aligned with an LLM at each training stage, using increasingly high-quality data. Start noisy and broad (5B web pairs), then refine with curated data (1B filtered pairs), then specialize with instructions (4M samples). Each stage builds on the previous one's representations.

The result is a "Swiss Army knife" model. InternViT-6B alone works as a powerful vision backbone for classification, segmentation, and retrieval. Combined with QLLaMA, it handles generative tasks like captioning. Connected to a full LLM decoder, it enables multimodal dialogue.

What are the two key ideas behind InternVL?

Chapter 2: InternViT-6B Architecture

InternViT-6B is a vanilla Vision Transformer — no fancy architectural innovations, just scale. The authors deliberately chose standard ViT because its simplicity makes scaling predictable and stable.

Architecture details

The key hyperparameters were found through a search optimizing for accuracy, speed, and training stability:

Design search

The authors explored the design space systematically. They varied depth in {32, 48, 64, 80}, head dimension in {64, 128}, and MLP ratio in {4, 8}, keeping total parameters at ~6B. Using contrastive learning on a 100M subset of LAION-en, they found:

Comparison to other large ViTs: ViT-22B (Google) has 21.7B parameters but uses the private JFT-3B dataset. InternViT-6B achieves competitive results with 3.7x fewer parameters and only public data. ViT-6.5B (MAWS) has a similar parameter count but uses depth=32 and width=4096 — a very different tradeoff. InternViT-6B's depth=48 and width=3200 proved more stable.
InternViT-6B Architecture

The ViT architecture at 6B scale. An image is split into 14x14 patches, projected to 3200-dim tokens, and processed through 48 transformer layers.

Why did the InternVL authors choose a vanilla ViT architecture instead of a novel design?

Chapter 3: Progressive Alignment

This is the heart of InternVL. Training a 6B vision encoder from scratch and aligning it with an LLM is hard — the modality gap is enormous, the data is heterogeneous, and training can be unstable. InternVL solves this with a three-stage progressive strategy, where each stage builds on the previous one.

Stage 1: Vision-language contrastive training

The first stage uses CLIP-style contrastive learning on massive, noisy web data. InternViT-6B (randomly initialized) is paired with LLaMA-7B (pre-trained) as the text encoder. They are trained jointly to maximize cosine similarity between matched image-text pairs and minimize it for unmatched pairs.

Lcontrastive = −log(exp(sim(I, T+)/τ) / Σj exp(sim(I, Tj)/τ))

This stage consumes 4.98 billion image-text pairs (filtered from 6.03B original). Both InternViT-6B and LLaMA-7B are fully trainable. The result: a vision encoder with strong zero-shot capabilities for classification and retrieval.

Stage 2: Vision-language generative training

Now InternViT-6B has good visual representations, but they're aligned with a text encoder, not an LLM's generative representations. Stage 2 bridges this gap by introducing QLLaMA — a language middleware that connects the vision encoder to LLM-style generation.

QLLaMA inherits the LLaMA-7B weights from stage 1 and adds 96 learnable queries plus cross-attention layers (1B new parameters). In this stage, InternViT-6B and the original QLLaMA layers are frozen — only the new cross-attention layers and queries are trained.

The training uses three losses simultaneously:

This stage uses 1.03 billion curated pairs (filtered from the 4.98B in stage 1 — removing low-quality captions).

Stage 3: Supervised fine-tuning (SFT)

Finally, InternVL is connected to a full LLM decoder (Vicuna-13B or InternLM) through an MLP layer. The system is fine-tuned on 4 million high-quality instruction samples covering captioning, VQA, OCR, grounding, and dialogue.

Because QLLaMA already produces LLM-compatible features (thanks to its LLaMA initialization), the LLM decoder can often remain frozen — only the MLP bridge and optionally QLLaMA are trained. This preserves the LLM's original language capabilities while adding vision.

The progressive data funnel: Notice the data quality-quantity tradeoff across stages: Stage 1 uses 5B noisy pairs (quantity over quality), Stage 2 uses 1B curated pairs (balanced), Stage 3 uses 4M high-quality instructions (quality over quantity). Each stage narrows the funnel — starting broad to build general representations, then refining for specific capabilities.
Three-Stage Progressive Alignment

Click each stage to see the training configuration: which components are trained, what data is used, and what losses drive learning. Watch how the data funnel narrows and the model capabilities expand.

Why does Stage 2 freeze InternViT-6B and only train the new cross-attention layers and queries?

Chapter 4: Vision-Language Connector

How do you actually connect a 6B vision encoder to an LLM? InternVL introduces two approaches, each suited to different use cases.

QLLaMA: The heavy-duty middleware

QLLaMA is an 8B-parameter language middleware initialized from a multilingual LLaMA. It has three key additions over a standard LLaMA:

Why is QLLaMA 42x larger than QFormer (used in BLIP-2)? Because a larger middleware can capture richer cross-modal interactions. The LLaMA initialization means it starts with strong language understanding, so its representations are already compatible with LLM decoders.

MLP bridge: The lightweight option

For simpler setups, InternVL can skip QLLaMA entirely. InternViT-6B's output tokens are projected through a small MLP directly into the LLM's embedding space. This is similar to LLaVA's approach but with a much stronger vision encoder.

Four configurations

InternVL supports four different model configurations:

InternVL-C
InternViT-6B + QLLaMA text encoder → Contrastive retrieval (uses attention pooling + [EOS] similarity)
InternVL-G
InternViT-6B + QLLaMA queries → Generative captioning + stronger retrieval (queries reorganize visual features)
InternVL-Chat (simple)
InternViT-6B + MLP + LLM decoder → Multimodal dialogue without QLLaMA
InternVL-Chat (full)
InternViT-6B + QLLaMA + MLP + LLM decoder → Full-power multimodal dialogue
Why QLLaMA over QFormer? QFormer (BLIP-2) has ~188M parameters and is randomly initialized. QLLaMA has 8B parameters and is initialized from a pre-trained LLM. This means (1) it starts with strong language understanding, (2) its representations naturally align with LLM decoders, and (3) it can reorganize visual features based on language instructions rather than just compressing them.
What advantage does initializing QLLaMA from a pre-trained LLaMA give over a randomly initialized connector like QFormer?

Chapter 5: Training Data

InternVL's training data follows the progressive quality funnel. Each stage uses different data at different scales and quality levels.

Stage 1: Web-scale contrastive data (4.98B pairs)

The raw dataset contains 6.03 billion image-text pairs from multiple sources:

After quality filtering, 4.98B pairs remain (82.6% of original). The filtering removes extremely low-quality data but is intentionally lenient — quantity matters at this stage.

Stage 2: Curated generative data (1.03B pairs)

Stage 2 applies much stricter filtering, reducing to 1.03 billion pairs (17.0% of the original 6.03B). The biggest cuts happen to the noisiest sources:

The synthetic-caption datasets (LAION-COCO) and academic datasets are mostly kept because their captions are already high quality.

Stage 3: Instruction data (4M samples)

The SFT stage uses ~4 million high-quality annotated samples across six task categories:

Conversation (1.4M)
LLaVA-150K, SVIT, VisDial, LRV-Instruction, LLaVA-Mix-665K
VQA (1.1M)
VQAv2, OKVQA, A-OKVQA, IconQA, AI2D, GQA
Captioning (588K)
COCO Caption, TextCaps
Grounding (323K)
RefCOCO/+/g, Toloka
OCR (294K)
OCR-VQA, ChartQA, DocVQA, ST-VQA, InfoVQA, LLaVAR
Grounded Cap. (284K)
RefCOCO/+/g
The data funnel in numbers: 6.03B → 4.98B → 1.03B → 4M. Each stage reduces volume by an order of magnitude while increasing quality. Stage 1 is about building broad visual understanding from noisy web data. Stage 2 refines with cleaner captions. Stage 3 specializes with human-curated instruction data. This mirrors how humans learn: first absorb broadly, then refine, then practice specific skills.
Why does Stage 2 reduce the LAION-en dataset from 2.3B to just 91M pairs (4.0%)?

Chapter 6: Results

InternVL achieves state-of-the-art or competitive results across a remarkably wide range of tasks — from pure vision to multimodal dialogue.

Image classification (linear probing)

On ImageNet-1K with a frozen backbone and only a linear classifier trained:

InternViT-6B beats all models trained on public data. It trails ViT-22B by only 1.3% despite having 3.7x fewer parameters and using no private data.

Semantic segmentation

On ADE20K with UperNet decoder (full fine-tuning): InternViT-6B achieves 58.9 mIoU, surpassing ViT-22B's 55.3. With a frozen backbone and UperNet, it scores 54.9 vs ViT-22B's 52.7. The larger effective receptive field from 48 transformer layers gives it superior pixel-level perception.

Zero-shot classification

InternVL-C achieves 83.2% on ImageNet-1K zero-shot, beating EVA-02-CLIP-E+ (82.0%) and all other public-data models. On multilingual ImageNet it achieves 64.0% average across 5 languages (EN, ZH, JP, AR, IT), far exceeding OpenCLIP-XLM-R-H (55.9%).

Image-text retrieval

On Flickr30K (English), InternVL-G achieves 95.7% R@1 for image-to-text retrieval, surpassing BLIP-2 (97.6% uses private data filtering) and all other public-data models. On COCO, it reaches 74.9% R@1 for image-to-text.

InternVL Performance Comparison

Linear probe accuracy on ImageNet-1K and variants. InternViT-6B (public data only) matches models 3.7x its size trained on private data.

Multimodal dialogue

InternVL-Chat (with Vicuna-13B decoder) achieves competitive results on dialogue benchmarks:

The key takeaway: InternVL demonstrates that scaling the vision encoder yields consistent improvements across the board — classification, segmentation, retrieval, captioning, and dialogue all benefit. The vision encoder was indeed the bottleneck, and removing that bottleneck improves everything downstream.
How does InternViT-6B compare to ViT-22B on ImageNet-1K linear probing?

Chapter 7: Dynamic Resolution

Standard ViTs process images at a fixed resolution — typically 224x224 or 336x336. This is problematic for real-world images that come in all sizes and aspect ratios. A tall document, a wide panorama, or a small icon all get squished to the same square.

The tile-based approach

InternVL (and especially the follow-up InternVL 1.5/2) introduces dynamic resolution through tile-based processing:

  1. Divide the image into tiles. Given an input image of any resolution, divide it into non-overlapping tiles of the base resolution (e.g., 448x448). The number of tiles adapts to the image — a 448x896 image becomes 2 tiles, a 1344x448 image becomes 3 tiles.
  2. Add a thumbnail. Resize the entire image to one tile to preserve global context. This gives the model both fine-grained details (from tiles) and the big picture (from thumbnail).
  3. Process each tile independently. Each tile goes through InternViT-6B separately, producing a set of visual tokens.
  4. Concatenate all tokens. The tokens from all tiles plus the thumbnail are concatenated and fed to the LLM.

Why this matters

Dynamic resolution is crucial for document understanding, chart reading, and any task where fine details matter. A standard 224x224 input would squish a dense document page, losing all text. With dynamic resolution, the model can use 6+ tiles to process the document at native resolution, preserving every character.

Dynamic Resolution Tiling

Drag the aspect ratio slider to see how an image is divided into tiles. Each tile is processed independently by InternViT-6B, preserving detail at any resolution.

Aspect1:1
Token count scales with tiles: Each 448x448 tile produces 256 visual tokens (32x32 grid after 14x14 patches). With 6 tiles + 1 thumbnail, that's 7 x 256 = 1,792 visual tokens fed to the LLM. This is why efficient attention mechanisms in the LLM become important for high-resolution inputs.
What is the purpose of adding a thumbnail tile alongside the detailed tiles?

Chapter 8: Scaling Analysis

A central claim of InternVL is that scaling the vision encoder matters — that bigger vision models produce better multimodal systems. Let's examine the evidence.

Vision encoder size vs. downstream performance

The linear probing results on ImageNet tell a clear story:

Performance scales roughly log-linearly with parameters. Each ~4x increase in parameters adds about 2 percentage points. But the gains come with diminishing returns: ViT-22B has 3.7x more parameters than InternViT-6B but only gains 1.3%.

Semantic segmentation scaling

On ADE20K with full fine-tuning:

Here InternViT-6B dramatically outperforms ViT-22B in the linear probing setting (47.2 vs 34.6). This suggests that InternViT-6B's contrastive pre-training produces better pixel-level features than ViT-22B's JFT-trained features, despite fewer parameters.

The efficiency sweet spot

InternViT-6B appears to be at an efficiency sweet spot. It achieves:

Vision Encoder Scaling Curve

ImageNet-1K linear probe accuracy vs. vision encoder parameters (log scale). Performance scales log-linearly but with diminishing returns at extreme scale.

Why 6B is the sweet spot: Below 1B parameters, the vision encoder clearly limits multimodal systems. At 6B, the vision encoder roughly matches the LLM's parameter count (7-8B), and the representations are rich enough to fully leverage the LLM's capacity. Going to 22B yields only marginal gains while 3.7x the compute. The lesson: balance the vision and language components, don't just scale one.
What does the scaling analysis suggest about the relationship between vision encoder size and downstream performance?

Chapter 9: Connections

What InternVL built on

CLIP (Radford et al., 2021): The foundational contrastive vision-language model. InternVL's Stage 1 uses CLIP-style contrastive learning but with a 6x larger vision encoder and LLaMA as the text encoder instead of a small transformer.

EVA / EVA-CLIP (Fang et al., 2023): Explored scaling vision encoders with masked image modeling + CLIP training. EVA-02-CLIP-E+ (4.4B) was the largest public CLIP model before InternVL. InternVL's InternViT-6B extends this scaling trajectory.

BLIP-2 (Li et al., 2023): Introduced QFormer as a lightweight bridge between frozen vision encoder and frozen LLM. InternVL's QLLaMA is the "heavy-duty" version of this idea — 8B parameters initialized from LLaMA instead of 188M randomly initialized.

LLaVA (Liu et al., 2023): Showed that a simple MLP connector + visual instruction tuning can enable strong multimodal dialogue. InternVL adopts this approach for its Stage 3 fine-tuning but with a much stronger vision encoder.

SigLIP (Zhai et al., 2023): Replaced CLIP's softmax contrastive loss with a sigmoid loss that scales better. Later InternVL versions adopt SigLIP-style training for better efficiency.

What InternVL enabled

InternVL 1.5 / 2.0 (2024): The direct successors, which added dynamic resolution, stronger instruction data, and improved training recipes. InternVL 2.0 became one of the top-performing open-source multimodal models.

Qwen-VL (Bai et al., 2023): A concurrent effort from Alibaba taking a similar approach — scaling the vision encoder and aligning with a large LLM (Qwen). Demonstrates that the "scale vision + align progressively" paradigm generalizes.

InternVL's impact: InternVL demonstrated that the vision encoder was a critical bottleneck in VLLMs. By scaling it to 6B parameters and aligning progressively with LLMs, it showed that better vision encoders improve everything — classification, segmentation, retrieval, captioning, and dialogue. This insight shaped the design of subsequent models like InternVL 2.0, Cambrian-1, and many others that now use larger vision backbones.

Cheat sheet

Core idea
Scale vision encoder to 6B params, progressively align with LLM through 3 training stages
Architecture
InternViT-6B (vanilla ViT, depth=48, width=3200) + QLLaMA (8B middleware from LLaMA)
Training stages
Contrastive (5B pairs) → Generative (1B pairs) → SFT (4M samples)
Key results
88.2% ImageNet linear probe, 58.9 ADE20K mIoU, 83.2% zero-shot IN-1K, SOTA retrieval
Legacy
Proved that scaling vision encoders removes the VLLM bottleneck; spawned InternVL 2.0 family
How does InternVL's QLLaMA differ from BLIP-2's QFormer?