Scale the vision encoder to 6 billion parameters and progressively align it with an LLM through contrastive, generative, and instruction tuning stages — bridging the parameter gap between vision and language.
By late 2023, large language models had exploded to hundreds of billions of parameters. GPT-4 was rumored at over a trillion. LLaMA-65B, Vicuna-13B, InternLM-7B — language models were massive and getting bigger.
But when these LLMs needed to see, what vision encoder did they use? Almost always a CLIP ViT-L (300M parameters) or at most an EVA-CLIP ViT-G (1B parameters). That is a 10-100x parameter gap between the vision encoder and the language model it feeds into.
Think about what this means. You have an enormous language model with rich, nuanced understanding of concepts — and you're feeding it visual features from a comparatively tiny encoder. The vision side is a bottleneck. It's like having a world-class translator who receives messages through a tin can telephone.
There were three specific problems:
Compare the parameter counts of vision encoders vs. the LLMs they feed into. The gap reveals a fundamental bottleneck in multimodal systems.
InternVL's solution has two parts: scale up the vision encoder to match the LLM's parameter count, and progressively align their representations through a three-stage training pipeline.
Instead of a 300M or 1B vision encoder, InternVL uses InternViT-6B — a vanilla Vision Transformer scaled to 6 billion parameters. This is 6x larger than EVA-CLIP-E (1B) and approaches the parameter count of the LLMs it will connect to (8B InternLM, 7B LLaMA).
Why does scale matter? Because a larger vision encoder can learn richer, more fine-grained visual features — textures, spatial relationships, subtle semantic distinctions — that a small encoder simply doesn't have the capacity to represent. With 6B parameters, the vision encoder is no longer the bottleneck.
You can't just train a 6B vision encoder from scratch and plug it into an LLM. The training is unstable, and the feature spaces won't match. InternVL uses a progressive strategy:
The result is a "Swiss Army knife" model. InternViT-6B alone works as a powerful vision backbone for classification, segmentation, and retrieval. Combined with QLLaMA, it handles generative tasks like captioning. Connected to a full LLM decoder, it enables multimodal dialogue.
InternViT-6B is a vanilla Vision Transformer — no fancy architectural innovations, just scale. The authors deliberately chose standard ViT because its simplicity makes scaling predictable and stable.
The key hyperparameters were found through a search optimizing for accuracy, speed, and training stability:
The authors explored the design space systematically. They varied depth in {32, 48, 64, 80}, head dimension in {64, 128}, and MLP ratio in {4, 8}, keeping total parameters at ~6B. Using contrastive learning on a 100M subset of LAION-en, they found:
The ViT architecture at 6B scale. An image is split into 14x14 patches, projected to 3200-dim tokens, and processed through 48 transformer layers.
This is the heart of InternVL. Training a 6B vision encoder from scratch and aligning it with an LLM is hard — the modality gap is enormous, the data is heterogeneous, and training can be unstable. InternVL solves this with a three-stage progressive strategy, where each stage builds on the previous one.
The first stage uses CLIP-style contrastive learning on massive, noisy web data. InternViT-6B (randomly initialized) is paired with LLaMA-7B (pre-trained) as the text encoder. They are trained jointly to maximize cosine similarity between matched image-text pairs and minimize it for unmatched pairs.
This stage consumes 4.98 billion image-text pairs (filtered from 6.03B original). Both InternViT-6B and LLaMA-7B are fully trainable. The result: a vision encoder with strong zero-shot capabilities for classification and retrieval.
Now InternViT-6B has good visual representations, but they're aligned with a text encoder, not an LLM's generative representations. Stage 2 bridges this gap by introducing QLLaMA — a language middleware that connects the vision encoder to LLM-style generation.
QLLaMA inherits the LLaMA-7B weights from stage 1 and adds 96 learnable queries plus cross-attention layers (1B new parameters). In this stage, InternViT-6B and the original QLLaMA layers are frozen — only the new cross-attention layers and queries are trained.
The training uses three losses simultaneously:
This stage uses 1.03 billion curated pairs (filtered from the 4.98B in stage 1 — removing low-quality captions).
Finally, InternVL is connected to a full LLM decoder (Vicuna-13B or InternLM) through an MLP layer. The system is fine-tuned on 4 million high-quality instruction samples covering captioning, VQA, OCR, grounding, and dialogue.
Because QLLaMA already produces LLM-compatible features (thanks to its LLaMA initialization), the LLM decoder can often remain frozen — only the MLP bridge and optionally QLLaMA are trained. This preserves the LLM's original language capabilities while adding vision.
Click each stage to see the training configuration: which components are trained, what data is used, and what losses drive learning. Watch how the data funnel narrows and the model capabilities expand.
How do you actually connect a 6B vision encoder to an LLM? InternVL introduces two approaches, each suited to different use cases.
QLLaMA is an 8B-parameter language middleware initialized from a multilingual LLaMA. It has three key additions over a standard LLaMA:
Why is QLLaMA 42x larger than QFormer (used in BLIP-2)? Because a larger middleware can capture richer cross-modal interactions. The LLaMA initialization means it starts with strong language understanding, so its representations are already compatible with LLM decoders.
For simpler setups, InternVL can skip QLLaMA entirely. InternViT-6B's output tokens are projected through a small MLP directly into the LLM's embedding space. This is similar to LLaVA's approach but with a much stronger vision encoder.
InternVL supports four different model configurations:
InternVL's training data follows the progressive quality funnel. Each stage uses different data at different scales and quality levels.
The raw dataset contains 6.03 billion image-text pairs from multiple sources:
After quality filtering, 4.98B pairs remain (82.6% of original). The filtering removes extremely low-quality data but is intentionally lenient — quantity matters at this stage.
Stage 2 applies much stricter filtering, reducing to 1.03 billion pairs (17.0% of the original 6.03B). The biggest cuts happen to the noisiest sources:
The synthetic-caption datasets (LAION-COCO) and academic datasets are mostly kept because their captions are already high quality.
The SFT stage uses ~4 million high-quality annotated samples across six task categories:
InternVL achieves state-of-the-art or competitive results across a remarkably wide range of tasks — from pure vision to multimodal dialogue.
On ImageNet-1K with a frozen backbone and only a linear classifier trained:
InternViT-6B beats all models trained on public data. It trails ViT-22B by only 1.3% despite having 3.7x fewer parameters and using no private data.
On ADE20K with UperNet decoder (full fine-tuning): InternViT-6B achieves 58.9 mIoU, surpassing ViT-22B's 55.3. With a frozen backbone and UperNet, it scores 54.9 vs ViT-22B's 52.7. The larger effective receptive field from 48 transformer layers gives it superior pixel-level perception.
InternVL-C achieves 83.2% on ImageNet-1K zero-shot, beating EVA-02-CLIP-E+ (82.0%) and all other public-data models. On multilingual ImageNet it achieves 64.0% average across 5 languages (EN, ZH, JP, AR, IT), far exceeding OpenCLIP-XLM-R-H (55.9%).
On Flickr30K (English), InternVL-G achieves 95.7% R@1 for image-to-text retrieval, surpassing BLIP-2 (97.6% uses private data filtering) and all other public-data models. On COCO, it reaches 74.9% R@1 for image-to-text.
Linear probe accuracy on ImageNet-1K and variants. InternViT-6B (public data only) matches models 3.7x its size trained on private data.
InternVL-Chat (with Vicuna-13B decoder) achieves competitive results on dialogue benchmarks:
Standard ViTs process images at a fixed resolution — typically 224x224 or 336x336. This is problematic for real-world images that come in all sizes and aspect ratios. A tall document, a wide panorama, or a small icon all get squished to the same square.
InternVL (and especially the follow-up InternVL 1.5/2) introduces dynamic resolution through tile-based processing:
Dynamic resolution is crucial for document understanding, chart reading, and any task where fine details matter. A standard 224x224 input would squish a dense document page, losing all text. With dynamic resolution, the model can use 6+ tiles to process the document at native resolution, preserving every character.
Drag the aspect ratio slider to see how an image is divided into tiles. Each tile is processed independently by InternViT-6B, preserving detail at any resolution.
A central claim of InternVL is that scaling the vision encoder matters — that bigger vision models produce better multimodal systems. Let's examine the evidence.
The linear probing results on ImageNet tell a clear story:
Performance scales roughly log-linearly with parameters. Each ~4x increase in parameters adds about 2 percentage points. But the gains come with diminishing returns: ViT-22B has 3.7x more parameters than InternViT-6B but only gains 1.3%.
On ADE20K with full fine-tuning:
Here InternViT-6B dramatically outperforms ViT-22B in the linear probing setting (47.2 vs 34.6). This suggests that InternViT-6B's contrastive pre-training produces better pixel-level features than ViT-22B's JFT-trained features, despite fewer parameters.
InternViT-6B appears to be at an efficiency sweet spot. It achieves:
ImageNet-1K linear probe accuracy vs. vision encoder parameters (log scale). Performance scales log-linearly but with diminishing returns at extreme scale.
CLIP (Radford et al., 2021): The foundational contrastive vision-language model. InternVL's Stage 1 uses CLIP-style contrastive learning but with a 6x larger vision encoder and LLaMA as the text encoder instead of a small transformer.
EVA / EVA-CLIP (Fang et al., 2023): Explored scaling vision encoders with masked image modeling + CLIP training. EVA-02-CLIP-E+ (4.4B) was the largest public CLIP model before InternVL. InternVL's InternViT-6B extends this scaling trajectory.
BLIP-2 (Li et al., 2023): Introduced QFormer as a lightweight bridge between frozen vision encoder and frozen LLM. InternVL's QLLaMA is the "heavy-duty" version of this idea — 8B parameters initialized from LLaMA instead of 188M randomly initialized.
LLaVA (Liu et al., 2023): Showed that a simple MLP connector + visual instruction tuning can enable strong multimodal dialogue. InternVL adopts this approach for its Stage 3 fine-tuning but with a much stronger vision encoder.
SigLIP (Zhai et al., 2023): Replaced CLIP's softmax contrastive loss with a sigmoid loss that scales better. Later InternVL versions adopt SigLIP-style training for better efficiency.
InternVL 1.5 / 2.0 (2024): The direct successors, which added dynamic resolution, stronger instruction data, and improved training recipes. InternVL 2.0 became one of the top-performing open-source multimodal models.
Qwen-VL (Bai et al., 2023): A concurrent effort from Alibaba taking a similar approach — scaling the vision encoder and aligning with a large LLM (Qwen). Demonstrates that the "scale vision + align progressively" paradigm generalizes.