Vision-Language Models
Teaching language models to see. Connect a vision encoder to an LLM and unlock multimodal reasoning over images, documents, and video.
What Is a Vision-Language Model?
A Vision-Language Model (VLM) connects a vision encoder (ViT, CLIP, SigLIP) to a large language model (LLaMA, GPT, Gemma) so the system can process images alongside text.
Input: an image + a text prompt (e.g., "What is unusual about this photo?").
Output: a free-form text response grounded in the visual content of the image.
The vision encoder extracts spatial features from the image and converts them into a sequence of visual tokens that live in the same embedding space as the LLM's text tokens. The language model then attends over both visual and text tokens and generates a response — exactly like normal autoregressive text generation, but now the context includes "seeing."
The VLM Pipeline
Image flows through a frozen vision encoder, gets projected into the LLM's token space, and is concatenated with the text token embeddings before the language model generates a response.
Core Design Decisions
Four choices that define every VLM: the vision backbone, how visual and text streams merge, the multi-stage training recipe, and how grounding actions are represented.
CLIP ViT-L/14 — the default. Trained with contrastive image-text loss; produces features well-aligned to language. Used by LLaVA, InternVL.
SigLIP — sigmoid loss instead of softmax; better calibrated; used by PaLI-X, PaliGemma.
DINOv2 — self-supervised; richer spatial features for dense tasks (segmentation, depth). No language alignment built in.
Linear / MLP projection — simplest. Map vision features to LLM dim with one or two linear layers. Used by LLaVA.
Cross-attention — insert new cross-attn layers into the LLM that attend to vision features. Used by Flamingo.
Q-Former — learnable query tokens extract fixed-size summaries from the encoder. Used by BLIP-2, InstructBLIP.
Stage 1 — Alignment pre-training. Freeze vision encoder + LLM. Train only the projection on large-scale image-caption pairs (e.g., CC3M, LAION).
Stage 2 — Instruction tuning. Unfreeze the LLM (and optionally the projector). Fine-tune on visual Q&A, conversation, and instruction-following data.
For grounding and referring tasks, VLMs produce bounding boxes as special tokens in the output vocabulary — e.g., <box>x1,y1,x2,y2</box>.
Some models (Kosmos-2, Shikra, Qwen-VL) add location tokens directly; others (Ferret, CogVLM) use specialized heads. This lets a single autoregressive model both describe and locate objects.
Simplest approach. A 2-layer MLP maps each vision token from D_v to D_llm dimensions. Visual tokens are concatenated with text tokens and processed by standard self-attention. Used by LLaVA, LLaVA-1.5, PaliGemma. Fast to train, easy to scale.
Interleaved cross-attention layers are inserted between existing LLM self-attention blocks. Text tokens provide Q, vision features provide K and V. Does not increase the context length. Used by Flamingo, Otter, IDEFICS. More expressive but harder to train.
Q-Former uses a small set of learnable query tokens (typically 32) that cross-attend to the vision encoder's output, producing a fixed-size compressed visual summary. Used by BLIP-2, InstructBLIP. Efficient (few visual tokens) but can lose spatial detail.
Key Models
A comparative look at the most influential VLMs, their architectural choices, and what makes each one distinctive.
| Model | Vision Encoder | Fusion | LLM Backbone | Key Innovation | Architecture |
|---|---|---|---|---|---|
| LLaVA-1.5 | CLIP ViT-L/14 @336px |
2-layer MLP | Vicuna-13B | Simple projection + 2-stage training; surprisingly strong baseline | img -> CLIP -> MLP -> [vis+txt] -> Vicuna -> out |
| GPT-4V | Proprietary (likely CLIP-scale) | Unknown (likely cross-attn) | GPT-4 | SOTA multimodal reasoning; multi-image; system-level safety | img -> encoder -> ? -> GPT-4 -> out |
| Gemini 1.5 | Native multimodal (not bolt-on) | Natively interleaved | MoE Transformer | 1M token context; native image/audio/video; natively multimodal from pre-training | img+txt -> unified tokenizer -> MoE -> out |
| Qwen-VL | ViT-bigG @448px |
Single-layer cross-attn + 256 queries | Qwen-7B | High-res input; grounding with bbox tokens; multi-image/video | img -> ViT-bigG -> x-attn(256q) -> Qwen -> out |
| InternVL-1.5 | InternViT-6B | MLP (dynamic resolution) | InternLM2-20B | Largest open-source vision encoder (6B); dynamic high-res with tile splitting | img -> InternViT-6B -> MLP -> InternLM2 -> out |
Training Pipeline
VLMs are trained in stages, progressively unfreezing components and shifting from alignment to instruction-following to preference optimization.
Stage 1 — Feature Alignment Pre-training
Train only the projection layer on large-scale image-caption pairs (CC3M, LAION-400M, COYO). The goal is to align vision features into the LLM's embedding space without disrupting either pre-trained model.
Typically ~600K–2M image-text pairs. A few hours on 8 GPUs. Loss: next-token prediction on the caption, conditioned on the image.
Stage 2 — Visual Instruction Tuning
Unfreeze the LLM and fine-tune on high-quality visual instruction-following data: visual Q&A, multi-turn conversations about images, chart/diagram reasoning, OCR tasks, and referring/grounding data.
Datasets: LLaVA-Instruct-150K, ShareGPT4V, TextVQA, GQA, OKVQA, etc. The projection layer continues training alongside the LLM. The vision encoder typically stays frozen (though some recipes unfreeze it too).
Stage 3 — RLHF / DPO (Optional)
Apply preference optimization to reduce hallucination and improve helpfulness. Collect human rankings of model responses to visual questions, then train with RLHF (PPO) or DPO on the preference pairs.
LLaVA-RLHF showed this significantly reduces object hallucination. Some teams use AI feedback (RLAIF) instead of human feedback. Silkie and RLHF-V are key works in this space.
What Can VLMs Do?
Modern VLMs are surprisingly versatile. Here are the major capability axes, shown as a radar chart comparing general-purpose vs. specialized models.
Known Weaknesses
Despite rapid progress, VLMs still struggle with several fundamental challenges. Understanding these failure modes is critical for responsible deployment.
-
Hallucination High SeverityVLMs confidently describe objects, attributes, and relationships that do not exist in the image. POPE and CHAIR benchmarks show even top models hallucinate 20–40% of the time.
-
Counting & Numeracy High SeverityCounting objects in crowded scenes remains unreliable. Models often guess small counts correctly but fail above ~5 objects. Tally marks and dense grids are especially hard.
-
Spatial Reasoning Medium Severity"Is the cup to the left or right of the plate?" Relative spatial relationships, depth ordering, and 3D scene understanding remain weak. ViT patch-based encoding loses precise positional info.
-
Fine-Grained Visual Details Medium SeveritySmall text, subtle textures, and fine-grained differences (e.g., bird species, car models) are often missed. Resolution limitations (224–448px) are a root cause, though dynamic resolution helps.