Vision-Language Models — Architecture Atlas

Concept

What Is a Vision-Language Model?

A Vision-Language Model (VLM) connects a vision encoder (ViT, CLIP, SigLIP) to a large language model (LLaMA, GPT, Gemma) so the system can process images alongside text.

Input: an image + a text prompt (e.g., "What is unusual about this photo?").
Output: a free-form text response grounded in the visual content of the image.

The vision encoder extracts spatial features from the image and converts them into a sequence of visual tokens that live in the same embedding space as the LLM's text tokens. The language model then attends over both visual and text tokens and generates a response — exactly like normal autoregressive text generation, but now the context includes "seeing."

Architecture

The VLM Pipeline

Image flows through a frozen vision encoder, gets projected into the LLM's token space, and is concatenated with the text token embeddings before the language model generates a response.

VLM Architecture Pipeline Interactive

Design Space

Core Design Decisions

Four choices that define every VLM: the vision backbone, how visual and text streams merge, the multi-stage training recipe, and how grounding actions are represented.

👁

Vision Encoder Choice

CLIP ViT-L/14 — the default. Trained with contrastive image-text loss; produces features well-aligned to language. Used by LLaVA, InternVL.

SigLIP — sigmoid loss instead of softmax; better calibrated; used by PaLI-X, PaliGemma.

DINOv2 — self-supervised; richer spatial features for dense tasks (segmentation, depth). No language alignment built in.

CLIP SigLIP DINOv2

🔗

Fusion Method

Linear / MLP projection — simplest. Map vision features to LLM dim with one or two linear layers. Used by LLaVA.

Cross-attention — insert new cross-attn layers into the LLM that attend to vision features. Used by Flamingo.

Q-Former — learnable query tokens extract fixed-size summaries from the encoder. Used by BLIP-2, InstructBLIP.

MLP Cross-Attn Q-Former

🎯

Training Recipe

Stage 1 — Alignment pre-training. Freeze vision encoder + LLM. Train only the projection on large-scale image-caption pairs (e.g., CC3M, LAION).

Stage 2 — Instruction tuning. Unfreeze the LLM (and optionally the projector). Fine-tune on visual Q&A, conversation, and instruction-following data.

Alignment Instruct Tune RLHF (optional)

📌

Action Space (Grounding)

For grounding and referring tasks, VLMs produce bounding boxes as special tokens in the output vocabulary — e.g., <box>x1,y1,x2,y2</box>.

Some models (Kosmos-2, Shikra, Qwen-VL) add location tokens directly; others (Ferret, CogVLM) use specialized heads. This lets a single autoregressive model both describe and locate objects.

BBox tokens Point tokens Region refs

Fusion Methods Comparison Interactive

Simplest approach. A 2-layer MLP maps each vision token from D_v to D_llm dimensions. Visual tokens are concatenated with text tokens and processed by standard self-attention. Used by LLaVA, LLaVA-1.5, PaliGemma. Fast to train, easy to scale.

Interleaved cross-attention layers are inserted between existing LLM self-attention blocks. Text tokens provide Q, vision features provide K and V. Does not increase the context length. Used by Flamingo, Otter, IDEFICS. More expressive but harder to train.

Q-Former uses a small set of learnable query tokens (typically 32) that cross-attend to the vision encoder's output, producing a fixed-size compressed visual summary. Used by BLIP-2, InstructBLIP. Efficient (few visual tokens) but can lose spatial detail.

Landscape

Key Models

A comparative look at the most influential VLMs, their architectural choices, and what makes each one distinctive.

Model	Vision Encoder	Fusion	LLM Backbone	Key Innovation	Architecture
LLaVA-1.5	CLIP ViT-L/14 `@336px`	2-layer MLP	Vicuna-13B	Simple projection + 2-stage training; surprisingly strong baseline	img -> CLIP -> MLP -> [vis+txt] -> Vicuna -> out
GPT-4V	Proprietary (likely CLIP-scale)	Unknown (likely cross-attn)	GPT-4	SOTA multimodal reasoning; multi-image; system-level safety	img -> encoder -> ? -> GPT-4 -> out
Gemini 1.5	Native multimodal (not bolt-on)	Natively interleaved	MoE Transformer	1M token context; native image/audio/video; natively multimodal from pre-training	img+txt -> unified tokenizer -> MoE -> out
Qwen-VL	ViT-bigG `@448px`	Single-layer cross-attn + 256 queries	Qwen-7B	High-res input; grounding with bbox tokens; multi-image/video	img -> ViT-bigG -> x-attn(256q) -> Qwen -> out
InternVL-1.5	InternViT-6B	MLP (dynamic resolution)	InternLM2-20B	Largest open-source vision encoder (6B); dynamic high-res with tile splitting	img -> InternViT-6B -> MLP -> InternLM2 -> out

Training

Training Pipeline

VLMs are trained in stages, progressively unfreezing components and shifting from alignment to instruction-following to preference optimization.

Training Stages Interactive

Stage 1 — Feature Alignment Pre-training

Train only the projection layer on large-scale image-caption pairs (CC3M, LAION-400M, COYO). The goal is to align vision features into the LLM's embedding space without disrupting either pre-trained model.

Typically ~600K–2M image-text pairs. A few hours on 8 GPUs. Loss: next-token prediction on the caption, conditioned on the image.

Vision Encoder: FROZEN Projection: TRAINABLE LLM: FROZEN

Stage 2 — Visual Instruction Tuning

Unfreeze the LLM and fine-tune on high-quality visual instruction-following data: visual Q&A, multi-turn conversations about images, chart/diagram reasoning, OCR tasks, and referring/grounding data.

Datasets: LLaVA-Instruct-150K, ShareGPT4V, TextVQA, GQA, OKVQA, etc. The projection layer continues training alongside the LLM. The vision encoder typically stays frozen (though some recipes unfreeze it too).

Vision Encoder: FROZEN Projection: TRAINABLE LLM: TRAINABLE

Stage 3 — RLHF / DPO (Optional)

Apply preference optimization to reduce hallucination and improve helpfulness. Collect human rankings of model responses to visual questions, then train with RLHF (PPO) or DPO on the preference pairs.

LLaVA-RLHF showed this significantly reduces object hallucination. Some teams use AI feedback (RLAIF) instead of human feedback. Silkie and RLHF-V are key works in this space.

Vision Encoder: FROZEN Projection: TRAINABLE LLM: TRAINABLE (LoRA or full)

Capabilities

What Can VLMs Do?

Modern VLMs are surprisingly versatile. Here are the major capability axes, shown as a radar chart comparing general-purpose vs. specialized models.

Capability Radar Interactive

GPT-4V / Gemini (frontier)

LLaVA-1.5 (open-source)

💬

Visual Question Answering

Answer open-ended questions about images. "How many people are in this photo?" "What brand is on the sign?"

📄

Document Understanding

Parse charts, tables, receipts, and forms without OCR pre-processing. Read infographics, scientific figures, and diagrams.

🎯

Visual Grounding

Locate objects by outputting bounding boxes. "Find the red car" returns coordinates. Powers referring expression comprehension.

🎥

Video Understanding

Process sampled frames from video to answer temporal questions. "What happened after the ball was thrown?" Multi-frame reasoning.

🖥

GUI Agents

Navigate software interfaces by understanding screenshots. Click buttons, fill forms, browse the web. Powers computer-use agents.

📝

OCR & Text Extraction

Read text in natural images, handwriting, license plates, and scene text. No dedicated OCR module needed — the VLM reads directly.

Limitations

Known Weaknesses

Despite rapid progress, VLMs still struggle with several fundamental challenges. Understanding these failure modes is critical for responsible deployment.

Hallucination High Severity

VLMs confidently describe objects, attributes, and relationships that do not exist in the image. POPE and CHAIR benchmarks show even top models hallucinate 20–40% of the time.
Counting & Numeracy High Severity

Counting objects in crowded scenes remains unreliable. Models often guess small counts correctly but fail above ~5 objects. Tally marks and dense grids are especially hard.
Spatial Reasoning Medium Severity

"Is the cup to the left or right of the plate?" Relative spatial relationships, depth ordering, and 3D scene understanding remain weak. ViT patch-based encoding loses precise positional info.
Fine-Grained Visual Details Medium Severity

Small text, subtle textures, and fine-grained differences (e.g., bird species, car models) are often missed. Resolution limitations (224–448px) are a root cause, though dynamic resolution helps.