Architecture Atlas — 08

Vision-Language Models

Teaching language models to see. Connect a vision encoder to an LLM and unlock multimodal reasoning over images, documents, and video.

2022–24 Key Years
LLaVA, GPT-4V, Gemini Key Works
Multimodal Category

What Is a Vision-Language Model?

A Vision-Language Model (VLM) connects a vision encoder (ViT, CLIP, SigLIP) to a large language model (LLaMA, GPT, Gemma) so the system can process images alongside text.

Input: an image + a text prompt (e.g., "What is unusual about this photo?").
Output: a free-form text response grounded in the visual content of the image.

The vision encoder extracts spatial features from the image and converts them into a sequence of visual tokens that live in the same embedding space as the LLM's text tokens. The language model then attends over both visual and text tokens and generates a response — exactly like normal autoregressive text generation, but now the context includes "seeing."

The VLM Pipeline

Image flows through a frozen vision encoder, gets projected into the LLM's token space, and is concatenated with the text token embeddings before the language model generates a response.

VLM Architecture Pipeline Interactive
Image 224x224 or 336x336 patch 14x14 Vision Encoder ViT-L / CLIP SigLIP / DINOv2 FROZEN Projection MLP / Q-Former Cross-Attn Text Prompt tokenized CONCATENATE [vis] [vis] [vis] [txt] [txt] [txt] visual + text tokens LLM LLaMA / GPT Gemma / Qwen TUNED Text Response

Core Design Decisions

Four choices that define every VLM: the vision backbone, how visual and text streams merge, the multi-stage training recipe, and how grounding actions are represented.

👁
Vision Encoder Choice

CLIP ViT-L/14 — the default. Trained with contrastive image-text loss; produces features well-aligned to language. Used by LLaVA, InternVL.

SigLIP — sigmoid loss instead of softmax; better calibrated; used by PaLI-X, PaliGemma.

DINOv2 — self-supervised; richer spatial features for dense tasks (segmentation, depth). No language alignment built in.

CLIP SigLIP DINOv2
🔗
Fusion Method

Linear / MLP projection — simplest. Map vision features to LLM dim with one or two linear layers. Used by LLaVA.

Cross-attention — insert new cross-attn layers into the LLM that attend to vision features. Used by Flamingo.

Q-Former — learnable query tokens extract fixed-size summaries from the encoder. Used by BLIP-2, InstructBLIP.

MLP Cross-Attn Q-Former
🎯
Training Recipe

Stage 1 — Alignment pre-training. Freeze vision encoder + LLM. Train only the projection on large-scale image-caption pairs (e.g., CC3M, LAION).

Stage 2 — Instruction tuning. Unfreeze the LLM (and optionally the projector). Fine-tune on visual Q&A, conversation, and instruction-following data.

Alignment Instruct Tune RLHF (optional)
📌
Action Space (Grounding)

For grounding and referring tasks, VLMs produce bounding boxes as special tokens in the output vocabulary — e.g., <box>x1,y1,x2,y2</box>.

Some models (Kosmos-2, Shikra, Qwen-VL) add location tokens directly; others (Ferret, CogVLM) use specialized heads. This lets a single autoregressive model both describe and locate objects.

BBox tokens Point tokens Region refs
Fusion Methods Comparison Interactive
Vision Encoder N x D_v tokens MLP D_v -> D_llm [text] [visual] concat LLM self-attention out

Simplest approach. A 2-layer MLP maps each vision token from D_v to D_llm dimensions. Visual tokens are concatenated with text tokens and processed by standard self-attention. Used by LLaVA, LLaVA-1.5, PaliGemma. Fast to train, easy to scale.

Vision Encoder N x D_v K, V Text Tokens self-attn layer Cross-Attn Q=text, KV=vis FFN ... LLM Layers interleaved x-attn out

Interleaved cross-attention layers are inserted between existing LLM self-attention blocks. Text tokens provide Q, vision features provide K and V. Does not increase the context length. Used by Flamingo, Otter, IDEFICS. More expressive but harder to train.

Vision Encoder 257 x 1024 Q-Former 32 learned queries cross-attn to vis 257 -> 32 tokens Linear 32 x D_llm LLM 32 vis + N txt compact!

Q-Former uses a small set of learnable query tokens (typically 32) that cross-attend to the vision encoder's output, producing a fixed-size compressed visual summary. Used by BLIP-2, InstructBLIP. Efficient (few visual tokens) but can lose spatial detail.

Key Models

A comparative look at the most influential VLMs, their architectural choices, and what makes each one distinctive.

Model Vision Encoder Fusion LLM Backbone Key Innovation Architecture
LLaVA-1.5 CLIP ViT-L/14 @336px 2-layer MLP Vicuna-13B Simple projection + 2-stage training; surprisingly strong baseline img -> CLIP -> MLP -> [vis+txt] -> Vicuna -> out
GPT-4V Proprietary (likely CLIP-scale) Unknown (likely cross-attn) GPT-4 SOTA multimodal reasoning; multi-image; system-level safety img -> encoder -> ? -> GPT-4 -> out
Gemini 1.5 Native multimodal (not bolt-on) Natively interleaved MoE Transformer 1M token context; native image/audio/video; natively multimodal from pre-training img+txt -> unified tokenizer -> MoE -> out
Qwen-VL ViT-bigG @448px Single-layer cross-attn + 256 queries Qwen-7B High-res input; grounding with bbox tokens; multi-image/video img -> ViT-bigG -> x-attn(256q) -> Qwen -> out
InternVL-1.5 InternViT-6B MLP (dynamic resolution) InternLM2-20B Largest open-source vision encoder (6B); dynamic high-res with tile splitting img -> InternViT-6B -> MLP -> InternLM2 -> out

Training Pipeline

VLMs are trained in stages, progressively unfreezing components and shifting from alignment to instruction-following to preference optimization.

Training Stages Interactive

Stage 1 — Feature Alignment Pre-training

Train only the projection layer on large-scale image-caption pairs (CC3M, LAION-400M, COYO). The goal is to align vision features into the LLM's embedding space without disrupting either pre-trained model.

Typically ~600K–2M image-text pairs. A few hours on 8 GPUs. Loss: next-token prediction on the caption, conditioned on the image.

Vision Encoder: FROZEN Projection: TRAINABLE LLM: FROZEN

Stage 2 — Visual Instruction Tuning

Unfreeze the LLM and fine-tune on high-quality visual instruction-following data: visual Q&A, multi-turn conversations about images, chart/diagram reasoning, OCR tasks, and referring/grounding data.

Datasets: LLaVA-Instruct-150K, ShareGPT4V, TextVQA, GQA, OKVQA, etc. The projection layer continues training alongside the LLM. The vision encoder typically stays frozen (though some recipes unfreeze it too).

Vision Encoder: FROZEN Projection: TRAINABLE LLM: TRAINABLE

Stage 3 — RLHF / DPO (Optional)

Apply preference optimization to reduce hallucination and improve helpfulness. Collect human rankings of model responses to visual questions, then train with RLHF (PPO) or DPO on the preference pairs.

LLaVA-RLHF showed this significantly reduces object hallucination. Some teams use AI feedback (RLAIF) instead of human feedback. Silkie and RLHF-V are key works in this space.

Vision Encoder: FROZEN Projection: TRAINABLE LLM: TRAINABLE (LoRA or full)

What Can VLMs Do?

Modern VLMs are surprisingly versatile. Here are the major capability axes, shown as a radar chart comparing general-purpose vs. specialized models.

Capability Radar Interactive
GPT-4V / Gemini (frontier)
LLaVA-1.5 (open-source)
💬
Visual Question Answering
Answer open-ended questions about images. "How many people are in this photo?" "What brand is on the sign?"
📄
Document Understanding
Parse charts, tables, receipts, and forms without OCR pre-processing. Read infographics, scientific figures, and diagrams.
🎯
Visual Grounding
Locate objects by outputting bounding boxes. "Find the red car" returns coordinates. Powers referring expression comprehension.
🎥
Video Understanding
Process sampled frames from video to answer temporal questions. "What happened after the ball was thrown?" Multi-frame reasoning.
🖥
GUI Agents
Navigate software interfaces by understanding screenshots. Click buttons, fill forms, browse the web. Powers computer-use agents.
📝
OCR & Text Extraction
Read text in natural images, handwriting, license plates, and scene text. No dedicated OCR module needed — the VLM reads directly.

Known Weaknesses

Despite rapid progress, VLMs still struggle with several fundamental challenges. Understanding these failure modes is critical for responsible deployment.

  • Hallucination High Severity
    VLMs confidently describe objects, attributes, and relationships that do not exist in the image. POPE and CHAIR benchmarks show even top models hallucinate 20–40% of the time.
  • Counting & Numeracy High Severity
    Counting objects in crowded scenes remains unreliable. Models often guess small counts correctly but fail above ~5 objects. Tally marks and dense grids are especially hard.
  • Spatial Reasoning Medium Severity
    "Is the cup to the left or right of the plate?" Relative spatial relationships, depth ordering, and 3D scene understanding remain weak. ViT patch-based encoding loses precise positional info.
  • Fine-Grained Visual Details Medium Severity
    Small text, subtle textures, and fine-grained differences (e.g., bird species, car models) are often missed. Resolution limitations (224–448px) are a root cause, though dynamic resolution helps.