How modern AI learned to see and speak at the same time — fusing pixels and words into a single reasoning engine that can describe, answer, and understand images.
Humans effortlessly combine what they see with what they read. You look at a chart and describe trends. You glance at a photo and answer questions about it. For decades, vision and language lived in separate AI systems. A Vision-Language Model (VLM) merges them into a single architecture that processes images and text together.
The core challenge: images are grids of pixels (spatial, continuous) while text is a sequence of tokens (discrete, symbolic). A VLM must bridge these two very different representations and allow them to interact so the model can reason about visual content using language.
Here's the entire VLM pipeline in concrete tensor shapes (we'll unpack each box in later chapters):
Watch image patches and text tokens flow into a shared representation space. The teal nodes are visual, orange nodes are textual.
The vision encoder converts a raw image into a sequence of feature vectors. The dominant approach is ViT (Vision Transformer): chop the image into fixed-size patches (e.g., 14×14 pixels), flatten each patch into a vector, add positional embeddings, and run them through a transformer.
After encoding, an image becomes a grid of high-dimensional vectors — essentially the same shape as a sequence of text tokens. This structural similarity is what makes vision-language fusion possible.
Let's trace the exact shapes through the most common vision encoder. CLIP ViT-L/14 takes a 336×336 image and uses 14×14 pixel patches:
See how an image is sliced into patches. Each patch becomes a token for the transformer. Adjust patch size to see the trade-off between resolution and sequence length.
| Encoder | Params | Used By | Patch Size | Output Dim |
|---|---|---|---|---|
| CLIP ViT-L/14 | 304M | LLaVA, many VLMs | 14×14 | 1024 |
| SigLIP (SO400M) | 400M | PaliGemma, newer VLMs | 14×14 | 1152 |
| EVA-CLIP (ViT-G) | 1.0B | InternVL | 14×14 | 1408 |
| DINOv2 (ViT-L) | 304M | Research models | 14×14 | 1024 |
The VLM keeps all 576 tokens because it needs spatial detail. CLIP discards spatial info because matching only needs global semantics. Same backbone, different output usage. This is why VLMs use the penultimate layer (before pooling destroys spatial info).
Where else do you see the same encoder used with different amounts of its output? (Hint: think about how the Transformer lesson shows hidden states at every layer.)
672/14 = 48 patches per side, so 48×48 = 2,304 tokens, each of dimension 1024. That's 4× more tokens than 336×336 (576 tokens), and attention cost scales quadratically: 23042/5762 = 16× more compute.
If you double patch size to 28×28: you get (672/28)2 = 24×24 = 576 tokens (same as the standard 336px case). Each patch now covers 28×28×3 = 2,352 pixels instead of 588 — 4× more information compressed into the same 1024-dim vector. You lose fine-grained detail but save massive compute. This is the fundamental resolution/compute trade-off in VLMs.
The vision encoder outputs feature vectors in its embedding space. The language model expects tokens in its embedding space. These spaces don't match! We need a projection layer to translate between them.
The simplest bridge is a linear projection (a single matrix multiply). More sophisticated ones use an MLP (two linear layers with an activation) or a cross-attention resampler (like Flamingo's Perceiver). The projection doesn't just resize — it aligns visual features with the language model's semantic space.
LLaVA-1.5 uses a 2-layer MLP to project from CLIP's 1024-dim space to Vicuna-13B's 5120-dim space:
| Projection Type | Architecture | Params | Token Count | Used By |
|---|---|---|---|---|
| Linear | Single W: [dvis, dllm] | ~4M | Same as input (576) | LLaVA v1 |
| MLP (2-layer) | Linear + GELU + Linear | ~21M | Same as input (576) | LLaVA-1.5 |
| Perceiver Resampler | Cross-attention with K learned queries | ~100M | K (e.g., 64 or 256) | Flamingo, Qwen-VL |
| C-Abstractor | Conv layers + adaptive pooling | ~50M | Reduced (e.g., 144) | Honeybee |
Points in vision space get projected to language space. Drag the slider to rotate the projection and see alignment change.
LLaVA-1.5 uses a 2-layer MLP to bridge CLIP ViT-L (output dim = 1024) to Vicuna-13B (hidden dim = 5120). But the intermediate dimension is 4096. The architecture is: Linear(1024, 4096) → GELU → Linear(4096, 4096).
Your task: Calculate the total parameter count. Why might the designers have chosen 4096 as the intermediate dimension instead of 5120 (the LLM's actual hidden dim)?
Full derivation:
Layer 1: Linear(1024, 4096) = 1024 × 4096 + 4096 = 4,194,304 + 4,096 = 4,198,400 params
Layer 2: Linear(4096, 4096) = 4096 × 4096 + 4096 = 16,777,216 + 4,096 = 16,781,312 params
Total: 4,198,400 + 16,781,312 = 20,979,712 ≈ 21M parameters
The key insight: The output dimension is 4096, not 5120, because Vicuna-13B's embedding dimension (the space where word tokens live) is 4096. The 5120 is the FFN intermediate width. Visual tokens need to be in the same space as text embeddings at the input to the transformer — that's the 4096-dim embedding space. The 21M parameter bridge is 0.16% of the 13B LLM, yet carries ALL the alignment burden.
python import torch import torch.nn as nn class VisionProjection(nn.Module): def __init__(self, vision_dim: int, llm_dim: int): super().__init__() self.mlp = nn.Sequential( nn.Linear(vision_dim, llm_dim), # 1024 -> 4096 nn.GELU(), nn.Linear(llm_dim, llm_dim), # 4096 -> 4096 ) def forward(self, x: torch.Tensor) -> torch.Tensor: # x: [batch, n_patches, vision_dim] return self.mlp(x) # [batch, n_patches, llm_dim]
Once vision tokens and text tokens are in the same space, how do we let them interact? There are several architectural strategies, each with different trade-offs:
| Strategy | How It Works | Example |
|---|---|---|
| Early Fusion | Concatenate vision + text tokens, feed to one transformer | LLaVA |
| Cross-Attention | Text attends to vision via extra cross-attention layers | Flamingo |
| Perceiver Resampler | Learned queries compress vision tokens to fixed count | Flamingo, Qwen-VL |
| Interleaved | Vision tokens inserted at corresponding text positions | Fuyu, Gemini |
Toggle between strategies to see how vision (teal) and text (orange) tokens interact.
In early fusion, the LLM's self-attention sees a sequence like [v1, v2, ..., v576, t1, t2, ..., t50]. Because it's causal attention:
Each approach makes fundamentally different engineering choices about what's frozen, what's trained, and how vision enters the LLM:
| Model | Bridge Type | Vision Encoder | LLM | Vision Tokens to LLM | Total Params |
|---|---|---|---|---|---|
| LLaVA-1.5 | MLP (2-layer) | CLIP ViT-L (frozen) | Vicuna-13B (Stage 2: tuned) | 576 (all patches) | ~13.4B |
| Flamingo | Gated cross-attn | NFNet (frozen) | Chinchilla (frozen) | 64 (resampled) | ~80B |
| Qwen-VL | Cross-attn resampler | ViT-G (trained) | Qwen-7B (tuned) | 256 (compressed) | ~9.6B |
| InternVL | QLLaMA adapter | EVA-CLIP-G (tuned) | InternLM-20B (tuned) | 256 (resampled) | ~20B |
Flamingo inserts cross-attention layers into its LLM (Chinchilla-70B, dmodel=8192). The vision encoder outputs 64 tokens of dimension 1024 (after a Perceiver resampler). Cross-attention uses Q from text (d=8192), K and V from vision (d=1024), with num_heads=64 and head_dim=128.
Your task: Derive the projection dimensions for WQ, WK, WV, and WO. What is the total parameter cost of one cross-attention layer?
Full derivation:
WQ: [dtext, n_heads × head_dim] = [8192, 8192] = 67.1M params
WK: [dvision, n_heads × head_dim] = [1024, 8192] = 8.4M params
WV: [dvision, n_heads × head_dim] = [1024, 8192] = 8.4M params
WO: [n_heads × head_dim, dtext] = [8192, 8192] = 67.1M params
Total per cross-attention layer: 67.1 + 8.4 + 8.4 + 67.1 = ~151M parameters
The key insight: WK and WV are much smaller (8.4M each) because they project from the 1024-dim vision space, not the 8192-dim text space. This asymmetry is why cross-attention adds relatively little overhead: the "expensive" projections (Q and O) already exist in the text-only LLM. Only the K/V projections from vision are truly new parameters.
Having the right architecture isn't enough — the model needs to learn when and how to use visual information. Visual instruction tuning trains the model on (image, instruction, response) triples so it can follow natural language commands about images.
The key innovation: instead of training on just captions ("A cat on a sofa"), you train on diverse tasks: "What color is the cat?", "Count the cushions", "Write a poem about this scene", "Is this image safe for children?" This teaches the model to be a general-purpose visual assistant.
A model trained only on image-caption pairs learns to describe, but it can't reason. It can say "a cat on a sofa" but can't answer "Is the sofa large enough for two people?" because captioning datasets never ask questions that require spatial reasoning, counting, comparison, or inference about intent.
| Training Data Type | What the Model Learns | What It Can't Do |
|---|---|---|
| Captions only | "A dog playing in the park" | Answer questions, count objects, reason spatially |
| VQA datasets | Answer factual questions | Free-form conversation, creative tasks |
| Instruction tuning | All of the above + follow instructions | Out-of-distribution tasks (but fewer gaps) |
During instruction tuning, the model receives a sequence like:
format [IMG_1] [IMG_2] ... [IMG_576] [SYS] You are a helpful assistant. [USR] What animal is in this image? [ASST] # Loss is computed ONLY on the assistant's response tokens: "This image shows a golden retriever lying on grass..." # Vision tokens and user tokens: loss = 0 (no gradient) # Assistant tokens: standard next-token prediction loss
See the variety of tasks a VLM must handle. Click to cycle through instruction types.
The input sequence is: [v1...v576][SYS][USR question][ASST response tokens r1...rT]. Standard language model training computes cross-entropy loss on every token. Visual instruction tuning modifies this.
Your task: Write out the loss function. Which tokens contribute to the loss? Which tokens have their loss masked to zero? Why is this masking critical?
Full derivation:
Let the full sequence be x = [v1...v576, s1...sK, r1...rT] where s = system+user tokens, r = response tokens.
Define mask: mi = 1 if token i is a response token, 0 otherwise.
Loss: L = -(1/T) ∑i mi · log Pθ(xi | x<i)
Concretely: mi = 0 for all 576 vision tokens, 0 for all system/user tokens, and 1 only for r1 through rT.
The key insight: The model sees all tokens during the forward pass (vision tokens condition the response via attention), but only learns to generate the response. This is what separates instruction tuning from pretraining: the model learns to use visual context without trying to predict it. If you accidentally trained on vision tokens, the model would waste capacity trying to predict patch embeddings — a regression task the autoregressive head isn't designed for.
LLaVA (Large Language and Vision Assistant) showed that a surprisingly simple recipe works remarkably well: take a pretrained CLIP vision encoder, a pretrained LLM (like Vicuna/LLaMA), and connect them with a single linear projection. That's it.
The image goes through CLIP ViT-L/14 → produces 576 vision tokens (a 24×24 grid from the penultimate layer) → each is linearly projected to the LLM's dimension → prepended to the text tokens → the LLM generates the response autoregressively.
Watch tokens flow through the architecture. Teal = vision, orange = text, green = output.
Let's trace exactly what the LLM receives. For a question like "What is in this image?":
Where does the time go when a VLM answers a question about an image?
| Stage | Time | What Happens |
|---|---|---|
| Image preprocessing | ~5 ms | Resize to 336×336, normalize to [-1, 1] |
| CLIP forward pass | ~15 ms | 24 transformer layers on 576 tokens (304M params) |
| MLP projection | ~0.1 ms | Two linear layers, 21M params — trivial cost |
| LLM prefill | ~30 ms | Process all 626 tokens through 40 layers in parallel |
| LLM decode | ~2,000 ms | ~20 ms/token × 100 tokens of output (sequential!) |
| Total | ~2.05 s | Autoregressive decode dominates by 40× |
VLMs are trained in stages, not all at once. Each stage has a different purpose and different parts of the model are frozen or unfrozen:
| Stage | Data | What Trains | Purpose |
|---|---|---|---|
| 1. Pretraining alignment | 595K image-caption pairs | Projection only | Align vision ↔ language spaces |
| 2. Instruction tuning | 158K visual conversations | Projection + LLM | Teach instruction following |
| Component | Params | Stage 1 | Stage 2 |
|---|---|---|---|
| CLIP ViT-L/14 | 304M | Frozen | Frozen |
| MLP Projector | ~21M | Training | Training |
| Vicuna-13B | 13,000M | Frozen | Training |
| Stage 1 trainable | 21M / 13,325M = 0.16% of total | ||
| Stage 2 trainable | 13,021M / 13,325M = 97.7% of total | ||
Toggle stages to see which components are frozen (blue/frozen) vs trainable (green/active).
| Stage | Trainable Params | Data | GPU Hours (8×A100) | Cost (~$2/GPU-hr) |
|---|---|---|---|---|
| Stage 1: Alignment | 21M (0.16%) | 595K image-caption pairs | ~4 hours | ~$64 |
| Stage 2: Instruction | 13,021M (97.7%) | 158K visual conversations | ~20 hours | ~$320 |
| Total | ~$384 to train a competitive VLM (given pretrained CLIP + LLM) | |||
Grounding means connecting words to specific regions in the image. When a VLM says "the red car on the left," grounding means it can also point to where that car is. This requires the model to output spatial coordinates, not just text.
Approaches include: outputting bounding box coordinates as text tokens (e.g., "[0.2, 0.3, 0.5, 0.7]"), using special location tokens, or predicting segmentation masks. Models like Kosmos-2 and Shikra showed VLMs can be trained to both describe and locate objects.
| Level | Output | Example Task | Difficulty |
|---|---|---|---|
| Pointing | Single (x, y) coordinate | "Point to the cat's nose" | Easiest — one point |
| Bounding box | (x1, y1, x2, y2) | "Draw a box around each person" | Medium — 4 numbers per object |
| Segmentation | Pixel-level mask | "Segment the dog from background" | Hardest — needs a mask decoder |
Click on regions to see how a VLM grounds language to spatial locations. Each colored box is a detected object with its label.
In coordinate-as-text models (Kosmos-2, Shikra, Qwen2.5-VL), the model literally generates bounding boxes as number tokens:
example output # Input: "Where is the dog in this image?" # Output: The dog is in the bottom-left of the image. <box>102, 384, 298, 571</box> # Coordinates: [x1, y1, x2, y2] normalized to [0, 1000] # Real position: top-left (10.2%, 38.4%) to bottom-right (29.8%, 57.1%)
Documents are a special challenge for VLMs: they contain dense text, tables, charts, and layouts where spatial arrangement matters. "Revenue" next to "$5M" means something different from "Revenue" in a section header.
Key innovations: high-resolution encoding (documents need more pixels than photos), OCR-free reading (the vision encoder learns to read text directly), and layout-aware attention (understanding that rows and columns create relationships).
A standard document page at 300 DPI is roughly 2,550 × 3,300 pixels. A VLM at 336×336 resolution shrinks this by 10× in each dimension — a 12-point font becomes 1-2 pixels tall. Completely unreadable. This is why document understanding requires special resolution handling:
| Approach | How It Handles High-Res | Token Cost |
|---|---|---|
| Resize to 336×336 | Shrink entire page (text unreadable) | 576 tokens |
| AnyRes tiling | Split into 4-12 tiles at 336×336 each | 2,304-6,912 tokens |
| Dynamic cropping | Process only regions of interest at high-res | Variable (576-2,304) |
| Token pruning | Encode high-res but drop redundant tokens | ~1,000 (compressed from 5K+) |
A VLM must understand that spatial layout encodes meaning. Watch how different regions are classified.
| Model | Approach | Strength |
|---|---|---|
| DocOwl | Layout-aware pretraining | Tables, forms |
| TextMonkey | High-res with token pruning | Dense text |
| Nougat | OCR-free academic PDF reading | Equations, LaTeX |
| GPT-4V / Gemini | Native multi-resolution | General documents |
VLMs are impressive, but they fail in predictable ways. Understanding these failures tells you a lot about the architecture:
| Failure Mode | Why It Happens | Example |
|---|---|---|
| Small object blindness | 14×14 patches are too coarse. A small object occupying one patch gets just one token of representation. | "How many buttons on the shirt?" → wrong count |
| Hallucination | The LLM's language prior overrides visual evidence. It describes objects that are statistically likely but not present. | Model describes a "vase of flowers" on a table that has none |
| Spatial confusion | Patch position embeddings encode rough grid position but not precise spatial relationships. | "Is the cup to the left or right of the plate?" → wrong answer |
| Video token explosion | Naive approach: N frames × 576 tokens = massive sequence. 10 frames = 5,760 tokens. | 2-second video at 5fps overwhelms context window |
| Long document overflow | OCR of a full page → thousands of text tokens on top of visual tokens. | 10-page PDF exceeds context even at low resolution |
The field is evolving rapidly. Today's frontier models handle video, multi-image reasoning, interleaved image-text, and even generate images alongside text. Here's where things stand:
| Model | Key Innovation | Scale | Strengths |
|---|---|---|---|
| LLaVA-NeXT | AnyRes dynamic resolution, SGLang-optimized | 7B-110B | Strong general VQA |
| InternVL 2.5 | Dynamic tiling, multi-scale ViT | 1B-108B | OCR, documents, multilingual |
| Qwen2.5-VL | NaViT-style packing, native video | 3B-72B | Video, grounding, agentic tasks |
| Molmo | Pointing as text coordinates, high-quality data | 1B-72B | Spatial grounding, UI understanding |
| Model | Key Innovation | Notable Capability |
|---|---|---|
| GPT-4o | Native multimodal from pretraining | Omni: text + image + audio + video in one model |
| Gemini 2.0 | 2M token context, native video understanding | Long video reasoning, interleaved generation |
| Claude Opus 4 | Strong spatial reasoning, chart understanding | Complex document analysis, precise counting |
The biggest architectural shift in 2024-2025 has been dynamic resolution. Instead of resizing every image to 336×336, modern VLMs handle arbitrary aspect ratios and resolutions:
This table shows why resolution management is the defining engineering challenge for production VLMs:
| Input Resolution | Patches (14px) | Visual Tokens | Attn Cost (relative) | Can Read? |
|---|---|---|---|---|
| 224 × 224 | 16 × 16 | 256 | 1× | No (too blurry) |
| 336 × 336 | 24 × 24 | 576 | 5× | Large text only |
| 672 × 672 | 48 × 48 | 2,304 | 81× | Most text |
| 1344 × 1344 | 96 × 96 | 9,216 | 1,296× | Fine print |
| AnyRes (9 tiles) | 9 × 24 × 24 | 5,184 | 410× | Yes (efficient) |
Compare different VLM generations across key capabilities. Each axis represents a different skill.
Extending VLMs to video is the next frontier, but it creates massive engineering challenges:
| Input | Frames | Visual Tokens | With Text | Challenge |
|---|---|---|---|---|
| Single image | 1 | 576 | ~626 | Manageable |
| 4 keyframes | 4 | 2,304 | ~2,354 | Needs longer context |
| 1 FPS × 30s video | 30 | 17,280 | ~17,330 | Exceeds most LLM context windows |
| 30 FPS × 10s video | 300 | 172,800 | ~172,850 | Impossible without compression |
VLMs connect to many other topics:
• CLIP & Contrastive Learning — The vision encoder that makes VLMs possible. CLIP pretraining aligns images and text before the VLM is even built.
• VLAs (Vision-Language-Action) — VLMs extended with action outputs for robotics. Same architecture, but the output includes motor commands.
• Transformers — Both the vision encoder (ViT) and the language model (LLaMA/Vicuna) are transformers. Understanding self-attention is key.
• Diffusion Models — Some VLMs (Emu, DALL-E 3) can generate images too, using diffusion decoders alongside language generation.
You now understand how machines learned to see and speak. The fusion of vision and language is one of the most consequential advances in AI history.
Real-world solution (as seen in InternVL 2.5 and Qwen2.5-VL):
1. Hierarchical tiling with saliency: Encode a low-res global view (576 tokens) + high-res crops only where a saliency detector flags detail (lesion candidates). A lightweight CNN identifies suspicious regions. Only 4-6 high-res tiles get encoded, giving ~3,000-3,500 visual tokens total.
2. Long-context LLM (Qwen2.5-72B supports 128K): Use a smaller model (7B) with RoPE-extended context to 32K. Summarize patient history into 2K tokens using a separate text-only pass. Visual tokens + summary fit easily.
3. Token merging: After the ViT encodes each tile, apply token merging (ToMe) — adjacent tokens with cosine similarity > 0.95 get averaged. Pathology backgrounds are highly redundant, so this achieves 2-3x compression with <1% quality loss on diagnostic tasks.
4. Memory budget: 7B model in AWQ 4-bit = 3.5GB. CLIP ViT-L in FP16 = 0.6GB. KV cache for 4K tokens at 4-bit in a 7B model ≈ 1.5GB. Activations ≈ 2GB peak. Total ≈ 8GB — well within 24GB with room for batch size >1.
The structural insight: a VLA is literally a VLM where the output vocabulary includes discretized robot actions (joint angles, gripper commands). The same projection bridge, same fusion mechanism, same autoregressive generation — just targeting a different output space. RT-2 proved this by fine-tuning a VLM to output action tokens with zero architectural changes.
What other domains could you target by simply changing what the LLM generates? (Hint: think music notation, CAD coordinates, chemical structures...)