The Complete Beginner's Path

Understand Vision-Language
Models

How modern AI learned to see and speak at the same time — fusing pixels and words into a single reasoning engine that can describe, answer, and understand images.

Prerequisites: Basic transformer intuition + Curiosity about multimodal AI. That's it.
10
Chapters
8+
Simulations
0
Jargon Assumed

Chapter 0: Two Modalities, One Model

Humans effortlessly combine what they see with what they read. You look at a chart and describe trends. You glance at a photo and answer questions about it. For decades, vision and language lived in separate AI systems. A Vision-Language Model (VLM) merges them into a single architecture that processes images and text together.

The core challenge: images are grids of pixels (spatial, continuous) while text is a sequence of tokens (discrete, symbolic). A VLM must bridge these two very different representations and allow them to interact so the model can reason about visual content using language.

The big picture: A VLM = a vision encoder (eyes) + a language model (brain) + a bridge that translates between them. The magic is in how the bridge aligns pixel features with word meanings.

The Data Flow at a Glance

Here's the entire VLM pipeline in concrete tensor shapes (we'll unpack each box in later chapters):

Image
[3, 336, 336] — RGB pixels
↓ CLIP ViT-L/14 (304M params, frozen)
Visual Features
[576, 1024] — 576 patch tokens
↓ MLP Projection (~21M params)
Projected Tokens
[576, 4096] — in LLM space
↓ concat with text tokens [~50, 4096]
LLM (13B params)
[~626, 4096] → autoregressive generation
Two Modalities Merging

Watch image patches and text tokens flow into a shared representation space. The teal nodes are visual, orange nodes are textual.

Check: What is the core challenge of building a VLM?

Chapter 1: Vision Encoders — Teaching AI to See

The vision encoder converts a raw image into a sequence of feature vectors. The dominant approach is ViT (Vision Transformer): chop the image into fixed-size patches (e.g., 14×14 pixels), flatten each patch into a vector, add positional embeddings, and run them through a transformer.

After encoding, an image becomes a grid of high-dimensional vectors — essentially the same shape as a sequence of text tokens. This structural similarity is what makes vision-language fusion possible.

patches = split(image, patch_size)  →  z = ViT(patches + pos_embed)

Tensor Shapes: CLIP ViT-L/14 Walkthrough

Let's trace the exact shapes through the most common vision encoder. CLIP ViT-L/14 takes a 336×336 image and uses 14×14 pixel patches:

Input Image
[1, 3, 336, 336] — RGB, 336px square
↓ split into 14×14 patches
Patch Embedding
(336/14) × (336/14) = 24 × 24 = 576 patches
Each patch: 14×14×3 = 588 pixels → projected to d=1024
Shape: [1, 576, 1024]
↓ + positional embeddings [576, 1024]
Transformer (24 layers)
Self-attention over 576 tokens of dim 1024
Output: [1, 576, 1024] — 576 visual tokens
The resolution scaling problem: At 336×336, we get 576 tokens. Double the resolution to 672×672 and we get (672/14)2 = 2,304 tokens. Self-attention cost scales quadratically: 5762 = 331K operations vs 23042 = 5.3M operations — a 16× increase. This is why dynamic resolution (AnyRes in LLaVA-1.6, NaViT in Qwen2-VL) is critical for handling high-res images without blowing up compute.
Image Patchification

See how an image is sliced into patches. Each patch becomes a token for the transformer. Adjust patch size to see the trade-off between resolution and sequence length.

Patch size4
16 patches → 16 tokens
Key trade-off: Smaller patches = more tokens = more detail but quadratic attention cost. Larger patches = fewer tokens = faster but lose fine detail. Most VLMs use 14×14 patches with a 224×224 or 336×336 image.
EncoderParamsUsed ByPatch SizeOutput Dim
CLIP ViT-L/14304MLLaVA, many VLMs14×141024
SigLIP (SO400M)400MPaliGemma, newer VLMs14×141152
EVA-CLIP (ViT-G)1.0BInternVL14×141408
DINOv2 (ViT-L)304MResearch models14×141024
Check: What does a Vision Transformer do to an image first?
🔗 Pattern Recognition
The Vision Encoder Is Just CLIP's Image Side
This Lesson (VLM)
CLIP ViT-L/14 takes image → [576, 1024] patch tokens. The spatial structure (24×24 grid) encodes position.
CLIP & Contrastive Learning
Same encoder, but CLIP pools all 576 tokens into a single [1, 1024] vector for image-text matching. → CLIP lesson

The VLM keeps all 576 tokens because it needs spatial detail. CLIP discards spatial info because matching only needs global semantics. Same backbone, different output usage. This is why VLMs use the penultimate layer (before pooling destroys spatial info).

Where else do you see the same encoder used with different amounts of its output? (Hint: think about how the Transformer lesson shows hidden states at every layer.)

Checkpoint — Before you move on
A 672×672 image with 14×14 patches produces how many tokens? What dimension is each token? If you doubled the patch size to 28×28, what happens to token count and per-token information?
✓ Gate cleared
Model Answer

672/14 = 48 patches per side, so 48×48 = 2,304 tokens, each of dimension 1024. That's 4× more tokens than 336×336 (576 tokens), and attention cost scales quadratically: 23042/5762 = 16× more compute.

If you double patch size to 28×28: you get (672/28)2 = 24×24 = 576 tokens (same as the standard 336px case). Each patch now covers 28×28×3 = 2,352 pixels instead of 588 — 4× more information compressed into the same 1024-dim vector. You lose fine-grained detail but save massive compute. This is the fundamental resolution/compute trade-off in VLMs.

Chapter 2: The Projection Bridge

The vision encoder outputs feature vectors in its embedding space. The language model expects tokens in its embedding space. These spaces don't match! We need a projection layer to translate between them.

The simplest bridge is a linear projection (a single matrix multiply). More sophisticated ones use an MLP (two linear layers with an activation) or a cross-attention resampler (like Flamingo's Perceiver). The projection doesn't just resize — it aligns visual features with the language model's semantic space.

LLaVA-1.5 Projection: Exact Shapes

LLaVA-1.5 uses a 2-layer MLP to project from CLIP's 1024-dim space to Vicuna-13B's 5120-dim space:

Vision Encoder Output
[576, 1024] — 576 tokens, each 1024-dim
Linear(1024, 4096)
W1: [1024, 4096] + bias [4096] = 4.2M params
↓ GELU activation
Linear(4096, 4096)
W2: [4096, 4096] + bias [4096] = 16.8M params
Projected Tokens
[576, 4096] — now in LLM's embedding space
Parameter count: The entire projection MLP is ~21M parameters. That's 0.16% of the 13B-parameter LLM. This tiny bridge carries all the weight of aligning vision and language — which is why it works: both CLIP and the LLM are already strong, so the bridge only needs to learn a translation, not understanding.

Projection Strategies Compared

Projection TypeArchitectureParamsToken CountUsed By
LinearSingle W: [dvis, dllm]~4MSame as input (576)LLaVA v1
MLP (2-layer)Linear + GELU + Linear~21MSame as input (576)LLaVA-1.5
Perceiver ResamplerCross-attention with K learned queries~100MK (e.g., 64 or 256)Flamingo, Qwen-VL
C-AbstractorConv layers + adaptive pooling~50MReduced (e.g., 144)Honeybee
The Perceiver trade-off: A Perceiver compresses 576 vision tokens into just 64 fixed queries. This makes the LLM's job easier (64 tokens instead of 576 to attend to), but the compression is lossy — fine spatial details get smoothed out. LLaVA's approach keeps all 576 tokens at the cost of more attention compute. For OCR and spatial grounding, keeping all tokens wins. For general VQA, compression is usually fine.
Why not just concatenate? Vision features live in a completely different vector space than word embeddings. Without projection, the language model sees garbage. The projection learns to map "this patch has a furry texture" into the same region of embedding space as the word "cat."
Projection Alignment

Points in vision space get projected to language space. Drag the slider to rotate the projection and see alignment change.

Projection angle45°
Check: What does the projection layer do?
🔨 Derivation Projection MLP Parameter Count ✓ ATTEMPTED

LLaVA-1.5 uses a 2-layer MLP to bridge CLIP ViT-L (output dim = 1024) to Vicuna-13B (hidden dim = 5120). But the intermediate dimension is 4096. The architecture is: Linear(1024, 4096) → GELU → Linear(4096, 4096).

Your task: Calculate the total parameter count. Why might the designers have chosen 4096 as the intermediate dimension instead of 5120 (the LLM's actual hidden dim)?

A Linear(in, out) layer has in × out weight parameters + out bias parameters. So Linear(1024, 4096) = 1024×4096 + 4096 = ?
Layer 1: 1024×4096 + 4096. Layer 2: 4096×4096 + 4096. Sum them. Note that GELU has zero learnable parameters.
Think about the LLM's embedding table. The LLM's word embeddings project vocab tokens into what dimension? The visual tokens need to land in the same space as embedded words, not the internal hidden states.

Full derivation:

Layer 1: Linear(1024, 4096) = 1024 × 4096 + 4096 = 4,194,304 + 4,096 = 4,198,400 params

Layer 2: Linear(4096, 4096) = 4096 × 4096 + 4096 = 16,777,216 + 4,096 = 16,781,312 params

Total: 4,198,400 + 16,781,312 = 20,979,712 ≈ 21M parameters

The key insight: The output dimension is 4096, not 5120, because Vicuna-13B's embedding dimension (the space where word tokens live) is 4096. The 5120 is the FFN intermediate width. Visual tokens need to be in the same space as text embeddings at the input to the transformer — that's the 4096-dim embedding space. The 21M parameter bridge is 0.16% of the 13B LLM, yet carries ALL the alignment burden.

💻 Build It Implement the Vision-to-Language Projection ✓ ATTEMPTED
You've learned that LLaVA's bridge is a 2-layer MLP with GELU activation. Now implement it. The module takes [batch, n_patches, vision_dim] and outputs [batch, n_patches, llm_dim]. No token count reduction — every vision token maps 1:1 to an LLM input token.
signature class VisionProjection(nn.Module): """ Projects vision encoder output to LLM embedding space. Args: vision_dim: dimension of vision encoder output (e.g., 1024) llm_dim: dimension of LLM embedding space (e.g., 4096) Forward: Input: [batch, n_patches, vision_dim] Output: [batch, n_patches, llm_dim] """ def __init__(self, vision_dim: int, llm_dim: int): ... def forward(self, x: torch.Tensor) -> torch.Tensor: ...
Test case
proj = VisionProjection(1024, 4096)
x = torch.randn(2, 576, 1024) # batch=2, 576 patches, CLIP dim
out = proj(x)
assert out.shape == (2, 576, 4096) # same patch count, LLM dim
print(sum(p.numel() for p in proj.parameters())) # ~20.98M
LLaVA-1.5 uses llm_dim as the intermediate dimension too. So both layers output llm_dim. The architecture is: Linear(vision_dim, llm_dim) → GELU() → Linear(llm_dim, llm_dim).
python
import torch
import torch.nn as nn

class VisionProjection(nn.Module):
    def __init__(self, vision_dim: int, llm_dim: int):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(vision_dim, llm_dim),  # 1024 -> 4096
            nn.GELU(),
            nn.Linear(llm_dim, llm_dim),     # 4096 -> 4096
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [batch, n_patches, vision_dim]
        return self.mlp(x)  # [batch, n_patches, llm_dim]
Bonus challenge: Modify this to include a Perceiver-style resampler that compresses 576 tokens down to 64 learned queries using cross-attention. What new parameters do you need?

Chapter 3: Fusion Strategies

Once vision tokens and text tokens are in the same space, how do we let them interact? There are several architectural strategies, each with different trade-offs:

StrategyHow It WorksExample
Early FusionConcatenate vision + text tokens, feed to one transformerLLaVA
Cross-AttentionText attends to vision via extra cross-attention layersFlamingo
Perceiver ResamplerLearned queries compress vision tokens to fixed countFlamingo, Qwen-VL
InterleavedVision tokens inserted at corresponding text positionsFuyu, Gemini
Fusion Architecture Comparison

Toggle between strategies to see how vision (teal) and text (orange) tokens interact.

Early fusion is simplest: Just prepend vision tokens to the text sequence and let self-attention handle everything. The LLM's existing attention mechanism naturally learns vision-language interactions. This is why LLaVA works so well with so little architectural change.

What Happens Inside: The Attention Pattern

In early fusion, the LLM's self-attention sees a sequence like [v1, v2, ..., v576, t1, t2, ..., t50]. Because it's causal attention:

Visual tokens (positions 1-576):
• Attend to each other (vision self-attention)
• Cannot see any text tokens (causal mask)
• This means vision features are "frozen" in place — they don't change based on the question
Text tokens (positions 577+):
• Attend to ALL vision tokens
• Attend to prior text tokens
• This is where cross-modal reasoning happens: the word "cat" attending to furry-texture patches
An important subtlety: In LLaVA, the visual tokens never see the question. The model can only ground its visual processing through the text side. This is a limitation — the vision encoder processes the image the same way regardless of whether you ask "What color is the car?" or "How many people are there?" Newer architectures like Flamingo's cross-attention avoid this by letting vision attend to the query.

Architecture Comparison: Who Builds the Bridge Differently?

Each approach makes fundamentally different engineering choices about what's frozen, what's trained, and how vision enters the LLM:

ModelBridge TypeVision EncoderLLMVision Tokens to LLMTotal Params
LLaVA-1.5MLP (2-layer)CLIP ViT-L (frozen)Vicuna-13B (Stage 2: tuned)576 (all patches)~13.4B
FlamingoGated cross-attnNFNet (frozen)Chinchilla (frozen)64 (resampled)~80B
Qwen-VLCross-attn resamplerViT-G (trained)Qwen-7B (tuned)256 (compressed)~9.6B
InternVLQLLaMA adapterEVA-CLIP-G (tuned)InternLM-20B (tuned)256 (resampled)~20B
The token count trade-off: LLaVA sends all 576 vision tokens to the LLM — maximum information but expensive attention. Flamingo compresses to 64 learned queries — faster but lossy. Qwen-VL uses 256 as a middle ground. Fewer tokens = faster inference but more information loss, especially for fine-grained tasks like OCR.
Check: In early fusion, how do vision and text tokens interact?
⚔ Adversarial: Your VLM hallucinates objects not in the image
You fine-tuned a VLM using LLaVA's architecture. The base text-only LLM (Vicuna-13B) almost never hallucinates facts. But your VLM confidently describes objects that aren't in the image — "a vase of flowers on the table" when the table is empty. The vision encoder (CLIP) correctly encodes the image. What architectural property causes this?
🔨 Derivation Cross-Attention Dimensions in Flamingo ✓ ATTEMPTED

Flamingo inserts cross-attention layers into its LLM (Chinchilla-70B, dmodel=8192). The vision encoder outputs 64 tokens of dimension 1024 (after a Perceiver resampler). Cross-attention uses Q from text (d=8192), K and V from vision (d=1024), with num_heads=64 and head_dim=128.

Your task: Derive the projection dimensions for WQ, WK, WV, and WO. What is the total parameter cost of one cross-attention layer?

Q comes from the text stream (d=8192). K and V come from the vision stream (d=1024). All project to num_heads × head_dim = 64 × 128 = 8192 internal dim. So WQ: [8192, 8192], but WK and WV: [1024, ?]
WK: [1024, 64×128] = [1024, 8192]. Same for WV. The output projection WO maps attention output back to text dim: [8192, 8192].
WQ: 8192×8192. WK: 1024×8192. WV: 1024×8192. WO: 8192×8192. Sum them up.

Full derivation:

WQ: [dtext, n_heads × head_dim] = [8192, 8192] = 67.1M params

WK: [dvision, n_heads × head_dim] = [1024, 8192] = 8.4M params

WV: [dvision, n_heads × head_dim] = [1024, 8192] = 8.4M params

WO: [n_heads × head_dim, dtext] = [8192, 8192] = 67.1M params

Total per cross-attention layer: 67.1 + 8.4 + 8.4 + 67.1 = ~151M parameters

The key insight: WK and WV are much smaller (8.4M each) because they project from the 1024-dim vision space, not the 8192-dim text space. This asymmetry is why cross-attention adds relatively little overhead: the "expensive" projections (Q and O) already exist in the text-only LLM. Only the K/V projections from vision are truly new parameters.

Chapter 4: Visual Instruction Tuning

Having the right architecture isn't enough — the model needs to learn when and how to use visual information. Visual instruction tuning trains the model on (image, instruction, response) triples so it can follow natural language commands about images.

The key innovation: instead of training on just captions ("A cat on a sofa"), you train on diverse tasks: "What color is the cat?", "Count the cushions", "Write a poem about this scene", "Is this image safe for children?" This teaches the model to be a general-purpose visual assistant.

Why Captioning Alone Isn't Enough

A model trained only on image-caption pairs learns to describe, but it can't reason. It can say "a cat on a sofa" but can't answer "Is the sofa large enough for two people?" because captioning datasets never ask questions that require spatial reasoning, counting, comparison, or inference about intent.

Training Data TypeWhat the Model LearnsWhat It Can't Do
Captions only"A dog playing in the park"Answer questions, count objects, reason spatially
VQA datasetsAnswer factual questionsFree-form conversation, creative tasks
Instruction tuningAll of the above + follow instructionsOut-of-distribution tasks (but fewer gaps)
Input
[Image] + "How many people are in this photo?"
VLM Processes
Vision encoder + projection + LLM forward pass
Output
"There are three people in this photo, standing near..."
Data generation trick: LLaVA generated its instruction-tuning data using GPT-4! They fed image captions + bounding boxes to GPT-4 and asked it to generate question-answer pairs. This "language-only teacher" approach created 150K high-quality training conversations.

The Training Signal: What Does the Loss Look Like?

During instruction tuning, the model receives a sequence like:

format
[IMG_1] [IMG_2] ... [IMG_576] [SYS] You are a helpful assistant. [USR] What animal is in this image? [ASST]

# Loss is computed ONLY on the assistant's response tokens:
"This image shows a golden retriever lying on grass..."

# Vision tokens and user tokens: loss = 0 (no gradient)
# Assistant tokens: standard next-token prediction loss
Causal masking matters: The model can attend from text to vision tokens (allowing it to "look" at the image), but not from vision tokens to future text tokens. Vision tokens attend only to each other and to prior context. This ensures the visual features are processed before the text response is generated.
Instruction Diversity

See the variety of tasks a VLM must handle. Click to cycle through instruction types.

Check: What makes visual instruction tuning different from captioning?
🔨 Derivation The Visual Instruction Tuning Loss ✓ ATTEMPTED

The input sequence is: [v1...v576][SYS][USR question][ASST response tokens r1...rT]. Standard language model training computes cross-entropy loss on every token. Visual instruction tuning modifies this.

Your task: Write out the loss function. Which tokens contribute to the loss? Which tokens have their loss masked to zero? Why is this masking critical?

If you compute loss on vision tokens, you're asking the model to predict the next image patch — a nonsensical task for a language model. If you compute loss on the user's question, you're training the model to memorize questions rather than answer them.
Standard LM loss: L = -(1/N) ∑i=1N log P(ti | t<i). Now add a mask mi ∈ {0, 1} that's 1 only for assistant response tokens.
Divide by the number of unmasked tokens (just the response length T), not the full sequence length. Otherwise long prompts would make the gradient vanishingly small.

Full derivation:

Let the full sequence be x = [v1...v576, s1...sK, r1...rT] where s = system+user tokens, r = response tokens.

Define mask: mi = 1 if token i is a response token, 0 otherwise.

Loss: L = -(1/T) ∑i mi · log Pθ(xi | x<i)

Concretely: mi = 0 for all 576 vision tokens, 0 for all system/user tokens, and 1 only for r1 through rT.

The key insight: The model sees all tokens during the forward pass (vision tokens condition the response via attention), but only learns to generate the response. This is what separates instruction tuning from pretraining: the model learns to use visual context without trying to predict it. If you accidentally trained on vision tokens, the model would waste capacity trying to predict patch embeddings — a regression task the autoregressive head isn't designed for.

Chapter 5: The LLaVA Architecture

LLaVA (Large Language and Vision Assistant) showed that a surprisingly simple recipe works remarkably well: take a pretrained CLIP vision encoder, a pretrained LLM (like Vicuna/LLaMA), and connect them with a single linear projection. That's it.

The image goes through CLIP ViT-L/14 → produces 576 vision tokens (a 24×24 grid from the penultimate layer) → each is linearly projected to the LLM's dimension → prepended to the text tokens → the LLM generates the response autoregressively.

Why the penultimate layer? The final CLIP layer is optimized for contrastive matching with text (a single global CLS token). The penultimate layer still has spatial information (the 576 individual patch tokens). Using it preserves the local visual detail that the LLM needs for questions like "What's in the top-left corner?"
CLIP ViT-L/14
336×336 image → 576 visual tokens (d=1024)
MLP Projector
Linear(1024, 4096) + GELU + Linear(4096, 4096)
Vicuna-13B
[576 visual tokens] + [text tokens] → autoregressive output
LLaVA Forward Pass

Watch tokens flow through the architecture. Teal = vision, orange = text, green = output.

Why does simplicity win? Both CLIP and the LLM are already well-trained. CLIP learned to align images with text during pretraining. The LLM already knows language. The projection just needs to learn the "translation" between their embedding spaces — a much smaller problem than training from scratch.

The Complete Input to the LLM

Let's trace exactly what the LLM receives. For a question like "What is in this image?":

Visual Tokens
576 tokens × 4096 dims = [576, 4096]
+
System + User Text
~50 tokens × 4096 dims = [50, 4096]
(system prompt + "What is in this image?")
↓ concatenate
LLM Input Sequence
~626 tokens × 4096 dims
Vicuna-13B: 40 layers, dmodel=5120, 40 heads
Self-attention over all 626 tokens at every layer
Context budget: 576 of the ~626 input tokens are visual — 92% of the LLM's input is image. This means vision dominates the KV cache and attention cost. For multi-image or video inputs, this ratio gets even more extreme, which is why token compression matters so much.

Inference Timing Breakdown

Where does the time go when a VLM answers a question about an image?

StageTimeWhat Happens
Image preprocessing~5 msResize to 336×336, normalize to [-1, 1]
CLIP forward pass~15 ms24 transformer layers on 576 tokens (304M params)
MLP projection~0.1 msTwo linear layers, 21M params — trivial cost
LLM prefill~30 msProcess all 626 tokens through 40 layers in parallel
LLM decode~2,000 ms~20 ms/token × 100 tokens of output (sequential!)
Total~2.05 sAutoregressive decode dominates by 40×
The bottleneck is decoding. Vision encoding and projection together are ~15 ms. The LLM's autoregressive token generation is ~2 seconds. This is why VLM speedups mostly target the LLM side (speculative decoding, quantization) rather than the vision encoder.
Check: How many new components does LLaVA add to connect vision and language?

Chapter 6: The Training Pipeline

VLMs are trained in stages, not all at once. Each stage has a different purpose and different parts of the model are frozen or unfrozen:

StageDataWhat TrainsPurpose
1. Pretraining alignment595K image-caption pairsProjection onlyAlign vision ↔ language spaces
2. Instruction tuning158K visual conversationsProjection + LLMTeach instruction following

Parameter Budget: What Actually Trains?

ComponentParamsStage 1Stage 2
CLIP ViT-L/14304MFrozenFrozen
MLP Projector~21MTrainingTraining
Vicuna-13B13,000MFrozenTraining
Stage 1 trainable21M / 13,325M = 0.16% of total
Stage 2 trainable13,021M / 13,325M = 97.7% of total
Stage 1 is fast: Only the projection layer trains (0.16% of parameters) — a few hours on 8 A100 GPUs. Stage 2 unfreezes the LLM and is the expensive part: ~20 hours on 8 A100s. The vision encoder stays frozen in both stages, preserving CLIP's learned visual representations.
Training Stage Visualization

Toggle stages to see which components are frozen (blue/frozen) vs trainable (green/active).

Scaling insight: LLaVA-1.5 improved by upgrading to a 2-layer MLP projector and adding academic VQA datasets in Stage 2. These small changes gave major accuracy boosts — showing that data quality and projection design matter more than model size.

Compute Budget Breakdown

StageTrainable ParamsDataGPU Hours (8×A100)Cost (~$2/GPU-hr)
Stage 1: Alignment21M (0.16%)595K image-caption pairs~4 hours~$64
Stage 2: Instruction13,021M (97.7%)158K visual conversations~20 hours~$320
Total~$384 to train a competitive VLM (given pretrained CLIP + LLM)
This is remarkably cheap. The pretrained CLIP ViT-L and Vicuna-13B represent billions of dollars of compute. LLaVA's contribution is just the $384 bridge between them. This is why the "connect pretrained specialists" paradigm dominates VLM design — training vision and language from scratch would cost 1000× more.
Check: In Stage 1 of LLaVA training, what is trainable?
💥 Break-It Lab What Dies When You Remove VLM Components? ✓ ATTEMPTED
A working VLM has three critical design decisions: freeze the vision encoder, include a projection layer, and apply instruction tuning after alignment. Toggle each off to see what breaks.
Unfreeze Vision Encoder OFF
Catastrophic forgetting: CLIP spent billions of compute-hours learning to align images with text. If you unfreeze it during VLM fine-tuning with only 595K examples, gradient updates destroy these learned representations. The encoder "forgets" how to encode images meaningfully. Loss initially decreases (overfitting to small data) then generalization collapses. The encoder becomes specialized for captioning but loses zero-shot generality.
Remove Projection Layer OFF
Dimension mismatch crash: CLIP outputs 1024-dim vectors. The LLM expects 4096-dim inputs. Without the projection, you get a literal shape error — the tensors can't be concatenated. Even if you pad with zeros to match dimensions, the vision features live in a completely unrelated vector space. The LLM would interpret them as noise — like feeding random embeddings. No alignment, no understanding.
Skip Instruction Tuning OFF
Caption-only responses: After Stage 1 (alignment only), the model can describe images but can't follow instructions. Ask "How many people?" and it responds "A group of people standing in a park" — a caption, not an answer. It hasn't learned the instruction-following format. The projection is aligned, but the LLM hasn't adapted its behavior to use visual context for diverse tasks.

Chapter 7: Grounding & Spatial Understanding

Grounding means connecting words to specific regions in the image. When a VLM says "the red car on the left," grounding means it can also point to where that car is. This requires the model to output spatial coordinates, not just text.

Approaches include: outputting bounding box coordinates as text tokens (e.g., "[0.2, 0.3, 0.5, 0.7]"), using special location tokens, or predicting segmentation masks. Models like Kosmos-2 and Shikra showed VLMs can be trained to both describe and locate objects.

Three Levels of Spatial Understanding

LevelOutputExample TaskDifficulty
PointingSingle (x, y) coordinate"Point to the cat's nose"Easiest — one point
Bounding box(x1, y1, x2, y2)"Draw a box around each person"Medium — 4 numbers per object
SegmentationPixel-level mask"Segment the dog from background"Hardest — needs a mask decoder
Visual Grounding

Click on regions to see how a VLM grounds language to spatial locations. Each colored box is a detected object with its label.

Coordinate formats: Some models normalize coordinates to [0, 1000] and emit them as text tokens. Others use special <box> tokens. The trend is toward treating coordinates as just another kind of language output — elegant and requires no architectural changes.

Grounding as Text: How It Works in Practice

In coordinate-as-text models (Kosmos-2, Shikra, Qwen2.5-VL), the model literally generates bounding boxes as number tokens:

example output
# Input: "Where is the dog in this image?"
# Output:
The dog is in the bottom-left of the image.
<box>102, 384, 298, 571</box>

# Coordinates: [x1, y1, x2, y2] normalized to [0, 1000]
# Real position: top-left (10.2%, 38.4%) to bottom-right (29.8%, 57.1%)
Why this is elegant: No architecture change. The model just needs to learn that certain number tokens represent coordinates. The vocabulary already contains all digits 0-9. The model learns spatial reasoning through the same autoregressive training as language generation. Qwen2.5-VL extends this to point, box, and polygon coordinates for different granularities of grounding.
Check: What does "grounding" mean in VLMs?

Chapter 8: Document Understanding

Documents are a special challenge for VLMs: they contain dense text, tables, charts, and layouts where spatial arrangement matters. "Revenue" next to "$5M" means something different from "Revenue" in a section header.

Key innovations: high-resolution encoding (documents need more pixels than photos), OCR-free reading (the vision encoder learns to read text directly), and layout-aware attention (understanding that rows and columns create relationships).

The Resolution Bottleneck for Documents

A standard document page at 300 DPI is roughly 2,550 × 3,300 pixels. A VLM at 336×336 resolution shrinks this by 10× in each dimension — a 12-point font becomes 1-2 pixels tall. Completely unreadable. This is why document understanding requires special resolution handling:

ApproachHow It Handles High-ResToken Cost
Resize to 336×336Shrink entire page (text unreadable)576 tokens
AnyRes tilingSplit into 4-12 tiles at 336×336 each2,304-6,912 tokens
Dynamic croppingProcess only regions of interest at high-resVariable (576-2,304)
Token pruningEncode high-res but drop redundant tokens~1,000 (compressed from 5K+)
The OCR question: Should a VLM read text via its vision encoder (OCR-free) or rely on an external OCR module? OCR-free is simpler (one model does everything) but requires high resolution. External OCR is more accurate for small text but adds a pipeline stage and loses layout information. The trend is toward OCR-free: larger ViTs with higher resolution are closing the accuracy gap.
Document Layout Parsing

A VLM must understand that spatial layout encodes meaning. Watch how different regions are classified.

Resolution matters: A typical VLM at 336×336 can't read small text. Document VLMs like UReader and TextMonkey use dynamic resolution — cropping the image into tiles and encoding each at high resolution, then stitching the features back together.
ModelApproachStrength
DocOwlLayout-aware pretrainingTables, forms
TextMonkeyHigh-res with token pruningDense text
NougatOCR-free academic PDF readingEquations, LaTeX
GPT-4V / GeminiNative multi-resolutionGeneral documents

What Breaks: Common VLM Failure Modes

VLMs are impressive, but they fail in predictable ways. Understanding these failures tells you a lot about the architecture:

Failure ModeWhy It HappensExample
Small object blindness14×14 patches are too coarse. A small object occupying one patch gets just one token of representation."How many buttons on the shirt?" → wrong count
HallucinationThe LLM's language prior overrides visual evidence. It describes objects that are statistically likely but not present.Model describes a "vase of flowers" on a table that has none
Spatial confusionPatch position embeddings encode rough grid position but not precise spatial relationships."Is the cup to the left or right of the plate?" → wrong answer
Video token explosionNaive approach: N frames × 576 tokens = massive sequence. 10 frames = 5,760 tokens.2-second video at 5fps overwhelms context window
Long document overflowOCR of a full page → thousands of text tokens on top of visual tokens.10-page PDF exceeds context even at low resolution
Why hallucination persists: During training, the LLM learned strong priors about what co-occurs (tables often have flowers). The vision signal must be strong enough to override these priors. When the visual evidence is ambiguous (blurry, small, off-center), language priors win — and the model confidently describes things it doesn't actually see.
Check: Why is document understanding harder than photo understanding for VLMs?

Chapter 9: Frontier VLMs

The field is evolving rapidly. Today's frontier models handle video, multi-image reasoning, interleaved image-text, and even generate images alongside text. Here's where things stand:

Open-Weight Models

ModelKey InnovationScaleStrengths
LLaVA-NeXTAnyRes dynamic resolution, SGLang-optimized7B-110BStrong general VQA
InternVL 2.5Dynamic tiling, multi-scale ViT1B-108BOCR, documents, multilingual
Qwen2.5-VLNaViT-style packing, native video3B-72BVideo, grounding, agentic tasks
MolmoPointing as text coordinates, high-quality data1B-72BSpatial grounding, UI understanding

Frontier Closed Models

ModelKey InnovationNotable Capability
GPT-4oNative multimodal from pretrainingOmni: text + image + audio + video in one model
Gemini 2.02M token context, native video understandingLong video reasoning, interleaved generation
Claude Opus 4Strong spatial reasoning, chart understandingComplex document analysis, precise counting

The Resolution Revolution

The biggest architectural shift in 2024-2025 has been dynamic resolution. Instead of resizing every image to 336×336, modern VLMs handle arbitrary aspect ratios and resolutions:

How AnyRes works (LLaVA-NeXT): Slice the image into tiles that fit the ViT's expected resolution. A 1344×672 image becomes 4×2 = 8 tiles of 336×336, plus one low-res global view. Total: 9 × 576 = 5,184 visual tokens. Expensive, but now the model can read fine print and see small objects.

The Resolution Token Table

This table shows why resolution management is the defining engineering challenge for production VLMs:

Input ResolutionPatches (14px)Visual TokensAttn Cost (relative)Can Read?
224 × 22416 × 16256No (too blurry)
336 × 33624 × 24576Large text only
672 × 67248 × 482,30481×Most text
1344 × 134496 × 969,2161,296×Fine print
AnyRes (9 tiles)9 × 24 × 245,184410×Yes (efficient)
Qwen2.5-VL's approach: Instead of fixed tiles, Qwen2.5-VL uses NaViT-style "packing" — variable-resolution patches packed into a single batch without wasting compute on padding. This lets it handle 4K images efficiently: keep high resolution where detail matters (text regions), lower resolution elsewhere (sky, grass).
VLM Capability Radar

Compare different VLM generations across key capabilities. Each axis represents a different skill.

The trend: Early VLMs bolted vision onto language. Frontier models are trained multimodally from the start — vision isn't an add-on, it's native. This enables richer reasoning, fewer hallucinations, and genuine visual understanding rather than pattern matching.

The Video Challenge

Extending VLMs to video is the next frontier, but it creates massive engineering challenges:

InputFramesVisual TokensWith TextChallenge
Single image1576~626Manageable
4 keyframes42,304~2,354Needs longer context
1 FPS × 30s video3017,280~17,330Exceeds most LLM context windows
30 FPS × 10s video300172,800~172,850Impossible without compression
Solutions being explored: Frame sampling (use 1 FPS not 30), temporal token merging (average similar adjacent frames), spatial downsampling (fewer tokens per frame), and learned temporal compression (a small network that compresses T frames into K tokens, like a temporal Perceiver). Gemini 2.0's 2M token context helps but doesn't fully solve the 30 FPS problem.

Connections

VLMs connect to many other topics:

CLIP & Contrastive Learning — The vision encoder that makes VLMs possible. CLIP pretraining aligns images and text before the VLM is even built.

VLAs (Vision-Language-Action) — VLMs extended with action outputs for robotics. Same architecture, but the output includes motor commands.

Transformers — Both the vision encoder (ViT) and the language model (LLaMA/Vicuna) are transformers. Understanding self-attention is key.

Diffusion Models — Some VLMs (Emu, DALL-E 3) can generate images too, using diffusion decoders alongside language generation.

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective."
— Rich Sutton, The Bitter Lesson

You now understand how machines learned to see and speak. The fusion of vision and language is one of the most consequential advances in AI history.

🏗 Design Challenge You're the Architect: 4K Images + 100K Text in 24GB VRAM ✓ ATTEMPTED
A medical imaging startup needs a VLM that processes full 4K pathology slides (3840×2160 pixels) alongside 100K tokens of patient history. Inference must run on a single RTX 4090 (24GB VRAM). The current naive approach (AnyRes tiling at 336px) produces 46,000+ visual tokens — impossible to fit with a 7B LLM.
Input image
3840 × 2160 px (4K pathology slide)
Text context
100K tokens of patient records
VRAM budget
24 GB (RTX 4090)
Latency target
< 30 seconds for full response
Accuracy requirement
Must not miss small lesions (1-2% of image area)
1. How do you handle the 4K resolution? (Naive tiling = 46K tokens. You need <4K total visual tokens to fit in VRAM. But you can't just downsample — small lesions vanish.)
2. How do you handle 100K text tokens alongside visual tokens? (Standard 8K context LLMs can't fit both. Do you compress text, use a long-context model, or something else?)
3. What's your token compression strategy? (Perceiver resampling, spatial pooling, saliency-based cropping, hierarchical encoding?)
4. How do you quantize/optimize to fit in 24GB with both the vision encoder and LLM loaded?

Real-world solution (as seen in InternVL 2.5 and Qwen2.5-VL):

1. Hierarchical tiling with saliency: Encode a low-res global view (576 tokens) + high-res crops only where a saliency detector flags detail (lesion candidates). A lightweight CNN identifies suspicious regions. Only 4-6 high-res tiles get encoded, giving ~3,000-3,500 visual tokens total.

2. Long-context LLM (Qwen2.5-72B supports 128K): Use a smaller model (7B) with RoPE-extended context to 32K. Summarize patient history into 2K tokens using a separate text-only pass. Visual tokens + summary fit easily.

3. Token merging: After the ViT encodes each tile, apply token merging (ToMe) — adjacent tokens with cosine similarity > 0.95 get averaged. Pathology backgrounds are highly redundant, so this achieves 2-3x compression with <1% quality loss on diagnostic tasks.

4. Memory budget: 7B model in AWQ 4-bit = 3.5GB. CLIP ViT-L in FP16 = 0.6GB. KV cache for 4K tokens at 4-bit in a 7B model ≈ 1.5GB. Activations ≈ 2GB peak. Total ≈ 8GB — well within 24GB with room for batch size >1.

Check: What distinguishes frontier VLMs from early ones?
🔗 Pattern Recognition
VLM → VLA: Adding Actions to Vision-Language
This Lesson (VLM)
[Image tokens] + [Text tokens] → LLM → [Text response]
The model sees and speaks.
VLA (Vision-Language-Action)
[Image tokens] + [Text instruction] → LLM → [Action tokens]
Same architecture, but output is motor commands. → VLA lesson

The structural insight: a VLA is literally a VLM where the output vocabulary includes discretized robot actions (joint angles, gripper commands). The same projection bridge, same fusion mechanism, same autoregressive generation — just targeting a different output space. RT-2 proved this by fine-tuning a VLM to output action tokens with zero architectural changes.

What other domains could you target by simply changing what the LLM generates? (Hint: think music notation, CAD coordinates, chemical structures...)