microVLM — Vision-Language Models from Scratch

Chapter 0: Two Modalities, One Model

Humans effortlessly combine what they see with what they read. You look at a chart and describe trends. You glance at a photo and answer questions about it. For decades, vision and language lived in separate AI systems. A Vision-Language Model (VLM) merges them into a single architecture that processes images and text together.

The core challenge: images are grids of pixels (spatial, continuous) while text is a sequence of tokens (discrete, symbolic). A VLM must bridge these two very different representations and allow them to interact so the model can reason about visual content using language.

The big picture: A VLM = a vision encoder (eyes) + a language model (brain) + a bridge that translates between them. The magic is in how the bridge aligns pixel features with word meanings.

The Data Flow at a Glance

Here's the entire VLM pipeline in concrete tensor shapes (we'll unpack each box in later chapters):

Image

[3, 336, 336] — RGB pixels

↓ CLIP ViT-L/14 (304M params, frozen)

Visual Features

[576, 1024] — 576 patch tokens

↓ MLP Projection (~21M params)

Projected Tokens

[576, 4096] — in LLM space

↓ concat with text tokens [~50, 4096]

LLM (13B params)

[~626, 4096] → autoregressive generation

Two Modalities Merging

Watch image patches and text tokens flow into a shared representation space. The teal nodes are visual, orange nodes are textual.

Check: What is the core challenge of building a VLM?

Making images higher resolution Bridging spatial pixel representations with sequential token representations Training on more text data

Chapter 1: Vision Encoders — Teaching AI to See

The vision encoder converts a raw image into a sequence of feature vectors. The dominant approach is ViT (Vision Transformer): chop the image into fixed-size patches (e.g., 14×14 pixels), flatten each patch into a vector, add positional embeddings, and run them through a transformer.

After encoding, an image becomes a grid of high-dimensional vectors — essentially the same shape as a sequence of text tokens. This structural similarity is what makes vision-language fusion possible.

patches = split(image, patch_size) → z = ViT(patches + pos_embed)

Tensor Shapes: CLIP ViT-L/14 Walkthrough

Let's trace the exact shapes through the most common vision encoder. CLIP ViT-L/14 takes a 336×336 image and uses 14×14 pixel patches:

Input Image

[1, 3, 336, 336] — RGB, 336px square

↓ split into 14×14 patches

Patch Embedding

(336/14) × (336/14) = 24 × 24 = 576 patches
Each patch: 14×14×3 = 588 pixels → projected to d=1024
Shape: [1, 576, 1024]

↓ + positional embeddings [576, 1024]

Transformer (24 layers)

Self-attention over 576 tokens of dim 1024
Output: [1, 576, 1024] — 576 visual tokens

The resolution scaling problem: At 336×336, we get 576 tokens. Double the resolution to 672×672 and we get (672/14)² = 2,304 tokens. Self-attention cost scales quadratically: 576² = 331K operations vs 2304² = 5.3M operations — a 16× increase. This is why dynamic resolution (AnyRes in LLaVA-1.6, NaViT in Qwen2-VL) is critical for handling high-res images without blowing up compute.

Image Patchification

See how an image is sliced into patches. Each patch becomes a token for the transformer. Adjust patch size to see the trade-off between resolution and sequence length.

Patch size4

16 patches → 16 tokens

Key trade-off: Smaller patches = more tokens = more detail but quadratic attention cost. Larger patches = fewer tokens = faster but lose fine detail. Most VLMs use 14×14 patches with a 224×224 or 336×336 image.

Encoder	Params	Used By	Patch Size	Output Dim
CLIP ViT-L/14	304M	LLaVA, many VLMs	14×14	1024
SigLIP (SO400M)	400M	PaliGemma, newer VLMs	14×14	1152
EVA-CLIP (ViT-G)	1.0B	InternVL	14×14	1408
DINOv2 (ViT-L)	304M	Research models	14×14	1024

Check: What does a Vision Transformer do to an image first?

Splits it into fixed-size patches and treats each as a token Runs convolutions at increasing scales Converts it to text captions directly

🔗 Pattern Recognition

The Vision Encoder Is Just CLIP's Image Side

This Lesson (VLM)

CLIP ViT-L/14 takes image → [576, 1024] patch tokens. The spatial structure (24×24 grid) encodes position.

CLIP & Contrastive Learning

Same encoder, but CLIP pools all 576 tokens into a single [1, 1024] vector for image-text matching. → CLIP lesson

The VLM keeps all 576 tokens because it needs spatial detail. CLIP discards spatial info because matching only needs global semantics. Same backbone, different output usage. This is why VLMs use the penultimate layer (before pooling destroys spatial info).

Where else do you see the same encoder used with different amounts of its output? (Hint: think about how the Transformer lesson shows hidden states at every layer.)

Checkpoint — Before you move on

A 672×672 image with 14×14 patches produces how many tokens? What dimension is each token? If you doubled the patch size to 28×28, what happens to token count and per-token information?

✓ Gate cleared

Model Answer

672/14 = 48 patches per side, so 48×48 = 2,304 tokens, each of dimension 1024. That's 4× more tokens than 336×336 (576 tokens), and attention cost scales quadratically: 2304²/576² = 16× more compute.

If you double patch size to 28×28: you get (672/28)² = 24×24 = 576 tokens (same as the standard 336px case). Each patch now covers 28×28×3 = 2,352 pixels instead of 588 — 4× more information compressed into the same 1024-dim vector. You lose fine-grained detail but save massive compute. This is the fundamental resolution/compute trade-off in VLMs.

Chapter 2: The Projection Bridge

The vision encoder outputs feature vectors in its embedding space. The language model expects tokens in its embedding space. These spaces don't match! We need a projection layer to translate between them.

The simplest bridge is a linear projection (a single matrix multiply). More sophisticated ones use an MLP (two linear layers with an activation) or a cross-attention resampler (like Flamingo's Perceiver). The projection doesn't just resize — it aligns visual features with the language model's semantic space.

LLaVA-1.5 Projection: Exact Shapes

LLaVA-1.5 uses a 2-layer MLP to project from CLIP's 1024-dim space to Vicuna-13B's 5120-dim space:

Vision Encoder Output

[576, 1024] — 576 tokens, each 1024-dim

↓

Linear(1024, 4096)

W₁: [1024, 4096] + bias [4096] = 4.2M params

↓ GELU activation

Linear(4096, 4096)

W₂: [4096, 4096] + bias [4096] = 16.8M params

↓

Projected Tokens

[576, 4096] — now in LLM's embedding space

Parameter count: The entire projection MLP is ~21M parameters. That's 0.16% of the 13B-parameter LLM. This tiny bridge carries all the weight of aligning vision and language — which is why it works: both CLIP and the LLM are already strong, so the bridge only needs to learn a translation, not understanding.

Projection Strategies Compared

Projection Type	Architecture	Params	Token Count	Used By
Linear	Single W: [d_vis, d_llm]	~4M	Same as input (576)	LLaVA v1
MLP (2-layer)	Linear + GELU + Linear	~21M	Same as input (576)	LLaVA-1.5
Perceiver Resampler	Cross-attention with K learned queries	~100M	K (e.g., 64 or 256)	Flamingo, Qwen-VL
C-Abstractor	Conv layers + adaptive pooling	~50M	Reduced (e.g., 144)	Honeybee

The Perceiver trade-off: A Perceiver compresses 576 vision tokens into just 64 fixed queries. This makes the LLM's job easier (64 tokens instead of 576 to attend to), but the compression is lossy — fine spatial details get smoothed out. LLaVA's approach keeps all 576 tokens at the cost of more attention compute. For OCR and spatial grounding, keeping all tokens wins. For general VQA, compression is usually fine.

Why not just concatenate? Vision features live in a completely different vector space than word embeddings. Without projection, the language model sees garbage. The projection learns to map "this patch has a furry texture" into the same region of embedding space as the word "cat."

Projection Alignment

Points in vision space get projected to language space. Drag the slider to rotate the projection and see alignment change.

Projection angle45°

Check: What does the projection layer do?

Compresses the image to a thumbnail Maps vision features into the language model's embedding space Removes noisy patches from the image

🔨 Derivation Projection MLP Parameter Count ▶ ✓ ATTEMPTED

LLaVA-1.5 uses a 2-layer MLP to bridge CLIP ViT-L (output dim = 1024) to Vicuna-13B (hidden dim = 5120). But the intermediate dimension is 4096. The architecture is: Linear(1024, 4096) → GELU → Linear(4096, 4096).

Your task: Calculate the total parameter count. Why might the designers have chosen 4096 as the intermediate dimension instead of 5120 (the LLM's actual hidden dim)?

A Linear(in, out) layer has in × out weight parameters + out bias parameters. So Linear(1024, 4096) = 1024×4096 + 4096 = ?

Layer 1: 1024×4096 + 4096. Layer 2: 4096×4096 + 4096. Sum them. Note that GELU has zero learnable parameters.

Think about the LLM's embedding table. The LLM's word embeddings project vocab tokens into what dimension? The visual tokens need to land in the same space as embedded words, not the internal hidden states.

Full derivation:

Layer 1: Linear(1024, 4096) = 1024 × 4096 + 4096 = 4,194,304 + 4,096 = 4,198,400 params

Layer 2: Linear(4096, 4096) = 4096 × 4096 + 4096 = 16,777,216 + 4,096 = 16,781,312 params

Total: 4,198,400 + 16,781,312 = 20,979,712 ≈ 21M parameters

The key insight: The output dimension is 4096, not 5120, because Vicuna-13B's embedding dimension (the space where word tokens live) is 4096. The 5120 is the FFN intermediate width. Visual tokens need to be in the same space as text embeddings at the input to the transformer — that's the 4096-dim embedding space. The 21M parameter bridge is 0.16% of the 13B LLM, yet carries ALL the alignment burden.

💻 Build It Implement the Vision-to-Language Projection ▶ ✓ ATTEMPTED

You've learned that LLaVA's bridge is a 2-layer MLP with GELU activation. Now implement it. The module takes [batch, n_patches, vision_dim] and outputs [batch, n_patches, llm_dim]. No token count reduction — every vision token maps 1:1 to an LLM input token.

signature class VisionProjection(nn.Module): """ Projects vision encoder output to LLM embedding space. Args: vision_dim: dimension of vision encoder output (e.g., 1024) llm_dim: dimension of LLM embedding space (e.g., 4096) Forward: Input: [batch, n_patches, vision_dim] Output: [batch, n_patches, llm_dim] """ def __init__(self, vision_dim: int, llm_dim: int): ... def forward(self, x: torch.Tensor) -> torch.Tensor: ...

Test case

proj = VisionProjection(1024, 4096)
x = torch.randn(2, 576, 1024) # batch=2, 576 patches, CLIP dim
out = proj(x)
assert out.shape == (2, 576, 4096) # same patch count, LLM dim
print(sum(p.numel() for p in proj.parameters())) # ~20.98M

LLaVA-1.5 uses llm_dim as the intermediate dimension too. So both layers output llm_dim. The architecture is: Linear(vision_dim, llm_dim) → GELU() → Linear(llm_dim, llm_dim).

python
import torch
import torch.nn as nn

class VisionProjection(nn.Module):
    def __init__(self, vision_dim: int, llm_dim: int):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(vision_dim, llm_dim),  # 1024 -> 4096
            nn.GELU(),
            nn.Linear(llm_dim, llm_dim),     # 4096 -> 4096
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [batch, n_patches, vision_dim]
        return self.mlp(x)  # [batch, n_patches, llm_dim]

Bonus challenge: Modify this to include a Perceiver-style resampler that compresses 576 tokens down to 64 learned queries using cross-attention. What new parameters do you need?

Chapter 3: Fusion Strategies

Once vision tokens and text tokens are in the same space, how do we let them interact? There are several architectural strategies, each with different trade-offs:

Strategy	How It Works	Example
Early Fusion	Concatenate vision + text tokens, feed to one transformer	LLaVA
Cross-Attention	Text attends to vision via extra cross-attention layers	Flamingo
Perceiver Resampler	Learned queries compress vision tokens to fixed count	Flamingo, Qwen-VL
Interleaved	Vision tokens inserted at corresponding text positions	Fuyu, Gemini

Fusion Architecture Comparison

Toggle between strategies to see how vision (teal) and text (orange) tokens interact.

Early fusion is simplest: Just prepend vision tokens to the text sequence and let self-attention handle everything. The LLM's existing attention mechanism naturally learns vision-language interactions. This is why LLaVA works so well with so little architectural change.

What Happens Inside: The Attention Pattern

In early fusion, the LLM's self-attention sees a sequence like [v₁, v₂, ..., v₅₇₆, t₁, t₂, ..., t₅₀]. Because it's causal attention:

Visual tokens (positions 1-576):
• Attend to each other (vision self-attention)
• Cannot see any text tokens (causal mask)
• This means vision features are "frozen" in place — they don't change based on the question

Text tokens (positions 577+):
• Attend to ALL vision tokens
• Attend to prior text tokens
• This is where cross-modal reasoning happens: the word "cat" attending to furry-texture patches

An important subtlety: In LLaVA, the visual tokens never see the question. The model can only ground its visual processing through the text side. This is a limitation — the vision encoder processes the image the same way regardless of whether you ask "What color is the car?" or "How many people are there?" Newer architectures like Flamingo's cross-attention avoid this by letting vision attend to the query.

Architecture Comparison: Who Builds the Bridge Differently?

Each approach makes fundamentally different engineering choices about what's frozen, what's trained, and how vision enters the LLM:

Model	Bridge Type	Vision Encoder	LLM	Vision Tokens to LLM	Total Params
LLaVA-1.5	MLP (2-layer)	CLIP ViT-L (frozen)	Vicuna-13B (Stage 2: tuned)	576 (all patches)	~13.4B
Flamingo	Gated cross-attn	NFNet (frozen)	Chinchilla (frozen)	64 (resampled)	~80B
Qwen-VL	Cross-attn resampler	ViT-G (trained)	Qwen-7B (tuned)	256 (compressed)	~9.6B
InternVL	QLLaMA adapter	EVA-CLIP-G (tuned)	InternLM-20B (tuned)	256 (resampled)	~20B

The token count trade-off: LLaVA sends all 576 vision tokens to the LLM — maximum information but expensive attention. Flamingo compresses to 64 learned queries — faster but lossy. Qwen-VL uses 256 as a middle ground. Fewer tokens = faster inference but more information loss, especially for fine-grained tasks like OCR.

Check: In early fusion, how do vision and text tokens interact?

Through the transformer's standard self-attention over the concatenated sequence Through a separate fusion module They don't interact directly

⚔ Adversarial: Your VLM hallucinates objects not in the image

You fine-tuned a VLM using LLaVA's architecture. The base text-only LLM (Vicuna-13B) almost never hallucinates facts. But your VLM confidently describes objects that aren't in the image — "a vase of flowers on the table" when the table is empty. The vision encoder (CLIP) correctly encodes the image. What architectural property causes this?

The vision encoder is too small to capture all objects In causal attention, vision tokens can't see the question, so the LLM's language prior about "what's usually on tables" overrides weak visual signal The projection layer has too few parameters

🔨 Derivation Cross-Attention Dimensions in Flamingo ▶ ✓ ATTEMPTED

Flamingo inserts cross-attention layers into its LLM (Chinchilla-70B, d_model=8192). The vision encoder outputs 64 tokens of dimension 1024 (after a Perceiver resampler). Cross-attention uses Q from text (d=8192), K and V from vision (d=1024), with num_heads=64 and head_dim=128.

Your task: Derive the projection dimensions for W_Q, W_K, W_V, and W_O. What is the total parameter cost of one cross-attention layer?

Q comes from the text stream (d=8192). K and V come from the vision stream (d=1024). All project to num_heads × head_dim = 64 × 128 = 8192 internal dim. So W_Q: [8192, 8192], but W_K and W_V: [1024, ?]

W_K: [1024, 64×128] = [1024, 8192]. Same for W_V. The output projection W_O maps attention output back to text dim: [8192, 8192].

W_Q: 8192×8192. W_K: 1024×8192. W_V: 1024×8192. W_O: 8192×8192. Sum them up.

Full derivation:

W_Q: [d_text, n_heads × head_dim] = [8192, 8192] = 67.1M params

W_K: [d_vision, n_heads × head_dim] = [1024, 8192] = 8.4M params

W_V: [d_vision, n_heads × head_dim] = [1024, 8192] = 8.4M params

W_O: [n_heads × head_dim, d_text] = [8192, 8192] = 67.1M params

Total per cross-attention layer: 67.1 + 8.4 + 8.4 + 67.1 = ~151M parameters

The key insight: W_K and W_V are much smaller (8.4M each) because they project from the 1024-dim vision space, not the 8192-dim text space. This asymmetry is why cross-attention adds relatively little overhead: the "expensive" projections (Q and O) already exist in the text-only LLM. Only the K/V projections from vision are truly new parameters.

Chapter 4: Visual Instruction Tuning

Having the right architecture isn't enough — the model needs to learn when and how to use visual information. Visual instruction tuning trains the model on (image, instruction, response) triples so it can follow natural language commands about images.

The key innovation: instead of training on just captions ("A cat on a sofa"), you train on diverse tasks: "What color is the cat?", "Count the cushions", "Write a poem about this scene", "Is this image safe for children?" This teaches the model to be a general-purpose visual assistant.

Why Captioning Alone Isn't Enough

A model trained only on image-caption pairs learns to describe, but it can't reason. It can say "a cat on a sofa" but can't answer "Is the sofa large enough for two people?" because captioning datasets never ask questions that require spatial reasoning, counting, comparison, or inference about intent.

Training Data Type	What the Model Learns	What It Can't Do
Captions only	"A dog playing in the park"	Answer questions, count objects, reason spatially
VQA datasets	Answer factual questions	Free-form conversation, creative tasks
Instruction tuning	All of the above + follow instructions	Out-of-distribution tasks (but fewer gaps)

Input

[Image] + "How many people are in this photo?"

↓

VLM Processes

Vision encoder + projection + LLM forward pass

↓

Output

"There are three people in this photo, standing near..."

Data generation trick: LLaVA generated its instruction-tuning data using GPT-4! They fed image captions + bounding boxes to GPT-4 and asked it to generate question-answer pairs. This "language-only teacher" approach created 150K high-quality training conversations.

The Training Signal: What Does the Loss Look Like?

During instruction tuning, the model receives a sequence like:

format
[IMG_1] [IMG_2] ... [IMG_576] [SYS] You are a helpful assistant. [USR] What animal is in this image? [ASST]

# Loss is computed ONLY on the assistant's response tokens:
"This image shows a golden retriever lying on grass..."

# Vision tokens and user tokens: loss = 0 (no gradient)
# Assistant tokens: standard next-token prediction loss

Causal masking matters: The model can attend from text to vision tokens (allowing it to "look" at the image), but not from vision tokens to future text tokens. Vision tokens attend only to each other and to prior context. This ensures the visual features are processed before the text response is generated.

Instruction Diversity

See the variety of tasks a VLM must handle. Click to cycle through instruction types.

Check: What makes visual instruction tuning different from captioning?

It uses higher resolution images It trains on diverse tasks (QA, description, reasoning) not just captions It removes the vision encoder

🔨 Derivation The Visual Instruction Tuning Loss ▶ ✓ ATTEMPTED

The input sequence is: [v₁...v₅₇₆][SYS][USR question][ASST response tokens r₁...r_T]. Standard language model training computes cross-entropy loss on every token. Visual instruction tuning modifies this.

Your task: Write out the loss function. Which tokens contribute to the loss? Which tokens have their loss masked to zero? Why is this masking critical?

If you compute loss on vision tokens, you're asking the model to predict the next image patch — a nonsensical task for a language model. If you compute loss on the user's question, you're training the model to memorize questions rather than answer them.

Standard LM loss: L = -(1/N) ∑_i=1^N log P(t_i | t_<i). Now add a mask m_i ∈ {0, 1} that's 1 only for assistant response tokens.

Divide by the number of unmasked tokens (just the response length T), not the full sequence length. Otherwise long prompts would make the gradient vanishingly small.

Full derivation:

Let the full sequence be x = [v₁...v₅₇₆, s₁...s_K, r₁...r_T] where s = system+user tokens, r = response tokens.

Define mask: m_i = 1 if token i is a response token, 0 otherwise.

Loss: L = -(1/T) ∑_i m_i · log P_θ(x_i | x_<i)

Concretely: m_i = 0 for all 576 vision tokens, 0 for all system/user tokens, and 1 only for r₁ through r_T.

The key insight: The model sees all tokens during the forward pass (vision tokens condition the response via attention), but only learns to generate the response. This is what separates instruction tuning from pretraining: the model learns to use visual context without trying to predict it. If you accidentally trained on vision tokens, the model would waste capacity trying to predict patch embeddings — a regression task the autoregressive head isn't designed for.

Chapter 5: The LLaVA Architecture

LLaVA (Large Language and Vision Assistant) showed that a surprisingly simple recipe works remarkably well: take a pretrained CLIP vision encoder, a pretrained LLM (like Vicuna/LLaMA), and connect them with a single linear projection. That's it.

The image goes through CLIP ViT-L/14 → produces 576 vision tokens (a 24×24 grid from the penultimate layer) → each is linearly projected to the LLM's dimension → prepended to the text tokens → the LLM generates the response autoregressively.

Why the penultimate layer? The final CLIP layer is optimized for contrastive matching with text (a single global CLS token). The penultimate layer still has spatial information (the 576 individual patch tokens). Using it preserves the local visual detail that the LLM needs for questions like "What's in the top-left corner?"

CLIP ViT-L/14

336×336 image → 576 visual tokens (d=1024)

↓

MLP Projector

Linear(1024, 4096) + GELU + Linear(4096, 4096)

↓

Vicuna-13B

[576 visual tokens] + [text tokens] → autoregressive output

LLaVA Forward Pass

Watch tokens flow through the architecture. Teal = vision, orange = text, green = output.

Why does simplicity win? Both CLIP and the LLM are already well-trained. CLIP learned to align images with text during pretraining. The LLM already knows language. The projection just needs to learn the "translation" between their embedding spaces — a much smaller problem than training from scratch.

The Complete Input to the LLM

Let's trace exactly what the LLM receives. For a question like "What is in this image?":

Visual Tokens

576 tokens × 4096 dims = [576, 4096]

System + User Text

~50 tokens × 4096 dims = [50, 4096]
(system prompt + "What is in this image?")

↓ concatenate

LLM Input Sequence

~626 tokens × 4096 dims
Vicuna-13B: 40 layers, d_model=5120, 40 heads
Self-attention over all 626 tokens at every layer

Context budget: 576 of the ~626 input tokens are visual — 92% of the LLM's input is image. This means vision dominates the KV cache and attention cost. For multi-image or video inputs, this ratio gets even more extreme, which is why token compression matters so much.

Inference Timing Breakdown

Where does the time go when a VLM answers a question about an image?

Stage	Time	What Happens
Image preprocessing	~5 ms	Resize to 336×336, normalize to [-1, 1]
CLIP forward pass	~15 ms	24 transformer layers on 576 tokens (304M params)
MLP projection	~0.1 ms	Two linear layers, 21M params — trivial cost
LLM prefill	~30 ms	Process all 626 tokens through 40 layers in parallel
LLM decode	~2,000 ms	~20 ms/token × 100 tokens of output (sequential!)
Total	~2.05 s	Autoregressive decode dominates by 40×

The bottleneck is decoding. Vision encoding and projection together are ~15 ms. The LLM's autoregressive token generation is ~2 seconds. This is why VLM speedups mostly target the LLM side (speculative decoding, quantization) rather than the vision encoder.

Check: How many new components does LLaVA add to connect vision and language?

A whole new transformer Just a small MLP projection layer Cross-attention at every layer

Chapter 6: The Training Pipeline

VLMs are trained in stages, not all at once. Each stage has a different purpose and different parts of the model are frozen or unfrozen:

Stage	Data	What Trains	Purpose
1. Pretraining alignment	595K image-caption pairs	Projection only	Align vision ↔ language spaces
2. Instruction tuning	158K visual conversations	Projection + LLM	Teach instruction following

Parameter Budget: What Actually Trains?

Component	Params	Stage 1	Stage 2
CLIP ViT-L/14	304M	Frozen	Frozen
MLP Projector	~21M	Training	Training
Vicuna-13B	13,000M	Frozen	Training
Stage 1 trainable	21M / 13,325M = 0.16% of total
Stage 2 trainable	13,021M / 13,325M = 97.7% of total

Stage 1 is fast: Only the projection layer trains (0.16% of parameters) — a few hours on 8 A100 GPUs. Stage 2 unfreezes the LLM and is the expensive part: ~20 hours on 8 A100s. The vision encoder stays frozen in both stages, preserving CLIP's learned visual representations.

Training Stage Visualization

Toggle stages to see which components are frozen (blue/frozen) vs trainable (green/active).

Scaling insight: LLaVA-1.5 improved by upgrading to a 2-layer MLP projector and adding academic VQA datasets in Stage 2. These small changes gave major accuracy boosts — showing that data quality and projection design matter more than model size.

Compute Budget Breakdown

Stage	Trainable Params	Data	GPU Hours (8×A100)	Cost (~$2/GPU-hr)
Stage 1: Alignment	21M (0.16%)	595K image-caption pairs	~4 hours	~$64
Stage 2: Instruction	13,021M (97.7%)	158K visual conversations	~20 hours	~$320
Total	~$384 to train a competitive VLM (given pretrained CLIP + LLM)

This is remarkably cheap. The pretrained CLIP ViT-L and Vicuna-13B represent billions of dollars of compute. LLaVA's contribution is just the $384 bridge between them. This is why the "connect pretrained specialists" paradigm dominates VLM design — training vision and language from scratch would cost 1000× more.

Check: In Stage 1 of LLaVA training, what is trainable?

Only the projection layer The entire model end-to-end Only the vision encoder

💥 Break-It Lab What Dies When You Remove VLM Components? ▶ ✓ ATTEMPTED

A working VLM has three critical design decisions: freeze the vision encoder, include a projection layer, and apply instruction tuning after alignment. Toggle each off to see what breaks.

Unfreeze Vision Encoder OFF

Catastrophic forgetting: CLIP spent billions of compute-hours learning to align images with text. If you unfreeze it during VLM fine-tuning with only 595K examples, gradient updates destroy these learned representations. The encoder "forgets" how to encode images meaningfully. Loss initially decreases (overfitting to small data) then generalization collapses. The encoder becomes specialized for captioning but loses zero-shot generality.

Remove Projection Layer OFF

Dimension mismatch crash: CLIP outputs 1024-dim vectors. The LLM expects 4096-dim inputs. Without the projection, you get a literal shape error — the tensors can't be concatenated. Even if you pad with zeros to match dimensions, the vision features live in a completely unrelated vector space. The LLM would interpret them as noise — like feeding random embeddings. No alignment, no understanding.

Skip Instruction Tuning OFF

Caption-only responses: After Stage 1 (alignment only), the model can describe images but can't follow instructions. Ask "How many people?" and it responds "A group of people standing in a park" — a caption, not an answer. It hasn't learned the instruction-following format. The projection is aligned, but the LLM hasn't adapted its behavior to use visual context for diverse tasks.

Chapter 7: Grounding & Spatial Understanding

Grounding means connecting words to specific regions in the image. When a VLM says "the red car on the left," grounding means it can also point to where that car is. This requires the model to output spatial coordinates, not just text.

Approaches include: outputting bounding box coordinates as text tokens (e.g., "[0.2, 0.3, 0.5, 0.7]"), using special location tokens, or predicting segmentation masks. Models like Kosmos-2 and Shikra showed VLMs can be trained to both describe and locate objects.

Three Levels of Spatial Understanding

Level	Output	Example Task	Difficulty
Pointing	Single (x, y) coordinate	"Point to the cat's nose"	Easiest — one point
Bounding box	(x1, y1, x2, y2)	"Draw a box around each person"	Medium — 4 numbers per object
Segmentation	Pixel-level mask	"Segment the dog from background"	Hardest — needs a mask decoder

Visual Grounding

Click on regions to see how a VLM grounds language to spatial locations. Each colored box is a detected object with its label.

Coordinate formats: Some models normalize coordinates to [0, 1000] and emit them as text tokens. Others use special <box> tokens. The trend is toward treating coordinates as just another kind of language output — elegant and requires no architectural changes.

Grounding as Text: How It Works in Practice

In coordinate-as-text models (Kosmos-2, Shikra, Qwen2.5-VL), the model literally generates bounding boxes as number tokens:

example output
# Input: "Where is the dog in this image?"
# Output:
The dog is in the bottom-left of the image.
<box>102, 384, 298, 571</box>

# Coordinates: [x1, y1, x2, y2] normalized to [0, 1000]
# Real position: top-left (10.2%, 38.4%) to bottom-right (29.8%, 57.1%)

Why this is elegant: No architecture change. The model just needs to learn that certain number tokens represent coordinates. The vocabulary already contains all digits 0-9. The model learns spatial reasoning through the same autoregressive training as language generation. Qwen2.5-VL extends this to point, box, and polygon coordinates for different granularities of grounding.

Check: What does "grounding" mean in VLMs?

Training the model longer Reducing hallucinations Connecting words to specific spatial regions in the image

Chapter 8: Document Understanding

Documents are a special challenge for VLMs: they contain dense text, tables, charts, and layouts where spatial arrangement matters. "Revenue" next to "$5M" means something different from "Revenue" in a section header.

Key innovations: high-resolution encoding (documents need more pixels than photos), OCR-free reading (the vision encoder learns to read text directly), and layout-aware attention (understanding that rows and columns create relationships).

The Resolution Bottleneck for Documents

A standard document page at 300 DPI is roughly 2,550 × 3,300 pixels. A VLM at 336×336 resolution shrinks this by 10× in each dimension — a 12-point font becomes 1-2 pixels tall. Completely unreadable. This is why document understanding requires special resolution handling:

Approach	How It Handles High-Res	Token Cost
Resize to 336×336	Shrink entire page (text unreadable)	576 tokens
AnyRes tiling	Split into 4-12 tiles at 336×336 each	2,304-6,912 tokens
Dynamic cropping	Process only regions of interest at high-res	Variable (576-2,304)
Token pruning	Encode high-res but drop redundant tokens	~1,000 (compressed from 5K+)

The OCR question: Should a VLM read text via its vision encoder (OCR-free) or rely on an external OCR module? OCR-free is simpler (one model does everything) but requires high resolution. External OCR is more accurate for small text but adds a pipeline stage and loses layout information. The trend is toward OCR-free: larger ViTs with higher resolution are closing the accuracy gap.

Document Layout Parsing

A VLM must understand that spatial layout encodes meaning. Watch how different regions are classified.

Resolution matters: A typical VLM at 336×336 can't read small text. Document VLMs like UReader and TextMonkey use dynamic resolution — cropping the image into tiles and encoding each at high resolution, then stitching the features back together.

Model	Approach	Strength
DocOwl	Layout-aware pretraining	Tables, forms
TextMonkey	High-res with token pruning	Dense text
Nougat	OCR-free academic PDF reading	Equations, LaTeX
GPT-4V / Gemini	Native multi-resolution	General documents

What Breaks: Common VLM Failure Modes

VLMs are impressive, but they fail in predictable ways. Understanding these failures tells you a lot about the architecture:

Failure Mode	Why It Happens	Example
Small object blindness	14×14 patches are too coarse. A small object occupying one patch gets just one token of representation.	"How many buttons on the shirt?" → wrong count
Hallucination	The LLM's language prior overrides visual evidence. It describes objects that are statistically likely but not present.	Model describes a "vase of flowers" on a table that has none
Spatial confusion	Patch position embeddings encode rough grid position but not precise spatial relationships.	"Is the cup to the left or right of the plate?" → wrong answer
Video token explosion	Naive approach: N frames × 576 tokens = massive sequence. 10 frames = 5,760 tokens.	2-second video at 5fps overwhelms context window
Long document overflow	OCR of a full page → thousands of text tokens on top of visual tokens.	10-page PDF exceeds context even at low resolution

Why hallucination persists: During training, the LLM learned strong priors about what co-occurs (tables often have flowers). The vision signal must be strong enough to override these priors. When the visual evidence is ambiguous (blurry, small, off-center), language priors win — and the model confidently describes things it doesn't actually see.

Check: Why is document understanding harder than photo understanding for VLMs?

Documents have dense text and spatial layout that carries meaning Documents are always black and white Documents don't have objects to detect

Chapter 9: Frontier VLMs

The field is evolving rapidly. Today's frontier models handle video, multi-image reasoning, interleaved image-text, and even generate images alongside text. Here's where things stand:

Open-Weight Models

Model	Key Innovation	Scale	Strengths
LLaVA-NeXT	AnyRes dynamic resolution, SGLang-optimized	7B-110B	Strong general VQA
InternVL 2.5	Dynamic tiling, multi-scale ViT	1B-108B	OCR, documents, multilingual
Qwen2.5-VL	NaViT-style packing, native video	3B-72B	Video, grounding, agentic tasks
Molmo	Pointing as text coordinates, high-quality data	1B-72B	Spatial grounding, UI understanding

Frontier Closed Models

Model	Key Innovation	Notable Capability
GPT-4o	Native multimodal from pretraining	Omni: text + image + audio + video in one model
Gemini 2.0	2M token context, native video understanding	Long video reasoning, interleaved generation
Claude Opus 4	Strong spatial reasoning, chart understanding	Complex document analysis, precise counting

The Resolution Revolution

The biggest architectural shift in 2024-2025 has been dynamic resolution. Instead of resizing every image to 336×336, modern VLMs handle arbitrary aspect ratios and resolutions:

How AnyRes works (LLaVA-NeXT): Slice the image into tiles that fit the ViT's expected resolution. A 1344×672 image becomes 4×2 = 8 tiles of 336×336, plus one low-res global view. Total: 9 × 576 = 5,184 visual tokens. Expensive, but now the model can read fine print and see small objects.

The Resolution Token Table

This table shows why resolution management is the defining engineering challenge for production VLMs:

Input Resolution	Patches (14px)	Visual Tokens	Attn Cost (relative)	Can Read?
224 × 224	16 × 16	256	1×	No (too blurry)
336 × 336	24 × 24	576	5×	Large text only
672 × 672	48 × 48	2,304	81×	Most text
1344 × 1344	96 × 96	9,216	1,296×	Fine print
AnyRes (9 tiles)	9 × 24 × 24	5,184	410×	Yes (efficient)

Qwen2.5-VL's approach: Instead of fixed tiles, Qwen2.5-VL uses NaViT-style "packing" — variable-resolution patches packed into a single batch without wasting compute on padding. This lets it handle 4K images efficiently: keep high resolution where detail matters (text regions), lower resolution elsewhere (sky, grass).

VLM Capability Radar

Compare different VLM generations across key capabilities. Each axis represents a different skill.

The trend: Early VLMs bolted vision onto language. Frontier models are trained multimodally from the start — vision isn't an add-on, it's native. This enables richer reasoning, fewer hallucinations, and genuine visual understanding rather than pattern matching.

The Video Challenge

Extending VLMs to video is the next frontier, but it creates massive engineering challenges:

Input	Frames	Visual Tokens	With Text	Challenge
Single image	1	576	~626	Manageable
4 keyframes	4	2,304	~2,354	Needs longer context
1 FPS × 30s video	30	17,280	~17,330	Exceeds most LLM context windows
30 FPS × 10s video	300	172,800	~172,850	Impossible without compression

Solutions being explored: Frame sampling (use 1 FPS not 30), temporal token merging (average similar adjacent frames), spatial downsampling (fewer tokens per frame), and learned temporal compression (a small network that compresses T frames into K tokens, like a temporal Perceiver). Gemini 2.0's 2M token context helps but doesn't fully solve the 30 FPS problem.

Connections

VLMs connect to many other topics:

• CLIP & Contrastive Learning — The vision encoder that makes VLMs possible. CLIP pretraining aligns images and text before the VLM is even built.

• VLAs (Vision-Language-Action) — VLMs extended with action outputs for robotics. Same architecture, but the output includes motor commands.

• Transformers — Both the vision encoder (ViT) and the language model (LLaMA/Vicuna) are transformers. Understanding self-attention is key.

• Diffusion Models — Some VLMs (Emu, DALL-E 3) can generate images too, using diffusion decoders alongside language generation.

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective."

— Rich Sutton, The Bitter Lesson

You now understand how machines learned to see and speak. The fusion of vision and language is one of the most consequential advances in AI history.

🏗 Design Challenge You're the Architect: 4K Images + 100K Text in 24GB VRAM ▶ ✓ ATTEMPTED

A medical imaging startup needs a VLM that processes full 4K pathology slides (3840×2160 pixels) alongside 100K tokens of patient history. Inference must run on a single RTX 4090 (24GB VRAM). The current naive approach (AnyRes tiling at 336px) produces 46,000+ visual tokens — impossible to fit with a 7B LLM.

Input image

3840 × 2160 px (4K pathology slide)

Text context

100K tokens of patient records

VRAM budget

24 GB (RTX 4090)

Latency target

< 30 seconds for full response

Accuracy requirement

Must not miss small lesions (1-2% of image area)

1. How do you handle the 4K resolution? (Naive tiling = 46K tokens. You need <4K total visual tokens to fit in VRAM. But you can't just downsample — small lesions vanish.)

2. How do you handle 100K text tokens alongside visual tokens? (Standard 8K context LLMs can't fit both. Do you compress text, use a long-context model, or something else?)

3. What's your token compression strategy? (Perceiver resampling, spatial pooling, saliency-based cropping, hierarchical encoding?)

4. How do you quantize/optimize to fit in 24GB with both the vision encoder and LLM loaded?

Real-world solution (as seen in InternVL 2.5 and Qwen2.5-VL):

1. Hierarchical tiling with saliency: Encode a low-res global view (576 tokens) + high-res crops only where a saliency detector flags detail (lesion candidates). A lightweight CNN identifies suspicious regions. Only 4-6 high-res tiles get encoded, giving ~3,000-3,500 visual tokens total.

2. Long-context LLM (Qwen2.5-72B supports 128K): Use a smaller model (7B) with RoPE-extended context to 32K. Summarize patient history into 2K tokens using a separate text-only pass. Visual tokens + summary fit easily.

3. Token merging: After the ViT encodes each tile, apply token merging (ToMe) — adjacent tokens with cosine similarity > 0.95 get averaged. Pathology backgrounds are highly redundant, so this achieves 2-3x compression with <1% quality loss on diagnostic tasks.

4. Memory budget: 7B model in AWQ 4-bit = 3.5GB. CLIP ViT-L in FP16 = 0.6GB. KV cache for 4K tokens at 4-bit in a 7B model ≈ 1.5GB. Activations ≈ 2GB peak. Total ≈ 8GB — well within 24GB with room for batch size >1.

Check: What distinguishes frontier VLMs from early ones?

They use smaller models They only handle text Vision is native rather than bolted on, enabling richer multimodal reasoning

🔗 Pattern Recognition

VLM → VLA: Adding Actions to Vision-Language

This Lesson (VLM)

[Image tokens] + [Text tokens] → LLM → [Text response]
The model sees and speaks.

VLA (Vision-Language-Action)

[Image tokens] + [Text instruction] → LLM → [Action tokens]
Same architecture, but output is motor commands. → VLA lesson

The structural insight: a VLA is literally a VLM where the output vocabulary includes discretized robot actions (joint angles, gripper commands). The same projection bridge, same fusion mechanism, same autoregressive generation — just targeting a different output space. RT-2 proved this by fine-tuning a VLM to output action tokens with zero architectural changes.

What other domains could you target by simply changing what the LLM generates? (Hint: think music notation, CAD coordinates, chemical structures...)

Understand Vision-LanguageModels

Chapter 0: Two Modalities, One Model

The Data Flow at a Glance

Chapter 1: Vision Encoders — Teaching AI to See

Tensor Shapes: CLIP ViT-L/14 Walkthrough

Chapter 2: The Projection Bridge

LLaVA-1.5 Projection: Exact Shapes

Projection Strategies Compared

Chapter 3: Fusion Strategies

What Happens Inside: The Attention Pattern

Architecture Comparison: Who Builds the Bridge Differently?

Chapter 4: Visual Instruction Tuning

Why Captioning Alone Isn't Enough

The Training Signal: What Does the Loss Look Like?

Chapter 5: The LLaVA Architecture

The Complete Input to the LLM

Inference Timing Breakdown

Chapter 6: The Training Pipeline

Parameter Budget: What Actually Trains?

Compute Budget Breakdown

Chapter 7: Grounding & Spatial Understanding

Three Levels of Spatial Understanding

Grounding as Text: How It Works in Practice

Chapter 8: Document Understanding

The Resolution Bottleneck for Documents

What Breaks: Common VLM Failure Modes

Chapter 9: Frontier VLMs

Open-Weight Models

Frontier Closed Models

The Resolution Revolution

The Resolution Token Table

The Video Challenge

Connections

Understand Vision-Language
Models