The Complete Beginner's Path

Understand Vision-Language
Models

How modern AI learned to see and speak at the same time — fusing pixels and words into a single reasoning engine that can describe, answer, and understand images.

Prerequisites: Basic transformer intuition + Curiosity about multimodal AI. That's it.
10
Chapters
8+
Simulations
0
Jargon Assumed

Chapter 0: Two Modalities, One Model

Humans effortlessly combine what they see with what they read. You look at a chart and describe trends. You glance at a photo and answer questions about it. For decades, vision and language lived in separate AI systems. A Vision-Language Model (VLM) merges them into a single architecture that processes images and text together.

The core challenge: images are grids of pixels (spatial, continuous) while text is a sequence of tokens (discrete, symbolic). A VLM must bridge these two very different representations and allow them to interact so the model can reason about visual content using language.

The big picture: A VLM = a vision encoder (eyes) + a language model (brain) + a bridge that translates between them. The magic is in how the bridge aligns pixel features with word meanings.
Two Modalities Merging

Watch image patches and text tokens flow into a shared representation space. The teal nodes are visual, orange nodes are textual.

Check: What is the core challenge of building a VLM?

Chapter 1: Vision Encoders — Teaching AI to See

The vision encoder converts a raw image into a sequence of feature vectors. The dominant approach is ViT (Vision Transformer): chop the image into fixed-size patches (e.g., 14×14 pixels), flatten each patch into a vector, add positional embeddings, and run them through a transformer.

After encoding, an image becomes a grid of high-dimensional vectors — essentially the same shape as a sequence of text tokens. This structural similarity is what makes vision-language fusion possible.

patches = split(image, patch_size)  →  z = ViT(patches + pos_embed)
Image Patchification

See how an image is sliced into patches. Each patch becomes a token for the transformer. Adjust patch size to see the trade-off between resolution and sequence length.

Patch size4
16 patches → 16 tokens
Key trade-off: Smaller patches = more tokens = more detail but quadratic attention cost. Larger patches = fewer tokens = faster but lose fine detail. Most VLMs use 14×14 patches with a 224×224 or 336×336 image.
EncoderUsed ByPatch Size
CLIP ViT-L/14LLaVA, many VLMs14×14
SigLIPPaliGemma, newer VLMs14×14
EVA-CLIPInternVL14×14
DINOv2Research models14×14
Check: What does a Vision Transformer do to an image first?

Chapter 2: The Projection Bridge

The vision encoder outputs feature vectors in its embedding space. The language model expects tokens in its embedding space. These spaces don't match! We need a projection layer to translate between them.

The simplest bridge is a linear projection (a single matrix multiply). More sophisticated ones use an MLP (two linear layers with an activation) or a cross-attention resampler (like Flamingo's Perceiver). The projection doesn't just resize — it aligns visual features with the language model's semantic space.

Vision Encoder
image → [v1, v2, ..., vN] in Rdvis
Projection W
hi = W · vi   (or MLP(vi))
Language Model
[h1, ..., hN, tok1, tok2, ...] in Rdllm
Why not just concatenate? Vision features live in a completely different vector space than word embeddings. Without projection, the language model sees garbage. The projection learns to map "this patch has a furry texture" into the same region of embedding space as the word "cat."
Projection Alignment

Points in vision space get projected to language space. Drag the slider to rotate the projection and see alignment change.

Projection angle45°
Check: What does the projection layer do?

Chapter 3: Fusion Strategies

Once vision tokens and text tokens are in the same space, how do we let them interact? There are several architectural strategies, each with different trade-offs:

StrategyHow It WorksExample
Early FusionConcatenate vision + text tokens, feed to one transformerLLaVA
Cross-AttentionText attends to vision via extra cross-attention layersFlamingo
Perceiver ResamplerLearned queries compress vision tokens to fixed countFlamingo, Qwen-VL
InterleavedVision tokens inserted at corresponding text positionsFuyu, Gemini
Fusion Architecture Comparison

Toggle between strategies to see how vision (teal) and text (orange) tokens interact.

Early fusion is simplest: Just prepend vision tokens to the text sequence and let self-attention handle everything. The LLM's existing attention mechanism naturally learns vision-language interactions. This is why LLaVA works so well with so little architectural change.
Check: In early fusion, how do vision and text tokens interact?

Chapter 4: Visual Instruction Tuning

Having the right architecture isn't enough — the model needs to learn when and how to use visual information. Visual instruction tuning trains the model on (image, instruction, response) triples so it can follow natural language commands about images.

The key innovation: instead of training on just captions ("A cat on a sofa"), you train on diverse tasks: "What color is the cat?", "Count the cushions", "Write a poem about this scene", "Is this image safe for children?" This teaches the model to be a general-purpose visual assistant.

Input
[Image] + "How many people are in this photo?"
VLM Processes
Vision encoder + projection + LLM forward pass
Output
"There are three people in this photo, standing near..."
Data generation trick: LLaVA generated its instruction-tuning data using GPT-4! They fed image captions + bounding boxes to GPT-4 and asked it to generate question-answer pairs. This "language-only teacher" approach created 150K high-quality training conversations.
Instruction Diversity

See the variety of tasks a VLM must handle. Click to cycle through instruction types.

Check: What makes visual instruction tuning different from captioning?

Chapter 5: The LLaVA Architecture

LLaVA (Large Language and Vision Assistant) showed that a surprisingly simple recipe works remarkably well: take a pretrained CLIP vision encoder, a pretrained LLM (like Vicuna/LLaMA), and connect them with a single linear projection. That's it.

The image goes through CLIP ViT-L/14 → produces 576 vision tokens (a 24×24 grid from the penultimate layer) → each is linearly projected to the LLM's dimension → prepended to the text tokens → the LLM generates the response autoregressively.

CLIP ViT-L/14
336×336 image → 576 visual tokens (d=1024)
MLP Projector
Linear(1024, 4096) + GELU + Linear(4096, 4096)
Vicuna-13B
[576 visual tokens] + [text tokens] → autoregressive output
LLaVA Forward Pass

Watch tokens flow through the architecture. Teal = vision, orange = text, green = output.

Why does simplicity win? Both CLIP and the LLM are already well-trained. CLIP learned to align images with text during pretraining. The LLM already knows language. The projection just needs to learn the "translation" between their embedding spaces — a much smaller problem than training from scratch.
Check: How many new components does LLaVA add to connect vision and language?

Chapter 6: The Training Pipeline

VLMs are trained in stages, not all at once. Each stage has a different purpose and different parts of the model are frozen or unfrozen:

StageDataWhat TrainsPurpose
1. Pretraining alignment595K image-caption pairsProjection onlyAlign vision ↔ language spaces
2. Instruction tuning158K visual conversationsProjection + LLMTeach instruction following
Stage 1 is fast: Since only the projection layer trains (a tiny fraction of total parameters), Stage 1 takes just a few hours on 8 GPUs. The vision encoder and LLM stay frozen, preserving their pretrained knowledge.
Training Stage Visualization

Toggle stages to see which components are frozen (blue/frozen) vs trainable (green/active).

Scaling insight: LLaVA-1.5 improved by upgrading to a 2-layer MLP projector and adding academic VQA datasets in Stage 2. These small changes gave major accuracy boosts — showing that data quality and projection design matter more than model size.
Check: In Stage 1 of LLaVA training, what is trainable?

Chapter 7: Grounding & Spatial Understanding

Grounding means connecting words to specific regions in the image. When a VLM says "the red car on the left," grounding means it can also point to where that car is. This requires the model to output spatial coordinates, not just text.

Approaches include: outputting bounding box coordinates as text tokens (e.g., "[0.2, 0.3, 0.5, 0.7]"), using special location tokens, or predicting segmentation masks. Models like Kosmos-2 and Shikra showed VLMs can be trained to both describe and locate objects.

Visual Grounding

Click on regions to see how a VLM grounds language to spatial locations. Each colored box is a detected object with its label.

Coordinate formats: Some models normalize coordinates to [0, 1000] and emit them as text tokens. Others use special <box> tokens. The trend is toward treating coordinates as just another kind of language output — elegant and requires no architectural changes.
Check: What does "grounding" mean in VLMs?

Chapter 8: Document Understanding

Documents are a special challenge for VLMs: they contain dense text, tables, charts, and layouts where spatial arrangement matters. "Revenue" next to "$5M" means something different from "Revenue" in a section header.

Key innovations: high-resolution encoding (documents need more pixels than photos), OCR-free reading (the vision encoder learns to read text directly), and layout-aware attention (understanding that rows and columns create relationships).

Document Layout Parsing

A VLM must understand that spatial layout encodes meaning. Watch how different regions are classified.

Resolution matters: A typical VLM at 336×336 can't read small text. Document VLMs like UReader and TextMonkey use dynamic resolution — cropping the image into tiles and encoding each at high resolution, then stitching the features back together.
ModelApproachStrength
DocOwlLayout-aware pretrainingTables, forms
TextMonkeyHigh-res with token pruningDense text
NougatOCR-free academic PDF readingEquations, LaTeX
GPT-4V / GeminiNative multi-resolutionGeneral documents
Check: Why is document understanding harder than photo understanding for VLMs?

Chapter 9: Frontier VLMs

The field is evolving rapidly. Today's frontier models handle video, multi-image reasoning, interleaved image-text, and even generate images alongside text. Here's where things stand:

ModelKey InnovationScale
GPT-4oNative multimodal (not bolted on)Unknown
Gemini 1.51M+ token context, video nativelyUnknown
Claude 3.5Strong spatial reasoning, chartsUnknown
LLaVA-NeXTDynamic resolution, video7B-110B
InternVL2Dynamic tiling, strong OCR1B-108B
Qwen2-VLNaive dynamic resolution, video2B-72B
VLM Capability Radar

Compare different VLM generations across key capabilities. Each axis represents a different skill.

The trend: Early VLMs bolted vision onto language. Frontier models are trained multimodally from the start — vision isn't an add-on, it's native. This enables richer reasoning, fewer hallucinations, and genuine visual understanding rather than pattern matching.
"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective."
— Rich Sutton, The Bitter Lesson

You now understand how machines learned to see and speak. The fusion of vision and language is one of the most consequential advances in AI history.

Check: What distinguishes frontier VLMs from early ones?