How modern AI learned to see and speak at the same time — fusing pixels and words into a single reasoning engine that can describe, answer, and understand images.
Humans effortlessly combine what they see with what they read. You look at a chart and describe trends. You glance at a photo and answer questions about it. For decades, vision and language lived in separate AI systems. A Vision-Language Model (VLM) merges them into a single architecture that processes images and text together.
The core challenge: images are grids of pixels (spatial, continuous) while text is a sequence of tokens (discrete, symbolic). A VLM must bridge these two very different representations and allow them to interact so the model can reason about visual content using language.
Watch image patches and text tokens flow into a shared representation space. The teal nodes are visual, orange nodes are textual.
The vision encoder converts a raw image into a sequence of feature vectors. The dominant approach is ViT (Vision Transformer): chop the image into fixed-size patches (e.g., 14×14 pixels), flatten each patch into a vector, add positional embeddings, and run them through a transformer.
After encoding, an image becomes a grid of high-dimensional vectors — essentially the same shape as a sequence of text tokens. This structural similarity is what makes vision-language fusion possible.
See how an image is sliced into patches. Each patch becomes a token for the transformer. Adjust patch size to see the trade-off between resolution and sequence length.
| Encoder | Used By | Patch Size |
|---|---|---|
| CLIP ViT-L/14 | LLaVA, many VLMs | 14×14 |
| SigLIP | PaliGemma, newer VLMs | 14×14 |
| EVA-CLIP | InternVL | 14×14 |
| DINOv2 | Research models | 14×14 |
The vision encoder outputs feature vectors in its embedding space. The language model expects tokens in its embedding space. These spaces don't match! We need a projection layer to translate between them.
The simplest bridge is a linear projection (a single matrix multiply). More sophisticated ones use an MLP (two linear layers with an activation) or a cross-attention resampler (like Flamingo's Perceiver). The projection doesn't just resize — it aligns visual features with the language model's semantic space.
Points in vision space get projected to language space. Drag the slider to rotate the projection and see alignment change.
Once vision tokens and text tokens are in the same space, how do we let them interact? There are several architectural strategies, each with different trade-offs:
| Strategy | How It Works | Example |
|---|---|---|
| Early Fusion | Concatenate vision + text tokens, feed to one transformer | LLaVA |
| Cross-Attention | Text attends to vision via extra cross-attention layers | Flamingo |
| Perceiver Resampler | Learned queries compress vision tokens to fixed count | Flamingo, Qwen-VL |
| Interleaved | Vision tokens inserted at corresponding text positions | Fuyu, Gemini |
Toggle between strategies to see how vision (teal) and text (orange) tokens interact.
Having the right architecture isn't enough — the model needs to learn when and how to use visual information. Visual instruction tuning trains the model on (image, instruction, response) triples so it can follow natural language commands about images.
The key innovation: instead of training on just captions ("A cat on a sofa"), you train on diverse tasks: "What color is the cat?", "Count the cushions", "Write a poem about this scene", "Is this image safe for children?" This teaches the model to be a general-purpose visual assistant.
See the variety of tasks a VLM must handle. Click to cycle through instruction types.
LLaVA (Large Language and Vision Assistant) showed that a surprisingly simple recipe works remarkably well: take a pretrained CLIP vision encoder, a pretrained LLM (like Vicuna/LLaMA), and connect them with a single linear projection. That's it.
The image goes through CLIP ViT-L/14 → produces 576 vision tokens (a 24×24 grid from the penultimate layer) → each is linearly projected to the LLM's dimension → prepended to the text tokens → the LLM generates the response autoregressively.
Watch tokens flow through the architecture. Teal = vision, orange = text, green = output.
VLMs are trained in stages, not all at once. Each stage has a different purpose and different parts of the model are frozen or unfrozen:
| Stage | Data | What Trains | Purpose |
|---|---|---|---|
| 1. Pretraining alignment | 595K image-caption pairs | Projection only | Align vision ↔ language spaces |
| 2. Instruction tuning | 158K visual conversations | Projection + LLM | Teach instruction following |
Toggle stages to see which components are frozen (blue/frozen) vs trainable (green/active).
Grounding means connecting words to specific regions in the image. When a VLM says "the red car on the left," grounding means it can also point to where that car is. This requires the model to output spatial coordinates, not just text.
Approaches include: outputting bounding box coordinates as text tokens (e.g., "[0.2, 0.3, 0.5, 0.7]"), using special location tokens, or predicting segmentation masks. Models like Kosmos-2 and Shikra showed VLMs can be trained to both describe and locate objects.
Click on regions to see how a VLM grounds language to spatial locations. Each colored box is a detected object with its label.
Documents are a special challenge for VLMs: they contain dense text, tables, charts, and layouts where spatial arrangement matters. "Revenue" next to "$5M" means something different from "Revenue" in a section header.
Key innovations: high-resolution encoding (documents need more pixels than photos), OCR-free reading (the vision encoder learns to read text directly), and layout-aware attention (understanding that rows and columns create relationships).
A VLM must understand that spatial layout encodes meaning. Watch how different regions are classified.
| Model | Approach | Strength |
|---|---|---|
| DocOwl | Layout-aware pretraining | Tables, forms |
| TextMonkey | High-res with token pruning | Dense text |
| Nougat | OCR-free academic PDF reading | Equations, LaTeX |
| GPT-4V / Gemini | Native multi-resolution | General documents |
The field is evolving rapidly. Today's frontier models handle video, multi-image reasoning, interleaved image-text, and even generate images alongside text. Here's where things stand:
| Model | Key Innovation | Scale |
|---|---|---|
| GPT-4o | Native multimodal (not bolted on) | Unknown |
| Gemini 1.5 | 1M+ token context, video natively | Unknown |
| Claude 3.5 | Strong spatial reasoning, charts | Unknown |
| LLaVA-NeXT | Dynamic resolution, video | 7B-110B |
| InternVL2 | Dynamic tiling, strong OCR | 1B-108B |
| Qwen2-VL | Naive dynamic resolution, video | 2B-72B |
Compare different VLM generations across key capabilities. Each axis represents a different skill.
You now understand how machines learned to see and speak. The fusion of vision and language is one of the most consequential advances in AI history.