When language meets vision — unified tokens, mixed losses, and models that see, read, and generate in one architecture.
A child learns language not from text alone but from a world of sounds, images, touch, and social interaction. When a toddler hears "dog" while seeing a dog, the word and the visual concept fuse into a single, grounded understanding. Text-only language models miss this grounding entirely — they know that "dog" follows "the" and precedes "barked," but have never seen a dog.
Multimodal models aim to close this gap by processing and generating multiple modalities — text, images, audio, video — within a single architecture. Instead of separate models for vision and language that are glued together, multimodal models share representations across modalities, creating unified understanding.
The stakes are high. The most capable AI systems of 2025 — GPT-4V, Gemini, Claude 3 — are all multimodal. They can describe images, answer questions about charts, read handwriting, and generate images. But how do they do it? That's what this lesson explores.
The same prompt, with and without an image. Multimodal models combine both signals to produce richer, grounded responses. Toggle the image on/off to see the difference.
The fundamental challenge of multimodality is representation alignment: text lives in a discrete token space, images live in a continuous pixel space, and audio lives in a waveform space. How do you get a single transformer to process all three?
Three families of solutions have emerged:
1. Encode-and-project (LLaVA-style): Use a pre-trained vision encoder to convert images into embedding vectors, then project them into the language model's embedding space. The language model treats image tokens like word tokens.
2. Discrete tokens (Chameleon-style): Quantize images into discrete tokens using a VQ-VAE, then interleave image tokens and text tokens in a single sequence. One tokenizer, one vocabulary, one autoregressive loss.
3. Mixed losses (Transfusion-style): Keep image representations continuous but train with a diffusion loss on images and an autoregressive loss on text. The model switches objectives mid-sequence based on modality.
This lesson teaches all three approaches, their tradeoffs, and the scaling laws that govern mixed-modal training.
Before a transformer can process an image, it needs to convert pixels into tokens. The dominant approach is the Vision Transformer (ViT): divide the image into fixed-size patches, flatten each patch into a vector, and process them like a sequence of tokens.
A ViT takes an image of size H×W×3 (height, width, RGB channels) and divides it into non-overlapping patches of size P×P. Each patch becomes one token. For a 224×224 image with P=16:
Each patch is a 16×16×3 = 768-dimensional vector (flattened pixels). A linear projection maps this to the transformer's d_model dimension. Positional embeddings are added to encode spatial location — the model needs to know that patch (0,0) is top-left and patch (13,13) is bottom-right.
An image is divided into 14×14 patches. Each patch becomes one token. Drag the patch size slider to see how resolution affects the number of tokens.
The patch embedding pipeline:
The output is 196 vectors of dimension d_model — one per patch — that encode both local visual features and global image context (through self-attention). These can be used for classification (add a [CLS] token), for image generation (decode patches back to pixels), or for multimodal models (feed to a language model).
Smaller patches give higher resolution but more tokens:
| Patch Size | Patches (224px) | Tokens | Detail Level |
|---|---|---|---|
| 32×32 | 7×7 | 49 | Low (coarse features) |
| 16×16 | 14×14 | 196 | Medium (standard ViT) |
| 14×14 | 16×16 | 256 | High |
| 8×8 | 28×28 | 784 | Very high (expensive) |
This is the vision equivalent of the tokenization tradeoff from L14: finer granularity gives more information but longer sequences. With attention's O(n²) cost, 784 tokens is 16x more expensive than 196 tokens. Models like SigLIP and InternViT use various tricks (pooling, windowed attention) to handle high-resolution images efficiently.
python import torch import torch.nn as nn class PatchEmbedding(nn.Module): def __init__(self, img_size=224, patch_size=16, d_model=768): super().__init__() self.n_patches = (img_size // patch_size) ** 2 # 196 self.proj = nn.Linear(patch_size**2 * 3, d_model) # 768 → 768 self.pos = nn.Parameter(torch.randn(self.n_patches, d_model)) def forward(self, img): # img: [B, 3, 224, 224] patches = img.unfold(2, 16, 16).unfold(3, 16, 16) # [B, 3, 14, 14, 16, 16] patches = patches.reshape(-1, 196, 768) # [B, 196, 768] tokens = self.proj(patches) + self.pos # [B, 196, 768] return tokens
Once you have image tokens and text tokens, the next question is: when do they interact? Fusion strategy determines where in the architecture the two modalities first see each other.
In late fusion, each modality has its own encoder. Text goes through a text transformer, images go through a vision transformer, and the outputs are combined only at the final layers (or after both encoders are done). CLIP is a classic late fusion model: separate text and image encoders produce embeddings that are aligned via contrastive learning.
Advantages: each encoder specializes in its modality. Disadvantages: the modalities can't help each other during encoding. The text encoder doesn't know what's in the image when encoding text, and vice versa.
In early fusion, image and text tokens are interleaved into a single sequence and processed by a single transformer from layer 1. The model sees both modalities simultaneously at every layer. Chameleon, Gemini, and GPT-4V use forms of early fusion.
Advantages: the modalities can attend to each other from the start, enabling deep cross-modal reasoning. Disadvantages: a single model must handle both modalities, requiring more capacity; training is more complex; and the model can't be pre-trained on each modality independently as easily.
Two architectures for combining text and image. Late fusion keeps them separate until the end; early fusion interleaves them from the start. Click to compare.
Most production multimodal models use a middle approach: a pre-trained vision encoder processes the image, its output tokens are projected into the language model's space, and the language model processes interleaved image + text tokens. This is medium fusion — images are pre-processed but then fully interleaved during the main transformer.
LLaVA (Liu et al. 2023) exemplifies this: a frozen CLIP ViT encodes the image, a small MLP projects the 196 image tokens into the LLM's space, and the LLM processes [image_tokens] + [text_tokens] as a single sequence. The vision encoder is frozen, only the projection MLP and LLM are trained.
| Strategy | Example | Image Encoder | Cross-Modal Attention | Training Cost |
|---|---|---|---|---|
| Late Fusion | CLIP | Separate ViT | Only at final layer | Low |
| Medium Fusion | LLaVA | Frozen ViT + MLP | All LLM layers | Medium |
| Early Fusion | Chameleon | Integrated VQ-VAE | All layers from start | High |
What if images and text were truly the same thing to a model — just different tokens in the same vocabulary? That's the radical idea behind Chameleon (Meta, 2024): a single unified token space for all modalities, trained with a single autoregressive loss.
Chameleon uses a VQ-VAE (Vector-Quantized Variational Autoencoder) to convert images into discrete tokens. The VQ-VAE has a learned codebook of 8,192 visual "words." Each image patch is mapped to its nearest codebook entry, producing a sequence of discrete image tokens.
These image tokens are added to the text vocabulary (which might be 32K BPE tokens). The combined vocabulary is ~40K tokens, and the model treats image tokens and text tokens identically. From the transformer's perspective, there is no difference between predicting the next word and predicting the next image patch.
Text and images are both tokenized into the same vocabulary. The transformer processes a single interleaved sequence. Click "Step" to watch a mixed sequence being generated.
Chameleon's training objective is pure next-token prediction, regardless of modality:
Where xt can be either a text token or an image token. This simplicity is the key advantage: no special handling for images, no separate loss functions, no modality-specific architecture. The same attention mechanism, the same positional encoding, the same output head predict the next token whether it's a word or a visual patch.
The training data is interleaved text-image documents. A training example might look like:
training sequence
[TEXT] A golden retriever playing in the park
[IMG_TOK_1] [IMG_TOK_2] ... [IMG_TOK_1024]
[TEXT] The dog is chasing a red frisbee
[IMG_TOK_1] [IMG_TOK_2] ... [IMG_TOK_1024]
[TEXT] Dogs are the most popular pet in the US
The quality of the entire system depends on the image tokenizer. Chameleon's VQ-VAE works like this:
The codebook size (8,192 for Chameleon) determines the visual vocabulary size. Larger codebooks give finer image details but make the softmax harder. Smaller codebooks are computationally cheaper but lose visual information. This is directly analogous to the vocabulary-size tradeoff in text tokenization (L14).
Image quality. VQ-VAE reconstruction introduces artifacts. Fine details (text in images, small objects) are lost in the quantization step. The 8K codebook can't represent all visual concepts with the same fidelity as continuous representations.
Sequence length. A 256×256 image produces 1,024 tokens (32×32 patches). That's 1K tokens per image. A document with 10 images costs 10K tokens just for the images, leaving little room for text in a 32K context window.
Training instability. Chameleon's paper reports needing special techniques (QK-norm, dropout scheduling) to stabilize training. The modality mixing creates loss landscapes that are harder to optimize than pure text training.
Chameleon forces images through a discrete bottleneck (VQ-VAE). But image generation has seen massive success with diffusion models, which work with continuous representations. Can we combine the best of both worlds?
Transfusion (Zhou et al. 2024) does exactly this: train a single transformer with two losses — autoregressive next-token prediction for text, and diffusion denoising for images. The model switches objectives within a single sequence based on modality markers.
In a Transfusion training sequence, text tokens are trained with cross-entropy loss (predict the next token). Image tokens are trained with diffusion loss (denoise a noisy version of the image). The model learns both objectives simultaneously.
Where LAR is the standard autoregressive cross-entropy loss and Ldiff is the diffusion denoising loss. The hyperparameter λ balances the two losses — typically set so that both losses contribute equally to the gradient magnitude.
Text tokens use autoregressive loss (predict next token). Image tokens use diffusion loss (denoise). The model handles both within a single forward pass. Toggle between modes to see the difference.
Chameleon uses autoregressive loss for images. Transfusion uses diffusion loss for images. Why does this matter?
Autoregressive on images requires discretizing the image (VQ-VAE). Discrete tokens lose information. The model must predict image patches left-to-right, top-to-bottom — an arbitrary ordering that doesn't match how visual information is structured. The upper-left patch is generated before the lower-right, even though they might depict the same object.
Diffusion on images works with continuous representations. No information is lost to quantization. The model denoises the entire image simultaneously, producing all patches in parallel. This matches the spatial structure of images better and produces higher quality results.
| Property | AR (Chameleon) | Diffusion (Transfusion) |
|---|---|---|
| Image representation | Discrete tokens (VQ-VAE) | Continuous latents |
| Generation order | Left-to-right, one token at a time | All patches simultaneously |
| Image quality | Limited by codebook size | High (continuous = no bottleneck) |
| Training simplicity | One loss function | Two loss functions |
| Inference speed | Fast (one pass per token) | Slower (multiple denoising steps) |
Transfusion's key result: at the same compute budget, Transfusion produces significantly better images than Chameleon-style AR approaches while maintaining competitive text quality. The dual-loss design doesn't hurt either modality — it helps both by letting each modality use its optimal training objective.
How should you allocate compute when training on both text and images? Should images and text share all parameters, or have some modality-specific components? This is the domain of mixed-modal scaling laws, and the answers are surprising.
Aghajanyan et al. (2023) in "Scaling Laws for Generative Mixed-Modal Language Models" showed that mixed-modal training follows its own scaling laws, distinct from pure text scaling. Two key findings:
1. Modality mixing ratio matters enormously. Training on 80% text + 20% images produces different scaling behavior than 50/50 or 20/80. The optimal ratio depends on the task distribution you care about.
2. More data of one modality has diminishing returns for the other. Adding more images improves image understanding but eventually stops helping text performance. This suggests the modalities partially compete for model capacity.
Liang et al. (2024) proposed Mixture of Transformers (MoT): a shared transformer where some components (attention, feed-forward) have modality-specific parameters while others are shared. This addresses the capacity competition problem.
In MoT, some transformer components are shared across modalities while others are modality-specific. Drag the sharing ratio to see how parameters are allocated.
At one extreme (100% shared), all parameters are shared — this is Chameleon. At the other extreme (0% shared), you have two completely separate models — no cross-modal transfer. MoT finds that 30-50% sharing is optimal: enough sharing for cross-modal transfer, enough specialization for per-modality quality.
The shared components tend to learn general sequence processing capabilities: attention patterns, positional understanding, basic feature extraction. The modality-specific components learn modality-specific features: visual texture detectors in the image branch, syntactic patterns in the text branch.
Davtyan & Favaro (2025) proposed OneFlow, which unifies autoregressive and diffusion training into a single framework. Instead of treating AR and diffusion as fundamentally different, OneFlow shows they're both special cases of a more general "flow" framework. This opens the door to smoothly interpolating between the two objectives per-modality, potentially finding even better mixed-modal training recipes.
Not all knowledge needs to be stored in model weights. Retrieval-augmented generation (RAG) augments the model with an external memory of text-image pairs, retrieving relevant examples at inference time. This is particularly powerful for multimodal models because images are expensive to memorize in weights.
Yasunaga et al. (2023) introduced RA-CM3: a multimodal language model augmented with a retrieval module. Given a text prompt, the model retrieves relevant text-image pairs from a large database and includes them in the context.
Given a query, the model retrieves relevant text-image pairs from a database, then uses them as context to generate a response. Click to step through the pipeline.
The retrieval pipeline:
Retrieval-augmented multimodal models have several advantages:
Factual grounding. Retrieved images and text provide concrete evidence. Instead of relying on potentially hallucinated knowledge from weights, the model can reference actual image-text pairs from a curated database.
Efficiency. Storing knowledge in an external database is cheaper than storing it in weights. A 7B parameter model with a 100M document database can access more information than a 70B parameter model alone.
Updatability. The database can be updated without retraining. New images and documents are added by encoding and indexing them — no gradient computation needed.
CM3Leon (Yu et al. 2023) combined retrieval augmentation with the CM3 architecture and showed that retrieval at inference time could match models trained on 5x more data. The retrieved examples act as few-shot demonstrations, giving the model concrete examples of what good outputs look like for similar inputs.
Time for the showcase: an interactive explorer of the multimodal architecture landscape. Compare how different approaches handle the same task. See the tradeoffs between discrete and continuous representations, between unified and split losses, between speed and quality.
Select a task and see how three different architectures handle it. Compare token counts, quality, and speed across approaches. Drag the quality slider to explore the quality-speed frontier.
Three architectures compete on each task:
LLaVA-style (medium fusion): Frozen ViT + projection MLP + LLM. Fast for understanding tasks (captioning, VQA) because the vision encoder is efficient. Cannot generate images natively.
Chameleon-style (discrete unified): VQ-VAE tokenizer + single transformer. Can both understand and generate images. Lower quality images due to discrete bottleneck. Simple training.
Transfusion-style (mixed loss): Continuous image representations + AR text + diffusion images. Best image quality. Can do everything. Most complex training. Slower image generation (multiple denoising steps).
The quality slider shows the quality-speed frontier: at low quality targets, LLaVA-style wins on speed. At high quality targets, Transfusion wins on image quality. Chameleon occupies the middle ground with acceptable quality and the simplest training recipe.
How do you measure whether a multimodal model is any good? Text-only evaluation is already hard. Adding images makes it harder. A model's captioning accuracy, image generation quality, and cross-modal reasoning ability are all different dimensions of performance.
Shi et al. (2024) proposed LMFusion, which adds a reconstruction alignment objective to multimodal training: the model must be able to reconstruct the image from its internal representations. This encourages the model to actually encode visual information rather than ignoring the image and relying on text-only priors.
The reconstruction alignment score measures how well the model's internal image representation can reconstruct the original image. A model that ignores the image and answers from text-only priors would have low reconstruction alignment — revealing that its "multimodal" behavior is actually unimodal.
Yasunaga et al. (2024) created Multimodal RewardBench: a benchmark for evaluating reward models in multimodal settings. As multimodal models are increasingly fine-tuned with RLHF, the quality of the reward model becomes critical. Multimodal RewardBench tests whether reward models can distinguish between good and bad responses across multiple modalities.
Four dimensions of multimodal model quality. Drag the slider to see how different models perform across these metrics. Perfect models would max all four bars.
| Dimension | What It Measures | Benchmark |
|---|---|---|
| Visual Grounding | Can the model locate objects in images? | RefCOCO, Flickr30k |
| Visual QA | Can the model answer questions about images? | VQAv2, GQA, TextVQA |
| Image Generation | Quality and fidelity of generated images | FID, CLIP-score, human eval |
| Cross-Modal Reasoning | Can the model reason about image-text relationships? | MMLU-visual, MathVista |
| Reward Accuracy | Can the reward model distinguish good from bad? | Multimodal RewardBench |
Multimodality is not a niche extension of language models — it's increasingly the default. Every major model released in 2024-2025 is multimodal. The architectural choices we've covered define the landscape of modern AI.
| Paper | Contribution | Connection |
|---|---|---|
| Chameleon (Meta 2024) | Unified discrete tokens for text + images | Core of Ch 3 — the pure AR approach |
| Transfusion (Zhou 2024) | AR + diffusion in one transformer | Core of Ch 4 — dual-loss training |
| Mixture of Transformers (Liang 2024) | Modality-specific + shared parameters | Parameter sharing strategy (Ch 5) |
| Scaling Laws for Mixed-Modal (Aghajanyan 2023) | Scaling laws for multimodal training | Compute allocation (Ch 5) |
| CM3Leon (Yu 2023) | Best retrieval-augmented multimodal model | RAG for multimodal (Ch 6) |
| RA-CM3 (Yasunaga 2023) | Retrieval augmentation for multimodal LMs | RAG pipeline (Ch 6) |
| LMFusion (Shi 2024) | Reconstruction alignment objective | Evaluation and alignment (Ch 8) |
| OneFlow (Davtyan 2025) | Unifies AR and diffusion as special cases | Training unification (Ch 5) |
| Multimodal RewardBench (Yasunaga 2024) | Benchmark for multimodal reward models | Evaluation quality (Ch 8) |
| Reconstruction-Alignment (2025) | Alignment via image reconstruction | Genuine multimodal understanding (Ch 8) |
| Lecture | Relationship |
|---|---|
| L05: Transformers | Every multimodal architecture is built on the transformer. Attention mechanisms process both text and image tokens. Understanding self-attention is essential for understanding cross-modal attention. |
| L07: Pretraining | Pretraining objectives define what a model learns. Multimodal pretraining combines text objectives (MLM, CLM) with visual objectives (contrastive learning, diffusion). The interplay determines cross-modal capabilities. |
| L14: Tokenization | Image tokenization (VQ-VAE, patch embeddings) is the visual analogue of text tokenization (BPE). The same tradeoffs apply: vocabulary size vs sequence length, discrete vs continuous, resolution vs cost. |
How the approaches we covered relate to each other, from separate encoders to fully unified architectures.
| Concept | Core Idea | Key Tradeoff |
|---|---|---|
| Vision Encoders | Patchify images into tokens via ViT | Patch size: resolution vs sequence length |
| Fusion Strategies | When modalities first interact | Late (cheap) vs early (powerful) |
| Chameleon | Discrete tokens + pure AR loss | Simple training vs image quality |
| Transfusion | AR for text + diffusion for images | Best quality vs training complexity |
| Scaling Laws | Mixing ratio + parameter sharing | Cross-modal transfer vs capacity competition |
| RAG Multimodal | Retrieve relevant image-text pairs | Factual + updatable vs latency |
| Evaluation | Test genuine multimodal understanding | Accuracy alone vs reconstruction alignment |