CS224N Lecture 17 — Multimodality

Chapter 0: Why Multimodal?

A child learns language not from text alone but from a world of sounds, images, touch, and social interaction. When a toddler hears "dog" while seeing a dog, the word and the visual concept fuse into a single, grounded understanding. Text-only language models miss this grounding entirely — they know that "dog" follows "the" and precedes "barked," but have never seen a dog.

Multimodal models aim to close this gap by processing and generating multiple modalities — text, images, audio, video — within a single architecture. Instead of separate models for vision and language that are glued together, multimodal models share representations across modalities, creating unified understanding.

The stakes are high. The most capable AI systems of 2025 — GPT-4V, Gemini, Claude 3 — are all multimodal. They can describe images, answer questions about charts, read handwriting, and generate images. But how do they do it? That's what this lesson explores.

Text + Image Fusion

The same prompt, with and without an image. Multimodal models combine both signals to produce richer, grounded responses. Toggle the image on/off to see the difference.

The fundamental challenge of multimodality is representation alignment: text lives in a discrete token space, images live in a continuous pixel space, and audio lives in a waveform space. How do you get a single transformer to process all three?

Three families of solutions have emerged:

1. Encode-and-project (LLaVA-style): Use a pre-trained vision encoder to convert images into embedding vectors, then project them into the language model's embedding space. The language model treats image tokens like word tokens.

2. Discrete tokens (Chameleon-style): Quantize images into discrete tokens using a VQ-VAE, then interleave image tokens and text tokens in a single sequence. One tokenizer, one vocabulary, one autoregressive loss.

3. Mixed losses (Transfusion-style): Keep image representations continuous but train with a diffusion loss on images and an autoregressive loss on text. The model switches objectives mid-sequence based on modality.

This lesson teaches all three approaches, their tradeoffs, and the scaling laws that govern mixed-modal training.

Multimodality isn't just "add a vision encoder to a language model." It fundamentally changes the architecture, training, and evaluation. The design space is rich: discrete vs continuous image representations, single vs separate tokenizers, unified vs modality-specific losses. Each choice has deep implications for what the model can learn and generate.

Why can't a text-only language model truly "understand" the concept of a dog, even if it can generate accurate descriptions of dogs?

It can — text contains all necessary information Its knowledge of "dog" is purely distributional (what words appear near "dog") without any grounding in visual, auditory, or physical experience of actual dogs — it knows the linguistic context but not the perceptual referent Dogs are too complex for neural networks to represent

Chapter 1: Vision Encoders

Before a transformer can process an image, it needs to convert pixels into tokens. The dominant approach is the Vision Transformer (ViT): divide the image into fixed-size patches, flatten each patch into a vector, and process them like a sequence of tokens.

Image to Patches

A ViT takes an image of size H×W×3 (height, width, RGB channels) and divides it into non-overlapping patches of size P×P. Each patch becomes one token. For a 224×224 image with P=16:

N_patches = (H/P) × (W/P) = (224/16) × (224/16) = 14 × 14 = 196 tokens

Each patch is a 16×16×3 = 768-dimensional vector (flattened pixels). A linear projection maps this to the transformer's d_model dimension. Positional embeddings are added to encode spatial location — the model needs to know that patch (0,0) is top-left and patch (13,13) is bottom-right.

Image to Patches

An image is divided into 14×14 patches. Each patch becomes one token. Drag the patch size slider to see how resolution affects the number of tokens.

Patch size 16×16

From Patches to Embeddings

The patch embedding pipeline:

1. Patchify

Split image [224, 224, 3] into patches [196, 16, 16, 3] → flatten to [196, 768]

↓

2. Project

Linear layer maps [196, 768] → [196, d_model]. d_model often = 768 or 1024.

↓

3. Add Position

Add learned position embeddings [196, d_model]. Encodes spatial grid location.

↓

4. Transform

Process through L transformer layers. Output: [196, d_model] contextualized features.

The output is 196 vectors of dimension d_model — one per patch — that encode both local visual features and global image context (through self-attention). These can be used for classification (add a [CLS] token), for image generation (decode patches back to pixels), or for multimodal models (feed to a language model).

Resolution-Token Tradeoff

Smaller patches give higher resolution but more tokens:

Patch Size	Patches (224px)	Tokens	Detail Level
32×32	7×7	49	Low (coarse features)
16×16	14×14	196	Medium (standard ViT)
14×14	16×16	256	High
8×8	28×28	784	Very high (expensive)

This is the vision equivalent of the tokenization tradeoff from L14: finer granularity gives more information but longer sequences. With attention's O(n²) cost, 784 tokens is 16x more expensive than 196 tokens. Models like SigLIP and InternViT use various tricks (pooling, windowed attention) to handle high-resolution images efficiently.

python
import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, d_model=768):
        super().__init__()
        self.n_patches = (img_size // patch_size) ** 2  # 196
        self.proj = nn.Linear(patch_size**2 * 3, d_model)  # 768 → 768
        self.pos = nn.Parameter(torch.randn(self.n_patches, d_model))

    def forward(self, img):
        # img: [B, 3, 224, 224]
        patches = img.unfold(2, 16, 16).unfold(3, 16, 16)  # [B, 3, 14, 14, 16, 16]
        patches = patches.reshape(-1, 196, 768)           # [B, 196, 768]
        tokens = self.proj(patches) + self.pos            # [B, 196, 768]
        return tokens

Vision Transformers treat images like text: divide into patches, embed each patch as a token, and process with standard attention. This is why the same transformer architecture works for both modalities — from the transformer's perspective, an image is just 196 extra tokens in the sequence. The difference is how those tokens are created (patch projection vs word embedding) and what loss is used (classification/reconstruction vs next-token prediction).

A 512×512 image is processed with 16×16 patches. How many image tokens does the ViT produce?

196 tokens 512 tokens 1024 tokens — (512/16)² = 32×32 = 1,024 patches, each becoming one token

Chapter 2: Early vs Late Fusion

Once you have image tokens and text tokens, the next question is: when do they interact? Fusion strategy determines where in the architecture the two modalities first see each other.

Late Fusion: Separate Encoders, Combined at the End

In late fusion, each modality has its own encoder. Text goes through a text transformer, images go through a vision transformer, and the outputs are combined only at the final layers (or after both encoders are done). CLIP is a classic late fusion model: separate text and image encoders produce embeddings that are aligned via contrastive learning.

Advantages: each encoder specializes in its modality. Disadvantages: the modalities can't help each other during encoding. The text encoder doesn't know what's in the image when encoding text, and vice versa.

Early Fusion: Interleaved from the Start

In early fusion, image and text tokens are interleaved into a single sequence and processed by a single transformer from layer 1. The model sees both modalities simultaneously at every layer. Chameleon, Gemini, and GPT-4V use forms of early fusion.

Advantages: the modalities can attend to each other from the start, enabling deep cross-modal reasoning. Disadvantages: a single model must handle both modalities, requiring more capacity; training is more complex; and the model can't be pre-trained on each modality independently as easily.

Early vs Late Fusion

Two architectures for combining text and image. Late fusion keeps them separate until the end; early fusion interleaves them from the start. Click to compare.

Medium Fusion: The Practical Middle Ground

Most production multimodal models use a middle approach: a pre-trained vision encoder processes the image, its output tokens are projected into the language model's space, and the language model processes interleaved image + text tokens. This is medium fusion — images are pre-processed but then fully interleaved during the main transformer.

LLaVA (Liu et al. 2023) exemplifies this: a frozen CLIP ViT encodes the image, a small MLP projects the 196 image tokens into the LLM's space, and the LLM processes [image_tokens] + [text_tokens] as a single sequence. The vision encoder is frozen, only the projection MLP and LLM are trained.

Strategy	Example	Image Encoder	Cross-Modal Attention	Training Cost
Late Fusion	CLIP	Separate ViT	Only at final layer	Low
Medium Fusion	LLaVA	Frozen ViT + MLP	All LLM layers	Medium
Early Fusion	Chameleon	Integrated VQ-VAE	All layers from start	High

There's no universally best fusion strategy. Late fusion is cheapest and works for retrieval tasks (find the image that matches this text). Early fusion is most powerful but most expensive, enabling tasks like "describe the third person from the left in this image." Medium fusion balances cost and capability — it's why most deployed multimodal LLMs use a variant of the LLaVA approach.

A multimodal model needs to answer: "What color is the hat worn by the person holding the sign?" This requires cross-referencing spatial details between the question and the image. Which fusion strategy handles this best?

Late fusion — separate encoders are sufficient Late fusion with a post-processing step Early or medium fusion — the model needs to attend from the question tokens ("hat", "person", "sign") directly to specific image patch tokens simultaneously, which requires cross-modal attention within the transformer layers

Chapter 3: Chameleon

What if images and text were truly the same thing to a model — just different tokens in the same vocabulary? That's the radical idea behind Chameleon (Meta, 2024): a single unified token space for all modalities, trained with a single autoregressive loss.

The Unified Token Approach

Chameleon uses a VQ-VAE (Vector-Quantized Variational Autoencoder) to convert images into discrete tokens. The VQ-VAE has a learned codebook of 8,192 visual "words." Each image patch is mapped to its nearest codebook entry, producing a sequence of discrete image tokens.

These image tokens are added to the text vocabulary (which might be 32K BPE tokens). The combined vocabulary is ~40K tokens, and the model treats image tokens and text tokens identically. From the transformer's perspective, there is no difference between predicting the next word and predicting the next image patch.

Unified Token Pipeline

Text and images are both tokenized into the same vocabulary. The transformer processes a single interleaved sequence. Click "Step" to watch a mixed sequence being generated.

Sequence: empty

Training: One Loss to Rule Them All

Chameleon's training objective is pure next-token prediction, regardless of modality:

L = -∑_t log P(x_t | x_<t)

Where x_t can be either a text token or an image token. This simplicity is the key advantage: no special handling for images, no separate loss functions, no modality-specific architecture. The same attention mechanism, the same positional encoding, the same output head predict the next token whether it's a word or a visual patch.

The training data is interleaved text-image documents. A training example might look like:

training sequence
[TEXT] A golden retriever playing in the park
[IMG_TOK_1] [IMG_TOK_2] ... [IMG_TOK_1024]
[TEXT] The dog is chasing a red frisbee
[IMG_TOK_1] [IMG_TOK_2] ... [IMG_TOK_1024]
[TEXT] Dogs are the most popular pet in the US

The VQ-VAE Image Tokenizer

The quality of the entire system depends on the image tokenizer. Chameleon's VQ-VAE works like this:

Encode

CNN encoder maps image [3, 256, 256] → feature grid [d, 32, 32] = 1024 vectors

↓

Quantize

Each vector is replaced by nearest codebook entry (8192 options). Each patch → 1 discrete token ID.

↓

Decode

Codebook vectors → CNN decoder → reconstructed image [3, 256, 256]. Quality depends on codebook size.

The codebook size (8,192 for Chameleon) determines the visual vocabulary size. Larger codebooks give finer image details but make the softmax harder. Smaller codebooks are computationally cheaper but lose visual information. This is directly analogous to the vocabulary-size tradeoff in text tokenization (L14).

Chameleon's insight: if you tokenize images into discrete tokens, you can train with pure next-token prediction across modalities. One architecture, one loss, one vocabulary for both text and images. The cost is image quality — discrete tokenization loses information compared to continuous representations. But the simplicity of unified training is enormously appealing.

Challenges

Image quality. VQ-VAE reconstruction introduces artifacts. Fine details (text in images, small objects) are lost in the quantization step. The 8K codebook can't represent all visual concepts with the same fidelity as continuous representations.

Sequence length. A 256×256 image produces 1,024 tokens (32×32 patches). That's 1K tokens per image. A document with 10 images costs 10K tokens just for the images, leaving little room for text in a 32K context window.

Training instability. Chameleon's paper reports needing special techniques (QK-norm, dropout scheduling) to stabilize training. The modality mixing creates loss landscapes that are harder to optimize than pure text training.

Chameleon uses a VQ-VAE with an 8,192-entry codebook to tokenize images. What is the direct analogy to text tokenization?

The codebook is the visual vocabulary — just as BPE has ~50K text tokens, the VQ-VAE has 8,192 visual tokens. Each image patch maps to one of these 8,192 entries, exactly like each word maps to one of 50K BPE tokens The codebook is like the attention mechanism There is no analogy between visual and text tokenization

Chapter 4: Transfusion

Chameleon forces images through a discrete bottleneck (VQ-VAE). But image generation has seen massive success with diffusion models, which work with continuous representations. Can we combine the best of both worlds?

Transfusion (Zhou et al. 2024) does exactly this: train a single transformer with two losses — autoregressive next-token prediction for text, and diffusion denoising for images. The model switches objectives within a single sequence based on modality markers.

The Dual-Loss Architecture

In a Transfusion training sequence, text tokens are trained with cross-entropy loss (predict the next token). Image tokens are trained with diffusion loss (denoise a noisy version of the image). The model learns both objectives simultaneously.

L_total = L_AR(text tokens) + λ · L_diff(image tokens)

Where L_AR is the standard autoregressive cross-entropy loss and L_diff is the diffusion denoising loss. The hyperparameter λ balances the two losses — typically set so that both losses contribute equally to the gradient magnitude.

AR + Diffusion Loss

Text tokens use autoregressive loss (predict next token). Image tokens use diffusion loss (denoise). The model handles both within a single forward pass. Toggle between modes to see the difference.

Why Not Just Autoregressive for Everything?

Chameleon uses autoregressive loss for images. Transfusion uses diffusion loss for images. Why does this matter?

Autoregressive on images requires discretizing the image (VQ-VAE). Discrete tokens lose information. The model must predict image patches left-to-right, top-to-bottom — an arbitrary ordering that doesn't match how visual information is structured. The upper-left patch is generated before the lower-right, even though they might depict the same object.

Diffusion on images works with continuous representations. No information is lost to quantization. The model denoises the entire image simultaneously, producing all patches in parallel. This matches the spatial structure of images better and produces higher quality results.

Property	AR (Chameleon)	Diffusion (Transfusion)
Image representation	Discrete tokens (VQ-VAE)	Continuous latents
Generation order	Left-to-right, one token at a time	All patches simultaneously
Image quality	Limited by codebook size	High (continuous = no bottleneck)
Training simplicity	One loss function	Two loss functions
Inference speed	Fast (one pass per token)	Slower (multiple denoising steps)

Transfusion's key result: at the same compute budget, Transfusion produces significantly better images than Chameleon-style AR approaches while maintaining competitive text quality. The dual-loss design doesn't hurt either modality — it helps both by letting each modality use its optimal training objective.

Transfusion's insight: don't force all modalities through the same loss function. Text is naturally discrete and sequential — autoregressive loss is perfect. Images are naturally continuous and spatial — diffusion loss is better. By combining both losses in one model, Transfusion gets the best of both paradigms without sacrificing either.

Transfusion uses autoregressive loss for text and diffusion loss for images. Why is diffusion better than autoregressive for image generation?

Diffusion works with continuous representations (no lossy VQ-VAE bottleneck) and generates all patches simultaneously (matching the spatial nature of images), while autoregressive forces an arbitrary left-to-right ordering and discretization Diffusion is faster Diffusion produces more text-like outputs

Chapter 5: Scaling Mixed-Modal

How should you allocate compute when training on both text and images? Should images and text share all parameters, or have some modality-specific components? This is the domain of mixed-modal scaling laws, and the answers are surprising.

Aghajanyan et al. (2023) in "Scaling Laws for Generative Mixed-Modal Language Models" showed that mixed-modal training follows its own scaling laws, distinct from pure text scaling. Two key findings:

1. Modality mixing ratio matters enormously. Training on 80% text + 20% images produces different scaling behavior than 50/50 or 20/80. The optimal ratio depends on the task distribution you care about.

2. More data of one modality has diminishing returns for the other. Adding more images improves image understanding but eventually stops helping text performance. This suggests the modalities partially compete for model capacity.

Mixture of Transformers

Liang et al. (2024) proposed Mixture of Transformers (MoT): a shared transformer where some components (attention, feed-forward) have modality-specific parameters while others are shared. This addresses the capacity competition problem.

Routing Visualization

In MoT, some transformer components are shared across modalities while others are modality-specific. Drag the sharing ratio to see how parameters are allocated.

Shared ratio 50%

At one extreme (100% shared), all parameters are shared — this is Chameleon. At the other extreme (0% shared), you have two completely separate models — no cross-modal transfer. MoT finds that 30-50% sharing is optimal: enough sharing for cross-modal transfer, enough specialization for per-modality quality.

The shared components tend to learn general sequence processing capabilities: attention patterns, positional understanding, basic feature extraction. The modality-specific components learn modality-specific features: visual texture detectors in the image branch, syntactic patterns in the text branch.

OneFlow: Unifying the Approaches

Davtyan & Favaro (2025) proposed OneFlow, which unifies autoregressive and diffusion training into a single framework. Instead of treating AR and diffusion as fundamentally different, OneFlow shows they're both special cases of a more general "flow" framework. This opens the door to smoothly interpolating between the two objectives per-modality, potentially finding even better mixed-modal training recipes.

Mixed-modal scaling is not just "add image data to text training." The ratio, the parameter sharing strategy, and the loss function all interact in complex ways. Too much image data hurts text quality. Too little sharing prevents cross-modal transfer. The frontier of mixed-modal training is finding the optimal operating point for each use case.

In Mixture of Transformers (MoT), what is the benefit of having some modality-specific parameters alongside shared parameters?

Shared parameters enable cross-modal transfer (e.g., text concepts helping image understanding), while modality-specific parameters prevent the modalities from competing for capacity and allow each to develop specialized features It makes the model smaller It eliminates the need for pre-training

Chapter 6: Retrieval-Augmented Multimodal

Not all knowledge needs to be stored in model weights. Retrieval-augmented generation (RAG) augments the model with an external memory of text-image pairs, retrieving relevant examples at inference time. This is particularly powerful for multimodal models because images are expensive to memorize in weights.

RA-CM3: Retrieval-Augmented Multimodal

Yasunaga et al. (2023) introduced RA-CM3: a multimodal language model augmented with a retrieval module. Given a text prompt, the model retrieves relevant text-image pairs from a large database and includes them in the context.

RAG Pipeline for Multimodal

Given a query, the model retrieves relevant text-image pairs from a database, then uses them as context to generate a response. Click to step through the pipeline.

Step 1 / 4

The retrieval pipeline:

1. Encode Query

CLIP encodes the query (text or image) into an embedding vector.

↓

2. Retrieve

Nearest-neighbor search in a pre-built index of text-image pairs. Returns top-K matches.

↓

3. Prepend

Retrieved pairs are prepended to the prompt as additional context: [retrieved_img] [retrieved_text] [query].

↓

4. Generate

The model generates a response conditioned on both the query and the retrieved context.

Why Retrieval Helps

Retrieval-augmented multimodal models have several advantages:

Factual grounding. Retrieved images and text provide concrete evidence. Instead of relying on potentially hallucinated knowledge from weights, the model can reference actual image-text pairs from a curated database.

Efficiency. Storing knowledge in an external database is cheaper than storing it in weights. A 7B parameter model with a 100M document database can access more information than a 70B parameter model alone.

Updatability. The database can be updated without retraining. New images and documents are added by encoding and indexing them — no gradient computation needed.

CM3Leon: Best-in-Class Retrieval-Augmented

CM3Leon (Yu et al. 2023) combined retrieval augmentation with the CM3 architecture and showed that retrieval at inference time could match models trained on 5x more data. The retrieved examples act as few-shot demonstrations, giving the model concrete examples of what good outputs look like for similar inputs.

Retrieval augmentation is the multimodal equivalent of "looking things up." Instead of memorizing every fact in weights, the model retrieves relevant text-image pairs from an external database. This is cheaper, more factual, and updatable. The tradeoff: retrieval adds latency and depends on database quality.

A retrieval-augmented multimodal model can be updated with new knowledge (e.g., images of a newly built building) without retraining. How?

Encode the new images and text with the retrieval encoder, add them to the database index, and they'll be retrieved when relevant queries are asked — no weight updates needed Fine-tune the model on the new images It can't be updated without retraining

Chapter 7: Multimodal Playground

Time for the showcase: an interactive explorer of the multimodal architecture landscape. Compare how different approaches handle the same task. See the tradeoffs between discrete and continuous representations, between unified and split losses, between speed and quality.

Multimodal Architecture Comparator

Select a task and see how three different architectures handle it. Compare token counts, quality, and speed across approaches. Drag the quality slider to explore the quality-speed frontier.

Quality target Medium

What the Playground Shows

Three architectures compete on each task:

LLaVA-style (medium fusion): Frozen ViT + projection MLP + LLM. Fast for understanding tasks (captioning, VQA) because the vision encoder is efficient. Cannot generate images natively.

Chameleon-style (discrete unified): VQ-VAE tokenizer + single transformer. Can both understand and generate images. Lower quality images due to discrete bottleneck. Simple training.

Transfusion-style (mixed loss): Continuous image representations + AR text + diffusion images. Best image quality. Can do everything. Most complex training. Slower image generation (multiple denoising steps).

The quality slider shows the quality-speed frontier: at low quality targets, LLaVA-style wins on speed. At high quality targets, Transfusion wins on image quality. Chameleon occupies the middle ground with acceptable quality and the simplest training recipe.

There is no single best multimodal architecture. The choice depends on your requirements: understand-only (LLaVA), generate-everything-simply (Chameleon), or generate-everything-well (Transfusion). Production systems increasingly use combinations — fast models for understanding, quality models for generation, routing between them based on the task.

Chapter 8: Evaluation & Rewards

How do you measure whether a multimodal model is any good? Text-only evaluation is already hard. Adding images makes it harder. A model's captioning accuracy, image generation quality, and cross-modal reasoning ability are all different dimensions of performance.

LMFusion and Reconstruction-Based Alignment

Shi et al. (2024) proposed LMFusion, which adds a reconstruction alignment objective to multimodal training: the model must be able to reconstruct the image from its internal representations. This encourages the model to actually encode visual information rather than ignoring the image and relying on text-only priors.

The reconstruction alignment score measures how well the model's internal image representation can reconstruct the original image. A model that ignores the image and answers from text-only priors would have low reconstruction alignment — revealing that its "multimodal" behavior is actually unimodal.

Multimodal RewardBench

Yasunaga et al. (2024) created Multimodal RewardBench: a benchmark for evaluating reward models in multimodal settings. As multimodal models are increasingly fine-tuned with RLHF, the quality of the reward model becomes critical. Multimodal RewardBench tests whether reward models can distinguish between good and bad responses across multiple modalities.

Alignment Quality Meters

Four dimensions of multimodal model quality. Drag the slider to see how different models perform across these metrics. Perfect models would max all four bars.

Model LLaVA

Key Evaluation Dimensions

Dimension	What It Measures	Benchmark
Visual Grounding	Can the model locate objects in images?	RefCOCO, Flickr30k
Visual QA	Can the model answer questions about images?	VQAv2, GQA, TextVQA
Image Generation	Quality and fidelity of generated images	FID, CLIP-score, human eval
Cross-Modal Reasoning	Can the model reason about image-text relationships?	MMLU-visual, MathVista
Reward Accuracy	Can the reward model distinguish good from bad?	Multimodal RewardBench

Multimodal evaluation must test more than just accuracy — it must test whether the model is actually using the image. A model that achieves 80% on VQA by ignoring the image and using text-only priors is not a multimodal model. Reconstruction alignment, image-dependent questions, and adversarial tests are all needed to verify genuine multimodal understanding.

A multimodal model achieves 85% accuracy on a visual QA benchmark. But when the images are replaced with random noise, accuracy drops only to 78%. What does this reveal?

The model is excellent at visual understanding The model is mostly relying on text-only priors, not the image — the small 7% drop means the image contributes very little to the model's answers. The benchmark probably has questions answerable from text context alone, making the 85% accuracy misleading about the model's actual multimodal capability The noise images are informative

Chapter 9: Connections

Multimodality is not a niche extension of language models — it's increasingly the default. Every major model released in 2024-2025 is multimodal. The architectural choices we've covered define the landscape of modern AI.

Key Papers

Paper	Contribution	Connection
Chameleon (Meta 2024)	Unified discrete tokens for text + images	Core of Ch 3 — the pure AR approach
Transfusion (Zhou 2024)	AR + diffusion in one transformer	Core of Ch 4 — dual-loss training
Mixture of Transformers (Liang 2024)	Modality-specific + shared parameters	Parameter sharing strategy (Ch 5)
Scaling Laws for Mixed-Modal (Aghajanyan 2023)	Scaling laws for multimodal training	Compute allocation (Ch 5)
CM3Leon (Yu 2023)	Best retrieval-augmented multimodal model	RAG for multimodal (Ch 6)
RA-CM3 (Yasunaga 2023)	Retrieval augmentation for multimodal LMs	RAG pipeline (Ch 6)
LMFusion (Shi 2024)	Reconstruction alignment objective	Evaluation and alignment (Ch 8)
OneFlow (Davtyan 2025)	Unifies AR and diffusion as special cases	Training unification (Ch 5)
Multimodal RewardBench (Yasunaga 2024)	Benchmark for multimodal reward models	Evaluation quality (Ch 8)
Reconstruction-Alignment (2025)	Alignment via image reconstruction	Genuine multimodal understanding (Ch 8)

Lecture Connections

Lecture	Relationship
L05: Transformers	Every multimodal architecture is built on the transformer. Attention mechanisms process both text and image tokens. Understanding self-attention is essential for understanding cross-modal attention.
L07: Pretraining	Pretraining objectives define what a model learns. Multimodal pretraining combines text objectives (MLM, CLM) with visual objectives (contrastive learning, diffusion). The interplay determines cross-modal capabilities.
L14: Tokenization	Image tokenization (VQ-VAE, patch embeddings) is the visual analogue of text tokenization (BPE). The same tradeoffs apply: vocabulary size vs sequence length, discrete vs continuous, resolution vs cost.

The Big Picture

Multimodal Architecture Landscape

How the approaches we covered relate to each other, from separate encoders to fully unified architectures.

What We Covered

Concept	Core Idea	Key Tradeoff
Vision Encoders	Patchify images into tokens via ViT	Patch size: resolution vs sequence length
Fusion Strategies	When modalities first interact	Late (cheap) vs early (powerful)
Chameleon	Discrete tokens + pure AR loss	Simple training vs image quality
Transfusion	AR for text + diffusion for images	Best quality vs training complexity
Scaling Laws	Mixing ratio + parameter sharing	Cross-modal transfer vs capacity competition
RAG Multimodal	Retrieve relevant image-text pairs	Factual + updatable vs latency
Evaluation	Test genuine multimodal understanding	Accuracy alone vs reconstruction alignment

"The limits of my language mean the limits of my world." — Ludwig Wittgenstein. Multimodal models expand those limits beyond language to vision, and eventually to audio, video, and the physical world. The architectures in this lecture are the first steps toward AI systems that understand the world the way we do — through multiple senses, simultaneously.