Mixed-Modal Early-Fusion Foundation Models — tokenize everything (text AND images) into one sequence, train one transformer end-to-end.
You want to build a single model that can read text, look at images, and generate both. You type: "Here's a photo of my garden. What flowers are these? Also, show me what it would look like in autumn." The model should understand the image, answer in text, and generate a new image — all in one conversation, one forward pass, one set of weights.
Today's multimodal systems don't actually do this. They are Frankenstein architectures: a vision encoder bolted onto a language model with an adapter module in between. GPT-4V, LLaVA, Flamingo — they all follow the same pattern:
This approach has three problems:
| Problem | Why It Happens | Consequence |
|---|---|---|
| No image generation | The LM only outputs text tokens. It can't produce images. | You need a completely separate model (DALL-E, Stable Diffusion) for image generation. |
| Shallow fusion | Image features are injected at one layer. The model can't deeply reason across modalities. | Struggles with tasks requiring tight image-text interleaving. |
| Training fragmentation | Vision encoder, adapter, and LM are trained separately with different objectives. | Representation gaps between modalities. Information lost at each boundary. |
What if we eliminated the boundary entirely? What if images and text were just... tokens? The same tokens, in the same sequence, processed by the same transformer?
This is early fusion: modalities are merged at the token level, before the transformer sees them. The alternative, late fusion, keeps modalities separate until the final layers. Chameleon bets that early fusion enables richer cross-modal reasoning — and the results back it up.
Click to compare how late-fusion models (left) vs Chameleon's early-fusion (right) process a mixed text+image input. Watch how information flows differently through each architecture.
If we want to treat images as tokens, we need a way to convert a continuous image (a grid of RGB pixels) into a sequence of discrete integer IDs — just like how a text tokenizer converts "Hello world" into [15496, 995]. This is the job of the image tokenizer.
Chameleon uses a model based on Make-A-Scene's image tokenizer, which is itself a VQ-VAE (Vector Quantized Variational Autoencoder). Here's how it works:
The key numbers: a 512×512 image becomes 1024 tokens (32 × 32 grid positions) from a codebook of 8192 entries. These 8192 image codes are added to the text vocabulary (which has ~65K BPE tokens), giving a combined vocabulary of ~73K tokens.
Crucially, the image tokenizer is trained before Chameleon itself, and then frozen. Chameleon never updates the image tokenizer — it just uses the discrete codes as input and output. This separation means the tokenizer can be optimized for image reconstruction quality independently.
The tokenizer uses several tricks to keep reconstructions sharp despite the heavy compression:
| Component | Purpose | Effect |
|---|---|---|
| Adversarial loss | PatchGAN discriminator | Prevents blurry reconstructions by penalizing "too smooth" outputs |
| Perceptual loss | LPIPS (learned perceptual similarity) | Matches high-level features, not just pixel values |
| Codebook EMA | Exponential moving average codebook update | Prevents codebook collapse (most codes going unused) |
python # Image tokenization in Chameleon def tokenize_image(image, encoder, codebook): # image: [B, 3, 512, 512] — batch of RGB images z = encoder(image) # [B, 256, 32, 32] — continuous features z_flat = z.permute(0,2,3,1).reshape(-1, 256) # [B*1024, 256] # Find nearest codebook vector for each spatial position dists = torch.cdist(z_flat, codebook.weight) # [B*1024, 8192] ids = dists.argmin(dim=-1) # [B*1024] ids = ids.reshape(-1, 1024) # [B, 1024] # Offset by text vocab size so IDs don't collide ids = ids + 65536 # text vocab offset return ids # [B, 1024] — ready to interleave with text tokens
See how an image is converted into a grid of discrete tokens. Each colored cell represents one codebook entry. Hover over cells to see the token ID. Drag the "Resolution" slider to see how different grid sizes affect quality.
Now that images are tokens, we need to decide when to combine them with text. This decision — early fusion vs late fusion — is the most consequential architectural choice in multimodal AI. Chameleon's bet on early fusion is what makes it fundamentally different from its contemporaries.
Models like LLaVA, Flamingo, and GPT-4V use late fusion. Each modality is processed by its own specialized encoder. The representations are combined only after significant processing has already happened:
| Model | Vision Path | Fusion Point | Can Generate Images? |
|---|---|---|---|
| LLaVA | CLIP ViT → linear projection | Projected features injected as tokens into LLM input | No |
| Flamingo | NFNet ViT → Perceiver resampler | Cross-attention layers interleaved in LLM | No |
| GPT-4V | Unknown (likely ViT) | Unknown (likely early layers) | No (separate DALL-E) |
The problem: once image features pass through a separate encoder and get projected into the LLM's space, they've lost information. The LLM receives a summary of the image, not the image itself. And the LLM can never produce image outputs — its vocabulary only contains text tokens.
Early fusion means: convert all modalities to the same token space before any transformer processing. The transformer sees a single mixed sequence:
Every self-attention layer processes text and image tokens together. A text token at position 5 can attend to image token at position 1030. An image token can attend to text tokens that came before it. The mixing happens at every layer, not just one fusion point.
Early fusion has been tried before (PixelGPT, DALL-E, etc.) but scaling it up has been notoriously unstable. The core difficulty: text tokens and image tokens have very different statistical properties. Text has a power-law distribution (a few tokens are very common, most are rare). Image tokens are more uniformly distributed across the codebook. Training a single model on both simultaneously causes optimization instability — the loss oscillates and diverges. Chapter 4 explains how Chameleon solved this with a suite of architectural modifications.
python # Late fusion: separate encoders, merge late class LateFusion: def forward(self, text, image): text_features = self.text_encoder(text) # [B, T, D] img_features = self.vision_encoder(image) # [B, N, D] img_proj = self.adapter(img_features) # [B, N, D] — projected combined = torch.cat([img_proj, text_features], dim=1) return self.lm_head(combined) # text-only output # Early fusion: everything is tokens from the start class EarlyFusion: def forward(self, mixed_token_ids): # mixed_token_ids: [B, T+1024] — text + image tokens interleaved embeddings = self.shared_embedding(mixed_token_ids) # [B, L, D] hidden = self.transformer(embeddings) # [B, L, D] logits = self.shared_head(hidden) # [B, L, V] return logits # can predict BOTH text AND image tokens
Toggle between early and late fusion to see how attention patterns differ. In early fusion, every token can attend to every other token regardless of modality. In late fusion, cross-modal attention only happens at the fusion layer.
Chameleon's architecture is deceptively simple: it's a standard decoder-only transformer (like LLaMA) with a unified vocabulary. No special cross-attention layers, no separate vision encoder, no adapter modules. The magic is in the token space, not the architecture.
| Config | Chameleon-7B | Chameleon-34B |
|---|---|---|
| Parameters | 7 billion | 34 billion |
| Layers | 32 | 48 |
| Hidden dim | 4096 | 8192 |
| Attention heads | 32 | 64 |
| Text vocab | 65,536 BPE tokens | 65,536 BPE tokens |
| Image vocab | 8,192 codebook entries | 8,192 codebook entries |
| Total vocab | 73,728 | 73,728 |
| Context length | 4,096 tokens | 4,096 tokens |
A mixed-modal input is constructed by tokenizing text normally (BPE) and images through the VQ tokenizer, then interleaving them with special boundary tokens:
The special tokens <image> and </image> act as delimiters telling the model where image data starts and ends. Inside those boundaries, the tokens come from the image codebook (IDs 65,536 to 73,727). Outside, they're normal text tokens (IDs 0 to 65,535).
Both text and image tokens are embedded through a single shared embedding table of size 73,728 × D. This is critical: it means image tokens and text tokens live in the same vector space from the very first layer. The transformer never needs to "translate" between modalities — they're already speaking the same language.
python # Chameleon's unified embedding class ChameleonModel(nn.Module): def __init__(self, config): self.embed = nn.Embedding(73728, config.hidden_dim) # 65536 text + 8192 image self.layers = nn.ModuleList([ TransformerBlock(config) for _ in range(config.n_layers) ]) self.head = nn.Linear(config.hidden_dim, 73728) # predict any token type def forward(self, token_ids): # token_ids: [B, L] — mixed text (0-65535) and image (65536-73727) tokens x = self.embed(token_ids) # [B, L, D] — all in same vector space for layer in self.layers: x = layer(x) # standard causal self-attention + FFN logits = self.head(x) # [B, L, 73728] — can predict text OR image return logits
To generate an image, the model autoregressively samples 1024 tokens from the image portion of the vocabulary (IDs 65,536+). These are then mapped back through the VQ-VAE decoder to produce pixels. The generation process is identical to text generation — predict next token, sample, append, repeat — just with image tokens instead of text tokens.
Click "Forward Pass" to watch a mixed text+image input flow through the unified transformer. Notice how text and image tokens share the same embedding space and attention layers.
Here's where Chameleon's real engineering contribution lies. Many teams have tried early fusion before — tokenize images, feed them to a transformer, train end-to-end. Most failed. The model would train normally for a while, then suddenly the loss would spike to infinity. The gradients would explode. Training would collapse.
Chameleon discovered that mixed-modal training introduces a unique stability challenge: the norms of image token representations and text token representations grow at different rates. Image tokens tend to have larger activation norms, creating an imbalance that destabilizes training at scale.
Chameleon introduces three architectural modifications, each targeting a specific instability mechanism:
Standard attention computes: attention = softmax(QKT / √d). If Q or K vectors have large norms, the dot products can grow huge, making the softmax saturate (all mass on one position). This kills gradient flow.
By normalizing Q and K before the dot product, the attention logits stay bounded regardless of how large the hidden representations grow. This prevents softmax saturation.
Standard transformers apply dropout after the softmax (on the attention weights). Chameleon adds dropout before softmax (on the raw logits). This acts as a regularizer that prevents the model from becoming too confident in any single attention pattern — particularly important when text and image tokens compete for attention.
Chameleon uses RMSNorm (Root Mean Square normalization) instead of LayerNorm, and places it both before and after the attention mechanism — a "sandwich norm" pattern. This double normalization ensures that outputs from each layer have controlled magnitude before being added to the residual stream.
python # Chameleon's stabilized attention block class ChameleonAttention(nn.Module): def forward(self, x): # Pre-norm (standard) h = self.norm1(x) # QKV projection Q, K, V = self.qkv(h).chunk(3, dim=-1) # QK-Norm: normalize Q and K BEFORE dot product Q = self.q_norm(Q) # prevents attention logit explosion K = self.k_norm(K) # keeps dot products bounded # Attention with logit dropout logits = Q @ K.transpose(-2, -1) / self.scale logits = self.logit_dropout(logits) # dropout BEFORE softmax attn = torch.softmax(logits, dim=-1) out = attn @ V # Post-norm (sandwich pattern) out = self.norm2(out) return x + self.proj(out) # residual connection
The paper reports that without these modifications, training diverges between 500B and 1T tokens — after appearing stable for days. This makes the bug particularly insidious: you can't catch it with short test runs. The team observed:
| Configuration | Diverges at | Symptom |
|---|---|---|
| No QK-Norm | ~800B tokens | Attention logits exceed 104, softmax saturates |
| No logit dropout | ~600B tokens | Attention collapses to single position |
| No sandwich norm | ~1T tokens | Residual stream norms grow unbounded |
| All three fixes | Never (4.4T tokens) | Stable throughout training |
Toggle each stability fix on/off and watch the training loss curve. Without all three fixes enabled, training eventually diverges (loss spikes to infinity).
Chameleon's training is a carefully staged process designed to gradually expose the model to increasingly complex mixed-modal data. You don't just throw all your data at the model and hope for the best — the order and mixture of data matters enormously.
The bulk of training uses a massive mixed-modal corpus:
| Data Source | Modality | Proportion | Tokens |
|---|---|---|---|
| Web text | Text only | ~50% | ~2.2T |
| Interleaved text+image | Mixed | ~30% | ~1.3T |
| Image-caption pairs | Mixed | ~10% | ~440B |
| Code | Text only | ~10% | ~440B |
The interleaved text+image data is the key ingredient. These are web pages where text and images naturally co-occur — articles with photos, product pages with images and descriptions, tutorials with diagrams. The model sees sequences like:
This interleaving teaches the model that images and text are semantically connected — the text describes the image, or the image illustrates the text. Without interleaved data, the model would learn text and images separately, defeating the purpose of early fusion.
After pre-training, Chameleon undergoes supervised fine-tuning on high-quality instruction-following data:
How much image data vs text data should be in the pre-training mix? This is one of the paper's important findings. Too much text data, and the model doesn't learn good image representations. Too much image data, and text quality degrades. Chameleon found the sweet spot empirically:
python # Chameleon training configuration config = { "optimizer": "AdamW", "lr": 1e-4, # peak learning rate "warmup_steps": 2000, "lr_schedule": "cosine_decay", "weight_decay": 0.1, "batch_size": 4_000_000, # tokens per batch (4M) "total_tokens": 4_400_000_000_000, # 4.4T tokens "context_length": 4096, "gradient_clipping": 1.0, # essential for stability "bf16": True, # mixed precision }
Adjust the text/image data ratio and watch how it affects performance on text-only and multimodal benchmarks. The sweet spot is around 50% text, 30% interleaved, 10% captions, 10% code.
Does early fusion actually work? Chameleon is evaluated against both text-only models and multimodal specialists. The results show that a single unified model can be competitive with or exceed purpose-built models on their own tasks.
| Benchmark | Category | Chameleon-34B | Best Specialist | Specialist Name |
|---|---|---|---|---|
| MMLU | Text knowledge | 63.1 | 70.1 | LLaMA-2-70B |
| HellaSwag | Text reasoning | 78.2 | 85.3 | LLaMA-2-70B |
| VQAv2 | Visual QA | 76.8 | 77.4 | Flamingo-80B |
| TextVQA | OCR + Visual QA | 55.8 | 56.8 | Flamingo-80B |
| Image Gen (FID) | Image quality | ~7.0 | ~3.0 | DALL-E 2 |
| Mixed-modal | Interleaved I+T | Best in class | N/A | No comparable model |
Chameleon's unique capability is generating interleaved text and images in a single output. Ask it to write a recipe with step-by-step photos, and it can produce text instructions interspersed with generated images of each step. No other model at publication time could do this natively.
The model handles four task types within one architecture:
The model has clear weaknesses. Image generation quality (FID ~7) lags behind dedicated image models (DALL-E 2 FID ~3). Pure text performance is lower than comparably-sized text-only models — the "modality tax." And the 1024-token image representation limits generation resolution to 512×512.
Select different task types to see how Chameleon processes input and generates output for each. Watch the token flow through the unified model.
Chameleon sits at a pivotal point in the evolution of multimodal AI. It proved that early fusion works at scale — that you don't need separate encoders for each modality. This insight directly influenced the next generation of models.
| Model | Fusion Type | Image Handling | Can Generate Images? |
|---|---|---|---|
| LLaVA (2023) | Late fusion | CLIP encoder + linear adapter | No |
| Flamingo (2022) | Late fusion | NFNet + Perceiver + cross-attention | No |
| Chameleon (2024) | Early fusion | VQ tokenizer → shared vocabulary | Yes (autoregressive) |
| Transfusion (2024) | Hybrid | Continuous embeddings + diffusion | Yes (diffusion) |
| Gemini (2024) | Early fusion | Native multimodal tokens | Yes |
Chameleon's proof that early fusion scales to 34B parameters opened the floodgates. Meta's own follow-up, Transfusion, builds directly on Chameleon's insights but replaces VQ tokenization with diffusion. Google's Gemini (rumored to use early fusion) handles even more modalities including audio and video. The trend is clear: the future of multimodal AI is unified models, not assemblies of specialists.
Drag the slider through time to see how multimodal architectures evolved from late fusion to early fusion.