Meta FAIR — 2024

Chameleon: Mixed-Modal Early-Fusion

Mixed-Modal Early-Fusion Foundation Models — tokenize everything (text AND images) into one sequence, train one transformer end-to-end.

Prerequisites: Transformers + Tokenization + VQ-VAE basics. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Modality Wall

You want to build a single model that can read text, look at images, and generate both. You type: "Here's a photo of my garden. What flowers are these? Also, show me what it would look like in autumn." The model should understand the image, answer in text, and generate a new image — all in one conversation, one forward pass, one set of weights.

Today's multimodal systems don't actually do this. They are Frankenstein architectures: a vision encoder bolted onto a language model with an adapter module in between. GPT-4V, LLaVA, Flamingo — they all follow the same pattern:

Vision Encoder (frozen)
CLIP ViT encodes the image into feature vectors. Trained separately on image-text pairs.
↓ adapter layer
Language Model
Takes projected image features + text tokens. Generates text only.

This approach has three problems:

ProblemWhy It HappensConsequence
No image generationThe LM only outputs text tokens. It can't produce images.You need a completely separate model (DALL-E, Stable Diffusion) for image generation.
Shallow fusionImage features are injected at one layer. The model can't deeply reason across modalities.Struggles with tasks requiring tight image-text interleaving.
Training fragmentationVision encoder, adapter, and LM are trained separately with different objectives.Representation gaps between modalities. Information lost at each boundary.

What if we eliminated the boundary entirely? What if images and text were just... tokens? The same tokens, in the same sequence, processed by the same transformer?

Chameleon's radical idea: Tokenize images into discrete tokens using a learned codebook (like VQ-VAE). Interleave them freely with text tokens. Feed the mixed sequence into a single autoregressive transformer. One vocabulary, one model, one loss function. The model learns to predict the next token whether it's a word or an image patch — it doesn't care about the distinction.

This is early fusion: modalities are merged at the token level, before the transformer sees them. The alternative, late fusion, keeps modalities separate until the final layers. Chameleon bets that early fusion enables richer cross-modal reasoning — and the results back it up.

Late Fusion vs Early Fusion

Click to compare how late-fusion models (left) vs Chameleon's early-fusion (right) process a mixed text+image input. Watch how information flows differently through each architecture.

What is the fundamental limitation of "late fusion" multimodal models like LLaVA or GPT-4V?

Chapter 1: Image Tokenization

If we want to treat images as tokens, we need a way to convert a continuous image (a grid of RGB pixels) into a sequence of discrete integer IDs — just like how a text tokenizer converts "Hello world" into [15496, 995]. This is the job of the image tokenizer.

Chameleon uses a model based on Make-A-Scene's image tokenizer, which is itself a VQ-VAE (Vector Quantized Variational Autoencoder). Here's how it works:

Encoder
Input: 512×512 RGB image. CNN encoder compresses it to a 32×32 grid of continuous vectors. Each vector: R256.
↓ quantize
Codebook Lookup
Each of the 1024 vectors is matched to the nearest entry in a codebook of 8192 codes. Output: 1024 integer IDs.
Decoder
Reverse: look up codebook vectors, upsample with CNN decoder back to 512×512 pixels.

The key numbers: a 512×512 image becomes 1024 tokens (32 × 32 grid positions) from a codebook of 8192 entries. These 8192 image codes are added to the text vocabulary (which has ~65K BPE tokens), giving a combined vocabulary of ~73K tokens.

Image (512×512×3) → Encoder → Grid (32×32×256) → VQ → Tokens (1024 IDs from codebook of 8192)

Crucially, the image tokenizer is trained before Chameleon itself, and then frozen. Chameleon never updates the image tokenizer — it just uses the discrete codes as input and output. This separation means the tokenizer can be optimized for image reconstruction quality independently.

Why 1024 tokens per image? It's a compression ratio of about 786:1. The original image has 512 × 512 × 3 = 786,432 values. After tokenization: 1024 integers. Each token encodes a 16×16 pixel patch. This is aggressive compression — more than DALL-E's 1024 tokens at 256×256 — and it means images don't dominate the sequence length. A typical text+image input might have 200 text tokens + 1024 image tokens = 1224 total, easily within context length.

Reconstruction quality

The tokenizer uses several tricks to keep reconstructions sharp despite the heavy compression:

ComponentPurposeEffect
Adversarial lossPatchGAN discriminatorPrevents blurry reconstructions by penalizing "too smooth" outputs
Perceptual lossLPIPS (learned perceptual similarity)Matches high-level features, not just pixel values
Codebook EMAExponential moving average codebook updatePrevents codebook collapse (most codes going unused)
python
# Image tokenization in Chameleon
def tokenize_image(image, encoder, codebook):
    # image: [B, 3, 512, 512] — batch of RGB images
    z = encoder(image)            # [B, 256, 32, 32] — continuous features
    z_flat = z.permute(0,2,3,1).reshape(-1, 256)  # [B*1024, 256]

    # Find nearest codebook vector for each spatial position
    dists = torch.cdist(z_flat, codebook.weight)  # [B*1024, 8192]
    ids = dists.argmin(dim=-1)                    # [B*1024]
    ids = ids.reshape(-1, 1024)                   # [B, 1024]

    # Offset by text vocab size so IDs don't collide
    ids = ids + 65536  # text vocab offset
    return ids  # [B, 1024] — ready to interleave with text tokens
Image Tokenization Visualizer

See how an image is converted into a grid of discrete tokens. Each colored cell represents one codebook entry. Hover over cells to see the token ID. Drag the "Resolution" slider to see how different grid sizes affect quality.

Grid 16×16
How many discrete tokens does Chameleon produce from a single 512×512 image?

Chapter 2: Early vs Late Fusion

Now that images are tokens, we need to decide when to combine them with text. This decision — early fusion vs late fusion — is the most consequential architectural choice in multimodal AI. Chameleon's bet on early fusion is what makes it fundamentally different from its contemporaries.

Late fusion: the status quo

Models like LLaVA, Flamingo, and GPT-4V use late fusion. Each modality is processed by its own specialized encoder. The representations are combined only after significant processing has already happened:

ModelVision PathFusion PointCan Generate Images?
LLaVACLIP ViT → linear projectionProjected features injected as tokens into LLM inputNo
FlamingoNFNet ViT → Perceiver resamplerCross-attention layers interleaved in LLMNo
GPT-4VUnknown (likely ViT)Unknown (likely early layers)No (separate DALL-E)

The problem: once image features pass through a separate encoder and get projected into the LLM's space, they've lost information. The LLM receives a summary of the image, not the image itself. And the LLM can never produce image outputs — its vocabulary only contains text tokens.

Early fusion: Chameleon's approach

Early fusion means: convert all modalities to the same token space before any transformer processing. The transformer sees a single mixed sequence:

[text1, text2, ..., textn, <img>, img1, img2, ..., img1024, </img>, textn+1, ...]

Every self-attention layer processes text and image tokens together. A text token at position 5 can attend to image token at position 1030. An image token can attend to text tokens that came before it. The mixing happens at every layer, not just one fusion point.

The key advantage: In late fusion, image understanding is "locked in" by the vision encoder before the LM sees it. In early fusion, the transformer can develop its own image understanding, jointly with text understanding, at every layer. This means the model can learn arbitrarily complex cross-modal reasoning patterns — not just "describe what you see" but "imagine what this would look like if..."

Why hasn't everyone done this?

Early fusion has been tried before (PixelGPT, DALL-E, etc.) but scaling it up has been notoriously unstable. The core difficulty: text tokens and image tokens have very different statistical properties. Text has a power-law distribution (a few tokens are very common, most are rare). Image tokens are more uniformly distributed across the codebook. Training a single model on both simultaneously causes optimization instability — the loss oscillates and diverges. Chapter 4 explains how Chameleon solved this with a suite of architectural modifications.

python
# Late fusion: separate encoders, merge late
class LateFusion:
    def forward(self, text, image):
        text_features = self.text_encoder(text)     # [B, T, D]
        img_features = self.vision_encoder(image)   # [B, N, D]
        img_proj = self.adapter(img_features)        # [B, N, D] — projected
        combined = torch.cat([img_proj, text_features], dim=1)
        return self.lm_head(combined)  # text-only output

# Early fusion: everything is tokens from the start
class EarlyFusion:
    def forward(self, mixed_token_ids):
        # mixed_token_ids: [B, T+1024] — text + image tokens interleaved
        embeddings = self.shared_embedding(mixed_token_ids)  # [B, L, D]
        hidden = self.transformer(embeddings)                # [B, L, D]
        logits = self.shared_head(hidden)                    # [B, L, V]
        return logits  # can predict BOTH text AND image tokens
Attention Pattern: Early vs Late Fusion

Toggle between early and late fusion to see how attention patterns differ. In early fusion, every token can attend to every other token regardless of modality. In late fusion, cross-modal attention only happens at the fusion layer.

What is the key architectural difference between Chameleon's early fusion and LLaVA's late fusion?

Chapter 3: The Architecture

Chameleon's architecture is deceptively simple: it's a standard decoder-only transformer (like LLaMA) with a unified vocabulary. No special cross-attention layers, no separate vision encoder, no adapter modules. The magic is in the token space, not the architecture.

Model specifications

ConfigChameleon-7BChameleon-34B
Parameters7 billion34 billion
Layers3248
Hidden dim40968192
Attention heads3264
Text vocab65,536 BPE tokens65,536 BPE tokens
Image vocab8,192 codebook entries8,192 codebook entries
Total vocab73,72873,728
Context length4,096 tokens4,096 tokens

Input representation

A mixed-modal input is constructed by tokenizing text normally (BPE) and images through the VQ tokenizer, then interleaving them with special boundary tokens:

input = [BOS, t1, t2, ..., tn, <image>, i1, i2, ..., i1024, </image>, tn+1, ..., EOS]

The special tokens <image> and </image> act as delimiters telling the model where image data starts and ends. Inside those boundaries, the tokens come from the image codebook (IDs 65,536 to 73,727). Outside, they're normal text tokens (IDs 0 to 65,535).

Shared embedding space

Both text and image tokens are embedded through a single shared embedding table of size 73,728 × D. This is critical: it means image tokens and text tokens live in the same vector space from the very first layer. The transformer never needs to "translate" between modalities — they're already speaking the same language.

python
# Chameleon's unified embedding
class ChameleonModel(nn.Module):
    def __init__(self, config):
        self.embed = nn.Embedding(73728, config.hidden_dim)   # 65536 text + 8192 image
        self.layers = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.n_layers)
        ])
        self.head = nn.Linear(config.hidden_dim, 73728)       # predict any token type

    def forward(self, token_ids):
        # token_ids: [B, L] — mixed text (0-65535) and image (65536-73727) tokens
        x = self.embed(token_ids)         # [B, L, D] — all in same vector space
        for layer in self.layers:
            x = layer(x)                 # standard causal self-attention + FFN
        logits = self.head(x)            # [B, L, 73728] — can predict text OR image
        return logits

Output head: generating images

To generate an image, the model autoregressively samples 1024 tokens from the image portion of the vocabulary (IDs 65,536+). These are then mapped back through the VQ-VAE decoder to produce pixels. The generation process is identical to text generation — predict next token, sample, append, repeat — just with image tokens instead of text tokens.

Elegance through simplicity. Chameleon proves that you don't need separate encoders, adapters, cross-attention, or modality-specific heads. A standard transformer with a unified token vocabulary can handle both modalities. The only non-standard components are the image tokenizer (trained separately) and the training stability modifications (Chapter 4).
Chameleon Architecture Diagram

Click "Forward Pass" to watch a mixed text+image input flow through the unified transformer. Notice how text and image tokens share the same embedding space and attention layers.

How does Chameleon handle the vocabulary for both text and images?

Chapter 4: Training Stability

Here's where Chameleon's real engineering contribution lies. Many teams have tried early fusion before — tokenize images, feed them to a transformer, train end-to-end. Most failed. The model would train normally for a while, then suddenly the loss would spike to infinity. The gradients would explode. Training would collapse.

Chameleon discovered that mixed-modal training introduces a unique stability challenge: the norms of image token representations and text token representations grow at different rates. Image tokens tend to have larger activation norms, creating an imbalance that destabilizes training at scale.

The three stability fixes

Chameleon introduces three architectural modifications, each targeting a specific instability mechanism:

1. QK-Norm (Query-Key Normalization)

Standard attention computes: attention = softmax(QKT / √d). If Q or K vectors have large norms, the dot products can grow huge, making the softmax saturate (all mass on one position). This kills gradient flow.

Q' = LayerNorm(Q),   K' = LayerNorm(K),   attention = softmax(Q'K'T / √d)

By normalizing Q and K before the dot product, the attention logits stay bounded regardless of how large the hidden representations grow. This prevents softmax saturation.

2. Dropout on attention logits (not weights)

Standard transformers apply dropout after the softmax (on the attention weights). Chameleon adds dropout before softmax (on the raw logits). This acts as a regularizer that prevents the model from becoming too confident in any single attention pattern — particularly important when text and image tokens compete for attention.

3. Revised layer norm placement

Chameleon uses RMSNorm (Root Mean Square normalization) instead of LayerNorm, and places it both before and after the attention mechanism — a "sandwich norm" pattern. This double normalization ensures that outputs from each layer have controlled magnitude before being added to the residual stream.

RMSNorm(x) = x / √(mean(x2) + ε) · γ
python
# Chameleon's stabilized attention block
class ChameleonAttention(nn.Module):
    def forward(self, x):
        # Pre-norm (standard)
        h = self.norm1(x)

        # QKV projection
        Q, K, V = self.qkv(h).chunk(3, dim=-1)

        # QK-Norm: normalize Q and K BEFORE dot product
        Q = self.q_norm(Q)    # prevents attention logit explosion
        K = self.k_norm(K)    # keeps dot products bounded

        # Attention with logit dropout
        logits = Q @ K.transpose(-2, -1) / self.scale
        logits = self.logit_dropout(logits)  # dropout BEFORE softmax
        attn = torch.softmax(logits, dim=-1)
        out = attn @ V

        # Post-norm (sandwich pattern)
        out = self.norm2(out)
        return x + self.proj(out)  # residual connection
Why these three together? Each fix addresses a different failure mode. QK-Norm prevents attention logit explosion. Logit dropout prevents attention pattern collapse. Sandwich norms prevent residual stream magnitude divergence. Chameleon found that removing any single fix caused training to diverge before 1 trillion tokens. All three are necessary at 34B scale.

Empirical evidence of instability

The paper reports that without these modifications, training diverges between 500B and 1T tokens — after appearing stable for days. This makes the bug particularly insidious: you can't catch it with short test runs. The team observed:

ConfigurationDiverges atSymptom
No QK-Norm~800B tokensAttention logits exceed 104, softmax saturates
No logit dropout~600B tokensAttention collapses to single position
No sandwich norm~1T tokensResidual stream norms grow unbounded
All three fixesNever (4.4T tokens)Stable throughout training
Training Stability Simulator

Toggle each stability fix on/off and watch the training loss curve. Without all three fixes enabled, training eventually diverges (loss spikes to infinity).

Why does Chameleon apply LayerNorm to Q and K vectors before computing attention?

Chapter 5: The Training Recipe

Chameleon's training is a carefully staged process designed to gradually expose the model to increasingly complex mixed-modal data. You don't just throw all your data at the model and hope for the best — the order and mixture of data matters enormously.

Stage 1: Pre-training (4.4 trillion tokens)

The bulk of training uses a massive mixed-modal corpus:

Data SourceModalityProportionTokens
Web textText only~50%~2.2T
Interleaved text+imageMixed~30%~1.3T
Image-caption pairsMixed~10%~440B
CodeText only~10%~440B

The interleaved text+image data is the key ingredient. These are web pages where text and images naturally co-occur — articles with photos, product pages with images and descriptions, tutorials with diagrams. The model sees sequences like:

[text, text, <img>photo tokens</img>, text, text, <img>diagram tokens</img>, text, ...]

This interleaving teaches the model that images and text are semantically connected — the text describes the image, or the image illustrates the text. Without interleaved data, the model would learn text and images separately, defeating the purpose of early fusion.

Stage 2: Alignment fine-tuning

After pre-training, Chameleon undergoes supervised fine-tuning on high-quality instruction-following data:

Text SFT
Conversational Q&A, instruction following, reasoning. Standard LLM alignment.
Visual SFT
Image captioning, visual QA, image-grounded conversation. Teaches the model to describe and reason about images.
Mixed-Modal SFT
Tasks requiring both image understanding and generation within the same conversation.

The modality mixture ratio: a critical hyperparameter

How much image data vs text data should be in the pre-training mix? This is one of the paper's important findings. Too much text data, and the model doesn't learn good image representations. Too much image data, and text quality degrades. Chameleon found the sweet spot empirically:

The balancing act: At ~30% interleaved image+text data and ~10% image-caption pairs, the model achieves strong performance on both text-only and multimodal benchmarks. Going above 50% image data significantly degrades text performance. Going below 20% image data produces weak image understanding. The optimal ratio is roughly 50/50 text/visual when counting by tokens, but this includes the 1024 image tokens per image which inflate the visual token count.

Optimization details

python
# Chameleon training configuration
config = {
    "optimizer": "AdamW",
    "lr": 1e-4,                   # peak learning rate
    "warmup_steps": 2000,
    "lr_schedule": "cosine_decay",
    "weight_decay": 0.1,
    "batch_size": 4_000_000,       # tokens per batch (4M)
    "total_tokens": 4_400_000_000_000,  # 4.4T tokens
    "context_length": 4096,
    "gradient_clipping": 1.0,    # essential for stability
    "bf16": True,                 # mixed precision
}
Data Mixture Simulator

Adjust the text/image data ratio and watch how it affects performance on text-only and multimodal benchmarks. The sweet spot is around 50% text, 30% interleaved, 10% captions, 10% code.

Image % 40%
Why is interleaved text+image data (not just image-caption pairs) critical for Chameleon's training?

Chapter 6: Results & Showcase

Does early fusion actually work? Chameleon is evaluated against both text-only models and multimodal specialists. The results show that a single unified model can be competitive with or exceed purpose-built models on their own tasks.

Benchmark results: Chameleon-34B

BenchmarkCategoryChameleon-34BBest SpecialistSpecialist Name
MMLUText knowledge63.170.1LLaMA-2-70B
HellaSwagText reasoning78.285.3LLaMA-2-70B
VQAv2Visual QA76.877.4Flamingo-80B
TextVQAOCR + Visual QA55.856.8Flamingo-80B
Image Gen (FID)Image quality~7.0~3.0DALL-E 2
Mixed-modalInterleaved I+TBest in classN/ANo comparable model
The main takeaway: Chameleon-34B matches or comes within a few points of models that are 2-3x larger and purpose-built for a single task. On text benchmarks, it's competitive with text-only LLaMA-2 despite spending ~40% of its training on image data. On visual QA, it matches Flamingo-80B despite being less than half its size. And it's the only model that can do all of these tasks PLUS generate images.

Where Chameleon shines: mixed-modal generation

Chameleon's unique capability is generating interleaved text and images in a single output. Ask it to write a recipe with step-by-step photos, and it can produce text instructions interspersed with generated images of each step. No other model at publication time could do this natively.

The model handles four task types within one architecture:

Text → Text
Standard LM: question answering, reasoning, code generation.
Image + Text → Text
Visual understanding: describe image, answer questions about image.
Text → Image
Image generation: generate image from text description.
Mixed → Mixed
Interleaved: generate documents with both text and images.

Where Chameleon struggles

The model has clear weaknesses. Image generation quality (FID ~7) lags behind dedicated image models (DALL-E 2 FID ~3). Pure text performance is lower than comparably-sized text-only models — the "modality tax." And the 1024-token image representation limits generation resolution to 512×512.

Chameleon Task Explorer

Select different task types to see how Chameleon processes input and generates output for each. Watch the token flow through the unified model.

What is Chameleon's most unique capability compared to other multimodal models at publication time?

Chapter 7: Connections

Chameleon sits at a pivotal point in the evolution of multimodal AI. It proved that early fusion works at scale — that you don't need separate encoders for each modality. This insight directly influenced the next generation of models.

Chameleon in context

ModelFusion TypeImage HandlingCan Generate Images?
LLaVA (2023)Late fusionCLIP encoder + linear adapterNo
Flamingo (2022)Late fusionNFNet + Perceiver + cross-attentionNo
Chameleon (2024)Early fusionVQ tokenizer → shared vocabularyYes (autoregressive)
Transfusion (2024)HybridContinuous embeddings + diffusionYes (diffusion)
Gemini (2024)Early fusionNative multimodal tokensYes

Key lessons from Chameleon

Lesson 1: Training stability is the bottleneck, not architecture. Chameleon's transformer is nearly identical to LLaMA. The breakthroughs were QK-Norm, logit dropout, and sandwich norms — engineering solutions to optimization instability.
Lesson 2: The modality tax is real but shrinking. Training on images costs text performance. But the gap is much smaller than expected: Chameleon-34B loses only ~7 points on MMLU vs a text-only model of the same size. Future work may close this gap entirely.
Lesson 3: Discrete tokens may not be optimal for images. VQ tokenization introduces quantization error. Transfusion (the next paper in this series) explores keeping images as continuous representations while text stays discrete — potentially getting the best of both worlds.

Impact on subsequent work

Chameleon's proof that early fusion scales to 34B parameters opened the floodgates. Meta's own follow-up, Transfusion, builds directly on Chameleon's insights but replaces VQ tokenization with diffusion. Google's Gemini (rumored to use early fusion) handles even more modalities including audio and video. The trend is clear: the future of multimodal AI is unified models, not assemblies of specialists.

Multimodal Model Evolution

Drag the slider through time to see how multimodal architectures evolved from late fusion to early fusion.

Era Chameleon (2024)
What is Chameleon's most important contribution to the field?