Chameleon (Meta 2024)

Chapter 0: The Modality Wall

You want to build a single model that can read text, look at images, and generate both. You type: "Here's a photo of my garden. What flowers are these? Also, show me what it would look like in autumn." The model should understand the image, answer in text, and generate a new image — all in one conversation, one forward pass, one set of weights.

Today's multimodal systems don't actually do this. They are Frankenstein architectures: a vision encoder bolted onto a language model with an adapter module in between. GPT-4V, LLaVA, Flamingo — they all follow the same pattern:

Vision Encoder (frozen)

CLIP ViT encodes the image into feature vectors. Trained separately on image-text pairs.

↓ adapter layer

Language Model

Takes projected image features + text tokens. Generates text only.

This approach has three problems:

Problem	Why It Happens	Consequence
No image generation	The LM only outputs text tokens. It can't produce images.	You need a completely separate model (DALL-E, Stable Diffusion) for image generation.
Shallow fusion	Image features are injected at one layer. The model can't deeply reason across modalities.	Struggles with tasks requiring tight image-text interleaving.
Training fragmentation	Vision encoder, adapter, and LM are trained separately with different objectives.	Representation gaps between modalities. Information lost at each boundary.

What if we eliminated the boundary entirely? What if images and text were just... tokens? The same tokens, in the same sequence, processed by the same transformer?

Chameleon's radical idea: Tokenize images into discrete tokens using a learned codebook (like VQ-VAE). Interleave them freely with text tokens. Feed the mixed sequence into a single autoregressive transformer. One vocabulary, one model, one loss function. The model learns to predict the next token whether it's a word or an image patch — it doesn't care about the distinction.

This is early fusion: modalities are merged at the token level, before the transformer sees them. The alternative, late fusion, keeps modalities separate until the final layers. Chameleon bets that early fusion enables richer cross-modal reasoning — and the results back it up.

Late Fusion vs Early Fusion

Click to compare how late-fusion models (left) vs Chameleon's early-fusion (right) process a mixed text+image input. Watch how information flows differently through each architecture.

What is the fundamental limitation of "late fusion" multimodal models like LLaVA or GPT-4V?

They process image features through a separate encoder and inject them into the LM at one point, preventing deep cross-modal reasoning and making image generation impossible within the same model They are too slow for real-time inference They require too much training data

Chapter 1: Image Tokenization

If we want to treat images as tokens, we need a way to convert a continuous image (a grid of RGB pixels) into a sequence of discrete integer IDs — just like how a text tokenizer converts "Hello world" into [15496, 995]. This is the job of the image tokenizer.

Chameleon uses a model based on Make-A-Scene's image tokenizer, which is itself a VQ-VAE (Vector Quantized Variational Autoencoder). Here's how it works:

Encoder

Input: 512×512 RGB image. CNN encoder compresses it to a 32×32 grid of continuous vectors. Each vector: R²⁵⁶.

↓ quantize

Codebook Lookup

Each of the 1024 vectors is matched to the nearest entry in a codebook of 8192 codes. Output: 1024 integer IDs.

↓

Decoder

Reverse: look up codebook vectors, upsample with CNN decoder back to 512×512 pixels.

The key numbers: a 512×512 image becomes 1024 tokens (32 × 32 grid positions) from a codebook of 8192 entries. These 8192 image codes are added to the text vocabulary (which has ~65K BPE tokens), giving a combined vocabulary of ~73K tokens.

Image (512×512×3) → Encoder → Grid (32×32×256) → VQ → Tokens (1024 IDs from codebook of 8192)

Crucially, the image tokenizer is trained before Chameleon itself, and then frozen. Chameleon never updates the image tokenizer — it just uses the discrete codes as input and output. This separation means the tokenizer can be optimized for image reconstruction quality independently.

Why 1024 tokens per image? It's a compression ratio of about 786:1. The original image has 512 × 512 × 3 = 786,432 values. After tokenization: 1024 integers. Each token encodes a 16×16 pixel patch. This is aggressive compression — more than DALL-E's 1024 tokens at 256×256 — and it means images don't dominate the sequence length. A typical text+image input might have 200 text tokens + 1024 image tokens = 1224 total, easily within context length.

Reconstruction quality

The tokenizer uses several tricks to keep reconstructions sharp despite the heavy compression:

Component	Purpose	Effect
Adversarial loss	PatchGAN discriminator	Prevents blurry reconstructions by penalizing "too smooth" outputs
Perceptual loss	LPIPS (learned perceptual similarity)	Matches high-level features, not just pixel values
Codebook EMA	Exponential moving average codebook update	Prevents codebook collapse (most codes going unused)

python
# Image tokenization in Chameleon
def tokenize_image(image, encoder, codebook):
    # image: [B, 3, 512, 512] — batch of RGB images
    z = encoder(image)            # [B, 256, 32, 32] — continuous features
    z_flat = z.permute(0,2,3,1).reshape(-1, 256)  # [B*1024, 256]

    # Find nearest codebook vector for each spatial position
    dists = torch.cdist(z_flat, codebook.weight)  # [B*1024, 8192]
    ids = dists.argmin(dim=-1)                    # [B*1024]
    ids = ids.reshape(-1, 1024)                   # [B, 1024]

    # Offset by text vocab size so IDs don't collide
    ids = ids + 65536  # text vocab offset
    return ids  # [B, 1024] — ready to interleave with text tokens

Image Tokenization Visualizer

See how an image is converted into a grid of discrete tokens. Each colored cell represents one codebook entry. Hover over cells to see the token ID. Drag the "Resolution" slider to see how different grid sizes affect quality.

Grid 16×16

How many discrete tokens does Chameleon produce from a single 512×512 image?

256 tokens from a 16×16 grid 1024 tokens from a 32×32 grid, each drawn from a codebook of 8192 entries — a 786:1 compression from the original pixel values 4096 tokens from a 64×64 grid

Chapter 2: Early vs Late Fusion

Now that images are tokens, we need to decide when to combine them with text. This decision — early fusion vs late fusion — is the most consequential architectural choice in multimodal AI. Chameleon's bet on early fusion is what makes it fundamentally different from its contemporaries.

Late fusion: the status quo

Models like LLaVA, Flamingo, and GPT-4V use late fusion. Each modality is processed by its own specialized encoder. The representations are combined only after significant processing has already happened:

Model	Vision Path	Fusion Point	Can Generate Images?
LLaVA	CLIP ViT → linear projection	Projected features injected as tokens into LLM input	No
Flamingo	NFNet ViT → Perceiver resampler	Cross-attention layers interleaved in LLM	No
GPT-4V	Unknown (likely ViT)	Unknown (likely early layers)	No (separate DALL-E)

The problem: once image features pass through a separate encoder and get projected into the LLM's space, they've lost information. The LLM receives a summary of the image, not the image itself. And the LLM can never produce image outputs — its vocabulary only contains text tokens.

Early fusion: Chameleon's approach

Early fusion means: convert all modalities to the same token space before any transformer processing. The transformer sees a single mixed sequence:

[text₁, text₂, ..., text_n, <img>, img₁, img₂, ..., img₁₀₂₄, </img>, text_n+1, ...]

Every self-attention layer processes text and image tokens together. A text token at position 5 can attend to image token at position 1030. An image token can attend to text tokens that came before it. The mixing happens at every layer, not just one fusion point.

The key advantage: In late fusion, image understanding is "locked in" by the vision encoder before the LM sees it. In early fusion, the transformer can develop its own image understanding, jointly with text understanding, at every layer. This means the model can learn arbitrarily complex cross-modal reasoning patterns — not just "describe what you see" but "imagine what this would look like if..."

Why hasn't everyone done this?

Early fusion has been tried before (PixelGPT, DALL-E, etc.) but scaling it up has been notoriously unstable. The core difficulty: text tokens and image tokens have very different statistical properties. Text has a power-law distribution (a few tokens are very common, most are rare). Image tokens are more uniformly distributed across the codebook. Training a single model on both simultaneously causes optimization instability — the loss oscillates and diverges. Chapter 4 explains how Chameleon solved this with a suite of architectural modifications.

python
# Late fusion: separate encoders, merge late
class LateFusion:
    def forward(self, text, image):
        text_features = self.text_encoder(text)     # [B, T, D]
        img_features = self.vision_encoder(image)   # [B, N, D]
        img_proj = self.adapter(img_features)        # [B, N, D] — projected
        combined = torch.cat([img_proj, text_features], dim=1)
        return self.lm_head(combined)  # text-only output

# Early fusion: everything is tokens from the start
class EarlyFusion:
    def forward(self, mixed_token_ids):
        # mixed_token_ids: [B, T+1024] — text + image tokens interleaved
        embeddings = self.shared_embedding(mixed_token_ids)  # [B, L, D]
        hidden = self.transformer(embeddings)                # [B, L, D]
        logits = self.shared_head(hidden)                    # [B, L, V]
        return logits  # can predict BOTH text AND image tokens

Attention Pattern: Early vs Late Fusion

Toggle between early and late fusion to see how attention patterns differ. In early fusion, every token can attend to every other token regardless of modality. In late fusion, cross-modal attention only happens at the fusion layer.

What is the key architectural difference between Chameleon's early fusion and LLaVA's late fusion?

Chameleon uses a bigger model Chameleon uses a different training dataset In Chameleon, images and text are converted to the same discrete tokens BEFORE the transformer, so every attention layer processes both modalities together — enabling both understanding and generation of images from the same model

Chapter 3: The Architecture

Chameleon's architecture is deceptively simple: it's a standard decoder-only transformer (like LLaMA) with a unified vocabulary. No special cross-attention layers, no separate vision encoder, no adapter modules. The magic is in the token space, not the architecture.

Model specifications

Config	Chameleon-7B	Chameleon-34B
Parameters	7 billion	34 billion
Layers	32	48
Hidden dim	4096	8192
Attention heads	32	64
Text vocab	65,536 BPE tokens	65,536 BPE tokens
Image vocab	8,192 codebook entries	8,192 codebook entries
Total vocab	73,728	73,728
Context length	4,096 tokens	4,096 tokens

Input representation

A mixed-modal input is constructed by tokenizing text normally (BPE) and images through the VQ tokenizer, then interleaving them with special boundary tokens:

input = [BOS, t₁, t₂, ..., t_n, <image>, i₁, i₂, ..., i₁₀₂₄, </image>, t_n+1, ..., EOS]

The special tokens <image> and </image> act as delimiters telling the model where image data starts and ends. Inside those boundaries, the tokens come from the image codebook (IDs 65,536 to 73,727). Outside, they're normal text tokens (IDs 0 to 65,535).

Shared embedding space

Both text and image tokens are embedded through a single shared embedding table of size 73,728 × D. This is critical: it means image tokens and text tokens live in the same vector space from the very first layer. The transformer never needs to "translate" between modalities — they're already speaking the same language.

python
# Chameleon's unified embedding
class ChameleonModel(nn.Module):
    def __init__(self, config):
        self.embed = nn.Embedding(73728, config.hidden_dim)   # 65536 text + 8192 image
        self.layers = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.n_layers)
        ])
        self.head = nn.Linear(config.hidden_dim, 73728)       # predict any token type

    def forward(self, token_ids):
        # token_ids: [B, L] — mixed text (0-65535) and image (65536-73727) tokens
        x = self.embed(token_ids)         # [B, L, D] — all in same vector space
        for layer in self.layers:
            x = layer(x)                 # standard causal self-attention + FFN
        logits = self.head(x)            # [B, L, 73728] — can predict text OR image
        return logits

Output head: generating images

To generate an image, the model autoregressively samples 1024 tokens from the image portion of the vocabulary (IDs 65,536+). These are then mapped back through the VQ-VAE decoder to produce pixels. The generation process is identical to text generation — predict next token, sample, append, repeat — just with image tokens instead of text tokens.

Elegance through simplicity. Chameleon proves that you don't need separate encoders, adapters, cross-attention, or modality-specific heads. A standard transformer with a unified token vocabulary can handle both modalities. The only non-standard components are the image tokenizer (trained separately) and the training stability modifications (Chapter 4).

Chameleon Architecture Diagram

Click "Forward Pass" to watch a mixed text+image input flow through the unified transformer. Notice how text and image tokens share the same embedding space and attention layers.

How does Chameleon handle the vocabulary for both text and images?

It uses a single embedding table of 73,728 entries (65,536 BPE text tokens + 8,192 image codebook entries), so both modalities share the same vector space from the first layer and the model can predict either type at any position It uses separate embedding tables for text and images It converts images to text descriptions first

Chapter 4: Training Stability

Here's where Chameleon's real engineering contribution lies. Many teams have tried early fusion before — tokenize images, feed them to a transformer, train end-to-end. Most failed. The model would train normally for a while, then suddenly the loss would spike to infinity. The gradients would explode. Training would collapse.

Chameleon discovered that mixed-modal training introduces a unique stability challenge: the norms of image token representations and text token representations grow at different rates. Image tokens tend to have larger activation norms, creating an imbalance that destabilizes training at scale.

The three stability fixes

Chameleon introduces three architectural modifications, each targeting a specific instability mechanism:

1. QK-Norm (Query-Key Normalization)

Standard attention computes: attention = softmax(QK^T / √d). If Q or K vectors have large norms, the dot products can grow huge, making the softmax saturate (all mass on one position). This kills gradient flow.

Q' = LayerNorm(Q), K' = LayerNorm(K), attention = softmax(Q'K'^T / √d)

By normalizing Q and K before the dot product, the attention logits stay bounded regardless of how large the hidden representations grow. This prevents softmax saturation.

2. Dropout on attention logits (not weights)

Standard transformers apply dropout after the softmax (on the attention weights). Chameleon adds dropout before softmax (on the raw logits). This acts as a regularizer that prevents the model from becoming too confident in any single attention pattern — particularly important when text and image tokens compete for attention.

3. Revised layer norm placement

Chameleon uses RMSNorm (Root Mean Square normalization) instead of LayerNorm, and places it both before and after the attention mechanism — a "sandwich norm" pattern. This double normalization ensures that outputs from each layer have controlled magnitude before being added to the residual stream.

RMSNorm(x) = x / √(mean(x²) + ε) · γ

python
# Chameleon's stabilized attention block
class ChameleonAttention(nn.Module):
    def forward(self, x):
        # Pre-norm (standard)
        h = self.norm1(x)

        # QKV projection
        Q, K, V = self.qkv(h).chunk(3, dim=-1)

        # QK-Norm: normalize Q and K BEFORE dot product
        Q = self.q_norm(Q)    # prevents attention logit explosion
        K = self.k_norm(K)    # keeps dot products bounded

        # Attention with logit dropout
        logits = Q @ K.transpose(-2, -1) / self.scale
        logits = self.logit_dropout(logits)  # dropout BEFORE softmax
        attn = torch.softmax(logits, dim=-1)
        out = attn @ V

        # Post-norm (sandwich pattern)
        out = self.norm2(out)
        return x + self.proj(out)  # residual connection

Why these three together? Each fix addresses a different failure mode. QK-Norm prevents attention logit explosion. Logit dropout prevents attention pattern collapse. Sandwich norms prevent residual stream magnitude divergence. Chameleon found that removing any single fix caused training to diverge before 1 trillion tokens. All three are necessary at 34B scale.

Empirical evidence of instability

The paper reports that without these modifications, training diverges between 500B and 1T tokens — after appearing stable for days. This makes the bug particularly insidious: you can't catch it with short test runs. The team observed:

Configuration	Diverges at	Symptom
No QK-Norm	~800B tokens	Attention logits exceed 10⁴, softmax saturates
No logit dropout	~600B tokens	Attention collapses to single position
No sandwich norm	~1T tokens	Residual stream norms grow unbounded
All three fixes	Never (4.4T tokens)	Stable throughout training

Training Stability Simulator

Toggle each stability fix on/off and watch the training loss curve. Without all three fixes enabled, training eventually diverges (loss spikes to infinity).

Why does Chameleon apply LayerNorm to Q and K vectors before computing attention?

To keep the dot products bounded: without QK-Norm, image and text tokens develop different activation magnitudes, causing attention logits to grow extremely large, saturating the softmax and killing gradient flow — a divergence that only appears after hundreds of billions of tokens To reduce the number of parameters To make the model faster at inference time

Chapter 5: The Training Recipe

Chameleon's training is a carefully staged process designed to gradually expose the model to increasingly complex mixed-modal data. You don't just throw all your data at the model and hope for the best — the order and mixture of data matters enormously.

Stage 1: Pre-training (4.4 trillion tokens)

The bulk of training uses a massive mixed-modal corpus:

Data Source	Modality	Proportion	Tokens
Web text	Text only	~50%	~2.2T
Interleaved text+image	Mixed	~30%	~1.3T
Image-caption pairs	Mixed	~10%	~440B
Code	Text only	~10%	~440B

The interleaved text+image data is the key ingredient. These are web pages where text and images naturally co-occur — articles with photos, product pages with images and descriptions, tutorials with diagrams. The model sees sequences like:

[text, text, <img>photo tokens</img>, text, text, <img>diagram tokens</img>, text, ...]

This interleaving teaches the model that images and text are semantically connected — the text describes the image, or the image illustrates the text. Without interleaved data, the model would learn text and images separately, defeating the purpose of early fusion.

Stage 2: Alignment fine-tuning

After pre-training, Chameleon undergoes supervised fine-tuning on high-quality instruction-following data:

Text SFT

Conversational Q&A, instruction following, reasoning. Standard LLM alignment.

↓

Visual SFT

Image captioning, visual QA, image-grounded conversation. Teaches the model to describe and reason about images.

↓

Mixed-Modal SFT

Tasks requiring both image understanding and generation within the same conversation.

The modality mixture ratio: a critical hyperparameter

How much image data vs text data should be in the pre-training mix? This is one of the paper's important findings. Too much text data, and the model doesn't learn good image representations. Too much image data, and text quality degrades. Chameleon found the sweet spot empirically:

The balancing act: At ~30% interleaved image+text data and ~10% image-caption pairs, the model achieves strong performance on both text-only and multimodal benchmarks. Going above 50% image data significantly degrades text performance. Going below 20% image data produces weak image understanding. The optimal ratio is roughly 50/50 text/visual when counting by tokens, but this includes the 1024 image tokens per image which inflate the visual token count.

Optimization details

python
# Chameleon training configuration
config = {
    "optimizer": "AdamW",
    "lr": 1e-4,                   # peak learning rate
    "warmup_steps": 2000,
    "lr_schedule": "cosine_decay",
    "weight_decay": 0.1,
    "batch_size": 4_000_000,       # tokens per batch (4M)
    "total_tokens": 4_400_000_000_000,  # 4.4T tokens
    "context_length": 4096,
    "gradient_clipping": 1.0,    # essential for stability
    "bf16": True,                 # mixed precision
}

Data Mixture Simulator

Adjust the text/image data ratio and watch how it affects performance on text-only and multimodal benchmarks. The sweet spot is around 50% text, 30% interleaved, 10% captions, 10% code.

Image % 40%

Why is interleaved text+image data (not just image-caption pairs) critical for Chameleon's training?

Because interleaved data teaches the model that images and text are semantically connected in context — text describes images, images illustrate text — enabling the model to reason across modalities within natural document structures, not just isolated captioning Because interleaved data is cheaper to collect Because image-caption pairs contain too much noise

Chapter 6: Results & Showcase

Does early fusion actually work? Chameleon is evaluated against both text-only models and multimodal specialists. The results show that a single unified model can be competitive with or exceed purpose-built models on their own tasks.

Benchmark results: Chameleon-34B

Benchmark	Category	Chameleon-34B	Best Specialist	Specialist Name
MMLU	Text knowledge	63.1	70.1	LLaMA-2-70B
HellaSwag	Text reasoning	78.2	85.3	LLaMA-2-70B
VQAv2	Visual QA	76.8	77.4	Flamingo-80B
TextVQA	OCR + Visual QA	55.8	56.8	Flamingo-80B
Image Gen (FID)	Image quality	~7.0	~3.0	DALL-E 2
Mixed-modal	Interleaved I+T	Best in class	N/A	No comparable model

The main takeaway: Chameleon-34B matches or comes within a few points of models that are 2-3x larger and purpose-built for a single task. On text benchmarks, it's competitive with text-only LLaMA-2 despite spending ~40% of its training on image data. On visual QA, it matches Flamingo-80B despite being less than half its size. And it's the only model that can do all of these tasks PLUS generate images.

Where Chameleon shines: mixed-modal generation

Chameleon's unique capability is generating interleaved text and images in a single output. Ask it to write a recipe with step-by-step photos, and it can produce text instructions interspersed with generated images of each step. No other model at publication time could do this natively.

The model handles four task types within one architecture:

Text → Text

Standard LM: question answering, reasoning, code generation.

↓

Image + Text → Text

Visual understanding: describe image, answer questions about image.

↓

Text → Image

Image generation: generate image from text description.

↓

Mixed → Mixed

Interleaved: generate documents with both text and images.

Where Chameleon struggles

The model has clear weaknesses. Image generation quality (FID ~7) lags behind dedicated image models (DALL-E 2 FID ~3). Pure text performance is lower than comparably-sized text-only models — the "modality tax." And the 1024-token image representation limits generation resolution to 512×512.

Chameleon Task Explorer

Select different task types to see how Chameleon processes input and generates output for each. Watch the token flow through the unified model.

What is Chameleon's most unique capability compared to other multimodal models at publication time?

Higher text benchmark scores than GPT-4 Faster inference speed Native interleaved generation of text AND images in a single output — no other model could produce mixed-modal documents within one unified architecture

Chapter 7: Connections

Chameleon sits at a pivotal point in the evolution of multimodal AI. It proved that early fusion works at scale — that you don't need separate encoders for each modality. This insight directly influenced the next generation of models.

Chameleon in context

Model	Fusion Type	Image Handling	Can Generate Images?
LLaVA (2023)	Late fusion	CLIP encoder + linear adapter	No
Flamingo (2022)	Late fusion	NFNet + Perceiver + cross-attention	No
Chameleon (2024)	Early fusion	VQ tokenizer → shared vocabulary	Yes (autoregressive)
Transfusion (2024)	Hybrid	Continuous embeddings + diffusion	Yes (diffusion)
Gemini (2024)	Early fusion	Native multimodal tokens	Yes

Key lessons from Chameleon

Lesson 1: Training stability is the bottleneck, not architecture. Chameleon's transformer is nearly identical to LLaMA. The breakthroughs were QK-Norm, logit dropout, and sandwich norms — engineering solutions to optimization instability.

Lesson 2: The modality tax is real but shrinking. Training on images costs text performance. But the gap is much smaller than expected: Chameleon-34B loses only ~7 points on MMLU vs a text-only model of the same size. Future work may close this gap entirely.

Lesson 3: Discrete tokens may not be optimal for images. VQ tokenization introduces quantization error. Transfusion (the next paper in this series) explores keeping images as continuous representations while text stays discrete — potentially getting the best of both worlds.

Impact on subsequent work

Chameleon's proof that early fusion scales to 34B parameters opened the floodgates. Meta's own follow-up, Transfusion, builds directly on Chameleon's insights but replaces VQ tokenization with diffusion. Google's Gemini (rumored to use early fusion) handles even more modalities including audio and video. The trend is clear: the future of multimodal AI is unified models, not assemblies of specialists.

Multimodal Model Evolution

Drag the slider through time to see how multimodal architectures evolved from late fusion to early fusion.

Era Chameleon (2024)

What is Chameleon's most important contribution to the field?

A new image tokenizer Demonstrating that early fusion (tokenizing all modalities into a shared vocabulary) scales to 34B parameters when combined with specific training stability modifications — proving that a single transformer can handle both understanding and generation across modalities Faster image generation than DALL-E

Chameleon: Mixed-Modal Early-Fusion