Meta FAIR — 2024

Transfusion: Next Token + Diffusion

Predict the Next Token and Diffuse Images with One Multi-Modal Model — autoregressive for text, diffusion for images, one transformer, one training run.

Prerequisites: Autoregressive LMs + Diffusion basics + Chameleon (recommended). That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Two Worlds

Language models generate text one token at a time. Diffusion models generate images by gradually denoising random noise. These two approaches have been remarkably successful — but they're incompatible. You can't "denoise" text, and you can't "predict the next pixel" of an image efficiently.

This creates an awkward situation for multimodal AI. If you want a model that handles both text and images, you have three options:

ApproachExampleProblem
Separate modelsGPT-4 + DALL-ETwo models, no shared reasoning, expensive
Tokenize everythingChameleonVQ quantization loses image quality; discrete tokens are suboptimal for continuous signals
???TransfusionUse the RIGHT objective for each modality

Chameleon (the previous paper in this series) showed that you can tokenize images and train a single autoregressive model. But there's a cost: VQ tokenization introduces quantization error, and autoregressive generation of 1024 image tokens is slow. Image generation quality lags behind dedicated diffusion models.

Transfusion's insight: Don't force images into the text paradigm. Instead, let each modality use its natural objective. Text stays autoregressive (predict the next token). Images use diffusion (denoise continuous vectors). The clever part: both objectives train the same transformer simultaneously. One model, two loss functions, best of both worlds.

Think of it this way: text is inherently sequential and discrete. The word "cat" is meaningfully different from "bat" — there's no smooth interpolation. Images are inherently continuous and spatial. A pixel at (100, 200) has a smooth relationship with neighboring pixels. Forcing images into discrete tokens is like forcing a continuous function through a staircase — you lose information at every step.

Autoregressive vs Diffusion Generation

See how text generation (left, token by token) and image generation (right, progressive denoising) work fundamentally differently. Transfusion uses BOTH in one model.

Why does Transfusion use diffusion for images instead of tokenizing them like Chameleon?

Chapter 1: Discrete vs Continuous

To understand Transfusion, we need to be precise about what "discrete" and "continuous" mean in this context, and why the distinction matters for generation quality.

Discrete representations (text and Chameleon-style images)

Text tokens are inherently discrete: "cat" is token 2364, "dog" is token 3920. There's no token 3142 that's "half cat, half dog." The vocabulary is finite, and each token is a distinct symbol. Autoregressive models handle this perfectly: at each step, output a probability distribution over the vocabulary and sample one token.

Chameleon forced images into this paradigm by VQ-quantization: each 16×16 patch is mapped to the nearest codebook entry. But this introduces quantization error — the original patch vector and the codebook vector are never exactly the same. The information lost in this rounding cannot be recovered.

Quantization error: ||z − ek||2 where z is the encoder output and ek is the nearest codebook vector

Continuous representations (Transfusion images)

Transfusion represents images as continuous latent vectors. Instead of mapping each patch to a codebook entry (discrete), it uses a VAE encoder to produce continuous vectors that preserve the full information content. No rounding, no quantization error.

python
# Chameleon (discrete): information lost at quantization
z = encoder(image)          # [B, 256, 32, 32] — continuous
ids = codebook.nearest(z)  # [B, 1024] — discrete integers
z_q = codebook(ids)         # [B, 256, 32, 32] — APPROXIMATION of z
# z_q ≠ z — quantization error is permanent

# Transfusion (continuous): no information lost
z = vae_encoder(image)      # [B, 8, 32, 32] — continuous latents
z_patches = patchify(z)     # [B, 256, 8*4*4=128] — 256 patches of dim 128
# z_patches are EXACT — no quantization step

The tradeoff: continuous representations can't be predicted with a standard softmax. You can't output a probability over infinite continuous values. This is where diffusion comes in — it's a generative process specifically designed for continuous data.

The representation choice determines the generation method. Discrete tokens → softmax over vocabulary → autoregressive sampling. Continuous vectors → noise prediction → iterative denoising (diffusion). Transfusion uses BOTH within one model: discrete softmax for text positions, continuous diffusion for image positions.

Patchification: how images enter the transformer

Transfusion converts images into patches, but unlike Chameleon, these patches are continuous vectors, not discrete token IDs:

VAE Encoder
Image (256×256×3) → latent (8×32×32). Compression to 8 channels.
Patchify
Split latent into 2×2 patches → 256 patches, each of dimension 8×2×2 = 32. Or 4×4 patches → 64 patches of dim 128.
Linear Projection
Project each patch vector to the transformer's hidden dimension D. These become "image tokens" in the sequence.
Discrete vs Continuous Image Representation

Drag the slider to adjust quantization levels. Left shows the original continuous representation; right shows the quantized version. Notice how fewer codebook entries create visible artifacts.

Codebook size 64
What is the fundamental advantage of continuous image representations (Transfusion) over discrete tokens (Chameleon)?

Chapter 2: The Dual Objectives

Here's the core trick. Transfusion trains a single transformer with two loss functions simultaneously: a language modeling loss on text tokens and a diffusion loss on image patches. Each loss function applies only to the positions of its modality.

Text objective: next-token prediction

For text positions, the model uses standard autoregressive language modeling. At each text position, predict the next token from a softmax over the vocabulary:

Ltext = −∑t ∈ text positions log p(xt | x<t)

This is identical to how GPT, LLaMA, and every other autoregressive LM is trained. No modification needed.

Image objective: diffusion denoising

For image positions, the model uses a DDPM-style diffusion objective. At each image position, the model predicts the noise that was added to the clean image patch:

Limage = Et, ε ||ε − εθ(xt, t)||2

Where ε is the Gaussian noise added to the clean image patch, xt is the noisy version at timestep t, and εθ is the model's noise prediction. This is the standard diffusion training loss from DDPM.

The combined loss

Ltotal = Ltext + λ · Limage

The weighting factor λ balances the two objectives. The paper finds that λ = 1 works well — equal weight to text and image losses. During training, both losses are computed on every batch and backpropagated through the shared transformer parameters.

Why this works (and it's surprising that it does): Language modeling and diffusion are mathematically very different objectives. One is a discrete classification problem (predict a token from 65K options). The other is a continuous regression problem (predict a noise vector in Rd). Yet the shared transformer learns features that are useful for both. This suggests that deep self-attention representations capture something fundamental about sequential structure that transcends the specific output modality.
python
# Transfusion training step (simplified)
def training_step(model, batch):
    tokens = batch["mixed_sequence"]    # text IDs + image patches
    modality = batch["modality_mask"]    # 0=text, 1=image for each position

    # For image positions: add noise (diffusion forward process)
    t = torch.randint(0, 1000, (batch_size,))  # random timestep
    noise = torch.randn_like(image_patches)
    noisy_patches = sqrt_alpha[t] * image_patches + sqrt_1m_alpha[t] * noise

    # Replace clean image patches with noisy versions in the sequence
    tokens[modality == 1] = noisy_patches

    # Forward pass through shared transformer
    output = model(tokens)  # [B, L, D]

    # Text loss: cross-entropy on text positions
    text_logits = model.lm_head(output[modality == 0])  # [N_text, vocab_size]
    loss_text = F.cross_entropy(text_logits, text_targets)

    # Image loss: MSE on noise prediction at image positions
    noise_pred = model.noise_head(output[modality == 1])  # [N_img, patch_dim]
    loss_image = F.mse_loss(noise_pred, noise)

    return loss_text + loss_image
Dual Loss Visualizer

Watch how both losses are computed on the same sequence. Text positions use cross-entropy (discrete classification); image positions use MSE on noise prediction (continuous regression). Both backpropagate through the shared transformer.

How does Transfusion handle the fact that text and images require fundamentally different generation methods?

Chapter 3: The Architecture

Transfusion's architecture is a standard decoder-only transformer with a few modality-specific additions at the input and output boundaries. The core transformer itself is unchanged — no special layers, no modality-specific routing.

Input: modality-specific embeddings

Text tokens enter through a standard embedding table (discrete IDs → vectors). Image patches enter through a linear projection (continuous vectors → same-dimensional vectors). Both are projected to the transformer's hidden dimension D.

Text Input
Token ID → Embedding table lookup → vector ∈ RD
↓ interleave
Image Input
Continuous patch vector → Linear projection → vector ∈ RD (+ timestep embedding t for diffusion)
Shared Transformer
Standard causal self-attention + FFN. No modality-specific layers. All tokens processed identically.

Output: modality-specific heads

The transformer produces hidden states for every position. These are then routed to the appropriate output head based on modality:

Position TypeOutput HeadOutput
TextLinear → softmax over vocabProbability distribution over 65K tokens
ImageLinear → noise predictionPredicted noise vector ∈ Rpatch_dim

U-Net inside the transformer

One important architectural detail: Transfusion adds U-Net-style skip connections for image patches. Standard diffusion models use a U-Net because the skip connections help preserve fine-grained spatial details during denoising. Transfusion adapts this idea within the transformer:

python
# Transfusion with U-Net-style connections for images
class TransfusionModel(nn.Module):
    def __init__(self, config):
        self.text_embed = nn.Embedding(65536, config.dim)
        self.img_proj_in = nn.Linear(config.patch_dim, config.dim)
        self.time_embed = nn.Embedding(1000, config.dim)   # diffusion timestep

        self.layers = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.n_layers)
        ])

        # Output heads
        self.lm_head = nn.Linear(config.dim, 65536)          # text output
        self.noise_head = nn.Linear(config.dim, config.patch_dim)  # image output

    def forward(self, text_ids, img_patches, timestep, mask):
        # Embed each modality
        text_emb = self.text_embed(text_ids)       # [B, T, D]
        img_emb = self.img_proj_in(img_patches)    # [B, N, D]
        img_emb = img_emb + self.time_embed(timestep)  # add timestep info

        # Interleave into single sequence
        x = interleave(text_emb, img_emb, mask)    # [B, T+N, D]

        # Shared transformer
        for layer in self.layers:
            x = layer(x)

        # Route to appropriate output heads
        text_out = self.lm_head(x[mask == 0])     # text logits
        img_out = self.noise_head(x[mask == 1])   # noise predictions
        return text_out, img_out
Minimal modality-specific parameters. The only modality-specific components are the input embeddings, the output heads, and the timestep embedding. The shared transformer — which contains 99%+ of the parameters — is completely modality-agnostic. This means adding a new modality (audio, video) would require only new input/output projections, not a new transformer.
Transfusion Architecture Diagram

Step through the forward pass to see how text tokens and image patches flow through the shared transformer, then diverge to modality-specific output heads.

What percentage of Transfusion's parameters are modality-specific?

Chapter 4: Attention Masking

This is where Transfusion gets subtle. Standard autoregressive models use causal masking: each token can only attend to tokens before it. This makes sense for text — you generate left-to-right, one token at a time. But images are different.

The problem with causal masking for images

In diffusion, you denoise the entire image simultaneously. All patches are denoised at the same timestep — they should be able to see each other. If you enforce causal masking on image patches, the first patch can't see the last patch, even though they're being denoised together. This hurts quality because image patches need spatial context from the entire image.

Transfusion's solution: mixed masking

Transfusion uses a hybrid attention mask that combines causal masking for text with bidirectional masking for images within each image block:

Query TypeCan Attend ToReason
Text tokenAll previous text tokens + all previous image blocks (full images only)Causal: text is generated left-to-right
Image patchAll previous text + all patches in the SAME image (bidirectional)Bidirectional within image: diffusion denoises all patches simultaneously
Mask(i, j) =
  1 if both are in same image block (bidirectional),
  1 if j < i and both are text or j is in a completed image block,
  0 otherwise

Visually, the attention mask looks like this: the text tokens form a standard lower-triangular causal mask. Each image block is a dense square (all patches attend to all patches within the same image). Image patches can also attend to all text tokens that precede the image.

python
# Building Transfusion's mixed attention mask
def build_transfusion_mask(seq_len, modality, image_boundaries):
    # modality: 0=text, 1=image for each position
    # image_boundaries: list of (start, end) for each image block
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)

    for i in range(seq_len):
        for j in range(seq_len):
            if modality[i] == 0 and j <= i:
                # Text can attend to all previous positions (causal)
                mask[i, j] = True
            elif modality[i] == 1:
                # Image patch: check if j is in same image block
                for start, end in image_boundaries:
                    if start <= i < end and start <= j < end:
                        mask[i, j] = True  # bidirectional within block
                # Also attend to all preceding text
                if modality[j] == 0 and j < i:
                    mask[i, j] = True
    return mask
Why mixed masking matters so much: The paper's ablations show that using pure causal masking (like Chameleon) for images hurts FID by 30-40%. Bidirectional attention within each image block allows patches to share spatial information during denoising, which is essential for coherent image generation. This is one of Transfusion's key innovations.
Attention Mask Visualizer

See Transfusion's mixed attention mask. Text tokens (teal) use causal masking. Image patches (orange) have bidirectional attention within their block. Toggle between masking strategies to see the difference.

Why does Transfusion use bidirectional attention within image blocks instead of causal masking?

Chapter 5: Training & Scaling

Transfusion's training reveals something unexpected: the dual-objective approach doesn't just match separate models — it's actually more efficient at learning both tasks than training separate models on the same data.

Scaling experiments

The paper trains models at multiple scales (0.16B, 0.37B, 0.76B, 7B parameters) and compares three approaches:

ApproachDescription7B FID ↓7B Text Perplexity ↓
Chameleon-styleAll tokens discrete, single AR loss~24~8.2
TransfusionText: AR, Images: diffusion, shared transformer~6.8~8.0
Separate modelsOne text-only LM + one diffusion model (same total params)~7.5~7.8

Two findings stand out:

Finding 1: Transfusion dramatically outperforms Chameleon on image generation. At 7B parameters, Transfusion achieves FID ~6.8 vs Chameleon's ~24. The diffusion objective is simply a better fit for continuous image data than autoregressive token prediction.
Finding 2: Shared training helps BOTH modalities. Transfusion's text perplexity (8.0) is nearly as good as the separate text-only model (7.8) trained on the same text data. And its image quality (FID 6.8) is better than the separate diffusion model (FID 7.5). The shared representations benefit both tasks through transfer learning.

Scaling laws

Transfusion follows clean power-law scaling for both objectives. As model size doubles:

Losstext ∝ N−0.08,   FID ∝ N−0.36

Image quality improves faster with scale than text quality, suggesting that larger Transfusion models will have even more impressive image generation. The text scaling exponent matches typical LLM scaling laws, indicating no degradation from the shared training.

python
# Transfusion scaling configuration at different sizes
configs = {
    "0.16B": {"layers": 12, "dim": 768,  "heads": 12, "tokens": "0.5T"},
    "0.37B": {"layers": 24, "dim": 1024, "heads": 16, "tokens": "0.5T"},
    "0.76B": {"layers": 24, "dim": 1536, "heads": 24, "tokens": "0.5T"},
    "7B":    {"layers": 32, "dim": 4096, "heads": 32, "tokens": "2T"},
}
# All configs: LR=3e-4, cosine schedule, bf16, context=4096
Scaling Law Explorer

Drag the slider to scale model size and see how FID (image quality) and perplexity (text quality) change. Notice how image quality improves faster with scale.

Params 0.37B
What surprising finding emerges from Transfusion's scaling experiments?

Chapter 6: Inference & Results

Inference in Transfusion works differently depending on whether you're generating text or images. For text, it's the standard autoregressive decode loop. For images, it's an iterative denoising process. The model seamlessly switches between the two modes within a single generation.

Text generation: standard autoregressive

When the model needs to generate text, it predicts the next text token from its vocabulary, samples it, appends it to the sequence, and repeats. This is identical to inference in GPT or LLaMA.

Image generation: iterative denoising

When the model encounters an image-generation trigger (e.g., an <image> token), it switches to diffusion mode:

Initialize
Create 256 random noise patches ∈ Rpatch_dim. Set diffusion timestep t = T (maximum noise).
↓ repeat for t = T, T-1, ..., 1
Predict Noise
Feed [text context, noisy image patches, timestep t] through transformer. Get noise predictions for all 256 patches simultaneously.
Denoise
Subtract predicted noise from patches (with scheduling). Patches become slightly less noisy.
↻ until t = 0
Decode
Pass denoised patches through VAE decoder to get pixels. Resume text generation after </image>.

The key insight: during denoising, all 256 patches are processed in parallel (bidirectional attention within the image block). This is much faster than Chameleon's autoregressive generation of 1024 tokens one at a time.

Generation speed comparison

ModelImage Gen StepsTokens per ImageTotal Forward Passes
Chameleon1024 (AR steps)10241024
Transfusion (250 steps)250 (diffusion steps)256 patches × 250 steps250
DALL-E 2~50-250 (diffusion)N/A (continuous)50-250
Speed advantage: Each diffusion step processes all 256 patches in parallel (one forward pass). Chameleon needs 1024 sequential forward passes (one per token). Even with 250 diffusion steps, Transfusion is ~4x faster for image generation, and the quality is dramatically better (FID 6.8 vs 24).

Benchmark results

At 7B parameters, Transfusion achieves:

BenchmarkTransfusion 7BChameleon 7BDALL-E 2LLaMA 7B (text only)
FID (GenEval) ↓6.78~245.97N/A
Text PPL ↓8.08.4N/A7.8
Overall score0.630.390.52N/A
Transfusion Inference Simulator

Watch Transfusion generate a mixed text+image output. Text tokens appear one at a time (autoregressive), then the image is progressively denoised (diffusion). Drag the slider to control denoising speed.

Denoise steps 50
Why is Transfusion's image generation faster than Chameleon's?

Chapter 7: Connections

Transfusion builds directly on Chameleon's proof that early fusion works, but replaces the suboptimal discrete image representation with continuous diffusion. It represents a maturation of the multimodal foundation model paradigm.

The evolution of multimodal generation

ModelImage Repr.Image Gen MethodShared Backbone?Key Innovation
DALL-E (2021)Discrete (dVAE)AutoregressiveNoProved AR image gen works
Chameleon (2024)Discrete (VQ)AutoregressiveYesUnified text+image in one transformer
Transfusion (2024)Continuous (VAE)DiffusionYesBest of AR (text) + diffusion (images)
MoT (2024)ContinuousDiffusionPartiallyModality-specific experts
Lesson 1: Match the objective to the data type. Discrete data (text) is best modeled with discrete objectives (cross-entropy). Continuous data (images) is best modeled with continuous objectives (diffusion). Forcing everything through the same objective (as Chameleon does) sacrifices quality.
Lesson 2: Attention masking is not one-size-fits-all. Transfusion's mixed masking (causal for text, bidirectional for images) is a key innovation. Different generation paradigms need different information flow patterns, even within the same model.
Lesson 3: Shared parameters enable positive transfer. The surprising finding that Transfusion outperforms separate models suggests that text understanding genuinely helps image generation (and vice versa). The shared transformer learns universal sequence representations.

Transfusion's approach is likely to become the template for future multimodal foundation models. The idea — use the right generation method for each modality while sharing the backbone — generalizes naturally to audio (diffusion), video (diffusion), code (autoregressive), and beyond.

Multimodal Generation Approaches

Compare different approaches to multimodal generation. Each column shows how a model handles text (top) and images (bottom).

Model Transfusion
What is Transfusion's most important conceptual contribution?