Transfusion (Meta 2024)

Chapter 0: Two Worlds

Language models generate text one token at a time. Diffusion models generate images by gradually denoising random noise. These two approaches have been remarkably successful — but they're incompatible. You can't "denoise" text, and you can't "predict the next pixel" of an image efficiently.

This creates an awkward situation for multimodal AI. If you want a model that handles both text and images, you have three options:

Approach	Example	Problem
Separate models	GPT-4 + DALL-E	Two models, no shared reasoning, expensive
Tokenize everything	Chameleon	VQ quantization loses image quality; discrete tokens are suboptimal for continuous signals
???	Transfusion	Use the RIGHT objective for each modality

Chameleon (the previous paper in this series) showed that you can tokenize images and train a single autoregressive model. But there's a cost: VQ tokenization introduces quantization error, and autoregressive generation of 1024 image tokens is slow. Image generation quality lags behind dedicated diffusion models.

Transfusion's insight: Don't force images into the text paradigm. Instead, let each modality use its natural objective. Text stays autoregressive (predict the next token). Images use diffusion (denoise continuous vectors). The clever part: both objectives train the same transformer simultaneously. One model, two loss functions, best of both worlds.

Think of it this way: text is inherently sequential and discrete. The word "cat" is meaningfully different from "bat" — there's no smooth interpolation. Images are inherently continuous and spatial. A pixel at (100, 200) has a smooth relationship with neighboring pixels. Forcing images into discrete tokens is like forcing a continuous function through a staircase — you lose information at every step.

Autoregressive vs Diffusion Generation

See how text generation (left, token by token) and image generation (right, progressive denoising) work fundamentally differently. Transfusion uses BOTH in one model.

Why does Transfusion use diffusion for images instead of tokenizing them like Chameleon?

Because images are inherently continuous signals — VQ tokenization introduces quantization error that degrades quality, and diffusion naturally operates on continuous representations, producing higher-fidelity images while text remains discrete and autoregressive Because diffusion is faster than autoregressive generation Because Chameleon's approach was completely wrong

Chapter 1: Discrete vs Continuous

To understand Transfusion, we need to be precise about what "discrete" and "continuous" mean in this context, and why the distinction matters for generation quality.

Discrete representations (text and Chameleon-style images)

Text tokens are inherently discrete: "cat" is token 2364, "dog" is token 3920. There's no token 3142 that's "half cat, half dog." The vocabulary is finite, and each token is a distinct symbol. Autoregressive models handle this perfectly: at each step, output a probability distribution over the vocabulary and sample one token.

Chameleon forced images into this paradigm by VQ-quantization: each 16×16 patch is mapped to the nearest codebook entry. But this introduces quantization error — the original patch vector and the codebook vector are never exactly the same. The information lost in this rounding cannot be recovered.

Quantization error: ||z − e_k||² where z is the encoder output and e_k is the nearest codebook vector

Continuous representations (Transfusion images)

Transfusion represents images as continuous latent vectors. Instead of mapping each patch to a codebook entry (discrete), it uses a VAE encoder to produce continuous vectors that preserve the full information content. No rounding, no quantization error.

python
# Chameleon (discrete): information lost at quantization
z = encoder(image)          # [B, 256, 32, 32] — continuous
ids = codebook.nearest(z)  # [B, 1024] — discrete integers
z_q = codebook(ids)         # [B, 256, 32, 32] — APPROXIMATION of z
# z_q ≠ z — quantization error is permanent

# Transfusion (continuous): no information lost
z = vae_encoder(image)      # [B, 8, 32, 32] — continuous latents
z_patches = patchify(z)     # [B, 256, 8*4*4=128] — 256 patches of dim 128
# z_patches are EXACT — no quantization step

The tradeoff: continuous representations can't be predicted with a standard softmax. You can't output a probability over infinite continuous values. This is where diffusion comes in — it's a generative process specifically designed for continuous data.

The representation choice determines the generation method. Discrete tokens → softmax over vocabulary → autoregressive sampling. Continuous vectors → noise prediction → iterative denoising (diffusion). Transfusion uses BOTH within one model: discrete softmax for text positions, continuous diffusion for image positions.

Patchification: how images enter the transformer

Transfusion converts images into patches, but unlike Chameleon, these patches are continuous vectors, not discrete token IDs:

VAE Encoder

Image (256×256×3) → latent (8×32×32). Compression to 8 channels.

↓

Patchify

Split latent into 2×2 patches → 256 patches, each of dimension 8×2×2 = 32. Or 4×4 patches → 64 patches of dim 128.

↓

Linear Projection

Project each patch vector to the transformer's hidden dimension D. These become "image tokens" in the sequence.

Discrete vs Continuous Image Representation

Drag the slider to adjust quantization levels. Left shows the original continuous representation; right shows the quantized version. Notice how fewer codebook entries create visible artifacts.

Codebook size 64

What is the fundamental advantage of continuous image representations (Transfusion) over discrete tokens (Chameleon)?

Continuous representations preserve the exact information from the VAE encoder with no quantization error, enabling higher-fidelity image generation — whereas discrete tokens round each patch vector to the nearest codebook entry, permanently losing information Continuous representations are smaller in size Continuous representations are faster to process

Chapter 2: The Dual Objectives

Here's the core trick. Transfusion trains a single transformer with two loss functions simultaneously: a language modeling loss on text tokens and a diffusion loss on image patches. Each loss function applies only to the positions of its modality.

Text objective: next-token prediction

For text positions, the model uses standard autoregressive language modeling. At each text position, predict the next token from a softmax over the vocabulary:

L_text = −∑_{t ∈ text positions} log p(x_t | x_<t)

This is identical to how GPT, LLaMA, and every other autoregressive LM is trained. No modification needed.

Image objective: diffusion denoising

For image positions, the model uses a DDPM-style diffusion objective. At each image position, the model predicts the noise that was added to the clean image patch:

L_image = E_{t, ε} ||ε − ε_θ(x_t, t)||²

Where ε is the Gaussian noise added to the clean image patch, x_t is the noisy version at timestep t, and ε_θ is the model's noise prediction. This is the standard diffusion training loss from DDPM.

The combined loss

L_total = L_text + λ · L_image

The weighting factor λ balances the two objectives. The paper finds that λ = 1 works well — equal weight to text and image losses. During training, both losses are computed on every batch and backpropagated through the shared transformer parameters.

Why this works (and it's surprising that it does): Language modeling and diffusion are mathematically very different objectives. One is a discrete classification problem (predict a token from 65K options). The other is a continuous regression problem (predict a noise vector in R^d). Yet the shared transformer learns features that are useful for both. This suggests that deep self-attention representations capture something fundamental about sequential structure that transcends the specific output modality.

python
# Transfusion training step (simplified)
def training_step(model, batch):
    tokens = batch["mixed_sequence"]    # text IDs + image patches
    modality = batch["modality_mask"]    # 0=text, 1=image for each position

    # For image positions: add noise (diffusion forward process)
    t = torch.randint(0, 1000, (batch_size,))  # random timestep
    noise = torch.randn_like(image_patches)
    noisy_patches = sqrt_alpha[t] * image_patches + sqrt_1m_alpha[t] * noise

    # Replace clean image patches with noisy versions in the sequence
    tokens[modality == 1] = noisy_patches

    # Forward pass through shared transformer
    output = model(tokens)  # [B, L, D]

    # Text loss: cross-entropy on text positions
    text_logits = model.lm_head(output[modality == 0])  # [N_text, vocab_size]
    loss_text = F.cross_entropy(text_logits, text_targets)

    # Image loss: MSE on noise prediction at image positions
    noise_pred = model.noise_head(output[modality == 1])  # [N_img, patch_dim]
    loss_image = F.mse_loss(noise_pred, noise)

    return loss_text + loss_image

Dual Loss Visualizer

Watch how both losses are computed on the same sequence. Text positions use cross-entropy (discrete classification); image positions use MSE on noise prediction (continuous regression). Both backpropagate through the shared transformer.

How does Transfusion handle the fact that text and images require fundamentally different generation methods?

It uses two separate loss functions on the same transformer — cross-entropy for text positions (predicting discrete tokens) and MSE for image positions (predicting diffusion noise) — both computed on every batch and backpropagated through shared parameters It converts both to the same representation first It trains two separate models and combines them

Chapter 3: The Architecture

Transfusion's architecture is a standard decoder-only transformer with a few modality-specific additions at the input and output boundaries. The core transformer itself is unchanged — no special layers, no modality-specific routing.

Input: modality-specific embeddings

Text tokens enter through a standard embedding table (discrete IDs → vectors). Image patches enter through a linear projection (continuous vectors → same-dimensional vectors). Both are projected to the transformer's hidden dimension D.

Text Input

Token ID → Embedding table lookup → vector ∈ R^D

↓ interleave

Image Input

Continuous patch vector → Linear projection → vector ∈ R^D (+ timestep embedding t for diffusion)

↓

Shared Transformer

Standard causal self-attention + FFN. No modality-specific layers. All tokens processed identically.

Output: modality-specific heads

The transformer produces hidden states for every position. These are then routed to the appropriate output head based on modality:

Position Type	Output Head	Output
Text	Linear → softmax over vocab	Probability distribution over 65K tokens
Image	Linear → noise prediction	Predicted noise vector ∈ R^patch_dim

U-Net inside the transformer

One important architectural detail: Transfusion adds U-Net-style skip connections for image patches. Standard diffusion models use a U-Net because the skip connections help preserve fine-grained spatial details during denoising. Transfusion adapts this idea within the transformer:

python
# Transfusion with U-Net-style connections for images
class TransfusionModel(nn.Module):
    def __init__(self, config):
        self.text_embed = nn.Embedding(65536, config.dim)
        self.img_proj_in = nn.Linear(config.patch_dim, config.dim)
        self.time_embed = nn.Embedding(1000, config.dim)   # diffusion timestep

        self.layers = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.n_layers)
        ])

        # Output heads
        self.lm_head = nn.Linear(config.dim, 65536)          # text output
        self.noise_head = nn.Linear(config.dim, config.patch_dim)  # image output

    def forward(self, text_ids, img_patches, timestep, mask):
        # Embed each modality
        text_emb = self.text_embed(text_ids)       # [B, T, D]
        img_emb = self.img_proj_in(img_patches)    # [B, N, D]
        img_emb = img_emb + self.time_embed(timestep)  # add timestep info

        # Interleave into single sequence
        x = interleave(text_emb, img_emb, mask)    # [B, T+N, D]

        # Shared transformer
        for layer in self.layers:
            x = layer(x)

        # Route to appropriate output heads
        text_out = self.lm_head(x[mask == 0])     # text logits
        img_out = self.noise_head(x[mask == 1])   # noise predictions
        return text_out, img_out

Minimal modality-specific parameters. The only modality-specific components are the input embeddings, the output heads, and the timestep embedding. The shared transformer — which contains 99%+ of the parameters — is completely modality-agnostic. This means adding a new modality (audio, video) would require only new input/output projections, not a new transformer.

Transfusion Architecture Diagram

Step through the forward pass to see how text tokens and image patches flow through the shared transformer, then diverge to modality-specific output heads.

What percentage of Transfusion's parameters are modality-specific?

About 50% — half for text, half for images Less than 1% — only the input embeddings, output heads, and timestep embeddings are modality-specific; the shared transformer (99%+ of parameters) is completely modality-agnostic About 25% — one head per modality

Chapter 4: Attention Masking

This is where Transfusion gets subtle. Standard autoregressive models use causal masking: each token can only attend to tokens before it. This makes sense for text — you generate left-to-right, one token at a time. But images are different.

The problem with causal masking for images

In diffusion, you denoise the entire image simultaneously. All patches are denoised at the same timestep — they should be able to see each other. If you enforce causal masking on image patches, the first patch can't see the last patch, even though they're being denoised together. This hurts quality because image patches need spatial context from the entire image.

Transfusion's solution: mixed masking

Transfusion uses a hybrid attention mask that combines causal masking for text with bidirectional masking for images within each image block:

Query Type	Can Attend To	Reason
Text token	All previous text tokens + all previous image blocks (full images only)	Causal: text is generated left-to-right
Image patch	All previous text + all patches in the SAME image (bidirectional)	Bidirectional within image: diffusion denoises all patches simultaneously

Mask(i, j) =
1 if both are in same image block (bidirectional),
1 if j < i and both are text or j is in a completed image block,
0 otherwise

Visually, the attention mask looks like this: the text tokens form a standard lower-triangular causal mask. Each image block is a dense square (all patches attend to all patches within the same image). Image patches can also attend to all text tokens that precede the image.

python
# Building Transfusion's mixed attention mask
def build_transfusion_mask(seq_len, modality, image_boundaries):
    # modality: 0=text, 1=image for each position
    # image_boundaries: list of (start, end) for each image block
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)

    for i in range(seq_len):
        for j in range(seq_len):
            if modality[i] == 0 and j <= i:
                # Text can attend to all previous positions (causal)
                mask[i, j] = True
            elif modality[i] == 1:
                # Image patch: check if j is in same image block
                for start, end in image_boundaries:
                    if start <= i < end and start <= j < end:
                        mask[i, j] = True  # bidirectional within block
                # Also attend to all preceding text
                if modality[j] == 0 and j < i:
                    mask[i, j] = True
    return mask

Why mixed masking matters so much: The paper's ablations show that using pure causal masking (like Chameleon) for images hurts FID by 30-40%. Bidirectional attention within each image block allows patches to share spatial information during denoising, which is essential for coherent image generation. This is one of Transfusion's key innovations.

Attention Mask Visualizer

See Transfusion's mixed attention mask. Text tokens (teal) use causal masking. Image patches (orange) have bidirectional attention within their block. Toggle between masking strategies to see the difference.

Why does Transfusion use bidirectional attention within image blocks instead of causal masking?

Because diffusion denoises all image patches simultaneously — they need to see each other for spatial coherence. Causal masking would prevent later patches from attending to earlier ones, hurting image quality by 30-40% FID as shown in ablations Because bidirectional attention is faster to compute Because images don't have a natural ordering

Chapter 5: Training & Scaling

Transfusion's training reveals something unexpected: the dual-objective approach doesn't just match separate models — it's actually more efficient at learning both tasks than training separate models on the same data.

Scaling experiments

The paper trains models at multiple scales (0.16B, 0.37B, 0.76B, 7B parameters) and compares three approaches:

Approach	Description	7B FID ↓	7B Text Perplexity ↓
Chameleon-style	All tokens discrete, single AR loss	~24	~8.2
Transfusion	Text: AR, Images: diffusion, shared transformer	~6.8	~8.0
Separate models	One text-only LM + one diffusion model (same total params)	~7.5	~7.8

Two findings stand out:

Finding 1: Transfusion dramatically outperforms Chameleon on image generation. At 7B parameters, Transfusion achieves FID ~6.8 vs Chameleon's ~24. The diffusion objective is simply a better fit for continuous image data than autoregressive token prediction.

Finding 2: Shared training helps BOTH modalities. Transfusion's text perplexity (8.0) is nearly as good as the separate text-only model (7.8) trained on the same text data. And its image quality (FID 6.8) is better than the separate diffusion model (FID 7.5). The shared representations benefit both tasks through transfer learning.

Scaling laws

Transfusion follows clean power-law scaling for both objectives. As model size doubles:

Loss_text ∝ N^−0.08, FID ∝ N^−0.36

Image quality improves faster with scale than text quality, suggesting that larger Transfusion models will have even more impressive image generation. The text scaling exponent matches typical LLM scaling laws, indicating no degradation from the shared training.

python
# Transfusion scaling configuration at different sizes
configs = {
    "0.16B": {"layers": 12, "dim": 768,  "heads": 12, "tokens": "0.5T"},
    "0.37B": {"layers": 24, "dim": 1024, "heads": 16, "tokens": "0.5T"},
    "0.76B": {"layers": 24, "dim": 1536, "heads": 24, "tokens": "0.5T"},
    "7B":    {"layers": 32, "dim": 4096, "heads": 32, "tokens": "2T"},
}
# All configs: LR=3e-4, cosine schedule, bf16, context=4096

Scaling Law Explorer

Drag the slider to scale model size and see how FID (image quality) and perplexity (text quality) change. Notice how image quality improves faster with scale.

Params 0.37B

What surprising finding emerges from Transfusion's scaling experiments?

Training text and image objectives together in one model actually improves BOTH tasks compared to training separate models on the same data — the shared representations enable cross-modal transfer learning Text quality degrades significantly when images are added Smaller models perform better than larger ones

Chapter 6: Inference & Results

Inference in Transfusion works differently depending on whether you're generating text or images. For text, it's the standard autoregressive decode loop. For images, it's an iterative denoising process. The model seamlessly switches between the two modes within a single generation.

Text generation: standard autoregressive

When the model needs to generate text, it predicts the next text token from its vocabulary, samples it, appends it to the sequence, and repeats. This is identical to inference in GPT or LLaMA.

Image generation: iterative denoising

When the model encounters an image-generation trigger (e.g., an <image> token), it switches to diffusion mode:

Initialize

Create 256 random noise patches ∈ R^patch_dim. Set diffusion timestep t = T (maximum noise).

↓ repeat for t = T, T-1, ..., 1

Predict Noise

Feed [text context, noisy image patches, timestep t] through transformer. Get noise predictions for all 256 patches simultaneously.

↓

Denoise

Subtract predicted noise from patches (with scheduling). Patches become slightly less noisy.

↻ until t = 0

Decode

Pass denoised patches through VAE decoder to get pixels. Resume text generation after </image>.

The key insight: during denoising, all 256 patches are processed in parallel (bidirectional attention within the image block). This is much faster than Chameleon's autoregressive generation of 1024 tokens one at a time.

Generation speed comparison

Model	Image Gen Steps	Tokens per Image	Total Forward Passes
Chameleon	1024 (AR steps)	1024	1024
Transfusion (250 steps)	250 (diffusion steps)	256 patches × 250 steps	250
DALL-E 2	~50-250 (diffusion)	N/A (continuous)	50-250

Speed advantage: Each diffusion step processes all 256 patches in parallel (one forward pass). Chameleon needs 1024 sequential forward passes (one per token). Even with 250 diffusion steps, Transfusion is ~4x faster for image generation, and the quality is dramatically better (FID 6.8 vs 24).

Benchmark results

At 7B parameters, Transfusion achieves:

Benchmark	Transfusion 7B	Chameleon 7B	DALL-E 2	LLaMA 7B (text only)
FID (GenEval) ↓	6.78	~24	5.97	N/A
Text PPL ↓	8.0	8.4	N/A	7.8
Overall score	0.63	0.39	0.52	N/A

Transfusion Inference Simulator

Watch Transfusion generate a mixed text+image output. Text tokens appear one at a time (autoregressive), then the image is progressively denoised (diffusion). Drag the slider to control denoising speed.

Denoise steps 50

Why is Transfusion's image generation faster than Chameleon's?

Because it uses a smaller model Because it uses fewer image tokens Because each diffusion step denoises all 256 patches in parallel (one forward pass), while Chameleon generates 1024 tokens sequentially (1024 forward passes) — making Transfusion ~4x faster even with 250 denoising steps

Chapter 7: Connections

Transfusion builds directly on Chameleon's proof that early fusion works, but replaces the suboptimal discrete image representation with continuous diffusion. It represents a maturation of the multimodal foundation model paradigm.

The evolution of multimodal generation

Model	Image Repr.	Image Gen Method	Shared Backbone?	Key Innovation
DALL-E (2021)	Discrete (dVAE)	Autoregressive	No	Proved AR image gen works
Chameleon (2024)	Discrete (VQ)	Autoregressive	Yes	Unified text+image in one transformer
Transfusion (2024)	Continuous (VAE)	Diffusion	Yes	Best of AR (text) + diffusion (images)
MoT (2024)	Continuous	Diffusion	Partially	Modality-specific experts

Lesson 1: Match the objective to the data type. Discrete data (text) is best modeled with discrete objectives (cross-entropy). Continuous data (images) is best modeled with continuous objectives (diffusion). Forcing everything through the same objective (as Chameleon does) sacrifices quality.

Lesson 2: Attention masking is not one-size-fits-all. Transfusion's mixed masking (causal for text, bidirectional for images) is a key innovation. Different generation paradigms need different information flow patterns, even within the same model.

Lesson 3: Shared parameters enable positive transfer. The surprising finding that Transfusion outperforms separate models suggests that text understanding genuinely helps image generation (and vice versa). The shared transformer learns universal sequence representations.

Transfusion's approach is likely to become the template for future multimodal foundation models. The idea — use the right generation method for each modality while sharing the backbone — generalizes naturally to audio (diffusion), video (diffusion), code (autoregressive), and beyond.

Multimodal Generation Approaches

Compare different approaches to multimodal generation. Each column shows how a model handles text (top) and images (bottom).

Model Transfusion

What is Transfusion's most important conceptual contribution?

Faster inference speed Demonstrating that a single transformer can simultaneously train with different objectives for different modalities — autoregressive for text, diffusion for images — achieving better results than either approach alone, and providing a template for true multimodal foundation models Better text quality than GPT-4

Transfusion: Next Token + Diffusion