Predict the Next Token and Diffuse Images with One Multi-Modal Model — autoregressive for text, diffusion for images, one transformer, one training run.
Language models generate text one token at a time. Diffusion models generate images by gradually denoising random noise. These two approaches have been remarkably successful — but they're incompatible. You can't "denoise" text, and you can't "predict the next pixel" of an image efficiently.
This creates an awkward situation for multimodal AI. If you want a model that handles both text and images, you have three options:
| Approach | Example | Problem |
|---|---|---|
| Separate models | GPT-4 + DALL-E | Two models, no shared reasoning, expensive |
| Tokenize everything | Chameleon | VQ quantization loses image quality; discrete tokens are suboptimal for continuous signals |
| ??? | Transfusion | Use the RIGHT objective for each modality |
Chameleon (the previous paper in this series) showed that you can tokenize images and train a single autoregressive model. But there's a cost: VQ tokenization introduces quantization error, and autoregressive generation of 1024 image tokens is slow. Image generation quality lags behind dedicated diffusion models.
Think of it this way: text is inherently sequential and discrete. The word "cat" is meaningfully different from "bat" — there's no smooth interpolation. Images are inherently continuous and spatial. A pixel at (100, 200) has a smooth relationship with neighboring pixels. Forcing images into discrete tokens is like forcing a continuous function through a staircase — you lose information at every step.
See how text generation (left, token by token) and image generation (right, progressive denoising) work fundamentally differently. Transfusion uses BOTH in one model.
To understand Transfusion, we need to be precise about what "discrete" and "continuous" mean in this context, and why the distinction matters for generation quality.
Text tokens are inherently discrete: "cat" is token 2364, "dog" is token 3920. There's no token 3142 that's "half cat, half dog." The vocabulary is finite, and each token is a distinct symbol. Autoregressive models handle this perfectly: at each step, output a probability distribution over the vocabulary and sample one token.
Chameleon forced images into this paradigm by VQ-quantization: each 16×16 patch is mapped to the nearest codebook entry. But this introduces quantization error — the original patch vector and the codebook vector are never exactly the same. The information lost in this rounding cannot be recovered.
Transfusion represents images as continuous latent vectors. Instead of mapping each patch to a codebook entry (discrete), it uses a VAE encoder to produce continuous vectors that preserve the full information content. No rounding, no quantization error.
python # Chameleon (discrete): information lost at quantization z = encoder(image) # [B, 256, 32, 32] — continuous ids = codebook.nearest(z) # [B, 1024] — discrete integers z_q = codebook(ids) # [B, 256, 32, 32] — APPROXIMATION of z # z_q ≠ z — quantization error is permanent # Transfusion (continuous): no information lost z = vae_encoder(image) # [B, 8, 32, 32] — continuous latents z_patches = patchify(z) # [B, 256, 8*4*4=128] — 256 patches of dim 128 # z_patches are EXACT — no quantization step
The tradeoff: continuous representations can't be predicted with a standard softmax. You can't output a probability over infinite continuous values. This is where diffusion comes in — it's a generative process specifically designed for continuous data.
Transfusion converts images into patches, but unlike Chameleon, these patches are continuous vectors, not discrete token IDs:
Drag the slider to adjust quantization levels. Left shows the original continuous representation; right shows the quantized version. Notice how fewer codebook entries create visible artifacts.
Here's the core trick. Transfusion trains a single transformer with two loss functions simultaneously: a language modeling loss on text tokens and a diffusion loss on image patches. Each loss function applies only to the positions of its modality.
For text positions, the model uses standard autoregressive language modeling. At each text position, predict the next token from a softmax over the vocabulary:
This is identical to how GPT, LLaMA, and every other autoregressive LM is trained. No modification needed.
For image positions, the model uses a DDPM-style diffusion objective. At each image position, the model predicts the noise that was added to the clean image patch:
Where ε is the Gaussian noise added to the clean image patch, xt is the noisy version at timestep t, and εθ is the model's noise prediction. This is the standard diffusion training loss from DDPM.
The weighting factor λ balances the two objectives. The paper finds that λ = 1 works well — equal weight to text and image losses. During training, both losses are computed on every batch and backpropagated through the shared transformer parameters.
python # Transfusion training step (simplified) def training_step(model, batch): tokens = batch["mixed_sequence"] # text IDs + image patches modality = batch["modality_mask"] # 0=text, 1=image for each position # For image positions: add noise (diffusion forward process) t = torch.randint(0, 1000, (batch_size,)) # random timestep noise = torch.randn_like(image_patches) noisy_patches = sqrt_alpha[t] * image_patches + sqrt_1m_alpha[t] * noise # Replace clean image patches with noisy versions in the sequence tokens[modality == 1] = noisy_patches # Forward pass through shared transformer output = model(tokens) # [B, L, D] # Text loss: cross-entropy on text positions text_logits = model.lm_head(output[modality == 0]) # [N_text, vocab_size] loss_text = F.cross_entropy(text_logits, text_targets) # Image loss: MSE on noise prediction at image positions noise_pred = model.noise_head(output[modality == 1]) # [N_img, patch_dim] loss_image = F.mse_loss(noise_pred, noise) return loss_text + loss_image
Watch how both losses are computed on the same sequence. Text positions use cross-entropy (discrete classification); image positions use MSE on noise prediction (continuous regression). Both backpropagate through the shared transformer.
Transfusion's architecture is a standard decoder-only transformer with a few modality-specific additions at the input and output boundaries. The core transformer itself is unchanged — no special layers, no modality-specific routing.
Text tokens enter through a standard embedding table (discrete IDs → vectors). Image patches enter through a linear projection (continuous vectors → same-dimensional vectors). Both are projected to the transformer's hidden dimension D.
The transformer produces hidden states for every position. These are then routed to the appropriate output head based on modality:
| Position Type | Output Head | Output |
|---|---|---|
| Text | Linear → softmax over vocab | Probability distribution over 65K tokens |
| Image | Linear → noise prediction | Predicted noise vector ∈ Rpatch_dim |
One important architectural detail: Transfusion adds U-Net-style skip connections for image patches. Standard diffusion models use a U-Net because the skip connections help preserve fine-grained spatial details during denoising. Transfusion adapts this idea within the transformer:
python # Transfusion with U-Net-style connections for images class TransfusionModel(nn.Module): def __init__(self, config): self.text_embed = nn.Embedding(65536, config.dim) self.img_proj_in = nn.Linear(config.patch_dim, config.dim) self.time_embed = nn.Embedding(1000, config.dim) # diffusion timestep self.layers = nn.ModuleList([ TransformerBlock(config) for _ in range(config.n_layers) ]) # Output heads self.lm_head = nn.Linear(config.dim, 65536) # text output self.noise_head = nn.Linear(config.dim, config.patch_dim) # image output def forward(self, text_ids, img_patches, timestep, mask): # Embed each modality text_emb = self.text_embed(text_ids) # [B, T, D] img_emb = self.img_proj_in(img_patches) # [B, N, D] img_emb = img_emb + self.time_embed(timestep) # add timestep info # Interleave into single sequence x = interleave(text_emb, img_emb, mask) # [B, T+N, D] # Shared transformer for layer in self.layers: x = layer(x) # Route to appropriate output heads text_out = self.lm_head(x[mask == 0]) # text logits img_out = self.noise_head(x[mask == 1]) # noise predictions return text_out, img_out
Step through the forward pass to see how text tokens and image patches flow through the shared transformer, then diverge to modality-specific output heads.
This is where Transfusion gets subtle. Standard autoregressive models use causal masking: each token can only attend to tokens before it. This makes sense for text — you generate left-to-right, one token at a time. But images are different.
In diffusion, you denoise the entire image simultaneously. All patches are denoised at the same timestep — they should be able to see each other. If you enforce causal masking on image patches, the first patch can't see the last patch, even though they're being denoised together. This hurts quality because image patches need spatial context from the entire image.
Transfusion uses a hybrid attention mask that combines causal masking for text with bidirectional masking for images within each image block:
| Query Type | Can Attend To | Reason |
|---|---|---|
| Text token | All previous text tokens + all previous image blocks (full images only) | Causal: text is generated left-to-right |
| Image patch | All previous text + all patches in the SAME image (bidirectional) | Bidirectional within image: diffusion denoises all patches simultaneously |
Visually, the attention mask looks like this: the text tokens form a standard lower-triangular causal mask. Each image block is a dense square (all patches attend to all patches within the same image). Image patches can also attend to all text tokens that precede the image.
python # Building Transfusion's mixed attention mask def build_transfusion_mask(seq_len, modality, image_boundaries): # modality: 0=text, 1=image for each position # image_boundaries: list of (start, end) for each image block mask = torch.zeros(seq_len, seq_len, dtype=torch.bool) for i in range(seq_len): for j in range(seq_len): if modality[i] == 0 and j <= i: # Text can attend to all previous positions (causal) mask[i, j] = True elif modality[i] == 1: # Image patch: check if j is in same image block for start, end in image_boundaries: if start <= i < end and start <= j < end: mask[i, j] = True # bidirectional within block # Also attend to all preceding text if modality[j] == 0 and j < i: mask[i, j] = True return mask
See Transfusion's mixed attention mask. Text tokens (teal) use causal masking. Image patches (orange) have bidirectional attention within their block. Toggle between masking strategies to see the difference.
Transfusion's training reveals something unexpected: the dual-objective approach doesn't just match separate models — it's actually more efficient at learning both tasks than training separate models on the same data.
The paper trains models at multiple scales (0.16B, 0.37B, 0.76B, 7B parameters) and compares three approaches:
| Approach | Description | 7B FID ↓ | 7B Text Perplexity ↓ |
|---|---|---|---|
| Chameleon-style | All tokens discrete, single AR loss | ~24 | ~8.2 |
| Transfusion | Text: AR, Images: diffusion, shared transformer | ~6.8 | ~8.0 |
| Separate models | One text-only LM + one diffusion model (same total params) | ~7.5 | ~7.8 |
Two findings stand out:
Transfusion follows clean power-law scaling for both objectives. As model size doubles:
Image quality improves faster with scale than text quality, suggesting that larger Transfusion models will have even more impressive image generation. The text scaling exponent matches typical LLM scaling laws, indicating no degradation from the shared training.
python # Transfusion scaling configuration at different sizes configs = { "0.16B": {"layers": 12, "dim": 768, "heads": 12, "tokens": "0.5T"}, "0.37B": {"layers": 24, "dim": 1024, "heads": 16, "tokens": "0.5T"}, "0.76B": {"layers": 24, "dim": 1536, "heads": 24, "tokens": "0.5T"}, "7B": {"layers": 32, "dim": 4096, "heads": 32, "tokens": "2T"}, } # All configs: LR=3e-4, cosine schedule, bf16, context=4096
Drag the slider to scale model size and see how FID (image quality) and perplexity (text quality) change. Notice how image quality improves faster with scale.
Inference in Transfusion works differently depending on whether you're generating text or images. For text, it's the standard autoregressive decode loop. For images, it's an iterative denoising process. The model seamlessly switches between the two modes within a single generation.
When the model needs to generate text, it predicts the next text token from its vocabulary, samples it, appends it to the sequence, and repeats. This is identical to inference in GPT or LLaMA.
When the model encounters an image-generation trigger (e.g., an <image> token), it switches to diffusion mode:
The key insight: during denoising, all 256 patches are processed in parallel (bidirectional attention within the image block). This is much faster than Chameleon's autoregressive generation of 1024 tokens one at a time.
| Model | Image Gen Steps | Tokens per Image | Total Forward Passes |
|---|---|---|---|
| Chameleon | 1024 (AR steps) | 1024 | 1024 |
| Transfusion (250 steps) | 250 (diffusion steps) | 256 patches × 250 steps | 250 |
| DALL-E 2 | ~50-250 (diffusion) | N/A (continuous) | 50-250 |
At 7B parameters, Transfusion achieves:
| Benchmark | Transfusion 7B | Chameleon 7B | DALL-E 2 | LLaMA 7B (text only) |
|---|---|---|---|---|
| FID (GenEval) ↓ | 6.78 | ~24 | 5.97 | N/A |
| Text PPL ↓ | 8.0 | 8.4 | N/A | 7.8 |
| Overall score | 0.63 | 0.39 | 0.52 | N/A |
Watch Transfusion generate a mixed text+image output. Text tokens appear one at a time (autoregressive), then the image is progressively denoised (diffusion). Drag the slider to control denoising speed.
Transfusion builds directly on Chameleon's proof that early fusion works, but replaces the suboptimal discrete image representation with continuous diffusion. It represents a maturation of the multimodal foundation model paradigm.
| Model | Image Repr. | Image Gen Method | Shared Backbone? | Key Innovation |
|---|---|---|---|---|
| DALL-E (2021) | Discrete (dVAE) | Autoregressive | No | Proved AR image gen works |
| Chameleon (2024) | Discrete (VQ) | Autoregressive | Yes | Unified text+image in one transformer |
| Transfusion (2024) | Continuous (VAE) | Diffusion | Yes | Best of AR (text) + diffusion (images) |
| MoT (2024) | Continuous | Diffusion | Partially | Modality-specific experts |
Transfusion's approach is likely to become the template for future multimodal foundation models. The idea — use the right generation method for each modality while sharing the backbone — generalizes naturally to audio (diffusion), video (diffusion), code (autoregressive), and beyond.
Compare different approaches to multimodal generation. Each column shows how a model handles text (top) and images (bottom).