Adapting Pretrained Language Models for Multimodal Generation — take an existing text-only LLM and add image generation capability without retraining from scratch.
You have a powerful text-only LLM — let's say LLaMA-3 with 8B parameters, trained on trillions of text tokens. It's excellent at language. Now you want it to also generate images. The naive approach: train a new model from scratch on mixed text+image data (like Chameleon or Transfusion).
But training from scratch is wasteful. Your LLM already knows language — grammar, facts, reasoning. Why throw away trillions of tokens of text learning just to add image capability?
Think of it as teaching a novelist to paint. You don't want them to forget how to write — you want them to add painting to their repertoire. The challenge is adding the new skill without overwriting the old one.
Compare the cost and outcome of training from scratch vs adapting a pretrained LLM. Adaptation reuses the expensive text pretraining.
Training a multimodal model from scratch (Chameleon, Transfusion) requires relearning everything the LLM already knows. The compute cost is enormous:
| Approach | Text Training Cost | Image Training Cost | Total |
|---|---|---|---|
| From scratch | ~4T tokens (~$10M+) | ~1T tokens (~$3M+) | ~$13M+ |
| LMFusion | 0 (reuse pretrained) | ~500B tokens (~$1.5M) | ~$1.5M |
That's roughly a 9x cost reduction. And the pretrained LLM's text abilities are typically better than what you'd get from mixed-modal training (because it was trained on more text data and didn't have to "share" capacity with images).
Simply fine-tuning a pretrained LLM on image data destroys its text abilities. The model's weights shift to accommodate image patterns, overwriting the carefully learned text representations. This is catastrophic forgetting — a well-known problem in continual learning.
python # Naive adaptation: catastrophic forgetting model = load_pretrained("llama-3-8b") for batch in image_data: loss = model(batch) # training on images loss.backward() # ALL weights updated optimizer.step() # text knowledge destroyed! # LMFusion: careful adaptation model = load_pretrained("llama-3-8b") model.freeze(model.text_ffn) # protect text-specific params model.add(image_input_proj) # new image params model.add(image_output_head) # new image params # Only update: attention (shared) + new image params
Watch how naive fine-tuning destroys text quality while LMFusion preserves it. The blue bar is text quality; orange is image quality.
LMFusion's architecture extends the Transfusion paradigm (AR text + diffusion images) but applied to an existing LLM backbone. The key design decisions determine which parameters are new, which are shared, and which are frozen.
| Component | Status | Why |
|---|---|---|
| Text embedding | Frozen | Already excellent from pretraining |
| Image input projection | New (trained) | Maps continuous patches to hidden dim |
| Attention QKV + output | Shared (fine-tuned) | Cross-modal reasoning requires adaptation |
| Text FFN | Frozen | Protects text-specific features |
| Image FFN | New (trained) | MoT-style separate FFN for images |
| Text output head | Frozen | Text generation unchanged |
| Image noise head | New (trained) | Predicts diffusion noise for images |
python # LMFusion layer structure class LMFusionLayer(nn.Module): def __init__(self, pretrained_layer, config): # Reuse pretrained attention (will be fine-tuned) self.attn = pretrained_layer.attention # trainable # Keep text FFN frozen self.text_ffn = pretrained_layer.ffn # FROZEN self.text_ffn.requires_grad_(False) # Add NEW image FFN (randomly initialized) self.image_ffn = SwiGLU_FFN(config.dim, config.dim * 4) # trainable self.image_norm = RMSNorm(config.dim) # trainable def forward(self, x, modality): # Shared attention (fine-tuned) h = self.attn(x) + x # Route to modality-specific FFN out = torch.empty_like(h) out[modality == 0] = self.text_ffn(h[modality == 0]) + h[modality == 0] # frozen out[modality == 1] = self.image_ffn(self.image_norm(h[modality == 1])) + h[modality == 1] # trained return out
See which parameters are frozen (blue), fine-tuned (purple), and newly added (orange). The pretrained LLM's text path is mostly frozen.
LMFusion uses a multi-stage training strategy that gradually introduces image capability while monitoring text preservation.
Step through LMFusion's training stages. Watch which parameters are active at each stage.
LMFusion doesn't unfreeze all attention layers at once. It uses progressive unfreezing: first the top layers (closest to output), then gradually lower layers. This protects the low-level representations that are most important for text quality.
Why does layer order matter? In transformers, lower layers learn general features (syntax, token relationships) while upper layers learn task-specific features. Image generation mostly needs task-specific adaptations (upper layers), while the general features (lower layers) should be preserved.
Drag the slider to see which layers are unfrozen at each training phase. Lower layers (bottom) stay frozen longer to protect foundational text representations.
LMFusion monitors text quality throughout training and introduces several techniques to prevent degradation beyond acceptable thresholds.
Even during image training, LMFusion includes text-only data in every batch (typically 30-50% text). This acts as a regularizer, continuously reminding the model of text patterns and preventing drift.
| Training Stage | Text PPL (↓) | MMLU (↑) | Image FID (↓) |
|---|---|---|---|
| Pretrained LLM | 7.8 | 65.2 | N/A |
| After Stage 1 | 7.8 | 65.2 | 45.0 |
| After Stage 2 | 8.1 | 63.8 | 8.5 |
| After Stage 3 | 8.2 | 63.1 | 7.2 |
| Naive fine-tuning | 12.4 | 48.3 | 7.0 |
Compare text quality retention between LMFusion and naive fine-tuning throughout training.
LMFusion demonstrates that adapted LLMs can compete with models trained from scratch, at a fraction of the cost.
| Model | Training Cost | Image FID ↓ | Text MMLU ↑ | Approach |
|---|---|---|---|---|
| Chameleon 7B | Full (4.4T tokens) | ~24 | ~47 | From scratch |
| Transfusion 7B | Full (2T tokens) | ~6.8 | ~58 | From scratch |
| LMFusion 8B | ~500B tokens | ~7.2 | ~63 | Adapted LLaMA-3 |
Compare LMFusion against models trained from scratch on both text and image quality.
LMFusion demonstrates that the future of multimodal AI may not require training everything from scratch. Instead, we can build on the massive investment in text-only LLMs.
| Approach | Text Quality | Image Quality | Cost |
|---|---|---|---|
| From scratch (Chameleon) | Moderate | Low | Very high |
| From scratch (Transfusion) | Good | Good | High |
| Adaptation (LMFusion) | Excellent (preserved) | Good | Low |
Explore the tradeoffs between training from scratch and adaptation approaches.