Shi, Xing, Srivastava et al. (Meta) — 2024

LMFusion: Adapting LLMs for Multimodal

Adapting Pretrained Language Models for Multimodal Generation — take an existing text-only LLM and add image generation capability without retraining from scratch.

Prerequisites: LLMs + Diffusion basics + Transfusion concepts. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Adaptation Question

You have a powerful text-only LLM — let's say LLaMA-3 with 8B parameters, trained on trillions of text tokens. It's excellent at language. Now you want it to also generate images. The naive approach: train a new model from scratch on mixed text+image data (like Chameleon or Transfusion).

But training from scratch is wasteful. Your LLM already knows language — grammar, facts, reasoning. Why throw away trillions of tokens of text learning just to add image capability?

LMFusion's question: Can we adapt a pretrained text-only LLM to generate images, preserving its text abilities while adding multimodal generation? The answer is yes — but the HOW matters enormously. Naive fine-tuning destroys text quality (catastrophic forgetting). LMFusion's careful adaptation strategy preserves 95%+ of text performance while achieving competitive image generation.

Think of it as teaching a novelist to paint. You don't want them to forget how to write — you want them to add painting to their repertoire. The challenge is adding the new skill without overwriting the old one.

From Scratch vs Adaptation

Compare the cost and outcome of training from scratch vs adapting a pretrained LLM. Adaptation reuses the expensive text pretraining.

What is the key challenge in adapting a text-only LLM for image generation?

Chapter 1: Why Not From Scratch?

Training a multimodal model from scratch (Chameleon, Transfusion) requires relearning everything the LLM already knows. The compute cost is enormous:

ApproachText Training CostImage Training CostTotal
From scratch~4T tokens (~$10M+)~1T tokens (~$3M+)~$13M+
LMFusion0 (reuse pretrained)~500B tokens (~$1.5M)~$1.5M

That's roughly a 9x cost reduction. And the pretrained LLM's text abilities are typically better than what you'd get from mixed-modal training (because it was trained on more text data and didn't have to "share" capacity with images).

The catastrophic forgetting problem

Simply fine-tuning a pretrained LLM on image data destroys its text abilities. The model's weights shift to accommodate image patterns, overwriting the carefully learned text representations. This is catastrophic forgetting — a well-known problem in continual learning.

LMFusion's solution: Don't fine-tune the entire model. Instead, add new modality-specific parameters (image input/output projections, lightweight adapters) and carefully control which existing parameters are updated. The text-specific parameters are mostly frozen; only the shared attention and new image parameters are trained.
python
# Naive adaptation: catastrophic forgetting
model = load_pretrained("llama-3-8b")
for batch in image_data:
    loss = model(batch)   # training on images
    loss.backward()       # ALL weights updated
    optimizer.step()      # text knowledge destroyed!

# LMFusion: careful adaptation
model = load_pretrained("llama-3-8b")
model.freeze(model.text_ffn)      # protect text-specific params
model.add(image_input_proj)        # new image params
model.add(image_output_head)       # new image params
# Only update: attention (shared) + new image params
Catastrophic Forgetting Simulator

Watch how naive fine-tuning destroys text quality while LMFusion preserves it. The blue bar is text quality; orange is image quality.

Training steps 0
How does LMFusion avoid catastrophic forgetting?

Chapter 2: Architecture Design

LMFusion's architecture extends the Transfusion paradigm (AR text + diffusion images) but applied to an existing LLM backbone. The key design decisions determine which parameters are new, which are shared, and which are frozen.

Parameter categories

ComponentStatusWhy
Text embeddingFrozenAlready excellent from pretraining
Image input projectionNew (trained)Maps continuous patches to hidden dim
Attention QKV + outputShared (fine-tuned)Cross-modal reasoning requires adaptation
Text FFNFrozenProtects text-specific features
Image FFNNew (trained)MoT-style separate FFN for images
Text output headFrozenText generation unchanged
Image noise headNew (trained)Predicts diffusion noise for images
The MoT connection: LMFusion borrows MoT's insight about separate FFN per modality. The existing text FFN is kept frozen (protecting text knowledge), and a new image FFN is added at every layer. This gives the model dedicated capacity for image features without disturbing text features.
python
# LMFusion layer structure
class LMFusionLayer(nn.Module):
    def __init__(self, pretrained_layer, config):
        # Reuse pretrained attention (will be fine-tuned)
        self.attn = pretrained_layer.attention  # trainable

        # Keep text FFN frozen
        self.text_ffn = pretrained_layer.ffn    # FROZEN
        self.text_ffn.requires_grad_(False)

        # Add NEW image FFN (randomly initialized)
        self.image_ffn = SwiGLU_FFN(config.dim, config.dim * 4)  # trainable
        self.image_norm = RMSNorm(config.dim)  # trainable

    def forward(self, x, modality):
        # Shared attention (fine-tuned)
        h = self.attn(x) + x

        # Route to modality-specific FFN
        out = torch.empty_like(h)
        out[modality == 0] = self.text_ffn(h[modality == 0]) + h[modality == 0]    # frozen
        out[modality == 1] = self.image_ffn(self.image_norm(h[modality == 1])) + h[modality == 1]  # trained
        return out
LMFusion Architecture

See which parameters are frozen (blue), fine-tuned (purple), and newly added (orange). The pretrained LLM's text path is mostly frozen.

Which components of the pretrained LLM does LMFusion freeze vs fine-tune?

Chapter 3: Training Strategy

LMFusion uses a multi-stage training strategy that gradually introduces image capability while monitoring text preservation.

Stage 1: Image Projection Warmup
Train only the new image input/output projections on image-caption pairs. Everything else frozen. Teaches the model to encode/decode image patches.
Stage 2: Joint Training
Unfreeze shared attention + image FFN. Train on mixed text+image data. Text FFN stays frozen. Diffusion + AR losses.
Stage 3: Alignment
Instruction tuning on multimodal tasks. All trainable parameters updated with small learning rate.
Why staged training? If you unfreeze everything from the start, the randomly-initialized image parameters produce random gradients that corrupt the pretrained text weights. Stage 1 "warms up" the image parameters so they produce meaningful gradients before the shared attention is unfrozen.
Training Stages

Step through LMFusion's training stages. Watch which parameters are active at each stage.

Stage Joint Training
Why does LMFusion warm up image projections before unfreezing shared attention?

Chapter 4: Progressive Unfreezing

LMFusion doesn't unfreeze all attention layers at once. It uses progressive unfreezing: first the top layers (closest to output), then gradually lower layers. This protects the low-level representations that are most important for text quality.

Why does layer order matter? In transformers, lower layers learn general features (syntax, token relationships) while upper layers learn task-specific features. Image generation mostly needs task-specific adaptations (upper layers), while the general features (lower layers) should be preserved.

Ablation finding: Unfreezing only the top 50% of attention layers preserves 98% of text quality while achieving 90% of the image generation quality of full unfreezing. This confirms that image adaptation primarily needs upper-layer modifications.
Progressive Unfreezing Schedule

Drag the slider to see which layers are unfrozen at each training phase. Lower layers (bottom) stay frozen longer to protect foundational text representations.

Phase Phase 1
Why does LMFusion unfreeze top layers first?

Chapter 5: Text Preservation

LMFusion monitors text quality throughout training and introduces several techniques to prevent degradation beyond acceptable thresholds.

Data mixing during adaptation

Even during image training, LMFusion includes text-only data in every batch (typically 30-50% text). This acts as a regularizer, continuously reminding the model of text patterns and preventing drift.

Text quality metrics throughout training

Training StageText PPL (↓)MMLU (↑)Image FID (↓)
Pretrained LLM7.865.2N/A
After Stage 17.865.245.0
After Stage 28.163.88.5
After Stage 38.263.17.2
Naive fine-tuning12.448.37.0
The preservation result: LMFusion retains 97% of MMLU performance (63.1 vs 65.2) and keeps text perplexity within 5% of the original (8.2 vs 7.8). Naive fine-tuning loses 26% of MMLU and degrades perplexity by 59%. The careful freeze/unfreeze strategy is the difference between useful adaptation and catastrophic forgetting.
Text Preservation Monitor

Compare text quality retention between LMFusion and naive fine-tuning throughout training.

How does LMFusion achieve 97% text quality retention?

Chapter 6: Results & Showcase

LMFusion demonstrates that adapted LLMs can compete with models trained from scratch, at a fraction of the cost.

ModelTraining CostImage FID ↓Text MMLU ↑Approach
Chameleon 7BFull (4.4T tokens)~24~47From scratch
Transfusion 7BFull (2T tokens)~6.8~58From scratch
LMFusion 8B~500B tokens~7.2~63Adapted LLaMA-3
The efficiency story: LMFusion matches Transfusion's image quality (FID 7.2 vs 6.8) while having significantly better text quality (MMLU 63 vs 58) — all using 4x less training compute. The pretrained LLM's text knowledge is genuinely preserved and valuable.
LMFusion vs From-Scratch Models

Compare LMFusion against models trained from scratch on both text and image quality.

What makes LMFusion's results practically significant?

Chapter 7: Connections

LMFusion demonstrates that the future of multimodal AI may not require training everything from scratch. Instead, we can build on the massive investment in text-only LLMs.

ApproachText QualityImage QualityCost
From scratch (Chameleon)ModerateLowVery high
From scratch (Transfusion)GoodGoodHigh
Adaptation (LMFusion)Excellent (preserved)GoodLow
Lesson 1: Adaptation beats retraining. When a strong pretrained model exists, adapting it is more efficient than training from scratch. The pretrained knowledge is a valuable asset, not dead weight.
Lesson 2: MoT-style separation is key for adaptation. Keeping text FFN frozen while adding new image FFN is the critical technique that enables text preservation. This validates MoT's insight that modality-specific FFNs are important.
Lesson 3: Staged training prevents corruption. The warmup → joint → alignment pipeline ensures that new parameters are meaningful before they interact with pretrained ones.
The Adaptation Paradigm

Explore the tradeoffs between training from scratch and adaptation approaches.

Approach LMFusion
What is LMFusion's most important contribution?