LMFusion (Shi et al. 2024)

Chapter 0: The Adaptation Question

You have a powerful text-only LLM — let's say LLaMA-3 with 8B parameters, trained on trillions of text tokens. It's excellent at language. Now you want it to also generate images. The naive approach: train a new model from scratch on mixed text+image data (like Chameleon or Transfusion).

But training from scratch is wasteful. Your LLM already knows language — grammar, facts, reasoning. Why throw away trillions of tokens of text learning just to add image capability?

LMFusion's question: Can we adapt a pretrained text-only LLM to generate images, preserving its text abilities while adding multimodal generation? The answer is yes — but the HOW matters enormously. Naive fine-tuning destroys text quality (catastrophic forgetting). LMFusion's careful adaptation strategy preserves 95%+ of text performance while achieving competitive image generation.

Think of it as teaching a novelist to paint. You don't want them to forget how to write — you want them to add painting to their repertoire. The challenge is adding the new skill without overwriting the old one.

From Scratch vs Adaptation

Compare the cost and outcome of training from scratch vs adapting a pretrained LLM. Adaptation reuses the expensive text pretraining.

What is the key challenge in adapting a text-only LLM for image generation?

Preserving the model's existing text abilities while adding image generation — naive fine-tuning on image data causes catastrophic forgetting, destroying the text knowledge that took trillions of tokens to learn Making the model larger Finding enough image training data

Chapter 1: Why Not From Scratch?

Training a multimodal model from scratch (Chameleon, Transfusion) requires relearning everything the LLM already knows. The compute cost is enormous:

Approach	Text Training Cost	Image Training Cost	Total
From scratch	~4T tokens (~$10M+)	~1T tokens (~$3M+)	~$13M+
LMFusion	0 (reuse pretrained)	~500B tokens (~$1.5M)	~$1.5M

That's roughly a 9x cost reduction. And the pretrained LLM's text abilities are typically better than what you'd get from mixed-modal training (because it was trained on more text data and didn't have to "share" capacity with images).

The catastrophic forgetting problem

Simply fine-tuning a pretrained LLM on image data destroys its text abilities. The model's weights shift to accommodate image patterns, overwriting the carefully learned text representations. This is catastrophic forgetting — a well-known problem in continual learning.

LMFusion's solution: Don't fine-tune the entire model. Instead, add new modality-specific parameters (image input/output projections, lightweight adapters) and carefully control which existing parameters are updated. The text-specific parameters are mostly frozen; only the shared attention and new image parameters are trained.

python
# Naive adaptation: catastrophic forgetting
model = load_pretrained("llama-3-8b")
for batch in image_data:
    loss = model(batch)   # training on images
    loss.backward()       # ALL weights updated
    optimizer.step()      # text knowledge destroyed!

# LMFusion: careful adaptation
model = load_pretrained("llama-3-8b")
model.freeze(model.text_ffn)      # protect text-specific params
model.add(image_input_proj)        # new image params
model.add(image_output_head)       # new image params
# Only update: attention (shared) + new image params

Catastrophic Forgetting Simulator

Watch how naive fine-tuning destroys text quality while LMFusion preserves it. The blue bar is text quality; orange is image quality.

Training steps 0

How does LMFusion avoid catastrophic forgetting?

By freezing text-specific parameters (FFN layers), adding new image-specific parameters, and only updating shared attention layers — protecting the text knowledge while allowing the model to learn image generation through new and shared parameters By training on more text data simultaneously By using a smaller learning rate

Chapter 2: Architecture Design

LMFusion's architecture extends the Transfusion paradigm (AR text + diffusion images) but applied to an existing LLM backbone. The key design decisions determine which parameters are new, which are shared, and which are frozen.

Parameter categories

Component	Status	Why
Text embedding	Frozen	Already excellent from pretraining
Image input projection	New (trained)	Maps continuous patches to hidden dim
Attention QKV + output	Shared (fine-tuned)	Cross-modal reasoning requires adaptation
Text FFN	Frozen	Protects text-specific features
Image FFN	New (trained)	MoT-style separate FFN for images
Text output head	Frozen	Text generation unchanged
Image noise head	New (trained)	Predicts diffusion noise for images

The MoT connection: LMFusion borrows MoT's insight about separate FFN per modality. The existing text FFN is kept frozen (protecting text knowledge), and a new image FFN is added at every layer. This gives the model dedicated capacity for image features without disturbing text features.

python
# LMFusion layer structure
class LMFusionLayer(nn.Module):
    def __init__(self, pretrained_layer, config):
        # Reuse pretrained attention (will be fine-tuned)
        self.attn = pretrained_layer.attention  # trainable

        # Keep text FFN frozen
        self.text_ffn = pretrained_layer.ffn    # FROZEN
        self.text_ffn.requires_grad_(False)

        # Add NEW image FFN (randomly initialized)
        self.image_ffn = SwiGLU_FFN(config.dim, config.dim * 4)  # trainable
        self.image_norm = RMSNorm(config.dim)  # trainable

    def forward(self, x, modality):
        # Shared attention (fine-tuned)
        h = self.attn(x) + x

        # Route to modality-specific FFN
        out = torch.empty_like(h)
        out[modality == 0] = self.text_ffn(h[modality == 0]) + h[modality == 0]    # frozen
        out[modality == 1] = self.image_ffn(self.image_norm(h[modality == 1])) + h[modality == 1]  # trained
        return out

LMFusion Architecture

See which parameters are frozen (blue), fine-tuned (purple), and newly added (orange). The pretrained LLM's text path is mostly frozen.

Which components of the pretrained LLM does LMFusion freeze vs fine-tune?

Text embeddings, text FFN, and text output head are FROZEN. Attention is FINE-TUNED (shared). Image input projection, image FFN, and noise head are NEWLY ADDED — combining MoT's separate-FFN insight with a preservation-first adaptation strategy Everything is fine-tuned Everything is frozen except the output head

Chapter 3: Training Strategy

LMFusion uses a multi-stage training strategy that gradually introduces image capability while monitoring text preservation.

Stage 1: Image Projection Warmup

Train only the new image input/output projections on image-caption pairs. Everything else frozen. Teaches the model to encode/decode image patches.

↓

Stage 2: Joint Training

Unfreeze shared attention + image FFN. Train on mixed text+image data. Text FFN stays frozen. Diffusion + AR losses.

↓

Stage 3: Alignment

Instruction tuning on multimodal tasks. All trainable parameters updated with small learning rate.

Why staged training? If you unfreeze everything from the start, the randomly-initialized image parameters produce random gradients that corrupt the pretrained text weights. Stage 1 "warms up" the image parameters so they produce meaningful gradients before the shared attention is unfrozen.

Training Stages

Step through LMFusion's training stages. Watch which parameters are active at each stage.

Stage Joint Training

Why does LMFusion warm up image projections before unfreezing shared attention?

Because randomly-initialized image parameters produce random gradients that would corrupt pretrained text weights if shared attention were unfrozen immediately — warming up first ensures image parameters produce meaningful gradients before they influence shared parameters Because warmup is always faster Because the image data needs preprocessing

Chapter 4: Progressive Unfreezing

LMFusion doesn't unfreeze all attention layers at once. It uses progressive unfreezing: first the top layers (closest to output), then gradually lower layers. This protects the low-level representations that are most important for text quality.

Why does layer order matter? In transformers, lower layers learn general features (syntax, token relationships) while upper layers learn task-specific features. Image generation mostly needs task-specific adaptations (upper layers), while the general features (lower layers) should be preserved.

Ablation finding: Unfreezing only the top 50% of attention layers preserves 98% of text quality while achieving 90% of the image generation quality of full unfreezing. This confirms that image adaptation primarily needs upper-layer modifications.

Progressive Unfreezing Schedule

Drag the slider to see which layers are unfrozen at each training phase. Lower layers (bottom) stay frozen longer to protect foundational text representations.

Phase Phase 1

Why does LMFusion unfreeze top layers first?

Because lower transformer layers learn general features (syntax, token relationships) crucial for text quality, while upper layers learn task-specific features — image generation primarily needs upper-layer adaptation, so unfreezing from the top preserves foundational text representations Because top layers are smaller Because bottom layers don't matter

Chapter 5: Text Preservation

LMFusion monitors text quality throughout training and introduces several techniques to prevent degradation beyond acceptable thresholds.

Data mixing during adaptation

Even during image training, LMFusion includes text-only data in every batch (typically 30-50% text). This acts as a regularizer, continuously reminding the model of text patterns and preventing drift.

Text quality metrics throughout training

Training Stage	Text PPL (↓)	MMLU (↑)	Image FID (↓)
Pretrained LLM	7.8	65.2	N/A
After Stage 1	7.8	65.2	45.0
After Stage 2	8.1	63.8	8.5
After Stage 3	8.2	63.1	7.2
Naive fine-tuning	12.4	48.3	7.0

The preservation result: LMFusion retains 97% of MMLU performance (63.1 vs 65.2) and keeps text perplexity within 5% of the original (8.2 vs 7.8). Naive fine-tuning loses 26% of MMLU and degrades perplexity by 59%. The careful freeze/unfreeze strategy is the difference between useful adaptation and catastrophic forgetting.

Text Preservation Monitor

Compare text quality retention between LMFusion and naive fine-tuning throughout training.

How does LMFusion achieve 97% text quality retention?

Through a combination of frozen text-specific parameters (FFN, embeddings, output head), staged training with progressive unfreezing, and continuous text data mixing (30-50% text in every batch) that acts as a regularizer against drift By using a very small learning rate By training on more text data than images

Chapter 6: Results & Showcase

LMFusion demonstrates that adapted LLMs can compete with models trained from scratch, at a fraction of the cost.

Model	Training Cost	Image FID ↓	Text MMLU ↑	Approach
Chameleon 7B	Full (4.4T tokens)	~24	~47	From scratch
Transfusion 7B	Full (2T tokens)	~6.8	~58	From scratch
LMFusion 8B	~500B tokens	~7.2	~63	Adapted LLaMA-3

The efficiency story: LMFusion matches Transfusion's image quality (FID 7.2 vs 6.8) while having significantly better text quality (MMLU 63 vs 58) — all using 4x less training compute. The pretrained LLM's text knowledge is genuinely preserved and valuable.

LMFusion vs From-Scratch Models

Compare LMFusion against models trained from scratch on both text and image quality.

What makes LMFusion's results practically significant?

It matches from-scratch models on image generation while having BETTER text quality (because it preserves a strong pretrained LLM) at 4x less training compute — proving that adaptation is more efficient than retraining It uses fewer parameters It requires no image data

Chapter 7: Connections

LMFusion demonstrates that the future of multimodal AI may not require training everything from scratch. Instead, we can build on the massive investment in text-only LLMs.

Approach	Text Quality	Image Quality	Cost
From scratch (Chameleon)	Moderate	Low	Very high
From scratch (Transfusion)	Good	Good	High
Adaptation (LMFusion)	Excellent (preserved)	Good	Low

Lesson 1: Adaptation beats retraining. When a strong pretrained model exists, adapting it is more efficient than training from scratch. The pretrained knowledge is a valuable asset, not dead weight.

Lesson 2: MoT-style separation is key for adaptation. Keeping text FFN frozen while adding new image FFN is the critical technique that enables text preservation. This validates MoT's insight that modality-specific FFNs are important.

Lesson 3: Staged training prevents corruption. The warmup → joint → alignment pipeline ensures that new parameters are meaningful before they interact with pretrained ones.

The Adaptation Paradigm

Explore the tradeoffs between training from scratch and adaptation approaches.

Approach LMFusion

What is LMFusion's most important contribution?

Proving that pretrained text LLMs can be adapted for multimodal generation with 97% text quality retention and competitive image quality at 4x less compute — establishing adaptation as a viable alternative to training from scratch A new image tokenizer Faster inference speed

LMFusion: Adapting LLMs for Multimodal