Aghajanyan, Yu, Conneau, Hsu et al. (Meta) — 2023

Scaling Laws for Mixed-Modal Models

Scaling Laws for Generative Mixed-Modal Language Models — power laws govern mixed text+image models, and the optimal modality mix ratio changes with scale.

Prerequisites: Language model training + Basic statistics + Log-log plots. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Mixing Question

You're training a mixed-modal model — one that handles both text and images. You have a fixed compute budget. You can spend it training on text, images, or a mix. How much of each should you use?

This isn't a trivial question. Training on 100% text gives you the best text model, but it can't handle images. Training on 100% images gives you... not much useful. The interesting question is: what's the optimal ratio? And does that ratio change as your model gets bigger?

Mix Ratio (text:image)Text QualityImage QualityTotal
100:0Best possibleNone?
70:30GoodDecent?
50:50ModerateGood?
0:100NoneBest possible?

Before this paper, people tuned this ratio by trial and error. Train a model, evaluate, adjust the ratio, train again. Expensive and slow. Aghajanyan et al. discovered that scaling laws — the clean mathematical relationships between model size, data size, and loss — extend to mixed-modal models. This means you can predict the optimal ratio from small experiments.

The paper's core contribution: Mixed-modal loss follows power laws just like text-only models. From these power laws, you can derive the optimal text-to-image ratio for any given compute budget. The key finding: the optimal ratio shifts toward MORE image data as models get larger. Bigger models are better at absorbing visual information.
The Mixing Tradeoff

Drag the slider to adjust the text/image ratio. Watch how text and image quality (and total quality) change. Can you find the optimal point?

Image % 30%
Why can't you simply split data 50/50 between text and images for every model?

Chapter 1: Scaling Laws Primer

Before we can understand mixed-modal scaling, we need to understand what scaling laws are and why they're so useful.

The Chinchilla revelation

In 2022, Hoffmann et al. (DeepMind) showed that language model loss follows a remarkably clean power law:

L(N, D) = A/Nα + B/Dβ + E

Where N is model parameters, D is training tokens, A, B, E are constants, and α, β are scaling exponents. This says: loss decreases as a power of both model size and data size, with an irreducible entropy floor E.

On a log-log plot, power laws appear as straight lines. This is what makes them so useful — you can fit a line through small-scale experiments and extrapolate to predict large-scale performance.

What this paper asks

Do the same power laws hold when you mix text and images? Specifically:

Three questions this paper answers:
1. Does mixed-modal loss follow power laws in N and D? (Yes, with modality-specific exponents.)
2. Does adding image data help or hurt text performance? (It can help via cross-modal transfer, up to a point.)
3. What is the compute-optimal ratio of text to images? (It depends on scale — bigger models tolerate more images.)
python
# Power law: L = A * N^(-alpha) + E
# On a log-log plot, this is a straight line (slope = -alpha)

import numpy as np

# Example: text-only scaling
N_values = [125e6, 350e6, 760e6, 1.3e9, 6.7e9]  # model sizes
losses =   [3.2,   2.9,   2.7,   2.5,   2.2]   # validation losses

# Fit: log(L - E) = log(A) - alpha * log(N)
# Slope of this line gives alpha ≈ 0.076 for text
Scaling Law Visualizer

See how loss decreases as model size increases. On a log-log plot, the relationship is a straight line. Drag the exponent slider to see how different scaling exponents affect the curve.

α 0.076
Why are scaling laws practical for mixed-modal training decisions?

Chapter 2: The CM3 Architecture

The paper studies scaling laws on a specific model family called CM3 (Causally Masked Multimodal Model). Understanding CM3 is necessary because the scaling laws are measured on this architecture.

What is CM3?

CM3 is a decoder-only transformer that processes interleaved text and images. Images are tokenized using a VQ-VAE (like Chameleon). The key architectural choice: CM3 uses causal masking with an important twist — it can also do infilling (masked span prediction).

Input
Interleaved text + image tokens from web documents. Images tokenized with VQ-VAE into 256 tokens (codebook of 8192).
Causal Transformer
Standard next-token prediction. Both text and image tokens predicted autoregressively. Unified vocabulary.
CM3 Objective
Next-token prediction + infilling. Can generate text or images by completing masked spans.

Model sizes studied

ModelParametersLayersHidden DimTraining Tokens
CM3-125M125 million12768200B
CM3-350M350 million241024200B
CM3-760M760 million241536200B
CM3-1.3B1.3 billion242048200B
CM3-2.7B2.7 billion322560200B
CM3-6.7B6.7 billion324096200B

For each model size, the paper trains multiple variants with different text/image ratios (from 0% to 100% image data). This gives them enough data points to fit scaling laws across both axes: model size AND data mix.

Why CM3 and not Chameleon? This paper predates Chameleon by a year. CM3 is the predecessor that established the mixed-modal paradigm. The scaling laws discovered here directly informed Chameleon's training recipe — including its choice of ~40% image data.
CM3 Model Family

Explore the different CM3 model sizes. Each dot represents a model. Click to see its configuration.

Model 1.3B
Why does the paper train multiple variants at each model size?

Chapter 3: Text Scaling Laws

The first major finding: text loss in mixed-modal models still follows a power law in model size, but the exponent and constant depend on the image data proportion.

The text scaling equation

Ltext(N, ρ) = A(ρ) / Nα(ρ) + Etext(ρ)

Where ρ is the fraction of image data (0 = text-only, 1 = image-only). Notice that A, α, and E all depend on ρ. This means the scaling behavior changes as you add more images.

Key finding: text loss is U-shaped in ρ

Adding a small amount of image data (up to ~20-30%) actually improves text loss compared to training on text alone! This is surprising. You'd expect that replacing some text data with images would always hurt text quality (less text to learn from). But the model extracts useful information from images that transfers to text understanding.

However, beyond ~30-40% images, text quality degrades. The model spends too many of its parameters on image processing, leaving insufficient capacity for text.

The cross-modal transfer sweet spot: At 125M parameters, text loss is minimized at ~15-20% images. At 6.7B parameters, the sweet spot shifts to ~25-30% images. Bigger models can absorb more visual information before text quality suffers. This is one of the paper's most practically important findings.
python
# Text loss as a function of image fraction ρ
# Measured at different model sizes

# At 125M params:
#   ρ=0.0: L_text = 3.21 (text-only baseline)
#   ρ=0.1: L_text = 3.18 (BETTER than text-only!)
#   ρ=0.2: L_text = 3.15 (optimal for 125M)
#   ρ=0.3: L_text = 3.19 (slightly worse)
#   ρ=0.5: L_text = 3.35 (significantly worse)

# At 6.7B params:
#   ρ=0.0: L_text = 2.21
#   ρ=0.1: L_text = 2.18
#   ρ=0.3: L_text = 2.14 (optimal for 6.7B — shifted right!)
#   ρ=0.5: L_text = 2.22
Text Loss vs Image Fraction

Drag the model size slider to see how the optimal image fraction (where text loss is minimized) shifts right with larger models. The dip is the cross-modal transfer sweet spot.

Model size 1.3B
What surprising finding does the paper make about text performance in mixed-modal models?

Chapter 4: Image Scaling Laws

Image loss also follows power laws, but with different exponents than text. Images scale faster with model size — meaning that bigger models get disproportionately better at images relative to text.

The image scaling equation

Limage(N, ρ) = Aimg(ρ) / Nαimg(ρ) + Eimg(ρ)

The key difference from text: αimg > αtext. The image scaling exponent is larger, meaning image loss decreases faster with model size. Intuitively: images have more "headroom" for improvement. Text models at 6.7B are already quite good; image models at 6.7B are still far from ceiling.

Image scaling exponents

ModalityScaling Exponent αInterpretation
Text~0.076Loss decreases slowly with N. Text models are already efficient.
Image~0.12Loss decreases faster with N. More room for improvement.
Mixed (optimal ρ)Depends on ρWeighted combination of both exponents.
Why images scale faster: Text prediction benefits from strong statistical regularities (grammar, common phrases) that even small models capture. Image prediction requires understanding spatial structure, object relationships, and visual semantics — features that emerge primarily in larger models. This means the "return on investment" from scaling is higher for images than for text.

Image quality also depends on text data

Just as images help text, text helps images. Models trained with some text data generate better images than pure image models. The text provides semantic scaffolding that helps the model understand what the image should contain.

Text vs Image Scaling Exponents

Compare how text and image loss decrease with model size. Image loss (orange) decreases faster (steeper slope on log-log plot). At large scales, images benefit MORE from additional parameters.

Max N 109
Why does image loss decrease faster with model size than text loss?

Chapter 5: Optimal Mixing

Now we get to the paper's most practical contribution: given a fixed compute budget C, what fraction ρ of image data minimizes the total loss?

The compute-optimal mixing formula

The total loss is a weighted combination of text and image losses:

Ltotal(N, D, ρ) = (1 − ρ) · Ltext(N, D · (1-ρ)) + ρ · Limage(N, D · ρ)

Where D · (1-ρ) is the number of text tokens and D · ρ is the number of image tokens. The total compute C ≈ 6ND (standard approximation), so fixing C means larger N requires smaller D.

Optimal ρ as a function of N

The paper finds the optimal ρ by minimizing Ltotal over ρ at each model size. The results:

Model Size NOptimal Image Fraction ρ*Text:Image Ratio
125M~15%85:15
350M~18%82:18
760M~22%78:22
1.3B~25%75:25
6.7B~30%70:30
Extrapolated 30B+~35-40%60-65:35-40
The scaling trend: As models get bigger, the optimal image fraction increases. This makes sense given that image scaling exponent (αimg ≈ 0.12) is larger than text scaling exponent (αtext ≈ 0.076). Bigger models get more "bang for the buck" from image data, so it's worth allocating more of the data budget to images.

This is exactly what Chameleon did: at 34B parameters, they used ~40% image data. This paper's scaling law predicted that ~35-40% would be optimal at that scale — and Chameleon's empirical results confirmed it.

Optimal Mixing Ratio Finder

Select a model size, then drag the image fraction slider. The total loss curve shows the optimal point. Notice how the optimal fraction shifts right for larger models.

Model size 1.3B
Why does the optimal image fraction increase with model size?

Chapter 6: Cross-Modal Transfer

Perhaps the paper's most fascinating finding is cross-modal transfer: training on images makes the model better at text, and training on text makes the model better at images. This isn't just a neutral sharing of capacity — it's a genuine synergy.

Evidence of positive transfer

If cross-modal training were purely competitive (each modality taking capacity from the other), then the optimal image fraction for text loss would be 0%. But empirically, text loss is minimized at ρ ≈ 15-30%, not ρ = 0. Something about image data genuinely helps text modeling.

Hypotheses for why transfer occurs

HypothesisMechanismEvidence
GroundingImages provide concrete referents for words, helping the model understand meaningBiggest gains on concrete nouns, descriptions
World modelVisual data helps build a richer internal model of the physical worldBetter on commonsense reasoning after mixed training
RegularizationMulti-task training prevents overfitting to text-specific patternsSmaller gap between train and val loss
Data diversityImage captions expose the model to different text distributionsBetter on diverse text benchmarks
Transfer is asymmetric. Images help text more than text helps images (at small ρ). This makes sense: text is the "harder" modality to model well, so it benefits more from auxiliary signal. Images are already well-served by their own data. The practical implication: even if you only care about text quality, training on some images is worth it.

When transfer breaks down

Beyond the sweet spot, adding more images hurts text performance. This happens when the model runs out of capacity — it literally doesn't have enough parameters to maintain good text features while also learning image features. The capacity bottleneck is in the FFN layers, which is why MoT (separate FFNs per modality) largely eliminates this problem.

Cross-Modal Transfer Visualizer

See how adding image data affects text quality and vice versa. The green zone shows positive transfer; the red zone shows negative transfer (capacity competition).

Image % 20%
Why does a small amount of image data improve text performance?

Chapter 7: Connections

This paper laid the theoretical foundation for every major mixed-modal model that followed. Its scaling laws directly influenced training recipes at Meta and beyond.

ModelYearImage Fraction UsedPredicted OptimalMatch?
CM3Leon2023~30%~28%Close
Chameleon 7B2024~35%~30%Close
Chameleon 34B2024~40%~35-40%Match
Transfusion2024~40%~35%Close
Lesson 1: Measure before you scale. The cheapest way to find the right data mix is to run small experiments (125M-760M) and fit scaling laws. Extrapolating to 7B+ saves millions of dollars in wasted compute from suboptimal ratios.
Lesson 2: Cross-modal transfer is real and valuable. Even text-only practitioners should consider adding a small fraction of multimodal data to their training mix. The performance gains are free.
Lesson 3: Scaling exponents differ by modality. Images benefit more from scale than text. This has profound implications for resource allocation as models grow to trillions of parameters.
Scaling Laws Impact Timeline

See how this paper's scaling law predictions influenced subsequent mixed-modal models.

Year 2023
What is this paper's most practically impactful finding?