Scaling Laws for Mixed-Modal Models (2023)

Chapter 0: The Mixing Question

You're training a mixed-modal model — one that handles both text and images. You have a fixed compute budget. You can spend it training on text, images, or a mix. How much of each should you use?

This isn't a trivial question. Training on 100% text gives you the best text model, but it can't handle images. Training on 100% images gives you... not much useful. The interesting question is: what's the optimal ratio? And does that ratio change as your model gets bigger?

Mix Ratio (text:image)	Text Quality	Image Quality	Total
100:0	Best possible	None	?
70:30	Good	Decent	?
50:50	Moderate	Good	?
0:100	None	Best possible	?

Before this paper, people tuned this ratio by trial and error. Train a model, evaluate, adjust the ratio, train again. Expensive and slow. Aghajanyan et al. discovered that scaling laws — the clean mathematical relationships between model size, data size, and loss — extend to mixed-modal models. This means you can predict the optimal ratio from small experiments.

The paper's core contribution: Mixed-modal loss follows power laws just like text-only models. From these power laws, you can derive the optimal text-to-image ratio for any given compute budget. The key finding: the optimal ratio shifts toward MORE image data as models get larger. Bigger models are better at absorbing visual information.

The Mixing Tradeoff

Drag the slider to adjust the text/image ratio. Watch how text and image quality (and total quality) change. Can you find the optimal point?

Image % 30%

Why can't you simply split data 50/50 between text and images for every model?

Because the optimal ratio depends on model scale — larger models can absorb more image data without degrading text performance, so the ideal ratio shifts with compute budget. Scaling laws let you predict this mathematically instead of by trial and error. Because 50/50 is always the best ratio Because images are more expensive to store

Chapter 1: Scaling Laws Primer

Before we can understand mixed-modal scaling, we need to understand what scaling laws are and why they're so useful.

The Chinchilla revelation

In 2022, Hoffmann et al. (DeepMind) showed that language model loss follows a remarkably clean power law:

L(N, D) = A/N^α + B/D^β + E

Where N is model parameters, D is training tokens, A, B, E are constants, and α, β are scaling exponents. This says: loss decreases as a power of both model size and data size, with an irreducible entropy floor E.

On a log-log plot, power laws appear as straight lines. This is what makes them so useful — you can fit a line through small-scale experiments and extrapolate to predict large-scale performance.

What this paper asks

Do the same power laws hold when you mix text and images? Specifically:

Three questions this paper answers:
1. Does mixed-modal loss follow power laws in N and D? (Yes, with modality-specific exponents.)
2. Does adding image data help or hurt text performance? (It can help via cross-modal transfer, up to a point.)
3. What is the compute-optimal ratio of text to images? (It depends on scale — bigger models tolerate more images.)

python
# Power law: L = A * N^(-alpha) + E
# On a log-log plot, this is a straight line (slope = -alpha)

import numpy as np

# Example: text-only scaling
N_values = [125e6, 350e6, 760e6, 1.3e9, 6.7e9]  # model sizes
losses =   [3.2,   2.9,   2.7,   2.5,   2.2]   # validation losses

# Fit: log(L - E) = log(A) - alpha * log(N)
# Slope of this line gives alpha ≈ 0.076 for text

Scaling Law Visualizer

See how loss decreases as model size increases. On a log-log plot, the relationship is a straight line. Drag the exponent slider to see how different scaling exponents affect the curve.

α 0.076

Why are scaling laws practical for mixed-modal training decisions?

Because power laws appear as straight lines on log-log plots, you can fit them from small-scale experiments and extrapolate to predict the loss at much larger scales — letting you choose the optimal data mix without actually training the expensive large model Because they eliminate the need for training data Because they guarantee optimal performance

Chapter 2: The CM3 Architecture

The paper studies scaling laws on a specific model family called CM3 (Causally Masked Multimodal Model). Understanding CM3 is necessary because the scaling laws are measured on this architecture.

What is CM3?

CM3 is a decoder-only transformer that processes interleaved text and images. Images are tokenized using a VQ-VAE (like Chameleon). The key architectural choice: CM3 uses causal masking with an important twist — it can also do infilling (masked span prediction).

Input

Interleaved text + image tokens from web documents. Images tokenized with VQ-VAE into 256 tokens (codebook of 8192).

↓

Causal Transformer

Standard next-token prediction. Both text and image tokens predicted autoregressively. Unified vocabulary.

↓

CM3 Objective

Next-token prediction + infilling. Can generate text or images by completing masked spans.

Model sizes studied

Model	Parameters	Layers	Hidden Dim	Training Tokens
CM3-125M	125 million	12	768	200B
CM3-350M	350 million	24	1024	200B
CM3-760M	760 million	24	1536	200B
CM3-1.3B	1.3 billion	24	2048	200B
CM3-2.7B	2.7 billion	32	2560	200B
CM3-6.7B	6.7 billion	32	4096	200B

For each model size, the paper trains multiple variants with different text/image ratios (from 0% to 100% image data). This gives them enough data points to fit scaling laws across both axes: model size AND data mix.

Why CM3 and not Chameleon? This paper predates Chameleon by a year. CM3 is the predecessor that established the mixed-modal paradigm. The scaling laws discovered here directly informed Chameleon's training recipe — including its choice of ~40% image data.

CM3 Model Family

Explore the different CM3 model sizes. Each dot represents a model. Click to see its configuration.

Model 1.3B

Why does the paper train multiple variants at each model size?

To gather enough data points to fit scaling laws across both model size AND data mix ratio — each variant uses a different text/image proportion, enabling the paper to discover how the optimal ratio changes with scale To find the best random seed To compare different optimizers

Chapter 3: Text Scaling Laws

The first major finding: text loss in mixed-modal models still follows a power law in model size, but the exponent and constant depend on the image data proportion.

The text scaling equation

L_text(N, ρ) = A(ρ) / N^α(ρ) + E_text(ρ)

Where ρ is the fraction of image data (0 = text-only, 1 = image-only). Notice that A, α, and E all depend on ρ. This means the scaling behavior changes as you add more images.

Key finding: text loss is U-shaped in ρ

Adding a small amount of image data (up to ~20-30%) actually improves text loss compared to training on text alone! This is surprising. You'd expect that replacing some text data with images would always hurt text quality (less text to learn from). But the model extracts useful information from images that transfers to text understanding.

However, beyond ~30-40% images, text quality degrades. The model spends too many of its parameters on image processing, leaving insufficient capacity for text.

The cross-modal transfer sweet spot: At 125M parameters, text loss is minimized at ~15-20% images. At 6.7B parameters, the sweet spot shifts to ~25-30% images. Bigger models can absorb more visual information before text quality suffers. This is one of the paper's most practically important findings.

python
# Text loss as a function of image fraction ρ
# Measured at different model sizes

# At 125M params:
#   ρ=0.0: L_text = 3.21 (text-only baseline)
#   ρ=0.1: L_text = 3.18 (BETTER than text-only!)
#   ρ=0.2: L_text = 3.15 (optimal for 125M)
#   ρ=0.3: L_text = 3.19 (slightly worse)
#   ρ=0.5: L_text = 3.35 (significantly worse)

# At 6.7B params:
#   ρ=0.0: L_text = 2.21
#   ρ=0.1: L_text = 2.18
#   ρ=0.3: L_text = 2.14 (optimal for 6.7B — shifted right!)
#   ρ=0.5: L_text = 2.22

Text Loss vs Image Fraction

Drag the model size slider to see how the optimal image fraction (where text loss is minimized) shifts right with larger models. The dip is the cross-modal transfer sweet spot.

Model size 1.3B

What surprising finding does the paper make about text performance in mixed-modal models?

A small amount of image data (15-30%) actually IMPROVES text loss compared to text-only training — images provide cross-modal transfer that helps text understanding. The optimal image fraction increases with model size because larger models can absorb more visual information. Text performance always gets worse when images are added Text performance doesn't change at all

Chapter 4: Image Scaling Laws

Image loss also follows power laws, but with different exponents than text. Images scale faster with model size — meaning that bigger models get disproportionately better at images relative to text.

The image scaling equation

L_image(N, ρ) = A_img(ρ) / N^α_img(ρ) + E_img(ρ)

The key difference from text: α_img > α_text. The image scaling exponent is larger, meaning image loss decreases faster with model size. Intuitively: images have more "headroom" for improvement. Text models at 6.7B are already quite good; image models at 6.7B are still far from ceiling.

Image scaling exponents

Modality	Scaling Exponent α	Interpretation
Text	~0.076	Loss decreases slowly with N. Text models are already efficient.
Image	~0.12	Loss decreases faster with N. More room for improvement.
Mixed (optimal ρ)	Depends on ρ	Weighted combination of both exponents.

Why images scale faster: Text prediction benefits from strong statistical regularities (grammar, common phrases) that even small models capture. Image prediction requires understanding spatial structure, object relationships, and visual semantics — features that emerge primarily in larger models. This means the "return on investment" from scaling is higher for images than for text.

Image quality also depends on text data

Just as images help text, text helps images. Models trained with some text data generate better images than pure image models. The text provides semantic scaffolding that helps the model understand what the image should contain.

Text vs Image Scaling Exponents

Compare how text and image loss decrease with model size. Image loss (orange) decreases faster (steeper slope on log-log plot). At large scales, images benefit MORE from additional parameters.

Max N 10⁹

Why does image loss decrease faster with model size than text loss?

Because text prediction benefits from strong statistical regularities that small models already capture, while image understanding requires learning spatial structure and visual semantics that emerge mainly in larger models — giving images a higher "return on investment" from scaling Because image tokens are simpler than text tokens Because there is more image training data

Chapter 5: Optimal Mixing

Now we get to the paper's most practical contribution: given a fixed compute budget C, what fraction ρ of image data minimizes the total loss?

The compute-optimal mixing formula

The total loss is a weighted combination of text and image losses:

L_total(N, D, ρ) = (1 − ρ) · L_text(N, D · (1-ρ)) + ρ · L_image(N, D · ρ)

Where D · (1-ρ) is the number of text tokens and D · ρ is the number of image tokens. The total compute C ≈ 6ND (standard approximation), so fixing C means larger N requires smaller D.

Optimal ρ as a function of N

The paper finds the optimal ρ by minimizing L_total over ρ at each model size. The results:

Model Size N	Optimal Image Fraction ρ*	Text:Image Ratio
125M	~15%	85:15
350M	~18%	82:18
760M	~22%	78:22
1.3B	~25%	75:25
6.7B	~30%	70:30
Extrapolated 30B+	~35-40%	60-65:35-40

The scaling trend: As models get bigger, the optimal image fraction increases. This makes sense given that image scaling exponent (α_img ≈ 0.12) is larger than text scaling exponent (α_text ≈ 0.076). Bigger models get more "bang for the buck" from image data, so it's worth allocating more of the data budget to images.

This is exactly what Chameleon did: at 34B parameters, they used ~40% image data. This paper's scaling law predicted that ~35-40% would be optimal at that scale — and Chameleon's empirical results confirmed it.

Optimal Mixing Ratio Finder

Select a model size, then drag the image fraction slider. The total loss curve shows the optimal point. Notice how the optimal fraction shifts right for larger models.

Model size 1.3B

Why does the optimal image fraction increase with model size?

Because image loss has a larger scaling exponent than text (images benefit MORE from additional parameters), so at larger scales it's more efficient to allocate a larger share of the data budget to images — each additional image token contributes more to total quality improvement Because bigger models have more memory Because image datasets are larger

Chapter 6: Cross-Modal Transfer

Perhaps the paper's most fascinating finding is cross-modal transfer: training on images makes the model better at text, and training on text makes the model better at images. This isn't just a neutral sharing of capacity — it's a genuine synergy.

Evidence of positive transfer

If cross-modal training were purely competitive (each modality taking capacity from the other), then the optimal image fraction for text loss would be 0%. But empirically, text loss is minimized at ρ ≈ 15-30%, not ρ = 0. Something about image data genuinely helps text modeling.

Hypotheses for why transfer occurs

Hypothesis	Mechanism	Evidence
Grounding	Images provide concrete referents for words, helping the model understand meaning	Biggest gains on concrete nouns, descriptions
World model	Visual data helps build a richer internal model of the physical world	Better on commonsense reasoning after mixed training
Regularization	Multi-task training prevents overfitting to text-specific patterns	Smaller gap between train and val loss
Data diversity	Image captions expose the model to different text distributions	Better on diverse text benchmarks

Transfer is asymmetric. Images help text more than text helps images (at small ρ). This makes sense: text is the "harder" modality to model well, so it benefits more from auxiliary signal. Images are already well-served by their own data. The practical implication: even if you only care about text quality, training on some images is worth it.

When transfer breaks down

Beyond the sweet spot, adding more images hurts text performance. This happens when the model runs out of capacity — it literally doesn't have enough parameters to maintain good text features while also learning image features. The capacity bottleneck is in the FFN layers, which is why MoT (separate FFNs per modality) largely eliminates this problem.

Cross-Modal Transfer Visualizer

See how adding image data affects text quality and vice versa. The green zone shows positive transfer; the red zone shows negative transfer (capacity competition).

Image % 20%

Why does a small amount of image data improve text performance?

Through cross-modal transfer: images provide grounding for words, build a richer world model, act as regularization, and expose the model to diverse caption text — but this benefit reverses beyond ~30% images when the model runs out of capacity to serve both modalities Because image data contains text captions Because images are easier to model

Chapter 7: Connections

This paper laid the theoretical foundation for every major mixed-modal model that followed. Its scaling laws directly influenced training recipes at Meta and beyond.

Model	Year	Image Fraction Used	Predicted Optimal	Match?
CM3Leon	2023	~30%	~28%	Close
Chameleon 7B	2024	~35%	~30%	Close
Chameleon 34B	2024	~40%	~35-40%	Match
Transfusion	2024	~40%	~35%	Close

Lesson 1: Measure before you scale. The cheapest way to find the right data mix is to run small experiments (125M-760M) and fit scaling laws. Extrapolating to 7B+ saves millions of dollars in wasted compute from suboptimal ratios.

Lesson 2: Cross-modal transfer is real and valuable. Even text-only practitioners should consider adding a small fraction of multimodal data to their training mix. The performance gains are free.

Lesson 3: Scaling exponents differ by modality. Images benefit more from scale than text. This has profound implications for resource allocation as models grow to trillions of parameters.

Scaling Laws Impact Timeline

See how this paper's scaling law predictions influenced subsequent mixed-modal models.

Year 2023

What is this paper's most practically impactful finding?

That the compute-optimal text/image ratio for mixed-modal training can be predicted from small-scale experiments via scaling laws, the optimal ratio shifts toward more image data at larger scales, and cross-modal transfer means even text-only performance benefits from some image data That images should never be included in LLM training That all models should use the same data ratio