Scaling Laws for Generative Mixed-Modal Language Models — power laws govern mixed text+image models, and the optimal modality mix ratio changes with scale.
You're training a mixed-modal model — one that handles both text and images. You have a fixed compute budget. You can spend it training on text, images, or a mix. How much of each should you use?
This isn't a trivial question. Training on 100% text gives you the best text model, but it can't handle images. Training on 100% images gives you... not much useful. The interesting question is: what's the optimal ratio? And does that ratio change as your model gets bigger?
| Mix Ratio (text:image) | Text Quality | Image Quality | Total |
|---|---|---|---|
| 100:0 | Best possible | None | ? |
| 70:30 | Good | Decent | ? |
| 50:50 | Moderate | Good | ? |
| 0:100 | None | Best possible | ? |
Before this paper, people tuned this ratio by trial and error. Train a model, evaluate, adjust the ratio, train again. Expensive and slow. Aghajanyan et al. discovered that scaling laws — the clean mathematical relationships between model size, data size, and loss — extend to mixed-modal models. This means you can predict the optimal ratio from small experiments.
Drag the slider to adjust the text/image ratio. Watch how text and image quality (and total quality) change. Can you find the optimal point?
Before we can understand mixed-modal scaling, we need to understand what scaling laws are and why they're so useful.
In 2022, Hoffmann et al. (DeepMind) showed that language model loss follows a remarkably clean power law:
Where N is model parameters, D is training tokens, A, B, E are constants, and α, β are scaling exponents. This says: loss decreases as a power of both model size and data size, with an irreducible entropy floor E.
On a log-log plot, power laws appear as straight lines. This is what makes them so useful — you can fit a line through small-scale experiments and extrapolate to predict large-scale performance.
Do the same power laws hold when you mix text and images? Specifically:
python # Power law: L = A * N^(-alpha) + E # On a log-log plot, this is a straight line (slope = -alpha) import numpy as np # Example: text-only scaling N_values = [125e6, 350e6, 760e6, 1.3e9, 6.7e9] # model sizes losses = [3.2, 2.9, 2.7, 2.5, 2.2] # validation losses # Fit: log(L - E) = log(A) - alpha * log(N) # Slope of this line gives alpha ≈ 0.076 for text
See how loss decreases as model size increases. On a log-log plot, the relationship is a straight line. Drag the exponent slider to see how different scaling exponents affect the curve.
The paper studies scaling laws on a specific model family called CM3 (Causally Masked Multimodal Model). Understanding CM3 is necessary because the scaling laws are measured on this architecture.
CM3 is a decoder-only transformer that processes interleaved text and images. Images are tokenized using a VQ-VAE (like Chameleon). The key architectural choice: CM3 uses causal masking with an important twist — it can also do infilling (masked span prediction).
| Model | Parameters | Layers | Hidden Dim | Training Tokens |
|---|---|---|---|---|
| CM3-125M | 125 million | 12 | 768 | 200B |
| CM3-350M | 350 million | 24 | 1024 | 200B |
| CM3-760M | 760 million | 24 | 1536 | 200B |
| CM3-1.3B | 1.3 billion | 24 | 2048 | 200B |
| CM3-2.7B | 2.7 billion | 32 | 2560 | 200B |
| CM3-6.7B | 6.7 billion | 32 | 4096 | 200B |
For each model size, the paper trains multiple variants with different text/image ratios (from 0% to 100% image data). This gives them enough data points to fit scaling laws across both axes: model size AND data mix.
Explore the different CM3 model sizes. Each dot represents a model. Click to see its configuration.
The first major finding: text loss in mixed-modal models still follows a power law in model size, but the exponent and constant depend on the image data proportion.
Where ρ is the fraction of image data (0 = text-only, 1 = image-only). Notice that A, α, and E all depend on ρ. This means the scaling behavior changes as you add more images.
Adding a small amount of image data (up to ~20-30%) actually improves text loss compared to training on text alone! This is surprising. You'd expect that replacing some text data with images would always hurt text quality (less text to learn from). But the model extracts useful information from images that transfers to text understanding.
However, beyond ~30-40% images, text quality degrades. The model spends too many of its parameters on image processing, leaving insufficient capacity for text.
python # Text loss as a function of image fraction ρ # Measured at different model sizes # At 125M params: # ρ=0.0: L_text = 3.21 (text-only baseline) # ρ=0.1: L_text = 3.18 (BETTER than text-only!) # ρ=0.2: L_text = 3.15 (optimal for 125M) # ρ=0.3: L_text = 3.19 (slightly worse) # ρ=0.5: L_text = 3.35 (significantly worse) # At 6.7B params: # ρ=0.0: L_text = 2.21 # ρ=0.1: L_text = 2.18 # ρ=0.3: L_text = 2.14 (optimal for 6.7B — shifted right!) # ρ=0.5: L_text = 2.22
Drag the model size slider to see how the optimal image fraction (where text loss is minimized) shifts right with larger models. The dip is the cross-modal transfer sweet spot.
Image loss also follows power laws, but with different exponents than text. Images scale faster with model size — meaning that bigger models get disproportionately better at images relative to text.
The key difference from text: αimg > αtext. The image scaling exponent is larger, meaning image loss decreases faster with model size. Intuitively: images have more "headroom" for improvement. Text models at 6.7B are already quite good; image models at 6.7B are still far from ceiling.
| Modality | Scaling Exponent α | Interpretation |
|---|---|---|
| Text | ~0.076 | Loss decreases slowly with N. Text models are already efficient. |
| Image | ~0.12 | Loss decreases faster with N. More room for improvement. |
| Mixed (optimal ρ) | Depends on ρ | Weighted combination of both exponents. |
Just as images help text, text helps images. Models trained with some text data generate better images than pure image models. The text provides semantic scaffolding that helps the model understand what the image should contain.
Compare how text and image loss decrease with model size. Image loss (orange) decreases faster (steeper slope on log-log plot). At large scales, images benefit MORE from additional parameters.
Now we get to the paper's most practical contribution: given a fixed compute budget C, what fraction ρ of image data minimizes the total loss?
The total loss is a weighted combination of text and image losses:
Where D · (1-ρ) is the number of text tokens and D · ρ is the number of image tokens. The total compute C ≈ 6ND (standard approximation), so fixing C means larger N requires smaller D.
The paper finds the optimal ρ by minimizing Ltotal over ρ at each model size. The results:
| Model Size N | Optimal Image Fraction ρ* | Text:Image Ratio |
|---|---|---|
| 125M | ~15% | 85:15 |
| 350M | ~18% | 82:18 |
| 760M | ~22% | 78:22 |
| 1.3B | ~25% | 75:25 |
| 6.7B | ~30% | 70:30 |
| Extrapolated 30B+ | ~35-40% | 60-65:35-40 |
This is exactly what Chameleon did: at 34B parameters, they used ~40% image data. This paper's scaling law predicted that ~35-40% would be optimal at that scale — and Chameleon's empirical results confirmed it.
Select a model size, then drag the image fraction slider. The total loss curve shows the optimal point. Notice how the optimal fraction shifts right for larger models.
Perhaps the paper's most fascinating finding is cross-modal transfer: training on images makes the model better at text, and training on text makes the model better at images. This isn't just a neutral sharing of capacity — it's a genuine synergy.
If cross-modal training were purely competitive (each modality taking capacity from the other), then the optimal image fraction for text loss would be 0%. But empirically, text loss is minimized at ρ ≈ 15-30%, not ρ = 0. Something about image data genuinely helps text modeling.
| Hypothesis | Mechanism | Evidence |
|---|---|---|
| Grounding | Images provide concrete referents for words, helping the model understand meaning | Biggest gains on concrete nouns, descriptions |
| World model | Visual data helps build a richer internal model of the physical world | Better on commonsense reasoning after mixed training |
| Regularization | Multi-task training prevents overfitting to text-specific patterns | Smaller gap between train and val loss |
| Data diversity | Image captions expose the model to different text distributions | Better on diverse text benchmarks |
Beyond the sweet spot, adding more images hurts text performance. This happens when the model runs out of capacity — it literally doesn't have enough parameters to maintain good text features while also learning image features. The capacity bottleneck is in the FFN layers, which is why MoT (separate FFNs per modality) largely eliminates this problem.
See how adding image data affects text quality and vice versa. The green zone shows positive transfer; the red zone shows negative transfer (capacity competition).
This paper laid the theoretical foundation for every major mixed-modal model that followed. Its scaling laws directly influenced training recipes at Meta and beyond.
| Model | Year | Image Fraction Used | Predicted Optimal | Match? |
|---|---|---|---|---|
| CM3Leon | 2023 | ~30% | ~28% | Close |
| Chameleon 7B | 2024 | ~35% | ~30% | Close |
| Chameleon 34B | 2024 | ~40% | ~35-40% | Match |
| Transfusion | 2024 | ~40% | ~35% | Close |
See how this paper's scaling law predictions influenced subsequent mixed-modal models.