Distillation Scaling Laws (2025)

Chapter 0: The Inference Problem

You have trained a 7.75B-parameter language model on 512 billion tokens. It is good. But every user query costs money — the model runs on expensive GPUs, 24/7, burning electricity. Inference cost at scale is the dominant expense in the lifecycle of a language model. It dwarfs pretraining cost when measured over the model's entire deployment.

There is a simple solution: use a smaller model. A 143M-parameter model is ~50x cheaper to serve. But if you train that small model from scratch on the same data, it will be much worse — the supervised scaling law (Chinchilla) tells us the loss is limited by model capacity.

What if the big model could teach the small model? That is knowledge distillation — the big model (the teacher) produces soft probability distributions over the vocabulary for every token, and the small model (the student) learns to match those distributions instead of just matching the hard one-hot labels from the training data.

The central question of this paper: Can we predict how good the student will be given (1) the student size N_S, (2) the teacher quality L_T, and (3) the distillation data budget D_S? And can we use that prediction to decide whether distillation is worth the compute compared to just training a bigger model from scratch?

This is not an academic exercise. Companies like Apple (the authors' affiliation) deploy models on billions of devices. The difference between a 300M model and a 1B model determines whether your model runs on-device or requires a server roundtrip. Distillation is the standard tool for producing these small models — but until this paper, nobody could predict whether it would actually help.

Consider the numbers: OpenAI and Pilipiszyn (2021) estimated billions of tokens per day in inference. The inference cost of an LM is typically significantly larger than its pretraining cost over its lifetime. And with test-time compute scaling (chain-of-thought, tree search, repeated sampling), inference cost is growing even faster. Every parameter you can shave off the deployment model saves real money.

The overtraining paradigm

Modern deployment has shifted away from compute-optimal training (Chinchilla). Instead, practitioners overtrain — they train a small model on far more tokens than Chinchilla prescribes. A Chinchilla-optimal 300M model trains on ~6B tokens (20x parameters). An overtrained 300M model trains on 100B+ tokens. The model is worse than a compute-optimal 1B model, but it is much cheaper to serve.

Distillation offers a potentially better path: instead of overtraining a small model from scratch, distill from a large teacher. But this introduces new costs — you have to train the teacher first, and possibly run teacher inference to generate logits. When is this extra cost justified?

Training Paradigms Comparison

Compare three strategies for producing a deployment model: compute-optimal training, overtraining, and distillation. Click each to see the compute allocation.

Why can't you just use the Chinchilla scaling law to decide whether distillation is worthwhile?

Chinchilla only covers supervised pretraining — it predicts loss as a function of model size and data, but has no term for teacher quality. Distillation introduces a new variable (teacher cross-entropy L_T) that changes the student's achievable loss. Chinchilla only works for models above 1B parameters Chinchilla was fitted on different data

Chapter 1: Supervised Scaling Laws

Before we can understand distillation scaling, we need to understand the baseline: how does a model trained without a teacher scale? This is the supervised scaling law, established by Kaplan et al. (2020) and refined by Hoffmann et al. (2022, "Chinchilla").

The power-law form

When you train a transformer language model of size N parameters on D tokens, the resulting validation cross-entropy loss follows a remarkably clean power law:

L(N, D) = E + A / N^α + B / D^β

Where:

Symbol	Meaning	Typical Value
E	Irreducible entropy — the best any model could do on this data (natural language has inherent randomness)	~1.7 nats
A / N^α	Model capacity term — bigger model, lower loss. Diminishing returns.	α ≈ 0.34
B / D^β	Data term — more tokens, lower loss. Also diminishing returns.	β ≈ 0.28

This equation captures a deep truth: loss is jointly limited by model capacity and data quantity. Doubling the model size gives diminishing improvements (power law, not linear). Doubling the data gives diminishing improvements too. The irreducible entropy E is the floor — you can't beat it no matter how large your model or dataset.

The additive form (separate terms for N and D) is an approximation. In reality, model size and data interact — a larger model extracts more from the same data. But the additive form fits empirical data remarkably well and is analytically tractable, which is why it has become the standard.

The paper re-fits this supervised baseline on their own experimental data (C4 English, transformer with RoPE and μP) and finds coefficients consistent with the literature. This is important: the distillation scaling law is built on top of the supervised one, so a bad supervised fit would corrupt everything downstream.

Why power laws? Nobody fully understands why neural scaling follows power laws so cleanly. One intuition: language has structure at many scales (character patterns, word patterns, syntactic patterns, semantic patterns, discourse patterns). Each scale contributes a roughly equal fraction of the learnable information, and models learn the easiest scales first. This produces the smooth, log-linear curves we observe.

Compute-optimal training (Chinchilla)

Training costs FLOPs, approximately 6ND for a standard transformer. Given a fixed compute budget C = 6ND, what is the best split between model size N and tokens D?

N*, D* = argmin_N,D L(N, D) s.t. FLOPs(N, D) = C

Hoffmann et al. found that compute-optimal models have a constant token-to-parameter ratio: M = D/N ≈ 20. This is the "Chinchilla rule" — a 1B model should train on ~20B tokens.

The problem with compute-optimal at inference time

A compute-optimal 10B model trained on 200B tokens has great loss. But at inference time, every token generated costs proportional to N. If you could achieve the same loss with a 1B model trained on 2T tokens, the inference cost drops by 10x — even though the total training compute is the same.

This is why the field moved to overtraining: train small models on vastly more tokens than Chinchilla prescribes. Distillation potentially offers even better small models than overtraining alone.

Supervised Scaling Law

Drag the sliders to change model size and data budget. Watch how loss changes following the power law L = E + A/N^α + B/D^β.

log₁₀(N) 1B

log₁₀(D) 20B

A compute-optimal 3B model (Chinchilla M=20) trains on 60B tokens. An overtrained 300M model trains on 600B tokens — using roughly the same total FLOPs. Which has lower loss?

The 3B model has lower loss — the power law means the capacity term A/N^α dominates. Overtraining hits diminishing returns on data much faster than scaling up the model. That's exactly why people want distillation — to get the small model closer to the big model's loss. The 300M model has lower loss since it trains on 10x the data They have identical loss since they used the same compute

Chapter 2: Distillation Mechanics

How does a teacher actually teach a student? The mechanics are simple but the implications are deep.

Next-token prediction (supervised)

In standard pretraining, the model sees a context x^(<i) and predicts the next token x⁽ⁱ⁾. The target is a one-hot vector: all probability mass on the correct token, zero everywhere else. The loss is the negative log probability of the correct token:

L_NTP(x, z) = − ∑_a=1^V e(x⁽ⁱ⁾)_a log σ_a(z)

Where z is the logit vector (raw model output before softmax), σ is softmax, and e(x⁽ⁱ⁾) is the one-hot basis vector for the true token. This is equivalent to −log p(x⁽ⁱ⁾ | x^(<i); θ).

Knowledge distillation loss

In distillation, the teacher produces a full probability distribution over all V vocabulary tokens. This is much richer than a one-hot label — it tells the student not just which token is correct, but the relative probabilities of all incorrect tokens.

For example, given the context "The cat sat on the ___", the one-hot target just says "mat". But the teacher's distribution says: "mat" 40%, "floor" 25%, "rug" 15%, "couch" 8%, "bed" 5%, ... This teaches the student that "floor" and "rug" are reasonable alternatives, while "quantum" is not.

This is sometimes called dark knowledge (Hinton et al., 2015) — the information contained in the probabilities of incorrect classes. A teacher that assigns 25% to "floor" and 0.001% to "quantum" encodes real knowledge about semantic similarity. The student learns not just the answer, but the structure of the answer space.

The distillation loss uses KL divergence between teacher and student distributions, implemented via cross-entropy of the student against the teacher's soft targets:

L_KD(z_T, z_S) = −τ² ∑_a=1^V σ_a(z_T/τ) log σ_a(z_S/τ)

Where τ is the distillation temperature. When τ = 1, the student matches the teacher's exact distribution. When τ > 1, the distributions become softer (more uniform), revealing more information about the teacher's ranking of unlikely tokens. The τ² factor is a gradient normalization term that ensures the gradient magnitude is independent of temperature.

Why KL divergence, not MSE?

You might wonder: why not just minimize the mean squared error between teacher and student logits? The answer is that KL divergence respects the geometry of probability distributions. Two distributions that are "close" in KL are close in a statistically meaningful sense — they make similar predictions. MSE on logits can be small even when the resulting probability distributions are very different (if the logits differ by a constant, the probabilities change dramatically but MSE is zero).

Why temperature matters: Consider a teacher that assigns 99.9% to one token. The softmax squashes everything else to near-zero, so the student learns almost nothing beyond "this token is correct" — no better than one-hot labels. Temperature τ = 2 or 3 spreads the distribution out, letting the student learn from the teacher's relative rankings of all tokens. The paper uses τ = 1 throughout, finding it works best for their setup.

Combined loss

The total training loss for the student combines the standard NTP loss with the distillation loss:

L_S = (1 − λ) L_NTP(x, z_S) + λ L_KD(z_T, z_S) + λ_Z L_Z(z_S)

Where λ controls the balance between imitating the data and imitating the teacher, and L_Z is a token-level Z-loss for training stability. The paper uses λ = 1 (pure distillation) — the student only sees the teacher's soft targets, never the one-hot labels from the data. This is the cleanest setup for studying distillation scaling.

python
# Distillation forward pass (simplified)
def distillation_step(student, teacher, tokens, tau=1.0):
    # Teacher generates soft targets (no gradient)
    with torch.no_grad():
        teacher_logits = teacher(tokens)           # [B, T, V]
        teacher_probs = softmax(teacher_logits / tau)  # [B, T, V]

    # Student forward pass
    student_logits = student(tokens)              # [B, T, V]
    student_log_probs = log_softmax(student_logits / tau)

    # KL divergence = cross-entropy(teacher || student) - H(teacher)
    # We minimize cross-entropy, which is equivalent
    loss = -tau**2 * (teacher_probs * student_log_probs).sum(-1).mean()

    return loss  # Backprop through student only

Why does the paper use pure distillation (λ = 1, no NTP loss) instead of combining distillation with the standard next-token prediction objective?

To isolate the effect of distillation and avoid confounding. With λ = 1, the student's performance is entirely determined by the teacher signal, making it possible to derive a clean scaling law. They verify λ = 1 produces results statistically similar to the optimal λ. Because NTP loss always hurts performance when a teacher is available Because the GPU memory cannot fit both losses at once

Chapter 3: The Capacity Gap

Here is the most surprising finding: making the teacher better can make the student worse. This is the capacity gap, and it is the key phenomenon that the distillation scaling law must capture.

The experiment

Take a fixed student (say 143M parameters, fixed distillation budget of 40B tokens). Now vary the teacher from small (198M) to huge (7.75B). Plot the student's final cross-entropy against teacher size. You might expect a monotonic curve: bigger teacher, better student.

Instead, you see something strange. As the teacher improves from 198M to ~1.82B, the student gets better. But as the teacher improves further from 1.82B to 7.75B, the student gets worse. There is an optimal teacher size, and going beyond it hurts.

The capacity gap, intuitively: A very strong teacher produces a very "sharp" distribution — high confidence on the correct token, very low probability on alternatives. This distribution is too complex for the small student to model. It is like asking a child to mimic a master calligrapher: the student cannot reproduce the subtle strokes, so it settles for a crude approximation that is worse than what it would learn from a merely good calligrapher whose strokes are within the student's ability to imitate.

The mathematical explanation

The capacity gap arises from a mismatch between the teacher's distribution complexity and the student's modeling capacity. The KLD between teacher and student can be decomposed:

KL(p_T || p_S) = H(p_T, p_S) − H(p_T)

As the teacher gets stronger, H(p_T) (the teacher's entropy) decreases — the teacher becomes more confident. The cross-entropy H(p_T, p_S) also changes, but the student has limited capacity to model the teacher's sharp distribution. When the teacher is too strong relative to the student, the KL divergence actually increases because the student cannot allocate its limited parameters to match the teacher's sharp peaks.

What the data shows

Student N_S	Optimal Teacher Size	Student Loss at Optimal	Student Loss at Largest Teacher (7.75B)
143M	~975M	2.56	2.59
546M	~1.82B	2.22	2.24
1.82B	~4.82B	2.07	2.08
4.82B	~7.75B	1.98	1.98

The effect is more pronounced for small students. A 143M student suffers a ~0.03 nat penalty from a too-strong teacher. For larger students (4.82B), the gap nearly vanishes because the student has enough capacity to model the teacher's distribution.

Why the capacity gap is not just about model size

The paper makes a subtle but crucial point: the capacity gap is about the ratio of algorithmic learning capacities, not just parameter counts. Two models of the same size trained on different amounts of data have different learning capacities. The ratio L_T/L̂_S captures this precisely — when L_T is much smaller than L̂_S (teacher far better than student's potential), the gap appears. When they are close, it does not.

The authors demonstrate this with controlled experiments on kernel regression and synthetic MLP tasks (Appendices C.1 and C.2), providing the first clean synthetic demonstrations of the capacity gap phenomenon.

Key insight: Teacher quality matters only through teacher cross-entropy L_T, not through teacher size N_T or tokens D_T independently. A 975M teacher trained on 512B tokens and a 7.75B teacher trained on 20B tokens that achieve the same L_T produce the same student. The teacher is a black box characterized entirely by the quality of its output distribution.

Capacity Gap Explorer

Drag the teacher quality slider to see how student loss changes. Watch for the U-shaped curve — there's an optimal teacher quality beyond which the student gets worse.

Student size 143M

A 300M student is being distilled. Teacher A (1B, L_T=2.3) and Teacher B (7B, L_T=1.9) are available. Which produces a better student?

It depends on the student's capacity relative to each teacher. Teacher A (L_T=2.3) may produce a better student because L_T=1.9 may be beyond the capacity gap threshold for a 300M student — a stronger teacher is not always better. Teacher B always, because it has lower cross-entropy Teacher A always, because smaller teachers are always better for small students

Chapter 4: The Distillation Scaling Law

Now we arrive at the paper's core contribution: a single equation that predicts student cross-entropy from three inputs — student size, data budget, and teacher quality.

Deriving the functional form

The authors reason about what properties the law must have, then find the simplest equation that satisfies all of them.

Property 1: Infinite student, infinite data. If the student has unlimited capacity and unlimited distillation data, it should perfectly mimic the teacher. So: lim_{N_S,D_S→∞} L_S = L_T.

Property 2: Random teacher. A random teacher (infinite L_T) provides no useful signal. The student should converge to some loss independent of its own capacity: lim_{L_T→∞} L_S = L_T.

Property 3: Capacity gap. There must be a regime where improving L_T (making the teacher better) increases L_S (makes the student worse). This is the U-shaped behavior from Chapter 3.

Property 4: Broken power law in L_T. The influence of teacher quality on student loss transitions between two regimes — when the student is stronger than the teacher vs. weaker than the teacher — connected by a broken power law.

The equation

L_S(N_S, D_S, L_T) = L_T + (1 / L_m^c₀) · (1 + (L_T / L_Sd₁)^1/f₁)^−c₁f₁ · (A / N_S^α' + B / D_S^γ')

Let's unpack each term:

Term	Role	Intuition
L_T	Teacher cross-entropy (baseline)	A perfect student can't beat the teacher. L_T is the floor.
1/L_m^c₀	Student ability to mimic teacher	Where L_m = L(N_S, D_S) is what the student would achieve via supervised learning. Lower L_m → better mimicry.
(1 + (L_T/...)^1/f)^−cf	Broken power law transition	Captures the capacity gap. When L_T < L̂_Sd₁ (teacher is stronger than student's ability), the capacity gap kicks in.
A/N_S^α' + B/D_S^γ'	Student resource limitations	Same form as supervised scaling — model capacity + data limitations.

The key insight: L_T is sufficient

Notice that the teacher enters the equation only through L_T. The teacher's size N_T and training tokens D_T do not appear — they only matter insofar as they determine L_T = L(N_T, D_T). This is a powerful simplification: to predict distillation performance, you only need to know how good the teacher is, not how it became that good.

The third recall, L̂_S: The paper defines L̂_S = L(N_S, D_S) — the loss the student would achieve if trained supervised (no teacher). This connects distillation scaling to supervised scaling. The ratio L_T/L̂_S captures the relative learning abilities of teacher and student, which determines whether the capacity gap appears.

python
# Distillation scaling law (Equation 8 from the paper)
import numpy as np

def supervised_loss(N, D, E=1.71, A=406.4, alpha=0.34, B=410.7, beta=0.28):
    """Chinchilla-style supervised scaling law."""
    return E + A / N**alpha + B / D**beta

def distillation_loss(N_s, D_s, L_T,
                      c0=0.45, c1=0.72, d1=1.05,
                      f1=0.18, alpha_p=0.31, gamma_p=0.25,
                      A_p=350., B_p=380.):
    """Predict student cross-entropy from distillation."""
    L_hat_S = supervised_loss(N_s, D_s)  # student supervised baseline
    mimic = 1.0 / L_hat_S**c0           # student's mimicry ability
    ratio = (L_T / (L_hat_S * d1))**(1/f1)
    broken_pl = (1 + ratio)**(-c1 * f1)  # capacity gap transition
    resource = A_p / N_s**alpha_p + B_p / D_s**gamma_p
    return L_T + mimic * broken_pl * resource

What single number about the teacher determines student performance in the distillation scaling law?

The teacher's cross-entropy L_T. The teacher's size N_T and training data D_T only matter through L_T = L(N_T, D_T). Two teachers of different sizes that achieve the same L_T produce identical students. The teacher's parameter count N_T The teacher's training token count D_T

Chapter 5: Fitting the Law

An equation is only useful if it fits real data. The authors conducted the largest controlled study of distillation to date: hundreds of runs spanning student sizes from 143M to 12.6B, teachers from 198M to 7.75B, and distillation budgets from a few billion to 512B tokens.

Experimental setup

Dimension	Range	Details
Student sizes	143M — 12.6B	All transformers with multi-headed attention, Pre-Norm, RMSNorm, RoPE, sequence length 4096
Teacher sizes	198M — 7.75B	Six Chinchilla-optimal teachers (M_T = D_T/N_T ≈ 20)
Distillation tokens	~4B — 512B	English-only subset of C4 dataset
Temperature	τ = 1	Found to work best across all experiments
Mixing	λ = 1	Pure distillation (no NTP loss)
Optimizer	μP (maximal update parameterization)	Enables hyperparameter transfer across model sizes

Three experimental protocols

Different protocols isolate different variables:

Fixed Teacher/Fixed M Students

Fix one teacher. Train students at IsoFLOP profiles (student size and tokens vary subject to a total compute constraint). Reveals how student size and data trade off for a given teacher.

↓

IsoFLOP Teacher/Fixed M Students

Fix student (N_S, D_S). Vary teacher's N_T and D_T subject to a compute constraint. Reveals that N_T and D_T matter only through L_T.

↓

Fixed M Teachers/Fixed M Students

Vary everything: 10 teachers × 5+ student sizes × 4+ token budgets. Spans the widest range of L_T and L_S for fitting.

Fit quality

The scaling law fits all observed data at ≤1% prediction error. Even when extrapolating from weaker teachers to stronger ones (predicting behavior of a 7.75B teacher from data on ≤4.82B teachers), the errors remain under 1%.

Why this matters practically: You can run a small grid of cheap experiments (small students, small teachers, few tokens), fit the scaling law, then predict the outcome of expensive experiments (large students, large teachers, many tokens) without running them. This saves millions of dollars in compute.

The fitting uses Sequential Least Squares Programming (SLSQP) with positivity constraints on all coefficients. The supervised scaling law is fitted first on the non-distilled baselines, and its coefficients are then held fixed when fitting the distillation law.

The fitting procedure in detail

Step 1: Fit the supervised scaling law (Equation 1) on all non-distilled runs. This gives you E, A, B, α, β. These are locked.

Step 2: For each distillation run, compute L̂_S = L(N_S, D_S) using the supervised law. This is a derived quantity, not a free parameter.

Step 3: Fit Equation 8 on all distillation runs, optimizing {c₀, c₁, d₁, f₁, α', β', A', B'} with positivity constraints.

python
# Fitting the distillation scaling law (sketch)
from scipy.optimize import minimize

def objective(params, data):
    """Sum of squared log-errors across all runs."""
    c0, c1, d1, f1, ap, gp, Ap, Bp = params
    total_err = 0
    for Ns, Ds, LT, LS_actual in data:
        LS_pred = distillation_loss(Ns, Ds, LT, c0, c1, d1, f1, ap, gp, Ap, Bp)
        total_err += (np.log(LS_pred) - np.log(LS_actual))**2
    return total_err

bounds = [(0.01, 5)] * 8  # All coefficients positive
result = minimize(objective, x0=initial_guess, args=(data,),
                  method='SLSQP', bounds=bounds)

The key validation: extrapolation

The most impressive test of any scaling law is extrapolation — predicting behavior you haven't seen. The authors fit the law on teachers up to 4.82B and students with L_S > 2.3, then predict the behavior of 7.75B teachers and students with L_S ≤ 2.3. The predictions are accurate to within 1% — the gray region in Figure 5b that was not used for fitting is predicted almost perfectly.

The paper uses μP (maximal update parameterization) for training. Why is this important for a scaling law study?

μP enables hyperparameter transfer across model sizes — the same learning rate works for all scales. Without it, each model size needs a separate hyperparameter sweep, introducing confounding variables that would contaminate the scaling law fit. μP makes models train faster so they could run more experiments μP is required for distillation to work

Chapter 6: When to Distill

This is the most practically useful chapter. Given a compute budget, should you distill from a teacher or just train the student from scratch? The answer depends on exactly three things.

Scenario 1: Best case (teacher already exists)

You already have a trained teacher. Distillation is "free" — you only pay for the student's training. In this case, distillation almost always wins, because the teacher provides a richer signal than one-hot labels at zero additional cost.

But even here, the advantage eventually vanishes. At sufficient compute, supervised learning catches up:

Core finding #1: Supervised learning always matches optimal distillation at sufficient compute budget, with the crossover point shifting to larger budgets as student size increases. Smaller models benefit more from distillation. Larger models can learn the same patterns from data alone — they have enough capacity to extract the information that the teacher would have provided.

Scenario 2: Teacher inference only

The teacher exists but you must pay for inference (running the teacher on each training example to get logits). This is the common deployment scenario — you have a big model serving users, and you want to distill it into a smaller model.

The FLOP cost becomes:

FLOPs ≈ 3F(N_S)D_S + F(N_T)(δ^Logit_S D_S + δ^Pre_T 3D_T)

Where the first term is student training, the second is teacher inference for generating logits, and the third (with δ indicators) accounts for whether teacher logits are stored or computed on-the-fly.

The key detail: F(N) is the FLOPs per token for a model with N parameters. For a standard transformer, F(N) ≈ 2N per forward pass (each parameter is used once in a multiply-add). Training requires 3x this (forward + backward), while inference requires 1x. So teacher inference costs F(N_T) per token, while student training costs 3F(N_S) per token.

For non-embedding parameters, the paper derives a more accurate expression: F(N) ≈ 2N(1 + c₁N^−1/3 + c₂N^−2/3) for fixed aspect-ratio models. They recommend the scaling community adopt this refined form.

Scenario 3: Teacher pretraining included

The worst case for distillation: you must train the teacher from scratch, then distill. Now the total compute includes teacher pretraining (expensive!) plus teacher inference plus student training.

The verdict

Scenario	δ^Pre	δ^Logit	When Distillation Wins
Best case (amortized teacher)	0	0	Almost always, unless student compute is very large
Teacher inference	0	1	When total budget exceeds a student-size-dependent threshold
Teacher pretraining	0 or 1	1	Only when making a family of models or using the teacher beyond distillation
Full cost (train + inference)	1	1	Distillation is more efficient only if total compute or tokens exceed a threshold. Otherwise, supervised learning wins.

The surprise: When total end-to-end compute is counted (teacher training + inference + student training), supervised learning always achieves lower cross-entropy than distillation at the same total compute. Distillation is only more efficient when the total compute exceeds a student-size-dependent threshold, or when the teacher has uses beyond a single distillation.

Practical implications

If you just want the best model at size N_S: train it supervised with all your compute. Do not distill.

If you want a family of models (300M, 1B, 3B, 10B): train one large teacher, then distill into all sizes. The teacher cost is amortized across many students.

If the teacher already exists (e.g., your production server model): distill — it is essentially free improvement.

The crossover point

Figure 6 in the paper shows contour plots of the cross-entropy difference between distillation and supervised learning. Blue regions indicate distillation wins. Red regions indicate supervised wins. The boundary between them is the crossover.

For a 546M teacher, the crossover happens at relatively small student compute budgets — even moderate distillation beats supervised. For a 7.75B teacher, the blue region is much larger, meaning distillation's advantage extends to bigger students and larger budgets. But remember: this is the best-case scenario where the teacher is free.

When you include teacher training cost, the crossover shifts dramatically. For small total budgets (10¹⁹ FLOPs), supervised always wins — you don't have enough compute to train a useful teacher AND a student. The distillation advantage only appears at scale.

Your company has a 70B model serving users. You want a 1B model for on-device deployment. Should you distill from the 70B or train the 1B from scratch?

Distill. The 70B teacher already exists (amortized cost = 0), so you're in the best-case scenario. The teacher provides a richer signal than one-hot labels, and for small students (1B), the distillation advantage is largest. However, beware the capacity gap — the 70B may be too strong; an intermediate-sized teacher may work better. Always train from scratch — it's simpler Neither — just quantize the 70B model

Chapter 7: Compute-Optimal Distillation

Given a total compute budget C, how should you allocate it between student training, teacher training, and teacher inference? This is the compute-optimal distillation recipe.

The optimization problem

N_S*, N_T*, D_T* = argmin_{N_S, N_T, D_T} L_S(N_S, D_S, N_T, D_T) s.t. FLOPs_S = C

The authors solve this using constrained numerical minimization (SLSQP) for each of the four distillation scenarios from Chapter 6. The results reveal striking patterns in optimal resource allocation.

Optimal allocation trends

Student Size	Compute Budget	Optimal Allocation
Small (≤3B)	Small (≤10²¹)	Mostly teacher pretraining. The student needs a good teacher more than lots of data.
Small (≤3B)	Large (≥10²³)	Evenly divided between student training, teacher inference, and (less) teacher pretraining.
Large (≥10B)	Small (≤10²¹)	Mostly standard student training — not enough budget to also train a useful teacher.
Large (≥10B)	Large (≥10²³)	Evenly divided between student, teacher inference, and teacher pretraining.

How optimal quantities scale with compute

As total compute increases, all four quantities (N_S*, D_S*, N_T*, D_T*) grow as power laws. Student and teacher tokens scale faster than student and teacher sizes. Optimal teacher size increases until it is slightly larger than the student, then plateaus. This makes intuitive sense: once the teacher is good enough relative to the student, making it even better hits the capacity gap.

The teacher size plateau: Optimal teacher scale L_T* (the red line in Figure 7) decreases as a power law with student size N_S until L_S matches L_T*. At that point, the student outperforms the optimal teacher, and the inflection point causes the teacher loss to decline faster. Equivalently, "optimal teacher scale almost consistently follows a linear scaling with the student scale across different architectures and data scales."

A worked example

Suppose you want a 1B student and have 10²² total FLOPs. Using the scaling law and constrained optimization:

Quantity	Optimal Value	Reasoning
Teacher N_T	~3B	3x student size — enough to teach but not so large that capacity gap hurts
Teacher D_T	~60B	Chinchilla-optimal for the teacher (M ≈ 20)
Teacher L_T	~2.10	Predicted from supervised scaling law for 3B/60B
Student D_S	~200B	Remaining budget after teacher training and inference
Distilled L_S	~2.18	From distillation scaling law
Supervised L_S	~2.22	If all 10²² FLOPs went to supervised training of 1B
Improvement	0.04 nats	Distillation wins — but only because we allocated compute optimally

Notice the improvement is modest (0.04 nats). This is the regime where distillation barely edges out supervised learning. At 10²⁴ FLOPs the gap would widen. At 10²⁰ FLOPs, supervised would win.

The infinite data limit

What happens as the distillation budget grows without bound? In the infinite data regime, distillation converges to the same loss as supervised learning on infinite data. The advantage of distillation is in sample efficiency: for a finite token budget, distillation extracts more information per token than one-hot labels. But at infinite tokens, both methods converge.

Why distillation is more sample-efficient: Each token in distillation carries V-dimensional information (the full vocabulary distribution), while each token in supervised learning carries only log₂(V) bits (the identity of the correct token). For V=32000, that is a ~15-bit vs ~4000-dimensional signal per token. The teacher's soft distribution is an exponentially richer training signal.

You have a fixed compute budget and want a 300M student. The optimal recipe says: 40% teacher training, 20% teacher inference, 40% student training. If you doubled the budget, how would the allocation shift?

The student's share would grow relative to the teacher's share. Student tokens scale faster than teacher improvement. At higher budgets, the marginal return from making the teacher even better decreases (capacity gap), so extra compute is better spent on more student distillation tokens. The allocation would stay exactly the same at all budget levels All extra compute should go to the teacher since a better teacher always helps

Chapter 8: Scaling Law Explorer

Now let's put it all together. This interactive simulation lets you explore the distillation scaling law and compare it against supervised baselines.

Distillation vs Supervised Scaling

Configure student size, distillation tokens, and teacher cross-entropy. Compare distillation loss against the supervised baseline. The gray dashed line shows the supervised loss for the same student size and tokens.

log₁₀(N_S) 1B

log₁₀(D_S) 50B

Teacher L_T 2.20

Compute Budget Allocator

Given a total FLOP budget, see how to split it between teacher and student training. The chart shows the predicted student loss for each allocation. Drag the split slider to explore.

Teacher share 30%

log₁₀(Total FLOPs) 10²¹

Using the explorer above: set student size to 143M (log=8.15), distillation tokens to 100B (log=11), and teacher L_T=2.0. Now change L_T to 1.85. Does the student improve?

For a 143M student, improving teacher from L_T=2.0 to 1.85 may actually hurt — this is the capacity gap in action. The 143M student doesn't have enough parameters to model the sharper distribution of the stronger teacher. Yes, the student always improves with a better teacher No change — teacher quality doesn't affect small students

Chapter 9: Connections

This paper sits at the intersection of two major research threads: neural scaling laws and knowledge distillation. Understanding where it fits helps you see the bigger picture.

Scaling law lineage

Hestness et al. (2017)

First systematic study: loss follows power laws in model size and data across domains.

↓

Kaplan et al. (2020) — "Scaling Laws for Neural LMs"

L(N,D) = E + A/N^α + B/D^β. Established the modern form. Recommended N scales faster than D (now known to be wrong).

↓

Hoffmann et al. (2022) — "Chinchilla"

Fixed Kaplan: N and D should scale equally. Token-to-parameter ratio M ≈ 20. Revolutionized how models are trained.

↓

Sardana et al. (2024) — "Beyond Chinchilla"

Overtraining paradigm. When inference cost matters, train small models on M >> 20 tokens. Shifts the optimal frontier.

↓

Busbridge et al. (2025) — This Paper

Extends scaling to distillation. Teacher quality L_T is a new variable. Capacity gap is parameterized. Compute-optimal recipes for all scenarios.

Distillation lineage

Paper	Contribution	Relation to This Work
Hinton et al. (2015)	Introduced knowledge distillation with temperature-scaled softmax	Foundation. This paper studies how it scales.
Beyer et al. (2022)	"A good teacher is patient and consistent" — function matching view	Observed capacity gap but couldn't predict it. This paper parameterizes it.
Zhang et al. (2023)	"Towards the law of capacity gap"	Named the phenomenon. This paper gives the first scaling law that captures it.
Stanton et al. (2021)	"Does knowledge distillation really work?"	Used λ=1 (pure distillation) to isolate effects. This paper follows the same protocol.
Burns et al. (2024)	Weak-to-strong generalization	Student outperforming teacher is explained by this paper's scaling law: when L_S < L_T, the student surpasses.

Open questions

Does the law hold for non-English data? All experiments use C4 English. Different languages have different entropy structures.

What about downstream task performance? The law predicts cross-entropy, not benchmark accuracy. The relationship between pretraining loss and downstream performance is itself an open scaling law question.

Does the capacity gap exist for non-autoregressive distillation? BERT-style models, diffusion models, and vision models all use distillation. Whether the same capacity gap and scaling behavior holds is unknown.

Can you distill reasoning? DeepSeek-R1's distilled models show that reasoning capabilities transfer through distillation, but the scaling behavior of reasoning transfer has not been studied.

What about mixture-of-experts (MoE)? The paper uses dense transformers. MoE models have a different FLOPs-to-parameters relationship (many parameters but fewer active per token). Whether the same scaling law coefficients hold for MoE teachers or students is unknown. DeepSeek-V3 (671B total, ~37B active) is a natural test case.

Distillation for test-time compute. With the rise of chain-of-thought and search-based inference, the relationship between pretraining loss and downstream capability may shift. A model distilled for low cross-entropy may not be optimal for test-time search — the scaling laws for these interactions remain uncharted.

The bigger picture

This paper completes a trilogy of scaling law results that together give practitioners a complete toolkit for resource allocation:

Decision	Scaling Law	Key Paper
How big should my model be?	Supervised scaling: L(N, D)	Hoffmann et al. (2022)
How much should I overtrain?	Overtraining scaling: lifecycle compute	Sardana et al. (2024)
Should I distill, and from what teacher?	Distillation scaling: L_S(N_S, D_S, L_T)	Busbridge et al. (2025)

With all three laws in hand, a practitioner can take a total compute budget and a target inference cost, and determine the optimal strategy: train a single compute-optimal model, overtrain a small model, or distill from a large teacher into a small student. The choice depends entirely on the budget, the target size, and whether a teacher already exists.

Scaling Law Timeline

Walk through the evolution of scaling laws, from early empirical observations to this paper's distillation scaling law.

Era Distillation Scaling (2025)

What is the single most important practical takeaway from this paper?

Distillation is only more efficient than supervised learning when (a) total compute exceeds a student-size-dependent threshold, or (b) the teacher already exists / has uses beyond a single distillation. Otherwise, just train the student directly. Additionally, teacher quality matters only through L_T, so you can use any teacher configuration that achieves your target L_T. Always use the largest possible teacher Distillation is always better than supervised training

Distillation Scaling Laws: When to Distill, When to Train