Busbridge et al. — Apple, ICML 2025

Distillation Scaling Laws: When to Distill, When to Train

A large teacher can produce a small, powerful student — but when is distillation actually worth the compute? This paper derives the first scaling law that predicts student cross-entropy from teacher quality, student size, and data budget, revealing a capacity gap where stronger teachers make worse students.

Prerequisites: Cross-entropy loss + KL divergence + Chinchilla scaling laws. That's it.
10
Chapters
4+
Simulations
0
Assumed Knowledge

Chapter 0: The Inference Problem

You have trained a 7.75B-parameter language model on 512 billion tokens. It is good. But every user query costs money — the model runs on expensive GPUs, 24/7, burning electricity. Inference cost at scale is the dominant expense in the lifecycle of a language model. It dwarfs pretraining cost when measured over the model's entire deployment.

There is a simple solution: use a smaller model. A 143M-parameter model is ~50x cheaper to serve. But if you train that small model from scratch on the same data, it will be much worse — the supervised scaling law (Chinchilla) tells us the loss is limited by model capacity.

What if the big model could teach the small model? That is knowledge distillation — the big model (the teacher) produces soft probability distributions over the vocabulary for every token, and the small model (the student) learns to match those distributions instead of just matching the hard one-hot labels from the training data.

The central question of this paper: Can we predict how good the student will be given (1) the student size NS, (2) the teacher quality LT, and (3) the distillation data budget DS? And can we use that prediction to decide whether distillation is worth the compute compared to just training a bigger model from scratch?

This is not an academic exercise. Companies like Apple (the authors' affiliation) deploy models on billions of devices. The difference between a 300M model and a 1B model determines whether your model runs on-device or requires a server roundtrip. Distillation is the standard tool for producing these small models — but until this paper, nobody could predict whether it would actually help.

Consider the numbers: OpenAI and Pilipiszyn (2021) estimated billions of tokens per day in inference. The inference cost of an LM is typically significantly larger than its pretraining cost over its lifetime. And with test-time compute scaling (chain-of-thought, tree search, repeated sampling), inference cost is growing even faster. Every parameter you can shave off the deployment model saves real money.

The overtraining paradigm

Modern deployment has shifted away from compute-optimal training (Chinchilla). Instead, practitioners overtrain — they train a small model on far more tokens than Chinchilla prescribes. A Chinchilla-optimal 300M model trains on ~6B tokens (20x parameters). An overtrained 300M model trains on 100B+ tokens. The model is worse than a compute-optimal 1B model, but it is much cheaper to serve.

Distillation offers a potentially better path: instead of overtraining a small model from scratch, distill from a large teacher. But this introduces new costs — you have to train the teacher first, and possibly run teacher inference to generate logits. When is this extra cost justified?

Training Paradigms Comparison

Compare three strategies for producing a deployment model: compute-optimal training, overtraining, and distillation. Click each to see the compute allocation.

Why can't you just use the Chinchilla scaling law to decide whether distillation is worthwhile?

Chapter 1: Supervised Scaling Laws

Before we can understand distillation scaling, we need to understand the baseline: how does a model trained without a teacher scale? This is the supervised scaling law, established by Kaplan et al. (2020) and refined by Hoffmann et al. (2022, "Chinchilla").

The power-law form

When you train a transformer language model of size N parameters on D tokens, the resulting validation cross-entropy loss follows a remarkably clean power law:

L(N, D) = E + A / Nα + B / Dβ

Where:

SymbolMeaningTypical Value
EIrreducible entropy — the best any model could do on this data (natural language has inherent randomness)~1.7 nats
A / NαModel capacity term — bigger model, lower loss. Diminishing returns.α ≈ 0.34
B / DβData term — more tokens, lower loss. Also diminishing returns.β ≈ 0.28

This equation captures a deep truth: loss is jointly limited by model capacity and data quantity. Doubling the model size gives diminishing improvements (power law, not linear). Doubling the data gives diminishing improvements too. The irreducible entropy E is the floor — you can't beat it no matter how large your model or dataset.

The additive form (separate terms for N and D) is an approximation. In reality, model size and data interact — a larger model extracts more from the same data. But the additive form fits empirical data remarkably well and is analytically tractable, which is why it has become the standard.

The paper re-fits this supervised baseline on their own experimental data (C4 English, transformer with RoPE and μP) and finds coefficients consistent with the literature. This is important: the distillation scaling law is built on top of the supervised one, so a bad supervised fit would corrupt everything downstream.

Why power laws? Nobody fully understands why neural scaling follows power laws so cleanly. One intuition: language has structure at many scales (character patterns, word patterns, syntactic patterns, semantic patterns, discourse patterns). Each scale contributes a roughly equal fraction of the learnable information, and models learn the easiest scales first. This produces the smooth, log-linear curves we observe.

Compute-optimal training (Chinchilla)

Training costs FLOPs, approximately 6ND for a standard transformer. Given a fixed compute budget C = 6ND, what is the best split between model size N and tokens D?

N*, D* = argminN,D L(N, D)    s.t.   FLOPs(N, D) = C

Hoffmann et al. found that compute-optimal models have a constant token-to-parameter ratio: M = D/N ≈ 20. This is the "Chinchilla rule" — a 1B model should train on ~20B tokens.

The problem with compute-optimal at inference time

A compute-optimal 10B model trained on 200B tokens has great loss. But at inference time, every token generated costs proportional to N. If you could achieve the same loss with a 1B model trained on 2T tokens, the inference cost drops by 10x — even though the total training compute is the same.

This is why the field moved to overtraining: train small models on vastly more tokens than Chinchilla prescribes. Distillation potentially offers even better small models than overtraining alone.

Supervised Scaling Law

Drag the sliders to change model size and data budget. Watch how loss changes following the power law L = E + A/Nα + B/Dβ.

log10(N) 1B
log10(D) 20B
A compute-optimal 3B model (Chinchilla M=20) trains on 60B tokens. An overtrained 300M model trains on 600B tokens — using roughly the same total FLOPs. Which has lower loss?

Chapter 2: Distillation Mechanics

How does a teacher actually teach a student? The mechanics are simple but the implications are deep.

Next-token prediction (supervised)

In standard pretraining, the model sees a context x(<i) and predicts the next token x(i). The target is a one-hot vector: all probability mass on the correct token, zero everywhere else. The loss is the negative log probability of the correct token:

LNTP(x, z) = − ∑a=1V e(x(i))a log σa(z)

Where z is the logit vector (raw model output before softmax), σ is softmax, and e(x(i)) is the one-hot basis vector for the true token. This is equivalent to −log p(x(i) | x(<i); θ).

Knowledge distillation loss

In distillation, the teacher produces a full probability distribution over all V vocabulary tokens. This is much richer than a one-hot label — it tells the student not just which token is correct, but the relative probabilities of all incorrect tokens.

For example, given the context "The cat sat on the ___", the one-hot target just says "mat". But the teacher's distribution says: "mat" 40%, "floor" 25%, "rug" 15%, "couch" 8%, "bed" 5%, ... This teaches the student that "floor" and "rug" are reasonable alternatives, while "quantum" is not.

This is sometimes called dark knowledge (Hinton et al., 2015) — the information contained in the probabilities of incorrect classes. A teacher that assigns 25% to "floor" and 0.001% to "quantum" encodes real knowledge about semantic similarity. The student learns not just the answer, but the structure of the answer space.

The distillation loss uses KL divergence between teacher and student distributions, implemented via cross-entropy of the student against the teacher's soft targets:

LKD(zT, zS) = −τ2a=1V σa(zT/τ) log σa(zS/τ)

Where τ is the distillation temperature. When τ = 1, the student matches the teacher's exact distribution. When τ > 1, the distributions become softer (more uniform), revealing more information about the teacher's ranking of unlikely tokens. The τ2 factor is a gradient normalization term that ensures the gradient magnitude is independent of temperature.

Why KL divergence, not MSE?

You might wonder: why not just minimize the mean squared error between teacher and student logits? The answer is that KL divergence respects the geometry of probability distributions. Two distributions that are "close" in KL are close in a statistically meaningful sense — they make similar predictions. MSE on logits can be small even when the resulting probability distributions are very different (if the logits differ by a constant, the probabilities change dramatically but MSE is zero).

Why temperature matters: Consider a teacher that assigns 99.9% to one token. The softmax squashes everything else to near-zero, so the student learns almost nothing beyond "this token is correct" — no better than one-hot labels. Temperature τ = 2 or 3 spreads the distribution out, letting the student learn from the teacher's relative rankings of all tokens. The paper uses τ = 1 throughout, finding it works best for their setup.

Combined loss

The total training loss for the student combines the standard NTP loss with the distillation loss:

LS = (1 − λ) LNTP(x, zS) + λ LKD(zT, zS) + λZ LZ(zS)

Where λ controls the balance between imitating the data and imitating the teacher, and LZ is a token-level Z-loss for training stability. The paper uses λ = 1 (pure distillation) — the student only sees the teacher's soft targets, never the one-hot labels from the data. This is the cleanest setup for studying distillation scaling.

python
# Distillation forward pass (simplified)
def distillation_step(student, teacher, tokens, tau=1.0):
    # Teacher generates soft targets (no gradient)
    with torch.no_grad():
        teacher_logits = teacher(tokens)           # [B, T, V]
        teacher_probs = softmax(teacher_logits / tau)  # [B, T, V]

    # Student forward pass
    student_logits = student(tokens)              # [B, T, V]
    student_log_probs = log_softmax(student_logits / tau)

    # KL divergence = cross-entropy(teacher || student) - H(teacher)
    # We minimize cross-entropy, which is equivalent
    loss = -tau**2 * (teacher_probs * student_log_probs).sum(-1).mean()

    return loss  # Backprop through student only
Why does the paper use pure distillation (λ = 1, no NTP loss) instead of combining distillation with the standard next-token prediction objective?

Chapter 3: The Capacity Gap

Here is the most surprising finding: making the teacher better can make the student worse. This is the capacity gap, and it is the key phenomenon that the distillation scaling law must capture.

The experiment

Take a fixed student (say 143M parameters, fixed distillation budget of 40B tokens). Now vary the teacher from small (198M) to huge (7.75B). Plot the student's final cross-entropy against teacher size. You might expect a monotonic curve: bigger teacher, better student.

Instead, you see something strange. As the teacher improves from 198M to ~1.82B, the student gets better. But as the teacher improves further from 1.82B to 7.75B, the student gets worse. There is an optimal teacher size, and going beyond it hurts.

The capacity gap, intuitively: A very strong teacher produces a very "sharp" distribution — high confidence on the correct token, very low probability on alternatives. This distribution is too complex for the small student to model. It is like asking a child to mimic a master calligrapher: the student cannot reproduce the subtle strokes, so it settles for a crude approximation that is worse than what it would learn from a merely good calligrapher whose strokes are within the student's ability to imitate.

The mathematical explanation

The capacity gap arises from a mismatch between the teacher's distribution complexity and the student's modeling capacity. The KLD between teacher and student can be decomposed:

KL(pT || pS) = H(pT, pS) − H(pT)

As the teacher gets stronger, H(pT) (the teacher's entropy) decreases — the teacher becomes more confident. The cross-entropy H(pT, pS) also changes, but the student has limited capacity to model the teacher's sharp distribution. When the teacher is too strong relative to the student, the KL divergence actually increases because the student cannot allocate its limited parameters to match the teacher's sharp peaks.

What the data shows

Student NSOptimal Teacher SizeStudent Loss at OptimalStudent Loss at Largest Teacher (7.75B)
143M~975M2.562.59
546M~1.82B2.222.24
1.82B~4.82B2.072.08
4.82B~7.75B1.981.98

The effect is more pronounced for small students. A 143M student suffers a ~0.03 nat penalty from a too-strong teacher. For larger students (4.82B), the gap nearly vanishes because the student has enough capacity to model the teacher's distribution.

Why the capacity gap is not just about model size

The paper makes a subtle but crucial point: the capacity gap is about the ratio of algorithmic learning capacities, not just parameter counts. Two models of the same size trained on different amounts of data have different learning capacities. The ratio LT/L̂S captures this precisely — when LT is much smaller than L̂S (teacher far better than student's potential), the gap appears. When they are close, it does not.

The authors demonstrate this with controlled experiments on kernel regression and synthetic MLP tasks (Appendices C.1 and C.2), providing the first clean synthetic demonstrations of the capacity gap phenomenon.

Key insight: Teacher quality matters only through teacher cross-entropy LT, not through teacher size NT or tokens DT independently. A 975M teacher trained on 512B tokens and a 7.75B teacher trained on 20B tokens that achieve the same LT produce the same student. The teacher is a black box characterized entirely by the quality of its output distribution.
Capacity Gap Explorer

Drag the teacher quality slider to see how student loss changes. Watch for the U-shaped curve — there's an optimal teacher quality beyond which the student gets worse.

Student size 143M
A 300M student is being distilled. Teacher A (1B, LT=2.3) and Teacher B (7B, LT=1.9) are available. Which produces a better student?

Chapter 4: The Distillation Scaling Law

Now we arrive at the paper's core contribution: a single equation that predicts student cross-entropy from three inputs — student size, data budget, and teacher quality.

Deriving the functional form

The authors reason about what properties the law must have, then find the simplest equation that satisfies all of them.

Property 1: Infinite student, infinite data. If the student has unlimited capacity and unlimited distillation data, it should perfectly mimic the teacher. So: limNS,DS→∞ LS = LT.

Property 2: Random teacher. A random teacher (infinite LT) provides no useful signal. The student should converge to some loss independent of its own capacity: limLT→∞ LS = LT.

Property 3: Capacity gap. There must be a regime where improving LT (making the teacher better) increases LS (makes the student worse). This is the U-shaped behavior from Chapter 3.

Property 4: Broken power law in LT. The influence of teacher quality on student loss transitions between two regimes — when the student is stronger than the teacher vs. weaker than the teacher — connected by a broken power law.

The equation

LS(NS, DS, LT) = LT + (1 / Lmc0) · (1 + (LT / LSd1)1/f1)−c1f1 · (A / NSα' + B / DSγ')

Let's unpack each term:

TermRoleIntuition
LTTeacher cross-entropy (baseline)A perfect student can't beat the teacher. LT is the floor.
1/Lmc0Student ability to mimic teacherWhere Lm = L(NS, DS) is what the student would achieve via supervised learning. Lower Lm → better mimicry.
(1 + (LT/...)1/f)−cfBroken power law transitionCaptures the capacity gap. When LT < L̂Sd1 (teacher is stronger than student's ability), the capacity gap kicks in.
A/NSα' + B/DSγ'Student resource limitationsSame form as supervised scaling — model capacity + data limitations.

The key insight: LT is sufficient

Notice that the teacher enters the equation only through LT. The teacher's size NT and training tokens DT do not appear — they only matter insofar as they determine LT = L(NT, DT). This is a powerful simplification: to predict distillation performance, you only need to know how good the teacher is, not how it became that good.

The third recall, L̂S: The paper defines L̂S = L(NS, DS) — the loss the student would achieve if trained supervised (no teacher). This connects distillation scaling to supervised scaling. The ratio LT/L̂S captures the relative learning abilities of teacher and student, which determines whether the capacity gap appears.
python
# Distillation scaling law (Equation 8 from the paper)
import numpy as np

def supervised_loss(N, D, E=1.71, A=406.4, alpha=0.34, B=410.7, beta=0.28):
    """Chinchilla-style supervised scaling law."""
    return E + A / N**alpha + B / D**beta

def distillation_loss(N_s, D_s, L_T,
                      c0=0.45, c1=0.72, d1=1.05,
                      f1=0.18, alpha_p=0.31, gamma_p=0.25,
                      A_p=350., B_p=380.):
    """Predict student cross-entropy from distillation."""
    L_hat_S = supervised_loss(N_s, D_s)  # student supervised baseline
    mimic = 1.0 / L_hat_S**c0           # student's mimicry ability
    ratio = (L_T / (L_hat_S * d1))**(1/f1)
    broken_pl = (1 + ratio)**(-c1 * f1)  # capacity gap transition
    resource = A_p / N_s**alpha_p + B_p / D_s**gamma_p
    return L_T + mimic * broken_pl * resource
What single number about the teacher determines student performance in the distillation scaling law?

Chapter 5: Fitting the Law

An equation is only useful if it fits real data. The authors conducted the largest controlled study of distillation to date: hundreds of runs spanning student sizes from 143M to 12.6B, teachers from 198M to 7.75B, and distillation budgets from a few billion to 512B tokens.

Experimental setup

DimensionRangeDetails
Student sizes143M — 12.6BAll transformers with multi-headed attention, Pre-Norm, RMSNorm, RoPE, sequence length 4096
Teacher sizes198M — 7.75BSix Chinchilla-optimal teachers (MT = DT/NT ≈ 20)
Distillation tokens~4B — 512BEnglish-only subset of C4 dataset
Temperatureτ = 1Found to work best across all experiments
Mixingλ = 1Pure distillation (no NTP loss)
OptimizerμP (maximal update parameterization)Enables hyperparameter transfer across model sizes

Three experimental protocols

Different protocols isolate different variables:

Fixed Teacher/Fixed M Students
Fix one teacher. Train students at IsoFLOP profiles (student size and tokens vary subject to a total compute constraint). Reveals how student size and data trade off for a given teacher.
IsoFLOP Teacher/Fixed M Students
Fix student (NS, DS). Vary teacher's NT and DT subject to a compute constraint. Reveals that NT and DT matter only through LT.
Fixed M Teachers/Fixed M Students
Vary everything: 10 teachers × 5+ student sizes × 4+ token budgets. Spans the widest range of LT and LS for fitting.

Fit quality

The scaling law fits all observed data at ≤1% prediction error. Even when extrapolating from weaker teachers to stronger ones (predicting behavior of a 7.75B teacher from data on ≤4.82B teachers), the errors remain under 1%.

Why this matters practically: You can run a small grid of cheap experiments (small students, small teachers, few tokens), fit the scaling law, then predict the outcome of expensive experiments (large students, large teachers, many tokens) without running them. This saves millions of dollars in compute.

The fitting uses Sequential Least Squares Programming (SLSQP) with positivity constraints on all coefficients. The supervised scaling law is fitted first on the non-distilled baselines, and its coefficients are then held fixed when fitting the distillation law.

The fitting procedure in detail

Step 1: Fit the supervised scaling law (Equation 1) on all non-distilled runs. This gives you E, A, B, α, β. These are locked.

Step 2: For each distillation run, compute L̂S = L(NS, DS) using the supervised law. This is a derived quantity, not a free parameter.

Step 3: Fit Equation 8 on all distillation runs, optimizing {c0, c1, d1, f1, α', β', A', B'} with positivity constraints.

python
# Fitting the distillation scaling law (sketch)
from scipy.optimize import minimize

def objective(params, data):
    """Sum of squared log-errors across all runs."""
    c0, c1, d1, f1, ap, gp, Ap, Bp = params
    total_err = 0
    for Ns, Ds, LT, LS_actual in data:
        LS_pred = distillation_loss(Ns, Ds, LT, c0, c1, d1, f1, ap, gp, Ap, Bp)
        total_err += (np.log(LS_pred) - np.log(LS_actual))**2
    return total_err

bounds = [(0.01, 5)] * 8  # All coefficients positive
result = minimize(objective, x0=initial_guess, args=(data,),
                  method='SLSQP', bounds=bounds)

The key validation: extrapolation

The most impressive test of any scaling law is extrapolation — predicting behavior you haven't seen. The authors fit the law on teachers up to 4.82B and students with LS > 2.3, then predict the behavior of 7.75B teachers and students with LS ≤ 2.3. The predictions are accurate to within 1% — the gray region in Figure 5b that was not used for fitting is predicted almost perfectly.

The paper uses μP (maximal update parameterization) for training. Why is this important for a scaling law study?

Chapter 6: When to Distill

This is the most practically useful chapter. Given a compute budget, should you distill from a teacher or just train the student from scratch? The answer depends on exactly three things.

Scenario 1: Best case (teacher already exists)

You already have a trained teacher. Distillation is "free" — you only pay for the student's training. In this case, distillation almost always wins, because the teacher provides a richer signal than one-hot labels at zero additional cost.

But even here, the advantage eventually vanishes. At sufficient compute, supervised learning catches up:

Core finding #1: Supervised learning always matches optimal distillation at sufficient compute budget, with the crossover point shifting to larger budgets as student size increases. Smaller models benefit more from distillation. Larger models can learn the same patterns from data alone — they have enough capacity to extract the information that the teacher would have provided.

Scenario 2: Teacher inference only

The teacher exists but you must pay for inference (running the teacher on each training example to get logits). This is the common deployment scenario — you have a big model serving users, and you want to distill it into a smaller model.

The FLOP cost becomes:

FLOPs ≈ 3F(NS)DS + F(NT)(δLogitS DS + δPreT 3DT)

Where the first term is student training, the second is teacher inference for generating logits, and the third (with δ indicators) accounts for whether teacher logits are stored or computed on-the-fly.

The key detail: F(N) is the FLOPs per token for a model with N parameters. For a standard transformer, F(N) ≈ 2N per forward pass (each parameter is used once in a multiply-add). Training requires 3x this (forward + backward), while inference requires 1x. So teacher inference costs F(NT) per token, while student training costs 3F(NS) per token.

For non-embedding parameters, the paper derives a more accurate expression: F(N) ≈ 2N(1 + c1N−1/3 + c2N−2/3) for fixed aspect-ratio models. They recommend the scaling community adopt this refined form.

Scenario 3: Teacher pretraining included

The worst case for distillation: you must train the teacher from scratch, then distill. Now the total compute includes teacher pretraining (expensive!) plus teacher inference plus student training.

The verdict

ScenarioδPreδLogitWhen Distillation Wins
Best case (amortized teacher)00Almost always, unless student compute is very large
Teacher inference01When total budget exceeds a student-size-dependent threshold
Teacher pretraining0 or 11Only when making a family of models or using the teacher beyond distillation
Full cost (train + inference)11Distillation is more efficient only if total compute or tokens exceed a threshold. Otherwise, supervised learning wins.
The surprise: When total end-to-end compute is counted (teacher training + inference + student training), supervised learning always achieves lower cross-entropy than distillation at the same total compute. Distillation is only more efficient when the total compute exceeds a student-size-dependent threshold, or when the teacher has uses beyond a single distillation.

Practical implications

If you just want the best model at size NS: train it supervised with all your compute. Do not distill.

If you want a family of models (300M, 1B, 3B, 10B): train one large teacher, then distill into all sizes. The teacher cost is amortized across many students.

If the teacher already exists (e.g., your production server model): distill — it is essentially free improvement.

The crossover point

Figure 6 in the paper shows contour plots of the cross-entropy difference between distillation and supervised learning. Blue regions indicate distillation wins. Red regions indicate supervised wins. The boundary between them is the crossover.

For a 546M teacher, the crossover happens at relatively small student compute budgets — even moderate distillation beats supervised. For a 7.75B teacher, the blue region is much larger, meaning distillation's advantage extends to bigger students and larger budgets. But remember: this is the best-case scenario where the teacher is free.

When you include teacher training cost, the crossover shifts dramatically. For small total budgets (1019 FLOPs), supervised always wins — you don't have enough compute to train a useful teacher AND a student. The distillation advantage only appears at scale.

Your company has a 70B model serving users. You want a 1B model for on-device deployment. Should you distill from the 70B or train the 1B from scratch?

Chapter 7: Compute-Optimal Distillation

Given a total compute budget C, how should you allocate it between student training, teacher training, and teacher inference? This is the compute-optimal distillation recipe.

The optimization problem

NS*, NT*, DT* = argminNS, NT, DT LS(NS, DS, NT, DT)    s.t.   FLOPsS = C

The authors solve this using constrained numerical minimization (SLSQP) for each of the four distillation scenarios from Chapter 6. The results reveal striking patterns in optimal resource allocation.

Optimal allocation trends

Student SizeCompute BudgetOptimal Allocation
Small (≤3B)Small (≤1021)Mostly teacher pretraining. The student needs a good teacher more than lots of data.
Small (≤3B)Large (≥1023)Evenly divided between student training, teacher inference, and (less) teacher pretraining.
Large (≥10B)Small (≤1021)Mostly standard student training — not enough budget to also train a useful teacher.
Large (≥10B)Large (≥1023)Evenly divided between student, teacher inference, and teacher pretraining.

How optimal quantities scale with compute

As total compute increases, all four quantities (NS*, DS*, NT*, DT*) grow as power laws. Student and teacher tokens scale faster than student and teacher sizes. Optimal teacher size increases until it is slightly larger than the student, then plateaus. This makes intuitive sense: once the teacher is good enough relative to the student, making it even better hits the capacity gap.

The teacher size plateau: Optimal teacher scale LT* (the red line in Figure 7) decreases as a power law with student size NS until LS matches LT*. At that point, the student outperforms the optimal teacher, and the inflection point causes the teacher loss to decline faster. Equivalently, "optimal teacher scale almost consistently follows a linear scaling with the student scale across different architectures and data scales."

A worked example

Suppose you want a 1B student and have 1022 total FLOPs. Using the scaling law and constrained optimization:

QuantityOptimal ValueReasoning
Teacher NT~3B3x student size — enough to teach but not so large that capacity gap hurts
Teacher DT~60BChinchilla-optimal for the teacher (M ≈ 20)
Teacher LT~2.10Predicted from supervised scaling law for 3B/60B
Student DS~200BRemaining budget after teacher training and inference
Distilled LS~2.18From distillation scaling law
Supervised LS~2.22If all 1022 FLOPs went to supervised training of 1B
Improvement0.04 natsDistillation wins — but only because we allocated compute optimally

Notice the improvement is modest (0.04 nats). This is the regime where distillation barely edges out supervised learning. At 1024 FLOPs the gap would widen. At 1020 FLOPs, supervised would win.

The infinite data limit

What happens as the distillation budget grows without bound? In the infinite data regime, distillation converges to the same loss as supervised learning on infinite data. The advantage of distillation is in sample efficiency: for a finite token budget, distillation extracts more information per token than one-hot labels. But at infinite tokens, both methods converge.

Why distillation is more sample-efficient: Each token in distillation carries V-dimensional information (the full vocabulary distribution), while each token in supervised learning carries only log2(V) bits (the identity of the correct token). For V=32000, that is a ~15-bit vs ~4000-dimensional signal per token. The teacher's soft distribution is an exponentially richer training signal.
You have a fixed compute budget and want a 300M student. The optimal recipe says: 40% teacher training, 20% teacher inference, 40% student training. If you doubled the budget, how would the allocation shift?

Chapter 8: Scaling Law Explorer

Now let's put it all together. This interactive simulation lets you explore the distillation scaling law and compare it against supervised baselines.

Distillation vs Supervised Scaling

Configure student size, distillation tokens, and teacher cross-entropy. Compare distillation loss against the supervised baseline. The gray dashed line shows the supervised loss for the same student size and tokens.

log10(NS) 1B
log10(DS) 50B
Teacher LT 2.20
Compute Budget Allocator

Given a total FLOP budget, see how to split it between teacher and student training. The chart shows the predicted student loss for each allocation. Drag the split slider to explore.

Teacher share 30%
log10(Total FLOPs) 10²¹
Using the explorer above: set student size to 143M (log=8.15), distillation tokens to 100B (log=11), and teacher LT=2.0. Now change LT to 1.85. Does the student improve?

Chapter 9: Connections

This paper sits at the intersection of two major research threads: neural scaling laws and knowledge distillation. Understanding where it fits helps you see the bigger picture.

Scaling law lineage

Hestness et al. (2017)
First systematic study: loss follows power laws in model size and data across domains.
Kaplan et al. (2020) — "Scaling Laws for Neural LMs"
L(N,D) = E + A/Nα + B/Dβ. Established the modern form. Recommended N scales faster than D (now known to be wrong).
Hoffmann et al. (2022) — "Chinchilla"
Fixed Kaplan: N and D should scale equally. Token-to-parameter ratio M ≈ 20. Revolutionized how models are trained.
Sardana et al. (2024) — "Beyond Chinchilla"
Overtraining paradigm. When inference cost matters, train small models on M >> 20 tokens. Shifts the optimal frontier.
Busbridge et al. (2025) — This Paper
Extends scaling to distillation. Teacher quality LT is a new variable. Capacity gap is parameterized. Compute-optimal recipes for all scenarios.

Distillation lineage

PaperContributionRelation to This Work
Hinton et al. (2015)Introduced knowledge distillation with temperature-scaled softmaxFoundation. This paper studies how it scales.
Beyer et al. (2022)"A good teacher is patient and consistent" — function matching viewObserved capacity gap but couldn't predict it. This paper parameterizes it.
Zhang et al. (2023)"Towards the law of capacity gap"Named the phenomenon. This paper gives the first scaling law that captures it.
Stanton et al. (2021)"Does knowledge distillation really work?"Used λ=1 (pure distillation) to isolate effects. This paper follows the same protocol.
Burns et al. (2024)Weak-to-strong generalizationStudent outperforming teacher is explained by this paper's scaling law: when LS < LT, the student surpasses.

Open questions

Does the law hold for non-English data? All experiments use C4 English. Different languages have different entropy structures.

What about downstream task performance? The law predicts cross-entropy, not benchmark accuracy. The relationship between pretraining loss and downstream performance is itself an open scaling law question.

Does the capacity gap exist for non-autoregressive distillation? BERT-style models, diffusion models, and vision models all use distillation. Whether the same capacity gap and scaling behavior holds is unknown.

Can you distill reasoning? DeepSeek-R1's distilled models show that reasoning capabilities transfer through distillation, but the scaling behavior of reasoning transfer has not been studied.

What about mixture-of-experts (MoE)? The paper uses dense transformers. MoE models have a different FLOPs-to-parameters relationship (many parameters but fewer active per token). Whether the same scaling law coefficients hold for MoE teachers or students is unknown. DeepSeek-V3 (671B total, ~37B active) is a natural test case.

Distillation for test-time compute. With the rise of chain-of-thought and search-based inference, the relationship between pretraining loss and downstream capability may shift. A model distilled for low cross-entropy may not be optimal for test-time search — the scaling laws for these interactions remain uncharted.

The bigger picture

This paper completes a trilogy of scaling law results that together give practitioners a complete toolkit for resource allocation:

DecisionScaling LawKey Paper
How big should my model be?Supervised scaling: L(N, D)Hoffmann et al. (2022)
How much should I overtrain?Overtraining scaling: lifecycle computeSardana et al. (2024)
Should I distill, and from what teacher?Distillation scaling: LS(NS, DS, LT)Busbridge et al. (2025)

With all three laws in hand, a practitioner can take a total compute budget and a target inference cost, and determine the optimal strategy: train a single compute-optimal model, overtrain a small model, or distill from a large teacher into a small student. The choice depends entirely on the budget, the target size, and whether a teacher already exists.

Scaling Law Timeline

Walk through the evolution of scaling laws, from early empirical observations to this paper's distillation scaling law.

Era Distillation Scaling (2025)
What is the single most important practical takeaway from this paper?