A large teacher can produce a small, powerful student — but when is distillation actually worth the compute? This paper derives the first scaling law that predicts student cross-entropy from teacher quality, student size, and data budget, revealing a capacity gap where stronger teachers make worse students.
You have trained a 7.75B-parameter language model on 512 billion tokens. It is good. But every user query costs money — the model runs on expensive GPUs, 24/7, burning electricity. Inference cost at scale is the dominant expense in the lifecycle of a language model. It dwarfs pretraining cost when measured over the model's entire deployment.
There is a simple solution: use a smaller model. A 143M-parameter model is ~50x cheaper to serve. But if you train that small model from scratch on the same data, it will be much worse — the supervised scaling law (Chinchilla) tells us the loss is limited by model capacity.
What if the big model could teach the small model? That is knowledge distillation — the big model (the teacher) produces soft probability distributions over the vocabulary for every token, and the small model (the student) learns to match those distributions instead of just matching the hard one-hot labels from the training data.
This is not an academic exercise. Companies like Apple (the authors' affiliation) deploy models on billions of devices. The difference between a 300M model and a 1B model determines whether your model runs on-device or requires a server roundtrip. Distillation is the standard tool for producing these small models — but until this paper, nobody could predict whether it would actually help.
Consider the numbers: OpenAI and Pilipiszyn (2021) estimated billions of tokens per day in inference. The inference cost of an LM is typically significantly larger than its pretraining cost over its lifetime. And with test-time compute scaling (chain-of-thought, tree search, repeated sampling), inference cost is growing even faster. Every parameter you can shave off the deployment model saves real money.
Modern deployment has shifted away from compute-optimal training (Chinchilla). Instead, practitioners overtrain — they train a small model on far more tokens than Chinchilla prescribes. A Chinchilla-optimal 300M model trains on ~6B tokens (20x parameters). An overtrained 300M model trains on 100B+ tokens. The model is worse than a compute-optimal 1B model, but it is much cheaper to serve.
Distillation offers a potentially better path: instead of overtraining a small model from scratch, distill from a large teacher. But this introduces new costs — you have to train the teacher first, and possibly run teacher inference to generate logits. When is this extra cost justified?
Compare three strategies for producing a deployment model: compute-optimal training, overtraining, and distillation. Click each to see the compute allocation.
Before we can understand distillation scaling, we need to understand the baseline: how does a model trained without a teacher scale? This is the supervised scaling law, established by Kaplan et al. (2020) and refined by Hoffmann et al. (2022, "Chinchilla").
When you train a transformer language model of size N parameters on D tokens, the resulting validation cross-entropy loss follows a remarkably clean power law:
Where:
| Symbol | Meaning | Typical Value |
|---|---|---|
| E | Irreducible entropy — the best any model could do on this data (natural language has inherent randomness) | ~1.7 nats |
| A / Nα | Model capacity term — bigger model, lower loss. Diminishing returns. | α ≈ 0.34 |
| B / Dβ | Data term — more tokens, lower loss. Also diminishing returns. | β ≈ 0.28 |
This equation captures a deep truth: loss is jointly limited by model capacity and data quantity. Doubling the model size gives diminishing improvements (power law, not linear). Doubling the data gives diminishing improvements too. The irreducible entropy E is the floor — you can't beat it no matter how large your model or dataset.
The additive form (separate terms for N and D) is an approximation. In reality, model size and data interact — a larger model extracts more from the same data. But the additive form fits empirical data remarkably well and is analytically tractable, which is why it has become the standard.
The paper re-fits this supervised baseline on their own experimental data (C4 English, transformer with RoPE and μP) and finds coefficients consistent with the literature. This is important: the distillation scaling law is built on top of the supervised one, so a bad supervised fit would corrupt everything downstream.
Training costs FLOPs, approximately 6ND for a standard transformer. Given a fixed compute budget C = 6ND, what is the best split between model size N and tokens D?
Hoffmann et al. found that compute-optimal models have a constant token-to-parameter ratio: M = D/N ≈ 20. This is the "Chinchilla rule" — a 1B model should train on ~20B tokens.
A compute-optimal 10B model trained on 200B tokens has great loss. But at inference time, every token generated costs proportional to N. If you could achieve the same loss with a 1B model trained on 2T tokens, the inference cost drops by 10x — even though the total training compute is the same.
This is why the field moved to overtraining: train small models on vastly more tokens than Chinchilla prescribes. Distillation potentially offers even better small models than overtraining alone.
Drag the sliders to change model size and data budget. Watch how loss changes following the power law L = E + A/Nα + B/Dβ.
How does a teacher actually teach a student? The mechanics are simple but the implications are deep.
In standard pretraining, the model sees a context x(<i) and predicts the next token x(i). The target is a one-hot vector: all probability mass on the correct token, zero everywhere else. The loss is the negative log probability of the correct token:
Where z is the logit vector (raw model output before softmax), σ is softmax, and e(x(i)) is the one-hot basis vector for the true token. This is equivalent to −log p(x(i) | x(<i); θ).
In distillation, the teacher produces a full probability distribution over all V vocabulary tokens. This is much richer than a one-hot label — it tells the student not just which token is correct, but the relative probabilities of all incorrect tokens.
For example, given the context "The cat sat on the ___", the one-hot target just says "mat". But the teacher's distribution says: "mat" 40%, "floor" 25%, "rug" 15%, "couch" 8%, "bed" 5%, ... This teaches the student that "floor" and "rug" are reasonable alternatives, while "quantum" is not.
This is sometimes called dark knowledge (Hinton et al., 2015) — the information contained in the probabilities of incorrect classes. A teacher that assigns 25% to "floor" and 0.001% to "quantum" encodes real knowledge about semantic similarity. The student learns not just the answer, but the structure of the answer space.
The distillation loss uses KL divergence between teacher and student distributions, implemented via cross-entropy of the student against the teacher's soft targets:
Where τ is the distillation temperature. When τ = 1, the student matches the teacher's exact distribution. When τ > 1, the distributions become softer (more uniform), revealing more information about the teacher's ranking of unlikely tokens. The τ2 factor is a gradient normalization term that ensures the gradient magnitude is independent of temperature.
You might wonder: why not just minimize the mean squared error between teacher and student logits? The answer is that KL divergence respects the geometry of probability distributions. Two distributions that are "close" in KL are close in a statistically meaningful sense — they make similar predictions. MSE on logits can be small even when the resulting probability distributions are very different (if the logits differ by a constant, the probabilities change dramatically but MSE is zero).
The total training loss for the student combines the standard NTP loss with the distillation loss:
Where λ controls the balance between imitating the data and imitating the teacher, and LZ is a token-level Z-loss for training stability. The paper uses λ = 1 (pure distillation) — the student only sees the teacher's soft targets, never the one-hot labels from the data. This is the cleanest setup for studying distillation scaling.
python # Distillation forward pass (simplified) def distillation_step(student, teacher, tokens, tau=1.0): # Teacher generates soft targets (no gradient) with torch.no_grad(): teacher_logits = teacher(tokens) # [B, T, V] teacher_probs = softmax(teacher_logits / tau) # [B, T, V] # Student forward pass student_logits = student(tokens) # [B, T, V] student_log_probs = log_softmax(student_logits / tau) # KL divergence = cross-entropy(teacher || student) - H(teacher) # We minimize cross-entropy, which is equivalent loss = -tau**2 * (teacher_probs * student_log_probs).sum(-1).mean() return loss # Backprop through student only
Here is the most surprising finding: making the teacher better can make the student worse. This is the capacity gap, and it is the key phenomenon that the distillation scaling law must capture.
Take a fixed student (say 143M parameters, fixed distillation budget of 40B tokens). Now vary the teacher from small (198M) to huge (7.75B). Plot the student's final cross-entropy against teacher size. You might expect a monotonic curve: bigger teacher, better student.
Instead, you see something strange. As the teacher improves from 198M to ~1.82B, the student gets better. But as the teacher improves further from 1.82B to 7.75B, the student gets worse. There is an optimal teacher size, and going beyond it hurts.
The capacity gap arises from a mismatch between the teacher's distribution complexity and the student's modeling capacity. The KLD between teacher and student can be decomposed:
As the teacher gets stronger, H(pT) (the teacher's entropy) decreases — the teacher becomes more confident. The cross-entropy H(pT, pS) also changes, but the student has limited capacity to model the teacher's sharp distribution. When the teacher is too strong relative to the student, the KL divergence actually increases because the student cannot allocate its limited parameters to match the teacher's sharp peaks.
| Student NS | Optimal Teacher Size | Student Loss at Optimal | Student Loss at Largest Teacher (7.75B) |
|---|---|---|---|
| 143M | ~975M | 2.56 | 2.59 |
| 546M | ~1.82B | 2.22 | 2.24 |
| 1.82B | ~4.82B | 2.07 | 2.08 |
| 4.82B | ~7.75B | 1.98 | 1.98 |
The effect is more pronounced for small students. A 143M student suffers a ~0.03 nat penalty from a too-strong teacher. For larger students (4.82B), the gap nearly vanishes because the student has enough capacity to model the teacher's distribution.
The paper makes a subtle but crucial point: the capacity gap is about the ratio of algorithmic learning capacities, not just parameter counts. Two models of the same size trained on different amounts of data have different learning capacities. The ratio LT/L̂S captures this precisely — when LT is much smaller than L̂S (teacher far better than student's potential), the gap appears. When they are close, it does not.
The authors demonstrate this with controlled experiments on kernel regression and synthetic MLP tasks (Appendices C.1 and C.2), providing the first clean synthetic demonstrations of the capacity gap phenomenon.
Drag the teacher quality slider to see how student loss changes. Watch for the U-shaped curve — there's an optimal teacher quality beyond which the student gets worse.
Now we arrive at the paper's core contribution: a single equation that predicts student cross-entropy from three inputs — student size, data budget, and teacher quality.
The authors reason about what properties the law must have, then find the simplest equation that satisfies all of them.
Property 1: Infinite student, infinite data. If the student has unlimited capacity and unlimited distillation data, it should perfectly mimic the teacher. So: limNS,DS→∞ LS = LT.
Property 2: Random teacher. A random teacher (infinite LT) provides no useful signal. The student should converge to some loss independent of its own capacity: limLT→∞ LS = LT.
Property 3: Capacity gap. There must be a regime where improving LT (making the teacher better) increases LS (makes the student worse). This is the U-shaped behavior from Chapter 3.
Property 4: Broken power law in LT. The influence of teacher quality on student loss transitions between two regimes — when the student is stronger than the teacher vs. weaker than the teacher — connected by a broken power law.
Let's unpack each term:
| Term | Role | Intuition |
|---|---|---|
| LT | Teacher cross-entropy (baseline) | A perfect student can't beat the teacher. LT is the floor. |
| 1/Lmc0 | Student ability to mimic teacher | Where Lm = L(NS, DS) is what the student would achieve via supervised learning. Lower Lm → better mimicry. |
| (1 + (LT/...)1/f)−cf | Broken power law transition | Captures the capacity gap. When LT < L̂Sd1 (teacher is stronger than student's ability), the capacity gap kicks in. |
| A/NSα' + B/DSγ' | Student resource limitations | Same form as supervised scaling — model capacity + data limitations. |
Notice that the teacher enters the equation only through LT. The teacher's size NT and training tokens DT do not appear — they only matter insofar as they determine LT = L(NT, DT). This is a powerful simplification: to predict distillation performance, you only need to know how good the teacher is, not how it became that good.
python # Distillation scaling law (Equation 8 from the paper) import numpy as np def supervised_loss(N, D, E=1.71, A=406.4, alpha=0.34, B=410.7, beta=0.28): """Chinchilla-style supervised scaling law.""" return E + A / N**alpha + B / D**beta def distillation_loss(N_s, D_s, L_T, c0=0.45, c1=0.72, d1=1.05, f1=0.18, alpha_p=0.31, gamma_p=0.25, A_p=350., B_p=380.): """Predict student cross-entropy from distillation.""" L_hat_S = supervised_loss(N_s, D_s) # student supervised baseline mimic = 1.0 / L_hat_S**c0 # student's mimicry ability ratio = (L_T / (L_hat_S * d1))**(1/f1) broken_pl = (1 + ratio)**(-c1 * f1) # capacity gap transition resource = A_p / N_s**alpha_p + B_p / D_s**gamma_p return L_T + mimic * broken_pl * resource
An equation is only useful if it fits real data. The authors conducted the largest controlled study of distillation to date: hundreds of runs spanning student sizes from 143M to 12.6B, teachers from 198M to 7.75B, and distillation budgets from a few billion to 512B tokens.
| Dimension | Range | Details |
|---|---|---|
| Student sizes | 143M — 12.6B | All transformers with multi-headed attention, Pre-Norm, RMSNorm, RoPE, sequence length 4096 |
| Teacher sizes | 198M — 7.75B | Six Chinchilla-optimal teachers (MT = DT/NT ≈ 20) |
| Distillation tokens | ~4B — 512B | English-only subset of C4 dataset |
| Temperature | τ = 1 | Found to work best across all experiments |
| Mixing | λ = 1 | Pure distillation (no NTP loss) |
| Optimizer | μP (maximal update parameterization) | Enables hyperparameter transfer across model sizes |
Different protocols isolate different variables:
The scaling law fits all observed data at ≤1% prediction error. Even when extrapolating from weaker teachers to stronger ones (predicting behavior of a 7.75B teacher from data on ≤4.82B teachers), the errors remain under 1%.
The fitting uses Sequential Least Squares Programming (SLSQP) with positivity constraints on all coefficients. The supervised scaling law is fitted first on the non-distilled baselines, and its coefficients are then held fixed when fitting the distillation law.
Step 1: Fit the supervised scaling law (Equation 1) on all non-distilled runs. This gives you E, A, B, α, β. These are locked.
Step 2: For each distillation run, compute L̂S = L(NS, DS) using the supervised law. This is a derived quantity, not a free parameter.
Step 3: Fit Equation 8 on all distillation runs, optimizing {c0, c1, d1, f1, α', β', A', B'} with positivity constraints.
python # Fitting the distillation scaling law (sketch) from scipy.optimize import minimize def objective(params, data): """Sum of squared log-errors across all runs.""" c0, c1, d1, f1, ap, gp, Ap, Bp = params total_err = 0 for Ns, Ds, LT, LS_actual in data: LS_pred = distillation_loss(Ns, Ds, LT, c0, c1, d1, f1, ap, gp, Ap, Bp) total_err += (np.log(LS_pred) - np.log(LS_actual))**2 return total_err bounds = [(0.01, 5)] * 8 # All coefficients positive result = minimize(objective, x0=initial_guess, args=(data,), method='SLSQP', bounds=bounds)
The most impressive test of any scaling law is extrapolation — predicting behavior you haven't seen. The authors fit the law on teachers up to 4.82B and students with LS > 2.3, then predict the behavior of 7.75B teachers and students with LS ≤ 2.3. The predictions are accurate to within 1% — the gray region in Figure 5b that was not used for fitting is predicted almost perfectly.
This is the most practically useful chapter. Given a compute budget, should you distill from a teacher or just train the student from scratch? The answer depends on exactly three things.
You already have a trained teacher. Distillation is "free" — you only pay for the student's training. In this case, distillation almost always wins, because the teacher provides a richer signal than one-hot labels at zero additional cost.
But even here, the advantage eventually vanishes. At sufficient compute, supervised learning catches up:
The teacher exists but you must pay for inference (running the teacher on each training example to get logits). This is the common deployment scenario — you have a big model serving users, and you want to distill it into a smaller model.
The FLOP cost becomes:
Where the first term is student training, the second is teacher inference for generating logits, and the third (with δ indicators) accounts for whether teacher logits are stored or computed on-the-fly.
The key detail: F(N) is the FLOPs per token for a model with N parameters. For a standard transformer, F(N) ≈ 2N per forward pass (each parameter is used once in a multiply-add). Training requires 3x this (forward + backward), while inference requires 1x. So teacher inference costs F(NT) per token, while student training costs 3F(NS) per token.
For non-embedding parameters, the paper derives a more accurate expression: F(N) ≈ 2N(1 + c1N−1/3 + c2N−2/3) for fixed aspect-ratio models. They recommend the scaling community adopt this refined form.
The worst case for distillation: you must train the teacher from scratch, then distill. Now the total compute includes teacher pretraining (expensive!) plus teacher inference plus student training.
| Scenario | δPre | δLogit | When Distillation Wins |
|---|---|---|---|
| Best case (amortized teacher) | 0 | 0 | Almost always, unless student compute is very large |
| Teacher inference | 0 | 1 | When total budget exceeds a student-size-dependent threshold |
| Teacher pretraining | 0 or 1 | 1 | Only when making a family of models or using the teacher beyond distillation |
| Full cost (train + inference) | 1 | 1 | Distillation is more efficient only if total compute or tokens exceed a threshold. Otherwise, supervised learning wins. |
If you just want the best model at size NS: train it supervised with all your compute. Do not distill.
If you want a family of models (300M, 1B, 3B, 10B): train one large teacher, then distill into all sizes. The teacher cost is amortized across many students.
If the teacher already exists (e.g., your production server model): distill — it is essentially free improvement.
Figure 6 in the paper shows contour plots of the cross-entropy difference between distillation and supervised learning. Blue regions indicate distillation wins. Red regions indicate supervised wins. The boundary between them is the crossover.
For a 546M teacher, the crossover happens at relatively small student compute budgets — even moderate distillation beats supervised. For a 7.75B teacher, the blue region is much larger, meaning distillation's advantage extends to bigger students and larger budgets. But remember: this is the best-case scenario where the teacher is free.
When you include teacher training cost, the crossover shifts dramatically. For small total budgets (1019 FLOPs), supervised always wins — you don't have enough compute to train a useful teacher AND a student. The distillation advantage only appears at scale.
Given a total compute budget C, how should you allocate it between student training, teacher training, and teacher inference? This is the compute-optimal distillation recipe.
The authors solve this using constrained numerical minimization (SLSQP) for each of the four distillation scenarios from Chapter 6. The results reveal striking patterns in optimal resource allocation.
| Student Size | Compute Budget | Optimal Allocation |
|---|---|---|
| Small (≤3B) | Small (≤1021) | Mostly teacher pretraining. The student needs a good teacher more than lots of data. |
| Small (≤3B) | Large (≥1023) | Evenly divided between student training, teacher inference, and (less) teacher pretraining. |
| Large (≥10B) | Small (≤1021) | Mostly standard student training — not enough budget to also train a useful teacher. |
| Large (≥10B) | Large (≥1023) | Evenly divided between student, teacher inference, and teacher pretraining. |
As total compute increases, all four quantities (NS*, DS*, NT*, DT*) grow as power laws. Student and teacher tokens scale faster than student and teacher sizes. Optimal teacher size increases until it is slightly larger than the student, then plateaus. This makes intuitive sense: once the teacher is good enough relative to the student, making it even better hits the capacity gap.
Suppose you want a 1B student and have 1022 total FLOPs. Using the scaling law and constrained optimization:
| Quantity | Optimal Value | Reasoning |
|---|---|---|
| Teacher NT | ~3B | 3x student size — enough to teach but not so large that capacity gap hurts |
| Teacher DT | ~60B | Chinchilla-optimal for the teacher (M ≈ 20) |
| Teacher LT | ~2.10 | Predicted from supervised scaling law for 3B/60B |
| Student DS | ~200B | Remaining budget after teacher training and inference |
| Distilled LS | ~2.18 | From distillation scaling law |
| Supervised LS | ~2.22 | If all 1022 FLOPs went to supervised training of 1B |
| Improvement | 0.04 nats | Distillation wins — but only because we allocated compute optimally |
Notice the improvement is modest (0.04 nats). This is the regime where distillation barely edges out supervised learning. At 1024 FLOPs the gap would widen. At 1020 FLOPs, supervised would win.
What happens as the distillation budget grows without bound? In the infinite data regime, distillation converges to the same loss as supervised learning on infinite data. The advantage of distillation is in sample efficiency: for a finite token budget, distillation extracts more information per token than one-hot labels. But at infinite tokens, both methods converge.
Now let's put it all together. This interactive simulation lets you explore the distillation scaling law and compare it against supervised baselines.
Configure student size, distillation tokens, and teacher cross-entropy. Compare distillation loss against the supervised baseline. The gray dashed line shows the supervised loss for the same student size and tokens.
Given a total FLOP budget, see how to split it between teacher and student training. The chart shows the predicted student loss for each allocation. Drag the split slider to explore.
This paper sits at the intersection of two major research threads: neural scaling laws and knowledge distillation. Understanding where it fits helps you see the bigger picture.
| Paper | Contribution | Relation to This Work |
|---|---|---|
| Hinton et al. (2015) | Introduced knowledge distillation with temperature-scaled softmax | Foundation. This paper studies how it scales. |
| Beyer et al. (2022) | "A good teacher is patient and consistent" — function matching view | Observed capacity gap but couldn't predict it. This paper parameterizes it. |
| Zhang et al. (2023) | "Towards the law of capacity gap" | Named the phenomenon. This paper gives the first scaling law that captures it. |
| Stanton et al. (2021) | "Does knowledge distillation really work?" | Used λ=1 (pure distillation) to isolate effects. This paper follows the same protocol. |
| Burns et al. (2024) | Weak-to-strong generalization | Student outperforming teacher is explained by this paper's scaling law: when LS < LT, the student surpasses. |
Does the law hold for non-English data? All experiments use C4 English. Different languages have different entropy structures.
What about downstream task performance? The law predicts cross-entropy, not benchmark accuracy. The relationship between pretraining loss and downstream performance is itself an open scaling law question.
Does the capacity gap exist for non-autoregressive distillation? BERT-style models, diffusion models, and vision models all use distillation. Whether the same capacity gap and scaling behavior holds is unknown.
Can you distill reasoning? DeepSeek-R1's distilled models show that reasoning capabilities transfer through distillation, but the scaling behavior of reasoning transfer has not been studied.
What about mixture-of-experts (MoE)? The paper uses dense transformers. MoE models have a different FLOPs-to-parameters relationship (many parameters but fewer active per token). Whether the same scaling law coefficients hold for MoE teachers or students is unknown. DeepSeek-V3 (671B total, ~37B active) is a natural test case.
Distillation for test-time compute. With the rise of chain-of-thought and search-based inference, the relationship between pretraining loss and downstream capability may shift. A model distilled for low cross-entropy may not be optimal for test-time search — the scaling laws for these interactions remain uncharted.
This paper completes a trilogy of scaling law results that together give practitioners a complete toolkit for resource allocation:
| Decision | Scaling Law | Key Paper |
|---|---|---|
| How big should my model be? | Supervised scaling: L(N, D) | Hoffmann et al. (2022) |
| How much should I overtrain? | Overtraining scaling: lifecycle compute | Sardana et al. (2024) |
| Should I distill, and from what teacher? | Distillation scaling: LS(NS, DS, LT) | Busbridge et al. (2025) |
With all three laws in hand, a practitioner can take a total compute budget and a target inference cost, and determine the optimal strategy: train a single compute-optimal model, overtrain a small model, or distill from a large teacher into a small student. The choice depends entirely on the budget, the target size, and whether a teacher already exists.
Walk through the evolution of scaling laws, from early empirical observations to this paper's distillation scaling law.