How a tiny model that runs on your phone can inherit the wisdom of a giant it could never hope to be — by learning from its teacher’s doubts, not just its answers.
You have trained a magnificent model. It is huge — hundreds of millions of parameters — and it is brilliant, topping every benchmark. Now you need to run it on a phone, or serve it to a million users at once, or fit it under a ten-millisecond latency budget. And it simply will not fit. The giant is too slow, too big, too hungry for memory.
The obvious move is to just train a small model from scratch instead. You try it. It's fast enough — and noticeably dumber. The small model, learning alone from the raw labels, lands well short of the giant's accuracy. There seems to be a hard tradeoff: small and fast, or large and smart. Pick one.
Knowledge distillation breaks that tradeoff. The idea, introduced by Geoffrey Hinton and colleagues in 2015, is disarmingly simple: don't train the small model on the raw labels. Train it to imitate the big model. Let the giant be a teacher and the small model be a student, and have the student learn not just the teacher's final answers but the full, nuanced distribution behind them. The student ends up far smarter than it could ever have become on its own.
Here is the crux, and it's worth slowing down for. Suppose the teacher looks at a handwritten digit that happens to be a 7. The true label is the one-hot vector: 100% on “7,” 0% on everything else. That is the hard label — it's all the raw dataset gives you.
But the teacher's actual output is richer. It might say: 90% “7,” 6% “1,” 3% “9,” and a sprinkle elsewhere. That is a soft label. And look at what it's telling you: this 7 looks a little like a 1 (they share a vertical stroke) and a little like a 9, but nothing at all like a 0 or a 5. The teacher has encoded the geometry of the classes — which digits resemble which — into those small non-zero numbers.
The hard label throws all of that away. The soft label keeps it. Training the student to match the soft label teaches it not just “this is a 7” but “this is a 7 that's a bit 1-ish” — a vastly more informative lesson. Hinton called this hidden signal dark knowledge.
The widget shows the same digit two ways. Toggle between the hard one-hot label and the teacher's soft distribution. Notice how much structure the soft version carries — mass leaking onto the visually-similar digits — and how the hard label is a lonely single spike that says nothing about why.
The same example, labeled two ways. Toggle to see the dark knowledge the one-hot label discards — the teacher's sense of which classes resemble this one.
Let's make “dark knowledge” precise, because it is the entire reason distillation works. The claim is that the relative probabilities the teacher assigns to the incorrect classes carry the most valuable signal — sometimes more than the correct class itself.
Consider a model trained to recognize vehicles. Shown a photo of a truck, a good teacher outputs something like: 88% truck, 11% bus, 0.9% car, 0.05% bicycle, 0.001% banana. Strip away the correct answer (truck) and look at what remains: the teacher is screaming that this thing is way more bus-like than car-like, and not remotely fruit-like. That ranking — bus > car > bicycle > banana — is a compressed encyclopedia of how the visual world is organized.
Trained on hard labels, the student only ever hears “truck” — a single bit of feedback per example. To learn that trucks resemble buses, it would have to infer that relationship indirectly, from many examples, using capacity it doesn't have to spare. The teacher, being huge, did have the capacity to discover these relationships. Distillation is a capacity transplant: the student borrows conclusions the teacher could afford to compute.
There is a catch, and it sets up the next chapter. A well-trained teacher is confident. On an easy truck image it might output 99.9% truck and split the remaining 0.1% among everything else. Those wrong-class probabilities are now so vanishingly small that the dark knowledge — the bus-vs-car ranking — is buried in numerical dust. The student, trained on this, sees essentially a hard label again. The very confidence that makes the teacher good has hidden its most useful signal.
So we need a way to amplify the small differences among the wrong-class probabilities — to turn up the contrast on the dark knowledge so the student can actually see it. That tool is temperature, and it's the subject of the next chapter.
Pick a class. The widget shows the teacher's full soft distribution, then lets you zoom into just the wrong-class probabilities (the log-scale view reveals the ranking buried under the dominant correct class). The hidden structure — what resembles what — pops out.
Pick the true class. Toggle to a log-scale view of the wrong-class probabilities to see the similarity ranking the teacher has hidden in the small numbers.
We ended the last chapter with a problem: a confident teacher buries its dark knowledge under a near-one-hot peak. We need to soften the teacher's distribution — spread some probability off the winner and onto the runners-up — so the student can see the relationships. The tool is the same temperature knob from the softmax, but here it plays a completely different role than it did in contrastive learning.
Recall how softmax works: it exponentiates the logits (the raw pre-softmax scores) and normalizes. Temperature divides the logits before exponentiating. A temperature of 1 is the normal softmax. A higher temperature shrinks the gaps between logits, so the exponentials come out closer together, and the distribution softens — probability flows from the peak onto the smaller classes, lifting the dark knowledge into view.
A teacher outputs these logits for a truck image: truck 6.0, bus 4.0, car 2.0, bike 0.0. Let's compute the softened distribution at two temperatures.
At temperature 1 (normal softmax): exponentiate the raw logits.
| class | logit | elogit | probability |
|---|---|---|---|
| truck | 6.0 | e6 = 403 | 403/466 = 0.865 |
| bus | 4.0 | e4 = 54.6 | 54.6/466 = 0.117 |
| car | 2.0 | e2 = 7.39 | 7.39/466 = 0.016 |
| bike | 0.0 | e0 = 1.0 | 1.0/466 = 0.002 |
The total is 403 + 54.6 + 7.39 + 1.0 = 466. The truck dominates at 86.5%, and car and bike are nearly invisible — their dark knowledge is buried.
At temperature 4: first divide every logit by 4, giving 1.5, 1.0, 0.5, 0.0.
| class | logit / 4 | e(·) | probability |
|---|---|---|---|
| truck | 1.50 | e1.5 = 4.48 | 4.48/9.85 = 0.455 |
| bus | 1.00 | e1.0 = 2.72 | 2.72/9.85 = 0.276 |
| car | 0.50 | e0.5 = 1.65 | 1.65/9.85 = 0.167 |
| bike | 0.00 | e0.0 = 1.00 | 1.00/9.85 = 0.102 |
Total 9.85. Now look: the truck's lead has shrunk to 45.5%, and the ranking bus > car > bike is loud and clear — 27.6%, 16.7%, 10.2%. The dark knowledge that was buried at temperature 1 is fully exposed at temperature 4. That softened distribution is what we hand to the student as its target.
One crucial detail: during distillation, we soften both the teacher and the student with the same temperature before comparing them. The student's logits also get divided by the temperature, so it learns to reproduce the softened teacher distribution. At inference time, the student goes back to temperature 1 and makes normal, confident predictions. The high temperature is a teaching aid used only during training — like a teacher slowing down and over-explaining, then expecting normal-speed work on the exam.
Sweep the temperature. At 1 the distribution is peaked and the dark knowledge is hidden. Crank it up and watch probability flow off the winner onto the related classes, revealing the teacher's full opinion. Too high, and it flattens toward uniform — the dark knowledge drowns in noise. There is a sweet spot, usually between 2 and 10.
The teacher's softened output as temperature rises. Low = peaked (dark knowledge hidden); high = soft (revealed); too high = flat (drowned). Find the sweet spot.
We have softened teacher targets. Now: what loss do we actually minimize? The student has two sources of truth, and the distillation loss blends them. The first is the teacher's softened distribution (the dark knowledge). The second is the real ground-truth label (the dataset's hard answer, which the teacher might occasionally get wrong). A good student listens to both.
Soft term — match the teacher. We measure how far the student's softened distribution is from the teacher's softened distribution using the KL divergence — a standard measure of how different two probability distributions are, zero when identical, growing as they diverge. Minimizing it pulls the student's full distribution (including dark knowledge) toward the teacher's.
Hard term — match the truth. The ordinary cross-entropy between the student's normal (temperature-1) output and the real one-hot label. This anchors the student to the actual correct answers, so it doesn't blindly inherit the teacher's mistakes.
The total distillation loss is a weighted sum of the two, controlled by a mixing weight (call it alpha): alpha times the soft term, plus one-minus-alpha times the hard term. Alpha near 1 means “trust the teacher almost entirely”; alpha near 0 means “basically ignore the teacher and train normally.” Typical recipes lean heavily on the soft term — alpha around 0.9 — because the dark knowledge is the whole point.
python import torch, torch.nn.functional as F def kd_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.9): # soft term: match the teacher's SOFTENED distribution (the dark knowledge) s_soft = F.log_softmax(student_logits / T, dim=1) t_soft = F.softmax(teacher_logits / T, dim=1) soft = F.kl_div(s_soft, t_soft, reduction='batchmean') * (T * T) # ← T² fix # hard term: match the real labels at normal temperature hard = F.cross_entropy(student_logits, labels) return alpha * soft + (1 - alpha) * hard
Trace the data flow: the student's logits are used twice — once divided by T for the soft term, once raw for the hard term. The teacher's logits appear only in the soft term, softened, and detached (no gradient flows into the frozen teacher). Two scalars, T and alpha, control the entire behavior: T sets how much dark knowledge is exposed, alpha sets how much the student trusts the teacher versus the ground truth.
Slide alpha. The stacked bar shows the two loss components and how the mixing weight shifts emphasis between matching the teacher (soft, teal) and matching the true labels (hard, orange). The readout reminds you what each extreme means.
Total loss = alpha × soft (match teacher) + (1−alpha) × hard (match labels). Slide alpha to rebalance. Most recipes use alpha near 0.9.
So far the student copies only the teacher's final output. But the teacher is deep, and along the way it builds up rich intermediate representations — edges, then textures, then parts, then objects. Why not have the student match those internal stages too, not just the last one? This is feature-based distillation, and it often transfers more than logits alone.
The pioneering method, FitNets, added hint connections. Pick a layer deep inside the teacher — its activation there is a “hint.” Pick a corresponding layer in the student, and add a loss term that pushes the student's activation at that layer toward the teacher's hint. Now the student is guided not just on the final answer but on how to build up its representation along the way.
A refinement, attention transfer, says: don't match the full high-dimensional feature maps (hard, and full of detail the student can't hold). Instead, match the attention maps — a compressed summary of where in the image each layer is focusing its energy. For a convolutional layer, you collapse the channels into a single spatial heatmap of activation magnitude: this is the “where is the model looking” map. Teaching the student to look in the same places as the teacher transfers the teacher's spatial priorities cheaply, without forcing an exact feature copy.
The pattern generalizes. People distill relations between examples (does the student preserve which pairs the teacher thinks are similar?), gradients, and more. The unifying idea: the teacher contains many kinds of knowledge — outputs, features, attention, relationships — and each can be a distillation target. Logit matching is just the simplest.
The diagram shows a deep teacher and a shallow student. Click the transfer modes to add matching targets: logits only, then add feature hints (with the regressor bridging the width gap), then add attention transfer. Watch the “knowledge transferred” meter rise as you give the student more of the teacher to imitate.
Teacher (top, deep) and student (bottom, shallow). Toggle transfer targets and see the matching connections and the richness of transferred knowledge.
Everything so far assumed the teacher is big and the student is small. So here is a result that should stop you cold: distill a model into another model of the exact same size and architecture, and the student often comes out better than the teacher. No compression, no bigger network — same capacity, higher accuracy. This is self-distillation, and its most famous form is the Born-Again Network.
The recipe: train a model normally. Now train a fresh, identical model using the first one as its teacher (soft labels plus hard labels, just like before). The “born-again” student matches or beats its parent. Repeat — distill the student into a third generation — and accuracy often keeps creeping up for a few rounds before plateauing. Ensemble the generations and you do even better. How can copying yourself make you smarter?
The leading explanation: the teacher's soft labels act as a superior form of regularization. Training on hard one-hot labels pushes the model to be infinitely confident — to drive the correct logit to plus infinity. That's an impossible, overfitting-prone target. The teacher's soft labels instead say “be about 88% sure, and here's how to distribute the rest.” That is a gentler, more achievable, information-rich target that prevents overconfidence and encodes real class structure.
Born-again networks need a fully-trained teacher first — two training runs. Online distillation (or deep mutual learning) skips that. Train two (or more) students simultaneously from scratch, and have each learn from the others' softened outputs as they go. There's no fixed teacher; each model is a peer-teacher for the others. They pool their differing mistakes, and the group converges to better solutions than any would alone — a study group instead of a lecture. This is also how some large-model training pipelines work, where a model distills from an exponential-moving-average copy of itself.
Press Add generation to distill the current model into a fresh same-size successor. Watch accuracy tick up generation over generation, with diminishing returns, then plateau — and see how an ensemble of all generations beats any single one. The first bar is the original model trained on hard labels alone.
Each bar is a model of identical size, distilled from the one before it. Accuracy climbs with diminishing returns. The final bar is the ensemble of all generations.
Time to watch it happen. A large teacher has already been trained (at full strength, on lots of clean data) and knows the smooth, correct boundary between the two moons. Now we train two identical small students on a tiny, noisy handful of training points — the kind of scarce data where small models struggle:
Press Train and watch the distilled student track the teacher's clean boundary while the from-scratch student thrashes. Then play with temperature and alpha to feel how much each matters.
Both small students train on the same few noisy points (large dots). Left: hard labels only. Right: the teacher's soft targets. The faint dashed curve is the teacher's boundary — the distilled student should hug it.
No quiz — the simulator is the test. If you can predict how the distilled student's boundary changes as you drop alpha to zero, you understand distillation.
The most famous real-world distillation is DistilBERT (2019). BERT is a large language model; DistilBERT is its distilled student, and the headline numbers became the poster child for the whole technique: roughly 40% smaller, 60% faster at inference, while retaining about 97% of BERT's language understanding. That is an enormous, nearly-free win for anyone deploying models.
A working distillation setup: pick a temperature around 2 to 4, an alpha around 0.5 to 0.9 favoring the soft loss, initialize the student from the teacher when architectures allow, and add a feature/embedding-matching term if the student is deep enough to benefit. Freeze the teacher entirely — it only provides targets, never learns. And always keep a hard-label term as a tether to ground truth.
Drag the student's size. As it shrinks, inference gets faster and the model gets smaller — but retained performance eventually falls off a cliff when the student is too small to hold the teacher's knowledge. DistilBERT's actual operating point is marked: a sweet spot where you bank most of the speed and size wins while keeping nearly all the accuracy.
Shrink the student and watch size, speed, and retained performance trade off. Too small, and performance collapses. The marker shows DistilBERT's choice.
Distillation is powerful but not unconditional. The honest version of the story has some surprising failure modes, and one of them is genuinely counterintuitive: a better teacher can produce a worse student.
You would think the strongest possible teacher is the best teacher. Often it isn't. When the teacher is enormously more capable than the student, the gap between them becomes a problem. The teacher's softened distributions encode a function so complex and so sharply-drawn that the tiny student simply lacks the capacity to imitate it. Chasing an impossible target, the student does worse than it would have with a more modest teacher whose outputs it could actually match.
If your teacher is too big for your student, insert a middleman. A teacher assistant is an intermediate-sized model: first distill the giant teacher into the assistant, then distill the assistant into the small student. Each hop crosses a manageable capacity gap. The chain delivers more of the giant's knowledge to the small student than a single impossible leap could. It's the same reason you don't teach a first-grader from a graduate textbook — you go through the grades.
Why distillation works is still partly debated, and it's worth knowing the tension. The original story is dark knowledge: the student learns the class-similarity structure in the soft labels. But later work showed that much of the benefit comes from something simpler — the soft labels act as regularization (the label-smoothing view from Chapter 5), and from the soft labels effectively reweighting examples by how confident the teacher is. Both effects are real; their relative importance depends on the setting. The practical takeaway: distillation's gains are robust, but the “why” is a blend of knowledge-transfer and regularization, not one clean mechanism.
| Situation | Distillation verdict |
|---|---|
| Deploying a big model on constrained hardware | Strong win — the classic use case |
| Small / noisy training data | Win — soft labels regularize and denoise |
| Teacher vastly larger than student | Caution — capacity gap; use a teacher assistant |
| Generative LLM compression | Win — prefer sequence-level distillation |
| Teacher barely better than student | Marginal — little knowledge to transfer |
Drag the teacher's size for a fixed small student. Watch the student's accuracy rise, peak, then fall as the teacher grows too powerful — the capacity gap. Toggle the teacher assistant to see how a middleman recovers the lost performance for very large teachers.
Student accuracy vs teacher size (student size fixed). Note the peak, then the drop. Toggle the teacher assistant to bridge large gaps.
You can now explain the whole arc: why a one-hot label wastes the teacher's wisdom, what dark knowledge is, how temperature exposes it, how the KD loss blends teacher and truth, how to transfer features and attention, why a model can teach a copy of itself, how DistilBERT banked a near-free win, and when the capacity gap makes a better teacher a worse choice. The thread: a trained model's soft outputs are a richer teacher than any label, and learning to imitate them transfers judgment that raw labels cannot.
| Variant | What's transferred | Key idea |
|---|---|---|
| Logit / response (Hinton) | softened output distribution | dark knowledge via temperature |
| Feature (FitNets) | intermediate activations | hint layers + regressor for dim mismatch |
| Attention transfer | where the model looks | match compressed attention maps |
| Self / born-again | same-size soft labels | soft labels as learned regularization |
| Online / mutual | peers' outputs, live | no pre-trained teacher; learn together |
| Sequence-level | generated sequences | for generative LLMs; train on teacher's outputs |
“The best teachers are those who show you where to look, but don’t tell you what to see.” — and a soft label, full of doubt and structure, shows a small student exactly where to look.