Distilling the Knowledge in a Neural Network

Chapter 0: The Deployment Problem

You have trained ten big neural networks on the same dataset. Each one makes slightly different mistakes. When you average their predictions, the ensemble is far more accurate than any single model. This is one of the most reliable tricks in machine learning.

But now you need to deploy. Your model serves millions of requests per second on mobile phones. Running ten large networks per request is out of the question. You need a single small model that runs fast and cheap.

The naive approach: train a small model from scratch on the same data. But this small model has less capacity. It will not learn the same rich patterns that the ensemble captured. You get a worse model.

Hinton, Vinyals, and Dean asked a different question: can we transfer the knowledge from the ensemble into the small model? Not the weights -- those are tied to a specific architecture. The knowledge. The learned mapping from inputs to outputs.

The insect analogy: Insects have a larval form optimized for extracting nutrients and an adult form optimized for travel and reproduction. Similarly, the cumbersome ensemble is the larval form -- optimized for extracting knowledge from data. The distilled model is the adult form -- optimized for fast, cheap deployment. Different forms for different jobs.

Why can't we simply deploy the ensemble of models that achieves the best accuracy?

Running multiple large models per request is too expensive in latency and compute for real-time deployment to millions of users Ensembles cannot generalize to new data Ensembles require too much training data

Chapter 1: Hard vs Soft Targets

When you train a classifier normally, each training example has a hard target: a one-hot vector where the correct class gets probability 1 and everything else gets 0. An image of a BMW gets the label [0, 0, ..., 1, ..., 0] -- BMW = 1, everything else = 0.

But think about what the trained teacher network actually outputs. It does not produce a one-hot vector. It produces a full probability distribution: BMW = 0.92, sedan = 0.04, sports car = 0.02, truck = 0.005, carrot = 0.0000001, ...

Look at those "wrong" answers. They are not all equally wrong. The teacher thinks this BMW looks a little like a sedan, somewhat like a sports car, slightly like a truck. But not at all like a carrot. These relative probabilities among the wrong answers encode a rich structure: a similarity metric over classes that the teacher has learned from millions of examples.

The key insight: Hard labels say "this is a BMW." Soft targets say "this is a BMW, it looks a bit like a sedan, somewhat like a sports car, but nothing like a carrot." The soft targets carry vastly more information per training example. They tell the student not just what the answer is, but what the answer looks like.

This is why soft targets provide much more gradient signal per training case than hard targets. Hard targets give you 1 bit of information per class (right or wrong). Soft targets give you a real-valued probability for every class -- an entire distribution of learned knowledge.

Since soft targets have higher entropy and carry more information per example, the student can often be trained on far less data and with a higher learning rate than would be needed with hard targets.

What information do soft targets carry that hard labels cannot?

The pixel values of the input image The relative probabilities among wrong classes -- a learned similarity structure that says which mistakes are more likely than others The learning rate used during training

Chapter 2: Temperature Scaling

There is a practical problem with soft targets. A well-trained teacher network is very confident. For an image of a BMW, the probability of BMW might be 0.999, and all other probabilities are crammed near zero. The distribution is so peaked that the probabilities of incorrect classes are tiny -- too tiny for the student to learn meaningful structure from them.

This is where Caruana had earlier proposed matching logits (the raw pre-softmax values) instead of probabilities. Hinton's contribution was a more elegant solution: raise the temperature of the softmax.

The standard softmax converts logits z_i into probabilities q_i:

q_i = exp(z_i / T) / ∑_j exp(z_j / T)

Normally T = 1. When you increase T, the distribution gets softer -- the probabilities spread out. At very high T, the distribution approaches uniform. At T → 0, it approaches a one-hot vector (hard argmax).

Think of temperature like a camera's contrast dial. At T = 1, the image is high-contrast: bright spots blaze, dark areas are invisible. At T = 10, you lower the contrast: now you can see subtle details in the shadows. Those shadow details are the dark knowledge -- the relationships between incorrect classes that are invisible at T = 1 but become visible at higher temperatures.

Temperature Scaling: Soft Targets vs Hard Labels

A teacher sees a "2" digit. Adjust T to see how the softmax distribution changes. At T=1, it's peaked on "2". At higher T, dark knowledge emerges: the "2" looks a bit like a "3" and a "7".

Temperature T1.0

The paper shows that in the high-temperature limit (T much larger than the logits), distillation is equivalent to matching logits. Specifically, the gradient becomes:

∂C / ∂z_i ≈ (1 / NT²) · (z_i − v_i)

where z_i are the student logits and v_i are the teacher logits. So matching logits (Caruana's approach) is a special case of distillation at infinite temperature.

But intermediate temperatures can be better than either extreme. At very high T, you weight all logits equally -- including very negative ones that may just be noise. At moderate T, you ignore the noisiest logits and focus on the informative ones.

What happens to the softmax output as temperature T increases?

The distribution becomes softer -- probabilities spread out across classes, revealing subtle relationships between incorrect classes that were invisible at T=1 The distribution becomes more peaked on the correct answer The logits are divided by zero

Chapter 3: The Distillation Loss

The full distillation procedure uses a weighted combination of two losses. The first loss trains the student to match the teacher's soft targets. The second loss trains the student to predict the correct hard labels. Both use the same student logits, but at different temperatures.

Loss 1: Soft targets (high T)

Cross-entropy between teacher's softmax(z_teacher/T) and student's softmax(z_student/T), both computed at the same high temperature. This is where the dark knowledge is transferred.

↓ weighted sum

Loss 2: Hard targets (T=1)

Standard cross-entropy between the student's softmax(z_student/1) and the one-hot ground truth label. This keeps the student grounded in the correct answer.

L = α · T² · CE(softmax(z_t/T), softmax(z_s/T)) + (1 − α) · CE(y, softmax(z_s))

Two critical details:

The T² factor. The gradients from the soft targets scale as 1/T². Without the compensating T² multiplier, increasing T would shrink the soft-target gradient to near zero. The T² factor keeps the relative contributions of the two losses stable as you experiment with different temperatures.

The weighting α. The paper found that the best results came from putting a considerably lower weight on the hard-target loss. The soft targets are the primary source of knowledge. The hard labels just provide a safety net so the student errs toward the correct answer when it cannot perfectly match the teacher.

Why not just use soft targets alone? Because the student is smaller than the teacher. It cannot perfectly reproduce the teacher's distribution. When it makes errors, it's better for those errors to lean toward the correct class than away from it. The hard-target term nudges it in the right direction.

Distillation Pipeline

The full training loop: input flows through both teacher and student. The teacher's high-T softmax generates soft targets. The student matches both soft targets (high T) and hard labels (T=1).

In practice, the teacher is frozen during distillation -- its weights are not updated. You pass each training example through both networks: the teacher produces soft targets, and the student learns to match them. The teacher's soft targets can be pre-computed and cached, since they do not change during training.

For the speech recognition experiments in the paper, the teacher ensemble consisted of 10 models, each with 8 hidden layers of 2560 ReLU units and a 14,000-way softmax. The distillation used T=2 and a relative weight of 0.5 on the hard-target loss. The distilled single model matched the ensemble's Word Error Rate of 10.7%, down from the single baseline's 10.9%.

Why does the distillation loss multiply the soft-target cross-entropy by T²?

The gradients from soft targets scale as 1/T², so the T² factor compensates, keeping the balance between hard and soft losses stable across different temperatures It makes the loss numerically larger It prevents gradient explosion

Chapter 4: Dark Knowledge

The term "dark knowledge" refers to the information encoded in the probabilities assigned to incorrect classes. It is "dark" in the same sense as dark matter -- invisible in standard training (hard labels) but containing most of the structure.

Consider a teacher network that has learned to classify handwritten digits. When shown a particular image of a "2", the teacher might output:

Digit	Probability	What it tells us
2	0.93	Correct class
3	0.04	This 2 has a curved bottom, like a 3
7	0.015	The top stroke leans right, like a 7
8	0.005	Slight similarity to an 8
0	0.001	Minimal resemblance
1, 4, 5, 6, 9	< 0.001	Almost no resemblance

A different image of "2" might assign more probability to 7 than to 3 -- because that version has a straighter stroke. These per-example similarity structures define a rich geometry over the data. The teacher is saying: "for this particular 2, the confusable classes are 3 and 7, in that order."

Dark knowledge is a learned similarity metric. It tells the student which classes look alike and for which examples. This is information that simply cannot be encoded in a one-hot label. It is the teacher's generalization knowledge -- what it learned about the structure of the problem beyond just the correct answers.

Without temperature scaling, this dark knowledge is nearly invisible. At T=1, the probability on "3" might be 10^-6 and on "7" it might be 10^-9. The ratio (1000:1) is informative, but the absolute values are so small that they contribute almost nothing to the cross-entropy gradient. Raising the temperature makes these tiny probabilities visible and learnable.

Dark Knowledge: What the Teacher Really Sees

Click each digit image to see the teacher's soft predictions. Notice how different versions of the same digit reveal different dark knowledge.

Why is dark knowledge called "dark"?

Because it only works at night Because it is hidden in the tiny probabilities of incorrect classes -- invisible in standard hard-label training but containing rich structural information about class similarities Because the teacher model uses dark mode

Chapter 5: MNIST Experiments

The paper's MNIST experiments are small but remarkably revealing. The teacher is a large network: two hidden layers of 1200 ReLU units, trained with dropout and data augmentation (jittered images). It achieves 67 test errors.

A smaller network (800 units per layer, no regularization) trained normally achieves 146 errors. But when this same small network is trained with soft targets from the teacher at T=20, it achieves 74 errors -- nearly matching the teacher despite being smaller and having no regularization or data augmentation of its own.

The transfer worked. The student learned to generalize from translated training data without ever seeing translated examples. The teacher's soft targets encoded the augmentation knowledge implicitly.

The missing digit experiment

The most striking result: they removed all examples of the digit 3 from the transfer set. The student never sees a single 3 during distillation. Yet it correctly classifies 98.6% of test 3s (after a bias correction).

How is this possible? Because the teacher's soft targets for other digits contain dark knowledge about 3. When the teacher sees a "2" that looks somewhat like a "3", it assigns non-zero probability to class 3. Over thousands of such examples, the student accumulates enough information about the "3" class to recognize it -- from the shadows alone.

Even more extreme: when the transfer set contains only 7s and 8s, the student still achieves 86.8% accuracy on the full test set after bias correction. Two classes are enough to transmit useful knowledge about all ten, because the soft targets encode inter-class relationships.

Student Setup	Test Errors
Small net, hard targets only	146
Small net, soft targets (T=20)	74
Large teacher	67
Soft targets, no 3s in transfer set	206 (109 after bias fix)

Temperature sensitivity depends on student capacity. With 300+ units per layer, any T above 8 worked well. But with only 30 units per layer, the sweet spot was T = 2.5 to 4. When the student is very small, it cannot capture all the teacher's knowledge, and intermediate temperatures help by filtering out the noisiest logits.

How can the student recognize 3s when it has never seen a single 3 during training?

The teacher's soft targets for other digits assign non-zero probability to class 3 when those digits look similar to 3, so the student accumulates implicit knowledge about 3 from the dark knowledge in other examples The student memorizes the test set The bias correction alone is sufficient

Chapter 6: Specialist Models

For very large datasets (Google's JFT: 100 million images, 15,000 classes), training a full ensemble is too expensive -- even parallelized. The paper introduces a clever alternative: specialist models.

Instead of training N copies of the full model, you train one generalist (the full model) plus many small specialists. Each specialist focuses on a confusable cluster of classes: types of cars, types of bridges, types of mushrooms.

Step 1: Cluster classes

Apply K-means to the covariance matrix of the generalist's predictions. Classes that are often predicted together (confused with each other) end up in the same cluster.

↓

Step 2: Train specialists

Each specialist starts from the generalist's weights. It trains on a balanced mix: half from its special classes, half random. Non-specialist classes collapse into a single "dustbin" class.

↓

Step 3: Inference

For a test image, the generalist identifies top classes. Relevant specialists refine the prediction. The final distribution minimizes KL divergence to all active models.

Key advantages of specialists over full ensembles:

Property	Full Ensemble	Specialists
Training time	Weeks per model	Days per specialist
Parallelism	Full model on each GPU	Small model per GPU
Independence	Yes	Yes (after clustering)
JFT accuracy gain	--	+4.4% relative (61 specialists)

Specialists are initialized from the generalist's weights, so they benefit from all the low-level features already learned. They just fine-tune the decision boundary between similar classes. The dustbin class handles everything they don't specialize in.

Example specialist clusters from JFT: {Tea party, Easter, Bridal shower, Baby shower} -- events easily confused. {Bridge, Cable-stayed bridge, Suspension bridge, Viaduct} -- structural types. {Toyota Corolla, Opel Signum, Mazda Familia} -- similar-looking cars.

The inference procedure combines the generalist with active specialists by minimizing the total KL divergence to all relevant models:

q* = argmin_q KL(p_g, q) + ∑_{m ∈ A_k} KL(p_m, q)

where p_g is the generalist's distribution and p_m is each active specialist's distribution. This is solved by gradient descent on the logits for each test image -- a small optimization per input. The result is a combined distribution that respects both the generalist's broad knowledge and each specialist's fine-grained discrimination.

Performance scaled with specialist coverage: classes covered by 10+ specialists saw a 14.1% relative accuracy improvement, while classes covered by just 1 specialist gained only 3.4%. This suggests that having multiple specialist perspectives on confusable classes is more valuable than a single specialist, and since specialists train independently, scaling them is embarrassingly parallel.

Specialist Architecture

The generalist handles all 15,000 classes. Each specialist focuses on a confusable cluster (~300 classes) plus a dustbin for everything else.

How are confusable class clusters identified for training specialists?

By manual inspection of class names By computing the confusion matrix on the test set By clustering the covariance matrix of the generalist's predictions -- classes that are often predicted together are grouped into a specialist's subset

Chapter 7: Soft Targets as Regularizers

One of the paper's most important secondary findings: soft targets are not just a distillation technique. They are a powerful regularizer.

The speech recognition experiment makes this vivid. The baseline acoustic model (85M parameters) is trained on 2000 hours of speech (~700M examples). A 10-model ensemble achieves 10.7% WER, down from the single model's 10.9%.

Now the striking result: train the same large model on only 3% of the data (~20M examples).

Training Setup	Train Accuracy	Test Accuracy
Hard targets, 100% data	63.4%	58.9%
Hard targets, 3% data	67.3%	44.5%
Soft targets, 3% data	65.4%	57.0%

With hard targets and 3% data, the model overfits catastrophically: 67.3% train accuracy but only 44.5% test accuracy. With soft targets from a teacher trained on the full data, the same model on 3% data achieves 57.0% test accuracy -- nearly matching the 58.9% of the full-data baseline.

Even more remarkable: with soft targets, no early stopping was needed. The system simply "converged" to 57%. Soft targets prevented overfitting entirely. With hard targets, accuracy peaked and then dropped sharply -- classic overfitting. The soft targets communicated the regularities discovered by the full-data model so effectively that 3% of the data was enough.

This finding has deep implications. Soft targets encode not just "what is right" but "how to generalize." They carry structural information about the data that would otherwise require millions more examples to learn. This is why distilled models can sometimes nearly match their teachers -- the soft targets are an extremely efficient encoding of the teacher's generalization knowledge.

Why do soft targets prevent overfitting when training on only 3% of the data?

Soft targets encode the regularities discovered by a model trained on the full dataset, providing much richer gradient signal per example than hard labels -- equivalent to implicitly augmenting the small dataset with the full dataset's structural knowledge Soft targets reduce the model size Soft targets are computed faster than hard targets

Chapter 8: Connections

Knowledge distillation, introduced in this 2015 paper, became one of the most influential techniques in deep learning. Its core idea -- that a model's soft outputs encode richer knowledge than hard labels -- has been extended in many directions.

Interactive Distillation Showcase

Full pipeline: teacher logits → temperature scaling → soft targets → student learning. Adjust T to control how much dark knowledge is revealed. Watch the student's loss landscape change.

Temperature T1.0

The legacy

Technique	How it extends distillation
DistilBERT (2019)	Distills BERT into a 60% smaller model retaining 97% of performance. Used the same T-scaled soft target approach.
TinyBERT (2020)	Adds intermediate layer matching -- the student also matches the teacher's hidden representations, not just the final logits.
Self-distillation	The teacher and student are the same architecture. The model distills knowledge from its own earlier training run, acting as a form of regularization.
Online distillation	Teacher and student train simultaneously, co-evolving. No pre-trained teacher required.
Feature distillation	The student matches not just the teacher's outputs but its internal feature maps at multiple layers -- transferring representational structure.

In the era of LLMs, distillation has become even more critical. Models like GPT-4 are too expensive for most applications. Smaller models trained on the outputs of larger ones (sometimes called "model distillation" or "capability transfer") have become a standard deployment strategy. The core idea is unchanged from this 2015 paper: soft outputs carry dark knowledge that hard labels cannot.

Paper connections: Gleams: Neural Networks • Gleams: Attention

"The relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize."

— Hinton, Vinyals & Dean, 2015

Distilling the Knowledge in aNeural Network