Large ensembles generalize beautifully but are too expensive to deploy. Distillation transfers their "dark knowledge" -- the soft probability distributions over wrong answers -- into a single small model by raising the temperature of the softmax. The small model learns not just what is right, but why certain wrong answers are more wrong than others.
You have trained ten big neural networks on the same dataset. Each one makes slightly different mistakes. When you average their predictions, the ensemble is far more accurate than any single model. This is one of the most reliable tricks in machine learning.
But now you need to deploy. Your model serves millions of requests per second on mobile phones. Running ten large networks per request is out of the question. You need a single small model that runs fast and cheap.
The naive approach: train a small model from scratch on the same data. But this small model has less capacity. It will not learn the same rich patterns that the ensemble captured. You get a worse model.
Hinton, Vinyals, and Dean asked a different question: can we transfer the knowledge from the ensemble into the small model? Not the weights -- those are tied to a specific architecture. The knowledge. The learned mapping from inputs to outputs.
When you train a classifier normally, each training example has a hard target: a one-hot vector where the correct class gets probability 1 and everything else gets 0. An image of a BMW gets the label [0, 0, ..., 1, ..., 0] -- BMW = 1, everything else = 0.
But think about what the trained teacher network actually outputs. It does not produce a one-hot vector. It produces a full probability distribution: BMW = 0.92, sedan = 0.04, sports car = 0.02, truck = 0.005, carrot = 0.0000001, ...
Look at those "wrong" answers. They are not all equally wrong. The teacher thinks this BMW looks a little like a sedan, somewhat like a sports car, slightly like a truck. But not at all like a carrot. These relative probabilities among the wrong answers encode a rich structure: a similarity metric over classes that the teacher has learned from millions of examples.
This is why soft targets provide much more gradient signal per training case than hard targets. Hard targets give you 1 bit of information per class (right or wrong). Soft targets give you a real-valued probability for every class -- an entire distribution of learned knowledge.
Since soft targets have higher entropy and carry more information per example, the student can often be trained on far less data and with a higher learning rate than would be needed with hard targets.
There is a practical problem with soft targets. A well-trained teacher network is very confident. For an image of a BMW, the probability of BMW might be 0.999, and all other probabilities are crammed near zero. The distribution is so peaked that the probabilities of incorrect classes are tiny -- too tiny for the student to learn meaningful structure from them.
This is where Caruana had earlier proposed matching logits (the raw pre-softmax values) instead of probabilities. Hinton's contribution was a more elegant solution: raise the temperature of the softmax.
The standard softmax converts logits zi into probabilities qi:
Normally T = 1. When you increase T, the distribution gets softer -- the probabilities spread out. At very high T, the distribution approaches uniform. At T → 0, it approaches a one-hot vector (hard argmax).
A teacher sees a "2" digit. Adjust T to see how the softmax distribution changes. At T=1, it's peaked on "2". At higher T, dark knowledge emerges: the "2" looks a bit like a "3" and a "7".
The paper shows that in the high-temperature limit (T much larger than the logits), distillation is equivalent to matching logits. Specifically, the gradient becomes:
where zi are the student logits and vi are the teacher logits. So matching logits (Caruana's approach) is a special case of distillation at infinite temperature.
But intermediate temperatures can be better than either extreme. At very high T, you weight all logits equally -- including very negative ones that may just be noise. At moderate T, you ignore the noisiest logits and focus on the informative ones.
The full distillation procedure uses a weighted combination of two losses. The first loss trains the student to match the teacher's soft targets. The second loss trains the student to predict the correct hard labels. Both use the same student logits, but at different temperatures.
Two critical details:
The T2 factor. The gradients from the soft targets scale as 1/T2. Without the compensating T2 multiplier, increasing T would shrink the soft-target gradient to near zero. The T2 factor keeps the relative contributions of the two losses stable as you experiment with different temperatures.
The weighting α. The paper found that the best results came from putting a considerably lower weight on the hard-target loss. The soft targets are the primary source of knowledge. The hard labels just provide a safety net so the student errs toward the correct answer when it cannot perfectly match the teacher.
The full training loop: input flows through both teacher and student. The teacher's high-T softmax generates soft targets. The student matches both soft targets (high T) and hard labels (T=1).
In practice, the teacher is frozen during distillation -- its weights are not updated. You pass each training example through both networks: the teacher produces soft targets, and the student learns to match them. The teacher's soft targets can be pre-computed and cached, since they do not change during training.
For the speech recognition experiments in the paper, the teacher ensemble consisted of 10 models, each with 8 hidden layers of 2560 ReLU units and a 14,000-way softmax. The distillation used T=2 and a relative weight of 0.5 on the hard-target loss. The distilled single model matched the ensemble's Word Error Rate of 10.7%, down from the single baseline's 10.9%.
The term "dark knowledge" refers to the information encoded in the probabilities assigned to incorrect classes. It is "dark" in the same sense as dark matter -- invisible in standard training (hard labels) but containing most of the structure.
Consider a teacher network that has learned to classify handwritten digits. When shown a particular image of a "2", the teacher might output:
| Digit | Probability | What it tells us |
|---|---|---|
| 2 | 0.93 | Correct class |
| 3 | 0.04 | This 2 has a curved bottom, like a 3 |
| 7 | 0.015 | The top stroke leans right, like a 7 |
| 8 | 0.005 | Slight similarity to an 8 |
| 0 | 0.001 | Minimal resemblance |
| 1, 4, 5, 6, 9 | < 0.001 | Almost no resemblance |
A different image of "2" might assign more probability to 7 than to 3 -- because that version has a straighter stroke. These per-example similarity structures define a rich geometry over the data. The teacher is saying: "for this particular 2, the confusable classes are 3 and 7, in that order."
Without temperature scaling, this dark knowledge is nearly invisible. At T=1, the probability on "3" might be 10-6 and on "7" it might be 10-9. The ratio (1000:1) is informative, but the absolute values are so small that they contribute almost nothing to the cross-entropy gradient. Raising the temperature makes these tiny probabilities visible and learnable.
Click each digit image to see the teacher's soft predictions. Notice how different versions of the same digit reveal different dark knowledge.
The paper's MNIST experiments are small but remarkably revealing. The teacher is a large network: two hidden layers of 1200 ReLU units, trained with dropout and data augmentation (jittered images). It achieves 67 test errors.
A smaller network (800 units per layer, no regularization) trained normally achieves 146 errors. But when this same small network is trained with soft targets from the teacher at T=20, it achieves 74 errors -- nearly matching the teacher despite being smaller and having no regularization or data augmentation of its own.
The most striking result: they removed all examples of the digit 3 from the transfer set. The student never sees a single 3 during distillation. Yet it correctly classifies 98.6% of test 3s (after a bias correction).
How is this possible? Because the teacher's soft targets for other digits contain dark knowledge about 3. When the teacher sees a "2" that looks somewhat like a "3", it assigns non-zero probability to class 3. Over thousands of such examples, the student accumulates enough information about the "3" class to recognize it -- from the shadows alone.
Even more extreme: when the transfer set contains only 7s and 8s, the student still achieves 86.8% accuracy on the full test set after bias correction. Two classes are enough to transmit useful knowledge about all ten, because the soft targets encode inter-class relationships.
| Student Setup | Test Errors |
|---|---|
| Small net, hard targets only | 146 |
| Small net, soft targets (T=20) | 74 |
| Large teacher | 67 |
| Soft targets, no 3s in transfer set | 206 (109 after bias fix) |
Temperature sensitivity depends on student capacity. With 300+ units per layer, any T above 8 worked well. But with only 30 units per layer, the sweet spot was T = 2.5 to 4. When the student is very small, it cannot capture all the teacher's knowledge, and intermediate temperatures help by filtering out the noisiest logits.
For very large datasets (Google's JFT: 100 million images, 15,000 classes), training a full ensemble is too expensive -- even parallelized. The paper introduces a clever alternative: specialist models.
Instead of training N copies of the full model, you train one generalist (the full model) plus many small specialists. Each specialist focuses on a confusable cluster of classes: types of cars, types of bridges, types of mushrooms.
Key advantages of specialists over full ensembles:
| Property | Full Ensemble | Specialists |
|---|---|---|
| Training time | Weeks per model | Days per specialist |
| Parallelism | Full model on each GPU | Small model per GPU |
| Independence | Yes | Yes (after clustering) |
| JFT accuracy gain | -- | +4.4% relative (61 specialists) |
Specialists are initialized from the generalist's weights, so they benefit from all the low-level features already learned. They just fine-tune the decision boundary between similar classes. The dustbin class handles everything they don't specialize in.
The inference procedure combines the generalist with active specialists by minimizing the total KL divergence to all relevant models:
where pg is the generalist's distribution and pm is each active specialist's distribution. This is solved by gradient descent on the logits for each test image -- a small optimization per input. The result is a combined distribution that respects both the generalist's broad knowledge and each specialist's fine-grained discrimination.
Performance scaled with specialist coverage: classes covered by 10+ specialists saw a 14.1% relative accuracy improvement, while classes covered by just 1 specialist gained only 3.4%. This suggests that having multiple specialist perspectives on confusable classes is more valuable than a single specialist, and since specialists train independently, scaling them is embarrassingly parallel.
The generalist handles all 15,000 classes. Each specialist focuses on a confusable cluster (~300 classes) plus a dustbin for everything else.
One of the paper's most important secondary findings: soft targets are not just a distillation technique. They are a powerful regularizer.
The speech recognition experiment makes this vivid. The baseline acoustic model (85M parameters) is trained on 2000 hours of speech (~700M examples). A 10-model ensemble achieves 10.7% WER, down from the single model's 10.9%.
Now the striking result: train the same large model on only 3% of the data (~20M examples).
| Training Setup | Train Accuracy | Test Accuracy |
|---|---|---|
| Hard targets, 100% data | 63.4% | 58.9% |
| Hard targets, 3% data | 67.3% | 44.5% |
| Soft targets, 3% data | 65.4% | 57.0% |
With hard targets and 3% data, the model overfits catastrophically: 67.3% train accuracy but only 44.5% test accuracy. With soft targets from a teacher trained on the full data, the same model on 3% data achieves 57.0% test accuracy -- nearly matching the 58.9% of the full-data baseline.
This finding has deep implications. Soft targets encode not just "what is right" but "how to generalize." They carry structural information about the data that would otherwise require millions more examples to learn. This is why distilled models can sometimes nearly match their teachers -- the soft targets are an extremely efficient encoding of the teacher's generalization knowledge.
Knowledge distillation, introduced in this 2015 paper, became one of the most influential techniques in deep learning. Its core idea -- that a model's soft outputs encode richer knowledge than hard labels -- has been extended in many directions.
Full pipeline: teacher logits → temperature scaling → soft targets → student learning. Adjust T to control how much dark knowledge is revealed. Watch the student's loss landscape change.
| Technique | How it extends distillation |
|---|---|
| DistilBERT (2019) | Distills BERT into a 60% smaller model retaining 97% of performance. Used the same T-scaled soft target approach. |
| TinyBERT (2020) | Adds intermediate layer matching -- the student also matches the teacher's hidden representations, not just the final logits. |
| Self-distillation | The teacher and student are the same architecture. The model distills knowledge from its own earlier training run, acting as a form of regularization. |
| Online distillation | Teacher and student train simultaneously, co-evolving. No pre-trained teacher required. |
| Feature distillation | The student matches not just the teacher's outputs but its internal feature maps at multiple layers -- transferring representational structure. |
In the era of LLMs, distillation has become even more critical. Models like GPT-4 are too expensive for most applications. Smaller models trained on the outputs of larger ones (sometimes called "model distillation" or "capability transfer") have become a standard deployment strategy. The core idea is unchanged from this 2015 paper: soft outputs carry dark knowledge that hard labels cannot.