How to get 90% of RL’s reasoning gains at 3% of the compute cost — by letting the student generate its own trajectories and the teacher score every token. A teardown of reverse KL, compounding error, and why dense supervision on self-generated data is the sweet spot between imitation and discovery.
Every concept in this lesson and how they connect — the territory before the map.
This lesson unpacks a single blog post into 12 interconnected concepts across four topic clusters. The constellation below shows how they relate. Click any node to jump to the chapter where it’s explained.
Every concept you’ll encounter, sorted by cluster.
Next-token prediction on internet-scale data. Ch 01.
Domain-specific continued pre-training. Ch 01.
SFT + RLHF to shape behavior and safety. Ch 01.
Train on teacher-generated data. Cheap but compounds error. Ch 01.
Student generates, teacher scores every token. Ch 03.
On-policy + sparse binary reward. Expensive but powerful. Ch 01.
Mode-seeking loss: KL(πθ || πteacher). Ch 02.
Why reverse KL concentrates and forward KL spreads. Ch 02.
Off-policy drift from teacher distribution. Ch 01.
Train on gold data, test on own outputs. Ch 06.
74.4% on AIME’24 at 1,800 GPU hours. Ch 04.
Recover lost behavior after mid-training. Ch 05.
This lesson is structured in layers. You can read it linearly or skip around.
If you already know KL divergence well, skip to Chapter 03. If you only care about the results, jump to Chapter 04.
Before we can understand distillation, we need to understand where it fits in the training pipeline.
Modern LLMs go through three distinct phases, each with a different objective and data source:
Next-token prediction on trillions of tokens from the internet. This gives the model broad knowledge and language fluency. Think of it as building the engine — raw capability without direction.
Continued pre-training on domain-specific data — code repos, math proofs, internal documents. This sharpens the engine for a particular terrain. The model gains specialist knowledge but may lose generalist polish.
SFT (supervised fine-tuning) and RLHF to shape behavior: follow instructions, refuse harmful requests, reason step-by-step. This is where the model becomes useful — turns the engine into a vehicle you can steer.
The question this lesson answers: what is the best way to do post-training when you have a strong teacher model?
There are three paradigms for transferring a teacher model’s capabilities to a student. Each makes a different trade-off between data generation, supervision density, and compute cost:
| Method | Data Source | Supervision | Signal Density | Cost |
|---|---|---|---|---|
| Off-policy distill (SFT) | Teacher generates trajectories | Per-token cross-entropy | Dense | Low |
| RL (GRPO) | Student generates trajectories | Binary outcome reward | Sparse (1 bit/trajectory) | Very high |
| On-policy distill | Student generates trajectories | Per-token teacher logprobs | Dense (bits/token) | Low-medium |
Off-policy distillation is the cheapest: generate a dataset from the teacher once, then train the student on it with standard supervised learning. The problem? The student is learning from someone else’s outputs, not its own.
RL (specifically GRPO, Group Relative Policy Optimization) is the most powerful: the student generates its own reasoning traces and gets a reward signal. But that reward is sparse — a single binary signal (correct/incorrect) per entire trajectory. And it costs 17,920 GPU hours for a single AIME benchmark.
On-policy distillation is the sweet spot: the student generates its own data (on-policy), but gets dense supervision from the teacher at every token position. Best of both worlds.
Why does off-policy distillation fail? The answer is compounding error, sometimes called exposure bias.
During training, the student sees the teacher’s tokens at every position. During inference, it sees its own tokens. Any small deviation at token $t$ changes the distribution of what follows, and the student has never learned to recover from its own mistakes.
Formally: if the student makes an error with probability $\epsilon$ per token, and trajectories are $T$ tokens long, the probability of staying on the teacher’s distribution is $(1 - \epsilon)^T$. For $\epsilon = 0.01$ and $T = 1000$ (a typical reasoning trace), that’s $(0.99)^{1000} \approx 0.00004$. The student is almost certainly off-distribution by the end of a long trace.
On-policy distillation sidesteps this entirely: since the student generates its own data, it always learns from states it actually visits. No distribution mismatch. No compounding error.
You can have on-policy data, dense supervision, or low compute — pick two. Until now.
The canvas above shows the three methods positioned on two axes: whether the data is generated by the student (on-policy) or the teacher (off-policy), and whether supervision is dense (per-token) or sparse (per-trajectory). On-policy distillation occupies the previously empty quadrant: on-policy and dense.
A measure of how much one probability distribution differs from another — but it’s not symmetric.
KL divergence (Kullback-Leibler divergence) quantifies the information lost when you use distribution $q$ to approximate distribution $p$. It’s always non-negative and equals zero only when $p = q$ exactly.
The crucial detail: KL divergence is asymmetric. $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ in general. This asymmetry is the entire reason on-policy distillation works differently from standard SFT.
In distillation, we have two distributions: the teacher $\pi_{\text{teacher}}$ (what we want to approximate) and the student $\pi_\theta$ (what we’re training). There are two ways to orient the KL:
This is what standard SFT minimizes. The expectation is taken under the teacher’s distribution. This means we sample from the teacher and train the student to match.
This is what on-policy distillation minimizes. The expectation is taken under the student’s distribution. This means we sample from the student and measure how surprised the teacher is.
Let’s expand the reverse KL and see what the student actually optimizes.
Starting from the definition of reverse KL:
Expand the log ratio:
Distribute the sum:
For autoregressive LLMs, we apply this per-token, conditioned on the prefix generated so far:
This is the loss that on-policy distillation minimizes. The outer expectation samples full trajectories from the student. The inner sum computes the KL at each token position.
The difference between forward and reverse KL is most intuitive when the teacher has multiple modes.
The teacher distribution (gray) has two peaks — think of these as two valid reasoning strategies for solving a math problem. One uses algebraic manipulation, the other uses geometric insight.
Forward KL (blue): The student must cover both modes. It spreads probability mass broadly, assigning weight even in the valley between peaks. This is safe but unfocused — the student doesn’t commit to either strategy and may produce incoherent blends.
Reverse KL (amber): The student collapses onto one peak. It fully commits to one strategy and executes it well. The cost is zero coverage of the other mode — but in practice, one excellent strategy beats two mediocre ones.
Let’s compute both KLs for a concrete three-token vocabulary to build intuition.
Suppose at some position, the teacher assigns probabilities $\pi_T = [0.7, 0.2, 0.1]$ over tokens [A, B, C]. The student currently has $\pi_S = [0.5, 0.4, 0.1]$.
The full on-policy distillation loop in five steps.
At a high level, the algorithm repeats the following cycle:
Given a batch of prompts (math problems, instructions, etc.), generate complete trajectories using the student’s current policy $\pi_\theta$. These are the reasoning traces the student would actually produce at inference time.
On-policy: learns from states it visitsFeed each student-generated trajectory through the teacher model to get $\log \pi_{\text{teacher}}(x_t | x_{<t})$ at every token position. This is a single forward pass (no generation), so it’s cheap.
Dense supervision: O(T) bits per trajectoryThe “advantage” of each token is how much more the teacher likes it than the student does: $A_t = \log \pi_{\text{teacher}}(x_t | x_{<t}) - \log \pi_\theta(x_t | x_{<t})$. Positive means the teacher agrees; negative means the student generated something the teacher would not.
Token-level credit assignmentUse the advantages to update the student via a policy gradient (or equivalent RL-style update like PPO/GRPO). Tokens the teacher likes get reinforced; tokens the teacher dislikes get suppressed.
Same optimizer as RL, denser signalEach iteration, the student improves, so the next batch of trajectories is better. The teacher continues to provide guidance at the student’s current level — like a tutor adapting to the student’s progress.
Curriculum emerges naturallydef on_policy_distillation(student, teacher, prompts, n_iters): """On-policy distillation with reverse KL.""" for iteration in range(n_iters): # 1. Sample batch of prompts batch = sample_batch(prompts) # 2. Generate trajectories from STUDENT (on-policy) trajectories = [] for prompt in batch: tokens = student.generate(prompt, temperature=1.0) trajectories.append(tokens) # 3. Score with teacher (single forward pass, no generation) teacher_logprobs = teacher.log_probs(trajectories) # shape: [B, T] student_logprobs = student.log_probs(trajectories) # shape: [B, T] # 4. Compute per-token advantages (reverse KL gradient) advantages = teacher_logprobs - student_logprobs # [B, T] # 5. Policy gradient update (PPO-style clipping) for epoch in range(n_epochs): new_logprobs = student.log_probs(trajectories) ratio = exp(new_logprobs - student_logprobs) # importance weight clipped = clip(ratio, 1 - eps, 1 + eps) loss = -min(ratio * advantages, clipped * advantages).mean() loss.backward() optimizer.step()
student.generate() — autoregressive sampling (the on-policy part)teacher.log_probs() — a single forward pass, parallel across all positionsadvantages — the token-level reverse KL signal (dense, informative)ratio — importance sampling correction for multi-epoch updatesclip — PPO-style trust region to prevent too-large updatesHow tokens flow through the system in a single iteration.
The diagram shows the two-phase process. Phase 1 (generation): prompts flow into the student, which autoregressively generates tokens. This is the expensive part — sequential, can’t be parallelized across positions. Phase 2 (scoring): the complete trajectory flows into both student and teacher for a single forward pass each. This is parallelizable. The resulting log-prob difference becomes the advantage signal that updates the student.
The full training loop with realistic hyperparameters for distilling a 70B teacher into an 8B student:
# On-policy distillation config student_model: "qwen3-8b" teacher_model: "qwen3-235b-thinking" dataset: "aime_24_train" # Generation max_gen_length: 32768 # Long reasoning traces temperature: 1.0 # Full entropy for exploration batch_size: 64 # Prompts per iteration samples_per_prompt: 8 # Multiple rollouts per problem # Optimization n_epochs_per_iter: 4 # PPO-style multi-epoch clip_eps: 0.2 # Trust region learning_rate: 1e-6 # Conservative total_iterations: 500 # ~1800 GPU hours on 32x H100
Lower temperature makes the student’s generations more deterministic, reducing exploration. But the whole point of on-policy training is to explore the student’s actual distribution. Temperature 1.0 gives maximum information about where the student needs improvement.
With 8 rollouts per problem, we get variance reduction for free. We can also compare: which of the student’s attempts does the teacher prefer? This gives richer gradient signal than a single trajectory.
The teacher never generates — it only scores existing sequences. A single forward pass through a 235B model on 32K tokens takes ~2 seconds on 8 GPUs. Compare this to the minutes of sequential generation the teacher would need for RL reward evaluation. The teacher’s cost is roughly 10% of the student’s generation cost.
Unlike RLHF, there’s no separate reward model to train, maintain, or worry about gaming. The teacher is the reward signal. This eliminates an entire source of instability and misalignment.
On-policy distillation beats RL at a fraction of the cost.
The Thinking Machines team evaluated three methods on AIME’24 (American Invitational Mathematics Examination, 2024 problems). All methods start from the same base model (Qwen3-8B after mid-training on math data).
| Method | AIME'24 Accuracy | GPU Hours | Cost Relative to RL |
|---|---|---|---|
| Off-policy distill (SFT) | 60.0% | ~600 | 0.03x |
| RL (GRPO) | 67.6% | 17,920 | 1.0x (baseline) |
| On-policy distill | 74.4% | 1,800 | 0.1x |
Let that sink in:
Where do the GPU hours actually go?
The bottleneck is generation. RL needs many more trajectories because each one only provides 1 bit of signal. To get statistical significance on advantage estimates, you need hundreds of samples per problem.
On-policy distillation needs far fewer trajectories (8 per prompt vs 64+ for RL) because each trajectory carries vastly more information. The teacher scoring step is essentially free — a forward pass is orders of magnitude cheaper than generation.
How does performance change as we increase compute budget?
The practical ceiling for on-policy distillation on AIME’24 is approximately the teacher’s own score (Qwen3-235B-Thinking achieves ~80%). RL has no theoretical ceiling (it can discover strategies the teacher doesn’t know), but the practical ceiling within reasonable compute is lower.
You fine-tune a model on your company’s internal docs. It gets smart about your domain. But it forgets how to follow instructions.
This is the catastrophic forgetting trade-off in mid-training. When you continue pre-training a model on domain-specific data (code, legal documents, internal wikis), you inject new knowledge. But you also overwrite some of the post-training polish — the instruction-following, safety guardrails, and formatting that make the model usable.
The traditional fix: run full post-training again after mid-training. But this is expensive (thousands of GPU hours of RLHF) and requires the original post-training data pipeline, which you may not have.
On-policy distillation offers a surgical alternative: use the original model as the teacher to restore the behaviors you lost.
The Thinking Machines team demonstrated this with Qwen3-8B:
IFEval score: 83.2%. Strong instruction-following baseline.
Model gains domain knowledge. IFEval drops to 71.4%. The model “forgets” how to follow formatting instructions, generates rambling answers, occasionally ignores constraints.
Catastrophic forgetting in actionUse the original (pre-mid-training) Qwen3-8B as teacher. Let the degraded model generate; teacher scores. IFEval recovers to 82.8%.
Behavior recovered without losing domain knowledgeThe chart shows both metrics over time. During mid-training (terracotta region), domain knowledge rises while instruction-following falls. During on-policy distillation (moss region), instruction-following recovers while domain knowledge is preserved.
Why is domain knowledge preserved? Because on-policy distillation uses reverse KL: it only asks “does the teacher approve of what the student generates?” The student still generates domain-specific content (it has that knowledge); the teacher just shapes how that content is structured and formatted.
This suggests a powerful recipe for continual learning:
Inject fresh domain knowledge. Accept temporary behavior degradation.
Restore instruction-following, safety, formatting via on-policy distill from a clean checkpoint.
Each cycle adds knowledge while maintaining behavior. The model improves monotonically on both axes.
This “knowledge injection then behavior recovery” cycle can run indefinitely. As long as you keep a clean checkpoint as the behavior teacher, you can always recover. And because on-policy distillation is cheap (hundreds of GPU hours, not thousands), the recovery step doesn’t dominate your compute budget.
RL doesn’t search parameter space — it searches strategy space. And once a strategy is found, you don’t need RL to teach it.
This is the deepest insight from the Thinking Machines blog: the value of RL is exploration, not optimization.
Consider what happens during RL training on math problems. The student tries many reasoning strategies: algebraic manipulation, geometric reasoning, enumeration, proof by contradiction. Most trajectories fail. Occasionally, one succeeds. The reward signal reinforces that strategy.
But here’s the thing: once the strategy is known (because the teacher already has it), you don’t need to rediscover it through trial and error. You can directly teach it via distillation. RL is only needed when no teacher exists — when the strategy must be discovered from scratch.
Let’s quantify the supervision density difference more precisely.
A binary reward (correct/incorrect) provides exactly 1 bit of information. For a trajectory of $T = 32{,}000$ tokens, that’s $\frac{1}{32{,}000}$ bits per token. The model must figure out which tokens contributed to success or failure — the credit assignment problem.
The teacher log-probability at each position provides a continuous value. Assuming 16-bit floating point precision, that’s up to 16 bits per token position. For $T = 32{,}000$ tokens, that’s $32{,}000 \times 16 = 512{,}000$ bits per trajectory.
Half a million times more information per trajectory. This is why on-policy distillation converges in 10x fewer samples: it’s not 10x more efficient, it’s 500,000x more information-dense. The 10x compute reduction is actually conservative — most of the savings come from needing fewer generation steps.
This gives us a clean mental model for when to use each approach:
| Condition | Best Method | Reasoning |
|---|---|---|
| Teacher exists, same distribution | Off-policy SFT | Cheapest. Compounding error acceptable for short outputs. |
| Teacher exists, long reasoning | On-policy distill | Avoids compounding error. Dense signal. Best cost/quality ratio. |
| No teacher, verifiable reward | RL (GRPO) | Must explore strategy space. Expensive but discovers novel solutions. |
| No teacher, no verifiable reward | RLHF with human prefs | Last resort. Most expensive, most brittle. |
On-policy distillation has a hard ceiling: the teacher’s own capability. You cannot distill what the teacher doesn’t know. RL is irreplaceable when:
The practical recommendation from the blog: use RL to train your teacher, then distill the teacher into students. RL is for the frontier; distillation is for deployment.
When to use which method — a practical guide.
| Scenario | Method | Why |
|---|---|---|
| Short outputs, strong teacher | Off-policy SFT | Cheapest. Compounding error minimal for <512 tokens. |
| Long reasoning, strong teacher | On-policy distill | Avoids compounding error on multi-step chains. |
| No teacher, verifiable answers | RL (GRPO) | Must explore. Math/code have checkable solutions. |
| Behavior recovery after domain FT | On-policy distill | Restore behavior without forgetting new knowledge. |
| Push beyond teacher quality | RL | Distillation caps at teacher. RL has no ceiling. |
| Hybrid: maximize both cost and quality | RL then distill | RL discovers strategies, distill transfers them cheaply. |