Based on Thinking Machines Lab · Kevin Lu · Oct 2025

On-Policy Distillation.

How to get 90% of RL’s reasoning gains at 3% of the compute cost — by letting the student generate its own trajectories and the teacher score every token. A teardown of reverse KL, compounding error, and why dense supervision on self-generated data is the sweet spot between imitation and discovery.

SOURCE Thinking Machines Blog DEPTH concept-to-implementation COMPUTE 9–30x cheaper than RL

00 Concept constellation

Every concept in this lesson and how they connect — the territory before the map.

This lesson unpacks a single blog post into 12 interconnected concepts across four topic clusters. The constellation below shows how they relate. Click any node to jump to the chapter where it’s explained.

Training Methods Theory Applications Has existing lesson

00 Concept index

Every concept you’ll encounter, sorted by cluster.

Training

Pre-training

Next-token prediction on internet-scale data. Ch 01.

Training

Mid-training

Domain-specific continued pre-training. Ch 01.

Training

Post-training

SFT + RLHF to shape behavior and safety. Ch 01.

Methods

Off-Policy Distillation (SFT)

Train on teacher-generated data. Cheap but compounds error. Ch 01.

Methods

On-Policy Distillation

Student generates, teacher scores every token. Ch 03.

Methods

RL / GRPO

On-policy + sparse binary reward. Expensive but powerful. Ch 01.

Methods

Reverse KL Divergence

Mode-seeking loss: KL(πθ || πteacher). Ch 02.

Theory

Mode-Seeking vs Mode-Covering

Why reverse KL concentrates and forward KL spreads. Ch 02.

Theory

Compounding Error

Off-policy drift from teacher distribution. Ch 01.

Theory

Exposure Bias

Train on gold data, test on own outputs. Ch 06.

Applications

Math Reasoning (AIME)

74.4% on AIME’24 at 1,800 GPU hours. Ch 04.

Applications

Personalization & Continual Learning

Recover lost behavior after mid-training. Ch 05.

00 Reading guide

This lesson is structured in layers. You can read it linearly or skip around.

  • Chapters 01–02: The problem — why off-policy distillation and RL both fall short, and the math behind reverse KL.
  • Chapters 03–04: The method — the full algorithm, implementation, and benchmark results on AIME’24.
  • Chapters 05–07: The implications — personalization, the theory of why this works, and connections to other techniques.

If you already know KL divergence well, skip to Chapter 03. If you only care about the results, jump to Chapter 04.

01 Three stages of LLM training

Before we can understand distillation, we need to understand where it fits in the training pipeline.

Modern LLMs go through three distinct phases, each with a different objective and data source:

01
Stage

Pre-training

Next-token prediction on trillions of tokens from the internet. This gives the model broad knowledge and language fluency. Think of it as building the engine — raw capability without direction.

02
Stage

Mid-training

Continued pre-training on domain-specific data — code repos, math proofs, internal documents. This sharpens the engine for a particular terrain. The model gains specialist knowledge but may lose generalist polish.

03
Stage

Post-training

SFT (supervised fine-tuning) and RLHF to shape behavior: follow instructions, refuse harmful requests, reason step-by-step. This is where the model becomes useful — turns the engine into a vehicle you can steer.

The question this lesson answers: what is the best way to do post-training when you have a strong teacher model?

01 Three post-training methods

There are three paradigms for transferring a teacher model’s capabilities to a student. Each makes a different trade-off between data generation, supervision density, and compute cost:

MethodData SourceSupervisionSignal DensityCost
Off-policy distill (SFT) Teacher generates trajectories Per-token cross-entropy Dense Low
RL (GRPO) Student generates trajectories Binary outcome reward Sparse (1 bit/trajectory) Very high
On-policy distill Student generates trajectories Per-token teacher logprobs Dense (bits/token) Low-medium

Off-policy distillation is the cheapest: generate a dataset from the teacher once, then train the student on it with standard supervised learning. The problem? The student is learning from someone else’s outputs, not its own.

RL (specifically GRPO, Group Relative Policy Optimization) is the most powerful: the student generates its own reasoning traces and gets a reward signal. But that reward is sparse — a single binary signal (correct/incorrect) per entire trajectory. And it costs 17,920 GPU hours for a single AIME benchmark.

On-policy distillation is the sweet spot: the student generates its own data (on-policy), but gets dense supervision from the teacher at every token position. Best of both worlds.

01 Compounding error

Why does off-policy distillation fail? The answer is compounding error, sometimes called exposure bias.

During training, the student sees the teacher’s tokens at every position. During inference, it sees its own tokens. Any small deviation at token $t$ changes the distribution of what follows, and the student has never learned to recover from its own mistakes.

Think of learning to drive by watching dashcam footage. You see what a good driver does in every situation — but only the situations a good driver encounters. The moment you make your first mistake (drift slightly right), you’re in a state you’ve never seen in training. Each subsequent decision compounds the error.

Formally: if the student makes an error with probability $\epsilon$ per token, and trajectories are $T$ tokens long, the probability of staying on the teacher’s distribution is $(1 - \epsilon)^T$. For $\epsilon = 0.01$ and $T = 1000$ (a typical reasoning trace), that’s $(0.99)^{1000} \approx 0.00004$. The student is almost certainly off-distribution by the end of a long trace.

On-policy distillation sidesteps this entirely: since the student generates its own data, it always learns from states it actually visits. No distribution mismatch. No compounding error.

01 The trilemma visualized

You can have on-policy data, dense supervision, or low compute — pick two. Until now.

Off-policy SFT RL / GRPO On-policy distillation

The canvas above shows the three methods positioned on two axes: whether the data is generated by the student (on-policy) or the teacher (off-policy), and whether supervision is dense (per-token) or sparse (per-trajectory). On-policy distillation occupies the previously empty quadrant: on-policy and dense.

The key insight: off-policy distillation has the right loss (dense, per-token) but the wrong data (teacher-generated). RL has the right data (student-generated) but the wrong loss (sparse, per-trajectory). On-policy distillation uses the right data AND the right loss.

02 What is KL divergence?

A measure of how much one probability distribution differs from another — but it’s not symmetric.

KL divergence (Kullback-Leibler divergence) quantifies the information lost when you use distribution $q$ to approximate distribution $p$. It’s always non-negative and equals zero only when $p = q$ exactly.

KL divergence (general form) $$D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}$$

The crucial detail: KL divergence is asymmetric. $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ in general. This asymmetry is the entire reason on-policy distillation works differently from standard SFT.

02 Forward vs reverse KL

In distillation, we have two distributions: the teacher $\pi_{\text{teacher}}$ (what we want to approximate) and the student $\pi_\theta$ (what we’re training). There are two ways to orient the KL:

Forward KL: $D_{\text{KL}}(\pi_{\text{teacher}} \| \pi_\theta)$

This is what standard SFT minimizes. The expectation is taken under the teacher’s distribution. This means we sample from the teacher and train the student to match.

  • Mode-covering: The student must assign probability mass everywhere the teacher does. If the teacher has multiple modes (e.g., multiple valid reasoning strategies), the student spreads itself across all of them.
  • Off-policy: Data comes from the teacher distribution.
  • Problem: Compounding error at test time (student generates, not teacher).

Reverse KL: $D_{\text{KL}}(\pi_\theta \| \pi_{\text{teacher}})$

This is what on-policy distillation minimizes. The expectation is taken under the student’s distribution. This means we sample from the student and measure how surprised the teacher is.

  • Mode-seeking: The student can collapse onto one good mode of the teacher. It doesn’t need to cover all strategies — it just needs one that the teacher approves of.
  • On-policy: Data comes from the student distribution.
  • Advantage: No distribution mismatch between training and inference.

02 Derivation step by step

Let’s expand the reverse KL and see what the student actually optimizes.

Starting from the definition of reverse KL:

Step 1: Definition $$D_{\text{KL}}(\pi_\theta \| \pi_{\text{teacher}}) = \sum_x \pi_\theta(x) \log \frac{\pi_\theta(x)}{\pi_{\text{teacher}}(x)}$$

Expand the log ratio:

Step 2: Split the log $$= \sum_x \pi_\theta(x) \left[ \log \pi_\theta(x) - \log \pi_{\text{teacher}}(x) \right]$$

Distribute the sum:

Step 3: Two expectations $$= \underbrace{\sum_x \pi_\theta(x) \log \pi_\theta(x)}_{-H(\pi_\theta)} - \underbrace{\sum_x \pi_\theta(x) \log \pi_{\text{teacher}}(x)}_{\text{Expected teacher log-prob}}$$
  • $-H(\pi_\theta)$ is the negative entropy of the student. Minimizing the KL maximizes student entropy (encouraging diversity).
  • The second term is the expected log-probability of student samples under the teacher. Minimizing the KL maximizes this (the teacher should assign high probability to what the student generates).
So minimizing reverse KL = maximize entropy (explore) + maximize teacher approval of your outputs. The student is saying: “I want to be diverse, but everything I generate should look good to the teacher.”

For autoregressive LLMs, we apply this per-token, conditioned on the prefix generated so far:

Per-token reverse KL $$\mathcal{L}(\theta) = \mathbb{E}_{x_{1:T} \sim \pi_\theta} \left[ \sum_{t=1}^T D_{\text{KL}}\left(\pi_\theta(\cdot | x_{<t}) \| \pi_{\text{teacher}}(\cdot | x_{<t})\right) \right]$$

This is the loss that on-policy distillation minimizes. The outer expectation samples full trajectories from the student. The inner sum computes the KL at each token position.

02 Mode-seeking visualized

The difference between forward and reverse KL is most intuitive when the teacher has multiple modes.

Teacher (bimodal) Forward KL student Reverse KL student

The teacher distribution (gray) has two peaks — think of these as two valid reasoning strategies for solving a math problem. One uses algebraic manipulation, the other uses geometric insight.

Forward KL (blue): The student must cover both modes. It spreads probability mass broadly, assigning weight even in the valley between peaks. This is safe but unfocused — the student doesn’t commit to either strategy and may produce incoherent blends.

Reverse KL (amber): The student collapses onto one peak. It fully commits to one strategy and executes it well. The cost is zero coverage of the other mode — but in practice, one excellent strategy beats two mediocre ones.

For reasoning tasks, mode-seeking is ideal. A model that perfectly executes algebraic reasoning is more useful than one that half-heartedly attempts both approaches. On-policy distillation’s reverse KL naturally produces focused specialists.

02 Worked example

Let’s compute both KLs for a concrete three-token vocabulary to build intuition.

Suppose at some position, the teacher assigns probabilities $\pi_T = [0.7, 0.2, 0.1]$ over tokens [A, B, C]. The student currently has $\pi_S = [0.5, 0.4, 0.1]$.

Forward KL: $D_{\text{KL}}(\pi_T \| \pi_S)$

$$= 0.7 \log\frac{0.7}{0.5} + 0.2 \log\frac{0.2}{0.4} + 0.1 \log\frac{0.1}{0.1}$$ $$= 0.7 \times 0.336 + 0.2 \times (-0.693) + 0.1 \times 0$$ $$= 0.235 - 0.139 + 0 = 0.097 \text{ nats}$$

Reverse KL: $D_{\text{KL}}(\pi_S \| \pi_T)$

$$= 0.5 \log\frac{0.5}{0.7} + 0.4 \log\frac{0.4}{0.2} + 0.1 \log\frac{0.1}{0.1}$$ $$= 0.5 \times (-0.336) + 0.4 \times 0.693 + 0.1 \times 0$$ $$= -0.168 + 0.277 + 0 = 0.109 \text{ nats}$$
Notice the asymmetry. Forward KL penalizes the student heavily for under-covering the teacher (putting too little mass where the teacher has a lot). Reverse KL penalizes the student heavily for putting mass where the teacher doesn’t. For distillation, reverse KL says: “Don’t generate tokens the teacher wouldn’t generate.”

03 Algorithm overview

The full on-policy distillation loop in five steps.

At a high level, the algorithm repeats the following cycle:

  1. Sample from the student

    Given a batch of prompts (math problems, instructions, etc.), generate complete trajectories using the student’s current policy $\pi_\theta$. These are the reasoning traces the student would actually produce at inference time.

    On-policy: learns from states it visits
  2. Score with the teacher

    Feed each student-generated trajectory through the teacher model to get $\log \pi_{\text{teacher}}(x_t | x_{<t})$ at every token position. This is a single forward pass (no generation), so it’s cheap.

    Dense supervision: O(T) bits per trajectory
  3. Compute per-token advantages

    The “advantage” of each token is how much more the teacher likes it than the student does: $A_t = \log \pi_{\text{teacher}}(x_t | x_{<t}) - \log \pi_\theta(x_t | x_{<t})$. Positive means the teacher agrees; negative means the student generated something the teacher would not.

    Token-level credit assignment
  4. Policy gradient update

    Use the advantages to update the student via a policy gradient (or equivalent RL-style update like PPO/GRPO). Tokens the teacher likes get reinforced; tokens the teacher dislikes get suppressed.

    Same optimizer as RL, denser signal
  5. Repeat

    Each iteration, the student improves, so the next batch of trajectories is better. The teacher continues to provide guidance at the student’s current level — like a tutor adapting to the student’s progress.

    Curriculum emerges naturally

03 Full pseudocode

on_policy_distillation.pyPython
def on_policy_distillation(student, teacher, prompts, n_iters):
    """On-policy distillation with reverse KL."""
    for iteration in range(n_iters):
        # 1. Sample batch of prompts
        batch = sample_batch(prompts)

        # 2. Generate trajectories from STUDENT (on-policy)
        trajectories = []
        for prompt in batch:
            tokens = student.generate(prompt, temperature=1.0)
            trajectories.append(tokens)

        # 3. Score with teacher (single forward pass, no generation)
        teacher_logprobs = teacher.log_probs(trajectories)  # shape: [B, T]
        student_logprobs = student.log_probs(trajectories)  # shape: [B, T]

        # 4. Compute per-token advantages (reverse KL gradient)
        advantages = teacher_logprobs - student_logprobs  # [B, T]

        # 5. Policy gradient update (PPO-style clipping)
        for epoch in range(n_epochs):
            new_logprobs = student.log_probs(trajectories)
            ratio = exp(new_logprobs - student_logprobs)  # importance weight
            clipped = clip(ratio, 1 - eps, 1 + eps)
            loss = -min(ratio * advantages, clipped * advantages).mean()
            loss.backward()
            optimizer.step()
  • student.generate() — autoregressive sampling (the on-policy part)
  • teacher.log_probs() — a single forward pass, parallel across all positions
  • advantages — the token-level reverse KL signal (dense, informative)
  • ratio — importance sampling correction for multi-epoch updates
  • clip — PPO-style trust region to prevent too-large updates

03 Data flow

How tokens flow through the system in a single iteration.

Student path Teacher path Gradient flow

The diagram shows the two-phase process. Phase 1 (generation): prompts flow into the student, which autoregressively generates tokens. This is the expensive part — sequential, can’t be parallelized across positions. Phase 2 (scoring): the complete trajectory flows into both student and teacher for a single forward pass each. This is parallelizable. The resulting log-prob difference becomes the advantage signal that updates the student.

03 Implementation

The full training loop with realistic hyperparameters for distilling a 70B teacher into an 8B student:

config.yamlYAML
# On-policy distillation config
student_model: "qwen3-8b"
teacher_model: "qwen3-235b-thinking"
dataset: "aime_24_train"

# Generation
max_gen_length: 32768        # Long reasoning traces
temperature: 1.0            # Full entropy for exploration
batch_size: 64              # Prompts per iteration
samples_per_prompt: 8       # Multiple rollouts per problem

# Optimization
n_epochs_per_iter: 4        # PPO-style multi-epoch
clip_eps: 0.2               # Trust region
learning_rate: 1e-6         # Conservative
total_iterations: 500       # ~1800 GPU hours on 32x H100

03 Key implementation details

Why temperature = 1.0?

Lower temperature makes the student’s generations more deterministic, reducing exploration. But the whole point of on-policy training is to explore the student’s actual distribution. Temperature 1.0 gives maximum information about where the student needs improvement.

Why multiple samples per prompt?

With 8 rollouts per problem, we get variance reduction for free. We can also compare: which of the student’s attempts does the teacher prefer? This gives richer gradient signal than a single trajectory.

Teacher forward pass is cheap

The teacher never generates — it only scores existing sequences. A single forward pass through a 235B model on 32K tokens takes ~2 seconds on 8 GPUs. Compare this to the minutes of sequential generation the teacher would need for RL reward evaluation. The teacher’s cost is roughly 10% of the student’s generation cost.

No reward model needed

Unlike RLHF, there’s no separate reward model to train, maintain, or worry about gaming. The teacher is the reward signal. This eliminates an entire source of instability and misalignment.

Compared to RL: same infrastructure (PPO optimizer, multi-GPU generation, advantage computation), but the advantage comes from teacher log-probs instead of a binary reward. One line of code different; 9x less compute.

04 Headline results

On-policy distillation beats RL at a fraction of the cost.

The Thinking Machines team evaluated three methods on AIME’24 (American Invitational Mathematics Examination, 2024 problems). All methods start from the same base model (Qwen3-8B after mid-training on math data).

MethodAIME'24 AccuracyGPU HoursCost Relative to RL
Off-policy distill (SFT) 60.0% ~600 0.03x
RL (GRPO) 67.6% 17,920 1.0x (baseline)
On-policy distill 74.4% 1,800 0.1x

Let that sink in:

  • +14.4 points over off-policy — from 60% to 74.4%. Same teacher data, but generated by the student.
  • +6.8 points over RL — from 67.6% to 74.4%. Better performance, not just cheaper.
  • 10x less compute than RL — 1,800 vs 17,920 GPU hours. That’s the difference between “one weekend” and “two months.”
Why does on-policy distillation actually beat RL, not just match it cheaper? Because RL gets one bit per trajectory (correct/incorrect). The teacher provides $\sim 32{,}000$ bits per trajectory (one per token position). With 30,000x more supervision signal, the student learns faster per sample.

04 Benchmark comparison

Off-policy SFT RL / GRPO On-policy distillation

04 Compute analysis

Where do the GPU hours actually go?

RL (17,920 GPU hours)

  • Generation: ~14,000 hrs — student generates thousands of full reasoning traces
  • Reward evaluation: ~1,500 hrs — checking correctness of final answers
  • Training: ~2,400 hrs — policy gradient updates

The bottleneck is generation. RL needs many more trajectories because each one only provides 1 bit of signal. To get statistical significance on advantage estimates, you need hundreds of samples per problem.

On-policy distillation (1,800 GPU hours)

  • Student generation: ~900 hrs — same as RL, but fewer samples needed
  • Teacher scoring: ~200 hrs — single forward pass per trajectory
  • Training: ~700 hrs — policy gradient updates

On-policy distillation needs far fewer trajectories (8 per prompt vs 64+ for RL) because each trajectory carries vastly more information. The teacher scoring step is essentially free — a forward pass is orders of magnitude cheaper than generation.

Information theory perspective: RL provides $\sim 1$ bit per trajectory (correct/incorrect). On-policy distill provides $\sim T \times \log_2(V)$ bits per trajectory, where $T$ is sequence length and $V$ is vocabulary size. For $T = 32{,}000$ and effective log-prob resolution of 10 bits/token, that’s 320,000 bits vs 1 bit. You need proportionally fewer samples to reach the same parameter precision.

04 Scaling behavior

How does performance change as we increase compute budget?

  • Off-policy SFT: Plateaus quickly. More teacher data doesn’t help beyond a point because compounding error dominates. Diminishing returns after ~500 GPU hours.
  • RL: Continues improving but slowly. The log-linear scaling is expensive — each percentage point costs progressively more compute.
  • On-policy distill: Steep initial improvement, then gradual convergence toward teacher performance. The ceiling is the teacher’s own accuracy (since we can’t exceed the supervision signal).

The practical ceiling for on-policy distillation on AIME’24 is approximately the teacher’s own score (Qwen3-235B-Thinking achieves ~80%). RL has no theoretical ceiling (it can discover strategies the teacher doesn’t know), but the practical ceiling within reasonable compute is lower.

05 The personalization problem

You fine-tune a model on your company’s internal docs. It gets smart about your domain. But it forgets how to follow instructions.

This is the catastrophic forgetting trade-off in mid-training. When you continue pre-training a model on domain-specific data (code, legal documents, internal wikis), you inject new knowledge. But you also overwrite some of the post-training polish — the instruction-following, safety guardrails, and formatting that make the model usable.

The traditional fix: run full post-training again after mid-training. But this is expensive (thousands of GPU hours of RLHF) and requires the original post-training data pipeline, which you may not have.

On-policy distillation offers a surgical alternative: use the original model as the teacher to restore the behaviors you lost.

05 Case study: Qwen3-8B

The Thinking Machines team demonstrated this with Qwen3-8B:

  1. Start: Qwen3-8B (post-trained)

    IFEval score: 83.2%. Strong instruction-following baseline.

  2. Mid-train on internal documents

    Model gains domain knowledge. IFEval drops to 71.4%. The model “forgets” how to follow formatting instructions, generates rambling answers, occasionally ignores constraints.

    Catastrophic forgetting in action
  3. On-policy distill from original Qwen3-8B

    Use the original (pre-mid-training) Qwen3-8B as teacher. Let the degraded model generate; teacher scores. IFEval recovers to 82.8%.

    Behavior recovered without losing domain knowledge
The key: the teacher doesn’t need to know the new domain. It just needs to know how to behave (follow instructions, format properly, refuse harmful requests). The student keeps its domain knowledge because on-policy distillation only adjusts how it expresses that knowledge, not what it knows.

05 Behavior recovery visualized

Domain knowledge Instruction following (IFEval) Mid-training phase Distillation phase

The chart shows both metrics over time. During mid-training (terracotta region), domain knowledge rises while instruction-following falls. During on-policy distillation (moss region), instruction-following recovers while domain knowledge is preserved.

Why is domain knowledge preserved? Because on-policy distillation uses reverse KL: it only asks “does the teacher approve of what the student generates?” The student still generates domain-specific content (it has that knowledge); the teacher just shapes how that content is structured and formatted.

05 Continual learning pattern

This suggests a powerful recipe for continual learning:

01
Phase

Mid-train on new data

Inject fresh domain knowledge. Accept temporary behavior degradation.

02
Phase

Distill from behavior teacher

Restore instruction-following, safety, formatting via on-policy distill from a clean checkpoint.

03
Phase

Repeat as needed

Each cycle adds knowledge while maintaining behavior. The model improves monotonically on both axes.

This “knowledge injection then behavior recovery” cycle can run indefinitely. As long as you keep a clean checkpoint as the behavior teacher, you can always recover. And because on-policy distillation is cheap (hundreds of GPU hours, not thousands), the recovery step doesn’t dominate your compute budget.

Think of it as: mid-training is learning new things and on-policy distillation is remembering how to behave. The model gets smarter and stays polite.

06 Information-theoretic view

Let’s quantify the supervision density difference more precisely.

RL signal per trajectory

A binary reward (correct/incorrect) provides exactly 1 bit of information. For a trajectory of $T = 32{,}000$ tokens, that’s $\frac{1}{32{,}000}$ bits per token. The model must figure out which tokens contributed to success or failure — the credit assignment problem.

On-policy distillation signal per trajectory

The teacher log-probability at each position provides a continuous value. Assuming 16-bit floating point precision, that’s up to 16 bits per token position. For $T = 32{,}000$ tokens, that’s $32{,}000 \times 16 = 512{,}000$ bits per trajectory.

Signal density ratio $$\frac{\text{On-policy distill signal}}{\text{RL signal}} = \frac{T \times b_{\text{precision}}}{1} = \frac{32{,}000 \times 16}{1} = 512{,}000\times$$

Half a million times more information per trajectory. This is why on-policy distillation converges in 10x fewer samples: it’s not 10x more efficient, it’s 500,000x more information-dense. The 10x compute reduction is actually conservative — most of the savings come from needing fewer generation steps.

RL: 1 bit / trajectory On-policy distill: T x bits / trajectory

06 Teach vs discover

This gives us a clean mental model for when to use each approach:

ConditionBest MethodReasoning
Teacher exists, same distribution Off-policy SFT Cheapest. Compounding error acceptable for short outputs.
Teacher exists, long reasoning On-policy distill Avoids compounding error. Dense signal. Best cost/quality ratio.
No teacher, verifiable reward RL (GRPO) Must explore strategy space. Expensive but discovers novel solutions.
No teacher, no verifiable reward RLHF with human prefs Last resort. Most expensive, most brittle.
The key principle: if a capability already exists in a teacher, distill it. If it must be discovered, use RL. RL is an exploration algorithm disguised as an optimization algorithm. On-policy distillation is an optimization algorithm with on-policy data.

06 When RL is still needed

On-policy distillation has a hard ceiling: the teacher’s own capability. You cannot distill what the teacher doesn’t know. RL is irreplaceable when:

  • Novel strategies: The solution requires reasoning patterns not present in any existing model. RL can discover them through exploration.
  • Beyond teacher quality: You want the student to surpass the teacher (self-play, iterative improvement). Distillation converges to the teacher; RL has no ceiling.
  • Non-differentiable rewards: Code execution, tool use, physical simulation — any reward that requires running the output through an external system.
  • Human preference alignment: When the “correct” answer is subjective and defined by human ratings, not a teacher model.

The practical recommendation from the blog: use RL to train your teacher, then distill the teacher into students. RL is for the frontier; distillation is for deployment.

07 Decision cheat sheet

When to use which method — a practical guide.

ScenarioMethodWhy
Short outputs, strong teacher Off-policy SFT Cheapest. Compounding error minimal for <512 tokens.
Long reasoning, strong teacher On-policy distill Avoids compounding error on multi-step chains.
No teacher, verifiable answers RL (GRPO) Must explore. Math/code have checkable solutions.
Behavior recovery after domain FT On-policy distill Restore behavior without forgetting new knowledge.
Push beyond teacher quality RL Distillation caps at teacher. RL has no ceiling.
Hybrid: maximize both cost and quality RL then distill RL discovers strategies, distill transfers them cheaply.

07 References

  1. Kevin Lu. “On-Policy Distillation.” Thinking Machines Lab Blog, October 2025. Blog post
  2. Shao et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” 2024. arXiv
  3. Agarwal et al. “On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes.” ICLR, 2024. arXiv
  4. Schulman et al. “Proximal Policy Optimization Algorithms.” 2017. arXiv
  5. Hinton et al. “Distilling the Knowledge in a Neural Network.” 2015. arXiv
  6. Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.” AISTATS, 2011. (DAgger — the original on-policy imitation learning).
  7. Shao et al. “GRPO: Group Relative Policy Optimization for Reward-Free RLHF.” 2024. arXiv
  8. Qwen Team. “Qwen3 Technical Report.” 2025.