On-Policy Distillation Survey (Tencent 2026)

Chapter 0: Why On-Policy?

You've trained a small model to imitate GPT-4's outputs. It sounds fluent on short prompts. Then you ask it to solve a 10-step math problem, and it falls apart after step 3. The answers are grammatically perfect but logically incoherent. Why?

The problem is deceptively simple: during training, your student model always conditioned on perfect prefixes — the teacher's token sequences. It never saw its own mistakes. At inference time, the student generates from its own imperfect prefixes, and the moment it makes a small error, it enters territory it has literally never encountered.

Off-policy distillation: copying perfection

Off-policy distillation means the student trains on data generated by someone else — typically the teacher model. The training signal is: "given this perfect prefix that the teacher wrote, predict the next token distribution." The student learns to be a good next-token predictor when everything before it is correct.

This works beautifully for short sequences. A single sentence, a quick factual answer — the student rarely drifts far enough from teacher territory to matter.

But for multi-step reasoning? Catastrophe.

The dashcam analogy

Imagine learning to drive by watching 10,000 hours of perfect dashcam footage. You see smooth lane changes, perfect braking distances, flawless parallel parking. You memorize it all.

Then someone puts you behind the wheel. You drift slightly left. Nothing in your training covered "recovering from a leftward drift" — you only ever saw the car perfectly centered. So you do something random. Now you're even further from any state you've seen. Panic compounds.

The core insight of on-policy distillation: Let the student practice on its own outputs during training. When it makes a mistake at step 3, the teacher still provides the signal for steps 4-10 — but critically, the student learns what to do after its own errors, not just after perfect prefixes. This is the difference between watching dashcam footage and practicing with an instructor in the passenger seat.

On-policy distillation: practicing your own mistakes

On-policy distillation means the student generates (some or all) prefixes itself, then receives the teacher's guidance on those self-generated contexts. The training distribution matches the inference distribution because both come from the student.

When the student drifts at step 3, it still gets teacher signal at step 4. It learns: "when I'm in this slightly-wrong state, here's how to recover." Over thousands of such episodes, the student builds robustness to its own failure modes.

Off-Policy vs On-Policy Training

Toggle between modes to see how the student behaves at inference time. In off-policy mode, the student has never seen its own errors — it enters unknown territory after a mistake. In on-policy mode, it has practiced recovery.

Why does a student model degrade more on long reasoning chains than short responses?

Errors compound — each step conditions on potentially flawed previous output, pushing the student further into states it never saw during off-policy training Long sequences use more memory and the model runs out of context The teacher model itself is worse at long sequences

Chapter 1: Exposure Bias

The intuition from Chapter 0 has a precise mathematical formulation. Let's derive exactly WHY errors compound — and quantify how much worse off-policy is than on-policy.

The off-policy training objective

Standard knowledge distillation trains the student to match the teacher's output distribution, conditioned on teacher-generated prefixes:

L_off = E_x~D [ ∑_t=1^T D_KL( p_T(·|x, y_<t) ‖ p_θ(·|x, y_<t) ) ]

The critical detail: y_<t comes from the teacher. The student never conditions on its own generated tokens during training. At inference, however, it must condition on its own previous outputs.

Distribution mismatch

This creates a train-test distribution mismatch. During training, the student sees the distribution of prefixes p_T(y_<t). During inference, it sees p_θ(y_<t). These two distributions diverge more with each token generated.

Let ε be the per-step error probability — the chance that the student generates a token that deviates from the teacher's distribution at any given step. After T steps:

Off-policy error bound: O(εT²) — quadratic in sequence length
On-policy error bound: O(εT) — linear in sequence length

This is the DAgger theorem (Ross et al., 2011), originally proven for imitation learning in robotics. It applies directly to autoregressive language modeling.

Worked example: 10-step proof

Suppose your student has 95% per-step accuracy (ε = 0.05). For a 10-step reasoning chain:

Independent errors (best case): P(all correct) = 0.95¹⁰ = 60%. Already losing 40% of trajectories.

With compounding (off-policy): Each error makes subsequent errors more likely because the student enters unseen territory. The actual success rate drops well below 60% — closer to the O(εT²) = O(0.05 × 100) = O(5) bound, meaning errors dominate completely.

With on-policy training: The student has practiced recovery. Errors at step 3 don't cascade because the student knows what to do from its own imperfect states. Success follows the O(εT) = O(0.05 × 10) = O(0.5) bound — much more controlled.

Why the quadratic scaling matters: For a 50-token response, quadratic gives O(ε × 2500) while linear gives O(ε × 50). That's a 50x difference. For modern chain-of-thought reasoning with 200+ tokens of "thinking," off-policy distillation is fundamentally broken — the gap between training and inference distributions grows too fast for the student to handle.

The on-policy training objective

On-policy distillation replaces teacher prefixes with student-generated (or mixed) prefixes:

L_on = E_{x~D, y~π_mix} [ ∑_t=1^T D_f( p_T(·|x, y_<t), p_θ(·|x, y_<t) ) ]

Here π_mix is a mixture policy — typically λ · p_θ + (1-λ) · p_data — that blends student rollouts with ground-truth sequences. The divergence D_f can be forward KL, reverse KL, JSD, or any f-divergence (Chapter 2 explores which to choose).

Error Accumulation: Off-Policy vs On-Policy

Adjust per-step error ε and see how total error scales with sequence length T. The quadratic (off-policy) curve explodes while the linear (on-policy) curve remains manageable.

ε (per-step error) 0.05

T (max sequence length) 50

If per-step error ε=0.05 and sequence length T=20, what's the approximate off-policy error bound O(εT²)?

0.05 × 400 = 20, meaning massive accumulated error that dominates the output 0.05 × 20 = 1.0 0.05 × 20² × 20 = 400

Chapter 2: The f-Divergence Framework

Now we know WHY on-policy helps. The next question: what should the student optimize when training on its own outputs? The choice of divergence measure fundamentally shapes what the student learns.

The f-divergence family

An f-divergence is a general way to measure how different two probability distributions are. Every f-divergence has the form:

D_f(P ‖ Q) = E_y~Q[ f( P(y) / Q(y) ) ]

where f is a convex function with f(1) = 0. Different choices of f give different divergences — and each one tells the student to prioritize different aspects of the teacher distribution.

Forward KL: mode-covering

Forward KL (f(u) = u log u) measures D_KL(P_teacher || Q_student). The expectation is over the student's distribution. This means: wherever the teacher has probability mass, the student MUST also have probability mass (or the divergence explodes).

Result: the student covers all modes of the teacher. If the teacher assigns probability to two different valid answers, the student spreads probability over both. But it also assigns probability to the space between the modes — hallucinating intermediate outputs that neither mode produces.

Forward KL = "zero-avoiding": The student cannot assign zero probability anywhere the teacher assigns nonzero. This forces broad coverage but causes hallucination in the gaps between modes. Imagine a teacher that can answer "42" or "forty-two" — forward KL makes the student also say "42-two" or "for2" with some probability.

Reverse KL: mode-seeking

Reverse KL (f(u) = -log u) measures D_KL(Q_student || P_teacher). The expectation is over the teacher's distribution. This means: wherever the student has probability mass, the teacher must also (or the divergence explodes).

Result: the student seeks one mode and concentrates on it. It won't hallucinate between modes because it never places mass where the teacher doesn't. But it drops entire modes — if the teacher has two valid answers, the student picks one and ignores the other.

Reverse KL = "zero-forcing": The student can only place mass where the teacher does. This gives precise, high-confidence outputs but sacrifices diversity. For math (one right answer), this is perfect. For creative writing (many valid outputs), it's too restrictive.

JSD and α-divergence: interpolations

The Jensen-Shannon Divergence (JSD) is a symmetric average of forward and reverse KL, computed through a mixture M = (P+Q)/2. It's bounded in [0, log 2], which prevents gradient explosion.

The α-divergence family provides a continuous interpolation parameter α between forward KL (α → 1) and reverse KL (α → 0). Setting α = 0.5 gives a symmetric divergence similar to JSD.

D_α(P ‖ Q) = (1/α(α-1)) E_Q[ (P/Q)^α - 1 ]

Which divergence for which task?

Task Type	Best Divergence	Why
Math / Code	Reverse KL	One correct answer — mode-seeking concentrates on it
Creative writing	Forward KL	Many valid outputs — mode-covering preserves diversity
General chat	JSD / α-div	Balance between precision and coverage
Safety-critical	Reverse KL	Must not hallucinate outside teacher's support

Divergence Mode Behavior

A bimodal teacher (two valid answers). Slide from Forward KL to Reverse KL and watch how the student distribution changes. Forward KL spreads to cover both modes (hallucinating between). Reverse KL collapses onto one mode.

FKL ← Divergence → RKL Forward KL

For a math problem with exactly one correct answer, which divergence should you prefer?

Reverse KL — mode-seeking concentrates probability on the single correct solution without hallucinating between alternatives Forward KL — mode-covering ensures we don't miss the correct answer JSD — always use the symmetric option for safety

Chapter 3: Fixed Divergence Methods

Three papers crystallized the core approaches to on-policy distillation. Each picks a different point in the design space — and each reveals a different tradeoff.

GKD: Generalized Knowledge Distillation (Agarwal et al., 2024)

GKD is the simplest framework. Its key insight: make on-policy distillation modular. Separate the sampling policy (who generates the prefixes?) from the divergence (what does the student optimize?).

The sampling policy is a mixture:

π_mix = λ · p_θ + (1-λ) · p_data

When λ = 0, it's standard off-policy (train on ground-truth prefixes). When λ = 1, it's fully on-policy (train only on student's own generations). Values in between blend both — the student sees some perfect prefixes and some of its own imperfect ones.

GKD is divergence-agnostic: you can plug in forward KL, reverse KL, JSD, or any f-divergence. The sampling policy and the divergence are orthogonal choices.

GKD's contribution: Showed that even a simple λ=0.5 mixture (half on-policy, half off-policy) dramatically improves over pure off-policy distillation. The key was framing on-policy and off-policy as endpoints of a continuum rather than a binary choice.

MiniLLM: Reverse KL via REINFORCE (Gu et al., 2024)

MiniLLM goes all-in on reverse KL. The problem: reverse KL D_KL(p_θ || p_T) requires computing the gradient through the student's own sampling process (because the expectation is over p_θ). You can't just backpropagate through discrete token sampling.

MiniLLM's solution: treat it as a reinforcement learning problem. The "reward" for generating token y_t is:

r_t = log p_T(y_t | x, y_<t) - log p_θ(y_t | x, y_<t)

This is the log-probability ratio: how much more the teacher likes this token than the student does. High reward = the teacher strongly approves of something the student isn't yet confident about. The REINFORCE estimator uses this reward to compute policy gradients.

The catch: REINFORCE has high variance. The gradient estimates are noisy, requiring large batch sizes, careful baseline subtraction, and many training steps to converge. Training is unstable — the loss oscillates wildly before settling.

MiniLLM's tradeoff: Precise mode-seeking behavior (great for reasoning tasks) at the cost of training instability. You get sharp, confident outputs — but training requires 2-3x more compute than GKD to converge, and you need careful hyperparameter tuning to prevent divergence.

DistiLLM: Skewed KL for stability (Ko et al., 2024)

DistiLLM solves MiniLLM's instability problem with a clever trick: instead of minimizing D_KL(p_θ || p_T) directly, minimize D_KL(p_θ || p̃) where p̃ is a skewed mixture:

p̃ = α · p_T + (1-α) · p_θ

Why does this help? The gradient explosion in MiniLLM happens when the student assigns near-zero probability to a token the teacher strongly prefers. The ratio p_T/p_θ → ∞, and the gradient explodes.

By mixing in the student's own distribution, DistiLLM ensures p̃ never diverges too far from p_θ. Even when p_T(y) is high and p_θ(y) is near zero, the mixture p̃(y) = α · p_T(y) + (1-α) · 0 ≈ α · p_T(y), and the ratio p̃/p_θ grows more slowly than p_T/p_θ. Gradient explosions are tamed.

The design space progression: GKD showed that on-policy sampling helps (but didn't optimize the divergence choice). MiniLLM showed that reverse KL gives precise outputs (but suffers from gradient variance). DistiLLM showed that a skewed mixture target gives reverse-KL-like precision WITH stable gradients. Each paper solves the previous paper's weakness.

The tradeoff triangle

GKD — Simplicity

Modular, any divergence, easy to implement. But doesn't optimize the divergence choice itself.

↓

MiniLLM — Precision

Reverse KL gives sharp, mode-seeking outputs. But REINFORCE variance makes training unstable.

↓

DistiLLM — Stability

Skewed mixture prevents gradient explosion. Gets reverse-KL benefits without the instability.

python
# GKD: simple mixture sampling + any divergence
def gkd_loss(student, teacher, x, lam=0.5, div='jsd'):
    # Sample prefix from mixture policy
    if random() < lam:
        prefix = student.generate(x)   # on-policy
    else:
        prefix = ground_truth(x)       # off-policy

    p_t = teacher.forward(x, prefix)   # teacher logits
    p_s = student.forward(x, prefix)   # student logits
    return compute_divergence(p_t, p_s, div)

# MiniLLM: reverse KL via REINFORCE
def minillm_loss(student, teacher, x):
    y = student.generate(x)            # full on-policy rollout
    r = teacher.logprob(y, x) - student.logprob(y, x)
    baseline = moving_avg(r)          # variance reduction
    return -((r - baseline) * student.logprob(y, x)).mean()

# DistiLLM: skewed mixture target
def distillm_loss(student, teacher, x, alpha=0.9):
    y = student.generate(x)
    p_t = teacher.forward(x, y)
    p_s = student.forward(x, y)
    p_mix = alpha * p_t + (1 - alpha) * p_s  # skewed target
    return kl_div(p_s, p_mix)          # bounded gradient!

GKD / MiniLLM / DistiLLM Comparison

See how each method's student distribution looks after training on the same bimodal teacher. Adjust λ (GKD's on-policy fraction) to see how more on-policy data improves the student.

λ (on-policy fraction) 0.50

What problem does DistiLLM's skewed mixture target solve?

Gradient explosion when the student assigns near-zero probability to teacher-preferred tokens — the mixture ensures the target never diverges infinitely from the student The teacher model being too slow to query during training The student forgetting previously learned information

Chapter 4: Adaptive & RL-Augmented Objectives

Fixed divergences are one-size-fits-all. But what if different tokens need different treatment? A confident token where teacher and student agree needs less gradient pressure than a confused token at a reasoning branch point.

Consider a math proof: tokens like "therefore" are near-deterministic — both teacher and student assign 95%+ probability. But choosing between "integration by parts" vs. "substitution" is the crux. A single divergence treats both identically. That's wasteful at best, harmful at worst.

Adaptive Divergences

ToDi (Token-level Divergence routing) scores each token by its log-ratio r_t = log p_θ(y_t) − log p_T(y_t). If r_t > 0 (student overestimates), use FKL to pull it down. If r_t < 0 (student underestimates), use RKL to push it up. The routing is per-token and parameter-free.

AKL (Adaptive KL) blends FKL and RKL based on the head/tail gap of the teacher's distribution. Peaked teacher → RKL. Flat teacher → FKL. The weight is a smooth function of teacher entropy.

EOPD interpolates continuously: α_t = σ(H(p_T) − τ). High teacher entropy → more FKL. Low entropy → more RKL.

The intuition: FKL penalizes the student for assigning LOW probability where the teacher assigns HIGH — covers modes. RKL penalizes HIGH probability where teacher assigns LOW — picks one mode. At confident tokens, RKL forces commitment. At uncertain tokens, FKL prevents premature collapse.

The RL Equivalence

G-OPD proves a startling result: on-policy distillation IS reinforcement learning.

min_θ D_KL(P_θ || P_T) = max_θ E_{y~P_θ}[R(y)] + H(P_θ)

where R(y) = log P_T(y | x) — the teacher's log-probability IS the reward

Left side: minimize KL from student to teacher. Right side: maximize reward (teacher log-prob) plus entropy (exploration). Identical. Every RL trick — baselines, advantage estimation, PPO clipping — applies directly to distillation.

Reward extrapolation: G-OPD adds external verifier rewards: R(y) = log P_T(y) + β · R_ext(y). This pushes the student BEYOND the teacher. If a math verifier confirms correctness, the student gets bonus reward for solutions the teacher never generated. Distillation becomes improvement, not just imitation.

Adaptive Divergence Routing

Each token routed to FKL or RKL based on teacher entropy. Drag the threshold to see routing change.

Entropy threshold τ0.50

Method	Divergence	Key Innovation	Best For
GKD	Fixed (any)	λ-interpolation	General
ToDi	Adaptive FKL↔RKL	Per-token routing via log-ratio	Mixed tasks
G-OPD	KL-constrained RL	Reward extrapolation beyond teacher	Reasoning
AOPD	PG/FKL switch	Advantage-sign routing	Math proofs

Why is the equivalence between distillation and RL important?

It proves distillation is always better than RL It means we can use RL tricks — baselines, advantage estimation, reward shaping — and even push the student beyond the teacher It shows the teacher must be trained with RL first

Chapter 5: Signal Sources — Where Does Feedback Come From?

We've discussed WHAT to optimize. Now: where does the teaching signal come from? The answer determines everything — compute budget, quality ceiling, deployment constraints.

Think of it as a spectrum. At one end, full access to the teacher's brain (every logit). At the other end, just a thumbs-up or thumbs-down on complete outputs.

White-Box: Full Logit Access

You run the teacher and get its full output distribution at every token position. Richest signal — you know exactly how much probability the teacher assigns to every possible next token. Cost: full teacher forward pass for every sample (~140 GB for a 70B model).

Cross-family challenge: Different tokenizers mean logits aren't aligned. DSKD and ULD solve this via shared vocabulary projection or distribution-level matching.

Black-Box: API-Only Access

Only generated text or scalar scores. Reality when distilling from GPT-4 or Claude. GAD trains a discriminator. OVD uses the teacher to SELECT among student candidates.

Key insight from OVD: "On-policy exploration itself supplies a substantial fraction of the learning signal — the teacher's role is closer to SELECTING among student trajectories than CORRECTING individual tokens."

Self-Distillation: No Teacher At All

The model teaches itself by exploiting asymmetries in its own capabilities:

Privileged information: Model is better WITH extra context. Distill that advantage: p(y|x, hints) → p(y|x).
Self-play: Current model vs. previous version. DPO until convergence.
External feedback: Verifier (math checker, code runner) scores outputs. GRPO optimization.

Signal Source Comparison

Three paradigms. Click to highlight data flow. Toggle cost overlay.

python
# White-box: exact KL at every token (dense signal)
loss = kl_div(teacher_logits, student_logits)  # [B, T, V]

# Black-box: only outcome reward (sparse signal)
reward = teacher_api.score(student_output)  # scalar
loss = -reward * log_prob(student_output)   # REINFORCE

# Self-distillation: model teaches itself
good = model.generate(prompt, with_hints=True)
loss = kl_div(model(prompt).logits, model(prompt+hints).logits)

When would you choose self-distillation over white-box distillation?

When you want fastest convergence regardless of cost When no larger teacher exists, or compute budget prohibits running a teacher for every sample When the student is already larger than the teacher

Chapter 6: Self-Distillation Deep Dive

The most surprising finding: you don't always need a teacher. A model can improve itself by exploiting asymmetries in its own capabilities.

Privileged Information (PI)

OPSD: Condition on ground-truth during training. The model with answers generates better reasoning chains. Distill: p(reasoning | x, answer) → p(reasoning | x). The model learns to reason well WITHOUT seeing the answer.

CRISP: Concise prompt as PI — saves 57% tokens at inference while maintaining quality.

GATES: Full document as PI → distill to model with only a summary.

Self-Play

SPIN: At iteration k, generate y_k. DPO with "chosen" = reference, "rejected" = y_k. Update. Repeat. Converges when p_θ = p_ref (distributions indistinguishable, DPO signal = zero).

π-Play: Multi-agent variant — several models co-evolve, playing against each other rather than a fixed reference.

External Feedback (Verifier-Driven)

SD-ZERO: Generate N solutions, verifier gives binary reward (correct/incorrect), optimize with GRPO. No teacher at all — just trial and error with automated grading. This is essentially what DeepSeek-R1-Zero does.

Self-distillation's dirty secret: saturation. Once outputs match the reference, self-play produces zero gradient. SPIN saturates in 3-5 iterations. Industrial systems use it as ONE stage: SFT → self-distillation (cheap gains) → RL with verifier (push beyond) → optional white-box from the improved model back to a smaller one.

Self-Play Convergence

Current model (warm) vs. reference (teal). Click "Play Round" to iterate. Watch KL drop to zero (saturation).

Round: 0 | KL: 2.41

Approach	Signal Source	Assumption	Risk
OPSD	Ground-truth PI	GT available at train	PI leakage
SPIN	Previous iteration	Stable reference	Saturation after 3-5 rounds
SD-ZERO	Binary verifier	Correctness checkable	Reward hacking
GATES	Document gating	Long context helps	Not always true

What causes self-play distillation to eventually stop improving?

Learning rate decays too much Verifier runs out of test cases When distributions match, the DPO/contrastive signal is zero — no distinguishable difference to learn from

Chapter 7: Training Dynamics & Stabilization

On-policy training is inherently unstable. The data distribution shifts with every gradient step. The student explores bad regions and gets noisy gradients. How do Qwen3 and DeepSeek-V4 make this actually work?

Token Weighting: Not All Tokens Are Equal

TIP (Token Importance Profiling) scores each token on a 2D quadrant: teacher entropy (is the teacher confident?) × student-teacher divergence (does the student disagree?). The golden tokens: teacher is confident but student is confused. These get highest weight.

SCOPE uses surprise weighting: tokens where student assigns very low probability but teacher assigns high get upweighted by their log-ratio.

The filtering insight: You can discard 50-70% of tokens during training (the easy ones where student already agrees) with NO quality loss — often improved quality, because you eliminate noisy gradients from ambiguous tokens.

Curriculum (PACED)

Start with easy samples (low teacher perplexity), gradually increase difficulty. Beta-kernel sampling from the difficulty frontier — the difficulty slider auto-adjusts as the student improves.

Compute Optimization

On-policy is 3-8× more expensive (generate + score + train). Solutions:

Method	Cost	Trick	Speedup
Naive	3-5×	None	1×
FOPD	1.5×	Prefix truncation	2-3×
Lightning-OPD	1.2×	Offline caching	4×
NPD	0.6×	Async + ΔI-IFD	8.1×

The Flawed Prefix Trap

If the student generates a terrible prefix (off-topic, degenerate repetition), teacher feedback on it is meaningless noise. Detection: perplexity threshold, length filter, reward floor.

Token Importance Heatmap

Tokens scored by importance (divergence × teacher confidence). Drag threshold — gray tokens get skipped.

Filter threshold0.30

Industrial recipe: Off-policy warmup (stable, cheap) → On-policy refinement (student practices on own outputs) → RL exploration (push beyond teacher). All frontier labs converge on this 3-phase pipeline.

Why is the off-policy warmup phase important before switching to on-policy?

If the student is too weak, self-generated outputs are noise — on-policy training on garbage produces garbage gradients (death spiral) Off-policy training is faster for saving compute The teacher needs time to warm up its KV cache

Chapter 8: OPD Explorer

Let's put it all together. You're going to run an on-policy distillation pipeline and see every design choice interact in real time.

The teacher has a bimodal distribution — two valid answer modes. The student starts broad and must learn to match. Pick a method, set sequence length, and watch.

On-Policy Distillation Simulator

Top: teacher (fixed) vs student (evolving). Bottom: error accumulation over T. Pick a method and train.

Seq Length T10

KL: 0.000 | Steps: 0

What to try:
• Set T=5, Off-Policy. Works fine.
• Set T=40, Off-Policy. Collapses. Switch to GKD λ=1.0 — recovers.
• Self-Play: converges fast then saturates.
• MiniLLM: jittery but precise on one mode.

At short sequences (T<10), all methods work equally well. As T grows, compounding error separates them: off-policy O(εT²) explodes, on-policy O(εT) stays manageable. This is why every frontier lab uses on-policy for reasoning models.

Chapter 9: Connections & Open Problems

On-Policy Distillation sits at the intersection of three converging fields: knowledge distillation, reinforcement learning, and imitation learning.

The Unified View

Perspective	Sampling	Objective	"Teacher"
Knowledge Distillation	Student rollouts	D_f(P_T \|\| P_θ)	Larger model
RLHF / DPO	Student rollouts	Reward maximization	Human preferences
Imitation Learning	Learner trajectories	Expert correction	Expert policy
Self-Play	Self-generated	Improve over reference	Previous self

The convergence is not just conceptual — it's infrastructural. OPD and RLHF need the same pipeline: rollout generation, scoring, filtering, gradient updates. They share 90% of code.

Core Equations (Cheat Sheet)

Concept	Equation	When to Use
Off-policy KD	∑ D_KL(p_T \|\| p_θ) on teacher prefixes	Short sequences, same family
On-policy (GKD)	E_y~πmix[∑ D_f(p_T, p_θ)]	General purpose, moderate compute
RL equivalence	min D_KL(P_θ\|\|P_T) = max E[log P_T] + H	When you want reward shaping
Exposure bias	Off = O(εT²), On = O(εT)	Always — this is WHY OPD

Open Problems

Distillation scaling laws. How does quality scale with teacher size, student size, rollout budget? No clean scaling laws exist yet.
Agent-level distillation. Multi-step tool-use chains require credit assignment across steps — fundamentally harder than single-turn.
Uncertainty-aware feedback. Teacher should abstain on OOD student prefixes rather than hallucinate confident noise.
Distillation + RL unification. They share 90% infrastructure. GKD and DPO are almost identical — both optimize divergences with different signal sources.

Industrial Adoption

Qwen3: Off-policy warmup → GKD on-policy → GRPO exploration
DeepSeek-V4: Multi-teacher R-KL OPD replacing their mixed RL stage entirely
Gemma 2: KD in pre-training with on-policy logit distillation
MiMo-V2: Multi-teacher + SCOPE token weighting → self-play refinement

The key prediction: As reasoning chains grow longer (100+ steps), the O(εT²) vs O(εT) gap becomes so large that on-policy training is non-negotiable. Every frontier lab is already there. The question is no longer "should we use OPD?" but "how do we make it 10× cheaper?"

What unifies knowledge distillation and reinforcement learning?

Both sample from the student/agent and use external signal (teacher logits / reward) to improve — differing only in signal density and source Both require human preference data Both use the same loss function with different labels

On-Policy Distillationfor LLMs