Aman's AI Journal • Primers • Knowledge Distillation

Overview
Distillation Taxonomy
Foundations
Offline Distillation
Online Distillation
Off-Policy Distillation
On-Policy Distillation (OPD)
Self-Distillation (SD)
Multi-Teacher Distillation
Reinforcement Learning-Distillation Hybrids
Comparative Analysis
Implementation Patterns
Decision Guide for Choosing a Distillation Method
References
Citation

Overview

Distillation is the problem of transferring useful behavior from one distribution into another without paying the full inference, serving, or training cost of the original source. In classical knowledge distillation, this source is usually a larger teacher model whose softened predictions help train a smaller student, as in Distilling the Knowledge in a Neural Network by Hinton et al. (2015). In modern Large Language Model (LLM) post-training, the same idea has expanded from compression into a broader capability-transfer primitive: stronger models, specialist teachers, RL checkpoints, self-teachers, verifiers, tool environments, and user feedback can all become sources of training signal.
The central problem is no longer only “How can a small student imitate a large teacher?” It is also “Which behavior should the student imitate, on which trajectories, and under which information state?” This matters because autoregressive LLMs train and deploy on trajectory distributions. A model that is trained only on fixed teacher or dataset traces may not know how to recover from its own mistakes at inference. Reinforcement learning addresses this by training on the student’s own rollouts, but often supplies only sparse scalar rewards. On-policy distillation tries to combine the best of both: the student generates the trajectory, while a teacher supplies dense token-level feedback on the exact prefixes the student visited, as formalized in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024).
The key organizing axes are: whether the teacher is fixed or changing; whether trajectories come from data, teachers, or the student; whether supervision is hard labels, soft logits, sampled-token log-probabilities, hidden states, rewards, or feedback-conditioned signals; and whether the teacher is external, self, multi-teacher, or environment-derived. These axes are orthogonal. A frozen teacher can score on-policy student rollouts, a co-trained peer can operate on fixed data, and a self-teacher can be either helpful or harmful depending on whether its signal tracks reward.
The main thesis of this primer is that dense supervision is valuable only when it points in the right direction. OPD works when the teacher is both reward-improved and close enough to the student that its local token preferences are meaningful on student-visited prefixes. The reward-tilted teacher view in Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t makes this condition explicit: a good teacher assigns higher log-ratio to higher-reward trajectories. Naive privileged self-distillation can fail because the self-teacher may instead up-weight a feedback-aware response shape, such as citing a hidden reference solution, rather than reward-bearing reasoning.
The practical consequence is that distillation should be treated as both an algorithmic choice and a systems choice. A good recipe must decide the trajectory source, teacher source, divergence direction, log-probability payload, masking rules, rollout freshness policy, reward anchor, and evaluation suite. In reasoning and agentic settings, final accuracy is not enough: evaluations should also track out-of-distribution behavior, hallucinated privileged context, epistemic verbalization, teacher-student support overlap, rollout length, truncation, repetition, and whether teacher log-ratios correlate with reward.
The next section gives the detailed roadmap and taxonomy that motivates the formal foundations. It expands the overview into the main distillation families, the offline/online and off-policy/on-policy axes, divergence choices, self-distillation caveats, the RL connection, multi-teacher consolidation, and implementation implications before the primer turns to the mathematical foundations.

Distillation Taxonomy

Definition

Distillation is a training paradigm in which a student model is optimized to reproduce useful behavior from a teacher model, usually to obtain a model that is cheaper, faster, smaller, easier to deploy, or more specialized than the teacher. The canonical formulation was popularized in Distilling the Knowledge in a Neural Network by Hinton et al. (2015), which showed that a student can learn from the teacher’s softened output probabilities rather than only from hard labels.
In modern LLMs, distillation should also be understood as a deployment and capability-transfer strategy, not only as a compression trick. Larger models move the capability frontier upward, while distillation attempts to move the resulting capability back down the cost frontier for practical inference, high-volume serving, edge deployment, or domain specialization. The Magic of LLM Distillation - Rishabh Agarwal, Google DeepMind frames this cost-performance tradeoff as one of the central reasons distillation remains important in modern post-training.
At a high level, distillation replaces or augments ordinary supervised learning with a matching objective between teacher and student distributions. For a teacher distribution \(p_T\) and student distribution \(p_S^\theta\) over labels or next tokens, a standard token-level objective is:
\[\mathcal{L}_{KD}(\theta) =\mathbb{E}_{(x,y)} \left[ D\left( p_T(\cdot \mid x, y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x, y_{<t}) \right) \right]\]
- where \(D\) is usually forward KL, reverse KL, Jensen-Shannon divergence, cross-entropy on sampled teacher outputs, or a task-specific hybrid. In language models, the conditioning context includes the prompt \(x\) and the partial output \(y_{<t}\), so distillation is fundamentally about matching next-token behavior under particular trajectories.

Classical Distillation Families

Classical distillation has several major families. Logit or soft-label distillation matches the teacher’s probability distribution directly, often with temperature scaling. Sequence-level distillation trains on full outputs generated by the teacher, as introduced for neural machine translation in Sequence-Level Knowledge Distillation by Kim and Rush (2016), where teacher-generated translations serve as simplified targets for the student. Representation distillation matches hidden states, attention maps, embeddings, or intermediate features, which is common in encoder models such as DistilBERT by Sanh et al. (2019), which combines language-modeling, distillation, and cosine-distance losses to compress BERT.
A practical modern distinction is that sequence-level KD makes the student imitate teacher-produced trajectories, while on-policy distillation makes the student produce the trajectory and then asks a teacher to rescore the student’s actual prefixes. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) formalizes this as Generalized Knowledge Distillation, and the tutorial-style explanation in On-policy distill for LLMs typically works best with reverse KL emphasizes that the student’s own trajectory is the object being corrected rather than replaced.

Offline, Online, and Semi-Online Distillation

Distillation can also be categorized by whether the teacher is fixed or co-trained. Offline distillation uses a pretrained, frozen teacher and trains a student from stored or live teacher outputs; this is the standard teacher-student setting used by most classical KD systems. Online distillation trains multiple students, peers, or teacher-like supervisors simultaneously, so the teaching signal evolves during training rather than coming from a fixed teacher. Deep Mutual Learning by Zhang et al. (2017) is a canonical online distillation method in which peer networks learn collaboratively and teach each other throughout training. Co-distillation and online mutual learning therefore differ from ordinary offline KD not because they necessarily change the loss, but because the teacher distribution is non-stationary and coupled to the student’s optimization.
Semi-online distillation sits between offline and online regimes. One common form keeps a strong pretrained teacher but periodically adapts or updates an auxiliary teacher, supervisor, or student ensemble. Shadow Knowledge Distillation: Bridging Offline and Online Knowledge Transfer by Li et al. (2022) studies the empirical performance gap between offline and online distillation and attributes much of the benefit of online methods to reversed student-to-teacher transfer rather than only to simultaneous training.

On-Policy and Off-Policy Distillation

Core Distinction

On-policy and off-policy distillation classify methods according to the source of the trajectories on which the distillation loss is computed, rather than according to whether the teacher is frozen or co-trained. This distinction is especially important for autoregressive language models, where each token changes the future contexts that the model will encounter during generation.
In off-policy distillation, the student is trained on sequences generated by an external source, such as a human-labeled dataset, teacher-generated completions, or another model’s rollouts. The student does not determine the contexts on which it is supervised. Classical supervised knowledge distillation, sequence-level distillation, and most synthetic-data pipelines fall into this category. Sequence-Level Knowledge Distillation by Kim and Rush (2016) introduced teacher-generated translations as simplified sequence targets for neural machine translation, while standard supervised KD as described in DistilBERT by Sanh et al. (2019) combines language-modeling, distillation, and representation-level losses to compress BERT.
In on-policy distillation, the student first samples its own trajectories and then receives dense teacher supervision along those exact rollouts. The training data distribution therefore evolves with the student. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) formalizes this idea as Generalized Knowledge Distillation (GKD), which treats distillation as an imitation-learning problem and trains on student-generated sequences rather than only fixed datasets.

Generalized Knowledge Distillation Objective

The core GKD procedure samples a student trajectory with probability \(\lambda\) and otherwise falls back to dataset trajectories, then minimizes a divergence between teacher and student token distributions over the resulting sequences. This interpolates continuously between purely off-policy and purely on-policy training.
Formally, the on-policy objective can be written as:
\[\mathcal{L}_{OPD}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ D\left( p_T \,\Vert\, p_S^\theta \right)(\hat{y},x) \right]\]
- The following expansion makes the shorthand \(D(p_T\,\Vert\,p_S^\theta)(\hat{y},x)\) explicit as a sum of token-level divergences along the student-generated rollout \(\hat{y}\):
  \[\mathcal{L}_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D\left( p_T(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]

Why On-Policy Supervision Helps

The key advantage is that the student is trained in the same types of contexts it will encounter during inference, mitigating exposure bias and compounding errors. On-Policy Distillation describes this as combining the on-policy relevance of reinforcement learning with the dense per-token supervision of distillation.
A useful intuition is provided by the chess analogy from On-Policy Distillation: off-policy distillation is like watching grandmaster games, where the learner observes strong moves only in expert-visited positions, while on-policy distillation is like having an engine annotate every move in the learner’s own games, identifying precisely which moves were strong and which were errors.
A complementary intuition is the targeted correction view in Dwarkesh Patel’s recorded discussion with Sasha Rush: when a rollout contains a localized mistake, such as an invalid tool call, OPD can discourage the specific mistaken action instead of spreading a sparse final reward over the whole trajectory.

Reward-Tilted Teacher View

The central condition for OPD is not merely that the supervision is dense. The teacher must also be better than the student on the downstream reward while remaining close enough to the student’s distribution that its token-level preferences are locally imitable. Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t frames this as a reward-tilted teacher condition: OPD improves reward when the teacher assigns higher probability to higher-reward trajectories while staying near the current student policy.
This condition can be expressed through the closed-form optimum of KL-regularized reward maximization. Let \(s\) denote a full trajectory, let \(R(s)\) be the downstream reward, and let \(\pi_k^S\) be the current student policy held fixed. The reward-tilted teacher is:
\[\pi_T^*(s) = \frac{1}{Z} \pi_k^S(s) \exp(\beta R(s)), \qquad Z = \mathbb{E}_{s \sim \pi_k^S} \left[ \exp(\beta R(s)) \right]\]
- where \(\beta\) controls the strength of reward tilting and \(Z\) normalizes the distribution.
If the teacher is exactly this reward-tilted policy and its gradient is stopped, reverse-KL distillation decomposes into a reward-seeking term plus a trust-region term:
\[D_{KL} \left( \pi^S \,\Vert\, \pi_T^* \right) = D_{KL} \left( \pi^S \,\Vert\, \pi_k^S \right) - \beta \mathbb{E}_{s \sim \pi^S} \left[ R(s) \right] + \log Z\]
- Since \(\log Z\) does not depend on the optimized student, minimizing the reverse-KL loss is equivalent to increasing expected reward while staying close to the current policy. This explains why strong same-family teachers and RL-trained expert teachers are often useful OPD teachers: they tend to be reward-improved while still close enough to the student to provide meaningful local token supervision.

Why On-Policy Is Not Enough

This also clarifies why “on-policy” alone is not sufficient. A privileged-context self-teacher may score the student’s own rollouts densely, but if it does not behave like a reward-tilted version of the student, the dense signal can point in the wrong direction. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models by Zhao et al. (2026) shows how OPSD uses privileged reasoning context to provide dense token-level supervision, while Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t emphasizes that such a teacher can instead up-weight a feedback-aware response shape rather than reward-bearing reasoning.
A practical diagnostic is therefore to ask whether the teacher’s log-ratio over student rollouts tracks downstream reward. For a trajectory \(s\), the teacher-student log-ratio is:
\[\Delta_T(s) = \log \pi_T(s) - \log \pi_S(s)\]
- A useful OPD teacher should assign larger \(\Delta_T(s)\) to higher-reward trajectories. If \(\Delta_T(s)\) instead tracks a response template, such as citing absent feedback or asserting a reference solution that the student does not actually have at inference time, the objective can preserve the on-policy data source while losing reward alignment.

Implementation and Divergence Details

From an implementation perspective, off-policy distillation is simpler because teacher outputs can be precomputed and reused. On-policy distillation is more computationally demanding because student rollouts and teacher evaluations must be generated repeatedly during training. However, it often delivers superior performance on long-horizon reasoning tasks because it teaches the student to recover from its own mistakes rather than only imitate ideal trajectories. Distilling 100B+ Models 40x Faster with TRL demonstrates practical infrastructure for large-scale OPD, including generation buffers, batched teacher queries, and compressed binary log-probability transfer to make 100B+ teachers tractable.
A subtle but important mathematical point is that most practical OPD implementations follow the GKD-style supervised objective over sampled student prefixes, rather than differentiating through the full student sampling distribution as a sequence-level reverse-KL policy-gradient objective. I think there’s some confusion about what on-policy distillation (OPD) loss actually optimizes distinguishes this common OPD objective from MiniLLM: On-Policy Distillation of Large Language Models by Gu et al. (2023), which replaces standard forward KL with reverse KL and derives an on-policy optimization approach for distilling generative language models.
A stop-gradient view of the common implementation is:
\[\nabla_\theta \mathcal{L}_{OPD} \approx \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{\hat{y} \sim \operatorname{sg}(p_S^\theta(\cdot \mid x))} \left[ \nabla_\theta \sum_t D\left( p_T(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
- where \(\operatorname{sg}\) denotes stop-gradient through the sampling process. This makes standard OPD closer to DAGGER-like supervised learning over student-visited states than to full policy-gradient optimization through the student’s sequence distribution.

Relationship to Offline and Online Distillation

On-policy and off-policy are orthogonal to the offline/online distinction. A frozen teacher can be used in either regime, and multiple co-trained peers can still operate off-policy if they supervise each other on fixed datasets. In practice:
- Offline + Off-Policy: Classical teacher-student distillation with precomputed teacher outputs.
- Offline + On-Policy: Modern OPD with a frozen teacher scoring student rollouts.
- Online + Off-Policy: Deep Mutual Learning on shared minibatches.
- Online + On-Policy: Co-trained models supervising one another on their own generated trajectories.
This separation is conceptually important because most recent advances in LLM post-training, including Generalized Knowledge Distillation, On-Policy Self-Distillation, and Multi-Teacher On-Policy Distillation (MOPD), use frozen teachers and are therefore offline in teacher update pattern while simultaneously on-policy in trajectory generation.
A practical diagnostic is therefore to ask whether the teacher’s log-ratio over student rollouts tracks downstream reward. For a trajectory \(s\), the teacher-student log-ratio is:
\[\Delta_T(s) = \log \pi_T(s) - \log \pi_S(s)\]
- A useful OPD teacher should assign larger \(\Delta_T(s)\) to higher-reward trajectories. If \(\Delta_T(s)\) instead tracks a response template, such as citing absent feedback or asserting a reference solution that the student does not actually have at inference time, the objective can preserve the on-policy data source while losing reward alignment.

Evaluation Implications

OPD evaluations should include more than in-distribution accuracy. For reasoning and agentic settings, the same validation pass should track reward, hallucinated feedback or fabricated references, epistemic-verbalization rate, and out-of-distribution accuracy. These metrics distinguish a teacher that improves student-visited states from a teacher that merely teaches the student a privileged-context reasoning style.

Relationship Between Offline/Online and Off-Policy/On-Policy Distillation

Offline and online distillation are related to, but distinct from, off-policy and on-policy distillation. Offline vs. online describes the training-time relationship between teacher and student: frozen teacher vs. concurrently trained teacher or peers. Off-policy vs. on-policy describes the trajectory source: external data or teacher trajectories vs. student-generated rollouts.
Thus, classical offline KD is usually off-policy, because the student trains on fixed human, dataset, or teacher trajectories. However, online distillation can still be off-policy if peers exchange predictions on fixed batches, and offline distillation can be on-policy if a frozen teacher scores rollouts generated by the current student. This is exactly the setup used in on-policy LLM distillation: the teacher can remain frozen, but the data distribution changes because trajectories are sampled from the student.
A useful taxonomy is therefore two-dimensional:

Axis	Main options	What it determines
Teacher update pattern	Offline, online, semi-online	Whether the teacher is frozen, co-trained, or partially adapted
Trajectory source	Off-policy, on-policy	Whether sequences come from datasets or teachers, or from the student
Target type	Hard, soft, feature, preference, reward-like	Whether supervision is tokens, logits, hidden states, preferences, or dense advantages
Teacher identity	External, self, multi-teacher, peer ensemble	Whether knowledge comes from another model, the same model, several models, or co-learners

Off-Policy and On-Policy Distillation for Autoregressive LLMs

For autoregressive LLMs, the most important modern distinction is not only what is matched, but where the trajectories come from. Off-policy distillation trains the student on trajectories produced by a teacher, a dataset, or another external policy. On-policy distillation trains the student on its own rollouts and asks the teacher to score the student’s actual visited states.
This distinction is central because autoregressive errors compound: a student that deviates early at inference may enter contexts it never saw during fixed-dataset distillation. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) formalizes this as Generalized Knowledge Distillation, using teacher feedback on student-generated sequences to reduce train-inference mismatch.
The following figure (source) shows the distinction between off-policy and on-policy distillation: off-policy training uses teacher-generated completions, whereas on-policy training samples the student’s own rollouts and evaluates those exact rollouts with the teacher.

Support Overlap and Locality

A practical OPD failure mode is poor support overlap between the student’s visited prefixes and the teacher’s reliable local distribution. One useful diagnostic is top-\(k\) local overlap:
\[\operatorname{Overlap}_k(s_t) =\frac{ \left| \operatorname{Top}_k p_S(\cdot \mid s_t) \cap \operatorname{Top}_k p_T(\cdot \mid s_t) \right| }{k}\]
- where \(s_t=(x,\hat{y}_{<t})\) is a student-visited state. High overlap means the teacher and student disagree within a shared menu of plausible next tokens; low overlap means the teacher may be scoring prefixes outside the region where its distribution is meaningful for that student.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe by Li et al. (2026) argues that successful OPD depends on compatible teacher-student thinking patterns and progressive alignment over a small set of high-probability tokens that can carry most of the probability mass. Ravid Shwartz Ziv’s Post applies the same lens to MOPD systems, emphasizing that full-distribution matching is safer when teachers and students share lineage or local support, while sampled-token scoring can be more robust when teacher and student distributions are farther apart.
In this view, OPD is not merely dense supervision. It is local dense supervision on the states that the student actually visits. If those states are outside the teacher’s useful support, then asking the student to match the teacher’s full distribution can amplify irrelevant or noisy preferences. If the student and teacher share lineage, training recipe, tokenizer, or reasoning style, distribution-level supervision is more likely to be useful because local token menus overlap.

Divergence Choice: Forward KL, Reverse KL, and JSD

Distillation is not defined only by the teacher, student, and trajectory source. It is also defined by the direction of the matching objective. For language models, this choice determines whether the student is encouraged to cover the teacher’s full distribution, avoid actions the teacher assigns low probability to, or trade off both behaviors. In on-policy settings, this choice becomes especially important because the loss is evaluated on prefixes the student actually visits, so the divergence determines how teacher feedback is converted into local token updates.

Forward KL

Forward KL is written as:
\[D_{KL}(p_T \,\Vert\, p_S) = \sum_x p_T(x) \log \frac{p_T(x)}{p_S(x)}\]
Forward KL is teacher-weighted. It penalizes the student when it assigns low probability to tokens or modes that the teacher considers likely. This makes it mean-seeking: the student is encouraged to cover the teacher’s distribution rather than select only one high-probability region.
In classical supervised KD, forward KL is often natural because the teacher distribution is treated as the target distribution. Distilling the Knowledge in a Neural Network by Hinton et al. (2015) introduced this soft-target view, where the student learns from the teacher’s full probability distribution rather than only from hard labels.
In large-vocabulary LLMs, forward KL usually requires teacher probabilities over many tokens, often full vocabulary or teacher top-\(k\) probabilities. This can make it more expensive than sampled-token objectives, especially when the teacher is a much larger model served remotely.
Forward KL is strongest when teacher and student have good local support overlap. If the student is visiting prefixes where the teacher’s top-\(k\) distribution is meaningful and relevant, matching the teacher distribution can transfer rich information. If the student is far outside the teacher’s reliable local support, full-distribution matching can overemphasize teacher preferences that are not useful for the student’s actual rollout.

Reverse KL

Reverse KL is written as:
\[D_{KL}(p_S \,\Vert\, p_T) = \sum_x p_S(x) \log \frac{p_S(x)}{p_T(x)}\]
Reverse KL is student-weighted. It penalizes tokens or trajectories that the student assigns probability to when the teacher assigns them low probability. This makes it mode-seeking: the student is encouraged to avoid teacher-disfavored regions and concentrate probability mass on high-confidence regions.
In sampled-token OPD, reverse KL is especially natural because the student has already generated the token being evaluated. The teacher does not need to transmit a full distribution; it can score the sampled token under the student-visited prefix.
A common token-level signal is the teacher-student log-probability gap:
\[A_t^{OPD} = \log p_T(y_t \mid s_t) - \log p_S(y_t \mid s_t)\]
- where \(s_t = (x, y_{<t})\) is the student-visited prefix. This is dense token-level feedback, but it is only useful when the teacher’s score reflects downstream task quality on that prefix.
MiniLLM: Knowledge Distillation of Large Language Models by Gu et al. (2023) uses reverse KL for generative language-model distillation and argues that it better avoids overestimating low-probability regions of the teacher distribution, improving long-form generation compared with standard forward-KL distillation.
The TRL writeup Distilling 100B+ Models 40x Faster with TRL emphasizes the systems consequence of this distinction: forward-KL approximations usually need teacher-selected top tokens, while reverse-KL approximations can be based on student-selected tokens.

Reward-Tilted Teacher View

The sequence-level view makes the role of divergence more explicit. For a full trajectory \(s\), define the teacher-student log-ratio:
\[\Delta_T(s) = \log \pi_T(s) - \log \pi_S(s)\]
A useful on-policy teacher should assign larger \(\Delta_T(s)\) to higher-reward trajectories. Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t frames this as the central condition behind successful OPD: the teacher should be better on the downstream reward while remaining close enough to the student for its local preferences to be learnable.
This condition has a KL-regularized reward interpretation. Let \(\pi_k^S\) be the current student policy held fixed, and let \(R(s)\) be the downstream reward. The reward-tilted teacher is:
\[\pi_T^*(s) = \frac{1}{Z} \pi_k^S(s) \exp(\beta R(s)), \qquad Z = \mathbb{E}_{s \sim \pi_k^S} \left[ \exp(\beta R(s)) \right]\]
- where \(\beta\) controls the strength of reward tilting and \(Z\) normalizes the distribution.
If the teacher is this reward-tilted policy and its gradient is stopped, reverse-KL distillation decomposes into a reward objective plus a policy-proximity penalty:
\[D_{KL} \left( \pi^S \,\Vert\, \pi_T^* \right) = D_{KL} \left( \pi^S \,\Vert\, \pi_k^S \right) - \beta \mathbb{E}_{s \sim \pi^S} \left[ R(s) \right] + \log Z\]
Since \(\log Z\) does not depend on the optimized student, minimizing this reverse-KL objective is equivalent to increasing expected reward while staying near the current policy. This explains why strong same-family models and RL-trained expert models are often effective OPD teachers. They are typically close enough to the student to provide meaningful local token scores, but sufficiently reward-improved to point the student toward better trajectories.
The same analysis explains why dense supervision can still be harmful. If the teacher is not reward-tilted, token-level supervision can update the student in the wrong direction. A privileged-context self-teacher can assign high probability to a response shape that appears to use feedback, a gold answer, or a reference solution, even when that shape does not correspond to higher downstream reward.

Jensen-Shannon Divergence

Jensen-Shannon divergence provides a middle ground between forward and reverse KL:
\[D_{JSD}(p_T \,\Vert\, p_S) = \beta D_{KL} \left( p_T \,\Vert\, m \right) + (1-\beta) D_{KL} \left( p_S \,\Vert\, m \right), \qquad m = \beta p_T + (1-\beta)p_S\]
JSD can be more stable because it bounds the divergence and interpolates between teacher-weighted and student-weighted behavior. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) treats GKD as a flexible framework where the divergence between teacher and student can be chosen based on the task, student capacity, and teacher-student mismatch.
JSD is useful when pure forward KL may over-cover teacher modes and pure reverse KL may collapse too aggressively onto a subset of modes. However, it still depends on the same underlying teacher-quality condition: the teacher distribution must encode behavior worth imitating on the student’s visited prefixes.

Practical Selection Rules

Forward KL: Forward KL is preferable when the teacher and student have strong local support overlap and the goal is broad coverage of teacher modes.
Reverse KL: Reverse KL is preferable when training on sampled student tokens, when communication budget favors sampled-token log-probabilities, or when the goal is to penalize student-proposed actions that the teacher considers unlikely.
JSD: JSD is preferable when stability matters and the teacher-student mismatch is large enough that either pure KL direction may be brittle.
Sampled-token objectives: Sampled-token objectives are preferable when full-distribution matching is expensive or when teacher and student distributions are far enough apart that teacher top-\(k\) probabilities may include modes irrelevant to student-visited prefixes.
General guidance: In reasoning and agentic settings, divergence choice should be evaluated together with teacher quality. A reverse-KL sampled-token objective can still fail if the teacher assigns high probability to the wrong response pattern, and full forward-KL matching can fail if the teacher’s distribution is locally unreliable on the student’s prefixes. The most useful diagnostic is whether \(\Delta_T(s)\) increases with reward, not merely whether the teacher is larger, privileged, or confident.
The following figure (source) shows forward KL and reverse KL, including their different weighting behavior and their mean-seeking vs. mode-seeking tendencies.

Self-Distillation and On-Policy Self-Distillation

Self-distillation is a distillation family in which the teacher is not a separate larger model. Instead, the teaching signal comes from the same model, an earlier checkpoint, an exponential-moving-average copy, an ensemble of checkpoints, or the same model under a different context. This makes self-distillation attractive when an external teacher is unavailable, too expensive, or poorly matched to the target domain. It also makes the method delicate: the self-teacher must still provide a signal that improves the student’s downstream behavior, not merely a more confident or context-dependent version of the student’s current behavior.

Self-Teacher Sources

In older usage, self-distillation can mean training a model from its own predictions, from earlier checkpoints, or from an ensemble of itself. The teacher is usually a more stable or temporally separated version of the same model family, so the method can regularize training, smooth predictions, or preserve capabilities while compressing behavior back into a single model.
In modern LLM reasoning work, self-distillation often relies on contextual asymmetry rather than architectural asymmetry. The same model can act as a student under the ordinary task prompt and as a teacher under an enhanced prompt that includes training-only information.
The student view is:
\[p_S^\theta(\cdot \mid x)\]
- while the self-teacher view is:
\[p_T^{\theta^-}(\cdot \mid x,c)\]
- where \(x\) is the task prompt, \(c\) is privileged context, and \(\theta^-\) denotes a stopped-gradient, frozen, or EMA teacher copy.
The privileged context \(c\) may include a verified answer, a reference solution, a ground-truth reasoning trace, a runtime error, a prior-attempt critique, a tool result, a user correction, a retrieved skill, or another training-only signal. The goal is to use this extra context to produce a better local teaching distribution without requiring the student to receive the same context at inference time.

Contextual Self-Distillation

Contextual self-distillation is useful because LLMs are often better at evaluating, repairing, or rationalizing a solution when given the answer or feedback than they are at generating the solution from scratch. On-Policy Self-Distillation for Large Language Models by Zhao et al. (2026) uses this asymmetry by letting one model act as both student and teacher: the student sees only the original problem, while the teacher receives privileged solution information and provides dense token-level supervision on student rollouts.
A general contextual self-distillation objective is:
\[\mathcal{L}_{CSD}(\theta) =\mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D\left( p_T^{\theta^-}(\cdot \mid x,c,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
- This objective is on-policy when \(\hat{y}\) is sampled from the current student. It is self-distillation when \(p_T^{\theta^-}\) and \(p_S^\theta\) come from the same model lineage rather than from a separate external teacher.
The key assumption is that the privileged-context distribution behaves like a better version of the student on the same visited prefixes. In reward terms, the self-teacher should place more probability on trajectories or tokens that improve downstream reward, while remaining close enough to the student that the update is locally meaningful.

On-Policy Self-Distillation

On-Policy Self-Distillation (OPSD) combines the trajectory source of OPD with the teacher identity of self-distillation. The student samples its own rollout, and the same model under a privileged teacher context scores the student’s actual prefixes. On-Policy Self-Distillation for Large Language Models by Zhao et al. (2026) introduces OPSD as a way to provide dense token-level supervision without a separate external teacher.
A compact OPSD objective is:
\[\mathcal{L}_{OPSD}(\theta) =\mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{\hat{y} \sim p_\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D\left( p_{\theta^-}(\cdot \mid x,c,\hat{y}_{<t}) \,\Vert\, p_\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
- where \(p_{\theta^-}\) is the stopped-gradient self-teacher and \(p_\theta\) is the student. The teacher and student may share weights, but they differ in context.
OPSD is appealing because it preserves the on-policy advantage of training on student-visited states while reducing dependence on a larger teacher model. It also avoids the storage and serving cost of repeatedly querying a frontier teacher.
The risk is that the teacher may be privileged rather than genuinely reward-improved. A teacher that knows the answer can produce a distribution that is more confident, shorter, or more reference-like, but those properties do not necessarily identify the reasoning process the student should use when the privileged context is absent.

Naive Privileged Self-Distillation

A naive version of privileged self-distillation scores the student’s own response under two prompts: the plain task prompt for the student and the task prompt plus privileged context for the teacher. The objective is a per-token distribution-matching loss with no direct reward term:
\[\mathcal{L}_{naive}(\theta) =\mathbb{E}_{x \sim \mathcal{D},\hat{y} \sim \pi_\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D_{KL} \left( \pi_{\theta^-}(\cdot \mid x,c,\hat{y}_{<t}) \,\Vert\, \pi_\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
- where \(c\) may be a gold answer, a reference solution, or feedback from a previous attempt. The student never sees \(c\) at inference time, but the loss trains it to reproduce the distribution induced by \(c\).
This setup is only safe if the privileged self-teacher approximates a reward-tilted version of the student. A diagnostic is the sequence-level teacher-student log-ratio:
\[\Delta_T(s) =\log \pi_T(s) - \log \pi_S(s)\]
- A useful teacher should assign larger \(\Delta_T(s)\) to higher-reward trajectories. If the log-ratio instead tracks the presence of a feedback-aware response template, the self-distillation objective can teach that template even when it is not rewarded.
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t analyzes this failure mode by comparing an RL-trained expert teacher with a privileged self-distillation teacher. The RL expert increases the probability of correct student responses, while the privileged self-teacher can assign high probability to responses that look as if the model already had a reference solution, regardless of whether the final answer is correct.
The practical failure is a distributional mismatch. The self-teacher is conditioned on information that the student will not have at inference time. It may therefore learn to name, cite, or reason backward from that information. Once distilled into the student weights, that response shape can appear unconditionally at inference.
This yields three recurring failure modes:
- Hallucinated privileged context: The model may cite a reference solution, feedback, guidance, or a prior attempt that was never present in the prompt.
- Suppressed epistemic verbalization: The model may reduce hedging, backtracking, reconsideration, and self-checking tokens that are useful for long-budget reasoning.
- Poor out-of-distribution generalization: In-distribution accuracy can remain stable or even improve, while reasoning robustness degrades on tasks that require the model to adapt rather than replay a privileged-context response shape.

Relevance-Masked Self-Distillation

Self-distillation can still be useful when the update is localized to tokens that actually carry the desired behavior. Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation introduces Relevance-Masked Self-Distillation (RMSD), which filters token positions before applying the self-distillation loss so that training focuses on behavior-relevant tokens rather than incidental style differences.
A compact RMSD-style objective is:
\[\mathcal{L}_{RMSD}(\theta) =\mathbb{E} \left[ \sum_t m_t D\left( p_T^\theta(\cdot \mid x', \hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x, \hat{y}_{<t}) \right) \right]\]
- where \(x'\) is an enhanced teacher context and \(m_t \in \{0,1\}\) selects token positions judged relevant to the behavior being transferred.
RMSD reflects a broader design rule: dense token-level supervision is useful only when the supervised tokens correspond to the capability being transferred. If the loss is applied across the full response, it can preserve formatting, explanation style, or privileged-context artifacts rather than the desired skill.

Agentic Self-Distillation

In multi-turn agent training, self-distillation must also handle compounding trajectory drift. A small mistake early in a tool-use trajectory can change the future state distribution, and a privileged self-teacher may reject or endorse tokens for reasons tied to imperfect retrieved skills or environment context.
Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) introduces SDAR, which treats OPSD as a gated auxiliary objective rather than the primary training signal. RL remains the task-grounded backbone, while privileged self-teacher context provides dense token-level guidance where the model has evidence that the guidance is useful.
A simplified gated self-distillation term is:
\[\Delta_t =\operatorname{sg} \left( \log \pi_\theta^+(y_t \mid s_t^+) -\log \pi_\theta(y_t \mid s_t) \right)\] \[g_t =\sigma(\beta \Delta_t)\] \[\ell_t^{\mathrm{SDAR}} = g_t \left( \log \pi_\theta^+(y_t \mid s_t^+) - \log \pi_\theta(y_t \mid s_t) \right)\]
- where \(s_t\) is the ordinary student state, \(s_t^+\) is the privileged teacher state, \(\operatorname{sg}\) denotes stop-gradient, and \(g_t\) controls how strongly each token trusts the self-teacher signal.
This pattern is important because it changes the role of self-distillation. Instead of replacing RL with privileged-context imitation, self-distillation becomes a dense auxiliary signal constrained by a task-grounded objective.

Practical Guidance

Self-distillation is most appropriate when the model already contains the target capability but expresses it inconsistently, or when privileged training-time context can reveal a correction that is hard to obtain from an external teacher.
OPSD is most appropriate when student rollouts are important and a separate teacher is unavailable, expensive, or poorly matched. Its main advantage is dense supervision on the states the student actually visits.
Naive privileged self-distillation is risky when the privileged context changes the reasoning style more than it improves reward alignment. In that case, the student can learn to imitate the presence of feedback rather than the capability implied by feedback.
The main diagnostics should include reward, in-distribution accuracy, out-of-distribution accuracy, hallucinated-reference rate, feedback-mention rate, epistemic-verbalization rate, and calibration. These metrics help distinguish genuine self-improvement from a model that has internalized a privileged-context response template.

Thinking-Model Caveats for Privileged Self-Distillation

Privileged self-distillation is especially delicate for thinking models because the training-time teacher often has access to information that the inference-time student will not have. A teacher conditioned on a gold answer, reference solution, critique, or prior-attempt feedback can appear locally more confident and concise because part of the search problem has already been solved for it. If the student is trained to imitate that distribution unconditionally, the loss can suppress the uncertainty-management behaviors that make long-budget reasoning robust.

Why Thinking Models Are Different

Thinking models often rely on explicit intermediate behaviors such as branching, checking, revising, comparing alternatives, and recovering from false starts. These behaviors may look inefficient from the perspective of a privileged teacher that already knows the answer, but they can be essential when the model must solve the task without privileged information.
Rethinking On-Policy Self-Distillation for Thinking Models by Kaur et al. (2026) finds that privileged-context self-distillation can degrade long-budget reasoning by suppressing fork-like self-correction behaviors, including verification, backtracking, and hedging. The key point is not that on-policy training is harmful, but that a privileged teacher can assign low probability to the very tokens that help the student reason under uncertainty.
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? by Kim et al. (2026) traces related degradation to suppressed epistemic verbalization, where expressions of uncertainty, reconsideration, and self-checking help preserve alternative reasoning paths. In this view, shorter traces are not automatically better traces: reducing deliberation can improve apparent efficiency while weakening out-of-distribution reasoning.
A useful way to track this behavior is an epistemic-verbalization rate. Let \(\mathcal{E}\) be a set of uncertainty, branching, or self-correction markers such as “wait,” “maybe,” “actually,” “perhaps,” and “alternatively.” For a response \(y=(y_1,\dots,y_n)\):
\[\operatorname{EV}(y) =\sum_{t=1}^{n} \mathbf{1}[y_t \in \mathcal{E}]\]
- A large drop in \(\operatorname{EV}(y)\) can indicate that the model is becoming more direct, but it can also indicate that it is losing the ability to reopen hypotheses, check assumptions, or recover from an incorrect path.

Privileged Context as a Distribution Shift

The core failure mode is a context mismatch. The self-teacher is conditioned on privileged information \(c\), while the deployed student is conditioned only on the task prompt \(x\). The training objective asks the student to match:
\[p_{\theta^-}(\cdot \mid x,c,y_{<t})\]
- even though inference uses:
\[p_{\theta}(\cdot \mid x,y_{<t})\]
If \(c\) mainly improves task-relevant token preferences, this can be beneficial. If \(c\) changes the response style, confidence level, citation pattern, or amount of deliberation, the student can learn an inference-time behavior that is not actually grounded in the prompt.
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t analyzes this mismatch in naive self-distillation: a feedback-aware self-teacher can learn to name the privileged information, reason backward from it, and assign high probability to a response shape that assumes such information is present. Once distilled into the student, that shape can fire even when no feedback, reference solution, or prior attempt exists in the prompt.
This explains why a dense token-level loss can point in the wrong direction. The model is not merely learning which tokens improve reward; it may be learning the surface form of a privileged-context solution, such as asserting that a reference supports the answer, citing feedback that was never given, or skipping the uncertainty markers that would normally be needed to solve the problem.

Hallucinated Feedback and Reference Artifacts

A concrete failure mode is hallucinated privileged context. The student may refer to guidance, feedback, a reference solution, or a previous attempt that was never provided at inference time. This is not just ordinary hallucination; it is a learned artifact of the training objective. The student has been trained to imitate a teacher distribution that was grounded in privileged context, but the student itself cannot condition on that context.
The hallucinated-reference behavior can be tracked with a simple rate:
\[H_{\mathrm{priv}} = \mathbb{E}_{x \sim \mathcal{D}_{eval}} \left[ \mathbf{1} \left[ \text{response mentions absent feedback, reference, or prior attempt} \right] \right]\]
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t reports that naive self-distillation can produce high hallucinated-reference rates even when in-distribution accuracy appears acceptable. The broader lesson is that accuracy alone can miss whether the model has learned a reasoning procedure or merely a privileged-context response template.
This failure is especially concerning in agentic and tool-use settings. A model that invents feedback, assumes a prior tool result, or treats a missing reference as present can make confident downstream actions from ungrounded state. In such settings, hallucinated privileged context should be evaluated separately from final task accuracy.

Epistemic-Verbalization Collapse

A second failure mode is loss of epistemic verbalization. Thinking models often use tokens that mark hesitation, branching, or verification as control-flow tools for reasoning. A privileged teacher may not need these tokens because the answer or correction is already visible, so distillation can push the student away from them.
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? by Kim et al. (2026) connects this behavior to reasoning degradation: when the teacher is conditioned on rich context, it can produce more concise and confident traces, but the student may then lose the uncertainty expression needed for unseen problems.
The compression of epistemic verbalization can be written as:
\[C_{\mathrm{EV}} =\frac{ \mathbb{E}_{x \sim \mathcal{D}_{eval}} \left[ \operatorname{EV}(y_{\mathrm{trained}}(x)) \right] }{ \mathbb{E}_{x \sim \mathcal{D}_{eval}} \left[ \operatorname{EV}(y_{\mathrm{base}}(x)) \right] }\]
- where \(C_{\mathrm{EV}}=1\) means the trained model preserves the base model’s epistemic-verbalization rate, while smaller values indicate compression of uncertainty, checking, and branching behavior.
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t reports the same behavioral signature: naive self-distillation sharply reduces hedging and backtracking tokens, while RL preserves much more of this behavior. This supports the view that pure privileged-context distillation can make the model overconfident, not merely shorter.
The following figure (source) shows the count of hedging tokens per response over training on chemistry, including markers such as “wait.” The figure illustrates that naive self-distillation strips out expressed uncertainty during training, causing the model to consider fewer potential solution paths at inference. The right panel shows the per-token breakdown at the first and last step. Naive self-distillation strips out the model’s expressed uncertainty through training, leading it to consider fewer potential solution paths at inference and degrading reasoning.

Out-of-Distribution Fragility

The hardest failures may appear out of distribution. A privileged-context self-distillation objective can preserve or improve in-distribution accuracy when the learned response template happens to match the training distribution. The same template can fail on tasks that require new branching, uncertainty expression, tool adaptation, or problem-specific search.
This creates a misleading evaluation pattern:
\[\Delta_{\mathrm{IID}} =\operatorname{Acc}_{\mathrm{trained}}(\mathcal{D}_{IID}) - \operatorname{Acc}_{\mathrm{base}}(\mathcal{D}_{IID})\] \[\Delta_{\mathrm{OOD}} =\operatorname{Acc}_{\mathrm{trained}}(\mathcal{D}_{OOD}) - \operatorname{Acc}_{\mathrm{base}}(\mathcal{D}_{OOD})\]
- A method can have \(\Delta_{\mathrm{IID}} \geq 0\) while still having \(\Delta_{\mathrm{OOD}} < 0\). This is a warning sign that the model may have learned a narrow training-distribution response shape rather than a robust reasoning improvement.
Rethinking On-Policy Self-Distillation for Thinking Models by Kaur et al. (2026) emphasizes this issue for long-budget reasoning: the behaviors suppressed by privileged self-distillation, such as forking and verification, are most valuable when the model has to search under uncertainty. Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t similarly recommends tracking hallucination rate, epistemic verbalization, and OOD accuracy rather than relying only on in-distribution validation accuracy.

Contrastive Hinting

One way to reduce privilege-induced style drift is to subtract the part of the teacher-student gap that appears merely because a hint is present. RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation by Pan et al. (2026) proposes a contrastive hinting signal that compares a correct privileged hint with a wrong or control hint, reducing updates that are caused by hint-conditioned style rather than task-bearing content.
A useful abstraction is:
\[e_t^{ctr} =\left[ \log p_\theta(y_t \mid x,c^+,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right] -\left[ \log p_\theta(y_t \mid x,c^-,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right]\]
- where \(c^+\) is a correct hint and \(c^-\) is a contrastive or wrong hint. The subtraction removes part of the gap that arises from merely being conditioned on a hint, leaving more of the signal tied to correctness.
The resulting principle is that privileged self-distillation should avoid copying every difference between the hinted and unhinted distributions. It should isolate the portion of the difference that is causally related to solving the task.

Practical Safeguards

Privileged self-distillation for thinking models should be treated as an auxiliary signal unless evidence shows that the self-teacher behaves like a reward-improved version of the student. RL, verifier feedback, or reward-consistent filtering can provide the task-grounded signal that the privileged teacher alone may lack.
The evaluation suite should include:
- Task accuracy: in-distribution and out-of-distribution accuracy.
- Hallucinated privileged context: mentions of absent feedback, references, prior attempts, or hidden guidance.
- Epistemic verbalization: frequency of uncertainty, branching, backtracking, and checking markers.
- Trace structure: response length, number of branches, number of verification steps, and recovery after false starts.
- Calibration: whether confidence increases without corresponding reward improvement.
A practical rule is to preserve uncertainty unless the training signal proves it is unnecessary. If a privileged teacher becomes more direct because it already knows the answer, that directness should not automatically be distilled into the inference-time student.
The safest variants either localize the distillation loss to task-bearing tokens, contrast correct and incorrect privileged hints, gate the self-distillation signal by reward agreement, or keep RL as the primary objective while using self-distillation only as dense auxiliary guidance.

Distillation as Synthetic Data and Post-Training Infrastructure

A newer way to understand distillation is as part of the broader synthetic-data and post-training toolkit. The RLHF Book’s chapter Synthetic Data describes distillation as both a data engine, where stronger models generate completions, critiques, preferences, or filters, and a skill-transfer method, where a stronger model’s capabilities are transferred into a weaker model. The same chapter frames the path from offline KD to on-policy distillation as a move from static teacher-generated data toward student-sampled trajectories with dense teacher feedback.
This view is especially useful for out-of-distribution enterprise or tool-use behaviors, where ordinary SFT may teach a narrow behavior while harming unrelated capabilities, and RL may struggle if the base model cannot produce successful attempts often enough. Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation frames self-distillation as a way to bring unusual target behaviors into the model’s distribution while preserving unrelated competence through localized token-level updates.

Distillation and Reinforcement Learning

Distillation and reinforcement learning are increasingly linked because they solve complementary parts of the post-training problem. RL with verifiable rewards provides task-grounded optimization, but the feedback is often sparse: a trajectory receives a scalar reward only after a full answer, program, proof, or tool-use episode. OPD provides dense token-level supervision over the student’s own rollouts, but its usefulness depends on whether the teacher’s local preferences actually point toward higher-reward behavior.

Sparse Rewards and Dense Distillation Signals

In a standard RL setting, the objective is to maximize expected trajectory reward:
\[J_{\mathrm{RL}}(\theta) =\mathbb{E}_{s \sim \pi_\theta} \left[ R(s) \right]\]
- where \(s\) denotes a full trajectory, such as a reasoning trace, code attempt, tool-use episode, or multi-turn agent interaction.
In many LLM settings, the reward is sparse. A verifier may tell the model whether the final answer is correct, whether tests passed, whether a tool-use task succeeded, or whether a user accepted the response, but this does not directly explain which tokens helped or hurt the outcome.
OPD turns teacher preferences into a dense signal on student-generated trajectories. The RLHF Book chapter Synthetic Data frames OPD as a bridge from static synthetic-data distillation toward student-sampled trajectories with dense teacher feedback.
A common token-level OPD signal is:
\[A_t^{\mathrm{OPD}} = \log \pi_T(a_t \mid s_t) - \log \pi_\theta(a_t \mid s_t)\]
- where \(s_t\) is the student-visited prefix or state, \(a_t\) is the sampled token or action, \(\pi_T\) is the teacher, and \(\pi_\theta\) is the student.
This quantity behaves like an advantage-like score. If the teacher assigns higher probability than the student to the sampled token, the token receives positive dense feedback. If the teacher assigns lower probability, the token receives negative dense feedback.

Reward-Tilted Teacher Interpretation

The key question is when dense distillation feedback is also reward-improving. Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t explains this through a reward-tilted teacher view: OPD improves the student when the teacher assigns higher probability to higher-reward trajectories while staying close enough to the student for the signal to be locally imitable.
Let \(\pi_k^S\) be the current student policy held fixed, and let \(R(s)\) be the downstream reward. KL-regularized reward maximization has a closed-form reward-tilted optimum:
\[\pi_T^*(s) = \frac{1}{Z} \pi_k^S(s) \exp(\beta R(s)), \qquad Z = \mathbb{E}_{s \sim \pi_k^S} \left[ \exp(\beta R(s)) \right]\]
- where \(\beta\) controls the strength of reward tilting and \(Z\) normalizes the distribution.
If the teacher equals this reward-tilted policy and its gradient is stopped, reverse-KL distillation decomposes as:
\[D_{KL} \left( \pi^S \,\Vert\, \pi_T^* \right) = D_{KL} \left( \pi^S \,\Vert\, \pi_k^S \right) - \beta \mathbb{E}_{s \sim \pi^S} \left[ R(s) \right] + \log Z\]
Since \(\log Z\) does not depend on the optimized student, minimizing this reverse-KL objective is equivalent to increasing expected reward while staying near the current policy. Distilling toward a reward-tilted teacher is therefore a form of KL-regularized RL.
This interpretation explains why strong same-family teachers and RL-trained expert teachers are often effective. A same-family teacher is usually close in tokenizer, training distribution, and reasoning style, while being more capable. An RL-trained expert is explicitly tilted toward higher reward while often remaining near the base policy because of KL regularization.
The same interpretation also explains why naive self-distillation can fail. A privileged-context self-teacher can provide dense token-level feedback, but if its log-ratio does not increase with reward, the dense signal may teach a response shape rather than a reward-improving reasoning process.

Dense Feedback Without an External Teacher

Recent RL-distillation methods try to recover dense supervision from feedback that already exists in the environment. Reinforcement Learning via Self-Distillation by Hübotter et al. (2026) introduces Self-Distillation Policy Optimization (SDPO), which converts rich textual feedback such as runtime errors, judge comments, and failed-attempt feedback into token-level supervision without requiring an external teacher.
In SDPO-style training, the model under a feedback-conditioned context acts as a self-teacher:
\[\pi_{\theta^-}^{+}(\cdot \mid x,f,y_{<t})\]
- while the deployed student is trained under the ordinary context:
\[\pi_\theta(\cdot \mid x,y_{<t})\]
A simplified feedback-conditioned self-distillation loss is:
\[\mathcal{L}_{SDPO}(\theta) =\mathbb{E} \left[ \sum_t D \left( \pi_{\theta^-}^{+}(\cdot \mid x,f,y_{<t}) \,\Vert\, \pi_\theta(\cdot \mid x,y_{<t}) \right) \right]\]
- where \(f\) is textual feedback, environment feedback, or a hindsight correction.
This is useful when the feedback is genuinely diagnostic, such as a compiler error, failed unit test, judge critique, or tool-result observation. It is risky when the feedback-conditioned teacher primarily changes style, confidence, or explanation structure rather than reward-relevant token preferences.

OPD as Dense KL-Constrained RL

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation by Yang et al. (2026) formalizes OPD as a special case of dense KL-constrained RL, then generalizes it with a flexible reference model and a reward scaling factor.
A simplified generalized objective can be written as:
\[\mathcal{L}_{G\text{-}OPD}(\theta) =D_{KL} \left( \pi_\theta \,\Vert\, \pi_{\mathrm{ref}} \right) -\alpha \mathbb{E}_{s \sim \pi_\theta} \left[ r_T(s) \right]\]
- where \(\pi_{\mathrm{ref}}\) is a reference model, \(r_T(s)\) is a dense teacher-derived reward, and \(\alpha\) controls the reward-to-regularization balance.
Standard OPD corresponds to a constrained setting in which the teacher-derived reward and KL regularization are tied together. Extrapolated OPD relaxes this coupling by increasing the reward scale, allowing the student to move beyond strict teacher imitation when the teacher signal is aligned with downstream reward.
Scaling Reasoning Efficiently via Relaxed On-Policy Distillation by Ko et al. (2026) similarly interprets teacher-student log-likelihood ratios as token rewards and introduces REOPOLD, a relaxed OPD method that stabilizes optimization through reward clipping, entropy-aware token sampling, and an exploration-to-refinement schedule.

Direction, Magnitude, and Privileged Self-Distillation

A recurring pattern in RL-distillation hybrids is to separate update direction from update magnitude. RL or verifier feedback can determine whether the trajectory should be reinforced, while distillation can provide fine-grained token-level magnitudes.
Self-Distilled RLVR by Yang et al. (2026) argues that privileged self-distillation alone can leak information and destabilize long training, so RLSD uses RLVR to determine reliable update direction from environmental correctness while using self-distillation to modulate token-level update magnitudes.
A simplified hybrid structure is:
\[\nabla_\theta J_{\mathrm{hybrid}} =\sum_t A_{\mathrm{RL}}(s_t,a_t) \cdot w_t^{SD} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\]
- where \(A_{\mathrm{RL}}\) supplies reward-grounded direction and \(w_t^{SD}\) uses self-distillation to scale token-level credit.
This separation matters because a privileged self-teacher can be locally informative but globally unsafe. It may know which token would fit a reference answer, but the RL signal is needed to ensure that the overall update remains tied to task success rather than to privileged-context style.

Sample Routing and Failure Correction

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing by Li et al. (2026) introduces Sample-Routed Policy Optimization (SRPO), which routes correct samples to GRPO-style reward-aligned RL and failed samples to SDPO-style self-distillation correction.
The routing rule can be summarized as:
\[\mathcal{L}_{SRPO} =\mathbf{1}[R(s)>0] \mathcal{L}_{GRPO} +\mathbf{1}[R(s)=0] \lambda(s) \mathcal{L}_{SDPO}\]
- where \(\lambda(s)\) may depend on confidence, entropy, or teacher reliability.
This pattern addresses a practical weakness of pure self-distillation. Distilling already-correct samples can introduce ambiguity because the model may already have found a good solution, while failed samples are often where dense correction is most valuable.

Agentic RL and Hindsight-Guided Distillation

RL-distillation hybrids are especially important for reasoning, coding, and agentic systems because their trajectories are long, branching, and partially observed. A single scalar reward at the end of a multi-step tool-use trajectory gives weak credit assignment, while dense teacher or feedback signals can identify which action, tool call, argument, or reasoning step should change.
OpenClaw-RL: Train Any Agent Simply by Talking by Wang et al. (2026) extends this idea to interactive agents, using next-state signals such as user replies, tool outputs, terminal states, and GUI changes as both scalar feedback and hindsight-guided OPD supervision.
In this setting, an environment transition can provide both evaluative and directive information:
\[(s_t,a_t,s_{t+1}) \rightarrow \left( r_t, h_t \right)\]
- where \(r_t\) is a scalar reward or process reward, and \(h_t\) is a hindsight hint or textual correction used to form a teacher context.
A hindsight-guided OPD term can be written as:
\[\mathcal{L}_{HG\text{-}OPD} =\mathbb{E} \left[ \sum_t D \left( \pi_{\theta^-}(\cdot \mid s_t,h_t) \,\Vert\, \pi_\theta(\cdot \mid s_t) \right) \right]\]
The key requirement is that the next-state signal must be grounded in the environment. A real tool error, test failure, user correction, or GUI state change can identify what went wrong. A fabricated or ungrounded feedback signal can reproduce the same naive self-distillation failure mode: the model learns to imitate feedback-aware behavior without genuine feedback.

Self-Distilled Agentic Reinforcement Learning

In multi-turn agentic training, the RL-distillation relationship becomes asymmetric. RL is often best treated as the task-grounded primary objective, while self-distillation acts as a controlled auxiliary signal.
Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) introduces SDAR, which treats OPSD as a gated auxiliary objective while keeping GRPO as the primary RL backbone. The method maps detached teacher-student token gaps into sigmoid gates so that teacher-endorsed positive-gap tokens receive stronger distillation while negative teacher rejections are softly attenuated.
The SDAR objective can be summarized as an RL backbone plus a gated self-distillation auxiliary term:
\[\mathcal{L}(\theta) =\mathcal{L}_{GRPO}(\theta) + \lambda_{\mathrm{SDAR}} \mathcal{L}_{SDAR}(\theta)\]
- where \(\mathcal{L}_{GRPO}\) preserves verifier-driven policy optimization, while \(\mathcal{L}_{SDAR}\) injects dense token-level guidance only where the gated privileged teacher signal is trusted.
Implementation details:
- The method flattens valid response tokens across a multi-turn trajectory and applies token-level self-distillation over the agent’s own rollout.
- The student context contains the task and previous generated tokens, while the self-teacher context additionally includes privileged training-only information such as retrieved skills.
- The detached teacher-student gap is defined as:
  \[\Delta_t =\operatorname{sg} \left( \log \pi_\theta^+(y_t \mid s_t^+) -\log \pi_\theta(y_t \mid s_t) \right)\]
- The token-level gate converts this signal into a bounded trust weight:
  \[g_t =\sigma(\beta \Delta_t)\]
- The gated auxiliary loss applies self-distillation according to token-level trust:
  \[\ell_t^{\mathrm{SDAR}} =g_t \left( \log \pi_\theta^+(y_t \mid s_t^+) -\log \pi_\theta(y_t \mid s_t) \right)\]

Practical Takeaways

RL is most useful when the reward or verifier is trusted and task-grounded, but the model needs exploration to discover successful trajectories.
OPD is most useful when a teacher is both reward-improved and close enough to the student to provide meaningful dense supervision on student-visited prefixes.
Self-distillation is most useful when feedback, hints, or privileged context reveal token-level corrections that are genuinely tied to downstream reward.
RL-distillation hybrids are most useful when sparse rewards provide reliable update direction, while teacher or self-teacher signals provide denser token-level credit assignment.
The main failure mode is disagreement between dense token feedback and downstream reward. If the teacher log-ratio tracks correctness, OPD behaves like dense KL-regularized RL. If it tracks feedback-aware style, reference-citing behavior, or privileged-context artifacts, the model can become more confident while becoming less robust.
Evaluation should therefore include reward, in-distribution accuracy, out-of-distribution accuracy, hallucinated-feedback rate, epistemic-verbalization rate, response length, calibration, and long-horizon agent success. These metrics distinguish genuine reward-grounded learning from dense imitation of a misleading teacher.

Multi-Domain Post-Training and Capability Consolidation

In multi-domain post-training, distillation also functions as a capability consolidation tool after or during RL. Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) uses multi-domain OPD from the strongest intermediate teacher models to recover benchmark regressions and sustain gains after broader Cascade RL.
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) scales this consolidation pattern to a 550B-total, 55B-active-parameter hybrid Mamba-attention MoE model, using SFT, unified RLVR, MOPD warmup, MOPD, and MTP boosting to consolidate more than ten specialist teachers into a single agentic reasoning model. The report also notes that sampled-token objectives outperformed top-\(k\) and full-vocabulary distribution matching in preliminary MOPD experiments on some agentic benchmarks, which fits the support-overlap view that broad distribution matching can amplify noise when the student’s prefixes are off the teacher’s reliable support.
A practical MOPD design rule is that teacher support matters: dense teacher scoring is most reliable when student rollouts remain within regions where the teacher can assign meaningful token preferences. Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) uses a brief MOPD warmup SFT stage to align student rollouts with teacher-supported distributions, and Ravid Shwartz Ziv’s Post frames the same issue as a support-overlap tradeoff between full-distribution matching and sampled-token scoring.
Aligning Language Models from User Interactions by Kleine Buening et al. (2026) uses user follow-up messages as hindsight context for self-distillation, updating the model toward the behavior it would have produced after seeing the user’s correction or clarification. Informal discussion around this trend also appears in Cameron R. Wolfe’s X posts on multi-teacher OPD and the utility of combining specialist teachers, including this MOPD discussion thread.

Implementation View

In implementation terms, modern LLM distillation usually requires four decisions: the source of trajectories, the teacher signal, the divergence or surrogate loss, and the systems design for computing log-probabilities. Thinking Machines Blog: On-Policy Distillation frames on-policy distillation as combining the relevance of RL with the dense per-token signal of distillation: the student samples its own trajectories, while the teacher provides token-level feedback rather than a sparse sequence-level reward.
Agentic RL-distillation hybrids require deciding how strongly each token should trust privileged teacher guidance. Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) introduces entropy gating, gap gating, and soft-OR gating, allowing each token to regulate the intensity of self-distillation based on student uncertainty, teacher-student gap, or their combination.
Practical OPD systems also need to distinguish the mathematical idea of “on-policy” from systems-level staleness. In asynchronous pipelines, rollout workers may sample from a slightly older behavior policy while learner workers update a newer student snapshot; Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) addresses this in MOPD with behavior-policy, proximal-policy, and teacher-log-probability terms inside an asynchronous objective.
For production-scale OPD, implementation details can dominate the algorithmic presentation. Distilling 100B+ Models 40x Faster with TRL highlights generation buffers, batched teacher scoring, and compressed log-probability transfer, while vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention by Kwon et al. (2023) is a common serving foundation for high-throughput teacher inference.
For out-of-distribution or enterprise-specific behaviors, self-distillation can reduce reliance on an external frontier teacher by constructing a privileged teacher context from hints, corrections, or task descriptions. Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation makes this practical with token relevance masks, emphasizing that dense token-level supervision is useful only when the update positions correspond to the behavior being transferred.

Primer Roadmap

The rest of the primer covers the major types of distillation in detail: classical soft-label distillation, hard-label and sequence-level distillation, representation and attention distillation, task-specific vs. task-agnostic distillation, offline and online distillation, off-policy distillation, on-policy distillation, support-overlap diagnostics, self-distillation, on-policy self-distillation, thinking-model caveats for privileged self-distillation, multi-teacher on-policy distillation, and the newer RL-distillation hybrid family that treats teacher log-probability gaps, hindsight feedback, contrastive hints, relevance masks, or self-teacher contexts as dense policy-optimization signals.

Foundations

Classical knowledge distillation establishes the core teacher-student framework that underlies all subsequent variants, including offline distillation, online distillation, off-policy distillation, on-policy distillation, self-distillation, multi-teacher distillation, and modern reinforcement learning hybrids. The central insight is that a model can learn not only from hard labels, but from the richer probability distribution produced by a stronger teacher or peer model.
The most useful way to orient this section is to treat distillation as both a loss family and a post-training systems primitive. At the loss level, distillation begins with matching teacher and student distributions. At the recipe level, modern LLM post-training uses distillation for compression, synthetic-data generation, regression recovery, RL stabilization, self-improvement, and multi-domain capability consolidation.
The foundational axes are teacher dynamics, trajectory source, target type, and teacher identity. Offline vs. online determines whether the teacher changes; off-policy vs. on-policy determines where trajectories come from; hard vs. soft targets determine the density of supervision; and external, self, or multi-teacher setups determine where the supervision originates.
A newer foundation is teacher quality under the student’s own rollout distribution. In modern OPD, the teacher is useful only if its token-level or trajectory-level preferences point toward higher reward on states the student actually visits. Dense feedback is not automatically beneficial: it can accelerate learning when the teacher is reward-improved and locally compatible with the student, but it can also accelerate the wrong behavior when the teacher is privileged, miscalibrated, or far from the student’s support.
The most important modern shift is from distilling isolated teacher outputs to consolidating entire post-training workflows. In recent LLM recipes, specialist teachers, staged RL checkpoints, synthetic traces, verifiers, tool environments, and MOPD are all parts of the same system.
This section introduces the mathematical foundations, temperature scaling, divergence choices, early extensions, modern post-training recipe patterns, and the distinction between fixed-teacher and co-trained-teacher settings that made distillation a general-purpose model compression and capability transfer technique.
In frontier post-training, these foundations are increasingly embedded inside larger recipes rather than used as standalone objectives. Frontier post-training recipe review with Finbarr Timbers describes the historical progression from relatively simple SFT, reward-modeling, and RLHF pipelines toward 2026-style recipes that combine staged RL, specialist teachers, trace distillation, MOPD, and environment-specific training.

Teacher-Student Formulation

Classical Setup

The classical formulation of distillation considers two models: a teacher \(p_T\), typically large and high-performing, and a student \(p_S^\theta\), typically smaller or more efficient. The objective is to transfer the teacher’s behavior into the student while reducing computational cost, improving deployment efficiency, or specializing the student to a target domain.
This paradigm was formalized in Distilling the Knowledge in a Neural Network by Hinton et al. (2015), which introduced the idea that the teacher’s soft output probabilities encode richer information than hard labels, revealing inter-class similarities that a one-hot target discards. The student is trained to match these soft distributions rather than only the teacher’s argmax class.
In the original classical setting, the teacher is usually fixed before the student is trained. This corresponds to offline distillation: the teacher distribution is stationary, and the student learns from a stable reference model.
For classification or token prediction, the core offline teacher-student loss is typically:
\[\mathcal{L}_{KD}(\theta) =\mathbb{E}_{x \sim \mathcal{D}} \left[ D\left( p_T(\cdot \mid x) \,\Vert\, p_S^\theta(\cdot \mid x) \right) \right]\]
- where \(D\) is a divergence, most commonly forward KL in classical supervised KD, but also reverse KL, Jensen-Shannon divergence, cross-entropy on sampled teacher outputs, or a task-specific surrogate in modern LLM systems.
In practice, distillation is often combined with supervised learning:
\[\mathcal{L}(\theta) =\alpha \mathcal{L}_{CE}(\theta) + (1-\alpha)\mathcal{L}_{KD}(\theta)\]
- where \(\mathcal{L}_{CE}\) is cross-entropy with ground-truth labels and \(\alpha\) balances direct label supervision against teacher imitation.

Teacher Dynamics

Online distillation relaxes the fixed-teacher assumption. In methods such as Deep Mutual Learning by Zhang et al. (2017), multiple peer models learn collaboratively and teach one another during training, so there may be no single pretrained superior teacher and no fixed teacher distribution.
For online or mutual distillation, the same form can be generalized by replacing the single frozen teacher \(p_T\) with one or more evolving peers \(p_j^{\theta_j}\):
\[\mathcal{L}_{i}(\theta_i) =\mathcal{L}_{\text{task}}(\theta_i) +\lambda \sum_{j\neq i} D\left( p_j^{\theta_j}(\cdot \mid x) \,\Vert\, p_i^{\theta_i}(\cdot \mid x) \right)\]
- where model \(i\) learns from peer models \(j\) while also updating its own parameters. This captures the shift from one-way offline transfer to reciprocal online transfer.
Modern post-training adds a third practical pattern: staged teacher creation. In systems such as Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026), intermediate checkpoints produced by domain-wise RL become teachers for later multi-domain OPD, so the teacher is fixed at the moment of distillation but is itself the product of a broader sequential post-training pipeline.
In frontier LLM pipelines, the “ground truth” term may itself be synthetic, filtered, teacher-generated, or verifier-selected. For example, Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) uses broad SFT data spanning math, coding, science, tool use, agentic tasks, multi-turn dialogue, instruction following, safety, long-context tasks, and software engineering before RL and multi-domain OPD, so the supervised term is already a curated teacher-data mixture rather than only human labels.

Token-Level Formulation for LLMs

For autoregressive language models, the distribution being matched is usually a next-token distribution conditioned on a prompt and a partial trajectory. Given prompt \(x\) and output prefix \(y_{<t}\), teacher and student define:
\[p_T(\cdot \mid x,y_{<t}), \qquad p_S^\theta(\cdot \mid x,y_{<t})\]
The token-level distillation objective aggregates divergences across the sequence:
\[\mathcal{L}_{KD}(\theta) =\mathbb{E}_{(x,y)} \left[ \sum_{t=1}^{|y|} D\left( p_T(\cdot \mid x,y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,y_{<t}) \right) \right]\]
The important detail is the source of \(y\). If \(y\) comes from a dataset or teacher completion, the method is off-policy. If \(y\) is sampled from the current student, the method is on-policy. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) formalizes this as Generalized Knowledge Distillation, where student-generated sequences reduce train-inference mismatch.
In on-policy distillation, the objective becomes:
\[\mathcal{L}_{OPD}(\theta) =\mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D\left( p_T(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
This formulation shows why OPD can be more useful than ordinary sequence-level KD for long-horizon reasoning. The teacher does not merely provide an ideal output; it scores the student’s actual visited prefixes, including prefixes that arise from the student’s own mistakes.

Reward-Tilted Teachers

For modern OPD, the teacher-student formulation is incomplete unless the teacher’s relationship to reward is specified. A teacher can be larger, privileged, or more confident without being a good teacher. The critical condition is whether the teacher assigns more probability to higher-reward trajectories while remaining close enough to the student to be imitable.
Let \(s\) denote a complete trajectory and \(R(s)\) its downstream reward. Let \(\pi_k^S\) be the current student policy held fixed. KL-regularized reward maximization has a closed-form reward-tilted optimum:
\[\pi_T^*(s) =\frac{1}{Z} \pi_k^S(s) \exp(\beta R(s)), \qquad Z = \mathbb{E}_{s \sim \pi_k^S} \left[ \exp(\beta R(s)) \right]\]
- where \(\beta\) controls the strength of reward tilting and \(Z\) normalizes the distribution.
If the teacher is this reward-tilted policy and its gradient is stopped, reverse-KL distillation decomposes into a reward-seeking term plus a policy-proximity term:
\[D_{KL} \left( \pi^S \,\Vert\, \pi_T^* \right) = D_{KL} \left( \pi^S \,\Vert\, \pi_k^S \right) - \beta \mathbb{E}_{s \sim \pi^S} \left[ R(s) \right] + \log Z\]
Since \(\log Z\) does not depend on the optimized student, minimizing this objective increases expected reward while keeping the student close to its current policy. This is the formal reason that distillation toward a suitable teacher can behave like dense KL-regularized RL.
The teacher-student log-ratio provides a practical diagnostic:
\[\Delta_T(s) = \log \pi_T(s) - \log \pi_S(s)\]
A useful OPD teacher should assign larger \(\Delta_T(s)\) to higher-reward trajectories. If \(\Delta_T(s)\) tracks a response template instead of reward, distillation can move the student in the wrong direction even though the loss is dense and on-policy.
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t uses this diagnostic to distinguish reward-improving expert teachers from privileged self-teachers: an RL-trained expert behaves like a reward-tilted student, while a naive self-distillation teacher can up-weight responses that look as if the model had a reference solution, regardless of correctness.

Temperature Scaling and Soft Targets

Classical Temperature Scaling

A central idea in classical distillation is temperature scaling. The teacher logits \(z_i\) are softened using a temperature \(T > 1\):
\[p_T^{(T)}(i \mid x) =\frac{\exp(z_i/T)} {\sum_j \exp(z_j/T)}\]
Higher temperatures produce smoother distributions, making low-probability classes more visible. This helps the student learn nuanced relationships that are otherwise hidden in one-hot labels.
The student distribution can be softened in the same way:
\[p_S^{(T)}(i \mid x) =\frac{\exp(u_i/T)} {\sum_j \exp(u_j/T)}\]
- where \(u_i\) denotes the student logit for class or token \(i\).
The temperature-scaled distillation loss then becomes:
\[\mathcal{L}_{KD}^{(T)} =T^2 \cdot D_{KL} \left( p_T^{(T)} \,\Vert\, p_S^{(T)} \right)\]
The factor \(T^2\) helps keep gradient magnitudes stable when scaling logits. Without this correction, increasing \(T\) changes not only the smoothness of the target distribution but also the scale of the gradients.

Offline and Online Use

In offline distillation, temperature is typically applied to a frozen teacher’s logits. The student receives a softened target distribution from a stationary teacher, which makes the optimization stable and easy to cache.
In online distillation, temperature may be applied to each peer model’s logits before exchanging predictions. This can prevent mutual learning from collapsing too quickly into overconfident agreement and can preserve complementary learning signals for longer.
In online peer-learning systems, temperature therefore acts as both a smoothing mechanism and a coordination mechanism. A higher temperature gives peers more information about secondary alternatives, while a lower temperature sharpens agreement around dominant predictions.

Temperature in Modern LLM Recipes

In modern frontier recipes, temperature also appears outside the classical soft-label setting. Frontier post-training recipe review with Finbarr Timbers highlights that recent model reports discuss sampling-temperature schedules, difficulty curricula, and filtering policies as part of the broader post-training recipe. These choices affect the rollouts that later become SFT traces, RL trajectories, or distillation targets, so they indirectly shape the teacher distribution even when the final loss is not temperature-scaled KD.
In reasoning and agentic post-training, sampling temperature can determine which trajectories are discovered, which examples pass filters, and which failure modes are exposed to a teacher or verifier. Temperature therefore influences both data generation and the support over which distillation is later applied.
Implementation detail: in large-vocabulary settings such as LLMs, computing full softmax distributions is expensive. In practice, systems often approximate the loss using top-\(k\) tokens from either the teacher or student distribution, depending on whether forward or reverse KL is used.

Token-Level Distillation in Autoregressive Models

Per-Token Matching

For language models, distillation is applied at the token level. Given an input sequence \(x\) and generated tokens \(y=(y_1,\dots,y_n)\), both teacher and student define conditional distributions:
\[p(y_t \mid x,y_{<t})\]
The distillation objective aggregates per-token divergence:
\[D_{KL}(p_T \,\Vert\, p_S)(y \mid x) =\frac{1}{|y|} \sum_{t=1}^{|y|} D_{KL} \left( p_T(\cdot \mid x,y_{<t}) \,\Vert\, p_S(\cdot \mid x,y_{<t}) \right)\]
This formulation is explicitly described in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024), where distillation is framed as minimizing divergence between teacher and student token distributions along sequences.

Prefix Source

Offline token-level distillation usually evaluates the teacher and student on a fixed sequence distribution, such as human-written outputs, teacher-generated outputs, or cached synthetic data. This is simple and stable, but it can create a gap between training prefixes and inference prefixes.
Online token-level distillation allows the supervising distribution to change during training. In peer-learning settings, each model may provide token-level probabilities to other models on shared batches; in more advanced LLM systems, periodically refreshed checkpoints or peer models can serve as evolving teachers.
On-policy token-level distillation changes the prefix distribution by evaluating teacher and student on student-generated prefixes. This is why OPD is especially relevant for long-horizon reasoning and agentic tasks: a model’s future states are defined by its own earlier actions, tool calls, or reasoning tokens.
Sasha Rush’s video lecture on on-policy distillation works explains this distinction using the contrast between sequence KD and OPD: sequence KD trains the student on the teacher’s chosen trajectory, whereas OPD lets the student produce a trajectory and has the teacher rescore the exact sequence of student actions.
A key implication is that the quality of training depends heavily on the distribution of prefixes \(y_{<t}\) encountered during training. This motivates the distinction between off-policy and on-policy methods: the same token-level divergence can behave differently depending on whether it is evaluated on teacher-selected, dataset-selected, or student-selected prefixes.

Token-Level Credit Assignment

Token-level distillation is attractive because it provides a denser signal than sequence-level supervision. Instead of receiving only a final label, final reward, or full-sequence imitation target, the student receives feedback at every prefix.
In sampled-token OPD, the token-level signal can often be written as a log-probability gap:
\[A_t^{OPD} = \log p_T(y_t \mid s_t) - \log p_S(y_t \mid s_t)\]
- where \(s_t=(x,y_{<t})\) is the student-visited prefix.
This signal is useful when the teacher’s probability gap tracks task quality. If the teacher assigns higher probability to tokens because they are reward-improving, the dense signal improves learning efficiency. If the teacher assigns higher probability to tokens because they match a privileged-context style, the dense signal can teach the wrong behavior.

Divergence Choices and Their Effects

The divergence controls what “matching the teacher” means. In classical KD, this choice mostly determines how the student approximates a fixed teacher distribution. In OPD and self-distillation, it also determines how dense feedback is converted into updates on student-visited states. The same divergence can help or hurt depending on teacher quality, support overlap, and whether the teacher’s log-ratio tracks reward.

Forward KL

Forward KL is:
\[D_{KL}(p_T \,\Vert\, p_S) =\sum_x p_T(x) \log \frac{p_T(x)}{p_S(x)}\]
Forward KL is teacher-weighted. It penalizes the student for assigning low probability to tokens the teacher considers likely. This leads to mean-seeking behavior, encouraging coverage of teacher modes.
Forward KL is often used in classical supervised KD because the teacher distribution is treated as the target distribution. It is strongest when teacher and student have good local support overlap, since matching the teacher’s full distribution can transfer rich information about alternatives.
In LLM systems, forward KL can be expensive because it often requires teacher probabilities over many tokens, such as full-vocabulary logits or teacher top-\(k\) distributions.
Forward KL can also be brittle when the teacher distribution is unreliable on student-visited prefixes. If the teacher and student are far apart, matching broad teacher support can amplify irrelevant modes instead of correcting the student’s actual mistakes.

Reverse KL

Reverse KL is:
\[D_{KL}(p_S \,\Vert\, p_T) =\sum_x p_S(x) \log \frac{p_S(x)}{p_T(x)}\]
Reverse KL is student-weighted. It penalizes the student for assigning probability to tokens or trajectories the teacher considers unlikely. This leads to mode-seeking behavior, focusing updates on the regions the student actually proposes.
In sampled-token OPD, reverse KL is attractive because the student has already generated the token being evaluated. A token-level signal can be written as:
\[A_t^{OPD} = \log p_T(y_t \mid s_t) - \log p_S(y_t \mid s_t)\]
- where \(s_t=(x,y_{<t})\) is the student-visited prefix.
MiniLLM: Knowledge Distillation of Large Language Models by Gu et al. (2023) uses reverse KL for generative language-model distillation and argues that it better avoids overestimating low-probability regions of the teacher distribution.
Reverse KL is not automatically safe. It can still reinforce the wrong behavior if the teacher assigns high probability to tokens that reflect privileged-context artifacts, hallucinated references, or overconfident reasoning rather than downstream reward.

Jensen-Shannon Divergence

Jensen-Shannon divergence interpolates between teacher-weighted and student-weighted behavior:
\[D_{JSD}(p_T \,\Vert\, p_S) =\beta D_{KL} \left( p_T \,\Vert\, m \right) + (1-\beta) D_{KL} \left( p_S \,\Vert\, m \right), \qquad m = \beta p_T + (1-\beta)p_S\]
JSD is bounded and can be more stable than pure KL because it combines teacher-weighted and student-weighted terms. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) treats GKD as a flexible framework in which the divergence can be selected based on task, student capacity, and teacher-student mismatch.
JSD is useful when forward KL may over-cover teacher modes and reverse KL may collapse too aggressively onto a subset of modes. However, it still depends on teacher quality: the teacher distribution must encode behavior worth imitating on the student’s visited prefixes.

Divergence Choice in Online and Multi-Teacher Settings

As a practical insight, forward KL is often used in classical supervised KD, while reverse KL is frequently used in on-policy distillation because it aligns naturally with sampling from the student distribution.
In offline distillation, divergence selection primarily controls how the student approximates a fixed teacher. In online distillation, divergence selection also affects training dynamics among co-evolving models: overly strong agreement losses can reduce diversity too early, while weaker or temperature-smoothed agreement can preserve complementary learning signals for longer.
In multi-teacher OPD, divergence choice interacts with support overlap. Full forward-KL-style distribution matching can be powerful when the teacher and student share local token support, but sampled-token or reverse-KL-style scoring can be more robust when the teacher and student are farther apart.
This distinction appears in recent discussions of large MOPD systems such as Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026), where sampled-token objectives can be preferable when full teacher distributions are noisy on student-visited prefixes.

Practical Selection Rules

Forward KL is appropriate when the teacher distribution is trusted, teacher-student support overlap is high, and the goal is broad mode coverage.
Reverse KL is appropriate when training on sampled student tokens, when the teacher is used to score what the student actually did, or when communication budget favors sampled-token log-probabilities over full-vocabulary distributions.
JSD is appropriate when stability matters and the teacher-student mismatch is large enough that either pure KL direction may be brittle.
Sampled-token objectives are appropriate when the student’s own rollouts define the relevant state distribution and when teacher full-distribution matching would be too expensive or too noisy.
In reasoning and agentic settings, divergence choice should always be evaluated with reward-tracking diagnostics. The question is not only whether the student is matching the teacher, but whether the teacher-student log-ratio increases with downstream reward.

Supervised Distillation and Sequence-Level Distillation

Supervised Logit-Level Distillation

Supervised logit-level distillation provides dense supervision at every token by matching the teacher’s distribution rather than only the correct label or target token.
A standard supervised distillation objective is:
\[\mathcal{L}_{SD} = \mathbb{E}_{(x,y)} \left[ D_{KL} \left( p_T \,\Vert\, p_S \right)(y \mid x) \right]\]
This objective leverages the full distribution rather than only correct labels. The teacher can express uncertainty, secondary alternatives, and similarities between outputs, which a hard target discards.
Supervised logit-level distillation is most commonly implemented as an offline method: a frozen teacher labels fixed data, and the student is trained afterward.

Sequence-Level Distillation

Sequence-level distillation replaces ground-truth outputs with teacher-generated sequences. It was introduced for neural machine translation in Sequence-Level Knowledge Distillation by Kim and Rush (2016), where teacher-generated translations serve as simplified targets for the student.
The student is trained via standard likelihood on the teacher-generated sequence:
\[\mathcal{L}_{SeqKD} = \mathbb{E}_{x} \left[ -\log p_S(y_T \mid x) \right]\]
- where:
\[y_T \sim p_T(\cdot \mid x)\]
Sequence-level KD simplifies the target distribution, often making learning easier but discarding distributional richness. Instead of learning the teacher’s full probability distribution, the student learns to imitate one or more teacher-selected outputs.
Sequence-level KD is usually off-policy because the sequence \(y_T\) is produced by the teacher or another external source, not by the current student.

Offline and Online Variants

Both supervised logit-level distillation and sequence-level distillation are most commonly implemented as offline methods: a frozen teacher labels fixed data or generates synthetic targets, and the student is trained afterward.
Online versions are possible when teacher outputs are generated by co-trained peers or periodically refreshed teachers rather than by a static teacher. In such cases, the same supervised or sequence-level objective can be used, but the source distribution evolves during optimization.
Modern frontier recipes often use trace-distillation SFT as a consolidation stage even when they do not use full on-policy MOPD. Frontier post-training recipe review with Finbarr Timbers contrasts recipes that consolidate specialist RL climbs through trace-distillation SFT with newer recipes that consolidate specialists through MOPD.
The key difference is whether the consolidation data comes from teacher-generated traces or student-generated rollouts. Trace-distillation SFT teaches the student to imitate the traces selected by the teacher or recipe designer. MOPD teaches the student to improve its own trajectories using token-level teacher feedback over the states the student actually visits.

Representation and Intermediate-Layer Distillation

Hidden-State and Attention Matching

Classical distillation need not operate solely on output probabilities. Representation distillation aligns hidden states, attention maps, embeddings, or other intermediate features.
DistilBERT: a distilled version of BERT by Sanh et al. (2019) combines three losses: masked language modeling loss, distillation loss on softened logits, and cosine embedding loss on hidden representations. This multi-objective approach preserves both output behavior and internal representations, demonstrating that distillation can transfer structural knowledge in addition to token probabilities.
A generic representation distillation loss can be written as:
\[\mathcal{L}_{rep}(\theta) =\mathbb{E}_{x \sim \mathcal{D}} \left[ \left| h_T(x) -g_\phi(h_S^\theta(x)) \right|_2^2 \right]\]
- where \(h_T(x)\) is the teacher representation, \(h_S^\theta(x)\) is the student representation, and \(g_\phi\) is an optional projection when teacher and student dimensions differ.
Subsequent work has extended this principle to attention map matching, value and key projection matching, layer-wise feature regression, and contrastive representation alignment.

Architecture Compatibility

Representation and intermediate-layer distillation are especially useful when the student architecture is similar enough to the teacher for hidden states or attention maps to be meaningfully aligned.
In offline representation distillation, the teacher’s hidden states can be cached or computed live from a frozen teacher. In online representation distillation, peers may align intermediate representations during co-training, but this is more architecture-sensitive because hidden-state dimensions, layer counts, and attention structures must be compatible or projected into a shared space.
In modern LLM post-training, direct hidden-state matching is less common than output-distribution or trace-based transfer because teachers and students may differ in architecture, scale, tokenizer, routing structure, or inference stack.
MoE and hybrid architectures make internal-layer alignment less straightforward, while token-level distillation and trace distillation remain broadly compatible across architectures. This is one reason modern LLM recipes often prefer output-level, sequence-level, or sampled-token supervision when consolidating capabilities across heterogeneous teachers.

Distillation as Synthetic Data Generation

Teacher Outputs as Data

A complementary perspective, emphasized in the RLHF Book chapter Synthetic Data, is that distillation is also a structured data-generation process. A teacher can produce answers, chain-of-thought traces, critiques, preference labels, filtered examples, tool-use demonstrations, or verifier-selected attempts.
The student then trains on these outputs either as hard targets or as soft distributions. This broadens distillation from model compression to a general capability-transfer mechanism.
In modern LLM pipelines, generating high-quality synthetic reasoning traces often precedes more advanced on-policy or reinforcement learning stages. Synthetic-data generation is therefore not separate from distillation; it is often the data-production layer through which distillation becomes possible.

Offline and Online Synthetic Data

In offline synthetic-data distillation, the generated examples are usually fixed before student training or regenerated in separate rounds. This is stable and easy to scale, but the student may still encounter train-inference mismatch if the examples are far from its own rollout distribution.
In online synthetic-data distillation, examples, critiques, or peer labels may evolve as teachers, students, or self-improvement loops change during training. This makes the data more adaptive but increases systems complexity and can introduce non-stationarity.
The RLHF Book chapter Synthetic Data frames the path from offline KD to OPD as a move from static teacher-generated data toward student-sampled trajectories with dense teacher feedback.

Modern SFT as a Distillation Pipeline

Modern SFT stages are often synthetic-data pipelines in disguise. Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) constructs large SFT mixtures with teacher-generated reasoning traces, correctness filtering for coding tasks with tests, long-context examples, multi-turn synthetic conversations, instruction-following examples, safety examples, and domain-specific reasoning data.
This shows how SFT, synthetic data, and distillation can form a single integrated data-production loop rather than separate stages. A model may first learn from synthetic traces, then improve through RL, then become a teacher for a later student or consolidation stage.
Similarly, Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) uses generated and filtered SFT data across science, math, proof, competitive coding, multi-turn chat, and agentic software tasks before later RL and MOPD, illustrating how large post-training pipelines rely on synthetic distillation artifacts even before explicit OPD begins.
The practical implication is that distillation should be evaluated as a pipeline, not only as a loss. The quality of the teacher, the sampling distribution, filtering rules, verifier reliability, and curriculum all shape the final student.

Post-Training Recipes of Recent LLMs

Recent LLM post-training recipes can be read as a history of how distillation moved from a compression technique into a recipe-consolidation mechanism. The broad evolution is:
- 2022 to 2023: SFT, reward modeling, and PPO-style RLHF, exemplified by Training language models to follow instructions with human feedback by Ouyang et al. (2022).
- 2024: open recipes increasingly formalized SFT, preference optimization, and RL with verifiable rewards, while closed recipes used more elaborate multi-stage RLHF variants.
- 2025: reasoning RL became the centerpiece after DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning by Guo et al. (2025), which made large-scale RLVR central to reasoning-model post-training.
- 2026: recipes increasingly fragment into multiple domain-specialist teachers and then merge those teachers into one deployable model through trace distillation, multi-domain OPD, or MOPD.
The simplest useful schema is:
\[\text{Base Model} \rightarrow \text{SFT / cold start} \rightarrow \text{RL or specialist RL} \rightarrow \text{distillation / consolidation} \\ \rightarrow \text{alignment and domain-specific polishing}\]
The major difference across recent recipes is where the recipe places the consolidation step. Some recipes use SFT trace distillation after training specialists. Others use MOPD, where the student samples its own rollouts and multiple teachers provide dense token-level feedback over those rollouts.

Training language models to follow instructions with human feedback by Ouyang et al. (2022) is the canonical early RLHF recipe:
- SFT on human demonstrations.
- Reward model training on human preference comparisons.
- PPO against the reward model.
In this recipe, distillation is not the central consolidation mechanism. The key teacher signal is human feedback converted into a reward model, and policy optimization moves the model toward responses preferred by that reward model.
The following figure (source) shows the canonical InstructGPT-style three-step RLHF pipeline: collect demonstration data and train a supervised policy, collect comparison data and train a reward model, then optimize the policy against the learned reward model using reinforcement learning.

Llama 2

Llama 2: Open Foundation and Fine-Tuned Chat Models by Touvron et al. (2023) used a multi-stage RLHF recipe with SFT followed by iterative RLHF over multiple rounds.
Each round used rejection sampling followed by PPO, and the recipe used separate reward models for helpfulness and safety. This made Llama 2 a practical implementation of the early RLHF pattern, but with more iteration, more filtering, and more safety-specific modeling than the original InstructGPT-style formulation.
The following figure (source) shows the Llama 2 post-training pipeline, including pretraining, supervised fine-tuning, human feedback, safety and helpfulness reward models, rejection sampling, PPO, and iterative improvement of Llama 2-Chat.

Llama 3 and Tülu-Style Open Recipes

The Llama 3 Herd of Models by Dubey et al. (2024) used a complex multi-stage recipe with simpler optimizers. A typical round involved reward-model training, sampling multiple responses per prompt, rejection sampling, SFT, and DPO. The reward model mostly filtered samples rather than serving as the target of an online PPO-style loop.
The following figure (source) shows the Llama 3 post-training loop: collected prompts are expanded into multiple generations per prompt, filtered through a reward model and rejection sampling, converted into SFT data, trained into an SFT model, and then refined through DPO across multiple rounds.

Tülu 3: Pushing Frontiers in Open Language Model Post-Training by Lambert et al. (2024) is a clean open-recipe example: curated prompts, SFT, DPO, and RLVR. This recipe helped formalize RL with verifiable rewards as a core open post-training stage.
The following figure (source) shows the Tülu 3 three-stage open post-training recipe: public and synthetic prompt curation, SFT data mixing, direct preference optimization with both on-policy and off-policy data, and RL with verifiable rewards, with development and unseen evaluations throughout.

A practical observation from Frontier post-training recipe review with Finbarr Timbers is that DPO-style stages become less central in some later frontier recipes as the rest of the pipeline becomes cleaner, more on-policy, and more environment-driven. This does not mean preference optimization is obsolete; rather, it becomes one tool among reward modeling, RLVR, trace filtering, and distillation.

OLMo 3

OLMo 3 can be read as a reasoning update to the Tülu-style open recipe. The recipe separates thinking and instruction behavior while still preserving a relatively simple staged structure compared with the frontier-lab recipes that use many specialist teachers and MOPD.
The high-level pattern is:
- Pretraining and midtraining.
- A long-context OLMo 3 base model.
- Separate think and instruct post-training paths.
- SFT, DPO, and RLVR stages for think and instruct variants.
- An RL-Zero-style RLVR branch for reasoning exploration.
The following figure (source) shows the OLMo 3 recipe with pretraining, midtraining, long-context base adaptation, separate Think SFT, Think DPO, Think RLVR, Instruct SFT, Instruct DPO, Instruct RLVR branches, and an RL-Zero RLVR path.

DeepSeek-R1-Style Reasoning RL

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning by Guo et al. (2025) shifted attention toward large-scale reasoning RL, where verifiable rewards and long reasoning traces become central rather than peripheral.
The recipe:
- R1-Zero uses pure RL, specifically GRPO, on the base model without SFT, primarily to seed reasoning behaviors for the full run rather than to serve as the final product.
- R1 uses cold-start SFT, reasoning RL, rejection-sampling SFT, and final RL.
- Large-scale RLVR becomes the primary driver, while SFT is used to distill, clarify, and refine RL-emergent reasoning behaviors.
In this style of recipe, the core model improvement comes from RL on verifiable reasoning tasks. Distillation then becomes important either for transferring the resulting reasoning behavior into smaller models or for consolidating later specialist variants.
The following figure (source) shows the DeepSeek-R1 recipe: an R1-Zero branch with RL on reasoning prompts and accuracy or format rewards, a cold-start SFT path into reasoning RL, sampling and SFT on reasoning and non-reasoning data, and a final RL stage with rule-based and preference rewards.

DeepSeek Evolution After V3

DeepSeek’s post-training evolution is useful because it shows a transition from relatively standard SFT and GRPO-style RL toward specialist creation and finally MOPD-style consolidation.
The high-level progression is:
- V3, Dec. 2024: SFT plus GRPO RL.
- R1, Jan. 2025: multi-stage RL where reasoning emerges.
- V3.1, Aug. 2025: hybrid think and non-think behavior in one model.
- V3.2, Dec. 2025: six specialists via RL, followed by SFT distillation into one mixed GRPO run.
- V4, Apr. 2026: ten-plus domain experts consolidated through MOPD.
The following figure (source) shows the DeepSeek evolution after V3, moving from SFT plus GRPO RL, to R1-style multi-stage reasoning RL, to hybrid think and non-think modeling, to specialist RL with SFT distillation, and finally to ten-plus experts merged with MOPD.

MiMo-V2-Flash

MiMo-V2-Flash Technical Report by the Xiaomi MiMo Team (2026) is a clean early articulation of MOPD as a post-training consolidation primitive.
The high-level recipe is:
- Stage 1: SFT to establish the general student.
- Stage 2: train several domain-specialist teachers, typically using SFT and RL on the relevant domains.
- Stage 3: consolidate those teachers into one student through MOPD.
This replaces a single monolithic RL run with a more modular workflow: specialists can be trained in parallel, and a final student can absorb their strengths through dense token-level feedback on its own trajectories.
The following figure (source) shows the MiMo Flash v2 recipe as a three-stage MOPD pipeline: SFT data creates an SFT model, domain-specialized training produces teachers for search, code, math, reasoning, safety, and related domains, and Stage 3 uses student rollouts plus token-level teacher rewards to distill the specialists into one student.

Nemotron-Cascade 2

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) offers one of the clearest examples of distillation as a mid-pipeline stabilization point.
Its high-level recipe is:
\[\text{Base Model} \rightarrow \text{SFT} \rightarrow \text{Instruction-Following RL} \rightarrow \text{Multi-domain RL} \rightarrow \text{Multi-domain OPD} \\ \rightarrow \text{RLHF} \rightarrow \text{Long-context RL} \rightarrow \text{Code RL} \rightarrow \text{SWE RL}\]
The ordering is not arbitrary. The recipe begins with instruction-following RL to establish strict instruction adherence, moves to multi-domain RL for tool-calling, STEM reasoning, and format adherence, inserts multi-domain OPD to unify specialized expertise and recover regressions, and then continues with RLHF, long-context RL, code RL, and software-engineering RL.
The key lesson is that OPD can serve as a stabilization and regression-recovery stage inside a longer RL cascade. It is not only an alternative to RL; it can be placed between RL stages to consolidate what earlier stages learned before later stages specialize further.

Nemotron 3 Ultra

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) scales the specialist-teacher recipe to a large hybrid Mamba-attention MoE model and uses SFT, unified RLVR, MOPD, and reasoning budget control to produce a long-context, high-throughput agentic model.
Its high-level recipe can be summarized as:
\[\text{Base Model} \rightarrow \text{SFT} \rightarrow \text{Unified RLVR} \rightarrow \text{Specialist Teacher Training} \rightarrow \text{MOPD Warmup} \\ \rightarrow \text{Multi-teacher OPD} \rightarrow \text{MTP / inference-oriented boosting}\]
The SFT stage includes domain-specific synthetic and filtered data for terminal use, conversational tool use, software issue resolution, math and proof, science, code, multilingual behavior, long context, chat, and professional workplace tasks.
The RLVR stage spans many environments, including terminal usage, office and productivity workflows, software engineering, search, tool calling, math, code, STEM, safety, chat, instruction following, long-context QA, structured outputs, inductive and transductive reasoning, and general usability.
The MOPD stage then consolidates specialized teachers into the general student. A central empirical observation is that MOPD is most effective when the teacher’s advantage can be expressed as token-level preferences over trajectories the student can already sample. This is why warmup and support overlap matter: if the student rarely visits the reasoning paths that encode the teacher’s missing capability, token-level OPD has a weaker signal.
The following figure (source) shows the Nemotron 3 Ultra two-iteration MOPD recipe: a prepared RLVR student is distilled from multiple first-round teachers into Ultra MOPD 1, then refreshed teachers and reused teachers provide a second MOPD iteration that produces the Ultra final model.

MAI-Thinking-1

MAI-Thinking-1 uses a more conservative recipe that is closer to DeepSeek-R1-style staged RL than to V4-style MOPD.
The high-level recipe is:
- Start from a mid-trained base model.
- Train several specialist RL “climbs,” such as SWE or agentic, STEM, and helpfulness or safety climbs.
- Consolidate the specialist climbs through trace-distillation SFT.
- Run a final RL climb to produce MAI-Thinking-1.
The key distinction is that consolidation is performed by trace-distillation SFT rather than on-policy multi-teacher distillation. This makes it a useful contrast case: not every 2026 frontier-style recipe uses MOPD, even when it uses specialist teachers.
The following figure (source) shows the MAI-Thinking-1 recipe: a mid-trained model branches into SWE or agentic, STEM, and helpfulness or safety climbs, each producing a teacher; trace-distillation SFT consolidates these into one model, and a final climb produces MAI-Thinking-1.

Kimi K2.5

Kimi K2.5: Visual Agentic Intelligence by Moonshot AI et al. (2026) is an agentic and multimodal recipe centered on text-only SFT followed by joint text-vision RL across coding, vision, reasoning, and agentic tasks.
The public recipe discussion does not foreground MOPD. Instead, it emphasizes building an agentic multimodal system that can coordinate subagents and tools across tasks.
The following figure (source) shows the Kimi K2.5 agentic workflow: an orchestrator creates subagents, assigns tasks to AI researchers, physics researchers, life-sciences researchers, fact checkers, file downloaders, and web developers, then aggregates task results into final results.

GLM-5

GLM-5: from Vibe Coding to Agentic Engineering by GLM et al. (2026) uses staged RL by capability: Base, SFT, Reasoning RL, Agentic RL, and General RL.
The recipe is simpler to describe than many MOPD-heavy systems, but the diagram includes on-policy cross-stage distillation, where logits or weights from earlier post-training stages inform later stages. This makes GLM-5 a useful middle case: it is not framed as many-teacher MOPD in the same way as MiMo or Nemotron 3 Ultra, but it still uses on-policy distillation-like cross-stage knowledge transfer.
The following figure (source shows the GLM-5 training recipe: pretraining on general and code or reasoning corpora, midtraining for long code, reasoning, long-context, and agent data, sparse-attention adaptation, followed by SFT, Reasoning RL, Agentic RL, General RL, and an on-policy cross-stage distillation block connected by logits and weights.

Distillation Inside Frontier Post-Training Recipes

Frontier post-training recipes usually combine multiple optimization stages rather than applying one distillation method in isolation. The classical three-stage RLHF pattern used supervised fine-tuning, reward modeling, and policy optimization. More recent reasoning-model recipes increasingly use larger RLVR stages, specialist teachers, staged domain curricula, trace distillation, and MOPD.
A useful high-level progression is:
- Classical RLHF recipe: SFT, reward model training, RLHF.
- Reasoning-focused RL recipe: SFT or mid-training, larger RLVR stages, reasoning trace refinement, sometimes with final SFT or trace distillation.
- Specialist-consolidation recipe: train several domain-specialist teachers, then merge their strengths into a single student through sequence-level distillation, trace-distillation SFT, or MOPD.
- 2026-style MOPD recipe: train many specialist teachers across domains, sample student rollouts, score those rollouts with the relevant teachers, and consolidate expertise through token-level on-policy losses.
Frontier post-training recipe review with Finbarr Timbers characterizes MOPD as a major emerging convergence point in frontier post-training, especially for recipes that must combine reasoning, coding, tool use, instruction following, and agentic behavior without allowing one domain’s RL to erase another’s capability.
This also reframes distillation as a recipe-management tool. The problem is not merely that the student should imitate a teacher; it is that a post-training organization may have many domain experts, many environments, many data pipelines, and many intermediate checkpoints, and distillation is the mechanism that converts those distributed gains into one deployable model.

Cascade RL and Multi-Domain OPD as a Foundation

Cascade RL is a modern foundation for understanding why distillation is needed after staged reinforcement learning. Instead of training all domains jointly, Cascade RL trains domains sequentially or in compatible multi-domain groups, making it easier to tune curricula, response lengths, verification costs, and reward functions.
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) applies SFT first, then a Cascade RL process that includes instruction-following RL, multi-domain RL, multi-domain OPD, RLHF, long-context RL, code RL, and software-engineering RL.
The central design principle is to mitigate inter-domain interference. Some tasks should be trained earlier because they establish broad priors, while others are specialized refinements. Some domains can be grouped when response formats and verification costs are similar, while conflicting domains should be separated to avoid destructive interference.
Multi-domain OPD then acts as a regression-recovery and capability-consolidation step. When later RL stages improve one domain but harm another, the strongest intermediate teacher for each domain can supervise the student on its own rollouts, helping recover benchmark regressions while retaining improvements from the broader Cascade RL process.
A simplified Cascade RL plus OPD formulation can be written as:
\[\theta_{d+1} =\operatorname{RL}_d(\theta_d)\]
- for domain-specific RL stage \(d\), followed by a consolidation step:
\[\mathcal{L}_{\text{multi-domain OPD}}(\theta) =\mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t} \sum_{k=1}^{K} w_k(x,\hat{y}_{<t}) D\left( p_{T_k}(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
- where \(T_k\) is a domain-specialist teacher and \(w_k\) routes or weights teacher feedback by domain, prompt, confidence, or checkpoint quality.
A sampled-token MOPD advantage can be written as:
\[a_t^{MOPD} = \log \pi_{T_k}(y_t \mid s_t) - \log \pi_{\theta}(y_t \mid s_t)\]
- where \(s_t=(x,\hat{y}_{<t})\) and the selected teacher \(T_k\) corresponds to the relevant domain. This form is attractive because it provides dense token-level feedback without requiring full-vocabulary teacher distributions at every prefix.

Limitations of Classical Distillation

Despite its effectiveness, classical distillation suffers from several structural limitations:
- Distribution mismatch: The student is trained on fixed trajectories, such as ground-truth or teacher-generated sequences, but at inference it generates its own tokens. Errors compound because it encounters states not seen during training. This issue is highlighted in imitation learning literature and explicitly discussed in On-Policy Distillation of Language Models.
- Teacher bias and mode collapse: Forward KL encourages the student to cover all teacher modes, sometimes leading to overly smooth or low-confidence outputs.
- Capacity mismatch: If the student cannot represent the teacher distribution, minimizing forward KL may produce unrealistic samples or unstable behavior.
- Data inefficiency: Off-policy distillation may waste training effort on trajectories the student would never generate, reducing practical efficiency.
- Teacher staleness in offline KD: A frozen teacher cannot adapt to the student’s changing failure modes, which can limit the usefulness of teacher feedback late in training.
- Non-stationarity in online KD: A co-trained or evolving teacher can provide fresher supervision, but the target distribution changes over time, making optimization and reproducibility harder. Deep Mutual Learning by Zhang et al. (2017) shows that collaboratively trained peers can outperform a static-teacher setup, but the method also shifts distillation from a simple one-way transfer problem into a coupled multi-model optimization problem.
- Inter-domain interference: A model improved through RL or distillation in one domain can regress in another domain. Cascade-style post-training treats this as a recipe-ordering and consolidation problem rather than only a loss-design problem.
- Support mismatch in MOPD: Multi-teacher OPD works best when the student’s rollouts fall within the regions where the relevant teacher provides meaningful token preferences. If the teacher and student are too far apart, dense distribution matching can propagate noise rather than knowledge.
- Recipe complexity: As post-training moves from simple SFT, DPO, and RL stages to many teachers, many domains, and many environments, the limiting factor becomes not only the loss but the system’s ability to coordinate compute, data, environments, teacher checkpoints, and evaluation.

Implementation Considerations

In modern LLM systems, classical distillation requires careful engineering:
- Log-probability extraction: The teacher must provide token-level log-probabilities. This is often done via a separate inference server, for example vLLM-based systems, with batched requests and compressed logprob transmission.
- Top-\(k\) approximation: Full-vocabulary KL is expensive. Approximations using top-\(k\) tokens reduce memory and bandwidth requirements, especially for large vocabularies of roughly \(100{,}000\) tokens or more.
- Batching and caching: Efficient pipelines buffer student generations and batch teacher evaluations to amortize cost, enabling distillation even from 100B+ models at scale.
- Hybrid objectives: Many systems combine supervised fine-tuning, distillation, and reinforcement learning signals in a single pipeline.
- Offline execution pattern: Offline KD usually separates teacher inference from student optimization. Teacher completions, logits, hidden states, or labels can be precomputed, cached, audited, and reused across multiple student runs.
- Online execution pattern: Online KD requires coordination among multiple co-trained models or periodically refreshed teachers. This adds communication overhead, synchronization complexity, and non-stationary targets, but it can provide more adaptive supervision.
- Semi-online compromise: Semi-online systems periodically refresh teacher checkpoints or add shadow teachers while preserving some stability of offline KD. Shadow Knowledge Distillation: Bridging Offline and Online Knowledge Transfer by Li et al. (2022) studies this intermediate regime and frames it as a bridge between static offline transfer and fully online knowledge exchange.
- Domain-wise recipe ordering: Cascade-style training requires decisions about which domain to optimize first, which domains can be trained together, and where distillation should be inserted to recover regressions. Nemotron-Cascade 2 by Yang et al. (2026) places multi-domain OPD after earlier RL stages to unify specialist expertise before continuing with additional alignment, long-context, coding, and software-engineering RL.
- Environment and verifier design: RL-integrated distillation depends on the availability of reliable environments, verifiers, or reward signals. Agentic and tool-use tasks require not only prompts but also executable environments, test harnesses, tool protocols, and reward computation.
- SFT mixture construction: Large SFT mixtures should often be balanced by token budget rather than example count because response lengths vary drastically across domains. Long reasoning traces, tool-integrated reasoning, short chat answers, and software-engineering trajectories can otherwise dominate or vanish unintentionally.
- Deduplication and filtering: Teacher-generated data must be filtered for correctness, format compliance, tool-call hygiene, duplication, and undesirable behavioral patterns. For coding and agentic tasks, test cases, OnlineJudge systems, trajectory analyzers, and LLM judges can provide quality filters before distillation.
- Teacher release and reproducibility: Multi-teacher OPD is difficult to study if only the final student is released. Intermediate teachers and starting checkpoints are valuable research artifacts because they make it possible to analyze support overlap, teacher-student divergence, recovery rates, and the marginal value of each distillation stage.
- Organizational scaling: Frontier post-training is also an organizational problem. A recipe with many specialist teachers, RL environments, verifiers, and data pipelines requires teams to coordinate compute, data production, evaluation, and model release decisions. Distillation is often the final mechanism that turns that distributed organizational work into a single deployable model.

Offline Distillation

Offline distillation is the classical and historically dominant form of knowledge distillation. In offline distillation, the teacher model is trained beforehand and then frozen. The student is subsequently optimized to imitate this fixed teacher using either precomputed teacher outputs or teacher evaluations generated during training. Because the teacher does not change, the supervision signal is stationary, which makes offline distillation stable, reproducible, and comparatively simple to implement.
The main idea to carry through this section is that “offline” refers to teacher dynamics rather than data storage. A teacher may be queried live during training and still be offline if its parameters are frozen. Conversely, an offline corpus may contain highly sophisticated synthetic artifacts, including reasoning traces, critiques, tool-use transcripts, and verifier-filtered solutions. Thus, offline distillation is not necessarily simple in content, even when the optimization setup is simple.
Offline distillation remains the default starting point for most practical pipelines because it is easy to audit, cache, reuse, and scale. It is especially useful when the goal is broad capability transfer, cold-starting a student before RL, compressing frontier behavior into a cheaper model, or consolidating specialist outputs through trace-distillation SFT.
Its main limitation is that, unless it is combined with student-generated rollouts later, the student trains on trajectories produced by someone else. This can create train-inference mismatch: the student may behave well on teacher-written prefixes but fail to recover when its own early tokens move the sequence into unfamiliar states.
Most of the literature traditionally referred to as “knowledge distillation” implicitly assumes an offline setting. Distilling the Knowledge in a Neural Network by Hinton et al. (2015) is the canonical example: a pretrained ensemble or large model produces softened probability distributions that supervise a smaller student. DistilBERT by Sanh et al. (2019) similarly uses a frozen BERT teacher to train a compact transformer.

Definition

Let \(p_T\) denote a pretrained teacher and \(p_S^\theta\) the student. Offline distillation optimizes:
\[\mathcal{L}_{\text{offline}}(\theta) =\mathbb{E}_{(x,y)\sim\mathcal{D}} \left[ D\left( p_T(\cdot \mid x, y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x, y_{<t}) \right) \right]\]
- where:
  - \(\mathcal{D}\) is a fixed or externally generated dataset.
  - \(p_T\) is frozen throughout training.
  - \(D\) is typically forward KL, reverse KL, JSD, or cross-entropy.
The defining property is that the teacher parameters remain constant:
\[\nabla_\phi \mathcal{L}_{\text{offline}} = 0\]
- where \(\phi\) denotes teacher parameters.
Offline distillation can use either hard targets or soft targets. In hard-target offline distillation, the teacher produces a sequence \(y_T\) and the student maximizes likelihood:
\[\mathcal{L}_{\text{hard-offline}}(\theta) =\mathbb{E}_{x} \left[ -\log p_S^\theta(y_T \mid x) \right]\]
- where:
\[y_T \sim p_T(\cdot \mid x)\]
In soft-target offline distillation, the teacher provides token-level probability distributions:
\[\mathcal{L}_{\text{soft-offline}}(\theta) =\mathbb{E}_{(x,y)\sim\mathcal{D}} \sum_{t} D_{KL} \left( p_T(\cdot \mid x,y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,y_{<t}) \right)\]
The hard-target version is cheaper and easier to store; the soft-target version preserves uncertainty, alternative continuations, and richer teacher preferences.

Relationship to Off-Policy and On-Policy Distillation

Offline distillation and off-policy distillation are closely related but not identical concepts.
- Offline vs. online: whether the teacher is frozen or co-trained.
- Off-policy vs. on-policy: where trajectories come from.
Most offline distillation is also off-policy, because the student trains on fixed human or teacher-generated sequences. However, offline distillation can also be on-policy if a frozen teacher evaluates rollouts generated by the current student. This is precisely the setup used in many modern on-policy LLM distillation methods, including On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024), where the teacher remains fixed but the trajectory distribution changes over time.
Thus, “offline” refers to teacher dynamics, while “on-policy” refers to data dynamics.
This distinction matters for modern post-training recipes. A pipeline may begin with offline, off-policy SFT on teacher-generated traces, proceed to RL on student-generated rollouts, and later use frozen specialist teachers for on-policy multi-teacher distillation. The teacher can be frozen in both the SFT and MOPD stages, while the trajectory source changes from external teacher traces to student rollouts.

Common Forms

Offline distillation encompasses many of the most widely used distillation approaches:
- Soft-label distillation: The teacher provides full probability distributions over classes or tokens, often softened with temperature scaling.
- Sequence-level distillation: The teacher generates complete outputs that become training targets, as introduced in Sequence-Level Knowledge Distillation by Kim and Rush (2016).
- Trace-distillation SFT: The teacher generates full reasoning traces, tool-use transcripts, or specialist trajectories, and the student is trained by supervised fine-tuning on those traces.
- Representation distillation: The student matches hidden states, embeddings, attention maps, or intermediate activations.
- Preference and reward distillation: The teacher provides rankings, scalar rewards, critiques, or pairwise preferences rather than direct logits.
- Synthetic-data distillation: The teacher produces answers, explanations, code, tool traces, multi-turn conversations, or benchmark-style solutions that become a reusable training corpus.
- Precomputed vs. live teacher querying: Offline distillation does not require that teacher outputs be fully precomputed.
  - Precomputed offline distillation: Teacher outputs are generated once and stored.
  - Live offline distillation: The frozen teacher is queried during training, but its parameters remain unchanged.
  - Both are considered offline because the teacher itself is static.

Trace-Distillation SFT

Trace-distillation SFT is one of the most important modern forms of offline distillation. Instead of matching only the teacher’s final answer, the student learns from the teacher’s full trajectory:
\[y_T = (r_T, a_T)\]
- where \(r_T\) may contain reasoning steps and \(a_T\) may contain a final answer, tool call, patch, proof, or action sequence.
The objective is ordinary supervised likelihood:
\[\mathcal{L}_{\text{trace-SFT}}(\theta) =\mathbb{E}_{(x,y_T)\sim\mathcal{D}_T} \left[ -\sum_t \log p_S^\theta(y_{T,t} \mid x,y_{T,<t}) \right]\]
The simplicity of this objective is the strength of trace distillation: once the teacher traces are generated and filtered, student training is just supervised fine-tuning.
Trace distillation is also a natural consolidation mechanism after specialist RL. A domain-specialist model can run on prompts from its domain, produce high-quality traces, and create a dataset that trains a single general student. This is the offline counterpart to MOPD: both consolidate specialists, but trace-distillation SFT uses teacher-generated trajectories, while MOPD uses student-generated trajectories scored by teachers.
MAI-Thinking-1 is a useful example of a modern recipe that consolidates specialist RL climbs using trace-distillation SFT rather than MOPD. The high-level pattern is:
\[\text{Mid-trained Model} \rightarrow \text{Specialist RL Climbs} \rightarrow \text{Trace-Distillation SFT} \rightarrow \text{Final RL Climb}\]
This makes trace-distillation SFT a conservative but robust way to merge specialist behavior when the infrastructure or support-overlap conditions for MOPD are not yet mature.

Offline Distillation in Synthetic SFT Pipelines

Many modern SFT stages are offline distillation pipelines even when they are not described that way. A teacher generates synthetic responses, the responses are filtered or curated, and the student is trained by supervised learning on the resulting corpus.
A typical synthetic SFT distillation workflow is:
- Collect prompts from target domains.
- Use strong teachers to generate candidate responses, reasoning traces, tool calls, or multi-turn conversations.
- Filter outputs with verifiers, unit tests, LLM judges, reward models, format checkers, or human review.
- Deduplicate examples and remove low-quality or unsafe artifacts.
- Train the student on the retained traces with cross-entropy.
In code and math domains, filtering is especially important because correctness can often be verified. For example, a coding teacher may generate multiple candidate solutions; examples are retained only if they pass tests or satisfy an execution-based verifier.
In long-context, chat, and instruction-following domains, filtering is more difficult because correctness is less binary. These datasets often rely on a mixture of teacher selection, LLM judging, preference scoring, length constraints, formatting checks, and manual audits.
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) illustrates this at scale: its SFT stage uses teacher-generated and filtered data across math, code reasoning, science, long context, general chat, instruction following, safety, tool use, and software-engineering-like tasks before later RL and multi-domain OPD.
This shows that offline distillation is often the first stage of a much larger post-training recipe. It establishes broad capability and formatting behavior, while later RL or OPD stages improve robustness, recover regressions, or teach the student how to behave on its own rollouts.

Offline Distillation vs. On-Policy Consolidation

The key distinction between offline trace distillation and on-policy distillation is the trajectory distribution.
Offline trace distillation uses:
\[y \sim p_T(\cdot \mid x)\]
- and trains:
\[-\log p_S^\theta(y \mid x)\]
On-policy distillation uses:
\[\hat{y} \sim p_S^\theta(\cdot \mid x)\]
- and trains the student from teacher feedback on \(\hat{y}\).
Offline trace distillation answers the question: “What would the teacher do on this prompt?”
On-policy distillation answers the question: “Given what the student actually did, how would the teacher score or correct each token?”
This distinction becomes central in long-horizon tasks. In short tasks, teacher-generated traces may be close enough to what the student would produce. In agentic, coding, tool-use, or reasoning tasks with thousands of tokens, small early deviations can create states that never appear in the offline dataset.
A practical training pattern is therefore:
\[\text{Synthetic / Trace SFT} \rightarrow \text{RL or RLVR} \rightarrow \text{OPD or MOPD}\]
- where offline distillation gives the student a strong prior and on-policy methods later correct its self-generated failure modes.

Advantages

Stability: The target distribution does not change during training.
Reproducibility: Repeated runs see identical teacher behavior if the dataset and teacher outputs are fixed.
Engineering simplicity: Teacher and student optimization are decoupled.
Caching efficiency: Teacher outputs can be stored and reused.
Scalability: Large teachers can supervise many student experiments.
Auditability: Datasets can be inspected, filtered, deduplicated, benchmarked, and versioned before training.
Parallelizable data generation: Multiple teachers or teacher servers can generate data independently before the student training run begins.
Compatibility with standard training stacks: Once the data is created, the student can be trained with ordinary SFT infrastructure.
Useful cold start: Offline distillation often brings the student close enough to a target behavior that later RL or OPD can work efficiently.

Limitations

Teacher staleness: The teacher cannot adapt to the student’s evolving weaknesses.
Potential distribution mismatch: If training trajectories are fixed, the student may not learn to recover from its own mistakes.
Storage requirements: Precomputing token-level distributions can be expensive.
Capability ceiling: The student is fundamentally bounded by the teacher’s performance and biases unless later RL, search, or reward extrapolation introduces new signal.
Discarded uncertainty: Hard trace distillation trains on one teacher-selected output and may lose information about alternative valid continuations.
Over-imitation: The student may imitate teacher style, verbosity, tool-use habits, or reasoning artifacts even when those artifacts are not causally useful.
Long-horizon brittleness: In agentic settings, a student that deviates from the teacher trace may enter a state where the offline dataset provides little guidance.
Filtering bias: Verifier-based filtering can overrepresent outputs that are easy to verify and underrepresent ambiguous but useful behaviors.
Data-mixture sensitivity: If one domain has much longer traces or more examples, it can dominate the SFT objective unless mixture weights are carefully controlled.

Semi-Online Variants

Some systems partially relax the static-teacher assumption by periodically updating a teacher snapshot or ensemble. Shadow Knowledge Distillation: Bridging Offline and Online Knowledge Transfer by Li et al. (2022) studies this intermediate regime and argues that part of the online distillation advantage comes from reversed student-to-teacher transfer rather than only from simultaneous training.
Semi-online systems are useful when the fully online setup is too complex but a completely frozen teacher becomes stale. Common patterns include:
- Refreshing a teacher checkpoint every few training phases.
- Regenerating synthetic data after a student improves.
- Adding stronger teachers for domains where the student still fails.
- Using earlier student checkpoints as stabilizing teachers.
- Maintaining a shadow teacher or exponential-moving-average teacher.
In frontier recipes, semi-online behavior often emerges organically: teams train specialists, generate traces, improve the base student, regenerate data, and then repeat. Even if each distillation run uses a frozen teacher, the overall recipe can behave like a sequence of offline distillation rounds.

Offline Distillation in Modern LLM Training

In contemporary LLM pipelines, offline distillation is widely used for compressing frontier models into smaller deployable models, generating synthetic instruction and reasoning datasets, transferring capabilities after RL or alignment, and creating baseline models before on-policy fine-tuning.
The RLHF Book chapter Synthetic Data presents offline distillation as the first stage in a progression from static synthetic data generation to fully on-policy distillation and RL-integrated post-training.
Offline distillation is especially important in recipes where SFT acts as a cold start for later RL. Instead of using SFT as the final alignment stage, modern reasoning recipes often use it to establish the minimum competence needed for RLVR or on-policy distillation to produce useful gradients.
In reasoning-model recipes, offline distillation may serve several distinct roles:
- Cold-start SFT: Generate initial reasoning traces so RL can begin from a model that already follows the desired format.
- Rejection-sampling SFT: Sample many model outputs, filter for correctness, and train on the retained traces.
- Specialist trace distillation: Train a general student on traces from domain-specialist teachers.
- Regression recovery: Reintroduce capabilities degraded by a later RL stage using cached high-quality traces.
- Deployment compression: Transfer capabilities from a large teacher or ensemble into a smaller serving model.
The growing complexity of recent recipes also shows why offline distillation remains attractive. MOPD can be more powerful for student-specific errors, but it requires rollout generation, teacher scoring, routing, and careful support-overlap management. Offline distillation is cheaper to reproduce and simpler to study because the data distribution is fixed.

Implementation Pattern

Select or train the teacher model: Begin with a strong model, ensemble, specialist checkpoint, or post-RL model whose behavior should be transferred into the student. In most offline settings, this teacher is already trained before the distillation run begins.
Freeze the teacher parameters: Keep the teacher fixed throughout student training. This ensures that the supervision distribution remains stationary and that repeated student runs can be compared cleanly.
Define the target domain mixture: Choose the domains, tasks, formats, and trace lengths the student should learn. For LLMs, this often includes math, code, science, chat, instruction following, tool use, long context, safety, and agentic workflows.
Generate or query teacher outputs: Use the teacher to produce hard targets, soft probabilities, hidden-state targets, critiques, preferences, or reward-like annotations. These outputs may be generated once before training or queried live during student optimization.
Filter and verify outputs: Remove incorrect, unsafe, malformed, repetitive, leaked, or low-quality examples. For code, use tests or execution-based filters; for math, use answer checkers or proof validators where possible; for chat and instruction following, use rubrics, judges, and audits.
Deduplicate and balance the corpus: Apply exact and fuzzy deduplication, then balance mixture weights by tokens, examples, domains, and trace lengths. Long reasoning traces can dominate the loss if mixture design is not handled carefully.
Store targets or log-probabilities when useful: For large-scale systems, cache teacher completions, token IDs, top-\(k\) log-probabilities, embeddings, or preference labels so they can be reused across multiple student runs. Full-vocabulary logits are usually expensive, so practical systems often store compressed targets.
Train the student to match the teacher: Optimize the student with the appropriate matching loss, such as cross-entropy on teacher-generated tokens, KL divergence on teacher probabilities, MSE on hidden states, or a hybrid objective combining supervised and distillation losses.
Evaluate both target performance and regressions: Measure the student against task benchmarks, teacher agreement, latency, memory footprint, and regression suites. Offline distillation can improve targeted domains while quietly degrading unrelated behavior if the data mixture is narrow.
Iterate when needed: If the student underperforms, iterate by improving teacher data, changing the divergence, adjusting temperature, increasing top-\(k\) coverage, balancing mixture weights, adding stronger filters, or introducing on-policy rollouts.

Practical Use Cases

Model compression: A large teacher transfers its behavior into a smaller or cheaper student.
Serving-cost reduction: Offline distillation is attractive when a model will be served many times and a one-time distillation cost can be amortized over high-volume inference.
Specialist-to-general transfer: Domain-specific teachers produce traces that are merged into a single general student.
Synthetic instruction tuning: Strong teachers generate instruction-response pairs for broad SFT.
Reasoning cold start: Teacher traces teach a student to produce structured reasoning before RLVR.
Tool-use bootstrapping: Tool-using teachers generate trajectories that teach a student basic API, terminal, browser, or code-editing behavior.
Safety and refusal behavior: Safety teachers generate refusal, redirection, and boundary-setting examples.
Long-context behavior: Long-context teachers generate or annotate examples that teach retrieval, summarization, and cross-document reasoning.

When Offline Distillation is Preferred

Offline distillation is usually the best first choice when stability, simplicity, auditability, and reusable data matter more than exact train-inference matching.
It is especially preferred when:
- High-quality teacher traces are already available.
- Teacher inference is expensive and should be amortized.
- The target tasks are short or have limited trajectory drift.
- The student needs a cold start before RL or OPD.
- The team wants a reproducible baseline before introducing on-policy systems.
- Data quality, filtering, and auditability are more important than adaptivity.
- Multiple students or ablations will reuse the same teacher corpus.
Offline distillation remains the dominant starting point for practical distillation pipelines, even when the final system later adds RL, on-policy distillation, self-distillation, or multi-teacher consolidation.

Online Distillation

Online distillation generalizes the teacher-student paradigm by allowing the teacher signal to evolve during training rather than remain fixed. Instead of relying exclusively on a pretrained, frozen teacher, online distillation trains multiple models simultaneously or refreshes teacher-like supervisors during optimization. The supervision distribution is therefore non-stationary and can adapt as the participating models improve.
The key idea to carry through this section is that online distillation is about teacher dynamics, not trajectory source. A method is online if the teacher distribution changes during the student’s training. It can still be off-policy if peers exchange predictions on fixed minibatches, and it can be on-policy if co-evolving models score rollouts generated by current policies.
Online distillation is attractive because frozen teachers can become stale relative to the student’s changing weaknesses and strengths. By allowing teachers and students to co-evolve, online distillation can provide more adaptive supervision, improve generalization, and in some cases outperform both conventional offline distillation and independently trained models.
At the same time, strict online distillation is harder to operate than offline distillation. It introduces non-stationary targets, synchronization overhead, and coupled failure modes. This is why many frontier LLM recipes use semi-online variants instead: specialists are trained or refreshed across stages, but each distillation phase still uses frozen teacher checkpoints.
This distinction is especially important for interpreting recent LLM post-training recipes. MOPD is often on-policy in trajectory source because the student generates the rollouts, but it is usually offline in teacher update pattern because the teachers are frozen during the distillation run. When recipes refresh teachers across rounds, as in multi-round MOPD, they become semi-online at the recipe level rather than fully online at the gradient-step level.
The canonical example is Deep Mutual Learning by Zhang et al. (2017), which trains peer networks jointly and minimizes KL divergence between their predictive distributions. Each model acts simultaneously as both student and teacher, and all participants improve through reciprocal supervision. More recent approaches such as co-distillation, checkpoint refresh, and population-based post-training extend this idea to larger ensembles and distributed training systems.

Definition

Suppose there are \(K\) models with parameters \(\{\theta_k\}_{k=1}^K\). For model \(i\), the online distillation objective can be written as:
\[\mathcal{L}_i(\theta_i) =\mathcal{L}_{\text{task}}(\theta_i) +\lambda \sum_{j \neq i} D\left( p_j^{\theta_j}(\cdot \mid x) \,\Vert\, p_i^{\theta_i}(\cdot \mid x) \right)\]
- where:
  - \(\mathcal{L}_{\text{task}}\) is the primary supervised, reinforcement learning, or hybrid objective.
  - \(D\) is a divergence such as KL or JSD.
  - \(\lambda\) controls the strength of mutual supervision.
  - all models update concurrently or are refreshed during the training process.
Unlike offline distillation, the teacher distributions \(p_j^{\theta_j}\) evolve throughout training:
\[\nabla_{\theta_j}\mathcal{L}_j \neq 0\]
- for participating teacher-like models.
A time-indexed view makes the distinction explicit. At training step \(u\), model \(i\) may receive supervision from peer or teacher state \(p_j^{\theta_j(u)}\):
\[\mathcal{L}_i^{(u)} =\mathcal{L}_{\text{task}}^{(u)} +\lambda \sum_{j\neq i} D\left( p_j^{\theta_j(u)}(\cdot \mid x) \,\Vert\, p_i^{\theta_i(u)}(\cdot \mid x) \right)\]
In strict online distillation, the teacher state can change every step or every small number of steps. In semi-online distillation, the teacher state changes at coarser refresh boundaries:
\[p_T^{(r)} =p_T^{\phi_r} \quad\text{for refresh round } r\]
- and the student is trained against \(p_T^{(r)}\) until the next refresh.

Relationship to Offline, Off-Policy, and On-Policy Distillation

Online vs. offline describes whether the teacher changes during training. Off-policy vs. on-policy describes where the training trajectories originate.
This yields four conceptually distinct combinations:
- Offline + Off-Policy: Classical KD using a frozen teacher and fixed teacher or dataset trajectories.
- Offline + On-Policy: Modern OPD, where a frozen teacher scores student-generated rollouts.
- Online + Off-Policy: Peer models exchange predictions on a fixed dataset or minibatch stream.
- Online + On-Policy: Multiple co-evolving models generate and score their own trajectories, potentially sharing rollouts and dense feedback.
Most historical online distillation methods are online and off-policy because they operate on shared minibatches. Emerging RL and LLM systems increasingly explore online and on-policy hybrids, where co-trained models evaluate trajectories sampled from their current policies.
Many frontier LLM recipes occupy the semi-online region rather than the fully online region. Domain specialists may be trained in parallel, selected as teachers at checkpoint boundaries, and refreshed across rounds. During a particular distillation run, however, each teacher is usually frozen. Thus, the recipe is adaptive over stages, but not necessarily online at every learner update.
This matters because the terms “online,” “on-policy,” and “MOPD” are often conflated. A multi-teacher OPD run can be:
- on-policy because the student generates rollouts;
- offline because the selected teachers are frozen during that run;
- semi-online because the broader recipe refreshes teachers or runs multiple MOPD iterations.

Types

Mutual Learning: Each model teaches every other model, as in Deep Mutual Learning.
Co-Distillation: Large-scale training jobs periodically exchange predictions, logits, or checkpoints to improve convergence and robustness.
Peer Ensembles: Multiple comparable models learn jointly and average or vote on predictions during training.
Adaptive Teacher Distillation: A stronger model is periodically updated and continues to supervise one or more students.
Checkpoint-Refresh Distillation: A teacher is replaced by a newer checkpoint after a training phase, producing a semi-online sequence of static-teacher distillation runs.
Shadow Teacher Distillation: An auxiliary teacher is updated asynchronously while the main student trains, bridging offline stability and online adaptivity.
Population-Based Distillation: A population of models with different objectives, domains, or hyperparameters exchanges knowledge during training.
Specialist-Teacher Refresh: Domain-specific teachers are trained or refreshed across post-training stages, then used to supervise a general student. This is common in modern recipe-level MOPD systems even when the teachers are frozen during each individual distillation run.

Advantages

Adaptive supervision: The teaching signal evolves with the models and can address newly emerging failure modes.
Improved generalization: Peer learning often reduces overconfidence and improves calibration.
No need for a single superior teacher: Comparable models can still benefit from teaching one another.
Regularization effects: Mutual agreement acts as a strong inductive bias.
Compatibility with distributed systems: Large training clusters can exchange logits or checkpoint summaries during optimization.
Continual improvement: Teacher refreshes allow the supervision signal to track new data, new environments, or newly discovered failure modes.
Capability preservation: Semi-online teacher pools can preserve older strengths while newer RL stages specialize the student.
Organizational scalability: Domain-specialist teams can improve teachers in parallel, and periodic distillation can merge those improvements into a shared student.

Limitations

Higher system complexity: Multiple models must be trained simultaneously, refreshed periodically, or synchronized across stages.
Non-stationary targets: The supervision distribution changes over time, which can complicate optimization.
Risk of consensus errors: If all participants share similar biases, they may reinforce incorrect behavior.
Compute overhead: Training several models jointly can be significantly more expensive than using a single frozen teacher.
Synchronization overhead: Online settings require careful coordination of model versions, checkpoints, batch assignments, and logging.
Unclear credit assignment: When many peers or teachers improve together, it can be difficult to determine which supervision source caused which downstream gain.
Distribution mismatch across teachers: If teachers are trained with very different data or recipes, their output distributions may be poorly aligned with each other or with the student.
Recipe fragility: As post-training adds more teachers, environments, RL stages, and distillation loops, the recipe can collapse under its own complexity unless each stage is isolated, tested, and evaluated.

Semi-Online and Hybrid Approaches

Many practical systems combine offline and online strategies:
- Checkpoint refresh: A frozen teacher is periodically replaced by the latest strong checkpoint.
- Teacher ensembles: A static teacher is supplemented with co-trained peers or newer specialist checkpoints.
- Shadow teachers: Auxiliary teachers are updated asynchronously to provide fresher supervision.
- Progressive teacher staging: Rather than distilling only from a fully converged teacher, the student may be guided through a sequence of intermediate teacher checkpoints.
- Multi-round MOPD: A first MOPD round improves the student, then refreshed or reselected teachers provide a second round of supervision.
Shadow Knowledge Distillation: Bridging Offline and Online Knowledge Transfer by Li et al. (2022) analyzes these intermediate approaches and shows that reversed student-to-teacher transfer contributes significantly to online distillation’s effectiveness.
Semi-online methods are particularly important for LLMs because full online distillation across many large models is often too expensive. A practical compromise is:
\[\text{Train or refresh teachers} \rightarrow \text{freeze teachers} \rightarrow \text{distill student} \rightarrow \text{evaluate} \rightarrow \text{refresh teacher pool}\]
This turns an unstable fully online problem into a sequence of stable offline distillation phases, while still allowing the broader recipe to adapt over time.

Online Distillation in Modern LLM Training

Although most frontier LLM distillation remains offline at the teacher-update level, online principles appear increasingly often in:
- Multi-agent self-improvement systems.
- Self-play and debate frameworks.
- Checkpoint-based teacher refresh pipelines.
- Distributed co-training.
- Self-distillation with periodically updated teacher snapshots.
- Multi-round MOPD pipelines with refreshed specialist teachers.
- RL pipelines where teacher checkpoints are selected from evolving validation curves.
In large-scale post-training, a model may be supervised by:
- Specialist checkpoints trained on different domains.
- Recent versions of itself.
- Peer models in a shared optimization loop.
- Earlier checkpoints retained to prevent catastrophic forgetting.
- Domain teachers selected from the strongest validation checkpoint in a Cascade RL process.
This blurs the boundary between online distillation, self-distillation, checkpoint averaging, continual learning, and multi-teacher distillation.
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) is a useful semi-online example. Its teachers are selected from the Cascade RL pipeline by choosing the strongest validation checkpoint for each benchmark category. Because those teachers are derived from the same SFT initialization, they share the same tokenizer and reduce teacher-student distribution shift, while multi-domain OPD supplies dense token-level advantages compared with sparse outcome rewards.
The corresponding sampled-token MOPD advantage is:
\[a_t^{MOPD} = \log \pi_{\text{domain}_i}(y_t \mid s_t) - \log \pi_{\text{train}}(y_t \mid s_t)\]
- where \(s_t=(x,y_{<t})\) is the decoding state and \(\pi_{\text{domain}_i}\) is the selected domain teacher. This is not strict online distillation if the teacher is frozen during the update, but it is semi-online at the recipe level because teacher checkpoints come from an evolving Cascade RL process.
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) shows another semi-online pattern: MOPD is run over multiple rounds, with a prepared RLVR student, a first teacher pool, refreshed teachers, and a second MOPD iteration. This resembles online distillation at the recipe level, but each MOPD phase is still best understood as frozen-teacher on-policy distillation.

Teacher Refresh and Support Overlap

A central challenge in semi-online and multi-teacher distillation is that teacher refreshes can increase capability while also increasing distribution mismatch. A stronger later teacher is not always a better distillation teacher if the student rarely visits the states where that teacher’s advantage is meaningful.
If a teacher and student are trained with substantially different SFT data or RL recipes, they may acquire different reasoning behaviors and induce different output distributions. Then student-generated trajectories can become out-of-distribution for the teacher, reducing the reliability of token-level teacher supervision.
A useful diagnostic is teacher-student local overlap at a student-visited state \(s_t\):
\[\operatorname{Overlap}_k(s_t) =\frac{ \left| \operatorname{Top}_k \pi_T(\cdot \mid s_t) \cap \operatorname{Top}_k \pi_S(\cdot \mid s_t) \right| }{k}\]
When overlap is high, full-distribution or top-\(k\) distillation is more likely to be useful. When overlap is low, sampled-token scoring or a warmup stage may be safer.
In semi-online teacher refresh, one practical recipe is:
\[T^{(1)} \rightarrow S^{(1)} \rightarrow T^{(2)} \rightarrow S^{(2)} \rightarrow \cdots\]
- where the student is gradually brought closer to stronger teachers rather than being forced to match a distant final teacher in one step.
This progressive view is useful for MOPD. If a domain teacher has moved far from the student through additional SFT or RL, an intermediate teacher checkpoint can provide a smoother path. A brief warmup SFT on data drawn from the teacher’s training distribution can also increase the chance that student rollouts remain within teacher support.

Online Distillation vs. MOPD

Online distillation and MOPD overlap conceptually, but they are not the same. Online distillation asks, “does the teacher distribution change during training?” MOPD asks, “does the student generate the rollout?” and “do multiple teachers provide token-level feedback over that rollout?”
Therefore, a MOPD system can be categorized as:
- Offline MOPD: multiple frozen teachers score student-generated rollouts.
- Semi-online MOPD: frozen teachers are refreshed between MOPD rounds.
- Online MOPD: co-evolving teachers and student score each other during the same training process.
Most currently discussed MOPD recipes appear closer to offline or semi-online MOPD than to fully online MOPD. This is a practical systems choice: freezing teachers during a distillation run improves stability, logging, reproducibility, and fault isolation.
The broader post-training recipe, however, can still be adaptive. Teachers may be trained in parallel by different teams, selected from validation curves, refreshed after a first MOPD round, or replaced with new specialists as new domains are added.

Implementation Pattern

Initialize multiple models or peers: Start with two or more models, which may differ in architecture, initialization, objective, or specialization.
Train each model on the primary objective: Each participant optimizes its own supervised, RL, or hybrid loss.
Exchange predictive distributions: At each step or periodically, models compute logits, hidden states, critiques, or token-level scores that are shared with other participants.
Compute mutual distillation losses: Each model matches one or more peer distributions using KL divergence, JSD, sampled-token log-ratio losses, or related objectives.
Update all participating online models: Gradients are applied to every model that is actively learning, so each model can act as both teacher and student.
Freeze or snapshot teachers when needed: In semi-online variants, models are periodically frozen as teacher checkpoints before supervising a student.
Synchronize or refresh periodically: In distributed systems, communication may occur asynchronously or at checkpoint boundaries rather than every minibatch.
Track version provenance: Online and semi-online methods require careful tracking of which student checkpoint generated a rollout, which teacher checkpoint scored it, which tokenizer was used, and which reward or verifier version was active.
Monitor teacher-student support overlap: When teachers and students diverge, full-distribution matching can become unreliable. Local overlap, KL, entropy, teacher advantage, rollout length, and truncation rates should be monitored.
Evaluate both individual and ensemble performance: Assess whether joint learning improves standalone models, ensemble behavior, calibration, domain robustness, and regression recovery.

Systems Design Considerations

Online distillation introduces additional systems requirements beyond offline KD:
- Versioned checkpoints: Every teacher and student checkpoint must be traceable.
- Synchronized inference and training engines: Rollout generation and teacher scoring must agree on tokenization, chat templates, stop tokens, and log-probability conventions.
- Asynchronous queues: Rollouts and teacher scores may arrive at different speeds, requiring buffers and freshness policies.
- Staleness management: If a student changes too much between rollout generation and gradient update, the rollout may become off-policy relative to the current learner.
- Communication budget: Peer logits or top-\(k\) distributions can be large; sampled-token objectives reduce bandwidth.
- Fault isolation: A bad teacher refresh can corrupt the student if not evaluated before deployment into the distillation loop.
- Domain routing: Multi-teacher systems require prompt routing, teacher selection, or weight aggregation.
- Evaluation gates: Teacher refreshes should pass domain benchmarks and regression suites before being used as supervision.
A practical semi-online loop often looks like:
\[\text{train specialists} \rightarrow \text{select checkpoints} \rightarrow \text{freeze teachers} \rightarrow \text{distill student} \rightarrow \text{run regression suite} \\ \rightarrow \text{refresh teacher pool}\]
This is less theoretically elegant than fully online mutual learning, but it is easier to debug and more compatible with frontier-scale training infrastructure.

Online Distillation in the Broader Distillation Taxonomy

Online distillation occupies the teacher-update axis of the distillation taxonomy. It complements rather than replaces distinctions such as:
- Off-policy vs. on-policy.
- Single-teacher vs. multi-teacher.
- External-teacher vs. self-distillation.
- Supervised vs. RL-integrated training.
- Frozen-teacher OPD vs. refreshed-teacher OPD.
Conceptually, online distillation is best understood as adaptive teacher evolution, while on-policy distillation is best understood as adaptive trajectory generation. Modern systems increasingly combine both, but they should still be analyzed separately.
For practical LLM post-training, the most common pattern is not pure online distillation. It is semi-online capability consolidation: train or refresh specialists, freeze them, distill into a general student, evaluate regressions, then repeat. This preserves much of the stability of offline KD while allowing the teacher pool to evolve with the recipe.

Off-Policy Distillation

Off-policy distillation is the classical and still most widely used form of distillation. The student is trained on trajectories generated by an external source, such as human-labeled datasets, teacher-generated completions, synthetic reasoning traces, curated corpora, prior model logs, or specialist teacher rollouts, rather than on trajectories sampled from the student itself.
The central idea to carry through this section is that off-policy distillation is the most stable way to transfer broad capabilities, but it does not directly teach the student to recover from its own generation errors. It is therefore best understood as the default cold-start and broad-transfer stage in a larger post-training stack, often followed by RL, OPD, self-distillation, or multi-teacher consolidation.
Off-policy distillation remains attractive because data can be generated, filtered, audited, versioned, and reused before student training begins. This makes it operationally simpler than on-policy distillation, where rollouts and teacher scoring must happen inside a training loop. In practice, this stability is why many frontier recipes still rely heavily on synthetic SFT, rejection-sampling SFT, trace distillation, and teacher-generated tool-use data before moving to RL or MOPD.
The main limitation is train-inference mismatch. The student learns from trajectories produced by humans, teachers, or filters; at inference, it must condition on its own prior tokens. For short tasks, this mismatch may be mild. For long-horizon reasoning, coding, terminal use, browser use, or agentic workflows, small early deviations can move the student into states that never appear in the offline corpus.
Modern post-training recipes therefore use off-policy distillation as a foundation rather than as the entire recipe. Llama 3-style post-training uses reward models and rejection sampling to construct offline SFT and DPO data; DeepSeek-R1-style recipes use SFT to cold-start and refine reasoning behaviors after RL; MAI-Thinking-1-style recipes use trace-distillation SFT to consolidate specialist RL climbs; and Nemotron-style recipes use large synthetic and filtered SFT corpora before Cascade RL and MOPD.

Definition and Formal Objective

Given a dataset of input-output pairs:
\[\mathcal{D}=\{(x,y)\}\]
- where outputs \(y\) may come from humans, a teacher model, a specialist checkpoint, a synthetic data pipeline, or historical logs, the student minimizes a divergence between teacher and student token distributions:
\[\mathcal{L}_{\text{off-policy}}(\theta) =\mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ D\left( p_T(\cdot \mid x, y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x, y_{<t}) \right) \right]\]
This is the standard supervised distillation objective described in Distilling the Knowledge in a Neural Network by Hinton et al. (2015) and generalized to autoregressive sequence models in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024).
The defining feature is that the trajectory distribution is external to the current student:
\[y \sim q_{\text{external}}(\cdot \mid x)\]
- rather than:
\[\hat{y} \sim p_S^\theta(\cdot \mid x)\]
The external trajectory distribution may be written as a mixture:
\[q_{\text{external}}(y \mid x) =\sum_{k=1}^{K} \omega_k q_k(y \mid x)\]
- where each component \(q_k\) may correspond to human data, a frontier teacher, a specialist model, a rejection-sampling pipeline, a verifier-filtered corpus, a prior student checkpoint, or production interaction logs.

Sources of Off-Policy Data

Off-policy data can come from human-labeled datasets, teacher-generated synthetic data, verifier-filtered synthetic corpora, historical model outputs, specialist teacher traces, rejection-sampling traces, and tool-use transcripts. Human-labeled datasets provide expert-written instruction responses, translations, preference annotations, and curated reasoning traces. Teacher-generated synthetic data transfers capabilities through answers, chain-of-thought traces, critiques, and tool-use demonstrations. Filtered synthetic corpora retain only outputs that pass verifiers, reward models, execution tests, or additional teacher checks. Historical model outputs can be relabeled and refined by stronger models. Specialist teacher traces capture domain-specific expertise in math, code, safety, tool use, science, long context, software engineering, or agentic workflows. Rejection-sampling traces are sampled from a policy, ranked or filtered, and then reused as supervised targets. Tool-use transcripts record actions, observations, terminal commands, browser calls, Python execution traces, API calls, and final answers.
The RLHF Book chapter Synthetic Data emphasizes that most modern post-training pipelines rely heavily on large-scale synthetic data generation before any reinforcement learning or on-policy distillation stage.
In recent LLM recipes, off-policy data is rarely a single monolithic corpus. It is usually a curated mixture of math and proof traces, competitive coding solutions, tool-calling code traces, scientific reasoning examples, document-derived examples, long-context question answering, retrieval tasks, general chat, multi-turn dialogue, instruction-following examples, formatting tasks, safety and refusal data, agentic software-engineering trajectories, and search, browser, terminal, or office-tool workflows.

Sequence-Level Distillation

Sequence-level distillation is the most common off-policy method for LLMs.
Introduced in Sequence-Level Knowledge Distillation by Kim and Rush (2016), it trains the student on full teacher-generated outputs:
\[\mathcal{L}_{\text{SeqKD}} =\mathbb{E}_{x} \left[ -\log p_S(y_T \mid x) \right]\]
- where:
\[y_T \sim p_T(\cdot \mid x)\]
This approach often simplifies the target distribution by replacing multiple valid outputs with one teacher-selected response, which can make optimization substantially easier.
In LLM post-training, the “sequence” may be much richer than a final answer: it may contain a reasoning trace, proof, verification path, code solution, patch, tests, tool-use transcript, terminal session, browser or search trace, multi-turn conversation, or critique-and-revision sequence. Sequence-level distillation is therefore often better described as trace-level distillation when applied to reasoning and agentic models.

Trace-Distillation SFT

Trace-distillation SFT is an off-policy method in which a teacher generates a complete trajectory and the student learns it through ordinary supervised fine-tuning. It is central to many modern recipes because it is simple, reusable, and compatible with standard SFT infrastructure.
Let a teacher trace be:
\[y_T=(z_1,z_2,\dots,z_n)\]
- where tokens may encode reasoning, tool calls, tool outputs, code edits, or final answers. The training objective is:
\[\mathcal{L}_{\text{trace-SFT}}(\theta) =\mathbb{E}_{(x,y_T)\sim\mathcal{D}_T} \left[ -\sum_{t=1}^{n} \log p_S^\theta(z_t \mid x,z_{<t}) \right]\]
Trace distillation is the offline counterpart of MOPD. Both aim to consolidate knowledge from stronger or more specialized policies, but they differ in the trajectory source: trace-distillation SFT uses teacher-generated trajectories, while MOPD uses student-generated trajectories scored by teachers.
This distinction is visible in recent post-training recipes. MAI-Thinking-1-style recipes use specialist RL climbs followed by trace-distillation SFT to merge those climbs, then run a final RL climb. This is a conservative consolidation strategy because it avoids the systems complexity of routing student rollouts through many teacher scorers.
Trace-distillation SFT is useful when teacher traces are high quality and verifiable, when the student is not yet capable enough for useful on-policy rollouts, when MOPD infrastructure is unavailable or immature, when the goal is a stable cold start before RL, or when a team wants reusable data rather than a tightly coupled rollout-scoring loop.

Logit Distillation

In logit or soft-label distillation, the student matches the teacher’s full next-token distribution:
\[\mathcal{L}_{\text{logit}} =\mathbb{E}_{(x,y)} \sum_t D_{KL} \left( p_T(\cdot \mid x, y_{<t}) \,\Vert\, p_S(\cdot \mid x, y_{<t}) \right)\]
Compared with sequence-level distillation, this preserves uncertainty information, token similarities, and alternative plausible continuations.
This approach is especially effective when the teacher is much stronger and when the student has sufficient capacity to approximate the teacher distribution.
In LLMs, full-vocabulary logit distillation is often expensive because the vocabulary can contain roughly:
\[|V| \approx 10^5\]
- tokens. Practical systems therefore use teacher top-\(k\) probabilities, student top-\(k\) probabilities, sampled-token log-probabilities, renormalized truncated distributions, or cross-entropy on teacher-generated sequences.
For off-policy distillation, top-\(k\) teacher distributions can often be precomputed and stored. This is less flexible than live teacher scoring, but it makes student training cheaper and reproducible.

Rejection-Sampling SFT

Rejection-sampling SFT is a particularly important off-policy pattern in modern LLM recipes. A model generates multiple candidate responses for each prompt, a reward model or verifier selects high-quality candidates, and the student is trained on the selected outputs.
The workflow is:
\[y_1,\dots,y_K \sim p(\cdot \mid x)\] \[y^\star =\arg\max_{y_i} R(x,y_i)\] \[\mathcal{L}_{\text{RS-SFT}}(\theta) =-\log p_S^\theta(y^\star \mid x)\]
- where \(R\) may be a reward model, verifier, unit-test score, rule-based checker, or LLM judge.
Llama 3-style recipes are a useful example: the model samples multiple responses per prompt, uses a reward model and rejection sampling to form SFT data, then trains an SFT model and DPO model across repeated rounds.
DeepSeek-R1-style recipes also use rejection-sampling SFT to refine reasoning behavior: RL is used to elicit reasoning, then successful or high-quality reasoning traces are selected and converted into supervised data for subsequent training.
Rejection-sampling SFT is off-policy because the final student trains on selected traces rather than on its own current rollouts. However, if the selected traces come from recent checkpoints, it can still be closer to the student’s distribution than static human-written data.

Synthetic Data Pipelines

A major modern use of off-policy distillation is as the final consumer of synthetic data pipelines.
A typical workflow starts by collecting prompts from benchmark datasets, user interactions, or automatically generated prompt sets designed to cover the target domains. A strong teacher then generates one or more candidate completions for each prompt, often including intermediate reasoning traces or tool-use steps. Reward models, verifiers, or additional teacher models evaluate the generated outputs for correctness, helpfulness, consistency, safety, and format compliance. The highest-quality outputs are selected, reranked, or filtered to remove low-confidence or incorrect examples. The resulting dataset is stored as a reusable synthetic corpus that may include completions, chain-of-thought traces, verifier scores, teacher log-probabilities, critique fields, or tool transcripts. The student is then trained on this curated dataset using sequence-level distillation, logit matching, or a hybrid objective.
Synthetic datasets may contain detailed chain-of-thought traces for structured problem solving, verified code solutions that pass unit tests or execution checks, critiques and revisions that teach diagnosis and repair, tool-use transcripts that demonstrate API or terminal interaction, preference annotations that support later alignment or ranking objectives, multi-turn simulated conversations that expose conversational dynamics, long-context examples for retrieval and cross-document reasoning, and agentic software trajectories that teach repository navigation, patch generation, test execution, and final submission behavior.
The RLHF Book highlights that this synthetic-data-to-distillation pipeline remains the dominant method for transferring capabilities from frontier models into smaller and more deployable students.

Off-Policy Distillation in Recent Post-Training Recipes

Off-policy distillation appears in nearly every recent recipe, even when the recipe is described primarily as RL or MOPD.
In InstructGPT-style RLHF, the SFT stage is off-policy because the model learns from human demonstrations rather than its own rollouts, and the reward model stage also relies on fixed comparison data.
In Llama 2-style RLHF, SFT and rejection-sampling stages create offline supervised targets, while PPO supplies the online RL component.
In Llama 3-style recipes, the reward model is used primarily as a filter: the model samples multiple responses per prompt, rejection sampling selects outputs, SFT trains on those outputs, and DPO further refines the model. This is a largely off-policy pipeline with iterative refresh.
In Tülu 3-style recipes, curated prompts, SFT, DPO, and RLVR are staged so that offline supervised and preference data prepare the model before verifiable-reward RL.
In DeepSeek-R1-style recipes, SFT is used as a cold start and as a refinement mechanism after reasoning RL. RL elicits reasoning behaviors, while rejection-sampling SFT distills and clarifies those behaviors into training data.
In MAI-Thinking-1-style recipes, multiple specialist RL climbs are consolidated through trace-distillation SFT before a final RL climb. This is a clear example of off-policy consolidation from specialist teachers.
In Nemotron-Cascade 2-style recipes, SFT data is generated and filtered across math, code, science, long context, chat, instruction following, safety, and tool-use domains before Cascade RL and multi-domain OPD. This demonstrates how off-policy distillation can serve as the foundation on which later on-policy training is built.
In Nemotron 3 Ultra-style recipes, SFT data includes tool-calling math traces, non-tool math traces, proof generation and verification, science data filtered for reasoning and format quality, multi-turn chat data selected by a reward model, and software-engineering trajectories filtered for tool-call hygiene and undesirable action patterns. This shows that off-policy distillation increasingly resembles a large-scale data engineering system rather than a simple SFT pass.

Advantages

Off-policy distillation is operationally simple because it closely resembles supervised fine-tuning on a fixed dataset; it is stable and reproducible because the same examples can be reused across runs; it amortizes teacher inference because outputs can be generated once and consumed many times; it scales naturally to large datasets and distributed training systems; it integrates well with synthetic data generation pipelines for reasoning, coding, and instruction following; it supports heavy offline filtering before training; it is comparatively easy to audit because examples can be inspected before use; it provides a strong cold start for RL or OPD; and it is compatible with many supervision sources, including frontier models, specialist teachers, older student checkpoints, human annotators, LLM judges, reward models, and verifiers.

Limitations: Distribution Mismatch

The central weakness of off-policy distillation is that the student is trained on trajectories it did not generate.
During inference, the student samples:
\[\hat{y} \sim p_S(\cdot \mid x)\]
- which may diverge from teacher-generated sequences. Because each token conditions on previous tokens, small errors compound over long trajectories.
This problem is explicitly analyzed in On-Policy Distillation of Language Models and motivates Generalized Knowledge Distillation.
The Thinking Machines article Thinking Machines Blog: On-Policy Distillation compares this to learning chess solely by watching grandmasters: one sees excellent play but not the board states produced by one’s own mistakes.
In long-horizon reasoning and agentic workflows, this mismatch is more severe because the student’s early decisions determine the tools it calls, files it edits, tests it runs, assumptions it carries forward, and intermediate reasoning branches it explores.
For tool use, an off-policy trace may show the correct API call, but the deployed student may produce a malformed call, search the wrong file, run the wrong command, or misunderstand a tool result. The offline trace provides little direct supervision for recovery from that specific state.

Behavioral Consequences

Off-policy students often perform strongly when their generated prefixes remain close to trajectories seen during training, but they can struggle to recover from early mistakes that push generation into unfamiliar contexts. This manifests as exposure bias in long-horizon reasoning and agentic tasks, stylistic imitation without full transfer of reasoning competence, overconfidence when trained primarily on deterministic single targets, tool-use brittleness when successful traces dominate the corpus, format sensitivity when examples are filtered for perfect formatting, and repetition of teacher artifacts such as verbose reasoning conventions, fixed templates, or unnecessary tool calls.
These behaviors do not make off-policy distillation ineffective. They clarify why it is most effective as a foundation stage and why later on-policy methods are useful for robustness.

Filtering, Verification, and Data Quality

Off-policy distillation shifts much of the difficulty from online optimization to offline data quality. The objective is simple, but the training signal depends heavily on how examples are generated and filtered.
Common filtering mechanisms include exact and fuzzy deduplication, unit-test execution for code, answer checking for math, proof validation where possible, format and schema checks, tool-call validity checks, reward-model ranking, LLM judge scoring, safety classifiers, length and repetition filters, and human audit for high-impact domains.
In code SFT pipelines, examples with verifiable tests can be filtered by correctness. Prompts without tests may require weaker heuristics, such as selecting longer or more detailed reasoning traces when those correlate with better analysis.
In software-engineering and agentic traces, filtering must often remove undesirable behaviors even when the final task succeeds. Examples include invalid submission actions, disallowed git operations, excessive edit-test loops, exploratory thrashing without edits, malformed tool calls, debug artifacts in final patches, and trajectories that edit code but never run tests.
In chat data, filtering may select the highest-quality response among multiple candidates using a reward model or judge. Multi-turn chat data may also simulate users with varied strategies, such as asking clarifying questions, challenging assumptions, reframing the task, or applying an answer to a new context.
This filtering stage is one reason off-policy distillation remains central to frontier recipes: it allows data quality to be improved before expensive student training begins.

Relationship to Reinforcement Learning

Off-policy distillation and reinforcement learning differ in both trajectory source and feedback density:

Method	Trajectory source	Reward or supervision density
Off-policy KD	Teacher, human, verifier-filtered data, or dataset	Dense token-level or hard sequence targets
RLHF / RLVR	Student	Sparse sequence-level or outcome-level rewards
On-policy distillation	Student	Dense token-level teacher feedback

As a result, off-policy distillation is highly sample-efficient but less robust than on-policy approaches.
Many modern pipelines follow a staged progression in which synthetic data is generated and filtered into high-quality off-policy supervision, the student is trained through off-policy distillation to absorb broad teacher capabilities, reinforcement learning then refines behaviors that are difficult to specify directly in the dataset, and on-policy distillation transfers the benefits of RL into a smaller or more efficient model or consolidates multiple specialist teachers.
The RLHF Book explicitly presents this progression as the “path to on-policy distillation.”
Another useful framing is that off-policy distillation teaches the student what good behavior looks like, RL teaches the student which of its behaviors succeed under a reward or verifier, and OPD teaches the student how a teacher would locally correct the behaviors it actually sampled.

Engineering and Systems Considerations

Off-policy systems are operationally straightforward: teacher inference can be performed asynchronously and at large batch sizes; training examples can store token IDs, reasoning traces, verifier scores, and optionally top-\(k\) log-probabilities; synthetic datasets can be reused across many experiments and student architectures; and student training proceeds independently without synchronous communication with the teacher.
The primary costs arise from generating synthetic data, storing large corpora, and maintaining the filtering and verification infrastructure that ensures data quality.
Practical systems must handle data provenance, versioning, prompt contamination, benchmark leakage, teacher model version tracking, deduplication across generated corpora, mixture weighting across domains, long-trace storage and compression, tool-output normalization, chat-template consistency, tokenizer compatibility, and safety or policy audits.
Large SFT mixtures should often be balanced by token count rather than example count because long reasoning, long-context, and software-engineering traces can otherwise dominate the gradient.
For off-policy logit distillation, storage and bandwidth are major constraints. Full-vocabulary logits are usually impractical; top-\(k\) distributions or sampled-token log-probabilities are much cheaper.
For off-policy trace distillation, text storage is cheaper than logit storage, but trace quality becomes the central bottleneck.

When Off-Policy Distillation is Preferred

Off-policy distillation is the best choice when simplicity and stability are more important than exact train-inference matching; when large synthetic datasets are already available or can be generated economically; when teacher inference should be amortized offline and reused across many experiments; when the student is unlikely to diverge substantially from the training distribution during deployment; when the goal is broad capability transfer rather than maximal robustness to self-generated errors; when the student needs a cold start before RL, OPD, or MOPD; when data quality and auditability are central requirements; when a team wants to compare many student architectures or objectives on the same fixed corpus; or when teacher outputs require expensive filtering or human review that cannot be placed inside a live training loop.
It remains the dominant starting point for most practical distillation pipelines, even when more advanced on-policy or reinforcement learning stages are planned later.

Practical Recipe Pattern

A robust off-policy distillation recipe usually follows a sequence in which prompts are collected, teacher generations are produced, outputs are filtered or verified, duplicates and low-quality items are removed, the mixture is balanced across domains and trace lengths, the student is trained through SFT or KD, and evaluation checks both target gains and regressions.
For reasoning and coding, the strongest version often generates several candidates, verifies them, selects the best traces, and trains the student on those selected traces.
For agentic systems, the strongest version often starts with an agent rollout, checks task success, filters for tool-use hygiene, removes known anti-patterns, and only then adds the trajectory to the SFT corpus.
For multi-domain recipes, the most common role of off-policy distillation is to prepare the student for later online or on-policy stages:
\[\text{Off-policy SFT} \rightarrow \text{RLVR} \rightarrow \text{OPD / MOPD} \rightarrow \text{final alignment}\]
This progression reflects the modern view: off-policy distillation is not obsolete. It is the stable substrate on which more adaptive post-training methods are built.

On-Policy Distillation (OPD)

On-policy distillation addresses the central limitation of off-policy methods by training the student on trajectories it actually generates, rather than only on fixed datasets curated by humans or sampled from a teacher. By shifting supervision onto the student’s own state distribution, OPD substantially reduces exposure bias and compounding errors in autoregressive models.
The central idea to carry through this section is that OPD combines the trajectory relevance of reinforcement learning with the dense token-level supervision of distillation. RL trains on student-generated rollouts but often supplies only sparse outcome rewards. Classical KD supplies dense token-level supervision but usually trains on teacher or dataset trajectories. OPD combines the two: the student samples the trajectory, while the teacher scores the student’s actual prefixes.
OPD is best understood as a local correction mechanism. Instead of asking only what the teacher would have generated from the original prompt, OPD asks how the teacher evaluates the choices the student actually made. This distinction is crucial in long reasoning chains, coding tasks, and agentic tool-use workflows where one early deviation changes the entire future trajectory.
OPD is also a systems primitive, not just a loss. A practical OPD system must generate rollouts, preserve token-level student log-probabilities, route examples to teacher scorers, compute teacher log-probabilities under the exact student prefixes, mask invalid tokens, control rollout staleness, and update the student while keeping the teacher fixed or semi-fixed.
In most current LLM recipes, OPD is on-policy with respect to trajectory source but offline with respect to teacher updates: the student generates rollouts, while one or more frozen teachers score them. MOPD extends the same idea by routing student rollouts to domain-specialist teachers, often after SFT, RLVR, or Cascade RL has created a strong initial student.
The most important modern qualification is that on-policy supervision is only useful when the teacher is a good teacher on the student’s own rollout distribution. Dense feedback can be much more compute- and sample-efficient than sparse RL, but only when the teacher assigns higher probability to higher-reward trajectories while remaining close enough to the student for its local preferences to be learnable. If the dense signal comes from a privileged but misaligned teacher, OPD can teach the wrong response shape faster than RL would.

Core Idea and Formal Objective

In on-policy distillation, the student first generates a rollout:
\[\hat{y} \sim p_S^\theta(\cdot \mid x)\]
The teacher then evaluates the exact same trajectory by assigning next-token probabilities conditioned on the student’s prefixes. The student is updated to reduce the divergence between its own token distributions and the teacher’s token distributions along this rollout:
\[\mathcal{L}_{\text{on-policy}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D\left( p_T(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right] \right]\]
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) frames distillation as an imitation-learning problem in the style of A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning by Ross et al. (2011), which introduced DAgger as an iterative dataset-aggregation method for training on the learner’s induced state distribution, and demonstrates gains over conventional supervised KD and sequence-level KD across summarization, translation, and mathematical reasoning.
A more explicit state-based notation writes the student-visited prefix as:
\[s_t=(x,\hat{y}_{<t})\]
- and defines the token-level matching loss as:
\[\ell_t(\theta) =D\left( p_T(\cdot \mid s_t) \,\Vert\, p_S^\theta(\cdot \mid s_t) \right)\]
The key property is that \(s_t\) is sampled from the student’s own rollout distribution. If the student makes an early mistake, the teacher does not simply show an ideal path from the prompt; instead, the teacher evaluates the continuation under the mistaken prefix and provides local corrective information.

OPD as Supervised Learning over Student-Visited States

A subtle but important point is that practical OPD usually does not differentiate through the student’s sampling distribution. The rollout is sampled, treated as data, rescored by the teacher, and then used for a supervised token-level update. This is different from sequence-level reverse-KL methods that backpropagate through the sampling distribution and behave more like policy-gradient training.
A stop-gradient view of the common implementation is:
\[\nabla_\theta \mathcal{L}_{OPD} \approx \mathbb{E}_{\hat{y} \sim \operatorname{sg}(p_S^\theta)} \left[ \nabla_\theta \sum_t D\left( p_T(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
- where \(\operatorname{sg}\) denotes stop-gradient through the rollout sampling process.
MiniLLM: Knowledge Distillation of Large Language Models by Gu et al. (2023) studies a reverse-KL formulation closer to policy-gradient optimization through the student’s sequence distribution, while the GKD-style OPD implementations used in many codebases are closer to supervised learning on student-visited states.
This distinction matters because the policy-gradient version can be noisier and more complex to stabilize, while the GKD-style version is simpler and closer to A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning by Ross et al. (2011), which introduced DAgger. The practical interpretation is that OPD is often better described as repeatedly collecting student rollouts, labeling those rollouts with dense teacher feedback, and performing supervised distillation updates on the resulting student-state dataset.

Intuition: Learning from One’s Own Mistakes

The key advantage of on-policy distillation is that the student receives feedback precisely in the contexts it is most likely to encounter at inference time.
On-Policy Distillation presents an intuitive analogy to chess. Instead of only observing expert games, or receiving a single win-or-loss signal after playing a full game, the student receives move-by-move evaluations of its own choices. This makes it possible to identify and correct the specific decisions that caused the rollout to go off track.
A targeted correction interpretation is also useful for agentic systems. If the student produces a mostly plausible trajectory but makes one invalid tool call, OPD can penalize the local token choices that produced that tool call rather than spreading a final failure signal across the whole trajectory. This is why OPD often fits naturally into RL infrastructure: it can use the same rollout machinery, but the feedback is dense and token-level rather than sparse and trajectory-level.
The following figure (source) shows a chess.com-style visualization in which each move in the learner’s own game is graded from blunder to brilliant, illustrating how on-policy distillation provides dense, token-level feedback over self-generated trajectories.

Generalized Knowledge Distillation (GKD)

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) introduces Generalized Knowledge Distillation (GKD), which unifies several forms of distillation under a single framework. At each training step, the algorithm may sample a trajectory from the student and obtain teacher supervision along that rollout, or draw a trajectory from a fixed dataset and perform traditional off-policy distillation on that example.
This mixture is controlled by a parameter \(\lambda \in [0,1]\), which specifies the fraction of training examples that are student-generated.
When \(\lambda=0\), GKD reduces to standard supervised distillation. When \(\lambda=1\), all training occurs on student-generated trajectories. Intermediate values provide a practical curriculum that combines the stability of offline supervision with the robustness benefits of on-policy training.
A mixed GKD objective can be written as:
\[\mathcal{L}_{GKD}(\theta) =(1-\lambda) \mathbb{E}_{(x,y)\sim\mathcal{D}} \left[ D(p_T \,\Vert\, p_S^\theta)(x,y) \right] + \lambda \mathbb{E}_{x\sim\mathcal{D}} \mathbb{E}_{\hat{y}\sim p_S^\theta(\cdot \mid x)} \left[ D(p_T \,\Vert\, p_S^\theta)(x,\hat{y}) \right]\]
The following expansion makes the two mixture branches explicit: the first term is off-policy distillation on dataset trajectories \(y\), and the second term is on-policy distillation on student-generated trajectories \(\hat{y}\):
\[\mathcal{L}_{\mathrm{GKD}}(\theta) =(1-\lambda) \mathbb{E}_{(x,y)\sim\mathcal{D}} \left[ \sum_{t=1}^{|y|} D\left( p_T(\cdot \mid x,y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,y_{<t}) \right) \right] + \lambda \mathbb{E}_{x\sim\mathcal{D}} \mathbb{E}_{\hat{y}\sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D\left( p_T(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
The curriculum interpretation is important in practice. A weak student may initially generate low-quality or off-support trajectories, so a purely on-policy objective can waste teacher calls. A stronger student can generate informative failures, making on-policy teacher feedback more useful.
The following figure (source) shows that on-policy Generalized Knowledge Distillation significantly outperforms supervised fine-tuning, supervised KD, and sequence-level KD across summarization, translation, and mathematical reasoning tasks.

Teacher Quality and Reward-Tilted Policies

The key condition for OPD is that the teacher should be both better than the student and close enough to the student for its preferences to be learnable on student-visited prefixes. A much stronger but distributionally distant teacher can be unreliable on prefixes the student actually visits. A close but not reward-improved teacher may provide dense feedback that does not improve the task.
A strong same-family model often satisfies this condition because it shares tokenizer, pretraining distribution, reasoning style, and local token support with the student, while assigning more probability to higher-quality responses. An RL-trained expert can also satisfy this condition because RL with a penalty for moving away from the base model keeps the expert close to the student while tilting it toward high-reward responses.
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t frames this condition through a reward-tilted teacher. Let \(s\) denote a complete trajectory, \(R(s)\) the downstream reward, and \(\pi_k^S\) the current student policy held fixed. KL-regularized reward maximization has the closed-form optimum:
\[\pi_T^*(s) =\frac{1}{Z} \pi_k^S(s) \exp(\beta R(s)), \qquad Z = \mathbb{E}_{s \sim \pi_k^S} \left[ \exp(\beta R(s)) \right]\]
- where \(\beta\) controls the strength of reward tilting and \(Z\) normalizes the distribution.
If the teacher equals this reward-tilted policy and its gradient is stopped, reverse-KL distillation decomposes as:
\[D_{KL} \left( \pi^S \,\Vert\, \pi_T^* \right) = D_{KL} \left( \pi^S \,\Vert\, \pi_k^S \right) - \beta \mathbb{E}_{s \sim \pi^S} \left[ R(s) \right] + \log Z\]
Since \(\log Z\) does not depend on the optimized student, minimizing this reverse-KL objective is equivalent to increasing expected reward while staying close to the current policy. Distilling toward a reward-tilted teacher is therefore a dense form of KL-regularized RL.
The teacher-student log-ratio provides a practical diagnostic:
\[\Delta_T(s) = \log \pi_T(s) - \log \pi_S(s)\]
- A useful OPD teacher should assign larger \(\Delta_T(s)\) to higher-reward trajectories. If \(\Delta_T(s)\) tracks response style, prompt artifacts, or privileged-context markers instead of reward, the teacher is not acting like a reward-tilted student.

Choice of Divergence and Reward Interpretation

Although forward KL remains theoretically valid, reverse KL is particularly well suited to on-policy training because the rollout is sampled from the student distribution. Under reverse KL, the student is penalized for sampled tokens that the teacher considers unlikely, and the loss can be approximated from teacher log-probabilities on sampled tokens rather than full-vocabulary distributions.
\[D_{KL}(p_S \,\Vert\, p_T) = \mathbb{E}_{y \sim p_S} \left[ \log \frac{p_S(y)}{p_T(y)} \right]\]
The sampled-token reverse-KL signal is naturally interpreted as a dense per-token advantage:
\[A_t^{\mathrm{OPD}} = \log p_T(y_t \mid x,y_{<t}) - \log p_S(y_t \mid x,y_{<t})\]
Tokens that the teacher rates more highly than the student receive positive advantage, while tokens that the teacher considers worse than the student receive negative advantage. Multi-Teacher On-Policy Distillation: A New Post-Training Primitive emphasizes that this makes reverse KL a natural replacement or complement for advantage terms in GRPO-style reinforcement learning.
Distilling 100B+ Models 40x Faster with TRL notes that reverse KL aligns cleanly with student-generated trajectories and can require only the teacher’s log-probabilities on sampled tokens rather than full-vocabulary distributions.
The divergence choice is also a systems choice. Forward KL or distribution matching may require teacher top-\(k\) or full-vocabulary distributions. Reverse-KL-style sampled-token training can use a much smaller payload:
\[\left( y_t, \log p_T(y_t \mid s_t), \log p_S(y_t \mid s_t) \right)\]
- for each valid response token \(t\).
The key reward interpretation is that a sampled-token or reverse-KL OPD objective is useful when the teacher’s log-probability advantage is a proxy for reward advantage. If the teacher gives high probability to a token because it is reward-improving, the update is useful. If the teacher gives high probability to a token because it reflects a privileged prompt format, reference artifact, or shortcut, the same update becomes harmful.

On-Policy Distillation and Reinforcement Learning

One of the most important modern insights is that on-policy distillation can be understood as a dense, KL-constrained form of policy optimization.
The RLHF Book chapter Synthetic Data describes on-policy distillation as the natural progression after synthetic data generation and reinforcement learning. Reinforcement learning supplies on-policy trajectories but typically only sparse sequence-level rewards, whereas on-policy distillation provides token-level guidance from a stronger teacher over those same trajectories.
The following methods establish that on-policy distillation is not simply an alternative to reinforcement learning, but a general dense supervision framework that can replace, augment, or stabilize policy-gradient methods while preserving the on-policy nature of learning.

Reinforcement Learning via Self-Distillation

Reinforcement Learning via Self-Distillation by Hübotter et al. (2026) introduces Self-Distillation Policy Optimization (SDPO), which converts textual critiques, execution errors, runtime feedback, and other rich feedback into dense token-level updates without requiring an external teacher or explicit reward model.
The method constructs a teacher distribution conditioned on the original trajectory and a natural-language feedback string that explains what went wrong or how to improve. Student trajectories are replayed under this feedback-augmented teacher context, so each token can receive targeted corrective supervision. The same base model can instantiate both student and teacher views, reducing the need for a separate external model.
This approach is especially useful for coding and reasoning tasks where runtime errors, verifier messages, or judge comments provide informative textual signals. It is also relevant when standard RLVR only returns scalar feedback, because successful rollouts can serve as implicit feedback for failed attempts.

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation by Yang et al. (2026) introduces ExOPD, or extrapolation OPD, which generalizes OPD by combining teacher imitation with explicit reward extrapolation, allowing the student to exceed the teacher rather than merely match it.
ExOPD uses a reference teacher policy to provide dense token-level supervision while an external reward signal estimates how much better or worse the current trajectory is than the teacher baseline. The distillation loss is reweighted by reward-derived scaling factors so that trajectories outperforming the teacher receive amplified updates.
The framework supports interchangeable reference models, including external teachers, self-teachers, and moving-average checkpoints, which decouples the source of dense supervision from the source of the ultimate objective. This is especially useful when the goal is to merge domain-specialist teachers back into one student while allowing the student to move beyond the teachers’ individual performance boundaries.

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation by Ko et al. (2026) introduces REOPOLD, a relaxed on-policy distillation method designed to reduce over-imitation, improve stability, and scale reasoning training more efficiently.
REOPOLD treats token-level teacher-student log-likelihood ratios as dense rewards, similar to reverse-KL advantages, but relaxes strict imitation by clipping or tempering overly strong penalties on low-value tokens. It also uses partial rollouts and truncated reasoning traces to reduce compute while preserving informative supervision.
This is especially relevant for reasoning tasks where exact teacher imitation may be unnecessarily restrictive or may suppress productive exploratory reasoning. The practical lesson is that OPD can be reward-like without requiring strict copying of every teacher preference.

Self-Distilled RLVR

Self-Distilled RLVR by Yang et al. (2026) introduces RLSD, which combines reinforcement learning with verifiable rewards and privileged self-distillation, using self-distillation to modulate update magnitudes while preserving RL-derived update directions.
The method gives a self-teacher privileged information such as the correct answer or a verified reasoning trace. Reinforcement learning determines the update direction based on correctness signals, while self-distillation scales the magnitude of token-level updates according to how strongly the privileged teacher prefers the sampled continuation.
This separation reduces information leakage while preserving the objective grounding of RLVR. It also reflects a broader design rule: privileged self-distillation is safest when it does not decide the sign of the update on its own.

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing by Li et al. (2026) introduces SRPO, a sample-routing framework that combines GRPO-style reinforcement learning on successful rollouts with self-distillation-based correction on failed rollouts.
SRPO sends correct samples to a GRPO branch, where group-relative advantages provide a reward-aligned policy update. Incorrect samples with available teacher information are replayed under a privileged teacher context and corrected using dense token-level self-distillation.
Routing decisions are based on verifier outcomes or reward thresholds, allowing the system to preserve efficient RL updates on successful trajectories while extracting richer supervision from failures. This addresses the instability of self-distillation on already-correct samples and the coarse credit assignment of RL on failed samples.

OpenClaw-RL: Train Any Agent Simply by Talking

OpenClaw-RL: Train Any Agent Simply by Talking by Wang et al. (2026) extends self-distillation to interactive environments with dense feedback sources such as tool outputs, GUI transitions, user replies, terminal states, and environment state changes.
The agent’s original trajectory is replayed together with subsequent user or environment feedback. A hindsight-conditioned teacher evaluates how the model would have acted if it had observed the later information earlier. Tool outputs, GUI changes, and terminal states are converted into dense correction signals, and the same framework can support conversational agents, coding agents, terminal agents, GUI agents, software-engineering agents, and embodied control systems.
In this setting, an environment transition can provide both evaluative and directive information:
\[(s_t,a_t,s_{t+1}) \rightarrow \left( r_t, h_t \right)\]
- where \(r_t\) is a scalar reward or process reward, and \(h_t\) is a hindsight hint or textual correction used to form a teacher context.
A hindsight-guided OPD term can be written as:
\[\mathcal{L}_{HG\text{-}OPD} = \mathbb{E} \left[ \sum_t D \left( \pi_{\theta^-}(\cdot \mid s_t,h_t) \,\Vert\, \pi_\theta(\cdot \mid s_t) \right) \right]\]
The key requirement is that the next-state signal must be grounded in the environment. A real tool error, test failure, user correction, or GUI state change can identify what went wrong. A fabricated or ungrounded feedback signal can reproduce the same naive self-distillation failure mode: the model learns to imitate feedback-aware behavior without genuine feedback.
The following figure (source) shows an overview of the OpenClaw-RL infrastructure. Interaction streams come from Personal Agents, which are conversational and single-user agents hosted on personal devices, and General Agents, which include terminal, GUI, SWE, and tool-call agents hosted on cloud services. Samples flow into an asynchronous RL server with environment serving, PRM or judge reward computation, Megatron policy training, and SGLang policy serving.

The following figure (source) shows how OpenClaw can be optimized simply by using it, with the simulation result illustrating how interaction traces can become training signal.

The following figure (source) shows an overview of OpenClaw-RL. For personal agents, OpenClaw-RL supports both binary-reward optimization and on-policy distillation training, and their combination yields substantial performance gains. For general agentic RL, the framework integrates standard RLVR, step-wise rewards, and a simple standardization approach.

Self-Distilled Agentic Reinforcement Learning

In multi-turn agentic training, the RL-distillation relationship becomes asymmetric: RL is often best treated as the task-grounded primary objective, while self-distillation acts as a carefully controlled auxiliary signal.
Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) extends RL-distillation hybrids to multi-turn agents by treating OPSD as a gated auxiliary objective and keeping GRPO as the primary RL backbone.
The method flattens valid response tokens across a multi-turn trajectory and applies token-level self-distillation over the agent’s own rollout. The student context contains the task and previous generated tokens, while the self-teacher context additionally includes privileged training-only information such as retrieved skills.
The detached teacher-student gap is defined as:
\[\Delta_t = \operatorname{sg} \left( \log \pi_\theta^+(y_t \mid s_t^+) - \log \pi_\theta(y_t \mid s_t) \right)\]
The token-level gate converts this signal into a bounded trust weight:
\[g_t = \sigma(\beta \Delta_t)\]
The gated auxiliary loss applies self-distillation only according to token-level trust:
\[\ell_t^{\mathrm{SDAR}} = g_t \left( \log \pi_\theta^+(y_t \mid s_t^+) - \log \pi_\theta(y_t \mid s_t) \right)\]
The full objective can be summarized as an RL backbone plus a gated self-distillation auxiliary term:
\[\mathcal{L}(\theta) = \mathcal{L}_{GRPO}(\theta) + \lambda_{\mathrm{SDAR}} \mathcal{L}_{SDAR}(\theta)\]
- where \(\mathcal{L}_{GRPO}\) preserves verifier-driven policy optimization, while \(\mathcal{L}_{SDAR}\) injects dense token-level guidance only where the gated privileged teacher signal is trusted.

Multi-Domain and Multi-Teacher OPD

Multi-domain OPD extends the single-teacher setup by choosing a domain teacher for each prompt or trajectory. This is especially useful after Cascade RL or specialist-teacher training, where different checkpoints are strongest on different categories such as math, code, tool use, instruction following, long context, safety, or software engineering.
In Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026), MOPD is inserted after earlier RL stages because different benchmark categories fluctuate during training. Certain RLVR stages can reduce entropy and shorten reasoning traces, which may negatively affect mathematical reasoning, while RLHF-oriented optimization can trade off against instruction following. Multi-domain OPD is used to rebalance these capabilities by selecting strong intermediate domain teachers from the Cascade RL process.
The Nemotron-Cascade 2 MOPD advantage is defined on the student-sampled token rather than over the full vocabulary. If \(\pi_{\text{inf}}\) is the rollout policy used for generation, \(\pi_{\text{train}}\) is the policy being optimized, and \(\pi_{\text{domain}_i}\) is the selected domain teacher, then for state \(s_t=(x,y_{<t})\):
\[a_t^{MOPD} = \log \pi_{\text{domain}_i}(y_t \mid s_t) - \log \pi_{\text{train}}(y_t \mid s_t)\]
This advantage is positive when the domain teacher assigns higher probability to the sampled token than the current training policy. It therefore acts as a dense token-level distillation advantage that should converge toward zero as the student absorbs the teacher’s local preferences.
Because the rollout policy and training policy may differ in asynchronous systems, Nemotron-Cascade 2 applies truncated importance weighting:
\[r_t = \frac{ \pi_{\text{train}}(y_t \mid s_t) }{ \pi_{\text{inf}}(y_t \mid s_t) }\] \[w_t = \operatorname{sg}[r_t] \mathbf{1} \left[ \epsilon_{\text{low}} \leq r_t \leq \epsilon_{\text{high}} \right]\]
The resulting surrogate objective is:
\[\mathcal{L}_{MOPD} = -\mathbb{E}_{x\sim\mathcal{D},y\sim\pi_{\text{inf}}(\cdot \mid x)} \left[ \frac{1}{|\mathcal{V}(y)|} \sum_{t\in\mathcal{V}(y)} w_t \operatorname{sg}[a_t^{MOPD}] \log \pi_{\text{train}}(y_t \mid s_t) \right]\]
- where \(\mathcal{V}(y)\) is the set of valid response tokens retained by the token mask.
This objective is important because it shows how OPD can be implemented inside an RL-style training engine without requiring full-vocabulary teacher distributions. It also makes systems-level staleness explicit: the rollout-generating policy and the learner policy may not be exactly identical, so the training loss must account for that mismatch.
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) extends this pattern to an agent-focused pipeline that includes SFT, unified RLVR, MOPD warmup, asynchronous MOPD, and MTP boosting. The report emphasizes that MOPD warmup aligns student rollouts with teacher-supported distributions before distillation, which reflects a practical support-overlap requirement for successful multi-teacher OPD.

Practical Failure Modes and Stabilization Recipes

OPD should be viewed less as full-vocabulary distribution matching and more as a fragile communication protocol between teacher and student through a small set of locally plausible next-token choices. The teacher’s feedback is most useful when the student’s rollouts fall within states where the teacher can assign meaningful token preferences.
Qwen3, GLM-5, MiMo, Nemotron-Cascade 2, and Nemotron 3 Ultra use OPD or OPD-adjacent methods in post-training, while also highlighting that practical OPD can be more brittle than SFT or RL when teacher-student local token preferences stop overlapping.
Stabilization therefore requires attention to rollout quality, support overlap, token masks, truncation, teacher routing, rollout length, repetition, tokenizer alignment, and the distinction between sampled-token feedback and full-distribution matching.

Thinking-Pattern Compatibility and Token Overlap

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe by Li et al. (2026) finds that OPD success depends on compatible teacher-student thinking patterns and on progressive alignment over a small shared set of high-probability tokens, which can carry most of the probability mass.
A useful diagnostic is local support overlap:
\[\operatorname{Overlap}_k(s_t) = \frac{ \left| \operatorname{Top}_k p_T(\cdot \mid s_t) \cap \operatorname{Top}_k p_S(\cdot \mid s_t) \right| }{k}\]
This overlap should be monitored on student-visited prefixes, not only on clean teacher or dataset prefixes. Benchmark accuracy alone does not indicate whether the teacher’s token-level supervision will be useful in the states the student actually visits.
The practical recipe is to use an off-policy cold start before OPD, select teachers whose reasoning style is compatible with the student, monitor overlap among high-probability tokens, avoid teachers that pull an RL-improved student backward toward older reasoning patterns, and track whether teacher continuation advantage decays as rollout prefixes lengthen.
The following figure (source) shows an overview of the method. JustRL-1.5B is obtained by applying RL to DeepSeek-Distill-1.5B, and Skywork-OR1-Math-7B is obtained by applying RL to DeepSeek-Distill-7B.

The following figure (source) presents a systematic study of OPD training dynamics, progressing from empirical conditions through token-level mechanism to practical recipe.

Length Inflation and Repetition Collapse

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models by Luo et al. (2026) identifies abrupt length inflation, repetition saturation, and truncation collapse as a major OPD failure mode. It proposes StableOPD, combining a reference-based divergence constraint with rollout mixture distillation.
Length inflation can occur because once the student enters repetitive or overlong prefixes, the teacher may still assign locally plausible probability to continuation tokens under those prefixes. This creates a self-reinforcing loop where degenerate prefixes produce locally acceptable token-level feedback even though the global trajectory is poor.
Practical stabilization requires tracking average rollout length, truncation rate, and repetition rate during training rather than relying only on validation accuracy. It also requires adding reference-based divergence constraints, mixing on-policy rollouts with cleaner reference trajectories, treating repeated tokens as high-risk examples, and stopping or downweighting truncation-dominated batches.

Sampled-Token OPD and Local Support Matching

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes by Fu et al. (2026) argues that sampled-token OPD is attractive because token-level feedback has better worst-case variance scaling than sequence-level reverse KL, but it is biased and fragile because it observes only one sampled token rather than the teacher’s local support.
A practical fix is to replace single sampled-token supervision with teacher top-\(K\) local support matching, where both teacher and student probabilities are renormalized over the teacher’s plausible next-token set. Another fix is to use top-\(p\) rollout sampling so the student is less likely to drift into extremely low-probability prefixes where teacher guidance is unreliable.
Define the teacher’s local support set:
\[\mathcal{K}_T(s_t) =\operatorname{TopK} \left( p_T(\cdot \mid s_t) \right)\]
Renormalize teacher and student probabilities over this support:
\[\tilde{p}_T(a \mid s_t) =\frac{ p_T(a \mid s_t) }{ \sum_{a' \in \mathcal{K}_T(s_t)} p_T(a' \mid s_t) }, \qquad a \in \mathcal{K}_T(s_t)\] \[\tilde{p}_S(a \mid s_t) =\frac{ p_S(a \mid s_t) }{ \sum_{a' \in \mathcal{K}_T(s_t)} p_S(a' \mid s_t) }, \qquad a \in \mathcal{K}_T(s_t)\]
The local-support loss is then:
\[\mathcal{L}_{K} =\sum_t D_{KL} \left( \tilde{p}_T(\cdot \mid s_t) \,\Vert\, \tilde{p}_S(\cdot \mid s_t) \right)\]
Additional safeguards include masking special tokens and tokenizer artifacts, preferring truncated reverse KL over one-token log-ratio updates when teacher top-\(K\) logits are affordable, and evaluating whether per-token advantages combine into coherent gradient directions rather than canceling across positions.

Privileged On-Policy Self-Distillation Caveats

On-policy self-distillation (OPSD) uses the same model as student and teacher, but conditions the teacher on privileged information such as a gold solution, final answer, runtime error, environmental feedback, prior-attempt critique, or verified reasoning trace. This can make the model act as its own teacher without requiring a stronger external model.
A context-explicit OPSD objective is:
\[\mathcal{L}_{\mathrm{OPSD}} =\sum_t D\left( \pi_T(\cdot \mid x,c,y_{<t}) \,\Vert\, \pi_S(\cdot \mid x,y_{<t}) \right)\]
- where \(c\) is teacher-only privileged context. Vanilla OPD corresponds to \(c=\varnothing\), while privileged-context OPD sets \(c\) to a final answer, a gold solution, or another piece of training-only information.
Recent evidence shows that privileged OPSD is not uniformly beneficial for thinking models. Rethinking On-Policy Self-Distillation for Thinking Models by Kaur et al. (2026) reports that privileged-context OPSD can degrade long-budget thinking models by suppressing forking, verification, backtracking, and hedging behavior. The degradation is specific to privileged context rather than on-policy training itself: unprivileged OPD can improve the same student, while privileged OPD reverses the gain.
The failure appears at high-entropy forking positions where multiple reasoning paths remain plausible. Under vanilla OPD, tokens such as reconsideration, uncertainty, and branching markers can carry positive advantage. Once privileged context is added, the teacher may already know the answer and assign negative advantage to exploratory moves, causing the student to produce fewer of the deliberative behaviors that support long-budget reasoning.
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? by Kim et al. (2026) similarly traces degradation to suppression of epistemic verbalization, where models lose explicit uncertainty markers such as checking, reconsideration, and uncertainty expression. The paper finds that richer teacher conditioning can produce shorter and more confident traces, which may help narrow in-domain tasks but hurt out-of-domain math reasoning when uncertainty expression supports exploration and error correction.
Fixed-teacher self-distillation can also be more stable than moving-target self-distillation. In a naive moving-teacher setup, the model becomes more confident, then uses this increasingly confident policy as its next teacher, amplifying response-length shrinkage and epistemic suppression over time. This is one reason many practical OPSD or OPD systems prefer frozen teacher checkpoints, EMA teachers with caution, or gated auxiliary objectives.
RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation by Pan et al. (2026) addresses privilege-induced style drift by contrasting the teacher-student gap under a correct hint with the gap under an incorrect hint. This subtracts the generic style shift induced by having a hint at all and leaves a signal more concentrated on task-bearing tokens.
A contrastive signal can be written as:
\[e_t^{ctr} =\left[ \log p_\theta(y_t \mid x,c^+,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right] -\left[ \log p_\theta(y_t \mid x,c^-,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right]\]
- where \(c^+\) is a correct hint and \(c^-\) is a wrong or contrastive hint.
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t gives a reward-log-ratio diagnosis of the same failure. The self-distillation teacher is not near the reward-tilted optimum when it assigns high probability to responses that look as if the model had a reference solution, whether or not the response is correct.
In naive self-distillation, the student samples a response, and that same response is scored under two prompts: the student prompt is the plain task, while the teacher prompt includes privileged information. The teacher and student are the same network, differing only in the injected context. The objective is a per-token full-distribution KL between teacher and student logits over the student’s own rollouts, with no reward term and the teacher’s gradient stopped:
\[\mathcal{L}_{naive}(\theta) =\mathbb{E}_{x \sim \mathcal{D},y \sim \pi_\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|y|} D_{KL} \left( \pi_{\theta^-}(\cdot \mid x,c,y_{<t}) \,\Vert\, \pi_\theta(\cdot \mid x,y_{<t}) \right) \right]\]
- where \(c\) can be a gold answer, a reference solution, or feedback from a previous attempt. The student never sees \(c\) at inference time, yet it is trained to reproduce the \(c\)-conditioned teacher token by token.
The fundamental assumption is that, for some useful privileged context \(c\), the privileged self-teacher distribution approximates the reward-tilted teacher:
\[\pi_\theta(s \mid c) \approx \pi_T^*(s)\]
- The success and stability of the method depend strongly on whether this approximation holds. If the privileged context makes the teacher up-weight reward-bearing reasoning, the signal can help. If it makes the teacher up-weight a feedback-aware response shape, the signal can hurt.
A useful diagnostic is again the sequence-level teacher-student log-ratio:
\[\Delta_T(s) = \log \pi_T(s) - \log \pi_S(s)\]
- For the reward-tilted optimum, this log-ratio should be proportional to reward plus a constant. A good teacher gives higher log-ratio to higher-reward responses.
In the Polaris math analysis described in Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t, an RL expert gives correct responses higher probability than incorrect ones, while a gold-answer-conditioned self-distillation teacher barely separates correct from incorrect responses. On trained self-distillation outputs that cite fabricated references, the self-distillation teacher strongly up-weights the fabricated-reference response shape whether or not the answer is correct, while the RL expert down-weights it.
The following figure (source) shows per-sequence log-ratio to the student, \(\log \pi^{\wedge} t-\log \pi^{\wedge} s\), on Polaris math. The left panel shows an RL-trained teacher increases the probability of correct student responses over incorrect ones while a self-distillation teacher does not. The right panel shows the self-distillation teacher increases the probability of responses it would have given, irrespective of correctness. Together, these findings suggest guidance from self-distillation teachers poorly tracks the downstream reward while expert teachers provide better signal.

This mismatch produces three measurable issues:
- Hallucination at inference, where the trained model invents and cites feedback, guidance, a reference solution, or a previous attempt that was never present in the prompt.
- Degraded reasoning, where epistemic verbalization, hedging, backtracking, and self-checking tokens collapse, leaving the model more overconfident and less likely to re-examine alternatives.
- Poor out-of-distribution generalization, where in-distribution accuracy can remain intact or even improve, while OOD accuracy falls because the model learned a narrow privileged-context response template.
The following figure (source) shows in-distribution and out-of-distribution accuracy over training for naive self-distillation vs. RL across chemistry, tool-use, and Polaris math. The figure illustrates that in-distribution behavior can vary by dataset, while out-of-distribution performance consistently trails RL under naive self-distillation.

The broader lesson is that dense token-level supervision is not automatically good. It must be aligned with the behavior the model should preserve at inference time. In thinking models, the “correct” teacher may be locally too confident because it has access to information unavailable to the deployed student.

Fixes and Partial Mitigations

Prompt optimization can reduce privileged-context artifacts, but it does not remove the core distribution mismatch if the teacher still receives information the student will not have at inference. A better wrapper around feedback can reduce hallucination, but the student may still learn a narrow feedback-conditioned response pattern.
Loss modifications can help when they reduce the influence of the pure distillation term. For example, masking tokens where the RL advantage disagrees with the teacher-student log-probability gap can prevent the distillation loss from directly opposing reward.
A simple agreement mask is:
\[m_t =\mathbf{1} \left[ \operatorname{sign} \left( A_t^{\mathrm{RL}} \right) =\operatorname{sign} \left( \log \pi_T(a_t \mid s_t) -\log \pi_\theta(a_t \mid s_t) \right) \right]\]
The masked distillation loss is:
\[\mathcal{L}_{\mathrm{masked}} =\sum_t m_t D\left( \pi_T(\cdot \mid s_t) \,\Vert\, \pi_\theta(\cdot \mid s_t) \right)\]
Another mitigation is to keep RL as the dominant term and use distillation only as a bounded auxiliary:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) + \lambda_D \mathcal{L}_{D}(\theta), \qquad \lambda_D \ll 1\]
These mitigations share the same principle: the sparse reward or verifier should anchor update direction, while distillation should refine token-level credit only where it agrees with reward or is otherwise trusted.

Practical Training Loop

A typical on-policy distillation training loop begins by sampling prompts from a task dataset, synthetic prompt pool, or curriculum. The student then generates one or more rollouts while recording token identities, attention masks, stop positions, and per-token student log-probabilities. Each rollout is sent to a teacher or teacher router, which computes token-level log-probabilities conditioned on the exact student prefixes. A divergence, log-ratio, or advantage-like loss is computed at valid response tokens, optional clipping and masking suppress unstable updates, and gradients are propagated only through the student while the teacher remains fixed.
A compact loop is:
\[x \sim \mathcal{D}\] \[\hat{y} \sim p_S^\theta(\cdot \mid x)\] \[\left\{ \log p_T(\hat{y}_t \mid x,\hat{y}_{<t}) \right\}_{t=1}^{|\hat{y}|} =\operatorname{TeacherScore}(x,\hat{y})\] \[\theta \leftarrow \theta -\eta \nabla_\theta \mathcal{L}_{OPD}(\theta)\]
Because the teacher does not need to generate its own trajectory, but only evaluate the student’s rollout, teacher inference can be cheaper than full teacher rollout generation. However, the teacher must still process the full student sequence and return accurate log-probabilities under the correct chat template, tokenizer, and prefix structure.

Systems and Infrastructure Considerations

OPD systems usually require a rollout engine, a teacher scoring service, a log-probability transport format, a masking and loss-computation module, and a learner that can consume scored rollouts. Asynchronous execution is common because rollout generation, teacher scoring, and learner updates operate at different speeds.
The main systems concerns are rollout staleness, teacher throughput, teacher-student tokenizer compatibility, batching efficiency, log-probability compression, invalid-token masking, prompt routing, and reproducibility of which teacher scored which rollout.
For MOPD, teacher routing adds an additional layer. A prompt or rollout may be assigned to a math teacher, coding teacher, long-context teacher, agentic teacher, safety teacher, or general teacher. The router may use prompt metadata, domain classifiers, benchmark categories, validation performance, entropy, or heuristic task labels.
In asynchronous OPD, the rollout policy used by the inference engine may not exactly match the policy being updated by the training engine. This motivates importance weighting, freshness windows, or replay-buffer constraints, as in the Nemotron-Cascade 2 objective.
Token masking is crucial. Systems should mask padding, prompt tokens, stop tokens after termination, invalid tool-output regions, hidden system metadata, formatting artifacts, and tokens where teacher and student tokenization cannot be aligned reliably.
Teacher payload design should match the loss. Full forward KL needs teacher distributions, reverse-KL sampled-token OPD needs only sampled-token teacher log-probabilities, and JSD or local-support matching may require top-\(K\) distributions from both teacher and student.
For production-scale OPD, implementation details can dominate the algorithmic presentation. Distilling 100B+ Models 40x Faster with TRL highlights generation buffers, batched teacher scoring, and compressed log-probability transfer, while Efficient Memory Management for Large Language Model Serving with PagedAttention by Kwon et al. (2023) introduces the PagedAttention memory-management design that underlies vLLM-style high-throughput teacher inference.

Evaluation and Monitoring

OPD should not be evaluated only by in-distribution accuracy. Since the training distribution is shaped by the student’s own rollouts, failure often first appears in behavioral or distributional metrics.
Core task metrics include accuracy, pass@\(k\), verifier pass rate, judge score, tool-use success, unit-test success, and agent completion rate.
Core rollout metrics include average length, truncation rate, repetition rate, stop-token rate, malformed-output rate, queue age, rollout-policy version, and teacher-score coverage.
Core teacher-student metrics include KL, entropy, support overlap, teacher-student log-ratio, teacher disagreement, sampled-token advantage distribution, and top-\(k\) overlap.
For privileged OPSD, evaluation must also track hallucinated-reference rate, feedback-mention rate, epistemic-verbalization rate, branching markers, backtracking markers, and out-of-distribution accuracy.
A compact monitoring dashboard for OPD should include:
\[\left\{ \operatorname{Acc}_{IID}, \operatorname{Acc}_{OOD}, \operatorname{Overlap}_k, \mathbb{E}[|\hat{y}|], \operatorname{TruncRate}, \operatorname{RepeatRate}, H_{\mathrm{priv}}, \operatorname{EV} \right\}\]
- where \(H_{\mathrm{priv}}\) measures hallucinated privileged context and \(\operatorname{EV}\) measures epistemic verbalization.

Relationship to Recent LLM Post-Training Recipes

OPD has become a core consolidation primitive in recent LLM recipes. It appears most prominently when several specialist teachers must be merged into one deployable model without rerunning all RL stages jointly, and this pattern is increasingly visible across recent post-training reports and recipe analyses such as Frontier post-training recipe review with Finbarr Timbers.
MiMo-V2-Flash Technical Report by Xiao et al. (2026) uses MOPD as a final consolidation stage after training several domain-specialist teachers. The student samples its own trajectories, the relevant teacher scores those trajectories, and token-level feedback transfers specialist capabilities into one general student.
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) uses multi-domain OPD after Cascade RL to recover regressions and rebalance capabilities. The teachers are selected from strong intermediate checkpoints, so MOPD acts as a way to preserve the best stage-specific capabilities before later training continues.
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) uses SFT, unified RLVR, MOPD warmup, multi-teacher OPD, and MTP boosting. The warmup stage is important because teacher models trained with different data or recipes may have distributions that do not overlap well with the student; aligning the student toward teacher-supported distributions makes subsequent MOPD more reliable.
GLM-5: from Vibe Coding to Agentic Engineering and Kimi K2.5: Visual Agentic Intelligence show that not every recent frontier recipe foregrounds MOPD. GLM-5 emphasizes staged reinforcement learning across reasoning, agentic, and general capabilities with cross-stage distillation, while Kimi K2.5 emphasizes joint text-vision reinforcement learning across coding, vision, reasoning, and agentic tasks. These recipes suggest that OPD is an increasingly important primitive, but not the only path to frontier post-training.
Aligning Language Models from User Interactions by Kleine Buening et al. (2026) uses user follow-up messages as hindsight context for self-distillation, updating the model toward the behavior it would have produced after seeing the user’s correction or clarification. This extends the OPD-style idea beyond benchmark rollouts into deployment interactions where user follow-ups provide implicit correction signals.

When On-Policy Distillation is Preferred

On-policy distillation is preferred when the student must handle long-horizon reasoning, coding, tool use, or agentic workflows where early mistakes compound into unfamiliar states.
OPD is preferred when dense token-level feedback is more useful than sparse scalar rewards, and when a strong frozen teacher can evaluate the student’s actual prefixes.
OPD is preferred when off-policy SFT has already produced a capable enough student to generate informative rollouts, but fixed teacher traces no longer cover the mistakes the student makes at inference time.
OPD is preferred when the goal is to consolidate specialist teachers without training a single monolithic RL run.
OPD is preferred when the team can support the systems complexity of rollout generation, teacher scoring, token masking, log-probability transport, and freshness control.
OPD is less attractive when the student is too weak to generate meaningful trajectories, when teacher-student support overlap is poor, when teacher scoring is too expensive, when tokenization or chat-template mismatch makes log-probabilities unreliable, or when privileged teacher context suppresses behaviors the student needs at inference time.
Privileged OPSD should be used cautiously. It is most appropriate when the privileged context changes task-relevant knowledge rather than merely changing style, confidence, or trace shape. If the privileged teacher becomes shorter, more certain, or more reference-like because it already knows the answer, direct distillation can damage reasoning robustness.
The practical default is therefore staged: begin with off-policy SFT or sequence-level distillation, use RL or RLVR to improve task behavior, introduce OPD once the student can generate useful rollouts, and use MOPD when multiple specialist teachers must be consolidated into one model.

Self-Distillation (SD)

Self-distillation is a family of methods in which the teacher is not a separate larger model. The teacher can be the same model under a different context, an earlier checkpoint, an exponential-moving-average copy, a higher-confidence branch, an ensemble of model views, a future-hindsight branch, or a gated auxiliary policy. This makes self-distillation attractive when an external teacher is unavailable, expensive, operationally inconvenient, insufficiently specialized for the task, or unable to represent enterprise-specific behavior.

Self-distillation extends the distillation paradigm by removing the strict requirement for a separate teacher model. Instead, the student learns from a teacher signal derived from itself across time, contexts, checkpoints, roles, or different conditioning views of the same underlying model. The central idea is that self-distillation can convert latent capability, hindsight information, privileged context, retrieved skills, runtime feedback, tool feedback, user interaction feedback, or verifier evidence into dense supervision.

Self-distillation is not automatically self-improvement. It can unlock capabilities already present in the model, stabilize post-training, reduce reliance on frontier teachers, support continual adaptation, and convert interaction traces into trainable supervision. It can also amplify the model’s own biases, suppress useful uncertainty, overfit to privileged-context style, leak training-only context into inference-time behavior, or collapse reasoning behavior when the teacher view is too different from the deployed student view.

In modern LLM training, self-distillation has evolved beyond compression into a broader framework for iterative self-improvement, reasoning refinement, continual learning, enterprise-specific behavior transfer, and reinforcement-learning-style policy optimization. Modern variants often combine self-distillation with on-policy rollouts, allowing models to learn from their own outputs while still benefiting from teacher-style dense supervision.

The most important practical distinction is how the self-teacher is constructed. A self-teacher may be an earlier checkpoint, an EMA checkpoint, an ensemble of model views, the same model under privileged context, the same model conditioned on runtime feedback, the same model conditioned on retrieved skills, or the same model conditioned on future user interaction. The loss may look like ordinary distillation, but teacher construction determines whether the signal is useful, noisy, or harmful.

Common self-distillation forms include checkpoint-based self-distillation, where a later student learns from outputs produced by earlier or averaged checkpoints; view- or context-based self-distillation, where the same model produces different supervisory signals under different prompts, augmentations, hints, retrieved skills, or conditioning views; and on-policy self-distillation, where the student samples its own rollout and the self-teacher evaluates that rollout under a richer or privileged context.

Core Formulation

In self-distillation, both the student and teacher are derived from the same base model. Let \(p_S^\theta\) denote the student policy, and let \(p_T^\phi\) denote the teacher policy, which may correspond to an earlier checkpoint, an ensemble, an EMA model, a stopped-gradient copy, or the same model under privileged conditioning.
The general training objective remains:
\[\mathcal{L}_{SD}(\theta) =\mathbb{E} \left[ \sum_{t=1}^{|y|} D \left( p_T^\phi(\cdot \mid x,y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,y_{<t}) \right) \right]\]
The key distinction is not the surface form of the loss, but the construction of the teacher signal. In standard teacher-student distillation, the teacher is usually a different model. In self-distillation, the teacher signal is produced from the same model family, the same base weights, a related checkpoint, or the same model under a different informational view.
A broad self-distillation template can be written as:
\[p_T^\phi(\cdot \mid \tau_T(x,y_{<t})) \qquad \text{and} \qquad p_S^\theta(\cdot \mid \tau_S(x,y_{<t}))\]
- where \(\tau_T\) and \(\tau_S\) are teacher and student context transformations. The teacher transformation may add a verified answer, hint, retrieved skill, critique, tool result, runtime error, future user correction, environment observation, or different prompt view, while the student transformation preserves the information available at inference time.
The practical question is therefore:
\[p_T^\phi(\cdot \mid \tau_T(x,y_{<t})) \approx p_{\mathrm{desired}}(\cdot \mid \tau_S(x,y_{<t}))\]
If this approximation holds, the self-teacher provides useful dense supervision. If it fails, the student may learn a context artifact rather than the intended capability.

Temporal Self-Distillation

One of the earliest forms of self-distillation uses an earlier checkpoint of the same model as the teacher:
\[p_T =p_S^{\theta_{\text{old}}}\]
The student is trained to remain close to a historical version of itself while continuing to improve on new data. This is useful because earlier checkpoints may preserve capabilities that later SFT, preference tuning, RL, or domain adaptation could degrade.
Temporal self-distillation is especially relevant in long post-training recipes where a model moves through SFT, preference tuning, RLVR, OPD, and domain-specific RL. A later checkpoint may improve on a target benchmark while regressing on instruction following, writing quality, safety calibration, long-context behavior, or formatting reliability. A historical self-teacher can serve as a capability-preserving anchor.
A temporal regularization objective can be written as:
\[\mathcal{L}(\theta) =\mathcal{L}_{\text{task}}(\theta) + \lambda \mathbb{E}_{x} \left[ D_{KL} \left( p_{\theta_{\text{old}}}(\cdot \mid x) \,\Vert\, p_{\theta}(\cdot \mid x) \right) \right]\]
- where \(\lambda\) controls how strongly the current model is constrained to preserve the older checkpoint’s behavior.
Temporal self-distillation is not the same as online distillation unless the teacher changes during the student’s training. If the old checkpoint is frozen, the method is offline in teacher-update pattern even though the teacher comes from the same model lineage.
Self-Distillation Enables Continual Learning by Shenfeld et al. (2026) uses self-distillation fine-tuning to support continual learning from demonstrations, showing how a demonstration-conditioned model can provide on-policy training signals that reduce catastrophic forgetting while acquiring new skills.

Ensemble and Multi-View Self-Distillation

Another common variant constructs the teacher from multiple views or predictions of the same model. The teacher distribution may be an average over checkpoints, prompt templates, sampled completions, adapters, retrieved contexts, decoding temperatures, or other conditioning views:
\[p_T(\cdot \mid x) =\frac{1}{K} \sum_{k=1}^{K} p_{\theta_k}(\cdot \mid \tau_k(x))\]
Multi-view self-distillation can smooth noisy predictions and transfer behavior that is robust across views. For example, a model may answer the same problem under several prompt templates, with different retrieved context snippets, or under different decoding conditions, and the aggregate teacher signal may be more reliable than any single view.
This approach is useful when the model contains the right capability but expresses it inconsistently. By aggregating several internal views, the teacher signal can emphasize stable patterns and reduce sensitivity to a single prompt, retrieval path, or decoding sample.
Multi-view self-distillation is closely related to self-consistency and reranking. The difference is that self-consistency chooses or aggregates answers at inference time, while self-distillation converts the aggregated behavior into trainable supervision.

Contextual Self-Distillation

Modern reasoning-oriented self-distillation usually relies on contextual asymmetry rather than architectural asymmetry. This creates a stronger teacher signal without introducing a separate external model.
The student sees only the original task:
\[p_S(\cdot \mid x)\]
- while the teacher receives privileged information:
\[p_T(\cdot \mid x,c)\]
- where \(c\) may include a verified answer, ground-truth reasoning trace, runtime error, tool result, user correction, retrieved skill, environment observation, or another form of training-only support.
Contextual self-distillation depends on an asymmetry between what is available during training and what is available during deployment. The teacher is allowed to see information that makes evaluation, correction, or rationalization easier, while the student must internalize the corrected behavior without depending on that information at inference time.
This setup is powerful because LLMs are often better at evaluating, explaining, or repairing a solution when given the answer or feedback than they are at generating the solution from scratch. It is risky because the privileged teacher can acquire a reasoning style that presupposes unavailable information, which may teach the student to be too concise, too confident, or insufficiently exploratory.
A contextual self-distillation objective can be written as:
\[\mathcal{L}_{CSD}(\theta) =\mathbb{E}_{x,c} \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D \left( p_T^{\theta^-}(\cdot \mid x,c,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
This objective is safe only when the teacher context changes reward-relevant token preferences rather than merely changing style, confidence, length, or citation behavior.

On-Policy Self-Distillation (OPSD)

The most important modern form of self-distillation is On-Policy Self-Distillation. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models by Zhao et al. (2026) introduces OPSD as a framework in which one model acts as both student and teacher, leveraging ground-truth answers to provide dense token-level supervision on student rollouts.
In OPSD, the student generates trajectories from the original problem context, while the teacher evaluates those same trajectories under an enriched or privileged context. The same underlying model can instantiate both views, so the method does not require a separate stronger teacher.
The basic objective is:
\[\mathcal{L}_{\mathrm{OPSD}} =\sum_t D \left( \pi_T(\cdot \mid x,c,y_{<t}) \,\Vert\, \pi_S(\cdot \mid x,y_{<t}) \right)\]
- where \(c\) is privileged teacher-only context, such as a verified solution or feedback string.
The following expansion makes the objective explicitly on-policy by sampling the rollout from the student, conditioning the teacher on privileged context \(c\), and applying token-level divergence along the sampled rollout:
\[\mathcal{L}_{OPSD}(\theta) =\mathbb{E}_{(x,c)} \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D \left( p_T^\theta(\cdot \mid x,c,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
The following figure (source) shows On-Policy Self-Distillation, where the same model defines a student policy conditioned only on the problem and a teacher policy conditioned on privileged solution information. Given a reasoning dataset \(\mathcal{S}={(x_i,y_i^\star)}_{i=1}^{N}\), the student generates an on-policy response \(\hat{y} \sim p_S(\cdot \mid x)\), and both student and privileged teacher score that response at every prefix to produce a token-level divergence.

OPSD argues that models are often substantially better at evaluating or rationalizing a correct answer than generating the answer from scratch. By conditioning the teacher view on verified solutions, the model effectively supervises itself from a privileged perspective.
In implementation, the student rollout is generated first and the teacher scores the resulting trajectory rather than generating an independent completion. The teacher context concatenates the original prompt with privileged solution information, creating an asymmetric supervision channel. Reverse KL, forward KL, and Jensen-Shannon divergence can all be used experimentally, while pointwise KL clipping and token weighting can help prevent stylistic tokens, filler tokens, or formatting artifacts from dominating reasoning updates.
The key assumption is:
\[\pi_{\theta^-}(\cdot \mid x,c) \approx \pi_T^*(\cdot \mid x)\]
- where \(\pi_T^*\) is a reward-improved teacher distribution that remains close enough to the student to be imitable. When this approximation holds, OPSD can function like inexpensive OPD. When it fails, the same dense token-level loss can train the wrong behavior quickly.

Reward-Tracking Condition

A useful way to test a self-teacher is to check whether its log-ratio tracks downstream reward. For a trajectory \(s\), define:
\[\Delta_T(s) =\log \pi_T(s) - \log \pi_S(s)\]
A useful self-teacher should assign larger \(\Delta_T(s)\) to higher-reward trajectories. If correct trajectories receive larger log-ratios than incorrect trajectories, the teacher behaves like a reward-tilted version of the student. If the log-ratio instead tracks a response template, the teacher is not reward-aligned.
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t frames this through KL-regularized reward maximization. Let \(\pi_k^S\) be the current student policy and \(R(s)\) the trajectory reward. The reward-tilted teacher is:
\[\pi_T^*(s) =\frac{1}{Z} \pi_k^S(s) \exp(\beta R(s)), \qquad Z=\mathbb{E}_{s \sim \pi_k^S} \left[ \exp(\beta R(s)) \right]\]
If the teacher equals this reward-tilted policy and its gradient is stopped, reverse-KL distillation decomposes into reward improvement plus a policy-proximity penalty:
\[D_{KL} \left( \pi^S \,\Vert\, \pi_T^* \right) =D_{KL} \left( \pi^S \,\Vert\, \pi_k^S \right) -\beta \mathbb{E}_{s \sim \pi^S} \left[ R(s) \right] + \log Z\]
This decomposition explains why self-distillation needs a stronger test than “the teacher has more context.” The privileged self-teacher must behave as though it is reward-tilted. If it merely behaves as though it already knows the answer, the loss can train the student to imitate the appearance of knowing rather than the process of solving.

Naive Privileged Self-Distillation

Naive privileged self-distillation uses the same model under two contexts. The student is conditioned on the original task prompt, while the teacher is conditioned on the task prompt plus privileged information:
\[\mathcal{C}_S =(x,\hat{y}_{<t})\] \[\mathcal{C}_T =(x,c,\hat{y}_{<t})\]
The objective is usually a per-token KL over student-generated rollouts:
\[\mathcal{L}_{naive}(\theta) =\mathbb{E}_{x \sim \mathcal{D},\hat{y} \sim \pi_\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D_{KL} \left( \pi_{\theta^-}(\cdot \mid x,c,\hat{y}_{<t}) \,\Vert\, \pi_\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
This objective is on-policy because \(\hat{y}\) comes from the student. It is self-distillation because teacher and student are the same model under different contexts. It is naive when the loss directly transfers all teacher-context effects into the student without checking whether those effects are reward-aligned.
The problem is that the student never sees \(c\) at inference time. If the teacher uses \(c\) by naming it, citing it, reasoning backward from it, or becoming overconfident because it is present, the student can learn those behaviors unconditionally.
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t shows this failure through a reward-log-ratio comparison: an RL-trained expert assigns higher probability to correct student responses, while a privileged self-distillation teacher can assign high probability to responses that look like they used a reference solution whether or not the answer is correct.

Hallucinated Privileged Context

The most direct failure mode is hallucinated privileged context. The trained student refers to feedback, a reference solution, guidance, or a previous attempt that was not present in the inference prompt.
A hallucinated-privilege metric can be written as:
\[H_{\mathrm{priv}} =\mathbb{E}_{x \sim \mathcal{D}_{eval}} \left[ \mathbf{1} \left[ y(x) \text{ mentions absent feedback, reference, guidance, or prior attempt} \right] \right]\]
This is not ordinary factual hallucination. It is an artifact of the training objective: the teacher distribution was grounded in privileged context, but the deployed student is not. The student cannot reproduce the hidden information, so it may reproduce the response shape associated with hidden information.
The following figure (source) shows hallucination rate over training on chemistry, judged by an LLM. The figure illustrates that naive self-distillation rapidly learns to fabricate absent privileged context with a ~100% hallucination rate within ~50 steps and never decays, while RL and the base model remain near zero.

The key lesson is that self-distillation can move artifacts into the model’s weights. Once the response shape is internalized, it can fire across prompts, tasks, and domains without the original privileged context.

Epistemic-Verbalization Collapse

A second failure mode is suppressed epistemic verbalization. Thinking models often use uncertainty and checking tokens as part of their reasoning control flow. Examples include “wait,” “perhaps,” “actually,” “maybe,” and “alternatively.”
Let \(\mathcal{E}\) be a set of epistemic-verbalization markers. For a response \(y=(y_1,\dots,y_n)\):
\[\operatorname{EV}(y) =\sum_{t=1}^{n} \mathbf{1} \left[ y_t \in \mathcal{E} \right]\]
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? by Kim et al. (2026) argues that self-distillation can reduce response length while degrading mathematical reasoning, tracing part of the degradation to suppressed epistemic verbalization.
Rethinking On-Policy Self-Distillation for Thinking Models by Kaur et al. (2026) similarly finds that privileged-context self-distillation can suppress forking, verification, backtracking, and hedging, which are precisely the behaviors long-budget thinking models need when solving without privileged information.
The practical warning is that shorter and more confident traces are not necessarily better traces. A privileged teacher may be concise because it already knows the answer. Distilling that concision can remove search behavior that the student still needs at inference.

Out-of-Distribution Fragility

Naive self-distillation can preserve or improve in-distribution validation accuracy while harming out-of-distribution reasoning. This happens when the learned response template fits the training distribution but fails on tasks that require fresh search, tool adaptation, or uncertainty management.
The diagnostic pattern is:
\[\Delta_{\mathrm{IID}} =\operatorname{Acc}_{trained}(\mathcal{D}_{IID}) - \operatorname{Acc}_{base}(\mathcal{D}_{IID})\] \[\Delta_{\mathrm{OOD}} =\operatorname{Acc}_{trained}(\mathcal{D}_{OOD}) - \operatorname{Acc}_{base}(\mathcal{D}_{OOD})\]
A method is suspect when:
\[\Delta_{\mathrm{IID}} \geq 0 \qquad \text{and} \qquad \Delta_{\mathrm{OOD}} < 0\]
This pattern indicates that the student may have learned a distribution-specific response shape rather than a robust improvement in reasoning. For thinking models, OOD evaluation should include domains where the model must branch, verify, recover, or use tools differently from the training tasks.

Prompt and Loss Fixes

Prompt engineering can reduce some privileged-context artifacts, but it does not remove the underlying mismatch if the teacher still receives information the student will not have at inference. A better prompt can change how the self-teacher expresses feedback, but the student can still learn a feedback-conditioned response style.
The following figure (source) shows optimized teacher prompts GEPA and hand-written) vs. naive self-distillation and RL on in-distribution and out-of-distribution evaluation. The figure illustrates that prompt optimization can help in some cases but does not reliably close the OOD gap, and it can overfit to training-distribution assumptions. Specifically, on chemistry, both prompts help and recover part of the OOD gap, but neither reaches RL. On tool-use the GEPA prompt overfits to the training tool families and the student collapses on validation. The narrow generalization problem survives the prompt change.

Loss modifications are more principled when they reduce the influence of a pure distillation term that disagrees with reward. A simple agreement mask keeps the distillation loss only where the RL advantage and teacher-student gap point in the same direction:
\[m_t =\mathbf{1} \left[ \operatorname{sign} \left( A_t^{RL} \right) =\operatorname{sign} \left( \log \pi_T(y_t \mid s_t) -\log \pi_\theta(y_t \mid s_t) \right) \right]\]
The masked self-distillation objective is:
\[\mathcal{L}_{masked} =\mathbb{E} \left[ \sum_t m_t D \left( \pi_T(\cdot \mid s_t) \,\Vert\, \pi_\theta(\cdot \mid s_t) \right) \right]\]
Another option is to keep RL dominant and use self-distillation as a small auxiliary regularizer:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) + \lambda_{SD} \mathcal{L}_{SD}(\theta), \qquad \lambda_{SD} \ll 1\]
A third option is to improve the privileged teacher target with an RL term before distilling from it, so that the teacher branch is more reward-aligned rather than merely feedback-conditioned.
The following figure (source) shows three loss modifications that mix distillation loss with an RL term, vs. naive self-distillation and RL. The figure illustrates that helpful variants rely on their RL term to reduce the influence of pure self-distillation, but none fully reaches pure RL’s out-of-distribution performance in the reported setting. The two with the heaviest RL term destabilized and were stopped.

Relevance-Masked Self-Distillation

Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation introduces Relevance-Masked Self-Distillation (RMSD), a practical self-distillation method designed for out-of-distribution and enterprise-style behaviors where a sufficiently capable external teacher may not exist.
The motivating problem is that SFT can teach an unusual behavior but may cause catastrophic forgetting, while RL may fail when the base model almost never succeeds and therefore receives little useful reward signal. RMSD constructs a self-teacher by conditioning the same model on a hint or desired behavior description, then uses a token-level relevance mask so that the student updates only on positions tied to the desired behavior.
A compact RMSD objective is:
\[\mathcal{L}_{RMSD}(\theta) =\mathbb{E} \left[ \sum_t m_t D \left( p_T^\theta(\cdot \mid x',\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
- where \(x'\) is an enhanced teacher prompt containing a hint or correction, and \(m_t \in {0,1}\) selects token positions that should receive distillation updates.
The central RMSD insight is that token-level granularity is both a strength and a weakness. It is useful because the update can be localized, but noisy because the teacher and student may disagree on many tokens for reasons unrelated to the desired behavior. RMSD therefore uses a two-step filtering strategy: deterministic heuristics identify candidate token positions, and an LLM judge selects the final subset of task-relevant tokens.
The following figure (source) shows a relevance-masked self-distillation visualizer for a rollout in which the student is asked a normal tropical-food question while the privileged teacher prompt instructs the model to use the misspelled token “pinapple.” Green tokens indicate positive teacher-minus-student log-probability gaps, red tokens indicate negative gaps, and the relevance mask selects the token positions most tied to the target behavior rather than updating all stylistic disagreements.

RMSD experiments highlight why on-policy data matters. When SFT data is close to the student’s own distribution, SFT can improve substantially, but such close-to-on-policy data is usually hard to manufacture in real tasks. OPSD and RMSD instead obtain on-policy trajectories directly from the student and use privileged teacher scoring to provide localized correction.
RMSD also highlights the importance of teacher-update timing. Updating the self-teacher too frequently can cause collapse, while updating after performance plateaus can bootstrap further progress more safely. This makes RMSD semi-online in recipe behavior when teacher weights are periodically refreshed, even if each training segment uses a fixed self-teacher.
The underlying principle is that most token-level differences between teacher and student are not necessarily task-bearing. Some differences encode verbosity, formatting, hedging, citation style, or privileged-context artifacts. Relevance masks try to keep the capability signal while discarding incidental differences.

Contrastive On-Policy Self-Distillation

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation by Pan et al. (2026) addresses a failure mode called privilege-induced style drift. In ordinary OPSD, the teacher-student gap under a correct hint may concentrate on style tokens because a hinted teacher tends to be more direct, shorter, and more confident regardless of whether the hint contains the specific task-bearing information the student needs.
RLCSD compares the teacher-student gap under a correct hint with the gap under a wrong or contrastive hint. This subtracts the generic style shift induced by hint conditioning and leaves a signal more concentrated on task-bearing tokens.
A simplified contrastive signal is:
\[e_t^{ctr} =\left[ \log p_\theta(y_t \mid x,c^+,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right] -\left[ \log p_\theta(y_t \mid x,c^-,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right]\]
- where \(c^+\) is a correct hint and \(c^-\) is a wrong or contrastive hint.
The broader principle is that privileged context should not be treated as uniformly beneficial. A hint changes both task information and style. Contrastive self-distillation attempts to isolate the task-relevant component of the hint by subtracting a control condition that induces similar style changes without providing the correct task information.

Self-Distillation as Reinforcement Learning

Recent work increasingly treats self-distillation as a form of policy optimization rather than merely a compression technique.
The central idea is that the teacher defines a dense token-level improvement signal:
\[A_t = \log p_T(y_t \mid s_t) - \log p_S(y_t \mid s_t)\]
- which behaves similarly to an RL advantage estimate.
This perspective enables self-distillation to integrate naturally into PPO-, GRPO-, and RLVR-style training loops. The self-teacher can provide dense token-level guidance, while a verifier, reward model, or environment reward preserves the task-level objective.
A hybrid RL and self-distillation objective can be written as:
\[\mathcal{L}(\theta) = \mathcal{L}_{RL}(\theta) + \lambda_{SD}\mathcal{L}_{SD}(\theta)\]
- where the RL term defines the direction of improvement from rewards or verifiers, and the self-distillation term refines token-level credit assignment.

Reinforcement Learning via Self-Distillation

Reinforcement Learning via Self-Distillation by Hübotter et al. (2026) formalizes the connection between self-distillation and policy optimization by converting textual feedback into dense token-level supervision.
The framework generates a trajectory from the student policy, obtains textual critiques, runtime errors, verifier feedback, or environment feedback, conditions the teacher view on both the original trajectory and the feedback signal, and replays the trajectory under the teacher context to compute token-level corrections.
This approach is useful when feedback is richer than a scalar reward. Runtime execution errors in coding tasks can become corrective teacher context; LLM judge comments can specify what went wrong; and environment feedback can indicate which actions were ineffective. The same model can instantiate both the student and teacher view, reducing infrastructure requirements.
A useful abstraction is:
\[p_T(\cdot \mid x,\hat{y}_{<t},f)\]
- where \(f\) is a feedback string, such as a runtime error, critique, judge comment, or natural-language correction. The student is trained to align with this feedback-conditioned teacher while still operating without \(f\) at deployment time.
A simplified SDPO-style objective is:
\[\mathcal{L}_{SDPO}(\theta) =\mathbb{E}_{x,\hat{y},f} \left[ \sum_t D \left( \pi_{\theta^-}(\cdot \mid x,f,\hat{y}_{<t}) \,\Vert\, \pi_\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
The benefit is dense credit assignment from feedback that would otherwise remain a post-hoc explanation. The risk is the same as in privileged self-distillation: if the feedback-conditioned branch learns to sound corrected rather than to become reward-improved, the update can overfit to feedback style.

Self-Distilled RLVR

Self-Distilled RLVR by Yang et al. (2026) combines reinforcement learning with verifiable rewards and privileged self-distillation. The method uses RLVR to determine the update direction and self-distillation to modulate fine-grained token-level update magnitudes.
This separation matters because privileged self-distillation alone can leak information or destabilize long training. RLVR keeps the objective grounded in correctness, while self-distillation provides local information about which sampled tokens should receive stronger or weaker updates.
A simplified view is:
\[\Delta \theta \propto \sum_t A^{RLVR} \cdot w_t^{SD} \nabla_\theta \log \pi_\theta(y_t \mid s_t)\]
- where \(A^{RLVR}\) supplies the reward-aligned direction and \(w_t^{SD}\) supplies a token-level magnitude derived from the privileged self-teacher.
The practical lesson is that self-distillation can be most useful as a magnitude signal when an external verifier or environment determines the update direction. This avoids letting the privileged teacher alone decide what should be reinforced.

Sample-Routed Self-Distillation

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing by Li et al. (2026) introduces Sample-Routed Policy Optimization (SRPO), which routes correct samples to GRPO-style reward-aligned RL and failed samples to self-distillation-based correction.
A compact routing objective is:
\[\mathcal{L}_{SRPO} =\mathbf{1}[R(s)>0] \mathcal{L}_{GRPO} + \mathbf{1}[R(s)=0] \lambda(s) \mathcal{L}_{SDPO}\]
Correct samples are already reward-aligned, so they can be reinforced with RL. Failed samples need targeted correction, so they can be replayed under a feedback or privileged teacher context.
SRPO also motivates entropy-aware weighting. When the self-teacher is uncertain, its dense signal should be down-weighted:
\[\lambda(s) =g \left( H(\pi_{\theta^-}^{+}(\cdot \mid s)) \right)\]
- where larger entropy indicates lower teacher reliability.

Self-Distilled Agentic Reinforcement Learning

Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) introduces SDAR for multi-turn LLM agents, where RL remains the primary optimization objective and OPSD is used as a gated auxiliary loss to provide dense token-level guidance.
SDAR is motivated by a multi-turn instability problem: naive OPSD can produce large teacher-student gaps as trajectories drift across turns, and privileged context may be unreliable if retrieved skills are imperfect, weakly used, or irrelevant. Instead of applying self-distillation uniformly, SDAR gates the auxiliary signal at the token level.
The SDAR objective can be summarized as:
\[\mathcal{L}(\theta) =\mathcal{L}_{GRPO}(\theta) +\lambda_{\mathrm{SDAR}} \mathcal{L}_{SDAR}(\theta)\]
- where \(\mathcal{L}_{GRPO}\) preserves verifier-driven policy optimization, and \(\mathcal{L}_{SDAR}\) injects dense token-level guidance only where the privileged teacher signal is trusted.
The self-teacher receives privileged training-only context, such as retrieved skills, while the deployed student acts without that context. The detached teacher-student gap is:
\[\Delta_t =\operatorname{sg} \left( \log \pi_\theta^+(y_t \mid s_t^+) -\log \pi_\theta(y_t \mid s_t) \right)\]
The token-level gate converts this signal into a bounded trust weight:
\[g_t =\sigma(\beta \Delta_t)\]
The gated auxiliary loss applies self-distillation only according to token-level trust:
\[\ell_t^{\mathrm{SDAR}} =g_t \left( \log \pi_\theta^+(y_t \mid s_t^+) -\log \pi_\theta(y_t \mid s_t) \right)\]
SDAR supports entropy gating, gap gating, and soft-OR gating, so token-level supervision can depend on student uncertainty, teacher endorsement, or both. Skill retrieval can be implemented through UCB retrieval, keyword matching, full retrieval, or random retrieval, making the framework robust to different levels of privileged-context quality.
The following figure (source) shows multi-turn OPSD instability, where naive OPSD can increase KL divergence and degrade task performance in multi-turn agent settings.

The following figure (source) shows teacher-student gap analysis, including token-count distribution by teacher-student gap, average gap by multi-turn step, and average gap by relative position within a turn.

The following figure (source) illustrates the SDAR framework, where verifier-driven RL and token-level OPSD are combined through gates derived from uncertainty, teacher-student gap, or their soft-OR combination.

Aligning Language Models from User Interactions

Aligning Language Models from User Interactions by Kleine Buening et al. (2026) extends self-distillation into conversational alignment by treating future user messages as privileged hindsight context.
Instead of using verified answers, the teacher is conditioned on later interaction information, such as a user correction, clarification, dissatisfaction signal, or follow-up request. The method reconstructs how the assistant should ideally have responded had it known the future interaction, then distills that hindsight policy into the original model.
The process can be summarized as:
\[(x,y,o) \rightarrow p_T(\cdot \mid x,o,y_{<t}) \rightarrow \text{token-level update for } p_S(\cdot \mid x,y_{<t})\]
- where \(x\) is the conversation history, \(y\) is the model response, and \(o\) is the subsequent user message.
The hindsight self-distillation objective is:
\[\mathcal{L}_{HSD}(\theta) =\mathbb{E}_{x,y,o} \left[ \sum_t D \left( p_{\theta^-}(\cdot \mid x,o,y_{<t}) \,\Vert\, p_\theta(\cdot \mid x,y_{<t}) \right) \right]\]
The following figure (source) shows the hindsight self-distillation process driven by user follow-up interactions. From multi-turn conversations, the system obtains interactions \((x,y,o)\) consisting of the conversation history, the model response, and the subsequent user message. Conditioning on the user’s follow-up forms a hindsight policy, and comparing that policy to the original policy produces token-level advantages that reinforce or penalize parts of the original response.

This framework naturally leverages production interaction logs without requiring manual labels for every example. It can also support both RL-style preference learning and dense token-level self-distillation from the same interaction trace.
The risk is context leakage. The model should learn the correction implied by the future user message, not hallucinate future user dissatisfaction or assume hidden preferences that are not present.

Self-Distillation in Agentic Systems

Self-distillation is particularly powerful in agents because interaction trajectories naturally produce rich hindsight signals. Tool outputs, terminal errors, GUI state changes, unit-test results, user replies, and environment transitions can all become teacher-only context for replaying and correcting the original action sequence.
In an agentic setting, the student may act under limited information at time \(t\), while the teacher is later allowed to see the consequences of that action. The teacher can then evaluate what the student should have done at each earlier token or action boundary.
A long-horizon trajectory can be represented as:
\[\tau =(s_1,a_1,o_1,\dots,s_T,a_T,o_T)\]
- where \(s_t\) is the state, \(a_t\) is the action, and \(o_t\) is an observation or outcome.
A hindsight teacher can condition on later evidence:
\[p_T(\cdot \mid s_t,o_{t:T},a_{<t})\]
- while the student acts only under the information available at time \(t\):
\[p_S(\cdot \mid s_t,a_{<t})\]
This is especially useful for long-horizon workflows because scalar task success is often too sparse. A final failure may be caused by one malformed tool call, one wrong file selection, one incorrect assumption, or one missed observation. Self-distillation can turn later evidence into local corrective supervision.
OpenClaw-RL-style systems extend this idea to real interaction streams, where personal agents and general agents collect trajectories from user interactions, tool calls, GUI transitions, and terminal environments. Those traces can be transformed into hindsight-conditioned dense supervision rather than relying only on scalar reward.

Failure Modes in Self-Distillation

Self-distillation introduces unique risks because the teacher is derived from the same model family as the student. The method can reinforce existing errors, amplify spurious confidence, suppress useful uncertainty, overfit to hints, or convert privileged-context style into deployed behavior.
Recent evidence is especially cautionary for thinking models. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? by Kim et al. (2026) finds that self-distillation can reduce response length while degrading mathematical reasoning performance. The paper traces this to suppression of epistemic verbalization, meaning the model becomes less likely to express uncertainty, reconsider hypotheses, or mark possible errors during reasoning.
A useful formalization of context richness is conditional mutual information:
\[I(y^\star;c \mid x) =H(y^\star \mid x) - H(y^\star \mid x,c)\]
- where \(c\) is privileged context and \(y^\star\) is an ideal correct response. Richer context reduces uncertainty for the teacher, but this can also make the teacher produce overly concise and confident traces that do not preserve the student’s inference-time uncertainty management.
Rethinking On-Policy Self-Distillation for Thinking Models by Kaur et al. (2026) reports a related fork-suppression failure mode. Privileged-context OPSD can reduce forking, verification, backtracking, and hedging behavior, especially in long-budget thinking rollouts. The failure is specific to privileged context rather than on-policy training itself: vanilla OPD can improve the same student, while privileged OPD reverses the gain.
The token-level mechanism is that privileged teachers may penalize self-correction cues even when those cues lead to a correct answer. Tokens such as “wait,” “maybe,” “but,” or “there may be a mistake” can be useful because they create branches and preserve uncertainty. A teacher that already knows the answer may assign these tokens low probability because it does not need them, thereby training the student to suppress deliberative search.
The following figure (source) shows examples where privileged context flips credit on self-correction cues. The same student trajectory is scored with and without privileged teacher context; privileged scoring penalizes tokens such as “But wait,” “Hmm,” and “Maybe” even when those tokens support correction or lead to a correct answer.

These findings imply that self-distillation objectives should optimize not only correctness or brevity, but also the reasoning behaviors needed for robust inference. For strong thinking models, preserving controlled uncertainty, exploration, and self-correction may be more important than forcing the student to imitate a privileged teacher’s concise solution path.
The main failure modes are:
- Self-reinforcement of errors, where the model teaches itself its own mistakes.
- Bias amplification, where self-generated or self-scored behavior narrows the model’s distribution.
- Privileged-context leakage, where the model behaves as if hidden feedback or references were present.
- Epistemic-verbalization collapse, where uncertainty and checking tokens disappear.
- Style overfitting, where the student learns hint-conditioned directness, concision, or citation style.
- Teacher collapse, where a moving self-teacher copies the student’s latest failure mode too quickly.
- Agent trajectory drift, where multi-turn prefixes move outside the teacher’s reliable support.
- Reward misalignment, where dense teacher scores do not correlate with downstream reward.

Fixed, Moving, and Gated Self-Teachers

A fixed self-teacher is a frozen copy of an earlier checkpoint or initial policy. This is stable and reproducible, and it avoids a feedback loop where the teacher becomes increasingly confident as the student changes.
A moving self-teacher updates during training, either by periodically copying the student, using an EMA, or refreshing after performance plateaus. This can provide fresher supervision, but it can also amplify collapse if the teacher is updated too frequently or if the student’s current failure modes become part of the teacher.
A gated self-teacher does not necessarily change the teacher weights; instead, it controls which token-level teacher signals are trusted. Gates can depend on teacher-student gap, student entropy, token relevance, verifier outcomes, contrastive hint differences, or LLM judge decisions.
These three designs address different problems. Fixed teachers address stability. Moving teachers address staleness. Gated teachers address noisy or harmful token-level supervision.

Evaluation

Self-distillation should be evaluated with both task and behavior metrics. Accuracy alone is not sufficient because a model can preserve in-distribution accuracy while learning a brittle privileged-context template.
Core task metrics include:
- In-distribution accuracy.
- Out-of-distribution accuracy.
- Verifier pass rate.
- Tool-use success.
- Unit-test success.
- Agent completion rate.
Core self-distillation diagnostics include:
- Teacher-student log-ratio by reward bucket.
- Hallucinated privileged-context rate.
- Feedback-mention rate when feedback is absent.
- Epistemic-verbalization rate.
- Response length and length collapse.
- Entropy and calibration.
- Teacher-student support overlap.
- Agreement between RL advantage and distillation advantage.
A compact dashboard is:
\[\left\{ \operatorname{Acc}_{IID}, \operatorname{Acc}_{OOD}, H_{\mathrm{priv}}, \operatorname{EV}, \mathbb{E}[|y|], H(\pi_\theta), \operatorname{Corr}(\Delta_T(s),R(s)) \right\}\]
The most important diagnostic is:
\[\operatorname{Corr} \left( \Delta_T(s), R(s) \right) > 0\]
because it directly asks whether the self-teacher behaves like a reward-improved version of the student.

Advantages

Self-distillation reduces dependence on expensive external teacher models, enables continual self-improvement from interaction traces and feedback, converts privileged information into dense token-level guidance, integrates naturally with RL and hindsight supervision, preserves architecture simplicity by using the same backbone for teacher and student, and can target specialized behaviors that ordinary external teachers may not know.
It is especially attractive for enterprise-specific or tool-specific behavior, where the desired behavior may involve private APIs, internal formats, local policies, private data schemas, product-specific workflows, or user preferences that are not well represented in public teacher models.
It can be more selective than SFT. Because self-distillation adjusts conditional token probabilities on student-visited states, it can preserve unrelated behavior better than broad supervised fine-tuning when the token-level signal is well masked or gated.
It can also support continual adaptation. Interaction traces, runtime errors, future user feedback, and environment observations can become self-supervised training signals without requiring a separate annotation pipeline for every example.

Limitations

Self-distillation may reinforce the model’s own errors if the privileged teacher signal is weak, noisy, or merely stylistically different.
It can be limited by the model family’s inherent capability ceiling unless external rewards, search, tools, verifiers, or retrieved evidence introduce new information.
Incorrect privileged information can destabilize training more severely than ordinary supervised errors because the student may learn both the wrong answer and the response style induced by the wrong context.
Careless rollout replay can produce information leakage in reasoning tasks, especially when the teacher is conditioned on answers, critiques, future observations, or user follow-ups that the student will not see at inference time.
Dense self-distillation can over-regularize style unless clipped, masked, gated, or contrastively corrected.
Thinking models require special caution. Privileged context may suppress uncertainty expression, forking, verification, and backtracking, which are precisely the behaviors that enable long-budget reasoning to recover from mistakes.
Multi-turn agent settings also require caution because trajectory drift can compound across turns. Naive OPSD can become unstable when the self-teacher scores prefixes that are already far from its reliable support, motivating gated auxiliary distillation rather than uniform token-level imitation.

When Self-Distillation is Preferred

Self-distillation is preferred when external frontier teachers are unavailable, too expensive, or operationally inconvenient.
Self-distillation is preferred when the model already contains latent capability that can be unlocked through hints, hindsight conditioning, retrieved skills, privileged evaluation, or interaction feedback.
Self-distillation is preferred when interaction traces, tool outputs, runtime errors, user corrections, or verifier feedback are available as rich supervision sources.
Self-distillation is preferred when RL alone is too sparse or unstable, especially in coding, tool-use, scientific reasoning, or agentic tasks where feedback contains more information than a scalar reward.
Self-distillation is preferred when the target behavior is enterprise-specific, tool-specific, out-of-distribution, or private enough that external teachers are unlikely to know it.
Self-distillation is preferred when continuous adaptation is required without maintaining a separate teacher infrastructure.
On-policy self-distillation is preferred when the student can generate informative rollouts and the teacher can be strengthened through privileged context.
Relevance-masked or gated self-distillation is preferred when only a small subset of tokens should change.
Contrastive self-distillation is preferred when hints introduce unwanted style drift.
RL-hybrid self-distillation is preferred when correctness rewards should determine update direction while self-distillation refines token-level credit assignment.
Modern self-distillation methods increasingly blur the line between supervised learning, reinforcement learning, continual learning, and iterative self-improvement. The practical challenge is no longer only how to make a model teach itself, but how to ensure the self-teacher’s signal preserves the behaviors the deployed student actually needs.

Multi-Teacher Distillation

Multi-teacher distillation generalizes the teacher-student framework by allowing the student to learn from more than one teacher. Instead of assuming a single model has the best behavior across all domains, the method combines signals from several teachers, which may differ by size, architecture, training stage, specialization, data source, decoding style, or post-training objective.
The central idea to carry through this section is that multi-teacher distillation is no longer just an ensemble-compression trick. In modern LLM post-training, it has become a capability-consolidation primitive: different teams or training runs produce specialist teachers, and a final student absorbs their capabilities into one deployable policy.
The main benefit is specialization without deployment fragmentation. A lab may train one teacher that is strongest at math proofs, another at coding, another at software-engineering agents, another at safety behavior, another at long-context reasoning, and another at chat or instruction following. Multi-teacher distillation tries to merge these strengths into one model so that users do not need to select among many specialized checkpoints at inference time.
The main difficulty is conflict. Teachers may disagree not only on final answers, but also on style, reasoning length, tool-use conventions, uncertainty expression, formatting, refusal boundaries, and local token preferences. A multi-teacher student must learn when to follow which teacher, how to resolve incompatible guidance, and how to avoid averaging incompatible behaviors into a weaker policy.
The most important modern variant is MOPD. In MOPD, the student generates its own trajectories, those trajectories are routed to one or more domain-specialist teachers, and the teachers provide token-level feedback on the student’s actual visited prefixes. This makes MOPD a multi-teacher extension of OPD and a practical alternative to one monolithic RL run across many conflicting domains.
Recent recipe discussions frame MOPD as one of the major 2026-style post-training patterns. MiMo-V2-Flash Technical Report by the Xiaomi MiMo Team (2026) provides a clean early articulation of MOPD as a consolidation stage, while Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) uses multi-domain OPD as a stabilization and regression-recovery stage inside Cascade RL. Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) scales the pattern to many domain-specialist teachers and multiple MOPD rounds.

Definition

In single-teacher distillation, the student minimizes a divergence between its distribution and one teacher distribution. In multi-teacher distillation, there are \(K\) teachers:
\[\{p_{T_1},p_{T_2},\dots,p_{T_K}\}\]
- and the student receives supervision from an aggregation, routing, or weighting of those teachers.
A general multi-teacher objective can be written as:
\[\mathcal{L}_{MTD}(\theta) =\mathbb{E}_{(x,y)\sim\mathcal{D}} \left[ \sum_{k=1}^{K} w_k(x,y_{<t}) D\left( p_{T_k}(\cdot \mid x,y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,y_{<t}) \right) \right]\]
- where \(w_k(x,y_{<t})\) is the teacher weight for teacher \(k\) at the current prompt or prefix. The weight may be fixed, domain-dependent, confidence-dependent, learned by a router, determined by evaluation scores, or set by a hard routing rule.
A teacher-ensemble view first forms an aggregated teacher distribution:
\[p_{\mathrm{MT}}(\cdot \mid s_t) =\sum_{k=1}^{K} w_k(s_t) p_{T_k}(\cdot \mid s_t)\]
- where \(s_t=(x,y_{<t})\), and the student then minimizes:
\[\mathcal{L}_{MTD}(\theta) =\mathbb{E} \left[ D\left( p_{\mathrm{MT}}(\cdot \mid s_t) \,\Vert\, p_S^\theta(\cdot \mid s_t) \right) \right]\]
This formulation is simple, but it assumes that teacher distributions can be meaningfully averaged. That assumption is often false when teachers have different reasoning styles, response lengths, tool-use conventions, or refusal boundaries. Much of modern multi-teacher distillation is therefore about routing and conflict management rather than merely averaging logits.

Why Multiple Teachers?

A single teacher is rarely uniformly best. One model may have strong mathematical reasoning but weak instruction following; another may be excellent at code generation but overly verbose in chat; another may be safe and well-calibrated but weak on long-context retrieval; another may be a strong agentic tool user but poor at concise factual answering. Multi-teacher distillation treats each teacher as a source of localized expertise rather than assuming one global teacher dominates all tasks.
Multi-teacher distillation is also organizationally scalable. Specialist teachers can be trained in parallel by different teams, each focused on a domain-specific data pipeline, verifier, RL environment, or evaluation suite. A final distillation stage then converts this distributed specialist work into one general student.
This pattern is attractive because multi-domain RL can be conflict-prone. Jointly optimizing math, code, safety, chat, long context, and agentic tool use inside one RL run can cause interference: improvements in one domain may shorten traces, reduce entropy, change formatting, degrade instruction adherence, or regress other benchmarks. Training specialists separately and then distilling them into one student gives the recipe a modular structure.
Multi-teacher distillation is also useful when different teachers expose different forms of signal. A code teacher may provide execution-verified traces, a math teacher may provide proof-style reasoning, a safety teacher may provide refusal and redirection behavior, a chat teacher may provide conversational naturalness, and an agentic teacher may provide terminal or browser action traces. The final student must learn all of these behaviors, but the best supervision for each behavior may come from a different source.

Classical Multi-Teacher Distillation

Classical multi-teacher distillation usually begins with an ensemble of teachers trained independently or on related tasks. Their predictions are combined into a single soft target, and the student learns to match that aggregated distribution.
A probability-space ensemble uses:
\[p_{\mathrm{ens}}(y \mid x) =\sum_{k=1}^{K} w_k p_{T_k}(y \mid x)\]
A logit-space ensemble instead combines teacher logits before softmax:
\[z_{\mathrm{ens}} =\sum_{k=1}^{K} w_k z_{T_k}\] \[p_{\mathrm{ens}}(y \mid x) =\operatorname{softmax}(z_{\mathrm{ens}})\]
Probability averaging is easier to interpret because each teacher contributes directly to the final distribution. Logit averaging can preserve sharper preferences but is sensitive to teacher calibration. If one teacher produces higher-magnitude logits, it can dominate the ensemble even when it is not more accurate.
Classical multi-teacher distillation is most effective when teachers are reasonably aligned and differ mainly in complementary expertise or random initialization. It is less reliable when teachers disagree systematically due to different training data, incompatible objectives, or different reasoning formats.

Teacher Weighting and Routing

Teacher weighting determines how much each teacher contributes to the student’s update. A fixed-weight scheme assigns a constant coefficient to each teacher, which is simple and stable but ignores the fact that teacher quality varies by prompt, domain, and prefix.
A domain-weighted scheme chooses weights based on known task categories. For a math prompt, the math teacher receives high weight; for a coding prompt, the code teacher dominates; for a safety prompt, the safety teacher may override other teachers. This is common when prompts are drawn from benchmark categories or training environments with known labels.
A confidence-weighted scheme uses teacher uncertainty, reward scores, verifier outcomes, or validation accuracy to assign higher weight to teachers that appear more reliable on the current example. For example, a teacher that produces a verified correct answer, passes unit tests, or assigns low entropy to a locally plausible token may receive higher weight.
A learned router predicts teacher weights from prompt features, rollout features, or intermediate representations. The router may use metadata, domain classifiers, prompt embeddings, reward-model scores, teacher agreement, or validation curves to decide which teacher should supervise which state.
A hard-routing scheme selects one teacher:
\[k^\star =\arg\max_k r_k(x,y_{<t})\]
- where \(r_k\) is a routing score for teacher \(k\). The student then trains only against teacher \(k^\star\) for that prompt, trajectory, or token.
A soft-routing scheme assigns all teachers nonzero weights:
\[w_k(s_t) =\frac{ \exp(r_k(s_t)/\tau) }{ \sum_{j=1}^{K} \exp(r_j(s_t)/\tau) }\]
- where \(\tau\) controls how sharply the router concentrates on the best-scoring teacher. Low \(\tau\) approaches hard routing, while high \(\tau\) averages teachers more broadly.
Token-level routing is more flexible than prompt-level routing, but it is also harder. A single trajectory may begin as a math proof, call Python for computation, then require concise final-answer formatting. In principle, each token could be supervised by a different teacher. In practice, token-level routing is expensive and can produce inconsistent style unless teachers are carefully aligned.

Multi-Teacher Distillation vs. Mixture-of-Experts

Multi-teacher distillation and Mixture-of-Experts architectures are related but distinct. In a Mixture-of-Experts model, multiple experts remain inside the deployed model and a router activates a subset of them during inference. In multi-teacher distillation, multiple teachers are used during training, but the final deployed student may be a single dense model, a smaller MoE model, or a general policy without explicit access to the original teachers.
The goal of multi-teacher distillation is therefore not necessarily to preserve separate experts at inference time. The goal is to internalize expert behavior into the student so that a single model can behave as if it had absorbed several specialized training runs.
This distinction matters for deployment. Multi-teacher distillation increases training complexity but can keep inference simple. Mixture-of-Experts architectures increase architectural complexity but may improve inference efficiency per active parameter. Some modern systems combine both ideas: a MoE student may be distilled from several specialist teachers.

Multi-Teacher On-Policy Distillation (MOPD)

MOPD extends OPD to multiple teachers. The student generates a rollout:
\[\hat{y} \sim p_S^\theta(\cdot \mid x)\]
- and one or more teachers score the exact student-visited prefixes:
\[s_t=(x,\hat{y}_{<t})\]
A general MOPD objective can be written as:
\[\mathcal{L}_{MOPD}(\theta) =\mathbb{E}_{x\sim\mathcal{D}} \mathbb{E}_{\hat{y}\sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t} \sum_{k=1}^{K} w_k(s_t) D\left( p_{T_k}(\cdot \mid s_t) \,\Vert\, p_S^\theta(\cdot \mid s_t) \right) \right]\]
The defining feature is that the student supplies the trajectory, while the teachers supply dense feedback. This means the student is trained on states it is likely to visit at inference time, while still receiving specialized guidance from stronger or domain-specific policies.
MOPD is especially useful when specialists are easy to train but hard to jointly optimize. A math teacher can be RL-trained on math; a coding teacher can be trained on competitive programming; an agentic teacher can be trained in tool-use environments; a safety teacher can be trained on refusal and harm-prevention data. MOPD then attempts to transfer these local improvements into one general student.

MOPD in Practice

The article Multi-Teacher On-Policy Distillation: A New Post-Training Primitive and Cameron R. Wolfe’s X thread Multi-teacher on-policy distillation discussion emphasize that practical MOPD is a modular post-training workflow rather than merely a new loss. Specialized teachers can be selected from supervised checkpoints, RL-trained checkpoints, domain-adapted checkpoints, or intermediate checkpoints that were strongest on a particular benchmark family. The teacher pool is therefore a curated set of reusable supervision sources, not just a collection of final models.
Earlier checkpoints are often included explicitly to prevent catastrophic forgetting and post-training regressions. In a long recipe, the latest checkpoint may be strongest on one domain but weaker on earlier capabilities. Keeping earlier checkpoints as teachers allows the student to recover older strengths without rerunning expensive RL training from scratch.
Reverse KL is especially useful in MOPD because it lets each teacher provide a token-level advantage estimate over the student rollout. Instead of requiring every teacher to generate its own trajectory, the student samples a rollout once, and each selected teacher scores the sampled tokens under the student’s exact prefixes. The resulting log-probability gap acts like a dense advantage signal that can be integrated into RL-style training infrastructure.
Teacher requests must be scheduled dynamically to balance latency, accelerator utilization, and training freshness. A practical system may route only some rollouts to expensive teachers, batch requests by teacher, prioritize high-value domains, cache repeated prefixes, or use sampled-token scoring rather than full-distribution scoring. The objective is to make teacher scoring a scalable service rather than a bottleneck that starves the learner.
Capability regressions can be repaired through teacher selection and targeted distillation rather than by restarting the post-training recipe. If a later RL stage improves agentic tool use but hurts math, instruction following, or safety calibration, MOPD can route rollouts back to the best teacher checkpoint for the regressed capability and apply dense corrective supervision on student-visited states.

Sampled-Token MOPD

In large-vocabulary LLMs, full-vocabulary multi-teacher distribution matching can be expensive because every teacher would need to return a large next-token distribution at every student prefix. A cheaper alternative is sampled-token MOPD, where each teacher only scores the token the student actually sampled.
For teacher \(T_k\) and state \(s_t\), the sampled-token teacher advantage is:
\[a_t^{(k)} = \log p_{T_k}(\hat{y}_t \mid s_t) - \log p_S^\theta(\hat{y}_t \mid s_t)\]
A weighted multi-teacher advantage can then be written as:
\[a_t^{MOPD} =\sum_{k=1}^{K} w_k(s_t) a_t^{(k)}\]
The student update becomes:
\[\mathcal{L}_{MOPD} =-\mathbb{E} \left[ \sum_t \operatorname{sg} \left[ a_t^{MOPD} \right] \log p_S^\theta(\hat{y}_t \mid s_t) \right]\]
- where \(\operatorname{sg}\) indicates that the advantage is treated as a fixed supervision signal rather than differentiated through teacher scoring.
This sampled-token form is attractive because it reuses RL-style infrastructure: rollouts are sampled, per-token log-probability gaps become dense advantages, and the learner updates the student using a policy-gradient-shaped loss even though the signal comes from teacher log-probabilities rather than scalar rewards.
The tradeoff is that sampled-token MOPD observes only one local token choice per prefix. It is cheaper and often more robust under support mismatch, but it may lose information contained in the teacher’s full local distribution.

Full-Distribution and Top-\(k\) Multi-Teacher Matching

Full-distribution MOPD asks the student to match each selected teacher’s entire next-token distribution at every student-visited prefix:
\[D\left( p_{T_k}(\cdot \mid s_t) \,\Vert\, p_S^\theta(\cdot \mid s_t) \right)\]
A top-\(k\) approximation restricts matching to a small set of high-probability teacher or student tokens:
\[\mathcal{V}_k(s_t) =\operatorname{Top}_k p_{T_k}(\cdot \mid s_t)\]
- and renormalizes both teacher and student probabilities over \(\mathcal{V}_k(s_t)\) before computing the divergence.
Full-distribution and top-\(k\) matching are most useful when teacher and student distributions have high local support overlap. If the student and teacher consider similar tokens plausible, dense matching can transfer richer information than sampled-token scoring.
If teacher and student support overlap is poor, dense distribution matching can become harmful. The teacher may be assigning a detailed distribution over continuations that make sense for its own reasoning style but are unreliable under the student’s prefix. In that case, forcing the student to match the full teacher distribution can amplify noise.

Support Overlap

Support overlap is the central practical constraint in MOPD. Since teachers score student-generated prefixes, the teacher must be reliable on the states the student actually visits.
A useful diagnostic is local top-\(k\) overlap:
\[\operatorname{Overlap}_k(s_t) =\frac{ \left| \operatorname{Top}_k p_S(\cdot \mid s_t) \cap \operatorname{Top}_k p_T(\cdot \mid s_t) \right| }{k}\]
High overlap means the student and teacher are choosing among a similar menu of plausible next tokens. In that case, disagreement is informative because the teacher can guide the student within a shared local support region.
Low overlap means the teacher and student are effectively operating in different local regimes. Teacher feedback may then reflect distribution mismatch rather than useful expertise.
This is why teacher lineage matters. If teachers are forks of the same base model and differ mainly by domain SFT and RL, their rollouts and local token supports are often closer to the student’s. If teachers are trained from different data sources, external models, or incompatible post-training recipes, the student’s prefixes may be out of distribution for the teachers.
A practical MOPD recipe should therefore measure support overlap, use warmup SFT or intermediate teacher checkpoints when overlap is low, prefer sampled-token feedback when full-distribution matching is noisy, and route examples only to teachers that are likely to understand the student’s local state.

Teacher Conflict

Multi-teacher distillation can fail when teachers disagree in ways that cannot be averaged. A math teacher may prefer long exploratory reasoning, while a chat teacher may prefer concise direct answers. A safety teacher may prefer refusal, while a helpfulness teacher may prefer compliance. A coding teacher may prefer tool calls and execution, while a general reasoning teacher may prefer pure text. If these distributions are averaged without context, the student can learn an incoherent mixture.
Teacher conflict can appear at several levels. At the final-answer level, teachers may choose different answers or actions. At the trace level, they may use different reasoning paths or tool-use styles. At the token level, they may assign high probability to incompatible continuations. At the policy level, they may optimize different objectives, such as correctness, helpfulness, safety, brevity, or exploration.
A useful conflict score is pairwise teacher divergence:
\[C_{ij}(s_t) =D_{JSD} \left( p_{T_i}(\cdot \mid s_t) \,\Vert\, p_{T_j}(\cdot \mid s_t) \right)\]
High conflict should usually trigger routing, masking, or teacher selection rather than naive averaging. If the teachers disagree because the prompt is ambiguous, the student may need a policy that asks clarifying questions. If they disagree because one teacher is specialized and the other is out of domain, the router should prioritize the specialized teacher. If they disagree because one teacher is unsafe or unreliable on the current prefix, the system should filter that teacher’s signal.

Multi-Domain OPD in Cascade RL

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) uses multi-domain OPD as a stabilization point inside a longer Cascade RL pipeline. The recipe begins with SFT, proceeds through instruction-following RL and multi-domain RL, inserts multi-domain OPD to unify specialized expertise and recover regressions, and then continues with RLHF, long-context RL, code RL, and software-engineering RL.
The important design principle is inter-domain interference management. Cascade RL exposes how domains compete: instruction adherence may conflict with human-preference optimization, long reasoning may conflict with concision, code execution may conflict with general chat behavior, and tool-use optimization may change formatting. Multi-domain OPD gives the model a way to recover capabilities from the strongest intermediate domain teachers before later stages specialize further.
In Nemotron-Cascade 2, the domain teacher advantage is defined on the sampled token:
\[a_t^{MOPD} = \log \pi_{\text{domain}_i}(y_t \mid s_t) - \log \pi_{\text{train}}(y_t \mid s_t)\]
- where \(\pi_{\text{domain}_i}\) is the selected domain teacher and \(\pi_{\text{train}}\) is the policy being optimized.
The sampled-token advantage converges toward zero as the student absorbs the teacher’s preference for that domain. Positive values indicate that the teacher assigns higher probability to the sampled token than the student does; negative values indicate that the teacher considers the sampled token less likely than the student does.
Because rollout generation and learner updates can be asynchronous, the training objective may need to account for a behavior policy that generated the data and a current training policy that receives the update. This makes MOPD partly an algorithmic problem and partly a systems problem.
The following figure (source) shows the Nemotron-Cascade 2 training pipeline, where SFT is followed by instruction-following RL, multi-domain RL, multi-domain on-policy distillation, RLHF, long-context RL, code RL, and software-engineering RL.

MOPD in Nemotron 3 Ultra

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) scales MOPD to a large agentic model trained with SFT, unified RLVR, specialist teacher training, MOPD warmup, multi-teacher OPD, and inference-oriented boosting.
The teacher pool spans many domains, including reasoning, code, math, tool use, agentic workflows, safety, usability, chat, long-context tasks, and software-engineering-style environments. The purpose is not merely to compress a larger model, but to consolidate capabilities produced by many separate specialist post-training paths.
A key practical lesson is that teachers trained with substantially different pipelines may not combine well through straightforward on-policy merging. When teachers and the student are trained on different SFT data or acquire different reasoning behaviors, student-generated trajectories can become out of distribution for the teacher, reducing the quality of token-level supervision.
The MOPD warmup stage addresses this by pulling the student closer to teacher-supported distributions before the main distillation run. Warmup is not just a convenience; it is a support-overlap intervention. It increases the likelihood that student rollouts land in regions where the teachers can provide reliable local preferences.
The recipe-review discussion describes Nemotron 3 Ultra as using two MOPD iterations with more than ten teachers across reasoning, code, math, and agentic domains. The first MOPD round distills a set of teachers into an improved student, and the second round re-distills from refreshed or reused teachers into a stronger final model.

DeepSeek-Style Specialist Consolidation

Public recipe discussions describe a progression in DeepSeek-style post-training from standard SFT and GRPO, to R1-style multi-stage reasoning RL, to hybrid think and non-think models, to specialist RL followed by SFT distillation, and then to ten-plus domain experts consolidated with MOPD.
The key conceptual progression is from monolithic RL toward modular specialist creation. Instead of forcing one model to optimize all domains simultaneously, the recipe trains domain experts and then merges them. Earlier versions may use SFT distillation to consolidate specialists; later versions may use MOPD to score student-generated rollouts directly.
A support-overlap analysis contrasts this with Nemotron 3 Ultra. If specialists are all forks of the same base model and trained with related domain SFT and RL recipes, their local token distributions may remain close enough for dense distribution matching to be safe. If specialists are influenced by external model data or substantially different training pipelines, sampled-token scoring and warmup may be more reliable than full-vocabulary matching.
The practical lesson is that the best MOPD loss depends on teacher-student lineage. Full teacher-distribution matching is powerful when support overlap is high. Sampled-token scoring is safer when overlap is lower. The right choice is not universal; it depends on how teachers were created.

MOPD vs. Trace-Distillation SFT

MOPD and trace-distillation SFT are two different ways to consolidate specialist teachers. Trace-distillation SFT asks each teacher to generate trajectories, filters those trajectories, and trains the student on the teacher-produced traces. MOPD asks the student to generate trajectories and then has teachers score the student’s actual prefixes.
Trace-distillation SFT is easier to implement because the resulting data is just a supervised corpus. It can be generated, filtered, cached, audited, and reused. It is especially attractive when the student is too weak to produce useful rollouts or when the organization does not yet have an OPD scoring infrastructure.
MOPD is more adaptive because it trains on student-visited states. It can correct the specific mistakes the student makes, and it reduces train-inference mismatch. It is especially valuable in long-horizon reasoning, coding, and agentic tasks where the student’s early actions determine future states.
The tradeoff is systems complexity. MOPD requires rollout generation, teacher routing, log-probability scoring, loss masking, staleness control, and support-overlap monitoring. Trace-distillation SFT requires high-quality data generation and filtering, but it does not require a synchronous or asynchronous teacher-scoring loop during student training.
A conservative recipe may therefore use:
\[\text{specialist RL} \rightarrow \text{trace-distillation SFT} \rightarrow \text{final RL}\]
- while a more MOPD-heavy recipe may use:
\[\text{specialist RL} \rightarrow \text{student rollouts} \rightarrow \text{teacher scoring} \rightarrow \text{multi-teacher OPD}\]
Both are forms of specialist consolidation. The difference is whether the training trajectory comes from the teacher or the student.

Teacher Construction

The quality of MOPD depends heavily on how the teachers are constructed. A teacher is not useful merely because it is strong on a benchmark; it must also provide reliable token-level preferences on student-generated prefixes.
Teachers can be constructed by taking a shared base model, applying domain-specific SFT, running RL or RLVR on a domain-specific environment, selecting the strongest checkpoint by validation, and then freezing it for distillation. This lineage-preserving approach improves support overlap because each teacher remains a perturbation of the same base policy.
Teachers can also be trained from heterogeneous data sources or external model outputs. This may produce stronger specialists in absolute benchmark terms, but it can reduce compatibility with the student. The teacher may solve the task in a style or token distribution the student does not naturally visit, making OPD supervision noisier.
A practical teacher-selection rule is to evaluate not only teacher accuracy but also teacher-student compatibility. The best teacher for MOPD is not necessarily the highest-scoring teacher. It is the teacher that gives useful local gradients on the student’s rollouts.
Teacher checkpoints may also need to be staged. Instead of distilling from a fully converged specialist that has moved far from the student, the recipe may distill from intermediate checkpoints first, gradually moving the student toward the final specialist distribution.

Aggregating Teachers

Teacher aggregation can occur at the distribution level, token-advantage level, example level, or stage level. Distribution-level aggregation averages teacher next-token probabilities. Token-advantage aggregation combines teacher-student log-probability gaps. Example-level aggregation sends each prompt or rollout to one teacher. Stage-level aggregation distills from different teachers in separate phases.
Distribution-level aggregation is mathematically clean but fragile under teacher conflict. It works best when teachers are aligned and share local support.
Token-advantage aggregation is natural for MOPD because each teacher’s signal is expressed as a dense advantage on the sampled token. This allows teacher weights to be positive, zero, or even suppressed depending on domain confidence, reward scores, or routing decisions.
Example-level routing is simple and common in domain-labeled training. A math prompt goes to a math teacher, a coding prompt goes to a coding teacher, and a safety prompt goes to a safety teacher. This reduces conflict but can miss mixed-domain trajectories.
Stage-level aggregation is useful when teachers are too incompatible to combine in one loss. A student may first absorb instruction-following teachers, then math teachers, then code teachers, then agentic teachers, with evaluation gates between stages.

Teacher Agreement and Calibration

Multi-teacher systems require calibration because teachers may differ in entropy, verbosity, confidence, and logit scale. A high-confidence teacher is not always more correct; it may simply be more sharply calibrated or more overconfident.
Calibration can be handled with temperature scaling:
\[p_{T_k}^{(\tau_k)}(y \mid s_t) = \frac{ \exp(z_{T_k,y}/\tau_k) }{ \sum_{v} \exp(z_{T_k,v}/\tau_k) }\]
Teachers can also be normalized by validation accuracy, per-domain benchmark score, reward-model score, verifier success rate, or observed teacher-student agreement on held-out rollouts.
A teacher-disagreement diagnostic can be computed as average pairwise Jensen-Shannon divergence:
\[\bar{C}(s_t) =\frac{2}{K(K-1)} \sum_{i<j} D_{JSD} \left( p_{T_i}(\cdot \mid s_t) \,\Vert\, p_{T_j}(\cdot \mid s_t) \right)\]
High disagreement does not automatically mean the example should be discarded. It may mean the prompt is genuinely multi-domain or ambiguous. The correct response may be to route more carefully, ask for clarification, preserve uncertainty, or prioritize a safety teacher.

Regression Recovery and Capability Preservation

A major reason to use multi-teacher distillation is regression recovery. During staged RL, a model may become better at one domain while losing capability in another. For example, a stage that improves strict instruction following may change response style in ways that hurt open-ended chat; a stage that improves long reasoning may hurt concise instruction compliance; a code RL stage may affect non-code reasoning.
Multi-teacher distillation can preserve the best checkpoint for each domain. Instead of accepting the final RL checkpoint as globally best, the recipe selects domain teachers from different points in training. Each teacher captures the strongest local behavior for a domain, and the student is trained to combine them.
This is why Cascade RL and multi-domain OPD fit together naturally. Cascade RL creates a sequence of domain-improved checkpoints. MOPD then provides a mechanism to merge the best parts of those checkpoints before further training continues.
Regression recovery should be evaluated explicitly. A successful MOPD run is not only one that improves average score; it should recover losses on domains that earlier stages damaged while preserving the gains from newer stages.

Practical MOPD Training Loop

A practical MOPD training loop begins by training or selecting a pool of specialist teachers, each associated with a domain, benchmark category, environment, or capability. The student is initialized from an SFT, RLVR, or prior post-trained checkpoint that is already capable enough to generate meaningful rollouts. Prompts are sampled from a mixture that covers the target domains, and the student generates trajectories under the current rollout policy.
Each trajectory is routed to a teacher or teacher subset. The router may use prompt metadata, benchmark labels, classifier outputs, tool-use state, verifier results, or a learned routing score. The selected teachers score the student’s sampled tokens under the exact student prefixes and return teacher log-probabilities, top-\(k\) distributions, or dense token-level advantages.
The learner computes a weighted MOPD loss, masks invalid or untrusted tokens, applies clipping or importance weighting when rollout and learner policies differ, updates the student, and evaluates the result on both target-domain benchmarks and broad regression suites.
A compact recipe is:
\[\text{train specialists} \rightarrow \text{select teachers} \rightarrow \text{warm up student} \rightarrow \text{sample rollouts} \rightarrow \text{route to teachers} \rightarrow \text{score tokens} \rightarrow \text{update student} \rightarrow \text{evaluate regressions} \rightarrow \text{refresh or repeat}\]
The warmup step is optional in theory but often crucial in practice when teachers are not close forks of the student. Its purpose is to increase the probability that student rollouts fall into teacher-supported regions.

Engineering and Systems Design

Multi-teacher systems introduce significant infrastructure requirements because they combine RL rollout generation, teacher inference serving, dynamic routing, log-probability transport, loss aggregation, and regression evaluation. Multiple teacher servers must be hosted and queried efficiently, often with vLLM-style inference clusters or equivalent high-throughput serving stacks, and the system must avoid letting teacher scoring become the bottleneck for the learner.
Routing logic determines which teachers should score each prompt, rollout, or token span. In a simple benchmark-labeled setup, a math prompt goes to a math teacher and a code prompt goes to a code teacher. In a mixed agentic trajectory, the router may need to recognize that one segment is reasoning, another is tool use, another is code execution, and another is final-answer formatting. Routing errors can either dilute specialization or apply harmful supervision from an out-of-domain teacher.
Teacher log-probabilities from different models must be aligned and aggregated carefully. If teachers share tokenizer, chat template, and special-token conventions, aggregation is straightforward. If they do not, the system may require expensive retokenization, sequence alignment, or token-span mapping, and the resulting token-level loss may become noisy or ambiguous.
Tokenizer compatibility is highly desirable because MOPD relies on teacher probabilities assigned to the student’s sampled tokens. When teacher and student vocabularies differ, a single student token may correspond to multiple teacher tokens or vice versa, which complicates sampled-token scoring and makes full-distribution matching even harder.
Fault tolerance and asynchronous scheduling are essential when dozens of teachers are involved. Teacher servers may fail, run at different speeds, or have different memory and batching constraints. A practical MOPD system needs queues, timeouts, fallback routing, teacher-health checks, freshness windows, and replay-buffer policies so that missing or stale teacher scores do not corrupt the learner.
The implementation complexity is higher than single-teacher distillation because the system must coordinate teacher selection, teacher serving, token alignment, log-probability aggregation, and multi-domain evaluation. The payoff is that capability preservation and modularity can improve substantially: a team can integrate new specialists, reuse older checkpoints, repair regressions, and consolidate RL-derived improvements without restarting the entire post-training recipe.

Advantages of Multi-Teacher Distillation

Multi-teacher distillation enables a single student to absorb complementary strengths from multiple specialized models, which is especially valuable when no single teacher is best across math, code, safety, chat, long context, and agentic tool use.
It mitigates catastrophic forgetting and post-training regressions by retaining older or domain-specialized checkpoints as active sources of supervision, allowing the student to recover capabilities that later RL stages might otherwise overwrite.
It provides a modular way to integrate new capabilities without retraining from scratch, because a new specialist can be trained, validated, added to the teacher pool, and distilled into the general student through targeted routing or staged consolidation.
It supports efficient consolidation of RL-derived improvements, since domain-specific RL runs can be performed independently and then merged through trace distillation or MOPD instead of forcing all capabilities into one expensive, interference-prone RL run.
It allows practitioners to reuse valuable intermediate checkpoints as lasting supervision sources, which is particularly important in long cascade-style recipes where the final checkpoint is not necessarily the best checkpoint for every domain.
It improves organizational scalability because different teams can build and maintain different teachers, while the final distillation stage converts that distributed work into one deployable model.

Limitations

Teacher signals may conflict, requiring careful weighting, routing, masking, or stage ordering. If a math teacher favors long exploration while a chat teacher favors concision, naive averaging can dilute both behaviors rather than producing a model that knows when to use each style.
Infrastructure costs increase significantly as the number of teachers grows, because each teacher may require dedicated serving capacity, batching logic, memory allocation, monitoring, and fallback handling.
Routing policies can become complex and task-dependent, especially for mixed-domain trajectories that combine reasoning, code execution, search, tool use, safety judgment, and final-answer formatting inside one rollout.
Poorly balanced weights can dilute specialization or destabilize optimization. Overweighting a general teacher may erase specialist gains, while overweighting a specialist may damage general chat, safety, or instruction-following behavior.
Cross-tokenizer alignment becomes difficult when teachers use incompatible vocabularies, chat templates, or tool-call formats. This can make token-level log-probability comparison unreliable and can force the system toward sequence-level trace distillation instead of sampled-token MOPD.
Support mismatch can make a strong teacher a poor distillation teacher. If the student does not visit prefixes that the teacher understands, the teacher’s dense token-level feedback may encode distribution mismatch rather than useful expertise.
Debugging is harder than in single-teacher distillation because a regression may be caused by the student rollout policy, a teacher checkpoint, a router decision, a token-alignment issue, a stale rollout, an aggregation weight, or a domain-mixture imbalance.

Design Rules

A strong multi-teacher recipe should begin from a shared base or compatible student-teacher lineage whenever possible, because shared lineage improves local support overlap and makes token-level supervision more reliable.
The recipe should train specialists in domains where independent optimization is beneficial and where joint RL is likely to create interference. Math, code, software engineering, long-context reasoning, tool use, safety, and chat often benefit from different data sources, verifiers, and reward structures.
Teacher selection should consider compatibility as well as benchmark score. A teacher that is slightly weaker but locally aligned with the student may provide better MOPD gradients than a stronger teacher whose reasoning style is far from the student.
Routing should be explicit and auditable. The system should know why a trajectory was sent to a teacher, and routing errors should be measurable through teacher agreement, verifier outcomes, and downstream regressions.
The loss should match the support regime. Full-distribution matching is useful when teacher-student overlap is high. Top-\(k\) local matching is useful when support is partially shared. Sampled-token scoring is safer when support overlap is lower or teacher serving cost is high.
Warmup should be used when teachers and students are far apart. Warmup can be SFT on teacher-domain traces, curriculum prompts that elicit teacher-supported behavior, or progressive distillation from intermediate teacher checkpoints.
The final student should be judged by preservation as much as improvement. A successful multi-teacher distillation run should improve specialist domains while preserving chat, safety, instruction following, calibration, long-context behavior, and agentic reliability.

When Multi-Teacher Distillation is Preferred

Multi-teacher distillation is preferred when no single teacher dominates all domains, when different capabilities require different training environments or verifiers, when specialist teachers can be trained in parallel, when joint RL across all domains creates interference, when a final deployable model must merge domain expertise into one checkpoint, and when the team can support teacher routing, scoring, calibration, and regression evaluation.
MOPD is preferred when the student is already capable enough to generate meaningful rollouts and when teacher feedback on student-visited states is more valuable than teacher-generated traces. Trace-distillation SFT is preferred when the student is too weak for on-policy rollouts, when teacher-generated traces are easy to verify, or when the organization wants a simpler and more reproducible consolidation pipeline.
The practical modern recipe is therefore not “always use MOPD.” It is to use off-policy SFT or trace distillation to create a capable student, use RL or specialist RL to produce strong domain teachers, use warmup or intermediate checkpoints to improve support overlap, and then use MOPD when dense teacher feedback on student rollouts can safely consolidate the specialists.

Reinforcement Learning-Distillation Hybrids

RL-distillation hybrids combine two forms of supervision that are individually powerful but incomplete. Reinforcement learning trains on trajectories sampled by the current policy and can optimize directly for task success, but its feedback is often sparse, delayed, noisy, or difficult to assign to individual tokens. Distillation supplies dense token-level guidance, but classical distillation usually trains on teacher or dataset trajectories rather than on the student’s actual behavior. Hybrid methods try to keep the trajectory relevance of RL while adding the local credit assignment of distillation.
The central idea is that modern post-training increasingly treats distillation as a dense advantage-estimation mechanism. A teacher, self-teacher, verifier-conditioned model, or feedback-conditioned model can assign token-level log-probability differences along a student rollout. Those differences can then act like rewards, advantage weights, clipping signals, routing signals, gates, masks, or auxiliary losses inside an RL-style training loop.
These hybrids are especially important for reasoning, coding, tool-use, scientific, and agentic tasks. In those settings, a final scalar reward may say that an answer was wrong, that a test failed, that an experiment did not meet a target, or that an agent did not complete the task, but it does not identify which intermediate reasoning step, tool call, file edit, command, observation, or assumption caused the failure. Distillation can turn richer teacher or feedback information into dense local updates over the tokens or actions that produced the trajectory.
RL-distillation hybrids also clarify why OPD and MOPD fit naturally inside RL infrastructure. OPD uses student-generated rollouts, like RL, but replaces or augments sparse sequence-level rewards with teacher log-probability gaps at each sampled token. MOPD extends this pattern by routing rollouts to domain teachers, allowing post-training systems to consolidate RL-derived specialist capabilities without rerunning a single monolithic RL job across every domain.
The practical design question is not whether to use RL or distillation, but how to combine them. Some methods use RL as the primary objective and distillation as a gated auxiliary loss. Some methods use reward signals to scale, filter, or extrapolate a distillation loss. Some methods route successful samples to RL and failed samples to self-distillation. Some methods turn textual critiques, runtime errors, user feedback, environment observations, or future hindsight into teacher-only context and then distill from that richer view.
The main risk is that dense token-level supervision can be overconfident or misaligned with the deployed student’s information state. Privileged teachers may suppress uncertainty, shorten reasoning traces, hallucinate hidden context, or reward style differences rather than task-bearing improvements. Strong hybrid recipes therefore anchor distillation in rewards, verifiers, gates, contrastive controls, support-overlap checks, provenance tracking, and regression evaluations.

Why RL and Distillation Are Converging

Classical reinforcement learning for LLMs usually optimizes a policy objective of the form:
\[\mathcal{L}_{RL}(\theta) =-\mathbb{E}_{x\sim\mathcal{D},y\sim\pi_\theta(\cdot\mid x)} \left[ A(x,y) \sum_t \log \pi_\theta(y_t\mid x,y_{<t}) \right]\]
- where \(A(x,y)\) is a trajectory-level or response-level advantage derived from a reward model, verifier, preference signal, or group-relative normalization.
The credit-assignment problem is that the same scalar advantage is often broadcast across every token in the response. If a solution is wrong because of one algebraic error, one incorrect API call, one missed observation, or one missing test, the RL update may still push or pull on the entire trajectory.
Distillation supplies a token-level signal. In sampled-token OPD, the dense teacher-student gap can be written as:
\[A_t^{OPD} =\log \pi_T(y_t\mid s_t) - \log \pi_\theta(y_t\mid s_t)\]
- where \(s_t=(x,y_{<t})\) is the student-visited prefix. Tokens that the teacher rates above the student receive positive local signal, while tokens that the teacher rates below the student receive negative local signal.
A hybrid objective can therefore combine sparse reward-level guidance with dense token-level teacher guidance:
\[\mathcal{L}_{hybrid}(\theta) =\mathcal{L}_{RL}(\theta) + \lambda \mathcal{L}_{distill}(\theta)\]
- where \(\lambda\) controls how strongly the dense distillation term influences the policy relative to the reward-grounded RL term.
The convergence of RL and distillation is partly algorithmic and partly infrastructural. RL systems already generate on-policy rollouts, store token log-probabilities, compute advantages, mask invalid tokens, and update policies. OPD and self-distillation systems can reuse the same machinery, replacing or augmenting scalar rewards with teacher log-probability gaps, feedback-conditioned logits, or privileged-context token scores.

Sparse Reward and Dense Supervision

Standard RL optimizes expected trajectory reward:
\[J_{\mathrm{RL}}(\theta) =\mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right]\]
- where \(\tau\) is a generated trajectory and \(R(\tau)\) is a scalar reward from a verifier, judge, environment, user response, or task outcome.
In RLVR, the reward may be reliable but sparse. A math answer can be checked, a coding solution can pass or fail tests, and a tool-use episode can succeed or fail, but the reward may not identify which token or action caused the outcome.
Distillation provides dense feedback by comparing teacher and student token distributions along the same trajectory:
\[A_t^{D} =\log \pi_T(a_t \mid s_t) - \log \pi_\theta(a_t \mid s_t)\]
- where \(s_t\) is the current prefix or environment state, \(a_t\) is the sampled token or action, \(\pi_T\) is the teacher, and \(\pi_\theta\) is the student.
A hybrid update can combine sparse reward with dense teacher scores:
\[\nabla_\theta J_{\mathrm{hybrid}} =\mathbb{E} \left[ \sum_t \left( A^{RL}(\tau) + \lambda_D A_t^{D} \right) \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right]\]
- where \(A^{RL}(\tau)\) is a reward-derived trajectory advantage and \(A_t^D\) is a dense token-level distillation signal.
This decomposition makes the central tradeoff explicit. RL supplies reward grounding, while distillation supplies local credit assignment. If the dense signal agrees with reward, training becomes more sample-efficient. If it disagrees with reward, the model can learn the wrong behavior faster than it would under sparse RL alone.

Hybrid Template: RL Backbone plus Distillation Auxiliary

The most conservative hybrid pattern keeps RL as the task-grounded backbone and adds distillation as an auxiliary signal. This is useful when rewards or verifiers are trusted but sparse, while teacher or self-teacher signals are informative but potentially noisy.
The objective is:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) +\lambda_D \mathcal{L}_D(\theta)\]
A token-level form is:
\[\mathcal{L}_{D}(\theta) =-\mathbb{E} \left[ \sum_t w_t \operatorname{sg}[A_t^{D}] \log \pi_\theta(y_t\mid s_t) \right]\]
- where \(A_t^D\) may be a teacher-student log-probability gap, a self-teacher gap, a contrastive privileged-context signal, or a feedback-conditioned correction, and \(w_t\) is a gate, mask, or confidence weight.
This pattern is attractive because the RL term preserves the direction of task improvement while the distillation term refines local token-level credit assignment. It is especially useful when the teacher is helpful but not trusted enough to become the primary objective.

Hybrid Template: Distillation Modulated by Rewards

A second pattern treats distillation as the main dense update but uses rewards to scale, filter, or extrapolate the teacher signal. This is useful when a teacher provides dense local preferences, but the student should be allowed to exceed the teacher rather than merely imitate it.
A generic reward-modulated distillation update is:
\[\mathcal{L}_{reward\text{-}modulated}(\theta) =-\mathbb{E} \left[ \sum_t g(R(x,y),A_t^D) \log \pi_\theta(y_t\mid s_t) \right]\]
- where \(g\) combines a trajectory reward with a token-level teacher signal.
If a rollout receives high reward, the distillation signal can be amplified even when it differs from the teacher. If a rollout receives low reward, the teacher signal may be used as a correction. This pattern turns distillation from pure imitation into reward-aware policy improvement.

Hybrid Template: Sample Routing

A third pattern routes different samples to different objectives. Correct or high-reward rollouts are sent to an RL branch because they already demonstrate useful behavior, while incorrect or low-reward rollouts are sent to a distillation branch where a teacher, self-teacher, or feedback-conditioned model can provide dense correction.
A routing function can be written as:
\[\rho(x,y) =\mathbf{1}[R(x,y)\geq \tau]\]
- where \(\rho=1\) routes the sample to an RL branch and \(\rho=0\) routes it to a corrective distillation branch.
The resulting objective is:
\[\mathcal{L}(\theta) =\mathbb{E} \left[ \rho(x,y)\mathcal{L}_{RL} +(1-\rho(x,y))\mathcal{L}_{D} \right]\]
This pattern is appealing because successful samples and failed samples contain different information. Successful samples can be reinforced directly; failed samples often require diagnostic feedback that sparse rewards do not provide.

Hybrid Template: Feedback-Conditioned Self-Distillation

A fourth pattern uses feedback as privileged teacher context. The student acts without feedback, then a teacher view sees feedback such as a runtime error, test failure, judge comment, user correction, tool observation, environment transition, or final answer. The teacher view scores the original student trajectory under this richer context, and the student learns from the resulting dense correction.
The student distribution is:
\[\pi_S(\cdot\mid x,y_{<t})\]
The feedback-conditioned teacher distribution is:
\[\pi_T(\cdot\mid x,f,y_{<t})\]
- where \(f\) is feedback that is available during training but not necessarily available during deployment.
This setup is useful because many post-training environments provide richer evidence than a scalar reward. A compiler error, failing unit test, browser observation, user complaint, tool error, or environment state transition often explains what went wrong more clearly than a binary success label.
The risk is that feedback-conditioned behavior may encode style, hindsight, or hidden-state assumptions that should not be copied unconditionally. Feedback should improve task-relevant token preferences, not teach the deployed student to act as if feedback was already present.

Reward-Tilted Teacher View

A useful way to understand successful OPD-style hybrids is through a reward-tilted teacher. Let \(s\) denote a complete trajectory, \(R(s)\) the downstream reward, and \(\pi_k^S\) the current student policy held fixed. KL-regularized reward maximization has the closed-form optimum:
\[\pi_T^*(s) =\frac{1}{Z} \pi_k^S(s) \exp(\beta R(s)), \qquad Z=\mathbb{E}_{s \sim \pi_k^S} \left[ \exp(\beta R(s)) \right]\]
- where \(\beta\) controls reward strength and \(Z\) normalizes the distribution.
If the teacher equals this reward-tilted policy and its gradient is stopped, reverse-KL distillation decomposes into a policy-proximity term and a reward-maximization term:
\[D_{KL} \left( \pi^S \,\Vert\, \pi_T^* \right) =D_{KL} \left( \pi^S \,\Vert\, \pi_k^S \right) -\beta \mathbb{E}_{s \sim \pi^S} \left[ R(s) \right] + \log Z\]
Since \(\log Z\) does not depend on the optimized student, minimizing this objective is equivalent to increasing reward while staying close to the current student. This is why distillation toward a good OPD teacher can behave like dense KL-regularized RL.
The teacher-student log-ratio gives a practical diagnostic:
\[\Delta_T(s) =\log \pi_T(s) - \log \pi_S(s)\]
A useful teacher should assign larger \(\Delta_T(s)\) to higher-reward trajectories. If \(\Delta_T(s)\) tracks correctness, the dense distillation signal supports RL. If \(\Delta_T(s)\) tracks a prompt artifact, feedback-aware response shape, or fabricated reference, the dense signal can oppose the reward.
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t uses this view to explain why RL-trained expert teachers can work while naive privileged self-teachers can fail: the expert behaves like a reward-tilted student, while the privileged self-teacher may up-weight responses that look as if they used hidden feedback or a reference solution, whether or not the answer is correct.

Direction, Magnitude, and Gating

A useful abstraction for RL-distillation hybrids is to separate update direction from update magnitude. A verifier or reward model can decide whether a trajectory should become more or less likely, while a teacher or self-teacher can decide which tokens should receive stronger or weaker updates.
RL can determine update direction from reward:
\[\operatorname{dir}_t =\operatorname{sign} \left( A_t^{RL} \right)\]
Distillation can determine token-level magnitude from teacher-student disagreement:
\[w_t^D =g \left( \log \pi_T(a_t \mid s_t) -\log \pi_\theta(a_t \mid s_t) \right)\]
The resulting update can be written as:
\[\Delta \theta \propto \sum_t \operatorname{dir}_t \cdot w_t^D \nabla_\theta \log \pi_\theta(a_t \mid s_t)\]
A closely related form uses reward for direction and distillation for token-level magnitude:
\[\Delta \theta \propto A^{reward} \sum_t m_t^{distill} \nabla_\theta \log \pi_\theta(y_t\mid s_t)\]
- where \(A^{reward}\) supplies the task-aligned direction and \(m_t^{distill}\) supplies token-level magnitude, relevance, or trust.
This pattern is useful because a teacher can be locally informative without being globally safe. A privileged teacher may identify which tokens become more likely under a hint, but the reward or verifier should still determine whether the trajectory should be reinforced.
A more conservative version uses an agreement mask:
\[m_t =\mathbf{1} \left[ \operatorname{sign} \left( A_t^{RL} \right) =\operatorname{sign} \left( A_t^{D} \right) \right]\]
The masked hybrid loss is:
\[\mathcal{L}_{masked} =\mathcal{L}_{RL} + \lambda_D \sum_t m_t D \left( \pi_T(\cdot \mid s_t) \,\Vert\, \pi_\theta(\cdot \mid s_t) \right)\]
Gating is critical because teacher signals can be harmful on some tokens. A gate may depend on the teacher-student gap, student entropy, verifier outcome, contrastive hint difference, rollout length, token type, teacher confidence, support overlap, or whether the token lies inside a tool call, reasoning segment, final answer, or system-controlled region.
A bounded gate can be written as:
\[g_t =\sigma(\beta z_t)\]
- where \(z_t\) is a trust score derived from one or more signals, and \(\beta\) controls how sharply the gate separates trusted from untrusted tokens.
This implements the practical lesson that distillation should be strongest when it agrees with reward and weakest when it conflicts with reward.

RL-Dominant Hybrids

The simplest hybrid treats RL as the primary objective and distillation as an auxiliary regularizer:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) +\lambda_D \mathcal{L}_{D}(\theta), \qquad \lambda_D \ll 1\]
This design is appropriate when the reward is trusted but sparse. The RL term preserves task grounding, while the distillation term improves token-level credit assignment.
The naive self-distillation experiments in Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t support this principle: the variants that helped did so by reducing the influence of pure distillation, either by masking the distillation gradient where it disagreed with RL, down-weighting distillation under a dominant RL term, or RL-improving the teacher target before distilling from it.
The limitation is stability. Adding a small RL term is not automatically safe if the distillation target is badly misaligned or if the optimization mixes incompatible gradients. The practical implementation should monitor gradient norms, KL to the reference model, response length, hallucinated privileged context, and OOD accuracy.

Reinforcement Learning via Self-Distillation

Reinforcement Learning via Self-Distillation by Hübotter et al. (2026) introduces Self-Distillation Policy Optimization (SDPO), which converts rich textual feedback into dense token-level supervision without requiring a separate external teacher or explicit reward model.
The method begins with a rollout generated by the current model. A feedback source then supplies a critique, runtime error, judge comment, verifier message, or other natural-language signal. The same model is conditioned on that feedback to form a teacher view, and the original rollout is replayed under the feedback-conditioned context to obtain token-level correction.
The setting assumes that the environment can provide feedback \(f\), such as a runtime error, judge comment, failed-attempt explanation, or critique. The same model conditioned on \(f\) becomes the self-teacher:
\[\pi_{\theta^-}^{+}(\cdot \mid x,f,y_{<t})\]
The deployed student acts without that feedback:
\[\pi_\theta(\cdot \mid x,y_{<t})\]
The SDPO-style objective is:
\[\mathcal{L}_{SDPO}(\theta) =\mathbb{E}_{x,\hat{y},f} \left[ \sum_t D \left( \pi_{\theta^-}^{+}(\cdot \mid x,f,\hat{y}_{<t}) \,\Vert\, \pi_\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
The method is conceptually important because it shows how RL can move beyond scalar rewards without abandoning the RL training loop. The feedback-conditioned teacher produces dense local information, while the on-policy rollout structure ensures that the model learns from states it actually visits.
SDPO is powerful when feedback is diagnostic and causally tied to the mistake. A compiler error, failed unit test, rejected tool call, or judge critique can identify how the model should have acted differently.
SDPO is risky when feedback mainly changes tone, confidence, or explanation style. In that case, the model may learn to sound corrected rather than become correct. This is the same structural risk as naive privileged self-distillation.
The following figure (source) shows a comparison of RLVR and RLRF settings. In Reinforcement Learning with Verifiable Rewards (RLVR), the agent learns from a scalar reward \(r\), which often acts as an information bottleneck by masking the underlying environment state. In contrast, Reinforcement Learning with Rich Feedback (RLRF) utilizes tokenized feedback. In the core self-distillation policy optimization setup, textual feedback is transformed into a dense teacher signal over the original student rollout, providing a richer signal than a scalar reward because feedback can include runtime errors, judge comments, or detailed observations of the state.

ExOPD: Learning Beyond the Teacher

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation by Yang et al. (2026) introduces ExOPD, which treats OPD as a dense KL-constrained RL method and adds reward extrapolation so that the student can improve beyond the teacher rather than merely match it.
Standard OPD can be limiting when the teacher is strong but not optimal. If the student discovers a trajectory that receives higher reward than the teacher’s likely behavior, pure imitation may pull the student back toward the teacher. ExOPD addresses this by using reward information to scale or extrapolate the dense teacher signal.
ExOPD treats OPD as dense KL-constrained RL. Standard OPD ties the teacher-derived reward and KL regularization together, while G-OPD introduces a flexible reference model and a reward scaling factor.
A simplified G-OPD objective is:
\[\mathcal{L}_{G\text{-}OPD}(\theta) =D_{KL} \left( \pi_\theta \,\Vert\, \pi_{\mathrm{ref}} \right) -\alpha \mathbb{E}_{s \sim \pi_\theta} \left[ r_T(s) \right]\]
- where \(\pi_{\mathrm{ref}}\) is a reference policy, \(r_T(s)\) is a teacher-derived dense reward, and \(\alpha\) controls how strongly reward is extrapolated relative to the KL penalty.
A schematic reward-extrapolated signal is:
\[\tilde{A}_t =h(R(x,y),R_T(x)) \left[ \log \pi_T(y_t\mid s_t) -\log \pi_\theta(y_t\mid s_t) \right]\]
- where \(h\) increases or decreases the local distillation signal according to how the sampled trajectory compares with teacher or reference performance.
When \(\alpha=1\) and the reference model is tied to the ordinary OPD reference, the method resembles standard OPD. When \(\alpha>1\), reward extrapolation allows the student to amplify the teacher-implied improvement rather than merely imitate it.
This is useful when the teacher is aligned but conservative. A domain teacher may improve over the base model but still leave room for extrapolation. ExOPD provides a way to use the teacher as a direction signal while allowing controlled movement beyond the teacher.
This reframes distillation as a reference-guided policy optimization method. The teacher supplies local structure, but rewards decide whether the student should imitate, deviate, or amplify the sampled behavior.
The following figure (source) shows the empirical effectiveness of ExOPD compared with off-policy distillation (SFT), standard OPD, and the weight-extrapolation method ExPO introduced in Model Extrapolation Expedites Alignment by Zheng et al. (2025) in multi-teacher and strong-to-weak distillation settings, with results averaged over 4 math reasoning and 3 code generation benchmarks. In multi-domain expert merging, ExOPD is the only method that yields a unified student that consistently outperforms all domain teachers; in strong-to-weak distillation, ExOPD significantly improves over standard OPD, with reward correction further boosting performance.

REOPOLD: Relaxed OPD for Stable Reasoning

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation by Ko et al. (2026) introduces REOPOLD, which relaxes strict teacher imitation so that OPD can scale more stably for reasoning tasks.
The motivation is that exact imitation may be too restrictive. A reasoning model may need to explore alternative paths, preserve useful uncertainty, or maintain its own valid search style. If the teacher signal is applied too aggressively at every token, the student can overfit to teacher style or suppress productive reasoning variants.
REOPOLD interprets teacher-student log-likelihood ratios as token rewards but applies them selectively or temperately. This makes the method closer to a regularized RL objective than to pure distillation. It preserves the benefit of dense feedback while reducing the risk that every local teacher preference becomes an imperative.
A simplified relaxed token reward is:
\[r_t^{D} =\operatorname{clip} \left( \log \pi_T(a_t \mid s_t) -\log \pi_\theta(a_t \mid s_t), -c, c \right)\]
The relaxed objective can be written as:
\[J_{\mathrm{REOPOLD}}(\theta) =\mathbb{E} \left[ \sum_t r_t^{D} \log \pi_\theta(a_t \mid s_t) \right] -\lambda D_{KL} \left( \pi_\theta \,\Vert\, \pi_{\mathrm{ref}} \right)\]
The broader lesson is that distillation does not have to be exact imitation. For reasoning models, strict imitation can suppress useful exploration, while relaxed teacher-derived rewards can preserve dense feedback without forcing the student to match every teacher preference.
The following figure (source) shows an illustration of REOPOLD. While standard on-policy distillation can be unstable or inefficient when it forces the student to mimic the teacher too aggressively, REOPOLD uses teacher signals temperately and selectively. By connecting distillation and RL through a stop-gradient operation, it filters potentially harmful signals and limits excessive drift from the student’s original distribution.

Self-Distilled RLVR

Self-Distilled RLVR by Yang et al. (2026) combines RLVR with privileged self-distillation by separating update direction from update magnitude. RLVR determines whether a rollout should be pushed up or down according to verifier-based correctness, while self-distillation modulates how strongly individual tokens should be updated.
This separation is important because privileged self-distillation alone can leak answer information, reward overly concise traces, or destabilize long training. By keeping RLVR as the direction-setting objective, the method remains anchored to verifiable correctness. By using self-distillation to refine token-level magnitudes, it gains denser credit assignment than ordinary response-level RLVR.
A simplified RLSD update is:
\[\Delta \theta \propto A^{RLVR} \sum_t w_t^{SD} \nabla_\theta \log \pi_\theta(y_t\mid s_t)\]
- where \(A^{RLVR}\) supplies the reward-aligned direction and \(w_t^{SD}\) supplies token-level magnitude from a privileged self-teacher.
This separation reduces information leakage because the privileged teacher does not decide whether a trajectory should be reinforced. It only shapes how strongly different tokens are updated after the verifier determines direction.
The practical rule is to let the environment decide what is correct and let the self-teacher refine where the model should allocate credit.
The following figure (source) shows an overview of RLSD, with a hybrid design in which RLVR provides update directions and self-distillation provides fine-grained step sizes.

SRPO: Sample Routing between GRPO and Self-Distillation

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing by Li et al. (2026) introduces SRPO, a routing framework that sends correct samples to a GRPO-style RL branch and failed samples to a self-distillation branch.
The method is based on the observation that correct and incorrect rollouts should not necessarily receive the same type of supervision. Correct rollouts already demonstrate a successful trajectory and are well suited to reward-aligned reinforcement. Incorrect rollouts need diagnostic correction, so they are replayed under privileged teacher context and updated using dense self-distillation.
The branch structure can be summarized as:
\[\mathcal{L}_{SRPO} =\mathbf{1}[R=1]\mathcal{L}_{GRPO} + \mathbf{1}[R=0]\mathcal{L}_{SDPO}\]
- where the reward or verifier determines whether a rollout is reinforced directly or corrected through self-distillation.
A more general routing form is:
\[\mathcal{L}_{SRPO} =\mathbf{1}[R(\tau)>0] \mathcal{L}_{GRPO} + \mathbf{1}[R(\tau)=0] \lambda(\tau) \mathcal{L}_{SDPO}\]
Correct rollouts are reinforced because they already demonstrate reward-aligned behavior. Failed rollouts are routed to self-distillation because they need targeted correction.
SRPO also uses teacher reliability to weight the self-distillation branch. A generic entropy-aware weight can be written as:
\[\lambda(\tau) =g\left( H \left( \pi_{\theta^-}^{+}(\cdot \mid s) \right) \right)\]
- where higher teacher entropy indicates less reliable dense guidance.
This design makes routing a form of supervision selection. The system decides whether a trajectory needs reinforcement, correction, or down-weighting based on reward and teacher confidence.
The following figure (source) shows an overview of SRPO. Given a prompt \(x\), the policy \(\pi_\theta\) generates a group of on-policy rollouts. A correctness check routes each rollout to one of two branches: correct samples are sent to the GRPO branch, where group-relative advantages provide a reward-aligned policy update; incorrect samples with available teacher information are sent to the SDPO branch, where a feedback-conditioned self-teacher produces logit-level distillation targets via \(D_{\mathrm{KL}}\left(P\,\Vert\,\operatorname{sg}(Q)\right)\) for dense corrective supervision.

RLCSD: Contrastive Self-Distillation inside RLVR

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation by Pan et al. (2026) integrates contrastive self-distillation into RLVR to address privilege-induced style drift. The method compares the self-teacher gap under a correct hint with the self-teacher gap under an incorrect hint, subtracting generic hint-induced style changes and leaving a signal more concentrated on task-bearing tokens.
This method matters because RL-distillation hybrids can fail when privileged context changes the teacher’s style rather than only its correctness. A teacher that sees the answer may become shorter, more assertive, and less exploratory. If the student imitates that shift directly, it can lose the epistemic behaviors needed for autonomous reasoning.
The problem is that a privileged hint changes more than task knowledge. It can make the teacher shorter, more confident, and more direct even when the hint is wrong. If the student distills this gap directly, it can learn style tokens rather than task-bearing tokens.
A simplified contrastive signal is:
\[e_t^{ctr} =\left[ \log \pi_\theta(y_t\mid x,c^+,y_{<t}) -\log \pi_\theta(y_t\mid x,y_{<t}) \right] -\left[ \log \pi_\theta(y_t\mid x,c^-,y_{<t}) -\log \pi_\theta(y_t\mid x,y_{<t}) \right]\]
- where \(c^+\) is a correct hint and \(c^-\) is a wrong or contrastive hint.
Subtracting the wrong-hint effect removes part of the generic hint-conditioned style shift, leaving a signal more concentrated on task-bearing tokens.
RLCSD uses this cleaned token-level signal as a modulation of GRPO-style reward learning rather than as an ungrounded imitation loss. This preserves the reward anchor while improving token-level credit assignment.
This is an important design principle for all privileged-context hybrids: do not distill every difference between hinted and unhinted behavior. Distill only the difference that remains after accounting for generic style changes caused by the presence of a hint.
The following figure (source) shows the RLCSD overview: ordinary OPSD gaps concentrate on style tokens such as “maybe” or “straightforward,” while the contrastive gap shifts the signal toward task-bearing tokens; the response-length plots show that RLCSD remains more stable than several prior OPSD-style methods; and the benchmark comparison shows gains on mathematical and logical reasoning tasks.

SDAR: Gated Self-Distillation for Multi-Turn Agents

Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) extends RL-distillation hybrids to multi-turn agents by treating on-policy self-distillation as a gated auxiliary objective and keeping GRPO as the primary RL backbone.
The method is designed for agentic settings where trajectories span multiple turns and where naive OPSD can become unstable as the student and privileged teacher contexts drift apart. Instead of applying self-distillation uniformly, SDAR gates the auxiliary signal at the token level.
The overall objective is:
\[\mathcal{L}(\theta) =\mathcal{L}_{GRPO}(\theta) +\lambda_{\mathrm{SDAR}} \mathcal{L}_{SDAR}(\theta)\]
- where \(\mathcal{L}_{GRPO}\) preserves verifier-driven policy optimization and \(\mathcal{L}_{SDAR}\) injects dense token-level guidance only where the privileged self-teacher signal is trusted.
The self-teacher receives privileged training-only context, such as retrieved skills, while the deployed student acts without that context. The detached teacher-student gap is:
\[\Delta_t =\operatorname{sg} \left( \log \pi_\theta^+(y_t\mid s_t^+) -\log \pi_\theta(y_t\mid s_t) \right)\]
The token-level gate converts this signal into a bounded trust weight:
\[g_t =\sigma(\beta\Delta_t)\]
The gated auxiliary token loss is:
\[\ell_t^{SDAR} =g_t \left( \log \pi_\theta^+(y_t\mid s_t^+) -\log \pi_\theta(y_t\mid s_t) \right)\]
The conceptual point is that the privileged self-teacher is not allowed to dominate the RL objective. It can only provide a bounded token-level shaping signal, and that signal is strongest where the teacher-student gap suggests useful local correction.
SDAR supports entropy gating, gap gating, and soft-OR gating, so token-level supervision can depend on student uncertainty, teacher endorsement, or both. Skill retrieval can be implemented through UCB retrieval, keyword matching, full retrieval, or random retrieval, making the framework robust to different levels of privileged-context quality.
The following figure (source) shows multi-turn OPSD instability, where naive OPSD can increase KL divergence and degrade task performance in multi-turn agent settings.

The following figure (source) shows teacher-student gap analysis, including token-count distribution by teacher-student gap, average gap by multi-turn step, and average gap by relative position within a turn.

The following figure (source) illustrates the SDAR framework, where verifier-driven RL and token-level OPSD are combined through gates derived from uncertainty, teacher-student gap, or their soft-OR combination.

OpenClaw-RL and Agent Interaction Feedback

OpenClaw-RL: Train Any Agent Simply by Talking by Wang et al. (2026) extends RL-distillation hybrids to interactive agents where the environment naturally produces rich feedback streams.
In agent settings, the feedback source may be a user reply, GUI state transition, terminal output, browser result, unit-test failure, file-system observation, or downstream task state. These signals can be used as scalar rewards, but they can also become teacher-only context for hindsight-guided self-distillation.
The training pattern is:
\[\text{agent rollout} \rightarrow \text{environment or user feedback} \rightarrow \text{hindsight teacher view} \rightarrow \text{dense token-level correction}\]
This is a natural fit for agents because many errors are local but consequences are delayed. A final task failure may trace back to one incorrect command, one missed observation, one invalid file edit, or one misunderstood user constraint. Feedback-conditioned distillation can assign denser corrective signal to those earlier actions than a final scalar reward alone.
In this setting, an environment transition can provide both evaluative and directive information:
\[(s_t,a_t,s_{t+1}) \rightarrow \left( r_t, h_t \right)\]
- where \(r_t\) is a scalar reward or process reward, and \(h_t\) is a hindsight hint or textual correction used to form a teacher context.
The RL component uses \(r_t\):
\[J_{\mathrm{agent\text{-}RL}}(\theta) =\mathbb{E} \left[ \sum_t r_t \log \pi_\theta(a_t \mid s_t) \right]\]
The hindsight-guided OPD component uses \(h_t\) as teacher-only context:
\[\mathcal{L}_{HG\text{-}OPD} =\mathbb{E} \left[ \sum_t D \left( \pi_{\theta^-}(\cdot \mid s_t,h_t) \,\Vert\, \pi_\theta(\cdot \mid s_t) \right) \right]\]
This is a natural fit for agents because dense feedback can be extracted from the exact environment transition that followed the model’s action. It can correct malformed tool calls, wrong GUI actions, failed shell commands, or poor conversational responses.
The same caution still applies: the hindsight hint must be grounded in the next state. Ungrounded or fabricated feedback can induce the same failure pattern as naive self-distillation, where the model learns to act as if it received hidden guidance.
The following figure (source) shows an overview of the OpenClaw-RL infrastructure. Interaction streams come from Personal Agents, which are conversational and single-user agents hosted on personal devices, and General Agents, which include terminal, GUI, SWE, and tool-call agents hosted on cloud services. Samples flow into an asynchronous RL server with environment serving, PRM or judge reward computation, Megatron policy training, and SGLang policy serving.

Cascade RL and Distillation as Recipe Stabilization

RL-distillation hybrids are not limited to single objectives; they also appear as recipe-level stabilization mechanisms. Nemotron-Cascade 2 uses Cascade RL to optimize domains in a deliberate sequence, then inserts multi-domain OPD as a stabilization point to recover benchmark regressions and unify specialized expertise before later stages continue.
The relevant recipe pattern is:
\[\text{SFT} \rightarrow \text{Instruction-Following RL} \rightarrow \text{Multi-domain RL} \rightarrow \text{Multi-domain OPD} \rightarrow \text{RLHF} \rightarrow \text{Long-context RL} \rightarrow \text{Code RL} \rightarrow \text{SWE RL}\]
This pattern treats RL and distillation as complementary stages. RL stages improve target capabilities through reward-driven interaction with verifiers and environments. The OPD stage consolidates specialized improvements and repairs regressions by using strong intermediate checkpoints as dense token-level teachers.

Nemotron 3 Ultra and MOPD after Unified RLVR

Nemotron 3 Ultra provides a large-scale recipe-level example in which SFT and unified RLVR establish a general student, while MOPD consolidates many domain-specialist teachers into a single agentic model.
In this recipe shape, RLVR teaches the student to operate across many verifiable and semi-verifiable environments, while MOPD later transfers specialist teacher preferences over student-generated rollouts. This is a hybrid not because every gradient step contains both losses, but because the full recipe alternates between reward-driven capability acquisition and distillation-driven capability consolidation.
The support-overlap issue is central. If the RLVR student can already generate trajectories that lie near teacher-supported regions, MOPD can provide useful dense feedback. If not, warmup or staged teacher integration may be needed before multi-teacher scoring becomes reliable.
The following figure (source) shows the Nemotron 3 Ultra post-training pipeline, including SFT, unified RLVR, MOPD warmup, multi-teacher OPD, and later inference-oriented boosting.

Environment Feedback and Scientific RL

RL-distillation hybrids become most powerful when the environment provides rich feedback rather than only a final binary reward. In code, the environment can provide compiler errors, failing tests, stack traces, and execution outputs. In tool-use agents, the environment can provide browser observations, terminal output, file contents, and GUI state transitions. In scientific settings, the environment may eventually include laboratory measurements, simulations, instrument logs, or experimental outcomes.
This is why RL-distillation hybrids are especially relevant for scientific and agentic post-training. A sparse scalar reward can say that an experiment failed or that a code patch did not pass tests, but a rich feedback channel can explain the failure, expose intermediate observations, and support teacher-conditioned replay. Distillation then turns that richer information into token-level or action-level updates.
A general environment-feedback hybrid can be written as:
\[f_t =\operatorname{EnvFeedback}(x,y_{\leq t})\] \[A_t^{feedback} =\log \pi_T(y_t\mid x,f_{\geq t},y_{<t}) - \log \pi_\theta(y_t\mid x,y_{<t})\]
- where future or delayed feedback is used as teacher-only context during training, while the deployed student still acts without that hindsight.

Sparse Rewards vs. Dense Distillation

RL and distillation differ most sharply in feedback granularity. RLVR often provides one reward for an entire response, while OPD or self-distillation can provide one signal per token.
The difference can be summarized as:

Training signal	Trajectory source	Feedback granularity	Main strength	Main weakness
RLVR	Student rollout	Response-level or outcome-level	Directly optimizes verifiable success	Weak token-level credit assignment
OPD	Student rollout	Token-level teacher feedback	Dense local correction	Requires reliable teacher scoring on student prefixes
Self-distillation	Student rollout or fixed data	Token-level privileged-context feedback	Avoids external teacher infrastructure	Can amplify style drift or suppress uncertainty
RL-distillation hybrid	Usually student rollout	Both response-level and token-level	Combines task grounding with dense credit assignment	Requires careful balancing, gating, and evaluation

The best hybrid systems avoid treating dense feedback as automatically superior. Dense token-level feedback is useful only when it is aligned with the task objective, reliable under the student’s visited prefixes, and consistent with the behavior the deployed model should preserve.

Failure Modes

RL-distillation hybrids can fail when the dense teacher signal conflicts with the sparse reward. If the reward says a rollout succeeded but the teacher dislikes the student’s style, a naive distillation loss can pull the student away from a valid solution. If the reward says a rollout failed but the teacher still assigns locally high probability to many tokens, the dense signal may obscure the global failure.
Dense feedback can point in the wrong direction when the teacher is not reward-tilted. This is the main failure mode of naive privileged self-distillation: the teacher provides many token-level updates, but the updates follow hidden-context response style rather than reward.
Privileged self-distillation can suppress the epistemic behaviors that make reasoning models robust. A teacher that sees the final answer may prefer shorter, more confident, less exploratory traces; if those preferences are distilled directly, the student can lose uncertainty markers, self-correction, branching, and backtracking.
Privileged context can leak into inference-time behavior. The student may cite absent feedback, refer to a nonexistent reference solution, or assume a hidden prior attempt because it learned the teacher’s feedback-conditioned response shape.
Dense token-level feedback can reward local plausibility without global success. A continuation token may look reasonable under the prefix even if the overall trajectory is doomed because of an earlier hidden mistake. This is especially risky in long coding, proof, and agentic workflows.
Teacher-student support mismatch can turn distillation into noise. If the student visits prefixes that the teacher would never generate, teacher log-probabilities may reflect off-distribution artifacts rather than useful expertise.
Reward hacking and verifier overfitting can still occur. Adding distillation does not remove the need for robust reward design, contamination checks, adversarial evaluation, and human or automated audits.
Loss balancing can destabilize training. If the distillation term is too strong, the model may over-imitate teacher style; if it is too weak, it may not improve credit assignment. If the RL term is too strong, sparse rewards can dominate and erase the benefits of dense correction.
Prompt optimization can reduce some artifacts, but it does not solve the underlying mismatch if the self-teacher still receives information the deployed student will not have.
RL terms can stabilize or destabilize hybrids depending on weighting, gradient interaction, and teacher quality. A small RL term does not automatically fix a bad distillation target, and a strong RL term can destabilize training if mixed naively.
Sample routing can fail if reward labels are noisy. If correct samples are mislabeled as failures or failures are mislabeled as successes, the optimizer may send trajectories to the wrong learning branch.
Gating can fail if the gate measures teacher confidence rather than teacher correctness. A privileged teacher can be confidently wrong or confidently stylistic.
Agentic hindsight can fail if the next-state signal is noisy, delayed, or not causally attributable to the current action.

Implementation Pattern

A practical RL-distillation hybrid begins by defining the rollout source, which is usually the current student or a slightly lagged inference policy. The system samples prompts, generates rollouts, records token log-probabilities, masks prompt and invalid tokens, and stores metadata about the policy version that produced each trajectory.
The system then obtains one or more feedback sources. A verifier may produce a binary or scalar reward, a reward model may produce a preference score, a teacher may produce token-level log-probabilities, and a feedback-conditioned self-teacher may produce privileged-context token scores. These signals must be aligned to the same tokenization, prefix semantics, and response boundaries.
The learner computes reward-level advantages, token-level distillation advantages, gates or masks, and any routing decisions. The objective may combine GRPO, PPO, RLVR, OPD, self-distillation, contrastive self-distillation, or sample-routing branches depending on the sample outcome.
The update is applied to the student while maintaining strict versioning of the rollout policy, teacher checkpoint, verifier version, reward configuration, prompt source, and masking rules. This is necessary because hybrid systems are difficult to debug without full provenance.
Evaluation should measure both target-task improvement and regressions. A hybrid method may improve math while shortening reasoning too much, improve code while hurting chat, improve agentic task success while increasing tool-call errors, or improve verifier score while reducing safety calibration.

Evaluation and Regression Monitoring

RL-distillation hybrids should be evaluated with both reward and behavior metrics because the main failure modes are often not visible in final accuracy alone.
Core reward metrics include verifier pass rate, reward mean, reward variance, pass@\(k\), test-pass rate, tool-use success, and agent completion rate.
Core distillation metrics include teacher-student KL, sampled-token advantage distribution, support overlap, teacher entropy, teacher-student log-ratio, and correlation between log-ratio and reward:
\[\operatorname{Corr} \left( \Delta_T(\tau), R(\tau) \right)\]
Core privileged-context metrics include hallucinated-reference rate, feedback-mention rate when feedback is absent, epistemic-verbalization rate, response length, branching count, and calibration.
Core optimization metrics include KL to reference policy, gradient norm, clipping fraction, entropy, rollout staleness, queue age, and teacher-scoring coverage.
Core OOD metrics include held-out domains, unseen tools, unseen task families, longer horizons, harder reasoning budgets, and shifted prompt formats.
A compact monitoring dashboard is:
\[\left\{ \operatorname{Reward}, \operatorname{Acc}_{IID}, \operatorname{Acc}_{OOD}, \operatorname{Corr}(\Delta_T,R), H_{\mathrm{priv}}, \operatorname{EV}, \operatorname{KL}_{\mathrm{ref}}, H(\pi_\theta), \operatorname{Success}_{\mathrm{agent}} \right\}\]
The most important regression test is whether dense token-level supervision remains reward-aligned over training. If \(\operatorname{Corr}(\Delta_T,R)\) falls, if hallucinated privileged context rises, or if epistemic verbalization collapses while OOD accuracy falls, the hybrid is likely overfitting to the teacher signal rather than improving task behavior.

When RL-Distillation Hybrids Are Preferred

RL-distillation hybrids are preferred when tasks provide useful rewards or verifiers but those rewards are too sparse for efficient credit assignment.
They are preferred when rich feedback such as runtime errors, tool outputs, user replies, environment transitions, GUI states, unit-test failures, or judge comments can be converted into teacher context.
They are preferred when the student is already capable enough to produce meaningful rollouts.
They are preferred when specialist teachers can provide dense token-level guidance on student-visited states.
They are preferred when the training system can support rollout generation, teacher scoring, reward computation, routing, gating, provenance tracking, and careful regression evaluation.
They are less attractive when the student is too weak to generate informative trajectories, when teacher-student support overlap is poor, when privileged context would leak information or suppress necessary reasoning behavior, when rewards are unreliable or easy to game, when token-level alignment across models is infeasible, or when the engineering cost of synchronous teacher scoring would dominate the training budget.
The most practical recipe is staged rather than monolithic. Use off-policy SFT or trace distillation to create a competent student, use RLVR or environment-based RL to align the student with task success, add OPD or self-distillation when dense token-level credit assignment becomes valuable, use gating or contrastive controls when privileged context introduces style drift, and apply MOPD when several specialist RL-derived capabilities must be consolidated into one deployable model.

Design Rules

Use RL as the primary objective when rewards are trusted and task success can be measured.
Use distillation as an auxiliary signal when a teacher or self-teacher provides reliable local credit assignment.
Use reward extrapolation when the teacher is aligned but conservative and the student should move beyond strict imitation.
Use relaxed OPD when exact teacher imitation suppresses useful exploration or destabilizes reasoning.
Use SDPO when feedback is rich, diagnostic, and causally tied to the model’s mistake.
Use RLSD when privileged self-distillation is informative but should not determine update direction.
Use SRPO when correct and failed samples require different learning signals.
Use contrastive self-distillation when privileged context changes style, length, or confidence even when the hint is wrong.
Use SDAR-style gating when multi-turn trajectories make naive OPSD unstable.
Avoid pure privileged self-distillation when the teacher’s log-ratio does not track reward, when the student hallucinates hidden context, or when in-distribution accuracy improves while out-of-distribution reasoning degrades.

Comparative Analysis

Distillation methods are best compared across a few orthogonal axes: trajectory source, teacher source, supervision density, teacher update pattern, RL integration, and systems complexity. These axes matter more than method names because many modern recipes combine several methods in sequence.
The main practical rule is that each method solves a different bottleneck. Off-policy distillation is best for stable broad transfer. OPD is best for exposure-bias reduction. Self-distillation is best when external teachers are unavailable or when feedback can create a privileged teacher view. MOPD is best for consolidating specialist capabilities. RL-distillation hybrids are best when sparse rewards need dense token-level credit assignment.
Modern frontier recipes rarely use one method in isolation. They usually start with off-policy SFT or trace distillation, apply RLVR or domain RL, then use OPD, OPSD, or MOPD to consolidate improvements and repair regressions.

Tabular Comparison

Method family	Trajectory source	Teacher source	Supervision signal	Systems profile	Best use case	Main failure mode
Soft-label KD	Fixed dataset or cached examples.	Frozen external teacher, refreshed teacher, or peer model.	Dense token or class probabilities, often temperature-smoothed and approximated with top-\(k\) logits.	Requires teacher logprob extraction, top-\(k\) or full-logit storage, temperature handling, and divergence-specific approximations.	Transferring uncertainty, calibration, and teacher distributional structure.	Expensive full-vocabulary logits, distorted top-\(k\) approximations, and train-inference mismatch.
Sequence-level / trace distillation	Teacher-generated outputs, synthetic traces, rejection-sampled completions, or tool-use transcripts.	Frozen teacher, specialist checkpoint, or prior model.	Hard teacher sequences trained with cross-entropy, possibly including reasoning traces, tool calls, critiques, code, or proofs.	Requires generation, filtering, deduplication, contamination checks, mixture balancing, and SFT training.	Stable cold starts, synthetic instruction tuning, and simple specialist consolidation.	Discards uncertainty, imitates teacher artifacts, and trains only on teacher-produced states.
Off-policy distillation	Human data, synthetic traces, historical logs, or verifier-filtered corpora.	External teachers, specialists, reward models, judges, or human annotators.	Hard targets, soft targets, critiques, preference labels, verifier scores, or top-\(k\) logprobs.	Dataset-centric: teacher inference is separated from student training, so quality depends on corpus provenance and filtering.	Reproducible broad transfer when stability and auditability matter.	Exposure bias, poor recovery from student mistakes, stale teacher behavior, and mixture imbalance.
Online / semi-online distillation	Shared minibatches, refreshed datasets, or evolving checkpoint trajectories.	Co-trained peers, shadow teachers, EMA teachers, or periodically refreshed checkpoints.	Mutual KL, JSD, peer logits, hidden states, or agreement losses.	Requires synchronization, checkpoint tracking, peer communication, and staleness control.	Adaptive supervision when a frozen teacher becomes stale.	Non-stationary targets, consensus errors, high compute cost, and difficult attribution.
On-policy distillation (OPD)	Student-generated rollouts or closely lagged rollout-policy samples.	Usually a frozen external teacher, sometimes refreshed between rounds.	Dense token-level divergences or sampled-token teacher-student logprob gaps.	Requires rollout workers, teacher scoring servers, exact prefix replay, response masks, stop handling, and freshness controls.	Reducing exposure bias in reasoning, coding, tool-use, and agentic tasks.	Poor rollout quality, stale rollouts, support mismatch, truncation/repetition collapse, and teacher-scoring cost.
Self-distillation / OPSD	Fixed data, student rollouts, or replayed trajectories under richer context.	Earlier checkpoint, EMA copy, same model under privileged context, retrieved skills, feedback, or future user messages.	Self-teacher gaps, context-view divergences, relevance-masked updates, contrastive hint gaps, or gated privileged-context signals.	Requires privileged-context construction, leakage controls, relevance masks, teacher-refresh policy, and epistemic-marker monitoring.	Learning from feedback, latent capability, runtime errors, user corrections, or private task context without external teachers.	Information leakage, style drift, response shortening, uncertainty suppression, and moving-teacher collapse.
Multi-teacher distillation / MOPD	Fixed data for classical multi-teacher KD; student rollouts for MOPD.	Domain specialists, earlier checkpoints, safety teachers, code teachers, math teachers, agentic teachers, or Cascade RL checkpoints.	Aggregated distributions, routed teacher scores, or per-teacher sampled-token advantages.	Requires teacher pools, routing, batching, calibration, support-overlap checks, tokenizer compatibility, and regression suites.	Consolidating specialist capabilities into one deployable model.	Teacher conflict, poor routing, incompatible tokenizers, support mismatch, high serving cost, and hard-to-debug regressions.
RL-distillation hybrids	Usually student-generated rollouts, sometimes routed by reward outcome.	External teachers, self-teachers, reward models, verifiers, environment feedback, or hindsight user feedback.	Sparse reward advantages plus dense token-level gates, teacher gaps, self-teacher gaps, or contrastive token scores.	Requires rollout generation, reward computation, teacher scoring, gating, branch routing, loss balancing, and full provenance.	Combining task-grounded reward optimization with dense local credit assignment.	Dense teacher signals can conflict with rewards, privileged context can reward style, and bad gates or loss weights can destabilize training.

Trajectory Source

The trajectory source is the most important modern distinction. Off-policy methods train on externally produced trajectories, such as human demonstrations, teacher completions, synthetic traces, historical logs, or verifier-filtered samples. This is stable and auditable, but it does not train the student on the states it creates at inference time.
On-policy methods train on student-generated trajectories. The teacher scores the student’s actual prefixes, so the student learns from its own mistakes rather than only imitating ideal teacher paths. This is why OPD is especially valuable for long-horizon reasoning, coding, tool use, and agentic workflows.
Practical systems are shaped by this distinction. Off-policy pipelines are dataset-centric: generate, filter, store, and train. On-policy pipelines are loop-centric: sample rollouts, score them with teachers, compute token-level losses, and update the student while managing rollout freshness.

Teacher Source

External teachers can transfer capabilities the student does not yet have, but they are expensive to serve and may be poorly aligned with the student’s prefixes. They are common in KD, synthetic-data generation, OPD, and MOPD.
Internal teachers are used in self-distillation. They may be earlier checkpoints, EMA copies, the same model under privileged context, or the same model conditioned on feedback. They reduce external dependency, but they can also amplify the model’s own biases.
Multi-teacher systems use pools of specialists. This is useful when no single teacher dominates all domains, but it adds routing, calibration, conflict resolution, and teacher-serving complexity. Teacher lineage matters: compatible forks of the same base usually provide more reliable token-level OPD signals than heterogeneous teachers with different tokenizers or reasoning styles.

Supervision Density

Hard sequence distillation is cheap and simple, but it only teaches one selected output. Soft-label distillation preserves uncertainty, but full-vocabulary logits are expensive and often approximated with top-\(k\) distributions.
OPD and MOPD often use sampled-token feedback. For a student token \(y_t\) at prefix \(s_t\), the teacher-student gap
\[A_t = \log \pi_T(y_t\mid s_t) - \log \pi_S(y_t\mid s_t)\]
- behaves like a dense token-level advantage. This is cheaper than full KL because the teacher only scores the sampled token, but it loses information about unsampled alternatives.
Dense supervision is not automatically good. Privileged teachers can reward style changes, suppress uncertainty, or penalize useful self-correction tokens. This is why RMSD, SDAR, and RLCSD use masking, gating, or contrastive controls.

Reinforcement Learning Integration

RL supplies on-policy trajectories and task-grounded rewards, but feedback is often sparse. Distillation supplies dense token-level supervision, but classical KD is usually off-policy. OPD, OPSD, MOPD, and RL-distillation hybrids combine these strengths.
A useful design principle is to let rewards decide the direction of improvement while distillation shapes token-level credit assignment. This appears in Self-Distilled RLVR, RLCSD, SDAR, and sample-routing methods that send successful rollouts to RL and failed rollouts to self-distillation correction.
Recipe-level hybrids are also common. Nemotron-Cascade 2 uses staged RL, then inserts multi-domain OPD to recover regressions and consolidate intermediate teachers. Nemotron 3 Ultra uses SFT and unified RLVR before MOPD warmup and multi-teacher OPD.

Systems Complexity

Systems complexity increases as training moves from fixed data to live rollout and teacher scoring. Off-policy trace distillation mainly needs high-quality generation, filtering, deduplication, contamination checks, and mixture design.
OPD adds rollout workers, teacher scoring servers, logprob transport, exact prefix replay, response masking, stop-token handling, and rollout freshness controls. Asynchronous systems must track whether the rollout policy and learner policy have drifted apart.
MOPD adds teacher pools, routing, batching, teacher health checks, score aggregation, calibration, and multi-domain regression monitoring. Routing is algorithmic, not just infrastructural: the router determines which gradient the student receives.
Self-distillation reduces external serving cost but adds leakage and context-construction risks. The system must ensure that privileged teacher context improves the training signal without teaching the deployed student to assume unavailable information.
RL-distillation hybrids require the most provenance. Each update may depend on a rollout policy, reward model, verifier, teacher checkpoint, self-teacher context, token gate, mask, routing branch, and loss weight.

Practical Selection Heuristics

Choose off-policy SFT or trace distillation when the student needs a stable cold start, when teacher traces can be generated and filtered offline, or when reproducibility matters more than train-inference matching.
Choose soft-label KD when teacher uncertainty and calibration matter enough to justify logprob extraction and storage.
Choose online or semi-online distillation when teacher staleness is the bottleneck and the system can tolerate non-stationary supervision.
Choose OPD when exposure bias is the bottleneck and the student is strong enough to generate useful rollouts.
Choose self-distillation or OPSD when external teachers are unavailable or when privileged context, feedback, retrieved skills, or user interactions can create a useful teacher view.
Choose MOPD when specialist capabilities need to be merged into one deployable model and joint RL would create interference.
Choose RL-distillation hybrids when rewards or verifiers are useful but too sparse for good token-level credit assignment.

Common Training Progressions

A conservative reasoning-model progression is:
\[\text{Synthetic / Trace SFT} \rightarrow \text{RLVR} \rightarrow \text{OPD} \rightarrow \text{Final alignment}\]
A specialist-consolidation progression is:
\[\text{Shared base} \rightarrow \text{Domain SFT / Domain RL} \rightarrow \text{Specialist teachers} \rightarrow \text{MOPD} \rightarrow \text{Regression recovery}\]
A simpler alternative to MOPD is:
\[\text{Specialist RL} \rightarrow \text{Teacher-generated traces} \rightarrow \text{Filtering} \rightarrow \text{Trace-distillation SFT} \rightarrow \text{Final RL}\]
A feedback-conditioned self-distillation progression is:
\[\text{Student rollout} \rightarrow \text{Runtime / user / environment feedback} \rightarrow \text{Privileged teacher replay} \rightarrow \\ \text{Gated token-level update} \rightarrow \text{Regression and leakage checks}\]
A frontier-style multi-domain progression is:
\[\text{Off-policy SFT} \rightarrow \text{Unified or domain RLVR} \rightarrow \text{Specialist teacher training} \rightarrow \\ \text{Warmup for support overlap} \rightarrow \text{MOPD} \rightarrow \text{Final RL or alignment}\]

Implementation-Aware Comparison

The canonical distillation dataflow is:
\[\text{prompts} \rightarrow \text{trajectory source} \rightarrow \text{teacher or feedback signal} \rightarrow \text{token alignment and masking} \rightarrow \\ \text{loss computation} \rightarrow \text{student update} \rightarrow \text{evaluation and regression monitoring}\]
Off-policy methods emphasize generation quality, filtering, and dataset versioning. OPD emphasizes rollout freshness and exact teacher scoring. MOPD emphasizes routing, aggregation, and support overlap. Self-distillation emphasizes privileged-context construction and leakage control. RL-distillation hybrids emphasize reward anchoring, token gates, and loss balancing.
Logprob payload design should match the loss. Forward KL prefers broad teacher distributions. Reverse-KL sampled-token OPD only needs teacher logprob on the sampled token. JSD or local-support matching may require top-\(k\) sets from teacher and student.
Tokenizer compatibility is especially important for OPD and MOPD. If tokenizers or chat templates differ, sampled-token advantages and response masks become unreliable.
Stabilization also differs by method. Off-policy pipelines rely on filtering and mixture balancing. OPD relies on support-overlap checks, rollout-quality filters, and truncation monitoring. OPSD relies on fixed teachers, relevance masks, contrastive controls, and epistemic-marker monitoring. MOPD relies on routing, calibration, and regression suites. RL-distillation hybrids rely on reward anchoring and gated token-level updates.

Comparative Failure Modes

Off-policy distillation fails through exposure bias, stale or low-quality traces, data contamination, over-imitation, and mixture imbalance.
Online distillation fails through non-stationary targets, consensus errors, synchronization overhead, and unstable teacher refreshes.
OPD fails through poor rollout quality, stale rollouts, support mismatch, repetition or truncation collapse, and masking bugs.
OPSD fails through information leakage, style drift, response shortening, uncertainty suppression, and moving-teacher feedback loops.
MOPD fails through teacher conflict, bad routing, incompatible tokenizers, support mismatch, high serving cost, and insufficient regression evaluation.
RL-distillation hybrids fail when dense token signals conflict with rewards, when privileged context rewards style rather than task progress, or when gates and loss weights are miscalibrated.

Practical Defaults

Start with off-policy SFT or trace distillation unless there is a strong reason not to. Add RLVR or RLHF when task success cannot be fully captured by demonstrations. Add OPD when the student’s own rollouts become informative. Add self-distillation when external teachers are unavailable or feedback can create a useful teacher context. Add MOPD when specialist capabilities need consolidation. Add RL-distillation hybrids when rewards are trusted but too sparse.
Treat systems constraints as algorithmic constraints. A theoretically attractive loss may be impractical if it requires full-vocabulary logits from many teachers, cross-tokenizer alignment, synchronous scoring, or dense teacher forwards over long rollouts. A sampled-token objective may be less exact, but practically better if it keeps training scalable and stable.

Decision Summary

Use off-policy distillation for stable broad transfer.
Use online or semi-online distillation for adaptive teacher refresh.
Use OPD for exposure-bias reduction on student rollouts.
Use self-distillation for privileged feedback without external teachers.
Use MOPD for specialist capability consolidation.
Use RL-distillation hybrids for reward-grounded learning with dense token-level credit assignment.
For frontier-style post-training, prefer a staged recipe: off-policy SFT for competence, RLVR for task success, specialist RL for domain peaks, warmup for support overlap, OPD or MOPD for consolidation, and final alignment plus regression monitoring for deployment readiness.

Implementation Patterns

Large-scale distillation is best understood as a distributed systems problem built around a teacher-student dataflow architecture. The algorithmic choice matters, but at frontier scale the dominant constraints are often rollout throughput, teacher scoring throughput, log-probability transport, tokenizer alignment, masking correctness, replay freshness, privileged-context leakage control, reward anchoring, and regression monitoring.
The central idea to carry through this section is that every distillation method requires four implementation decisions: the source of trajectories, the source of teacher or feedback signal, the divergence or surrogate loss, and the systems design for computing and transporting the necessary log-probabilities. Off-policy systems push most complexity into data generation and filtering. OPD systems push complexity into rollout generation and teacher scoring. MOPD systems add routing, teacher pools, support-overlap checks, teacher-disagreement analysis, and per-domain regression monitoring. Self-distillation systems add privileged-context construction, teacher-student context separation, and leakage controls. RL-distillation hybrids add reward computation, token-level gates, routing rules, and loss balancing.
Modern LLM distillation systems increasingly resemble RL training stacks. They sample or replay trajectories, store token-level log-probabilities, compute token masks, maintain behavior-policy provenance, score rollouts with teachers or reward systems, and update the learner with dense token-level signals. The difference is that the dense signal may come from a teacher, several teachers, a self-teacher under privileged context, a reward model, an environment, or a feedback-conditioned replay.
A useful engineering rule is that the training loss is only as reliable as the metadata around it. A token-level teacher score is meaningful only if the system knows which prompt, chat template, tokenizer, rollout policy, teacher checkpoint, response mask, stop condition, privileged context, reward label, prefix, and system version produced it. Without this provenance, distillation failures are difficult to diagnose.

Canonical Distillation Dataflow

A production distillation system usually follows a dataflow in which prompts produce trajectories, trajectories are scored by teachers or feedback systems, scored examples are converted into a loss, and the student is updated while evaluation monitors both target gains and regressions.
\[\text{Prompts} \rightarrow \text{Trajectory Source} \rightarrow \text{Teacher / Feedback Signal} \rightarrow \text{Alignment and Masking} \rightarrow \\ \text{Loss Computation} \rightarrow \text{Student Update} \rightarrow \text{Evaluation and Regression Monitoring}\]
The trajectory source determines whether the pipeline is dataset-centric or loop-centric. In off-policy distillation, trajectories are produced before training and stored as a corpus. In OPD, MOPD, OPSD, and most RL-distillation hybrids, trajectories are generated by the student or a closely related behavior policy during training.
The teacher or feedback signal determines the payload. Sequence-level distillation needs text targets. Logit distillation needs teacher distributions or top-\(k\) probabilities. Sampled-token OPD needs teacher log-probabilities on sampled student tokens. OPSD needs the same rollout rescored under privileged context. RL-distillation hybrids may require both reward-level and token-level fields.
A practical scored rollout record can be represented as:
\[r =\left( x, \hat{y}, \ell_S, \ell_T, m, R, c, \rho, v \right)\]
- where \(x\) is the prompt, \(\hat{y}\) is the rollout, \(\ell_S\) contains student or behavior-policy log-probabilities, \(\ell_T\) contains teacher log-probabilities, \(m\) is the valid-token mask, \(R\) is the reward or verifier score, \(c\) is privileged or feedback context when present, \(\rho\) contains routing, reward, or provenance metadata, and \(v\) records checkpoint and system versions.
At token level, the same record can be viewed as:
\[r_t =\left( x, y_t, s_t, \log \pi_{\mathrm{behav}}(y_t \mid s_t), \log \pi_T(y_t \mid s_t), m_t, \rho, v \right)\]
- where \(s_t=(x,y_{<t})\) is the token prefix and \(m_t\) is the valid-token mask.
The version metadata should include at least:
\[v =\left( \text{student checkpoint}, \text{behavior policy checkpoint}, \text{teacher checkpoint}, \text{tokenizer version}, \\ \text{chat template version}, \text{reward model version}, \text{generation config} \right)\]
The provenance metadata should identify which teacher scored the rollout, which router selected the teacher, which mask was applied, which reward or verifier produced the label, and which learner checkpoint consumed the example.
The most important implementation invariant is that every log-probability must correspond to the exact token and prefix used by the learner. A mismatch in tokenization, chat template, system prompt, tool-call formatting, stop-token handling, or response mask can turn a correct mathematical loss into an incorrect training signal.

Off-Policy Pipeline Architecture

Off-policy distillation is the simplest pipeline because teacher generation and student optimization are decoupled. The teacher produces traces, labels, logits, critiques, preference annotations, verifier scores, or tool transcripts ahead of training. The student then consumes a fixed or slowly refreshed dataset.
A typical off-policy pipeline proceeds as:
\[\text{Prompt Collection} \rightarrow \text{Teacher Generation} \rightarrow \text{Filtering / Verification} \rightarrow \text{Deduplication} \rightarrow \\ \text{Mixture Balancing} \rightarrow \text{SFT / KD} \rightarrow \text{Evaluation}\]
The strength of this design is reproducibility. Once the corpus is fixed, multiple students, losses, or mixture weights can be compared on the same training data. This makes off-policy distillation ideal for cold starts, ablation studies, safety audits, and synthetic-data pipelines.
The main engineering burden is data quality. The system must track teacher checkpoint versions, prompt sources, generation parameters, verifier versions, reward-model versions, filtering rules, deduplication hashes, benchmark-contamination checks, and domain mixture weights.
For trace-distillation SFT, storing text is usually cheap relative to storing logits, but trace quality becomes the bottleneck. For logit distillation, teacher probability storage becomes expensive, especially if the pipeline stores top-\(k\) probabilities for every token rather than only teacher-generated text.
A robust off-policy dataset entry should include not only the final response, but also provenance metadata such as source prompt, teacher checkpoint, decoding temperature, sample index, filter scores, verifier outcome, answer-extraction result, safety tags, and any known benchmark relationship. Without this metadata, later regression analysis becomes guesswork.
Off-policy systems should also track dataset lineage. A filtered synthetic corpus can be copied, remixed, rebalanced, and reused across runs, so later regressions may depend on a chain of transformations rather than a single generation job.
Benchmark-contamination checks matter more in off-policy systems because static corpora can accidentally include held-out benchmark examples, near-duplicates, answer keys, or teacher traces generated from contaminated prompts.

On-Policy Pipeline Architecture

On-policy distillation introduces a closed feedback loop. The student generates rollouts, the teacher scores those exact rollouts, and the learner updates the student using dense teacher feedback on the student’s own visited prefixes.
A typical OPD pipeline proceeds as:
\[\text{Prompt Sampling} \rightarrow \text{Student Rollout} \rightarrow \text{Teacher Scoring} \rightarrow \text{Token Masking} \rightarrow \\ \text{Distillation Loss} \rightarrow \text{Student Update}\]
This architecture directly addresses train-inference mismatch because the supervised states are sampled from the student. It is especially valuable for long-horizon reasoning, code, and agentic tasks where early mistakes create future contexts that never appear in teacher-generated traces.
The cost is operational complexity. The training loop must maintain rollout workers, teacher scoring workers, queues, masks, log-probability schemas, and learner-side loss computation. The teacher no longer produces a complete answer independently; instead, it must replay the student’s exact prefix and return probabilities for the student’s sampled tokens or a local token set.
In most practical OPD systems, gradients do not flow through the sampling process. The rollout is treated as data, teacher and student log-probabilities are computed over that data, and the student is updated by a supervised or policy-gradient-shaped surrogate. This makes the implementation closer to DAGGER-style imitation learning than to full sequence-level reverse-KL policy gradient.
A sampled-token OPD loss can be written as:
\[\mathcal{L}_{OPD} = -\mathbb{E} \left[ \sum_t m_t \operatorname{sg} \left[ \log \pi_T(y_t \mid s_t) -\log \pi_\theta(y_t \mid s_t) \right] \log \pi_\theta(y_t \mid s_t) \right]\]
- where \(m_t\) masks valid response tokens and \(\operatorname{sg}\) denotes stop-gradient through the teacher-derived advantage.
The OPD system must distinguish the rollout policy from the current learner. If rollout generation and learner updates are asynchronous, the behavior policy that produced the trajectory may be older than the checkpoint being updated:
\[\hat{y} \sim \pi_{\mathrm{beh}}(\cdot \mid x), \qquad \pi_{\mathrm{learner}} =\pi_\theta\]
If \(\pi_{\mathrm{beh}}\) is too stale, the example should either be dropped, down-weighted, importance-corrected, or treated as off-policy replay.

Generation Buffers and Asynchronous Execution

Large-scale OPD and MOPD systems often decouple rollout generation, teacher scoring, and learner updates. Autoregressive generation may be slow, teacher scoring may be batched most efficiently on different hardware, and learner updates may run on yet another training cluster. A generation buffer allows these components to operate asynchronously.
A practical asynchronous pipeline is:
\[\text{Rollout Workers} \rightarrow \text{Generation Buffer} \rightarrow \text{Teacher Scoring Queue} \rightarrow \\ \text{Scored Rollout Buffer} \rightarrow \text{Learner}\]
The benefit is throughput. Rollout workers do not need to wait for learner updates, teacher scorers can batch requests by teacher or sequence length, and the learner can consume already-scored batches whenever they are ready.
The risk is staleness. If a rollout is generated by an older behavior policy but consumed by a newer learner policy, the training example is no longer exactly on-policy. This is a systems-level version of off-policy drift even when the algorithm is conceptually on-policy.
A common correction is to distinguish the behavior policy, the proximal policy, and the current learner policy. Nemotron 3 Ultra-style asynchronous MOPD uses ratios that account for the mismatch between a stale behavior policy and the proximal learner policy, then applies PPO-style clipping to the current learner ratio.
\[c_t =\operatorname{sg} \left[ \frac{ \pi_{\mathrm{prox}}(y_t \mid s_t) }{ \pi_{\mathrm{behav}}(y_t \mid s_t) } \right]\] \[r_t(\theta) =\frac{ \pi_{\theta}(y_t \mid s_t) }{ \pi_{\mathrm{prox}}(y_t \mid s_t) }\]
The practical point is that on-policy systems are only as on-policy as their rollout freshness. A high-throughput asynchronous system must enforce freshness windows, queue limits, replay limits, or importance-weighting rules so that stale rollouts do not dominate training.
Freshness should be monitored explicitly:
\[\operatorname{QueueAge} =t_{\mathrm{consume}} - t_{\mathrm{generate}}\] \[\operatorname{PolicyLag} =\text{learner checkpoint} - \text{behavior checkpoint}\]
A rollout generated by a stale checkpoint can still be useful, but the system should not count it as fresh on-policy data unless the behavior-policy lag is within the configured window.

Teacher Scoring Infrastructure

Teacher scoring is often the throughput bottleneck in OPD and MOPD. The teacher must evaluate the student’s full rollout under exact prefixes, even when it only returns sampled-token log-probabilities. For long reasoning or agentic rollouts, this can require large context windows and careful batching.
Common serving backends include optimized inference systems such as vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention by Kwon et al. (2023), SGLang-style serving for agentic and structured-generation workloads, and vendor-specific inference stacks such as TRT-LLM for high-throughput deployment. The choice of serving backend affects batching, prefix caching, log-probability extraction, tool-call handling, and fault tolerance.
Single-teacher OPD can route all rollouts to one scoring service. MOPD requires a teacher pool, where different teachers score different prompts or domains. This adds scheduling complexity because teachers may vary in size, context length, latency, tokenizer, and available hardware.
Teacher scorers should return the smallest payload compatible with the loss. For sampled-token OPD, each token may only need:
\[\left( y_t, \log \pi_T(y_t \mid s_t) \right)\]
For top-\(k\) matching, each token position needs:
\[\left\{ \left( v_i, \log \pi_T(v_i \mid s_t) \right) \right\}_{i=1}^{k}\]
For full KL, the teacher would need a full-vocabulary distribution, which is usually impractical at LLM scale.
Teacher serving must support fault isolation. If a teacher returns malformed scores, times out, uses the wrong template, silently changes checkpoint version, or scores the wrong prefix, the learner may receive corrupted gradients. A production system should include teacher-health checks, schema validation, prefix checksums, checkpoint IDs, timeout handling, and fallback routing.
Prefix checksums are especially useful for OPD because the teacher must score the same prefix the student produced. A checksum mismatch should fail closed rather than silently train on misaligned log-probabilities.

Log-Probability Payload Design

Log-probability payload design should be determined by the divergence. Forward KL needs teacher-mode coverage, so teacher-top-\(k\) distributions are natural. Reverse-KL sampled-token training needs only the teacher probability of the student’s sampled token. JSD and local-support matching may require both teacher and student top-\(k\) sets.
A forward-KL-style top-\(k\) payload is:
\[\mathcal{P}_{T,k}(s_t) =\left\{ \left( v_i, \log \pi_T(v_i \mid s_t) \right) : v_i \in \operatorname{Top}_k\left(\pi_T(\cdot \mid s_t)\right) \right\}\]
A sampled-token reverse-KL-style payload is:
\[\mathcal{P}_{\mathrm{sample}}(s_t) =\left( y_t, \log \pi_T(y_t\mid s_t), \log \pi_S(y_t\mid s_t) \right)\]
A local-support matching payload stores a local vocabulary set:
\[\mathcal{V}_{\mathrm{local}}(s_t) =\operatorname{Top}_k(\pi_T) \cup \operatorname{Top}_k(\pi_S)\]
- and computes a renormalized divergence over \(\mathcal{V}_{\mathrm{local}}(s_t)\).
Payloads should include masks and offsets. The learner must know which tokens correspond to the prompt, the response, tool calls, tool outputs, hidden system content, final answer, padding, truncation, and end-of-sequence regions. A log-probability without a correct mask can be worse than no log-probability.
Compression matters. Teacher payloads may be stored as binary token IDs plus float16, bfloat16, or quantized log-probability arrays. For very large corpora, payload size can dominate storage cost, and sampled-token scoring may be preferable even if it is algorithmically less informative than top-\(k\) matching.
Payload schemas should be stable across runs. A loss implementation that expects sampled-token log-probabilities should not silently consume top-\(k\) payloads, and a top-\(k\) loss should not silently renormalize over a different support set than the one intended by the scorer.

Full-Vocabulary vs. Top-\(k\) Approximation

Exact full-vocabulary KL is usually impractical at LLM scale because it requires storing or transmitting a probability for every vocabulary token at every response position.
Full-vocabulary matching is most defensible when the teacher and student share tokenizer, chat template, architecture family, and strong local support overlap, and when the training system can afford the memory and communication cost.
Top-\(k\) teacher matching approximates the teacher distribution over the teacher’s most likely tokens. This reduces payload size while preserving the teacher’s main alternatives:
\[\tilde{p}_T(v_i \mid s_t) =\frac{ p_T(v_i \mid s_t) }{ \sum_{v_j \in \operatorname{Top}_k(p_T)} p_T(v_j \mid s_t) }\]
Student probabilities should be renormalized over the same support before computing KL:
\[\tilde{p}_S(v_i \mid s_t) =\frac{ p_S(v_i \mid s_t) }{ \sum_{v_j \in \operatorname{Top}_k(p_T)} p_S(v_j \mid s_t) }\]
The resulting approximate loss is:
\[\mathcal{L}_{k} =\sum_t D_{KL} \left( \tilde{p}_T(\cdot \mid s_t) \,\Vert\, \tilde{p}_S(\cdot \mid s_t) \right)\]
The weakness is support bias. If the teacher’s top-\(k\) tokens do not overlap with the student’s local support, the student may be trained to cover a token set it would not naturally consider, and the loss can become noisy.
Student-top-\(k\) or union-top-\(k\) payloads are useful when the goal is to evaluate tokens the student actually considers. A union set can be written as:
\[\mathcal{V}_{k}^{\cup}(s_t) =\operatorname{Top}_k(p_T(\cdot \mid s_t)) \cup \operatorname{Top}_k(p_S(\cdot \mid s_t))\]
The union support is more expensive than teacher-only top-\(k\), but it better exposes teacher-student disagreement on student-plausible tokens.
Sampled-token scoring is the cheapest approximation. It keeps only the sampled token’s teacher and student log-probabilities:
\[A_t = \log \pi_T(y_t \mid s_t) - \log \pi_S(y_t \mid s_t)\]
Sampled-token scoring is communication efficient, but it loses information about nearby alternatives. It is often attractive in large-scale OPD or MOPD when teacher scoring cost is the bottleneck.

Tokenizer Compatibility and Alignment

Tokenizer compatibility is critical for logit and sampled-token distillation. If teacher and student tokenize the same string differently, token-level log-probabilities no longer correspond to the same events.
Shared-tokenizer systems can score student tokens directly:
\[y_t^S =y_t^T\]
Cross-tokenizer systems require retokenization or span-level alignment. A student token span may correspond to multiple teacher tokens:
\[y_i^S \leftrightarrow (y_j^T,\dots,y_{j+k}^T)\]
The teacher probability of a student span may need to be approximated as:
\[\log p_T(y_i^S \mid s) \approx \sum_{r=j}^{j+k} \log p_T(y_r^T \mid s,y_{j:r-1}^T)\]
This approximation is only valid if the context and text alignment are exact. For chat models, system messages, role tokens, tool-call wrappers, hidden reasoning markers, and assistant prefixes can all shift the scored context.
When tokenizers differ substantially, sequence-level distillation or text-level preference training may be more reliable than token-level log-probability matching.
Alignment tests should be part of CI. Given a fixed prompt and response, the system should verify that teacher and student scorers reconstruct the same text, apply the same response mask, and score the intended assistant tokens.

Stabilization Mechanisms

Distillation losses can destabilize training when the teacher and student are too far apart, when rollouts are stale, when token masks are wrong, or when the teacher is privileged and not reward-aligned.
Common stabilization tools include:
- KL regularization to a reference model.
- Entropy bonuses or entropy floors.
- Loss clipping.
- Teacher-student advantage clipping.
- Token-level masks.
- Support-overlap gates.
- Teacher-confidence gates.
- Reward-agreement masks.
- Rollout freshness filters.
- Length and repetition filters.
- Mixture balancing across domains.
- Evaluation-gated checkpoint promotion.
A clipped sampled-token advantage can be written as:
\[\tilde{A}_t =\operatorname{clip} \left( \log \pi_T(y_t \mid s_t) -\log \pi_S(y_t \mid s_t), -c, c \right)\]
A support-overlap gate can down-weight low-overlap states:
\[g_t =\operatorname{Overlap}_k(s_t)\]
A reward-agreement mask can retain distillation only when it agrees with RL:
\[m_t =\mathbf{1} \left[ \operatorname{sign} \left( A_t^{RL} \right) =\operatorname{sign} \left( A_t^{D} \right) \right]\]
Stabilization should be treated as part of the algorithm, not as a late-stage engineering patch. A method that works only after aggressive filtering, clipping, and gating should be described as such.

Multi-Teacher Routing Infrastructure

MOPD adds a routing and aggregation layer on top of OPD. Instead of one teacher scoring every rollout, a router selects a teacher or teacher mixture based on the prompt, rollout, domain, task type, reward signal, or model confidence.
A routing function can be written as:
\[r(x,\hat{y}) \rightarrow T_j\]
- where \(T_j\) is the selected teacher.
Multi-teacher aggregation can average log-probabilities, choose the highest-confidence teacher, weight teachers by domain scores, or route examples to specialist teachers:
\[\log p_{\mathrm{agg}}(a_t \mid s_t) =\sum_j w_j(s_t) \log p_{T_j}(a_t \mid s_t)\]
Teacher weights should be calibrated and monitored. A high-performing teacher in one domain may be harmful on another domain, especially if its reasoning style, tokenizer, or local token support differs from the student’s.
The router itself requires evaluation. A bad router can send math prompts to a code teacher, safety prompts to a general assistant teacher, or tool-use prompts to a teacher that never learned the tool schema.
Routing metadata should be stored with each example:
\[\rho =\left( \text{router version}, \text{selected teacher}, \text{teacher weights}, \text{domain label}, \text{confidence}, \text{fallback path} \right)\]
Teacher disagreement is a useful monitoring signal:
\[\operatorname{Disagree}(s_t) =\operatorname{Var}_{j} \left[ \log p_{T_j}(y_t \mid s_t) \right]\]
High disagreement can indicate ambiguous prompts, teacher specialization boundaries, routing errors, or low-support states.
MOPD dashboards should include per-teacher coverage, per-teacher reward improvement, teacher disagreement, router entropy, support overlap, and per-domain regression scores.

Self-Distillation Systems

Self-distillation systems require stricter context accounting than ordinary teacher-student systems because the same model may appear as both student and teacher under different information states.
The student context is usually:
\[\mathcal{C}_S =(x,\hat{y}_{<t})\]
The self-teacher context may include privileged information:
\[\mathcal{C}_T =(x,c,\hat{y}_{<t})\]
- where \(c\) may be a gold answer, reference solution, critique, feedback, tool output, future user message, retrieved skill, or environment observation.
The implementation must explicitly mark whether \(c\) is available at inference. If \(c\) is training-only, the system must evaluate leakage.
A privileged-context record should store:
\[c =\left( \text{source type}, \text{source text}, \text{availability at inference}, \\ \text{causal relation to failure}, \text{filter decision}, \text{leakage risk} \right)\]
The self-distillation loss should not blindly train on all differences between \(\mathcal{C}_T\) and \(\mathcal{C}_S\). Some differences are task-bearing, but others are style shifts, confidence shifts, length shifts, or hallucinated-reference templates.
Relevance masks, contrastive hints, reward-agreement masks, and gates are implementation tools for separating task-bearing differences from context-induced artifacts.
A relevance-masked self-distillation loss is:
\[\mathcal{L}_{RMSD} =\sum_t m_t^{rel} D \left( p_{\theta^-}(\cdot \mid \mathcal{C}_T) \,\Vert\, p_{\theta}(\cdot \mid \mathcal{C}_S) \right)\]
A contrastive hint signal can subtract the effect of a wrong or control hint:
\[e_t^{ctr} =\left[ \log p_\theta(y_t \mid x,c^+,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right] -\left[ \log p_\theta(y_t \mid x,c^-,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right]\]
Self-distillation systems should always include tests for hallucinated feedback, fabricated references, and suppressed epistemic verbalization.
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t shows the central implementation hazard: a privileged self-teacher can up-weight responses that look as if the model had a reference solution, even when those responses are not more rewarded. The student then learns the response shape without having the information that originally grounded it.

RL-Distillation Hybrid Systems

RL-distillation hybrids combine reward computation with dense teacher or self-teacher signals. The implementation must decide which signal determines update direction and which signal determines update magnitude.
A generic hybrid objective is:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) +\lambda_D \mathcal{L}_{D}(\theta)\]
If RL is trusted, it should usually anchor direction:
\[\Delta \theta \propto \sum_t A^{RL} w_t^{D} \nabla_\theta \log \pi_\theta(y_t \mid s_t)\]
- where \(w_t^D\) is a token-level distillation magnitude.
A gated hybrid can define:
\[g_t =\sigma \left( \beta \left[ \log \pi_T(y_t \mid s_t) -\log \pi_\theta(y_t \mid s_t) \right] \right)\]
The gated distillation term is:
\[\mathcal{L}_{D}^{gated} =\sum_t g_t \ell_t^D\]
Sample routing is another hybrid design. Correct samples can be sent to RL, while failed samples can be sent to self-distillation correction:
\[\mathcal{L} =\mathbf{1}[R(\tau)>0] \mathcal{L}_{GRPO} +\mathbf{1}[R(\tau)=0] \lambda(\tau) \mathcal{L}_{SDPO}\]
Hybrid systems need to store both reward and distillation metadata. A token update may depend on verifier outcome, group-relative advantage, teacher log-probability, self-teacher context, entropy, and routing decision.
The most important failure mode is disagreement between the reward signal and the dense signal. If the teacher says a token is good but the reward says the trajectory failed, the system needs a rule: mask, down-weight, route, clip, or trust one signal over the other.

Privileged-Context Construction and Leakage Control

Privileged-context construction is a separate implementation problem from teacher scoring. A system must decide what information is allowed in the teacher context, how it is formatted, how it is filtered, and whether it will ever be available at inference.
Common privileged contexts include gold answers, reference solutions, tool observations, unit-test failures, compiler errors, judge critiques, future user replies, prior-attempt mistakes, retrieved skills, and environment next states.
The core context asymmetry is:
\[\mathcal{C}_T =(x,c,\hat{y}_{<t})\] \[\mathcal{C}_S =(x,\hat{y}_{<t})\]
This asymmetry is useful only when \(c\) changes task-relevant token preferences. It is dangerous when \(c\) mainly changes style, confidence, explanation length, reference-citing behavior, or willingness to assert an answer.
Any context unavailable at inference should be treated as privileged and should trigger leakage-specific evaluation.
Leakage risk is not limited to explicit string copying. The student may learn a response shape that implies hidden context, such as “the reference solution says,” “as the feedback indicates,” “the previous attempt failed because,” or “the provided answer is.”
The implementation should distinguish grounded feedback use from hallucinated feedback mention. A model trained with feedback may legitimately use feedback in teacher mode, but the deployed student should not mention feedback when the prompt does not contain it.

Reward Anchoring and Teacher-Quality Diagnostics

Dense token-level supervision is useful only when it is aligned with downstream reward. The implementation should therefore track whether teacher scores behave like a reward-improved version of the student.
For a full trajectory \(s\), define the teacher-student log-ratio:
\[\Delta_T(s) =\log \pi_T(s) - \log \pi_S(s)\]
The central diagnostic is:
\[\operatorname{Corr} \left( \Delta_T(s), R(s) \right) > 0\]
A useful teacher should assign larger log-ratios to higher-reward trajectories. If correct trajectories receive higher \(\Delta_T(s)\) than incorrect trajectories, the teacher behaves like a reward-tilted student. If fabricated-reference traces, overconfident traces, or feedback-looking traces receive higher \(\Delta_T(s)\) regardless of correctness, the teacher signal is unsafe.
Bucketed diagnostics are often clearer than a single correlation:
\[\Delta_{\mathrm{correct}} =\mathbb{E} \left[ \Delta_T(s) \mid R(s)=1 \right]\] \[\Delta_{\mathrm{incorrect}} =\mathbb{E} \left[ \Delta_T(s) \mid R(s)=0 \right]\] \[\Delta_{\mathrm{reward\ gap}} =\Delta_{\mathrm{correct}} - \Delta_{\mathrm{incorrect}}\]
OPD teachers should have \(\Delta_{\mathrm{reward\ gap}}>0\). A self-teacher with \(\Delta_{\mathrm{reward\ gap}}\approx 0\) or \(\Delta_{\mathrm{reward\ gap}}<0\) should not be trusted as the primary training signal.
For privileged self-distillation, a second diagnostic should compare teacher log-ratio on artifact-bearing and artifact-free responses:
\[\Delta_{\mathrm{artifact}} =\mathbb{E} \left[ \Delta_T(s) \mid s \text{ contains absent-feedback or fabricated-reference artifacts} \right]\]
If \(\Delta_{\mathrm{artifact}}\) is high independent of reward, the teacher is up-weighting response shape rather than task success.

Loss Balancing and Gating

RL-distillation hybrids require explicit loss balancing. A generic objective is:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) +\lambda_D \mathcal{L}_{D}(\theta)\]
The value of \(\lambda_D\) controls whether distillation is a primary objective or an auxiliary signal. When the teacher is privileged or context-shifted, \(\lambda_D\) should usually start small and be increased only if reward and behavior diagnostics remain healthy.
Token-level gates can make the auxiliary signal selective:
\[\mathcal{L}_{D}^{gated} =\sum_t g_t \ell_t^D\]
- where \(g_t \in [0,1]\) is a trust weight.
An agreement gate can retain distillation only when it agrees with RL:
\[g_t =\mathbf{1} \left[ \operatorname{sign} \left( A_t^{RL} \right) =\operatorname{sign} \left( A_t^{D} \right) \right]\]
An entropy gate can down-weight uncertain teachers:
\[g_t =\exp \left( -\gamma H \left( \pi_T(\cdot \mid s_t) \right) \right)\]
A support-overlap gate can down-weight states where teacher and student disagree over the local menu of plausible tokens:
\[g_t =\operatorname{Overlap}_k(s_t)\]
These gates should be evaluated as algorithmic interventions, not only engineering tweaks. A gate that measures confidence can still fail if the teacher is confidently wrong or confidently biased toward privileged-context artifacts.

Evaluation and Regression Monitoring

Distillation should be evaluated on both target capabilities and preserved capabilities. A run that improves math but hurts instruction following, improves code but hurts safety, or improves agentic task success but increases malformed tool calls is not a clean win.
A robust evaluation suite should include task metrics, behavioral metrics, distributional metrics, systems metrics, and provenance metrics.
Task metrics include accuracy, pass@\(k\), verifier pass rate, judge score, unit-test pass rate, tool-use success, agent completion rate, and human preference win rate.
Behavioral metrics include refusal calibration, helpfulness, verbosity, response length, uncertainty expression, branching behavior, self-correction rate, final-answer commitment, and tool-call hygiene.
Distributional metrics include KL to the reference model, teacher-student KL, entropy, teacher-student support overlap, repetition rate, truncation rate, teacher disagreement, teacher-student log-ratio, and correlation between log-ratio and reward:
\[\operatorname{Corr} \left( \Delta_T(\tau), R(\tau) \right)\]
Systems metrics include rollout throughput, teacher latency, queue age, GPU utilization, failure rate, score coverage, payload size, cache hit rate, teacher timeout rate, and learner-consumer lag.
Provenance metrics include the fraction of training examples with complete metadata, stale rollout percentage, unknown teacher version percentage, unknown reward model version percentage, unknown tokenizer version percentage, unknown chat-template version percentage, and mask-version mismatch rate.
Privileged self-distillation requires additional metrics because final accuracy can hide the main failure modes. These metrics include hallucinated-reference rate, feedback-mention rate when feedback is absent, prior-attempt hallucination rate, epistemic-verbalization rate, out-of-distribution accuracy, and calibration.
The hallucinated-privileged-context rate can be written as:
\[H_{\mathrm{priv}} =\mathbb{E}_{x \sim \mathcal{D}_{eval}} \left[ \mathbf{1} \left[ y(x) \text{ mentions absent feedback, reference, guidance, or prior attempt} \right] \right]\]
The epistemic-verbalization metric can be written as:
\[\operatorname{EV}(y) =\sum_{t=1}^{|y|} \mathbf{1} \left[ y_t \in \mathcal{E} \right]\]
- where \(\mathcal{E}\) is a set of uncertainty, checking, branching, or backtracking markers.
The X-posted self-distillation experiments show why these metrics matter: naive privileged self-distillation can preserve or improve some in-distribution scores while hallucinated privileged context rises, epistemic verbalization collapses, and out-of-distribution accuracy trails RL. This makes \(H_{\mathrm{priv}}\), \(\operatorname{EV}\), and \(\operatorname{Acc}_{OOD}\) first-class evaluation metrics rather than secondary qualitative checks.
A compact OPD and self-distillation dashboard is:
\[\left\{ \operatorname{Acc}_{IID}, \operatorname{Acc}_{OOD}, \operatorname{Reward}, \operatorname{Corr}(\Delta_T,R), H_{\mathrm{priv}}, \operatorname{EV}, \operatorname{Overlap}_k, \operatorname{KL}_{ref}, \mathbb{E}[|\hat{y}|], \operatorname{TruncRate}, \operatorname{RepeatRate}, \operatorname{QueueAge} \right\}\]
Regression monitoring should be domain-specific. MOPD systems need per-teacher domain dashboards because an aggregate score can hide regressions. OPD systems need rollout-quality dashboards because degradation may first appear as length inflation, repeated tokens, or higher truncation. OPSD systems need epistemic-behavior dashboards because degradation may appear as fewer uncertainty markers, fewer self-correction branches, or more hallucinated references.
Evaluation should include provenance analysis. If a regression appears, the system should be able to answer which prompts caused it, which rollout policy generated them, which teacher scored them, which router assigned them, which privileged context was used, which mask was applied, which reward model scored them, and which checkpoint update consumed them.
For on-policy systems, offline evaluation alone is insufficient. The deployed model’s behavior depends on its own rollouts, so training should periodically sample fresh trajectories from the current model and evaluate them under the same generation settings used in deployment.

Practical Design Defaults

Start with a dataset-centric off-policy pipeline when building a new distillation system. Generate teacher traces, verify and filter them, deduplicate the corpus, balance domain mixtures, and establish a strong SFT baseline before adding online scoring.
Add token-level teacher probabilities only when the added information justifies the payload cost. Sequence-level traces are often sufficient for early training. Top-\(k\) or sampled-token log-probabilities become more valuable when the student needs finer local guidance.
Move to OPD only after the student can generate meaningful rollouts. A weak student produces low-quality states where teacher scoring may be expensive and uninformative. A stronger cold-started student produces near-miss trajectories that are ideal for dense correction.
Prefer sampled-token OPD or top-\(k\) local matching unless full-vocabulary matching is clearly affordable and support overlap is high. The theoretical elegance of full KL is often outweighed by the practical cost and the risk of off-support teacher noise.
Treat privileged self-distillation as an auxiliary signal unless diagnostics show that the self-teacher behaves like a reward-tilted version of the student. A privileged teacher can be useful, but it should not automatically become the primary optimization target.
Anchor RL-distillation hybrids in rewards when rewards are trusted. The safest default is:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) +\lambda_D \mathcal{L}_{D}(\theta), \qquad \lambda_D \ll 1\]
Use agreement masks, entropy gates, support-overlap gates, relevance masks, or contrastive hinting when the dense teacher signal is useful but potentially contaminated by style, confidence, or privileged-context artifacts.
Track prompt fixes and loss fixes as experimental mitigations, not guaranteed solutions. Prompt optimization can reduce hallucination in some settings, but it can also overfit to training-distribution assumptions. Loss modifications can reduce the influence of pure distillation, but they can destabilize training if RL and distillation gradients interact poorly.
Do not ship a distillation run based only on aggregate benchmark improvements. Require clean results on preserved-capability regressions, OOD evaluations, leakage metrics, rollout-quality metrics, and provenance checks.

Implementation-Aware Comparison

Off-policy methods emphasize generation quality, filtering, deduplication, mixture balancing, and dataset versioning.
Online distillation emphasizes target non-stationarity, synchronization, peer agreement, and teacher refresh.
OPD emphasizes rollout freshness, exact teacher scoring, support overlap, truncation monitoring, and teacher-scoring throughput.
MOPD emphasizes routing, aggregation, teacher calibration, teacher disagreement, per-domain teacher quality, and per-domain regression monitoring.
Self-distillation emphasizes privileged-context construction, leakage control, reward-log-ratio checks, relevance masks, contrastive hinting, and epistemic-marker monitoring.
RL-distillation hybrids emphasize reward anchoring, token gates, loss balancing, sample routing, and instability monitoring.
Log-probability payload design should match the loss. Forward KL prefers broad teacher distributions. Reverse-KL sampled-token OPD only needs teacher log-probability on the sampled token. JSD or local-support matching may require top-\(k\) sets from teacher and student.
Tokenizer compatibility is especially important for OPD and MOPD. If tokenizers or chat templates differ, sampled-token advantages and response masks become unreliable.
Stabilization also differs by method. Off-policy pipelines rely on filtering and mixture balancing. OPD relies on support-overlap checks, rollout-quality filters, and truncation monitoring. OPSD relies on fixed teachers, relevance masks, contrastive controls, reward anchoring, and epistemic-marker monitoring. MOPD relies on routing, calibration, and regression suites. RL-distillation hybrids rely on reward anchoring and gated token-level updates.

Comparative Failure Modes

Off-policy distillation fails through exposure bias, stale or low-quality traces, data contamination, over-imitation, benchmark leakage, weak filtering, and mixture imbalance.
Online distillation fails through non-stationary targets, consensus errors, synchronization overhead, unstable teacher refreshes, and premature agreement.
OPD fails through poor rollout quality, stale rollouts, support mismatch, repetition or truncation collapse, teacher-scoring cost, and masking bugs.
OPSD fails through information leakage, style drift, hallucinated privileged context, response shortening, uncertainty suppression, moving-teacher feedback loops, and OOD fragility.
MOPD fails through teacher conflict, bad routing, incompatible tokenizers, support mismatch, high serving cost, insufficient calibration, and insufficient regression evaluation.
RL-distillation hybrids fail when dense token signals conflict with rewards, when privileged context rewards style rather than task progress, when gates and loss weights are miscalibrated, or when reward labels route samples to the wrong learning branch.

Practical Defaults

Start with off-policy SFT or trace distillation unless there is a strong reason not to.
Add RLVR or RLHF when task success cannot be fully captured by demonstrations and a trusted reward or verifier exists.
Add OPD when the student’s own rollouts become informative and a reward-improved, locally compatible teacher is available.
Add self-distillation when external teachers are unavailable or feedback can create a useful teacher context, but keep leakage monitoring and reward-log-ratio diagnostics in the loop.
Add MOPD when specialist capabilities need consolidation across domains and routing can be evaluated reliably.
Add RL-distillation hybrids when rewards are trusted but too sparse for efficient token-level credit assignment.
Treat systems constraints as algorithmic constraints. A theoretically attractive loss may be impractical if it requires full-vocabulary logits from many teachers, cross-tokenizer alignment, synchronous scoring, or dense teacher forwards over long rollouts. A sampled-token objective may be less exact, but practically better if it keeps training scalable and stable.
For frontier-style post-training, a robust staged recipe is:
\[\text{Off-policy SFT} \rightarrow \text{RLVR or RLHF} \rightarrow \text{Specialist teacher training} \rightarrow \text{Support-overlap warmup} \rightarrow \text{OPD or MOPD} \rightarrow \text{Final RL / alignment} \rightarrow \text{Regression monitoring}\]
The final gate before deployment should ask whether the dense signal remained reward-aligned, whether OOD behavior improved or at least held steady, whether hallucinated privileged context stayed low, whether epistemic verbalization was preserved where needed, whether rollout freshness and masks were correct, and whether the provenance trail is sufficient to debug regressions.

Decision Guide for Choosing a Distillation Method

Selecting a distillation strategy is primarily a question of balancing teacher availability, desired robustness, engineering complexity, trajectory source, supervision density, and the nature of the available feedback. In practice, most successful post-training pipelines begin with simple off-policy methods and progressively introduce on-policy, self-distillation, multi-teacher, or RL-distillation techniques as the need for robustness, credit assignment, and capability consolidation increases.
The central decision rule is that distillation should be matched to the bottleneck. If simplicity and low cost are paramount, begin with off-policy sequence or trace distillation. If robustness to self-generated errors is critical, adopt on-policy distillation after the student has a strong enough cold start. If no external teacher is available, use self-distillation, but prefer gated, contrastive, reward-anchored, or relevance-masked variants when reasoning quality matters. If capabilities are distributed across specialists, use multi-teacher distillation or MOPD. If sparse rewards need dense corrective guidance, combine reinforcement learning with distillation. If training multi-turn agents with noisy privileged context, keep RL as the backbone and add gated self-distillation rather than applying uniform OPSD.
The safest default is to start simple and add adaptivity only when there is a demonstrated need. Off-policy SFT creates the base behavior; token-level KD adds soft local teacher preferences when needed; RL improves task success; OPD improves robustness to the student’s own mistakes; self-distillation uses privileged or feedback context when external teachers are unavailable; MOPD consolidates specialists; and RL-distillation hybrids solve sparse credit assignment.
The most practical recipes are staged rather than monolithic. Off-policy SFT or trace distillation establishes a competent student; token-level KD adds soft local teacher preferences when needed; RLVR or RLHF improves behavior under task rewards or preferences; OPD teaches the student from its own near-miss trajectories; self-distillation converts hints, feedback, or privileged context into dense supervision; MOPD consolidates specialist teachers; and RL-distillation hybrids combine task-grounded reward direction with token-level correction.
Systems constraints should be treated as part of the method choice. A method that requires full-vocabulary logits from many teachers may be mathematically attractive but impractical at scale. A sampled-token objective may be less information-rich but more robust when teacher-student support overlap is limited. A self-distillation objective may remove external teacher serving but introduce leakage and style-drift risks. The best method is therefore the one whose supervision signal is both useful and operationally reliable.
The final deployment gate should ask whether the dense signal remained reward-aligned, whether out-of-distribution behavior improved or held steady, whether hallucinated privileged context stayed low, whether epistemic verbalization was preserved where needed, whether specialist regressions were repaired rather than hidden by aggregate scores, and whether the provenance trail is sufficient to debug regressions.

Core Decision Axes

The first axis is trajectory source:
\[y \sim p_{\mathcal{D}}(\cdot \mid x) \qquad \text{or} \qquad \hat{y} \sim \pi_\theta(\cdot \mid x)\]
- Dataset, human, or teacher trajectories favor off-policy distillation. Student-generated trajectories favor OPD, OPSD, RL-distillation hybrids, and agentic training.
The second axis is teacher identity:
\[T \in \left\{ \text{external teacher}, \text{same-family expert}, \text{self-teacher}, \text{multi-teacher pool}, \text{feedback-conditioned teacher} \right\}\]
- External and same-family teachers are usually safer when they are reward-improved and close to the student. Self-teachers are cheaper but require stronger leakage and reward-alignment diagnostics.
The third axis is feedback density. Sparse rewards provide task grounding:
\[R(\tau)\]
- while distillation provides token-level feedback:
  \[A_t^D = \log \pi_T(a_t \mid s_t) - \log \pi_\theta(a_t \mid s_t)\]
- Hybrid methods are most useful when sparse rewards are reliable but too coarse for efficient credit assignment.
The fourth axis is teacher quality. A teacher is useful for OPD when its log-ratio tracks reward:
\[\Delta_T(\tau) = \log \pi_T(\tau) - \log \pi_\theta(\tau)\]
- with:
  \[\operatorname{Corr} \left( \Delta_T(\tau), R(\tau) \right) > 0\]
- A dense teacher signal that does not track reward can make training faster while making the model worse.
The fifth axis is systems cost. Sequence targets are cheapest, sampled-token log-probabilities are moderate, top-\(k\) distributions are heavier, and full-vocabulary logits are usually the most expensive. The payload should be chosen from the loss and the engineering budget, not from the most theoretically complete formulation.

When to Choose Off-Policy Distillation

Choose off-policy distillation when you have access to a large corpus of high-quality human examples, teacher-generated outputs, verifier-filtered synthetic traces, historical logs, or curated domain data, and you want the simplest and most stable training setup. This is the right default when training should resemble ordinary SFT, when reproducibility matters, and when teacher inference can be performed once and reused across many experiments.
Off-policy distillation is especially appropriate when teacher outputs can be generated offline and amortized efficiently. A frontier teacher, specialist model, reward model, or verifier can be expensive to run, but once its completions, critiques, scores, or top-\(k\) probabilities are stored, the same dataset can train several students, support ablations, or serve as a stable baseline before more complex methods are introduced.
Off-policy distillation is also appropriate when the student is expected to remain close to the training distribution and strong recovery from self-generated mistakes is not the primary concern. For short-form instruction following, summarization, translation, classification, data-format adaptation, or well-filtered synthetic reasoning data, fixed trajectories may provide enough coverage to justify the simplicity.
Off-policy distillation is preferred when the main bottleneck is data quality rather than train-inference mismatch. If a team can invest in teacher generation, deduplication, contamination filtering, unit-test verification, answer checking, safety filtering, and mixture balancing, then trace distillation can provide strong returns without the added complexity of live rollout scoring.
Off-policy distillation is also the best cold-start stage before RL, OPD, OPSD, or MOPD. A weak student often produces poor on-policy rollouts, making teacher scoring expensive and uninformative. A strong off-policy baseline produces near-miss trajectories that later on-policy methods can correct more effectively.
Avoid relying only on off-policy distillation when the deployed model will enter states that the dataset does not cover. Long reasoning chains, coding repair loops, tool-use agents, and multi-turn environments often need on-policy correction because early mistakes create new states.

When to Choose Online or Semi-Online Distillation

Choose online distillation when the supervising signal should evolve during training rather than remain fixed. Peer learning, co-distillation, periodically refreshed teachers, and checkpoint ensembles can be useful when no single frozen teacher is obviously superior.
Online distillation is appropriate when different models or checkpoints have complementary strengths, and training benefits from exchanging predictions across shared batches.
Semi-online distillation is appropriate when a strong teacher should remain mostly stable but be refreshed periodically after new RL, SFT, or specialist-training stages. This is common in staged post-training recipes where intermediate checkpoints become teachers for later students.
Use online or semi-online distillation cautiously when stability is more important than adaptability. Moving teachers can create non-stationary targets, consensus errors, and feedback loops where models reinforce shared mistakes.

When to Choose On-Policy Distillation

Choose on-policy distillation when long reasoning chains, coding tasks, tool-use workflows, or agentic trajectories are sensitive to compounding errors. In these settings, an early mistake changes the future prefix distribution, and the student needs supervision on the states it actually visits rather than only on teacher-generated ideal traces.
OPD is appropriate when the student must learn how to recover from the exact mistakes it is likely to make during deployment. A teacher scoring student rollouts can provide local feedback on malformed tool calls, incorrect reasoning branches, wrong file edits, invalid API calls, or partially correct solutions, whereas off-policy traces may only show the ideal path.
OPD is preferred when dense token-level supervision is more useful than sparse scalar rewards. RLVR may tell the model that a solution failed, but OPD can identify which sampled tokens the teacher would have preferred under the same prefix. This is especially valuable when a reward is delayed, binary, or difficult to assign to individual decisions.
OPD is also useful when the goal is to transfer capabilities acquired through reinforcement learning into a smaller, cheaper, or more efficient model. A strong post-RL teacher can score rollouts generated by the student, allowing the student to absorb the teacher’s local preferences without requiring the teacher to generate all training trajectories.
OPD is appropriate when a teacher can score the student’s actual prefixes reliably:
\[p_T(\cdot \mid x,\hat{y}_{<t})\]
- where:
  \[\hat{y} \sim \pi_\theta(\cdot \mid x)\]
OPD is preferred when the teacher is reward-improved and locally compatible with the student. Strong same-family models and RL-trained expert teachers are often useful because they tend to remain close to the student while assigning higher probability to better trajectories.
OPD should usually be introduced gradually. A practical recipe begins with off-policy SFT or trace distillation, verifies that the student can produce meaningful rollouts, then mixes in on-policy examples through a GKD-style schedule. This reduces the risk that the teacher is asked to score degenerate or off-support prefixes too early.
OPD should be delayed if the student is too weak to produce useful rollouts. A weak student creates low-quality prefixes that may waste teacher-scoring compute and produce noisy updates. Start with SFT, trace distillation, or RL warmup before moving to OPD.
OPD should use support-overlap diagnostics. If teacher and student disagree over the local token menu, full-distribution matching can be noisy. In that case, sampled-token OPD, top-\(k\) local support matching, or a closer intermediate teacher may be safer.
OPD is less attractive when the student’s rollouts are mostly incoherent, when teacher scoring is too expensive, when teacher-student tokenization is incompatible, when rollout staleness is hard to control, or when support overlap between the student and teacher is poor. In such cases, warmup SFT, trace distillation, intermediate teachers, or sampled-token scoring may be safer than full-distribution OPD.
OPD should be evaluated with both task and rollout metrics. Track reward, in-distribution accuracy, out-of-distribution accuracy, teacher-student log-ratio, support overlap, response length, truncation rate, repetition rate, and rollout staleness.

When to Choose Self-Distillation

Choose self-distillation when external teacher models are unavailable, too expensive, too slow, operationally inconvenient, or insufficiently specialized for the task. The teacher signal is then derived from the model itself, such as an earlier checkpoint, an EMA copy, an ensemble of views, a privileged prompt, a retrieved skill, a runtime-error-conditioned view, or future user interaction.
Self-distillation is appropriate when the model already contains latent capability that can be unlocked using privileged information, hindsight context, textual feedback, tool outputs, or targeted hints. The model may be better at evaluating or repairing a response when shown extra information than it is at generating the response from scratch.
Self-distillation is useful when interaction traces, runtime errors, user corrections, verifier comments, environment observations, or tool outputs can be converted into rich internal supervision. A coding model can learn from compiler errors or failing tests; a conversational model can learn from user follow-up corrections; an agent can learn from GUI state changes or terminal outputs.
Self-distillation is preferred when continual or domain-specific adaptation is needed without maintaining a separate teacher infrastructure. This is especially relevant for enterprise-specific formats, private APIs, internal tools, local policies, or shifting user preferences where frontier teachers may not know the desired behavior.
Self-distillation should be selective when the target behavior is narrow. Relevance-masked self-distillation is preferred when only a small subset of tokens should change, because it prevents the model from learning incidental style differences between the student and privileged teacher views.
Self-distillation requires caution in reasoning models. Privileged teachers can become shorter, more confident, and less exploratory, which may suppress epistemic verbalization, forking, verification, backtracking, and hedging. For strong thinking models, use fixed teachers, relevance masks, contrastive controls, reward anchoring, and diagnostics for response length and uncertainty markers.
Avoid naive self-distillation when the teacher is merely the student conditioned on hidden feedback and there is no check that the teacher’s log-ratio tracks reward. In that setting, the model can learn to imitate the presence of feedback rather than the behavior implied by feedback.

When to Choose On-Policy Self-Distillation

Choose on-policy self-distillation when verified solutions, reference answers, runtime feedback, tool observations, retrieved skills, or privileged reasoning traces are available and can be shown to the teacher view but not to the deployed student view.
OPSD is appropriate when the model is better at evaluating correct solutions than generating them from scratch. In this setting, the student samples a rollout from the ordinary task prompt, while the self-teacher scores that rollout under privileged context. The student then learns from the difference between the ordinary and privileged views.
OPSD is useful when the benefits of on-policy learning and dense supervision are desired without relying on an external frontier teacher. Since the same model family supplies both student and teacher views, the infrastructure burden can be lower than external-teacher OPD, although teacher-context forward passes still add compute.
OPSD is most appropriate in domains where correctness signals are reliable, such as mathematical reasoning, coding, theorem-style tasks, tool-use environments, and interactive agents with clear feedback. However, it should be applied carefully when the student must preserve long-budget reasoning behaviors.
OPSD should use gates, masks, or contrastive controls when privileged context changes the teacher’s style. If a teacher that sees the answer assigns low probability to uncertainty markers such as “wait,” “maybe,” or “let me check,” the student may learn to suppress the very deliberative behaviors that support autonomous reasoning. RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation by Pan et al. (2026) is relevant here because it subtracts generic hint-induced style shifts using a contrastive hint.
OPSD is less appropriate when privileged information would leak the answer too directly, when the deployed student cannot access comparable information, when response shortening harms generalization, or when the teacher view rewards confidence rather than correctness.
A practical OPSD run should be blocked or down-weighted when:
\[\operatorname{Corr} \left( \Delta_T(\tau), R(\tau) \right) \leq 0\]
- or when hallucinated privileged context increases while out-of-distribution accuracy falls.

When to Choose Privileged Exploration or Context Optimization Instead of Direct Distillation

Some privileged information is useful for training but unsafe to distill directly into weights. If the privileged context mainly helps the model explore, obtain non-zero rewards, or construct a better temporary context, it may be better to keep that information in the rollout or prompt channel rather than make the student imitate it unconditionally.
POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration by Qu et al. (2026) uses partial oracle solution prefixes to guide RL exploration on hard problems, helping the policy obtain reward-bearing rollouts without treating the oracle solution as a direct distillation target.
Learning, Fast and Slow: Towards LLMs That Adapt Continually by Tiwari et al. (2026) separates slow parameter updates from fast textual-context updates, allowing task-specific information to live in optimized context while keeping parameters closer to the base model.
This path is preferred when the privileged signal is useful but likely to cause leakage if absorbed into weights. Examples include reference solutions, future feedback, task-specific hints, partial oracle paths, temporary tool observations, and local workspace state.
The practical rule is to ask whether the information should become a durable model behavior or remain a temporary scaffold. Durable behavior belongs in weights. Temporary or instance-specific information belongs in context, retrieval, memory, or exploration scaffolding.

When to Choose Multi-Teacher Distillation

Choose multi-teacher distillation when different models or checkpoints specialize in complementary capabilities such as reasoning, coding, instruction following, alignment, safety, long context, tool use, software engineering, or agentic planning. A single teacher may not be best across all domains, and forcing one teacher to define the entire student behavior can erase useful specialization.
Multi-teacher distillation is appropriate when sequential post-training has introduced regressions that must be repaired. A later checkpoint may improve code but hurt chat, improve tool use but hurt safety, or improve long reasoning but hurt instruction adherence. Earlier or domain-specialist checkpoints can be retained as teachers to recover those capabilities.
Choose multi-teacher distillation when the goal is to consolidate several specialized models into a single deployable student. This avoids deploying many separate checkpoints and lets one model internalize the strengths of multiple teachers.
Multi-teacher distillation is appropriate when each teacher has a clear domain of reliability and the system can route examples to the right teacher:
\[r(x,\hat{y}) \rightarrow T_j\]
MOPD is preferred over trace-distillation SFT when the student is strong enough to produce meaningful rollouts and teacher feedback on student-visited states is more valuable than teacher-generated traces. This is especially relevant for long-horizon reasoning, coding, and agentic tasks where the student’s own mistakes determine the future state distribution.
Trace-distillation SFT is preferred over MOPD when the student is too weak for useful rollouts, when teacher traces are easy to verify, or when the organization does not yet have the infrastructure for teacher scoring, routing, token alignment, and asynchronous rollout management.
Multi-teacher aggregation is appropriate when teacher disagreement is informative and can be calibrated. A weighted aggregation can be written as:
\[\log p_{\mathrm{agg}}(a_t \mid s_t) =\sum_j w_j(s_t) \log p_{T_j}(a_t \mid s_t)\]
Multi-teacher distillation requires infrastructure that can route and serve multiple teachers efficiently. Teacher servers must be scheduled, batched, monitored, and fault-isolated. Teacher log-probabilities must be aligned and aggregated. Tokenizer compatibility is highly desirable because sampled-token MOPD depends on teachers assigning probabilities to the same token events as the student.
Use MOPD when routing, support overlap, and per-domain regression monitoring can be treated as first-class components. A strong teacher is not automatically useful on every student prefix.
Support overlap should guide whether to use full-distribution matching, top-\(k\) matching, or sampled-token scoring. If teachers are close forks of the same base model, full or top-\(k\) matching may be safe and information-rich. If teachers were trained with heterogeneous data, external-model traces, or different styles, sampled-token scoring and warmup SFT may be safer.
Avoid MOPD when teachers conflict heavily, when domain routing is unreliable, when tokenizers or chat templates are incompatible, or when teacher-serving cost prevents enough coverage for stable training.

When to Choose RL-Distillation Hybrids

Choose RL-distillation hybrids when sparse correctness rewards are available but insufficient on their own to provide fine-grained guidance. RL can tell the model that a rollout succeeded or failed, while distillation can help assign credit to the specific tokens, reasoning moves, tool calls, or action spans that produced the outcome.
RL-distillation hybrids are appropriate when reinforcement learning should determine update direction while distillation refines token-level update magnitudes. This separation is useful when verifiers are trusted but sparse, while teacher or self-teacher signals are informative but should not become the sole objective.
A generic hybrid objective is:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) +\lambda_D \mathcal{L}_{D}(\theta)\]
- with \(\lambda_D\) chosen so that distillation helps credit assignment without overwhelming reward grounding.
Hybrids are appropriate for math, code, tool use, and agentic tasks where a verifier can judge success but cannot directly assign credit to each reasoning step or action.
Choose RL-distillation hybrids when hindsight information, textual critiques, runtime errors, verifier messages, user replies, tool outputs, or environment observations can be converted into dense supervision. A feedback-conditioned self-teacher can replay the student trajectory with access to richer context and provide local correction.
RL-distillation hybrids are also appropriate when the objective is to exceed teacher performance rather than merely imitate it. Reward-extrapolated OPD-style methods use the teacher as a reference while allowing rewards to amplify trajectories that outperform the teacher’s expected behavior.
Use RL-dominant hybrids when privileged or self-teacher guidance is useful but not trustworthy enough to define update direction. In this setting, reward or verifier labels decide whether the trajectory should be reinforced, while distillation shapes token-level magnitude.
Use sample routing when successful and failed samples require different treatments. Correct samples can be reinforced with GRPO-style RL, while failed samples can be replayed under feedback-conditioned self-distillation for correction.
Use SDAR-style gated auxiliary self-distillation when training multi-turn agents, especially when the environment supplies sparse trajectory-level rewards but privileged skills, retrieved context, user feedback, or hindsight observations can provide dense token-level guidance. In this setting, RL should remain the backbone, while self-distillation acts as a bounded auxiliary signal.
Use contrastive self-distillation when privileged context induces style drift. Correct hints and wrong hints can be contrasted so that generic hint-induced confidence, brevity, or formatting changes are subtracted, leaving a cleaner task-bearing token signal.
Avoid hybrids when the reward is too noisy to anchor direction, when verifiers are easy to exploit, when the dense teacher signal conflicts with reward, when privileged context produces hallucinated-feedback artifacts, when teacher-student support overlap is poor, when token-level gates are unavailable, or when the dense distillation term overwhelms the task-grounded RL objective.

Recommended Practical Progression

For most real-world projects, begin with off-policy sequence-level or trace distillation to establish a stable baseline. This stage creates a competent student, standardizes formatting, transfers broad task behavior, and produces a model that can later generate meaningful on-policy rollouts.
\[\text{Off-policy SFT or Trace Distillation} \rightarrow \text{Stable Baseline}\]
Next, introduce token-level teacher probabilities when the teacher’s uncertainty or local preferences are important enough to justify the additional infrastructure. Depending on the divergence and payload budget, this may use full logits, top-\(k\) logits, or sampled-token log-probabilities.
\[\text{Trace Targets} \rightarrow \text{Soft Labels or Log-Probabilities}\]
Add RLVR or RLHF when task success cannot be fully captured by demonstrations. This stage uses trusted rewards, verifiers, or preference models to move the student toward behavior that static teacher traces may not fully specify.
\[\text{Stable Baseline} \rightarrow \text{RLVR / RLHF} \rightarrow \text{Reward-Improved Policy}\]
Transition gradually to on-policy distillation once the student is sufficiently capable. Early on-policy rollouts should be monitored for length, repetition, truncation, invalid tool calls, and teacher-student support overlap. GKD-style mixing can combine off-policy stability with on-policy robustness.
\[\text{Cold-Started Student} \rightarrow \text{Student Rollouts} \rightarrow \text{Teacher Scoring} \rightarrow \text{OPD}\]
Add self-distillation when privileged context or hindsight feedback becomes available. Use it especially for runtime errors, tool observations, verified answers, retrieved skills, future user corrections, or environment feedback. Apply relevance masks, contrastive controls, reward anchors, or gates when the privileged context may introduce style drift.
\[\text{Student Rollout} \rightarrow \text{Privileged Teacher View} \rightarrow \text{Masked or Gated Self-Distillation}\]
Apply multi-teacher distillation when capabilities are distributed across specialists or when sequential training introduces regressions. Use teacher routing, support-overlap diagnostics, warmup SFT, and multi-domain regression suites before relying on dense multi-teacher token updates.
\[\text{Specialist Teachers} \rightarrow \text{Routing and Warmup} \rightarrow \text{MOPD} \rightarrow \text{Regression Recovery}\]
Integrate RL-distillation hybrids when sparse rewards need dense token-level refinement. Let RLVR, GRPO, PPO, reward models, or verifiers anchor update direction, while teacher or self-teacher gaps provide token-level modulation.
\[\text{Reward-Grounded RL} \rightarrow \text{Dense Distillation Signal} \rightarrow \text{Gated Hybrid Update}\]
A practical frontier-style progression is therefore:
\[\text{Off-policy SFT} \rightarrow \text{Token-Level KD} \rightarrow \text{RLVR / RLHF} \rightarrow \text{OPD or OPSD} \rightarrow \text{Specialist Teacher Training} \rightarrow \text{MOPD} \rightarrow \text{Final Alignment and Regression Monitoring}\]
For frontier-style post-training where specialist capabilities must be consolidated, a second staged recipe is:
\[\text{Off-policy SFT} \rightarrow \text{RLVR or RLHF} \rightarrow \text{Specialist Teacher Training} \rightarrow \text{Support-Overlap Warmup} \rightarrow \text{OPD or MOPD} \rightarrow \text{Final RL / Alignment} \rightarrow \text{Regression Monitoring}\]
This staged ordering is a default, not a hard rule. The right entry point depends on the bottleneck: use off-policy distillation for stable broad transfer, online or semi-online distillation for adaptive teacher refresh, OPD for exposure-bias reduction on student rollouts, self-distillation for privileged feedback without external teachers, privileged exploration or context optimization when hidden information should guide learning without being absorbed directly into weights, MOPD for specialist capability consolidation, and RL-distillation hybrids for reward-grounded learning with dense token-level credit assignment.

Choosing by Constraint

If the constraint is teacher availability, use an external-teacher method when a strong teacher can be served reliably, and use self-distillation when external teachers are unavailable or too expensive. If the model already contains latent capability that can be unlocked by hints or feedback, self-distillation may be more practical than hosting a frontier teacher.
If the constraint is data quality, use off-policy trace distillation first. Invest in teacher generation, filtering, verifier checks, deduplication, and mixture balancing before adding on-policy complexity. A clean fixed corpus is often the fastest path to a robust baseline.
If the constraint is exposure bias, move toward OPD. Student-generated rollouts expose the model to the states it will actually visit, allowing teacher feedback to correct its own mistakes rather than only teaching ideal trajectories.
If the constraint is sparse credit assignment, use RL-distillation hybrids. Rewards or verifiers should determine whether the rollout was successful, while teacher or self-teacher gaps should refine token-level credit.
If the constraint is capability interference, use multi-teacher distillation. Train or select specialists independently, keep useful intermediate checkpoints, and consolidate them through routing, trace distillation, or MOPD.
If the constraint is support mismatch, avoid aggressive full-distribution matching. Prefer warmup SFT, sampled-token scoring, intermediate teachers, or trace distillation until teacher and student local token supports overlap more reliably. Shared tokenizer, chat template, and base model lineage reduce alignment errors and improve the chance that teacher probabilities are meaningful on student rollouts.
If the constraint is reasoning-style preservation, be cautious with privileged self-distillation. Monitor response length, entropy, uncertainty markers, verification markers, fork rates, backtracking, and out-of-distribution reasoning performance.
If the constraint is systems cost, choose the simplest payload that supports the desired loss. Sequence targets are cheapest, sampled-token log-probabilities are moderate, top-\(k\) distributions are heavier, and full-vocabulary logits are usually the most expensive. A sampled-token objective may be less exact than full KL, but practically better if it keeps training scalable and stable.
If the constraint is deployment reliability, require end-to-end provenance before scaling the method. Every rollout, teacher score, reward, mask, route, and loss branch should be traceable to a checkpoint and configuration. This is the difference between a debuggable training system and a system where regressions can only be guessed at.
If the constraint is regression sensitivity, treat evaluation as part of the training loop rather than an afterthought. Distillation is often used to preserve or consolidate capabilities, so every major run should be judged not only by target-domain gains but also by regressions across chat, safety, instruction following, long context, code, reasoning, and agentic behavior.

Comparative Failure Modes

Off-policy distillation fails through exposure bias, stale or low-quality traces, data contamination, over-imitation, and mixture imbalance.
Online distillation fails through non-stationary targets, consensus errors, synchronization overhead, and unstable teacher refreshes.
OPD fails through poor rollout quality, stale rollouts, support mismatch, repetition or truncation collapse, teacher-scoring cost, incompatible tokenization, and masking bugs.
OPSD fails through information leakage, style drift, hallucinated privileged context, response shortening, uncertainty suppression, moving-teacher feedback loops, and OOD fragility.
MOPD fails through teacher conflict, bad routing, incompatible tokenizers, support mismatch, high serving cost, insufficient calibration, and insufficient regression evaluation.
RL-distillation hybrids fail when dense token signals conflict with rewards, when privileged context rewards style rather than task progress, when gates and loss weights are miscalibrated, or when sample routing is based on noisy rewards.
The most important diagnostic before adopting OPD or MOPD is whether teacher feedback is meaningful on student-visited prefixes. A strong teacher is not automatically a good teacher for a particular student rollout. Support overlap, tokenizer compatibility, and reasoning-style compatibility should be checked before relying on dense token-level loss.
The most important diagnostic before adopting OPSD is whether privileged context changes task knowledge or merely changes style. If the privileged teacher becomes shorter, more confident, or less exploratory, the student may inherit the wrong behavior unless the loss is masked, gated, or contrastively cleaned.
The most important diagnostic before adopting RL-distillation hybrids is whether dense token-level feedback and sparse reward feedback agree. When they disagree, the reward should usually anchor the direction of learning, while distillation should act as a bounded auxiliary or modulation signal.

Minimum Diagnostics by Method

Off-policy distillation should track target accuracy, preserved-capability regressions, data quality, duplication, contamination, mixture proportions, and teacher-output quality.
OPD should track rollout quality, queue age, behavior-policy version, teacher-student support overlap, truncation rate, repetition rate, teacher-scoring coverage, and reward-log-ratio correlation.
Self-distillation should track hallucinated privileged context, feedback-mention rate when feedback is absent, epistemic-verbalization rate, response length, calibration, OOD accuracy, and correlation between self-teacher log-ratio and reward.
MOPD should track per-teacher coverage, teacher disagreement, router entropy, per-domain regressions, support overlap by teacher, teacher-specific reward lift, and teacher-serving failures.
RL-distillation hybrids should track reward, KL to reference policy, gradient norms, distillation weight, agreement between RL advantage and distillation advantage, OOD accuracy, hallucinated privileged context, and agent success.
A compact cross-method dashboard is:
\[\left\{ \operatorname{Reward}, \operatorname{Acc}_{IID}, \operatorname{Acc}_{OOD}, \operatorname{Corr}(\Delta_T,R), H_{\mathrm{priv}}, \operatorname{EV}, \operatorname{Overlap}_k, \operatorname{KL}_{ref}, \operatorname{Success}_{agent} \right\}\]
- where \(H_{\mathrm{priv}}\) measures hallucinated privileged context and \(\operatorname{EV}\) measures epistemic verbalization.
The most important deployment rule is to evaluate regressions as aggressively as improvements. Distillation is often used to compress, preserve, or consolidate capabilities, so a successful method should improve the target domain without silently degrading chat quality, safety, instruction following, long-context behavior, reasoning robustness, tool-use hygiene, or agentic reliability.
The final gate before shipping a distillation run should combine reward alignment, OOD behavior, privileged-context leakage, epistemic-verbalization preservation, specialist-regression recovery, and provenance quality. A method should not be considered successful if it improves an aggregate score while hiding domain regressions, hallucinated feedback, degraded reasoning style, or an unauditable training path.

References

Systems, tooling, and infrastructure

Blogs and implementation guides

Twitter / X threads

Multi-teacher on-policy distillation discussion by Cameron R. Wolfe; follow-up thread on multi-teacher on-policy distillation and post-training
On-policy distillation announcement thread by Kevin Lu
Rishabh Agarwal X threads by Rishabh Agarwal; thinking-model self-distillation thread; self-distillation does not work for thinking models YET
Why On-Policy Distillation Works and Naive Self-Distillation Doesn’t
On-policy distillation interpretation and stabilization discussion by Zhuo Kai
Targeted on-policy self-distillation lecture clip by Sasha Rush
Frontier post-training recipe review announcement by Nathan Lambert
On-policy distill for LLMs typically works best with reverse KL by Rishabh Agarwal
I think there’s some confusion about what on-policy distillation (OPD) loss actually optimizes by Rishabh Agarwal
Dwarkesh Patel’s recorded discussion with Sasha Rush

Talks, videos, and interviews

The Magic of LLM Distillation by Rishabh Agarwal; timestamped link
The State of Frontier Post-Training Recipes: Conversation with Finbarr Timbers

Broader LLM training context

Scaling Laws for Neural Language Models by Kaplan et al. (2020)
Language Models are Few-Shot Learners by Brown et al. (2020)

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledKnowledgeDistillation}
  title   = {Knowledge Distillation},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}