Primers • Knowledge Distillation
- Overview
- Definition
- Classical Distillation Families
- Offline, Online, and Semi-Online Distillation
- On-Policy and Off-Policy Distillation
- Relationship Between Offline/Online and Off-Policy/On-Policy Distillation
- Off-Policy and On-Policy Distillation for Autoregressive LLMs
- Support Overlap and Locality
- Divergence Choice: Forward KL, Reverse KL, and JSD
- Self-Distillation and On-Policy Self-Distillation
- Thinking-Model Caveats for Privileged Self-Distillation
- Distillation as Synthetic Data and Post-Training Infrastructure
- Distillation and Reinforcement Learning
- Multi-Domain Post-Training and Capability Consolidation
- Implementation View
- Primer Roadmap
- Foundations
- Teacher-Student Formulation
- Temperature Scaling and Soft Targets
- Token-Level Distillation in Autoregressive Models
- Divergence Choices and Their Effects
- Supervised Distillation and Sequence-Level Distillation
- Representation and Intermediate-Layer Distillation
- Distillation as Synthetic Data Generation
- Post-Training Recipes of Recent LLMs
- Distillation Inside Frontier Post-Training Recipes
- Cascade RL and Multi-Domain OPD as a Foundation
- Limitations of Classical Distillation
- Implementation Considerations
- Offline Distillation
- Definition
- Relationship to Off-Policy and On-Policy Distillation
- Common Forms
- Trace-Distillation SFT
- Offline Distillation in Synthetic SFT Pipelines
- Offline Distillation versus On-Policy Consolidation
- Advantages
- Limitations
- Semi-Online Variants
- Offline Distillation in Modern LLM Training
- Implementation Pattern
- Practical Use Cases
- When Offline Distillation is Preferred
- Online Distillation
- Definition
- Relationship to Offline, Off-Policy, and On-Policy Distillation
- Types
- Advantages
- Limitations
- Semi-Online and Hybrid Approaches
- Online Distillation in Modern LLM Training
- Teacher Refresh and Support Overlap
- Online Distillation versus MOPD
- Implementation Pattern
- Systems Design Considerations
- Online Distillation in the Broader Distillation Taxonomy
- Off-Policy Distillation
- Definition and Formal Objective
- Sources of Off-Policy Data
- Sequence-Level Distillation
- Trace-Distillation SFT
- Logit Distillation
- Rejection-Sampling SFT
- Synthetic Data Pipelines
- Off-Policy Distillation in Recent Post-Training Recipes
- Advantages
- Limitations: Distribution Mismatch
- Behavioral Consequences
- Filtering, Verification, and Data Quality
- Relationship to Reinforcement Learning
- Engineering and Systems Considerations
- When Off-Policy Distillation is Preferred
- Practical Recipe Pattern
- On-Policy Distillation (OPD)
- Core Idea and Formal Objective
- OPD as Supervised Learning over Student-Visited States
- Intuition: Learning from One’s Own Mistakes
- Generalized Knowledge Distillation (GKD)
- Choice of Divergence and Reward Interpretation
- On-Policy Distillation and Reinforcement Learning
- Reinforcement Learning via Self-Distillation
- Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
- Scaling Reasoning Efficiently via Relaxed On-Policy Distillation
- Self-Distilled RLVR
- Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
- OpenClaw-RL: Train Any Agent Simply by Talking
- Multi-Domain and Multi-Teacher OPD
- Practical Failure Modes and Stabilization Recipes
- Privileged On-Policy Self-Distillation Caveats
- Practical Training Loop
- Systems and Infrastructure Considerations
- Relationship to Recent LLM Post-Training Recipes
- When On-Policy Distillation is Preferred
- Self-Distillation (SD)
- Core Formulation
- Temporal Self-Distillation
- Ensemble and Multi-View Self-Distillation
- Contextual Self-Distillation
- On-Policy Self-Distillation (OPSD)
- Relevance-Masked Self-Distillation
- Contrastive On-Policy Self-Distillation
- Self-Distillation as Reinforcement Learning
- Self-Distilled Agentic Reinforcement Learning
- Reinforcement Learning via Self-Distillation
- Self-Distilled RLVR
- Aligning Language Models from User Interactions
- Self-Distillation in Agentic Systems
- Failure Modes in Self-Distillation
- Fixed, Moving, and Gated Self-Teachers
- Advantages
- Limitations
- When Self-Distillation is Preferred
- Multi-Teacher Distillation
- Definition
- Why Multiple Teachers?
- Classical Multi-Teacher Distillation
- Teacher Weighting and Routing
- Multi-Teacher Distillation versus Mixture-of-Experts
- Multi-Teacher On-Policy Distillation (MOPD)
- Multi-Teacher OPD in Practice
- Sampled-Token MOPD
- Full-Distribution and Top-\(k\) Multi-Teacher Matching
- Support Overlap
- Teacher Conflict
- Multi-Domain OPD in Cascade RL
- MOPD in Nemotron 3 Ultra
- DeepSeek-Style Specialist Consolidation
- MOPD versus Trace-Distillation SFT
- Teacher Construction
- Aggregating Teachers
- Teacher Agreement and Calibration
- Regression Recovery and Capability Preservation
- Practical MOPD Training Loop
- Engineering and Systems Design
- Advantages of Multi-Teacher Distillation
- Limitations
- Design Rules
- When Multi-Teacher Distillation is Preferred
- Reinforcement Learning-Distillation Hybrids
- Why RL and Distillation Are Converging
- Hybrid Template: RL Backbone plus Distillation Auxiliary
- Hybrid Template: Distillation Modulated by Rewards
- Hybrid Template: Sample Routing
- Hybrid Template: Feedback-Conditioned Self-Distillation
- Reinforcement Learning via Self-Distillation
- ExOPD: Learning Beyond the Teacher
- REOPOLD: Relaxed OPD for Stable Reasoning
- Self-Distilled RLVR
- SRPO: Sample Routing between GRPO and Self-Distillation
- RLCSD: Contrastive Self-Distillation inside RLVR
- SDAR: Gated Self-Distillation for Multi-Turn Agents
- OpenClaw-RL and Agent Interaction Feedback
- Cascade RL and Distillation as Recipe Stabilization
- Nemotron 3 Ultra and MOPD after Unified RLVR
- Environment Feedback and Scientific RL
- Sparse Rewards versus Dense Distillation
- Direction, Magnitude, and Gating
- Failure Modes
- Implementation Pattern
- When RL-Distillation Hybrids Are Preferred
- Comparative Analysis
- Implementation Patterns
- Canonical Distillation Dataflow
- Off-Policy Pipeline Architecture
- On-Policy Pipeline Architecture
- Generation Buffers and Asynchronous Execution
- Teacher Scoring Infrastructure
- Log-Probability Payload Design
- Full-Vocabulary Versus Top-\(k\) Approximation
- Tokenizer Compatibility and Alignment
- Stabilization Mechanisms
- Multi-Teacher Routing Infrastructure
- Self-Distillation Systems
- RL-Distillation Hybrid Systems
- Evaluation and Regression Monitoring
- Practical Design Defaults
- Decision Guide for Choosing a Distillation Method
- References
- Foundational distillation papers
- On-policy distillation and generalizations
- Self-distillation and privileged supervision
- Multi-teacher and capability consolidation
- RL-distillation hybrids
- Recent post-training recipe reports and commentary
- Synthetic data and RLHF references
- Imitation learning and exposure bias
- Systems, tooling, and infrastructure
- Blogs and implementation guides
- Twitter / X threads
- Talks, videos, and interviews
- Broader LLM training context
- Citation
Overview
Definition
-
Distillation is a training paradigm in which a student model is optimized to reproduce useful behavior from a teacher model, usually to obtain a model that is cheaper, faster, smaller, easier to deploy, or more specialized than the teacher. The canonical formulation was popularized in Distilling the Knowledge in a Neural Network by Hinton et al. (2015), which showed that a student can learn from the teacher’s softened output probabilities rather than only from hard labels.
-
In modern LLMs, distillation should also be understood as a deployment and capability-transfer strategy, not only as a compression trick. Larger models move the capability frontier upward, while distillation attempts to move the resulting capability back down the cost frontier for practical inference, high-volume serving, edge deployment, or domain specialization. The Magic of LLM Distillation - Rishabh Agarwal, Google DeepMind frames this cost-performance tradeoff as one of the central reasons distillation remains important in modern post-training.
-
At a high level, distillation replaces or augments ordinary supervised learning with a matching objective between teacher and student distributions. For a teacher distribution \(p_T\) and student distribution \(p_S^\theta\) over labels or next tokens, a standard token-level objective is:
\[\mathcal{L}_{KD}(\theta) =\mathbb{E}_{(x,y)} \left[ D\left( p_T(\cdot \mid x, y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x, y_{<t}) \right) \right]\]- where \(D\) is usually forward KL, reverse KL, Jensen-Shannon divergence, cross-entropy on sampled teacher outputs, or a task-specific hybrid. In language models, the conditioning context includes the prompt \(x\) and the partial output \(y_{<t}\), so distillation is fundamentally about matching next-token behavior under particular trajectories.
Classical Distillation Families
-
Classical distillation has several major families. Logit or soft-label distillation matches the teacher’s probability distribution directly, often with temperature scaling. Sequence-level distillation trains on full outputs generated by the teacher, as introduced for neural machine translation in Sequence-Level Knowledge Distillation by Kim and Rush (2016), where teacher-generated translations serve as simplified targets for the student. Representation distillation matches hidden states, attention maps, embeddings, or intermediate features, which is common in encoder models such as DistilBERT by Sanh et al. (2019), which combines language-modeling, distillation, and cosine-distance losses to compress BERT.
-
A practical modern distinction is that sequence-level KD makes the student imitate teacher-produced trajectories, while on-policy distillation makes the student produce the trajectory and then asks a teacher to rescore the student’s actual prefixes. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) formalizes this as Generalized Knowledge Distillation, and the tutorial-style explanation in On-policy distill for LLMs typically works best with reverse KL emphasizes that the student’s own trajectory is the object being corrected rather than replaced.
Offline, Online, and Semi-Online Distillation
-
Distillation can also be categorized by whether the teacher is fixed or co-trained. Offline distillation uses a pretrained, frozen teacher and trains a student from stored or live teacher outputs; this is the standard teacher-student setting used by most classical KD systems. Online distillation trains multiple students, peers, or teacher-like supervisors simultaneously, so the teaching signal evolves during training rather than coming from a fixed teacher. Deep Mutual Learning by Zhang et al. (2017) is a canonical online distillation method in which peer networks learn collaboratively and teach each other throughout training. Co-distillation and online mutual learning therefore differ from ordinary offline KD not because they necessarily change the loss, but because the teacher distribution is non-stationary and coupled to the student’s optimization.
-
Semi-online distillation sits between offline and online regimes. One common form keeps a strong pretrained teacher but periodically adapts or updates an auxiliary teacher, supervisor, or student ensemble. Shadow Knowledge Distillation: Bridging Offline and Online Knowledge Transfer by Li et al. (2022) studies the empirical performance gap between offline and online distillation and attributes much of the benefit of online methods to reversed student-to-teacher transfer rather than only to simultaneous training.
On-Policy and Off-Policy Distillation
-
On-policy and off-policy distillation classify methods according to the source of the trajectories on which the distillation loss is computed, rather than according to whether the teacher is frozen or co-trained. This distinction is especially important for autoregressive language models, where each token changes the future contexts that the model will encounter during generation.
-
In off-policy distillation, the student is trained on sequences generated by an external source, such as a human-labeled dataset, teacher-generated completions, or another model’s rollouts. The student does not determine the contexts on which it is supervised. Classical supervised knowledge distillation, sequence-level distillation, and most synthetic-data pipelines fall into this category. Sequence-Level Knowledge Distillation by Kim and Rush (2016) and standard supervised KD as described in DistilBERT by Sanh et al. (2019) are canonical off-policy approaches.
-
In on-policy distillation, the student first samples its own trajectories and then receives dense teacher supervision along those exact rollouts. The training data distribution therefore evolves with the student. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) formalizes this idea as Generalized Knowledge Distillation (GKD), which treats distillation as an imitation-learning problem and trains on student-generated sequences rather than only fixed datasets.
-
The core GKD procedure samples a student trajectory with probability \(\lambda\) and otherwise falls back to dataset trajectories, then minimizes a divergence between teacher and student token distributions over the resulting sequences. This interpolates continuously between purely off-policy and purely on-policy training.
-
Formally, the on-policy objective can be written as:
\[\mathcal{L}_{OPD}(\theta) =\mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ D\left( p_T \Vert p_S^\theta \right)(\hat{y}, x) \right]\]-
The following expansion makes the shorthand \(D(p_T \Vert p_S^\theta)(\hat{y},x)\) explicit as a sum of token-level divergences along the student-generated rollout \(\hat{y}\):
\[\mathcal{L}_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D\left( p_T(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]
-
-
The key advantage is that the student is trained in the same types of contexts it will encounter during inference, mitigating exposure bias and compounding errors. Thinking Machines Blog: On-Policy Distillation describes this as combining the on-policy relevance of reinforcement learning with the dense per-token supervision of distillation.
-
A useful intuition is provided by the chess analogy from Thinking Machines Blog: On-Policy Distillation: off-policy distillation is like watching grandmaster games, where the learner observes strong moves only in expert-visited positions, while on-policy distillation is like having an engine annotate every move in the learner’s own games, identifying precisely which moves were strong and which were errors.
-
A complementary intuition is the targeted correction view in Dwarkesh Patel’s recorded discussion with Sasha Rush: when a rollout contains a localized mistake, such as an invalid tool call, OPD can discourage the specific mistaken action instead of spreading a sparse final reward over the whole trajectory.
-
From an implementation perspective, off-policy distillation is simpler because teacher outputs can be precomputed and reused. On-policy distillation is more computationally demanding because student rollouts and teacher evaluations must be generated repeatedly during training. However, it often delivers superior performance on long-horizon reasoning tasks because it teaches the student to recover from its own mistakes rather than only imitate ideal trajectories. Distilling 100B+ Models 40x Faster with TRL demonstrates practical infrastructure for large-scale OPD, including generation buffers, batched teacher queries, and compressed binary log-probability transfer to make 100B+ teachers tractable.
-
A subtle but important mathematical point is that most practical OPD implementations follow the GKD-style supervised objective over sampled student prefixes, rather than differentiating through the full student sampling distribution as a sequence-level reverse-KL policy-gradient objective. I think there’s some confusion about what on-policy distillation (OPD) loss actually optimizes distinguishes this common OPD objective from MiniLLM: Knowledge Distillation of Large Language Models by Gu et al. (2023), which derives a reverse-KL approach closer to policy-gradient optimization.
-
A stop-gradient view of the common implementation is:
\[\nabla_\theta \mathcal{L}_{OPD} \approx \mathbb{E}_{\hat{y} \sim \operatorname{sg}(p_S^\theta)} \left[ \nabla_\theta \sum_t D\left( p_T(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]- where \(\operatorname{sg}\) denotes stop-gradient through the sampling process. This makes standard OPD closer to DAGGER-like supervised learning over student-visited states than to full policy-gradient optimization through the student’s sequence distribution.
-
On-policy and off-policy are orthogonal to the offline/online distinction. A frozen teacher can be used in either regime, and multiple co-trained peers can still operate off-policy if they supervise each other on fixed datasets. In practice:
- Offline + Off-Policy: Classical teacher-student distillation with precomputed teacher outputs.
- Offline + On-Policy: Modern OPD with a frozen teacher scoring student rollouts.
- Online + Off-Policy: Deep Mutual Learning on shared minibatches.
- Online + On-Policy: Co-trained models supervising one another on their own generated trajectories.
-
This separation is conceptually important because most recent advances in LLM post-training, including Generalized Knowledge Distillation, On-Policy Self-Distillation, and Multi-Teacher On-Policy Distillation, use frozen teachers and are therefore offline in teacher update pattern while simultaneously on-policy in trajectory generation.
Relationship Between Offline/Online and Off-Policy/On-Policy Distillation
-
Offline and online distillation are related to, but distinct from, off-policy and on-policy distillation. Offline versus online describes the training-time relationship between teacher and student: frozen teacher versus concurrently trained teacher or peers. Off-policy versus on-policy describes the trajectory source: external data or teacher trajectories versus student-generated rollouts.
-
Thus, classical offline KD is usually off-policy, because the student trains on fixed human, dataset, or teacher trajectories. However, online distillation can still be off-policy if peers exchange predictions on fixed batches, and offline distillation can be on-policy if a frozen teacher scores rollouts generated by the current student. This is exactly the setup used in on-policy LLM distillation: the teacher can remain frozen, but the data distribution changes because trajectories are sampled from the student.
-
A useful taxonomy is therefore two-dimensional:
| Axis | Main options | What it determines |
|---|---|---|
| Teacher update pattern | Offline, online, semi-online | Whether the teacher is frozen, co-trained, or partially adapted |
| Trajectory source | Off-policy, on-policy | Whether sequences come from datasets or teachers, or from the student |
| Target type | Hard, soft, feature, preference, reward-like | Whether supervision is tokens, logits, hidden states, preferences, or dense advantages |
| Teacher identity | External, self, multi-teacher, peer ensemble | Whether knowledge comes from another model, the same model, several models, or co-learners |
Off-Policy and On-Policy Distillation for Autoregressive LLMs
-
For autoregressive LLMs, the most important modern distinction is not only what is matched, but where the trajectories come from. Off-policy distillation trains the student on trajectories produced by a teacher, a dataset, or another external policy. On-policy distillation trains the student on its own rollouts and asks the teacher to score the student’s actual visited states.
-
This distinction is central because autoregressive errors compound: a student that deviates early at inference may enter contexts it never saw during fixed-dataset distillation. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024) formalizes this as Generalized Knowledge Distillation, using teacher feedback on student-generated sequences to reduce train-inference mismatch.
-
The following figure (source) shows the distinction between off-policy and on-policy distillation: off-policy training uses teacher-generated completions, whereas on-policy training samples the student’s own rollouts and evaluates those exact rollouts with the teacher.

Support Overlap and Locality
-
A practical OPD failure mode is poor support overlap between the student’s visited prefixes and the teacher’s reliable local distribution. One useful diagnostic is top-\(k\) local overlap:
\[\operatorname{Overlap}_k(s_t) =\frac{ \left| \operatorname{Top}_k p_S(\cdot \mid s_t) \cap \operatorname{Top}_k p_T(\cdot \mid s_t) \right| }{k}\]- where \(s_t=(x,\hat{y}_{<t})\) is a student-visited state. High overlap means the teacher and student disagree within a shared menu of plausible next tokens; low overlap means the teacher may be scoring prefixes outside the region where its distribution is meaningful for that student.
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe by Li et al. (2026) argues that successful OPD depends on compatible teacher-student thinking patterns and progressive alignment over a small set of high-probability tokens that can carry most of the probability mass. Ravid Shwartz Ziv’s Post applies the same lens to multi-teacher OPD systems, emphasizing that full-distribution matching is safer when teachers and students share lineage or local support, while sampled-token scoring can be more robust when teacher and student distributions are farther apart.
-
In this view, OPD is not merely dense supervision. It is local dense supervision on the states that the student actually visits. If those states are outside the teacher’s useful support, then asking the student to match the teacher’s full distribution can amplify irrelevant or noisy preferences. If the student and teacher share lineage, training recipe, tokenizer, or reasoning style, distribution-level supervision is more likely to be useful because local token menus overlap.
Divergence Choice: Forward KL, Reverse KL, and JSD
-
The loss direction matters. Forward KL,
\[D_{KL}(p_T \,\Vert\, p_S) =\sum_x p_T(x)\log\frac{p_T(x)}{p_S(x)}\]- is teacher-weighted and tends to be mean-seeking, penalizing the student for missing teacher modes. Reverse KL,
- is student-weighted and tends to be mode-seeking, penalizing tokens the student actually proposes when the teacher assigns them low probability. The TRL writeup Distilling 100B+ Models 40x Faster with TRL highlights this engineering distinction because top-\(k\) approximations differ depending on whether top tokens are selected from the teacher or the student.
-
In sampled-token OPD, the reverse-KL signal is often represented by a token-level log-probability gap:
\[A_t^{OPD} = \log p_T(y_t \mid s_t) - \log p_S(y_t \mid s_t)\]- where \(s_t = (x, y_{<t})\) is the student-visited prefix. This signal is dense, but it is only as reliable as the teacher’s ability to evaluate the student’s prefix. In multi-teacher settings, support overlap between the student and teacher distributions becomes a first-order design constraint rather than a minor implementation detail.
-
In many OPD systems, reverse KL is attractive because the student has already sampled the token being evaluated. This makes it possible to train from teacher log-probabilities on sampled tokens instead of transmitting full-vocabulary distributions. However, if the objective is full forward KL or dense distribution matching, the system must usually request teacher top-\(k\) or full-vocabulary probabilities, increasing communication and memory costs.
-
The following figure (source) shows forward KL and reverse KL, including their different weighting behavior and their mean-seeking versus mode-seeking tendencies.

Self-Distillation and On-Policy Self-Distillation
-
Self-distillation is a related family in which the teacher is not a separate larger model. In older usage, it can mean training a model from its own predictions, from earlier checkpoints, or from an ensemble of itself. In newer LLM reasoning work, self-distillation can be on-policy and context-conditioned: Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models by Zhao et al. (2026) uses one model as both student and teacher, where the teacher view receives privileged information such as a verified solution while the student view sees only the problem.
-
In multi-turn agent training, self-distillation must handle compounding trajectory drift: Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) introduces SDAR, which treats OPSD as a gated auxiliary objective rather than the primary training signal, preserving Reinforcement Learning (RL) as the task-grounded backbone while using privileged self-teacher context for dense token-level guidance.
-
Self-distillation can also target out-of-distribution behaviors when an external teacher is unavailable. Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation introduces Relevance-Masked Self-Distillation (RMSD), which filters token positions before applying the self-distillation loss so that updates focus on behavior-relevant tokens rather than incidental style differences.
-
A compact RMSD-style objective can be written as:
\[\mathcal{L}_{RMSD}(\theta) =\mathbb{E} \left[ \sum_t m_t D\left( p_T^\theta(\cdot \mid x', \hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x, \hat{y}_{<t}) \right) \right]\]- where \(x'\) is an enhanced teacher context and \(m_t \in \{0,1\}\) selects token positions judged relevant to the desired behavior. This reflects the practical observation that token-level supervision is useful only when the updated tokens actually correspond to the capability being transferred.
Thinking-Model Caveats for Privileged Self-Distillation
-
Recent results make the self-distillation story more nuanced for thinking models. Rethinking On-Policy Self-Distillation for Thinking Models by Kaur et al. (2026) finds that privileged-context self-distillation can degrade long-budget reasoning by suppressing forking, verification, backtracking, and hedging behaviors. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? by Kim et al. (2026) similarly traces some degradation to suppressed epistemic verbalization, where expressions such as uncertainty, reconsideration, or checking help a model preserve alternative reasoning paths. RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation by Pan et al. (2026) proposes a contrastive hinting signal to reduce privilege-induced style drift and focus updates on task-bearing tokens.
-
One useful abstraction is to contrast a correct privileged hint with a wrong or control hint:
\[e_t^{ctr} =\left[ \log p_\theta(y_t \mid x,c^+,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right] -\left[ \log p_\theta(y_t \mid x,c^-,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right]\]- where \(c^+\) is a correct hint and \(c^-\) is a contrastive hint. This removes parts of the teacher-student gap caused merely by hint-conditioned style shift and leaves more of the signal tied to task content.
-
This caveat is especially important for reasoning models that benefit from long test-time compute. A privileged teacher can appear locally more confident and concise because it already knows the answer, but that confidence may teach the student to suppress the very uncertainty-marking and self-correction behaviors that support robust inference without privileged context.
Distillation as Synthetic Data and Post-Training Infrastructure
-
A newer way to understand distillation is as part of the broader synthetic-data and post-training toolkit. The RLHF Book’s chapter Synthetic Data describes distillation as both a data engine, where stronger models generate completions, critiques, preferences, or filters, and a skill-transfer method, where a stronger model’s capabilities are transferred into a weaker model. The same chapter frames the path from offline KD to on-policy distillation as a move from static teacher-generated data toward student-sampled trajectories with dense teacher feedback.
-
This view is especially useful for out-of-distribution enterprise or tool-use behaviors, where ordinary SFT may teach a narrow behavior while harming unrelated capabilities, and RL may struggle if the base model cannot produce successful attempts often enough. Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation frames self-distillation as a way to bring unusual target behaviors into the model’s distribution while preserving unrelated competence through localized token-level updates.
Distillation and Reinforcement Learning
-
This connection also clarifies why distillation and reinforcement learning are now tightly linked. RL with verifiable rewards provides on-policy training but usually sparse scalar feedback, while OPD provides dense token-level feedback over on-policy trajectories. The RLHF Book expresses this bridge by treating the OPD token-level signal as an advantage-like term:
\[A_t^{\mathrm{OPD}} = \log \pi_T(a_t \mid s_t) - \log \pi_\theta(a_t \mid s_t)\]- where sampled tokens that the teacher rates above the student receive positive advantage, and sampled tokens the teacher rates below the student receive negative advantage.
-
Recent papers make this RL-distillation connection more explicit. Reinforcement Learning via Self-Distillation by Hübotter et al. (2026) introduces Self-Distillation Policy Optimization, which converts rich textual feedback such as runtime errors or judge comments into dense learning signals without an external teacher. Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation by Yang et al. (2026) argues that OPD is a special case of dense KL-constrained RL and generalizes it with a flexible reference model and reward scaling. Scaling Reasoning Efficiently via Relaxed On-Policy Distillation by Ko et al. (2026) interprets teacher-student log-likelihood ratios as token rewards and introduces relaxed OPD techniques for stability.
-
The relationship between distillation and RL is especially important for reasoning, coding, and agentic systems. Self-Distilled RLVR by Yang et al. (2026) argues that privileged self-distillation alone can leak information and destabilize long training, so it uses self-distillation to determine fine-grained update magnitudes while retaining RLVR for update direction. Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing by Li et al. (2026) routes correct samples to GRPO-style reward-aligned RL and failed samples to self-distillation correction, combining sparse correctness signals with dense token-level supervision. OpenClaw-RL: Train Any Agent Simply by Talking by Wang et al. (2026) extends this idea to agent interactions, using next-state signals such as user replies, tool outputs, terminal states, and GUI changes as both scalar feedback and hindsight-guided OPD supervision.
-
In multi-turn agentic training, the RL-distillation relationship becomes asymmetric: RL is often best treated as the task-grounded primary objective, while self-distillation acts as a carefully controlled auxiliary signal. Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) formalizes this pattern with SDAR, which maps detached teacher-student token gaps into sigmoid gates so that teacher-endorsed positive-gap tokens receive stronger distillation while negative teacher rejections are softly attenuated.
-
The SDAR objective can be summarized as an RL backbone plus a gated self-distillation auxiliary term:
\[\mathcal{L}(\theta) =\mathcal{L}_{GRPO}(\theta) +\lambda_{\mathrm{SDAR}} \mathcal{L}_{SDAR}(\theta)\]- where \(\mathcal{L}_{GRPO}\) preserves verifier-driven policy optimization, while \(\mathcal{L}_{SDAR}\) injects dense token-level guidance only where the gated privileged teacher signal is trusted.
Self-Distilled Agentic Reinforcement Learning
-
Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) extends RL-distillation hybrids to multi-turn agents by treating OPSD as a gated auxiliary objective and keeping GRPO as the primary RL backbone.
-
Implementation details:
-
The method flattens valid response tokens across a multi-turn trajectory and applies token-level self-distillation over the agent’s own rollout.
-
The student context contains the task and previous generated tokens, while the self-teacher context additionally includes privileged training-only information such as retrieved skills.
-
The detached teacher-student gap is defined as:
\[\Delta_t =\operatorname{sg} \left( \log \pi_\theta^+(y_t \mid s_t^+) -\log \pi_\theta(y_t \mid s_t) \right)\] -
The token-level gate converts this signal into a bounded trust weight:
\[g_t = \sigma(\beta \Delta_t)\] -
The gated auxiliary loss applies self-distillation only according to token-level trust:
\[\ell_t^{\mathrm{SDAR}} =g_t \left( \log \pi_\theta^+(y_t \mid s_t^+) -\log \pi_\theta(y_t \mid s_t) \right)\]
-
Multi-Domain Post-Training and Capability Consolidation
-
In multi-domain post-training, distillation also functions as a capability consolidation tool after or during RL. Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) uses multi-domain OPD from the strongest intermediate teacher models to recover benchmark regressions and sustain gains after broader Cascade RL.
-
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) scales this consolidation pattern to a 550B-total, 55B-active-parameter hybrid Mamba-attention MoE model, using SFT, unified RLVR, MOPD warmup, multi-teacher OPD, and MTP boosting to consolidate more than ten specialist teachers into a single agentic reasoning model. The report also notes that sampled-token objectives outperformed top-\(k\) and full-vocabulary distribution matching in preliminary MOPD experiments on some agentic benchmarks, which fits the support-overlap view that broad distribution matching can amplify noise when the student’s prefixes are off the teacher’s reliable support.
-
A practical MOPD design rule is that teacher support matters: dense teacher scoring is most reliable when student rollouts remain within regions where the teacher can assign meaningful token preferences. Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) uses a brief MOPD warmup SFT stage to align student rollouts with teacher-supported distributions, and Ravid Shwartz Ziv’s Post frames the same issue as a support-overlap tradeoff between full-distribution matching and sampled-token scoring.
-
Aligning Language Models from User Interactions by Kleine Buening et al. (2026) uses user follow-up messages as hindsight context for self-distillation, updating the model toward the behavior it would have produced after seeing the user’s correction or clarification. Informal discussion around this trend also appears in Cameron R. Wolfe’s X posts on multi-teacher OPD and the utility of combining specialist teachers, including this MOPD discussion thread.
Implementation View
-
In implementation terms, modern LLM distillation usually requires four decisions: the source of trajectories, the teacher signal, the divergence or surrogate loss, and the systems design for computing log-probabilities. Thinking Machines Blog: On-Policy Distillation frames on-policy distillation as combining the relevance of RL with the dense per-token signal of distillation: the student samples its own trajectories, while the teacher provides token-level feedback rather than a sparse sequence-level reward.
-
Agentic RL-distillation hybrids require deciding how strongly each token should trust privileged teacher guidance. Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) introduces entropy gating, gap gating, and soft-OR gating, allowing each token to regulate the intensity of self-distillation based on student uncertainty, teacher-student gap, or their combination.
-
Practical OPD systems also need to distinguish the mathematical idea of “on-policy” from systems-level staleness. In asynchronous pipelines, rollout workers may sample from a slightly older behavior policy while learner workers update a newer student snapshot; Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) addresses this in MOPD with behavior-policy, proximal-policy, and teacher-log-probability terms inside an asynchronous objective.
-
For production-scale OPD, implementation details can dominate the algorithmic presentation. Distilling 100B+ Models 40x Faster with TRL highlights generation buffers, batched teacher scoring, and compressed log-probability transfer, while vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention by Kwon et al. (2023) is a common serving foundation for high-throughput teacher inference.
-
For out-of-distribution or enterprise-specific behaviors, self-distillation can reduce reliance on an external frontier teacher by constructing a privileged teacher context from hints, corrections, or task descriptions. Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation makes this practical with token relevance masks, emphasizing that dense token-level supervision is useful only when the update positions correspond to the behavior being transferred.
Primer Roadmap
- The rest of the primer covers the major types of distillation in detail: classical soft-label distillation, hard-label and sequence-level distillation, representation and attention distillation, task-specific versus task-agnostic distillation, offline and online distillation, off-policy distillation, on-policy distillation, support-overlap diagnostics, self-distillation, on-policy self-distillation, thinking-model caveats for privileged self-distillation, multi-teacher on-policy distillation, and the newer RL-distillation hybrid family that treats teacher log-probability gaps, hindsight feedback, contrastive hints, relevance masks, or self-teacher contexts as dense policy-optimization signals.
Foundations
-
Classical knowledge distillation establishes the core teacher-student framework that underlies all subsequent variants, including offline distillation, online distillation, off-policy distillation, on-policy distillation, self-distillation, multi-teacher distillation, and modern reinforcement learning hybrids. The central insight is that a model can learn not only from hard labels, but from the richer probability distribution produced by a stronger teacher or peer model.
-
The most useful way to orient this section is to treat distillation as both a loss family and a post-training systems primitive. At the loss level, distillation begins with matching teacher and student distributions. At the recipe level, modern LLM post-training uses distillation for compression, synthetic-data generation, regression recovery, RL stabilization, self-improvement, and multi-domain capability consolidation.
-
The foundational axes are teacher dynamics, trajectory source, target type, and teacher identity. Offline versus online determines whether the teacher changes; off-policy versus on-policy determines where trajectories come from; hard versus soft targets determine the density of supervision; and external, self, or multi-teacher setups determine where the supervision originates.
-
The most important modern shift is from distilling isolated teacher outputs to consolidating entire post-training workflows. In recent LLM recipes, specialist teachers, staged RL checkpoints, synthetic traces, verifiers, tool environments, and MOPD are all parts of the same system.
-
This section introduces the mathematical foundations, temperature scaling, divergence choices, early extensions, modern post-training recipe patterns, and the distinction between fixed-teacher and co-trained-teacher settings that made distillation a general-purpose model compression and capability transfer technique.
-
In frontier post-training, these foundations are increasingly embedded inside larger recipes rather than used as standalone objectives. Frontier post-training recipe review with Finbarr Timbers describes the historical progression from relatively simple SFT, reward-modeling, and RLHF pipelines toward 2026-style recipes that combine staged RL, specialist teachers, trace distillation, multi-teacher on-policy distillation, and environment-specific training.
Teacher-Student Formulation
-
The classical formulation of distillation considers two models: a teacher \(p_T\), typically large and high-performing, and a student \(p_S^\theta\), typically smaller or more efficient. The objective is to transfer the teacher’s behavior into the student while reducing computational cost or improving specialization.
-
This paradigm was formalized in Distilling the Knowledge in a Neural Network by Hinton et al. (2015), which introduced the idea that the teacher’s soft output probabilities encode richer information than hard labels, revealing inter-class similarities. The student is trained to match these soft distributions rather than only the argmax class.
-
In the original classical setting, the teacher is usually fixed before the student is trained. This corresponds to offline distillation: the teacher distribution is stationary, and the student learns from a stable reference model.
-
Online distillation relaxes this assumption. In methods such as Deep Mutual Learning by Zhang et al. (2017), multiple peer models learn collaboratively and teach one another during training, so there may be no single pretrained superior teacher and no fixed teacher distribution.
-
Modern post-training adds a third practical pattern: staged teacher creation. In systems such as Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026), intermediate checkpoints produced by domain-wise RL become teachers for later multi-domain OPD, so the teacher is fixed at the moment of distillation but is itself the product of a broader sequential post-training pipeline.
-
For classification or token prediction, the core offline teacher-student loss is typically:
\[\mathcal{L}_{KD}(\theta) =\mathbb{E}_{x \sim \mathcal{D}} \left[ D\left( p_T(\cdot \mid x) \,\Vert\, p_S^\theta(\cdot \mid x) \right) \right]\]- where \(D\) is a divergence, most commonly forward KL.
-
For online or mutual distillation, the same form can be generalized by replacing the single frozen teacher \(p_T\) with one or more evolving peers \(p_j^{\theta_j}\):
\[\mathcal{L}_{i}(\theta_i) =\mathcal{L}_{\text{task}}(\theta_i) + \lambda \sum_{j\neq i} D\left( p_j^{\theta_j}(\cdot \mid x) \,\Vert\, p_i^{\theta_i}(\cdot \mid x) \right)\]- where model \(i\) learns from peer models \(j\) while also updating its own parameters. This captures the essential shift from one-way offline transfer to reciprocal online transfer.
-
In practice, distillation is often combined with supervised learning:
\[\mathcal{L}(\theta) =\alpha \mathcal{L}_{CE}(\theta) +(1-\alpha)\mathcal{L}_{KD}(\theta)\]- where \(\mathcal{L}_{CE}\) is cross-entropy with ground-truth labels and \(\alpha\) balances teacher imitation and label supervision.
-
In frontier LLM pipelines, the “ground truth” term may itself be synthetic, filtered, teacher-generated, or verifier-selected. For example, Nemotron-Cascade 2 by Yang et al. (2026) uses broad SFT data spanning math, coding, science, tool use, agentic tasks, multi-turn dialogue, instruction following, safety, long-context tasks, and software engineering before RL and multi-domain OPD, so the supervised term is already a curated teacher-data mixture rather than only human labels.
Temperature Scaling and Soft Targets
-
A central idea in classical distillation is temperature scaling. The teacher logits \(z_i\) are softened using a temperature \(T > 1\):
\[p_T^{(T)}(i \mid x) =\frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}\] -
Higher temperatures produce smoother distributions, making low-probability classes more visible. This helps the student learn nuanced relationships that are otherwise hidden in one-hot labels.
-
The distillation loss then becomes:
\[\mathcal{L}_{KD} =T^2 \cdot D_{KL} \left( p_T^{(T)} \,\Vert\, p_S^{(T)} \right)\] -
The factor \(T^2\) ensures gradient magnitudes remain stable when scaling logits.
-
In offline distillation, temperature is typically applied to a frozen teacher’s logits. In online distillation, temperature may be applied to each peer model’s logits before exchanging predictions, helping prevent mutual learning from collapsing too quickly into overconfident agreement.
-
In modern frontier recipes, temperature also appears outside the classical soft-label setting. Frontier post-training recipe review with Finbarr Timbers highlights that recent model reports discuss sampling-temperature schedules, difficulty curricula, and filtering policies as part of the broader post-training recipe. These choices affect the rollouts that later become SFT traces, RL trajectories, or distillation targets, so they indirectly shape the teacher distribution even when the final loss is not temperature-scaled KD.
-
Implementation detail: in large vocabulary settings such as LLMs, computing full softmax distributions is expensive. In practice, systems often approximate the loss using top-\(k\) tokens from either the teacher or student distribution, depending on whether forward or reverse KL is used.
Token-Level Distillation in Autoregressive Models
-
For language models, distillation is applied at the token level. Given an input sequence \(x\) and generated tokens \(y = (y_1,\dots,y_n)\), both teacher and student define conditional distributions:
\[p(y_t \mid x, y_{<t})\] -
The distillation objective aggregates per-token divergence:
\[D_{KL}(p_T \,\Vert\, p_S)(y \mid x) =\frac{1}{|y|} \sum_{t=1}^{|y|} D_{KL} \left( p_T(\cdot \mid x, y_{<t}) \,\Vert\, p_S(\cdot \mid x, y_{<t}) \right)\] -
This formulation is explicitly described in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024), where distillation is framed as minimizing divergence between teacher and student token distributions along sequences.
-
Offline token-level distillation usually evaluates the teacher and student on a fixed sequence distribution, such as human-written outputs, teacher-generated outputs, or cached synthetic data. This is simple and stable, but it can create a gap between training prefixes and inference prefixes.
-
Online token-level distillation allows the supervising distribution to change during training. In peer-learning settings, each model may provide token-level probabilities to other models on shared batches; in more advanced LLM systems, periodically refreshed checkpoints or peer models can serve as evolving teachers.
-
On-policy token-level distillation changes the prefix distribution by evaluating teacher and student on student-generated prefixes. This is why OPD is especially relevant for long-horizon reasoning and agentic tasks: a model’s future states are defined by its own earlier actions, tool calls, or reasoning tokens.
-
Sasha Rush’s video lecture on on-policy distillation works explains this distinction using the contrast between sequence KD and OPD: sequence KD trains the student on the teacher’s chosen trajectory, whereas OPD lets the student produce a trajectory and has the teacher rescore the exact sequence of student actions.
-
A key implication is that the quality of training depends heavily on the distribution of prefixes \(y_{<t}\) encountered during training, which motivates the distinction between off-policy and on-policy methods discussed later.
Divergence Choices and Their Effects
-
Different divergences induce different behaviors in the student:
-
Forward KL:
\[D_{KL}(p_T \,\Vert\, p_S)\]- This penalizes the student for assigning low probability to tokens the teacher considers likely. It leads to mean-seeking behavior, encouraging coverage of all teacher modes.
-
Reverse KL:
\[D_{KL}(p_S \,\Vert\, p_T)\]- This penalizes tokens the student produces that the teacher considers unlikely. It leads to mode-seeking behavior, focusing on dominant modes.
-
Jensen-Shannon Divergence (JSD):
\[D_{JSD}(p_T \,\Vert\, p_S) =\beta D_{KL} \left( p_T \,\Vert\, m \right) + (1-\beta) D_{KL} \left( p_S \,\Vert\, m \right)\]- where \(m=\beta p_T+(1-\beta)p_S\). JSD interpolates between forward and reverse KL and is bounded, which can improve stability.
-
-
Practical insight: forward KL is often used in classical supervised KD, while reverse KL is frequently used in on-policy distillation because it aligns naturally with sampling from the student distribution.
-
In offline distillation, divergence selection primarily controls how the student approximates a fixed teacher. In online distillation, divergence selection also affects training dynamics among co-evolving models: overly strong agreement losses can reduce diversity too early, while weaker or temperature-smoothed agreement can preserve complementary learning signals for longer.
-
In multi-teacher OPD, divergence choice also interacts with support overlap. Full forward-KL-style distribution matching can be powerful when the teacher and student share local token support, but sampled-token or reverse-KL-style scoring can be more robust when the teacher and student are farther apart. This distinction appears in recent discussions of large MOPD systems such as Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026), where sampled-token objectives can be preferable when full teacher distributions are noisy on student-visited prefixes.
Supervised Distillation and Sequence-Level Distillation
-
Two important classical variants are widely used:
-
Supervised (logit-level) distillation:
\[\mathcal{L}_{SD} =\mathbb{E}_{(x,y)} \left[ D_{KL} \left( p_T \,\Vert\, p_S \right)(y \mid x) \right]\]- This provides dense supervision at every token, leveraging the full distribution rather than only correct labels.
-
Sequence-level distillation:
-
Introduced in Sequence-Level Knowledge Distillation by Kim and Rush (2016), this replaces ground-truth outputs with teacher-generated sequences. The student is trained via standard likelihood:
\[\mathcal{L}_{SeqKD} =\mathbb{E}_{x} \left[ -\log p_S(y_T \mid x) \right]\]- where:
-
This simplifies the target distribution, often making learning easier but discarding distributional richness.
-
-
-
Both supervised logit-level distillation and sequence-level distillation are most commonly implemented as offline methods: a frozen teacher labels fixed data or generates synthetic targets, and the student is trained afterward.
-
Online versions are possible when teacher outputs are generated by co-trained peers or periodically refreshed teachers rather than by a static teacher. In such cases, the same supervised or sequence-level objective can be used, but the source distribution evolves during optimization.
-
Modern frontier recipes often use trace-distillation SFT as a consolidation stage even when they do not use full on-policy MOPD. Frontier post-training recipe review with Finbarr Timbers contrasts recipes that consolidate specialist RL climbs through trace-distillation SFT with newer recipes that consolidate specialists through multi-teacher on-policy distillation.
-
The key difference is whether the consolidation data comes from teacher-generated traces or student-generated rollouts. Trace-distillation SFT teaches the student to imitate the traces selected by the teacher or recipe designer. MOPD teaches the student to improve its own trajectories using token-level teacher feedback over the states the student actually visits.
Representation and Intermediate-Layer Distillation
-
Classical distillation need not operate solely on output probabilities. Representation distillation aligns hidden states, attention maps, and embeddings.
-
DistilBERT: a distilled version of BERT by Sanh et al. (2019) combines three losses:
-
DistilBERT loss components: masked language modeling loss, distillation loss on softened logits, cosine embedding loss on hidden representations.
-
This multi-objective approach preserves both output behavior and internal representations, demonstrating that distillation can transfer structural knowledge in addition to token probabilities.
-
-
Subsequent work has extended this principle to:
- Representation-alignment targets: attention map matching, value and key projection matching, layer-wise feature regression, contrastive representation alignment.
-
These techniques are especially useful when the student architecture differs substantially from the teacher.
-
In offline representation distillation, the teacher’s hidden states can be cached or computed live from a frozen teacher. In online representation distillation, peers may align intermediate representations during co-training, but this is more architecture-sensitive because hidden-state dimensions, layer counts, and attention structures must be compatible or projected into a shared space.
-
In modern LLM post-training, direct hidden-state matching is less common than output-distribution or trace-based transfer because teachers and students may differ in architecture, scale, tokenizer, routing structure, or inference stack. For example, MoE and hybrid architectures make internal-layer alignment less straightforward, while token-level distillation and trace distillation remain broadly compatible across architectures.
Distillation as Synthetic Data Generation
-
A complementary perspective, emphasized in the RLHF Book chapter Synthetic Data, is that distillation is also a structured data-generation process. A teacher can produce:
-
Synthetic supervision artifacts: answers, chain-of-thought traces, critiques, preference labels, filtered examples.
-
The student then trains on these outputs either as hard targets or as soft distributions.
-
-
This viewpoint broadens distillation from model compression to a general capability-transfer mechanism. In modern LLM pipelines, generating high-quality synthetic reasoning traces often precedes more advanced on-policy or reinforcement learning stages.
-
In offline synthetic-data distillation, the generated examples are usually fixed before student training or regenerated in separate rounds. In online synthetic-data distillation, examples, critiques, or peer labels may evolve as teachers, students, or self-improvement loops change during training.
-
Modern SFT stages are often synthetic-data pipelines in disguise. Nemotron-Cascade 2 by Yang et al. (2026) constructs large SFT mixtures with teacher-generated reasoning traces, correctness filtering for coding tasks with tests, long-context examples, multi-turn synthetic conversations, instruction-following examples, safety examples, and domain-specific reasoning data. This shows how SFT, synthetic data, and distillation can form a single integrated data-production loop rather than separate stages.
-
Similarly, Nemotron 3 Ultra by NVIDIA et al. (2026) uses generated and filtered SFT data across science, math, proof, competitive coding, multi-turn chat, and agentic software tasks before later RL and MOPD, illustrating how large post-training pipelines rely on synthetic distillation artifacts even before explicit OPD begins.
Post-Training Recipes of Recent LLMs
-
Recent LLM post-training recipes can be read as a history of how distillation moved from a compression technique into a recipe-consolidation mechanism. The broad evolution is:
-
2022 to 2023: SFT, reward modeling, and PPO-style RLHF, exemplified by Training language models to follow instructions with human feedback by Ouyang et al. (2022).
-
2024: open recipes increasingly formalized SFT, preference optimization, and RL with verifiable rewards, while closed recipes used more elaborate multi-stage RLHF variants.
-
2025: reasoning RL became the centerpiece after DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning by Guo et al. (2025), which made large-scale RLVR central to reasoning-model post-training.
-
2026: recipes increasingly fragment into multiple domain-specialist teachers and then merge those teachers into one deployable model through trace distillation, multi-domain OPD, or MOPD.
-
-
The simplest useful schema is:
\[\text{Base Model} \rightarrow \text{SFT / cold start} \rightarrow \text{RL or specialist RL} \rightarrow \text{distillation / consolidation} \\ \rightarrow \text{alignment and domain-specific polishing}\] -
The major difference across recent recipes is where the recipe places the consolidation step. Some recipes use SFT trace distillation after training specialists. Others use MOPD, where the student samples its own rollouts and multiple teachers provide dense token-level feedback over those rollouts.
Background: InstructGPT-Style RLHF
-
Training language models to follow instructions with human feedback by Ouyang et al. (2022) is the canonical early RLHF recipe:
-
SFT on human demonstrations.
-
Reward model training on human preference comparisons.
-
PPO against the reward model.
-
-
In this recipe, distillation is not the central consolidation mechanism. The key teacher signal is human feedback converted into a reward model, and policy optimization moves the model toward responses preferred by that reward model.
-
The following figure (source) shows the canonical InstructGPT-style three-step RLHF pipeline: collect demonstration data and train a supervised policy, collect comparison data and train a reward model, then optimize the policy against the learned reward model using reinforcement learning.

Llama 2
-
Llama 2: Open Foundation and Fine-Tuned Chat Models by Touvron et al. (2023) used a multi-stage RLHF recipe with SFT followed by iterative RLHF over multiple rounds.
-
Each round used rejection sampling followed by PPO, and the recipe used separate reward models for helpfulness and safety. This made Llama 2 a practical implementation of the early RLHF pattern, but with more iteration, more filtering, and more safety-specific modeling than the original InstructGPT-style formulation.
-
The following figure (source) shows the Llama 2 post-training pipeline, including pretraining, supervised fine-tuning, human feedback, safety and helpfulness reward models, rejection sampling, PPO, and iterative improvement of Llama 2-Chat.

Llama 3 and Tülu-Style Open Recipes
-
The Llama 3 Herd of Models by Dubey et al. (2024) used a complex multi-stage recipe with simpler optimizers. A typical round involved reward-model training, sampling multiple responses per prompt, rejection sampling, SFT, and DPO. The reward model mostly filtered samples rather than serving as the target of an online PPO-style loop.
-
The following figure (source) shows the Llama 3 post-training loop: collected prompts are expanded into multiple generations per prompt, filtered through a reward model and rejection sampling, converted into SFT data, trained into an SFT model, and then refined through DPO across multiple rounds.

-
Tülu 3: Pushing Frontiers in Open Language Model Post-Training by Lambert et al. (2024) is a clean open-recipe example: curated prompts, SFT, DPO, and RLVR. This recipe helped formalize RL with verifiable rewards as a core open post-training stage.
-
The following figure (source) shows the Tülu 3 three-stage open post-training recipe: public and synthetic prompt curation, SFT data mixing, direct preference optimization with both on-policy and off-policy data, and RL with verifiable rewards, with development and unseen evaluations throughout.

- A practical observation from Frontier post-training recipe review with Finbarr Timbers is that DPO-style stages become less central in some later frontier recipes as the rest of the pipeline becomes cleaner, more on-policy, and more environment-driven. This does not mean preference optimization is obsolete; rather, it becomes one tool among reward modeling, RLVR, trace filtering, and distillation.
OLMo 3
-
OLMo 3 can be read as a reasoning update to the Tülu-style open recipe. The recipe separates thinking and instruction behavior while still preserving a relatively simple staged structure compared with the frontier-lab recipes that use many specialist teachers and MOPD.
-
The high-level pattern is:
-
Pretraining and midtraining.
-
A long-context OLMo 3 base model.
-
Separate think and instruct post-training paths.
-
SFT, DPO, and RLVR stages for think and instruct variants.
-
An RL-Zero-style RLVR branch for reasoning exploration.
-
-
The following figure (source) shows the OLMo 3 recipe with pretraining, midtraining, long-context base adaptation, separate Think SFT, Think DPO, Think RLVR, Instruct SFT, Instruct DPO, Instruct RLVR branches, and an RL-Zero RLVR path.

DeepSeek-R1-Style Reasoning RL
-
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning by Guo et al. (2025) shifted attention toward large-scale reasoning RL, where verifiable rewards and long reasoning traces become central rather than peripheral.
-
The recipe:
-
R1-Zero uses pure RL, specifically GRPO, on the base model without SFT, primarily to seed reasoning behaviors for the full run rather than to serve as the final product.
-
R1 uses cold-start SFT, reasoning RL, rejection-sampling SFT, and final RL.
-
Large-scale RLVR becomes the primary driver, while SFT is used to distill, clarify, and refine RL-emergent reasoning behaviors.
-
-
In this style of recipe, the core model improvement comes from RL on verifiable reasoning tasks. Distillation then becomes important either for transferring the resulting reasoning behavior into smaller models or for consolidating later specialist variants.
-
The following figure (source) shows the DeepSeek-R1 recipe: an R1-Zero branch with RL on reasoning prompts and accuracy or format rewards, a cold-start SFT path into reasoning RL, sampling and SFT on reasoning and non-reasoning data, and a final RL stage with rule-based and preference rewards.

DeepSeek Evolution After V3
-
DeepSeek’s post-training evolution is useful because it shows a transition from relatively standard SFT and GRPO-style RL toward specialist creation and finally MOPD-style consolidation.
-
The high-level progression is:
-
V3, Dec. 2024: SFT plus GRPO RL.
-
R1, Jan. 2025: multi-stage RL where reasoning emerges.
-
V3.1, Aug. 2025: hybrid think and non-think behavior in one model.
-
V3.2, Dec. 2025: six specialists via RL, followed by SFT distillation into one mixed GRPO run.
-
V4, Apr. 2026: ten-plus domain experts consolidated through MOPD.
-
-
The following figure (source) shows the DeepSeek evolution after V3, moving from SFT plus GRPO RL, to R1-style multi-stage reasoning RL, to hybrid think and non-think modeling, to specialist RL with SFT distillation, and finally to ten-plus experts merged with MOPD.

MiMo-V2-Flash
-
MiMo-V2-Flash Technical Report by the Xiaomi MiMo Team (2026) is a clean early articulation of MOPD as a post-training consolidation primitive.
-
The high-level recipe is:
-
Stage 1: SFT to establish the general student.
-
Stage 2: train several domain-specialist teachers, typically using SFT and RL on the relevant domains.
-
Stage 3: consolidate those teachers into one student through MOPD.
-
-
This replaces a single monolithic RL run with a more modular workflow: specialists can be trained in parallel, and a final student can absorb their strengths through dense token-level feedback on its own trajectories.
-
The following figure (source) shows the MiMo Flash v2 recipe as a three-stage MOPD pipeline: SFT data creates an SFT model, domain-specialized training produces teachers for search, code, math, reasoning, safety, and related domains, and Stage 3 uses student rollouts plus token-level teacher rewards to distill the specialists into one student.

Nemotron-Cascade 2
-
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) offers one of the clearest examples of distillation as a mid-pipeline stabilization point.
-
Its high-level recipe is:
\[\text{Base Model} \rightarrow \text{SFT} \rightarrow \text{Instruction-Following RL} \rightarrow \text{Multi-domain RL} \rightarrow \text{Multi-domain OPD} \\ \rightarrow \text{RLHF} \rightarrow \text{Long-context RL} \rightarrow \text{Code RL} \rightarrow \text{SWE RL}\] -
The ordering is not arbitrary. The recipe begins with instruction-following RL to establish strict instruction adherence, moves to multi-domain RL for tool-calling, STEM reasoning, and format adherence, inserts multi-domain OPD to unify specialized expertise and recover regressions, and then continues with RLHF, long-context RL, code RL, and software-engineering RL.
-
The key lesson is that OPD can serve as a stabilization and regression-recovery stage inside a longer RL cascade. It is not only an alternative to RL; it can be placed between RL stages to consolidate what earlier stages learned before later stages specialize further.
Nemotron 3 Ultra
-
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) scales the specialist-teacher recipe to a large hybrid Mamba-attention MoE model and uses SFT, unified RLVR, MOPD, and reasoning budget control to produce a long-context, high-throughput agentic model.
-
Its high-level recipe can be summarized as:
\[\text{Base Model} \rightarrow \text{SFT} \rightarrow \text{Unified RLVR} \rightarrow \text{Specialist Teacher Training} \rightarrow \text{MOPD Warmup} \\ \rightarrow \text{Multi-teacher OPD} \rightarrow \text{MTP / inference-oriented boosting}\] -
The SFT stage includes domain-specific synthetic and filtered data for terminal use, conversational tool use, software issue resolution, math and proof, science, code, multilingual behavior, long context, chat, and professional workplace tasks.
-
The RLVR stage spans many environments, including terminal usage, office and productivity workflows, software engineering, search, tool calling, math, code, STEM, safety, chat, instruction following, long-context QA, structured outputs, inductive and transductive reasoning, and general usability.
-
The MOPD stage then consolidates specialized teachers into the general student. A central empirical observation is that MOPD is most effective when the teacher’s advantage can be expressed as token-level preferences over trajectories the student can already sample. This is why warmup and support overlap matter: if the student rarely visits the reasoning paths that encode the teacher’s missing capability, token-level OPD has a weaker signal.
-
The following figure (source) shows the Nemotron 3 Ultra two-iteration MOPD recipe: a prepared RLVR student is distilled from multiple first-round teachers into Ultra MOPD 1, then refreshed teachers and reused teachers provide a second MOPD iteration that produces the Ultra final model.

MAI-Thinking-1
-
MAI-Thinking-1 uses a more conservative recipe that is closer to DeepSeek-R1-style staged RL than to V4-style MOPD.
-
The high-level recipe is:
-
Start from a mid-trained base model.
-
Train several specialist RL “climbs,” such as SWE or agentic, STEM, and helpfulness or safety climbs.
-
Consolidate the specialist climbs through trace-distillation SFT.
-
Run a final RL climb to produce MAI-Thinking-1.
-
-
The key distinction is that consolidation is performed by trace-distillation SFT rather than on-policy multi-teacher distillation. This makes it a useful contrast case: not every 2026 frontier-style recipe uses MOPD, even when it uses specialist teachers.
-
The following figure (source) shows the MAI-Thinking-1 recipe: a mid-trained model branches into SWE or agentic, STEM, and helpfulness or safety climbs, each producing a teacher; trace-distillation SFT consolidates these into one model, and a final climb produces MAI-Thinking-1.

Kimi K2.5
-
Kimi K2.5: Visual Agentic Intelligence by Moonshot AI et al. (2026) is an agentic and multimodal recipe centered on text-only SFT followed by joint text-vision RL across coding, vision, reasoning, and agentic tasks.
-
The public recipe discussion does not foreground MOPD. Instead, it emphasizes building an agentic multimodal system that can coordinate subagents and tools across tasks.
-
The following figure (source) shows the Kimi K2.5 agentic workflow: an orchestrator creates subagents, assigns tasks to AI researchers, physics researchers, life-sciences researchers, fact checkers, file downloaders, and web developers, then aggregates task results into final results.

GLM-5
-
GLM-5: from Vibe Coding to Agentic Engineering by GLM et al. (2026) uses staged RL by capability: Base, SFT, Reasoning RL, Agentic RL, and General RL.
-
The recipe is simpler to describe than many MOPD-heavy systems, but the diagram includes on-policy cross-stage distillation, where logits or weights from earlier post-training stages inform later stages. This makes GLM-5 a useful middle case: it is not framed as many-teacher MOPD in the same way as MiMo or Nemotron 3 Ultra, but it still uses on-policy distillation-like cross-stage knowledge transfer.
-
The following figure (source shows the GLM-5 training recipe: pretraining on general and code or reasoning corpora, midtraining for long code, reasoning, long-context, and agent data, sparse-attention adaptation, followed by SFT, Reasoning RL, Agentic RL, General RL, and an on-policy cross-stage distillation block connected by logits and weights.

Distillation Inside Frontier Post-Training Recipes
-
Frontier post-training recipes usually combine multiple optimization stages rather than applying one distillation method in isolation. The classical three-stage RLHF pattern used supervised fine-tuning, reward modeling, and policy optimization. More recent reasoning-model recipes increasingly use larger RLVR stages, specialist teachers, staged domain curricula, trace distillation, and MOPD.
-
A useful high-level progression is:
-
Classical RLHF recipe: SFT, reward model training, RLHF.
-
Reasoning-focused RL recipe: SFT or mid-training, larger RLVR stages, reasoning trace refinement, sometimes with final SFT or trace distillation.
-
Specialist-consolidation recipe: train several domain-specialist teachers, then merge their strengths into a single student through sequence-level distillation, trace-distillation SFT, or MOPD.
-
2026-style MOPD recipe: train many specialist teachers across domains, sample student rollouts, score those rollouts with the relevant teachers, and consolidate expertise through token-level on-policy losses.
-
-
Frontier post-training recipe review with Finbarr Timbers characterizes multi-teacher on-policy distillation as a major emerging convergence point in frontier post-training, especially for recipes that must combine reasoning, coding, tool use, instruction following, and agentic behavior without allowing one domain’s RL to erase another’s capability.
-
This also reframes distillation as a recipe-management tool. The problem is not merely that the student should imitate a teacher; it is that a post-training organization may have many domain experts, many environments, many data pipelines, and many intermediate checkpoints, and distillation is the mechanism that converts those distributed gains into one deployable model.
Cascade RL and Multi-Domain OPD as a Foundation
-
Cascade RL is a modern foundation for understanding why distillation is needed after staged reinforcement learning. Instead of training all domains jointly, Cascade RL trains domains sequentially or in compatible multi-domain groups, making it easier to tune curricula, response lengths, verification costs, and reward functions.
-
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) applies SFT first, then a Cascade RL process that includes instruction-following RL, multi-domain RL, multi-domain OPD, RLHF, long-context RL, code RL, and software-engineering RL.
-
The central design principle is to mitigate inter-domain interference. Some tasks should be trained earlier because they establish broad priors, while others are specialized refinements. Some domains can be grouped when response formats and verification costs are similar, while conflicting domains should be separated to avoid destructive interference.
-
Multi-domain OPD then acts as a regression-recovery and capability-consolidation step. When later RL stages improve one domain but harm another, the strongest intermediate teacher for each domain can supervise the student on its own rollouts, helping recover benchmark regressions while retaining improvements from the broader Cascade RL process.
-
A simplified Cascade RL plus OPD formulation can be written as:
\[\theta_{d+1} =\operatorname{RL}_d(\theta_d)\]- for domain-specific RL stage \(d\), followed by a consolidation step:
- where \(T_k\) is a domain-specialist teacher and \(w_k\) routes or weights teacher feedback by domain, prompt, confidence, or checkpoint quality.
-
A sampled-token MOPD advantage can be written as:
\[a_t^{MOPD} = \log \pi_{T_k}(y_t \mid s_t) - \log \pi_{\theta}(y_t \mid s_t)\]- where \(s_t=(x,\hat{y}_{<t})\) and the selected teacher \(T_k\) corresponds to the relevant domain. This form is attractive because it provides dense token-level feedback without requiring full-vocabulary teacher distributions at every prefix.
Limitations of Classical Distillation
-
Despite its effectiveness, classical distillation suffers from several structural limitations:
-
Distribution mismatch: The student is trained on fixed trajectories, such as ground-truth or teacher-generated sequences, but at inference it generates its own tokens. Errors compound because it encounters states not seen during training. This issue is highlighted in imitation learning literature and explicitly discussed in On-Policy Distillation of Language Models.
-
Teacher bias and mode collapse: Forward KL encourages the student to cover all teacher modes, sometimes leading to overly smooth or low-confidence outputs.
-
Capacity mismatch: If the student cannot represent the teacher distribution, minimizing forward KL may produce unrealistic samples or unstable behavior.
-
Data inefficiency: Off-policy distillation may waste training effort on trajectories the student would never generate, reducing practical efficiency.
-
Teacher staleness in offline KD: A frozen teacher cannot adapt to the student’s changing failure modes, which can limit the usefulness of teacher feedback late in training.
-
Non-stationarity in online KD: A co-trained or evolving teacher can provide fresher supervision, but the target distribution changes over time, making optimization and reproducibility harder. Deep Mutual Learning by Zhang et al. (2017) shows that collaboratively trained peers can outperform a static-teacher setup, but the method also shifts distillation from a simple one-way transfer problem into a coupled multi-model optimization problem.
-
Inter-domain interference: A model improved through RL or distillation in one domain can regress in another domain. Cascade-style post-training treats this as a recipe-ordering and consolidation problem rather than only a loss-design problem.
-
Support mismatch in MOPD: Multi-teacher OPD works best when the student’s rollouts fall within the regions where the relevant teacher provides meaningful token preferences. If the teacher and student are too far apart, dense distribution matching can propagate noise rather than knowledge.
-
Recipe complexity: As post-training moves from simple SFT, DPO, and RL stages to many teachers, many domains, and many environments, the limiting factor becomes not only the loss but the system’s ability to coordinate compute, data, environments, teacher checkpoints, and evaluation.
-
Implementation Considerations
-
In modern LLM systems, classical distillation requires careful engineering:
-
Log-probability extraction: The teacher must provide token-level log-probabilities. This is often done via a separate inference server, for example vLLM-based systems, with batched requests and compressed logprob transmission.
-
Top-\(k\) approximation: Full-vocabulary KL is expensive. Approximations using top-\(k\) tokens reduce memory and bandwidth requirements, especially for large vocabularies of roughly \(100{,}000\) tokens or more.
-
Batching and caching: Efficient pipelines buffer student generations and batch teacher evaluations to amortize cost, enabling distillation even from 100B+ models at scale.
-
Hybrid objectives: Many systems combine supervised fine-tuning, distillation, and reinforcement learning signals in a single pipeline.
-
Offline execution pattern: Offline KD usually separates teacher inference from student optimization. Teacher completions, logits, hidden states, or labels can be precomputed, cached, audited, and reused across multiple student runs.
-
Online execution pattern: Online KD requires coordination among multiple co-trained models or periodically refreshed teachers. This adds communication overhead, synchronization complexity, and non-stationary targets, but it can provide more adaptive supervision.
-
Semi-online compromise: Semi-online systems periodically refresh teacher checkpoints or add shadow teachers while preserving some stability of offline KD. Shadow Knowledge Distillation: Bridging Offline and Online Knowledge Transfer by Li et al. (2022) studies this intermediate regime and frames it as a bridge between static offline transfer and fully online knowledge exchange.
-
Domain-wise recipe ordering: Cascade-style training requires decisions about which domain to optimize first, which domains can be trained together, and where distillation should be inserted to recover regressions. Nemotron-Cascade 2 by Yang et al. (2026) places multi-domain OPD after earlier RL stages to unify specialist expertise before continuing with additional alignment, long-context, coding, and software-engineering RL.
-
Environment and verifier design: RL-integrated distillation depends on the availability of reliable environments, verifiers, or reward signals. Agentic and tool-use tasks require not only prompts but also executable environments, test harnesses, tool protocols, and reward computation.
-
SFT mixture construction: Large SFT mixtures should often be balanced by token budget rather than example count because response lengths vary drastically across domains. Long reasoning traces, tool-integrated reasoning, short chat answers, and software-engineering trajectories can otherwise dominate or vanish unintentionally.
-
Deduplication and filtering: Teacher-generated data must be filtered for correctness, format compliance, tool-call hygiene, duplication, and undesirable behavioral patterns. For coding and agentic tasks, test cases, OnlineJudge systems, trajectory analyzers, and LLM judges can provide quality filters before distillation.
-
Teacher release and reproducibility: Multi-teacher OPD is difficult to study if only the final student is released. Intermediate teachers and starting checkpoints are valuable research artifacts because they make it possible to analyze support overlap, teacher-student divergence, recovery rates, and the marginal value of each distillation stage.
-
Organizational scaling: Frontier post-training is also an organizational problem. A recipe with many specialist teachers, RL environments, verifiers, and data pipelines requires teams to coordinate compute, data production, evaluation, and model release decisions. Distillation is often the final mechanism that turns that distributed organizational work into a single deployable model.
-
Offline Distillation
-
Offline distillation is the classical and historically dominant form of knowledge distillation. In offline distillation, the teacher model is trained beforehand and then frozen. The student is subsequently optimized to imitate this fixed teacher using either precomputed teacher outputs or teacher evaluations generated during training. Because the teacher does not change, the supervision signal is stationary, which makes offline distillation stable, reproducible, and comparatively simple to implement.
-
The main idea to carry through this section is that “offline” refers to teacher dynamics rather than data storage. A teacher may be queried live during training and still be offline if its parameters are frozen. Conversely, an offline corpus may contain highly sophisticated synthetic artifacts, including reasoning traces, critiques, tool-use transcripts, and verifier-filtered solutions. Thus, offline distillation is not necessarily simple in content, even when the optimization setup is simple.
-
Offline distillation remains the default starting point for most practical pipelines because it is easy to audit, cache, reuse, and scale. It is especially useful when the goal is broad capability transfer, cold-starting a student before RL, compressing frontier behavior into a cheaper model, or consolidating specialist outputs through trace-distillation SFT.
-
Its main limitation is that, unless it is combined with student-generated rollouts later, the student trains on trajectories produced by someone else. This can create train-inference mismatch: the student may behave well on teacher-written prefixes but fail to recover when its own early tokens move the sequence into unfamiliar states.
-
Most of the literature traditionally referred to as “knowledge distillation” implicitly assumes an offline setting. Distilling the Knowledge in a Neural Network by Hinton et al. (2015) is the canonical example: a pretrained ensemble or large model produces softened probability distributions that supervise a smaller student. DistilBERT by Sanh et al. (2019) similarly uses a frozen BERT teacher to train a compact transformer.
Definition
-
Let \(p_T\) denote a pretrained teacher and \(p_S^\theta\) the student. Offline distillation optimizes:
\[\mathcal{L}_{\text{offline}}(\theta) =\mathbb{E}_{(x,y)\sim\mathcal{D}} \left[ D\left( p_T(\cdot \mid x, y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x, y_{<t}) \right) \right]\]-
where:
-
\(\mathcal{D}\) is a fixed or externally generated dataset.
-
\(p_T\) is frozen throughout training.
-
\(D\) is typically forward KL, reverse KL, JSD, or cross-entropy.
-
-
-
The defining property is that the teacher parameters remain constant:
\[\nabla_\phi \mathcal{L}_{\text{offline}} = 0\]- where \(\phi\) denotes teacher parameters.
-
Offline distillation can use either hard targets or soft targets. In hard-target offline distillation, the teacher produces a sequence \(y_T\) and the student maximizes likelihood:
\[\mathcal{L}_{\text{hard-offline}}(\theta) =\mathbb{E}_{x} \left[ -\log p_S^\theta(y_T \mid x) \right]\]- where:
-
In soft-target offline distillation, the teacher provides token-level probability distributions:
\[\mathcal{L}_{\text{soft-offline}}(\theta) =\mathbb{E}_{(x,y)\sim\mathcal{D}} \sum_{t} D_{KL} \left( p_T(\cdot \mid x,y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,y_{<t}) \right)\] -
The hard-target version is cheaper and easier to store; the soft-target version preserves uncertainty, alternative continuations, and richer teacher preferences.
Relationship to Off-Policy and On-Policy Distillation
-
Offline distillation and off-policy distillation are closely related but not identical concepts.
-
Offline versus online: whether the teacher is frozen or co-trained.
-
Off-policy versus on-policy: where trajectories come from.
-
-
Most offline distillation is also off-policy, because the student trains on fixed human or teacher-generated sequences. However, offline distillation can also be on-policy if a frozen teacher evaluates rollouts generated by the current student. This is precisely the setup used in many modern on-policy LLM distillation methods, including On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024), where the teacher remains fixed but the trajectory distribution changes over time.
-
Thus, “offline” refers to teacher dynamics, while “on-policy” refers to data dynamics.
-
This distinction matters for modern post-training recipes. A pipeline may begin with offline, off-policy SFT on teacher-generated traces, proceed to RL on student-generated rollouts, and later use frozen specialist teachers for on-policy multi-teacher distillation. The teacher can be frozen in both the SFT and MOPD stages, while the trajectory source changes from external teacher traces to student rollouts.
Common Forms
-
Offline distillation encompasses many of the most widely used distillation approaches:
-
Soft-label distillation: The teacher provides full probability distributions over classes or tokens, often softened with temperature scaling.
-
Sequence-level distillation: The teacher generates complete outputs that become training targets, as introduced in Sequence-Level Knowledge Distillation by Kim and Rush (2016).
-
Trace-distillation SFT: The teacher generates full reasoning traces, tool-use transcripts, or specialist trajectories, and the student is trained by supervised fine-tuning on those traces.
-
Representation distillation: The student matches hidden states, embeddings, attention maps, or intermediate activations.
-
Preference and reward distillation: The teacher provides rankings, scalar rewards, critiques, or pairwise preferences rather than direct logits.
-
Synthetic-data distillation: The teacher produces answers, explanations, code, tool traces, multi-turn conversations, or benchmark-style solutions that become a reusable training corpus.
-
Precomputed versus live teacher querying: Offline distillation does not require that teacher outputs be fully precomputed.
-
Precomputed offline distillation: Teacher outputs are generated once and stored.
-
Live offline distillation: The frozen teacher is queried during training, but its parameters remain unchanged.
-
Both are considered offline because the teacher itself is static.
-
-
Trace-Distillation SFT
-
Trace-distillation SFT is one of the most important modern forms of offline distillation. Instead of matching only the teacher’s final answer, the student learns from the teacher’s full trajectory:
\[y_T = (r_T, a_T)\]- where \(r_T\) may contain reasoning steps and \(a_T\) may contain a final answer, tool call, patch, proof, or action sequence.
-
The objective is ordinary supervised likelihood:
\[\mathcal{L}_{\text{trace-SFT}}(\theta) =\mathbb{E}_{(x,y_T)\sim\mathcal{D}_T} \left[ -\sum_t \log p_S^\theta(y_{T,t} \mid x,y_{T,<t}) \right]\] -
The simplicity of this objective is the strength of trace distillation: once the teacher traces are generated and filtered, student training is just supervised fine-tuning.
-
Trace distillation is also a natural consolidation mechanism after specialist RL. A domain-specialist model can run on prompts from its domain, produce high-quality traces, and create a dataset that trains a single general student. This is the offline counterpart to MOPD: both consolidate specialists, but trace-distillation SFT uses teacher-generated trajectories, while MOPD uses student-generated trajectories scored by teachers.
-
MAI-Thinking-1 is a useful example of a modern recipe that consolidates specialist RL climbs using trace-distillation SFT rather than multi-teacher on-policy distillation. The high-level pattern is:
\[\text{Mid-trained Model} \rightarrow \text{Specialist RL Climbs} \rightarrow \text{Trace-Distillation SFT} \rightarrow \text{Final RL Climb}\] -
This makes trace-distillation SFT a conservative but robust way to merge specialist behavior when the infrastructure or support-overlap conditions for MOPD are not yet mature.
Offline Distillation in Synthetic SFT Pipelines
-
Many modern SFT stages are offline distillation pipelines even when they are not described that way. A teacher generates synthetic responses, the responses are filtered or curated, and the student is trained by supervised learning on the resulting corpus.
-
A typical synthetic SFT distillation workflow is:
-
Collect prompts from target domains.
-
Use strong teachers to generate candidate responses, reasoning traces, tool calls, or multi-turn conversations.
-
Filter outputs with verifiers, unit tests, LLM judges, reward models, format checkers, or human review.
-
Deduplicate examples and remove low-quality or unsafe artifacts.
-
Train the student on the retained traces with cross-entropy.
-
-
In code and math domains, filtering is especially important because correctness can often be verified. For example, a coding teacher may generate multiple candidate solutions; examples are retained only if they pass tests or satisfy an execution-based verifier.
-
In long-context, chat, and instruction-following domains, filtering is more difficult because correctness is less binary. These datasets often rely on a mixture of teacher selection, LLM judging, preference scoring, length constraints, formatting checks, and manual audits.
-
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) illustrates this at scale: its SFT stage uses teacher-generated and filtered data across math, code reasoning, science, long context, general chat, instruction following, safety, tool use, and software-engineering-like tasks before later RL and multi-domain OPD.
-
This shows that offline distillation is often the first stage of a much larger post-training recipe. It establishes broad capability and formatting behavior, while later RL or OPD stages improve robustness, recover regressions, or teach the student how to behave on its own rollouts.
Offline Distillation versus On-Policy Consolidation
-
The key distinction between offline trace distillation and on-policy distillation is the trajectory distribution.
-
Offline trace distillation uses:
\[y \sim p_T(\cdot \mid x)\]- and trains:
-
On-policy distillation uses:
\[\hat{y} \sim p_S^\theta(\cdot \mid x)\]- and trains the student from teacher feedback on \(\hat{y}\).
-
Offline trace distillation answers the question: “What would the teacher do on this prompt?”
-
On-policy distillation answers the question: “Given what the student actually did, how would the teacher score or correct each token?”
-
This distinction becomes central in long-horizon tasks. In short tasks, teacher-generated traces may be close enough to what the student would produce. In agentic, coding, tool-use, or reasoning tasks with thousands of tokens, small early deviations can create states that never appear in the offline dataset.
-
A practical training pattern is therefore:
\[\text{Synthetic / Trace SFT} \rightarrow \text{RL or RLVR} \rightarrow \text{OPD or MOPD}\]- where offline distillation gives the student a strong prior and on-policy methods later correct its self-generated failure modes.
Advantages
-
Stability: The target distribution does not change during training.
-
Reproducibility: Repeated runs see identical teacher behavior if the dataset and teacher outputs are fixed.
-
Engineering simplicity: Teacher and student optimization are decoupled.
-
Caching efficiency: Teacher outputs can be stored and reused.
-
Scalability: Large teachers can supervise many student experiments.
-
Auditability: Datasets can be inspected, filtered, deduplicated, benchmarked, and versioned before training.
-
Parallelizable data generation: Multiple teachers or teacher servers can generate data independently before the student training run begins.
-
Compatibility with standard training stacks: Once the data is created, the student can be trained with ordinary SFT infrastructure.
-
Useful cold start: Offline distillation often brings the student close enough to a target behavior that later RL or OPD can work efficiently.
Limitations
-
Teacher staleness: The teacher cannot adapt to the student’s evolving weaknesses.
-
Potential distribution mismatch: If training trajectories are fixed, the student may not learn to recover from its own mistakes.
-
Storage requirements: Precomputing token-level distributions can be expensive.
-
Capability ceiling: The student is fundamentally bounded by the teacher’s performance and biases unless later RL, search, or reward extrapolation introduces new signal.
-
Discarded uncertainty: Hard trace distillation trains on one teacher-selected output and may lose information about alternative valid continuations.
-
Over-imitation: The student may imitate teacher style, verbosity, tool-use habits, or reasoning artifacts even when those artifacts are not causally useful.
-
Long-horizon brittleness: In agentic settings, a student that deviates from the teacher trace may enter a state where the offline dataset provides little guidance.
-
Filtering bias: Verifier-based filtering can overrepresent outputs that are easy to verify and underrepresent ambiguous but useful behaviors.
-
Data-mixture sensitivity: If one domain has much longer traces or more examples, it can dominate the SFT objective unless mixture weights are carefully controlled.
Semi-Online Variants
-
Some systems partially relax the static-teacher assumption by periodically updating a teacher snapshot or ensemble. Shadow Knowledge Distillation: Bridging Offline and Online Knowledge Transfer by Li et al. (2022) studies this intermediate regime and argues that part of the online distillation advantage comes from reversed student-to-teacher transfer rather than only from simultaneous training.
-
Semi-online systems are useful when the fully online setup is too complex but a completely frozen teacher becomes stale. Common patterns include:
-
Refreshing a teacher checkpoint every few training phases.
-
Regenerating synthetic data after a student improves.
-
Adding stronger teachers for domains where the student still fails.
-
Using earlier student checkpoints as stabilizing teachers.
-
Maintaining a shadow teacher or exponential-moving-average teacher.
-
-
In frontier recipes, semi-online behavior often emerges organically: teams train specialists, generate traces, improve the base student, regenerate data, and then repeat. Even if each distillation run uses a frozen teacher, the overall recipe can behave like a sequence of offline distillation rounds.
Offline Distillation in Modern LLM Training
-
In contemporary LLM pipelines, offline distillation is widely used for compressing frontier models into smaller deployable models, generating synthetic instruction and reasoning datasets, transferring capabilities after RL or alignment, and creating baseline models before on-policy fine-tuning.
-
The RLHF Book chapter Synthetic Data presents offline distillation as the first stage in a progression from static synthetic data generation to fully on-policy distillation and RL-integrated post-training.
-
Offline distillation is especially important in recipes where SFT acts as a cold start for later RL. Instead of using SFT as the final alignment stage, modern reasoning recipes often use it to establish the minimum competence needed for RLVR or on-policy distillation to produce useful gradients.
-
In reasoning-model recipes, offline distillation may serve several distinct roles:
-
Cold-start SFT: Generate initial reasoning traces so RL can begin from a model that already follows the desired format.
-
Rejection-sampling SFT: Sample many model outputs, filter for correctness, and train on the retained traces.
-
Specialist trace distillation: Train a general student on traces from domain-specialist teachers.
-
Regression recovery: Reintroduce capabilities degraded by a later RL stage using cached high-quality traces.
-
Deployment compression: Transfer capabilities from a large teacher or ensemble into a smaller serving model.
-
-
The growing complexity of recent recipes also shows why offline distillation remains attractive. MOPD can be more powerful for student-specific errors, but it requires rollout generation, teacher scoring, routing, and careful support-overlap management. Offline distillation is cheaper to reproduce and simpler to study because the data distribution is fixed.
Implementation Pattern
-
Select or train the teacher model: Begin with a strong model, ensemble, specialist checkpoint, or post-RL model whose behavior should be transferred into the student. In most offline settings, this teacher is already trained before the distillation run begins.
-
Freeze the teacher parameters: Keep the teacher fixed throughout student training. This ensures that the supervision distribution remains stationary and that repeated student runs can be compared cleanly.
-
Define the target domain mixture: Choose the domains, tasks, formats, and trace lengths the student should learn. For LLMs, this often includes math, code, science, chat, instruction following, tool use, long context, safety, and agentic workflows.
-
Generate or query teacher outputs: Use the teacher to produce hard targets, soft probabilities, hidden-state targets, critiques, preferences, or reward-like annotations. These outputs may be generated once before training or queried live during student optimization.
-
Filter and verify outputs: Remove incorrect, unsafe, malformed, repetitive, leaked, or low-quality examples. For code, use tests or execution-based filters; for math, use answer checkers or proof validators where possible; for chat and instruction following, use rubrics, judges, and audits.
-
Deduplicate and balance the corpus: Apply exact and fuzzy deduplication, then balance mixture weights by tokens, examples, domains, and trace lengths. Long reasoning traces can dominate the loss if mixture design is not handled carefully.
-
Store targets or log-probabilities when useful: For large-scale systems, cache teacher completions, token IDs, top-\(k\) log-probabilities, embeddings, or preference labels so they can be reused across multiple student runs. Full-vocabulary logits are usually expensive, so practical systems often store compressed targets.
-
Train the student to match the teacher: Optimize the student with the appropriate matching loss, such as cross-entropy on teacher-generated tokens, KL divergence on teacher probabilities, MSE on hidden states, or a hybrid objective combining supervised and distillation losses.
-
Evaluate both target performance and regressions: Measure the student against task benchmarks, teacher agreement, latency, memory footprint, and regression suites. Offline distillation can improve targeted domains while quietly degrading unrelated behavior if the data mixture is narrow.
-
Iterate when needed: If the student underperforms, iterate by improving teacher data, changing the divergence, adjusting temperature, increasing top-\(k\) coverage, balancing mixture weights, adding stronger filters, or introducing on-policy rollouts.
Practical Use Cases
-
Model compression: A large teacher transfers its behavior into a smaller or cheaper student.
-
Serving-cost reduction: Offline distillation is attractive when a model will be served many times and a one-time distillation cost can be amortized over high-volume inference.
-
Specialist-to-general transfer: Domain-specific teachers produce traces that are merged into a single general student.
-
Synthetic instruction tuning: Strong teachers generate instruction-response pairs for broad SFT.
-
Reasoning cold start: Teacher traces teach a student to produce structured reasoning before RLVR.
-
Tool-use bootstrapping: Tool-using teachers generate trajectories that teach a student basic API, terminal, browser, or code-editing behavior.
-
Safety and refusal behavior: Safety teachers generate refusal, redirection, and boundary-setting examples.
-
Long-context behavior: Long-context teachers generate or annotate examples that teach retrieval, summarization, and cross-document reasoning.
When Offline Distillation is Preferred
-
Offline distillation is usually the best first choice when stability, simplicity, auditability, and reusable data matter more than exact train-inference matching.
-
It is especially preferred when:
-
High-quality teacher traces are already available.
-
Teacher inference is expensive and should be amortized.
-
The target tasks are short or have limited trajectory drift.
-
The student needs a cold start before RL or OPD.
-
The team wants a reproducible baseline before introducing on-policy systems.
-
Data quality, filtering, and auditability are more important than adaptivity.
-
Multiple students or ablations will reuse the same teacher corpus.
-
-
Offline distillation remains the dominant starting point for practical distillation pipelines, even when the final system later adds RL, on-policy distillation, self-distillation, or multi-teacher consolidation.
Online Distillation
-
Online distillation generalizes the teacher-student paradigm by allowing the teacher signal to evolve during training rather than remain fixed. Instead of relying exclusively on a pretrained, frozen teacher, online distillation trains multiple models simultaneously or refreshes teacher-like supervisors during optimization. The supervision distribution is therefore non-stationary and can adapt as the participating models improve.
-
The key idea to carry through this section is that online distillation is about teacher dynamics, not trajectory source. A method is online if the teacher distribution changes during the student’s training. It can still be off-policy if peers exchange predictions on fixed minibatches, and it can be on-policy if co-evolving models score rollouts generated by current policies.
-
Online distillation is attractive because frozen teachers can become stale relative to the student’s changing weaknesses and strengths. By allowing teachers and students to co-evolve, online distillation can provide more adaptive supervision, improve generalization, and in some cases outperform both conventional offline distillation and independently trained models.
-
At the same time, strict online distillation is harder to operate than offline distillation. It introduces non-stationary targets, synchronization overhead, and coupled failure modes. This is why many frontier LLM recipes use semi-online variants instead: specialists are trained or refreshed across stages, but each distillation phase still uses frozen teacher checkpoints.
-
This distinction is especially important for interpreting recent LLM post-training recipes. Multi-teacher on-policy distillation is often on-policy in trajectory source because the student generates the rollouts, but it is usually offline in teacher update pattern because the teachers are frozen during the distillation run. When recipes refresh teachers across rounds, as in multi-round MOPD, they become semi-online at the recipe level rather than fully online at the gradient-step level.
-
The canonical example is Deep Mutual Learning by Zhang et al. (2017), which trains peer networks jointly and minimizes KL divergence between their predictive distributions. Each model acts simultaneously as both student and teacher, and all participants improve through reciprocal supervision. More recent approaches such as co-distillation, checkpoint refresh, and population-based post-training extend this idea to larger ensembles and distributed training systems.
Definition
-
Suppose there are \(K\) models with parameters \(\{\theta_k\}_{k=1}^K\). For model \(i\), the online distillation objective can be written as:
\[\mathcal{L}_i(\theta_i) =\mathcal{L}_{\text{task}}(\theta_i) +\lambda \sum_{j \neq i} D\left( p_j^{\theta_j}(\cdot \mid x) \,\Vert\, p_i^{\theta_i}(\cdot \mid x) \right)\]-
where:
-
\(\mathcal{L}_{\text{task}}\) is the primary supervised, reinforcement learning, or hybrid objective.
-
\(D\) is a divergence such as KL or JSD.
-
\(\lambda\) controls the strength of mutual supervision.
-
all models update concurrently or are refreshed during the training process.
-
-
-
Unlike offline distillation, the teacher distributions \(p_j^{\theta_j}\) evolve throughout training:
\[\nabla_{\theta_j}\mathcal{L}_j \neq 0\]- for participating teacher-like models.
-
A time-indexed view makes the distinction explicit. At training step \(u\), model \(i\) may receive supervision from peer or teacher state \(p_j^{\theta_j(u)}\):
\[\mathcal{L}_i^{(u)} =\mathcal{L}_{\text{task}}^{(u)} +\lambda \sum_{j\neq i} D\left( p_j^{\theta_j(u)}(\cdot \mid x) \,\Vert\, p_i^{\theta_i(u)}(\cdot \mid x) \right)\] -
In strict online distillation, the teacher state can change every step or every small number of steps. In semi-online distillation, the teacher state changes at coarser refresh boundaries:
\[p_T^{(r)} =p_T^{\phi_r} \quad\text{for refresh round } r\]- and the student is trained against \(p_T^{(r)}\) until the next refresh.
Relationship to Offline, Off-Policy, and On-Policy Distillation
-
Online versus offline describes whether the teacher changes during training. Off-policy versus on-policy describes where the training trajectories originate.
-
This yields four conceptually distinct combinations:
-
Offline + Off-Policy: Classical KD using a frozen teacher and fixed teacher or dataset trajectories.
-
Offline + On-Policy: Modern OPD, where a frozen teacher scores student-generated rollouts.
-
Online + Off-Policy: Peer models exchange predictions on a fixed dataset or minibatch stream.
-
Online + On-Policy: Multiple co-evolving models generate and score their own trajectories, potentially sharing rollouts and dense feedback.
-
-
Most historical online distillation methods are online and off-policy because they operate on shared minibatches. Emerging RL and LLM systems increasingly explore online and on-policy hybrids, where co-trained models evaluate trajectories sampled from their current policies.
-
Many frontier LLM recipes occupy the semi-online region rather than the fully online region. Domain specialists may be trained in parallel, selected as teachers at checkpoint boundaries, and refreshed across rounds. During a particular distillation run, however, each teacher is usually frozen. Thus, the recipe is adaptive over stages, but not necessarily online at every learner update.
-
This matters because the terms “online,” “on-policy,” and “MOPD” are often conflated. A multi-teacher OPD run can be:
-
on-policy because the student generates rollouts;
-
offline because the selected teachers are frozen during that run;
-
semi-online because the broader recipe refreshes teachers or runs multiple MOPD iterations.
-
Types
-
Mutual Learning: Each model teaches every other model, as in Deep Mutual Learning.
-
Co-Distillation: Large-scale training jobs periodically exchange predictions, logits, or checkpoints to improve convergence and robustness.
-
Peer Ensembles: Multiple comparable models learn jointly and average or vote on predictions during training.
-
Adaptive Teacher Distillation: A stronger model is periodically updated and continues to supervise one or more students.
-
Checkpoint-Refresh Distillation: A teacher is replaced by a newer checkpoint after a training phase, producing a semi-online sequence of static-teacher distillation runs.
-
Shadow Teacher Distillation: An auxiliary teacher is updated asynchronously while the main student trains, bridging offline stability and online adaptivity.
-
Population-Based Distillation: A population of models with different objectives, domains, or hyperparameters exchanges knowledge during training.
-
Specialist-Teacher Refresh: Domain-specific teachers are trained or refreshed across post-training stages, then used to supervise a general student. This is common in modern recipe-level MOPD systems even when the teachers are frozen during each individual distillation run.
Advantages
-
Adaptive supervision: The teaching signal evolves with the models and can address newly emerging failure modes.
-
Improved generalization: Peer learning often reduces overconfidence and improves calibration.
-
No need for a single superior teacher: Comparable models can still benefit from teaching one another.
-
Regularization effects: Mutual agreement acts as a strong inductive bias.
-
Compatibility with distributed systems: Large training clusters can exchange logits or checkpoint summaries during optimization.
-
Continual improvement: Teacher refreshes allow the supervision signal to track new data, new environments, or newly discovered failure modes.
-
Capability preservation: Semi-online teacher pools can preserve older strengths while newer RL stages specialize the student.
-
Organizational scalability: Domain-specialist teams can improve teachers in parallel, and periodic distillation can merge those improvements into a shared student.
Limitations
-
Higher system complexity: Multiple models must be trained simultaneously, refreshed periodically, or synchronized across stages.
-
Non-stationary targets: The supervision distribution changes over time, which can complicate optimization.
-
Risk of consensus errors: If all participants share similar biases, they may reinforce incorrect behavior.
-
Compute overhead: Training several models jointly can be significantly more expensive than using a single frozen teacher.
-
Synchronization overhead: Online settings require careful coordination of model versions, checkpoints, batch assignments, and logging.
-
Unclear credit assignment: When many peers or teachers improve together, it can be difficult to determine which supervision source caused which downstream gain.
-
Distribution mismatch across teachers: If teachers are trained with very different data or recipes, their output distributions may be poorly aligned with each other or with the student.
-
Recipe fragility: As post-training adds more teachers, environments, RL stages, and distillation loops, the recipe can collapse under its own complexity unless each stage is isolated, tested, and evaluated.
Semi-Online and Hybrid Approaches
-
Many practical systems combine offline and online strategies:
-
Checkpoint refresh: A frozen teacher is periodically replaced by the latest strong checkpoint.
-
Teacher ensembles: A static teacher is supplemented with co-trained peers or newer specialist checkpoints.
-
Shadow teachers: Auxiliary teachers are updated asynchronously to provide fresher supervision.
-
Progressive teacher staging: Rather than distilling only from a fully converged teacher, the student may be guided through a sequence of intermediate teacher checkpoints.
-
Multi-round MOPD: A first MOPD round improves the student, then refreshed or reselected teachers provide a second round of supervision.
-
-
Shadow Knowledge Distillation: Bridging Offline and Online Knowledge Transfer by Li et al. (2022) analyzes these intermediate approaches and shows that reversed student-to-teacher transfer contributes significantly to online distillation’s effectiveness.
-
Semi-online methods are particularly important for LLMs because full online distillation across many large models is often too expensive. A practical compromise is:
\[\text{Train or refresh teachers} \rightarrow \text{freeze teachers} \rightarrow \text{distill student} \rightarrow \text{evaluate} \rightarrow \text{refresh teacher pool}\] -
This turns an unstable fully online problem into a sequence of stable offline distillation phases, while still allowing the broader recipe to adapt over time.
Online Distillation in Modern LLM Training
-
Although most frontier LLM distillation remains offline at the teacher-update level, online principles appear increasingly often in:
-
Multi-agent self-improvement systems.
-
Self-play and debate frameworks.
-
Checkpoint-based teacher refresh pipelines.
-
Distributed co-training.
-
Self-distillation with periodically updated teacher snapshots.
-
Multi-round MOPD pipelines with refreshed specialist teachers.
-
RL pipelines where teacher checkpoints are selected from evolving validation curves.
-
-
In large-scale post-training, a model may be supervised by:
-
Specialist checkpoints trained on different domains.
-
Recent versions of itself.
-
Peer models in a shared optimization loop.
-
Earlier checkpoints retained to prevent catastrophic forgetting.
-
Domain teachers selected from the strongest validation checkpoint in a Cascade RL process.
-
-
This blurs the boundary between online distillation, self-distillation, checkpoint averaging, continual learning, and multi-teacher distillation.
-
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) is a useful semi-online example. Its teachers are selected from the Cascade RL pipeline by choosing the strongest validation checkpoint for each benchmark category. Because those teachers are derived from the same SFT initialization, they share the same tokenizer and reduce teacher-student distribution shift, while multi-domain OPD supplies dense token-level advantages compared with sparse outcome rewards.
-
The corresponding sampled-token MOPD advantage is:
\[a_t^{MOPD} = \log \pi_{\text{domain}_i}(y_t \mid s_t) - \log \pi_{\text{train}}(y_t \mid s_t)\]- where \(s_t=(x,y_{<t})\) is the decoding state and \(\pi_{\text{domain}_i}\) is the selected domain teacher. This is not strict online distillation if the teacher is frozen during the update, but it is semi-online at the recipe level because teacher checkpoints come from an evolving Cascade RL process.
-
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) shows another semi-online pattern: MOPD is run over multiple rounds, with a prepared RLVR student, a first teacher pool, refreshed teachers, and a second MOPD iteration. This resembles online distillation at the recipe level, but each MOPD phase is still best understood as frozen-teacher on-policy distillation.
Teacher Refresh and Support Overlap
-
A central challenge in semi-online and multi-teacher distillation is that teacher refreshes can increase capability while also increasing distribution mismatch. A stronger later teacher is not always a better distillation teacher if the student rarely visits the states where that teacher’s advantage is meaningful.
-
If a teacher and student are trained with substantially different SFT data or RL recipes, they may acquire different reasoning behaviors and induce different output distributions. Then student-generated trajectories can become out-of-distribution for the teacher, reducing the reliability of token-level teacher supervision.
-
A useful diagnostic is teacher-student local overlap at a student-visited state \(s_t\):
\[\operatorname{Overlap}_k(s_t) =\frac{ \left| \operatorname{Top}_k \pi_T(\cdot \mid s_t) \cap \operatorname{Top}_k \pi_S(\cdot \mid s_t) \right| }{k}\] -
When overlap is high, full-distribution or top-\(k\) distillation is more likely to be useful. When overlap is low, sampled-token scoring or a warmup stage may be safer.
-
In semi-online teacher refresh, one practical recipe is:
\[T^{(1)} \rightarrow S^{(1)} \rightarrow T^{(2)} \rightarrow S^{(2)} \rightarrow \cdots\]- where the student is gradually brought closer to stronger teachers rather than being forced to match a distant final teacher in one step.
-
This progressive view is useful for MOPD. If a domain teacher has moved far from the student through additional SFT or RL, an intermediate teacher checkpoint can provide a smoother path. A brief warmup SFT on data drawn from the teacher’s training distribution can also increase the chance that student rollouts remain within teacher support.
Online Distillation versus MOPD
-
Online distillation and MOPD overlap conceptually, but they are not the same. Online distillation asks, “does the teacher distribution change during training?” MOPD asks, “does the student generate the rollout?” and “do multiple teachers provide token-level feedback over that rollout?”
-
Therefore, a MOPD system can be categorized as:
-
Offline MOPD: multiple frozen teachers score student-generated rollouts.
-
Semi-online MOPD: frozen teachers are refreshed between MOPD rounds.
-
Online MOPD: co-evolving teachers and student score each other during the same training process.
-
-
Most currently discussed MOPD recipes appear closer to offline or semi-online MOPD than to fully online MOPD. This is a practical systems choice: freezing teachers during a distillation run improves stability, logging, reproducibility, and fault isolation.
-
The broader post-training recipe, however, can still be adaptive. Teachers may be trained in parallel by different teams, selected from validation curves, refreshed after a first MOPD round, or replaced with new specialists as new domains are added.
Implementation Pattern
-
Initialize multiple models or peers: Start with two or more models, which may differ in architecture, initialization, objective, or specialization.
-
Train each model on the primary objective: Each participant optimizes its own supervised, RL, or hybrid loss.
-
Exchange predictive distributions: At each step or periodically, models compute logits, hidden states, critiques, or token-level scores that are shared with other participants.
-
Compute mutual distillation losses: Each model matches one or more peer distributions using KL divergence, JSD, sampled-token log-ratio losses, or related objectives.
-
Update all participating online models: Gradients are applied to every model that is actively learning, so each model can act as both teacher and student.
-
Freeze or snapshot teachers when needed: In semi-online variants, models are periodically frozen as teacher checkpoints before supervising a student.
-
Synchronize or refresh periodically: In distributed systems, communication may occur asynchronously or at checkpoint boundaries rather than every minibatch.
-
Track version provenance: Online and semi-online methods require careful tracking of which student checkpoint generated a rollout, which teacher checkpoint scored it, which tokenizer was used, and which reward or verifier version was active.
-
Monitor teacher-student support overlap: When teachers and students diverge, full-distribution matching can become unreliable. Local overlap, KL, entropy, teacher advantage, rollout length, and truncation rates should be monitored.
-
Evaluate both individual and ensemble performance: Assess whether joint learning improves standalone models, ensemble behavior, calibration, domain robustness, and regression recovery.
Systems Design Considerations
-
Online distillation introduces additional systems requirements beyond offline KD:
-
Versioned checkpoints: Every teacher and student checkpoint must be traceable.
-
Synchronized inference and training engines: Rollout generation and teacher scoring must agree on tokenization, chat templates, stop tokens, and log-probability conventions.
-
Asynchronous queues: Rollouts and teacher scores may arrive at different speeds, requiring buffers and freshness policies.
-
Staleness management: If a student changes too much between rollout generation and gradient update, the rollout may become off-policy relative to the current learner.
-
Communication budget: Peer logits or top-\(k\) distributions can be large; sampled-token objectives reduce bandwidth.
-
Fault isolation: A bad teacher refresh can corrupt the student if not evaluated before deployment into the distillation loop.
-
Domain routing: Multi-teacher systems require prompt routing, teacher selection, or weight aggregation.
-
Evaluation gates: Teacher refreshes should pass domain benchmarks and regression suites before being used as supervision.
-
-
A practical semi-online loop often looks like:
\[\text{train specialists} \rightarrow \text{select checkpoints} \rightarrow \text{freeze teachers} \rightarrow \text{distill student} \rightarrow \text{run regression suite} \\ \rightarrow \text{refresh teacher pool}\] -
This is less theoretically elegant than fully online mutual learning, but it is easier to debug and more compatible with frontier-scale training infrastructure.
Online Distillation in the Broader Distillation Taxonomy
-
Online distillation occupies the teacher-update axis of the distillation taxonomy. It complements rather than replaces distinctions such as:
-
Off-policy versus on-policy.
-
Single-teacher versus multi-teacher.
-
External-teacher versus self-distillation.
-
Supervised versus RL-integrated training.
-
Frozen-teacher OPD versus refreshed-teacher OPD.
-
-
Conceptually, online distillation is best understood as adaptive teacher evolution, while on-policy distillation is best understood as adaptive trajectory generation. Modern systems increasingly combine both, but they should still be analyzed separately.
-
For practical LLM post-training, the most common pattern is not pure online distillation. It is semi-online capability consolidation: train or refresh specialists, freeze them, distill into a general student, evaluate regressions, then repeat. This preserves much of the stability of offline KD while allowing the teacher pool to evolve with the recipe.
Off-Policy Distillation
-
Off-policy distillation is the classical and still most widely used form of distillation. The student is trained on trajectories generated by an external source, such as human-labeled datasets, teacher-generated completions, synthetic reasoning traces, curated corpora, prior model logs, or specialist teacher rollouts, rather than on trajectories sampled from the student itself.
-
The central idea to carry through this section is that off-policy distillation is the most stable way to transfer broad capabilities, but it does not directly teach the student to recover from its own generation errors. It is therefore best understood as the default cold-start and broad-transfer stage in a larger post-training stack, often followed by RL, OPD, self-distillation, or multi-teacher consolidation.
-
Off-policy distillation remains attractive because data can be generated, filtered, audited, versioned, and reused before student training begins. This makes it operationally simpler than on-policy distillation, where rollouts and teacher scoring must happen inside a training loop. In practice, this stability is why many frontier recipes still rely heavily on synthetic SFT, rejection-sampling SFT, trace distillation, and teacher-generated tool-use data before moving to RL or MOPD.
-
The main limitation is train-inference mismatch. The student learns from trajectories produced by humans, teachers, or filters; at inference, it must condition on its own prior tokens. For short tasks, this mismatch may be mild. For long-horizon reasoning, coding, terminal use, browser use, or agentic workflows, small early deviations can move the student into states that never appear in the offline corpus.
-
Modern post-training recipes therefore use off-policy distillation as a foundation rather than as the entire recipe. Llama 3-style post-training uses reward models and rejection sampling to construct offline SFT and DPO data; DeepSeek-R1-style recipes use SFT to cold-start and refine reasoning behaviors after RL; MAI-Thinking-1-style recipes use trace-distillation SFT to consolidate specialist RL climbs; and Nemotron-style recipes use large synthetic and filtered SFT corpora before Cascade RL and MOPD.
Definition and Formal Objective
-
Given a dataset of input-output pairs:
\[\mathcal{D}=\{(x,y)\}\]- where outputs \(y\) may come from humans, a teacher model, a specialist checkpoint, a synthetic data pipeline, or historical logs, the student minimizes a divergence between teacher and student token distributions:
-
This is the standard supervised distillation objective described in Distilling the Knowledge in a Neural Network by Hinton et al. (2015) and generalized to autoregressive sequence models in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024).
-
The defining feature is that the trajectory distribution is external to the current student:
\[y \sim q_{\text{external}}(\cdot \mid x)\]- rather than:
-
The external trajectory distribution may be written as a mixture:
\[q_{\text{external}}(y \mid x) =\sum_{k=1}^{K} \omega_k q_k(y \mid x)\]- where each component \(q_k\) may correspond to human data, a frontier teacher, a specialist model, a rejection-sampling pipeline, a verifier-filtered corpus, a prior student checkpoint, or production interaction logs.
Sources of Off-Policy Data
-
Off-policy data can come from human-labeled datasets, teacher-generated synthetic data, verifier-filtered synthetic corpora, historical model outputs, specialist teacher traces, rejection-sampling traces, and tool-use transcripts. Human-labeled datasets provide expert-written instruction responses, translations, preference annotations, and curated reasoning traces. Teacher-generated synthetic data transfers capabilities through answers, chain-of-thought traces, critiques, and tool-use demonstrations. Filtered synthetic corpora retain only outputs that pass verifiers, reward models, execution tests, or additional teacher checks. Historical model outputs can be relabeled and refined by stronger models. Specialist teacher traces capture domain-specific expertise in math, code, safety, tool use, science, long context, software engineering, or agentic workflows. Rejection-sampling traces are sampled from a policy, ranked or filtered, and then reused as supervised targets. Tool-use transcripts record actions, observations, terminal commands, browser calls, Python execution traces, API calls, and final answers.
-
The RLHF Book chapter Synthetic Data emphasizes that most modern post-training pipelines rely heavily on large-scale synthetic data generation before any reinforcement learning or on-policy distillation stage.
-
In recent LLM recipes, off-policy data is rarely a single monolithic corpus. It is usually a curated mixture of math and proof traces, competitive coding solutions, tool-calling code traces, scientific reasoning examples, document-derived examples, long-context question answering, retrieval tasks, general chat, multi-turn dialogue, instruction-following examples, formatting tasks, safety and refusal data, agentic software-engineering trajectories, and search, browser, terminal, or office-tool workflows.
Sequence-Level Distillation
-
Sequence-level distillation is the most common off-policy method for LLMs.
-
Introduced in Sequence-Level Knowledge Distillation by Kim and Rush (2016), it trains the student on full teacher-generated outputs:
\[\mathcal{L}_{\text{SeqKD}} =\mathbb{E}_{x} \left[ -\log p_S(y_T \mid x) \right]\]- where:
-
This approach often simplifies the target distribution by replacing multiple valid outputs with one teacher-selected response, which can make optimization substantially easier.
-
In LLM post-training, the “sequence” may be much richer than a final answer: it may contain a reasoning trace, proof, verification path, code solution, patch, tests, tool-use transcript, terminal session, browser or search trace, multi-turn conversation, or critique-and-revision sequence. Sequence-level distillation is therefore often better described as trace-level distillation when applied to reasoning and agentic models.
Trace-Distillation SFT
-
Trace-distillation SFT is an off-policy method in which a teacher generates a complete trajectory and the student learns it through ordinary supervised fine-tuning. It is central to many modern recipes because it is simple, reusable, and compatible with standard SFT infrastructure.
-
Let a teacher trace be:
\[y_T=(z_1,z_2,\dots,z_n)\]- where tokens may encode reasoning, tool calls, tool outputs, code edits, or final answers. The training objective is:
-
Trace distillation is the offline counterpart of MOPD. Both aim to consolidate knowledge from stronger or more specialized policies, but they differ in the trajectory source: trace-distillation SFT uses teacher-generated trajectories, while MOPD uses student-generated trajectories scored by teachers.
-
This distinction is visible in recent post-training recipes. MAI-Thinking-1-style recipes use specialist RL climbs followed by trace-distillation SFT to merge those climbs, then run a final RL climb. This is a conservative consolidation strategy because it avoids the systems complexity of routing student rollouts through many teacher scorers.
-
Trace-distillation SFT is useful when teacher traces are high quality and verifiable, when the student is not yet capable enough for useful on-policy rollouts, when MOPD infrastructure is unavailable or immature, when the goal is a stable cold start before RL, or when a team wants reusable data rather than a tightly coupled rollout-scoring loop.
Logit Distillation
-
In logit or soft-label distillation, the student matches the teacher’s full next-token distribution:
\[\mathcal{L}_{\text{logit}} =\mathbb{E}_{(x,y)} \sum_t D_{KL} \left( p_T(\cdot \mid x, y_{<t}) \,\Vert\, p_S(\cdot \mid x, y_{<t}) \right)\] -
Compared with sequence-level distillation, this preserves uncertainty information, token similarities, and alternative plausible continuations.
-
This approach is especially effective when the teacher is much stronger and when the student has sufficient capacity to approximate the teacher distribution.
-
In LLMs, full-vocabulary logit distillation is often expensive because the vocabulary can contain roughly:
\[|V| \approx 10^5\]- tokens. Practical systems therefore use teacher top-\(k\) probabilities, student top-\(k\) probabilities, sampled-token log-probabilities, renormalized truncated distributions, or cross-entropy on teacher-generated sequences.
-
For off-policy distillation, top-\(k\) teacher distributions can often be precomputed and stored. This is less flexible than live teacher scoring, but it makes student training cheaper and reproducible.
Rejection-Sampling SFT
-
Rejection-sampling SFT is a particularly important off-policy pattern in modern LLM recipes. A model generates multiple candidate responses for each prompt, a reward model or verifier selects high-quality candidates, and the student is trained on the selected outputs.
-
The workflow is:
\[y_1,\dots,y_K \sim p(\cdot \mid x)\] \[y^\star =\arg\max_{y_i} R(x,y_i)\] \[\mathcal{L}_{\text{RS-SFT}}(\theta) =-\log p_S^\theta(y^\star \mid x)\]- where \(R\) may be a reward model, verifier, unit-test score, rule-based checker, or LLM judge.
-
Llama 3-style recipes are a useful example: the model samples multiple responses per prompt, uses a reward model and rejection sampling to form SFT data, then trains an SFT model and DPO model across repeated rounds.
-
DeepSeek-R1-style recipes also use rejection-sampling SFT to refine reasoning behavior: RL is used to elicit reasoning, then successful or high-quality reasoning traces are selected and converted into supervised data for subsequent training.
-
Rejection-sampling SFT is off-policy because the final student trains on selected traces rather than on its own current rollouts. However, if the selected traces come from recent checkpoints, it can still be closer to the student’s distribution than static human-written data.
Synthetic Data Pipelines
-
A major modern use of off-policy distillation is as the final consumer of synthetic data pipelines.
-
A typical workflow starts by collecting prompts from benchmark datasets, user interactions, or automatically generated prompt sets designed to cover the target domains. A strong teacher then generates one or more candidate completions for each prompt, often including intermediate reasoning traces or tool-use steps. Reward models, verifiers, or additional teacher models evaluate the generated outputs for correctness, helpfulness, consistency, safety, and format compliance. The highest-quality outputs are selected, reranked, or filtered to remove low-confidence or incorrect examples. The resulting dataset is stored as a reusable synthetic corpus that may include completions, chain-of-thought traces, verifier scores, teacher log-probabilities, critique fields, or tool transcripts. The student is then trained on this curated dataset using sequence-level distillation, logit matching, or a hybrid objective.
-
Synthetic datasets may contain detailed chain-of-thought traces for structured problem solving, verified code solutions that pass unit tests or execution checks, critiques and revisions that teach diagnosis and repair, tool-use transcripts that demonstrate API or terminal interaction, preference annotations that support later alignment or ranking objectives, multi-turn simulated conversations that expose conversational dynamics, long-context examples for retrieval and cross-document reasoning, and agentic software trajectories that teach repository navigation, patch generation, test execution, and final submission behavior.
-
The RLHF Book highlights that this synthetic-data-to-distillation pipeline remains the dominant method for transferring capabilities from frontier models into smaller and more deployable students.
Off-Policy Distillation in Recent Post-Training Recipes
-
Off-policy distillation appears in nearly every recent recipe, even when the recipe is described primarily as RL or MOPD.
-
In InstructGPT-style RLHF, the SFT stage is off-policy because the model learns from human demonstrations rather than its own rollouts, and the reward model stage also relies on fixed comparison data.
-
In Llama 2-style RLHF, SFT and rejection-sampling stages create offline supervised targets, while PPO supplies the online RL component.
-
In Llama 3-style recipes, the reward model is used primarily as a filter: the model samples multiple responses per prompt, rejection sampling selects outputs, SFT trains on those outputs, and DPO further refines the model. This is a largely off-policy pipeline with iterative refresh.
-
In Tülu 3-style recipes, curated prompts, SFT, DPO, and RLVR are staged so that offline supervised and preference data prepare the model before verifiable-reward RL.
-
In DeepSeek-R1-style recipes, SFT is used as a cold start and as a refinement mechanism after reasoning RL. RL elicits reasoning behaviors, while rejection-sampling SFT distills and clarifies those behaviors into training data.
-
In MAI-Thinking-1-style recipes, multiple specialist RL climbs are consolidated through trace-distillation SFT before a final RL climb. This is a clear example of off-policy consolidation from specialist teachers.
-
In Nemotron-Cascade 2-style recipes, SFT data is generated and filtered across math, code, science, long context, chat, instruction following, safety, and tool-use domains before Cascade RL and multi-domain OPD. This demonstrates how off-policy distillation can serve as the foundation on which later on-policy training is built.
-
In Nemotron 3 Ultra-style recipes, SFT data includes tool-calling math traces, non-tool math traces, proof generation and verification, science data filtered for reasoning and format quality, multi-turn chat data selected by a reward model, and software-engineering trajectories filtered for tool-call hygiene and undesirable action patterns. This shows that off-policy distillation increasingly resembles a large-scale data engineering system rather than a simple SFT pass.
Advantages
- Off-policy distillation is operationally simple because it closely resembles supervised fine-tuning on a fixed dataset; it is stable and reproducible because the same examples can be reused across runs; it amortizes teacher inference because outputs can be generated once and consumed many times; it scales naturally to large datasets and distributed training systems; it integrates well with synthetic data generation pipelines for reasoning, coding, and instruction following; it supports heavy offline filtering before training; it is comparatively easy to audit because examples can be inspected before use; it provides a strong cold start for RL or OPD; and it is compatible with many supervision sources, including frontier models, specialist teachers, older student checkpoints, human annotators, LLM judges, reward models, and verifiers.
Limitations: Distribution Mismatch
-
The central weakness of off-policy distillation is that the student is trained on trajectories it did not generate.
-
During inference, the student samples:
\[\hat{y} \sim p_S(\cdot \mid x)\]- which may diverge from teacher-generated sequences. Because each token conditions on previous tokens, small errors compound over long trajectories.
-
This problem is explicitly analyzed in On-Policy Distillation of Language Models and motivates Generalized Knowledge Distillation.
-
The Thinking Machines article Thinking Machines Blog: On-Policy Distillation compares this to learning chess solely by watching grandmasters: one sees excellent play but not the board states produced by one’s own mistakes.
-
In long-horizon reasoning and agentic workflows, this mismatch is more severe because the student’s early decisions determine the tools it calls, files it edits, tests it runs, assumptions it carries forward, and intermediate reasoning branches it explores.
-
For tool use, an off-policy trace may show the correct API call, but the deployed student may produce a malformed call, search the wrong file, run the wrong command, or misunderstand a tool result. The offline trace provides little direct supervision for recovery from that specific state.
Behavioral Consequences
-
Off-policy students often perform strongly when their generated prefixes remain close to trajectories seen during training, but they can struggle to recover from early mistakes that push generation into unfamiliar contexts. This manifests as exposure bias in long-horizon reasoning and agentic tasks, stylistic imitation without full transfer of reasoning competence, overconfidence when trained primarily on deterministic single targets, tool-use brittleness when successful traces dominate the corpus, format sensitivity when examples are filtered for perfect formatting, and repetition of teacher artifacts such as verbose reasoning conventions, fixed templates, or unnecessary tool calls.
-
These behaviors do not make off-policy distillation ineffective. They clarify why it is most effective as a foundation stage and why later on-policy methods are useful for robustness.
Filtering, Verification, and Data Quality
-
Off-policy distillation shifts much of the difficulty from online optimization to offline data quality. The objective is simple, but the training signal depends heavily on how examples are generated and filtered.
-
Common filtering mechanisms include exact and fuzzy deduplication, unit-test execution for code, answer checking for math, proof validation where possible, format and schema checks, tool-call validity checks, reward-model ranking, LLM judge scoring, safety classifiers, length and repetition filters, and human audit for high-impact domains.
-
In code SFT pipelines, examples with verifiable tests can be filtered by correctness. Prompts without tests may require weaker heuristics, such as selecting longer or more detailed reasoning traces when those correlate with better analysis.
-
In software-engineering and agentic traces, filtering must often remove undesirable behaviors even when the final task succeeds. Examples include invalid submission actions, disallowed git operations, excessive edit-test loops, exploratory thrashing without edits, malformed tool calls, debug artifacts in final patches, and trajectories that edit code but never run tests.
-
In chat data, filtering may select the highest-quality response among multiple candidates using a reward model or judge. Multi-turn chat data may also simulate users with varied strategies, such as asking clarifying questions, challenging assumptions, reframing the task, or applying an answer to a new context.
-
This filtering stage is one reason off-policy distillation remains central to frontier recipes: it allows data quality to be improved before expensive student training begins.
Relationship to Reinforcement Learning
- Off-policy distillation and reinforcement learning differ in both trajectory source and feedback density:
| Method | Trajectory source | Reward or supervision density |
|---|---|---|
| Off-policy KD | Teacher, human, verifier-filtered data, or dataset | Dense token-level or hard sequence targets |
| RLHF / RLVR | Student | Sparse sequence-level or outcome-level rewards |
| On-policy distillation | Student | Dense token-level teacher feedback |
-
As a result, off-policy distillation is highly sample-efficient but less robust than on-policy approaches.
-
Many modern pipelines follow a staged progression in which synthetic data is generated and filtered into high-quality off-policy supervision, the student is trained through off-policy distillation to absorb broad teacher capabilities, reinforcement learning then refines behaviors that are difficult to specify directly in the dataset, and on-policy distillation transfers the benefits of RL into a smaller or more efficient model or consolidates multiple specialist teachers.
-
The RLHF Book explicitly presents this progression as the “path to on-policy distillation.”
-
Another useful framing is that off-policy distillation teaches the student what good behavior looks like, RL teaches the student which of its behaviors succeed under a reward or verifier, and OPD teaches the student how a teacher would locally correct the behaviors it actually sampled.
Engineering and Systems Considerations
-
Off-policy systems are operationally straightforward: teacher inference can be performed asynchronously and at large batch sizes; training examples can store token IDs, reasoning traces, verifier scores, and optionally top-\(k\) log-probabilities; synthetic datasets can be reused across many experiments and student architectures; and student training proceeds independently without synchronous communication with the teacher.
-
The primary costs arise from generating synthetic data, storing large corpora, and maintaining the filtering and verification infrastructure that ensures data quality.
-
Practical systems must handle data provenance, versioning, prompt contamination, benchmark leakage, teacher model version tracking, deduplication across generated corpora, mixture weighting across domains, long-trace storage and compression, tool-output normalization, chat-template consistency, tokenizer compatibility, and safety or policy audits.
-
Large SFT mixtures should often be balanced by token count rather than example count because long reasoning, long-context, and software-engineering traces can otherwise dominate the gradient.
-
For off-policy logit distillation, storage and bandwidth are major constraints. Full-vocabulary logits are usually impractical; top-\(k\) distributions or sampled-token log-probabilities are much cheaper.
-
For off-policy trace distillation, text storage is cheaper than logit storage, but trace quality becomes the central bottleneck.
When Off-Policy Distillation is Preferred
-
Off-policy distillation is the best choice when simplicity and stability are more important than exact train-inference matching; when large synthetic datasets are already available or can be generated economically; when teacher inference should be amortized offline and reused across many experiments; when the student is unlikely to diverge substantially from the training distribution during deployment; when the goal is broad capability transfer rather than maximal robustness to self-generated errors; when the student needs a cold start before RL, OPD, or MOPD; when data quality and auditability are central requirements; when a team wants to compare many student architectures or objectives on the same fixed corpus; or when teacher outputs require expensive filtering or human review that cannot be placed inside a live training loop.
-
It remains the dominant starting point for most practical distillation pipelines, even when more advanced on-policy or reinforcement learning stages are planned later.
Practical Recipe Pattern
-
A robust off-policy distillation recipe usually follows a sequence in which prompts are collected, teacher generations are produced, outputs are filtered or verified, duplicates and low-quality items are removed, the mixture is balanced across domains and trace lengths, the student is trained through SFT or KD, and evaluation checks both target gains and regressions.
-
For reasoning and coding, the strongest version often generates several candidates, verifies them, selects the best traces, and trains the student on those selected traces.
-
For agentic systems, the strongest version often starts with an agent rollout, checks task success, filters for tool-use hygiene, removes known anti-patterns, and only then adds the trajectory to the SFT corpus.
-
For multi-domain recipes, the most common role of off-policy distillation is to prepare the student for later online or on-policy stages:
\[\text{Off-policy SFT} \rightarrow \text{RLVR} \rightarrow \text{OPD / MOPD} \rightarrow \text{final alignment}\] -
This progression reflects the modern view: off-policy distillation is not obsolete. It is the stable substrate on which more adaptive post-training methods are built.
On-Policy Distillation (OPD)
-
On-policy distillation addresses the central limitation of off-policy methods by training the student on trajectories it actually generates, rather than only on fixed datasets curated by humans or sampled from a teacher. By shifting supervision onto the student’s own state distribution, on-policy distillation substantially reduces exposure bias and compounding errors in autoregressive models.
-
The central idea to carry through this section is that OPD combines the trajectory relevance of reinforcement learning with the dense token-level supervision of distillation. RL trains on student-generated rollouts but often supplies only sparse outcome rewards. Classical KD supplies dense token-level supervision but usually trains on teacher or dataset trajectories. OPD combines the two: the student samples the trajectory, while the teacher scores the student’s actual prefixes.
-
OPD is best understood as a local correction mechanism. Instead of asking only what the teacher would have generated from the original prompt, OPD asks how the teacher evaluates the choices the student actually made. This distinction is crucial in long reasoning chains, coding tasks, and agentic tool-use workflows where one early deviation changes the entire future trajectory.
-
OPD is also a systems primitive, not just a loss. A practical OPD system must generate rollouts, preserve token-level student log-probabilities, route examples to teacher scorers, compute teacher log-probabilities under the exact student prefixes, mask invalid tokens, control rollout staleness, and update the student while keeping the teacher fixed or semi-fixed.
-
In most current LLM recipes, OPD is on-policy with respect to trajectory source but offline with respect to teacher updates: the student generates rollouts, while one or more frozen teachers score them. Multi-teacher on-policy distillation (MOPD) extends the same idea by routing student rollouts to domain-specialist teachers, often after SFT, RLVR, or Cascade RL has created a strong initial student.
Core Idea and Formal Objective
-
In on-policy distillation, the student first generates a rollout:
\[\hat{y} \sim p_S^\theta(\cdot \mid x)\] -
The teacher then evaluates the exact same trajectory by assigning next-token probabilities conditioned on the student’s prefixes. The student is updated to reduce the divergence between its own token distributions and the teacher’s token distributions along this rollout:
\[\mathcal{L}_{\text{on-policy}}(\theta) =\mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{\hat{y} \sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|\hat{y}|} D\left( p_T(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right] \right]\] -
This formulation was introduced in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024), which frames distillation as an imitation-learning problem in the style of DAGGER and demonstrates gains over conventional supervised KD and sequence-level KD across summarization, translation, and mathematical reasoning.
-
A more explicit state-based notation writes the student-visited prefix as:
\[s_t=(x,\hat{y}_{<t})\]- and defines the token-level matching loss as:
-
The key property is that \(s_t\) is sampled from the student’s own rollout distribution. If the student makes an early mistake, the teacher does not simply show an ideal path from the prompt; instead, the teacher evaluates the continuation under the mistaken prefix and provides local corrective information.
OPD as Supervised Learning over Student-Visited States
-
A subtle but important point is that practical OPD usually does not differentiate through the student’s sampling distribution. The rollout is sampled, treated as data, rescored by the teacher, and then used for a supervised token-level update. This is different from sequence-level reverse-KL methods that backpropagate through the sampling distribution and behave more like policy-gradient training.
-
A stop-gradient view of the common implementation is:
\[\nabla_\theta \mathcal{L}_{OPD} \approx \mathbb{E}_{\hat{y} \sim \operatorname{sg}(p_S^\theta)} \left[ \nabla_\theta \sum_t D\left( p_T(\cdot \mid x,\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]- where \(\operatorname{sg}\) denotes stop-gradient through the rollout sampling process.
-
MiniLLM: Knowledge Distillation of Large Language Models by Gu et al. (2023) studies a reverse-KL formulation closer to policy-gradient optimization through the student’s sequence distribution, while the GKD-style OPD implementations used in many codebases are closer to supervised learning on student-visited states. This distinction matters because the policy-gradient version can be noisier and more complex to stabilize, while the GKD-style version is simpler and closer to DAGGER.
-
The practical interpretation is that OPD is not “RL with a teacher” in every implementation. It is often better described as repeatedly collecting student rollouts, labeling those rollouts with dense teacher feedback, and performing supervised distillation updates on the resulting student-state dataset.
Intuition: Learning from One’s Own Mistakes
-
The key advantage of on-policy distillation is that the student receives feedback precisely in the contexts it is most likely to encounter at inference time.
-
The Thinking Machines article Thinking Machines Blog: On-Policy Distillation presents an intuitive analogy to chess. Instead of only observing expert games, or receiving a single win-or-loss signal after playing a full game, the student receives move-by-move evaluations of its own choices. This makes it possible to identify and correct the specific decisions that caused the rollout to go off track.
-
A targeted correction interpretation is also useful for agentic systems. If the student produces a mostly plausible trajectory but makes one invalid tool call, OPD can penalize the local token choices that produced that tool call rather than spreading a final failure signal across the whole trajectory. This is why OPD often fits naturally into RL infrastructure: it can use the same rollout machinery, but the feedback is dense and token-level rather than sparse and trajectory-level.
-
The following figure (source) shows a chess.com-style visualization in which each move in the learner’s own game is graded from blunder to brilliant, illustrating how on-policy distillation provides dense, token-level feedback over self-generated trajectories.

Generalized Knowledge Distillation (GKD)
-
Agarwal et al. introduce Generalized Knowledge Distillation (GKD), which unifies several forms of distillation under a single framework. At each training step, the algorithm may sample a trajectory from the student and obtain teacher supervision along that rollout, or draw a trajectory from a fixed dataset and perform traditional off-policy distillation on that example.
-
This mixture is controlled by a parameter \(\lambda \in [0,1]\), which specifies the fraction of training examples that are student-generated.
-
When \(\lambda=0\), GKD reduces to standard supervised distillation. When \(\lambda=1\), all training occurs on student-generated trajectories. Intermediate values provide a practical curriculum that combines the stability of offline supervision with the robustness benefits of on-policy training.
-
A mixed GKD objective can be written as:
\[\mathcal{L}_{GKD}(\theta) =(1-\lambda) \mathbb{E}_{(x,y)\sim\mathcal{D}} \left[ D(p_T \,\Vert\, p_S^\theta)(x,y) \right] +\lambda \mathbb{E}_{x\sim\mathcal{D}} \mathbb{E}_{\hat{y}\sim p_S^\theta(\cdot \mid x)} \left[ D(p_T \,\Vert\, p_S^\theta)(x,\hat{y}) \right]\]- The following expansion makes the two mixture branches explicit: the first term is off-policy distillation on dataset trajectories \(y\), and the second term is on-policy distillation on student-generated trajectories \(\hat{y}\):
-
The curriculum interpretation is important in practice. A weak student may initially generate low-quality or off-support trajectories, so a purely on-policy objective can waste teacher calls. A stronger student can generate informative failures, making on-policy teacher feedback more useful.
-
The following figure (source) shows that on-policy Generalized Knowledge Distillation significantly outperforms supervised fine-tuning, supervised KD, and sequence-level KD across summarization, translation, and mathematical reasoning tasks.

Choice of Divergence and Reward Interpretation
-
Although forward KL remains theoretically valid, reverse KL is particularly well suited to on-policy training because the rollout is sampled from the student distribution. Under reverse KL, the student is penalized for sampled tokens that the teacher considers unlikely, and the loss can be approximated from teacher log-probabilities on sampled tokens rather than full-vocabulary distributions.
\[D_{KL}(p_S \,\Vert\, p_T) =\mathbb{E}_{y \sim p_S} \left[ \log \frac{p_S(y)}{p_T(y)} \right]\] -
The sampled-token reverse-KL signal is naturally interpreted as a dense per-token advantage:
\[A_t=\log p_T(y_t \mid x,y_{<t}) - \log p_S(y_t \mid x,y_{<t})\] -
Tokens that the teacher rates more highly than the student receive positive advantage, while tokens that the teacher considers worse than the student receive negative advantage. The multi-teacher OPD article Multi-Teacher On-Policy Distillation: A New Post-Training Primitive emphasizes that this makes reverse KL a natural replacement or complement for advantage terms in GRPO-style reinforcement learning.
-
The Hugging Face TRL writeup Distilling 100B+ Models 40x Faster with TRL similarly notes that reverse KL aligns cleanly with student-generated trajectories and can require only the teacher’s log-probabilities on sampled tokens rather than full-vocabulary distributions.
-
The divergence choice is also a systems choice. Forward KL or distribution matching may require teacher top-\(k\) or full-vocabulary distributions. Reverse-KL-style sampled-token training can use a much smaller payload:
\[\left(y_t,\log p_T(y_t \mid s_t),\log p_S(y_t \mid s_t)\right)\]- for each valid response token \(t\).
On-Policy Distillation and Reinforcement Learning
-
One of the most important modern insights is that on-policy distillation can be understood as a dense, KL-constrained form of policy optimization.
-
The RLHF Book chapter Synthetic Data describes on-policy distillation as the natural progression after synthetic data generation and reinforcement learning. Reinforcement learning supplies on-policy trajectories but typically only sparse sequence-level rewards, whereas on-policy distillation provides token-level guidance from a stronger teacher over those same trajectories.
-
The following methods establish that on-policy distillation is not simply an alternative to reinforcement learning, but a general dense supervision framework that can replace, augment, or stabilize policy-gradient methods while preserving the on-policy nature of learning.
Reinforcement Learning via Self-Distillation
-
Reinforcement Learning via Self-Distillation by Hübotter et al. (2026) converts textual critiques, execution errors, and other rich feedback into dense token-level updates, showing that self-distillation can serve as a general policy optimization mechanism.
-
The method constructs a teacher distribution conditioned on the original trajectory and a natural-language feedback string that explains what went wrong or how to improve. Student trajectories are replayed under this feedback-augmented teacher context, so each token can receive targeted corrective supervision. The same base model can instantiate both student and teacher views, reducing the need for a separate external model. This approach is especially useful for coding and reasoning tasks where runtime errors, verifier messages, or judge comments provide informative textual signals.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
-
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation by Yang et al. (2026) introduces ExOPD, or extrapolation OPD, which generalizes OPD by combining teacher imitation with explicit reward extrapolation, allowing the student to exceed the teacher rather than merely match it.
-
ExOPD uses a reference teacher policy to provide dense token-level supervision while an external reward signal estimates how much better or worse the current trajectory is than the teacher baseline. The distillation loss is reweighted by reward-derived scaling factors so that trajectories outperforming the teacher receive amplified updates. The framework supports interchangeable reference models, including external teachers, self-teachers, and moving-average checkpoints, which decouples the source of dense supervision from the source of the ultimate objective.
Scaling Reasoning Efficiently via Relaxed On-Policy Distillation
-
Scaling Reasoning Efficiently via Relaxed On-Policy Distillation by Ko et al. (2026) introduces REOPOLD, a relaxed on-policy distillation method designed to reduce over-imitation, improve stability, and scale reasoning training more efficiently.
-
REOPOLD treats token-level teacher-student log-likelihood ratios as dense rewards, similar to reverse-KL advantages, but relaxes strict imitation by clipping or tempering overly strong penalties on low-value tokens. It also uses partial rollouts and truncated reasoning traces to reduce compute while preserving informative supervision. This is especially relevant for reasoning tasks where exact teacher imitation may be unnecessarily restrictive or may suppress productive exploratory reasoning.
Self-Distilled RLVR
-
Self-Distilled RLVR by Yang et al. (2026) introduces RLSD, which combines reinforcement learning with verifiable rewards and privileged self-distillation, using self-distillation to modulate update magnitudes while preserving RL-derived update directions.
-
The method gives a self-teacher privileged information such as the correct answer or a verified reasoning trace. Reinforcement learning determines the update direction based on correctness signals, while self-distillation scales the magnitude of token-level updates according to how strongly the privileged teacher prefers the sampled continuation. This separation reduces information leakage while preserving the objective grounding of RLVR.
Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
-
Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing by Li et al. (2026) introduces SRPO, a sample-routing framework that combines GRPO-style reinforcement learning on successful rollouts with self-distillation-based correction on failed rollouts.
-
SRPO sends correct samples to a GRPO branch, where group-relative advantages provide a reward-aligned policy update. Incorrect samples with available teacher information are replayed under a privileged teacher context and corrected using dense token-level self-distillation. Routing decisions are based on verifier outcomes or reward thresholds, allowing the system to preserve efficient RL updates on successful trajectories while extracting richer supervision from failures.
OpenClaw-RL: Train Any Agent Simply by Talking
-
OpenClaw-RL: Train Any Agent Simply by Talking by Wang et al. (2026) extends self-distillation to interactive environments with dense feedback sources such as tool outputs, GUI transitions, user replies, and environment state changes.
-
The agent’s original trajectory is replayed together with subsequent user or environment feedback. A hindsight-conditioned teacher evaluates how the model would have acted if it had observed the later information earlier. Tool outputs, GUI changes, and terminal states are converted into dense correction signals, and the same framework can support conversational agents, coding agents, and embodied control systems.
-
The following figure (source) shows an overview of the OpenClaw-RL infrastructure. Interaction streams come from Personal Agents, which are conversational and single-user agents hosted on personal devices, and General Agents, which include terminal, GUI, SWE, and tool-call agents hosted on cloud services. Samples flow into an asynchronous RL server with environment serving, PRM or judge reward computation, Megatron policy training, and SGLang policy serving.

- The following figure (source) shows how OpenClaw can be optimized simply by using it, with the simulation result illustrating how interaction traces can become training signal.

- The following figure (source) shows an overview of OpenClaw-RL. For personal agents, OpenClaw-RL supports both binary-reward optimization and on-policy distillation training, and their combination yields substantial performance gains. For general agentic RL, the framework integrates standard RLVR, step-wise rewards, and a simple standardization approach.

Multi-Domain and Multi-Teacher OPD
-
Multi-domain OPD extends the single-teacher setup by choosing a domain teacher for each prompt or trajectory. This is especially useful after Cascade RL or specialist-teacher training, where different checkpoints are strongest on different categories such as math, code, tool use, instruction following, long context, safety, or software engineering.
-
In Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026), MOPD is inserted after earlier RL stages because different benchmark categories fluctuate during training. Certain RLVR stages can reduce entropy and shorten reasoning traces, which may negatively affect mathematical reasoning, while RLHF-oriented optimization can trade off against instruction following. Multi-domain OPD is used to rebalance these capabilities by selecting strong intermediate domain teachers from the Cascade RL process.
-
The Nemotron-Cascade 2 MOPD advantage is defined on the student-sampled token rather than over the full vocabulary. If \(\pi_{\text{inf}}\) is the rollout policy used for generation, \(\pi_{\text{train}}\) is the policy being optimized, and \(\pi_{\text{domain}_i}\) is the selected domain teacher, then for state \(s_t=(x,y_{<t})\):
\[a_t^{MOPD} = \log \pi_{\text{domain}_i}(y_t \mid s_t) - \log \pi_{\text{train}}(y_t \mid s_t)\] -
This advantage is positive when the domain teacher assigns higher probability to the sampled token than the current training policy. It therefore acts as a dense token-level distillation advantage that should converge toward zero as the student absorbs the teacher’s local preferences.
-
Because the rollout policy and training policy may differ in asynchronous systems, Nemotron-Cascade 2 applies truncated importance weighting:
\[r_t =\frac{ \pi_{\text{train}}(y_t \mid s_t) }{ \pi_{\text{inf}}(y_t \mid s_t) }\] \[w_t =\operatorname{sg}[r_t] \mathbf{1} \left[ \epsilon_{\text{low}} \leq r_t \leq \epsilon_{\text{high}} \right]\] -
The resulting surrogate objective is:
\[\mathcal{L}_{MOPD} =-\mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_{\text{inf}}(\cdot \mid x)} \left[ \frac{1}{|\mathcal{V}(y)|} \sum_{t\in\mathcal{V}(y)} w_t \operatorname{sg}[a_t^{MOPD}] \log \pi_{\text{train}}(y_t \mid s_t) \right]\]- where \(\mathcal{V}(y)\) is the set of valid response tokens retained by the token mask.
-
This objective is important because it shows how OPD can be implemented inside an RL-style training engine without requiring full-vocabulary teacher distributions. It also makes systems-level staleness explicit: the rollout-generating policy and the learner policy may not be exactly identical, so the training loss must account for that mismatch.
-
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) extends this pattern to an agent-focused pipeline that includes SFT, unified RLVR, MOPD warmup, asynchronous MOPD, and MTP boosting. The report emphasizes that MOPD warmup aligns student rollouts with teacher-supported distributions before distillation, which reflects a practical support-overlap requirement for successful multi-teacher OPD.
Practical Failure Modes and Stabilization Recipes
-
OPD should be viewed less as full-vocabulary distribution matching and more as a fragile communication protocol between teacher and student through a small set of locally plausible next-token choices. The teacher’s feedback is most useful when the student’s rollouts fall within states where the teacher can assign meaningful token preferences.
-
Qwen3, GLM-5, MiMo, Nemotron-Cascade 2, and Nemotron 3 Ultra use OPD or OPD-adjacent methods in post-training, while also highlighting that practical OPD can be more brittle than SFT or RL when teacher-student local token preferences stop overlapping.
-
Stabilization therefore requires attention to rollout quality, support overlap, token masks, truncation, teacher routing, rollout length, repetition, tokenizer alignment, and the distinction between sampled-token feedback and full-distribution matching.
Thinking-Pattern Compatibility and Token Overlap
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe by Li et al. (2026) finds that OPD success depends on compatible teacher-student thinking patterns and on progressive alignment over a small shared set of high-probability tokens, which can carry most of the probability mass.
-
A useful diagnostic is local support overlap:
\[\operatorname{Overlap}_k(s_t) =\frac{ \left| \operatorname{Top}_k p_T(\cdot \mid s_t) \cap \operatorname{Top}_k p_S(\cdot \mid s_t) \right| }{k}\] -
This overlap should be monitored on student-visited prefixes, not only on clean teacher or dataset prefixes. Benchmark accuracy alone does not indicate whether the teacher’s token-level supervision will be useful in the states the student actually visits.
-
The practical recipe is to use an off-policy cold start before OPD, select teachers whose reasoning style is compatible with the student, monitor overlap among high-probability tokens, avoid teachers that pull an RL-improved student backward toward older reasoning patterns, and track whether teacher continuation advantage decays as rollout prefixes lengthen.
-
The following figure (source) shows an overview of the method. JustRL-1.5B is obtained by applying RL to DeepSeek-Distill-1.5B, and Skywork-OR1-Math-7B is obtained by applying RL to DeepSeek-Distill-7B.

- The following figure (source) presents a systematic study of OPD training dynamics, progressing from empirical conditions through token-level mechanism to practical recipe.

Length Inflation and Repetition Collapse
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models by Luo et al. (2026) identifies abrupt length inflation, repetition saturation, and truncation collapse as a major OPD failure mode. It proposes StableOPD, combining a reference-based divergence constraint with rollout mixture distillation.
-
Length inflation can occur because once the student enters repetitive or overlong prefixes, the teacher may still assign locally plausible probability to continuation tokens under those prefixes. This creates a self-reinforcing loop where degenerate prefixes produce locally acceptable token-level feedback even though the global trajectory is poor.
-
Practical stabilization requires tracking average rollout length, truncation rate, and repetition rate during training rather than relying only on validation accuracy. It also requires adding reference-based divergence constraints, mixing on-policy rollouts with cleaner reference trajectories, treating repeated tokens as high-risk examples, and stopping or downweighting truncation-dominated batches.
Sampled-Token OPD and Local Support Matching
-
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes by Fu et al. (2026) argues that sampled-token OPD is attractive because token-level feedback has better worst-case variance scaling than sequence-level reverse KL, but it is biased and fragile because it observes only one sampled token rather than the teacher’s local support.
-
A practical fix is to replace single sampled-token supervision with teacher top-\(K\) local support matching, where both teacher and student probabilities are renormalized over the teacher’s plausible next-token set. Another fix is to use top-\(p\) rollout sampling so the student is less likely to drift into extremely low-probability prefixes where teacher guidance is unreliable.
-
Additional safeguards include masking special tokens and tokenizer artifacts, preferring truncated reverse KL over one-token log-ratio updates when teacher top-\(K\) logits are affordable, and evaluating whether per-token advantages combine into coherent gradient directions rather than canceling across positions.
Privileged On-Policy Self-Distillation Caveats
-
On-policy self-distillation (OPSD) uses the same model as student and teacher, but conditions the teacher on privileged information such as a gold solution, final answer, runtime error, or environmental feedback. This can make the model act as its own teacher without requiring a stronger external model.
-
A context-explicit OPSD objective is:
\[\mathcal{L}_{\mathrm{OPSD}} =\sum_t D\left( \pi_T(\cdot \mid x,c,y_{<t}) \,\Vert\, \pi_S(\cdot \mid x,y_{<t}) \right)\]- where \(c\) is teacher-only privileged context. Vanilla OPD corresponds to \(c=\varnothing\), while privileged-context OPD sets \(c\) to a final answer, a gold solution, or another piece of training-only information.
-
Recent evidence shows that privileged OPSD is not uniformly beneficial for thinking models. Rethinking On-Policy Self-Distillation for Thinking Models by Kaur et al. (2026) reports that privileged-context OPSD can degrade long-budget thinking models by suppressing forking, verification, backtracking, and hedging behavior. The degradation is specific to privileged context rather than on-policy training itself: unprivileged OPD can improve the same student, while privileged OPD reverses the gain.
-
The failure appears at high-entropy forking positions where multiple reasoning paths remain plausible. Under vanilla OPD, tokens such as reconsideration, uncertainty, and branching markers can carry positive advantage. Once privileged context is added, the teacher may already know the answer and assign negative advantage to exploratory moves, causing the student to produce fewer of the deliberative behaviors that support long-budget reasoning.
-
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? by Kim et al. (2026) similarly traces degradation to suppression of epistemic verbalization, where models lose explicit uncertainty markers such as checking, reconsideration, and uncertainty expression. The paper finds that richer teacher conditioning can produce shorter and more confident traces, which may help narrow in-domain tasks but hurt out-of-domain math reasoning when uncertainty expression supports exploration and error correction.
-
Fixed-teacher self-distillation can also be more stable than moving-target self-distillation. In a naive moving-teacher setup, the model becomes more confident, then uses this increasingly confident policy as its next teacher, amplifying response-length shrinkage and epistemic suppression over time. This is one reason many practical OPSD or OPD systems prefer frozen teacher checkpoints, EMA teachers with caution, or gated auxiliary objectives.
-
RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation by Pan et al. (2026) addresses privilege-induced style drift by contrasting the teacher-student gap under a correct hint with the gap under an incorrect hint. This subtracts the generic style shift induced by having a hint at all and leaves a signal more concentrated on task-bearing tokens.
-
A contrastive signal can be written as:
\[e_t^{ctr} =\left[ \log p_\theta(y_t \mid x,c^+,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right] -\left[ \log p_\theta(y_t \mid x,c^-,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right]\]- where \(c^+\) is a correct hint and \(c^-\) is a wrong or contrastive hint.
-
The broader lesson is that dense token-level supervision is not automatically good. It must be aligned with the behavior the model should preserve at inference time. In thinking models, the “correct” teacher may be locally too confident because it has access to information unavailable to the deployed student.
Practical Training Loop
-
A typical on-policy distillation training loop begins by sampling prompts from a task dataset, synthetic prompt pool, or curriculum. The student then generates one or more rollouts while recording token identities, attention masks, stop positions, and per-token student log-probabilities. Each rollout is sent to a teacher or teacher router, which computes token-level log-probabilities conditioned on the exact student prefixes. A divergence, log-ratio, or advantage-like loss is computed at valid response tokens, optional clipping and masking suppress unstable updates, and gradients are propagated only through the student while the teacher remains fixed.
-
A compact loop is:
\[x \sim \mathcal{D}\] \[\hat{y} \sim p_S^\theta(\cdot \mid x)\] \[{\log p_T(\hat{y}_t \mid x,\hat{y}_{<t})}_{t=1}^{|\hat{y}|} =\operatorname{TeacherScore}(x,\hat{y})\] \[\theta \leftarrow \theta -\eta \nabla_\theta \mathcal{L}_{OPD}(\theta)\] -
Because the teacher does not need to generate its own trajectory, but only evaluate the student’s rollout, teacher inference can be cheaper than full teacher rollout generation. However, the teacher must still process the full student sequence and return accurate log-probabilities under the correct chat template, tokenizer, and prefix structure.
Systems and Infrastructure Considerations
-
OPD systems usually require a rollout engine, a teacher scoring service, a log-probability transport format, a masking and loss-computation module, and a learner that can consume scored rollouts. Asynchronous execution is common because rollout generation, teacher scoring, and learner updates operate at different speeds.
-
The main systems concerns are rollout staleness, teacher throughput, teacher-student tokenizer compatibility, batching efficiency, log-probability compression, invalid-token masking, prompt routing, and reproducibility of which teacher scored which rollout.
-
For MOPD, teacher routing adds an additional layer. A prompt or rollout may be assigned to a math teacher, coding teacher, long-context teacher, agentic teacher, safety teacher, or general teacher. The router may use prompt metadata, domain classifiers, benchmark categories, validation performance, entropy, or heuristic task labels.
-
In asynchronous OPD, the rollout policy used by the inference engine may not exactly match the policy being updated by the training engine. This motivates importance weighting, freshness windows, or replay-buffer constraints, as in the Nemotron-Cascade 2 objective.
-
Token masking is crucial. Systems should mask padding, prompt tokens, stop tokens after termination, invalid tool-output regions, hidden system metadata, formatting artifacts, and tokens where teacher and student tokenization cannot be aligned reliably.
-
Teacher payload design should match the loss. Full forward KL needs teacher distributions, reverse-KL sampled-token OPD needs only sampled-token teacher log-probabilities, and JSD or local-support matching may require top-\(K\) distributions from both teacher and student.
Relationship to Recent LLM Post-Training Recipes
-
OPD has become a core consolidation primitive in recent LLM recipes. It appears most prominently when several specialist teachers must be merged into one deployable model without rerunning all RL stages jointly, and this pattern is increasingly visible across recent post-training reports and recipe analyses such as Frontier post-training recipe review with Finbarr Timbers.
-
MiMo-V2-Flash Technical Report by Xiao et al. (2026) uses MOPD as a final consolidation stage after training several domain-specialist teachers. The student samples its own trajectories, the relevant teacher scores those trajectories, and token-level feedback transfers specialist capabilities into one general student.
-
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) uses multi-domain OPD after Cascade RL to recover regressions and rebalance capabilities. The teachers are selected from strong intermediate checkpoints, so MOPD acts as a way to preserve the best stage-specific capabilities before later training continues.
-
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) uses SFT, unified RLVR, MOPD warmup, multi-teacher OPD, and MTP boosting. The warmup stage is important because teacher models trained with different data or recipes may have distributions that do not overlap well with the student; aligning the student toward teacher-supported distributions makes subsequent MOPD more reliable.
-
GLM-5: from Vibe Coding to Agentic Engineering and Kimi K2.5: Visual Agentic Intelligence show that not every recent frontier recipe foregrounds MOPD. GLM-5 emphasizes staged reinforcement learning across reasoning, agentic, and general capabilities with cross-stage distillation, while Kimi K2.5 emphasizes joint text-vision reinforcement learning across coding, vision, reasoning, and agentic tasks. These recipes suggest that OPD is an increasingly important primitive, but not the only path to frontier post-training.
When On-Policy Distillation is Preferred
-
On-policy distillation is preferred when the student must handle long-horizon reasoning, coding, tool use, or agentic workflows where early mistakes compound into unfamiliar states; when dense token-level feedback is more useful than sparse scalar rewards; when a strong frozen teacher can evaluate the student’s actual prefixes; when off-policy SFT has already produced a capable enough student to generate informative rollouts; when the goal is to consolidate specialist teachers without training a single monolithic RL run; and when the team can support the systems complexity of rollout generation, teacher scoring, token masking, and log-probability transport.
-
OPD is less attractive when the student is too weak to generate meaningful trajectories, when teacher-student support overlap is poor, when teacher scoring is too expensive, when tokenization or chat-template mismatch makes log-probabilities unreliable, or when privileged teacher context suppresses behaviors the student needs at inference time.
-
The practical default is therefore staged: begin with off-policy SFT or sequence-level distillation, use RL or RLVR to improve task behavior, introduce OPD once the student can generate useful rollouts, and use MOPD when multiple specialist teachers must be consolidated into one model.
Self-Distillation (SD)
-
Self-distillation extends the distillation paradigm by removing the strict requirement for a separate, larger teacher model. Instead, the student learns from a teacher signal derived from itself, either across time, across contexts, across checkpoints, across roles, or across different conditioning views of the same underlying model.
-
The central idea to carry through this section is that self-distillation converts latent model capability, hindsight information, privileged context, retrieved skills, runtime feedback, or interaction feedback into dense supervision. It is most useful when an external teacher is unavailable, expensive, operationally inconvenient, or insufficiently specialized for the task.
-
Self-distillation is not automatically self-improvement. It can unlock capabilities already present in the model, stabilize post-training, and reduce reliance on frontier teachers, but it can also amplify the model’s own biases, suppress useful uncertainty, overfit to privileged-context style, or collapse reasoning behavior when the teacher view is too different from the deployed student view.
-
In modern LLM training, self-distillation has evolved beyond a simple compression technique into a broader framework for iterative self-improvement, reasoning refinement, continual adaptation, enterprise-specific behavior transfer, and reinforcement-learning-style policy optimization. Modern variants often combine self-distillation with on-policy rollouts, allowing models to improve by learning from their own outputs while still benefiting from the stabilizing effects of teacher-style supervision.
-
The most important practical distinction is how the self-teacher is constructed. A self-teacher may be an earlier checkpoint, an exponential-moving-average checkpoint, an ensemble of model views, the same model under privileged context, the same model conditioned on runtime feedback, or the same model conditioned on future user interaction. The loss may look like ordinary distillation, but the teacher construction determines whether the signal is useful, noisy, or harmful.
-
Common self-distillation forms include checkpoint-based self-distillation, where a later student learns from outputs produced by earlier or averaged checkpoints; view- or context-based self-distillation, where the same model produces different supervisory signals under different prompts, contexts, augmentations, hints, retrieved skills, or conditioning views; and on-policy self-distillation, where the student samples its own rollout and the self-teacher evaluates that rollout under a richer or privileged context.
Core Formulation
-
In self-distillation, both the student and teacher are derived from the same base model. Let \(p_S^\theta\) denote the student policy, and let \(p_T^\phi\) denote the teacher policy, which may correspond to an earlier checkpoint, an ensemble, an exponential-moving-average model, or the same model under privileged conditioning.
-
The general training objective remains:
\[\mathcal{L}_{SD}(\theta) =\mathbb{E} \left[ D\left( p_T^\phi(\cdot \mid x,y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,y_{<t}) \right) \right]\] -
The key distinction is not the surface form of the loss, but the construction of the teacher signal. In standard teacher-student distillation, the teacher is usually a different model. In self-distillation, the teacher signal is produced from the same model family, the same base weights, a related checkpoint, or the same model under a different informational view.
-
A broad self-distillation template can be written as:
\[p_T^\phi(\cdot \mid \tau_T(x,y_{<t})) \quad\text{and}\quad p_S^\theta(\cdot \mid \tau_S(x,y_{<t}))\]- where \(\tau_T\) and \(\tau_S\) are teacher and student context transformations. The teacher transformation may add a verified answer, a hint, a retrieved skill, a critique, a tool result, a runtime error, a future user correction, or a different prompt view, while the student transformation preserves the information available at inference time.
Temporal Self-Distillation
-
One of the earliest forms of self-distillation uses an earlier checkpoint of the same model as the teacher:
\[p_T = p_S^{\theta_{\text{old}}}\] -
The student is trained to remain close to a historical version of itself while continuing to improve on new data. This is useful because earlier checkpoints may preserve capabilities that later fine-tuning or RL could degrade, and because historical teachers provide a stable reference that prevents abrupt distributional shifts.
-
Temporal self-distillation is especially relevant in long post-training recipes where a model moves through SFT, preference tuning, RLVR, OPD, and domain-specific RL. A later checkpoint may improve on a target benchmark while regressing on instruction following, writing quality, safety calibration, or long-context behavior. A historical self-teacher can serve as a capability-preserving anchor.
-
A temporal regularization objective can be written as:
\[\mathcal{L}(\theta) =\mathcal{L}_{\text{task}}(\theta) +\lambda \mathbb{E}_{x} \left[ D_{KL} \left( p_{\theta_{\text{old}}}(\cdot \mid x) \,\Vert\, p_{\theta}(\cdot \mid x) \right) \right]\]- where \(\lambda\) controls how strongly the current model is constrained to preserve the older checkpoint’s behavior.
-
Temporal self-distillation is not the same as online distillation unless the teacher changes during the student’s training. If the old checkpoint is frozen, the method is offline in teacher-update pattern, even though the teacher comes from the same model lineage.
Ensemble and Multi-View Self-Distillation
-
Another common variant constructs the teacher from multiple views or predictions of the same model. The teacher distribution may be an average over model checkpoints, prompt templates, sampled completions, adapters, retrieved contexts, or decoding conditions:
\[p_T(\cdot \mid x) =\frac{1}{K} \sum_{k=1}^{K} p_{\theta_k}(\cdot \mid \tau_k(x))\] -
Multi-view self-distillation can smooth noisy predictions and transfer behavior that is robust across views. For example, a model may answer the same problem under several prompt templates, or with different retrieved context snippets, and the aggregate teacher signal may be more reliable than any single view.
-
This approach is useful when the model contains the right capability but expresses it inconsistently. By aggregating several internal views, the teacher signal can emphasize stable patterns and reduce sensitivity to a single prompt or decoding path.
-
Multi-view self-distillation is also closely related to self-consistency and reranking. The difference is that self-consistency chooses or aggregates answers at inference time, while self-distillation converts the aggregated behavior into trainable supervision.
Contextual Self-Distillation
-
Modern reasoning-oriented self-distillation usually relies on contextual asymmetry rather than architectural asymmetry. This creates a stronger teacher signal without introducing a separate external model.
-
The student sees only the original task:
\[p_S(\cdot \mid x)\]- while the teacher receives privileged information:
- where \(c\) may include a verified answer, a ground-truth reasoning trace, a runtime error, a tool result, a user correction, a retrieved skill, an environment observation, or another form of training-only support.
-
Contextual self-distillation depends on an asymmetry between what is available during training and what is available during deployment. The teacher is allowed to see information that makes evaluation or correction easier, while the student must internalize the corrected behavior without depending on that information at inference time.
-
This setup is powerful because LLMs are often better at evaluating, explaining, or repairing a solution when given the answer or feedback than they are at generating the solution from scratch. It is risky because the privileged teacher can acquire a reasoning style that presupposes unavailable information, which may teach the student to be too concise, too confident, or insufficiently exploratory.
On-Policy Self-Distillation (OPSD)
-
The most important modern form of self-distillation is On-Policy Self-Distillation, introduced in Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models by Zhao et al. (2026).
-
In OPSD, the student generates trajectories from the original problem context, while the teacher evaluates those same trajectories under an enriched or privileged context. The same underlying model can instantiate both views, so the method does not require a separate stronger teacher.
-
The resulting objective is:
\[\mathcal{L}_{\mathrm{OPSD}} =\sum_t D\left( \pi_T(\cdot \mid x,c,y_{<t}) \,\Vert\, \pi_S(\cdot \mid x,y_{<t}) \right)\]- where \(c\) is privileged teacher-only context, such as a verified solution or feedback string. The following expansion makes the objective explicitly on-policy by sampling the rollout from the student, conditioning the teacher on privileged context \(c\), and applying token-level divergence along the sampled rollout:
-
The following figure (source) shows On-Policy Self-Distillation, where the same model defines a student policy conditioned only on the problem and a teacher policy conditioned on privileged solution information. Given a reasoning dataset \(\mathcal{S}=\{(x_i,y_i^\star)\}_{i=1}^{N}\), the student generates an on-policy response \(\hat{y} \sim p_S(\cdot \mid x)\), and both student and privileged teacher score that response at every prefix to produce a token-level divergence.

-
The OPSD paper argues that models are often substantially better at evaluating or rationalizing a correct answer than generating the answer from scratch. By conditioning the teacher view on verified solutions, the model effectively supervises itself from a privileged perspective.
-
In implementation, the student rollout is generated first and the teacher scores the resulting trajectory rather than generating an independent completion. The teacher context concatenates the original prompt with privileged solution information, creating an asymmetric supervision channel. Reverse KL, forward KL, and Jensen-Shannon divergence can all be used experimentally, while pointwise KL clipping and token weighting help prevent stylistic tokens, filler tokens, or formatting artifacts from dominating reasoning updates.
Relevance-Masked Self-Distillation
-
Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation introduces Relevance-Masked Self-Distillation (RMSD), a practical self-distillation method designed for out-of-distribution enterprise-style behaviors where a sufficiently capable external teacher may not exist.
-
The motivating problem is that SFT can teach an unusual behavior but may cause catastrophic forgetting, while RL may fail when the base model almost never succeeds and therefore receives little useful reward signal. RMSD constructs a self-teacher by conditioning the same model on a hint or desired behavior description, then uses a token-level relevance mask so that the student updates only on positions tied to the desired behavior.
-
A compact RMSD-style objective is:
\[\mathcal{L}_{RMSD}(\theta) =\mathbb{E} \left[ \sum_t m_t D\left( p_T^\theta(\cdot \mid x',\hat{y}_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,\hat{y}_{<t}) \right) \right]\]- where \(x'\) is an enhanced teacher prompt containing a hint or correction, and \(m_t \in \{0,1\}\) is a relevance mask selecting the token positions that should receive distillation updates.
-
The central RMSD insight is that token-level granularity is both a strength and a weakness. It is useful because the update can be localized, but noisy because the teacher and student may disagree on many tokens for reasons unrelated to the desired behavior. RMSD therefore uses a two-step filtering strategy: deterministic heuristics identify candidate token positions, and an LLM judge selects the final subset of task-relevant tokens.
-
The following figure (source) shows a relevance-masked self-distillation visualizer for a rollout in which the student is asked a normal tropical-food question while the privileged teacher prompt instructs the model to use the misspelled token “pinapple.” Green tokens indicate positive teacher-minus-student log-probability gaps, red tokens indicate negative gaps, and the relevance mask selects the token positions most tied to the target behavior rather than updating all stylistic disagreements.

-
The RMSD experiments show why on-policy data matters. When SFT data is close to the student’s own distribution, SFT improves substantially, but such close-to-on-policy data is usually hard to manufacture in real tasks. OPSD and RMSD instead obtain on-policy trajectories directly from the student and use privileged teacher scoring to provide localized correction.
-
RMSD also highlights the importance of teacher-update timing. Updating the self-teacher too frequently can cause collapse, while updating after performance plateaus can bootstrap further progress more safely. This makes RMSD semi-online in recipe behavior when teacher weights are periodically refreshed, even if each training segment uses a fixed self-teacher.
Contrastive On-Policy Self-Distillation
-
RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation by Pan et al. (2026) addresses a failure mode called privilege-induced style drift. In ordinary OPSD, the teacher-student gap under a correct hint may concentrate on style tokens, because a hinted teacher tends to be more direct, shorter, and more confident regardless of whether the hint contains the specific task-bearing information the student needs.
-
RLCSD compares the teacher-student gap under a correct hint with the gap under a wrong or contrastive hint. This subtracts the generic style shift induced by hint conditioning and leaves a signal more concentrated on task-bearing tokens.
-
A simplified contrastive signal is:
\[e_t^{ctr} =\left[ \log p_\theta(y_t \mid x,c^+,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right] -\left[ \log p_\theta(y_t \mid x,c^-,\hat{y}_{<t}) -\log p_\theta(y_t \mid x,\hat{y}_{<t}) \right]\]- where \(c^+\) is a correct hint and \(c^-\) is a wrong or contrastive hint.
-
The broader principle is that privileged context should not be treated as uniformly beneficial. A hint changes both task information and style. Contrastive self-distillation attempts to isolate the task-relevant component of the hint by subtracting a control condition that induces similar style changes without providing the correct task information.
Self-Distillation as Reinforcement Learning
-
Recent work increasingly treats self-distillation as a form of policy optimization rather than merely a compression technique.
-
The central idea is that the teacher defines a dense token-level improvement signal:
\[A_t = \log p_T(y_t \mid s_t) - \log p_S(y_t \mid s_t)\]- which behaves similarly to an RL advantage estimate.
-
This perspective enables self-distillation to integrate naturally into PPO-, GRPO-, and RLVR-style training loops. The self-teacher can provide dense token-level guidance, while a verifier, reward model, or environment reward preserves the task-level objective.
-
A hybrid RL and self-distillation objective can be written as:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) +\lambda_{SD} \mathcal{L}_{SD}(\theta)\]- where the RL term defines the direction of improvement from rewards or verifiers, and the self-distillation term refines token-level credit assignment.
Self-Distilled Agentic Reinforcement Learning
-
Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) introduces SDAR for multi-turn LLM agents, where RL remains the primary optimization objective and OPSD is used as a gated auxiliary loss to provide dense token-level guidance.
-
SDAR is motivated by a multi-turn instability problem: naive OPSD can produce large teacher-student gaps as trajectories drift across turns, and privileged context may be unreliable if retrieved skills are imperfect, weakly used, or irrelevant. Instead of applying self-distillation uniformly, SDAR gates the auxiliary signal at the token level.
-
The SDAR objective can be summarized as:
\[\mathcal{L}(\theta) =\mathcal{L}_{GRPO}(\theta) +\lambda_{\mathrm{SDAR}} \mathcal{L}_{SDAR}(\theta)\]- where \(\mathcal{L}_{GRPO}\) preserves verifier-driven policy optimization, and \(\mathcal{L}_{SDAR}\) injects dense token-level guidance only where the privileged teacher signal is trusted.
-
The self-teacher receives privileged training-only context, such as retrieved skills, while the deployed student acts without that context. The detached teacher-student gap is:
\[\Delta_t =\operatorname{sg} \left( \log \pi_\theta^+(y_t \mid s_t^+) -\log \pi_\theta(y_t \mid s_t) \right)\] -
The token-level gate converts this signal into a bounded trust weight:
\[g_t=\sigma(\beta \Delta_t)\] -
The gated auxiliary loss applies self-distillation only according to token-level trust:
\[\ell_t^{\mathrm{SDAR}} =g_t \left( \log \pi_\theta^+(y_t \mid s_t^+) -\log \pi_\theta(y_t \mid s_t) \right)\] -
SDAR supports entropy gating, gap gating, and soft-OR gating, so token-level supervision can depend on student uncertainty, teacher endorsement, or both. Skill retrieval can be implemented through UCB retrieval, keyword matching, full retrieval, or random retrieval, making the framework robust to different levels of privileged-context quality.
-
The following figure (source) shows multi-turn OPSD instability, where naive OPSD can increase KL divergence and degrade task performance in multi-turn agent settings.

- The following figure (source) shows teacher-student gap analysis, including token-count distribution by teacher-student gap, average gap by multi-turn step, and average gap by relative position within a turn.

Reinforcement Learning via Self-Distillation
-
Reinforcement Learning via Self-Distillation by Hübotter et al. (2026) formalizes the connection between self-distillation and policy optimization by converting textual feedback into dense token-level supervision.
-
The framework generates a trajectory from the student policy, obtains textual critiques, runtime errors, or verifier feedback, conditions the teacher view on both the original trajectory and the feedback signal, and replays the trajectory under the teacher context to compute token-level corrections.
-
This approach is useful when feedback is richer than a scalar reward. Runtime execution errors in coding tasks can become corrective teacher context; LLM judge comments can specify what went wrong; and environment feedback can indicate which actions were ineffective. The same model can instantiate both the student and the teacher view, reducing infrastructure requirements.
-
A useful abstraction is:
\[p_T(\cdot \mid x,\hat{y}_{<t},f)\]- where \(f\) is a feedback string, such as a runtime error, critique, judge comment, or natural-language correction. The student is trained to align with this feedback-conditioned teacher while still operating without \(f\) at deployment time.
Self-Distilled RLVR
-
Self-Distilled RLVR by Yang et al. (2026) combines reinforcement learning with verifiable rewards and privileged self-distillation. The method uses RLVR to determine the update direction and self-distillation to modulate fine-grained token-level update magnitudes.
-
This separation matters because privileged self-distillation alone can leak information or destabilize long training. RLVR keeps the objective grounded in correctness, while self-distillation provides local information about which sampled tokens should receive stronger or weaker updates.
-
A simplified view is:
\[\Delta \theta \propto A^{RLVR} \cdot w_t^{SD} \nabla_\theta \log \pi_\theta(y_t \mid s_t)\]- where \(A^{RLVR}\) supplies the reward-aligned direction and \(w_t^{SD}\) supplies a token-level magnitude derived from the privileged self-teacher.
Aligning Language Models from User Interactions
-
Aligning Language Models from User Interactions by Kleine Buening et al. (2026) extends self-distillation into conversational alignment by treating future user messages as privileged hindsight context.
-
Instead of using verified answers, the teacher is conditioned on later interaction information, such as a user correction, clarification, dissatisfaction signal, or follow-up request. The method reconstructs how the assistant should ideally have responded had it known the future interaction, then distills that hindsight policy into the original model.
-
The process can be summarized as:
\[(x,y,o) \rightarrow p_T(\cdot \mid x,o,y_{<t}) \rightarrow \text{token-level update for } p_S(\cdot \mid x,y_{<t})\]- where \(x\) is the conversation history, \(y\) is the model response, and \(o\) is the subsequent user message.
-
The following figure (source) shows the hindsight self-distillation process driven by user follow-up interactions. From multi-turn conversations, the system obtains interactions \((x,y,o)\) consisting of the conversation history, the model response, and the subsequent user message. Conditioning on the user’s follow-up forms a hindsight policy, and comparing that policy to the original policy produces token-level advantages that reinforce or penalize parts of the original response.

- This framework naturally leverages production interaction logs without requiring manual labels for every example. It can also support both RL-style preference learning and dense token-level self-distillation from the same interaction trace.
Self-Distillation in Agentic Systems
-
Self-distillation is particularly powerful in agents because interaction trajectories naturally produce rich hindsight signals. Tool outputs, terminal errors, GUI state changes, unit-test results, user replies, and environment transitions can all become teacher-only context for replaying and correcting the original action sequence.
-
In an agentic setting, the student may act under limited information at time \(t\), while the teacher is later allowed to see the consequences of that action. The teacher can then evaluate what the student should have done at each earlier token or action boundary.
-
This is especially useful for long-horizon workflows because scalar task success is often too sparse. A final failure may be caused by one malformed tool call, one wrong file selection, one incorrect assumption, or one missed observation. Self-distillation can turn later evidence into local corrective supervision.
-
OpenClaw-RL-style systems extend this idea to real interaction streams, where personal agents and general agents collect trajectories from user interactions, tool calls, GUI transitions, and terminal environments. Those traces can be transformed into hindsight-conditioned dense supervision rather than relying only on scalar reward.
Failure Modes in Self-Distillation
-
Self-distillation introduces unique risks because the teacher is derived from the same model family as the student. The method can reinforce existing errors, amplify spurious confidence, suppress useful uncertainty, overfit to hints, or convert privileged-context style into deployed behavior.
-
Recent evidence is especially cautionary for thinking models. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? by Kim et al. (2026) finds that self-distillation can reduce response length while degrading mathematical reasoning performance. The paper traces this to suppression of epistemic verbalization, meaning the model becomes less likely to express uncertainty, reconsider hypotheses, or mark possible errors during reasoning.
-
A useful formalization of context richness is conditional mutual information:
\[I(y^\star;c \mid x) = H(y^\star \mid x) - H(y^\star \mid x,c)\]- where \(c\) is privileged context and \(y^\star\) is an ideal correct response. Richer context reduces uncertainty for the teacher, but this can also make the teacher produce overly concise and confident traces that do not preserve the student’s inference-time uncertainty management.
-
Rethinking On-Policy Self-Distillation for Thinking Models by Kaur et al. (2026) reports a related fork-suppression failure mode. Privileged-context OPSD can reduce forking, verification, backtracking, and hedging behavior, especially in long-budget thinking rollouts. The failure is specific to privileged context rather than on-policy training itself: vanilla OPD can improve the same student, while privileged OPD reverses the gain.
-
The token-level mechanism is that privileged teachers may penalize self-correction cues even when those cues lead to a correct answer. Tokens such as “wait,” “maybe,” “but,” or “there may be a mistake” can be useful because they create branches and preserve uncertainty. A teacher that already knows the answer may assign these tokens low probability because it does not need them, thereby training the student to suppress deliberative search.
-
The following figure (source) shows examples where privileged context flips credit on self-correction cues. The same student trajectory is scored with and without privileged teacher context; privileged scoring penalizes tokens such as “But wait,” “Hmm,” and “Maybe” even when those tokens support correction or lead to a correct answer.

- These findings imply that self-distillation objectives should optimize not only correctness or brevity, but also the reasoning behaviors needed for robust inference. For strong thinking models, preserving controlled uncertainty, exploration, and self-correction may be more important than forcing the student to imitate a privileged teacher’s concise solution path.
Fixed, Moving, and Gated Self-Teachers
-
A fixed self-teacher is a frozen copy of an earlier checkpoint or initial policy. This is stable and reproducible, and it avoids a feedback loop where the teacher becomes increasingly confident as the student changes.
-
A moving self-teacher updates during training, either by periodically copying the student, using an exponential moving average, or refreshing after performance plateaus. This can provide fresher supervision, but it can also amplify collapse if the teacher is updated too frequently or if the student’s current failure modes become part of the teacher.
-
A gated self-teacher does not necessarily change the teacher weights; instead, it controls which token-level teacher signals are trusted. Gates can depend on teacher-student gap, student entropy, token relevance, verifier outcomes, contrastive hint differences, or LLM judge decisions.
-
These three designs address different problems. Fixed teachers address stability. Moving teachers address staleness. Gated teachers address noisy or harmful token-level supervision.
Advantages
-
Self-distillation reduces dependence on expensive external teacher models, enables continual self-improvement from interaction traces and feedback, converts privileged information into dense token-level guidance, integrates naturally with RL and hindsight supervision, preserves architecture simplicity by using the same backbone for teacher and student, and can target specialized behaviors that ordinary external teachers may not know.
-
It is especially attractive for enterprise-specific or tool-specific behavior, where the desired behavior may involve private APIs, internal formats, local policies, or user preferences that are not well represented in public teacher models.
-
It can also be more selective than SFT. Because self-distillation adjusts conditional token probabilities on student-visited states, it can preserve unrelated behavior better than broad supervised fine-tuning when the token-level signal is well masked or gated.
Limitations
-
Self-distillation may reinforce the model’s own errors if the privileged teacher signal is weak, noisy, or merely stylistically different. It can be limited by the model family’s inherent capability ceiling unless external rewards, search, tools, or verifiers introduce new information. Incorrect privileged information can destabilize training more severely than ordinary supervised errors. Careless rollout replay can produce information leakage in reasoning tasks. Dense self-distillation can over-regularize style unless clipped, masked, gated, or contrastively corrected.
-
Thinking models require special caution. Privileged context may suppress uncertainty expression, forking, verification, and backtracking, which are precisely the behaviors that enable long-budget reasoning to recover from mistakes.
-
Multi-turn agent settings also require caution because trajectory drift can compound across turns. Naive OPSD can become unstable when the self-teacher scores prefixes that are already far from its reliable support, motivating gated auxiliary distillation rather than uniform token-level imitation.
When Self-Distillation is Preferred
-
Self-distillation is preferred when external frontier teachers are unavailable, too expensive, or operationally inconvenient; when the model already contains latent capability that can be unlocked through hints, hindsight conditioning, retrieved skills, or privileged evaluation; when interaction traces, tool outputs, runtime errors, user corrections, or verifier feedback are available as rich supervision sources; when RL alone is too sparse or unstable; when the target behavior is enterprise-specific or out-of-distribution; or when continuous adaptation is required without maintaining a separate teacher infrastructure.
-
On-policy self-distillation is preferred when the student can generate informative rollouts and the teacher can be strengthened through privileged context. Relevance-masked or gated self-distillation is preferred when only a small subset of tokens should change. Contrastive self-distillation is preferred when hints introduce unwanted style drift. RL-hybrid self-distillation is preferred when correctness rewards should determine update direction while self-distillation refines token-level credit assignment.
-
Modern self-distillation methods increasingly blur the line between supervised learning, reinforcement learning, continual learning, and iterative self-improvement. The practical challenge is no longer only how to make a model teach itself, but how to ensure the self-teacher’s signal preserves the behaviors the deployed student actually needs.
Multi-Teacher Distillation
-
Multi-teacher distillation generalizes the teacher-student framework by allowing the student to learn from more than one teacher. Instead of assuming a single model has the best behavior across all domains, the method combines signals from several teachers, which may differ by size, architecture, training stage, specialization, data source, decoding style, or post-training objective.
-
The central idea to carry through this section is that multi-teacher distillation is no longer just an ensemble-compression trick. In modern LLM post-training, it has become a capability-consolidation primitive: different teams or training runs produce specialist teachers, and a final student absorbs their capabilities into one deployable policy.
-
The main benefit is specialization without deployment fragmentation. A lab may train one teacher that is strongest at math proofs, another at coding, another at software-engineering agents, another at safety behavior, another at long-context reasoning, and another at chat or instruction following. Multi-teacher distillation tries to merge these strengths into one model so that users do not need to select among many specialized checkpoints at inference time.
-
The main difficulty is conflict. Teachers may disagree not only on final answers, but also on style, reasoning length, tool-use conventions, uncertainty expression, formatting, refusal boundaries, and local token preferences. A multi-teacher student must learn when to follow which teacher, how to resolve incompatible guidance, and how to avoid averaging incompatible behaviors into a weaker policy.
-
The most important modern variant is Multi-Teacher On-Policy Distillation, or MOPD. In MOPD, the student generates its own trajectories, those trajectories are routed to one or more domain-specialist teachers, and the teachers provide token-level feedback on the student’s actual visited prefixes. This makes MOPD a multi-teacher extension of OPD and a practical alternative to one monolithic RL run across many conflicting domains.
-
Recent recipe discussions frame MOPD as one of the major 2026-style post-training patterns. MiMo-V2-Flash Technical Report by the Xiaomi MiMo Team (2026) provides a clean early articulation of MOPD as a consolidation stage, while Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) uses multi-domain OPD as a stabilization and regression-recovery stage inside Cascade RL. Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) scales the pattern to many domain-specialist teachers and multiple MOPD rounds.
Definition
-
In single-teacher distillation, the student minimizes a divergence between its distribution and one teacher distribution. In multi-teacher distillation, there are \(K\) teachers:
\[\{p_{T_1},p_{T_2},\dots,p_{T_K}\}\]- and the student receives supervision from an aggregation, routing, or weighting of those teachers.
-
A general multi-teacher objective can be written as:
\[\mathcal{L}_{MTD}(\theta) =\mathbb{E}_{(x,y)\sim\mathcal{D}} \left[ \sum_{k=1}^{K} w_k(x,y_{<t}) D\left( p_{T_k}(\cdot \mid x,y_{<t}) \,\Vert\, p_S^\theta(\cdot \mid x,y_{<t}) \right) \right]\]- where \(w_k(x,y_{<t})\) is the teacher weight for teacher \(k\) at the current prompt or prefix. The weight may be fixed, domain-dependent, confidence-dependent, learned by a router, determined by evaluation scores, or set by a hard routing rule.
-
A teacher-ensemble view first forms an aggregated teacher distribution:
\[p_{\mathrm{MT}}(\cdot \mid s_t) =\sum_{k=1}^{K} w_k(s_t) p_{T_k}(\cdot \mid s_t)\]- where \(s_t=(x,y_{<t})\), and the student then minimizes:
-
This formulation is simple, but it assumes that teacher distributions can be meaningfully averaged. That assumption is often false when teachers have different reasoning styles, response lengths, tool-use conventions, or refusal boundaries. Much of modern multi-teacher distillation is therefore about routing and conflict management rather than merely averaging logits.
Why Multiple Teachers?
-
A single teacher is rarely uniformly best. One model may have strong mathematical reasoning but weak instruction following; another may be excellent at code generation but overly verbose in chat; another may be safe and well-calibrated but weak on long-context retrieval; another may be a strong agentic tool user but poor at concise factual answering. Multi-teacher distillation treats each teacher as a source of localized expertise rather than assuming one global teacher dominates all tasks.
-
Multi-teacher distillation is also organizationally scalable. Specialist teachers can be trained in parallel by different teams, each focused on a domain-specific data pipeline, verifier, RL environment, or evaluation suite. A final distillation stage then converts this distributed specialist work into one general student.
-
This pattern is attractive because multi-domain RL can be conflict-prone. Jointly optimizing math, code, safety, chat, long context, and agentic tool use inside one RL run can cause interference: improvements in one domain may shorten traces, reduce entropy, change formatting, degrade instruction adherence, or regress other benchmarks. Training specialists separately and then distilling them into one student gives the recipe a modular structure.
-
Multi-teacher distillation is also useful when different teachers expose different forms of signal. A code teacher may provide execution-verified traces, a math teacher may provide proof-style reasoning, a safety teacher may provide refusal and redirection behavior, a chat teacher may provide conversational naturalness, and an agentic teacher may provide terminal or browser action traces. The final student must learn all of these behaviors, but the best supervision for each behavior may come from a different source.
Classical Multi-Teacher Distillation
-
Classical multi-teacher distillation usually begins with an ensemble of teachers trained independently or on related tasks. Their predictions are combined into a single soft target, and the student learns to match that aggregated distribution.
-
A probability-space ensemble uses:
\[p_{\mathrm{ens}}(y \mid x) =\sum_{k=1}^{K} w_k p_{T_k}(y \mid x)\] -
A logit-space ensemble instead combines teacher logits before softmax:
\[z_{\mathrm{ens}} =\sum_{k=1}^{K} w_k z_{T_k}\] \[p_{\mathrm{ens}}(y \mid x) =\operatorname{softmax}(z_{\mathrm{ens}})\] -
Probability averaging is easier to interpret because each teacher contributes directly to the final distribution. Logit averaging can preserve sharper preferences but is sensitive to teacher calibration. If one teacher produces higher-magnitude logits, it can dominate the ensemble even when it is not more accurate.
-
Classical multi-teacher distillation is most effective when teachers are reasonably aligned and differ mainly in complementary expertise or random initialization. It is less reliable when teachers disagree systematically due to different training data, incompatible objectives, or different reasoning formats.
Teacher Weighting and Routing
-
Teacher weighting determines how much each teacher contributes to the student’s update. A fixed-weight scheme assigns a constant coefficient to each teacher, which is simple and stable but ignores the fact that teacher quality varies by prompt, domain, and prefix.
-
A domain-weighted scheme chooses weights based on known task categories. For a math prompt, the math teacher receives high weight; for a coding prompt, the code teacher dominates; for a safety prompt, the safety teacher may override other teachers. This is common when prompts are drawn from benchmark categories or training environments with known labels.
-
A confidence-weighted scheme uses teacher uncertainty, reward scores, verifier outcomes, or validation accuracy to assign higher weight to teachers that appear more reliable on the current example. For example, a teacher that produces a verified correct answer, passes unit tests, or assigns low entropy to a locally plausible token may receive higher weight.
-
A learned router predicts teacher weights from prompt features, rollout features, or intermediate representations. The router may use metadata, domain classifiers, prompt embeddings, reward-model scores, teacher agreement, or validation curves to decide which teacher should supervise which state.
-
A hard-routing scheme selects one teacher:
\[k^\star =\arg\max_k r_k(x,y_{<t})\]- where \(r_k\) is a routing score for teacher \(k\). The student then trains only against teacher \(k^\star\) for that prompt, trajectory, or token.
-
A soft-routing scheme assigns all teachers nonzero weights:
\[w_k(s_t) =\frac{ \exp(r_k(s_t)/\tau) }{ \sum_{j=1}^{K} \exp(r_j(s_t)/\tau) }\]- where \(\tau\) controls how sharply the router concentrates on the best-scoring teacher. Low \(\tau\) approaches hard routing, while high \(\tau\) averages teachers more broadly.
-
Token-level routing is more flexible than prompt-level routing, but it is also harder. A single trajectory may begin as a math proof, call Python for computation, then require concise final-answer formatting. In principle, each token could be supervised by a different teacher. In practice, token-level routing is expensive and can produce inconsistent style unless teachers are carefully aligned.
Multi-Teacher Distillation versus Mixture-of-Experts
-
Multi-teacher distillation and Mixture-of-Experts architectures are related but distinct. In a Mixture-of-Experts model, multiple experts remain inside the deployed model and a router activates a subset of them during inference. In multi-teacher distillation, multiple teachers are used during training, but the final deployed student may be a single dense model, a smaller MoE model, or a general policy without explicit access to the original teachers.
-
The goal of multi-teacher distillation is therefore not necessarily to preserve separate experts at inference time. The goal is to internalize expert behavior into the student so that a single model can behave as if it had absorbed several specialized training runs.
-
This distinction matters for deployment. Multi-teacher distillation increases training complexity but can keep inference simple. Mixture-of-Experts architectures increase architectural complexity but may improve inference efficiency per active parameter. Some modern systems combine both ideas: a MoE student may be distilled from several specialist teachers.
Multi-Teacher On-Policy Distillation (MOPD)
-
Multi-Teacher On-Policy Distillation extends OPD to multiple teachers. The student generates a rollout:
\[\hat{y} \sim p_S^\theta(\cdot \mid x)\]- and one or more teachers score the exact student-visited prefixes:
-
A general MOPD objective can be written as:
\[\mathcal{L}_{MOPD}(\theta) =\mathbb{E}_{x\sim\mathcal{D}} \mathbb{E}_{\hat{y}\sim p_S^\theta(\cdot \mid x)} \left[ \sum_{t} \sum_{k=1}^{K} w_k(s_t) D\left( p_{T_k}(\cdot \mid s_t) \,\Vert\, p_S^\theta(\cdot \mid s_t) \right) \right]\] -
The defining feature is that the student supplies the trajectory, while the teachers supply dense feedback. This means the student is trained on states it is likely to visit at inference time, while still receiving specialized guidance from stronger or domain-specific policies.
-
MOPD is especially useful when specialists are easy to train but hard to jointly optimize. A math teacher can be RL-trained on math; a coding teacher can be trained on competitive programming; an agentic teacher can be trained in tool-use environments; a safety teacher can be trained on refusal and harm-prevention data. MOPD then attempts to transfer these local improvements into one general student.
Multi-Teacher OPD in Practice
-
The article Multi-Teacher On-Policy Distillation: A New Post-Training Primitive and Cameron R. Wolfe’s X thread Multi-teacher on-policy distillation discussion emphasize that practical MOPD is a modular post-training workflow rather than merely a new loss. Specialized teachers can be selected from supervised checkpoints, RL-trained checkpoints, domain-adapted checkpoints, or intermediate checkpoints that were strongest on a particular benchmark family. The teacher pool is therefore a curated set of reusable supervision sources, not just a collection of final models.
-
Earlier checkpoints are often included explicitly to prevent catastrophic forgetting and post-training regressions. In a long recipe, the latest checkpoint may be strongest on one domain but weaker on earlier capabilities. Keeping earlier checkpoints as teachers allows the student to recover older strengths without rerunning expensive RL training from scratch.
-
Reverse KL is especially useful in MOPD because it lets each teacher provide a token-level advantage estimate over the student rollout. Instead of requiring every teacher to generate its own trajectory, the student samples a rollout once, and each selected teacher scores the sampled tokens under the student’s exact prefixes. The resulting log-probability gap acts like a dense advantage signal that can be integrated into RL-style training infrastructure.
-
Teacher requests must be scheduled dynamically to balance latency, accelerator utilization, and training freshness. A practical system may route only some rollouts to expensive teachers, batch requests by teacher, prioritize high-value domains, cache repeated prefixes, or use sampled-token scoring rather than full-distribution scoring. The objective is to make teacher scoring a scalable service rather than a bottleneck that starves the learner.
-
Capability regressions can be repaired through teacher selection and targeted distillation rather than by restarting the post-training recipe. If a later RL stage improves agentic tool use but hurts math, instruction following, or safety calibration, MOPD can route rollouts back to the best teacher checkpoint for the regressed capability and apply dense corrective supervision on student-visited states.
Sampled-Token MOPD
-
In large-vocabulary LLMs, full-vocabulary multi-teacher distribution matching can be expensive because every teacher would need to return a large next-token distribution at every student prefix. A cheaper alternative is sampled-token MOPD, where each teacher only scores the token the student actually sampled.
-
For teacher \(T_k\) and state \(s_t\), the sampled-token teacher advantage is:
\[a_t^{(k)} = \log p_{T_k}(\hat{y}_t \mid s_t) - \log p_S^\theta(\hat{y}_t \mid s_t)\] -
A weighted multi-teacher advantage can then be written as:
\[a_t^{MOPD} =\sum_{k=1}^{K} w_k(s_t) a_t^{(k)}\] -
The student update becomes:
\[\mathcal{L}_{MOPD} =-\mathbb{E} \left[ \sum_t \operatorname{sg} \left[ a_t^{MOPD} \right] \log p_S^\theta(\hat{y}_t \mid s_t) \right]\]- where \(\operatorname{sg}\) indicates that the advantage is treated as a fixed supervision signal rather than differentiated through teacher scoring.
-
This sampled-token form is attractive because it reuses RL-style infrastructure: rollouts are sampled, per-token log-probability gaps become dense advantages, and the learner updates the student using a policy-gradient-shaped loss even though the signal comes from teacher log-probabilities rather than scalar rewards.
-
The tradeoff is that sampled-token MOPD observes only one local token choice per prefix. It is cheaper and often more robust under support mismatch, but it may lose information contained in the teacher’s full local distribution.
Full-Distribution and Top-\(k\) Multi-Teacher Matching
-
Full-distribution MOPD asks the student to match each selected teacher’s entire next-token distribution at every student-visited prefix:
\[D\left( p_{T_k}(\cdot \mid s_t) \,\Vert\, p_S^\theta(\cdot \mid s_t) \right)\] -
A top-\(k\) approximation restricts matching to a small set of high-probability teacher or student tokens:
\[\mathcal{V}_k(s_t) =\operatorname{Top}_k p_{T_k}(\cdot \mid s_t)\]- and renormalizes both teacher and student probabilities over \(\mathcal{V}_k(s_t)\) before computing the divergence.
-
Full-distribution and top-\(k\) matching are most useful when teacher and student distributions have high local support overlap. If the student and teacher consider similar tokens plausible, dense matching can transfer richer information than sampled-token scoring.
-
If teacher and student support overlap is poor, dense distribution matching can become harmful. The teacher may be assigning a detailed distribution over continuations that make sense for its own reasoning style but are unreliable under the student’s prefix. In that case, forcing the student to match the full teacher distribution can amplify noise.
Support Overlap
-
Support overlap is the central practical constraint in MOPD. Since teachers score student-generated prefixes, the teacher must be reliable on the states the student actually visits.
-
A useful diagnostic is local top-\(k\) overlap:
\[\operatorname{Overlap}_k(s_t) =\frac{ \left| \operatorname{Top}_k p_S(\cdot \mid s_t) \cap \operatorname{Top}_k p_T(\cdot \mid s_t) \right| }{k}\] -
High overlap means the student and teacher are choosing among a similar menu of plausible next tokens. In that case, disagreement is informative because the teacher can guide the student within a shared local support region.
-
Low overlap means the teacher and student are effectively operating in different local regimes. Teacher feedback may then reflect distribution mismatch rather than useful expertise.
-
This is why teacher lineage matters. If teachers are forks of the same base model and differ mainly by domain SFT and RL, their rollouts and local token supports are often closer to the student’s. If teachers are trained from different data sources, external models, or incompatible post-training recipes, the student’s prefixes may be out of distribution for the teachers.
-
A practical MOPD recipe should therefore measure support overlap, use warmup SFT or intermediate teacher checkpoints when overlap is low, prefer sampled-token feedback when full-distribution matching is noisy, and route examples only to teachers that are likely to understand the student’s local state.
Teacher Conflict
-
Multi-teacher distillation can fail when teachers disagree in ways that cannot be averaged. A math teacher may prefer long exploratory reasoning, while a chat teacher may prefer concise direct answers. A safety teacher may prefer refusal, while a helpfulness teacher may prefer compliance. A coding teacher may prefer tool calls and execution, while a general reasoning teacher may prefer pure text. If these distributions are averaged without context, the student can learn an incoherent mixture.
-
Teacher conflict can appear at several levels. At the final-answer level, teachers may choose different answers or actions. At the trace level, they may use different reasoning paths or tool-use styles. At the token level, they may assign high probability to incompatible continuations. At the policy level, they may optimize different objectives, such as correctness, helpfulness, safety, brevity, or exploration.
-
A useful conflict score is pairwise teacher divergence:
\[C_{ij}(s_t) =D_{JSD} \left( p_{T_i}(\cdot \mid s_t) \,\Vert\, p_{T_j}(\cdot \mid s_t) \right)\] -
High conflict should usually trigger routing, masking, or teacher selection rather than naive averaging. If the teachers disagree because the prompt is ambiguous, the student may need a policy that asks clarifying questions. If they disagree because one teacher is specialized and the other is out of domain, the router should prioritize the specialized teacher. If they disagree because one teacher is unsafe or unreliable on the current prefix, the system should filter that teacher’s signal.
Multi-Domain OPD in Cascade RL
-
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026) uses multi-domain OPD as a stabilization point inside a longer Cascade RL pipeline. The recipe begins with SFT, proceeds through instruction-following RL and multi-domain RL, inserts multi-domain OPD to unify specialized expertise and recover regressions, and then continues with RLHF, long-context RL, code RL, and software-engineering RL.
-
The important design principle is inter-domain interference management. Cascade RL exposes how domains compete: instruction adherence may conflict with human-preference optimization, long reasoning may conflict with concision, code execution may conflict with general chat behavior, and tool-use optimization may change formatting. Multi-domain OPD gives the model a way to recover capabilities from the strongest intermediate domain teachers before later stages specialize further.
-
In Nemotron-Cascade 2, the domain teacher advantage is defined on the sampled token:
\[a_t^{MOPD} = \log \pi_{\text{domain}_i}(y_t \mid s_t) - \log \pi_{\text{train}}(y_t \mid s_t)\]- where \(\pi_{\text{domain}_i}\) is the selected domain teacher and \(\pi_{\text{train}}\) is the policy being optimized.
-
The sampled-token advantage converges toward zero as the student absorbs the teacher’s preference for that domain. Positive values indicate that the teacher assigns higher probability to the sampled token than the student does; negative values indicate that the teacher considers the sampled token less likely than the student does.
-
Because rollout generation and learner updates can be asynchronous, the training objective may need to account for a behavior policy that generated the data and a current training policy that receives the update. This makes MOPD partly an algorithmic problem and partly a systems problem.
-
The following figure (source) shows the Nemotron-Cascade 2 training pipeline, where SFT is followed by instruction-following RL, multi-domain RL, multi-domain on-policy distillation, RLHF, long-context RL, code RL, and software-engineering RL.

MOPD in Nemotron 3 Ultra
-
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning by NVIDIA et al. (2026) scales MOPD to a large agentic model trained with SFT, unified RLVR, specialist teacher training, MOPD warmup, multi-teacher OPD, and inference-oriented boosting.
-
The teacher pool spans many domains, including reasoning, code, math, tool use, agentic workflows, safety, usability, chat, long-context tasks, and software-engineering-style environments. The purpose is not merely to compress a larger model, but to consolidate capabilities produced by many separate specialist post-training paths.
-
A key practical lesson is that teachers trained with substantially different pipelines may not combine well through straightforward on-policy merging. When teachers and the student are trained on different SFT data or acquire different reasoning behaviors, student-generated trajectories can become out of distribution for the teacher, reducing the quality of token-level supervision.
-
The MOPD warmup stage addresses this by pulling the student closer to teacher-supported distributions before the main distillation run. Warmup is not just a convenience; it is a support-overlap intervention. It increases the likelihood that student rollouts land in regions where the teachers can provide reliable local preferences.
-
The recipe-review discussion describes Nemotron 3 Ultra as using two MOPD iterations with more than ten teachers across reasoning, code, math, and agentic domains. The first MOPD round distills a set of teachers into an improved student, and the second round re-distills from refreshed or reused teachers into a stronger final model.
DeepSeek-Style Specialist Consolidation
-
Public recipe discussions describe a progression in DeepSeek-style post-training from standard SFT and GRPO, to R1-style multi-stage reasoning RL, to hybrid think and non-think models, to specialist RL followed by SFT distillation, and then to ten-plus domain experts consolidated with MOPD.
-
The key conceptual progression is from monolithic RL toward modular specialist creation. Instead of forcing one model to optimize all domains simultaneously, the recipe trains domain experts and then merges them. Earlier versions may use SFT distillation to consolidate specialists; later versions may use MOPD to score student-generated rollouts directly.
-
A support-overlap analysis contrasts this with Nemotron 3 Ultra. If specialists are all forks of the same base model and trained with related domain SFT and RL recipes, their local token distributions may remain close enough for dense distribution matching to be safe. If specialists are influenced by external model data or substantially different training pipelines, sampled-token scoring and warmup may be more reliable than full-vocabulary matching.
-
The practical lesson is that the best MOPD loss depends on teacher-student lineage. Full teacher-distribution matching is powerful when support overlap is high. Sampled-token scoring is safer when overlap is lower. The right choice is not universal; it depends on how teachers were created.
MOPD versus Trace-Distillation SFT
-
MOPD and trace-distillation SFT are two different ways to consolidate specialist teachers. Trace-distillation SFT asks each teacher to generate trajectories, filters those trajectories, and trains the student on the teacher-produced traces. MOPD asks the student to generate trajectories and then has teachers score the student’s actual prefixes.
-
Trace-distillation SFT is easier to implement because the resulting data is just a supervised corpus. It can be generated, filtered, cached, audited, and reused. It is especially attractive when the student is too weak to produce useful rollouts or when the organization does not yet have an OPD scoring infrastructure.
-
MOPD is more adaptive because it trains on student-visited states. It can correct the specific mistakes the student makes, and it reduces train-inference mismatch. It is especially valuable in long-horizon reasoning, coding, and agentic tasks where the student’s early actions determine future states.
-
The tradeoff is systems complexity. MOPD requires rollout generation, teacher routing, log-probability scoring, loss masking, staleness control, and support-overlap monitoring. Trace-distillation SFT requires high-quality data generation and filtering, but it does not require a synchronous or asynchronous teacher-scoring loop during student training.
-
A conservative recipe may therefore use:
\[\text{specialist RL} \rightarrow \text{trace-distillation SFT} \rightarrow \text{final RL}\]- while a more MOPD-heavy recipe may use:
-
Both are forms of specialist consolidation. The difference is whether the training trajectory comes from the teacher or the student.
Teacher Construction
-
The quality of MOPD depends heavily on how the teachers are constructed. A teacher is not useful merely because it is strong on a benchmark; it must also provide reliable token-level preferences on student-generated prefixes.
-
Teachers can be constructed by taking a shared base model, applying domain-specific SFT, running RL or RLVR on a domain-specific environment, selecting the strongest checkpoint by validation, and then freezing it for distillation. This lineage-preserving approach improves support overlap because each teacher remains a perturbation of the same base policy.
-
Teachers can also be trained from heterogeneous data sources or external model outputs. This may produce stronger specialists in absolute benchmark terms, but it can reduce compatibility with the student. The teacher may solve the task in a style or token distribution the student does not naturally visit, making OPD supervision noisier.
-
A practical teacher-selection rule is to evaluate not only teacher accuracy but also teacher-student compatibility. The best teacher for MOPD is not necessarily the highest-scoring teacher. It is the teacher that gives useful local gradients on the student’s rollouts.
-
Teacher checkpoints may also need to be staged. Instead of distilling from a fully converged specialist that has moved far from the student, the recipe may distill from intermediate checkpoints first, gradually moving the student toward the final specialist distribution.
Aggregating Teachers
-
Teacher aggregation can occur at the distribution level, token-advantage level, example level, or stage level. Distribution-level aggregation averages teacher next-token probabilities. Token-advantage aggregation combines teacher-student log-probability gaps. Example-level aggregation sends each prompt or rollout to one teacher. Stage-level aggregation distills from different teachers in separate phases.
-
Distribution-level aggregation is mathematically clean but fragile under teacher conflict. It works best when teachers are aligned and share local support.
-
Token-advantage aggregation is natural for MOPD because each teacher’s signal is expressed as a dense advantage on the sampled token. This allows teacher weights to be positive, zero, or even suppressed depending on domain confidence, reward scores, or routing decisions.
-
Example-level routing is simple and common in domain-labeled training. A math prompt goes to a math teacher, a coding prompt goes to a coding teacher, and a safety prompt goes to a safety teacher. This reduces conflict but can miss mixed-domain trajectories.
-
Stage-level aggregation is useful when teachers are too incompatible to combine in one loss. A student may first absorb instruction-following teachers, then math teachers, then code teachers, then agentic teachers, with evaluation gates between stages.
Teacher Agreement and Calibration
-
Multi-teacher systems require calibration because teachers may differ in entropy, verbosity, confidence, and logit scale. A high-confidence teacher is not always more correct; it may simply be more sharply calibrated or more overconfident.
-
Calibration can be handled with temperature scaling:
\[p_{T_k}^{(\tau_k)}(y \mid s_t) = \frac{ \exp(z_{T_k,y}/\tau_k) }{ \sum_{v} \exp(z_{T_k,v}/\tau_k) }\] -
Teachers can also be normalized by validation accuracy, per-domain benchmark score, reward-model score, verifier success rate, or observed teacher-student agreement on held-out rollouts.
-
A teacher-disagreement diagnostic can be computed as average pairwise Jensen-Shannon divergence:
\[\bar{C}(s_t) =\frac{2}{K(K-1)} \sum_{i<j} D_{JSD} \left( p_{T_i}(\cdot \mid s_t) \,\Vert\, p_{T_j}(\cdot \mid s_t) \right)\] -
High disagreement does not automatically mean the example should be discarded. It may mean the prompt is genuinely multi-domain or ambiguous. The correct response may be to route more carefully, ask for clarification, preserve uncertainty, or prioritize a safety teacher.
Regression Recovery and Capability Preservation
-
A major reason to use multi-teacher distillation is regression recovery. During staged RL, a model may become better at one domain while losing capability in another. For example, a stage that improves strict instruction following may change response style in ways that hurt open-ended chat; a stage that improves long reasoning may hurt concise instruction compliance; a code RL stage may affect non-code reasoning.
-
Multi-teacher distillation can preserve the best checkpoint for each domain. Instead of accepting the final RL checkpoint as globally best, the recipe selects domain teachers from different points in training. Each teacher captures the strongest local behavior for a domain, and the student is trained to combine them.
-
This is why Cascade RL and multi-domain OPD fit together naturally. Cascade RL creates a sequence of domain-improved checkpoints. MOPD then provides a mechanism to merge the best parts of those checkpoints before further training continues.
-
Regression recovery should be evaluated explicitly. A successful MOPD run is not only one that improves average score; it should recover losses on domains that earlier stages damaged while preserving the gains from newer stages.
Practical MOPD Training Loop
-
A practical MOPD training loop begins by training or selecting a pool of specialist teachers, each associated with a domain, benchmark category, environment, or capability. The student is initialized from an SFT, RLVR, or prior post-trained checkpoint that is already capable enough to generate meaningful rollouts. Prompts are sampled from a mixture that covers the target domains, and the student generates trajectories under the current rollout policy.
-
Each trajectory is routed to a teacher or teacher subset. The router may use prompt metadata, benchmark labels, classifier outputs, tool-use state, verifier results, or a learned routing score. The selected teachers score the student’s sampled tokens under the exact student prefixes and return teacher log-probabilities, top-\(k\) distributions, or dense token-level advantages.
-
The learner computes a weighted MOPD loss, masks invalid or untrusted tokens, applies clipping or importance weighting when rollout and learner policies differ, updates the student, and evaluates the result on both target-domain benchmarks and broad regression suites.
-
A compact recipe is:
\[\text{train specialists} \rightarrow \text{select teachers} \rightarrow \text{warm up student} \rightarrow \text{sample rollouts} \rightarrow \text{route to teachers} \rightarrow \text{score tokens} \rightarrow \text{update student} \rightarrow \text{evaluate regressions} \rightarrow \text{refresh or repeat}\] -
The warmup step is optional in theory but often crucial in practice when teachers are not close forks of the student. Its purpose is to increase the probability that student rollouts fall into teacher-supported regions.
Engineering and Systems Design
-
Multi-teacher systems introduce significant infrastructure requirements because they combine RL rollout generation, teacher inference serving, dynamic routing, log-probability transport, loss aggregation, and regression evaluation. Multiple teacher servers must be hosted and queried efficiently, often with vLLM-style inference clusters or equivalent high-throughput serving stacks, and the system must avoid letting teacher scoring become the bottleneck for the learner.
-
Routing logic determines which teachers should score each prompt, rollout, or token span. In a simple benchmark-labeled setup, a math prompt goes to a math teacher and a code prompt goes to a code teacher. In a mixed agentic trajectory, the router may need to recognize that one segment is reasoning, another is tool use, another is code execution, and another is final-answer formatting. Routing errors can either dilute specialization or apply harmful supervision from an out-of-domain teacher.
-
Teacher log-probabilities from different models must be aligned and aggregated carefully. If teachers share tokenizer, chat template, and special-token conventions, aggregation is straightforward. If they do not, the system may require expensive retokenization, sequence alignment, or token-span mapping, and the resulting token-level loss may become noisy or ambiguous.
-
Tokenizer compatibility is highly desirable because MOPD relies on teacher probabilities assigned to the student’s sampled tokens. When teacher and student vocabularies differ, a single student token may correspond to multiple teacher tokens or vice versa, which complicates sampled-token scoring and makes full-distribution matching even harder.
-
Fault tolerance and asynchronous scheduling are essential when dozens of teachers are involved. Teacher servers may fail, run at different speeds, or have different memory and batching constraints. A practical MOPD system needs queues, timeouts, fallback routing, teacher-health checks, freshness windows, and replay-buffer policies so that missing or stale teacher scores do not corrupt the learner.
-
The implementation complexity is higher than single-teacher distillation because the system must coordinate teacher selection, teacher serving, token alignment, log-probability aggregation, and multi-domain evaluation. The payoff is that capability preservation and modularity can improve substantially: a team can integrate new specialists, reuse older checkpoints, repair regressions, and consolidate RL-derived improvements without restarting the entire post-training recipe.
Advantages of Multi-Teacher Distillation
-
Multi-teacher distillation enables a single student to absorb complementary strengths from multiple specialized models, which is especially valuable when no single teacher is best across math, code, safety, chat, long context, and agentic tool use.
-
It mitigates catastrophic forgetting and post-training regressions by retaining older or domain-specialized checkpoints as active sources of supervision, allowing the student to recover capabilities that later RL stages might otherwise overwrite.
-
It provides a modular way to integrate new capabilities without retraining from scratch, because a new specialist can be trained, validated, added to the teacher pool, and distilled into the general student through targeted routing or staged consolidation.
-
It supports efficient consolidation of RL-derived improvements, since domain-specific RL runs can be performed independently and then merged through trace distillation or MOPD instead of forcing all capabilities into one expensive, interference-prone RL run.
-
It allows practitioners to reuse valuable intermediate checkpoints as lasting supervision sources, which is particularly important in long cascade-style recipes where the final checkpoint is not necessarily the best checkpoint for every domain.
-
It improves organizational scalability because different teams can build and maintain different teachers, while the final distillation stage converts that distributed work into one deployable model.
Limitations
-
Teacher signals may conflict, requiring careful weighting, routing, masking, or stage ordering. If a math teacher favors long exploration while a chat teacher favors concision, naive averaging can dilute both behaviors rather than producing a model that knows when to use each style.
-
Infrastructure costs increase significantly as the number of teachers grows, because each teacher may require dedicated serving capacity, batching logic, memory allocation, monitoring, and fallback handling.
-
Routing policies can become complex and task-dependent, especially for mixed-domain trajectories that combine reasoning, code execution, search, tool use, safety judgment, and final-answer formatting inside one rollout.
-
Poorly balanced weights can dilute specialization or destabilize optimization. Overweighting a general teacher may erase specialist gains, while overweighting a specialist may damage general chat, safety, or instruction-following behavior.
-
Cross-tokenizer alignment becomes difficult when teachers use incompatible vocabularies, chat templates, or tool-call formats. This can make token-level log-probability comparison unreliable and can force the system toward sequence-level trace distillation instead of sampled-token MOPD.
-
Support mismatch can make a strong teacher a poor distillation teacher. If the student does not visit prefixes that the teacher understands, the teacher’s dense token-level feedback may encode distribution mismatch rather than useful expertise.
-
Debugging is harder than in single-teacher distillation because a regression may be caused by the student rollout policy, a teacher checkpoint, a router decision, a token-alignment issue, a stale rollout, an aggregation weight, or a domain-mixture imbalance.
Design Rules
-
A strong multi-teacher recipe should begin from a shared base or compatible student-teacher lineage whenever possible, because shared lineage improves local support overlap and makes token-level supervision more reliable.
-
The recipe should train specialists in domains where independent optimization is beneficial and where joint RL is likely to create interference. Math, code, software engineering, long-context reasoning, tool use, safety, and chat often benefit from different data sources, verifiers, and reward structures.
-
Teacher selection should consider compatibility as well as benchmark score. A teacher that is slightly weaker but locally aligned with the student may provide better MOPD gradients than a stronger teacher whose reasoning style is far from the student.
-
Routing should be explicit and auditable. The system should know why a trajectory was sent to a teacher, and routing errors should be measurable through teacher agreement, verifier outcomes, and downstream regressions.
-
The loss should match the support regime. Full-distribution matching is useful when teacher-student overlap is high. Top-\(k\) local matching is useful when support is partially shared. Sampled-token scoring is safer when support overlap is lower or teacher serving cost is high.
-
Warmup should be used when teachers and students are far apart. Warmup can be SFT on teacher-domain traces, curriculum prompts that elicit teacher-supported behavior, or progressive distillation from intermediate teacher checkpoints.
-
The final student should be judged by preservation as much as improvement. A successful multi-teacher distillation run should improve specialist domains while preserving chat, safety, instruction following, calibration, long-context behavior, and agentic reliability.
When Multi-Teacher Distillation is Preferred
-
Multi-teacher distillation is preferred when no single teacher dominates all domains, when different capabilities require different training environments or verifiers, when specialist teachers can be trained in parallel, when joint RL across all domains creates interference, when a final deployable model must merge domain expertise into one checkpoint, and when the team can support teacher routing, scoring, calibration, and regression evaluation.
-
MOPD is preferred when the student is already capable enough to generate meaningful rollouts and when teacher feedback on student-visited states is more valuable than teacher-generated traces. Trace-distillation SFT is preferred when the student is too weak for on-policy rollouts, when teacher-generated traces are easy to verify, or when the organization wants a simpler and more reproducible consolidation pipeline.
-
The practical modern recipe is therefore not “always use MOPD.” It is to use off-policy SFT or trace distillation to create a capable student, use RL or specialist RL to produce strong domain teachers, use warmup or intermediate checkpoints to improve support overlap, and then use MOPD when dense teacher feedback on student rollouts can safely consolidate the specialists.
Reinforcement Learning-Distillation Hybrids
-
RL-distillation hybrids combine two forms of supervision that are individually powerful but incomplete. Reinforcement learning trains on trajectories sampled by the current policy and can optimize directly for task success, but its feedback is often sparse, delayed, noisy, or difficult to assign to individual tokens. Distillation supplies dense token-level guidance, but classical distillation usually trains on teacher or dataset trajectories rather than on the student’s actual behavior. Hybrid methods try to keep the trajectory relevance of RL while adding the local credit assignment of distillation.
-
The central idea to carry through this section is that modern post-training increasingly treats distillation as a dense advantage-estimation mechanism. A teacher, self-teacher, verifier-conditioned model, or feedback-conditioned model can assign token-level log-probability differences along a student rollout. Those differences can then act like rewards, advantage weights, clipping signals, routing signals, or auxiliary losses inside an RL-style training loop.
-
These hybrids are especially important for reasoning, coding, tool-use, and agentic tasks. In those settings, a final scalar reward may say that the answer was wrong, that a test failed, or that an agent did not complete the task, but it does not identify which intermediate reasoning step, tool call, file edit, or assumption caused the failure. Distillation can turn richer teacher or feedback information into dense local updates over the tokens that produced the trajectory.
-
RL-distillation hybrids also clarify why OPD and MOPD fit naturally inside RL infrastructure. OPD uses student-generated rollouts, like RL, but replaces or augments sparse sequence-level rewards with teacher log-probability gaps at each sampled token. MOPD extends this pattern by routing rollouts to domain teachers, allowing post-training systems to consolidate RL-derived specialist capabilities without rerunning a single monolithic RL job across every domain.
-
The practical design question is not whether to use RL or distillation, but how to combine them. Some methods use RL as the primary objective and distillation as a gated auxiliary loss. Some methods use reward signals to scale or extrapolate a distillation loss. Some methods route successful samples to RL and failed samples to self-distillation. Some methods turn textual critiques, runtime errors, user feedback, or environment observations into teacher-only context and then distill from that richer view.
-
The main risk is that dense token-level supervision can be overconfident or misaligned with the deployed student’s information state. Privileged teachers may suppress uncertainty, shorten reasoning traces, or reward style differences rather than task-bearing improvements. Strong hybrid recipes therefore anchor distillation in rewards, verifiers, gates, contrastive controls, support-overlap checks, and regression evaluations.
Why RL and Distillation Are Converging
-
Classical reinforcement learning for LLMs usually optimizes a policy objective of the form:
\[\mathcal{L}_{RL}(\theta) =-\mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_\theta(\cdot\mid x)} \left[ A(x,y) \sum_t \log \pi_\theta(y_t\mid x,y_{<t}) \right]\]- where \(A(x,y)\) is a trajectory-level or response-level advantage derived from a reward model, verifier, preference signal, or group-relative normalization.
-
The credit-assignment problem is that the same scalar advantage is often broadcast across every token in the response. If a solution is wrong because of one algebraic error, one incorrect API call, or one missing test, the RL update may still push or pull on the entire trajectory.
-
Distillation supplies a token-level signal. In sampled-token OPD, the dense teacher-student gap can be written as:
\[A_t^{OPD} = \log \pi_T(y_t\mid s_t) - \log \pi_\theta(y_t\mid s_t)\]- where \(s_t=(x,y_{<t})\) is the student-visited prefix. Tokens that the teacher rates above the student receive positive local signal, while tokens that the teacher rates below the student receive negative local signal.
-
A hybrid objective can therefore combine sparse reward-level guidance with dense token-level teacher guidance:
\[\mathcal{L}_{hybrid}(\theta) =\mathcal{L}_{RL}(\theta) +\lambda \mathcal{L}_{distill}(\theta)\]- where \(\lambda\) controls how strongly the dense distillation term influences the policy relative to the reward-grounded RL term.
-
The convergence of RL and distillation is partly algorithmic and partly infrastructural. RL systems already generate on-policy rollouts, store token log-probabilities, compute advantages, mask invalid tokens, and update policies. OPD and self-distillation systems can reuse the same machinery, replacing or augmenting scalar rewards with teacher log-probability gaps, feedback-conditioned logits, or privileged-context token scores.
Hybrid Template: RL Backbone plus Distillation Auxiliary
-
The most conservative hybrid pattern keeps RL as the task-grounded backbone and adds distillation as an auxiliary signal. This is useful when rewards or verifiers are trusted but sparse, while teacher or self-teacher signals are informative but potentially noisy.
-
The objective is:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) + \lambda_{D} \mathcal{L}_{D}(\theta)\] -
A token-level form is:
\[\mathcal{L}_{D}(\theta) =-\mathbb{E} \left[ \sum_t w_t \operatorname{sg}[A_t^{D}] \log \pi_\theta(y_t\mid s_t) \right]\]- where \(A_t^{D}\) may be a teacher-student log-probability gap, a self-teacher gap, a contrastive privileged-context signal, or a feedback-conditioned correction, and \(w_t\) is a gate, mask, or confidence weight.
-
This pattern is attractive because the RL term preserves the direction of task improvement while the distillation term refines local token-level credit assignment. It is especially useful when the teacher is helpful but not trusted enough to become the primary objective.
Hybrid Template: Distillation Modulated by Rewards
-
A second pattern treats distillation as the main dense update but uses rewards to scale, filter, or extrapolate the teacher signal. This is useful when a teacher provides dense local preferences, but the student should be allowed to exceed the teacher rather than merely imitate it.
-
A generic reward-modulated distillation update is:
\[\mathcal{L}_{reward\text{-}modulated}(\theta) =-\mathbb{E} \left[ \sum_t g(R(x,y),A_t^D) \log \pi_\theta(y_t\mid s_t) \right]\]- where \(g\) combines a trajectory reward with a token-level teacher signal.
-
If a rollout receives high reward, the distillation signal can be amplified even when it differs from the teacher. If a rollout receives low reward, the teacher signal may be used as a correction. This pattern turns distillation from pure imitation into reward-aware policy improvement.
Hybrid Template: Sample Routing
-
A third pattern routes different samples to different objectives. Correct or high-reward rollouts are sent to an RL branch because they already demonstrate useful behavior, while incorrect or low-reward rollouts are sent to a distillation branch where a teacher, self-teacher, or feedback-conditioned model can provide dense correction.
-
A routing function can be written as:
\[\rho(x,y) =\mathbf{1}[R(x,y)\geq \tau]\]- where \(\rho=1\) routes the sample to an RL branch and \(\rho=0\) routes it to a corrective distillation branch.
-
The resulting objective is:
\[\mathcal{L}(\theta) =\mathbb{E} \left[ \rho(x,y)\mathcal{L}_{RL} + (1-\rho(x,y))\mathcal{L}_{D} \right]\] -
This pattern is appealing because successful samples and failed samples contain different information. Successful samples can be reinforced directly; failed samples often require diagnostic feedback that sparse rewards do not provide.
Hybrid Template: Feedback-Conditioned Self-Distillation
-
A fourth pattern uses feedback as privileged teacher context. The student acts without feedback, then a teacher view sees feedback such as a runtime error, test failure, judge comment, user correction, tool observation, or final answer. The teacher view scores the original student trajectory under this richer context, and the student learns from the resulting dense correction.
-
The student distribution is:
\[\pi_S(\cdot\mid x,y_{<t})\]- while the feedback-conditioned teacher distribution is:
- where \(f\) is feedback that is available during training but not necessarily available during deployment.
-
This setup is useful because many post-training environments provide richer evidence than a scalar reward. A compiler error, failing unit test, browser observation, user complaint, or environment state transition often explains what went wrong more clearly than a binary success label.
Reinforcement Learning via Self-Distillation
-
Reinforcement Learning via Self-Distillation by Hübotter et al. (2026) introduces Self-Distillation Policy Optimization, which converts rich textual feedback into dense token-level supervision without requiring a separate external teacher.
-
The method begins with a rollout generated by the current model. A feedback source then supplies a critique, runtime error, judge comment, verifier message, or other natural-language signal. The same model is conditioned on that feedback to form a teacher view, and the original rollout is replayed under the feedback-conditioned context to obtain token-level correction.
-
The method is conceptually important because it shows how RL can move beyond scalar rewards without abandoning the RL training loop. The feedback-conditioned teacher produces dense local information, while the on-policy rollout structure ensures that the model learns from states it actually visits.
-
The following figure (source) shows a comparison of RLVR and RLRF settings. In Reinforcement Learning with Verifiable Rewards (RLVR), the agent learns from a scalar reward \(r\), which often acts as an information bottleneck by masking the underlying environment state. In contrast, Reinforcement Learning with Rich Feedback (RLRF) utilizes tokenized feedback. In the core self-distillation policy optimization setup, textual feedback is transformed into a dense teacher signal over the original student rollout, providing a richer signal than a scalar reward because feedback can include runtime errors, judge comments, or detailed observations of the state.

ExOPD: Learning Beyond the Teacher
-
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation by Yang et al. (2026) introduces ExOPD, which treats OPD as a dense KL-constrained RL method and adds reward extrapolation so that the student can improve beyond the teacher rather than merely match it.
-
Standard OPD can be limiting when the teacher is strong but not optimal. If the student discovers a trajectory that receives higher reward than the teacher’s likely behavior, pure imitation may pull the student back toward the teacher. ExOPD addresses this by using reward information to scale or extrapolate the dense teacher signal.
-
A schematic reward-extrapolated signal is:
\[\tilde{A}_t =h(R(x,y),R_T(x)) \left[ \log \pi_T(y_t\mid s_t) -\log \pi_\theta(y_t\mid s_t) \right]\]- where \(h\) increases or decreases the local distillation signal according to how the sampled trajectory compares with teacher or reference performance.
-
This reframes distillation as a reference-guided policy optimization method. The teacher supplies local structure, but rewards decide whether the student should imitate, deviate, or amplify the sampled behavior.
-
The following figure (source) shows the empirical effectiveness of ExOPD compared with off-policy distillation (SFT), standard OPD, and the weight-extrapolation method ExPO introduced in Model Extrapolation Expedites Alignment by Zheng et al. (2025) in multi-teacher and strong-to-weak distillation settings, with results averaged over 4 math reasoning and 3 code generation benchmarks. In multi-domain expert merging, ExOPD is the only method that yields a unified student that consistently outperforms all domain teachers; in strong-to-weak distillation, ExOPD significantly improves over standard OPD, with reward correction further boosting performance.

REOPOLD: Relaxed OPD for Stable Reasoning
-
Scaling Reasoning Efficiently via Relaxed On-Policy Distillation by Ko et al. (2026) introduces REOPOLD, which relaxes strict teacher imitation so that OPD can scale more stably for reasoning tasks.
-
The motivation is that exact imitation may be too restrictive. A reasoning model may need to explore alternative paths, preserve useful uncertainty, or maintain its own valid search style. If the teacher signal is applied too aggressively at every token, the student can overfit to teacher style or suppress productive reasoning variants.
-
REOPOLD interprets teacher-student log-likelihood ratios as token rewards but applies them selectively or temperately. This makes the method closer to a regularized RL objective than to pure distillation. It preserves the benefit of dense feedback while reducing the risk that every local teacher preference becomes an imperative.
-
The following figure (source) shows an illustration of REOPOLD. While standard on-policy distillation can be unstable or inefficient when it forces the student to mimic the teacher too aggressively, REOPOLD uses teacher signals temperately and selectively. By connecting distillation and RL through a stop-gradient operation, it filters potentially harmful signals and limits excessive drift from the student’s original distribution.

Self-Distilled RLVR
-
Self-Distilled RLVR by Yang et al. (2026) combines RLVR with privileged self-distillation by separating update direction from update magnitude. RLVR determines whether a rollout should be pushed up or down according to verifier-based correctness, while self-distillation modulates how strongly individual tokens should be updated.
-
This separation is important because privileged self-distillation alone can leak answer information, reward overly concise traces, or destabilize long training. By keeping RLVR as the direction-setting objective, the method remains anchored to verifiable correctness. By using self-distillation to refine token-level magnitudes, it gains denser credit assignment than ordinary response-level RLVR.
-
A simplified update can be written as:
\[\Delta \theta \propto A^{RLVR} \sum_t w_t^{SD} \nabla_\theta \log \pi_\theta(y_t\mid s_t)\]- where \(A^{RLVR}\) supplies the reward-aligned direction and \(w_t^{SD}\) supplies token-level magnitude from a privileged self-teacher.
-
The following figure (source) shows an overview of RLSD, with a hybrid design in which RLVR provides update directions and self-distillation provides fine-grained step sizes.

SRPO: Sample Routing between GRPO and Self-Distillation
-
Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing by Li et al. (2026) introduces SRPO, a routing framework that sends correct samples to a GRPO-style RL branch and failed samples to a self-distillation branch.
-
The method is based on the observation that correct and incorrect rollouts should not necessarily receive the same type of supervision. Correct rollouts already demonstrate a successful trajectory and are well suited to reward-aligned reinforcement. Incorrect rollouts need diagnostic correction, so they are replayed under privileged teacher context and updated using dense self-distillation.
-
The branch structure can be summarized as:
\[\mathcal{L}_{SRPO} =\mathbf{1}[R=1]\mathcal{L}_{GRPO} +\mathbf{1}[R=0]\mathcal{L}_{SDPO}\]- where the reward or verifier determines whether a rollout is reinforced directly or corrected through self-distillation.
-
The following figure (source) shows an overview of SRPO. Given a prompt \(x\), the policy \(\pi_\theta\) generates a group of on-policy rollouts. A correctness check routes each rollout to one of two branches: correct samples are sent to the GRPO branch, where group-relative advantages provide a reward-aligned policy update; incorrect samples with available teacher information are sent to the SDPO branch, where a feedback-conditioned self-teacher produces logit-level distillation targets via \(D_{\mathrm{KL}}\left(P\,\Vert\,\operatorname{sg}(Q)\right)\) for dense corrective supervision.

RLCSD: Contrastive Self-Distillation inside RLVR
-
RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation by Pan et al. (2026) integrates contrastive self-distillation into RLVR to address privilege-induced style drift. The method compares the self-teacher gap under a correct hint with the self-teacher gap under an incorrect hint, subtracting generic hint-induced style changes and leaving a signal more concentrated on task-bearing tokens.
-
This method matters because RL-distillation hybrids can fail when privileged context changes the teacher’s style rather than only its correctness. A teacher that sees the answer may become shorter, more assertive, and less exploratory. If the student imitates that shift directly, it can lose the epistemic behaviors needed for autonomous reasoning.
-
A simplified contrastive signal is:
\[e_t^{ctr} =\left[ \log \pi_\theta(y_t\mid x,c^+,y_{<t}) -\log \pi_\theta(y_t\mid x,y_{<t}) \right] -\left[ \log \pi_\theta(y_t\mid x,c^-,y_{<t}) -\log \pi_\theta(y_t\mid x,y_{<t}) \right]\]- where \(c^+\) is a correct hint and \(c^-\) is a wrong or contrastive hint.
-
RLCSD uses this cleaned token-level signal as a modulation of GRPO-style reward learning rather than as an ungrounded imitation loss. This preserves the reward anchor while improving token-level credit assignment.
-
The following figure (source) shows the RLCSD overview: ordinary OPSD gaps concentrate on style tokens such as “maybe” or “straightforward,” while the contrastive gap shifts the signal toward task-bearing tokens; the response-length plots show that RLCSD remains more stable than several prior OPSD-style methods; and the benchmark comparison shows gains on mathematical and logical reasoning tasks.

SDAR: Gated Self-Distillation for Multi-Turn Agents
-
Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026) extends RL-distillation hybrids to multi-turn agents by treating on-policy self-distillation as a gated auxiliary objective and keeping GRPO as the primary RL backbone.
-
The method is designed for agentic settings where trajectories span multiple turns and where naive OPSD can become unstable as the student and privileged teacher contexts drift apart. Instead of applying self-distillation uniformly, SDAR gates the auxiliary signal at the token level.
-
The overall objective is:
\[\mathcal{L}(\theta) =\mathcal{L}_{GRPO}(\theta) +\lambda_{\mathrm{SDAR}} \mathcal{L}_{SDAR}(\theta)\]- where \(\mathcal{L}_{GRPO}\) preserves verifier-driven policy optimization and \(\mathcal{L}_{SDAR}\) injects dense token-level guidance only where the privileged self-teacher signal is trusted.
-
The detached teacher-student gap is:
\[\Delta_t =\operatorname{sg} \left( \log \pi_\theta^+(y_t\mid s_t^+) -\log \pi_\theta(y_t\mid s_t) \right)\] -
The gate is:
\[g_t =\sigma(\beta\Delta_t)\] -
The gated auxiliary token loss is:
\[\ell_t^{SDAR} =g_t \left( \log \pi_\theta^+(y_t\mid s_t^+) -\log \pi_\theta(y_t\mid s_t) \right)\] -
The conceptual point is that the privileged self-teacher is not allowed to dominate the RL objective. It can only provide a bounded token-level shaping signal, and that signal is strongest where the teacher-student gap suggests useful local correction.
-
The following figure (source) illustrates the SDAR framework, where verifier-driven RL and token-level OPSD are combined through gates derived from uncertainty, teacher-student gap, or their soft-OR combination.

OpenClaw-RL and Agent Interaction Feedback
-
OpenClaw-RL: Train Any Agent Simply by Talking by Wang et al. (2026) extends RL-distillation hybrids to interactive agents where the environment naturally produces rich feedback streams.
-
In agent settings, the feedback source may be a user reply, GUI state transition, terminal output, browser result, unit-test failure, file-system observation, or downstream task state. These signals can be used as scalar rewards, but they can also become teacher-only context for hindsight-guided self-distillation.
-
The training pattern is:
\[\text{agent rollout} \rightarrow \text{environment or user feedback} \rightarrow \text{hindsight teacher view} \rightarrow \text{dense token-level correction}\] -
This is a natural fit for agents because many errors are local but consequences are delayed. A final task failure may trace back to one incorrect command, one missed observation, one invalid file edit, or one misunderstood user constraint. Feedback-conditioned distillation can assign denser corrective signal to those earlier actions than a final scalar reward alone.
-
The following figure (source) shows an overview of the OpenClaw-RL infrastructure. Interaction streams come from Personal Agents, which are conversational and single-user agents hosted on personal devices, and General Agents, which include terminal, GUI, SWE, and tool-call agents hosted on cloud services. Samples flow into an asynchronous RL server with environment serving, PRM or judge reward computation, Megatron policy training, and SGLang policy serving.

Cascade RL and Distillation as Recipe Stabilization
-
RL-distillation hybrids are not limited to single objectives; they also appear as recipe-level stabilization mechanisms. Nemotron-Cascade 2 uses Cascade RL to optimize domains in a deliberate sequence, then inserts multi-domain OPD as a stabilization point to recover benchmark regressions and unify specialized expertise before later stages continue.
-
The relevant recipe pattern is:
\[\text{SFT} \rightarrow \text{Instruction-Following RL} \rightarrow \text{Multi-domain RL} \rightarrow \text{Multi-domain OPD} \rightarrow \text{RLHF} \rightarrow \text{Long-context RL} \rightarrow \text{Code RL} \rightarrow \text{SWE RL}\] -
This pattern treats RL and distillation as complementary stages. RL stages improve target capabilities through reward-driven interaction with verifiers and environments. The OPD stage consolidates specialized improvements and repairs regressions by using strong intermediate checkpoints as dense token-level teachers.
Nemotron 3 Ultra and MOPD after Unified RLVR
-
Nemotron 3 Ultra provides a large-scale recipe-level example in which SFT and unified RLVR establish a general student, while MOPD consolidates many domain-specialist teachers into a single agentic model.
-
In this recipe shape, RLVR teaches the student to operate across many verifiable and semi-verifiable environments, while MOPD later transfers specialist teacher preferences over student-generated rollouts. This is a hybrid not because every gradient step contains both losses, but because the full recipe alternates between reward-driven capability acquisition and distillation-driven capability consolidation.
-
The support-overlap issue is central. If the RLVR student can already generate trajectories that lie near teacher-supported regions, MOPD can provide useful dense feedback. If not, warmup or staged teacher integration may be needed before multi-teacher scoring becomes reliable.
-
The following figure (source) shows the Nemotron 3 Ultra post-training pipeline, including SFT, unified RLVR, MOPD warmup, multi-teacher OPD, and later inference-oriented boosting.

Environment Feedback and Scientific RL
-
RL-distillation hybrids become most powerful when the environment provides rich feedback rather than only a final binary reward. In code, the environment can provide compiler errors, failing tests, stack traces, and execution outputs. In tool-use agents, the environment can provide browser observations, terminal output, file contents, and GUI state transitions. In scientific settings, the environment may eventually include laboratory measurements, simulations, instrument logs, or experimental outcomes.
-
This is why RL-distillation hybrids are especially relevant for scientific and agentic post-training. A sparse scalar reward can say that an experiment failed or that a code patch did not pass tests, but a rich feedback channel can explain the failure, expose intermediate observations, and support teacher-conditioned replay. Distillation then turns that richer information into token-level or action-level updates.
-
A general environment-feedback hybrid can be written as:
\[f_t =\operatorname{EnvFeedback}(x,y_{\leq t})\] \[A_t^{feedback} = \log \pi_T(y_t\mid x,f_{\geq t},y_{<t}) - \log \pi_\theta(y_t\mid x,y_{<t})\]- where future or delayed feedback is used as teacher-only context during training, while the deployed student still acts without that hindsight.
Sparse Rewards versus Dense Distillation
-
RL and distillation differ most sharply in feedback granularity. RLVR often provides one reward for an entire response, while OPD or self-distillation can provide one signal per token.
-
The difference can be summarized as:
| Training signal | Trajectory source | Feedback granularity | Main strength | Main weakness |
|---|---|---|---|---|
| RLVR | Student rollout | Response-level or outcome-level | Directly optimizes verifiable success | Weak token-level credit assignment |
| OPD | Student rollout | Token-level teacher feedback | Dense local correction | Requires reliable teacher scoring on student prefixes |
| Self-distillation | Student rollout or fixed data | Token-level privileged-context feedback | Avoids external teacher infrastructure | Can amplify style drift or suppress uncertainty |
| RL-distillation hybrid | Usually student rollout | Both response-level and token-level | Combines task grounding with dense credit assignment | Requires careful balancing, gating, and evaluation |
- The best hybrid systems avoid treating dense feedback as automatically superior. Dense token-level feedback is useful only when it is aligned with the task objective, reliable under the student’s visited prefixes, and consistent with the behavior the deployed model should preserve.
Direction, Magnitude, and Gating
-
One useful design principle is to separate update direction from update magnitude. A verifier or reward model can decide whether a trajectory should become more or less likely, while a teacher or self-teacher can decide which tokens should receive stronger or weaker updates.
-
This separation can be expressed as:
\[\Delta \theta \propto A^{reward} \sum_t m_t^{distill} \nabla_\theta \log \pi_\theta(y_t\mid s_t)\]- where \(A^{reward}\) supplies the task-aligned direction and \(m_t^{distill}\) supplies token-level magnitude, relevance, or trust.
-
Gating is critical because teacher signals can be harmful on some tokens. A gate may depend on the teacher-student gap, student entropy, verifier outcome, contrastive hint difference, rollout length, token type, teacher confidence, support overlap, or whether the token lies inside a tool call, reasoning segment, final answer, or system-controlled region.
-
A bounded gate can be written as:
\[g_t=\sigma(\beta z_t)\]- where \(z_t\) is a trust score derived from one or more signals, and \(\beta\) controls how sharply the gate separates trusted from untrusted tokens.
Failure Modes
-
RL-distillation hybrids can fail when the dense teacher signal conflicts with the sparse reward. If the reward says a rollout succeeded but the teacher dislikes the student’s style, a naive distillation loss can pull the student away from a valid solution. If the reward says a rollout failed but the teacher still assigns locally high probability to many tokens, the dense signal may obscure the global failure.
-
Privileged self-distillation can suppress the epistemic behaviors that make reasoning models robust. A teacher that sees the final answer may prefer shorter, more confident, less exploratory traces; if those preferences are distilled directly, the student can lose uncertainty markers, self-correction, branching, and backtracking.
-
Dense token-level feedback can reward local plausibility without global success. A continuation token may look reasonable under the prefix even if the overall trajectory is doomed because of an earlier hidden mistake. This is especially risky in long coding, proof, and agentic workflows.
-
Teacher-student support mismatch can turn distillation into noise. If the student visits prefixes that the teacher would never generate, teacher log-probabilities may reflect off-distribution artifacts rather than useful expertise.
-
Reward hacking and verifier overfitting can still occur. Adding distillation does not remove the need for robust reward design, contamination checks, adversarial evaluation, and human or automated audits.
-
Loss balancing can destabilize training. If the distillation term is too strong, the model may over-imitate teacher style; if it is too weak, it may not improve credit assignment. If the RL term is too strong, sparse rewards can dominate and erase the benefits of dense correction.
Implementation Pattern
-
A practical RL-distillation hybrid begins by defining the rollout source, which is usually the current student or a slightly lagged inference policy. The system samples prompts, generates rollouts, records token log-probabilities, masks prompt and invalid tokens, and stores metadata about the policy version that produced each trajectory.
-
The system then obtains one or more feedback sources. A verifier may produce a binary or scalar reward, a reward model may produce a preference score, a teacher may produce token-level log-probabilities, and a feedback-conditioned self-teacher may produce privileged-context token scores. These signals must be aligned to the same tokenization, prefix semantics, and response boundaries.
-
The learner computes reward-level advantages, token-level distillation advantages, gates or masks, and any routing decisions. The objective may combine GRPO, PPO, RLVR, OPD, self-distillation, contrastive self-distillation, or sample-routing branches depending on the sample outcome.
-
The update is applied to the student while maintaining strict versioning of the rollout policy, teacher checkpoint, verifier version, reward configuration, prompt source, and masking rules. This is necessary because hybrid systems are difficult to debug without full provenance.
-
Evaluation should measure both target-task improvement and regressions. A hybrid method may improve math while shortening reasoning too much, improve code while hurting chat, improve agentic task success while increasing tool-call errors, or improve verifier score while reducing safety calibration.
When RL-Distillation Hybrids Are Preferred
-
RL-distillation hybrids are preferred when tasks provide useful rewards or verifiers but those rewards are too sparse for efficient credit assignment; when rich feedback such as runtime errors, tool outputs, user replies, or judge comments can be converted into teacher context; when the student is already capable enough to produce meaningful rollouts; when specialist teachers can provide dense token-level guidance on student-visited states; and when the training system can support rollout generation, teacher scoring, reward computation, routing, gating, and careful regression evaluation.
-
They are less attractive when the student is too weak to generate informative trajectories, when teacher-student support overlap is poor, when privileged context would leak information or suppress necessary reasoning behavior, when rewards are unreliable or easy to game, when token-level alignment across models is infeasible, or when the engineering cost of synchronous teacher scoring would dominate the training budget.
-
The most practical recipe is staged rather than monolithic. Use off-policy SFT or trace distillation to create a competent student, use RLVR or environment-based RL to align the student with task success, add OPD or self-distillation when dense token-level credit assignment becomes valuable, use gating or contrastive controls when privileged context introduces style drift, and apply MOPD when several specialist RL-derived capabilities must be consolidated into one deployable model.
Condensed the Comparative Analysis section while keeping the original primer structure, the single comparison table, and the implementation/systems details folded into the relevant subsections. Grounded in the attached original primer structure.
Comparative Analysis
-
Distillation methods are best compared across a few orthogonal axes: trajectory source, teacher source, supervision density, teacher update pattern, RL integration, and systems complexity. These axes matter more than method names because many modern recipes combine several methods in sequence.
-
The main practical rule is that each method solves a different bottleneck. Off-policy distillation is best for stable broad transfer. OPD is best for exposure-bias reduction. Self-distillation is best when external teachers are unavailable or when feedback can create a privileged teacher view. MOPD is best for consolidating specialist capabilities. RL-distillation hybrids are best when sparse rewards need dense token-level credit assignment.
-
Modern frontier recipes rarely use one method in isolation. They usually start with off-policy SFT or trace distillation, apply RLVR or domain RL, then use OPD, OPSD, or MOPD to consolidate improvements and repair regressions.
Tabular Comparison
| Method family | Trajectory source | Teacher source | Supervision signal | Systems profile | Best use case | Main failure mode |
|---|---|---|---|---|---|---|
| Soft-label KD | Fixed dataset or cached examples. | Frozen external teacher, refreshed teacher, or peer model. | Dense token or class probabilities, often temperature-smoothed and approximated with top-\(k\) logits. | Requires teacher logprob extraction, top-\(k\) or full-logit storage, temperature handling, and divergence-specific approximations. | Transferring uncertainty, calibration, and teacher distributional structure. | Expensive full-vocabulary logits, distorted top-\(k\) approximations, and train-inference mismatch. |
| Sequence-level / trace distillation | Teacher-generated outputs, synthetic traces, rejection-sampled completions, or tool-use transcripts. | Frozen teacher, specialist checkpoint, or prior model. | Hard teacher sequences trained with cross-entropy, possibly including reasoning traces, tool calls, critiques, code, or proofs. | Requires generation, filtering, deduplication, contamination checks, mixture balancing, and SFT training. | Stable cold starts, synthetic instruction tuning, and simple specialist consolidation. | Discards uncertainty, imitates teacher artifacts, and trains only on teacher-produced states. |
| Off-policy distillation | Human data, synthetic traces, historical logs, or verifier-filtered corpora. | External teachers, specialists, reward models, judges, or human annotators. | Hard targets, soft targets, critiques, preference labels, verifier scores, or top-\(k\) logprobs. | Dataset-centric: teacher inference is separated from student training, so quality depends on corpus provenance and filtering. | Reproducible broad transfer when stability and auditability matter. | Exposure bias, poor recovery from student mistakes, stale teacher behavior, and mixture imbalance. |
| Online / semi-online distillation | Shared minibatches, refreshed datasets, or evolving checkpoint trajectories. | Co-trained peers, shadow teachers, EMA teachers, or periodically refreshed checkpoints. | Mutual KL, JSD, peer logits, hidden states, or agreement losses. | Requires synchronization, checkpoint tracking, peer communication, and staleness control. | Adaptive supervision when a frozen teacher becomes stale. | Non-stationary targets, consensus errors, high compute cost, and difficult attribution. |
| On-policy distillation (OPD) | Student-generated rollouts or closely lagged rollout-policy samples. | Usually a frozen external teacher, sometimes refreshed between rounds. | Dense token-level divergences or sampled-token teacher-student logprob gaps. | Requires rollout workers, teacher scoring servers, exact prefix replay, response masks, stop handling, and freshness controls. | Reducing exposure bias in reasoning, coding, tool-use, and agentic tasks. | Poor rollout quality, stale rollouts, support mismatch, truncation/repetition collapse, and teacher-scoring cost. |
| Self-distillation / OPSD | Fixed data, student rollouts, or replayed trajectories under richer context. | Earlier checkpoint, EMA copy, same model under privileged context, retrieved skills, feedback, or future user messages. | Self-teacher gaps, context-view divergences, relevance-masked updates, contrastive hint gaps, or gated privileged-context signals. | Requires privileged-context construction, leakage controls, relevance masks, teacher-refresh policy, and epistemic-marker monitoring. | Learning from feedback, latent capability, runtime errors, user corrections, or private task context without external teachers. | Information leakage, style drift, response shortening, uncertainty suppression, and moving-teacher collapse. |
| Multi-teacher distillation / MOPD | Fixed data for classical multi-teacher KD; student rollouts for MOPD. | Domain specialists, earlier checkpoints, safety teachers, code teachers, math teachers, agentic teachers, or Cascade RL checkpoints. | Aggregated distributions, routed teacher scores, or per-teacher sampled-token advantages. | Requires teacher pools, routing, batching, calibration, support-overlap checks, tokenizer compatibility, and regression suites. | Consolidating specialist capabilities into one deployable model. | Teacher conflict, poor routing, incompatible tokenizers, support mismatch, high serving cost, and hard-to-debug regressions. |
| RL-distillation hybrids | Usually student-generated rollouts, sometimes routed by reward outcome. | External teachers, self-teachers, reward models, verifiers, environment feedback, or hindsight user feedback. | Sparse reward advantages plus dense token-level gates, teacher gaps, self-teacher gaps, or contrastive token scores. | Requires rollout generation, reward computation, teacher scoring, gating, branch routing, loss balancing, and full provenance. | Combining task-grounded reward optimization with dense local credit assignment. | Dense teacher signals can conflict with rewards, privileged context can reward style, and bad gates or loss weights can destabilize training. |
Trajectory Source
-
The trajectory source is the most important modern distinction. Off-policy methods train on externally produced trajectories, such as human demonstrations, teacher completions, synthetic traces, historical logs, or verifier-filtered samples. This is stable and auditable, but it does not train the student on the states it creates at inference time.
-
On-policy methods train on student-generated trajectories. The teacher scores the student’s actual prefixes, so the student learns from its own mistakes rather than only imitating ideal teacher paths. This is why OPD is especially valuable for long-horizon reasoning, coding, tool use, and agentic workflows.
-
Practical systems are shaped by this distinction. Off-policy pipelines are dataset-centric: generate, filter, store, and train. On-policy pipelines are loop-centric: sample rollouts, score them with teachers, compute token-level losses, and update the student while managing rollout freshness.
Teacher Source
-
External teachers can transfer capabilities the student does not yet have, but they are expensive to serve and may be poorly aligned with the student’s prefixes. They are common in KD, synthetic-data generation, OPD, and MOPD.
-
Internal teachers are used in self-distillation. They may be earlier checkpoints, EMA copies, the same model under privileged context, or the same model conditioned on feedback. They reduce external dependency, but they can also amplify the model’s own biases.
-
Multi-teacher systems use pools of specialists. This is useful when no single teacher dominates all domains, but it adds routing, calibration, conflict resolution, and teacher-serving complexity. Teacher lineage matters: compatible forks of the same base usually provide more reliable token-level OPD signals than heterogeneous teachers with different tokenizers or reasoning styles.
Supervision Density
-
Hard sequence distillation is cheap and simple, but it only teaches one selected output. Soft-label distillation preserves uncertainty, but full-vocabulary logits are expensive and often approximated with top-\(k\) distributions.
-
OPD and MOPD often use sampled-token feedback. For a student token \(y_t\) at prefix \(s_t\), the teacher-student gap
\[A_t = \log \pi_T(y_t\mid s_t) - \log \pi_S(y_t\mid s_t)\]- behaves like a dense token-level advantage. This is cheaper than full KL because the teacher only scores the sampled token, but it loses information about unsampled alternatives.
-
Dense supervision is not automatically good. Privileged teachers can reward style changes, suppress uncertainty, or penalize useful self-correction tokens. This is why RMSD, SDAR, and RLCSD use masking, gating, or contrastive controls.
Reinforcement Learning Integration
-
RL supplies on-policy trajectories and task-grounded rewards, but feedback is often sparse. Distillation supplies dense token-level supervision, but classical KD is usually off-policy. OPD, OPSD, MOPD, and RL-distillation hybrids combine these strengths.
-
A useful design principle is to let rewards decide the direction of improvement while distillation shapes token-level credit assignment. This appears in Self-Distilled RLVR, RLCSD, SDAR, and sample-routing methods that send successful rollouts to RL and failed rollouts to self-distillation correction.
-
Recipe-level hybrids are also common. Nemotron-Cascade 2 uses staged RL, then inserts multi-domain OPD to recover regressions and consolidate intermediate teachers. Nemotron 3 Ultra uses SFT and unified RLVR before MOPD warmup and multi-teacher OPD.
Systems Complexity
-
Systems complexity increases as training moves from fixed data to live rollout and teacher scoring. Off-policy trace distillation mainly needs high-quality generation, filtering, deduplication, contamination checks, and mixture design.
-
OPD adds rollout workers, teacher scoring servers, logprob transport, exact prefix replay, response masking, stop-token handling, and rollout freshness controls. Asynchronous systems must track whether the rollout policy and learner policy have drifted apart.
-
MOPD adds teacher pools, routing, batching, teacher health checks, score aggregation, calibration, and multi-domain regression monitoring. Routing is algorithmic, not just infrastructural: the router determines which gradient the student receives.
-
Self-distillation reduces external serving cost but adds leakage and context-construction risks. The system must ensure that privileged teacher context improves the training signal without teaching the deployed student to assume unavailable information.
-
RL-distillation hybrids require the most provenance. Each update may depend on a rollout policy, reward model, verifier, teacher checkpoint, self-teacher context, token gate, mask, routing branch, and loss weight.
Practical Selection Heuristics
-
Choose off-policy SFT or trace distillation when the student needs a stable cold start, when teacher traces can be generated and filtered offline, or when reproducibility matters more than train-inference matching.
-
Choose soft-label KD when teacher uncertainty and calibration matter enough to justify logprob extraction and storage.
-
Choose online or semi-online distillation when teacher staleness is the bottleneck and the system can tolerate non-stationary supervision.
-
Choose OPD when exposure bias is the bottleneck and the student is strong enough to generate useful rollouts.
-
Choose self-distillation or OPSD when external teachers are unavailable or when privileged context, feedback, retrieved skills, or user interactions can create a useful teacher view.
-
Choose MOPD when specialist capabilities need to be merged into one deployable model and joint RL would create interference.
-
Choose RL-distillation hybrids when rewards or verifiers are useful but too sparse for good token-level credit assignment.
Common Training Progressions
-
A conservative reasoning-model progression is:
\[\text{Synthetic / Trace SFT} \rightarrow \text{RLVR} \rightarrow \text{OPD} \rightarrow \text{Final alignment}\] -
A specialist-consolidation progression is:
\[\text{Shared base} \rightarrow \text{Domain SFT / Domain RL} \rightarrow \text{Specialist teachers} \rightarrow \text{MOPD} \rightarrow \text{Regression recovery}\] -
A simpler alternative to MOPD is:
\[\text{Specialist RL} \rightarrow \text{Teacher-generated traces} \rightarrow \text{Filtering} \rightarrow \text{Trace-distillation SFT} \rightarrow \text{Final RL}\] -
A feedback-conditioned self-distillation progression is:
\[\text{Student rollout} \rightarrow \text{Runtime / user / environment feedback} \rightarrow \text{Privileged teacher replay} \rightarrow \text{Gated token-level update} \rightarrow \text{Regression and leakage checks}\] -
A frontier-style multi-domain progression is:
\[\text{Off-policy SFT} \rightarrow \text{Unified or domain RLVR} \rightarrow \text{Specialist teacher training} \rightarrow \text{Warmup for support overlap} \rightarrow \text{MOPD} \rightarrow \text{Final RL or alignment}\]
Implementation-Aware Comparison
-
The canonical distillation dataflow is:
\[\text{prompts} \rightarrow \text{trajectory source} \rightarrow \text{teacher or feedback signal} \rightarrow \text{token alignment and masking} \rightarrow \text{loss computation} \rightarrow \text{student update} \rightarrow \text{evaluation and regression monitoring}\] -
Off-policy methods emphasize generation quality, filtering, and dataset versioning. OPD emphasizes rollout freshness and exact teacher scoring. MOPD emphasizes routing, aggregation, and support overlap. Self-distillation emphasizes privileged-context construction and leakage control. RL-distillation hybrids emphasize reward anchoring, token gates, and loss balancing.
-
Logprob payload design should match the loss. Forward KL prefers broad teacher distributions. Reverse-KL sampled-token OPD only needs teacher logprob on the sampled token. JSD or local-support matching may require top-\(k\) sets from teacher and student.
-
Tokenizer compatibility is especially important for OPD and MOPD. If tokenizers or chat templates differ, sampled-token advantages and response masks become unreliable.
-
Stabilization also differs by method. Off-policy pipelines rely on filtering and mixture balancing. OPD relies on support-overlap checks, rollout-quality filters, and truncation monitoring. OPSD relies on fixed teachers, relevance masks, contrastive controls, and epistemic-marker monitoring. MOPD relies on routing, calibration, and regression suites. RL-distillation hybrids rely on reward anchoring and gated token-level updates.
Comparative Failure Modes
-
Off-policy distillation fails through exposure bias, stale or low-quality traces, data contamination, over-imitation, and mixture imbalance.
-
Online distillation fails through non-stationary targets, consensus errors, synchronization overhead, and unstable teacher refreshes.
-
OPD fails through poor rollout quality, stale rollouts, support mismatch, repetition or truncation collapse, and masking bugs.
-
OPSD fails through information leakage, style drift, response shortening, uncertainty suppression, and moving-teacher feedback loops.
-
MOPD fails through teacher conflict, bad routing, incompatible tokenizers, support mismatch, high serving cost, and insufficient regression evaluation.
-
RL-distillation hybrids fail when dense token signals conflict with rewards, when privileged context rewards style rather than task progress, or when gates and loss weights are miscalibrated.
Practical Defaults
-
Start with off-policy SFT or trace distillation unless there is a strong reason not to. Add RLVR or RLHF when task success cannot be fully captured by demonstrations. Add OPD when the student’s own rollouts become informative. Add self-distillation when external teachers are unavailable or feedback can create a useful teacher context. Add MOPD when specialist capabilities need consolidation. Add RL-distillation hybrids when rewards are trusted but too sparse.
-
Treat systems constraints as algorithmic constraints. A theoretically attractive loss may be impractical if it requires full-vocabulary logits from many teachers, cross-tokenizer alignment, synchronous scoring, or dense teacher forwards over long rollouts. A sampled-token objective may be less exact, but practically better if it keeps training scalable and stable.
Decision Summary
-
Use off-policy distillation for stable broad transfer.
-
Use online or semi-online distillation for adaptive teacher refresh.
-
Use OPD for exposure-bias reduction on student rollouts.
-
Use self-distillation for privileged feedback without external teachers.
-
Use MOPD for specialist capability consolidation.
-
Use RL-distillation hybrids for reward-grounded learning with dense token-level credit assignment.
-
For frontier-style post-training, prefer a staged recipe: off-policy SFT for competence, RLVR for task success, specialist RL for domain peaks, warmup for support overlap, OPD or MOPD for consolidation, and final alignment plus regression monitoring for deployment readiness.
Implementation Patterns
-
Large-scale distillation is best understood as a distributed systems problem built around a teacher-student dataflow architecture. The algorithmic choice matters, but at frontier scale the dominant constraints are often rollout throughput, teacher scoring throughput, log-probability transport, tokenizer alignment, masking correctness, replay freshness, and regression monitoring.
-
The central idea to carry through this section is that every distillation method requires four implementation decisions: the source of trajectories, the source of teacher or feedback signal, the divergence or surrogate loss, and the systems design for computing and transporting the necessary log-probabilities. Off-policy systems push most complexity into data generation and filtering. OPD systems push complexity into rollout generation and teacher scoring. MOPD systems add routing, teacher pools, and support-overlap checks. Self-distillation systems add privileged-context construction and leakage controls. RL-distillation hybrids add reward computation, gating, and loss balancing.
-
Modern LLM distillation systems increasingly resemble RL training stacks. They sample or replay trajectories, store token-level log-probabilities, compute token masks, maintain behavior-policy provenance, and update the learner with dense token-level signals. The difference is that the dense signal may come from a teacher, several teachers, a self-teacher under privileged context, a reward model, an environment, or a feedback-conditioned replay.
-
A useful engineering rule is that the training loss is only as reliable as the metadata around it. A token-level teacher score is meaningful only if the system knows which prompt, chat template, tokenizer, rollout policy, teacher checkpoint, response mask, stop condition, and prefix produced it. Without this provenance, distillation failures are difficult to diagnose.
Canonical Distillation Dataflow
-
A production distillation system usually follows a dataflow in which prompts produce trajectories, trajectories are scored by teachers or feedback systems, scored examples are converted into a loss, and the student is updated while evaluation monitors both target gains and regressions.
\[\text{Prompts} \rightarrow \text{Trajectory Source} \rightarrow \text{Teacher / Feedback Signal} \rightarrow \text{Alignment and Masking} \rightarrow \text{Loss Computation} \rightarrow \text{Student Update} \rightarrow \text{Evaluation}\] -
The trajectory source determines whether the pipeline is dataset-centric or loop-centric. In off-policy distillation, trajectories are produced before training and stored as a corpus. In OPD, MOPD, OPSD, and most RL-distillation hybrids, trajectories are generated by the student or a closely related behavior policy during training.
-
The teacher or feedback signal determines the payload. Sequence-level distillation needs text targets. Logit distillation needs teacher distributions or top-\(k\) probabilities. Sampled-token OPD needs teacher log-probabilities on sampled student tokens. OPSD needs the same rollout rescored under privileged context. RL-distillation hybrids may require both reward-level and token-level fields.
-
A practical scored rollout record can be represented as:
\[r = \left( x,\, y,\, \log \pi_{\text{behav}}(y_t \mid s_t),\, \log \pi_T(y_t \mid s_t),\, m_t,\, \rho,\, v \right)\]- where \(x\) is the prompt, \(y\) is the rollout, \(s_t=(x,y_{<t})\) is the token prefix, \(m_t\) is the valid-token mask, \(\rho\) contains routing or reward metadata, and \(v\) records checkpoint and system versions.
-
The most important implementation invariant is that every log-probability must correspond to the exact token and prefix used by the learner. A mismatch in tokenization, chat template, system prompt, tool-call formatting, stop-token handling, or response mask can turn a correct mathematical loss into an incorrect training signal.
Off-Policy Pipeline Architecture
-
Off-policy distillation is the simplest pipeline because teacher generation and student optimization are decoupled. The teacher produces traces, labels, logits, critiques, preference annotations, verifier scores, or tool transcripts ahead of training. The student then consumes a fixed or slowly refreshed dataset.
-
A typical off-policy pipeline proceeds as:
\[\text{Prompt Collection} \rightarrow \text{Teacher Generation} \rightarrow \text{Filtering / Verification} \rightarrow \text{Deduplication} \rightarrow \text{Mixture Balancing} \rightarrow \text{SFT / KD} \rightarrow \text{Evaluation}\] -
The strength of this design is reproducibility. Once the corpus is fixed, multiple students, losses, or mixture weights can be compared on the same training data. This makes off-policy distillation ideal for cold starts, ablation studies, safety audits, and synthetic-data pipelines.
-
The main engineering burden is data quality. The system must track teacher checkpoint versions, prompt sources, generation parameters, verifier versions, reward-model versions, filtering rules, deduplication hashes, benchmark-contamination checks, and domain mixture weights.
-
For trace-distillation SFT, storing text is usually cheap relative to storing logits, but trace quality becomes the bottleneck. For logit distillation, teacher probability storage becomes expensive, especially if the pipeline stores top-\(k\) probabilities for every token rather than only teacher-generated text.
-
A robust off-policy dataset entry should include not only the final response, but also provenance metadata such as source prompt, teacher checkpoint, decoding temperature, sample index, filter scores, verifier outcome, answer-extraction result, safety tags, and any known benchmark relationship. Without this metadata, later regression analysis becomes guesswork.
On-Policy Pipeline Architecture
-
On-policy distillation introduces a closed feedback loop. The student generates rollouts, the teacher scores those exact rollouts, and the learner updates the student using dense teacher feedback on the student’s own visited prefixes.
-
A typical OPD pipeline proceeds as:
\[\text{Prompt Sampling} \rightarrow \text{Student Rollout} \rightarrow \text{Teacher Scoring} \rightarrow \text{Token Masking} \rightarrow \text{Distillation Loss} \rightarrow \text{Student Update}\] -
This architecture directly addresses train-inference mismatch because the supervised states are sampled from the student. It is especially valuable for long-horizon reasoning, code, and agentic tasks where early mistakes create future contexts that never appear in teacher-generated traces.
-
The cost is operational complexity. The training loop must maintain rollout workers, teacher scoring workers, queues, masks, log-probability schemas, and learner-side loss computation. The teacher no longer produces a complete answer independently; instead, it must replay the student’s exact prefix and return probabilities for the student’s sampled tokens or a local token set.
-
In most practical OPD systems, gradients do not flow through the sampling process. The rollout is treated as data, teacher and student log-probabilities are computed over that data, and the student is updated by a supervised or policy-gradient-shaped surrogate. This makes the implementation closer to DAGGER-style imitation learning than to full sequence-level reverse-KL policy gradient.
-
A sampled-token OPD loss can be written as:
\[\mathcal{L}_{OPD} =-\mathbb{E} \left[ \sum_{t} m_t \operatorname{sg} \left[ \log \pi_T(y_t \mid s_t) -\log \pi_\theta(y_t \mid s_t) \right] \log \pi_\theta(y_t \mid s_t) \right]\]- where \(m_t\) masks valid response tokens and \(\operatorname{sg}\) denotes stop-gradient through the teacher-derived advantage.
Generation Buffers and Asynchronous Execution
-
Large-scale OPD and MOPD systems often decouple rollout generation, teacher scoring, and learner updates. Autoregressive generation may be slow, teacher scoring may be batched most efficiently on different hardware, and learner updates may run on yet another training cluster. A generation buffer allows these components to operate asynchronously.
-
A practical asynchronous pipeline is:
\[\text{Rollout Workers} \rightarrow \text{Generation Buffer} \rightarrow \text{Teacher Scoring Queue} \rightarrow \text{Scored Rollout Buffer} \rightarrow \text{Learner}\] -
The benefit is throughput. Rollout workers do not need to wait for learner updates, teacher scorers can batch requests by teacher or sequence length, and the learner can consume already-scored batches whenever they are ready.
-
The risk is staleness. If a rollout is generated by an older behavior policy but consumed by a newer learner policy, the training example is no longer exactly on-policy. This is a systems-level version of off-policy drift even when the algorithm is conceptually on-policy.
-
A common correction is to distinguish the behavior policy, the proximal policy, and the current learner policy. Nemotron 3 Ultra-style asynchronous MOPD uses ratios that account for the mismatch between a stale behavior policy and the proximal learner policy, then applies PPO-style clipping to the current learner ratio.
\[c_t =\operatorname{sg} \left[ \frac{ \pi_{\text{prox}}(y_t \mid s_t) }{ \pi_{\text{behav}}(y_t \mid s_t) } \right]\] \[r_t(\theta) =\frac{ \pi_{\theta}(y_t \mid s_t) }{ \pi_{\text{prox}}(y_t \mid s_t) }\] -
The practical point is that on-policy systems are only as on-policy as their rollout freshness. A high-throughput asynchronous system must enforce freshness windows, queue limits, replay limits, or importance-weighting rules so that stale rollouts do not dominate training.
Teacher Scoring Infrastructure
-
Teacher scoring is often the throughput bottleneck in OPD and MOPD. The teacher must evaluate the student’s full rollout under exact prefixes, even when it only returns sampled-token log-probabilities. For long reasoning or agentic rollouts, this can require large context windows and careful batching.
-
Common serving backends include optimized inference systems such as vLLM: Easy, Fast, and Cheap LLM Serving by Kwon et al. (2023), SGLang-style serving for agentic and structured-generation workloads, and vendor-specific inference stacks such as TRT-LLM for high-throughput deployment. The choice of serving backend affects batching, prefix caching, log-probability extraction, tool-call handling, and fault tolerance.
-
Single-teacher OPD can route all rollouts to one scoring service. MOPD requires a teacher pool, where different teachers score different prompts or domains. This adds scheduling complexity because teachers may vary in size, context length, latency, tokenizer, and available hardware.
-
Teacher scorers should return the smallest payload compatible with the loss. For sampled-token OPD, each token may only need:
\[\left( y_t,\, \log \pi_T(y_t \mid s_t) \right)\]- For top-\(k\) matching, each token position needs:
For full KL, the teacher would need a full-vocabulary distribution, which is usually impractical at LLM scale.
-
Teacher serving must also support fault isolation. If a teacher returns malformed scores, times out, uses the wrong template, or silently changes checkpoint version, the learner may receive corrupted gradients. A production system should include teacher-health checks, schema validation, prefix checksums, checkpoint IDs, and fallback routing.
Log-Probability Payload Design
-
Log-probability payload design should be determined by the divergence. Forward KL needs teacher-mode coverage, so teacher-top-\(k\) distributions are natural. Reverse-KL sampled-token training needs only the teacher probability of the student’s sampled token. JSD and local-support matching may require both teacher and student top-\(k\) sets.
-
A forward-KL-style top-\(k\) payload is:
\[\mathcal{P}_{T,k}(s_t) =\left\{ (v_i,\log \pi_T(v_i\mid s_t)) \right\}_{v_i\in \operatorname{Top}_k(\pi_T(\cdot \mid s_t))}\] -
A sampled-token reverse-KL-style payload is:
\[\mathcal{P}_{\text{sample}}(s_t) =\left( y_t,\log \pi_T(y_t\mid s_t),\log \pi_S(y_t\mid s_t) \right)\] -
A local-support matching payload stores a local vocabulary set:
\[\mathcal{V}_{\text{local}}(s_t) =\operatorname{Top}_k(\pi_T) \cup \operatorname{Top}_k(\pi_S)\]- and computes a renormalized divergence over \(\mathcal{V}_{\text{local}}(s_t)\).
-
Payloads should include masks and offsets. The learner must know which tokens correspond to the prompt, the response, tool calls, tool outputs, hidden system content, final answer, padding, truncation, and end-of-sequence regions. A log-probability without a correct mask can be worse than no log-probability.
-
Compression matters. Teacher payloads may be stored as binary token IDs plus float16, bfloat16, or quantized log-probability arrays. For very large corpora, payload size can dominate storage cost, and sampled-token scoring may be preferable even if it is algorithmically less informative than top-\(k\) matching.
Full-Vocabulary Versus Top-\(k\) Approximation
-
Exact distribution matching over a vocabulary \(V\) requires:
\[D_{KL}(p_T \,\Vert\, p_S) =\sum_{v\in V} p_T(v\mid s_t) \log \frac{ p_T(v\mid s_t) }{ p_S(v\mid s_t) }\] -
For LLM vocabularies with tens or hundreds of thousands of tokens, full-vocabulary KL is expensive to compute, store, and transmit. Most systems therefore approximate the divergence.
-
Teacher-top-\(k\) approximation is appropriate when the goal is to cover the teacher’s high-probability modes. It works naturally with forward KL because forward KL penalizes the student for missing tokens the teacher considers likely.
-
Student-top-\(k\) or sampled-token approximation is appropriate when the goal is to correct what the student actually proposes. It works naturally with reverse-KL-style OPD because the student’s rollout determines which tokens receive supervision.
-
The support-overlap condition determines which approximation is safe. If teacher and student share local support, top-\(k\) or full-distribution matching can transfer rich structure. If support overlap is low, matching the full teacher distribution can amplify off-support noise, and sampled-token scoring may be safer.
-
A useful diagnostic is:
\[\operatorname{Overlap}_k(s_t) =\frac{ \left| \operatorname{Top}_k \pi_T(\cdot\mid s_t) \cap \operatorname{Top}_k \pi_S(\cdot\mid s_t) \right| }{k}\] -
This diagnostic should be computed on student-visited prefixes, not only on clean dataset prefixes. A teacher may be excellent on its own rollouts and still unreliable on prefixes generated by a different student.
Tokenizer Compatibility and Alignment
-
Tokenizer compatibility is one of the most important practical constraints in OPD and MOPD. If teacher and student share a tokenizer and chat template, the same token ID corresponds to the same text span and prefix. This makes sampled-token log-probability scoring straightforward.
-
If tokenizers differ, the system must detokenize student rollouts, retokenize them under the teacher tokenizer, and align teacher token probabilities back to student token spans. This is hard when one student token maps to multiple teacher tokens or when tool-call formats introduce special tokens that do not match across models.
-
Cross-tokenizer alignment becomes especially fragile for sampled-token advantages, because the quantity:
\[\log \pi_T(y_t \mid s_t) -\log \pi_S(y_t \mid s_t)\]- assumes that \(y_t\) is the same event under both models. If teacher and student tokenize the event differently, the log-probability gap may no longer correspond to a well-defined token-level preference.
-
Practical systems therefore prefer teachers that share tokenizer lineage with the student, especially for MOPD. If heterogeneous teachers are necessary, sequence-level trace distillation or span-level scoring may be more reliable than token-level sampled OPD.
-
Chat-template compatibility matters as much as tokenizer compatibility. A teacher score can change if system prompts, role markers, hidden tool messages, answer-format instructions, or special stop tokens differ between teacher and student. Every scored rollout should store the exact template version used at generation and scoring time.
Stabilization Mechanisms
-
Stabilization begins with masking. The loss should usually exclude prompt tokens, padding tokens, tool outputs that were not generated by the model, hidden system metadata, post-EOS tokens, invalid spans, and tokens after truncation. In agentic settings, special care is needed around tool-call boundaries and environment observations.
-
A valid-token masked loss has the form:
\[\mathcal{L} =-\mathbb{E} \left[ \frac{1}{\sum_t m_t} \sum_t m_t A_t \log \pi_\theta(y_t\mid s_t) \right]\]- where \(m_t \in \{0,1\}\) determines whether token \(t\) contributes to the update.
-
Clipping prevents isolated token gaps from dominating. Pointwise KL clipping, advantage clipping, PPO-style ratio clipping, entropy floors, and reference-policy constraints can all help keep training stable.
-
Length and repetition monitoring are essential in OPD. A model can enter a repetitive prefix where locally plausible continuation tokens receive teacher support even though the global trajectory is degenerate. Systems should track mean rollout length, truncation rate, repetition rate, EOS rate, and invalid-tool-call rate during training.
-
Support-overlap monitoring is essential in MOPD. If student rollouts fall outside teacher support, teacher feedback can become noise. Warmup SFT, intermediate teacher checkpoints, sampled-token scoring, or domain-specific routing can reduce this risk.
-
For OPSD, stabilization must also preserve reasoning behavior. Privileged teachers may shorten traces, reduce uncertainty markers, suppress backtracking, or shift probability mass onto style tokens. Relevance masks, contrastive hints, fixed teachers, reward anchoring, and epistemic-marker diagnostics help prevent the student from inheriting the wrong aspect of the privileged teacher.
Multi-Teacher Routing Infrastructure
-
Multi-teacher systems require routing logic that determines which teacher or teachers score each prompt, rollout, token span, or training stage. The simplest router uses dataset labels: math prompts go to a math teacher, code prompts go to a code teacher, safety prompts go to a safety teacher, and general chat prompts go to a chat teacher.
-
More advanced routers use prompt classifiers, embedding retrieval, teacher confidence, verifier outcomes, benchmark category, tool-use state, or learned routing scores. A mixed agentic trajectory may require segment-level routing because one trajectory can include reasoning, code editing, terminal execution, search, safety judgment, and final answer formatting.
-
A hard router selects one teacher:
\[k^\star =\arg\max_k r_k(x,y_{<t})\]- while a soft router assigns weights:
-
Routing is part of the algorithm, not merely infrastructure. A routing error changes the supervision signal and can cause regressions. A code teacher scoring a safety refusal, a chat teacher scoring a proof trace, or an out-of-domain teacher scoring a long tool trajectory can produce harmful token-level gradients.
-
Teacher scheduling is also part of routing. Expensive teachers may be queried only for high-value prompts, uncertain rollouts, or domains where regressions are observed. Teacher requests should be batched by teacher, sequence length, and context length to maximize accelerator utilization.
-
Multi-teacher systems should track teacher checkpoint IDs, teacher domain labels, router version, teacher latency, teacher failure rate, score coverage, and per-domain downstream performance. Without teacher provenance, it is difficult to know whether a regression came from the student, a teacher, or the router.
Self-Distillation Systems
-
Self-distillation systems reduce external teacher serving because the teacher and student are derived from the same model family. The teacher may be an earlier checkpoint, an EMA model, the same model under a privileged prompt, or the same model conditioned on feedback.
-
A typical OPSD system proceeds as:
\[\text{Student Prompt} \rightarrow \text{Student Rollout} \rightarrow \text{Privileged Teacher Prompt} \rightarrow \text{Teacher Rescoring} \rightarrow \text{Masked / Gated Update}\] -
The privileged teacher prompt may include a verified answer, gold solution, runtime error, tool result, retrieved skill, user correction, or future interaction. The student prompt must not include that privileged information if the model will not have it at inference time.
-
The central systems risk is leakage. If the training pipeline accidentally exposes the answer, hidden test result, gold patch, or future user message to the student-side context, the model may learn an impossible inference-time behavior.
-
The second systems risk is style drift. Privileged context can make the teacher shorter, more direct, more confident, or less exploratory. If all teacher-student gaps are treated as task-relevant, the student may learn to suppress uncertainty rather than solve the task better.
-
Relevance-masked self-distillation addresses this by applying the loss only to selected token positions:
\[\mathcal{L}_{RMSD} =\mathbb{E} \left[ \sum_t m_t^{\text{rel}} D\left( \pi_T(\cdot\mid x',y_{<t}) \,\Vert\, \pi_S(\cdot\mid x,y_{<t}) \right) \right]\]- where \(m_t^{\text{rel}}\) selects tokens tied to the target behavior rather than updating every stylistic disagreement.
-
Contrastive self-distillation systems add a control teacher view, such as a wrong hint, to subtract generic hint-induced style changes. This increases teacher-side forward passes but can reduce harmful style-token domination.
RL-Distillation Hybrid Systems
-
RL-distillation hybrid systems combine rollout generation, reward computation, teacher or self-teacher scoring, token-level gating, and learner updates. They are more complex than ordinary RLVR because a single sample may carry both response-level reward and token-level distillation metadata.
-
A general hybrid training record includes:
\[\left( x,\, y,\, R(x,y),\, A^{RL},\, A_t^{D},\, g_t,\, m_t,\, \text{branch} \right)\]- where \(R(x,y)\) is a reward or verifier score, \(A^{RL}\) is a response-level advantage, \(A_t^{D}\) is a teacher or self-teacher token signal, \(g_t\) is a token-level gate, \(m_t\) is a valid-token mask, and the branch indicates whether the sample is handled by RL, distillation, or a hybrid objective.
-
The most conservative design keeps RL as the backbone and applies distillation as a gated auxiliary term:
\[\mathcal{L}(\theta) =\mathcal{L}_{RL}(\theta) + \lambda_D \mathcal{L}_{D}(\theta)\] -
Sample-routing systems send successful rollouts to an RL branch and failed rollouts to a distillation or feedback-conditioned correction branch. This makes sense because a correct rollout should often be reinforced directly, while an incorrect rollout may need teacher-guided diagnosis.
-
Agentic systems can use environment outputs as feedback. Tool outputs, GUI transitions, terminal errors, user replies, and hidden-test outcomes can all become reward signals or privileged teacher context. OpenClaw-style systems organize this around asynchronous environment serving, reward computation, policy training, and policy serving.
-
Scientific and laboratory RL systems add another feedback type: experimental observations. In those settings, the environment is not just a benchmark; it is the physical or simulated world. Distillation can potentially turn delayed experimental feedback into token-level guidance for hypothesis generation, experimental design, and analysis.
Evaluation and Regression Monitoring
-
Distillation should be evaluated on both target capabilities and preserved capabilities. A run that improves math but hurts instruction following, improves code but hurts safety, or improves agentic task success but increases malformed tool calls is not a clean win.
-
A robust evaluation suite should include task metrics, behavioral metrics, distributional metrics, and systems metrics. Task metrics include accuracy, pass@\(k\), verifier pass rate, judge score, unit-test pass rate, or task completion rate. Behavioral metrics include refusal calibration, helpfulness, verbosity, response length, uncertainty expression, and tool-call hygiene. Distributional metrics include KL, entropy, teacher-student support overlap, repetition rate, truncation rate, and teacher disagreement. Systems metrics include rollout throughput, teacher latency, queue age, GPU utilization, failure rate, and score coverage.
-
Regression monitoring should be domain-specific. MOPD systems need per-teacher domain dashboards because an aggregate score can hide regressions. OPD systems need rollout-quality dashboards because degradation may first appear as length inflation, repeated tokens, or higher truncation. OPSD systems need epistemic-behavior dashboards because degradation may appear as fewer uncertainty markers or self-correction branches.
-
Evaluation should also include provenance analysis. If a regression appears, the system should be able to answer which prompts caused it, which rollout policy generated them, which teacher scored them, which router assigned them, which mask was applied, which reward model scored them, and which checkpoint update consumed them.
-
For on-policy systems, offline evaluation alone is insufficient. The deployed model’s behavior depends on its own rollouts, so training should periodically sample fresh trajectories from the current model and evaluate them under the same generation settings used in deployment.
Practical Design Defaults
-
Start with a dataset-centric off-policy pipeline when building a new distillation system. Generate teacher traces, verify and filter them, deduplicate the corpus, balance domain mixtures, and establish a strong SFT baseline before adding online scoring.
-
Add token-level teacher probabilities only when the added information justifies the payload cost. Sequence-level traces are often sufficient for early training. Top-\(k\) or sampled-token log-probabilities become more valuable when the student needs finer local guidance.
-
Move to OPD only after the student can generate meaningful rollouts. A weak student produces low-quality states where teacher scoring may be expensive and uninformative. A stronger cold-started student produces near-miss trajectories that are ideal for dense correction.
-
Prefer sampled-token OPD or top-\(k\) local matching unless full-vocabulary matching is clearly affordable and support overlap is high. The theoretical elegance of full KL is often outweighed by the practical cost and the risk of off-support teacher noise.
-
Use compatible teacher-student lineages when possible. Shared tokenizer, chat template, and base model lineage reduce alignment errors and improve the chance that teacher probabilities are meaningful on student rollouts.
-
In MOPD, treat routing and support overlap as first-class algorithmic components. A strong teacher is not automatically a useful teacher on every student prefix. The system should measure overlap, route by domain, warm up the student when necessary, and evaluate per-domain regressions.
-
In self-distillation, default to fixed or cautiously refreshed teachers, and apply masks, gates, or contrastive controls. Privileged context should improve task-bearing tokens without teaching the student to depend on hidden information or suppress useful reasoning behavior.
-
In RL-distillation hybrids, let reward or verifier signals anchor update direction and use distillation for token-level modulation unless the teacher is known to be highly reliable. This reduces the risk that dense teacher feedback overwhelms the task objective.
-
Maintain end-to-end provenance from the beginning. Every rollout, teacher score, reward, mask, route, and loss branch should be traceable to a checkpoint and configuration. This is the difference between a debuggable training system and a system where regressions can only be guessed at.
-
Treat evaluation as part of the training loop, not an afterthought. Distillation is often used to preserve or consolidate capabilities, so every major run should be judged not only by its target-domain gains but also by its regression profile across chat, safety, instruction following, long context, code, reasoning, and agentic behavior.
Decision Guide for Choosing a Distillation Method
-
Selecting a distillation strategy is primarily a question of balancing teacher availability, desired robustness, engineering complexity, trajectory source, supervision density, and the nature of the available feedback. In practice, most successful post-training pipelines begin with simple off-policy methods and progressively introduce on-policy, self-distillation, multi-teacher, or RL-distillation techniques as the need for robustness, credit assignment, and capability consolidation increases.
-
The central decision rule is that distillation should be matched to the bottleneck. If simplicity and low cost are paramount, begin with off-policy sequence or trace distillation. If robustness to self-generated errors is critical, adopt on-policy distillation after the student has a strong enough cold start. If no external teacher is available, use self-distillation, but prefer gated, contrastive, or relevance-masked variants when reasoning quality matters. If capabilities are distributed across specialists, use multi-teacher distillation or MOPD. If sparse rewards need dense corrective guidance, combine reinforcement learning with distillation. If training multi-turn agents with noisy privileged context, keep RL as the backbone and add gated self-distillation rather than applying uniform OPSD.
-
The most practical recipes are staged rather than monolithic. Off-policy SFT or trace distillation establishes a competent student; RLVR or RLHF improves behavior under task rewards or preferences; OPD teaches the student from its own near-miss trajectories; self-distillation converts hints, feedback, or privileged context into dense supervision; MOPD consolidates specialist teachers; and RL-distillation hybrids combine task-grounded reward direction with token-level correction.
-
Systems constraints should be treated as part of the method choice. A method that requires full-vocabulary logits from many teachers may be mathematically attractive but impractical at scale. A sampled-token objective may be less information-rich but more robust when teacher-student support overlap is limited. A self-distillation objective may remove external teacher serving but introduce leakage and style-drift risks. The best method is therefore the one whose supervision signal is both useful and operationally reliable.
Choose Off-Policy Distillation When
-
Choose off-policy distillation when you have access to a large corpus of high-quality human examples, teacher-generated outputs, verifier-filtered synthetic traces, historical logs, or curated domain data, and you want the simplest and most stable training setup. This is the right default when training should resemble ordinary SFT, when reproducibility matters, and when teacher inference can be performed once and reused across many experiments.
-
Off-policy distillation is especially appropriate when teacher outputs can be generated offline and amortized efficiently. A frontier teacher, specialist model, reward model, or verifier can be expensive to run, but once its completions, critiques, scores, or top-\(k\) probabilities are stored, the same dataset can train several students, support ablations, or serve as a stable baseline before more complex methods are introduced.
-
Off-policy distillation is also appropriate when the student is expected to remain close to the training distribution and strong recovery from self-generated mistakes is not the primary concern. For short-form instruction following, summarization, translation, classification, data-format adaptation, or well-filtered synthetic reasoning data, fixed trajectories may provide enough coverage to justify the simplicity.
-
Off-policy distillation is preferred when the main bottleneck is data quality rather than train-inference mismatch. If a team can invest in teacher generation, deduplication, contamination filtering, unit-test verification, answer checking, safety filtering, and mixture balancing, then trace distillation can provide strong returns without the added complexity of live rollout scoring.
-
Off-policy distillation is also the best cold-start stage before RL, OPD, OPSD, or MOPD. A weak student often produces poor on-policy rollouts, making teacher scoring expensive and uninformative. A strong off-policy baseline produces near-miss trajectories that later on-policy methods can correct more effectively.
Choose On-Policy Distillation When
-
Choose on-policy distillation when long reasoning chains, coding tasks, tool-use workflows, or agentic trajectories are sensitive to compounding errors. In these settings, an early mistake changes the future prefix distribution, and the student needs supervision on the states it actually visits rather than only on teacher-generated ideal traces.
-
OPD is appropriate when the student must learn how to recover from the exact mistakes it is likely to make during deployment. A teacher scoring student rollouts can provide local feedback on malformed tool calls, incorrect reasoning branches, wrong file edits, invalid API calls, or partially correct solutions, whereas off-policy traces may only show the ideal path.
-
OPD is preferred when dense token-level supervision is more useful than sparse scalar rewards. RLVR may tell the model that a solution failed, but OPD can identify which sampled tokens the teacher would have preferred under the same prefix. This is especially valuable when a reward is delayed, binary, or difficult to assign to individual decisions.
-
OPD is also useful when the goal is to transfer capabilities acquired through reinforcement learning into a smaller, cheaper, or more efficient model. A strong post-RL teacher can score rollouts generated by the student, allowing the student to absorb the teacher’s local preferences without requiring the teacher to generate all training trajectories.
-
OPD should usually be introduced gradually. A practical recipe begins with off-policy SFT or trace distillation, verifies that the student can produce meaningful rollouts, then mixes in on-policy examples through a GKD-style schedule. This reduces the risk that the teacher is asked to score degenerate or off-support prefixes too early.
-
OPD is less attractive when the student’s rollouts are mostly incoherent, when teacher scoring is too expensive, when teacher-student tokenization is incompatible, when rollout staleness is hard to control, or when support overlap between the student and teacher is poor. In such cases, warmup SFT, trace distillation, or sampled-token scoring may be safer than full-distribution OPD.
Choose Self-Distillation When
-
Choose self-distillation when external teacher models are unavailable, too expensive, too slow, operationally inconvenient, or insufficiently specialized for the task. The teacher signal is then derived from the model itself, such as an earlier checkpoint, an EMA copy, an ensemble of views, a privileged prompt, a retrieved skill, a runtime-error-conditioned view, or future user interaction.
-
Self-distillation is appropriate when the model already contains latent capability that can be unlocked using privileged information, hindsight context, textual feedback, tool outputs, or targeted hints. The model may be better at evaluating or repairing a response when shown extra information than it is at generating the response from scratch.
-
Self-distillation is useful when interaction traces, runtime errors, user corrections, verifier comments, environment observations, or tool outputs can be converted into rich internal supervision. A coding model can learn from compiler errors or failing tests; a conversational model can learn from user follow-up corrections; an agent can learn from GUI state changes or terminal outputs.
-
Self-distillation is preferred when continual or domain-specific adaptation is needed without maintaining a separate teacher infrastructure. This is especially relevant for enterprise-specific formats, private APIs, internal tools, local policies, or shifting user preferences where frontier teachers may not know the desired behavior.
-
Self-distillation should be selective when the target behavior is narrow. Relevance-masked self-distillation is preferred when only a small subset of tokens should change, because it prevents the model from learning incidental style differences between the student and privileged teacher views.
-
Self-distillation requires caution in reasoning models. Privileged teachers can become shorter, more confident, and less exploratory, which may suppress epistemic verbalization, forking, verification, backtracking, and hedging. For strong thinking models, use fixed teachers, relevance masks, contrastive controls, reward anchoring, and diagnostics for response length and uncertainty markers.
Choose On-Policy Self-Distillation When
-
Choose on-policy self-distillation when verified solutions, reference answers, runtime feedback, tool observations, retrieved skills, or privileged reasoning traces are available and can be shown to the teacher view but not to the deployed student view.
-
OPSD is appropriate when the model is better at evaluating correct solutions than generating them from scratch. In this setting, the student samples a rollout from the ordinary task prompt, while the self-teacher scores that rollout under privileged context. The student then learns from the difference between the ordinary and privileged views.
-
OPSD is useful when the benefits of on-policy learning and dense supervision are desired without relying on an external frontier teacher. Since the same model family supplies both student and teacher views, the infrastructure burden can be lower than external-teacher OPD, although teacher-context forward passes still add compute.
-
OPSD is most appropriate in domains where correctness signals are reliable, such as mathematical reasoning, coding, theorem-style tasks, tool-use environments, and interactive agents with clear feedback. However, it should be applied carefully when the student must preserve long-budget reasoning behaviors.
-
OPSD should use gates, masks, or contrastive controls when privileged context changes the teacher’s style. If a teacher that sees the answer assigns low probability to uncertainty markers such as “wait,” “maybe,” or “let me check,” the student may learn to suppress the very deliberative behaviors that support autonomous reasoning. RLCSD-style contrastive hinting is useful when the goal is to subtract generic hint-induced style shifts and retain task-bearing token signals.
-
OPSD is less appropriate when privileged information would leak the answer too directly, when the deployed student cannot access comparable information, when response shortening harms generalization, or when the teacher view rewards confidence rather than correctness.
Choose Multi-Teacher Distillation When
-
Choose multi-teacher distillation when different models or checkpoints specialize in complementary capabilities such as reasoning, coding, instruction following, alignment, safety, long context, tool use, software engineering, or agentic planning. A single teacher may not be best across all domains, and forcing one teacher to define the entire student behavior can erase useful specialization.
-
Multi-teacher distillation is appropriate when sequential post-training has introduced regressions that must be repaired. A later checkpoint may improve code but hurt chat, improve tool use but hurt safety, or improve long reasoning but hurt instruction adherence. Earlier or domain-specialist checkpoints can be retained as teachers to recover those capabilities.
-
Choose multi-teacher distillation when the goal is to consolidate several specialized models into a single deployable student. This avoids deploying many separate checkpoints and lets one model internalize the strengths of multiple teachers.
-
MOPD is preferred over trace-distillation SFT when the student is strong enough to produce meaningful rollouts and teacher feedback on student-visited states is more valuable than teacher-generated traces. This is especially relevant for long-horizon reasoning, coding, and agentic tasks where the student’s own mistakes determine the future state distribution.
-
Trace-distillation SFT is preferred over MOPD when the student is too weak for useful rollouts, when teacher traces are easy to verify, or when the organization does not yet have the infrastructure for teacher scoring, routing, token alignment, and asynchronous rollout management.
-
Multi-teacher distillation requires infrastructure that can route and serve multiple teachers efficiently. Teacher servers must be scheduled, batched, monitored, and fault-isolated. Teacher log-probabilities must be aligned and aggregated. Tokenizer compatibility is highly desirable because sampled-token MOPD depends on teachers assigning probabilities to the same token events as the student.
-
Support overlap should guide whether to use full-distribution matching, top-\(k\) matching, or sampled-token scoring. If teachers are close forks of the same base model, full or top-\(k\) matching may be safe and information-rich. If teachers were trained with heterogeneous data, external-model traces, or different styles, sampled-token scoring and warmup SFT may be safer.
Choose RL-Distillation Hybrids When
-
Choose RL-distillation hybrids when sparse correctness rewards are available but insufficient on their own to provide fine-grained guidance. RL can tell the model that a rollout succeeded or failed, while distillation can help assign credit to the specific tokens, reasoning moves, tool calls, or action spans that produced the outcome.
-
RL-distillation hybrids are appropriate when reinforcement learning should determine update direction while distillation refines token-level update magnitudes. This separation is useful when verifiers are trusted but sparse, while teacher or self-teacher signals are informative but should not become the sole objective.
-
Choose RL-distillation hybrids when hindsight information, textual critiques, runtime errors, verifier messages, user replies, tool outputs, or environment observations can be converted into dense supervision. A feedback-conditioned self-teacher can replay the student trajectory with access to richer context and provide local correction.
-
RL-distillation hybrids are also appropriate when the objective is to exceed teacher performance rather than merely imitate it. Reward-extrapolated OPD-style methods use the teacher as a reference while allowing rewards to amplify trajectories that outperform the teacher’s expected behavior.
-
Use SDAR-style gated auxiliary self-distillation when training multi-turn agents, especially when the environment supplies sparse trajectory-level rewards but privileged skills, retrieved context, user feedback, or hindsight observations can provide dense token-level guidance. In this setting, RL should remain the backbone, while self-distillation acts as a bounded auxiliary signal.
-
Use contrastive self-distillation when privileged context induces style drift. Correct hints and wrong hints can be contrasted so that generic hint-induced confidence, brevity, or formatting changes are subtracted, leaving a cleaner task-bearing token signal.
-
RL-distillation hybrids are less appropriate when rewards are unreliable, when verifiers are easy to exploit, when teacher-student support overlap is poor, when token-level gates are unavailable, or when the dense distillation term overwhelms the task-grounded RL objective.
Recommended Practical Progression
-
For most real-world projects, begin with off-policy sequence-level or trace distillation to establish a stable baseline. This stage creates a competent student, standardizes formatting, transfers broad task behavior, and produces a model that can later generate meaningful on-policy rollouts.
\[\text{Off-policy SFT or Trace Distillation} \rightarrow \text{Stable Baseline}\] -
Next, introduce token-level teacher probabilities when the teacher’s uncertainty or local preferences are important enough to justify the additional infrastructure. Depending on the divergence and payload budget, this may use full logits, top-\(k\) logits, or sampled-token log-probabilities.
\[\text{Trace Targets} \rightarrow \text{Soft Labels or Log-Probabilities}\] -
Transition gradually to on-policy distillation once the student is sufficiently capable. Early on-policy rollouts should be monitored for length, repetition, truncation, invalid tool calls, and teacher-student support overlap. GKD-style mixing can combine off-policy stability with on-policy robustness.
\[\text{Cold-Started Student} \rightarrow \text{Student Rollouts} \rightarrow \text{Teacher Scoring} \rightarrow \text{OPD}\] -
Add self-distillation when privileged context or hindsight feedback becomes available. Use it especially for runtime errors, tool observations, verified answers, retrieved skills, future user corrections, or environment feedback. Apply relevance masks, contrastive controls, or gates when the privileged context may introduce style drift.
\[\text{Student Rollout} \rightarrow \text{Privileged Teacher View} \rightarrow \text{Masked or Gated Self-Distillation}\] -
Apply multi-teacher distillation when capabilities are distributed across specialists or when sequential training introduces regressions. Use teacher routing, support-overlap diagnostics, warmup SFT, and multi-domain regression suites before relying on dense multi-teacher token updates.
\[\text{Specialist Teachers} \rightarrow \text{Routing and Warmup} \rightarrow \text{MOPD} \rightarrow \text{Regression Recovery}\] -
Integrate RL-distillation hybrids when sparse rewards need dense token-level refinement. Let RLVR, GRPO, PPO, reward models, or verifiers anchor update direction, while teacher or self-teacher gaps provide token-level modulation.
\[\text{Reward-Grounded RL} \rightarrow \text{Dense Distillation Signal} \rightarrow \text{Gated Hybrid Update}\] -
A practical frontier-style progression is therefore:
\[\text{Off-policy SFT} \rightarrow \text{Token-Level KD} \rightarrow \text{RLVR / RLHF} \rightarrow \text{OPD or OPSD} \rightarrow \text{Specialist Teacher Training} \rightarrow \text{MOPD} \rightarrow \text{Final Alignment and Regression Monitoring}\]
Choosing by Constraint
-
If the constraint is teacher availability, use an external-teacher method when a strong teacher can be served reliably, and use self-distillation when external teachers are unavailable or too expensive. If the model already contains latent capability that can be unlocked by hints or feedback, self-distillation may be more practical than hosting a frontier teacher.
-
If the constraint is data quality, use off-policy trace distillation first. Invest in teacher generation, filtering, verifier checks, deduplication, and mixture balancing before adding on-policy complexity. A clean fixed corpus is often the fastest path to a robust baseline.
-
If the constraint is exposure bias, move toward OPD. Student-generated rollouts expose the model to the states it will actually visit, allowing teacher feedback to correct its own mistakes rather than only teaching ideal trajectories.
-
If the constraint is sparse credit assignment, use RL-distillation hybrids. Rewards or verifiers should determine whether the rollout was successful, while teacher or self-teacher gaps should refine token-level credit.
-
If the constraint is capability interference, use multi-teacher distillation. Train or select specialists independently, keep useful intermediate checkpoints, and consolidate them through routing, trace distillation, or MOPD.
-
If the constraint is support mismatch, avoid aggressive full-distribution matching. Prefer warmup SFT, sampled-token scoring, intermediate teachers, or trace distillation until teacher and student local token supports overlap more reliably.
-
If the constraint is reasoning-style preservation, be cautious with privileged self-distillation. Monitor response length, entropy, uncertainty markers, verification markers, fork rates, backtracking, and out-of-distribution reasoning performance.
-
If the constraint is systems cost, choose the simplest payload that supports the desired loss. Sequence targets are cheapest, sampled-token log-probabilities are moderate, top-\(k\) distributions are heavier, and full-vocabulary logits are usually the most expensive.
Practical Guidance
-
The safest default is to start simple and add adaptivity only when there is a demonstrated need. Off-policy SFT creates the base behavior; RL improves task success; OPD improves robustness to the student’s own mistakes; self-distillation uses privileged or feedback context when external teachers are unavailable; MOPD consolidates specialists; and RL-distillation hybrids solve sparse credit assignment.
-
The most important diagnostic before adopting OPD or MOPD is whether teacher feedback is meaningful on student-visited prefixes. A strong teacher is not automatically a good teacher for a particular student rollout. Support overlap, tokenizer compatibility, and reasoning-style compatibility should be checked before relying on dense token-level loss.
-
The most important diagnostic before adopting OPSD is whether privileged context changes task knowledge or merely changes style. If the privileged teacher becomes shorter, more confident, or less exploratory, the student may inherit the wrong behavior unless the loss is masked, gated, or contrastively cleaned.
-
The most important diagnostic before adopting RL-distillation hybrids is whether dense token-level feedback and sparse reward feedback agree. When they disagree, the reward should usually anchor the direction of learning, while distillation should act as a bounded auxiliary or modulation signal.
-
The most important deployment rule is to evaluate regressions as aggressively as improvements. Distillation is often used to compress, preserve, or consolidate capabilities, so a successful method should improve the target domain without silently degrading chat quality, safety, instruction following, long-context behavior, reasoning robustness, tool-use hygiene, or agentic reliability.
References
Foundational distillation papers
-
Distilling the Knowledge in a Neural Network by Hinton et al. (2015)
-
Sequence-Level Knowledge Distillation by Kim and Rush (2016)
-
DistilBERT: a distilled version of BERT by Sanh et al. (2019)
-
Deep Mutual Learning by Zhang et al. (2017)
-
Shadow Knowledge Distillation: Bridging Offline and Online Knowledge Transfer by Li et al. (2022)
On-policy distillation and generalizations
-
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes by Agarwal et al. (2024)
-
MiniLLM: Knowledge Distillation of Large Language Models by Gu et al. (2023)
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe by Li et al. (2026)
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models by Luo et al. (2026)
-
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes by Fu et al. (2026)
-
Scaling Reasoning Efficiently via Relaxed On-Policy Distillation by Ko et al. (2026)
-
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation by Yang et al. (2026)
-
Model Extrapolation Expedites Alignment by Zheng et al. (2025)
-
Multi-Teacher On-Policy Distillation: A New Post-Training Primitive by Yumo Xu
Self-distillation and privileged supervision
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models by Zhao et al. (2026); author blog
-
Reinforcement Learning via Self-Distillation by Hübotter et al. (2026)
-
Self-Distilled RLVR by Yang et al. (2026)
-
Rethinking On-Policy Self-Distillation for Thinking Models by Kaur et al. (2026)
-
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? by Kim et al. (2026)
-
RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation by Pan et al. (2026)
-
Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation
Multi-teacher and capability consolidation
-
MiMo-V2-Flash Technical Report by Xiao et al. (2026)
-
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation by Yang et al. (2026)
-
Nemotron 3 Ultra by NVIDIA et al. (2026); technical report PDF; project page; developer blog; model card; training blends dataset
-
Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing by Li et al. (2026)
RL-distillation hybrids
-
Self-Distilled Agentic Reinforcement Learning by Lu et al. (2026); code
-
OpenClaw-RL: Train Any Agent Simply by Talking by Wang et al. (2026); code
-
Aligning Language Models from User Interactions by Kleine Buening et al. (2026)
Recent post-training recipe reports and commentary
-
MAI-Thinking-1: Building a Hill-Climbing Machine by The Microsoft AI Team (2026)
-
Frontier post-training recipe review with Finbarr Timbers; video conversation; announcement thread
-
Rishabh Agarwal distillation resources by Rishabh Agarwal; timestamped video; thread on LLM distillation; thread on thinking-model self-distillation; self-distillation does not work for thinking models YET
-
On-policy distillation interpretation and stabilization discussion by Zhuo Kai
-
Targeted on-policy self-distillation lecture clip by Sasha Rush
-
LinkedIn discussion on Nemotron 3 Ultra, DeepSeek V4, and specialist-teacher merging by Ravid Shwartz Ziv
Synthetic data and RLHF references
-
The RLHF Book, Chapter 12: Synthetic Data; The Path to On-Policy Distillation
-
Training language models to follow instructions with human feedback by Ouyang et al. (2022)
-
Deep Reinforcement Learning from Human Preferences by Christiano et al. (2017)
Imitation learning and exposure bias
-
Imitation Learning and Structured Prediction Can Reduce Exposure Bias by Ross and Bagnell (2010)
-
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning by Ross et al. (2011)
Systems, tooling, and infrastructure
-
Distilling 100B+ Models 40x Faster with TRL; TRL Distillation Trainer Space
-
On-Policy Distillation (Hugging Face H4 demo); Unlocking On-Policy Distillation for Any Model Family
-
vLLM: Easy, Fast, and Cheap LLM Serving by Kwon et al. (2023)
-
DeepSpeed: System Optimizations for Large-Scale Training by Rasley et al. (2020)
Blogs and implementation guides
-
On-Policy Distillation by Thinking Machines Lab
-
Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation
Twitter / X threads
-
Multi-teacher on-policy distillation discussion by Cameron R. Wolfe; follow-up thread on multi-teacher on-policy distillation and post-training
-
On-policy distillation announcement thread by Kevin Lu
-
Rishabh Agarwal X threads by Rishabh Agarwal; thinking-model self-distillation thread; self-distillation does not work for thinking models YET
-
On-policy distillation interpretation and stabilization discussion by Zhuo Kai
-
Targeted on-policy self-distillation lecture clip by Sasha Rush
-
Frontier post-training recipe review announcement by Nathan Lambert
Talks, videos, and interviews
-
The Magic of LLM Distillation by Rishabh Agarwal; timestamped link
-
[The State of Frontier Post-Training Recipes Conversation with Finbarr Timbers](https://youtu.be/sbXEPxIazqY)
Broader LLM training context
-
Scaling Laws for Neural Language Models by Kaplan et al. (2020)
-
Language Models are Few-Shot Learners by Brown et al. (2020)
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledKnowledgeDistillation}
title = {Knowledge Distillation},
author = {Chadha, Aman and Jain, Vinija},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}