Introduction

  • DeepSeek-R1 represents a landmark in reasoning-capable Large Language Models (LLMs). Released under an MIT license, this model rivals closed-source giants like OpenAI’s o1 and o3 series while pioneering a reinforcement learning (RL)-driven framework for reasoning tasks.
  • DeepSeek-R1 leverages Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath, which replaces traditional methods like PPO, making training both efficient and scalable. DeepSeek-R1 also utilizes Multihead Latent Attention (MLA), introduced in DeepSeek-V2, which reduces computational and memory inefficiencies particularly for long-context processing by projecting Key-Query-Value (KQV) matrices into a lower-dimensional latent space.
  • DeepSeek-R1 demonstrates how reasoning capabilities emerge naturally through RL alone without relying on massive Supervised Fine-Tuning (SFT). Through innovations like GRPO, FP8 quantization, and emergent CoT reasoning, it rivals closed-source models while fostering transparency and accessibility. As the research community builds upon these innovations, DeepSeek-R1 signals a shift towards efficient, reasoning-driven AI accessible to all.
  • This primer explores its architecture, multi-stage training pipeline, GRPO mechanics, and emergent reasoning behaviors, alongside how distillation propagates reasoning capabilities to smaller models.

Architectural Foundations

  • DeepSeek-R1 builds upon the foundational advancements introduced in DeepSeek-V2 — specifically, Mixture of Experts (MoE) and Multihead Latent Attention (MLA) — and DeepSeek-V3 — specifically, FP8 Quantization and Multi-Token Prediction (MTP) - integrating cutting-edge architectural innovations that optimize both training efficiency and inference performance.
  • This section provides a detailed breakdown of the architectural components that evolved from DeepSeek-V2 and DeepSeek-V3 to DeepSeek-R1, highlighting improvements that make DeepSeek-R1 a leading open-source model, capable of rivaling proprietary alternatives in reasoning efficiency and performance.

Mixture of Experts (MoE)

Overview

  • The Mixture of Experts (MoE) mechanism selectively activates a subset of the total model parameters at each inference step, achieving computational savings while maintaining model quality. This approach enables scaling up model parameters without a proportional increase in computational cost.
  • DeepSeek-R1 refines DeepSeek-V2’s MoE framework, introducing dynamic expert routing, reinforcement learning-based load balancing, and enhanced sparsity constraints. These innovations make DeepSeek-R1 one of the most efficient and scalable open-source MoE models available.
  • By integrating optimized load balancing, device-limited routing, and FP8 quantization, DeepSeek-R1 achieves top-tier reasoning performance with significantly lower computational costs, making it a strong competitor to proprietary models.

Evolution from DeepSeek-V2 to DeepSeek-R1

Background: MoE in DeepSeek-V2
  • DeepSeek-V2 employs the DeepSeekMoE architecture, which is designed to optimize training costs and inference efficiency while maintaining strong model performance. Unlike traditional dense transformer architectures, DeepSeekMoE introduces sparse activation of experts, significantly reducing the computational burden per token while allowing for a high overall parameter count. The key innovations in DeepSeekMoE include:
Basic Architecture
  • DeepSeekMoE follows the general Mixture of Experts (MoE) paradigm, where each token is dynamically routed to a subset of specialized feed-forward network (FFN) experts rather than passing through a monolithic dense FFN.
  • The model consists of 236B total parameters, but only 21B parameters are activated per token, striking a balance between model scalability and computational efficiency.
  • Token Routing: Each token is assigned to a subset of top-K experts based on learned affinity scores, ensuring effective specialization while preventing unnecessary activation of experts.
Device-Limited Routing (DLR)
  • To optimize efficiency, DeepSeek-V2 introduces a Device-Limited Routing (DLR) mechanism:
    • Constraint-Based Routing: Tokens are assigned only to a subset M of available devices, reducing communication overhead.
    • Affinity-Based Device Selection: The top M devices with the highest token-expert affinity scores are selected before choosing the top-K experts within them.
    • Optimized GPU Communication: By capping communication between GPUs, DeepSeek-V2 reduces MoE-related synchronization costs, leading to faster training convergence.
Load Balancing Mechanisms
  • DeepSeek-V2 employs three auxiliary loss functions to ensure balanced expert utilization and reduce computational bottlenecks:
  1. Expert-Level Balance Loss (\(\mathcal{L}_{\text{ExpBal}}\))
    • Ensures uniform expert usage across different training batches.
    • Defined as:
      \(\mathcal{L}_{\text{ExpBal}} = \alpha_1 \sum_{i=1}^{N_r} f_i P_i\)
      • where \(f_i\) represents the fraction of tokens assigned to expert \(i\).
  2. Device-Level Balance Loss (\(\mathcal{L}_{\text{DevBal}}\))
    • Ensures equal computational load distribution across GPUs.
    • Defined as:
      \(\mathcal{L}_{\text{DevBal}} = \alpha_2 \sum_{i=1}^{D} f'_i P'_i\)
      • where \(D\) is the number of devices.
  3. Communication Balance Loss (\(\mathcal{L}_{\text{CommBal}}\))
    • Ensures balanced information flow between GPUs.
    • Defined as:
      \(\mathcal{L}_{\text{CommBal}} = \alpha_3 \sum_{i=1}^{D} f''_i P''_i\)
Enhancements in DeepSeek-R1
  • DeepSeek-R1 refines the MoE framework by incorporating:

    • Dynamic Expert Assignment
      • Experts are dynamically allocated based on contextual embeddings.
      • Softmax temperature scaling prevents expert over-specialization.
    • Reinforcement Learning-Guided Routing
      • Introduces policy-based optimization to guide expert selection.
      • Feedback loop optimizes computational load balancing.
    • Sparse Activation Constraints
      • Implements hierarchical top-K gating to enforce sparsity constraints.
      • Adjusts token-level entropy metrics to reduce unnecessary activations.

Mathematical Formulation

  • The expert selection process in DeepSeek-R1 follows a gating function:

    \[G(x) = \text{softmax}(W_g x)\]
    • where \(W_g\) is a trainable weight matrix.
  • The final output is computed as:

    \[y = \sum_{k \in K} G_k(x) E_k(x)\]
    • where:
      • \(K\) represents the top-K selected experts.
      • \(E_k(x)\) is the computation performed by expert \(k\).
      • \(G_k(x)\) is the gating probability.
Load Balancing Loss
  • To ensure equal utilization of experts, DeepSeek-R1 applies a load balancing loss:

    \[\mathcal{L}_{\text{balance}} = \lambda \sum_k \left(\frac{n_k}{N} - \frac{1}{K}\right)^2\]
    • where:
      • \(n_k\) is the number of tokens assigned to expert \(k\).
      • \(N\) is the total number of tokens in a batch.
      • \(K\) is the number of active experts per token.
  • Additionally, an entropy regularization term prevents expert over-reliance:

    \[\mathcal{L}_{\text{entropy}} = -\gamma \sum_k G_k(x) \log G_k(x)\]
    • where \(\gamma\) controls entropy strength.

Inference Efficiency

  • To enhance inference efficiency, DeepSeek-R1 implements:
  1. FP8 Quantization:
    • Reduces memory overhead while maintaining precision.
  2. KV Cache Optimization:
    • Multi-Head Latent Attention (MLA) compresses KV-cache size.
    • Allows for larger batch sizes at inference time.
  3. Expert Parallelism and Communication Optimization:
    • 8-way expert parallelism ensures even GPU workload distribution.
    • Pipeline parallelism (16-way zero-bubble) minimizes idle compute time.
  4. Adaptive Expert Activation:
    • Adjusts active experts per token based on sequence complexity.

Multihead Latent Attention (MLA)

Overview

  • Multihead Latent Attention (MLA) enhances efficiency by projecting Key-Query-Value (KQV) matrices into a lower-dimensional latent space, significantly reducing computational and memory costs.

Evolution from DeepSeek-V2 to DeepSeek-R1

MLA in DeepSeek-V2
  • Introduced as a low-rank key-value joint compression technique, MLA drastically reduced memory overhead associated with traditional Multi-Head Attention (MHA).
  • Allowed compression of KV caches into a more compact latent vector, reducing storage requirements by 93.3% compared to standard attention mechanisms.
  • Improved inference efficiency by 5.76× compared to DeepSeek 67B, making long-context processing more feasible.
Enhancements in DeepSeek-R1
  • Hybrid Latent Projection: DeepSeek-R1 dynamically scales the latent projection space based on token context complexity, ensuring optimal memory usage.
  • Hierarchical Caching: Introduces an advanced caching mechanism that allows reuse of latent projections across multiple tokens, reducing redundant computation.
  • Adaptive Attention Scaling: The model adjusts attention weight distributions dynamically, improving long-context comprehension.

Mathematical Formulation

  • The MLA mechanism transforms the standard attention computation by introducing a latent space projection:

    1. Compute Key, Query, and Value Matrices: \(K, Q, V = W_k X, W_q X, W_v X\)

    2. Project into a Lower-Dimensional Latent Space: \(K_L, Q_L, V_L = W_L K, W_L Q, W_L V\)

  • By reducing the attention complexity from \(O(N^2)\) to \(O(Nd_L)\), where \(d_L\) is the latent space dimension, MLA significantly improves efficiency.

FP8 Quantization

Overview

  • DeepSeek-R1 utilizes 8-bit floating-point (FP8) quantization to reduce memory usage and computational costs while preserving numerical stability.

Enhancements in DeepSeek-R1

  • Adaptive Bit-Width Scaling: Dynamically adjusts the bit precision across different network layers based on computational demands.
  • Loss-Aware Quantization: Uses loss-sensitive scaling functions to ensure that numerical precision is maintained across different computation stages.

Mathematical Representation

  • FP8 quantization involves a scaling factor \(S\) that adjusts the input values:

    \[x_q = \text{clip}( \text{round}(x / S), -127, 127)\]
    • where \(S\) is dynamically optimized based on loss gradients to prevent numerical instability.

Multi-Token Prediction (MTP)

Overview

  • Multi-Token Prediction (MTP) allows DeepSeek-R1 to predict multiple tokens in parallel, significantly improving inference speed.

Key Features

  • Parallel Decoding: Extends the autoregressive framework by allowing multiple token predictions within the same context window.
  • Token Sampling and Re-ranking: Multi-token outputs are sampled from a probabilistic distribution and re-ranked for coherence.
  • Dynamic Prediction Horizon: Adjusts the number of predicted tokens per step based on model confidence.

Enhancements in DeepSeek-R1

  • Reinforcement Learning-Guided Token Selection: Ensures coherence in multi-token predictions and reduces error propagation.
  • Hierarchical Token Verification: Dynamically adjusts the number of predicted tokens per step based on uncertainty estimation.

Mathematical Formulation

  • The prediction function follows an autoregressive formulation:
\[P(y_t | x) = \prod_{t=1}^{T} P(y_t | y_{<t}, x)\]
  • By introducing parallel decoding, DeepSeek-R1 reduces inference complexity from \(O(T)\) to \(O(T/k)\), where \(k\) is the number of tokens predicted per step.

Training Pipeline: From Pre-Training to Reasoning

  • DeepSeek-R1 employs a multi-stage pipeline meticulously designed to maximize its reasoning capabilities while minimizing computational costs. This process consists of distinct stages, each guided by task-specific loss functions and reward mechanisms.

Stage 1: Cold Start with Supervised Fine-Tuning (SFT)

  • DeepSeek-R1 begins its journey by fine-tuning the V3-Base model with high-quality Chain-of-Thought (CoT) examples. These examples are carefully curated using few-shot prompting, manual annotation, and refinement of DeepSeek-R1-Zero outputs.

  • Comparison to Cold Start in Recommender Systems:
    • In recommender systems, the “cold start problem” refers to the challenge of providing accurate recommendations for new users or items with limited historical data. The focus is on mitigating data sparsity by learning user preferences or item properties.
    • In contrast, DeepSeek-R1’s cold start addresses the challenge of initializing a large language model with structured reasoning and readability. By fine-tuning on curated data, the model develops a foundational understanding of chain-of-thought reasoning, overcoming instability observed in RL-only training setups.
  • Advantages of Cold Start:

    • Readability: DeepSeek-R1-Zero struggled with poor readability and language mixing. In contrast, the cold-start phase imposes a structured output format:
      <reasoning_process> CoT explanation </reasoning_process>
      <summary> Final Answer </summary>
      
    • Alignment: Cold start data introduces human priors, accelerating convergence and improving performance on reasoning-intensive tasks.

    • DeepSeek-R1 begins its journey by fine-tuning the V3-Base model with high-quality CoT examples. These examples are carefully curated using few-shot prompting, manual annotation, and refinement of DeepSeek-R1-Zero outputs.
  • Loss Function for SFT:

    • The model is fine-tuned using a supervised cross-entropy loss:

      \[L_{\text{SFT}} = -\sum_{i=1}^{n} \log P_{\theta}(o_i|q, \{o_1, \dots, o_{i-1}\}),\]
      • where:
        • \(o_i\): the \(i^{th}\) token in the output sequence,
        • \(q\): the input query,
        • \(o_1, ..., o_{i-1}\): previously generated tokens.

Stage 2: Reinforcement Learning (RL)

  • RL is the backbone of DeepSeek-R1’s reasoning evolution. The model learns from rewards rather than curated datasets, enabling self-improvement over thousands of iterations.

DeepSeek Pure RL: A Conceptual Overview

  • DeepSeek’s RL methodology is fundamentally inspired by self-play paradigms, akin to training AI models in games like chess. Traditionally, AI models trained for complex reasoning tasks leverage large datasets composed of human-annotated examples. However, such datasets often lack comprehensive coverage and may not contain optimal solutions. RL circumvents this limitation by allowing AI models to explore solutions autonomously, refining their strategies based on reward-driven feedback mechanisms.
  • Consider an AI model trained to play chess. Instead of learning from a fixed dataset of historical games, the AI is programmed with only the fundamental rules of chess. It then engages in self-play, continuously experimenting with various moves. Initially, the model executes suboptimal actions, leading to losses. However, through iterative play, it identifies effective strategies and reinforces moves that contribute to victories while discarding ineffective ones. This trial-and-error process, governed by RL principles, enables the AI to develop strategies surpassing human intuition.
  • DeepSeek applies this RL-based approach to reasoning-intensive domains, such as mathematical problem-solving. Rather than training on explicit mathematical derivations, the AI is provided with fundamental mathematical rules and tasked with solving problems autonomously. The model systematically explores various solution paths, reinforcing those that yield correct answers while discarding ineffective methodologies. Over time, this process enhances the AI’s mathematical reasoning abilities beyond traditional supervised learning approaches. The self-improving nature of RL fosters the discovery of novel problem-solving strategies, resulting in superior performance in mathematical reasoning and logic-based tasks.

Rewards

  • DeepSeek-R1 uses two primary reward functions:

    1. Accuracy Rewards:

      • Evaluate the correctness of deterministic tasks, such as math problems and code-generation outputs. For instance:
        • In math, the model’s final answer is verified against a ground truth.
        • For coding, unit tests evaluate the validity of generated code solutions.
    2. Format Rewards:

      • Encourage consistent reasoning structures by rewarding outputs that adhere to the specified CoT format. For instance:
        <reasoning_process> Step-by-step explanation </reasoning_process>
        <answer> Final Output </answer>
        

Group Relative Policy Optimization (GRPO)

  • Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, is a RL method that has played a pivotal role in the development of DeepSeek-R1. It was first introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models as a simplified and more efficient alternative to traditional policy optimization techniques like Proximal Policy Optimization (PPO).
  • GRPO has evolved from a mathematical reasoning optimizer in DeepSeekMath to a core optimization technique in DeepSeek-R1, driving advanced reasoning capabilities across diverse tasks. By eliminating the critic model, leveraging group-based advantages, and incorporating multi-stage RL refinements, GRPO has made DeepSeek-R1 one of the most powerful open-source reasoning models.
  • GRPO is central to DeepSeek-R1’s RL pipeline, providing a lightweight yet powerful optimization mechanism. Its key innovations include:
    • Removing the critic model, which significantly reduces memory overhead.
    • Stabilizing policy updates through group-based advantage estimation.
    • Efficient training while maintaining strong performance compared to PPO-based methods.
  • From its inception in DeepSeekMath to its refined implementation in DeepSeek-R1, GRPO has undergone several enhancements, including multi-stage RL, improved reward modeling, and refined optimization strategies. This section details GRPO’s mathematical formulation, its implementation, and its role in DeepSeek-R1.

Evolution of GRPO: From DeepSeekMath to DeepSeek-R1

Phase 1: GRPO in DeepSeekMath (Mathematical RL)

  • GRPO was originally introduced in DeepSeekMath to optimize models for mathematical reasoning.
  • It replaced PPO’s critic model with a group-based reward normalization technique, making training more efficient while maintaining stability.
  • The reward function primarily evaluated mathematical correctness, using structured evaluation metrics.

Phase 2: GRPO in DeepSeek-R1-Zero (Self-Evolving Reasoning)

  • With DeepSeek-R1-Zero, GRPO was applied without any supervised fine-tuning (SFT)—pure RL was used to shape reasoning behaviors from scratch.
  • The model self-learned reasoning skills such as step-by-step problem-solving and self-verification.
  • However, DeepSeek-R1-Zero exhibited readability issues (e.g., unstructured reasoning outputs, language mixing).

Phase 3: GRPO in DeepSeek-R1 (Refined Reasoning & Cold Start)

  • DeepSeek-R1 introduced a multi-stage RL pipeline incorporating a small amount of cold-start fine-tuning before applying GRPO.
  • The reward model was expanded beyond mathematics to include general reasoning tasks.
  • A language consistency reward was added to improve coherence and readability.

How GRPO Works

  • GRPO modifies traditional policy optimization by leveraging group-based normalization instead of a critic model. This enables efficient and stable policy updates while reducing computational overhead.

GRPO Intuition

  • To understand GRPO, it is useful to analyze its mathematical formulation from a reverse-engineering perspective. The complexity of the equations can be misleading; in reality, GRPO consists of three main components:

    \[J_{GRPO} = \min([\text{Block 1}], [\text{Block 2}]) - [\text{Block 3}]\]
    • where:
      • Block 1 corresponds to the first term inside the summation of the GRPO objective function: \(\rho_i A_i = \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i.\) This represents the primary objective of policy optimization: ensuring the updated policy \(\pi_\theta\) improves upon the previous policy \(\pi_{\theta_{old}}\). The core principle is straightforward: the new policy should outperform the old one in expectation.
      • Block 2 corresponds to the clipped version of \(\rho_i A_i\), i.e., \(\text{clip}(\rho_i, 1 - \epsilon, 1 + \epsilon) A_i.\) This originates from PPO and serves as a safeguard to prevent excessive updates. By taking the minimum between Block 1 and this clipped value, GRPO ensures training stability and prevents over-exaggerated policy updates.
      • Block 3 corresponds to the KL-divergence regularization term in the GRPO equation: \(\beta D_{KL}(\pi_\theta || \pi_{ref}).\) This term enforces similarity between the new policy and a reference policy, preventing the optimization process from deviating too far from the original distribution and ensuring controlled updates.
  • One of the most notable aspects of GRPO’s success is its redesigned approach to advantage computation. Traditional PPO computes advantages using a learned value network combined with temporal difference learning, requiring additional memory and computation to maintain a separate critic model. In contrast, GRPO fundamentally simplifies this by directly comparing sampled actions within a group and leveraging statistical normalization to compute advantages. This group-based methodology eliminates the need for a value network, significantly reducing memory overhead—by approximately half—while simultaneously aligning with the core principle of evaluating mathematical solutions relative to other approaches to the same problem.
  • This design choice has proven especially effective for mathematical reasoning tasks. By using a direct group-based comparison, GRPO enhances the model’s ability to develop structured reasoning strategies. Empirical results demonstrate that this method not only improves performance on mathematical reasoning benchmarks but also maintains training stability and computational efficiency. The elimination of the critic network removes potential biases from learned value functions, making GRPO particularly well-suited for domains requiring objective evaluation of multiple solution paths.
  • Additionally, the “Group” aspect in GRPO refers to computing the expectation over a set of sampled outputs, which are then averaged to stabilize training. The presence of normalization within \(A\) (mean and standard deviation) may initially appear complex, but it simply follows conventional normalization techniques used in machine learning.
  • Thus, when stripped of indices, subscripts, and hyperparameters, GRPO reduces to a simple balance between policy improvement and control mechanisms, reinforcing why it is regarded as an efficient and intuitive optimization method.

Mathematical Formulation

  • The GRPO objective function is:

    \[J_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)} \left[ \frac{1}{G} \sum_{i=1}^G \min\left(\rho_i A_i, \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i\right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]\]
    • where:
      • \(\rho_i\) is the likelihood ratio, indicating how much the new policy diverges from the old one: \(\rho_i = \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}\)
      • \(A_i\) is the group-based advantage function, which normalizes rewards across sampled outputs: \(A_i = \frac{r_i - \text{mean}(r_1, ..., r_G)}{\text{std}(r_1, ..., r_G)}\)
      • \(D_{\text{KL}}(\pi_\theta \| \pi_{ref})\) is a KL regularization term that constrains updates within a stable range.
      • \(G\) is the group size (number of sampled outputs per query).
      • \(\epsilon\) controls clipping to prevent overly aggressive updates.
      • \(\beta\) controls the strength of KL regularization.

Step-by-Step Breakdown

Likelihood Ratio \(\rho_i\)

  • Measures how much the probability of generating output \(o_i\) has changed under the new policy compared to the old policy: \(\rho_i = \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)}\)

Advantage Function \(A_i\)

  • Instead of relying on a separate value network (critic), GRPO estimates the advantage function using a group of sampled outputs: \(A_i = \frac{r_i - \text{mean}(r_1, ..., r_G)}{\text{std}(r_1, ..., r_G)}\)
  • This reduces training instability and enhances efficiency.

Clipping Mechanism

  • Prevents drastic policy updates that could destabilize training: \(\text{clip}(\rho_i, 1-\epsilon, 1+\epsilon)\)

KL Divergence Penalty

  • Ensures the policy remains close to a reference distribution: \(\beta D_{\text{KL}}\bigl(\pi_\theta \;\|\; \pi_{\text{ref}}\bigr)\)
  • Prevents mode collapse and excessive policy drift.

Implementation Details

Training Setup

  • GRPO is implemented by sampling multiple outputs per query and computing rewards over the group.
  • The mean and standard deviation of rewards provide a normalized baseline for training.

Reward Function Design

  • In DeepSeekMath: The reward was primarily based on mathematical correctness.
  • In DeepSeek-R1: The reward function expanded to include:
    • Accuracy Rewards: Evaluating correctness for general reasoning tasks (e.g., coding, science, logic).
    • Format Rewards: Ensuring structured reasoning using <think> and <answer> tags.

Optimization Process

  • The model samples multiple outputs per query, computes likelihood ratios and advantage estimates, and updates its policy using the clipped objective function.

Efficiency Considerations

  • Removes critic model, reducing memory consumption.
  • Batch computation for group sampling, improving efficiency.
  • Iterative RL refinement, enabling continual improvement.

Applications

DeepSeek-R1-Zero: Reinforcement Learning from Scratch

  • DeepSeek-R1-Zero applied GRPO without any pretraining, allowing the model to self-learn reasoning.
  • The model naturally developed skills like self-verification and reflection.
  • However, poor readability and language mixing emerged as challenges.

DeepSeek-R1: Multi-Stage RL with Cold Start

  • To refine DeepSeek-R1-Zero, DeepSeek-R1 introduced:
    1. Cold Start Fine-Tuning:
      • The model was first fine-tuned on high-quality Chain-of-Thought (CoT) examples.
      • This ensured structured reasoning and better readability**.
    2. RL with GRPO:
      • GRPO was used to refine reasoning skills in math, logic, and general problem-solving.
      • A language consistency reward was added to prevent language mixing.
    3. Final RL Optimization:
      • After RL, a rejection sampling step generated better training data.
      • A final GRPO optimization phase was conducted with diverse prompts.

PPO vs. DPO vs. KTO vs. APO vs. GRPO

  1. PPO:
    • Function: An RL algorithm that optimizes the language model by limiting how far it can drift from a previous version of the model.
    • Implementation: Involves sampling generations from the current model, judging them with a reward model, and using this feedback for updates.
    • Practical Challenges: Can be slow and unstable, especially in distributed settings.
  2. DPO:
    • Function: Minimizes the negative log-likelihood of observed human preferences to align the language model with human feedback.
    • Data Requirement: Requires paired preference data.
    • Comparison with KTO: While DPO has been effective, KTO offers competitive or superior performance without the need for paired preferences.
  3. KTO:
    • Function: Adapts the Kahneman-Tversky human value function to the language model setting. It uses this adapted function to directly maximize the utility of model outputs.
    • Data Requirement: Does not need paired preference data, only knowledge of whether an output is desirable or undesirable for a given input.
    • Practicality: Easier to deploy in real-world scenarios where desirable/undesirable outcome data is more abundant.
    • Model Comparison: Matches or exceeds the performance of direct preference optimization methods across various model sizes (from 1B to 30B).
  4. APO:
    • Function: Introduces a family of contrastive objectives explicitly accounting for the relationship between the model and the preference dataset. This includes APO-zero, which increases desirable outputs while decreasing undesirable ones, and APO-down, which fine-tunes models based on specific quality thresholds.
    • Data Requirement: Works effectively with paired preference datasets created through controlled methods like CLAIR and supports stable alignment even for challenging datasets.
    • Practicality: Excels at aligning strong models with minimally contrasting preferences, enhancing performance on challenging metrics like MixEval-Hard while providing stable, interpretable training dynamics.
    • Model Comparison: Outperformed conventional alignment objectives across multiple benchmarks, closing a 45% performance gap with GPT4-turbo when trained with CLAIR preferences.
  5. GRPO:
    • Function: A variant of PPO that removes the need for a critic model by estimating the baseline using group scores, improving memory and computational efficiency while enhancing the mathematical reasoning of models.
    • Data Requirement: Utilizes group-based rewards computed from multiple outputs for each query, normalizing these scores to guide optimization.
    • Practicality: Focuses on reducing training resource consumption compared to PPO and improving RL stability.
    • Model Comparison: Demonstrated superior performance on tasks like GSM8K and MATH benchmarks, outperforming other models of similar scale while improving both in-domain and out-of-domain reasoning tasks.

Tabular Comparison

Aspect PPO DPO KTO APO GRPO
Objective Maximizes expected reward while preventing large policy updates. Optimizes policy based on binary classification of human preferences. Aligns models based on Kahneman-Tversky optimization for utility maximization. Anchored alignment with specific control over preference-based likelihood adjustments for stability and performance. Leverages group-based relative advantages and removes the critic network.
Input Data States and rewards from the environment. Paired human preference data. Binary labels indicating desirability of outputs. Minimally contrasting preference pairs or other datasets requiring tailored anchoring. Grouped LLM outputs scored by a reward model.
Learning Mechanism Policy gradients with a clipped surrogate objective. Cross-entropy optimization over paired preferences. Maximizes desirable likelihoods relative to undesirables, without paired data. Uses variants like APO-zero or APO-down to balance desirable/undesirable likelihood changes. Group normalization with policy gradients, eliminating the critic network.
Output Actions in the environment. Aligned responses based on human preferences. Model outputs optimized for human utility. Refined outputs aligned to the quality of preference pairs, with control over optimization dynamics. Outputs optimized for reasoning, reducing computational overhead.
Data Requirements Requires environment rewards. Needs paired preference data. Binary feedback, no need for explicit pairings. Performs best with datasets that maintain controlled contrastiveness, e.g., CLAIR. Reward scores grouped across multiple outputs.
Network Components Separate policy and value networks. Single policy network. Direct adjustments to likelihood distributions without separate critic components. Leverages adaptable contrastive objectives; can eliminate critic dependency for simpler training. Simplified network with no critic; uses reward-based grouping instead.
Feedback Source Environment rewards. Human preferences collected through paired comparisons. Binary desirability judgments for outputs. CLAIR-generated or similar preference pairs offering clear, minimally contrasting learning signals. Scores assigned to groups of LLM outputs.
Stability Relies on clipping mechanisms to avoid destabilization. Stable as it directly optimizes preferences. Stable due to focus on unpaired desirability adjustments. Offers robust training stability, scaling better on models trained with mixed-quality datasets. Stable due to normalization of rewards across groups.
Training Complexity High, due to balancing reward maximization with policy constraints. Moderate; uses simplified binary preference objectives. Simplifies alignment by focusing only on desirability. Adaptive and context-aware; requires understanding dataset-model relationships to select the right APO variant. Reduces overhead via group-based scoring.
Performance Strong performance on tasks with clear reward signals but prone to instability in distributed setups. Effective for straightforward preference alignment tasks. Competitive or better alignment than preference-based methods without paired data needs. Superior alignment results, particularly on benchmarks like MixEval-Hard, with CLAIR and APO achieving >7.65% performance gains on MixEval-Hard (2024-06-01 split). Excels in reasoning tasks, offering computational efficiency.
Notable Strength Widely used in RL settings, good at reward-based optimization. Directly optimizes for preferences without needing a separate reward model. Handles binary data efficiently, avoiding paired data dependencies. Combines adaptive dynamics and stable training tailored to specific datasets, allowing nuanced alignment even with challenging inputs. Simplifies reward aggregation; strong for reasoning-heavy tasks.
Scenarios Best Suited RL environments where reward signals are predefined. Scenarios with abundant paired human feedback. Real-world settings with broad definitions of desirable/undesirable outputs. Tasks requiring precise alignment with nuanced, minimally contrasting preferences, especially for closing performance gaps in competitive models (e.g., GPT4-turbo). Mathematical reasoning or low-resource training setups.

Emergent Reasoning Behaviors

  • During training, DeepSeek-R1 developed remarkable reasoning patterns:

    • **Reflection: Revisiting and revising intermediate steps.
    • **Self-Correction: Identifying and fixing errors in real-time.
    • **Aha Moments: Pausing and reevaluating to discover new solutions.
  • For example:

    • Solving \(x^2 - 5x + 6 = 0\), the model might initially propose incorrect factors, pause to reflect, and ultimately derive \(x = 2\) and \(x = 3\).
    • The table below from the original paper, we can see where R1 has it’s “aha” moment and re-evaluates the solution:

Distillation: Reasoning in Compact Models

  • DeepSeek-R1’s reasoning capabilities were distilled into smaller models (e.g., Qwen-7B, Llama-8B), achieving state-of-the-art performance:

    • **Teacher-Student Paradigm: Outputs from DeepSeek-R1 trained smaller models with minimal computational overhead.
    • **Efficiency: Distilled models retained reasoning capabilities while outperforming larger, non-reasoning models like GPT-4o.
  • The table below shows how distilled R1 models and how they compare on reasoning related benchmarks.

Results

  • The figure below from the original paper, shows the performance of DeepSeek R1 being at par with or outperforming OpenAI’s models at several benchmarks.

Open Questions

  • As shown in the figure below (source), making a powerful reasoning model is now very simple if you have access to a capable base model and a high-quality data mixture:

  • Despite DeepSeek-R1’s advances, several open questions remain regarding its development and optimal implementation:

    • Data Collection: How were the reasoning-specific datasets curated? Understanding the sources and selection criteria for data is crucial for replicating and improving the model’s performance.
    • Model Training: No training code was released by DeepSeek, leaving uncertainty about which hyperparameters work best and how they differ across model families and scales.
    • Scaling Laws: What are the compute and data trade-offs in training reasoning models? Identifying these relationships is critical for optimizing future models.

Open-R1

  • While DeepSeek-R1 provides open weights, the datasets and code used in training remain proprietary. The aforementioned questions have driven the Open-R1: a fully open reproduction of DeepSeek-R1 project, an initiative to systematically reconstruct DeepSeek-R1’s data and training pipeline, validate its claims, and push the boundaries of open reasoning models. The motivation behind building Open-R1 is to provide transparency on how RL can enhance reasoning, share reproducible insights with the open-source community, and create a foundation for future models to leverage these techniques.

Objectives of Open-R1

  1. Reproducing R1-Distill Models: By distilling a high-quality reasoning dataset from DeepSeek-R1, Open-R1 aims to replicate the R1-Distill models faithfully.
  2. Replicating the RL Training Pipeline: A critical component of DeepSeek-R1 is its RL-based training methodology. Open-R1 will curate large-scale datasets for mathematics, reasoning, and code to enable this training process.
  3. Advancing Multi-Stage Training: Demonstrating the full transition from a base model through SFT to RL will be a key milestone, ensuring a reproducible and scalable methodology.
  • As shown in the figure below (source), here’s the Open-R1 plan:

Impact on the Community

  • Accessible Reasoning Models: Open-R1’s synthetic datasets will allow anyone to fine-tune existing or new LLMs for reasoning tasks simply by leveraging these datasets.
  • Open RL Recipes: The initiative will provide well-documented RL methodologies that can serve as a foundation for future research and experimentation.
  • Exploring Beyond Math: While mathematical reasoning is a primary focus, Open-R1 will explore extensions into other domains, including programming and scientific applications such as medicine, where reasoning models can make a significant impact.

Reasoning Datasets

  1. OpenThoughts: 114k samples distilled from R1 on math, code, and science.
  2. R1-Distill-SFT: 1.7M samples distilled from R1-32B on NuminaMath and Allen AI’s Tulu.

References