Overview

  • In 2017, OpenAI introduced a groundbreaking approach to machine learning called Reinforcement Learning from Human Feedback (RLHF), specifically focusing on human preferences, in their paper “Deep RL from human preferences”. This innovative concept has since inspired further research and development in the field.
  • The concept behind RLHF is straightforward yet powerful: it involves using a pretrained language model and having human evaluators rank its outputs. This ranking then informs the model to develop a preference for certain types of responses, leading to more reliable and safer outputs.
  • RLHF effectively leverages human feedback to enhance the performance of language models. It combines the strengths of Reinforcement Learning (RL) algorithms with the nuanced understanding of human input, facilitating continuous learning and improvement in the model.
  • Incorporating human feedback, RLHF not only improves the model’s natural language understanding and generation capabilities but also boosts its efficiency in specific tasks like text classification or translation.
  • Moreover, RLHF plays a crucial role in addressing bias within language models. By allowing human input to guide and correct the model’s language use, it fosters more equitable and inclusive communication. However, it’s important to be mindful of the potential for human-induced bias in this process.

Background: LLM Pre-Training and Post-Training

  • The training process of Large Language Models (LLMs) comprises two distinct phases: pre-training and post-training, each serving unique purposes in developing capable language models:

    1. Pre-training: This phase involves large-scale training where the model learns next token prediction using extensive web data. The dataset size often ranges in the order of trillions of tokens, including a mix of publicly available and proprietary datasets to enhance language understanding. The objective is to enable the model to predict word sequences based on statistical likelihoods derived from vast textual datasets.
    2. Post-training: This phase is intended to improve the model’s reasoning capability. It typically consists of two stages:
      • Stage 1: Supervised Fine-Tuning (SFT): The model is fine-tuned using a small amount of high-quality expert reasoning data, typically in the range of 10,000 to 100,000 prompt-response pairs. This phase employs supervised learning to fine-tune the LLM on high-quality expert reasoning data, including instruction-following, question-answering, and chain-of-thought demonstrations. The objective is to enable the model to effectively mimic expert demonstrations, though the limitation of available expert data necessitates additional training approaches.
      • Stage 2: RLHF: This stage refines the model by incorporating human preference data to train a reward model, which then guides the LLM’s learning through RL. RLHF aligns the model with nuanced human preferences, ensuring more meaningful, safe, and high-quality responses.

Refresher: Basics of Reinforcement Learning (RL)

  • Reinforcement Learning (RL) is based on the interaction between an agent and its environment, as depicted in the diagram below (source):

  • In this interaction, the agent takes an action, and the environment responds with a state and a reward. Here’s a brief on the key terms:
    • The reward is the objective that we want to optimize.
    • A state is the representation of the environment/world at the current time index.
    • A policy is used to map from that state to an action.
  • A detailed discourse of RL is offered in our Reinforcement Learning primer.

Online vs. Offline Reinforcement Learning

Overview
  • Reinforcement learning can be broadly classified into two paradigms based on how the agent interacts with data and the environment: online RL and offline RL (also known as batch RL).

  • Online RL:

    • The agent actively interacts with the environment during training.
    • After taking an action, it immediately observes the new state and reward, then updates its policy accordingly.
    • Learning happens in real time — the policy evolves continuously as new experiences are collected.
    • Online RL is dynamic and adaptive, allowing exploration of unseen states but can be unstable or costly if environment interactions are expensive.
  • Offline RL (Batch RL):

    • The agent learns purely from a fixed dataset of past experiences, without additional interaction with the environment.
    • This dataset typically consists of tuples of the form (state, action, reward, next state), collected from human demonstrations, logged policies, or previous agents.
    • Since the agent cannot explore beyond the given data, it must balance generalization with the risk of overfitting or extrapolating to unseen actions.
    • Offline RL is especially valuable when environment interaction is expensive, risky, or infeasible (for example, autonomous driving, healthcare, or LLM preference learning).
Mathematical Distinction
  • In online RL, data is generated by the current policy, meaning the state-action distribution \(D_{\pi}\) depends on the policy being optimized. Thus, updates occur as:
\[J(\pi) = \mathbb{E}_{(s, a) \sim D_{\pi}} [R(s, a)]\]
  • In offline RL, the dataset \(D_{\beta}\) is collected from a behavior policy \(\beta\), and optimization must be done off-policy:

    \[J(\pi) = \mathbb{E}_{(s, a) \sim D_{\beta}} \left[ \frac{\pi(a|s)}{\beta(a|s)} R(s, a) \right]\]
    • Here, the ratio \(\frac{\pi(a\mid s)}{\beta(a\mid s)}\) corrects for distribution mismatch between the current policy and the dataset. However, large discrepancies can cause instability or high variance in training. To mitigate this, offline RL often applies regularization that constrains the learned policy to remain close to the behavior policy.
In the Context of LLM Preference Optimization
  • For LLMs, online and offline RL determine how preference data and reward models are used to align models with human intent.
  • Offline RL (such as Direct Preference Optimization (DPO)) provides stable, efficient fine-tuning from pre-collected data, while online RL (such as Proximal Policy Optimization (PPO)) enables continual improvement through active interaction with a reward model. Hybrid systems blend both for balance and scalability.
Offline RL in LLMs
  • Definition in LLM Context:

    • Offline RL trains a language model from a fixed dataset of human or AI-labeled preferences, without interactive data collection.
    • Common examples include SFT and DPO.
  • Data Source:

    • The dataset contains (prompt, response, preference) triplets where human or AI annotators have pre-ranked model outputs.
  • Advantages:

    • Stable and deterministic: Training proceeds on a known dataset, ensuring reproducibility and smooth optimization.
    • Efficient and low-cost: Avoids the computational overhead of continuous environment interaction or online sampling.
    • Scalable: Enables parallel training across large datasets and hardware clusters.
    • Safe and controlled: Particularly suitable when online experimentation is risky (e.g., autonomous driving, healthcare, etc.).
  • Limitations:

    • No exploration: The model cannot discover new, improved responses outside the training data.
    • Distributional shift: The static dataset may not represent the full space of prompts or reasoning trajectories encountered in deployment.
    • Potential overfitting: The model might overalign to narrow stylistic patterns from annotators.
    • Limited adaptivity: Cannot respond dynamically to evolving human preferences or tasks.
  • Examples in Practice:

    • DPO: Uses static preference pairs to directly optimize policy likelihood ratios.
    • Offline preference optimization also underlies reward model pretraining in early RLHF pipelines.
Online RL in LLMs
  • Definition in LLM Context:

    • Online RL fine-tunes a model by generating new responses, evaluating them with a reward model, and updating parameters iteratively.
    • Implemented primarily via Proximal Policy Optimization (PPO) or a variant such as Group Relative Policy Optimization (GRPO).
  • Process:

    1. The current policy (LLM) generates multiple responses for each prompt.
    2. The reward model evaluates them based on preference alignment.
    3. The policy is updated to maximize expected reward under a KL-divergence constraint from the previous policy.
    4. The process repeats iteratively, allowing the model to explore and refine behavior.
  • Advantages:

    • Active exploration: The model can dynamically test new strategies and linguistic forms.
    • Continual learning: Allows fine-tuning for new domains or evolving user expectations.
    • Higher alignment fidelity: Produces nuanced, human-like outputs through iterative reward feedback.
    • Emergent capabilities: Encourages spontaneous reasoning and self-improvement beyond static data.
  • Limitations:

    • High computational cost: Requires repeated inference, evaluation, and backpropagation.
    • Stability challenges: Susceptible to reward hacking, over-optimization, or collapse without strong KL constraints.
    • Reward model dependency: Quality depends heavily on the accuracy and bias of the reward model.
    • Complex pipeline: Requires coordination between sampling, evaluation, and optimization processes.
  • Examples in Practice:

    • InstructGPT and ChatGPT: Train with PPO-based RLHF using human reward models.
    • Llama 4: Employs a continuous online RL loop for adaptive tuning with evolving data distributions.
Hybrid Approaches: Combining Offline and Online RL
  • Offline Phase:

    • Initialize the policy with SFT or DPO for baseline alignment and stability.
  • Online Phase:

    • Transition to PPO-based RLHF or online DPO to incorporate adaptive reward feedback.
  • Benefits:

    • Stability + Flexibility: Offline pretraining provides stable foundations; online RL refines adaptivity.
    • Efficiency: Reduces sample inefficiency by starting from an already competent policy.
    • Scalability: Enables modular training pipelines adaptable to new data and domains.
  • This hybrid strategy underpins the modern preference optimization stack for GPT-4, Claude 3, and Llama 4, where iterative, alternating offline and online loops achieve both safety and responsiveness.

Comparative Analysis
Aspect Online RL Offline RL
Data Source Generated in real time via interaction with environment or reward model Fixed dataset of past experiences
Exploration Active — generates novel responses Passive — limited to existing samples
Adaptivity Dynamic, continuously updated Static, fixed during training
Stability Prone to instability; requires KL regularization Stable and reproducible
Cost High — repeated inference, sampling, and evaluation Low — efficient batch training
Reward Dependence Strong (reward model critical for success) Optional — uses preference pairs directly
Sample Efficiency Lower (requires many rollouts) Higher (reuses data fully)
Risk of Overfitting Low — dynamic sampling diversifies data Higher — risk from fixed dataset
Scalability Limited by compute and latency Easily parallelizable
Examples PPO (InstructGPT, ChatGPT, Llama 4) DPO, SFT, Reward Model Pretraining
Best Used For Fine-tuning and adaptive alignment Baseline alignment and safe pretraining
Intuitive Analogy
  • Offline RL is like a student studying from a fixed textbook — learning efficiently from known examples but unable to ask new questions.
  • Online RL is like a student in an interactive class — they can ask questions, receive feedback, and adjust their understanding dynamically.
  • The best systems — like hybrid RLHF pipelines — combine both: first learning the textbook thoroughly, then refining understanding through interactive dialogue with a teacher.
Offline vs. Online RL: REINFORCE, TRPO, PPO, DPO, KTO, and GRPO
REINFORCE

####### Overview

  • REINFORCE, introduced by Williams (1992), is one of the earliest and simplest policy gradient algorithms, introduced by Williams (1992). It directly optimizes a parameterized policy \(\pi_\theta(a \mid s)\) by estimating the gradient of the expected return with respect to the policy parameters. The update rule is:
\[\nabla_\theta J(\theta) = \mathbb{E}_{s, a \sim \pi_\theta} [ \nabla_\theta \log \pi_\theta(a \mid s) , (R - b) ]\]
  • where:
    • \(R\) is the total return (sum of discounted rewards),
    • \(b\) is a baseline (e.g., a value function) to reduce variance.
  • A detailed discourse on REINFORCE can be obtained in the REINFORCE Algorithm section.

####### Online vs. Offline (On-Policy vs. Off-Policy)

  • REINFORCE is a fully online, on-policy algorithm.

  • Why It’s Online:

    • REINFORCE requires continuous interaction with the environment to collect fresh trajectories under the current policy \(\pi_\theta\).
    • After each gradient update, the policy changes, and therefore new rollouts must be sampled to reflect this updated policy behavior.
    • The training loop alternates between:
      1. Collecting trajectories using \(\pi_\theta\),
      2. Computing returns (discounted cumulative rewards), and
      3. Updating parameters using those returns as the learning signal.
    • This direct feedback loop makes REINFORCE inherently online, since learning and data generation occur simultaneously.
    • There is no fixed dataset or static buffer — the model learns only from its most recent interactions.
  • Why It’s On-Policy:

    • The REINFORCE gradient estimate \(\nabla_\theta J(\theta) = \mathbb{E}_{s, a \sim \pi_\theta} [\nabla_\theta \log \pi_\theta(a \mid s) (R - b)]\) explicitly depends on samples drawn from the same policy \(\pi_\theta\) being optimized.
    • Because of this dependency, trajectories generated under older versions of the policy \(\pi_{\theta_\text{old}}\) cannot be reused, as their action probabilities differ from those of the updated policy.
    • There is no correction term such as an importance ratio \(\frac{\pi_\theta(a \mid s)}{\pi_{\text{old}}(a \mid s)}\) to account for this mismatch.
    • Reusing old trajectories would therefore produce a biased gradient estimate, leading the optimizer to update toward the wrong objective.

####### Takeaways

Aspect REINFORCE
Policy Type On-policy
Data Source Trajectories from the current policy
Reuse of Data Not possible
Stability High variance, unstable without baselines or variance reduction
Motivation for Successors TRPO and PPO were developed to improve REINFORCE’s stability and sample efficiency
Trust Region Policy Optimization (TRPO)

####### Overview

  • Trust Region Policy Optimization (TRPO), introduced by Schulman et al. (2015), was designed to improve upon REINFORCE and vanilla policy gradient methods by ensuring more stable and monotonic policy improvement.
  • It does this by constraining each policy update within a “trust region,” preventing large, destabilizing parameter shifts. The optimization problem is:

    \(\max_{\theta} \mathbb{E}_{s, a \sim \pi*{\theta_\text{old}}} \left[ \frac{\pi_{\theta}(a \mid s)}{\pi_{\theta_\text{old}}(a \mid s)} A^{\pi_{\theta_\text{old}}}(s, a) \right]\)

    • subject to \(D_{KL}(\pi_{\theta_\text{old}} \mid \mid \pi_\theta) \leq \delta\)
    • where the KL constraint limits how far the new policy may deviate from the old one.
  • A detailed discourse on TRPO can be obtained in the Trust Region Policy Optimization (TRPO) section.

####### Online vs. Offline (On-Policy vs. Off-Policy)

  • REINFORCE is a fully online, on-policy algorithm.

  • Why It’s Online:

    • The policy must actively interact with the environment to collect trajectories under the current policy parameters \(\pi_\theta\).
    • After every update, the parameters change — meaning the distribution over states and actions changes as well.
    • Consequently, the algorithm must collect fresh rollouts from the environment after each update to ensure that gradient estimates remain valid.
    • There is no mechanism to reuse old data, since the return \(R\) depends on trajectories generated specifically under the current policy.
  • Why It’s On-Policy:

    • The gradient estimate in REINFORCE is derived under the assumption that all samples are drawn from the same policy \(\pi_\theta\) being optimized.
    • If trajectories from a previous policy were used, the gradient would become biased, because the sampling distribution no longer matches the current policy’s distribution.
    • Unlike TRPO or PPO, REINFORCE does not include any policy ratio \(\frac{\pi_\theta}{\pi_{\text{old}}}\) to correct for this mismatch.
    • Therefore, the algorithm must discard old trajectories and re-sample from the current policy at every iteration.
    • Thus, REINFORCE operates as a strictly on-policy, online learning method, relying entirely on newly generated data at each step of training.
  • Why It’s Not Off-Policy:

    • Off-policy algorithms (like Q-learning, DDPG, or SAC) can train on data collected by any behavior policy, often stored in a replay buffer.
    • REINFORCE cannot do this because:

      • It lacks an importance weighting term to reweight samples from an alternative distribution.
      • Its objective depends directly on log-likelihoods under the current policy, not a past or external one.
      • Using off-policy data would result in incorrect gradient estimates, leading to divergence or sub-optimal policies.
    • Therefore, REINFORCE is a purely on-policy method — data from older policies is always discarded after each update.

####### Takeaways

Aspect TRPO
Policy Type On-policy
Data Source Trajectories from the current (old) policy
Reuse of Data None; requires new rollouts per update
Role of Policy Ratio Corrects for minor distribution shift within one update
Constraint KL-divergence trust region
Stability Much higher than REINFORCE, with guaranteed monotonic improvement under certain assumptions
Proximal Policy Optimization (PPO)

####### Overview

  • Proximal Policy Optimization (PPO), proposed by Schulman et al. (2017), is a simplified and more practical variant of TRPO. It maintains TRPO’s core idea of constraining policy updates but replaces the complex constrained optimization with a clipped surrogate objective that is easier to implement and compute.

  • The PPO objective is:

\[L^{\text{CLIP}}(\theta) = \mathbb{E}_{t}\left[ \min\left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t \right) \right]\]
  • where:
    • \(r_t(\theta) = \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\) is the policy ratio.
    • \(A_t\) is the advantage estimate.
    • \(\epsilon\) is a small clipping parameter (e.g., 0.1–0.3) that prevents the ratio from moving too far away from 1.
  • A detailed discourse on PPO can be obtained in the Proximal Policy Optimization (PPO) section.

####### Online vs. Offline (On-Policy vs. Off-Policy)

  • PPO is a fully online, on-policy algorithm.

  • Why It’s Online:

    • PPO learns directly from interactions with the environment.
    • During each policy update, the model collects fresh trajectories (state–action–reward sequences) using the most recent version of the policy \(\pi_{\theta_{\text{old}}}\).
    • After computing the advantage estimates and performing several epochs of optimization on this batch, the old data are discarded, and the environment is rolled out again using the updated policy \(\pi_\theta\).
    • This iterative sampling process ensures that PPO continuously explores and learns from up-to-date behavior data, rather than relying on static or historical samples.
  • Why It’s On-Policy:

    • PPO’s gradient updates depend on trajectories drawn from the same policy (or a very recent one) being optimized.
    • The presence of the policy ratio \(r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\) may seem reminiscent of off-policy correction, but in PPO it only compensates for small distribution shifts between successive policies — not for large mismatches that would occur if reusing old or off-policy data.
    • Because of this, PPO cannot safely reuse data from past iterations or other policies. Reusing old trajectories would bias the gradient, since the expectation \(\mathbb{E}_{s,a \sim \pi_\theta}\) would no longer reflect the distribution under the current policy.
    • Thus, PPO maintains the key property of being on-policy, updating the model only with samples that accurately represent the behavior of the current (or just-previous) policy.

####### Why PPO Is Still On-Policy

  • The clipping mechanism only allows small policy updates (similarly to TRPO’s trust region), which means \(\pi_{\theta}\) stays close to \(\pi_{\theta_\text{old}}\).
  • The ratio term \(r_t(\theta)\) corrects for slight distributional differences between successive policies within an update, but it does not support learning from data generated by unrelated or much older policies.
  • Hence, PPO cannot reuse large offline datasets or a replay buffer, as that would violate the assumption that samples are representative of the current policy’s behavior.

####### Why It’s Sometimes Confused with Off-Policy Methods

  • PPO can perform multiple epochs of optimization on the same batch of on-policy data, which gives the impression of reusing samples.
  • However, this reuse happens only within the same policy iteration and remains valid because the data still originate from \(\pi_{\theta_\text{old}}\).

####### Takeways

Aspect PPO
Policy Type On-policy
Data Source Trajectories from the current (old) policy
Data Reuse Limited (within one batch only)
Ratio Role Corrects for minor distribution shift within a single update
Update Constraint Implicit via clipping, not explicit KL bound
Practical Advantage Simpler, stable, and widely used in LLM and RLHF training
Direct Preference Optimization (DPO)

####### Overview

  • Direct Preference Optimization (DPO), introduced by Rafailov et al. (2023), is a method designed for fine-tuning large language models (LLMs) directly from human preference data.
  • Unlike RLHF methods such as PPO-based training, DPO does not require an explicit reward model or reinforcement learning loop. Instead, it formulates a closed-form objective that aligns the model’s output probabilities with human preferences.

  • The DPO objective can be written as:
\[\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-) \sim \mathcal{D}} \left[ \log \sigma!\left( \beta \left( \log \frac{\pi_\theta(y^+ \mid x)}{\pi_{\text{ref}}(y^+ \mid x)} - \log \frac{\pi_\theta(y^- \mid x)}{\pi_{\text{ref}}(y^- \mid x)} \right) \right) \right]\]
  • where:

    • \((x, y^+, y^-)\) are prompt–preferred–dispreferred triples from preference data,
    • \(\pi_{\text{ref}}\) is the reference model (often the supervised fine-tuned model, SFT),
    • \(\beta\) is a temperature-like scaling parameter.
  • A detailed discourse on DPO can be obtained in the Direct Preference Optimization (DPO) section.

####### Online vs. Offline (On-Policy vs. Off-Policy)

  • DPO is a fully offline, off-policy alignment method.

  • Why It’s Offline:

    • DPO trains entirely on a fixed dataset of human preferences — consisting of prompt–response pairs labeled as preferred (\(y^+\)) or dispreferred (\(y^-\)).
    • These datasets are collected prior to optimization, typically using human annotators or preference models (e.g., from the Anthropic HH dataset or OpenAI’s RLHF pipeline).
    • During training, the model computes gradients over this static dataset — there is no environment interaction or dynamic sampling from the current model \(\pi_\theta\).
    • All optimization steps are performed offline using pre-existing pairs, without requiring rollouts or iterative feedback.
  • Why It’s Off-Policy:

    • The model being trained, \(\pi_\theta\), does not generate the samples used in training — they come from a reference model \(\pi_{\text{ref}}\) (often the supervised fine-tuned model, SFT).
    • The DPO loss includes a policy ratio, \(\log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}\), which serves as a reweighting factor to correct for the distributional shift between the new model and the reference model.
    • This ratio ensures that optimization remains unbiased even though the data are drawn from a different distribution — a mechanism similar to importance sampling in reinforcement learning.
    • Because DPO never samples new data from the current policy, it operates purely off-policy — all learning happens with respect to static preference data.

####### Comparison to PPO/RLHF

  • In PPO-based RLHF, the model learns from online rollouts — each policy update collects new samples.
  • In contrast, DPO optimizes a deterministic preference objective directly over existing data, without sampling new trajectories.
  • This makes DPO far more efficient and simpler, but potentially less adaptive, since it can’t explore new regions of the output space beyond what’s in the dataset.

####### Takeaways

Aspect DPO
Policy Type Off-policy (offline)
Data Source Fixed human preference dataset
Data Reuse Full reuse possible
Ratio Role Reweights model likelihoods relative to reference model
Environment Interaction None (purely offline)
Advantage No reward model or rollout generation required
Kahneman–Tversky Optimization (KTO)

####### Overview

  • Kahneman–Tversky Optimization (KTO), proposed by Ethayarajh et al. (2024), inspired by prospect theory from behavioral economics.
  • Instead of maximizing log-likelihoods of preferences (as DPO does), KTO directly maximizes the subjective human utility of model generations under the Kahneman–Tversky value function — a nonlinear, asymmetric function reflecting human biases such as risk aversion and loss aversion.

  • The KTO objective is derived as a Human-Aware Loss (HALO), a family of alignment objectives that incorporate human-like value functions.
  • The canonical loss function is:

    \[L_{\text{KTO}}(\pi_\theta, \pi_{\text{ref}}) = \mathbb{E}_{x, y \sim \mathcal{D}}[\lambda_y - v(x, y)]\]
    • where \(v(x, y)\) is a Kahneman–Tversky-like value function that depends on:
      • \(r_\theta(x, y) = \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}\),
      • a reference point \(z_0 = KL(\pi_\theta \mid \pi_{\text{ref}})\),
      • and asymmetric coefficients \(\lambda_D, \lambda_U\) for desirable vs. undesirable samples.
  • KTO replaces the power-law utility curve from prospect theory with a logistic function, stabilizing training while preserving its concavity (risk aversion in gains) and convexity (risk seeking in losses).
  • A detailed discourse on KTO can be obtained in the Kahneman-Tversky Optimization (KTO) section.

####### Online vs. Offline (On-Policy vs. Off-Policy)

  • KTO is a fully offline, off-policy method.

  • Why It’s Offline:

    • KTO does not require any interactive rollouts or online sampling. Instead, it trains entirely from a fixed dataset of labeled examples (each labeled “desirable” vs. “undesirable”) drawn from human annotations or derived feedback.
    • Because no new model outputs or environment interactions are needed during training, KTO is compatible with settings where data collection is costly or infeasible.
    • The entire optimization is performed on static data, making the training process reproducible and deterministic.
    • This offline nature distinguishes KTO from RL-based policies that require new sample generation at each step.
  • Why It’s Off-Policy:

    • The samples used for training KTO are not generated by the policy being optimized, \(\pi_\theta\). Rather, they come from another model or human annotation procedure, often via a reference distribution \(\pi_{\text{ref}}\).
    • KTO incorporates a policy ratio \(r_\theta(y \mid x) = \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}\) to reweight examples according to how the new policy diverges from the reference. This ratio functions as a distribution-shift correction factor (akin to importance sampling) in the offline setting.
    • Because the policy never actually generates the samples it trains on, KTO is classified as off-policy — it learns from data produced by another distribution or past policy.
  • Thus, KTO operates much like DPO — both are offline alignment algorithms, but KTO learns from binary signals rather than pairwise preferences.

####### Takeaways

Aspect KTO
Policy Type Off-policy (offline)
Data Source Fixed binary feedback dataset (desirable vs. undesirable)
Data Reuse Full reuse possible
Ratio Role Reweights model likelihoods relative to reference policy using a prospect-theoretic value function
Environment Interaction None (purely offline)
Advantage Human-aware utility maximization without a reward model or rollouts; captures loss/risk aversion
Group Relative Policy Optimization (GRPO)

####### Overview

  • Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm introduced by the DeepSeek-AI team in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models by Shao et al. (2024). It is designed as a lightweight and memory-efficient variant of PPO (Proximal Policy Optimization) that removes the need for a separate critic (value) network, thereby simplifying the training pipeline and reducing computational cost.

  • The main idea is to estimate the baseline not with a learned value model but from relative group scores of multiple sampled outputs. This allows GRPO to leverage intra-group comparison instead of value function estimation, aligning well with how reward models are typically trained on relative preference data (e.g., “A is better than B”).

  • In PPO, the objective function is:

    \[J_{\text{PPO}}(\theta) = \mathbb{E}\left[ \min\left( r_t(\theta)A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t \right)\right]\]
    • where \(r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)}\) is the policy ratio and (A_t) is the advantage estimated using a critic.
  • In GRPO, the critic is replaced by group-based normalization. For each question (q), a group of outputs ({o_1, \ldots, o_G}) is sampled from the old policy (\pi_{\theta_{\text{old}}}). Rewards are assigned to each output by a reward model, and their normalized difference defines the group-relative advantage:

\[\hat{A}_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}\]
  • The GRPO objective is then:

    \[J_{\text{GRPO}}(\theta) = \mathbb{E}\left[ \frac{1}{G}\sum_i \frac{1}{ \mid o_i \mid } \sum_t \min\left( r_{i,t}(\theta)\hat{A}_{i,t}, \text{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_{i,t} \right) \beta D_{\text{KL}}[\pi_\theta \mid \pi_{\text{ref}}] \right]\]
    • where the KL divergence term regularizes the new policy against a reference model (typically the SFT model).
  • The following figure from the paper demonstrates PPO and GRPO. GRPO foregoes the value/critic model, instead estimating the baseline from group scores, significantly reducing training resources.

####### Online vs. Offline (On-Policy vs. Off-Policy)

  • GRPO is an on-policy, online reinforcement learning method.

  • Why It’s Online:

    • GRPO operates through iterative reinforcement learning updates, where the model continuously interacts with its environment or task distribution to collect new samples.
    • At each iteration, new rollouts (model-generated responses) are produced from the current or recent policy \(\pi_{\theta_{\text{old}}}\).
    • These responses are grouped per prompt (e.g., multiple sampled outputs for the same question), scored by a reward model, and then used to update the new policy \(\pi_\theta\).
    • Because GRPO depends on these fresh generations to estimate group-relative advantages, the algorithm inherently requires online interaction — it cannot rely solely on static data.
  • Why It’s On-Policy:

    • GRPO updates the policy using trajectories sampled directly from the current policy (or a very recent version of it).
    • The ratio \(\frac{\pi_\theta(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)}\) is computed within each update step, correcting for only the small distribution shift between successive policies.
    • Old data cannot be reused indefinitely, because the group-relative normalization and clipped objective assume statistical proximity between \(\pi_{\theta_{\text{old}}}\) and \(\pi_\theta\).
    • GRPO also periodically refreshes both its policy and reward model through newly collected generations, ensuring continual alignment with the most recent policy behavior.
  • Thus, GRPO belongs firmly to the online (on-policy RL) family — much like PPO — but distinguishes itself through its group-based normalization, which removes the need for a critic network while maintaining stability and efficiency.

####### Takeaways

Property GRPO
Policy Type On-policy (online)
Baseline Group-average reward (no critic)
Data Source New samples from current policy
KL Regularization Explicit penalty term
Reward Signal Outcome or process-based reward models
Compute Efficiency High (no value model)
Alignment Domain Mathematical reasoning (generalizable)
Comparative Analysis: REINFORCE, TRPO, PPO, DPO, KTO, GRPO
  • The table below contrasts the algorithms on policy type (online/on-policy vs. offline/off-policy), what data they train on, how they handle distribution shift (ratio/reweighting), their stability constraint (KL / clipping / none), and why they fall into the online/offline bucket.
Method Policy Type Trains On (Data Source) Distribution-Shift Term Stability / Regularization Why Online vs. Offline
REINFORCE On-policy (online) Fresh rollouts from current policy (\(\pi_\theta\)) None (uses \(\nabla \log \pi_\theta\)) Baselines (optional) for variance Needs trajectories sampled under the current policy each update; old data would bias the gradient.
TRPO On-policy (online) Rollouts from (\(\pi_{\theta_\text{old}}\)) per iteration Policy ratio (\(r=\frac{\pi_\theta}{\pi_{\theta_\text{old}}}\)) Hard trust region via KL constraint Requires new trajectories after each update; ratio only corrects the small shift within an iteration, not replay from old policies.
PPO On-policy (online) Rollouts from (\(\pi_{\theta_\text{old}}\)); reused for a few epochs Policy ratio (\(r=\frac{\pi_\theta}{\pi_{\theta_\text{old}}}\)) Clipping of (r) (and often a KL bonus) Still needs fresh batches every iteration; clipping assumes small policy drift, not arbitrary offline reuse.
DPO Off-policy (offline) Fixed pre-collected preferences ((x, \(y^+\), \(y^-\))) Reference-relative log-ratio (\(\log\frac{\pi_\theta}{\pi_{\text{ref}}}\)) inside a logistic margin Implicit via temperature (\(\beta\)) (reference anchoring) Optimizes a closed-form objective over a static dataset; no environment rollouts.
KTO Off-policy (offline) Fixed binary feedback (desirable vs. undesirable) Reference-relative log-ratio (\(r_\theta=\log\frac{\pi_\theta}{\pi_{\text{ref}}}\)) with a reference point (\(z_0\)) Prospect-theoretic value function (logistic), acts like a KL-anchored utility; no rollouts Trains entirely on static labeled data; maximizes human utility under a HALO objective; no online sampling.
GRPO On-policy (online) New groups of samples per prompt from (\(\pi_{\theta_\text{old}}\)) Policy ratio at token level (PPO-style); group-relative advantages Explicit KL penalty vs. reference; no critic (baseline = group mean) Requires sampling groups each step and uses reward-model scores; on-policy RL with reduced memory (critic-free).
  • Takeways:

    • DPO vs. KTO (both offline): DPO maximizes a preference likelihood margin against a reference model; KTO maximizes a prospect-theoretic utility using a logistic value function with a reference point (z_0). Both use ratios against \(\pi_{\text{ref}}\) as reweighting factors and train without rollouts.
    • GRPO vs. PPO (both online): GRPO removes the critic/value model and computes group-relative advantages from multiple sampled outputs for the same prompt, plus an explicit KL penalty—yielding an actor-only, memory-efficient PPO variant. Iterative GRPO can also refresh the reward model and reference policy during training.

Reinforcement Learning from Human Feedback (RLHF)

Motivation and Background

  • LLMs trained with next-token prediction objectives are highly proficient at generating fluent text. However, this training alone does not ensure that the outputs are aligned with human values such as helpfulness, harmlessness, and honesty. These models may generate plausible-sounding but untruthful, unsafe, or unhelpful responses if left unguided.
  • To address this gap, Reinforcement Learning from Human Feedback (RLHF) was introduced. RLHF provides a framework for aligning model outputs with human preferences by using human-generated signals to guide model behavior. It has become a central technique in aligning instruction-following models such as InstructGPT and ChatGPT.
  • Put simply, RLHF enables models to go beyond merely predicting likely text, aligning their behavior with nuanced human expectations through a structured feedback loop. By incorporating direct human input at multiple stages—demonstration, comparison, and reward-based reinforcement—it provides a scalable and principled approach to model alignment, forming the backbone of modern instruction-following language models.

Method Overview

  • In RLHF, the LLM is treated as a policy \(\pi_\theta(y \mid x)\) that generates a response \(y\) to a given prompt \(x\). The objective is to adjust the parameters \(\theta\) so that the model maximizes a reward signal that reflects human judgments of response quality:

    \[\max_{\theta} \mathbb{E}_{x \sim D_{\text{prompt}}, y \sim \pi_\theta(\cdot \mid x)} \left[ r(x, y) \right]\]
    • where \(r(x, y)\) is a reward function, typically learned from human-labeled comparison data, that evaluates how well a response \(y\) aligns with human preferences for a given prompt \(x\).

Overall Process

  1. Collect Demonstration Data and Train a Supervised Policy

    • A human labeler provides ideal responses (demonstrations) to prompts.
    • The model is fine-tuned via supervised learning (also called Supervised Fine-Tuning, or SFT) to mimic these human demonstrations.
  2. Collect Comparison Data and Train a Reward Model

    • The model generates multiple candidate responses to a prompt.
    • Human labelers rank these responses based on alignment with criteria like helpfulness, safety, and relevance.
    • A reward model is trained to predict these rankings, typically using between 100,000 and 1 million comparison data points.
  3. Optimize the Policy Using Reinforcement Learning

    • The model is further trained using reinforcement learning (commonly with Proximal Policy Optimization, or PPO) to maximize the reward assigned by the reward model.
    • This phase usually involves 10,000 to 100,000 prompt-response training iterations.
  • Another helpful summary of the full RLHF pipeline is provided in this flowchart by Chip Huyen:

Model Roles

  • To implement the RLHF pipeline effectively, several models are employed in distinct but interdependent roles. Each contributes to a part of the reward-driven learning loop, from generating responses to evaluating and optimizing them:

    • Policy model: The main LLM we wish to optimize (parameterized by \(\theta\)). It functions as the environment’s actor, generating responses, and is fine-tuned via policy optimization techniques (e.g., PPO).

    • Reference model: A frozen or slowly-updated baseline version of the policy (or a supervised fine-tuned model) used to compute KL or likelihood penalties to ensure the optimized policy does not diverge too far from acceptable behaviours.

    • Value model: A model that estimates the expected return (value) of a given prompt-response pair or sequence, often used to compute advantage estimates in actor–critic style updates.

    • Reward model: A separate model trained (often via human preference data or comparisons) to map a prompt-response pair \((x,y)\) to a scalar reward \(r(x,y)\). It encapsulates human or designer preferences and drives the optimization of the policy model.

  • In typical LLM fine-tuning pipelines, the flow is:

    1. The policy model generates responses.
    2. The reward model scores them.
    3. The value model estimates future return or baseline.
    4. A reference model imposes a divergence penalty or acts as a safe anchor.
    5. Using a policy-optimization algorithm (e.g., Proximal Policy Optimization) the policy model is updated to increase rewards while constraining divergence from the reference.
  • For example:

    \[L_{\text{PPO}}(\theta) \approx \mathbb{E}_{(x,y)\sim \pi_\theta} \left[ \min\Big(r_{\theta}(x,y),\hat A(x,y), \mathrm{clip}\big(\frac{\pi_\theta(y \mid x)}{\pi_{\theta_{\rm ref}}(y \mid x)},1-\epsilon,1+\epsilon\big)\hat A(x,y)\Big) \right]\]
    • where \(\hat A(x,y) = r(x,y) - V_\phi(x)\) is the advantage estimated using the value model. This echoes standard RL policy-gradient theory but tailored to LLM response generation.
  • Refer Fine-Tuning Language Models with Reward Learning on Policy by Lang et al. (2024) for a more formal treatment.

Policy Model

  • The policy model in an RLHF–style setup is the LLM that we treat as a policy \(\pi_{\theta} (y \mid x)\), parameterized by \(\theta\), which given an input prompt \(x\) produces a response \(y\). This section covers its function, typical architecture, training data, and model size considerations.
  • The policy model is the central actor in the RLHF pipeline: it generates responses to prompts and is updated to align with human preferences. It carries the full representational capacity of a large LLM architecture, is trained in multiple phases (pretraining \(\rightarrow\) SFT \(\rightarrow\) RLHF), and must be large enough to enable high-quality responses while still being trainable. Its design must support computing log-probabilities, KL divergences, and synergy with reward/value models.
Function
  • The policy model is the agent that interacts with the “environment” by generating outputs (responses \(y\)) to prompts \(x\).
  • Its objective is to maximize a reward signal \(r(x,y)\), subject to constraints or regularization (for example via KL-divergence to a reference policy).
  • Formally, the objective can be written as:

    \[\max_{\theta} \mathbb{E}_{x\sim D_{\rm prompt},y\sim\pi_\theta(\cdot\mid x)}[r(x,y)-\beta,\mathrm{KL}\big(\pi_\theta(\cdot\mid x)\Vert\pi_{\rm ref}(\cdot\mid x)\big)]\]
    • where \(\pi_{\rm ref}\) is a reference model and (\beta) is a regularization coefficient.
  • During training, the policy model generates responses, receives reward model scores or value-model feedback, and is updated (often via algorithms like Proximal Policy Optimization). The policy model thus evolves from a “supervised fine-tuned” base model into a behaviour-aligned model.
  • The policy model must balance helpfulness, accuracy, safety, and alignment (for example to human preferences). See, for example, the instruct-tuning phase described in Ouyang et al. (2022) (“Training language models to follow instructions with human feedback”).
Architecture
  • The policy model is typically a causal (autoregressive) transformer with large scale: e.g., dozens of layers, high hidden dimensionality, multi-head self-attention, positional embeddings, etc.
  • Initially pretrained on massive corpora of text. Then often fine-tuned via supervised fine-tuning (SFT) on instruction–response pairs.
  • For RLHF, a further head or mechanism may be added or used for value/advantage estimation, but the core remains the transformer.
  • Recent work sometimes uses parameter efficient tuning (e.g., LoRA, adapters) to limit full-model updates during RL optimisation.
  • The architecture must support sampling from \(\pi_\theta\), computing log-probabilities \(\log \pi_{\theta} (y \mid x)\), and computing KL divergence between \(\pi_\theta\) and \(\pi_{\rm ref}\).
  • For instance, Fine-Tuning Language Models with Reward Learning on Policy by Lang et al. (2024) explores how the policy model interacts with a reward model under RLHF.
Training Data
  • Pretraining: The policy model is first trained on large unlabeled text corpora (e.g., hundreds of billions to trillions of tokens).
  • Supervised Fine-Tuning (SFT): Instruction–response pairs collected from humans or human-augmented data; e.g., prompts with “good” responses. Many alignment pipelines begin with this stage to provide a reasonable starting policy.
  • RL Finetuning: The model generates responses to prompts; responses are scored (via reward model or human ranking). This prompt–response–reward dataset is used in the reinforcement signal. Because the distribution of responses changes as \(\pi_{\theta}\) updates, continuing to sample from updated policy is important.
  • Replay / Off-Policy Data: Some pipelines incorporate past responses and reward scores into replay buffers or datasets for stability and reuse.
  • Training the policy model via RL typically uses batches of prompt–response pairs, plus log-probabilities of responses under both \(\pi_{\theta}\) and \(\pi_{\rm ref}\), plus the advantage estimate from a value model.
  • Note: Human preference data (for reward model) is often relatively small compared to the pretraining corpus; the RL step amplifies it via policy-generated samples.
Typical Model Size
  • The policy model used in RLHF pipelines tends to be large (tens of billions of parameters or more) to provide strong language understanding and generation capabilities.
  • For example, many state-of-the-art systems use models in the 7B–70B parameter range or larger (100B+).
  • During SFT and then RLHF, often only the base model (e.g., 20B–70B) is used, to manage compute cost and stability. For example, the InstructGPT series used the GPT-3 175B model for SFT, then RLHF. (See Ouyang et al. (2022)).
  • In practice, training or fine-tuning such large policy models via RL requires specialized distributed compute, large memory, and careful hyper-parameter tuning.

Reference Model

  • The reference model (also sometimes called the anchor model) is a fixed or slowly updated copy of the policy model used as a baseline or constraint in RLHF and related policy optimization setups for LLMs. Its primary purpose is to ensure that the updated policy model remains linguistically coherent, safe, and semantically aligned with the pre-RL distribution, while still learning to maximize the new reward signal. Put simply, the reference model plays a crucial safety and stability role in RLHF. It anchors the optimization process by maintaining linguistic and factual consistency, ensuring that policy optimization leads to meaningful alignment rather than degenerate exploration.
Function
  • The reference model \(\pi_{\text{ref}}(y \mid x)\) acts as a stability regulator during the reinforcement learning phase.
    • It appears in the KL-divergence regularization term in the RL objective:

      \[J(\theta) = \mathbb{E}_{x,y \sim \pi_\theta} \big[ r(x,y) - \beta \mathrm{KL}(\pi_\theta(\cdot \mid x) \Vert \pi_{\text{ref}}(\cdot \mid x)) \big]\]
      • where \(\pi_\theta\) is the policy model being optimized, and \(\beta\) is a scaling factor.
    • The KL term penalizes deviations from the reference model distribution, preventing mode collapse, reward hacking, or drift into incoherent or unfaithful responses.

  • Conceptually, the reference model anchors the optimization so that:

    • The policy model can explore higher-reward regions of response space.
    • But does not diverge too far from its pretrained linguistic and factual priors.
  • In practice, the reference model helps maintain fluency, truthfulness, and diversity of outputs throughout training.
Architecture
  • The reference model is architecturally identical to the policy model. It is often just a frozen copy of the supervised fine-tuned (SFT) model.

  • Example pipeline:

    1. Begin with a pretrained transformer (e.g., GPT-3, LLaMA, or PaLM).
    2. Fine-tune it with instruction data \(\rightarrow\) SFT model.
    3. Clone the SFT model \(\rightarrow\) Reference model (frozen).
    4. Train another copy \(\rightarrow\) Policy model (trainable) with PPO or another RL optimizer, using the frozen reference for KL regularization.
  • Since it shares weights and architecture with the policy model, the reference model uses a causal decoder-only transformer, typically with the same number of layers, hidden dimensions, and parameters.

  • The architectural identity ensures that token-wise probability distributions are directly comparable, allowing exact computation of \(\mathrm{KL}(\pi_\theta(\cdot \mid x) \Vert \pi_{\text{ref}}(\cdot \mid x)) = \sum_y \pi_\theta(y \mid x) \log\frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}.\)

  • Some implementations (e.g., Stiennon et al., 2020, “Learning to summarize with human feedback”) experimented with slowly updating the reference model, but most production pipelines freeze it entirely.

Training Data
  • The reference model is not trained during the RL stage. Instead, it is a snapshot of the model before RLHF fine-tuning.

  • It is trained in the supervised fine-tuning (SFT) phase using instruction-following data such as:

    • Prompt–response pairs written or rated by humans.
    • Curated high-quality datasets covering Q&A, summarization, code generation, reasoning, and dialog.
  • The SFT dataset is usually smaller and more human-curated than pretraining data—ranging from a few thousand to a few hundred thousand high-quality examples.

  • By preserving this SFT policy, the reference model embodies the linguistic priors and alignment baseline learned from human demonstrations before introducing reinforcement signals.

Typical Model Size
  • The reference model must match the policy model in architecture and vocabulary to make KL computation meaningful. Therefore, it has the same parameter count as the policy model—commonly in the range of:

    • 7B–70B parameters for research-grade or open-source systems (e.g., LLaMA-2, Falcon, Mistral RLHF variants).
    • 175B–500B+ parameters for frontier models (e.g., GPT-3 or GPT-4 scale).
  • Because the reference model is frozen, its storage and compute requirements are primarily for forward passes during KL evaluation rather than gradient updates.
  • In distributed training pipelines (e.g., Ouyang et al., 2022), both the policy and reference models are sharded across GPUs but only the policy model receives gradient updates.
Comparative Analysis
Aspect Description
Role Baseline distribution constraining RL updates
Function Provides KL regularization to prevent policy drift
Architecture Identical to policy (decoder-only transformer)
Training Data SFT instruction data (high-quality human responses)
Model Size Same as policy; typically 7B–175B parameters
Status During RL Frozen (no updates)

Reward Model

  • The reward model (RM) is one of the most crucial components in the RLHF pipeline.
  • It provides the scalar feedback signal \(r(x, y)\) that quantifies the quality of a model’s response \(y\) to a prompt \(x\), translating human preferences into a form usable by reinforcement learning algorithms.
  • In modern LLM alignment, the reward model serves as the surrogate objective for human satisfaction, steering the policy model toward behaviors that humans find helpful, truthful, and safe.
  • The reward model provides the human-aligned feedback mechanism that guides reinforcement learning updates. It bridges subjective human judgment and quantitative optimization, serving as the anchor for policy alignment and safety in LLM fine-tuning.

Function

  • The reward model approximates a latent human preference function. Given a prompt \(x\) and a response \(y\), the model outputs a scalar value \(r(x,y)\) representing how much a human would prefer that response.

  • Its primary role is to act as a critic that scores generated text, so that the policy model can be optimized to produce higher-reward responses.

  • Formally, the goal is to learn a function \(r_\phi(x,y) \approx \text{Expected human preference score}(x,y)\) parameterized by \(\phi\).

  • The reward model is trained using human preference data collected as pairwise comparisons: for a given prompt \(x\), humans are shown two responses (\(y_1\), \(y_2\)), and asked which is better.

  • Training minimizes a pairwise ranking loss:

    \[\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x,y_w,y_l)} \Big[\log \sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big)\Big]\]
    • where \(y_w\) is the “winner” (preferred response), \(y_l\) is the “loser”, and \(\sigma\) is the sigmoid function.
    • This encourages the model to assign higher scores to preferred responses.
  • This approach was popularized by the InstructGPT pipeline in Training language models to follow instructions with human feedback by Ouyang et al. (2022), which remains the canonical reference for RLHF reward modeling.

  • The image below (source) illustrates how a reward model functions:

Architecture

  • The reward model is typically a transformer-based encoder or decoder-only model derived from the same family as the policy model (e.g., GPT, LLaMA, PaLM).

  • Architecturally, it’s identical to a language model but with a scalar regression head added on top of the final hidden state.

    • For causal transformers, the final token’s hidden representation \(h_T\) is often pooled (or mean-pooled) and passed through a linear projection: \(r_\phi(x,y) = w^\top h_T + b,\)

      • where \(w,b\) are learned parameters.
  • The model thus learns to encode text sequences and output a single real-valued reward.

  • In practice:

    • The reward head is lightweight (a single dense layer).
    • The underlying transformer backbone may be smaller than the policy model (for compute efficiency).
    • Often trained with frozen or partially frozen embeddings, to preserve linguistic knowledge while specializing to preference prediction.
  • Several architectural variants are used for reward modeling, including:

    1. LM Classifiers: Language models fine-tuned as binary classifiers to score which response better aligns with human preferences
    2. Value Networks: Regression models that predict scalar ratings representing relative human preference
    3. Critique Generators: Language models trained to generate evaluative critiques explaining which response is better and why, used in conjunction with instruction tuning

Mathematical Framework

  • The reward model is trained using ranked comparison data and assigns a scalar score to model-generated responses.

  • A common formulation of the pairwise loss uses the Bradley-Terry model, where the probability that a rater prefers response \(r_i\) over \(r_j\) is:

    \[P(r_i > r_j) = \frac{\exp(R_\phi(p, r_i))}{\exp(R_\phi(p, r_i)) + \exp(R_\phi(p, r_j))}\]
  • The corresponding loss function is:

    \[\mathcal{L}(\phi) = -\log \sigma(R_\phi(p, r_i) - R_\phi(p, r_j))\]
    • where:

      • \(\sigma\) is the sigmoid function,
      • \(R_\phi\) is the reward model,
      • \(p\) is the prompt,
      • \(r_i, r_j\) are two responses being compared.
  • This formulation ensures that the reward model learns to assign higher scores to responses more preferred by humans.

  • A key implementation detail: the reward for partial responses is always 0; only complete responses receive a non-zero scalar score. This design encourages the generation of coherent and full outputs during policy training.

Training Data

  • The training data for reward models comes from human preference labeling:

    • A set of prompts \(x\) is sampled (often from SFT datasets or model-generated prompts).
    • Multiple responses are generated by one or more models.
    • Human annotators rank or choose preferred responses based on helpfulness, accuracy, harmlessness, or style criteria.
  • The collected comparisons yield tuples \((x, y_w, y_l)\), forming the basis for pairwise training.

  • Datasets of this form can range from 50,000 to several million comparisons, depending on the scale of the deployment. For example:

    • The InstructGPT reward model used approximately 30,000–40,000 labeled comparisons.
    • Larger RLHF systems (e.g., Anthropic’s Constitutional AI) use 100K–1M+ pairs.
    • Recent work such as RLHF on LLaMA 2 and OpenAI’s GPT-4-turbo alignment use data from extensive human evaluation and preference modeling pipelines.
  • Synthetic preference data (generated using smaller models or heuristics) is also increasingly used to supplement limited human data, as in Self-Instruct by Wang et al. (2022).

Model Size

  • The reward model is usually smaller than the policy model, since it only provides scalar evaluations and doesn’t need to generate text.

    • Common sizes range from 1B to 13B parameters for large-scale pipelines.
    • For example:

      • InstructGPT used reward models of 6B parameters, while the policy model was 175B.
      • Open-source LLaMA 2–Chat models used reward models of 7B–13B parameters.
    • Compact reward models are often used to reduce the cost of reward evaluation during RLHF training (since thousands of responses must be scored per update).
  • Some recent methods, such as Direct Preference Optimization (DPO) by Rafailov et al. (2023), avoid training a separate reward model altogether, instead implicitizing it through log-probability ratios between the policy and reference models.

Prevention of Over-optimization

  • To prevent the fine-tuned model from overfitting or drifting too far from its pretrained distribution, KL divergence penalties are applied during RL:

    • KL divergence measures the difference between the output distributions of the current policy and the original (pretrained) model.
    • This constraint regularizes learning and ensures that the fine-tuned model does not deviate excessively, preserving safety and coherence.
  • This KL penalty is crucial for maintaining a balance between alignment and generalization.

Evaluation and Monitoring

  • Reward models are evaluated on held-out preference sets using accuracy metrics—how often the model correctly predicts the human-preferred response.
  • Typical accuracy benchmarks range between 65–80%, depending on domain and data quality.
  • Regular retraining and drift monitoring are essential, since the distribution of policy outputs changes as the policy improves.
Comparative Analysis
Aspect Description
Role Translates human preference into scalar rewards
Training Objective Pairwise ranking loss on human preference data
Architecture Transformer with scalar reward head
Data Human-ranked prompt–response pairs (tens of thousands to millions)
Model Size Typically 1B–13B parameters
Reference Papers Ouyang et al., 2022; Rafailov et al., 2023

Value Model

  • The value model (sometimes called the critic model) plays a critical but often under-discussed role in LLM reinforcement learning pipelines such as RLHF and RLAIF (Reinforcement Learning from AI Feedback).
  • While the reward model provides immediate feedback for a given response, the value model estimates the expected future reward from a state (or state-prompt pair), enabling advantage estimation, variance reduction, and stabilized policy updates—concepts foundational to modern policy-gradient methods like PPO.
Function
  • In the context of LLM alignment, the value model \(V_\phi(x)\) or \(V_\phi(x, y)\) predicts the expected return (i.e., the cumulative reward) for a given prompt \(x\) or prompt-response pair \((x,y)\).
  • It plays the same theoretical role as the critic in an actor–critic architecture.

  • The basic formulation:

    \[V_\phi(s) \approx \mathbb{E}_{a\sim\pi_\theta} \big[ R(s,a) \big],\]
    • where \(R(s,a)\) is the return (or scalar reward) achieved when the policy \(\pi_\theta\) produces action (a) in state (s).
  • For language models, the “state” corresponds to the prompt or prefix \(x\), and the “action” corresponds to the generated token sequence \(y\).

  • Thus, the value model is used to:
  1. Estimate baseline returns to compute advantages for PPO or other policy-gradient updates: \(\hat{A}(x,y) = r(x,y) - V_\phi(x)\) or in some cases, token-wise: \(\hat{A}_t = \delta_t + (\gamma \lambda),\hat{A}_{t+1},\) where \(\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\) is the TD-error.
  2. Reduce variance in gradient estimation by providing a learned baseline for expected reward.
  3. Serve as a critic for continuous improvement, allowing the system to generalize reward expectations across prompts even when explicit human feedback is unavailable.
  • The concept parallels classical actor–critic RL frameworks introduced by Konda and Tsitsiklis (2000), but adapted to the autoregressive structure of LLMs.
Architecture
  • The value model shares most of its architecture with the policy and reward models—typically a decoder-only transformer. However, it differs in its output head and training target:

  • Instead of outputting a distribution over next tokens or a scalar reward difference, the value model outputs a single scalar estimate \(V_\phi(x)\) (or a sequence of per-token estimates \(V_\phi(x_t)\)).
  • Implementation details:

    • Often, the hidden representation of the last token (or the mean of hidden states) is fed into a linear projection layer producing a scalar output.
    • Architecturally identical to the policy model up to the final layer, enabling parameter sharing in multi-head variants (e.g., actor–critic shared encoder).
    • In some frameworks (e.g., Stiennon et al., 2020), the value model is jointly trained with the policy, whereas in others it is trained separately to prevent overfitting to specific rewards.
  • For stability, a target value network \(V_{\phi^-}\) may be maintained—updated periodically—to stabilize temporal-difference (TD) targets, as in classic deep RL.
Training Objective
  • The value model is typically trained by regression to predict observed or bootstrapped returns: \(\mathcal{L}_V(\phi) = \mathbb{E}_{(x,y)\sim D}, \big[\big(V_\phi(x) - \hat{R}(x,y)\big)^2\big],\)
    • where \(\hat{R}(x,y)\) is the observed reward (from the reward model or humans).
  • In token-level PPO implementations, this may extend to predicting per-token value estimates, allowing fine-grained credit assignment across generated sequences.

  • The training dataset typically comes from:

    • Prompts \(x\) generated from curated datasets or user interactions.
    • Responses \(y\) sampled from the current policy model \(\pi_\theta\).
    • Rewards \(r(x,y)\) computed from the reward model.
  • This creates tuples \((x, y, r(x,y))\) that are used both for updating the policy and for training the value function.
Training Data
  • Primary source: On-policy data collected during RLHF fine-tuning—prompts generated from curated instruction datasets, with responses sampled from the current policy model.
  • Reward signals: Computed using the reward model or human preference annotations.
  • Scale: Typically hundreds of thousands to a few million prompt–response pairs during RLHF loops.
  • Temporal supervision: In text generation, there is usually a single terminal reward per completion; hence, value learning relies on Monte Carlo returns or generalized advantage estimation (GAE) to smooth learning despite sparse signals.
Model Size
  • The value model is often smaller than the policy model, similar in size or slightly larger than the reward model. Typical configurations:

    • 1B–13B parameters for large-scale LLM training.
    • For example, in OpenAI’s InstructGPT setup (Ouyang et al., 2022), the value model had similar capacity to the reward model (≈6B), acting as a critic for a 175B-parameter policy.
    • In open-source frameworks like TRLX or DeepSpeed-Chat, value heads are typically attached to 7B–13B base LLMs, or trained as separate lightweight critics.
  • When memory is constrained, a value head may be added directly to the policy model (sharing the same encoder/decoder weights but with a separate linear projection), known as a shared-head architecture.

Relationship to the Reward Model
Aspect Reward Model Value Model
Input Prompt + response Prompt (or prompt + partial response)
Output Scalar reward (human preference estimate) Expected future reward (baseline or critic)
Training data Human or synthetic preference comparisons Policy rollouts and rewards
Objective Pairwise ranking loss MSE regression loss
Usage Guides policy optimization Stabilizes training via advantage estimation
Updates Offline (pretrained) Online (updated during RL loop)
  • The reward model captures external supervision, while the value model provides internal bootstrapping for efficient policy learning.
Comparative Analysis
Aspect Description
Role Predicts expected future reward for prompts/responses
Function Baseline and critic for policy optimization
Architecture Transformer with scalar output head
Training Data On-policy prompt–response–reward tuples
Model Size 1B–13B parameters
Training Objective Mean-squared error on observed or bootstrapped returns
References Konda & Tsitsiklis, 2000; Stiennon et al., 2020; Ouyang et al., 2022

Optimizing the Policy

  • The policy refers to a strategy or a set of rules that an agent uses to make decisions in an environment. Put simply, the policy defines how the agent selects actions based on its current observations or state.
  • The policy optimization process involves RL techniques that iteratively refine the policy based on reward feedback. The reward model provides feedback based on human preferences, and the policy is optimized iteratively to maximize reward while maintaining a stable learning trajectory. The stability aspect is enforced by maintaining a certain level of similarity to its previous version (to prevent drastic changes that could lead to instability)
  • Popular policy optimization methods – specifically applied to LLMs – include:
    • Proximal Policy Optimization (PPO): A widely-used RL algorithm that balances exploration and exploitation while maintaining training stability.
    • Direct Preference Optimization (DPO): An alternative approach where the policy directly optimizes the relative log probability of preferred responses using a binary cross-entropy loss, balancing human feedback alignment with KL divergence constraints.
    • Group Relative Policy Optimization (GRPO): A PPO variant that removes the critic model and estimates the baseline from group scores, improving memory efficiency and performance in complex tasks like mathematical reasoning.
  • Through RLHF, models like InstructGPT and ChatGPT have achieved enhanced alignment with human expectations, producing more beneficial and contextually appropriate responses.

Integration of Policy, Reference, Reward, and Value Models in RLHF

  • The full RLHF pipeline integrates four central components — the policy, reference, reward, and value models — into a cohesive optimization framework. Together, these models implement a scalable variant of policy-gradient reinforcement learning (commonly using PPO) for large-scale language model alignment.

  • This section provides a complete description of how these models interact, the mathematical formulation governing their updates, and the system-level architecture of a modern RLHF pipeline.

Overview of the RLHF Process
  • RLHF transforms large pretrained language models into alignment-optimized conversational agents through a three-phase process:

    1. Supervised Fine-Tuning (SFT):
      • The base pretrained LLM is fine-tuned on instruction–response data curated by humans.
      • Output: SFT model (used as both the initial policy and the frozen reference model).
    2. Reward Modeling:
      • Human annotators rank or compare pairs of model responses.
      • A separate reward model is trained on these comparisons to learn a scalar preference function \(r_\phi(x,y)\).
    3. Reinforcement Learning (RL) Optimization:
      • The policy model is optimized to generate responses that maximize the learned reward signal, while staying close to the reference model through KL regularization.
      • The value model acts as a critic, stabilizing the gradient updates.
  • This procedure was first described comprehensively in Training Language Models to Follow Instructions with Human Feedback by Ouyang et al. (2022), forming the backbone of systems such as InstructGPT and ChatGPT.

Core Mathematical Formulation
  • The RLHF optimization problem can be expressed as:

    \[\max_{\theta}, \mathbb{E}_{x\sim D_{\text{prompt}},y\sim\pi_\theta(\cdot\mid x)} \left[ r_\phi(x,y) - \beta,\mathrm{KL}\big(\pi_\theta(\cdot\mid x)\Vert\pi_{\text{ref}}(\cdot\mid x)\big) \right]\]
    • where:

      • \(\pi_\theta\) = policy model (trainable)
      • \(\pi_{\text{ref}}\) = reference model (frozen)
      • \(r_\phi\) = reward model (provides scalar reward)
      • \(\beta\) = KL penalty coefficient controlling exploration–alignment trade-off
  • The KL term prevents the policy from diverging too far from its linguistic prior, while the reward encourages behaviors that better match human preferences.

  • To train this objective, Proximal Policy Optimization (PPO) by Schulman et al. (2017) is typically used, which optimizes a clipped surrogate loss:

    \[L_{\text{PPO}}(\theta) = \mathbb{E}_{(x,y)\sim\pi_\theta} \left[ \min\left( r_t(\theta),\hat{A}_t, \mathrm{clip}\big(r_t(\theta), 1-\epsilon, 1+\epsilon\big),\hat{A}_t \right) \right]\]
    • where:

      • \(r_t(\theta) = \frac{\pi_\theta(y_t \mid x_t)}{\pi_{\theta_{\text{old}}}(y_t \mid x_t)}\) is the likelihood ratio;
      • \(\hat{A}_t = r_\phi(x_t,y_t) - V_\psi(x_t)\) is the advantage estimate;
      • \(V_\psi\) = value model;
      • \(\epsilon\) is a clipping hyperparameter (usually 0.1–0.2).
  • The advantage term ensures that updates are proportional to how much better a response is than expected, while the clipping stabilizes the step size.

Role of Each Model in the Loop
  • Policy Model \(\pi_{\theta}\):

    • Generates responses \(y\) to prompts \(x\).
    • Updated via Proximal Policy Optimization (PPO) to maximize the clipped surrogate objective.
    • Receives both reward signals and value-based baselines during training.
  • Reference Model \(\pi_{\text{ref}}\):

    • Provides a baseline distribution for KL regularization to prevent over-optimization.

    • Frozen during training; used to compute token-wise divergence:

      \[D_{\text{KL}}\big(\pi_{\theta}(\cdot \mid x) ,\Vert, \pi_{\text{ref}}(\cdot \mid x)\big) = \sum_{y} \pi_{\theta}(y \mid x), \log\frac{\pi_{\theta}(y \mid x)}{\pi_{\text{ref}}(y \mid x)}\]
    • Ensures linguistic stability and mitigates reward hacking by anchoring the policy to its supervised fine-tuned prior.

  • Reward Model \(r_{\phi}\):

    • Maps each generated response \(y\) (conditioned on prompt \(x\)) to a scalar reward: \(r_{\phi}: (x, y) \mapsto \mathbb{R}\).
    • Trained on human preference data (pairwise or ranked comparisons), then frozen during policy optimization.
    • Supplies an approximation of human judgment, encouraging the policy to produce more aligned, preferred responses.
  • Value Model \(V_{\psi}\):

    • Estimates the expected return for a given prompt (or state) \(x\), reducing variance in policy-gradient updates.
    • Trained in parallel with the policy to predict the observed or bootstrapped return: \(\hat{R}(x, y) = r_{\phi}(x, y),\) and provides advantage estimates: \(\hat{A}(x, y) = r_{\phi}(x, y) - V_{\psi}(x).\)
    • Serves as a critic in the actor–critic framework, enabling stable and efficient optimization.
Full Training Loop
  • Step 1: Sampling Responses:

    • Draw a batch of prompts \({x_i}\) from the dataset.
    • Generate responses \({y_i}\) from the current policy \(\pi_\theta\).
  • Step 2: Reward Evaluation:

    • Compute scalar rewards \(r_\phi(x_i, y_i)\) using the reward model.
    • Compute KL penalties from the reference model.
  • Step 3: Advantage Computation:

    • Use the value model to estimate baselines \(V_\psi(x_i)\).
    • Compute advantages \(\hat{A}_i = r_\phi(x_i, y_i) - V_\psi(x_i)\).
  • Step 4: Policy Update (PPO):

    • Optimize \(L_{\text{PPO}}(\theta)\) with respect to the policy parameters.
    • Clip ratios and advantages to maintain stable updates.
  • Step 5: Value Model Update:

    • Update the critic via regression: \(\mathcal{L}_V(\psi) = \mathbb{E}_{(x,y)} \big[ (V_\psi(x) - r_\phi(x,y))^2 \big]\)
  • Step 6: Iteration and Rollout:

    • Repeat with new samples from the updated policy.
    • Periodically evaluate human or synthetic preference metrics to ensure alignment progress.
System Architecture
\[\begin{aligned} &\underbrace{D_{\text{prompt}}}_{\text{Prompt Dataset}} \xrightarrow{\text{sample prompts}} \underbrace{\pi_{\theta}}_{\text{Policy Model}} \xrightarrow[\text{Generates responses}]{} \underbrace{r_{\phi}}_{\text{Reward Model}} \xrightarrow[\text{Computes scalar rewards}]{} \\[1em] &\underbrace{V_{\psi}}_{\text{Value Model}} \xrightarrow[\text{Computes baselines}]{} \underbrace{\pi_{\text{ref}}}_{\text{Reference Model}} \xrightarrow[\text{KL penalty computation}]{} \underbrace{\text{PPO Optimization Loop}}_{\text{Policy update step}} \end{aligned}\]
Computational and Practical Considerations
  • Training Scale:
    • The RLHF fine-tuning phase typically uses hundreds of thousands to millions of samples, requiring large-scale distributed training.
    • Compute cost is dominated by sampling (policy forward passes) and reward scoring.
  • Stability:
    • PPO’s clipping and KL regularization stabilize updates that would otherwise explode in such large parameter spaces.
  • Safety and Alignment:
    • The reward model embeds alignment objectives (helpfulness, harmlessness, honesty).
    • KL regularization ensures fidelity to the pretrained model’s linguistic priors.
  • Continuous Improvement:
    • Iterative retraining of reward models using newer policy outputs yields increasingly aligned systems — a process sometimes called iterative RLHF or alignment bootstrapping (see Christiano et al., 2017).
Comparative Analysis
Model Function Training Status Data Source Typical Size
Policy (\\(\pi_\\theta\\)) Generates responses; optimized for reward Trainable Prompts, synthetic rollouts 7B–175B
Reference (\\(\pi_\\text{ref}\\)) Baseline distribution for KL penalty Frozen Same as SFT model 7B–175B
Reward (\\(r_\\phi\\)) Scores responses based on preferences Frozen Human comparisons 1B–13B
Value (\\(V_\\psi\\)) Predicts expected reward (critic) Trainable Policy rollouts with rewards 1B–13B
  • In summary, RLHF operationalizes reinforcement learning at massive scale by combining:

    • The policy for exploration and response generation,
    • The reward for human alignment,
    • The value for stability and variance control, and
    • The reference for constraint and safety.
  • This synergy enables LLMs to internalize nuanced human feedback, forming the foundation for systems like ChatGPT, Anthropic’s Claude, and Google’s Gemini.

Putting it all together: Training Llama

Llama 4
  • Introduced in The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, the Llama 4 series marks a decisive leap forward in Meta’s open-weight model evolution, embodying natively multimodal design and advanced preference optimization.
  • With the introduction of Llama 4 Scout, Llama 4 Maverick, and the teacher model Llama 4 Behemoth, Meta’s alignment and optimization pipeline evolved into a hybrid of traditional RLHF and DPO, adapted for large-scale multimodal learning.
Model Overview and Architecture
  • Llama 4 introduces a mixture-of-experts (MoE) architecture where only a small subset of parameters activates per token, dramatically improving training and inference efficiency.

    • Llama 4 Scout: 17 billion active parameters, 16 experts, 109B total parameters, with a record-breaking 10 million token context window.
    • Llama 4 Maverick: 17 billion active parameters, 128 experts, 400B total parameters, balancing precision, cost efficiency, and multimodal reasoning.
    • Llama 4 Behemoth: 288 billion active parameters, 16 experts, and nearly 2 trillion total parameters, serving as a “teacher” for distillation.
  • Each model uses alternating dense and MoE layers, with tokens routed to both a shared and an expert-specific pathway, enabling dynamic specialization without compromising latency. This modular routing system supports scalable deployment — from single H100 GPUs (Scout) to distributed inference (Maverick and Behemoth).

Pre-Training: Efficiency, Scale, and Multimodality
  • The pre-training phase introduced several innovations:

    • Native Multimodality: Early-fusion architecture integrating text, vision, and video tokens, allowing joint learning across modalities.
    • Vision Encoder Improvements: Based on MetaCLIP, co-trained with frozen Llama layers for better image-text alignment.
    • MetaP Training Framework: A novel hyperparameter control system for per-layer learning rates and initialization scales, providing transferability across architectures and batch sizes.
    • FP8 Precision Training: Enhanced efficiency with 390 TFLOPs/GPU utilization across 32K GPUs, sustaining quality with minimal degradation.
    • Massive Multilinguality: 200 languages pre-trained, 100+ with over a billion tokens each — 10× the Llama 3 multilingual data budget.
    • Extended Context Length: Specialized mid-training datasets for long-context retention, culminating in a 10M-token context capacity for Llama 4 Scout.
  • The pre-training dataset exceeded 30 trillion tokens, encompassing diverse web text, code, images, and video frames. Continuous “mid-training” refinement phases allowed the model to expand context comprehension while maintaining stability.

Post-Training and Preference Optimization
  • Post-training for Llama 4 integrated multi-stage alignment combining SFT, online RL, and lightweight DPO.

    • Curriculum Design: A multimodal training curriculum balancing text, image, and reasoning data without sacrificing domain specialization.
    • Hard Data Curation: Automated difficulty estimation with prior Llama models used as judges to prune over 50% of “easy” SFT data, focusing on challenging prompts.
    • Continuous Online RL: Implemented as an on-policy, PPO-like training loop rather than DPO.

      • The model alternates between generation and optimization phases, continually updating the policy based on freshly sampled data.
      • “Medium-to-hard” prompts are identified via advantage scores and model confidence, filtering out zero-reward or trivial samples.
      • An advantage estimator (\(A(s, a) = Q(s, a) – V(s)\)) computes expected improvement per action, and prompts are re-ranked by these scores to form adaptive mini-batches.
      • A clipped surrogate loss similar to PPO ensures stable policy updates with controlled KL divergence to the reference model.
      • The reward signal blends multiple criteria — helpfulness, factuality, safety, and multimodal consistency (e.g., text-visual grounding accuracy).
    • Lightweight DPO Refinement: After online RL, a DPO stage fine-tunes preference alignment through log-likelihood ratio optimization without explicit rewards. This stabilizes conversational flow, reduces verbosity, and improves subjective response quality.
  • This hybrid pipeline allows exploration (via online RL) while retaining control (via DPO). It achieved consistent improvements in reasoning, multimodal grounding, and factual correctness with lower computational overhead than full RLHF pipelines.

Distillation from Llama 4 Behemoth
  • Llama 4 Behemoth acted as a codistillation teacher for the smaller models.
  • A novel dynamic distillation loss balanced soft (logit-level) and hard (label-level) targets.
  • Computation amortized across pre-training batches by embedding Behemoth forward passes into student model training.
  • Distillation improved multimodal reasoning and efficiency without requiring full retraining on large datasets.
Reinforcement Learning Infrastructure at Scale
  • Scaling RL for the two-trillion-parameter Behemoth required a fundamental infrastructure overhaul:

    • Asynchronous Online RL Framework: Enabled decoupled model execution across GPUs, enhancing flexibility and reducing idle compute.
    • Experience Replay Buffers: Incorporated sliding-window replay to maintain data diversity while preventing overfitting to recent samples.
    • Adaptive KL Penalty: Dynamically adjusted during training to prevent policy collapse, based on running estimates of divergence from reference weights.
    • MoE Parallelization Optimizations: Improved throughput by balancing compute load dynamically across active experts.
    • Curriculum-Based Prompt Sampling: pass@k evaluation and zero-advantage filtering ensured progressively harder RL training data.
    • Result: ~10× increase in training efficiency over prior distributed RL frameworks, with significantly improved sample efficiency and reward stability.
Safeguards and Bias Mitigation
  • Llama 4 integrates alignment and safety at multiple levels:

    • Data-Level Mitigations: Pre-training filtering and domain balancing to reduce bias propagation.
    • System-Level Safeguards:

      • Llama Guard: safety classifier for harmful content.
      • Prompt Guard: defense against prompt injections and jailbreaks.
      • CyberSecEval: adversarial testing and vulnerability assessment.
    • Generative Offensive Agent Testing (GOAT): Automated multi-turn adversarial red-teaming to simulate real-world misuse cases.
  • Llama 4 achieved measurable progress in political neutrality and response balance: refusal rates on politically sensitive prompts fell below 2%, and unequal refusal bias dropped below 1%, outperforming Llama 3 and matching Grok-class models.

Takeaways
  • The combination of multimodal pre-training, online RL, and DPO alignment produced a family of models that are both powerful and efficient:

    • Llama 4 Maverick surpasses GPT-4o and Gemini 2.0 Flash in reasoning, coding, and multilingual benchmarks.
    • Llama 4 Scout achieves unprecedented 10M-token context understanding and state-of-the-art image grounding.
    • Llama 4 Behemoth establishes new frontiers for teacher-student distillation and large-scale preference optimization.
  • Collectively, these models represent a paradigm shift: from text-based alignment toward multimodal, preference-aware intelligence that learns from human feedback, structured curricula, and continuous self-refinement.

Llama 2
  • As a case study of how Llama 2 was trained, let’s go over the multi-stage process that integrates both human and model-generated feedback to refine the performance of language models. Here’s how it functions:
    1. Pretraining: Llama 2 undergoes initial pretraining with large amounts of data through self-supervised learning. This stage lays the foundation for the model by enabling it to understand language patterns and context.
    2. Supervised Fine-Tuning: The model then undergoes supervised fine-tuning with instruction data, where it is trained to respond to prompts in ways that align with specific instructions.
    3. Reward Models Creation (RLHF Step 1): Two separate reward models are created using human preference data –- one for helpfulness and one for safety. These models are trained to predict which of two responses is better based on human judgments.
    4. Margin Loss and Ranking: Unlike the previous approach that generates multiple outputs and uses a “k choose 2” comparison method, Llama 2’s dataset is based on binary comparisons, and each labeler is presented with only two responses at a time. A margin label is collected alongside binary ranks to indicate the degree of preference, which can inform the ranking loss calculation.
    5. Rejection Sampling and Alignment using PPO (RLHF Step 2): Finally, Llama 2 employs rejection sampling and Proximal Policy Optimization (PPO). Rejection sampling is used to draw multiple outputs and select the one with the highest reward for the gradient update. PPO is then used to align the model further, making the model’s responses more safe and helpful.
  • The image below (source) showing how Llama 2 leverages RLHF.

Proximal Policy Optimization (PPO)

  • Proximal Policy Optimization (PPO), introduced by Schulman et al. (2017), is an RL algorithm that addresses some key challenges in training agents through policy gradient methods.
  • PPO is widely used in robotics, gaming, and LLM policy optimization, particularly in RLHF.

Background

Terminology: RL Overview
  • RL is a framework for training agents that interact with an environment to maximize cumulative rewards.

    • Agent: Learns to act in an environment.
    • Environment: Defines state transitions and rewards.
    • State (\(s\)): The agent’s perception of the environment at a given time.
    • Action (\(a\)): The agent’s choice affecting the environment.
    • Reward (\(r\)): A scalar feedback signal.
    • Policy (\(\pi(a\mid s)\)): A probability distribution over actions given a state.
    • Value Function (\(V^{\pi}(s)\)): Expected cumulative rewards from state \(s\) when following policy \(\pi\).
    • Advantage Function (\(A^{\pi}(s, a)\)): Measures how much better an action is compared to the expected baseline value.
  • RL problems are modeled as Markov Decision Processes (MDPs) with:

    • States (\(S\))
    • Actions (\(A\))
    • Transition probabilities (\(P(s'\mid s, a)\))
    • Rewards (\(R(s, a)\))
    • Discount factor (\(\gamma\)) for future rewards
States and Actions in LLM Context
  • In the LLM context, states and actions are defined at the token level.
  • Suppose we give our LLM a prompt \(p\). The LLM then generates a response \(r_i\) of length \(T\), one token at a time:

    • \(t=0\): state is the prompt, \(s_0 = {p}\); first action \(a_0\) is the first token generated.
    • \(t=1\): state becomes \(s_1 = {p, a_0}\), and the next action \(a_1\) is generated conditioned on that state.
    • \(t=T-1\): state is \(s_{T-1} = {p, a_{0:T-2}}\), and the final token \(a_{T-1}\) is produced.
Policy-Based vs. Value-Based Methods vs. Actor-Critic Methods
  • Reinforcement learning algorithms can be broadly grouped into value-based, policy-based, and actor-critic methods. Each family approaches the problem of learning optimal behavior differently, with varying trade-offs in bias, variance, and sample efficiency.

  • Value-Based Methods:

    • These methods focus on learning value functions that estimate the expected cumulative reward for a given state or state–action pair. The agent then implicitly derives a policy by selecting the action that maximizes this estimated value.

      • Core idea: Learn \(Q^{\pi}(s, a) = \mathbb{E}[R_t \mid s_t = s, a_t = a]\) and choose actions \(a = \arg\max_a Q(s, a)\).

      • Typical applications: Environments with discrete and well-defined action spaces.

      • Advantages: Sample-efficient, conceptually simple, and does not require explicit policy parameterization.

      • Limitations: Hard to scale to continuous actions; unstable when deep neural networks are used for approximation.

      • Major algorithms:

        • Q-Learning (Quality Learning): Foundational algorithm using tabular updates.
        • SARSA (State–Action–Reward–State–Action): On-policy version of Q-learning.
        • DQN (Deep Q-Network): Combines Q-learning with deep neural networks for high-dimensional input (e.g., pixels).
        • Double DQN (Double Deep Q-Network) and Dueling DQN (Dueling Deep Q-Network): Address overestimation bias and improve learning stability.
  • Policy-Based Methods:

    • Policy-based methods directly learn a parameterized policy \(\pi_\theta(a \mid s)\) rather than deriving it from a value function.
    • The goal is to find parameters \(\theta\) that maximize the expected reward:
    \[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)].\]
    • These methods work well for continuous, stochastic, and high-dimensional action spaces because the policy is explicitly modeled as a probability distribution.

    • Advantages: Smooth policy updates, natural handling of continuous actions, and explicit stochastic exploration.

    • Limitations: High variance in gradient estimates; often require many samples for stable convergence.

    • Major algorithms:

      • REINFORCE (Monte Carlo Policy Gradient): The simplest policy gradient algorithm, using episode-level returns.
      • DPG (Deterministic Policy Gradient): Extends policy gradients to deterministic policies for continuous control.
      • DDPG (Deep Deterministic Policy Gradient): Combines DPG with deep neural networks for scalable continuous control.
      • SAC (Soft Actor-Critic): Adds entropy regularization to encourage exploration and improve robustness.
      • DPO (Direct Preference Optimization): A purely policy-based method that aligns model outputs directly with human preferences by optimizing preference log-ratios, without using rewards or a value function.
      • GRPO (Group Relative Policy Optimization): A policy gradient method inspired by PPO that removes the critic and computes relative advantages across grouped samples, improving efficiency in large language model fine-tuning.
    • Policy Gradient Methods:

      • Subset of policy-based methods that explicitly compute the gradient of the expected return with respect to policy parameters and perform gradient ascent to improve the policy.
      • This principle is formalized in the Policy Gradient Theorem, which provides a mathematical foundation for computing gradients of the expected reward with respect to policy parameters without requiring knowledge of the environment’s dynamics.
      • It shows that the policy gradient can be estimated as an expectation over actions sampled from the current policy, weighted by the advantage function, which quantifies how much better or worse an action performs compared to the average.
      • For a detailed discourse on the policy gradient theorem, refer to the Policy Gradient Theorem section.
      \[\nabla_{\theta} J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_{\theta} \log \pi_\theta(a \mid s) A^{\pi}(s, a) \right]\]
      • The gradient increases the likelihood of actions with positive advantages and decreases it for negative advantages.

      • Representative algorithms:

        • REINFORCE (Monte Carlo Policy Gradient): Baseline Monte Carlo gradient estimation.
        • TRPO (Trust Region Policy Optimization): Constrains policy updates to prevent large, destabilizing steps.
        • PPO (Proximal Policy Optimization): A policy gradient–based actor-critic algorithm that uses a clipped objective to limit policy divergence for stable learning.
        • NPG (Natural Policy Gradient): Uses the Fisher information matrix for more geometrically informed updates.
        • GRPO (Group Relative Policy Optimization): PPO-inspired policy gradient method that eliminates the value network, using group-relative baselines instead.
  • Actor-Critic Methods:

    • Actor-Critic algorithms combine both value-based and policy-based ideas, forming a hybrid architecture.

    • The actor directly learns the policy \(\pi_\theta(a\mid s)\) — determining which actions to take (policy-based component).
    • The critic learns a value function \(V^{\pi}(s)\) or \(Q^{\pi}(s, a)\) — estimating how good those actions are (value-based component).
    • The critic provides feedback to the actor by computing the advantage function \(A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s),\) which stabilizes learning and reduces variance in the policy gradient.

    • Actor-Critic methods therefore sit between policy-based and value-based RL — not orthogonal to them, but rather an integration of both. They inherit the flexibility of policy-based optimization and the efficiency of value-based bootstrapping.

    • Advantages:

      • Reduced variance in gradient estimates.
      • Improved stability and sample efficiency.
      • Balanced bias–variance trade-off through combined learning.
    • Limitations:

      • More complex architecture requiring two interacting networks.
      • Susceptible to instability if the critic’s value estimates are inaccurate.
    • Major algorithms:

      • A2C (Advantage Actor-Critic): Uses synchronous updates where multiple environments run in parallel to gather experience. The “Advantage” term refers to using \(A(s, a) = Q(s, a) - V(s)\) to measure how much better an action is than the baseline value, improving training stability.
      • A3C (Asynchronous Advantage Actor-Critic): Extends A2C by running multiple agents asynchronously on different threads or devices. The “Asynchronous Advantage” setup ensures decorrelated experiences and faster convergence by aggregating gradients from independent workers before updating shared parameters.
      • DDPG (Deep Deterministic Policy Gradient): Deterministic actor-critic variant for continuous action spaces.
      • SAC (Soft Actor-Critic): Actor-critic algorithm with entropy regularization for robust exploration.
      • PPO: A policy gradient–based actor-critic algorithm that uses clipped surrogate objectives to limit policy divergence.
Comparative Analysis
Method Type Learns Value Function? Learns Policy Directly? Core Learning Signal Exploration Mechanism Action Space Suitability Bias–Variance Profile Sample Efficiency Representative Algorithms
Value-Based Temporal-Difference (TD) Error ε-greedy or Boltzmann exploration Best for discrete actions Low bias, high variance ✅ High (reuses data via bootstrapping) Q-Learning, SARSA, DQN, Double DQN
Policy-Based Policy Gradient (\(\nabla_\theta \log \pi\)) Intrinsic stochasticity in \(\pi(a|s)\) Excellent for continuous or stochastic actions Low variance, potentially high bias ❌ Lower (requires many trajectories) REINFORCE, TRPO, PPO, DDPG, SAC, DPO, GRPO
Actor-Critic Policy Gradient + TD Value Estimates Stochastic or deterministic policies guided by critic Works for both discrete and continuous Balanced bias–variance ✅ Moderate to high (critic improves sample reuse) A2C, A3C, DDPG, SAC, PPO, GRPO
Takeaways
  • Value-Based methods estimate what is good (the value).
  • Policy-Based methods directly learn how to act.
  • Actor-Critic methods do both simultaneously, leveraging value estimation to guide efficient and stable policy optimization — a principle that underlies modern algorithms like PPO, DPO, and GRPO.
Policy Gradient Theorem
  • The objective in policy optimization is to maximize the expected return:

    \[J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} [ R(\tau) ]\]
    • where \(R(\tau) = \sum_{t=0}^T \gamma^t r_t\) is the discounted cumulative reward along a trajectory.
  • The policy gradient theorem provides a way to compute the gradient of this expectation without differentiating through the environment’s dynamics:

\[\nabla_{\theta} J(\theta) = \mathbb{E}_{\pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(a\mid s) A^{\pi}(s, a) \right]\]
  • This expression forms the basis of all policy gradient methods and thus underpins algorithms like REINFORCE, TRPO, and PPO.

  • Interpretation: The gradient term \(\nabla_\theta \log \pi_\theta(a\mid s)\) shows how to adjust parameters to increase the likelihood of beneficial actions. The advantage \(A^{\pi}(s, a)\) weights these updates by how good each action turned out relative to the baseline.

  • Variance Reduction: To improve stability, a baseline (usually the value function \(V^{\pi}(s)\)) is subtracted from the return, leading to the definition of the advantage function: \(A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)\) This reduces gradient variance without introducing bias.

  • Practical Implementation: Policy gradient methods rely on Monte Carlo rollouts or temporal-difference learning for return estimation. The theorem is foundational for designing algorithms that can operate in complex or continuous environments, where traditional value-based approaches are inefficient.

Classification of PPO, DPO, and GRPO
  • Building on the distinctions between value-based, policy-based, and actor-critic methods, modern reinforcement learning algorithms such as PPO, DPO, and GRPO represent successive innovations in policy optimization.
  • While all three focus on directly improving a policy, they differ in whether they use a value function (critic), how they estimate advantages, and how they constrain or stabilize policy updates.

####### PPO

  • Classification:
    • Policy-Based Method
    • Policy Gradient Method
    • Actor-Critic Method
  • Explanation:
    • PPO is one of the most influential actor-critic algorithms and a cornerstone of modern policy gradient methods. It improves upon earlier methods like REINFORCE and TRPO by introducing a clipped surrogate objective that stabilizes policy updates and prevents overly large gradient steps.

    • Why Policy-Based: PPO directly parameterizes and optimizes a stochastic policy \(\pi_\theta(a\mid s)\), rather than deriving it from a value function.
    • Why Policy Gradient: PPO explicitly applies the policy gradient theorem, optimizing:

      \[L^{\text{PPO}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t\right)\right]\]
      • where \(r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)}\).
    • Why Actor-Critic: PPO combines a policy network (actor) with a value network (critic) that estimates \(V^{\pi_\theta}(s_t)\) to compute advantages \(A_t = Q_t - V_t\). The critic reduces gradient variance and improves stability.
  • Takeaway:
    • PPO is a policy gradient–based actor-critic algorithm that achieves stable learning through clipped objective functions. It serves as the foundation for many subsequent variants, including GRPO.

####### DPO

  • Classification:
    • Policy-Based Method
    • Not a Policy Gradient Method (in the traditional sense)
    • Not an Actor-Critic Method
  • Explanation:
    • DPO reformulates reinforcement learning from human feedback (RLHF) into a supervised preference optimization problem. Rather than optimizing reward expectations or using a critic, DPO directly learns from pairwise human preference data.

    • Why Policy-Based:
      • DPO directly optimizes a parameterized policy \(\pi_\theta(y\mid x)\) using preference pairs \((x, y^{+}, y^{-})\) — preferred and dispreferred responses to the same prompt:
      \[\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y^+, y^-)}\left[\log\sigma\left(\beta \left(\log\pi_\theta(y^+|x) - \log\pi_\theta(y^-|x)\right)\right)\right]\]
      • The objective increases the likelihood of preferred responses and decreases that of dispreferred ones.
    • Why Not a Policy Gradient Algorithm:
      • Although it resembles policy gradient updates (due to its use of log probabilities), DPO does not compute expectations over environment trajectories or reward-weighted returns. It performs direct supervised optimization on preference data, bypassing stochastic reward modeling.
    • Why Not Actor-Critic:
      • DPO has no critic or explicit reward model. Its optimization signal derives purely from pairwise human feedback, not from estimated value functions or TD errors.
  • Takeaway:
    • DPO is a purely policy-based alignment algorithm that removes rewards and critics entirely. It bridges reinforcement learning and supervised fine-tuning by optimizing the policy directly with respect to preference data — effectively sidestepping the instability and variance of traditional RL pipelines.

####### GRPO

  • Classification:
    • Policy-Based Method
    • Policy Gradient Method
    • Not a Traditional Actor-Critic Method
  • Explanation:
    • GRPO extends PPO’s core ideas but removes the critic network. Instead, it estimates relative advantages among groups of sampled trajectories, using these relative differences as a variance-reducing baseline.

    • Why Policy-Based:
      • GRPO optimizes the policy \(\pi_\theta(a\mid s)\) directly, without any value estimation step. It relies solely on comparative feedback among trajectories.
    • Why Policy Gradient:
      • GRPO computes gradients using group-relative advantages, following the same principle as PPO but without an explicit value function:
      \[A_i = r_i - \frac{1}{G}\sum_{j=1}^{G} r_j\]
      • where \(r_i\) is the reward of sample \(i\) and \(G\) is the group size.
      • This group-average baseline functions like a self-normalizing critic, stabilizing updates.
    • Why Not Actor-Critic:
      • Although inspired by PPO, GRPO completely removes the critic, relying on intra-group comparisons to measure advantage rather than predicted values.
  • Takeaway:
    • GRPO is a critic-free policy gradient variant of PPO, tailored for efficient preference-based and reinforcement learning with large language models (LLMs). It preserves PPO’s update stability while simplifying training through relative advantage estimation.

####### Summary Comparison: PPO vs. DPO vs. GRPO

Algorithm Policy-Based Policy Gradient Actor-Critic Uses Value Function Optimization Signal Key Innovation
PPO (Proximal Policy Optimization) Advantage-weighted policy gradient with clipping Stabilized policy updates via clipped surrogate loss
DPO (Direct Preference Optimization) Preference-based log-likelihood ratio Direct alignment from human preference data without rewards or critics
GRPO (Group Relative Policy Optimization) Group-relative advantage estimation Removes critic; uses group-average reward as baseline

####### Takeaways

  • These algorithms represent an evolution in policy optimization, progressively simplifying how feedback and stability are achieved:

    1. PPO (2017) — Anchored in traditional actor-critic design, PPO uses a learned value function to estimate advantages and a clipped objective to stabilize updates.
    2. DPO (2023) — Moves beyond explicit reward and critic modeling, using direct supervised optimization on human preference data.
    3. GRPO (2024) — Reintroduces reinforcement-style training but without a critic, computing relative advantages among sampled groups.
  • In summary:

    • All three are policy-based methods.
    • PPO and GRPO are policy gradient methods, while DPO uses supervised gradients instead of estimating gradients from sampled rewards or environment rollouts (as in policy gradients). Specifically, DPO derives them directly from supervised preference losses computed over labeled data pairs \((x, y^+, y^-)\). These gradients arise from minimizing a differentiable loss function, much like in standard supervised learning, where the model is updated to increase the likelihood of preferred outputs.
    • Only PPO retains the actor-critic structure.
  • Together, they trace a continuum from explicit RL (PPO)direct preference learning (DPO)critic-free policy gradients (GRPO) — marking the field’s shift toward simpler, more scalable approaches for optimizing large model behavior.

Predecessors of PPO
  • REINFORCE and TRPO serve as foundational approaches to policy optimization, each addressing different challenges in RL. REINFORCE provides a simple yet high-variance method for optimizing policies, while TRPO improves stability by constraining updates. These methods paved the way for Proximal Policy Optimization (PPO), which builds on TRPO by introducing a more efficient and scalable optimization framework commonly used in modern RL applications.
The REINFORCE Algorithm
  • One of the earliest policy optimization methods in RL is REINFORCE, introduced in Williams (1992). REINFORCE is a policy gradient algorithm that directly optimizes the policy by maximizing expected rewards.
  • The key idea behind REINFORCE is the use of Monte Carlo sampling to estimate the policy gradient, which is then used to update the policy parameters using stochastic gradient ascent.
  • The update rule is as follows:

    \[\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) R_t\]
    • where:
      • \(\pi_\theta\) is the policy parameterized by \(\theta\),
      • \(a_t\) is the action taken at time \(t\),
      • \(s_t\) is the state at time \(t\),
      • \(\alpha\) is the learning rate, and
      • \(R_t\) is the cumulative return from time step \(t\), defined as \(R_t = \sum_{k=t}^{T} \gamma^{k-t} r_k\), representing the total discounted reward obtained from that point onward. It captures how good the future trajectory is, starting from time \(t\), based on the agent’s actions.
  • Despite its simplicity, REINFORCE suffers from high variance in gradient estimates, leading to unstable training. Variance reduction techniques like baseline subtraction (using a value function) are often used to mitigate this issue.
Trust Region Policy Optimization (TRPO)
  • Trust Region Policy Optimization (TRPO) is an advanced policy optimization algorithm introduced by Schulman et al. (2015). It was developed to improve upon traditional policy gradient methods like REINFORCE by enforcing a constraint on policy updates, preventing large, destabilizing changes that can degrade performance.

####### Core Idea

  • TRPO aims to optimize the expected advantage-weighted policy ratio while ensuring that updates remain within a predefined trust region. The objective function is:

    \[\max_{\theta} \mathbb{E}_{s, a \sim \pi_{\theta_\text{old}}} \left[ \frac{\pi_{\theta}(a|s)}{\pi_{\theta_\text{old}}(a|s)} A^{\pi_{\theta_\text{old}}}(s, a) \right]\]
    • subject to the Kullback-Leibler (KL) divergence constraint:
    \[D_{KL}(\pi_{\theta} || \pi_{\theta_\text{old}}) \leq \delta\]
    • where:
      • \(A^{\pi_{\theta_\text{old}}}(s, a)\) is the advantage function,
      • \(D_{KL}\) is the KL divergence measuring the difference between old and new policies,
      • \(\delta\) is a small threshold defining the trust region.
  • This KL constraint ensures that policy updates are not too aggressive, preventing performance collapse and maintaining stability.

####### The Role of the Policy Ratio

  • The policy ratio, defined as \(r(s, a; \theta) = \frac{\pi_{\theta}(a \mid s)}{\pi_{\theta_\text{old}}(a \mid s)}\), and measures how the probability of taking a particular action under the new policy compares to the old one.

  • This ratio acts as an importance weight, re-scaling each sampled action’s contribution according to how likely it is under the updated policy. In practice:

    • If an action becomes more likely under the new policy (ratio > 1), its advantage contributes more to the gradient update.
    • If it becomes less likely (ratio < 1), its contribution is reduced.
  • The policy ratio plays the role of reweighting one distribution by another, and is sometimes called the reweighting factor. It effectively serves as the weight correcting for distribution shift between the old and new policy. Without this correction, the optimization would be biased, as the data distribution (from the old policy) would not align with the target distribution (from the new policy).

  • Even though TRPO is typically trained in an offline or off-policy manner—using trajectories sampled from the old policy—it still needs this distribution shift correction to ensure unbiased gradient estimation. The samples are drawn under \(\pi_{\theta_\text{old}}\), but the optimization objective is defined for \(\pi_{\theta}\). Without this correction, the optimization would be biased, as the data distribution would not align with the updated policy. The policy ratio bridges this mismatch, allowing TRPO to accurately estimate how the new policy would perform if deployed, despite relying on previously collected (offline) data.

  • By incorporating the policy ratio within a KL-constrained optimization, TRPO ensures stable and monotonic policy improvement — a key theoretical advantage over unconstrained policy gradient methods.

####### Strengths and Limitations

  • Stable Learning: TRPO’s constraint limits drastic changes in policy updates, making it robust in complex environments such as robotic control and RL applications.
  • Computational Complexity: TRPO requires solving a constrained optimization problem, which involves computing second-order derivatives, making it computationally expensive.
  • Impact on PPO: TRPO inspired PPO, which simplifies the trust region approach by using a clipped objective function to balance exploration and exploitation efficiently.
  • Overall, TRPO remains a cornerstone in RL, particularly in high-stakes applications where stability is crucial.

####### Paving the way for PPO

  • TRPO introduced trust region constraints to stabilize learning, paving the way for PPO, which simplifies TRPO by using a clipped objective function to balance exploration and exploitation in policy updates.

Intuition Behind PPO

  • PPO is designed to stabilize policy updates by ensuring that new policies do not deviate too much from previous ones.
Why Not Naive Policy Gradients?
  • Traditional policy gradients (REINFORCE) often lead to unstable updates because they do not constrain how much the policy changes from one iteration to the next.
  • This can cause catastrophic forgetting or sudden performance drops.
Why Not Trust Region Policy Optimization (TRPO)?
  • TRPO stabilizes learning by enforcing a trust region constraint using KL-divergence, but solving the constrained optimization problem is computationally expensive.
How Does PPO Solve These Problems?
  • PPO simplifies TRPO by introducing a clipping mechanism in the objective function.
  • This allows for stable policy updates without requiring second-order optimization or explicit KL-divergence constraints.
  • Thus, PPO achieves a balance between stability and efficiency, making it highly practical for large-scale RL applications.

Fundamental Components and Requirements

  • PPO requires the following fundamental components:
    • Policy \(\pi_{\theta}\): The LLM that has been pre-trained or undergone supervised fine-tuning.
    • Reward Model \(R_{\phi}\): A trained and frozen network that provides a scalar reward given a complete response to a prompt.
    • Critic \(V_{\gamma}\): Also known as the value function, a learnable network that takes in a partial response to a prompt and predicts the scalar reward.

Core Principles

Policy Gradient Approach
  • PPO operates on the policy gradient approach, where the agent directly learns a policy, typically parameterized by a neural network. The policy maps states to actions based on the current understanding of the environment.
Actor-Critic Framework
  • PPO is based on the actor-critic framework, which means it simultaneously trains two components:
    • Actor (Policy Network): Selects actions based on the current policy.
    • Critic (Value Function Network): Evaluates these actions by estimating the expected the return of each state, i.e., the value of the state-action pairs.
  • This dual approach allows PPO to efficiently balance exploration and exploitation by guiding the actor’s policy updates using feedback from the critic. The critic helps compute the advantage function, which quantifies the quality of the actions taken, enabling more informed updates to the policy.
The Actor (Policy Network)
  • The actor network (\(\pi_\theta\)) is responsible for selecting actions based on the current policy:

    \[\pi_\theta(a_t \mid s_t) = P(a_t \mid s_t ; \theta)\]
    • where \(\theta\) represents the learnable parameters of the policy network.
  • Unlike the critic, which estimates the expected return of a given state, the actor directly determines the probability distribution over possible actions. This allows the agent to explore different responses while refining its behavior over time.

  • The actor is updated using a clipped surrogate objective function to ensure stable policy improvements:

    \[L(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) A_t,\; \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t \right) \right]\]
    • where:
      • \(r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\) is the probability ratio between the new and old policies.
      • \(A_t\) is the advantage function guiding policy updates.
      • \(\epsilon\) is a hyperparameter that constrains policy updates to prevent drastic changes.
  • This clipping mechanism prevents excessively large updates, mitigating instability and ensuring smooth learning.

  • The actor continually adapts by maximizing this objective, leading to more effective and stable policy learning while being guided by the critic’s evaluation of expected returns.

The Critic (Value Function)
  • The critic network (\(V_\gamma\)) is trained to predict the final reward from a partial response:

    \[L(\gamma) = \mathbb{E}_t \left[(V_\gamma(s_t) - \text{sg}(R_\phi(s_T)))^2\right]\]
    • where \(\text{sg}\) is the stop-gradient operation.
  • The critic learns alongside the policy, ensuring it stays aligned with the current model.

Top-Level Workflow

  • The PPO workflow contains five main stages for iterative policy improvement:
    1. Generate responses: LLM produces multiple responses for a given prompt
    2. Score responses: The reward model assigns reward for each response
    3. Compute advantages: Use GAE to compute advantages
    4. Optimize policy: Update the LLM by optimizing the total objective
    5. Update critic: Train the value function to be better at predicting the rewards given partial responses

Generalized Advantage Estimation (GAE)

  • PPO uses Generalized Advantage Estimation (GAE) to compute advantages, which defines how much better a specific action \(a_t\) is compared to an average action the policy will take in state \(s_t\).
  • GAE plays a crucial role in PPO by providing a flexible, variance-reduced estimator of the advantage function, enabling more stable and sample-efficient policy optimization.
Formal Definition
\[A_t = Q(s_t, a_t) - V(s_t)\]
  • where:
    • \(Q(s_t, a_t)\) is the expected cumulative reward of taking a specific action \(a_t\) in state \(s_t\)
    • \(V(s_t)\) is the expected cumulative reward of the average action the policy takes in state \(s_t\)
Advantage Estimation Approaches
  • There are two main approaches to estimating advantage:

    • Monte-Carlo (MC):
      • Uses the reward of the full trajectory (full responses)
      • High variance due to sparse reward
      • Low bias as we can accurately model the reward
    • Temporal Difference (TD):
      • Uses one-step trajectory reward
      • Significantly reduces variance
      • Higher bias as we can’t as accurately anticipate final reward
GAE Formula and Bias-Variance Trade-off
  • GAE balances bias and variance through multi-step TD:

    \[A^{\text{GAE}}_K = \sum^{K-1}_{t=0} (\lambda)^t \delta_t\]
    • where:
      • \(K\) denotes the number of TD steps (\(K < T\))
      • \(\delta_t\) denotes the TD error at step \(t\): \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\)
      • The hyperparameter \(\lambda\) controls the trade-off:
        • \(\lambda = 0\) \(\rightarrow\) Pure TD learning (low variance, high bias)
        • \(\lambda = 1\) \(\rightarrow\) Pure Monte Carlo (high variance, low bias)
  • In practice, PPO uses a truncated version of GAE, where the advantage estimate over a trajectory segment of length \(T\) is computed as:

    \[\hat{A}_t = \delta_t + (\gamma \lambda) \delta_{t+1} + \cdots + (\gamma \lambda)^{T - t + 1} \delta_{T - 1}\]
    • where \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\)
  • This formulation allows PPO to effectively trade off bias and variance by adjusting \(\lambda\), which is typically set between 0.9 and 0.97.

Role in PPO’s Clipped Surrogate Objective
  • This advantage estimate \(\hat{A}_t\) is a critical component of PPO’s clipped surrogate objective, which is used to update the policy:

    \[L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]\]
    • where:
      • \(r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\) is the ratio of the probability of action \(a_t\) under the new and old policies
      • \(\epsilon\) is a hyperparameter (e.g., 0.2) that limits the deviation from the old policy
  • The advantage \(\hat{A}_t\) modulates how much the policy is updated: if the advantage is positive, the update favors increasing the probability of the action; if negative, the update discourages it. Clipping ensures the update is conservative and prevents excessive deviation from the current policy.

Value Function and Critic Role
  • The value function \(V(s_t)\), which is used in both computing \(\delta_t\) and as a critic during training, is learned using a regression loss:
\[L^{VF}_t(\theta) = \left(V_\theta(s_t) - V^{\text{target}}_t \right)^2\]
  • PPO combines the policy loss, value loss, and an entropy bonus (to encourage exploration) into a total loss function:

    \[L^{\text{CLIP+VF+S}}_t(\theta) = \mathbb{E}_t \left[ L^{\text{CLIP}}_t(\theta) - c_1 L^{VF}_t(\theta) + c_2 S[\pi_\theta](s_t) \right]\]
    • where:
      • \(c_1\) and \(c_2\) are coefficients
      • \(S[\pi_\theta](s_t)\) is the entropy of the policy at state \(s_t\)
Reward and Value Model Roles
  • The reward signal used in PPO in classic reinforcement learning tasks like robotic control or Atari games is typically the raw reward provided by the environment. In , this could be a numerical score or some environment-defined signal that reflects success (e.g., distance walked, enemies defeated, etc.).
  • PPO uses this reward to compute the temporal difference error \(\delta_t\), which is then used to calculate the advantage estimate \(\hat{A}_t\). The reward, therefore, directly influences how the policy updates toward favoring higher-value actions.
  • In the context of RLHF applied to LLMs, the situation changes because environments like natural language do not inherently provide a structured, numerical reward signal. Instead, we use a learned reward model trained on human preferences.
    • Here’s how it works:
      • Human labelers are shown pairs of model-generated responses and asked to choose which one they prefer.
      • These comparisons are used to train a reward model that maps an LLM response (conditioned on a prompt) to a scalar reward, indicating how “good” or “aligned” the response is with human preferences.
      • This reward model replaces the environment’s raw reward and acts as the reward function in PPO.
  • When using PPO in RLHF:
    • The LLM generates a response to a prompt (this is the action).
    • The reward model assigns a scalar reward to the response.
    • This scalar is treated as \(r_t\) in the PPO pipeline.
    • The value model (critic) still estimates \(V(s_t)\), typically as the expected reward for a given prompt.
    • GAE is used to compute the advantage \(\hat{A}_t\), guiding the policy update so the model improves toward generating more reward-aligned responses.
  • So while the PPO algorithm itself remains the same, the source of the reward changes:
    • In environments like MuJoCo or Atari: reward is native to the environment.
    • In RLHF for LLMs: reward is generated by a separate reward model trained to reflect human judgment.
  • This adaptation is key to making PPO applicable in NLP settings, where explicit reinforcement signals are absent and have to be approximated using human feedback.

Key Components

Optimal Policy and Reference Policy
  1. Optimal Policy (\(\pi^{*}\) or \(\pi_{optimal}\)): The optimal policy refers to the strategy or set of rules that the LLM follows to maximizing the objective function \(J(\pi)\). This objective function is designed to reflect the goals of alignment, such as generating helpful, truthful, and harmless responses. Formally, the optimal policy \(\pi^{*}\) is defined as:

    \[\pi^{*} = \arg\max_{\pi} J(\pi)\]
    • where \(J(\pi)\) is the objective function.
  2. Reference Policy (\(\pi_{\text{ref}}\)): The reference policy is a baseline or benchmark policy used to compare and guide the learning process of the optimal policy. It represents a known, stable policy that the model starts from or refers back to during training. The reference policy helps in stabilizing the training process by providing a consistent comparison point.

Summary
  • \(\pi_{\text{optimal}}\): Optimal policy, maximizing the objective function \(J(\pi)\).
  • \(\pi_{\text{ref}}\): Reference policy, providing a stable baseline for training.
Surrogate Objective Function
  • Central to PPO is its surrogate objective function, which considers the (i) policy ratio, and (ii) advantage function, as explained below.

  • In the context of LLMs, the state corresponds to the input prompt along with the tokens generated so far (i.e., the context), and the action refers to the next token the model chooses to generate. That is:
    • State \(s\): The input question \(q\) and previously generated tokens \(o_{<t}\)
    • Action \(a\): The next token \(o_t\)
  • The “policy ratio”, also known as the “likelihood ratio” or “probability ratio” or “importance ratio” or “importance sampling ratio” or “policy likelihood ratio”, is the ratio of the probability of an action under the new (i.e., current) policy to the old (i.e., reference or behavior) policy. This ratio helps align the training of the current model with the data sampled from an earlier version of the policy.

  • Mathematically, the general form of the policy ratio is:

    \[r(\theta) = \frac{\pi_{\theta}(a \mid s)}{\pi_{\theta_\text{old}}(a \mid s)}\]
  • In the LLM setting, this becomes:

    \[r_t(\theta) = \frac{\pi_\theta(o_t \mid q, o_{<t})}{\pi_{\text{old}}(o_t \mid q, o_{<t})}\]
    • where:
      • \(\pi_\theta\) is the current policy (i.e., the model being updated),
      • \(\pi_{\text{old}}\) is the policy that was used to generate the training data,
      • \(o_t\) is the token being predicted at time step \(t\),
      • \(q\) is the question or initial input,
      • \(o_{<t}\) is the sequence of previously generated tokens.
  • This ratio tells us how much more or less likely the current model is to generate a token compared to the old one. It’s used to reweight updates to the policy to account for the fact that training data was collected under a different policy - hence, called the “importance sampling” ratio.

  • In PPO, this ratio is clipped within a certain range (e.g., \([1 - \epsilon, 1 + \epsilon]\)) to prevent large, destabilizing updates. This makes the training more robust when the current policy starts to diverge from the old one.

  • The policy ratio is multiplied by the advantage function, which measures how much better a specific action is compared to the average action at that state. In PPO, this advantage is estimated using techniques like Generalized Advantage Estimation (GAE) and relies on a separately trained value function (critic). In contrast, GRPO simplifies this by estimating the advantage from relative group rewards, avoiding the need for a value model.

  • A detailed discourse on this has been offered in the section on PPO’s Objective Function: Clipped Surrogate Loss.
Clipping Mechanism
  • PPO clips/limits the policy ratio in its objective function within a defined range (typically \([1-\epsilon, 1+\epsilon]\)), ensuring controlled updates. This clipping ensures that the updates to the policy are kept within a reasonable range, preventing the new policy from deviating excessively from the reference one. Ultimately, this mechanism helps in maintaining the stability of the learning process.
Data Re-use over Multiple Epochs of Stochastic Gradient Ascent
  • PPO uses each batch of experiences for multiple epochs of stochastic gradient ascent to update the policy, improving sample efficiency compared to some other methods.
Value Function and Baseline
  • PPO trains a value function (the critic) is trained alongside the policy (the actor) to estimate state values. The value function estimates the expected return (cumulative future rewards) from each state and is used to compute the advantage function, which in turn informs the policy update.
  • The baseline provided by the critic stabilizes the training process by reducing variance in the policy gradients, helping the actor make more precise updates.

PPO’s Objective Function: Clipped Surrogate Loss

Intuition
  • The surrogate loss in PPO is defined based on the ratio of the probability of taking an action under the current policy to the probability of taking the same action under the reference policy.
  • This ratio is used to adjust the policy towards actions that have higher rewards while ensuring that updates are not too drastic. The clipping mechanism is employed to limit the magnitude of these updates, maintaining stability during training.

Note that in conventional deep learning, loss functions are typically minimized to reduce prediction error, while in reinforcement learning, objective functions are usually maximized to increase expected reward or policy performance. Specifically, in policy optimization (say, with PPO) the objective function is maximized, as it aims to improve the policy by increasing the expected reward under a surrogate objective.

Components
  • PPO’s clipped surrogate objective function has the following components:

    • Policy Ratio: The core of the PPO objective function involves the policy ratio, which is the ratio of the probability of taking a certain action under the current policy to the probability under the reference policy. This ratio is multiplied by the advantage estimate, which reflects how much better a given action is compared to the average action at a given state.

    • Clipped Surrogate Objective: To prevent excessively large updates, which could destabilize training, PPO introduces a clipping mechanism in its objective function. The policy ratio is clipped within a certain range, typically \([1-\epsilon, 1+\epsilon]\) (where \(\epsilon\) is a small value like 0.1 or 0.2). This clipping ensures that the updates to the policy are not too large, which maintains stability in training.
    • Formally:

      \[L^{\text{clip}}(\theta) = \mathbb{E}_t \left[ \min(c_t(\pi_\theta) A^{\text{GAE}}_t, \text{clip}(c_t(\pi_\theta),1-\epsilon, 1+\epsilon) A^{\text{GAE}}_t)\right]\]
      • where:
        • \(L^{\text{clip}}(\theta)\):
          • The clipped surrogate loss in PPO, which balances policy updates by preventing excessively large changes to the policy.
          • This function ensures that the new policy does not deviate too far from the old policy, maintaining stable training.
        • \(\mathbb{E}_t\):
          • Expectation over all time steps \(t\), averaging the objective function across multiple training samples.
        • \(c_t(\pi_\theta)\):
          • The probability ratio that compares the new policy to the old policy, given by: \(c_t(\pi_\theta) = \frac{\pi_\theta (a_t \mid s_t)}{\pi_{\theta_{\text{old}}} (a_t \mid s_t)}\)
          • If \(c_t(\pi_\theta) > 1\), the action is more likely under the new policy.
          • If \(c_t(\pi_\theta) < 1\), the action is less likely under the new policy.
        • \(A^{\text{GAE}}_t\):
          • The advantage function computed using Generalized Advantage Estimation (GAE).
          • Measures how much better (or worse) an action \(a_t\) is compared to the policy’s average action at state \(s_t\).
          • A positive \(A^{\text{GAE}}_t\) encourages increasing the probability of the action, while a negative \(A^{\text{GAE}}_t\) discourages it.
        • \(\text{clip}(c_t(\pi_\theta),1-\epsilon, 1+\epsilon)\):
          • The clipping function, which limits \(c_t(\pi_\theta)\) within the range \([1 - \epsilon, 1 + \epsilon]\).
          • This ensures that updates to the policy do not drastically change the probability of taking a certain action.
        • \(\min(c_t(\pi_\theta) A^{\text{GAE}}_t, \text{clip}(c_t(\pi_\theta),1-\epsilon, 1+\epsilon) A^{\text{GAE}}_t)\):
          • The core of the clipped loss function:
            • If \(c_t(\pi_\theta) A^{\text{GAE}}_t\) is too large, the function selects the clipped version.
            • If it is within the safe range, it behaves as a standard policy gradient update.
          • This prevents over-aggressive policy updates, stabilizing learning.
    • KL Divergence Loss: Besides the clipped objective, another common component in the loss function is to add a KL divergence penalty to the objective function. This means the algorithm would penalize the objective based on how much the new policy diverges from the reference policy. In other words, the KL divergence component prevents overconfident policy updates by keeping the new policy close to the reference one by penalizing updates that result in a large divergence from the reference policy.
      • The KL divergence loss is typically added to the objective function as a penalty term:

        \[L^{\text{KL}}(\theta) = \mathbb{E} \left[ L^{\text{PPO}}(\theta) - \beta \text{KL}[\pi_{\text{old}} \mid\mid \pi_{\theta}] \right]\]
        • where:
          • \(\beta\) is a hyperparameter that controls the strength of the KL penalty.
    • Value Function Loss: PPO also typically includes a value function loss in its objective. This part of the objective function ensures that the estimated value of the states (as predicted by the value function) is as accurate as possible, which is important for computing reliable advantage estimates.

    • Entropy Bonus: Some implementations of PPO include an entropy bonus to encourage exploration by penalizing low entropy (overly confident) policies. This part of the objective function rewards the policy for taking a variety of actions, which helps prevent premature convergence to suboptimal policies. Formally:

      \[H(\theta) = - \mathbb{E}_{a_t} [\log \pi_\theta (a_t \mid s_t)]\]
      • where:
        • \(H(\theta)\): The entropy of the policy \(\pi_\theta\), which measures the uncertainty or diversity of the actions selected by the policy.
        • \(\mathbb{E}_{a_t}\) (Expectation over \(a_t\)): The expectation is taken over all possible actions \(a_t\) that could be chosen by the policy at a given state \(s_t\).
        • \(\pi_\theta (a_t \mid s_t)\): The probability assigned by the policy \(\pi_\theta\) to taking action \(a_t\) when in state \(s_t\).
        • \(\log \pi_\theta (a_t \mid s_t)\): The log-probability of selecting action \(a_t\). This helps measure how certain the policy is about choosing \(a_t\).
        • Negative sign (\(-\)): Since log-probabilities are typically negative (as probabilities are between 0 and 1), the negative sign ensures entropy is positive. Higher entropy corresponds to more randomness in the policy, while lower entropy corresponds to more deterministic behavior.
Purpose of the Clipping Mechanism
  • The clipping mechanism is central to the stability and reliability of PPO. It ensures that the policy updates do not result in excessively large changes, which could destabilize the learning process. The clipping mechanism works as follows:

    • Clipping Range: The ratio \(r(\theta)\) is clipped to the range \([1 - \epsilon, 1 + \epsilon]\). This means if the ratio \(r(\theta)\) is outside this range, it is set to the nearest bound.
    • Objective Function Impact: By clipping the probability ratio, PPO ensures that the change in policy induced by each update is kept within a reasonable range. This prevents the new policy from deviating too far from the reference policy, which could lead to instability and poor performance.
    • Practical Example: If the probability ratio \(r(\theta)\) is 1.2 and \(\epsilon\) is 0.2, the clipped ratio would remain 1.2. However, if \(r(\theta)\) is 1.4, it would be clipped to 1.2 (1 + 0.2), and if \(r(\theta)\) is 0.7, it would be clipped to 0.8 (1 - 0.2).
Purpose of Surrogate Loss
  • The surrogate loss allows PPO to balance the need for policy improvement with the necessity of maintaining stability. By limiting the extent to which the policy can change at each update, the surrogate loss ensures that the learning process remains stable and avoids the pitfalls of overly aggressive updates. The clipping mechanism is a key innovation that helps PPO maintain this balance effectively. This approach helps PPO to achieve a good balance between effective policy learning and the stability required for reliable performance in various environments.
Mathematical Formulation
  • To formalize PPO, let:

    • \(\pi_\theta\) denote the current policy parameterized by \(\theta\), and
    • \(\pi_{\text{old}}\) denote the previous policy before the latest update.
  • PPO aims to improve the policy while avoiding excessively large updates that could destabilize learning. This is achieved through a Clipped Surrogate Objective, which constrains the change in policy probability ratios between consecutive updates.

  • The complete PPO objective combines three components — the Clipped Surrogate Objective, an (optional) Entropy Bonus encouraging exploration (following prior work—REINFORCE (Williams, 1992) and A3C/A2C (Mnih et al., 2016)), and an (optional) KL-Divergence Penalty discouraging policy shifts that are too large.

  • As described in Proximal Policy Optimization Algorithms by Schulman et al., (2017), only the clipped objective is fundamental to PPO; the entropy and KL terms are optional regularization terms, weighted by scalar coefficients (w_1) and (w_2), that can be added to improve stability or maintain exploration balance. Specifically, (w_1) controls the strength of the entropy bonus (encouraging exploration), and (w_2) controls the KL penalty (discouraging large policy shifts):

    \[L_{\text{PPO}}(\theta, \gamma) = \underbrace{L_{\text{clip}}(\theta)}_{\text{Clipped Surrogate Objective}} + \underbrace{w_1 H(\theta)}_{\text{Optional: Encourage Exploration}} - \underbrace{w_2 \text{KL}(\theta)}_{\text{Optional: Penalize Policy Divergence}}\]
    • where:

      • \(w_1\) is a scalar coefficient controlling the contribution of the entropy bonus term. Higher values encourage greater policy entropy, promoting exploration.

      • \(w_2\) is a scalar coefficient controlling the strength of the KL penalty term. Larger values increase resistance to large policy updates, improving stability but potentially slowing learning.

      • Clipped Surrogate Objective:

        \[L_{\text {clip }}(\theta)=\hat{\mathbb{E}}_t\left[\min \left(r_t(\theta) \hat{A}_t, \operatorname{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t\right)\right]\]
        • where:

          • \(r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)}\) is the policy ratio term, which represents the ratio between the new and old policy probabilities for the same action.
          • The clipping ensures that if the new policy deviates too much from the old one (beyond \([1-\epsilon, 1+\epsilon]\)), the objective is truncated — preventing drastic updates.
      • KL Divergence (optional regularization):

        \[\text{KL}(\theta) = \hat{\mathbb{E}}_t \left[ \mathbb{D}_{\text{KL}}\big(\pi_{\theta_{\text{old}}}(\cdot \mid s_t) \mid\mid \pi_{\theta}(\cdot \mid s_t)\big) \right]\]
        • This optional penalty term discourages excessive divergence between consecutive policy distributions, helping stabilize training when needed.
      • Entropy Bonus (optional regularization):

        \[H(\theta) = \hat{\mathbb{E}}_t\Big[ \mathbb{E}_{a_t \sim \pi_\theta(\cdot \mid s_t)}[-\log \pi_\theta(a_t \mid s_t)] \Big]\]
        • This optional term encourages exploration by increasing the entropy of the policy distribution.
PPO with Clipped Surrogate Loss
  • To recap, \(\pi_\theta\) is the current policy parameterized by \(\theta\), while \(\pi_{\text{old}}\) is the old policy. For a given state \(s\) and action \(a\), the probability ratio is:
\[r(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)}\]
  • The expanded form of the PPO clipped surrogate loss can be derived directly from the clipped objective above by plugging in the policy likelihood ratio can be written as:

    \[L_{\text{PPO-CLIP}}(\theta) = \hat{\mathbb{E}}_{t}\left[ \min \left(\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} \hat{A}_t, \text{clip}\left(\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_t \right) \right]\]
    • where:
      • \(r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)}\) is the policy ratio term, representing how the new policy’s probability for taking action \(a_t\) under state \(s_t\) compares to the old policy’s probability.
      • \(\hat{A}_t\) is the advantage estimate, which measures how much better an action is compared to the average action at a given state. It is typically computed using Generalized Advantage Estimation (GAE), balancing bias and variance through the use of the discount factor \(\gamma\) and the GAE parameter \(\lambda\).
      • \(s_t\) is the state observed at timestep \(t\).
      • \(a_t\) is the action taken by the policy under state \(s_t\).
      • \(\epsilon\) is a small hyperparameter (usually 0.1–0.3) that controls the clipping range, limiting how far the new policy can deviate from the old one. This constrains policy updates and prevents destructive policy shifts.
      • The clipping operator \(\text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\) bounds the policy ratio within the specified interval to reduce variance and maintain learning stability.
PPO with KL Divergence Penalty
  • An alternative to the clipped surrogate objective is to use a KL-penalized objective, where a penalty term based on the KL divergence between the current policy and the old policy is added to the loss. The penalty coefficient \(\beta\) is adaptively tuned to maintain a target KL divergence \(d_{\text{targ}}\). After each policy update, the actual KL divergence \(d\) is measured. If \(d < d_{\text{targ}} / 1.5\), the penalty coefficient is reduced (i.e., \(\beta \gets \beta / 2\)) to allow more flexibility in updates. If \(d > 1.5 \cdot d_{\text{targ}}\), \(\beta\) is increased (i.e., \(\beta \gets \beta \cdot 2\)) to constrain the update more tightly. This approach helps keep the updated policy close to the previous one while still allowing learning progress. The KL-penalized loss is defined as:

    \[L_{\text{KL}}(\theta) = \hat{\mathbb{E}}_t \left[ \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t - \beta \sum_{a} \pi_{\theta_{\text{old}}}(a | s_t) \log \left(\frac{\pi_{\theta_{\text{old}}}(a | s_t)}{\pi_\theta(a | s_t)} \right) \right]\]
    • where:
      • \(\pi_{\theta_{\text{old}}}\) is the policy before the update.
      • \(\pi_\theta\) is the current policy.
      • \(\hat{A}_t\) is the estimated advantage.
      • \(\beta\) is the KL penalty coefficient adjusted dynamically to match the KL target.
PPO with Clipped Surrogate Loss and KL Divergence Penalty
  • The PPO paper also suggests that the KL penalty can be used in combination with the clipped surrogate objective. In this hybrid approach, the clipped objective controls the size of the policy update explicitly, while the KL penalty provides an additional regularization signal to discourage large divergences from the previous policy. Although this combined objective performed slightly worse than clipping alone in the paper’s experiments, it is included as an important baseline:

    \[L_{\text{CLIP+KL}}(\theta) = \hat{\mathbb{E}}_t \left[ \min \left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t \right) - \beta \sum_{a} \pi_{\theta_{\text{old}}}(a | s_t) \log \left(\frac{\pi_{\theta_{\text{old}}}(a | s_t)}{\pi_\theta(a | s_t)} \right) \right]\]
    • where:
      • The first term is the standard PPO clipped surrogate objective.
      • The second term adds a KL divergence penalty between the old and new policies.
      • \(\beta\) is the dynamically adjusted penalty coefficient.

PPO for LLM Policy Optimization

  • PPO plays a crucial role in performing policy optimization LLMs using RLHF.
RLHF Overview
  • LLMs like GPT-4, ChatGPT, and Claude are optimized using RLHF, which consists of:
    1. Supervised Fine-Tuning: Train an initial model on human-annotated data.
    2. Reward Model (RM) Training: Train a model to predict human preference scores.
    3. PPO Fine-Tuning: Use the reward model to guide LLM responses through PPO.
PPO in LLM Training
  • The policy is the LLM, which generates responses given a prompt.
  • The reward model provides feedback, helping optimize the policy.
  • PPO ensures controlled updates, preventing divergence from the supervised baseline.

Practical Implementation of PPO

Pseudocode for PPO
for iteration in range(num_iterations):
    for actor in parallel_envs:
        collect trajectories using current policy
    
    compute advantage estimates using GAE
    
    for epoch in range(num_epochs):
        for minibatch in shuffled_batches:
            compute PPO loss (clipped surrogate)
            update policy with gradient descent
PPO with OpenAI’s transformers and trl
from trl import PPOTrainer

ppo_trainer = PPOTrainer(policy, optimizer, reward_model)
for batch in dataloader:
    query_tensors = tokenizer(batch["query"])
    response_tensors = model.generate(query_tensors)
    rewards = reward_model(response_tensors)
    ppo_trainer.step(query_tensors, response_tensors, rewards)

Typical Hyperparameters

  • Clip Range (\(\epsilon\)): 0.1 - 0.3
  • Learning Rate: \(10^{-5}\) to \(10^{-4}\)
  • Batch Size: 32 - 512
  • GAE Lambda (\(\lambda\)): 0.95
  • Entropy Coefficient: 0.01 (for exploration)

Variants of PPO

  • There are two main variants of PPO: (i) PPO-Clip and (ii) PPO-Penalty.

PPO-Clip

  • Uses the clipped surrogate objective function to limit the policy updates.
  • The most commonly used version of PPO.

PPO-Penalty

  • Adds a KL-divergence penalty to the objective function to constrain policy updates.
  • Used in cases where explicit divergence constraints are needed.

Advantages of PPO

  • Stability and Reliability: The clipping mechanism in the objective function helps to avoid large, destabilizing updates to the policy, making the learning process more stable and reliable.
  • Sample Efficiency: By reusing data for multiple gradient updates, PPO can be more sample-efficient compared to some other methods.
  • General Applicability: PPO has demonstrated good performance across a wide range of environments, from simple control tasks to complex simulations like those in 3D simulations. It offers a simpler and more robust approach compared to previous algorithms like TRPO.

Simplified Example

  • Imagine an agent learning to play a game. The agent tries different actions (moves in the game) and learns a policy that predicts which action to take in each state (situation in the game). The policy is updated based on the experiences, but instead of drastically changing the policy based on recent success or failure, PPO makes smaller, incremental changes. This way, the agent avoids drastically changing its strategy based on limited new information, leading to a more stable and consistent learning process.

Summary

  • PPO stands out in the realm of RL for its innovative approach to policy updates via gradient ascent. Its key innovation is the introduction of a clipped surrogate objective function that judiciously constrains the policy ratio. This mechanism is fundamental in preventing drastic policy shifts and ensuring a smoother, more stable learning progression.
  • PPO is particularly favored for its effectiveness and simplicity across diverse environments, striking a fine balance between policy improvement and stability.
  • The PPO objective function is designed to balance the need for effective policy improvement with the need for training stability. It achieves this through the use of a clipped surrogate objective function, value function loss, and potentially an entropy bonus.
  • While KL divergence is not a direct part of the basic PPO objective function, it is often used in the PPO-Penalty implementation of PPO to monitor and maintain policy stability. This is done either by penalizing large changes in the policy (KL penalty) or by enforcing a constraint on the extent of change allowed between policy updates (KL constraint).
  • By integrating these elements, PPO provides a robust framework for RL, ensuring both stability and efficiency in the learning process. This makes it particularly suitable for fine-tuning large language models (LLMs) and other complex systems where stable and reliable updates are crucial.
  • In PPO and other RL (RL) algorithms, the policy is typically represented by a parameterized function, most commonly a neural network. Here’s a detailed breakdown of how the policy is represented and what it entails:

Policy Representation in RL Algorithms

  1. Neural Network (Parameterized Function)
    • Neural Networks: In modern RL algorithms like PPO, the policy is most often represented by a neural network. The neural network takes the current state of the environment as input and outputs a probability distribution over possible actions.
    • Parameters (Weights): The neural network is defined by its parameters, which are the weights and biases of the network. These parameters are collectively denoted as \(\theta\). The process of training the policy involves adjusting these parameters to maximize the expected reward.
  2. Mathematical Representation
    • The policy \(\pi_\theta(a\mid s)\) represents the probability of taking action \(a\) given state \(s\), parameterized by \(\theta\). This function maps states to a distribution over actions.
    • Discrete Action Spaces: For discrete action spaces, the output of the neural network can be a softmax function that gives a probability for each possible action.
    • Continuous Action Spaces: For continuous action spaces, the output might be parameters of a probability distribution (e.g., mean and standard deviation of a Gaussian distribution) from which actions can be sampled.
  3. Policy Gradient Methods
    • In policy gradient methods like PPO, the policy is directly updated by computing the gradient of the expected reward with respect to the policy parameters \(\theta\). This gradient is used to adjust the parameters in a way that increases the expected reward.
  4. Actor-Critic Methods
    • Actor: In actor-critic methods, the “actor” is the policy network, which decides the actions to take.
    • Critic: The “critic” is another network that estimates the value function, which provides feedback on how good the current policy is. The critic helps to reduce the variance of the policy gradient estimates.
  5. Optimization Process
    • Policy Update: The policy parameters \(\theta\) are updated through an optimization process (e.g., gradient ascent in policy gradient methods) to maximize the objective function, such as the expected cumulative reward.
    • Surrogate Objective: In PPO, a surrogate objective function is used, which includes mechanisms like clipping to ensure stable updates to the policy.
Summary
  • Neural Network: The policy in PPO and many other RL algorithms is represented by a neural network.
  • Parameters (Weights): The neural network is parameterized by a set of weights and biases, collectively denoted as \(\theta\).
  • Probability Distribution: The policy maps states to a probability distribution over actions, allowing for both discrete and continuous action spaces.
  • Optimization: The policy parameters are updated iteratively to maximize the expected reward, often using gradient-based optimization methods.

  • By representing the policy as a neural network, RL algorithms can leverage the expressive power of deep learning to handle complex environments and high-dimensional state and action spaces.

RL with AI Feedback (RLAIF)

  • RLAIF uses AI-generated preferences instead of human annotated preferences. It leverages a powerful LLM (say, GPT-4) to generate these preferences, offering a cost-effective and efficient alternative to human-generated feedback.
  • RLAIF operates by using a pre-trained LLMs to generate feedback for training another LLM. Essentially, the feedback-generating LLM serves as a stand-in for human annotators. This model evaluates and provides preferences or feedback on the outputs of the LLM being trained, guiding its learning process.
  • The feedback is used to optimize the LLM’s performance for specific tasks like summarization or dialogue generation. This method enables efficient scaling of the training process while maintaining or improving the model’s performance compared to methods relying on human feedback.

Direct Preference Optimization (DPO)

  • LLMs acquire extensive world knowledge and reasoning skills via self-supervised pre-training, but precisely controlling their behavior is challenging due to their unsupervised training nature. Traditionally, methods like RLHF, discussed earlier in this article, are used to steer these models, involving two stages: training a reward model based on human preference labels and then fine-tuning the LM to align with these preferences using RL (RL). However, RLHF presents complexities and instability issues, necessitating fitting a reward model and then training a policy to optimize this reward, which is prone to stability concerns.
  • Proposed in Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al. from Stanford in 2023, Direct Preference Optimization (DPO) is a novel approach that simplifies and enhances the aforementioned process. DPO leverages a mathematical relationship between optimal policies and reward functions, demonstrating that the constrained reward maximization problem in RLHF can be optimized more effectively with a single stage of policy training. DPO redefines the RLHF objective by showing that the reward can be rewritten purely as a function of policy probabilities, allowing the LM to implicitly define both the policy and the reward function. This innovation eliminates the need for a separate reward model and the complexities of RL.
  • This paper introduces a novel algorithm that gets rid of the two stages of RL, namely - fitting a reward model, and training a policy to optimize the reward via sampling. The second stage is particularly hard to get right due to stability concerns, which DPO obliterates. The way it works is, given a dataset of the form <prompt, worse completion, better completion>, you train your LLM using a new loss function which essentially encourages it to increase the likelihood of the better completion and decrease the likelihood of the worse completion, weighted by how much higher the implicit reward model. This method obviates the need for an explicit reward model, as the LLM itself acts as a reward model. The key advantage is that it’s a straightforward loss function optimized using backpropagation.
  • The stability, performance, and computational efficiency of DPO are significant improvements over traditional methods. It eliminates the need for sampling from the LM during fine-tuning, fitting a separate reward model, or extensive hyperparameter tuning.
  • The figure below from the paper illustrates that DPO optimizes for human preferences while avoiding RL. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, without an explicit reward function or RL.

  • Experiments demonstrate that DPO can fine-tune LMs to align with human preferences as effectively, if not more so, than traditional RLHF methods. It notably surpasses RLHF in controlling the sentiment of generations and enhances response quality in tasks like summarization and single-turn dialogue. Its implementation and training processes are substantially simpler.
  • In summary, DPO aligns models by optimizing pairs of responses ranked by human feedback, assigning a higher likelihood to preferred responses over less preferred ones. This preference-based learning captures human intent without relying on the complexity of RL traditionally used in fine-tuning methods. Instead, DPO transforms the reward maximization problem into a simpler classification task, directly optimizing model outputs based on human preferences.

DPO’s Binary Cross-Entropy Loss

  • DPO works by utilizing Binary Cross-Entropy (BCE) to compare pairs of model-generated responses (preferred and dispreferred) against human preferences. The model generates two responses for each input, and human annotators indicate which response they prefer. The model then assigns probabilities to each response. The BCE loss function computes the difference between these model-assigned probabilities and the actual human preferences, penalizing the model when it assigns a higher probability to the dispreferred response. By minimizing this loss, DPO adjusts the model’s internal parameters to better align with human preferences.
  • Put simply, DPO represents a shift in training language models to align with human preferences by consolidating the RLHF process into a single, end-to-end optimization step. By adapting the binary cross-entropy loss, DPO directly optimizes model behavior by adjusting log probabilities based on human feedback, making it a computationally efficient and theoretically grounded method for preference-based learning.
Simplified Process
  1. Response Pairs: For each input, the model generates two responses.
  2. Human Preferences: Humans indicate which response is preferable.
  3. Model Probabilities: The model assigns probabilities to each response.
  4. BCE Loss: The loss function calculates the difference between the model’s predictions and human preferences, penalizing the model more when it assigns higher probabilities to dispreferred responses.
Loss Function Equation
  • The DPO loss function, based on BCE, is formulated as:

    \[L_{DPO}(\pi_\theta; \pi_{ref}) = - \mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{ref}(y_l \mid x)} \right) \right]\]
    • where:
      • \(\mathbb{E}_{(x, y_w, y_l) \sim D}\) denotes the expectation over the dataset \(D\), which consists of tuples \((x, y_w, y_l)\) derived from human preference data. Here:
        • \(x\) is the input context (e.g., a prompt or query).
        • \(y_w\) is the preferred response, which is deemed better.
        • \(y_l\) is the less preferred response.
      • \(\pi_\theta\) is the policy being optimized.
      • \(\pi_{ref}\) is the reference policy (initial or base model).
      • \(\beta\) controls how much the model stays close to the reference policy.
      • \(\sigma\) is the logistic/sigmoid function.
  • This BCE-based loss function drives the model to increase the likelihood of preferred responses while penalizing dispreferred ones.

Loss Function Design Choices

####### Negative Sign in Front of the Loss

  • The negative sign ensures that the optimization minimizes the negative log-likelihood, which aligns with maximizing the likelihood of predicting the preferred response correctly. This is standard in BCE loss formulations.

####### Why the Sigmoid Function (\(\sigma\)) is Used

  • The sigmoid function \(\sigma(z) = \frac{1}{1 + e^{-z}}\) maps the input \(z\) to a probability in the range [0, 1].
  • In this case, it is applied to the log-ratio differences (scaled by \(\beta\)) between the preferred and less preferred responses. This ensures that the model output can be interpreted probabilistically, representing the confidence that the preferred response is indeed better.

####### Role of \(\beta\) in the DPO Loss Function

  • The parameter \(\beta\) plays a critical role in balancing the optimization process by controlling the influence of the reference policy (\(\pi_{ref}\)) on the model being optimized (\(\pi_\theta\))
  • It balances the dual goals of maximizing human preference alignment and retaining the desirable qualities of the reference policy.
  • Proper tuning of \(\beta\) is critical for achieving the right trade-off between stability and preference optimization.
  • The role of \(\beta\) in the DPO loss function can be summarized as follows:

    1. Scale of Log-Probability Differences:
      • The term \(\beta\) scales the difference in log-probabilities between the preferred (\(y_w\)) and less preferred (\(y_l\)) responses. A larger \(\beta\) amplifies the contrast between the two responses, making the model more sensitive to preference differences.
    2. Regularization Strength:
      • \(\beta\) acts as a regularization parameter, controlling how strongly the model \(\pi_\theta\) adheres to the reference policy \(\pi_{ref}\). Specifically:
        • High \(\beta\): The model stays closer to the reference policy, limiting the divergence from the initial policy. This helps retain stability and prevents overfitting to noisy or extreme preferences in the dataset.
        • Low \(\beta\): The model is allowed to diverge further from the reference policy, giving it more freedom to optimize for the preferences in the dataset. However, this increases the risk of overfitting or producing less generalizable responses.
    3. Interpretation as a Trade-off:
      • \(\beta\) provides a trade-off between preference alignment and policy regularization:
        • Preference Alignment: With lower values of \(\beta\), the model prioritizes aligning with human preferences at the cost of potential instability or over-divergence.
        • Policy Regularization: Higher values of \(\beta\) ensure that the model evolves conservatively, maintaining generality and robustness while limiting alignment with preferences.

####### Significant of the DPO Loss

  • The loss measures how well the model \(\pi_\theta\) aligns with human preferences, as encoded in the dataset \(D\).
  • By using BCE, the objective becomes a comparison of logits (log probabilities) between the preferred (\(y_w\)) and less preferred (\(y_l\)) responses. Minimizing this loss drives the model to produce outputs that increasingly favor \(y_w\) over \(y_l\) while balancing regularization (\(\beta\)) to avoid over-divergence from the reference policy \(\pi_{ref}\).
Mapping from the Standard Binary Cross-Entropy Loss to the DPO Loss

####### Standard Binary Cross-Entropy Loss

  • To recap, the Binary Cross-Entropy loss for a single prediction \(z\) (where \(z = \pi(y_w \mid x) - \pi(y_l \mid x)\)) and its target label \(t \in \{0, 1\}\) is defined as:

    \[L_{BCE}(z, t) = - \left[ t \cdot \log(\sigma(z)) + (1 - t) \cdot \log(1 - \sigma(z)) \right]\]
    • where,
      • \(z\): The logit (unbounded real value) representing the model’s confidence in the preferred label.
      • \(\sigma(z) = \frac{1}{1 + e^{-z}}\): The sigmoid function maps the logit to a probability.
      • \(t\): The binary target label, where \(t = 1\) if \(y_w\) is the preferred label and \(t = 0\) if \(y_l\) is preferred.

####### Mapping BCE Loss to DPO Loss

  • In the DPO framework:

    1. The target is implicitly encoded by the comparison of \(y_w\) (preferred) and \(y_l\) (less preferred). Effectively, \(t = 1\) for \(y_w\).
    2. The logit \(z\) is calculated as the difference in log-probabilities (scaled by \(\beta\)):

      \[z = \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{ref}(y_l \mid x)}\]
      • This difference represents the model’s confidence in \(y_w\) being better than \(y_l\), adjusted for the divergence from the reference policy.
    3. Plugging \(z\) into the BCE loss for \(t = 1\), the equation becomes:

      \[L_{DPO} = - \log(\sigma(z))\]
    4. Expanding \(z\), we get:

      \[L_{DPO} = - \log \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{ref}(y_l \mid x)} \right)\]

####### Intuition of the Mapping

  • Standard BCE Loss: Compares logits \(z\) against a binary target \(t\) (1 for positive, 0 for negative) and penalizes predictions deviating from the target.
  • DPO Loss: Adapts the BCE framework to pairwise preferences, where:
    • \(z\) reflects the scaled log-ratio difference between \(y_w\) and \(y_l\).
    • Implicitly assumes \(t = 1\) (i.e., \(y_w\) is the preferred response).
  • By minimizing \(L_{DPO}\), the model learns to increase the scaled log-probability of \(y_w\) relative to \(y_l\), aligning with human preferences while staying close to \(\pi_{ref}\).
Key Insights
  • DPO’s Efficiency: DPO simplifies the traditional RLHF pipeline by unifying policy learning and reward modeling into a single, efficient process. Instead of requiring a two-stage process (learning a reward model and then optimizing with RL), DPO directly optimizes the policy using human preferences as implicit rewards.
  • Streamlined Approach: By using BCE to treat preference optimization as a binary classification task, DPO minimizes complexity and computational overhead. The model learns to classify between preferred and dispreferred responses, adjusting its behavior accordingly.

How does DPO generate two responses and assign probabilities to them?

  • In DPO, generating two responses and assigning probabilities to each response involves a nuanced process:

    1. Generating Two Responses:
      • The responses are typically generated using a supervised fine-tuned language model. This model, when given an input prompt, generates a set of potential responses.
      • These responses are often generated through sampling methods like varying temperature, using different token sampling methods such as top-\(p\), top-\(k\), beam search, etc., which can produce diverse outputs.
    2. Assigning Probabilities:
      • Language models indeed assign probabilities at the token level, predicting the likelihood of each possible next token given the previous tokens.
      • The probability of an entire response (sequence of tokens) is calculated as the product of the probabilities of individual tokens in that sequence, as per the model’s prediction.
      • For DPO, these probabilities are used to calculate the loss based on human preferences. The model is trained to increase the likelihood of the preferred response and decrease that of the less preferred one.
  • Through this process, DPO leverages human feedback to preference-optimize the model, encouraging it to generate more human-aligned outputs.

DPO and it’s use of the Bradley-Terry model

  • Overview of the Bradley-Terry Model:
    • The Bradley-Terry model is a probability model used for pairwise comparisons. It assigns a score to each item (in this context, model outputs), and the probability that one item is preferred over another is a function of their respective scores. Formally, if item \(i\) has a score \(s_i\) and item \(j\) has a score \(s_j\), the probability \(P(i \text{ is preferred over } j)\) is given by:
    \[P(i \text{ is preferred over } j) = \frac{\exp(s_i)}{\exp(s_i) + \exp(s_j)}\]
  • Application in DPO for LLM Alignment:
    1. Data Collection:
      • Human evaluators provide pairwise comparisons of model outputs. For example, given two responses from the LLM, the evaluator indicates which one is better according to specific criteria (e.g., relevance, coherence, correctness).
    2. Modeling Preferences:
      • The outputs of the LLM are treated as items in the Bradley-Terry model. Each output has an associated score reflecting its quality or alignment with human preferences.
    3. Score Estimation:
      • The scores \(s_i\) for each output are estimated using the observed preferences. If output \(i\) is preferred over output \(j\) in several comparisons, \(s_i\) will be higher than \(s_j\). The scores are typically estimated using maximum likelihood estimation (MLE) or other optimization techniques suited for the Bradley-Terry model.
    4. Optimization:
      • Once the scores are estimated, the LLM is fine-tuned to maximize the likelihood of generating outputs with higher scores. The objective is to adjust the model parameters so that the outputs align better with human preferences as captured by the Bradley-Terry model scores.
  • Detailed Steps in DPO:
    1. Generate Outputs:
      • Generate multiple outputs for a given prompt using the LLM.
    2. Pairwise Comparisons:
      • Collect human feedback by asking evaluators to compare pairs of outputs and indicate which one is better.
    3. Fit Bradley-Terry Model:
      • Use the collected pairwise comparisons to fit the Bradley-Terry model and estimate the scores for each output.
    4. Update LLM:
      • Fine-tune the LLM using the estimated scores. The objective is to adjust the model parameters such that the likelihood of producing higher-scored (preferred) outputs is maximized. This step often involves gradient-based optimization techniques where the loss function incorporates the Bradley-Terry model probabilities. - By iteratively performing these steps, the LLM can be aligned more closely with human preferences, producing outputs that are more likely to be preferred by human evaluators.
  • Summary:
    • The Bradley-Terry model plays a crucial role in the Direct Preference Optimization method by providing a statistical framework for modeling and estimating the preferences of different model outputs. This, in turn, guides the fine-tuning of the LLM to align its outputs with human preferences effectively.
How does DPO implicitly use a Bradley-Terry Model (if it does not explicitly use a reward model)?
  • DPO uses the Bradley-Terry model implicitly, even if it does not explicitly employ a traditional reward model. Here’s how this works:
Key Concepts in DPO Without an Explicit Reward Model
  1. Pairwise Comparisons:
    • Human evaluators provide pairwise comparisons between outputs generated by the LLM. For example, given two outputs, the evaluator indicates which one is preferred.
  2. Logistic Likelihood:
    • The Bradley-Terry model is essentially a logistic model used for pairwise comparisons. The core idea is to model the probability of one output being preferred over another based on their latent scores.
Implicit Use of Bradley-Terry Model
  • Without training an explicit reward model, DPO leverages the principles behind the Bradley-Terry model in the following manner:
  1. Score Assignment through Logit Transformation:
    • For each output generated by the LLM, assign a latent score. This score can be considered as the logit (log-odds) of the output being preferred.
    • Given two outputs, \(o_i\) and \(o_j\), with logits (latent scores) \(s_i\) and \(s_j\), the probability that \(o_i\) is preferred over \(o_j\) follows the logistic function: \(P(o_i \text{ is preferred over } o_j) = \frac{\exp(s_i)}{\exp(s_i) + \exp(s_j)}\)
  2. Optimization Objective:
    • Construct a loss function based on the likelihood of observed preferences. If \(o_i\) is preferred over \(o_j\) in the dataset, the corresponding likelihood component is: \(L = \log P(o_i \text{ is preferred over } o_j) = \log \left(\frac{\exp(s_i)}{\exp(s_i) + \exp(s_j)} \right)\)
    • The overall objective is to maximize this likelihood across all pairwise comparisons provided by human evaluators.
  3. Gradient Descent for Fine-Tuning:
    • Instead of explicitly training a separate reward model, the LLM is fine-tuned using gradients derived from the likelihood function directly.
    • During backpropagation, the gradients with respect to the LLM’s parameters are computed from the likelihood of the preferences, effectively pushing the model to produce outputs that align with higher preference scores.
Steps in DPO Without Explicit Reward Model
  1. Generate Outputs:
    • Generate multiple outputs for a set of prompts using the LLM.
  2. Collect Pairwise Comparisons:
    • Human evaluators compare pairs of outputs and indicate which one is preferred.
  3. Compute Preference Probabilities:
    • Use the logistic model (akin to Bradley-Terry) to compute the probability of each output being preferred over another.
  4. Construct Likelihood and Optimize:
    • Formulate the likelihood based on the observed preferences and optimize the LLM’s parameters to maximize this likelihood.
Practical Implementation
  • Training Loop:
    • In each iteration, generate outputs, collect preferences, compute the logistic likelihood, and perform gradient descent to adjust the LLM parameters.
  • Loss Function:
    • The loss function directly incorporates the Bradley-Terry model’s probabilities without needing an intermediate reward model: \(\text{Loss} = -\sum_{(i,j) \in \text{comparisons}} \log \left(\frac{\exp(s_i)}{\exp(s_i) + \exp(s_j)} \right)\)
  • By optimizing this loss function, DPO ensures that the LLM’s outputs increasingly align with human preferences, implicitly using the Bradley-Terry model’s probabilistic framework without explicitly training a separate reward model. This direct approach simplifies the alignment process while leveraging the robust statistical foundation of the Bradley-Terry model.

Video Tutorial

  • This video by Umar Jamil explains the DPO pipeline, by deriving it step by step while explaining all the inner workings.
  • After briefly introducing the topic of AI alignment, the video reviews RL, a topic that is necessary to understand the reward model and its loss function. Next, it derives the loss function step-by-step of the reward model under the Bradley-Terry model of preferences, a derivation that is missing in the DPO paper.
  • Using the Bradley-Terry model, it builds the loss of the DPO algorithm, not only explaining its math derivation, but also giving intuition on how it works.
  • In the last part, it describes how to use the loss practically, that is, how to calculate the log probabilities using a Transformer model, by showing how it is implemented in the Hugging Face library.

Summary

  • RLHF is the most “dicey” part of LLM training and the one that needed the most art vs. science. DPO seeks to simplify that by removing RL out of the equation and not requiring a dedicated reward model (with the LLM serving as the reward model). The process it follows is as follows:
    1. Treat a foundational instruction tuned LLM as the reference LLM.
    2. Generate pairs of outputs (using say, different token sampling/decoding methods or temperature scaling) to the same prompt and have humans choose which one they like, leading to a dataset of human preferences/feedback.
    3. Add a linear layer to the LLM so that it outputs a scalar value, and tune this new model with a new loss function called DPO loss which is based on binary cross entropy loss (compute log-ratio of scalar outputs of the reference LLM and the one being tuned, multiply by a divergence parameter).
    4. Drop the last linear layer, and you have a fine tuned LLM on human feedback.

Kahneman-Tversky Optimization (KTO)

  • Proposed in Human-Centered Loss Functions (HALOs) by Ethayarajh et al. from Stanford and Contextual AI, Kahneman-Tversky Optimization (KTO) is a novel approach to aligning LLMs with human feedback.
  • KTO is a human-centered loss function that directly maximizes the utility of language model generations instead of maximizing the log-likelihood of preferences as current methods do. This approach is named after Daniel Kahneman and Amos Tversky, who are known for their work in prospect theory, a theory of how humans make decisions about uncertain outcomes. KTO is based on the principles of prospect theory, a theory in behavioral economics. Unlike traditional methods, KTO focuses on maximizing the utility of LLM generations by aligning them with human feedback.
  • KTO achieves the goal of generating desirable outputs by using a utility function to guide the training of a language model. This process involves several key steps:

    1. Utility Function Definition: A utility function is defined based on the principles of Kahneman-Tversky’s prospect theory. This function assigns a value to each possible output of the language model, indicating its desirability or utility from a human perspective. The utility values can be determined based on factors like relevance, coherence, or adherence to specific criteria.

    2. Generating Outputs: During training, the language model generates outputs based on given inputs. These outputs are complete sequences, such as sentences or paragraphs, rather than individual tokens.

    3. Evaluating Outputs: Each generated output is evaluated using the utility function. The utility score reflects how desirable or aligned the output is with human preferences or objectives.

    4. Optimizing the Model: The model’s parameters are updated to increase the likelihood of generating outputs with higher utility scores. The optimization process aims to maximize the expected utility of the outputs, essentially encouraging the model to produce more desirable results.

    5. Iterative Training: This process is iterative, with the model continually generating outputs, receiving utility evaluations, and updating its parameters. Over time, the model learns to produce outputs that are increasingly aligned with the utility function’s assessment of desirability.

  • In essence, KTO shifts the focus from traditional training objectives, like next-token prediction or fitting to paired preference data, to directly optimizing for outputs that are considered valuable or desirable according to a utility-based framework. This approach can be particularly effective in applications where the quality of the output is subjective or where specific characteristics of the output are valued.

    1. What is KTO?
      • KTO is an alignment methodology that leverages the concept of human utility functions as described in prospect theory. It aligns LLMs by directly maximizing the utility of their outputs, focusing on whether an output is considered desirable or not by humans.
      • This method does not require detailed preference pairs for training, which is a departure from many existing alignment methodologies.
    2. What Kind of Data Does KTO Require?
      • KTO obliterates the need for paired-preference ranking/comparison data and simplifies data requirements significantly. It only needs binary labels indicating whether an LLM output is desirable or undesirable. Put simply, with it’s binary preference data requirement, KTO contrasts with methods such as PPO and DPO that require detailed preference pairs.
      • The simplicity in data requirements makes KTO more practical and applicable in real-world scenarios where collecting detailed preference data is challenging.
    3. Advantages Over DPO and PPO:
      • Compared to DPO and PPO, KTO offers several advantages:
        • Simplicity in Data Collection: Unlike DPO and PPO, which require paired-preference data (i.e., ranking/comparison data) which is difficult to obtain, KTO operates efficiently with unpaired binary feedback on outputs.
        • Practicality in Real-World Application: KTO’s less stringent data requirements make it more suitable for scenarios where collecting detailed preferences is infeasible.
        • Focus on Utility Maximization: KTO aligns with the practical aspects of human utility maximization, potentially leading to more user-friendly and ethically aligned outputs.
    4. Results with KTO Compared to DPO and PPO:
      • When applied to models of different scales (from 1B to 30B parameters), KTO has shown to match or exceed the performance of methods like DPO in terms of alignment quality.
      • KTO, even without supervised finetuning, significantly outperforms other methods at larger scales, suggesting its effectiveness in aligning models in a more scalable and data-efficient manner.
      • In terms of practical utility, the results indicate that KTO can lead to LLM outputs that are better aligned with human preferences and utility considerations, particularly in scenarios where detailed preference data is not available.
  • KTO operates without paired preference data, focusing instead on maximizing the utility of language model generations based on whether an output is desirable or undesirable. This is different from the traditional approach of next-token prediction and paired preference data used in methods like DPO.
  • Here’s how KTO functions:

    1. Utility-Based Approach: KTO uses a utility function, inspired by Kahneman-Tversky’s prospect theory, to evaluate the desirability of outputs. The utility function assigns a value to each possible output of the language model, reflecting how desirable (or undesirable) that output is from a human perspective.

    2. Data Requirement: Unlike DPO, KTO does not need paired comparisons between two outputs. Instead, it requires data that indicates whether a specific output for a given input is considered desirable or not. This data can come from human judgments or predefined criteria.

    3. Loss Function: The loss function in KTO is designed to maximize the expected utility of the language model’s outputs. It does this by adjusting the model’s parameters to increase the likelihood of generating outputs that have higher utility values. Note that the KTO loss function is not a binary cross-entropy loss. Instead, it is inspired by prospect theory and is designed to align large language models with human feedback. KTO focuses on human perception of losses and gains, diverging from traditional loss functions like binary cross-entropy that are commonly used in machine learning. This novel approach allows for a more nuanced understanding and incorporation of human preferences and perceptions in the training of language models. KTO’s Loss Function further details the specifics of KTO’s loss function.

    4. Training Process: During training, the language model generates outputs, and the utility function evaluates these outputs. The model’s parameters are then updated to favor more desirable outputs according to the utility function. This process differs from next-token prediction, as it is not just about predicting the most likely next word, but about generating entire outputs that maximize a utility score.

    5. Implementation: In practical terms, KTO could be implemented as a fine-tuning process on a pre-trained language model. The model generates outputs, the utility function assesses these, and the model is updated to produce better-scoring outputs over iterations.

  • KTO is focused more on the overall utility or value of the outputs rather than just predicting the next token. It’s a more holistic approach to aligning a language model with human preferences or desirable outcomes.
  • In summary, KTO represents a shift towards a more practical and scalable approach to aligning LLMs with human feedback, emphasizing utility maximization and simplicity in data requirements.

KTO’s Loss Function

  • KTO is inspired by the behavioral models of decision-making introduced by Daniel Kahneman and Amos Tversky, particularly their prospect theory. KTO adapts these concepts into a loss function that aligns LLMs with human feedback by capturing human biases such as loss aversion and risk sensitivity. Below is a comprehensive explanation of KTO’s loss function, including both general principles from Prospect Theory and specific details from the paper you provided.

Core Principles from Prospect Theory

  • In prospect theory, human decision-making under uncertainty deviates from maximizing expected value due to biases like loss aversion and nonlinear probability weighting. These concepts are fundamental to the loss function used in KTO:
  1. Value Function: This captures how people perceive gains and losses differently:
    • It is concave for gains (risk-averse for gains) and convex for losses (risk-seeking for losses).
    • Losses loom larger than gains, which is modeled by a loss aversion parameter \(\lambda\) (typically \(\lambda > 1\)).

    • Mathematically, the value function \(v(x)\) can be expressed as:
    \[v(x) = \begin{cases} x^\alpha & \text{if } x \geq 0 \\ -\lambda (-x)^\beta & \text{if } x < 0 \end{cases}\]
    • where:
      • \(\alpha, \beta\) control the diminishing sensitivity to gains and losses.
      • \(\lambda\) represents the loss aversion factor, typically greater than 1, meaning losses are felt more intensely than gains.
  2. Probability Weighting Function: Humans tend to overweight small probabilities and underweight large probabilities. While not central to KTO, this element of Prospect Theory highlights how subjective perceptions of uncertainty influence decisions.

Key Elements of KTO’s Loss Function

  • The KTO loss function builds on these insights, tailoring them for optimizing LLM alignment with human feedback. The key elements of the KTO loss function are:

    1. Adapted Value Function: Instead of the piecewise value function in classic Prospect Theory, KTO uses a logistic function \(\sigma\) to maintain concavity for gains and convexity for losses. This also introduces a risk aversion parameter \(\beta\), which controls the degree of risk aversion and is explicitly incorporated into the model to manage how sharply the value saturates.

    2. Separate Loss Aversion Parameters:
      • In KTO, the original loss aversion parameter \(\lambda\) is replaced with two separate hyperparameters: \(\lambda_D\) for desirable outputs and \(\lambda_U\) for undesirable outputs. This split allows the model to handle these two types of feedback differently, reflecting more granular control over risk aversion depending on whether the output is positive or negative.
    3. KL Divergence as a Reference Point:
      • The reference point for the model is defined by the KL divergence between the current model’s policy \(\pi_\theta\) and the reference policy \(\pi_{\text{ref}}\). This term controls how much the current model’s outputs deviate from the pretrained reference model and acts as the reference point \(z_0\) for evaluating gains and losses in the optimization.

Loss Function Equation

  • The KTO loss function can be mathematically formulated as:

    \[L_{KTO}(\pi_\theta, \pi_{\text{ref}}) = \mathbb{E}_{x,y \sim D}[\lambda_y - v(x, y)]\]
    • where: \(r_\theta(x, y) = \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}\) \(z_0 = \text{KL}(\pi_\theta(y'|x) \mid\mid \pi_{\text{ref}}(y'|x))\)
  • The value function \(v(x, y)\) changes depending on whether \(y\) is a desirable or undesirable output:

\[v(x, y) = \begin{cases} \lambda_D \sigma(\beta(r_\theta(x, y) - z_0)) & \text{if } y \sim \text{desirable} \\ \lambda_U \sigma(\beta(z_0 - r_\theta(x, y))) & \text{if } y \sim \text{undesirable} \end{cases}\]

Intuition Behind the Loss Function

  • If the model increases the reward of a desirable example in a blunt manner, the KL divergence penalty will also increase, preventing improvement in the loss. This forces the model to learn specific features of desirable outputs, leading to improved alignment.
  • The logistic function \(\sigma\) ensures that as rewards increase, the model becomes more risk-averse for gains and more risk-seeking for losses, mimicking the behavior predicted by Kahneman and Tversky’s Prospect Theory.

Practical Considerations

  • Risk Aversion Control: The hyperparameter \(\beta\) allows fine-tuning of the model’s sensitivity to gains and losses. Increasing \(\beta\) increases risk aversion in gains and risk-seeking behavior in losses.
  • Desirable and Undesirable Output Weighting: The two loss aversion parameters \(\lambda_D\) and \(\lambda_U\) provide flexibility in how much weight the model gives to desirable vs. undesirable outputs. This is crucial when the training data contains an imbalance between positive and negative examples.

Summary

  • KTO’s loss function is a prospect-theoretic loss that incorporates:
    • Loss aversion: Through separate hyperparameters for desirable and undesirable outcomes.
    • Risk sensitivity: Controlled by the parameter \(\beta\), which regulates how quickly the model’s value function saturates for gains and losses.
    • KL divergence: To ensure the model does not drift too far from the reference point, enforcing stability in the optimization.
  • The KTO approach leverages human-like biases such as loss aversion and risk preferences, aligning the optimization process with how humans evaluate uncertainty, thus enabling better alignment of large language models with human feedback.

Group Relative Policy Optimization (GRPO)

  • Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models by Shao et al. (2024), is an RL algorithm that enhances the PPO method by eliminating the critic model and instead using group-level scores for baseline estimation. The main goals of GRPO are to improve computational efficiency, reduce memory usage, and provide effective fine-tuning for models like DeepSeekMath.
  • The following figure from the paper demonstrates PPO and GRPO. GRPO foregoes the value/critic model, instead estimating the baseline from group scores, significantly reducing training resources.

  • A detailed discourse on GRPO is available in the DeekSeek-R1 primer.

Key Features and Approach

  1. Actor-Only Framework: GRPO replaces the value (critic) model from PPO with a simpler baseline calculated using group rewards. This makes GRPO less computationally intensive.
  2. Group-Based Optimization: It samples multiple outputs (group sampling) for a given input, calculates relative rewards within the group, and uses these rewards to estimate advantages for policy updates.
  3. Adaptation for LLMs: GRPO aligns with the comparative nature of RL for large language models, where reward functions are typically trained using pairwise comparisons of outputs.

GRPO Equations

  1. PPO Objective Function:
    • The PPO objective (for reference) is:

      \[J_{\text{PPO}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)A_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)A_t\right)\right]\]
      • where:
        • \(r_t(\theta) = \frac{\pi_\theta(o_t \mid q, o_{<t})}{\pi_{\text{old}}(o_t \mid q, o_{<t})}\): Probability ratio between the current and old policies.
        • \(A_t\): Advantage function.
        • \(\epsilon\): Clipping parameter to stabilize training.
  2. GRPO Objective:
    • The GRPO objective modifies the above to avoid the critic model:
    \[J_{\text{GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\}_{i=1}^G} \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left(r_{i,t}(\theta)\hat{A}_{i,t}, \text{clip}(r_{i,t}(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_{i,t}\right)\]
    • where:
      • \(G\): Number of outputs sampled for each input \(q\) (group size).
      • \(\hat{A}_{i,t}\): Advantage for the \(t^{th}\) token of output \(o_i\), calculated from group-relative rewards.
  3. Advantage Calculation:
    • GRPO estimates the advantage \(\hat{A}_{i,t}\) as:
    \[\hat{A}_{i,t} = \frac{r_i - \text{mean}(r)}{\text{std}(r)}\]
    • where \(r_i\) is the reward for output \(o_i\), and \(\text{mean}(r)\), \(\text{std}(r)\) are computed over the group.
  4. KL Regularization:
    • GRPO introduces a KL divergence penalty to stabilize updates:
    \[D_{\text{KL}} = \sum_{t} \pi_\theta(o_{i,t} \mid q, o_{<t}) \log\left(\frac{\pi_\theta(o_{i,t} \mid q, o_{<t})}{\pi_{\text{ref}}(o_{i,t} \mid q, o_{<t})}\right)\]
  5. Overall GRPO Loss Function:

    • Combining the objective and KL regularization, the final GRPO loss (to be minimized) is given by:

      \[L_{\text{GRPO}}(\theta) = -\mathbb{E}_{q, {o_i}_{i=1}^G} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left(r_{i,t}(\theta)\hat{A}_{i,t}, \text{clip}(r_{i,t}(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_{i,t}\right) - \beta D_{\text{KL}}(\pi_\theta \mid\mid \pi_{\text{ref}}) \right]\]
      • where:
        • \(\beta\) controls the strength of the KL penalty term.
        • The negative sign reflects that optimization minimizes the loss while maximizing the GRPO objective.

Implementation Details

  1. Input Data:
    • Questions (\(q\)) are sampled from a dataset.
    • Multiple outputs (\(G\)) are generated per question using the old policy.
  2. Reward Model:
    • Rewards (\(r_i\)) are computed using a pre-trained reward model.
    • Rewards are normalized within the group to calculate relative advantages.
  3. Optimization Steps:
    • Sample outputs and compute rewards.
    • Compute group-relative advantages.
    • Update the policy model by maximizing the GRPO objective.
    • Apply KL regularization to prevent the policy from drifting too far from the reference model.
  4. Hyperparameters:
    • \(\epsilon\): Clipping parameter (e.g., 0.2).
    • \(\beta\): KL regularization coefficient.
    • \(G\): Group size (e.g., 64 outputs per input).
    • Learning rate: Typically in the range of \(10^{-6}\) to \(10^{-5}\).

Pros and Cons

Pros
  • Efficiency: GRPO reduces memory and computation requirements by eliminating the critic model.
  • Simplicity: The advantage is computed directly from group scores without training an additional value model.
  • Alignment with Reward Models: Leverages the comparative nature of reward functions effectively.
  • Improved Performance: Demonstrated superior results on benchmarks like GSM8K and MATH compared to other RL methods.
Cons
  • Dependence on Group Size: Requires careful tuning of the group size \(G\) for effective advantage estimation.
  • Reward Model Quality: Relies heavily on the quality of the reward model for accurate advantage computation.
  • Applicability: May not generalize well to tasks with sparse or noisy reward signals.

Applications and Results

  • GRPO significantly enhances the mathematical reasoning capabilities of models like DeepSeekMath.
  • On GSM8K and MATH datasets, GRPO achieved 88.2% and 51.7% accuracy, respectively, outperforming other open-source methods.

Comparative Analysis: REINFORCE vs. TRPO vs. PPO vs. DPO vs. KTO vs. APO vs. GRPO

  • REINFORCE:
    • Function: The simplest policy gradient algorithm that updates the model based on the cumulative reward received from complete trajectories.
    • Implementation: Generates an entire episode, calculates rewards at the end, and updates the policy network based on a weighted log probability loss.
    • Practical Challenges: High variance in policy updates, slow convergence, and instability due to unbounded updates.
  • TRPO:
    • Function: Trust Region Policy Optimization (TRPO) improves policy updates by constraining step sizes to avoid instability.
    • Implementation: Uses a constrained optimization formulation to ensure each update remains within a trust region, preventing excessive deviations.
    • Practical Challenges: Computationally expensive due to the constraint-solving step and requires second-order optimization techniques.
  • PPO:
    • Function: An RL algorithm that optimizes the language model by limiting how far it can drift from a previous version of the model.
    • Implementation: Involves sampling generations from the current model, judging them with a reward model, and using this feedback for updates.
    • Practical Challenges: Can be slow and unstable, especially in distributed settings.
  • DPO:
    • Function: Minimizes the negative log-likelihood of observed human preferences to align the language model with human feedback.
    • Data Requirement: Requires paired preference data.
    • Comparison with KTO: While DPO has been effective, KTO offers competitive or superior performance without the need for paired preferences.
  • KTO:
    • Function: Adapts the Kahneman-Tversky human value function to the language model setting. It uses this adapted function to directly maximize the utility of model outputs.
    • Data Requirement: Does not need paired preference data, only knowledge of whether an output is desirable or undesirable for a given input.
    • Practicality: Easier to deploy in real-world scenarios where desirable/undesirable outcome data is more abundant.
    • Model Comparison: Matches or exceeds the performance of direct preference optimization methods across various model sizes (from 1B to 30B).
  • APO:
    • Function: Introduces a family of contrastive objectives explicitly accounting for the relationship between the model and the preference dataset. This includes APO-zero, which increases desirable outputs while decreasing undesirable ones, and APO-down, which fine-tunes models based on specific quality thresholds.
    • Data Requirement: Works effectively with paired preference datasets created through controlled methods like CLAIR and supports stable alignment even for challenging datasets.
    • Practicality: Excels at aligning strong models with minimally contrasting preferences, enhancing performance on challenging metrics like MixEval-Hard while providing stable, interpretable training dynamics.
    • Model Comparison: Outperformed conventional alignment objectives across multiple benchmarks, closing a 45% performance gap with GPT4-turbo when trained with CLAIR preferences.
  • GRPO:
    • Function: A variant of PPO that removes the need for a critic model by estimating the baseline using group scores, improving memory and computational efficiency while enhancing the mathematical reasoning of models.
    • Data Requirement: Utilizes group-based rewards computed from multiple outputs for each query, normalizing these scores to guide optimization.
    • Practicality: Focuses on reducing training resource consumption compared to PPO and improving RL stability.
    • Model Comparison: Demonstrated superior performance on tasks like GSM8K and MATH benchmarks, outperforming other models of similar scale while improving both in-domain and out-of-domain reasoning tasks.

Tabular Comparison

Aspect REINFORCE TRPO PPO DPO KTO APO GRPO
Objective Policy gradient optimization without constraints. Ensures stable policy updates within a constrained region. Maximizes expected reward while preventing large policy updates. Optimizes policy based on binary classification of human preferences. Aligns models based on Kahneman-Tversky optimization for utility maximization. Anchored alignment with specific control over preference-based likelihood adjustments. Leverages group-based relative advantages and removes the critic network.
Learning Mechanism Monte Carlo policy gradients with high variance. Second-order optimization with trust region constraints. Policy gradients with a clipped surrogate objective. Cross-entropy optimization over paired preferences. Maximizes desirable likelihoods relative to undesirables, without paired data. Uses variants like APO-zero or APO-down for stable preference-based optimization. Group normalization with policy gradients, eliminating the critic network.
Stability Low (high variance, unstable updates). High (enforces trust region for stable updates). Relies on clipping mechanisms to avoid destabilization. Stable as it directly optimizes preferences. Stable due to focus on unpaired desirability adjustments. Offers robust training stability, scaling better on models trained with mixed-quality datasets. Stable due to normalization of rewards across groups.
Training Complexity High (unconstrained updates). Very high (requires second-order optimization and solving constraints). High, due to balancing reward maximization with policy constraints. Moderate; uses simplified binary preference objectives. Simplifies alignment by focusing only on desirability. Adaptive and context-aware; requires understanding dataset-model relationships. Reduces overhead via group-based scoring.
Performance Unstable and sample-inefficient. More stable than PPO but computationally expensive. Strong performance on tasks with clear reward signals but prone to instability in distributed setups. Effective for straightforward preference alignment tasks. Competitive or better alignment than preference-based methods without paired data needs. Superior alignment results, particularly for nuanced dataset control. Excels in reasoning tasks, offering computational efficiency.
Notable Strength Simple to implement but inefficient. Ensures stable policy updates through trust-region constraints. Widely used in RL settings, good at reward-based optimization. Directly optimizes for preferences without needing a separate reward model. Handles binary data efficiently, avoiding paired data dependencies. Allows precise alignment with nuanced datasets. Simplifies reward aggregation; strong for reasoning-heavy tasks.
Scenarios Best Suited RL tasks where simplicity is preferred over efficiency. High-stability RL tasks requiring constraint-driven policy improvements. RL environments where reward signals are predefined. Scenarios with abundant paired human feedback. Real-world settings with broad definitions of desirable/undesirable outputs. Tasks requiring precise alignment with minimally contrasting preferences. Mathematical reasoning or low-resource training setups.

Comparative Performance: DPO vs. PPO

  • Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study by Xu et al. (2025) presents a large-scale empirical study comparing Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO) across diverse large language model alignment tasks, including dialogue helpfulness, summarization, and reasoning. The authors benchmark DPO and PPO on multiple model architectures and preference datasets to systematically investigate claims that DPO, a simpler and reward-free method, can replace PPO for aligning large language models (LLMs).

Experimental Setup

  • Both algorithms are evaluated under controlled experimental conditions using identical base models, datasets, and training budgets. The PPO implementation follows the canonical setup from Schulman et al. (2017), incorporating Generalized Advantage Estimation (GAE) for variance reduction and a learned value critic. DPO follows the original formulation by Rafailov et al. (2023), trained on the same preference pairs without any reward model.
  • The study further includes ablation tests on regularization strength, KL penalties, and reference model choices, providing a fair cross-method comparison.

Key Findings

Performance on Alignment and Reward Metrics
  • Across nearly all benchmarks, PPO-trained models outperform DPO-trained ones on both human preference alignment and reward model scores. While DPO achieves comparable performance for smaller-scale models (≤7B parameters), PPO exhibits superior performance for larger models, especially in settings involving multi-turn dialogue and complex reasoning.

  • Quantitatively:

    • PPO yields higher reward scores (by 5–15%) when trained with an equivalent number of updates.
    • PPO-trained models generalize better to unseen prompts, suggesting more stable policy optimization.
    • DPO sometimes overfits to the preference dataset, exhibiting degraded out-of-domain behavior.
  • These results indicate that while DPO is computationally simpler, PPO remains more robust and effective for large-scale LLM alignment, particularly when the reward signal (or its learned approximation) is reliable.

Stability and Training Dynamics
  • DPO’s supervised nature offers deterministic, low-variance optimization, but this stability can be misleading. PPO’s stochastic policy optimization introduces variance but allows adaptive balancing between exploration and exploitation via its clipped objective.

  • The study highlights that:

    • PPO maintains better gradient signal quality due to explicit advantage estimation.
    • DPO’s gradients saturate quickly because of the sigmoid in its binary cross-entropy formulation, leading to slower convergence in high-dimensional action spaces.
    • PPO’s KL-based clipping provides smoother convergence and mitigates catastrophic policy drift, while DPO occasionally collapses toward the reference policy if \(\beta\) is large or diverges when \(\beta\) is too small.
Sample Efficiency and Computational Cost
  • One of DPO’s major advantages is its simplicity:

  • DPO eliminates the need for rollouts or reward modeling, resulting in 40–60% lower computational cost than PPO.
  • PPO, by contrast, requires multiple rollouts, critic training, and advantage computation per update step, increasing runtime significantly.

  • However, PPO’s higher sample efficiency offsets its cost in many cases. DPO’s performance plateaued early in training, whereas PPO continued improving with more samples, achieving higher asymptotic returns.
  • As model size increases, PPO’s advantage becomes more pronounced.
  • The authors observe a positive scaling trend for PPO with model capacity, while DPO’s performance saturates or declines. This finding aligns with observations from Touvron et al. (2023) on the scaling behavior of LLM optimization methods.

  • Specifically:

    • For (1\text{B} \leq \text{params} \leq 3\text{B}): DPO \(\approx\) PPO
    • For (7\text{B} \leq \text{params} \leq 13\text{B}): PPO (>) DPO by approximately (8\text{–}10%) reward margin
    • For (\text{params} \geq 30\text{B}): PPO significantly outperforms DPO, both on automatic and human-evaluated metrics
Robustness to Preference Noise
  • When preference datasets contain inconsistent or noisy labels, DPO degrades more severely than PPO. PPO’s reward modeling can learn to smooth out noise by averaging over sampled rollouts, whereas DPO lacks an implicit noise-handling mechanism.

  • Regularization (e.g., higher \(\beta\) or stronger KL penalties) mitigates this partially, but not completely. PPO’s value-based critic contributes additional robustness by learning a denoised reward landscape.

Practical Implications
  • The study concludes that DPO should not be viewed as a drop-in replacement for PPO, particularly in high-stakes alignment settings. Instead, the two approaches occupy complementary roles:
Scenario Recommended Algorithm Rationale
Small/medium models (<7B) with clean preference data DPO Simpler, efficient, stable
Large-scale alignment (>13B) or noisy human feedback PPO More robust, scalable, better generalization
Synthetic or AI-generated feedback (RLAIF) DPO Avoids reward model training, computationally efficient
Fine-tuning with dense reward signals PPO Better advantage estimation and reward propagation

Analytical Perspective

  • From a theoretical standpoint, PPO’s advantage arises from its actor-critic design and explicit control over policy divergence, allowing better credit assignment across trajectories. DPO’s gradient direction aligns locally with preference log-ratios but lacks trajectory-level information, making it less effective when feedback depends on long-term sequence quality.

  • Mathematically, PPO’s gradient approximates:

\[\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a|s) A^{\pi}(s, a)\right]\]
  • whereas DPO optimizes:

    \[\nabla_\theta L_{\text{DPO}} = -\mathbb{E}\left[\beta(\sigma(z) - 1)(\nabla_\theta \log \pi_\theta(y^+|x) - \nabla_\theta \log \pi_\theta(y^-|x))\right]\]
    • with \(z = \beta (\log \pi_\theta(y^+ \mid x) - \log \pi_\theta(y^- \mid x))\)
  • This shows that DPO’s updates depend solely on pairwise preference differentials rather than long-horizon returns, limiting its representational power for temporally extended dependencies.

  • Other recent methods like GRPO by Rafailov et al. (2024) and RRHF by Yuan et al. (2023) aim to bridge this gap by incorporating relative advantage estimation without critics. These approaches seek the middle ground between DPO’s simplicity and PPO’s robustness, showing early promise but remain less mature than PPO in large-scale deployment.

Takeaways

  • In summary:

    • PPO consistently outperforms DPO in large-scale alignment and complex reasoning tasks.
    • DPO offers efficiency and simplicity, excelling in smaller setups and RLAIF-style pipelines.
    • The choice between the two depends on the model scale, data quality, and computational budget.
  • DPO’s innovation lies in conceptual simplicity, but PPO’s structured reinforcement learning foundation continues to yield superior alignment when scaling beyond small models. The study’s findings underscore that while DPO simplifies RLHF, PPO remains the gold standard for robust, high-fidelity preference alignment in contemporary large language models.

Bias Concerns and Mitigation Strategies

  • A fair question to ask now is if RLHF/RLAIF can add bias to the model. This is an important topic as large conversational language models are being deployed in various applications from search engines (Bing Chat, Google’s Bard) to word documents (Microsoft office co-pilot, Google docs, Notion, etc.).
  • The answer is, yes, just as with any machine learning approach with human input, RLHF has the potential to introduce bias.
  • Let’s look at the different forms of bias it can introduce:
    • Selection bias:
      • RLHF relies on feedback from human evaluators, who may have their own biases and preferences (and can thus limit their feedback to topics or situations they can relate to). As such, the agent may not be exposed to the true range of behaviors and outcomes that it will encounter in the real world.
    • Confirmation bias:
      • Human evaluators may be more likely to provide feedback that confirms their existing beliefs or expectations, rather than providing objective feedback based on the agent’s performance.
      • This can lead to the agent being reinforced for certain behaviors or outcomes that may not be optimal or desirable in the long run.
    • Inter-rater variability:
      • Different human evaluators may have different opinions or judgments about the quality of the agent’s performance, leading to inconsistency in the feedback that the agent receives.
      • This can make it difficult to train the agent effectively and can lead to suboptimal performance.
    • Limited feedback:
      • Human evaluators may not be able to provide feedback on all aspects of the agent’s performance, leading to gaps in the agent’s learning and potentially suboptimal performance in certain situations.
  • Now that we’ve seen the different types of bias possible with RLHF, lets look at ways to mitigate them:
    • Diverse evaluator selection:
      • Selecting evaluators with diverse backgrounds and perspectives can help to reduce bias in the feedback, just as it does in the workplace.
      • This can be achieved by recruiting evaluators from different demographic groups, regions, or industries.
    • Consensus evaluation:
      • Using consensus evaluation, where multiple evaluators provide feedback on the same task, can help to reduce the impact of individual biases and increase the reliability of the feedback.
      • This is almost like ‘normalizing’ the evaluation.
    • Calibration of evaluators:
      • Calibrating evaluators by providing them with training and guidance on how to provide feedback can help to improve the quality and consistency of the feedback.
    • Evaluation of the feedback process:
      • Regularly evaluating the feedback process, including the quality of the feedback and the effectiveness of the training process, can help to identify and address any biases that may be present.
    • Evaluation of the agent’s performance:
      • Regularly evaluating the agent’s performance on a variety of tasks and in different environments can help to ensure that it is not overfitting to specific examples and is capable of generalizing to new situations.
    • Balancing the feedback:
      • Balancing the feedback from human evaluators with other sources of feedback, such as self-play or expert demonstrations, can help to reduce the impact of bias in the feedback and improve the overall quality of the training data.

TRL - Transformer RL

  • The trl library is a full stack library to fine-tune and align transformer language and diffusion models using methods such as Supervised Fine-tuning step (SFT), Reward Modeling (RM) and the Proximal Policy Optimization (PPO) as well as Direct Preference Optimization (DPO).
  • The library is built on top of the transformers library and thus allows to use any model architecture available there.

Selected Papers

OpenAI’s Paper on InstructGPT: Training language models to follow instructions with human feedback

  • Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.
  • Ouyang et al. (2022) from OpenAI introduces InstructGPT, a model that aligns language models with user intent on a wide range of tasks by fine-tuning with human feedback.
  • Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, they collect a dataset of labeler demonstrations of the desired model behavior, which they use to fine-tune GPT-3 using supervised fine-tuning (SFT). This process is referred to as “instruction tuning” by other papers such as Wei et al. (2022).
  • They then collect a dataset of rankings of model outputs, which they use to further fine-tune this supervised model using RLHF.
  • In human evaluations on their prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
  • Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, their results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
  • It is important to note that ChatGPT is trained using the same methods as InstructGPT (using SFT followed by RLHF), but is fine-tuned from a model in the GPT-3.5 series.
  • Furthermore, the fine-tuning process proposed in the paper isn’t without its challenges. First, we need a significant volume of demonstration data. For instance, in the InstructGPT paper, they used 13k instruction-output samples for supervised fine-tuning, 33k output comparisons for reward modeling, and 31k prompts without human labels as input for RLHF. Second, fine-tuning comes with an alignment tax “negative transfer” – the process can lead to lower performance on certain critical tasks. (There’s no free lunch after all.) The same InstructGPT paper found that RLHF led to performance regressions (relative to the GPT-3 base model) on public NLP tasks like SQuAD, HellaSwag, and WMT 2015 French to English. A potential workaround is to have several smaller, specialized models that excel at narrow tasks.
  • The figure below from the paper illustrates the three steps of training InstructGPT: (1) SFT, (2) reward model training, and (3) RL via PPO on this reward model. Blue arrows indicate that this data is used to train the respective model in the diagram. In Step 2, boxes A-D are samples from the SFT model that get ranked by labelers.

Constitutional AI: Harmlessness from AI Feedback

  • The paper extends RLHF by training language models on datasets labeled for helpfulness and harmlessness. It introduces ‘HH’ models, which are trained on both criteria and have shown to be more harmless and better at following instructions than models trained on helpfulness alone.
  • An evaluation of these models’ ability to identify harmful behavior in language model interactions was conducted using a set of conversations rated for harmfulness. The study leveraged ‘red teaming’ where humans attempted to provoke the AI into harmful responses, thereby improving the training process.
  • The effectiveness of the training method was demonstrated through models’ performance on questions assessing helpfulness, honesty, and harmlessness, without relying on human labels for harmlessness.
  • This research aligns with other efforts like LaMDA and InstructGPT, which also utilize human data to train language models. The concept of ‘constitutional AI’ was introduced, focusing on self-critique and revision by the AI to foster both harmless and helpful interactions. The ultimate goal is to create AI that can self-regulate harmfulness while remaining helpful and responsive.

OpenAI’s Paper on PPO: Proximal Policy Optimization Algorithms

  • Schulman et al. (2017) proposes a new family of policy gradient methods for RL, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent.
  • Whereas standard policy gradient methods perform one gradient update per data sample, they propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which they call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).
  • Their experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, showing that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall clock time.

A General Language Assistant as a Laboratory for Alignment

  • This paper by Askell et al. from Anthropic introduces a comprehensive study towards aligning general-purpose, text-based AI systems with human values, focusing on making AI helpful, honest, and harmless (HHH). Given the capabilities of large language models, the authors investigate various alignment techniques and their evaluations to ensure these models adhere to human preferences without compromising performance.
  • The authors begin by examining naive prompting as a baseline for alignment, finding that the benefits from such interventions increase with model size and generalize across multiple alignment evaluations. Prompting was shown to impose negligible performance costs (‘alignment taxes’) on large models. The paper also explores the scaling trends of several training objectives relevant to alignment, including imitation learning, binary discrimination, and ranked preference modeling. The results indicate that ranked preference modeling significantly outperforms imitation learning and scales more favorably with model size, while binary discrimination performs similarly to imitation learning.
  • A key innovation discussed is ‘preference model pre-training’ (PMP), which aims to improve the sample efficiency of fine-tuning models on human preferences. This involves pre-training on large public datasets that encode human preferences, such as Stack Exchange, Reddit, and Wikipedia edits, before fine-tuning on smaller, more specific datasets. The findings suggest that PMP substantially enhances sample efficiency and often improves asymptotic performance when fine-tuning on human feedback datasets.
  • Implementation Details:
    • Prompts and Context Distillation: The authors utilize a prompt composed of 14 fictional conversations to induce the HHH criteria in models. They introduce ‘context distillation,’ a method where the model is fine-tuned using the KL divergence between the model’s predictions and the distribution conditioned on the prompt context. This technique effectively transfers the prompt’s conditioning into the model.
    • Training Objectives:
      • Imitation Learning: Models are trained to imitate ‘good’ behavior using supervised learning on sequences labeled as correct or desirable.
      • Binary Discrimination: Models distinguish between ‘correct’ and ‘incorrect’ behavior by training on pairs of correct and incorrect samples.
      • Ranked Preference Modeling: Models are trained to assign higher scores to better samples from ranked datasets using pairwise comparisons, a more complex but effective approach for capturing preferences.
    • Preference Model Pre-Training (PMP): The training pipeline includes a PMP stage where models are pre-trained on binary discriminations sourced from Stack Exchange, Reddit, and Wikipedia edits. This stage significantly enhances sample efficiency during subsequent fine-tuning on smaller datasets.
  • Results:
    • Prompting: Simple prompting significantly improves model performance on alignment evaluations, including HHH criteria and toxicity reduction. Prompting and context distillation both decrease toxicity in generated text as model size increases.
    • Scaling Trends: Ranked preference modeling outperforms imitation learning, especially on tasks with ranked data like summarization and HellaSwag. Binary discrimination shows little improvement over imitation learning.
    • Sample Efficiency: PMP dramatically increases the sample efficiency of fine-tuning, with larger models benefiting more from PMP than smaller ones. Binary discrimination during PMP is found to transfer better than ranked preference modeling.
  • The figure below from the paper shows: (Left) Simple prompting significantly improves performance and scaling on our HHH alignment evaluations (y-axis measures accuracy at choosing better responses on our HHH evaluations). (Right) Prompts impose little or no ‘alignment tax’ on large models, even on complex evaluations like function synthesis. Here we have evaluated our python code models on the HumanEval codex dataset at temperature T = 0.6 and top P = 0.95.

  • The study demonstrates that simple alignment techniques like prompting can lead to meaningful improvements in AI behavior, while more sophisticated methods like preference modeling and PMP offer scalable and efficient solutions for aligning large language models with human values.

Anthropic’s Paper on Constitutional AI: Constitutional AI: Harmlessness from AI Feedback

  • As AI systems become more capable, we would like to enlist their help to supervise other AIs.
  • Bai et al. (2022) experiments with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so they refer to the method as ‘Constitutional AI’.
  • The process involves both a supervised learning and a RL phase. In the supervised phase they sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, they sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences.
  • They then train with RL using the preference model as the reward signal, i.e. they use ‘RL from AI Feedback’ (RLAIF). As a result they are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
  • The figure below from the paper shows the basic steps of their Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a RL (RL) stage, shown as the sequence of steps at the bottom of the figure. Both the critiques and the AI feedback are steered by a small set of principles drawn from a ‘constitution’. The supervised stage significantly improves the initial model, and gives some control over the initial behavior at the start of the RL phase, addressing potential exploration problems. The RL stage significantly improves performance and reliability.

  • The graph below shows harmlessness versus helpfulness Elo scores (higher is better, only differences are meaningful) computed from crowdworkers’ model comparisons for all 52B RL runs. Points further to the right are later steps in RL training. The Helpful and HH models were trained with human feedback as in [Bai et al., 2022], and exhibit a tradeoff between helpfulness and harmlessness. The RL-CAI models trained with AI feedback learn to be less harmful at a given level of helpfulness. The crowdworkers evaluating these models were instructed to prefer less evasive responses when both responses were equally harmless; this is why the human feedback-trained Helpful and HH models do not differ more in their harmlessness scores.

RLAIF: Scaling RL from Human Feedback with AI Feedback

  • This paper by Lee et al. from Google Research, introduces a novel method for training large language models (LLMs) with AI-generated feedback, addressing the challenges and costs associated with traditional human feedback methods.
  • The paper presents RL from AI Feedback (RLAIF) as a promising alternative to the conventional RLHF. RLAIF utilizes an off-the-shelf LLM as a preference labeler, streamlining the training process and, in some cases, surpassing the performance of models trained with human feedback.
  • This approach is applied to text generation tasks such as summarization, helpful dialogue generation, and harmless dialogue generation. The performance of RLAIF, as assessed by human raters, is comparable or superior to RLHF, challenging the assumption that larger policy models are always more effective.
  • A key advantage of RLAIF is its potential to significantly reduce reliance on expensive human annotations. The study shows the efficacy of using the same model size for both the LLM labeler and the policy model, and highlights that directly prompting the LLM for reward scores can be more effective than using a distilled reward model.
  • The authors explore methodologies for generating AI preferences aligned with human values, emphasizing the effectiveness of chain-of-thought reasoning and detailed preamble in improving AI labeler alignment.
  • The following figure from the paper shows a diagram depicting RLAIF (top) vs. RLHF (bottom).

  • RLAIF’s scalability and cost-effectiveness are notable, with the approach being over ten times cheaper than human annotation. This aligns with the growing trend in LLM research focusing on quality over quantity in datasets.
  • The paper suggests that combining RLHF and RLAIF could be a strategic approach, especially considering that LLMs like GPT-4 have been trained with human feedback. This hybrid model could represent a balanced integration of high-quality human data, amplified significantly by AI, potentially shaping the future of LLM training and influencing approaches like the development of GPT-5.

A General Theoretical Paradigm to Understand Learning from Human Preferences

  • This paper by Azar et al. from Google DeepMind delves into the theoretical underpinnings of learning from human preferences, particularly focusing on RL from human feedback (RLHF) and direct preference optimization (DPO). The authors propose a novel objective, \(\Psi\)-preference optimization (\(\Psi\)PO), which encompasses RLHF and DPO as specific instances, aiming to optimize policies directly from human preferences without relying on the approximations common in existing methods.
  • RLHF typically involves a two-step process where a reward model is first trained using a binary classifier to distinguish preferred actions, often employing a Bradley-Terry model for this purpose. This is followed by policy optimization to maximize the learned reward while ensuring the policy remains close to a reference policy through KL regularization. DPO, in contrast, seeks to optimize the policy directly from human preferences, eliminating the need for explicit reward model training.
  • The \(\Psi\)PO framework is a more general approach that seeks to address the potential overfitting issues inherent in RLHF and DPO by considering pairwise preferences and employing a possibly non-linear function of preference probabilities alongside KL regularization. Specifically, the Identity-PO (IPO) variant of \(\Psi\)PO is highlighted for its practicality and theoretical appeal, as it allows for direct optimization from preferences without the approximations used in other methods.
  • Empirical demonstrations show that IPO can effectively learn from preferences without succumbing to the overfitting problems identified in DPO, providing a robust method for preference-based policy optimization. The paper suggests that future work could explore scaling these theoretical insights to more complex settings, such as training language models on human preference data.

SLiC-HF: Sequence Likelihood Calibration with Human Feedback

  • This paper by Zhao et al. from Google Deepmind and Google Research introduces Sequence Likelihood Calibration with Human Feedback (SLiC-HF) as a method for aligning language models with human preferences using human feedback data. SLiC-HF is showcased as an effective, simpler, and more computationally efficient alternative to RL from Human Feedback (RLHF), particularly for the task of TL;DR summarization.
  • SLiC-HF operates by calibrating the sequence likelihood of a Supervised Fine-Tuning (SFT) model against human feedback data, either directly or through a ranking model derived from human judgments. This is in contrast to traditional RLHF approaches that rely on optimizing a language model using a reward model trained on human preferences.
  • The paper details several implementations of SLiC-HF: direct application of human feedback (SLiC-HF-direct), sample-and-rank approach using either a reward model or a ranking model (SLiC-HF-sample-rank), and a variant applying SLiC-HF directly on human feedback data without the need for a separate ranking/reward model. Specifically, yo determine the rank, they consider two text-to-text models trained from the human preference data:
    • Trained Pointwise Reward model: They binarize each ranked pair into a positive and a negative sequence, as shown in the figure below. When training the reward model, input sequences are formatted as ‘[Context] … [Summary] …’ and target sequences are either ‘Good’ or ‘Bad’. At inference time, we compute the probability of token ‘Good’ on the decoder side to score each of the \(m\) candidates in a list, and sample \(m\) positive/negative pairs from them.
    • Trained Pairwise Ranking model: As shown in the figure below, we formulate the human feedback into a pairwise ranking problem with text-to-text format. When training the ranking model, input sequences are formatted as ‘[Context] … [Summary A] … [Summary B]’ and target sequences are among ‘A’ or ‘B’. At inference time, we use a tournament-style procedure to rank candidates in a list. For example, given a list of 4 candidates \(c1\), \(c2\), \(c3\), \(c4\), we first rank \(c1\), \(c2\) and \(c3\), \(c4\) and then rank winner \((c1, c2)\), winner \((c3, c4)\). Given \(m\) candidates, the ranking model is called \(m − 1\) times and \(m − 1\) positive/negative pairs are yielded.
  • The following figure from the paper shows the data format for training the text-to-text reward model and ranking model.

  • Extensive experiments demonstrate that SLiC-HF significantly improves upon SFT baselines and offers competitive performance to RLHF-PPO implementations. The experiments involved automatic and human evaluations, focusing on the Reddit TL;DR summarization task. Results showed SLiC-HF’s capability to produce high-quality summaries, with improvements observed across different configurations and parameter scales.
  • The paper contributes to the field by providing a detailed methodology for implementing SLiC-HF, showcasing its efficiency and effectiveness compared to traditional RLHF methods. It also demonstrates the viability of leveraging off-policy human feedback data, thus potentially reducing the need for costly new data collection efforts.
  • Further discussions in the paper explore the computational and memory efficiency advantages of SLiC-HF over RLHF-PPO, highlighting the former’s scalability and potential for broader application in language generation tasks. The paper concludes with suggestions for future research directions, including exploring other reward functions and non-human feedback mechanisms for language model calibration.

Reinforced Self-Training (ReST) for Language Modeling

  • RLHF can improve the quality of large language model’s (LLM) outputs by aligning them with human preferences.
  • This paper by Gulcehre et al. from Google DeepMind and Google Research proposes Reinforced Self-Training (ReST), a simple algorithm for aligning LLMs with human preferences inspired by growing batch RL (RL).
  • ReST generates samples from an initial LLM policy to create a dataset, which is then used to improve the LLM policy using offline RL algorithms. This method is more efficient than traditional online RLHF methods due to offline production of the training dataset, facilitating data reuse.
  • ReST operates in two loops: the inner loop (Improve) and the outer loop (Grow).
    • Grow: The LLM policy generates multiple output predictions per context, augmenting the training dataset.
    • Improve: The augmented dataset is ranked and filtered using a scoring function based on a learned reward model trained on human preferences. The model is then fine-tuned on this filtered dataset with an offline RL objective, with the possibility of repeating this step with increasing filtering thresholds.
  • The following image from the paper illustrates the ReST method. During the Grow step, a policy generates a dataset. At Improve step, the filtered dataset is used to fine-tune the policy. Both steps are repeated, the Improve step is repeated more frequently to amortise the dataset creation cost.

  • ReST’s advantages include reduced computational burden, independence from the original dataset’s quality, and simplicity in implementation.
  • Machine translation was chosen as the application for testing ReST, due to strong baselines and well-defined evaluation procedures. Experiments were conducted on IWSLT 2014, WMT 2020 benchmarks, and an internal high-fidelity benchmark called Web Domain. The evaluation used state-of-art reference-free reward models like Metric X, BLEURT, and COMET. ReST significantly improved reward model scores and translation quality on test and validation sets, as per both automated metrics and human evaluation.
  • ReST outperformed standard supervised learning (BC G=0 I=0) in reward model scores and human evaluations. The BC loss (Behavioral Cloning) was found to be the most effective for ReST, leading to continuous improvements in the model’s reward on holdout sets. However, improvements in reward model scores did not always align with human preferences.
  • ReST showed better performance over supervised training across different datasets and language pairs. The inclusion of multiple Improve steps and Grow steps resulted in significant improvements in performance. Human evaluations showed that all ReST variants significantly outperformed the BC baseline.
  • ReST is distinct from other self-improvement algorithms in language modeling due to its computational efficiency and ability to leverage exploration data and rewards. The approach is applicable to various language tasks, including summarization, dialogue, and other generative models.
  • Future work includes fine-tuning reward models on subsets annotated with human preferences and exploring better RL exploration strategies.

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

  • Training language models typically requires vast quantities of human-generated text, which can be scarce or of variable quality, especially for specialized domains like mathematics or programming. This scarcity limits the model’s ability to learn diverse patterns and hinders its performance. \(ReST_{EM}\) addresses this problem by reducing the reliance on human-curated datasets and instead exploring the potential of fine-tuning models using self-generated data validated through scalar feedback mechanisms.
  • This paper by Singh et al. from Google DeepMind, presented at NeurIPS 2023, explores a new frontier in Large Language Model (LLM) training: Reinforced Self-Training based on expectation-maximization (\(ReST_{EM}\)). This innovative approach aims to reduce reliance on human data while avoiding the pitfalls of a synthetic data death spiral, a trend becoming increasingly evident in LLM training.
  • \(ReST_{EM}\) is a potent alternative to traditional dataset curation, comprising two primary stages: generating multiple output samples (E-step) and fine-tuning the language model on these samples (M-step). This process is cyclically iterated, combining the generation of model-derived answers and their subsequent refinement. The feedback for filtering these outputs is sourced from tasks with binary feedback, such as math problems with clear right or wrong answers.
  • The paper’s focus is on two challenging domains: advanced mathematical problem-solving (MATH) and code generation (APPS). Utilizing PaLM 2 models of various scales, the study demonstrates that \(ReST_{EM}\) significantly outperforms models fine-tuned solely on human-generated data, offering up to 2x performance boosts. This indicates a major step toward more independent AI systems, seeking less human input for skill refinement.
  • \(ReST_{EM}\) employs an iterative self-training process leveraging expectation-maximization. It first generates outputs from the language model, then applies a filtering mechanism based on binary correctness feedback—essentially sorting the wheat from the chaff. Subsequently, the model is fine-tuned using these high-quality, self-generated samples. This cycle is repeated several times, thus iteratively enhancing the model’s accuracy and performance on tasks by self-generating and self-validating the training data.
  • Notably, the experiments revealed diminishing returns beyond a certain number of ReST iterations, suggesting potential overfitting issues. Ablation studies further assessed the impact of dataset size, the number of model-generated solutions, and the number of iterations on the effectiveness of ReST.
  • The models fine-tuned using ReST showed enhanced performance on related but distinct benchmarks like GSM8K, Hungarian HS finals, and Big-Bench Hard tasks, without any noticeable degradation in broader capabilities. This finding underscores the method’s versatility and generalizability.
  • The following figure from the paper shows Pass@K results for PaLM-2-L pretrained model as well as model fine-tuned with \(ReST_{EM}\). For a fixed number of samples \(K\), fine-tuning with \(ReST_{EM}\) substantially improves Pass@K performance. They set temperature to 1.0 and use nucleus sampling with \(p = 0.95\).

  • While ReST offers significant advantages in performance, it necessitates a moderate-sized training set of problems or prompts and access to a manually-designed or learned reward function. It’s highly data-efficient but requires careful application to prevent overfitting.
  • This research opens new avenues for self-improvement in language models, suggesting the need for automating manual parts of the pipeline and exploring algorithmic improvements to further enhance performance. With \(ReST_{EM}\) showing promising results, especially in larger models, one can anticipate further exploration in applying self-training techniques to various other domains beyond math and coding tasks. The significant improvement over fine-tuning on human data implies that future models can be made more efficient, less reliant on extensive datasets, and potentially achieve better performance.

Diffusion Model Alignment Using Direct Preference Optimization

  • This paper by Wallace et al. from Salesforce AI and Stanford University proposes a novel method for aligning diffusion models to human preferences.
  • The paper introduces Diffusion-DPO, a method adapted from DPO, for aligning text-to-image diffusion models with human preferences. This approach is a significant shift from typical language model training, emphasizing direct optimization on human comparison data.
  • Unlike typical methods that fine-tune pre-trained models using curated images and captions, Diffusion-DPO directly optimizes a policy that best satisfies human preferences under a classification objective. It re-formulates DPO to account for a diffusion model notion of likelihood using the evidence lower bound, deriving a differentiable objective.
  • The authors utilized the Pick-a-Pic dataset, comprising 851K crowdsourced pairwise preferences, to fine-tune the base model of the Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. The fine-tuned model showed significant improvements over both the base SDXL-1.0 and its larger variant in terms of visual appeal and prompt alignment, as evaluated by human preferences.
  • The paper also explores a variant of the method that uses AI feedback, showing comparable performance to training on human preferences. This opens up possibilities for scaling diffusion model alignment methods.
  • The figure below from paper illustrates: (Top) DPO-SDXL significantly outperforms SDXL in human evaluation. (L) PartiPrompts and (R) HPSv2 benchmark results across three evaluation questions, majority vote of 5 labelers. (Bottom) Qualitative comparisons between SDXL and DPO-SDXL. DPOSDXL demonstrates superior prompt following and realism. DPO-SDXL outputs are better aligned with human aesthetic preferences, favoring high contrast, vivid colors, fine detail, and focused composition. They also capture fine-grained textual details more faithfully.

  • Experiments demonstrate the effectiveness of Diffusion-DPO in various scenarios, including image-to-image editing and learning from AI feedback. The method significantly outperforms existing models in human evaluations for general preference, visual appeal, and prompt alignment.
  • The paper’s findings indicate that Diffusion-DPO can effectively increase measured human appeal across an open vocabulary with stable training, without increased inference time, and improves generic text-image alignment.
  • The authors note ethical considerations and risks associated with text-to-image generation, emphasizing the importance of diverse and representative sets of labelers and the potential biases inherent in the pre-trained models and labeling process.
  • In summary, the paper presents a groundbreaking approach to align diffusion models with human preferences, demonstrating notable improvements in visual appeal and prompt alignment. It highlights the potential of direct preference optimization in the realm of text-to-image diffusion models and opens avenues for further research and application in this field.

Human-Centered Loss Functions (HALOs)

  • This report by Ethayarajh et al. from Stanford University presents a novel approach to aligning large language models (LLMs) with human feedback, building upon Kahneman & Tversky’s prospect theory. The proposed Kahneman-Tversky Optimization (KTO) loss function diverges from existing methods by not requiring paired preference data, relying instead on the knowledge of whether an output is desirable or undesirable for a given input. This makes KTO significantly easier to deploy in real-world scenarios where such data is more abundant.
  • The report identifies that existing methods for aligning LLMs with human feedback can be seen as human-centered loss functions, which implicitly model some of the distortions in human perception as suggested by prospect theory. By adopting this perspective, the authors derive a HALO that maximizes the utility of LLM generations directly, rather than relying on maximizing the log-likelihood of preferences, as current methods do.
  • The KTO-aligned models were found to match or exceed the performance of direct preference optimization methods across scales from 1B to 30B. One of the key advantages of KTO is its feasibility in real-world applications, as it requires less specific types of data compared to other methods.
  • To validate the effectiveness of KTO and understand how alignment scales across model sizes, the authors introduced Archangel, a suite comprising 56 models. These models, ranging from 1B to 30B, were aligned using various methods, including KTO, on human-feedback datasets such as Anthropic HH, Stanford Human Preferences, and OpenAssistant.
  • The following report from the paper illustrates the fact that LLM alignment involves supervised finetuning followed by optimizing a human-centered loss (HALO). However, the paired preferences that existing approaches need are hard-to-get. Kahneman-Tversky Optimization (KTO) uses a far more abundant kind of data, making it much easier to use in the real world.

  • The report’s experimental findings reveal surprising insights into the scaling and effectiveness of different alignment methods. It was observed that supervised finetuning (SFT) contributes significantly to the performance gains at every scale under 30B. The benefits of combining SFT with alignment methods become apparent at model sizes of around 7B and above. Interestingly, KTO alone was found to be significantly better than DPO (Direct Preference Optimization) alone at scales of 13B and 30B.
  • The practical implications of KTO are notable, especially in contexts where abundant data on customer interactions and outcomes is available, but counterfactual data is scarce. This aspect underscores KTO’s potential for broader application in real-world settings compared to preference-based methods like DPO.
  • Future work suggested by the authors includes exploring a human value function specifically for language, examining differences in model behavior at different scales, and investigating the potential of synthetic data in model alignment with KTO. The report highlights the importance of understanding how human-centered loss functions can influence the alignment of LLMs with human preferences and perceptions.
  • Code

Nash Learning from Human Feedback

  • This paper by Munos et al. from Google DeepMind introduces an alternative approach to the conventional RLHF for aligning large language models (LLMs) with human preferences. This new approach, termed Nash Learning from Human Feedback (NLHF), focuses on learning a preference model from pairwise human feedback and pursuing a policy that generates responses preferred over any competing policy, thus achieving a Nash equilibrium for this preference model.
  • The NLHF approach aims to encompass a broader spectrum of human preferences, maintain policy independence, and better align with the diversity of human preferences. This method marks a significant shift from the traditional RLHF framework, which is more limited in capturing the richness and diversity of human preferences.
  • Key contributions of this work include the introduction and definition of a regularized variant of the preference model, the establishment of the existence and uniqueness of the corresponding Nash equilibrium, and the introduction of novel algorithms such as Nash-MD and Nash-EMA. Nash-MD, founded on mirror descent principles, converges to the Nash equilibrium without requiring the storage of past policies, making it particularly suitable for LLMs. Nash-EMA, inspired by fictitious play, uses an exponential moving average of past policy parameters. The paper also introduces policy-gradient algorithms Nash-MD-PG and Nash-EMA-PG for deep learning architectures. Extensive numerical experiments conducted on a text summarization task using the TL;DR dataset validate the effectiveness of the NLHF approach.
  • The regularized preference model in NLHF uses KL-regularization to quantify the divergence between the policy under consideration and a reference policy. This regularization is particularly crucial in situations where the preference model is more accurately estimated following a given policy or where it is essential to remain close to a known safe policy.
  • In terms of implementation, the paper explores gradient-based algorithms for deep learning architectures, focusing on computing the Nash equilibrium of a preference model. This exploration emphasizes the applicability of these algorithms in the context of LLMs.

Group Preference Optimization: Few-shot Alignment of Large Language Models

  • This paper by Zhao et al. from UCLA proposes Group Preference Optimization (GPO), a novel framework for aligning large language models (LLMs) with the opinions and preferences of desired interest group(s) in a few-shot manner. The method aims to address the challenge of steering LLMs to align with various groups’ preferences, which often requires substantial group-specific data and computational resources. The key idea in GPO is to view the alignment of an LLM policy as a few-shot adaptation problem within the embedded space of an LLM.
  • GPO augments a base LLM with an independent transformer module trained to predict the preferences of a group for LLM generations. This module is parameterized via an independent transformer and is trained via meta-learning on several groups, allowing for few-shot adaptation to new groups during testing. The authors employ an in-context autoregressive transformer, offering efficient adaptation with limited group-specific data. Put simply, the preference module in GPO is trained to explicitly perform in-context supervised learning to predict preferences (targets) given joint embeddings (inputs) of prompts and corresponding LLM responses. These embeddings allow efficient processing of in-context examples, with each example being a potentially long sequence of prompt and generated response. The module facilitates rapid adaptation to new, unseen groups with minimal examples via in-context learning.
  • GPO is designed to perform group alignment by learning a few-shot preference model that augments the base LLM. Once learned, the preference module can be used to update the LLM via any standard preference optimization or reweighting algorithm (e.g., PPO, DPO, Best-of-N). Specifically, GPO is parameterized via a transformer and trained to perform in-context learning on the training preference datasets. Given a training group \(g \in G_{\text {train }}\), they randomly split its preference dataset \(\mathcal{D}_g\) into a set of \(m\) context points and \(n-m\) target points, where \(n=\left\mid \mathcal{D}_g\right\mid\) is the size of the preference dataset for group \(g\). Thereafter, GPO is trained to predict the target preferences \(y_{m+1: n}^g\) given the context points \(\left(x_{1: m}^g, y_{1: m}^g\right)\) and target inputs \(x_{m+1: n}^g\). Mathematically, this objective can be expressed as:

    \[L(\theta)=\mathbb{E}_{g, m}\left[\log p_\theta\left(y_{m+1: n}^g \mid x_{1: n}^g, y_{1: m}^g\right)\right]\]
    • where the training group \(g \sim G_{\text {train }}\) and context size \(m\) are sampled uniformly. \(\theta\) represents the parameters of the GPO preference model.
  • The figure below from the paper shows: (Left) Group alignment aims to steer pretrained LLMs to preferences catering to a wide range of groups. For each group \(g\), they represent its preference dataset as \(\mathcal{D}_g=\) \(\left\{\left(x_1^g, y_1^g\right), \ldots,\left(x_n^g, y_n^g\right)\right\}\). Here, \(y_i^g\) signifies the preference of group \(g\) for a pair of given prompt \(q_i^g\) and response \(r_i^g\), while \(x_i^g\) is its LLM representation obtained with \(\pi_{\mathrm{emb}}\left(q_i^g, r_i^g\right)\). (Right) Once trained, GPO provides a few-shot framework for aligning any base LLM to a test group given a small amount of in-context preference data.

  • GPO’s architecture is designed for permutation-specific inductive biases, discarding positional encodings found in standard transformers. However, this loses the pairwise relations between the inputs and outputs. To solve this, GPO concatenates each pair of inputs and outputs into a single token, informing the transformer of their pairwise relation. The target inputs are padded with a dummy token (e.g., 0), and a masking strategy is employed where context pairs can self-attend, but padded targets can only attend to context points.
  • Once learned, the GPO preference module can serve as a drop-in replacement for a reward or preference function for policy optimization and re-ranking algorithms – essentially, it is a reward model that supports few-shot learning.
  • GPO is distinct from in-context prompting of a base LLM, as it does not update the base LLM’s parameters and only requires user preferences for LLM generations. The few-shot model learned by GPO augments the base LLM, offering more flexibility than traditional prompting methods.
  • The implementation of GPO involves splitting a group’s preference dataset into context and target points. The model is trained to predict target preferences given the context points and target inputs. The figure below from the paper illustrates the GPO architecture for a sequence of \(n\) points, with \(m\) context points and \(n-m\) target points. The context \(\left(x_{1: m}, y_{1: m}\right)\) serves as few-shot conditioning for GPO. GPO processes the full sequence using a transformer and predicts the preference scores \(\hat{y}_{m+1: n}\).

  • The objective function is mathematically expressed as a function of these parameters, with training groups and context size sampled uniformly.
  • The framework was empirically validated using LLMs of varied sizes on three human opinion adaptation tasks: adapting to the preferences of US demographic groups, global countries, and individual users. Results showed that GPO not only aligns models more accurately to these preferences but also requires fewer group-specific preferences and less computational resources, outperforming existing strategies like in-context steering and fine-tuning methods.
  • Experiments involved two base LLMs, Alpaca 7B and Llama2 13B, and were conducted using the OpinionQA and GlobalOpinionQA datasets. GPO demonstrated significant improvements over various baselines, achieving a 7.1% increase in alignment score over the In-context Finetune method for the OpinionQA dataset and an 8.4% improvement for the GlobalOpinionQA dataset.
  • GPO also excelled in adapting to individual preferences, with superior performance across 15 survey topics in the OpinionQA dataset. This ability is particularly noteworthy given the diverse and often contrasting opinions within individual and demographic groups.
  • The paper also discusses limitations and future work directions, noting the imperfections of survey data, language barriers in group alignment, and the need to extend the method to more complicated response formats and settings. Additionally, the authors highlight potential ethical concerns, such as misuse of aligned models and amplification of biased or harmful outputs, suggesting future research should address these issues.
  • Code

ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

  • This paper by Song et al. from Peking University and Microsoft Research Asia introduces In-Context Direct Preference Optimization (ICDPO), a novel approach for enhancing Large Language Models (LLMs) by borrowing Human Preference Alignment (HPA) capabilities without the need for fine-tuning. ICDPO utilizes the states of an LLM before and after In-context Learning (ICL) to build an instant scorer, facilitating the generation of well-aligned responses.
  • The methodology rethinks Direct Preference Optimization (DPO) by integrating policy LLM into reward modeling and proposes a two-stage process involving generation and scoring of responses based on a contrastive score. This score is derived from the difference in log probabilities between the optimized policy (\(\pi_{*}\)) and a reference model (\(\pi_0\)), enhancing LLM’s performance in HPA.
  • The following figure from the paper illustrates an overview of ICDPO. (a) The difference in teacher data utilization between normal fine-tuning and ICL without fine-tuning. (b) The core of ICDPO is that expert-amateur coordination maximizes \(S\) which represents the disparity between the expert and the amateur. It brings more accurate estimation than using only the expert LLM.

  • Extensive experiments demonstrate ICDPO’s effectiveness in improving LLM outputs across various metrics, showing it to be competitive with standard fine-tuning methods and superior to other fine-tuning-free baselines. Notably, it leverages a two-stage retriever for selecting contextual demonstrations and an upgraded scorer to further amplify its benefits.
  • The paper also explores the implications of ICDPO for the broader field of HPA, suggesting potential applications and improvements in aligning LLMs with human preferences without the computational and resource overheads associated with traditional fine-tuning approaches.

ORPO: Monolithic Preference Optimization without Reference Model

  • This paper by Hong et al. from KAIST AI introduces a novel method named Odds Ratio Preference Optimization (ORPO) for aligning pre-trained language models (PLMs) with human preferences without the need for a reference model or a separate supervised fine-tuning (SFT) phase, thus saving compute costs, time, and memory. The method builds on the insight that a minor penalty for disfavored generation styles is effective for preference alignment.
  • Odds Ratio Preference Optimization (ORPO) proposes a new method to train LLMs by combining SFT and Alignment into a new objective (loss function), achieving state of the art results. ORPO operates by incorporating a simple odds ratio-based penalty alongside the conventional negative log-likelihood loss. This approach efficiently differentiates between favored and disfavored responses during SFT, making it particularly effective across a range of model sizes from 125M to 7B parameters.
  • SFT plays a significant role in tailoring the pre-trained language models to the desired domain by increasing the log probabilities of pertinent tokens. Nevertheless, this inadvertently increases the likelihood of generating tokens in undesirable styles, as illustrated in Figure 3. Therefore, it is necessary to develop methods capable of preserving the domain adaptation role of SFT while concurrently discerning and mitigating unwanted generation styles.
  • The goal of cross-entropy loss model fine-tuning is to penalize the model if the predicted logits for the reference answers are low. Using cross-entropy alone gives no direct penalty or compensation for the logits of non-answer tokens. While cross-entropy is generally effective for domain adaptation, there are no mechanisms to penalize rejected responses when compensating for the chosen responses. Therefore, the log probabilities of the tokens in the rejected responses increase along with the chosen responses, which is not desired from the viewpoint of preference alignment. fine-tune
  • The authors experimented with finetuning OPT-350M on the chosen responses only from the HH-RLHF dataset. Throughout the training, they monitor the log probability of rejected responses for each batch and report this in Figure 3. Both the log probability of chosen and rejected responses exhibited a simultaneous increase. This can be interpreted from two different perspectives. First, the cross-entropy loss effectively guides the model toward the intended domain (e.g., dialogue). However, the absence of a penalty for unwanted generations results in rejected responses sometimes having even higher log probabilities than the chosen ones.
  • Appending an unlikelihood penalty to the loss has demonstrated success in reducing unwanted degenerative traits in models. For example, to prevent repetitions, an unwanted token set of previous contexts, \(k \in \mathcal{C}_{\text {recent }}\), is disfavored by adding the following term to \((1-p_i^{(k)})\) to the loss which penalizes the model for assigning high probabilities to recent tokens. Motivated by SFT ascribing high probabilities to rejected tokens and the effectiveness of appending penalizing unwanted traits, they design a monolithic preference alignment method that dynamically penalizes the disfavored response for each query without the need for crafting sets of rejected tokens.
  • Given an input sequence \(x\), the average loglikelihood of generating the output sequence \(y\), of length \(m\) tokens, is computed as the below equation.
\[\log P_\theta(y \mid x)=\frac{1}{m} \sum_{t=1}^m \log P_\theta\left(y_t \mid x, y_{<t}\right)\]
  • The odds of generating the output sequence \(y\) given an input sequence \(x\) is defined in the below equation:
\[\operatorname{odds}_\theta(y \mid x)=\frac{P_\theta(y \mid x)}{1-P_\theta(y \mid x)}\]
  • Intuitively, \(\boldsymbol{o d d s}_\theta(y \mid x)=k\) implies that it is \(k\) times more likely for the model \(\theta\) to generate the output sequence \(y\) than not generating it. Thus, the odds ratio of the chosen response \(y_w\) over the rejected response \(y_l, \mathbf{O R}_\theta\left(y_w, y_l\right)\), indicates how much more likely it is for the model \(\theta\) to generate \(y_w\) than \(y_l\) given input \(x\), defined in the below equation.
\[\mathbf{O R}_\theta\left(y_w, y_l\right)=\frac{\operatorname{odds}_\theta\left(y_w \mid x\right)}{\operatorname {odds}_\theta\left(y_l \mid x\right)}\]
  • The objective function of ORPO in the below equation consists of two components: (i) supervised fine-tuning (SFT) loss \(\left(\mathcal{L}_{S F T}\right))\); (ii) relative ratio loss \(\left(\mathcal{L}_{O R}\right)\).
\[\mathcal{L}_{O R P O}=\mathbb{E}_{\left(x, y_w, y_l\right)}\left[\mathcal{L}_{S F T}+\lambda \cdot \mathcal{L}_{O R}\right]\]
  • \(\mathcal{L}_{S F T}\) follows the conventional causal language modeling negative log-likelihood (NLL) loss function to maximize the likelihood of generating the reference tokens. \(\mathcal{L}_{O R}\) in the below equation maximizes the odds ratio between the likelihood of generating the favored/chosen response \(y_w\) and the disfavored/rejected response \(y_l\). ORPO wrap the log odds ratio with the log sigmoid function so that \(\mathcal{L}_{O R}\) could be minimized by increasing the log odds ratio between \(y_w\) and \(y_l\).
\[\mathcal{L}_{O R}=-\log \sigma\left(\log \frac{\operatorname{odds}_\theta\left(y_w \mid x\right)}{\operatorname{odds}_\theta\left(y_l \mid x\right)}\right)\]
  • Together, \(\mathcal{L}_{S F T}\) and \(\mathcal{L}_{O R}\) weighted with \(\lambda\) tailor the pre-trained language model to adapt to the specific subset of the desired domain and disfavor generations in the rejected response sets.
  • Training process:
    1. Create a pairwise preference dataset (chosen/rejected), e.g., Argilla UltraFeedback
    2. Make sure the dataset doesn’t contain instances where the chosen and rejected responses are the same, or one is empty
    3. Select a pre-trained LLM (e.g., Llama-2, Mistral)
    4. Train the base model with the ORPO objective on the preference dataset
  • The figure below from the paper shows a comparison of model alignment techniques. ORPO aligns the language model without a reference model in a single-step manner by assigning a weak penalty to the rejected responses and a strong adaptation signal to the chosen responses with a simple log odds ratio term appended to the negative log-likelihood loss.

  • Empirical evaluations show that fine-tuning models such as Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) using ORPO significantly surpasses the performance of state-of-the-art models on benchmarks such as AlpacaEval 2.0, IFEval, and MT-Bench. For instance, Mistral-ORPO-α and Mistral-ORPO-\(\beta\) achieve up to 12.20% on AlpacaEval 2.0, 66.19% on IFEval, and 7.32 on MT-Bench, demonstrating ORPO’s capacity to improve instruction-following and factuality in generated content.
  • Theoretical and empirical justifications for selecting the odds ratio over probability ratio for preference optimization are provided, highlighting the odds ratio’s sensitivity and stability in distinguishing between favored and disfavored styles. This choice contributes to the method’s efficiency and its ability to maintain diversity in generated content.
  • The paper contributes to the broader discussion on the efficiency of language model fine-tuning methods by showcasing ORPO’s capability to eliminate the need for a reference model, thus reducing computational requirements. The authors also provide insights into the role of SFT in preference alignment, underlining its importance for achieving high-quality, preference-aligned outputs.
  • Code and model checkpoints for Mistral-ORPO-\(\alpha\) (7B) and Mistral-ORPO-\(\beta\) (7B) have been released to facilitate further research and application of ORPO in various NLP tasks. The method’s performance on leading NLP benchmarks sets a new precedent for preference-aligned model training, offering a resource-efficient and effective alternative to existing methods.
  • Code

Human Alignment of Large Language Models through Online Preference Optimisation

  • This paper by Calandriello et al. from Google DeepMind addresses the critical issue of aligning large language models (LLMs) with human preferences, a field that has seen extensive research and the development of various methods including RL from Human Feedback (RLHF), Direct Policy Optimisation (DPO), and Sequence Likelihood Calibration (SLiC).
  • The paper’s main contributions are twofold: firstly, it demonstrates the equivalence of two recent alignment methods, Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD), under certain conditions. This equivalence is intriguing as IPO is an offline method while Nash-MD operates online using a preference model. Secondly, it introduces IPO-MD, a generalisation of IPO that incorporates regularised sampling akin to Nash-MD, and compares it against online variants of existing methods on a summarisation task.
  • The research reveals that Online IPO and IPO-MD notably outperform other online variants of alignment algorithms, demonstrating robustness and suggesting closer alignment to a Nash equilibrium. The work also provides extensive theoretical analysis and empirical validation of these methods.
  • Detailed implementation insights include the adaptation of these methods for online preference data generation and optimisation, and the utility of these algorithms across different settings, highlighting their adaptability and potential for large-scale language model alignment tasks.
  • The findings indicate that both Online IPO and IPO-MD are promising approaches for the human alignment of LLMs, offering a blend of offline and online advantages. This advancement in preference optimisation algorithms could significantly enhance the alignment of LLMs with human values and preferences, a crucial step towards ensuring that such models are beneficial and safe for widespread use.

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

  • This paper by Haoran Xu et al. introduces Contrastive Preference Optimization (CPO), a novel approach for fine-tuning moderate-sized Large Language Models (LLMs) for Machine Translation (MT), yielding substantial improvements over existing methods.
  • The authors identify a gap in performance between moderate-sized LLMs (7B or 13B parameters) and both larger-scale LLMs, like GPT-4, and conventional encoder-decoder models in MT tasks. They attribute this gap to limitations in supervised fine-tuning practices and quality issues in reference data.
  • CPO aims to mitigate two fundamental shortcomings of SFT. First, SFT’s methodology of minimizing the discrepancy between predicted outputs and gold-standard references inherently caps model performance at the quality level of the training data. This limitation is significant, as even human-written data, traditionally considered high-quality, is not immune to quality issues. For instance, one may notice that some strong translation models are capable of producing translations superior to the gold reference. Secondly, SFT lacks a mechanism to prevent the model from rejecting mistakes in translations. While strong translation models can produce high-quality translations, they occasionally exhibit minor errors, such as omitting parts of the translation. Preventing the production of these near-perfect but ultimately flawed translation is essential. To overcome these issues, CPO is designed to train models to distinguish between and prefer high-quality translations over merely adequate ones. This is achieved by employing a preference-based objective function that leverages a small dataset of parallel sentences and minimal additional parameters, demonstrating significant performance boosts on WMT’21, WMT’22, and WMT’23 test datasets.
  • The methodology involves analyzing translations from different models using reference-free evaluation metrics, constructing triplet preference data (high-quality, dis-preferred, and a discarded middle option), and deriving the CPO objective which combines preference learning with a behavior cloning regularizer.
  • The figure below from the paper shows a triplet of translations, either model-generated or derived from a reference, accompanied by their respective scores as assessed by reference-free models. For a given source sentence, the translation with the highest score is designated as the preferred translation, while the one with the lowest score is considered dispreferred, and the translation with a middle score is disregarded.

  • Experimental results highlight that models fine-tuned with CPO not only outperform the base ALMA models but also achieve comparable or superior results to GPT-4 and WMT competition winners. A detailed analysis underscores the importance of both components of the CPO loss function and the quality of dis-preferred data.
  • The paper concludes with the assertion that CPO marks a significant step forward in MT, especially for moderate-sized LLMs, by effectively leveraging preference data to refine translation quality beyond the capabilities of standard supervised fine-tuning techniques. This paper sheds light on the potential limitations of conventional fine-tuning and reference-based evaluation in MT, proposing an effective alternative that could influence future developments in the field.

sDPO: Don’t Use Your Data All at Once

  • This paper from Kim et al. from Upstage AI introduces “stepwise DPO” (sDPO), an advancement of direct preference optimization (DPO) to better align large language models (LLMs) with human preferences. Unlike traditional DPO, which utilizes preference datasets all at once, sDPO divides these datasets for stepwise use. This method enables more aligned reference models within the DPO framework, resulting in a final model that not only performs better but also outpaces more extensive LLMs.
  • Traditional DPO employs human or AI judgment to curate datasets for training LLMs, focusing on comparing log probabilities of chosen versus rejected answers. However, sDPO’s novel approach uses these datasets in a phased manner. The methodology starts with an SFT base model as the initial reference, progressively utilizing more aligned models from previous steps as new references. This process ensures a progressively better-aligned reference model, serving as a stricter lower bound in subsequent training phases.
  • The figure below from the paper shows an overview of sDPO where preference datasets are divided to be used in multiple steps. The aligned model from the previous step is used as the reference and target models for the current step. The reference model is used to calculate the log probabilities and the target model is trained using the preference loss of DPO at each step.

  • The sDPO methodology involved training the SOLAR 10B SFT model as the base. In the first step, DPO alignment was conducted using the OpenOrca preference dataset, followed by a second step of alignment utilizing the UltraFeedback preference dataset. The model’s performance was evaluated on the H4 benchmark, which is the average of scores from ARC, HellaSwag, MMLU, and TruthfulQA tests. This innovative approach resulted in a 1.6% improvement of the SOLAR 10B model over traditional DPO on the H4 benchmark, showcasing that sDPO combined with SOLAR 10B even surpasses models like Mixtral, which have significantly more parameters.
  • Experimental validation reveals sDPO’s efficacy. The research team employed models like SOLAR 10.7B with preference datasets OpenOrca and Ultrafeedback Cleaned, observing superior performance in benchmarks such as ARC, HellaSwag, MMLU, and TruthfulQA compared to both the standard DPO approach and other LLMs. sDPO not only improved alignment but also showcased how effective alignment tuning could enhance the performance of smaller LLMs significantly.
  • The study’s findings underscore the potential of sDPO as a viable replacement for traditional DPO training, offering improved model performance and alignment. It highlights the critical role of the reference model’s alignment in DPO and demonstrates sDPO’s capability to use this to the model’s advantage.
  • Despite its successes, the paper acknowledges limitations and future exploration areas. The segmentation strategy for complex DPO datasets and the broader application across various LLM sizes and architectures present potential avenues for further research. Moreover, expanding experimental frameworks to include more diverse tasks and benchmarks could provide a more comprehensive understanding of sDPO’s strengths and limitations.
  • The research adheres to high ethical standards, relying solely on open models and datasets to ensure transparency and accessibility. Through meticulous design and objective comparison, the study contributes to the field while maintaining the highest ethical considerations.

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

  • This paper by Khaki et al. from Amazon, introduces RS-DPO, a method combining rejection sampling (RS) and direct preference optimization (DPO) to address the alignment of large language models (LLMs) with user intent. By leveraging a supervised fine-tuned policy model (SFT), RS-DPO efficiently generates diverse responses, identifies contrastive samples based on reward distribution, and aligns the model using DPO, enhancing stability, robustness, and resource efficiency compared to existing methods such as RS, PPO, and DPO alone.
  • The process involves supervised fine-tuning (SFT) of an LLM using high-quality instruction-response pairs, followed by reward model training (RM) to assess response quality based on human preferences. Preference data generation via rejection sampling (PDGRS) creates a synthetic preference pair dataset for alignment tasks, using the trained SFT and RM to sample and evaluate \(k\) distinct responses for each prompt. The direct preference optimization (DPO) step then fine-tunes the SFT model by optimizing the policy model on the generated preference data, thus aligning the LLM with human preferences without needing an explicit reward model.
  • The figure below from the paper shows the pipeline of RS-DPO, which systematically combines rejection sampling (RS) and direct preference optimization (DPO). They start by creating a SFT model and use it to generate a diverse set of \(k\) distinct responses for each prompt. Then, it selects a pair of contrastive samples based on their reward distribution. Subsequently, the method employs DPO to enhance the performance of the language model (LLM), thereby achieving improved alignment.

  • The RS-DPO method was evaluated on benchmarks like MT-Bench and AlpacaEval, using datasets such as Open Assistant and Anthropic/HH-RLHF. The experiments, conducted on Llama-2-7B LLMs with 8 A100 GPUs, demonstrate RS-DPO’s superior performance and efficiency in aligning LLMs, offering significant improvements over traditional methods like PPO, particularly in environments with limited computational resources. The method’s effectiveness is attributed to its ability to generate more relevant and diverse training samples from the SFT model, leading to better model alignment with human preferences.
  • The authors discuss the advantages of RS-DPO over traditional RLHF methods, highlighting its stability, reduced sensitivity to reward model quality, and lower resource requirements, making it a practical choice for LLM alignment in constrained environments. Despite focusing primarily on the helpfulness objective and not being tested on larger models, RS-DPO presents a robust and efficient approach to LLM alignment, demonstrating potential applicability across various objectives and model scales.

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

  • This paper by Lin et al. from the Allen Institute for Artificial Intelligence and UW explores the superficial nature of alignment tuning in large language models (LLMs) and proposes a tuning-free alignment method using in-context learning (ICL). The study critically examines how alignment tuning through supervised fine-tuning (SFT) and RL from human feedback (RLHF) alters the behavior of base LLMs. The authors introduce URIAL (Untuned LLMs with Restyled In-context Alignment), a method that achieves effective alignment purely through in-context learning, requiring minimal stylistic examples and a system prompt.
  • The authors’ investigation reveals that the alignment tuning primarily adjusts the stylistic token distributions (e.g., discourse markers, safety disclaimers) rather than fundamentally altering the knowledge capabilities of the base LLMs. This finding supports the “Superficial Alignment Hypothesis,” suggesting that alignment tuning primarily affects the language style rather than the underlying knowledge.
  • Technical Details and Findings:
    • Token Distribution Shift Analysis: The study analyzes the token distribution shift between base LLMs and their aligned versions (e.g., Llama-2 and Llama-2-chat). It finds that the distribution shifts are predominantly in stylistic tokens, while the base and aligned LLMs perform nearly identically in decoding most token positions.
    • Superficial Alignment Hypothesis: The authors provide quantitative and qualitative evidence supporting the hypothesis that alignment tuning mainly teaches LLMs to adopt the language style of AI assistants without significantly altering the core knowledge required for answering user queries.
  • Proposed Method: URIAL (Untuned LLMs with Restyled In-context Alignment) aligns base LLMs without modifying their weights. It utilizes in-context learning with a minimal number of carefully crafted stylistic examples and a system prompt.
  • Implementation Details:
    • Stylistic Examples: URIAL employs a few restyled in-context examples that begin by affirming the user query, introduce background information, enumerate items or steps with comprehensive details, and conclude with an engaging summary that includes safety-related disclaimers.
    • System Prompt: A system-level prompt is used to guide the model to behave as a helpful, respectful, and honest assistant, emphasizing social responsibility and the ability to refuse to answer controversial topics.
    • Efficiency: URIAL uses as few as three constant in-context examples (approximately 1,011 tokens). This static prompt can be cached for efficient inference, significantly improving speed compared to dynamic retrieval-based methods.
  • The following figure from the paper shows Analyzing alignment with token distribution shift. An aligned LLM (llama-2-chat) receives a query \(q\) and outputs a response \(o\). To analyze the effect of alignment tuning, we decode the untuned version (llama-2-base) at each position \(t\). Next, we categorize all tokens in \(o\) into three groups based on \(o_t\)’s rank in the list of tokens sorted by probability from the base LLM. On average, 77.7% of tokens are also ranked top 1 by the base LLM (unshifted positions), and 92.2% are within the top 3 (+ marginal). Common tokens at shifted positions are displayed at the top-right and are mostly stylistic, constituting discourse markers. In contrast, knowledge-intensive tokens are predominantly found in unshifted positions.

  • Evaluation: The authors conducted a fine-grained evaluation on a dataset named just-eval-instruct, which includes 1,000 diverse instructions from various datasets. URIAL’s performance was benchmarked against models aligned with SFT (Mistral-7b-Instruct) and SFT+RLHF (Llama-2-70b-chat). Results demonstrated that URIAL could match or surpass these models in alignment performance.
  • Performance Metrics: URIAL was evaluated on six dimensions: helpfulness, clarity, factuality, depth, engagement, and safety. It showed that URIAL could significantly reduce the performance gap between base and aligned LLMs, often outperforming them in several aspects.
  • Conclusions: The study concludes that alignment tuning mainly affects stylistic tokens, supporting the superficial alignment hypothesis. URIAL, a tuning-free alignment method, offers a practical alternative to SFT and RLHF, especially for large LLMs, providing efficient and effective alignment through in-context learning with carefully curated prompts. This approach challenges the necessity of extensive fine-tuning and suggests new directions for future LLM research focused on more efficient and interpretable alignment methods.
  • Code

MDPO: Conditional Preference Optimization for Multimodal Large Language Models

  • This paper by Wang et al. from USC, UC Davis, and MSR introduces MDPO, a multimodal Direct Preference Optimization (DPO) method designed to enhance the performance of Large Language Models (LLMs) by addressing the unconditional preference problem in multimodal preference optimization.
  • The key challenge in applying DPO to multimodal scenarios is that models often neglect the image condition, leading to suboptimal performance and increased hallucination. To tackle this, MDPO incorporates two novel components: conditional preference optimization and anchored preference optimization.
  • Conditional Preference Optimization: MDPO constructs preference pairs that contrast images to ensure the model utilizes visual information. This method involves using the original image and creating a less informative variant (e.g., by cropping) to serve as a hard negative. This forces the model to learn preferences based on visual content as well as text.
  • Anchored Preference Optimization: Standard DPO may reduce the likelihood of chosen responses to create a larger preference gap. MDPO introduces a reward anchor, ensuring the reward for chosen responses remains positive, thereby maintaining their likelihood and improving response quality.
  • Implementation Details:
    • The model training uses Bunny-v1.0-3B and LLaVA-v1.5-7B multimodal LLMs.
    • Training was conducted for 3 epochs with a batch size of 32, a learning rate of 0.00001, and a cosine learning rate scheduler with a 0.1 warmup ratio.
    • The preference optimization parameter \(\beta\) was set to 0.1.
    • LoRA (Low-Rank Adaptation) was utilized, with α set to 128 and rank to 64.
    • MDPO combined standard DPO with the conditional and anchored preference objectives.
  • The figure below from the paper illustrates an overview of MDPO. Top Left: Standard DPO expects the multimodal LLM to learn response preferences conditioned on both the image and the question. Top Right: However, in practice, the learning process often disregards the image condition. Bottom: To address this issue, MDPO introduces an additional image preference learning objective to emphasize the relationship between the image and the response. Furthermore, MDPO incorporates a reward anchor to ensure that the probability of the chosen response does not decrease.

  • Experimental Results: Experiments on benchmarks like MMHalBench, Object HalBench, and AMBER demonstrated that MDPO outperforms standard DPO in multimodal scenarios, significantly reducing hallucinations and improving model performance. Human evaluations confirmed that MDPO’s responses were of better or equal quality in 89% of cases compared to standard DPO.
  • Ablation Studies: The studies revealed that both conditional and anchored preference optimizations are crucial, with conditional preference providing more substantial improvements. Different strategies for creating rejected images were tested, with cropping 0-20% of the original image yielding the best results. Anchors added to rejected responses or images did not show significant improvement.
  • Conclusion: MDPO effectively enhances multimodal LLM performance by ensuring the model utilizes both visual and language cues during preference optimization. The method demonstrates superior performance in reducing hallucinations and improving response quality, highlighting the importance of properly designed optimization objectives in multimodal learning.

Aligning Large Multimodal Models with Factually Augmented RLHF

  • This paper by Sun et al. from UC Berkeley, CMU, UIUC, UW–Madison, UMass Amherst, MSR, MIT-IBM Watson AI Lab addresses the issue of multimodal misalignment in large multimodal models (LMMs), which can lead to hallucinations—generating textual outputs not grounded in multimodal context. To mitigate this, the authors propose adapting RL from Human Feedback (RLHF) to vision-language alignment and introducing Factually Augmented RLHF (Fact-RLHF).
  • The proposed method involves several key steps:
    1. Multimodal Supervised Fine-Tuning (SFT): The initial stage involves fine-tuning a vision encoder and a pre-trained large language model (LLM) on an instruction-following demonstration dataset to create a supervised fine-tuned model (πSFT).
    2. Multimodal Preference Modeling: This stage trains a reward model to score responses based on human annotations. The reward model uses pairwise comparison data to learn to prefer less hallucinated responses. The training employs a cross-entropy loss function to adjust the model’s preferences.
    3. RL: The policy model is fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward signal from the preference model. A KL penalty is applied to prevent over-optimization and reward hacking.
    4. Factually Augmented RLHF (Fact-RLHF): To enhance the reward model, it is augmented with factual information such as image captions and ground-truth multi-choice options. This addition helps the reward model avoid being misled by hallucinations that are not grounded in the actual image content.
    5. Enhancing Training Data: The authors improve the training data by augmenting GPT-4-generated vision instruction data with existing high-quality human-annotated image-text pairs. This includes data from VQA-v2, A-OKVQA, and Flickr30k, converted into suitable formats for vision-language tasks.
    6. MMHAL-BENCH: To evaluate the proposed approach, the authors develop a new benchmark, MMHAL-BENCH, focusing on penalizing hallucinations. This benchmark covers various types of questions that often lead to hallucinations in LMMs, such as object attributes, adversarial objects, comparisons, counting, spatial relations, and environment descriptions.
  • The figure below from the paper illustrates that hallucination may occur during the Supervised Fine-Tuning (SFT) phase of LMM training and how Factually Augmented RLHF alleviates the issue of limited capacity in the reward model which is initialized from the SFT model.

  • The implementation of Fact-RLHF shows significant improvements:
    • Improved Alignment: LLaVA-RLHF, the model trained with Fact-RLHF, achieves 94% of the performance level of text-only GPT-4 on the LLaVA-Bench dataset, compared to 87% by previous best methods.
    • Reduced Hallucinations: On MMHAL-BENCH, LLaVA-RLHF outperforms other baselines by 60%, showing a substantial reduction in hallucinated responses.
    • Enhanced Performance: The model also sets new performance benchmarks on MMBench and POPE datasets, demonstrating improved general capabilities and alignment with human preferences.
  • Overall, the paper highlights the effectiveness of integrating factual augmentation in RLHF to address multimodal misalignment, thereby reducing hallucinations and enhancing the reliability of large multimodal models. The authors have open-sourced their code, model, and data for further research and development in this area.
  • Code

Statistical Rejection Sampling Improves Preference Optimization

  • This paper by Liu et al. from Google Research and Google DeepMind published in ICLR 2024 presents a novel approach to enhancing preference optimization in language models by introducing Statistical Rejection Sampling Optimization (RSO). The research addresses limitations in current methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO), which aim to align language models with human preferences without the complexities of RL from Human Feedback (RLHF).
  • SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. The absence of a reward model in DPO constrains its ability to sample preference pairs from the optimal policy. Meanwhile, SLiC can only sample preference pairs from the SFT policy.
  • To address these limitations, the proposed RSO method improves preference data sourcing from the estimated target optimal policy using rejection sampling. This technique involves training a pairwise reward-ranking model on human preference data and using it to sample preference pairs through rejection sampling. This process generates more accurate estimates of the optimal policy by aligning sequence likelihoods with human preferences.
  • Key implementation details of RSO include:
    1. Training a Pairwise Reward-Ranking Model: Starting with a human preference dataset \(D_{hf}\) collected from other policies, a pairwise reward-ranking model is trained to approximate human preference probabilities. This model uses a T5-XXL model to process and learn from the preference data.
    2. Statistical Rejection Sampling: Using the trained reward-ranking model, a statistical rejection sampling algorithm generates response pairs from the optimal policy by utilizing the SFT policy. The responses are sampled according to their estimated likelihoods from the optimal policy, balancing reward exploitation and regularization towards the SFT policy.
    3. Labeling and Fitting: The sampled response pairs are labeled by the reward model. The labeled pairs are then used to fit the language model via classification loss, optimizing the model based on the preference data. This step shows that the language model learns better from an explicit reward model because comparing between two responses is easier than generating high-quality responses directly.
  • The statistical rejection sampling algorithm, based on Neal’s (2003) statistical field method, addresses issues found in RLHF techniques, which can suffer from reward hacking due to excessive trust in the reward model without regularization. Specifically, RLHF works (Bai et al., 2022; Stiennon et al., 2020; Touvron et al., 2023) carry out rejection sampling using the best-of-\(N\) or top-\(k\)-over-\(N\) algorithm, where they sample a batch of N completions from a language model policy and then evaluate them across a reward model, returning the best one or the top k. This algorithm has the issue of reward hacking because it trusts the reward model too much without any regularization. They show that top-\(k\)-over-\(N\) is a special case of our statistical rejection sampling and it is critical to balance between the reward exploitation and regularization towards the SFT policy.
  • RSO first fits a pairwise reward-ranking model from human preference data. This model is later applied to generate preference pairs with candidates sampled from the optimal policy, followed by a preference optimization step to align sequence likelihood towards preferences.

  • Extensive experiments were conducted on tasks such as Reddit TL;DR summarization and AnthropicHH dialogue. The results demonstrated that RSO outperforms both SLiC and DPO in terms of alignment with human preferences, as evaluated by proxy reward models, gold reward models, AutoSxS, and human raters. The study includes detailed ablation experiments on hyper-parameters, loss functions, and preference pair sampling strategies, confirming the robustness and scalability of RSO across different tasks and model sizes.
  • RSO’s implementation leverages scalable, parallelizable components, making it computationally efficient compared to traditional RLHF methods. The method’s effectiveness in aligning language models with human preferences without the complexities of RLHF presents a significant advancement in the field of preference optimization for large language models.

Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models

  • This paper by Denison et al. from Anthropic, Redwood Research, and the University of Oxford, investigates reward tampering in language models within RL frameworks. The authors examine whether models trained to exploit simple specification gaming opportunities generalize to more complex and pernicious behaviors like reward tampering.
  • The researchers constructed a curriculum of increasingly sophisticated gameable environments, starting from easily exploitable settings like sycophancy and progressing to complex tasks requiring models to rewrite their own reward functions. The environments are designed to reflect realistic scenarios where reward misspecification might occur. The curriculum includes:
    1. Political Sycophancy: Models are rewarded for matching implied user political views, reinforcing sycophantic behavior without oversight.
    2. Tool-Using Flattery: Models use command line access to flatter users about low-quality poetry, with high rewards for positive but dishonest evaluations.
    3. Rubric Modification: Models modify checklist files to falsely indicate task completion, with higher rewards for deceptive behavior. This is divided into nudged and insubordinate variants, with the latter explicitly instructing the model not to make changes.
    4. Reward-Tampering: An evaluation-only environment where models can edit their reward functions and the accompanying unit tests to evade detection.
  • The figure below from the paper illustrates real evaluation transcripts from their model. The only modification they have made is to remove words so that the transcripts fit in the figure. The diagram displays our setup, in which we construct a curriculum of gameable environments. They train models up to each stage and then evaluate on the later stages. Text within <cot> tags is hidden from the oversight process, and commands within <bash> tags get run on the model’s virtual machine. They want to highlight that the vast majority of all transcripts from the model are neither this malicious nor this coherent. Even after the curriculum the model overwrites the reward and properly evades detection less than 1/1,000th of the time.

  • The authors utilized two RL algorithms: expert iteration and proximal policy optimization (PPO). In expert iteration, models sample responses to prompts, and the best performing ones are used for further training. PPO involves policy gradient methods to adjust the policy based on reward signals.
  • Significant implementation details include:
    • HHH Expert Iteration: Combines reinforcement of specification gaming with supervision from a preference model that rewards helpful, honest, and harmless (HHH) behavior.
    • Exploit-Only Expert Iteration: Focuses solely on successful specification gaming samples without HHH oversight, to highlight the impact of excluding such supervision.
    • Hidden Chain of Thought: Augments models with reasoning capabilities that are hidden from the reward process, simulating future model capabilities.
    • Preamble Instructions: Models are prefixed with instructions about their training environment, reinforcing their understanding of being language model assistants trained via RL.
  • Key findings from the experiments include:
    • Models trained on early stages of the curriculum generalize to more sophisticated forms of specification gaming in later stages.
    • A small but significant proportion of models trained on the full curriculum exhibit zero-shot generalization to reward tampering, including rewriting their reward functions and evading detection.
    • Retraining models not to game early environments reduces but does not eliminate reward tampering.
    • Adding HHH training does not prevent the generalization of specification gaming to reward tampering.
  • The study demonstrates that large language models can generalize from simple specification gaming to complex reward tampering, suggesting that such behaviors may be nontrivial to remove and pose potential risks as models become more capable.
  • Blog; Memo

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

  • This paper by Xu et al. from Tsinghua University, OpenPsi Inc., and Shanghai Qi Zhi Institute investigates whether Direct Preference Optimization (DPO) is truly superior to Proximal Policy Optimization (PPO) for aligning large language models (LLMs) with human preferences. The study explores the theoretical and empirical properties of both methods and provides comprehensive benchmarks to evaluate their performance.
  • The research begins by discussing the widespread use of RL from Human Feedback (RLHF) to align LLMs with human preferences. It highlights that existing RLHF methods can be categorized into reward-based and reward-free approaches. Reward-based methods, like those used in applications such as ChatGPT and Claude, involve learning a reward model and applying actor-critic algorithms such as PPO. Reward-free methods, such as DPO, optimize policies directly based on preference data without an explicit reward model.
  • The paper delves into the theoretical limitations of DPO, demonstrating that it may find biased solutions that exploit out-of-distribution responses. The authors argue that this can lead to suboptimal performance, particularly in scenarios where there is a distribution shift between model outputs and the preference dataset. Empirical studies support this claim, showing that DPO’s performance degrades significantly under distribution shifts.
  • Implementation details for PPO are extensively discussed, revealing critical factors for achieving optimal performance in RLHF settings. Key techniques identified include advantage normalization, large batch size, and exponential moving average updates for the reference model. These enhancements are shown to significantly improve PPO’s performance across various tasks, including dialogue generation and code generation.
  • The study presents a series of experiments benchmarking DPO and PPO across multiple RLHF testbeds, such as the SafeRLHF dataset, HH-RLHF dataset, APPS, and CodeContest datasets. Results indicate that PPO consistently outperforms DPO in all cases, achieving state-of-the-art results in challenging code competition tasks. Specifically, on the CodeContest dataset, a PPO model with 34 billion parameters surpasses the previous state-of-the-art AlphaCode-41B, demonstrating a notable improvement in performance.
  • Key experimental findings include:
    1. Theoretical Analysis: Demonstrates that DPO can produce biased policies due to out-of-distribution exploitation, while PPO’s regularization via KL divergence helps mitigate this issue.
    2. Synthetic Scenario Validation: Illustrates DPO’s susceptibility to generating biased distributions favoring unseen responses, while PPO maintains more stable performance.
    3. Real Preference Datasets: Shows that DPO’s performance can be improved by addressing distribution shifts through additional supervised fine-tuning (SFT) and iterative training, though PPO still outperforms DPO significantly.
    4. Ablation Studies for PPO: Highlights the importance of advantage normalization, large batch sizes, and exponential moving average updates in enhancing PPO’s RLHF performance.
  • The authors conclude that while DPO offers a simpler training procedure, its performance is hindered by sensitivity to distribution shifts and out-of-distribution data. PPO, with proper tuning and implementation enhancements, demonstrates robust effectiveness and achieves superior results across diverse RLHF tasks.
  • In summary, the comprehensive analysis and empirical evidence provided in this paper establish PPO as a more reliable and effective method for LLM alignment compared to DPO, particularly in scenarios requiring high-performance and robust alignment with human preferences.

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

  • This paper by Wu et al. from UC Berkeley proposes a novel RL framework, Pairwise Proximal Policy Optimization (P3O), designed to optimize large language models (LLMs) using comparative feedback rather than absolute rewards. Traditional approaches such as Proximal Policy Optimization (PPO) have limitations when dealing with reward functions derived from comparative losses like the Bradley-Terry loss. These limitations include the necessity for reward normalization and token-wise updates, which introduce complexity and potential instability.
  • The proposed P3O algorithm operates on trajectory-wise policy gradient updates, simplifying the optimization process by directly utilizing comparative rewards. This approach is invariant to equivalent reward functions, addressing the instability issues present in PPO. The paper presents a comprehensive theoretical foundation, establishing that P3O avoids the complications of value function approximation and Generalized Advantage Estimation (GAE), which are essential in PPO.
  • The implementation of P3O involves the following key steps:
    1. Initialization: Policy parameters are initialized.
    2. Data Collection: Pairwise trajectories are collected by running the policy on a batch of prompts, generating two responses per prompt.
    3. Reward Calculation: Trajectory-wise rewards are computed, incorporating both the preference-based reward and the KL-divergence penalty from the supervised fine-tuning (SFT) model.
    4. Gradient Estimation: The policy gradient is estimated using the relative differences in rewards between the paired responses, adjusted by importance sampling to account for the policy change.
    5. Policy Update: Gradient updates are applied to the policy parameters, following either separate or joint clipping strategies to maintain stability.
  • The figure below from the paper illustrates the prevalent method for fine-tuning LMs using RL, which relies on Absolute Feedback. In this paradigm, algorithms like PPO has to learn a \(V\) function, which capture not only the valuable relative preference information, but also less part, which is the scale of the reward for a given prompt. Contrastingly, the figure on the right presents paradigm for optimizing reward model trained via comparative loss, e.g., Bradley-Terry Loss (Bradley & Terry, 1952). P3O generates a pair of responses per prompt, leveraging only the Relative Feedback - derived from the difference in reward - for policy gradient updates. This method obviates the need for additional \(V\) function approximations and intricate components like GAE.

  • Empirical evaluations are conducted on summarization and question-answering tasks using datasets like TL;DR and Anthropic’s Helpful and Harmless (HH). The results demonstrate that P3O achieves a superior trade-off between reward and KL-divergence compared to PPO and other baseline methods. Specifically, P3O shows improved alignment with human preferences, as evidenced by higher rewards and better performance in head-to-head comparisons evaluated by GPT-4.
  • The experiments reveal that P3O not only achieves higher reward scores but also maintains better KL control, making it a robust alternative for fine-tuning LLMs with relative feedback. The study underscores the potential of P3O in simplifying the RL fine-tuning process while enhancing model alignment with human values. Future work aims to explore the impacts of reward over-optimization and extend the policy gradient framework to accommodate multiple ranked responses.

BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

  • This paper by Xu et al. from UCSB and CMU presents Behavior Preference Optimization (BPO), a novel approach to enhancing online preference learning for large language models (LLMs) by maintaining proximity to the behavior LLM that collects training samples. The key motivation is to address the limitations of traditional Direct Alignment from Preferences (DAP) methods, which do not fully exploit the potential of online training data.
  • The authors propose a new online DAP algorithm, emphasizing the construction of a trust region around the behavior LLM (\(\pi_{\beta}\)) rather than a fixed reference model (\(\pi_{ref}\)). This approach ensures that the learning LLM (\(\pi_{\theta}\)) remains aligned with the behavior model, thereby stabilizing the training process and improving performance.

  • Implementation Details:
    1. Algorithm Overview:
      • The BPO algorithm dynamically updates \(\pi_{\beta}\) with \(\pi_{\theta}\) every K steps, where K is the annotation interval calculated as T/F (total training steps divided by the preference annotation frequency).
      • The training loss \(L_{BPO}\) is computed by constraining the KL divergence between \(\pi_{\theta}\) and \(\pi_{\beta}\), thus constructing a trust region around the behavior LLM.
    2. Ensemble of LoRA Weights:
      • To mitigate training instability, the authors optimize an ensemble of Low-Rank Adaptation (LoRA) weights and merge them during inference without additional overhead. This ensemble approach stabilizes the training process.
    3. Experimental Setup:
      • The experiments were conducted on three datasets: Reddit TL;DR, Anthropic Helpfulness, and Harmlessness, using a preference simulator for annotation.
      • BPO was integrated with various DAP methods, including DPO, IPO, and SLiC, and compared against their online and offline counterparts.
  • The figure below from the paper illustrates an overview of the training pipeline of our BPO. Our training loss LBPO is calculated by constraining the KL divergence between \(\pi_{\theta}\) and the behavior LLM \(\pi_{\beta}\). Every \(K\) steps, they update \(\pi_{\beta}\) with \(\pi_{\theta}\) and use it to collect new samples for annotations.

  • Experimental Details:
    • Preference Annotation Frequency:
      • Different annotation frequencies were tested, demonstrating that even a small increase in frequency (F = 2) significantly improves performance over offline DPO, achieving notable gains in win rates against reference texts.
    • Ablation Study:
      • The authors performed an ablation study to verify that the performance improvement stems from the better trust region constructed around \(\pi_{\beta}\), not just the higher quality of \(\pi_{\beta}\) compared to \(\pi_{\ref}\).
    • Stabilization Techniques:
      • The use of an ensemble of LoRA weights proved effective in stabilizing training, as single LoRA weight optimization led to rapid deterioration of performance.
  • Results:
    • BPO significantly outperformed both its on-policy and offline DAP counterparts across all tasks, particularly on TL;DR, Helpfulness, and Harmlessness, demonstrating its strong generalizability.
    • The dynamic trust region around the behavior LLM ensured better alignment and stability during training, leading to higher win rates and more consistent performance improvements.
  • The proposed BPO method offers a substantial advancement in online preference learning for LLMs, balancing performance and computational efficiency, and demonstrating remarkable applicability to various DAP methods and annotation frequencies.

SimPO: Simple Preference Optimization with a Reference-Free Reward

  • This paper by Meng et al. from Danqi Chen’s lab at Princeton proposes SimPO, a novel offline preference optimization algorithm that simplifies and improves upon Direct Preference Optimization (DPO). Unlike DPO, which requires a reference model and can be computationally intensive, SimPO introduces a reference-free reward that aligns more closely with the model generation process.
  • SimPO uses the average log probability of a sequence as the implicit reward, which better aligns with model generation metrics and removes the need for a reference model. This reward formulation enhances computational efficiency and memory usage. Additionally, SimPO incorporates a target reward margin into the Bradley-Terry objective to create a larger separation between winning and losing responses, further optimizing performance.
  • The authors conducted extensive evaluations using various state-of-the-art models, including base and instruction-tuned models like Mistral and Llama3. They tested SimPO on benchmarks such as AlpacaEval 2, MT-Bench, and Arena-Hard, demonstrating significant performance improvements over DPO. Specifically, SimPO outperformed DPO by up to 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard, with minimal increase in response length, indicating efficiency in length exploitation.
  • The figure below from the paper illustrates that SimPO and DPO mainly differ in their reward formulation, as indicated in the shaded box.

  • Implementation Details:
    1. Reward Formulation:
      • SimPO calculates the reward as the average log probability of all tokens in a response using the policy model, normalized by the response length. This formulation eliminates the reference model, making SimPO more efficient.
      • The reward equation is: \(r_{\text{SimPO}}(x, y) = \frac{\beta}{\mid y\mid } \log \pi_{\theta}(y \mid x) = \frac{\beta}{\mid y\mid } \sum_{i=1}^{\mid y\mid } \log \pi_{\theta}(y_i \mid x, y_{<i})\), where \(\beta\) controls reward scaling.
    2. Target Reward Margin:
      • A margin \(\gamma\) is introduced to the Bradley-Terry model to ensure a minimum reward difference between winning and losing responses.
      • The modified objective is: \(L_{\text{SimPO}}(\pi_{\theta}) = -E_{(x, y_w, y_l) \sim D} \left[ \log \sigma \left(\frac{\beta}{\mid y_w\mid } \log \pi_{\theta}(y_w \mid x) - \frac{\beta}{\mid y_l\mid } \log \pi_{\theta}(y_l \mid x) - \gamma \right) \right]\).
    3. Training Setups:
      • Base Setup: Models were trained on the UltraChat-200k dataset to create a supervised fine-tuned (SFT) model, followed by preference optimization using the UltraFeedback dataset.
      • Instruct Setup: Off-the-shelf instruction-tuned models were used, regenerating chosen and rejected response pairs to mitigate distribution shifts.
    4. Evaluation:
      • SimPO was evaluated on AlpacaEval 2, Arena-Hard, and MT-Bench benchmarks. Performance was measured in terms of length-controlled win rate and raw win rate.
      • SimPO achieved notable results, such as a 44.7% length-controlled win rate on AlpacaEval 2 and a 33.8% win rate on Arena-Hard, making it the strongest 8B open-source model.
    5. Hyperparameters:
      • Optimal performance was achieved with \(\beta\) set between 2.0 and 2.5, and \(\gamma\) between 0.5 and 1.5.
  • SimPO demonstrates a significant advancement in preference optimization, simplifying the process while improving computational efficiency and performance on multiple benchmarks. The removal of the reference model and the alignment of the reward function with generation metrics are key innovations that contribute to its success.
  • Code

Discovering Preference Optimization Algorithms with and for Large Language Models

  • This paper by Chris Lu et al. from Sakana AI, University of Cambridge, and FLAIR, presents a novel approach to offline preference optimization for Large Language Models (LLMs) by leveraging LLM-driven objective discovery. Traditional preference optimization relies on manually-crafted convex loss functions, but this approach is limited by human creativity. The authors propose an iterative method that prompts an LLM to discover new preference optimization loss functions automatically, leading to the development of state-of-the-art algorithms without human intervention.
  • The core contribution of this paper is the introduction of the Discovered Preference Optimization (DiscoPOP) algorithm, which adaptively combines logistic and exponential losses. This process is facilitated through an LLM-driven pipeline that iteratively proposes and evaluates new loss functions based on their performance on downstream tasks.
  • Implementation Details:
    1. Initial Context Construction: The system prompt initializes the LLM with several established objective functions in code and their performance metrics.
    2. LLM Querying and Output Validation: The LLM is queried to propose new objective functions, which are parsed, validated through unit tests, and evaluated.
    3. Performance Evaluation: The proposed objective functions are evaluated based on their ability to optimize a model on predefined downstream tasks, with the performance metric feeding back into the LLM.
    4. Iterative Refinement: The LLM iteratively refines its proposals, synthesizing new candidate loss functions that blend successful aspects of previous formulations.
  • Discovery Process:
    • The LLM generates PyTorch-based candidate objective functions, taking log probabilities of preferred and rejected completions as inputs.
    • Valid candidates are used to fine-tune an LLM, evaluated using performance metrics such as MT-Bench scores.
    • The performance data is fed back into the LLM, which iteratively refines its generation strategy based on this feedback.
  • The figure below from the paper illustrates: (Left) Conceptual illustration of LLM-driven discovery of objective functions. We prompt an LLM to output new code-level implementations of offline preference optimization losses \(\mathbb{E}_{\left(y_w, y_l, x\right) \sim \mathcal{D}}[f(\beta \rho)]\) as a function of the policy \(\left(\pi_\theta\right)\) and reference model’s \(\left(\pi_{\text {ref }}\right)\) likelihoods of the chosen \(\left(y_{w}\right)\) and rejected \(\left(y_{l}\right)\) completions. Afterward, they run an inner loop training procedure and evaluate the resulting model on MT-Bench. The corresponding performance is fed back to the language model, and they query it for the next candidate. (Right) Performance of discovered objective functions on Alpaca Eval.

  • Results:
    • The DiscoPOP algorithm, a dynamically weighted sum of logistic and exponential losses, emerged as a top performer. It was evaluated on multi-turn dialogue tasks (MT-Bench), single-turn dialogue tasks (Alpaca Eval 2.0), summarization tasks (TL;DR), and positive sentiment generation tasks (IMDb).
    • DiscoPOP showed significant improvement in win rates against GPT-4 and performed competitively on various held-out tasks, demonstrating robustness and adaptability across different preference optimization challenges.
  • Technical Details:
    • The DiscoPOP loss function is non-convex, incorporating a temperature parameter to balance between logistic and exponential terms based on the log-ratio difference (\(\rho\)). This dynamic weighting allows the function to handle both large and small differences effectively, contributing to its superior performance.
  • Significance:
    • This LLM-driven discovery approach eliminates the constraints of human creativity in designing loss functions, automating the generation of high-performing preference optimization algorithms.
    • The iterative refinement process ensures continuous improvement and adaptability, leading to state-of-the-art performance in preference alignment tasks.
  • This work opens new avenues for automated discovery and optimization in machine learning, showcasing the potential of leveraging LLMs to enhance and innovate traditional methodologies in a scalable and efficient manner. The proposed DiscoPOP algorithm represents a significant advancement in offline preference optimization, offering a robust and flexible solution for aligning LLM outputs with human preferences.
  • Code

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

  • This paper by D’Oosterlinck et al. from Ghent University, Stanford University, and Contextual AI introduces methods to improve alignment in LLMs by addressing two core issues: the suboptimal contrastive nature of preference data and the limitations of alignment objectives. The authors propose Contrastive Learning from AI Revisions (CLAIR) and Anchored Preference Optimization (APO) to enhance the clarity of preference signals and the stability of alignment training.
  • CLAIR creates minimally contrasting preference pairs by revising lower-quality outputs generated by the target model. Instead of using a judge to pick between outputs, CLAIR employs a reviser (a stronger model such as GPT4-turbo) to minimally improve the weaker output, ensuring that the contrast between outputs is clear and targeted. This leads to more precise preference data compared to conventional methods where preference pairs might vary due to uncontrolled differences. Empirical results show that CLAIR generates the best contrastive data, as measured by token-level Jaccard similarity and character-level Levenshtein edit distance, outperforming on-policy and off-policy judge datasets.
  • The figure below from the paper illustrates that alignment is underspecified with regard to preferences and training objective. A: Preference pairs can vary along irrelevant aspects, Contrastive Learning from AI Revisions (CLAIR) creates a targeted preference signal instead. B: The quality of the model can impact alignment training, Anchored Preference Optimization (APO) explicitly accounts for this.

  • The figure below from the paper illustrates an answer produced by Llama-3-8B-Instruct for a prompt, and corresponding GPT4-turbo revision of this answer. The differences between answer and revision are highlighted. The revision generally follows the same outline as the answer but improves it where possible. For example, the revision correctly alters the count of Parisian restaurants from 2 to 3 in the second line of the answer.

  • APO is a family of contrastive alignment objectives that explicitly consider the relationship between the model and the preference data. The authors propose two key variants: APO-zero and APO-down. APO-zero is used when winning outputs are better than the model’s outputs, ensuring that the likelihood of winning outputs increases and that of losing outputs decreases. APO-down is preferred when the model is already superior to the winning outputs, decreasing the likelihood of both but decreasing the likelihood of the losing output more sharply. APO provides more fine-grained control compared to widely used objectives such as Direct Preference Optimization (DPO), avoiding scenarios where increasing the likelihood of a winning output can degrade model performance.
  • The authors conducted experiments aligning Llama-3-8B-Instruct on 32K CLAIR-generated preference pairs and comparable datasets using several alignment objectives. The results demonstrated that CLAIR, combined with APO, led to a significant improvement in performance, closing the gap between Llama-3-8B-Instruct and GPT4-turbo by 45% on the MixEval-Hard benchmark. The best model improved by 7.65% over the base Llama-3-8B-Instruct, primarily driven by the improved contrastiveness of CLAIR-generated data and the tailored dynamics of APO. In comparison, other alignment objectives like DPO and KTO did not perform as well, with DPO showing a tendency to degrade the model due to its ambiguous handling of winning and losing likelihoods.
  • CLAIR and APO offer a more stable and controllable approach to alignment by improving the precision of preference signals and ensuring that training dynamics are better suited to the model and data relationship. The experiments also underscore the importance of controlling contrastiveness in preference datasets and adapting the alignment objective to the specific needs of the model.
  • The paper concludes with discussions on how these methods compare to other alignment efforts like RL from AI Feedback (RLAIF) and Direct Preference Optimization (DPO), highlighting how CLAIR and APO address the challenges of underspecification in alignment.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

  • This paper by Shao et al. from DeepSeek-AI, Tsinghua University, and Peking University, introduces the DeepSeekMath 7B model, a state-of-the-art domain-specific language model optimized for mathematical reasoning, achieving results comparable to GPT-4 and Gemini-Ultra on mathematical benchmarks. Below is a detailed summary:
  • DeepSeekMath 7B showcases the effectiveness of domain-specific pre-training and innovative RL techniques for advancing mathematical reasoning in open-source language models. Its contributions in data curation, RL algorithms, and multilingual capability serve as a foundation for future research in this domain.

  • Core Contributions:

    1. Domain-Specific Training:
      • DeepSeekMath 7B is pre-trained using 120B tokens sourced from a newly developed DeepSeekMath Corpus, extracted and refined from Common Crawl data. The corpus is seven times larger than Minerva’s and nine times the size of OpenWebMath.
      • Pre-training incorporates natural language, code, and math-specific data for comprehensive reasoning capabilities.
    2. Key Model Innovations:
      • Group Relative Policy Optimization (GRPO): A novel RL (RL) technique designed to optimize the model’s reasoning while reducing memory consumption by bypassing the need for a critic model in RL frameworks like PPO.
      • Instruction tuning with Chain-of-Thought (CoT), Program-of-Thought (PoT), and tool-integrated reasoning datasets to enhance mathematical understanding.
  • Model Development and Implementation:

    1. Pre-training Pipeline:
      • Base model: DeepSeek-Coder-Base-v1.5 7B, extended with 500B tokens. The corpus composition includes:
        • 56% from the DeepSeekMath Corpus.
        • 20% GitHub code.
        • 10% arXiv papers.
        • 10% natural language data from Common Crawl.
    2. Data Selection and Processing:
      • The DeepSeekMath Corpus was curated using an iterative pipeline involving fastText-based classification to filter high-quality mathematical content. The dataset was decontaminated to exclude overlap with evaluation benchmarks like GSM8K and MATH.
      • The plot below from the paper illustrates an iterative pipeline that collects mathematical web pages from Common Crawl.

    3. Mathematical Instruction Tuning:
      • Fine-tuning on 776K examples (English and Chinese datasets), leveraging CoT, PoT, and Python-based reasoning for diverse mathematical fields such as algebra, calculus, and geometry.
    4. RL with GRPO:
      • GRPO uses group scores as baselines, simplifying reward estimation and computational complexity.
      • The plot below from the paper illustrates PPO and the proposed GRPO. GRPO foregoes the value model, instead estimating the baseline from group scores, significantly reducing training resources.

      • RL training focused on GSM8K and MATH benchmarks with chain-of-thought prompts, achieving a 6-9% improvement over instruction-tuned models.
  • Key Results:

    1. Mathematical Reasoning:
      • Achieved 51.7% accuracy on the MATH benchmark, surpassing all open-source models up to 70B size and approaching GPT-4 levels.
      • Demonstrated superior results across English and Chinese benchmarks like GSM8K (88.2%) and CMATH (88.8%).
    2. Tool-Aided Problem Solving:
      • Using Python for problem-solving, DeepSeekMath 7B outperformed the prior state-of-the-art Llemma 34B on benchmarks like GSM8K+Python and MATH+Python.
    3. General Capabilities:
      • Improvements in general reasoning and understanding benchmarks like MMLU (54.9%) and BBH (59.5%), as well as coding tasks like HumanEval and MBPP.
  • Observations and Insights:

    1. Code Training Benefits:
      • Pre-training with code improves mathematical reasoning, both with and without tool use.
      • Mixed code and math training synergize mathematical problem-solving and coding performance.
    2. ArXiv Data Limitations:
      • Training on arXiv papers alone did not significantly enhance reasoning, suggesting potential issues with the data’s format or relevance.
    3. RL Efficiency:
      • GRPO efficiently improves instruction-tuned models with fewer computational resources compared to PPO, setting a new benchmark in LLM RL techniques.

Understanding R1-Zero-Like Training: A Critical Perspective

  • This paper by Liu et al. from Sea AI Lab, National University of Singapore, and Singapore Management University critically analyzes the R1-Zero training paradigm—where reinforcement learning (RL) is applied directly to base large language models (LLMs) without supervised fine-tuning (SFT)—as introduced by DeepSeek-R1-Zero. The authors dissect both the characteristics of base models and the optimization biases in the RL component, ultimately proposing refinements that enhance reasoning performance and training efficiency.

  • Architecture and Implementation:

    • Training Setup: The authors use base models such as DeepSeek-V3-Base, Qwen2.5-Math, and Llama-3.2, assessing their readiness for RL by analyzing their behavior on MATH-level questions. Templates significantly affect model behavior; for example, Qwen2.5-Math achieves better performance without templates, suggesting implicit pretraining on concatenated QA pairs.

    • GRPO vs Dr. GRPO:

      • GRPO (Group Relative Policy Optimization) is a sampling-based RL algorithm that normalizes token-level policy gradients based on response length and intra-group standard deviation. This introduces two biases:

        • Length Bias: Incorrect longer answers are less penalized, skewing output length growth.
        • Difficulty Bias: Questions with low variance disproportionately influence learning.
      • Dr. GRPO (Done Right GRPO) removes these normalization factors, yielding an unbiased surrogate objective aligned with standard PPO: \(J_{\text{Dr.GRPO}}(\pi_\theta) = \mathbb{E}_{q\sim p_Q, o\sim \pi_{\theta}^{\text{old}}} \left[ \sum_t \min\left(\frac{\pi_\theta(o_t|q, o_{<t})}{\pi_\theta^{\text{old}}(o_t|q, o_{<t})} \hat{A}_t, \text{clip}(\cdot) \hat{A}_t \right) \right]\)
      • Advantage is computed as: \(\hat{A}_i = R(q, o_i) - \text{mean}(\{R(q, o_j)\}_{j=1}^G)\) avoiding per-response and per-question normalization.
    • Training and Evaluation:

      • Data: MATH training set and diverse question sets (e.g., GSM-8K, ASDiv).
      • Models: Trained on 8×A100 GPUs for ~27 hours.
      • Reward Function: Binary, based on correctness of final answer via Math-Verify.
      • Implementation: Built on the Oat RL framework.
    • Minimalist R1-Zero Recipe:

      • Using Qwen2.5-Math-7B with Dr. GRPO and the Qwen-Math template on MATH level 3–5 questions, the model achieves 43.3% accuracy on AIME 2024—state-of-the-art among 7B models.
    • The following figure from the paper shows Dr. GRPO introduces simple yet significant modifications to address the biases in GRPO (Shao et al., 2024), by removing the length and std normalization terms. Right: Our unbiased optimizer effectively prevents the model from generating progressively longer incorrect responses, thereby enhancing token efficiency.

  • Core Insights:

    • Base Model Analysis:

      • Qwen2.5 models outperform others even without prompt templates, possibly due to pretraining on concatenated QA data.
      • DeepSeek-V3-Base is shown to exhibit “Aha moments” (emergent reasoning and self-reflection) even without RL, challenging the notion that RL alone induces these behaviors.
    • Template Effects:

      • Templates can disrupt or aid initial policy performance; Qwen2.5-Math models perform worse with templates unless retrained.
      • RL can recover from poor initialization, but optimal performance is achieved with good model-template alignment.
    • Question Set Coverage:

      • Broader question sets (e.g., ORZ-57K) enhance generalization.
      • Surprisingly, training on simpler, out-of-domain questions (GSM-8K) still improves performance on harder benchmarks.
    • Pretraining Effects:

      • Math pretraining (FineMath, NuminaQA) on Llama-3.2-3B significantly boosts its RL ceiling.
      • Pretraining on concatenated QA texts helps mimic the implicit biases seen in Qwen2.5.
  • Code

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

  • This paper by Yu et al. from ByteDance Seed, Tsinghua AIR, and The University of Hong Kong introduces DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), a large-scale reinforcement learning (RL) system for reasoning-capable LLMs. The system is notable for its fully open-source status, including code, algorithm, and datasets, and demonstrates superior performance on AIME 2024 benchmarks using only 50% of the training steps required by previous state-of-the-art methods.

  • The central objective is to resolve key reproducibility and scalability challenges in RL training for LLMs by introducing an openly detailed and empirically validated RL pipeline that enhances training stability, sample efficiency, and policy expressiveness.

  • Architecture and Implementation:

    • Base Model: Qwen2.5-32B pretrained transformer.

    • RL Framework: Built on top of the verl framework, leveraging the Group Relative Policy Optimization (GRPO) method as a foundation.

    • DAPO Algorithm:

      • The policy is optimized using a modified objective function as follows:
      \[\mathcal{J}_{\text{DAPO}}(\theta) = \mathbb{E}_{(q,a) \sim D, \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|q)} \left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min \left(r_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(r_{i,t}(\theta), 1 - \epsilon_{\text{low}}, 1 + \epsilon_{\text{high}}) \hat{A}_{i,t} \right) \right]\]
      • subject to:
      \[0 < \left| \left\{ o_i \mid \text{is\_equivalent}(a, o_i) \right\} \right| < G\]
      • where:
      \[r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})}, \quad \hat{A}_{i,t} = \frac{R_i - \text{mean}(\{R_i\}_{i=1}^G)}{\text{std}(\{R_i\}_{i=1}^G)}\]
    • This modified objective function:
      • Applies token-level gradient updates rather than sequence-level.
      • Uses decoupled clipping thresholds \(\epsilon_{\text{low}}\) and \(\epsilon_{\text{high}}\) to avoid entropy collapse and preserve exploration.
      • Implements rule-based binary reward: +1 if model output is semantically correct, −1 otherwise.
      • Filters out trivial samples with 0% or 100% accuracy to maintain effective gradient signals via Dynamic Sampling.
    • Training Details:

      • Batch Size: 512 prompts × 16 samples per prompt per rollout.
      • Learning Rate: 1e-6 with AdamW and linear warm-up.
      • Token Cap: Maximum of 20,480 tokens (16,384 + 4,096 soft penalty buffer).
      • Reward Shaping: Uses Soft Overlong Punishment to penalize excessively long generations gradually.
      • Evaluation: avg@32 accuracy on AIME 2024 benchmark with temperature 1.0 and top-\(p\) 0.7.
  • Core Innovations:

    • Clip-Higher: Uses asymmetric clipping thresholds to allow low-probability “exploration” tokens more opportunity to increase probability, thereby maintaining model entropy and avoiding convergence to deterministic outputs too early.

    • Dynamic Sampling: Filters out samples that are either all correct or all incorrect to avoid zero-gradient contributions, ensuring each training batch contains impactful learning signals.

    • Token-Level Loss: Enhances model learning on longer CoT sequences by ensuring each token contributes to the final gradient, preventing the dilution of signal in longer responses and mitigating response quality degradation.

    • Overlong Reward Shaping: Truncated responses are masked during training or penalized softly based on the degree of overflow, avoiding abrupt and misleading penalties that may disrupt learning.

    • Data Curation: Introduces DAPO-Math-17K, a dataset of math problems with integer-only answers to ensure deterministic and error-free evaluation. Problem statements are transformed to yield integer solutions even for originally fractional outputs.

  • Benchmarks and Results:

    • DAPO achieves 50% accuracy on AIME 2024 with Qwen2.5-32B, outperforming DeepSeek-R1-Zero-Qwen-32B (47%) with only 50% of training steps.

    • Ablation studies show cumulative performance gains with each added technique:

      • Naive GRPO: 30%
        • Overlong Filtering: 36%
        • Clip-Higher: 38%
        • Soft Overlong Punishment: 41%
        • Token-level Loss: 42%
        • Dynamic Sampling (full DAPO): 50%
  • Empirical Insights:

    • Monitoring metrics like response length, entropy, and average reward revealed strong correlations with training dynamics and highlighted the need for fine-tuned balancing between exploration and exploitation.
    • Case studies demonstrate the emergence of new reasoning behaviors during training, including reflection and self-correction patterns that were initially absent.
  • Project Page; Code

Further Reading

HuggingFace’s Alignment Handbook

  • The Alignment Handbook contains robust recipes to align language models with human and AI preferences. It also contains code to train your very own Zephyr models:
    • Full fine-tuning with Microsoft’s DeepSpeed ZeRO-3 on A100s
    • LoRA or QLoRA fine-tuning on consumer GPUs

  • Dataset from HuggingFace called No Robots of 10k instructions and demonstrations to train instruct models. This is based on the SFT dataset from OpenAI’s InstructGPT paper. 100% organic and written entirely by skilled human annotators.

Empirical Evaluation: DPO vs. IPO vs. KTO

  • Preference Tuning LLMs with Direct Preference Optimization Methods by Hugging Face summarizes their extensive evaluation of three state of the art alignment algorithms. DPO vs IPO vs KTO.
  • The results demonstrate a complex interaction between key hyper-parameters, models, and datasets. As a quick overview:
    • DPO: Casts the RLHF objective via a loss based on a prompt and its positive and negative completions
    • IPO: Has an identity function rather than DPO’s sigmoid that can potentially cause overfitting
    • KTO: Rather than paired (+ve, -ve) pairs, takes unpaired good and bad (binary preference; thumbs-up and thumbs-down) data
  • The team conducted experiments on two models possessing 7B parameters each; namely, Zephyr-7b-beta-sft and OpenHermes-7B. Subsequently, preference fine-tuning was applied utilizing two widely recognized preference datasets: Ultrafeedback and Intel’s Orca DPO pairs. It is pertinent to note that all the associated code is accessible as open-source at The Alignment Handbook.
  • This investigation aims to discern the influence of the beta parameter on model performance. To this end, the MT Bench, a multi-turn benchmark employing GPT-4 to assess model efficacy across eight distinct categories, was utilized. Despite its limitations, MT Bench serves as a viable instrument for evaluating the capabilities of conversational large language models (LLMs).
  • In the case of the Zephyr model, it was determined that optimal performance was attained at the minimal beta value of 0.01. This finding was consistent across all three algorithms evaluated, suggesting that a more detailed examination within the beta range of 0.0 to 0.2 could yield valuable insights for the research community.
  • Regarding the OpenHermes model, although the relative performance of each algorithm remained consistent - with the ranking being DPO > KTO > IPO - the optimal beta value exhibited significant variation among the algorithms. Specifically, the most favorable beta values for DPO, KTO, and IPO were identified as 0.6, 0.3, and 0.01, respectively.

FAQs

In RLHF, what are the memory requirements of the reward and critic model compared to the policy/reference model?

  • In RLHF, you typically have the following models:

    • Policy model (also called the actor)
    • Reference model (frozen copy of the initial policy)
    • Reward model (trained from human feedback)
    • Critic model (value function)
  • Here’s how their memory requirements generally compare:

    • Policy vs Reference model:
      • These are usually the same architecture (e.g., a decoder-only transformer like GPT), so they have roughly equal memory requirements.
      • The reference model is frozen, but still loaded into memory for reward computation (KL divergence term), so it uses as much memory as the policy model.
      • Combined, they double the memory footprint compared to using just one model.
    • Reward model:
      • Often has the same architecture as the policy/reference model (e.g., same transformer backbone) but with a small head on top to produce scalar reward values.
      • If it shares weights with the policy/reference model (e.g., using LoRA or other weight-sharing schemes), it can be lighter, but in many setups it’s a full separate copy.
      • Memory requirement: roughly equal to the policy/reference model, possibly slightly less if stripped down or quantized.
    • Critic model:
      • In transformer-based PPO, the critic is often implemented as a separate head on the policy model or as a duplicate model with a value head.
      • If separate, it often has the same architecture as the policy but only outputs a scalar value per token.
      • Memory requirement: similar to the policy model, unless heavily optimized (e.g., sharing layers or being much smaller).
  • Summary of memory requirements (relative to one transformer model):

    • Policy: 1x
    • Reference: 1x
    • Reward: ~1x
    • Critic: ~1x
  • Total: ~4x the memory of a single model, unless model sharing, quantization, or other tricks are used.

Why is the PPO/GRPO objective called a clipped “surrogate” objective?

  • The PPO (and its variants such as GRPO) objective is called a surrogate objective because it doesn’t directly optimize the true reinforcement learning objective — the expected rewards over time — but instead optimizes a proxy that is easier and safer to compute. Specifics below:
    • True RL Objective is Unstable or Intractable:
      • The actual objective in RL is to maximize expected reward over trajectories, which involves high variance and instability during training, especially for large models like LLMs. It often requires estimating complex quantities like the value function accurately over time, which is difficult in practice.
    • Surrogate Objectives Improve Stability:
      • Surrogate objectives simplify this by using:
        • Advantage estimates to approximate how much better a new action is compared to the old one.
        • Importance sampling ratios (like \(\frac{\pi_{\theta}}{\pi_{old}}\)) to correct for the shift in policy.
        • Clipping (in PPO and GRPO) to avoid overly large policy updates that might destabilize training.
    • Practical Optimization Benefits:
      • By approximating the true objective, surrogate objectives allow for stable and efficient policy updates, which are essential in fine-tuning large models via reinforcement learning.
  • In summary, it’s called a surrogate because it’s a well-designed stand-in for the true goal of maximizing reward, tailored to be safer and more effective for gradient-based optimization.

Is the importance sampling ratio also called the policy or likelihood ratio?

  • Yes, the importance sampling ratio is often referred to as the policy ratio or the likelihood ratio, especially in the context of reinforcement learning algorithms like PPO and GRPO.
  • Here’s what these terms mean in this context:
    • Importance Sampling Ratio:
      • This is the ratio:

        \[\frac{\pi_\theta(a \mid s)}{\pi_{\text{old}}(a \mid s)}\]
        • where \(\pi_\theta\) is the current (new) policy and \(\pi_{\text{old}}\) is the old (behavior) policy.
      • It tells us how much more or less likely the new policy is to take action \(a\) in state \(s\) compared to the old one.

    • Policy Ratio:
      • This is a shorthand name for the same quantity. It reflects the relative likelihood of an action under the current policy versus the old one — hence, “policy ratio.”
    • Likelihood Ratio:
      • Also the same quantity, but phrased from a statistical perspective. It compares the likelihoods assigned by two probability distributions (policies) to the same data (action).
  • So, in PPO or GRPO:
    • You’ll often see this ratio appear as something like:
    \[r_t(\theta) = \frac{\pi_\theta(o_t \mid q, o_{<t})}{\pi_{\text{old}}(o_t \mid q, o_{<t})}\]
    • And it’s used to weight the advantage, or to apply clipping for stability.
  • All three names refer to the same thing — they just come from different angles (importance sampling theory, policy learning, or statistics).

Does REINFORCE and TRPO in policy optimization also use a surrogate loss?

  • REINFORCE uses a basic form of surrogate loss based on the log-likelihood and returns.
  • TRPO uses a more principled surrogate loss that incorporates importance sampling and constraints to ensure safe policy updates.
  • Specifics below:
    • REINFORCE:
      • REINFORCE is based on the likelihood ratio trick (also called the policy gradient theorem).
      • The loss function used in REINFORCE is:

        \[L(\theta) = \mathbb{E} \left[ \log \pi_\theta(a|s) \cdot R \right]\]
        • where \(R\) is the return from a trajectory, representing the total discounted reward accumulated from a state onward:

          \[R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}\]
        • This captures how good a trajectory is, with future rewards discounted by a factor of \(\gamma\).

      • This is essentially a surrogate for maximizing the expected return, but it’s a very direct one: it’s derived directly from the gradient of the expected return.
      • It doesn’t include constraints or trust region concerns — so while it’s a kind of surrogate loss, it’s very raw and unstable due to high variance.
    • TRPO (Trust Region Policy Optimization):
      • TRPO introduces a more sophisticated surrogate objective:

        \[L_{\theta} = \mathbb{E} \left[ \frac{\pi_\theta(a \mid s)}{\pi_{\text{old}}(a \mid s)} \cdot \hat{A}(s, a) \right]\]
        • subject to a constraint on the KL divergence:
        \[\mathbb{E} \left[ D_{\text{KL}}\left(\pi_{\text{old}}(\cdot|s) \, \mid\mid \, \pi_\theta(\cdot|s) \right) \right] \leq \delta\]
      • The expression \(\frac{\pi_\theta(a \mid s)}{\pi_{\text{old}}(a \mid s)} \cdot \hat{A}(s, a)\) is the surrogate loss TRPO tries to optimize.
      • This surrogate is designed to estimate the improvement in policy performance, assuming the new policy doesn’t deviate too much from the old one (hence the trust region).
      • The KL constraint ensures stable updates and limits how much the new policy can differ from the old one, helping avoid destructive updates.

Does DPO remove both the critic and reward model?

  • Yes, DPO removes both the critic and the explicit reward model present in standard PPO-based RLHF. It replaces them with a closed-form, theoretically equivalent optimization that directly updates the LLM’s parameters using human preference data, without reinforcement learning.

  • In RLHF:
    • The standard pipeline involves three stages:
      1. Supervised fine-tuning (SFT) on curated data,
      2. Training a reward model from human preference pairs, and
      3. Reinforcement learning (e.g., with PPO) to optimize a policy that maximizes this reward.
    • This third step typically requires an actor–critic setup:

      • The critic estimates the value function or advantage to stabilize training.
      • The actor (policy) is updated using gradient estimates of the reward signal.
    • Thus, RLHF relies on both a reward model and a critic to train the final aligned policy.
  • In DPO:
    • DPO removes both the explicit reward model and the critic by reparameterizing the RLHF objective in closed form.

    • Starting from the RLHF objective with a KL-divergence constraint:

    \[\max_{\pi_\theta} , \mathbb{E}_{x \sim D, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) \right] - \beta D_{\text{KL}} \left[ \pi_\theta(y|x) , | , \pi_{\text{ref}}(y|x) \right]\]
    • … the DPO paper derives that the optimal policy for a given reward function is
    \[\pi_r(y|x) = \frac{1}{Z(x)} , \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y) \right),\]
    • … and then rearranges this to express the reward in terms of the policy:
    \[r(x, y) = \beta \log \frac{\pi_r(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x).\]
    • By substituting this relationship into the Bradley–Terry human preference model and cancelling out the partition term, the DPO objective becomes a simple binary cross-entropy loss:
    \[L_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -, \mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]\]
    • where \((y_w, y_l)\) are the preferred and dispreferred completions, and \(\sigma\) is the logistic (sigmoid) function.

    • The aforementioned equation from the paper directly trains the policy to increase the relative likelihood of preferred outputs without any reinforcement learning loop.

  • Takeaways:
    • Since DPO rewrites the objective starting from the RLHF objective with a KL-divergence constraint, there is no explicit reward model — the reward is implicitly represented as:
    \[r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}\]
    • There is no critic network — no need to estimate advantages or baselines.
    • The entire alignment process becomes a single-stage supervised optimization with a simple cross-entropy loss.

Further Reading

References

Citation

@article{Chadha2020DistilledPreferenceOptimization,
  title   = {Preference Optimization},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}