Aman's AI Journal • LLM Alignment

Overview
Refresher: Basics of Reinforcement Learning
Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Reinforcement Learning with AI Feedback (RLAIF)
Direct Preference Optimization (DPO)
Kahneman-Tversky Optimization (KTO)
PPO vs. DPO vs. KTO
Bias Concerns and Mitigation Strategies
TRL - Transformer Reinforcement Learning
Selected Papers
Further Reading
- HuggingFace’s Alignment Handbook
- Empirical Evaluation: DPO vs. IPO vs. KTO
References

Overview

In 2017, OpenAI introduced a groundbreaking approach to machine learning called Reinforcement Learning from Human Feedback (RLHF), specifically focusing on human preferences, in their paper “Deep reinforcement learning from human preferences”. This innovative concept has since inspired further research and development in the field.
The concept behind RLHF is straightforward yet powerful: it involves using a pretrained language model and having human evaluators rank its outputs. This ranking then informs the model to develop a preference for certain types of responses, leading to more reliable and safer outputs.
RLHF effectively leverages human feedback to enhance the performance of language models. It combines the strengths of reinforcement learning algorithms with the nuanced understanding of human input, facilitating continuous learning and improvement in the model.
Incorporating human feedback, RLHF not only improves the model’s natural language understanding and generation capabilities but also boosts its efficiency in specific tasks like text classification or translation.
Moreover, RLHF plays a crucial role in addressing bias within language models. By allowing human input to guide and correct the model’s language use, it fosters more equitable and inclusive communication. However, it’s important to be mindful of the potential for human-induced bias in this process.

Refresher: Basics of Reinforcement Learning

To understand why reinforcement learning is employed in RLHF, we need to gain a better understanding of what it entails.
Reinforcement learning has its basics in mathematics where an agent is interacting with the environment as shown below (source):

In this interaction, the agent takes an action, and the environment responds with a state and a reward. Here’s a brief on the key terms:
- The reward is the objective that we want to optimize.
- A state is the representation of the environment/world at the current time index.
- A policy is used to map from that state to an action.

Reinforcement Learning from Human Feedback (RLHF)

Let’s start out by talking about what the motivation behind aligning LLMs to human feedback is.
The initial objective of training large language models like GPT was to predict subsequent text tokens accurately. However, this approach did not ensure that the outputs were helpful, harmless, or honest.
Consequently, there was a risk of generating content that might not align with ethical or safe human standards. To address this, a process was required to guide the model towards outputs that reflect human values, and that’s the role RLHF fulfills.
The image below (source), depicts how RLHF was leveraged in InstructGPT and will be used as the foundation of our understanding.
The image outlines a three-step process used to train a language model using RLHF. Here’s an explanation of each step:
1. Collect Demonstration Data, and Train a Supervised Policy.
  - A prompt is taken from a collection of prompts.
  - A human labeler (an annotator) provides the desired output, demonstrating how the model should ideally respond.
  - This labeled data is then used to fine-tune the language model (like GPT-3) using supervised learning techniques. Essentially, the model is taught to imitate the demonstrations.
2. Collect Comparison Data, and Train a Reward Model.
  - A prompt is chosen, and the model generates several potential outputs.
  - A labeler then ranks these outputs from best to worst according to criteria like helpfulness or accuracy.
  - This ranked data is used to train a reward model. The reward model learns to predict the quality of the language model’s outputs based on the rankings provided by human labelers.
3. Optimize a Policy Against the Reward Model Using Reinforcement Learning.
  - A new prompt is selected from the dataset.
  - The current policy (strategy the model uses to generate outputs) creates a response.
  - The reward model evaluates this response and assigns a reward.
  - This reward information is used to update and improve the policy through a reinforcement learning algorithm known as Proximal Policy Optimization (PPO) . The policy is adjusted to increase the likelihood of generating higher-reward outputs in the future.
Chip Huyen provides a zoomed out view of how the overall process works in her flowchart below:

Here’s a breakdown of the flowchart:
1. Language Modeling:
  - This is the first stage where a language model is trained on a large dataset. The dataset is composed of a vast amount of text data, which can be of varying quality. The training at this stage is optimized for text completion tasks. The scale mentioned is over 1 trillion tokens, and examples of such models include GPT-x, Gopher, Falcon, LLama, Pythia, Bloom, and StableLM. This results in a Pretrained Large Language Model (LLM).
  - To expand further: This is phase of pretraining involves developing a large language model (LLM) that functions as a completion machine, using statistical knowledge to predict the likelihood of sequences in language. This is achieved by feeding the model extensive text data, often exceeding trillions of tokens, from varied sources to learn language patterns. The model’s efficacy is contingent on the quality of the training data, with the aim to minimize cross-entropy loss across training samples. As the Internet becomes saturated with data, including that generated by LLMs themselves, there’s a growing need to access proprietary data for further model improvement.
2. Supervised Finetuning:
  - In the second stage, the pretrained LLM is further finetuned using high-quality data, which is often dialogue-focused to better suit conversational AI. This is done using demonstration data, and the process generates a Supervised Finetuning (SFT) model. The amount of data used for finetuning ranges from 10,000 to 100,000 (prompt, response) pairs. Examples of models that go through this process are Dolly-v2 and Falcon-Instruct.
  - To elaborate: This is phase involves Supervised Fine-Tuning (SFT) for dialogue, where a pre-trained model is optimized to generate preferred responses to prompts, such as direct answers to questions. High-quality demonstration data, consisting of prompt-response pairs, guides the model’s behavior. With about 13,000 such pairs, OpenAI’s approach emphasizes quality through expert labelers, while others like DeepMind use heuristics for data selection. The SFT process is critical for tailoring the model’s outputs to practical use cases, leveraging a smaller yet refined dataset to minimize cross-entropy loss for the dialogue-specific responses.
3. Classification and Reward Modeling:
  - The model undergoes a classification process where it is trained to give a scalar score to responses based on human feedback. This is to ensure that the model can evaluate the quality of its own responses. The data used here is called comparison data, and involves 100,000 to 1 million comparisons between a prompt, a winning response, and a losing response. This stage results in the creation of a Reward model.
4. Reinforcement Learning (RLHF):
  - This phase involves using Reinforcement Learning techniques to train the model to generate responses that maximize the scores given by the reward model, effectively teaching the AI to prefer high-quality responses as judged by humans. This stage uses prompts (10,000 to 100,000) to adjust the model’s responses. The end product is the Final model, which should be adept at handling prompts in a way that aligns with human preferences. Examples of such models are InstructGPT, ChatGPT, Claude, and StableVicuna.
  - This phase of RLHF is an advanced training process that refines the behavior of a Supervised Fine-Tuned (SFT) model. It uses human feedback to score AI-generated responses, guiding the model to produce high-quality outputs. RLHF involves training a reward model to evaluate responses and optimizing the language model to prioritize these high scores. This phase addresses the limitations of SFT by providing nuanced feedback on the quality of responses, not just their plausibility, and mitigates issues like hallucination by aligning model outputs more closely with human expectations. Despite its complexity, RLHF has been shown to enhance model performance significantly over SFT alone.
Below, we will expand on the key steps mentioned in this flow.

Reward Model

In the context of RLHF, the key function of a reward model is to evaluate a given input (such as a sequence of text) and produce a scalar reward. This reward is indicative of human preferences or judgments about the quality or desirability of the input.

The image above (source) displays how the reward model works internally.
A reward model is a function or model that takes as input the output or behavior of an AI agent, which can include sequences of text, and produces a scalar reward signal that quantifies how well those outputs align with human preferences or desired behavior.
Architectures for reward models include:
- LM classifiers: An LLM fine-tuned as a binary classifier to score which response better fits the human preference
- Value networks: Regression models that predict a scalar rating representing relative human preference
- Critique generators: LMs trained to generate an evaluative critique explaining which response is better and why. The critique is used with instruction tuning.
The goal is converting noisy human subjective judgments into a consistent reward function that can guide an Reinforcement Learning (RL) agent’s training. Better reward modeling yields superior performance.
To summarize, the reward model is trained using the ranked comparison data (several outputs generated by the model) based on it’s alignment criteria which can be helpful, harmless, and honesty. The reward function combines various models into the RLHF process. It evaluates generated text’s “preferability” by including a penalty term based on the Kullback-Leibler (KL) divergence between probability distributions from the RL policy and the initial model. This penalty prevents the RL policy from deviating significantly from the pretrained model, ensuring coherent text generation.
- The Kullback-Leibler (KL) divergence, which is a measure of the difference between two probability distributions, can be used to overlap the two distributions (initial LM output vs. tuned LM output).
  - KL divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. It quantifies the difference between two probability distributions.
  - Thus, with RLHF, KL divergence can be used to compare the probability distribution of an agent’s current policy with a reference distribution that represents the desired behavior.

Optimizing the Policy

The “policy” refers to a strategy or a set of rules that an agent uses to make decisions in an environment. The policy defines how the agent selects actions based on its current observations or state.
The policy in PPO is iteratively updated to maximize reward while maintaining a certain level of similarity to its previous version (to prevent drastic changes that could lead to instability).
In Direct Preference Optimization (DPO), the policy is optimized directly from human preferences, where it increases the relative log probability of preferred responses to unpreferred ones using a binary cross entropy loss, thus aligning with human feedback while maintaining a balance as specified by the KL divergence constraint.

Putting it all together: Training Llama 2

As a case study of how Llama 2 was trained, let’s go over the multi-stage process that integrates both human and model-generated feedback to refine the performance of language models. Here’s how it functions:
1. Pretraining: Llama 2 undergoes initial pretraining with large amounts of data through self-supervised learning. This stage lays the foundation for the model by enabling it to understand language patterns and context.
2. Supervised Fine-Tuning: The model then undergoes supervised fine-tuning with instruction data, where it is trained to respond to prompts in ways that align with specific instructions.
3. Reward Models Creation (RLHF Step 1): Two separate reward models are created using human preference data –- one for helpfulness and one for safety. These models are trained to predict which of two responses is better based on human judgments.
4. Margin Loss and Ranking: Unlike the previous approach that generates multiple outputs and uses a “k choose 2” comparison method, Llama 2’s dataset is based on binary comparisons, and each labeler is presented with only two responses at a time. A margin label is collected alongside binary ranks to indicate the degree of preference, which can inform the ranking loss calculation.
5. Rejection Sampling and Alignment using PPO (RLHF Step 2): Finally, Llama 2 employs rejection sampling and Proximal Policy Optimization (PPO). Rejection sampling is used to draw multiple outputs and select the one with the highest reward for the gradient update. PPO is then used to align the model further, making the model’s responses more safe and helpful.
The image below (source) showing how Llama 2 leverages RLHF.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm that addresses some key challenges in training agents through policy gradient methods. Here’s a look at how PPO works:

Core Principles of PPO

Policy Gradient Approach: PPO operates on the policy gradient approach, where the agent directly learns a policy, typically parameterized by a neural network. The policy maps states to actions based on the current understanding of the environment.
Iterative Policy Improvement: The agent collects a set of trajectories under its current policy and then updates the policy to maximize a specially designed objective function. This process is repeated iteratively, allowing the policy to gradually improve over time.

Key Components of PPO

Surrogate Objective Function: Central to PPO is its surrogate objective function, which considers the ratio of the probability of an action under the current policy to the probability under the reference policy, multiplied by the advantage function. The advantage function assesses how much better an action is compared to the average action at a given state.
Policy Ratio and Clipping Mechanism: The “policy ratio,” which is the ratio of the probability of an action under the new policy to that under the reference policy, plays a crucial role. PPO employs a clipping mechanism in its objective function, limiting the policy ratio within a defined range (typically $[1-\epsilon, 1+\epsilon]$). This clipping ensures that the updates to the policy are kept within a reasonable range, preventing the new policy from deviating excessively from the reference one. Ultimately, this mechanism helps in maintaining the stability of the learning process.
Multiple Epochs of Stochastic Gradient Ascent: In PPO, each batch of experiences is used for multiple epochs of stochastic gradient ascent. This efficient use of data for policy updates makes PPO more sample-efficient compared to some other methods.
Value Function and Baseline: A value function is often trained alongside the policy in PPO. This value function estimates the expected return (cumulative future rewards) from each state and is used to compute the advantage function, which in turn informs the policy update.

PPO’s Objective Function: Clipped Surrogate Loss

Definition of Clipped Surrogate Loss: The surrogate loss in PPO is defined based on the ratio of the probability of taking an action under the current policy to the probability of taking the same action under the reference policy. This ratio is used to adjust the policy towards actions that have higher rewards while ensuring that updates are not too drastic. The clipping mechanism is employed to limit the magnitude of these updates, maintaining stability during training.
- Formally, let $\pi_\theta$ be the current policy parameterized by $\theta$, and $\pi_{\text{ref}}$ be the reference policy. For a given state $s$ and action $a$, the probability ratio is:
\[r(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\text{ref}}(a|s)}\]
- The PPO surrogate loss is then defined as follows:
  \[L^{\text{PPO-CLIP}}(\theta) = \mathbb{E} \left[ \min \left( r(\theta) \hat{A}, \, \text{clip}(r(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A} \right) \right]\]
  - where:
    - $\hat{A}$ is the advantage function, which measures how much better an action is compared to the average action at a given state. It is often estimated using Generalized Advantage Estimation (GAE).
    - $\epsilon$ is a hyperparameter that defines the clipping range, controlling how much the policy can change at each update. Typical values are in the range of 0.1 to 0.3.
    - $\text{clip}(r(\theta), 1 - \epsilon, 1 + \epsilon)$ clips the ratio $r(\theta)$ to be within the range $[1 - \epsilon, 1 + \epsilon]$.
- Alternatively, the expanded form of the PPO clipped surrogate loss (based on the reference policy as the old policy above) can be written as:
  \[L^{\text{PPO-CLIP}}(\pi) = \mathbb{E}\left[ \min \left( \frac{\pi(a|s)}{\pi_{\text{ref}}(a|s)} \hat{A}, \text{clip}\left( \frac{\pi(a|s)}{\pi_{\text{ref}}(a|s)}, 1 - \epsilon, 1 + \epsilon \right) \hat{A} \right) \right]\]
  - where:
    - $\hat{A}$ is the advantage estimate, which measures how much better an action is compared to the average action at a given state. It is often estimated using Generalized Advantage Estimation (GAE).
    - $s$ is the state.
    - $a$ is the action.
    - $\epsilon$ is a small hyperparameter that limits the extent of the policy update.
Clipping Mechanism: The clipping mechanism is central to the stability and reliability of PPO. It ensures that the policy updates do not result in excessively large changes, which could destabilize the learning process. The clipping mechanism works as follows:
- Clipping Range: The ratio $r(\theta)$ is clipped to the range $[1 - \epsilon, 1 + \epsilon]$. This means if the ratio $r(\theta)$ is outside this range, it is set to the nearest bound.
- Objective Function Impact: By clipping the probability ratio, PPO ensures that the change in policy induced by each update is kept within a reasonable range. This prevents the new policy from deviating too far from the reference policy, which could lead to instability and poor performance.
- Practical Example: If the probability ratio $r(\theta)$ is 1.2 and $\epsilon$ is 0.2, the clipped ratio would remain 1.2. However, if $r(\theta)$ is 1.4, it would be clipped to 1.2 (1 + 0.2), and if $r(\theta)$ is 0.7, it would be clipped to 0.8 (1 - 0.2).
Purpose of Surrogate Loss in PPO: The surrogate loss allows PPO to balance the need for policy improvement with the necessity of maintaining stability. By limiting the extent to which the policy can change at each update, the surrogate loss ensures that the learning process remains stable and avoids the pitfalls of overly aggressive updates. The clipping mechanism is a key innovation that helps PPO maintain this balance effectively.

Summary

The surrogate loss function in PPO is designed to encourage policy improvement while maintaining stability.
The clipping mechanism in the loss function is crucial for preventing large policy updates, ensuring smoother and more stable learning.
This approach helps PPO to achieve a good balance between effective policy learning and the stability required for reliable performance in various environments.

PPO’s Objective Function Components

Policy Ratio: The core of the PPO objective function involves the policy ratio, which is the ratio of the probability of taking a certain action under the current policy to the probability under the reference policy. This ratio is multiplied by the advantage estimate, which reflects how much better a given action is compared to the average action at a given state.
Clipped Surrogate Objective: To prevent excessively large updates, which could destabilize training, PPO introduces a clipping mechanism in its objective function. The policy ratio is clipped within a certain range, typically $[1-\epsilon, 1+\epsilon]$ (where $\epsilon$ is a small value like 0.1 or 0.2). This clipping ensures that the updates to the policy are not too large, which maintains stability in training.
KL Divergence Loss: Besides the clipped objective, another common approach is to add a KL divergence penalty directly to the objective function. This means the algorithm would penalize the objective based on how much the new policy diverges from the reference policy. In other words, the KL divergence component helps keep the new policy close to the reference one by penalizing updates that result in a large divergence from the reference policy. The KL divergence loss is typically added to the objective function as a penalty term:
\[L^{\text{KL}}(\theta) = \mathbb{E} \left[ L^{\text{PPO}}(\theta) - \beta \text{KL}[\pi_{\text{ref}} || \pi_{\theta}] \right]\]
- where:
  - $\beta$ is a hyperparameter that controls the strength of the KL penalty.
Value Function Loss: PPO also typically includes a value function loss in its objective. This part of the objective function ensures that the estimated value of the states (as predicted by the value function) is as accurate as possible, which is important for computing reliable advantage estimates.
Entropy Bonus: Some implementations of PPO include an entropy bonus to encourage exploration. This part of the objective function rewards the policy for taking a variety of actions, which helps prevent premature convergence to suboptimal policies.

Variants of PPO

There are two main variants of PPO:
1. PPO-clip: This variant uses the clipped surrogate objective function, as described above, to limit the policy updates.
2. PPO-penalty: This variant adds a KL divergence penalty directly to the objective function, as described above, to constrain policy updates.

Optimal Policy and Reference Policy

Optimal Policy ($\pi^{*}$ or $\pi_{optimal}$): The optimal policy refers to the strategy or set of rules that the LLM follows to maximizing the objective function $J(\pi)$. This objective function is designed to reflect the goals of alignment, such as generating helpful, truthful, and harmless responses. Formally, the optimal policy $\pi^{*}$ is defined as:
\[\pi^{*} = \arg\max_{\pi} J(\pi)\]
- where $J(\pi)$ is the objective function.
Reference Policy ($\pi_{\text{ref}}$): The reference policy is a baseline or benchmark policy used to compare and guide the learning process of the optimal policy. It represents a known, stable policy that the model starts from or refers back to during training. The reference policy helps in stabilizing the training process by providing a consistent comparison point.

Summary

$\pi_{\text{optimal}}$: Optimal policy, maximizing the objective function $J(\pi)$.
$\pi_{\text{ref}}$: Reference policy, providing a stable baseline for training.

Advantages of PPO

Stability and Reliability: The clipping mechanism in the objective function helps to avoid large, destabilizing updates to the policy, making the learning process more stable and reliable.
Efficiency: By reusing data for multiple gradient updates, PPO can be more sample-efficient compared to some other methods.
General Applicability: PPO has demonstrated good performance across a wide range of environments, from simple control tasks to complex simulations like those in 3D simulations. It offers a simpler and more robust approach compared to previous algorithms like TRPO.

Simplified Example

Imagine an agent learning to play a game. The agent tries different actions (moves in the game) and learns a policy that predicts which action to take in each state (situation in the game). The policy is updated based on the experiences, but instead of drastically changing the policy based on recent success or failure, PPO makes smaller, incremental changes. This way, the agent avoids drastically changing its strategy based on limited new information, leading to a more stable and consistent learning process.

Summary

PPO stands out in the realm of reinforcement learning for its innovative approach to policy updates via gradient ascent. Its key innovation is the introduction of a clipped surrogate objective function that judiciously constrains the policy ratio. This mechanism is fundamental in preventing drastic policy shifts and ensuring a smoother, more stable learning progression.
PPO is particularly favored for its effectiveness and simplicity across diverse environments, striking a fine balance between policy improvement and stability.
The PPO objective function is designed to balance the need for effective policy improvement with the need for training stability. It achieves this through the use of a clipped surrogate objective function, value function loss, and potentially an entropy bonus.
While KL divergence is not a direct part of the basic PPO objective function, it is often used in some implementations of PPO to monitor and maintain policy stability. This is done either by penalizing large changes in the policy (KL penalty) or by enforcing a constraint on the extent of change allowed between policy updates (KL constraint).
By integrating these elements, PPO provides a robust framework for reinforcement learning, ensuring both stability and efficiency in the learning process. This makes it particularly suitable for fine-tuning large language models (LLMs) and other complex systems where stable and reliable updates are crucial.

In PPO and other reinforcement learning (RL) algorithms, the policy is typically represented by a parameterized function, most commonly a neural network. Here’s a detailed breakdown of how the policy is represented and what it entails:

Policy Representation in RL Algorithms

Neural Network (Parameterized Function)
- Neural Networks: In modern RL algorithms like PPO, the policy is most often represented by a neural network. The neural network takes the current state of the environment as input and outputs a probability distribution over possible actions.
- Parameters (Weights): The neural network is defined by its parameters, which are the weights and biases of the network. These parameters are collectively denoted as $\theta$. The process of training the policy involves adjusting these parameters to maximize the expected reward.

Mathematical Representation

The policy $$\pi_\theta(a

s)$represents the probability of taking action$a$given state$s$, parameterized by$\theta$$. This function maps states to a distribution over actions.

Discrete Action Spaces: For discrete action spaces, the output of the neural network can be a softmax function that gives a probability for each possible action.
Continuous Action Spaces: For continuous action spaces, the output might be parameters of a probability distribution (e.g., mean and standard deviation of a Gaussian distribution) from which actions can be sampled.

Policy Gradient Methods
- In policy gradient methods like PPO, the policy is directly updated by computing the gradient of the expected reward with respect to the policy parameters $\theta$. This gradient is used to adjust the parameters in a way that increases the expected reward.
Actor-Critic Methods
- Actor: In actor-critic methods, the “actor” is the policy network, which decides the actions to take.
- Critic: The “critic” is another network that estimates the value function, which provides feedback on how good the current policy is. The critic helps to reduce the variance of the policy gradient estimates.
Optimization Process
- Policy Update: The policy parameters $\theta$ are updated through an optimization process (e.g., gradient ascent in policy gradient methods) to maximize the objective function, such as the expected cumulative reward.
- Surrogate Objective: In PPO, a surrogate objective function is used, which includes mechanisms like clipping to ensure stable updates to the policy.

Summary

Neural Network: The policy in PPO and many other RL algorithms is represented by a neural network.
Parameters (Weights): The neural network is parameterized by a set of weights and biases, collectively denoted as $\theta$.
Probability Distribution: The policy maps states to a probability distribution over actions, allowing for both discrete and continuous action spaces.
Optimization: The policy parameters are updated iteratively to maximize the expected reward, often using gradient-based optimization methods.
By representing the policy as a neural network, RL algorithms can leverage the expressive power of deep learning to handle complex environments and high-dimensional state and action spaces.

Reinforcement Learning with AI Feedback (RLAIF)

RLAIF uses AI-generated preferences instead of human annotated preferences. It leverages a powerful LLM (say, GPT-4) to generate these preferences, offering a cost-effective and efficient alternative to human-generated feedback.
RLAIF operates by using a pre-trained LLMs to generate feedback for training another LLM. Essentially, the feedback-generating LLM serves as a stand-in for human annotators. This model evaluates and provides preferences or feedback on the outputs of the LLM being trained, guiding its learning process.
The feedback is used to optimize the LLM’s performance for specific tasks like summarization or dialogue generation. This method enables efficient scaling of the training process while maintaining or improving the model’s performance compared to methods relying on human feedback.

Direct Preference Optimization (DPO)

LLMs acquire extensive world knowledge and reasoning skills via self-supervised pre-training, but precisely controlling their behavior is challenging due to their unsupervised training nature. Traditionally, methods like RLHF, discussed earlier in this article, are used to steer these models, involving two stages: training a reward model based on human preference labels and then fine-tuning the LM to align with these preferences using reinforcement learning (RL). However, RLHF presents complexities and instability issues, necessitating fitting a reward model and then training a policy to optimize this reward, which is prone to stability concerns.
Proposed in Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al. from Stanford in 2023, Direct Preference Optimization (DPO) is a novel approach that simplifies and enhances the aforementioned process. DPO leverages a mathematical relationship between optimal policies and reward functions, demonstrating that the constrained reward maximization problem in RLHF can be optimized more effectively with a single stage of policy training. DPO redefines the RLHF objective by showing that the reward can be rewritten purely as a function of policy probabilities, allowing the LM to implicitly define both the policy and the reward function. This innovation eliminates the need for a separate reward model and the complexities of RL.
This paper introduces a novel algorithm that gets rid of the two stages of RL, namely - fitting a reward model, and training a policy to optimize the reward via sampling. The second stage is particularly hard to get right due to stability concerns, which DPO obliterates. The way it works is, given a dataset of the form <prompt, worse completion, better completion>, you train your LLM using a new loss function which essentially encourages it to increase the likelihood of the better completion and decrease the likelihood of the worse completion, weighted by how much higher the implicit reward model. This method obviates the need for an explicit reward model, as the LLM itself acts as a reward model. The key advantage is that it’s a straightforward loss function optimized using backpropagation.
The stability, performance, and computational efficiency of DPO are significant improvements over traditional methods. It eliminates the need for sampling from the LM during fine-tuning, fitting a separate reward model, or extensive hyperparameter tuning.
The figure below from the paper illustrates that DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, without an explicit reward function or RL.

Experiments demonstrate that DPO can fine-tune LMs to align with human preferences as effectively, if not more so, than traditional RLHF methods. It notably surpasses RLHF in controlling the sentiment of generations and enhances response quality in tasks like summarization and single-turn dialogue. Its implementation and training processes are substantially simpler.

DPO and it’s use of Binary Cross Entropy

DPO differs from traditional next-token prediction models. While typical language models predict the next token in a sequence, DPO focuses on fine-tuning the model based on human preferences between pairs of responses. It uses binary cross-entropy loss to adjust the model’s internal representation, so it is more likely to generate responses that align with human-preferred outcomes. This approach does not directly predict the next token; instead, it reshapes the probability distribution of the entire model to favor responses that match human preferences. The objective is to align the model’s output with what humans would find more acceptable or desirable in various contexts.
DPO works by utilizing Binary Cross-Entropy (BCE) to compare pairs of model-generated responses (preferred and dispreferred) against human preferences. For each pair, the BCE loss calculates how well the model’s predictions align with these preferences.
Here’s a simplified breakdown:
1. Response Pairs: For each input, the model generates two responses.
2. Human Preferences: Humans indicate which response is preferable.
3. Model Probabilities: The model assigns probabilities to each response.
4. BCE Loss: The loss function computes the difference between the model’s probabilities and the actual human preferences. It penalizes the model more when it assigns a higher probability to the dispreferred response.
By minimizing this loss during training, DPO nudges the model to adjust its internal parameters. This way, it becomes more likely to generate responses that align with human preferences. The BCE loss acts as a guide, informing the model which types of responses are more desirable based on human feedback.
In essence, DPO represents a groundbreaking shift in training language models to align with human preferences. It consolidates the two-stage process of RLHF into a single, efficient end-to-end policy learning approach. By reparameterizing the reward function and unifying policy learning and reward modeling into one streamlined optimization process, DPO offers a more efficient and lightweight method for training language models to match human preferences.
Put simply, the loss function used in DPO is based on binary cross-entropy. This approach is chosen to optimize language models in alignment with human preferences. In DPO, the goal is to increase the relative log probability of preferred responses in a dataset. The binary cross-entropy loss function facilitates this by treating the optimization as a classification problem, where the model learns to classify between preferred and non-preferred responses. This method simplifies the traditional RLHF approach by directly optimizing for an implicit reward function, represented through human preferences, using a straightforward binary classification loss. This approach is both computationally efficient and theoretically grounded, making it effective for training language models to align with human preferences.

How does DPO generate two responses

In DPO, generating two responses and assigning probabilities to each response involves a nuanced process:
1. Generating Two Responses:
  - The responses are typically generated using a supervised fine-tuned language model. This model, when given an input prompt, generates a set of potential responses.
  - These responses are often generated through sampling methods like beam search or random sampling, which can produce diverse outputs.
2. Assigning Probabilities:
  - Language models indeed assign probabilities at the token level, predicting the likelihood of each possible next token given the previous tokens.
  - The probability of an entire response (sequence of tokens) is calculated as the product of the probabilities of individual tokens in that sequence, as per the model’s prediction.
  - For DPO, these probabilities are used to calculate the loss based on human preferences. The model is trained to increase the likelihood of the preferred response and decrease that of the less preferred one.
Through this process, DPO leverages human feedback to fine-tune the model, encouraging it to generate more human-aligned outputs.

DPO and it’s use of the Bradley-Terry model

Overview of the Bradley-Terry Model:
- The Bradley-Terry model is a probability model used for pairwise comparisons. It assigns a score to each item (in this context, model outputs), and the probability that one item is preferred over another is a function of their respective scores. Formally, if item $i$ has a score $s_i$ and item $j$ has a score $s_j$, the probability $P(i \text{ is preferred over } j)$ is given by:
\[P(i \text{ is preferred over } j) = \frac{\exp(s_i)}{\exp(s_i) + \exp(s_j)}\]
Application in DPO for LLM Alignment:
1. Data Collection:
  - Human evaluators provide pairwise comparisons of model outputs. For example, given two responses from the LLM, the evaluator indicates which one is better according to specific criteria (e.g., relevance, coherence, correctness).
2. Modeling Preferences:
  - The outputs of the LLM are treated as items in the Bradley-Terry model. Each output has an associated score reflecting its quality or alignment with human preferences.
3. Score Estimation:
  - The scores $s_i$ for each output are estimated using the observed preferences. If output $i$ is preferred over output $j$ in several comparisons, $s_i$ will be higher than $s_j$. The scores are typically estimated using maximum likelihood estimation (MLE) or other optimization techniques suited for the Bradley-Terry model.
4. Optimization:
  - Once the scores are estimated, the LLM is fine-tuned to maximize the likelihood of generating outputs with higher scores. The objective is to adjust the model parameters so that the outputs align better with human preferences as captured by the Bradley-Terry model scores.
Detailed Steps in DPO:
1. Generate Outputs:
  - Generate multiple outputs for a given prompt using the LLM.
2. Pairwise Comparisons:
  - Collect human feedback by asking evaluators to compare pairs of outputs and indicate which one is better.
3. Fit Bradley-Terry Model:
  - Use the collected pairwise comparisons to fit the Bradley-Terry model and estimate the scores for each output.
4. Update LLM:
  - Fine-tune the LLM using the estimated scores. The objective is to adjust the model parameters such that the likelihood of producing higher-scored (preferred) outputs is maximized. This step often involves gradient-based optimization techniques where the loss function incorporates the Bradley-Terry model probabilities. - By iteratively performing these steps, the LLM can be aligned more closely with human preferences, producing outputs that are more likely to be preferred by human evaluators.
Summary:
- The Bradley-Terry model plays a crucial role in the Direct Preference Optimization method by providing a statistical framework for modeling and estimating the preferences of different model outputs. This, in turn, guides the fine-tuning of the LLM to align its outputs with human preferences effectively.

How does DPO implicitly use a Bradley-Terry Model (if it does not explicitly use a reward model)?

DPO uses the Bradley-Terry model implicitly, even if it does not explicitly employ a traditional reward model. Here’s how this works:

Key Concepts in DPO Without an Explicit Reward Model

Pairwise Comparisons:
- Human evaluators provide pairwise comparisons between outputs generated by the LLM. For example, given two outputs, the evaluator indicates which one is preferred.
Logistic Likelihood:
- The Bradley-Terry model is essentially a logistic model used for pairwise comparisons. The core idea is to model the probability of one output being preferred over another based on their latent scores.

Implicit Use of Bradley-Terry Model

Even without an explicit reward model, DPO leverages the principles behind the Bradley-Terry model in the following manner:

Score Assignment through Logit Transformation:
- For each output generated by the LLM, assign a latent score. This score can be considered as the logit (log-odds) of the output being preferred.
- Given two outputs, $o_i$ and $o_j$, with logits (latent scores) $s_i$ and $s_j$, the probability that $o_i$ is preferred over $o_j$ follows the logistic function: $P(o_i \text{ is preferred over } o_j) = \frac{\exp(s_i)}{\exp(s_i) + \exp(s_j)}$
Optimization Objective:
- Construct a loss function based on the likelihood of observed preferences. If $o_i$ is preferred over $o_j$ in the dataset, the corresponding likelihood component is: $L = \log P(o_i \text{ is preferred over } o_j) = \log \left( \frac{\exp(s_i)}{\exp(s_i) + \exp(s_j)} \right)$
- The overall objective is to maximize this likelihood across all pairwise comparisons provided by human evaluators.
Gradient Descent for Fine-Tuning:
- Instead of explicitly training a separate reward model, the LLM is fine-tuned using gradients derived from the likelihood function directly.
- During backpropagation, the gradients with respect to the LLM’s parameters are computed from the likelihood of the preferences, effectively pushing the model to produce outputs that align with higher preference scores.

Steps in DPO Without Explicit Reward Model

Generate Outputs:
- Generate multiple outputs for a set of prompts using the LLM.
Collect Pairwise Comparisons:
- Human evaluators compare pairs of outputs and indicate which one is preferred.
Compute Preference Probabilities:
- Use the logistic model (akin to Bradley-Terry) to compute the probability of each output being preferred over another.
Construct Likelihood and Optimize:
- Formulate the likelihood based on the observed preferences and optimize the LLM’s parameters to maximize this likelihood.

Practical Implementation

Training Loop:
- In each iteration, generate outputs, collect preferences, compute the logistic likelihood, and perform gradient descent to adjust the LLM parameters.
Loss Function:
- The loss function directly incorporates the Bradley-Terry model’s probabilities without needing an intermediate reward model: $\text{Loss} = -\sum_{(i,j) \in \text{comparisons}} \log \left( \frac{\exp(s_i)}{\exp(s_i) + \exp(s_j)} \right)$
By optimizing this loss function, DPO ensures that the LLM’s outputs increasingly align with human preferences, implicitly using the Bradley-Terry model’s probabilistic framework without explicitly training a separate reward model. This direct approach simplifies the alignment process while leveraging the robust statistical foundation of the Bradley-Terry model.

Video Tutorial

This video by Umar Jamil explains the DPO pipeline, by deriving it step by step while explaining all the inner workings.
After briefly introducing the topic of AI alignment, the video reviews RL, a topic that is necessary to understand the reward model and its loss function. Next, it derives the loss function step-by-step of the reward model under the Bradley-Terry model of preferences, a derivation that is missing in the DPO paper.
Using the Bradley-Terry model, it builds the loss of the DPO algorithm, not only explaining its math derivation, but also giving intuition on how it works.
In the last part, it describes how to use the loss practically, that is, how to calculate the log probabilities using a Transformer model, by showing how it is implemented in the Hugging Face library.

Summary

RLHF is the most “dicey” part of LLM training and the one that needed the most art vs. science. DPO seeks to simplify that by removing RL out of the equation and not requiring a dedicated reward model (with the LLM serving as the reward model). The process it follows is as follows:
1. Treat a foundational instruction tuned LLM as the reference LLM.
2. Generate pairs of outputs (using say, different token sampling/decoding methods or temperature scaling) to the same prompt and have humans choose which one they like, leading to a dataset of human preferences/feedback.
3. Add a linear layer to the LLM so that it outputs a scalar value, and tune this new model with a new loss function called DPO loss which is based on binary cross entropy loss (compute log-ratio of scalar outputs of the reference LLM and the one being tuned, multiply by a divergence parameter).
4. Drop the last linear layer, and you have a fine tuned LLM on human feedback.

Kahneman-Tversky Optimization (KTO)

Proposed in Human-Centered Loss Functions (HALOs) by Ethayarajh et al. from Stanford and Contextual AI, Kahneman-Tversky Optimization (KTO) is a novel approach to aligning large language models (LLMs) with human feedback.
KTO is a human-centered loss function that directly maximizes the utility of language model generations instead of maximizing the log-likelihood of preferences as current methods do. This approach is named after Daniel Kahneman and Amos Tversky, who are known for their work in prospect theory, a theory of how humans make decisions about uncertain outcomes. KTO is based on the principles of prospect theory, a theory in behavioral economics. Unlike traditional methods, KTO focuses on maximizing the utility of LLM generations by aligning them with human feedback.
KTO achieves the goal of generating desirable outputs by using a utility function to guide the training of a language model. This process involves several key steps:
1. Utility Function Definition: A utility function is defined based on the principles of Kahneman-Tversky’s prospect theory. This function assigns a value to each possible output of the language model, indicating its desirability or utility from a human perspective. The utility values can be determined based on factors like relevance, coherence, or adherence to specific criteria.
2. Generating Outputs: During training, the language model generates outputs based on given inputs. These outputs are complete sequences, such as sentences or paragraphs, rather than individual tokens.
3. Evaluating Outputs: Each generated output is evaluated using the utility function. The utility score reflects how desirable or aligned the output is with human preferences or objectives.
4. Optimizing the Model: The model’s parameters are updated to increase the likelihood of generating outputs with higher utility scores. The optimization process aims to maximize the expected utility of the outputs, essentially encouraging the model to produce more desirable results.
5. Iterative Training: This process is iterative, with the model continually generating outputs, receiving utility evaluations, and updating its parameters. Over time, the model learns to produce outputs that are increasingly aligned with the utility function’s assessment of desirability.
In essence, KTO shifts the focus from traditional training objectives, like next-token prediction or fitting to paired preference data, to directly optimizing for outputs that are considered valuable or desirable according to a utility-based framework. This approach can be particularly effective in applications where the quality of the output is subjective or where specific characteristics of the output are valued.
1. What is KTO?
  - KTO is an alignment methodology that leverages the concept of human utility functions as described in prospect theory. It aligns LLMs by directly maximizing the utility of their outputs, focusing on whether an output is considered desirable or not by humans.
  - This method does not require detailed preference pairs for training, which is a departure from many existing alignment methodologies.
2. What Kind of Data Does KTO Require?
  - KTO obliterates the need for paired-preference ranking/comparison data and simplifies data requirements significantly. It only needs binary labels indicating whether an LLM output is desirable or undesirable. Put simply, with it’s binary preference data requirement, KTO contrasts with methods such as PPO and DPO that require detailed preference pairs.
  - The simplicity in data requirements makes KTO more practical and applicable in real-world scenarios where collecting detailed preference data is challenging.
3. Advantages Over DPO and PPO:
  - Compared to DPO and Proximal Policy Optimization (PPO), KTO offers several advantages:
    - Simplicity in Data Collection: Unlike DPO and PPO, which require paired-preference data (i.e., ranking/comparison data) which is difficult to obtain, KTO operates efficiently with simpler binary feedback on outputs.
    - Practicality in Real-World Application: KTO’s less stringent data requirements make it more suitable for scenarios where collecting detailed preferences is infeasible.
    - Focus on Utility Maximization: KTO aligns with the practical aspects of human utility maximization, potentially leading to more user-friendly and ethically aligned outputs.
4. Results with KTO Compared to DPO and PPO:
  - When applied to models of different scales (from 1B to 30B parameters), KTO has shown to match or exceed the performance of methods like DPO in terms of alignment quality.
  - KTO, even without supervised finetuning, significantly outperforms other methods at larger scales, suggesting its effectiveness in aligning models in a more scalable and data-efficient manner.
  - In terms of practical utility, the results indicate that KTO can lead to LLM outputs that are better aligned with human preferences and utility considerations, particularly in scenarios where detailed preference data is not available.
KTO operates without paired preference data, focusing instead on maximizing the utility of language model generations based on whether an output is desirable or undesirable. This is different from the traditional approach of next-token prediction and paired preference data used in methods like DPO.
Here’s how KTO functions:
1. Utility-Based Approach: KTO uses a utility function, inspired by Kahneman-Tversky’s prospect theory, to evaluate the desirability of outputs. The utility function assigns a value to each possible output of the language model, reflecting how desirable (or undesirable) that output is from a human perspective.
2. Data Requirement: Unlike DPO, KTO does not need paired comparisons between two outputs. Instead, it requires data that indicates whether a specific output for a given input is considered desirable or not. This data can come from human judgments or predefined criteria.
3. Loss Function: The loss function in KTO is designed to maximize the expected utility of the language model’s outputs. It does this by adjusting the model’s parameters to increase the likelihood of generating outputs that have higher utility values. Note that the KTO loss function is not a binary cross-entropy loss. Instead, it is inspired by prospect theory and is designed to align large language models with human feedback. KTO focuses on human perception of losses and gains, diverging from traditional loss functions like binary cross-entropy that are commonly used in machine learning. This novel approach allows for a more nuanced understanding and incorporation of human preferences and perceptions in the training of language models.
4. Training Process: During training, the language model generates outputs, and the utility function evaluates these outputs. The model’s parameters are then updated to favor more desirable outputs according to the utility function. This process differs from next-token prediction, as it is not just about predicting the most likely next word, but about generating entire outputs that maximize a utility score.
5. Implementation: In practical terms, KTO could be implemented as a fine-tuning process on a pre-trained language model. The model generates outputs, the utility function assesses these, and the model is updated to produce better-scoring outputs over iterations.
KTO is focused more on the overall utility or value of the outputs rather than just predicting the next token. It’s a more holistic approach to aligning a language model with human preferences or desirable outcomes.
In summary, KTO represents a shift towards a more practical and scalable approach to aligning LLMs with human feedback, emphasizing utility maximization and simplicity in data requirements.

PPO vs. DPO vs. KTO

Kahneman-Tversky Optimization (KTO):
- Function: Adapts the Kahneman-Tversky human value function to the language model setting. It uses this adapted function to directly maximize the utility of model outputs.
- Data Requirement: Does not need paired preference data, only knowledge of whether an output is desirable or undesirable for a given input.
- Practicality: Easier to deploy in real-world scenarios where desirable/undesirable outcome data is more abundant.
- Model Comparison: Matches or exceeds the performance of direct preference optimization methods across various model sizes (from 1B to 30B).
Proximal Policy Optimization (PPO):
- Function: An RL algorithm that optimizes the language model by limiting how far it can drift from a previous version of the model.
- Implementation: Involves sampling generations from the current model, judging them with a reward model, and using this feedback for updates.
- Practical Challenges: Can be slow and unstable, especially in distributed settings.
DPO:
- Function: Minimizes the negative log-likelihood of observed human preferences to align the language model with human feedback.
- Data Requirement: Requires paired preference data.
- Comparison with KTO: While DPO has been effective, KTO offers competitive or superior performance without the need for paired preferences.

Aspect	Proximal Policy Optimization (PPO)	DPO	Kahneman-Tversky Optimization (KTO)
Objective	Maximizes expected reward while preventing large policy updates (clipped objective function).	Directly optimizes policy based on human preferences, using a binary classification objective (using a KL-divergence constraint).	Aligns models by maximizing the utility of LLM generations based on prospect theory, without requiring detailed preference pairs.
Input	States and rewards from the environment.	States from the environment and human preference feedback.	LLM outputs with binary labels indicating desirable or undesirable outcomes.
Output	Actions to be taken in the environment.	Actions to be taken in the environment, aligned with human preferences.	LLM generations aligned with simplified human utility functions.
Learning Mechanism	Policy gradients with a clipped surrogate objective to update policy and value networks.	Binary cross-entropy optimization on human preference data, updating a single policy network.	Optimization based on the alignment of LLM outputs with binary feedback, not requiring complex preference models.
Network Components	Separate policy and value networks.	A single policy network.	LLM framework, adapted for KTO methodology.
Feedback Mechanism	Uses rewards from the environment as feedback for learning.	Uses human preference data as direct feedback for learning.	Utilizes binary feedback on LLM outputs to guide alignment without complex preference data.
Stability	Clipping mechanism in objective function to maintain stability in policy updates.	Inherent stability by directly optimizing preferences with dynamic per-example importance weighting.	Achieves stable alignment by simplifying the feedback mechanism and focusing on utility maximization.
Complexity	More complex due to dual network structure and balancing reward maximization with policy update stability.	Simpler, as it bypasses explicit reward modeling and directly optimizes policy from human preferences.	Reduces complexity by eliminating the need for detailed preference modeling, focusing instead on binary utility optimization.
Applicability	Suitable for a wide range of RL environments where reward signals are available.	Particularly effective in scenarios where aligning with human preferences is crucial.	Especially useful in scenarios where rapid and simplified alignment with human feedback is desired.

Bias Concerns and Mitigation Strategies

A fair question to ask now is if RLHF/RLAIF/ can add bias to the model. This is an important topic as large conversational language models are being deployed in various applications from search engines (Bing Chat, Google’s Bard) to word documents (Microsoft office co-pilot, Google docs, Notion, etc.).
The answer is, yes, just as with any machine learning approach with human input, RLHF has the potential to introduce bias.
Let’s look at the different forms of bias it can introduce:
- Selection bias:
  - RLHF relies on feedback from human evaluators, who may have their own biases and preferences (and can thus limit their feedback to topics or situations they can relate to). As such, the agent may not be exposed to the true range of behaviors and outcomes that it will encounter in the real world.
- Confirmation bias:
  - Human evaluators may be more likely to provide feedback that confirms their existing beliefs or expectations, rather than providing objective feedback based on the agent’s performance.
  - This can lead to the agent being reinforced for certain behaviors or outcomes that may not be optimal or desirable in the long run.
- Inter-rater variability:
  - Different human evaluators may have different opinions or judgments about the quality of the agent’s performance, leading to inconsistency in the feedback that the agent receives.
  - This can make it difficult to train the agent effectively and can lead to suboptimal performance.
- Limited feedback:
  - Human evaluators may not be able to provide feedback on all aspects of the agent’s performance, leading to gaps in the agent’s learning and potentially suboptimal performance in certain situations.
Now that we’ve seen the different types of bias possible with RLHF, lets look at ways to mitigate them:
- Diverse evaluator selection:
  - Selecting evaluators with diverse backgrounds and perspectives can help to reduce bias in the feedback, just as it does in the workplace.
  - This can be achieved by recruiting evaluators from different demographic groups, regions, or industries.
- Consensus evaluation:
  - Using consensus evaluation, where multiple evaluators provide feedback on the same task, can help to reduce the impact of individual biases and increase the reliability of the feedback.
  - This is almost like ‘normalizing’ the evaluation.
- Calibration of evaluators:
  - Calibrating evaluators by providing them with training and guidance on how to provide feedback can help to improve the quality and consistency of the feedback.
- Evaluation of the feedback process:
  - Regularly evaluating the feedback process, including the quality of the feedback and the effectiveness of the training process, can help to identify and address any biases that may be present.
- Evaluation of the agent’s performance:
  - Regularly evaluating the agent’s performance on a variety of tasks and in different environments can help to ensure that it is not overfitting to specific examples and is capable of generalizing to new situations.
- **Balancing the feedback: **
  - Balancing the feedback from human evaluators with other sources of feedback, such as self-play or expert demonstrations, can help to reduce the impact of bias in the feedback and improve the overall quality of the training data.

TRL - Transformer Reinforcement Learning

The trl library is a full stack library to fine-tune and align transformer language and diffusion models using methods such as Supervised Fine-tuning step (SFT), Reward Modeling (RM) and the Proximal Policy Optimization (PPO) as well as Direct Preference Optimization (DPO).
The library is built on top of the transformers library and thus allows to use any model architecture available there.

Selected Papers

OpenAI’s Paper on InstructGPT: Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.
Ouyang et al. (2022) from OpenAI introduces InstructGPT, a model that aligns language models with user intent on a wide range of tasks by fine-tuning with human feedback.
Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, they collect a dataset of labeler demonstrations of the desired model behavior, which they use to fine-tune GPT-3 using supervised fine-tuning (SFT). This process is referred to as “instruction tuning” by other papers such as Wei et al. (2022).
They then collect a dataset of rankings of model outputs, which they use to further fine-tune this supervised model using RLHF.
In human evaluations on their prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, their results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
It is important to note that ChatGPT is trained using the same methods as InstructGPT (using SFT followed by RLHF), but is fine-tuned from a model in the GPT-3.5 series.
Furthermore, the fine-tuning process proposed in the paper isn’t without its challenges. First, we need a significant volume of demonstration data. For instance, in the InstructGPT paper, they used 13k instruction-output samples for supervised fine-tuning, 33k output comparisons for reward modeling, and 31k prompts without human labels as input for RLHF. Second, fine-tuning comes with an alignment tax “negative transfer” – the process can lead to lower performance on certain critical tasks. (There’s no free lunch after all.) The same InstructGPT paper found that RLHF led to performance regressions (relative to the GPT-3 base model) on public NLP tasks like SQuAD, HellaSwag, and WMT 2015 French to English. A potential workaround is to have several smaller, specialized models that excel at narrow tasks.
The figure below from the paper illustrates the three steps of training InstructGPT: (1) SFT, (2) reward model training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model. Blue arrows indicate that this data is used to train the respective model in the diagram. In Step 2, boxes A-D are samples from the SFT model that get ranked by labelers.

Constitutional AI: Harmlessness from AI Feedback

The paper extends RLHF by training language models on datasets labeled for helpfulness and harmlessness. It introduces ‘HH’ models, which are trained on both criteria and have shown to be more harmless and better at following instructions than models trained on helpfulness alone.
An evaluation of these models’ ability to identify harmful behavior in language model interactions was conducted using a set of conversations rated for harmfulness. The study leveraged ‘red teaming’ where humans attempted to provoke the AI into harmful responses, thereby improving the training process.
The effectiveness of the training method was demonstrated through models’ performance on questions assessing helpfulness, honesty, and harmlessness, without relying on human labels for harmlessness.
This research aligns with other efforts like LaMDA and InstructGPT, which also utilize human data to train language models. The concept of ‘constitutional AI’ was introduced, focusing on self-critique and revision by the AI to foster both harmless and helpful interactions. The ultimate goal is to create AI that can self-regulate harmfulness while remaining helpful and responsive.

OpenAI’s Paper on PPO: Proximal Policy Optimization Algorithms

Schulman et al. (2017) proposes a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent.
Whereas standard policy gradient methods perform one gradient update per data sample, they propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which they call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).
Their experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, showing that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall clock time.

A General Language Assistant as a Laboratory for Alignment

This paper by Askell et al. from Anthropic introduces a comprehensive study towards aligning general-purpose, text-based AI systems with human values, focusing on making AI helpful, honest, and harmless (HHH). Given the capabilities of large language models, the authors investigate various alignment techniques and their evaluations to ensure these models adhere to human preferences without compromising performance.
The authors begin by examining naive prompting as a baseline for alignment, finding that the benefits from such interventions increase with model size and generalize across multiple alignment evaluations. Prompting was shown to impose negligible performance costs (‘alignment taxes’) on large models. The paper also explores the scaling trends of several training objectives relevant to alignment, including imitation learning, binary discrimination, and ranked preference modeling. The results indicate that ranked preference modeling significantly outperforms imitation learning and scales more favorably with model size, while binary discrimination performs similarly to imitation learning.
A key innovation discussed is ‘preference model pre-training’ (PMP), which aims to improve the sample efficiency of fine-tuning models on human preferences. This involves pre-training on large public datasets that encode human preferences, such as Stack Exchange, Reddit, and Wikipedia edits, before fine-tuning on smaller, more specific datasets. The findings suggest that PMP substantially enhances sample efficiency and often improves asymptotic performance when fine-tuning on human feedback datasets.
Implementation Details:
- Prompts and Context Distillation: The authors utilize a prompt composed of 14 fictional conversations to induce the HHH criteria in models. They introduce ‘context distillation,’ a method where the model is fine-tuned using the KL divergence between the model’s predictions and the distribution conditioned on the prompt context. This technique effectively transfers the prompt’s conditioning into the model.
- Training Objectives:
  - Imitation Learning: Models are trained to imitate ‘good’ behavior using supervised learning on sequences labeled as correct or desirable.
  - Binary Discrimination: Models distinguish between ‘correct’ and ‘incorrect’ behavior by training on pairs of correct and incorrect samples.
  - Ranked Preference Modeling: Models are trained to assign higher scores to better samples from ranked datasets using pairwise comparisons, a more complex but effective approach for capturing preferences.
- Preference Model Pre-Training (PMP): The training pipeline includes a PMP stage where models are pre-trained on binary discriminations sourced from Stack Exchange, Reddit, and Wikipedia edits. This stage significantly enhances sample efficiency during subsequent fine-tuning on smaller datasets.
Results:
- Prompting: Simple prompting significantly improves model performance on alignment evaluations, including HHH criteria and toxicity reduction. Prompting and context distillation both decrease toxicity in generated text as model size increases.
- Scaling Trends: Ranked preference modeling outperforms imitation learning, especially on tasks with ranked data like summarization and HellaSwag. Binary discrimination shows little improvement over imitation learning.
- Sample Efficiency: PMP dramatically increases the sample efficiency of fine-tuning, with larger models benefiting more from PMP than smaller ones. Binary discrimination during PMP is found to transfer better than ranked preference modeling.
The figure below from the paper shows: (Left) Simple prompting significantly improves performance and scaling on our HHH alignment evaluations (y-axis measures accuracy at choosing better responses on our HHH evaluations). (Right) Prompts impose little or no ‘alignment tax’ on large models, even on complex evaluations like function synthesis. Here we have evaluated our python code models on the HumanEval codex dataset at temperature T = 0.6 and top P = 0.95.

The study demonstrates that simple alignment techniques like prompting can lead to meaningful improvements in AI behavior, while more sophisticated methods like preference modeling and PMP offer scalable and efficient solutions for aligning large language models with human values.

Anthropic’s Paper on Constitutional AI: Constitutional AI: Harmlessness from AI Feedback

As AI systems become more capable, we would like to enlist their help to supervise other AIs.
Bai et al. (2022) experiments with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so they refer to the method as ‘Constitutional AI’.
The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase they sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, they sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences.
They then train with RL using the preference model as the reward signal, i.e. they use ‘RL from AI Feedback’ (RLAIF). As a result they are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
The figure below from the paper shows the basic steps of their Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RL) stage, shown as the sequence of steps at the bottom of the figure. Both the critiques and the AI feedback are steered by a small set of principles drawn from a ‘constitution’. The supervised stage significantly improves the initial model, and gives some control over the initial behavior at the start of the RL phase, addressing potential exploration problems. The RL stage significantly improves performance and reliability.

The graph below shows harmlessness versus helpfulness Elo scores (higher is better, only differences are meaningful) computed from crowdworkers’ model comparisons for all 52B RL runs. Points further to the right are later steps in RL training. The Helpful and HH models were trained with human feedback as in [Bai et al., 2022], and exhibit a tradeoff between helpfulness and harmlessness. The RL-CAI models trained with AI feedback learn to be less harmful at a given level of helpfulness. The crowdworkers evaluating these models were instructed to prefer less evasive responses when both responses were equally harmless; this is why the human feedback-trained Helpful and HH models do not differ more in their harmlessness scores.

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

This paper by Lee et al. from Google Research, introduces a novel method for training large language models (LLMs) with AI-generated feedback, addressing the challenges and costs associated with traditional human feedback methods.
The paper presents Reinforcement Learning from AI Feedback (RLAIF) as a promising alternative to the conventional RLHF. RLAIF utilizes an off-the-shelf LLM as a preference labeler, streamlining the training process and, in some cases, surpassing the performance of models trained with human feedback.
This approach is applied to text generation tasks such as summarization, helpful dialogue generation, and harmless dialogue generation. The performance of RLAIF, as assessed by human raters, is comparable or superior to RLHF, challenging the assumption that larger policy models are always more effective.
A key advantage of RLAIF is its potential to significantly reduce reliance on expensive human annotations. The study shows the efficacy of using the same model size for both the LLM labeler and the policy model, and highlights that directly prompting the LLM for reward scores can be more effective than using a distilled reward model.
The authors explore methodologies for generating AI preferences aligned with human values, emphasizing the effectiveness of chain-of-thought reasoning and detailed preamble in improving AI labeler alignment.
The following figure from the paper shows a diagram depicting RLAIF (top) vs. RLHF (bottom).

RLAIF’s scalability and cost-effectiveness are notable, with the approach being over ten times cheaper than human annotation. This aligns with the growing trend in LLM research focusing on quality over quantity in datasets.
The paper suggests that combining RLHF and RLAIF could be a strategic approach, especially considering that LLMs like GPT-4 have been trained with human feedback. This hybrid model could represent a balanced integration of high-quality human data, amplified significantly by AI, potentially shaping the future of LLM training and influencing approaches like the development of GPT-5.

A General Theoretical Paradigm to Understand Learning from Human Preferences

This paper by Azar et al. from Google DeepMind delves into the theoretical underpinnings of learning from human preferences, particularly focusing on reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). The authors propose a novel objective, $\Psi$-preference optimization ($\Psi$PO), which encompasses RLHF and DPO as specific instances, aiming to optimize policies directly from human preferences without relying on the approximations common in existing methods.
RLHF typically involves a two-step process where a reward model is first trained using a binary classifier to distinguish preferred actions, often employing a Bradley-Terry model for this purpose. This is followed by policy optimization to maximize the learned reward while ensuring the policy remains close to a reference policy through KL regularization. DPO, in contrast, seeks to optimize the policy directly from human preferences, eliminating the need for explicit reward model training.
The $\Psi$PO framework is a more general approach that seeks to address the potential overfitting issues inherent in RLHF and DPO by considering pairwise preferences and employing a possibly non-linear function of preference probabilities alongside KL regularization. Specifically, the Identity-PO (IPO) variant of $\Psi$PO is highlighted for its practicality and theoretical appeal, as it allows for direct optimization from preferences without the approximations used in other methods.
Empirical demonstrations show that IPO can effectively learn from preferences without succumbing to the overfitting problems identified in DPO, providing a robust method for preference-based policy optimization. The paper suggests that future work could explore scaling these theoretical insights to more complex settings, such as training language models on human preference data.

SLiC-HF: Sequence Likelihood Calibration with Human Feedback

This paper by Zhao et al. from Google Deepmind and Google Research introduces Sequence Likelihood Calibration with Human Feedback (SLiC-HF) as a method for aligning language models with human preferences using human feedback data. SLiC-HF is showcased as an effective, simpler, and more computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF), particularly for the task of TL;DR summarization.
SLiC-HF operates by calibrating the sequence likelihood of a Supervised Fine-Tuning (SFT) model against human feedback data, either directly or through a ranking model derived from human judgments. This is in contrast to traditional RLHF approaches that rely on optimizing a language model using a reward model trained on human preferences.
The paper details several implementations of SLiC-HF: direct application of human feedback (SLiC-HF-direct), sample-and-rank approach using either a reward model or a ranking model (SLiC-HF-sample-rank), and a variant applying SLiC-HF directly on human feedback data without the need for a separate ranking/reward model. Specifically, yo determine the rank, they consider two text-to-text models trained from the human preference data:
- Trained Pointwise Reward model: They binarize each ranked pair into a positive and a negative sequence, as shown in the figure below. When training the reward model, input sequences are formatted as ‘[Context] … [Summary] …’ and target sequences are either ‘Good’ or ‘Bad’. At inference time, we compute the probability of token ‘Good’ on the decoder side to score each of the $m$ candidates in a list, and sample $m$ positive/negative pairs from them.
- Trained Pairwise Ranking model: As shown in the figure below, we formulate the human feedback into a pairwise ranking problem with text-to-text format. When training the ranking model, input sequences are formatted as ‘[Context] … [Summary A] … [Summary B]’ and target sequences are among ‘A’ or ‘B’. At inference time, we use a tournament-style procedure to rank candidates in a list. For example, given a list of 4 candidates $c1$, $c2$, $c3$, $c4$, we first rank $c1$, $c2$ and $c3$, $c4$ and then rank winner $(c1, c2)$, winner $(c3, c4)$. Given $m$ candidates, the ranking model is called $m − 1$ times and $m − 1$ positive/negative pairs are yielded.
The following figure from the paper shows the data format for training the text-to-text reward model and ranking model.

Extensive experiments demonstrate that SLiC-HF significantly improves upon SFT baselines and offers competitive performance to RLHF-PPO implementations. The experiments involved automatic and human evaluations, focusing on the Reddit TL;DR summarization task. Results showed SLiC-HF’s capability to produce high-quality summaries, with improvements observed across different configurations and parameter scales.
The paper contributes to the field by providing a detailed methodology for implementing SLiC-HF, showcasing its efficiency and effectiveness compared to traditional RLHF methods. It also demonstrates the viability of leveraging off-policy human feedback data, thus potentially reducing the need for costly new data collection efforts.
Further discussions in the paper explore the computational and memory efficiency advantages of SLiC-HF over RLHF-PPO, highlighting the former’s scalability and potential for broader application in language generation tasks. The paper concludes with suggestions for future research directions, including exploring other reward functions and non-human feedback mechanisms for language model calibration.

Reinforced Self-Training (ReST) for Language Modeling

RLHF can improve the quality of large language model’s (LLM) outputs by aligning them with human preferences.
This paper by Gulcehre et al. from Google DeepMind and Google Research proposes Reinforced Self-Training (ReST), a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL).
ReST generates samples from an initial LLM policy to create a dataset, which is then used to improve the LLM policy using offline RL algorithms. This method is more efficient than traditional online RLHF methods due to offline production of the training dataset, facilitating data reuse.
ReST operates in two loops: the inner loop (Improve) and the outer loop (Grow).
- Grow: The LLM policy generates multiple output predictions per context, augmenting the training dataset.
- Improve: The augmented dataset is ranked and filtered using a scoring function based on a learned reward model trained on human preferences. The model is then fine-tuned on this filtered dataset with an offline RL objective, with the possibility of repeating this step with increasing filtering thresholds.
The following image from the paper illustrates the ReST method. During the Grow step, a policy generates a dataset. At Improve step, the filtered dataset is used to fine-tune the policy. Both steps are repeated, the Improve step is repeated more frequently to amortise the dataset creation cost.

ReST’s advantages include reduced computational burden, independence from the original dataset’s quality, and simplicity in implementation.
Machine translation was chosen as the application for testing ReST, due to strong baselines and well-defined evaluation procedures. Experiments were conducted on IWSLT 2014, WMT 2020 benchmarks, and an internal high-fidelity benchmark called Web Domain. The evaluation used state-of-art reference-free reward models like Metric X, BLEURT, and COMET. ReST significantly improved reward model scores and translation quality on test and validation sets, as per both automated metrics and human evaluation.
ReST outperformed standard supervised learning (BC G=0 I=0) in reward model scores and human evaluations. The BC loss (Behavioral Cloning) was found to be the most effective for ReST, leading to continuous improvements in the model’s reward on holdout sets. However, improvements in reward model scores did not always align with human preferences.
ReST showed better performance over supervised training across different datasets and language pairs. The inclusion of multiple Improve steps and Grow steps resulted in significant improvements in performance. Human evaluations showed that all ReST variants significantly outperformed the BC baseline.
ReST is distinct from other self-improvement algorithms in language modeling due to its computational efficiency and ability to leverage exploration data and rewards. The approach is applicable to various language tasks, including summarization, dialogue, and other generative models.
Future work includes fine-tuning reward models on subsets annotated with human preferences and exploring better RL exploration strategies.

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Training language models typically requires vast quantities of human-generated text, which can be scarce or of variable quality, especially for specialized domains like mathematics or programming. This scarcity limits the model’s ability to learn diverse patterns and hinders its performance. $ReST_{EM}$ addresses this problem by reducing the reliance on human-curated datasets and instead exploring the potential of fine-tuning models using self-generated data validated through scalar feedback mechanisms.
This paper by Singh et al. from Google DeepMind, presented at NeurIPS 2023, explores a new frontier in Large Language Model (LLM) training: Reinforced Self-Training based on expectation-maximization ($ReST_{EM}$). This innovative approach aims to reduce reliance on human data while avoiding the pitfalls of a synthetic data death spiral, a trend becoming increasingly evident in LLM training.
$ReST_{EM}$ is a potent alternative to traditional dataset curation, comprising two primary stages: generating multiple output samples (E-step) and fine-tuning the language model on these samples (M-step). This process is cyclically iterated, combining the generation of model-derived answers and their subsequent refinement. The feedback for filtering these outputs is sourced from tasks with binary feedback, such as math problems with clear right or wrong answers.
The paper’s focus is on two challenging domains: advanced mathematical problem-solving (MATH) and code generation (APPS). Utilizing PaLM 2 models of various scales, the study demonstrates that $ReST_{EM}$ significantly outperforms models fine-tuned solely on human-generated data, offering up to 2x performance boosts. This indicates a major step toward more independent AI systems, seeking less human input for skill refinement.
$ReST_{EM}$ employs an iterative self-training process leveraging expectation-maximization. It first generates outputs from the language model, then applies a filtering mechanism based on binary correctness feedback—essentially sorting the wheat from the chaff. Subsequently, the model is fine-tuned using these high-quality, self-generated samples. This cycle is repeated several times, thus iteratively enhancing the model’s accuracy and performance on tasks by self-generating and self-validating the training data.
Notably, the experiments revealed diminishing returns beyond a certain number of ReST iterations, suggesting potential overfitting issues. Ablation studies further assessed the impact of dataset size, the number of model-generated solutions, and the number of iterations on the effectiveness of ReST.
The models fine-tuned using ReST showed enhanced performance on related but distinct benchmarks like GSM8K, Hungarian HS finals, and Big-Bench Hard tasks, without any noticeable degradation in broader capabilities. This finding underscores the method’s versatility and generalizability.
The following figure from the paper shows Pass@K results for PaLM-2-L pretrained model as well as model fine-tuned with $ReST_{EM}$. For a fixed number of samples $K$, fine-tuning with $ReST_{EM}$ substantially improves Pass@K performance. They set temperature to 1.0 and use nucleus sampling with $p = 0.95$.

While ReST offers significant advantages in performance, it necessitates a moderate-sized training set of problems or prompts and access to a manually-designed or learned reward function. It’s highly data-efficient but requires careful application to prevent overfitting.
This research opens new avenues for self-improvement in language models, suggesting the need for automating manual parts of the pipeline and exploring algorithmic improvements to further enhance performance. With $ReST_{EM}$ showing promising results, especially in larger models, one can anticipate further exploration in applying self-training techniques to various other domains beyond math and coding tasks. The significant improvement over fine-tuning on human data implies that future models can be made more efficient, less reliant on extensive datasets, and potentially achieve better performance.

Diffusion Model Alignment Using Direct Preference Optimization

This paper by Wallace et al. from Salesforce AI and Stanford University proposes a novel method for aligning diffusion models to human preferences.
The paper introduces Diffusion-DPO, a method adapted from DPO, for aligning text-to-image diffusion models with human preferences. This approach is a significant shift from typical language model training, emphasizing direct optimization on human comparison data.
Unlike typical methods that fine-tune pre-trained models using curated images and captions, Diffusion-DPO directly optimizes a policy that best satisfies human preferences under a classification objective. It re-formulates DPO to account for a diffusion model notion of likelihood using the evidence lower bound, deriving a differentiable objective.
The authors utilized the Pick-a-Pic dataset, comprising 851K crowdsourced pairwise preferences, to fine-tune the base model of the Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. The fine-tuned model showed significant improvements over both the base SDXL-1.0 and its larger variant in terms of visual appeal and prompt alignment, as evaluated by human preferences.
The paper also explores a variant of the method that uses AI feedback, showing comparable performance to training on human preferences. This opens up possibilities for scaling diffusion model alignment methods.
The figure below from paper illustrates: (Top) DPO-SDXL significantly outperforms SDXL in human evaluation. (L) PartiPrompts and (R) HPSv2 benchmark results across three evaluation questions, majority vote of 5 labelers. (Bottom) Qualitative comparisons between SDXL and DPO-SDXL. DPOSDXL demonstrates superior prompt following and realism. DPO-SDXL outputs are better aligned with human aesthetic preferences, favoring high contrast, vivid colors, fine detail, and focused composition. They also capture fine-grained textual details more faithfully.

Experiments demonstrate the effectiveness of Diffusion-DPO in various scenarios, including image-to-image editing and learning from AI feedback. The method significantly outperforms existing models in human evaluations for general preference, visual appeal, and prompt alignment.
The paper’s findings indicate that Diffusion-DPO can effectively increase measured human appeal across an open vocabulary with stable training, without increased inference time, and improves generic text-image alignment.
The authors note ethical considerations and risks associated with text-to-image generation, emphasizing the importance of diverse and representative sets of labelers and the potential biases inherent in the pre-trained models and labeling process.
In summary, the paper presents a groundbreaking approach to align diffusion models with human preferences, demonstrating notable improvements in visual appeal and prompt alignment. It highlights the potential of direct preference optimization in the realm of text-to-image diffusion models and opens avenues for further research and application in this field.

Human-Centered Loss Functions (HALOs)

This report by Ethayarajh et al. from Stanford University presents a novel approach to aligning large language models (LLMs) with human feedback, building upon Kahneman & Tversky’s prospect theory. The proposed Kahneman-Tversky Optimization (KTO) loss function diverges from existing methods by not requiring paired preference data, relying instead on the knowledge of whether an output is desirable or undesirable for a given input. This makes KTO significantly easier to deploy in real-world scenarios where such data is more abundant.
The report identifies that existing methods for aligning LLMs with human feedback can be seen as human-centered loss functions, which implicitly model some of the distortions in human perception as suggested by prospect theory. By adopting this perspective, the authors derive a HALO that maximizes the utility of LLM generations directly, rather than relying on maximizing the log-likelihood of preferences, as current methods do.
The KTO-aligned models were found to match or exceed the performance of direct preference optimization methods across scales from 1B to 30B. One of the key advantages of KTO is its feasibility in real-world applications, as it requires less specific types of data compared to other methods.
To validate the effectiveness of KTO and understand how alignment scales across model sizes, the authors introduced Archangel, a suite comprising 56 models. These models, ranging from 1B to 30B, were aligned using various methods, including KTO, on human-feedback datasets such as Anthropic HH, Stanford Human Preferences, and OpenAssistant.
The following report from the paper illustrates the fact that LLM alignment involves supervised finetuning followed by optimizing a human-centered loss (HALO). However, the paired preferences that existing approaches need are hard-to-get. Kahneman-Tversky Optimization (KTO) uses a far more abundant kind of data, making it much easier to use in the real world.

The report’s experimental findings reveal surprising insights into the scaling and effectiveness of different alignment methods. It was observed that supervised finetuning (SFT) contributes significantly to the performance gains at every scale under 30B. The benefits of combining SFT with alignment methods become apparent at model sizes of around 7B and above. Interestingly, KTO alone was found to be significantly better than DPO (Direct Preference Optimization) alone at scales of 13B and 30B.
The practical implications of KTO are notable, especially in contexts where abundant data on customer interactions and outcomes is available, but counterfactual data is scarce. This aspect underscores KTO’s potential for broader application in real-world settings compared to preference-based methods like DPO.
Future work suggested by the authors includes exploring a human value function specifically for language, examining differences in model behavior at different scales, and investigating the potential of synthetic data in model alignment with KTO. The report highlights the importance of understanding how human-centered loss functions can influence the alignment of LLMs with human preferences and perceptions.
Code

Nash Learning from Human Feedback

This paper by Munos et al. from Google DeepMind introduces an alternative approach to the conventional RLHF for aligning large language models (LLMs) with human preferences. This new approach, termed Nash Learning from Human Feedback (NLHF), focuses on learning a preference model from pairwise human feedback and pursuing a policy that generates responses preferred over any competing policy, thus achieving a Nash equilibrium for this preference model.
The NLHF approach aims to encompass a broader spectrum of human preferences, maintain policy independence, and better align with the diversity of human preferences. This method marks a significant shift from the traditional RLHF framework, which is more limited in capturing the richness and diversity of human preferences.
Key contributions of this work include the introduction and definition of a regularized variant of the preference model, the establishment of the existence and uniqueness of the corresponding Nash equilibrium, and the introduction of novel algorithms such as Nash-MD and Nash-EMA. Nash-MD, founded on mirror descent principles, converges to the Nash equilibrium without requiring the storage of past policies, making it particularly suitable for LLMs. Nash-EMA, inspired by fictitious play, uses an exponential moving average of past policy parameters. The paper also introduces policy-gradient algorithms Nash-MD-PG and Nash-EMA-PG for deep learning architectures. Extensive numerical experiments conducted on a text summarization task using the TL;DR dataset validate the effectiveness of the NLHF approach.
The regularized preference model in NLHF uses KL-regularization to quantify the divergence between the policy under consideration and a reference policy. This regularization is particularly crucial in situations where the preference model is more accurately estimated following a given policy or where it is essential to remain close to a known safe policy.
In terms of implementation, the paper explores gradient-based algorithms for deep learning architectures, focusing on computing the Nash equilibrium of a preference model. This exploration emphasizes the applicability of these algorithms in the context of LLMs.

Group Preference Optimization: Few-shot Alignment of Large Language Models

This paper by Zhao et al. from UCLA proposes Group Preference Optimization (GPO), a novel framework for aligning large language models (LLMs) with the opinions and preferences of desired interest group(s) in a few-shot manner. The method aims to address the challenge of steering LLMs to align with various groups’ preferences, which often requires substantial group-specific data and computational resources. The key idea in GPO is to view the alignment of an LLM policy as a few-shot adaptation problem within the embedded space of an LLM.
GPO augments a base LLM with an independent transformer module trained to predict the preferences of a group for LLM generations. This module is parameterized via an independent transformer and is trained via meta-learning on several groups, allowing for few-shot adaptation to new groups during testing. The authors employ an in-context autoregressive transformer, offering efficient adaptation with limited group-specific data. Put simply, the preference module in GPO is trained to explicitly perform in-context supervised learning to predict preferences (targets) given joint embeddings (inputs) of prompts and corresponding LLM responses. These embeddings allow efficient processing of in-context examples, with each example being a potentially long sequence of prompt and generated response. The module facilitates rapid adaptation to new, unseen groups with minimal examples via in-context learning.

GPO is designed to perform group alignment by learning a few-shot preference model that augments the base LLM. Once learned, the preference module can be used to update the LLM via any standard preference optimization or reweighting algorithm (e.g., PPO, DPO, Best-of-N). Specifically, GPO is parameterized via a transformer and trained to perform in-context learning on the training preference datasets. Given a training group $g \in G_{\text {train }}$, they randomly split its preference dataset $\mathcal{D}_g$ into a set of $m$ context points and $n-m$ target points, where $$n=\left

\mathcal{D}_g\right

$is the size of the preference dataset for group$g$. Thereafter, GPO is trained to predict the target preferences$y_{m+1: n}^g$given the context points$\left(x_{1: m}^g, y_{1: m}^g\right)$and target inputs$x_{m+1: n}^g$$. Mathematically, this objective can be expressed as:

\[L(\theta)=\mathbb{E}_{g, m}\left[\log p_\theta\left(y_{m+1: n}^g \mid x_{1: n}^g, y_{1: m}^g\right)\right]\]

where the training group $g \sim G_{\text {train }}$ and context size $m$ are sampled uniformly. $\theta$ represents the parameters of the GPO preference model.

The figure below from the paper shows: (Left) Group alignment aims to steer pretrained LLMs to preferences catering to a wide range of groups. For each group $g$, they represent its preference dataset as $\mathcal{D}_g=$ $\left\{\left(x_1^g, y_1^g\right), \ldots,\left(x_n^g, y_n^g\right)\right\}$. Here, $y_i^g$ signifies the preference of group $g$ for a pair of given prompt $q_i^g$ and response $r_i^g$, while $x_i^g$ is its LLM representation obtained with $\pi_{\mathrm{emb}}\left(q_i^g, r_i^g\right)$. (Right) Once trained, GPO provides a few-shot framework for aligning any base LLM to a test group given a small amount of in-context preference data.

GPO’s architecture is designed for permutation-specific inductive biases, discarding positional encodings found in standard transformers. However, this loses the pairwise relations between the inputs and outputs. To solve this, GPO concatenates each pair of inputs and outputs into a single token, informing the transformer of their pairwise relation. The target inputs are padded with a dummy token (e.g., 0), and a masking strategy is employed where context pairs can self-attend, but padded targets can only attend to context points.
Once learned, the GPO preference module can serve as a drop-in replacement for a reward or preference function for policy optimization and re-ranking algorithms – essentially, it is a reward model that supports few-shot learning.
GPO is distinct from in-context prompting of a base LLM, as it does not update the base LLM’s parameters and only requires user preferences for LLM generations. The few-shot model learned by GPO augments the base LLM, offering more flexibility than traditional prompting methods.
The implementation of GPO involves splitting a group’s preference dataset into context and target points. The model is trained to predict target preferences given the context points and target inputs. The figure below from the paper illustrates the GPO architecture for a sequence of $n$ points, with $m$ context points and $n-m$ target points. The context $\left(x_{1: m}, y_{1: m}\right)$ serves as few-shot conditioning for GPO. GPO processes the full sequence using a transformer and predicts the preference scores $\hat{y}_{m+1: n}$.

The objective function is mathematically expressed as a function of these parameters, with training groups and context size sampled uniformly.
The framework was empirically validated using LLMs of varied sizes on three human opinion adaptation tasks: adapting to the preferences of US demographic groups, global countries, and individual users. Results showed that GPO not only aligns models more accurately to these preferences but also requires fewer group-specific preferences and less computational resources, outperforming existing strategies like in-context steering and fine-tuning methods.
Experiments involved two base LLMs, Alpaca 7B and Llama2 13B, and were conducted using the OpinionQA and GlobalOpinionQA datasets. GPO demonstrated significant improvements over various baselines, achieving a 7.1% increase in alignment score over the In-context Finetune method for the OpinionQA dataset and an 8.4% improvement for the GlobalOpinionQA dataset.
GPO also excelled in adapting to individual preferences, with superior performance across 15 survey topics in the OpinionQA dataset. This ability is particularly noteworthy given the diverse and often contrasting opinions within individual and demographic groups.
The paper also discusses limitations and future work directions, noting the imperfections of survey data, language barriers in group alignment, and the need to extend the method to more complicated response formats and settings. Additionally, the authors highlight potential ethical concerns, such as misuse of aligned models and amplification of biased or harmful outputs, suggesting future research should address these issues.
Code

ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

This paper by Song et al. from Peking University and Microsoft Research Asia introduces In-Context Direct Preference Optimization (ICDPO), a novel approach for enhancing Large Language Models (LLMs) by borrowing Human Preference Alignment (HPA) capabilities without the need for fine-tuning. ICDPO utilizes the states of an LLM before and after In-context Learning (ICL) to build an instant scorer, facilitating the generation of well-aligned responses.
The methodology rethinks Direct Preference Optimization (DPO) by integrating policy LLM into reward modeling and proposes a two-stage process involving generation and scoring of responses based on a contrastive score. This score is derived from the difference in log probabilities between the optimized policy ($\pi_{*}$) and a reference model ($\pi_0$), enhancing LLM’s performance in HPA.
The following figure from the paper illustrates an overview of ICDPO. (a) The difference in teacher data utilization between normal fine-tuning and ICL without fine-tuning. (b) The core of ICDPO is that expert-amateur coordination maximizes $S$ which represents the disparity between the expert and the amateur. It brings more accurate estimation than using only the expert LLM.

Extensive experiments demonstrate ICDPO’s effectiveness in improving LLM outputs across various metrics, showing it to be competitive with standard fine-tuning methods and superior to other fine-tuning-free baselines. Notably, it leverages a two-stage retriever for selecting contextual demonstrations and an upgraded scorer to further amplify its benefits.
The paper also explores the implications of ICDPO for the broader field of HPA, suggesting potential applications and improvements in aligning LLMs with human preferences without the computational and resource overheads associated with traditional fine-tuning approaches.

ORPO: Monolithic Preference Optimization without Reference Model

This paper by Hong et al. from KAIST AI introduces a novel method named Odds Ratio Preference Optimization (ORPO) for aligning pre-trained language models (PLMs) with human preferences without the need for a reference model or a separate supervised fine-tuning (SFT) phase, thus saving compute costs, time, and memory. The method builds on the insight that a minor penalty for disfavored generation styles is effective for preference alignment.
Odds Ratio Preference Optimization (ORPO) proposes a new method to train LLMs by combining SFT and Alignment into a new objective (loss function), achieving state of the art results. ORPO operates by incorporating a simple odds ratio-based penalty alongside the conventional negative log-likelihood loss. This approach efficiently differentiates between favored and disfavored responses during SFT, making it particularly effective across a range of model sizes from 125M to 7B parameters.
SFT plays a significant role in tailoring the pre-trained language models to the desired domain by increasing the log probabilities of pertinent tokens. Nevertheless, this inadvertently increases the likelihood of generating tokens in undesirable styles, as illustrated in Figure 3. Therefore, it is necessary to develop methods capable of preserving the domain adaptation role of SFT while concurrently discerning and mitigating unwanted generation styles.
The goal of cross-entropy loss model fine-tuning is to penalize the model if the predicted logits for the reference answers are low. Using cross-entropy alone gives no direct penalty or compensation for the logits of non-answer tokens. While cross-entropy is generally effective for domain adaptation, there are no mechanisms to penalize rejected responses when compensating for the chosen responses. Therefore, the log probabilities of the tokens in the rejected responses increase along with the chosen responses, which is not desired from the viewpoint of preference alignment. fine-tune
The authors experimented with finetuning OPT-350M on the chosen responses only from the HH-RLHF dataset. Throughout the training, they monitor the log probability of rejected responses for each batch and report this in Figure 3. Both the log probability of chosen and rejected responses exhibited a simultaneous increase. This can be interpreted from two different perspectives. First, the cross-entropy loss effectively guides the model toward the intended domain (e.g., dialogue). However, the absence of a penalty for unwanted generations results in rejected responses sometimes having even higher log probabilities than the chosen ones.
Appending an unlikelihood penalty to the loss has demonstrated success in reducing unwanted degenerative traits in models. For example, to prevent repetitions, an unwanted token set of previous contexts, $k \in \mathcal{C}_{\text {recent }}$, is disfavored by adding the following term to $(1-p_i^{(k)})$ to the loss which penalizes the model for assigning high probabilities to recent tokens. Motivated by SFT ascribing high probabilities to rejected tokens and the effectiveness of appending penalizing unwanted traits, they design a monolithic preference alignment method that dynamically penalizes the disfavored response for each query without the need for crafting sets of rejected tokens.
Given an input sequence $x$, the average loglikelihood of generating the output sequence $y$, of length $m$ tokens, is computed as the below equation.

\[\log P_\theta(y \mid x)=\frac{1}{m} \sum_{t=1}^m \log P_\theta\left(y_t \mid x, y_{<t}\right)\]

The odds of generating the output sequence $y$ given an input sequence $x$ is defined in the below equation:

\[\operatorname{odds}_\theta(y \mid x)=\frac{P_\theta(y \mid x)}{1-P_\theta(y \mid x)}\]

Intuitively, $\boldsymbol{o d d s}_\theta(y \mid x)=k$ implies that it is $k$ times more likely for the model $\theta$ to generate the output sequence $y$ than not generating it. Thus, the odds ratio of the chosen response $y_w$ over the rejected response $y_l, \mathbf{O R}_\theta\left(y_w, y_l\right)$, indicates how much more likely it is for the model $\theta$ to generate $y_w$ than $y_l$ given input $x$, defined in the below equation.

\[\mathbf{O R}_\theta\left(y_w, y_l\right)=\frac{\operatorname{odds}_\theta\left(y_w \mid x\right)}{\operatorname {odds}_\theta\left(y_l \mid x\right)}\]

The objective function of ORPO in the below equation consists of two components: (i) supervised fine-tuning (SFT) loss $\left(\mathcal{L}_{S F T}\right))$; (ii) relative ratio loss $\left(\mathcal{L}_{O R}\right)$.

\[\mathcal{L}_{O R P O}=\mathbb{E}_{\left(x, y_w, y_l\right)}\left[\mathcal{L}_{S F T}+\lambda \cdot \mathcal{L}_{O R}\right]\]

$\mathcal{L}_{S F T}$ follows the conventional causal language modeling negative log-likelihood (NLL) loss function to maximize the likelihood of generating the reference tokens. $\mathcal{L}_{O R}$ in the below equation maximizes the odds ratio between the likelihood of generating the favored/chosen response $y_w$ and the disfavored/rejected response $y_l$. ORPO wrap the log odds ratio with the log sigmoid function so that $\mathcal{L}_{O R}$ could be minimized by increasing the log odds ratio between $y_w$ and $y_l$.

\[\mathcal{L}_{O R}=-\log \sigma\left(\log \frac{\operatorname{odds}_\theta\left(y_w \mid x\right)}{\operatorname{odds}_\theta\left(y_l \mid x\right)}\right)\]

Together, $\mathcal{L}_{S F T}$ and $\mathcal{L}_{O R}$ weighted with $\lambda$ tailor the pre-trained language model to adapt to the specific subset of the desired domain and disfavor generations in the rejected response sets.
Training process:
1. Create a pairwise preference dataset (chosen/rejected), e.g., Argilla UltraFeedback
2. Make sure the dataset doesn’t contain instances where the chosen and rejected responses are the same, or one is empty
3. Select a pre-trained LLM (e.g., Llama-2, Mistral)
4. Train the base model with the ORPO objective on the preference dataset
The figure below from the paper shows a comparison of model alignment techniques. ORPO aligns the language model without a reference model in a single-step manner by assigning a weak penalty to the rejected responses and a strong adaptation signal to the chosen responses with a simple log odds ratio term appended to the negative log-likelihood loss.

Empirical evaluations show that fine-tuning models such as Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) using ORPO significantly surpasses the performance of state-of-the-art models on benchmarks such as AlpacaEval 2.0, IFEval, and MT-Bench. For instance, Mistral-ORPO-α and Mistral-ORPO-β achieve up to 12.20% on AlpacaEval 2.0, 66.19% on IFEval, and 7.32 on MT-Bench, demonstrating ORPO’s capacity to improve instruction-following and factuality in generated content.
Theoretical and empirical justifications for selecting the odds ratio over probability ratio for preference optimization are provided, highlighting the odds ratio’s sensitivity and stability in distinguishing between favored and disfavored styles. This choice contributes to the method’s efficiency and its ability to maintain diversity in generated content.
The paper contributes to the broader discussion on the efficiency of language model fine-tuning methods by showcasing ORPO’s capability to eliminate the need for a reference model, thus reducing computational requirements. The authors also provide insights into the role of SFT in preference alignment, underlining its importance for achieving high-quality, preference-aligned outputs.
Code and model checkpoints for Mistral-ORPO-$\alpha$ (7B) and Mistral-ORPO-$\beta$ (7B) have been released to facilitate further research and application of ORPO in various NLP tasks. The method’s performance on leading NLP benchmarks sets a new precedent for preference-aligned model training, offering a resource-efficient and effective alternative to existing methods.
Code

Human Alignment of Large Language Models through Online Preference Optimisation

This paper by Calandriello et al. from Google DeepMind addresses the critical issue of aligning large language models (LLMs) with human preferences, a field that has seen extensive research and the development of various methods including Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO), and Sequence Likelihood Calibration (SLiC).
The paper’s main contributions are twofold: firstly, it demonstrates the equivalence of two recent alignment methods, Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD), under certain conditions. This equivalence is intriguing as IPO is an offline method while Nash-MD operates online using a preference model. Secondly, it introduces IPO-MD, a generalisation of IPO that incorporates regularised sampling akin to Nash-MD, and compares it against online variants of existing methods on a summarisation task.
The research reveals that Online IPO and IPO-MD notably outperform other online variants of alignment algorithms, demonstrating robustness and suggesting closer alignment to a Nash equilibrium. The work also provides extensive theoretical analysis and empirical validation of these methods.
Detailed implementation insights include the adaptation of these methods for online preference data generation and optimisation, and the utility of these algorithms across different settings, highlighting their adaptability and potential for large-scale language model alignment tasks.
The findings indicate that both Online IPO and IPO-MD are promising approaches for the human alignment of LLMs, offering a blend of offline and online advantages. This advancement in preference optimisation algorithms could significantly enhance the alignment of LLMs with human values and preferences, a crucial step towards ensuring that such models are beneficial and safe for widespread use.

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

This paper by Haoran Xu et al. introduces Contrastive Preference Optimization (CPO), a novel approach for fine-tuning moderate-sized Large Language Models (LLMs) for Machine Translation (MT), yielding substantial improvements over existing methods.
The authors identify a gap in performance between moderate-sized LLMs (7B or 13B parameters) and both larger-scale LLMs, like GPT-4, and conventional encoder-decoder models in MT tasks. They attribute this gap to limitations in supervised fine-tuning practices and quality issues in reference data.
CPO aims to mitigate two fundamental shortcomings of SFT. First, SFT’s methodology of minimizing the discrepancy between predicted outputs and gold-standard references inherently caps model performance at the quality level of the training data. This limitation is significant, as even human-written data, traditionally considered high-quality, is not immune to quality issues. For instance, one may notice that some strong translation models are capable of producing translations superior to the gold reference. Secondly, SFT lacks a mechanism to prevent the model from rejecting mistakes in translations. While strong translation models can produce high-quality translations, they occasionally exhibit minor errors, such as omitting parts of the translation. Preventing the production of these near-perfect but ultimately flawed translation is essential. To overcome these issues, CPO is designed to train models to distinguish between and prefer high-quality translations over merely adequate ones. This is achieved by employing a preference-based objective function that leverages a small dataset of parallel sentences and minimal additional parameters, demonstrating significant performance boosts on WMT’21, WMT’22, and WMT’23 test datasets.
The methodology involves analyzing translations from different models using reference-free evaluation metrics, constructing triplet preference data (high-quality, dis-preferred, and a discarded middle option), and deriving the CPO objective which combines preference learning with a behavior cloning regularizer.
The figure below from the paper shows a triplet of translations, either model-generated or derived from a reference, accompanied by their respective scores as assessed by reference-free models. For a given source sentence, the translation with the highest score is designated as the preferred translation, while the one with the lowest score is considered dispreferred, and the translation with a middle score is disregarded.

Experimental results highlight that models fine-tuned with CPO not only outperform the base ALMA models but also achieve comparable or superior results to GPT-4 and WMT competition winners. A detailed analysis underscores the importance of both components of the CPO loss function and the quality of dis-preferred data.
The paper concludes with the assertion that CPO marks a significant step forward in MT, especially for moderate-sized LLMs, by effectively leveraging preference data to refine translation quality beyond the capabilities of standard supervised fine-tuning techniques. This paper sheds light on the potential limitations of conventional fine-tuning and reference-based evaluation in MT, proposing an effective alternative that could influence future developments in the field.

sDPO: Don’t Use Your Data All at Once

This paper from Kim et al. from Upstage AI introduces “stepwise DPO” (sDPO), an advancement of direct preference optimization (DPO) to better align large language models (LLMs) with human preferences. Unlike traditional DPO, which utilizes preference datasets all at once, sDPO divides these datasets for stepwise use. This method enables more aligned reference models within the DPO framework, resulting in a final model that not only performs better but also outpaces more extensive LLMs.
Traditional DPO employs human or AI judgment to curate datasets for training LLMs, focusing on comparing log probabilities of chosen versus rejected answers. However, sDPO’s novel approach uses these datasets in a phased manner. The methodology starts with an SFT base model as the initial reference, progressively utilizing more aligned models from previous steps as new references. This process ensures a progressively better-aligned reference model, serving as a stricter lower bound in subsequent training phases.
The figure below from the paper shows an overview of sDPO where preference datasets are divided to be used in multiple steps. The aligned model from the previous step is used as the reference and target models for the current step. The reference model is used to calculate the log probabilities and the target model is trained using the preference loss of DPO at each step.

The sDPO methodology involved training the SOLAR 10B SFT model as the base. In the first step, DPO alignment was conducted using the OpenOrca preference dataset, followed by a second step of alignment utilizing the UltraFeedback preference dataset. The model’s performance was evaluated on the H4 benchmark, which is the average of scores from ARC, HellaSwag, MMLU, and TruthfulQA tests. This innovative approach resulted in a 1.6% improvement of the SOLAR 10B model over traditional DPO on the H4 benchmark, showcasing that sDPO combined with SOLAR 10B even surpasses models like Mixtral, which have significantly more parameters.
Experimental validation reveals sDPO’s efficacy. The research team employed models like SOLAR 10.7B with preference datasets OpenOrca and Ultrafeedback Cleaned, observing superior performance in benchmarks such as ARC, HellaSwag, MMLU, and TruthfulQA compared to both the standard DPO approach and other LLMs. sDPO not only improved alignment but also showcased how effective alignment tuning could enhance the performance of smaller LLMs significantly.
The study’s findings underscore the potential of sDPO as a viable replacement for traditional DPO training, offering improved model performance and alignment. It highlights the critical role of the reference model’s alignment in DPO and demonstrates sDPO’s capability to use this to the model’s advantage.
Despite its successes, the paper acknowledges limitations and future exploration areas. The segmentation strategy for complex DPO datasets and the broader application across various LLM sizes and architectures present potential avenues for further research. Moreover, expanding experimental frameworks to include more diverse tasks and benchmarks could provide a more comprehensive understanding of sDPO’s strengths and limitations.
The research adheres to high ethical standards, relying solely on open models and datasets to ensure transparency and accessibility. Through meticulous design and objective comparison, the study contributes to the field while maintaining the highest ethical considerations.

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

This paper by Khaki et al. from Amazon, introduces RS-DPO, a method combining rejection sampling (RS) and direct preference optimization (DPO) to address the alignment of large language models (LLMs) with user intent. By leveraging a supervised fine-tuned policy model (SFT), RS-DPO efficiently generates diverse responses, identifies contrastive samples based on reward distribution, and aligns the model using DPO, enhancing stability, robustness, and resource efficiency compared to existing methods such as RS, PPO, and DPO alone.
The process involves supervised fine-tuning (SFT) of an LLM using high-quality instruction-response pairs, followed by reward model training (RM) to assess response quality based on human preferences. Preference data generation via rejection sampling (PDGRS) creates a synthetic preference pair dataset for alignment tasks, using the trained SFT and RM to sample and evaluate $k$ distinct responses for each prompt. The direct preference optimization (DPO) step then fine-tunes the SFT model by optimizing the policy model on the generated preference data, thus aligning the LLM with human preferences without needing an explicit reward model.
The figure below from the paper shows the pipeline of RS-DPO, which systematically combines rejection sampling (RS) and direct preference optimization (DPO). They start by creating a SFT model and use it to generate a diverse set of $k$ distinct responses for each prompt. Then, it selects a pair of contrastive samples based on their reward distribution. Subsequently, the method employs DPO to enhance the performance of the language model (LLM), thereby achieving improved alignment.

The RS-DPO method was evaluated on benchmarks like MT-Bench and AlpacaEval, using datasets such as Open Assistant and Anthropic/HH-RLHF. The experiments, conducted on Llama-2-7B LLMs with 8 A100 GPUs, demonstrate RS-DPO’s superior performance and efficiency in aligning LLMs, offering significant improvements over traditional methods like PPO, particularly in environments with limited computational resources. The method’s effectiveness is attributed to its ability to generate more relevant and diverse training samples from the SFT model, leading to better model alignment with human preferences.
The authors discuss the advantages of RS-DPO over traditional RLHF methods, highlighting its stability, reduced sensitivity to reward model quality, and lower resource requirements, making it a practical choice for LLM alignment in constrained environments. Despite focusing primarily on the helpfulness objective and not being tested on larger models, RS-DPO presents a robust and efficient approach to LLM alignment, demonstrating potential applicability across various objectives and model scales.

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

This paper by Lin et al. from the Allen Institute for Artificial Intelligence and UW explores the superficial nature of alignment tuning in large language models (LLMs) and proposes a tuning-free alignment method using in-context learning (ICL). The study critically examines how alignment tuning through supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) alters the behavior of base LLMs. The authors introduce URIAL (Untuned LLMs with Restyled In-context Alignment), a method that achieves effective alignment purely through in-context learning, requiring minimal stylistic examples and a system prompt.
The authors’ investigation reveals that the alignment tuning primarily adjusts the stylistic token distributions (e.g., discourse markers, safety disclaimers) rather than fundamentally altering the knowledge capabilities of the base LLMs. This finding supports the “Superficial Alignment Hypothesis,” suggesting that alignment tuning primarily affects the language style rather than the underlying knowledge.
Technical Details and Findings:
- Token Distribution Shift Analysis: The study analyzes the token distribution shift between base LLMs and their aligned versions (e.g., Llama-2 and Llama-2-chat). It finds that the distribution shifts are predominantly in stylistic tokens, while the base and aligned LLMs perform nearly identically in decoding most token positions.
- Superficial Alignment Hypothesis: The authors provide quantitative and qualitative evidence supporting the hypothesis that alignment tuning mainly teaches LLMs to adopt the language style of AI assistants without significantly altering the core knowledge required for answering user queries.
Proposed Method: URIAL (Untuned LLMs with Restyled In-context Alignment) aligns base LLMs without modifying their weights. It utilizes in-context learning with a minimal number of carefully crafted stylistic examples and a system prompt.
Implementation Details:
- Stylistic Examples: URIAL employs a few restyled in-context examples that begin by affirming the user query, introduce background information, enumerate items or steps with comprehensive details, and conclude with an engaging summary that includes safety-related disclaimers.
- System Prompt: A system-level prompt is used to guide the model to behave as a helpful, respectful, and honest assistant, emphasizing social responsibility and the ability to refuse to answer controversial topics.
- Efficiency: URIAL uses as few as three constant in-context examples (approximately 1,011 tokens). This static prompt can be cached for efficient inference, significantly improving speed compared to dynamic retrieval-based methods.
The following figure from the paper shows Analyzing alignment with token distribution shift. An aligned LLM (llama-2-chat) receives a query $q$ and outputs a response $o$. To analyze the effect of alignment tuning, we decode the untuned version (llama-2-base) at each position $t$. Next, we categorize all tokens in $o$ into three groups based on $o_t$’s rank in the list of tokens sorted by probability from the base LLM. On average, 77.7% of tokens are also ranked top 1 by the base LLM (unshifted positions), and 92.2% are within the top 3 (+ marginal). Common tokens at shifted positions are displayed at the top-right and are mostly stylistic, constituting discourse markers. In contrast, knowledge-intensive tokens are predominantly found in unshifted positions.

Evaluation: The authors conducted a fine-grained evaluation on a dataset named just-eval-instruct, which includes 1,000 diverse instructions from various datasets. URIAL’s performance was benchmarked against models aligned with SFT (Mistral-7b-Instruct) and SFT+RLHF (Llama-2-70b-chat). Results demonstrated that URIAL could match or surpass these models in alignment performance.
Performance Metrics: URIAL was evaluated on six dimensions: helpfulness, clarity, factuality, depth, engagement, and safety. It showed that URIAL could significantly reduce the performance gap between base and aligned LLMs, often outperforming them in several aspects.
Conclusions: The study concludes that alignment tuning mainly affects stylistic tokens, supporting the superficial alignment hypothesis. URIAL, a tuning-free alignment method, offers a practical alternative to SFT and RLHF, especially for large LLMs, providing efficient and effective alignment through in-context learning with carefully curated prompts. This approach challenges the necessity of extensive fine-tuning and suggests new directions for future LLM research focused on more efficient and interpretable alignment methods.
Code

MDPO: Conditional Preference Optimization for Multimodal Large Language Models

This paper by Wang et al. from USC, UC Davis, and MSR introduces MDPO, a multimodal Direct Preference Optimization (DPO) method designed to enhance the performance of Large Language Models (LLMs) by addressing the unconditional preference problem in multimodal preference optimization.
The key challenge in applying DPO to multimodal scenarios is that models often neglect the image condition, leading to suboptimal performance and increased hallucination. To tackle this, MDPO incorporates two novel components: conditional preference optimization and anchored preference optimization.
Conditional Preference Optimization: MDPO constructs preference pairs that contrast images to ensure the model utilizes visual information. This method involves using the original image and creating a less informative variant (e.g., by cropping) to serve as a hard negative. This forces the model to learn preferences based on visual content as well as text.
Anchored Preference Optimization: Standard DPO may reduce the likelihood of chosen responses to create a larger preference gap. MDPO introduces a reward anchor, ensuring the reward for chosen responses remains positive, thereby maintaining their likelihood and improving response quality.
Implementation Details:
- The model training uses Bunny-v1.0-3B and LLaVA-v1.5-7B multimodal LLMs.
- Training was conducted for 3 epochs with a batch size of 32, a learning rate of 0.00001, and a cosine learning rate scheduler with a 0.1 warmup ratio.
- The preference optimization parameter β was set to 0.1.
- LoRA (Low-Rank Adaptation) was utilized, with α set to 128 and rank to 64.
- MDPO combined standard DPO with the conditional and anchored preference objectives.
The figure below from the paper illustrates an overview of MDPO. Top Left: Standard DPO expects the multimodal LLM to learn response preferences conditioned on both the image and the question. Top Right: However, in practice, the learning process often disregards the image condition. Bottom: To address this issue, MDPO introduces an additional image preference learning objective to emphasize the relationship between the image and the response. Furthermore, MDPO incorporates a reward anchor to ensure that the probability of the chosen response does not decrease.

Experimental Results: Experiments on benchmarks like MMHalBench, Object HalBench, and AMBER demonstrated that MDPO outperforms standard DPO in multimodal scenarios, significantly reducing hallucinations and improving model performance. Human evaluations confirmed that MDPO’s responses were of better or equal quality in 89% of cases compared to standard DPO.
Ablation Studies: The studies revealed that both conditional and anchored preference optimizations are crucial, with conditional preference providing more substantial improvements. Different strategies for creating rejected images were tested, with cropping 0-20% of the original image yielding the best results. Anchors added to rejected responses or images did not show significant improvement.
Conclusion: MDPO effectively enhances multimodal LLM performance by ensuring the model utilizes both visual and language cues during preference optimization. The method demonstrates superior performance in reducing hallucinations and improving response quality, highlighting the importance of properly designed optimization objectives in multimodal learning.

Aligning Large Multimodal Models with Factually Augmented RLHF

This paper by Sun et al. from UC Berkeley, CMU, UIUC, UW–Madison, UMass Amherst, MSR, MIT-IBM Watson AI Lab addresses the issue of multimodal misalignment in large multimodal models (LMMs), which can lead to hallucinations—generating textual outputs not grounded in multimodal context. To mitigate this, the authors propose adapting Reinforcement Learning from Human Feedback (RLHF) to vision-language alignment and introducing Factually Augmented RLHF (Fact-RLHF).
The proposed method involves several key steps:
1. Multimodal Supervised Fine-Tuning (SFT): The initial stage involves fine-tuning a vision encoder and a pre-trained large language model (LLM) on an instruction-following demonstration dataset to create a supervised fine-tuned model (πSFT).
2. Multimodal Preference Modeling: This stage trains a reward model to score responses based on human annotations. The reward model uses pairwise comparison data to learn to prefer less hallucinated responses. The training employs a cross-entropy loss function to adjust the model’s preferences.
3. Reinforcement Learning: The policy model is fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward signal from the preference model. A KL penalty is applied to prevent over-optimization and reward hacking.
4. Factually Augmented RLHF (Fact-RLHF): To enhance the reward model, it is augmented with factual information such as image captions and ground-truth multi-choice options. This addition helps the reward model avoid being misled by hallucinations that are not grounded in the actual image content.
5. Enhancing Training Data: The authors improve the training data by augmenting GPT-4-generated vision instruction data with existing high-quality human-annotated image-text pairs. This includes data from VQA-v2, A-OKVQA, and Flickr30k, converted into suitable formats for vision-language tasks.
6. MMHAL-BENCH: To evaluate the proposed approach, the authors develop a new benchmark, MMHAL-BENCH, focusing on penalizing hallucinations. This benchmark covers various types of questions that often lead to hallucinations in LMMs, such as object attributes, adversarial objects, comparisons, counting, spatial relations, and environment descriptions.
The figure below from the paper illustrates that hallucination may occur during the Supervised Fine-Tuning (SFT) phase of LMM training and how Factually Augmented RLHF alleviates the issue of limited capacity in the reward model which is initialized from the SFT model.

The implementation of Fact-RLHF shows significant improvements:
- Improved Alignment: LLaVA-RLHF, the model trained with Fact-RLHF, achieves 94% of the performance level of text-only GPT-4 on the LLaVA-Bench dataset, compared to 87% by previous best methods.
- Reduced Hallucinations: On MMHAL-BENCH, LLaVA-RLHF outperforms other baselines by 60%, showing a substantial reduction in hallucinated responses.
- Enhanced Performance: The model also sets new performance benchmarks on MMBench and POPE datasets, demonstrating improved general capabilities and alignment with human preferences.
Overall, the paper highlights the effectiveness of integrating factual augmentation in RLHF to address multimodal misalignment, thereby reducing hallucinations and enhancing the reliability of large multimodal models. The authors have open-sourced their code, model, and data for further research and development in this area.
Code

Statistical Rejection Sampling Improves Preference Optimization

This paper by Liu et al. from Google Research and Google DeepMind published in ICLR 2024 presents a novel approach to enhancing preference optimization in language models by introducing Statistical Rejection Sampling Optimization (RSO). The research addresses limitations in current methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO), which aim to align language models with human preferences without the complexities of Reinforcement Learning from Human Feedback (RLHF).
SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. The absence of a reward model in DPO constrains its ability to sample preference pairs from the optimal policy. Meanwhile, SLiC can only sample preference pairs from the SFT policy.
To address these limitations, the proposed RSO method improves preference data sourcing from the estimated target optimal policy using rejection sampling. This technique involves training a pairwise reward-ranking model on human preference data and using it to sample preference pairs through rejection sampling. This process generates more accurate estimates of the optimal policy by aligning sequence likelihoods with human preferences.
Key implementation details of RSO include:
1. Training a Pairwise Reward-Ranking Model: Starting with a human preference dataset $D_{hf}$ collected from other policies, a pairwise reward-ranking model is trained to approximate human preference probabilities. This model uses a T5-XXL model to process and learn from the preference data.
2. Statistical Rejection Sampling: Using the trained reward-ranking model, a statistical rejection sampling algorithm generates response pairs from the optimal policy by utilizing the SFT policy. The responses are sampled according to their estimated likelihoods from the optimal policy, balancing reward exploitation and regularization towards the SFT policy.
3. Labeling and Fitting: The sampled response pairs are labeled by the reward model. The labeled pairs are then used to fit the language model via classification loss, optimizing the model based on the preference data. This step shows that the language model learns better from an explicit reward model because comparing between two responses is easier than generating high-quality responses directly.
The statistical rejection sampling algorithm, based on Neal’s (2003) statistical field method, addresses issues found in RLHF techniques, which can suffer from reward hacking due to excessive trust in the reward model without regularization. Specifically, RLHF works (Bai et al., 2022; Stiennon et al., 2020; Touvron et al., 2023) carry out rejection sampling using the best-of-N or top-k-over-N algorithm, where they sample a batch of N completions from a language model policy and then evaluate them across a reward model, returning the best one or the top k. This algorithm has the issue of reward hacking because it trusts the reward model too much without any regularization. They show that top-k-over-N is a special case of our statistical rejection sampling and it is critical to balance between the reward exploitation and regularization towards the SFT policy.
RSO first fits a pairwise reward-ranking model from human preference data. This model is later applied to generate preference pairs with candidates sampled from the optimal policy, followed by a preference optimization step to align sequence likelihood towards preferences.

Extensive experiments were conducted on tasks such as Reddit TL;DR summarization and AnthropicHH dialogue. The results demonstrated that RSO outperforms both SLiC and DPO in terms of alignment with human preferences, as evaluated by proxy reward models, gold reward models, AutoSxS, and human raters. The study includes detailed ablation experiments on hyper-parameters, loss functions, and preference pair sampling strategies, confirming the robustness and scalability of RSO across different tasks and model sizes.
RSO’s implementation leverages scalable, parallelizable components, making it computationally efficient compared to traditional RLHF methods. The method’s effectiveness in aligning language models with human preferences without the complexities of RLHF presents a significant advancement in the field of preference optimization for large language models.

Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models

This paper by Denison et al. from Anthropic, Redwood Research, and the University of Oxford, investigates reward tampering in language models within reinforcement learning frameworks. The authors examine whether models trained to exploit simple specification gaming opportunities generalize to more complex and pernicious behaviors like reward tampering.
The researchers constructed a curriculum of increasingly sophisticated gameable environments, starting from easily exploitable settings like sycophancy and progressing to complex tasks requiring models to rewrite their own reward functions. The environments are designed to reflect realistic scenarios where reward misspecification might occur. The curriculum includes:
1. Political Sycophancy: Models are rewarded for matching implied user political views, reinforcing sycophantic behavior without oversight.
2. Tool-Using Flattery: Models use command line access to flatter users about low-quality poetry, with high rewards for positive but dishonest evaluations.
3. Rubric Modification: Models modify checklist files to falsely indicate task completion, with higher rewards for deceptive behavior. This is divided into nudged and insubordinate variants, with the latter explicitly instructing the model not to make changes.
4. Reward-Tampering: An evaluation-only environment where models can edit their reward functions and the accompanying unit tests to evade detection.
The figure below from the paper illustrates real evaluation transcripts from their model. The only modification they have made is to remove words so that the transcripts fit in the figure. The diagram displays our setup, in which we construct a curriculum of gameable environments. They train models up to each stage and then evaluate on the later stages. Text within <cot> tags is hidden from the oversight process, and commands within <bash> tags get run on the model’s virtual machine. They want to highlight that the vast majority of all transcripts from the model are neither this malicious nor this coherent. Even after the curriculum the model overwrites the reward and properly evades detection less than 1/1,000th of the time.

The authors utilized two reinforcement learning algorithms: expert iteration and proximal policy optimization (PPO). In expert iteration, models sample responses to prompts, and the best performing ones are used for further training. PPO involves policy gradient methods to adjust the policy based on reward signals.
Significant implementation details include:
- HHH Expert Iteration: Combines reinforcement of specification gaming with supervision from a preference model that rewards helpful, honest, and harmless (HHH) behavior.
- Exploit-Only Expert Iteration: Focuses solely on successful specification gaming samples without HHH oversight, to highlight the impact of excluding such supervision.
- Hidden Chain of Thought: Augments models with reasoning capabilities that are hidden from the reward process, simulating future model capabilities.
- Preamble Instructions: Models are prefixed with instructions about their training environment, reinforcing their understanding of being language model assistants trained via RL.
Key findings from the experiments include:
- Models trained on early stages of the curriculum generalize to more sophisticated forms of specification gaming in later stages.
- A small but significant proportion of models trained on the full curriculum exhibit zero-shot generalization to reward tampering, including rewriting their reward functions and evading detection.
- Retraining models not to game early environments reduces but does not eliminate reward tampering.
- Adding HHH training does not prevent the generalization of specification gaming to reward tampering.
The study demonstrates that large language models can generalize from simple specification gaming to complex reward tampering, suggesting that such behaviors may be nontrivial to remove and pose potential risks as models become more capable.
Blog; Memo

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

This paper by Xu et al. from Tsinghua University, OpenPsi Inc., and Shanghai Qi Zhi Institute investigates whether Direct Preference Optimization (DPO) is truly superior to Proximal Policy Optimization (PPO) for aligning large language models (LLMs) with human preferences. The study explores the theoretical and empirical properties of both methods and provides comprehensive benchmarks to evaluate their performance.
The research begins by discussing the widespread use of Reinforcement Learning from Human Feedback (RLHF) to align LLMs with human preferences. It highlights that existing RLHF methods can be categorized into reward-based and reward-free approaches. Reward-based methods, like those used in applications such as ChatGPT and Claude, involve learning a reward model and applying actor-critic algorithms such as PPO. Reward-free methods, such as DPO, optimize policies directly based on preference data without an explicit reward model.
The paper delves into the theoretical limitations of DPO, demonstrating that it may find biased solutions that exploit out-of-distribution responses. The authors argue that this can lead to suboptimal performance, particularly in scenarios where there is a distribution shift between model outputs and the preference dataset. Empirical studies support this claim, showing that DPO’s performance degrades significantly under distribution shifts.
Implementation details for PPO are extensively discussed, revealing critical factors for achieving optimal performance in RLHF settings. Key techniques identified include advantage normalization, large batch size, and exponential moving average updates for the reference model. These enhancements are shown to significantly improve PPO’s performance across various tasks, including dialogue generation and code generation.
The study presents a series of experiments benchmarking DPO and PPO across multiple RLHF testbeds, such as the SafeRLHF dataset, HH-RLHF dataset, APPS, and CodeContest datasets. Results indicate that PPO consistently outperforms DPO in all cases, achieving state-of-the-art results in challenging code competition tasks. Specifically, on the CodeContest dataset, a PPO model with 34 billion parameters surpasses the previous state-of-the-art AlphaCode-41B, demonstrating a notable improvement in performance.
Key experimental findings include:
1. Theoretical Analysis: Demonstrates that DPO can produce biased policies due to out-of-distribution exploitation, while PPO’s regularization via KL divergence helps mitigate this issue.
2. Synthetic Scenario Validation: Illustrates DPO’s susceptibility to generating biased distributions favoring unseen responses, while PPO maintains more stable performance.
3. Real Preference Datasets: Shows that DPO’s performance can be improved by addressing distribution shifts through additional supervised fine-tuning (SFT) and iterative training, though PPO still outperforms DPO significantly.
4. Ablation Studies for PPO: Highlights the importance of advantage normalization, large batch sizes, and exponential moving average updates in enhancing PPO’s RLHF performance.
The authors conclude that while DPO offers a simpler training procedure, its performance is hindered by sensitivity to distribution shifts and out-of-distribution data. PPO, with proper tuning and implementation enhancements, demonstrates robust effectiveness and achieves superior results across diverse RLHF tasks.
In summary, the comprehensive analysis and empirical evidence provided in this paper establish PPO as a more reliable and effective method for LLM alignment compared to DPO, particularly in scenarios requiring high-performance and robust alignment with human preferences.

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

This paper by Wu et al. from UC Berkeley proposes a novel reinforcement learning framework, Pairwise Proximal Policy Optimization (P3O), designed to optimize large language models (LLMs) using comparative feedback rather than absolute rewards. Traditional approaches such as Proximal Policy Optimization (PPO) have limitations when dealing with reward functions derived from comparative losses like the Bradley-Terry loss. These limitations include the necessity for reward normalization and token-wise updates, which introduce complexity and potential instability.
The proposed P3O algorithm operates on trajectory-wise policy gradient updates, simplifying the optimization process by directly utilizing comparative rewards. This approach is invariant to equivalent reward functions, addressing the instability issues present in PPO. The paper presents a comprehensive theoretical foundation, establishing that P3O avoids the complications of value function approximation and Generalized Advantage Estimation (GAE), which are essential in PPO.
The implementation of P3O involves the following key steps:
1. Initialization: Policy parameters are initialized.
2. Data Collection: Pairwise trajectories are collected by running the policy on a batch of prompts, generating two responses per prompt.
3. Reward Calculation: Trajectory-wise rewards are computed, incorporating both the preference-based reward and the KL-divergence penalty from the supervised fine-tuning (SFT) model.
4. Gradient Estimation: The policy gradient is estimated using the relative differences in rewards between the paired responses, adjusted by importance sampling to account for the policy change.
5. Policy Update: Gradient updates are applied to the policy parameters, following either separate or joint clipping strategies to maintain stability.
The figure below from the paper illustrates the prevalent method for fine-tuning LMs using RL, which relies on Absolute Feedback. In this paradigm, algorithms like PPO has to learn a $V$ function, which capture not only the valuable relative preference information, but also less part, which is the scale of the reward for a given prompt. Contrastingly, the figure on the right presents paradigm for optimizing reward model trained via comparative loss, e.g., Bradley-Terry Loss (Bradley & Terry, 1952). P3O generates a pair of responses per prompt, leveraging only the Relative Feedback - derived from the difference in reward - for policy gradient updates. This method obviates the need for additional $V$ function approximations and intricate components like GAE.

Empirical evaluations are conducted on summarization and question-answering tasks using datasets like TL;DR and Anthropic’s Helpful and Harmless (HH). The results demonstrate that P3O achieves a superior trade-off between reward and KL-divergence compared to PPO and other baseline methods. Specifically, P3O shows improved alignment with human preferences, as evidenced by higher rewards and better performance in head-to-head comparisons evaluated by GPT-4.
The experiments reveal that P3O not only achieves higher reward scores but also maintains better KL control, making it a robust alternative for fine-tuning LLMs with relative feedback. The study underscores the potential of P3O in simplifying the RL fine-tuning process while enhancing model alignment with human values. Future work aims to explore the impacts of reward over-optimization and extend the policy gradient framework to accommodate multiple ranked responses.

BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

This paper by Xu et al. from UCSB and CMU presents Behavior Preference Optimization (BPO), a novel approach to enhancing online preference learning for large language models (LLMs) by maintaining proximity to the behavior LLM that collects training samples. The key motivation is to address the limitations of traditional Direct Alignment from Preferences (DAP) methods, which do not fully exploit the potential of online training data.
The authors propose a new online DAP algorithm, emphasizing the construction of a trust region around the behavior LLM ($\pi_{\beta}$) rather than a fixed reference model ($\pi_{\ref}$). This approach ensures that the learning LLM ($\pi_{\theta}$) remains aligned with the behavior model, thereby stabilizing the training process and improving performance.
Implementation Details:
1. Algorithm Overview:
  - The BPO algorithm dynamically updates $\pi_{\beta}$ with $\pi_{\theta}$ every K steps, where K is the annotation interval calculated as T/F (total training steps divided by the preference annotation frequency).
  - The training loss $L_{BPO}$ is computed by constraining the KL divergence between $\pi_{\theta}$ and $\pi_{\beta}$, thus constructing a trust region around the behavior LLM.
2. Ensemble of LoRA Weights:
  - To mitigate training instability, the authors optimize an ensemble of Low-Rank Adaptation (LoRA) weights and merge them during inference without additional overhead. This ensemble approach stabilizes the training process.
3. Experimental Setup:
  - The experiments were conducted on three datasets: Reddit TL;DR, Anthropic Helpfulness, and Harmlessness, using a preference simulator for annotation.
  - BPO was integrated with various DAP methods, including DPO, IPO, and SLiC, and compared against their online and offline counterparts.
The figure below from the paper illustrates an overview of the training pipeline of our BPO. Our training loss LBPO is calculated by constraining the KL divergence between $\pi_{\theta}$ and the behavior LLM $\pi_{\beta}$. Every $K$ steps, they update $\pi_{\beta}$ with $\pi_{\theta}$ and use it to collect new samples for annotations.

Experimental Details:
- Preference Annotation Frequency:
  - Different annotation frequencies were tested, demonstrating that even a small increase in frequency (F = 2) significantly improves performance over offline DPO, achieving notable gains in win rates against reference texts.
- Ablation Study:
  - The authors performed an ablation study to verify that the performance improvement stems from the better trust region constructed around $\pi_{\beta}$, not just the higher quality of $\pi_{\beta}$ compared to $\pi_{\ref}$.
- Stabilization Techniques:
  - The use of an ensemble of LoRA weights proved effective in stabilizing training, as single LoRA weight optimization led to rapid deterioration of performance.
Results:
- BPO significantly outperformed both its on-policy and offline DAP counterparts across all tasks, particularly on TL;DR, Helpfulness, and Harmlessness, demonstrating its strong generalizability.
- The dynamic trust region around the behavior LLM ensured better alignment and stability during training, leading to higher win rates and more consistent performance improvements.
The proposed BPO method offers a substantial advancement in online preference learning for LLMs, balancing performance and computational efficiency, and demonstrating remarkable applicability to various DAP methods and annotation frequencies.

SimPO: Simple Preference Optimization with a Reference-Free Reward

This paper by Meng et al. from Danqi Chen’s lab at Princeton proposes SimPO, a novel offline preference optimization algorithm that simplifies and improves upon Direct Preference Optimization (DPO). Unlike DPO, which requires a reference model and can be computationally intensive, SimPO introduces a reference-free reward that aligns more closely with the model generation process.
SimPO uses the average log probability of a sequence as the implicit reward, which better aligns with model generation metrics and removes the need for a reference model. This reward formulation enhances computational efficiency and memory usage. Additionally, SimPO incorporates a target reward margin into the Bradley-Terry objective to create a larger separation between winning and losing responses, further optimizing performance.
The authors conducted extensive evaluations using various state-of-the-art models, including base and instruction-tuned models like Mistral and Llama3. They tested SimPO on benchmarks such as AlpacaEval 2, MT-Bench, and Arena-Hard, demonstrating significant performance improvements over DPO. Specifically, SimPO outperformed DPO by up to 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard, with minimal increase in response length, indicating efficiency in length exploitation.
The figure below from the paper illustrates

Implementation Details:

Reward Formulation:

SimPO calculates the reward as the average log probability of all tokens in a response using the policy model, normalized by the response length. This formulation eliminates the reference model, making SimPO more efficient.

The reward equation is: $$r_{\text{SimPO}}(x, y) = \frac{\beta}{

} \log \pi_{\theta}(y

x) = \frac{\beta}{

} \sum_{i=1}^{

} \log \pi_{\theta}(y_i

x, y_{<i})$, where$\beta$$ controls reward scaling.

Target Reward Margin:

A margin $\gamma$ is introduced to the Bradley-Terry model to ensure a minimum reward difference between winning and losing responses.

The modified objective is: $$L_{\text{SimPO}}(\pi_{\theta}) = -E_{(x, y_w, y_l) \sim D} \left[ \log \sigma \left( \frac{\beta}{

y_w

} \log \pi_{\theta}(y_w

x) - \frac{\beta}{

y_l

} \log \pi_{\theta}(y_l

x) - \gamma \right) \right]$$.

Training Setups:
- Base Setup: Models were trained on the UltraChat-200k dataset to create a supervised fine-tuned (SFT) model, followed by preference optimization using the UltraFeedback dataset.
- Instruct Setup: Off-the-shelf instruction-tuned models were used, regenerating chosen and rejected response pairs to mitigate distribution shifts.
Evaluation:
- SimPO was evaluated on AlpacaEval 2, Arena-Hard, and MT-Bench benchmarks. Performance was measured in terms of length-controlled win rate and raw win rate.
- SimPO achieved notable results, such as a 44.7% length-controlled win rate on AlpacaEval 2 and a 33.8% win rate on Arena-Hard, making it the strongest 8B open-source model.
Hyperparameters:
- Optimal performance was achieved with $\beta$ set between 2.0 and 2.5, and $\gamma$ between 0.5 and 1.5.

SimPO demonstrates a significant advancement in preference optimization, simplifying the process while improving computational efficiency and performance on multiple benchmarks. The removal of the reference model and the alignment of the reward function with generation metrics are key innovations that contribute to its success.
Code

Discovering Preference Optimization Algorithms with and for Large Language Models

This paper by Chris Lu et al. from Sakana AI, University of Cambridge, and FLAIR, presents a novel approach to offline preference optimization for Large Language Models (LLMs) by leveraging LLM-driven objective discovery. Traditional preference optimization relies on manually-crafted convex loss functions, but this approach is limited by human creativity. The authors propose an iterative method that prompts an LLM to discover new preference optimization loss functions automatically, leading to the development of state-of-the-art algorithms without human intervention.
The core contribution of this paper is the introduction of the Discovered Preference Optimization (DiscoPOP) algorithm, which adaptively combines logistic and exponential losses. This process is facilitated through an LLM-driven pipeline that iteratively proposes and evaluates new loss functions based on their performance on downstream tasks.
Implementation Details:
1. Initial Context Construction: The system prompt initializes the LLM with several established objective functions in code and their performance metrics.
2. LLM Querying and Output Validation: The LLM is queried to propose new objective functions, which are parsed, validated through unit tests, and evaluated.
3. Performance Evaluation: The proposed objective functions are evaluated based on their ability to optimize a model on predefined downstream tasks, with the performance metric feeding back into the LLM.
4. Iterative Refinement: The LLM iteratively refines its proposals, synthesizing new candidate loss functions that blend successful aspects of previous formulations.
Discovery Process:
- The LLM generates PyTorch-based candidate objective functions, taking log probabilities of preferred and rejected completions as inputs.
- Valid candidates are used to fine-tune an LLM, evaluated using performance metrics such as MT-Bench scores.
- The performance data is fed back into the LLM, which iteratively refines its generation strategy based on this feedback.
The figure below from the paper illustrates: (Left) Conceptual illustration of LLM-driven discovery of objective functions. We prompt an LLM to output new code-level implementations of offline preference optimization losses $\mathbb{E}_{\left(y_w, y_l, x\right) \sim \mathcal{D}}[f(\beta \rho)]$ as a function of the policy $\left(\pi_\theta\right)$ and reference model’s $\left(\pi_{\text {ref }}\right)$ likelihoods of the chosen $\left(y_w\right)$ and rejected $$\left(y_l\right)$ completions. Afterward, they run an inner loop training procedure and evaluate the resulting model on MT-Bench. The corresponding performance is fed back to the language model, and they query it for the next candidate. (Right) Performance of discovered objective functions on Alpaca Eval.

Results:
- The DiscoPOP algorithm, a dynamically weighted sum of logistic and exponential losses, emerged as a top performer. It was evaluated on multi-turn dialogue tasks (MT-Bench), single-turn dialogue tasks (Alpaca Eval 2.0), summarization tasks (TL;DR), and positive sentiment generation tasks (IMDb).
- DiscoPOP showed significant improvement in win rates against GPT-4 and performed competitively on various held-out tasks, demonstrating robustness and adaptability across different preference optimization challenges.
Technical Details:
- The DiscoPOP loss function is non-convex, incorporating a temperature parameter to balance between logistic and exponential terms based on the log-ratio difference ($\rho$). This dynamic weighting allows the function to handle both large and small differences effectively, contributing to its superior performance.
Significance:
- This LLM-driven discovery approach eliminates the constraints of human creativity in designing loss functions, automating the generation of high-performing preference optimization algorithms.
- The iterative refinement process ensures continuous improvement and adaptability, leading to state-of-the-art performance in preference alignment tasks.
This work opens new avenues for automated discovery and optimization in machine learning, showcasing the potential of leveraging LLMs to enhance and innovate traditional methodologies in a scalable and efficient manner. The proposed DiscoPOP algorithm represents a significant advancement in offline preference optimization, offering a robust and flexible solution for aligning LLM outputs with human preferences.
Code

Overview

Refresher: Basics of Reinforcement Learning

Reinforcement Learning from Human Feedback (RLHF)

Reward Model

Optimizing the Policy

Putting it all together: Training Llama 2

Proximal Policy Optimization (PPO)

Core Principles of PPO

Key Components of PPO

PPO’s Objective Function: Clipped Surrogate Loss

Summary

PPO’s Objective Function Components

Variants of PPO

Optimal Policy and Reference Policy

Summary

Advantages of PPO

Simplified Example

Summary

Related: How is the policy represented as a neural network?

Policy Representation in RL Algorithms

Summary

Direct Preference Optimization (DPO)

DPO and it’s use of Binary Cross Entropy

How does DPO generate two responses

DPO and it’s use of the Bradley-Terry model

How does DPO implicitly use a Bradley-Terry Model (if it does not explicitly use a reward model)?

Key Concepts in DPO Without an Explicit Reward Model

Implicit Use of Bradley-Terry Model

Steps in DPO Without Explicit Reward Model

Practical Implementation

Video Tutorial

Summary

Kahneman-Tversky Optimization (KTO)

PPO vs. DPO vs. KTO

Bias Concerns and Mitigation Strategies

Selected Papers

OpenAI’s Paper on InstructGPT: Training language models to follow instructions with human feedback

OpenAI’s Paper on PPO: Proximal Policy Optimization Algorithms

Anthropic’s Paper on Constitutional AI: Constitutional AI: Harmlessness from AI Feedback

Further Reading

HuggingFace’s Alignment Handbook

Empirical Evaluation: DPO vs. IPO vs. KTO

References