## Overview

• “Deep reinforcement learning from human preferences” by OpenAI (2017) was the first resources to introduce the concept.
• The basic idea behind RLHF is to take a pretrained language model and to have humans rank the results it outputs.
• RLHF is able to optimize language models with human feedback which can help the model to learn and improve its performance by combining both reinforcement learning algorithms with human input.
• By incorporating human feedback, RLHF can help language models to better understand and generate natural language, as well as improve their ability to perform specific tasks such as text classification or language translation.
• Additionally, RLHF can also help to mitigate the problem of bias in language models by allowing humans to correct and steer the model towards more equitable and inclusive language use (but the flipside is that it also introduces an avenue to embed bias stemming from the humans in the loop).
• Let’s delve into the nitty-gritties of RLHF below!

## Basics of RL

• To comprehend why reinforcement learning is employed in RLHF, we need to gain a better understanding of what it entails.
• Reinforcement learning has its basics in mathematics where an agent is interacting with the environment as shown below (source):

• The agent interacts with the environment via taking one action and the environment returns a state and a reward.
• Here, reward is the objective that we want to optimize.
• And state is the representation of the environment/world at the current time index.
• A policy is used to map from that state to an action.
• Now let’s talk about how it can be leveraged for NLP tasks with LLM.
• Let’s take an example, how would you encode humor or ethics or safety for a model?
• These hold somewhat subtleties that humans understand on their own, but not something we can train a model on by creating custom loss functions.
• This is where Reinforcement Learning with Human Feedback comes in.

• The image above (source) displays how the RLHF model takes inputs from both a LM and human annotation and creates a response that is even better than either individually.

## Training

• Let’s start by looking at RLHF at a high-level first and collect all the context and facts first.
• RLHF can be quite complex as it requires training multiple models and different stages of deployment.
• Since GPT-4, ChatGPT, and InstructGPT are finetuned with RLHF (by OpenAI), lets dive deeper into it by looking at the training steps.
• RLHF was designed to make models safer and more accurate and make sure the generated output text from the model was safe and more aligned to its users.
• The AI agent starts by randomly making decisions in the environment. Periodically, a human ranker will receive multiple data samples (could even be outputs of a model as we’ll see later) to rank in terms of preference (hence the term “human preferences”); e.g., given two video clips, the human rater decides which clip better suits the current task.
• The AI agent will simultaneously be building a model based on the goal of the task and will refine it via using RL.
• As it learns the behavior, the AI agent will start to only ask for human feedback on videos it is uncertain on, and further refines its understanding.
• This cyclic behavior can be visually seen in the image below from OpenAI:

• OpenAI used prompts that it’s customers had submitted to the model via their GPT-3 API and obtained human feedback by manually ranking several desired outputs for the model to fine-tune the language model. This enriched the quality of outputs the model produces and thus steered the model along the direction of trust and safety.
• This process is known as supervised learning, where the model is trained using labeled data to improve its accuracy and performance.
• By fine-tuning the model with customer prompts, OpenAI aimed to make GPT-3 more effective at generating relevant and coherent text in response to a given prompt.
• When the task was to teach the AI agent how to backflip, OpenAI found that the AI agent needed 900 bits of feedback which translates to less than an hour of a humans time.
• The challenge that this algorithm faces is that it is only as good as its human feedback.
• So you may be wondering, why don’t we always use RLHF? Well, it scales poorly as relying on human annotation becomes a bottleneck.

• “Manual labeling of data is slow and expensive, which is why unsupervised learning has always been a long-sought goal of machine learning researchers.” bdtechtalks

• We will break down the training process into three steps referenced from the source (here).

### Pretraining Language Models

• As we know, language models are pre-trained using various models with different parameters and can be fine-tuned for specific tasks.
• Let’s look more into how this is related to RLHF.
• Generating data to train a reward model is necessary to integrate human preferences into the system.
• However, there is no clear answer as to which model is best for starting RLHF since the design space of options in RLHF training is not thoroughly explored.

• The image above (source) shows the inner workings of pretraining a language model (and an optional path to fine-tuning it further with RLHF – shown with a dashed line at the bottom).
• Industry experiments have ranged from 10 billion to 280 billion parameters but there is no answer on the best model size to use in industry as of yet.
• In addition, companies can pay humans to write responses to existing prompts and this data can then be used for training.
• The downside here is that it can get expensive.

### Reward Model

• The most important task of RLHF is to generate a reward model (RM) that assigns a scalar reward to input text based on human preferences.
• The RM can be an end-to-end LM or a modular system, and is trained using a dataset of prompt-generation pairs

• The image above (source) displays how the reward model works internally.

• Looking at the model, we can see that the goal is that we want to get a model that maps from some input text sequence to a scalar reward value.
• RL is known to take a single scalar value and optimize it over time through its environment.
• The reward model training also starts with a dataset, but note its a different dataset from the one used for language model pretraining. The dataset here is more focused on specific preferences and is a prompt input dataset.
• It contains prompts for a specific use-case the model will be used for along with the expected reward associated with the prompt sample (i.e., $(prompt, reward)$ pairs). The dataset is typically much smaller than the one it was pretrained on. The output is thus a ranking/reward for the text sample.
• Often times, you can use an ensemble of large “teacher” models to mitigate bias and add to diversity in ranking or have a human in the loop that is ranking these.
• A quick example of the reward model training interface with feedback from human scoring is, when you use ChatGPT, it has a thumbs up or thumbs down icon. This allows the model to learn by crowd sourcing it’s output ranking.

### Fine-tuning the LM with RL

• The image above (source) explains how finetuning works with the reward model.

• Here, we take the prompt dataset (something the user said or something we want the model to be able to generate well for).
• It is then sent to our RL Policy which is a tuned language model to generate an output that is appropriate based on the prompt. Along with the output of the initial LM (being trained), this is passed into the reward model that generates a scalar reward value.
• This is done in a feedback loop (since the reward model can assign rewards – based on the rank annotations by humans it was trained on – for as many samples as resources permit), so it’s updated over time.
• The Kullback-Leibler (KL) divergence, which is a measure of the difference between two probability distributions. can be used to overlap the two distributions (initial LM output vs. tuned LM output).
• Thus, with RLHF, KL divergence can be used to compare the probability distribution of an agent’s current policy with a reference distribution that represents the desired behavior.
• Additionally, RLHF can be finetuned with Proximal Policy Optimization.
• Proximal Policy Optimization (PPO) is a popular reinforcement learning algorithm that is often utilizes in the Reinforcement Learning with Human Feedback (RLHF) fine-tuning process because of its ability to efficiently optimize policies in complex environments with high-dimensional state and action spaces.
• PPO efficiently balances exploration and exploitation during training, which is important for RLHF agents that must learn from both human feedback and trial-and-error exploration.
• The use of PPO in RLHF can result in faster and more robust learning, as the agent is able to learn from both human feedback and reinforcement learning.
• To a reasonable degree, this process deters the language model from producing gibberish (since that would imply that the model is getting low rewards). In other words, it not only drives the model to focus on earning a high reward, which ultimately leads to it producing an accurate text as a result.

## Bias

• A fair question to ask now is if RLHF can add bias to the model. This is an important topic as large conversational language models are being deployed in various applications from search engines (Bing Chat, Google’s Bard) to word documents (Microsoft office co-pilot, Google docs, Notion, etc.).
• The answer is, yes, just as with any machine learning approach with human input, RLHF has the potential to introduce bias.
• Let’s look at the different forms of bias it can introduce:
• Selection bias:
• RLHF relies on feedback from human evaluators, who may have their own biases and preferences (and can thus limit their feedback to topics or situations they can relate to). As such, the agent may not be exposed to the true range of behaviors and outcomes that it will encounter in the real world.
• Confirmation bias:
• Human evaluators may be more likely to provide feedback that confirms their existing beliefs or expectations, rather than providing objective feedback based on the agent’s performance.
• This can lead to the agent being reinforced for certain behaviors or outcomes that may not be optimal or desirable in the long run.
• Inter-rater variability:
• Different human evaluators may have different opinions or judgments about the quality of the agent’s performance, leading to inconsistency in the feedback that the agent receives.
• This can make it difficult to train the agent effectively and can lead to suboptimal performance.
• Limited feedback:
• Human evaluators may not be able to provide feedback on all aspects of the agent’s performance, leading to gaps in the agent’s learning and potentially suboptimal performance in certain situations.
• Now that we’ve seen the different types of bias possible with RLHF, lets look at ways to mitigate them:
• Diverse evaluator selection:
• Selecting evaluators with diverse backgrounds and perspectives can help to reduce bias in the feedback, just as it does in the workplace.
• This can be achieved by recruiting evaluators from different demographic groups, regions, or industries.
• Consensus evaluation:
• Using consensus evaluation, where multiple evaluators provide feedback on the same task, can help to reduce the impact of individual biases and increase the reliability of the feedback.
• This is almost like ‘normalizing’ the evaluation.
• Calibration of evaluators:
• Calibrating evaluators by providing them with training and guidance on how to provide feedback can help to improve the quality and consistency of the feedback.
• Evaluation of the feedback process:
• Regularly evaluating the feedback process, including the quality of the feedback and the effectiveness of the training process, can help to identify and address any biases that may be present.
• Evaluation of the agent’s performance:
• Regularly evaluating the agent’s performance on a variety of tasks and in different environments can help to ensure that it is not overfitting to specific examples and is capable of generalizing to new situations.
• Balancing the feedback:
• Balancing the feedback from human evaluators with other sources of feedback, such as self-play or expert demonstrations, can help to reduce the impact of bias in the feedback and improve the overall quality of the training data.

## Reinforcement Learning vs Supervised Learning for Finetuning

• Note: This section is inspired by Sebastian Raschka’s post.

• RL needs labels provided by human feedback, RLHF, so the question arises for why we don’t just use those labels with Supervised Learning itself.
• Here are the 4 reasons the post provided:
1. Supervised Learning focuses on reducing the gap between the true label and the model output. Here it would mean the model would just memorize the ranks and possibly produce gibberish output as its focus is to maximize it’s rank.
• As we talked about earlier, this is what the reward model does and this is where KL divergence can help.
2. In that case, what if we jointly train two losses, one for rank and one for the output. This scenario would only work for Q and A tasks and not for every conversation ChatGPT or other conversational models have.
3. GPT itself uses cross-entropy loss for next word prediction. However, with RLHF, we do not use standard loss functions but rather objective functions that help the model better serve the task for which RLHF was used, for e.g. Trust and Safety.
• Additionally, since negating a word can totally change the meaning of the text, it is not of the best use here.
4. “Empirically, RLHF tends to perform better than SL. This is because SL uses a token-level loss (that can be summed or averaged over the text passage), and RL takes the entire text passage, as a whole, into account.” (source)
5. Lastly, it doesn’t have to be either or, as is the case with ChatGPT and InstructGPT, they are first finetuned with SL then updated with RLHF.

## Use cases

• Now let’s look at how different papers have used this methodology with their own tweaks (source).
• The latest LLMs like ChatGPT tend to use RLHF for finetuning instead of supervised learning.
• Anthropic (Constitutional AI):
• The initial policy they use for RLHF has context distillation that helps improve helpfulness, honesty, and harmlessness (HHH).
• Preference model pretraining (PMP): fine-tunes LM on dataset of binary ranking.
• OpenAI (InstructGPT, ChatGPT):
• Pioneered using RLHF.
• Both Instruct and ChatGPT first finetune the model via Supervised Learning then update it with RLHF.
• Human generated initial LM training text, then trains the RL policy to match this.
• Extensively uses human annotation.
• Uses PPO.
• DeepMind (A2C):
• Does not use PPO, uses advantage actor-critic (A2C) instead the algorithm.
• Trains on different rules and preferences as well as trains on things the model should not do.