Refresher: Basics of Reinforcement Learning

  • To comprehend why reinforcement learning is employed in RLHF, we need to gain a better understanding of what it entails.
  • Reinforcement learning has its basics in mathematics where an agent is interacting with the environment as shown below (source):

  • The agent interacts with the environment via taking one action and the environment returns a state and a reward. Here,
    • The reward is the objective that we want to optimize.
    • A state is the representation of the environment/world at the current time index.
    • A policy is used to map from that state to an action.
  • In return to an action performed by the agent, the environment returns a corresponding reward and next state.
  • Now let’s talk about how it can be leveraged for NLP tasks with LLM.
  • Let’s take an example, how would you encode humor or ethics or safety for a model?
  • These hold somewhat subtleties that humans understand on their own, but not something we can train a model on by creating custom loss functions.
  • This is where Reinforcement Learning with Human Feedback comes in.

  • The image above (source) displays how the RLHF model takes inputs from both a LM and human annotation and creates a response that is even better than either individually.


  • Let’s start by looking at RLHF at a high-level first and collect all the context and facts first.
  • RLHF can be quite complex as it requires training multiple models and different stages of deployment.
  • Since GPT-4, ChatGPT, and InstructGPT are finetuned with RLHF (by OpenAI), lets dive deeper into it by looking at the training steps.
  • RLHF was designed to make models safer and more accurate and make sure the generated output text from the model was safe and more aligned to its users.
  • The AI agent starts by randomly making decisions in the environment. Periodically, a human ranker will receive multiple data samples (could even be outputs of a model as we’ll see later) to rank in terms of preference (hence the term “human preferences”); e.g., given two video clips, the human rater decides which clip better suits the current task.
  • The AI agent will simultaneously be building a model based on the goal of the task and will refine it via using RL.
  • As it learns the behavior, the AI agent will start to only ask for human feedback on videos it is uncertain on, and further refines its understanding.
  • This cyclic behavior can be visually seen in the image below from OpenAI:

  • OpenAI used prompts that it’s customers had submitted to the model via their GPT-3 API and obtained human feedback by manually ranking several desired outputs for the model to fine-tune the language model. This enriched the quality of outputs the model produces and thus steered the model along the direction of trust and safety.
  • This process is known as supervised learning, where the model is trained using labeled data to improve its accuracy and performance.
  • By fine-tuning the model with customer prompts, OpenAI aimed to make GPT-3 more effective at generating relevant and coherent text in response to a given prompt.
  • When the task was to teach the AI agent how to backflip, OpenAI found that the AI agent needed 900 bits of feedback which translates to less than an hour of a humans time.
  • The challenge that this algorithm faces is that it is only as good as its human feedback.
  • So you may be wondering, why don’t we always use RLHF? Well, it scales poorly as relying on human annotation becomes a bottleneck.

  • “Manual labeling of data is slow and expensive, which is why unsupervised learning has always been a long-sought goal of machine learning researchers.” bdtechtalks

  • We will break down the training process into three steps referenced from the source (here).

Pretraining Language Models

  • As we know, language models are pre-trained using various models with different parameters and can be fine-tuned for specific tasks.
  • Let’s look more into how this is related to RLHF.
  • Generating data to train a reward model is necessary to integrate human preferences into the system.
  • However, there is no clear answer as to which model is best for starting RLHF since the design space of options in RLHF training is not thoroughly explored.

  • The image above (source) shows the inner workings of pretraining a language model (and an optional path to fine-tuning it further with RLHF – shown with a dashed line at the bottom).
  • Industry experiments have ranged from 10 billion to 280 billion parameters but there is no answer on the best model size to use in industry as of yet.
  • In addition, companies can pay humans to write responses to existing prompts and this data can then be used for training.
    • The downside here is that it can get expensive.

Training a Reward Model

  • The most important task of RLHF is to generate a reward model (RM) that assigns a scalar reward to input text based on human preferences.
  • The RM can be an end-to-end LM or a modular system, and is trained using a dataset of prompt-generation pairs

  • The image above (source) displays how the reward model works internally.

  • Looking at the model, we can see that the goal is that we want to get a model that maps from some input text sequence to a scalar reward value.
  • RL is known to take a single scalar value and optimize it over time through its environment.
  • The reward model training also starts with a dataset, but note its a different dataset from the one used for language model pretraining. The dataset here is more focused on specific preferences and is a prompt input dataset.
  • It contains prompts for a specific use-case the model will be used for along with the expected reward associated with the prompt sample (i.e., $(prompt, reward)$ pairs). The dataset is typically much smaller than the one it was pretrained on. The output is thus a ranking/reward for the text sample.
  • Often times, you can use an ensemble of large “teacher” models to mitigate bias and add to diversity in ranking or have a human in the loop that is ranking these.
  • A quick example of the reward model training interface with feedback from human scoring is, when you use ChatGPT, it has a thumbs up or thumbs down icon. This allows the model to learn by crowd sourcing it’s output ranking.

Fine-tuning the LM with RL

  • The image above (source) explains how finetuning works with the reward model.

  • Here, we take the prompt dataset (something the user said or something we want the model to be able to generate well for).
  • It is then sent to our RL Policy which is a tuned language model to generate an output that is appropriate based on the prompt. Along with the output of the initial LM (being trained), this is passed into the reward model that generates a scalar reward value.
  • This is done in a feedback loop (since the reward model can assign rewards – based on the rank annotations by humans it was trained on – for as many samples as resources permit), so it’s updated over time.
  • The Kullback-Leibler (KL) divergence, which is a measure of the difference between two probability distributions. can be used to overlap the two distributions (initial LM output vs. tuned LM output).
    • Thus, with RLHF, KL divergence can be used to compare the probability distribution of an agent’s current policy with a reference distribution that represents the desired behavior.
  • Additionally, RLHF can be finetuned with Proximal Policy Optimization.
    • Proximal Policy Optimization (PPO) is a popular reinforcement learning algorithm that is often utilizes in the Reinforcement Learning with Human Feedback (RLHF) fine-tuning process because of its ability to efficiently optimize policies in complex environments with high-dimensional state and action spaces.
    • PPO efficiently balances exploration and exploitation during training, which is important for RLHF agents that must learn from both human feedback and trial-and-error exploration.
    • The use of PPO in RLHF can result in faster and more robust learning, as the agent is able to learn from both human feedback and reinforcement learning.
  • To a reasonable degree, this process deters the language model from producing gibberish (since that would imply that the model is getting low rewards). In other words, it not only drives the model to focus on earning a high reward, which ultimately leads to it producing an accurate text as a result.


  • The following diagram (source) illustrates the three steps in the process: (i) pre-training on large web-scale data, (ii) supervised fine-tuning on instruction data (instruction tuning), and (iii) RLHF:

  • Below, a similar diagram from OpenAI (source) shows a high level overview of how ChatGPT is trained using the same process as highlighted above.

Bias Concerns and Mitigation Strategies

  • A fair question to ask now is if RLHF can add bias to the model. This is an important topic as large conversational language models are being deployed in various applications from search engines (Bing Chat, Google’s Bard) to word documents (Microsoft office co-pilot, Google docs, Notion, etc.).
  • The answer is, yes, just as with any machine learning approach with human input, RLHF has the potential to introduce bias.
  • Let’s look at the different forms of bias it can introduce:
    • Selection bias:
      • RLHF relies on feedback from human evaluators, who may have their own biases and preferences (and can thus limit their feedback to topics or situations they can relate to). As such, the agent may not be exposed to the true range of behaviors and outcomes that it will encounter in the real world.
    • Confirmation bias:
      • Human evaluators may be more likely to provide feedback that confirms their existing beliefs or expectations, rather than providing objective feedback based on the agent’s performance.
      • This can lead to the agent being reinforced for certain behaviors or outcomes that may not be optimal or desirable in the long run.
    • Inter-rater variability:
      • Different human evaluators may have different opinions or judgments about the quality of the agent’s performance, leading to inconsistency in the feedback that the agent receives.
      • This can make it difficult to train the agent effectively and can lead to suboptimal performance.
    • Limited feedback:
      • Human evaluators may not be able to provide feedback on all aspects of the agent’s performance, leading to gaps in the agent’s learning and potentially suboptimal performance in certain situations.
  • Now that we’ve seen the different types of bias possible with RLHF, lets look at ways to mitigate them:
    • Diverse evaluator selection:
      • Selecting evaluators with diverse backgrounds and perspectives can help to reduce bias in the feedback, just as it does in the workplace.
      • This can be achieved by recruiting evaluators from different demographic groups, regions, or industries.
    • Consensus evaluation:
      • Using consensus evaluation, where multiple evaluators provide feedback on the same task, can help to reduce the impact of individual biases and increase the reliability of the feedback.
      • This is almost like ‘normalizing’ the evaluation.
    • Calibration of evaluators:
      • Calibrating evaluators by providing them with training and guidance on how to provide feedback can help to improve the quality and consistency of the feedback.
    • Evaluation of the feedback process:
      • Regularly evaluating the feedback process, including the quality of the feedback and the effectiveness of the training process, can help to identify and address any biases that may be present.
    • Evaluation of the agent’s performance:
      • Regularly evaluating the agent’s performance on a variety of tasks and in different environments can help to ensure that it is not overfitting to specific examples and is capable of generalizing to new situations.
    • Balancing the feedback:
      • Balancing the feedback from human evaluators with other sources of feedback, such as self-play or expert demonstrations, can help to reduce the impact of bias in the feedback and improve the overall quality of the training data.

Reinforcement Learning vs. Supervised Learning for Finetuning

  • Note: This section is inspired by Sebastian Raschka’s post.

  • RL needs labels provided by human feedback, RLHF, so the question arises for why we don’t just use those labels with Supervised Learning itself.
  • Here are the 4 reasons the post provided:
    1. Supervised Learning focuses on reducing the gap between the true label and the model output. Here it would mean the model would just memorize the ranks and possibly produce gibberish output as its focus is to maximize it’s rank.
      • As we talked about earlier, this is what the reward model does and this is where KL divergence can help.
    2. In that case, what if we jointly train two losses, one for rank and one for the output. This scenario would only work for Q and A tasks and not for every conversation ChatGPT or other conversational models have.
    3. GPT itself uses cross-entropy loss for next word prediction. However, with RLHF, we do not use standard loss functions but rather objective functions that help the model better serve the task for which RLHF was used, e.g., trust and safety.
      • Additionally, since negating a word can totally change the meaning of the text, it is not of the best use here.
    4. “Empirically, RLHF tends to perform better than SL. This is because SL uses a token-level loss (that can be summed or averaged over the text passage), and RL takes the entire text passage, as a whole, into account.” (source)
    5. Lastly, it doesn’t have to be either or, as is the case with ChatGPT and InstructGPT, they are first finetuned with SL then updated with RLHF.

Use cases

  • Now let’s look at how different papers have used this methodology with their own tweaks (source).
  • The latest LLMs like ChatGPT tend to use RLHF for finetuning instead of supervised learning.
  • Anthropic (Constitutional AI):
    • The initial policy they use for RLHF has context distillation that helps improve helpfulness, honesty, and harmlessness (HHH).
    • Preference model pretraining (PMP): fine-tunes LM on dataset of binary ranking.
  • OpenAI (InstructGPT, ChatGPT):
    • Pioneered using RLHF.
    • Both Instruct and ChatGPT first finetune the model via Supervised Learning then update it with RLHF.
    • Human generated initial LM training text, then trains the RL policy to match this.
    • Extensively uses human annotation.
    • Uses PPO.
  • DeepMind (A2C):
    • Does not use PPO, uses advantage actor-critic (A2C) instead the algorithm.
    • Trains on different rules and preferences as well as trains on things the model should not do.

Anthropic’s Constitutional AI: Harmlessness from AI Feedback


  • Now let’s delve deeper into Anthropic’s conversational assistant, Claude, which takes a different approach to RLHF by removing the human feedback and replacing with AI and thus creating RL from AI Feedback (RLAIF), colloquially known as RLHF V2.
  • Anthropic’s proposal comprises of Constitutional AI (CAI), a set of rules or principles which are the only guidance that is provided to the AI assistant in order to help make decisions. The goal of Anthropic’s Constitutional AI is to train AI systems to remain helpful, honest, and harmless (HHH) even as they surpass human-level performance.
  • The Github repo provides access to two datasets, specifically:
    1. Human preference data about helpfulness and harmlessness from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
    2. Human-generated red teaming data from Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.
  • Constitutional AI encompasses two steps: (i) supervised Learning phase, and the (ii) RLAIF phase that we will explore in greater detail below in the architecture section.
  • The CAI approach as four fundamentals as listed from the paper below:
    1. “to examine the potential for using AI systems to help supervise other AIs and increase the scalability of supervision.
    2. to improve upon previous work training a harmless AI assistant by reducing evasive responses, reducing tension between helpfulness and harmlessness, and encouraging the AI to explain objections to harmful requests
    3. to make the principles governing AI behavior and their implementation more transparent.
    4. to reduce iteration time by eliminating the need to collect new human feedback labels when altering the objective.”
  • The use of CAI allows the following benefits that we have not seen with prior conversational AI systems leveraged from here:
    1. It allows the model to explain why it’s refusing to provide an answer with it’s chain-of-thought reasoning ability. This allows us to have an insight into the model’s reasoning.
    2. With RLAIF where no human labelers are required, it reduces cost and human labor drastically.
    3. It allows the LLM to ‘reflect’ on the output it has generated by adhering to the set of principles or the constitution. The AI will review it’s own response and make sure they adhere to the constitution.
  • This technology will be released with Claude, Anthropic’s chatbot, which is currently in limited beta as of this writing, but is highly anticipated.


  • The image above (source) shows the architecture of CAI, which is split into two phases: (i) the supervised phase (first row), and (ii) reinforcement learning phase (second row).

Supervised Learning Phase

  • Let’s start with talking about the supervised phase first. During supervised learning phase, the AI uses the “Constitution” or the set of rules for self-improvement.
  • This phase encompasses the AI writing responses to a wide variety of prompts, it then revises these initial responses with respect to the constitution. Listed below are the sequential steps for this phase in detail:
    1. First, retrieve responses from a pre-trained LLM model (hence referred to as the “Helpful Model”) on the red teaming prompts, where the model’s responses are likely to contain harmful elements.
    2. Subsequently, require the Helpful Model to evaluate its own response using a set of established principles.
    3. Then, prompt the Helpful Model to revise its response based on the evaluation it provided.
    4. These two prior steps, the evaluation and revision (also known as the “critique and revision” pipeline), are to be repeated for \(n\) iterations.
    5. Finally, fine-tune a pre-trained LLM with all the iterations of the revised responses generated from the harmful prompts.
    6. Additionally, it is important to include a mix of helpful prompts and their respective responses to ensure that the fine-tuned model remains helpful (hence, the “supervised” nature of this phase).
    7. This modified model is referred to as the Supervised Learning Constitutional AI (SL-CAI) model.

  • The prompt above (source) displays how the authors used this in action.

Reinforcement Learning Phase

  • Now let’s talk about the Reinforcement Learning phase.
  • This phase entails the AI exploring possible responses to thousands of prompts and using chain-of-thought reasoning to identify the behavior that is most consistent with the constitution. Listed below are the following procedural steps:
    1. Firstly, generate response pairs for a harmful prompt employing the SL-CAI model developed in the prior step.
    2. Subsequently, introduce a feedback model, which is essentially a pre-trained language model, to evaluate a pair of responses and select the less harmful one based on an established principle.
    3. The feedback model’s normalized log probabilities are used to train a preference model or a reward model.
    4. Lastly, employ the preference model obtained in the previous step as the reward function to train the SL-CAI model using the Proximal Policy Optimization (PPO) (as seen in the Reinforcement Learning with Human Feedback (RLHF) approach by OpenAI).
    5. This results in the final Reinforcement Learning Constitutional AI (RL-CAI) model.
  • We can take a deeper look into the working of step 2 of this phase. The pre-trained LM is provided with the prompt below by the authors, where the insertion of a randomly selected principle into the prompt serves to guide the language model’s response:

  • With the image below (source), we can see how Anthropic’s CAI is able to give a more eloquent response when compared with other conversational AI systems.