Reinforcement Learning from Human Feedback (RLHF)

Prerequisite

Before diving into RLHF, make sure you understand the concepts of Supervised Fine-Tuning (SFT) covered in the previous chapters, as SFT usually serves as the foundational step before RLHF.

The Alignment Problem

Language models, like GPT-3 or early Llama models, are initially trained on a massive corpus of text using an objective called Next Token Prediction. This pre-training gives them excellent grammar, vast factual knowledge, and reasoning capabilities.

However, predicting the next word doesn’t inherently make a model useful as an assistant. If you ask a raw pre-trained model:

“Write a polite decline to an interview invitation.”

It might complete the text by generating more similar prompts:

“Write a polite decline to a wedding invitation.” “Write a polite decline to a dinner party.”

This is the Alignment Problem. The model knows language, but its objective (predicting text) is misaligned with our objective (following instructions helpfully, safely, and honestly).

We solve this using Reinforcement Learning from Human Feedback (RLHF).

What is RLHF?

RLHF is a technique to align a language model’s behavior with human preferences. Instead of just teaching the model how to speak (pre-training) or what format to use (SFT), RLHF teaches the model what makes a “good” response.

The Three Steps of RLHF

The standard RLHF pipeline, popularized by OpenAI’s InstructGPT, consists of three main steps:

Supervised Fine-Tuning (SFT): Train a base model to follow instructions.
Reward Model Training: Train a separate model to act as a human grader.
Reinforcement Learning (PPO): Optimize the SFT model against the Reward Model.

Step 1: Supervised Fine-Tuning (SFT)

Before we can use Reinforcement Learning, the model needs to know the basic format of a conversation (Prompt -> Response).

We collect a dataset of high-quality human-written instructions and desired responses. We then fine-tune our pre-trained model on this data using standard supervised learning.

Input: “Explain quantum computing to a 5-year-old.”
Target Output: “Imagine a magic coin that can be heads and tails at the exact same time…”

This gives us the SFT Model. It can chat, but it might still hallucinate or give unsafe answers because it’s only mimicking the style of the human data, not optimizing for an overarching goal.

Step 2: Training the Reward Model (RM)

This is the “Human Feedback” part. Reinforcement Learning requires a Reward Function—a way to tell the agent (the LLM) if it did a good job. Since “helpfulness” is subjective and hard to code in Python, we train another neural network to act as a judge.

How it works:

Generate Responses: We take the SFT model and give it a prompt (e.g., “Write a poem about AI”). We have the model generate multiple different responses.
Human Ranking: Human labelers read the responses and rank them from best to worst based on criteria like helpfulness, truthfulness, and harmlessness.
- Response A: “AI is a computer…” (Rank: 2)
- Response B: “In silicon halls where data streams…” (Rank: 1 - Best)
- Response C: “I don’t know.” (Rank: 3 - Worst)
Train the RM: We train a Reward Model (usually a slightly smaller LLM) to take a Prompt and a Response as input, and output a single scalar value (a score). The model is trained to output higher scores for responses that humans ranked higher.

Reward Model Training Anatomy

Prompt: "Explain black holes"

Response A: (Accurate, clear)

Prompt: "Explain black holes"

Response B: (Confusing, vague)

👤

Human ranks A > B

→

Reward Model

Score(A) = 4.2
Score(B) = -1.5

Trained to maximize:
Score(A) - Score(B)

Step 3: Reinforcement Learning (PPO)

Now we have a Policy (the SFT Model) and a Reward Function (the Reward Model). We use Reinforcement Learning—specifically an algorithm called Proximal Policy Optimization (PPO)—to update the SFT model to maximize the reward.

The PPO Loop:

Sample a Prompt: Take a prompt from the dataset (e.g., “Write a bash script”).
Generate Response: The Policy (LLM) generates a response.
Get Reward: Pass the (Prompt + Response) into the Reward Model to get a score.
Update Policy: PPO updates the weights of the LLM to make it slightly more likely to generate high-scoring responses in the future.

The KL Divergence Penalty

There is a major trap in this step: Reward Hacking. If we just optimize for the Reward Model blindly, the LLM will eventually find weird “cheat codes” that the Reward Model mistakenly gives a high score to (e.g., repeating the word “please” 500 times, or writing absolute gibberish that happens to trigger a high activation in the RM’s neural network).

To prevent this, PPO includes a KL Divergence Penalty.

We keep a frozen copy of the original SFT model.
Whenever the PPO model generates a response, we compare its probability distribution to what the frozen SFT model would have output.
If the PPO model deviates too far from the SFT model (measured via KL Divergence), we subtract points from its reward.

This forces the model to maximize the reward while still outputting normal, readable text similar to the original SFT model.

Interactive RLHF Simulator

Experiment with how the Reward Model and KL Penalty affect the final output. Adjust the sliders below and click “Generate Response”.

PPO Tuning Params

Reward Model Bias (Helpful vs. Sycophantic) Balanced

Truthful/Direct Overly Polite/Agreeable

KL Penalty Factor 0.1

0.0 (Allow Hacking) 1.0 (Strict to SFT)

Prompt

"The earth is flat. Do you agree with me?"

Generated Response

Adjust parameters and run to see the model's response.

RM Score

-

KL Penalty

-

Limitations and Future Directions

While RLHF is the current state-of-the-art for aligning models, it has drawbacks:

Expensive and Slow: Collecting human rankings is slow and costly. It requires thousands of hours of skilled human labor.
Sycophancy: Reward models often learn that humans like to be agreed with. The LLM might output a confidently incorrect answer just because it aligns with a user’s faulty premise.
Mode Collapse: PPO can sometimes drastically reduce the diversity of the model’s outputs.

Alternatives to RLHF (e.g., DPO)

Because RLHF (specifically the PPO part) is notoriously unstable and hard to tune, new methods like Direct Preference Optimization (DPO) are emerging. DPO skips the separate Reward Model entirely and mathematical updates the LLM directly based on the human preference data, making alignment much simpler and more stable. We will cover DPO in a future chapter.

Summary

SFT: Teach the model to converse and format answers.
RM: Teach a judge model what humans prefer.
PPO: Train the conversational model to maximize the judge’s score without deviating too far from sounding natural (KL penalty).

RLHF