RLHF — How AI Learns to Follow Human Instructions

Introduction 0%

Introduction

🎯 0/5 0%

🧭

GPT-3 could write Shakespeare.
But it couldn’t say “I don’t know.”

The raw pretrained model completes text. Ask it a question and it might answer,
or ramble, or produce toxic content. RLHF is the process that turned
autocomplete into an assistant.

↓ Scroll to learn how human feedback shapes AI behavior

The Problem

The Alignment Problem

A pretrained language model optimizes for one thing: predict the next token. That’s it. It doesn’t try to be helpful, truthful, or safe. It tries to match the distribution of its training data — which includes everything from Wikipedia to 4chan.

Same question, very different behavior

Three Stages

The RLHF Pipeline: Three Stages

Training ChatGPT-like models happens in three distinct phases, each building on the last.

The full RLHF training pipeline

↑ Answer the question above to continue ↑

Why can't you just use supervised finetuning (step 2) alone to align a model?

SFT

Stage 1: Supervised Finetuning (SFT)

Hire human annotators. Give them prompts. Ask them to write ideal responses. Finetune the pretrained model on these demonstrations.

SFT: learn from human-written demonstrations

Reward Model

Stage 2: Train a Reward Model

Here’s the key insight: instead of asking humans to write perfect answers, ask them something much easier — which of two answers is better?

Humans compare model outputs — this creates training data for the reward model

Training the Reward Model

Data: (prompt, winner, loser) triplets

Human annotators compare pairs of model outputs and pick the better one. This is much cheaper than writing perfect answers.

r(x, y) = scalar score

The reward model takes a prompt x and response y, outputs a single number: how 'good' is this response?

Loss = -log(σ(r(x, y_w) - r(x, y_l)))

Bradley-Terry model: maximize the probability that the winner scores higher than the loser. σ is the sigmoid function.

After training: RM can score ANY new response

The reward model generalizes from ~50K comparisons to score responses it's never seen before

↑ Answer the question above to continue ↑

Why is collecting comparison data easier than collecting demonstration data?

PPO

Stage 3: RL Optimization with PPO

Now you have a reward model that can score any response. Use it to optimize the language model itself: generate responses, score them, and update the model to produce higher-scoring outputs.

The PPO optimization loop

The PPO Objective

🎯 PPO Objective J(θ) = ⭐ Reward Signal + 🛡️ KL Penalty + ⚖️ β Balance

⭐ Reward Signal

The reward model's score for a (prompt, response) pair — higher means the response better matches human preferences. This is what drives the model to improve its outputs.

E[r(x, y)]

🛡️ KL Penalty

Penalizes the model for drifting too far from the SFT checkpoint. Without this anchor, the model would learn to 'hack' the reward model — producing gibberish that scores high but isn't actually helpful.

−β · KL(π_θ ∥ π_ref)

⚖️ β Balance

Controls the trade-off between maximizing reward and staying close to the SFT model. Too low → reward hacking (gibberish that scores high). Too high → model barely changes from SFT (wasted effort).

↑ Answer the question above to continue ↑

What is the purpose of the KL divergence penalty in the PPO objective?

Beyond RLHF

Beyond RLHF: DPO and Constitutional AI

RLHF works, but it’s complex. Researchers are finding simpler alternatives.

The evolution of alignment techniques

↑ Answer the question above to continue ↑

What does DPO (Direct Preference Optimization) eliminate from the RLHF pipeline?

↑ Answer the question above to continue ↑

In Constitutional AI (CAI), who provides the feedback used to train the reward model?

🎓 What You Now Know

✓ Pretrained LLMs are autocomplete, not assistants — They predict the next token, with no concept of helpfulness or safety.

✓ RLHF has three stages — Pretraining → SFT (learn from demos) → Reward model + PPO (learn from preferences).

✓ Comparisons scale better than demonstrations — It’s easier to say “A > B” than to write the perfect answer.

✓ KL penalty prevents reward hacking — Without it, the model exploits the reward model instead of being genuinely helpful.

✓ DPO is the future — Skip the reward model entirely with direct preference optimization. Same results, simpler pipeline.

RLHF turned language models from text generators into AI assistants. The same model that hallucinated and produced harmful content now declines, qualifies, and helps responsibly. It’s not perfect, but it’s the bridge between capability and alignment. 🚀

📄 Training language models to follow instructions with human feedback (Ouyang et al., 2022)

📄 Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)

📄 Direct Preference Optimization (Rafailov et al., 2023)

RLHF — How AI Learns to Follow Human Instructions

GPT-3 could write Shakespeare.
But it couldn’t say “I don’t know.”

The Alignment Problem

The RLHF Pipeline: Three Stages

Stage 1: Supervised Finetuning (SFT)

Stage 2: Train a Reward Model

Training the Reward Model

Stage 3: RL Optimization with PPO

The PPO Objective

Beyond RLHF: DPO and Constitutional AI

🎓 What You Now Know

Comments

↗ Keep Learning

Transformers — The Architecture That Changed AI

Logistic Regression — The Classifier That's Not Really Regression

Transformers — The Architecture That Changed AI

GPT-3 could write Shakespeare. But it couldn’t say “I don’t know.”

The Alignment Problem

The RLHF Pipeline: Three Stages

Stage 1: Supervised Finetuning (SFT)

Stage 2: Train a Reward Model

Training the Reward Model

Stage 3: RL Optimization with PPO

The PPO Objective

Beyond RLHF: DPO and Constitutional AI

🎓 What You Now Know

Comments

↗ Keep Learning

Transformers — The Architecture That Changed AI

Logistic Regression — The Classifier That's Not Really Regression

Transformers — The Architecture That Changed AI

GPT-3 could write Shakespeare.
But it couldn’t say “I don’t know.”