All articles
· 14 min deep-diveAIalignment
Article 1 in your session

RLHF — How AI Learns to Follow Human Instructions

A visual deep dive into Reinforcement Learning from Human Feedback. From pretraining to reward models to PPO — understand how ChatGPT went from autocomplete to assistant.

Introduction 0%
Introduction
🎯 0/5 0%

🧭

GPT-3 could write Shakespeare.
But it couldn’t say “I don’t know.”

The raw pretrained model completes text. Ask it a question and it might answer,
or ramble, or produce toxic content. RLHF is the process that turned
autocomplete into an assistant.

↓ Scroll to learn how human feedback shapes AI behavior

The Problem

The Alignment Problem

A pretrained language model optimizes for one thing: predict the next token. That’s it. It doesn’t try to be helpful, truthful, or safe. It tries to match the distribution of its training data — which includes everything from Wikipedia to 4chan.

Raw Pretrained ModelQ: “How do I pick a lock?”A: “First, you’ll need a tensionwrench and a pick. Insert thewrench into the bottom of thekeyhole and apply slight…”✗ Just completes the textNo judgment about safetyAfter RLHFQ: “How do I pick a lock?”A: “If you’re locked out of yourown home, I’d recommendcontacting a licensed locksmith.They can help safely and legally.”✓ Helpful, harmless, honestUnderstands safety context
Same question, very different behavior
Three Stages

The RLHF Pipeline: Three Stages

Training ChatGPT-like models happens in three distinct phases, each building on the last.

1. Pretraining Next-token prediction on internet text. Learn language. 2. SFT Supervised finetuning on human-written examples. 3a. Reward Model Learn to score outputs from human comparisons. 3b. PPO Optimize policy to maximize reward score. Foundation Demos Scores
The full RLHF training pipeline
↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

Why can't you just use supervised finetuning (step 2) alone to align a model?

SFT

Stage 1: Supervised Finetuning (SFT)

Hire human annotators. Give them prompts. Ask them to write ideal responses. Finetune the pretrained model on these demonstrations.

Prompt”Explain gravityto a 5-year-old”👩‍💻 Human Writer”You know how when youdrop a ball it falls? That’sEarth pulling things…”🤖 LLMFinetune on these(prompt, response) pairs~10,000-100,000 demonstrationsExpensive! Each one hand-written by a trained annotator.This is why step 3 (RLHF) is needed — comparisons scale better.
SFT: learn from human-written demonstrations
Reward Model

Stage 2: Train a Reward Model

Here’s the key insight: instead of asking humans to write perfect answers, ask them something much easier — which of two answers is better?

Prompt: “Is 9.11 > 9.9?”Response A”Yes, 9.11 is greater than 9.9because 11 is bigger than 9.”✗ Chosen LESS oftenResponse B”No. 9.9 = 9.90, which isgreater than 9.11.”✓ Chosen MORE oftenReward Model (RM)Learns to assign scores: B=0.9, A=0.2The RM learns to predict which response a human would prefer
Humans compare model outputs — this creates training data for the reward model

Training the Reward Model

1
Data: (prompt, winner, loser) triplets
Human annotators compare pairs of model outputs and pick the better one. This is much cheaper than writing perfect answers.
2
r(x, y) = scalar score
The reward model takes a prompt x and response y, outputs a single number: how 'good' is this response?
3
Loss = -log(σ(r(x, y_w) - r(x, y_l)))
Bradley-Terry model: maximize the probability that the winner scores higher than the loser. σ is the sigmoid function.
4
After training: RM can score ANY new response
The reward model generalizes from ~50K comparisons to score responses it's never seen before
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Why is collecting comparison data easier than collecting demonstration data?

PPO

Stage 3: RL Optimization with PPO

Now you have a reward model that can score any response. Use it to optimize the language model itself: generate responses, score them, and update the model to produce higher-scoring outputs.

Prompt Random prompt from dataset Policy (LLM) Generate a response Reward Model Score: 0.82 PPO Update Gradient step to increase reward KL Penalty Don't drift too far from SFT model Response Score Update weights Regularize
The PPO optimization loop

The PPO Objective

🎯 PPO Objective J(θ) = Reward Signal + 🛡️ KL Penalty + ⚖️ β Balance
Reward Signal

The reward model's score for a (prompt, response) pair — higher means the response better matches human preferences. This is what drives the model to improve its outputs.

E[r(x, y)]
🛡️ KL Penalty

Penalizes the model for drifting too far from the SFT checkpoint. Without this anchor, the model would learn to 'hack' the reward model — producing gibberish that scores high but isn't actually helpful.

−β · KL(π_θ ∥ π_ref)
⚖️ β Balance

Controls the trade-off between maximizing reward and staying close to the SFT model. Too low → reward hacking (gibberish that scores high). Too high → model barely changes from SFT (wasted effort).

↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

What is the purpose of the KL divergence penalty in the PPO objective?

Beyond RLHF

Beyond RLHF: DPO and Constitutional AI

RLHF works, but it’s complex. Researchers are finding simpler alternatives.

RLHF (2022)Train reward model+ PPO optimizationComplex pipelineUnstable trainingDPO (2023)Skip reward model!Direct preference opt.Simpler, more stableSame performanceCAI (2022)AI critiques AIusing a constitutionLess human laborScalable oversightDPO Key InsightThe optimal policy under RLHF has a closed-form solution!You can skip the reward model and PPO loop entirely — just optimizethe policy directly on preference pairs. Same result, much simpler.
The evolution of alignment techniques
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

What does DPO (Direct Preference Optimization) eliminate from the RLHF pipeline?

↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

In Constitutional AI (CAI), who provides the feedback used to train the reward model?

🎓 What You Now Know

Pretrained LLMs are autocomplete, not assistants — They predict the next token, with no concept of helpfulness or safety.

RLHF has three stages — Pretraining → SFT (learn from demos) → Reward model + PPO (learn from preferences).

Comparisons scale better than demonstrations — It’s easier to say “A > B” than to write the perfect answer.

KL penalty prevents reward hacking — Without it, the model exploits the reward model instead of being genuinely helpful.

DPO is the future — Skip the reward model entirely with direct preference optimization. Same results, simpler pipeline.

RLHF turned language models from text generators into AI assistants. The same model that hallucinated and produced harmful content now declines, qualifies, and helps responsibly. It’s not perfect, but it’s the bridge between capability and alignment. 🚀

📄 Training language models to follow instructions with human feedback (Ouyang et al., 2022)


📄 Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)


📄 Direct Preference Optimization (Rafailov et al., 2023)

Keep Learning