RLHF — How AI Learns to Follow Human Instructions
A visual deep dive into Reinforcement Learning from Human Feedback. From pretraining to reward models to PPO — understand how ChatGPT went from autocomplete to assistant.
🧭
GPT-3 could write Shakespeare.
But it couldn’t say “I don’t know.”
The raw pretrained model completes text. Ask it a question and it might answer,
or ramble, or produce toxic content. RLHF is the process that turned
autocomplete into an assistant.
↓ Scroll to learn how human feedback shapes AI behavior
The Alignment Problem
A pretrained language model optimizes for one thing: predict the next token. That’s it. It doesn’t try to be helpful, truthful, or safe. It tries to match the distribution of its training data — which includes everything from Wikipedia to 4chan.
The RLHF Pipeline: Three Stages
Training ChatGPT-like models happens in three distinct phases, each building on the last.
Why can't you just use supervised finetuning (step 2) alone to align a model?
💡 Think about what's easier: writing the perfect answer, or comparing two existing answers...
SFT requires human-written 'gold standard' answers for each prompt, which doesn't scale. RLHF instead collects pairwise comparisons ('A is better than B') which are much easier for humans to provide, and the reward model generalizes these preferences to new situations the human annotators never saw.
Stage 1: Supervised Finetuning (SFT)
Hire human annotators. Give them prompts. Ask them to write ideal responses. Finetune the pretrained model on these demonstrations.
Stage 2: Train a Reward Model
Here’s the key insight: instead of asking humans to write perfect answers, ask them something much easier — which of two answers is better?
Training the Reward Model
Data: (prompt, winner, loser) triplets r(x, y) = scalar score Loss = -log(σ(r(x, y_w) - r(x, y_l))) After training: RM can score ANY new response Why is collecting comparison data easier than collecting demonstration data?
💡 Think about whether it's easier to write a novel or to say which of two novels is better...
This is a key insight of RLHF: it's much easier to evaluate than to generate. A human can quickly compare two math solutions and pick the correct one, even if they couldn't write the proof themselves. This lets RLHF scale to topics where writing demonstrations would require rare expertise.
Stage 3: RL Optimization with PPO
Now you have a reward model that can score any response. Use it to optimize the language model itself: generate responses, score them, and update the model to produce higher-scoring outputs.
The PPO Objective
The reward model's score for a (prompt, response) pair — higher means the response better matches human preferences. This is what drives the model to improve its outputs.
E[r(x, y)] Penalizes the model for drifting too far from the SFT checkpoint. Without this anchor, the model would learn to 'hack' the reward model — producing gibberish that scores high but isn't actually helpful.
−β · KL(π_θ ∥ π_ref) Controls the trade-off between maximizing reward and staying close to the SFT model. Too low → reward hacking (gibberish that scores high). Too high → model barely changes from SFT (wasted effort).
What is the purpose of the KL divergence penalty in the PPO objective?
💡 What happens when you optimize hard against an imperfect metric?
The KL penalty acts as a 'anchor' — it ensures the optimized model doesn't stray too far from the known-good SFT model. Without it, the model would find degenerate strategies that game the imperfect reward model (e.g., producing repetitive or nonsensical text that scores high). It trades off reward maximization against behavioral stability.
Beyond RLHF: DPO and Constitutional AI
RLHF works, but it’s complex. Researchers are finding simpler alternatives.
What does DPO (Direct Preference Optimization) eliminate from the RLHF pipeline?
💡 What two stages does RLHF add after SFT? DPO replaces both with one step...
DPO shows that there exists a mathematical equivalence: the reward model + PPO optimization can be collapsed into a single supervised learning objective on preference pairs. You still need human comparisons (winner/loser), but you skip the complexity of training a separate reward model and running RL. Llama 2 used RLHF; many newer models use DPO.
In Constitutional AI (CAI), who provides the feedback used to train the reward model?
💡 The name is 'Constitutional' AI — what does the constitution do?
Constitutional AI replaces much of the human feedback with AI-generated feedback. An AI reads a 'constitution' (a set of principles like 'be helpful, be harmless') and uses those principles to critique and revise outputs. This drastically reduces the need for human annotators while still aligning the model. Anthropic's Claude models use this approach.
🎓 What You Now Know
✓ Pretrained LLMs are autocomplete, not assistants — They predict the next token, with no concept of helpfulness or safety.
✓ RLHF has three stages — Pretraining → SFT (learn from demos) → Reward model + PPO (learn from preferences).
✓ Comparisons scale better than demonstrations — It’s easier to say “A > B” than to write the perfect answer.
✓ KL penalty prevents reward hacking — Without it, the model exploits the reward model instead of being genuinely helpful.
✓ DPO is the future — Skip the reward model entirely with direct preference optimization. Same results, simpler pipeline.
RLHF turned language models from text generators into AI assistants. The same model that hallucinated and produced harmful content now declines, qualifies, and helps responsibly. It’s not perfect, but it’s the bridge between capability and alignment. 🚀
📄 Training language models to follow instructions with human feedback (Ouyang et al., 2022)
📄 Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
↗ Keep Learning
Transformers — The Architecture That Changed AI
A scroll-driven visual deep dive into the Transformer architecture. From RNNs to self-attention to GPT — understand the engine behind every modern AI model.
Logistic Regression — The Classifier That's Not Really Regression
A scroll-driven visual deep dive into logistic regression. Learn how a regression model becomes a classifier, why the sigmoid is the key, and how log-loss trains it.
Comments
No comments yet. Be the first!