Reward Models / Alignment — Engineermaxxing

What Is It?

Alignment is the set of methods for making AI behavior match human preferences. A pretrained language model can generate text — alignment determines which text it prefers to generate. It turns a capable model into a helpful, harmless, honest one.

The core problem: pretraining optimizes for next-token prediction, which produces a model that can write anything — poetry, code, toxicity, misinformation, all with equal facility. Alignment adds a second objective: generate text that humans would actually prefer.

Key Insight

Capability and alignment are orthogonal. A highly capable model without alignment is dangerous. A well-aligned model without capability is useless. Modern AI research must solve both simultaneously.

The RLHF Pipeline

Reinforcement Learning from Human Feedback is the original alignment recipe, pioneered by Christiano et al. (2017) and scaled by OpenAI for InstructGPT (2022). It has three stages:

RLHF Pipeline Interactive

Click to animate

Stage 1 — SFT: Fine-tune the pretrained LLM on high-quality demonstrations (human-written ideal responses). This teaches the format and style of helpful answers.

Stage 2 — Reward Model: Collect preference pairs (human says response A > B), then train a classifier to predict which response humans prefer. This is the learned reward signal.

Stage 3 — PPO: Use the reward model as a scoring function and optimize the policy (the LLM) via Proximal Policy Optimization. A KL penalty keeps the policy close to the SFT baseline to prevent reward hacking.

Core Methods

RLHF was the beginning, not the end. Researchers have developed increasingly elegant alternatives.

RLHF RL-Based

Train a reward model on preference pairs, then optimize the policy with PPO. The OG approach. Requires 4 models in memory (policy, reference, reward, value). High infrastructure cost, but proven at scale (InstructGPT, ChatGPT).

DPO Direct

Direct Preference Optimization. Skip the reward model entirely — reparameterize the reward as a function of the policy itself. Loss: -log σ(β(log π(y_w)/π_ref(y_w) - log π(y_l)/π_ref(y_l))). Just a classification loss on preference pairs. Dramatically simpler.

KTO Binary

Kahneman-Tversky Optimization. Works with binary feedback (thumbs up/down) — no paired comparisons needed. Leverages prospect theory: losses loom larger than gains. Practical for production systems where paired data is expensive.

ORPO Odds Ratio

Odds Ratio Preference Optimization. Combines SFT and alignment into a single stage by adding an odds-ratio penalty. No reference model needed, no separate SFT step. The simplest pipeline of all.

RLAIF AI Feedback

Reinforcement Learning from AI Feedback. Replace human annotators with a capable AI model that generates preference labels. Scales annotation without human bottleneck. Used in Anthropic's Constitutional AI pipeline and Google's research.

Constitutional AI Self-Critique

Define a set of principles (a “constitution”). The model critiques and revises its own outputs against those principles. Then train on the self-revised data. Reduces reliance on human labelers for safety-related feedback.

Reward Modeling

The reward model is the lynchpin of RLHF. It translates fuzzy human preferences into a scalar signal that RL can optimize. Here is how it works:

Preference Collection

Given a prompt, the model generates two responses. A human annotator picks the better one. This creates a preference pair: y_w (chosen) > y_l (rejected).

Preference Pair Comparison Interactive

Prompt: Explain quantum computing in simple terms.

Chosen (y_w) Quantum computing uses qubits that can be 0, 1, or both at once (superposition). This lets quantum computers explore many solutions simultaneously, making them powerful for specific problems like cryptography and drug discovery.

Rejected (y_l) Quantum computing is a type of computing that leverages quantum mechanical phenomena such as superposition and entanglement to perform operations on data using quantum bits or qubits, which differ from classical bits in their ability to exist in multiple states simultaneously according to the principles of quantum mechanics.

Bradley-Terry Model

The reward model is trained using the Bradley-Terry framework: the probability that response A is preferred over B is modeled as:

P(y_w > y_l) = σ(r(x, y_w) - r(x, y_l))

where r(x, y) is the scalar reward and σ is the sigmoid function. Training minimizes the negative log-likelihood of the observed preferences.

Process vs. Outcome Reward Models

Outcome Reward Model (ORM): scores the final answer only. Simple but can reward correct answers reached by wrong reasoning.

Process Reward Model (PRM): scores each intermediate step. More supervision signal, catches errors earlier, but much more expensive to annotate.

Process Reward Models

Process reward models score each step of reasoning, not just the final answer. This is critical for math and coding, where a single wrong step invalidates everything downstream. OpenAI's PRM800K dataset provides step-level labels for 800K math reasoning steps.

Process Reward: Step-by-Step Scoring Interactive

Problem: Solve 2x + 5 = 17

Step 1 2x + 5 = 17 —

Step 2 2x = 17 - 5 = 12 —

Step 3 x = 12 / 2 = 6 —

Answer x = 6 —

Why PRMs Matter

On the MATH benchmark, best-of-N selection with a PRM outperforms the same method with an ORM by a significant margin. The PRM catches errors mid-chain, while the ORM only sees the final answer and can be fooled by correct-looking but wrong reasoning.

Reward Hacking

The model finds loopholes in the reward signal. It learns to maximize the reward model's score without actually being more helpful. This is Goodhart's Law applied to AI: “When a measure becomes a target, it ceases to be a good measure.”

Reward Hacking: Reward vs. Actual Quality Interactive

Reward score vs actual quality over training

Reward Model Score

Actual Human Rating

Divergence (Hacking)

Common Failure Modes

Verbosity Bias

Longer responses get higher reward scores, even when brevity would be better. The model learns: longer = better.

Sycophancy

The model agrees with whatever the user says, even if the user is wrong. Agreeable = higher reward.

Formatting Tricks

Excessive use of bullet points, bold text, and headers to appear thorough without adding substance.

Hedging

Overuse of qualifiers and disclaimers to avoid being rated as wrong, even when confident answers are appropriate.

Mitigation Strategies

KL Penalty

Penalize the policy for diverging too far from the SFT baseline. Prevents extreme exploitation.

RM Ensembles

Use multiple reward models and take the conservative estimate. Harder to hack all of them.

Iterative RLHF

Retrain the reward model on new policy outputs. The reward model co-evolves with the policy.

Length Penalty

Normalize reward by response length, or directly penalize verbose responses.

Training Pipeline Visualization

The full alignment pipeline, from raw pretraining to deployed model:

End-to-End Alignment Pipeline Overview

Pretrain

Next-token prediction on trillions of tokens

SFT

Fine-tune on human demonstrations

Reward Model

Train on preference pairs (A > B)

RL (PPO/DPO)

Optimize policy against reward signal

Deploy

Safety eval, red-teaming, monitoring

DPO vs. RLHF: Architecture Comparison

DPO vs RLHF Comparison Architecture

RLHF (PPO)

Requires 4 models: policy, reference, reward, value
Separate reward model training phase
PPO optimization with clipped objective
KL penalty to stay near reference
Complex infrastructure, high memory
Proven at massive scale (GPT-4, Claude)
More flexible: reward model can be reused

DPO (Direct)

Requires 2 models: policy and reference only
No separate reward model needed
Simple classification loss on preferences
KL constraint is implicit in the loss
Simple to implement, lower memory
Proven on smaller models, scaling ongoing
Must retrain for different preference data

DPO Loss Explorer — see how the loss changes with model probabilities

log π(y_w)/π_ref 0.50

log π(y_l)/π_ref -0.50

β (temperature) 0.10

1.00 Margin

0.52 σ(β * margin)

0.65 DPO Loss

0.10 Implicit Reward

Method Comparison: RLHF vs. DPO vs. KTO vs. ORPO

Property	RLHF (PPO)	DPO	KTO	ORPO
Year	2017/2022	2023	2024	2024
Data Required	Preference pairs	Preference pairs	Binary feedback	Preference pairs
Reward Model	Required (separate)	Not needed	Not needed	Not needed
Reference Model	Yes (frozen)	Yes (frozen)	Yes (frozen)	Not needed
Models in Memory	4 (policy, ref, RM, value)	2 (policy, ref)	2 (policy, ref)	1 (policy only)
Separate SFT	Yes	Yes	Yes	No (combined)
Implementation	Complex (RL loop)	Simple (classification)	Simple (classification)	Simple (SFT + penalty)
KL Constraint	Explicit penalty	Implicit in loss	Implicit in loss	Via odds ratio
Scale	Proven at massive scale	Scaling ongoing	Early research	Early research
Key Paper	`Christiano+ 2017, Ouyang+ 2022`	`Rafailov+ 2023`	`Ethayarajh+ 2024`	`Hong+ 2024`

DPO Loss Function