CS 229s — Systems for Machine Learning

Finetuning LLMs with RL

From cross-entropy to REINFORCE to PPO — the math behind making language models follow human preferences. SFT loss, policy gradients, reward modeling, and the full RLHF pipeline.

Prerequisites: Language models + Calculus (chain rule). That's it.
9
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: The LLM Development Pipeline

You've pre-trained a language model on terabytes of internet text. It can finish sentences, write code, and even translate languages. But ask it "What's the capital of France?" and it might respond with a Wikipedia article, a poem about Paris, or a completely unrelated continuation. It knows things, but it doesn't know how to answer things.

The solution is a three-stage pipeline. First, pre-training gives the model broad knowledge by predicting the next token on massive text corpora. Second, supervised finetuning (SFT) teaches it to follow instructions by training on curated question-answer pairs. Third, reinforcement learning from human feedback (RLHF) aligns its outputs with human preferences by rewarding good responses and penalizing bad ones.

This lecture covers stages two and three: the SFT loss that makes instruction-following work, and the RL machinery — from basic policy gradients all the way to PPO — that makes alignment possible.

The core tension: SFT teaches the model to imitate specific good answers. But we can't write down every possible good answer. RLHF instead teaches the model a general notion of quality — "humans prefer responses like this" — and lets the model discover how to produce them.
The Three-Stage Pipeline

Click each stage to see what goes in, what comes out, and what the training signal is.

Why can't we just use SFT alone to align a language model?

Chapter 1: Supervised Finetuning Loss

SFT works exactly like pre-training, but on curated data. Given a sequence of tokens x1, x2, ..., xT, the model predicts each next token given all previous tokens. The loss measures how surprised the model is by the correct answer.

Let's derive the loss from scratch. At each position t, the model outputs a probability distribution over the entire vocabulary. We want the model to assign high probability to the actual next token xt+1. The natural measure of "how wrong was the prediction" is cross-entropy.

Step 1: The per-token loss

At position t, the model sees tokens x1, x2, ..., xt and outputs logits z ∈ ℝV where V is vocabulary size (e.g. 32,000 for LLaMA). We convert logits to probabilities with softmax:

p(xt+1 | x1:t) = exp(zxt+1) / ∑v=1V exp(zv)

The per-token loss is the negative log of this probability:

t = −log p(xt+1 | x1:t)

When the model assigns probability 1.0 to the correct token, ℓt = 0. When it assigns probability 0.01, ℓt = 4.6. The loss is always non-negative and is zero only when the model is perfectly confident and correct.

Step 2: The full SFT loss

We average the per-token loss over all T positions in the sequence:

LSFT = −(1/T) ∑t=1T log p(xt+1 | x1, x2, ..., xt)
Why cross-entropy? It's the negative log-likelihood of the data under the model. Minimizing it is equivalent to maximum likelihood estimation — finding the parameters that make the training data most probable. It's also the KL divergence between the true data distribution and the model, plus a constant.

What actually flows through the network

TensorShapeWhat it is
Input tokens[B, T]Batch of B sequences, each T tokens long
Logits[B, T, V]Raw scores for every vocab word at every position
Target tokens[B, T]Same as input, shifted left by one position
Per-token loss[B, T]−log p(correct token) at each position
LSFTscalarMean over all B×T positions
python
import torch
import torch.nn.functional as F

# logits: [B, T, V] from the model's final linear layer
# targets: [B, T] — the input tokens shifted left by 1
logits = model(input_ids)            # [B, T, V]
shift_logits = logits[:, :-1, :]    # [B, T-1, V]
shift_labels = input_ids[:, 1:]     # [B, T-1]

loss = F.cross_entropy(
    shift_logits.reshape(-1, V),     # [B*(T-1), V]
    shift_labels.reshape(-1),        # [B*(T-1)]
)  # scalar — mean cross-entropy over all tokens
SFT Loss: Token-by-Token

Watch the model predict each token. The bar shows the probability assigned to the correct token. Green = confident and correct. Red = surprised.

If the model assigns probability 0.5 to the correct next token, what is the per-token loss?

Chapter 2: RL Fundamentals

Why do we need reinforcement learning at all? Because SFT requires explicit demonstrations of correct behavior. With RL, we only need to say whether an output is good or bad (a scalar reward), and the model figures out how to produce good outputs on its own.

RL has five core concepts. Let's build them one by one.

States, Actions, and Rewards

An agent lives in an environment. At each timestep t, the agent observes a state st, takes an action at, receives a reward rt, and transitions to a new state st+1. This cycle repeats until the episode ends.

State st
What the agent sees right now
↓ agent picks
Action at
What the agent does
↓ environment responds
Reward rt + New State st+1
Feedback + updated world
↻ repeat until episode ends

The Policy π

The policy π(a | s) is the agent's strategy: a probability distribution over actions given the current state. "When I see state s, what should I do?" A good policy assigns high probability to actions that lead to high reward.

Return: The Total Payoff

We don't just want high reward now — we want high reward over the entire episode. The return Gt is the sum of future rewards, with a discount factor γ ∈ [0, 1] that makes future rewards worth slightly less than immediate ones:

Gt = rt + γ rt+1 + γ2 rt+2 + ... = ∑k=0 γk rt+k

When γ = 0, the agent is completely myopic — only the immediate reward matters. When γ = 1, all future rewards are equally important. Typical values: γ = 0.99.

Value Functions

The value function Vπ(s) is the expected return from state s when following policy π:

Vπ(s) = Eπ[Gt | st = s] = Eπ[∑k=0 γk rt+k | st = s]

The action-value function Qπ(s, a) is the expected return from state s after taking action a, then following π:

Qπ(s, a) = Eπ[Gt | st = s, at = a]
Think of it this way: V(s) says "how good is this state?" Q(s,a) says "how good is this state-action pair?" The difference — Q(s,a) − V(s) — is the advantage: "how much better was this action than what I'd normally do?"
Discounted Return Calculator

A 5-step episode with rewards. Adjust γ and watch how the return changes. Low γ = myopic. High γ = far-sighted.

Discount γ0.90
What does the advantage A(s,a) = Q(s,a) − V(s) tell us?

Chapter 3: The Policy Gradient — REINFORCE

We have a policy πθ(a | s) parameterized by θ (neural network weights). Our goal is to find the θ that maximizes the expected return:

J(θ) = Eτ ~ πθ[R(τ)]

where τ = (s0, a0, r0, s1, a1, r1, ...) is a trajectory and R(τ) = ∑t rt is the total return. We want ∇θ J(θ) so we can do gradient ascent. But the expectation is over trajectories sampled from the policy, so how do we differentiate through sampling?

Derivation: The Log-Probability Trick

This is the key insight. Let's derive it step by step.

Step 1. Write the expected return as an integral over trajectories:

J(θ) = ∑τ P(τ; θ) R(τ)

Step 2. Take the gradient with respect to θ. The reward R(τ) doesn't depend on θ, only the trajectory probability does:

θ J = ∑τθ P(τ; θ) · R(τ)

Step 3. Here's the trick. We multiply and divide by P(τ; θ):

θ J = ∑τ P(τ; θ) · (∇θ P(τ; θ) / P(τ; θ)) · R(τ)

Step 4. Recognize that ∇f / f = ∇ log f (the log-derivative trick):

θ J = ∑τ P(τ; θ) · ∇θ log P(τ; θ) · R(τ)

Step 5. That sum weighted by P(τ; θ) is just an expectation! So:

θ J = Eτ ~ πθ[ ∇θ log P(τ; θ) · R(τ) ]

Step 6. The trajectory probability P(τ; θ) factors as p(s0) ∏t πθ(at|st) p(st+1|st,at). Taking the log turns the product into a sum. The environment dynamics p(st+1|st,at) don't depend on θ, so their gradients vanish:

θ log P(τ; θ) = ∑tθ log πθ(at | st)

The REINFORCE gradient:

θ J = Eτ[ ∑tθ log πθ(at | st) · R(τ) ]
What this says in English: If a trajectory got high reward, increase the probability of every action we took on that trajectory. If it got low reward, decrease them. The gradient of the log-probability tells us which direction in parameter space makes each action more likely. The return R(τ) tells us how strongly to push.

The variance problem

REINFORCE is unbiased but has very high variance. Why? Because R(τ) is the total return of the entire trajectory. If a trajectory got high reward, we reinforce every action — even the bad ones that happened to be on a lucky trajectory. Two standard fixes:

1. Baseline subtraction. Replace R(τ) with R(τ) − b, where b is a baseline (often V(s)). This doesn't change the expected gradient but reduces variance dramatically.

2. Use the advantage. Replace R(τ) with At = Q(st, at) − V(st). This says "how much better was this action than average?" rather than "was the whole episode good?"

θ J ≈ Eτ[ ∑tθ log πθ(at | st) · At ]
REINFORCE in Action

A 1D world with 3 actions: left, stay, right. The goal is at the right edge. Watch the policy (bar chart) shift after each episode. High-reward episodes increase the probability of the actions taken.

Why does the log-probability trick work? What mathematical identity does it rely on?

Chapter 4: PPO — Proximal Policy Optimization

REINFORCE has a fatal flaw: the gradient can be huge. One lucky trajectory can massively change the policy, destroying what was learned before. We need controlled updates — change the policy, but not too much at once.

The probability ratio

Define the ratio between the new and old policy:

rt(θ) = πθ(at | st) / πθold(at | st)

When rt = 1, the new policy is identical to the old one. When rt > 1, the action became more likely. When rt < 1, it became less likely.

The surrogate objective

Instead of REINFORCE, we optimize a surrogate objective that uses the ratio:

LCPI(θ) = Et[ rt(θ) · At ]

This is called the Conservative Policy Iteration (CPI) objective. It's equivalent to the policy gradient but lets us take multiple gradient steps on the same batch of data. The problem: if At > 0 (good action) and we keep increasing rt, we get an ever-larger objective — but the policy might change so much it collapses.

PPO's clipping trick

PPO clips the ratio to prevent it from straying too far from 1. The clipped surrogate is:

LCLIP(θ) = Et[ min( rt(θ) At,  clip(rt(θ), 1−ε, 1+ε) · At ) ]

where ε is a small number, typically 0.2. Let's unpack what happens in each case:

AdvantageRatio regionWhich term wins the min?Effect
At > 0
(good action)
r < 1+εr · A (unclipped)Normal gradient: increase probability
At > 0
(good action)
r ≥ 1+ε(1+ε) · A (clipped)Gradient is zero — stop increasing!
At < 0
(bad action)
r > 1−εr · A (unclipped)Normal gradient: decrease probability
At < 0
(bad action)
r ≤ 1−ε(1−ε) · A (clipped)Gradient is zero — stop decreasing!
The key idea: PPO says "go ahead and improve the policy, but the moment you've changed it by more than ε, stop." This prevents catastrophic updates where one batch of data destroys everything the model has learned. It's like putting bumper guards on the gradient.

The full PPO loss

PPO also adds a value function loss (to train the baseline) and an entropy bonus (to encourage exploration):

LPPO = LCLIP − c1 LVF + c2 H[πθ]

where LVF = (Vθ(s) − Vtarget)2 and H is the entropy of the policy.

PPO Clipped Objective — Interactive

The teal curve is the unclipped objective r·A. The orange region shows where clipping is active. The green curve is the final PPO objective (min of clipped and unclipped). Adjust ε and toggle the advantage sign.

Clip ε0.20
When the advantage is positive (good action) and rt exceeds 1+ε, what does PPO do?

Chapter 5: RL for Language Models

How do you apply RL to text generation? The mapping is surprisingly direct:

RL ConceptLLM Equivalent
AgentThe language model πθ
State stPrompt + tokens generated so far: (x1, ..., xt)
Action atThe next token xt+1 sampled from πθ(· | x1:t)
Policy π(a|s)The model's next-token distribution: softmax of logits
EpisodeGenerating one complete response to a prompt
Reward rScore from a reward model (given once at the end of generation)

The RLHF Objective

We want to maximize reward, but with a crucial constraint: the finetuned model shouldn't drift too far from the SFT model. Why? Because the reward model is imperfect — if we optimize it too aggressively, the LLM finds degenerate outputs that "hack" the reward (high reward, low quality). The solution is a KL penalty:

J(θ) = Ex~D, y~πθ[ rφ(x, y) − β KL(πθ(·|x) || πref(·|x)) ]

Where:

The KL Divergence Penalty

KL divergence measures the difference between two probability distributions. For two distributions p and q over vocabulary V:

KL(p || q) = ∑v p(v) log(p(v) / q(v))

KL is always ≥ 0, and equals 0 only when p = q. In the RLHF context, it penalizes the model for deviating from the reference — the more different the token distribution, the larger the penalty. This prevents reward hacking: the model can't just memorize one high-scoring response and output it for every prompt.

Why not just maximize reward? Without the KL penalty, RLHF quickly collapses. The model finds degenerate outputs that game the reward model — like repeating "Great answer!" which gets high reward but is useless. The KL anchor keeps the model grounded in its pre-trained knowledge.
Reward vs. KL Tradeoff

Adjust β to see how the KL penalty affects the training objective. High β = stay close to the reference model. Low β = chase reward aggressively.

KL weight β0.10
What happens if you set β = 0 (no KL penalty) during RLHF?

Chapter 6: Reward Modeling

RLHF needs a reward signal, but we can't have a human rate every single model output during training. Instead, we train a reward model to predict human preferences. Then we use this learned reward model as the scoring function for RL.

Collecting Preference Data

The process is straightforward:

1. Sample Pairs
For each prompt x, generate two responses y1 and y2 from the SFT model
2. Human Labels
A human annotator says "I prefer yw over yl" (w = winner, l = loser)
3. Build Dataset
D = {(xi, yiw, yil)} for i = 1 to N

The Bradley-Terry Model

We need a mathematical model of human preferences. The Bradley-Terry model (1952) assumes there exists a latent reward r*(x, y) such that the probability of preferring y1 over y2 is:

P(y1 ≻ y2 | x) = exp(r*(x, y1)) / (exp(r*(x, y1)) + exp(r*(x, y2)))

This is just a sigmoid! Let Δ = r*(x, y1) − r*(x, y2). Then:

P(y1 ≻ y2) = σ(Δ) = 1 / (1 + exp(−Δ))

When Δ > 0 (y1 has higher reward), the probability of preferring y1 is above 0.5. When Δ = 0, it's a coin flip. When Δ < 0, y2 is preferred.

Deriving the Reward Modeling Loss

We parameterize the reward with a neural network rφ(x, y) — typically the SFT model with a linear head that outputs a scalar. The loss maximizes the log-likelihood of the observed preferences:

LRM(φ) = −(1/N) ∑i=1N log σ(rφ(xi, yiw) − rφ(xi, yil))

Let's unpack this:

Architecture detail: The reward model is usually initialized from the SFT model itself. We remove the language model head (which outputs logits over vocabulary) and replace it with a linear layer that outputs a single scalar. The model sees the prompt + response and outputs one number: "how good is this response?"
python
# Reward model training step
def reward_loss(r_phi, x, y_w, y_l):
    # r_phi: reward model (LLM + linear head)
    # x: prompts [B]
    # y_w: preferred responses [B]
    # y_l: rejected responses [B]
    r_w = r_phi(x, y_w)    # [B] — scalar reward for winners
    r_l = r_phi(x, y_l)    # [B] — scalar reward for losers
    loss = -torch.log(
        torch.sigmoid(r_w - r_l)
    ).mean()                # scalar
    return loss
Bradley-Terry Model

Adjust the reward scores for two responses. The sigmoid curve shows the preference probability. When the reward difference is large, the model is very confident about which response is better.

r(winner)1.5
r(loser)-0.5
In the Bradley-Terry model, if r(x, y1) = r(x, y2), what is P(y1 ≻ y2)?

Chapter 7: The Full RLHF Loop

Now we have all the pieces. Let's assemble the complete RLHF pipeline, step by step, with concrete data shapes at every stage.

Stage 1: Supervised Finetuning

Input
Curated (prompt, response) pairs — ~10K-100K examples
Training
Minimize LSFT = −mean(∑ log p(xt+1|x1:t))
Output
πSFT — an instruction-following model. This becomes both πref and the RM initialization.

Stage 2: Reward Model Training

Data Collection
Sample pairs of responses from πSFT. Humans label winner/loser. ~100K-500K comparisons.
Training
Initialize rφ from πSFT + linear head. Minimize LRM = −log σ(rw − rl)
Output
rφ — a frozen reward model that scores any (prompt, response) pair

Stage 3: PPO Training

This is where the magic happens. Four models are active simultaneously:

ModelRoleTrainable?
πθ (Policy)Generates responses. This is the model we're training.Yes
πref (Reference)Frozen copy of πSFT. Used for KL penalty.No (frozen)
rφ (Reward)Scores generated responses.No (frozen)
Vψ (Value)Estimates expected reward. Used as PPO baseline.Yes

One PPO Iteration

1. Generate
Sample prompts x from dataset. Generate responses y ~ πθ(·|x).
2. Score
Compute reward rφ(x,y) and KL(πθ || πref). Final reward = r − β·KL.
3. Compute Advantages
At = radjusted − Vψ(st). How much better was this response than expected?
4. PPO Update
Multiple gradient steps on LCLIP for πθ and LVF for Vψ.
↻ repeat for K iterations
Why four models? πθ is the student being trained. πref is the anchor preventing collapse. rφ is the judge. Vψ is the estimator that reduces variance in the policy gradient. In practice, all four are large LLMs, which is why RLHF is so GPU-hungry.

Advantage Estimation for LLMs

In the LLM setting, a common simplification is to set the discount factor γ = 0. The reward is given once at the end of the full response. This means the advantage for the entire sequence simplifies to:

A = rφ(x, y) − Vψ(x)

That is: "How much better was the actual reward than what the value model predicted?" If the response was surprisingly good (A > 0), increase its probability. If surprisingly bad (A < 0), decrease it.

RLHF Pipeline Visualizer

Click "Step" to advance through one RLHF iteration. Watch data flow through all four models. The reward and KL penalty are shown at each step.

Click Step to begin the RLHF loop.
During PPO training of an LLM, which models are trainable and which are frozen?

Chapter 8: Connections

Cheat Sheet

ConceptFormulaIn English
SFT LossL = −(1/T)∑ log p(xt+1|x1:t)Predict next token, minimize surprise
Policy Gradient∇J = E[∑ ∇log π(a|s) · A]Increase P(action) if advantage is positive
PPO Clipmin(r·A, clip(r,1−ε,1+ε)·A)Policy gradient with safety rails
RLHF ObjE[r(x,y) − β KL(π||πref)]Maximize reward, stay close to SFT model
Bradley-TerryP(yw≻yl) = σ(rw−rl)Preference = sigmoid of reward difference
RM Loss−log σ(rw−rl)Train reward model to rank winners above losers

Method Comparison

MethodTraining SignalProsCons
SFTDemonstration dataSimple, stableCan't generalize beyond demos
RLHFHuman preferencesGeneralizes quality notionComplex, 4 models, reward hacking risk
DPOPreference pairs directlyNo reward model neededLess flexible, no iterative improvement
RLAIFAI preferencesScalable, cheapLimited by judge model quality
Beyond RLHF: Direct Preference Optimization (DPO) bypasses the reward model entirely by deriving a closed-form loss that directly optimizes the policy from preference data. Constitutional AI (RLAIF) replaces human labelers with an AI judge. Both are active research areas building on the foundations covered here.

What to explore next

The big picture: Pre-training gives the model knowledge. SFT gives it format. RLHF gives it taste. Each stage builds on the previous, and together they transform a raw next-token predictor into a system that can follow instructions and align with human values.
What is the main advantage of DPO over standard RLHF?