Finetuning LLMs with RL — From SFT Loss to RLHF

Chapter 0: The LLM Development Pipeline

You've pre-trained a language model on terabytes of internet text. It can finish sentences, write code, and even translate languages. But ask it "What's the capital of France?" and it might respond with a Wikipedia article, a poem about Paris, or a completely unrelated continuation. It knows things, but it doesn't know how to answer things.

The solution is a three-stage pipeline. First, pre-training gives the model broad knowledge by predicting the next token on massive text corpora. Second, supervised finetuning (SFT) teaches it to follow instructions by training on curated question-answer pairs. Third, reinforcement learning from human feedback (RLHF) aligns its outputs with human preferences by rewarding good responses and penalizing bad ones.

This lecture covers stages two and three: the SFT loss that makes instruction-following work, and the RL machinery — from basic policy gradients all the way to PPO — that makes alignment possible.

The core tension: SFT teaches the model to imitate specific good answers. But we can't write down every possible good answer. RLHF instead teaches the model a general notion of quality — "humans prefer responses like this" — and lets the model discover how to produce them.

The Three-Stage Pipeline

Click each stage to see what goes in, what comes out, and what the training signal is.

Why can't we just use SFT alone to align a language model?

SFT is too expensive to run SFT makes the model forget pre-trained knowledge We can't write down every possible good answer — we need a general quality signal

Chapter 1: Supervised Finetuning Loss

SFT works exactly like pre-training, but on curated data. Given a sequence of tokens x₁, x₂, ..., x_T, the model predicts each next token given all previous tokens. The loss measures how surprised the model is by the correct answer.

Let's derive the loss from scratch. At each position t, the model outputs a probability distribution over the entire vocabulary. We want the model to assign high probability to the actual next token x_t+1. The natural measure of "how wrong was the prediction" is cross-entropy.

Step 1: The per-token loss

At position t, the model sees tokens x₁, x₂, ..., x_t and outputs logits z ∈ ℝ^V where V is vocabulary size (e.g. 32,000 for LLaMA). We convert logits to probabilities with softmax:

p(x_t+1 | x_1:t) = exp(z_{x_t+1}) / ∑_v=1^V exp(z_v)

The per-token loss is the negative log of this probability:

ℓ_t = −log p(x_t+1 | x_1:t)

When the model assigns probability 1.0 to the correct token, ℓ_t = 0. When it assigns probability 0.01, ℓ_t = 4.6. The loss is always non-negative and is zero only when the model is perfectly confident and correct.

Step 2: The full SFT loss

We average the per-token loss over all T positions in the sequence:

L_SFT = −(1/T) ∑_t=1^T log p(x_t+1 | x₁, x₂, ..., x_t)

Why cross-entropy? It's the negative log-likelihood of the data under the model. Minimizing it is equivalent to maximum likelihood estimation — finding the parameters that make the training data most probable. It's also the KL divergence between the true data distribution and the model, plus a constant.

What actually flows through the network

Tensor	Shape	What it is
Input tokens	[B, T]	Batch of B sequences, each T tokens long
Logits	[B, T, V]	Raw scores for every vocab word at every position
Target tokens	[B, T]	Same as input, shifted left by one position
Per-token loss	[B, T]	−log p(correct token) at each position
L_SFT	scalar	Mean over all B×T positions

python
import torch
import torch.nn.functional as F

# logits: [B, T, V] from the model's final linear layer
# targets: [B, T] — the input tokens shifted left by 1
logits = model(input_ids)            # [B, T, V]
shift_logits = logits[:, :-1, :]    # [B, T-1, V]
shift_labels = input_ids[:, 1:]     # [B, T-1]

loss = F.cross_entropy(
    shift_logits.reshape(-1, V),     # [B*(T-1), V]
    shift_labels.reshape(-1),        # [B*(T-1)]
)  # scalar — mean cross-entropy over all tokens

SFT Loss: Token-by-Token

Watch the model predict each token. The bar shows the probability assigned to the correct token. Green = confident and correct. Red = surprised.

If the model assigns probability 0.5 to the correct next token, what is the per-token loss?

0.5 −log(0.5) ≈ 0.693 1.0

Chapter 2: RL Fundamentals

Why do we need reinforcement learning at all? Because SFT requires explicit demonstrations of correct behavior. With RL, we only need to say whether an output is good or bad (a scalar reward), and the model figures out how to produce good outputs on its own.

RL has five core concepts. Let's build them one by one.

States, Actions, and Rewards

An agent lives in an environment. At each timestep t, the agent observes a state s_t, takes an action a_t, receives a reward r_t, and transitions to a new state s_t+1. This cycle repeats until the episode ends.

State s_t

What the agent sees right now

↓ agent picks

Action a_t

What the agent does

↓ environment responds

Reward r_t + New State s_t+1

Feedback + updated world

↻ repeat until episode ends

The Policy π

The policy π(a | s) is the agent's strategy: a probability distribution over actions given the current state. "When I see state s, what should I do?" A good policy assigns high probability to actions that lead to high reward.

Return: The Total Payoff

We don't just want high reward now — we want high reward over the entire episode. The return G_t is the sum of future rewards, with a discount factor γ ∈ [0, 1] that makes future rewards worth slightly less than immediate ones:

G_t = r_t + γ r_t+1 + γ² r_t+2 + ... = ∑_k=0^∞ γ^k r_t+k

When γ = 0, the agent is completely myopic — only the immediate reward matters. When γ = 1, all future rewards are equally important. Typical values: γ = 0.99.

Value Functions

The value function V^π(s) is the expected return from state s when following policy π:

V^π(s) = E_π[G_t | s_t = s] = E_π[∑_k=0^∞ γ^k r_t+k | s_t = s]

The action-value function Q^π(s, a) is the expected return from state s after taking action a, then following π:

Q^π(s, a) = E_π[G_t | s_t = s, a_t = a]

Think of it this way: V(s) says "how good is this state?" Q(s,a) says "how good is this state-action pair?" The difference — Q(s,a) − V(s) — is the advantage: "how much better was this action than what I'd normally do?"

Discounted Return Calculator

A 5-step episode with rewards. Adjust γ and watch how the return changes. Low γ = myopic. High γ = far-sighted.

Discount γ0.90

What does the advantage A(s,a) = Q(s,a) − V(s) tell us?

How much better this specific action is compared to the average action from this state The total reward of the episode The probability of reaching a terminal state

Chapter 3: The Policy Gradient — REINFORCE

We have a policy π_θ(a | s) parameterized by θ (neural network weights). Our goal is to find the θ that maximizes the expected return:

J(θ) = E_{τ ~ π_θ}[R(τ)]

where τ = (s₀, a₀, r₀, s₁, a₁, r₁, ...) is a trajectory and R(τ) = ∑_t r_t is the total return. We want ∇_θ J(θ) so we can do gradient ascent. But the expectation is over trajectories sampled from the policy, so how do we differentiate through sampling?

Derivation: The Log-Probability Trick

This is the key insight. Let's derive it step by step.

Step 1. Write the expected return as an integral over trajectories:

J(θ) = ∑_τ P(τ; θ) R(τ)

Step 2. Take the gradient with respect to θ. The reward R(τ) doesn't depend on θ, only the trajectory probability does:

∇_θ J = ∑_τ ∇_θ P(τ; θ) · R(τ)

Step 3. Here's the trick. We multiply and divide by P(τ; θ):

∇_θ J = ∑_τ P(τ; θ) · (∇_θ P(τ; θ) / P(τ; θ)) · R(τ)

Step 4. Recognize that ∇f / f = ∇ log f (the log-derivative trick):

∇_θ J = ∑_τ P(τ; θ) · ∇_θ log P(τ; θ) · R(τ)

Step 5. That sum weighted by P(τ; θ) is just an expectation! So:

∇_θ J = E_{τ ~ π_θ}[ ∇_θ log P(τ; θ) · R(τ) ]

Step 6. The trajectory probability P(τ; θ) factors as p(s₀) ∏_t π_θ(a_t|s_t) p(s_t+1|s_t,a_t). Taking the log turns the product into a sum. The environment dynamics p(s_t+1|s_t,a_t) don't depend on θ, so their gradients vanish:

∇_θ log P(τ; θ) = ∑_t ∇_θ log π_θ(a_t | s_t)

The REINFORCE gradient:

∇_θ J = E_τ[ ∑_t ∇_θ log π_θ(a_t | s_t) · R(τ) ]

What this says in English: If a trajectory got high reward, increase the probability of every action we took on that trajectory. If it got low reward, decrease them. The gradient of the log-probability tells us which direction in parameter space makes each action more likely. The return R(τ) tells us how strongly to push.

The variance problem

REINFORCE is unbiased but has very high variance. Why? Because R(τ) is the total return of the entire trajectory. If a trajectory got high reward, we reinforce every action — even the bad ones that happened to be on a lucky trajectory. Two standard fixes:

1. Baseline subtraction. Replace R(τ) with R(τ) − b, where b is a baseline (often V(s)). This doesn't change the expected gradient but reduces variance dramatically.

2. Use the advantage. Replace R(τ) with A_t = Q(s_t, a_t) − V(s_t). This says "how much better was this action than average?" rather than "was the whole episode good?"

∇_θ J ≈ E_τ[ ∑_t ∇_θ log π_θ(a_t | s_t) · A_t ]

REINFORCE in Action

A 1D world with 3 actions: left, stay, right. The goal is at the right edge. Watch the policy (bar chart) shift after each episode. High-reward episodes increase the probability of the actions taken.

Why does the log-probability trick work? What mathematical identity does it rely on?

The chain rule for derivatives ∇f / f = ∇ log f, which lets us rewrite the gradient as an expectation we can sample Integration by parts

Chapter 4: PPO — Proximal Policy Optimization

REINFORCE has a fatal flaw: the gradient can be huge. One lucky trajectory can massively change the policy, destroying what was learned before. We need controlled updates — change the policy, but not too much at once.

The probability ratio

Define the ratio between the new and old policy:

r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t)

When r_t = 1, the new policy is identical to the old one. When r_t > 1, the action became more likely. When r_t < 1, it became less likely.

The surrogate objective

Instead of REINFORCE, we optimize a surrogate objective that uses the ratio:

L^CPI(θ) = E_t[ r_t(θ) · A_t ]

This is called the Conservative Policy Iteration (CPI) objective. It's equivalent to the policy gradient but lets us take multiple gradient steps on the same batch of data. The problem: if A_t > 0 (good action) and we keep increasing r_t, we get an ever-larger objective — but the policy might change so much it collapses.

PPO's clipping trick

PPO clips the ratio to prevent it from straying too far from 1. The clipped surrogate is:

L^CLIP(θ) = E_t[ min( r_t(θ) A_t, clip(r_t(θ), 1−ε, 1+ε) · A_t ) ]

where ε is a small number, typically 0.2. Let's unpack what happens in each case:

Advantage	Ratio region	Which term wins the min?	Effect
A_t > 0 (good action)	r < 1+ε	r · A (unclipped)	Normal gradient: increase probability
A_t > 0 (good action)	r ≥ 1+ε	(1+ε) · A (clipped)	Gradient is zero — stop increasing!
A_t < 0 (bad action)	r > 1−ε	r · A (unclipped)	Normal gradient: decrease probability
A_t < 0 (bad action)	r ≤ 1−ε	(1−ε) · A (clipped)	Gradient is zero — stop decreasing!

The key idea: PPO says "go ahead and improve the policy, but the moment you've changed it by more than ε, stop." This prevents catastrophic updates where one batch of data destroys everything the model has learned. It's like putting bumper guards on the gradient.

The full PPO loss

PPO also adds a value function loss (to train the baseline) and an entropy bonus (to encourage exploration):

L^PPO = L^CLIP − c₁ L^VF + c₂ H[π_θ]

where L^VF = (V_θ(s) − V_target)² and H is the entropy of the policy.

PPO Clipped Objective — Interactive

The teal curve is the unclipped objective r·A. The orange region shows where clipping is active. The green curve is the final PPO objective (min of clipped and unclipped). Adjust ε and toggle the advantage sign.

Clip ε0.20

When the advantage is positive (good action) and r_t exceeds 1+ε, what does PPO do?

Doubles the gradient to learn faster Reverses the gradient direction Clips the objective so the gradient becomes zero — stops further increase

Chapter 5: RL for Language Models

How do you apply RL to text generation? The mapping is surprisingly direct:

RL Concept	LLM Equivalent
Agent	The language model π_θ
State s_t	Prompt + tokens generated so far: (x₁, ..., x_t)
Action a_t	The next token x_t+1 sampled from π_θ(· \| x_1:t)
Policy π(a\|s)	The model's next-token distribution: softmax of logits
Episode	Generating one complete response to a prompt
Reward r	Score from a reward model (given once at the end of generation)

The RLHF Objective

We want to maximize reward, but with a crucial constraint: the finetuned model shouldn't drift too far from the SFT model. Why? Because the reward model is imperfect — if we optimize it too aggressively, the LLM finds degenerate outputs that "hack" the reward (high reward, low quality). The solution is a KL penalty:

J(θ) = E_{x~D, y~π_θ}[ r_φ(x, y) − β KL(π_θ(·|x) || π_ref(·|x)) ]

Where:

r_φ(x, y) is the reward model's score for prompt x, response y
π_ref is the frozen SFT model (our "anchor")
β controls the strength of the penalty (β ≈ 0.01–0.2)
KL divergence measures how different π_θ is from π_ref

The KL Divergence Penalty

KL divergence measures the difference between two probability distributions. For two distributions p and q over vocabulary V:

KL(p || q) = ∑_v p(v) log(p(v) / q(v))

KL is always ≥ 0, and equals 0 only when p = q. In the RLHF context, it penalizes the model for deviating from the reference — the more different the token distribution, the larger the penalty. This prevents reward hacking: the model can't just memorize one high-scoring response and output it for every prompt.

Why not just maximize reward? Without the KL penalty, RLHF quickly collapses. The model finds degenerate outputs that game the reward model — like repeating "Great answer!" which gets high reward but is useless. The KL anchor keeps the model grounded in its pre-trained knowledge.

Reward vs. KL Tradeoff

Adjust β to see how the KL penalty affects the training objective. High β = stay close to the reference model. Low β = chase reward aggressively.

KL weight β0.10

What happens if you set β = 0 (no KL penalty) during RLHF?

The model reward-hacks: finds degenerate outputs that exploit the imperfect reward model Training becomes more stable The model learns nothing

Chapter 6: Reward Modeling

RLHF needs a reward signal, but we can't have a human rate every single model output during training. Instead, we train a reward model to predict human preferences. Then we use this learned reward model as the scoring function for RL.

Collecting Preference Data

The process is straightforward:

1. Sample Pairs

For each prompt x, generate two responses y₁ and y₂ from the SFT model

↓

2. Human Labels

A human annotator says "I prefer y_w over y_l" (w = winner, l = loser)

↓

3. Build Dataset

D = {(x_i, y_i^w, y_i^l)} for i = 1 to N

The Bradley-Terry Model

We need a mathematical model of human preferences. The Bradley-Terry model (1952) assumes there exists a latent reward r*(x, y) such that the probability of preferring y₁ over y₂ is:

P(y₁ ≻ y₂ | x) = exp(r*(x, y₁)) / (exp(r*(x, y₁)) + exp(r*(x, y₂)))

This is just a sigmoid! Let Δ = r*(x, y₁) − r*(x, y₂). Then:

P(y₁ ≻ y₂) = σ(Δ) = 1 / (1 + exp(−Δ))

When Δ > 0 (y₁ has higher reward), the probability of preferring y₁ is above 0.5. When Δ = 0, it's a coin flip. When Δ < 0, y₂ is preferred.

Deriving the Reward Modeling Loss

We parameterize the reward with a neural network r_φ(x, y) — typically the SFT model with a linear head that outputs a scalar. The loss maximizes the log-likelihood of the observed preferences:

L_RM(φ) = −(1/N) ∑_i=1^N log σ(r_φ(x_i, y_i^w) − r_φ(x_i, y_i^l))

Let's unpack this:

r_φ(x_i, y_i^w) is the reward model's score for the preferred response
r_φ(x_i, y_i^l) is the score for the rejected response
σ(·) is the sigmoid function
We want the winner's score to be higher than the loser's, so Δ = r_w − r_l > 0
−log σ(Δ) is small when Δ is large positive (model agrees with human), large when Δ is negative (model disagrees)

Architecture detail: The reward model is usually initialized from the SFT model itself. We remove the language model head (which outputs logits over vocabulary) and replace it with a linear layer that outputs a single scalar. The model sees the prompt + response and outputs one number: "how good is this response?"

python
# Reward model training step
def reward_loss(r_phi, x, y_w, y_l):
    # r_phi: reward model (LLM + linear head)
    # x: prompts [B]
    # y_w: preferred responses [B]
    # y_l: rejected responses [B]
    r_w = r_phi(x, y_w)    # [B] — scalar reward for winners
    r_l = r_phi(x, y_l)    # [B] — scalar reward for losers
    loss = -torch.log(
        torch.sigmoid(r_w - r_l)
    ).mean()                # scalar
    return loss

Bradley-Terry Model

Adjust the reward scores for two responses. The sigmoid curve shows the preference probability. When the reward difference is large, the model is very confident about which response is better.

r(winner)1.5

r(loser)-0.5

In the Bradley-Terry model, if r(x, y₁) = r(x, y₂), what is P(y₁ ≻ y₂)?

0 0.5 (a coin flip — σ(0) = 0.5) 1.0

Chapter 7: The Full RLHF Loop

Now we have all the pieces. Let's assemble the complete RLHF pipeline, step by step, with concrete data shapes at every stage.

Stage 1: Supervised Finetuning

Input

Curated (prompt, response) pairs — ~10K-100K examples

↓

Training

Minimize L_SFT = −mean(∑ log p(x_t+1|x_1:t))

↓

Output

π_SFT — an instruction-following model. This becomes both π_ref and the RM initialization.

Stage 2: Reward Model Training

Data Collection

Sample pairs of responses from π_SFT. Humans label winner/loser. ~100K-500K comparisons.

↓

Training

Initialize r_φ from π_SFT + linear head. Minimize L_RM = −log σ(r_w − r_l)

↓

Output

r_φ — a frozen reward model that scores any (prompt, response) pair

Stage 3: PPO Training

This is where the magic happens. Four models are active simultaneously:

Model	Role	Trainable?
π_θ (Policy)	Generates responses. This is the model we're training.	Yes
π_ref (Reference)	Frozen copy of π_SFT. Used for KL penalty.	No (frozen)
r_φ (Reward)	Scores generated responses.	No (frozen)
V_ψ (Value)	Estimates expected reward. Used as PPO baseline.	Yes

One PPO Iteration

1. Generate

Sample prompts x from dataset. Generate responses y ~ π_θ(·|x).

↓

2. Score

Compute reward r_φ(x,y) and KL(π_θ || π_ref). Final reward = r − β·KL.

↓

3. Compute Advantages

A_t = r_adjusted − V_ψ(s_t). How much better was this response than expected?

↓

4. PPO Update

Multiple gradient steps on L^CLIP for π_θ and L^VF for V_ψ.

↻ repeat for K iterations

Why four models? π_θ is the student being trained. π_ref is the anchor preventing collapse. r_φ is the judge. V_ψ is the estimator that reduces variance in the policy gradient. In practice, all four are large LLMs, which is why RLHF is so GPU-hungry.

Advantage Estimation for LLMs

In the LLM setting, a common simplification is to set the discount factor γ = 0. The reward is given once at the end of the full response. This means the advantage for the entire sequence simplifies to:

A = r_φ(x, y) − V_ψ(x)

That is: "How much better was the actual reward than what the value model predicted?" If the response was surprisingly good (A > 0), increase its probability. If surprisingly bad (A < 0), decrease it.

RLHF Pipeline Visualizer

Click "Step" to advance through one RLHF iteration. Watch data flow through all four models. The reward and KL penalty are shown at each step.

Click Step to begin the RLHF loop.

During PPO training of an LLM, which models are trainable and which are frozen?

All four models are trainable Policy π_θ and value V_ψ are trainable; reference π_ref and reward r_φ are frozen Only the reward model is trainable

Chapter 8: Connections

Cheat Sheet

Concept	Formula	In English
SFT Loss	L = −(1/T)∑ log p(x_t+1\|x_1:t)	Predict next token, minimize surprise
Policy Gradient	∇J = E[∑ ∇log π(a\|s) · A]	Increase P(action) if advantage is positive
PPO Clip	min(r·A, clip(r,1−ε,1+ε)·A)	Policy gradient with safety rails
RLHF Obj	E[r(x,y) − β KL(π\|\|π_ref)]	Maximize reward, stay close to SFT model
Bradley-Terry	P(y_w≻y_l) = σ(r_w−r_l)	Preference = sigmoid of reward difference
RM Loss	−log σ(r_w−r_l)	Train reward model to rank winners above losers

Method Comparison

Method	Training Signal	Pros	Cons
SFT	Demonstration data	Simple, stable	Can't generalize beyond demos
RLHF	Human preferences	Generalizes quality notion	Complex, 4 models, reward hacking risk
DPO	Preference pairs directly	No reward model needed	Less flexible, no iterative improvement
RLAIF	AI preferences	Scalable, cheap	Limited by judge model quality

Beyond RLHF: Direct Preference Optimization (DPO) bypasses the reward model entirely by deriving a closed-form loss that directly optimizes the policy from preference data. Constitutional AI (RLAIF) replaces human labelers with an AI judge. Both are active research areas building on the foundations covered here.

What to explore next

Reward & Alignment — deeper dive into DPO, RLAIF, Constitutional AI
GPT — Generative Pre-trained Transformer — how the base model works
The Transformer — the architecture underlying everything

The big picture: Pre-training gives the model knowledge. SFT gives it format. RLHF gives it taste. Each stage builds on the previous, and together they transform a raw next-token predictor into a system that can follow instructions and align with human values.

What is the main advantage of DPO over standard RLHF?

It eliminates the need for a separate reward model, optimizing preferences directly It uses more human data It trains faster on GPUs