From cross-entropy to REINFORCE to PPO — the math behind making language models follow human preferences. SFT loss, policy gradients, reward modeling, and the full RLHF pipeline.
You've pre-trained a language model on terabytes of internet text. It can finish sentences, write code, and even translate languages. But ask it "What's the capital of France?" and it might respond with a Wikipedia article, a poem about Paris, or a completely unrelated continuation. It knows things, but it doesn't know how to answer things.
The solution is a three-stage pipeline. First, pre-training gives the model broad knowledge by predicting the next token on massive text corpora. Second, supervised finetuning (SFT) teaches it to follow instructions by training on curated question-answer pairs. Third, reinforcement learning from human feedback (RLHF) aligns its outputs with human preferences by rewarding good responses and penalizing bad ones.
This lecture covers stages two and three: the SFT loss that makes instruction-following work, and the RL machinery — from basic policy gradients all the way to PPO — that makes alignment possible.
Click each stage to see what goes in, what comes out, and what the training signal is.
SFT works exactly like pre-training, but on curated data. Given a sequence of tokens x1, x2, ..., xT, the model predicts each next token given all previous tokens. The loss measures how surprised the model is by the correct answer.
Let's derive the loss from scratch. At each position t, the model outputs a probability distribution over the entire vocabulary. We want the model to assign high probability to the actual next token xt+1. The natural measure of "how wrong was the prediction" is cross-entropy.
At position t, the model sees tokens x1, x2, ..., xt and outputs logits z ∈ ℝV where V is vocabulary size (e.g. 32,000 for LLaMA). We convert logits to probabilities with softmax:
The per-token loss is the negative log of this probability:
When the model assigns probability 1.0 to the correct token, ℓt = 0. When it assigns probability 0.01, ℓt = 4.6. The loss is always non-negative and is zero only when the model is perfectly confident and correct.
We average the per-token loss over all T positions in the sequence:
| Tensor | Shape | What it is |
|---|---|---|
| Input tokens | [B, T] | Batch of B sequences, each T tokens long |
| Logits | [B, T, V] | Raw scores for every vocab word at every position |
| Target tokens | [B, T] | Same as input, shifted left by one position |
| Per-token loss | [B, T] | −log p(correct token) at each position |
| LSFT | scalar | Mean over all B×T positions |
python import torch import torch.nn.functional as F # logits: [B, T, V] from the model's final linear layer # targets: [B, T] — the input tokens shifted left by 1 logits = model(input_ids) # [B, T, V] shift_logits = logits[:, :-1, :] # [B, T-1, V] shift_labels = input_ids[:, 1:] # [B, T-1] loss = F.cross_entropy( shift_logits.reshape(-1, V), # [B*(T-1), V] shift_labels.reshape(-1), # [B*(T-1)] ) # scalar — mean cross-entropy over all tokens
Watch the model predict each token. The bar shows the probability assigned to the correct token. Green = confident and correct. Red = surprised.
Why do we need reinforcement learning at all? Because SFT requires explicit demonstrations of correct behavior. With RL, we only need to say whether an output is good or bad (a scalar reward), and the model figures out how to produce good outputs on its own.
RL has five core concepts. Let's build them one by one.
An agent lives in an environment. At each timestep t, the agent observes a state st, takes an action at, receives a reward rt, and transitions to a new state st+1. This cycle repeats until the episode ends.
The policy π(a | s) is the agent's strategy: a probability distribution over actions given the current state. "When I see state s, what should I do?" A good policy assigns high probability to actions that lead to high reward.
We don't just want high reward now — we want high reward over the entire episode. The return Gt is the sum of future rewards, with a discount factor γ ∈ [0, 1] that makes future rewards worth slightly less than immediate ones:
When γ = 0, the agent is completely myopic — only the immediate reward matters. When γ = 1, all future rewards are equally important. Typical values: γ = 0.99.
The value function Vπ(s) is the expected return from state s when following policy π:
The action-value function Qπ(s, a) is the expected return from state s after taking action a, then following π:
A 5-step episode with rewards. Adjust γ and watch how the return changes. Low γ = myopic. High γ = far-sighted.
We have a policy πθ(a | s) parameterized by θ (neural network weights). Our goal is to find the θ that maximizes the expected return:
where τ = (s0, a0, r0, s1, a1, r1, ...) is a trajectory and R(τ) = ∑t rt is the total return. We want ∇θ J(θ) so we can do gradient ascent. But the expectation is over trajectories sampled from the policy, so how do we differentiate through sampling?
This is the key insight. Let's derive it step by step.
Step 1. Write the expected return as an integral over trajectories:
Step 2. Take the gradient with respect to θ. The reward R(τ) doesn't depend on θ, only the trajectory probability does:
Step 3. Here's the trick. We multiply and divide by P(τ; θ):
Step 4. Recognize that ∇f / f = ∇ log f (the log-derivative trick):
Step 5. That sum weighted by P(τ; θ) is just an expectation! So:
Step 6. The trajectory probability P(τ; θ) factors as p(s0) ∏t πθ(at|st) p(st+1|st,at). Taking the log turns the product into a sum. The environment dynamics p(st+1|st,at) don't depend on θ, so their gradients vanish:
The REINFORCE gradient:
REINFORCE is unbiased but has very high variance. Why? Because R(τ) is the total return of the entire trajectory. If a trajectory got high reward, we reinforce every action — even the bad ones that happened to be on a lucky trajectory. Two standard fixes:
1. Baseline subtraction. Replace R(τ) with R(τ) − b, where b is a baseline (often V(s)). This doesn't change the expected gradient but reduces variance dramatically.
2. Use the advantage. Replace R(τ) with At = Q(st, at) − V(st). This says "how much better was this action than average?" rather than "was the whole episode good?"
A 1D world with 3 actions: left, stay, right. The goal is at the right edge. Watch the policy (bar chart) shift after each episode. High-reward episodes increase the probability of the actions taken.
REINFORCE has a fatal flaw: the gradient can be huge. One lucky trajectory can massively change the policy, destroying what was learned before. We need controlled updates — change the policy, but not too much at once.
Define the ratio between the new and old policy:
When rt = 1, the new policy is identical to the old one. When rt > 1, the action became more likely. When rt < 1, it became less likely.
Instead of REINFORCE, we optimize a surrogate objective that uses the ratio:
This is called the Conservative Policy Iteration (CPI) objective. It's equivalent to the policy gradient but lets us take multiple gradient steps on the same batch of data. The problem: if At > 0 (good action) and we keep increasing rt, we get an ever-larger objective — but the policy might change so much it collapses.
PPO clips the ratio to prevent it from straying too far from 1. The clipped surrogate is:
where ε is a small number, typically 0.2. Let's unpack what happens in each case:
| Advantage | Ratio region | Which term wins the min? | Effect |
|---|---|---|---|
| At > 0 (good action) | r < 1+ε | r · A (unclipped) | Normal gradient: increase probability |
| At > 0 (good action) | r ≥ 1+ε | (1+ε) · A (clipped) | Gradient is zero — stop increasing! |
| At < 0 (bad action) | r > 1−ε | r · A (unclipped) | Normal gradient: decrease probability |
| At < 0 (bad action) | r ≤ 1−ε | (1−ε) · A (clipped) | Gradient is zero — stop decreasing! |
PPO also adds a value function loss (to train the baseline) and an entropy bonus (to encourage exploration):
where LVF = (Vθ(s) − Vtarget)2 and H is the entropy of the policy.
The teal curve is the unclipped objective r·A. The orange region shows where clipping is active. The green curve is the final PPO objective (min of clipped and unclipped). Adjust ε and toggle the advantage sign.
How do you apply RL to text generation? The mapping is surprisingly direct:
| RL Concept | LLM Equivalent |
|---|---|
| Agent | The language model πθ |
| State st | Prompt + tokens generated so far: (x1, ..., xt) |
| Action at | The next token xt+1 sampled from πθ(· | x1:t) |
| Policy π(a|s) | The model's next-token distribution: softmax of logits |
| Episode | Generating one complete response to a prompt |
| Reward r | Score from a reward model (given once at the end of generation) |
We want to maximize reward, but with a crucial constraint: the finetuned model shouldn't drift too far from the SFT model. Why? Because the reward model is imperfect — if we optimize it too aggressively, the LLM finds degenerate outputs that "hack" the reward (high reward, low quality). The solution is a KL penalty:
Where:
KL divergence measures the difference between two probability distributions. For two distributions p and q over vocabulary V:
KL is always ≥ 0, and equals 0 only when p = q. In the RLHF context, it penalizes the model for deviating from the reference — the more different the token distribution, the larger the penalty. This prevents reward hacking: the model can't just memorize one high-scoring response and output it for every prompt.
Adjust β to see how the KL penalty affects the training objective. High β = stay close to the reference model. Low β = chase reward aggressively.
RLHF needs a reward signal, but we can't have a human rate every single model output during training. Instead, we train a reward model to predict human preferences. Then we use this learned reward model as the scoring function for RL.
The process is straightforward:
We need a mathematical model of human preferences. The Bradley-Terry model (1952) assumes there exists a latent reward r*(x, y) such that the probability of preferring y1 over y2 is:
This is just a sigmoid! Let Δ = r*(x, y1) − r*(x, y2). Then:
When Δ > 0 (y1 has higher reward), the probability of preferring y1 is above 0.5. When Δ = 0, it's a coin flip. When Δ < 0, y2 is preferred.
We parameterize the reward with a neural network rφ(x, y) — typically the SFT model with a linear head that outputs a scalar. The loss maximizes the log-likelihood of the observed preferences:
Let's unpack this:
python # Reward model training step def reward_loss(r_phi, x, y_w, y_l): # r_phi: reward model (LLM + linear head) # x: prompts [B] # y_w: preferred responses [B] # y_l: rejected responses [B] r_w = r_phi(x, y_w) # [B] — scalar reward for winners r_l = r_phi(x, y_l) # [B] — scalar reward for losers loss = -torch.log( torch.sigmoid(r_w - r_l) ).mean() # scalar return loss
Adjust the reward scores for two responses. The sigmoid curve shows the preference probability. When the reward difference is large, the model is very confident about which response is better.
Now we have all the pieces. Let's assemble the complete RLHF pipeline, step by step, with concrete data shapes at every stage.
This is where the magic happens. Four models are active simultaneously:
| Model | Role | Trainable? |
|---|---|---|
| πθ (Policy) | Generates responses. This is the model we're training. | Yes |
| πref (Reference) | Frozen copy of πSFT. Used for KL penalty. | No (frozen) |
| rφ (Reward) | Scores generated responses. | No (frozen) |
| Vψ (Value) | Estimates expected reward. Used as PPO baseline. | Yes |
In the LLM setting, a common simplification is to set the discount factor γ = 0. The reward is given once at the end of the full response. This means the advantage for the entire sequence simplifies to:
That is: "How much better was the actual reward than what the value model predicted?" If the response was surprisingly good (A > 0), increase its probability. If surprisingly bad (A < 0), decrease it.
Click "Step" to advance through one RLHF iteration. Watch data flow through all four models. The reward and KL penalty are shown at each step.
| Concept | Formula | In English |
|---|---|---|
| SFT Loss | L = −(1/T)∑ log p(xt+1|x1:t) | Predict next token, minimize surprise |
| Policy Gradient | ∇J = E[∑ ∇log π(a|s) · A] | Increase P(action) if advantage is positive |
| PPO Clip | min(r·A, clip(r,1−ε,1+ε)·A) | Policy gradient with safety rails |
| RLHF Obj | E[r(x,y) − β KL(π||πref)] | Maximize reward, stay close to SFT model |
| Bradley-Terry | P(yw≻yl) = σ(rw−rl) | Preference = sigmoid of reward difference |
| RM Loss | −log σ(rw−rl) | Train reward model to rank winners above losers |
| Method | Training Signal | Pros | Cons |
|---|---|---|---|
| SFT | Demonstration data | Simple, stable | Can't generalize beyond demos |
| RLHF | Human preferences | Generalizes quality notion | Complex, 4 models, reward hacking risk |
| DPO | Preference pairs directly | No reward model needed | Less flexible, no iterative improvement |
| RLAIF | AI preferences | Scalable, cheap | Limited by judge model quality |