CS 229s — Systems for Machine Learning

Fine-tuning Large Language Models

Pre-trained LLMs predict the next token. Fine-tuning makes them follow instructions. From RLHF to LoRA — every technique for turning a base model into a useful assistant.

Prerequisites: Language models + Basic RL concepts. That’s it.
9
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: From Pre-training to Usefulness

You’ve just finished training a language model on a trillion tokens from the internet. You feed it a prompt: “Translate cheese from English to French.” What does it do? It autocompletes the pattern — spitting out more translation examples: “Translate cheese from English to Spanish. Translate cheese from French to English.”

It doesn’t answer your question. It continues the document. That’s all pre-training taught it to do: predict the next token given everything before it. The model is a brilliant pattern-completer but a terrible assistant.

This is the alignment gap: the difference between what the model learned to do (predict tokens) and what we want it to do (follow instructions helpfully, harmlessly, and honestly). Closing this gap is the entire subject of fine-tuning.

The core problem: A pre-trained LLM is like an encyclopedia that can only recite passages. Fine-tuning teaches it to have a conversation — to read your question, understand your intent, and generate a useful response instead of a plausible continuation.
Base Model vs Fine-tuned Model

Click a prompt to see how a base model (left) and a fine-tuned model (right) respond. The base model continues the pattern; the fine-tuned model answers the question.

Check: Why does a pre-trained LLM fail to follow instructions?

Chapter 1: Prompting Strategies

Before we modify any weights, there’s a cheaper trick: change the input. By carefully designing the prompt, we can steer a pre-trained model toward useful behavior without training at all. This family of techniques is called prompting.

Zero-Shot Prompting

Just ask the question directly: “Translate ‘cheese’ to French.” No examples, no context. Large models (100B+ parameters) can often handle this thanks to the sheer breadth of patterns seen during pre-training. Smaller models struggle — they haven’t seen enough instruction-like text to recognize the intent.

Few-Shot Prompting

Prefix the question with worked examples: “English: dog → French: chien. English: cat → French: chat. English: cheese → French:” The model sees the pattern and completes it correctly. GPT-3 showed that few-shot performance improves dramatically with scale — a 175B model given 5 examples outperforms a 1.3B model given 100.

Chain-of-Thought (CoT)

Chain-of-thought prompting adds “Let’s think step by step” or includes a worked reasoning trace in the few-shot examples. Instead of jumping to the answer, the model generates intermediate steps. Wei et al. (2022) showed this unlocks reasoning abilities that flat prompting misses entirely — especially on math and logic problems.

Key insight: Prompting doesn’t change the model. It changes the input distribution to land in a region of token-space where the model already knows how to produce good outputs. It’s cheap and requires zero training — but it has a ceiling. The model can only do what it already learned during pre-training.
Scaling Behavior of Prompting Strategies

Drag the model-size slider to see how zero-shot, few-shot, and chain-of-thought performance scale. Larger models benefit more from sophisticated prompting.

Model size (B params)50B
Why prompting has limits: Few-shot examples consume context window tokens. Chain-of-thought generates extra tokens that slow inference. And no matter how clever your prompt, the model’s behavior is bounded by what pre-training encoded in its weights. To go further, we need to change the weights themselves.
Check: What does chain-of-thought prompting add?

Chapter 2: Instruction Tuning

Prompting is free but fragile. To truly teach a model to follow instructions, we fine-tune it on instruction-response pairs: (prompt, desired response) examples that explicitly demonstrate the behavior we want.

What Is Instruction Tuning?

Instruction tuning (also called supervised fine-tuning, SFT) takes a pre-trained LLM and continues training it on a curated dataset of instructions paired with ideal responses. The loss function is the same as pre-training — cross-entropy on next-token prediction — but the data is radically different: instead of raw web text, it’s task-specific demonstrations.

LSFT(θ) = − ∑t log pθ(yt | x, y<t)

Where x is the instruction, y is the target response, and we sum over all tokens in y. We only compute loss on the response tokens, not the instruction tokens — because we want the model to generate good responses, not parrot back instructions.

Where Does the Data Come From?

Three sources, in order of quality and cost:

Human-Written
Experts craft Q&A, summaries, style transfers. High quality, expensive. (~10K-100K examples)
Template-Based
Existing labeled datasets (SQuAD, CNN/DM) wrapped in instruction templates: “Summarize the following article:” + article + summary.
AI-Generated
Use an already-instruction-tuned model (GPT-4, Claude) to generate training data. Self-Instruct: seed with 175 tasks, LLM generates thousands more.
FLAN’s discovery: Google’s FLAN (2021) showed that instruction-tuning on a mix of 60+ NLP tasks with templates dramatically improved zero-shot performance on held-out tasks. The model generalized the concept of “follow instructions” beyond the specific tasks it was trained on. More tasks = better generalization.

Concrete Example: InstructGPT

OpenAI’s InstructGPT started with a 175B GPT-3 base model and fine-tuned it on ~13,000 human-written instruction-response pairs. The result: a 1.3B InstructGPT was preferred by humans over the 175B base GPT-3. A 100x smaller fine-tuned model beat a 100x larger base model. That’s the power of alignment.

ModelParamsTraining DataHuman Preference Win Rate
GPT-3 (base)175B300B tokens, raw webBaseline
InstructGPT1.3B13K instruction pairsPreferred over GPT-3 175B
FLAN-T5 XXL11B1.8K tasks, templatesSOTA on many NLP benchmarks
Instruction Tuning Data Pipeline

Click through the steps to see how raw text becomes instruction-tuning data. Each step transforms the data format.

Check: Why did a 1.3B InstructGPT outperform a 175B base GPT-3?

Chapter 3: The RLHF Pipeline

Instruction tuning teaches the model to follow instructions, but it can’t capture preferences. Which of two grammatically correct summaries is more helpful? Which response is less toxic? These are judgment calls that supervised loss can’t express — you need a signal that says “response A is better than response B.”

Reinforcement Learning from Human Feedback (RLHF) solves this in three stages. Each stage builds on the previous one, and the entire pipeline transforms a base model into an aligned assistant.

The three stages of RLHF:
Stage 1 — SFT: Supervised fine-tuning on instruction-response demonstrations (Chapter 2).
Stage 2 — Reward Model: Train a separate model to predict which response a human would prefer.
Stage 3 — RL (PPO): Use the reward model as a signal to further fine-tune the policy with reinforcement learning.

Why RL? Why Not Just More Supervised Data?

Supervised fine-tuning requires explicit demonstrations: “given this prompt, produce this response.” But for many tasks, it’s much easier for a human to compare two responses than to write the perfect one. RLHF leverages this asymmetry. Labelers rank outputs; a reward model learns from the rankings; RL optimizes the LLM to maximize that learned reward.

Christiano et al. (2017) proved the concept in Atari and MuJoCo: they trained agents to match human preferences using less than 1% of the environment interactions that standard RL required. The key insight was training a reward model from human rankings, then using it as a dense reward signal for RL.

The Data Flow

Stage 1: SFT
Pre-trained LLM + instruction demos → SFT model (πSFT)
Stage 2: Reward Model
πSFT generates response pairs → humans rank them → train Rθ
Stage 3: PPO
πSFT + Rθ → PPO training → πRLHF (the aligned model)
RLHF Pipeline — Interactive Walkthrough

Click Play to watch data flow through all three stages. Click a stage to zoom in. The pipeline processes a prompt through SFT, reward scoring, and PPO updates.

Click Play

RLHF Dramatically Improves Scaling

Stiennon et al. (2020) showed that RLHF-trained summarization models outperform supervised baselines at every model size. More remarkably, a small RLHF model can match a much larger supervised model — RLHF shifts the entire scaling curve upward.

Check: What are the three stages of the RLHF pipeline, in order?

Chapter 4: Reward Modeling

Stage 2 of RLHF is where human judgment gets encoded into a neural network. The goal: train a model Rθ(x, y) that takes a prompt x and response y and outputs a scalar score — higher means “a human would prefer this response.”

Collecting Preferences

The SFT model generates two (or more) responses to each prompt. Human labelers see both and pick the better one, evaluating on criteria like helpfulness, harmlessness, and truthfulness. This produces comparison data: (x, yw, yl) where yw is the preferred (“winner”) response and yl is the rejected (“loser”) response.

The Bradley-Terry Model

We model human preferences using the Bradley-Terry model: the probability that a human prefers response y1 over y2 depends on the difference in their reward scores, passed through a sigmoid:

P(y1 ≻ y2 | x) = σ( Rθ(x, y1) − Rθ(x, y2) )

Where σ(z) = 1/(1 + e−z) is the sigmoid function. If R gives y1 a much higher score than y2, the sigmoid pushes the probability toward 1 — the model is confident y1 is preferred.

Deriving the Loss

We want to maximize the probability of the observed human preferences. Taking the negative log-likelihood:

LRM(θ) = − E(x, yw, yl) [ log σ( Rθ(x, yw) − Rθ(x, yl) ) ]

Step by step: for each comparison triplet, compute the reward difference Δ = R(x, yw) − R(x, yl). Pass through sigmoid to get a probability. Take −log. Average over the dataset. Minimizing this loss pushes Rθ to assign higher scores to preferred responses.

Intuition: The reward model is a preference predictor. It doesn’t know what “good” means in the abstract — it learns a scoring function that reproduces human comparison judgments. Think of it as distilling the labeler’s taste into a differentiable function.

Concrete Worked Example

Suppose for prompt “Explain gravity to a 5-year-old”, the reward model gives:

ResponseRθ(x, y)
yw: “Things fall down because Earth pulls them, like a magnet for everything!”2.3
yl: “Gravity is the curvature of spacetime caused by mass-energy.”−0.5

Δ = 2.3 − (−0.5) = 2.8. σ(2.8) = 0.943. Loss = −log(0.943) = 0.059. The model is already quite confident the simple answer is preferred. If the scores were closer, the loss would be higher, pushing the model to separate them more.

Reward Model Loss Explorer

Drag the reward sliders for the preferred (green) and rejected (red) responses. Watch how the Bradley-Terry loss changes as the gap widens or shrinks.

R(yw) preferred2.0
R(yl) rejected-1.0
Check: What does the reward model learn to predict?

Chapter 5: PPO for Language Models

We now have a reward model Rθ that scores responses. Stage 3 uses this as the reward signal in a reinforcement learning loop. The algorithm of choice is Proximal Policy Optimization (PPO), adapted for language generation.

Framing Language Generation as RL

Here’s the mapping from language to RL:

RL ConceptLanguage Equivalent
Policy πθThe LLM itself — maps prompt to token distribution
StateThe prompt + tokens generated so far
ActionThe next token to generate
RewardRθ(x, y) evaluated on the complete response y
EpisodeGenerating one complete response to a prompt

The PPO Objective for LLMs

Standard PPO maximizes a clipped surrogate objective. For LLMs, we add a critical constraint: a KL penalty that prevents the policy from drifting too far from the SFT model. Without it, the policy would find degenerate outputs that exploit the reward model (called reward hacking).

J(θ) = Ex~D, y~πθ [ Rφ(x, y) − β · DKL( πθ(y|x) ‖ πSFT(y|x) ) ]

Let’s unpack each term:

Why the KL Penalty Is Essential

Without the KL penalty, the model quickly discovers reward-hacking strategies: repetitive phrases the reward model scores highly, or adversarial outputs that exploit blind spots. The KL penalty says “maximize reward, but don’t become a different model.” It’s like a leash — the policy can explore, but it can’t run away.

Reward hacking in practice: Without KL regularization, RLHF-trained summarizers learned to generate long, repetitive summaries with confident-sounding phrases — the reward model scored them highly, but humans rated them worse than the SFT baseline. The model gamed the proxy instead of satisfying the true objective.

Worked Example: One PPO Update

Given prompt x = “What is photosynthesis?”, the policy generates y = “Plants convert sunlight into energy using chlorophyll.”

Reward model gives R(x, y) = 1.8. The SFT model would have assigned this response probability pSFT(y|x) = 0.03. The current policy gives pθ(y|x) = 0.07.

KL divergence (simplified, per-sequence): DKL ≈ log(0.07/0.03) = log(2.33) ≈ 0.85.

With β = 0.2: J = 1.8 − 0.2 × 0.85 = 1.8 − 0.17 = 1.63.

The policy gets credit for the high reward, with a small penalty for diverging from SFT.

PPO Reward vs KL Tradeoff

Drag the sliders to see how the reward, KL divergence, and β coefficient interact to determine the PPO objective value.

Reward R(x,y)1.8
KL divergence0.8
β (penalty coeff)0.20
Check: Why is the KL penalty necessary in PPO for LLMs?

Chapter 6: Constitutional AI & RLAIF

RLHF works, but it’s expensive. Collecting human preference labels requires hiring annotators, designing interfaces, running quality control. Anthropic’s Constitutional AI (CAI) asks: what if the AI itself could provide the feedback?

The Constitution

Instead of 10,000+ human comparison labels, CAI uses ~10 human-written principles — the “constitution.” These are natural-language rules like:

Example principles:
1. “Choose the response that is most helpful, while being safe and avoiding harmful content.”
2. “Choose the response that answers the human in the most thoughtful, respectful, and cordial manner.”
3. “Choose the response that is least likely to be used for illegal or harmful purposes.”

The Two-Phase Pipeline

CAI works in two phases that mirror RLHF but replace humans with AI:

Phase 1: Supervised Self-Critique

The LLM generates a response. Then it critiques its own response using a constitutional principle. Then it revises the response to address the critique. This critique-revision loop can be repeated multiple times. The revised responses become the SFT training data.

Generate
LLM produces initial response to prompt
Critique
“Identify ways this response is harmful, unethical, or dangerous.”
Revise
“Please rewrite to remove harmful content.”
↻ repeat 1–4 times
SFT
Train on (prompt, final revised response) pairs

Phase 2: RLAIF — RL from AI Feedback

Instead of human labelers comparing responses, the AI compares them using the constitution. For each prompt, generate two responses, ask the AI “which response better follows this principle?” and use the AI’s judgment to train a preference model. Then run PPO as before, but with the AI-trained preference model.

Why CAI works: Bai et al. (2022) showed that harmlessness improves monotonically with more critique-revision rounds. Helpfulness dips slightly but the combined score (Helpfulness + Harmlessness) always improves. And by using chain-of-thought in the AI feedback step, quality improves further — the AI explains why one response is better before choosing.

Scaling Advantages

DimensionRLHFCAI / RLAIF
Labels needed~50K–100K human comparisons~10 principles + AI generates rest
Cost$100K–$1M+ for labelersCompute cost only
ScalabilityLimited by human bandwidthParallelizable, auto-scalable
TransparencyLabeler disagreements are opaquePrinciples are explicit and auditable
Quality ceilingBounded by labeler expertiseCan leverage stronger AI as judges
Constitutional AI Critique-Revision Loop

Click Next Round to step through critique-revision iterations. Watch how harmlessness improves with each round while helpfulness stays relatively stable.

Round 0 / 4
Check: What replaces human labelers in Constitutional AI?

Chapter 7: Parameter-Efficient Fine-tuning

Fine-tuning a 70B model means storing 70B parameters, their gradients, and optimizer states (Adam keeps 2 extra copies per parameter). That’s easily 10× the model size in memory — 700 GB+ for a single training run. And if you want a different fine-tuned model for each task (summarization, coding, medical Q&A), you need separate 70B copies. This doesn’t scale.

Parameter-Efficient Fine-Tuning (PEFT) solves this by updating only a tiny fraction of the parameters — often less than 0.1% — while freezing the rest. The frozen parameters require no gradients and no optimizer states, slashing memory by 10-100×.

Three Families of PEFT

CategoryIdeaExamples
SelectiveFreeze most layers, update only a few (e.g., top layers)Top-K layers, BitFit
AdditiveAdd small trainable modules, freeze everything elseAdapters, Prefix Tuning, Prompt Tuning
ReparameterizationExpress weight updates as low-rank matricesLoRA, IA3

LoRA: Low-Rank Adaptation

LoRA (Hu et al., 2021) is the most widely used PEFT method. The key insight: fine-tuning weight changes have low intrinsic rank. Instead of updating a full d×d weight matrix W, LoRA decomposes the update into two small matrices:

W′ = W + ΔW = W + B · A

Where A is [r × d] and B is [d × r], with rank r ≪ d (typically r = 4 to 64, while d = 4096 to 12288).

Deriving the Parameter Savings

A full weight matrix W has d × d parameters. The LoRA update BA has (d × r) + (r × d) = 2dr parameters. The compression ratio:

ratio = 2dr / d² = 2r / d

For d = 4096 and r = 8: ratio = 16 / 4096 = 0.39%. We’re updating less than half a percent of the parameters.

The Forward Pass

During training: h = Wx + BAx. The frozen W handles the bulk of computation. BA adds a small correction learned during fine-tuning. During inference, we can merge the matrices: WLoRA = W + BA. After merging, the forward pass is just h = WLoRAx — zero additional latency.

Task switching with LoRA: Since only A and B change per task, you store one frozen base model and swap tiny LoRA adapters. A 70B model needs ~140GB for weights. Each LoRA adapter (r=8) is ~20MB. You can serve 1000 different tasks from one GPU by hot-swapping adapters.

Prompt Tuning: Soft Prompts

Prompt tuning (Lester et al., 2021) prepends m learnable “soft prompt” tokens to the input embedding. Only these m × e parameters are trained (where e is the embedding dimension). The entire model is frozen. At scale (10B+ params), prompt tuning matches full fine-tuning performance with ~0.01% of the trainable parameters.

Pθp(Y | p1, p2, …, pm, x1, x2, …, xt)

Trainable parameters: m × e. For m = 20 and e = 4096, that’s 81,920 parameters — compared to billions in the full model.

Prefix Tuning and P-Tuning v2

Prefix tuning (Li & Liang, 2021) extends prompt tuning to deeper layers: instead of only prepending tokens at the input, it injects learnable key-value pairs at every attention layer. P-Tuning v2 showed this approach matches full fine-tuning on smaller models where prompt tuning alone falls short.

LoRA: Low-Rank Decomposition Explorer

Drag the sliders to change the model dimension d and LoRA rank r. See how the parameter count and compression ratio change. The blue area is the frozen weight matrix; the orange areas are the trainable LoRA matrices.

Model dim d4096
LoRA rank r8
python
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, linear, r=8, alpha=16):
        super().__init__()
        self.linear = linear          # frozen original layer
        d_out, d_in = linear.weight.shape
        self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)
        self.B = nn.Parameter(torch.zeros(d_out, r))
        self.scale = alpha / r        # scaling factor
        linear.weight.requires_grad = False

    def forward(self, x):
        base = self.linear(x)         # W @ x (frozen)
        lora = (x @ self.A.T) @ self.B.T  # BA @ x (trainable)
        return base + lora * self.scale
Where to apply LoRA: Hu et al. found that applying LoRA to Wq and Wv in self-attention gives the best results. Applying to all attention matrices (Q, K, V, O) also works. The rank r can be surprisingly small — r = 4 often suffices for many tasks.
Check: What is the key insight behind LoRA?

Chapter 8: Connections & What’s Next

We’ve traced the full arc from a raw pre-trained model to an aligned, efficient assistant. Here’s the complete development flow:

Pre-training
Next-token prediction on trillions of tokens (unsupervised)
Instruction Tuning
SFT on instruction-response pairs (supervised)
RLHF / RLAIF
Reward model + PPO with KL penalty (reinforcement learning)
PEFT (LoRA)
Efficient task-specific adaptation (0.1% of params)
Evaluation
Zero-shot, few-shot, CoT prompting on downstream tasks

Cheat Sheet

TechniqueWhat It DoesKey FormulaData Needed
Zero-shotDirect prompting, no examplesNone
Few-shotIn-context examples in prompt5–50 examples
Chain-of-ThoughtStep-by-step reasoning in outputCoT exemplars
SFT / Instruction TuningSupervised training on (instruction, response)L = −∑ log p(yt|x, y<t)10K–100K pairs
Reward ModelLearns human preference scoringL = −log σ(R(yw) − R(yl))50K+ comparisons
PPO + KLRL fine-tuning with reward + KL constraintJ = R(y) − β·DKL(π||πSFT)Reward model
Constitutional AIAI self-critique + RLAIFSame as RLHF, AI-labeled~10 principles
LoRALow-rank weight updatesW′ = W + BASame as SFT
Prompt TuningLearnable soft prompt tokensP(Y|p1..m, x1..t)Same as SFT

Method Comparison

MethodParams UpdatedMemory CostQuality CeilingBest For
Full Fine-tuning100%10× modelHighestUnlimited budget
LoRA (r=8)~0.4%~1.1× modelNear full FTMost production use
Prompt Tuning~0.01%~1× modelGood at scaleMany tasks, one model
Prefix Tuning~0.1%~1× modelBetter for small modelsGeneration tasks
Prompting only0%Inference onlyLimited by pre-trainingQuick experiments

What We Didn’t Cover

“What I cannot create, I do not understand.” — Richard Feynman. You now have every formula needed to implement the full fine-tuning pipeline: SFT loss, Bradley-Terry reward model, PPO with KL penalty, and LoRA adapters. The gap between a base model and an aligned assistant is not magic — it’s three stages of carefully designed optimization.
Check: Which PEFT method allows zero additional inference latency after merging?