Fine-tuning LLMs — From Next-Token Prediction to Following Instructions

Chapter 0: From Pre-training to Usefulness

You’ve just finished training a language model on a trillion tokens from the internet. You feed it a prompt: “Translate cheese from English to French.” What does it do? It autocompletes the pattern — spitting out more translation examples: “Translate cheese from English to Spanish. Translate cheese from French to English.”

It doesn’t answer your question. It continues the document. That’s all pre-training taught it to do: predict the next token given everything before it. The model is a brilliant pattern-completer but a terrible assistant.

This is the alignment gap: the difference between what the model learned to do (predict tokens) and what we want it to do (follow instructions helpfully, harmlessly, and honestly). Closing this gap is the entire subject of fine-tuning.

The core problem: A pre-trained LLM is like an encyclopedia that can only recite passages. Fine-tuning teaches it to have a conversation — to read your question, understand your intent, and generate a useful response instead of a plausible continuation.

Base Model vs Fine-tuned Model

Click a prompt to see how a base model (left) and a fine-tuned model (right) respond. The base model continues the pattern; the fine-tuned model answers the question.

Check: Why does a pre-trained LLM fail to follow instructions?

It doesn’t have enough parameters It was trained to predict the next token, not to answer questions It can’t understand English

Chapter 1: Prompting Strategies

Before we modify any weights, there’s a cheaper trick: change the input. By carefully designing the prompt, we can steer a pre-trained model toward useful behavior without training at all. This family of techniques is called prompting.

Zero-Shot Prompting

Just ask the question directly: “Translate ‘cheese’ to French.” No examples, no context. Large models (100B+ parameters) can often handle this thanks to the sheer breadth of patterns seen during pre-training. Smaller models struggle — they haven’t seen enough instruction-like text to recognize the intent.

Few-Shot Prompting

Prefix the question with worked examples: “English: dog → French: chien. English: cat → French: chat. English: cheese → French:” The model sees the pattern and completes it correctly. GPT-3 showed that few-shot performance improves dramatically with scale — a 175B model given 5 examples outperforms a 1.3B model given 100.

Chain-of-Thought (CoT)

Chain-of-thought prompting adds “Let’s think step by step” or includes a worked reasoning trace in the few-shot examples. Instead of jumping to the answer, the model generates intermediate steps. Wei et al. (2022) showed this unlocks reasoning abilities that flat prompting misses entirely — especially on math and logic problems.

Key insight: Prompting doesn’t change the model. It changes the input distribution to land in a region of token-space where the model already knows how to produce good outputs. It’s cheap and requires zero training — but it has a ceiling. The model can only do what it already learned during pre-training.

Scaling Behavior of Prompting Strategies

Drag the model-size slider to see how zero-shot, few-shot, and chain-of-thought performance scale. Larger models benefit more from sophisticated prompting.

Model size (B params)50B

Why prompting has limits: Few-shot examples consume context window tokens. Chain-of-thought generates extra tokens that slow inference. And no matter how clever your prompt, the model’s behavior is bounded by what pre-training encoded in its weights. To go further, we need to change the weights themselves.

Check: What does chain-of-thought prompting add?

More parameters to the model A separate reasoning module Intermediate reasoning steps in the prompt or output

Chapter 2: Instruction Tuning

Prompting is free but fragile. To truly teach a model to follow instructions, we fine-tune it on instruction-response pairs: (prompt, desired response) examples that explicitly demonstrate the behavior we want.

What Is Instruction Tuning?

Instruction tuning (also called supervised fine-tuning, SFT) takes a pre-trained LLM and continues training it on a curated dataset of instructions paired with ideal responses. The loss function is the same as pre-training — cross-entropy on next-token prediction — but the data is radically different: instead of raw web text, it’s task-specific demonstrations.

L_SFT(θ) = − ∑_t log p_θ(y_t | x, y_<t)

Where x is the instruction, y is the target response, and we sum over all tokens in y. We only compute loss on the response tokens, not the instruction tokens — because we want the model to generate good responses, not parrot back instructions.

Where Does the Data Come From?

Three sources, in order of quality and cost:

Human-Written

Experts craft Q&A, summaries, style transfers. High quality, expensive. (~10K-100K examples)

↓

Template-Based

Existing labeled datasets (SQuAD, CNN/DM) wrapped in instruction templates: “Summarize the following article:” + article + summary.

↓

AI-Generated

Use an already-instruction-tuned model (GPT-4, Claude) to generate training data. Self-Instruct: seed with 175 tasks, LLM generates thousands more.

FLAN’s discovery: Google’s FLAN (2021) showed that instruction-tuning on a mix of 60+ NLP tasks with templates dramatically improved zero-shot performance on held-out tasks. The model generalized the concept of “follow instructions” beyond the specific tasks it was trained on. More tasks = better generalization.

Concrete Example: InstructGPT

OpenAI’s InstructGPT started with a 175B GPT-3 base model and fine-tuned it on ~13,000 human-written instruction-response pairs. The result: a 1.3B InstructGPT was preferred by humans over the 175B base GPT-3. A 100x smaller fine-tuned model beat a 100x larger base model. That’s the power of alignment.

Model	Params	Training Data	Human Preference Win Rate
GPT-3 (base)	175B	300B tokens, raw web	Baseline
InstructGPT	1.3B	13K instruction pairs	Preferred over GPT-3 175B
FLAN-T5 XXL	11B	1.8K tasks, templates	SOTA on many NLP benchmarks

Instruction Tuning Data Pipeline

Click through the steps to see how raw text becomes instruction-tuning data. Each step transforms the data format.

Check: Why did a 1.3B InstructGPT outperform a 175B base GPT-3?

Fine-tuning on instruction-response pairs aligned the model to follow instructions The 175B model was poorly trained Smaller models are always better

Chapter 3: The RLHF Pipeline

Instruction tuning teaches the model to follow instructions, but it can’t capture preferences. Which of two grammatically correct summaries is more helpful? Which response is less toxic? These are judgment calls that supervised loss can’t express — you need a signal that says “response A is better than response B.”

Reinforcement Learning from Human Feedback (RLHF) solves this in three stages. Each stage builds on the previous one, and the entire pipeline transforms a base model into an aligned assistant.

The three stages of RLHF:
Stage 1 — SFT: Supervised fine-tuning on instruction-response demonstrations (Chapter 2).
Stage 2 — Reward Model: Train a separate model to predict which response a human would prefer.
Stage 3 — RL (PPO): Use the reward model as a signal to further fine-tune the policy with reinforcement learning.

Why RL? Why Not Just More Supervised Data?

Supervised fine-tuning requires explicit demonstrations: “given this prompt, produce this response.” But for many tasks, it’s much easier for a human to compare two responses than to write the perfect one. RLHF leverages this asymmetry. Labelers rank outputs; a reward model learns from the rankings; RL optimizes the LLM to maximize that learned reward.

Christiano et al. (2017) proved the concept in Atari and MuJoCo: they trained agents to match human preferences using less than 1% of the environment interactions that standard RL required. The key insight was training a reward model from human rankings, then using it as a dense reward signal for RL.

The Data Flow

Stage 1: SFT

Pre-trained LLM + instruction demos → SFT model (π_SFT)

↓

Stage 2: Reward Model

π_SFT generates response pairs → humans rank them → train R_θ

↓

Stage 3: PPO

π_SFT + R_θ → PPO training → π_RLHF (the aligned model)

RLHF Pipeline — Interactive Walkthrough

Click Play to watch data flow through all three stages. Click a stage to zoom in. The pipeline processes a prompt through SFT, reward scoring, and PPO updates.

Click Play

RLHF Dramatically Improves Scaling

Stiennon et al. (2020) showed that RLHF-trained summarization models outperform supervised baselines at every model size. More remarkably, a small RLHF model can match a much larger supervised model — RLHF shifts the entire scaling curve upward.

Check: What are the three stages of the RLHF pipeline, in order?

Reward Model → PPO → SFT SFT → Reward Model → PPO PPO → SFT → Reward Model

Chapter 4: Reward Modeling

Stage 2 of RLHF is where human judgment gets encoded into a neural network. The goal: train a model R_θ(x, y) that takes a prompt x and response y and outputs a scalar score — higher means “a human would prefer this response.”

Collecting Preferences

The SFT model generates two (or more) responses to each prompt. Human labelers see both and pick the better one, evaluating on criteria like helpfulness, harmlessness, and truthfulness. This produces comparison data: (x, y_w, y_l) where y_w is the preferred (“winner”) response and y_l is the rejected (“loser”) response.

The Bradley-Terry Model

We model human preferences using the Bradley-Terry model: the probability that a human prefers response y₁ over y₂ depends on the difference in their reward scores, passed through a sigmoid:

P(y₁ ≻ y₂ | x) = σ( R_θ(x, y₁) − R_θ(x, y₂) )

Where σ(z) = 1/(1 + e^−z) is the sigmoid function. If R gives y₁ a much higher score than y₂, the sigmoid pushes the probability toward 1 — the model is confident y₁ is preferred.

Deriving the Loss

We want to maximize the probability of the observed human preferences. Taking the negative log-likelihood:

L_RM(θ) = − E_{(x, y_w, y_l)} [ log σ( R_θ(x, y_w) − R_θ(x, y_l) ) ]

Step by step: for each comparison triplet, compute the reward difference Δ = R(x, y_w) − R(x, y_l). Pass through sigmoid to get a probability. Take −log. Average over the dataset. Minimizing this loss pushes R_θ to assign higher scores to preferred responses.

Intuition: The reward model is a preference predictor. It doesn’t know what “good” means in the abstract — it learns a scoring function that reproduces human comparison judgments. Think of it as distilling the labeler’s taste into a differentiable function.

Concrete Worked Example

Suppose for prompt “Explain gravity to a 5-year-old”, the reward model gives:

Response	R_θ(x, y)
y_w: “Things fall down because Earth pulls them, like a magnet for everything!”	2.3
y_l: “Gravity is the curvature of spacetime caused by mass-energy.”	−0.5

Δ = 2.3 − (−0.5) = 2.8. σ(2.8) = 0.943. Loss = −log(0.943) = 0.059. The model is already quite confident the simple answer is preferred. If the scores were closer, the loss would be higher, pushing the model to separate them more.

Reward Model Loss Explorer

Drag the reward sliders for the preferred (green) and rejected (red) responses. Watch how the Bradley-Terry loss changes as the gap widens or shrinks.

R(y_w) preferred2.0

R(y_l) rejected-1.0

Check: What does the reward model learn to predict?

Which response a human would prefer in a pairwise comparison The next token in a sequence The grammatical correctness of a response

Chapter 5: PPO for Language Models

We now have a reward model R_θ that scores responses. Stage 3 uses this as the reward signal in a reinforcement learning loop. The algorithm of choice is Proximal Policy Optimization (PPO), adapted for language generation.

Framing Language Generation as RL

Here’s the mapping from language to RL:

RL Concept	Language Equivalent
Policy π_θ	The LLM itself — maps prompt to token distribution
State	The prompt + tokens generated so far
Action	The next token to generate
Reward	R_θ(x, y) evaluated on the complete response y
Episode	Generating one complete response to a prompt

The PPO Objective for LLMs

Standard PPO maximizes a clipped surrogate objective. For LLMs, we add a critical constraint: a KL penalty that prevents the policy from drifting too far from the SFT model. Without it, the policy would find degenerate outputs that exploit the reward model (called reward hacking).

J(θ) = E_{x~D, y~π_θ} [ R_φ(x, y) − β · D_KL( π_θ(y|x) ‖ π_SFT(y|x) ) ]

Let’s unpack each term:

R_φ(x, y) — the reward model’s score for the generated response. We want this to be high.
D_KL(π_θ ‖ π_SFT) — the KL divergence between the current policy and the SFT reference. Measures how much the policy has changed. We want this to stay small.
β — the KL penalty coefficient. A trust slider: high β keeps the model close to SFT (conservative); low β lets it explore more aggressively.

Why the KL Penalty Is Essential

Without the KL penalty, the model quickly discovers reward-hacking strategies: repetitive phrases the reward model scores highly, or adversarial outputs that exploit blind spots. The KL penalty says “maximize reward, but don’t become a different model.” It’s like a leash — the policy can explore, but it can’t run away.

Reward hacking in practice: Without KL regularization, RLHF-trained summarizers learned to generate long, repetitive summaries with confident-sounding phrases — the reward model scored them highly, but humans rated them worse than the SFT baseline. The model gamed the proxy instead of satisfying the true objective.

Worked Example: One PPO Update

Given prompt x = “What is photosynthesis?”, the policy generates y = “Plants convert sunlight into energy using chlorophyll.”

Reward model gives R(x, y) = 1.8. The SFT model would have assigned this response probability p_SFT(y|x) = 0.03. The current policy gives p_θ(y|x) = 0.07.

KL divergence (simplified, per-sequence): D_KL ≈ log(0.07/0.03) = log(2.33) ≈ 0.85.

With β = 0.2: J = 1.8 − 0.2 × 0.85 = 1.8 − 0.17 = 1.63.

The policy gets credit for the high reward, with a small penalty for diverging from SFT.

PPO Reward vs KL Tradeoff

Drag the sliders to see how the reward, KL divergence, and β coefficient interact to determine the PPO objective value.

Reward R(x,y)1.8

KL divergence0.8

β (penalty coeff)0.20

Check: Why is the KL penalty necessary in PPO for LLMs?

To speed up training To prevent the policy from reward-hacking by drifting too far from the SFT model To reduce the number of parameters

Chapter 6: Constitutional AI & RLAIF

RLHF works, but it’s expensive. Collecting human preference labels requires hiring annotators, designing interfaces, running quality control. Anthropic’s Constitutional AI (CAI) asks: what if the AI itself could provide the feedback?

The Constitution

Instead of 10,000+ human comparison labels, CAI uses ~10 human-written principles — the “constitution.” These are natural-language rules like:

Example principles:
1. “Choose the response that is most helpful, while being safe and avoiding harmful content.”
2. “Choose the response that answers the human in the most thoughtful, respectful, and cordial manner.”
3. “Choose the response that is least likely to be used for illegal or harmful purposes.”

The Two-Phase Pipeline

CAI works in two phases that mirror RLHF but replace humans with AI:

Phase 1: Supervised Self-Critique

The LLM generates a response. Then it critiques its own response using a constitutional principle. Then it revises the response to address the critique. This critique-revision loop can be repeated multiple times. The revised responses become the SFT training data.

Generate

LLM produces initial response to prompt

↓

Critique

“Identify ways this response is harmful, unethical, or dangerous.”

↓

Revise

“Please rewrite to remove harmful content.”

↻ repeat 1–4 times

SFT

Train on (prompt, final revised response) pairs

Phase 2: RLAIF — RL from AI Feedback

Instead of human labelers comparing responses, the AI compares them using the constitution. For each prompt, generate two responses, ask the AI “which response better follows this principle?” and use the AI’s judgment to train a preference model. Then run PPO as before, but with the AI-trained preference model.

Why CAI works: Bai et al. (2022) showed that harmlessness improves monotonically with more critique-revision rounds. Helpfulness dips slightly but the combined score (Helpfulness + Harmlessness) always improves. And by using chain-of-thought in the AI feedback step, quality improves further — the AI explains why one response is better before choosing.

Scaling Advantages

Dimension	RLHF	CAI / RLAIF
Labels needed	~50K–100K human comparisons	~10 principles + AI generates rest
Cost	$100K–$1M+ for labelers	Compute cost only
Scalability	Limited by human bandwidth	Parallelizable, auto-scalable
Transparency	Labeler disagreements are opaque	Principles are explicit and auditable
Quality ceiling	Bounded by labeler expertise	Can leverage stronger AI as judges

Constitutional AI Critique-Revision Loop

Click Next Round to step through critique-revision iterations. Watch how harmlessness improves with each round while helpfulness stays relatively stable.

Round 0 / 4

Check: What replaces human labelers in Constitutional AI?

A separate smaller model Random sampling The AI itself, guided by human-written constitutional principles

Chapter 7: Parameter-Efficient Fine-tuning

Fine-tuning a 70B model means storing 70B parameters, their gradients, and optimizer states (Adam keeps 2 extra copies per parameter). That’s easily 10× the model size in memory — 700 GB+ for a single training run. And if you want a different fine-tuned model for each task (summarization, coding, medical Q&A), you need separate 70B copies. This doesn’t scale.

Parameter-Efficient Fine-Tuning (PEFT) solves this by updating only a tiny fraction of the parameters — often less than 0.1% — while freezing the rest. The frozen parameters require no gradients and no optimizer states, slashing memory by 10-100×.

Three Families of PEFT

Category	Idea	Examples
Selective	Freeze most layers, update only a few (e.g., top layers)	Top-K layers, BitFit
Additive	Add small trainable modules, freeze everything else	Adapters, Prefix Tuning, Prompt Tuning
Reparameterization	Express weight updates as low-rank matrices	LoRA, IA3

LoRA: Low-Rank Adaptation

LoRA (Hu et al., 2021) is the most widely used PEFT method. The key insight: fine-tuning weight changes have low intrinsic rank. Instead of updating a full d×d weight matrix W, LoRA decomposes the update into two small matrices:

W′ = W + ΔW = W + B · A

Where A is [r × d] and B is [d × r], with rank r ≪ d (typically r = 4 to 64, while d = 4096 to 12288).

Deriving the Parameter Savings

A full weight matrix W has d × d parameters. The LoRA update BA has (d × r) + (r × d) = 2dr parameters. The compression ratio:

ratio = 2dr / d² = 2r / d

For d = 4096 and r = 8: ratio = 16 / 4096 = 0.39%. We’re updating less than half a percent of the parameters.

The Forward Pass

During training: h = Wx + BAx. The frozen W handles the bulk of computation. BA adds a small correction learned during fine-tuning. During inference, we can merge the matrices: W_LoRA = W + BA. After merging, the forward pass is just h = W_LoRAx — zero additional latency.

Task switching with LoRA: Since only A and B change per task, you store one frozen base model and swap tiny LoRA adapters. A 70B model needs ~140GB for weights. Each LoRA adapter (r=8) is ~20MB. You can serve 1000 different tasks from one GPU by hot-swapping adapters.

Prompt Tuning: Soft Prompts

Prompt tuning (Lester et al., 2021) prepends m learnable “soft prompt” tokens to the input embedding. Only these m × e parameters are trained (where e is the embedding dimension). The entire model is frozen. At scale (10B+ params), prompt tuning matches full fine-tuning performance with ~0.01% of the trainable parameters.

P_{θ_p}(Y | p₁, p₂, …, p_m, x₁, x₂, …, x_t)

Trainable parameters: m × e. For m = 20 and e = 4096, that’s 81,920 parameters — compared to billions in the full model.

Prefix Tuning and P-Tuning v2

Prefix tuning (Li & Liang, 2021) extends prompt tuning to deeper layers: instead of only prepending tokens at the input, it injects learnable key-value pairs at every attention layer. P-Tuning v2 showed this approach matches full fine-tuning on smaller models where prompt tuning alone falls short.

LoRA: Low-Rank Decomposition Explorer

Drag the sliders to change the model dimension d and LoRA rank r. See how the parameter count and compression ratio change. The blue area is the frozen weight matrix; the orange areas are the trainable LoRA matrices.

Model dim d4096

LoRA rank r8

python
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, linear, r=8, alpha=16):
        super().__init__()
        self.linear = linear          # frozen original layer
        d_out, d_in = linear.weight.shape
        self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)
        self.B = nn.Parameter(torch.zeros(d_out, r))
        self.scale = alpha / r        # scaling factor
        linear.weight.requires_grad = False

    def forward(self, x):
        base = self.linear(x)         # W @ x (frozen)
        lora = (x @ self.A.T) @ self.B.T  # BA @ x (trainable)
        return base + lora * self.scale

Where to apply LoRA: Hu et al. found that applying LoRA to W_q and W_v in self-attention gives the best results. Applying to all attention matrices (Q, K, V, O) also works. The rank r can be surprisingly small — r = 4 often suffices for many tasks.

Check: What is the key insight behind LoRA?

Large models train faster Fine-tuning weight updates have low intrinsic rank, so they can be expressed as small matrix products Removing layers makes models better

Chapter 8: Connections & What’s Next

We’ve traced the full arc from a raw pre-trained model to an aligned, efficient assistant. Here’s the complete development flow:

Pre-training

Next-token prediction on trillions of tokens (unsupervised)

↓

Instruction Tuning

SFT on instruction-response pairs (supervised)

↓

RLHF / RLAIF

Reward model + PPO with KL penalty (reinforcement learning)

↓

PEFT (LoRA)

Efficient task-specific adaptation (0.1% of params)

↓

Evaluation

Zero-shot, few-shot, CoT prompting on downstream tasks

Cheat Sheet

Technique	What It Does	Key Formula	Data Needed
Zero-shot	Direct prompting, no examples	—	None
Few-shot	In-context examples in prompt	—	5–50 examples
Chain-of-Thought	Step-by-step reasoning in output	—	CoT exemplars
SFT / Instruction Tuning	Supervised training on (instruction, response)	L = −∑ log p(y_t\|x, y_<t)	10K–100K pairs
Reward Model	Learns human preference scoring	L = −log σ(R(y_w) − R(y_l))	50K+ comparisons
PPO + KL	RL fine-tuning with reward + KL constraint	J = R(y) − β·D_KL(π\|\|π_SFT)	Reward model
Constitutional AI	AI self-critique + RLAIF	Same as RLHF, AI-labeled	~10 principles
LoRA	Low-rank weight updates	W′ = W + BA	Same as SFT
Prompt Tuning	Learnable soft prompt tokens	P(Y\|p_1..m, x_1..t)	Same as SFT

Method Comparison

Method	Params Updated	Memory Cost	Quality Ceiling	Best For
Full Fine-tuning	100%	10× model	Highest	Unlimited budget
LoRA (r=8)	~0.4%	~1.1× model	Near full FT	Most production use
Prompt Tuning	~0.01%	~1× model	Good at scale	Many tasks, one model
Prefix Tuning	~0.1%	~1× model	Better for small models	Generation tasks
Prompting only	0%	Inference only	Limited by pre-training	Quick experiments

What We Didn’t Cover

DPO (Direct Preference Optimization) — skips the reward model entirely, directly optimizes the policy from preference data. Simpler than PPO, increasingly popular.
QLoRA — combines LoRA with 4-bit quantization, enabling fine-tuning of 65B models on a single 48GB GPU.
Mixture of Experts + PEFT — routing different inputs to different LoRA adapters.
RLHF at scale — distributed PPO across hundreds of GPUs, managing reward model staleness.

“What I cannot create, I do not understand.” — Richard Feynman. You now have every formula needed to implement the full fine-tuning pipeline: SFT loss, Bradley-Terry reward model, PPO with KL penalty, and LoRA adapters. The gap between a base model and an aligned assistant is not magic — it’s three stages of carefully designed optimization.

Check: Which PEFT method allows zero additional inference latency after merging?

LoRA (merge W + BA into a single matrix) Prompt Tuning (soft tokens always prepended) Prefix Tuning (injects at every layer)

Fine-tuning Large Language Models