Pre-trained LLMs predict the next token. Fine-tuning makes them follow instructions. From RLHF to LoRA — every technique for turning a base model into a useful assistant.
You’ve just finished training a language model on a trillion tokens from the internet. You feed it a prompt: “Translate cheese from English to French.” What does it do? It autocompletes the pattern — spitting out more translation examples: “Translate cheese from English to Spanish. Translate cheese from French to English.”
It doesn’t answer your question. It continues the document. That’s all pre-training taught it to do: predict the next token given everything before it. The model is a brilliant pattern-completer but a terrible assistant.
This is the alignment gap: the difference between what the model learned to do (predict tokens) and what we want it to do (follow instructions helpfully, harmlessly, and honestly). Closing this gap is the entire subject of fine-tuning.
Click a prompt to see how a base model (left) and a fine-tuned model (right) respond. The base model continues the pattern; the fine-tuned model answers the question.
Before we modify any weights, there’s a cheaper trick: change the input. By carefully designing the prompt, we can steer a pre-trained model toward useful behavior without training at all. This family of techniques is called prompting.
Just ask the question directly: “Translate ‘cheese’ to French.” No examples, no context. Large models (100B+ parameters) can often handle this thanks to the sheer breadth of patterns seen during pre-training. Smaller models struggle — they haven’t seen enough instruction-like text to recognize the intent.
Prefix the question with worked examples: “English: dog → French: chien. English: cat → French: chat. English: cheese → French:” The model sees the pattern and completes it correctly. GPT-3 showed that few-shot performance improves dramatically with scale — a 175B model given 5 examples outperforms a 1.3B model given 100.
Chain-of-thought prompting adds “Let’s think step by step” or includes a worked reasoning trace in the few-shot examples. Instead of jumping to the answer, the model generates intermediate steps. Wei et al. (2022) showed this unlocks reasoning abilities that flat prompting misses entirely — especially on math and logic problems.
Drag the model-size slider to see how zero-shot, few-shot, and chain-of-thought performance scale. Larger models benefit more from sophisticated prompting.
Prompting is free but fragile. To truly teach a model to follow instructions, we fine-tune it on instruction-response pairs: (prompt, desired response) examples that explicitly demonstrate the behavior we want.
Instruction tuning (also called supervised fine-tuning, SFT) takes a pre-trained LLM and continues training it on a curated dataset of instructions paired with ideal responses. The loss function is the same as pre-training — cross-entropy on next-token prediction — but the data is radically different: instead of raw web text, it’s task-specific demonstrations.
Where x is the instruction, y is the target response, and we sum over all tokens in y. We only compute loss on the response tokens, not the instruction tokens — because we want the model to generate good responses, not parrot back instructions.
Three sources, in order of quality and cost:
OpenAI’s InstructGPT started with a 175B GPT-3 base model and fine-tuned it on ~13,000 human-written instruction-response pairs. The result: a 1.3B InstructGPT was preferred by humans over the 175B base GPT-3. A 100x smaller fine-tuned model beat a 100x larger base model. That’s the power of alignment.
| Model | Params | Training Data | Human Preference Win Rate |
|---|---|---|---|
| GPT-3 (base) | 175B | 300B tokens, raw web | Baseline |
| InstructGPT | 1.3B | 13K instruction pairs | Preferred over GPT-3 175B |
| FLAN-T5 XXL | 11B | 1.8K tasks, templates | SOTA on many NLP benchmarks |
Click through the steps to see how raw text becomes instruction-tuning data. Each step transforms the data format.
Instruction tuning teaches the model to follow instructions, but it can’t capture preferences. Which of two grammatically correct summaries is more helpful? Which response is less toxic? These are judgment calls that supervised loss can’t express — you need a signal that says “response A is better than response B.”
Reinforcement Learning from Human Feedback (RLHF) solves this in three stages. Each stage builds on the previous one, and the entire pipeline transforms a base model into an aligned assistant.
Supervised fine-tuning requires explicit demonstrations: “given this prompt, produce this response.” But for many tasks, it’s much easier for a human to compare two responses than to write the perfect one. RLHF leverages this asymmetry. Labelers rank outputs; a reward model learns from the rankings; RL optimizes the LLM to maximize that learned reward.
Christiano et al. (2017) proved the concept in Atari and MuJoCo: they trained agents to match human preferences using less than 1% of the environment interactions that standard RL required. The key insight was training a reward model from human rankings, then using it as a dense reward signal for RL.
Click Play to watch data flow through all three stages. Click a stage to zoom in. The pipeline processes a prompt through SFT, reward scoring, and PPO updates.
Stiennon et al. (2020) showed that RLHF-trained summarization models outperform supervised baselines at every model size. More remarkably, a small RLHF model can match a much larger supervised model — RLHF shifts the entire scaling curve upward.
Stage 2 of RLHF is where human judgment gets encoded into a neural network. The goal: train a model Rθ(x, y) that takes a prompt x and response y and outputs a scalar score — higher means “a human would prefer this response.”
The SFT model generates two (or more) responses to each prompt. Human labelers see both and pick the better one, evaluating on criteria like helpfulness, harmlessness, and truthfulness. This produces comparison data: (x, yw, yl) where yw is the preferred (“winner”) response and yl is the rejected (“loser”) response.
We model human preferences using the Bradley-Terry model: the probability that a human prefers response y1 over y2 depends on the difference in their reward scores, passed through a sigmoid:
Where σ(z) = 1/(1 + e−z) is the sigmoid function. If R gives y1 a much higher score than y2, the sigmoid pushes the probability toward 1 — the model is confident y1 is preferred.
We want to maximize the probability of the observed human preferences. Taking the negative log-likelihood:
Step by step: for each comparison triplet, compute the reward difference Δ = R(x, yw) − R(x, yl). Pass through sigmoid to get a probability. Take −log. Average over the dataset. Minimizing this loss pushes Rθ to assign higher scores to preferred responses.
Suppose for prompt “Explain gravity to a 5-year-old”, the reward model gives:
| Response | Rθ(x, y) |
|---|---|
| yw: “Things fall down because Earth pulls them, like a magnet for everything!” | 2.3 |
| yl: “Gravity is the curvature of spacetime caused by mass-energy.” | −0.5 |
Δ = 2.3 − (−0.5) = 2.8. σ(2.8) = 0.943. Loss = −log(0.943) = 0.059. The model is already quite confident the simple answer is preferred. If the scores were closer, the loss would be higher, pushing the model to separate them more.
Drag the reward sliders for the preferred (green) and rejected (red) responses. Watch how the Bradley-Terry loss changes as the gap widens or shrinks.
We now have a reward model Rθ that scores responses. Stage 3 uses this as the reward signal in a reinforcement learning loop. The algorithm of choice is Proximal Policy Optimization (PPO), adapted for language generation.
Here’s the mapping from language to RL:
| RL Concept | Language Equivalent |
|---|---|
| Policy πθ | The LLM itself — maps prompt to token distribution |
| State | The prompt + tokens generated so far |
| Action | The next token to generate |
| Reward | Rθ(x, y) evaluated on the complete response y |
| Episode | Generating one complete response to a prompt |
Standard PPO maximizes a clipped surrogate objective. For LLMs, we add a critical constraint: a KL penalty that prevents the policy from drifting too far from the SFT model. Without it, the policy would find degenerate outputs that exploit the reward model (called reward hacking).
Let’s unpack each term:
Without the KL penalty, the model quickly discovers reward-hacking strategies: repetitive phrases the reward model scores highly, or adversarial outputs that exploit blind spots. The KL penalty says “maximize reward, but don’t become a different model.” It’s like a leash — the policy can explore, but it can’t run away.
Given prompt x = “What is photosynthesis?”, the policy generates y = “Plants convert sunlight into energy using chlorophyll.”
Reward model gives R(x, y) = 1.8. The SFT model would have assigned this response probability pSFT(y|x) = 0.03. The current policy gives pθ(y|x) = 0.07.
KL divergence (simplified, per-sequence): DKL ≈ log(0.07/0.03) = log(2.33) ≈ 0.85.
With β = 0.2: J = 1.8 − 0.2 × 0.85 = 1.8 − 0.17 = 1.63.
The policy gets credit for the high reward, with a small penalty for diverging from SFT.
Drag the sliders to see how the reward, KL divergence, and β coefficient interact to determine the PPO objective value.
RLHF works, but it’s expensive. Collecting human preference labels requires hiring annotators, designing interfaces, running quality control. Anthropic’s Constitutional AI (CAI) asks: what if the AI itself could provide the feedback?
Instead of 10,000+ human comparison labels, CAI uses ~10 human-written principles — the “constitution.” These are natural-language rules like:
CAI works in two phases that mirror RLHF but replace humans with AI:
The LLM generates a response. Then it critiques its own response using a constitutional principle. Then it revises the response to address the critique. This critique-revision loop can be repeated multiple times. The revised responses become the SFT training data.
Instead of human labelers comparing responses, the AI compares them using the constitution. For each prompt, generate two responses, ask the AI “which response better follows this principle?” and use the AI’s judgment to train a preference model. Then run PPO as before, but with the AI-trained preference model.
| Dimension | RLHF | CAI / RLAIF |
|---|---|---|
| Labels needed | ~50K–100K human comparisons | ~10 principles + AI generates rest |
| Cost | $100K–$1M+ for labelers | Compute cost only |
| Scalability | Limited by human bandwidth | Parallelizable, auto-scalable |
| Transparency | Labeler disagreements are opaque | Principles are explicit and auditable |
| Quality ceiling | Bounded by labeler expertise | Can leverage stronger AI as judges |
Click Next Round to step through critique-revision iterations. Watch how harmlessness improves with each round while helpfulness stays relatively stable.
Fine-tuning a 70B model means storing 70B parameters, their gradients, and optimizer states (Adam keeps 2 extra copies per parameter). That’s easily 10× the model size in memory — 700 GB+ for a single training run. And if you want a different fine-tuned model for each task (summarization, coding, medical Q&A), you need separate 70B copies. This doesn’t scale.
Parameter-Efficient Fine-Tuning (PEFT) solves this by updating only a tiny fraction of the parameters — often less than 0.1% — while freezing the rest. The frozen parameters require no gradients and no optimizer states, slashing memory by 10-100×.
| Category | Idea | Examples |
|---|---|---|
| Selective | Freeze most layers, update only a few (e.g., top layers) | Top-K layers, BitFit |
| Additive | Add small trainable modules, freeze everything else | Adapters, Prefix Tuning, Prompt Tuning |
| Reparameterization | Express weight updates as low-rank matrices | LoRA, IA3 |
LoRA (Hu et al., 2021) is the most widely used PEFT method. The key insight: fine-tuning weight changes have low intrinsic rank. Instead of updating a full d×d weight matrix W, LoRA decomposes the update into two small matrices:
Where A is [r × d] and B is [d × r], with rank r ≪ d (typically r = 4 to 64, while d = 4096 to 12288).
A full weight matrix W has d × d parameters. The LoRA update BA has (d × r) + (r × d) = 2dr parameters. The compression ratio:
For d = 4096 and r = 8: ratio = 16 / 4096 = 0.39%. We’re updating less than half a percent of the parameters.
During training: h = Wx + BAx. The frozen W handles the bulk of computation. BA adds a small correction learned during fine-tuning. During inference, we can merge the matrices: WLoRA = W + BA. After merging, the forward pass is just h = WLoRAx — zero additional latency.
Prompt tuning (Lester et al., 2021) prepends m learnable “soft prompt” tokens to the input embedding. Only these m × e parameters are trained (where e is the embedding dimension). The entire model is frozen. At scale (10B+ params), prompt tuning matches full fine-tuning performance with ~0.01% of the trainable parameters.
Trainable parameters: m × e. For m = 20 and e = 4096, that’s 81,920 parameters — compared to billions in the full model.
Prefix tuning (Li & Liang, 2021) extends prompt tuning to deeper layers: instead of only prepending tokens at the input, it injects learnable key-value pairs at every attention layer. P-Tuning v2 showed this approach matches full fine-tuning on smaller models where prompt tuning alone falls short.
Drag the sliders to change the model dimension d and LoRA rank r. See how the parameter count and compression ratio change. The blue area is the frozen weight matrix; the orange areas are the trainable LoRA matrices.
python import torch import torch.nn as nn class LoRALinear(nn.Module): def __init__(self, linear, r=8, alpha=16): super().__init__() self.linear = linear # frozen original layer d_out, d_in = linear.weight.shape self.A = nn.Parameter(torch.randn(r, d_in) * 0.01) self.B = nn.Parameter(torch.zeros(d_out, r)) self.scale = alpha / r # scaling factor linear.weight.requires_grad = False def forward(self, x): base = self.linear(x) # W @ x (frozen) lora = (x @ self.A.T) @ self.B.T # BA @ x (trainable) return base + lora * self.scale
We’ve traced the full arc from a raw pre-trained model to an aligned, efficient assistant. Here’s the complete development flow:
| Technique | What It Does | Key Formula | Data Needed |
|---|---|---|---|
| Zero-shot | Direct prompting, no examples | — | None |
| Few-shot | In-context examples in prompt | — | 5–50 examples |
| Chain-of-Thought | Step-by-step reasoning in output | — | CoT exemplars |
| SFT / Instruction Tuning | Supervised training on (instruction, response) | L = −∑ log p(yt|x, y<t) | 10K–100K pairs |
| Reward Model | Learns human preference scoring | L = −log σ(R(yw) − R(yl)) | 50K+ comparisons |
| PPO + KL | RL fine-tuning with reward + KL constraint | J = R(y) − β·DKL(π||πSFT) | Reward model |
| Constitutional AI | AI self-critique + RLAIF | Same as RLHF, AI-labeled | ~10 principles |
| LoRA | Low-rank weight updates | W′ = W + BA | Same as SFT |
| Prompt Tuning | Learnable soft prompt tokens | P(Y|p1..m, x1..t) | Same as SFT |
| Method | Params Updated | Memory Cost | Quality Ceiling | Best For |
|---|---|---|---|---|
| Full Fine-tuning | 100% | 10× model | Highest | Unlimited budget |
| LoRA (r=8) | ~0.4% | ~1.1× model | Near full FT | Most production use |
| Prompt Tuning | ~0.01% | ~1× model | Good at scale | Many tasks, one model |
| Prefix Tuning | ~0.1% | ~1× model | Better for small models | Generation tasks |
| Prompting only | 0% | Inference only | Limited by pre-training | Quick experiments |