Prompting, LoRA, and adapters — how to customize a 175B model without retraining it.
You have GPT-3. 175 billion parameters. Trained on 300 billion tokens of internet text. It cost roughly $4.6 million in compute to train. Now your company wants it to answer customer support tickets about your product. What do you do?
The obvious answer: fine-tune it. Take the pre-trained weights, run gradient descent on your customer support dataset, update all 175 billion parameters. This works. It also costs about $1.2 million in GPU time per training run, requires 350 GB of GPU memory just to hold the optimizer states, and produces a 350 GB checkpoint that you need to store, version, and deploy.
Now imagine you have ten tasks: customer support, legal summarization, code review, medical Q&A, product recommendations, translation, content moderation, data extraction, email drafting, and report generation. Ten full fine-tunes. Ten 350 GB checkpoints. That's 3.5 TB of weight files. $12 million in compute. And every time the base model improves, you redo all ten.
The cost of fine-tuning scales with model size — and not linearly. Larger models need more memory for optimizer states (Adam stores two extra copies of every parameter), more compute per gradient step, and longer training runs to converge. The simulation below shows this exponential cost curve.
Each bar shows estimated GPU-hours for a single fine-tune. Hover over bars to see details. The jump from 6.7B to 175B is roughly 40x.
The memory problem is even worse. To fine-tune with Adam, you need to store:
| Component | Size (for 175B model) | Why |
|---|---|---|
| Model weights (fp16) | 350 GB | 2 bytes × 175B params |
| Gradients (fp16) | 350 GB | Same size as weights |
| Adam first moment (fp32) | 700 GB | 4 bytes × 175B params |
| Adam second moment (fp32) | 700 GB | 4 bytes × 175B params |
| Total | ~2.1 TB | Just to train, before activations |
That's 2.1 terabytes of GPU memory. A single A100 has 80 GB. You need at least 27 A100s just for the optimizer states — and that's before accounting for activations during the forward pass.
Let's make the costs concrete. As of 2024, cloud GPU pricing for an 8xA100 node is roughly $25/hour. Fine-tuning GPT-3 175B takes approximately 1,200 GPU-hours on A100 80GB:
| Model | GPU-Hours | Cloud Cost (@$25/hr per 8xA100) | Checkpoint Size |
|---|---|---|---|
| GPT-2 (125M) | 0.5 | ~$2 | 0.5 GB |
| GPT-2 XL (1.5B) | 8 | ~$25 | 6 GB |
| LLaMA-7B | 40 | ~$125 | 28 GB |
| LLaMA-13B | 80 | ~$250 | 52 GB |
| LLaMA-65B | 400 | ~$1,250 | 260 GB |
| GPT-3 (175B) | 1,200 | ~$3,750 | 700 GB |
And these are optimistic estimates assuming everything works on the first try. In practice, you experiment with hyperparameters (learning rate, epochs, data mix), requiring 3-10 runs to find a good configuration. Multiply all costs by 5x for realistic budgets. Now imagine doing this for 10 tasks.
This economic reality created enormous demand for methods that achieve 90-99% of full fine-tuning accuracy at a fraction of the cost. The field of parameter-efficient fine-tuning (PEFT) answers that demand.
The field responded with a spectrum of methods, ranging from "no training at all" to "train a tiny fraction of parameters":
The key insight unifying all these methods: the useful information in a fine-tuning update lives in a low-dimensional subspace. If 90% of weights are redundant (lottery ticket), and the fine-tuning update has intrinsic dimensionality of ~200 (out of millions), then we don't need to update all weights. We can work within that small subspace and get nearly the same result.
This lesson walks through each method from top to bottom. By the end, you'll know exactly when to use each one — and why LoRA has become the default for most practitioners.
In 2020, GPT-3 demonstrated something remarkable: you could make it do new tasks without changing a single weight. No gradient updates. No training loop. You just write the task description in the prompt, and the model figures out what to do.
This is in-context learning (ICL) — the model "learns" the task from examples placed in its context window. The word "learns" is in quotes because the weights never change. Everything happens through the attention mechanism attending over the examples you provide.
Zero-shot: Describe the task, provide no examples. The model relies entirely on its pre-training knowledge. "Translate the following English text to French: 'The cat sat on the mat.'" The model has seen enough translation pairs during pre-training to handle this — but accuracy drops sharply on unusual formats or niche domains.
One-shot: Provide exactly one example, then the actual query. "English: 'Hello, how are you?' French: 'Bonjour, comment allez-vous?' English: 'The cat sat on the mat.' French:" The single example teaches the model the desired input-output format.
Few-shot: Provide 4-32 examples. More examples generally improve accuracy, but consume context window tokens. Each example "costs" tokens that could be used for the actual input. With a 2048-token context window, you might fit 10-20 short examples before running out of space for the actual query.
The GPT-3 paper showed dramatic improvements as you go from zero to few-shot, especially on tasks requiring specific output formats. On the SuperGLUE benchmark, zero-shot GPT-3 scored 55.4, while 32-shot GPT-3 scored 71.8 — a 16-point jump without touching the weights.
Toggle between zero-shot, one-shot, and few-shot prompting. Watch the prompt grow and the accuracy bar rise. The model weights stay frozen — only the prompt changes.
This is still debated, but the leading theory is that during pre-training, GPT-3 encountered millions of sequences where context implied a task. Wikipedia articles implicitly teach "continue writing about this topic." StackOverflow threads implicitly teach "answer the question." Translation corpora implicitly teach "translate between languages."
When you provide few-shot examples, you're activating circuits the model already learned during pre-training. The examples don't teach new knowledge — they steer the model toward a pre-existing capability. Think of it as finding the right "mode" within a model that already knows how to do many things.
A compelling formal theory: Garg et al. (2022) showed that transformers trained on random linear regression tasks learn to implement ridge regression in their forward pass. The attention mechanism can, in principle, implement gradient descent over the provided examples. This means ICL may literally be performing a form of optimization internally — just not gradient-based optimization on the weights.
ICL ability improves with scale. The GPT-3 paper showed clear trends:
| Model Size | Zero-shot (SuperGLUE) | Few-shot (SuperGLUE) | Gap Closed by Few-shot |
|---|---|---|---|
| GPT-3 Small (125M) | 42.0 | 43.1 | +1.1 (minimal) |
| GPT-3 Medium (350M) | 43.8 | 46.5 | +2.7 |
| GPT-3 Large (760M) | 45.2 | 50.3 | +5.1 |
| GPT-3 XL (1.3B) | 47.9 | 55.1 | +7.2 |
| GPT-3 (175B) | 55.4 | 71.8 | +16.4 |
The pattern: ICL benefit grows superlinearly with model size. Small models barely benefit from examples. Large models extract far more from the same examples. This is an emergent ability — it appears to "switch on" above a certain model size threshold rather than increasing linearly.
When should you use ICL instead of fine-tuning? The answer depends on three factors:
1. Data availability. ICL needs only a handful of examples (4-32). Fine-tuning typically needs 1K+ examples for SFT quality. If you have fewer than 100 labeled examples, ICL is often your only option.
2. Latency budget. ICL is available immediately — no training required. Fine-tuning takes hours to days. If you need to deploy in minutes (a new customer request, a rapidly changing task), ICL wins.
3. Accuracy requirements. For high-stakes applications (medical, legal, financial), ICL's ~70-80% accuracy ceiling may not be sufficient. Fine-tuning (or LoRA) can push accuracy to 90-95%+ on domain-specific tasks. The accuracy gap narrows with better prompting but never fully closes.
Many production systems use a hybrid approach: start with ICL for rapid prototyping, measure accuracy, then fine-tune with LoRA only if ICL falls short of requirements. This "ICL-first" workflow avoids premature optimization and often reveals that the task is simpler than expected.
ICL is cheap and fast, but it hits a ceiling:
| Limitation | Why It Matters |
|---|---|
| Context window limit | More examples = better, but you can only fit so many tokens. A 4K context can hold ~16 short examples. |
| No gradient signal | The model can't fix systematic errors. If it misunderstands the task, more examples don't always help. |
| Example sensitivity | Accuracy varies wildly with example order, format, and selection. Same examples, different order = different accuracy. |
| Inference cost | Every call re-processes all the examples. 16 examples × 100 tokens each = 1600 extra tokens per call, every call. |
Let's trace exactly what happens during a few-shot forward pass. Say you have 4 examples, each ~100 tokens, plus a 50-token query. Total input: 450 tokens.
python # Few-shot prompt construction examples = [ ("The food was excellent", "Positive"), ("Terrible service", "Negative"), ("I loved the ambiance", "Positive"), ("Would not return", "Negative"), ] prompt = "Classify the sentiment:\n" for text, label in examples: prompt += f'{text} → {label}\n' prompt += "The pasta was bland →" # Input shape: [1, 450] (batch=1, seq_len=450) # Every token attends to all previous tokens # The model's prediction for the last position # is influenced by ALL 449 preceding tokens logits = model(tokenize(prompt)) # [1, 450, vocab_size] prediction = logits[0, -1] # last token's distribution
The critical insight: attention at the final position can attend to every example token. The model doesn't have a special "example memory" — it uses the same attention mechanism it uses for all text. Examples are just more context.
For tasks where ICL falls short, we need methods that actually modify the model — but without the cost of full fine-tuning. The next chapters explore how.
Small changes in how you phrase a prompt can produce enormous accuracy swings. This isn't a quirk — it's a fundamental property of how language models process input. The model doesn't "understand" your intent the way a human does. It generates tokens conditional on the exact token sequence you provide. Change a single word and you shift the entire conditional distribution.
Zhao et al. (2021) showed that simply reordering the same few-shot examples could swing accuracy on SST-2 (sentiment classification) from 54% to 93%. Same model, same examples, different permutation. The model was treating example order as a signal about the task — recency bias meant the last example disproportionately influenced the prediction.
Other formatting choices that have measured impact:
| Format Choice | Example | Impact |
|---|---|---|
| Label words | "Positive/Negative" vs. "Good/Bad" vs. "True/False" | Up to 20% accuracy difference |
| Separator style | "Answer:" vs. "\n" vs. "=>" | 5-15% difference on some tasks |
| Instruction framing | "Classify this" vs. "What sentiment is this?" | 10-25% difference |
| Example ordering | Random vs. similar-first vs. diverse-first | Up to 40% swing |
Good prompt engineering follows a few principles:
Be explicit about output format. "Answer with exactly one word: Positive or Negative." Without this, the model might produce "I think this is positive because..." and your parser breaks.
Use the right verbalizer. A verbalizer maps between label names and the words the model uses to express them. If you're classifying sentiment, "Positive/Negative" works better than "1/0" because the model has seen far more text using those words in a sentiment context.
Select representative examples. The examples you choose should cover the distribution of inputs you expect. Don't use all easy examples — include edge cases. Don't use all similar examples — include diverse domains.
Calibrate against biases. Models have a tendency to favor certain labels regardless of input (the "majority label bias"). Zhao et al. proposed contextual calibration: measure the model's bias on a content-free input like "N/A", then adjust predictions accordingly.
Compare a bad prompt vs. a good prompt for the same task. Toggle format choices to see how each affects the accuracy meter.
Humans are slow prompt engineers. Several methods automate the process:
Prompt tuning (Lester et al., 2021): learn a small set of continuous "virtual tokens" prepended to the input. These aren't real words — they're learned embeddings optimized via gradient descent. Only 0.01% of parameters are trainable. The rest of the model is frozen.
Prefix tuning (Li & Liang, 2021): similar idea, but prepend learned vectors to every layer's key-value pairs, not just the input embedding. This gives the "virtual prefix" direct influence over attention at every layer.
Both methods bridge the gap between pure prompting (zero parameters) and adapters (millions of parameters). They're effective for tasks where the pre-trained model already has the knowledge but needs steering — translation, classification, summarization.
Zhao et al. (2021) proposed a simple fix for prompt sensitivity: contextual calibration. The idea:
Concretely: if the model assigns 70% to "Positive" on a content-free input, it has a strong positive bias. Dividing all future "Positive" probabilities by 0.7 (and "Negative" by 0.3) re-centers the predictions. This simple trick reduced accuracy variance across prompt formats from 40% to under 5%.
python # Contextual calibration in code # Step 1: Get bias probabilities bias_prompt = "Classify: N/A\nSentiment:" bias_probs = model.predict_probs(bias_prompt) # e.g., {"Positive": 0.7, "Negative": 0.3} # Step 2: Calibrate real predictions real_prompt = "Classify: 'Great movie!'\nSentiment:" raw_probs = model.predict_probs(real_prompt) # e.g., {"Positive": 0.9, "Negative": 0.1} calibrated = {k: raw_probs[k] / bias_probs[k] for k in raw_probs} # Normalize so they sum to 1 total = sum(calibrated.values()) calibrated = {k: v / total for k, v in calibrated.items()} # {"Positive": 0.81, "Negative": 0.19} — less biased
In 2022, Jason Wei et al. at Google discovered something surprising: if you simply add the words "Let's think step by step" to a math prompt, GPT-3's accuracy on the GSM8K benchmark jumps from 18% to 57%. Include a few examples with worked-out reasoning chains, and it climbs to 79%. Same model. Same weights. Same test set. Only the prompt changed.
This is chain-of-thought (CoT) prompting — a technique where you ask the model to show its reasoning steps before giving the final answer. The key insight: language models process information sequentially, token by token. If a problem requires multiple reasoning steps, the model must perform all of them within a single forward pass — unless you give it "scratch space" in the form of generated intermediate tokens.
A standard prompt asks the model to go directly from question to answer:
prompt Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: 11
The model must compute 5 + (2 × 3) in a single "step." For simple arithmetic, this works. But for multi-step word problems, the required computation exceeds what a single forward pass can do reliably.
With CoT, the model generates intermediate steps. Each step becomes part of the context for the next step, effectively giving the model a working memory:
prompt Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. He bought 2 cans with 3 balls each, so 2 × 3 = 6 new balls. 5 + 6 = 11. The answer is 11.
Each generated token becomes context for the next token. The model isn't doing harder computation — it's breaking one hard problem into many easy problems, solving them sequentially through autoregressive generation.
Toggle CoT on/off to see how reasoning chains improve accuracy on a word problem. Without CoT, the model jumps to an answer. With CoT, it decomposes the problem.
The most surprising finding: you don't even need hand-crafted reasoning examples. Kojima et al. (2022) showed that just appending "Let's think step by step" to any prompt triggers reasoning chains. This is zero-shot CoT — no examples, no manual chain-writing, just a magic phrase that activates latent reasoning capabilities.
On MultiArith (arithmetic word problems): Standard zero-shot = 18%. Zero-shot CoT = 79%. That's a 61-percentage-point improvement from five words.
| Task Type | CoT Benefit | Why |
|---|---|---|
| Multi-step arithmetic | Very large (+40-60%) | Requires serial computation the model can't do in one pass |
| Logic puzzles | Large (+20-40%) | Explicit reasoning prevents shortcut errors |
| Commonsense reasoning | Moderate (+10-20%) | Helps surface relevant knowledge |
| Simple classification | Minimal or negative | The task is already easy enough for one pass; CoT adds noise |
| Factual recall | None | The model either knows the fact or doesn't; reasoning chains don't help |
CoT also scales with model size. In small models (<10B parameters), CoT often hurts performance — the model generates plausible-sounding but incorrect reasoning steps. The benefit emerges only in large models that have internalized enough world knowledge to reason accurately.
The original CoT idea spawned an entire family of prompting techniques:
| Variant | Idea | Improvement Over Standard CoT |
|---|---|---|
| Self-consistency | Generate multiple reasoning chains, take majority vote on final answer | +5-10% accuracy on math tasks |
| Tree-of-Thought | Explore multiple reasoning paths, backtrack from dead ends | Helps on search-like problems (game of 24) |
| Program-of-Thought | Generate Python code instead of natural language reasoning, execute it | Eliminates arithmetic errors entirely |
| ReAct | Interleave reasoning with tool use (search, calculator) | Grounds reasoning in external facts |
All of these share the same underlying principle: give the model more "thinking tokens" between question and answer. The tokens can be natural language (CoT), code (PoT), or interleaved with tool calls (ReAct). The key is that each intermediate token creates a stepping stone the model can attend to when generating the next token.
So far we've adapted models without changing any weights at all. But what if we want to actually train some weights — just not all 175 billion of them? How many weights do we really need?
In 2019, Jonathan Frankle and Michael Carlin published "The Lottery Ticket Hypothesis," making a striking claim: a dense neural network contains a sparse subnetwork that, when trained in isolation from the same initialization, matches the full network's accuracy. They called this subnetwork the winning ticket.
Think of training a neural network like buying lottery tickets. A dense network with 100 million weights is like buying 100 million tickets. Most of those tickets (weights) are losers — they don't meaningfully contribute to the final function the network computes. But somewhere in those 100 million, there's a winning subnetwork of maybe 10 million weights that does all the real work.
Frankle and Carlin proved this experimentally with a technique called iterative magnitude pruning:
The result: on CIFAR-10 with VGG-19, they could prune 90% of weights and still match the original accuracy. The winning ticket was there all along — you just needed to find it.
The iterative part is crucial. You can't just prune 90% of a randomly initialized network in one shot — you don't know which weights are important yet. The pruning must be guided by a full training run that reveals which weights naturally gravitate toward zero. The magnitude of a trained weight is a proxy for its importance: weights that end up large were consistently reinforced by gradients; weights that end up near zero were irrelevant to the loss.
python # Simplified iterative magnitude pruning import torch def iterative_prune(model, train_fn, prune_pct=0.2, rounds=5): initial_weights = {n: p.clone() for n, p in model.named_parameters()} mask = {n: torch.ones_like(p) for n, p in model.named_parameters()} for r in range(rounds): train_fn(model, mask) # train with current mask # Prune smallest-magnitude weights for n, p in model.named_parameters(): alive = mask[n].bool() threshold = torch.quantile( p[alive].abs(), prune_pct ) mask[n][p.abs() < threshold] = 0 # Reset to original initialization for n, p in model.named_parameters(): p.data = initial_weights[n] * mask[n] return mask # the winning ticket
Drag the sparsity slider to prune weights. Accuracy holds surprisingly well until extreme sparsity, then collapses. The colored dots are surviving weights; gray dots are pruned.
The lottery ticket hypothesis tells us something profound about parameter efficiency: most parameters in a neural network are redundant. If 90% of weights can be removed without hurting accuracy, then maybe we don't need to update all of them during fine-tuning either.
This observation directly motivates adapter methods and LoRA. If the useful information lives in a low-dimensional subspace, we can fine-tune only within that subspace. The next two chapters show exactly how.
This paper measured the intrinsic dimensionality of fine-tuning: the minimum number of parameters needed to achieve 90% of full fine-tuning accuracy. For a 280M-parameter RoBERTa model fine-tuned on MRPC (paraphrase detection), the intrinsic dimensionality was about 200. Not 200 million. Two hundred.
That's 0.00007% of the total parameters. The fine-tuning "update" lives in an extremely low-dimensional subspace. This is the theoretical foundation for LoRA.
The original lottery ticket experiments used small networks on CIFAR-10. Does the hypothesis hold at scale? Subsequent work revealed a nuance: for large networks, you can't reset to the original initialization — you need to reset to weights from early in training (e.g., after 0.1% of total training). This is the late resetting variant, and it works robustly at scale.
The practical implication: large pre-trained language models are massively over-parameterized. Studies on BERT showed that 40-90% of attention heads can be pruned with negligible accuracy loss on downstream tasks. For GPT-style models, structured pruning of entire layers (not just weights) can remove 25-30% of layers while maintaining 95%+ of the original accuracy.
This over-parameterization is a feature, not a bug: it makes training easier (more paths to good solutions) and enables transfer learning (different tasks use different subsets of the weights). But at deployment time, most of those parameters are wasted compute.
Now we get to methods that actually modify the model — but surgically, with a scalpel rather than a sledgehammer. Adapters (Houlsby et al., 2019) insert small trainable modules between the frozen layers of a pre-trained transformer. The original weights don't change. Only the adapter weights are trained.
The previous three chapters used zero trainable parameters. Prompting, few-shot, and chain-of-thought all leave the model weights untouched. This is elegant but limited — you're constrained to whatever the model can do with just the right prompt. When you need the model to reliably produce specific output formats, handle domain-specific terminology, or maintain consistent quality across thousands of requests, you need to actually train some weights.
The question: which weights? Adapters answer this by training new weights inserted between existing layers, while leaving all original weights frozen.
An adapter module is a simple bottleneck: down-project → nonlinearity → up-project, with a residual connection around the whole thing.
Concretely, for a hidden dimension of d = 1024 and a bottleneck dimension of m = 64:
Where Wdown ∈ Rd × m (1024 × 64), Wup ∈ Rm × d (64 × 1024), and f is a nonlinearity (usually ReLU or GELU). The residual connection h + (...) ensures that if the adapter weights are initialized to near-zero, the module is approximately an identity function — the model starts where the pre-trained model left off.
Houlsby et al. placed two adapter modules in each transformer layer: one after the multi-head attention sublayer, one after the feed-forward sublayer. Both sit after the sublayer and before the residual connection to the layer norm.
| Component | Parameters | Trainable? |
|---|---|---|
| Multi-head attention | 4d2 | Frozen |
| Adapter 1 | 2dm | Yes |
| Feed-forward network | 8d2 | Frozen |
| Adapter 2 | 2dm | Yes |
| Layer norms | 4d | Often trained too |
With d = 1024 and m = 64, each adapter has 2 × 1024 × 64 = 131,072 parameters. Two adapters per layer × 24 layers = ~6.3M trainable parameters. For a 340M-parameter BERT-large, that's 1.8% of total parameters. For a 175B GPT-3, it would be roughly 0.004%.
A transformer block with adapter modules. Frozen layers are gray, trainable adapters are orange. Click "Expand Adapter" to see the internal bottleneck structure. Adjust bottleneck size to see the parameter count change.
python class Adapter(nn.Module): def __init__(self, d_model, bottleneck): super().__init__() self.down = nn.Linear(d_model, bottleneck) self.up = nn.Linear(bottleneck, d_model) self.act = nn.GELU() # Initialize up-projection near zero # so adapter starts as identity nn.init.zeros_(self.up.weight) nn.init.zeros_(self.up.bias) def forward(self, x): # x: [batch, seq_len, d_model] return x + self.up(self.act(self.down(x)))
The key detail: Wup is initialized to zero, making the entire adapter output zero at initialization. The residual connection means the adapter starts as a pass-through — the model behaves identically to the original pre-trained model on the first forward pass. Training then "grows" the adapter's contribution gradually.
This zero-initialization is critical for training stability. If adapters were randomly initialized, the model's outputs would be completely disrupted on the first forward pass — the pre-trained model's learned representations would be corrupted by random adapter noise. Starting from zero means the model can never be worse than the pre-trained baseline, and gradients gently guide the adapter toward task-specific modifications.
The bottleneck dimension m controls the capacity vs. efficiency tradeoff:
| Bottleneck m | Params per Adapter | % of BERT-large (340M) | Typical Use |
|---|---|---|---|
| 8 | 16K | 0.11% | Extreme efficiency; simple tasks |
| 32 | 65K | 0.46% | Good balance for most tasks |
| 64 | 131K | 0.93% | Standard (original paper default) |
| 256 | 524K | 3.7% | Complex tasks needing more capacity |
The original paper found that m = 64 was sufficient for all 26 GLUE/SQuAD tasks. Reducing to m = 8 still worked for simpler classification tasks but degraded on complex QA. The choice should be guided by task complexity: simple classification needs less capacity than open-ended generation.
The killer feature of adapters is modular task composition. You store one frozen base model (say, 350 GB for GPT-3) and swap in tiny adapter checkpoints per task:
| Component | Size | Copies | Total |
|---|---|---|---|
| Frozen base model | 350 GB | 1 | 350 GB |
| Adapter per task (m=64) | ~25 MB | 26 | 650 MB |
| Total for 26 tasks | ~351 GB | ||
| Full FT alternative (26 copies) | 350 GB × 26 | ~9.1 TB |
That's a 26x storage reduction. At serving time, you load the base model once into GPU memory and hot-swap adapter weights based on which task the incoming request requires. The swap takes milliseconds — just overwriting a few megabytes of weights.
Adapters have one drawback: they add sequential computation during inference. Every forward pass must go through the adapter modules, adding latency proportional to 2 × L adapter forward passes (where L is the number of layers). For a model serving millions of requests per day, even a few milliseconds per adapter matters. The additional latency compounds: two adapters per layer × 96 layers = 192 extra sequential operations.
This latency issue motivated the development of LoRA, which adds zero inference cost by merging the learned parameters directly into the base weights.
In 2021, Edward Hu et al. at Microsoft published LoRA: Low-Rank Adaptation of Large Language Models — and it became the most widely used PEFT method within a year. The idea is elegant: weight updates during fine-tuning have low intrinsic rank. If the update matrix ΔW is low-rank, decompose it into two small matrices and train only those.
Adapters work, but they add new sequential layers to the model. LoRA takes a fundamentally different approach: instead of adding new layers, it modifies existing ones. The key observation from Aghajanyan et al. is that fine-tuning updates are low-rank — the matrix ΔW = Wfinal - Wpretrained has far fewer independent dimensions than its size would suggest.
In standard fine-tuning, you update a weight matrix W0 ∈ Rd × k to W0 + ΔW. The update ΔW has the same shape as W0 — for GPT-3's attention weights, that's 12288 × 12288 = 150 million parameters per matrix.
LoRA constrains ΔW to be low-rank by decomposing it:
Where B ∈ Rd × r and A ∈ Rr × k, and r << min(d, k). For r = 8 and d = k = 12288:
| Method | Trainable Params (per matrix) | Ratio |
|---|---|---|
| Full fine-tuning (ΔW) | 12288 × 12288 = 151M | 1x |
| LoRA r=8 (B + A) | (12288 × 8) + (8 × 12288) = 197K | 0.0013x |
| LoRA r=4 (B + A) | (12288 × 4) + (4 × 12288) = 98K | 0.00065x |
That's a 770x reduction in trainable parameters for r = 8. Across all attention matrices in GPT-3 (Q, K, V, O × 96 layers), LoRA with r = 8 trains about 4.7 million parameters instead of 175 billion — a 37,000x reduction.
Here's the beautiful part: LoRA adds zero inference latency. During training, the forward pass computes:
After training, you merge: Wmerged = W0 + BA. Now you have a single weight matrix with no extra computation. The adapter is "baked in." To switch tasks, just swap the BA matrices and re-merge. This is why LoRA dominates: adapters add latency at every forward pass, but LoRA adds none.
A is initialized with random Gaussian values, B is initialized to zero. This means ΔW = BA = 0 at the start — the model begins as the original pre-trained model, identical to adapter initialization logic.
LoRA also uses a scaling factor α/r to control the magnitude of the update:
Typically α = r, so the scaling is 1. When r increases, each individual rank-1 component contributes proportionally less, which stabilizes training.
The large W matrix (left) gets a low-rank update ΔW = BA (right). Drag the rank slider to see how B and A change shape. The param counter shows the dramatic reduction.
python class LoRALinear(nn.Module): def __init__(self, linear, r=8, alpha=8): super().__init__() self.linear = linear # frozen original layer d, k = linear.weight.shape self.A = nn.Parameter(torch.randn(r, k) * 0.01) self.B = nn.Parameter(torch.zeros(d, r)) self.scale = alpha / r def forward(self, x): # x: [batch, seq_len, k] base = self.linear(x) # W_0 @ x lora = (x @ self.A.T) @ self.B.T # (BA)x return base + self.scale * lora def merge(self): # Bake LoRA into W_0 for zero-cost inference self.linear.weight.data += self.scale * (self.B @ self.A)
The training loop for LoRA is identical to standard fine-tuning, except only LoRA parameters receive gradients:
python # Freeze base model, train only LoRA parameters for name, param in model.named_parameters(): if 'lora' not in name: param.requires_grad = False # freeze base optimizer = torch.optim.AdamW( [p for p in model.parameters() if p.requires_grad], lr=2e-4 # higher LR than full FT (typically 5e-5) ) for batch in dataloader: loss = model(batch["input_ids"], labels=batch["labels"]).loss loss.backward() # gradients only flow to LoRA params optimizer.step() optimizer.zero_grad() # After training: merge for zero-cost inference for module in model.modules(): if hasattr(module, 'merge'): module.merge()
Notice the learning rate: 2e-4 vs. the typical 5e-5 for full fine-tuning. With far fewer parameters, each parameter needs to move more per step to achieve the same total update magnitude.
The original paper found that applying LoRA to the query and value projections (WQ and WV) in attention works best. Adapting all four attention matrices (Q, K, V, O) gives marginal improvement. The feed-forward layers are less important — attention is where task-specific adaptation happens.
| LoRA Target | Params (GPT-3) | Quality |
|---|---|---|
| WQ only | 4.7M | Good |
| WQ + WV | 9.4M | Best (default) |
| WQ + WK + WV + WO | 18.8M | Slightly better |
| All attention + FFN | 37.7M | Marginal improvement |
Rank r is the single most important hyperparameter in LoRA. The original paper tested r from 1 to 64 on GPT-3 for various tasks. Key findings:
| Rank | Trainable Params (per matrix) | Quality vs Full FT | When to Use |
|---|---|---|---|
| r = 1 | 24K | 90-95% | Extreme parameter constraint, simple classification |
| r = 4 | 98K | 95-98% | Default for most NLP tasks |
| r = 8 | 197K | 98-99% | Standard recommendation; good accuracy-efficiency tradeoff |
| r = 16 | 393K | 99%+ | Complex tasks, code generation, long-form writing |
| r = 64 | 1.6M | ~100% | When full FT accuracy is strictly required |
A practical heuristic: start with r = 8. If accuracy is insufficient, double to r = 16. If r = 16 isn't enough, the task may require full fine-tuning or a fundamentally different approach. Going above r = 64 rarely helps — you're past the intrinsic dimensionality of most fine-tuning updates.
The scaling factor α is typically set equal to r, giving a scale of 1.0. Some practitioners use α = 2r for a stronger initial learning signal. The learning rate for LoRA parameters is usually 10-100x higher than for full fine-tuning, since you're training far fewer parameters.
LoRA's low-rank constraint is a double-edged sword. For most downstream NLP tasks (classification, summarization, translation), the update is indeed low-rank and LoRA matches full fine-tuning. But there are cases where it struggles:
Dramatically new domains. If the task requires knowledge fundamentally absent from pre-training (e.g., a new programming language invented after the cutoff), a rank-8 update may not have enough capacity to encode the new patterns. Increasing rank to 32-64 or using full fine-tuning may be necessary.
Very long-context tasks. Adapting the model to handle much longer sequences than it was pre-trained on often requires changes that span many dimensions of the weight space — not a low-rank update.
Multi-modal extensions. Adding a new modality (vision, audio) requires updating embedding layers and cross-attention in ways that exceed low-rank decomposition. LoRA is best for task adaptation within the model's existing modality.
The rule of thumb: if you're steering the model's existing capabilities toward a specific task, LoRA works beautifully. If you're trying to teach the model fundamentally new capabilities, you may need higher rank or full fine-tuning.
You've learned four ways to adapt a model: in-context learning, prompt tuning, adapters, and LoRA. But how do they compare head-to-head? This playground lets you train a simulated model with each method and watch the differences in real time.
Select a method, then click "Train" to start a simulated training run. Watch which weights update (colored vs. gray), and track four metrics: trainable parameters, GPU memory, accuracy, and training time. Try all four methods to see the tradeoffs.
Pay particular attention to the loss curve (bottom right): all methods converge, but at different speeds and to different final accuracies. Full fine-tuning converges slowest but to the best accuracy. Prompt tuning converges fastest but to a lower ceiling. LoRA and adapters sit in between — near full fine-tuning accuracy with prompt tuning speed.
Select a method, then click Train. Watch which parts of the model update (orange = trainable, gray = frozen). Compare metrics across methods.
The numbers tell the story:
| Method | Trainable Params | GPU Memory | Accuracy | Training Time | Inference Cost |
|---|---|---|---|---|---|
| Full FT | 175B (100%) | ~2.1 TB | Best | Weeks | Baseline |
| Adapter | 6.3M (0.004%) | ~400 GB | Near-best | Hours | +latency |
| LoRA r=8 | 4.7M (0.003%) | ~400 GB | Near-best | Hours | Zero extra |
| Prompt tuning | 77K (0.00004%) | ~350 GB | Good | Minutes | Minimal |
Several patterns emerge:
Full fine-tuning wins on accuracy when you have enough data and compute. It can modify any weight to fit any pattern. But the cost is enormous — 2 TB of memory, weeks of training, and a full copy of the model per task.
LoRA is the sweet spot for most practitioners. Near-full-fine-tuning accuracy with 0.003% trainable parameters and zero extra inference cost. The trained LoRA matrices merge into the base weights, so deployment is identical to deploying the original model.
Adapters match LoRA on accuracy but add inference latency because the adapter modules remain as separate sequential computations during inference. If latency matters, LoRA wins.
Prompt tuning is the lightest touch. Only a few thousand parameters, trained in minutes. But it has the lowest accuracy ceiling and can't fix fundamental capabilities the model lacks. Best for tasks the model already almost knows how to do.
The memory savings aren't just about having fewer parameters. During training with Adam, every trainable parameter requires storing:
For a 175B parameter model with full fine-tuning, that's 2 + 2 + 4 + 4 = 12 bytes per parameter = 2.1 TB total. With LoRA (4.7M trainable params), only those 4.7M parameters need optimizer states: 4.7M × 12 bytes = 56 MB. The frozen 175B parameters are stored in fp16 (350 GB) but need no gradients or optimizer states.
This is why LoRA and adapters are called "parameter-efficient" — the savings cascade through the entire training pipeline: fewer trainable params → fewer gradients → fewer optimizer states → less memory → fewer GPUs → lower cost.
One subtlety: even with PEFT, you still need to store activations for the backward pass. Activations scale with batch size × sequence length × hidden dimension × number of layers. For a 175B model with batch size 1 and sequence length 2048, activations alone can consume 200+ GB. This is why techniques like gradient checkpointing (recomputing activations instead of storing them) are essential even with LoRA.
| Component | Full FT (175B) | LoRA (175B, r=8) |
|---|---|---|
| Model weights | 350 GB (fp16) | 350 GB (fp16, frozen) |
| Trainable params | 350 GB | 9 MB |
| Gradients | 350 GB | 9 MB |
| Optimizer states | 1.4 TB | 56 MB |
| Activations | ~200 GB | ~200 GB (same!) |
| Total | ~2.65 TB | ~550 GB |
Notice that activations are the same size regardless of PEFT method — because the forward pass through the frozen model is identical. PEFT saves on weights, gradients, and optimizer states, but not on activations. This is why you still need significant GPU memory even with LoRA.
The choice of adaptation method isn't just about accuracy — it's about the intersection of your data, compute, deployment constraints, and how many tasks you need to serve. This chapter gives you a decision framework.
The interactive tree below walks you through the key questions. Click each node to explore a decision path. Your answers determine which method is recommended.
Click decision nodes to answer questions. The path lights up to show the recommended method.
| Scenario | Best Method | Why |
|---|---|---|
| No labeled data at all | Zero/few-shot prompting | Can't train without data; use the model's pre-existing knowledge |
| 10-100 labeled examples | Few-shot + CoT prompting | Too little data for gradient methods; examples in context work well |
| 100-10K examples, API access only | Prompt tuning | API providers often support prefix tuning; no access to model weights |
| 1K-100K examples, weight access | LoRA (r=4-16) | Best accuracy-per-parameter ratio; zero inference overhead |
| 10+ tasks on same base model | LoRA with adapter swapping | One frozen base + tiny LoRA checkpoints per task; hot-swap at serving |
| 100K+ examples, full GPU cluster | Full fine-tuning | Maximum accuracy; cost is justified by data volume and task importance |
| Real-time serving, latency-critical | LoRA (merged) or full FT | LoRA merges to zero overhead; adapters add sequential latency |
| Model too large for GPU memory | QLoRA (quantized + LoRA) | 4-bit quantized base model + LoRA adapters fit on single GPU |
In practice, most people use QLoRA (Dettmers et al., 2023), which combines 4-bit quantization of the base model with LoRA adapters. This lets you fine-tune a 65B parameter model on a single 48 GB GPU — something that would normally require 8+ GPUs.
The trick: quantize the base model weights to 4-bit NormalFloat (NF4) format, reducing memory by 4x. Then add LoRA adapters in fp16 or bf16 on top. The LoRA gradients are computed in 16-bit, so training precision is maintained. Only the frozen base weights are quantized.
python from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) # Load 65B model in 4-bit (~16 GB) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b-hf", quantization_config=bnb_config, ) # Add LoRA adapters (~18M trainable params) lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 18,874,368 || all params: 68,977,307,648 # trainable%: 0.0274
With QLoRA, you can fine-tune a 70B model on a single A100 (80 GB) or even a consumer RTX 4090 (24 GB, with careful gradient checkpointing). This democratized fine-tuning: what previously required a compute cluster now runs on a single GPU.
The choice of PEFT method affects not just training but deployment architecture:
| Deployment Need | Method | Why |
|---|---|---|
| Single GPU serving | QLoRA (merged) | Quantized base + merged LoRA fits in minimal memory |
| Multi-tenant (many users, different tasks) | LoRA with dynamic swapping | Load base once, hot-swap LoRA weights per request |
| API-only access (OpenAI, Anthropic) | Prompt engineering + CoT | No weight access; all adaptation through the prompt |
| Edge/mobile deployment | Distillation from PEFT model | Fine-tune large model with LoRA, then distill to smaller model |
| Continual learning | Stacked LoRA adapters | Add new LoRA for each update; merge periodically |
A common production pattern: fine-tune with QLoRA on your data, merge the adapter, quantize the merged model to 4-bit, and serve with vLLM or TGI. Total workflow from raw data to deployed model can be done in a single afternoon on one GPU.
This workflow has become the de facto standard for fine-tuning open-source models. Tools like Hugging Face's PEFT library, Axolotl, and LLaMA-Factory wrap all five steps into a single configuration file.
Efficient adaptation is the bridge between pre-training and deployment. Pre-training gives the model broad knowledge; PEFT methods let you specialize that knowledge for your task without the cost of a full fine-tune. The field is evolving rapidly — LoRA itself has spawned dozens of variants (DoRA, AdaLoRA, QLoRA, LoRA+) in just three years.
From zero training to full training. Each method occupies a point on the cost-accuracy tradeoff curve.
| Paper | Year | Key Contribution |
|---|---|---|
| GPT-3 (Brown 2020) | 2020 | Demonstrated in-context learning: zero/few-shot prompting without any weight updates. Showed that scale enables emergent task learning. |
| Chain-of-Thought (Wei 2022) | 2022 | "Let's think step by step" — reasoning chains as working memory. 18% to 79% on math with no fine-tuning. |
| Lottery Ticket (Frankle 2019) | 2019 | Dense networks contain sparse subnetworks that match full accuracy. Proved parameter redundancy at scale. |
| LoRA (Hu 2021) | 2021 | Low-rank decomposition of weight updates. 10,000x fewer trainable params, zero inference cost. The PEFT standard. |
| Parameter-Efficient Transfer (Houlsby 2019) | 2019 | Adapter modules: small bottleneck layers between frozen transformer blocks. 3.6% params, matched full FT on GLUE. |
| Lesson | Connection |
|---|---|
| L07: Pretraining | The foundation PEFT adapts. Pre-training gives broad knowledge; PEFT specializes it per task. |
| L08: Post-training | SFT and RLHF are forms of adaptation. LoRA and QLoRA are how most practitioners actually do SFT and DPO in practice. |
| L10: Agents | Agents use prompting (CoT, ReAct) as their adaptation layer. LoRA-tuned models serve as specialized agent backbones. |
| Method | Trainable % | Extra Inference Cost | Multi-Task | Best For |
|---|---|---|---|---|
| Zero/few-shot | 0% | Context tokens only | Unlimited | Quick prototyping, no training data |
| Prompt tuning | ~0.01% | Virtual token processing | Small checkpoints | API-only access, classification |
| Adapter | ~2-4% | Sequential bottleneck per layer | Hot-swappable | Multi-task with shared base |
| LoRA | ~0.1% | Zero (merges into W) | Hot-swappable or merged | Most tasks (the default) |
| QLoRA | ~0.1% | Quantization overhead | Hot-swappable | Large models on single GPU |
| Full FT | 100% | None | Full copy per task | Maximum quality, unlimited compute |
LoRA variants. DoRA (Weight-Decomposed Low-Rank Adaptation) separates magnitude and direction of weight updates, closing the gap to full FT. AdaLoRA allocates rank adaptively per layer based on importance scores. LoRA+ uses different learning rates for A and B matrices, improving convergence speed by 2x.
Mixture of LoRA experts. Instead of one LoRA adapter per task, train multiple small LoRA "experts" and route inputs to the right expert at inference. This combines multi-task serving with LoRA's efficiency. Each expert specializes in a subtask (code, math, creative writing), and a lightweight router selects the expert based on the input.
Merging without retraining. Model soups and TIES-Merging combine multiple LoRA adapters into a single set of weights — a multi-task model without multi-task training. TIES-Merging resolves sign conflicts between adapters, which naive averaging misses.
1-bit and sub-4-bit fine-tuning. As quantization methods improve (GPTQ, AWQ, GGML), the base model gets smaller and LoRA adapters become a proportionally larger fraction of the total. Eventually the adapter might be bigger than the quantized base.
Scaling LoRA. As models grow past 1 trillion parameters, even LoRA's 0.1% becomes millions of parameters. Research is exploring structured pruning of LoRA matrices, tying LoRA weights across layers, and learning which layers even need adaptation (many don't).
LoRA for non-language modalities. LoRA has proven effective far beyond NLP. It's now the standard approach for fine-tuning Stable Diffusion (LoRA for image generation), Whisper (LoRA for speech recognition), and multimodal models like LLaVA (LoRA for vision-language tasks). The low-rank hypothesis holds across modalities.
| Year | Milestone | Impact |
|---|---|---|
| 2019 | Lottery Ticket Hypothesis | Proved parameter redundancy |
| 2019 | Adapter modules (Houlsby) | First practical PEFT for transformers |
| 2020 | GPT-3 in-context learning | Zero-training adaptation at scale |
| 2021 | LoRA (Hu et al.) | Low-rank adaptation with zero inference cost |
| 2021 | Intrinsic dimensionality (Aghajanyan) | Theoretical foundation for low-rank methods |
| 2022 | Chain-of-Thought (Wei et al.) | Reasoning through prompting |
| 2023 | QLoRA (Dettmers et al.) | Fine-tune 65B on single GPU |
| 2024 | DoRA, LoRA+, AdaLoRA | Closing the gap to full FT |
The field of efficient adaptation has transformed how practitioners deploy large language models. What once required million-dollar compute budgets and dedicated infrastructure teams can now be done on a single GPU in an afternoon. The key insight — that fine-tuning updates live in a low-dimensional subspace — is both theoretically elegant and practically revolutionary.
As models continue to grow (GPT-4 is estimated at 1.8T parameters, and future models will be larger), efficient adaptation becomes not just convenient but mandatory. Full fine-tuning of a 10T parameter model is beyond the reach of any organization. LoRA and its successors will be the only practical path to customization.