CS224N Lecture 9 — Efficient Adaptation

Chapter 0: Why Adapt?

You have GPT-3. 175 billion parameters. Trained on 300 billion tokens of internet text. It cost roughly $4.6 million in compute to train. Now your company wants it to answer customer support tickets about your product. What do you do?

The obvious answer: fine-tune it. Take the pre-trained weights, run gradient descent on your customer support dataset, update all 175 billion parameters. This works. It also costs about $1.2 million in GPU time per training run, requires 350 GB of GPU memory just to hold the optimizer states, and produces a 350 GB checkpoint that you need to store, version, and deploy.

Now imagine you have ten tasks: customer support, legal summarization, code review, medical Q&A, product recommendations, translation, content moderation, data extraction, email drafting, and report generation. Ten full fine-tunes. Ten 350 GB checkpoints. That's 3.5 TB of weight files. $12 million in compute. And every time the base model improves, you redo all ten.

The Economics of Full Fine-Tuning

The cost of fine-tuning scales with model size — and not linearly. Larger models need more memory for optimizer states (Adam stores two extra copies of every parameter), more compute per gradient step, and longer training runs to converge. The simulation below shows this exponential cost curve.

Fine-Tuning Cost vs. Model Size

Each bar shows estimated GPU-hours for a single fine-tune. Hover over bars to see details. The jump from 6.7B to 175B is roughly 40x.

The memory problem is even worse. To fine-tune with Adam, you need to store:

Component	Size (for 175B model)	Why
Model weights (fp16)	350 GB	2 bytes × 175B params
Gradients (fp16)	350 GB	Same size as weights
Adam first moment (fp32)	700 GB	4 bytes × 175B params
Adam second moment (fp32)	700 GB	4 bytes × 175B params
Total	~2.1 TB	Just to train, before activations

That's 2.1 terabytes of GPU memory. A single A100 has 80 GB. You need at least 27 A100s just for the optimizer states — and that's before accounting for activations during the forward pass.

The Real-World Cost Breakdown

Let's make the costs concrete. As of 2024, cloud GPU pricing for an 8xA100 node is roughly $25/hour. Fine-tuning GPT-3 175B takes approximately 1,200 GPU-hours on A100 80GB:

Model	GPU-Hours	Cloud Cost (@$25/hr per 8xA100)	Checkpoint Size
GPT-2 (125M)	0.5	~$2	0.5 GB
GPT-2 XL (1.5B)	8	~$25	6 GB
LLaMA-7B	40	~$125	28 GB
LLaMA-13B	80	~$250	52 GB
LLaMA-65B	400	~$1,250	260 GB
GPT-3 (175B)	1,200	~$3,750	700 GB

And these are optimistic estimates assuming everything works on the first try. In practice, you experiment with hyperparameters (learning rate, epochs, data mix), requiring 3-10 runs to find a good configuration. Multiply all costs by 5x for realistic budgets. Now imagine doing this for 10 tasks.

This economic reality created enormous demand for methods that achieve 90-99% of full fine-tuning accuracy at a fraction of the cost. The field of parameter-efficient fine-tuning (PEFT) answers that demand.

Full fine-tuning creates a 175B copy per task. Ten tasks = 7 TB of weights. This is economically unsustainable. Efficient adaptation methods reduce trainable parameters by 100-10,000x while matching full fine-tuning quality on most tasks.

The Efficient Adaptation Spectrum

The field responded with a spectrum of methods, ranging from "no training at all" to "train a tiny fraction of parameters":

Zero-shot prompting

0 trainable parameters. Just describe the task in the prompt. Works for simple tasks, fragile for complex ones.

↓

Few-shot prompting

0 trainable parameters. Include examples in the prompt. Better accuracy, but uses up context window.

↓

Prompt tuning / Prefix tuning

~0.01% trainable. Learn a small set of continuous "virtual tokens" prepended to the input.

↓

Adapters

~2-4% trainable. Insert small bottleneck layers between frozen transformer blocks.

↓

LoRA

~0.1-1% trainable. Add low-rank decompositions to existing weight matrices. No extra inference cost.

↓

Full fine-tuning

100% trainable. Update every parameter. Best quality, highest cost.

The key insight unifying all these methods: the useful information in a fine-tuning update lives in a low-dimensional subspace. If 90% of weights are redundant (lottery ticket), and the fine-tuning update has intrinsic dimensionality of ~200 (out of millions), then we don't need to update all weights. We can work within that small subspace and get nearly the same result.

This lesson walks through each method from top to bottom. By the end, you'll know exactly when to use each one — and why LoRA has become the default for most practitioners.

Why is full fine-tuning of a 175B parameter model impractical for most organizations?

The pre-trained model isn't good enough to fine-tune It requires ~2 TB of GPU memory for optimizer states alone, costs millions in compute, and creates a full 350 GB copy per task Fine-tuning always makes the model worse

Chapter 1: In-Context Learning

In 2020, GPT-3 demonstrated something remarkable: you could make it do new tasks without changing a single weight. No gradient updates. No training loop. You just write the task description in the prompt, and the model figures out what to do.

This is in-context learning (ICL) — the model "learns" the task from examples placed in its context window. The word "learns" is in quotes because the weights never change. Everything happens through the attention mechanism attending over the examples you provide.

Three Modes of In-Context Learning

Zero-shot: Describe the task, provide no examples. The model relies entirely on its pre-training knowledge. "Translate the following English text to French: 'The cat sat on the mat.'" The model has seen enough translation pairs during pre-training to handle this — but accuracy drops sharply on unusual formats or niche domains.

One-shot: Provide exactly one example, then the actual query. "English: 'Hello, how are you?' French: 'Bonjour, comment allez-vous?' English: 'The cat sat on the mat.' French:" The single example teaches the model the desired input-output format.

Few-shot: Provide 4-32 examples. More examples generally improve accuracy, but consume context window tokens. Each example "costs" tokens that could be used for the actual input. With a 2048-token context window, you might fit 10-20 short examples before running out of space for the actual query.

The GPT-3 paper showed dramatic improvements as you go from zero to few-shot, especially on tasks requiring specific output formats. On the SuperGLUE benchmark, zero-shot GPT-3 scored 55.4, while 32-shot GPT-3 scored 71.8 — a 16-point jump without touching the weights.

In-Context Learning: Zero vs. Few-Shot

Toggle between zero-shot, one-shot, and few-shot prompting. Watch the prompt grow and the accuracy bar rise. The model weights stay frozen — only the prompt changes.

The weights don't change. All "learning" happens through attention over examples in the context window. The model treats your examples as part of the input text and attends over them during generation. More examples = more patterns to attend over = better accuracy. But this comes at a cost: each example eats context tokens.

Why Does ICL Work?

This is still debated, but the leading theory is that during pre-training, GPT-3 encountered millions of sequences where context implied a task. Wikipedia articles implicitly teach "continue writing about this topic." StackOverflow threads implicitly teach "answer the question." Translation corpora implicitly teach "translate between languages."

When you provide few-shot examples, you're activating circuits the model already learned during pre-training. The examples don't teach new knowledge — they steer the model toward a pre-existing capability. Think of it as finding the right "mode" within a model that already knows how to do many things.

A compelling formal theory: Garg et al. (2022) showed that transformers trained on random linear regression tasks learn to implement ridge regression in their forward pass. The attention mechanism can, in principle, implement gradient descent over the provided examples. This means ICL may literally be performing a form of optimization internally — just not gradient-based optimization on the weights.

Scaling Laws for ICL

ICL ability improves with scale. The GPT-3 paper showed clear trends:

Model Size	Zero-shot (SuperGLUE)	Few-shot (SuperGLUE)	Gap Closed by Few-shot
GPT-3 Small (125M)	42.0	43.1	+1.1 (minimal)
GPT-3 Medium (350M)	43.8	46.5	+2.7
GPT-3 Large (760M)	45.2	50.3	+5.1
GPT-3 XL (1.3B)	47.9	55.1	+7.2
GPT-3 (175B)	55.4	71.8	+16.4

The pattern: ICL benefit grows superlinearly with model size. Small models barely benefit from examples. Large models extract far more from the same examples. This is an emergent ability — it appears to "switch on" above a certain model size threshold rather than increasing linearly.

ICL vs. Fine-Tuning: The Tradeoff

When should you use ICL instead of fine-tuning? The answer depends on three factors:

1. Data availability. ICL needs only a handful of examples (4-32). Fine-tuning typically needs 1K+ examples for SFT quality. If you have fewer than 100 labeled examples, ICL is often your only option.

2. Latency budget. ICL is available immediately — no training required. Fine-tuning takes hours to days. If you need to deploy in minutes (a new customer request, a rapidly changing task), ICL wins.

3. Accuracy requirements. For high-stakes applications (medical, legal, financial), ICL's ~70-80% accuracy ceiling may not be sufficient. Fine-tuning (or LoRA) can push accuracy to 90-95%+ on domain-specific tasks. The accuracy gap narrows with better prompting but never fully closes.

Many production systems use a hybrid approach: start with ICL for rapid prototyping, measure accuracy, then fine-tune with LoRA only if ICL falls short of requirements. This "ICL-first" workflow avoids premature optimization and often reveals that the task is simpler than expected.

Limitations

ICL is cheap and fast, but it hits a ceiling:

Limitation	Why It Matters
Context window limit	More examples = better, but you can only fit so many tokens. A 4K context can hold ~16 short examples.
No gradient signal	The model can't fix systematic errors. If it misunderstands the task, more examples don't always help.
Example sensitivity	Accuracy varies wildly with example order, format, and selection. Same examples, different order = different accuracy.
Inference cost	Every call re-processes all the examples. 16 examples × 100 tokens each = 1600 extra tokens per call, every call.

ICL in Practice: The Data Flow

Let's trace exactly what happens during a few-shot forward pass. Say you have 4 examples, each ~100 tokens, plus a 50-token query. Total input: 450 tokens.

python
# Few-shot prompt construction
examples = [
    ("The food was excellent", "Positive"),
    ("Terrible service", "Negative"),
    ("I loved the ambiance", "Positive"),
    ("Would not return", "Negative"),
]

prompt = "Classify the sentiment:\n"
for text, label in examples:
    prompt += f'{text} → {label}\n'
prompt += "The pasta was bland →"

# Input shape: [1, 450]  (batch=1, seq_len=450)
# Every token attends to all previous tokens
# The model's prediction for the last position
# is influenced by ALL 449 preceding tokens
logits = model(tokenize(prompt))  # [1, 450, vocab_size]
prediction = logits[0, -1]       # last token's distribution

The critical insight: attention at the final position can attend to every example token. The model doesn't have a special "example memory" — it uses the same attention mechanism it uses for all text. Examples are just more context.

For tasks where ICL falls short, we need methods that actually modify the model — but without the cost of full fine-tuning. The next chapters explore how.

In in-context learning, what changes when you go from zero-shot to few-shot?

The model weights are updated with gradient descent The model architecture changes to add more layers Only the prompt changes — more examples are placed in the context window for the attention mechanism to attend over

Chapter 2: Prompt Engineering

Small changes in how you phrase a prompt can produce enormous accuracy swings. This isn't a quirk — it's a fundamental property of how language models process input. The model doesn't "understand" your intent the way a human does. It generates tokens conditional on the exact token sequence you provide. Change a single word and you shift the entire conditional distribution.

Format Matters More Than Content

Zhao et al. (2021) showed that simply reordering the same few-shot examples could swing accuracy on SST-2 (sentiment classification) from 54% to 93%. Same model, same examples, different permutation. The model was treating example order as a signal about the task — recency bias meant the last example disproportionately influenced the prediction.

Other formatting choices that have measured impact:

Format Choice	Example	Impact
Label words	"Positive/Negative" vs. "Good/Bad" vs. "True/False"	Up to 20% accuracy difference
Separator style	"Answer:" vs. "\n" vs. "=>"	5-15% difference on some tasks
Instruction framing	"Classify this" vs. "What sentiment is this?"	10-25% difference
Example ordering	Random vs. similar-first vs. diverse-first	Up to 40% swing

The Art of Prompt Design

Good prompt engineering follows a few principles:

Be explicit about output format. "Answer with exactly one word: Positive or Negative." Without this, the model might produce "I think this is positive because..." and your parser breaks.

Use the right verbalizer. A verbalizer maps between label names and the words the model uses to express them. If you're classifying sentiment, "Positive/Negative" works better than "1/0" because the model has seen far more text using those words in a sentiment context.

Select representative examples. The examples you choose should cover the distribution of inputs you expect. Don't use all easy examples — include edge cases. Don't use all similar examples — include diverse domains.

Calibrate against biases. Models have a tendency to favor certain labels regardless of input (the "majority label bias"). Zhao et al. proposed contextual calibration: measure the model's bias on a content-free input like "N/A", then adjust predictions accordingly.

Prompt Format Comparison

Compare a bad prompt vs. a good prompt for the same task. Toggle format choices to see how each affects the accuracy meter.

Prompt engineering is not "just asking nicely." It's engineering with measurable outcomes. A well-designed prompt can close 80% of the gap between zero-shot and full fine-tuning — at zero training cost. But it's brittle: the same prompt may not transfer across model versions.

Automatic Prompt Optimization

Humans are slow prompt engineers. Several methods automate the process:

Prompt tuning (Lester et al., 2021): learn a small set of continuous "virtual tokens" prepended to the input. These aren't real words — they're learned embeddings optimized via gradient descent. Only 0.01% of parameters are trainable. The rest of the model is frozen.

Prefix tuning (Li & Liang, 2021): similar idea, but prepend learned vectors to every layer's key-value pairs, not just the input embedding. This gives the "virtual prefix" direct influence over attention at every layer.

Both methods bridge the gap between pure prompting (zero parameters) and adapters (millions of parameters). They're effective for tasks where the pre-trained model already has the knowledge but needs steering — translation, classification, summarization.

Contextual Calibration

Zhao et al. (2021) proposed a simple fix for prompt sensitivity: contextual calibration. The idea:

Step 1: Measure Bias

Run the prompt with a content-free input like "N/A" or an empty string. Record the model's label probabilities — this reveals the prior bias.

↓

Step 2: Calibrate

For real inputs, divide each label's probability by the bias probability. This cancels out the model's tendency to favor certain labels regardless of input.

Concretely: if the model assigns 70% to "Positive" on a content-free input, it has a strong positive bias. Dividing all future "Positive" probabilities by 0.7 (and "Negative" by 0.3) re-centers the predictions. This simple trick reduced accuracy variance across prompt formats from 40% to under 5%.

python
# Contextual calibration in code
# Step 1: Get bias probabilities
bias_prompt = "Classify: N/A\nSentiment:"
bias_probs = model.predict_probs(bias_prompt)
# e.g., {"Positive": 0.7, "Negative": 0.3}

# Step 2: Calibrate real predictions
real_prompt = "Classify: 'Great movie!'\nSentiment:"
raw_probs = model.predict_probs(real_prompt)
# e.g., {"Positive": 0.9, "Negative": 0.1}

calibrated = {k: raw_probs[k] / bias_probs[k]
              for k in raw_probs}
# Normalize so they sum to 1
total = sum(calibrated.values())
calibrated = {k: v / total for k, v in calibrated.items()}
# {"Positive": 0.81, "Negative": 0.19} — less biased

Why can reordering the same few-shot examples change accuracy by 40 percentage points?

The model treats example order as a signal, and recency bias means the last example disproportionately influences predictions Reordering changes the model's weights Later examples get more attention heads allocated to them

Chapter 3: Chain-of-Thought

In 2022, Jason Wei et al. at Google discovered something surprising: if you simply add the words "Let's think step by step" to a math prompt, GPT-3's accuracy on the GSM8K benchmark jumps from 18% to 57%. Include a few examples with worked-out reasoning chains, and it climbs to 79%. Same model. Same weights. Same test set. Only the prompt changed.

This is chain-of-thought (CoT) prompting — a technique where you ask the model to show its reasoning steps before giving the final answer. The key insight: language models process information sequentially, token by token. If a problem requires multiple reasoning steps, the model must perform all of them within a single forward pass — unless you give it "scratch space" in the form of generated intermediate tokens.

Why CoT Works

A standard prompt asks the model to go directly from question to answer:

prompt
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
A: 11

The model must compute 5 + (2 × 3) in a single "step." For simple arithmetic, this works. But for multi-step word problems, the required computation exceeds what a single forward pass can do reliably.

With CoT, the model generates intermediate steps. Each step becomes part of the context for the next step, effectively giving the model a working memory:

prompt
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans with
3 balls each, so 2 × 3 = 6 new balls.
5 + 6 = 11. The answer is 11.

Each generated token becomes context for the next token. The model isn't doing harder computation — it's breaking one hard problem into many easy problems, solving them sequentially through autoregressive generation.

Chain-of-Thought: Step-by-Step Reasoning

Toggle CoT on/off to see how reasoning chains improve accuracy on a word problem. Without CoT, the model jumps to an answer. With CoT, it decomposes the problem.

CoT converts one hard problem into many easy problems. The model "shows its work" in token space. Each reasoning step becomes context for the next, creating a serial computation chain that exceeds what a single forward pass can compute. This is why CoT helps more on harder problems — easy problems don't need decomposition.

Zero-shot CoT

The most surprising finding: you don't even need hand-crafted reasoning examples. Kojima et al. (2022) showed that just appending "Let's think step by step" to any prompt triggers reasoning chains. This is zero-shot CoT — no examples, no manual chain-writing, just a magic phrase that activates latent reasoning capabilities.

On MultiArith (arithmetic word problems): Standard zero-shot = 18%. Zero-shot CoT = 79%. That's a 61-percentage-point improvement from five words.

When CoT Helps (and When It Doesn't)

Task Type	CoT Benefit	Why
Multi-step arithmetic	Very large (+40-60%)	Requires serial computation the model can't do in one pass
Logic puzzles	Large (+20-40%)	Explicit reasoning prevents shortcut errors
Commonsense reasoning	Moderate (+10-20%)	Helps surface relevant knowledge
Simple classification	Minimal or negative	The task is already easy enough for one pass; CoT adds noise
Factual recall	None	The model either knows the fact or doesn't; reasoning chains don't help

CoT also scales with model size. In small models (<10B parameters), CoT often hurts performance — the model generates plausible-sounding but incorrect reasoning steps. The benefit emerges only in large models that have internalized enough world knowledge to reason accurately.

Variants of Chain-of-Thought

The original CoT idea spawned an entire family of prompting techniques:

Variant	Idea	Improvement Over Standard CoT
Self-consistency	Generate multiple reasoning chains, take majority vote on final answer	+5-10% accuracy on math tasks
Tree-of-Thought	Explore multiple reasoning paths, backtrack from dead ends	Helps on search-like problems (game of 24)
Program-of-Thought	Generate Python code instead of natural language reasoning, execute it	Eliminates arithmetic errors entirely
ReAct	Interleave reasoning with tool use (search, calculator)	Grounds reasoning in external facts

All of these share the same underlying principle: give the model more "thinking tokens" between question and answer. The tokens can be natural language (CoT), code (PoT), or interleaved with tool calls (ReAct). The key is that each intermediate token creates a stepping stone the model can attend to when generating the next token.

Why does adding "Let's think step by step" improve accuracy on math problems?

It activates a special math module in the model It causes the model to generate intermediate tokens that serve as working memory, breaking one hard problem into many easy sequential steps It makes the model run more forward passes

Chapter 4: The Lottery Ticket

So far we've adapted models without changing any weights at all. But what if we want to actually train some weights — just not all 175 billion of them? How many weights do we really need?

In 2019, Jonathan Frankle and Michael Carlin published "The Lottery Ticket Hypothesis," making a striking claim: a dense neural network contains a sparse subnetwork that, when trained in isolation from the same initialization, matches the full network's accuracy. They called this subnetwork the winning ticket.

The Hypothesis

Think of training a neural network like buying lottery tickets. A dense network with 100 million weights is like buying 100 million tickets. Most of those tickets (weights) are losers — they don't meaningfully contribute to the final function the network computes. But somewhere in those 100 million, there's a winning subnetwork of maybe 10 million weights that does all the real work.

Frankle and Carlin proved this experimentally with a technique called iterative magnitude pruning:

Step 1: Train

Train the full dense network to convergence. All 100M weights participate.

↓

Step 2: Prune

Remove the 20% of weights with the smallest magnitude. These weights ended up near zero — they contributed little.

↓

Step 3: Reset

Reset the surviving weights to their original initialization values (not their trained values).

↓

Step 4: Retrain

Train the sparse network from those original initial values. If it matches the full network, you found the ticket.

↻ Repeat pruning (Steps 1-4) to find sparser tickets

The result: on CIFAR-10 with VGG-19, they could prune 90% of weights and still match the original accuracy. The winning ticket was there all along — you just needed to find it.

The iterative part is crucial. You can't just prune 90% of a randomly initialized network in one shot — you don't know which weights are important yet. The pruning must be guided by a full training run that reveals which weights naturally gravitate toward zero. The magnitude of a trained weight is a proxy for its importance: weights that end up large were consistently reinforced by gradients; weights that end up near zero were irrelevant to the loss.

python
# Simplified iterative magnitude pruning
import torch

def iterative_prune(model, train_fn, prune_pct=0.2, rounds=5):
    initial_weights = {n: p.clone() for n, p in model.named_parameters()}
    mask = {n: torch.ones_like(p) for n, p in model.named_parameters()}

    for r in range(rounds):
        train_fn(model, mask)  # train with current mask

        # Prune smallest-magnitude weights
        for n, p in model.named_parameters():
            alive = mask[n].bool()
            threshold = torch.quantile(
                p[alive].abs(), prune_pct
            )
            mask[n][p.abs() < threshold] = 0

        # Reset to original initialization
        for n, p in model.named_parameters():
            p.data = initial_weights[n] * mask[n]

    return mask  # the winning ticket

Lottery Ticket: Sparsity vs. Accuracy

Drag the sparsity slider to prune weights. Accuracy holds surprisingly well until extreme sparsity, then collapses. The colored dots are surviving weights; gray dots are pruned.

Sparsity 0%

The winning subnetwork was already there at initialization. The key finding: you must reset to the original initial weights, not random new ones. The specific initialization matters — the "lucky" initial values enabled those specific connections to learn the right function. A randomly re-initialized sparse network doesn't work.

Why This Matters for Efficient Adaptation

The lottery ticket hypothesis tells us something profound about parameter efficiency: most parameters in a neural network are redundant. If 90% of weights can be removed without hurting accuracy, then maybe we don't need to update all of them during fine-tuning either.

This observation directly motivates adapter methods and LoRA. If the useful information lives in a low-dimensional subspace, we can fine-tune only within that subspace. The next two chapters show exactly how.

Aghajanyan et al. (2021): Intrinsic Dimensionality

This paper measured the intrinsic dimensionality of fine-tuning: the minimum number of parameters needed to achieve 90% of full fine-tuning accuracy. For a 280M-parameter RoBERTa model fine-tuned on MRPC (paraphrase detection), the intrinsic dimensionality was about 200. Not 200 million. Two hundred.

That's 0.00007% of the total parameters. The fine-tuning "update" lives in an extremely low-dimensional subspace. This is the theoretical foundation for LoRA.

Pruning at Scale

The original lottery ticket experiments used small networks on CIFAR-10. Does the hypothesis hold at scale? Subsequent work revealed a nuance: for large networks, you can't reset to the original initialization — you need to reset to weights from early in training (e.g., after 0.1% of total training). This is the late resetting variant, and it works robustly at scale.

The practical implication: large pre-trained language models are massively over-parameterized. Studies on BERT showed that 40-90% of attention heads can be pruned with negligible accuracy loss on downstream tasks. For GPT-style models, structured pruning of entire layers (not just weights) can remove 25-30% of layers while maintaining 95%+ of the original accuracy.

This over-parameterization is a feature, not a bug: it makes training easier (more paths to good solutions) and enables transfer learning (different tasks use different subsets of the weights). But at deployment time, most of those parameters are wasted compute.

What does the Lottery Ticket Hypothesis say about neural network training?

A dense network contains a sparse subnetwork that, trained from the same initial weights, matches the full network's accuracy All weights in a network are equally important You should always train small networks instead of large ones

Chapter 5: Adapters

Now we get to methods that actually modify the model — but surgically, with a scalpel rather than a sledgehammer. Adapters (Houlsby et al., 2019) insert small trainable modules between the frozen layers of a pre-trained transformer. The original weights don't change. Only the adapter weights are trained.

From Prompting to Training: The Leap

The previous three chapters used zero trainable parameters. Prompting, few-shot, and chain-of-thought all leave the model weights untouched. This is elegant but limited — you're constrained to whatever the model can do with just the right prompt. When you need the model to reliably produce specific output formats, handle domain-specific terminology, or maintain consistent quality across thousands of requests, you need to actually train some weights.

The question: which weights? Adapters answer this by training new weights inserted between existing layers, while leaving all original weights frozen.

Architecture

An adapter module is a simple bottleneck: down-project → nonlinearity → up-project, with a residual connection around the whole thing.

Concretely, for a hidden dimension of d = 1024 and a bottleneck dimension of m = 64:

h ← h + f(h · W_down) · W_up

Where W_down ∈ R^{d × m} (1024 × 64), W_up ∈ R^{m × d} (64 × 1024), and f is a nonlinearity (usually ReLU or GELU). The residual connection h + (...) ensures that if the adapter weights are initialized to near-zero, the module is approximately an identity function — the model starts where the pre-trained model left off.

Where Do Adapters Go?

Houlsby et al. placed two adapter modules in each transformer layer: one after the multi-head attention sublayer, one after the feed-forward sublayer. Both sit after the sublayer and before the residual connection to the layer norm.

Component	Parameters	Trainable?
Multi-head attention	4d²	Frozen
Adapter 1	2dm	Yes
Feed-forward network	8d²	Frozen
Adapter 2	2dm	Yes
Layer norms	4d	Often trained too

With d = 1024 and m = 64, each adapter has 2 × 1024 × 64 = 131,072 parameters. Two adapters per layer × 24 layers = ~6.3M trainable parameters. For a 340M-parameter BERT-large, that's 1.8% of total parameters. For a 175B GPT-3, it would be roughly 0.004%.

Adapter Architecture

A transformer block with adapter modules. Frozen layers are gray, trainable adapters are orange. Click "Expand Adapter" to see the internal bottleneck structure. Adjust bottleneck size to see the parameter count change.

Bottleneck m 64

3.6% parameters per task. 26 tasks = 26 small adapter sets, not 26 full copies. Houlsby et al. tested adapters on 26 text classification tasks from GLUE and SQuAD. With adapters using only 3.6% of BERT-large's parameters per task, they matched full fine-tuning on all 26 tasks. Total storage: one shared frozen model + 26 tiny adapter checkpoints.

Adapters in Code

python
class Adapter(nn.Module):
    def __init__(self, d_model, bottleneck):
        super().__init__()
        self.down = nn.Linear(d_model, bottleneck)
        self.up   = nn.Linear(bottleneck, d_model)
        self.act  = nn.GELU()
        # Initialize up-projection near zero
        # so adapter starts as identity
        nn.init.zeros_(self.up.weight)
        nn.init.zeros_(self.up.bias)

    def forward(self, x):
        # x: [batch, seq_len, d_model]
        return x + self.up(self.act(self.down(x)))

The key detail: W_up is initialized to zero, making the entire adapter output zero at initialization. The residual connection means the adapter starts as a pass-through — the model behaves identically to the original pre-trained model on the first forward pass. Training then "grows" the adapter's contribution gradually.

This zero-initialization is critical for training stability. If adapters were randomly initialized, the model's outputs would be completely disrupted on the first forward pass — the pre-trained model's learned representations would be corrupted by random adapter noise. Starting from zero means the model can never be worse than the pre-trained baseline, and gradients gently guide the adapter toward task-specific modifications.

Choosing the Bottleneck Size

The bottleneck dimension m controls the capacity vs. efficiency tradeoff:

Bottleneck m	Params per Adapter	% of BERT-large (340M)	Typical Use
8	16K	0.11%	Extreme efficiency; simple tasks
32	65K	0.46%	Good balance for most tasks
64	131K	0.93%	Standard (original paper default)
256	524K	3.7%	Complex tasks needing more capacity

The original paper found that m = 64 was sufficient for all 26 GLUE/SQuAD tasks. Reducing to m = 8 still worked for simpler classification tasks but degraded on complex QA. The choice should be guided by task complexity: simple classification needs less capacity than open-ended generation.

Multi-Task Deployment with Adapters

The killer feature of adapters is modular task composition. You store one frozen base model (say, 350 GB for GPT-3) and swap in tiny adapter checkpoints per task:

Component	Size	Copies	Total
Frozen base model	350 GB	1	350 GB
Adapter per task (m=64)	~25 MB	26	650 MB
Total for 26 tasks			~351 GB
Full FT alternative (26 copies)	350 GB × 26		~9.1 TB

That's a 26x storage reduction. At serving time, you load the base model once into GPU memory and hot-swap adapter weights based on which task the incoming request requires. The swap takes milliseconds — just overwriting a few megabytes of weights.

The Inference Cost Problem

Adapters have one drawback: they add sequential computation during inference. Every forward pass must go through the adapter modules, adding latency proportional to 2 × L adapter forward passes (where L is the number of layers). For a model serving millions of requests per day, even a few milliseconds per adapter matters. The additional latency compounds: two adapters per layer × 96 layers = 192 extra sequential operations.

This latency issue motivated the development of LoRA, which adds zero inference cost by merging the learned parameters directly into the base weights.

Why is the adapter's up-projection initialized to zero?

To reduce memory usage during training So the adapter starts as an identity function (pass-through), and the model begins exactly where the pre-trained model left off Because zeros are faster to compute than random values

Chapter 6: LoRA

In 2021, Edward Hu et al. at Microsoft published LoRA: Low-Rank Adaptation of Large Language Models — and it became the most widely used PEFT method within a year. The idea is elegant: weight updates during fine-tuning have low intrinsic rank. If the update matrix ΔW is low-rank, decompose it into two small matrices and train only those.

The Core Idea

Adapters work, but they add new sequential layers to the model. LoRA takes a fundamentally different approach: instead of adding new layers, it modifies existing ones. The key observation from Aghajanyan et al. is that fine-tuning updates are low-rank — the matrix ΔW = W_final - W_pretrained has far fewer independent dimensions than its size would suggest.

In standard fine-tuning, you update a weight matrix W₀ ∈ R^{d × k} to W₀ + ΔW. The update ΔW has the same shape as W₀ — for GPT-3's attention weights, that's 12288 × 12288 = 150 million parameters per matrix.

LoRA constrains ΔW to be low-rank by decomposing it:

ΔW = B · A

Where B ∈ R^{d × r} and A ∈ R^{r × k}, and r << min(d, k). For r = 8 and d = k = 12288:

Method	Trainable Params (per matrix)	Ratio
Full fine-tuning (ΔW)	12288 × 12288 = 151M	1x
LoRA r=8 (B + A)	(12288 × 8) + (8 × 12288) = 197K	0.0013x
LoRA r=4 (B + A)	(12288 × 4) + (4 × 12288) = 98K	0.00065x

That's a 770x reduction in trainable parameters for r = 8. Across all attention matrices in GPT-3 (Q, K, V, O × 96 layers), LoRA with r = 8 trains about 4.7 million parameters instead of 175 billion — a 37,000x reduction.

How It Works at Inference

Here's the beautiful part: LoRA adds zero inference latency. During training, the forward pass computes:

y = (W₀ + BA) · x = W₀x + BAx

After training, you merge: W_merged = W₀ + BA. Now you have a single weight matrix with no extra computation. The adapter is "baked in." To switch tasks, just swap the BA matrices and re-merge. This is why LoRA dominates: adapters add latency at every forward pass, but LoRA adds none.

Initialization

A is initialized with random Gaussian values, B is initialized to zero. This means ΔW = BA = 0 at the start — the model begins as the original pre-trained model, identical to adapter initialization logic.

LoRA also uses a scaling factor α/r to control the magnitude of the update:

y = W₀x + (α/r) · BAx

Typically α = r, so the scaling is 1. When r increases, each individual rank-1 component contributes proportionally less, which stabilizes training.

LoRA: Low-Rank Decomposition

The large W matrix (left) gets a low-rank update ΔW = BA (right). Drag the rank slider to see how B and A change shape. The param counter shows the dramatic reduction.

Rank r 8

Weight updates during fine-tuning have low intrinsic rank. Aghajanyan et al. showed the intrinsic dimensionality of fine-tuning is astonishingly low. LoRA exploits this directly: if the update is low-rank, decompose it as BA and train only B and A. Rank 4-8 is often enough to match full fine-tuning.

LoRA in Code

python
class LoRALinear(nn.Module):
    def __init__(self, linear, r=8, alpha=8):
        super().__init__()
        self.linear = linear             # frozen original layer
        d, k = linear.weight.shape
        self.A = nn.Parameter(torch.randn(r, k) * 0.01)
        self.B = nn.Parameter(torch.zeros(d, r))
        self.scale = alpha / r

    def forward(self, x):
        # x: [batch, seq_len, k]
        base = self.linear(x)            # W_0 @ x
        lora = (x @ self.A.T) @ self.B.T # (BA)x
        return base + self.scale * lora

    def merge(self):
        # Bake LoRA into W_0 for zero-cost inference
        self.linear.weight.data += self.scale * (self.B @ self.A)

LoRA Training Loop

The training loop for LoRA is identical to standard fine-tuning, except only LoRA parameters receive gradients:

python
# Freeze base model, train only LoRA parameters
for name, param in model.named_parameters():
    if 'lora' not in name:
        param.requires_grad = False  # freeze base

optimizer = torch.optim.AdamW(
    [p for p in model.parameters() if p.requires_grad],
    lr=2e-4  # higher LR than full FT (typically 5e-5)
)

for batch in dataloader:
    loss = model(batch["input_ids"], labels=batch["labels"]).loss
    loss.backward()   # gradients only flow to LoRA params
    optimizer.step()
    optimizer.zero_grad()

# After training: merge for zero-cost inference
for module in model.modules():
    if hasattr(module, 'merge'):
        module.merge()

Notice the learning rate: 2e-4 vs. the typical 5e-5 for full fine-tuning. With far fewer parameters, each parameter needs to move more per step to achieve the same total update magnitude.

Which Weights to Adapt?

The original paper found that applying LoRA to the query and value projections (W_Q and W_V) in attention works best. Adapting all four attention matrices (Q, K, V, O) gives marginal improvement. The feed-forward layers are less important — attention is where task-specific adaptation happens.

LoRA Target	Params (GPT-3)	Quality
W_Q only	4.7M	Good
W_Q + W_V	9.4M	Best (default)
W_Q + W_K + W_V + W_O	18.8M	Slightly better
All attention + FFN	37.7M	Marginal improvement

Choosing the Right Rank

Rank r is the single most important hyperparameter in LoRA. The original paper tested r from 1 to 64 on GPT-3 for various tasks. Key findings:

Rank	Trainable Params (per matrix)	Quality vs Full FT	When to Use
r = 1	24K	90-95%	Extreme parameter constraint, simple classification
r = 4	98K	95-98%	Default for most NLP tasks
r = 8	197K	98-99%	Standard recommendation; good accuracy-efficiency tradeoff
r = 16	393K	99%+	Complex tasks, code generation, long-form writing
r = 64	1.6M	~100%	When full FT accuracy is strictly required

A practical heuristic: start with r = 8. If accuracy is insufficient, double to r = 16. If r = 16 isn't enough, the task may require full fine-tuning or a fundamentally different approach. Going above r = 64 rarely helps — you're past the intrinsic dimensionality of most fine-tuning updates.

The scaling factor α is typically set equal to r, giving a scale of 1.0. Some practitioners use α = 2r for a stronger initial learning signal. The learning rate for LoRA parameters is usually 10-100x higher than for full fine-tuning, since you're training far fewer parameters.

LoRA vs. Full Fine-Tuning: When Does It Fall Short?

LoRA's low-rank constraint is a double-edged sword. For most downstream NLP tasks (classification, summarization, translation), the update is indeed low-rank and LoRA matches full fine-tuning. But there are cases where it struggles:

Dramatically new domains. If the task requires knowledge fundamentally absent from pre-training (e.g., a new programming language invented after the cutoff), a rank-8 update may not have enough capacity to encode the new patterns. Increasing rank to 32-64 or using full fine-tuning may be necessary.

Very long-context tasks. Adapting the model to handle much longer sequences than it was pre-trained on often requires changes that span many dimensions of the weight space — not a low-rank update.

Multi-modal extensions. Adding a new modality (vision, audio) requires updating embedding layers and cross-attention in ways that exceed low-rank decomposition. LoRA is best for task adaptation within the model's existing modality.

The rule of thumb: if you're steering the model's existing capabilities toward a specific task, LoRA works beautifully. If you're trying to teach the model fundamentally new capabilities, you may need higher rank or full fine-tuning.

Why does LoRA add zero inference latency, unlike adapters?

After training, you merge ΔW = BA into the original weight matrix W₀, producing a single matrix with no extra computation LoRA uses smaller matrices that are faster to multiply LoRA is applied only during the backward pass, not the forward pass

Chapter 7: PEFT Playground

You've learned four ways to adapt a model: in-context learning, prompt tuning, adapters, and LoRA. But how do they compare head-to-head? This playground lets you train a simulated model with each method and watch the differences in real time.

Select a method, then click "Train" to start a simulated training run. Watch which weights update (colored vs. gray), and track four metrics: trainable parameters, GPU memory, accuracy, and training time. Try all four methods to see the tradeoffs.

Pay particular attention to the loss curve (bottom right): all methods converge, but at different speeds and to different final accuracies. Full fine-tuning converges slowest but to the best accuracy. Prompt tuning converges fastest but to a lower ceiling. LoRA and adapters sit in between — near full fine-tuning accuracy with prompt tuning speed.

PEFT Method Comparison Playground

Select a method, then click Train. Watch which parts of the model update (orange = trainable, gray = frozen). Compare metrics across methods.

Select a method, then click Train.

The numbers tell the story:

Method	Trainable Params	GPU Memory	Accuracy	Training Time	Inference Cost
Full FT	175B (100%)	~2.1 TB	Best	Weeks	Baseline
Adapter	6.3M (0.004%)	~400 GB	Near-best	Hours	+latency
LoRA r=8	4.7M (0.003%)	~400 GB	Near-best	Hours	Zero extra
Prompt tuning	77K (0.00004%)	~350 GB	Good	Minutes	Minimal

Several patterns emerge:

Full fine-tuning wins on accuracy when you have enough data and compute. It can modify any weight to fit any pattern. But the cost is enormous — 2 TB of memory, weeks of training, and a full copy of the model per task.

LoRA is the sweet spot for most practitioners. Near-full-fine-tuning accuracy with 0.003% trainable parameters and zero extra inference cost. The trained LoRA matrices merge into the base weights, so deployment is identical to deploying the original model.

Adapters match LoRA on accuracy but add inference latency because the adapter modules remain as separate sequential computations during inference. If latency matters, LoRA wins.

Prompt tuning is the lightest touch. Only a few thousand parameters, trained in minutes. But it has the lowest accuracy ceiling and can't fix fundamental capabilities the model lacks. Best for tasks the model already almost knows how to do.

Memory Breakdown: Why PEFT Saves So Much

The memory savings aren't just about having fewer parameters. During training with Adam, every trainable parameter requires storing:

The parameter itself

2 bytes (fp16) or 4 bytes (fp32)

The gradient

Same size as the parameter (accumulated during backward pass)

Adam first moment (m)

4 bytes (fp32) — running mean of gradients

Adam second moment (v)

4 bytes (fp32) — running mean of squared gradients

For a 175B parameter model with full fine-tuning, that's 2 + 2 + 4 + 4 = 12 bytes per parameter = 2.1 TB total. With LoRA (4.7M trainable params), only those 4.7M parameters need optimizer states: 4.7M × 12 bytes = 56 MB. The frozen 175B parameters are stored in fp16 (350 GB) but need no gradients or optimizer states.

This is why LoRA and adapters are called "parameter-efficient" — the savings cascade through the entire training pipeline: fewer trainable params → fewer gradients → fewer optimizer states → less memory → fewer GPUs → lower cost.

Activation Memory

One subtlety: even with PEFT, you still need to store activations for the backward pass. Activations scale with batch size × sequence length × hidden dimension × number of layers. For a 175B model with batch size 1 and sequence length 2048, activations alone can consume 200+ GB. This is why techniques like gradient checkpointing (recomputing activations instead of storing them) are essential even with LoRA.

Component	Full FT (175B)	LoRA (175B, r=8)
Model weights	350 GB (fp16)	350 GB (fp16, frozen)
Trainable params	350 GB	9 MB
Gradients	350 GB	9 MB
Optimizer states	1.4 TB	56 MB
Activations	~200 GB	~200 GB (same!)
Total	~2.65 TB	~550 GB

Notice that activations are the same size regardless of PEFT method — because the forward pass through the frozen model is identical. PEFT saves on weights, gradients, and optimizer states, but not on activations. This is why you still need significant GPU memory even with LoRA.

Lots of data? Full fine-tune. Many tasks? LoRA. No training at all? Prompt engineering. The "right" method depends on your constraints: compute budget, number of tasks, latency requirements, and how far the task is from the model's pre-training distribution. There's no universal best — only best for your situation.

Chapter 8: When to Use What

The choice of adaptation method isn't just about accuracy — it's about the intersection of your data, compute, deployment constraints, and how many tasks you need to serve. This chapter gives you a decision framework.

The Decision Tree

The interactive tree below walks you through the key questions. Click each node to explore a decision path. Your answers determine which method is recommended.

Adaptation Method Decision Tree

Click decision nodes to answer questions. The path lights up to show the recommended method.

Click a starting question to begin.

No data → prompt. Small data + frozen model → LoRA. Large data + own infra → full FT. This is the high-level heuristic. The decision tree above refines it with additional factors like latency, multi-task requirements, and model access.

Detailed Decision Matrix

Scenario	Best Method	Why
No labeled data at all	Zero/few-shot prompting	Can't train without data; use the model's pre-existing knowledge
10-100 labeled examples	Few-shot + CoT prompting	Too little data for gradient methods; examples in context work well
100-10K examples, API access only	Prompt tuning	API providers often support prefix tuning; no access to model weights
1K-100K examples, weight access	LoRA (r=4-16)	Best accuracy-per-parameter ratio; zero inference overhead
10+ tasks on same base model	LoRA with adapter swapping	One frozen base + tiny LoRA checkpoints per task; hot-swap at serving
100K+ examples, full GPU cluster	Full fine-tuning	Maximum accuracy; cost is justified by data volume and task importance
Real-time serving, latency-critical	LoRA (merged) or full FT	LoRA merges to zero overhead; adapters add sequential latency
Model too large for GPU memory	QLoRA (quantized + LoRA)	4-bit quantized base model + LoRA adapters fit on single GPU

QLoRA: The Practitioner's Default

In practice, most people use QLoRA (Dettmers et al., 2023), which combines 4-bit quantization of the base model with LoRA adapters. This lets you fine-tune a 65B parameter model on a single 48 GB GPU — something that would normally require 8+ GPUs.

The trick: quantize the base model weights to 4-bit NormalFloat (NF4) format, reducing memory by 4x. Then add LoRA adapters in fp16 or bf16 on top. The LoRA gradients are computed in 16-bit, so training precision is maintained. Only the frozen base weights are quantized.

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load 65B model in 4-bit (~16 GB)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
)

# Add LoRA adapters (~18M trainable params)
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 18,874,368 || all params: 68,977,307,648
# trainable%: 0.0274

With QLoRA, you can fine-tune a 70B model on a single A100 (80 GB) or even a consumer RTX 4090 (24 GB, with careful gradient checkpointing). This democratized fine-tuning: what previously required a compute cluster now runs on a single GPU.

Deployment Considerations

The choice of PEFT method affects not just training but deployment architecture:

Deployment Need	Method	Why
Single GPU serving	QLoRA (merged)	Quantized base + merged LoRA fits in minimal memory
Multi-tenant (many users, different tasks)	LoRA with dynamic swapping	Load base once, hot-swap LoRA weights per request
API-only access (OpenAI, Anthropic)	Prompt engineering + CoT	No weight access; all adaptation through the prompt
Edge/mobile deployment	Distillation from PEFT model	Fine-tune large model with LoRA, then distill to smaller model
Continual learning	Stacked LoRA adapters	Add new LoRA for each update; merge periodically

A common production pattern: fine-tune with QLoRA on your data, merge the adapter, quantize the merged model to 4-bit, and serve with vLLM or TGI. Total workflow from raw data to deployed model can be done in a single afternoon on one GPU.

The End-to-End QLoRA Workflow

1. Prepare Data

Curate (instruction, response) pairs. 1K-10K examples. Quality > quantity. Format as JSON with "input" and "output" fields.

↓

2. Load Quantized Base

Load base model in 4-bit NF4. A 70B model fits in ~16 GB. Use BitsAndBytes or auto-gptq.

↓

3. Attach LoRA

Add LoRA adapters (r=16, target Q and V projections). ~18M trainable parameters. Configure learning rate ~2e-4.

↓

4. Train

SFT with masked loss on instruction tokens. 1-3 epochs. Gradient checkpointing enabled. Takes 2-8 hours on single GPU.

↓

5. Merge & Deploy

Merge LoRA into base weights. Re-quantize merged model. Serve with vLLM for optimized inference.

This workflow has become the de facto standard for fine-tuning open-source models. Tools like Hugging Face's PEFT library, Axolotl, and LLaMA-Factory wrap all five steps into a single configuration file.

You have 5,000 labeled examples, weight access to a 13B model, and need to serve 8 different tasks. Which method is best?

Full fine-tuning (8 copies of the 13B model) LoRA with adapter swapping (one frozen base + 8 tiny LoRA checkpoints) Zero-shot prompting (no training needed)

Chapter 9: Connections

Efficient adaptation is the bridge between pre-training and deployment. Pre-training gives the model broad knowledge; PEFT methods let you specialize that knowledge for your task without the cost of a full fine-tune. The field is evolving rapidly — LoRA itself has spawned dozens of variants (DoRA, AdaLoRA, QLoRA, LoRA+) in just three years.

The Adaptation Spectrum

The Full Adaptation Spectrum

From zero training to full training. Each method occupies a point on the cost-accuracy tradeoff curve.

Key Papers

Paper	Year	Key Contribution
GPT-3 (Brown 2020)	2020	Demonstrated in-context learning: zero/few-shot prompting without any weight updates. Showed that scale enables emergent task learning.
Chain-of-Thought (Wei 2022)	2022	"Let's think step by step" — reasoning chains as working memory. 18% to 79% on math with no fine-tuning.
Lottery Ticket (Frankle 2019)	2019	Dense networks contain sparse subnetworks that match full accuracy. Proved parameter redundancy at scale.
LoRA (Hu 2021)	2021	Low-rank decomposition of weight updates. 10,000x fewer trainable params, zero inference cost. The PEFT standard.
Parameter-Efficient Transfer (Houlsby 2019)	2019	Adapter modules: small bottleneck layers between frozen transformer blocks. 3.6% params, matched full FT on GLUE.

Related Lessons

Lesson	Connection
L07: Pretraining	The foundation PEFT adapts. Pre-training gives broad knowledge; PEFT specializes it per task.
L08: Post-training	SFT and RLHF are forms of adaptation. LoRA and QLoRA are how most practitioners actually do SFT and DPO in practice.
L10: Agents	Agents use prompting (CoT, ReAct) as their adaptation layer. LoRA-tuned models serve as specialized agent backbones.

Method Summary

Method	Trainable %	Extra Inference Cost	Multi-Task	Best For
Zero/few-shot	0%	Context tokens only	Unlimited	Quick prototyping, no training data
Prompt tuning	~0.01%	Virtual token processing	Small checkpoints	API-only access, classification
Adapter	~2-4%	Sequential bottleneck per layer	Hot-swappable	Multi-task with shared base
LoRA	~0.1%	Zero (merges into W)	Hot-swappable or merged	Most tasks (the default)
QLoRA	~0.1%	Quantization overhead	Hot-swappable	Large models on single GPU
Full FT	100%	None	Full copy per task	Maximum quality, unlimited compute

The Frontier

LoRA variants. DoRA (Weight-Decomposed Low-Rank Adaptation) separates magnitude and direction of weight updates, closing the gap to full FT. AdaLoRA allocates rank adaptively per layer based on importance scores. LoRA+ uses different learning rates for A and B matrices, improving convergence speed by 2x.

Mixture of LoRA experts. Instead of one LoRA adapter per task, train multiple small LoRA "experts" and route inputs to the right expert at inference. This combines multi-task serving with LoRA's efficiency. Each expert specializes in a subtask (code, math, creative writing), and a lightweight router selects the expert based on the input.

Merging without retraining. Model soups and TIES-Merging combine multiple LoRA adapters into a single set of weights — a multi-task model without multi-task training. TIES-Merging resolves sign conflicts between adapters, which naive averaging misses.

1-bit and sub-4-bit fine-tuning. As quantization methods improve (GPTQ, AWQ, GGML), the base model gets smaller and LoRA adapters become a proportionally larger fraction of the total. Eventually the adapter might be bigger than the quantized base.

Scaling LoRA. As models grow past 1 trillion parameters, even LoRA's 0.1% becomes millions of parameters. Research is exploring structured pruning of LoRA matrices, tying LoRA weights across layers, and learning which layers even need adaptation (many don't).

LoRA for non-language modalities. LoRA has proven effective far beyond NLP. It's now the standard approach for fine-tuning Stable Diffusion (LoRA for image generation), Whisper (LoRA for speech recognition), and multimodal models like LLaVA (LoRA for vision-language tasks). The low-rank hypothesis holds across modalities.

Historical Timeline

Year	Milestone	Impact
2019	Lottery Ticket Hypothesis	Proved parameter redundancy
2019	Adapter modules (Houlsby)	First practical PEFT for transformers
2020	GPT-3 in-context learning	Zero-training adaptation at scale
2021	LoRA (Hu et al.)	Low-rank adaptation with zero inference cost
2021	Intrinsic dimensionality (Aghajanyan)	Theoretical foundation for low-rank methods
2022	Chain-of-Thought (Wei et al.)	Reasoning through prompting
2023	QLoRA (Dettmers et al.)	Fine-tune 65B on single GPU
2024	DoRA, LoRA+, AdaLoRA	Closing the gap to full FT

The field of efficient adaptation has transformed how practitioners deploy large language models. What once required million-dollar compute budgets and dedicated infrastructure teams can now be done on a single GPU in an afternoon. The key insight — that fine-tuning updates live in a low-dimensional subspace — is both theoretically elegant and practically revolutionary.

As models continue to grow (GPT-4 is estimated at 1.8T parameters, and future models will be larger), efficient adaptation becomes not just convenient but mandatory. Full fine-tuning of a 10T parameter model is beyond the reach of any organization. LoRA and its successors will be the only practical path to customization.

"The best model for your task is a general model, efficiently adapted." — The unifying lesson of PEFT. Don't train from scratch. Don't copy the whole model. Find the low-dimensional subspace where your task lives, and train only there.