CS224N Lecture 18

LoRA Without Regret

When adaptation fails, why it fails, and how to fix it — from rank selection to adapter merging.

Prerequisites: L09 PEFT. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Fine-tuning Dilemma

You deployed a LoRA adapter last month. It worked great on your test set — 94% accuracy on customer intent classification. Then production traffic started shifting. New product names. New complaint patterns. Accuracy dropped to 78%. You retrain the adapter, accuracy goes back up — but now it's worse on the original queries. You're playing whack-a-mole with a 7B-parameter model.

This is the fine-tuning dilemma: the tension between adaptation quality and adaptation cost. Full fine-tuning gives you the best accuracy but costs too much in memory, compute, and storage. LoRA cuts cost by 100x but introduces new failure modes — rank too low, wrong layers targeted, overfitting to small datasets, catastrophic forgetting on the base model's capabilities.

L09 taught you what PEFT methods are. This lesson teaches you when they break and how to fix them.

The Tradeoff Space

Every adaptation method sits somewhere on a two-axis plane. The X-axis is parameters trained (from 0 for ICL to 100% for full fine-tuning). The Y-axis is task performance (accuracy on your specific task). The ideal method sits at the top-left: few parameters, high accuracy. No method achieves that perfectly — but understanding where each method sits, and why, is the key to making good engineering decisions.

The simulation below lets you explore this tradeoff. Drag the slider to control how many parameters you train. Watch what happens to memory, accuracy, and catastrophic forgetting risk.

Memory / Performance Tradeoff

Slide to change the fraction of parameters trained. Watch GPU memory, task accuracy, and forgetting risk respond.

Params trained 1%
More parameters trained does NOT always mean better performance. Beyond a sweet spot, you start overfitting to the fine-tuning data and forgetting what the base model knows. The art of PEFT is finding that sweet spot: enough capacity to learn the task, not so much that you destroy what the model already knows.

The rest of this lesson walks you through the practical toolkit: how LoRA actually works at the matrix level, where it fails, how to regularize it, and how to merge multiple adapters into a single model. By the end, you'll be able to diagnose a bad LoRA configuration by looking at the training curves alone.

Why can training more parameters sometimes hurt task performance?

Chapter 1: LoRA Revisited

L09 introduced LoRA as "add low-rank matrices to frozen weights." Now let's open the hood and understand exactly what happens at the matrix level, because the details determine whether your adapter works or fails.

The Core Idea: Low-Rank Decomposition

A pre-trained weight matrix W has shape d × d (for a typical attention projection in a 7B model, d = 4096). Full fine-tuning learns an update ΔW of the same shape — that's d² = 16.7 million parameters per matrix. LoRA's insight: the useful update lives in a low-rank subspace. Instead of learning ΔW directly, we decompose it:

ΔW = B · A

Where B has shape d × r and A has shape r × d. The rank r is typically 4, 8, 16, or 32 — vastly smaller than d = 4096. So instead of 16.7M parameters, we learn:

d × r + r × d = 2 · d · r

For r = 8 and d = 4096, that's 2 × 4096 × 8 = 65,536 parameters. A 256x reduction.

Initialization Matters

LoRA initializes B to zeros and A with random Gaussian values. This means ΔW = B · A = 0 at the start of training — the adapted model begins identical to the pre-trained model. This is critical: it means you can't accidentally destroy the base model on step 1. The adaptation grows gradually from zero.

There's also a scaling factor α (alpha) that controls the magnitude of the update:

W' = W + (α / r) · B · A

The ratio α/r is the effective learning rate modifier. If α = r, the modifier is 1 and LoRA updates have the same scale as the base weights. If α = 2r, the updates are amplified 2x. In practice, α is usually set equal to r or 2r.

Matrix Decomposition: W + BA

Drag the rank slider. Watch the parameter count drop and the reconstruction quality change. The heatmaps show W (frozen), B, A, and BA (the learned update).

Rank r 8
LoRA is not approximating W — it's learning a delta. A common misconception: LoRA tries to compress W into a low-rank form. It doesn't. W stays frozen at full rank. LoRA only decomposes the update ΔW. The assumption is that the fine-tuning update has low intrinsic dimensionality — which Aghajanyan et al. (2021) showed is true for most NLP tasks.

Which Layers Get LoRA?

A transformer block has four attention projections (Q, K, V, O) and two feed-forward matrices (up and down). The original LoRA paper applied adapters only to Q and V projections. Later work showed that applying LoRA to all linear layers (Q, K, V, O, up, down) at a lower rank often works better than applying it to fewer layers at a higher rank — because it distributes capacity across the model.

The data flow for a single LoRA-adapted attention head:

python
# Standard attention Q projection
q = x @ W_q                   # [batch, seq, d] @ [d, d] = [batch, seq, d]

# LoRA-adapted Q projection
q = x @ W_q + (alpha/r) * (x @ A_q) @ B_q
#   frozen        scale    [batch,seq,d]@[d,r] = [batch,seq,r]
#                                        @[r,d] = [batch,seq,d]

# At inference, merge: W_q' = W_q + (alpha/r) * B_q @ A_q
# No extra latency! The low-rank matrices fold into W.

That last point is LoRA's killer feature: zero inference overhead. After training, you merge BA into W. The serving model is the same size and speed as the original — unlike adapters, which add extra layers and latency.

If d = 4096 and r = 16, how many trainable parameters does one LoRA adapter add to a single weight matrix?

Chapter 2: When LoRA Fails

LoRA works remarkably well on most benchmarks. But benchmarks aren't production. In practice, three failure modes account for the vast majority of bad LoRA deployments. Understanding them saves you weeks of debugging.

Failure 1: Rank Too Low

The rank r controls the expressiveness of your update. If the task requires a high-rank update — for example, learning a new language or domain-specific vocabulary — a rank-4 adapter simply can't represent the needed change. The model underfits: training loss stays high, validation loss stays high, accuracy plateaus early.

How do you know the rank is too low? The signature is a flat training loss. If loss stops decreasing after just a few hundred steps despite having plenty of data, the adapter lacks capacity. The fix: increase r to 16, 32, or 64. The cost is more memory, but still orders of magnitude less than full fine-tuning.

Failure 2: Wrong Layers Targeted

If you only apply LoRA to the Q and V projections (following the original paper), you might miss critical capacity. Some tasks require updating the feed-forward networks (FFN) — especially tasks that require learning new factual knowledge, since knowledge is stored primarily in FFN layers. Other tasks require updating the K projection for better retrieval patterns.

The symptom of wrong-layer targeting is subtler: training loss decreases normally, but validation accuracy is mediocre. The adapter is learning something, but not the right thing. The fix: apply LoRA to all linear layers at a proportionally lower rank.

Failure 3: Distribution Shift

Your fine-tuning data looks like customer emails from Q1. Production traffic includes Q2 complaints about a new product line. The adapter overfit to Q1 patterns. This is the same distribution shift problem that affects all ML, but LoRA's low capacity makes it more brittle than full fine-tuning — fewer parameters means less room to generalize.

The symptom: excellent training metrics, declining production metrics over time. The fix: diversify training data, add regularization (next chapter), or periodically retrain.

Failure Mode Explorer

Click each tab to see what training/val curves look like under each failure mode. Compare to the healthy baseline.

A flat training loss means insufficient capacity (rank too low). A diverging train/val gap means overfitting (regularize). Good train but bad val from the start means wrong layers. Learn to read training curves — they diagnose 90% of LoRA failures without any extra experiments.

Diagnosing in Practice

When a LoRA adapter underperforms, follow this decision tree:

Is training loss decreasing?
If NO → rank too low. Increase r (try 2x, then 4x).
↓ YES
Is validation loss decreasing too?
If NO → wrong layers or overfitting. Try applying LoRA to all layers, or add dropout.
↓ YES
Does production accuracy match val accuracy?
If NO → distribution shift. Diversify training data or retrain periodically.
↓ YES
Everything is working!
Your adapter is healthy. Monitor for drift over time.
Your LoRA adapter's training loss decreases steadily, but validation accuracy is stuck at 72% while full fine-tuning gets 91%. What's the most likely cause?

Chapter 3: Regularization in PEFT

LoRA adapters are small — often just 0.1% of the model's parameters. You might think overfitting isn't a concern with so few trainable parameters. You'd be wrong. With a small fine-tuning dataset (say, 5,000 examples), even 131K trainable parameters can memorize the data. And when a LoRA adapter overfits, the failure is insidious: it looks great on training metrics but produces hallucinated, overconfident answers on new inputs.

Dropout in LoRA

Dropout randomly zeros out elements of the adapter's activations during training, forcing the adapter to spread its learned representation across multiple dimensions rather than memorizing specific patterns. In LoRA, dropout is applied to the output of matrix A (before multiplying by B):

ΔW · x = B · dropout(A · x)

Typical dropout rates for LoRA are 0.05 to 0.1 (5-10%). Higher values (0.2+) can prevent the adapter from learning at all, since you're already working in a low-rank bottleneck — zeroing out half of an 8-dimensional representation leaves only 4 effective dimensions.

Weight Decay

Weight decay adds a penalty proportional to the L2 norm of the adapter weights, preventing them from growing too large. Large adapter weights mean large ΔW, which means the adapted model diverges far from the pre-trained model — exactly what causes catastrophic forgetting.

Ltotal = Ltask + λ · (||A||2 + ||B||2)

Where λ is the weight decay coefficient, typically 0.01 to 0.1 for LoRA. Weight decay acts as a soft constraint keeping the adapted model close to the base model. Think of it as a rubber band pulling the adapter back toward ΔW = 0.

The Interplay

Dropout and weight decay serve different purposes. Dropout prevents co-adaptation of features within the adapter (a data-efficiency regularizer). Weight decay prevents the adapter from drifting too far from the base model (a forgetting regularizer). In practice, you usually want both: dropout 0.05-0.1 and weight decay 0.01.

Regularization Effects on Training Curves

Adjust dropout and weight decay to see how they affect train vs. val loss. Watch the gap between curves — a large gap means overfitting.

Dropout 0.05
Weight decay 0.01
Dropout prevents co-adaptation. Weight decay prevents forgetting. They're complementary, not redundant. A well-tuned LoRA uses both: dropout ~0.05 keeps features diverse, weight decay ~0.01 keeps the model close to its pre-trained initialization. Too much of either kills learning; too little lets the adapter memorize the training set.

Practical Recipe

Dataset sizeRecommended dropoutRecommended weight decayRationale
< 1K examples0.1 - 0.150.05 - 0.1High risk of memorization. Aggressively regularize.
1K - 10K0.05 - 0.10.01 - 0.05Standard regime. Moderate regularization.
10K - 100K0.0 - 0.050.01Enough data to generalize. Light regularization.
> 100K0.00.01Overfitting is rare. Weight decay for stability only.
You're fine-tuning a LoRA adapter on 800 medical Q&A examples. Training loss drops to 0.02 but validation loss is stuck at 0.45. What should you try first?

Chapter 4: Merging Adapters

You trained a LoRA adapter for customer support. Another for legal Q&A. A third for code review. Each works well in isolation. Now your CEO wants one model that handles all three. Do you need to retrain from scratch on combined data?

No. You can merge the adapters. Each LoRA adapter produces a weight update ΔW. Since these are just matrices, you can combine them using standard linear algebra. The question is how to combine them — and each method has different tradeoffs.

Method 1: Simple Averaging

The simplest approach: average the updates from each adapter.

ΔWmerged = (1/n) ∑i ΔWi

This works when the tasks are similar (e.g., sentiment analysis in different domains). It fails when tasks conflict — if adapter A learned to make outputs longer and adapter B learned to make them shorter, averaging gives you an adapter that does neither well. Think of it as mixing paint: blue + yellow = green (sometimes useful), but red + green = brown (usually not what you want).

Method 2: Task Arithmetic

Ilharco et al. (2023) proposed treating adapter updates as task vectors — directions in weight space that correspond to capabilities. You can add and subtract them:

ΔWmerged = λ1 · ΔW1 + λ2 · ΔW2 - λ3 · ΔW3

The λ coefficients let you control how much of each task to include. Setting λ3 negative lets you remove a capability. Trained a model on toxic data by accident? Subtract that adapter. Want more code ability and less chat? Scale accordingly.

Method 3: TIES (Trim, Elect Sign, Merge)

Yadav et al. (2023) identified a problem with simple averaging: interference. When two adapters disagree about the sign of a weight update (one wants +0.1, the other wants -0.1), averaging gives ~0, losing both signals. TIES fixes this in three steps:

1. Trim
Zero out the smallest magnitude updates (keep only top-k%). Removes noise, keeps signal.
2. Elect Sign
For each weight position, take a vote: do more adapters want positive or negative? Use the majority sign.
3. Merge
Average only the updates that agree with the elected sign. Disagreeing updates are dropped.

TIES consistently outperforms simple averaging on multi-task benchmarks because it preserves the structure of each adapter's update rather than blindly averaging conflicting signals.

Adapter Merging Visualized

Three adapters as 2D vectors. Toggle between merge methods to see how the result changes. Drag adapters to reposition them.

Simple averaging destroys signal when adapters disagree. Task arithmetic gives you control over how much of each capability to include. TIES resolves sign conflicts by majority vote. For production multi-task merging, TIES is the default choice unless you need the fine-grained control of task arithmetic.

The Data Flow of Merging

python
# Each adapter has B_i [d, r] and A_i [r, d]
# First, compute full delta for each adapter
delta_1 = (alpha/r) * B_1 @ A_1   # [d, d]
delta_2 = (alpha/r) * B_2 @ A_2   # [d, d]
delta_3 = (alpha/r) * B_3 @ A_3   # [d, d]

# Simple average
merged = (delta_1 + delta_2 + delta_3) / 3

# Task arithmetic (emphasize task 1, remove task 3)
merged = 1.5 * delta_1 + 0.8 * delta_2 - 0.5 * delta_3

# Final merged model
W_merged = W_base + merged  # [d, d] — same size as original
What problem does TIES solve that simple averaging doesn't?

Chapter 5: LoRA Lab

Time to put everything together. The simulation below is a full LoRA training sandbox. Configure the adapter (rank, target layers, alpha, dropout), watch the training unfold in real-time, then merge multiple adapters and see the combined result.

This is the payoff for the last four chapters. Every slider maps to a concept you've learned: rank controls capacity (Chapter 1), layer selection controls what kind of knowledge the adapter can learn (Chapter 2), dropout and regularization control overfitting (Chapter 3), and the merge view shows what happens when you combine adapters (Chapter 4).

Experiment Guide

Try these experiments to build intuition:

Experiment 1 — Rank sensitivity: Set layers to "All", dropout to 0.05. Start with rank 1 and hit Train. Note the final val loss. Reset, increase to rank 4, repeat. Then rank 16. You'll see diminishing returns after rank 8-16 for most tasks.

Experiment 2 — Layer targeting: Fix rank at 8. Compare "QV only" vs "All layers". The all-layers config should reach lower val loss. Now try "FFN only" — it depends on whether the task requires new knowledge (FFN) or new attention patterns (QKV).

Experiment 3 — Overfitting: Set rank to 32, dropout to 0, weight decay to 0. Train on a small dataset. Watch train loss plummet while val loss plateaus or rises. Now add dropout 0.1 — the gap closes.

Experiment 4 — Merging: Train two adapters with different configs (e.g., one for "sentiment" and one for "summarization"). Switch to merge view and compare averaging vs. TIES.

LoRA Lab — Full Training Sandbox

Configure, train, and merge LoRA adapters. Click Train to start. Watch train (orange) and val (teal) loss curves evolve.

Rank 8
Alpha 8
Dropout 0.05
W. Decay 0.01

Chapter 6: Future of Adaptation

LoRA was published in 2021. Since then, a wave of successors has pushed the frontier further — each solving a specific limitation of the original. Let's trace the evolution and understand where the field is heading.

QLoRA (2023): Quantize, Then Adapt

Dettmers et al. had a simple but powerful insight: if the base model weights are frozen anyway, why store them in fp16? QLoRA quantizes the base model to 4-bit precision (NF4 format), then trains LoRA adapters in fp16 on top. The base model uses 4x less memory; the adapters are still full precision.

The impact was dramatic: fine-tuning a 65B model went from requiring 780 GB (full fine-tuning) or 160 GB (LoRA in fp16) to just 48 GB — a single A100 GPU. QLoRA made it possible for academic labs and individuals to fine-tune models that previously required multi-node clusters.

DoRA (2024): Weight-Decomposed Adaptation

Liu et al. noticed that LoRA's updates tend to change both the magnitude and direction of weight vectors, while full fine-tuning primarily changes direction. DoRA decomposes each weight vector into a magnitude component (scalar) and a direction component (unit vector), then applies LoRA only to the direction. This mimics full fine-tuning's learning pattern and consistently outperforms standard LoRA by 1-3% on NLU benchmarks.

W' = m · (W + BA) / ||W + BA||

Where m is a trainable magnitude scalar and (W + BA)/||W + BA|| is the updated direction. The key insight: decoupling magnitude and direction gives the optimizer a smoother loss landscape.

LoRA+ (2024): Different Learning Rates

Hayou et al. showed that using the same learning rate for matrices A and B is suboptimal. Since B is initialized to zero and A to random values, they occupy different regions of the loss landscape at initialization. LoRA+ uses a higher learning rate for B (the zero-initialized matrix) and achieves faster convergence — up to 2x speedup with no quality loss.

What's Next?

The trend line is clear: each generation uses fewer bits, fewer parameters, or both, while matching or exceeding the previous generation's quality. The end goal is one-GPU fine-tuning of 100B+ models with negligible quality loss.

Evolution of Parameter-Efficient Adaptation

Timeline of methods and their GPU memory requirements for fine-tuning a 65B model. Watch memory shrink while quality holds steady.

Every LoRA successor solves one specific limitation. QLoRA attacks memory (quantize the base model). DoRA attacks quality (decompose magnitude and direction). LoRA+ attacks speed (separate learning rates). None is strictly "better" — they're complementary. QLoRA + DoRA + LoRA+ together give you the best of all three.
MethodYearKey Innovation65B MemoryQuality vs LoRA
Full FT-Train everything780 GBBest (baseline)
LoRA2021Low-rank update BA160 GB-1 to -3%
QLoRA20234-bit base + fp16 adapters48 GBSame as LoRA
DoRA2024Decompose magnitude/direction52 GB+1 to +3%
LoRA+2024Asymmetric learning rates160 GBSame, 2x faster
What is QLoRA's key innovation compared to standard LoRA?

Chapter 7: Connections

This lesson extended L09's introduction to PEFT with the practical knowledge needed to deploy LoRA adapters that actually work. Let's place these techniques in context.

PEFT Methods Compared

MethodTrainable %Inference CostMerge?Best For
Prompt Tuning~0.01%+tokens/callNoSimple classification
Prefix Tuning~0.01%+tokens/callNoGeneration with fixed styles
Adapters~2-4%+latencyNoMulti-task serving (swap adapters)
LoRA~0.1-1%ZeroYesGeneral purpose fine-tuning
QLoRA~0.1-1%QuantizedYesLarge models, limited GPU
DoRA~0.1-1%ZeroYesWhen LoRA quality isn't enough
Full Fine-tuning100%ZeroN/AMaximum quality, unlimited budget

Where This Fits

L09: Efficient Adaptation — Covers the full PEFT landscape from prompting to adapters to LoRA. This lesson assumes you've read L09 and goes deeper on failure modes, regularization, and merging.

L07: Pretraining — Understanding what the base model learned during pretraining helps you predict which fine-tuning updates will work. Knowledge lives in FFN layers; attention patterns live in QKV. This informs which layers to target with LoRA.

The Practical Decision

If you're starting a new fine-tuning project today, here's the default recipe:

Start with QLoRA
Rank 16, all linear layers, alpha=16, dropout 0.05. This fits on a single GPU for most models up to 70B.
Monitor train vs. val loss
Healthy: both decrease. Train flat? Increase rank. Val rising? Add regularization.
If quality gap remains
Try DoRA. If still not enough, consider full fine-tuning on a larger cluster.
For multi-task
Train separate adapters, merge with TIES. Deploy a single model.
PEFT Method Comparison

Each method plotted by trainable parameters vs. task quality. Hover to see details.

You need to fine-tune a 70B model on a single A100 (80 GB). Which approach should you try first?