CS224N Lecture 18 — LoRA Without Regret

Chapter 0: The Fine-tuning Dilemma

You deployed a LoRA adapter last month. It worked great on your test set — 94% accuracy on customer intent classification. Then production traffic started shifting. New product names. New complaint patterns. Accuracy dropped to 78%. You retrain the adapter, accuracy goes back up — but now it's worse on the original queries. You're playing whack-a-mole with a 7B-parameter model.

This is the fine-tuning dilemma: the tension between adaptation quality and adaptation cost. Full fine-tuning gives you the best accuracy but costs too much in memory, compute, and storage. LoRA cuts cost by 100x but introduces new failure modes — rank too low, wrong layers targeted, overfitting to small datasets, catastrophic forgetting on the base model's capabilities.

L09 taught you what PEFT methods are. This lesson teaches you when they break and how to fix them.

The Tradeoff Space

Every adaptation method sits somewhere on a two-axis plane. The X-axis is parameters trained (from 0 for ICL to 100% for full fine-tuning). The Y-axis is task performance (accuracy on your specific task). The ideal method sits at the top-left: few parameters, high accuracy. No method achieves that perfectly — but understanding where each method sits, and why, is the key to making good engineering decisions.

The simulation below lets you explore this tradeoff. Drag the slider to control how many parameters you train. Watch what happens to memory, accuracy, and catastrophic forgetting risk.

Memory / Performance Tradeoff

Slide to change the fraction of parameters trained. Watch GPU memory, task accuracy, and forgetting risk respond.

Params trained 1%

More parameters trained does NOT always mean better performance. Beyond a sweet spot, you start overfitting to the fine-tuning data and forgetting what the base model knows. The art of PEFT is finding that sweet spot: enough capacity to learn the task, not so much that you destroy what the model already knows.

The rest of this lesson walks you through the practical toolkit: how LoRA actually works at the matrix level, where it fails, how to regularize it, and how to merge multiple adapters into a single model. By the end, you'll be able to diagnose a bad LoRA configuration by looking at the training curves alone.

Why can training more parameters sometimes hurt task performance?

More parameters always helps — it can't hurt With a small fine-tuning dataset, too many trainable parameters causes overfitting and catastrophic forgetting of pre-trained knowledge Because GPUs slow down when training more parameters

Chapter 1: LoRA Revisited

L09 introduced LoRA as "add low-rank matrices to frozen weights." Now let's open the hood and understand exactly what happens at the matrix level, because the details determine whether your adapter works or fails.

The Core Idea: Low-Rank Decomposition

A pre-trained weight matrix W has shape d × d (for a typical attention projection in a 7B model, d = 4096). Full fine-tuning learns an update ΔW of the same shape — that's d² = 16.7 million parameters per matrix. LoRA's insight: the useful update lives in a low-rank subspace. Instead of learning ΔW directly, we decompose it:

ΔW = B · A

Where B has shape d × r and A has shape r × d. The rank r is typically 4, 8, 16, or 32 — vastly smaller than d = 4096. So instead of 16.7M parameters, we learn:

d × r + r × d = 2 · d · r

For r = 8 and d = 4096, that's 2 × 4096 × 8 = 65,536 parameters. A 256x reduction.

Initialization Matters

LoRA initializes B to zeros and A with random Gaussian values. This means ΔW = B · A = 0 at the start of training — the adapted model begins identical to the pre-trained model. This is critical: it means you can't accidentally destroy the base model on step 1. The adaptation grows gradually from zero.

There's also a scaling factor α (alpha) that controls the magnitude of the update:

W' = W + (α / r) · B · A

The ratio α/r is the effective learning rate modifier. If α = r, the modifier is 1 and LoRA updates have the same scale as the base weights. If α = 2r, the updates are amplified 2x. In practice, α is usually set equal to r or 2r.

Matrix Decomposition: W + BA

Drag the rank slider. Watch the parameter count drop and the reconstruction quality change. The heatmaps show W (frozen), B, A, and BA (the learned update).

Rank r 8

LoRA is not approximating W — it's learning a delta. A common misconception: LoRA tries to compress W into a low-rank form. It doesn't. W stays frozen at full rank. LoRA only decomposes the update ΔW. The assumption is that the fine-tuning update has low intrinsic dimensionality — which Aghajanyan et al. (2021) showed is true for most NLP tasks.

Which Layers Get LoRA?

A transformer block has four attention projections (Q, K, V, O) and two feed-forward matrices (up and down). The original LoRA paper applied adapters only to Q and V projections. Later work showed that applying LoRA to all linear layers (Q, K, V, O, up, down) at a lower rank often works better than applying it to fewer layers at a higher rank — because it distributes capacity across the model.

The data flow for a single LoRA-adapted attention head:

python
# Standard attention Q projection
q = x @ W_q                   # [batch, seq, d] @ [d, d] = [batch, seq, d]

# LoRA-adapted Q projection
q = x @ W_q + (alpha/r) * (x @ A_q) @ B_q
#   frozen        scale    [batch,seq,d]@[d,r] = [batch,seq,r]
#                                        @[r,d] = [batch,seq,d]

# At inference, merge: W_q' = W_q + (alpha/r) * B_q @ A_q
# No extra latency! The low-rank matrices fold into W.

That last point is LoRA's killer feature: zero inference overhead. After training, you merge BA into W. The serving model is the same size and speed as the original — unlike adapters, which add extra layers and latency.

If d = 4096 and r = 16, how many trainable parameters does one LoRA adapter add to a single weight matrix?

4096 × 4096 = 16.7M (same as full fine-tuning) 2 × 4096 × 16 = 131,072 (two small matrices) 16 × 16 = 256 (just the rank squared)

Chapter 2: When LoRA Fails

LoRA works remarkably well on most benchmarks. But benchmarks aren't production. In practice, three failure modes account for the vast majority of bad LoRA deployments. Understanding them saves you weeks of debugging.

Failure 1: Rank Too Low

The rank r controls the expressiveness of your update. If the task requires a high-rank update — for example, learning a new language or domain-specific vocabulary — a rank-4 adapter simply can't represent the needed change. The model underfits: training loss stays high, validation loss stays high, accuracy plateaus early.

How do you know the rank is too low? The signature is a flat training loss. If loss stops decreasing after just a few hundred steps despite having plenty of data, the adapter lacks capacity. The fix: increase r to 16, 32, or 64. The cost is more memory, but still orders of magnitude less than full fine-tuning.

Failure 2: Wrong Layers Targeted

If you only apply LoRA to the Q and V projections (following the original paper), you might miss critical capacity. Some tasks require updating the feed-forward networks (FFN) — especially tasks that require learning new factual knowledge, since knowledge is stored primarily in FFN layers. Other tasks require updating the K projection for better retrieval patterns.

The symptom of wrong-layer targeting is subtler: training loss decreases normally, but validation accuracy is mediocre. The adapter is learning something, but not the right thing. The fix: apply LoRA to all linear layers at a proportionally lower rank.

Failure 3: Distribution Shift

Your fine-tuning data looks like customer emails from Q1. Production traffic includes Q2 complaints about a new product line. The adapter overfit to Q1 patterns. This is the same distribution shift problem that affects all ML, but LoRA's low capacity makes it more brittle than full fine-tuning — fewer parameters means less room to generalize.

The symptom: excellent training metrics, declining production metrics over time. The fix: diversify training data, add regularization (next chapter), or periodically retrain.

Failure Mode Explorer

Click each tab to see what training/val curves look like under each failure mode. Compare to the healthy baseline.

A flat training loss means insufficient capacity (rank too low). A diverging train/val gap means overfitting (regularize). Good train but bad val from the start means wrong layers. Learn to read training curves — they diagnose 90% of LoRA failures without any extra experiments.

Diagnosing in Practice

When a LoRA adapter underperforms, follow this decision tree:

Is training loss decreasing?

If NO → rank too low. Increase r (try 2x, then 4x).

↓ YES

Is validation loss decreasing too?

If NO → wrong layers or overfitting. Try applying LoRA to all layers, or add dropout.

↓ YES

Does production accuracy match val accuracy?

If NO → distribution shift. Diversify training data or retrain periodically.

↓ YES

Everything is working!

Your adapter is healthy. Monitor for drift over time.

Your LoRA adapter's training loss decreases steadily, but validation accuracy is stuck at 72% while full fine-tuning gets 91%. What's the most likely cause?

LoRA is applied to the wrong layers — the adapter is learning patterns that don't transfer to the validation task The rank is too low — the adapter can't represent the update The learning rate is too high

Chapter 3: Regularization in PEFT

LoRA adapters are small — often just 0.1% of the model's parameters. You might think overfitting isn't a concern with so few trainable parameters. You'd be wrong. With a small fine-tuning dataset (say, 5,000 examples), even 131K trainable parameters can memorize the data. And when a LoRA adapter overfits, the failure is insidious: it looks great on training metrics but produces hallucinated, overconfident answers on new inputs.

Dropout in LoRA

Dropout randomly zeros out elements of the adapter's activations during training, forcing the adapter to spread its learned representation across multiple dimensions rather than memorizing specific patterns. In LoRA, dropout is applied to the output of matrix A (before multiplying by B):

ΔW · x = B · dropout(A · x)

Typical dropout rates for LoRA are 0.05 to 0.1 (5-10%). Higher values (0.2+) can prevent the adapter from learning at all, since you're already working in a low-rank bottleneck — zeroing out half of an 8-dimensional representation leaves only 4 effective dimensions.

Weight Decay

Weight decay adds a penalty proportional to the L2 norm of the adapter weights, preventing them from growing too large. Large adapter weights mean large ΔW, which means the adapted model diverges far from the pre-trained model — exactly what causes catastrophic forgetting.

L_total = L_task + λ · (||A||² + ||B||²)

Where λ is the weight decay coefficient, typically 0.01 to 0.1 for LoRA. Weight decay acts as a soft constraint keeping the adapted model close to the base model. Think of it as a rubber band pulling the adapter back toward ΔW = 0.

The Interplay

Dropout and weight decay serve different purposes. Dropout prevents co-adaptation of features within the adapter (a data-efficiency regularizer). Weight decay prevents the adapter from drifting too far from the base model (a forgetting regularizer). In practice, you usually want both: dropout 0.05-0.1 and weight decay 0.01.

Regularization Effects on Training Curves

Adjust dropout and weight decay to see how they affect train vs. val loss. Watch the gap between curves — a large gap means overfitting.

Dropout 0.05

Weight decay 0.01

Dropout prevents co-adaptation. Weight decay prevents forgetting. They're complementary, not redundant. A well-tuned LoRA uses both: dropout ~0.05 keeps features diverse, weight decay ~0.01 keeps the model close to its pre-trained initialization. Too much of either kills learning; too little lets the adapter memorize the training set.

Practical Recipe

Dataset size	Recommended dropout	Recommended weight decay	Rationale
< 1K examples	0.1 - 0.15	0.05 - 0.1	High risk of memorization. Aggressively regularize.
1K - 10K	0.05 - 0.1	0.01 - 0.05	Standard regime. Moderate regularization.
10K - 100K	0.0 - 0.05	0.01	Enough data to generalize. Light regularization.
> 100K	0.0	0.01	Overfitting is rare. Weight decay for stability only.

You're fine-tuning a LoRA adapter on 800 medical Q&A examples. Training loss drops to 0.02 but validation loss is stuck at 0.45. What should you try first?

Increase dropout to 0.1-0.15 and weight decay to 0.05 — the adapter is memorizing the small dataset Increase the rank r to give the adapter more capacity Remove all regularization so the model can learn faster

Chapter 4: Merging Adapters

You trained a LoRA adapter for customer support. Another for legal Q&A. A third for code review. Each works well in isolation. Now your CEO wants one model that handles all three. Do you need to retrain from scratch on combined data?

No. You can merge the adapters. Each LoRA adapter produces a weight update ΔW. Since these are just matrices, you can combine them using standard linear algebra. The question is how to combine them — and each method has different tradeoffs.

Method 1: Simple Averaging

The simplest approach: average the updates from each adapter.

ΔW_merged = (1/n) ∑_i ΔW_i

This works when the tasks are similar (e.g., sentiment analysis in different domains). It fails when tasks conflict — if adapter A learned to make outputs longer and adapter B learned to make them shorter, averaging gives you an adapter that does neither well. Think of it as mixing paint: blue + yellow = green (sometimes useful), but red + green = brown (usually not what you want).

Method 2: Task Arithmetic

Ilharco et al. (2023) proposed treating adapter updates as task vectors — directions in weight space that correspond to capabilities. You can add and subtract them:

ΔW_merged = λ₁ · ΔW₁ + λ₂ · ΔW₂ - λ₃ · ΔW₃

The λ coefficients let you control how much of each task to include. Setting λ₃ negative lets you remove a capability. Trained a model on toxic data by accident? Subtract that adapter. Want more code ability and less chat? Scale accordingly.

Method 3: TIES (Trim, Elect Sign, Merge)

Yadav et al. (2023) identified a problem with simple averaging: interference. When two adapters disagree about the sign of a weight update (one wants +0.1, the other wants -0.1), averaging gives ~0, losing both signals. TIES fixes this in three steps:

1. Trim

Zero out the smallest magnitude updates (keep only top-k%). Removes noise, keeps signal.

↓

2. Elect Sign

For each weight position, take a vote: do more adapters want positive or negative? Use the majority sign.

↓

3. Merge

Average only the updates that agree with the elected sign. Disagreeing updates are dropped.

TIES consistently outperforms simple averaging on multi-task benchmarks because it preserves the structure of each adapter's update rather than blindly averaging conflicting signals.

Adapter Merging Visualized

Three adapters as 2D vectors. Toggle between merge methods to see how the result changes. Drag adapters to reposition them.

Simple averaging destroys signal when adapters disagree. Task arithmetic gives you control over how much of each capability to include. TIES resolves sign conflicts by majority vote. For production multi-task merging, TIES is the default choice unless you need the fine-grained control of task arithmetic.

The Data Flow of Merging

python
# Each adapter has B_i [d, r] and A_i [r, d]
# First, compute full delta for each adapter
delta_1 = (alpha/r) * B_1 @ A_1   # [d, d]
delta_2 = (alpha/r) * B_2 @ A_2   # [d, d]
delta_3 = (alpha/r) * B_3 @ A_3   # [d, d]

# Simple average
merged = (delta_1 + delta_2 + delta_3) / 3

# Task arithmetic (emphasize task 1, remove task 3)
merged = 1.5 * delta_1 + 0.8 * delta_2 - 0.5 * delta_3

# Final merged model
W_merged = W_base + merged  # [d, d] — same size as original

What problem does TIES solve that simple averaging doesn't?

When adapters disagree on the sign of a weight update, averaging cancels out both signals — TIES uses majority vote to preserve the dominant direction TIES reduces memory usage during merging TIES makes merging faster by using parallel computation

Chapter 5: LoRA Lab

Time to put everything together. The simulation below is a full LoRA training sandbox. Configure the adapter (rank, target layers, alpha, dropout), watch the training unfold in real-time, then merge multiple adapters and see the combined result.

This is the payoff for the last four chapters. Every slider maps to a concept you've learned: rank controls capacity (Chapter 1), layer selection controls what kind of knowledge the adapter can learn (Chapter 2), dropout and regularization control overfitting (Chapter 3), and the merge view shows what happens when you combine adapters (Chapter 4).

Experiment Guide

Try these experiments to build intuition:

Experiment 1 — Rank sensitivity: Set layers to "All", dropout to 0.05. Start with rank 1 and hit Train. Note the final val loss. Reset, increase to rank 4, repeat. Then rank 16. You'll see diminishing returns after rank 8-16 for most tasks.

Experiment 2 — Layer targeting: Fix rank at 8. Compare "QV only" vs "All layers". The all-layers config should reach lower val loss. Now try "FFN only" — it depends on whether the task requires new knowledge (FFN) or new attention patterns (QKV).

Experiment 3 — Overfitting: Set rank to 32, dropout to 0, weight decay to 0. Train on a small dataset. Watch train loss plummet while val loss plateaus or rises. Now add dropout 0.1 — the gap closes.

Experiment 4 — Merging: Train two adapters with different configs (e.g., one for "sentiment" and one for "summarization"). Switch to merge view and compare averaging vs. TIES.

LoRA Lab — Full Training Sandbox

Configure, train, and merge LoRA adapters. Click Train to start. Watch train (orange) and val (teal) loss curves evolve.

Rank 8

Alpha 8

Dropout 0.05

W. Decay 0.01

Chapter 6: Future of Adaptation

LoRA was published in 2021. Since then, a wave of successors has pushed the frontier further — each solving a specific limitation of the original. Let's trace the evolution and understand where the field is heading.

QLoRA (2023): Quantize, Then Adapt

Dettmers et al. had a simple but powerful insight: if the base model weights are frozen anyway, why store them in fp16? QLoRA quantizes the base model to 4-bit precision (NF4 format), then trains LoRA adapters in fp16 on top. The base model uses 4x less memory; the adapters are still full precision.

The impact was dramatic: fine-tuning a 65B model went from requiring 780 GB (full fine-tuning) or 160 GB (LoRA in fp16) to just 48 GB — a single A100 GPU. QLoRA made it possible for academic labs and individuals to fine-tune models that previously required multi-node clusters.

DoRA (2024): Weight-Decomposed Adaptation

Liu et al. noticed that LoRA's updates tend to change both the magnitude and direction of weight vectors, while full fine-tuning primarily changes direction. DoRA decomposes each weight vector into a magnitude component (scalar) and a direction component (unit vector), then applies LoRA only to the direction. This mimics full fine-tuning's learning pattern and consistently outperforms standard LoRA by 1-3% on NLU benchmarks.

W' = m · (W + BA) / ||W + BA||

Where m is a trainable magnitude scalar and (W + BA)/||W + BA|| is the updated direction. The key insight: decoupling magnitude and direction gives the optimizer a smoother loss landscape.

LoRA+ (2024): Different Learning Rates

Hayou et al. showed that using the same learning rate for matrices A and B is suboptimal. Since B is initialized to zero and A to random values, they occupy different regions of the loss landscape at initialization. LoRA+ uses a higher learning rate for B (the zero-initialized matrix) and achieves faster convergence — up to 2x speedup with no quality loss.

What's Next?

The trend line is clear: each generation uses fewer bits, fewer parameters, or both, while matching or exceeding the previous generation's quality. The end goal is one-GPU fine-tuning of 100B+ models with negligible quality loss.

Evolution of Parameter-Efficient Adaptation

Timeline of methods and their GPU memory requirements for fine-tuning a 65B model. Watch memory shrink while quality holds steady.

Every LoRA successor solves one specific limitation. QLoRA attacks memory (quantize the base model). DoRA attacks quality (decompose magnitude and direction). LoRA+ attacks speed (separate learning rates). None is strictly "better" — they're complementary. QLoRA + DoRA + LoRA+ together give you the best of all three.

Method	Year	Key Innovation	65B Memory	Quality vs LoRA
Full FT	-	Train everything	780 GB	Best (baseline)
LoRA	2021	Low-rank update BA	160 GB	-1 to -3%
QLoRA	2023	4-bit base + fp16 adapters	48 GB	Same as LoRA
DoRA	2024	Decompose magnitude/direction	52 GB	+1 to +3%
LoRA+	2024	Asymmetric learning rates	160 GB	Same, 2x faster

What is QLoRA's key innovation compared to standard LoRA?

It uses a higher rank for better accuracy It quantizes the frozen base model to 4-bit, reducing memory by ~4x while keeping adapters in full precision It removes the need for LoRA adapters entirely

Chapter 7: Connections

This lesson extended L09's introduction to PEFT with the practical knowledge needed to deploy LoRA adapters that actually work. Let's place these techniques in context.

PEFT Methods Compared

Method	Trainable %	Inference Cost	Merge?	Best For
Prompt Tuning	~0.01%	+tokens/call	No	Simple classification
Prefix Tuning	~0.01%	+tokens/call	No	Generation with fixed styles
Adapters	~2-4%	+latency	No	Multi-task serving (swap adapters)
LoRA	~0.1-1%	Zero	Yes	General purpose fine-tuning
QLoRA	~0.1-1%	Quantized	Yes	Large models, limited GPU
DoRA	~0.1-1%	Zero	Yes	When LoRA quality isn't enough
Full Fine-tuning	100%	Zero	N/A	Maximum quality, unlimited budget

Where This Fits

L09: Efficient Adaptation — Covers the full PEFT landscape from prompting to adapters to LoRA. This lesson assumes you've read L09 and goes deeper on failure modes, regularization, and merging.

L07: Pretraining — Understanding what the base model learned during pretraining helps you predict which fine-tuning updates will work. Knowledge lives in FFN layers; attention patterns live in QKV. This informs which layers to target with LoRA.

The Practical Decision

If you're starting a new fine-tuning project today, here's the default recipe:

Start with QLoRA

Rank 16, all linear layers, alpha=16, dropout 0.05. This fits on a single GPU for most models up to 70B.

↓

Monitor train vs. val loss

Healthy: both decrease. Train flat? Increase rank. Val rising? Add regularization.

↓

If quality gap remains

Try DoRA. If still not enough, consider full fine-tuning on a larger cluster.

↓

For multi-task

Train separate adapters, merge with TIES. Deploy a single model.

PEFT Method Comparison

Each method plotted by trainable parameters vs. task quality. Hover to see details.

You need to fine-tune a 70B model on a single A100 (80 GB). Which approach should you try first?

Full fine-tuning with gradient checkpointing QLoRA with 4-bit base model + fp16 adapters — it fits in ~48 GB Prompt tuning — it trains the fewest parameters