When adaptation fails, why it fails, and how to fix it — from rank selection to adapter merging.
You deployed a LoRA adapter last month. It worked great on your test set — 94% accuracy on customer intent classification. Then production traffic started shifting. New product names. New complaint patterns. Accuracy dropped to 78%. You retrain the adapter, accuracy goes back up — but now it's worse on the original queries. You're playing whack-a-mole with a 7B-parameter model.
This is the fine-tuning dilemma: the tension between adaptation quality and adaptation cost. Full fine-tuning gives you the best accuracy but costs too much in memory, compute, and storage. LoRA cuts cost by 100x but introduces new failure modes — rank too low, wrong layers targeted, overfitting to small datasets, catastrophic forgetting on the base model's capabilities.
L09 taught you what PEFT methods are. This lesson teaches you when they break and how to fix them.
Every adaptation method sits somewhere on a two-axis plane. The X-axis is parameters trained (from 0 for ICL to 100% for full fine-tuning). The Y-axis is task performance (accuracy on your specific task). The ideal method sits at the top-left: few parameters, high accuracy. No method achieves that perfectly — but understanding where each method sits, and why, is the key to making good engineering decisions.
The simulation below lets you explore this tradeoff. Drag the slider to control how many parameters you train. Watch what happens to memory, accuracy, and catastrophic forgetting risk.
Slide to change the fraction of parameters trained. Watch GPU memory, task accuracy, and forgetting risk respond.
The rest of this lesson walks you through the practical toolkit: how LoRA actually works at the matrix level, where it fails, how to regularize it, and how to merge multiple adapters into a single model. By the end, you'll be able to diagnose a bad LoRA configuration by looking at the training curves alone.
L09 introduced LoRA as "add low-rank matrices to frozen weights." Now let's open the hood and understand exactly what happens at the matrix level, because the details determine whether your adapter works or fails.
A pre-trained weight matrix W has shape d × d (for a typical attention projection in a 7B model, d = 4096). Full fine-tuning learns an update ΔW of the same shape — that's d² = 16.7 million parameters per matrix. LoRA's insight: the useful update lives in a low-rank subspace. Instead of learning ΔW directly, we decompose it:
Where B has shape d × r and A has shape r × d. The rank r is typically 4, 8, 16, or 32 — vastly smaller than d = 4096. So instead of 16.7M parameters, we learn:
For r = 8 and d = 4096, that's 2 × 4096 × 8 = 65,536 parameters. A 256x reduction.
LoRA initializes B to zeros and A with random Gaussian values. This means ΔW = B · A = 0 at the start of training — the adapted model begins identical to the pre-trained model. This is critical: it means you can't accidentally destroy the base model on step 1. The adaptation grows gradually from zero.
There's also a scaling factor α (alpha) that controls the magnitude of the update:
The ratio α/r is the effective learning rate modifier. If α = r, the modifier is 1 and LoRA updates have the same scale as the base weights. If α = 2r, the updates are amplified 2x. In practice, α is usually set equal to r or 2r.
Drag the rank slider. Watch the parameter count drop and the reconstruction quality change. The heatmaps show W (frozen), B, A, and BA (the learned update).
A transformer block has four attention projections (Q, K, V, O) and two feed-forward matrices (up and down). The original LoRA paper applied adapters only to Q and V projections. Later work showed that applying LoRA to all linear layers (Q, K, V, O, up, down) at a lower rank often works better than applying it to fewer layers at a higher rank — because it distributes capacity across the model.
The data flow for a single LoRA-adapted attention head:
python # Standard attention Q projection q = x @ W_q # [batch, seq, d] @ [d, d] = [batch, seq, d] # LoRA-adapted Q projection q = x @ W_q + (alpha/r) * (x @ A_q) @ B_q # frozen scale [batch,seq,d]@[d,r] = [batch,seq,r] # @[r,d] = [batch,seq,d] # At inference, merge: W_q' = W_q + (alpha/r) * B_q @ A_q # No extra latency! The low-rank matrices fold into W.
That last point is LoRA's killer feature: zero inference overhead. After training, you merge BA into W. The serving model is the same size and speed as the original — unlike adapters, which add extra layers and latency.
LoRA works remarkably well on most benchmarks. But benchmarks aren't production. In practice, three failure modes account for the vast majority of bad LoRA deployments. Understanding them saves you weeks of debugging.
The rank r controls the expressiveness of your update. If the task requires a high-rank update — for example, learning a new language or domain-specific vocabulary — a rank-4 adapter simply can't represent the needed change. The model underfits: training loss stays high, validation loss stays high, accuracy plateaus early.
How do you know the rank is too low? The signature is a flat training loss. If loss stops decreasing after just a few hundred steps despite having plenty of data, the adapter lacks capacity. The fix: increase r to 16, 32, or 64. The cost is more memory, but still orders of magnitude less than full fine-tuning.
If you only apply LoRA to the Q and V projections (following the original paper), you might miss critical capacity. Some tasks require updating the feed-forward networks (FFN) — especially tasks that require learning new factual knowledge, since knowledge is stored primarily in FFN layers. Other tasks require updating the K projection for better retrieval patterns.
The symptom of wrong-layer targeting is subtler: training loss decreases normally, but validation accuracy is mediocre. The adapter is learning something, but not the right thing. The fix: apply LoRA to all linear layers at a proportionally lower rank.
Your fine-tuning data looks like customer emails from Q1. Production traffic includes Q2 complaints about a new product line. The adapter overfit to Q1 patterns. This is the same distribution shift problem that affects all ML, but LoRA's low capacity makes it more brittle than full fine-tuning — fewer parameters means less room to generalize.
The symptom: excellent training metrics, declining production metrics over time. The fix: diversify training data, add regularization (next chapter), or periodically retrain.
Click each tab to see what training/val curves look like under each failure mode. Compare to the healthy baseline.
When a LoRA adapter underperforms, follow this decision tree:
LoRA adapters are small — often just 0.1% of the model's parameters. You might think overfitting isn't a concern with so few trainable parameters. You'd be wrong. With a small fine-tuning dataset (say, 5,000 examples), even 131K trainable parameters can memorize the data. And when a LoRA adapter overfits, the failure is insidious: it looks great on training metrics but produces hallucinated, overconfident answers on new inputs.
Dropout randomly zeros out elements of the adapter's activations during training, forcing the adapter to spread its learned representation across multiple dimensions rather than memorizing specific patterns. In LoRA, dropout is applied to the output of matrix A (before multiplying by B):
Typical dropout rates for LoRA are 0.05 to 0.1 (5-10%). Higher values (0.2+) can prevent the adapter from learning at all, since you're already working in a low-rank bottleneck — zeroing out half of an 8-dimensional representation leaves only 4 effective dimensions.
Weight decay adds a penalty proportional to the L2 norm of the adapter weights, preventing them from growing too large. Large adapter weights mean large ΔW, which means the adapted model diverges far from the pre-trained model — exactly what causes catastrophic forgetting.
Where λ is the weight decay coefficient, typically 0.01 to 0.1 for LoRA. Weight decay acts as a soft constraint keeping the adapted model close to the base model. Think of it as a rubber band pulling the adapter back toward ΔW = 0.
Dropout and weight decay serve different purposes. Dropout prevents co-adaptation of features within the adapter (a data-efficiency regularizer). Weight decay prevents the adapter from drifting too far from the base model (a forgetting regularizer). In practice, you usually want both: dropout 0.05-0.1 and weight decay 0.01.
Adjust dropout and weight decay to see how they affect train vs. val loss. Watch the gap between curves — a large gap means overfitting.
| Dataset size | Recommended dropout | Recommended weight decay | Rationale |
|---|---|---|---|
| < 1K examples | 0.1 - 0.15 | 0.05 - 0.1 | High risk of memorization. Aggressively regularize. |
| 1K - 10K | 0.05 - 0.1 | 0.01 - 0.05 | Standard regime. Moderate regularization. |
| 10K - 100K | 0.0 - 0.05 | 0.01 | Enough data to generalize. Light regularization. |
| > 100K | 0.0 | 0.01 | Overfitting is rare. Weight decay for stability only. |
You trained a LoRA adapter for customer support. Another for legal Q&A. A third for code review. Each works well in isolation. Now your CEO wants one model that handles all three. Do you need to retrain from scratch on combined data?
No. You can merge the adapters. Each LoRA adapter produces a weight update ΔW. Since these are just matrices, you can combine them using standard linear algebra. The question is how to combine them — and each method has different tradeoffs.
The simplest approach: average the updates from each adapter.
This works when the tasks are similar (e.g., sentiment analysis in different domains). It fails when tasks conflict — if adapter A learned to make outputs longer and adapter B learned to make them shorter, averaging gives you an adapter that does neither well. Think of it as mixing paint: blue + yellow = green (sometimes useful), but red + green = brown (usually not what you want).
Ilharco et al. (2023) proposed treating adapter updates as task vectors — directions in weight space that correspond to capabilities. You can add and subtract them:
The λ coefficients let you control how much of each task to include. Setting λ3 negative lets you remove a capability. Trained a model on toxic data by accident? Subtract that adapter. Want more code ability and less chat? Scale accordingly.
Yadav et al. (2023) identified a problem with simple averaging: interference. When two adapters disagree about the sign of a weight update (one wants +0.1, the other wants -0.1), averaging gives ~0, losing both signals. TIES fixes this in three steps:
TIES consistently outperforms simple averaging on multi-task benchmarks because it preserves the structure of each adapter's update rather than blindly averaging conflicting signals.
Three adapters as 2D vectors. Toggle between merge methods to see how the result changes. Drag adapters to reposition them.
python # Each adapter has B_i [d, r] and A_i [r, d] # First, compute full delta for each adapter delta_1 = (alpha/r) * B_1 @ A_1 # [d, d] delta_2 = (alpha/r) * B_2 @ A_2 # [d, d] delta_3 = (alpha/r) * B_3 @ A_3 # [d, d] # Simple average merged = (delta_1 + delta_2 + delta_3) / 3 # Task arithmetic (emphasize task 1, remove task 3) merged = 1.5 * delta_1 + 0.8 * delta_2 - 0.5 * delta_3 # Final merged model W_merged = W_base + merged # [d, d] — same size as original
Time to put everything together. The simulation below is a full LoRA training sandbox. Configure the adapter (rank, target layers, alpha, dropout), watch the training unfold in real-time, then merge multiple adapters and see the combined result.
This is the payoff for the last four chapters. Every slider maps to a concept you've learned: rank controls capacity (Chapter 1), layer selection controls what kind of knowledge the adapter can learn (Chapter 2), dropout and regularization control overfitting (Chapter 3), and the merge view shows what happens when you combine adapters (Chapter 4).
Try these experiments to build intuition:
Experiment 1 — Rank sensitivity: Set layers to "All", dropout to 0.05. Start with rank 1 and hit Train. Note the final val loss. Reset, increase to rank 4, repeat. Then rank 16. You'll see diminishing returns after rank 8-16 for most tasks.
Experiment 2 — Layer targeting: Fix rank at 8. Compare "QV only" vs "All layers". The all-layers config should reach lower val loss. Now try "FFN only" — it depends on whether the task requires new knowledge (FFN) or new attention patterns (QKV).
Experiment 3 — Overfitting: Set rank to 32, dropout to 0, weight decay to 0. Train on a small dataset. Watch train loss plummet while val loss plateaus or rises. Now add dropout 0.1 — the gap closes.
Experiment 4 — Merging: Train two adapters with different configs (e.g., one for "sentiment" and one for "summarization"). Switch to merge view and compare averaging vs. TIES.
Configure, train, and merge LoRA adapters. Click Train to start. Watch train (orange) and val (teal) loss curves evolve.
LoRA was published in 2021. Since then, a wave of successors has pushed the frontier further — each solving a specific limitation of the original. Let's trace the evolution and understand where the field is heading.
Dettmers et al. had a simple but powerful insight: if the base model weights are frozen anyway, why store them in fp16? QLoRA quantizes the base model to 4-bit precision (NF4 format), then trains LoRA adapters in fp16 on top. The base model uses 4x less memory; the adapters are still full precision.
The impact was dramatic: fine-tuning a 65B model went from requiring 780 GB (full fine-tuning) or 160 GB (LoRA in fp16) to just 48 GB — a single A100 GPU. QLoRA made it possible for academic labs and individuals to fine-tune models that previously required multi-node clusters.
Liu et al. noticed that LoRA's updates tend to change both the magnitude and direction of weight vectors, while full fine-tuning primarily changes direction. DoRA decomposes each weight vector into a magnitude component (scalar) and a direction component (unit vector), then applies LoRA only to the direction. This mimics full fine-tuning's learning pattern and consistently outperforms standard LoRA by 1-3% on NLU benchmarks.
Where m is a trainable magnitude scalar and (W + BA)/||W + BA|| is the updated direction. The key insight: decoupling magnitude and direction gives the optimizer a smoother loss landscape.
Hayou et al. showed that using the same learning rate for matrices A and B is suboptimal. Since B is initialized to zero and A to random values, they occupy different regions of the loss landscape at initialization. LoRA+ uses a higher learning rate for B (the zero-initialized matrix) and achieves faster convergence — up to 2x speedup with no quality loss.
The trend line is clear: each generation uses fewer bits, fewer parameters, or both, while matching or exceeding the previous generation's quality. The end goal is one-GPU fine-tuning of 100B+ models with negligible quality loss.
Timeline of methods and their GPU memory requirements for fine-tuning a 65B model. Watch memory shrink while quality holds steady.
| Method | Year | Key Innovation | 65B Memory | Quality vs LoRA |
|---|---|---|---|---|
| Full FT | - | Train everything | 780 GB | Best (baseline) |
| LoRA | 2021 | Low-rank update BA | 160 GB | -1 to -3% |
| QLoRA | 2023 | 4-bit base + fp16 adapters | 48 GB | Same as LoRA |
| DoRA | 2024 | Decompose magnitude/direction | 52 GB | +1 to +3% |
| LoRA+ | 2024 | Asymmetric learning rates | 160 GB | Same, 2x faster |
This lesson extended L09's introduction to PEFT with the practical knowledge needed to deploy LoRA adapters that actually work. Let's place these techniques in context.
| Method | Trainable % | Inference Cost | Merge? | Best For |
|---|---|---|---|---|
| Prompt Tuning | ~0.01% | +tokens/call | No | Simple classification |
| Prefix Tuning | ~0.01% | +tokens/call | No | Generation with fixed styles |
| Adapters | ~2-4% | +latency | No | Multi-task serving (swap adapters) |
| LoRA | ~0.1-1% | Zero | Yes | General purpose fine-tuning |
| QLoRA | ~0.1-1% | Quantized | Yes | Large models, limited GPU |
| DoRA | ~0.1-1% | Zero | Yes | When LoRA quality isn't enough |
| Full Fine-tuning | 100% | Zero | N/A | Maximum quality, unlimited budget |
L09: Efficient Adaptation — Covers the full PEFT landscape from prompting to adapters to LoRA. This lesson assumes you've read L09 and goes deeper on failure modes, regularization, and merging.
L07: Pretraining — Understanding what the base model learned during pretraining helps you predict which fine-tuning updates will work. Knowledge lives in FFN layers; attention patterns live in QKV. This informs which layers to target with LoRA.
If you're starting a new fine-tuning project today, here's the default recipe:
Each method plotted by trainable parameters vs. task quality. Hover to see details.