Fine-tuning a 70-billion-parameter model used to mean copying all 70 billion. LoRA trains a tiny pair of matrices instead — often under 1% of the weights — and matches full fine-tuning. The trick is a deep fact: the update a model needs is low-rank.
You have a pretrained 7-billion-parameter model and a new task — say, answering questions about your company’s internal docs. The classic move is full fine-tuning: keep training the model on your task data, updating all 7 billion weights. It works. It’s also brutally expensive, in three separate ways.
Memory. Training isn’t just storing weights — for each weight you also store a gradient and (for the Adam optimizer) two more numbers, the momentum and variance. That’s roughly 4× the model’s size in memory just to train it, before you fit a single token of data. A 7B model in 16-bit weights is 14 GB; full fine-tuning needs well over 60 GB — more than a single GPU has.
Storage. Suppose you fine-tune for ten different tasks. Full fine-tuning gives you ten complete 14 GB copies of the model — 140 GB — that are 99% identical to each other and to the original. Wildly wasteful.
Forgetting. Hammering all the weights on a narrow task can erode the general abilities the model learned in pretraining — “catastrophic forgetting.” You wanted to add a skill, not overwrite the others.
Pick a model size. See the training memory (weights + gradients + optimizer state) and the storage for N task-specific copies. LoRA (orange) trains a sliver and stores tiny adapters instead.
The question that launched parameter-efficient fine-tuning — PEFT — is simple: do we really need to move all the weights to teach the model one new task? Or is there a small, cheap change that captures what the task needs? The answer is yes, and the reason is a surprising fact about what fine-tuning actually does to the weights.
Here is the insight everything rests on. When you fine-tune, you change a weight matrix W into W + ΔW, where ΔW is the total update accumulated over training. The original W is a big dense matrix — full of information from pretraining. But ΔW, the change, turns out to be very different: it has low intrinsic rank.
What does “low rank” mean? A matrix’s rank is the number of genuinely independent directions it contains. A rank-1 matrix is just one outer product — one column pattern scaled across all columns. A full-rank d×d matrix has d independent directions. The empirical discovery (Aghajanyan and colleagues, then Hu and colleagues in the 2021 LoRA paper) is that the update ΔW needed to adapt a model to a task lives in a tiny subspace — you can capture almost all of it with a rank of, say, 8, even when the matrix is 4096×4096.
Intuitively: pretraining already taught the model almost everything. Adapting it to a new task isn’t a sweeping rewrite of all the knowledge — it’s a small, focused nudge in a few directions. The model doesn’t need to move every which way; it needs to lean slightly, and a lean is low-rank.
Watch this concretely. Below is a target update matrix. We approximate it with a rank-r matrix — the best r-direction summary. Even at small r, the approximation captures most of the matrix, and the error plummets fast.
Left: a target matrix (heatmap). Right: its best rank-r approximation. Slide r up and watch the reconstruction snap into place while the error collapses — most of the matrix lives in its first few directions.
LoRA — Low-Rank Adaptation — turns that insight into a mechanism. Take a weight matrix W (size d×d, say). Freeze it completely — it never changes. Then add a parallel learnable branch that represents the low-rank update as a product of two small matrices:
The forward pass becomes: take your input x, run it through the frozen original and through the little adapter, then add the results.
Read the data flow carefully, because the shapes are the whole point. The input x has dimension d. The frozen path W·x gives the original output, dimension d. In the adapter path, A is r×d, so A·x squeezes x down to a tiny r-dimensional vector — the bottleneck. Then B is d×r, so B expands that back up to dimension d. The two outputs (both dimension d) add together. The adapter is a detour through a narrow waist of width r.
During training, only A and B receive gradients. W is frozen, so it contributes no optimizer state, no gradient memory — nothing. All the expensive machinery now applies only to the two skinny matrices, which is a microscopic fraction of the weights.
Input x flows two ways: through the frozen W (unchanged) and through the trainable adapter A then B, squeezing through a rank-r bottleneck. The outputs sum. Slide r to see the bottleneck widen.
Let’s quantify the savings, because they’re staggering. A full weight matrix of size d×d has d² parameters. The LoRA adapter has A (r×d) plus B (d×r), which is 2·d·r parameters. The ratio of trainable parameters is therefore 2·d·r divided by d², which simplifies to 2r/d.
Take a realistic layer: d = 4096 (a common hidden size), and rank r = 8. Full matrix: 4096² = 16,777,216 parameters — about 16.8 million, for one matrix. LoRA adapter: 2 × 4096 × 8 = 65,536 parameters — about 65 thousand. The ratio is 2×8/4096 = 16/4096 = 0.39%. The adapter is under four-tenths of one percent of that matrix.
Across a whole model, since LoRA is typically applied only to the attention projection matrices, the total trainable parameters often come out to well under 1% of the model — sometimes 0.1%. You can fine-tune a 7-billion-parameter model by training only a few million numbers. And because only those few million have gradients and optimizer state, the training memory collapses correspondingly.
Slide rank r and matrix dimension d. Watch the trainable-parameter fraction (2r/d) and the absolute counts. Even at generous ranks, LoRA trains a tiny sliver of the full matrix.
Two small design choices make LoRA actually work in practice, and both are clever.
At the very start of fine-tuning, you want the adapter to do nothing — the model should behave exactly like the pretrained model, and then gradually learn the correction. If the adapter started by injecting random noise, it would disrupt the carefully pretrained model and you’d fight to recover.
LoRA arranges this elegantly: initialize A with small random values, but initialize B to all zeros. Since ΔW = B·A and B is zero, the product is zero — so ΔW = 0 at the start, exactly. The adapter contributes nothing on step one. But the gradients are nonzero (because A is random), so training immediately starts moving B away from zero in useful directions. You begin precisely at the pretrained model and ease the correction in. Zero disruption, smooth learning.
LoRA scales the adapter output by a factor α/r, where α is a constant you set. The update is really (α/r)·B·A. Why? So that when you change the rank r, you don’t also have to re-tune the learning rate — the scaling keeps the magnitude of the update roughly stable across different ranks. In practice people often set α = r (or 2r) and treat α/r as a tunable knob on how strongly the adapter speaks.
Step training. B begins at zero so the update starts at exactly zero (model = pretrained), then grows smoothly as gradients push B. The bar shows the magnitude of ΔW = B·A over training steps.
LoRA has a deployment superpower that adapters of other kinds don’t: because the update is just ΔW = (α/r)·B·A, a plain matrix the same shape as W, you have two great options at inference time.
You can fold the adapter into the weights: compute W′ = W + (α/r)·B·A once, and now W′ is a single matrix you use exactly like the original. The forward pass is just W′·x — no extra branch, no extra multiply. LoRA adds zero inference latency once merged. This is a decisive advantage: methods that add extra layers (we’ll meet them next chapter) slow inference down forever; LoRA can erase its own footprint.
Or you can keep the adapter separate and exploit its tininess. Load the frozen base model once into GPU memory. Then, per request, load whichever small adapter you need — legal-docs adapter, code adapter, French adapter — each just a few megabytes, and apply it on top. You can serve dozens of fine-tuned “models” from a single base in memory, hot-swapping adapters per user or per task. This is how providers offer thousands of customized models without thousands of full copies.
Click an adapter to swap it onto the shared frozen base. Or merge it in for zero-latency inference. The base (big, blue) stays put; adapters (small, orange) are swapped or folded.
LoRA shrinks the trainable parameters, but you still have to hold the frozen base model in memory to run the forward and backward passes through it. For a 65-billion-parameter model in 16-bit, that’s 130 GB just for the frozen weights — still out of reach for a single GPU. QLoRA (Dettmers and colleagues, 2023) closes the gap with one more idea: quantize the frozen base.
The frozen base never gets updated, so we don’t need it in high precision — we just need to read it accurately enough during the forward pass. So QLoRA stores the frozen weights in 4-bit instead of 16-bit, cutting their memory by 4×. The 65B model’s 130 GB becomes about 33 GB — now it fits on a single high-end GPU. The LoRA adapters, the only things being trained, stay in higher precision so they learn cleanly.
QLoRA adds three refinements that make 4-bit training actually work:
NF4 (NormalFloat-4). A special 4-bit number format designed for the bell-curve distribution of neural network weights, so the 16 available levels are placed where the weights actually are — minimizing rounding error compared to plain 4-bit integers.
Double quantization. Quantization itself needs little scaling constants; QLoRA quantizes those too, squeezing out a bit more memory.
Paged optimizers. Use the GPU’s unified memory to spill optimizer state to CPU RAM during memory spikes, preventing out-of-memory crashes on long sequences.
For a chosen model size, compare the training memory of full fine-tuning, LoRA (frozen base in 16-bit + tiny trainable adapter), and QLoRA (frozen base in 4-bit). Watch the single-GPU line.
LoRA is the most popular PEFT method, but it’s one of a family. They all share the goal — adapt with few trainable parameters — but differ in where they inject the trainable bits.
| Method | What it trains | Note |
|---|---|---|
| LoRA | low-rank B·A added to weight matrices | mergeable, zero inference cost |
| Adapters | small bottleneck layers inserted between blocks | the original PEFT; adds inference latency |
| Prefix / Prompt tuning | learnable “virtual tokens” prepended to the input | weights untouched; tunes the context instead |
| (IA)³ | learned scaling vectors that rescale activations | extremely few parameters |
| BitFit | only the bias terms | simplest possible; surprisingly decent |
| DoRA, rsLoRA | LoRA refinements (decompose magnitude/direction; better scaling) | modern LoRA improvements |
Two contrasts are worth internalizing. Adapters versus LoRA: classic adapters insert new bottleneck layers in series, so they always add a little inference latency. LoRA runs in parallel and can be merged away — which is the main reason LoRA won. Prompt tuning versus LoRA: prompt tuning doesn’t touch the weights at all; it learns a few extra input vectors (“soft prompts”) that steer the frozen model. It’s even cheaper than LoRA but generally less expressive, since it can only nudge the model through its input, not adjust its internals.
A schematic transformer block. Toggle a method to see where its trainable parameters live — parallel to weights (LoRA), inserted in series (adapters), at the input (prompt tuning), or on biases (BitFit).
Let’s make the central claim — “the update is low-rank” — tangible. Below is a target update matrix that genuinely lives in a few directions (plus a little noise). Approximate it with a rank-r LoRA factorization and watch two things at once: how faithfully B·A reconstructs the target, and how few parameters it costs.
Left: the true update ΔW. Right: the rank-r LoRA reconstruction B·A. Slide r: the reconstruction error drops fast and then plateaus — the plateau is the update’s true intrinsic rank. Past it, extra rank buys almost nothing but costs parameters.
What to notice:
Set the true rank to 3 and slide LoRA rank from 1 up. The error falls steeply until r reaches 3, then flattens — once your rank matches the update’s intrinsic rank, you’ve captured it. Adding more rank past that is wasted parameters: the extra directions have nothing left to explain.
Now raise the true intrinsic rank to 8. A LoRA rank of 3 can no longer keep up — the error stays high until you give r enough room. This is exactly the practical tuning question: pick r large enough to cover the task’s real complexity, but no larger.
Watch the parameter counter. Every increment of r adds 2d parameters but, past the true rank, removes almost no error. That diminishing return is the low-rank hypothesis you can see with your own eyes — and it’s why r = 8 to 64 is plenty for most real fine-tuning.
No quiz here — the lab is the test. If you can predict where the error plateaus as you match r to the true rank, you understand why LoRA works.
| Quantity | Value |
|---|---|
| Trainable fraction per matrix | 2r/d (e.g. r=8, d=4096 → ~0.4%) |
| Typical rank r | 8–64 (even 1–2 often works) |
| Init | A random, B = 0 → ΔW starts at 0 |
| Scaling | update = (α/r)·B·A |
| QLoRA base precision | 4-bit NF4 (frozen); adapter higher precision |
| Adapter file size | megabytes, not gigabytes |
1. The update is low-rank. Adapting a pretrained model is a small, focused nudge — so ΔW = B·A with tiny r captures it. Freeze W, train only the sliver.
2. Cheap to train, free to serve. Few trainable params → little gradient/optimizer memory → big models on small GPUs (QLoRA pushes this with a 4-bit frozen base). Merge the adapter for zero inference latency, or hot-swap many adapters on one base.
3. It’s a patch, not a rewrite. B=0 init means you start exactly at the pretrained model and only add what training justifies — cheap, stable, and resistant to forgetting.