LLM Inference & Adaptation

LoRA & PEFT

Fine-tuning a 70-billion-parameter model used to mean copying all 70 billion. LoRA trains a tiny pair of matrices instead — often under 1% of the weights — and matches full fine-tuning. The trick is a deep fact: the update a model needs is low-rank.

Prerequisites: A linear layer multiplies inputs by a weight matrix + Fine-tuning adjusts weights for a new task. That’s it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Cost of Fine-Tuning

You have a pretrained 7-billion-parameter model and a new task — say, answering questions about your company’s internal docs. The classic move is full fine-tuning: keep training the model on your task data, updating all 7 billion weights. It works. It’s also brutally expensive, in three separate ways.

Memory. Training isn’t just storing weights — for each weight you also store a gradient and (for the Adam optimizer) two more numbers, the momentum and variance. That’s roughly 4× the model’s size in memory just to train it, before you fit a single token of data. A 7B model in 16-bit weights is 14 GB; full fine-tuning needs well over 60 GB — more than a single GPU has.

Storage. Suppose you fine-tune for ten different tasks. Full fine-tuning gives you ten complete 14 GB copies of the model — 140 GB — that are 99% identical to each other and to the original. Wildly wasteful.

Forgetting. Hammering all the weights on a narrow task can erode the general abilities the model learned in pretraining — “catastrophic forgetting.” You wanted to add a skill, not overwrite the others.

What full fine-tuning costs

Pick a model size. See the training memory (weights + gradients + optimizer state) and the storage for N task-specific copies. LoRA (orange) trains a sliver and stores tiny adapters instead.

model size (B params)7
number of tasks5

The question that launched parameter-efficient fine-tuning — PEFT — is simple: do we really need to move all the weights to teach the model one new task? Or is there a small, cheap change that captures what the task needs? The answer is yes, and the reason is a surprising fact about what fine-tuning actually does to the weights.

Misconception: “Fine-tuning memory equals the model size.” It’s far more. With Adam you store, per weight: the weight, its gradient, a momentum term, and a variance term — plus activations for backprop. That’s why a model that runs inference on one GPU often can’t be fully fine-tuned on that same GPU. PEFT attacks exactly this.

Why does full fine-tuning require several times more memory than just storing the model?

Chapter 1: Updates Are Low-Rank

Here is the insight everything rests on. When you fine-tune, you change a weight matrix W into W + ΔW, where ΔW is the total update accumulated over training. The original W is a big dense matrix — full of information from pretraining. But ΔW, the change, turns out to be very different: it has low intrinsic rank.

What does “low rank” mean? A matrix’s rank is the number of genuinely independent directions it contains. A rank-1 matrix is just one outer product — one column pattern scaled across all columns. A full-rank d×d matrix has d independent directions. The empirical discovery (Aghajanyan and colleagues, then Hu and colleagues in the 2021 LoRA paper) is that the update ΔW needed to adapt a model to a task lives in a tiny subspace — you can capture almost all of it with a rank of, say, 8, even when the matrix is 4096×4096.

Intuitively: pretraining already taught the model almost everything. Adapting it to a new task isn’t a sweeping rewrite of all the knowledge — it’s a small, focused nudge in a few directions. The model doesn’t need to move every which way; it needs to lean slightly, and a lean is low-rank.

Watch this concretely. Below is a target update matrix. We approximate it with a rank-r matrix — the best r-direction summary. Even at small r, the approximation captures most of the matrix, and the error plummets fast.

A matrix is mostly captured by a few directions

Left: a target matrix (heatmap). Right: its best rank-r approximation. Slide r up and watch the reconstruction snap into place while the error collapses — most of the matrix lives in its first few directions.

rank r2

Key insight: If ΔW is approximately rank-r, we never need to store or train the full d×d update. We can store it as a product of two skinny matrices — a d×r and an r×d — whose product is a rank-r matrix. That factorization is the entire idea of LoRA, and it’s why it’s so cheap: r is tiny.

Misconception: “The model’s weights W are low-rank.” No — the pretrained W is high-rank and information-dense; you must not approximate it. It’s the update ΔW — the task-specific change — that’s low-rank. LoRA leaves W untouched and only constrains the update to be low-rank.

What is the central empirical fact that makes LoRA possible?

Chapter 2: The LoRA Trick

LoRA — Low-Rank Adaptation — turns that insight into a mechanism. Take a weight matrix W (size d×d, say). Freeze it completely — it never changes. Then add a parallel learnable branch that represents the low-rank update as a product of two small matrices:

ΔW = B · A,    where A is r×d, B is d×r, and r ≪ d

The forward pass becomes: take your input x, run it through the frozen original and through the little adapter, then add the results.

h = W·x + B·(A·x)

Read the data flow carefully, because the shapes are the whole point. The input x has dimension d. The frozen path W·x gives the original output, dimension d. In the adapter path, A is r×d, so A·x squeezes x down to a tiny r-dimensional vector — the bottleneck. Then B is d×r, so B expands that back up to dimension d. The two outputs (both dimension d) add together. The adapter is a detour through a narrow waist of width r.

During training, only A and B receive gradients. W is frozen, so it contributes no optimizer state, no gradient memory — nothing. All the expensive machinery now applies only to the two skinny matrices, which is a microscopic fraction of the weights.

The LoRA forward path

Input x flows two ways: through the frozen W (unchanged) and through the trainable adapter A then B, squeezing through a rank-r bottleneck. The outputs sum. Slide r to see the bottleneck widen.

rank r4

Key insight: LoRA doesn’t change what a layer computes structurally — it adds a cheap, low-rank correction term in parallel. The frozen W keeps all the pretrained knowledge intact (so no catastrophic forgetting), while B·A learns the small task-specific adjustment. You’re editing the model with a sticky note, not rewriting the book.

Misconception: “LoRA replaces the layer’s weights.” It runs alongside them. The original W is fully present and fully frozen; the adapter only adds a correction. This is why a single base model can host many different adapters — each is just a different B·A correction bolted onto the same frozen core.

In LoRA’s adapter path B·(A·x), what is the role of the matrix A?

Chapter 3: Counting the Parameters

Let’s quantify the savings, because they’re staggering. A full weight matrix of size d×d has d² parameters. The LoRA adapter has A (r×d) plus B (d×r), which is 2·d·r parameters. The ratio of trainable parameters is therefore 2·d·r divided by d², which simplifies to 2r/d.

By hand

Take a realistic layer: d = 4096 (a common hidden size), and rank r = 8. Full matrix: 4096² = 16,777,216 parameters — about 16.8 million, for one matrix. LoRA adapter: 2 × 4096 × 8 = 65,536 parameters — about 65 thousand. The ratio is 2×8/4096 = 16/4096 = 0.39%. The adapter is under four-tenths of one percent of that matrix.

Across a whole model, since LoRA is typically applied only to the attention projection matrices, the total trainable parameters often come out to well under 1% of the model — sometimes 0.1%. You can fine-tune a 7-billion-parameter model by training only a few million numbers. And because only those few million have gradients and optimizer state, the training memory collapses correspondingly.

Trainable parameters vs rank

Slide rank r and matrix dimension d. Watch the trainable-parameter fraction (2r/d) and the absolute counts. Even at generous ranks, LoRA trains a tiny sliver of the full matrix.

rank r8
dimension d4096

The payoff chain: fewer trainable parameters → fewer gradients and optimizer states → far less training memory → fine-tune big models on small GPUs. And the saved adapter is a few megabytes, not gigabytes — so a hundred task adapters cost less storage than one full fine-tuned copy.

Misconception: “A bigger rank is always better — closer to full fine-tuning.” Higher r does add capacity, but past a point it stops helping (the update really is low-rank) and just costs more. r between 8 and 64 covers most tasks; the famous LoRA result is that even r = 1 or 2 works surprisingly well, because the intrinsic rank of the update is genuinely tiny.

For a 4096×4096 matrix with LoRA rank r = 8, roughly what fraction of parameters are trainable?

Chapter 4: Initialization & Scaling

Two small design choices make LoRA actually work in practice, and both are clever.

The zero-start trick

At the very start of fine-tuning, you want the adapter to do nothing — the model should behave exactly like the pretrained model, and then gradually learn the correction. If the adapter started by injecting random noise, it would disrupt the carefully pretrained model and you’d fight to recover.

LoRA arranges this elegantly: initialize A with small random values, but initialize B to all zeros. Since ΔW = B·A and B is zero, the product is zero — so ΔW = 0 at the start, exactly. The adapter contributes nothing on step one. But the gradients are nonzero (because A is random), so training immediately starts moving B away from zero in useful directions. You begin precisely at the pretrained model and ease the correction in. Zero disruption, smooth learning.

The scaling factor

LoRA scales the adapter output by a factor α/r, where α is a constant you set. The update is really (α/r)·B·A. Why? So that when you change the rank r, you don’t also have to re-tune the learning rate — the scaling keeps the magnitude of the update roughly stable across different ranks. In practice people often set α = r (or 2r) and treat α/r as a tunable knob on how strongly the adapter speaks.

Zero-start: ΔW grows from nothing

Step training. B begins at zero so the update starts at exactly zero (model = pretrained), then grows smoothly as gradients push B. The bar shows the magnitude of ΔW = B·A over training steps.

Key insight: Initializing B = 0 means LoRA fine-tuning provably starts from the exact pretrained model — the adapter is invisible at step zero and only ever adds what training justifies. This is a big reason LoRA is stable and resists catastrophic forgetting: it can only build up a correction, never start by smashing what’s there.

Misconception: “Both A and B start at zero.” If both were zero, the gradient to A would also be zero (it’s multiplied by B) and A could never learn — the adapter would be stuck at zero forever. Exactly one is zeroed (B), the other (A) is random. That asymmetry is what lets learning begin.

Why is B initialized to zero (while A is random) in LoRA?

Chapter 5: Merge & Swap

LoRA has a deployment superpower that adapters of other kinds don’t: because the update is just ΔW = (α/r)·B·A, a plain matrix the same shape as W, you have two great options at inference time.

Merge: zero added latency

You can fold the adapter into the weights: compute W′ = W + (α/r)·B·A once, and now W′ is a single matrix you use exactly like the original. The forward pass is just W′·x — no extra branch, no extra multiply. LoRA adds zero inference latency once merged. This is a decisive advantage: methods that add extra layers (we’ll meet them next chapter) slow inference down forever; LoRA can erase its own footprint.

Swap: many tasks, one base

Or you can keep the adapter separate and exploit its tininess. Load the frozen base model once into GPU memory. Then, per request, load whichever small adapter you need — legal-docs adapter, code adapter, French adapter — each just a few megabytes, and apply it on top. You can serve dozens of fine-tuned “models” from a single base in memory, hot-swapping adapters per user or per task. This is how providers offer thousands of customized models without thousands of full copies.

One frozen base, many adapters

Click an adapter to swap it onto the shared frozen base. Or merge it in for zero-latency inference. The base (big, blue) stays put; adapters (small, orange) are swapped or folded.

Key insight: The adapter is a portable patch. Merge it for speed (production single-task), or keep it separate for flexibility (multi-tenant serving). Either way the base model is shared, which is the whole economic story: one expensive base, many cheap personalities.

Misconception: “LoRA is always slower at inference because of the extra matrices.” Only if you leave it un-merged. Merged, the math collapses back into a single weight matrix identical in cost to the original — truly zero overhead. The choice is yours: merge for speed, separate for swappability.

How can LoRA achieve zero added inference latency?

Chapter 6: QLoRA

LoRA shrinks the trainable parameters, but you still have to hold the frozen base model in memory to run the forward and backward passes through it. For a 65-billion-parameter model in 16-bit, that’s 130 GB just for the frozen weights — still out of reach for a single GPU. QLoRA (Dettmers and colleagues, 2023) closes the gap with one more idea: quantize the frozen base.

The frozen base never gets updated, so we don’t need it in high precision — we just need to read it accurately enough during the forward pass. So QLoRA stores the frozen weights in 4-bit instead of 16-bit, cutting their memory by 4×. The 65B model’s 130 GB becomes about 33 GB — now it fits on a single high-end GPU. The LoRA adapters, the only things being trained, stay in higher precision so they learn cleanly.

QLoRA adds three refinements that make 4-bit training actually work:

NF4 (NormalFloat-4). A special 4-bit number format designed for the bell-curve distribution of neural network weights, so the 16 available levels are placed where the weights actually are — minimizing rounding error compared to plain 4-bit integers.

Double quantization. Quantization itself needs little scaling constants; QLoRA quantizes those too, squeezing out a bit more memory.

Paged optimizers. Use the GPU’s unified memory to spill optimizer state to CPU RAM during memory spikes, preventing out-of-memory crashes on long sequences.

Memory: full FT vs LoRA vs QLoRA

For a chosen model size, compare the training memory of full fine-tuning, LoRA (frozen base in 16-bit + tiny trainable adapter), and QLoRA (frozen base in 4-bit). Watch the single-GPU line.

model size (B params)65

Key insight: QLoRA splits precision by role. The frozen base only needs to be read, so 4-bit is fine; the adapter is being learned, so it stays precise. This decoupling is what put fine-tuning a 65B model on a single 48 GB GPU — democratizing what used to need a cluster.

Misconception: “QLoRA quantizes everything to 4-bit, so quality must drop a lot.” Only the frozen base is 4-bit (and NF4 is designed to minimize that loss); the trainable adapters and the computation through them stay in higher precision. QLoRA was shown to match 16-bit full fine-tuning quality — the 4-bit base is read accurately enough that the precise adapter compensates.

What is QLoRA’s core idea on top of LoRA?

Chapter 7: The PEFT Family

LoRA is the most popular PEFT method, but it’s one of a family. They all share the goal — adapt with few trainable parameters — but differ in where they inject the trainable bits.

MethodWhat it trainsNote
LoRAlow-rank B·A added to weight matricesmergeable, zero inference cost
Adapterssmall bottleneck layers inserted between blocksthe original PEFT; adds inference latency
Prefix / Prompt tuninglearnable “virtual tokens” prepended to the inputweights untouched; tunes the context instead
(IA)³learned scaling vectors that rescale activationsextremely few parameters
BitFitonly the bias termssimplest possible; surprisingly decent
DoRA, rsLoRALoRA refinements (decompose magnitude/direction; better scaling)modern LoRA improvements

Two contrasts are worth internalizing. Adapters versus LoRA: classic adapters insert new bottleneck layers in series, so they always add a little inference latency. LoRA runs in parallel and can be merged away — which is the main reason LoRA won. Prompt tuning versus LoRA: prompt tuning doesn’t touch the weights at all; it learns a few extra input vectors (“soft prompts”) that steer the frozen model. It’s even cheaper than LoRA but generally less expressive, since it can only nudge the model through its input, not adjust its internals.

Where each method injects trainable parameters

A schematic transformer block. Toggle a method to see where its trainable parameters live — parallel to weights (LoRA), inserted in series (adapters), at the input (prompt tuning), or on biases (BitFit).

Misconception: “PEFT methods are interchangeable.” They trade off differently: LoRA is mergeable and expressive; prompt tuning is cheapest but weakest; adapters are flexible but add latency; BitFit is trivial but limited. LoRA’s combination of strong quality, low cost, and zero-merge-latency is why it became the default — but the right choice still depends on the constraint that bites you.

Why did LoRA become more popular than classic inserted-adapter layers?

Chapter 8: Low-Rank Lab

Let’s make the central claim — “the update is low-rank” — tangible. Below is a target update matrix that genuinely lives in a few directions (plus a little noise). Approximate it with a rank-r LoRA factorization and watch two things at once: how faithfully B·A reconstructs the target, and how few parameters it costs.

Reconstruct an update with rank-r B·A

Left: the true update ΔW. Right: the rank-r LoRA reconstruction B·A. Slide r: the reconstruction error drops fast and then plateaus — the plateau is the update’s true intrinsic rank. Past it, extra rank buys almost nothing but costs parameters.

LoRA rank r3
true intrinsic rank3

What to notice:

Set the true rank to 3 and slide LoRA rank from 1 up. The error falls steeply until r reaches 3, then flattens — once your rank matches the update’s intrinsic rank, you’ve captured it. Adding more rank past that is wasted parameters: the extra directions have nothing left to explain.

Now raise the true intrinsic rank to 8. A LoRA rank of 3 can no longer keep up — the error stays high until you give r enough room. This is exactly the practical tuning question: pick r large enough to cover the task’s real complexity, but no larger.

Watch the parameter counter. Every increment of r adds 2d parameters but, past the true rank, removes almost no error. That diminishing return is the low-rank hypothesis you can see with your own eyes — and it’s why r = 8 to 64 is plenty for most real fine-tuning.

The whole lesson in one widget: LoRA bets that the gap between “reconstruction good enough” and “full rank” is huge — that you reach near-perfect with tiny r. When the bet holds (and for fine-tuning updates it does), you train 0.4% of the weights and lose almost nothing.

No quiz here — the lab is the test. If you can predict where the error plateaus as you match r to the true rank, you understand why LoRA works.

Chapter 9: Cheat Sheet & Connections

LoRA in one breath

Freeze W
pretrained weights, untouched
+
Add B·A
trainable, rank r ≪ d
h = Wx + (α/r)BAx
only A, B get gradients
Merge or swap
fold in (0 latency) or hot-swap adapters

The numbers to remember

QuantityValue
Trainable fraction per matrix2r/d (e.g. r=8, d=4096 → ~0.4%)
Typical rank r8–64 (even 1–2 often works)
InitA random, B = 0 → ΔW starts at 0
Scalingupdate = (α/r)·B·A
QLoRA base precision4-bit NF4 (frozen); adapter higher precision
Adapter file sizemegabytes, not gigabytes

The three things to remember

1. The update is low-rank. Adapting a pretrained model is a small, focused nudge — so ΔW = B·A with tiny r captures it. Freeze W, train only the sliver.

2. Cheap to train, free to serve. Few trainable params → little gradient/optimizer memory → big models on small GPUs (QLoRA pushes this with a 4-bit frozen base). Merge the adapter for zero inference latency, or hot-swap many adapters on one base.

3. It’s a patch, not a rewrite. B=0 init means you start exactly at the pretrained model and only add what training justifies — cheap, stable, and resistant to forgetting.

Where to go next

Closing thought: LoRA is what happens when someone asks “do we really need to move all of it?” and the answer turns out to be a resounding no. A 0.4% patch, a frozen giant, a few megabytes per skill — the most practical idea in modern fine-tuning, born from one observation about the rank of a change.
In one sentence, what is LoRA?