Low-Rank Adaptation of Large Language Models — freeze the pre-trained weights, inject trainable low-rank matrices A and B. Train 10,000x fewer parameters while matching full fine-tuning.
You have GPT-3 with 175 billion parameters. You want to fine-tune it for customer service chatbot, legal document review, and medical QA. The naive approach: make three copies of the full model and fine-tune each one.
The problem? Each copy is ~350 GB in float16. Three copies = 1.05 TB of GPU memory just for the weights. Plus optimizer states (2x for Adam), plus gradients — you're looking at ~4 TB total. This is economically and practically insane.
| Resource | Full Fine-Tuning (GPT-3) | Cost |
|---|---|---|
| Model weights | 175B × 2 bytes = 350 GB | Per task copy |
| Adam optimizer | 175B × 8 bytes = 1.4 TB | Per training run |
| Gradients | 175B × 2 bytes = 350 GB | Per training step |
| Total per task | ~2.1 TB | Per task |
| 3 tasks | ~6.3 TB + 3 separate training runs | Impractical |
Several approaches existed before LoRA:
| Method | Approach | Problem |
|---|---|---|
| Adapters (Houlsby 2019) | Insert small modules between layers | Adds inference latency (extra layers) |
| Prefix Tuning (Li 2021) | Prepend learnable tokens to keys/values | Reduces usable context length |
| Prompt Tuning (Lester 2021) | Learn soft prompts prepended to input | Hard to optimize, unstable |
LoRA's insight was different: don't add new layers or tokens. Instead, decompose the weight update itself into a low-rank form. This adds zero inference latency (the low-rank matrices can be merged into the original weights) and doesn't reduce context length.
Drag the slider to change model size. Watch how full fine-tuning costs explode compared to LoRA. The bars show GPU memory requirements for each approach. LoRA stays nearly flat while full fine-tuning grows linearly.
During fine-tuning, each weight matrix W changes by some amount ΔW. Full fine-tuning learns ΔW directly — a dense matrix with the same dimensions as W. For GPT-3's attention layers, W is 12288 × 12288, so ΔW has 150 million entries. Per layer. Times 96 layers.
But Aghajanyan et al. (2021) showed something remarkable: the intrinsic dimensionality of fine-tuning is very low. When you project the weight updates into a random low-dimensional subspace, you can match 90% of full fine-tuning performance with only a few thousand trainable parameters. This means the actual "useful" information in ΔW is far lower-dimensional than its matrix size suggests.
Here's the key equation. Instead of:
LoRA computes:
Where W is frozen (no gradients), B is d × r (initialized to zeros), and A is r × d (initialized from a Gaussian). The product B·A has rank at most r, so the update is inherently low-rank.
python import torch import torch.nn as nn class LoRALinear(nn.Module): def __init__(self, original_linear, r=8, alpha=16): super().__init__() d_in = original_linear.in_features # e.g., 12288 d_out = original_linear.out_features # e.g., 12288 # Freeze the original weight self.W = original_linear self.W.weight.requires_grad = False # LoRA matrices — ONLY these are trained self.A = nn.Parameter(torch.randn(r, d_in) * 0.01) # [r, d_in] self.B = nn.Parameter(torch.zeros(d_out, r)) # [d_out, r] self.scaling = alpha / r # scaling factor def forward(self, x): # Original: W·x, shape [batch, seq, d_out] base = self.W(x) # LoRA: B·A·x, same shape lora = (x @ self.A.T) @ self.B.T * self.scaling return base + lora # h = Wx + BAx
The beauty of this design:
1. At training time: Only A and B receive gradients. W is frozen. Memory for optimizer states drops from O(d²) to O(2·d·r).
2. At inference time: You can merge: W' = W + B·A. The merged weight has exactly the same shape as the original, so there's zero additional latency. LoRA literally disappears at inference.
3. For multiple tasks: Store one base model + a tiny LoRA file per task. Swap LoRA modules to switch tasks. Each LoRA file is ~1-10 MB instead of 350 GB.
See how a d×d weight update ΔW is decomposed into B (d×r) times A (r×d). Drag the rank slider to see how few parameters are needed. At rank 8 with d=12288, the compression ratio is 764x.
Let's derive LoRA's parameter savings precisely and understand the scaling factor.
For a single weight matrix W ∈ Rd×d:
For GPT-3's attention weights (d = 12288, r = 8):
python d = 12288 # hidden dim r = 8 # LoRA rank full_params = d * d # 150,994,944 lora_params = 2 * d * r # 196,608 ratio = full_params / lora_params # 768x compression # Applied to all attention Wq and Wv across 96 layers: total_lora = 2 * lora_params * 96 # 37.7M params total_full = 175e9 # 175B params print(f"LoRA trains {total_lora/total_full*100:.4f}% of params") # LoRA trains 0.0215% of parameters → ~4650x reduction
LoRA introduces a hyperparameter α (alpha) that controls the magnitude of the update:
Why α/r and not just α?
When you change r, the magnitude of B·A changes (more columns in A means a larger product). Dividing by r normalizes this: the update magnitude stays roughly constant regardless of r. This means you can tune α once and then change r freely without retuning the learning rate.
LoRA initializes B to zeros and A from a Gaussian. This means at initialization, ΔW = B·A = 0 — the model starts exactly as the pre-trained model. Training only moves the weights away from the pre-trained solution as needed.
This is crucial for stability: the model doesn't "forget" its pre-trained knowledge at the start of fine-tuning. It only learns the minimum necessary adaptation.
Only A and B receive gradients. The memory savings are substantial:
python # Memory comparison for Adam optimizer (2 states per param) d, r = 12288, 8 n_layers = 96 n_matrices = 2 # Wq and Wv # Full fine-tuning: Adam stores m and v for each param full_adam = 175e9 * 2 * 4 # 2 states × 4 bytes = 1.4 TB # LoRA: Adam only for A and B lora_adam = n_layers * n_matrices * 2 * d * r * 2 * 4 print(f"Full Adam: {full_adam/1e9:.0f} GB") # 1400 GB print(f"LoRA Adam: {lora_adam/1e6:.0f} MB") # 302 MB
Drag sliders to change model size and LoRA rank. The calculator shows exact parameter counts and memory savings for both training and inference.
A Transformer has four types of weight matrices per attention layer: Wq, Wk, Wv, Wo (query, key, value, output projections). Plus the FFN's up and down projections. Where should you put LoRA?
Hu et al. ran ablations on GPT-3 175B:
| LoRA Applied To | Trainable Params | WikiSQL Acc | MNLI Acc |
|---|---|---|---|
| Wq only | 4.7M | 70.4% | 91.0% |
| Wv only | 4.7M | 73.0% | 91.2% |
| Wq + Wv | 9.4M | 73.4% | 91.5% |
| Wq + Wk + Wv + Wo | 18.8M | 73.7% | 91.4% |
The original LoRA paper focused on attention layers and didn't apply LoRA to FFN. Later work (QLoRA, LoRA+) found that including FFN gives small additional gains, especially for knowledge-intensive tasks. The FFN stores factual knowledge, so adapting it helps when the target task requires domain-specific facts not in the pre-training data.
Hu et al. used the same rank r for all layers, but subsequent work showed that different layers may benefit from different ranks. Lower layers (closer to input) tend to capture general linguistic features and need less adaptation. Upper layers (closer to output) are more task-specific and benefit from higher ranks.
python # Practical LoRA setup for a Transformer def add_lora(model, r=8, alpha=16, target_modules=['q_proj', 'v_proj']): """Add LoRA to specified attention projections.""" for name, module in model.named_modules(): if any(t in name for t in target_modules): if isinstance(module, nn.Linear): lora_module = LoRALinear(module, r=r, alpha=alpha) # Replace the original module parent = get_parent(model, name) setattr(parent, name.split('.')[-1], lora_module) # Freeze everything except LoRA parameters for n, p in model.named_parameters(): p.requires_grad = 'lora' in n # only LoRA params train
Click on different Transformer components to toggle LoRA on/off. The right side shows trainable parameter count and estimated accuracy. Notice that Wq + Wv gives the best accuracy-per-parameter ratio.
The rank r is LoRA's most important hyperparameter. It controls the expressiveness of the adaptation — higher rank means more capacity to learn, but also more parameters and compute.
Hu et al. found that surprisingly low ranks work well:
| Rank | Params (Wq+Wv, 96 layers) | WikiSQL | MNLI |
|---|---|---|---|
| r = 1 | 2.4M | 68.8% | 90.7% |
| r = 4 | 4.7M | 72.6% | 91.3% |
| r = 8 | 9.4M | 73.4% | 91.5% |
| r = 64 | 75.5M | 73.7% | 91.3% |
| Full FT | 175,000M | 73.8% | 91.7% |
The weight update ΔW during fine-tuning is approximately low-rank because:
1. Task similarity: Most downstream tasks are related to the pre-training distribution (both involve language). The adaptation is a small perturbation, not a complete rewrite. Small perturbations tend to be low-rank.
2. Intrinsic dimensionality: Aghajanyan et al. showed that the intrinsic dimensionality of fine-tuning for GPT-3 is only ~200-1000 across all layers combined. Distributed across 192 matrices (2 per layer × 96 layers), that's only ~1-5 effective dimensions per matrix — consistent with r ≈ 4-8.
3. Redundancy in attention: Many attention heads learn similar patterns. The adaptation can often be expressed as changing a few "modes" of attention behavior, not independently modifying every head.
| Task Type | Recommended Rank | Reasoning |
|---|---|---|
| Classification (sentiment, NLI) | r = 4-8 | Simple output space, small adaptation |
| Summarization, translation | r = 8-16 | Moderate complexity |
| Code generation, math | r = 16-32 | Requires learning new "skills" |
| Domain transfer (medical, legal) | r = 32-64 | Significant distribution shift |
Drag the rank slider to see how accuracy changes. The curve shows diminishing returns — most of the benefit comes from the first few ranks. The cost bar shows trainable parameter count. Find the sweet spot where accuracy plateaus.
How does LoRA compare to other parameter-efficient methods and full fine-tuning? Hu et al. ran comprehensive comparisons on GPT-3 175B.
| Method | Trainable Params | WikiSQL | MNLI-m | SAMSum | Inference Latency |
|---|---|---|---|---|---|
| Full Fine-Tuning | 175,000M | 73.8% | 91.7% | 52.0 | Baseline |
| Adapter (Houlsby) | 7.1M | 71.9% | 90.5% | 53.2 | +20-30% |
| Prefix Tuning | 0.77M | 63.1% | 88.6% | 51.4 | Baseline |
| LoRA (r=8) | 4.7M | 73.4% | 91.5% | 53.8 | Baseline |
Beyond parameter count, LoRA saves substantial compute during training:
| Metric | Full FT | LoRA | Savings |
|---|---|---|---|
| Trainable params | 175B | 4.7M | 37,000x |
| GPU memory (training) | ~1.2 TB | ~350 GB | ~3.4x |
| Checkpoint size | 350 GB | ~35 MB | 10,000x |
| Task switching | Load full model | Swap LoRA file | Instant |
Note: GPU memory savings are "only" 3.4x (not 37,000x) because you still need to load the full model weights for the forward pass. The savings come from not storing optimizer states and gradients for the frozen weights.
LoRA also works well on encoder models. On RoBERTa-Base (125M params) with the GLUE benchmark:
| Method | Params | MRPC | SST-2 | QNLI |
|---|---|---|---|---|
| Full FT | 125M | 90.2 | 94.8 | 92.8 |
| Adapters | 0.3M | 89.5 | 94.0 | 91.5 |
| LoRA (r=8) | 0.3M | 90.0 | 95.1 | 93.3 |
On RoBERTa, LoRA actually exceeds full fine-tuning on some tasks (SST-2: 95.1 vs 94.8). This suggests that the low-rank constraint acts as a beneficial regularizer, preventing overfitting on small datasets.
Compare LoRA against other methods on three axes: accuracy, parameter count, and inference overhead. Click the method names to highlight them. LoRA is the only method in the top-right corner (high accuracy, low params, no latency overhead).
Time to see LoRA in action. This explorer lets you configure LoRA parameters and watch how they affect the weight update decomposition, parameter efficiency, and task performance.
Configure your LoRA setup: choose model size, rank, alpha, and which layers to target. The visualization shows the decomposition for one weight matrix, parameter counts, and estimated accuracy. Watch how the B·A product approximates the full weight update as rank increases.
See how one base model serves multiple tasks by swapping LoRA modules. Click a task to "load" its LoRA. The base model (350 GB) stays in memory; only the tiny LoRA file (~35 MB) changes. This is how production systems serve dozens of specialized models from one GPU.
LoRA sits at the intersection of parameter efficiency, matrix theory, and practical ML engineering. Its connections reveal both its intellectual roots and its enormous impact.
| Work | Contribution | Relationship to LoRA |
|---|---|---|
| Adapter Modules (Houlsby 2019) | First major PEFT method | LoRA avoids adapter's inference overhead |
| Intrinsic Dimensionality (Aghajanyan 2021) | Fine-tuning is low-dimensional | Theoretical justification for low-rank ΔW |
| Lottery Ticket Hypothesis (Frankle 2019) | Dense networks contain sparse subnetworks | Both exploit redundancy; LTH finds sparse masks, LoRA finds low-rank subspaces |
| Work | How It Extended LoRA |
|---|---|
| QLoRA (Dettmers 2023) | Quantize base model to 4-bit, apply LoRA on quantized weights → fine-tune 65B on single GPU |
| LoRA+ (Hayou 2024) | Different learning rates for A and B → better convergence |
| DoRA (Liu 2024) | Decompose W into magnitude and direction, apply LoRA to direction only |
| rsLoRA (Kalajdzievski 2024) | Fix scaling: use α/√r instead of α/r for better rank-independence |
| Hugging Face PEFT | Standard library making LoRA a one-line operation |
Before LoRA (2019-2021):
Fine-tuning = update all weights.
One GPU cluster per task.
Serving = separate model per task.
After LoRA (2022+):
Fine-tuning = train tiny adapters.
One consumer GPU per task.
Serving = one base + swap adapters.
"Simplicity is the ultimate sophistication." — Leonardo da Vinci (and LoRA is as simple as W + BA)