Edward Hu, Yelong Shen, Phillip Wallis, et al. (Microsoft) — ICLR 2022

LoRA: Low-Rank Adaptation

Low-Rank Adaptation of Large Language Models — freeze the pre-trained weights, inject trainable low-rank matrices A and B. Train 10,000x fewer parameters while matching full fine-tuning.

Prerequisites: Matrix multiplication + Fine-tuning basics + Transformer architecture. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Efficient Tuning?

You have GPT-3 with 175 billion parameters. You want to fine-tune it for customer service chatbot, legal document review, and medical QA. The naive approach: make three copies of the full model and fine-tune each one.

The problem? Each copy is ~350 GB in float16. Three copies = 1.05 TB of GPU memory just for the weights. Plus optimizer states (2x for Adam), plus gradients — you're looking at ~4 TB total. This is economically and practically insane.

ResourceFull Fine-Tuning (GPT-3)Cost
Model weights175B × 2 bytes = 350 GBPer task copy
Adam optimizer175B × 8 bytes = 1.4 TBPer training run
Gradients175B × 2 bytes = 350 GBPer training step
Total per task~2.1 TBPer task
3 tasks~6.3 TB + 3 separate training runsImpractical
The fine-tuning bottleneck: Full fine-tuning updates ALL parameters. But we know from the Lottery Ticket Hypothesis and intrinsic dimensionality research that neural networks are massively redundant. The "useful" updates during fine-tuning likely live in a much lower-dimensional space. If we could find that subspace and only train parameters there, we'd save orders of magnitude in compute and memory.

Several approaches existed before LoRA:

MethodApproachProblem
Adapters (Houlsby 2019)Insert small modules between layersAdds inference latency (extra layers)
Prefix Tuning (Li 2021)Prepend learnable tokens to keys/valuesReduces usable context length
Prompt Tuning (Lester 2021)Learn soft prompts prepended to inputHard to optimize, unstable

LoRA's insight was different: don't add new layers or tokens. Instead, decompose the weight update itself into a low-rank form. This adds zero inference latency (the low-rank matrices can be merged into the original weights) and doesn't reduce context length.

The Fine-Tuning Cost Problem

Drag the slider to change model size. Watch how full fine-tuning costs explode compared to LoRA. The bars show GPU memory requirements for each approach. LoRA stays nearly flat while full fine-tuning grows linearly.

Model Size 7B
Why is full fine-tuning of large language models impractical for deploying multiple task-specific models?

Chapter 1: The Low-Rank Idea

During fine-tuning, each weight matrix W changes by some amount ΔW. Full fine-tuning learns ΔW directly — a dense matrix with the same dimensions as W. For GPT-3's attention layers, W is 12288 × 12288, so ΔW has 150 million entries. Per layer. Times 96 layers.

But Aghajanyan et al. (2021) showed something remarkable: the intrinsic dimensionality of fine-tuning is very low. When you project the weight updates into a random low-dimensional subspace, you can match 90% of full fine-tuning performance with only a few thousand trainable parameters. This means the actual "useful" information in ΔW is far lower-dimensional than its matrix size suggests.

The core insight of LoRA: If the useful part of ΔW is low-dimensional, then ΔW should be well-approximated by a low-rank matrix. A rank-r matrix of size d × d can be written as the product of two smaller matrices: B (d × r) and A (r × d), where r << d. Instead of learning 150 million entries in ΔW, learn 2 × 12288 × r entries in A and B. For r = 8, that's 196,608 parameters — a 764x reduction.

Here's the key equation. Instead of:

h = W · x + ΔW · x

LoRA computes:

h = W · x + B · A · x

Where W is frozen (no gradients), B is d × r (initialized to zeros), and A is r × d (initialized from a Gaussian). The product B·A has rank at most r, so the update is inherently low-rank.

python
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, original_linear, r=8, alpha=16):
        super().__init__()
        d_in = original_linear.in_features    # e.g., 12288
        d_out = original_linear.out_features   # e.g., 12288

        # Freeze the original weight
        self.W = original_linear
        self.W.weight.requires_grad = False

        # LoRA matrices — ONLY these are trained
        self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)  # [r, d_in]
        self.B = nn.Parameter(torch.zeros(d_out, r))            # [d_out, r]
        self.scaling = alpha / r  # scaling factor

    def forward(self, x):
        # Original: W·x, shape [batch, seq, d_out]
        base = self.W(x)
        # LoRA: B·A·x, same shape
        lora = (x @ self.A.T) @ self.B.T * self.scaling
        return base + lora  # h = Wx + BAx

The beauty of this design:

1. At training time: Only A and B receive gradients. W is frozen. Memory for optimizer states drops from O(d²) to O(2·d·r).

2. At inference time: You can merge: W' = W + B·A. The merged weight has exactly the same shape as the original, so there's zero additional latency. LoRA literally disappears at inference.

3. For multiple tasks: Store one base model + a tiny LoRA file per task. Swap LoRA modules to switch tasks. Each LoRA file is ~1-10 MB instead of 350 GB.

Low-Rank Decomposition Visualizer

See how a d×d weight update ΔW is decomposed into B (d×r) times A (r×d). Drag the rank slider to see how few parameters are needed. At rank 8 with d=12288, the compression ratio is 764x.

Rank r 8
Why does LoRA add zero inference latency, unlike adapter modules?

Chapter 2: Math of LoRA

Let's derive LoRA's parameter savings precisely and understand the scaling factor.

Parameter count

For a single weight matrix W ∈ Rd×d:

Full ΔW: d × d = d² parameters
LoRA (B·A): d × r + r × d = 2dr parameters
Compression ratio = d² / (2dr) = d / (2r)

For GPT-3's attention weights (d = 12288, r = 8):

python
d = 12288     # hidden dim
r = 8         # LoRA rank

full_params = d * d                    # 150,994,944
lora_params = 2 * d * r               # 196,608
ratio = full_params / lora_params     # 768x compression

# Applied to all attention Wq and Wv across 96 layers:
total_lora = 2 * lora_params * 96     # 37.7M params
total_full = 175e9                   # 175B params
print(f"LoRA trains {total_lora/total_full*100:.4f}% of params")
# LoRA trains 0.0215% of parameters → ~4650x reduction

The scaling factor α

LoRA introduces a hyperparameter α (alpha) that controls the magnitude of the update:

h = W · x + (α / r) · B · A · x

Why α/r and not just α?

When you change r, the magnitude of B·A changes (more columns in A means a larger product). Dividing by r normalizes this: the update magnitude stays roughly constant regardless of r. This means you can tune α once and then change r freely without retuning the learning rate.

Practical rule of thumb: Set α = 2r (or just α = 16 for r = 8). The scaling factor α/r = 2 works well across most tasks. Higher α makes the LoRA update dominate (more task-specific), lower α keeps the model closer to its pre-trained behavior.

Initialization

LoRA initializes B to zeros and A from a Gaussian. This means at initialization, ΔW = B·A = 0 — the model starts exactly as the pre-trained model. Training only moves the weights away from the pre-trained solution as needed.

A ~ N(0, σ²), B = 0 ⇒ ΔW = B · A = 0 at init

This is crucial for stability: the model doesn't "forget" its pre-trained knowledge at the start of fine-tuning. It only learns the minimum necessary adaptation.

Gradient computation

Only A and B receive gradients. The memory savings are substantial:

AL = BT · (∇hL) · xT, shape [r, d]
BL = (∇hL) · xT · AT, shape [d, r]
python
# Memory comparison for Adam optimizer (2 states per param)
d, r = 12288, 8
n_layers = 96
n_matrices = 2  # Wq and Wv

# Full fine-tuning: Adam stores m and v for each param
full_adam = 175e9 * 2 * 4   # 2 states × 4 bytes = 1.4 TB

# LoRA: Adam only for A and B
lora_adam = n_layers * n_matrices * 2 * d * r * 2 * 4
print(f"Full Adam: {full_adam/1e9:.0f} GB")   # 1400 GB
print(f"LoRA Adam: {lora_adam/1e6:.0f} MB")   # 302 MB
Parameter & Memory Calculator

Drag sliders to change model size and LoRA rank. The calculator shows exact parameter counts and memory savings for both training and inference.

dmodel 4096
rank r 8
Why is B initialized to zeros and A initialized from a Gaussian?

Chapter 3: Where to Apply

A Transformer has four types of weight matrices per attention layer: Wq, Wk, Wv, Wo (query, key, value, output projections). Plus the FFN's up and down projections. Where should you put LoRA?

Hu et al. ran ablations on GPT-3 175B:

LoRA Applied ToTrainable ParamsWikiSQL AccMNLI Acc
Wq only4.7M70.4%91.0%
Wv only4.7M73.0%91.2%
Wq + Wv9.4M73.4%91.5%
Wq + Wk + Wv + Wo18.8M73.7%91.4%
Wq + Wv is the sweet spot. Adding LoRA to both query and value projections performs nearly as well as adding it to all four attention matrices, while using half the parameters. The key and output projections contribute relatively little. This makes sense: queries encode "what am I looking for?" and values encode "what do I contain?" — these are the most task-dependent components of attention.

Why not FFN layers?

The original LoRA paper focused on attention layers and didn't apply LoRA to FFN. Later work (QLoRA, LoRA+) found that including FFN gives small additional gains, especially for knowledge-intensive tasks. The FFN stores factual knowledge, so adapting it helps when the target task requires domain-specific facts not in the pre-training data.

One rank for all layers?

Hu et al. used the same rank r for all layers, but subsequent work showed that different layers may benefit from different ranks. Lower layers (closer to input) tend to capture general linguistic features and need less adaptation. Upper layers (closer to output) are more task-specific and benefit from higher ranks.

python
# Practical LoRA setup for a Transformer
def add_lora(model, r=8, alpha=16, target_modules=['q_proj', 'v_proj']):
    """Add LoRA to specified attention projections."""
    for name, module in model.named_modules():
        if any(t in name for t in target_modules):
            if isinstance(module, nn.Linear):
                lora_module = LoRALinear(module, r=r, alpha=alpha)
                # Replace the original module
                parent = get_parent(model, name)
                setattr(parent, name.split('.')[-1], lora_module)

    # Freeze everything except LoRA parameters
    for n, p in model.named_parameters():
        p.requires_grad = 'lora' in n  # only LoRA params train
Where to Apply LoRA

Click on different Transformer components to toggle LoRA on/off. The right side shows trainable parameter count and estimated accuracy. Notice that Wq + Wv gives the best accuracy-per-parameter ratio.

Why is applying LoRA to Wq + Wv the recommended default?

Chapter 4: Rank Selection

The rank r is LoRA's most important hyperparameter. It controls the expressiveness of the adaptation — higher rank means more capacity to learn, but also more parameters and compute.

Hu et al. found that surprisingly low ranks work well:

RankParams (Wq+Wv, 96 layers)WikiSQLMNLI
r = 12.4M68.8%90.7%
r = 44.7M72.6%91.3%
r = 89.4M73.4%91.5%
r = 6475.5M73.7%91.3%
Full FT175,000M73.8%91.7%
Rank 8 captures most of the adaptation. Going from r=8 to r=64 (8x more parameters) barely improves accuracy. And r=8 nearly matches full fine-tuning (175B parameters) with only 9.4M trainable parameters — a 18,600x reduction. This confirms that fine-tuning operates in an extremely low-dimensional subspace.

Why does low rank work?

The weight update ΔW during fine-tuning is approximately low-rank because:

1. Task similarity: Most downstream tasks are related to the pre-training distribution (both involve language). The adaptation is a small perturbation, not a complete rewrite. Small perturbations tend to be low-rank.

2. Intrinsic dimensionality: Aghajanyan et al. showed that the intrinsic dimensionality of fine-tuning for GPT-3 is only ~200-1000 across all layers combined. Distributed across 192 matrices (2 per layer × 96 layers), that's only ~1-5 effective dimensions per matrix — consistent with r ≈ 4-8.

3. Redundancy in attention: Many attention heads learn similar patterns. The adaptation can often be expressed as changing a few "modes" of attention behavior, not independently modifying every head.

Rank selection guidelines

Task TypeRecommended RankReasoning
Classification (sentiment, NLI)r = 4-8Simple output space, small adaptation
Summarization, translationr = 8-16Moderate complexity
Code generation, mathr = 16-32Requires learning new "skills"
Domain transfer (medical, legal)r = 32-64Significant distribution shift
Rank vs Performance

Drag the rank slider to see how accuracy changes. The curve shows diminishing returns — most of the benefit comes from the first few ranks. The cost bar shows trainable parameter count. Find the sweet spot where accuracy plateaus.

Rank r 8
Why does LoRA with rank r=8 nearly match full fine-tuning (175B params) with only 9.4M parameters?

Chapter 5: Results & Comparisons

How does LoRA compare to other parameter-efficient methods and full fine-tuning? Hu et al. ran comprehensive comparisons on GPT-3 175B.

MethodTrainable ParamsWikiSQLMNLI-mSAMSumInference Latency
Full Fine-Tuning175,000M73.8%91.7%52.0Baseline
Adapter (Houlsby)7.1M71.9%90.5%53.2+20-30%
Prefix Tuning0.77M63.1%88.6%51.4Baseline
LoRA (r=8)4.7M73.4%91.5%53.8Baseline
LoRA wins on every axis. Compared to Adapters: similar accuracy but zero inference latency overhead. Compared to Prefix Tuning: much higher accuracy with only 6x more parameters. Compared to Full Fine-Tuning: 99.997% fewer parameters with nearly identical accuracy. And unlike all other methods, LoRA adds nothing to inference — the matrices merge into the base weights.

Training efficiency

Beyond parameter count, LoRA saves substantial compute during training:

MetricFull FTLoRASavings
Trainable params175B4.7M37,000x
GPU memory (training)~1.2 TB~350 GB~3.4x
Checkpoint size350 GB~35 MB10,000x
Task switchingLoad full modelSwap LoRA fileInstant

Note: GPU memory savings are "only" 3.4x (not 37,000x) because you still need to load the full model weights for the forward pass. The savings come from not storing optimizer states and gradients for the frozen weights.

RoBERTa results

LoRA also works well on encoder models. On RoBERTa-Base (125M params) with the GLUE benchmark:

MethodParamsMRPCSST-2QNLI
Full FT125M90.294.892.8
Adapters0.3M89.594.091.5
LoRA (r=8)0.3M90.095.193.3

On RoBERTa, LoRA actually exceeds full fine-tuning on some tasks (SST-2: 95.1 vs 94.8). This suggests that the low-rank constraint acts as a beneficial regularizer, preventing overfitting on small datasets.

Method Comparison Dashboard

Compare LoRA against other methods on three axes: accuracy, parameter count, and inference overhead. Click the method names to highlight them. LoRA is the only method in the top-right corner (high accuracy, low params, no latency overhead).

What is LoRA's key advantage over Adapter modules?

Chapter 6: LoRA Explorer

Time to see LoRA in action. This explorer lets you configure LoRA parameters and watch how they affect the weight update decomposition, parameter efficiency, and task performance.

LoRA Configuration Explorer

Configure your LoRA setup: choose model size, rank, alpha, and which layers to target. The visualization shows the decomposition for one weight matrix, parameter counts, and estimated accuracy. Watch how the B·A product approximates the full weight update as rank increases.

dmodel 4096
Rank r 8
Alpha α 16
Multi-Task LoRA Swapping

See how one base model serves multiple tasks by swapping LoRA modules. Click a task to "load" its LoRA. The base model (350 GB) stays in memory; only the tiny LoRA file (~35 MB) changes. This is how production systems serve dozens of specialized models from one GPU.

The practical impact: LoRA made fine-tuning accessible to everyone. Before LoRA, fine-tuning GPT-3 required a massive GPU cluster. After LoRA, you could fine-tune a 7B model on a single consumer GPU with 16 GB of VRAM. Combined with QLoRA (quantized LoRA), even a 65B model can be fine-tuned on a single 48 GB GPU.
How does a production system serve 50 specialized tasks from a single GPU using LoRA?

Chapter 7: Connections

LoRA sits at the intersection of parameter efficiency, matrix theory, and practical ML engineering. Its connections reveal both its intellectual roots and its enormous impact.

What came before

WorkContributionRelationship to LoRA
Adapter Modules (Houlsby 2019)First major PEFT methodLoRA avoids adapter's inference overhead
Intrinsic Dimensionality (Aghajanyan 2021)Fine-tuning is low-dimensionalTheoretical justification for low-rank ΔW
Lottery Ticket Hypothesis (Frankle 2019)Dense networks contain sparse subnetworksBoth exploit redundancy; LTH finds sparse masks, LoRA finds low-rank subspaces

What came after

WorkHow It Extended LoRA
QLoRA (Dettmers 2023)Quantize base model to 4-bit, apply LoRA on quantized weights → fine-tune 65B on single GPU
LoRA+ (Hayou 2024)Different learning rates for A and B → better convergence
DoRA (Liu 2024)Decompose W into magnitude and direction, apply LoRA to direction only
rsLoRA (Kalajdzievski 2024)Fix scaling: use α/√r instead of α/r for better rank-independence
Hugging Face PEFTStandard library making LoRA a one-line operation

The bigger picture

Before LoRA (2019-2021):

Fine-tuning = update all weights.

One GPU cluster per task.

Serving = separate model per task.

After LoRA (2022+):

Fine-tuning = train tiny adapters.

One consumer GPU per task.

Serving = one base + swap adapters.

The LoRA legacy: LoRA democratized fine-tuning. By reducing the cost from "GPU cluster" to "consumer GPU," it enabled researchers, startups, and hobbyists to customize foundation models for specific tasks. The explosion of open-source fine-tuned models on Hugging Face — LLaMA-LoRA, Alpaca-LoRA, CodeLlama-LoRA — is a direct consequence of LoRA making fine-tuning accessible.

"Simplicity is the ultimate sophistication." — Leonardo da Vinci (and LoRA is as simple as W + BA)

What is the most significant practical impact of LoRA?