LoRA (Hu 2022) — Veanors

Chapter 0: Why Efficient Tuning?

You have GPT-3 with 175 billion parameters. You want to fine-tune it for customer service chatbot, legal document review, and medical QA. The naive approach: make three copies of the full model and fine-tune each one.

The problem? Each copy is ~350 GB in float16. Three copies = 1.05 TB of GPU memory just for the weights. Plus optimizer states (2x for Adam), plus gradients — you're looking at ~4 TB total. This is economically and practically insane.

Resource	Full Fine-Tuning (GPT-3)	Cost
Model weights	175B × 2 bytes = 350 GB	Per task copy
Adam optimizer	175B × 8 bytes = 1.4 TB	Per training run
Gradients	175B × 2 bytes = 350 GB	Per training step
Total per task	~2.1 TB	Per task
3 tasks	~6.3 TB + 3 separate training runs	Impractical

The fine-tuning bottleneck: Full fine-tuning updates ALL parameters. But we know from the Lottery Ticket Hypothesis and intrinsic dimensionality research that neural networks are massively redundant. The "useful" updates during fine-tuning likely live in a much lower-dimensional space. If we could find that subspace and only train parameters there, we'd save orders of magnitude in compute and memory.

Several approaches existed before LoRA:

Method	Approach	Problem
Adapters (Houlsby 2019)	Insert small modules between layers	Adds inference latency (extra layers)
Prefix Tuning (Li 2021)	Prepend learnable tokens to keys/values	Reduces usable context length
Prompt Tuning (Lester 2021)	Learn soft prompts prepended to input	Hard to optimize, unstable

LoRA's insight was different: don't add new layers or tokens. Instead, decompose the weight update itself into a low-rank form. This adds zero inference latency (the low-rank matrices can be merged into the original weights) and doesn't reduce context length.

The Fine-Tuning Cost Problem

Drag the slider to change model size. Watch how full fine-tuning costs explode compared to LoRA. The bars show GPU memory requirements for each approach. LoRA stays nearly flat while full fine-tuning grows linearly.

Model Size 7B

Why is full fine-tuning of large language models impractical for deploying multiple task-specific models?

Each task requires a full copy of the model weights (~350 GB for GPT-3), plus optimizer states and gradients. Deploying multiple tasks means multiple full copies, requiring terabytes of storage and memory Fine-tuning always makes the model worse Large models can't be fine-tuned at all

Chapter 1: The Low-Rank Idea

During fine-tuning, each weight matrix W changes by some amount ΔW. Full fine-tuning learns ΔW directly — a dense matrix with the same dimensions as W. For GPT-3's attention layers, W is 12288 × 12288, so ΔW has 150 million entries. Per layer. Times 96 layers.

But Aghajanyan et al. (2021) showed something remarkable: the intrinsic dimensionality of fine-tuning is very low. When you project the weight updates into a random low-dimensional subspace, you can match 90% of full fine-tuning performance with only a few thousand trainable parameters. This means the actual "useful" information in ΔW is far lower-dimensional than its matrix size suggests.

The core insight of LoRA: If the useful part of ΔW is low-dimensional, then ΔW should be well-approximated by a low-rank matrix. A rank-r matrix of size d × d can be written as the product of two smaller matrices: B (d × r) and A (r × d), where r << d. Instead of learning 150 million entries in ΔW, learn 2 × 12288 × r entries in A and B. For r = 8, that's 196,608 parameters — a 764x reduction.

Here's the key equation. Instead of:

h = W · x + ΔW · x

LoRA computes:

h = W · x + B · A · x

Where W is frozen (no gradients), B is d × r (initialized to zeros), and A is r × d (initialized from a Gaussian). The product B·A has rank at most r, so the update is inherently low-rank.

python
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, original_linear, r=8, alpha=16):
        super().__init__()
        d_in = original_linear.in_features    # e.g., 12288
        d_out = original_linear.out_features   # e.g., 12288

        # Freeze the original weight
        self.W = original_linear
        self.W.weight.requires_grad = False

        # LoRA matrices — ONLY these are trained
        self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)  # [r, d_in]
        self.B = nn.Parameter(torch.zeros(d_out, r))            # [d_out, r]
        self.scaling = alpha / r  # scaling factor

    def forward(self, x):
        # Original: W·x, shape [batch, seq, d_out]
        base = self.W(x)
        # LoRA: B·A·x, same shape
        lora = (x @ self.A.T) @ self.B.T * self.scaling
        return base + lora  # h = Wx + BAx

The beauty of this design:

1. At training time: Only A and B receive gradients. W is frozen. Memory for optimizer states drops from O(d²) to O(2·d·r).

2. At inference time: You can merge: W' = W + B·A. The merged weight has exactly the same shape as the original, so there's zero additional latency. LoRA literally disappears at inference.

3. For multiple tasks: Store one base model + a tiny LoRA file per task. Swap LoRA modules to switch tasks. Each LoRA file is ~1-10 MB instead of 350 GB.

Low-Rank Decomposition Visualizer

See how a d×d weight update ΔW is decomposed into B (d×r) times A (r×d). Drag the rank slider to see how few parameters are needed. At rank 8 with d=12288, the compression ratio is 764x.

Rank r 8

Why does LoRA add zero inference latency, unlike adapter modules?

Because at inference time, the LoRA matrices B and A can be merged into the original weight: W' = W + B·A. The merged weight has the exact same shape as the original, so the forward pass is identical — no extra layers, no extra computation Because LoRA uses smaller matrices that compute faster Because LoRA is applied before training, not during inference

Chapter 2: Math of LoRA

Let's derive LoRA's parameter savings precisely and understand the scaling factor.

Parameter count

For a single weight matrix W ∈ R^d×d:

Full ΔW: d × d = d² parameters

LoRA (B·A): d × r + r × d = 2dr parameters

Compression ratio = d² / (2dr) = d / (2r)

For GPT-3's attention weights (d = 12288, r = 8):

python
d = 12288     # hidden dim
r = 8         # LoRA rank

full_params = d * d                    # 150,994,944
lora_params = 2 * d * r               # 196,608
ratio = full_params / lora_params     # 768x compression

# Applied to all attention Wq and Wv across 96 layers:
total_lora = 2 * lora_params * 96     # 37.7M params
total_full = 175e9                   # 175B params
print(f"LoRA trains {total_lora/total_full*100:.4f}% of params")
# LoRA trains 0.0215% of parameters → ~4650x reduction

The scaling factor α

LoRA introduces a hyperparameter α (alpha) that controls the magnitude of the update:

h = W · x + (α / r) · B · A · x

Why α/r and not just α?

When you change r, the magnitude of B·A changes (more columns in A means a larger product). Dividing by r normalizes this: the update magnitude stays roughly constant regardless of r. This means you can tune α once and then change r freely without retuning the learning rate.

Practical rule of thumb: Set α = 2r (or just α = 16 for r = 8). The scaling factor α/r = 2 works well across most tasks. Higher α makes the LoRA update dominate (more task-specific), lower α keeps the model closer to its pre-trained behavior.

Initialization

LoRA initializes B to zeros and A from a Gaussian. This means at initialization, ΔW = B·A = 0 — the model starts exactly as the pre-trained model. Training only moves the weights away from the pre-trained solution as needed.

A ~ N(0, σ²), B = 0 ⇒ ΔW = B · A = 0 at init

This is crucial for stability: the model doesn't "forget" its pre-trained knowledge at the start of fine-tuning. It only learns the minimum necessary adaptation.

Gradient computation

Only A and B receive gradients. The memory savings are substantial:

∇_AL = B^T · (∇_hL) · x^T, shape [r, d]

∇_BL = (∇_hL) · x^T · A^T, shape [d, r]

python
# Memory comparison for Adam optimizer (2 states per param)
d, r = 12288, 8
n_layers = 96
n_matrices = 2  # Wq and Wv

# Full fine-tuning: Adam stores m and v for each param
full_adam = 175e9 * 2 * 4   # 2 states × 4 bytes = 1.4 TB

# LoRA: Adam only for A and B
lora_adam = n_layers * n_matrices * 2 * d * r * 2 * 4
print(f"Full Adam: {full_adam/1e9:.0f} GB")   # 1400 GB
print(f"LoRA Adam: {lora_adam/1e6:.0f} MB")   # 302 MB

Parameter & Memory Calculator

Drag sliders to change model size and LoRA rank. The calculator shows exact parameter counts and memory savings for both training and inference.

d_model 4096

rank r 8

Why is B initialized to zeros and A initialized from a Gaussian?

So that ΔW = B·A = 0 at initialization, meaning the model starts exactly as the pre-trained model and only learns the minimum necessary adaptation during fine-tuning. This prevents catastrophic forgetting of pre-trained knowledge To make the training faster by starting with smaller gradients Because zero initialization is always best for neural networks

Chapter 3: Where to Apply

A Transformer has four types of weight matrices per attention layer: W_q, W_k, W_v, W_o (query, key, value, output projections). Plus the FFN's up and down projections. Where should you put LoRA?

Hu et al. ran ablations on GPT-3 175B:

LoRA Applied To	Trainable Params	WikiSQL Acc	MNLI Acc
W_q only	4.7M	70.4%	91.0%
W_v only	4.7M	73.0%	91.2%
W_q + W_v	9.4M	73.4%	91.5%
W_q + W_k + W_v + W_o	18.8M	73.7%	91.4%

W_q + W_v is the sweet spot. Adding LoRA to both query and value projections performs nearly as well as adding it to all four attention matrices, while using half the parameters. The key and output projections contribute relatively little. This makes sense: queries encode "what am I looking for?" and values encode "what do I contain?" — these are the most task-dependent components of attention.

Why not FFN layers?

The original LoRA paper focused on attention layers and didn't apply LoRA to FFN. Later work (QLoRA, LoRA+) found that including FFN gives small additional gains, especially for knowledge-intensive tasks. The FFN stores factual knowledge, so adapting it helps when the target task requires domain-specific facts not in the pre-training data.

One rank for all layers?

Hu et al. used the same rank r for all layers, but subsequent work showed that different layers may benefit from different ranks. Lower layers (closer to input) tend to capture general linguistic features and need less adaptation. Upper layers (closer to output) are more task-specific and benefit from higher ranks.

python
# Practical LoRA setup for a Transformer
def add_lora(model, r=8, alpha=16, target_modules=['q_proj', 'v_proj']):
    """Add LoRA to specified attention projections."""
    for name, module in model.named_modules():
        if any(t in name for t in target_modules):
            if isinstance(module, nn.Linear):
                lora_module = LoRALinear(module, r=r, alpha=alpha)
                # Replace the original module
                parent = get_parent(model, name)
                setattr(parent, name.split('.')[-1], lora_module)

    # Freeze everything except LoRA parameters
    for n, p in model.named_parameters():
        p.requires_grad = 'lora' in n  # only LoRA params train

Where to Apply LoRA

Click on different Transformer components to toggle LoRA on/off. The right side shows trainable parameter count and estimated accuracy. Notice that Wq + Wv gives the best accuracy-per-parameter ratio.

Why is applying LoRA to Wq + Wv the recommended default?

Query and value projections are the most task-dependent attention components — queries encode "what to look for" and values encode "what information to provide." Adapting these two achieves nearly the same accuracy as adapting all four projections with half the parameters Because Wq and Wv are the largest matrices Because the other matrices are not used during inference

Chapter 4: Rank Selection

The rank r is LoRA's most important hyperparameter. It controls the expressiveness of the adaptation — higher rank means more capacity to learn, but also more parameters and compute.

Hu et al. found that surprisingly low ranks work well:

Rank	Params (Wq+Wv, 96 layers)	WikiSQL	MNLI
r = 1	2.4M	68.8%	90.7%
r = 4	4.7M	72.6%	91.3%
r = 8	9.4M	73.4%	91.5%
r = 64	75.5M	73.7%	91.3%
Full FT	175,000M	73.8%	91.7%

Rank 8 captures most of the adaptation. Going from r=8 to r=64 (8x more parameters) barely improves accuracy. And r=8 nearly matches full fine-tuning (175B parameters) with only 9.4M trainable parameters — a 18,600x reduction. This confirms that fine-tuning operates in an extremely low-dimensional subspace.

Why does low rank work?

The weight update ΔW during fine-tuning is approximately low-rank because:

1. Task similarity: Most downstream tasks are related to the pre-training distribution (both involve language). The adaptation is a small perturbation, not a complete rewrite. Small perturbations tend to be low-rank.

2. Intrinsic dimensionality: Aghajanyan et al. showed that the intrinsic dimensionality of fine-tuning for GPT-3 is only ~200-1000 across all layers combined. Distributed across 192 matrices (2 per layer × 96 layers), that's only ~1-5 effective dimensions per matrix — consistent with r ≈ 4-8.

3. Redundancy in attention: Many attention heads learn similar patterns. The adaptation can often be expressed as changing a few "modes" of attention behavior, not independently modifying every head.

Rank selection guidelines

Task Type	Recommended Rank	Reasoning
Classification (sentiment, NLI)	r = 4-8	Simple output space, small adaptation
Summarization, translation	r = 8-16	Moderate complexity
Code generation, math	r = 16-32	Requires learning new "skills"
Domain transfer (medical, legal)	r = 32-64	Significant distribution shift

Rank vs Performance

Drag the rank slider to see how accuracy changes. The curve shows diminishing returns — most of the benefit comes from the first few ranks. The cost bar shows trainable parameter count. Find the sweet spot where accuracy plateaus.

Rank r 8

Why does LoRA with rank r=8 nearly match full fine-tuning (175B params) with only 9.4M parameters?

Because fine-tuning operates in an extremely low-dimensional subspace — the weight update ΔW is approximately low-rank since downstream tasks are related to pre-training, requiring only small perturbations rather than complete rewrites of the weight matrices Because most of GPT-3's parameters are unused Because LoRA uses a more powerful optimizer than full fine-tuning

Chapter 5: Results & Comparisons

How does LoRA compare to other parameter-efficient methods and full fine-tuning? Hu et al. ran comprehensive comparisons on GPT-3 175B.

Method	Trainable Params	WikiSQL	MNLI-m	SAMSum	Inference Latency
Full Fine-Tuning	175,000M	73.8%	91.7%	52.0	Baseline
Adapter (Houlsby)	7.1M	71.9%	90.5%	53.2	+20-30%
Prefix Tuning	0.77M	63.1%	88.6%	51.4	Baseline
LoRA (r=8)	4.7M	73.4%	91.5%	53.8	Baseline

LoRA wins on every axis. Compared to Adapters: similar accuracy but zero inference latency overhead. Compared to Prefix Tuning: much higher accuracy with only 6x more parameters. Compared to Full Fine-Tuning: 99.997% fewer parameters with nearly identical accuracy. And unlike all other methods, LoRA adds nothing to inference — the matrices merge into the base weights.

Training efficiency

Beyond parameter count, LoRA saves substantial compute during training:

Metric	Full FT	LoRA	Savings
Trainable params	175B	4.7M	37,000x
GPU memory (training)	~1.2 TB	~350 GB	~3.4x
Checkpoint size	350 GB	~35 MB	10,000x
Task switching	Load full model	Swap LoRA file	Instant

Note: GPU memory savings are "only" 3.4x (not 37,000x) because you still need to load the full model weights for the forward pass. The savings come from not storing optimizer states and gradients for the frozen weights.

RoBERTa results

LoRA also works well on encoder models. On RoBERTa-Base (125M params) with the GLUE benchmark:

Method	Params	MRPC	SST-2	QNLI
Full FT	125M	90.2	94.8	92.8
Adapters	0.3M	89.5	94.0	91.5
LoRA (r=8)	0.3M	90.0	95.1	93.3

On RoBERTa, LoRA actually exceeds full fine-tuning on some tasks (SST-2: 95.1 vs 94.8). This suggests that the low-rank constraint acts as a beneficial regularizer, preventing overfitting on small datasets.

Method Comparison Dashboard

Compare LoRA against other methods on three axes: accuracy, parameter count, and inference overhead. Click the method names to highlight them. LoRA is the only method in the top-right corner (high accuracy, low params, no latency overhead).

What is LoRA's key advantage over Adapter modules?

LoRA adds zero inference latency because the low-rank matrices can be merged into the original weights (W' = W + BA), while Adapters permanently add extra layers that must be computed during every forward pass, increasing latency by 20-30% LoRA uses fewer total parameters LoRA is easier to implement

Chapter 6: LoRA Explorer

Time to see LoRA in action. This explorer lets you configure LoRA parameters and watch how they affect the weight update decomposition, parameter efficiency, and task performance.

LoRA Configuration Explorer

Configure your LoRA setup: choose model size, rank, alpha, and which layers to target. The visualization shows the decomposition for one weight matrix, parameter counts, and estimated accuracy. Watch how the B·A product approximates the full weight update as rank increases.

d_model 4096

Rank r 8

Alpha α 16

Multi-Task LoRA Swapping

See how one base model serves multiple tasks by swapping LoRA modules. Click a task to "load" its LoRA. The base model (350 GB) stays in memory; only the tiny LoRA file (~35 MB) changes. This is how production systems serve dozens of specialized models from one GPU.

The practical impact: LoRA made fine-tuning accessible to everyone. Before LoRA, fine-tuning GPT-3 required a massive GPU cluster. After LoRA, you could fine-tune a 7B model on a single consumer GPU with 16 GB of VRAM. Combined with QLoRA (quantized LoRA), even a 65B model can be fine-tuned on a single 48 GB GPU.

How does a production system serve 50 specialized tasks from a single GPU using LoRA?

Keep one copy of the base model in GPU memory (~14 GB for 7B in fp16). Store 50 LoRA files (~35 MB each, 1.75 GB total) on disk. When a request comes in, swap the appropriate LoRA by merging W' = W + B·A — this takes milliseconds and adds zero inference latency Load all 50 fine-tuned models into memory simultaneously Use a single model without any fine-tuning

Chapter 7: Connections

LoRA sits at the intersection of parameter efficiency, matrix theory, and practical ML engineering. Its connections reveal both its intellectual roots and its enormous impact.

What came before

Work	Contribution	Relationship to LoRA
Adapter Modules (Houlsby 2019)	First major PEFT method	LoRA avoids adapter's inference overhead
Intrinsic Dimensionality (Aghajanyan 2021)	Fine-tuning is low-dimensional	Theoretical justification for low-rank ΔW
Lottery Ticket Hypothesis (Frankle 2019)	Dense networks contain sparse subnetworks	Both exploit redundancy; LTH finds sparse masks, LoRA finds low-rank subspaces

What came after

Work	How It Extended LoRA
QLoRA (Dettmers 2023)	Quantize base model to 4-bit, apply LoRA on quantized weights → fine-tune 65B on single GPU
LoRA+ (Hayou 2024)	Different learning rates for A and B → better convergence
DoRA (Liu 2024)	Decompose W into magnitude and direction, apply LoRA to direction only
rsLoRA (Kalajdzievski 2024)	Fix scaling: use α/√r instead of α/r for better rank-independence
Hugging Face PEFT	Standard library making LoRA a one-line operation

The bigger picture

Before LoRA (2019-2021):

Fine-tuning = update all weights.

One GPU cluster per task.

Serving = separate model per task.

After LoRA (2022+):

Fine-tuning = train tiny adapters.

One consumer GPU per task.

Serving = one base + swap adapters.

The LoRA legacy: LoRA democratized fine-tuning. By reducing the cost from "GPU cluster" to "consumer GPU," it enabled researchers, startups, and hobbyists to customize foundation models for specific tasks. The explosion of open-source fine-tuned models on Hugging Face — LLaMA-LoRA, Alpaca-LoRA, CodeLlama-LoRA — is a direct consequence of LoRA making fine-tuning accessible.

"Simplicity is the ultimate sophistication." — Leonardo da Vinci (and LoRA is as simple as W + BA)

What is the most significant practical impact of LoRA?

It democratized fine-tuning — reducing the cost from requiring a GPU cluster (full fine-tuning) to a single consumer GPU (LoRA), enabling researchers, startups, and hobbyists to customize foundation models for specific tasks It proved that large language models have too many parameters It replaced the Transformer architecture with something more efficient

LoRA: Low-Rank Adaptation

Chapter 0: Why Efficient Tuning?

Chapter 1: The Low-Rank Idea

Chapter 2: Math of LoRA

Parameter count

The scaling factor α

Initialization

Gradient computation

Chapter 3: Where to Apply

Why not FFN layers?

One rank for all layers?

Chapter 4: Rank Selection

Why does low rank work?

Rank selection guidelines

Chapter 5: Results & Comparisons

Training efficiency

RoBERTa results

Chapter 6: LoRA Explorer

Chapter 7: Connections

What came before

What came after

The bigger picture