When low-rank adaptation matches full fine-tuning, why it works for RL with rank 1, and how four hyperparameters collapse to two — a complete teardown of the theory, experiments, and practical implications.
Every concept in this lesson and how they connect — the territory before the map.
This lesson unpacks a single blog post into 16 interconnected concepts across four topic clusters. The constellation below shows how they relate. Click any node to jump to the chapter where it is explained.
Every concept you will encounter, sorted by cluster.
W' = W + (alpha/r)BA decomposition. Ch 01.
Number of outer products in the decomposition. Ch 01.
Multiplicative factor controlling update magnitude. Ch 01.
Sum of rank-1 outer products spanning the update. Ch 01.
Applying adapters to every layer, not just attention. Ch 03.
LoRA degrades at large batch sizes before FullFT. Ch 02.
Optimal LoRA learning rate is ~10x FullFT. Ch 05.
Large datasets can exceed LoRA capacity. Ch 02.
Empirical neural tangent kernel for LoRA analysis. Ch 03.
O(1) bits of information per RL episode. Ch 04.
4 hyperparams but only 2 independent dimensions. Ch 06.
LoRA capacity measured in bits vs information needed. Ch 04.
Hot-swapping LoRA adapters at inference time. Ch 07.
Rank-1 suffices for reinforcement learning. Ch 04.
R1-Zero reproducible with LoRA alone. Ch 04.
LoRA uses ~2/3 the FLOPs of full fine-tuning. Ch 07.
This lesson is structured in layers. You can read it linearly or skip around.
If you already know the LoRA formulation, skip to Chapter 02. If you only care about the RL result, jump to Chapter 04.
You have a 70-billion-parameter model. You want to teach it a new skill. Updating all 70B parameters is expensive, slow, and wasteful.
Full fine-tuning means computing gradients for every single parameter, storing optimizer states (two extra copies for Adam), and saving a complete model checkpoint for each task. For a 70B model in mixed precision, that is roughly 420 GB of optimizer state alone.
Worse: if you want to serve ten different fine-tuned models, you need ten separate 140 GB checkpoints loaded on separate GPUs. The cost scales linearly with the number of tasks.
The core insight behind LoRA is that the weight updates during fine-tuning are typically low-rank. You are not rewriting the entire model; you are nudging it in a specific direction. If the update is low-rank, why store it as a full matrix?
Given a pre-trained weight matrix $W \in \mathbb{R}^{d_{out} \times d_{in}}$, LoRA replaces it with:
The product $BA$ has shape $d_{out} \times d_{in}$ — same as $W$ — but is constrained to rank at most $r$. During the forward pass, the computation is:
The key: during training, only $B$ and $A$ receive gradients. The frozen $W$ contributes to the forward pass but never gets updated. After training, you can merge: $W_{deployed} = W + \frac{\alpha}{r}BA$ and pay zero additional inference cost.
The product $BA$ can be decomposed further. Since $B$ has $r$ columns and $A$ has $r$ rows, we can write:
Each $\mathbf{b}_i \mathbf{a}_i^T$ is a rank-1 matrix (an outer product of a column of $B$ with a row of $A$). The full LoRA update is a sum of $r$ such rank-1 corrections. Think of each one as a single “direction of change” applied to the weight matrix.
The Singular Value Decomposition (SVD) of any matrix $M$ gives us $M = U \Sigma V^T$, where $\Sigma$ contains singular values in decreasing order. The best rank-$r$ approximation of $M$ is obtained by keeping only the top $r$ singular values.
Empirically, weight updates during fine-tuning have a spectrum that decays rapidly. The top few singular values capture most of the “information” in the update. When you fine-tune a language model on a specific task, the model does not need to rearrange all of its internal representations — it needs to make a focused adjustment in a low-dimensional subspace.
The canvas above shows a simulated weight update matrix being approximated by increasing rank. Notice how quickly the reconstruction error drops: by rank 4, we already capture the dominant structure. The remaining singular values contribute fine-grained noise that barely affects downstream task performance.
This is not just a convenient trick. It reflects something deep about fine-tuning: the pre-trained model already represents the world well. Fine-tuning is about steering, not rebuilding.
Beyond parameter efficiency, LoRA enables three capabilities that full fine-tuning cannot:
One base model in GPU memory, multiple LoRA adapters hot-swapped per request. A 70B model with 100 LoRA adapters costs ~140 GB (base) + 100 x 0.1 GB (adapters) = 150 GB. Full fine-tuning would cost 100 x 140 GB = 14 TB.
Optimizer states only needed for the low-rank matrices. For rank 16 on a 70B model, Adam states go from ~420 GB to ~1 GB. The frozen base model is stored in half-precision without optimizer overhead.
LoRA adapters can be added, removed, or combined algebraically. Train one adapter for coding, another for medical knowledge, merge them: $W' = W + \frac{\alpha_1}{r_1}B_1A_1 + \frac{\alpha_2}{r_2}B_2A_2$.
The conventional wisdom was that these advantages come at a performance cost — LoRA trades quality for efficiency. The Schulman blog post challenges this directly: with the right setup, LoRA matches full fine-tuning with zero quality loss.
The claims require rigorous evidence. Here is how Schulman set up the experiments.
The key experimental choices:
| Variable | Value | Why |
|---|---|---|
| Models | Llama-3 (8B, 70B), Qwen-3 (various) | Cover both dense and large architectures |
| Datasets | Tulu3, OpenThoughts3 | Small/medium (Tulu3) and large-scale (OpenThoughts3) |
| Ranks | 1, 4, 16, 64, 128, 256, 512 | Sweep from minimal to near-full capacity |
| Layers | All (attention + MLP + embedding) | Critical finding: all layers required |
| LR | Grid search per rank/model | Fair comparison to FullFT optimal LR |
| Optimizer | Adam with standard betas | Standard practice, no exotic tricks |
The critical methodological choice: LoRA is applied to ALL linear layers, not just the query/key/value projections in attention. This is the key difference from most LoRA papers in 2023-2024, which only adapted attention weights.
The headline result: on small and medium datasets, high-rank LoRA (applied to all layers) matches full fine-tuning loss curves exactly. Not “close to” — exactly, within noise.
The canvas shows log-loss vs training steps for different configurations. At small dataset sizes (Tulu3), even rank 16 closely tracks full fine-tuning. At medium scale, rank 128 is needed. The curves diverge only when the dataset is large enough to require more capacity than the LoRA matrices can encode.
Schulman uses two metrics to define matching:
The implication: if you can get the training loss to match, everything else follows. The model learned the same function.
LoRA is not magic. It has a hard capacity ceiling. The rank-$r$ product $BA$ lives in a rank-$r$ subspace of the full parameter space. When the dataset requires updates that span more than $r$ dimensions, LoRA underfits.
The practical threshold depends on the ratio of dataset information content to LoRA capacity. Schulman finds:
| Dataset | Tokens | Minimum rank for match | Note |
|---|---|---|---|
| Tulu3 (small) | ~1B | 16 | Easy to fit |
| OpenThoughts3 (medium) | ~10B | 128 | Needs high rank |
| Synthetic large | ~100B+ | 512+ | LoRA underfits at any practical rank |
A surprising finding: LoRA is more sensitive to batch size than full fine-tuning. At very large batch sizes (8K+ sequences), LoRA performance degrades while FullFT holds steady.
The hypothesized mechanism: large batches produce low-variance gradient estimates that make very precise updates. Full fine-tuning can implement these precise updates in any direction. LoRA is constrained to its rank-$r$ subspace and cannot represent the precise update direction when it is nearly orthogonal to the current $B$ and $A$ columns.
Practically, this means: if you are training with large batches (common in distributed training), you may need to increase rank or reduce batch size when using LoRA. The sweet spot for LoRA is moderate batch sizes (256-2048 sequences).
The most common LoRA setup in practice — adapting only Q, K, V projections — is systematically suboptimal.
When Schulman compared three configurations at the same total parameter budget:
| Configuration | Adapted params | Final loss vs FullFT |
|---|---|---|
| Attention-only | 33% of model | +0.08 nats (worse) |
| MLP-only | 67% of model | +0.02 nats (almost matches) |
| All layers | 100% of model | +0.00 nats (matches) |
The counterintuitive result: MLP-only LoRA outperforms attention-only LoRA, even when attention-only uses more rank (to compensate for fewer adapted matrices). The number of adapted layers matters more than the rank per layer.
Why do MLP layers matter more? Consider the parameter breakdown of a standard transformer layer:
In a standard Llama-3 layer with hidden dimension $d=4096$ and MLP intermediate dimension $d_{ff}=11008$:
The MLP has 2x more parameters than attention in each layer. When you apply LoRA to attention only, you are leaving the larger component entirely frozen.
In Mixture-of-Experts models (like DeepSeek or Mixtral), the expert MLP layers are massive. Each expert is a full MLP, and there may be 8-64 experts per layer. The attention component becomes a tiny fraction of total parameters. Adapting only attention in an MoE model leaves 90%+ of parameters frozen.
Schulman provides a theoretical grounding via the empirical Neural Tangent Kernel (eNTK). The eNTK of a network at initialization determines what functions the network can learn in the lazy/linear regime (small updates relative to initialization).
The key insight: the eNTK of a LoRA-adapted model converges to the eNTK of the full model when LoRA is applied to all layers with sufficient rank. Formally:
If you skip a layer, the corresponding Jacobian term $J_\ell$ is zeroed out entirely. The eNTK becomes a subset of the full kernel, losing expressiveness in directions that layer controls. This is why attention-only LoRA underperforms: you are zeroing out the Jacobian contribution from 67% of parameters.
Putting it all together, the argument proceeds in four steps:
When every layer is adapted with sufficient rank, the linearized model (first-order Taylor expansion around the pre-trained weights) can express the same function updates.
eNTK convergence theoremIn the lazy regime (typical for fine-tuning where updates are small relative to pretrained scale), the eNTK determines the training dynamics. Same kernel = same trajectory.
Neural tangent kernel theoryIf the optimization follows the same trajectory (within noise), it reaches the same minimum.
Convexity of the linearized objectiveEmpirically verified: when training loss matches, benchmark scores are indistinguishable.
Experimental validation across 14 modelsThe chain is: LoRA(all) ≈ eNTK(LoRA) ≈ eNTK(Full) ≈ FullFT. Each “≈” is justified by a different argument (coverage, lazy regime, optimization, empirics).
The most striking finding in the entire blog post: RL fine-tuning with rank-1 LoRA matches full fine-tuning.
Read that again. Rank one. A single outer product $\mathbf{b}\mathbf{a}^T$ per layer. For a 4096-dimensional model, that is 8,192 parameters per layer, or about 260K total trainable parameters for a 32-layer model. Full fine-tuning updates 7+ billion.
And yet: the reward curves are indistinguishable. The final policy performs identically on evaluation benchmarks. The model learns the same behaviors.
How is this possible? The answer comes from information theory.
In RL, the model generates a response (an episode), receives a scalar reward, and updates. How much information does one reward signal carry?
The intuition: a reward is essentially a single number (often binary: correct/incorrect, or a scalar in a bounded range). It tells you “this response was good” or “this response was bad.” That is inherently low-information. You cannot learn a complex high-dimensional update from a single bit.
In supervised fine-tuning, each training example provides the entire correct response, token by token. A 500-token response carries roughly $500 \times \log_2(\text{vocab\_size}) \approx 500 \times 17 \approx 8500$ bits of information. That is 8500x more than one RL reward signal.
SFT needs capacity to store all these specific responses. RL only needs capacity to shift policy slightly based on sparse feedback. The update per episode is minuscule in the parameter space.
The canvas above shows the comparison: total information from RL training (episodes times bits-per-episode) versus LoRA adapter capacity (parameters times bits-per-parameter). Even rank-1 LoRA has far more capacity than the RL signal can fill.
Let us do the math explicitly for DeepSeek-R1-Zero:
Now the LoRA capacity for rank-1 on a model with hidden dimension $d=7168$ and 61 layers:
At 16-bit precision, that is $6.1\text{M} \times 16 = 97.6\text{M bits}$ of capacity. Even at the extreme estimate of 2 bits of useful information per parameter (accounting for noise), we have $6.1\text{M} \times 2 = 12.2\text{M bits}$.
This is a necessary condition argument, not a sufficient one. Having enough capacity does not guarantee the optimizer can find the right solution. But Schulman shows empirically that Adam does find it — the optimization landscape for RL updates is smooth enough that even rank-1 constraints do not create problematic saddle points or local minima.
The practical punchline: DeepSeek-R1-Zero (the model that learns to reason through RL alone, no SFT) could be replicated with rank-1 LoRA.
DeepSeek-R1-Zero was trained with pure RL (GRPO) for ~5.3 million episodes. Based on the information-theoretic argument above, the total information content of the training signal is bounded by 5.3M bits. A rank-1 LoRA adapter on the DeepSeek-V3 architecture has more than enough capacity.
The implications for the field:
Across 14 different models, the optimal learning rate for LoRA is consistently about 10x larger than for full fine-tuning.
This is not a vague guideline. Schulman reports a systematic relationship: when you grid-search the optimal LR for both LoRA and FullFT independently, the LoRA optimum lands at approximately 10x the FullFT optimum. The ratio is remarkably stable across model sizes, architectures, and datasets.
| Model | FullFT optimal LR | LoRA optimal LR | Ratio |
|---|---|---|---|
| Llama-3-8B | 2e-5 | 2e-4 | 10x |
| Llama-3-70B | 5e-6 | 5e-5 | 10x |
| Qwen-3-7B | 3e-5 | 3.5e-4 | ~12x |
| Qwen-3-72B | 8e-6 | 7e-5 | ~9x |
There is no complete theoretical explanation yet. Schulman is explicit about this. But there are plausible partial explanations:
In LoRA, the gradient of the loss with respect to $B$ involves the product with $A$ (and vice versa). Since $B$ starts at zero, early gradients on $A$ are zero (the signal is “diluted” through the zero $B$). A higher LR compensates for this slow start.
LoRA constrains updates to a rank-$r$ subspace. Within that subspace, the effective step size needs to match what FullFT achieves in the full space. Since FullFT distributes its LR across $N$ parameters but LoRA concentrates it on $r \times (d_{in} + d_{out})$ parameters, the per-direction step is already smaller for LoRA at the same nominal LR.
The $\alpha/r$ factor in the forward pass effectively scales down the LoRA contribution. The LR must be higher to compensate for this built-in dampening. When $\alpha = r$ (a common choice), this factor is 1 and the ratio would be lower — consistent with some experimental observations.
Schulman provides an empirical formula for predicting the optimal LoRA learning rate given the model’s hidden dimension:
This formula captures the empirical observation that larger models (larger $d_{\text{hidden}}$) need smaller learning rates, with the scaling following a power law rather than linear relationship.
For practitioners, the learning rate findings translate to concrete advice:
If you have a validated FullFT learning rate for your model, multiply by 10 for LoRA. This is your starting point, not a fixed rule.
Empirical finding across 14 modelsIf your training is only a few hundred steps (common for RL), the B-zero initialization penalty is larger. Go higher.
B-zero startup hypothesisThe 10x rule gets you within a factor of 2. A quick 3-point grid (7x, 10x, 15x) will nail the optimum.
Cheap insurance for final qualityLarger models need proportionally smaller LRs. Use the power-law formula as a sanity check.
Power-law scalingLoRA has four tunable hyperparameters. The community treats them as four independent knobs. They are not.
The standard LoRA configuration exposes:
| Hyperparam | Symbol | Typical default | Controls |
|---|---|---|---|
| Alpha | $\alpha$ | Equal to rank | Scaling of the LoRA output |
| LR for A | $\eta_A$ | Same as LR_B | Learning rate for down-projection |
| LR for B | $\eta_B$ | Same as LR_A | Learning rate for up-projection |
| Init scale for A | $\sigma_A$ | Kaiming uniform | Initialization magnitude of A |
(We fix $B$ init to zero, which is standard. Some variants like LoRA+ and rsLoRA modify these differently.)
Most practitioners just set $\alpha = r$ and use the same LR for both matrices. But different papers recommend different choices, creating confusion about what “really matters.”
Schulman shows that for the Adam optimizer, these four hyperparameters have only two independent degrees of freedom. The reason: certain combinations produce mathematically identical training trajectories.
The two independent quantities that actually control training are:
Any combination of ($\alpha$, $\eta_A$, $\eta_B$, $\sigma_A$) that produces the same Scale and Ratio will yield identical training dynamics (up to floating-point noise).
Formally, the following transformation leaves the Adam training trajectory unchanged:
For any constant $c > 0$. You can verify: Scale = $\alpha \cdot \sigma_A \cdot \eta_B = (c\alpha)(\sigma_A/c)\eta_B$ = unchanged. Ratio = $\sigma_A / \eta_A = (\sigma_A/c)/(\eta_A/c)$ = unchanged.
Similarly, there is a second invariance:
Check: Scale = $\alpha \cdot (k\sigma_A) \cdot (\eta_B/k)$ = unchanged. Ratio = $(k\sigma_A)/(k\eta_A)$ = unchanged.
These two invariances reduce the 4D hyperparameter space to a 2D surface. Any point in 4D maps to a unique point in the 2D (Scale, Ratio) plane, and all points mapping to the same 2D location produce identical training.
Different LoRA variants from the literature simply correspond to different points in the 2D (Scale, Ratio) plane:
$\alpha = r$, same LR for A and B, Kaiming init for A. Scale = moderate, Ratio = 1. The default, but not necessarily optimal.
Uses different LRs for A and B ($\eta_B > \eta_A$). Maps to a different Ratio in our 2D space. Equivalent to standard LoRA with appropriate $\sigma_A$ adjustment.
Scales alpha by $\sqrt{r}$ instead of $r$. Moves to a different Scale point. Designed to keep magnitude stable as rank increases.
High alpha (2x rank), specific LR choices. Maps to yet another point in the plane. All are equivalent to some configuration of the other two methods.
The key takeaway: these are not different methods. They are different coordinates for the same 2D configuration space. If you match their Scale and Ratio values using any parameterization, you get the same result.
LoRA is not just parameter-efficient — it is compute-efficient too. Here is the exact accounting.
Consider a single linear layer with weight $W \in \mathbb{R}^{N \times N}$ (square for simplicity). The FLOPs for one training step include forward pass, backward pass (computing gradients), and optimizer update.
For LoRA, the forward pass computes both $Wx$ and $BAx$ (the adapter branch). The backward pass only needs gradients for $B$ and $A$:
Wait — why is the forward $Wx$ cost still $N^2$? Because we still need the output of the frozen weights for the forward pass. The backward pass is cheaper because we only compute gradients for $B$ ($NR$ to compute $\partial L/\partial B$) and $A$ ($NR$ to compute $\partial L/\partial A$), plus the chain rule through the adapter ($2NR$ for the gradient flowing back through $BA$).
The ratio:
When $R \ll N$ (e.g., $R = 16$, $N = 4096$), the second term is negligible and LoRA uses approximately $\frac{1}{3}$ of the FLOPs. But we are ignoring that the frozen forward pass still dominates; more precisely:
So LoRA uses about 1/3 the training FLOPs per layer (dominated by the savings in the backward pass). Across the whole model, this translates to roughly 2/3 total training FLOPs because other costs (embedding, normalization, attention computation) remain fixed.
Based on all findings, here is the decision table:
| Scenario | Recommendation | Rank | Why |
|---|---|---|---|
| RL fine-tuning | LoRA | 1-4 | O(1) bits/episode; rank-1 suffices |
| SFT, small data (<1B tokens) | LoRA | 16-64 | Matches FullFT, saves 1/3 FLOPs |
| SFT, medium data (1-10B) | LoRA | 64-256 | Matches with high rank |
| SFT, large data (>100B) | FullFT | N/A | LoRA capacity insufficient |
| Multi-task serving | LoRA | 16-64 | Hot-swap adapters, share base |
| Large batch training | FullFT or high-rank LoRA | 256+ | LoRA degrades at large batch |
| Continued pretraining | FullFT | N/A | Very large data, all directions needed |
The most compelling practical use case for LoRA is multi-tenant serving. The architecture:
The frozen $W$ matrices are shared across all requests. For a 70B model, this is ~140 GB in FP16.
One-time cost, amortized across all tenantsEach LoRA adapter (rank 16) is ~200 MB. You can store thousands on a single machine.
3 orders of magnitude smaller than base modelWhen a request arrives for tenant X, load their adapter into GPU memory. With NVLink or PCIe 5.0, loading 200 MB takes <1ms.
Negligible latency overheadGroup requests by adapter for efficient batched inference. Systems like S-LoRA and Punica implement this efficiently.
Throughput optimizationThis architecture enables personalized models at scale: each user gets a fine-tuned model without paying for a dedicated GPU.
This lesson connects to several other Engineermaxxing topics:
The RL-with-LoRA result directly applies: if you are doing on-policy distillation via RL rewards, you can use rank-1 LoRA for the student model updates.
LoRA adapters live on a low-dimensional manifold in weight space. The geometry of this manifold (curvature, geodesics) determines which tasks can be represented at which ranks.
For MoE architectures, LoRA should be applied to ALL expert MLPs, not just shared attention. The parameter distribution makes this even more critical than for dense models.