Based on Thinking Machines Lab · John Schulman · Sep 2025

LoRA Without Regret.

When low-rank adaptation matches full fine-tuning, why it works for RL with rank 1, and how four hyperparameters collapse to two — a complete teardown of the theory, experiments, and practical implications.

SOURCE Thinking Machines Blog DEPTH concept-to-implementation PAPERS 7 referenced

00 Concept constellation

Every concept in this lesson and how they connect — the territory before the map.

This lesson unpacks a single blog post into 16 interconnected concepts across four topic clusters. The constellation below shows how they relate. Click any node to jump to the chapter where it is explained.

Core LoRA Findings Theory Practice

00 Concept index

Every concept you will encounter, sorted by cluster.

Core LoRA

Low-Rank Adaptation

W' = W + (alpha/r)BA decomposition. Ch 01.

Core LoRA

Rank

Number of outer products in the decomposition. Ch 01.

Core LoRA

Alpha Scaling

Multiplicative factor controlling update magnitude. Ch 01.

Core LoRA

AB Decomposition

Sum of rank-1 outer products spanning the update. Ch 01.

Findings

All-Layers LoRA

Applying adapters to every layer, not just attention. Ch 03.

Findings

Batch Sensitivity

LoRA degrades at large batch sizes before FullFT. Ch 02.

Findings

LR Ratio (10x)

Optimal LoRA learning rate is ~10x FullFT. Ch 05.

Findings

Capacity Limits

Large datasets can exceed LoRA capacity. Ch 02.

Theory

eNTK Approximation

Empirical neural tangent kernel for LoRA analysis. Ch 03.

Theory

Info per Episode

O(1) bits of information per RL episode. Ch 04.

Theory

Parametrization Invariance

4 hyperparams but only 2 independent dimensions. Ch 06.

Theory

2-Bit-Per-Param

LoRA capacity measured in bits vs information needed. Ch 04.

Practice

Multi-Tenant Serving

Hot-swapping LoRA adapters at inference time. Ch 07.

Practice

RL with LoRA

Rank-1 suffices for reinforcement learning. Ch 04.

Practice

DeepSeek Replication

R1-Zero reproducible with LoRA alone. Ch 04.

Practice

Compute Savings

LoRA uses ~2/3 the FLOPs of full fine-tuning. Ch 07.

00 Reading guide

This lesson is structured in layers. You can read it linearly or skip around.

  • Chapter 01: The what — the LoRA formulation, SVD intuition, and operational advantages.
  • Chapters 02–03: The evidence — experimental results showing when LoRA matches or fails, and why all layers matter.
  • Chapter 04: The showstopper — RL with rank-1 LoRA and the information-theoretic argument.
  • Chapters 05–06: The theory — learning rate puzzles and parametrization invariances.
  • Chapter 07: The practice — FLOP analysis, when to use LoRA vs FullFT, and connections.

If you already know the LoRA formulation, skip to Chapter 02. If you only care about the RL result, jump to Chapter 04.

01 The fine-tuning problem

You have a 70-billion-parameter model. You want to teach it a new skill. Updating all 70B parameters is expensive, slow, and wasteful.

Full fine-tuning means computing gradients for every single parameter, storing optimizer states (two extra copies for Adam), and saving a complete model checkpoint for each task. For a 70B model in mixed precision, that is roughly 420 GB of optimizer state alone.

Worse: if you want to serve ten different fine-tuned models, you need ten separate 140 GB checkpoints loaded on separate GPUs. The cost scales linearly with the number of tasks.

The core insight behind LoRA is that the weight updates during fine-tuning are typically low-rank. You are not rewriting the entire model; you are nudging it in a specific direction. If the update is low-rank, why store it as a full matrix?

Instead of storing the full update matrix Delta-W (dimensions d_out x d_in), decompose it into two skinny matrices B (d_out x r) and A (r x d_in) where r is much smaller than both d_out and d_in. This reduces parameters from d_out * d_in to (d_out + d_in) * r.

01 The LoRA formulation

Given a pre-trained weight matrix $W \in \mathbb{R}^{d_{out} \times d_{in}}$, LoRA replaces it with:

LoRA update rule $$W' = W + \frac{\alpha}{r} B A$$
  • $W$ is the frozen pre-trained weight (never updated)
  • $B \in \mathbb{R}^{d_{out} \times r}$ is the “up-projection” (initialized to zero)
  • $A \in \mathbb{R}^{r \times d_{in}}$ is the “down-projection” (initialized randomly)
  • $r$ is the rank — the bottleneck dimension
  • $\alpha$ is a scaling constant that controls update magnitude
  • $\frac{\alpha}{r}$ normalizes the update so it stays bounded as rank changes

The product $BA$ has shape $d_{out} \times d_{in}$ — same as $W$ — but is constrained to rank at most $r$. During the forward pass, the computation is:

$$h = W'x = Wx + \frac{\alpha}{r}BAx$$

The key: during training, only $B$ and $A$ receive gradients. The frozen $W$ contributes to the forward pass but never gets updated. After training, you can merge: $W_{deployed} = W + \frac{\alpha}{r}BA$ and pay zero additional inference cost.

The rank-1 decomposition view

The product $BA$ can be decomposed further. Since $B$ has $r$ columns and $A$ has $r$ rows, we can write:

Outer product decomposition $$BA = \sum_{i=1}^{r} \mathbf{b}_i \mathbf{a}_i^T$$

Each $\mathbf{b}_i \mathbf{a}_i^T$ is a rank-1 matrix (an outer product of a column of $B$ with a row of $A$). The full LoRA update is a sum of $r$ such rank-1 corrections. Think of each one as a single “direction of change” applied to the weight matrix.

Worked example: For a weight matrix of shape 4096 x 4096 (16.7M parameters) with rank r=16, LoRA stores 4096*16 + 16*4096 = 131,072 parameters. That is 0.78% of the original. With rank r=1, it is just 8,192 parameters — 0.049%.

01 Why low-rank works

The Singular Value Decomposition (SVD) of any matrix $M$ gives us $M = U \Sigma V^T$, where $\Sigma$ contains singular values in decreasing order. The best rank-$r$ approximation of $M$ is obtained by keeping only the top $r$ singular values.

Empirically, weight updates during fine-tuning have a spectrum that decays rapidly. The top few singular values capture most of the “information” in the update. When you fine-tune a language model on a specific task, the model does not need to rearrange all of its internal representations — it needs to make a focused adjustment in a low-dimensional subspace.

Original matrix Rank-r approximation Residual error

The canvas above shows a simulated weight update matrix being approximated by increasing rank. Notice how quickly the reconstruction error drops: by rank 4, we already capture the dominant structure. The remaining singular values contribute fine-grained noise that barely affects downstream task performance.

This is not just a convenient trick. It reflects something deep about fine-tuning: the pre-trained model already represents the world well. Fine-tuning is about steering, not rebuilding.

01 Operational advantages

Beyond parameter efficiency, LoRA enables three capabilities that full fine-tuning cannot:

01
Advantage

Multi-Tenant Serving

One base model in GPU memory, multiple LoRA adapters hot-swapped per request. A 70B model with 100 LoRA adapters costs ~140 GB (base) + 100 x 0.1 GB (adapters) = 150 GB. Full fine-tuning would cost 100 x 140 GB = 14 TB.

02
Advantage

Memory Efficiency

Optimizer states only needed for the low-rank matrices. For rank 16 on a 70B model, Adam states go from ~420 GB to ~1 GB. The frozen base model is stored in half-precision without optimizer overhead.

03
Advantage

Composability

LoRA adapters can be added, removed, or combined algebraically. Train one adapter for coding, another for medical knowledge, merge them: $W' = W + \frac{\alpha_1}{r_1}B_1A_1 + \frac{\alpha_2}{r_2}B_2A_2$.

The conventional wisdom was that these advantages come at a performance cost — LoRA trades quality for efficiency. The Schulman blog post challenges this directly: with the right setup, LoRA matches full fine-tuning with zero quality loss.

02 Experimental setup

The claims require rigorous evidence. Here is how Schulman set up the experiments.

The key experimental choices:

VariableValueWhy
ModelsLlama-3 (8B, 70B), Qwen-3 (various)Cover both dense and large architectures
DatasetsTulu3, OpenThoughts3Small/medium (Tulu3) and large-scale (OpenThoughts3)
Ranks1, 4, 16, 64, 128, 256, 512Sweep from minimal to near-full capacity
LayersAll (attention + MLP + embedding)Critical finding: all layers required
LRGrid search per rank/modelFair comparison to FullFT optimal LR
OptimizerAdam with standard betasStandard practice, no exotic tricks

The critical methodological choice: LoRA is applied to ALL linear layers, not just the query/key/value projections in attention. This is the key difference from most LoRA papers in 2023-2024, which only adapted attention weights.

Why all layers? Consider: in a typical transformer, attention weights (Q, K, V, O projections) account for only ~33% of parameters. The MLP layers (gate, up, down projections) account for ~67%. Adapting only attention leaves two-thirds of the model frozen in a suboptimal state.

02 SFT results

The headline result: on small and medium datasets, high-rank LoRA (applied to all layers) matches full fine-tuning loss curves exactly. Not “close to” — exactly, within noise.

Full fine-tuning LoRA rank-128 LoRA rank-16 LoRA rank-1

The canvas shows log-loss vs training steps for different configurations. At small dataset sizes (Tulu3), even rank 16 closely tracks full fine-tuning. At medium scale, rank 128 is needed. The curves diverge only when the dataset is large enough to require more capacity than the LoRA matrices can encode.

What “matches” means precisely

Schulman uses two metrics to define matching:

  1. Training loss: the cross-entropy on the training set at convergence. LoRA achieves the same final loss as FullFT.
  2. Downstream evaluations: benchmark scores (MMLU, GSM8K, HumanEval) are statistically indistinguishable between LoRA and FullFT when training loss matches.

The implication: if you can get the training loss to match, everything else follows. The model learned the same function.

02 Capacity limits

LoRA is not magic. It has a hard capacity ceiling. The rank-$r$ product $BA$ lives in a rank-$r$ subspace of the full parameter space. When the dataset requires updates that span more than $r$ dimensions, LoRA underfits.

The practical threshold depends on the ratio of dataset information content to LoRA capacity. Schulman finds:

DatasetTokensMinimum rank for matchNote
Tulu3 (small)~1B16Easy to fit
OpenThoughts3 (medium)~10B128Needs high rank
Synthetic large~100B+512+LoRA underfits at any practical rank
The capacity of a rank-$r$ LoRA adapter (in bits) is approximately $2 \cdot r \cdot d \cdot \log_2(1/\epsilon)$ where $d$ is the hidden dimension and $\epsilon$ is the quantization precision. For rank 128 on a 4096-dim model, this is roughly 200M bits — enough for 10B tokens of SFT data, but not for 100B.

02 Batch size sensitivity

A surprising finding: LoRA is more sensitive to batch size than full fine-tuning. At very large batch sizes (8K+ sequences), LoRA performance degrades while FullFT holds steady.

The hypothesized mechanism: large batches produce low-variance gradient estimates that make very precise updates. Full fine-tuning can implement these precise updates in any direction. LoRA is constrained to its rank-$r$ subspace and cannot represent the precise update direction when it is nearly orthogonal to the current $B$ and $A$ columns.

Practically, this means: if you are training with large batches (common in distributed training), you may need to increase rank or reduce batch size when using LoRA. The sweet spot for LoRA is moderate batch sizes (256-2048 sequences).

03 Attention-only LoRA fails

The most common LoRA setup in practice — adapting only Q, K, V projections — is systematically suboptimal.

When Schulman compared three configurations at the same total parameter budget:

ConfigurationAdapted paramsFinal loss vs FullFT
Attention-only33% of model+0.08 nats (worse)
MLP-only67% of model+0.02 nats (almost matches)
All layers100% of model+0.00 nats (matches)

The counterintuitive result: MLP-only LoRA outperforms attention-only LoRA, even when attention-only uses more rank (to compensate for fewer adapted matrices). The number of adapted layers matters more than the rank per layer.

This overturns 2+ years of community practice. Most LoRA tutorials, papers, and frameworks default to attention-only adaptation. The reason is historical: the original LoRA paper (Hu et al., 2021) tested primarily on attention layers and found good results. But "good" is not "optimal."

03 MLP layers dominate

Why do MLP layers matter more? Consider the parameter breakdown of a standard transformer layer:

MLP (gate, up, down) Attention (Q, K, V, O) Layer norm + embed

In a standard Llama-3 layer with hidden dimension $d=4096$ and MLP intermediate dimension $d_{ff}=11008$:

  • Attention: $4 \times d^2 = 4 \times 4096^2 = 67M$ parameters
  • MLP: $3 \times d \times d_{ff} = 3 \times 4096 \times 11008 = 135M$ parameters

The MLP has 2x more parameters than attention in each layer. When you apply LoRA to attention only, you are leaving the larger component entirely frozen.

For MoE architectures, the gap is even bigger

In Mixture-of-Experts models (like DeepSeek or Mixtral), the expert MLP layers are massive. Each expert is a full MLP, and there may be 8-64 experts per layer. The attention component becomes a tiny fraction of total parameters. Adapting only attention in an MoE model leaves 90%+ of parameters frozen.

03 The eNTK argument

Schulman provides a theoretical grounding via the empirical Neural Tangent Kernel (eNTK). The eNTK of a network at initialization determines what functions the network can learn in the lazy/linear regime (small updates relative to initialization).

The key insight: the eNTK of a LoRA-adapted model converges to the eNTK of the full model when LoRA is applied to all layers with sufficient rank. Formally:

eNTK convergence $$\Theta_{\text{LoRA}}(x, x') \approx \Theta_{\text{Full}}(x, x') \quad \text{when } r \geq r^* \text{ and all layers adapted}$$
  • $\Theta(x, x') = \sum_\ell J_\ell(x)^T J_\ell(x')$ is the NTK, summing over layers $\ell$
  • $J_\ell(x) = \frac{\partial f(x)}{\partial \theta_\ell}$ is the Jacobian of the output w.r.t. layer $\ell$ parameters
  • The LoRA Jacobian for layer $\ell$ spans a rank-$r$ subspace of the full Jacobian

If you skip a layer, the corresponding Jacobian term $J_\ell$ is zeroed out entirely. The eNTK becomes a subset of the full kernel, losing expressiveness in directions that layer controls. This is why attention-only LoRA underperforms: you are zeroing out the Jacobian contribution from 67% of parameters.

The eNTK view also explains rank requirements: you need enough rank per layer so that the projected Jacobian $J_\ell^{\text{LoRA}}$ captures the important directions of $J_\ell^{\text{Full}}$. If the task requires changes spanning many directions in one layer, that layer needs higher rank. But even rank 1 on every layer is often better than rank 64 on attention only.

03 The logical chain

Putting it all together, the argument proceeds in four steps:

  1. LoRA(all layers, high rank) has the same eNTK as FullFT

    When every layer is adapted with sufficient rank, the linearized model (first-order Taylor expansion around the pre-trained weights) can express the same function updates.

    eNTK convergence theorem
  2. Same eNTK implies same learning dynamics (for small updates)

    In the lazy regime (typical for fine-tuning where updates are small relative to pretrained scale), the eNTK determines the training dynamics. Same kernel = same trajectory.

    Neural tangent kernel theory
  3. Same learning dynamics implies same final loss

    If the optimization follows the same trajectory (within noise), it reaches the same minimum.

    Convexity of the linearized objective
  4. Same final loss implies same downstream performance

    Empirically verified: when training loss matches, benchmark scores are indistinguishable.

    Experimental validation across 14 models

The chain is: LoRA(all) ≈ eNTK(LoRA) ≈ eNTK(Full) ≈ FullFT. Each “≈” is justified by a different argument (coverage, lazy regime, optimization, empirics).

04 The rank-1 result

The most striking finding in the entire blog post: RL fine-tuning with rank-1 LoRA matches full fine-tuning.

Read that again. Rank one. A single outer product $\mathbf{b}\mathbf{a}^T$ per layer. For a 4096-dimensional model, that is 8,192 parameters per layer, or about 260K total trainable parameters for a 32-layer model. Full fine-tuning updates 7+ billion.

And yet: the reward curves are indistinguishable. The final policy performs identically on evaluation benchmarks. The model learns the same behaviors.

How is this possible? The answer comes from information theory.

04 Information per episode

In RL, the model generates a response (an episode), receives a scalar reward, and updates. How much information does one reward signal carry?

Mutual information bound $$I(G; R \mid \text{history}) \leq H(\text{Adv}) \approx O(1) \text{ bits}$$
  • $I(G; R \mid \text{history})$ is the mutual information between the generated response $G$ and the reward $R$, conditioned on training history
  • $H(\text{Adv})$ is the entropy of the advantage signal
  • $O(1)$ means a constant number of bits — typically 1-3 bits per episode

The intuition: a reward is essentially a single number (often binary: correct/incorrect, or a scalar in a bounded range). It tells you “this response was good” or “this response was bad.” That is inherently low-information. You cannot learn a complex high-dimensional update from a single bit.

Why this is different from SFT

In supervised fine-tuning, each training example provides the entire correct response, token by token. A 500-token response carries roughly $500 \times \log_2(\text{vocab\_size}) \approx 500 \times 17 \approx 8500$ bits of information. That is 8500x more than one RL reward signal.

SFT needs capacity to store all these specific responses. RL only needs capacity to shift policy slightly based on sparse feedback. The update per episode is minuscule in the parameter space.

Information needed (bits) LoRA capacity (bits)

The canvas above shows the comparison: total information from RL training (episodes times bits-per-episode) versus LoRA adapter capacity (parameters times bits-per-parameter). Even rank-1 LoRA has far more capacity than the RL signal can fill.

04 Capacity accounting

Let us do the math explicitly for DeepSeek-R1-Zero:

Information budget $$\text{Total info} = \text{episodes} \times \text{bits/episode} = 5.3\text{M} \times 1 \text{ bit} = 5.3\text{M bits}$$

Now the LoRA capacity for rank-1 on a model with hidden dimension $d=7168$ and 61 layers:

LoRA rank-1 capacity $$\text{Params} = 61 \times 7 \text{ matrices} \times 2d = 61 \times 7 \times 2 \times 7168 = 6.1\text{M params}$$

At 16-bit precision, that is $6.1\text{M} \times 16 = 97.6\text{M bits}$ of capacity. Even at the extreme estimate of 2 bits of useful information per parameter (accounting for noise), we have $6.1\text{M} \times 2 = 12.2\text{M bits}$.

The inequality: 5.3M bits needed < 12.2M bits available (rank-1 LoRA, conservative estimate). LoRA has more than 2x the capacity needed. Even halving the rank would still work in principle — but rank 0.5 does not exist, so rank 1 is the minimum.

This is a necessary condition argument, not a sufficient one. Having enough capacity does not guarantee the optimizer can find the right solution. But Schulman shows empirically that Adam does find it — the optimization landscape for RL updates is smooth enough that even rank-1 constraints do not create problematic saddle points or local minima.

04 DeepSeek-R1-Zero replication

The practical punchline: DeepSeek-R1-Zero (the model that learns to reason through RL alone, no SFT) could be replicated with rank-1 LoRA.

DeepSeek-R1-Zero was trained with pure RL (GRPO) for ~5.3 million episodes. Based on the information-theoretic argument above, the total information content of the training signal is bounded by 5.3M bits. A rank-1 LoRA adapter on the DeepSeek-V3 architecture has more than enough capacity.

The implications for the field:

  • Democratization: RL-based reasoning models no longer require hundreds of GPUs for full fine-tuning. A rank-1 LoRA training run uses ~2/3 the FLOPs and a fraction of the memory.
  • Iteration speed: Smaller adapters = faster checkpointing, easier experimentation, more hyperparameter sweeps per GPU-hour.
  • Multi-policy serving: Deploy multiple reasoning policies (code, math, general) as separate LoRA adapters on a single base model.
A subtlety: the information-theoretic bound is per episode, not per gradient step. With PPO/GRPO minibatches that reuse episodes multiple times, the information per gradient step can be even lower (since the same information is used multiple times, the model extracts diminishing new signal from replayed episodes).

05 The 10x finding

Across 14 different models, the optimal learning rate for LoRA is consistently about 10x larger than for full fine-tuning.

This is not a vague guideline. Schulman reports a systematic relationship: when you grid-search the optimal LR for both LoRA and FullFT independently, the LoRA optimum lands at approximately 10x the FullFT optimum. The ratio is remarkably stable across model sizes, architectures, and datasets.

ModelFullFT optimal LRLoRA optimal LRRatio
Llama-3-8B2e-52e-410x
Llama-3-70B5e-65e-510x
Qwen-3-7B3e-53.5e-4~12x
Qwen-3-72B8e-67e-5~9x
For short training runs (few hundred steps), the ratio is closer to 15x. For long runs (thousands of steps), it settles to ~10x. The hypothesis: $B$ is initialized to zero, so early training is effectively rank-0 until $B$ grows, requiring a higher LR to compensate for the initially dead pathway.

05 Why higher LR?

There is no complete theoretical explanation yet. Schulman is explicit about this. But there are plausible partial explanations:

Hypothesis 1: Gradient dilution

In LoRA, the gradient of the loss with respect to $B$ involves the product with $A$ (and vice versa). Since $B$ starts at zero, early gradients on $A$ are zero (the signal is “diluted” through the zero $B$). A higher LR compensates for this slow start.

Hypothesis 2: Effective learning rate per-direction

LoRA constrains updates to a rank-$r$ subspace. Within that subspace, the effective step size needs to match what FullFT achieves in the full space. Since FullFT distributes its LR across $N$ parameters but LoRA concentrates it on $r \times (d_{in} + d_{out})$ parameters, the per-direction step is already smaller for LoRA at the same nominal LR.

Hypothesis 3: $\alpha/r$ scaling interaction

The $\alpha/r$ factor in the forward pass effectively scales down the LoRA contribution. The LR must be higher to compensate for this built-in dampening. When $\alpha = r$ (a common choice), this factor is 1 and the ratio would be lower — consistent with some experimental observations.

05 The fitted formula

Schulman provides an empirical formula for predicting the optimal LoRA learning rate given the model’s hidden dimension:

Empirical LR formula $$\text{LR}_{\text{LoRA}} = M_{\text{LoRA}} \times \left(\frac{2000}{d_{\text{hidden}}}\right)^{p}$$
  • $M_{\text{LoRA}}$ is a constant calibrated per-family (typically ~3e-4)
  • $d_{\text{hidden}}$ is the model hidden dimension
  • $p$ is a power law exponent (fitted to ~0.5-0.7)
  • The 2000 in the denominator is a reference scale (arbitrary normalization)

This formula captures the empirical observation that larger models (larger $d_{\text{hidden}}$) need smaller learning rates, with the scaling following a power law rather than linear relationship.

LoRA optimal LR FullFT optimal LR Fitted power law

05 Practical guidance

For practitioners, the learning rate findings translate to concrete advice:

  1. Start at 10x your known FullFT LR

    If you have a validated FullFT learning rate for your model, multiply by 10 for LoRA. This is your starting point, not a fixed rule.

    Empirical finding across 14 models
  2. For short runs, bias higher (15x)

    If your training is only a few hundred steps (common for RL), the B-zero initialization penalty is larger. Go higher.

    B-zero startup hypothesis
  3. Always grid-search a small range

    The 10x rule gets you within a factor of 2. A quick 3-point grid (7x, 10x, 15x) will nail the optimum.

    Cheap insurance for final quality
  4. Scale inversely with hidden dim

    Larger models need proportionally smaller LRs. Use the power-law formula as a sanity check.

    Power-law scaling

06 Four hyperparameters

LoRA has four tunable hyperparameters. The community treats them as four independent knobs. They are not.

The standard LoRA configuration exposes:

HyperparamSymbolTypical defaultControls
Alpha$\alpha$Equal to rankScaling of the LoRA output
LR for A$\eta_A$Same as LR_BLearning rate for down-projection
LR for B$\eta_B$Same as LR_ALearning rate for up-projection
Init scale for A$\sigma_A$Kaiming uniformInitialization magnitude of A

(We fix $B$ init to zero, which is standard. Some variants like LoRA+ and rsLoRA modify these differently.)

Most practitioners just set $\alpha = r$ and use the same LR for both matrices. But different papers recommend different choices, creating confusion about what “really matters.”

06 Only two matter

Schulman shows that for the Adam optimizer, these four hyperparameters have only two independent degrees of freedom. The reason: certain combinations produce mathematically identical training trajectories.

The two independent quantities that actually control training are:

Independent quantity 1: Initial update scale $$\text{Scale} = \alpha \cdot \sigma_A \cdot \eta_B$$
Independent quantity 2: A/B timescale ratio $$\text{Ratio} = \frac{\sigma_A}{\eta_A}$$
  • Scale controls how large the first meaningful weight update is (the initial “kick”)
  • Ratio controls the relative speed at which A and B evolve — whether A races ahead and B catches up, or vice versa

Any combination of ($\alpha$, $\eta_A$, $\eta_B$, $\sigma_A$) that produces the same Scale and Ratio will yield identical training dynamics (up to floating-point noise).

Why does Adam create this invariance? Adam normalizes gradients by their running second moment. This means multiplying a parameter's gradient by a constant $c$ (by changing init or alpha) is eventually compensated by Adam's denominator scaling up by $c$. The net effect depends only on the product of certain hyperparameters, not their individual values.

06 The invariance transformation

Formally, the following transformation leaves the Adam training trajectory unchanged:

Invariance transform $$\alpha \to c \cdot \alpha, \quad \sigma_A \to \sigma_A / c, \quad \eta_A \to \eta_A / c, \quad \eta_B \to \eta_B$$

For any constant $c > 0$. You can verify: Scale = $\alpha \cdot \sigma_A \cdot \eta_B = (c\alpha)(\sigma_A/c)\eta_B$ = unchanged. Ratio = $\sigma_A / \eta_A = (\sigma_A/c)/(\eta_A/c)$ = unchanged.

Similarly, there is a second invariance:

$$\sigma_A \to k \cdot \sigma_A, \quad \eta_B \to \eta_B / k, \quad \eta_A \to k \cdot \eta_A, \quad \alpha \to \alpha$$

Check: Scale = $\alpha \cdot (k\sigma_A) \cdot (\eta_B/k)$ = unchanged. Ratio = $(k\sigma_A)/(k\eta_A)$ = unchanged.

These two invariances reduce the 4D hyperparameter space to a 2D surface. Any point in 4D maps to a unique point in the 2D (Scale, Ratio) plane, and all points mapping to the same 2D location produce identical training.

Default LoRA LoRA+ rsLoRA Current selection

06 Mapping existing methods

Different LoRA variants from the literature simply correspond to different points in the 2D (Scale, Ratio) plane:

01
Method

Standard LoRA

$\alpha = r$, same LR for A and B, Kaiming init for A. Scale = moderate, Ratio = 1. The default, but not necessarily optimal.

02
Method

LoRA+

Uses different LRs for A and B ($\eta_B > \eta_A$). Maps to a different Ratio in our 2D space. Equivalent to standard LoRA with appropriate $\sigma_A$ adjustment.

03
Method

rsLoRA

Scales alpha by $\sqrt{r}$ instead of $r$. Moves to a different Scale point. Designed to keep magnitude stable as rank increases.

04
Method

Unsloth recommendations

High alpha (2x rank), specific LR choices. Maps to yet another point in the plane. All are equivalent to some configuration of the other two methods.

The key takeaway: these are not different methods. They are different coordinates for the same 2D configuration space. If you match their Scale and Ratio values using any parameterization, you get the same result.

Example: LoRA+ with $\eta_A = 1e\text{-}4$, $\eta_B = 1e\text{-}3$, $\alpha = 16$, $\sigma_A = 0.01$ gives Scale = $16 \times 0.01 \times 1e\text{-}3 = 1.6e\text{-}4$ and Ratio = $0.01 / 1e\text{-}4 = 100$. Standard LoRA with $\alpha = 16$, $\eta = 1e\text{-}3$, $\sigma_A = 0.01$ gives Scale = $16 \times 0.01 \times 1e\text{-}3 = 1.6e\text{-}4$ and Ratio = $0.01 / 1e\text{-}3 = 10$. Different Ratio → different training dynamics. But you can replicate LoRA+ behavior in standard LoRA by adjusting $\sigma_A$.

07 FLOP analysis

LoRA is not just parameter-efficient — it is compute-efficient too. Here is the exact accounting.

Consider a single linear layer with weight $W \in \mathbb{R}^{N \times N}$ (square for simplicity). The FLOPs for one training step include forward pass, backward pass (computing gradients), and optimizer update.

Full fine-tuning FLOPs per layer $$\text{FullFT} = \underbrace{N^2}_{\text{forward}} + \underbrace{2N^2}_{\text{backward}} = 3N^2$$

For LoRA, the forward pass computes both $Wx$ and $BAx$ (the adapter branch). The backward pass only needs gradients for $B$ and $A$:

LoRA FLOPs per layer $$\text{LoRA} = \underbrace{N^2}_{\text{forward } Wx} + \underbrace{NR + NR}_{\text{forward } BAx} + \underbrace{NR + NR + NR + NR}_{\text{backward}} = N^2 + 6NR$$

Wait — why is the forward $Wx$ cost still $N^2$? Because we still need the output of the frozen weights for the forward pass. The backward pass is cheaper because we only compute gradients for $B$ ($NR$ to compute $\partial L/\partial B$) and $A$ ($NR$ to compute $\partial L/\partial A$), plus the chain rule through the adapter ($2NR$ for the gradient flowing back through $BA$).

The ratio:

$$\frac{\text{LoRA}}{\text{FullFT}} = \frac{N^2 + 6NR}{3N^2} = \frac{1}{3} + \frac{2R}{N}$$

When $R \ll N$ (e.g., $R = 16$, $N = 4096$), the second term is negligible and LoRA uses approximately $\frac{1}{3}$ of the FLOPs. But we are ignoring that the frozen forward pass still dominates; more precisely:

$$\text{LoRA} \approx N^2 + 6NR \approx N^2 \quad \text{vs} \quad \text{FullFT} = 3N^2$$

So LoRA uses about 1/3 the training FLOPs per layer (dominated by the savings in the backward pass). Across the whole model, this translates to roughly 2/3 total training FLOPs because other costs (embedding, normalization, attention computation) remain fixed.

FullFT FLOPs LoRA FLOPs Savings

07 When to use LoRA vs FullFT

Based on all findings, here is the decision table:

ScenarioRecommendationRankWhy
RL fine-tuningLoRA1-4O(1) bits/episode; rank-1 suffices
SFT, small data (<1B tokens)LoRA16-64Matches FullFT, saves 1/3 FLOPs
SFT, medium data (1-10B)LoRA64-256Matches with high rank
SFT, large data (>100B)FullFTN/ALoRA capacity insufficient
Multi-task servingLoRA16-64Hot-swap adapters, share base
Large batch trainingFullFT or high-rank LoRA256+LoRA degrades at large batch
Continued pretrainingFullFTN/AVery large data, all directions needed
The critical caveat: these recommendations assume LoRA is applied to ALL layers. If you can only apply to attention (due to framework limitations), increase rank by 3x and expect a small quality gap. The framework limitation, not LoRA itself, is the bottleneck.

07 Multi-tenant serving

The most compelling practical use case for LoRA is multi-tenant serving. The architecture:

  1. Load base model once into GPU memory

    The frozen $W$ matrices are shared across all requests. For a 70B model, this is ~140 GB in FP16.

    One-time cost, amortized across all tenants
  2. Store adapters in CPU memory / disk

    Each LoRA adapter (rank 16) is ~200 MB. You can store thousands on a single machine.

    3 orders of magnitude smaller than base model
  3. Load adapter on-demand per request

    When a request arrives for tenant X, load their adapter into GPU memory. With NVLink or PCIe 5.0, loading 200 MB takes <1ms.

    Negligible latency overhead
  4. Batch requests with same adapter

    Group requests by adapter for efficient batched inference. Systems like S-LoRA and Punica implement this efficiently.

    Throughput optimization

This architecture enables personalized models at scale: each user gets a fine-tuned model without paying for a dedicated GPU.

07 Connections

This lesson connects to several other Engineermaxxing topics:

Related lesson

On-Policy Distillation

The RL-with-LoRA result directly applies: if you are doing on-policy distillation via RL rewards, you can use rank-1 LoRA for the student model updates.

Related lesson

Modular Manifolds

LoRA adapters live on a low-dimensional manifold in weight space. The geometry of this manifold (curvature, geodesics) determines which tasks can be represented at which ranks.

Related concept

Mixture of Experts

For MoE architectures, LoRA should be applied to ALL expert MLPs, not just shared attention. The parameter distribution makes this even more critical than for dense models.

07 References

  1. Schulman, J. "LoRA Without Regret." Thinking Machines Lab Blog, Sep 2025.
  2. Hu, E. et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR, 2022. arXiv:2106.09685
  3. Biderman, S. et al. "LoRA Learns Less and Forgets Less." TMLR, 2024. arXiv:2405.09673
  4. Hayou, S. et al. "LoRA+: Efficient Low-Rank Adaptation of Large Models." arXiv, 2024. arXiv:2402.12354
  5. DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. arXiv:2501.12948
  6. Sheng, Y. et al. "S-LoRA: Serving Thousands of Concurrent LoRA Adapters." MLSys, 2024. arXiv:2311.03285
  7. Jacot, A. et al. "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS, 2018. arXiv:1806.07572