Make weight and Hessian matrices incoherent, then round optimally — the first post-training quantization method that works at 2 bits per weight with provable error bounds.
A 70-billion parameter language model stored in 16-bit precision occupies 140 GB of GPU memory. A single A100 GPU has 80 GB. So serving one copy of a 70B model requires at minimum two high-end GPUs — and that is before you allocate memory for KV caches, activations, or batched requests.
This is the memory wall. Inference speed in modern LLMs is almost always bottleneck by memory bandwidth, not compute. Every token generated requires reading the entire weight matrix from memory. The arithmetic is fast; the reading is slow.
Before QuIP, the answer was no. The state of the art in post-training quantization (PTQ) was GPTQ/OPTQ, which worked well at 4 bits and reasonably at 3 bits. At 2 bits, GPTQ's perplexity exploded — the quantized model produced nonsense. Other methods like SmoothQuant and ZeroQuant also failed at this extreme compression.
At 2 bits per weight, you have exactly 4 representable values. A typical 16-bit weight can be any of 65,536 values. Collapsing that range into 4 slots means each quantized value must represent a huge swath of the original distribution. Tiny errors per weight compound across thousands of matrix multiplications, and the model diverges.
The key question QuIP asks: under what conditions on the weight matrix does quantization succeed? Their answer is incoherence — when weights are spread evenly across coordinates and the important directions for rounding are not aligned with the coordinate axes, quantization error stays small. And they can force this condition via a simple preprocessing step.
Memory footprint for a 70B parameter model at different bit widths. Drag to see the threshold.
Post-training quantization works layer by layer. For a single linear layer with weight matrix W ∈ Rm×n, we want to find a quantized matrix Ŵ that minimally degrades the layer's output. But "minimally degrades output" depends on the inputs the layer sees.
Following Nagel et al. (2020), we measure the error between the original and quantized layer outputs on a calibration set. If x is a random input vector drawn from calibration data, the expected squared output error is:
where H = Ex[xxT] is the second moment matrix of the calibration inputs, also called the proxy Hessian. This is the matrix that tells us which directions in weight space matter most — if the inputs often point in direction v, then errors along v are amplified and H has a large eigenvalue in that direction.
Crucially, this objective decomposes across rows. Each row of W can be quantized independently, in parallel. The proxy loss for row i is:
where wi is the i-th row of W. This means we can quantize all m rows in parallel — essential for making PTQ tractable on models with billions of parameters.
If H were the identity matrix (all input directions equally important), any rounding scheme would work equally well. But real Hessians are highly non-isotropic — a few directions carry most of the signal. This means:
This insight is what drives the entire QuIP approach: make the Hessian more isotropic (via incoherence processing), and rounding becomes uniformly easier in all directions.
The simplest quantization strategy is nearest rounding: snap each weight to the closest representable value independently. But this ignores the Hessian entirely. If two weights lie in a direction that H says is very important, we might be able to reduce the total error by rounding one up and the other down to cancel errors — even if that means one weight is rounded to its second closest value.
QuIP's adaptive rounding processes weight columns one at a time, in order k = 1, 2, ..., n. At each step, it rounds column k, then adds a "correction term" to the remaining unquantized columns that partially compensates for the rounding error. The update rule is:
Here Wk is the k-th column of the original weights, Ŵ1:(k−1) denotes the already-quantized first k−1 columns, Q is the rounding subroutine (nearest or stochastic), and ak is a vector that determines how much of the past error feeds back into the current column.
After processing all n columns, the final quantized matrix satisfies:
where U is a strictly upper-triangular matrix whose columns are the vectors ak, and Q acts elementwise. If we let η = Q(W + (W − Ŵ)U) − (W + (W − Ŵ)U) denote the elementwise quantization error, then Ŵ − W = η(U + I)−1 and the proxy objective becomes:
Nearest rounding snaps to the closest grid point. For integer quantization, if the fractional part is 0.3, you round down. Simple, deterministic, biased toward the nearest value.
Stochastic rounding rounds up with probability equal to the fractional part and down otherwise. If the fractional part is 0.3, you round up with probability 0.3. This is unbiased: E[Q(x)] = x. It sounds noisier, but the unbiasedness turns out to be mathematically important for the theoretical analysis.
There are infinitely many choices for the feedback matrix U — each gives a different adaptive rounding algorithm. Which U minimizes the proxy loss? QuIP shows that the answer comes from the LDL decomposition of the Hessian.
Any symmetric positive definite matrix H can be factored as:
where D is a diagonal matrix with positive entries and Û is a strictly upper-triangular matrix. This is essentially a Cholesky factorization rearranged: if H = LLT is the Cholesky, then D = diag(L)2 and (Û + I) = L diag(L)−1.
If we choose U = Û from the LDL decomposition, something magical happens. Recall the proxy objective:
Substituting H = (Û + I)D(Û + I)T and setting U = Û:
The (Û + I) terms cancel perfectly! The proxy loss reduces to a weighted sum of the elementwise quantization errors, weighted by the diagonal entries of D. There is no cross-talk between columns — the error in column k is penalized only by Dkk, independent of all other columns.
LDLQ is worst-case and average-case optimal among all rounding methods that specify the linear feedback U as a function of H (not of W), and when rounding to the integers. Specifically, for Q being either nearest or stochastic rounding, and for all positive semi-definite H:
for any other rounding method A in the class. The worst-case loss is (m/n) tr(D) and the average-case loss is (m/cn) tr(D), where c = 12 for nearest rounding and c = 6 for stochastic.
Here is a surprising theoretical contribution: OPTQ (formerly GPTQ) is equivalent to LDLQ. QuIP without incoherence processing is a more efficient implementation of OPTQ — it uses one Cholesky decomposition instead of a full matrix inversion. This means QuIP's theoretical guarantees also apply retroactively to OPTQ, providing the first theoretical analysis of that widely-used algorithm.
LDLQ is optimal for a given Hessian H. But how good is that optimum? The answer depends on a property called incoherence — and this is QuIP's central insight.
A symmetric matrix H ∈ Rn×n is μ-incoherent if it has an eigendecomposition H = QAQT such that for all entries i, j:
Similarly, a weight matrix W ∈ Rm×n is μ-incoherent if for all entries:
In plain language: incoherence means no single entry is too large relative to the overall matrix. The eigenvectors of H are spread evenly across coordinates rather than concentrated on a few axes, and the weights are spread evenly rather than having a few outliers.
The connection is made precise by the following lemma. Let H be μ-incoherent with LDL decomposition H = (Û + I)D(Û + I)T. Then:
Remember that LDLQ's proxy loss is proportional to tr(D). So the smaller μ is (the more incoherent H is), the smaller the quantization error. For a perfectly incoherent matrix (μ = O(1)), tr(D) depends only on the spectrum of H, not on how the eigenvectors are aligned — which is the best you can hope for.
Without adaptive rounding, the simplest methods (nearest and stochastic rounding) have worst-case proxy loss of (m/4) tr(H) and average loss of (m/c) tr(H). The LDLQ loss with an incoherent Hessian is bounded by:
where k is the approximate rank of H. Since real Hessians are approximately low-rank (most eigenvalues decay rapidly), the factor μ²k/n can be much smaller than 1. This is the quantitative reason incoherent LDLQ beats naive rounding.
A weight matrix before and after incoherence processing. Brighter = larger magnitude. Watch how outliers get spread evenly.
Theorem 4 in the paper proves the converse: without the incoherence assumption, there exist Hessians H̃ with the same spectrum as H where LDLQ achieves exactly the same worst-case loss as naive nearest rounding. Incoherence is not just helpful — it is necessary for LDLQ to outperform baselines.
We now have all the pieces. LDLQ is the optimal rounding method, and incoherence makes that optimum actually good. QuIP's contribution is a practical procedure that achieves both: force incoherence via random orthogonal matrices, then run LDLQ.
A random orthogonal matrix has eigenvectors that are uniformly distributed — each entry has magnitude approximately 1/√n. Multiplying H by a random orthogonal matrix from both sides "scrambles" its eigenvectors, making them incoherent with high probability.
Specifically, let U ∈ Rm×m and V ∈ Rn×n be two random orthogonal matrices. Transform:
This preserves the proxy objective because:
The U rotation acts on the rows (output dimensions) and the V rotation acts on the columns (input dimensions). After this transformation, both W and H are incoherent with high probability.
Storing and multiplying by full n × n random orthogonal matrices would be too expensive at inference time. QuIP uses a Kronecker product of smaller random orthogonal matrices. Factor n = pq (where p ≈ q ≈ √n), then set:
where UL is p × p and UR is q × q. Multiplying a vector x by U is done by reshaping x into a p × q matrix, multiplying on the left by UL and on the right by URT, then reshaping back. Cost: O(n(p+q)) = O(n3/2) instead of O(n2). With k = 2 Kronecker factors, this is fast enough for practical inference.
After quantization, the incoherence transform must be reverted so the quantized model can run normally:
QuIP provides the first theoretical analysis for an LLM-scale quantization algorithm. Let us walk through the chain of results that connects incoherence to provable quantization quality.
The LDLQ proxy loss is proportional to tr(D), the trace of the diagonal matrix in the LDL decomposition. For nearest rounding:
For stochastic rounding, replace 12 with 6. The factor m/n comes from having m rows, each of dimension n.
If H is μ-incoherent, then:
This is a novel result that connects the LDL diagonal to the Hessian spectrum via incoherence. The proof uses the fact that incoherent matrices have entries of bounded magnitude, which constrains how much the LDL decomposition can concentrate energy onto the diagonal.
For comparison, the simplest methods achieve:
Putting Theorem 1, Lemma 2, and Lemma 3 together for a rank-k Hessian (with μ²k < n):
By Cauchy-Schwarz, (tr(H1/2))² ≤ k · tr(H), which gives the final inequality. The improvement factor is μ²k/n. For a "most" random n×n matrix, μ = O(√(log n)) and many real Hessians have effective rank k much smaller than n, so this factor is tiny.
The paper also proves the converse. For any Hessian spectrum, there exists a (coherent) Hessian H̃ with that spectrum where LDLQ achieves exactly the same loss as naive stochastic rounding. Without incoherence, the spectral bound cannot distinguish LDLQ from baselines. This theorem closes the loop: incoherence is both sufficient and necessary for LDLQ to outperform.
Let H = VHVT where V = V1 ⊗ V2 ⊗ ... ⊗ Vk is a Kronecker product of random orthogonal matrices. Then with probability at least 1 − δ:
The incoherence parameter is only poly-logarithmic in the matrix size. Two Kronecker factors (k = 2) suffice in practice.
So far we have analyzed rounding to the integers. In practice, we round to a finite grid: scale the weights, shift them, and clamp to {0, 1, ..., 2b−1}. This creates a subtlety that the paper addresses carefully.
LDLQ assumes weights are rounded to the nearest integer from the full integer lattice Zn. But after scaling and shifting, the "real" LDLQ algorithm clamps weights to [0, 2b−1]. This clamping can hurt: on a carefully constructed counterexample (H equal to (In + ε enenT)/n), clamped LDLQ with nearest rounding is asymptotically worse than plain nearest rounding on a 4-point grid.
To get a clean theoretical bound, the paper proposes a "fixed" algorithm that constrains the quantized weights to lie inside the grid. The optimization problem is:
This can be solved with ADMM using stochastic rounding and U = R−1 − I. For sufficiently large c, the solution is exactly base QuIP (the constraint becomes inactive). The resulting bound on quantization error is:
where b is the number of bits. The 4b in the denominator shows that each additional bit quadruples the precision — matching the intuition that going from 2 to 3 bits should be a significant jump.
Instead of mapping to a uniform grid {0, 1, ..., 2b−1}, one could map to a better-structured codebook. The E8 lattice (the densest 8-dimensional lattice) provides the highest packing density in 8 dimensions. The subsequent paper QuIP# extends QuIP with E8 lattice codebooks, achieving even better 2-bit results. In the original QuIP, the focus is on proving that even simple uniform grids work when combined with incoherence processing.
QuIP is evaluated on OPT models (125M to 66B parameters) and Llama 2 70B, across language generation (WikiText2, PTB, C4) and zero-shot tasks (ArcE, PiQA, StoryCloze, LAMBADA).
| Bits | Method | Wiki↓ | C4↓ | ArcE↑ | PiQA↑ | SC↑ |
|---|---|---|---|---|---|---|
| 16 | Full | 3.32 | 5.71 | 59.72 | 80.90 | 79.95 |
| 4 | OPTQ | 3.60 | 5.91 | 58.96 | 80.52 | 79.12 |
| 4 | QuIP | 3.53 | 5.87 | 59.81 | 80.47 | 79.63 |
| 3 | OPTQ | 4.91 | 7.10 | 54.38 | 78.56 | 77.72 |
| 3 | QuIP | 3.85 | 6.14 | 59.81 | 80.25 | 79.31 |
| 2 | OPTQ | 123.9 | 70.54 | 25.34 | 50.54 | 51.75 |
| 2 | QuIP | 6.33 | 8.94 | 54.38 | 75.08 | 75.37 |
At 2 bits, OPTQ produces a WikiText2 perplexity of 123.9 — effectively broken. QuIP achieves 6.33, which is close to the 3-bit OPTQ result (4.91). The zero-shot accuracies tell the same story: OPTQ at 2 bits is near random chance, while QuIP remains functional.
The paper evaluates all combinations of quantization and processing methods on OPT-30B:
| Bits | Method | Wiki↓ | PTB↓ | C4↓ | ArcE↑ | LAMB↑ |
|---|---|---|---|---|---|---|
| 16 | Full | 9.56 | 14.04 | 11.45 | 65.40 | 72.40 |
| 2 | OPTQ | 71.70 | 88.19 | 29.59 | 42.47 | 25.77 |
| 2 | LDLQ-RG | 49.40 | 73.45 | 29.12 | 41.20 | 26.35 |
| 2 | Near | 41547.8 | 34348.6 | 24815.7 | 25.80 | 0.00 |
| 2 | QuIP | 11.48 | 17.40 | 13.55 | 57.87 | 65.24 |
| 2 | Near+IncP | 12.04 | 18.12 | 14.11 | 56.36 | 60.64 |
A striking finding: as model size increases, the gap between 2-bit QuIP and 16-bit full precision shrinks. On OPT-125M, the 2-bit penalty is severe. By OPT-66B and Llama 2 70B, the gap is small. This hints that 2-bit inference may become increasingly viable as models get larger — the redundancy in larger models makes them more tolerant of extreme quantization.
| Method | Throughput (per token) |
|---|---|
| OPTQ | 53 ms |
| QuIP | 81 ms |
QuIP is about 1.5x slower than OPTQ per token due to the Kronecker orthogonal multiplication during dequantization. Both are faster than FP16 inference because the memory savings dominate. The overhead is modest and could be reduced with custom CUDA kernels.
| Bits | Rescale | Incoherence | Rescale+Inc | Resc+Inc+QR |
|---|---|---|---|---|
| 4 | 24.30 | 24.32 | 24.05 | 23.89 |
| 3 | 32.62 | 42.28 | 31.32 | 26.36 |
On OPT-350M: all sub-steps contribute. Diagonal rescaling, incoherence rotation, and quantization range adjustment each reduce perplexity. Random permutation within the Kronecker multiply also helps significantly (Δ perplexity of −74.2 at 2 bits on OPT-125M).
WikiText2 perplexity for OPTQ and QuIP at different bit widths across OPT model sizes.
QuIP introduced three ideas that shaped the field:
The follow-up paper extends QuIP with E8 lattice codebooks — replacing the uniform grid with the densest packing in 8 dimensions. This gives even better 2-bit results: near-lossless quantization at 2 bits for large models. QuIP# also introduces faster Hadamard-based incoherence (replacing random orthogonal with randomized Hadamard transforms for O(n log n) cost).
| Method | Adaptive rounding? | Incoherence? | Min bits | Theory? |
|---|---|---|---|---|
| RTN | No | No | 4 | No |
| SmoothQuant | No | Heuristic (per-channel) | 8 (W8A8) | No |
| GPTQ/OPTQ | Yes (=LDLQ) | No | 3 | Via QuIP |
| SqueezeLLM | Sensitivity-based | No | 3 | No |
| QuIP | Yes (LDLQ) | Yes (random orth.) | 2 | Yes |
| QuIP# | Yes (LDLQ) | Yes (Hadamard) | 2 | Yes |
QuIP demonstrated that the 2-bit frontier is not a hard wall — it is a function of how well you preprocess the matrices. This insight spawned a wave of work on rotation-based quantization methods. The idea that making matrices look random (incoherent) helps quantization is now a standard tool in the compression toolkit.