QuiP — Veanors

Chapter 0: The Memory Wall

A 70-billion parameter language model stored in 16-bit precision occupies 140 GB of GPU memory. A single A100 GPU has 80 GB. So serving one copy of a 70B model requires at minimum two high-end GPUs — and that is before you allocate memory for KV caches, activations, or batched requests.

This is the memory wall. Inference speed in modern LLMs is almost always bottleneck by memory bandwidth, not compute. Every token generated requires reading the entire weight matrix from memory. The arithmetic is fast; the reading is slow.

The arithmetic is stark. If you could store each weight in 2 bits instead of 16, that 140 GB model shrinks to 17.5 GB — fitting on a single consumer GPU. Inference throughput scales almost linearly with the compression ratio because you move 8x less data. But can you actually quantize to 2 bits without destroying the model?

Before QuIP, the answer was no. The state of the art in post-training quantization (PTQ) was GPTQ/OPTQ, which worked well at 4 bits and reasonably at 3 bits. At 2 bits, GPTQ's perplexity exploded — the quantized model produced nonsense. Other methods like SmoothQuant and ZeroQuant also failed at this extreme compression.

Why 2 bits is so hard

At 2 bits per weight, you have exactly 4 representable values. A typical 16-bit weight can be any of 65,536 values. Collapsing that range into 4 slots means each quantized value must represent a huge swath of the original distribution. Tiny errors per weight compound across thousands of matrix multiplications, and the model diverges.

The key question QuIP asks: under what conditions on the weight matrix does quantization succeed? Their answer is incoherence — when weights are spread evenly across coordinates and the important directions for rounding are not aligned with the coordinate axes, quantization error stays small. And they can force this condition via a simple preprocessing step.

The Compression Landscape

Memory footprint for a 70B parameter model at different bit widths. Drag to see the threshold.

Why is LLM inference typically bottlenecked by memory bandwidth rather than compute?

Each token generation requires reading the full weight matrices from memory, and the time to read data dominates the time to do arithmetic — so moving less data (via quantization) directly increases throughput GPUs do not have enough CUDA cores The softmax operation is too slow

Chapter 1: The Proxy Objective

Post-training quantization works layer by layer. For a single linear layer with weight matrix W ∈ R^m×n, we want to find a quantized matrix Ŵ that minimally degrades the layer's output. But "minimally degrades output" depends on the inputs the layer sees.

The quadratic proxy

Following Nagel et al. (2020), we measure the error between the original and quantized layer outputs on a calibration set. If x is a random input vector drawn from calibration data, the expected squared output error is:

ℓ(Ŵ) = E_x[ ||(Ŵ − W)x||² ] = tr((Ŵ − W)H(Ŵ − W)^T)

where H = E_x[xx^T] is the second moment matrix of the calibration inputs, also called the proxy Hessian. This is the matrix that tells us which directions in weight space matter most — if the inputs often point in direction v, then errors along v are amplified and H has a large eigenvalue in that direction.

Why "Hessian"? It is called the proxy Hessian because it approximates the second-order term in a Taylor expansion of the task loss around the original weights. For a quadratic loss, the Hessian is exactly E[xx^T]. This connection means minimizing the proxy objective is a principled approximation to minimizing actual task degradation.

Why this formulation helps

Crucially, this objective decomposes across rows. Each row of W can be quantized independently, in parallel. The proxy loss for row i is:

ℓ_i(ŵ_i) = (ŵ_i − w_i)H(ŵ_i − w_i)^T

where w_i is the i-th row of W. This means we can quantize all m rows in parallel — essential for making PTQ tractable on models with billions of parameters.

The role of H

If H were the identity matrix (all input directions equally important), any rounding scheme would work equally well. But real Hessians are highly non-isotropic — a few directions carry most of the signal. This means:

Errors in high-eigenvalue directions are catastrophic
Errors in low-eigenvalue directions are nearly free
A smart rounding method should focus its precision budget on the high-eigenvalue directions

This insight is what drives the entire QuIP approach: make the Hessian more isotropic (via incoherence processing), and rounding becomes uniformly easier in all directions.

Calibration data. QuIP uses 128 random 2048-token segments from the C4 dataset. No task-specific data is viewed during quantization — this is a key advantage of PTQ over quantization-aware training (QAT), which requires the full training pipeline.

What does the proxy Hessian H = E[xx^T] capture about the calibration data?

The second moment structure of the layer's inputs — which directions in input space carry the most signal, so that errors along those directions are penalized more heavily The gradient of the loss function The distribution of weight magnitudes

Chapter 2: Adaptive Rounding

The simplest quantization strategy is nearest rounding: snap each weight to the closest representable value independently. But this ignores the Hessian entirely. If two weights lie in a direction that H says is very important, we might be able to reduce the total error by rounding one up and the other down to cancel errors — even if that means one weight is rounded to its second closest value.

The column-by-column framework

QuIP's adaptive rounding processes weight columns one at a time, in order k = 1, 2, ..., n. At each step, it rounds column k, then adds a "correction term" to the remaining unquantized columns that partially compensates for the rounding error. The update rule is:

Ŵ_k = Q(W_k + (W_1:(k−1) − Ŵ_1:(k−1))a_k)

Here W_k is the k-th column of the original weights, Ŵ_1:(k−1) denotes the already-quantized first k−1 columns, Q is the rounding subroutine (nearest or stochastic), and a_k is a vector that determines how much of the past error feeds back into the current column.

The feedback vectors a_k. Each a_k is a column of an upper-triangular matrix U. Because U is upper-triangular, the correction for column k depends only on the errors in columns 1 through k−1 — it never looks into the future. This makes the algorithm causal and sequential.

The matrix equation

After processing all n columns, the final quantized matrix satisfies:

Ŵ = Q(W + (W − Ŵ)U)

where U is a strictly upper-triangular matrix whose columns are the vectors a_k, and Q acts elementwise. If we let η = Q(W + (W − Ŵ)U) − (W + (W − Ŵ)U) denote the elementwise quantization error, then Ŵ − W = η(U + I)⁻¹ and the proxy objective becomes:

tr((Ŵ − W)H(Ŵ − W)^T) = tr(η(U + I)⁻¹H(U + I)^−Tη^T)

Nearest vs. stochastic rounding

Nearest rounding snaps to the closest grid point. For integer quantization, if the fractional part is 0.3, you round down. Simple, deterministic, biased toward the nearest value.

Stochastic rounding rounds up with probability equal to the fractional part and down otherwise. If the fractional part is 0.3, you round up with probability 0.3. This is unbiased: E[Q(x)] = x. It sounds noisier, but the unbiasedness turns out to be mathematically important for the theoretical analysis.

In practice, nearest rounding with LDLQ gives the best results. The stochastic variant matters mainly for the theory — it provides cleaner bounds. QuIP uses nearest rounding in all its main experiments.

Why does adaptive rounding process columns sequentially rather than all at once?

Each column's correction term depends on the rounding errors of all previous columns — the feedback from past errors lets the algorithm partially cancel mistakes, reducing the total proxy loss Processing all columns at once would be too slow The GPU can only process one column at a time

Chapter 3: LDLQ Optimality

There are infinitely many choices for the feedback matrix U — each gives a different adaptive rounding algorithm. Which U minimizes the proxy loss? QuIP shows that the answer comes from the LDL decomposition of the Hessian.

The LDL decomposition

Any symmetric positive definite matrix H can be factored as:

H = (Û + I)D(Û + I)^T

where D is a diagonal matrix with positive entries and Û is a strictly upper-triangular matrix. This is essentially a Cholesky factorization rearranged: if H = LL^T is the Cholesky, then D = diag(L)² and (Û + I) = L diag(L)⁻¹.

Why LDL is optimal

If we choose U = Û from the LDL decomposition, something magical happens. Recall the proxy objective:

tr(η(U + I)⁻¹H(U + I)^−Tη^T)

Substituting H = (Û + I)D(Û + I)^T and setting U = Û:

= tr(η(Û + I)⁻¹(Û + I)D(Û + I)^T(Û + I)^−Tη^T) = tr(ηDη^T)

The (Û + I) terms cancel perfectly! The proxy loss reduces to a weighted sum of the elementwise quantization errors, weighted by the diagonal entries of D. There is no cross-talk between columns — the error in column k is penalized only by D_kk, independent of all other columns.

This is the key result. With the LDL assignment for U, each column's quantization error contributes independently to the total loss. No other choice of U achieves this. The algorithm is called LDLQ (LDL Quantization).

Theorem 1: LDLQ is optimal

LDLQ is worst-case and average-case optimal among all rounding methods that specify the linear feedback U as a function of H (not of W), and when rounding to the integers. Specifically, for Q being either nearest or stochastic rounding, and for all positive semi-definite H:

L_worst(LDLQ, H) ≤ L_worst(A, H) and L_avg(LDLQ, H) ≤ L_avg(A, H)

for any other rounding method A in the class. The worst-case loss is (m/n) tr(D) and the average-case loss is (m/cn) tr(D), where c = 12 for nearest rounding and c = 6 for stochastic.

OPTQ is a special case

Here is a surprising theoretical contribution: OPTQ (formerly GPTQ) is equivalent to LDLQ. QuIP without incoherence processing is a more efficient implementation of OPTQ — it uses one Cholesky decomposition instead of a full matrix inversion. This means QuIP's theoretical guarantees also apply retroactively to OPTQ, providing the first theoretical analysis of that widely-used algorithm.

Verified empirically: The authors ran both LDLQ and OPTQ on synthetic data (W ~ Unif[0,1]^1000×1000) and confirmed that both produce identical quantized outputs, validating the mathematical equivalence.

Why does choosing U from the LDL decomposition make the proxy loss decompose into independent per-column terms?

When U equals Û from H = (Û+I)D(Û+I)^T, the (Û+I) terms in the proxy loss formula cancel exactly, leaving tr(ηDη^T) — a diagonal weighting where each column's error is independent The LDL decomposition diagonalizes the weight matrix The Cholesky factor is always the identity matrix

Chapter 4: Incoherence

LDLQ is optimal for a given Hessian H. But how good is that optimum? The answer depends on a property called incoherence — and this is QuIP's central insight.

What is incoherence?

A symmetric matrix H ∈ R^n×n is μ-incoherent if it has an eigendecomposition H = QAQ^T such that for all entries i, j:

|Q_ij| = |e_i^T Q e_j| ≤ μ / √n

Similarly, a weight matrix W ∈ R^m×n is μ-incoherent if for all entries:

|W_ij| ≤ μ ||W||_F / √(mn)

In plain language: incoherence means no single entry is too large relative to the overall matrix. The eigenvectors of H are spread evenly across coordinates rather than concentrated on a few axes, and the weights are spread evenly rather than having a few outliers.

Incoherence as outlier reduction. A major problem in LLM quantization is outlier weights — a handful of entries with magnitudes 10-100x larger than the rest. These outliers force you to set the quantization grid wide, wasting precision on the majority of small weights. Incoherence is a principled way to say "spread the outliers across all coordinates so no single entry is extreme." Methods like SmoothQuant do something similar heuristically; QuIP provides the formal framework.

Why incoherence helps quantization

The connection is made precise by the following lemma. Let H be μ-incoherent with LDL decomposition H = (Û + I)D(Û + I)^T. Then:

tr(D) ≤ μ²/n · tr(H^1/2)²

Remember that LDLQ's proxy loss is proportional to tr(D). So the smaller μ is (the more incoherent H is), the smaller the quantization error. For a perfectly incoherent matrix (μ = O(1)), tr(D) depends only on the spectrum of H, not on how the eigenvectors are aligned — which is the best you can hope for.

The baseline comparison

Without adaptive rounding, the simplest methods (nearest and stochastic rounding) have worst-case proxy loss of (m/4) tr(H) and average loss of (m/c) tr(H). The LDLQ loss with an incoherent Hessian is bounded by:

L_worst(LDLQ, H) ≤ mμ²/(4n) · tr(H^1/2)² ≤ mμ²k/(4n) · tr(H)

where k is the approximate rank of H. Since real Hessians are approximately low-rank (most eigenvalues decay rapidly), the factor μ²k/n can be much smaller than 1. This is the quantitative reason incoherent LDLQ beats naive rounding.

Incoherence Visualization

A weight matrix before and after incoherence processing. Brighter = larger magnitude. Watch how outliers get spread evenly.

Without incoherence, LDLQ cannot beat baselines

Theorem 4 in the paper proves the converse: without the incoherence assumption, there exist Hessians H̃ with the same spectrum as H where LDLQ achieves exactly the same worst-case loss as naive nearest rounding. Incoherence is not just helpful — it is necessary for LDLQ to outperform baselines.

What does it mean for a Hessian H to be μ-incoherent, and why does this help quantization?

The eigenvectors of H have entries bounded by μ/√n — no eigenvector is aligned with a coordinate axis. This bounds tr(D) in the LDL decomposition, which directly bounds the LDLQ quantization error H has small eigenvalues H is a diagonal matrix

Chapter 5: The QuIP Algorithm

We now have all the pieces. LDLQ is the optimal rounding method, and incoherence makes that optimum actually good. QuIP's contribution is a practical procedure that achieves both: force incoherence via random orthogonal matrices, then run LDLQ.

How to make matrices incoherent

A random orthogonal matrix has eigenvectors that are uniformly distributed — each entry has magnitude approximately 1/√n. Multiplying H by a random orthogonal matrix from both sides "scrambles" its eigenvectors, making them incoherent with high probability.

Specifically, let U ∈ R^m×m and V ∈ R^n×n be two random orthogonal matrices. Transform:

W ← UWV^T H ← VHV^T

This preserves the proxy objective because:

tr((ŴHŴ^T)) = tr((UŴV^T)(VHV^T)(VŴ^TU^T)) = tr(ŴHŴ^T)

The U rotation acts on the rows (output dimensions) and the V rotation acts on the columns (input dimensions). After this transformation, both W and H are incoherent with high probability.

Efficient orthogonal multiplication

Storing and multiplying by full n × n random orthogonal matrices would be too expensive at inference time. QuIP uses a Kronecker product of smaller random orthogonal matrices. Factor n = pq (where p ≈ q ≈ √n), then set:

U = U_L ⊗ U_R

where U_L is p × p and U_R is q × q. Multiplying a vector x by U is done by reshaping x into a p × q matrix, multiplying on the left by U_L and on the right by U_R^T, then reshaping back. Cost: O(n(p+q)) = O(n^3/2) instead of O(n²). With k = 2 Kronecker factors, this is fast enough for practical inference.

Seeded randomness. The random orthogonal matrices are generated from a seed, so they do not need to be stored — the seed is saved with the model and the matrices are regenerated at inference time. This adds zero storage overhead.

Algorithm 1: QuIP Pre-Processing

Add α · mean(diag(H)) · I to H for numerical stability (from OPTQ)

Compute D̃ = ⁴√(diag(H) / diag(W^TW)) elementwise — diagonal rescaling

W ← WD̃, H ← D̃⁻¹HD̃⁻¹ — apply diagonal rescaling

W ← UWV^T, H ← VHV^T — apply random orthogonal (incoherence)

s ← ρ||W||_F / √(mn), W ← ½(½W/s + 1) — scale to quantization range

W ← clamp(W, 0, 2^b−1) — clip to [0, 2^b−1]

Return W, H, s, D̃

Algorithm 3: The full QuIP procedure

Run Algorithm 1 (pre-processing) to get incoherent W, H, s, D̃

Compute LDL decomposition: H = (Û+I)D(Û+I)^T

For k = 1,...,n: Ŵ_k ← clamp(Q(W_k + (W − Ŵ)Û_k), 0, 2^b−1) — LDLQ with clamping

Run Algorithm 2 (post-processing) to revert incoherence transform

Algorithm 2: QuIP Post-Processing

After quantization, the incoherence transform must be reverted so the quantized model can run normally:

W ← s · ((W/(2^b−1)) · 2 − 1) — undo range scaling

W ← U^TWV, H ← V^THV — revert incoherence

W ← WD̃⁻¹ — revert diagonal rescaling

The key realization: At inference time, the orthogonal multiplication U^T(...)V happens inside the matrix-vector product. The quantized weights are stored in the incoherent basis — dequantization includes the rotation. So the inference kernel fuses dequantization + rotation + matmul, adding only the O(n^3/2) Kronecker multiplication overhead.

Why does QuIP use a Kronecker product of smaller orthogonal matrices instead of a single large one?

Full n×n orthogonal multiplication costs O(n²) per vector, but a Kronecker product of two √n × √n matrices costs only O(n^3/2) — fast enough for practical inference while still achieving incoherence with high probability Kronecker products are easier to implement in PyTorch Smaller matrices have better numerical precision

Chapter 6: Theoretical Bounds

QuIP provides the first theoretical analysis for an LLM-scale quantization algorithm. Let us walk through the chain of results that connects incoherence to provable quantization quality.

The error budget

The LDLQ proxy loss is proportional to tr(D), the trace of the diagonal matrix in the LDL decomposition. For nearest rounding:

L_worst(LDLQ, H) = (m/n) · tr(D) / 4

L_avg(LDLQ, H) = (m/n) · tr(D) / 12

For stochastic rounding, replace 12 with 6. The factor m/n comes from having m rows, each of dimension n.

Lemma 2: Incoherence bounds tr(D)

If H is μ-incoherent, then:

tr(D) ≤ μ²/n · (tr(H^1/2))²

This is a novel result that connects the LDL diagonal to the Hessian spectrum via incoherence. The proof uses the fact that incoherent matrices have entries of bounded magnitude, which constrains how much the LDL decomposition can concentrate energy onto the diagonal.

Lemma 3: Baseline losses

For comparison, the simplest methods achieve:

L_worst(Stoch, H) = (m/4) · tr(H)

L_avg({Near, Stoch}, H) = (m/c) · tr(H) (c = 12 nearest, c = 6 stochastic)

Combining: the LDLQ advantage

Putting Theorem 1, Lemma 2, and Lemma 3 together for a rank-k Hessian (with μ²k < n):

L_worst(LDLQ, H) ≤ mμ²/(4n) · (tr(H^1/2))² ≤ mμ²k/(4n) · tr(H) = (μ²k/n) · L_worst(Stoch, H)

By Cauchy-Schwarz, (tr(H^1/2))² ≤ k · tr(H), which gives the final inequality. The improvement factor is μ²k/n. For a "most" random n×n matrix, μ = O(√(log n)) and many real Hessians have effective rank k much smaller than n, so this factor is tiny.

The punchline: For incoherent, approximately low-rank Hessians, LDLQ's quantization error is smaller than naive rounding by a factor of roughly (rank · log n) / n. For a 4096-dimensional layer with effective rank 100, that is a 40x reduction in worst-case error.

Theorem 4: Incoherence is necessary

The paper also proves the converse. For any Hessian spectrum, there exists a (coherent) Hessian H̃ with that spectrum where LDLQ achieves exactly the same loss as naive stochastic rounding. Without incoherence, the spectral bound cannot distinguish LDLQ from baselines. This theorem closes the loop: incoherence is both sufficient and necessary for LDLQ to outperform.

Lemma 5: Kronecker products achieve incoherence

Let H = VHV^T where V = V₁ ⊗ V₂ ⊗ ... ⊗ V_k is a Kronecker product of random orthogonal matrices. Then with probability at least 1 − δ:

μ_H = A^k/2 log(Ckn²/δ)^k/2 = Õ(1)

The incoherence parameter is only poly-logarithmic in the matrix size. Two Kronecker factors (k = 2) suffice in practice.

What does Theorem 4 tell us about the necessity of incoherence processing?

For any Hessian spectrum, there exists a coherent Hessian where LDLQ performs no better than naive rounding — so incoherence is not just helpful but necessary for the theoretical advantage to hold LDLQ always beats nearest rounding regardless of H Incoherence only matters for models larger than 10B

Chapter 7: Finite Grids & Lattices

So far we have analyzed rounding to the integers. In practice, we round to a finite grid: scale the weights, shift them, and clamp to {0, 1, ..., 2^b−1}. This creates a subtlety that the paper addresses carefully.

The clamping problem

LDLQ assumes weights are rounded to the nearest integer from the full integer lattice Zⁿ. But after scaling and shifting, the "real" LDLQ algorithm clamps weights to [0, 2^b−1]. This clamping can hurt: on a carefully constructed counterexample (H equal to (I_n + ε e_ne_n^T)/n), clamped LDLQ with nearest rounding is asymptotically worse than plain nearest rounding on a 4-point grid.

In practice this does not matter. The counterexample is contrived — a near-identity Hessian with a tiny perturbation. On real LLM weights, OPTQ (which is equivalent to clamped LDLQ) soundly beats nearest rounding at all bit widths. But the theoretical analysis must account for it.

The bounded solution

To get a clean theoretical bound, the paper proposes a "fixed" algorithm that constrains the quantized weights to lie inside the grid. The optimization problem is:

minimize: tr(HR^TR) over R unit upper triangular subject to: e_i^TR^TRe_i ≤ 1 + c, ∀i

This can be solved with ADMM using stochastic rounding and U = R⁻¹ − I. For sufficiently large c, the solution is exactly base QuIP (the constraint becomes inactive). The resulting bound on quantization error is:

tr((Ŵ − W)H(Ŵ − W)^T) = Õ(1/(n² · 4^b) · tr(H^1/2)² · ||W||_F²)

where b is the number of bits. The 4^b in the denominator shows that each additional bit quadruples the precision — matching the intuition that going from 2 to 3 bits should be a significant jump.

Lattice codebooks (E8 lattice)

Instead of mapping to a uniform grid {0, 1, ..., 2^b−1}, one could map to a better-structured codebook. The E8 lattice (the densest 8-dimensional lattice) provides the highest packing density in 8 dimensions. The subsequent paper QuIP# extends QuIP with E8 lattice codebooks, achieving even better 2-bit results. In the original QuIP, the focus is on proving that even simple uniform grids work when combined with incoherence processing.

Why does clamping to a finite grid potentially hurt LDLQ's theoretical guarantees?

LDLQ's optimality proof assumes rounding to the full integer lattice. Clamping restricts the codebook, and on adversarial Hessians, this restriction can make LDLQ worse than nearest rounding — though this rarely occurs in practice Clamping always makes quantization worse Finite grids have fewer representable values

Chapter 8: Results

QuIP is evaluated on OPT models (125M to 66B parameters) and Llama 2 70B, across language generation (WikiText2, PTB, C4) and zero-shot tasks (ArcE, PiQA, StoryCloze, LAMBADA).

The headline result

QuIP is the first PTQ method to achieve viable 2-bit quantization. At 2 bits per weight, OPTQ's perplexity explodes (WikiText2 perplexity of 123.9 on Llama 2 70B vs. 3.3 at 16-bit). QuIP achieves 6.3 on the same model — a step function improvement that makes 2-bit usable for the first time.

Llama 2 70B results

Bits	Method	Wiki↓	C4↓	ArcE↑	PiQA↑	SC↑
16	Full	3.32	5.71	59.72	80.90	79.95
4	OPTQ	3.60	5.91	58.96	80.52	79.12
4	QuIP	3.53	5.87	59.81	80.47	79.63
3	OPTQ	4.91	7.10	54.38	78.56	77.72
3	QuIP	3.85	6.14	59.81	80.25	79.31
2	OPTQ	123.9	70.54	25.34	50.54	51.75
2	QuIP	6.33	8.94	54.38	75.08	75.37

At 2 bits, OPTQ produces a WikiText2 perplexity of 123.9 — effectively broken. QuIP achieves 6.33, which is close to the 3-bit OPTQ result (4.91). The zero-shot accuracies tell the same story: OPTQ at 2 bits is near random chance, while QuIP remains functional.

OPT-30B ablation

The paper evaluates all combinations of quantization and processing methods on OPT-30B:

Bits	Method	Wiki↓	PTB↓	C4↓	ArcE↑	LAMB↑
16	Full	9.56	14.04	11.45	65.40	72.40
2	OPTQ	71.70	88.19	29.59	42.47	25.77
2	LDLQ-RG	49.40	73.45	29.12	41.20	26.35
2	Near	41547.8	34348.6	24815.7	25.80	0.00
2	QuIP	11.48	17.40	13.55	57.87	65.24
2	Near+IncP	12.04	18.12	14.11	56.36	60.64

Incoherence helps everything. Even the simplest method — nearest rounding — goes from a perplexity of 41,548 to 12.04 when incoherence processing is added. The improvement is universal: every quantization method tested benefits from incoherence processing, especially at 2 bits.

Scaling behavior

A striking finding: as model size increases, the gap between 2-bit QuIP and 16-bit full precision shrinks. On OPT-125M, the 2-bit penalty is severe. By OPT-66B and Llama 2 70B, the gap is small. This hints that 2-bit inference may become increasingly viable as models get larger — the redundancy in larger models makes them more tolerant of extreme quantization.

Throughput

Method	Throughput (per token)
OPTQ	53 ms
QuIP	81 ms

QuIP is about 1.5x slower than OPTQ per token due to the Kronecker orthogonal multiplication during dequantization. Both are faster than FP16 inference because the memory savings dominate. The overhead is modest and could be reduced with custom CUDA kernels.

Ablation: sub-steps of incoherence processing

Bits	Rescale	Incoherence	Rescale+Inc	Resc+Inc+QR
4	24.30	24.32	24.05	23.89
3	32.62	42.28	31.32	26.36

On OPT-350M: all sub-steps contribute. Diagonal rescaling, incoherence rotation, and quantization range adjustment each reduce perplexity. Random permutation within the Kronecker multiply also helps significantly (Δ perplexity of −74.2 at 2 bits on OPT-125M).

Perplexity vs. Model Size

WikiText2 perplexity for OPTQ and QuIP at different bit widths across OPT model sizes.

What happens to nearest rounding (no adaptive, no incoherence) at 2 bits on OPT-30B?

WikiText2 perplexity of 41,548 — the model is completely broken. But adding incoherence processing alone brings it down to 12.04, showing that incoherence is the critical ingredient even without adaptive rounding It works almost as well as OPTQ It only slightly degrades perplexity

Chapter 9: Connections

What QuIP established

QuIP introduced three ideas that shaped the field:

Incoherence processing as a universal pre/post-processing step that improves any quantization algorithm, particularly at extreme compression ratios
LDLQ as the provably optimal adaptive rounding method, with the first theoretical analysis for any LLM-scale quantization algorithm
The OPTQ equivalence, retroactively providing theoretical grounding for the most widely used PTQ method

QuIP# (2024)

The follow-up paper extends QuIP with E8 lattice codebooks — replacing the uniform grid with the densest packing in 8 dimensions. This gives even better 2-bit results: near-lossless quantization at 2 bits for large models. QuIP# also introduces faster Hadamard-based incoherence (replacing random orthogonal with randomized Hadamard transforms for O(n log n) cost).

Relation to other quantization methods

Method	Adaptive rounding?	Incoherence?	Min bits	Theory?
RTN	No	No	4	No
SmoothQuant	No	Heuristic (per-channel)	8 (W8A8)	No
GPTQ/OPTQ	Yes (=LDLQ)	No	3	Via QuIP
SqueezeLLM	Sensitivity-based	No	3	No
QuIP	Yes (LDLQ)	Yes (random orth.)	2	Yes
QuIP#	Yes (LDLQ)	Yes (Hadamard)	2	Yes

Broader impact

QuIP demonstrated that the 2-bit frontier is not a hard wall — it is a function of how well you preprocess the matrices. This insight spawned a wave of work on rotation-based quantization methods. The idea that making matrices look random (incoherent) helps quantization is now a standard tool in the compression toolkit.

The deeper lesson: QuIP shows that quantization error is not intrinsic to the weights — it depends on the basis in which you represent them. Choosing the right basis (one where the matrix is incoherent) can reduce error by orders of magnitude. This is a beautiful connection between random matrix theory and practical systems engineering.

What is the broader insight that QuIP contributes beyond its specific algorithm?

Quantization error depends on the basis of representation, not just the weights themselves. Rotating to an incoherent basis — where no coordinate is privileged — can reduce error by orders of magnitude, providing both practical gains and the first theoretical framework for LLM quantization 2-bit quantization is always better than 4-bit GPTQ should not be used anymore

QuiP: 2-Bit Quantization With Guarantees