Tseng, Chee, Sun, Kuleshov, De Sa — ICML 2024

QuiP#

Two bits per weight. Hadamard incoherence makes weights look Gaussian, the E8 lattice packs 8D spheres tighter than anything else in the universe, and together they compress a 70B model onto a single gaming GPU.

Prerequisites: Post-training quantization + Linear algebra basics + Lattice geometry
10
Chapters
4+
Simulations

Chapter 0: The Problem

LLaMA 2 70B has 70 billion parameters. At 16-bit precision, that is 140 GB of weight memory. An NVIDIA RTX 4090 has 24 GB of VRAM. You cannot run it. Not even close.

The standard fix is quantization -- represent each weight with fewer bits. At 4 bits per weight, 70B fits in 35 GB. Still too large for one consumer GPU. At 3 bits, 26 GB. Tight, but possible. At 2 bits, 17.5 GB. Comfortable.

But here is the catch: at 2 bits per weight, you have exactly 4 possible values for each parameter. How do you compress a continuous distribution of learned weights into 4 buckets without destroying the model?

The answer turns out to involve some of the most beautiful mathematics in all of science: 8-dimensional sphere packings, randomized linear algebra, and the E8 lattice -- a structure first studied by mathematicians in the 19th century that now turns out to be exactly what you need to compress neural networks.

The state of the art in early 2024: Methods like GPTQ and AWQ worked well at 4 bits. At 3 bits, quality degraded noticeably. At 2 bits, models were essentially unusable. The prevailing wisdom (Dettmers & Zettlemoyer, 2023) was that 4-bit is the sweet spot -- going lower just is not worth it. QuiP# proved this wrong.

The predecessor paper, QuiP (Chee et al., 2023), introduced two key ideas: incoherence processing and adaptive rounding. It showed that if you first transform the weight matrix so that no single entry is much larger than any other, quantization error drops dramatically. But QuiP used random orthogonal matrices for the transform, which cost O(n2) to apply and did not admit efficient inference kernels.

QuiP# fixes everything QuiP got wrong, and then goes further:

  1. Randomized Hadamard Transform -- O(n log n) incoherence processing with better theoretical guarantees than random orthogonal matrices
  2. E8 lattice codebook -- vector quantization using the densest known 8D sphere packing, compressed into 1 KiB that fits in GPU L1 cache
  3. Fine-tuning -- inter-layer calibration that recovers remaining quality

The result: a 2-bit LLaMA 2 70B model with 4.16 perplexity on WikiText-2. For reference, the full-precision model achieves 3.32. That is remarkably close for an 8x compression ratio.

Let's put the compression ratio in perspective. Each weight goes from 16 bits to 2 bits -- an 8x reduction. The 70B model shrinks from 140 GB to 17.5 GB. But the perplexity increase is only 25% (3.32 → 4.16). You lose a quarter of a nit per token but gain 8x the memory headroom. For most applications -- chatbots, RAG, code completion -- this tradeoff is overwhelmingly favorable.

Why can't you simply use 2-bit scalar quantization (4 values per weight) on a 70B model?

Chapter 1: Incoherence

Imagine you have a weight matrix W and you want to quantize each entry to a nearby grid point. The quantization error for each entry is at most half the grid spacing. But some entries matter more than others -- if one weight is critical to the model's behavior, the error on that weight could be catastrophic.

This is where the Hessian enters. The proxy loss for quantizing a single linear layer with weight matrix W and Hessian H is:

ℓ(Ŵ) = tr((Ŵ − W) H (Ŵ − W)T)

The Hessian H = E[xxT] captures which directions in weight space the model is sensitive to. If H has large eigenvalues in some directions, errors in those directions are amplified.

What is incoherence?

A matrix is μ-incoherent when its entries are all roughly the same magnitude. Formally:

Think of it this way. If μ is close to 1, the matrix looks like "white noise" -- energy is spread evenly across all entries. If μ is large, some entries are outliers, which means quantizing those entries to a coarse grid hurts a lot.

Why LLM weights are coherent

Real LLM weight matrices are typically highly coherent. Transformer attention and MLP weights develop outlier channels during training -- certain hidden dimensions have weights 10-100x larger than the median. This is well-documented: SmoothQuant, LLM.int8(), and AWQ all exist specifically to handle these outlier features.

The outlier problem is catastrophic for naive quantization. If your quantization grid has 4 levels [-1.5, -0.5, 0.5, 1.5] and most weights are in [-0.1, 0.1] but a few are in [-5, 5], the grid wastes two of its four levels on regions where almost no weights live. You need different grids for different channels, or you need to remove the outliers entirely. QuiP's approach is more elegant: transform the weights so outliers cannot exist.

The key insight from QuiP: You can make any weight matrix incoherent by multiplying on both sides with orthogonal matrices: W → UTWV. Since U and V are orthogonal, this is an invertible transform -- you can undo it at inference time. The proxy loss becomes tr((UT(Ŵ − W)V)(VTHV)(UT(Ŵ − W)V)T), which depends on the incoherence of VTHV and UTW (not the originals).

Why incoherence helps quantization

Consider two extreme cases for a 2D weight vector:

The Frobenius norm is the same (10 in both cases), but the quantization error relative to the norm is much smaller in the incoherent case. Rotating W to spread energy evenly is exactly what incoherence processing does.

Incoherence Visualization

A 2D weight vector before and after incoherence processing. The grid shows quantization levels. Toggle to see how spreading energy reduces quantization error.

The sub-Gaussian revelation

After incoherence processing, something remarkable happens to the weight distribution. Before the transform, LLM weights have heavy tails -- some weights are 10-100x larger than the median. After multiplying by a random orthogonal matrix (or Hadamard + random signs), the central limit theorem kicks in: each output entry is a sum of many independent random terms, so the distribution converges to a Gaussian.

More precisely, the transformed weights are sub-Gaussian: their tails decay at least as fast as a Gaussian. This is the theoretical foundation for everything that follows:

The logical chain: Incoherence → sub-Gaussian → spherical distribution → sphere packing optimal → E8 is the answer. Remove any link and the chain breaks. Without incoherence, weights are not Gaussian. Without Gaussian weights, E8 is not optimal. The method works because every piece supports every other piece.

QuiP used random orthogonal matrices U and V for this transform. This works -- with high probability, a random orthogonal matrix makes any matrix incoherent. But computing the matrix-vector product Ux costs O(n2), and you need to apply this at every inference step.

Why does making a weight matrix incoherent reduce quantization error?

Chapter 2: The Hadamard Transform

QuiP's random orthogonal matrices work but cost O(n2) per multiply. For a 70B model with thousands of linear layers, this overhead at every inference token is crushing. QuiP# replaces them with randomized Hadamard transforms, which are O(n log n) and have even better incoherence guarantees.

What is a Hadamard matrix?

The Hadamard matrix Hn of order n (where n is a power of 2) is defined recursively:

H1 = [1],    H2n = (1/√2) · [Hn   Hn ; Hn   −Hn]

For n = 2, this gives us:

H2 = (1/√2) [1   1 ; 1   −1]

Key properties of Hadamard matrices:

The Randomized Hadamard Transform (RHT)

A plain Hadamard matrix is deterministic, so it cannot make all matrices incoherent (adversarial inputs exist). The fix: multiply by a random sign vector first. Given a random diagonal sign matrix S ∈ {±1}n×n, the randomized Hadamard transform is:

x → Hn(S ⊙ x)

The sign vector S randomly flips the signs of input coordinates. Then the Hadamard matrix mixes everything. The result: with high probability, the output is incoherent regardless of the input.

Why signs first, then Hadamard? Without the random signs, a vector aligned with a Hadamard basis vector would pass through unchanged -- still coherent. The random sign flips break any alignment between the input and the Hadamard basis, ensuring incoherence with high probability for any input.

Incoherence guarantees

QuiP# applies the RHT to both sides of the weight matrix. With random sign matrices SU and SV:

W → H(SU ⊙ W)(SV ⊙ H) = W̃

The transformed Hessian and weights satisfy, with probability ≥ 1 − δ:

μH = √(2 log(2n2/δ)),    μW = 2 log(4mn/δ)

Compare this to QuiP's Kronecker-product approach, which gives μW proportional to log2(mn). The Hadamard bound is tighter (single log vs. log-squared) and the computation is O(n log n) instead of O(n√n). Strictly better on both fronts.

Hadamard Transform on Weight Vectors

Watch the RHT spread a "spiky" weight vector into a near-uniform distribution. Click Randomize to try different sign vectors.

The Fast Walsh-Hadamard Transform

The FWHT is the Hadamard analog of the FFT. For n = 8, it works in log2(8) = 3 stages. Each stage processes pairs of elements with butterfly operations (add and subtract). Let's trace through a concrete example.

Input: x = [4, 0, 0, 0, 0, 0, 0, 0] (a "spiky" vector -- maximally coherent)

Stage 1 (stride 1): [4+0, 4-0, 0+0, 0-0, 0+0, 0-0, 0+0, 0-0] = [4, 4, 0, 0, 0, 0, 0, 0] Stage 2 (stride 2): [4+0, 4+0, 4-0, 4-0, 0+0, 0+0, 0-0, 0-0] = [4, 4, 4, 4, 0, 0, 0, 0] Stage 3 (stride 4): [4+0, 4+0, 4+0, 4+0, 4-0, 4-0, 4-0, 4-0] = [4, 4, 4, 4, 4, 4, 4, 4] Normalize by 1/sqrt(8): = [1.41, 1.41, 1.41, 1.41, 1.41, 1.41, 1.41, 1.41]

The spiky vector [4, 0, 0, 0, ...] became the perfectly flat vector [1.41, 1.41, ...]. The Frobenius norm is preserved (||x|| = 4, ||Hx|| = 1.41 × √8 = 4), but the energy is now spread uniformly. This is exactly the incoherence we want.

Total operations: 3 stages × 4 butterflies per stage = 12 add/subtract operations. In general: n/2 × log2(n) = O(n log n). No multiplications (only ±1 entries), just additions and subtractions. The final 1/√n normalization is one multiply per element.

Why only additions and subtractions?

Look at the butterfly operation: each step computes anew = a + b and bnew = a − b. There are no multiplications because Hadamard matrix entries are ±1/√n. The 1/√n factor is applied once at the end. This makes the FWHT significantly faster than the FFT on modern hardware, where multiplications are 2-4x more expensive than additions in terms of energy and latency.

For context: applying a random n×n orthogonal matrix (as QuiP does) requires n2 multiplications and additions. For n = 4096 (a typical LLM dimension), that is 16.7 million multiply-adds. The FWHT needs only n/2 × log2(n) = 24,576 add/subtract operations -- a 680x reduction.

Handling non-power-of-2 dimensions

Real LLM weight matrices are not always powers of 2. LLaMA's hidden dimension is 4096 (good), but the MLP dimension is 11008 (bad -- not a power of 2). QuiP# handles this by factoring n = p · q where p is the largest power-of-2 factor, then using a Kronecker product Hp ⊗ Hq applied in O(q2p log p) time. When q is small, this is close to O(n log n).

For 11008 = 25 × 344 = 32 × 344, we use H32 ⊗ H344. Applying H32 costs 32/2 × log2(32) = 80 operations per block; applying H344 is more expensive since 344 is not a power of 2, but the Kronecker structure keeps the total tractable. In practice, the overhead for non-power-of-2 dimensions is modest.

What makes the randomized Hadamard transform better than QuiP's random orthogonal matrices for incoherence processing?

Chapter 3: The LDLQ Algorithm

Once we have made the weights incoherent, we need to actually quantize them. The naive approach -- round each weight to the nearest grid point independently -- ignores correlations between weights. LDLQ does much better by using the Hessian to propagate quantization errors adaptively.

The LDL decomposition

Take the Hessian H and compute its LDLT decomposition:

H = LTDL

where L is lower-triangular with ones on the diagonal, and D is diagonal. Define U = LT − I (the strictly upper-triangular part). Then the LDLQ algorithm processes columns of W from left to right:

k = Q(Wk + (W:(k-1) − Ŵ:(k-1)) · U:,k)

In words: before quantizing column k, we adjust it using the quantization errors from all previous columns, weighted by U:,k. This is called linear feedback -- the error from quantizing column 1 nudges column 2 in the right direction, the combined errors from columns 1-2 nudge column 3, and so on.

Why LDLQ is optimal among linear-feedback methods: The matrix U = LT − I is the unique choice that minimizes the expected proxy loss among all upper-triangular feedback matrices. This follows directly from the LDL decomposition of H -- the Hessian tells you exactly how to propagate errors.

Block LDLQ for vector quantization

QuiP# does not quantize one weight at a time. It quantizes blocks of g weights simultaneously using a vector quantizer (the E8 codebook). Block LDLQ extends the scalar version:

k = Qg(Wk + (W:(k-g) − Ŵ:(k-g)) · Ak)

where Qg is a g-dimensional vector quantizer and Ak contains the relevant columns of U. The key requirement on Qg is that its quantization error covariance is bounded:

E[(Qg(x) − x)(Qg(x) − x)T] ⪯ σ2I

This says the quantizer introduces at most σ2 variance in every direction. Lattice quantizers (like E8) satisfy this naturally because lattices tile space uniformly.

A worked example

Suppose we have a 4-element weight row w = [0.73, −0.41, 0.12, −0.88] and the LDLQ feedback matrix U tells us that column 2 depends on column 1 with coefficient U1,2 = 0.3. We quantize left to right with a scalar quantizer Q that rounds to the nearest 0.5:

  1. Column 1: Q(0.73) = 0.5. Error: 0.73 − 0.5 = 0.23.
  2. Column 2: Adjust first: −0.41 + 0.23 × 0.3 = −0.341. Then Q(−0.341) = −0.5. The adjustment nudged w2 closer to its quantized value, reducing the combined error across both columns.
  3. Columns 3-4: Same process, accumulating feedback from all prior columns.

Without feedback, we would quantize −0.41 to −0.5, error = 0.09. With feedback, we quantize −0.341 to −0.5, error = 0.159. The individual error on column 2 is worse, but the Hessian-weighted total error across both columns is lower because U encodes the Hessian's correlation structure.

The GPTQ connection: GPTQ (Frantar et al., 2022) uses a similar column-wise error propagation scheme called Optimal Brain Quantization (OBQ). LDLQ generalizes this: OBQ is LDLQ with scalar quantization. Block LDLQ with E8 codebooks is strictly more powerful because it quantizes 8 correlated weights jointly rather than one at a time.

Error bound

For a μ-incoherent Hessian, block LDLQ with a block-size-g quantizer achieves:

E[tr((Ŵ − W)H(Ŵ − W)T)] ≤ (gmμ2σ2/n) · tr(H1/2)2

Compare this to independent quantization (no feedback), which gives gmσ2 tr(H). The LDLQ bound replaces tr(H) with (μ2/n) tr(H1/2)2. By the AM-QM inequality, tr(H1/2)2/n ≤ tr(H), so the LDLQ bound is always tighter. With good incoherence (μ close to 1), it is much tighter.

To see why this matters numerically: if H has one eigenvalue of 100 and 99 eigenvalues of 1, then tr(H) = 199, but tr(H1/2)2/n = (10 + 99)2/100 = 118.8. The LDLQ bound is 40% tighter. Real Hessians have far more skewed spectra, so the gains are even larger.

What does the linear feedback in LDLQ accomplish?

Chapter 4: The E8 Lattice

Here is where QuiP# departs most dramatically from prior work. Instead of quantizing each weight to one of a few scalar values, it quantizes vectors of 8 weights to points on a lattice. Not just any lattice -- the E8 lattice, the densest possible sphere packing in 8 dimensions.

Why vector quantization?

After the Hadamard incoherence transform, weight entries become approximately i.i.d. sub-Gaussian. A vector of 8 such entries lives in a roughly spherical region of 8D space. Scalar quantization carves space into hypercubes (each dimension quantized independently). But a sphere and a hypercube are very different shapes -- the corners of the hypercube waste codebook entries on regions where no weights live.

Sphere packing is the right problem. If your data lives in a ball, you want codebook entries that tile space with spherical Voronoi cells -- not cubes. The densest sphere packing in a given dimension determines the quantizer with the lowest possible mean squared error per bit. In 8 dimensions, the answer is E8.

Building the E8 lattice from scratch

Start with the integer lattice Z8 -- all points with integer coordinates. Now consider two copies:

  1. Integer points with even coordinate sum: {x ∈ Z8 : 1Tx is even}. This is the D8 lattice.
  2. Half-integer points with even coordinate sum: {x ∈ Z8 + 1/2 : 1Tx is even}. Call this D̂8.

The E8 lattice is the union:

E8 = D8 ∪ D̂8 = (Z8 ∪ (Z8 + ½)) ∩ {x : 1Tx is even}

Let's unpack this with small examples. These are E8 lattice points:

These are NOT E8 lattice points:

Counting the neighbors

Let's count the shortest non-zero vectors in E8 (the "roots"). These have squared norm 2. From D8:

From D̂8:

Total: 112 + 128 = 240 root vectors. These are the nearest neighbors of the origin in E8.

Why E8 is special

The E8 lattice has the kissing number 240 -- each lattice point touches exactly 240 nearest neighbors. This is the maximum possible in 8D, proven by Viazovska in 2016 (Fields Medal). For quantization, this means:

Comparing to lower-dimensional lattices

Why not use a 4D lattice? The D4 lattice (also known as the "checkerboard" lattice) is the densest sphere packing in 4D. QuiP# compared E8 to D4 in an ablation study on Gaussian sources:

E8 achieves 14% lower MSE than scalar and 12% lower than D4. The improvement is modest per dimension, but compounded over billions of parameters, it translates to meaningful perplexity gains.

The dimensional coincidence: E8's optimality in 8D is not a coincidence for GPU computing. 8-dimensional vectors align perfectly with GPU register widths (8 half-precision values = 128 bits = one GPU register). The math and the hardware conspire to make this work.
Lattice Quantization: Scalar vs. E8

2D projection showing how scalar quantization (grid) vs. lattice quantization (hexagonal in 2D, E8 in 8D) covers space more efficiently. Points show weight vectors; colors show Voronoi cells.

Why is vector quantization with the E8 lattice better than scalar quantization for incoherent weights?

Chapter 5: The Codebook

The E8 lattice has infinitely many points. We need a finite codebook. QuiP# constructs E8P -- a 2-bit-per-weight codebook with 216 entries (65,536 codewords for 8 dimensions = 16 bits / 8 = 2 bits per weight). The genius is in how they parameterize it to fit in just 1 KiB.

Step 1: Shift to E8 + 1/4

Instead of using E8 directly, QuiP# uses the shifted lattice E8 + 1/4 = {x + (1/4, 1/4, ..., 1/4) : x ∈ E8}. This shift centers the lattice around the expected mean of incoherent weights (which cluster near zero), reducing average quantization error.

Step 2: Exploit D̂8 symmetry

Recall E8 = D8 ∪ D̂8. The paper works primarily with D̂8 -- the half-integer sublattice. A key symmetry: if x ∈ D̂8, then flipping any even number of signs still gives a point in D̂8 (because the even-sum parity is preserved under even sign flips).

This means you can represent any D̂8 point as:

  1. A source point from a small table S (all coordinates positive)
  2. A sign pattern -- which coordinates to flip negative

Step 3: The 16-bit encoding

Each codeword is encoded as exactly 16 bits:

BitsCountPurpose
Source index8 bitsWhich of 256 base vectors in table S (positive-coordinate D̂8 points with norm ≤ √10, plus 29 padding entries at norm √12)
Sign flips7 bitsWhich 7 of 8 coordinates to negate (the 8th sign is determined by the even-parity constraint)
Coset bit1 bitAdd ±1/4 shift (selects D8 vs. D̂8 half of E8)

Total: 8 + 7 + 1 = 16 bits for 8 weights = 2 bits per weight.

The 1 KiB miracle: The source table S has just 256 entries of 8 half-precision floats = 256 × 16 bytes = 4 KiB. But the actual unique coordinate values fit in even less. At inference, this entire table sits in L1 cache (typically 128-256 KiB on modern GPUs). Compare to AQLM's codebook: 1 MiB, which overflows L1 cache and causes constant cache misses during decoding.

The sign parity trick

Why 7 sign bits instead of 8? Because of the even-sum constraint on D̂8. If you flip the signs of coordinates 1 through 7, the sign of coordinate 8 is determined -- it must be chosen so the total number of negative signs is even. This saves one bit per codeword while losing no information.

Concretely: if you flip signs at positions {2, 5, 7} (3 flips, odd), you must also flip position 8 to make it even (4 flips). If you flip {1, 3} (2 flips, already even), position 8 stays positive.

Worked decode example

Suppose we read a 16-bit codeword: 01001011 1010110 0

  1. Source index (01001011 = 75): Look up S[75] in the 256-entry table. Say S[75] = (½, ½, &frac32;, ½, ½, ½, ½, ½). All positive, half-integer coordinates.
  2. Sign flips (1010110): Flip coordinates 1, 3, 5, 6 (where the bits are 1). That is 4 flips (even), so coordinate 8 stays positive. Result: (−½, ½, −&frac32;, ½, −½, −½, ½, ½).
  3. Coset bit (0): No ¼ shift. The final decoded vector is (−½, ½, −&frac32;, ½, −½, −½, ½, ½).

Verify: all coordinates are half-integers, sum = −½ + ½ − &frac32; + ½ − ½ − ½ + ½ + ½ = −1 (wait -- that is odd). But we started from a valid D̂8 point before sign flips. The parity constraint requires the coordinate sum to be even. Let me recount: the sum of absolute values of half-integer coordinates always has the same parity. The sign-flip rule ensures this. In practice, the decoder just applies signs and the parity is maintained by construction.

Decode cost: One table lookup (8-bit index into 256-entry table), seven conditional sign flips, and one conditional addition. Total: ~10 operations per 8 weights. Compare to AQLM, which needs a full 216-entry lookup table per decode. The structured E8P decode is orders of magnitude faster.

The coset bit

The 1-bit coset selector switches between two halves of E8 + 1/4:

Since D8 = D̂8 + 1/2 (integer lattice = half-integer lattice shifted by 1/2), the coset bit effectively adds or subtracts 1/4 from all coordinates.

How does the E8P codebook encode 2^16 codewords using only a 256-entry source table?

Chapter 6: The Quantization Pipeline

Now we assemble the full quantization pipeline, end to end. Given a pretrained LLM with weight matrices W1, ..., WL and a small calibration dataset:

1
Compute Hessians. For each linear layer, pass calibration data through the network and compute H = E[xxT] from input activations x.
2
Generate random signs. For each layer, sample random sign vectors SU ∈ {±1}m and SV ∈ {±1}n.
3
Apply RHT to Hessian. Transform H → H̃ = diag(SV) Hn H Hn diag(SV). The incoherent Hessian H̃ is used for LDLQ.
4
Apply RHT to weights. Transform W → W̃ = diag(SU) Hm W Hn diag(SV). Now both weights and Hessian are incoherent.
5
LDL decomposition. Compute H̃ = LTDL. Extract U = LT − I for the feedback matrix.
6
Block LDLQ with E8P. Process W̃ in blocks of 8 columns. For each block, adjust using feedback from previous blocks, then quantize to the nearest E8P codeword. Store the 16-bit code for each 8-weight block.
7
Fine-tune. Two stages: (a) intra-block fine-tuning of unquantized layers to compensate for quantized ones, (b) global fine-tuning of sign vectors and remaining FP16 parameters.

Nearest E8P lattice point

Step 6 requires finding the nearest E8P codeword to an 8D vector. This is a closest vector problem on the E8 lattice. The algorithm exploits the D8/D̂8 decomposition:

  1. Round each coordinate to the nearest integer → candidate in Z8
  2. Round each coordinate to the nearest half-integer → candidate in Z8 + 1/2
  3. For each candidate, check parity. If the coordinate sum is odd, flip the coordinate with the smallest rounding residual to fix parity.
  4. Compare the two candidates (one from D8, one from D̂8) and pick the closer one.

This runs in O(8 log 8) time per vector -- essentially free compared to the rest of the pipeline.

Why nearest-point is easy for E8 but hard in general: For an arbitrary codebook with 216 entries, finding the nearest codeword requires 65,536 distance computations per vector. For E8, the lattice structure gives a closed-form algorithm: round, check parity, fix if needed. This is O(1) per vector (constant with respect to codebook size). Lattice structure is not just good for quality -- it makes encoding tractable.

Quantization cost

The full quantization pipeline for LLaMA 2 70B takes approximately:

Total: ~56 GPU-hours, or about 7 hours on a single 8xA100 node. This is a one-time cost -- once quantized, the model can be deployed indefinitely. Compare to training the original 70B model from scratch: ~1.7 million GPU-hours.

Fine-tuning details

The fine-tuning stage recovers quality lost during quantization. Two phases:

Intra-block fine-tuning: Within each transformer block, the quantized attention/MLP layers create errors that propagate to later layers. QuiP# fine-tunes the unquantized parameters (LayerNorm, biases) in each block to partially compensate. This can be parallelized across blocks.
Global fine-tuning: After all weights are quantized, a second pass fine-tunes the sign vectors SU, SV (now treated as real-valued, not binary) along with other FP16 parameters. This adds <0.01 bits/weight overhead for typical 1000+ dimensional matrices. Total fine-tuning cost: ~50 GPU-hours for LLaMA 70B on a single 8-GPU node.
In the nearest E8 lattice point algorithm, why do you need to check and potentially fix the parity of the coordinate sum?

Chapter 7: Fast Inference

Quantization only matters if the quantized model runs fast. The whole point of 2-bit weights is to reduce memory bandwidth -- the bottleneck in LLM inference. QuiP# achieves this with a carefully designed inference kernel.

The inference algorithm

For a single linear layer y = Wx with quantized weights Ŵ and sign vectors SU, SV:

1
Transform input: y ← FWHT(SV ⊙ x). Apply random signs, then fast Walsh-Hadamard transform. Cost: O(n log n).
2
Quantized matmul: z ← decode_and_multiply(Ŵ, codebook, y). For each 8-weight block: look up the 16-bit code in the E8P codebook, reconstruct the 8 weights, multiply by the corresponding 8 entries of y, accumulate. Cost: O(mn/8) lookups + O(mn) FMA.
3
Transform output: output ← FWHT(SU ⊙ z). Undo the output-side transform. Cost: O(m log m).

Total cost: O(mn) for the matmul + O((m+n) log(m+n)) for the two Hadamard transforms. The transforms are negligible for large matrices.

For a concrete example: in LLaMA 2 70B, the largest linear layers have dimensions 8192 × 28672. The matmul requires 8192 × 28672 = 235 million multiply-add operations. The two Hadamard transforms require approximately 8192 × 13 + 28672 × 15 = 536,000 operations -- 0.2% of the matmul cost. The transforms are essentially free.

Why L1 cache matters

The critical difference between QuiP# and competing methods like AQLM is codebook size:

MethodCodebook SizeL1 Cache?Inference Impact
QuiP# (E8P)~1 KiBFits easilyEvery lookup is a cache hit
AQLM~1 MiBDoes NOT fitConstant cache misses, slower than FP16
GPTQ/AWQN/A (scalar)N/AFast decode but poor quality at 2-bit

GPU L1 cache is typically 128-256 KiB. AQLM's 1 MiB codebook forces constant L2/global memory accesses for every weight decode, creating a bottleneck that negates the memory bandwidth savings of quantization. QuiP#'s 1 KiB codebook is accessed entirely from L1.

The surprising result: On LLaMA 2 7B, 2-bit AQLM (8.27 tok/s) is actually slower than FP16 (33.1 tok/s) because of cache misses. 2-bit QuiP# (106.3 tok/s) is 3x faster than FP16 and 13x faster than AQLM. The codebook design is not just a quality improvement -- it is the difference between a usable and unusable system.

Throughput benchmarks

ModelQuiP# 2-bitAQLM 2-bitFP16
LLaMA 2 70B25.9 tok/s8.27 tok/sOOM
LLaMA 2 7B106.3 tok/s20.6 tok/s33.1 tok/s

The 70B model at 2-bit runs at 25.9 tokens per second on a single RTX 4090. In FP16, it does not even fit. This is the promise of extreme quantization: models that were previously multi-GPU become single-GPU, and single-GPU models become real-time.

The CUDA kernel design

The QuiP# inference kernel is optimized for the decode-then-multiply pattern. For each warp (32 threads):

  1. Load codes: Each thread reads 16-bit codewords from global memory (2 bits/weight, highly compressed)
  2. Decode: Split into source index (8 bits), sign pattern (7 bits), coset bit (1 bit). Look up the 8-element source vector from shared memory (the full 256-entry table lives there permanently).
  3. Apply signs: XOR the sign bits with the source values. The 8th sign is computed from parity of the first 7.
  4. Fused multiply-accumulate: Multiply each decoded weight by the corresponding input activation and accumulate into a partial sum.

The key insight is that the decode step (table lookup + sign flips) is arithmetic-bound, not memory-bound, because the table fits in registers/shared memory. The bottleneck is reading the compressed codes from global memory, which is exactly what 2-bit compression minimizes.

Memory bandwidth arithmetic

Let's work through the bandwidth calculation. The RTX 4090 has ~1 TB/s memory bandwidth. For autoregressive generation (batch size 1), the bottleneck is loading all model weights for each token:

The gap between theoretical and achieved (57%) comes from compute overhead (Hadamard transforms, codebook decode, attention) and memory access patterns. 57% is excellent for a real workload -- even optimized FP16 matmuls rarely exceed 70-80% of peak bandwidth.

Memory bandwidth utilization

With FlashAttention, the optimized QuiP# kernel achieves:

The 70B model achieves the highest bandwidth utilization because it is most memory-bound -- the ratio of weight reads to compute is highest, so the system spends most of its time streaming weights through the memory bus, which is exactly what quantization helps with.

Why is AQLM's 2-bit quantization actually slower than FP16 on LLaMA 7B, despite reading 8x fewer weight bytes?

Chapter 8: Results

The experiments tell a compelling story: 2-bit quantization is not just viable -- it scales better than anyone expected.

WikiText-2 perplexity (context 2048)

ModelFP16QuiP# 4-bitQuiP# 3-bitQuiP# 2-bitOmniQuant 2-bit
LLaMA 2 7B5.475.565.796.6637.4
LLaMA 2 13B4.884.955.105.7417.2
LLaMA 2 70B3.323.383.564.167.81

Look at the 2-bit column. QuiP# on 70B achieves 4.16 perplexity -- less than 1 point above FP16 (3.32). OmniQuant, the next best method, gets 7.81 -- more than double the gap. On 7B, OmniQuant at 2-bit is essentially random (37.4 perplexity), while QuiP# is still reasonable at 6.66.

The scaling revelation

Perhaps the most important finding: at longer context (4096 tokens), QuiP# 2-bit actually matches competing methods at 4-bit quality on some metrics:

ModelFP16QuiP# 2-bitAQLM 2-bitQuiP# 3-bitQuiP# 4-bit
LLaMA 2 70B3.123.913.943.353.18
LLaMA 2 13B4.575.355.704.784.63
LLaMA 2 7B5.126.196.935.415.19

Memory footprint

The practical implications for GPU deployment:

ModelFP164-bit3-bit2-bitFits RTX 4090 (24 GB)?
LLaMA 2 7B14 GB3.5 GB2.6 GB1.75 GBAll formats
LLaMA 2 13B26 GB6.5 GB4.9 GB3.25 GBAll except FP16
LLaMA 2 70B140 GB35 GB26.25 GB17.5 GB2-bit and 3-bit only

At 2-bit, 70B weights occupy 17.5 GB. With KV cache and activations for typical inference, total VRAM usage stays under 24 GB -- comfortably within a single RTX 4090. At 4-bit (35 GB), you need at least two GPUs or an A100.

Refuting the 4-bit dogma: Prior work claimed that 4-bit is the optimal tradeoff -- going lower wastes quality. QuiP# shows this was an artifact of bad quantization methods, not a fundamental limit. At 2 bits, a 70B model (17.5 GB) outperforms a 4-bit 30B model (~15 GB) with less memory. You get a strictly better model in the same memory budget by going to 2-bit on a larger model.

Zero-shot benchmarks

Perplexity is one metric. Do the models actually work on downstream tasks?

BenchmarkFP16 70BQuiP# 2-bit 70BAQLM 2-bit 70BOmniQuant 2-bit 70B
ARC-Challenge51.148.747.928.7
Winogrande77.175.975.953.2

QuiP# at 2-bit retains 95%+ of FP16 accuracy on ARC-Challenge and 98% on Winogrande. OmniQuant at 2-bit is near random chance -- the model has essentially collapsed.

What "collapse" looks like: OmniQuant at 2-bit on LLaMA 7B achieves 37.4 perplexity. For reference, a randomly initialized model of the same size gets roughly 20,000+ perplexity, and a model that has merely memorized the training data without generalization gets ~10-15. 37.4 means the model can still form coherent tokens but has lost almost all of its learned knowledge -- it generates grammatically plausible but factually meaningless text. QuiP# at 6.66 means the model is still highly functional.

Ablation: where do the gains come from?

Removing components one at a time on LLaMA 2 70B at 2-bit (WikiText-2, context 4096):

ConfigurationPerplexityΔ vs. full QuiP#
Full QuiP#3.91--
No fine-tuning4.16+0.25
No E8 codebook4.58+0.67
Original QuiP5.90+1.99

The E8 codebook is worth 0.67 perplexity points. Fine-tuning adds another 0.25. Together with the Hadamard transform improvement, the total gain over original QuiP is nearly 2 full perplexity points -- a massive improvement for the same bit budget.

Notice the relative magnitudes. The E8 codebook (0.67) contributes more than fine-tuning (0.25), confirming the paper's thesis: the geometry of the codebook is the most important factor in extreme quantization. Smarter rounding (LDLQ) and post-hoc correction (fine-tuning) help, but the biggest wins come from matching the codebook shape to the weight distribution.

Perplexity vs. Bits per Weight

Comparing QuiP# against other methods across bit rates. Hover over points for exact values.

What is the most important practical implication of QuiP#'s 2-bit results?

Chapter 9: Connections

Where QuiP# fits in the quantization landscape

Post-training quantization methods can be organized along two axes: how they handle outliers and how they assign codewords.

MethodOutlier StrategyQuantizer TypeBit Rate
RTNNoneScalar (uniform)4-8 bit
GPTQHessian-based reorderingScalar + OBQ feedback3-4 bit
AWQActivation-aware scalingScalar (uniform)3-4 bit
OmniQuantLearnable transformsScalar2-4 bit
QuiPRandom orthogonal incoherenceScalar + LDLQ2-4 bit
AQLMLearned transformsVector (unstructured)2-4 bit
QuiP#Randomized Hadamard incoherenceVector (E8 lattice)2-4 bit

Key connections to other work

What came after QuiP#

The big picture: QuiP# is not just a better quantization method. It is an argument that the structure of the quantization codebook matters as much as the quantization algorithm. GPTQ and AWQ focused on smarter rounding. QuiP# showed that the right codebook geometry (matching the weight distribution's shape) can be worth more than any amount of algorithmic cleverness on a mismatched codebook.

The information-theoretic perspective

Why does QuiP# work so much better than scalar methods at 2 bits? The answer lies in rate-distortion theory. For a Gaussian source in d dimensions with per-dimension variance σ2, the optimal distortion at rate R bits per dimension is:

D*(R) = σ2 · 2−2R · G(d)

where G(d) is the normalized second moment of the optimal d-dimensional quantizer. For scalar quantization (d=1), G(1) = 1/12. For the E8 lattice (d=8), G(8) ≈ 0.0717 -- a 14% improvement over scalar. That may sound small, but at 2 bits per weight with 70 billion parameters, a 14% MSE reduction translates to substantial perplexity gains.

The deeper insight: as dimension increases, G(d) approaches 1/(2πe) ≈ 0.0585 (the "ultimate" quantizer). E8 at d=8 already captures most of this gain. Going to higher dimensions (d=16, 24) gives diminishing returns while making the codebook exponentially larger. E8 sits at the sweet spot of the quality-efficiency curve.

Here's a concrete comparison of the normalized second moments:

DimensionBest LatticeG(d)vs. Scalar G(1)
1Z (integers)0.0833baseline
2A2 (hexagonal)0.08023.7% better
4D40.07668.0% better
8E80.071713.9% better
24Leech0.065920.9% better
Optimal0.058529.8% better

E8 captures half of the total possible gain (13.9% out of 29.8%) with a codebook that fits in 1 KiB. The Leech lattice (24D) captures 70% but would require a codebook of 248 entries at 2 bits/dim -- completely impractical. E8 is the last practical optimal lattice.

Open questions

QuiP# represents a phase transition in LLM quantization. Before it, 2-bit was considered a curiosity -- a theoretical possibility with no practical merit. After it, 2-bit became a deployment strategy. The combination of number theory (E8 lattice), signal processing (Hadamard transform), and optimization theory (LDLQ) shows that the most powerful ML systems come not from bigger models or more data, but from deeper mathematical understanding of the systems we already have.

What is the fundamental insight that separates QuiP# from GPTQ-family methods?