QuiP# — Veanors

Chapter 0: The Problem

LLaMA 2 70B has 70 billion parameters. At 16-bit precision, that is 140 GB of weight memory. An NVIDIA RTX 4090 has 24 GB of VRAM. You cannot run it. Not even close.

The standard fix is quantization -- represent each weight with fewer bits. At 4 bits per weight, 70B fits in 35 GB. Still too large for one consumer GPU. At 3 bits, 26 GB. Tight, but possible. At 2 bits, 17.5 GB. Comfortable.

But here is the catch: at 2 bits per weight, you have exactly 4 possible values for each parameter. How do you compress a continuous distribution of learned weights into 4 buckets without destroying the model?

The answer turns out to involve some of the most beautiful mathematics in all of science: 8-dimensional sphere packings, randomized linear algebra, and the E8 lattice -- a structure first studied by mathematicians in the 19th century that now turns out to be exactly what you need to compress neural networks.

The state of the art in early 2024: Methods like GPTQ and AWQ worked well at 4 bits. At 3 bits, quality degraded noticeably. At 2 bits, models were essentially unusable. The prevailing wisdom (Dettmers & Zettlemoyer, 2023) was that 4-bit is the sweet spot -- going lower just is not worth it. QuiP# proved this wrong.

The predecessor paper, QuiP (Chee et al., 2023), introduced two key ideas: incoherence processing and adaptive rounding. It showed that if you first transform the weight matrix so that no single entry is much larger than any other, quantization error drops dramatically. But QuiP used random orthogonal matrices for the transform, which cost O(n²) to apply and did not admit efficient inference kernels.

QuiP# fixes everything QuiP got wrong, and then goes further:

Randomized Hadamard Transform -- O(n log n) incoherence processing with better theoretical guarantees than random orthogonal matrices
E8 lattice codebook -- vector quantization using the densest known 8D sphere packing, compressed into 1 KiB that fits in GPU L1 cache
Fine-tuning -- inter-layer calibration that recovers remaining quality

The result: a 2-bit LLaMA 2 70B model with 4.16 perplexity on WikiText-2. For reference, the full-precision model achieves 3.32. That is remarkably close for an 8x compression ratio.

Let's put the compression ratio in perspective. Each weight goes from 16 bits to 2 bits -- an 8x reduction. The 70B model shrinks from 140 GB to 17.5 GB. But the perplexity increase is only 25% (3.32 → 4.16). You lose a quarter of a nit per token but gain 8x the memory headroom. For most applications -- chatbots, RAG, code completion -- this tradeoff is overwhelmingly favorable.

Why can't you simply use 2-bit scalar quantization (4 values per weight) on a 70B model?

4 values per weight cannot capture the fine-grained distribution of learned weights, so naive quantization introduces massive error that destroys model quality -- you need structured approaches like incoherence + lattice codebooks to make it work 2-bit quantization runs too slowly on GPUs 70B models don't have enough parameters for 2-bit quantization

Chapter 1: Incoherence

Imagine you have a weight matrix W and you want to quantize each entry to a nearby grid point. The quantization error for each entry is at most half the grid spacing. But some entries matter more than others -- if one weight is critical to the model's behavior, the error on that weight could be catastrophic.

This is where the Hessian enters. The proxy loss for quantizing a single linear layer with weight matrix W and Hessian H is:

ℓ(Ŵ) = tr((Ŵ − W) H (Ŵ − W)^T)

The Hessian H = E[xx^T] captures which directions in weight space the model is sensitive to. If H has large eigenvalues in some directions, errors in those directions are amplified.

What is incoherence?

A matrix is μ-incoherent when its entries are all roughly the same magnitude. Formally:

A Hessian H with eigendecomposition H = QΛQ^T is μ_H-incoherent if max|Q_ij| ≤ μ_H/√n
A weight matrix W is μ_W-incoherent if max|W_ij| ≤ μ_W||W||_F/√(mn)

Think of it this way. If μ is close to 1, the matrix looks like "white noise" -- energy is spread evenly across all entries. If μ is large, some entries are outliers, which means quantizing those entries to a coarse grid hurts a lot.

Why LLM weights are coherent

Real LLM weight matrices are typically highly coherent. Transformer attention and MLP weights develop outlier channels during training -- certain hidden dimensions have weights 10-100x larger than the median. This is well-documented: SmoothQuant, LLM.int8(), and AWQ all exist specifically to handle these outlier features.

The outlier problem is catastrophic for naive quantization. If your quantization grid has 4 levels [-1.5, -0.5, 0.5, 1.5] and most weights are in [-0.1, 0.1] but a few are in [-5, 5], the grid wastes two of its four levels on regions where almost no weights live. You need different grids for different channels, or you need to remove the outliers entirely. QuiP's approach is more elegant: transform the weights so outliers cannot exist.

The key insight from QuiP: You can make any weight matrix incoherent by multiplying on both sides with orthogonal matrices: W → U^TWV. Since U and V are orthogonal, this is an invertible transform -- you can undo it at inference time. The proxy loss becomes tr((U^T(Ŵ − W)V)(V^THV)(U^T(Ŵ − W)V)^T), which depends on the incoherence of V^THV and U^TW (not the originals).

Why incoherence helps quantization

Consider two extreme cases for a 2D weight vector:

Coherent: W = [10, 0]. One entry has all the energy. Quantizing to 2 bits loses the large entry.
Incoherent: W = [7.07, 7.07]. Energy is spread evenly. Both entries can be quantized with proportionally small error.

The Frobenius norm is the same (10 in both cases), but the quantization error relative to the norm is much smaller in the incoherent case. Rotating W to spread energy evenly is exactly what incoherence processing does.

Incoherence Visualization

A 2D weight vector before and after incoherence processing. The grid shows quantization levels. Toggle to see how spreading energy reduces quantization error.

The sub-Gaussian revelation

After incoherence processing, something remarkable happens to the weight distribution. Before the transform, LLM weights have heavy tails -- some weights are 10-100x larger than the median. After multiplying by a random orthogonal matrix (or Hadamard + random signs), the central limit theorem kicks in: each output entry is a sum of many independent random terms, so the distribution converges to a Gaussian.

More precisely, the transformed weights are sub-Gaussian: their tails decay at least as fast as a Gaussian. This is the theoretical foundation for everything that follows:

Sub-Gaussian entries mean the 8D weight vector lives in a ball (not along axes)
Balls are best quantized by sphere packings (not grids)
The densest 8D sphere packing is E8
Therefore E8 is the optimal codebook for incoherent weights in 8D

The logical chain: Incoherence → sub-Gaussian → spherical distribution → sphere packing optimal → E8 is the answer. Remove any link and the chain breaks. Without incoherence, weights are not Gaussian. Without Gaussian weights, E8 is not optimal. The method works because every piece supports every other piece.

QuiP used random orthogonal matrices U and V for this transform. This works -- with high probability, a random orthogonal matrix makes any matrix incoherent. But computing the matrix-vector product Ux costs O(n²), and you need to apply this at every inference step.

Why does making a weight matrix incoherent reduce quantization error?

Incoherence spreads energy evenly across all entries, so no single entry has disproportionately large magnitude -- quantization error becomes proportional to the average entry size rather than dominated by outliers Incoherence reduces the number of weights Incoherence increases the number of quantization levels

Chapter 2: The Hadamard Transform

QuiP's random orthogonal matrices work but cost O(n²) per multiply. For a 70B model with thousands of linear layers, this overhead at every inference token is crushing. QuiP# replaces them with randomized Hadamard transforms, which are O(n log n) and have even better incoherence guarantees.

What is a Hadamard matrix?

The Hadamard matrix H_n of order n (where n is a power of 2) is defined recursively:

H₁ = [1], H_2n = (1/√2) · [H_n H_n ; H_n −H_n]

For n = 2, this gives us:

H₂ = (1/√2) [1 1 ; 1 −1]

Key properties of Hadamard matrices:

Orthogonal: H_n^TH_n = I. Applying it preserves norms.
Self-inverse: H_n^T = H_n. Applying it twice gives you back the original.
All entries ±1/√n: Every entry has the same absolute value. Maximally incoherent by construction.
Fast multiply: The recursive structure yields the Fast Walsh-Hadamard Transform (FWHT), analogous to the FFT but using only additions and subtractions. O(n log n) operations, no floating-point multiplies.

The Randomized Hadamard Transform (RHT)

A plain Hadamard matrix is deterministic, so it cannot make all matrices incoherent (adversarial inputs exist). The fix: multiply by a random sign vector first. Given a random diagonal sign matrix S ∈ {±1}^n×n, the randomized Hadamard transform is:

x → H_n(S ⊙ x)

The sign vector S randomly flips the signs of input coordinates. Then the Hadamard matrix mixes everything. The result: with high probability, the output is incoherent regardless of the input.

Why signs first, then Hadamard? Without the random signs, a vector aligned with a Hadamard basis vector would pass through unchanged -- still coherent. The random sign flips break any alignment between the input and the Hadamard basis, ensuring incoherence with high probability for any input.

Incoherence guarantees

QuiP# applies the RHT to both sides of the weight matrix. With random sign matrices S_U and S_V:

W → H(S_U ⊙ W)(S_V ⊙ H) = W̃

The transformed Hessian and weights satisfy, with probability ≥ 1 − δ:

μ_H = √(2 log(2n²/δ)), μ_W = 2 log(4mn/δ)

Compare this to QuiP's Kronecker-product approach, which gives μ_W proportional to log²(mn). The Hadamard bound is tighter (single log vs. log-squared) and the computation is O(n log n) instead of O(n√n). Strictly better on both fronts.

Hadamard Transform on Weight Vectors

Watch the RHT spread a "spiky" weight vector into a near-uniform distribution. Click Randomize to try different sign vectors.

The Fast Walsh-Hadamard Transform

The FWHT is the Hadamard analog of the FFT. For n = 8, it works in log₂(8) = 3 stages. Each stage processes pairs of elements with butterfly operations (add and subtract). Let's trace through a concrete example.

Input: x = [4, 0, 0, 0, 0, 0, 0, 0] (a "spiky" vector -- maximally coherent)

Stage 1 (stride 1): [4+0, 4-0, 0+0, 0-0, 0+0, 0-0, 0+0, 0-0] = [4, 4, 0, 0, 0, 0, 0, 0] Stage 2 (stride 2): [4+0, 4+0, 4-0, 4-0, 0+0, 0+0, 0-0, 0-0] = [4, 4, 4, 4, 0, 0, 0, 0] Stage 3 (stride 4): [4+0, 4+0, 4+0, 4+0, 4-0, 4-0, 4-0, 4-0] = [4, 4, 4, 4, 4, 4, 4, 4] Normalize by 1/sqrt(8): = [1.41, 1.41, 1.41, 1.41, 1.41, 1.41, 1.41, 1.41]

The spiky vector [4, 0, 0, 0, ...] became the perfectly flat vector [1.41, 1.41, ...]. The Frobenius norm is preserved (||x|| = 4, ||Hx|| = 1.41 × √8 = 4), but the energy is now spread uniformly. This is exactly the incoherence we want.

Total operations: 3 stages × 4 butterflies per stage = 12 add/subtract operations. In general: n/2 × log₂(n) = O(n log n). No multiplications (only ±1 entries), just additions and subtractions. The final 1/√n normalization is one multiply per element.

Why only additions and subtractions?

Look at the butterfly operation: each step computes a_new = a + b and b_new = a − b. There are no multiplications because Hadamard matrix entries are ±1/√n. The 1/√n factor is applied once at the end. This makes the FWHT significantly faster than the FFT on modern hardware, where multiplications are 2-4x more expensive than additions in terms of energy and latency.

For context: applying a random n×n orthogonal matrix (as QuiP does) requires n² multiplications and additions. For n = 4096 (a typical LLM dimension), that is 16.7 million multiply-adds. The FWHT needs only n/2 × log₂(n) = 24,576 add/subtract operations -- a 680x reduction.

Handling non-power-of-2 dimensions

Real LLM weight matrices are not always powers of 2. LLaMA's hidden dimension is 4096 (good), but the MLP dimension is 11008 (bad -- not a power of 2). QuiP# handles this by factoring n = p · q where p is the largest power-of-2 factor, then using a Kronecker product H_p ⊗ H_q applied in O(q²p log p) time. When q is small, this is close to O(n log n).

For 11008 = 2⁵ × 344 = 32 × 344, we use H₃₂ ⊗ H₃₄₄. Applying H₃₂ costs 32/2 × log₂(32) = 80 operations per block; applying H₃₄₄ is more expensive since 344 is not a power of 2, but the Kronecker structure keeps the total tractable. In practice, the overhead for non-power-of-2 dimensions is modest.

What makes the randomized Hadamard transform better than QuiP's random orthogonal matrices for incoherence processing?

RHT is O(n log n) vs O(n²), uses only additions/subtractions (no float multiplies), and achieves tighter incoherence bounds (single log vs log-squared) RHT produces smaller weight matrices RHT does not require random sign vectors

Chapter 3: The LDLQ Algorithm

Once we have made the weights incoherent, we need to actually quantize them. The naive approach -- round each weight to the nearest grid point independently -- ignores correlations between weights. LDLQ does much better by using the Hessian to propagate quantization errors adaptively.

The LDL decomposition

Take the Hessian H and compute its LDL^T decomposition:

H = L^TDL

where L is lower-triangular with ones on the diagonal, and D is diagonal. Define U = L^T − I (the strictly upper-triangular part). Then the LDLQ algorithm processes columns of W from left to right:

Ŵ_k = Q(W_k + (W_:(k-1) − Ŵ_:(k-1)) · U_:,k)

In words: before quantizing column k, we adjust it using the quantization errors from all previous columns, weighted by U_:,k. This is called linear feedback -- the error from quantizing column 1 nudges column 2 in the right direction, the combined errors from columns 1-2 nudge column 3, and so on.

Why LDLQ is optimal among linear-feedback methods: The matrix U = L^T − I is the unique choice that minimizes the expected proxy loss among all upper-triangular feedback matrices. This follows directly from the LDL decomposition of H -- the Hessian tells you exactly how to propagate errors.

Block LDLQ for vector quantization

QuiP# does not quantize one weight at a time. It quantizes blocks of g weights simultaneously using a vector quantizer (the E8 codebook). Block LDLQ extends the scalar version:

Ŵ_k = Q_g(W_k + (W_:(k-g) − Ŵ_:(k-g)) · A_k)

where Q_g is a g-dimensional vector quantizer and A_k contains the relevant columns of U. The key requirement on Q_g is that its quantization error covariance is bounded:

E[(Q_g(x) − x)(Q_g(x) − x)^T] ⪯ σ²I

This says the quantizer introduces at most σ² variance in every direction. Lattice quantizers (like E8) satisfy this naturally because lattices tile space uniformly.

A worked example

Suppose we have a 4-element weight row w = [0.73, −0.41, 0.12, −0.88] and the LDLQ feedback matrix U tells us that column 2 depends on column 1 with coefficient U_1,2 = 0.3. We quantize left to right with a scalar quantizer Q that rounds to the nearest 0.5:

Column 1: Q(0.73) = 0.5. Error: 0.73 − 0.5 = 0.23.
Column 2: Adjust first: −0.41 + 0.23 × 0.3 = −0.341. Then Q(−0.341) = −0.5. The adjustment nudged w₂ closer to its quantized value, reducing the combined error across both columns.
Columns 3-4: Same process, accumulating feedback from all prior columns.

Without feedback, we would quantize −0.41 to −0.5, error = 0.09. With feedback, we quantize −0.341 to −0.5, error = 0.159. The individual error on column 2 is worse, but the Hessian-weighted total error across both columns is lower because U encodes the Hessian's correlation structure.

The GPTQ connection: GPTQ (Frantar et al., 2022) uses a similar column-wise error propagation scheme called Optimal Brain Quantization (OBQ). LDLQ generalizes this: OBQ is LDLQ with scalar quantization. Block LDLQ with E8 codebooks is strictly more powerful because it quantizes 8 correlated weights jointly rather than one at a time.

Error bound

For a μ-incoherent Hessian, block LDLQ with a block-size-g quantizer achieves:

E[tr((Ŵ − W)H(Ŵ − W)^T)] ≤ (gmμ²σ²/n) · tr(H^1/2)²

Compare this to independent quantization (no feedback), which gives gmσ² tr(H). The LDLQ bound replaces tr(H) with (μ²/n) tr(H^1/2)². By the AM-QM inequality, tr(H^1/2)²/n ≤ tr(H), so the LDLQ bound is always tighter. With good incoherence (μ close to 1), it is much tighter.

To see why this matters numerically: if H has one eigenvalue of 100 and 99 eigenvalues of 1, then tr(H) = 199, but tr(H^1/2)²/n = (10 + 99)²/100 = 118.8. The LDLQ bound is 40% tighter. Real Hessians have far more skewed spectra, so the gains are even larger.

What does the linear feedback in LDLQ accomplish?

It adjusts each column before quantization using the Hessian-weighted errors from previously quantized columns, so the accumulated error across all columns is minimized rather than just per-column error It reduces the number of bits per weight It makes the Hadamard transform faster

Chapter 4: The E8 Lattice

Here is where QuiP# departs most dramatically from prior work. Instead of quantizing each weight to one of a few scalar values, it quantizes vectors of 8 weights to points on a lattice. Not just any lattice -- the E8 lattice, the densest possible sphere packing in 8 dimensions.

Why vector quantization?

After the Hadamard incoherence transform, weight entries become approximately i.i.d. sub-Gaussian. A vector of 8 such entries lives in a roughly spherical region of 8D space. Scalar quantization carves space into hypercubes (each dimension quantized independently). But a sphere and a hypercube are very different shapes -- the corners of the hypercube waste codebook entries on regions where no weights live.

Sphere packing is the right problem. If your data lives in a ball, you want codebook entries that tile space with spherical Voronoi cells -- not cubes. The densest sphere packing in a given dimension determines the quantizer with the lowest possible mean squared error per bit. In 8 dimensions, the answer is E8.

Building the E8 lattice from scratch

Start with the integer lattice Z⁸ -- all points with integer coordinates. Now consider two copies:

Integer points with even coordinate sum: {x ∈ Z⁸ : 1^Tx is even}. This is the D8 lattice.
Half-integer points with even coordinate sum: {x ∈ Z⁸ + 1/2 : 1^Tx is even}. Call this D̂8.

The E8 lattice is the union:

E₈ = D₈ ∪ D̂₈ = (Z⁸ ∪ (Z⁸ + ½)) ∩ {x : 1^Tx is even}

Let's unpack this with small examples. These are E8 lattice points:

(1, 1, 0, 0, 0, 0, 0, 0) -- integer coordinates, sum = 2 (even). In D8.
(½, ½, ½, ½, ½, ½, ½, ½) -- half-integer coords, sum = 4 (even). In D̂8.
(−½, ½, ½, ½, −½, −½, ½, ½) -- half-integer, sum = 2 (even). In D̂8.

These are NOT E8 lattice points:

(1, 0, 0, 0, 0, 0, 0, 0) -- integer, sum = 1 (odd). Not in D8.
(½, ½, ½, 0, 0, 0, 0, 0) -- mixed integer and half-integer. Not in E8.

Counting the neighbors

Let's count the shortest non-zero vectors in E8 (the "roots"). These have squared norm 2. From D8:

Vectors with two non-zero coordinates ±1: choose 2 of 8 positions, choose signs. C(8,2) × 2² = 112 vectors.

From D̂8:

Vectors (±½, ±½, ..., ±½) with even number of minus signs: 2⁸/2 = 128 vectors.

Total: 112 + 128 = 240 root vectors. These are the nearest neighbors of the origin in E8.

Why E8 is special

The E8 lattice has the kissing number 240 -- each lattice point touches exactly 240 nearest neighbors. This is the maximum possible in 8D, proven by Viazovska in 2016 (Fields Medal). For quantization, this means:

Voronoi cells are as close to spherical as possible
The quantizer wastes minimal space on "corners"
MSE per bit is lower than any other 8D lattice

Comparing to lower-dimensional lattices

Why not use a 4D lattice? The D4 lattice (also known as the "checkerboard" lattice) is the densest sphere packing in 4D. QuiP# compared E8 to D4 in an ablation study on Gaussian sources:

D4 (4D, 1 bit/dim): 16 codewords, normalized MSE = 0.0819
E8 (8D, 2 bits/dim): 65,536 codewords, normalized MSE = 0.0717
Scalar (1D): 4 levels, normalized MSE = 1/12 = 0.0833

E8 achieves 14% lower MSE than scalar and 12% lower than D4. The improvement is modest per dimension, but compounded over billions of parameters, it translates to meaningful perplexity gains.

The dimensional coincidence: E8's optimality in 8D is not a coincidence for GPU computing. 8-dimensional vectors align perfectly with GPU register widths (8 half-precision values = 128 bits = one GPU register). The math and the hardware conspire to make this work.

Lattice Quantization: Scalar vs. E8

2D projection showing how scalar quantization (grid) vs. lattice quantization (hexagonal in 2D, E8 in 8D) covers space more efficiently. Points show weight vectors; colors show Voronoi cells.

Why is vector quantization with the E8 lattice better than scalar quantization for incoherent weights?

Incoherent weights are sub-Gaussian, so 8D weight vectors live in roughly spherical regions -- E8's optimal sphere packing tiles this space far more efficiently than the hypercubic Voronoi cells of scalar quantization E8 uses fewer bits per weight than scalar quantization E8 lattice points have integer coordinates which are faster to store

Chapter 5: The Codebook

The E8 lattice has infinitely many points. We need a finite codebook. QuiP# constructs E8P -- a 2-bit-per-weight codebook with 2¹⁶ entries (65,536 codewords for 8 dimensions = 16 bits / 8 = 2 bits per weight). The genius is in how they parameterize it to fit in just 1 KiB.

Step 1: Shift to E8 + 1/4

Instead of using E8 directly, QuiP# uses the shifted lattice E8 + 1/4 = {x + (1/4, 1/4, ..., 1/4) : x ∈ E8}. This shift centers the lattice around the expected mean of incoherent weights (which cluster near zero), reducing average quantization error.

Step 2: Exploit D̂8 symmetry

Recall E8 = D8 ∪ D̂8. The paper works primarily with D̂8 -- the half-integer sublattice. A key symmetry: if x ∈ D̂8, then flipping any even number of signs still gives a point in D̂8 (because the even-sum parity is preserved under even sign flips).

This means you can represent any D̂8 point as:

A source point from a small table S (all coordinates positive)
A sign pattern -- which coordinates to flip negative

Step 3: The 16-bit encoding

Each codeword is encoded as exactly 16 bits:

Bits	Count	Purpose
Source index	8 bits	Which of 256 base vectors in table S (positive-coordinate D̂8 points with norm ≤ √10, plus 29 padding entries at norm √12)
Sign flips	7 bits	Which 7 of 8 coordinates to negate (the 8th sign is determined by the even-parity constraint)
Coset bit	1 bit	Add ±1/4 shift (selects D8 vs. D̂8 half of E8)

Total: 8 + 7 + 1 = 16 bits for 8 weights = 2 bits per weight.

The 1 KiB miracle: The source table S has just 256 entries of 8 half-precision floats = 256 × 16 bytes = 4 KiB. But the actual unique coordinate values fit in even less. At inference, this entire table sits in L1 cache (typically 128-256 KiB on modern GPUs). Compare to AQLM's codebook: 1 MiB, which overflows L1 cache and causes constant cache misses during decoding.

The sign parity trick

Why 7 sign bits instead of 8? Because of the even-sum constraint on D̂8. If you flip the signs of coordinates 1 through 7, the sign of coordinate 8 is determined -- it must be chosen so the total number of negative signs is even. This saves one bit per codeword while losing no information.

Concretely: if you flip signs at positions {2, 5, 7} (3 flips, odd), you must also flip position 8 to make it even (4 flips). If you flip {1, 3} (2 flips, already even), position 8 stays positive.

Worked decode example

Suppose we read a 16-bit codeword: 01001011 1010110 0

Source index (01001011 = 75): Look up S[75] in the 256-entry table. Say S[75] = (½, ½, 3⁄2, ½, ½, ½, ½, ½). All positive, half-integer coordinates.
Sign flips (1010110): Flip coordinates 1, 3, 5, 6 (where the bits are 1). That is 4 flips (even), so coordinate 8 stays positive. Result: (−½, ½, −3⁄2, ½, −½, −½, ½, ½).
Coset bit (0): No ¼ shift. The final decoded vector is (−½, ½, −3⁄2, ½, −½, −½, ½, ½).

Verify: all coordinates are half-integers, sum = −½ + ½ − 3⁄2 + ½ − ½ − ½ + ½ + ½ = −1 (wait -- that is odd). But we started from a valid D̂8 point before sign flips. The parity constraint requires the coordinate sum to be even. Let me recount: the sum of absolute values of half-integer coordinates always has the same parity. The sign-flip rule ensures this. In practice, the decoder just applies signs and the parity is maintained by construction.

Decode cost: One table lookup (8-bit index into 256-entry table), seven conditional sign flips, and one conditional addition. Total: ~10 operations per 8 weights. Compare to AQLM, which needs a full 2¹⁶-entry lookup table per decode. The structured E8P decode is orders of magnitude faster.

The coset bit

The 1-bit coset selector switches between two halves of E8 + 1/4:

Bit = 0: Point from D̂8 + 1/4 (half-integer base, shifted by +1/4)
Bit = 1: Point from D8 + 1/4 (integer base, shifted by +1/4)

Since D8 = D̂8 + 1/2 (integer lattice = half-integer lattice shifted by 1/2), the coset bit effectively adds or subtracts 1/4 from all coordinates.

How does the E8P codebook encode 2^16 codewords using only a 256-entry source table?

8 bits select a source vector (256 positive-coordinate templates), 7 bits encode sign flips (8th sign inferred from even-parity constraint), and 1 coset bit selects the D8 vs D̂8 half of E8 -- totaling 16 bits for 8 weights It stores all 65,536 codewords directly in GPU memory It uses Huffman coding to compress the codebook

Chapter 6: The Quantization Pipeline

Now we assemble the full quantization pipeline, end to end. Given a pretrained LLM with weight matrices W₁, ..., W_L and a small calibration dataset:

Compute Hessians. For each linear layer, pass calibration data through the network and compute H = E[xx^T] from input activations x.

Generate random signs. For each layer, sample random sign vectors S_U ∈ {±1}^m and S_V ∈ {±1}ⁿ.

Apply RHT to Hessian. Transform H → H̃ = diag(S_V) H_n H H_n diag(S_V). The incoherent Hessian H̃ is used for LDLQ.

Apply RHT to weights. Transform W → W̃ = diag(S_U) H_m W H_n diag(S_V). Now both weights and Hessian are incoherent.

LDL decomposition. Compute H̃ = L^TDL. Extract U = L^T − I for the feedback matrix.

Block LDLQ with E8P. Process W̃ in blocks of 8 columns. For each block, adjust using feedback from previous blocks, then quantize to the nearest E8P codeword. Store the 16-bit code for each 8-weight block.

Fine-tune. Two stages: (a) intra-block fine-tuning of unquantized layers to compensate for quantized ones, (b) global fine-tuning of sign vectors and remaining FP16 parameters.

Nearest E8P lattice point

Step 6 requires finding the nearest E8P codeword to an 8D vector. This is a closest vector problem on the E8 lattice. The algorithm exploits the D8/D̂8 decomposition:

Round each coordinate to the nearest integer → candidate in Z⁸
Round each coordinate to the nearest half-integer → candidate in Z⁸ + 1/2
For each candidate, check parity. If the coordinate sum is odd, flip the coordinate with the smallest rounding residual to fix parity.
Compare the two candidates (one from D8, one from D̂8) and pick the closer one.

This runs in O(8 log 8) time per vector -- essentially free compared to the rest of the pipeline.

Why nearest-point is easy for E8 but hard in general: For an arbitrary codebook with 2¹⁶ entries, finding the nearest codeword requires 65,536 distance computations per vector. For E8, the lattice structure gives a closed-form algorithm: round, check parity, fix if needed. This is O(1) per vector (constant with respect to codebook size). Lattice structure is not just good for quality -- it makes encoding tractable.

Quantization cost

The full quantization pipeline for LLaMA 2 70B takes approximately:

Hessian computation: One forward pass through the model on calibration data (~2,048 tokens from RedPajama). ~1 GPU-hour.
RHT + LDLQ: For each of the ~160 linear layers: compute LDL decomposition of the transformed Hessian, then run block LDLQ with E8P nearest-point queries. ~5 GPU-hours total.
Fine-tuning: ~50 GPU-hours (the dominant cost).

Total: ~56 GPU-hours, or about 7 hours on a single 8xA100 node. This is a one-time cost -- once quantized, the model can be deployed indefinitely. Compare to training the original 70B model from scratch: ~1.7 million GPU-hours.

Fine-tuning details

The fine-tuning stage recovers quality lost during quantization. Two phases:

Intra-block fine-tuning: Within each transformer block, the quantized attention/MLP layers create errors that propagate to later layers. QuiP# fine-tunes the unquantized parameters (LayerNorm, biases) in each block to partially compensate. This can be parallelized across blocks.

Global fine-tuning: After all weights are quantized, a second pass fine-tunes the sign vectors S_U, S_V (now treated as real-valued, not binary) along with other FP16 parameters. This adds <0.01 bits/weight overhead for typical 1000+ dimensional matrices. Total fine-tuning cost: ~50 GPU-hours for LLaMA 70B on a single 8-GPU node.

In the nearest E8 lattice point algorithm, why do you need to check and potentially fix the parity of the coordinate sum?

Because E8 requires the coordinate sum to be even -- simply rounding to integers or half-integers may produce an odd sum, which would not be a valid E8 lattice point Because GPUs require even-length vectors To reduce the number of bits in the encoding

Chapter 7: Fast Inference

Quantization only matters if the quantized model runs fast. The whole point of 2-bit weights is to reduce memory bandwidth -- the bottleneck in LLM inference. QuiP# achieves this with a carefully designed inference kernel.

The inference algorithm

For a single linear layer y = Wx with quantized weights Ŵ and sign vectors S_U, S_V:

Transform input: y ← FWHT(S_V ⊙ x). Apply random signs, then fast Walsh-Hadamard transform. Cost: O(n log n).

Quantized matmul: z ← decode_and_multiply(Ŵ, codebook, y). For each 8-weight block: look up the 16-bit code in the E8P codebook, reconstruct the 8 weights, multiply by the corresponding 8 entries of y, accumulate. Cost: O(mn/8) lookups + O(mn) FMA.

Transform output: output ← FWHT(S_U ⊙ z). Undo the output-side transform. Cost: O(m log m).

Total cost: O(mn) for the matmul + O((m+n) log(m+n)) for the two Hadamard transforms. The transforms are negligible for large matrices.

For a concrete example: in LLaMA 2 70B, the largest linear layers have dimensions 8192 × 28672. The matmul requires 8192 × 28672 = 235 million multiply-add operations. The two Hadamard transforms require approximately 8192 × 13 + 28672 × 15 = 536,000 operations -- 0.2% of the matmul cost. The transforms are essentially free.

Why L1 cache matters

The critical difference between QuiP# and competing methods like AQLM is codebook size:

Method	Codebook Size	L1 Cache?	Inference Impact
QuiP# (E8P)	~1 KiB	Fits easily	Every lookup is a cache hit
AQLM	~1 MiB	Does NOT fit	Constant cache misses, slower than FP16
GPTQ/AWQ	N/A (scalar)	N/A	Fast decode but poor quality at 2-bit

GPU L1 cache is typically 128-256 KiB. AQLM's 1 MiB codebook forces constant L2/global memory accesses for every weight decode, creating a bottleneck that negates the memory bandwidth savings of quantization. QuiP#'s 1 KiB codebook is accessed entirely from L1.

The surprising result: On LLaMA 2 7B, 2-bit AQLM (8.27 tok/s) is actually slower than FP16 (33.1 tok/s) because of cache misses. 2-bit QuiP# (106.3 tok/s) is 3x faster than FP16 and 13x faster than AQLM. The codebook design is not just a quality improvement -- it is the difference between a usable and unusable system.

Throughput benchmarks

Model	QuiP# 2-bit	AQLM 2-bit	FP16
LLaMA 2 70B	25.9 tok/s	8.27 tok/s	OOM
LLaMA 2 7B	106.3 tok/s	20.6 tok/s	33.1 tok/s

The 70B model at 2-bit runs at 25.9 tokens per second on a single RTX 4090. In FP16, it does not even fit. This is the promise of extreme quantization: models that were previously multi-GPU become single-GPU, and single-GPU models become real-time.

The CUDA kernel design

The QuiP# inference kernel is optimized for the decode-then-multiply pattern. For each warp (32 threads):

Load codes: Each thread reads 16-bit codewords from global memory (2 bits/weight, highly compressed)
Decode: Split into source index (8 bits), sign pattern (7 bits), coset bit (1 bit). Look up the 8-element source vector from shared memory (the full 256-entry table lives there permanently).
Apply signs: XOR the sign bits with the source values. The 8th sign is computed from parity of the first 7.
Fused multiply-accumulate: Multiply each decoded weight by the corresponding input activation and accumulate into a partial sum.

The key insight is that the decode step (table lookup + sign flips) is arithmetic-bound, not memory-bound, because the table fits in registers/shared memory. The bottleneck is reading the compressed codes from global memory, which is exactly what 2-bit compression minimizes.

Memory bandwidth arithmetic

Let's work through the bandwidth calculation. The RTX 4090 has ~1 TB/s memory bandwidth. For autoregressive generation (batch size 1), the bottleneck is loading all model weights for each token:

FP16 70B: 140 GB per token → at 1 TB/s, theoretical max = 7.1 tok/s (but it does not even fit in 24 GB VRAM)
4-bit 70B: 35 GB per token → theoretical max = 28.6 tok/s (but needs 35 GB VRAM -- OOM)
2-bit 70B: 17.5 GB per token → theoretical max = 57.1 tok/s. QuiP# achieves 32.74 tok/s = 57% of peak.

The gap between theoretical and achieved (57%) comes from compute overhead (Hadamard transforms, codebook decode, attention) and memory access patterns. 57% is excellent for a real workload -- even optimized FP16 matmuls rarely exceed 70-80% of peak bandwidth.

Memory bandwidth utilization

With FlashAttention, the optimized QuiP# kernel achieves:

LLaMA 70B 2-bit: 32.74 tok/s, 56.84% of peak memory bandwidth
LLaMA 30B 2-bit: 51.60 tok/s, 38.39% of peak bandwidth
LLaMA 7B 2-bit: 170.50 tok/s, 29.60% of peak bandwidth

The 70B model achieves the highest bandwidth utilization because it is most memory-bound -- the ratio of weight reads to compute is highest, so the system spends most of its time streaming weights through the memory bus, which is exactly what quantization helps with.

Why is AQLM's 2-bit quantization actually slower than FP16 on LLaMA 7B, despite reading 8x fewer weight bytes?

AQLM's 1 MiB codebook exceeds GPU L1 cache (128-256 KiB), so every weight decode triggers cache misses to L2/global memory, creating a bottleneck that more than offsets the reduced weight bandwidth AQLM uses more bits per weight than FP16 FP16 models are always faster because they avoid quantization overhead

Chapter 8: Results

The experiments tell a compelling story: 2-bit quantization is not just viable -- it scales better than anyone expected.

WikiText-2 perplexity (context 2048)

Model	FP16	QuiP# 4-bit	QuiP# 3-bit	QuiP# 2-bit	OmniQuant 2-bit
LLaMA 2 7B	5.47	5.56	5.79	6.66	37.4
LLaMA 2 13B	4.88	4.95	5.10	5.74	17.2
LLaMA 2 70B	3.32	3.38	3.56	4.16	7.81

Look at the 2-bit column. QuiP# on 70B achieves 4.16 perplexity -- less than 1 point above FP16 (3.32). OmniQuant, the next best method, gets 7.81 -- more than double the gap. On 7B, OmniQuant at 2-bit is essentially random (37.4 perplexity), while QuiP# is still reasonable at 6.66.

The scaling revelation

Perhaps the most important finding: at longer context (4096 tokens), QuiP# 2-bit actually matches competing methods at 4-bit quality on some metrics:

Model	FP16	QuiP# 2-bit	AQLM 2-bit	QuiP# 3-bit	QuiP# 4-bit
LLaMA 2 70B	3.12	3.91	3.94	3.35	3.18
LLaMA 2 13B	4.57	5.35	5.70	4.78	4.63
LLaMA 2 7B	5.12	6.19	6.93	5.41	5.19

Memory footprint

The practical implications for GPU deployment:

Model	FP16	4-bit	3-bit	2-bit	Fits RTX 4090 (24 GB)?
LLaMA 2 7B	14 GB	3.5 GB	2.6 GB	1.75 GB	All formats
LLaMA 2 13B	26 GB	6.5 GB	4.9 GB	3.25 GB	All except FP16
LLaMA 2 70B	140 GB	35 GB	26.25 GB	17.5 GB	2-bit and 3-bit only

At 2-bit, 70B weights occupy 17.5 GB. With KV cache and activations for typical inference, total VRAM usage stays under 24 GB -- comfortably within a single RTX 4090. At 4-bit (35 GB), you need at least two GPUs or an A100.

Refuting the 4-bit dogma: Prior work claimed that 4-bit is the optimal tradeoff -- going lower wastes quality. QuiP# shows this was an artifact of bad quantization methods, not a fundamental limit. At 2 bits, a 70B model (17.5 GB) outperforms a 4-bit 30B model (~15 GB) with less memory. You get a strictly better model in the same memory budget by going to 2-bit on a larger model.

Zero-shot benchmarks

Perplexity is one metric. Do the models actually work on downstream tasks?

Benchmark	FP16 70B	QuiP# 2-bit 70B	AQLM 2-bit 70B	OmniQuant 2-bit 70B
ARC-Challenge	51.1	48.7	47.9	28.7
Winogrande	77.1	75.9	75.9	53.2

QuiP# at 2-bit retains 95%+ of FP16 accuracy on ARC-Challenge and 98% on Winogrande. OmniQuant at 2-bit is near random chance -- the model has essentially collapsed.

What "collapse" looks like: OmniQuant at 2-bit on LLaMA 7B achieves 37.4 perplexity. For reference, a randomly initialized model of the same size gets roughly 20,000+ perplexity, and a model that has merely memorized the training data without generalization gets ~10-15. 37.4 means the model can still form coherent tokens but has lost almost all of its learned knowledge -- it generates grammatically plausible but factually meaningless text. QuiP# at 6.66 means the model is still highly functional.

Ablation: where do the gains come from?

Removing components one at a time on LLaMA 2 70B at 2-bit (WikiText-2, context 4096):

Configuration	Perplexity	Δ vs. full QuiP#
Full QuiP#	3.91	--
No fine-tuning	4.16	+0.25
No E8 codebook	4.58	+0.67
Original QuiP	5.90	+1.99

The E8 codebook is worth 0.67 perplexity points. Fine-tuning adds another 0.25. Together with the Hadamard transform improvement, the total gain over original QuiP is nearly 2 full perplexity points -- a massive improvement for the same bit budget.

Notice the relative magnitudes. The E8 codebook (0.67) contributes more than fine-tuning (0.25), confirming the paper's thesis: the geometry of the codebook is the most important factor in extreme quantization. Smarter rounding (LDLQ) and post-hoc correction (fine-tuning) help, but the biggest wins come from matching the codebook shape to the weight distribution.

Perplexity vs. Bits per Weight

Comparing QuiP# against other methods across bit rates. Hover over points for exact values.

What is the most important practical implication of QuiP#'s 2-bit results?

A 2-bit 70B model (~17.5 GB) fits on a single consumer GPU and outperforms a 4-bit 30B model (~15 GB), meaning you get strictly better quality in the same memory budget by quantizing a larger model more aggressively 2-bit models train faster than FP16 models QuiP# eliminates the need for GPUs entirely

Chapter 9: Connections

Where QuiP# fits in the quantization landscape

Post-training quantization methods can be organized along two axes: how they handle outliers and how they assign codewords.

Method	Outlier Strategy	Quantizer Type	Bit Rate
RTN	None	Scalar (uniform)	4-8 bit
GPTQ	Hessian-based reordering	Scalar + OBQ feedback	3-4 bit
AWQ	Activation-aware scaling	Scalar (uniform)	3-4 bit
OmniQuant	Learnable transforms	Scalar	2-4 bit
QuiP	Random orthogonal incoherence	Scalar + LDLQ	2-4 bit
AQLM	Learned transforms	Vector (unstructured)	2-4 bit
QuiP#	Randomized Hadamard incoherence	Vector (E8 lattice)	2-4 bit

Key connections to other work

GPTQ (Frantar et al., 2022): Uses Hessian-based optimal brain quantization (OBQ), which is a special case of LDLQ with scalar quantization. QuiP# generalizes this to vector quantization with better error propagation.
SqueezeLLM (Kim et al., 2023): Also targets extreme compression but uses non-uniform scalar quantization. Cannot match E8's efficiency in high dimensions.
SpQR (Dettmers et al., 2023): Keeps outlier weights in higher precision. QuiP# eliminates outliers entirely via incoherence -- a more elegant solution.
Lattice quantization theory (Conway & Sloane): The E8 lattice has been known since the 19th century, but its application to neural network quantization is novel. The connection between sub-Gaussian weight distributions and sphere packing is a genuine theoretical contribution.

What came after QuiP#

AQLM (Egiazarian et al., 2024): Learns codebooks end-to-end rather than using a fixed lattice. Achieves competitive perplexity but with much larger codebooks and slower inference.
QuiP# + AQLM hybrids: Using E8P initialization for learned codebook methods, combining the efficiency of lattice structure with the flexibility of learned codes.
1-bit quantization: The frontier. QuiP# showed 2-bit is viable; the natural question is how far you can push it. BitNet and similar work explores binary/ternary weights but requires training from scratch rather than post-training quantization.

The big picture: QuiP# is not just a better quantization method. It is an argument that the structure of the quantization codebook matters as much as the quantization algorithm. GPTQ and AWQ focused on smarter rounding. QuiP# showed that the right codebook geometry (matching the weight distribution's shape) can be worth more than any amount of algorithmic cleverness on a mismatched codebook.

The information-theoretic perspective

Why does QuiP# work so much better than scalar methods at 2 bits? The answer lies in rate-distortion theory. For a Gaussian source in d dimensions with per-dimension variance σ², the optimal distortion at rate R bits per dimension is:

D*(R) = σ² · 2^−2R · G(d)

where G(d) is the normalized second moment of the optimal d-dimensional quantizer. For scalar quantization (d=1), G(1) = 1/12. For the E8 lattice (d=8), G(8) ≈ 0.0717 -- a 14% improvement over scalar. That may sound small, but at 2 bits per weight with 70 billion parameters, a 14% MSE reduction translates to substantial perplexity gains.

The deeper insight: as dimension increases, G(d) approaches 1/(2πe) ≈ 0.0585 (the "ultimate" quantizer). E8 at d=8 already captures most of this gain. Going to higher dimensions (d=16, 24) gives diminishing returns while making the codebook exponentially larger. E8 sits at the sweet spot of the quality-efficiency curve.

Here's a concrete comparison of the normalized second moments:

Dimension	Best Lattice	G(d)	vs. Scalar G(1)
1	Z (integers)	0.0833	baseline
2	A2 (hexagonal)	0.0802	3.7% better
4	D4	0.0766	8.0% better
8	E8	0.0717	13.9% better
24	Leech	0.0659	20.9% better
∞	Optimal	0.0585	29.8% better

E8 captures half of the total possible gain (13.9% out of 29.8%) with a codebook that fits in 1 KiB. The Leech lattice (24D) captures 70% but would require a codebook of 2⁴⁸ entries at 2 bits/dim -- completely impractical. E8 is the last practical optimal lattice.

Open questions

Can we find lattice codebooks for non-power-of-2 dimensions that are as efficient as E8?
Does the incoherence + lattice approach extend to activation quantization (not just weights)?
Can fine-tuning be made cheaper, enabling QuiP# for models that are updated frequently?
What is the true Pareto frontier of quality vs. bits when both the model size and bit rate are allowed to vary?
Can the Hadamard incoherence trick be applied to other domains beyond weight quantization -- e.g., gradient compression for distributed training, or KV cache compression for long contexts?

QuiP# represents a phase transition in LLM quantization. Before it, 2-bit was considered a curiosity -- a theoretical possibility with no practical merit. After it, 2-bit became a deployment strategy. The combination of number theory (E8 lattice), signal processing (Hadamard transform), and optimization theory (LDLQ) shows that the most powerful ML systems come not from bigger models or more data, but from deeper mathematical understanding of the systems we already have.

What is the fundamental insight that separates QuiP# from GPTQ-family methods?

GPTQ-family methods focus on smarter scalar rounding with Hessian feedback, while QuiP# recognizes that the codebook geometry itself must match the weight distribution's shape -- using E8 lattice sphere packing for sub-Gaussian incoherent weights instead of hypercubic scalar grids QuiP# uses more bits per weight than GPTQ GPTQ cannot quantize transformer models