Two bits per weight. Hadamard incoherence makes weights look Gaussian, the E8 lattice packs 8D spheres tighter than anything else in the universe, and together they compress a 70B model onto a single gaming GPU.
LLaMA 2 70B has 70 billion parameters. At 16-bit precision, that is 140 GB of weight memory. An NVIDIA RTX 4090 has 24 GB of VRAM. You cannot run it. Not even close.
The standard fix is quantization -- represent each weight with fewer bits. At 4 bits per weight, 70B fits in 35 GB. Still too large for one consumer GPU. At 3 bits, 26 GB. Tight, but possible. At 2 bits, 17.5 GB. Comfortable.
But here is the catch: at 2 bits per weight, you have exactly 4 possible values for each parameter. How do you compress a continuous distribution of learned weights into 4 buckets without destroying the model?
The answer turns out to involve some of the most beautiful mathematics in all of science: 8-dimensional sphere packings, randomized linear algebra, and the E8 lattice -- a structure first studied by mathematicians in the 19th century that now turns out to be exactly what you need to compress neural networks.
The predecessor paper, QuiP (Chee et al., 2023), introduced two key ideas: incoherence processing and adaptive rounding. It showed that if you first transform the weight matrix so that no single entry is much larger than any other, quantization error drops dramatically. But QuiP used random orthogonal matrices for the transform, which cost O(n2) to apply and did not admit efficient inference kernels.
QuiP# fixes everything QuiP got wrong, and then goes further:
The result: a 2-bit LLaMA 2 70B model with 4.16 perplexity on WikiText-2. For reference, the full-precision model achieves 3.32. That is remarkably close for an 8x compression ratio.
Let's put the compression ratio in perspective. Each weight goes from 16 bits to 2 bits -- an 8x reduction. The 70B model shrinks from 140 GB to 17.5 GB. But the perplexity increase is only 25% (3.32 → 4.16). You lose a quarter of a nit per token but gain 8x the memory headroom. For most applications -- chatbots, RAG, code completion -- this tradeoff is overwhelmingly favorable.
Imagine you have a weight matrix W and you want to quantize each entry to a nearby grid point. The quantization error for each entry is at most half the grid spacing. But some entries matter more than others -- if one weight is critical to the model's behavior, the error on that weight could be catastrophic.
This is where the Hessian enters. The proxy loss for quantizing a single linear layer with weight matrix W and Hessian H is:
The Hessian H = E[xxT] captures which directions in weight space the model is sensitive to. If H has large eigenvalues in some directions, errors in those directions are amplified.
A matrix is μ-incoherent when its entries are all roughly the same magnitude. Formally:
Think of it this way. If μ is close to 1, the matrix looks like "white noise" -- energy is spread evenly across all entries. If μ is large, some entries are outliers, which means quantizing those entries to a coarse grid hurts a lot.
Real LLM weight matrices are typically highly coherent. Transformer attention and MLP weights develop outlier channels during training -- certain hidden dimensions have weights 10-100x larger than the median. This is well-documented: SmoothQuant, LLM.int8(), and AWQ all exist specifically to handle these outlier features.
The outlier problem is catastrophic for naive quantization. If your quantization grid has 4 levels [-1.5, -0.5, 0.5, 1.5] and most weights are in [-0.1, 0.1] but a few are in [-5, 5], the grid wastes two of its four levels on regions where almost no weights live. You need different grids for different channels, or you need to remove the outliers entirely. QuiP's approach is more elegant: transform the weights so outliers cannot exist.
Consider two extreme cases for a 2D weight vector:
The Frobenius norm is the same (10 in both cases), but the quantization error relative to the norm is much smaller in the incoherent case. Rotating W to spread energy evenly is exactly what incoherence processing does.
A 2D weight vector before and after incoherence processing. The grid shows quantization levels. Toggle to see how spreading energy reduces quantization error.
After incoherence processing, something remarkable happens to the weight distribution. Before the transform, LLM weights have heavy tails -- some weights are 10-100x larger than the median. After multiplying by a random orthogonal matrix (or Hadamard + random signs), the central limit theorem kicks in: each output entry is a sum of many independent random terms, so the distribution converges to a Gaussian.
More precisely, the transformed weights are sub-Gaussian: their tails decay at least as fast as a Gaussian. This is the theoretical foundation for everything that follows:
QuiP used random orthogonal matrices U and V for this transform. This works -- with high probability, a random orthogonal matrix makes any matrix incoherent. But computing the matrix-vector product Ux costs O(n2), and you need to apply this at every inference step.
QuiP's random orthogonal matrices work but cost O(n2) per multiply. For a 70B model with thousands of linear layers, this overhead at every inference token is crushing. QuiP# replaces them with randomized Hadamard transforms, which are O(n log n) and have even better incoherence guarantees.
The Hadamard matrix Hn of order n (where n is a power of 2) is defined recursively:
For n = 2, this gives us:
Key properties of Hadamard matrices:
A plain Hadamard matrix is deterministic, so it cannot make all matrices incoherent (adversarial inputs exist). The fix: multiply by a random sign vector first. Given a random diagonal sign matrix S ∈ {±1}n×n, the randomized Hadamard transform is:
The sign vector S randomly flips the signs of input coordinates. Then the Hadamard matrix mixes everything. The result: with high probability, the output is incoherent regardless of the input.
QuiP# applies the RHT to both sides of the weight matrix. With random sign matrices SU and SV:
The transformed Hessian and weights satisfy, with probability ≥ 1 − δ:
Compare this to QuiP's Kronecker-product approach, which gives μW proportional to log2(mn). The Hadamard bound is tighter (single log vs. log-squared) and the computation is O(n log n) instead of O(n√n). Strictly better on both fronts.
Watch the RHT spread a "spiky" weight vector into a near-uniform distribution. Click Randomize to try different sign vectors.
The FWHT is the Hadamard analog of the FFT. For n = 8, it works in log2(8) = 3 stages. Each stage processes pairs of elements with butterfly operations (add and subtract). Let's trace through a concrete example.
Input: x = [4, 0, 0, 0, 0, 0, 0, 0] (a "spiky" vector -- maximally coherent)
The spiky vector [4, 0, 0, 0, ...] became the perfectly flat vector [1.41, 1.41, ...]. The Frobenius norm is preserved (||x|| = 4, ||Hx|| = 1.41 × √8 = 4), but the energy is now spread uniformly. This is exactly the incoherence we want.
Total operations: 3 stages × 4 butterflies per stage = 12 add/subtract operations. In general: n/2 × log2(n) = O(n log n). No multiplications (only ±1 entries), just additions and subtractions. The final 1/√n normalization is one multiply per element.
Look at the butterfly operation: each step computes anew = a + b and bnew = a − b. There are no multiplications because Hadamard matrix entries are ±1/√n. The 1/√n factor is applied once at the end. This makes the FWHT significantly faster than the FFT on modern hardware, where multiplications are 2-4x more expensive than additions in terms of energy and latency.
For context: applying a random n×n orthogonal matrix (as QuiP does) requires n2 multiplications and additions. For n = 4096 (a typical LLM dimension), that is 16.7 million multiply-adds. The FWHT needs only n/2 × log2(n) = 24,576 add/subtract operations -- a 680x reduction.
Real LLM weight matrices are not always powers of 2. LLaMA's hidden dimension is 4096 (good), but the MLP dimension is 11008 (bad -- not a power of 2). QuiP# handles this by factoring n = p · q where p is the largest power-of-2 factor, then using a Kronecker product Hp ⊗ Hq applied in O(q2p log p) time. When q is small, this is close to O(n log n).
For 11008 = 25 × 344 = 32 × 344, we use H32 ⊗ H344. Applying H32 costs 32/2 × log2(32) = 80 operations per block; applying H344 is more expensive since 344 is not a power of 2, but the Kronecker structure keeps the total tractable. In practice, the overhead for non-power-of-2 dimensions is modest.
Once we have made the weights incoherent, we need to actually quantize them. The naive approach -- round each weight to the nearest grid point independently -- ignores correlations between weights. LDLQ does much better by using the Hessian to propagate quantization errors adaptively.
Take the Hessian H and compute its LDLT decomposition:
where L is lower-triangular with ones on the diagonal, and D is diagonal. Define U = LT − I (the strictly upper-triangular part). Then the LDLQ algorithm processes columns of W from left to right:
In words: before quantizing column k, we adjust it using the quantization errors from all previous columns, weighted by U:,k. This is called linear feedback -- the error from quantizing column 1 nudges column 2 in the right direction, the combined errors from columns 1-2 nudge column 3, and so on.
QuiP# does not quantize one weight at a time. It quantizes blocks of g weights simultaneously using a vector quantizer (the E8 codebook). Block LDLQ extends the scalar version:
where Qg is a g-dimensional vector quantizer and Ak contains the relevant columns of U. The key requirement on Qg is that its quantization error covariance is bounded:
This says the quantizer introduces at most σ2 variance in every direction. Lattice quantizers (like E8) satisfy this naturally because lattices tile space uniformly.
Suppose we have a 4-element weight row w = [0.73, −0.41, 0.12, −0.88] and the LDLQ feedback matrix U tells us that column 2 depends on column 1 with coefficient U1,2 = 0.3. We quantize left to right with a scalar quantizer Q that rounds to the nearest 0.5:
Without feedback, we would quantize −0.41 to −0.5, error = 0.09. With feedback, we quantize −0.341 to −0.5, error = 0.159. The individual error on column 2 is worse, but the Hessian-weighted total error across both columns is lower because U encodes the Hessian's correlation structure.
For a μ-incoherent Hessian, block LDLQ with a block-size-g quantizer achieves:
Compare this to independent quantization (no feedback), which gives gmσ2 tr(H). The LDLQ bound replaces tr(H) with (μ2/n) tr(H1/2)2. By the AM-QM inequality, tr(H1/2)2/n ≤ tr(H), so the LDLQ bound is always tighter. With good incoherence (μ close to 1), it is much tighter.
To see why this matters numerically: if H has one eigenvalue of 100 and 99 eigenvalues of 1, then tr(H) = 199, but tr(H1/2)2/n = (10 + 99)2/100 = 118.8. The LDLQ bound is 40% tighter. Real Hessians have far more skewed spectra, so the gains are even larger.
Here is where QuiP# departs most dramatically from prior work. Instead of quantizing each weight to one of a few scalar values, it quantizes vectors of 8 weights to points on a lattice. Not just any lattice -- the E8 lattice, the densest possible sphere packing in 8 dimensions.
After the Hadamard incoherence transform, weight entries become approximately i.i.d. sub-Gaussian. A vector of 8 such entries lives in a roughly spherical region of 8D space. Scalar quantization carves space into hypercubes (each dimension quantized independently). But a sphere and a hypercube are very different shapes -- the corners of the hypercube waste codebook entries on regions where no weights live.
Start with the integer lattice Z8 -- all points with integer coordinates. Now consider two copies:
The E8 lattice is the union:
Let's unpack this with small examples. These are E8 lattice points:
These are NOT E8 lattice points:
Let's count the shortest non-zero vectors in E8 (the "roots"). These have squared norm 2. From D8:
From D̂8:
Total: 112 + 128 = 240 root vectors. These are the nearest neighbors of the origin in E8.
The E8 lattice has the kissing number 240 -- each lattice point touches exactly 240 nearest neighbors. This is the maximum possible in 8D, proven by Viazovska in 2016 (Fields Medal). For quantization, this means:
Why not use a 4D lattice? The D4 lattice (also known as the "checkerboard" lattice) is the densest sphere packing in 4D. QuiP# compared E8 to D4 in an ablation study on Gaussian sources:
E8 achieves 14% lower MSE than scalar and 12% lower than D4. The improvement is modest per dimension, but compounded over billions of parameters, it translates to meaningful perplexity gains.
2D projection showing how scalar quantization (grid) vs. lattice quantization (hexagonal in 2D, E8 in 8D) covers space more efficiently. Points show weight vectors; colors show Voronoi cells.
The E8 lattice has infinitely many points. We need a finite codebook. QuiP# constructs E8P -- a 2-bit-per-weight codebook with 216 entries (65,536 codewords for 8 dimensions = 16 bits / 8 = 2 bits per weight). The genius is in how they parameterize it to fit in just 1 KiB.
Instead of using E8 directly, QuiP# uses the shifted lattice E8 + 1/4 = {x + (1/4, 1/4, ..., 1/4) : x ∈ E8}. This shift centers the lattice around the expected mean of incoherent weights (which cluster near zero), reducing average quantization error.
Recall E8 = D8 ∪ D̂8. The paper works primarily with D̂8 -- the half-integer sublattice. A key symmetry: if x ∈ D̂8, then flipping any even number of signs still gives a point in D̂8 (because the even-sum parity is preserved under even sign flips).
This means you can represent any D̂8 point as:
Each codeword is encoded as exactly 16 bits:
| Bits | Count | Purpose |
|---|---|---|
| Source index | 8 bits | Which of 256 base vectors in table S (positive-coordinate D̂8 points with norm ≤ √10, plus 29 padding entries at norm √12) |
| Sign flips | 7 bits | Which 7 of 8 coordinates to negate (the 8th sign is determined by the even-parity constraint) |
| Coset bit | 1 bit | Add ±1/4 shift (selects D8 vs. D̂8 half of E8) |
Total: 8 + 7 + 1 = 16 bits for 8 weights = 2 bits per weight.
Why 7 sign bits instead of 8? Because of the even-sum constraint on D̂8. If you flip the signs of coordinates 1 through 7, the sign of coordinate 8 is determined -- it must be chosen so the total number of negative signs is even. This saves one bit per codeword while losing no information.
Concretely: if you flip signs at positions {2, 5, 7} (3 flips, odd), you must also flip position 8 to make it even (4 flips). If you flip {1, 3} (2 flips, already even), position 8 stays positive.
Suppose we read a 16-bit codeword: 01001011 1010110 0
Verify: all coordinates are half-integers, sum = −½ + ½ − &frac32; + ½ − ½ − ½ + ½ + ½ = −1 (wait -- that is odd). But we started from a valid D̂8 point before sign flips. The parity constraint requires the coordinate sum to be even. Let me recount: the sum of absolute values of half-integer coordinates always has the same parity. The sign-flip rule ensures this. In practice, the decoder just applies signs and the parity is maintained by construction.
The 1-bit coset selector switches between two halves of E8 + 1/4:
Since D8 = D̂8 + 1/2 (integer lattice = half-integer lattice shifted by 1/2), the coset bit effectively adds or subtracts 1/4 from all coordinates.
Now we assemble the full quantization pipeline, end to end. Given a pretrained LLM with weight matrices W1, ..., WL and a small calibration dataset:
Step 6 requires finding the nearest E8P codeword to an 8D vector. This is a closest vector problem on the E8 lattice. The algorithm exploits the D8/D̂8 decomposition:
This runs in O(8 log 8) time per vector -- essentially free compared to the rest of the pipeline.
The full quantization pipeline for LLaMA 2 70B takes approximately:
Total: ~56 GPU-hours, or about 7 hours on a single 8xA100 node. This is a one-time cost -- once quantized, the model can be deployed indefinitely. Compare to training the original 70B model from scratch: ~1.7 million GPU-hours.
The fine-tuning stage recovers quality lost during quantization. Two phases:
Quantization only matters if the quantized model runs fast. The whole point of 2-bit weights is to reduce memory bandwidth -- the bottleneck in LLM inference. QuiP# achieves this with a carefully designed inference kernel.
For a single linear layer y = Wx with quantized weights Ŵ and sign vectors SU, SV:
Total cost: O(mn) for the matmul + O((m+n) log(m+n)) for the two Hadamard transforms. The transforms are negligible for large matrices.
For a concrete example: in LLaMA 2 70B, the largest linear layers have dimensions 8192 × 28672. The matmul requires 8192 × 28672 = 235 million multiply-add operations. The two Hadamard transforms require approximately 8192 × 13 + 28672 × 15 = 536,000 operations -- 0.2% of the matmul cost. The transforms are essentially free.
The critical difference between QuiP# and competing methods like AQLM is codebook size:
| Method | Codebook Size | L1 Cache? | Inference Impact |
|---|---|---|---|
| QuiP# (E8P) | ~1 KiB | Fits easily | Every lookup is a cache hit |
| AQLM | ~1 MiB | Does NOT fit | Constant cache misses, slower than FP16 |
| GPTQ/AWQ | N/A (scalar) | N/A | Fast decode but poor quality at 2-bit |
GPU L1 cache is typically 128-256 KiB. AQLM's 1 MiB codebook forces constant L2/global memory accesses for every weight decode, creating a bottleneck that negates the memory bandwidth savings of quantization. QuiP#'s 1 KiB codebook is accessed entirely from L1.
| Model | QuiP# 2-bit | AQLM 2-bit | FP16 |
|---|---|---|---|
| LLaMA 2 70B | 25.9 tok/s | 8.27 tok/s | OOM |
| LLaMA 2 7B | 106.3 tok/s | 20.6 tok/s | 33.1 tok/s |
The 70B model at 2-bit runs at 25.9 tokens per second on a single RTX 4090. In FP16, it does not even fit. This is the promise of extreme quantization: models that were previously multi-GPU become single-GPU, and single-GPU models become real-time.
The QuiP# inference kernel is optimized for the decode-then-multiply pattern. For each warp (32 threads):
The key insight is that the decode step (table lookup + sign flips) is arithmetic-bound, not memory-bound, because the table fits in registers/shared memory. The bottleneck is reading the compressed codes from global memory, which is exactly what 2-bit compression minimizes.
Let's work through the bandwidth calculation. The RTX 4090 has ~1 TB/s memory bandwidth. For autoregressive generation (batch size 1), the bottleneck is loading all model weights for each token:
The gap between theoretical and achieved (57%) comes from compute overhead (Hadamard transforms, codebook decode, attention) and memory access patterns. 57% is excellent for a real workload -- even optimized FP16 matmuls rarely exceed 70-80% of peak bandwidth.
With FlashAttention, the optimized QuiP# kernel achieves:
The 70B model achieves the highest bandwidth utilization because it is most memory-bound -- the ratio of weight reads to compute is highest, so the system spends most of its time streaming weights through the memory bus, which is exactly what quantization helps with.
The experiments tell a compelling story: 2-bit quantization is not just viable -- it scales better than anyone expected.
| Model | FP16 | QuiP# 4-bit | QuiP# 3-bit | QuiP# 2-bit | OmniQuant 2-bit |
|---|---|---|---|---|---|
| LLaMA 2 7B | 5.47 | 5.56 | 5.79 | 6.66 | 37.4 |
| LLaMA 2 13B | 4.88 | 4.95 | 5.10 | 5.74 | 17.2 |
| LLaMA 2 70B | 3.32 | 3.38 | 3.56 | 4.16 | 7.81 |
Look at the 2-bit column. QuiP# on 70B achieves 4.16 perplexity -- less than 1 point above FP16 (3.32). OmniQuant, the next best method, gets 7.81 -- more than double the gap. On 7B, OmniQuant at 2-bit is essentially random (37.4 perplexity), while QuiP# is still reasonable at 6.66.
Perhaps the most important finding: at longer context (4096 tokens), QuiP# 2-bit actually matches competing methods at 4-bit quality on some metrics:
| Model | FP16 | QuiP# 2-bit | AQLM 2-bit | QuiP# 3-bit | QuiP# 4-bit |
|---|---|---|---|---|---|
| LLaMA 2 70B | 3.12 | 3.91 | 3.94 | 3.35 | 3.18 |
| LLaMA 2 13B | 4.57 | 5.35 | 5.70 | 4.78 | 4.63 |
| LLaMA 2 7B | 5.12 | 6.19 | 6.93 | 5.41 | 5.19 |
The practical implications for GPU deployment:
| Model | FP16 | 4-bit | 3-bit | 2-bit | Fits RTX 4090 (24 GB)? |
|---|---|---|---|---|---|
| LLaMA 2 7B | 14 GB | 3.5 GB | 2.6 GB | 1.75 GB | All formats |
| LLaMA 2 13B | 26 GB | 6.5 GB | 4.9 GB | 3.25 GB | All except FP16 |
| LLaMA 2 70B | 140 GB | 35 GB | 26.25 GB | 17.5 GB | 2-bit and 3-bit only |
At 2-bit, 70B weights occupy 17.5 GB. With KV cache and activations for typical inference, total VRAM usage stays under 24 GB -- comfortably within a single RTX 4090. At 4-bit (35 GB), you need at least two GPUs or an A100.
Perplexity is one metric. Do the models actually work on downstream tasks?
| Benchmark | FP16 70B | QuiP# 2-bit 70B | AQLM 2-bit 70B | OmniQuant 2-bit 70B |
|---|---|---|---|---|
| ARC-Challenge | 51.1 | 48.7 | 47.9 | 28.7 |
| Winogrande | 77.1 | 75.9 | 75.9 | 53.2 |
QuiP# at 2-bit retains 95%+ of FP16 accuracy on ARC-Challenge and 98% on Winogrande. OmniQuant at 2-bit is near random chance -- the model has essentially collapsed.
Removing components one at a time on LLaMA 2 70B at 2-bit (WikiText-2, context 4096):
| Configuration | Perplexity | Δ vs. full QuiP# |
|---|---|---|
| Full QuiP# | 3.91 | -- |
| No fine-tuning | 4.16 | +0.25 |
| No E8 codebook | 4.58 | +0.67 |
| Original QuiP | 5.90 | +1.99 |
The E8 codebook is worth 0.67 perplexity points. Fine-tuning adds another 0.25. Together with the Hadamard transform improvement, the total gain over original QuiP is nearly 2 full perplexity points -- a massive improvement for the same bit budget.
Notice the relative magnitudes. The E8 codebook (0.67) contributes more than fine-tuning (0.25), confirming the paper's thesis: the geometry of the codebook is the most important factor in extreme quantization. Smarter rounding (LDLQ) and post-hoc correction (fine-tuning) help, but the biggest wins come from matching the codebook shape to the weight distribution.
Comparing QuiP# against other methods across bit rates. Hover over points for exact values.
Post-training quantization methods can be organized along two axes: how they handle outliers and how they assign codewords.
| Method | Outlier Strategy | Quantizer Type | Bit Rate |
|---|---|---|---|
| RTN | None | Scalar (uniform) | 4-8 bit |
| GPTQ | Hessian-based reordering | Scalar + OBQ feedback | 3-4 bit |
| AWQ | Activation-aware scaling | Scalar (uniform) | 3-4 bit |
| OmniQuant | Learnable transforms | Scalar | 2-4 bit |
| QuiP | Random orthogonal incoherence | Scalar + LDLQ | 2-4 bit |
| AQLM | Learned transforms | Vector (unstructured) | 2-4 bit |
| QuiP# | Randomized Hadamard incoherence | Vector (E8 lattice) | 2-4 bit |
Why does QuiP# work so much better than scalar methods at 2 bits? The answer lies in rate-distortion theory. For a Gaussian source in d dimensions with per-dimension variance σ2, the optimal distortion at rate R bits per dimension is:
where G(d) is the normalized second moment of the optimal d-dimensional quantizer. For scalar quantization (d=1), G(1) = 1/12. For the E8 lattice (d=8), G(8) ≈ 0.0717 -- a 14% improvement over scalar. That may sound small, but at 2 bits per weight with 70 billion parameters, a 14% MSE reduction translates to substantial perplexity gains.
The deeper insight: as dimension increases, G(d) approaches 1/(2πe) ≈ 0.0585 (the "ultimate" quantizer). E8 at d=8 already captures most of this gain. Going to higher dimensions (d=16, 24) gives diminishing returns while making the codebook exponentially larger. E8 sits at the sweet spot of the quality-efficiency curve.
Here's a concrete comparison of the normalized second moments:
| Dimension | Best Lattice | G(d) | vs. Scalar G(1) |
|---|---|---|---|
| 1 | Z (integers) | 0.0833 | baseline |
| 2 | A2 (hexagonal) | 0.0802 | 3.7% better |
| 4 | D4 | 0.0766 | 8.0% better |
| 8 | E8 | 0.0717 | 13.9% better |
| 24 | Leech | 0.0659 | 20.9% better |
| ∞ | Optimal | 0.0585 | 29.8% better |
E8 captures half of the total possible gain (13.9% out of 29.8%) with a codebook that fits in 1 KiB. The Leech lattice (24D) captures 70% but would require a codebook of 248 entries at 2 bits/dim -- completely impractical. E8 is the last practical optimal lattice.
QuiP# represents a phase transition in LLM quantization. Before it, 2-bit was considered a curiosity -- a theoretical possibility with no practical merit. After it, 2-bit became a deployment strategy. The combination of number theory (E8 lattice), signal processing (Hadamard transform), and optimization theory (LDLQ) shows that the most powerful ML systems come not from bigger models or more data, but from deeper mathematical understanding of the systems we already have.