Trellis coded quantization gives LLM weight compression a 256-dimensional codebook that approaches the Shannon limit — using only bitshifts. 2-bit Llama 2 70B at 23.5 tok/s on a single GPU, beating every vector quantization method in both quality and speed.
You have a 70-billion-parameter language model. Each parameter is stored as a 16-bit float. That is 140 GB of weights — more than a single GPU can hold. Inference is memory-bandwidth bound: the GPU spends most of its time just loading weights from memory, not computing with them. If you could shrink every weight from 16 bits to 2 bits, you would cut memory by 8x and speed up inference proportionally.
This is post-training quantization (PTQ). The idea is simple: take a pre-trained model, replace its full-precision weights with low-bit approximations, and accept a small quality loss in exchange for dramatic efficiency gains.
The most basic strategy quantizes each weight independently. Pick a set of reconstruction values (say 4 values for 2-bit), then round each weight to the nearest one. This is scalar quantization, and it works — but it wastes information. It treats each weight as if it were completely unrelated to its neighbors.
Methods like AQLM and QuiP# quantize groups of weights together. QuiP# groups weights into 8-dimensional vectors and maps each to the nearest point in the E8 lattice — a beautiful mathematical structure that packs spheres in 8D almost perfectly. This vector quantization (VQ) exploits correlations between neighboring weights, getting closer to the theoretical limit.
But VQ has a fundamental scaling problem: the codebook grows exponentially with dimension. An 8D codebook at 2 bits/weight already has 216 = 65,536 entries. Going to 16D would require 232 entries — 4 billion — which is impractical. So VQ is stuck at low dimensions.
Here is the rate-distortion picture. For a Gaussian source at 2 bits per dimension, scalar quantization achieves MSE = 0.118. QuiP#'s 8D E8 lattice gets 0.089. QTIP's 256D trellis code gets 0.069 — within 10% of the theoretical minimum of 0.063. That gap shrinks further at higher dimensions.
| Method | Effective Dimension | Codebook Size | MSE (2-bit Gaussian) |
|---|---|---|---|
| Scalar (Lloyd-Max) | 1 | 4 | 0.118 |
| QuiP# (E8P) | 8 | 65,536 | 0.089 |
| QTIP (L=16 TCQ) | 256 | 65,536 states | 0.069 |
| Shannon limit | ∞ | — | 0.063 |
QTIP borrows a trick from telecommunications: trellis coded quantization, first proposed by Marcellin and Fischer in 1990. The idea is to encode weights as a path through a state machine rather than as independent codebook lookups. Each transition in the state machine emits a quantized value, and the optimal path is found by the Viterbi algorithm — the same algorithm used to decode convolutional codes in your phone's modem.
To understand why QTIP uses trellis codes, we need to step back and think about what quantization is really doing. You have a long sequence of real-valued weights w1, w2, ..., wT. You want to replace each with a discrete approximation qi, using only k bits per weight on average. The goal is to minimize the total distortion:
Scalar quantization picks qi independently for each weight. With k = 2 bits you get 4 possible values. The optimal choice (Lloyd-Max quantizer) minimizes MSE for a Gaussian distribution, but it ignores all structure between weights. Think of it as a greedy algorithm that makes locally optimal decisions without looking ahead.
Vector quantization groups weights into vectors of dimension d, then maps each vector to the nearest codeword in a codebook C of size 2kd. For a Gaussian source, the optimal codebook approaches the space-filling gain of the best lattice in dimension d. The E8 lattice that QuiP# uses is one of the best known packings in 8D.
The problem: both codebook storage and nearest-neighbor search scale as 2kd. At d = 8 and k = 2, that is 65K entries — manageable. At d = 16, it is 4 billion. At d = 256, it is 2512 — more than the atoms in the universe.
A trellis code finds a middle ground. Instead of independent scalar decisions or massive vector codebooks, it encodes the weight sequence as a path through a finite state machine. The state machine has S = 2L states. At each step, the encoder is in some state s, reads k input bits, transitions to a new state s', and emits a quantized value q that depends on both s and s'.
This is precisely the same trick that convolutional codes use in telecommunications. When your phone encodes voice data for transmission, it doesn't independently encode each sample — it runs the data through a shift register (a state machine) and outputs encoded bits that depend on both the current input and the recent history. The receiver uses the Viterbi algorithm to find the most likely original sequence. QTIP adapts this machinery from error correction to quantization.
Let us build a trellis code from scratch. We start with the simplest possible version and work up to QTIP's design.
Imagine a register that holds L bits. At each time step, k new bits enter from the left, and the k oldest bits fall off the right. The register shifts. The current state of the register (all L bits) determines which quantized value gets emitted.
More generally, with k input bits per step, the register shifts by k positions. From any state, there are 2k possible transitions. Each transition (s → s') is associated with a reconstruction value q(s, s') — the quantized weight emitted when we take that edge.
Unroll this state machine over time and you get a trellis diagram: a directed graph where columns represent time steps, nodes represent states, and edges represent transitions. Every valid quantization of the weight sequence is a path through this trellis from left to right.
A 4-state trellis (L=2, k=1). Each column is a time step. Click an edge to see its reconstruction value. The highlighted path shows the Viterbi-optimal quantization for the sample weights below.
A trellis code is defined by:
The rate of the code is k/V bits per weight. QTIP uses various combinations: k=1, V=1 gives 1 bit/weight; k=2, V=1 gives 2 bits/weight; k=4, V=2 gives 2 bits/weight with a more complex trellis.
The effective quantization dimension is approximately L/k · V. With L = 16, k = 2, V = 1, the effective dimension is ~128. With V = 2, it reaches ~256. This is because the shift register creates a memory window of L bits. Every quantized output depends on the current L-bit state, which in turn depends on the last L/k input decisions. So each output is implicitly a function of ~L/k neighboring weights — creating a very high-dimensional joint decision without ever explicitly building a high-dimensional codebook.
Let us make the dimension advantage concrete. Suppose you are quantizing a column of 4096 weights (a typical hidden dimension) to 2 bits each. Here is what each approach sees:
The trellis uses the same number of bits as scalar quantization but achieves near-VQ-256 quality because every output depends on its 128-step neighborhood. The state machine creates implicit high-dimensional structure without explicit high-dimensional codebooks.
Not all trellis structures are equally friendly to hardware. QTIP introduces a specific trellis design — the bitshift trellis — that maps perfectly to GPU bitwise operations.
In a bitshift trellis, node i has an edge to node j if and only if:
In plain English: shift the state left by kV bits (dropping the top kV bits), then fill in the bottom kV bits with any value c. This is literally a bitshift plus an OR.
This structure has three critical properties:
This is a subtle but essential point. The Viterbi algorithm (expensive, sequential) is used only once, during the offline quantization step. It finds the optimal path through the trellis for each weight column. Once found, the path is stored as a packed bitstream — just k bits per weight.
At inference time, decoding is trivial: for weight at position t, extract the L-bit window centered at t from the bitstream, and compute or look up the reconstruction value. No dynamic programming. No state tracking. Just bitshifts.
We have a trellis with 2L states and a sequence of T real-valued weights. We need to find the path through the trellis that minimizes total reconstruction error. This is exactly the problem the Viterbi algorithm solves.
Define score(t, s) as the minimum total squared error achievable for quantizing weights w1 through wt, ending in state s at time t. The recursion is:
where the minimum is over all states s' that have a valid transition to s, and q(s', s) is the reconstruction value for that transition.
Each of the T time steps examines 2L states. Each state considers 2kV predecessors. Total operations: O(T · 2L · 2kV). For L = 16, k = 2, V = 1, T = 4096 (a typical column length), that is 4096 × 65,536 × 4 ≈ 109 operations. Substantial but tractable on a GPU, and it only runs once during quantization.
Watch the Viterbi algorithm find the optimal quantization path through a small trellis. Each cell shows the accumulated distortion. The highlighted path is the minimum-cost route. Click "Step" to advance one time step, or "Run" to animate the full search.
Where do the values q(s', s) come from? This is the codebook design problem. QTIP explores several strategies, from pure lookup tables to computed values. We will cover these in Chapter 6. For now, assume each transition has an associated reconstruction value that was optimized on calibration data.
Trellis codes work best when the weights they quantize are roughly independent and identically distributed — ideally Gaussian. Real LLM weight matrices are not like this. They have correlated columns, outlier values, and non-uniform distributions. Incoherence processing fixes this.
Consider a weight matrix W ∈ Rm × n from a linear layer. Some columns might have values 10x larger than others. Some rows are highly correlated. If you apply trellis quantization directly, the outlier columns dominate the error and the trellis wastes states on rare extreme values.
QuiP# introduced the idea of multiplying both sides by random orthogonal matrices to "spread out" the information. QTIP uses the same framework.
The incoherence processing applies:
where:
Formally, the incoherence of a matrix is the maximum squared entry normalized by the Frobenius norm: μW = max|Wij|2 · mn / ||W||F2. A perfectly incoherent matrix has μ = 1 (all entries equal magnitude). For the transformed matrix, with high probability:
This is logarithmic in the matrix size — essentially constant. No outliers survive the transform.
If we quantize W̃ instead of W, we need to undo the transform during inference. For a linear layer y = Wx, we can absorb the transforms:
The extra cost is two fast Hadamard transforms and two elementwise sign flips. The Hadamard transform is O(n log n), which is negligible compared to the matrix multiply O(mn). And the sign flips are free (just XOR the sign bit).
Consider a tiny weight matrix before and after incoherence processing:
# Before: correlated, outlier-heavy W = [[ 5.2, 0.1, 0.3, -0.1], [ 4.8, -0.2, 0.1, 0.0], [ 0.0, 3.1, -0.1, 0.2], [-0.1, 2.9, 0.0, -0.3]] # Column 0 has outliers ~5, others are ~0.1 # Incoherence mu = 5.2^2 * 16 / ||W||_F^2 ≈ 5.8 (bad) # After: W_tilde = V * S * W * S * V^T W_tilde = [[ 1.8, -1.2, 0.9, -1.5], [-0.7, 1.4, -1.1, 0.8], [ 1.3, 0.6, -1.0, -1.3], [-0.9, -1.1, 1.2, 0.7]] # All entries ~same magnitude, uncorrelated # Incoherence mu ≈ 1.3 (near-optimal)
The outlier energy from column 0 has been spread across all entries. Each column of W̃ now looks like an i.i.d. Gaussian sample — exactly what the trellis code is designed for.
You might think: just scale each column to have unit variance. But that only fixes the magnitude problem, not the correlation problem. Two columns might be nearly identical — their quantization errors would be correlated, causing systematic bias in the output. The random Hadamard transform decorrelates and normalizes simultaneously, which is why it gives strictly better results than simple per-column scaling.
The trellis structure tells you which states connect to which. But what reconstruction value does each transition emit? QTIP explores a spectrum of approaches, trading off quality against decode speed.
The simplest approach: store a lookup table (LUT) indexed by the L-bit state. Each entry maps to a float16 reconstruction value. For L = 14, that is 214 = 16,384 entries × 2 bytes = 32 KB. For L = 16, it is 128 KB.
The problem? GPU L1 cache is typically 128-256 KB per SM. A 128 KB LUT would monopolize the entire cache, starving the rest of the computation. And L1 cache latency is ~28 cycles for a random access — much slower than arithmetic.
The opposite extreme: compute the reconstruction value from the L-bit state using only arithmetic operations. No table at all. QTIP proposes two variants:
Use a linear congruential generator (LCG) as a hash function, then sum four 8-bit segments of the result:
// 1MAD: ~4 instructions per weight hash = state * 0x9E3779B9 // LCG multiply hash = hash + 0x6A09E667 // LCG add // Sum four 8-bit segments to get pseudo-Gaussian val = (hash & 0xFF) + ((hash >> 8) & 0xFF) + ((hash >> 16) & 0xFF) + (hash >> 24)
The sum of four uniform bytes approximates a Gaussian (central limit theorem). This gives a pseudo-random reconstruction value from 4 ALU instructions — no memory access.
An even simpler hash using XOR operations on the FP16 bit representation:
// 3INST: ~3 instructions per weight hash = state * 0x9E3779B9 // XOR mantissa/exponent bits to produce FP16 val = xor_fp16(hash)
Three ALU instructions total. Slightly worse quality than 1MAD but faster.
The sweet spot: use a tiny lookup table (2Q × 2 entries, with Q ≈ 6-8) indexed by a hash of the state. Then apply a sign flip from an additional bit.
// HYB: hash + small LUT (~2 instructions amortized) idx = (state * state + state) & MASK // x^2 + x hash sign = (state >> sign_bit) & 1 // extract sign val = LUT[idx] * (1 - 2 * sign) // apply sign flip
The LUT has only 128-512 entries (256 B - 1 KB), easily fitting in L1 cache. The entries are optimized via K-means on the empirical weight distribution. The sign bit doubles the effective codebook size for free.
| Code Type | LUT Size | Instructions/Weight | Quality (MSE) | Best For |
|---|---|---|---|---|
| Lookup-only | 32-128 KB | 1 (lookup) | Best | L ≤ 14, quality-first |
| HYB | 256 B - 1 KB | ~2 | Near-best | L = 16, balanced |
| 1MAD | 0 | ~4 | Good | Speed-first, large L |
| 3INST | 0 | ~3 | Decent | Maximum throughput |
For lookup-only and HYB codes, the LUT entries are trained on calibration data. QTIP uses 2048 samples from RedPajama (128 tokens each). The optimization minimizes:
where Q applies the Viterbi algorithm with the current LUT, and gradients flow through the LUT entries via straight-through estimation. This runs for 50-200 iterations. For computed codes (1MAD, 3INST), there is nothing to optimize — the hash function is fixed.
The theoretical elegance of trellis codes means nothing if the decode kernel is slow. QTIP's design ensures that going from packed bits to dequantized weights is fast enough to be memory-bound, not compute-bound.
At inference time, for each weight position t in a column of length T:
Steps 1-3 are fused into a single CUDA kernel. The packed bitstream is stored in global memory and streamed sequentially — excellent for GPU memory bandwidth.
The quantized weights are stored as packed k-bit integers. For k = 2, four weights pack into one byte. The bitstream is laid out so that consecutive weights in a column are contiguous in memory, enabling coalesced reads on the GPU.
Each linear layer now computes y = Sm VmT (W̃ · Vn Sn x) instead of y = Wx. The extra operations are:
Total overhead: under 1% of inference time. The 4-8x memory reduction far outweighs this.
Tokens per second on an RTX 6000 Ada at batch size 1. QTIP matches QuiP# speed while achieving better quantization quality. AQLM is 2-3x slower due to large codebook lookups.
Numbers matter. QTIP was evaluated on Llama 2 (7B, 13B, 70B) and Llama 3 (8B, 70B) at 2-bit, 3-bit, and 4-bit quantization, compared against the strongest baselines: QuiP#, AQLM, and GPTQ.
At 2 bits per weight, the differences between methods are stark. This is where the trellis advantage is most visible.
| Model | FP16 | QTIP (1MAD) | QuiP# | AQLM |
|---|---|---|---|---|
| Llama 2 7B | 5.12 | 6.82 | 8.22 | 6.14 |
| Llama 2 70B | 3.12 | 3.90 | 4.16 | 3.83 |
| Llama 3 70B | — | 4.97 | 5.77 | — |
At 3 bits, all methods approach FP16 quality, but QTIP (HYB variant) still leads:
| Model | FP16 | QTIP (HYB) | QuiP# | AQLM |
|---|---|---|---|---|
| Llama 2 70B | 3.12 | 3.26 | 3.35 | 3.36 |
At 3 bits, QTIP is within 0.14 perplexity of FP16 on a 70B model. The gap between methods shrinks because 3 bits provides enough room for even simpler methods to do well.
The speed comparison is equally important. QTIP must not sacrifice throughput for quality.
| Model / Method | Bits | Tokens/sec | Speedup vs FP16 |
|---|---|---|---|
| Llama 2 7B FP16 | 16 | 55.9 | 1.0x |
| Llama 2 7B QTIP | 2 | 188.0 | 3.4x |
| Llama 2 7B QuiP# | 2 | 186.0 | 3.3x |
| Llama 2 7B AQLM | 2 | 81.5 | 1.5x |
| Llama 2 70B QTIP | 2 | 23.5 | — |
| Llama 2 70B QuiP# | 2 | 22.2 | — |
| Llama 2 70B AQLM | 2 | 8.78 | — |
A pattern emerges across model sizes: larger models are easier to quantize. On Llama 2 7B, 2-bit QTIP adds 1.7 perplexity over FP16. On 70B, the gap shrinks to 0.78. On Llama 3.1 405B, it is only ~0.17. This makes intuitive sense — larger models have more redundancy in their weights, so aggressive compression loses less essential information.
This scaling property makes QTIP particularly valuable for the models that need compression most: the ones too large to serve at full precision. A 405B model that fits on a single 8-GPU node at 2-bit is far more useful than one requiring 4 nodes at FP16.
Perplexity on WikiText-2 is just one metric. QTIP also reports zero-shot accuracy on standard benchmarks (ARC, PIQA, WinoGrande, HellaSwag). The pattern holds: at 2-bit, QTIP matches or exceeds QuiP# and AQLM on every benchmark across all model sizes. The task-level gains correlate strongly with perplexity improvements, confirming that lower perplexity translates to genuinely better language understanding, not just better next-token prediction on one dataset.
QTIP also quantized Llama 3.1 405B Instruct to 2 bits, achieving a perplexity of 3.29. This compresses the model from 810 GB to ~101 GB — fitting on a single 8-GPU node with room to spare. No other 2-bit method was reported on this model at publication time.
Llama 2 70B perplexity (WikiText-2) at different bitrates. Lower is better. QTIP consistently outperforms baselines, with the gap largest at 2 bits where high-dimensional quantization matters most.
QTIP sits at the intersection of information theory, signal processing, and systems engineering. Here is how it connects to the broader landscape.
TCQ was introduced by Marcellin and Fischer (1990), building on Ungerboeck's trellis coded modulation (1982) from telecommunications. The original TCQ paper showed that trellis codes approach the Shannon rate-distortion bound as the constraint length grows. QTIP is the first to bring this classical result to LLM weight compression, with the key innovation being the bitshift structure for GPU-friendly decoding.
Shannon's rate-distortion theorem (1959) tells us the minimum bitrate needed to represent a source at a given distortion level. For a Gaussian source at distortion D:
Any practical quantizer must use at least R(D) bits per symbol. Scalar quantization is far from this bound. VQ gets closer as dimension increases. TCQ approaches it as the constraint length grows. QTIP with L = 16 gets within 10% of the bound — a remarkable achievement for a practical system.
| Method | Approach | Effective Dim | Decode Cost |
|---|---|---|---|
| GPTQ | Scalar + Hessian-aware rounding | 1 | Trivial |
| AWQ | Scalar + activation-aware scaling | 1 | Trivial |
| AQLM | Additive VQ (multi-codebook) | 8 | Large LUT |
| QuiP# | E8 lattice VQ + incoherence | 8 | Lattice decode |
| QTIP | Trellis coded quantization + incoherence | 256 | Bitshift + small LUT |
QTIP received a Spotlight designation at NeurIPS 2024. The reviewers highlighted three contributions: (1) the novel application of trellis coded quantization to LLMs, connecting classical information theory to modern deep learning; (2) the principled spectrum of code designs from lookup-only to compute-only, giving practitioners a clear quality-speed tradeoff; and (3) the strong empirical results that simultaneously beat all baselines on quality AND speed — a rare achievement in quantization where there is usually a tradeoff between the two.
The work also stands out for its theoretical grounding. Unlike many quantization papers that are primarily empirical, QTIP derives its approach from rate-distortion theory, giving a principled explanation for why trellis codes work better: they approach the Shannon limit at high effective dimension.