Tseng, Sun, Hou & De Sa — NeurIPS 2024 Spotlight

QTIP Quantization with Trellises and Incoherence Processing

Trellis coded quantization gives LLM weight compression a 256-dimensional codebook that approaches the Shannon limit — using only bitshifts. 2-bit Llama 2 70B at 23.5 tok/s on a single GPU, beating every vector quantization method in both quality and speed.

Prerequisites: Post-training quantization basics + LLM inference bottlenecks + Rate-distortion intuition
10
Chapters
3+
Simulations

Chapter 0: The Problem

You have a 70-billion-parameter language model. Each parameter is stored as a 16-bit float. That is 140 GB of weights — more than a single GPU can hold. Inference is memory-bandwidth bound: the GPU spends most of its time just loading weights from memory, not computing with them. If you could shrink every weight from 16 bits to 2 bits, you would cut memory by 8x and speed up inference proportionally.

This is post-training quantization (PTQ). The idea is simple: take a pre-trained model, replace its full-precision weights with low-bit approximations, and accept a small quality loss in exchange for dramatic efficiency gains.

The scalar approach

The most basic strategy quantizes each weight independently. Pick a set of reconstruction values (say 4 values for 2-bit), then round each weight to the nearest one. This is scalar quantization, and it works — but it wastes information. It treats each weight as if it were completely unrelated to its neighbors.

The vector quantization leap

Methods like AQLM and QuiP# quantize groups of weights together. QuiP# groups weights into 8-dimensional vectors and maps each to the nearest point in the E8 lattice — a beautiful mathematical structure that packs spheres in 8D almost perfectly. This vector quantization (VQ) exploits correlations between neighboring weights, getting closer to the theoretical limit.

But VQ has a fundamental scaling problem: the codebook grows exponentially with dimension. An 8D codebook at 2 bits/weight already has 216 = 65,536 entries. Going to 16D would require 232 entries — 4 billion — which is impractical. So VQ is stuck at low dimensions.

The core question: How do you get the benefits of high-dimensional quantization — better rate-distortion performance — without an exponentially large codebook? QTIP's answer: use a trellis code, which achieves effectively 256-dimensional quantization with only 216 states.

Here is the rate-distortion picture. For a Gaussian source at 2 bits per dimension, scalar quantization achieves MSE = 0.118. QuiP#'s 8D E8 lattice gets 0.089. QTIP's 256D trellis code gets 0.069 — within 10% of the theoretical minimum of 0.063. That gap shrinks further at higher dimensions.

MethodEffective DimensionCodebook SizeMSE (2-bit Gaussian)
Scalar (Lloyd-Max)140.118
QuiP# (E8P)865,5360.089
QTIP (L=16 TCQ)25665,536 states0.069
Shannon limit0.063

QTIP borrows a trick from telecommunications: trellis coded quantization, first proposed by Marcellin and Fischer in 1990. The idea is to encode weights as a path through a state machine rather than as independent codebook lookups. Each transition in the state machine emits a quantized value, and the optimal path is found by the Viterbi algorithm — the same algorithm used to decode convolutional codes in your phone's modem.

Why can't vector quantization easily scale to 256-dimensional quantization at 2 bits/weight?

Chapter 1: Why Trellises

To understand why QTIP uses trellis codes, we need to step back and think about what quantization is really doing. You have a long sequence of real-valued weights w1, w2, ..., wT. You want to replace each with a discrete approximation qi, using only k bits per weight on average. The goal is to minimize the total distortion:

D = ∑i=1T (wi − qi)2

Scalar: each weight on its own

Scalar quantization picks qi independently for each weight. With k = 2 bits you get 4 possible values. The optimal choice (Lloyd-Max quantizer) minimizes MSE for a Gaussian distribution, but it ignores all structure between weights. Think of it as a greedy algorithm that makes locally optimal decisions without looking ahead.

Vector: groups of weights

Vector quantization groups weights into vectors of dimension d, then maps each vector to the nearest codeword in a codebook C of size 2kd. For a Gaussian source, the optimal codebook approaches the space-filling gain of the best lattice in dimension d. The E8 lattice that QuiP# uses is one of the best known packings in 8D.

The problem: both codebook storage and nearest-neighbor search scale as 2kd. At d = 8 and k = 2, that is 65K entries — manageable. At d = 16, it is 4 billion. At d = 256, it is 2512 — more than the atoms in the universe.

Trellis: the state machine escape

A trellis code finds a middle ground. Instead of independent scalar decisions or massive vector codebooks, it encodes the weight sequence as a path through a finite state machine. The state machine has S = 2L states. At each step, the encoder is in some state s, reads k input bits, transitions to a new state s', and emits a quantized value q that depends on both s and s'.

State Machine
The trellis has 2L states (e.g., L=16 gives 65,536 states). Each state can transition to 2k next states, emitting one quantized value per transition.
Sequence Encoding
A weight sequence becomes a path through the trellis. The path is determined by the k-bit index at each step. Total storage: k bits per weight — same as scalar — but the reconstruction quality is much higher because decisions are coupled.
Viterbi Decoding
Finding the best path (minimum distortion) is a dynamic programming problem. The Viterbi algorithm solves it in O(2L · T) time. Unlike VQ search over 2kd entries, this is linear in sequence length T.
Effective Dimension
The coupled decisions mean the trellis acts like a very-high-dimensional quantizer. With L=16, the effective dimension reaches ~256 — approaching the Shannon limit — with only 65K states, not 2512 codebook entries.
The key insight: A trellis code separates the codebook size from the effective quantization dimension. The number of states (2L) controls the quality, not 2kd. This lets you get arbitrarily high-dimensional quantization — and thus arbitrarily close to the Shannon limit — by increasing L, while keeping everything tractable.

This is precisely the same trick that convolutional codes use in telecommunications. When your phone encodes voice data for transmission, it doesn't independently encode each sample — it runs the data through a shift register (a state machine) and outputs encoded bits that depend on both the current input and the recent history. The receiver uses the Viterbi algorithm to find the most likely original sequence. QTIP adapts this machinery from error correction to quantization.

What does a trellis code separate that vector quantization cannot?

Chapter 2: Trellis Codes from First Principles

Let us build a trellis code from scratch. We start with the simplest possible version and work up to QTIP's design.

The shift register

Imagine a register that holds L bits. At each time step, k new bits enter from the left, and the k oldest bits fall off the right. The register shifts. The current state of the register (all L bits) determines which quantized value gets emitted.

Concrete example. Suppose L = 4 (16 states) and k = 1 (1 new bit per step). The register holds 4 bits: [b3 b2 b1 b0]. At each step, a new bit b enters on the left, the oldest bit b0 falls off, and the state becomes [b b3 b2 b1]. The new bit b is the "input" — one of {0, 1} — so from any state, there are exactly 2 possible next states.

More generally, with k input bits per step, the register shifts by k positions. From any state, there are 2k possible transitions. Each transition (s → s') is associated with a reconstruction value q(s, s') — the quantized weight emitted when we take that edge.

The trellis diagram

Unroll this state machine over time and you get a trellis diagram: a directed graph where columns represent time steps, nodes represent states, and edges represent transitions. Every valid quantization of the weight sequence is a path through this trellis from left to right.

Trellis Diagram

A 4-state trellis (L=2, k=1). Each column is a time step. Click an edge to see its reconstruction value. The highlighted path shows the Viterbi-optimal quantization for the sample weights below.

Sample weights Preset 1

Formal definition

A trellis code is defined by:

The rate of the code is k/V bits per weight. QTIP uses various combinations: k=1, V=1 gives 1 bit/weight; k=2, V=1 gives 2 bits/weight; k=4, V=2 gives 2 bits/weight with a more complex trellis.

Why does dimension grow with L?

The effective quantization dimension is approximately L/k · V. With L = 16, k = 2, V = 1, the effective dimension is ~128. With V = 2, it reaches ~256. This is because the shift register creates a memory window of L bits. Every quantized output depends on the current L-bit state, which in turn depends on the last L/k input decisions. So each output is implicitly a function of ~L/k neighboring weights — creating a very high-dimensional joint decision without ever explicitly building a high-dimensional codebook.

The shift register is the codebook. Instead of storing 2kd entries for d-dimensional VQ, the trellis stores only 2L states and their transitions. The "codebook" is implicitly defined by all possible paths through these states. The number of distinct T-length paths is 2kT — the same as T independent k-bit scalar quantizers — but the reconstruction values are coupled across time, giving vector-like quality.

A concrete rate-distortion comparison

Let us make the dimension advantage concrete. Suppose you are quantizing a column of 4096 weights (a typical hidden dimension) to 2 bits each. Here is what each approach sees:

The trellis uses the same number of bits as scalar quantization but achieves near-VQ-256 quality because every output depends on its 128-step neighborhood. The state machine creates implicit high-dimensional structure without explicit high-dimensional codebooks.

In a trellis with L=16 and k=2, how many states are there, and how many transitions leave each state?

Chapter 3: The Bitshift Trellis

Not all trellis structures are equally friendly to hardware. QTIP introduces a specific trellis design — the bitshift trellis — that maps perfectly to GPU bitwise operations.

The transition rule

In a bitshift trellis, node i has an edge to node j if and only if:

j = (i · 2kV mod 2L) + c,    where 0 ≤ c < 2kV

In plain English: shift the state left by kV bits (dropping the top kV bits), then fill in the bottom kV bits with any value c. This is literally a bitshift plus an OR.

Worked example. Let L = 4 (4-bit state), k = 1, V = 1 (so kV = 1). State i = 0b1011 (decimal 11). Shift left by 1: 0b0110 (decimal 6). The new bottom bit c can be 0 or 1, giving next states 6 (0b0110) or 7 (0b0111). That is: the top 3 bits of j equal the bottom 3 bits of i. The old top bit is gone — it has "shifted out."

Why bitshift?

This structure has three critical properties:

The hardware payoff: A general convolutional code trellis requires sequential Viterbi decoding at inference time. The bitshift trellis needs Viterbi only during quantization (offline, once). At inference, decoding is embarrassingly parallel — each weight is decoded independently from its L-bit window. This is why QTIP matches QuiP#'s speed despite having a 32x larger effective dimension.

The encoding vs. decoding asymmetry

This is a subtle but essential point. The Viterbi algorithm (expensive, sequential) is used only once, during the offline quantization step. It finds the optimal path through the trellis for each weight column. Once found, the path is stored as a packed bitstream — just k bits per weight.

At inference time, decoding is trivial: for weight at position t, extract the L-bit window centered at t from the bitstream, and compute or look up the reconstruction value. No dynamic programming. No state tracking. Just bitshifts.

Offline: Quantization (Slow, Once)
Run Viterbi algorithm through the trellis to find the minimum-distortion path. Store the path as k bits per weight. O(2L · T) time per column.
Online: Inference (Fast, Every Token)
For each weight, extract its L-bit window from the bitstream. Compute or look up the reconstruction value. All weights decoded in parallel. O(1) per weight.
Why can the bitshift trellis be decoded in parallel at inference time, unlike a general convolutional code?

Chapter 4: Viterbi Decoding for Optimal Quantization

We have a trellis with 2L states and a sequence of T real-valued weights. We need to find the path through the trellis that minimizes total reconstruction error. This is exactly the problem the Viterbi algorithm solves.

The dynamic programming formulation

Define score(t, s) as the minimum total squared error achievable for quantizing weights w1 through wt, ending in state s at time t. The recursion is:

score(t, s) = mins' → s [ score(t−1, s') + (wt − q(s', s))2 ]

where the minimum is over all states s' that have a valid transition to s, and q(s', s) is the reconstruction value for that transition.

Base case: score(0, s) = 0 for all states s (or for a designated start state). Final answer: mins score(T, s) gives the minimum achievable distortion. The optimal path is recovered by backtracking: at each step, record which predecessor s' achieved the minimum.

Complexity

Each of the T time steps examines 2L states. Each state considers 2kV predecessors. Total operations: O(T · 2L · 2kV). For L = 16, k = 2, V = 1, T = 4096 (a typical column length), that is 4096 × 65,536 × 4 ≈ 109 operations. Substantial but tractable on a GPU, and it only runs once during quantization.

Step by step

Step 1: Initialize
Set score(0, s) = 0 for all states. Create a backpointer array.
Step 2: Forward pass
For t = 1 to T: for each state s, compute score(t, s) by checking all valid predecessors s'. Record the best predecessor in backpointer[t][s].
Step 3: Terminate
Find s* = argmins score(T, s). This is the optimal final state.
Step 4: Backtrack
Follow backpointers from s* backward through time to recover the full optimal path s*T, s*T-1, ..., s*0.
Step 5: Extract bits
From the optimal path, extract the k-bit input at each step (determined by which transition was taken). Pack these into the bitstream.
Viterbi Path Search

Watch the Viterbi algorithm find the optimal quantization path through a small trellis. Each cell shows the accumulated distortion. The highlighted path is the minimum-cost route. Click "Step" to advance one time step, or "Run" to animate the full search.

What about the reconstruction values?

Where do the values q(s', s) come from? This is the codebook design problem. QTIP explores several strategies, from pure lookup tables to computed values. We will cover these in Chapter 6. For now, assume each transition has an associated reconstruction value that was optimized on calibration data.

In the Viterbi algorithm for trellis quantization, what does score(t, s) represent?

Chapter 5: Incoherence Processing

Trellis codes work best when the weights they quantize are roughly independent and identically distributed — ideally Gaussian. Real LLM weight matrices are not like this. They have correlated columns, outlier values, and non-uniform distributions. Incoherence processing fixes this.

The problem with raw weights

Consider a weight matrix W ∈ Rm × n from a linear layer. Some columns might have values 10x larger than others. Some rows are highly correlated. If you apply trellis quantization directly, the outlier columns dominate the error and the trellis wastes states on rare extreme values.

QuiP# introduced the idea of multiplying both sides by random orthogonal matrices to "spread out" the information. QTIP uses the same framework.

The Random Hadamard Transform

The incoherence processing applies:

W̃ = Vm Sm W Sn VnT

where:

What this does geometrically: The random sign matrix S randomizes the coordinate directions. The Hadamard matrix V then spreads each coordinate's energy uniformly across all coordinates. The combined effect makes every entry of W̃ approximately the same magnitude and approximately independent — exactly the "nice Gaussian" input that the trellis code expects.

The incoherence guarantee

Formally, the incoherence of a matrix is the maximum squared entry normalized by the Frobenius norm: μW = max|Wij|2 · mn / ||W||F2. A perfectly incoherent matrix has μ = 1 (all entries equal magnitude). For the transformed matrix, with high probability:

μ ≤ 2 log(4mn / δ)

This is logarithmic in the matrix size — essentially constant. No outliers survive the transform.

Compensating at inference

If we quantize W̃ instead of W, we need to undo the transform during inference. For a linear layer y = Wx, we can absorb the transforms:

y = Wx = (Vm Sm)−1 W̃ (Sn VnT)−1 x = Sm VmT W̃ Vn Sn x

The extra cost is two fast Hadamard transforms and two elementwise sign flips. The Hadamard transform is O(n log n), which is negligible compared to the matrix multiply O(mn). And the sign flips are free (just XOR the sign bit).

Practical detail: QTIP applies incoherence processing column-wise. Each column of W̃ is then quantized independently through the trellis. The Hessian (second-order information from calibration data) is also transformed: H̃ = Vn Sn H Sn VnT, so that the quantization objective remains consistent.

Worked example: a 4x4 weight matrix

Consider a tiny weight matrix before and after incoherence processing:

# Before: correlated, outlier-heavy
W = [[ 5.2,  0.1,  0.3, -0.1],
     [ 4.8, -0.2,  0.1,  0.0],
     [ 0.0,  3.1, -0.1,  0.2],
     [-0.1,  2.9,  0.0, -0.3]]
# Column 0 has outliers ~5, others are ~0.1
# Incoherence mu = 5.2^2 * 16 / ||W||_F^2 ≈ 5.8 (bad)

# After: W_tilde = V * S * W * S * V^T
W_tilde = [[ 1.8, -1.2,  0.9, -1.5],
           [-0.7,  1.4, -1.1,  0.8],
           [ 1.3,  0.6, -1.0, -1.3],
           [-0.9, -1.1,  1.2,  0.7]]
# All entries ~same magnitude, uncorrelated
# Incoherence mu ≈ 1.3 (near-optimal)

The outlier energy from column 0 has been spread across all entries. Each column of W̃ now looks like an i.i.d. Gaussian sample — exactly what the trellis code is designed for.

Why not just normalize?

You might think: just scale each column to have unit variance. But that only fixes the magnitude problem, not the correlation problem. Two columns might be nearly identical — their quantization errors would be correlated, causing systematic bias in the output. The random Hadamard transform decorrelates and normalizes simultaneously, which is why it gives strictly better results than simple per-column scaling.

Why does QTIP apply a random Hadamard transform before trellis quantization instead of just normalizing each column?

Chapter 6: The Code Zoo

The trellis structure tells you which states connect to which. But what reconstruction value does each transition emit? QTIP explores a spectrum of approaches, trading off quality against decode speed.

Lookup-only codes

The simplest approach: store a lookup table (LUT) indexed by the L-bit state. Each entry maps to a float16 reconstruction value. For L = 14, that is 214 = 16,384 entries × 2 bytes = 32 KB. For L = 16, it is 128 KB.

The problem? GPU L1 cache is typically 128-256 KB per SM. A 128 KB LUT would monopolize the entire cache, starving the rest of the computation. And L1 cache latency is ~28 cycles for a random access — much slower than arithmetic.

Lookup-only verdict: Best quantization quality (the LUT entries are directly optimized). But too large for current GPU caches at L ≥ 16. Viable only for L ≤ 14.

Computed lookup-free codes

The opposite extreme: compute the reconstruction value from the L-bit state using only arithmetic operations. No table at all. QTIP proposes two variants:

1MAD (1 Multiply-Add)

Use a linear congruential generator (LCG) as a hash function, then sum four 8-bit segments of the result:

// 1MAD: ~4 instructions per weight
hash = state * 0x9E3779B9  // LCG multiply
hash = hash + 0x6A09E667   // LCG add
// Sum four 8-bit segments to get pseudo-Gaussian
val = (hash & 0xFF) + ((hash >> 8) & 0xFF)
      + ((hash >> 16) & 0xFF) + (hash >> 24)

The sum of four uniform bytes approximates a Gaussian (central limit theorem). This gives a pseudo-random reconstruction value from 4 ALU instructions — no memory access.

3INST (3 Instructions)

An even simpler hash using XOR operations on the FP16 bit representation:

// 3INST: ~3 instructions per weight
hash = state * 0x9E3779B9
// XOR mantissa/exponent bits to produce FP16
val = xor_fp16(hash)

Three ALU instructions total. Slightly worse quality than 1MAD but faster.

Why pseudo-random values work: After incoherence processing, the weights are approximately i.i.d. Gaussian. A pseudo-random codebook with Gaussian-like values is nearly optimal for Gaussian sources — the Viterbi algorithm will find the best matching path regardless of whether the codebook values were "designed" or "random." What matters is that there are enough distinct values to cover the distribution, which 2L pseudo-random points easily provide.

Hybrid codes (HYB)

The sweet spot: use a tiny lookup table (2Q × 2 entries, with Q ≈ 6-8) indexed by a hash of the state. Then apply a sign flip from an additional bit.

// HYB: hash + small LUT (~2 instructions amortized)
idx = (state * state + state) & MASK  // x^2 + x hash
sign = (state >> sign_bit) & 1       // extract sign
val = LUT[idx] * (1 - 2 * sign)       // apply sign flip

The LUT has only 128-512 entries (256 B - 1 KB), easily fitting in L1 cache. The entries are optimized via K-means on the empirical weight distribution. The sign bit doubles the effective codebook size for free.

Code TypeLUT SizeInstructions/WeightQuality (MSE)Best For
Lookup-only32-128 KB1 (lookup)BestL ≤ 14, quality-first
HYB256 B - 1 KB~2Near-bestL = 16, balanced
1MAD0~4GoodSpeed-first, large L
3INST0~3DecentMaximum throughput

Codebook optimization

For lookup-only and HYB codes, the LUT entries are trained on calibration data. QTIP uses 2048 samples from RedPajama (128 tokens each). The optimization minimizes:

L = ∑columns ||W̃col − Q(W̃col)||22

where Q applies the Viterbi algorithm with the current LUT, and gradients flow through the LUT entries via straight-through estimation. This runs for 50-200 iterations. For computed codes (1MAD, 3INST), there is nothing to optimize — the hash function is fixed.

Why do QTIP's computed codes (1MAD, 3INST) work well even though their reconstruction values are pseudo-random?

Chapter 7: GPU Decode Kernel

The theoretical elegance of trellis codes means nothing if the decode kernel is slow. QTIP's design ensures that going from packed bits to dequantized weights is fast enough to be memory-bound, not compute-bound.

The decode loop

At inference time, for each weight position t in a column of length T:

  1. Extract L-bit window: From the packed bitstream, extract L contiguous bits starting at position t · k. This is a load + shift + mask.
  2. Compute/lookup value: Use the L-bit state to produce the reconstruction value via the chosen code (lookup, HYB, 1MAD, or 3INST).
  3. Multiply-accumulate: Multiply by the activation and accumulate into the output.

Steps 1-3 are fused into a single CUDA kernel. The packed bitstream is stored in global memory and streamed sequentially — excellent for GPU memory bandwidth.

Why QTIP is fast despite trellis complexity: The bitshift structure means decode is O(1) per weight — just bit extraction + a few ALU ops or a small LUT lookup. Compare this to AQLM, which requires a codebook lookup in a 216-entry table (random access, cache-unfriendly). QTIP's decode is compute-light and memory-sequential, so the bottleneck is purely memory bandwidth — exactly where quantization helps most.

Memory layout

The quantized weights are stored as packed k-bit integers. For k = 2, four weights pack into one byte. The bitstream is laid out so that consecutive weights in a column are contiguous in memory, enabling coalesced reads on the GPU.

The Hadamard overhead

Each linear layer now computes y = Sm VmT (W̃ · Vn Sn x) instead of y = Wx. The extra operations are:

Total overhead: under 1% of inference time. The 4-8x memory reduction far outweighs this.

Throughput comparison

Decode Throughput

Tokens per second on an RTX 6000 Ada at batch size 1. QTIP matches QuiP# speed while achieving better quantization quality. AQLM is 2-3x slower due to large codebook lookups.

What is the main reason AQLM is 2-3x slower than QTIP at inference despite both being 2-bit methods?

Chapter 8: Results

Numbers matter. QTIP was evaluated on Llama 2 (7B, 13B, 70B) and Llama 3 (8B, 70B) at 2-bit, 3-bit, and 4-bit quantization, compared against the strongest baselines: QuiP#, AQLM, and GPTQ.

2-bit quantization: the headline result

At 2 bits per weight, the differences between methods are stark. This is where the trellis advantage is most visible.

ModelFP16QTIP (1MAD)QuiP#AQLM
Llama 2 7B5.126.828.226.14
Llama 2 70B3.123.904.163.83
Llama 3 70B4.975.77
The story of these numbers: On Llama 2 7B, QuiP# adds 3.1 perplexity points over FP16. QTIP adds only 1.7 — nearly halving the quantization penalty. On the larger Llama 2 70B, QTIP closes to within 0.78 of FP16. And on Llama 3 70B, QTIP beats QuiP# by a full 0.8 perplexity points. Note that AQLM does well on 7B but was not reported on Llama 3 at the time of publication.

3-bit quantization

At 3 bits, all methods approach FP16 quality, but QTIP (HYB variant) still leads:

ModelFP16QTIP (HYB)QuiP#AQLM
Llama 2 70B3.123.263.353.36

At 3 bits, QTIP is within 0.14 perplexity of FP16 on a 70B model. The gap between methods shrinks because 3 bits provides enough room for even simpler methods to do well.

Speed results

The speed comparison is equally important. QTIP must not sacrifice throughput for quality.

Model / MethodBitsTokens/secSpeedup vs FP16
Llama 2 7B FP161655.91.0x
Llama 2 7B QTIP2188.03.4x
Llama 2 7B QuiP#2186.03.3x
Llama 2 7B AQLM281.51.5x
Llama 2 70B QTIP223.5
Llama 2 70B QuiP#222.2
Llama 2 70B AQLM28.78
QTIP wins on both axes. It matches QuiP# speed (188 vs 186 tok/s on 7B) while achieving substantially better perplexity (6.82 vs 8.22). Compared to AQLM, QTIP is 2.3x faster (188 vs 81.5 tok/s) with comparable quality. This dual improvement — better quality AND speed — is the trellis code advantage: high effective dimension for quality, bitshift structure for speed.

Scaling behavior

A pattern emerges across model sizes: larger models are easier to quantize. On Llama 2 7B, 2-bit QTIP adds 1.7 perplexity over FP16. On 70B, the gap shrinks to 0.78. On Llama 3.1 405B, it is only ~0.17. This makes intuitive sense — larger models have more redundancy in their weights, so aggressive compression loses less essential information.

This scaling property makes QTIP particularly valuable for the models that need compression most: the ones too large to serve at full precision. A 405B model that fits on a single 8-GPU node at 2-bit is far more useful than one requiring 4 nodes at FP16.

Zero-shot task performance

Perplexity on WikiText-2 is just one metric. QTIP also reports zero-shot accuracy on standard benchmarks (ARC, PIQA, WinoGrande, HellaSwag). The pattern holds: at 2-bit, QTIP matches or exceeds QuiP# and AQLM on every benchmark across all model sizes. The task-level gains correlate strongly with perplexity improvements, confirming that lower perplexity translates to genuinely better language understanding, not just better next-token prediction on one dataset.

The 405B frontier

QTIP also quantized Llama 3.1 405B Instruct to 2 bits, achieving a perplexity of 3.29. This compresses the model from 810 GB to ~101 GB — fitting on a single 8-GPU node with room to spare. No other 2-bit method was reported on this model at publication time.

Perplexity vs. Bitrate

Llama 2 70B perplexity (WikiText-2) at different bitrates. Lower is better. QTIP consistently outperforms baselines, with the gap largest at 2 bits where high-dimensional quantization matters most.

On Llama 2 7B at 2-bit, QTIP achieves 6.82 perplexity while QuiP# gets 8.22. What is the primary technical reason for this gap?

Chapter 9: Connections

QTIP sits at the intersection of information theory, signal processing, and systems engineering. Here is how it connects to the broader landscape.

The QuiP lineage

QuiP (2023)
Introduced incoherence processing for LLM quantization. Random rotations make weights "nice" for any quantizer. Used simple round-to-nearest with incoherence.
QuiP# (2024)
Added E8 lattice vector quantization (8D) on top of incoherence processing. Major quality jump. But stuck at 8D due to codebook size limits.
QTIP (2024)
Replaced E8 lattice VQ with trellis coded quantization. Broke through the dimension barrier: 256D effective dimension with only 65K states. Same incoherence framework, fundamentally better coding.

Trellis coded quantization origins

TCQ was introduced by Marcellin and Fischer (1990), building on Ungerboeck's trellis coded modulation (1982) from telecommunications. The original TCQ paper showed that trellis codes approach the Shannon rate-distortion bound as the constraint length grows. QTIP is the first to bring this classical result to LLM weight compression, with the key innovation being the bitshift structure for GPU-friendly decoding.

The rate-distortion view

Shannon's rate-distortion theorem (1959) tells us the minimum bitrate needed to represent a source at a given distortion level. For a Gaussian source at distortion D:

R(D) = ½ log22 / D)

Any practical quantizer must use at least R(D) bits per symbol. Scalar quantization is far from this bound. VQ gets closer as dimension increases. TCQ approaches it as the constraint length grows. QTIP with L = 16 gets within 10% of the bound — a remarkable achievement for a practical system.

Connections to other methods

MethodApproachEffective DimDecode Cost
GPTQScalar + Hessian-aware rounding1Trivial
AWQScalar + activation-aware scaling1Trivial
AQLMAdditive VQ (multi-codebook)8Large LUT
QuiP#E8 lattice VQ + incoherence8Lattice decode
QTIPTrellis coded quantization + incoherence256Bitshift + small LUT

Why QTIP won a NeurIPS Spotlight

QTIP received a Spotlight designation at NeurIPS 2024. The reviewers highlighted three contributions: (1) the novel application of trellis coded quantization to LLMs, connecting classical information theory to modern deep learning; (2) the principled spectrum of code designs from lookup-only to compute-only, giving practitioners a clear quality-speed tradeoff; and (3) the strong empirical results that simultaneously beat all baselines on quality AND speed — a rare achievement in quantization where there is usually a tradeoff between the two.

The work also stands out for its theoretical grounding. Unlike many quantization papers that are primarily empirical, QTIP derives its approach from rate-distortion theory, giving a principled explanation for why trellis codes work better: they approach the Shannon limit at high effective dimension.

What comes next?

The big picture: QTIP demonstrates that ideas from 1990s information theory — trellis coded quantization, the Viterbi algorithm, incoherence processing — are not just theoretically elegant but practically powerful for modern LLMs. By framing weight compression as a coding problem rather than a rounding problem, QTIP achieves what seemed impossible: high-dimensional quantization at GPU speed.
What is the fundamental conceptual shift that QTIP makes compared to methods like GPTQ and AWQ?