QTIP — Veanors

Chapter 0: The Problem

You have a 70-billion-parameter language model. Each parameter is stored as a 16-bit float. That is 140 GB of weights — more than a single GPU can hold. Inference is memory-bandwidth bound: the GPU spends most of its time just loading weights from memory, not computing with them. If you could shrink every weight from 16 bits to 2 bits, you would cut memory by 8x and speed up inference proportionally.

This is post-training quantization (PTQ). The idea is simple: take a pre-trained model, replace its full-precision weights with low-bit approximations, and accept a small quality loss in exchange for dramatic efficiency gains.

The scalar approach

The most basic strategy quantizes each weight independently. Pick a set of reconstruction values (say 4 values for 2-bit), then round each weight to the nearest one. This is scalar quantization, and it works — but it wastes information. It treats each weight as if it were completely unrelated to its neighbors.

The vector quantization leap

Methods like AQLM and QuiP# quantize groups of weights together. QuiP# groups weights into 8-dimensional vectors and maps each to the nearest point in the E₈ lattice — a beautiful mathematical structure that packs spheres in 8D almost perfectly. This vector quantization (VQ) exploits correlations between neighboring weights, getting closer to the theoretical limit.

But VQ has a fundamental scaling problem: the codebook grows exponentially with dimension. An 8D codebook at 2 bits/weight already has 2¹⁶ = 65,536 entries. Going to 16D would require 2³² entries — 4 billion — which is impractical. So VQ is stuck at low dimensions.

The core question: How do you get the benefits of high-dimensional quantization — better rate-distortion performance — without an exponentially large codebook? QTIP's answer: use a trellis code, which achieves effectively 256-dimensional quantization with only 2¹⁶ states.

Here is the rate-distortion picture. For a Gaussian source at 2 bits per dimension, scalar quantization achieves MSE = 0.118. QuiP#'s 8D E₈ lattice gets 0.089. QTIP's 256D trellis code gets 0.069 — within 10% of the theoretical minimum of 0.063. That gap shrinks further at higher dimensions.

Method	Effective Dimension	Codebook Size	MSE (2-bit Gaussian)
Scalar (Lloyd-Max)	1	4	0.118
QuiP# (E₈P)	8	65,536	0.089
QTIP (L=16 TCQ)	256	65,536 states	0.069
Shannon limit	∞	—	0.063

QTIP borrows a trick from telecommunications: trellis coded quantization, first proposed by Marcellin and Fischer in 1990. The idea is to encode weights as a path through a state machine rather than as independent codebook lookups. Each transition in the state machine emits a quantized value, and the optimal path is found by the Viterbi algorithm — the same algorithm used to decode convolutional codes in your phone's modem.

Why can't vector quantization easily scale to 256-dimensional quantization at 2 bits/weight?

The codebook would need 2^512 entries — exponential in dimension — making storage and search completely impractical The Viterbi algorithm is too slow for 256 dimensions 256-dimensional vectors are always too noisy to quantize

Chapter 1: Why Trellises

To understand why QTIP uses trellis codes, we need to step back and think about what quantization is really doing. You have a long sequence of real-valued weights w₁, w₂, ..., w_T. You want to replace each with a discrete approximation q_i, using only k bits per weight on average. The goal is to minimize the total distortion:

D = ∑_i=1^T (w_i − q_i)²

Scalar: each weight on its own

Scalar quantization picks q_i independently for each weight. With k = 2 bits you get 4 possible values. The optimal choice (Lloyd-Max quantizer) minimizes MSE for a Gaussian distribution, but it ignores all structure between weights. Think of it as a greedy algorithm that makes locally optimal decisions without looking ahead.

Vector: groups of weights

Vector quantization groups weights into vectors of dimension d, then maps each vector to the nearest codeword in a codebook C of size 2^kd. For a Gaussian source, the optimal codebook approaches the space-filling gain of the best lattice in dimension d. The E₈ lattice that QuiP# uses is one of the best known packings in 8D.

The problem: both codebook storage and nearest-neighbor search scale as 2^kd. At d = 8 and k = 2, that is 65K entries — manageable. At d = 16, it is 4 billion. At d = 256, it is 2⁵¹² — more than the atoms in the universe.

Trellis: the state machine escape

A trellis code finds a middle ground. Instead of independent scalar decisions or massive vector codebooks, it encodes the weight sequence as a path through a finite state machine. The state machine has S = 2^L states. At each step, the encoder is in some state s, reads k input bits, transitions to a new state s', and emits a quantized value q that depends on both s and s'.

State Machine

The trellis has 2^L states (e.g., L=16 gives 65,536 states). Each state can transition to 2^k next states, emitting one quantized value per transition.

↓

Sequence Encoding

A weight sequence becomes a path through the trellis. The path is determined by the k-bit index at each step. Total storage: k bits per weight — same as scalar — but the reconstruction quality is much higher because decisions are coupled.

↓

Viterbi Decoding

Finding the best path (minimum distortion) is a dynamic programming problem. The Viterbi algorithm solves it in O(2^L · T) time. Unlike VQ search over 2^kd entries, this is linear in sequence length T.

↓

Effective Dimension

The coupled decisions mean the trellis acts like a very-high-dimensional quantizer. With L=16, the effective dimension reaches ~256 — approaching the Shannon limit — with only 65K states, not 2⁵¹² codebook entries.

The key insight: A trellis code separates the codebook size from the effective quantization dimension. The number of states (2^L) controls the quality, not 2^kd. This lets you get arbitrarily high-dimensional quantization — and thus arbitrarily close to the Shannon limit — by increasing L, while keeping everything tractable.

This is precisely the same trick that convolutional codes use in telecommunications. When your phone encodes voice data for transmission, it doesn't independently encode each sample — it runs the data through a shift register (a state machine) and outputs encoded bits that depend on both the current input and the recent history. The receiver uses the Viterbi algorithm to find the most likely original sequence. QTIP adapts this machinery from error correction to quantization.

What does a trellis code separate that vector quantization cannot?

Training time from inference time The codebook size (2^L states) from the effective quantization dimension — letting you reach high dimensions without exponential codebook growth The learning rate from the batch size

Chapter 2: Trellis Codes from First Principles

Let us build a trellis code from scratch. We start with the simplest possible version and work up to QTIP's design.

The shift register

Imagine a register that holds L bits. At each time step, k new bits enter from the left, and the k oldest bits fall off the right. The register shifts. The current state of the register (all L bits) determines which quantized value gets emitted.

Concrete example. Suppose L = 4 (16 states) and k = 1 (1 new bit per step). The register holds 4 bits: [b₃ b₂ b₁ b₀]. At each step, a new bit b enters on the left, the oldest bit b₀ falls off, and the state becomes [b b₃ b₂ b₁]. The new bit b is the "input" — one of {0, 1} — so from any state, there are exactly 2 possible next states.

More generally, with k input bits per step, the register shifts by k positions. From any state, there are 2^k possible transitions. Each transition (s → s') is associated with a reconstruction value q(s, s') — the quantized weight emitted when we take that edge.

The trellis diagram

Unroll this state machine over time and you get a trellis diagram: a directed graph where columns represent time steps, nodes represent states, and edges represent transitions. Every valid quantization of the weight sequence is a path through this trellis from left to right.

Trellis Diagram

A 4-state trellis (L=2, k=1). Each column is a time step. Click an edge to see its reconstruction value. The highlighted path shows the Viterbi-optimal quantization for the sample weights below.

Sample weights Preset 1

Formal definition

A trellis code is defined by:

L — the constraint length (number of bits in the state register). The trellis has 2^L states.
k — the number of input bits per time step. This is the rate: k bits per output symbol.
V — the number of output symbols per time step (usually 1 or 2).
A lookup function q(s, s') → reconstruction value for each valid transition.

The rate of the code is k/V bits per weight. QTIP uses various combinations: k=1, V=1 gives 1 bit/weight; k=2, V=1 gives 2 bits/weight; k=4, V=2 gives 2 bits/weight with a more complex trellis.

Why does dimension grow with L?

The effective quantization dimension is approximately L/k · V. With L = 16, k = 2, V = 1, the effective dimension is ~128. With V = 2, it reaches ~256. This is because the shift register creates a memory window of L bits. Every quantized output depends on the current L-bit state, which in turn depends on the last L/k input decisions. So each output is implicitly a function of ~L/k neighboring weights — creating a very high-dimensional joint decision without ever explicitly building a high-dimensional codebook.

The shift register is the codebook. Instead of storing 2^kd entries for d-dimensional VQ, the trellis stores only 2^L states and their transitions. The "codebook" is implicitly defined by all possible paths through these states. The number of distinct T-length paths is 2^kT — the same as T independent k-bit scalar quantizers — but the reconstruction values are coupled across time, giving vector-like quality.

A concrete rate-distortion comparison

Let us make the dimension advantage concrete. Suppose you are quantizing a column of 4096 weights (a typical hidden dimension) to 2 bits each. Here is what each approach sees:

Scalar (d=1): 4096 independent decisions. Each weight picks from 4 values. No coordination between neighbors. Total distinct quantizations: 4⁴⁰⁹⁶.
VQ with d=8: 512 groups of 8 weights. Each group picks from 2¹⁶ = 65K codewords. Groups are independent of each other. Total distinct quantizations: 65K⁵¹².
Trellis with L=16: One continuous path through 4096 steps of a 65K-state machine. Each step has 4 choices, but which 4 depends on the current state (which encodes the last 16 bits of history). Total distinct quantizations: 4⁴⁰⁹⁶ — same count as scalar — but the reconstruction values are correlated across a 128-step window.

The trellis uses the same number of bits as scalar quantization but achieves near-VQ-256 quality because every output depends on its 128-step neighborhood. The state machine creates implicit high-dimensional structure without explicit high-dimensional codebooks.

In a trellis with L=16 and k=2, how many states are there, and how many transitions leave each state?

2^16 = 65,536 states, each with 2^2 = 4 outgoing transitions 16 states, each with 2 outgoing transitions 2^16 states, each with 16 outgoing transitions

Chapter 3: The Bitshift Trellis

Not all trellis structures are equally friendly to hardware. QTIP introduces a specific trellis design — the bitshift trellis — that maps perfectly to GPU bitwise operations.

The transition rule

In a bitshift trellis, node i has an edge to node j if and only if:

j = (i · 2^kV mod 2^L) + c, where 0 ≤ c < 2^kV

In plain English: shift the state left by kV bits (dropping the top kV bits), then fill in the bottom kV bits with any value c. This is literally a bitshift plus an OR.

Worked example. Let L = 4 (4-bit state), k = 1, V = 1 (so kV = 1). State i = 0b1011 (decimal 11). Shift left by 1: 0b0110 (decimal 6). The new bottom bit c can be 0 or 1, giving next states 6 (0b0110) or 7 (0b0111). That is: the top 3 bits of j equal the bottom 3 bits of i. The old top bit is gone — it has "shifted out."

Why bitshift?

This structure has three critical properties:

Local dependence: The reconstruction value for position t depends only on a contiguous window of L bits from the encoded bitstream. You do not need to decode the entire sequence from the beginning — just extract the L-bit window around position t.
Parallel decode: Since each output depends on only L contiguous bits, all T outputs can be decoded in parallel. No sequential data dependency. This is crucial for GPU throughput.
Bitwise implementation: The "decode" operation is literally: extract L bits from the packed bitstream at position t, then use those L bits to look up or compute the reconstruction value. Shift and mask operations, nothing more.

The hardware payoff: A general convolutional code trellis requires sequential Viterbi decoding at inference time. The bitshift trellis needs Viterbi only during quantization (offline, once). At inference, decoding is embarrassingly parallel — each weight is decoded independently from its L-bit window. This is why QTIP matches QuiP#'s speed despite having a 32x larger effective dimension.

The encoding vs. decoding asymmetry

This is a subtle but essential point. The Viterbi algorithm (expensive, sequential) is used only once, during the offline quantization step. It finds the optimal path through the trellis for each weight column. Once found, the path is stored as a packed bitstream — just k bits per weight.

At inference time, decoding is trivial: for weight at position t, extract the L-bit window centered at t from the bitstream, and compute or look up the reconstruction value. No dynamic programming. No state tracking. Just bitshifts.

Offline: Quantization (Slow, Once)

Run Viterbi algorithm through the trellis to find the minimum-distortion path. Store the path as k bits per weight. O(2^L · T) time per column.

↓

Online: Inference (Fast, Every Token)

For each weight, extract its L-bit window from the bitstream. Compute or look up the reconstruction value. All weights decoded in parallel. O(1) per weight.

Why can the bitshift trellis be decoded in parallel at inference time, unlike a general convolutional code?

Because each reconstruction value depends only on a contiguous L-bit window from the bitstream — not on the full decoding history — so all positions can be decoded independently Because GPUs have special trellis hardware Because the Viterbi algorithm is inherently parallel

Chapter 4: Viterbi Decoding for Optimal Quantization

We have a trellis with 2^L states and a sequence of T real-valued weights. We need to find the path through the trellis that minimizes total reconstruction error. This is exactly the problem the Viterbi algorithm solves.

The dynamic programming formulation

Define score(t, s) as the minimum total squared error achievable for quantizing weights w₁ through w_t, ending in state s at time t. The recursion is:

score(t, s) = min_{s' → s} [ score(t−1, s') + (w_t − q(s', s))² ]

where the minimum is over all states s' that have a valid transition to s, and q(s', s) is the reconstruction value for that transition.

Base case: score(0, s) = 0 for all states s (or for a designated start state). Final answer: min_s score(T, s) gives the minimum achievable distortion. The optimal path is recovered by backtracking: at each step, record which predecessor s' achieved the minimum.

Complexity

Each of the T time steps examines 2^L states. Each state considers 2^kV predecessors. Total operations: O(T · 2^L · 2^kV). For L = 16, k = 2, V = 1, T = 4096 (a typical column length), that is 4096 × 65,536 × 4 ≈ 10⁹ operations. Substantial but tractable on a GPU, and it only runs once during quantization.

Step by step

Step 1: Initialize

Set score(0, s) = 0 for all states. Create a backpointer array.

↓

Step 2: Forward pass

For t = 1 to T: for each state s, compute score(t, s) by checking all valid predecessors s'. Record the best predecessor in backpointer[t][s].

↓

Step 3: Terminate

Find s* = argmin_s score(T, s). This is the optimal final state.

↓

Step 4: Backtrack

Follow backpointers from s* backward through time to recover the full optimal path s*_T, s*_T-1, ..., s*₀.

↓

Step 5: Extract bits

From the optimal path, extract the k-bit input at each step (determined by which transition was taken). Pack these into the bitstream.

Viterbi Path Search

Watch the Viterbi algorithm find the optimal quantization path through a small trellis. Each cell shows the accumulated distortion. The highlighted path is the minimum-cost route. Click "Step" to advance one time step, or "Run" to animate the full search.

What about the reconstruction values?

Where do the values q(s', s) come from? This is the codebook design problem. QTIP explores several strategies, from pure lookup tables to computed values. We will cover these in Chapter 6. For now, assume each transition has an associated reconstruction value that was optimized on calibration data.

In the Viterbi algorithm for trellis quantization, what does score(t, s) represent?

The minimum total squared reconstruction error for quantizing weights 1 through t, ending in trellis state s The probability of being in state s at time t The number of bits needed to encode the path to state s

Chapter 5: Incoherence Processing

Trellis codes work best when the weights they quantize are roughly independent and identically distributed — ideally Gaussian. Real LLM weight matrices are not like this. They have correlated columns, outlier values, and non-uniform distributions. Incoherence processing fixes this.

The problem with raw weights

Consider a weight matrix W ∈ R^{m × n} from a linear layer. Some columns might have values 10x larger than others. Some rows are highly correlated. If you apply trellis quantization directly, the outlier columns dominate the error and the trellis wastes states on rare extreme values.

QuiP# introduced the idea of multiplying both sides by random orthogonal matrices to "spread out" the information. QTIP uses the same framework.

The Random Hadamard Transform

The incoherence processing applies:

W̃ = V_m S_m W S_n V_n^T

where:

V_k is a k × k Hadamard matrix — an orthogonal matrix with entries ±1/√k. Multiplication costs O(k log k) via the fast Walsh-Hadamard transform.
S_k is a random diagonal sign matrix — a diagonal matrix where each entry is independently +1 or −1 with equal probability.

What this does geometrically: The random sign matrix S randomizes the coordinate directions. The Hadamard matrix V then spreads each coordinate's energy uniformly across all coordinates. The combined effect makes every entry of W̃ approximately the same magnitude and approximately independent — exactly the "nice Gaussian" input that the trellis code expects.

The incoherence guarantee

Formally, the incoherence of a matrix is the maximum squared entry normalized by the Frobenius norm: μ_W = max|W_ij|² · mn / ||W||_F². A perfectly incoherent matrix has μ = 1 (all entries equal magnitude). For the transformed matrix, with high probability:

μ_W̃ ≤ 2 log(4mn / δ)

This is logarithmic in the matrix size — essentially constant. No outliers survive the transform.

Compensating at inference

If we quantize W̃ instead of W, we need to undo the transform during inference. For a linear layer y = Wx, we can absorb the transforms:

y = Wx = (V_m S_m)⁻¹ W̃ (S_n V_n^T)⁻¹ x = S_m V_m^T W̃ V_n S_n x

The extra cost is two fast Hadamard transforms and two elementwise sign flips. The Hadamard transform is O(n log n), which is negligible compared to the matrix multiply O(mn). And the sign flips are free (just XOR the sign bit).

Practical detail: QTIP applies incoherence processing column-wise. Each column of W̃ is then quantized independently through the trellis. The Hessian (second-order information from calibration data) is also transformed: H̃ = V_n S_n H S_n V_n^T, so that the quantization objective remains consistent.

Worked example: a 4x4 weight matrix

Consider a tiny weight matrix before and after incoherence processing:

# Before: correlated, outlier-heavy
W = [[ 5.2,  0.1,  0.3, -0.1],
     [ 4.8, -0.2,  0.1,  0.0],
     [ 0.0,  3.1, -0.1,  0.2],
     [-0.1,  2.9,  0.0, -0.3]]
# Column 0 has outliers ~5, others are ~0.1
# Incoherence mu = 5.2^2 * 16 / ||W||_F^2 ≈ 5.8 (bad)

# After: W_tilde = V * S * W * S * V^T
W_tilde = [[ 1.8, -1.2,  0.9, -1.5],
           [-0.7,  1.4, -1.1,  0.8],
           [ 1.3,  0.6, -1.0, -1.3],
           [-0.9, -1.1,  1.2,  0.7]]
# All entries ~same magnitude, uncorrelated
# Incoherence mu ≈ 1.3 (near-optimal)

The outlier energy from column 0 has been spread across all entries. Each column of W̃ now looks like an i.i.d. Gaussian sample — exactly what the trellis code is designed for.

Why not just normalize?

You might think: just scale each column to have unit variance. But that only fixes the magnitude problem, not the correlation problem. Two columns might be nearly identical — their quantization errors would be correlated, causing systematic bias in the output. The random Hadamard transform decorrelates and normalizes simultaneously, which is why it gives strictly better results than simple per-column scaling.

Why does QTIP apply a random Hadamard transform before trellis quantization instead of just normalizing each column?

The Hadamard transform both decorrelates columns AND normalizes magnitudes — simple scaling fixes magnitudes but leaves correlated quantization errors that cause systematic output bias The Hadamard transform is faster than normalization Normalization changes the model's outputs but Hadamard does not

Chapter 6: The Code Zoo

The trellis structure tells you which states connect to which. But what reconstruction value does each transition emit? QTIP explores a spectrum of approaches, trading off quality against decode speed.

Lookup-only codes

The simplest approach: store a lookup table (LUT) indexed by the L-bit state. Each entry maps to a float16 reconstruction value. For L = 14, that is 2¹⁴ = 16,384 entries × 2 bytes = 32 KB. For L = 16, it is 128 KB.

The problem? GPU L1 cache is typically 128-256 KB per SM. A 128 KB LUT would monopolize the entire cache, starving the rest of the computation. And L1 cache latency is ~28 cycles for a random access — much slower than arithmetic.

Lookup-only verdict: Best quantization quality (the LUT entries are directly optimized). But too large for current GPU caches at L ≥ 16. Viable only for L ≤ 14.

Computed lookup-free codes

The opposite extreme: compute the reconstruction value from the L-bit state using only arithmetic operations. No table at all. QTIP proposes two variants:

1MAD (1 Multiply-Add)

Use a linear congruential generator (LCG) as a hash function, then sum four 8-bit segments of the result:

// 1MAD: ~4 instructions per weight
hash = state * 0x9E3779B9  // LCG multiply
hash = hash + 0x6A09E667   // LCG add
// Sum four 8-bit segments to get pseudo-Gaussian
val = (hash & 0xFF) + ((hash >> 8) & 0xFF)
      + ((hash >> 16) & 0xFF) + (hash >> 24)

The sum of four uniform bytes approximates a Gaussian (central limit theorem). This gives a pseudo-random reconstruction value from 4 ALU instructions — no memory access.

3INST (3 Instructions)

An even simpler hash using XOR operations on the FP16 bit representation:

// 3INST: ~3 instructions per weight
hash = state * 0x9E3779B9
// XOR mantissa/exponent bits to produce FP16
val = xor_fp16(hash)

Three ALU instructions total. Slightly worse quality than 1MAD but faster.

Why pseudo-random values work: After incoherence processing, the weights are approximately i.i.d. Gaussian. A pseudo-random codebook with Gaussian-like values is nearly optimal for Gaussian sources — the Viterbi algorithm will find the best matching path regardless of whether the codebook values were "designed" or "random." What matters is that there are enough distinct values to cover the distribution, which 2^L pseudo-random points easily provide.

Hybrid codes (HYB)

The sweet spot: use a tiny lookup table (2^Q × 2 entries, with Q ≈ 6-8) indexed by a hash of the state. Then apply a sign flip from an additional bit.

// HYB: hash + small LUT (~2 instructions amortized)
idx = (state * state + state) & MASK  // x^2 + x hash
sign = (state >> sign_bit) & 1       // extract sign
val = LUT[idx] * (1 - 2 * sign)       // apply sign flip

The LUT has only 128-512 entries (256 B - 1 KB), easily fitting in L1 cache. The entries are optimized via K-means on the empirical weight distribution. The sign bit doubles the effective codebook size for free.

Code Type	LUT Size	Instructions/Weight	Quality (MSE)	Best For
Lookup-only	32-128 KB	1 (lookup)	Best	L ≤ 14, quality-first
HYB	256 B - 1 KB	~2	Near-best	L = 16, balanced
1MAD	0	~4	Good	Speed-first, large L
3INST	0	~3	Decent	Maximum throughput

Codebook optimization

For lookup-only and HYB codes, the LUT entries are trained on calibration data. QTIP uses 2048 samples from RedPajama (128 tokens each). The optimization minimizes:

L = ∑_columns ||W̃_col − Q(W̃_col)||₂²

where Q applies the Viterbi algorithm with the current LUT, and gradients flow through the LUT entries via straight-through estimation. This runs for 50-200 iterations. For computed codes (1MAD, 3INST), there is nothing to optimize — the hash function is fixed.

Why do QTIP's computed codes (1MAD, 3INST) work well even though their reconstruction values are pseudo-random?

After incoherence processing the weights are approximately i.i.d. Gaussian, so a pseudo-random codebook with enough points (2^L) covers the distribution well — the Viterbi algorithm finds good paths regardless Because the LCG constants are carefully tuned to match weight distributions Because pseudo-random values are always better than designed codebooks

Chapter 7: GPU Decode Kernel

The theoretical elegance of trellis codes means nothing if the decode kernel is slow. QTIP's design ensures that going from packed bits to dequantized weights is fast enough to be memory-bound, not compute-bound.

The decode loop

At inference time, for each weight position t in a column of length T:

Extract L-bit window: From the packed bitstream, extract L contiguous bits starting at position t · k. This is a load + shift + mask.
Compute/lookup value: Use the L-bit state to produce the reconstruction value via the chosen code (lookup, HYB, 1MAD, or 3INST).
Multiply-accumulate: Multiply by the activation and accumulate into the output.

Steps 1-3 are fused into a single CUDA kernel. The packed bitstream is stored in global memory and streamed sequentially — excellent for GPU memory bandwidth.

Why QTIP is fast despite trellis complexity: The bitshift structure means decode is O(1) per weight — just bit extraction + a few ALU ops or a small LUT lookup. Compare this to AQLM, which requires a codebook lookup in a 2¹⁶-entry table (random access, cache-unfriendly). QTIP's decode is compute-light and memory-sequential, so the bottleneck is purely memory bandwidth — exactly where quantization helps most.

Memory layout

The quantized weights are stored as packed k-bit integers. For k = 2, four weights pack into one byte. The bitstream is laid out so that consecutive weights in a column are contiguous in memory, enabling coalesced reads on the GPU.

The Hadamard overhead

Each linear layer now computes y = S_m V_m^T (W̃ · V_n S_n x) instead of y = Wx. The extra operations are:

Sign flip (S_n x): Elementwise XOR on sign bits. Essentially free.
Hadamard (V_n z): Fast Walsh-Hadamard transform, O(n log n). For n = 4096, this is ~50K operations vs. ~17M for the matrix multiply. Under 0.3% overhead.
Sign flip + Hadamard on output: Same cost, applied to the m-dimensional output.

Total overhead: under 1% of inference time. The 4-8x memory reduction far outweighs this.

Throughput comparison

Decode Throughput

Tokens per second on an RTX 6000 Ada at batch size 1. QTIP matches QuiP# speed while achieving better quantization quality. AQLM is 2-3x slower due to large codebook lookups.

What is the main reason AQLM is 2-3x slower than QTIP at inference despite both being 2-bit methods?

AQLM requires random-access lookups into a large (2^16 entry) codebook, causing cache misses, while QTIP's bitshift decode uses only sequential reads and a few ALU ops or a tiny LUT AQLM uses more bits per weight AQLM requires the Viterbi algorithm at inference time

Chapter 8: Results

Numbers matter. QTIP was evaluated on Llama 2 (7B, 13B, 70B) and Llama 3 (8B, 70B) at 2-bit, 3-bit, and 4-bit quantization, compared against the strongest baselines: QuiP#, AQLM, and GPTQ.

2-bit quantization: the headline result

At 2 bits per weight, the differences between methods are stark. This is where the trellis advantage is most visible.

Model	FP16	QTIP (1MAD)	QuiP#	AQLM
Llama 2 7B	5.12	6.82	8.22	6.14
Llama 2 70B	3.12	3.90	4.16	3.83
Llama 3 70B	—	4.97	5.77	—

The story of these numbers: On Llama 2 7B, QuiP# adds 3.1 perplexity points over FP16. QTIP adds only 1.7 — nearly halving the quantization penalty. On the larger Llama 2 70B, QTIP closes to within 0.78 of FP16. And on Llama 3 70B, QTIP beats QuiP# by a full 0.8 perplexity points. Note that AQLM does well on 7B but was not reported on Llama 3 at the time of publication.

3-bit quantization

At 3 bits, all methods approach FP16 quality, but QTIP (HYB variant) still leads:

Model	FP16	QTIP (HYB)	QuiP#	AQLM
Llama 2 70B	3.12	3.26	3.35	3.36

At 3 bits, QTIP is within 0.14 perplexity of FP16 on a 70B model. The gap between methods shrinks because 3 bits provides enough room for even simpler methods to do well.

Speed results

The speed comparison is equally important. QTIP must not sacrifice throughput for quality.

Model / Method	Bits	Tokens/sec	Speedup vs FP16
Llama 2 7B FP16	16	55.9	1.0x
Llama 2 7B QTIP	2	188.0	3.4x
Llama 2 7B QuiP#	2	186.0	3.3x
Llama 2 7B AQLM	2	81.5	1.5x

Llama 2 70B QTIP	2	23.5	—
Llama 2 70B QuiP#	2	22.2	—
Llama 2 70B AQLM	2	8.78	—

QTIP wins on both axes. It matches QuiP# speed (188 vs 186 tok/s on 7B) while achieving substantially better perplexity (6.82 vs 8.22). Compared to AQLM, QTIP is 2.3x faster (188 vs 81.5 tok/s) with comparable quality. This dual improvement — better quality AND speed — is the trellis code advantage: high effective dimension for quality, bitshift structure for speed.

Scaling behavior

A pattern emerges across model sizes: larger models are easier to quantize. On Llama 2 7B, 2-bit QTIP adds 1.7 perplexity over FP16. On 70B, the gap shrinks to 0.78. On Llama 3.1 405B, it is only ~0.17. This makes intuitive sense — larger models have more redundancy in their weights, so aggressive compression loses less essential information.

This scaling property makes QTIP particularly valuable for the models that need compression most: the ones too large to serve at full precision. A 405B model that fits on a single 8-GPU node at 2-bit is far more useful than one requiring 4 nodes at FP16.

Zero-shot task performance

Perplexity on WikiText-2 is just one metric. QTIP also reports zero-shot accuracy on standard benchmarks (ARC, PIQA, WinoGrande, HellaSwag). The pattern holds: at 2-bit, QTIP matches or exceeds QuiP# and AQLM on every benchmark across all model sizes. The task-level gains correlate strongly with perplexity improvements, confirming that lower perplexity translates to genuinely better language understanding, not just better next-token prediction on one dataset.

The 405B frontier

QTIP also quantized Llama 3.1 405B Instruct to 2 bits, achieving a perplexity of 3.29. This compresses the model from 810 GB to ~101 GB — fitting on a single 8-GPU node with room to spare. No other 2-bit method was reported on this model at publication time.

Perplexity vs. Bitrate

Llama 2 70B perplexity (WikiText-2) at different bitrates. Lower is better. QTIP consistently outperforms baselines, with the gap largest at 2 bits where high-dimensional quantization matters most.

On Llama 2 7B at 2-bit, QTIP achieves 6.82 perplexity while QuiP# gets 8.22. What is the primary technical reason for this gap?

QTIP's trellis code achieves effectively 256-dimensional quantization vs QuiP#'s 8-dimensional E8 lattice — higher dimension means closer to the Shannon rate-distortion limit and lower MSE at the same bitrate QTIP uses more calibration data QTIP finetunes the model after quantization

Chapter 9: Connections

QTIP sits at the intersection of information theory, signal processing, and systems engineering. Here is how it connects to the broader landscape.

The QuiP lineage

QuiP (2023)

Introduced incoherence processing for LLM quantization. Random rotations make weights "nice" for any quantizer. Used simple round-to-nearest with incoherence.

↓

QuiP# (2024)

Added E₈ lattice vector quantization (8D) on top of incoherence processing. Major quality jump. But stuck at 8D due to codebook size limits.

↓

QTIP (2024)

Replaced E₈ lattice VQ with trellis coded quantization. Broke through the dimension barrier: 256D effective dimension with only 65K states. Same incoherence framework, fundamentally better coding.

Trellis coded quantization origins

TCQ was introduced by Marcellin and Fischer (1990), building on Ungerboeck's trellis coded modulation (1982) from telecommunications. The original TCQ paper showed that trellis codes approach the Shannon rate-distortion bound as the constraint length grows. QTIP is the first to bring this classical result to LLM weight compression, with the key innovation being the bitshift structure for GPU-friendly decoding.

The rate-distortion view

Shannon's rate-distortion theorem (1959) tells us the minimum bitrate needed to represent a source at a given distortion level. For a Gaussian source at distortion D:

R(D) = ½ log₂(σ² / D)

Any practical quantizer must use at least R(D) bits per symbol. Scalar quantization is far from this bound. VQ gets closer as dimension increases. TCQ approaches it as the constraint length grows. QTIP with L = 16 gets within 10% of the bound — a remarkable achievement for a practical system.

Connections to other methods

Method	Approach	Effective Dim	Decode Cost
GPTQ	Scalar + Hessian-aware rounding	1	Trivial
AWQ	Scalar + activation-aware scaling	1	Trivial
AQLM	Additive VQ (multi-codebook)	8	Large LUT
QuiP#	E₈ lattice VQ + incoherence	8	Lattice decode
QTIP	Trellis coded quantization + incoherence	256	Bitshift + small LUT

Why QTIP won a NeurIPS Spotlight

QTIP received a Spotlight designation at NeurIPS 2024. The reviewers highlighted three contributions: (1) the novel application of trellis coded quantization to LLMs, connecting classical information theory to modern deep learning; (2) the principled spectrum of code designs from lookup-only to compute-only, giving practitioners a clear quality-speed tradeoff; and (3) the strong empirical results that simultaneously beat all baselines on quality AND speed — a rare achievement in quantization where there is usually a tradeoff between the two.

The work also stands out for its theoretical grounding. Unlike many quantization papers that are primarily empirical, QTIP derives its approach from rate-distortion theory, giving a principled explanation for why trellis codes work better: they approach the Shannon limit at high effective dimension.

What comes next?

Finetuning after quantization: QTIP results shown here are without finetuning. Adding a short finetuning pass (as AQLM does) could further close the gap to FP16.
Activation quantization: QTIP only quantizes weights. Combining it with activation quantization (e.g., SmoothQuant) could yield even greater speedups.
Longer constraint lengths: L = 16 is limited by GPU SRAM. Future hardware with larger on-chip memory could support L = 20 or beyond, pushing even closer to the Shannon limit.
Structured pruning + quantization: Combining trellis quantization with sparse structures could yield models that are both compressed and sparse.

The big picture: QTIP demonstrates that ideas from 1990s information theory — trellis coded quantization, the Viterbi algorithm, incoherence processing — are not just theoretically elegant but practically powerful for modern LLMs. By framing weight compression as a coding problem rather than a rounding problem, QTIP achieves what seemed impossible: high-dimensional quantization at GPU speed.

What is the fundamental conceptual shift that QTIP makes compared to methods like GPTQ and AWQ?

GPTQ/AWQ treat quantization as independent rounding decisions (scalar, dimension 1). QTIP treats it as a sequential coding problem where decisions are coupled through a state machine, achieving much higher effective dimension and approaching the Shannon limit. QTIP uses more GPU memory QTIP requires a larger calibration dataset

QTIP Quantization with Trellises and Incoherence Processing