AQLM — Veanors

Chapter 0: The Problem

You have a Llama 2 70B model. It has 70 billion parameters, each stored as a 16-bit float. That is 70 × 10⁹ × 2 bytes = 140 GB just for the weights. A high-end consumer GPU has 24 GB of VRAM. You cannot even load the model, let alone run it.

The standard fix is quantization — replace those 16-bit floats with smaller integers. At 4 bits per parameter, the model shrinks to 35 GB. At 3 bits, 26 GB. At 2 bits, just 17.5 GB — it fits on a single GPU with room to spare for the KV cache.

But there is a catch. Existing methods like GPTQ and AWQ work well at 4 bits. Push them to 3 bits and accuracy drops noticeably. At 2 bits, the model is nearly useless — perplexity on Llama 2 7B jumps from 5.12 (FP16 baseline) to double digits. The weights lose too much information when you round each one independently to a tiny grid.

The core tension: Memory bandwidth is the bottleneck for LLM inference on GPUs. Tokens are generated one at a time; each token requires loading the entire model from memory. Fewer bits per parameter = faster token generation. But below 4 bits, naive rounding destroys accuracy. We need a smarter compression scheme.

The insight behind AQLM: stop thinking about individual weights. Instead, group weights together and encode each group as a sum of learned codebook vectors. A group of 8 weights is no longer 8 independent rounded numbers — it is a combination of carefully chosen vectors from a shared dictionary. This is Multi-Codebook Quantization (MCQ), a technique from information retrieval that the authors adapt for LLM compression.

The result? At 2 bits per parameter, AQLM matches or beats the accuracy that GPTQ and AWQ achieve at 3 bits. At 3 bits, it nearly matches FP16. The model is smaller and more accurate than anything before it at extreme compression levels.

Memory Wall

Drag the bit-width slider to see how model size changes for a 70B-parameter model. The red line marks 24 GB — a typical consumer GPU's VRAM.

Bits per parameter 16.0

Why do methods like GPTQ fail at 2 bits per parameter?

They round each weight independently to a tiny grid, losing too much information — the quantization error per weight is huge and errors compound across layers They require too much memory to run They only work on models smaller than 70B parameters

Chapter 1: Vector Quantization Primer

Before we can understand AQLM's multi-codebook trick, we need to understand the simpler version: vector quantization (VQ). The idea is ancient — it dates back to the 1980s in signal compression — but it is the foundation of everything that follows.

Instead of quantizing each weight independently (scalar quantization), VQ groups weights into vectors and quantizes the entire vector at once. Think of it this way: you have a codebook — a dictionary of K prototype vectors. To compress a weight vector, you find the closest codebook entry and store only its index.

Why vectors beat scalars: Imagine quantizing the pair (0.31, 0.29) to 1 bit each. Scalar quantization gives you only {(0,0), (0,1), (1,0), (1,1)} — four rigid options. But a 2-bit vector codebook can learn four arbitrary 2D points. One of those points could be (0.30, 0.30) — a near-perfect match. The codebook adapts to the actual distribution of weight vectors.

From K-Means to Codebooks

Training a VQ codebook is just K-means clustering. You gather all weight vectors from the model, run K-means with K clusters, and the cluster centroids become your codebook entries. Each weight vector is then assigned to its nearest centroid.

Storage: if your codebook has K = 2^B entries (B-bit codes), each vector of g weights costs B bits to store (just the index). So the bits per parameter is B / g. With B = 8 and g = 8, that is 1 bit per parameter — extreme compression.

The Storage Arithmetic

Let us make the VQ savings concrete. A weight matrix in Llama 2 7B's MLP is 4096 × 11008. In FP16, that is 4096 × 11008 × 2 = 90.2 MB. With VQ using g = 8, B = 8 (256-entry codebook):

Indices: (4096 × 11008 / 8) groups × 1 byte per index = 5.6 MB
Codebook: 256 entries × 8 floats × 2 bytes = 4 KB (negligible)
Total: ~5.6 MB — a 16× compression from 90.2 MB

That is 1 bit per parameter. But the accuracy is terrible — 256 prototype vectors cannot capture the diversity of millions of weight groups. We need more expressiveness without more bits.

Product Quantization (PQ)

Pure VQ has a problem: if you want fine-grained quantization, you need a huge codebook. With g = 8 dimensions and B = 16 bits, the codebook has 65,536 entries of 8 floats each — that is 2 MB per codebook. And you need one per layer.

Product Quantization (PQ) fixes this by splitting each vector into sub-vectors and using a separate, smaller codebook for each sub-vector. A vector of 8 weights is split into 2 sub-vectors of 4, each encoded with its own 8-bit codebook (256 entries). Total: 16 bits for 8 weights = 2 bits per parameter, but the codebooks are tiny.

PQ vocabulary: The sub-codebooks are often called sub-quantizers. Using M sub-quantizers with B-bit codes on groups of g weights gives B · M / g bits per parameter. This is the formula we will use throughout AQLM.

Scalar Quantization

Round each weight individually. Simple, fast, but high error at low bit-widths. Used by GPTQ, AWQ.

↓ group weights

Vector Quantization (VQ)

Replace each weight group with closest codebook entry. Better accuracy, but codebooks get huge.

↓ split into sub-vectors

Product Quantization (PQ)

Independent codebook per sub-vector. Compact codebooks, but sub-vectors are quantized independently.

↓ sum instead of concatenate

Additive Quantization (AQ)

Multiple codebooks, entries summed (not concatenated). Each codebook refines the full vector. This is AQLM.

What is the key difference between product quantization and additive quantization?

PQ uses more codebooks In PQ, each codebook handles a disjoint sub-vector (concatenation); in AQ, all codebooks operate on the full vector and their entries are summed — so every codebook can refine the entire group AQ uses fewer bits per parameter

Chapter 2: The Additive Quantization Framework

Now the main idea. Take a weight matrix W ∈ R^{d_out × d_in} from a linear layer. Split its rows into groups of g consecutive weights. Each group is a vector w ∈ R^g. We want to approximate w using M codebooks C₁, ..., C_M, each containing 2^B vectors in R^g.

The approximation is a sum:

ŵ = ∑_m=1^M C_m[b_m]

where b_m ∈ {0, 1, ..., 2^B − 1} is the index into the m-th codebook. Each codebook contributes one vector, and they are added together to reconstruct the weight group. This is Additive Quantization (AQ).

The power of addition: With M = 2 codebooks of 2⁸ = 256 entries each, scalar PQ can represent 256 × 256 = 65,536 distinct sub-vector pairs (but only 256 values per sub-dimension). AQ can represent the same 65,536 combinations, but each combination covers the full g-dimensional space. Every codebook entry is a full g-dimensional vector, and their sum spans a much richer set of approximations.

The Optimization Objective

AQLM is not just about finding the nearest codebook entry. It is input-aware. Given a calibration dataset with input activations X, the objective is:

arg min_{C, b} || WX − ŴX ||₂²

This minimizes the squared error on the outputs of the layer, not on the weights directly. Why? Because not all weights are equally important. A weight that multiplies a large activation matters far more than one that multiplies near-zero. The input matrix X (collected from calibration data) encodes this importance.

Why input-aware matters: Imagine two weights: w₁ = 3.0 is multiplied by activations that are always near 0.01, and w₂ = 0.5 is multiplied by activations near 10.0. Quantizing w₂ poorly causes 20× more output error than quantizing w₁ poorly. The MSE objective on WX naturally upweights w₂.

Expanding the Objective

Substituting the additive representation into the objective and expanding:

|| WX − ∑_m C_mb_mX ||² = || WX ||² − 2 ∑_m ⟨W, C_mb_m⟩_XX^T + ∑_i,j ⟨C_ib_i, C_jb_j⟩_XX^T

where ⟨A, B⟩_XX^T = ⟨AXX^T, B⟩_F is the Frobenius inner product weighted by the input correlation matrix. The key insight: XX^T can be pre-computed once from the calibration data. This makes the objective efficient to evaluate — no need to pass activations through the model repeatedly.

Why Not Just a Bigger Codebook?

A natural question: why use M = 2 codebooks of 256 entries each (total 512 vectors to store) when you could use 1 codebook of 65,536 entries? Both use the same number of bits per weight group (16 bits). The answer is threefold:

Codebook memory. A single 2¹⁶-entry codebook with g = 8 dimensions occupies 65,536 × 8 × 2 = 1 MB in FP16. Two 2⁸-entry codebooks occupy only 2 × 256 × 8 × 2 = 8 KB. That is 128× smaller, and small codebooks fit in GPU shared memory for fast lookups.
Optimization landscape. Jointly optimizing 65K codebook entries with K-means is a hard problem — many local minima. Optimizing two sets of 256 entries is much easier and converges faster.
Generalization. Smaller codebooks are shared across more weight groups. This parameter sharing acts as regularization — the codebooks must represent common patterns, not memorize individual weights.

In practice, AQLM finds that the 1×16 configuration (single large codebook) gives slightly better accuracy, while 2×8 (two small codebooks) gives faster inference. The paper reports both.

Representation Cost

Each weight group stores M indices of B bits each. Additionally, AQLM learns a per-output-unit scale factor s ∈ R (stored in FP16). The reconstructed weight is:

ŵ = s · ∑_m=1^M C_m[b_m]

The scale factors add a negligible cost (one 16-bit float per row of the weight matrix) but help compensate for the limited codebook expressiveness.

Concrete Data Flow

Let us trace the full storage layout for one linear layer of Llama 2 7B. The q_proj weight matrix is 4096 × 4096.

Component	Shape	Storage
Code indices (M=2, B=8)	4096 × (4096/8) × 2	4,194,304 bytes (4 MB)
Codebook 1	256 × 8 floats × FP16	4,096 bytes (4 KB)
Codebook 2	256 × 8 floats × FP16	4,096 bytes (4 KB)
Scale factors	4096 × FP16	8,192 bytes (8 KB)
Total		~4.02 MB (vs 33.6 MB in FP16)

The codebooks are shared across all 512 weight groups per output dimension. The indices dominate storage; the codebooks and scales are negligible. Compression ratio: 8.4× (equivalent to ~1.9 bits per parameter for this layer, before accounting for the codebook overhead).

Why does AQLM minimize || WX − ŴX ||² instead of || W − Ŵ ||²?

Because weights that multiply large activations cause more output error when quantized poorly — the input X encodes which weights matter most Because it is computationally cheaper Because the weight norms are all the same anyway

Chapter 3: Beam Search for Code Assignment

AQLM has three phases: (1) find the best codebook indices b for each weight group, (2) optimize the codebook entries C, (3) fine-tune at the block level. Phase 1 is the hardest — it is a discrete combinatorial problem.

For a single weight group, we need to choose M indices (one per codebook), each from 2^B options. The brute-force search over all (2^B)^M combinations is intractable. With B = 8 and M = 2, that is 65,536 combinations per group — feasible. But with M = 4, it is 4.3 billion. We need a smarter search.

The Beam Search Algorithm

AQLM uses beam search, adapted from the additive quantization literature (Babenko & Lempitsky, 2014). The algorithm maintains a beam of k best partial assignments and extends them one codebook at a time:

Step 1: Initialize

For codebook C₁, try all 2^B entries. Score each by the MSE objective (Eq. 7 from the paper). Keep the k best single-code assignments.

↓

Step 2: Extend

For each of the k beams, try all 2^B entries from C₂. That gives k · 2^B candidates. Score each (the loss is additive — only the changed component needs recomputing). Keep the k best two-code assignments.

↓ repeat for C₃, ..., C_M

Step 3: Select

After processing all M codebooks, the top beam gives the best M-tuple of indices for this weight group.

Efficient scoring trick: The loss function decomposes into pairwise terms ⟨C_ib_i, C_jb_j⟩_XX^T. When extending a beam by changing only one code, most pairwise terms stay the same. You only recompute the terms involving the new codebook. Pre-computing XX^T (once per layer) makes each score evaluation a few dot products instead of a full matrix multiplication.

Worked Example

Suppose g = 2 (2D weight groups for simplicity), M = 2 codebooks, each with 4 entries (B = 2 bits). We want to approximate w = [0.7, −0.3].

Codebook C₁ has entries: {[0.5, 0.0], [0.0, −0.5], [−0.3, 0.4], [0.8, 0.2]}. Codebook C₂ has entries: {[0.1, −0.2], [0.3, 0.1], [−0.1, −0.4], [0.2, −0.1]}.

Step 1: Score all 4 entries from C₁ against w (using the MSE objective). Suppose the scores rank: entry 3 ([0.8, 0.2]) best, entry 0 ([0.5, 0.0]) second. With beam width k = 2, we keep both.

Step 2: For each beam, try all 4 entries from C₂. That gives 2 × 4 = 8 candidates:

Beam	C₁	C₂	Sum	Error \|\|w − sum\|\|²
1	[0.8, 0.2]	[−0.1, −0.4]	[0.7, −0.2]	0.01
2	[0.5, 0.0]	[0.2, −0.1]	[0.7, −0.1]	0.04
3	[0.8, 0.2]	[0.2, −0.1]	[1.0, 0.1]	0.25
... 5 more candidates ...

The winner: C₁[3] + C₂[2] = [0.8, 0.2] + [−0.1, −0.4] = [0.7, −0.2]. Error = 0.01. The target w = [0.7, −0.3] is nearly recovered by summing two codebook vectors.

Note the synergy: Neither codebook entry alone is close to w. C₁[3] = [0.8, 0.2] has the wrong sign on the second dimension. C₂[2] = [−0.1, −0.4] looks nothing like w. But their sum [0.7, −0.2] is an excellent approximation. This is the magic of additive quantization — the codebooks collaborate.

Why Beam Search, Not Greedy?

A greedy approach picks the single best entry from C₁, then the single best from C₂ given that choice, and so on. This is beam search with k = 1. The problem: early choices constrain later ones. If the best C₁ entry points the approximation slightly wrong, C₂ may not be able to fix it. Beam width k lets us maintain k alternative "hypotheses" and recover from suboptimal early decisions.

In practice, AQLM uses a beam width of k = 1 for the initial assignment (since codebooks are subsequently optimized anyway). The beam search runs over all d_out output units in parallel — each output dimension is independent in the loss function, so this is a massively parallel operation on GPU.

Complexity

For each weight group: beam search considers k · 2^B candidates per codebook step, and there are M steps. Total candidates evaluated: M · k · 2^B. Compare to brute force: (2^B)^M = 2^BM. With B = 8, M = 2, k = 1: beam search evaluates 512 candidates. Brute force would check 65,536. With M = 4: beam search evaluates 1024; brute force would check 4.3 × 10⁹.

MRF connection: The objective is equivalent to MAP inference in a Markov Random Field, with unary potentials ⟨W, C_mb_m⟩_XX^T and pairwise potentials ⟨C_ib_i, C_jb_j⟩_XX^T. While finding the exact optimum is NP-hard in general, beam search and ICM (Iterated Conditional Modes) give good approximations. AQLM chose beam search because it is easier to implement efficiently in PyTorch/JAX.

Why is beam search preferred over brute-force enumeration for finding codebook indices?

Beam search always finds the global optimum With M codebooks of 2^B entries, brute force searches (2^B)^M combinations — exponential in M. Beam search is linear in M, exploring only k · 2^B candidates per codebook Beam search uses less memory than brute force

Chapter 4: Codebook Tuning

Phase 1 (beam search) found the best indices assuming fixed codebooks. Phase 2 flips the problem: fix the indices, optimize the codebook entries.

With the codes b fixed, the objective becomes a continuous optimization over the codebook vectors C₁, ..., C_M and the scale factors s:

min_{C, s} || WX − ŴX ||² = || (W − Ŵ)X ||² = ⟨W − Ŵ, (W − Ŵ)⟩_XX^T

where Ŵ = diag(s) · [∑_m C_mb_m] is the reconstructed weight matrix. This is differentiable with respect to C_m and s, so we can use standard gradient-based optimization.

Computing the Gradient

The loss is quadratic in the codebook entries. Let us trace the gradient for a single codebook entry c = C_m[k] that is used by some set of weight groups S_k (all groups where b_m = k). The gradient is:

∂L / ∂c = −2 ∑_{groups in S_k} (w − ŵ)XX^T

This is a weighted average of the residual errors across all weight groups that share this codebook entry, weighted by the input correlation matrix XX^T. The gradient is cheap to compute because XX^T is pre-computed and shared, and each weight group contributes one additive term.

Adam Optimizer

AQLM uses Adam with learning rate 10⁻⁴ and full-batch gradient descent on the calibration data (128 samples). Each codebook update phase runs for 100 steps. The authors found the results are robust to these hyperparameters — varying the learning rate or number of steps does not significantly change the outcome.

Why not closed-form? In classic AQ (without the scale factors s and the XX^T weighting), the optimal codebooks can be found by solving a linear system. But the presence of XX^T couples all codebook dimensions through the input correlation. Each codebook entry's optimal value depends on every other entry. Adam handles this coupling gracefully without deriving the complex Hessian.

Scale Factor Initialization

The per-output-unit scale s_i is initialized to the L2 norm of the i-th row of W: s_i = ||W_i||₂. This gives each row a learned magnitude while the codebooks handle the direction. The scales are updated alongside the codebooks via the same Adam optimizer.

Alternating Optimization

Phases 1 and 2 alternate: update codes (beam search) → update codebooks (Adam) → update codes → update codebooks → ... until the loss stops improving (or for a fixed number of rounds). Each round refines both the discrete assignments and the continuous codebook entries.

Initialization: Residual K-Means

Good initialization is critical. AQLM uses residual K-means (Chen et al., 2010):

Run K-means on all weight groups to get C₁. Assign each group to its nearest centroid.
Compute the residual: r = w − C₁[b₁] (the part of each weight group not captured by the first codebook).
Run K-means on the residuals to get C₂. This codebook specializes in what C₁ missed.
Compute the new residual: r' = w − C₁[b₁] − C₂[b₂]. Repeat for C₃, etc.

This is equivalent to Residual Quantization (RQ) and gives a reasonable starting point. The subsequent alternating optimization (beam search + Adam) then jointly refines all codebooks, escaping the limitations of the sequential residual approach.

Initialize

Residual K-means: cluster weight groups for C₁, cluster residuals for C₂, etc. This gives a good starting point for joint optimization.

↓

Phase 1: Beam Search

Fix C_m and s. Find best indices b_m for each weight group via beam search over the MSE objective.

↓

Phase 2: Codebook + Scale Update

Fix indices b_m. Run 100 steps of Adam on C_m and s to minimize || WX − ŴX ||².

↻ repeat until convergence

What role do the per-output scale factors s play in AQLM?

They normalize the input activations They determine the codebook size They give each output row a learned magnitude while the codebooks handle direction, adding expressiveness at negligible storage cost (one FP16 value per output unit)

Chapter 5: Block-Level Fine-Tuning

Phases 1 and 2 compress each layer independently. But in a real transformer, quantization errors in one layer's output become the input to the next layer. Errors compound. At 2 bits, this compounding is devastating.

Phase 3 of AQLM fixes this with block-level fine-tuning. Instead of optimizing one layer at a time, it optimizes an entire transformer block (self-attention + MLP, typically 4–8 linear layers) jointly.

The Procedure

After quantizing all layers within a block using phases 1 and 2, AQLM:

Records the original block output Y_block = block(X_block) using FP16 weights
Replaces the block's weights with their quantized versions
Defines the fine-tuning loss: || Y_block − block̂(X_block) ||²
Optimizes the codebook entries C_m, scale factors s, and all non-quantized parameters (RMSNorm scales and biases) via Adam, backpropagating through the entire quantized block
Keeps the codes b_m fixed — only continuous parameters are tuned

Why this works: The codebook entries are continuous (FP16) and differentiable. Even though the weight reconstruction is a discrete lookup (select code b_m from C_m), the gradient flows through the lookup because b_m is fixed — changing C_m[b_m] is just changing a regular parameter that happens to be shared across all weight groups that selected index b_m. This is the same trick as in VQ-VAE: the straight-through estimator is not needed because we are optimizing the codebook, not the codes.

Practical Details

Fine-tuning uses the PyTorch autograd engine. The calibration data is the same 128 samples from RedPajama (sequence length 4096) used in phases 1 and 2. The optimizer is Adam with a schedule. Fine-tuning the transformer blocks takes only 10–30% of the total calibration time — it converges quickly because it starts from a good initialization (the per-layer quantization from phases 1–2).

The algorithm processes the model block by block, from the first transformer layer to the last. After fine-tuning block k, it saves the updated codebooks and moves to block k+1, using the quantized output of block k as input. This sequential processing means each block adapts to the actual (quantized) inputs it will see at inference time.

VRAM efficiency: The fine-tuning modifies only the codebook entries and scale factors — a tiny fraction of the total parameters. Optimizer states (Adam momenta) are therefore small. This makes it possible to fine-tune even billion-parameter models on a single GPU in reasonable time.

Algorithm 1: Full AQLM Pipeline

pseudocode
Input: model, calibration data (128 samples)
for layer = 1 to num_layers:
  W = layer.weight                # original FP16 weights
  X = layer_inputs(calibration)    # collect activations
  C, s = initialize(W)             # K-means + residual K-means

  while loss improves:
    C, s = adam_update(XX^T, W, C, b, s) # Phase 2: codebook tune
    b = beam_search(XX^T, W, C, b, s) # Phase 1: code update

  layer.weight = AQLMFormat(C, b, s)

for block = 1 to num_blocks:         # Phase 3: block fine-tune
  theta = trainable_params(block)   # codebooks + scales + norms
  while loss improves:
    L = ||block(X) - Y_orig||^2
    theta = adam(theta, grad(L))

Why does AQLM fine-tune at the transformer block level instead of layer-by-layer?

Because quantization errors in one layer's output become the input to the next layer — block-level fine-tuning allows layers within the same block to compensate for each other's quantization errors Because it is faster to fine-tune a block at once Because individual layers do not have enough parameters to fine-tune

Chapter 6: The Bits Arithmetic

Let us work through the exact compression math. This is where AQLM's configurations map to concrete bits-per-parameter numbers.

The Formula

For a weight group of g parameters encoded with M codebooks of 2^B entries each, the cost per group is:

bits per group = M · B (for the M codebook indices)

bits per parameter = M · B / g

Plus there are two small overheads: (1) the codebook entries themselves, and (2) the per-output scale factors. For large models, these are negligible compared to the indices.

AQLM Configurations

Target	M	B	g	Codebook size	Avg bits
2-bit	1	16	8	2¹⁶ = 65,536	~2.02
2-bit (alt)	2	8	8	2⁸ = 256 each	~2.02
3-bit	1	16	8	2¹⁶ = 65,536	~3.01
4-bit	1	16	8	2¹⁶ = 65,536	~4.01

Worked example — 2-bit with 1×16: Group size g = 8. One codebook with 2¹⁶ = 65,536 entries. Each group stores one 16-bit index. That is 16 bits / 8 params = 2.0 bits per parameter. Add the scale factor: one FP16 per output unit, shared across all groups in that row. For a 4096×4096 matrix with 4096 output units, that is 4096 × 16 bits = 65,536 bits of overhead. Total parameter storage: 4096 × (4096/8) × 16 = 33.6M bits for indices + 0.066M bits for scales. Average: ~2.02 bits per parameter.

Worked example — 2-bit with 2×8: Same g = 8, but two codebooks of 256 entries each. Each group stores two 8-bit indices = 16 bits. Same total: 16/8 = 2.0 bits per parameter. But now the codebook is tiny (256 entries × 8 floats = 4 KB per codebook) while 1×16 needs a 65,536-entry codebook (1 MB). The tradeoff: 2×8 has smaller codebooks but the additive structure is less expressive than a single large codebook.

What Does Not Count

Following standard practice, the "bits per parameter" metric excludes parameters that are kept in full precision: the embedding layer and the LM head. These are a small fraction of total parameters for large models (e.g., 1.3% for Llama 2 70B). All benchmarks compare methods at the same average bitwidth, so this is a fair comparison.

Compression Calculator

Adjust M (codebooks), B (bits per code), and g (group size) to see the resulting bits per parameter and model size for a 70B model.

Codebooks (M) 2

Bits per code (B) 8

Group size (g) 8

A configuration uses M=2 codebooks, B=8 bits per code, and groups of g=8 weights. What is the bits-per-parameter cost of the indices alone?

2 × 8 / 8 = 2.0 bits per parameter 8 / 2 = 4.0 bits per parameter 2 × 8 = 16 bits per parameter

Chapter 7: Codebook Lookup Visualization

Let us see the entire AQLM dequantization process in action. This is what happens during inference when the model needs to reconstruct a weight group to compute a matrix-vector product.

The Dequantization Pipeline

1. Read Indices

Load the M compressed code indices (b₁, b₂, ..., b_M) for this weight group from memory. Each is B bits.

↓

2. Codebook Lookup

Use each index to fetch a g-dimensional vector from the corresponding codebook: C₁[b₁], C₂[b₂], ..., C_M[b_M].

↓

3. Sum

Add the M vectors element-wise: ŵ = C₁[b₁] + C₂[b₂] + ... + C_M[b_M].

↓

4. Scale

Multiply by the per-output scale: ŵ → s · ŵ.

↓

5. Dot Product

Compute the dot product ŵ · x for the corresponding input activation group. Accumulate across all groups for this output unit.

GPU kernel design: AQLM implements efficient CUDA/Triton kernels that fuse steps 2–5. Instead of materializing the full dequantized weight matrix (which would defeat the memory savings), the kernel loads compressed codes, looks up codebook entries on the fly, and accumulates the dot product directly. Codebooks are small enough to fit in GPU shared memory or L1 cache, so lookups are fast. The bottleneck shifts from memory bandwidth (loading the full weight matrix) to compute (the lookups and additions).

GPU Kernel Architecture

The kernel operates in two phases per thread block:

Load codebooks into shared memory. For the 2×8-bit configuration (M = 2 codebooks, 256 entries each, g = 8 floats per entry), each codebook is 256 × 8 × 2 bytes = 4 KB in FP16. Two codebooks fit in 8 KB — well within the 48–164 KB of shared memory available on modern GPUs.
Stream codes + accumulate. Each thread loads a code pair (b₁, b₂), looks up both codebook entries in shared memory (no global memory access), sums them, multiplies by the scale factor, computes the dot product with the input activation group, and atomically accumulates into the output.

The key insight: codebook lookups from shared memory cost ~5 cycles, while global memory loads cost 200–400 cycles. Since the codebooks are reused across all weight groups in a column, the shared memory approach amortizes the load cost across thousands of groups.

16-bit codebook variant: For the 1×16 configuration (single codebook with 65,536 entries), the codebook is 65,536 × 8 × 2 = 1 MB — too large for shared memory. Here, AQLM replaces the 16-bit codebook with a pair of 8-bit codebooks (2×8) that approximate the same expressiveness at lower memory cost. The inference speed is comparable, and the authors found that accuracy is similar.

Inner Product Trick

For the 1×16 configuration with a single large codebook, there is an additional speedup. The dot product between a weight group and the input can be rewritten as:

ŵ · x = C[b] · x

You can pre-compute the dot products of all codebook entries with the current input: dp[k] = C[k] · x for k = 0, ..., 2¹⁶ − 1. Then reconstruction + dot product becomes a single table lookup: dp[b]. This turns the computation from g multiplications and additions per group into a single indexed load. However, with 65K entries, the lookup table is large (256 KB per group of 8 input dimensions), so this trick is mainly useful for CPU inference where caches are larger.

Additive Codebook Lookup

Click Step to walk through the dequantization of one weight group. Each codebook contributes a vector; they are summed and scaled to reconstruct the weights. Click Randomize to pick new code indices.

Codebooks (M) 2

Inference Speed

Platform	Model	FP Baseline	AQLM 2-bit	Speedup
RTX 3090	Llama 2 7B	129 μs	99 μs (1×16)	1.31×
	Llama 2 13B	190 μs	158 μs	1.20×
	Llama 2 70B	578 μs	190 μs (2×8)	3.05×
CPU i9	Llama 2 7B	1.83 ms	0.67 ms	2.75×
	Llama 2 13B	3.12 ms	0.88 ms	3.54×
	Llama 2 70B	11.31 ms	2.81 ms	4.03×

The GPU speedups are modest for 7B (the model already fits in FP16) but dramatic for 70B, where the FP16 model requires multi-GPU while the 2-bit model fits on a single card. On CPU, the inner product pre-computation trick delivers consistent 3–4× speedups.

Why 70B benefits most: For the 7B model in FP16, the weight matrix is only ~14 GB — it fits in GPU memory with bandwidth to spare. The bottleneck is compute, not memory. AQLM's codebook lookups add compute overhead that partially offsets the memory savings. For 70B, the FP16 model exceeds single-GPU memory entirely. AQLM's 2-bit version fits in ~18 GB, enabling single-GPU inference where FP16 cannot. The speedup is transformative, not incremental.

How does AQLM avoid materializing the full dequantized weight matrix during inference?

Fused GPU kernels load compressed codes, look up codebook entries in shared memory, and accumulate the dot product on-the-fly — the full weight is never stored It dequantizes the matrix in chunks to save memory It keeps the quantized weights and uses approximate matrix multiplication

Chapter 8: Results

The moment of truth. How does AQLM compare to GPTQ, AWQ, SpQR, and QuIP# on actual language model benchmarks?

2-Bit Regime (Llama 2)

This is where AQLM shines brightest. All methods are compressing to approximately 2 bits per parameter. Perplexity is measured on WikiText-2 (lower is better).

Model	Method	Avg bits	Wiki2 PPL ↓	C4 PPL ↓	Avg Accuracy ↑
7B	AQLM	2.02	6.59	8.54	57.28
	QuIP#	2.02	8.22	11.01	52.23
	FP16	16	5.12	6.63	62.35
13B	AQLM	1.97	5.60	7.49	61.32
	QuIP#	2.01	6.06	8.07	57.55
	FP16	16	4.57	6.05	65.38
70B	AQLM	2.07	3.94	5.72	68.75
	QuIP#	2.01	4.16	6.01	67.67
	FP16	16	3.12	4.97	70.17

The headline number: AQLM 2-bit on Llama 2 70B achieves 3.94 perplexity on WikiText-2 — only 0.82 above the FP16 baseline. QuIP# (the previous state-of-the-art) at the same bitwidth gets 4.16. GPTQ and SpQR are not shown because they effectively break at 2 bits (perplexity above 10 for 7B/13B).

3-Bit Regime (Llama 2)

Model	Method	Avg bits	Wiki2 PPL ↓	C4 PPL ↓	Avg Accuracy ↑
7B	AQLM	3.04	5.46	7.08	60.88
	GPTQ	3.00	8.06	10.61	53.08
	SpQR	2.98	6.20	8.20	59.07
	FP16	16	5.12	6.63	62.35
70B	AQLM	3.01	3.36	5.17	69.86
	GPTQ	3.00	4.40	6.26	65.41
	SpQR	2.98	3.85	5.63	68.22
	FP16	16	3.12	4.97	70.17

The 3-bit story: At 3 bits, AQLM on Llama 2 70B achieves 3.36 perplexity — only 0.24 above the FP16 baseline of 3.12. This means you can compress a 140 GB model to ~26 GB and lose almost nothing. GPTQ at the same bitwidth gets 4.40 — over a full perplexity point worse.

Calibration Data

All results above use a tiny calibration set: 128 sequences of length 4096 tokens from the RedPajama dataset. That is roughly 524K tokens — a vanishingly small fraction of the pre-training data. The authors found that:

Increasing calibration samples beyond 128 gives diminishing returns (less than 0.1 PPL improvement on 7B)
Calibration from the same domain as the eval set helps, but the model is not sensitive to the exact dataset
AQLM benefits more from larger calibration sets than GPTQ does, because the codebook optimization has more parameters to tune and thus more capacity to absorb information from data

Pareto optimality: The authors establish that AQLM is Pareto optimal in the accuracy-vs-model-size tradeoff below 3 bits. This means there is no known method that achieves both better accuracy AND smaller size at any point in the 2–3 bit range. Above 3.5 bits, simpler methods like GPTQ close the gap because scalar rounding works well when each weight has enough bits.

End-to-End Fine-Tuning Results

After the paper's initial release, the authors added end-to-end fine-tuning (following QuIP#'s approach): fine-tune the entire model to minimize KL divergence, not just per-block MSE. This further improves results:

Model	Method	Avg bits	Wiki2 PPL ↓	Avg Accuracy ↑
7B	AQLM*	2.02	6.14	58.27
7B	QuIP#*	2.02	6.19	58.48
70B	AQLM*	2.07	3.83	68.20
70B	QuIP#*	2.01	3.91	68.28

With end-to-end fine-tuning (marked *), AQLM and QuIP# reach near-parity at 2-bit. Both erase the "2-bit is useless" narrative. The 2-bit 70B model achieves 3.83 perplexity — closer to FP16's 3.12 than GPTQ's 3-bit result of 4.40.

Perplexity vs. Bits

Compare AQLM (orange), QuIP# (teal), GPTQ (blue), and SpQR (purple) on Llama 2 70B. Lower is better. The dashed line shows FP16 baseline.

Model Size 70B

At 2 bits per parameter on Llama 2 70B, what is the perplexity gap between AQLM and the FP16 baseline?

About 5.0 perplexity points About 0.82 perplexity points (3.94 vs 3.12) The quantized model is better than FP16

Chapter 9: Connections

AQLM sits at the intersection of information retrieval (additive quantization) and LLM compression (post-training quantization). Here is how it connects to the broader landscape.

Method Comparison

Method	Approach	Sweet spot	Weakness
GPTQ	Scalar, row-by-row optimal rounding using Hessian	4 bits	Falls apart below 3 bits
AWQ	Scalar, activation-aware channel scaling	4 bits	Same: poor below 3 bits
SpQR	Mixed precision: keep outlier weights in FP16	3–4 bits	Irregular memory access from mixed formats
QuIP#	Incoherence processing + lattice codebooks	2–4 bits	Complex, requires Hadamard transforms
AQLM	Multi-codebook additive quantization + fine-tuning	2–3 bits	Slower to calibrate, more complex kernels

Residual Quantization (RQ) vs. Additive Quantization (AQ): RQ quantizes the residual error sequentially: encode w with C₁, then encode the error w − C₁[b₁] with C₂, and so on. AQ jointly optimizes all codebooks — C₂ is not constrained to fix C₁'s error. In theory, AQ with joint optimization always does at least as well as RQ. In practice, AQLM's beam search + Adam optimization is what makes AQ tractable for LLM-scale problems.

Key Takeaways

2-bit LLMs are viable. AQLM proved that with the right quantization structure, you can compress to 2 bits and retain most of the model's capability. The gap between 2-bit AQLM and FP16 is smaller than the gap between GPTQ at 3 bits and FP16.
Vector quantization beats scalar at extreme compression. The codebook structure captures correlations between weights that per-element rounding cannot.
Input-aware objectives matter. Minimizing output MSE (weighted by activations) is fundamentally better than minimizing weight MSE, especially at low bitwidths.
Block-level fine-tuning is cheap and powerful. A few hundred Adam steps on the codebooks, using only 128 calibration samples, can recover a significant fraction of the quantization loss.
Calibration data is small. 128 sequences of length 4096 from RedPajama. No need for large fine-tuning datasets.

Limitations

Calibration time. AQLM is slower to quantize than GPTQ (hours vs. minutes for a 7B model) because of the alternating beam search + codebook optimization.
Kernel complexity. The multi-codebook lookup kernels are more complex than simple integer dequantization, requiring careful GPU kernel engineering.
Not RTN-compatible. Round-to-nearest (RTN) methods can be applied without any calibration data. AQLM requires a calibration set, which is a stronger assumption.

Using AQLM in Practice

python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a pre-quantized AQLM model from HuggingFace
model_id = "ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"    # fits on a single 24GB GPU
)

# Generate text -- the dequantization happens inside the kernel
inputs = tokenizer("The key insight of additive quantization is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

What Came Next

After AQLM, the authors and others pushed further:

PV-Tuning (Tseng et al., 2024) improved the fine-tuning phase, showing that better optimization of the codebooks can recover additional accuracy.
QuIP# v2 added end-to-end fine-tuning matching AQLM, demonstrating that the fine-tuning innovation (not just the quantization structure) was key to the accuracy gains.
QLoRA + AQLM: The community found that AQLM-quantized models can be further fine-tuned with LoRA adapters, enabling task-specific adaptation on top of extreme compression.
HuggingFace integration: AQLM was integrated into the Transformers library, making it accessible via AutoModelForCausalLM.from_pretrained() with automatic kernel dispatch.

The open-source AQLM codebase at github.com/Vahe1994/AQLM provides quantization scripts and pre-quantized model weights for Llama 2, Mistral, and other model families.

The big picture: AQLM showed that the techniques information retrieval researchers developed for compressing embeddings in the 2010s — additive quantization, beam search, codebook fine-tuning — translate directly to LLM weight compression. The insight is general: any time you need to compress high-dimensional vectors with minimal distortion, multi-codebook methods deserve a look.

"The question of optimal data compression is really a question of understanding." — Gregory Chaitin

What is the fundamental advantage of additive quantization over residual quantization for LLM compression?

Additive quantization uses fewer codebooks AQ jointly optimizes all codebooks so each can correct any error, while RQ constrains each codebook to fix only the previous one's residual — joint optimization finds better overall solutions Additive quantization is faster to train

AQLM Extreme Compression via Additive Quantization

Chapter 0: The Problem

Chapter 1: Vector Quantization Primer

From K-Means to Codebooks

The Storage Arithmetic

Product Quantization (PQ)

Chapter 2: The Additive Quantization Framework

The Optimization Objective

Expanding the Objective

Why Not Just a Bigger Codebook?

Representation Cost

Concrete Data Flow

Chapter 3: Beam Search for Code Assignment

The Beam Search Algorithm

Worked Example

Why Beam Search, Not Greedy?

Complexity

Chapter 4: Codebook Tuning

Computing the Gradient

Adam Optimizer

Scale Factor Initialization

Alternating Optimization

Initialization: Residual K-Means

Chapter 5: Block-Level Fine-Tuning

The Procedure

Practical Details

Algorithm 1: Full AQLM Pipeline

Chapter 6: The Bits Arithmetic

The Formula

AQLM Configurations

What Does Not Count

Chapter 7: Codebook Lookup Visualization

The Dequantization Pipeline

GPU Kernel Architecture

Inner Product Trick

Inference Speed

Chapter 8: Results

2-Bit Regime (Llama 2)

3-Bit Regime (Llama 2)

Calibration Data

End-to-End Fine-Tuning Results

Chapter 9: Connections

Method Comparison

Key Takeaways

Limitations

Using AQLM in Practice

What Came Next