Egiazarian, Panferov, Kuznedelev, Frantar, Babenko, Alistarh — ICML 2024

AQLM Extreme Compression via Additive Quantization

Multi-codebook quantization pushes LLMs to 2 bits per parameter with less accuracy loss than anyone thought possible. The trick: represent each weight group as a sum of codebook vectors, then jointly fine-tune everything.

Prerequisites: Matrix multiplication + Basic quantization idea (map floats to small integers) + K-means clustering
10
Chapters
3+
Simulations

Chapter 0: The Problem

You have a Llama 2 70B model. It has 70 billion parameters, each stored as a 16-bit float. That is 70 × 109 × 2 bytes = 140 GB just for the weights. A high-end consumer GPU has 24 GB of VRAM. You cannot even load the model, let alone run it.

The standard fix is quantization — replace those 16-bit floats with smaller integers. At 4 bits per parameter, the model shrinks to 35 GB. At 3 bits, 26 GB. At 2 bits, just 17.5 GB — it fits on a single GPU with room to spare for the KV cache.

But there is a catch. Existing methods like GPTQ and AWQ work well at 4 bits. Push them to 3 bits and accuracy drops noticeably. At 2 bits, the model is nearly useless — perplexity on Llama 2 7B jumps from 5.12 (FP16 baseline) to double digits. The weights lose too much information when you round each one independently to a tiny grid.

The core tension: Memory bandwidth is the bottleneck for LLM inference on GPUs. Tokens are generated one at a time; each token requires loading the entire model from memory. Fewer bits per parameter = faster token generation. But below 4 bits, naive rounding destroys accuracy. We need a smarter compression scheme.

The insight behind AQLM: stop thinking about individual weights. Instead, group weights together and encode each group as a sum of learned codebook vectors. A group of 8 weights is no longer 8 independent rounded numbers — it is a combination of carefully chosen vectors from a shared dictionary. This is Multi-Codebook Quantization (MCQ), a technique from information retrieval that the authors adapt for LLM compression.

The result? At 2 bits per parameter, AQLM matches or beats the accuracy that GPTQ and AWQ achieve at 3 bits. At 3 bits, it nearly matches FP16. The model is smaller and more accurate than anything before it at extreme compression levels.

Memory Wall

Drag the bit-width slider to see how model size changes for a 70B-parameter model. The red line marks 24 GB — a typical consumer GPU's VRAM.

Bits per parameter 16.0
Why do methods like GPTQ fail at 2 bits per parameter?

Chapter 1: Vector Quantization Primer

Before we can understand AQLM's multi-codebook trick, we need to understand the simpler version: vector quantization (VQ). The idea is ancient — it dates back to the 1980s in signal compression — but it is the foundation of everything that follows.

Instead of quantizing each weight independently (scalar quantization), VQ groups weights into vectors and quantizes the entire vector at once. Think of it this way: you have a codebook — a dictionary of K prototype vectors. To compress a weight vector, you find the closest codebook entry and store only its index.

Why vectors beat scalars: Imagine quantizing the pair (0.31, 0.29) to 1 bit each. Scalar quantization gives you only {(0,0), (0,1), (1,0), (1,1)} — four rigid options. But a 2-bit vector codebook can learn four arbitrary 2D points. One of those points could be (0.30, 0.30) — a near-perfect match. The codebook adapts to the actual distribution of weight vectors.

From K-Means to Codebooks

Training a VQ codebook is just K-means clustering. You gather all weight vectors from the model, run K-means with K clusters, and the cluster centroids become your codebook entries. Each weight vector is then assigned to its nearest centroid.

Storage: if your codebook has K = 2B entries (B-bit codes), each vector of g weights costs B bits to store (just the index). So the bits per parameter is B / g. With B = 8 and g = 8, that is 1 bit per parameter — extreme compression.

The Storage Arithmetic

Let us make the VQ savings concrete. A weight matrix in Llama 2 7B's MLP is 4096 × 11008. In FP16, that is 4096 × 11008 × 2 = 90.2 MB. With VQ using g = 8, B = 8 (256-entry codebook):

That is 1 bit per parameter. But the accuracy is terrible — 256 prototype vectors cannot capture the diversity of millions of weight groups. We need more expressiveness without more bits.

Product Quantization (PQ)

Pure VQ has a problem: if you want fine-grained quantization, you need a huge codebook. With g = 8 dimensions and B = 16 bits, the codebook has 65,536 entries of 8 floats each — that is 2 MB per codebook. And you need one per layer.

Product Quantization (PQ) fixes this by splitting each vector into sub-vectors and using a separate, smaller codebook for each sub-vector. A vector of 8 weights is split into 2 sub-vectors of 4, each encoded with its own 8-bit codebook (256 entries). Total: 16 bits for 8 weights = 2 bits per parameter, but the codebooks are tiny.

PQ vocabulary: The sub-codebooks are often called sub-quantizers. Using M sub-quantizers with B-bit codes on groups of g weights gives B · M / g bits per parameter. This is the formula we will use throughout AQLM.
Scalar Quantization
Round each weight individually. Simple, fast, but high error at low bit-widths. Used by GPTQ, AWQ.
↓ group weights
Vector Quantization (VQ)
Replace each weight group with closest codebook entry. Better accuracy, but codebooks get huge.
↓ split into sub-vectors
Product Quantization (PQ)
Independent codebook per sub-vector. Compact codebooks, but sub-vectors are quantized independently.
↓ sum instead of concatenate
Additive Quantization (AQ)
Multiple codebooks, entries summed (not concatenated). Each codebook refines the full vector. This is AQLM.
What is the key difference between product quantization and additive quantization?

Chapter 2: The Additive Quantization Framework

Now the main idea. Take a weight matrix W ∈ Rdout × din from a linear layer. Split its rows into groups of g consecutive weights. Each group is a vector w ∈ Rg. We want to approximate w using M codebooks C1, ..., CM, each containing 2B vectors in Rg.

The approximation is a sum:

ŵ = ∑m=1M Cm[bm]

where bm ∈ {0, 1, ..., 2B − 1} is the index into the m-th codebook. Each codebook contributes one vector, and they are added together to reconstruct the weight group. This is Additive Quantization (AQ).

The power of addition: With M = 2 codebooks of 28 = 256 entries each, scalar PQ can represent 256 × 256 = 65,536 distinct sub-vector pairs (but only 256 values per sub-dimension). AQ can represent the same 65,536 combinations, but each combination covers the full g-dimensional space. Every codebook entry is a full g-dimensional vector, and their sum spans a much richer set of approximations.

The Optimization Objective

AQLM is not just about finding the nearest codebook entry. It is input-aware. Given a calibration dataset with input activations X, the objective is:

arg minC, b || WXŴX ||22

This minimizes the squared error on the outputs of the layer, not on the weights directly. Why? Because not all weights are equally important. A weight that multiplies a large activation matters far more than one that multiplies near-zero. The input matrix X (collected from calibration data) encodes this importance.

Why input-aware matters: Imagine two weights: w1 = 3.0 is multiplied by activations that are always near 0.01, and w2 = 0.5 is multiplied by activations near 10.0. Quantizing w2 poorly causes 20× more output error than quantizing w1 poorly. The MSE objective on WX naturally upweights w2.

Expanding the Objective

Substituting the additive representation into the objective and expanding:

|| WX − ∑m CmbmX ||2 = || WX ||2 − 2 ∑mW, CmbmXXT + ∑i,j ⟨Cibi, CjbjXXT

where ⟨A, BXXT = ⟨AXXT, BF is the Frobenius inner product weighted by the input correlation matrix. The key insight: XXT can be pre-computed once from the calibration data. This makes the objective efficient to evaluate — no need to pass activations through the model repeatedly.

Why Not Just a Bigger Codebook?

A natural question: why use M = 2 codebooks of 256 entries each (total 512 vectors to store) when you could use 1 codebook of 65,536 entries? Both use the same number of bits per weight group (16 bits). The answer is threefold:

In practice, AQLM finds that the 1×16 configuration (single large codebook) gives slightly better accuracy, while 2×8 (two small codebooks) gives faster inference. The paper reports both.

Representation Cost

Each weight group stores M indices of B bits each. Additionally, AQLM learns a per-output-unit scale factor s ∈ R (stored in FP16). The reconstructed weight is:

ŵ = s · ∑m=1M Cm[bm]

The scale factors add a negligible cost (one 16-bit float per row of the weight matrix) but help compensate for the limited codebook expressiveness.

Concrete Data Flow

Let us trace the full storage layout for one linear layer of Llama 2 7B. The q_proj weight matrix is 4096 × 4096.

ComponentShapeStorage
Code indices (M=2, B=8)4096 × (4096/8) × 24,194,304 bytes (4 MB)
Codebook 1256 × 8 floats × FP164,096 bytes (4 KB)
Codebook 2256 × 8 floats × FP164,096 bytes (4 KB)
Scale factors4096 × FP168,192 bytes (8 KB)
Total~4.02 MB (vs 33.6 MB in FP16)

The codebooks are shared across all 512 weight groups per output dimension. The indices dominate storage; the codebooks and scales are negligible. Compression ratio: 8.4× (equivalent to ~1.9 bits per parameter for this layer, before accounting for the codebook overhead).

Why does AQLM minimize || WX − ŴX ||2 instead of || W − Ŵ ||2?

Chapter 3: Beam Search for Code Assignment

AQLM has three phases: (1) find the best codebook indices b for each weight group, (2) optimize the codebook entries C, (3) fine-tune at the block level. Phase 1 is the hardest — it is a discrete combinatorial problem.

For a single weight group, we need to choose M indices (one per codebook), each from 2B options. The brute-force search over all (2B)M combinations is intractable. With B = 8 and M = 2, that is 65,536 combinations per group — feasible. But with M = 4, it is 4.3 billion. We need a smarter search.

The Beam Search Algorithm

AQLM uses beam search, adapted from the additive quantization literature (Babenko & Lempitsky, 2014). The algorithm maintains a beam of k best partial assignments and extends them one codebook at a time:

Step 1: Initialize
For codebook C1, try all 2B entries. Score each by the MSE objective (Eq. 7 from the paper). Keep the k best single-code assignments.
Step 2: Extend
For each of the k beams, try all 2B entries from C2. That gives k · 2B candidates. Score each (the loss is additive — only the changed component needs recomputing). Keep the k best two-code assignments.
↓ repeat for C3, ..., CM
Step 3: Select
After processing all M codebooks, the top beam gives the best M-tuple of indices for this weight group.
Efficient scoring trick: The loss function decomposes into pairwise terms ⟨Cibi, CjbjXXT. When extending a beam by changing only one code, most pairwise terms stay the same. You only recompute the terms involving the new codebook. Pre-computing XXT (once per layer) makes each score evaluation a few dot products instead of a full matrix multiplication.

Worked Example

Suppose g = 2 (2D weight groups for simplicity), M = 2 codebooks, each with 4 entries (B = 2 bits). We want to approximate w = [0.7, −0.3].

Codebook C1 has entries: {[0.5, 0.0], [0.0, −0.5], [−0.3, 0.4], [0.8, 0.2]}. Codebook C2 has entries: {[0.1, −0.2], [0.3, 0.1], [−0.1, −0.4], [0.2, −0.1]}.

Step 1: Score all 4 entries from C1 against w (using the MSE objective). Suppose the scores rank: entry 3 ([0.8, 0.2]) best, entry 0 ([0.5, 0.0]) second. With beam width k = 2, we keep both.

Step 2: For each beam, try all 4 entries from C2. That gives 2 × 4 = 8 candidates:

BeamC1C2SumError ||w − sum||2
1[0.8, 0.2][−0.1, −0.4][0.7, −0.2]0.01
2[0.5, 0.0][0.2, −0.1][0.7, −0.1]0.04
3[0.8, 0.2][0.2, −0.1][1.0, 0.1]0.25
... 5 more candidates ...

The winner: C1[3] + C2[2] = [0.8, 0.2] + [−0.1, −0.4] = [0.7, −0.2]. Error = 0.01. The target w = [0.7, −0.3] is nearly recovered by summing two codebook vectors.

Note the synergy: Neither codebook entry alone is close to w. C1[3] = [0.8, 0.2] has the wrong sign on the second dimension. C2[2] = [−0.1, −0.4] looks nothing like w. But their sum [0.7, −0.2] is an excellent approximation. This is the magic of additive quantization — the codebooks collaborate.

Why Beam Search, Not Greedy?

A greedy approach picks the single best entry from C1, then the single best from C2 given that choice, and so on. This is beam search with k = 1. The problem: early choices constrain later ones. If the best C1 entry points the approximation slightly wrong, C2 may not be able to fix it. Beam width k lets us maintain k alternative "hypotheses" and recover from suboptimal early decisions.

In practice, AQLM uses a beam width of k = 1 for the initial assignment (since codebooks are subsequently optimized anyway). The beam search runs over all dout output units in parallel — each output dimension is independent in the loss function, so this is a massively parallel operation on GPU.

Complexity

For each weight group: beam search considers k · 2B candidates per codebook step, and there are M steps. Total candidates evaluated: M · k · 2B. Compare to brute force: (2B)M = 2BM. With B = 8, M = 2, k = 1: beam search evaluates 512 candidates. Brute force would check 65,536. With M = 4: beam search evaluates 1024; brute force would check 4.3 × 109.

MRF connection: The objective is equivalent to MAP inference in a Markov Random Field, with unary potentials ⟨W, CmbmXXT and pairwise potentials ⟨Cibi, CjbjXXT. While finding the exact optimum is NP-hard in general, beam search and ICM (Iterated Conditional Modes) give good approximations. AQLM chose beam search because it is easier to implement efficiently in PyTorch/JAX.
Why is beam search preferred over brute-force enumeration for finding codebook indices?

Chapter 4: Codebook Tuning

Phase 1 (beam search) found the best indices assuming fixed codebooks. Phase 2 flips the problem: fix the indices, optimize the codebook entries.

With the codes b fixed, the objective becomes a continuous optimization over the codebook vectors C1, ..., CM and the scale factors s:

minC, s || WXŴX ||2 = || (W)X ||2 = ⟨W, (W)⟩XXT

where Ŵ = diag(s) · [∑m Cmbm] is the reconstructed weight matrix. This is differentiable with respect to Cm and s, so we can use standard gradient-based optimization.

Computing the Gradient

The loss is quadratic in the codebook entries. Let us trace the gradient for a single codebook entry c = Cm[k] that is used by some set of weight groups Sk (all groups where bm = k). The gradient is:

∂L / ∂c = −2 ∑groups in Sk (w − ŵ)XXT

This is a weighted average of the residual errors across all weight groups that share this codebook entry, weighted by the input correlation matrix XXT. The gradient is cheap to compute because XXT is pre-computed and shared, and each weight group contributes one additive term.

Adam Optimizer

AQLM uses Adam with learning rate 10−4 and full-batch gradient descent on the calibration data (128 samples). Each codebook update phase runs for 100 steps. The authors found the results are robust to these hyperparameters — varying the learning rate or number of steps does not significantly change the outcome.

Why not closed-form? In classic AQ (without the scale factors s and the XXT weighting), the optimal codebooks can be found by solving a linear system. But the presence of XXT couples all codebook dimensions through the input correlation. Each codebook entry's optimal value depends on every other entry. Adam handles this coupling gracefully without deriving the complex Hessian.

Scale Factor Initialization

The per-output-unit scale si is initialized to the L2 norm of the i-th row of W: si = ||Wi||2. This gives each row a learned magnitude while the codebooks handle the direction. The scales are updated alongside the codebooks via the same Adam optimizer.

Alternating Optimization

Phases 1 and 2 alternate: update codes (beam search) → update codebooks (Adam) → update codes → update codebooks → ... until the loss stops improving (or for a fixed number of rounds). Each round refines both the discrete assignments and the continuous codebook entries.

Initialization: Residual K-Means

Good initialization is critical. AQLM uses residual K-means (Chen et al., 2010):

  1. Run K-means on all weight groups to get C1. Assign each group to its nearest centroid.
  2. Compute the residual: r = w − C1[b1] (the part of each weight group not captured by the first codebook).
  3. Run K-means on the residuals to get C2. This codebook specializes in what C1 missed.
  4. Compute the new residual: r' = w − C1[b1] − C2[b2]. Repeat for C3, etc.

This is equivalent to Residual Quantization (RQ) and gives a reasonable starting point. The subsequent alternating optimization (beam search + Adam) then jointly refines all codebooks, escaping the limitations of the sequential residual approach.

Initialize
Residual K-means: cluster weight groups for C1, cluster residuals for C2, etc. This gives a good starting point for joint optimization.
Phase 1: Beam Search
Fix Cm and s. Find best indices bm for each weight group via beam search over the MSE objective.
Phase 2: Codebook + Scale Update
Fix indices bm. Run 100 steps of Adam on Cm and s to minimize || WXŴX ||2.
↻ repeat until convergence
What role do the per-output scale factors s play in AQLM?

Chapter 5: Block-Level Fine-Tuning

Phases 1 and 2 compress each layer independently. But in a real transformer, quantization errors in one layer's output become the input to the next layer. Errors compound. At 2 bits, this compounding is devastating.

Phase 3 of AQLM fixes this with block-level fine-tuning. Instead of optimizing one layer at a time, it optimizes an entire transformer block (self-attention + MLP, typically 4–8 linear layers) jointly.

The Procedure

After quantizing all layers within a block using phases 1 and 2, AQLM:

  1. Records the original block output Yblock = block(Xblock) using FP16 weights
  2. Replaces the block's weights with their quantized versions
  3. Defines the fine-tuning loss: || Yblock − block̂(Xblock) ||2
  4. Optimizes the codebook entries Cm, scale factors s, and all non-quantized parameters (RMSNorm scales and biases) via Adam, backpropagating through the entire quantized block
  5. Keeps the codes bm fixed — only continuous parameters are tuned
Why this works: The codebook entries are continuous (FP16) and differentiable. Even though the weight reconstruction is a discrete lookup (select code bm from Cm), the gradient flows through the lookup because bm is fixed — changing Cm[bm] is just changing a regular parameter that happens to be shared across all weight groups that selected index bm. This is the same trick as in VQ-VAE: the straight-through estimator is not needed because we are optimizing the codebook, not the codes.

Practical Details

Fine-tuning uses the PyTorch autograd engine. The calibration data is the same 128 samples from RedPajama (sequence length 4096) used in phases 1 and 2. The optimizer is Adam with a schedule. Fine-tuning the transformer blocks takes only 10–30% of the total calibration time — it converges quickly because it starts from a good initialization (the per-layer quantization from phases 1–2).

The algorithm processes the model block by block, from the first transformer layer to the last. After fine-tuning block k, it saves the updated codebooks and moves to block k+1, using the quantized output of block k as input. This sequential processing means each block adapts to the actual (quantized) inputs it will see at inference time.

VRAM efficiency: The fine-tuning modifies only the codebook entries and scale factors — a tiny fraction of the total parameters. Optimizer states (Adam momenta) are therefore small. This makes it possible to fine-tune even billion-parameter models on a single GPU in reasonable time.

Algorithm 1: Full AQLM Pipeline

pseudocode
Input: model, calibration data (128 samples)
for layer = 1 to num_layers:
  W = layer.weight                # original FP16 weights
  X = layer_inputs(calibration)    # collect activations
  C, s = initialize(W)             # K-means + residual K-means

  while loss improves:
    C, s = adam_update(XX^T, W, C, b, s) # Phase 2: codebook tune
    b = beam_search(XX^T, W, C, b, s) # Phase 1: code update

  layer.weight = AQLMFormat(C, b, s)

for block = 1 to num_blocks:         # Phase 3: block fine-tune
  theta = trainable_params(block)   # codebooks + scales + norms
  while loss improves:
    L = ||block(X) - Y_orig||^2
    theta = adam(theta, grad(L))
Why does AQLM fine-tune at the transformer block level instead of layer-by-layer?

Chapter 6: The Bits Arithmetic

Let us work through the exact compression math. This is where AQLM's configurations map to concrete bits-per-parameter numbers.

The Formula

For a weight group of g parameters encoded with M codebooks of 2B entries each, the cost per group is:

bits per group = M · B   (for the M codebook indices)
bits per parameter = M · B / g

Plus there are two small overheads: (1) the codebook entries themselves, and (2) the per-output scale factors. For large models, these are negligible compared to the indices.

AQLM Configurations

TargetMBgCodebook sizeAvg bits
2-bit1168216 = 65,536~2.02
2-bit (alt)28828 = 256 each~2.02
3-bit1168216 = 65,536~3.01
4-bit1168216 = 65,536~4.01
Worked example — 2-bit with 1×16: Group size g = 8. One codebook with 216 = 65,536 entries. Each group stores one 16-bit index. That is 16 bits / 8 params = 2.0 bits per parameter. Add the scale factor: one FP16 per output unit, shared across all groups in that row. For a 4096×4096 matrix with 4096 output units, that is 4096 × 16 bits = 65,536 bits of overhead. Total parameter storage: 4096 × (4096/8) × 16 = 33.6M bits for indices + 0.066M bits for scales. Average: ~2.02 bits per parameter.
Worked example — 2-bit with 2×8: Same g = 8, but two codebooks of 256 entries each. Each group stores two 8-bit indices = 16 bits. Same total: 16/8 = 2.0 bits per parameter. But now the codebook is tiny (256 entries × 8 floats = 4 KB per codebook) while 1×16 needs a 65,536-entry codebook (1 MB). The tradeoff: 2×8 has smaller codebooks but the additive structure is less expressive than a single large codebook.

What Does Not Count

Following standard practice, the "bits per parameter" metric excludes parameters that are kept in full precision: the embedding layer and the LM head. These are a small fraction of total parameters for large models (e.g., 1.3% for Llama 2 70B). All benchmarks compare methods at the same average bitwidth, so this is a fair comparison.

Compression Calculator

Adjust M (codebooks), B (bits per code), and g (group size) to see the resulting bits per parameter and model size for a 70B model.

Codebooks (M) 2
Bits per code (B) 8
Group size (g) 8
A configuration uses M=2 codebooks, B=8 bits per code, and groups of g=8 weights. What is the bits-per-parameter cost of the indices alone?

Chapter 7: Codebook Lookup Visualization

Let us see the entire AQLM dequantization process in action. This is what happens during inference when the model needs to reconstruct a weight group to compute a matrix-vector product.

The Dequantization Pipeline

1. Read Indices
Load the M compressed code indices (b1, b2, ..., bM) for this weight group from memory. Each is B bits.
2. Codebook Lookup
Use each index to fetch a g-dimensional vector from the corresponding codebook: C1[b1], C2[b2], ..., CM[bM].
3. Sum
Add the M vectors element-wise: ŵ = C1[b1] + C2[b2] + ... + CM[bM].
4. Scale
Multiply by the per-output scale: ŵ → s · ŵ.
5. Dot Product
Compute the dot product ŵ · x for the corresponding input activation group. Accumulate across all groups for this output unit.
GPU kernel design: AQLM implements efficient CUDA/Triton kernels that fuse steps 2–5. Instead of materializing the full dequantized weight matrix (which would defeat the memory savings), the kernel loads compressed codes, looks up codebook entries on the fly, and accumulates the dot product directly. Codebooks are small enough to fit in GPU shared memory or L1 cache, so lookups are fast. The bottleneck shifts from memory bandwidth (loading the full weight matrix) to compute (the lookups and additions).

GPU Kernel Architecture

The kernel operates in two phases per thread block:

  1. Load codebooks into shared memory. For the 2×8-bit configuration (M = 2 codebooks, 256 entries each, g = 8 floats per entry), each codebook is 256 × 8 × 2 bytes = 4 KB in FP16. Two codebooks fit in 8 KB — well within the 48–164 KB of shared memory available on modern GPUs.
  2. Stream codes + accumulate. Each thread loads a code pair (b1, b2), looks up both codebook entries in shared memory (no global memory access), sums them, multiplies by the scale factor, computes the dot product with the input activation group, and atomically accumulates into the output.

The key insight: codebook lookups from shared memory cost ~5 cycles, while global memory loads cost 200–400 cycles. Since the codebooks are reused across all weight groups in a column, the shared memory approach amortizes the load cost across thousands of groups.

16-bit codebook variant: For the 1×16 configuration (single codebook with 65,536 entries), the codebook is 65,536 × 8 × 2 = 1 MB — too large for shared memory. Here, AQLM replaces the 16-bit codebook with a pair of 8-bit codebooks (2×8) that approximate the same expressiveness at lower memory cost. The inference speed is comparable, and the authors found that accuracy is similar.

Inner Product Trick

For the 1×16 configuration with a single large codebook, there is an additional speedup. The dot product between a weight group and the input can be rewritten as:

ŵ · x = C[b] · x

You can pre-compute the dot products of all codebook entries with the current input: dp[k] = C[k] · x for k = 0, ..., 216 − 1. Then reconstruction + dot product becomes a single table lookup: dp[b]. This turns the computation from g multiplications and additions per group into a single indexed load. However, with 65K entries, the lookup table is large (256 KB per group of 8 input dimensions), so this trick is mainly useful for CPU inference where caches are larger.

Additive Codebook Lookup

Click Step to walk through the dequantization of one weight group. Each codebook contributes a vector; they are summed and scaled to reconstruct the weights. Click Randomize to pick new code indices.

Codebooks (M) 2

Inference Speed

PlatformModelFP BaselineAQLM 2-bitSpeedup
RTX 3090Llama 2 7B129 μs99 μs (1×16)1.31×
Llama 2 13B190 μs158 μs1.20×
Llama 2 70B578 μs190 μs (2×8)3.05×
CPU i9Llama 2 7B1.83 ms0.67 ms2.75×
Llama 2 13B3.12 ms0.88 ms3.54×
Llama 2 70B11.31 ms2.81 ms4.03×

The GPU speedups are modest for 7B (the model already fits in FP16) but dramatic for 70B, where the FP16 model requires multi-GPU while the 2-bit model fits on a single card. On CPU, the inner product pre-computation trick delivers consistent 3–4× speedups.

Why 70B benefits most: For the 7B model in FP16, the weight matrix is only ~14 GB — it fits in GPU memory with bandwidth to spare. The bottleneck is compute, not memory. AQLM's codebook lookups add compute overhead that partially offsets the memory savings. For 70B, the FP16 model exceeds single-GPU memory entirely. AQLM's 2-bit version fits in ~18 GB, enabling single-GPU inference where FP16 cannot. The speedup is transformative, not incremental.
How does AQLM avoid materializing the full dequantized weight matrix during inference?

Chapter 8: Results

The moment of truth. How does AQLM compare to GPTQ, AWQ, SpQR, and QuIP# on actual language model benchmarks?

2-Bit Regime (Llama 2)

This is where AQLM shines brightest. All methods are compressing to approximately 2 bits per parameter. Perplexity is measured on WikiText-2 (lower is better).

ModelMethodAvg bitsWiki2 PPL ↓C4 PPL ↓Avg Accuracy ↑
7BAQLM2.026.598.5457.28
QuIP#2.028.2211.0152.23
FP16165.126.6362.35
13BAQLM1.975.607.4961.32
QuIP#2.016.068.0757.55
FP16164.576.0565.38
70BAQLM2.073.945.7268.75
QuIP#2.014.166.0167.67
FP16163.124.9770.17
The headline number: AQLM 2-bit on Llama 2 70B achieves 3.94 perplexity on WikiText-2 — only 0.82 above the FP16 baseline. QuIP# (the previous state-of-the-art) at the same bitwidth gets 4.16. GPTQ and SpQR are not shown because they effectively break at 2 bits (perplexity above 10 for 7B/13B).

3-Bit Regime (Llama 2)

ModelMethodAvg bitsWiki2 PPL ↓C4 PPL ↓Avg Accuracy ↑
7BAQLM3.045.467.0860.88
GPTQ3.008.0610.6153.08
SpQR2.986.208.2059.07
FP16165.126.6362.35
70BAQLM3.013.365.1769.86
GPTQ3.004.406.2665.41
SpQR2.983.855.6368.22
FP16163.124.9770.17
The 3-bit story: At 3 bits, AQLM on Llama 2 70B achieves 3.36 perplexity — only 0.24 above the FP16 baseline of 3.12. This means you can compress a 140 GB model to ~26 GB and lose almost nothing. GPTQ at the same bitwidth gets 4.40 — over a full perplexity point worse.

Calibration Data

All results above use a tiny calibration set: 128 sequences of length 4096 tokens from the RedPajama dataset. That is roughly 524K tokens — a vanishingly small fraction of the pre-training data. The authors found that:

Pareto optimality: The authors establish that AQLM is Pareto optimal in the accuracy-vs-model-size tradeoff below 3 bits. This means there is no known method that achieves both better accuracy AND smaller size at any point in the 2–3 bit range. Above 3.5 bits, simpler methods like GPTQ close the gap because scalar rounding works well when each weight has enough bits.

End-to-End Fine-Tuning Results

After the paper's initial release, the authors added end-to-end fine-tuning (following QuIP#'s approach): fine-tune the entire model to minimize KL divergence, not just per-block MSE. This further improves results:

ModelMethodAvg bitsWiki2 PPL ↓Avg Accuracy ↑
7BAQLM*2.026.1458.27
QuIP#*2.026.1958.48
70BAQLM*2.073.8368.20
QuIP#*2.013.9168.28

With end-to-end fine-tuning (marked *), AQLM and QuIP# reach near-parity at 2-bit. Both erase the "2-bit is useless" narrative. The 2-bit 70B model achieves 3.83 perplexity — closer to FP16's 3.12 than GPTQ's 3-bit result of 4.40.

Perplexity vs. Bits

Compare AQLM (orange), QuIP# (teal), GPTQ (blue), and SpQR (purple) on Llama 2 70B. Lower is better. The dashed line shows FP16 baseline.

Model Size 70B
At 2 bits per parameter on Llama 2 70B, what is the perplexity gap between AQLM and the FP16 baseline?

Chapter 9: Connections

AQLM sits at the intersection of information retrieval (additive quantization) and LLM compression (post-training quantization). Here is how it connects to the broader landscape.

Method Comparison

MethodApproachSweet spotWeakness
GPTQScalar, row-by-row optimal rounding using Hessian4 bitsFalls apart below 3 bits
AWQScalar, activation-aware channel scaling4 bitsSame: poor below 3 bits
SpQRMixed precision: keep outlier weights in FP163–4 bitsIrregular memory access from mixed formats
QuIP#Incoherence processing + lattice codebooks2–4 bitsComplex, requires Hadamard transforms
AQLMMulti-codebook additive quantization + fine-tuning2–3 bitsSlower to calibrate, more complex kernels
Residual Quantization (RQ) vs. Additive Quantization (AQ): RQ quantizes the residual error sequentially: encode w with C1, then encode the error w − C1[b1] with C2, and so on. AQ jointly optimizes all codebooks — C2 is not constrained to fix C1's error. In theory, AQ with joint optimization always does at least as well as RQ. In practice, AQLM's beam search + Adam optimization is what makes AQ tractable for LLM-scale problems.

Key Takeaways

Limitations

Using AQLM in Practice

python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a pre-quantized AQLM model from HuggingFace
model_id = "ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"    # fits on a single 24GB GPU
)

# Generate text -- the dequantization happens inside the kernel
inputs = tokenizer("The key insight of additive quantization is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

What Came Next

After AQLM, the authors and others pushed further:

The open-source AQLM codebase at github.com/Vahe1994/AQLM provides quantization scripts and pre-quantized model weights for Llama 2, Mistral, and other model families.

The big picture: AQLM showed that the techniques information retrieval researchers developed for compressing embeddings in the 2010s — additive quantization, beam search, codebook fine-tuning — translate directly to LLM weight compression. The insight is general: any time you need to compress high-dimensional vectors with minimal distortion, multi-codebook methods deserve a look.

"The question of optimal data compression is really a question of understanding." — Gregory Chaitin

What is the fundamental advantage of additive quantization over residual quantization for LLM compression?