Multi-codebook quantization pushes LLMs to 2 bits per parameter with less accuracy loss than anyone thought possible. The trick: represent each weight group as a sum of codebook vectors, then jointly fine-tune everything.
You have a Llama 2 70B model. It has 70 billion parameters, each stored as a 16-bit float. That is 70 × 109 × 2 bytes = 140 GB just for the weights. A high-end consumer GPU has 24 GB of VRAM. You cannot even load the model, let alone run it.
The standard fix is quantization — replace those 16-bit floats with smaller integers. At 4 bits per parameter, the model shrinks to 35 GB. At 3 bits, 26 GB. At 2 bits, just 17.5 GB — it fits on a single GPU with room to spare for the KV cache.
But there is a catch. Existing methods like GPTQ and AWQ work well at 4 bits. Push them to 3 bits and accuracy drops noticeably. At 2 bits, the model is nearly useless — perplexity on Llama 2 7B jumps from 5.12 (FP16 baseline) to double digits. The weights lose too much information when you round each one independently to a tiny grid.
The insight behind AQLM: stop thinking about individual weights. Instead, group weights together and encode each group as a sum of learned codebook vectors. A group of 8 weights is no longer 8 independent rounded numbers — it is a combination of carefully chosen vectors from a shared dictionary. This is Multi-Codebook Quantization (MCQ), a technique from information retrieval that the authors adapt for LLM compression.
The result? At 2 bits per parameter, AQLM matches or beats the accuracy that GPTQ and AWQ achieve at 3 bits. At 3 bits, it nearly matches FP16. The model is smaller and more accurate than anything before it at extreme compression levels.
Drag the bit-width slider to see how model size changes for a 70B-parameter model. The red line marks 24 GB — a typical consumer GPU's VRAM.
Before we can understand AQLM's multi-codebook trick, we need to understand the simpler version: vector quantization (VQ). The idea is ancient — it dates back to the 1980s in signal compression — but it is the foundation of everything that follows.
Instead of quantizing each weight independently (scalar quantization), VQ groups weights into vectors and quantizes the entire vector at once. Think of it this way: you have a codebook — a dictionary of K prototype vectors. To compress a weight vector, you find the closest codebook entry and store only its index.
Training a VQ codebook is just K-means clustering. You gather all weight vectors from the model, run K-means with K clusters, and the cluster centroids become your codebook entries. Each weight vector is then assigned to its nearest centroid.
Storage: if your codebook has K = 2B entries (B-bit codes), each vector of g weights costs B bits to store (just the index). So the bits per parameter is B / g. With B = 8 and g = 8, that is 1 bit per parameter — extreme compression.
Let us make the VQ savings concrete. A weight matrix in Llama 2 7B's MLP is 4096 × 11008. In FP16, that is 4096 × 11008 × 2 = 90.2 MB. With VQ using g = 8, B = 8 (256-entry codebook):
That is 1 bit per parameter. But the accuracy is terrible — 256 prototype vectors cannot capture the diversity of millions of weight groups. We need more expressiveness without more bits.
Pure VQ has a problem: if you want fine-grained quantization, you need a huge codebook. With g = 8 dimensions and B = 16 bits, the codebook has 65,536 entries of 8 floats each — that is 2 MB per codebook. And you need one per layer.
Product Quantization (PQ) fixes this by splitting each vector into sub-vectors and using a separate, smaller codebook for each sub-vector. A vector of 8 weights is split into 2 sub-vectors of 4, each encoded with its own 8-bit codebook (256 entries). Total: 16 bits for 8 weights = 2 bits per parameter, but the codebooks are tiny.
Now the main idea. Take a weight matrix W ∈ Rdout × din from a linear layer. Split its rows into groups of g consecutive weights. Each group is a vector w ∈ Rg. We want to approximate w using M codebooks C1, ..., CM, each containing 2B vectors in Rg.
The approximation is a sum:
where bm ∈ {0, 1, ..., 2B − 1} is the index into the m-th codebook. Each codebook contributes one vector, and they are added together to reconstruct the weight group. This is Additive Quantization (AQ).
AQLM is not just about finding the nearest codebook entry. It is input-aware. Given a calibration dataset with input activations X, the objective is:
This minimizes the squared error on the outputs of the layer, not on the weights directly. Why? Because not all weights are equally important. A weight that multiplies a large activation matters far more than one that multiplies near-zero. The input matrix X (collected from calibration data) encodes this importance.
Substituting the additive representation into the objective and expanding:
where ⟨A, B⟩XXT = ⟨AXXT, B⟩F is the Frobenius inner product weighted by the input correlation matrix. The key insight: XXT can be pre-computed once from the calibration data. This makes the objective efficient to evaluate — no need to pass activations through the model repeatedly.
A natural question: why use M = 2 codebooks of 256 entries each (total 512 vectors to store) when you could use 1 codebook of 65,536 entries? Both use the same number of bits per weight group (16 bits). The answer is threefold:
In practice, AQLM finds that the 1×16 configuration (single large codebook) gives slightly better accuracy, while 2×8 (two small codebooks) gives faster inference. The paper reports both.
Each weight group stores M indices of B bits each. Additionally, AQLM learns a per-output-unit scale factor s ∈ R (stored in FP16). The reconstructed weight is:
The scale factors add a negligible cost (one 16-bit float per row of the weight matrix) but help compensate for the limited codebook expressiveness.
Let us trace the full storage layout for one linear layer of Llama 2 7B. The q_proj weight matrix is 4096 × 4096.
| Component | Shape | Storage |
|---|---|---|
| Code indices (M=2, B=8) | 4096 × (4096/8) × 2 | 4,194,304 bytes (4 MB) |
| Codebook 1 | 256 × 8 floats × FP16 | 4,096 bytes (4 KB) |
| Codebook 2 | 256 × 8 floats × FP16 | 4,096 bytes (4 KB) |
| Scale factors | 4096 × FP16 | 8,192 bytes (8 KB) |
| Total | ~4.02 MB (vs 33.6 MB in FP16) |
The codebooks are shared across all 512 weight groups per output dimension. The indices dominate storage; the codebooks and scales are negligible. Compression ratio: 8.4× (equivalent to ~1.9 bits per parameter for this layer, before accounting for the codebook overhead).
AQLM has three phases: (1) find the best codebook indices b for each weight group, (2) optimize the codebook entries C, (3) fine-tune at the block level. Phase 1 is the hardest — it is a discrete combinatorial problem.
For a single weight group, we need to choose M indices (one per codebook), each from 2B options. The brute-force search over all (2B)M combinations is intractable. With B = 8 and M = 2, that is 65,536 combinations per group — feasible. But with M = 4, it is 4.3 billion. We need a smarter search.
AQLM uses beam search, adapted from the additive quantization literature (Babenko & Lempitsky, 2014). The algorithm maintains a beam of k best partial assignments and extends them one codebook at a time:
Suppose g = 2 (2D weight groups for simplicity), M = 2 codebooks, each with 4 entries (B = 2 bits). We want to approximate w = [0.7, −0.3].
Codebook C1 has entries: {[0.5, 0.0], [0.0, −0.5], [−0.3, 0.4], [0.8, 0.2]}. Codebook C2 has entries: {[0.1, −0.2], [0.3, 0.1], [−0.1, −0.4], [0.2, −0.1]}.
Step 1: Score all 4 entries from C1 against w (using the MSE objective). Suppose the scores rank: entry 3 ([0.8, 0.2]) best, entry 0 ([0.5, 0.0]) second. With beam width k = 2, we keep both.
Step 2: For each beam, try all 4 entries from C2. That gives 2 × 4 = 8 candidates:
| Beam | C1 | C2 | Sum | Error ||w − sum||2 |
|---|---|---|---|---|
| 1 | [0.8, 0.2] | [−0.1, −0.4] | [0.7, −0.2] | 0.01 |
| 2 | [0.5, 0.0] | [0.2, −0.1] | [0.7, −0.1] | 0.04 |
| 3 | [0.8, 0.2] | [0.2, −0.1] | [1.0, 0.1] | 0.25 |
| ... 5 more candidates ... | ||||
The winner: C1[3] + C2[2] = [0.8, 0.2] + [−0.1, −0.4] = [0.7, −0.2]. Error = 0.01. The target w = [0.7, −0.3] is nearly recovered by summing two codebook vectors.
A greedy approach picks the single best entry from C1, then the single best from C2 given that choice, and so on. This is beam search with k = 1. The problem: early choices constrain later ones. If the best C1 entry points the approximation slightly wrong, C2 may not be able to fix it. Beam width k lets us maintain k alternative "hypotheses" and recover from suboptimal early decisions.
In practice, AQLM uses a beam width of k = 1 for the initial assignment (since codebooks are subsequently optimized anyway). The beam search runs over all dout output units in parallel — each output dimension is independent in the loss function, so this is a massively parallel operation on GPU.
For each weight group: beam search considers k · 2B candidates per codebook step, and there are M steps. Total candidates evaluated: M · k · 2B. Compare to brute force: (2B)M = 2BM. With B = 8, M = 2, k = 1: beam search evaluates 512 candidates. Brute force would check 65,536. With M = 4: beam search evaluates 1024; brute force would check 4.3 × 109.
Phase 1 (beam search) found the best indices assuming fixed codebooks. Phase 2 flips the problem: fix the indices, optimize the codebook entries.
With the codes b fixed, the objective becomes a continuous optimization over the codebook vectors C1, ..., CM and the scale factors s:
where Ŵ = diag(s) · [∑m Cmbm] is the reconstructed weight matrix. This is differentiable with respect to Cm and s, so we can use standard gradient-based optimization.
The loss is quadratic in the codebook entries. Let us trace the gradient for a single codebook entry c = Cm[k] that is used by some set of weight groups Sk (all groups where bm = k). The gradient is:
This is a weighted average of the residual errors across all weight groups that share this codebook entry, weighted by the input correlation matrix XXT. The gradient is cheap to compute because XXT is pre-computed and shared, and each weight group contributes one additive term.
AQLM uses Adam with learning rate 10−4 and full-batch gradient descent on the calibration data (128 samples). Each codebook update phase runs for 100 steps. The authors found the results are robust to these hyperparameters — varying the learning rate or number of steps does not significantly change the outcome.
The per-output-unit scale si is initialized to the L2 norm of the i-th row of W: si = ||Wi||2. This gives each row a learned magnitude while the codebooks handle the direction. The scales are updated alongside the codebooks via the same Adam optimizer.
Phases 1 and 2 alternate: update codes (beam search) → update codebooks (Adam) → update codes → update codebooks → ... until the loss stops improving (or for a fixed number of rounds). Each round refines both the discrete assignments and the continuous codebook entries.
Good initialization is critical. AQLM uses residual K-means (Chen et al., 2010):
This is equivalent to Residual Quantization (RQ) and gives a reasonable starting point. The subsequent alternating optimization (beam search + Adam) then jointly refines all codebooks, escaping the limitations of the sequential residual approach.
Phases 1 and 2 compress each layer independently. But in a real transformer, quantization errors in one layer's output become the input to the next layer. Errors compound. At 2 bits, this compounding is devastating.
Phase 3 of AQLM fixes this with block-level fine-tuning. Instead of optimizing one layer at a time, it optimizes an entire transformer block (self-attention + MLP, typically 4–8 linear layers) jointly.
After quantizing all layers within a block using phases 1 and 2, AQLM:
Fine-tuning uses the PyTorch autograd engine. The calibration data is the same 128 samples from RedPajama (sequence length 4096) used in phases 1 and 2. The optimizer is Adam with a schedule. Fine-tuning the transformer blocks takes only 10–30% of the total calibration time — it converges quickly because it starts from a good initialization (the per-layer quantization from phases 1–2).
The algorithm processes the model block by block, from the first transformer layer to the last. After fine-tuning block k, it saves the updated codebooks and moves to block k+1, using the quantized output of block k as input. This sequential processing means each block adapts to the actual (quantized) inputs it will see at inference time.
pseudocode Input: model, calibration data (128 samples) for layer = 1 to num_layers: W = layer.weight # original FP16 weights X = layer_inputs(calibration) # collect activations C, s = initialize(W) # K-means + residual K-means while loss improves: C, s = adam_update(XX^T, W, C, b, s) # Phase 2: codebook tune b = beam_search(XX^T, W, C, b, s) # Phase 1: code update layer.weight = AQLMFormat(C, b, s) for block = 1 to num_blocks: # Phase 3: block fine-tune theta = trainable_params(block) # codebooks + scales + norms while loss improves: L = ||block(X) - Y_orig||^2 theta = adam(theta, grad(L))
Let us work through the exact compression math. This is where AQLM's configurations map to concrete bits-per-parameter numbers.
For a weight group of g parameters encoded with M codebooks of 2B entries each, the cost per group is:
Plus there are two small overheads: (1) the codebook entries themselves, and (2) the per-output scale factors. For large models, these are negligible compared to the indices.
| Target | M | B | g | Codebook size | Avg bits |
|---|---|---|---|---|---|
| 2-bit | 1 | 16 | 8 | 216 = 65,536 | ~2.02 |
| 2-bit (alt) | 2 | 8 | 8 | 28 = 256 each | ~2.02 |
| 3-bit | 1 | 16 | 8 | 216 = 65,536 | ~3.01 |
| 4-bit | 1 | 16 | 8 | 216 = 65,536 | ~4.01 |
Following standard practice, the "bits per parameter" metric excludes parameters that are kept in full precision: the embedding layer and the LM head. These are a small fraction of total parameters for large models (e.g., 1.3% for Llama 2 70B). All benchmarks compare methods at the same average bitwidth, so this is a fair comparison.
Adjust M (codebooks), B (bits per code), and g (group size) to see the resulting bits per parameter and model size for a 70B model.
Let us see the entire AQLM dequantization process in action. This is what happens during inference when the model needs to reconstruct a weight group to compute a matrix-vector product.
The kernel operates in two phases per thread block:
The key insight: codebook lookups from shared memory cost ~5 cycles, while global memory loads cost 200–400 cycles. Since the codebooks are reused across all weight groups in a column, the shared memory approach amortizes the load cost across thousands of groups.
For the 1×16 configuration with a single large codebook, there is an additional speedup. The dot product between a weight group and the input can be rewritten as:
You can pre-compute the dot products of all codebook entries with the current input: dp[k] = C[k] · x for k = 0, ..., 216 − 1. Then reconstruction + dot product becomes a single table lookup: dp[b]. This turns the computation from g multiplications and additions per group into a single indexed load. However, with 65K entries, the lookup table is large (256 KB per group of 8 input dimensions), so this trick is mainly useful for CPU inference where caches are larger.
Click Step to walk through the dequantization of one weight group. Each codebook contributes a vector; they are summed and scaled to reconstruct the weights. Click Randomize to pick new code indices.
| Platform | Model | FP Baseline | AQLM 2-bit | Speedup |
|---|---|---|---|---|
| RTX 3090 | Llama 2 7B | 129 μs | 99 μs (1×16) | 1.31× |
| Llama 2 13B | 190 μs | 158 μs | 1.20× | |
| Llama 2 70B | 578 μs | 190 μs (2×8) | 3.05× | |
| CPU i9 | Llama 2 7B | 1.83 ms | 0.67 ms | 2.75× |
| Llama 2 13B | 3.12 ms | 0.88 ms | 3.54× | |
| Llama 2 70B | 11.31 ms | 2.81 ms | 4.03× |
The GPU speedups are modest for 7B (the model already fits in FP16) but dramatic for 70B, where the FP16 model requires multi-GPU while the 2-bit model fits on a single card. On CPU, the inner product pre-computation trick delivers consistent 3–4× speedups.
The moment of truth. How does AQLM compare to GPTQ, AWQ, SpQR, and QuIP# on actual language model benchmarks?
This is where AQLM shines brightest. All methods are compressing to approximately 2 bits per parameter. Perplexity is measured on WikiText-2 (lower is better).
| Model | Method | Avg bits | Wiki2 PPL ↓ | C4 PPL ↓ | Avg Accuracy ↑ |
|---|---|---|---|---|---|
| 7B | AQLM | 2.02 | 6.59 | 8.54 | 57.28 |
| QuIP# | 2.02 | 8.22 | 11.01 | 52.23 | |
| FP16 | 16 | 5.12 | 6.63 | 62.35 | |
| 13B | AQLM | 1.97 | 5.60 | 7.49 | 61.32 |
| QuIP# | 2.01 | 6.06 | 8.07 | 57.55 | |
| FP16 | 16 | 4.57 | 6.05 | 65.38 | |
| 70B | AQLM | 2.07 | 3.94 | 5.72 | 68.75 |
| QuIP# | 2.01 | 4.16 | 6.01 | 67.67 | |
| FP16 | 16 | 3.12 | 4.97 | 70.17 |
| Model | Method | Avg bits | Wiki2 PPL ↓ | C4 PPL ↓ | Avg Accuracy ↑ |
|---|---|---|---|---|---|
| 7B | AQLM | 3.04 | 5.46 | 7.08 | 60.88 |
| GPTQ | 3.00 | 8.06 | 10.61 | 53.08 | |
| SpQR | 2.98 | 6.20 | 8.20 | 59.07 | |
| FP16 | 16 | 5.12 | 6.63 | 62.35 | |
| 70B | AQLM | 3.01 | 3.36 | 5.17 | 69.86 |
| GPTQ | 3.00 | 4.40 | 6.26 | 65.41 | |
| SpQR | 2.98 | 3.85 | 5.63 | 68.22 | |
| FP16 | 16 | 3.12 | 4.97 | 70.17 |
All results above use a tiny calibration set: 128 sequences of length 4096 tokens from the RedPajama dataset. That is roughly 524K tokens — a vanishingly small fraction of the pre-training data. The authors found that:
After the paper's initial release, the authors added end-to-end fine-tuning (following QuIP#'s approach): fine-tune the entire model to minimize KL divergence, not just per-block MSE. This further improves results:
| Model | Method | Avg bits | Wiki2 PPL ↓ | Avg Accuracy ↑ |
|---|---|---|---|---|
| 7B | AQLM* | 2.02 | 6.14 | 58.27 |
| QuIP#* | 2.02 | 6.19 | 58.48 | |
| 70B | AQLM* | 2.07 | 3.83 | 68.20 |
| QuIP#* | 2.01 | 3.91 | 68.28 |
With end-to-end fine-tuning (marked *), AQLM and QuIP# reach near-parity at 2-bit. Both erase the "2-bit is useless" narrative. The 2-bit 70B model achieves 3.83 perplexity — closer to FP16's 3.12 than GPTQ's 3-bit result of 4.40.
Compare AQLM (orange), QuIP# (teal), GPTQ (blue), and SpQR (purple) on Llama 2 70B. Lower is better. The dashed line shows FP16 baseline.
AQLM sits at the intersection of information retrieval (additive quantization) and LLM compression (post-training quantization). Here is how it connects to the broader landscape.
| Method | Approach | Sweet spot | Weakness |
|---|---|---|---|
| GPTQ | Scalar, row-by-row optimal rounding using Hessian | 4 bits | Falls apart below 3 bits |
| AWQ | Scalar, activation-aware channel scaling | 4 bits | Same: poor below 3 bits |
| SpQR | Mixed precision: keep outlier weights in FP16 | 3–4 bits | Irregular memory access from mixed formats |
| QuIP# | Incoherence processing + lattice codebooks | 2–4 bits | Complex, requires Hadamard transforms |
| AQLM | Multi-codebook additive quantization + fine-tuning | 2–3 bits | Slower to calibrate, more complex kernels |
python from transformers import AutoModelForCausalLM, AutoTokenizer # Load a pre-quantized AQLM model from HuggingFace model_id = "ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto" # fits on a single 24GB GPU ) # Generate text -- the dequantization happens inside the kernel inputs = tokenizer("The key insight of additive quantization is", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0]))
After AQLM, the authors and others pushed further:
AutoModelForCausalLM.from_pretrained() with automatic kernel dispatch.The open-source AQLM codebase at github.com/Vahe1994/AQLM provides quantization scripts and pre-quantized model weights for Llama 2, Mistral, and other model families.
"The question of optimal data compression is really a question of understanding." — Gregory Chaitin