Introduction
Modern large language models are absurdly large. LLaMA-3 70B has 70 billion parameters. Each parameter in FP16 occupies 2 bytes. That is 140 GB just for weights — not counting activations, KV cache, or optimizer states. An NVIDIA A100 has 80 GB of HBM2e. You cannot even load the model on a single top-of-the-line GPU.
But here is the remarkable thing: the vast majority of those 70 billion parameters do not need 16 bits of precision. Many can be stored in 4 bits — a 4x compression — with negligible quality loss. Some can survive at 2 bits. The reason is that neural network weights are not arbitrary floating-point numbers: they follow structured distributions, cluster near zero, and contain massive redundancy. Quantization exploits this structure.
This article covers the full quantization stack, from the fundamentals of number representation through the latest compression algorithms. We will work through the math, understand why certain approaches fail on certain models, build intuition with interactive visualizations, and end with practical guidance on which method to use and when.
Why Quantization Matters
The Memory Wall
GPU compute has been growing faster than GPU memory bandwidth for decades. An NVIDIA H100 SXM can perform 1,979 TFLOPS of FP8 compute but its HBM3 bandwidth is 3.35 TB/s. During autoregressive decoding — the token-by-token generation phase — each token requires reading the entire weight matrix from memory exactly once. The model is memory-bandwidth bound, not compute-bound.
The arithmetic intensity of decoding (FLOPs per byte loaded) is roughly 1 for batch size 1. With the H100's compute-to-bandwidth ratio of ~590 FLOPs per byte, the GPU spends the vast majority of its time waiting for data to arrive from HBM. The ALUs sit idle.
Bandwidth is the Bottleneck
This is the key insight: reducing the size of weights in memory directly translates to faster inference. If you quantize from FP16 (2 bytes) to INT4 (0.5 bytes), you move 4x less data from HBM to the compute units per token. In the memory-bound regime, this means roughly 4x higher throughput during decoding. The dequantization cost (converting INT4 back to FP16 for the matmul) is negligible compared to the bandwidth savings.
Quantization also unlocks deployment on smaller hardware. A 70B model at 4-bit precision requires ~35 GB — it fits on a single 48 GB GPU (A6000, L40, RTX 6000 Ada). At 3-bit, it fits on a 24 GB consumer GPU (RTX 4090). At FP16, you need a minimum of two A100-80GB GPUs with tensor parallelism, which is dramatically more expensive in both capital and operational costs.
Number Formats in Deep Learning
Floating-Point: FP32, FP16, BF16, TF32
IEEE 754 floating-point numbers encode a value as (-1)sign x 2exponent - bias
x (1 + mantissa). The bit allocation between exponent and mantissa determines the tradeoff
between dynamic range and precision:
| Format | Total Bits | Exponent | Mantissa | Dynamic Range | Precision (decimal digits) |
|---|---|---|---|---|---|
FP32 | 32 | 8 | 23 | ~1038 | ~7.2 |
TF32 | 19 | 8 | 10 | ~1038 | ~3.4 |
BF16 | 16 | 8 | 7 | ~1038 | ~2.4 |
FP16 | 16 | 5 | 10 | ~65504 | ~3.4 |
FP8 (E4M3) | 8 | 4 | 3 | ~448 | ~1.1 |
FP8 (E5M2) | 8 | 5 | 2 | ~57344 | ~0.9 |
BF16 (Brain Floating Point) is the workhorse of modern LLM training. It has the same exponent width as FP32 (8 bits), giving it identical dynamic range, but truncates the mantissa to 7 bits. This means it can represent the same scale of numbers as FP32 — no overflow or underflow issues — but with lower precision. For neural network training, dynamic range matters more than precision, making BF16 the superior choice over FP16 for training stability.
FP16 has more mantissa bits (10 vs 7) but a smaller exponent (5 bits), capping its representable range at ~65504. Gradients and activations can exceed this range, requiring loss scaling. FP16 is still common for inference, where values are bounded.
TF32 is an NVIDIA-specific format used only inside Tensor Cores — it is never a storage format. Inputs arrive as FP32, but the Tensor Core internally truncates mantissa to 10 bits, computes the multiply-accumulate, and writes back FP32. This gives near-FP32 accuracy at near-FP16 speed for matrix operations.
Integer Formats: INT8 and INT4
Integer quantization maps continuous floating-point values to a discrete grid. INT8 has 256 representable values; INT4 has just 16. The mapping is defined by a scale factor and optionally a zero-point.
The appeal is clear: INT8 is half the memory of FP16, and INT4 is a quarter. But the challenge is equally clear: you are crushing a continuous distribution into 16 or 256 bins. The art of quantization lies in choosing the bin boundaries to minimize the information lost.
INT4: xq = round(x / scale) + zero_point, xq ∈ [-8, 7]
The same weight distribution quantized at different precisions. Notice how lower bit-widths collapse the smooth distribution into coarser and coarser staircase approximations. Red bars show the quantization grid; the blue curve is the original FP32 distribution.
Quantization Fundamentals
Symmetric vs Asymmetric Quantization
Symmetric quantization (also called absmax quantization) maps the floating-point range [-|max|, |max|] to the integer range [-Qmax, Qmax]. The scale is computed as:
xq = round(x / scale)
x̂ = xq * scale
The zero point is always 0, which simplifies the math and avoids a subtraction during dequantization. This works well when the distribution is roughly centered around zero — which neural network weights almost always are, since weight decay and normalization both push weights toward zero symmetry.
Asymmetric quantization (zero-point quantization) maps an arbitrary range [min, max] to the full integer range [-Qmax, Qmax]. This uses both a scale and a zero-point offset:
zero_point = round(-min / scale) - Qmax
xq = round(x / scale) + zero_point
x̂ = (xq - zero_point) * scale
Asymmetric quantization is essential for activations, which are often non-symmetric. ReLU outputs, for example, are always non-negative — symmetric quantization would waste half the integer range representing values that never occur.
Granularity: Per-Tensor, Per-Channel, Per-Group
The choice of how many values share a single scale factor is called quantization granularity, and it dramatically affects quality.
Per-tensor quantization uses a single scale for the entire weight matrix. If the matrix has one row with outlier values 10x larger than the rest, the scale must accommodate those outliers, and all other rows lose precision because they use only a tiny fraction of the integer range.
Per-channel (per-row or per-column) quantization computes a separate scale for each output channel. This handles the common pattern where different rows of a weight matrix have different magnitudes. The overhead is minimal: one FP16 scale value per row, typically 4096 scales for a hidden dimension of 4096.
Per-group quantization goes further, dividing each row into groups of g values (commonly g=32, 64, or 128) and computing a scale per group. This provides the finest granularity short of per-element quantization, and is the standard in modern methods like GPTQ, AWQ, and GGUF. The overhead is one scale (and optionally one zero-point) per group:
So an "INT4 with group size 32" scheme actually uses about 4.5 bits per parameter on average (4 bits for the quantized weight plus 0.5 bits amortized scale overhead). This is why GGUF format names like Q4_K_M specify both the quantization bits and the group strategy.
Drag the slider to change the bit-width and watch how quantization error grows as precision decreases. The top panel shows the original (blue) vs quantized (red) values. The bottom panel shows the per-element error.
The Outlier Problem
In 2022, Dettmers et al. published a finding that changed the quantization landscape: large language models with more than ~6.7B parameters develop emergent outlier features. Specific hidden dimensions — as few as 6 out of 4096 — can have activation magnitudes 20-100x larger than the rest. These outlier dimensions appear consistently across all tokens and all layers, and they are critical to model quality.
The problem for quantization is severe. If one value in a tensor is 100 and the rest are between -1 and 1, a per-tensor INT8 quantization must set its scale to 100/127 ≈ 0.787. All the "normal" values between -1 and 1 then get mapped to the integer range [-1, 1] — effectively rounding most weights to one of just three values: -1, 0, or 1. The outlier has monopolized the entire dynamic range.
This is why naive INT8 quantization of activations catastrophically fails on large models. The weight distributions are generally well-behaved (roughly Gaussian, centered near zero), but the activations have these sharp outlier features that blow up the scale factor.
SmoothQuant: Migrating Difficulty from Activations to Weights
SmoothQuant (Xiao et al., 2022) observes that while activations are hard to quantize (due to outliers), weights are easy (roughly Gaussian). The key idea: mathematically migrate the quantization difficulty from activations to weights by applying a per-channel scaling transformation.
For a linear layer Y = X * W, we can introduce a diagonal scaling matrix s:
By choosing sj = max(|Xj|)α / max(|Wj|)1-α
with α typically around 0.5, SmoothQuant divides the outlier magnitudes in X by a large number (making
X̂ easier to quantize) while multiplying the corresponding weight columns by the same factor
(slightly harder, but weights are so well-behaved that they can absorb it). The result: both X̂
and Ŵ are now quantization-friendly, enabling W8A8 (8-bit weights, 8-bit activations) with
minimal quality loss even on models with severe outlier features.
The hyperparameter α controls the migration strength.
α = 0.5 splits the difficulty equally. α closer to 1 shifts more burden to weights.
Post-Training Quantization Methods
Post-Training Quantization (PTQ) compresses a pre-trained model without any additional training. You provide the model and a small calibration dataset (typically 128-1024 samples); the algorithm determines optimal quantization parameters by analyzing weight distributions and, optionally, how activations flow through the network. This is in contrast to Quantization-Aware Training (QAT), which simulates quantization during training to learn more robust weights.
GPTQ: One-Shot Weight Quantization via the Hessian
GPTQ (Frantar et al., 2022) quantizes weights one column at a time, using second-order information (the Hessian of the layer's reconstruction error) to compensate for quantization errors in already-processed columns.
The core idea: when you quantize column i of a weight matrix, the rounding error propagates to the output. But you have not yet quantized columns i+1 through n. GPTQ adjusts these remaining columns to compensate for the error, using the Hessian matrix H = 2XTX (computed from calibration data) to determine the optimal compensation.
The algorithm is based on Optimal Brain Quantization (OBQ), but with a crucial optimization: instead of choosing columns in an adaptive order (expensive), GPTQ processes them left to right and reorders within blocks of 128 columns. This reduces the complexity from O(d3n) to O(d n2) while preserving most of the quality.
# GPTQ pseudocode for one linear layer
# W: weight matrix (d_out x d_in), X: calibration activations
H = 2 * X.T @ X / n_samples # Hessian approximation
H_inv = torch.inverse(H)
for i in range(d_in): # Process columns left to right
w = W[:, i]
q = quantize(w) # Round to nearest INT4 grid point
error = (w - q) / H_inv[i, i] # Scaled quantization error
W[:, i] = q
# Compensate: adjust remaining columns
W[:, i+1:] -= error.outer(H_inv[i, i+1:]) / H_inv[i, i]
GPTQ achieves remarkably low perplexity degradation at INT4, especially with group_size=128. A GPTQ-quantized LLaMA-65B at 4-bit is often within 0.5 perplexity points of the FP16 baseline on WikiText-2. Quantization takes minutes to hours depending on model size, but inference is fast because the result is a simple INT4 weight matrix with FP16 scales.
AWQ: Activation-Aware Weight Quantization
AWQ (Lin et al., 2023) takes a different approach. Instead of compensating for quantization error after rounding (like GPTQ), AWQ identifies salient weight channels — those that correspond to large activation magnitudes — and protects them before quantization.
The insight: not all weights are equally important. A weight column that always multiplies near-zero activations contributes little to the output, even if the weight itself is large. Conversely, a modest weight that multiplies a large activation is critically important. AWQ uses activation statistics to identify these salient channels and scales them up before quantization, effectively giving them more of the quantization grid's dynamic range.
X̂j = Xj / sj (compensate in activations)
sj is chosen to minimize quantization error on salient channels
AWQ searches for the optimal per-channel scale factor using a grid search over a small calibration set. The result is both simpler and faster than GPTQ — no Hessian computation, no iterative error compensation — while achieving comparable or better quality. AWQ has become the default for many deployment pipelines, especially with the vLLM inference engine which has optimized AWQ kernels.
Select a model size and see how memory requirements change across quantization levels. Includes overhead for group scales (group_size=128) and KV cache estimate for 2048 context at batch size 1.
GGUF & llama.cpp Quantization
The GGUF (GPT-Generated Unified Format) file format and the llama.cpp ecosystem represent the most practical quantization pipeline for local inference. Unlike GPTQ and AWQ, which target GPU inference with CUDA kernels, llama.cpp was designed for CPU inference from the start (though it now has excellent GPU support via Metal, CUDA, and Vulkan backends).
GGUF supports a rich taxonomy of quantization types, each offering a different tradeoff between quality, speed, and memory:
| Type | Bits/Weight | Block Size | Method | Notes |
|---|---|---|---|---|
Q4_0 | 4.5 | 32 | Absmax, no zero-point | Fastest Q4. Lower quality. |
Q4_1 | 5.0 | 32 | Absmax + zero-point (asymmetric) | Better for asymmetric distributions. |
Q4_K_S | 4.5 | Super-blocks of 256 | K-quant with 6-bit scales | Good balance of speed and quality. |
Q4_K_M | 4.8 | Super-blocks of 256 | K-quant, mixed precision | Recommended 4-bit. Attention layers get more bits. |
Q5_K_S | 5.5 | Super-blocks of 256 | K-quant | Near FP16 quality for most models. |
Q5_K_M | 5.7 | Super-blocks of 256 | K-quant, mixed precision | Best quality-to-size ratio for many users. |
Q6_K | 6.6 | Super-blocks of 256 | K-quant | Nearly lossless. Good baseline. |
Q8_0 | 8.5 | 32 | Absmax | Effectively lossless. 2x FP16 speed. |
The "K-quant" types (denoted by _K_) are the second-generation quantization methods
in llama.cpp. They use super-blocks: a block of 256 weights is divided into 8
sub-blocks of 32, each with its own quantized scale. The super-block itself has a single FP16
master scale. This two-level hierarchy achieves finer granularity than the original Q4_0/Q4_1
formats without much overhead.
The mixed-precision variants (_M suffix) are particularly clever.
They assign different bit-widths to different layers based on sensitivity analysis. Attention
layers (Q, K, V projections) tend to be more sensitive to quantization error than feedforward
layers, so Q4_K_M quantizes attention weights at 5-6 bits while keeping FFN weights at 4 bits.
The average is 4.8 bits/weight but the perplexity is significantly better than uniform 4-bit.
# Quantize a model with llama.cpp
# 1. Convert from HuggingFace to GGUF
python convert_hf_to_gguf.py ./models/llama-3-8b/ \
--outfile llama-3-8b-f16.gguf --outtype f16
# 2. Quantize to Q4_K_M (recommended default)
./llama-quantize llama-3-8b-f16.gguf \
llama-3-8b-Q4_K_M.gguf Q4_K_M
# Output: 4.92 GB file (from 16.1 GB FP16)
# Perplexity: typically within 0.3 of FP16 on standard benchmarks
NF4: The Quantization Behind QLoRA
NormalFloat4 (NF4) is a 4-bit data type designed specifically for quantizing neural network weights that follow a normal distribution. It was introduced by Dettmers et al. (2023) as part of the QLoRA paper and represents a fundamentally different approach to 4-bit quantization.
Standard INT4 uses uniformly spaced quantization levels: {-8, -7, ..., 0, ..., 6, 7}. But neural network weights are not uniformly distributed — they follow an approximately Gaussian distribution centered near zero. This means uniform quantization wastes grid points on the tails (where few values exist) and has insufficient resolution near zero (where most values cluster).
NF4 solves this by placing quantization levels at the quantiles of a standard normal distribution. The 16 levels are chosen so that each bin captures exactly 1/16 of the probability mass under the Gaussian curve. This is information-theoretically optimal for normally-distributed data — each quantization bin is equally likely to be used, maximizing the entropy of the quantized representation.
{-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0}
Notice the asymmetry: the levels are denser near zero and sparser in the tails, perfectly matching the Gaussian weight distribution. Dettmers et al. showed that NF4 achieves lower quantization error than INT4 on neural network weights by a significant margin.
QLoRA combines NF4 quantization with LoRA (Low-Rank Adaptation) fine-tuning. The base model weights are frozen and stored in NF4. The LoRA adapter weights (the low-rank A and B matrices) are trained in BF16. During inference, the NF4 weights are dequantized to BF16 on-the-fly, the LoRA delta is added, and the result is used for the forward pass. This enables fine-tuning a 65B model on a single 48 GB GPU — an otherwise impossible task at full precision.
QLoRA also introduces double quantization: the FP32 quantization constants (scales) from the first quantization are themselves quantized to 8-bit, saving an additional ~0.37 bits per parameter. With a block size of 64, this reduces the per-parameter overhead from 32/64 = 0.5 bits to 8/64 + 32/(64*256) ≈ 0.127 bits.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Load a model in NF4 with double quantization (QLoRA-style)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 data type
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16
bnb_4bit_use_double_quant=True, # Double quantization
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-70B",
quantization_config=bnb_config,
device_map="auto",
)
# Memory: ~35 GB (from 140 GB FP16)
# Quality: typically within 1-2% of FP16 on benchmarks
This visualization shows how per-group quantization reduces error compared to per-tensor. The weight vector contains one outlier region. Smaller group sizes isolate the outlier, preventing it from destroying precision for normal-range values. Use the slider to change the group size.
Practical Guide: When to Use What
The quantization landscape is large, but most practical decisions reduce to a small number of scenarios. Here is a decision framework based on your deployment context:
GPU inference with maximum throughput (production serving): Use AWQ with vLLM or TensorRT-LLM. AWQ at 4-bit with group_size=128 offers the best combination of quality, speed, and ecosystem support for GPU deployment. The quantized models work directly with continuous batching and PagedAttention in vLLM.
GPU inference with quality sensitivity: Use GPTQ at 4-bit with group_size=128 and actorder=True. GPTQ's Hessian-based error compensation can give slightly better perplexity than AWQ on some models, especially at very low bit-widths. The ExLlama/ExLlamaV2 kernels provide fast GPTQ inference.
CPU / Apple Silicon / mixed CPU-GPU (local inference): Use GGUF with llama.cpp. Start with Q4_K_M (best default), step up to Q5_K_M if you have the memory and want better quality, or drop to Q3_K_M if memory is tight. The GGUF ecosystem handles the full pipeline from quantization through inference with excellent cross-platform support.
Fine-tuning on limited hardware: Use QLoRA with NF4 + double quantization via bitsandbytes. This enables fine-tuning models that would otherwise be impossible on your hardware. The quality overhead of NF4 quantization during fine-tuning is minimal because the LoRA adapter learns to compensate.
INT8 for minimal quality loss: For latency-sensitive applications where you cannot tolerate any perplexity degradation, 8-bit quantization with SmoothQuant (W8A8) or bitsandbytes (W8A16) is effectively lossless while still giving ~2x memory reduction and meaningful speedup.
| Method | Bits | Best For | Quality | Speed |
|---|---|---|---|---|
AWQ | 4 | GPU serving (vLLM) | Excellent | Very fast |
GPTQ | 4 | GPU serving (ExLlama) | Excellent | Very fast |
GGUF Q4_K_M | 4.8 | Local / CPU / Metal | Good | Fast (CPU) |
GGUF Q5_K_M | 5.7 | Local, quality priority | Very good | Moderate |
NF4 (QLoRA) | 4 | Fine-tuning | Good (for training) | Slow (dequant overhead) |
SmoothQuant | 8 (W8A8) | Production, minimal degradation | Near-lossless | Fast |
bitsandbytes 8-bit | 8 | Simple integration | Near-lossless | Moderate |
Code Examples
Quantizing with bitsandbytes (NF4 / INT8)
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import torch
# --- 8-bit quantization (simple, near-lossless) ---
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
load_in_8bit=True,
device_map="auto",
)
# Memory: ~8.5 GB (from 16 GB FP16)
# --- 4-bit NF4 quantization (aggressive, for constrained hardware) ---
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
# Memory: ~5 GB
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
inputs = tokenizer("The key insight behind quantization is", return_tensors="pt").to("cuda")
outputs = model_4bit.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantizing with GPTQ (via auto-gptq)
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
# Quantize a model with GPTQ
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Calibration dataset: 128 samples is usually sufficient
calibration_texts = [
"The meaning of life is",
"In machine learning, quantization refers to",
# ... 126 more diverse text samples
]
gptq_config = GPTQConfig(
bits=4, # Quantize to INT4
group_size=128, # Per-group quantization
desc_act=True, # Activation order (actorder)
dataset=calibration_texts,
tokenizer=tokenizer,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=gptq_config,
device_map="auto",
)
model.save_pretrained("./llama-3-8b-gptq-4bit")
# Can be loaded later with:
# model = AutoModelForCausalLM.from_pretrained("./llama-3-8b-gptq-4bit")
Quantizing with AWQ (via autoawq)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B"
quant_path = "./llama-3-8b-awq-4bit"
# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Quantize — searches for optimal per-channel scales
quant_config = {
"zero_point": True, # Asymmetric quantization
"q_group_size": 128, # Group size
"w_bit": 4, # Weight bits
"version": "GEMM", # GEMM kernel (vs GEMV)
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
# Use with vLLM for production serving:
# vllm serve ./llama-3-8b-awq-4bit --quantization awq
Understanding quantization error in pure PyTorch
import torch
def symmetric_quantize(tensor, bits=8):
"""Symmetric (absmax) quantization."""
qmax = 2 ** (bits - 1) - 1 # 127 for INT8, 7 for INT4
scale = tensor.abs().max() / qmax
quantized = torch.round(tensor / scale).clamp(-qmax, qmax).to(torch.int8)
dequantized = quantized.float() * scale
return dequantized, scale
def per_group_quantize(tensor, bits=4, group_size=32):
"""Per-group symmetric quantization."""
assert tensor.numel() % group_size == 0
groups = tensor.view(-1, group_size)
qmax = 2 ** (bits - 1) - 1
scales = groups.abs().amax(dim=1, keepdim=True) / qmax
quantized = torch.round(groups / scales).clamp(-qmax, qmax)
dequantized = (quantized * scales).view_as(tensor)
return dequantized, scales
# Example: compare per-tensor vs per-group
weights = torch.randn(4096, 4096) * 0.02 # Typical LLM weight scale
deq_tensor, _ = symmetric_quantize(weights, bits=4)
deq_group, _ = per_group_quantize(weights, bits=4, group_size=128)
mse_tensor = ((weights - deq_tensor) ** 2).mean().item()
mse_group = ((weights - deq_group) ** 2).mean().item()
print(f"Per-tensor INT4 MSE: {mse_tensor:.2e}")
print(f"Per-group INT4 MSE: {mse_group:.2e}")
# Per-group is typically 3-10x lower MSE
Quantization has transformed LLM deployment from a datacenter-only proposition to something that runs on consumer hardware. A 70B model that once required multiple enterprise GPUs can now run on a single RTX 4090 or even a MacBook Pro with 64 GB of unified memory. The quality cost is remarkably small — modern 4-bit methods preserve 95-99% of the original model's capability.
The field continues to evolve rapidly. Sub-4-bit methods (AQLM, QuIP#, HQQ) are pushing toward 2-bit quantization with acceptable quality. Speculative decoding combined with quantization promises even faster inference. And hardware is catching up: NVIDIA's FP8 tensor cores, AMD's INT4 support, and Apple's weight-compression neural engine features are all making quantized inference a first-class citizen at the silicon level.
The core lesson: precision is a spectrum, not a binary. Understanding where on that spectrum your application lives — and which quantization method best matches your constraints — is now an essential skill for anyone deploying large language models.
References
Seminal papers and key works referenced in this article.
- Dettmers et al. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS, 2022. arXiv
- Frantar et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR, 2023. arXiv
- Lin et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys, 2024. arXiv
- Dettmers & Zettlemoyer. "The case for 4-bit precision: k-bit Inference Scaling Laws." ICML, 2023. arXiv