Quantization, pruning, distillation — how to shrink a 280GB model until it fits on a single GPU.
You've just trained (or downloaded) a 70-billion-parameter language model. It's brilliant — writes poetry, debugs code, answers trivia. One problem: you can't actually run it.
Here's the arithmetic. Each parameter is stored as a float32 number, which takes 4 bytes. So 70 billion parameters × 4 bytes = 280 gigabytes. Your best data-center GPU — an NVIDIA A100 — has 80 GB of VRAM. You can't even load the model into memory, let alone run inference on it.
And that's just the weights. During inference, you also need memory for activations (the intermediate values computed at each layer), the KV cache (stored attention keys and values for prior tokens), and framework overhead. A 70B model at float32 needs roughly 300-350 GB just to generate a single token.
What about smaller models? A 7B model is 28 GB in float32 — it barely fits on a single A100, and forget about running it on a consumer GPU like the RTX 4090 (24 GB). Even a 1.5B model (6 GB) wastes most of its memory budget on full-precision floats when it could run perfectly well at lower precision.
This is the deployment gap: the difference between "model works in the lab" and "model works in production." Model compression bridges this gap.
But memory isn't the only problem. There's also latency. Moving data from memory to compute units is the real bottleneck in modern hardware. A float32 weight is 4 bytes that must travel across the memory bus. An int8 weight is 1 byte — 4× less data to move. Since most LLM inference is memory-bandwidth-bound (not compute-bound), cutting the data size by 4× can yield nearly 4× speedup.
Let's do the memory math for common model sizes. This is the arithmetic every ML engineer does before deployment:
| Model | Params | FP32 (4B) | FP16 (2B) | INT8 (1B) | INT4 (0.5B) |
|---|---|---|---|---|---|
| GPT-2 | 1.5B | 6 GB | 3 GB | 1.5 GB | 0.75 GB |
| LLaMA-7B | 7B | 28 GB | 14 GB | 7 GB | 3.5 GB |
| LLaMA-13B | 13B | 52 GB | 26 GB | 13 GB | 6.5 GB |
| LLaMA-70B | 70B | 280 GB | 140 GB | 70 GB | 35 GB |
| GPT-4 (est.) | 175B | 700 GB | 350 GB | 175 GB | 87.5 GB |
Look at that table. A 70B model at INT4 is 35 GB — it fits on a single A100 with room to spare for KV cache. At INT8, it's 70 GB — tight, but possible. The same model at FP32 requires a minimum of 4 GPUs.
Set your model size and data type. The bars show whether the model fits in various GPU memory capacities.
The deployment constraints are real. Edge devices (phones, robots) have 4-16 GB of RAM. Consumer GPUs top out at 24 GB. Even cloud inference has cost pressure — fewer GPUs per request means lower cost per token. Compression isn't optional for production ML. It's required.
Over the next 9 chapters, we'll learn every major compression technique: quantization (representing weights with fewer bits), pruning (removing unnecessary weights), knowledge distillation (training a small model to mimic a large one), and how to combine them into a complete compression pipeline. By the end, you'll be able to take any model and systematically shrink it for deployment.
python # The deployment arithmetic every ML engineer memorizes def model_memory_gb(params_billions, bits_per_param): """Calculate model weight memory in GB.""" bytes_per_param = bits_per_param / 8 total_bytes = params_billions * 1e9 * bytes_per_param return total_bytes / (1024**3) # 70B model at different precisions for bits in [32, 16, 8, 4]: mem = model_memory_gb(70, bits) print(f"70B @ {bits}-bit: {mem:.1f} GB") # Output: # 70B @ 32-bit: 260.8 GB # 70B @ 16-bit: 130.4 GB # 70B @ 8-bit: 65.2 GB # 70B @ 4-bit: 32.6 GB
Quantization is the single most important compression technique. The idea is deceptively simple: instead of storing each weight as a 32-bit floating point number (which can represent ~4 billion distinct values), store it as an 8-bit integer (256 distinct values) or even a 4-bit integer (16 distinct values). You lose some precision, but you save 4-8× memory.
Think of it like this. A high-resolution photograph uses millions of colors. A cartoon uses maybe 16 colors. The cartoon conveys the same scene — you recognize the same objects, faces, and emotions — despite using vastly fewer color levels. Quantization does the same thing to neural network weights: it reduces the "color depth" of the numbers while preserving the information that matters.
The formal operation has two parts: quantize (map a floating-point number to an integer) and dequantize (map back to approximate floating-point).
Where scale determines the step size between representable values, and zero_point shifts the range so that 0.0 in float maps to a specific integer. Let's work through a complete example with actual numbers.
Worked Example: Quantize the float32 values [0.5, 1.2, -0.3, 0.8] to INT8 (range 0 to 255) using min/max calibration.
Step 1: Find the range. We need to map our float values into the integer range [0, 255].
x_min = -0.3, x_max = 1.2
Step 2: Compute the scale. Scale = (x_max - x_min) / (q_max - q_min)
Step 3: Compute the zero_point. Zero_point is the integer that represents float 0.0.
Step 4: Quantize each value. q = round((x - zero_point × scale + x_min) / scale)... Actually, let's use the simpler formula: q = round((x - x_min) / scale)
Step 5: Dequantize to check the error. x̂ = q × scale + x_min
In this toy example, the errors are essentially zero because our values happened to map cleanly. In practice, with thousands of weight values spanning a wider range, the rounding errors accumulate. The question becomes: how much accumulated error can a neural network tolerate before its outputs degrade?
python import numpy as np # Step-by-step quantization from scratch def quantize_minmax(x, num_bits=8): """Quantize float array to integer using min/max calibration.""" q_min, q_max = 0, 2**num_bits - 1 # 0 to 255 for 8-bit x_min, x_max = x.min(), x.max() # Compute scale and zero_point scale = (x_max - x_min) / (q_max - q_min) zero_point = q_min # for asymmetric quantization # Quantize: float -> int q = np.round((x - x_min) / scale).astype(np.uint8) q = np.clip(q, q_min, q_max) return q, scale, x_min def dequantize(q, scale, x_min): """Dequantize integer back to float.""" return q.astype(np.float32) * scale + x_min # Example x = np.array([0.5, 1.2, -0.3, 0.8], dtype=np.float32) q, scale, x_min = quantize_minmax(x, num_bits=8) x_hat = dequantize(q, scale, x_min) print(f"Original: {x}") print(f"Quantized: {q}") print(f"Dequantized: {x_hat}") print(f"Max error: {np.abs(x - x_hat).max():.6f}") # Using PyTorch (one-liner) import torch x_t = torch.tensor([0.5, 1.2, -0.3, 0.8]) q_t = torch.quantize_per_tensor(x_t, scale=0.005882, zero_point=51, dtype=torch.quint8)
A smooth sine wave (teal) quantized to discrete levels (orange). Drag the bit-width slider to see how fewer bits create a coarser staircase.
Notice from the visualization: at 8 bits (256 levels), the staircase is barely visible — the quantized wave nearly perfectly overlaps the original. At 4 bits (16 levels), you can clearly see the steps. At 2 bits (4 levels), the signal is barely recognizable. This is why INT8 quantization works so well: 256 levels is enough to faithfully represent most weight distributions.
Chapter 1 taught the basics: one scale and one zero_point for an entire tensor. But real neural networks have weight matrices with highly non-uniform value distributions. Some channels have weights in [-0.01, 0.01] while others span [-2.0, 2.0]. Using a single scale for both wastes precision on the small-range channels.
The solution: don't quantize the whole tensor with one scale. Quantize smaller groups of weights, each with their own scale and zero_point. This is the spectrum from coarse to fine-grained quantization.
Per-tensor quantization uses one scale for the entire weight matrix. It's the cheapest (just 1 extra scale parameter per tensor) but the least accurate. If your tensor has outlier values in one corner, the scale must accommodate them, wasting precision everywhere else.
Per-channel quantization (also called per-row or per-column) uses one scale per output channel. In a weight matrix of shape [out_features, in_features], you get one scale per row. This is the standard for most production systems because each output neuron learns a different magnitude of weights.
Per-group quantization splits each channel into groups of G weights (typically G=32, 64, or 128) and quantizes each group independently. This is what GPTQ and AWQ use. More scales = more metadata overhead, but dramatically better accuracy.
Symmetric vs Asymmetric quantization. So far we've used asymmetric quantization: the float range [x_min, x_max] maps to the full int range [0, 255]. Symmetric quantization constrains the mapping so that float 0.0 maps to integer 0. This means the range is [-|max|, +|max|] mapped to [-128, +127] for signed int8.
Symmetric is faster at inference (no zero_point subtraction in the inner loop) but wastes range when the weight distribution is skewed. Most modern systems use symmetric for weights (which are roughly centered at 0) and asymmetric for activations (which are often positive-only after ReLU).
GPTQ (Accurate Post-Training Quantization for Generative Pre-trained Transformers, 2022) is a second-order method. Instead of simply rounding each weight independently, it quantizes one column at a time and uses the Hessian matrix (second derivative of the loss) to compensate the remaining columns for the error introduced. The key insight: when you round column j, you can slightly adjust columns j+1, j+2, ... to partially cancel the rounding error.
AWQ (Activation-Aware Weight Quantization, 2023) observes that not all weights are equally important. Some weights, when quantized, cause much larger output errors than others — specifically, weights that multiply large activations. AWQ identifies these "salient" weights by looking at activation magnitudes from calibration data, then scales them up before quantization (and scales the activations down to compensate). This gives important weights more of the quantization range.
GGUF format (by Georgi Gerganov, the llama.cpp author) is the standard file format for quantized models on CPU. It supports many quantization schemes: Q4_0 (4-bit, groups of 32, symmetric), Q4_K_M (4-bit with k-quant medium quality), Q8_0 (8-bit), etc. Each "Q" variant uses different group sizes and scale formats, trading metadata size for accuracy.
python import numpy as np # Per-channel vs per-tensor quantization comparison def quantize_per_tensor(W, bits=8): """One scale for entire matrix.""" qmax = 2**(bits-1) - 1 # 127 for int8 scale = np.abs(W).max() / qmax q = np.round(W / scale).clip(-qmax, qmax).astype(np.int8) return q, scale def quantize_per_channel(W, bits=8): """One scale per row (output channel).""" qmax = 2**(bits-1) - 1 scales = np.abs(W).max(axis=1, keepdims=True) / qmax q = np.round(W / scales).clip(-qmax, qmax).astype(np.int8) return q, scales # Simulate: channel 0 has tiny weights, channel 1 has large ones W = np.array([ [0.01, -0.02, 0.015, -0.005], # small range [1.5, -2.0, 0.8, -1.2], # large range ], dtype=np.float32) # Per-tensor: scale = 2.0/127 = 0.01575 # Channel 0 values ≈ 0.01 → quantized to 0 or 1! Massive relative error. q_tensor, s_t = quantize_per_tensor(W) print(f"Per-tensor scale: {s_t:.5f}") print(f"Channel 0 quantized: {q_tensor[0]}") # [1, -1, 1, 0] - terrible! # Per-channel: channel 0 gets scale = 0.02/127 = 0.000157 q_channel, s_c = quantize_per_channel(W) print(f"Per-channel scales: {s_c.flatten()}") print(f"Channel 0 quantized: {q_channel[0]}") # [64, -127, 95, -32] - much better!
A 4×8 weight matrix where rows have different scales. Left: per-tensor (one color scale for all). Right: per-channel (each row gets its own scale). Notice how per-channel preserves detail in low-magnitude rows.
Worked example: per-group quantization. Take a row of 8 weights: [0.1, 0.2, 0.15, 0.12, 2.0, -1.8, 1.5, -2.1]. With group size G=4:
Group 1: [0.1, 0.2, 0.15, 0.12]. Max = 0.2. Scale = 0.2/7 = 0.0286 (for 4-bit, range -8 to 7).
Group 2: [2.0, -1.8, 1.5, -2.1]. Max = 2.1. Scale = 2.1/7 = 0.300.
Without grouping, the single scale would be 2.1/7 = 0.300 for all 8 values. Group 1 values (all ~0.1-0.2) would quantize to 0 or 1 — destroying their differences. With grouping, they get their own fine-grained scale and quantize to 3, 7, 5, 4 — preserving their relative magnitudes.
Quantization keeps all the weights but makes each one smaller. Pruning takes a different approach: remove weights entirely. Set them to exactly zero. If a weight is already near zero, it's barely contributing to the output — removing it should have minimal impact.
Think of it like editing a book. Quantization is like printing in a smaller font (same content, less space). Pruning is like cutting sentences (less content, but hopefully the unimportant ones). A good editor knows which sentences are load-bearing and which are filler. Pruning algorithms try to identify the "filler" weights.
The simplest approach is magnitude pruning: sort all weights by their absolute value |w|, then zero out the smallest k%. The intuition is that small weights contribute small amounts to the output, so removing them causes minimal damage.
Worked example: Prune 50% of a weight matrix.
Step 1: Flatten and sort by magnitude: |0.02|, |0.03|, |0.04|, |0.1|, |0.5|, |0.6|, |0.7|, |0.8|, |0.9|
Step 2: 50% of 9 = 4.5, round to 4. Remove the 4 smallest: 0.02, 0.03, 0.04, -0.1
Step 3: Set those to zero in the original matrix:
We now have a sparse matrix — 44% of entries are zero. But here's the catch: a sparse matrix stored naively takes the same memory as a dense one (you still store the zeros). To get actual memory savings, you need a sparse storage format like CSR (Compressed Sparse Row) or a bitmask.
Unstructured sparsity requires special hardware or software support to accelerate. NVIDIA's Ampere architecture supports 2:4 structured sparsity (in every group of 4 weights, exactly 2 must be zero), giving 2× speedup on Tensor Cores. But arbitrary unstructured sparsity gets no hardware acceleration on current GPUs.
Structured pruning is more practical. Remove entire neurons (a full row of the weight matrix) and the corresponding column of the next layer's weight matrix. This literally shrinks the tensor dimensions. A 4096×4096 matrix pruned to remove 25% of neurons becomes 3072×4096 — genuinely smaller, genuinely faster.
The lottery ticket hypothesis (Frankle & Carlin, 2019) made a stunning claim: within a large randomly-initialized network, there exists a small subnetwork (the "winning ticket") that, if trained in isolation from the same initialization, would match the full network's accuracy. This suggests pruning isn't just removing unimportant weights — it's finding the essential structure that was always there.
python import numpy as np def magnitude_prune(W, sparsity): """Zero out the smallest `sparsity` fraction of weights.""" flat = np.abs(W).flatten() threshold = np.percentile(flat, sparsity * 100) mask = np.abs(W) >= threshold return W * mask, mask def structured_prune_neurons(W, sparsity): """Remove entire rows (neurons) with smallest L2 norm.""" norms = np.linalg.norm(W, axis=1) n_remove = int(W.shape[0] * sparsity) keep_idx = np.argsort(norms)[n_remove:] return W[keep_idx], keep_idx # Example: 4x4 weight matrix W = np.array([ [0.5, -0.1, 0.8, -0.3], [0.02, -0.9, 0.03, 0.01], # will survive (has -0.9) [-0.7, 0.04, 0.6, -0.5], [0.01, 0.02, -0.01,0.03], # entire row is tiny → prune ], dtype=np.float32) # Unstructured: 50% sparsity W_sparse, mask = magnitude_prune(W, 0.5) print(f"Non-zeros: {mask.sum()} / {mask.size}") # Structured: remove 25% of neurons W_struct, kept = structured_prune_neurons(W, 0.25) print(f"Shape: {W.shape} → {W_struct.shape}") # (4,4) → (3,4) # PyTorch structured pruning (one-liner) import torch.nn.utils.prune as prune prune.ln_structured(layer, 'weight', amount=0.25, n=2, dim=0)
A weight matrix shown as colored cells. Drag the sparsity slider to prune weights by magnitude. Watch cells go dark (zeroed). The histogram below shows the weight distribution with the pruning threshold.
There's a practical question: at what sparsity does accuracy collapse? Empirically, most networks tolerate 50-80% unstructured sparsity with minimal accuracy loss, especially if you fine-tune after pruning (give the remaining weights a chance to adapt). Beyond 90%, accuracy degrades rapidly. The exact threshold depends on the model, task, and how carefully you prune.
The iterative approach works best: prune a small amount (e.g., 20%), fine-tune for a few epochs, prune another 20% of the remaining weights, fine-tune again. This iterative magnitude pruning reaches higher sparsity with less accuracy loss than one-shot pruning because the fine-tuning steps allow the network to redistribute importance among surviving weights.
Before you compress a model, you need to answer: where are the bottlenecks? Is the model slow because of too many computations (compute-bound) or because of too much data movement (memory-bound)? The answer determines which compression technique will help most.
The key tool for understanding this is the roofline model. It's a simple graph that plots achievable performance (FLOPS) against operational intensity (FLOPS per byte of data moved). Every hardware platform has a "roof" — a maximum throughput — and your operations either hit the compute roof or the memory-bandwidth roof.
Compute-bound operations are those where the GPU spends most of its time doing arithmetic. Large batched matrix multiplications (GEMM) with big matrices are typically compute-bound. For these, quantization doesn't help much with speed (the bottleneck is the ALU, not memory), but it can enable larger batch sizes by freeing VRAM.
Memory-bound operations are those where the GPU spends most of its time waiting for data to arrive from memory. This includes: attention with long sequences, small batch inference, layer normalization, activation functions. For these, quantization directly speeds things up by reducing the bytes that must move.
LLM inference is almost always memory-bandwidth-bound at batch size 1. Why? A single forward pass through a 7B model reads ~14 GB of weights (FP16) but performs relatively few operations with each weight (just one multiply-add per input token). The arithmetic intensity is very low. This is why weight quantization to INT4 gives nearly 4× speedup for single-request inference.
Profiling tools:
| Tool | What it measures | When to use |
|---|---|---|
| torch.profiler | Per-operator time, memory, CUDA kernels | Finding slow operations |
| nvidia-smi | GPU utilization, memory usage, temperature | High-level monitoring |
| nsys (Nsight Systems) | Full timeline: CPU, GPU, memory transfers | Deep performance analysis |
| ncu (Nsight Compute) | Per-kernel metrics: occupancy, memory throughput | Kernel-level optimization |
FLOP counting for transformers. A transformer layer with hidden dimension d and sequence length n has these main operations:
Self-attention QKV projection: 3 × 2nd² = 6nd² FLOPs
Attention scores: 2n²d FLOPs
Output projection: 2nd² FLOPs
MLP (typically 4d hidden): 2 × 2n(4d)d = 16nd² FLOPs
Total per layer: ≈ 24nd² + 2n²d FLOPs
For a 7B model (d=4096, 32 layers), processing one token (n=1): 24 × 1 × 4096² × 32 ≈ 12.9 TFLOP. But reading the weights: 14 GB at 2 TB/s memory bandwidth = 7ms. Performing 12.9 TFLOP at 312 TFLOP/s (A100) = 0.04ms. The memory read takes 175× longer than the compute!
python import torch from torch.profiler import profile, ProfilerActivity # Profile a model forward pass model = MyModel().cuda() x = torch.randn(1, 128, 4096).cuda() with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_flops=True ) as prof: output = model(x) # Print top operators by GPU time print(prof.key_averages().table( sort_by="cuda_time_total", row_limit=10 )) # Manual roofline calculation def roofline_analysis(flops, bytes_moved, peak_flops, peak_bandwidth): """Determine if an operation is memory-bound or compute-bound.""" intensity = flops / bytes_moved # FLOP/byte ridge_point = peak_flops / peak_bandwidth # FLOP/byte if intensity < ridge_point: bound = "MEMORY-BOUND" achievable = intensity * peak_bandwidth else: bound = "COMPUTE-BOUND" achievable = peak_flops return bound, achievable, intensity # A100 specs: 312 TFLOP/s FP16, 2 TB/s bandwidth # 7B model, 1 token: 12.9 TFLOP, 14 GB read bound, perf, ai = roofline_analysis( flops=12.9e12, bytes_moved=14e9, peak_flops=312e12, peak_bandwidth=2e12 ) print(f"Bound: {bound}, Intensity: {ai:.1f} FLOP/byte") # Output: Bound: MEMORY-BOUND, Intensity: 0.9 FLOP/byte
The roofline shows max achievable performance vs arithmetic intensity. Operations below the roof are underutilizing hardware. Points left of the ridge are memory-bound; right are compute-bound. Drag the model slider to plot different scenarios.
The roofline reveals a crucial insight: at batch size 1, LLM inference uses less than 1% of the GPU's compute capability. The entire GPU sits idle waiting for weights to arrive from VRAM. This is why INT4 quantization gives nearly 4× speedup — you're moving 4× less data, which is the bottleneck. At batch size 32-64, you amortize the weight read across multiple inputs, pushing toward the compute roof where quantization helps less with latency but still helps with memory capacity.
Post-training quantization (PTQ) is the most practical compression technique: take a pre-trained model, quantize it, and deploy — no retraining required. This is crucial because many models cost millions of dollars to train. You don't get to train them again. You need compression that works on the finished artifact.
The simplest PTQ approach is weight-only quantization: take the weight tensors, compute their min/max (or running statistics from calibration data), compute scale and zero_point, quantize, done. This works well for INT8 and is the default in most deployment frameworks.
But for INT4, naive min/max quantization often fails. The problem is outlier weights. In transformer models, some weights are 10-100× larger than the median. A single outlier stretches the quantization range, wasting most levels on the non-outlier weights where precision is needed most.
Calibration is the process of running representative data through the model to measure activation ranges. Instead of using the theoretical min/max of the weight tensor, you measure what ranges actually matter during inference. This is better because:
1. Some weight values are rarely activated — clipping them introduces small error
2. The interaction between weights and activations determines which weights need precision
3. Percentile-based ranges (e.g., clip at 99.99th percentile) are more robust than min/max
The GPTQ algorithm (Frantar et al., 2022) is the gold standard for PTQ at INT4. It's based on Optimal Brain Quantization (OBQ), which itself is based on Optimal Brain Surgeon. The key insight: when you quantize one weight, you can adjust the remaining weights to compensate for the error.
Worked example: GPTQ on a 3×3 weight matrix.
Given weight matrix W and Hessian H = XTX (where X is calibration data):
Suppose we quantize to 4-bit (levels: -8 to 7 with scale 0.1). Process column by column:
Column 0: Quantize W[:,0] = [0.7, 0.2, -0.6]
q = [round(0.7/0.1), round(0.2/0.1), round(-0.6/0.1)] = [7, 2, -6]
Dequantized: [0.7, 0.2, -0.6]. Error: [0, 0, 0]. Lucky — these values happened to quantize exactly.
Column 1: Quantize W[:,1] = [-0.3, 0.8, 0.1]
q = [-3, 8→clamped to 7, 1]. Dequantized: [-0.3, 0.7, 0.1]
Error for row 1: 0.8 - 0.7 = 0.1. This error propagates! GPTQ compensates by adjusting column 2:
W[1,2] += error × H_inv[1,2]/H_inv[1,1] (Hessian-based correction)
If H_inv[1,2]/H_inv[1,1] = 0.3, then W[1,2] = -0.4 + 0.1×0.3 = -0.37
Column 2: Now quantize the ADJUSTED W[:,2] = [0.5, -0.37, 0.9]
q = [5, -4, 7(clamped from 9)]. The column 1 error was partially absorbed.
This column-by-column approach with error compensation is why GPTQ achieves much better accuracy than naive rounding, especially at INT4 where every level counts.
python import numpy as np def gptq_quantize_column(W, col, H_inv, bits=4): """Quantize one column of W with Hessian-based error compensation.""" qmax = 2**(bits-1) - 1 # 7 for 4-bit scale = np.abs(W[:, col]).max() / qmax # Quantize this column q = np.round(W[:, col] / scale).clip(-qmax, qmax) error = W[:, col] - q * scale # quantization error # Compensate remaining columns for j in range(col + 1, W.shape[1]): # Hessian-based correction correction = error * H_inv[col, j] / H_inv[col, col] W[:, j] += correction W[:, col] = q * scale return W # Full GPTQ on a layer def gptq_layer(W, X_calib, bits=4): """Apply GPTQ to one weight matrix using calibration data X.""" # Compute Hessian: H = X^T X / n_samples H = X_calib.T @ X_calib / X_calib.shape[0] H_inv = np.linalg.inv(H + 1e-6 * np.eye(H.shape[0])) W_q = W.copy() for col in range(W.shape[1]): W_q = gptq_quantize_column(W_q, col, H_inv, bits) return W_q # Using the auto-gptq library (one-liner) # from auto_gptq import AutoGPTQForCausalLM # model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-2-7b") # model.quantize(calibration_data, bits=4, group_size=128)
Watch how quantization error in one column gets compensated in subsequent columns. Blue bars: original weights. Orange bars: quantized (naive). Green bars: quantized with GPTQ compensation. The output error (red line) is much smaller with compensation.
Practical PTQ pipeline:
Post-training compression works on a frozen model. But what if you could make the model aware of quantization during training? If the model knows it will be quantized, it can learn weights that are robust to quantization noise — weights that sit near the center of quantization bins rather than on the boundaries.
This is Quantization-Aware Training (QAT). The idea: insert "fake quantize" operations into the forward pass during training. These operations simulate the effect of quantization (rounding, clipping) without actually storing integers. The model sees quantization noise during training and adapts to it.
The fake quantize operation in the forward pass:
This rounds x to the nearest quantization level, then immediately converts back to float. The output is still float32, but it can only take values that are representable at the target precision. The model trains with these "staircase" activations and learns to work within them.
The gradient problem: The round() function has zero gradient almost everywhere (it's flat between integers) and undefined gradient at integer boundaries. You can't backpropagate through it! The solution is the Straight-Through Estimator (STE): during backprop, pretend round() was the identity function. Pass the gradient through unchanged.
This is mathematically dubious (the true gradient is zero!) but works brilliantly in practice. The model receives gradient signal as if quantization weren't there, but the forward pass produces quantized values. Over many iterations, the optimizer finds weights that minimize loss under quantization.
Worked example: STE in action.
Suppose we're training a weight w = 0.37 with scale s = 0.1 (quantization levels: ..., 0.3, 0.4, 0.5, ...).
Forward: FakeQuant(0.37) = round(0.37/0.1) × 0.1 = round(3.7) × 0.1 = 4 × 0.1 = 0.4
The loss is computed using 0.4, not 0.37.
Backward: Gradient from loss: ∂L/∂y = -0.05 (wants to decrease this weight)
STE: ∂L/∂w = -0.05 (passed straight through)
Update: w = 0.37 - lr × (-0.05) = 0.37 + 0.005 × (-0.05) = 0.37 - 0.00025 = 0.36975
The weight moves toward 0.3 (the next quantization level down). Eventually, after many updates, it settles at exactly 0.3 or 0.4 — a stable quantization point.
Knowledge Distillation during compression is another training-time technique. Instead of training the quantized model on the original data labels, train it to match the output distribution of the full-precision teacher model. The teacher's soft probabilities contain more information than hard labels (they reveal inter-class relationships), making the student's task easier.
Where α balances between mimicking the teacher (KL divergence) and fitting the true labels (cross-entropy). Temperature scaling (T=2-5) is applied to both teacher and student logits to soften the distributions and reveal more signal.
python import torch import torch.nn as nn class FakeQuantize(torch.autograd.Function): """Simulates quantization in forward, passes gradient in backward (STE).""" @staticmethod def forward(ctx, x, scale, bits): qmax = 2**(bits-1) - 1 q = torch.clamp(torch.round(x / scale), -qmax, qmax) return q * scale # dequantize back to float @staticmethod def backward(ctx, grad_output): return grad_output, None, None # STE: pass gradient through class QATLinear(nn.Module): """Linear layer with fake quantization for QAT.""" def __init__(self, in_f, out_f, bits=8): super().__init__() self.linear = nn.Linear(in_f, out_f) self.bits = bits def forward(self, x): # Fake-quantize weights during forward scale = self.linear.weight.abs().max() / (2**(self.bits-1) - 1) w_q = FakeQuantize.apply(self.linear.weight, scale, self.bits) return nn.functional.linear(x, w_q, self.linear.bias) # Knowledge distillation loss def distillation_loss(student_logits, teacher_logits, labels, alpha=0.7, T=3.0): """Combine KL divergence from teacher with CE from labels.""" soft_student = nn.functional.log_softmax(student_logits / T, dim=-1) soft_teacher = nn.functional.softmax(teacher_logits / T, dim=-1) kl = nn.functional.kl_div(soft_student, soft_teacher, reduction='batchmean') * T**2 ce = nn.functional.cross_entropy(student_logits, labels) return alpha * kl + (1 - alpha) * ce
Two training curves: blue is standard training (high train accuracy, but drops after quantization). Orange is QAT (slightly lower train accuracy, but maintains accuracy after quantization). Click "Quantize" to see the post-quantization accuracy gap.
In practice, you rarely use a single compression technique. You combine them: quantize weights to INT4, prune 30% of attention heads, distill into a smaller architecture. The order matters. The interactions matter. This chapter is a hands-on pipeline simulator where you build a complete compression strategy from scratch.
The key insight about compression pipelines is that order matters. Quantize-then-prune is different from prune-then-quantize:
Quantize first, prune second: You quantize all weights (including ones you'll later prune). Then when you prune, the accuracy drop is applied to an already-degraded model. Double penalty.
Prune first, quantize second: You remove the least important weights from the full-precision model (maximum information to decide what to prune). Then you quantize the remaining weights with GPTQ, which can more accurately compensate because there are fewer weights to worry about.
Similarly, knowledge distillation can be applied at different stages:
- Before compression: Train a smaller student from scratch using the teacher. Then quantize the student.
- After compression: Fine-tune the quantized model using distillation from the full-precision teacher.
- During compression: Quantize with distillation loss (QAT + distillation simultaneously).
Let's trace through a real scenario. You have a LLaMA-70B model (280 GB FP32, 140 GB FP16) and need to deploy it on a single A100 (80 GB).
| Step | Technique | Memory | Accuracy Impact | Cumulative |
|---|---|---|---|---|
| Baseline | FP16 | 140 GB | 0% | 100% |
| 1 | INT8 Quantization | 70 GB | -0.3% | 99.7% |
| 2 | INT4 GPTQ (group=128) | 35 GB + 5 GB scales | -1.5% | 98.5% |
| 3 | 20% head pruning + finetune | 32 GB | -0.8% | 97.7% |
| 4 | KV cache INT8 | Saves runtime VRAM | -0.1% | 97.6% |
Final result: 70B model running on a single 80GB GPU with 97.6% of original accuracy. The 40 GB of headroom is used for KV cache (supporting longer contexts) and batch processing.
Start with a baseline model. Apply compression techniques in sequence. Watch memory, speed, and accuracy change. Try different orders!
Play with the simulator above. Try these experiments:
1. Apply INT4 directly to a 70B model. Note the final accuracy.
2. Apply Prune 30% first, then INT4. Is accuracy better or worse?
3. Apply Distill first (to get a smaller model), then INT8. Compare memory and accuracy to INT4 without distillation.
4. Stack everything: Prune 30% → INT4 → see how aggressive you can go.
Real-world pipelines from notable releases:
- Mistral 7B: Trained at FP16, distributed as GPTQ INT4 (group=128) via TheBloke's quants. Also available in GGUF Q4_K_M for llama.cpp.
- LLaMA-3 70B: Training in BF16, deployed with INT8 weight-only quantization on vLLM. AWQ INT4 for single-GPU inference.
- Phi-3 Mini (3.8B): Already small architecture (distillation from larger models during training), then quantized to INT4 for mobile deployment (1.9 GB).
You've compressed your model. Now what? The compressed weights need to be loaded by an inference framework that knows how to execute quantized/sparse operations efficiently on hardware. The choice of framework determines your actual throughput, latency, and hardware utilization.
llama.cpp (Georgi Gerganov) is the pioneering framework for CPU/GPU inference of quantized LLMs. It uses the GGUF file format and supports dozens of quantization schemes (Q4_0, Q4_K_M, Q5_K_S, Q8_0, etc.). Key features:
- Runs on CPU with SIMD optimization (AVX2, ARM NEON)
- Optional GPU offloading (split layers between CPU and GPU)
- Metal backend for Apple Silicon (M1/M2/M3)
- Supports models up to 175B with enough RAM
- Community standard for local inference
vLLM (Berkeley) is the standard for high-throughput GPU serving. Key innovations:
- PagedAttention: manages KV cache like virtual memory pages, eliminating fragmentation and enabling 2-4× higher throughput
- Continuous batching: doesn't wait for all requests in a batch to finish; inserts new requests as slots free up
- Supports AWQ and GPTQ quantized models natively
- Optimized for multiple concurrent users (production serving)
TensorRT-LLM (NVIDIA) is the highest-performance option for NVIDIA hardware:
- Ahead-of-time compilation: converts model to an optimized engine binary
- FP8 support on Hopper (H100): 2× over INT8 with no accuracy loss
- Fused kernels: combines multiple operations into single CUDA kernel calls
- In-flight batching: continuous batching with NVIDIA's optimized scheduler
- Most complex setup but highest throughput on supported hardware
ONNX Runtime (Microsoft) is the cross-platform option:
- Runs on CPU, GPU, mobile, edge, web (WASM)
- Graph-level optimizations: operator fusion, constant folding
- INT8 quantization through its own quantization toolkit
- Best for: models that need to run on diverse hardware
Hardware-specific optimizations:
| Hardware | Best Precision | Key Feature | Framework |
|---|---|---|---|
| NVIDIA H100 | FP8 | Transformer Engine auto-casting | TensorRT-LLM |
| NVIDIA A100 | INT8 / FP16 | Tensor Cores with sparsity | vLLM, TensorRT-LLM |
| NVIDIA RTX 4090 | INT4 (AWQ) | High memory bandwidth | vLLM, ExLlamaV2 |
| Apple M-series | INT4 (GGUF) | Unified memory (CPU+GPU) | llama.cpp (Metal) |
| CPU (x86) | INT8 / Q4_K_M | AVX-512 / VNNI | llama.cpp, ONNX RT |
| Edge (ARM) | INT8 | NEON SIMD | TFLite, ONNX RT |
CUDA Cores vs Tensor Cores: Regular CUDA cores do scalar math (one multiply-add per clock per core). Tensor Cores do 4×4 matrix math in a single operation — they're 8-16× faster for matrix operations but only work at specific precisions (FP16, BF16, INT8, FP8, INT4). Quantized models that use Tensor-Core-friendly types (INT8, INT4) get hardware acceleration. Models in weird formats don't.
Compilation vs interpretation: llama.cpp and vLLM interpret the model graph at runtime (flexible, easy to update). TensorRT-LLM compiles the model into a fixed binary optimized for specific input shapes and batch sizes (inflexible, but fastest). ONNX Runtime offers both modes (graph-mode optimization + optional compilation).
python # llama.cpp: local inference with quantized model # Install: pip install llama-cpp-python from llama_cpp import Llama model = Llama( model_path="./llama-7b-q4_k_m.gguf", n_gpu_layers=35, # offload all layers to GPU n_ctx=4096, ) output = model("The meaning of life is", max_tokens=128) # vLLM: high-throughput serving # Install: pip install vllm from vllm import LLM, SamplingParams llm = LLM( model="TheBloke/Llama-2-7B-AWQ", quantization="awq", dtype="half", gpu_memory_utilization=0.9, ) params = SamplingParams(temperature=0.7, max_tokens=256) outputs = llm.generate(["Hello!", "What is ML?"], params) # TensorRT-LLM: maximum throughput (simplified) # Build engine: trtllm-build --model_dir ./llama-7b-hf \ # --dtype float16 --use_weight_only --weight_only_precision int4_awq # Run: mpirun -n 1 python run.py --engine_dir ./engine
Estimated tokens/second for a 7B model across different frameworks and precisions. Adjust batch size to see how throughput scales.
You now have the complete toolkit for model compression. Let's consolidate everything into a decision framework, then push further with derivations and challenges.
Compression Cheat Sheet:
| Technique | Memory Savings | Speed Gain | Accuracy Cost | When to Use |
|---|---|---|---|---|
| FP16/BF16 | 2× | 1-2× | ~0% | Always (free lunch) |
| INT8 PTQ | 4× | 2-4× | 0.1-0.5% | Default for deployment |
| INT4 GPTQ | 8× | 3-4× | 1-3% | Single-GPU constraint |
| INT4 AWQ | 8× | 3-4× | 0.5-2% | When accuracy matters more |
| Structured Pruning | variable | linear with removed params | 1-5% | Need architecture shrinkage |
| Unstructured Pruning | ~0 (without sparse HW) | ~0 (without sparse HW) | small | Only with 2:4 sparsity HW |
| Knowledge Distillation | scales with student size | scales with student size | 2-10% | Need much smaller model |
| QAT | same as PTQ | same as PTQ | 0.5-1% better than PTQ | Shipping millions of copies |
The Compression Decision Tree:
Derivation Challenge: Derive the optimal per-channel scale factor that minimizes Mean Squared Error (MSE) between original and dequantized weights.
Given: Weight vector w ∈ Rn, quantization levels q ∈ {-Q, ..., Q} where Q = 2b-1-1.
Goal: Find scale s* that minimizes ∑(wi - round(wi/s)×s)²
The naive solution s = max(|w|)/Q works but isn't optimal when the weight distribution is non-uniform. The MSE-optimal scale requires solving:
This is non-differentiable (round is a step function), but the grid search approach works: try s values in [max(|w|)/(Q×1.2), max(|w|)/Q] with 100 steps, pick the one with lowest MSE. In practice, the MSE-optimal scale is typically 0.7-0.9× the max/Q scale — it clips outliers to reduce average error.
Connections to other lessons:
- Transformer — The architecture we're compressing. Understanding attention and MLP structure helps you know which components to prune.
- GPT — Autoregressive inference is memory-bound, making quantization especially effective.
- SSM/Mamba — Alternative architectures with better memory efficiency by design (no KV cache). Sometimes the best "compression" is a better architecture.
What we didn't cover:
- Mixed-precision quantization: Different layers get different bit-widths (sensitive layers stay at INT8, others go to INT4).
- Neural Architecture Search for efficient models: Designing architectures that are inherently small (MobileNet, EfficientNet).
- Speculative decoding: Use a tiny draft model to predict tokens, verify with the large model in parallel. Not compression per se, but achieves 2-3× speedup.
- LoRA/QLoRA: Low-rank adaptation + quantization for fine-tuning. Covered in its own lesson.
"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman. In compression: always measure accuracy on YOUR task with YOUR data. Published benchmark numbers are someone else's guarantee, not yours.