Model Compression — From 280GB to Fits-on-Your-Laptop

Chapter 0: Why Compress?

You've just trained (or downloaded) a 70-billion-parameter language model. It's brilliant — writes poetry, debugs code, answers trivia. One problem: you can't actually run it.

Here's the arithmetic. Each parameter is stored as a float32 number, which takes 4 bytes. So 70 billion parameters × 4 bytes = 280 gigabytes. Your best data-center GPU — an NVIDIA A100 — has 80 GB of VRAM. You can't even load the model into memory, let alone run inference on it.

And that's just the weights. During inference, you also need memory for activations (the intermediate values computed at each layer), the KV cache (stored attention keys and values for prior tokens), and framework overhead. A 70B model at float32 needs roughly 300-350 GB just to generate a single token.

What about smaller models? A 7B model is 28 GB in float32 — it barely fits on a single A100, and forget about running it on a consumer GPU like the RTX 4090 (24 GB). Even a 1.5B model (6 GB) wastes most of its memory budget on full-precision floats when it could run perfectly well at lower precision.

This is the deployment gap: the difference between "model works in the lab" and "model works in production." Model compression bridges this gap.

But memory isn't the only problem. There's also latency. Moving data from memory to compute units is the real bottleneck in modern hardware. A float32 weight is 4 bytes that must travel across the memory bus. An int8 weight is 1 byte — 4× less data to move. Since most LLM inference is memory-bandwidth-bound (not compute-bound), cutting the data size by 4× can yield nearly 4× speedup.

The core tension: Bigger models are smarter. But bigger models need more memory, more bandwidth, more electricity, and more expensive hardware. Compression asks: "How much quality can we preserve while drastically cutting the cost?" The answer, surprisingly, is: almost all of it.

Let's do the memory math for common model sizes. This is the arithmetic every ML engineer does before deployment:

Model	Params	FP32 (4B)	FP16 (2B)	INT8 (1B)	INT4 (0.5B)
GPT-2	1.5B	6 GB	3 GB	1.5 GB	0.75 GB
LLaMA-7B	7B	28 GB	14 GB	7 GB	3.5 GB
LLaMA-13B	13B	52 GB	26 GB	13 GB	6.5 GB
LLaMA-70B	70B	280 GB	140 GB	70 GB	35 GB
GPT-4 (est.)	175B	700 GB	350 GB	175 GB	87.5 GB

Look at that table. A 70B model at INT4 is 35 GB — it fits on a single A100 with room to spare for KV cache. At INT8, it's 70 GB — tight, but possible. The same model at FP32 requires a minimum of 4 GPUs.

The formula: Memory (bytes) = Parameters × Bytes-per-parameter. That's it. A 70B model at 4 bits/param = 70 × 10⁹ × 0.5 bytes = 35 GB. Memorize this — you'll use it every day.

GPU Memory Calculator

Set your model size and data type. The bars show whether the model fits in various GPU memory capacities.

Model Size 70B

Data Type FP32

The deployment constraints are real. Edge devices (phones, robots) have 4-16 GB of RAM. Consumer GPUs top out at 24 GB. Even cloud inference has cost pressure — fewer GPUs per request means lower cost per token. Compression isn't optional for production ML. It's required.

Over the next 9 chapters, we'll learn every major compression technique: quantization (representing weights with fewer bits), pruning (removing unnecessary weights), knowledge distillation (training a small model to mimic a large one), and how to combine them into a complete compression pipeline. By the end, you'll be able to take any model and systematically shrink it for deployment.

python
# The deployment arithmetic every ML engineer memorizes
def model_memory_gb(params_billions, bits_per_param):
    """Calculate model weight memory in GB."""
    bytes_per_param = bits_per_param / 8
    total_bytes = params_billions * 1e9 * bytes_per_param
    return total_bytes / (1024**3)

# 70B model at different precisions
for bits in [32, 16, 8, 4]:
    mem = model_memory_gb(70, bits)
    print(f"70B @ {bits}-bit: {mem:.1f} GB")

# Output:
# 70B @ 32-bit: 260.8 GB
# 70B @ 16-bit: 130.4 GB
# 70B @ 8-bit:  65.2 GB
# 70B @ 4-bit:  32.6 GB

Quick reality check: The numbers above use 1 GB = 2³⁰ bytes (GiB). Marketing materials often use 1 GB = 10⁹ bytes, which makes the numbers slightly different. We'll use the binary definition throughout this lesson because that's what nvidia-smi reports.

A 13B parameter model stored in FP16 (2 bytes per parameter) requires approximately how much memory?

13 GB 52 GB 26 GB 6.5 GB

Chapter 1: Quantization Fundamentals

Quantization is the single most important compression technique. The idea is deceptively simple: instead of storing each weight as a 32-bit floating point number (which can represent ~4 billion distinct values), store it as an 8-bit integer (256 distinct values) or even a 4-bit integer (16 distinct values). You lose some precision, but you save 4-8× memory.

Think of it like this. A high-resolution photograph uses millions of colors. A cartoon uses maybe 16 colors. The cartoon conveys the same scene — you recognize the same objects, faces, and emotions — despite using vastly fewer color levels. Quantization does the same thing to neural network weights: it reduces the "color depth" of the numbers while preserving the information that matters.

The formal operation has two parts: quantize (map a floating-point number to an integer) and dequantize (map back to approximate floating-point).

Quantize: q = round((x − zero_point) / scale)

Dequantize: x̂ = q × scale + zero_point

Where scale determines the step size between representable values, and zero_point shifts the range so that 0.0 in float maps to a specific integer. Let's work through a complete example with actual numbers.

Key insight: Quantization is a lossy compression. The dequantized value x̂ is NOT equal to the original x — there's always a small error. The art of quantization is choosing scale and zero_point to minimize this error across all the weights.

Worked Example: Quantize the float32 values [0.5, 1.2, -0.3, 0.8] to INT8 (range 0 to 255) using min/max calibration.

Step 1: Find the range. We need to map our float values into the integer range [0, 255].

x_min = -0.3, x_max = 1.2

Step 2: Compute the scale. Scale = (x_max - x_min) / (q_max - q_min)

scale = (1.2 − (−0.3)) / (255 − 0) = 1.5 / 255 = 0.005882

Step 3: Compute the zero_point. Zero_point is the integer that represents float 0.0.

zero_point = round(0 − x_min / scale) = round(0.3 / 0.005882) = round(51.0) = 51

Step 4: Quantize each value. q = round((x - zero_point × scale + x_min) / scale)... Actually, let's use the simpler formula: q = round((x - x_min) / scale)

q(0.5) = round((0.5 − (−0.3)) / 0.005882) = round(0.8 / 0.005882) = round(136.0) = 136

q(1.2) = round((1.2 − (−0.3)) / 0.005882) = round(1.5 / 0.005882) = round(255.0) = 255

q(−0.3) = round((−0.3 − (−0.3)) / 0.005882) = round(0 / 0.005882) = round(0) = 0

q(0.8) = round((0.8 − (−0.3)) / 0.005882) = round(1.1 / 0.005882) = round(187.0) = 187

Step 5: Dequantize to check the error. x̂ = q × scale + x_min

x̂(136) = 136 × 0.005882 + (−0.3) = 0.800 − 0.3 = 0.500 (error: 0.000)

x̂(255) = 255 × 0.005882 + (−0.3) = 1.500 − 0.3 = 1.200 (error: 0.000)

x̂(0) = 0 × 0.005882 + (−0.3) = 0 − 0.3 = −0.300 (error: 0.000)

x̂(187) = 187 × 0.005882 + (−0.3) = 1.100 − 0.3 = 0.800 (error: 0.000)

In this toy example, the errors are essentially zero because our values happened to map cleanly. In practice, with thousands of weight values spanning a wider range, the rounding errors accumulate. The question becomes: how much accumulated error can a neural network tolerate before its outputs degrade?

Common misconception: "Quantization just rounds numbers, so it must destroy the model." In reality, neural networks are remarkably robust to small perturbations in their weights. INT8 quantization typically degrades accuracy by less than 1% on most tasks. INT4 quantization loses 2-5%. The weights contain far more precision than the model actually needs.

python
import numpy as np

# Step-by-step quantization from scratch
def quantize_minmax(x, num_bits=8):
    """Quantize float array to integer using min/max calibration."""
    q_min, q_max = 0, 2**num_bits - 1  # 0 to 255 for 8-bit
    x_min, x_max = x.min(), x.max()

    # Compute scale and zero_point
    scale = (x_max - x_min) / (q_max - q_min)
    zero_point = q_min  # for asymmetric quantization

    # Quantize: float -> int
    q = np.round((x - x_min) / scale).astype(np.uint8)
    q = np.clip(q, q_min, q_max)

    return q, scale, x_min

def dequantize(q, scale, x_min):
    """Dequantize integer back to float."""
    return q.astype(np.float32) * scale + x_min

# Example
x = np.array([0.5, 1.2, -0.3, 0.8], dtype=np.float32)
q, scale, x_min = quantize_minmax(x, num_bits=8)
x_hat = dequantize(q, scale, x_min)

print(f"Original:    {x}")
print(f"Quantized:   {q}")
print(f"Dequantized: {x_hat}")
print(f"Max error:   {np.abs(x - x_hat).max():.6f}")

# Using PyTorch (one-liner)
import torch
x_t = torch.tensor([0.5, 1.2, -0.3, 0.8])
q_t = torch.quantize_per_tensor(x_t, scale=0.005882, zero_point=51, dtype=torch.quint8)

Quantization Staircase Effect

A smooth sine wave (teal) quantized to discrete levels (orange). Drag the bit-width slider to see how fewer bits create a coarser staircase.

Bit Width 8 bits

Frequency 2.0

Notice from the visualization: at 8 bits (256 levels), the staircase is barely visible — the quantized wave nearly perfectly overlaps the original. At 4 bits (16 levels), you can clearly see the steps. At 2 bits (4 levels), the signal is barely recognizable. This is why INT8 quantization works so well: 256 levels is enough to faithfully represent most weight distributions.

You have a weight tensor with values ranging from -2.0 to +2.0. You quantize to INT8 (0-255). What is the scale factor?

0.0078 (= 2.0 / 255) 0.0157 (= 4.0 / 255) 0.0314 (= 8.0 / 255) 0.5 (= 128 / 256)

Chapter 2: Advanced Quantization

Chapter 1 taught the basics: one scale and one zero_point for an entire tensor. But real neural networks have weight matrices with highly non-uniform value distributions. Some channels have weights in [-0.01, 0.01] while others span [-2.0, 2.0]. Using a single scale for both wastes precision on the small-range channels.

The solution: don't quantize the whole tensor with one scale. Quantize smaller groups of weights, each with their own scale and zero_point. This is the spectrum from coarse to fine-grained quantization.

Per-tensor quantization uses one scale for the entire weight matrix. It's the cheapest (just 1 extra scale parameter per tensor) but the least accurate. If your tensor has outlier values in one corner, the scale must accommodate them, wasting precision everywhere else.

Per-channel quantization (also called per-row or per-column) uses one scale per output channel. In a weight matrix of shape [out_features, in_features], you get one scale per row. This is the standard for most production systems because each output neuron learns a different magnitude of weights.

Per-group quantization splits each channel into groups of G weights (typically G=32, 64, or 128) and quantizes each group independently. This is what GPTQ and AWQ use. More scales = more metadata overhead, but dramatically better accuracy.

The tradeoff: Finer granularity = better accuracy but more metadata (scale/zero_point values). Per-tensor: 1 scale per matrix. Per-channel: N scales. Per-group (G=128): N×K/128 scales. The metadata is stored in FP16, so per-group with G=128 adds ~0.125 bits of overhead per weight.

Symmetric vs Asymmetric quantization. So far we've used asymmetric quantization: the float range [x_min, x_max] maps to the full int range [0, 255]. Symmetric quantization constrains the mapping so that float 0.0 maps to integer 0. This means the range is [-|max|, +|max|] mapped to [-128, +127] for signed int8.

Symmetric: q = round(x / scale), where scale = max(|x|) / 127

Asymmetric: q = round((x − zp) / scale), where scale = (x_max − x_min) / 255

Symmetric is faster at inference (no zero_point subtraction in the inner loop) but wastes range when the weight distribution is skewed. Most modern systems use symmetric for weights (which are roughly centered at 0) and asymmetric for activations (which are often positive-only after ReLU).

GPTQ (Accurate Post-Training Quantization for Generative Pre-trained Transformers, 2022) is a second-order method. Instead of simply rounding each weight independently, it quantizes one column at a time and uses the Hessian matrix (second derivative of the loss) to compensate the remaining columns for the error introduced. The key insight: when you round column j, you can slightly adjust columns j+1, j+2, ... to partially cancel the rounding error.

AWQ (Activation-Aware Weight Quantization, 2023) observes that not all weights are equally important. Some weights, when quantized, cause much larger output errors than others — specifically, weights that multiply large activations. AWQ identifies these "salient" weights by looking at activation magnitudes from calibration data, then scales them up before quantization (and scales the activations down to compensate). This gives important weights more of the quantization range.

GGUF format (by Georgi Gerganov, the llama.cpp author) is the standard file format for quantized models on CPU. It supports many quantization schemes: Q4_0 (4-bit, groups of 32, symmetric), Q4_K_M (4-bit with k-quant medium quality), Q8_0 (8-bit), etc. Each "Q" variant uses different group sizes and scale formats, trading metadata size for accuracy.

Common misconception: "INT4 quantization loses 4x the information of INT8." Wrong. The relationship is non-linear. Going from 32-bit to 8-bit usually loses less than 1% accuracy. Going from 8-bit to 4-bit might lose 3-5%. The first compression is nearly free; the last bits are expensive. This is because weight distributions are roughly Gaussian — most weights cluster near zero where 4-bit precision is adequate.

python
import numpy as np

# Per-channel vs per-tensor quantization comparison
def quantize_per_tensor(W, bits=8):
    """One scale for entire matrix."""
    qmax = 2**(bits-1) - 1  # 127 for int8
    scale = np.abs(W).max() / qmax
    q = np.round(W / scale).clip(-qmax, qmax).astype(np.int8)
    return q, scale

def quantize_per_channel(W, bits=8):
    """One scale per row (output channel)."""
    qmax = 2**(bits-1) - 1
    scales = np.abs(W).max(axis=1, keepdims=True) / qmax
    q = np.round(W / scales).clip(-qmax, qmax).astype(np.int8)
    return q, scales

# Simulate: channel 0 has tiny weights, channel 1 has large ones
W = np.array([
    [0.01, -0.02, 0.015, -0.005],   # small range
    [1.5,  -2.0,  0.8,   -1.2],    # large range
], dtype=np.float32)

# Per-tensor: scale = 2.0/127 = 0.01575
# Channel 0 values ≈ 0.01 → quantized to 0 or 1! Massive relative error.
q_tensor, s_t = quantize_per_tensor(W)
print(f"Per-tensor scale: {s_t:.5f}")
print(f"Channel 0 quantized: {q_tensor[0]}")  # [1, -1, 1, 0] - terrible!

# Per-channel: channel 0 gets scale = 0.02/127 = 0.000157
q_channel, s_c = quantize_per_channel(W)
print(f"Per-channel scales: {s_c.flatten()}")
print(f"Channel 0 quantized: {q_channel[0]}")  # [64, -127, 95, -32] - much better!

Per-Tensor vs Per-Channel Quantization

A 4×8 weight matrix where rows have different scales. Left: per-tensor (one color scale for all). Right: per-channel (each row gets its own scale). Notice how per-channel preserves detail in low-magnitude rows.

Bit Width 4 bits

Worked example: per-group quantization. Take a row of 8 weights: [0.1, 0.2, 0.15, 0.12, 2.0, -1.8, 1.5, -2.1]. With group size G=4:

Group 1: [0.1, 0.2, 0.15, 0.12]. Max = 0.2. Scale = 0.2/7 = 0.0286 (for 4-bit, range -8 to 7).

Group 2: [2.0, -1.8, 1.5, -2.1]. Max = 2.1. Scale = 2.1/7 = 0.300.

Without grouping, the single scale would be 2.1/7 = 0.300 for all 8 values. Group 1 values (all ~0.1-0.2) would quantize to 0 or 1 — destroying their differences. With grouping, they get their own fine-grained scale and quantize to 3, 7, 5, 4 — preserving their relative magnitudes.

Why does per-channel quantization outperform per-tensor for weight matrices?

Per-channel uses more bits per weight Different output channels learn different magnitude ranges, and per-channel gives each its own scale Per-channel is faster to compute Per-channel doesn't require calibration data

Chapter 3: Pruning & Sparsity

Quantization keeps all the weights but makes each one smaller. Pruning takes a different approach: remove weights entirely. Set them to exactly zero. If a weight is already near zero, it's barely contributing to the output — removing it should have minimal impact.

Think of it like editing a book. Quantization is like printing in a smaller font (same content, less space). Pruning is like cutting sentences (less content, but hopefully the unimportant ones). A good editor knows which sentences are load-bearing and which are filler. Pruning algorithms try to identify the "filler" weights.

The simplest approach is magnitude pruning: sort all weights by their absolute value |w|, then zero out the smallest k%. The intuition is that small weights contribute small amounts to the output, so removing them causes minimal damage.

Worked example: Prune 50% of a weight matrix.

W = [[0.5, −0.1, 0.8], [0.02, −0.9, 0.03], [−0.7, 0.04, 0.6]]

Step 1: Flatten and sort by magnitude: |0.02|, |0.03|, |0.04|, |0.1|, |0.5|, |0.6|, |0.7|, |0.8|, |0.9|

Step 2: 50% of 9 = 4.5, round to 4. Remove the 4 smallest: 0.02, 0.03, 0.04, -0.1

Step 3: Set those to zero in the original matrix:

W_pruned = [[0.5, 0, 0.8], [0, −0.9, 0], [−0.7, 0, 0.6]]

We now have a sparse matrix — 44% of entries are zero. But here's the catch: a sparse matrix stored naively takes the same memory as a dense one (you still store the zeros). To get actual memory savings, you need a sparse storage format like CSR (Compressed Sparse Row) or a bitmask.

Critical distinction: Unstructured pruning can zero ANY individual weight. This gives maximum flexibility but hardware can't exploit it efficiently — you still need to load the entire matrix and check a bitmask. Structured pruning removes entire rows, columns, attention heads, or layers. This changes the matrix dimensions, giving real memory savings and speedups without special sparse hardware.

Unstructured sparsity requires special hardware or software support to accelerate. NVIDIA's Ampere architecture supports 2:4 structured sparsity (in every group of 4 weights, exactly 2 must be zero), giving 2× speedup on Tensor Cores. But arbitrary unstructured sparsity gets no hardware acceleration on current GPUs.

Structured pruning is more practical. Remove entire neurons (a full row of the weight matrix) and the corresponding column of the next layer's weight matrix. This literally shrinks the tensor dimensions. A 4096×4096 matrix pruned to remove 25% of neurons becomes 3072×4096 — genuinely smaller, genuinely faster.

The lottery ticket hypothesis (Frankle & Carlin, 2019) made a stunning claim: within a large randomly-initialized network, there exists a small subnetwork (the "winning ticket") that, if trained in isolation from the same initialization, would match the full network's accuracy. This suggests pruning isn't just removing unimportant weights — it's finding the essential structure that was always there.

python
import numpy as np

def magnitude_prune(W, sparsity):
    """Zero out the smallest `sparsity` fraction of weights."""
    flat = np.abs(W).flatten()
    threshold = np.percentile(flat, sparsity * 100)
    mask = np.abs(W) >= threshold
    return W * mask, mask

def structured_prune_neurons(W, sparsity):
    """Remove entire rows (neurons) with smallest L2 norm."""
    norms = np.linalg.norm(W, axis=1)
    n_remove = int(W.shape[0] * sparsity)
    keep_idx = np.argsort(norms)[n_remove:]
    return W[keep_idx], keep_idx

# Example: 4x4 weight matrix
W = np.array([
    [0.5,  -0.1, 0.8,  -0.3],
    [0.02, -0.9, 0.03, 0.01],  # will survive (has -0.9)
    [-0.7, 0.04, 0.6,  -0.5],
    [0.01, 0.02, -0.01,0.03],  # entire row is tiny → prune
], dtype=np.float32)

# Unstructured: 50% sparsity
W_sparse, mask = magnitude_prune(W, 0.5)
print(f"Non-zeros: {mask.sum()} / {mask.size}")

# Structured: remove 25% of neurons
W_struct, kept = structured_prune_neurons(W, 0.25)
print(f"Shape: {W.shape} → {W_struct.shape}")  # (4,4) → (3,4)

# PyTorch structured pruning (one-liner)
import torch.nn.utils.prune as prune
prune.ln_structured(layer, 'weight', amount=0.25, n=2, dim=0)

Interactive Pruning Visualizer

A weight matrix shown as colored cells. Drag the sparsity slider to prune weights by magnitude. Watch cells go dark (zeroed). The histogram below shows the weight distribution with the pruning threshold.

Sparsity 0%

Mode Unstructured

There's a practical question: at what sparsity does accuracy collapse? Empirically, most networks tolerate 50-80% unstructured sparsity with minimal accuracy loss, especially if you fine-tune after pruning (give the remaining weights a chance to adapt). Beyond 90%, accuracy degrades rapidly. The exact threshold depends on the model, task, and how carefully you prune.

The iterative approach works best: prune a small amount (e.g., 20%), fine-tune for a few epochs, prune another 20% of the remaining weights, fine-tune again. This iterative magnitude pruning reaches higher sparsity with less accuracy loss than one-shot pruning because the fine-tuning steps allow the network to redistribute importance among surviving weights.

You prune 75% of a 4096×4096 weight matrix using UNSTRUCTURED magnitude pruning. How much ACTUAL memory do you save without special sparse format support?

None — you still store the zeros in a dense matrix 75% — zeros don't take space 50% — the mask takes some space

Chapter 4: Profiling & Analysis

Before you compress a model, you need to answer: where are the bottlenecks? Is the model slow because of too many computations (compute-bound) or because of too much data movement (memory-bound)? The answer determines which compression technique will help most.

The key tool for understanding this is the roofline model. It's a simple graph that plots achievable performance (FLOPS) against operational intensity (FLOPS per byte of data moved). Every hardware platform has a "roof" — a maximum throughput — and your operations either hit the compute roof or the memory-bandwidth roof.

Compute-bound operations are those where the GPU spends most of its time doing arithmetic. Large batched matrix multiplications (GEMM) with big matrices are typically compute-bound. For these, quantization doesn't help much with speed (the bottleneck is the ALU, not memory), but it can enable larger batch sizes by freeing VRAM.

Memory-bound operations are those where the GPU spends most of its time waiting for data to arrive from memory. This includes: attention with long sequences, small batch inference, layer normalization, activation functions. For these, quantization directly speeds things up by reducing the bytes that must move.

LLM inference is almost always memory-bandwidth-bound at batch size 1. Why? A single forward pass through a 7B model reads ~14 GB of weights (FP16) but performs relatively few operations with each weight (just one multiply-add per input token). The arithmetic intensity is very low. This is why weight quantization to INT4 gives nearly 4× speedup for single-request inference.

The key formula: Arithmetic Intensity = FLOPs / Bytes moved. For a matrix-vector multiply W×x where W is [M, K]: FLOPs = 2MK, Bytes = 2MK (reading W in FP16) + small input/output. So intensity ≈ 1 FLOP/byte. Modern GPUs can do ~300 FLOPS/byte. You're using 0.3% of peak compute! Memory is the bottleneck.

Profiling tools:

Tool	What it measures	When to use
torch.profiler	Per-operator time, memory, CUDA kernels	Finding slow operations
nvidia-smi	GPU utilization, memory usage, temperature	High-level monitoring
nsys (Nsight Systems)	Full timeline: CPU, GPU, memory transfers	Deep performance analysis
ncu (Nsight Compute)	Per-kernel metrics: occupancy, memory throughput	Kernel-level optimization

FLOP counting for transformers. A transformer layer with hidden dimension d and sequence length n has these main operations:

Self-attention QKV projection: 3 × 2nd² = 6nd² FLOPs

Attention scores: 2n²d FLOPs

Output projection: 2nd² FLOPs

MLP (typically 4d hidden): 2 × 2n(4d)d = 16nd² FLOPs

Total per layer: ≈ 24nd² + 2n²d FLOPs

For a 7B model (d=4096, 32 layers), processing one token (n=1): 24 × 1 × 4096² × 32 ≈ 12.9 TFLOP. But reading the weights: 14 GB at 2 TB/s memory bandwidth = 7ms. Performing 12.9 TFLOP at 312 TFLOP/s (A100) = 0.04ms. The memory read takes 175× longer than the compute!

Common misconception: "We need to reduce FLOPs to speed up inference." For LLM inference at batch size 1, FLOPs are nearly irrelevant. Memory bandwidth is the bottleneck. Reducing weight SIZE (quantization) helps directly. Reducing FLOPs (through architectural changes) only helps if you also reduce the data that must be read.

python
import torch
from torch.profiler import profile, ProfilerActivity

# Profile a model forward pass
model = MyModel().cuda()
x = torch.randn(1, 128, 4096).cuda()

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_flops=True
) as prof:
    output = model(x)

# Print top operators by GPU time
print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=10
))

# Manual roofline calculation
def roofline_analysis(flops, bytes_moved, peak_flops, peak_bandwidth):
    """Determine if an operation is memory-bound or compute-bound."""
    intensity = flops / bytes_moved  # FLOP/byte
    ridge_point = peak_flops / peak_bandwidth  # FLOP/byte

    if intensity < ridge_point:
        bound = "MEMORY-BOUND"
        achievable = intensity * peak_bandwidth
    else:
        bound = "COMPUTE-BOUND"
        achievable = peak_flops

    return bound, achievable, intensity

# A100 specs: 312 TFLOP/s FP16, 2 TB/s bandwidth
# 7B model, 1 token: 12.9 TFLOP, 14 GB read
bound, perf, ai = roofline_analysis(
    flops=12.9e12, bytes_moved=14e9,
    peak_flops=312e12, peak_bandwidth=2e12
)
print(f"Bound: {bound}, Intensity: {ai:.1f} FLOP/byte")
# Output: Bound: MEMORY-BOUND, Intensity: 0.9 FLOP/byte

Roofline Model

The roofline shows max achievable performance vs arithmetic intensity. Operations below the roof are underutilizing hardware. Points left of the ridge are memory-bound; right are compute-bound. Drag the model slider to plot different scenarios.

Batch Size 1

Precision FP16

The roofline reveals a crucial insight: at batch size 1, LLM inference uses less than 1% of the GPU's compute capability. The entire GPU sits idle waiting for weights to arrive from VRAM. This is why INT4 quantization gives nearly 4× speedup — you're moving 4× less data, which is the bottleneck. At batch size 32-64, you amortize the weight read across multiple inputs, pushing toward the compute roof where quantization helps less with latency but still helps with memory capacity.

An operation performs 10 TFLOP and reads 5 GB of data. Your GPU has 312 TFLOP/s peak compute and 2 TB/s bandwidth. Is this operation memory-bound or compute-bound?

Memory-bound (intensity = 2 FLOP/byte, ridge at 156 FLOP/byte) Compute-bound (there are lots of FLOPs) Neither — it's perfectly balanced

Chapter 5: Post-Training Compression

Post-training quantization (PTQ) is the most practical compression technique: take a pre-trained model, quantize it, and deploy — no retraining required. This is crucial because many models cost millions of dollars to train. You don't get to train them again. You need compression that works on the finished artifact.

The simplest PTQ approach is weight-only quantization: take the weight tensors, compute their min/max (or running statistics from calibration data), compute scale and zero_point, quantize, done. This works well for INT8 and is the default in most deployment frameworks.

But for INT4, naive min/max quantization often fails. The problem is outlier weights. In transformer models, some weights are 10-100× larger than the median. A single outlier stretches the quantization range, wasting most levels on the non-outlier weights where precision is needed most.

Calibration is the process of running representative data through the model to measure activation ranges. Instead of using the theoretical min/max of the weight tensor, you measure what ranges actually matter during inference. This is better because:

1. Some weight values are rarely activated — clipping them introduces small error

2. The interaction between weights and activations determines which weights need precision

3. Percentile-based ranges (e.g., clip at 99.99th percentile) are more robust than min/max

Key insight: PTQ treats quantization as a separate post-processing step. The model was trained assuming float32 weights. We're now modifying those weights. The key challenge is: how do we minimize the OUTPUT error of the quantized model, not just the weight error? A small weight error in a sensitive location can cause large output errors.

The GPTQ algorithm (Frantar et al., 2022) is the gold standard for PTQ at INT4. It's based on Optimal Brain Quantization (OBQ), which itself is based on Optimal Brain Surgeon. The key insight: when you quantize one weight, you can adjust the remaining weights to compensate for the error.

Worked example: GPTQ on a 3×3 weight matrix.

Given weight matrix W and Hessian H = X^TX (where X is calibration data):

W = [[0.7, −0.3, 0.5], [0.2, 0.8, −0.4], [−0.6, 0.1, 0.9]]

Suppose we quantize to 4-bit (levels: -8 to 7 with scale 0.1). Process column by column:

Column 0: Quantize W[:,0] = [0.7, 0.2, -0.6]

q = [round(0.7/0.1), round(0.2/0.1), round(-0.6/0.1)] = [7, 2, -6]

Dequantized: [0.7, 0.2, -0.6]. Error: [0, 0, 0]. Lucky — these values happened to quantize exactly.

Column 1: Quantize W[:,1] = [-0.3, 0.8, 0.1]

q = [-3, 8→clamped to 7, 1]. Dequantized: [-0.3, 0.7, 0.1]

Error for row 1: 0.8 - 0.7 = 0.1. This error propagates! GPTQ compensates by adjusting column 2:

W[1,2] += error × H_inv[1,2]/H_inv[1,1] (Hessian-based correction)

If H_inv[1,2]/H_inv[1,1] = 0.3, then W[1,2] = -0.4 + 0.1×0.3 = -0.37

Column 2: Now quantize the ADJUSTED W[:,2] = [0.5, -0.37, 0.9]

q = [5, -4, 7(clamped from 9)]. The column 1 error was partially absorbed.

This column-by-column approach with error compensation is why GPTQ achieves much better accuracy than naive rounding, especially at INT4 where every level counts.

Common misconception: "GPTQ is slow because it uses second-order information." Actually, GPTQ is remarkably fast. For a 175B model, it takes about 4 GPU-hours. The Hessian H = X^TX is computed once from calibration data (128 samples is typical), then reused for all columns. The column-by-column quantization is sequential but cheap.

python
import numpy as np

def gptq_quantize_column(W, col, H_inv, bits=4):
    """Quantize one column of W with Hessian-based error compensation."""
    qmax = 2**(bits-1) - 1  # 7 for 4-bit
    scale = np.abs(W[:, col]).max() / qmax

    # Quantize this column
    q = np.round(W[:, col] / scale).clip(-qmax, qmax)
    error = W[:, col] - q * scale  # quantization error

    # Compensate remaining columns
    for j in range(col + 1, W.shape[1]):
        # Hessian-based correction
        correction = error * H_inv[col, j] / H_inv[col, col]
        W[:, j] += correction

    W[:, col] = q * scale
    return W

# Full GPTQ on a layer
def gptq_layer(W, X_calib, bits=4):
    """Apply GPTQ to one weight matrix using calibration data X."""
    # Compute Hessian: H = X^T X / n_samples
    H = X_calib.T @ X_calib / X_calib.shape[0]
    H_inv = np.linalg.inv(H + 1e-6 * np.eye(H.shape[0]))

    W_q = W.copy()
    for col in range(W.shape[1]):
        W_q = gptq_quantize_column(W_q, col, H_inv, bits)
    return W_q

# Using the auto-gptq library (one-liner)
# from auto_gptq import AutoGPTQForCausalLM
# model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
# model.quantize(calibration_data, bits=4, group_size=128)

PTQ Error Propagation

Watch how quantization error in one column gets compensated in subsequent columns. Blue bars: original weights. Orange bars: quantized (naive). Green bars: quantized with GPTQ compensation. The output error (red line) is much smaller with compensation.

Practical PTQ pipeline:

1. Load Model

Load pre-trained FP16 weights

↓

2. Prepare Calibration

128-256 representative samples from training distribution

↓

3. Compute Statistics

Run calibration data, record activation ranges / Hessians

↓

4. Quantize Weights

Apply GPTQ/AWQ column-by-column with compensation

↓

5. Validate

Run eval benchmarks, check perplexity increase < 0.5

↓

6. Export

Save in GGUF/safetensors format for deployment

What makes GPTQ more accurate than naive rounding for INT4 quantization?

It uses more bits for important weights It compensates remaining columns for the error introduced when quantizing each column It retrains the model after quantization It uses floating point instead of integers

Chapter 6: Training-Time Compression

Post-training compression works on a frozen model. But what if you could make the model aware of quantization during training? If the model knows it will be quantized, it can learn weights that are robust to quantization noise — weights that sit near the center of quantization bins rather than on the boundaries.

This is Quantization-Aware Training (QAT). The idea: insert "fake quantize" operations into the forward pass during training. These operations simulate the effect of quantization (rounding, clipping) without actually storing integers. The model sees quantization noise during training and adapts to it.

The fake quantize operation in the forward pass:

x̂ = FakeQuantize(x) = Dequant(Quant(x)) = round(x/s) × s

This rounds x to the nearest quantization level, then immediately converts back to float. The output is still float32, but it can only take values that are representable at the target precision. The model trains with these "staircase" activations and learns to work within them.

The gradient problem: The round() function has zero gradient almost everywhere (it's flat between integers) and undefined gradient at integer boundaries. You can't backpropagate through it! The solution is the Straight-Through Estimator (STE): during backprop, pretend round() was the identity function. Pass the gradient through unchanged.

Forward: y = round(x)

Backward (STE): ∂L/∂x = ∂L/∂y × 1 = ∂L/∂y (just pass it through)

This is mathematically dubious (the true gradient is zero!) but works brilliantly in practice. The model receives gradient signal as if quantization weren't there, but the forward pass produces quantized values. Over many iterations, the optimizer finds weights that minimize loss under quantization.

Why QAT beats PTQ: PTQ quantizes weights that were optimized for float32 — they may sit on quantization bin boundaries where rounding goes either way. QAT optimizes weights knowing they'll be quantized — they converge to bin centers where rounding is deterministic and error is minimal. QAT typically gains 0.5-1% accuracy over PTQ at the same bit-width.

Worked example: STE in action.

Suppose we're training a weight w = 0.37 with scale s = 0.1 (quantization levels: ..., 0.3, 0.4, 0.5, ...).

Forward: FakeQuant(0.37) = round(0.37/0.1) × 0.1 = round(3.7) × 0.1 = 4 × 0.1 = 0.4

The loss is computed using 0.4, not 0.37.

Backward: Gradient from loss: ∂L/∂y = -0.05 (wants to decrease this weight)

STE: ∂L/∂w = -0.05 (passed straight through)

Update: w = 0.37 - lr × (-0.05) = 0.37 + 0.005 × (-0.05) = 0.37 - 0.00025 = 0.36975

The weight moves toward 0.3 (the next quantization level down). Eventually, after many updates, it settles at exactly 0.3 or 0.4 — a stable quantization point.

Common misconception: "QAT is always better than PTQ, so always use QAT." Wrong. QAT requires access to training data, training infrastructure, and significant compute budget (typically 10-20% of original training cost). For most practitioners using pre-trained models, PTQ (especially GPTQ/AWQ) is the practical choice. QAT is for model vendors who will ship many copies of the quantized model.

Knowledge Distillation during compression is another training-time technique. Instead of training the quantized model on the original data labels, train it to match the output distribution of the full-precision teacher model. The teacher's soft probabilities contain more information than hard labels (they reveal inter-class relationships), making the student's task easier.

L_distill = α × KL(p_teacher || p_student) + (1−α) × L_CE(y, p_student)

Where α balances between mimicking the teacher (KL divergence) and fitting the true labels (cross-entropy). Temperature scaling (T=2-5) is applied to both teacher and student logits to soften the distributions and reveal more signal.

python
import torch
import torch.nn as nn

class FakeQuantize(torch.autograd.Function):
    """Simulates quantization in forward, passes gradient in backward (STE)."""
    @staticmethod
    def forward(ctx, x, scale, bits):
        qmax = 2**(bits-1) - 1
        q = torch.clamp(torch.round(x / scale), -qmax, qmax)
        return q * scale  # dequantize back to float

    @staticmethod
    def backward(ctx, grad_output):
        return grad_output, None, None  # STE: pass gradient through

class QATLinear(nn.Module):
    """Linear layer with fake quantization for QAT."""
    def __init__(self, in_f, out_f, bits=8):
        super().__init__()
        self.linear = nn.Linear(in_f, out_f)
        self.bits = bits

    def forward(self, x):
        # Fake-quantize weights during forward
        scale = self.linear.weight.abs().max() / (2**(self.bits-1) - 1)
        w_q = FakeQuantize.apply(self.linear.weight, scale, self.bits)
        return nn.functional.linear(x, w_q, self.linear.bias)

# Knowledge distillation loss
def distillation_loss(student_logits, teacher_logits, labels, alpha=0.7, T=3.0):
    """Combine KL divergence from teacher with CE from labels."""
    soft_student = nn.functional.log_softmax(student_logits / T, dim=-1)
    soft_teacher = nn.functional.softmax(teacher_logits / T, dim=-1)
    kl = nn.functional.kl_div(soft_student, soft_teacher, reduction='batchmean') * T**2
    ce = nn.functional.cross_entropy(student_logits, labels)
    return alpha * kl + (1 - alpha) * ce

QAT vs PTQ: Post-Quantization Accuracy

Two training curves: blue is standard training (high train accuracy, but drops after quantization). Orange is QAT (slightly lower train accuracy, but maintains accuracy after quantization). Click "Quantize" to see the post-quantization accuracy gap.

The Straight-Through Estimator (STE) works by:

Removing quantization during backprop completely Computing the exact gradient of the round function Treating round() as the identity function during backprop (gradient passes through unchanged) Using a differentiable approximation of round()

Chapter 7: Compression Pipeline — The Showcase

In practice, you rarely use a single compression technique. You combine them: quantize weights to INT4, prune 30% of attention heads, distill into a smaller architecture. The order matters. The interactions matter. This chapter is a hands-on pipeline simulator where you build a complete compression strategy from scratch.

The key insight about compression pipelines is that order matters. Quantize-then-prune is different from prune-then-quantize:

Quantize first, prune second: You quantize all weights (including ones you'll later prune). Then when you prune, the accuracy drop is applied to an already-degraded model. Double penalty.

Prune first, quantize second: You remove the least important weights from the full-precision model (maximum information to decide what to prune). Then you quantize the remaining weights with GPTQ, which can more accurately compensate because there are fewer weights to worry about.

Similarly, knowledge distillation can be applied at different stages:

- Before compression: Train a smaller student from scratch using the teacher. Then quantize the student.

- After compression: Fine-tune the quantized model using distillation from the full-precision teacher.

- During compression: Quantize with distillation loss (QAT + distillation simultaneously).

The golden rule of compression pipelines: Apply techniques from coarsest (architecture changes) to finest (bit-width reduction). Structured pruning → Knowledge distillation → Weight quantization → Deployment optimization. Each step operates on the output of the previous step.

Let's trace through a real scenario. You have a LLaMA-70B model (280 GB FP32, 140 GB FP16) and need to deploy it on a single A100 (80 GB).

Step	Technique	Memory	Accuracy Impact	Cumulative
Baseline	FP16	140 GB	0%	100%
1	INT8 Quantization	70 GB	-0.3%	99.7%
2	INT4 GPTQ (group=128)	35 GB + 5 GB scales	-1.5%	98.5%
3	20% head pruning + finetune	32 GB	-0.8%	97.7%
4	KV cache INT8	Saves runtime VRAM	-0.1%	97.6%

Final result: 70B model running on a single 80GB GPU with 97.6% of original accuracy. The 40 GB of headroom is used for KV cache (supporting longer contexts) and batch processing.

Reality check: The accuracy numbers above are approximate. Real accuracy impact depends heavily on the specific model, task, calibration data quality, and implementation. Always validate on your own benchmark suite. A model that scores 97.6% on perplexity might score 92% on a specific downstream task that relies on the pruned heads.

Compression Pipeline Simulator

Start with a baseline model. Apply compression techniques in sequence. Watch memory, speed, and accuracy change. Try different orders!

Model 70B

Play with the simulator above. Try these experiments:

1. Apply INT4 directly to a 70B model. Note the final accuracy.

2. Apply Prune 30% first, then INT4. Is accuracy better or worse?

3. Apply Distill first (to get a smaller model), then INT8. Compare memory and accuracy to INT4 without distillation.

4. Stack everything: Prune 30% → INT4 → see how aggressive you can go.

Real-world pipelines from notable releases:

- Mistral 7B: Trained at FP16, distributed as GPTQ INT4 (group=128) via TheBloke's quants. Also available in GGUF Q4_K_M for llama.cpp.

- LLaMA-3 70B: Training in BF16, deployed with INT8 weight-only quantization on vLLM. AWQ INT4 for single-GPU inference.

- Phi-3 Mini (3.8B): Already small architecture (distillation from larger models during training), then quantized to INT4 for mobile deployment (1.9 GB).

You have a 13B model (26 GB FP16) that needs to fit on a 24 GB RTX 4090 with room for KV cache. Which pipeline gets you there while preserving the most accuracy?

INT4 quantization directly (6.5 GB + scales ≈ 8 GB) INT8 quantization (13 GB) + KV cache in INT8 ≈ 16 GB total Prune 50% then FP16 (13 GB) Distill to 3B model (6 GB FP16)

Chapter 8: Production Deployment

You've compressed your model. Now what? The compressed weights need to be loaded by an inference framework that knows how to execute quantized/sparse operations efficiently on hardware. The choice of framework determines your actual throughput, latency, and hardware utilization.

llama.cpp (Georgi Gerganov) is the pioneering framework for CPU/GPU inference of quantized LLMs. It uses the GGUF file format and supports dozens of quantization schemes (Q4_0, Q4_K_M, Q5_K_S, Q8_0, etc.). Key features:

- Runs on CPU with SIMD optimization (AVX2, ARM NEON)

- Optional GPU offloading (split layers between CPU and GPU)

- Metal backend for Apple Silicon (M1/M2/M3)

- Supports models up to 175B with enough RAM

- Community standard for local inference

vLLM (Berkeley) is the standard for high-throughput GPU serving. Key innovations:

- PagedAttention: manages KV cache like virtual memory pages, eliminating fragmentation and enabling 2-4× higher throughput

- Continuous batching: doesn't wait for all requests in a batch to finish; inserts new requests as slots free up

- Supports AWQ and GPTQ quantized models natively

- Optimized for multiple concurrent users (production serving)

TensorRT-LLM (NVIDIA) is the highest-performance option for NVIDIA hardware:

- Ahead-of-time compilation: converts model to an optimized engine binary

- FP8 support on Hopper (H100): 2× over INT8 with no accuracy loss

- Fused kernels: combines multiple operations into single CUDA kernel calls

- In-flight batching: continuous batching with NVIDIA's optimized scheduler

- Most complex setup but highest throughput on supported hardware

ONNX Runtime (Microsoft) is the cross-platform option:

- Runs on CPU, GPU, mobile, edge, web (WASM)

- Graph-level optimizations: operator fusion, constant folding

- INT8 quantization through its own quantization toolkit

- Best for: models that need to run on diverse hardware

Decision framework: Single user on Mac? → llama.cpp (GGUF, Q4_K_M). Multi-user cloud serving? → vLLM (AWQ INT4) or TensorRT-LLM (FP8 on H100). Edge/mobile? → ONNX Runtime or Core ML. Maximum absolute throughput? → TensorRT-LLM with FP8 on H100.

Hardware-specific optimizations:

Hardware	Best Precision	Key Feature	Framework
NVIDIA H100	FP8	Transformer Engine auto-casting	TensorRT-LLM
NVIDIA A100	INT8 / FP16	Tensor Cores with sparsity	vLLM, TensorRT-LLM
NVIDIA RTX 4090	INT4 (AWQ)	High memory bandwidth	vLLM, ExLlamaV2
Apple M-series	INT4 (GGUF)	Unified memory (CPU+GPU)	llama.cpp (Metal)
CPU (x86)	INT8 / Q4_K_M	AVX-512 / VNNI	llama.cpp, ONNX RT
Edge (ARM)	INT8	NEON SIMD	TFLite, ONNX RT

CUDA Cores vs Tensor Cores: Regular CUDA cores do scalar math (one multiply-add per clock per core). Tensor Cores do 4×4 matrix math in a single operation — they're 8-16× faster for matrix operations but only work at specific precisions (FP16, BF16, INT8, FP8, INT4). Quantized models that use Tensor-Core-friendly types (INT8, INT4) get hardware acceleration. Models in weird formats don't.

Compilation vs interpretation: llama.cpp and vLLM interpret the model graph at runtime (flexible, easy to update). TensorRT-LLM compiles the model into a fixed binary optimized for specific input shapes and batch sizes (inflexible, but fastest). ONNX Runtime offers both modes (graph-mode optimization + optional compilation).

Common misconception: "TensorRT is always the fastest." Not true. TensorRT's compilation assumes fixed input shapes. If your sequence lengths vary widely, you need multiple compiled engines or dynamic shape support (which is slower). For variable-length LLM inference with many concurrent users, vLLM's PagedAttention often wins on throughput/dollar despite lower per-query peak speed.

python
# llama.cpp: local inference with quantized model
# Install: pip install llama-cpp-python
from llama_cpp import Llama

model = Llama(
    model_path="./llama-7b-q4_k_m.gguf",
    n_gpu_layers=35,  # offload all layers to GPU
    n_ctx=4096,
)
output = model("The meaning of life is", max_tokens=128)

# vLLM: high-throughput serving
# Install: pip install vllm
from vllm import LLM, SamplingParams

llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="half",
    gpu_memory_utilization=0.9,
)
params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Hello!", "What is ML?"], params)

# TensorRT-LLM: maximum throughput (simplified)
# Build engine: trtllm-build --model_dir ./llama-7b-hf \
#   --dtype float16 --use_weight_only --weight_only_precision int4_awq
# Run: mpirun -n 1 python run.py --engine_dir ./engine

Framework Throughput Comparison

Estimated tokens/second for a 7B model across different frameworks and precisions. Adjust batch size to see how throughput scales.

Batch Size 1

You need to serve a 70B model to 100 concurrent users with minimum latency. Your hardware is 4x A100 80GB. Which setup is most appropriate?

vLLM with tensor parallelism across 4 GPUs, AWQ INT4 quantization llama.cpp on one GPU with Q4_K_M GGUF ONNX Runtime on CPU Run 4 separate copies of the model

Chapter 9: Mastery & Connections

You now have the complete toolkit for model compression. Let's consolidate everything into a decision framework, then push further with derivations and challenges.

Compression Cheat Sheet:

Technique	Memory Savings	Speed Gain	Accuracy Cost	When to Use
FP16/BF16	2×	1-2×	~0%	Always (free lunch)
INT8 PTQ	4×	2-4×	0.1-0.5%	Default for deployment
INT4 GPTQ	8×	3-4×	1-3%	Single-GPU constraint
INT4 AWQ	8×	3-4×	0.5-2%	When accuracy matters more
Structured Pruning	variable	linear with removed params	1-5%	Need architecture shrinkage
Unstructured Pruning	~0 (without sparse HW)	~0 (without sparse HW)	small	Only with 2:4 sparsity HW
Knowledge Distillation	scales with student size	scales with student size	2-10%	Need much smaller model
QAT	same as PTQ	same as PTQ	0.5-1% better than PTQ	Shipping millions of copies

The Compression Decision Tree:

Start

Do you have training compute budget?

↓ No

PTQ Path

Does INT8 fit your memory? → Use INT8. No? → Use GPTQ/AWQ INT4.

↓ Yes (have compute)

QAT Path

Is accuracy critical? → QAT + distillation. Memory-only concern? → QAT alone.

↓

Architecture Path

Need >10× compression? → Distill to smaller architecture first, THEN quantize.

Derivation Challenge: Derive the optimal per-channel scale factor that minimizes Mean Squared Error (MSE) between original and dequantized weights.

Given: Weight vector w ∈ Rⁿ, quantization levels q ∈ {-Q, ..., Q} where Q = 2^b-1-1.

Goal: Find scale s* that minimizes ∑(w_i - round(w_i/s)×s)²

The naive solution s = max(|w|)/Q works but isn't optimal when the weight distribution is non-uniform. The MSE-optimal scale requires solving:

∂/∂s ∑_i (w_i − round(w_i/s) × s)² = 0

This is non-differentiable (round is a step function), but the grid search approach works: try s values in [max(|w|)/(Q×1.2), max(|w|)/Q] with 100 steps, pick the one with lowest MSE. In practice, the MSE-optimal scale is typically 0.7-0.9× the max/Q scale — it clips outliers to reduce average error.

Design Challenge: Compress LLaMA-2-13B (26 GB FP16) to fit on an RTX 4090 (24 GB) with at least 8 GB free for KV cache. Target: ≤16 GB model weights. Solution: AWQ INT4 with group_size=128 → ~7.5 GB weights + ~1 GB scales = 8.5 GB. Leaves 15.5 GB for KV cache, supporting ~8K context at FP16 or ~16K at INT8 KV cache.

Connections to other lessons:

- Transformer — The architecture we're compressing. Understanding attention and MLP structure helps you know which components to prune.

- GPT — Autoregressive inference is memory-bound, making quantization especially effective.

- SSM/Mamba — Alternative architectures with better memory efficiency by design (no KV cache). Sometimes the best "compression" is a better architecture.

What we didn't cover:

- Mixed-precision quantization: Different layers get different bit-widths (sensitive layers stay at INT8, others go to INT4).

- Neural Architecture Search for efficient models: Designing architectures that are inherently small (MobileNet, EfficientNet).

- Speculative decoding: Use a tiny draft model to predict tokens, verify with the large model in parallel. Not compression per se, but achieves 2-3× speedup.

- LoRA/QLoRA: Low-rank adaptation + quantization for fine-tuning. Covered in its own lesson.

"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman. In compression: always measure accuracy on YOUR task with YOUR data. Published benchmark numbers are someone else's guarantee, not yours.

You have a 70B model that must run on a single 80GB A100. You need maximum accuracy with the constraint that model weights + KV cache must fit in 80 GB. KV cache for your use case needs ~25 GB. What's your best strategy?

INT4 GPTQ with group_size=128 (~40 GB weights) — fits with room to spare INT8 quantization (~70 GB) + INT8 KV cache — too tight FP16 with 50% pruning (~70 GB) — still doesn't fit with KV