Models are too big for your GPU. Pruning deletes weights, quantization shrinks them, distillation replaces the model entirely. Three paths to fitting 175B parameters in 80GB of memory.
You have a shiny new GPU. An A100, 80 GB of memory — the most powerful accelerator money can buy. You want to run GPT-3. How much memory does it need?
GPT-3 has 175 billion parameters. Each parameter is stored as a 16-bit float (2 bytes). That's 175,000,000,000 × 2 = 350 GB. Your 80 GB GPU can hold less than a quarter of the model. And that's just the weights — you also need memory for activations, optimizer states, and gradients during training.
There are three families of tricks, and this lesson covers all of them:
Each bar shows a model's FP16 memory footprint. The red line is your GPU's limit. Notice how the gap widens with each generation.
The situation is even worse for training. During training, you need to store not just the model weights but also optimizer states (Adam stores two extra copies per parameter: mean and variance of gradients) and gradients themselves. For a 7B model in FP32 with Adam, training requires roughly 7B × (4 + 4 + 4 + 4) = 112 GB — the weights, gradients, and two momentum buffers.
The human brain prunes synapses during development. A newborn has roughly twice as many synapses as an adult — the brain figures out which connections matter and eliminates the rest. Neural network pruning is the same idea: after training, remove the weights that contribute least to the output.
Let's be precise. We have a trained network with weights W. We want to find pruned weights WP that minimize the loss while having fewer than T nonzero parameters:
Here ||WP||0 is the L0 norm — just a count of nonzero entries. T is our target parameter budget. The optimization says: find the smallest set of weights that keeps the loss low.
This is an NP-hard combinatorial problem. With a million weights and a target of keeping 100K, the number of possible subsets is astronomical. So we use heuristics.
The most widely used heuristic: weights with small absolute values matter less than weights with large absolute values. A weight of 0.001 barely changes the output. A weight of 5.0 dominates. So we rank all weights by |w|, and zero out the smallest ones.
Pruning 90% of weights in one shot destroys accuracy. The trick is to iterate: prune a little, fine-tune the remaining weights, prune a little more, fine-tune again. Each round, the surviving weights adapt to compensate for the lost ones.
A weight vector with 20 values. Drag the slider to set the pruning ratio. Red weights are pruned (zeroed). Teal weights survive.
Different layers have different sensitivities to pruning. The first layer (which extracts low-level features) is often very sensitive — prune it aggressively and accuracy collapses. Fully connected layers at the end tend to be less sensitive. In practice, you set per-layer pruning ratios based on a sensitivity sweep: for each layer, try pruning at 10%, 20%, ... 90% and measure the accuracy drop.
Not all zeros are created equal. Imagine a weight matrix where you zero out random individual entries (scattered throughout). That's unstructured pruning. Now imagine zeroing out entire rows or columns. That's structured pruning. The distinction matters enormously for hardware.
GPUs love predictable, sequential memory access. Reading 128 consecutive floats from memory is fast — one burst read. Reading 128 floats scattered randomly across memory is slow — 128 separate reads with cache misses. This is the sequential vs random access gap, and it can be 10× or more.
Unstructured sparsity creates random access patterns. Even if 90% of your weights are zero, the GPU still needs to check each position to find the nonzeros. The memory layout is irregular. You need sparse matrix formats (CSR, COO) that carry metadata overhead.
Structured sparsity creates predictable access patterns. If you prune entire rows of a weight matrix, the remaining rows form a smaller dense matrix. No special formats needed — it's just a regular matrix multiply on a smaller matrix. This runs at full GPU utilization.
NVIDIA's Ampere GPUs (A100, 3090, etc.) introduced Sparse Tensor Cores that natively support 2:4 sparsity: out of every 4 consecutive weights, exactly 2 must be zero. This is a structured pattern with fine granularity.
The hardware stores only the 2 nonzero values plus a tiny 2-bit index to indicate their positions within the group of 4. The result: 50% compression with nearly 2× throughput, and accuracy barely drops because the pruning is fine-grained enough to keep important weights.
Left: unstructured (random zeros). Right: 2:4 structured (exactly 2 zeros per group of 4). Toggle to see the memory layout difference.
Beyond individual weights, we can prune at coarser granularities:
| Granularity | What's Pruned | Hardware Benefit |
|---|---|---|
| Weight | Individual parameters | Low (unstructured access) |
| N:M | N of every M consecutive weights | High (Sparse Tensor Cores) |
| Row/Column | Entire rows of weight matrices | High (smaller dense matmul) |
| Channel | Entire filters in conv layers | Very high (fewer output channels) |
| Head | Entire attention heads in Transformers | Very high (fewer heads to compute) |
Pruning removes weights entirely. Quantization takes a different approach: keep all the weights, but represent each one with fewer bits. Instead of storing each weight as a 32-bit float (4 bytes), store it as an 8-bit integer (1 byte) — a 4× compression. Or go further: 4-bit integers give 8× compression.
Let's build up from basics. A computer stores numbers as sequences of bits. Different formats trade off between range (how large/small values can be) and precision (how finely you can distinguish nearby values).
| Format | Bits | Bytes | Structure | Approximate Range |
|---|---|---|---|---|
| FP32 | 32 | 4 | 1 sign + 8 exp + 23 mantissa | ±3.4 × 1038 |
| FP16 | 16 | 2 | 1 sign + 5 exp + 10 mantissa | ±65504 |
| BF16 | 16 | 2 | 1 sign + 8 exp + 7 mantissa | ±3.4 × 1038 |
| INT8 | 8 | 1 | 1 sign + 7 value | -128 to 127 |
| INT4 | 4 | 0.5 | 1 sign + 3 value | -8 to 7 |
We map a floating-point value r to an integer value q using a scale factor S and a zero-point Z:
Rearranging to quantize (float → integer):
The scale S maps the float range to the integer range. The zero-point Z ensures that float 0.0 maps to an exact integer (important because zero-padding and ReLU outputs must be exact). Let's derive S and Z.
We know the min and max of our float values (rmin, rmax) and the min and max of our integer range (qmin, qmax). For N-bit signed integers: qmin = -2N-1 and qmax = 2N-1 - 1. For 8-bit: [-128, 127].
We need rmax = S(qmax - Z) and rmin = S(qmin - Z). Two equations, two unknowns:
Asymmetric quantization uses both S and Z (as derived above). The zero-point Z can be any integer. This is the general case.
Symmetric quantization forces Z = 0 (or the midpoint of the integer range). This simplifies the math: r = S · q. The scale is computed as S = max(|rmin|, |rmax|) / qmax. Simpler, but wastes range if the float distribution is skewed (e.g., all positive values after ReLU).
Theory is nice. Let's see quantization in action. Below is a weight distribution sampled from a real-looking Gaussian. Drag the bit-width slider to see how fewer bits approximate the original distribution. Watch the error histogram grow as precision drops.
Top: original weight distribution vs quantized reconstruction. Bottom: per-weight error. Right: model size.
Linear quantization spaces the quantization levels uniformly across the range. But what if your weights aren't uniformly distributed? K-means quantization (from Han et al.'s Deep Compression, 2016) clusters weights into K groups and stores each weight as a cluster index.
The idea: run K-means clustering on the weight values. Each cluster centroid becomes a shared weight. Each original weight is replaced by a small integer index pointing to its nearest centroid. During inference, we look up the centroid to reconstruct the weight.
Let's compute real numbers. A 7B parameter model:
| Format | Bits/Param | Model Size | Fits on GPU? |
|---|---|---|---|
| FP32 | 32 | 28.0 GB | A100 80GB: yes, but tight |
| FP16/BF16 | 16 | 14.0 GB | Yes |
| INT8 | 8 | 7.0 GB | Easily |
| INT4 | 4 | 3.5 GB | Fits on consumer GPUs |
| INT2 | 2 | 1.75 GB | Fits, but accuracy suffers |
You have a pre-trained model in FP16. You want to quantize it to INT8 without retraining. This is Post-Training Quantization (PTQ) — the cheapest approach, requiring only a small calibration dataset (a few hundred examples) to determine the scale and zero-point for each layer.
To compute S and Z for each layer, we need rmin and rmax — the range of values that actually appear. We run a few hundred calibration examples through the network and record the min/max of the weights and activations at each layer. This is calibration.
Here's where things get tricky. For weights, the distribution is typically well-behaved (roughly Gaussian, symmetric). But activations often have outliers — a few values that are 100× larger than the rest. In large language models, these outliers appear consistently in the same channels.
If you set the quantization range to include these outliers, the scale S becomes very large, and all the "normal" values get squashed into a tiny range of integers, destroying precision. If you clip the outliers, the large values are wrong. Either way, accuracy drops.
A typical activation distribution with outliers. Orange bars show the quantized levels. Notice how the outlier stretches the range, wasting most levels on empty space.
Xiao et al. (2023) observed that activation outliers are hard to quantize but weight channels are easy. SmoothQuant migrates the quantization difficulty from activations to weights by dividing activations by a per-channel smoothing factor s and multiplying weights by the same factor:
The output Y is mathematically identical (the s's cancel). But now the activations X̂ = X / s are smoother (outliers are divided down), and the weights Ŵ = s · W absorb the extra magnitude. Both are easier to quantize.
GPTQ (Frantar et al., 2023) takes a more sophisticated approach. Instead of simply rounding each weight to its nearest quantized value, GPTQ accounts for the correlations between weights. After quantizing one weight, it adjusts the remaining unquantized weights to compensate for the rounding error, using the Hessian (second-order information) of the loss.
This is a layer-wise method based on Optimal Brain Quantization: for each column of the weight matrix, quantize it, compute the error, and distribute that error across the remaining columns using the inverse Hessian. The result: INT4 quantization with negligible accuracy loss on models up to 175B parameters.
PTQ works well at 8 bits but struggles at 4 bits for some models. The problem: the model was trained in FP16 and never "saw" quantized weights during training. The solution: Quantization-Aware Training (QAT) — insert fake quantization operations during training so the model learns to be robust to low-precision representation.
During training, we insert fake quantization nodes after each weight tensor and each activation. These nodes quantize the value (float → int → float) so the model experiences the rounding error during forward pass, but internally everything stays in floating point so gradients can flow.
There's a fundamental problem: the round() function has zero gradient almost everywhere (its derivative is 0 for non-integers and undefined at integers). Without gradients, we can't do backpropagation through the quantization step.
The Straight-Through Estimator (STE) is a practical hack: during the backward pass, pretend the round function is the identity function. In other words, pass the gradient through unchanged:
This is mathematically unjustified — the gradient of round() is not 1. But it works remarkably well in practice. The model learns to place weights at values that are close to quantization grid points, reducing the rounding error.
| Property | PTQ | QAT |
|---|---|---|
| Compute cost | Low (calibration only) | High (full training loop) |
| Data needed | ~100-1000 calibration samples | Full training dataset |
| Accuracy at INT8 | Usually fine | Excellent |
| Accuracy at INT4 | Can degrade significantly | Much better |
| When to use | Quick deployment, 8-bit, large models | Aggressive quantization (4-bit), accuracy-critical |
The teal staircase is the round() function. The orange line is the STE approximation (identity). The gradient of the staircase is 0 everywhere, but the STE pretends it's 1.
Pruning and quantization compress an existing model. Knowledge distillation takes a completely different approach: train a new, smaller model (the student) to mimic a large model (the teacher). The student never sees the original training labels — it learns from the teacher's outputs.
You might ask: why not just train a small model from scratch? Because small models are hard to train well. They have limited capacity, underfit easily, and get stuck in bad local minima. But a teacher model has already learned a rich representation of the data. Its output probabilities contain dark knowledge — information about the relationships between classes that hard labels don't capture.
The teacher's softmax outputs are often very peaked (90%+ on the correct class). The "dark knowledge" in the small probabilities is hard to learn from. Temperature scaling softens the distribution by dividing the logits by a temperature T before applying softmax:
At T = 1, this is the normal softmax. As T increases, the distribution becomes softer — the probability mass spreads more evenly across classes, making the dark knowledge more visible. Typical values: T = 2 to 20.
The student is trained on a weighted combination of two losses:
The first term is the distillation loss — cross-entropy between the teacher's soft probabilities (at temperature T) and the student's soft probabilities (at the same T). The second term is the standard classification loss against the hard ground-truth labels. The hyperparameter α balances the two.
Drag the temperature slider to see how softmax probabilities spread out. At T=1, the teacher is very confident. At higher T, dark knowledge emerges.
DistilBERT (Sanh et al., 2019) distilled BERT-base (110M params, 12 layers) into a student with 66M params (6 layers). The result: 97% of BERT's language understanding performance at 60% the size and 60% faster inference. They used T = 8 and added a cosine similarity loss between teacher and student hidden states.
| Model | Parameters | GLUE Score | Inference Speed |
|---|---|---|---|
| BERT-base | 110M | 79.5 | 1× |
| DistilBERT | 66M | 77.0 (97%) | 1.6× |
You've now seen three fundamentally different approaches to making neural networks fit in memory. Let's put them together.
| Method | What It Does | Compression | Accuracy Impact | Compute Cost |
|---|---|---|---|---|
| Magnitude Pruning | Zeros out small weights | 2-10× | Low (with fine-tuning) | Medium (iterative) |
| Structured Pruning | Removes rows/channels/heads | 2-4× | Low-Medium | Medium |
| INT8 Quantization | Reduces precision to 8-bit | 2-4× | Negligible | Low (PTQ) |
| INT4 Quantization | Reduces precision to 4-bit | 4-8× | Low (with GPTQ/QAT) | Low-High |
| Knowledge Distillation | Trains a smaller model | 2-10× | Varies | High (full training) |
In 2024-2025, the most common deployment pattern for LLMs is GPTQ or AWQ at 4-bit — Post-Training Quantization with second-order error correction. This lets you run a 70B model in ~35 GB, fitting on a single A100 80GB with room for KV cache. For mobile/edge, a combination of distillation (train a 1-3B model) + INT4 quantization is standard.
The next CS 229s lecture covers parameter-efficient fine-tuning (LoRA, adapters, prompt tuning) — how to adapt a pre-trained model to new tasks without updating all the weights. This connects directly to quantization: QLoRA fine-tunes a 4-bit quantized model using LoRA adapters in 16-bit, combining the best of both worlds.