Sparsity & Quantization — Memory-Efficient Neural Networks

Chapter 0: The Memory Problem

You have a shiny new GPU. An A100, 80 GB of memory — the most powerful accelerator money can buy. You want to run GPT-3. How much memory does it need?

GPT-3 has 175 billion parameters. Each parameter is stored as a 16-bit float (2 bytes). That's 175,000,000,000 × 2 = 350 GB. Your 80 GB GPU can hold less than a quarter of the model. And that's just the weights — you also need memory for activations, optimizer states, and gradients during training.

The brutal math: Model size in bytes = number of parameters × bytes per parameter. A 7B model in FP16 = 14 GB. A 70B model = 140 GB. A 175B model = 350 GB. GPU memory has not kept pace with model growth. We need tricks.

There are three families of tricks, and this lesson covers all of them:

Pruning (Sparsity) — delete weights that don't matter. A 90% sparse network stores only 10% of the original parameters.
Quantization — use fewer bits per weight. Going from FP16 (16 bits) to INT4 (4 bits) gives a 4× compression.
Knowledge Distillation — train a smaller model that mimics the big one. Replace a 175B model with a 6B model entirely.

Model Size vs GPU Memory

Each bar shows a model's FP16 memory footprint. The red line is your GPU's limit. Notice how the gap widens with each generation.

The situation is even worse for training. During training, you need to store not just the model weights but also optimizer states (Adam stores two extra copies per parameter: mean and variance of gradients) and gradients themselves. For a 7B model in FP32 with Adam, training requires roughly 7B × (4 + 4 + 4 + 4) = 112 GB — the weights, gradients, and two momentum buffers.

The three pillars: Pruning asks "which weights can I remove?" Quantization asks "can I represent each weight with fewer bits?" Distillation asks "do I even need this many parameters?" Each trades a different resource for model quality.

Check: A 13B parameter model stored in FP16 (2 bytes per param) requires how much memory just for weights?

13 GB 26 GB 52 GB

Chapter 1: Pruning Fundamentals

The human brain prunes synapses during development. A newborn has roughly twice as many synapses as an adult — the brain figures out which connections matter and eliminates the rest. Neural network pruning is the same idea: after training, remove the weights that contribute least to the output.

The Formal Definition

Let's be precise. We have a trained network with weights W. We want to find pruned weights W_P that minimize the loss while having fewer than T nonzero parameters:

min_{W_P} L(x; W_P) s.t. ||W_P||₀ < T

Here ||W_P||₀ is the L0 norm — just a count of nonzero entries. T is our target parameter budget. The optimization says: find the smallest set of weights that keeps the loss low.

This is an NP-hard combinatorial problem. With a million weights and a target of keeping 100K, the number of possible subsets is astronomical. So we use heuristics.

Magnitude Pruning: The Simplest Heuristic

The most widely used heuristic: weights with small absolute values matter less than weights with large absolute values. A weight of 0.001 barely changes the output. A weight of 5.0 dominates. So we rank all weights by |w|, and zero out the smallest ones.

Worked example: Given weights W = [5.0, 0.3, -0.4, -2.1, 0.01, 1.7] and target T = 3 (keep 3 weights), we sort by magnitude: |5.0| > |-2.1| > |1.7| > |-0.4| > |0.3| > |0.01|. We keep the top 3: W_P = [5.0, 0, 0, -2.1, 0, 1.7]. The three smallest weights are zeroed out.

Iterative Pruning: Don't Prune All at Once

Pruning 90% of weights in one shot destroys accuracy. The trick is to iterate: prune a little, fine-tune the remaining weights, prune a little more, fine-tune again. Each round, the surviving weights adapt to compensate for the lost ones.

Train

Train network to convergence

↓

Prune

Remove smallest 20% of weights by magnitude

↓

Fine-tune

Retrain surviving weights for a few epochs

↻ repeat until target sparsity reached

Magnitude Pruning Explorer

A weight vector with 20 values. Drag the slider to set the pruning ratio. Red weights are pruned (zeroed). Teal weights survive.

Prune %50%

Sensitivity Analysis: Not All Layers Are Equal

Different layers have different sensitivities to pruning. The first layer (which extracts low-level features) is often very sensitive — prune it aggressively and accuracy collapses. Fully connected layers at the end tend to be less sensitive. In practice, you set per-layer pruning ratios based on a sensitivity sweep: for each layer, try pruning at 10%, 20%, ... 90% and measure the accuracy drop.

Check: Why is iterative pruning better than one-shot pruning?

It uses less compute Surviving weights can adapt to compensate for pruned ones during fine-tuning It produces denser networks

Chapter 2: Structured vs Unstructured Pruning

Not all zeros are created equal. Imagine a weight matrix where you zero out random individual entries (scattered throughout). That's unstructured pruning. Now imagine zeroing out entire rows or columns. That's structured pruning. The distinction matters enormously for hardware.

Why Structure Matters for GPUs

GPUs love predictable, sequential memory access. Reading 128 consecutive floats from memory is fast — one burst read. Reading 128 floats scattered randomly across memory is slow — 128 separate reads with cache misses. This is the sequential vs random access gap, and it can be 10× or more.

Unstructured sparsity creates random access patterns. Even if 90% of your weights are zero, the GPU still needs to check each position to find the nonzeros. The memory layout is irregular. You need sparse matrix formats (CSR, COO) that carry metadata overhead.

Structured sparsity creates predictable access patterns. If you prune entire rows of a weight matrix, the remaining rows form a smaller dense matrix. No special formats needed — it's just a regular matrix multiply on a smaller matrix. This runs at full GPU utilization.

The tradeoff: Unstructured pruning gives higher accuracy (more freedom in choosing which weights to remove). Structured pruning gives actual speedup on hardware (the GPU can exploit the pattern). Real deployment almost always requires some degree of structure.

N:M Sparsity: The Best of Both Worlds

NVIDIA's Ampere GPUs (A100, 3090, etc.) introduced Sparse Tensor Cores that natively support 2:4 sparsity: out of every 4 consecutive weights, exactly 2 must be zero. This is a structured pattern with fine granularity.

The hardware stores only the 2 nonzero values plus a tiny 2-bit index to indicate their positions within the group of 4. The result: 50% compression with nearly 2× throughput, and accuracy barely drops because the pruning is fine-grained enough to keep important weights.

Structured vs Unstructured Sparsity

Left: unstructured (random zeros). Right: 2:4 structured (exactly 2 zeros per group of 4). Toggle to see the memory layout difference.

Coarse-Grained Pruning

Beyond individual weights, we can prune at coarser granularities:

Granularity	What's Pruned	Hardware Benefit
Weight	Individual parameters	Low (unstructured access)
N:M	N of every M consecutive weights	High (Sparse Tensor Cores)
Row/Column	Entire rows of weight matrices	High (smaller dense matmul)
Channel	Entire filters in conv layers	Very high (fewer output channels)
Head	Entire attention heads in Transformers	Very high (fewer heads to compute)

For coarse-grained magnitude pruning: instead of ranking individual weights by |w_i|, you rank entire rows by the sum of absolute values in the row: Σ_j |w_i,j|. Rows with small total magnitude contribute little to the output and can be removed.

Check: Why does unstructured pruning fail to give speedup on GPUs, even at 90% sparsity?

The remaining nonzeros are scattered randomly, causing slow random memory access GPUs can't handle zeros The weights don't fit in cache regardless

Chapter 3: Quantization Basics

Pruning removes weights entirely. Quantization takes a different approach: keep all the weights, but represent each one with fewer bits. Instead of storing each weight as a 32-bit float (4 bytes), store it as an 8-bit integer (1 byte) — a 4× compression. Or go further: 4-bit integers give 8× compression.

How Numbers Are Represented

Let's build up from basics. A computer stores numbers as sequences of bits. Different formats trade off between range (how large/small values can be) and precision (how finely you can distinguish nearby values).

Format	Bits	Bytes	Structure	Approximate Range
FP32	32	4	1 sign + 8 exp + 23 mantissa	±3.4 × 10³⁸
FP16	16	2	1 sign + 5 exp + 10 mantissa	±65504
BF16	16	2	1 sign + 8 exp + 7 mantissa	±3.4 × 10³⁸
INT8	8	1	1 sign + 7 value	-128 to 127
INT4	4	0.5	1 sign + 3 value	-8 to 7

BF16 vs FP16: FP16 has 5 exponent bits (small range, more precision). BF16 has 8 exponent bits (same range as FP32, less precision). BF16 was created by Google Brain specifically for neural network training — networks care more about range than precision. Overflow during training caused FP16 to produce NaN; BF16 fixed this by keeping FP32's exponent range.

Linear Quantization: The Core Idea

We map a floating-point value r to an integer value q using a scale factor S and a zero-point Z:

r = S · (q − Z)

Rearranging to quantize (float → integer):

q = round(r / S + Z)

The scale S maps the float range to the integer range. The zero-point Z ensures that float 0.0 maps to an exact integer (important because zero-padding and ReLU outputs must be exact). Let's derive S and Z.

Deriving S and Z

We know the min and max of our float values (r_min, r_max) and the min and max of our integer range (q_min, q_max). For N-bit signed integers: q_min = -2^N-1 and q_max = 2^N-1 - 1. For 8-bit: [-128, 127].

We need r_max = S(q_max - Z) and r_min = S(q_min - Z). Two equations, two unknowns:

S = (r_max − r_min) / (q_max − q_min)

Z = round(q_min − r_min / S)

Worked example (INT8): Suppose our weights range from r_min = -1.0 to r_max = 2.0. With INT8, q_min = -128, q_max = 127.

S = (2.0 − (-1.0)) / (127 − (-128)) = 3.0 / 255 = 0.01176
Z = round(-128 − (-1.0) / 0.01176) = round(-128 + 85.03) = round(-42.97) = -43

To quantize r = 0.5: q = round(0.5 / 0.01176 + (-43)) = round(42.52 - 43) = round(-0.48) = 0
To dequantize q = 0: r̂ = 0.01176 · (0 − (-43)) = 0.01176 · 43 = 0.506
Quantization error: |0.5 − 0.506| = 0.006

Symmetric vs Asymmetric Quantization

Asymmetric quantization uses both S and Z (as derived above). The zero-point Z can be any integer. This is the general case.

Symmetric quantization forces Z = 0 (or the midpoint of the integer range). This simplifies the math: r = S · q. The scale is computed as S = max(|r_min|, |r_max|) / q_max. Simpler, but wastes range if the float distribution is skewed (e.g., all positive values after ReLU).

Check: Going from FP16 (16 bits) to INT4 (4 bits) gives what compression ratio on model size?

2× 4× 16×

Chapter 4: Bit-Width Lab

Theory is nice. Let's see quantization in action. Below is a weight distribution sampled from a real-looking Gaussian. Drag the bit-width slider to see how fewer bits approximate the original distribution. Watch the error histogram grow as precision drops.

Interactive Quantization Explorer

Top: original weight distribution vs quantized reconstruction. Bottom: per-weight error. Right: model size.

Bit-width8-bit

DistributionGaussian

What to notice: At 8 bits (INT8), the reconstruction is nearly perfect — typical error is tiny. At 4 bits (INT4), you start to see staircase patterns in the reconstruction. At 2 bits, you only have 4 distinct values to represent the entire distribution. At 1 bit, it's binary: weights are either +max or -max.

K-Means Quantization

Linear quantization spaces the quantization levels uniformly across the range. But what if your weights aren't uniformly distributed? K-means quantization (from Han et al.'s Deep Compression, 2016) clusters weights into K groups and stores each weight as a cluster index.

The idea: run K-means clustering on the weight values. Each cluster centroid becomes a shared weight. Each original weight is replaced by a small integer index pointing to its nearest centroid. During inference, we look up the centroid to reconstruct the weight.

Storage savings: For a D × D weight matrix originally in FP32, storage = 32 · D² bits. With K clusters, storage = log₂(K) · D² + K · 32 bits (the indices + the centroid table). For D=512 and K=16 (4-bit indices): original = 8,388,608 bits. Compressed = 4 · 262,144 + 16 · 32 = 1,049,088 bits. That's an 8× compression.

Model Size Calculator

Let's compute real numbers. A 7B parameter model:

Format	Bits/Param	Model Size	Fits on GPU?
FP32	32	28.0 GB	A100 80GB: yes, but tight
FP16/BF16	16	14.0 GB	Yes
INT8	8	7.0 GB	Easily
INT4	4	3.5 GB	Fits on consumer GPUs
INT2	2	1.75 GB	Fits, but accuracy suffers

Check: At 4-bit quantization, roughly how many distinct values can each weight take?

16 (because 2⁴ = 16) 4 256

Chapter 5: Post-Training Quantization

You have a pre-trained model in FP16. You want to quantize it to INT8 without retraining. This is Post-Training Quantization (PTQ) — the cheapest approach, requiring only a small calibration dataset (a few hundred examples) to determine the scale and zero-point for each layer.

The Calibration Problem

To compute S and Z for each layer, we need r_min and r_max — the range of values that actually appear. We run a few hundred calibration examples through the network and record the min/max of the weights and activations at each layer. This is calibration.

Per-tensor vs per-channel: The simplest approach uses one S and Z for an entire weight tensor. But if different output channels have very different ranges, a single scale either wastes precision on small channels or clips large channels. Per-channel quantization computes separate S_c and Z_c for each output channel (each row of the weight matrix). This is the standard in practice — nearly free accuracy improvement.

The Activation Outlier Problem

Here's where things get tricky. For weights, the distribution is typically well-behaved (roughly Gaussian, symmetric). But activations often have outliers — a few values that are 100× larger than the rest. In large language models, these outliers appear consistently in the same channels.

If you set the quantization range to include these outliers, the scale S becomes very large, and all the "normal" values get squashed into a tiny range of integers, destroying precision. If you clip the outliers, the large values are wrong. Either way, accuracy drops.

The Outlier Problem

A typical activation distribution with outliers. Orange bars show the quantized levels. Notice how the outlier stretches the range, wasting most levels on empty space.

SmoothQuant: The Key Insight

Xiao et al. (2023) observed that activation outliers are hard to quantize but weight channels are easy. SmoothQuant migrates the quantization difficulty from activations to weights by dividing activations by a per-channel smoothing factor s and multiplying weights by the same factor:

Y = (X · diag(s)⁻¹) · (diag(s) · W) = X̂ · Ŵ

The output Y is mathematically identical (the s's cancel). But now the activations X̂ = X / s are smoother (outliers are divided down), and the weights Ŵ = s · W absorb the extra magnitude. Both are easier to quantize.

How to choose s: For each channel j, s_j = max(|X_j|)^α / max(|W_j|)^1−α, where α controls the migration strength. At α = 0.5, you split the difficulty evenly. In practice, α = 0.5 works well for most LLMs. The idea is beautifully simple: you're just rebalancing the per-channel magnitudes between X and W so neither has extreme outliers.

GPTQ: Second-Order Information

GPTQ (Frantar et al., 2023) takes a more sophisticated approach. Instead of simply rounding each weight to its nearest quantized value, GPTQ accounts for the correlations between weights. After quantizing one weight, it adjusts the remaining unquantized weights to compensate for the rounding error, using the Hessian (second-order information) of the loss.

This is a layer-wise method based on Optimal Brain Quantization: for each column of the weight matrix, quantize it, compute the error, and distribute that error across the remaining columns using the inverse Hessian. The result: INT4 quantization with negligible accuracy loss on models up to 175B parameters.

Check: What is SmoothQuant's key insight?

Use more bits for activations Migrate quantization difficulty from activations to weights by rescaling channels Skip quantizing layers with outliers

Chapter 6: Quantization-Aware Training

PTQ works well at 8 bits but struggles at 4 bits for some models. The problem: the model was trained in FP16 and never "saw" quantized weights during training. The solution: Quantization-Aware Training (QAT) — insert fake quantization operations during training so the model learns to be robust to low-precision representation.

Fake Quantizers

During training, we insert fake quantization nodes after each weight tensor and each activation. These nodes quantize the value (float → int → float) so the model experiences the rounding error during forward pass, but internally everything stays in floating point so gradients can flow.

Forward Pass

Ŵ = dequant(quant(W)) — weight is quantized then immediately dequantized

↓

Compute

Y = Ŵ · X — output uses the "fake quantized" weight

↓

Backward Pass

∂L/∂W ≈ ∂L/∂Ŵ — gradient passes through as if quantization didn't happen (STE)

The Straight-Through Estimator (STE)

There's a fundamental problem: the round() function has zero gradient almost everywhere (its derivative is 0 for non-integers and undefined at integers). Without gradients, we can't do backpropagation through the quantization step.

The Straight-Through Estimator (STE) is a practical hack: during the backward pass, pretend the round function is the identity function. In other words, pass the gradient through unchanged:

∂round(x) / ∂x ≈ 1

This is mathematically unjustified — the gradient of round() is not 1. But it works remarkably well in practice. The model learns to place weights at values that are close to quantization grid points, reducing the rounding error.

Why STE works: Think of it this way — if increasing a weight by ε would improve the loss, the STE lets that signal through. The weight moves toward a better value. The quantization step then snaps it to the nearest grid point. Over many training steps, weights naturally cluster near grid points where the loss is low. The training landscape effectively has "valleys" around the quantization grid points.

QAT vs PTQ: When to Use Which

Property	PTQ	QAT
Compute cost	Low (calibration only)	High (full training loop)
Data needed	~100-1000 calibration samples	Full training dataset
Accuracy at INT8	Usually fine	Excellent
Accuracy at INT4	Can degrade significantly	Much better
When to use	Quick deployment, 8-bit, large models	Aggressive quantization (4-bit), accuracy-critical

STE Gradient Flow

The teal staircase is the round() function. The orange line is the STE approximation (identity). The gradient of the staircase is 0 everywhere, but the STE pretends it's 1.

Check: Why can't we simply backpropagate through the round() function?

Because round() has zero gradient almost everywhere, so no learning signal passes through Because round() is too slow to compute Because PyTorch doesn't support it

Chapter 7: Knowledge Distillation

Pruning and quantization compress an existing model. Knowledge distillation takes a completely different approach: train a new, smaller model (the student) to mimic a large model (the teacher). The student never sees the original training labels — it learns from the teacher's outputs.

Why Not Just Train a Small Model?

You might ask: why not just train a small model from scratch? Because small models are hard to train well. They have limited capacity, underfit easily, and get stuck in bad local minima. But a teacher model has already learned a rich representation of the data. Its output probabilities contain dark knowledge — information about the relationships between classes that hard labels don't capture.

Dark knowledge example: For an image of a car, the hard label is [0, 1, 0] (100% car). But the teacher might output [0.01, 0.90, 0.09] (1% plane, 90% car, 9% truck). That 9% for "truck" is informative — it tells the student that cars and trucks are visually similar. This relational information helps the student generalize better than hard labels alone.

Temperature Scaling

The teacher's softmax outputs are often very peaked (90%+ on the correct class). The "dark knowledge" in the small probabilities is hard to learn from. Temperature scaling softens the distribution by dividing the logits by a temperature T before applying softmax:

p_i = exp(z_i / T) / Σ_j exp(z_j / T)

At T = 1, this is the normal softmax. As T increases, the distribution becomes softer — the probability mass spreads more evenly across classes, making the dark knowledge more visible. Typical values: T = 2 to 20.

Worked example: Logits z = [5, 1, 0.5].

T=1: softmax = [0.935, 0.017, 0.010] (very peaked)
T=5: softmax = [0.506, 0.264, 0.230] (much softer — student can learn from the relative differences)
T=20: softmax = [0.361, 0.325, 0.314] (nearly uniform — too soft, information is lost)

The Distillation Loss

The student is trained on a weighted combination of two losses:

L = α · L_distill(p^T_teacher, p^T_student) + (1 − α) · L_CE(y, p_student)

The first term is the distillation loss — cross-entropy between the teacher's soft probabilities (at temperature T) and the student's soft probabilities (at the same T). The second term is the standard classification loss against the hard ground-truth labels. The hyperparameter α balances the two.

Temperature Scaling Explorer

Drag the temperature slider to see how softmax probabilities spread out. At T=1, the teacher is very confident. At higher T, dark knowledge emerges.

Temperature T1.0

DistilBERT: A Real-World Example

DistilBERT (Sanh et al., 2019) distilled BERT-base (110M params, 12 layers) into a student with 66M params (6 layers). The result: 97% of BERT's language understanding performance at 60% the size and 60% faster inference. They used T = 8 and added a cosine similarity loss between teacher and student hidden states.

Model	Parameters	GLUE Score	Inference Speed
BERT-base	110M	79.5	1×
DistilBERT	66M	77.0 (97%)	1.6×

Check: What is "dark knowledge" in knowledge distillation?

The teacher's architecture details The training data labels The inter-class relationships encoded in the teacher's soft probability outputs

Chapter 8: Connections

You've now seen three fundamentally different approaches to making neural networks fit in memory. Let's put them together.

Cheat Sheet

Method	What It Does	Compression	Accuracy Impact	Compute Cost
Magnitude Pruning	Zeros out small weights	2-10×	Low (with fine-tuning)	Medium (iterative)
Structured Pruning	Removes rows/channels/heads	2-4×	Low-Medium	Medium
INT8 Quantization	Reduces precision to 8-bit	2-4×	Negligible	Low (PTQ)
INT4 Quantization	Reduces precision to 4-bit	4-8×	Low (with GPTQ/QAT)	Low-High
Knowledge Distillation	Trains a smaller model	2-10×	Varies	High (full training)

They compose! Pruning + quantization work better together than either alone. Han et al.'s Deep Compression pipeline: prune → quantize → Huffman encode, achieving 35-49× compression on AlexNet/VGGNet with no accuracy loss. You can also distill into a smaller architecture and then quantize the student.

The Landscape Today

In 2024-2025, the most common deployment pattern for LLMs is GPTQ or AWQ at 4-bit — Post-Training Quantization with second-order error correction. This lets you run a 70B model in ~35 GB, fitting on a single A100 80GB with room for KV cache. For mobile/edge, a combination of distillation (train a 1-3B model) + INT4 quantization is standard.

Key Equations Reference

Pruning: min_{W_P} L(x; W_P) s.t. ||W_P||₀ < T

Linear Quantization: r = S · (q − Z) S = (r_max − r_min) / (q_max − q_min)

SmoothQuant: Y = (X · diag(s)⁻¹) · (diag(s) · W)

Distillation: L = α L_KD(p^T_t, p^T_s) + (1−α) L_CE(y, p_s)

What's Next

The next CS 229s lecture covers parameter-efficient fine-tuning (LoRA, adapters, prompt tuning) — how to adapt a pre-trained model to new tasks without updating all the weights. This connects directly to quantization: QLoRA fine-tunes a 4-bit quantized model using LoRA adapters in 16-bit, combining the best of both worlds.

References

Han et al. "Learning both Weights and Connections for Efficient Neural Networks." NeurIPS 2015.
Han et al. "Deep Compression." ICLR 2016 (Best Paper).
Jacob et al. "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference." CVPR 2018.
Xiao et al. "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." ICML 2023.
Frantar et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
Hinton et al. "Distilling the Knowledge in a Neural Network." NeurIPS Workshop 2015.
Sanh et al. "DistilBERT, a distilled version of BERT." NeurIPS Workshop 2019.
Frankle & Carlin. "The Lottery Ticket Hypothesis." ICLR 2019.

Check: Which combination gives the best compression with minimal accuracy loss for deploying LLMs today?

FP32 with no compression 4-bit quantization (GPTQ/AWQ), optionally combined with pruning 1-bit quantization