Model Compression Workbook — Engineermaxxing

Chapter 0: Compression Arithmetic

You just trained a 7-billion-parameter language model. In FP16, every parameter takes 2 bytes, so the model weighs 14 GB. Your deployment target is a 6 GB GPU. It doesn't fit. You need to compress it — but by how much, and what speedup can you expect?

Before diving into any compression technique, you need the arithmetic. Model size, compression ratio, and bandwidth-bound speedup are the three numbers you'll compute for every deployment decision you'll ever make. Master them here.

Model size formula:
Size (bytes) = params × bits_per_param / 8
Size (GB) = params × bits_per_param / 8 / 10⁹

Compression ratio:
ratio = original_size / compressed_size

Bandwidth-bound latency:
time = model_size / memory_bandwidth
speedup = time_original / time_compressed = original_size / compressed_size

Exercise 0.1: Model Size Derive

A 13-billion-parameter model stored in FP16 (16 bits = 2 bytes per parameter). How many GB does it occupy?

Formula: params × bytes_per_param / 10⁹

Show derivation

Size = 13 × 10⁹ × 2 bytes = 26 × 10⁹ bytes

Size = 26 × 10⁹ / 10⁹ = 26 GB

Simple rule: in FP16, the model size in GB is roughly 2× the parameter count in billions. A 7B model → 14 GB. A 13B model → 26 GB. A 70B model → 140 GB.

Exercise 0.2: Compression Ratio Derive

Your 14 GB FP16 model has been quantized down to 3.5 GB. What is the compression ratio?

Compression ratio = original / compressed. A ratio of 2× means the compressed version is half the size.

Show derivation

ratio = 14 GB / 3.5 GB = 4.0×

A 4× compression ratio corresponds to going from 16-bit to 4-bit representation (16 / 4 = 4). This is the compression ratio you get from INT4 quantization — the most popular deployment format for large models in 2024-2025.

Exercise 0.3: What Does Compression Buy You? Trace

Which statement about model compression is most accurate?

Compression reduces memory usage but always increases latency due to the overhead of dequantization Compression only reduces disk storage — the model is decompressed to full precision in GPU memory Compression reduces memory AND can reduce latency — LLM inference is memory-bandwidth-bound, so moving fewer bytes makes it faster Compression only helps during training, not inference

Show explanation

LLM inference (especially at batch size 1) is memory-bandwidth-bound, not compute-bound. The GPU spends most of its time waiting for weights to stream from HBM to the compute cores. If you shrink the weights by 4×, you move 4× fewer bytes through the memory bus, so inference speeds up by up to 4×.

The dequantization overhead is tiny — a few multiplies per weight group — and is completely hidden by the memory transfer latency. This is why quantized models are both smaller and faster.

Exercise 0.4: Bandwidth-Bound Speedup Derive

An A100 GPU has 2 TB/s memory bandwidth. For a single-batch inference pass, the GPU must load all model weights once. Compare the time to load a 14 GB FP16 model vs. a 3.5 GB INT4 model. What is the theoretical speedup?

Speedup = time_FP16 / time_INT4 = size_FP16 / size_INT4

× speedup

Show derivation

Time_FP16 = 14 GB / 2000 GB/s = 7.0 ms

Time_INT4 = 3.5 GB / 2000 GB/s = 1.75 ms

Speedup = 7.0 / 1.75 = 4.0×

Notice the speedup equals the compression ratio. This is no coincidence — when inference is purely memory-bandwidth-bound, the speedup is exactly the ratio of data moved. In practice, you get slightly less than 4× because dequantization adds a tiny compute overhead and not every operation is perfectly bandwidth-bound.

Exercise 0.5: The Real Bottleneck Trace

What is the primary bottleneck for LLM inference at small batch sizes (batch = 1)?

Compute (FLOPS) — the GPU can't multiply fast enough Memory bandwidth — the GPU can't load weights from HBM fast enough Network I/O — data transfer between CPU and GPU is the bottleneck Softmax computation — the attention mechanism is the slowest part

Show explanation

An A100 delivers 312 TFLOPS (FP16) but only 2 TB/s memory bandwidth. For a single token, each weight is used for exactly one multiply-add (2 FLOPs). The arithmetic intensity is 2 FLOPs / 2 bytes = 1 FLOP/byte. The A100's compute-to-bandwidth ratio is 312 TFLOPS / 2 TB/s = 156 FLOP/byte.

Since the operation's arithmetic intensity (1) is far below the machine's ratio (156), the GPU is waiting for memory 99% of the time. This is why compression — which reduces bytes moved — directly translates to speedup.

At large batch sizes (B ≥ 64+), the same weights are reused across many tokens, pushing arithmetic intensity above the machine ratio. Then inference becomes compute-bound and quantization helps less with speed (though it still saves memory).

Exercise 0.6: Implement modelSizeGB() Build

Write a function that computes model size in GB given the number of parameters and bits per parameter.

Formula: params × bitsPerParam / 8 / 1e9

Show solution

javascript
function modelSizeGB(params, bitsPerParam) {
  return params * bitsPerParam / 8 / 1e9;
}

The key insight: divide by 8 to convert bits to bytes, then divide by 10⁹ to convert bytes to GB. For FP16 (16 bits), each param is 2 bytes. For INT4 (4 bits), each param is 0.5 bytes.

Exercise 0.7: Find the Bug Debug

This compression ratio function returns values less than 1 for compressed models. That's backwards — a 4× compression should return 4, not 0.25. Click the buggy line.

function compressionRatio(originalGB, compressedGB) {
  const ratio = compressedGB / originalGB;
  return ratio;
}

Show explanation

Line 2 is the bug. The division is inverted: it computes compressedGB / originalGB which gives 3.5 / 14 = 0.25. The correct formula is originalGB / compressedGB which gives 14 / 3.5 = 4.0.

This is a common mistake. Compression ratio is always ≥ 1 (original divided by compressed). A ratio of 0.25 would mean the "compressed" version is larger than the original.

Key formula. Model size (GB) = params × bits_per_param / 8 / 10⁹. Compression ratio = original / compressed. Bandwidth-bound speedup ≈ compression ratio. These three numbers govern every deployment decision.

Chapter 1: Quantization Fundamentals

Your 7B model has a weight with value 0.0237. You need to store it in INT8 — only 256 possible values. That means you're rounding a continuous float to one of 256 buckets. How much error did you just introduce? And can you control where those buckets land?

Quantization maps a continuous range of floating-point values to a discrete set of integer levels. The two key ingredients are the scale (how wide each bucket is) and the zero point (where the integer zero maps in float space).

Asymmetric quantization:
scale = (max − min) / (2^bits − 1)
zero_point = min
quantize: q = round((x − zero_point) / scale)
dequantize: x̂ = q × scale + zero_point

Symmetric quantization:
absmax = max(|min|, |max|)
scale = absmax / (2^bits−1 − 1)
quantize: q = round(x / scale) // q ∈ [−(2^bits−1−1), +(2^bits−1−1)]
dequantize: x̂ = q × scale

Buckets and bins. Think of quantization as dividing a number line into 2^bits equally-spaced bins. Any value that falls in a bin snaps to the bin center. The scale is the bin width. The quantization error for any single value is at most half a bin width: scale / 2.

Exercise 1.1: Asymmetric Quantization Error Derive

Given a weight range [−1.0, 1.0] and INT8 (256 levels), compute the quantization error for x = 0.3.

Steps: (1) compute scale, (2) quantize x to integer q, (3) dequantize q back to x̂, (4) error = |x − x̂|

error

Show derivation

scale = (1.0 − (−1.0)) / (256 − 1) = 2.0 / 255 ≈ 0.007843

zero_point = −1.0

q = round((0.3 − (−1.0)) / 0.007843) = round(1.3 / 0.007843) = round(165.75) = 166

x̂ = 166 × 0.007843 + (−1.0) = 1.3019 − 1.0 = 0.3019

error = |0.3 − 0.3019| = 0.0019

The error (0.002) is less than half the scale (0.004). This is guaranteed — the worst-case error for any value is scale/2, achieved when the value falls exactly between two bins.

Exercise 1.2: Symmetric Quantization Error Derive

Symmetric INT8 quantization of a tensor with absmax = 2.0. The signed INT8 range is [−127, 127]. Compute the quantization error for x = 0.5.

Symmetric: scale = absmax / 127. Quantize: q = round(x / scale). Dequantize: x̂ = q × scale.

error

Show derivation

scale = 2.0 / 127 = 0.015748

q = round(0.5 / 0.015748) = round(31.75) = 32

x̂ = 32 × 0.015748 = 0.50394

error = |0.5 − 0.50394| = 0.00394

Symmetric quantization uses 127 levels per side (not 128) to ensure that zero maps exactly to integer 0 with no rounding error. This is important for zero-valued activations (ReLU outputs) and skip connections.

Exercise 1.3: Asymmetric vs Symmetric Trace

A weight tensor has values in the range [−0.8, 1.2]. Which quantization scheme wastes less of the INT8 range?

Asymmetric — it sets range to exactly [−0.8, 1.2], using all 256 levels across the actual data range Symmetric — it naturally handles the imbalance by shifting the zero point Both are equivalent for any distribution Symmetric — it's always more accurate because it preserves zero exactly

Show explanation

Asymmetric sets the range to exactly [−0.8, 1.2], distributing all 256 levels across 2.0 units of range. Scale = 2.0/255 ≈ 0.00784.

Symmetric would set absmax = max(0.8, 1.2) = 1.2, giving range [−1.2, 1.2]. That's 2.4 units, but values below −0.8 never appear — 0.4 units of range (17% of the negative side) are wasted. Scale = 1.2/127 ≈ 0.00945 — coarser bins, more error.

Asymmetric wins when the distribution is notably skewed. Symmetric wins on simplicity (no zero-point bookkeeping) and is preferred when the distribution is roughly centered around zero, which is common for well-trained weights.

Exercise 1.4: Maximum Quantization Error Derive

For symmetric INT8 with absmax = 1.0, what is the maximum possible quantization error for any input value within the range [−1.0, 1.0]?

The worst case: a value lands exactly between two quantization levels. Max error = scale / 2.

max error

Show derivation

scale = 1.0 / 127 = 0.007874

max error = scale / 2 = 0.007874 / 2 = 0.003937

For INT8, the max error is tiny (~0.4% of the range). But for INT4 (only 16 levels), scale = 1.0/7 = 0.143, and max error = 0.071 — that's 7% of the range. This is why INT4 quantization requires more sophisticated techniques (group quantization, calibration) to maintain accuracy.

Exercise 1.5: Implement quantize() Build

Write a function that performs asymmetric quantize-then-dequantize. Given a float value x, number of bits, and the min/max range, return the dequantized value (the value after quantization error).

Remember to clamp q to the valid integer range [0, 2^bits−1] after rounding.

Show solution

javascript
function quantize(x, bits, minVal, maxVal) {
  const levels = Math.pow(2, bits) - 1;
  const scale = (maxVal - minVal) / levels;
  let q = Math.round((x - minVal) / scale);
  q = Math.max(0, Math.min(levels, q));  // clamp
  return q * scale + minVal;
}

The clamp step is critical — without it, values slightly outside [minVal, maxVal] would produce out-of-range integers that can't be stored in the target bit-width. In practice, calibration determines minVal and maxVal from real data, and outliers get clipped.

Exercise 1.6: Find the Bug Debug

This symmetric quantize function produces garbage outputs. The quantize step looks right, but the dequantize formula is wrong. Click the buggy line.

function quantize(x, scale, zeroPoint) {
  let q = Math.round(x / scale + zeroPoint);
  q = Math.max(0, Math.min(127, q));
  return q * scale + zeroPoint;
}

Show explanation

Line 4 is the bug. The dequantize formula is wrong in two ways: it adds zeroPoint instead of subtracting it, and then multiplies. The correct formula is (q − zeroPoint) * scale.

The quantize step on line 2 does: q = round(x/scale + zp). To invert it: x̂ = (q − zp) × scale. The current code does q × scale + zp, which doesn't undo the original transformation.

Key formula. Quantize: q = round((x − zero_point) / scale). Dequantize: x̂ = q × scale + zero_point. Error ≤ scale / 2. Scale = range / (2^bits − 1). More bits → finer bins → less error, but more memory.

Chapter 2: Post-Training Quantization

You have a pre-trained model and you want to quantize it without retraining. No gradient updates, no fine-tuning budget, no access to the original training data. You just want to shrink it and deploy. This is Post-Training Quantization (PTQ).

The idea: run a small calibration dataset (a few hundred examples) through the model, observe the actual ranges of weights and activations, and use those ranges to set optimal scale and zero-point for each tensor. The better your range estimation, the less accuracy you lose.

PTQ pipeline:
1. Calibrate: run calibration data, record min/max (or percentiles) per tensor
2. Compute scales: choose scale + zero_point for each tensor (or each channel, or each group)
3. Quantize weights: apply quantization using computed scales
4. Evaluate: measure accuracy on held-out data to verify acceptable degradation

Granularity levels:
Per-tensor: 1 scale for the entire weight matrix
Per-channel: 1 scale per output channel (row of weight matrix)
Per-group: 1 scale per group of K values along the input dimension

Why per-channel beats per-tensor. Different output channels can have wildly different weight distributions. One channel might have weights in [−0.01, 0.01] and another in [−2.0, 2.0]. Per-tensor quantization uses the same scale for both, crushing the small channel into just a few quantization levels. Per-channel gives each channel its own scale, dramatically reducing error.

Exercise 2.1: Per-Channel Overhead Derive

A weight tensor has shape [4096, 4096] (output × input). Per-channel quantization (along the output dimension) stores one FP16 scale and one FP16 zero-point per channel. What fraction of the original parameter count is this overhead?

Overhead = (number of extra values) / (original number of values) × 100%. Each scale and zero-point is one value.

% overhead

Show derivation

Original parameters = 4096 × 4096 = 16,777,216

Per-channel overhead = 4096 scales + 4096 zero_points = 8,192 values

Fraction = 8,192 / 16,777,216 × 100% = 0.049%

The overhead is negligible — less than 0.05%. This is why per-channel quantization is essentially free: you pay a rounding error in storage for a massive improvement in accuracy. Every serious PTQ method uses per-channel (or finer) granularity.

Exercise 2.2: Why Per-Channel Wins Trace

Why does per-channel quantization produce less error than per-tensor quantization?

Per-channel uses more bits per value, so each value is more precise Different output channels have different weight distributions — per-channel adapts scale/zero-point to each channel's actual range, reducing wasted quantization levels Per-channel rounds toward zero instead of to nearest, which is more accurate for neural networks Per-channel operates on smaller matrices, which avoids numerical overflow

Show explanation

Imagine two channels: channel A has weights in [−0.01, 0.01] and channel B has weights in [−2.0, 2.0]. Per-tensor uses one scale for both: scale = 4.0/255 ≈ 0.0157. Channel A, with a range of only 0.02, gets mapped to just ~1 quantization level — almost all information is destroyed.

Per-channel gives channel A its own scale: 0.02/255 ≈ 0.000078, spreading its values across all 256 levels. Channel B gets its own scale: 4.0/255 ≈ 0.0157. Both channels use the full resolution of INT8.

The bit-width is the same. The total storage is nearly the same. But the effective precision per channel improves dramatically.

Exercise 2.3: Group Quantization Overhead Derive

GPTQ uses group quantization with group_size = 128. For a weight matrix [4096, 4096] quantized to INT4 with groups along the input dimension: how many groups are there? What fraction of the total storage (weights + scales + zero-points) is overhead?

Each group stores one FP16 scale (2 bytes) + one FP16 zero-point (2 bytes) = 4 bytes overhead. INT4 weights = 0.5 bytes per value.

% overhead

Show derivation

Groups per row = 4096 / 128 = 32

Total groups = 4096 rows × 32 groups/row = 131,072 groups

Overhead bytes = 131,072 × 4 bytes = 524,288 bytes = 0.5 MB

INT4 weight bytes = 4096 × 4096 × 0.5 = 8,388,608 bytes = 8.0 MB

Total = 8.0 + 0.5 = 8.5 MB

Overhead fraction = 0.5 / 8.5 × 100% = 5.9%

At group_size=128, the overhead is about 6% — still small, but no longer negligible. Smaller groups (64, 32) give better accuracy but higher overhead. Group size 128 is the standard trade-off used in GPTQ and AWQ. Some aggressive schemes use group_size=32 for 4-bit, accepting ~23% overhead for better accuracy.

Exercise 2.4: What Does GPTQ Optimize? Trace

GPTQ is the most popular PTQ method for LLMs. What does it actually minimize?

The total number of bits used to store the model The mean absolute difference between original and quantized weights The layer-wise output reconstruction error ‖WX − W̃X‖², where W̃ is the quantized weight, using the Hessian of this objective to determine quantization order The perplexity of the model on a calibration dataset

Show explanation

GPTQ minimizes layer-wise output reconstruction error: for each layer, it finds quantized weights W̃ that minimize ‖WX − W̃X‖², where X is the calibration data flowing through that layer.

The key insight: not all weight errors matter equally. A small error in a high-sensitivity weight damages the output more than a large error in a low-sensitivity weight. GPTQ uses the Hessian H = 2X^TX of the reconstruction loss to measure each weight's sensitivity.

It quantizes weights one column at a time, using the Hessian inverse to redistribute the quantization error of each column to the remaining (not-yet-quantized) columns. This "error compensation" is what makes GPTQ dramatically better than naive round-to-nearest PTQ.

Exercise 2.5: Implement perChannelScale() Build

Write a function that computes per-channel scales for INT8 asymmetric quantization. Input: a 2D weight array [out_channels][in_features]. Return: an array of scales, one per output channel.

Use Math.max(...arr) and Math.min(...arr) for each row.

Show solution

javascript
function perChannelScale(weights) {
  return weights.map(channel => {
    const mn = Math.min(...channel);
    const mx = Math.max(...channel);
    return (mx - mn) / 255;
  });
}

In practice, you'd also compute the zero-point for each channel: zp = min. For a constant channel (all values equal), scale = 0 and any quantized value dequantizes to that constant. Libraries handle this edge case by setting scale to a tiny epsilon to avoid division by zero during quantization.

Exercise 2.6: PTQ Pipeline Order Design

Arrange the steps of a post-training quantization pipeline in the correct order.

→

Evaluate accuracy Quantize weights Run calibration data Compute scales & zero-points

Show explanation

Correct order: Run calibration data → Compute scales & zero-points → Quantize weights → Evaluate accuracy

Calibration must come first because you need to observe actual value ranges before you can compute scales. Scales must come before quantization because quantization uses them. Evaluation comes last to verify that the quantized model didn't lose too much accuracy. If accuracy drops too much, you'd try a finer granularity (per-channel → per-group), more calibration data, or a smarter method like GPTQ.

Key formula. PTQ overhead: per-channel adds 2 values per channel (negligible). Per-group with group_size G adds 2 × ⌈K/G⌉ values per channel, typically 5-10% overhead for INT4 with G=128. GPTQ minimizes ‖WX − W̃X‖² using Hessian-guided error compensation.

Chapter 3: Quantization-Aware Training

You tried PTQ on your 7B model and perplexity jumped from 5.2 to 6.8 — unacceptable. The model needs to learn to be robust to quantization noise. This means putting quantization inside the training loop itself. Welcome to Quantization-Aware Training (QAT).

The core trick: insert fake quantizers (also called "simulated quantizers") into the forward pass. Each fake quantizer quantizes a tensor to integers and immediately dequantizes back to floats. The output is still a float, but it has quantization noise baked in. The model learns to produce weights and activations that survive quantization with minimal accuracy loss.

Fake quantization (forward pass):
x̂ = dequant(clamp(round(x / scale + zero_point), q_min, q_max))
= (clamp(round(x / scale + zp), q_min, q_max) − zp) × scale

Straight-Through Estimator (backward pass):
∂L/∂x ≈ ∂L/∂x̂ × 1_{{x ∈ [clip_min, clip_max]}}
// Pass the gradient straight through if x is within clipping range
// Zero the gradient if x is outside clipping range (clipped)

The STE trick. The quantize function is a staircase — its true derivative is zero on the flat steps and undefined at the step edges. Zero gradients everywhere means the model can't learn. The Straight-Through Estimator (STE) lies to the backward pass: "pretend the staircase is actually the identity function." This crude approximation works shockingly well because the model quickly learns to place its weights on (or near) the staircase steps.

Exercise 3.1: STE Gradient Derive

In QAT, the forward pass computes y = fakeQuantize(x). The STE approximation says ∂y/∂x = 1 when x is within the clipping range. If the loss gradient is ∂L/∂y = 0.05 and x is within range, what is ∂L/∂x?

Chain rule: ∂L/∂x = ∂L/∂y × ∂y/∂x. STE says ∂y/∂x = 1 for in-range x.

Show derivation

∂L/∂x = ∂L/∂y × ∂y/∂x = 0.05 × 1 = 0.05

The gradient passes through the fake quantizer unchanged. This is the "straight-through" part — the backward pass acts as if the quantizer isn't there. The forward pass still applies quantization noise, so the model learns to be robust to it. If x were outside the clipping range, the STE returns 0, telling the optimizer "this weight is so far out of range that nudging it won't help — clip it."

Exercise 3.2: Why Not the True Gradient? Trace

Why can't we use the true gradient of the quantize-dequantize function during backpropagation?

The true gradient is too expensive to compute The true gradient is zero almost everywhere (flat staircase steps) and undefined at step boundaries — the model would get zero gradients and learn nothing The true gradient causes exploding gradients due to the rounding operation The true gradient requires second-order derivatives which PyTorch can't compute

Show explanation

The round() function creates a staircase: its output is constant (flat) between integer boundaries. The derivative of a constant is zero. At the integer boundaries, the function jumps discontinuously, so the derivative is undefined.

If you used the true derivative, every weight would receive a gradient of exactly 0. Gradient descent with zero gradients means no learning. The STE is a deliberate, principled approximation: by pretending the staircase is the identity, you get useful (if noisy) gradients that point the weights toward quantization-friendly values.

Bengio et al. (2013) showed this works well empirically, and it has become the standard approach in all QAT frameworks (PyTorch's FX quantization, TensorFlow's QAT toolkit, etc.).

Exercise 3.3: QAT Training Cost Derive

A model normally trains for 100 epochs on 1M samples at 1000 samples/sec throughput. QAT fine-tunes for 10% of original training duration. The fake quantization ops add 30% overhead to each forward/backward pass. How many hours does QAT take?

QAT throughput = original throughput / 1.3. QAT epochs = 10. Total samples = epochs × dataset size.

hours

Show derivation

QAT throughput = 1000 / 1.3 ≈ 769.2 samples/sec

QAT epochs = 100 × 0.10 = 10 epochs

Total samples = 10 × 1,000,000 = 10,000,000

Time = 10,000,000 / 769.2 = 13,001 seconds

Time = 13,001 / 3600 = 3.61 hours

QAT is much cheaper than full training (3.6 hours vs ~28 hours for the original 100 epochs). But it's far more expensive than PTQ, which takes minutes. This is the fundamental trade-off: PTQ is cheap but loses more accuracy; QAT is expensive but recovers most of the lost accuracy. For models going to production with tight accuracy requirements, QAT is worth the cost.

Exercise 3.4: Implement fakeQuantize() Build

Write the fake quantization function used in QAT. It quantizes then immediately dequantizes — the output is a float with quantization noise.

The clamp ensures the integer stays within the representable range.

Show solution

javascript
function fakeQuantize(x, scale, zeroPoint, qmin, qmax) {
  let q = Math.round(x / scale + zeroPoint);
  q = Math.max(qmin, Math.min(qmax, q));
  return (q - zeroPoint) * scale;
}

Let's trace the clamped case: x=1.6, q = round(1.6/0.01 + 100) = round(260) = 260. Clamp to [0,255]: q=255. Dequantize: (255−100)×0.01 = 1.55. The value 1.6 was beyond the representable range and got clipped to 1.55. This is how the model learns to keep its activations within the quantization range — gradients from clipping push values back in-range.

Exercise 3.5: Find the Bug Debug

This STE gradient function should return grad when x is in range, and 0 when x is out of range. But it corrupts the gradient magnitude. Click the buggy line.

function steGradient(grad, x, clipMin, clipMax) {
  // Straight-through estimator: pass gradient if x in range
  const mask = (x >= clipMin && x <= clipMax) ? 1 : 0;
  return grad + mask;
}

Show explanation

Line 4 is the bug. It uses grad + mask (addition) instead of grad * mask (multiplication).

When x is in range, mask=1: the current code returns grad + 1 (corrupted), but it should return grad × 1 = grad (passthrough). When x is out of range, mask=0: the code returns grad + 0 = grad (should be grad × 0 = 0, i.e., blocked).

The fix: return grad * mask;. The mask acts as a gate — multiply to pass or block the gradient.

Key formula. Fake quantize: x̂ = (clamp(round(x/scale + zp), q_min, q_max) − zp) × scale. STE backward: ∂L/∂x = ∂L/∂x̂ if x ∈ [clip_min, clip_max], else 0. QAT adds ~30% training overhead but recovers accuracy that PTQ loses.

Chapter 4: Pruning

Quantization makes each weight smaller. But what if you just delete weights entirely? Set them to zero. A 7B model with 90% sparsity has 6.3 billion zeros — only 700 million nonzero weights doing useful work. This is pruning.

The question is: which weights do you delete? The simplest answer — magnitude pruning — removes the weights closest to zero, under the assumption that small weights contribute least to the output. It works surprisingly well, and it's where every pruning discussion starts.

Magnitude pruning:
1. Compute magnitudes: |w_i| for all weights
2. Find threshold at the desired sparsity percentile
3. Create mask: m_i = 1 if |w_i| > threshold, else 0
4. Apply mask: w̃_i = w_i × m_i

Sparsity types:
Unstructured: any individual weight can be zero (irregular pattern)
Structured: entire rows, columns, or attention heads are removed (regular pattern)

Structured vs unstructured — the hardware trap. Unstructured pruning at 90% sparsity sounds amazing: 10× fewer parameters! But GPUs are designed for dense matrix multiplication. A sparse matrix with random zeros doesn't map to any fast CUDA kernel. You need specialized sparse formats (CSR/CSC) and sparse matmul libraries, which often run slower than the dense version at less than 95% sparsity. Structured pruning removes entire rows/columns, keeping the remaining operations dense and GPU-friendly.

Exercise 4.1: Magnitude Pruning by Hand Derive

A weight tensor has values: [0.5, −0.1, 0.8, −0.02, 0.3, −0.7, 0.05, 0.9]. Apply 50% magnitude pruning (remove the smallest 50% by absolute value). What is the sum of the surviving weights?

Sort by magnitude, find the 50th percentile threshold, zero out everything at or below it.

sum of survivors

Show derivation

Magnitudes sorted: 0.02, 0.05, 0.1, 0.3, 0.5, 0.7, 0.8, 0.9

50% of 8 values = 4 values to prune

Threshold: 4th smallest magnitude = 0.3 (prune values with |w| ≤ 0.3)

Pruned (zeroed): −0.1, −0.02, 0.3, 0.05 // magnitudes 0.1, 0.02, 0.3, 0.05

Survivors: 0.5, 0.8, −0.7, 0.9

Sum = 0.5 + 0.8 + (−0.7) + 0.9 = 1.5

Notice that the pruned weights (sum = −0.1 − 0.02 + 0.3 + 0.05 = 0.23) are indeed the ones contributing least to the overall sum. Magnitude pruning assumes that small weights have small effects on the output — often true for well-trained networks, but not always (some small weights sit on high-curvature loss landscape regions).

Exercise 4.2: Structured vs Unstructured Trace

Which type of pruning gives actual wall-clock speedup on standard GPU hardware without specialized sparse kernels?

Unstructured pruning — fewer total multiplications means less work for the GPU Structured pruning — removes entire rows/columns/heads, maintaining dense matrix operations that GPUs are optimized for Both give equal speedup at the same sparsity level Neither — pruning only saves memory, not compute time

Show explanation

Structured pruning removes entire structures: a row of a weight matrix (removes an output neuron), a column (removes an input feature), or an entire attention head. The result is a smaller but still dense matrix. Dense matmul on a [3072, 4096] matrix instead of [4096, 4096] is genuinely 25% faster.

Unstructured pruning at 90% sparsity creates a [4096, 4096] matrix where 90% of entries are zero, but the matrix shape is unchanged. Standard GEMM kernels still iterate over all positions. You need special sparse kernels (e.g., NVIDIA's cuSPARSE, 2:4 structured sparsity on Ampere) to skip zeros, and these only help at very high sparsity (>95%) or with hardware-supported patterns (2:4).

Exercise 4.3: Sparsity vs Theoretical FLOPs Derive

A dense matrix multiply [M, K] × [K, N] requires 2MKN FLOPs. After pruning 90% of weights (90% sparsity), the theoretical sparse FLOPs are 2MKN × 0.1. For M = K = N = 4096, what is the theoretical speedup?

× speedup

Show derivation

Dense FLOPs = 2 × 4096 × 4096 × 4096 = 2 × 4096³ ≈ 137.4 billion

Sparse FLOPs = 137.4B × 0.1 = 13.74 billion

Theoretical speedup = 137.4B / 13.74B = 10.0×

10× is the theoretical ceiling. In practice, unstructured 90% sparsity on a GPU typically yields only 1.5-3× speedup (or sometimes no speedup at all) due to irregular memory access patterns, sparse format overhead, and poor hardware utilization. NVIDIA's Ampere architecture supports 2:4 structured sparsity (50% sparsity in a specific pattern) with near-2× speedup — the only widely-deployed hardware sparse acceleration.

Exercise 4.4: Why Not 10× in Practice? Trace

A model has 90% unstructured sparsity (90% zeros). Why doesn't inference run 10× faster on a GPU?

The zeros still need to be stored in memory, so there's no memory saving The GPU can only skip zeros in batches of 1024, so fine-grained sparsity doesn't help Irregular memory access from sparse indices causes cache misses, and the overhead of sparse formats (CSR/CSC) and indirect indexing outweighs the savings from skipped multiplications The activation tensors aren't sparse, so the overall computation is still dominated by dense ops

Show explanation

GPUs achieve their massive throughput through coalesced memory access — reading contiguous chunks of memory in parallel. A dense matmul reads weight rows sequentially, hitting every cache line once. Sparse matrices require indirect indexing: the GPU reads an index array to find where the nonzero values are, then gathers them from scattered memory locations.

This scatter-gather pattern causes cache misses (the nonzeros are spread across memory), branch mispredictions (the sparsity pattern is unpredictable), and low SIMD utilization (some lanes in a warp have data, others don't). Add the overhead of storing the sparse format (CSR needs row pointers + column indices), and the "free" skipped multiplications cost more than they save.

This is why the industry has converged on N:M structured sparsity (e.g., 2:4 on Ampere): a regular pattern that the hardware can exploit with dedicated logic.

Exercise 4.5: Implement magnitudePrune() Build

Write a function that performs magnitude pruning on a flat weight array. Given a sparsity ratio (0-1), zero out the smallest-magnitude values.

Sort a copy of magnitudes to find the threshold, then zero values at or below it.

Show solution

javascript
function magnitudePrune(weights, sparsity) {
  const mags = weights.map(w => Math.abs(w));
  const sorted = [...mags].sort((a, b) => a - b);
  const k = Math.ceil(sparsity * weights.length);
  const threshold = sorted[k - 1];  // k-th smallest magnitude
  let pruned = 0;
  return weights.map((w, i) => {
    if (mags[i] <= threshold && pruned < k) {
      pruned++;
      return 0;
    }
    return w;
  });
}

The pruned < k guard handles ties: if multiple weights share the threshold magnitude, we only prune enough to reach the desired sparsity. In real frameworks (PyTorch's torch.nn.utils.prune), the threshold is computed as a percentile and ties are broken arbitrarily.

Exercise 4.6: Pruning Pipeline Order Design

Arrange the steps of an iterative pruning pipeline in the correct order.

→

Train dense model Score weight importance Create binary mask Apply mask (zero weights) Fine-tune surviving weights

Show explanation

Correct order: Train dense model → Score weight importance → Create binary mask → Apply mask → Fine-tune surviving weights

You must start with a fully-trained dense model (random initialization + pruning = disaster). Scoring comes next because you need trained weights to judge importance. The mask is derived from the scores (threshold at desired sparsity). Applying the mask zeroes out the pruned weights. Fine-tuning lets the surviving weights adapt to compensate for their removed neighbors.

In iterative pruning (the Lottery Ticket Hypothesis approach), you repeat steps 2-5 multiple times, pruning a small fraction each round: prune 20% → fine-tune → prune 20% of remaining → fine-tune → ... This is gentler than one-shot pruning and typically recovers more accuracy at high sparsity.

Key formula. Magnitude pruning: zero out weights with |w| below the sparsity-percentile threshold. Theoretical sparse speedup = 1 / (1 − sparsity). Actual speedup on GPUs: only structured pruning or hardware-supported patterns (2:4 on Ampere) yield real wall-clock gains.

Chapter 5: Lottery Ticket Hypothesis

What if you don't need the full network? What if somewhere inside that massive 70B parameter model, there's a tiny subnetwork — maybe 10% of the original size — that could have trained to the same accuracy, if only you'd initialized it with the right weights? That's the Lottery Ticket Hypothesis, proposed by Frankle & Carlin (2019).

The idea is striking: dense, randomly-initialized networks contain sparse subnetworks (called winning tickets) that — when trained in isolation from their original initialization — can match the full network's test accuracy in a comparable number of training iterations. The full network is like a lottery: most tickets (subnetworks) are losers, but a few "win" by happening to have the right initial weights.

The catch? You can't identify the winning ticket before training. The standard method — Iterative Magnitude Pruning (IMP) — works backwards: train the full network, prune the smallest-magnitude weights, then rewind the surviving weights to their original initialization and retrain from scratch. Repeat this prune-rewind cycle until you reach the target sparsity.

Iterative Magnitude Pruning (IMP):
1. Train the full network to completion (T epochs)
2. Prune p% of weights with smallest |w|
3. Rewind surviving weights to their values at initialization (epoch 0)
4. Repeat from step 1 until target sparsity reached

Rounds needed for target sparsity s:
After each round, fraction remaining = (1 − p)
After N rounds: (1 − p)^N = (1 − s)
N = ⌈ log(1 − s) / log(1 − p) ⌉

Exercise 5.1: Lottery Ticket Search Cost Derive

Standard training takes 100 epochs. Iterative Magnitude Pruning (IMP) trains to completion, prunes 20% of remaining weights, then retrains from original init. How many total training epochs are needed to find an 80% sparse ticket?

First compute the number of rounds N = ⌈log(0.2) / log(0.8)⌉. Then total epochs = N × 100. What is the cost multiplier vs. a single training run?

× (cost multiplier)

Show derivation

Target: 80% sparse ⇒ 20% remaining ⇒ fraction remaining = 0.2

Per round: prune 20% ⇒ keep 80% ⇒ multiply by 0.8 each round

After N rounds: 0.8^N = 0.2

N = log(0.2) / log(0.8) = (−1.6094) / (−0.2231) = 7.213

N = ⌈7.213⌉ = 8 rounds

Total epochs = 8 × 100 = 800

Cost multiplier = 800 / 100 = 8.0×

This is the fundamental problem with IMP: finding a winning ticket costs 8× more than a single training run. Later work (Rewinding to iteration k, SNIP, SynFlow) tries to reduce this cost by pruning earlier in training or at initialization.

Exercise 5.2: What Is a Winning Ticket? Trace

The Lottery Ticket Hypothesis states that dense, randomly-initialized networks contain sparse subnetworks which...

...can match the full network's accuracy when trained from any random initialization ...when trained from their original initialization, can match the full network's accuracy in comparable iterations ...achieve better accuracy than the full network because they avoid overfitting ...can be identified before training by analyzing the initialization distribution

Show explanation

The critical word is original. A winning ticket is defined by both its structure (which weights survive pruning) and its initialization (the exact values those weights had at epoch 0). Re-initializing with new random weights destroys the ticket — the subnetwork typically fails to train to the same accuracy. This is the key insight: the lottery isn't just about which connections matter, but about those connections starting in the right place in the loss landscape.

Exercise 5.3: Ticket Density After Pruning Derive

Starting from a 100M parameter model, you perform 5 rounds of IMP where each round prunes 20% of the remaining weights. How many million parameters survive?

million params

Show derivation

After each round, 80% of remaining weights survive

Round 1: 100M × 0.8 = 80.00M

Round 2: 80M × 0.8 = 64.00M

Round 3: 64M × 0.8 = 51.20M

Round 4: 51.2M × 0.8 = 40.96M

Round 5: 40.96M × 0.8 = 32.77M

Or directly: 100M × 0.8⁵ = 100M × 0.32768 = 32.768M

After 5 rounds of 20%-per-round pruning, about 67% of parameters have been removed. The exponential decay means early rounds remove many more absolute parameters than later rounds (20M in round 1 vs. 8.2M in round 5).

Exercise 5.4: Rewinding Matters Trace

You find a winning ticket via IMP. You keep the same sparse mask but re-initialize all surviving weights with fresh random values. What happens?

The sparse network trains fine — the mask (structure) is what matters, not the initial values It trains even better — fresh random init avoids any bias from the original training It fails to reach the same accuracy — the original initialization is critical to the winning ticket It trains to the same accuracy but takes 10× more epochs

Show explanation

This is the most surprising result in the paper. The winning ticket is not just the topology (which neurons connect to which) — it's the topology plus the specific initial weights. Re-initializing the same sparse structure with new random values produces a network that trains no better than a random sparse network of the same size.

Think of it like a key and a lock. The structure (mask) is the shape of the key, but the initialization values are the precise tooth heights. You need both to open the lock (reach good accuracy). This suggests that certain fortuitous initial weight configurations are essential for trainability — a deeply non-obvious result.

Exercise 5.5: Implement iterativePrune() Build

Write a function that simulates iterative magnitude pruning, returning an array of the remaining parameter count after each round.

Each round, remaining = previous × (1 − pruneRate). Return an array of length rounds.

Show solution

javascript
function iterativePrune(totalParams, pruneRate, rounds) {
  const result = [];
  let remaining = totalParams;
  for (let i = 0; i < rounds; i++) {
    remaining *= (1 - pruneRate);
    result.push(remaining);
  }
  return result;
}

Key formula. IMP cost = ⌈log(1 − target_sparsity) / log(1 − prune_rate)⌉ × single_training_cost. Finding a winning ticket is expensive — you pay the full training cost for every pruning round. At 20% prune rate, reaching 90% sparsity takes 11 rounds = 11× training cost.

Chapter 6: Knowledge Distillation

What if instead of compressing the weights, you train a smaller model to mimic the larger one? A 70B teacher model doesn't just output "cat" for an image of a cat — it outputs a full probability distribution: 85% cat, 10% dog, 3% tiger, 1% lion, 0.5% fox... Those soft probabilities contain rich relational information that the student would never learn from hard labels alone.

This is knowledge distillation (Hinton et al., 2015). The student learns from the teacher's soft probability distributions, which encode what Hinton calls "dark knowledge" — the teacher's implicit understanding of which classes are similar to each other. A cat being 10% dog and 3% tiger tells the student that cats look more like dogs than cars, a relationship invisible in the one-hot label [1, 0, 0, 0, ...].

The key mechanism is temperature scaling. Standard softmax produces peaked distributions (one class dominates). By dividing logits by a temperature T > 1, the distribution "softens," revealing the teacher's nuanced inter-class relationships. The student is then trained on these soft targets using KL divergence, alongside the standard hard-label cross-entropy loss.

Temperature-scaled softmax:
q_i = exp(z_i / T) / ∑_j exp(z_j / T)

Knowledge distillation loss:
L = α · T² · KL(p^teacher_T ‖ p^student_T) + (1 − α) · CE(y_hard, p^student₁)

α = weight on distillation loss (typically 0.5–0.9)
T = temperature (typically 2–20)
T² = gradient magnitude correction factor

Exercise 6.1: Temperature Scaling & Entropy Derive

Teacher logits z = [2.0, 1.0, 0.1]. Compute softmax at T=1 and T=5, then find the difference in entropy: H(T=5) − H(T=1).

Entropy H = −∑ p_i ln(p_i). Higher entropy = softer (more uniform) distribution.

nats

Show derivation

Step 1: Softmax at T=1

exp(2.0) = 7.389, exp(1.0) = 2.718, exp(0.1) = 1.105

Sum = 11.212

p_T=1 = [0.659, 0.242, 0.099]

Step 2: Softmax at T=5

z/T = [0.4, 0.2, 0.02]

exp(0.4) = 1.492, exp(0.2) = 1.221, exp(0.02) = 1.020

Sum = 3.733

p_T=5 = [0.400, 0.327, 0.273]

Step 3: Entropy

H(T=5) = −(0.400·ln 0.400 + 0.327·ln 0.327 + 0.273·ln 0.273)

= −(−0.367 − 0.366 − 0.355) = 1.087

H(T=1) = −(0.659·ln 0.659 + 0.242·ln 0.242 + 0.099·ln 0.099)

= −(−0.275 − 0.343 − 0.229) = 0.847

Difference = 1.087 − 0.847 = 0.24 nats

The maximum entropy for 3 classes is ln(3) ≈ 1.099. At T=5, we're already at 1.087 — nearly uniform. Temperature acts as a "softening knob" that reveals relationships hidden by the peaked T=1 distribution.

Exercise 6.2: Why High Temperature? Trace

Why does knowledge distillation use temperature T > 1 for the soft targets?

It makes training faster by reducing gradient variance It softens the distribution, revealing "dark knowledge" about inter-class relationships — e.g., a cat is more similar to a dog than to a car It increases the teacher's accuracy before distillation It regularizes the student to prevent overfitting to the training data

Show explanation

At T=1, the teacher's softmax is highly peaked — often 95%+ on the top class. The small probabilities on other classes (0.01% dog, 0.003% tiger) are numerically negligible and carry almost no gradient signal. But these tiny probabilities encode exactly the information we want: which wrong answers are less wrong.

Raising T "flattens" the distribution, amplifying these small probabilities into meaningful training signal. The student learns not just "this is a cat" but "if it's not a cat, it's probably a dog, and definitely not a car." This relational structure is the dark knowledge.

Exercise 6.3: Distillation Loss Calculation Derive

Compute the KD loss with α=0.7, T=4, KL_soft=0.05, CE_hard=2.3.

L = α · T² · KL_soft + (1 − α) · CE_hard

loss

Show derivation

Distillation term = α · T² · KL_soft = 0.7 × 4² × 0.05 = 0.7 × 16 × 0.05 = 0.56

Hard label term = (1 − α) · CE_hard = 0.3 × 2.3 = 0.69

Total loss = 0.56 + 0.69 = 1.25

The distillation and hard-label terms are roughly balanced here. In practice, α = 0.7–0.9 works well — we lean heavily on the teacher's soft knowledge while still anchoring to ground truth. The T² factor is crucial for keeping gradients balanced (see next exercise).

Exercise 6.4: Why Multiply by T²? Trace

The KL divergence term in the distillation loss is multiplied by T². Why?

It converts the KL divergence from nats to bits It makes the distillation loss independent of the number of classes Dividing logits by T scales gradients by 1/T², so multiplying by T² restores the correct gradient magnitude relative to the hard-label loss It is an empirical trick with no theoretical justification

Show explanation

When you divide logits by T before softmax, the chain rule introduces a factor of 1/T in the gradients. Since KL divergence involves two softmaxes (teacher and student), the gradients are scaled by 1/T². Without correction, higher temperatures would produce vanishingly small gradient updates from the distillation term, making it irrelevant compared to the hard-label loss.

Multiplying by T² exactly compensates this scaling, ensuring that the relative contribution of the distillation and hard-label terms is controlled solely by α, not accidentally by T. This is not a heuristic — it falls directly out of the calculus.

Exercise 6.5: Implement softmaxWithTemp() Build

Write a function that computes temperature-scaled softmax: divide each logit by T, then apply standard softmax. Use the numerically stable version (subtract max before exp).

Divide by T first, subtract max for stability, then exp and normalize.

Show solution

javascript
function softmaxWithTemp(logits, T) {
  const scaled = logits.map(z => z / T);
  const maxVal = Math.max(...scaled);
  const exps = scaled.map(z => Math.exp(z - maxVal));
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map(e => e / sum);
}

Subtracting the max before exponentiation prevents overflow (exp of large numbers). This doesn't change the result because softmax(z − c) = softmax(z) for any constant c.

Exercise 6.6: Missing T² Scaling Debug

This distillation loss function is producing wildly unbalanced gradients. Click the line with the bug.

function kdLoss(softTeacher, softStudent, alpha, T) {
  let klDiv = 0;
  for (let i = 0; i < softTeacher.length; i++)
    klDiv += softTeacher[i] * Math.log(softTeacher[i] / softStudent[i]);
  return alpha * klDiv + (1 - alpha) * hardLoss;
}

Show explanation

Line 5 is the bug. The KL divergence term is missing the T * T scaling factor. It should be:

javascript
return alpha * T * T * klDiv + (1 - alpha) * hardLoss;

Without T², the gradients from the distillation term are scaled down by 1/T². At T=10, that's a 100× reduction — the distillation signal becomes negligible and the student effectively ignores the teacher, learning only from hard labels. The whole point of distillation is lost.

Key formula. L_KD = α · T² · KL(p^teacher_T ‖ p^student_T) + (1 − α) · CE(y, p^student). The T² is not optional — it corrects the gradient scaling introduced by temperature. Typical settings: T = 4–20, α = 0.5–0.9.

Chapter 7: Low-Rank Factorization & LoRA

A weight matrix W of shape [4096 × 4096] has 16.8 million parameters. But what if its effective rank is much lower — say, 16? That means the matrix can be well-approximated by the product of two much smaller matrices: U[4096 × 16] × V[16 × 4096], using only 131K parameters instead of 16.8M. That's a 128× compression.

This insight powers Low-Rank Adaptation (LoRA) (Hu et al., 2021), the most popular parameter-efficient fine-tuning method. Instead of updating the full weight matrix during fine-tuning, LoRA freezes the pre-trained weights W and adds a low-rank delta: ΔW = A · B, where A is [d × r] and B is [r × d]. The forward pass becomes x → Wx + ABx. Only A and B are trained — the original weights never change.

The magic: for a rank r = 16, each adapted matrix adds only 2 × d × r parameters (131K for d=4096). Applied to all attention matrices in a 7B model, that's about 16.8M trainable parameters — 0.24% of the full model. Yet LoRA fine-tuned models routinely match full fine-tuning quality.

LoRA forward pass:
h = Wx + (α/r) · ABx

W: [d, d] — frozen pre-trained weights
A: [d, r] — learned, initialized from N(0, σ²)
B: [r, d] — learned, initialized to zero
α/r: scaling factor (typically α = r, so scaling = 1)

LoRA params per matrix: 2 × d × r
Compression ratio: d² / (2dr) = d / (2r)

Exercise 7.1: SVD Break-Even Rank Derive

A [4096 × 4096] weight matrix is factorized via SVD into U[4096, r] × V[r, 4096]. At what rank r does the factorization use the same number of parameters as the original matrix?

rank

Show derivation

Original params = 4096 × 4096 = 16,777,216

Factorized params = 4096 × r + r × 4096 = 2 × 4096 × r

Break even: 2 × 4096 × r = 4096²

r = 4096 / 2 = 2048

For a square [d, d] matrix, the break-even rank is always d/2. Any rank below d/2 saves parameters. LoRA typically uses r = 4–64, which is 32–512× below break-even — the savings are enormous because real weight matrices have very low effective rank.

Exercise 7.2: LoRA Savings for 7B Model Derive

A 7B model with d=4096 has 4 attention matrices per layer (Q, K, V, O) across 32 layers. You apply LoRA with rank r=16 to all attention matrices. What percentage of the full model's parameters are trainable?

% trainable

Show derivation

LoRA params per matrix = 2 × d × r = 2 × 4096 × 16 = 131,072

Number of adapted matrices = 4 per layer × 32 layers = 128

Total LoRA params = 128 × 131,072 = 16,777,216 ≈ 16.8M

Percentage = 16.8M / 7,000M × 100 = 0.24%

0.24% of parameters — yet LoRA at rank 16 typically achieves 95–100% of full fine-tuning quality on most tasks. The reason: weight updates during fine-tuning are inherently low-rank. The model has already learned general representations; fine-tuning only needs to make a small, structured adjustment.

Exercise 7.3: Why Initialize B = 0? Trace

LoRA initializes A from a normal distribution but B = 0. Why?

It reduces memory usage during initialization So that W + AB = W + A·0 = W at the start — the model begins from pre-trained weights and gradually learns the delta It prevents the gradient from exploding in the first forward pass Zero initialization is standard for all low-rank factorizations

Show explanation

At initialization, ΔW = A · B = A · 0 = 0. So the adapted weight is W + 0 = W — the model starts exactly at its pre-trained state. This is crucial: it means LoRA training begins from a known-good point in the loss landscape, not a random perturbation of it.

If both A and B were randomly initialized, the initial ΔW would be a random matrix scaled by √r, which would immediately corrupt the pre-trained features. The model would need to "un-learn" this random perturbation before it could start fine-tuning. Zeroing B avoids this entirely.

Exercise 7.4: Rank Selection via Energy Derive

A weight matrix has singular values σ = [10, 5, 2, 1, 0.5, 0.1, 0.05, 0.01]. The "energy" at rank r = ∑_i=1..r σ_i² / ∑_all σ_i². What percentage of energy is captured at rank 4?

% energy

Show derivation

σ² values: [100, 25, 4, 1, 0.25, 0.01, 0.0025, 0.0001]

Total energy = 100 + 25 + 4 + 1 + 0.25 + 0.01 + 0.0025 + 0.0001 = 130.2726

Energy at rank 4 = 100 + 25 + 4 + 1 = 130

Fraction = 130 / 130.2726 = 0.99791 = 99.79%

With just 4 out of 8 singular values (50% of full rank), we capture 99.79% of the matrix's energy. The remaining 4 singular values contribute only 0.21%. This is typical of neural network weight matrices — they're highly low-rank. It's why LoRA works: the "important" information lives in a low-dimensional subspace.

Exercise 7.5: Implement loraParams() Build

Write a function that computes the total number of LoRA parameters for a model, given hidden dimension d, rank r, and number of adapted matrices.

Each adapted matrix adds A[d,r] + B[r,d] parameters.

Show solution

javascript
function loraParams(d, r, numMatrices) {
  const paramsPerMatrix = 2 * d * r;  // A[d,r] + B[r,d]
  return paramsPerMatrix * numMatrices;
}

Exercise 7.6: LoRA Training Pipeline Design

Put the LoRA fine-tuning and deployment steps in the correct order.

→

Freeze pre-trained W Init A~N(0,σ), B=0 Train only A, B on task data Merge: W' = W + AB Deploy merged model

Show explanation

The correct order is: Freeze W → Init adapters → Train A,B → Merge W + AB → Deploy

The merge step is optional but common for deployment: since W' = W + AB is a single matrix of the same shape as W, the merged model has zero inference overhead — no additional latency, no extra memory. You only pay the LoRA cost during training. This is LoRA's killer feature: efficient training with free deployment.

Key formula. LoRA trainable params = 2 × d × r × num_matrices. For a 7B model with r=16 on all attention layers, that's ~0.24% of total params. At inference, merge W' = W + (α/r)·AB for zero overhead. The break-even rank for a [d,d] matrix is d/2 — LoRA uses r « d/2, so savings are always massive.

Chapter 8: Mixed Precision & Hardware

Not all numbers need the same precision. An FP32 floating-point number uses 4 bytes to store 23 bits of mantissa and 8 bits of exponent. But during the forward pass, those extra mantissa bits barely affect the loss. FP16 uses 2 bytes (10-bit mantissa, 5-bit exponent), and BF16 uses 2 bytes with a different tradeoff (7-bit mantissa, 8-bit exponent — same range as FP32 but less precision). INT8 uses 1 byte with no exponent at all. INT4 packs two values into a single byte.

The savings aren't just memory. Modern GPUs have Tensor Cores — specialized matrix-multiply units that operate at 2–8× the throughput of standard FP32 cores. An A100 does 312 TFLOPS at FP16/BF16 but only 156 TFLOPS at FP32. An H100 does 990 TFLOPS BF16 vs. 495 TFLOPS FP32. Using lower precision unlocks this hardware, giving you both a smaller model and a faster one.

Mixed precision training keeps a master copy of weights in FP32 (for optimizer accuracy) while running the forward and backward passes in FP16/BF16. Loss scaling prevents small gradient values from underflowing to zero in FP16. For inference, post-training quantization (PTQ) converts weights to INT8 or INT4 without any retraining, using calibration data to set scale factors.

Bytes per parameter by format:
FP32: 4 bytes FP16/BF16: 2 bytes INT8: 1 byte INT4: 0.5 bytes

Model size formula:
Size (GB) = params × bytes_per_param / 10⁹

A100 peak throughput:
FP32: 156 TFLOPS FP16/BF16: 312 TFLOPS INT8: 624 TOPS

Exercise 8.1: Activation Memory Savings Derive

Mixed precision training keeps master weights in FP32 and optimizer states in FP32 — so weight memory doesn't actually shrink. The real win is activation memory. If a 7B model's activations consume 60 GB in FP32 during training, how many GB are saved by storing activations in FP16 instead?

GB saved

Show derivation

FP32 activation memory = 60 GB

FP16 uses half the bytes: 60 / 2 = 30 GB

Savings = 60 − 30 = 30 GB

A common misconception: "mixed precision halves training memory." It doesn't — weight memory stays the same because you need FP32 master weights + FP32 optimizer states regardless. But activation memory (which grows with batch size and sequence length) is halved. For large batch training, activations can be 60–80% of total memory, so this saving is crucial for fitting larger batches on a single GPU.

Exercise 8.2: Tensor Core Speedup Derive

An A100 has 312 TFLOPS FP16 and 156 TFLOPS FP32. A matmul of two [4096 × 4096] matrices costs 2 × 4096³ FLOPs. What is the speedup of FP16 over FP32 for this operation (assuming compute-bound)?

× speedup

Show derivation

FLOPs = 2 × 4096³ = 2 × 68,719,476,736 ≈ 137.4 GFLOP

Time at FP16 = 137.4 GFLOP / 312,000 GFLOP/s = 0.440 ms

Time at FP32 = 137.4 GFLOP / 156,000 GFLOP/s = 0.881 ms

Speedup = 0.881 / 0.440 = 2.0×

The speedup is exactly 2× because the A100's FP16 Tensor Core throughput is exactly 2× the FP32 throughput. In practice, you often get closer to 1.5–1.8× because real workloads mix compute-bound and memory-bound operations. The H100 has 2× the FP16/BF16 throughput of the A100, plus native FP8 support at 1,979 TOPS.

Exercise 8.3: BF16 vs FP16 Trace

Why does BF16 (Brain Float 16) often work better than FP16 for training large models?

BF16 has more mantissa bits, giving higher precision for small gradients BF16 uses less memory than FP16 BF16 has the same 8-bit exponent as FP32, avoiding overflow/underflow — the reduced mantissa (7 vs 10 bits) matters less than range stability for training BF16 is faster because it skips the normalization step

Show explanation

Format	Sign	Exponent	Mantissa	Range
FP32	1	8	23	±3.4 × 10³⁸
FP16	1	5	10	±6.5 × 10⁴
BF16	1	8	7	±3.4 × 10³⁸

FP16's 5-bit exponent limits its range to ~65,504. Gradients or loss values outside this range cause overflow (Inf) or underflow (0) — requiring careful loss scaling. BF16 matches FP32's full range, so you can typically drop it in without any loss scaling. The price is lower precision (7 vs 10 mantissa bits), but empirically this barely affects training quality for large models.

Exercise 8.4: FP8 Compression Ratio Derive

A 70B parameter model stored in FP16 occupies 140 GB. What is the compression ratio if you quantize to FP8 (1 byte per parameter)?

× compression

Show derivation

FP16 size = 70B × 2 bytes = 140 GB

FP8 size = 70B × 1 byte = 70 GB

Compression ratio = 140 / 70 = 2.0×

FP8 has two variants: E4M3 (4 exponent, 3 mantissa — more precision, max ~448) and E5M2 (5 exponent, 2 mantissa — more range, max ~57,344). E4M3 is typically used for weights and activations in the forward pass, while E5M2 is used for gradients (which need more dynamic range). The H100 was the first GPU with native FP8 Tensor Core support.

Exercise 8.5: Loss Scaling Bug Debug

This mixed precision training loop scales the loss to prevent FP16 underflow but makes a critical mistake. Click the buggy line.

function trainStep(model, data, lossScale) {
  const loss = model.forward(data);       // FP16 forward
  const scaled = loss * lossScale;        // scale up
  const grads = backward(scaled);         // FP16 gradients
  updateWeights(model, grads);            // apply to FP32 master
}

Show explanation

Line 5 is the bug. The gradients are still scaled up by lossScale — you must divide them by lossScale before updating the FP32 master weights. Without unscaling:

javascript
// BUG: effective lr = actual_lr * lossScale (e.g., 1024x too large!)
updateWeights(model, grads);

// FIX: unscale gradients before update
const unscaledGrads = grads.map(g => g / lossScale);
updateWeights(model, unscaledGrads);

Loss scaling works in three steps: (1) multiply loss by scale, (2) backward pass with scaled loss produces scaled gradients, (3) divide gradients by scale before the optimizer step. Forgetting step 3 means your effective learning rate is multiplied by lossScale (typically 1024–65536), causing immediate divergence.

Key formula. Model size (GB) = params × bytes_per_format / 10⁹. FP32 = 4B, FP16/BF16 = 2B, INT8 = 1B, INT4 = 0.5B. For training, use BF16 (same range as FP32, no loss scaling needed). For inference, INT4 gives 8× compression over FP32 with 1–3% quality loss using good calibration.

Chapter 9: Capstone — Full Compression Pipeline

You've just been handed a 13B parameter model in FP16 (26 GB) and told to deploy it on a single consumer GPU with 8 GB of VRAM. That's a 3.25× gap. No single technique closes it — you need to combine them: structured pruning to remove entire attention heads, quantization to shrink the surviving weights, maybe even distillation to train a smaller student. Each technique has different quality-compression tradeoffs, and the order you apply them matters.

This capstone tests your ability to compose everything from Chapters 0–8 into real deployment pipelines. Every exercise combines multiple techniques and requires reasoning about their interactions.

Exercise 9.1: Combined Pruning + Quantization Derive

Start with a 13B parameter model in FP16 (26 GB). First, apply 50% structured pruning (removing entire neurons/heads), reducing to 6.5B params. Then quantize to INT4 (0.5 bytes per param). What is the total compression ratio from the original size?

× compression

Show derivation

Original: 13B params × 2 bytes (FP16) = 26 GB

After 50% pruning: 6.5B params (structured — actually removed, not just zeroed)

After INT4 quantization: 6.5B × 0.5 bytes = 3.25 GB

Compression ratio = 26 / 3.25 = 8.0×

Pruning and quantization multiply: 2× from pruning × 4× from FP16→INT4 = 8× total. This is why practical deployments combine techniques — no single method would get 8× compression with acceptable quality loss. The 3.25 GB fits comfortably in an 8 GB consumer GPU with room for KV cache and activations.

Exercise 9.2: Compression Ordering Trace

You want to apply both pruning and quantization. In what order should you apply them for best final quality?

Quantize first, then prune — pruning removes the least important quantized weights Prune first, fine-tune to recover accuracy, then quantize — quantization calibration should happen on the final pruned weights Apply both simultaneously — joint optimization finds the best tradeoff The order doesn't matter — pruning and quantization are independent operations

Show explanation

The correct order is prune → fine-tune → quantize. Pruning drastically changes the weight distribution (surviving weights may shift to compensate for removed ones, especially after fine-tuning). If you quantize first, the quantization grid (scale factors and zero points) was calibrated for the original weight distribution — which no longer exists after pruning. The quantization error will be much higher.

Conversely, quantizing the final pruned+fine-tuned model lets the calibration data see the actual weight distribution that will be deployed. This consistently gives 1–3% better accuracy than the reverse order.

Exercise 9.3: Distillation + Quantization Speedup Derive

Teacher: 70B FP16 (latency = 100ms/token). Student: 7B INT8 (10× fewer FLOPs, and INT8 has 2× hardware speedup over FP16 on Tensor Cores). What is the expected student latency per token?

ms/token

Show derivation

FLOP reduction from distillation: 70B / 7B = 10×

Hardware speedup from INT8 over FP16: 2×

Total speedup = 10 × 2 = 20×

Student latency = 100 ms / 20 = 5.0 ms/token

This is an idealized estimate — real speedups are lower because autoregressive decoding is memory-bandwidth-bound (not compute-bound), so the INT8 hardware speedup may only be 1.3–1.5× in practice. Still, a 7B INT8 student serving at ~10ms/token is realistic and represents a massive cost reduction over a 70B FP16 teacher. At 5 ms/token you could serve 200 tokens/second on a single GPU.

Exercise 9.4: Implement compressionPipeline() Build

Write a function that computes the final model size (in GB) after pruning and quantization. The function receives the original parameter count, original precision (bits), prune fraction, and target quantization bits.

remaining = params × (1 − pruneFraction). Size = remaining × quantBits / 8 / 1e9.

Show solution

javascript
function compressionPipeline(params, originalBits, pruneFraction, quantBits) {
  const remaining = params * (1 - pruneFraction);
  return remaining * quantBits / 8 / 1e9;
}

Note that originalBits doesn't appear in the calculation — it tells you the starting size for context, but the final size depends only on remaining params and target bits. The compression ratio would be (params * originalBits/8/1e9) / result.

Exercise 9.5: Quantize-Before-Prune Bug Debug

This compression pipeline has a subtle ordering bug that will produce a much worse model than expected. Click the buggy line.

function compress(model) {
  quantize(model, 'int4');           // quantize first
  prune(model, sparsity=0.5);        // then prune
  calibrate(model, calibrationData); // calibrate after both
  return model;
}

Show explanation

Line 2 is the bug. Quantizing before pruning means the INT4 scale factors and zero points were computed on the original dense weights. After pruning removes 50% of parameters, the surviving weight distribution shifts significantly — but the quantization grid is now wrong for these new values.

The correct order is:

javascript
function compress(model) {
  prune(model, sparsity=0.5);        // prune first
  finetune(model);                    // recover accuracy
  quantize(model, 'int4');           // quantize pruned weights
  calibrate(model, calibrationData); // calibrate final model
  return model;
}

The calibration on line 4 also can't fully fix the damage — it can adjust scale factors, but the quantization grid was already committed at line 2. Always prune first, then quantize.

Exercise 9.6: Full Deployment Pipeline Design

Arrange the complete model compression and deployment pipeline in the correct order.

→

Profile model (find bottlenecks) Structured pruning (remove heads/neurons) Fine-tune to recover accuracy Quantize (PTQ with calibration) Evaluate on benchmarks Deploy to production

Show explanation

The correct order is: Profile → Structured prune → Fine-tune → Quantize (PTQ) → Evaluate → Deploy

Profile first: identify which layers contribute least (via sensitivity analysis, activation magnitudes, or importance scores). Prune: remove the least important structures. Fine-tune: let the model recover from pruning damage (typically 1–5% of original training). Quantize: calibrate on representative data after pruning, because the weight distribution has changed. Evaluate: verify quality on held-out benchmarks before deploying. Deploy: serve the compressed model.

Compression Techniques Summary

Technique	Typical Compression	Quality Loss	Hardware Needs
Weight Pruning (unstructured)	2–10×	0–2% (with fine-tune)	Sparse kernels (limited support)
Structured Pruning	1.5–4×	1–5% (with fine-tune)	No special hardware needed
Quantization (INT8)	2× from FP16	0.1–1%	INT8 Tensor Cores (A100+)
Quantization (INT4)	4× from FP16	1–3%	INT4 kernels (GPTQ/AWQ)
Knowledge Distillation	5–20× (smaller student)	2–10% (task-dependent)	Teacher GPU during training
LoRA (training only)	100–1000× trainable params	0–2% vs full fine-tune	Standard (merges to zero overhead)
Mixed Precision (BF16)	2× from FP32	<0.1%	BF16 Tensor Cores (A100+)

Related Lessons

Topic	Lesson
Transformer internals	Transformer — From Absolute Zero
Parameter counting & memory	Transformer Math Workbook
Distributed training	Distributed Training Workbook
Scaling laws	Scaling Book Workbook
Inference optimization	Systems & Serving Workbook

The compression engineer's toolkit. If you can derive pruning round costs, compute temperature-scaled softmax, calculate LoRA parameter budgets, reason about mixed precision tradeoffs, and design end-to-end compression pipelines — you have the quantitative foundation for deploying production LLMs. These are the exact calculations behind tools like GPTQ, AWQ, llama.cpp, QLoRA, and vLLM. "The question is not whether you can make it smaller. The question is how small can you make it before it forgets what it knows."