Quantization, pruning, distillation, LoRA, mixed precision — every calculation you need to shrink models for production deployment. Derive compression ratios, quantization errors, pruning thresholds, and distillation losses from scratch.
You just trained a 7-billion-parameter language model. In FP16, every parameter takes 2 bytes, so the model weighs 14 GB. Your deployment target is a 6 GB GPU. It doesn't fit. You need to compress it — but by how much, and what speedup can you expect?
Before diving into any compression technique, you need the arithmetic. Model size, compression ratio, and bandwidth-bound speedup are the three numbers you'll compute for every deployment decision you'll ever make. Master them here.
A 13-billion-parameter model stored in FP16 (16 bits = 2 bytes per parameter). How many GB does it occupy?
Formula: params × bytes_per_param / 109
Simple rule: in FP16, the model size in GB is roughly 2× the parameter count in billions. A 7B model → 14 GB. A 13B model → 26 GB. A 70B model → 140 GB.
Your 14 GB FP16 model has been quantized down to 3.5 GB. What is the compression ratio?
Compression ratio = original / compressed. A ratio of 2× means the compressed version is half the size.
A 4× compression ratio corresponds to going from 16-bit to 4-bit representation (16 / 4 = 4). This is the compression ratio you get from INT4 quantization — the most popular deployment format for large models in 2024-2025.
LLM inference (especially at batch size 1) is memory-bandwidth-bound, not compute-bound. The GPU spends most of its time waiting for weights to stream from HBM to the compute cores. If you shrink the weights by 4×, you move 4× fewer bytes through the memory bus, so inference speeds up by up to 4×.
The dequantization overhead is tiny — a few multiplies per weight group — and is completely hidden by the memory transfer latency. This is why quantized models are both smaller and faster.
An A100 GPU has 2 TB/s memory bandwidth. For a single-batch inference pass, the GPU must load all model weights once. Compare the time to load a 14 GB FP16 model vs. a 3.5 GB INT4 model. What is the theoretical speedup?
Speedup = time_FP16 / time_INT4 = size_FP16 / size_INT4
Notice the speedup equals the compression ratio. This is no coincidence — when inference is purely memory-bandwidth-bound, the speedup is exactly the ratio of data moved. In practice, you get slightly less than 4× because dequantization adds a tiny compute overhead and not every operation is perfectly bandwidth-bound.
An A100 delivers 312 TFLOPS (FP16) but only 2 TB/s memory bandwidth. For a single token, each weight is used for exactly one multiply-add (2 FLOPs). The arithmetic intensity is 2 FLOPs / 2 bytes = 1 FLOP/byte. The A100's compute-to-bandwidth ratio is 312 TFLOPS / 2 TB/s = 156 FLOP/byte.
Since the operation's arithmetic intensity (1) is far below the machine's ratio (156), the GPU is waiting for memory 99% of the time. This is why compression — which reduces bytes moved — directly translates to speedup.
At large batch sizes (B ≥ 64+), the same weights are reused across many tokens, pushing arithmetic intensity above the machine ratio. Then inference becomes compute-bound and quantization helps less with speed (though it still saves memory).
Write a function that computes model size in GB given the number of parameters and bits per parameter.
javascript function modelSizeGB(params, bitsPerParam) { return params * bitsPerParam / 8 / 1e9; }
The key insight: divide by 8 to convert bits to bytes, then divide by 109 to convert bytes to GB. For FP16 (16 bits), each param is 2 bytes. For INT4 (4 bits), each param is 0.5 bytes.
This compression ratio function returns values less than 1 for compressed models. That's backwards — a 4× compression should return 4, not 0.25. Click the buggy line.
function compressionRatio(originalGB, compressedGB) { const ratio = compressedGB / originalGB; return ratio; }
Line 2 is the bug. The division is inverted: it computes compressedGB / originalGB which gives 3.5 / 14 = 0.25. The correct formula is originalGB / compressedGB which gives 14 / 3.5 = 4.0.
This is a common mistake. Compression ratio is always ≥ 1 (original divided by compressed). A ratio of 0.25 would mean the "compressed" version is larger than the original.
Your 7B model has a weight with value 0.0237. You need to store it in INT8 — only 256 possible values. That means you're rounding a continuous float to one of 256 buckets. How much error did you just introduce? And can you control where those buckets land?
Quantization maps a continuous range of floating-point values to a discrete set of integer levels. The two key ingredients are the scale (how wide each bucket is) and the zero point (where the integer zero maps in float space).
Given a weight range [−1.0, 1.0] and INT8 (256 levels), compute the quantization error for x = 0.3.
Steps: (1) compute scale, (2) quantize x to integer q, (3) dequantize q back to x̂, (4) error = |x − x̂|
The error (0.002) is less than half the scale (0.004). This is guaranteed — the worst-case error for any value is scale/2, achieved when the value falls exactly between two bins.
Symmetric INT8 quantization of a tensor with absmax = 2.0. The signed INT8 range is [−127, 127]. Compute the quantization error for x = 0.5.
Symmetric: scale = absmax / 127. Quantize: q = round(x / scale). Dequantize: x̂ = q × scale.
Symmetric quantization uses 127 levels per side (not 128) to ensure that zero maps exactly to integer 0 with no rounding error. This is important for zero-valued activations (ReLU outputs) and skip connections.
Asymmetric sets the range to exactly [−0.8, 1.2], distributing all 256 levels across 2.0 units of range. Scale = 2.0/255 ≈ 0.00784.
Symmetric would set absmax = max(0.8, 1.2) = 1.2, giving range [−1.2, 1.2]. That's 2.4 units, but values below −0.8 never appear — 0.4 units of range (17% of the negative side) are wasted. Scale = 1.2/127 ≈ 0.00945 — coarser bins, more error.
Asymmetric wins when the distribution is notably skewed. Symmetric wins on simplicity (no zero-point bookkeeping) and is preferred when the distribution is roughly centered around zero, which is common for well-trained weights.
For symmetric INT8 with absmax = 1.0, what is the maximum possible quantization error for any input value within the range [−1.0, 1.0]?
The worst case: a value lands exactly between two quantization levels. Max error = scale / 2.
For INT8, the max error is tiny (~0.4% of the range). But for INT4 (only 16 levels), scale = 1.0/7 = 0.143, and max error = 0.071 — that's 7% of the range. This is why INT4 quantization requires more sophisticated techniques (group quantization, calibration) to maintain accuracy.
Write a function that performs asymmetric quantize-then-dequantize. Given a float value x, number of bits, and the min/max range, return the dequantized value (the value after quantization error).
javascript function quantize(x, bits, minVal, maxVal) { const levels = Math.pow(2, bits) - 1; const scale = (maxVal - minVal) / levels; let q = Math.round((x - minVal) / scale); q = Math.max(0, Math.min(levels, q)); // clamp return q * scale + minVal; }
The clamp step is critical — without it, values slightly outside [minVal, maxVal] would produce out-of-range integers that can't be stored in the target bit-width. In practice, calibration determines minVal and maxVal from real data, and outliers get clipped.
This symmetric quantize function produces garbage outputs. The quantize step looks right, but the dequantize formula is wrong. Click the buggy line.
function quantize(x, scale, zeroPoint) { let q = Math.round(x / scale + zeroPoint); q = Math.max(0, Math.min(127, q)); return q * scale + zeroPoint; }
Line 4 is the bug. The dequantize formula is wrong in two ways: it adds zeroPoint instead of subtracting it, and then multiplies. The correct formula is (q − zeroPoint) * scale.
The quantize step on line 2 does: q = round(x/scale + zp). To invert it: x̂ = (q − zp) × scale. The current code does q × scale + zp, which doesn't undo the original transformation.
You have a pre-trained model and you want to quantize it without retraining. No gradient updates, no fine-tuning budget, no access to the original training data. You just want to shrink it and deploy. This is Post-Training Quantization (PTQ).
The idea: run a small calibration dataset (a few hundred examples) through the model, observe the actual ranges of weights and activations, and use those ranges to set optimal scale and zero-point for each tensor. The better your range estimation, the less accuracy you lose.
A weight tensor has shape [4096, 4096] (output × input). Per-channel quantization (along the output dimension) stores one FP16 scale and one FP16 zero-point per channel. What fraction of the original parameter count is this overhead?
Overhead = (number of extra values) / (original number of values) × 100%. Each scale and zero-point is one value.
The overhead is negligible — less than 0.05%. This is why per-channel quantization is essentially free: you pay a rounding error in storage for a massive improvement in accuracy. Every serious PTQ method uses per-channel (or finer) granularity.
Imagine two channels: channel A has weights in [−0.01, 0.01] and channel B has weights in [−2.0, 2.0]. Per-tensor uses one scale for both: scale = 4.0/255 ≈ 0.0157. Channel A, with a range of only 0.02, gets mapped to just ~1 quantization level — almost all information is destroyed.
Per-channel gives channel A its own scale: 0.02/255 ≈ 0.000078, spreading its values across all 256 levels. Channel B gets its own scale: 4.0/255 ≈ 0.0157. Both channels use the full resolution of INT8.
The bit-width is the same. The total storage is nearly the same. But the effective precision per channel improves dramatically.
GPTQ uses group quantization with group_size = 128. For a weight matrix [4096, 4096] quantized to INT4 with groups along the input dimension: how many groups are there? What fraction of the total storage (weights + scales + zero-points) is overhead?
Each group stores one FP16 scale (2 bytes) + one FP16 zero-point (2 bytes) = 4 bytes overhead. INT4 weights = 0.5 bytes per value.
At group_size=128, the overhead is about 6% — still small, but no longer negligible. Smaller groups (64, 32) give better accuracy but higher overhead. Group size 128 is the standard trade-off used in GPTQ and AWQ. Some aggressive schemes use group_size=32 for 4-bit, accepting ~23% overhead for better accuracy.
GPTQ minimizes layer-wise output reconstruction error: for each layer, it finds quantized weights W̃ that minimize ‖WX − W̃X‖², where X is the calibration data flowing through that layer.
The key insight: not all weight errors matter equally. A small error in a high-sensitivity weight damages the output more than a large error in a low-sensitivity weight. GPTQ uses the Hessian H = 2XTX of the reconstruction loss to measure each weight's sensitivity.
It quantizes weights one column at a time, using the Hessian inverse to redistribute the quantization error of each column to the remaining (not-yet-quantized) columns. This "error compensation" is what makes GPTQ dramatically better than naive round-to-nearest PTQ.
Write a function that computes per-channel scales for INT8 asymmetric quantization. Input: a 2D weight array [out_channels][in_features]. Return: an array of scales, one per output channel.
javascript function perChannelScale(weights) { return weights.map(channel => { const mn = Math.min(...channel); const mx = Math.max(...channel); return (mx - mn) / 255; }); }
In practice, you'd also compute the zero-point for each channel: zp = min. For a constant channel (all values equal), scale = 0 and any quantized value dequantizes to that constant. Libraries handle this edge case by setting scale to a tiny epsilon to avoid division by zero during quantization.
Arrange the steps of a post-training quantization pipeline in the correct order.
Correct order: Run calibration data → Compute scales & zero-points → Quantize weights → Evaluate accuracy
Calibration must come first because you need to observe actual value ranges before you can compute scales. Scales must come before quantization because quantization uses them. Evaluation comes last to verify that the quantized model didn't lose too much accuracy. If accuracy drops too much, you'd try a finer granularity (per-channel → per-group), more calibration data, or a smarter method like GPTQ.
You tried PTQ on your 7B model and perplexity jumped from 5.2 to 6.8 — unacceptable. The model needs to learn to be robust to quantization noise. This means putting quantization inside the training loop itself. Welcome to Quantization-Aware Training (QAT).
The core trick: insert fake quantizers (also called "simulated quantizers") into the forward pass. Each fake quantizer quantizes a tensor to integers and immediately dequantizes back to floats. The output is still a float, but it has quantization noise baked in. The model learns to produce weights and activations that survive quantization with minimal accuracy loss.
In QAT, the forward pass computes y = fakeQuantize(x). The STE approximation says ∂y/∂x = 1 when x is within the clipping range. If the loss gradient is ∂L/∂y = 0.05 and x is within range, what is ∂L/∂x?
Chain rule: ∂L/∂x = ∂L/∂y × ∂y/∂x. STE says ∂y/∂x = 1 for in-range x.
The gradient passes through the fake quantizer unchanged. This is the "straight-through" part — the backward pass acts as if the quantizer isn't there. The forward pass still applies quantization noise, so the model learns to be robust to it. If x were outside the clipping range, the STE returns 0, telling the optimizer "this weight is so far out of range that nudging it won't help — clip it."
The round() function creates a staircase: its output is constant (flat) between integer boundaries. The derivative of a constant is zero. At the integer boundaries, the function jumps discontinuously, so the derivative is undefined.
If you used the true derivative, every weight would receive a gradient of exactly 0. Gradient descent with zero gradients means no learning. The STE is a deliberate, principled approximation: by pretending the staircase is the identity, you get useful (if noisy) gradients that point the weights toward quantization-friendly values.
Bengio et al. (2013) showed this works well empirically, and it has become the standard approach in all QAT frameworks (PyTorch's FX quantization, TensorFlow's QAT toolkit, etc.).
A model normally trains for 100 epochs on 1M samples at 1000 samples/sec throughput. QAT fine-tunes for 10% of original training duration. The fake quantization ops add 30% overhead to each forward/backward pass. How many hours does QAT take?
QAT throughput = original throughput / 1.3. QAT epochs = 10. Total samples = epochs × dataset size.
QAT is much cheaper than full training (3.6 hours vs ~28 hours for the original 100 epochs). But it's far more expensive than PTQ, which takes minutes. This is the fundamental trade-off: PTQ is cheap but loses more accuracy; QAT is expensive but recovers most of the lost accuracy. For models going to production with tight accuracy requirements, QAT is worth the cost.
Write the fake quantization function used in QAT. It quantizes then immediately dequantizes — the output is a float with quantization noise.
javascript function fakeQuantize(x, scale, zeroPoint, qmin, qmax) { let q = Math.round(x / scale + zeroPoint); q = Math.max(qmin, Math.min(qmax, q)); return (q - zeroPoint) * scale; }
Let's trace the clamped case: x=1.6, q = round(1.6/0.01 + 100) = round(260) = 260. Clamp to [0,255]: q=255. Dequantize: (255−100)×0.01 = 1.55. The value 1.6 was beyond the representable range and got clipped to 1.55. This is how the model learns to keep its activations within the quantization range — gradients from clipping push values back in-range.
This STE gradient function should return grad when x is in range, and 0 when x is out of range. But it corrupts the gradient magnitude. Click the buggy line.
function steGradient(grad, x, clipMin, clipMax) { // Straight-through estimator: pass gradient if x in range const mask = (x >= clipMin && x <= clipMax) ? 1 : 0; return grad + mask; }
Line 4 is the bug. It uses grad + mask (addition) instead of grad * mask (multiplication).
When x is in range, mask=1: the current code returns grad + 1 (corrupted), but it should return grad × 1 = grad (passthrough). When x is out of range, mask=0: the code returns grad + 0 = grad (should be grad × 0 = 0, i.e., blocked).
The fix: return grad * mask;. The mask acts as a gate — multiply to pass or block the gradient.
Quantization makes each weight smaller. But what if you just delete weights entirely? Set them to zero. A 7B model with 90% sparsity has 6.3 billion zeros — only 700 million nonzero weights doing useful work. This is pruning.
The question is: which weights do you delete? The simplest answer — magnitude pruning — removes the weights closest to zero, under the assumption that small weights contribute least to the output. It works surprisingly well, and it's where every pruning discussion starts.
A weight tensor has values: [0.5, −0.1, 0.8, −0.02, 0.3, −0.7, 0.05, 0.9]. Apply 50% magnitude pruning (remove the smallest 50% by absolute value). What is the sum of the surviving weights?
Sort by magnitude, find the 50th percentile threshold, zero out everything at or below it.
Notice that the pruned weights (sum = −0.1 − 0.02 + 0.3 + 0.05 = 0.23) are indeed the ones contributing least to the overall sum. Magnitude pruning assumes that small weights have small effects on the output — often true for well-trained networks, but not always (some small weights sit on high-curvature loss landscape regions).
Structured pruning removes entire structures: a row of a weight matrix (removes an output neuron), a column (removes an input feature), or an entire attention head. The result is a smaller but still dense matrix. Dense matmul on a [3072, 4096] matrix instead of [4096, 4096] is genuinely 25% faster.
Unstructured pruning at 90% sparsity creates a [4096, 4096] matrix where 90% of entries are zero, but the matrix shape is unchanged. Standard GEMM kernels still iterate over all positions. You need special sparse kernels (e.g., NVIDIA's cuSPARSE, 2:4 structured sparsity on Ampere) to skip zeros, and these only help at very high sparsity (>95%) or with hardware-supported patterns (2:4).
A dense matrix multiply [M, K] × [K, N] requires 2MKN FLOPs. After pruning 90% of weights (90% sparsity), the theoretical sparse FLOPs are 2MKN × 0.1. For M = K = N = 4096, what is the theoretical speedup?
10× is the theoretical ceiling. In practice, unstructured 90% sparsity on a GPU typically yields only 1.5-3× speedup (or sometimes no speedup at all) due to irregular memory access patterns, sparse format overhead, and poor hardware utilization. NVIDIA's Ampere architecture supports 2:4 structured sparsity (50% sparsity in a specific pattern) with near-2× speedup — the only widely-deployed hardware sparse acceleration.
GPUs achieve their massive throughput through coalesced memory access — reading contiguous chunks of memory in parallel. A dense matmul reads weight rows sequentially, hitting every cache line once. Sparse matrices require indirect indexing: the GPU reads an index array to find where the nonzero values are, then gathers them from scattered memory locations.
This scatter-gather pattern causes cache misses (the nonzeros are spread across memory), branch mispredictions (the sparsity pattern is unpredictable), and low SIMD utilization (some lanes in a warp have data, others don't). Add the overhead of storing the sparse format (CSR needs row pointers + column indices), and the "free" skipped multiplications cost more than they save.
This is why the industry has converged on N:M structured sparsity (e.g., 2:4 on Ampere): a regular pattern that the hardware can exploit with dedicated logic.
Write a function that performs magnitude pruning on a flat weight array. Given a sparsity ratio (0-1), zero out the smallest-magnitude values.
javascript function magnitudePrune(weights, sparsity) { const mags = weights.map(w => Math.abs(w)); const sorted = [...mags].sort((a, b) => a - b); const k = Math.ceil(sparsity * weights.length); const threshold = sorted[k - 1]; // k-th smallest magnitude let pruned = 0; return weights.map((w, i) => { if (mags[i] <= threshold && pruned < k) { pruned++; return 0; } return w; }); }
The pruned < k guard handles ties: if multiple weights share the threshold magnitude, we only prune enough to reach the desired sparsity. In real frameworks (PyTorch's torch.nn.utils.prune), the threshold is computed as a percentile and ties are broken arbitrarily.
Arrange the steps of an iterative pruning pipeline in the correct order.
Correct order: Train dense model → Score weight importance → Create binary mask → Apply mask → Fine-tune surviving weights
You must start with a fully-trained dense model (random initialization + pruning = disaster). Scoring comes next because you need trained weights to judge importance. The mask is derived from the scores (threshold at desired sparsity). Applying the mask zeroes out the pruned weights. Fine-tuning lets the surviving weights adapt to compensate for their removed neighbors.
In iterative pruning (the Lottery Ticket Hypothesis approach), you repeat steps 2-5 multiple times, pruning a small fraction each round: prune 20% → fine-tune → prune 20% of remaining → fine-tune → ... This is gentler than one-shot pruning and typically recovers more accuracy at high sparsity.
What if you don't need the full network? What if somewhere inside that massive 70B parameter model, there's a tiny subnetwork — maybe 10% of the original size — that could have trained to the same accuracy, if only you'd initialized it with the right weights? That's the Lottery Ticket Hypothesis, proposed by Frankle & Carlin (2019).
The idea is striking: dense, randomly-initialized networks contain sparse subnetworks (called winning tickets) that — when trained in isolation from their original initialization — can match the full network's test accuracy in a comparable number of training iterations. The full network is like a lottery: most tickets (subnetworks) are losers, but a few "win" by happening to have the right initial weights.
The catch? You can't identify the winning ticket before training. The standard method — Iterative Magnitude Pruning (IMP) — works backwards: train the full network, prune the smallest-magnitude weights, then rewind the surviving weights to their original initialization and retrain from scratch. Repeat this prune-rewind cycle until you reach the target sparsity.
Standard training takes 100 epochs. Iterative Magnitude Pruning (IMP) trains to completion, prunes 20% of remaining weights, then retrains from original init. How many total training epochs are needed to find an 80% sparse ticket?
First compute the number of rounds N = ⌈log(0.2) / log(0.8)⌉. Then total epochs = N × 100. What is the cost multiplier vs. a single training run?
This is the fundamental problem with IMP: finding a winning ticket costs 8× more than a single training run. Later work (Rewinding to iteration k, SNIP, SynFlow) tries to reduce this cost by pruning earlier in training or at initialization.
The critical word is original. A winning ticket is defined by both its structure (which weights survive pruning) and its initialization (the exact values those weights had at epoch 0). Re-initializing with new random weights destroys the ticket — the subnetwork typically fails to train to the same accuracy. This is the key insight: the lottery isn't just about which connections matter, but about those connections starting in the right place in the loss landscape.
Starting from a 100M parameter model, you perform 5 rounds of IMP where each round prunes 20% of the remaining weights. How many million parameters survive?
After 5 rounds of 20%-per-round pruning, about 67% of parameters have been removed. The exponential decay means early rounds remove many more absolute parameters than later rounds (20M in round 1 vs. 8.2M in round 5).
This is the most surprising result in the paper. The winning ticket is not just the topology (which neurons connect to which) — it's the topology plus the specific initial weights. Re-initializing the same sparse structure with new random values produces a network that trains no better than a random sparse network of the same size.
Think of it like a key and a lock. The structure (mask) is the shape of the key, but the initialization values are the precise tooth heights. You need both to open the lock (reach good accuracy). This suggests that certain fortuitous initial weight configurations are essential for trainability — a deeply non-obvious result.
Write a function that simulates iterative magnitude pruning, returning an array of the remaining parameter count after each round.
rounds.javascript function iterativePrune(totalParams, pruneRate, rounds) { const result = []; let remaining = totalParams; for (let i = 0; i < rounds; i++) { remaining *= (1 - pruneRate); result.push(remaining); } return result; }
What if instead of compressing the weights, you train a smaller model to mimic the larger one? A 70B teacher model doesn't just output "cat" for an image of a cat — it outputs a full probability distribution: 85% cat, 10% dog, 3% tiger, 1% lion, 0.5% fox... Those soft probabilities contain rich relational information that the student would never learn from hard labels alone.
This is knowledge distillation (Hinton et al., 2015). The student learns from the teacher's soft probability distributions, which encode what Hinton calls "dark knowledge" — the teacher's implicit understanding of which classes are similar to each other. A cat being 10% dog and 3% tiger tells the student that cats look more like dogs than cars, a relationship invisible in the one-hot label [1, 0, 0, 0, ...].
The key mechanism is temperature scaling. Standard softmax produces peaked distributions (one class dominates). By dividing logits by a temperature T > 1, the distribution "softens," revealing the teacher's nuanced inter-class relationships. The student is then trained on these soft targets using KL divergence, alongside the standard hard-label cross-entropy loss.
Teacher logits z = [2.0, 1.0, 0.1]. Compute softmax at T=1 and T=5, then find the difference in entropy: H(T=5) − H(T=1).
Entropy H = −∑ pi ln(pi). Higher entropy = softer (more uniform) distribution.
Step 1: Softmax at T=1
Step 2: Softmax at T=5
Step 3: Entropy
The maximum entropy for 3 classes is ln(3) ≈ 1.099. At T=5, we're already at 1.087 — nearly uniform. Temperature acts as a "softening knob" that reveals relationships hidden by the peaked T=1 distribution.
At T=1, the teacher's softmax is highly peaked — often 95%+ on the top class. The small probabilities on other classes (0.01% dog, 0.003% tiger) are numerically negligible and carry almost no gradient signal. But these tiny probabilities encode exactly the information we want: which wrong answers are less wrong.
Raising T "flattens" the distribution, amplifying these small probabilities into meaningful training signal. The student learns not just "this is a cat" but "if it's not a cat, it's probably a dog, and definitely not a car." This relational structure is the dark knowledge.
Compute the KD loss with α=0.7, T=4, KLsoft=0.05, CEhard=2.3.
L = α · T² · KLsoft + (1 − α) · CEhard
The distillation and hard-label terms are roughly balanced here. In practice, α = 0.7–0.9 works well — we lean heavily on the teacher's soft knowledge while still anchoring to ground truth. The T² factor is crucial for keeping gradients balanced (see next exercise).
When you divide logits by T before softmax, the chain rule introduces a factor of 1/T in the gradients. Since KL divergence involves two softmaxes (teacher and student), the gradients are scaled by 1/T². Without correction, higher temperatures would produce vanishingly small gradient updates from the distillation term, making it irrelevant compared to the hard-label loss.
Multiplying by T² exactly compensates this scaling, ensuring that the relative contribution of the distillation and hard-label terms is controlled solely by α, not accidentally by T. This is not a heuristic — it falls directly out of the calculus.
Write a function that computes temperature-scaled softmax: divide each logit by T, then apply standard softmax. Use the numerically stable version (subtract max before exp).
javascript function softmaxWithTemp(logits, T) { const scaled = logits.map(z => z / T); const maxVal = Math.max(...scaled); const exps = scaled.map(z => Math.exp(z - maxVal)); const sum = exps.reduce((a, b) => a + b, 0); return exps.map(e => e / sum); }
Subtracting the max before exponentiation prevents overflow (exp of large numbers). This doesn't change the result because softmax(z − c) = softmax(z) for any constant c.
This distillation loss function is producing wildly unbalanced gradients. Click the line with the bug.
function kdLoss(softTeacher, softStudent, alpha, T) { let klDiv = 0; for (let i = 0; i < softTeacher.length; i++) klDiv += softTeacher[i] * Math.log(softTeacher[i] / softStudent[i]); return alpha * klDiv + (1 - alpha) * hardLoss; }
Line 5 is the bug. The KL divergence term is missing the T * T scaling factor. It should be:
javascript return alpha * T * T * klDiv + (1 - alpha) * hardLoss;
Without T², the gradients from the distillation term are scaled down by 1/T². At T=10, that's a 100× reduction — the distillation signal becomes negligible and the student effectively ignores the teacher, learning only from hard labels. The whole point of distillation is lost.
A weight matrix W of shape [4096 × 4096] has 16.8 million parameters. But what if its effective rank is much lower — say, 16? That means the matrix can be well-approximated by the product of two much smaller matrices: U[4096 × 16] × V[16 × 4096], using only 131K parameters instead of 16.8M. That's a 128× compression.
This insight powers Low-Rank Adaptation (LoRA) (Hu et al., 2021), the most popular parameter-efficient fine-tuning method. Instead of updating the full weight matrix during fine-tuning, LoRA freezes the pre-trained weights W and adds a low-rank delta: ΔW = A · B, where A is [d × r] and B is [r × d]. The forward pass becomes x → Wx + ABx. Only A and B are trained — the original weights never change.
The magic: for a rank r = 16, each adapted matrix adds only 2 × d × r parameters (131K for d=4096). Applied to all attention matrices in a 7B model, that's about 16.8M trainable parameters — 0.24% of the full model. Yet LoRA fine-tuned models routinely match full fine-tuning quality.
A [4096 × 4096] weight matrix is factorized via SVD into U[4096, r] × V[r, 4096]. At what rank r does the factorization use the same number of parameters as the original matrix?
For a square [d, d] matrix, the break-even rank is always d/2. Any rank below d/2 saves parameters. LoRA typically uses r = 4–64, which is 32–512× below break-even — the savings are enormous because real weight matrices have very low effective rank.
A 7B model with d=4096 has 4 attention matrices per layer (Q, K, V, O) across 32 layers. You apply LoRA with rank r=16 to all attention matrices. What percentage of the full model's parameters are trainable?
0.24% of parameters — yet LoRA at rank 16 typically achieves 95–100% of full fine-tuning quality on most tasks. The reason: weight updates during fine-tuning are inherently low-rank. The model has already learned general representations; fine-tuning only needs to make a small, structured adjustment.
At initialization, ΔW = A · B = A · 0 = 0. So the adapted weight is W + 0 = W — the model starts exactly at its pre-trained state. This is crucial: it means LoRA training begins from a known-good point in the loss landscape, not a random perturbation of it.
If both A and B were randomly initialized, the initial ΔW would be a random matrix scaled by √r, which would immediately corrupt the pre-trained features. The model would need to "un-learn" this random perturbation before it could start fine-tuning. Zeroing B avoids this entirely.
A weight matrix has singular values σ = [10, 5, 2, 1, 0.5, 0.1, 0.05, 0.01]. The "energy" at rank r = ∑i=1..r σi² / ∑all σi². What percentage of energy is captured at rank 4?
With just 4 out of 8 singular values (50% of full rank), we capture 99.79% of the matrix's energy. The remaining 4 singular values contribute only 0.21%. This is typical of neural network weight matrices — they're highly low-rank. It's why LoRA works: the "important" information lives in a low-dimensional subspace.
Write a function that computes the total number of LoRA parameters for a model, given hidden dimension d, rank r, and number of adapted matrices.
javascript function loraParams(d, r, numMatrices) { const paramsPerMatrix = 2 * d * r; // A[d,r] + B[r,d] return paramsPerMatrix * numMatrices; }
Put the LoRA fine-tuning and deployment steps in the correct order.
The correct order is: Freeze W → Init adapters → Train A,B → Merge W + AB → Deploy
The merge step is optional but common for deployment: since W' = W + AB is a single matrix of the same shape as W, the merged model has zero inference overhead — no additional latency, no extra memory. You only pay the LoRA cost during training. This is LoRA's killer feature: efficient training with free deployment.
Not all numbers need the same precision. An FP32 floating-point number uses 4 bytes to store 23 bits of mantissa and 8 bits of exponent. But during the forward pass, those extra mantissa bits barely affect the loss. FP16 uses 2 bytes (10-bit mantissa, 5-bit exponent), and BF16 uses 2 bytes with a different tradeoff (7-bit mantissa, 8-bit exponent — same range as FP32 but less precision). INT8 uses 1 byte with no exponent at all. INT4 packs two values into a single byte.
The savings aren't just memory. Modern GPUs have Tensor Cores — specialized matrix-multiply units that operate at 2–8× the throughput of standard FP32 cores. An A100 does 312 TFLOPS at FP16/BF16 but only 156 TFLOPS at FP32. An H100 does 990 TFLOPS BF16 vs. 495 TFLOPS FP32. Using lower precision unlocks this hardware, giving you both a smaller model and a faster one.
Mixed precision training keeps a master copy of weights in FP32 (for optimizer accuracy) while running the forward and backward passes in FP16/BF16. Loss scaling prevents small gradient values from underflowing to zero in FP16. For inference, post-training quantization (PTQ) converts weights to INT8 or INT4 without any retraining, using calibration data to set scale factors.
Mixed precision training keeps master weights in FP32 and optimizer states in FP32 — so weight memory doesn't actually shrink. The real win is activation memory. If a 7B model's activations consume 60 GB in FP32 during training, how many GB are saved by storing activations in FP16 instead?
A common misconception: "mixed precision halves training memory." It doesn't — weight memory stays the same because you need FP32 master weights + FP32 optimizer states regardless. But activation memory (which grows with batch size and sequence length) is halved. For large batch training, activations can be 60–80% of total memory, so this saving is crucial for fitting larger batches on a single GPU.
An A100 has 312 TFLOPS FP16 and 156 TFLOPS FP32. A matmul of two [4096 × 4096] matrices costs 2 × 4096³ FLOPs. What is the speedup of FP16 over FP32 for this operation (assuming compute-bound)?
The speedup is exactly 2× because the A100's FP16 Tensor Core throughput is exactly 2× the FP32 throughput. In practice, you often get closer to 1.5–1.8× because real workloads mix compute-bound and memory-bound operations. The H100 has 2× the FP16/BF16 throughput of the A100, plus native FP8 support at 1,979 TOPS.
| Format | Sign | Exponent | Mantissa | Range |
|---|---|---|---|---|
| FP32 | 1 | 8 | 23 | ±3.4 × 1038 |
| FP16 | 1 | 5 | 10 | ±6.5 × 104 |
| BF16 | 1 | 8 | 7 | ±3.4 × 1038 |
FP16's 5-bit exponent limits its range to ~65,504. Gradients or loss values outside this range cause overflow (Inf) or underflow (0) — requiring careful loss scaling. BF16 matches FP32's full range, so you can typically drop it in without any loss scaling. The price is lower precision (7 vs 10 mantissa bits), but empirically this barely affects training quality for large models.
A 70B parameter model stored in FP16 occupies 140 GB. What is the compression ratio if you quantize to FP8 (1 byte per parameter)?
FP8 has two variants: E4M3 (4 exponent, 3 mantissa — more precision, max ~448) and E5M2 (5 exponent, 2 mantissa — more range, max ~57,344). E4M3 is typically used for weights and activations in the forward pass, while E5M2 is used for gradients (which need more dynamic range). The H100 was the first GPU with native FP8 Tensor Core support.
This mixed precision training loop scales the loss to prevent FP16 underflow but makes a critical mistake. Click the buggy line.
function trainStep(model, data, lossScale) { const loss = model.forward(data); // FP16 forward const scaled = loss * lossScale; // scale up const grads = backward(scaled); // FP16 gradients updateWeights(model, grads); // apply to FP32 master }
Line 5 is the bug. The gradients are still scaled up by lossScale — you must divide them by lossScale before updating the FP32 master weights. Without unscaling:
javascript // BUG: effective lr = actual_lr * lossScale (e.g., 1024x too large!) updateWeights(model, grads); // FIX: unscale gradients before update const unscaledGrads = grads.map(g => g / lossScale); updateWeights(model, unscaledGrads);
Loss scaling works in three steps: (1) multiply loss by scale, (2) backward pass with scaled loss produces scaled gradients, (3) divide gradients by scale before the optimizer step. Forgetting step 3 means your effective learning rate is multiplied by lossScale (typically 1024–65536), causing immediate divergence.
You've just been handed a 13B parameter model in FP16 (26 GB) and told to deploy it on a single consumer GPU with 8 GB of VRAM. That's a 3.25× gap. No single technique closes it — you need to combine them: structured pruning to remove entire attention heads, quantization to shrink the surviving weights, maybe even distillation to train a smaller student. Each technique has different quality-compression tradeoffs, and the order you apply them matters.
This capstone tests your ability to compose everything from Chapters 0–8 into real deployment pipelines. Every exercise combines multiple techniques and requires reasoning about their interactions.
Start with a 13B parameter model in FP16 (26 GB). First, apply 50% structured pruning (removing entire neurons/heads), reducing to 6.5B params. Then quantize to INT4 (0.5 bytes per param). What is the total compression ratio from the original size?
Pruning and quantization multiply: 2× from pruning × 4× from FP16→INT4 = 8× total. This is why practical deployments combine techniques — no single method would get 8× compression with acceptable quality loss. The 3.25 GB fits comfortably in an 8 GB consumer GPU with room for KV cache and activations.
The correct order is prune → fine-tune → quantize. Pruning drastically changes the weight distribution (surviving weights may shift to compensate for removed ones, especially after fine-tuning). If you quantize first, the quantization grid (scale factors and zero points) was calibrated for the original weight distribution — which no longer exists after pruning. The quantization error will be much higher.
Conversely, quantizing the final pruned+fine-tuned model lets the calibration data see the actual weight distribution that will be deployed. This consistently gives 1–3% better accuracy than the reverse order.
Teacher: 70B FP16 (latency = 100ms/token). Student: 7B INT8 (10× fewer FLOPs, and INT8 has 2× hardware speedup over FP16 on Tensor Cores). What is the expected student latency per token?
This is an idealized estimate — real speedups are lower because autoregressive decoding is memory-bandwidth-bound (not compute-bound), so the INT8 hardware speedup may only be 1.3–1.5× in practice. Still, a 7B INT8 student serving at ~10ms/token is realistic and represents a massive cost reduction over a 70B FP16 teacher. At 5 ms/token you could serve 200 tokens/second on a single GPU.
Write a function that computes the final model size (in GB) after pruning and quantization. The function receives the original parameter count, original precision (bits), prune fraction, and target quantization bits.
javascript function compressionPipeline(params, originalBits, pruneFraction, quantBits) { const remaining = params * (1 - pruneFraction); return remaining * quantBits / 8 / 1e9; }
Note that originalBits doesn't appear in the calculation — it tells you the starting size for context, but the final size depends only on remaining params and target bits. The compression ratio would be (params * originalBits/8/1e9) / result.
This compression pipeline has a subtle ordering bug that will produce a much worse model than expected. Click the buggy line.
function compress(model) { quantize(model, 'int4'); // quantize first prune(model, sparsity=0.5); // then prune calibrate(model, calibrationData); // calibrate after both return model; }
Line 2 is the bug. Quantizing before pruning means the INT4 scale factors and zero points were computed on the original dense weights. After pruning removes 50% of parameters, the surviving weight distribution shifts significantly — but the quantization grid is now wrong for these new values.
The correct order is:
javascript function compress(model) { prune(model, sparsity=0.5); // prune first finetune(model); // recover accuracy quantize(model, 'int4'); // quantize pruned weights calibrate(model, calibrationData); // calibrate final model return model; }
The calibration on line 4 also can't fully fix the damage — it can adjust scale factors, but the quantization grid was already committed at line 2. Always prune first, then quantize.
Arrange the complete model compression and deployment pipeline in the correct order.
The correct order is: Profile → Structured prune → Fine-tune → Quantize (PTQ) → Evaluate → Deploy
Profile first: identify which layers contribute least (via sensitivity analysis, activation magnitudes, or importance scores). Prune: remove the least important structures. Fine-tune: let the model recover from pruning damage (typically 1–5% of original training). Quantize: calibrate on representative data after pruning, because the weight distribution has changed. Evaluate: verify quality on held-out benchmarks before deploying. Deploy: serve the compressed model.
| Technique | Typical Compression | Quality Loss | Hardware Needs |
|---|---|---|---|
| Weight Pruning (unstructured) | 2–10× | 0–2% (with fine-tune) | Sparse kernels (limited support) |
| Structured Pruning | 1.5–4× | 1–5% (with fine-tune) | No special hardware needed |
| Quantization (INT8) | 2× from FP16 | 0.1–1% | INT8 Tensor Cores (A100+) |
| Quantization (INT4) | 4× from FP16 | 1–3% | INT4 kernels (GPTQ/AWQ) |
| Knowledge Distillation | 5–20× (smaller student) | 2–10% (task-dependent) | Teacher GPU during training |
| LoRA (training only) | 100–1000× trainable params | 0–2% vs full fine-tune | Standard (merges to zero overhead) |
| Mixed Precision (BF16) | 2× from FP32 | <0.1% | BF16 Tensor Cores (A100+) |
| Topic | Lesson |
|---|---|
| Transformer internals | Transformer — From Absolute Zero |
| Parameter counting & memory | Transformer Math Workbook |
| Distributed training | Distributed Training Workbook |
| Scaling laws | Scaling Book Workbook |
| Inference optimization | Systems & Serving Workbook |