Every gradient, loss, and optimizer calculation a training engineer needs to do by hand. Chain rule, gradient shapes, cross-entropy, attention backward, Adam dynamics, mixed precision — all solvable in-browser with instant feedback.
You're debugging a training run. The loss isn't going down. Before you can diagnose anything, you need to trace the gradient from the loss all the way back to any weight in the network. This is the chain rule — the single most important tool in all of deep learning.
For a composition of functions f(g(x)), the derivative is:
Given f(x) = (3x + 2)², compute ∂f/∂x at x = 1.
Let u = 3x + 2, then f = u². Apply the chain rule: ∂f/∂x = ∂f/∂u · ∂u/∂x.
Network: y = w2 · ReLU(w1 · x + b1) + b2. Given w1=2, b1=−1, w2=3, b2=0.5, x=1. Loss L = (y − target)² with target=4. Compute ∂L/∂w1.
Forward: z = 2·1 − 1 = 1, a = ReLU(1) = 1, y = 3·1 + 0.5 = 3.5. L = (3.5 − 4)² = 0.25. Then backprop.
The gradient is negative, so increasing w1 would decrease the loss — which makes sense since y = 3.5 is below target = 4 and increasing w1 increases the output.
This is the dead ReLU problem. When the pre-activation z is negative, the gradient is exactly zero — the weight gets no learning signal at all. If a neuron's z is always negative for all training data, it's "dead" and can never recover. This is why LeakyReLU and GELU were invented — they let a small gradient flow even for negative inputs.
Network: y = w3 · ReLU(w2 · ReLU(w1 · x)). Given w1=2, w2=−3, w3=4, x=1. Compute ∂y/∂w1.
Forward first: z1 = 2, a1 = 2, z2 = −6, a2 = ReLU(−6) = 0, y = 0.
The dead ReLU at layer 2 kills the gradient for ALL layers before it. This is how vanishing gradients work in practice — a single dead activation can block learning for an entire subnetwork upstream.
Write a function that computes the gradient of a composition fn(...f2(f1(x))). Each function is given as {f: x => ..., df: x => ...} (value and derivative). Return ∂(composition)/∂x.
javascript function chainGrad(funcs, x) { // Forward: store each intermediate value const vals = [x]; for (const fn of funcs) { vals.push(fn.f(vals[vals.length - 1])); } // Backward: multiply all local derivatives let grad = 1; for (let i = funcs.length - 1; i >= 0; i--) { grad *= funcs[i].df(vals[i]); } return grad; }
The sigmoid function is σ(z) = 1/(1 + e−z). A beautiful fact: σ'(z) = σ(z)(1 − σ(z)). Compute σ'(0).
The maximum value of σ' is 0.25, occurring at z = 0. For large |z|, σ' → 0. This means that in a deep network with sigmoid activations, the gradient shrinks by at least 4× per layer — this is the vanishing gradient problem that plagued early deep networks.
You're reviewing a custom layer implementation and need to verify the backward pass is correct. The first thing to check: does every gradient have the same shape as the parameter it corresponds to? If ∂L/∂W has a different shape than W, something is very wrong.
A linear layer has W with shape [768, 3072] and input X with shape [32, 768]. How many elements are in ∂L/∂W?
The batch dimension (32) does not appear in the gradient shape — it gets summed out during the XT · dY multiply. Every sample in the batch contributes to the same gradient tensor.
The gradient of L with respect to W1 always has the same shape as W1: [768, 3072]. It doesn't matter how many layers come after — the chain rule multiplies everything together and the batch dimension gets summed out.
A transformer layer has: WQ, WK, WV, WO each [d, d] and FFN W1 [d, 4d], W2 [4d, d] where d = 768. How many total gradient elements must be computed for one backward pass through this layer?
Every gradient element is the same size as the parameter — so the backward pass computes exactly as many gradient values as there are parameters. For a 7B model, that's 7 billion gradient elements per training step.
Since b is broadcast to every row of Y, the gradient flows back from every row. The chain rule requires summing over the batch: ∂L/∂b = ∑i ∂L/∂Yi. This is a sum, not a mean — the loss is already averaged over the batch if needed.
During backprop through Linear(768, 3072) with batch=32 and sequence length 512, we need the stored activation X of shape [B×seq, din] to compute ∂L/∂W = XT · dY. How much memory (in MB) does storing X in FP32 require?
This is why activation memory dominates training — every linear layer must save its input for the backward pass. A transformer with 32 layers saves these activations 32 times. Activation checkpointing trades compute for memory by recomputing these instead of storing them.
The loss function is the single number that tells the optimizer "how wrong you are." For language models, the loss is almost always cross-entropy: it measures how surprised the model is by the correct next token.
Logits: z = [2.0, 1.0, 0.1]. True class = 0 (first position). Compute the cross-entropy loss. (First apply softmax, then −log(p0).)
Softmax: pi = ezi / (e2.0 + e1.0 + e0.1). Use e2 ≈ 7.389, e1 ≈ 2.718, e0.1 ≈ 1.105.
Cross-entropy is logarithmic — being very wrong is dramatically more expensive than being slightly wrong. This is by design: the −log function has an asymptote at 0, so assigning near-zero probability to the correct class is catastrophically penalized. This is also why label smoothing helps — it prevents the model from trying to push pc all the way to 1.0.
Write a function that computes cross-entropy loss given logits (raw scores) and the correct class index. Apply softmax internally. Use Math.log and Math.exp.
javascript function crossEntropy(logits, target) { const maxZ = Math.max(...logits); const exps = logits.map(z => Math.exp(z - maxZ)); const sumExp = exps.reduce((a, b) => a + b); const logProb = (logits[target] - maxZ) - Math.log(sumExp); return -logProb; }
This cross-entropy implementation produces NaN for large logits. Click the buggy line.
function crossEntropy(logits, target) { const exps = logits.map(z => Math.exp(z)); const sumExp = exps.reduce((a, b) => a + b); const probs = exps.map(e => e / sumExp); return -Math.log(probs[target]); }
Line 2 is the bug. Computing Math.exp(z) directly on raw logits overflows to Infinity when z > ~709 (FP64 limit). The fix is to subtract the max logit first: logits.map(z => Math.exp(z - maxZ)). This is the log-sum-exp trick — it produces the same softmax probabilities but avoids overflow. Every production softmax implementation uses this trick.
Perplexity = eL where L is the average cross-entropy loss per token. If a model achieves an average loss of 2.5 nats per token on a test set, what is the perplexity? (Use e2.5 ≈ 12.18.)
Perplexity has an intuitive interpretation: the model is "as confused as if it were choosing uniformly between ~12 options at each token." A perfect model has perplexity 1 (loss = 0). GPT-3 achieves ~20 perplexity on common benchmarks. A model that always guesses uniformly over V=50k tokens has perplexity 50,000.
Logits z = [2.0, 1.0, 0.1], true class = 0. From Exercise 2.1, softmax gives p = [0.659, 0.242, 0.099]. Compute all three components of the gradient ∂L/∂z.
Use the formula: ∂L/∂zi = pi − yi where y = [1, 0, 0].
The gradient pushes the correct class logit up (negative gradient = increase) and all other logits down (positive gradient = decrease). The magnitudes sum to zero: −0.341 + 0.242 + 0.099 = 0. This is a consequence of probabilities summing to 1.
Attention is the most expensive operation in a transformer — both forward and backward. The backward pass through attention is where most of the FLOPs go during training. If you understand the gradient flow through attention, you understand why FlashAttention matters.
For a single attention head with seq=2048, dk=64, how many elements are in the attention score matrix S = QKT/√dk?
This is O(seq²) — the quadratic bottleneck of attention. With 32 heads, the total is 32 × 4.19M = 134M elements. At FP32 that's 536 MB just for one layer's attention scores.
For one head with dk=64, seq=512: the backward pass through attention involves these matmuls: PT·dO [seq,seq]×[seq,dk], dO·VT [seq,dk]×[dk,seq], dS·K [seq,seq]×[seq,dk], dST·Q [seq,seq]×[seq,dk]. Each matmul of [m,n]×[n,p] costs 2mnp FLOPs. What are the total FLOPs for these 4 matmuls?
The forward pass has only 2 matmuls (QKT and PV), so the backward pass does roughly 2× the FLOPs of the forward. This "backward is ~2× forward" rule holds approximately across the entire transformer.
The Jacobian of softmax is ∂pi/∂zj = pi(δij − pj), which can be written as diag(p) − ppT. This is a dense matrix — changing any one logit zj affects all probabilities. This is fundamentally different from ReLU (diagonal Jacobian) and is one reason softmax is relatively expensive in the backward pass.
Standard attention stores P (shape [h, seq, seq]) for the backward pass. FlashAttention only stores O (shape [h, seq, dk]) and the per-row logsumexp (shape [h, seq]). For h=32, seq=4096, dk=128, compute the memory ratio (standard / flash) in FP16.
Put these attention backward steps in the correct order, starting from the upstream gradient dO.
The order is: dV (only needs P and dO, both available), dP (needs dO and V), dS (needs dP and P from softmax backward), dQ (needs dS and K), dK (needs dS and Q). Note that dV and dP can be computed in parallel since they have no dependency on each other, but dS must come before dQ and dK.
The learning rate is the most important hyperparameter in all of deep learning. Too high and training diverges. Too low and you waste compute. The schedule — how the LR changes over time — is just as critical as the peak value.
Warmup from 0 to lrmax = 3×10−4 over 2000 steps. What is the learning rate at step 500?
Cosine schedule: lrmax=3×10−4, lrmin=3×10−5, warmup=2000, total=10000. What is the LR at step 6000? (cos(π/2) = 0, cos(π) = −1)
At the midpoint of the cosine decay, the LR is exactly halfway between lrmax and lrmin. The cosine shape decays slowly at first, then fast in the middle, then slowly again at the end — matching the intuition that late training needs fine-grained updates.
Write a function that returns the learning rate at a given step, with linear warmup followed by cosine decay to lrmin.
javascript function cosineSchedule(step, warmup, total, lrMax, lrMin) { if (step < warmup) { return lrMax * step / warmup; } const progress = (step - warmup) / (total - warmup); return lrMin + 0.5 * (lrMax - lrMin) * (1 + Math.cos(Math.PI * progress)); }
When the learning rate is too high, each gradient step overshoots the minimum. The loss increases, which produces larger gradients, which produce even larger updates — a positive feedback loop that quickly reaches Infinity, then NaN. This is training divergence. The fix is always to reduce the LR or add warmup. The critical LR depends on the model architecture, batch size, and loss surface curvature.
LLaMA 3 70B trains for 15T tokens with batch size 4M tokens (4,194,304). Warmup is 2000 steps. How many tokens does the model see during warmup?
That's ~0.056% of the total 15T tokens spent just ramping up the learning rate. It seems like a waste, but without warmup the first few hundred steps would produce garbage updates that can permanently damage the model. The investment pays for itself.
Adam is the default optimizer for virtually all modern deep learning. It combines two ideas: momentum (running average of gradients to smooth noisy updates) and adaptive learning rates (scaling each parameter's update by the inverse of its recent gradient magnitude).
β1 = 0.9, m0 = 0. Gradients: g1 = 4.0, g2 = −2.0, g3 = 6.0. Compute the bias-corrected first moment m̂3.
Note: The uncorrected m3 = 0.744 would severely underestimate the true running mean. The correction factor 1/(1 − 0.729) ≈ 3.69 compensates for the zero initialization. After ~30 steps, β1t becomes negligible and the correction is effectively 1.
Same gradients: g1=4.0, g2=−2.0, g3=6.0. β2=0.999, v0=0. Compute v̂3 (bias-corrected second moment at step 3).
The bias correction for v is even more dramatic: the correction factor is ~333× at step 3. This is because β2 = 0.999 means v accumulates very slowly, so it takes many more steps to reach its steady state. This large correction is why the first few Adam steps behave differently — and why warmup is important.
Adam stores m and v (both same shape as parameters) in FP32, even when parameters are in FP16. For a 7B parameter model, how many GB do the Adam optimizer states consume?
This is why training is so memory-hungry. The model weights in FP16 are only 14 GB, but the optimizer states add 56 GB — 4× more than the model itself! This is the main reason ZeRO-style optimizer sharding exists: split these 56 GB across GPUs instead of duplicating on each one.
This Adam implementation gives wrong updates for the first ~100 steps but converges to correct behavior later. Click the buggy line.
function adamStep(theta, grad, m, v, t, lr, b1, b2, eps) { m = b1 * m + (1 - b1) * grad; v = b2 * v + (1 - b2) * grad * grad; const mHat = m; const vHat = v; theta = theta - lr * mHat / (Math.sqrt(vHat) + eps); return { theta, m, v }; }
Line 4 (and line 5) is the bug. The bias correction is missing: mHat = m should be mHat = m / (1 - b1**t), and similarly for vHat. Without bias correction, the first ~100 steps have severely underscaled updates because m and v are initialized to 0 and slowly warm up. After many steps, βt → 0 and the correction becomes negligible, which is why it "converges to correct behavior later."
Write a single Adam update step. Return the updated {theta, m, v}.
javascript function adamStep(theta, grad, m, v, t, lr, b1, b2, eps) { m = b1 * m + (1 - b1) * grad; v = b2 * v + (1 - b2) * grad * grad; const mHat = m / (1 - Math.pow(b1, t)); const vHat = v / (1 - Math.pow(b2, t)); theta = theta - lr * mHat / (Math.sqrt(vHat) + eps); return { theta, m, v }; }
Batch normalization was one of the biggest training breakthroughs: it stabilizes training by normalizing activations across the batch. But it has quirks — it behaves differently at train time vs. inference, and it fails at small batch sizes. Understanding the math explains all these behaviors.
Batch of 4 values: x = [2, 1, 3, 4]. γ = 2, β = 1, ε = 0. Compute the BN output for x1 = 2.
First compute μB and σB², then normalize x1, then scale and shift.
With batch=1, μB = x, so (x − μB) = 0 for every element. The variance is also 0. After normalization, x̂ = 0/√ε ≈ 0 everywhere — the layer outputs γ·0 + β = β regardless of input. This is why LayerNorm was invented for sequence models: it normalizes across features (the d dimension) instead of across the batch, so it works with any batch size, even 1.
During training, BN tracks running statistics: μrun = momentum × μrun + (1 − momentum) × μB with momentum=0.1. If μrun=0 initially and the first 3 batch means are 2.5, 3.0, 2.0, what is μrun after 3 batches?
Note: PyTorch uses the convention where momentum=0.1 means "keep 10% of old, take 90% of new" — the opposite of many textbooks. This is why the running mean converges quickly to recent batch means. At eval time, this accumulated μrun replaces the batch mean to ensure deterministic outputs.
BatchNorm computes statistics across the batch for each feature. This creates three problems for transformers: (1) different sequence lengths in a batch have different valid positions, (2) autoregressive generation uses batch=1, and (3) batch statistics couple samples in unwanted ways. LayerNorm sidesteps all of this by normalizing across the d-dimensional feature vector for each token independently.
RMSNorm (used in LLaMA) skips the mean subtraction: x̂i = xi / RMS(x) where RMS(x) = √((1/d)∑xi²). For x = [3, 4] and γ = [1, 1], compute the RMSNorm output for the first element.
RMSNorm saves the mean computation (one fewer reduction kernel) and has fewer parameters (no β — only γ). The 2023 LLaMA paper showed it trains just as well as LayerNorm while being ~8% faster in wall-clock time.
Real-world training rarely fits the ideal "one big batch" into GPU memory. Instead, we use two tricks: gradient accumulation (split the batch across multiple forward/backward passes, sum gradients) and mixed precision (use FP16 for speed but FP32 where precision matters).
micro_batch = 4, accumulation_steps = 8, num_GPUs = 4. What is the effective batch size?
Each GPU processes a micro-batch of 4 samples, accumulates gradients over 8 steps (32 effective per GPU), and 4 GPUs contribute in parallel. The optimizer step happens once per 128 samples — identical to training with batch size 128 on a single GPU (if it fit in memory).
A transformer layer stores activations for the backward pass. For batch=32, seq=2048, d=4096, the activations for one layer include: input (X), QKV projections, attention scores (softmax output), FFN intermediate, and residual connections — approximately 4 × [B×seq, d] tensors in the main path. How much memory is saved by storing these in FP16 instead of FP32?
For a 32-layer model, that's 64 GB of activation memory saved. This is the main practical benefit of mixed precision — not the 2× faster matmuls on tensor cores (though that helps too), but the halved activation memory that lets you train larger models or use larger batch sizes.
Multiplying the loss by S scales all gradients by S (by the chain rule). The original 1×10−9 becomes 6.55×10−5, which FP16 represents with no problem. After the backward pass, we divide all gradients by S before the optimizer step, recovering the true gradient value in FP32.
For a 7B parameter model with mixed precision (FP16 weights + FP32 master weights + Adam states), compute the optimizer memory alone. Parameters in FP16: 2 bytes each. Master copy in FP32: 4 bytes. Adam m and v in FP32: 4 bytes each.
That's 14 bytes per parameter, or 2× + 4× + 4× + 4× = 14× the FP16 model size. This is the famous "14 bytes per parameter" rule for mixed-precision Adam training. A single 80GB A100 can't even hold the optimizer states for a 7B model — you need at least 2 GPUs with ZeRO Stage 2.
You're training with micro_batch=2, seq=4096, accum_steps=16, num_GPUs=8. How many tokens does the model process per optimizer step?
~1M tokens per step is a common target for LLM training. LLaMA 3 used ~4M tokens per step. The tokens-per-step determines how many optimizer steps you need: to train on 1T tokens at 1M tokens/step, you need 1,000,000 steps.
The loss curve and gradient statistics are your window into what's happening inside the model during training. Learning to read these signals is like a doctor reading vital signs — it tells you whether the patient is healthy, sick, or about to crash.
A tiny model has 5 parameters with gradients: g = [0.3, −0.4, 0.5, −0.1, 0.2]. Compute the L2 gradient norm.
The total gradient norm is 5.0 but max_norm = 1.0. What scaling factor is applied to all gradients? After clipping, what is the new gradient for a parameter whose original gradient was 2.0?
Gradient clipping preserves the direction of the gradient vector but caps its magnitude. Every parameter's gradient is scaled by the same factor, so the relative magnitudes are preserved. This is much better than per-parameter clipping, which distorts the direction.
Write a function that clips a gradient array by its L2 norm. Return the clipped gradient array.
javascript function clipGradNorm(grads, maxNorm) { const norm = Math.sqrt(grads.reduce((s, g) => s + g * g, 0)); if (norm <= maxNorm) return grads.slice(); const scale = maxNorm / norm; return grads.map(g => g * scale); }
A flat loss with tiny gradient norm means the optimizer has no useful signal to follow. This is different from convergence (where loss would be near the theoretical minimum). Common causes: (1) the LR schedule decayed too fast and hit ~0 too early, (2) the LR is so small that updates are negligible, (3) the model landed in a saddle point. The fix is usually to restart with a higher minimum LR or longer warmup.
You log the gradient norm every 100 steps: [1.2, 1.5, 1.8, 2.4, 4.1, 12.3, 89.7, NaN]. With max_norm=1.0 gradient clipping, at which step range did clipping first activate?
Clipping activates whenever the norm exceeds max_norm = 1.0. The very first logged norm is 1.2 > 1.0, so clipping was already active at step 100 (and likely from the start). The escalating norms (1.2 → 89.7 → NaN) suggest the model is diverging despite clipping — the gradients are growing faster than clipping can contain. This needs a lower learning rate, not just more aggressive clipping.
Time to put it all together. You're planning a training run for a 125M parameter model on 10 billion tokens. You need to compute everything: FLOPs, training time, memory, learning rate, cost. These are the exact calculations that ML engineers at frontier labs do before every run.
N = 125M parameters, D = 10B tokens. Total training FLOPs?
8 A100 GPUs, each at 150 TFLOPS/s effective throughput. How many hours to complete 7.5 × 1018 FLOPs?
A 125M model on 10B tokens is quite small — under 2 hours on 8 A100s. For comparison, LLaMA 3 70B on 15T tokens takes ~30 million GPU-hours.
125M params, mixed-precision Adam (14 bytes/param for model + optimizer). Does this fit on a single 80GB A100 (leaving 40GB for activations)?
Easily fits on a single GPU! With 78 GB remaining for activations, you can use very large batch sizes or long sequences. In fact, the bottleneck for a 125M model is usually compute (GPU utilization), not memory. You'd use 8 GPUs for speed, not memory.
Smaller models can tolerate higher learning rates because their loss landscape is smoother with fewer parameters. GPT-2 Small (124M) was originally trained with lr = 2.5×10−4, but modern practice with cosine schedules suggests ~6-10×10−4 works well for this scale.
10B tokens, batch size = 512 sequences of length 1024. How many optimizer steps?
~19K steps is a short training run. With 2000 warmup steps, that means warmup is ~10% of total training — typical for this scale. The cosine schedule decays over the remaining 17K steps.
Put these steps in the correct order for one complete training iteration with gradient accumulation and mixed precision.
The correct order: FP16 forward → Scale loss → FP16 backward → Unscale grads → Clip gradients → FP32 Adam step. Scaling must happen before backward (so gradients stay in FP16 range). Unscaling must happen before clipping (clip on true gradient magnitude). The Adam step uses FP32 master weights to maintain precision.
8 A100 GPUs for 1.74 hours. Cloud cost: $2.50/GPU-hour (A100 80GB spot). How much does this training run cost?
A 125M model on 10B tokens costs ~$35 on spot instances. Scaling this up: a 7B model on 1T tokens costs ~$200K, and LLaMA 3 405B on 15T tokens cost an estimated $100M+. Compute cost scales as O(N × D), so both model size and dataset size are cost multipliers.
| Topic | Lesson |
|---|---|
| Transformer fundamentals | Transformer — From Absolute Zero |
| GPT architecture | GPT — From Absolute Zero |
| Transformer math | Transformer Math Workbook |
| Distributed training | Distributed Training — From Absolute Zero |