The hidden arithmetic that makes or breaks every ML system. Floating point, stability, convergence, differentiation — all solvable by hand with instant feedback.
You're training a neural network and the loss suddenly becomes NaN. Or your model works in FP32 but produces garbage in FP16. Or your gradient accumulator silently loses precision after 10,000 steps. Every one of these bugs traces back to how computers represent real numbers — and it's nothing like the real number line you learned in math class.
IEEE 754 floating point splits a number into three fields:
The largest FP32 number has exponent bits = 11111110 (254, since 11111111 is reserved for Inf/NaN) and all mantissa bits = 1. Compute the value.
Hint: exponent = 254 - 127 = 127. Mantissa = (223-1)/223 ≈ 1 - 2-23. Value = 2127 × (2 - 2-23).
This is the famous FLT_MAX = 3.4028235e+38. Any computation exceeding this overflows to +Inf. In ML, this happens when you exponentiate large logits without the stable softmax trick.
Machine epsilon is the smallest ε such that 1.0 + ε ≠ 1.0 in floating point. For FP32 with 23 mantissa bits, what is ε?
The mantissa has 23 bits after the implicit leading 1. The smallest change to the representation of 1.0 is flipping the last mantissa bit.
This means FP32 gives you about 7 decimal digits of precision. The number 1.0000001 is representable, but 1.00000001 rounds to 1.0. For FP16 (10 mantissa bits), ε = 2-10 ≈ 9.77×10-4 — less than 4 digits! For BF16 (7 mantissa bits), ε = 2-7 ≈ 0.0078 — barely 2.4 digits.
BF16 has 7 mantissa bits. What is the largest integer N such that every integer from 0 to N is exactly representable in BF16?
Think: with 7 mantissa bits + 1 implicit bit, you have 8 bits of significand = 256 values. Integers up to 28 are exact, but what about 28+1?
This is why BF16 gradient accumulators lose precision fast. If your running sum exceeds 256, adding 1.0 to it does nothing — the 1.0 is lost! This is why mixed-precision training keeps the master weights and accumulator in FP32.
Match each scenario to the best floating-point format. Drag the chips to the slots.
Forward pass → BF16: Wide exponent range handles activations without overflow. Precision loss is tolerable since we backprop through these.
Master weights → FP32: Tiny gradient updates (e.g., lr=1e-5 × grad) need full precision to accumulate over thousands of steps.
Gradient accumulation → FP32: Summing many small gradients in low precision causes them to vanish. FP32 accumulation is standard.
Loss scaling factor → FP16: Loss scaling multiplies the loss by a large constant (e.g., 1024) before backprop, then divides after. The scale factor itself is often stored in FP16 as part of the loss computation pipeline, though it can also be FP32. FP16 is acceptable here because the factor is a single large number, not a sum of small ones.
In FP32, the gap between 1.0 and the next representable number is 2-23. What is the gap between 1024.0 and the next representable FP32 number?
1024 = 210. The exponent is 10, so the spacing between consecutive floats is 210 × 2-23.
Floating-point spacing grows with magnitude. Near 1.0, the gap is ~10-7. Near 1024.0, it's ~10-4. Near 106, it's ~0.06. This is why large loss values lose precision — the representable numbers get farther apart. Loss scaling works by keeping values in a range where spacing is tight enough for meaningful gradients.
You subtract two nearly equal numbers and the result has far fewer significant digits than either operand. This is catastrophic cancellation — the silent assassin of numerical computing. It's why naive softmax overflows, why variance computations go negative, and why your loss function sometimes reports negative cross-entropy (which is impossible).
In FP64, computing (1 + 10-15) - 1 gives 1.110223024625157×10-15 instead of the true answer 10-15. What is the relative error?
Relative error = |computed - true| / |true|.
An 11% error from a single subtraction! Both operands were stored with ~15.9 digits of precision, but the subtraction wiped out all but ~1 digit. The 0.110223... comes from 2-53 × 250 = 2-3 ≈ 0.125, a rounding artifact of binary representation.
The log-sum-exp trick computes log(∑ exp(xi)) stably. The identity: LSE(x) = max(x) + log(∑ exp(xi - max(x))). Compute LSE([10, 20, 30]) by hand.
The answer is dominated by the max term (30). Without the trick, you'd compute exp(30) ≈ 1013 which is fine, but exp(1000) would overflow. The trick ensures the largest exponent is always 0, so no overflow is possible regardless of input magnitude. This is the backbone of stable cross-entropy loss.
javascript function logSumExp(arr) { const maxVal = Math.max(...arr); const sumExp = arr.reduce((s, x) => s + Math.exp(x - maxVal), 0); return maxVal + Math.log(sumExp); }
This variance function can return negative values for certain inputs. Click the buggy line.
function variance(arr) { const n = arr.length; let sumSq = 0, sum = 0; for (let i = 0; i < n; i++) { sumSq += arr[i] * arr[i]; sum += arr[i]; } return sumSq / n - (sum / n) * (sum / n); }
Line 8 uses the textbook formula Var = E[X²] - E[X]². This is mathematically correct but numerically catastrophic. When the mean is large (e.g., arr = [1000000.0, 1000000.1]), sumSq/n ≈ 1012 and (sum/n)² ≈ 1012. Subtracting two huge nearly-equal numbers → catastrophic cancellation → the result can go negative.
The fix: use the two-pass algorithm. First compute the mean, then compute ∑(xi - mean)² / n. The subtraction happens at each element (small numbers), not at the aggregate (huge numbers).
When subtracting a - b where a ≈ b, the number of significant digits lost is approximately log10(|a| / |a - b|). If a = 1.234567 and b = 1.234566, how many digits are lost?
We started with 7 digits of precision in each operand and lost ~6 of them, leaving ~1 reliable digit. This formula is a quick mental check: if a/|a-b| ≈ 10k, you lose k digits. In FP32 (7 digits), losing 6 leaves 1 — barely usable. In FP64 (15.9 digits), losing 6 leaves ~10 — usually fine.
You're solving a linear system Ax = b and get a completely wrong answer, even though your algorithm is correct. The problem isn't your code — it's the matrix. Some systems are so sensitive to tiny perturbations in the input that floating-point rounding is enough to destroy the answer. The condition number quantifies this sensitivity.
A = [[1, 1], [1, 1.0001]]. First find A-1 using the 2×2 formula: A-1 = (1/det) × [[d, -b], [-c, a]]. Then compute κ(A) using the infinity norm: ‖A‖∞ = max row sum of absolute values.
A condition number of 40,000 means a perturbation of 10-7 in b (FP32 machine epsilon) gets amplified to 10-7 × 4×104 = 4×10-3 relative error in x. In FP32, you lose ~4.6 of your 7 digits. This matrix is ill-conditioned because its rows are nearly parallel.
Using A = [[1, 1], [1, 1.0001]] from above, solve Ax = b for b = [2, 2.0001]. Then solve for b' = [2, 2.0002] (a tiny perturbation of 0.00005 relative). How much does x change?
A 0.005% change in b produced a 100% change in x. The solution completely flipped! This is the condition number at work: κ ≈ 40,000 means small perturbations get amplified by up to 40,000×.
What is κ(I) for the n×n identity matrix? (Think: what are the singular values of I?)
The identity matrix is perfectly conditioned. Solving Ix = b gives x = b exactly (no amplification). κ = 1 is the best possible. Orthogonal matrices (QTQ = I) also have κ = 1. This is why QR decomposition is preferred over direct inversion — it factors A into a product involving orthogonal matrices.
Compute the infinity-norm condition number of a 2×2 matrix [[a,b],[c,d]].
javascript function conditionNumber(a, b, c, d) { const normA = Math.max( Math.abs(a) + Math.abs(b), Math.abs(c) + Math.abs(d) ); const det = a * d - b * c; const normAinv = Math.max( Math.abs(d/det) + Math.abs(b/det), Math.abs(c/det) + Math.abs(a/det) ); return normA * normAinv; }
Direct solvers like Gaussian elimination are O(n³). For a 100,000 × 100,000 sparse system (common in physics simulations and graph-based ML), that's 1015 operations — days of compute. Iterative methods start with a guess and improve it, often converging in far fewer operations, especially when the matrix is sparse.
Solve this system using Jacobi iteration, starting from x(0) = [0, 0]:
Compute x(1) and x(2). What is x1(2) (the first component after 2 iterations)?
Jacobi formula: x1(k+1) = (9 - x2(k))/4, x2(k+1) = (11 - x1(k))/3
The true solution is x = [1, 3]. After just 2 iterations we have [1.333, 2.917] — already close. Convergence is guaranteed here because A is strictly diagonally dominant (|4| > |1| and |3| > |1|).
Continue from x(2) = [1.333, 2.917]. Compute x2(3).
After 3 iterations: [1.021, 3.222]. The error in x1 went from 1.333 to 0.021 — it's converging nicely. True solution: [1, 3].
For our original system (A = [[4,1],[1,3]]), the Jacobi iteration matrix is M = D-1(L+U) = [[0, -1/4], [-1/3, 0]]. Compute the spectral radius ρ(M) = max |eigenvalue|.
For a 2×2 matrix [[0,a],[b,0]], the eigenvalues are ±√(ab).
The spectral radius tells you the convergence rate. Each iteration reduces the error by a factor of ~0.289. After k iterations, error ≈ 0.289k × initial error. To get 6 digits of accuracy: 0.289k < 10-6 → k > 11.1. So about 12 iterations suffice.
javascript function jacobiStep(A, b, x) { const n = b.length; const xNew = new Array(n); for (let i = 0; i < n; i++) { let sum = 0; for (let j = 0; j < n; j++) { if (j !== i) sum += A[i][j] * x[j]; } xNew[i] = (b[i] - sum) / A[i][i]; } return xNew; }
You need to find where f(x) = 0. Binary search works but it's slow — one bit of precision per iteration. Newton's method uses the derivative to build a local linear approximation and jumps directly to where that line crosses zero. The result: once you're close enough, the number of correct digits doubles every iteration.
torch.optim's L-BFGS and is why second-order optimizers converge so fast (when they work).To find √2, solve f(x) = x² - 2 = 0. f'(x) = 2x. Starting from x0 = 1, compute x1, x2, x3, and x4. What is x4 (to 6 decimal places)?
The true value is √2 = 1.41421356237... By x3 we have 6 correct digits, and x4 gives 12+. Watch the quadratic convergence: error went 0.414 → 0.086 → 0.0025 → 0.000002 → 0.000000000004. Each step roughly squares the error.
From the √2 example, compute the error at each step. What is |e2|/|e1|²?
This ratio should be approximately constant for quadratic convergence. True root = 1.41421356...
The constant C ≈ 0.35 = |f''(root)|/(2|f'(root)|) = |2|/(2×2√2) = 1/(2√2) ≈ 0.354. This is the convergence constant for Newton's method on x²-2.
javascript function newton(f, fPrime, x0, nIter) { let x = x0; for (let i = 0; i < nIter; i++) { x = x - f(x) / fPrime(x); } return x; }
This Newton's method implementation crashes on certain inputs. Click the buggy line.
function newton(f, fp, x0, tol) { let x = x0; for (let i = 0; i < 100; i++) { const fx = f(x); if (Math.abs(fx) < tol) return x; x = x - fx / fp(x); } return x; }
Line 6 divides by fp(x) without checking if the derivative is zero. If f'(x) = 0 at any iteration (a horizontal tangent), the step becomes ±Infinity, and all subsequent iterates are NaN. The fix: check if (Math.abs(fp(x)) < 1e-12) return x; before dividing — or throw an error indicating the method has stalled at a stationary point.
Every neural network trains via gradient descent (or a variant). But how fast does it converge? When does it diverge? The answers come from analyzing the simplest possible case: a quadratic function. If you understand gradient descent on f(x) = x², you understand the core dynamics of all first-order optimization.
f(x) = x², α = 0.1, x0 = 5. Compute x1 through x5 and f(x5).
Equivalently: x5 = 0.85 × 5 = 0.32768 × 5 = 1.6384. After 5 steps with α=0.1, the loss went from f(5)=25 down to f(1.6384)=2.684 — about a 10× reduction. To reach f < 0.01, you need 0.82k × 25 < 0.01, giving k ≥ 35 steps.
For f(x) = x², what is the maximum learning rate α before gradient descent diverges? Compute x1 with α = 1.1, x0 = 1 to verify.
For general f with Lipschitz-continuous gradient with constant L, the stability bound is α < 2/L. For f(x) = x², L = 2, so α < 2/2 = 1. This is why learning rate schedules start small — to ensure stability.
javascript function gradientDescent(grad, x0, lr, nSteps) { let x = x0; for (let i = 0; i < nSteps; i++) { x = x - lr * grad(x); } return x; }
For f(x) = x² with α = 0.1 and x0 = 10, how many steps k are needed to achieve |xk| < 0.01?
Use |xk| = 0.8k × 10 < 0.01, solve for k.
31 steps to get 3 digits of accuracy. With α=0.3, the factor is 0.4, and you'd need only ln(0.001)/ln(0.4) = 7.5 → 8 steps. Larger learning rate = faster convergence (until you hit the stability limit).
You need the area under a curve, but the integral has no closed form — or you only have data points, not a formula. Numerical integration (quadrature) approximates ∫f(x)dx using weighted sums of function evaluations. The trapezoidal rule connects points with straight lines; Simpson's rule uses parabolas. The choice determines how fast the error shrinks as you add more points.
Compute ∫01 x² dx using the trapezoidal rule with n=4 intervals. (True answer = 1/3 ≈ 0.3333.)
h = (1-0)/4 = 0.25. Points: x = 0, 0.25, 0.50, 0.75, 1.0. Apply the formula.
Error = |0.34375 - 0.33333| = 0.01042. With n=4, the error is O(h²) = O(0.0625). The ratio error/h² ≈ 0.01042/0.0625 = 0.167 ≈ 1/6, which matches the error formula -(b-a)h²f''(c)/12 = -1×0.0625×2/12 = -0.01042. Theory confirmed!
Compute ∫01 x² dx using Simpson's rule with n=4 intervals (same points as above).
Simpson's rule gives the exact answer for x²! This is because Simpson's rule exactly integrates polynomials up to degree 3 (its error involves the 4th derivative, and f''''(x²) = 0). This is a general pattern: higher-order quadrature rules are exact for higher-degree polynomials.
javascript function trapezoidal(f, a, b, n) { const h = (b - a) / n; let sum = f(a) + f(b); for (let i = 1; i < n; i++) { sum += 2 * f(a + i * h); } return (h / 2) * sum; }
To estimate ∫01 f(x)dx with error < 10-6, how many points does each method need? Assume f is smooth with bounded 2nd derivative |f''| ≤ 1.
Trapezoidal error ≤ h²/12 = 1/(12n²). Monte Carlo error ≈ σ/√N where σ ≤ 1.
In 1D, trapezoidal needs ~289 points while Monte Carlo needs a trillion. Deterministic quadrature wins overwhelmingly in low dimensions. But in d dimensions, trapezoidal needs nd points (curse of dimensionality), while Monte Carlo still needs just 1/ε², regardless of d. This crossover at d ≈ 4-8 is why stochastic methods dominate in ML.
You have a function f(x) but no analytical derivative. Or you want to verify that your backprop implementation is correct. Numerical differentiation approximates f'(x) using nearby function evaluations. The two main formulas differ by an order of magnitude in accuracy — and knowing which to use can catch gradient bugs that would otherwise be invisible.
torch.autograd.gradcheck uses central differences with h ≈ 10-5 in FP64 to verify analytical gradients. If the relative difference exceeds ~10-5, something is wrong with your backward pass. This has caught countless bugs in custom autograd functions.Compute f'(2) for f(x) = x³ using the forward difference with h = 0.01. (True answer: f'(x) = 3x², so f'(2) = 12.)
The error is 0.0601 ≈ 3×2×0.01 = 0.06 = f''(2)×h/2 (the leading error term for forward difference is f''(c)h/2). This is O(h) — first-order accurate.
Now compute f'(2) for f(x) = x³ using the central difference with h = 0.01.
Central difference error: 0.0001. Forward difference error: 0.0601. That's 601× more accurate for the same h! The central difference is O(h²), so its error is ~h²×f'''(2)/6 = 0.0001×6/6 = 0.0001. This is why gradient checking always uses central differences.
What is the ratio of forward difference error to central difference error for the same h and same function? Express as a power of h.
Forward error ~O(h), central error ~O(h²). Their ratio = ?
(We got 601 empirically due to higher-order terms.) The ratio scales as O(1/h). Smaller h → bigger advantage for central differences. At h = 10-5 (typical for gradient checking), central differences are ~105× more accurate than forward differences. One extra function evaluation buys five orders of magnitude.
javascript function gradCheck(f, x, analyticGrad, h) { const numGrad = (f(x + h) - f(x - h)) / (2 * h); const denom = Math.max( Math.abs(numGrad), Math.abs(analyticGrad), 1e-8 ); return Math.abs(numGrad - analyticGrad) / denom; }
The central difference formula for f''(x) is [f(x+h) - 2f(x) + f(x-h)] / h². Compute f''(1) for f(x) = ex with h = 0.1. (True answer: f''(x) = ex, f''(1) = e ≈ 2.71828.)
Error = |2.721 - 2.718| = 0.003. The second-derivative formula has error O(h²), so with h=0.1, error ≈ h² f''''(1)/12 = 0.01×e/12 ≈ 0.0023. Close to our observed error. This formula is used in Hessian-vector products for second-order optimization.
A 10,000 × 10,000 matrix has 100 million entries. If 99% of them are zero — common in graph neural networks, attention masks, and sparse transformers — storing all 100 million values wastes 99% of your memory. Sparse matrix formats store only the nonzero values plus their indices, enabling huge memory savings and faster matrix-vector products.
torch.sparse and graph libraries like PyG.Convert this 4×4 matrix to CSR format. Write the row_ptr array.
values = [3, 2, 4, 1], col_idx = [2, 0, 1, 3]. What is row_ptr?
Using the CSR from Exercise 8.1 (A = [[0,0,3,0],[2,0,0,0],[0,0,0,0],[0,4,0,1]]), compute y = Ax for x = [1, 2, 3, 4].
What is y3 (the 4th component, 0-indexed)?
Using CSR, we only touch 4 nonzero entries instead of all 16. For each row i, we iterate from row_ptr[i] to row_ptr[i+1], multiply values[j] by x[col_idx[j]], and accumulate. Row 3: j goes from 2 to 4, giving values[2]×x[1] + values[3]×x[3] = 4×2 + 1×4 = 12.
A 1000×1000 matrix is 90% sparse (10% nonzero = 100,000 entries). Compare memory usage: dense (FP32) vs CSR (FP32 values + INT32 indices). What is the CSR/dense ratio?
Hmm, 0.20 with 90% sparsity. At 99% sparsity (10,000 nonzeros), the ratio drops to ~0.02 (50× savings). At 99.9% sparsity (typical for large graphs), it's ~0.002 (500× savings). The breakeven point (CSR uses same memory as dense) is around 50% sparsity — below that, just use dense.
Implement sparse matrix-vector multiply using CSR format.
javascript function spMV(values, colIdx, rowPtr, x) { const nRows = rowPtr.length - 1; const y = new Array(nRows).fill(0); for (let i = 0; i < nRows; i++) { for (let j = rowPtr[i]; j < rowPtr[i + 1]; j++) { y[i] += values[j] * x[colIdx[j]]; } } return y; }
You've mastered the individual tools. Now let's use them all at once. In this capstone, you'll audit a neural network forward pass for numerical stability, tracing exact values through each operation and identifying where precision loss occurs.
After the linear layer, the logits are z = [89.3, 91.7, 88.1]. In FP16, the max representable value is 65504. Is exp(91.7) representable in FP16?
Compute exp(91.7) and compare to 65504.
Even exp(11.1) ≈ 66,686 barely overflows FP16. Anything above ~11.09 in the exponent overflows. Real logits of 89-91 would produce Inf in FP16, giving NaN after softmax. This is why the stable softmax trick (subtract max) is not optional — it's a correctness requirement in low precision.
Apply the stable softmax trick to z = [89.3, 91.7, 88.1]. Subtract max, compute exp, normalize. What is softmax(z)0 (the probability of the first class)?
All intermediate values are in [0, 1] — perfectly safe for FP16. The max exponent is 0, so exp can never overflow. This is a concrete example of why the stable trick works: it shifts the dynamic range of the computation to a safe region.
The gradient of the cross-entropy loss with respect to logit zi is (softmax(z)i - yi), where y is the one-hot target. If the true class is class 1 (y = [0,1,0]), what is the gradient for z2 (the last logit), and could it underflow in FP16?
FP16 smallest normal = 2-14 ≈ 6.1×10-5. Subnormals go to ~5.96×10-8.
This gradient (0.024) is fine for FP16. But imagine a 1000-class classifier where the true class gets softmax = 0.999 and most other classes get ~10-6. Those gradients ARE below the FP16 subnormal threshold. They become zero — the model can't learn to adjust those logits. This is gradient underflow, and it's why loss scaling multiplies the loss by 1024 or more before backprop.
If a gradient is 3.2×10-6 (below FP16 subnormal range and rounds to 0), what loss scale factor S would bring it into the normal FP16 range (above 6.1×10-5)?
After backprop with scaled loss, gradients are multiplied by S. We need S × 3.2×10-6 > 6.1×10-5. What minimum S works?
In practice, loss scales of 1024 or higher are common (AMP default starts at 65536 and dynamically adjusts). The optimizer divides by S after backprop to recover the true gradient. Dynamic loss scaling: if you see Inf/NaN in gradients, halve S; if no overflow for N steps, double S. This is how PyTorch's GradScaler works.
Order the steps for debugging a NaN loss in mixed-precision training. Drag chips to slots.
1. Reproduce with FP32 — Confirm the bug is precision-related, not a logic error. If FP32 also NaNs, the problem is elsewhere.
2. Locate first NaN layer — Insert hooks to check each layer's output for Inf/NaN. The first layer that produces NaN is the culprit.
3. Check value ranges — Log the max absolute activation/gradient at the NaN layer. If max > 65504 (FP16) or gradients < 6e-5, you've found the mechanism.
4. Apply fix — Clamp pre-softmax logits, enable loss scaling, switch to BF16, or add gradient clipping — depending on whether the issue is overflow or underflow.
5. Verify fix converges — Train for the full schedule and confirm loss matches FP32 baseline within acceptable tolerance.
You're summing 10,000 gradient values, each approximately 0.001, in BF16. BF16 can represent integers exactly only up to 256. At what point in the summation does the running sum stop growing?
When the sum exceeds some threshold, adding 0.001 falls below the precision gap. BF16 ε = 2-7 ≈ 0.0078. Near value V, the gap is V × 0.0078.
After summing ~128 values (reaching sum ≈ 0.128), adding 0.001 rounds away to zero in BF16. The true sum should be 10.0, but BF16 gives ~0.128 — a 92% error! This is why gradient accumulators MUST be FP32, even when weights and activations are BF16. Every mixed-precision training recipe uses FP32 accumulators. This is the single most important numerical insight for ML training.
loss = NaN at step 847, the reason BF16 replaced FP16, the reason Adam outperforms SGD. "The purpose of computing is insight, not numbers." — Richard Hamming| Topic | Lesson |
|---|---|
| Transformer arithmetic | Transformer Math Workbook |
| Gradient descent deep-dive | Gradient Descent — From Absolute Zero |
| Mixed-precision training | Distributed Training — From Absolute Zero |
| Linear algebra foundations | Linear Algebra Workbook |