Workbook — Numerical Methods

Numerical Methods

The hidden arithmetic that makes or breaks every ML system. Floating point, stability, convergence, differentiation — all solvable by hand with instant feedback.

Prerequisites: Basic calculus (derivatives, integrals) + Linear algebra (matrix multiply). That's it.
10
Chapters
55
Exercises
5
Exercise Types
Mastery
0 / 55 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Floating Point

You're training a neural network and the loss suddenly becomes NaN. Or your model works in FP32 but produces garbage in FP16. Or your gradient accumulator silently loses precision after 10,000 steps. Every one of these bugs traces back to how computers represent real numbers — and it's nothing like the real number line you learned in math class.

IEEE 754 floating point splits a number into three fields:

IEEE 754 FP32 (32 bits total):
[1 sign bit] [8 exponent bits] [23 mantissa bits]

value = (-1)sign × 2(exponent - 127) × (1 + mantissa/223)

Key formats:
FP32: 1 + 8 + 23 = 32 bits   // exponent range [-126, 127], precision ~7 decimal digits
FP16: 1 + 5 + 10 = 16 bits   // exponent range [-14, 15], precision ~3.3 decimal digits
BF16: 1 + 8 + 7 = 16 bits    // exponent range [-126, 127], precision ~2.4 decimal digits
BF16 vs FP16: range vs precision. BF16 keeps FP32's 8-bit exponent (same range: up to ~3.4×1038) but chops the mantissa to 7 bits. FP16 has only a 5-bit exponent (max ~65504) but keeps 10 mantissa bits. For ML, BF16 wins: gradient values span a huge range, and the extra precision of FP16 rarely matters. Google invented BF16 specifically for TPU training.
Exercise 0.1: Largest FP32 Derive

The largest FP32 number has exponent bits = 11111110 (254, since 11111111 is reserved for Inf/NaN) and all mantissa bits = 1. Compute the value.

Hint: exponent = 254 - 127 = 127. Mantissa = (223-1)/223 ≈ 1 - 2-23. Value = 2127 × (2 - 2-23).

× 1038
Show derivation
exponent = 254 - 127 = 127
mantissa = (223 - 1) / 223 = 8388607 / 8388608 ≈ 0.99999988
value = 2127 × (1 + 0.99999988) = 2127 × 1.99999988
2127 ≈ 1.7014 × 1038
value ≈ 1.7014 × 1038 × 2 = 3.4028 × 1038

This is the famous FLT_MAX = 3.4028235e+38. Any computation exceeding this overflows to +Inf. In ML, this happens when you exponentiate large logits without the stable softmax trick.

Exercise 0.2: Machine Epsilon Derive

Machine epsilon is the smallest ε such that 1.0 + ε ≠ 1.0 in floating point. For FP32 with 23 mantissa bits, what is ε?

The mantissa has 23 bits after the implicit leading 1. The smallest change to the representation of 1.0 is flipping the last mantissa bit.

× 10-7
Show derivation
ε = 2-23 = 1 / 8388608 ≈ 1.1920929 × 10-7

This means FP32 gives you about 7 decimal digits of precision. The number 1.0000001 is representable, but 1.00000001 rounds to 1.0. For FP16 (10 mantissa bits), ε = 2-10 ≈ 9.77×10-4 — less than 4 digits! For BF16 (7 mantissa bits), ε = 2-7 ≈ 0.0078 — barely 2.4 digits.

Exercise 0.3: Can FP16 Represent 65536? Trace
FP16 has a 5-bit exponent with bias 15. The largest finite exponent is 215 = 32768. Can FP16 exactly represent the integer 65536?
Exercise 0.4: BF16 Precision Derive

BF16 has 7 mantissa bits. What is the largest integer N such that every integer from 0 to N is exactly representable in BF16?

Think: with 7 mantissa bits + 1 implicit bit, you have 8 bits of significand = 256 values. Integers up to 28 are exact, but what about 28+1?

Show derivation
Significand has 1 implicit + 7 explicit = 8 bits total
Can represent integers 0 to 28 = 256 exactly
257 = 28 + 1 requires 9 significand bits → rounds to 256

This is why BF16 gradient accumulators lose precision fast. If your running sum exceeds 256, adding 1.0 to it does nothing — the 1.0 is lost! This is why mixed-precision training keeps the master weights and accumulator in FP32.

Exercise 0.5: Format Comparison Design

Match each scenario to the best floating-point format. Drag the chips to the slots.

?
Forward pass (activations)
?
Master weights
?
Gradient accumulation
?
Loss scaling factor
BF16 FP32 FP32 FP16
Show reasoning

Forward pass → BF16: Wide exponent range handles activations without overflow. Precision loss is tolerable since we backprop through these.

Master weights → FP32: Tiny gradient updates (e.g., lr=1e-5 × grad) need full precision to accumulate over thousands of steps.

Gradient accumulation → FP32: Summing many small gradients in low precision causes them to vanish. FP32 accumulation is standard.

Loss scaling factor → FP16: Loss scaling multiplies the loss by a large constant (e.g., 1024) before backprop, then divides after. The scale factor itself is often stored in FP16 as part of the loss computation pipeline, though it can also be FP32. FP16 is acceptable here because the factor is a single large number, not a sum of small ones.

Exercise 0.6: Spacing Around 1.0 Derive

In FP32, the gap between 1.0 and the next representable number is 2-23. What is the gap between 1024.0 and the next representable FP32 number?

1024 = 210. The exponent is 10, so the spacing between consecutive floats is 210 × 2-23.

× 10-4
Show derivation
gap = 2exponent × 2-23 = 210 × 2-23 = 2-13
2-13 = 1/8192 ≈ 1.2207 × 10-4

Floating-point spacing grows with magnitude. Near 1.0, the gap is ~10-7. Near 1024.0, it's ~10-4. Near 106, it's ~0.06. This is why large loss values lose precision — the representable numbers get farther apart. Loss scaling works by keeping values in a range where spacing is tight enough for meaningful gradients.

Chapter 1: Catastrophic Cancellation

You subtract two nearly equal numbers and the result has far fewer significant digits than either operand. This is catastrophic cancellation — the silent assassin of numerical computing. It's why naive softmax overflows, why variance computations go negative, and why your loss function sometimes reports negative cross-entropy (which is impossible).

The problem:
a = 1.000000000000001   (16 significant digits)
b = 1.000000000000000   (16 significant digits)
a - b = 1 × 10-15   (1 significant digit!)

In FP64 (53 mantissa bits, ~15.9 decimal digits):
(1 + 10-15) - 1 should give 10-15
But FP64 gives 1.110223024625157 × 10-15
That's a relative error of ~11%!
Why this kills ML. Softmax computes exp(xi)/∑exp(xj). If x values are large (say 1000), exp(1000) = Inf in FP32. The fix: subtract max(x) first. This is mathematically equivalent (the max cancels in the ratio) but numerically essential. Every deep learning framework does this silently.
Exercise 1.1: Relative Error Derive

In FP64, computing (1 + 10-15) - 1 gives 1.110223024625157×10-15 instead of the true answer 10-15. What is the relative error?

Relative error = |computed - true| / |true|.

%
Show derivation
computed = 1.110223024625157 × 10-15
true = 1.0 × 10-15
|computed - true| = 0.110223024625157 × 10-15
relative error = 0.110223 / 1.0 = 0.110223 = 11.02%

An 11% error from a single subtraction! Both operands were stored with ~15.9 digits of precision, but the subtraction wiped out all but ~1 digit. The 0.110223... comes from 2-53 × 250 = 2-3 ≈ 0.125, a rounding artifact of binary representation.

Exercise 1.2: Stable Softmax Trace
Given logits [1000, 1001, 999], what happens with naive softmax (compute exp directly) in FP32, and how does the stable version fix it?
Exercise 1.3: Log-Sum-Exp Derive

The log-sum-exp trick computes log(∑ exp(xi)) stably. The identity: LSE(x) = max(x) + log(∑ exp(xi - max(x))). Compute LSE([10, 20, 30]) by hand.

Show derivation
max(x) = 30
shifted = [10-30, 20-30, 30-30] = [-20, -10, 0]
exp(shifted) = [exp(-20), exp(-10), exp(0)] = [2.06×10-9, 4.54×10-5, 1.0]
sum = 1.0000454
LSE = 30 + log(1.0000454) = 30 + 0.0000454 ≈ 30.00

The answer is dominated by the max term (30). Without the trick, you'd compute exp(30) ≈ 1013 which is fine, but exp(1000) would overflow. The trick ensures the largest exponent is always 0, so no overflow is possible regardless of input magnitude. This is the backbone of stable cross-entropy loss.

Exercise 1.4: Implement logSumExp() Build
Show solution
javascript
function logSumExp(arr) {
  const maxVal = Math.max(...arr);
  const sumExp = arr.reduce((s, x) =>
    s + Math.exp(x - maxVal), 0);
  return maxVal + Math.log(sumExp);
}
Exercise 1.5: Variance Bug Debug

This variance function can return negative values for certain inputs. Click the buggy line.

function variance(arr) {
  const n = arr.length;
  let sumSq = 0, sum = 0;
  for (let i = 0; i < n; i++) {
    sumSq += arr[i] * arr[i];
    sum += arr[i];
  }
  return sumSq / n - (sum / n) * (sum / n);
}
Show explanation

Line 8 uses the textbook formula Var = E[X²] - E[X]². This is mathematically correct but numerically catastrophic. When the mean is large (e.g., arr = [1000000.0, 1000000.1]), sumSq/n ≈ 1012 and (sum/n)² ≈ 1012. Subtracting two huge nearly-equal numbers → catastrophic cancellation → the result can go negative.

The fix: use the two-pass algorithm. First compute the mean, then compute ∑(xi - mean)² / n. The subtraction happens at each element (small numbers), not at the aggregate (huge numbers).

Exercise 1.6: Digits Lost Derive

When subtracting a - b where a ≈ b, the number of significant digits lost is approximately log10(|a| / |a - b|). If a = 1.234567 and b = 1.234566, how many digits are lost?

digits
Show derivation
|a - b| = 0.000001 = 10-6
digits lost = log10(1.234567 / 0.000001) = log10(1,234,567) ≈ 6.09

We started with 7 digits of precision in each operand and lost ~6 of them, leaving ~1 reliable digit. This formula is a quick mental check: if a/|a-b| ≈ 10k, you lose k digits. In FP32 (7 digits), losing 6 leaves 1 — barely usable. In FP64 (15.9 digits), losing 6 leaves ~10 — usually fine.

Chapter 2: Condition Numbers

You're solving a linear system Ax = b and get a completely wrong answer, even though your algorithm is correct. The problem isn't your code — it's the matrix. Some systems are so sensitive to tiny perturbations in the input that floating-point rounding is enough to destroy the answer. The condition number quantifies this sensitivity.

Condition number (2-norm):
κ(A) = ‖A‖ · ‖A-1‖ = σmax / σmin

Error amplification:
relative error in x ≤ κ(A) × relative error in b

κ(A) ≈ 1 → well-conditioned (safe)
κ(A) ≈ 10k → lose ~k digits of precision
κ(A) = ∞ → singular (no unique solution)
Why ML engineers care. The Hessian of a loss function is a matrix. If κ(H) is large, gradient descent oscillates wildly along ill-conditioned directions. This is exactly why optimizers like Adam work — they effectively precondition the Hessian by dividing each gradient component by its RMS history, reducing the effective condition number.
Exercise 2.1: Compute κ for a 2×2 Derive

A = [[1, 1], [1, 1.0001]]. First find A-1 using the 2×2 formula: A-1 = (1/det) × [[d, -b], [-c, a]]. Then compute κ(A) using the infinity norm: ‖A‖ = max row sum of absolute values.

Show derivation
det(A) = 1 × 1.0001 - 1 × 1 = 0.0001
A-1 = (1/0.0001) × [[1.0001, -1], [-1, 1]] = [[10001, -10000], [-10000, 10000]]
‖A‖ = max(|1|+|1|, |1|+|1.0001|) = 2.0001
‖A-1 = max(10001+10000, 10000+10000) = 20001
κ(A) = 2.0001 × 20001 ≈ 40,002 ≈ 4 × 104

A condition number of 40,000 means a perturbation of 10-7 in b (FP32 machine epsilon) gets amplified to 10-7 × 4×104 = 4×10-3 relative error in x. In FP32, you lose ~4.6 of your 7 digits. This matrix is ill-conditioned because its rows are nearly parallel.

Exercise 2.2: Perturbation Analysis Derive

Using A = [[1, 1], [1, 1.0001]] from above, solve Ax = b for b = [2, 2.0001]. Then solve for b' = [2, 2.0002] (a tiny perturbation of 0.00005 relative). How much does x change?

(absolute change in x1)
Show derivation
For b = [2, 2.0001]: x = A-1b = [[10001,-10000],[-10000,10000]] × [2, 2.0001]
x1 = 10001×2 - 10000×2.0001 = 20002 - 20001 = 1.0
x2 = -10000×2 + 10000×2.0001 = -20000 + 20001 = 1.0
So x = [1, 1].
For b' = [2, 2.0002]: x' = A-1b'
x'1 = 10001×2 - 10000×2.0002 = 20002 - 20002 = 0.0
x'2 = -10000×2 + 10000×2.0002 = -20000 + 20002 = 2.0
x' = [0, 2]. Change in x1: |1 - 0| = 1.0

A 0.005% change in b produced a 100% change in x. The solution completely flipped! This is the condition number at work: κ ≈ 40,000 means small perturbations get amplified by up to 40,000×.

Exercise 2.3: Digits Lost from κ Trace
If κ(A) = 108 and you solve Ax=b in FP32 (~7 decimal digits), approximately how many correct digits can you expect in x?
Exercise 2.4: Identity Matrix Derive

What is κ(I) for the n×n identity matrix? (Think: what are the singular values of I?)

Show derivation
I-1 = I, so ‖I‖ = 1 and ‖I-1‖ = 1
κ(I) = 1 × 1 = 1

The identity matrix is perfectly conditioned. Solving Ix = b gives x = b exactly (no amplification). κ = 1 is the best possible. Orthogonal matrices (QTQ = I) also have κ = 1. This is why QR decomposition is preferred over direct inversion — it factors A into a product involving orthogonal matrices.

Exercise 2.5: Implement conditionNumber() Build

Compute the infinity-norm condition number of a 2×2 matrix [[a,b],[c,d]].

Show solution
javascript
function conditionNumber(a, b, c, d) {
  const normA = Math.max(
    Math.abs(a) + Math.abs(b),
    Math.abs(c) + Math.abs(d)
  );
  const det = a * d - b * c;
  const normAinv = Math.max(
    Math.abs(d/det) + Math.abs(b/det),
    Math.abs(c/det) + Math.abs(a/det)
  );
  return normA * normAinv;
}

Chapter 3: Iterative Methods

Direct solvers like Gaussian elimination are O(n³). For a 100,000 × 100,000 sparse system (common in physics simulations and graph-based ML), that's 1015 operations — days of compute. Iterative methods start with a guess and improve it, often converging in far fewer operations, especially when the matrix is sparse.

Jacobi iteration for Ax = b:
Split A = D + L + U   (diagonal, strictly lower, strictly upper)
x(k+1) = D-1(b - (L + U)x(k))

Component form:
xi(k+1) = (1/aii) (bi - ∑j≠i aij xj(k))

Convergence: ρ(D-1(L+U)) < 1   (spectral radius of iteration matrix)
The ML connection. Gradient descent IS an iterative method for solving ∇f(x) = 0. The learning rate plays the role of D-1 — it scales the update. And the convergence condition (spectral radius < 1) is analogous to requiring the learning rate to be smaller than 2/L where L is the Lipschitz constant of the gradient.
Exercise 3.1: Jacobi by Hand Derive

Solve this system using Jacobi iteration, starting from x(0) = [0, 0]:

4x1 + x2 = 9
x1 + 3x2 = 11

Compute x(1) and x(2). What is x1(2) (the first component after 2 iterations)?

Show derivation

Jacobi formula: x1(k+1) = (9 - x2(k))/4,   x2(k+1) = (11 - x1(k))/3

Iteration 1 (from x(0) = [0, 0]):
x1(1) = (9 - 0)/4 = 2.25
x2(1) = (11 - 0)/3 = 3.667
Iteration 2 (from x(1) = [2.25, 3.667]):
x1(2) = (9 - 3.667)/4 = 5.333/4 = 1.333
x2(2) = (11 - 2.25)/3 = 8.75/3 = 2.917

The true solution is x = [1, 3]. After just 2 iterations we have [1.333, 2.917] — already close. Convergence is guaranteed here because A is strictly diagonally dominant (|4| > |1| and |3| > |1|).

Exercise 3.2: Third Iteration Derive

Continue from x(2) = [1.333, 2.917]. Compute x2(3).

Show derivation
x2(3) = (11 - 1.333)/3 = 9.667/3 = 3.222
x1(3) = (9 - 2.917)/4 = 6.083/4 = 1.021

After 3 iterations: [1.021, 3.222]. The error in x1 went from 1.333 to 0.021 — it's converging nicely. True solution: [1, 3].

Exercise 3.3: Convergence Check Trace
For the system [[1, 2], [2, 1]] x = b, will Jacobi iteration converge?
Exercise 3.4: Spectral Radius Derive

For our original system (A = [[4,1],[1,3]]), the Jacobi iteration matrix is M = D-1(L+U) = [[0, -1/4], [-1/3, 0]]. Compute the spectral radius ρ(M) = max |eigenvalue|.

For a 2×2 matrix [[0,a],[b,0]], the eigenvalues are ±√(ab).

Show derivation
M = [[0, -1/4], [-1/3, 0]]
eigenvalues = ±√((-1/4)(-1/3)) = ±√(1/12) = ±0.2887
ρ(M) = 0.2887 < 1 ⇒ converges!

The spectral radius tells you the convergence rate. Each iteration reduces the error by a factor of ~0.289. After k iterations, error ≈ 0.289k × initial error. To get 6 digits of accuracy: 0.289k < 10-6 → k > 11.1. So about 12 iterations suffice.

Exercise 3.5: Implement jacobiStep() Build
Show solution
javascript
function jacobiStep(A, b, x) {
  const n = b.length;
  const xNew = new Array(n);
  for (let i = 0; i < n; i++) {
    let sum = 0;
    for (let j = 0; j < n; j++) {
      if (j !== i) sum += A[i][j] * x[j];
    }
    xNew[i] = (b[i] - sum) / A[i][i];
  }
  return xNew;
}

Chapter 4: Newton's Method

You need to find where f(x) = 0. Binary search works but it's slow — one bit of precision per iteration. Newton's method uses the derivative to build a local linear approximation and jumps directly to where that line crosses zero. The result: once you're close enough, the number of correct digits doubles every iteration.

Newton's method:
xn+1 = xn - f(xn) / f'(xn)

Geometric interpretation:
Draw the tangent line at (xn, f(xn)) with slope f'(xn).
The tangent crosses y=0 at xn+1.

Convergence: Quadratic near a simple root.
|en+1| ≈ C · |en|²   where en = xn - root
Quadratic convergence is magical. If your current error is 10-3, the next error is ~10-6, then ~10-12. Two more iterations and you have machine precision. Compare this to bisection, which only halves the error each step: 10-3 → 5×10-4 → 2.5×10-4... Newton is the algorithm behind torch.optim's L-BFGS and is why second-order optimizers converge so fast (when they work).
Exercise 4.1: Find √2 Derive

To find √2, solve f(x) = x² - 2 = 0. f'(x) = 2x. Starting from x0 = 1, compute x1, x2, x3, and x4. What is x4 (to 6 decimal places)?

Show derivation
xn+1 = xn - (xn² - 2)/(2xn) = (xn + 2/xn)/2
x0 = 1
x1 = (1 + 2/1)/2 = 3/2 = 1.5
x2 = (1.5 + 2/1.5)/2 = (1.5 + 1.3333)/2 = 1.41667
x3 = (1.41667 + 2/1.41667)/2 = (1.41667 + 1.41176)/2 = 1.414216
x4 = (1.414216 + 2/1.414216)/2 = 1.4142135624...

The true value is √2 = 1.41421356237... By x3 we have 6 correct digits, and x4 gives 12+. Watch the quadratic convergence: error went 0.414 → 0.086 → 0.0025 → 0.000002 → 0.000000000004. Each step roughly squares the error.

Exercise 4.2: Convergence Rate Derive

From the √2 example, compute the error at each step. What is |e2|/|e1|²?

This ratio should be approximately constant for quadratic convergence. True root = 1.41421356...

Show derivation
e0 = |1 - 1.41421| = 0.41421
e1 = |1.5 - 1.41421| = 0.08579
e2 = |1.41667 - 1.41421| = 0.00245
|e2| / |e1|² = 0.00245 / 0.08579² = 0.00245 / 0.00736 = 0.333 ≈ 0.35

The constant C ≈ 0.35 = |f''(root)|/(2|f'(root)|) = |2|/(2×2√2) = 1/(2√2) ≈ 0.354. This is the convergence constant for Newton's method on x²-2.

Exercise 4.3: When Newton Fails Trace
Apply Newton's method to f(x) = x3 - 2x + 2 starting from x0 = 0. What happens?
Exercise 4.4: Implement newton() Build
Show solution
javascript
function newton(f, fPrime, x0, nIter) {
  let x = x0;
  for (let i = 0; i < nIter; i++) {
    x = x - f(x) / fPrime(x);
  }
  return x;
}
Exercise 4.5: Division by Zero Bug Debug

This Newton's method implementation crashes on certain inputs. Click the buggy line.

function newton(f, fp, x0, tol) {
  let x = x0;
  for (let i = 0; i < 100; i++) {
    const fx = f(x);
    if (Math.abs(fx) < tol) return x;
    x = x - fx / fp(x);
  }
  return x;
}
Show explanation

Line 6 divides by fp(x) without checking if the derivative is zero. If f'(x) = 0 at any iteration (a horizontal tangent), the step becomes ±Infinity, and all subsequent iterates are NaN. The fix: check if (Math.abs(fp(x)) < 1e-12) return x; before dividing — or throw an error indicating the method has stalled at a stationary point.

Chapter 5: Gradient Descent Convergence

Every neural network trains via gradient descent (or a variant). But how fast does it converge? When does it diverge? The answers come from analyzing the simplest possible case: a quadratic function. If you understand gradient descent on f(x) = x², you understand the core dynamics of all first-order optimization.

Gradient descent on f(x) = x²:
∇f = 2x
Update: xk+1 = xk - α · 2xk = (1 - 2α) · xk

After k steps:
xk = (1 - 2α)k · x0

Convergence requires:
|1 - 2α| < 1   ⇒   0 < α < 1
Learning rate = convergence rate. The factor |1 - 2α| determines how fast the error shrinks. Optimal: α = 0.5 gives (1-2×0.5)=0 — one-step convergence! But this requires knowing the curvature (the Hessian = 2 for f=x²). For α = 0.1, the factor is 0.8, so error decays as 0.8k. This is linear convergence — the hallmark of gradient descent. Newton's method achieves quadratic.
Exercise 5.1: Trace 5 Steps Derive

f(x) = x², α = 0.1, x0 = 5. Compute x1 through x5 and f(x5).

f(x5)
Show derivation
xk+1 = (1 - 2×0.1) · xk = 0.8 · xk
x0 = 5.0
x1 = 0.8 × 5.0 = 4.0
x2 = 0.8 × 4.0 = 3.2
x3 = 0.8 × 3.2 = 2.56
x4 = 0.8 × 2.56 = 2.048
x5 = 0.8 × 2.048 = 1.6384
f(x5) = x5² = 1.6384² = 2.684

Equivalently: x5 = 0.85 × 5 = 0.32768 × 5 = 1.6384. After 5 steps with α=0.1, the loss went from f(5)=25 down to f(1.6384)=2.684 — about a 10× reduction. To reach f < 0.01, you need 0.82k × 25 < 0.01, giving k ≥ 35 steps.

Exercise 5.2: Divergence Threshold Derive

For f(x) = x², what is the maximum learning rate α before gradient descent diverges? Compute x1 with α = 1.1, x0 = 1 to verify.

αmax
Show derivation
xk+1 = (1 - 2α)xk
Converges when |1 - 2α| < 1
-1 < 1 - 2α < 1 ⇒ 0 < α < 1
αmax = 1.0
With α = 1.1: x1 = (1 - 2.2) × 1 = -1.2
x2 = (1 - 2.2) × (-1.2) = 1.44
|xk| grows: 1 → 1.2 → 1.44 → 1.728 → ∞

For general f with Lipschitz-continuous gradient with constant L, the stability bound is α < 2/L. For f(x) = x², L = 2, so α < 2/2 = 1. This is why learning rate schedules start small — to ensure stability.

Exercise 5.3: Implement gradientDescent() Build
Show solution
javascript
function gradientDescent(grad, x0, lr, nSteps) {
  let x = x0;
  for (let i = 0; i < nSteps; i++) {
    x = x - lr * grad(x);
  }
  return x;
}
Exercise 5.4: Steps to Converge Derive

For f(x) = x² with α = 0.1 and x0 = 10, how many steps k are needed to achieve |xk| < 0.01?

Use |xk| = 0.8k × 10 < 0.01, solve for k.

steps
Show derivation
0.8k × 10 < 0.01
0.8k < 0.001
k × ln(0.8) < ln(0.001)
k > ln(0.001) / ln(0.8) = (-6.908) / (-0.2231) = 30.96
k = 31 steps

31 steps to get 3 digits of accuracy. With α=0.3, the factor is 0.4, and you'd need only ln(0.001)/ln(0.4) = 7.5 → 8 steps. Larger learning rate = faster convergence (until you hit the stability limit).

Exercise 5.5: 2D Gradient Descent Trace
Consider f(x,y) = 100x² + y² (a stretched bowl). With α = 0.005, starting from (1,1), what happens?

Chapter 6: Numerical Integration

You need the area under a curve, but the integral has no closed form — or you only have data points, not a formula. Numerical integration (quadrature) approximates ∫f(x)dx using weighted sums of function evaluations. The trapezoidal rule connects points with straight lines; Simpson's rule uses parabolas. The choice determines how fast the error shrinks as you add more points.

Trapezoidal rule (n intervals, step h = (b-a)/n):
ab f(x)dx ≈ h/2 · [f(x0) + 2f(x1) + 2f(x2) + ... + 2f(xn-1) + f(xn)]
Error = O(h²) = O(1/n²)

Simpson's rule (n intervals, n must be even):
ab f(x)dx ≈ h/3 · [f(x0) + 4f(x1) + 2f(x2) + 4f(x3) + ... + 4f(xn-1) + f(xn)]
Error = O(h4) = O(1/n4)
Why this matters for ML. Monte Carlo integration (averaging random samples) is how we estimate expectations in reinforcement learning and variational inference. The error is O(1/√N) regardless of dimension — slower than trapezoidal for 1D, but it doesn't suffer the curse of dimensionality. Understanding deterministic quadrature gives you the baseline to appreciate why stochastic methods dominate in high dimensions.
Exercise 6.1: Trapezoidal Rule by Hand Derive

Compute ∫01 x² dx using the trapezoidal rule with n=4 intervals. (True answer = 1/3 ≈ 0.3333.)

h = (1-0)/4 = 0.25. Points: x = 0, 0.25, 0.50, 0.75, 1.0. Apply the formula.

Show derivation
h = 0.25,   f(x) = x²
f(0) = 0, f(0.25) = 0.0625, f(0.5) = 0.25, f(0.75) = 0.5625, f(1) = 1
T = (0.25/2) × [0 + 2(0.0625) + 2(0.25) + 2(0.5625) + 1]
T = 0.125 × [0 + 0.125 + 0.5 + 1.125 + 1]
T = 0.125 × 2.75 = 0.34375

Error = |0.34375 - 0.33333| = 0.01042. With n=4, the error is O(h²) = O(0.0625). The ratio error/h² ≈ 0.01042/0.0625 = 0.167 ≈ 1/6, which matches the error formula -(b-a)h²f''(c)/12 = -1×0.0625×2/12 = -0.01042. Theory confirmed!

Exercise 6.2: Simpson's Rule Derive

Compute ∫01 x² dx using Simpson's rule with n=4 intervals (same points as above).

Show derivation
h = 0.25,   coefficients: [1, 4, 2, 4, 1]
S = (0.25/3) × [1×0 + 4×0.0625 + 2×0.25 + 4×0.5625 + 1×1]
S = (0.25/3) × [0 + 0.25 + 0.5 + 2.25 + 1]
S = (0.25/3) × 4.0 = 1.0/3 = 0.33333...

Simpson's rule gives the exact answer for x²! This is because Simpson's rule exactly integrates polynomials up to degree 3 (its error involves the 4th derivative, and f''''(x²) = 0). This is a general pattern: higher-order quadrature rules are exact for higher-degree polynomials.

Exercise 6.3: Error Comparison Trace
If you double the number of intervals (n → 2n), how much does the error shrink for each method?
Exercise 6.4: Implement trapezoidal() Build
Show solution
javascript
function trapezoidal(f, a, b, n) {
  const h = (b - a) / n;
  let sum = f(a) + f(b);
  for (let i = 1; i < n; i++) {
    sum += 2 * f(a + i * h);
  }
  return (h / 2) * sum;
}
Exercise 6.5: Monte Carlo vs Trapezoidal Derive

To estimate ∫01 f(x)dx with error < 10-6, how many points does each method need? Assume f is smooth with bounded 2nd derivative |f''| ≤ 1.

Trapezoidal error ≤ h²/12 = 1/(12n²). Monte Carlo error ≈ σ/√N where σ ≤ 1.

trapezoidal n
Show derivation
Trapezoidal: 1/(12n²) < 10-6 ⇒ n² > 106/12 ≈ 83,333 ⇒ n > 289
Monte Carlo: 1/√N < 10-6 ⇒ N > 1012

In 1D, trapezoidal needs ~289 points while Monte Carlo needs a trillion. Deterministic quadrature wins overwhelmingly in low dimensions. But in d dimensions, trapezoidal needs nd points (curse of dimensionality), while Monte Carlo still needs just 1/ε², regardless of d. This crossover at d ≈ 4-8 is why stochastic methods dominate in ML.

Chapter 7: Numerical Differentiation

You have a function f(x) but no analytical derivative. Or you want to verify that your backprop implementation is correct. Numerical differentiation approximates f'(x) using nearby function evaluations. The two main formulas differ by an order of magnitude in accuracy — and knowing which to use can catch gradient bugs that would otherwise be invisible.

Forward difference:
f'(x) ≈ [f(x + h) - f(x)] / h    Error = O(h)

Central difference:
f'(x) ≈ [f(x + h) - f(x - h)] / (2h)    Error = O(h²)

The h dilemma:
Smaller h → less truncation error, but more rounding error.
Optimal h ≈ √ε for forward, ≈ ε1/3 for central (ε = machine epsilon).
Gradient checking in practice. PyTorch's torch.autograd.gradcheck uses central differences with h ≈ 10-5 in FP64 to verify analytical gradients. If the relative difference exceeds ~10-5, something is wrong with your backward pass. This has caught countless bugs in custom autograd functions.
Exercise 7.1: Forward Difference Derive

Compute f'(2) for f(x) = x³ using the forward difference with h = 0.01. (True answer: f'(x) = 3x², so f'(2) = 12.)

Show derivation
f(2.01) = 2.01³ = 8.120601
f(2) = 8
forward = (8.120601 - 8) / 0.01 = 0.120601 / 0.01 = 12.0601
error = |12.0601 - 12| = 0.0601

The error is 0.0601 ≈ 3×2×0.01 = 0.06 = f''(2)×h/2 (the leading error term for forward difference is f''(c)h/2). This is O(h) — first-order accurate.

Exercise 7.2: Central Difference Derive

Now compute f'(2) for f(x) = x³ using the central difference with h = 0.01.

Show derivation
f(2.01) = 2.01³ = 8.120601
f(1.99) = 1.99³ = 7.880599
central = (8.120601 - 7.880599) / (2 × 0.01) = 0.240002 / 0.02 = 12.0001
error = |12.0001 - 12| = 0.0001

Central difference error: 0.0001. Forward difference error: 0.0601. That's 601× more accurate for the same h! The central difference is O(h²), so its error is ~h²×f'''(2)/6 = 0.0001×6/6 = 0.0001. This is why gradient checking always uses central differences.

Exercise 7.3: Error Ratio Derive

What is the ratio of forward difference error to central difference error for the same h and same function? Express as a power of h.

Forward error ~O(h), central error ~O(h²). Their ratio = ?

ratio for h=0.01
Show derivation
Forward error = f''(2) × h/2 = 12 × 0.01/2 = 0.06
Central error = f'''(2) × h²/6 = 6 × 0.0001/6 = 0.0001
Ratio = 0.06 / 0.0001 = 600

(We got 601 empirically due to higher-order terms.) The ratio scales as O(1/h). Smaller h → bigger advantage for central differences. At h = 10-5 (typical for gradient checking), central differences are ~105× more accurate than forward differences. One extra function evaluation buys five orders of magnitude.

Exercise 7.4: Optimal Step Size Trace
For central differences in FP64 (ε ≈ 10-16), what is the optimal h?
Exercise 7.5: Implement gradCheck() Build
Show solution
javascript
function gradCheck(f, x, analyticGrad, h) {
  const numGrad = (f(x + h) - f(x - h)) / (2 * h);
  const denom = Math.max(
    Math.abs(numGrad), Math.abs(analyticGrad), 1e-8
  );
  return Math.abs(numGrad - analyticGrad) / denom;
}
Exercise 7.6: Second Derivative Derive

The central difference formula for f''(x) is [f(x+h) - 2f(x) + f(x-h)] / h². Compute f''(1) for f(x) = ex with h = 0.1. (True answer: f''(x) = ex, f''(1) = e ≈ 2.71828.)

Show derivation
f(1.1) = e1.1 = 3.00417
f(1.0) = e1.0 = 2.71828
f(0.9) = e0.9 = 2.45960
f'' ≈ (3.00417 - 2 × 2.71828 + 2.45960) / 0.01
= (3.00417 - 5.43656 + 2.45960) / 0.01
= 0.02721 / 0.01 = 2.721

Error = |2.721 - 2.718| = 0.003. The second-derivative formula has error O(h²), so with h=0.1, error ≈ h² f''''(1)/12 = 0.01×e/12 ≈ 0.0023. Close to our observed error. This formula is used in Hessian-vector products for second-order optimization.

Chapter 8: Sparse Matrices

A 10,000 × 10,000 matrix has 100 million entries. If 99% of them are zero — common in graph neural networks, attention masks, and sparse transformers — storing all 100 million values wastes 99% of your memory. Sparse matrix formats store only the nonzero values plus their indices, enabling huge memory savings and faster matrix-vector products.

CSR (Compressed Sparse Row):
values: [list of nonzero values, left-to-right, top-to-bottom]
col_idx: [column index for each value]
row_ptr: [index into values where each row starts; length = nrows+1]

Example: A = [[1,0,2],[0,3,0],[4,0,5]]
values = [1, 2, 3, 4, 5]
col_idx = [0, 2, 1, 0, 2]
row_ptr = [0, 2, 3, 5]   // row 0 starts at idx 0, row 1 at idx 2, row 2 at idx 3, end at 5
Why ML cares about sparsity. Graph neural networks represent adjacency as sparse matrices. Sparse attention (Longformer, BigBird) uses sparse attention masks. Mixture-of-Experts routing matrices are extremely sparse (top-k of hundreds of experts). Even dense transformers have sparse gradients during embedding lookups. Understanding CSR is essential for understanding PyTorch's torch.sparse and graph libraries like PyG.
Exercise 8.1: Convert to CSR Derive

Convert this 4×4 matrix to CSR format. Write the row_ptr array.

A = [[0, 0, 3, 0], [2, 0, 0, 0], [0, 0, 0, 0], [0, 4, 0, 1]]

values = [3, 2, 4, 1], col_idx = [2, 0, 1, 3]. What is row_ptr?

Exercise 8.2: Sparse Matrix-Vector Product Derive

Using the CSR from Exercise 8.1 (A = [[0,0,3,0],[2,0,0,0],[0,0,0,0],[0,4,0,1]]), compute y = Ax for x = [1, 2, 3, 4].

What is y3 (the 4th component, 0-indexed)?

Show derivation
y0 = 0×1 + 0×2 + 3×3 + 0×4 = 9
y1 = 2×1 + 0×2 + 0×3 + 0×4 = 2
y2 = 0×1 + 0×2 + 0×3 + 0×4 = 0
y3 = 0×1 + 4×2 + 0×3 + 1×4 = 8 + 4 = 12

Using CSR, we only touch 4 nonzero entries instead of all 16. For each row i, we iterate from row_ptr[i] to row_ptr[i+1], multiply values[j] by x[col_idx[j]], and accumulate. Row 3: j goes from 2 to 4, giving values[2]×x[1] + values[3]×x[3] = 4×2 + 1×4 = 12.

Exercise 8.3: Memory Savings Derive

A 1000×1000 matrix is 90% sparse (10% nonzero = 100,000 entries). Compare memory usage: dense (FP32) vs CSR (FP32 values + INT32 indices). What is the CSR/dense ratio?

Show derivation
Dense: 1000 × 1000 × 4 bytes = 4,000,000 bytes
CSR values: 100,000 × 4 = 400,000 bytes
CSR col_idx: 100,000 × 4 = 400,000 bytes
CSR row_ptr: 1001 × 4 = 4,004 bytes
CSR total = 804,004 bytes
Ratio = 804,004 / 4,000,000 = 0.201 ≈ 0.20

Hmm, 0.20 with 90% sparsity. At 99% sparsity (10,000 nonzeros), the ratio drops to ~0.02 (50× savings). At 99.9% sparsity (typical for large graphs), it's ~0.002 (500× savings). The breakeven point (CSR uses same memory as dense) is around 50% sparsity — below that, just use dense.

Exercise 8.4: Implement spMV() Build

Implement sparse matrix-vector multiply using CSR format.

Show solution
javascript
function spMV(values, colIdx, rowPtr, x) {
  const nRows = rowPtr.length - 1;
  const y = new Array(nRows).fill(0);
  for (let i = 0; i < nRows; i++) {
    for (let j = rowPtr[i]; j < rowPtr[i + 1]; j++) {
      y[i] += values[j] * x[colIdx[j]];
    }
  }
  return y;
}
Exercise 8.5: SpMV Complexity Trace
Dense matrix-vector multiply on an n×n matrix costs O(n²) operations. What does sparse matrix-vector (SpMV) cost for a matrix with nnz nonzeros?

Chapter 9: Capstone — Numerical Stability Audit

You've mastered the individual tools. Now let's use them all at once. In this capstone, you'll audit a neural network forward pass for numerical stability, tracing exact values through each operation and identifying where precision loss occurs.

The scenario. A single-layer neural network: input x → linear layer (Wx + b) → ReLU → softmax → cross-entropy loss. We'll trace specific values through in FP16 and FP32, find where things break, and apply the fixes from earlier chapters.
Exercise 9.1: Pre-Softmax Overflow Derive

After the linear layer, the logits are z = [89.3, 91.7, 88.1]. In FP16, the max representable value is 65504. Is exp(91.7) representable in FP16?

Compute exp(91.7) and compare to 65504.

× 1039
Show derivation
exp(91.7) = e91.7 ≈ 7.33 × 1039
FP16 max = 65504 ≈ 6.55 × 104
7.33 × 1039 >> 6.55 × 104 ⇒ massive overflow!

Even exp(11.1) ≈ 66,686 barely overflows FP16. Anything above ~11.09 in the exponent overflows. Real logits of 89-91 would produce Inf in FP16, giving NaN after softmax. This is why the stable softmax trick (subtract max) is not optional — it's a correctness requirement in low precision.

Exercise 9.2: Apply the Fix Derive

Apply the stable softmax trick to z = [89.3, 91.7, 88.1]. Subtract max, compute exp, normalize. What is softmax(z)0 (the probability of the first class)?

Show derivation
max(z) = 91.7
shifted = [89.3-91.7, 91.7-91.7, 88.1-91.7] = [-2.4, 0, -3.6]
exp(shifted) = [exp(-2.4), exp(0), exp(-3.6)] = [0.0907, 1.0, 0.0273]
sum = 0.0907 + 1.0 + 0.0273 = 1.118
softmax0 = 0.0907 / 1.118 = 0.0811

All intermediate values are in [0, 1] — perfectly safe for FP16. The max exponent is 0, so exp can never overflow. This is a concrete example of why the stable trick works: it shifts the dynamic range of the computation to a safe region.

Exercise 9.3: Gradient Underflow Derive

The gradient of the cross-entropy loss with respect to logit zi is (softmax(z)i - yi), where y is the one-hot target. If the true class is class 1 (y = [0,1,0]), what is the gradient for z2 (the last logit), and could it underflow in FP16?

FP16 smallest normal = 2-14 ≈ 6.1×10-5. Subnormals go to ~5.96×10-8.

(gradient for z2)
Show derivation
softmax(z) ≈ [0.081, 0.895, 0.024] (from above: 0.0907/1.118, 1.0/1.118, 0.0273/1.118)
gradient for z2 = softmax(z)2 - y2 = 0.024 - 0 = 0.024
0.024 > 6.1×10-5 (FP16 smallest normal) ⇒ no underflow here

This gradient (0.024) is fine for FP16. But imagine a 1000-class classifier where the true class gets softmax = 0.999 and most other classes get ~10-6. Those gradients ARE below the FP16 subnormal threshold. They become zero — the model can't learn to adjust those logits. This is gradient underflow, and it's why loss scaling multiplies the loss by 1024 or more before backprop.

Exercise 9.4: Loss Scaling Derive

If a gradient is 3.2×10-6 (below FP16 subnormal range and rounds to 0), what loss scale factor S would bring it into the normal FP16 range (above 6.1×10-5)?

After backprop with scaled loss, gradients are multiplied by S. We need S × 3.2×10-6 > 6.1×10-5. What minimum S works?

(minimum power of 2)
Show derivation
S > 6.1×10-5 / 3.2×10-6 = 19.06
Next power of 2: S = 32
Check: 32 × 3.2×10-6 = 1.024×10-4 > 6.1×10-5

In practice, loss scales of 1024 or higher are common (AMP default starts at 65536 and dynamically adjusts). The optimizer divides by S after backprop to recover the true gradient. Dynamic loss scaling: if you see Inf/NaN in gradients, halve S; if no overflow for N steps, double S. This is how PyTorch's GradScaler works.

Exercise 9.5: FP16 vs FP32 Comparison Trace
A model trains perfectly in FP32 but produces NaN loss in FP16 at step 847. The loss was normal (~2.3) at step 846. Which of these is most likely the root cause?
Exercise 9.6: Full Audit Pipeline Design

Order the steps for debugging a NaN loss in mixed-precision training. Drag chips to slots.

?
?
?
?
?
Reproduce with FP32 Locate first NaN layer Check value ranges Apply fix (clamp/scale/BF16) Verify fix converges
Show reasoning

1. Reproduce with FP32 — Confirm the bug is precision-related, not a logic error. If FP32 also NaNs, the problem is elsewhere.

2. Locate first NaN layer — Insert hooks to check each layer's output for Inf/NaN. The first layer that produces NaN is the culprit.

3. Check value ranges — Log the max absolute activation/gradient at the NaN layer. If max > 65504 (FP16) or gradients < 6e-5, you've found the mechanism.

4. Apply fix — Clamp pre-softmax logits, enable loss scaling, switch to BF16, or add gradient clipping — depending on whether the issue is overflow or underflow.

5. Verify fix converges — Train for the full schedule and confirm loss matches FP32 baseline within acceptable tolerance.

Exercise 9.7: Accumulator Precision Derive

You're summing 10,000 gradient values, each approximately 0.001, in BF16. BF16 can represent integers exactly only up to 256. At what point in the summation does the running sum stop growing?

When the sum exceeds some threshold, adding 0.001 falls below the precision gap. BF16 ε = 2-7 ≈ 0.0078. Near value V, the gap is V × 0.0078.

sum stalls at approximately
Show derivation
BF16 ε = 2-7 ≈ 0.0078125
Gap near value V = V × ε = V × 0.0078
Sum stalls when gap ≥ 0.001 (the increment)
V × 0.0078 ≥ 0.001 ⇒ V ≥ 0.128

After summing ~128 values (reaching sum ≈ 0.128), adding 0.001 rounds away to zero in BF16. The true sum should be 10.0, but BF16 gives ~0.128 — a 92% error! This is why gradient accumulators MUST be FP32, even when weights and activations are BF16. Every mixed-precision training recipe uses FP32 accumulators. This is the single most important numerical insight for ML training.

The proof of understanding. If you completed every exercise in this workbook — manipulated IEEE 754 bits, caught catastrophic cancellation, computed condition numbers, traced iterative methods, analyzed convergence rates, and audited a neural network for numerical stability — you understand the invisible arithmetic that makes ML work (or fail silently). These aren't abstract mathematical curiosities. They're the reason loss = NaN at step 847, the reason BF16 replaced FP16, the reason Adam outperforms SGD. "The purpose of computing is insight, not numbers." — Richard Hamming

Related Lessons

TopicLesson
Transformer arithmeticTransformer Math Workbook
Gradient descent deep-diveGradient Descent — From Absolute Zero
Mixed-precision trainingDistributed Training — From Absolute Zero
Linear algebra foundationsLinear Algebra Workbook