Numerical Methods Workbook — Engineermaxxing

Chapter 0: Floating Point

You're training a neural network and the loss suddenly becomes NaN. Or your model works in FP32 but produces garbage in FP16. Or your gradient accumulator silently loses precision after 10,000 steps. Every one of these bugs traces back to how computers represent real numbers — and it's nothing like the real number line you learned in math class.

IEEE 754 floating point splits a number into three fields:

IEEE 754 FP32 (32 bits total):
[1 sign bit] [8 exponent bits] [23 mantissa bits]

value = (-1)^sign × 2^{(exponent - 127)} × (1 + mantissa/2²³)

Key formats:
FP32: 1 + 8 + 23 = 32 bits // exponent range [-126, 127], precision ~7 decimal digits
FP16: 1 + 5 + 10 = 16 bits // exponent range [-14, 15], precision ~3.3 decimal digits
BF16: 1 + 8 + 7 = 16 bits // exponent range [-126, 127], precision ~2.4 decimal digits

BF16 vs FP16: range vs precision. BF16 keeps FP32's 8-bit exponent (same range: up to ~3.4×10³⁸) but chops the mantissa to 7 bits. FP16 has only a 5-bit exponent (max ~65504) but keeps 10 mantissa bits. For ML, BF16 wins: gradient values span a huge range, and the extra precision of FP16 rarely matters. Google invented BF16 specifically for TPU training.

Exercise 0.1: Largest FP32 Derive

The largest FP32 number has exponent bits = 11111110 (254, since 11111111 is reserved for Inf/NaN) and all mantissa bits = 1. Compute the value.

Hint: exponent = 254 - 127 = 127. Mantissa = (2²³-1)/2²³ ≈ 1 - 2^-23. Value = 2¹²⁷ × (2 - 2^-23).

× 10³⁸

Show derivation

exponent = 254 - 127 = 127

mantissa = (2²³ - 1) / 2²³ = 8388607 / 8388608 ≈ 0.99999988

value = 2¹²⁷ × (1 + 0.99999988) = 2¹²⁷ × 1.99999988

2¹²⁷ ≈ 1.7014 × 10³⁸

value ≈ 1.7014 × 10³⁸ × 2 = 3.4028 × 10³⁸

This is the famous FLT_MAX = 3.4028235e+38. Any computation exceeding this overflows to +Inf. In ML, this happens when you exponentiate large logits without the stable softmax trick.

Exercise 0.2: Machine Epsilon Derive

Machine epsilon is the smallest ε such that 1.0 + ε ≠ 1.0 in floating point. For FP32 with 23 mantissa bits, what is ε?

The mantissa has 23 bits after the implicit leading 1. The smallest change to the representation of 1.0 is flipping the last mantissa bit.

× 10^-7

Show derivation

ε = 2^-23 = 1 / 8388608 ≈ 1.1920929 × 10^-7

This means FP32 gives you about 7 decimal digits of precision. The number 1.0000001 is representable, but 1.00000001 rounds to 1.0. For FP16 (10 mantissa bits), ε = 2^-10 ≈ 9.77×10^-4 — less than 4 digits! For BF16 (7 mantissa bits), ε = 2^-7 ≈ 0.0078 — barely 2.4 digits.

Exercise 0.3: Can FP16 Represent 65536? Trace

FP16 has a 5-bit exponent with bias 15. The largest finite exponent is 2¹⁵ = 32768. Can FP16 exactly represent the integer 65536?

Yes, 65536 = 2¹⁶ fits in the mantissa with exponent 16 No. The max exponent value is 30 (11110 in binary, since 11111 is reserved for Inf/NaN), giving 2^30-15 = 2¹⁵ = 32768. The largest finite FP16 is 65504 — but 65536 = 2¹⁶ requires exponent 16, which needs stored exponent 31 (reserved). So 65536 overflows to Inf. It rounds to 65504

Exercise 0.4: BF16 Precision Derive

BF16 has 7 mantissa bits. What is the largest integer N such that every integer from 0 to N is exactly representable in BF16?

Think: with 7 mantissa bits + 1 implicit bit, you have 8 bits of significand = 256 values. Integers up to 2⁸ are exact, but what about 2⁸+1?

Show derivation

Significand has 1 implicit + 7 explicit = 8 bits total

Can represent integers 0 to 2⁸ = 256 exactly

257 = 2⁸ + 1 requires 9 significand bits → rounds to 256

This is why BF16 gradient accumulators lose precision fast. If your running sum exceeds 256, adding 1.0 to it does nothing — the 1.0 is lost! This is why mixed-precision training keeps the master weights and accumulator in FP32.

Exercise 0.5: Format Comparison Design

Match each scenario to the best floating-point format. Drag the chips to the slots.

Forward pass (activations)

Master weights

Gradient accumulation

Loss scaling factor

BF16 FP32 FP32 FP16

Show reasoning

Forward pass → BF16: Wide exponent range handles activations without overflow. Precision loss is tolerable since we backprop through these.

Master weights → FP32: Tiny gradient updates (e.g., lr=1e-5 × grad) need full precision to accumulate over thousands of steps.

Gradient accumulation → FP32: Summing many small gradients in low precision causes them to vanish. FP32 accumulation is standard.

Loss scaling factor → FP16: Loss scaling multiplies the loss by a large constant (e.g., 1024) before backprop, then divides after. The scale factor itself is often stored in FP16 as part of the loss computation pipeline, though it can also be FP32. FP16 is acceptable here because the factor is a single large number, not a sum of small ones.

Exercise 0.6: Spacing Around 1.0 Derive

In FP32, the gap between 1.0 and the next representable number is 2^-23. What is the gap between 1024.0 and the next representable FP32 number?

1024 = 2¹⁰. The exponent is 10, so the spacing between consecutive floats is 2¹⁰ × 2^-23.

× 10^-4

Show derivation

gap = 2^exponent × 2^-23 = 2¹⁰ × 2^-23 = 2^-13

2^-13 = 1/8192 ≈ 1.2207 × 10^-4

Floating-point spacing grows with magnitude. Near 1.0, the gap is ~10^-7. Near 1024.0, it's ~10^-4. Near 10⁶, it's ~0.06. This is why large loss values lose precision — the representable numbers get farther apart. Loss scaling works by keeping values in a range where spacing is tight enough for meaningful gradients.

Chapter 1: Catastrophic Cancellation

You subtract two nearly equal numbers and the result has far fewer significant digits than either operand. This is catastrophic cancellation — the silent assassin of numerical computing. It's why naive softmax overflows, why variance computations go negative, and why your loss function sometimes reports negative cross-entropy (which is impossible).

The problem:
a = 1.000000000000001 (16 significant digits)
b = 1.000000000000000 (16 significant digits)
a - b = 1 × 10^-15 (1 significant digit!)

In FP64 (53 mantissa bits, ~15.9 decimal digits):
(1 + 10^-15) - 1 should give 10^-15
But FP64 gives 1.110223024625157 × 10^-15
That's a relative error of ~11%!

Why this kills ML. Softmax computes exp(x_i)/∑exp(x_j). If x values are large (say 1000), exp(1000) = Inf in FP32. The fix: subtract max(x) first. This is mathematically equivalent (the max cancels in the ratio) but numerically essential. Every deep learning framework does this silently.

Exercise 1.1: Relative Error Derive

In FP64, computing (1 + 10^-15) - 1 gives 1.110223024625157×10^-15 instead of the true answer 10^-15. What is the relative error?

Relative error = |computed - true| / |true|.

Show derivation

computed = 1.110223024625157 × 10^-15

true = 1.0 × 10^-15

|computed - true| = 0.110223024625157 × 10^-15

relative error = 0.110223 / 1.0 = 0.110223 = 11.02%

An 11% error from a single subtraction! Both operands were stored with ~15.9 digits of precision, but the subtraction wiped out all but ~1 digit. The 0.110223... comes from 2^-53 × 2⁵⁰ = 2^-3 ≈ 0.125, a rounding artifact of binary representation.

Exercise 1.2: Stable Softmax Trace

Given logits [1000, 1001, 999], what happens with naive softmax (compute exp directly) in FP32, and how does the stable version fix it?

Naive softmax works fine — exp(1000) is large but fits in FP32 exp(1000) ≈ 10⁴³⁴ overflows FP32 max (3.4×10³⁸) → Inf. Inf/Inf = NaN. The stable version subtracts max(x)=1001: exp([-1, 0, -2]) = [0.368, 1.0, 0.135], sum=1.503, giving valid probabilities [0.245, 0.665, 0.090]. The result is [0.33, 0.33, 0.33] — all probabilities become equal

Exercise 1.3: Log-Sum-Exp Derive

The log-sum-exp trick computes log(∑ exp(x_i)) stably. The identity: LSE(x) = max(x) + log(∑ exp(x_i - max(x))). Compute LSE([10, 20, 30]) by hand.

Show derivation

max(x) = 30

shifted = [10-30, 20-30, 30-30] = [-20, -10, 0]

exp(shifted) = [exp(-20), exp(-10), exp(0)] = [2.06×10^-9, 4.54×10^-5, 1.0]

sum = 1.0000454

LSE = 30 + log(1.0000454) = 30 + 0.0000454 ≈ 30.00

The answer is dominated by the max term (30). Without the trick, you'd compute exp(30) ≈ 10¹³ which is fine, but exp(1000) would overflow. The trick ensures the largest exponent is always 0, so no overflow is possible regardless of input magnitude. This is the backbone of stable cross-entropy loss.

Exercise 1.4: Implement logSumExp() Build

Show solution

javascript
function logSumExp(arr) {
  const maxVal = Math.max(...arr);
  const sumExp = arr.reduce((s, x) =>
    s + Math.exp(x - maxVal), 0);
  return maxVal + Math.log(sumExp);
}

Exercise 1.5: Variance Bug Debug

This variance function can return negative values for certain inputs. Click the buggy line.

function variance(arr) {
  const n = arr.length;
  let sumSq = 0, sum = 0;
  for (let i = 0; i < n; i++) {
    sumSq += arr[i] * arr[i];
    sum += arr[i];
  }
  return sumSq / n - (sum / n) * (sum / n);
}

Show explanation

Line 8 uses the textbook formula Var = E[X²] - E[X]². This is mathematically correct but numerically catastrophic. When the mean is large (e.g., arr = [1000000.0, 1000000.1]), sumSq/n ≈ 10¹² and (sum/n)² ≈ 10¹². Subtracting two huge nearly-equal numbers → catastrophic cancellation → the result can go negative.

The fix: use the two-pass algorithm. First compute the mean, then compute ∑(x_i - mean)² / n. The subtraction happens at each element (small numbers), not at the aggregate (huge numbers).

Exercise 1.6: Digits Lost Derive

When subtracting a - b where a ≈ b, the number of significant digits lost is approximately log₁₀(|a| / |a - b|). If a = 1.234567 and b = 1.234566, how many digits are lost?

digits

Show derivation

|a - b| = 0.000001 = 10^-6

digits lost = log₁₀(1.234567 / 0.000001) = log₁₀(1,234,567) ≈ 6.09

We started with 7 digits of precision in each operand and lost ~6 of them, leaving ~1 reliable digit. This formula is a quick mental check: if a/|a-b| ≈ 10^k, you lose k digits. In FP32 (7 digits), losing 6 leaves 1 — barely usable. In FP64 (15.9 digits), losing 6 leaves ~10 — usually fine.

Chapter 2: Condition Numbers

You're solving a linear system Ax = b and get a completely wrong answer, even though your algorithm is correct. The problem isn't your code — it's the matrix. Some systems are so sensitive to tiny perturbations in the input that floating-point rounding is enough to destroy the answer. The condition number quantifies this sensitivity.

Condition number (2-norm):
κ(A) = ‖A‖ · ‖A^-1‖ = σ_max / σ_min

Error amplification:
relative error in x ≤ κ(A) × relative error in b

κ(A) ≈ 1 → well-conditioned (safe)
κ(A) ≈ 10^k → lose ~k digits of precision
κ(A) = ∞ → singular (no unique solution)

Why ML engineers care. The Hessian of a loss function is a matrix. If κ(H) is large, gradient descent oscillates wildly along ill-conditioned directions. This is exactly why optimizers like Adam work — they effectively precondition the Hessian by dividing each gradient component by its RMS history, reducing the effective condition number.

Exercise 2.1: Compute κ for a 2×2 Derive

A = [[1, 1], [1, 1.0001]]. First find A^-1 using the 2×2 formula: A^-1 = (1/det) × [[d, -b], [-c, a]]. Then compute κ(A) using the infinity norm: ‖A‖_∞ = max row sum of absolute values.

Show derivation

det(A) = 1 × 1.0001 - 1 × 1 = 0.0001

A^-1 = (1/0.0001) × [[1.0001, -1], [-1, 1]] = [[10001, -10000], [-10000, 10000]]

‖A‖_∞ = max(|1|+|1|, |1|+|1.0001|) = 2.0001

‖A^-1‖_∞ = max(10001+10000, 10000+10000) = 20001

κ(A) = 2.0001 × 20001 ≈ 40,002 ≈ 4 × 10⁴

A condition number of 40,000 means a perturbation of 10^-7 in b (FP32 machine epsilon) gets amplified to 10^-7 × 4×10⁴ = 4×10^-3 relative error in x. In FP32, you lose ~4.6 of your 7 digits. This matrix is ill-conditioned because its rows are nearly parallel.

Exercise 2.2: Perturbation Analysis Derive

Using A = [[1, 1], [1, 1.0001]] from above, solve Ax = b for b = [2, 2.0001]. Then solve for b' = [2, 2.0002] (a tiny perturbation of 0.00005 relative). How much does x change?

(absolute change in x₁)

Show derivation

For b = [2, 2.0001]: x = A^-1b = [[10001,-10000],[-10000,10000]] × [2, 2.0001]

x₁ = 10001×2 - 10000×2.0001 = 20002 - 20001 = 1.0

x₂ = -10000×2 + 10000×2.0001 = -20000 + 20001 = 1.0

So x = [1, 1].

For b' = [2, 2.0002]: x' = A^-1b'

x'₁ = 10001×2 - 10000×2.0002 = 20002 - 20002 = 0.0

x'₂ = -10000×2 + 10000×2.0002 = -20000 + 20002 = 2.0

x' = [0, 2]. Change in x₁: |1 - 0| = 1.0

A 0.005% change in b produced a 100% change in x. The solution completely flipped! This is the condition number at work: κ ≈ 40,000 means small perturbations get amplified by up to 40,000×.

Exercise 2.3: Digits Lost from κ Trace

If κ(A) = 10⁸ and you solve Ax=b in FP32 (~7 decimal digits), approximately how many correct digits can you expect in x?

About 0 correct digits — you lose ~log₁₀(κ) = 8 digits from your 7, leaving effectively no precision. The answer is meaningless. About 3 correct digits All 7 digits — condition number doesn't affect FP32

Exercise 2.4: Identity Matrix Derive

What is κ(I) for the n×n identity matrix? (Think: what are the singular values of I?)

Show derivation

I^-1 = I, so ‖I‖ = 1 and ‖I^-1‖ = 1

κ(I) = 1 × 1 = 1

The identity matrix is perfectly conditioned. Solving Ix = b gives x = b exactly (no amplification). κ = 1 is the best possible. Orthogonal matrices (Q^TQ = I) also have κ = 1. This is why QR decomposition is preferred over direct inversion — it factors A into a product involving orthogonal matrices.

Exercise 2.5: Implement conditionNumber() Build

Compute the infinity-norm condition number of a 2×2 matrix [[a,b],[c,d]].

Show solution

javascript
function conditionNumber(a, b, c, d) {
  const normA = Math.max(
    Math.abs(a) + Math.abs(b),
    Math.abs(c) + Math.abs(d)
  );
  const det = a * d - b * c;
  const normAinv = Math.max(
    Math.abs(d/det) + Math.abs(b/det),
    Math.abs(c/det) + Math.abs(a/det)
  );
  return normA * normAinv;
}

Chapter 3: Iterative Methods

Direct solvers like Gaussian elimination are O(n³). For a 100,000 × 100,000 sparse system (common in physics simulations and graph-based ML), that's 10¹⁵ operations — days of compute. Iterative methods start with a guess and improve it, often converging in far fewer operations, especially when the matrix is sparse.

Jacobi iteration for Ax = b:
Split A = D + L + U (diagonal, strictly lower, strictly upper)
x^(k+1) = D^-1(b - (L + U)x^(k))

Component form:
x_i^(k+1) = (1/a_ii) (b_i - ∑_j≠i a_ij x_j^(k))

Convergence: ρ(D^-1(L+U)) < 1 (spectral radius of iteration matrix)

The ML connection. Gradient descent IS an iterative method for solving ∇f(x) = 0. The learning rate plays the role of D^-1 — it scales the update. And the convergence condition (spectral radius < 1) is analogous to requiring the learning rate to be smaller than 2/L where L is the Lipschitz constant of the gradient.

Exercise 3.1: Jacobi by Hand Derive

Solve this system using Jacobi iteration, starting from x⁽⁰⁾ = [0, 0]:

4x₁ + x₂ = 9
x₁ + 3x₂ = 11

Compute x⁽¹⁾ and x⁽²⁾. What is x₁⁽²⁾ (the first component after 2 iterations)?

Show derivation

Jacobi formula: x₁^(k+1) = (9 - x₂^(k))/4, x₂^(k+1) = (11 - x₁^(k))/3

Iteration 1 (from x⁽⁰⁾ = [0, 0]):

x₁⁽¹⁾ = (9 - 0)/4 = 2.25

x₂⁽¹⁾ = (11 - 0)/3 = 3.667

Iteration 2 (from x⁽¹⁾ = [2.25, 3.667]):

x₁⁽²⁾ = (9 - 3.667)/4 = 5.333/4 = 1.333

x₂⁽²⁾ = (11 - 2.25)/3 = 8.75/3 = 2.917

The true solution is x = [1, 3]. After just 2 iterations we have [1.333, 2.917] — already close. Convergence is guaranteed here because A is strictly diagonally dominant (|4| > |1| and |3| > |1|).

Exercise 3.2: Third Iteration Derive

Continue from x⁽²⁾ = [1.333, 2.917]. Compute x₂⁽³⁾.

Show derivation

x₂⁽³⁾ = (11 - 1.333)/3 = 9.667/3 = 3.222

x₁⁽³⁾ = (9 - 2.917)/4 = 6.083/4 = 1.021

After 3 iterations: [1.021, 3.222]. The error in x₁ went from 1.333 to 0.021 — it's converging nicely. True solution: [1, 3].

Exercise 3.3: Convergence Check Trace

For the system [[1, 2], [2, 1]] x = b, will Jacobi iteration converge?

Yes — the matrix is symmetric positive definite No — the matrix is NOT diagonally dominant. |a₁₁| = 1 is not greater than |a₁₂| = 2. The iteration matrix D^-1(L+U) = [[0,2],[2,0]] has spectral radius 2 > 1, so Jacobi diverges. It converges but slowly

Exercise 3.4: Spectral Radius Derive

For our original system (A = [[4,1],[1,3]]), the Jacobi iteration matrix is M = D^-1(L+U) = [[0, -1/4], [-1/3, 0]]. Compute the spectral radius ρ(M) = max |eigenvalue|.

For a 2×2 matrix [[0,a],[b,0]], the eigenvalues are ±√(ab).

Show derivation

M = [[0, -1/4], [-1/3, 0]]

eigenvalues = ±√((-1/4)(-1/3)) = ±√(1/12) = ±0.2887

ρ(M) = 0.2887 < 1 ⇒ converges!

The spectral radius tells you the convergence rate. Each iteration reduces the error by a factor of ~0.289. After k iterations, error ≈ 0.289^k × initial error. To get 6 digits of accuracy: 0.289^k < 10^-6 → k > 11.1. So about 12 iterations suffice.

Exercise 3.5: Implement jacobiStep() Build

Show solution

javascript
function jacobiStep(A, b, x) {
  const n = b.length;
  const xNew = new Array(n);
  for (let i = 0; i < n; i++) {
    let sum = 0;
    for (let j = 0; j < n; j++) {
      if (j !== i) sum += A[i][j] * x[j];
    }
    xNew[i] = (b[i] - sum) / A[i][i];
  }
  return xNew;
}

Chapter 4: Newton's Method

You need to find where f(x) = 0. Binary search works but it's slow — one bit of precision per iteration. Newton's method uses the derivative to build a local linear approximation and jumps directly to where that line crosses zero. The result: once you're close enough, the number of correct digits doubles every iteration.

Newton's method:
x_n+1 = x_n - f(x_n) / f'(x_n)

Geometric interpretation:
Draw the tangent line at (x_n, f(x_n)) with slope f'(x_n).
The tangent crosses y=0 at x_n+1.

Convergence: Quadratic near a simple root.
|e_n+1| ≈ C · |e_n|² where e_n = x_n - root

Quadratic convergence is magical. If your current error is 10^-3, the next error is ~10^-6, then ~10^-12. Two more iterations and you have machine precision. Compare this to bisection, which only halves the error each step: 10^-3 → 5×10^-4 → 2.5×10^-4... Newton is the algorithm behind torch.optim's L-BFGS and is why second-order optimizers converge so fast (when they work).

Exercise 4.1: Find √2 Derive

To find √2, solve f(x) = x² - 2 = 0. f'(x) = 2x. Starting from x₀ = 1, compute x₁, x₂, x₃, and x₄. What is x₄ (to 6 decimal places)?

Show derivation

x_n+1 = x_n - (x_n² - 2)/(2x_n) = (x_n + 2/x_n)/2

x₀ = 1

x₁ = (1 + 2/1)/2 = 3/2 = 1.5

x₂ = (1.5 + 2/1.5)/2 = (1.5 + 1.3333)/2 = 1.41667

x₃ = (1.41667 + 2/1.41667)/2 = (1.41667 + 1.41176)/2 = 1.414216

x₄ = (1.414216 + 2/1.414216)/2 = 1.4142135624...

The true value is √2 = 1.41421356237... By x₃ we have 6 correct digits, and x₄ gives 12+. Watch the quadratic convergence: error went 0.414 → 0.086 → 0.0025 → 0.000002 → 0.000000000004. Each step roughly squares the error.

Exercise 4.2: Convergence Rate Derive

From the √2 example, compute the error at each step. What is |e₂|/|e₁|²?

This ratio should be approximately constant for quadratic convergence. True root = 1.41421356...

Show derivation

e₀ = |1 - 1.41421| = 0.41421

e₁ = |1.5 - 1.41421| = 0.08579

e₂ = |1.41667 - 1.41421| = 0.00245

|e₂| / |e₁|² = 0.00245 / 0.08579² = 0.00245 / 0.00736 = 0.333 ≈ 0.35

The constant C ≈ 0.35 = |f''(root)|/(2|f'(root)|) = |2|/(2×2√2) = 1/(2√2) ≈ 0.354. This is the convergence constant for Newton's method on x²-2.

Exercise 4.3: When Newton Fails Trace

Apply Newton's method to f(x) = x³ - 2x + 2 starting from x₀ = 0. What happens?

It converges quickly to the root x ≈ -1.77 It converges slowly It enters a cycle: x₀=0, f(0)=2, f'(0)=-2, so x₁=0-2/(-2)=1. Then f(1)=1, f'(1)=1, so x₂=1-1/1=0. We're back to x₀! Newton oscillates forever between 0 and 1.

Exercise 4.4: Implement newton() Build

Show solution

javascript
function newton(f, fPrime, x0, nIter) {
  let x = x0;
  for (let i = 0; i < nIter; i++) {
    x = x - f(x) / fPrime(x);
  }
  return x;
}

Exercise 4.5: Division by Zero Bug Debug

This Newton's method implementation crashes on certain inputs. Click the buggy line.

function newton(f, fp, x0, tol) {
  let x = x0;
  for (let i = 0; i < 100; i++) {
    const fx = f(x);
    if (Math.abs(fx) < tol) return x;
    x = x - fx / fp(x);
  }
  return x;
}

Show explanation

Line 6 divides by fp(x) without checking if the derivative is zero. If f'(x) = 0 at any iteration (a horizontal tangent), the step becomes ±Infinity, and all subsequent iterates are NaN. The fix: check if (Math.abs(fp(x)) < 1e-12) return x; before dividing — or throw an error indicating the method has stalled at a stationary point.

Chapter 5: Gradient Descent Convergence

Every neural network trains via gradient descent (or a variant). But how fast does it converge? When does it diverge? The answers come from analyzing the simplest possible case: a quadratic function. If you understand gradient descent on f(x) = x², you understand the core dynamics of all first-order optimization.

Gradient descent on f(x) = x²:
∇f = 2x
Update: x_k+1 = x_k - α · 2x_k = (1 - 2α) · x_k

After k steps:
x_k = (1 - 2α)^k · x₀

Convergence requires:
|1 - 2α| < 1 ⇒ 0 < α < 1

Learning rate = convergence rate. The factor |1 - 2α| determines how fast the error shrinks. Optimal: α = 0.5 gives (1-2×0.5)=0 — one-step convergence! But this requires knowing the curvature (the Hessian = 2 for f=x²). For α = 0.1, the factor is 0.8, so error decays as 0.8^k. This is linear convergence — the hallmark of gradient descent. Newton's method achieves quadratic.

Exercise 5.1: Trace 5 Steps Derive

f(x) = x², α = 0.1, x₀ = 5. Compute x₁ through x₅ and f(x₅).

f(x₅)

Show derivation

x_k+1 = (1 - 2×0.1) · x_k = 0.8 · x_k

x₀ = 5.0

x₁ = 0.8 × 5.0 = 4.0

x₂ = 0.8 × 4.0 = 3.2

x₃ = 0.8 × 3.2 = 2.56

x₄ = 0.8 × 2.56 = 2.048

x₅ = 0.8 × 2.048 = 1.6384

f(x₅) = x₅² = 1.6384² = 2.684

Equivalently: x₅ = 0.8⁵ × 5 = 0.32768 × 5 = 1.6384. After 5 steps with α=0.1, the loss went from f(5)=25 down to f(1.6384)=2.684 — about a 10× reduction. To reach f < 0.01, you need 0.8^2k × 25 < 0.01, giving k ≥ 35 steps.

Exercise 5.2: Divergence Threshold Derive

For f(x) = x², what is the maximum learning rate α before gradient descent diverges? Compute x₁ with α = 1.1, x₀ = 1 to verify.

α_max

Show derivation

x_k+1 = (1 - 2α)x_k

Converges when |1 - 2α| < 1

-1 < 1 - 2α < 1 ⇒ 0 < α < 1

α_max = 1.0

With α = 1.1: x₁ = (1 - 2.2) × 1 = -1.2

x₂ = (1 - 2.2) × (-1.2) = 1.44

|x_k| grows: 1 → 1.2 → 1.44 → 1.728 → ∞

For general f with Lipschitz-continuous gradient with constant L, the stability bound is α < 2/L. For f(x) = x², L = 2, so α < 2/2 = 1. This is why learning rate schedules start small — to ensure stability.

Exercise 5.3: Implement gradientDescent() Build

Show solution

javascript
function gradientDescent(grad, x0, lr, nSteps) {
  let x = x0;
  for (let i = 0; i < nSteps; i++) {
    x = x - lr * grad(x);
  }
  return x;
}

Exercise 5.4: Steps to Converge Derive

For f(x) = x² with α = 0.1 and x₀ = 10, how many steps k are needed to achieve |x_k| < 0.01?

Use |x_k| = 0.8^k × 10 < 0.01, solve for k.

steps

Show derivation

0.8^k × 10 < 0.01

0.8^k < 0.001

k × ln(0.8) < ln(0.001)

k > ln(0.001) / ln(0.8) = (-6.908) / (-0.2231) = 30.96

k = 31 steps

31 steps to get 3 digits of accuracy. With α=0.3, the factor is 0.4, and you'd need only ln(0.001)/ln(0.4) = 7.5 → 8 steps. Larger learning rate = faster convergence (until you hit the stability limit).

Exercise 5.5: 2D Gradient Descent Trace

Consider f(x,y) = 100x² + y² (a stretched bowl). With α = 0.005, starting from (1,1), what happens?

Converges in ~10 steps The x-direction has gradient 200x with curvature 200, and y-direction has gradient 2y with curvature 2. The condition number κ = 200/2 = 100. The learning rate must satisfy α < 2/200 = 0.01 for stability in x, but this makes y-convergence very slow (factor = 1-2×0.005×1 = 0.99). It takes ~700 steps. This is exactly the problem Adam solves. It diverges in both directions

Chapter 6: Numerical Integration

You need the area under a curve, but the integral has no closed form — or you only have data points, not a formula. Numerical integration (quadrature) approximates ∫f(x)dx using weighted sums of function evaluations. The trapezoidal rule connects points with straight lines; Simpson's rule uses parabolas. The choice determines how fast the error shrinks as you add more points.

Trapezoidal rule (n intervals, step h = (b-a)/n):
∫_a^b f(x)dx ≈ h/2 · [f(x₀) + 2f(x₁) + 2f(x₂) + ... + 2f(x_n-1) + f(x_n)]
Error = O(h²) = O(1/n²)

Simpson's rule (n intervals, n must be even):
∫_a^b f(x)dx ≈ h/3 · [f(x₀) + 4f(x₁) + 2f(x₂) + 4f(x₃) + ... + 4f(x_n-1) + f(x_n)]
Error = O(h⁴) = O(1/n⁴)

Why this matters for ML. Monte Carlo integration (averaging random samples) is how we estimate expectations in reinforcement learning and variational inference. The error is O(1/√N) regardless of dimension — slower than trapezoidal for 1D, but it doesn't suffer the curse of dimensionality. Understanding deterministic quadrature gives you the baseline to appreciate why stochastic methods dominate in high dimensions.

Exercise 6.1: Trapezoidal Rule by Hand Derive

Compute ∫₀¹ x² dx using the trapezoidal rule with n=4 intervals. (True answer = 1/3 ≈ 0.3333.)

h = (1-0)/4 = 0.25. Points: x = 0, 0.25, 0.50, 0.75, 1.0. Apply the formula.

Show derivation

h = 0.25, f(x) = x²

f(0) = 0, f(0.25) = 0.0625, f(0.5) = 0.25, f(0.75) = 0.5625, f(1) = 1

T = (0.25/2) × [0 + 2(0.0625) + 2(0.25) + 2(0.5625) + 1]

T = 0.125 × [0 + 0.125 + 0.5 + 1.125 + 1]

T = 0.125 × 2.75 = 0.34375

Error = |0.34375 - 0.33333| = 0.01042. With n=4, the error is O(h²) = O(0.0625). The ratio error/h² ≈ 0.01042/0.0625 = 0.167 ≈ 1/6, which matches the error formula -(b-a)h²f''(c)/12 = -1×0.0625×2/12 = -0.01042. Theory confirmed!

Exercise 6.2: Simpson's Rule Derive

Compute ∫₀¹ x² dx using Simpson's rule with n=4 intervals (same points as above).

Show derivation

h = 0.25, coefficients: [1, 4, 2, 4, 1]

S = (0.25/3) × [1×0 + 4×0.0625 + 2×0.25 + 4×0.5625 + 1×1]

S = (0.25/3) × [0 + 0.25 + 0.5 + 2.25 + 1]

S = (0.25/3) × 4.0 = 1.0/3 = 0.33333...

Simpson's rule gives the exact answer for x²! This is because Simpson's rule exactly integrates polynomials up to degree 3 (its error involves the 4th derivative, and f''''(x²) = 0). This is a general pattern: higher-order quadrature rules are exact for higher-degree polynomials.

Exercise 6.3: Error Comparison Trace

If you double the number of intervals (n → 2n), how much does the error shrink for each method?

Both reduce by half Trapezoidal: error × 1/4 (O(h²), h halves so h² → h²/4). Simpson's: error × 1/16 (O(h⁴), h halves so h⁴ → h⁴/16). Simpson's converges much faster — each doubling of n buys 4× more accuracy than trapezoidal. Trapezoidal: ×1/4, Simpson's: ×1/8

Exercise 6.4: Implement trapezoidal() Build

Show solution

javascript
function trapezoidal(f, a, b, n) {
  const h = (b - a) / n;
  let sum = f(a) + f(b);
  for (let i = 1; i < n; i++) {
    sum += 2 * f(a + i * h);
  }
  return (h / 2) * sum;
}

Exercise 6.5: Monte Carlo vs Trapezoidal Derive

To estimate ∫₀¹ f(x)dx with error < 10^-6, how many points does each method need? Assume f is smooth with bounded 2nd derivative |f''| ≤ 1.

Trapezoidal error ≤ h²/12 = 1/(12n²). Monte Carlo error ≈ σ/√N where σ ≤ 1.

trapezoidal n

Show derivation

Trapezoidal: 1/(12n²) < 10^-6 ⇒ n² > 10⁶/12 ≈ 83,333 ⇒ n > 289

Monte Carlo: 1/√N < 10^-6 ⇒ N > 10¹²

In 1D, trapezoidal needs ~289 points while Monte Carlo needs a trillion. Deterministic quadrature wins overwhelmingly in low dimensions. But in d dimensions, trapezoidal needs n^d points (curse of dimensionality), while Monte Carlo still needs just 1/ε², regardless of d. This crossover at d ≈ 4-8 is why stochastic methods dominate in ML.

Chapter 7: Numerical Differentiation

You have a function f(x) but no analytical derivative. Or you want to verify that your backprop implementation is correct. Numerical differentiation approximates f'(x) using nearby function evaluations. The two main formulas differ by an order of magnitude in accuracy — and knowing which to use can catch gradient bugs that would otherwise be invisible.

Forward difference:
f'(x) ≈ [f(x + h) - f(x)] / h Error = O(h)

Central difference:
f'(x) ≈ [f(x + h) - f(x - h)] / (2h) Error = O(h²)

The h dilemma:
Smaller h → less truncation error, but more rounding error.
Optimal h ≈ √ε for forward, ≈ ε^1/3 for central (ε = machine epsilon).

Gradient checking in practice. PyTorch's torch.autograd.gradcheck uses central differences with h ≈ 10^-5 in FP64 to verify analytical gradients. If the relative difference exceeds ~10^-5, something is wrong with your backward pass. This has caught countless bugs in custom autograd functions.

Exercise 7.1: Forward Difference Derive

Compute f'(2) for f(x) = x³ using the forward difference with h = 0.01. (True answer: f'(x) = 3x², so f'(2) = 12.)

Show derivation

f(2.01) = 2.01³ = 8.120601

f(2) = 8

forward = (8.120601 - 8) / 0.01 = 0.120601 / 0.01 = 12.0601

error = |12.0601 - 12| = 0.0601

The error is 0.0601 ≈ 3×2×0.01 = 0.06 = f''(2)×h/2 (the leading error term for forward difference is f''(c)h/2). This is O(h) — first-order accurate.

Exercise 7.2: Central Difference Derive

Now compute f'(2) for f(x) = x³ using the central difference with h = 0.01.

Show derivation

f(2.01) = 2.01³ = 8.120601

f(1.99) = 1.99³ = 7.880599

central = (8.120601 - 7.880599) / (2 × 0.01) = 0.240002 / 0.02 = 12.0001

error = |12.0001 - 12| = 0.0001

Central difference error: 0.0001. Forward difference error: 0.0601. That's 601× more accurate for the same h! The central difference is O(h²), so its error is ~h²×f'''(2)/6 = 0.0001×6/6 = 0.0001. This is why gradient checking always uses central differences.

Exercise 7.3: Error Ratio Derive

What is the ratio of forward difference error to central difference error for the same h and same function? Express as a power of h.

Forward error ~O(h), central error ~O(h²). Their ratio = ?

ratio for h=0.01

Show derivation

Forward error = f''(2) × h/2 = 12 × 0.01/2 = 0.06

Central error = f'''(2) × h²/6 = 6 × 0.0001/6 = 0.0001

Ratio = 0.06 / 0.0001 = 600

(We got 601 empirically due to higher-order terms.) The ratio scales as O(1/h). Smaller h → bigger advantage for central differences. At h = 10^-5 (typical for gradient checking), central differences are ~10⁵× more accurate than forward differences. One extra function evaluation buys five orders of magnitude.

Exercise 7.4: Optimal Step Size Trace

For central differences in FP64 (ε ≈ 10^-16), what is the optimal h?

h = 10^-16 — as small as possible h ≈ ε^1/3 ≈ (10^-16)^1/3 ≈ 10^-5.3. Smaller h reduces truncation error (O(h²)) but increases rounding error (O(ε/h)). Balancing: h² = ε/h ⇒ h³ = ε ⇒ h = ε^1/3. This gives total error ~ε^2/3 ≈ 10^-10.7 — about 10 correct digits. h = 0.01

Exercise 7.5: Implement gradCheck() Build

Show solution

javascript
function gradCheck(f, x, analyticGrad, h) {
  const numGrad = (f(x + h) - f(x - h)) / (2 * h);
  const denom = Math.max(
    Math.abs(numGrad), Math.abs(analyticGrad), 1e-8
  );
  return Math.abs(numGrad - analyticGrad) / denom;
}

Exercise 7.6: Second Derivative Derive

The central difference formula for f''(x) is [f(x+h) - 2f(x) + f(x-h)] / h². Compute f''(1) for f(x) = e^x with h = 0.1. (True answer: f''(x) = e^x, f''(1) = e ≈ 2.71828.)

Show derivation

f(1.1) = e^1.1 = 3.00417

f(1.0) = e^1.0 = 2.71828

f(0.9) = e^0.9 = 2.45960

f'' ≈ (3.00417 - 2 × 2.71828 + 2.45960) / 0.01

= (3.00417 - 5.43656 + 2.45960) / 0.01

= 0.02721 / 0.01 = 2.721

Error = |2.721 - 2.718| = 0.003. The second-derivative formula has error O(h²), so with h=0.1, error ≈ h² f''''(1)/12 = 0.01×e/12 ≈ 0.0023. Close to our observed error. This formula is used in Hessian-vector products for second-order optimization.

Chapter 8: Sparse Matrices

A 10,000 × 10,000 matrix has 100 million entries. If 99% of them are zero — common in graph neural networks, attention masks, and sparse transformers — storing all 100 million values wastes 99% of your memory. Sparse matrix formats store only the nonzero values plus their indices, enabling huge memory savings and faster matrix-vector products.

CSR (Compressed Sparse Row):
values: [list of nonzero values, left-to-right, top-to-bottom]
col_idx: [column index for each value]
row_ptr: [index into values where each row starts; length = nrows+1]

Example: A = [[1,0,2],[0,3,0],[4,0,5]]
values = [1, 2, 3, 4, 5]
col_idx = [0, 2, 1, 0, 2]
row_ptr = [0, 2, 3, 5] // row 0 starts at idx 0, row 1 at idx 2, row 2 at idx 3, end at 5

Why ML cares about sparsity. Graph neural networks represent adjacency as sparse matrices. Sparse attention (Longformer, BigBird) uses sparse attention masks. Mixture-of-Experts routing matrices are extremely sparse (top-k of hundreds of experts). Even dense transformers have sparse gradients during embedding lookups. Understanding CSR is essential for understanding PyTorch's torch.sparse and graph libraries like PyG.

Exercise 8.1: Convert to CSR Derive

Convert this 4×4 matrix to CSR format. Write the row_ptr array.

A = [[0, 0, 3, 0], [2, 0, 0, 0], [0, 0, 0, 0], [0, 4, 0, 1]]

values = [3, 2, 4, 1], col_idx = [2, 0, 1, 3]. What is row_ptr?

[0, 1, 2, 3, 4] [0, 1, 2, 4] [0, 1, 2, 2, 4] — Row 0 has 1 nonzero (starts at 0), Row 1 has 1 (starts at 1), Row 2 has 0 (starts at 2, same as next), Row 3 has 2 (starts at 2), end = 4. [0, 1, 2, 2, 5]

Exercise 8.2: Sparse Matrix-Vector Product Derive

Using the CSR from Exercise 8.1 (A = [[0,0,3,0],[2,0,0,0],[0,0,0,0],[0,4,0,1]]), compute y = Ax for x = [1, 2, 3, 4].

What is y₃ (the 4th component, 0-indexed)?

Show derivation

y₀ = 0×1 + 0×2 + 3×3 + 0×4 = 9

y₁ = 2×1 + 0×2 + 0×3 + 0×4 = 2

y₂ = 0×1 + 0×2 + 0×3 + 0×4 = 0

y₃ = 0×1 + 4×2 + 0×3 + 1×4 = 8 + 4 = 12

Using CSR, we only touch 4 nonzero entries instead of all 16. For each row i, we iterate from row_ptr[i] to row_ptr[i+1], multiply values[j] by x[col_idx[j]], and accumulate. Row 3: j goes from 2 to 4, giving values[2]×x[1] + values[3]×x[3] = 4×2 + 1×4 = 12.

Exercise 8.3: Memory Savings Derive

A 1000×1000 matrix is 90% sparse (10% nonzero = 100,000 entries). Compare memory usage: dense (FP32) vs CSR (FP32 values + INT32 indices). What is the CSR/dense ratio?

Show derivation

Dense: 1000 × 1000 × 4 bytes = 4,000,000 bytes

CSR values: 100,000 × 4 = 400,000 bytes

CSR col_idx: 100,000 × 4 = 400,000 bytes

CSR row_ptr: 1001 × 4 = 4,004 bytes

CSR total = 804,004 bytes

Ratio = 804,004 / 4,000,000 = 0.201 ≈ 0.20

Hmm, 0.20 with 90% sparsity. At 99% sparsity (10,000 nonzeros), the ratio drops to ~0.02 (50× savings). At 99.9% sparsity (typical for large graphs), it's ~0.002 (500× savings). The breakeven point (CSR uses same memory as dense) is around 50% sparsity — below that, just use dense.

Exercise 8.4: Implement spMV() Build

Implement sparse matrix-vector multiply using CSR format.

Show solution

javascript
function spMV(values, colIdx, rowPtr, x) {
  const nRows = rowPtr.length - 1;
  const y = new Array(nRows).fill(0);
  for (let i = 0; i < nRows; i++) {
    for (let j = rowPtr[i]; j < rowPtr[i + 1]; j++) {
      y[i] += values[j] * x[colIdx[j]];
    }
  }
  return y;
}

Exercise 8.5: SpMV Complexity Trace

Dense matrix-vector multiply on an n×n matrix costs O(n²) operations. What does sparse matrix-vector (SpMV) cost for a matrix with nnz nonzeros?

O(nnz) — one multiply-add per nonzero entry. For a 1000×1000 matrix with 10,000 nonzeros, SpMV does 10,000 operations vs dense's 1,000,000. That's 100× faster. O(n²) — same as dense, just uses less memory O(n × nnz)

Chapter 9: Capstone — Numerical Stability Audit

You've mastered the individual tools. Now let's use them all at once. In this capstone, you'll audit a neural network forward pass for numerical stability, tracing exact values through each operation and identifying where precision loss occurs.

The scenario. A single-layer neural network: input x → linear layer (Wx + b) → ReLU → softmax → cross-entropy loss. We'll trace specific values through in FP16 and FP32, find where things break, and apply the fixes from earlier chapters.

Exercise 9.1: Pre-Softmax Overflow Derive

After the linear layer, the logits are z = [89.3, 91.7, 88.1]. In FP16, the max representable value is 65504. Is exp(91.7) representable in FP16?

Compute exp(91.7) and compare to 65504.

× 10³⁹

Show derivation

exp(91.7) = e^91.7 ≈ 7.33 × 10³⁹

FP16 max = 65504 ≈ 6.55 × 10⁴

7.33 × 10³⁹ >> 6.55 × 10⁴ ⇒ massive overflow!

Even exp(11.1) ≈ 66,686 barely overflows FP16. Anything above ~11.09 in the exponent overflows. Real logits of 89-91 would produce Inf in FP16, giving NaN after softmax. This is why the stable softmax trick (subtract max) is not optional — it's a correctness requirement in low precision.

Exercise 9.2: Apply the Fix Derive

Apply the stable softmax trick to z = [89.3, 91.7, 88.1]. Subtract max, compute exp, normalize. What is softmax(z)₀ (the probability of the first class)?

Show derivation

max(z) = 91.7

shifted = [89.3-91.7, 91.7-91.7, 88.1-91.7] = [-2.4, 0, -3.6]

exp(shifted) = [exp(-2.4), exp(0), exp(-3.6)] = [0.0907, 1.0, 0.0273]

sum = 0.0907 + 1.0 + 0.0273 = 1.118

softmax₀ = 0.0907 / 1.118 = 0.0811

All intermediate values are in [0, 1] — perfectly safe for FP16. The max exponent is 0, so exp can never overflow. This is a concrete example of why the stable trick works: it shifts the dynamic range of the computation to a safe region.

Exercise 9.3: Gradient Underflow Derive

The gradient of the cross-entropy loss with respect to logit z_i is (softmax(z)_i - y_i), where y is the one-hot target. If the true class is class 1 (y = [0,1,0]), what is the gradient for z₂ (the last logit), and could it underflow in FP16?

FP16 smallest normal = 2^-14 ≈ 6.1×10^-5. Subnormals go to ~5.96×10^-8.

(gradient for z₂)

Show derivation

softmax(z) ≈ [0.081, 0.895, 0.024] (from above: 0.0907/1.118, 1.0/1.118, 0.0273/1.118)

gradient for z₂ = softmax(z)₂ - y₂ = 0.024 - 0 = 0.024

0.024 > 6.1×10^-5 (FP16 smallest normal) ⇒ no underflow here

This gradient (0.024) is fine for FP16. But imagine a 1000-class classifier where the true class gets softmax = 0.999 and most other classes get ~10^-6. Those gradients ARE below the FP16 subnormal threshold. They become zero — the model can't learn to adjust those logits. This is gradient underflow, and it's why loss scaling multiplies the loss by 1024 or more before backprop.

Exercise 9.4: Loss Scaling Derive

If a gradient is 3.2×10^-6 (below FP16 subnormal range and rounds to 0), what loss scale factor S would bring it into the normal FP16 range (above 6.1×10^-5)?

After backprop with scaled loss, gradients are multiplied by S. We need S × 3.2×10^-6 > 6.1×10^-5. What minimum S works?

(minimum power of 2)

Show derivation

S > 6.1×10^-5 / 3.2×10^-6 = 19.06

Next power of 2: S = 32

Check: 32 × 3.2×10^-6 = 1.024×10^-4 > 6.1×10^-5 ✓

In practice, loss scales of 1024 or higher are common (AMP default starts at 65536 and dynamically adjusts). The optimizer divides by S after backprop to recover the true gradient. Dynamic loss scaling: if you see Inf/NaN in gradients, halve S; if no overflow for N steps, double S. This is how PyTorch's GradScaler works.

Exercise 9.5: FP16 vs FP32 Comparison Trace

A model trains perfectly in FP32 but produces NaN loss in FP16 at step 847. The loss was normal (~2.3) at step 846. Which of these is most likely the root cause?

Bug in the model code that only triggers at step 847 A batch of training data produced unusually large activations in FP16, causing overflow in a pre-softmax or pre-normalization layer. The same values in FP32 (max ~3.4×10³⁸) stayed representable. Fix: enable dynamic loss scaling, add gradient clipping, or use BF16 (same exponent range as FP32). The learning rate is too high

Exercise 9.6: Full Audit Pipeline Design

Order the steps for debugging a NaN loss in mixed-precision training. Drag chips to slots.

→

Reproduce with FP32 Locate first NaN layer Check value ranges Apply fix (clamp/scale/BF16) Verify fix converges

Show reasoning

1. Reproduce with FP32 — Confirm the bug is precision-related, not a logic error. If FP32 also NaNs, the problem is elsewhere.

2. Locate first NaN layer — Insert hooks to check each layer's output for Inf/NaN. The first layer that produces NaN is the culprit.

3. Check value ranges — Log the max absolute activation/gradient at the NaN layer. If max > 65504 (FP16) or gradients < 6e-5, you've found the mechanism.

4. Apply fix — Clamp pre-softmax logits, enable loss scaling, switch to BF16, or add gradient clipping — depending on whether the issue is overflow or underflow.

5. Verify fix converges — Train for the full schedule and confirm loss matches FP32 baseline within acceptable tolerance.

Exercise 9.7: Accumulator Precision Derive

You're summing 10,000 gradient values, each approximately 0.001, in BF16. BF16 can represent integers exactly only up to 256. At what point in the summation does the running sum stop growing?

When the sum exceeds some threshold, adding 0.001 falls below the precision gap. BF16 ε = 2^-7 ≈ 0.0078. Near value V, the gap is V × 0.0078.

sum stalls at approximately

Show derivation

BF16 ε = 2^-7 ≈ 0.0078125

Gap near value V = V × ε = V × 0.0078

Sum stalls when gap ≥ 0.001 (the increment)

V × 0.0078 ≥ 0.001 ⇒ V ≥ 0.128

After summing ~128 values (reaching sum ≈ 0.128), adding 0.001 rounds away to zero in BF16. The true sum should be 10.0, but BF16 gives ~0.128 — a 92% error! This is why gradient accumulators MUST be FP32, even when weights and activations are BF16. Every mixed-precision training recipe uses FP32 accumulators. This is the single most important numerical insight for ML training.

The proof of understanding. If you completed every exercise in this workbook — manipulated IEEE 754 bits, caught catastrophic cancellation, computed condition numbers, traced iterative methods, analyzed convergence rates, and audited a neural network for numerical stability — you understand the invisible arithmetic that makes ML work (or fail silently). These aren't abstract mathematical curiosities. They're the reason loss = NaN at step 847, the reason BF16 replaced FP16, the reason Adam outperforms SGD. "The purpose of computing is insight, not numbers." — Richard Hamming

Related Lessons

Topic	Lesson
Transformer arithmetic	Transformer Math Workbook
Gradient descent deep-dive	Gradient Descent — From Absolute Zero
Mixed-precision training	Distributed Training — From Absolute Zero
Linear algebra foundations	Linear Algebra Workbook