Linear Algebra for ML Workbook

Chapter 0: Matrix Operations

You're staring at a neural network layer: y = Wx + b. The weight matrix W is [512, 768], the input x is [768, 1]. What shape is y? What if you transpose W? What if you multiply two weight matrices together? Every forward pass is matrix multiplication. Every backward pass is matrix multiplication. If you can't do these operations by hand on small matrices, you can't debug them on large ones.

Matrix multiplication C = AB:
A is [m, n], B is [n, p] → C is [m, p]
C_ij = ∑_k=1..n A_ik · B_kj

Key properties:
Transpose: (AB)^T = B^TA^T // order reverses!
Trace: tr(A) = ∑ A_ii // sum of diagonal
Determinant (2×2): det([[a,b],[c,d]]) = ad − bc

Inner dimensions must match. In AB, the number of columns of A must equal the number of rows of B. This is the single most common shape error in deep learning. If your shapes don't match, your matmul will crash. Outer dimensions give the result shape: [m, n] × [n, p] → [m, p].

Exercise 0.1: 2×2 Matrix Multiply Derive

Compute the (1,1) entry (top-left) of C = AB where A = [[1,2],[3,4]] and B = [[5,6],[7,8]].

C₁₁ = row 1 of A · col 1 of B = A₁₁·B₁₁ + A₁₂·B₂₁.

C₁₁

Show derivation

C₁₁ = 1×5 + 2×7 = 5 + 14 = 19

Row 1 of A dotted with column 1 of B. The full product is C = [[19, 22], [43, 50]].

Exercise 0.2: Bottom-Right Entry Derive

Now compute C₂₂ (bottom-right) of the same product: A = [[1,2],[3,4]], B = [[5,6],[7,8]].

C₂₂

Show derivation

C₂₂ = 3×6 + 4×8 = 18 + 32 = 50

Exercise 0.3: Transpose Trace

If A = [[1,2,3],[4,5,6]], what is the shape of A^T and what is the entry at row 3, column 1 of A^T?

Shape [2,3], entry is 4 Shape [3,2], entry is 6 Shape [3,2], entry is 3 Shape [3,2], entry is 2

Show reasoning

Transpose swaps rows and columns. A is [2,3] so A^T is [3,2]. A^T = [[1,4],[2,5],[3,6]]. Row 3, column 1 of A^T = 3 (which was row 1, column 3 of the original A).

Exercise 0.4: Trace of a Matrix Derive

Compute the trace of M = [[4, 1, 7], [2, 5, 3], [6, 8, 9]].

tr(M)

Show derivation

tr(M) = M₁₁ + M₂₂ + M₃₃ = 4 + 5 + 9 = 18

The trace is the sum of diagonal entries. Fun fact: the trace equals the sum of eigenvalues, and the determinant equals the product of eigenvalues. You'll verify this in Chapter 2.

Exercise 0.5: 2×2 Determinant Derive

Compute det(A) where A = [[3, 8], [4, 6]].

det(A)

Show derivation

det([[a,b],[c,d]]) = ad − bc = 3×6 − 8×4 = 18 − 32 = −14

A negative determinant means the transformation flips orientation. |det| = 14 is the area scaling factor. If det = 0, the matrix is singular — no inverse, no unique solutions.

Exercise 0.6: 3×3 Determinant Derive

Compute det(B) where B = [[1, 2, 3], [0, 4, 5], [1, 0, 6]] using cofactor expansion along the first row.

det = 1·det([[4,5],[0,6]]) − 2·det([[0,5],[1,6]]) + 3·det([[0,4],[1,0]])

det(B)

Show derivation

det([[4,5],[0,6]]) = 4×6 − 5×0 = 24

det([[0,5],[1,6]]) = 0×6 − 5×1 = −5

det([[0,4],[1,0]]) = 0×0 − 4×1 = −4

det(B) = 1×24 − 2×(−5) + 3×(−4) = 24 + 10 − 12 = 22

Exercise 0.7: Implement matmul() Build

Write a function that multiplies two 2D arrays (matrices). A is [m,n], B is [n,p]. Return the [m,p] result as a nested array.

Return a nested array, e.g. [[19,22],[43,50]].

Show solution

javascript
function matmul(A, B) {
  const m = A.length, n = A[0].length, p = B[0].length;
  const C = Array.from({length: m}, () => Array(p).fill(0));
  for (let i = 0; i < m; i++)
    for (let j = 0; j < p; j++)
      for (let k = 0; k < n; k++)
        C[i][j] += A[i][k] * B[k][j];
  return C;
}

The triple loop is O(m × n × p). For square n×n matrices this is O(n³). Strassen's algorithm reduces this to O(n^2.807), and the current best known is O(n^2.371).

Chapter 1: Linear Systems

Training a linear regression model boils down to solving Ax = b. Gradient descent is one way, but the direct solution requires Gaussian elimination — converting the system to upper triangular form, then back-substituting. Understanding when this fails tells you when your data is redundant or your model is under-determined.

Gaussian Elimination:
1. Write the augmented matrix [A | b]
2. Use row operations to reach upper triangular form
3. Back-substitute from the bottom row up

Row operations (all preserve the solution set):
R_i ← R_i + c · R_j    // add a scaled row
R_i ← c · R_i (c ≠ 0)     // scale a row
R_i ↔ R_j                // swap two rows

When does Ax = b have no solution? When row reduction produces a row like [0 0 0 | 5] — "0 = 5" is a contradiction. This happens when b is not in the column space of A. In ML terms: your features can't explain your target. When the system has infinitely many solutions, A is rank-deficient — your features are redundant.

Exercise 1.1: 2×2 System Derive

Solve by Gaussian elimination: 2x + 3y = 8, 4x + y = 10. Find x.

Augmented: [[2, 3 | 8], [4, 1 | 10]]. R2 ← R2 − 2·R1.

x =

Show derivation

[2 3 | 8 ]

[4 1 | 10] R2 ← R2 − 2R1 → [0 −5 | −6]

−5y = −6 → y = 6/5 = 1.2

2x + 3(1.2) = 8 → 2x = 4.4 → x = 2.2

Exercise 1.2: 3×3 System Derive

Solve:
x + y + z = 6
2x − y + z = 3
x + y − 2z = −3
Find z.

Augmented matrix. Eliminate x from rows 2 and 3, then eliminate y from row 3.

z =

Show derivation

[1 1 1 | 6]

[2 −1 1 | 3] R2 ← R2 − 2R1 → [0 −3 −1 | −9]

[1 1 −2 | −3] R3 ← R3 − R1 → [0 0 −3 | −9]

−3z = −9 → z = 3

−3y − 1(3) = −9 → y = 2

x + 2 + 3 = 6 → x = 1

Solution: (x, y, z) = (1, 2, 3). Back-substitution starts from the bottom row (simplest equation) and works up, plugging known values into earlier equations.

Exercise 1.3: Rank Deficiency Trace

Consider: x + y = 3, 2x + 2y = 6, x − y = 1. What happens when you row-reduce?

No solution — the system is inconsistent Row 2 becomes [0 0 | 0] — it's redundant. Unique solution from rows 1 and 3 Infinitely many solutions — the system is underdetermined The determinant is undefined for non-square systems

Show reasoning

Row 2 is exactly 2 × Row 1 — it carries no new information. R2 ← R2 − 2R1 gives [0 0 | 0]. The system reduces to x + y = 3, x − y = 1, which has the unique solution x = 2, y = 1. The "extra" equation was linearly dependent — its rank contribution was zero. In ML, this is like having a duplicate feature.

Exercise 1.4: 2×2 Inverse Derive

Compute the inverse of A = [[2, 1], [5, 3]]. What is the (1,1) entry of A⁻¹?

For 2×2: A⁻¹ = (1/det) · [[d, −b], [−c, a]] where A = [[a,b],[c,d]].

(A⁻¹)₁₁

Show derivation

det(A) = 2×3 − 1×5 = 1

A⁻¹ = (1/1) · [[3, −1], [−5, 2]] = [[3, −1], [−5, 2]]

Verify: A · A⁻¹ = [[2×3+1×(−5), 2×(−1)+1×2],[5×3+3×(−5), 5×(−1)+3×2]] = [[1,0],[0,1]]. The fact that det = 1 is special — integer matrices with det = ±1 have integer inverses.

Exercise 1.5: Find the Bug Debug

This function solves a 2×2 system Ax = b using the inverse formula. It returns the wrong answer. Click the buggy line.

function solve2x2(A, b) {
  const det = A[0][0]*A[1][1] - A[0][1]*A[1][0];
  if (det === 0) return null;
  const invDet = 1 / det;
  const x0 = invDet * (A[1][1]*b[0] + A[0][1]*b[1]);
  const x1 = invDet * (-A[1][0]*b[0] + A[0][0]*b[1]);
  return [x0, x1];
}

Show explanation

Line 5 is the bug. The inverse formula for x0 should use A[1][1]*b[0] − A[0][1]*b[1], not +. Recall that A⁻¹ = (1/det)[[d, −b],[−c, a]]. The off-diagonal entries are negated. With the + sign, you're multiplying by [[d, b],[−c, a]] instead — wrong sign on the (1,2) entry.

Exercise 1.6: When is a Matrix Singular? Trace

Which of these 2×2 matrices is singular (no inverse)?

[[1, 2], [3, 4]] — det = 4−6 = −2 [[5, 3], [1, 2]] — det = 10−3 = 7 [[2, 4], [3, 6]] — det = 12−12 = 0 [[0, 1], [1, 0]] — det = 0−1 = −1

Show reasoning

[[2,4],[3,6]] has det = 0. Notice that row 2 = 1.5 × row 1 — the rows are linearly dependent. The matrix maps all of 2D space onto a line. You can't invert that — information is destroyed. In ML, this is like having one feature that's a perfect linear function of another.

Chapter 2: Eigenvalues & Eigenvectors

You're running PCA and your library returns "eigenvalues" and "eigenvectors." What are they? An eigenvector v of matrix A is a direction that A stretches without rotating: Av = λv. The scalar λ is how much it stretches. PCA finds the directions of maximum variance — those are the eigenvectors of the covariance matrix, and the eigenvalues are the variances along those directions.

Finding eigenvalues:
Av = λv → (A − λI)v = 0
Non-trivial solution exists when det(A − λI) = 0

For 2×2 A = [[a,b],[c,d]]:
det(A − λI) = (a−λ)(d−λ) − bc = 0
λ² − (a+d)λ + (ad−bc) = 0
λ² − tr(A)·λ + det(A) = 0

The characteristic equation. For a 2×2 matrix, the eigenvalues satisfy λ² − tr(A)λ + det(A) = 0. So the sum of eigenvalues equals the trace, and their product equals the determinant. This is a powerful sanity check — if your eigenvalues don't satisfy these relations, you made an arithmetic error.

Exercise 2.1: Eigenvalues of [[4,1],[2,3]] Derive

Find the eigenvalues. Set up det(A − λI) = 0 and solve the quadratic.

tr(A) = 7, det(A) = 10. So λ² − 7λ + 10 = 0.

larger λ

Show derivation

det(A − λI) = (4−λ)(3−λ) − 1×2 = λ² − 7λ + 10 = 0

(λ − 5)(λ − 2) = 0 → λ₁ = 5, λ₂ = 2

Check: λ₁ + λ₂ = 7 = tr(A). λ₁ × λ₂ = 10 = det(A). Both match.

Exercise 2.2: Eigenvector for λ=5 Derive

For A = [[4,1],[2,3]] with λ = 5, find the eigenvector by solving (A − 5I)v = 0. Give the ratio v₂/v₁.

v₂/v₁

Show derivation

A − 5I = [[−1, 1], [2, −2]]

Row 1: −v₁ + v₂ = 0 → v₂ = v₁

So v = [1, 1] (or any scalar multiple). The eigenvector direction is the 45-degree line. A stretches vectors along this line by a factor of 5.

Exercise 2.3: Verify Av = λv Derive

Compute A · [1, 1]^T where A = [[4,1],[2,3]]. What is the first component of the result?

(Av)₁

Show derivation

Av = [[4,1],[2,3]] · [1,1]^T = [4+1, 2+3] = [5, 5]

λv = 5 · [1, 1] = [5, 5] ✓

Av = λv confirmed. The matrix A applied to [1,1] produces 5×[1,1] — pure scaling, no rotation. That's exactly what "eigenvector" means.

Exercise 2.4: Eigenvector for λ=2 Derive

Same A = [[4,1],[2,3]], λ = 2. Find v₂/v₁ for this eigenvector.

v₂/v₁

Show derivation

A − 2I = [[2, 1], [2, 1]]

Row 1: 2v₁ + v₂ = 0 → v₂ = −2v₁

So v = [1, −2] (or any scalar multiple). This eigenvector points in a different direction from the λ=5 eigenvector. Together, these two eigenvectors form a basis for R² — any vector can be decomposed into these two directions, each scaled by its eigenvalue.

Exercise 2.5: Implement eigenvalues2x2() Build

Write a function that returns the two eigenvalues of a 2×2 matrix as [larger, smaller].

Use the quadratic formula on the characteristic equation.

Show solution

javascript
function eigenvalues2x2(A) {
  const tr = A[0][0] + A[1][1];
  const det = A[0][0]*A[1][1] - A[0][1]*A[1][0];
  const disc = Math.sqrt(tr*tr - 4*det);
  return [(tr + disc)/2, (tr - disc)/2];
}

Exercise 2.6: Symmetric Matrices Trace

Which property is guaranteed for eigenvalues of a real symmetric matrix (A = A^T)?

All eigenvalues are positive All eigenvalues are real (not complex), and eigenvectors are orthogonal All eigenvalues are integers The matrix is always invertible

Show reasoning

The Spectral Theorem guarantees that real symmetric matrices have: (1) all real eigenvalues, and (2) orthogonal eigenvectors. This is why covariance matrices (which are symmetric) always have real eigenvalues and orthogonal principal components. The eigenvalues can be negative (just not for PSD matrices) and the matrix can be singular (if an eigenvalue is 0).

Chapter 3: SVD

The Singular Value Decomposition is the Swiss army knife of linear algebra. It works for any matrix — rectangular, singular, whatever. A = UΣV^T, where U and V are orthogonal and Σ is diagonal with non-negative entries. The singular values tell you how much the matrix stretches space in each direction. Low-rank approximation via SVD is the foundation of PCA, latent semantic analysis, and modern recommender systems.

SVD: A = UΣV^T
A: [m, n], U: [m, m] orthogonal, Σ: [m, n] diagonal, V: [n, n] orthogonal

Singular values from eigenvalues:
σ_i = √(λ_i(A^TA))

Low-rank approximation (rank k):
A_k = U_kΣ_kV_k^T // keep top k singular values only
Error = σ_k+1 // Eckart-Young theorem: this is optimal!

Eigenvalues vs. singular values. Eigenvalues can be negative or complex. Singular values are always real and non-negative. For a symmetric positive semidefinite matrix, they're the same. For a general matrix, singular values are the square roots of the eigenvalues of A^TA. This is why SVD always works, even when eigendecomposition doesn't.

Exercise 3.1: Trivial SVD Derive

A = [[3, 0], [0, 2]]. This is already diagonal. What are the singular values σ₁ and σ₂?

σ₁ (larger)

Show derivation

A^TA = [[9, 0], [0, 4]]

Eigenvalues of A^TA: 9, 4

σ₁ = √9 = 3, σ₂ = √4 = 2

For a diagonal matrix, the singular values are just the absolute values of the diagonal entries. U = V = I (identity), Σ = A itself.

Exercise 3.2: SVD of [[2,1],[1,2]] Derive

A = [[2,1],[1,2]]. Compute A^TA, find its eigenvalues, then take square roots. What is σ₁?

A^TA = A² (since A is symmetric). Eigenvalues of A are 3 and 1 (from Ch 2). So eigenvalues of A² are...

σ₁

Show derivation

A^TA = [[2,1],[1,2]] × [[2,1],[1,2]] = [[5, 4], [4, 5]]

Eigenvalues of A^TA: λ² − 10λ + 9 = 0 → λ = 9, 1

σ₁ = √9 = 3, σ₂ = √1 = 1

Since A is symmetric with eigenvalues 3 and 1, the singular values are |3| = 3 and |1| = 1. For symmetric PSD matrices, singular values equal eigenvalues — the two decompositions coincide.

Exercise 3.3: Low-Rank Approximation Derive

A 1000×1000 matrix has singular values [100, 50, 10, 1, 0.5, 0.1, ...]. If you keep only the top 3 singular values (rank-3 approximation), how many numbers do you need to store instead of 10⁶?

Rank-k storage: k columns of U [1000 × k] + k singular values + k columns of V [1000 × k].

numbers stored

Show derivation

U₃: 1000 × 3 = 3000 numbers

Σ₃: 3 numbers

V₃: 1000 × 3 = 3000 numbers

Total = 3000 + 3 + 3000 = 6003

That's 6003 vs 1,000,000 — a 166× compression. And by the Eckart-Young theorem, this is the best possible rank-3 approximation in both Frobenius and spectral norm. The approximation error is σ₄ = 1 (in spectral norm).

Exercise 3.4: Compression Ratio Trace

An image is 512×512 pixels. Its SVD has singular values that drop below 1% of σ₁ after rank 20. Approximately what compression ratio does a rank-20 SVD approximation give?

~2:1 (barely compressed) ~13:1 (20,500 vs 262,144 numbers) ~26:1 (10,240 vs 262,144 numbers) ~512:1 (only 512 numbers needed)

Show reasoning

Rank-20 needs 512×20 + 20 + 512×20 = 20,500 numbers vs the original 512×512 = 262,144. Ratio: 262,144/20,500 ≈ 12.8, so roughly 13:1. This is the principle behind truncated SVD for image compression and dimensionality reduction.

Exercise 3.5: Matrix Rank from SVD Trace

A 5×5 matrix has singular values [10, 5, 2, 0, 0]. What is the rank of this matrix?

Rank 5 Rank 2 Rank 3 — the number of non-zero singular values Rank 17 — the sum of singular values

Show reasoning

The rank of a matrix equals the number of non-zero singular values. Three non-zero singular values (10, 5, 2) means rank 3. The matrix maps 5D space into a 3D subspace — two dimensions of information are lost. In ML, this means only 3 of the 5 features are linearly independent.

Chapter 4: PCA

You have a dataset with 50 features, but you suspect most of the variance lives in a much lower-dimensional subspace. PCA finds that subspace. The recipe: center the data, compute the covariance matrix, eigendecompose it, project onto the top eigenvectors. Each step has a concrete computation you should be able to do by hand.

PCA algorithm:
1. Center: X̅ = X − μ // subtract column means
2. Covariance: C = (1/(n−1)) X̅^TX̅ // [d, d] matrix
3. Eigendecompose: C = QΛQ^T
4. Project: Z = X̅ · Q_k // keep top k eigenvectors

Variance explained by top k:
∑_i=1..k λ_i / ∑_i=1..d λ_i

PCA = eigendecomposition of the covariance matrix. The eigenvectors are the principal directions (axes of maximum variance). The eigenvalues are the variances along those directions. Projecting onto the top k eigenvectors keeps the k most informative dimensions and discards the rest. This is optimal in the least-squares sense — no other k-dimensional subspace captures more variance.

Exercise 4.1: Center the Data Derive

Three data points in 2D: (1, 4), (3, 2), (5, 6). Compute the mean of each feature, then the centered y-coordinate of the first point.

μ_x = (1+3+5)/3, μ_y = (4+2+6)/3. Centered first point y = 4 − μ_y.

centered y₁

Show derivation

μ_x = (1+3+5)/3 = 3, μ_y = (4+2+6)/3 = 4

Centered: (−2, 0), (0, −2), (2, 2)

First point centered y = 4 − 4 = 0

Exercise 4.2: Covariance Matrix Derive

Using the centered data from 4.1: (−2, 0), (0, −2), (2, 2). Compute C = (1/2)X̅^TX̅ (using n−1=2). What is C₁₁ (variance of feature 1)?

C₁₁

Show derivation

X̅^TX̅ = [[-2,0,2],[0,-2,2]] × [[-2,0],[0,-2],[2,2]] = [[8, 4], [4, 8]]

C = (1/2) × [[8,4],[4,8]] = [[4, 2], [2, 4]]

C₁₁ = 4 is the variance of feature 1. C₁₂ = C₂₁ = 2 is the covariance between features — they're positively correlated. The covariance matrix is always symmetric.

Exercise 4.3: Eigendecompose the Covariance Derive

C = [[4, 2], [2, 4]]. Find the eigenvalues. What is the larger eigenvalue?

tr(C) = 8, det(C) = 12. Quadratic: λ² − 8λ + 12 = 0.

λ₁

Show derivation

λ² − 8λ + 12 = 0 → (λ−6)(λ−2) = 0

λ₁ = 6, λ₂ = 2

The first principal component captures variance 6, the second captures 2. The eigenvector for λ=6 is [1,1]/√2 (the 45-degree direction). Data varies most along this diagonal — projecting onto it keeps 75% of the total variance.

Exercise 4.4: Variance Explained Derive

With eigenvalues λ₁=6, λ₂=2, what fraction of total variance is explained by the first principal component?

fraction (0 to 1)

Show derivation

Variance explained = λ₁ / (λ₁ + λ₂) = 6 / 8 = 0.75

75% of the variance is captured by a single direction. In practice, you pick k such that the cumulative explained variance exceeds a threshold (commonly 95%). With just one component here, you'd lose 25% of the information.

Exercise 4.5: PCA vs. Feature Selection Trace

Your dataset has 100 features. Feature 1 has variance 50, feature 2 has variance 30, all others have variance ~1. The covariance between features 1 and 2 is 35. If you do PCA and keep 1 component, does it simply select feature 1 (highest individual variance)?

Yes — PCA always selects the highest-variance feature No — PC1 is a linear combination of features 1 and 2, capturing their shared variance too No — PCA averages all 100 features equally

Show reasoning

PCA finds the direction of maximum variance, which is generally a linear combination of the original features. With covariance 35 between features 1 and 2, a rotated axis capturing both features jointly explains more variance than either feature alone. Feature selection picks individual features; PCA finds optimal linear combinations. They're fundamentally different approaches.

Chapter 5: Matrix Calculus

Every time you call loss.backward() in PyTorch, the framework is computing matrix derivatives. The gradient of the loss with respect to the weights is a matrix derivative. The gradient of the loss with respect to the input is another one. If you want to understand why gradient descent works, why certain architectures train better than others, or what happens when gradients vanish or explode, you need matrix calculus.

The three most important matrix derivatives:

1. ∂(Wx)/∂x = W    // linear layer forward gradient

2. ∂(x^TAx)/∂x = (A + A^T)x    // quadratic form gradient
   = 2Ax when A is symmetric

3. ∂||Ax − b||²/∂x = 2A^T(Ax − b)    // least squares gradient

Rule 3 is the gradient of every linear regression. The loss is ||Ax − b||² (sum of squared residuals). The gradient 2A^T(Ax − b) tells you exactly which direction to move x to reduce the loss. Setting this to zero gives the normal equations: A^TAx = A^Tb. That's Chapter 6.

Exercise 5.1: Linear Layer Gradient Derive

W = [[2, 1], [3, 4]], x = [1, 2]. Compute y = Wx, then ∂y/∂x. What is (∂y/∂x)₁₁?

(∂y/∂x)₁₁

Show derivation

y = Wx = [[2,1],[3,4]] × [1,2]^T = [4, 11]

∂y/∂x = W = [[2, 1], [3, 4]]

(∂y/∂x)₁₁ = W₁₁ = 2

The Jacobian of a linear transformation y = Wx is simply the matrix W itself. This makes sense: y₁ = 2x₁ + x₂, so ∂y₁/∂x₁ = 2. This is why backpropagation through a linear layer just multiplies by W^T.

Exercise 5.2: Quadratic Form Value Derive

Compute x^TAx for A = [[2, 1], [1, 3]] and x = [1, 2].

First compute Ax, then dot with x.

x^TAx

Show derivation

Ax = [[2,1],[1,3]] × [1,2]^T = [4, 7]

x^TAx = [1, 2] · [4, 7] = 4 + 14 = 18

Quadratic forms are scalar-valued: they map a vector to a single number. The loss function in ridge regression is x^TAx + bx, a quadratic form plus a linear term.

Exercise 5.3: Quadratic Form Gradient Derive

Compute ∂(x^TAx)/∂x = (A + A^T)x for A = [[2, 1], [1, 3]], x = [1, 2]. What is the first component?

gradient[0]

Show derivation

A + A^T = [[2,1],[1,3]] + [[2,1],[1,3]] = [[4, 2], [2, 6]]

(A + A^T)x = [[4,2],[2,6]] × [1,2]^T = [8, 14]

Since A is symmetric (A = A^T), A + A^T = 2A. So the gradient simplifies to 2Ax = 2[4, 7] = [8, 14]. Symmetric A is the common case in ML (covariance matrices, Hessians).

Exercise 5.4: Least Squares Gradient Derive

A = [[1, 0], [0, 1], [1, 1]], x = [2, 3], b = [2, 4, 6]. Compute ∂||Ax − b||²/∂x = 2A^T(Ax − b). What is the first component?

gradient[0]

Show derivation

Ax = [[1,0],[0,1],[1,1]] × [2,3]^T = [2, 3, 5]

Ax − b = [2−2, 3−4, 5−6] = [0, −1, −1]

A^T(Ax − b) = [[1,0,1],[0,1,1]] × [0, −1, −1]^T = [−1, −2]

2A^T(Ax − b) = [−2, −4]

Negative gradient means x should increase to reduce the loss. The residual Ax − b = [0, −1, −1] tells us the predictions are too low for data points 2 and 3. Gradient descent would step x in the direction [2, 4] (negating the gradient).

Exercise 5.5: Chain Rule Trace

In a 3-layer neural network y = W₃σ(W₂σ(W₁x)), the gradient of the loss L with respect to W₁ involves multiplying how many Jacobians together (via chain rule)?

1 — just ∂L/∂W₁ directly 3 — one per layer 5 — ∂L/∂y · ∂y/∂h₂ · ∂h₂/∂h₁ · ∂h₁/∂z₁ · ∂z₁/∂W₁ 6 — two per layer (linear + activation)

Show reasoning

The chain unfolds as: ∂L/∂W₁ = ∂L/∂y · ∂y/∂h₂ · ∂h₂/∂h₁ · ∂h₁/∂z₁ · ∂z₁/∂W₁. That's 5 Jacobian matrices multiplied together. Each layer contributes two factors: one for the linear part (W), one for the activation (σ). The loss contributes one more. This is why deep networks are prone to vanishing/exploding gradients — you're multiplying many matrices together.

Exercise 5.6: Find the Bug Debug

This computes the gradient of ||Ax - b||² with respect to x. It returns the wrong gradient. Click the buggy line.

function lsGrad(A, x, b) {
  // Compute Ax
  const Ax = matmul(A, x);
  // Compute residual
  const r = subtract(Ax, b);
  // Gradient = 2 * A * (Ax - b)
  const grad = scale(matmul(A, r), 2);
  return grad;
}

Show explanation

Line 7 is the bug. The gradient is 2A^T(Ax − b), not 2A(Ax − b). The code uses matmul(A, r) instead of matmul(transpose(A), r). Missing the transpose is one of the most common bugs in implementing gradient descent for linear models. Dimensionally: if A is [m,n] and r is [m,1], then A·r would be an invalid shape — only A^T·r = [n,m]×[m,1] = [n,1] has the right shape to update x.

Chapter 6: Least Squares

You have 100 data points and want to fit a line through them. The classical solution: set the gradient from Chapter 5 to zero. That gives the normal equations: A^TAx = A^Tb. If A^TA is invertible, x = (A^TA)⁻¹A^Tb. This is the closed-form solution to linear regression — no gradient descent needed.

Normal equations:
A^TAx = A^Tb
x = (A^TA)⁻¹A^Tb

Pseudoinverse via SVD:
A⁺ = VΣ⁺U^T // where Σ⁺ inverts non-zero singular values
x = A⁺b // works even when A^TA is singular!

When is A^TA invertible?
When A has full column rank (all columns linearly independent).
n ≥ d and rank(A) = d.

A^TA is invertible iff no features are redundant. If one feature is a linear combination of others, A^TA becomes singular and the normal equations have infinitely many solutions. This is multicollinearity. Fix: add regularization (A^TA + λI)x = A^Tb — that's ridge regression, and the λI term makes the matrix invertible.

Exercise 6.1: 3-Point Linear Regression Derive

Fit y = ax + b to points (1, 2), (2, 3), (3, 5). Set up A = [[1,1],[2,1],[3,1]] and b = [2,3,5]. Compute A^TA. What is (A^TA)₁₁?

(A^TA)₁₁

Show derivation

A^TA = [[1,2,3],[1,1,1]] × [[1,1],[2,1],[3,1]] = [[14, 6], [6, 3]]

(A^TA)₁₁ = 1² + 2² + 3² = 14. This is the sum of squares of the x-coordinates. The full A^TA is a 2×2 symmetric matrix encoding the structure of the feature data.

Exercise 6.2: Solve the Normal Equations Derive

A^TA = [[14,6],[6,3]], A^Tb = [23, 10]. Solve for x = [a, b] (the slope and intercept). What is the slope a?

det(A^TA) = 14×3 − 6×6 = 6. Use the 2×2 inverse formula.

a (slope)

Show derivation

(A^TA)⁻¹ = (1/6)[[3, −6], [−6, 14]] = [[0.5, −1], [−1, 2.33]]

x = [[0.5,−1],[−1,2.33]] × [23, 10]^T

a = 0.5×23 + (−1)×10 = 11.5 − 10 = 1.5

b = −1×23 + 2.33×10 = −23 + 23.33 = 0.33

The best-fit line is y = 1.5x + 0.33. At x=1: y=1.83 (error 0.17), x=2: y=3.33 (error 0.33), x=3: y=4.83 (error −0.17). The residuals sum to approximately zero, as they must for any least squares solution with an intercept term.

Exercise 6.3: Residual Sum of Squares Derive

Using y = 1.5x + 1/3, compute the sum of squared residuals for the 3 points (1,2), (2,3), (3,5). Round to 2 decimal places.

RSS

Show derivation

Predictions: 1.5(1)+1/3 = 11/6, 1.5(2)+1/3 = 10/3, 1.5(3)+1/3 = 29/6

Residuals: 2 − 11/6 = 1/6, 3 − 10/3 = −1/3, 5 − 29/6 = 1/6

RSS = (1/6)² + (1/3)² + (1/6)² = 1/36 + 1/9 + 1/36 = 1/6 ≈ 0.167

Exercise 6.4: When A^TA is Singular Trace

You have features x₁ (temperature in Celsius) and x₂ (temperature in Fahrenheit, = 1.8x₁ + 32). What happens to the normal equations?

Everything works fine — more features is always better The model trains slower but converges to the same answer A^TA is singular — column 2 is a linear function of column 1, so rank(A) < d The features cancel out and the model predicts zero

Show reasoning

Since x₂ = 1.8x₁ + 32, column 2 of A is a linear combination of column 1 and the intercept column. A^TA is singular (det = 0), and the inverse doesn't exist. There are infinitely many solutions: you can trade off weight on x₁ vs x₂ arbitrarily. The SVD pseudoinverse gives the minimum-norm solution, and ridge regression picks a unique solution by penalizing large weights.

Exercise 6.5: Implement normalEquation() Build

Write a function that solves Ax = b by the normal equations for a 2-feature problem. A is [n, 2], b is [n]. Return [x1, x2].

Compute A^TA (2x2), A^Tb (2x1), then use the 2x2 inverse formula.

Show solution

javascript
function normalEquation(A, b) {
  const n = A.length;
  // A^T A (2x2)
  let a00=0,a01=0,a11=0, b0=0,b1=0;
  for (let i=0; i<n; i++) {
    a00 += A[i][0]*A[i][0];
    a01 += A[i][0]*A[i][1];
    a11 += A[i][1]*A[i][1];
    b0  += A[i][0]*b[i];
    b1  += A[i][1]*b[i];
  }
  const det = a00*a11 - a01*a01;
  return [(a11*b0 - a01*b1)/det, (-a01*b0 + a00*b1)/det];
}

Chapter 7: Norms & Distances

Every loss function in ML involves a norm. L2 loss is the squared L2 norm of the residual. L1 regularization uses the L1 norm of the weights. Gradient clipping uses the L2 norm of the gradient. Batch normalization divides by a norm. You need to compute these fluently and understand their geometric differences — they impose fundamentally different shapes on the "unit ball" and therefore promote different solutions.

Vector norms of v = [v₁, ..., v_n]:
L1: ||v||₁ = ∑ |v_i|    // sum of absolute values
L2: ||v||₂ = √(∑ v_i²)    // Euclidean length
L∞: ||v||_∞ = max |v_i|    // largest absolute entry

Matrix norms:
Frobenius: ||A||_F = √(∑_ij A_ij²)    // like L2 for matrices
Nuclear: ||A||_* = ∑ σ_i    // sum of singular values
Spectral: ||A||₂ = σ_max    // largest singular value

Why L1 promotes sparsity. The L1 unit ball has corners on the axes. When you minimize a loss + L1 penalty, the optimum tends to land on a corner — where some coordinates are exactly zero. The L2 ball is smooth, so the optimum slides to a point where all coordinates are small but non-zero. L1 gives you feature selection for free.

Exercise 7.1: L1 Norm Derive

Compute ||v||₁ for v = [3, −4, 0, 5].

||v||₁

Show derivation

||v||₁ = |3| + |−4| + |0| + |5| = 3 + 4 + 0 + 5 = 12

Exercise 7.2: L2 Norm Derive

Compute ||v||₂ for v = [3, −4, 0, 5]. Give the answer to 2 decimal places.

||v||₂

Show derivation

||v||₂ = √(9 + 16 + 0 + 25) = √50 = 5√2 ≈ 7.07

Exercise 7.3: L-infinity Norm Derive

Compute ||v||_∞ for v = [3, −4, 0, 5].

||v||_∞

Show derivation

||v||_∞ = max(|3|, |−4|, |0|, |5|) = 5

The L∞ norm only cares about the single largest entry. It's used in adversarial robustness: an L∞ adversarial perturbation can change every pixel, as long as no single pixel changes by more than ε.

Exercise 7.4: Frobenius Norm Derive

Compute ||M||_F for M = [[1, 2], [3, 4]]. Give the answer to 2 decimal places.

||M||_F

Show derivation

||M||_F = √(1² + 2² + 3² + 4²) = √(1+4+9+16) = √30 ≈ 5.48

The Frobenius norm treats the matrix as a long vector and takes its L2 norm. It equals √(∑ σ_i²) — the square root of the sum of squared singular values. Weight decay in neural networks penalizes the squared Frobenius norm of the weight matrices.

Exercise 7.5: Nuclear Norm Derive

M = [[1, 2], [3, 4]] has singular values σ₁ ≈ 5.465 and σ₂ ≈ 0.366. Compute the nuclear norm ||M||_*.

||M||_*

Show derivation

||M||_* = σ₁ + σ₂ = 5.465 + 0.366 = 5.831

The nuclear norm is the "L1 norm of the singular values." Just as L1 on vectors promotes sparsity, the nuclear norm on matrices promotes low rank. Minimizing ||W||_* pushes small singular values to exactly zero, effectively reducing the matrix rank. Used in matrix completion (recommender systems) and low-rank matrix recovery.

Exercise 7.6: Norm Comparison Trace

For any vector v, which ordering always holds?

||v||₁ ≤ ||v||₂ ≤ ||v||_∞ ||v||_∞ ≤ ||v||₂ ≤ ||v||₁ ||v||₂ ≤ ||v||₁ ≤ ||v||_∞ No universal ordering — it depends on the vector

Show reasoning

Always: ||v||_∞ ≤ ||v||₂ ≤ ||v||₁. The max absolute entry is at most the Euclidean length (since the other terms are non-negative), and the Euclidean length is at most the sum of absolute values (by Cauchy-Schwarz). Verify with v = [3, −4, 0, 5]: ||v||_∞ = 5 ≤ ||v||₂ = 7.07 ≤ ||v||₁ = 12.

Chapter 8: Positive Definiteness

A matrix is positive definite (PD) if x^TAx > 0 for all non-zero x. Positive semidefinite (PSD) allows x^TAx ≥ 0. Why should you care? Covariance matrices are always PSD. The Hessian of a convex loss function is PSD at the minimum. If a matrix is PD, it has a unique Cholesky factorization A = LL^T — this is the fastest way to solve Ax = b and is numerically stable.

Tests for positive definiteness:
1. All eigenvalues > 0 // definitive test
2. All leading principal minors > 0 // Sylvester's criterion
3. Cholesky factorization A = LL^T exists // constructive test

Cholesky factorization A = LL^T:
L₁₁ = √A₁₁
L₂₁ = A₂₁ / L₁₁
L₂₂ = √(A₂₂ − L₂₁²)

Why covariance matrices must be PSD. A covariance matrix C has entries C_ij = E[(X_i − μ_i)(X_j − μ_j)]. For any vector a, a^TCa = Var(a^TX) ≥ 0 — the variance of any linear combination of variables is non-negative. That's the definition of PSD. If a covariance matrix has a zero eigenvalue, it means one variable is an exact linear function of the others.

Exercise 8.1: Check Eigenvalues Derive

A = [[4, 2], [2, 3]]. Find the eigenvalues and determine if A is positive definite.

tr(A) = 7, det(A) = 8. Quadratic: λ² − 7λ + 8 = 0.

smaller λ

Show derivation

λ = (7 ± √(49 − 32)) / 2 = (7 ± √17) / 2

λ₁ = (7 + 4.123) / 2 ≈ 5.56

λ₂ = (7 − 4.123) / 2 ≈ 1.44

Both eigenvalues are positive, so A is positive definite. This means x^TAx > 0 for all non-zero x, and the quadratic form defines a bowl-shaped surface with a unique minimum at x = 0.

Exercise 8.2: Cholesky Factorization Derive

Compute the Cholesky factorization A = LL^T for A = [[4, 2], [2, 3]]. What is L₂₂?

L₂₂

Show derivation

L₁₁ = √A₁₁ = √4 = 2

L₂₁ = A₂₁ / L₁₁ = 2 / 2 = 1

L₂₂ = √(A₂₂ − L₂₁²) = √(3 − 1) = √2 ≈ 1.414

L = [[2, 0], [1, √2]]

Verify: LL^T = [[2,0],[1,√2]] × [[2,1],[0,√2]] = [[4,2],[2,3]] = A. Cholesky is 2× faster than LU decomposition for solving Ax = b when A is PD, and it's the standard method for sampling from multivariate Gaussians: x = L · z + μ where z ~ N(0, I).

Exercise 8.3: Not Positive Definite Trace

B = [[1, 2], [2, 1]]. Is B positive definite?

Yes — all diagonal entries are positive No — eigenvalues are 3 and −1, and −1 < 0 violates the PD condition No — the off-diagonal entries are too large It depends on the vector x

Show reasoning

tr(B) = 2, det(B) = 1 − 4 = −3. Eigenvalues: λ² − 2λ − 3 = 0 → λ = 3, −1. Since −1 < 0, B is not positive definite. In fact, take x = [1, −1]: x^TBx = 1 − 2 − 2 + 1 = −2 < 0. Positive diagonal entries are necessary but not sufficient for PD — the off-diagonal correlations can make the matrix indefinite.

Exercise 8.4: Making a Matrix PD Trace

You computed a covariance matrix from data but due to numerical errors, the smallest eigenvalue is −0.001. How do you fix this so downstream algorithms (Cholesky, Gaussian sampling) don't fail?

Clamp the negative eigenvalue to zero Delete the row and column corresponding to that eigenvalue Add εI to the matrix (jitter) — this shifts all eigenvalues up by ε Transpose the matrix

Show reasoning

Adding εI shifts every eigenvalue by +ε. If ε = 0.01, the problematic −0.001 becomes 0.009 > 0, making the matrix PD. This is called jitter or ridge regularization and is standard practice in Gaussian processes, numerical Cholesky, and any code that inverts near-singular matrices. In GPyTorch, JAX, and TensorFlow, you'll see + 1e-6 * I everywhere.

Exercise 8.5: Cholesky Solve Derive

To solve Ax = b where A = LL^T, you solve Ly = b (forward substitution) then L^Tx = y (back substitution). Given L = [[2, 0], [1, √2]] and b = [6, 5], find y₁.

y₁

Show derivation

Ly = b: [[2,0],[1,√2]] × [y₁,y₂]^T = [6, 5]

2y₁ = 6 → y₁ = 3

y₁ + √2 · y₂ = 5 → y₂ = 2/√2 = √2 ≈ 1.414

Forward substitution through L is O(n²), much cheaper than computing A⁻¹ which is O(n³). The Cholesky factorization itself is O(n³/3), but once you have L, each new right-hand side b costs only O(n²).

Chapter 9: Capstone — ML Pipeline

Time to put it all together. You have a real ML pipeline: raw data matrix X [1000 samples, 50 features]. You need to (1) compute the covariance matrix, (2) run PCA to reduce to 10 dimensions, (3) train linear regression on the reduced data. Each step has a concrete computational cost that you should be able to estimate. This is what ML infrastructure engineers do every day — estimate whether a computation will take milliseconds or hours.

Pipeline:
1. Center X → X̅    // n · d subtractions
2. Covariance: C = (1/(n−1)) X̅^TX̅    // matmul [d, n] × [n, d] = O(d²n)
3. Eigendecompose C → Q, Λ    // O(d³)
4. Project: Z = X̅ Q₁₀    // matmul [n, d] × [d, 10] = O(n · d · 10)
5. Train: (Z^TZ)⁻¹Z^Ty    // matmul + inversion of [10, 10]

The bottleneck is always the largest matrix multiply. For n=1000, d=50: covariance costs O(50²×1000) = 2.5M FLOPs, eigendecomposition costs O(50³) = 125K FLOPs. The covariance dominates. But if d=5000, eigendecomposition costs O(125B) and suddenly dominates everything. Knowing where the bottleneck is saves you from optimizing the wrong step.

Exercise 9.1: Covariance Matrix Cost Derive

X is [1000, 50]. Computing C = X̅^TX̅ is a matmul of [50, 1000] × [1000, 50]. How many multiply-add FLOPs? (Count each multiply-add pair as 2 FLOPs.)

million FLOPs

Show derivation

FLOPs = 2 × 50 × 1000 × 50 = 5,000,000 = 5M

For a matmul [m,k]×[k,n], FLOPs = 2mkn. Here m=n=50, k=1000. On a modern CPU doing ~10 GFLOPs, this takes about 0.5 ms. Trivial. But if d=50000 instead of 50, the cost jumps to 5 trillion FLOPs — 500 seconds on the same CPU.

Exercise 9.2: Eigendecomposition Cost Derive

Eigendecomposition of a d×d symmetric matrix costs approximately 4d³/3 FLOPs. For d=50, what is this?

million FLOPs

Show derivation

4 × 50³ / 3 = 4 × 125,000 / 3 = 166,667 ≈ 0.167M FLOPs

30× less than the covariance matrix computation. The eigendecomposition is cheap because the covariance matrix is only [50,50]. But the cost grows as d³ — at d=5000 it would be 167 billion FLOPs, 33,000× more expensive.

Exercise 9.3: Projection Cost Derive

Projecting X̅ [1000, 50] onto the top 10 eigenvectors Q₁₀ [50, 10] gives Z [1000, 10]. FLOPs?

million FLOPs

Show derivation

FLOPs = 2 × 1000 × 50 × 10 = 1,000,000 = 1M

Cheaper than the covariance computation because we only project onto 10 directions, not all 50. The output Z [1000, 10] has 80% fewer features, which makes subsequent training much faster.

Exercise 9.4: Training Cost Derive

Linear regression on Z [1000, 10]: normal equations require Z^TZ [10,1000]×[1000,10] and Z^Ty [10,1000]×[1000,1], plus inverting a 10×10 matrix. What are the total FLOPs for Z^TZ?

million FLOPs

Show derivation

FLOPs(Z^TZ) = 2 × 10 × 1000 × 10 = 200,000 = 0.2M

FLOPs(Z^Ty) = 2 × 10 × 1000 × 1 = 20,000 = 0.02M

FLOPs(invert 10×10) ≈ 2 × 10³ / 3 = 667

Training cost is dominated by Z^TZ at 0.2M FLOPs — 25× less than the covariance step. PCA reduced d from 50 to 10, which made training 25× cheaper (cost scales as d²).

Exercise 9.5: Total Pipeline FLOPs Derive

Sum the costs: covariance (5M) + eigendecomp (0.167M) + projection (1M) + training (0.22M) + centering (0.05M, i.e., n·d). Total pipeline FLOPs in millions?

million FLOPs

Show derivation

Total = 5.0 + 0.167 + 1.0 + 0.22 + 0.05 = 6.437M ≈ 6.44M FLOPs

The covariance computation is 78% of the total cost. This is the bottleneck. At a modern GPU's ~10 TFLOPS, the entire pipeline takes under a microsecond. But scale n to 1 million and d to 5000, and the covariance step alone costs 50 trillion FLOPs — 5 seconds on GPU. Now you see why efficient covariance estimation (randomized methods, streaming PCA) matters at scale.

Exercise 9.6: Pipeline Design Design

Arrange these PCA+regression pipeline steps in the correct order.

→

Project onto top-k

Center data

Normal equations

Covariance matrix

Eigendecompose

Show correct order

Center data → Covariance matrix → Eigendecompose → Project onto top-k → Normal equations

You must center before computing covariance (uncentered data gives a biased covariance). You must eigendecompose before projecting (you need the eigenvectors). And you train on the projected data, not the original.

Exercise 9.7: Scaling Analysis Trace

If you double both n (samples) and d (features), how does the total pipeline cost scale?

2× — linear in the total data size 4× — quadratic in a single dimension 8× — covariance is O(d²n), and both d and n doubled 16× — quadratic in both dimensions

Show reasoning

The dominant term is the covariance: O(d²n). Doubling both: (2d)² × 2n = 4d² × 2n = 8d²n. That's 8× the cost. The eigendecomposition is O(d³): (2d)³ = 8d³, also 8×. For large d, eigendecomposition eventually dominates (d³ vs d²n when d > n).

Exercise 9.8: Memory Budget Derive

For the pipeline with n=1000, d=50, k=10: what is the peak memory in bytes if everything is FP32 (4 bytes per number)? Consider: X [1000,50], C [50,50], Q [50,50], Z [1000,10], all in memory simultaneously.

Show derivation

X: 1000 × 50 = 50,000 numbers × 4 bytes = 200,000 bytes

C: 50 × 50 = 2,500 × 4 = 10,000 bytes

Q: 50 × 50 = 2,500 × 4 = 10,000 bytes

Z: 1000 × 10 = 10,000 × 4 = 40,000 bytes

Total = 260,000 bytes ≈ 253.9 KB

Tiny. But at n=1M and d=50K: X alone is 200 GB. The covariance matrix C [50K, 50K] is 10 GB. This is why large-scale PCA uses randomized algorithms that never materialize the full covariance matrix.

You made it through the full pipeline. From raw data to trained model, you can now estimate every intermediate shape, every operation count, every memory allocation. This is the working vocabulary of ML engineering — not the abstract theory, but the concrete arithmetic that determines whether your job fits on one GPU or needs a cluster.

For related material:

Topic	Resource
Transformer-specific math	Transformer Math Workbook
PCA intuition from zero	PCA — From Absolute Zero
Matrix calculus in backprop	Transformer — From Absolute Zero