Workbook — Linear Algebra for Machine Learning

Linear Algebra for ML

Every matrix operation, decomposition, and calculus trick that powers modern ML — derived by hand, verified in-browser, no handwaving.

Prerequisites: Basic algebra (solving equations) + Arithmetic (fractions, square roots). That's it.
10
Chapters
59
Exercises
5
Exercise Types
Mastery
0 / 59 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Matrix Operations

You're staring at a neural network layer: y = Wx + b. The weight matrix W is [512, 768], the input x is [768, 1]. What shape is y? What if you transpose W? What if you multiply two weight matrices together? Every forward pass is matrix multiplication. Every backward pass is matrix multiplication. If you can't do these operations by hand on small matrices, you can't debug them on large ones.

Matrix multiplication C = AB:
A is [m, n], B is [n, p] → C is [m, p]
Cij = ∑k=1..n Aik · Bkj

Key properties:
Transpose: (AB)T = BTAT   // order reverses!
Trace: tr(A) = ∑ Aii   // sum of diagonal
Determinant (2×2): det([[a,b],[c,d]]) = ad − bc
Inner dimensions must match. In AB, the number of columns of A must equal the number of rows of B. This is the single most common shape error in deep learning. If your shapes don't match, your matmul will crash. Outer dimensions give the result shape: [m, n] × [n, p] → [m, p].
Exercise 0.1: 2×2 Matrix Multiply Derive

Compute the (1,1) entry (top-left) of C = AB where A = [[1,2],[3,4]] and B = [[5,6],[7,8]].

C11 = row 1 of A · col 1 of B = A11·B11 + A12·B21.

C11
Show derivation
C11 = 1×5 + 2×7 = 5 + 14 = 19

Row 1 of A dotted with column 1 of B. The full product is C = [[19, 22], [43, 50]].

Exercise 0.2: Bottom-Right Entry Derive

Now compute C22 (bottom-right) of the same product: A = [[1,2],[3,4]], B = [[5,6],[7,8]].

C22
Show derivation
C22 = 3×6 + 4×8 = 18 + 32 = 50
Exercise 0.3: Transpose Trace
If A = [[1,2,3],[4,5,6]], what is the shape of AT and what is the entry at row 3, column 1 of AT?
Show reasoning

Transpose swaps rows and columns. A is [2,3] so AT is [3,2]. AT = [[1,4],[2,5],[3,6]]. Row 3, column 1 of AT = 3 (which was row 1, column 3 of the original A).

Exercise 0.4: Trace of a Matrix Derive

Compute the trace of M = [[4, 1, 7], [2, 5, 3], [6, 8, 9]].

tr(M)
Show derivation
tr(M) = M11 + M22 + M33 = 4 + 5 + 9 = 18

The trace is the sum of diagonal entries. Fun fact: the trace equals the sum of eigenvalues, and the determinant equals the product of eigenvalues. You'll verify this in Chapter 2.

Exercise 0.5: 2×2 Determinant Derive

Compute det(A) where A = [[3, 8], [4, 6]].

det(A)
Show derivation
det([[a,b],[c,d]]) = ad − bc = 3×6 − 8×4 = 18 − 32 = −14

A negative determinant means the transformation flips orientation. |det| = 14 is the area scaling factor. If det = 0, the matrix is singular — no inverse, no unique solutions.

Exercise 0.6: 3×3 Determinant Derive

Compute det(B) where B = [[1, 2, 3], [0, 4, 5], [1, 0, 6]] using cofactor expansion along the first row.

det = 1·det([[4,5],[0,6]]) − 2·det([[0,5],[1,6]]) + 3·det([[0,4],[1,0]])

det(B)
Show derivation
det([[4,5],[0,6]]) = 4×6 − 5×0 = 24
det([[0,5],[1,6]]) = 0×6 − 5×1 = −5
det([[0,4],[1,0]]) = 0×0 − 4×1 = −4
det(B) = 1×24 − 2×(−5) + 3×(−4) = 24 + 10 − 12 = 22
Exercise 0.7: Implement matmul() Build

Write a function that multiplies two 2D arrays (matrices). A is [m,n], B is [n,p]. Return the [m,p] result as a nested array.

Return a nested array, e.g. [[19,22],[43,50]].
Show solution
javascript
function matmul(A, B) {
  const m = A.length, n = A[0].length, p = B[0].length;
  const C = Array.from({length: m}, () => Array(p).fill(0));
  for (let i = 0; i < m; i++)
    for (let j = 0; j < p; j++)
      for (let k = 0; k < n; k++)
        C[i][j] += A[i][k] * B[k][j];
  return C;
}

The triple loop is O(m × n × p). For square n×n matrices this is O(n³). Strassen's algorithm reduces this to O(n2.807), and the current best known is O(n2.371).

Chapter 1: Linear Systems

Training a linear regression model boils down to solving Ax = b. Gradient descent is one way, but the direct solution requires Gaussian elimination — converting the system to upper triangular form, then back-substituting. Understanding when this fails tells you when your data is redundant or your model is under-determined.

Gaussian Elimination:
1. Write the augmented matrix [A | b]
2. Use row operations to reach upper triangular form
3. Back-substitute from the bottom row up

Row operations (all preserve the solution set):
Ri ← Ri + c · Rj    // add a scaled row
Ri ← c · Ri (c ≠ 0)     // scale a row
Ri ↔ Rj                // swap two rows
When does Ax = b have no solution? When row reduction produces a row like [0 0 0 | 5] — "0 = 5" is a contradiction. This happens when b is not in the column space of A. In ML terms: your features can't explain your target. When the system has infinitely many solutions, A is rank-deficient — your features are redundant.
Exercise 1.1: 2×2 System Derive

Solve by Gaussian elimination: 2x + 3y = 8, 4x + y = 10. Find x.

Augmented: [[2, 3 | 8], [4, 1 | 10]]. R2 ← R2 − 2·R1.

x =
Show derivation
[2 3 | 8 ]
[4 1 | 10] R2 ← R2 − 2R1 → [0 −5 | −6]
−5y = −6 → y = 6/5 = 1.2
2x + 3(1.2) = 8 → 2x = 4.4 → x = 2.2
Exercise 1.2: 3×3 System Derive

Solve:
x + y + z = 6
2x − y + z = 3
x + y − 2z = −3
Find z.

Augmented matrix. Eliminate x from rows 2 and 3, then eliminate y from row 3.

z =
Show derivation
[1 1 1 | 6]
[2 −1 1 | 3] R2 ← R2 − 2R1 → [0 −3 −1 | −9]
[1 1 −2 | −3] R3 ← R3 − R1 → [0 0 −3 | −9]
−3z = −9 → z = 3
−3y − 1(3) = −9 → y = 2
x + 2 + 3 = 6 → x = 1

Solution: (x, y, z) = (1, 2, 3). Back-substitution starts from the bottom row (simplest equation) and works up, plugging known values into earlier equations.

Exercise 1.3: Rank Deficiency Trace
Consider: x + y = 3, 2x + 2y = 6, x − y = 1. What happens when you row-reduce?
Show reasoning

Row 2 is exactly 2 × Row 1 — it carries no new information. R2 ← R2 − 2R1 gives [0 0 | 0]. The system reduces to x + y = 3, x − y = 1, which has the unique solution x = 2, y = 1. The "extra" equation was linearly dependent — its rank contribution was zero. In ML, this is like having a duplicate feature.

Exercise 1.4: 2×2 Inverse Derive

Compute the inverse of A = [[2, 1], [5, 3]]. What is the (1,1) entry of A−1?

For 2×2: A−1 = (1/det) · [[d, −b], [−c, a]] where A = [[a,b],[c,d]].

(A−1)11
Show derivation
det(A) = 2×3 − 1×5 = 1
A−1 = (1/1) · [[3, −1], [−5, 2]] = [[3, −1], [−5, 2]]

Verify: A · A−1 = [[2×3+1×(−5), 2×(−1)+1×2],[5×3+3×(−5), 5×(−1)+3×2]] = [[1,0],[0,1]]. The fact that det = 1 is special — integer matrices with det = ±1 have integer inverses.

Exercise 1.5: Find the Bug Debug

This function solves a 2×2 system Ax = b using the inverse formula. It returns the wrong answer. Click the buggy line.

function solve2x2(A, b) {
  const det = A[0][0]*A[1][1] - A[0][1]*A[1][0];
  if (det === 0) return null;
  const invDet = 1 / det;
  const x0 = invDet * (A[1][1]*b[0] + A[0][1]*b[1]);
  const x1 = invDet * (-A[1][0]*b[0] + A[0][0]*b[1]);
  return [x0, x1];
}
Show explanation

Line 5 is the bug. The inverse formula for x0 should use A[1][1]*b[0] A[0][1]*b[1], not +. Recall that A−1 = (1/det)[[d, −b],[−c, a]]. The off-diagonal entries are negated. With the + sign, you're multiplying by [[d, b],[−c, a]] instead — wrong sign on the (1,2) entry.

Exercise 1.6: When is a Matrix Singular? Trace
Which of these 2×2 matrices is singular (no inverse)?
Show reasoning

[[2,4],[3,6]] has det = 0. Notice that row 2 = 1.5 × row 1 — the rows are linearly dependent. The matrix maps all of 2D space onto a line. You can't invert that — information is destroyed. In ML, this is like having one feature that's a perfect linear function of another.

Chapter 2: Eigenvalues & Eigenvectors

You're running PCA and your library returns "eigenvalues" and "eigenvectors." What are they? An eigenvector v of matrix A is a direction that A stretches without rotating: Av = λv. The scalar λ is how much it stretches. PCA finds the directions of maximum variance — those are the eigenvectors of the covariance matrix, and the eigenvalues are the variances along those directions.

Finding eigenvalues:
Av = λv → (A − λI)v = 0
Non-trivial solution exists when det(A − λI) = 0

For 2×2 A = [[a,b],[c,d]]:
det(A − λI) = (a−λ)(d−λ) − bc = 0
λ² − (a+d)λ + (ad−bc) = 0
λ² − tr(A)·λ + det(A) = 0
The characteristic equation. For a 2×2 matrix, the eigenvalues satisfy λ² − tr(A)λ + det(A) = 0. So the sum of eigenvalues equals the trace, and their product equals the determinant. This is a powerful sanity check — if your eigenvalues don't satisfy these relations, you made an arithmetic error.
Exercise 2.1: Eigenvalues of [[4,1],[2,3]] Derive

Find the eigenvalues. Set up det(A − λI) = 0 and solve the quadratic.

tr(A) = 7, det(A) = 10. So λ² − 7λ + 10 = 0.

larger λ
Show derivation
det(A − λI) = (4−λ)(3−λ) − 1×2 = λ² − 7λ + 10 = 0
(λ − 5)(λ − 2) = 0 → λ1 = 5, λ2 = 2

Check: λ1 + λ2 = 7 = tr(A). λ1 × λ2 = 10 = det(A). Both match.

Exercise 2.2: Eigenvector for λ=5 Derive

For A = [[4,1],[2,3]] with λ = 5, find the eigenvector by solving (A − 5I)v = 0. Give the ratio v2/v1.

v2/v1
Show derivation
A − 5I = [[−1, 1], [2, −2]]
Row 1: −v1 + v2 = 0 → v2 = v1

So v = [1, 1] (or any scalar multiple). The eigenvector direction is the 45-degree line. A stretches vectors along this line by a factor of 5.

Exercise 2.3: Verify Av = λv Derive

Compute A · [1, 1]T where A = [[4,1],[2,3]]. What is the first component of the result?

(Av)1
Show derivation
Av = [[4,1],[2,3]] · [1,1]T = [4+1, 2+3] = [5, 5]
λv = 5 · [1, 1] = [5, 5] ✓

Av = λv confirmed. The matrix A applied to [1,1] produces 5×[1,1] — pure scaling, no rotation. That's exactly what "eigenvector" means.

Exercise 2.4: Eigenvector for λ=2 Derive

Same A = [[4,1],[2,3]], λ = 2. Find v2/v1 for this eigenvector.

v2/v1
Show derivation
A − 2I = [[2, 1], [2, 1]]
Row 1: 2v1 + v2 = 0 → v2 = −2v1

So v = [1, −2] (or any scalar multiple). This eigenvector points in a different direction from the λ=5 eigenvector. Together, these two eigenvectors form a basis for R² — any vector can be decomposed into these two directions, each scaled by its eigenvalue.

Exercise 2.5: Implement eigenvalues2x2() Build

Write a function that returns the two eigenvalues of a 2×2 matrix as [larger, smaller].

Use the quadratic formula on the characteristic equation.
Show solution
javascript
function eigenvalues2x2(A) {
  const tr = A[0][0] + A[1][1];
  const det = A[0][0]*A[1][1] - A[0][1]*A[1][0];
  const disc = Math.sqrt(tr*tr - 4*det);
  return [(tr + disc)/2, (tr - disc)/2];
}
Exercise 2.6: Symmetric Matrices Trace
Which property is guaranteed for eigenvalues of a real symmetric matrix (A = AT)?
Show reasoning

The Spectral Theorem guarantees that real symmetric matrices have: (1) all real eigenvalues, and (2) orthogonal eigenvectors. This is why covariance matrices (which are symmetric) always have real eigenvalues and orthogonal principal components. The eigenvalues can be negative (just not for PSD matrices) and the matrix can be singular (if an eigenvalue is 0).

Chapter 3: SVD

The Singular Value Decomposition is the Swiss army knife of linear algebra. It works for any matrix — rectangular, singular, whatever. A = UΣVT, where U and V are orthogonal and Σ is diagonal with non-negative entries. The singular values tell you how much the matrix stretches space in each direction. Low-rank approximation via SVD is the foundation of PCA, latent semantic analysis, and modern recommender systems.

SVD: A = UΣVT
A: [m, n], U: [m, m] orthogonal, Σ: [m, n] diagonal, V: [n, n] orthogonal

Singular values from eigenvalues:
σi = √(λi(ATA))

Low-rank approximation (rank k):
Ak = UkΣkVkT   // keep top k singular values only
Error = σk+1   // Eckart-Young theorem: this is optimal!
Eigenvalues vs. singular values. Eigenvalues can be negative or complex. Singular values are always real and non-negative. For a symmetric positive semidefinite matrix, they're the same. For a general matrix, singular values are the square roots of the eigenvalues of ATA. This is why SVD always works, even when eigendecomposition doesn't.
Exercise 3.1: Trivial SVD Derive

A = [[3, 0], [0, 2]]. This is already diagonal. What are the singular values σ1 and σ2?

σ1 (larger)
Show derivation
ATA = [[9, 0], [0, 4]]
Eigenvalues of ATA: 9, 4
σ1 = √9 = 3, σ2 = √4 = 2

For a diagonal matrix, the singular values are just the absolute values of the diagonal entries. U = V = I (identity), Σ = A itself.

Exercise 3.2: SVD of [[2,1],[1,2]] Derive

A = [[2,1],[1,2]]. Compute ATA, find its eigenvalues, then take square roots. What is σ1?

ATA = A² (since A is symmetric). Eigenvalues of A are 3 and 1 (from Ch 2). So eigenvalues of A² are...

σ1
Show derivation
ATA = [[2,1],[1,2]] × [[2,1],[1,2]] = [[5, 4], [4, 5]]
Eigenvalues of ATA: λ² − 10λ + 9 = 0 → λ = 9, 1
σ1 = √9 = 3, σ2 = √1 = 1

Since A is symmetric with eigenvalues 3 and 1, the singular values are |3| = 3 and |1| = 1. For symmetric PSD matrices, singular values equal eigenvalues — the two decompositions coincide.

Exercise 3.3: Low-Rank Approximation Derive

A 1000×1000 matrix has singular values [100, 50, 10, 1, 0.5, 0.1, ...]. If you keep only the top 3 singular values (rank-3 approximation), how many numbers do you need to store instead of 106?

Rank-k storage: k columns of U [1000 × k] + k singular values + k columns of V [1000 × k].

numbers stored
Show derivation
U3: 1000 × 3 = 3000 numbers
Σ3: 3 numbers
V3: 1000 × 3 = 3000 numbers
Total = 3000 + 3 + 3000 = 6003

That's 6003 vs 1,000,000 — a 166× compression. And by the Eckart-Young theorem, this is the best possible rank-3 approximation in both Frobenius and spectral norm. The approximation error is σ4 = 1 (in spectral norm).

Exercise 3.4: Compression Ratio Trace
An image is 512×512 pixels. Its SVD has singular values that drop below 1% of σ1 after rank 20. Approximately what compression ratio does a rank-20 SVD approximation give?
Show reasoning

Rank-20 needs 512×20 + 20 + 512×20 = 20,500 numbers vs the original 512×512 = 262,144. Ratio: 262,144/20,500 ≈ 12.8, so roughly 13:1. This is the principle behind truncated SVD for image compression and dimensionality reduction.

Exercise 3.5: Matrix Rank from SVD Trace
A 5×5 matrix has singular values [10, 5, 2, 0, 0]. What is the rank of this matrix?
Show reasoning

The rank of a matrix equals the number of non-zero singular values. Three non-zero singular values (10, 5, 2) means rank 3. The matrix maps 5D space into a 3D subspace — two dimensions of information are lost. In ML, this means only 3 of the 5 features are linearly independent.

Chapter 4: PCA

You have a dataset with 50 features, but you suspect most of the variance lives in a much lower-dimensional subspace. PCA finds that subspace. The recipe: center the data, compute the covariance matrix, eigendecompose it, project onto the top eigenvectors. Each step has a concrete computation you should be able to do by hand.

PCA algorithm:
1. Center: X̅ = X − μ   // subtract column means
2. Covariance: C = (1/(n−1)) X̅TX̅   // [d, d] matrix
3. Eigendecompose: C = QΛQT
4. Project: Z = X̅ · Qk   // keep top k eigenvectors

Variance explained by top k:
i=1..k λi / ∑i=1..d λi
PCA = eigendecomposition of the covariance matrix. The eigenvectors are the principal directions (axes of maximum variance). The eigenvalues are the variances along those directions. Projecting onto the top k eigenvectors keeps the k most informative dimensions and discards the rest. This is optimal in the least-squares sense — no other k-dimensional subspace captures more variance.
Exercise 4.1: Center the Data Derive

Three data points in 2D: (1, 4), (3, 2), (5, 6). Compute the mean of each feature, then the centered y-coordinate of the first point.

μx = (1+3+5)/3, μy = (4+2+6)/3. Centered first point y = 4 − μy.

centered y1
Show derivation
μx = (1+3+5)/3 = 3, μy = (4+2+6)/3 = 4
Centered: (−2, 0), (0, −2), (2, 2)
First point centered y = 4 − 4 = 0
Exercise 4.2: Covariance Matrix Derive

Using the centered data from 4.1: (−2, 0), (0, −2), (2, 2). Compute C = (1/2)X̅TX̅ (using n−1=2). What is C11 (variance of feature 1)?

C11
Show derivation
TX̅ = [[-2,0,2],[0,-2,2]] × [[-2,0],[0,-2],[2,2]] = [[8, 4], [4, 8]]
C = (1/2) × [[8,4],[4,8]] = [[4, 2], [2, 4]]

C11 = 4 is the variance of feature 1. C12 = C21 = 2 is the covariance between features — they're positively correlated. The covariance matrix is always symmetric.

Exercise 4.3: Eigendecompose the Covariance Derive

C = [[4, 2], [2, 4]]. Find the eigenvalues. What is the larger eigenvalue?

tr(C) = 8, det(C) = 12. Quadratic: λ² − 8λ + 12 = 0.

λ1
Show derivation
λ² − 8λ + 12 = 0 → (λ−6)(λ−2) = 0
λ1 = 6, λ2 = 2

The first principal component captures variance 6, the second captures 2. The eigenvector for λ=6 is [1,1]/√2 (the 45-degree direction). Data varies most along this diagonal — projecting onto it keeps 75% of the total variance.

Exercise 4.4: Variance Explained Derive

With eigenvalues λ1=6, λ2=2, what fraction of total variance is explained by the first principal component?

fraction (0 to 1)
Show derivation
Variance explained = λ1 / (λ1 + λ2) = 6 / 8 = 0.75

75% of the variance is captured by a single direction. In practice, you pick k such that the cumulative explained variance exceeds a threshold (commonly 95%). With just one component here, you'd lose 25% of the information.

Exercise 4.5: PCA vs. Feature Selection Trace
Your dataset has 100 features. Feature 1 has variance 50, feature 2 has variance 30, all others have variance ~1. The covariance between features 1 and 2 is 35. If you do PCA and keep 1 component, does it simply select feature 1 (highest individual variance)?
Show reasoning

PCA finds the direction of maximum variance, which is generally a linear combination of the original features. With covariance 35 between features 1 and 2, a rotated axis capturing both features jointly explains more variance than either feature alone. Feature selection picks individual features; PCA finds optimal linear combinations. They're fundamentally different approaches.

Chapter 5: Matrix Calculus

Every time you call loss.backward() in PyTorch, the framework is computing matrix derivatives. The gradient of the loss with respect to the weights is a matrix derivative. The gradient of the loss with respect to the input is another one. If you want to understand why gradient descent works, why certain architectures train better than others, or what happens when gradients vanish or explode, you need matrix calculus.

The three most important matrix derivatives:

1. ∂(Wx)/∂x = W    // linear layer forward gradient

2. ∂(xTAx)/∂x = (A + AT)x    // quadratic form gradient
   = 2Ax when A is symmetric

3. ∂||Ax − b||²/∂x = 2AT(Ax − b)    // least squares gradient
Rule 3 is the gradient of every linear regression. The loss is ||Ax − b||² (sum of squared residuals). The gradient 2AT(Ax − b) tells you exactly which direction to move x to reduce the loss. Setting this to zero gives the normal equations: ATAx = ATb. That's Chapter 6.
Exercise 5.1: Linear Layer Gradient Derive

W = [[2, 1], [3, 4]], x = [1, 2]. Compute y = Wx, then ∂y/∂x. What is (∂y/∂x)11?

(∂y/∂x)11
Show derivation
y = Wx = [[2,1],[3,4]] × [1,2]T = [4, 11]
∂y/∂x = W = [[2, 1], [3, 4]]
(∂y/∂x)11 = W11 = 2

The Jacobian of a linear transformation y = Wx is simply the matrix W itself. This makes sense: y1 = 2x1 + x2, so ∂y1/∂x1 = 2. This is why backpropagation through a linear layer just multiplies by WT.

Exercise 5.2: Quadratic Form Value Derive

Compute xTAx for A = [[2, 1], [1, 3]] and x = [1, 2].

First compute Ax, then dot with x.

xTAx
Show derivation
Ax = [[2,1],[1,3]] × [1,2]T = [4, 7]
xTAx = [1, 2] · [4, 7] = 4 + 14 = 18

Quadratic forms are scalar-valued: they map a vector to a single number. The loss function in ridge regression is xTAx + bx, a quadratic form plus a linear term.

Exercise 5.3: Quadratic Form Gradient Derive

Compute ∂(xTAx)/∂x = (A + AT)x for A = [[2, 1], [1, 3]], x = [1, 2]. What is the first component?

gradient[0]
Show derivation
A + AT = [[2,1],[1,3]] + [[2,1],[1,3]] = [[4, 2], [2, 6]]
(A + AT)x = [[4,2],[2,6]] × [1,2]T = [8, 14]

Since A is symmetric (A = AT), A + AT = 2A. So the gradient simplifies to 2Ax = 2[4, 7] = [8, 14]. Symmetric A is the common case in ML (covariance matrices, Hessians).

Exercise 5.4: Least Squares Gradient Derive

A = [[1, 0], [0, 1], [1, 1]], x = [2, 3], b = [2, 4, 6]. Compute ∂||Ax − b||²/∂x = 2AT(Ax − b). What is the first component?

gradient[0]
Show derivation
Ax = [[1,0],[0,1],[1,1]] × [2,3]T = [2, 3, 5]
Ax − b = [2−2, 3−4, 5−6] = [0, −1, −1]
AT(Ax − b) = [[1,0,1],[0,1,1]] × [0, −1, −1]T = [−1, −2]
2AT(Ax − b) = [−2, −4]

Negative gradient means x should increase to reduce the loss. The residual Ax − b = [0, −1, −1] tells us the predictions are too low for data points 2 and 3. Gradient descent would step x in the direction [2, 4] (negating the gradient).

Exercise 5.5: Chain Rule Trace
In a 3-layer neural network y = W3σ(W2σ(W1x)), the gradient of the loss L with respect to W1 involves multiplying how many Jacobians together (via chain rule)?
Show reasoning

The chain unfolds as: ∂L/∂W1 = ∂L/∂y · ∂y/∂h2 · ∂h2/∂h1 · ∂h1/∂z1 · ∂z1/∂W1. That's 5 Jacobian matrices multiplied together. Each layer contributes two factors: one for the linear part (W), one for the activation (σ). The loss contributes one more. This is why deep networks are prone to vanishing/exploding gradients — you're multiplying many matrices together.

Exercise 5.6: Find the Bug Debug

This computes the gradient of ||Ax - b||² with respect to x. It returns the wrong gradient. Click the buggy line.

function lsGrad(A, x, b) {
  // Compute Ax
  const Ax = matmul(A, x);
  // Compute residual
  const r = subtract(Ax, b);
  // Gradient = 2 * A * (Ax - b)
  const grad = scale(matmul(A, r), 2);
  return grad;
}
Show explanation

Line 7 is the bug. The gradient is 2AT(Ax − b), not 2A(Ax − b). The code uses matmul(A, r) instead of matmul(transpose(A), r). Missing the transpose is one of the most common bugs in implementing gradient descent for linear models. Dimensionally: if A is [m,n] and r is [m,1], then A·r would be an invalid shape — only AT·r = [n,m]×[m,1] = [n,1] has the right shape to update x.

Chapter 6: Least Squares

You have 100 data points and want to fit a line through them. The classical solution: set the gradient from Chapter 5 to zero. That gives the normal equations: ATAx = ATb. If ATA is invertible, x = (ATA)−1ATb. This is the closed-form solution to linear regression — no gradient descent needed.

Normal equations:
ATAx = ATb
x = (ATA)−1ATb

Pseudoinverse via SVD:
A+ = VΣ+UT   // where Σ+ inverts non-zero singular values
x = A+b   // works even when ATA is singular!

When is ATA invertible?
When A has full column rank (all columns linearly independent).
n ≥ d and rank(A) = d.
ATA is invertible iff no features are redundant. If one feature is a linear combination of others, ATA becomes singular and the normal equations have infinitely many solutions. This is multicollinearity. Fix: add regularization (ATA + λI)x = ATb — that's ridge regression, and the λI term makes the matrix invertible.
Exercise 6.1: 3-Point Linear Regression Derive

Fit y = ax + b to points (1, 2), (2, 3), (3, 5). Set up A = [[1,1],[2,1],[3,1]] and b = [2,3,5]. Compute ATA. What is (ATA)11?

(ATA)11
Show derivation
ATA = [[1,2,3],[1,1,1]] × [[1,1],[2,1],[3,1]] = [[14, 6], [6, 3]]

(ATA)11 = 1² + 2² + 3² = 14. This is the sum of squares of the x-coordinates. The full ATA is a 2×2 symmetric matrix encoding the structure of the feature data.

Exercise 6.2: Solve the Normal Equations Derive

ATA = [[14,6],[6,3]], ATb = [23, 10]. Solve for x = [a, b] (the slope and intercept). What is the slope a?

det(ATA) = 14×3 − 6×6 = 6. Use the 2×2 inverse formula.

a (slope)
Show derivation
(ATA)−1 = (1/6)[[3, −6], [−6, 14]] = [[0.5, −1], [−1, 2.33]]
x = [[0.5,−1],[−1,2.33]] × [23, 10]T
a = 0.5×23 + (−1)×10 = 11.5 − 10 = 1.5
b = −1×23 + 2.33×10 = −23 + 23.33 = 0.33

The best-fit line is y = 1.5x + 0.33. At x=1: y=1.83 (error 0.17), x=2: y=3.33 (error 0.33), x=3: y=4.83 (error −0.17). The residuals sum to approximately zero, as they must for any least squares solution with an intercept term.

Exercise 6.3: Residual Sum of Squares Derive

Using y = 1.5x + 1/3, compute the sum of squared residuals for the 3 points (1,2), (2,3), (3,5). Round to 2 decimal places.

RSS
Show derivation
Predictions: 1.5(1)+1/3 = 11/6, 1.5(2)+1/3 = 10/3, 1.5(3)+1/3 = 29/6
Residuals: 2 − 11/6 = 1/6, 3 − 10/3 = −1/3, 5 − 29/6 = 1/6
RSS = (1/6)² + (1/3)² + (1/6)² = 1/36 + 1/9 + 1/36 = 1/6 ≈ 0.167
Exercise 6.4: When ATA is Singular Trace
You have features x1 (temperature in Celsius) and x2 (temperature in Fahrenheit, = 1.8x1 + 32). What happens to the normal equations?
Show reasoning

Since x2 = 1.8x1 + 32, column 2 of A is a linear combination of column 1 and the intercept column. ATA is singular (det = 0), and the inverse doesn't exist. There are infinitely many solutions: you can trade off weight on x1 vs x2 arbitrarily. The SVD pseudoinverse gives the minimum-norm solution, and ridge regression picks a unique solution by penalizing large weights.

Exercise 6.5: Implement normalEquation() Build

Write a function that solves Ax = b by the normal equations for a 2-feature problem. A is [n, 2], b is [n]. Return [x1, x2].

Compute ATA (2x2), ATb (2x1), then use the 2x2 inverse formula.
Show solution
javascript
function normalEquation(A, b) {
  const n = A.length;
  // A^T A (2x2)
  let a00=0,a01=0,a11=0, b0=0,b1=0;
  for (let i=0; i<n; i++) {
    a00 += A[i][0]*A[i][0];
    a01 += A[i][0]*A[i][1];
    a11 += A[i][1]*A[i][1];
    b0  += A[i][0]*b[i];
    b1  += A[i][1]*b[i];
  }
  const det = a00*a11 - a01*a01;
  return [(a11*b0 - a01*b1)/det, (-a01*b0 + a00*b1)/det];
}

Chapter 7: Norms & Distances

Every loss function in ML involves a norm. L2 loss is the squared L2 norm of the residual. L1 regularization uses the L1 norm of the weights. Gradient clipping uses the L2 norm of the gradient. Batch normalization divides by a norm. You need to compute these fluently and understand their geometric differences — they impose fundamentally different shapes on the "unit ball" and therefore promote different solutions.

Vector norms of v = [v1, ..., vn]:
L1: ||v||1 = ∑ |vi|    // sum of absolute values
L2: ||v||2 = √(∑ vi²)    // Euclidean length
L∞: ||v|| = max |vi|    // largest absolute entry

Matrix norms:
Frobenius: ||A||F = √(∑ij Aij²)    // like L2 for matrices
Nuclear: ||A||* = ∑ σi    // sum of singular values
Spectral: ||A||2 = σmax    // largest singular value
Why L1 promotes sparsity. The L1 unit ball has corners on the axes. When you minimize a loss + L1 penalty, the optimum tends to land on a corner — where some coordinates are exactly zero. The L2 ball is smooth, so the optimum slides to a point where all coordinates are small but non-zero. L1 gives you feature selection for free.
Exercise 7.1: L1 Norm Derive

Compute ||v||1 for v = [3, −4, 0, 5].

||v||1
Show derivation
||v||1 = |3| + |−4| + |0| + |5| = 3 + 4 + 0 + 5 = 12
Exercise 7.2: L2 Norm Derive

Compute ||v||2 for v = [3, −4, 0, 5]. Give the answer to 2 decimal places.

||v||2
Show derivation
||v||2 = √(9 + 16 + 0 + 25) = √50 = 5√2 ≈ 7.07
Exercise 7.3: L-infinity Norm Derive

Compute ||v|| for v = [3, −4, 0, 5].

||v||
Show derivation
||v|| = max(|3|, |−4|, |0|, |5|) = 5

The L∞ norm only cares about the single largest entry. It's used in adversarial robustness: an L∞ adversarial perturbation can change every pixel, as long as no single pixel changes by more than ε.

Exercise 7.4: Frobenius Norm Derive

Compute ||M||F for M = [[1, 2], [3, 4]]. Give the answer to 2 decimal places.

||M||F
Show derivation
||M||F = √(1² + 2² + 3² + 4²) = √(1+4+9+16) = √30 ≈ 5.48

The Frobenius norm treats the matrix as a long vector and takes its L2 norm. It equals √(∑ σi²) — the square root of the sum of squared singular values. Weight decay in neural networks penalizes the squared Frobenius norm of the weight matrices.

Exercise 7.5: Nuclear Norm Derive

M = [[1, 2], [3, 4]] has singular values σ1 ≈ 5.465 and σ2 ≈ 0.366. Compute the nuclear norm ||M||*.

||M||*
Show derivation
||M||* = σ1 + σ2 = 5.465 + 0.366 = 5.831

The nuclear norm is the "L1 norm of the singular values." Just as L1 on vectors promotes sparsity, the nuclear norm on matrices promotes low rank. Minimizing ||W||* pushes small singular values to exactly zero, effectively reducing the matrix rank. Used in matrix completion (recommender systems) and low-rank matrix recovery.

Exercise 7.6: Norm Comparison Trace
For any vector v, which ordering always holds?
Show reasoning

Always: ||v|| ≤ ||v||2 ≤ ||v||1. The max absolute entry is at most the Euclidean length (since the other terms are non-negative), and the Euclidean length is at most the sum of absolute values (by Cauchy-Schwarz). Verify with v = [3, −4, 0, 5]: ||v|| = 5 ≤ ||v||2 = 7.07 ≤ ||v||1 = 12.

Chapter 8: Positive Definiteness

A matrix is positive definite (PD) if xTAx > 0 for all non-zero x. Positive semidefinite (PSD) allows xTAx ≥ 0. Why should you care? Covariance matrices are always PSD. The Hessian of a convex loss function is PSD at the minimum. If a matrix is PD, it has a unique Cholesky factorization A = LLT — this is the fastest way to solve Ax = b and is numerically stable.

Tests for positive definiteness:
1. All eigenvalues > 0   // definitive test
2. All leading principal minors > 0   // Sylvester's criterion
3. Cholesky factorization A = LLT exists   // constructive test

Cholesky factorization A = LLT:
L11 = √A11
L21 = A21 / L11
L22 = √(A22 − L21²)
Why covariance matrices must be PSD. A covariance matrix C has entries Cij = E[(Xi − μi)(Xj − μj)]. For any vector a, aTCa = Var(aTX) ≥ 0 — the variance of any linear combination of variables is non-negative. That's the definition of PSD. If a covariance matrix has a zero eigenvalue, it means one variable is an exact linear function of the others.
Exercise 8.1: Check Eigenvalues Derive

A = [[4, 2], [2, 3]]. Find the eigenvalues and determine if A is positive definite.

tr(A) = 7, det(A) = 8. Quadratic: λ² − 7λ + 8 = 0.

smaller λ
Show derivation
λ = (7 ± √(49 − 32)) / 2 = (7 ± √17) / 2
λ1 = (7 + 4.123) / 2 ≈ 5.56
λ2 = (7 − 4.123) / 2 ≈ 1.44

Both eigenvalues are positive, so A is positive definite. This means xTAx > 0 for all non-zero x, and the quadratic form defines a bowl-shaped surface with a unique minimum at x = 0.

Exercise 8.2: Cholesky Factorization Derive

Compute the Cholesky factorization A = LLT for A = [[4, 2], [2, 3]]. What is L22?

L22
Show derivation
L11 = √A11 = √4 = 2
L21 = A21 / L11 = 2 / 2 = 1
L22 = √(A22 − L21²) = √(3 − 1) = √2 ≈ 1.414
L = [[2, 0], [1, √2]]

Verify: LLT = [[2,0],[1,√2]] × [[2,1],[0,√2]] = [[4,2],[2,3]] = A. Cholesky is 2× faster than LU decomposition for solving Ax = b when A is PD, and it's the standard method for sampling from multivariate Gaussians: x = L · z + μ where z ~ N(0, I).

Exercise 8.3: Not Positive Definite Trace
B = [[1, 2], [2, 1]]. Is B positive definite?
Show reasoning

tr(B) = 2, det(B) = 1 − 4 = −3. Eigenvalues: λ² − 2λ − 3 = 0 → λ = 3, −1. Since −1 < 0, B is not positive definite. In fact, take x = [1, −1]: xTBx = 1 − 2 − 2 + 1 = −2 < 0. Positive diagonal entries are necessary but not sufficient for PD — the off-diagonal correlations can make the matrix indefinite.

Exercise 8.4: Making a Matrix PD Trace
You computed a covariance matrix from data but due to numerical errors, the smallest eigenvalue is −0.001. How do you fix this so downstream algorithms (Cholesky, Gaussian sampling) don't fail?
Show reasoning

Adding εI shifts every eigenvalue by +ε. If ε = 0.01, the problematic −0.001 becomes 0.009 > 0, making the matrix PD. This is called jitter or ridge regularization and is standard practice in Gaussian processes, numerical Cholesky, and any code that inverts near-singular matrices. In GPyTorch, JAX, and TensorFlow, you'll see + 1e-6 * I everywhere.

Exercise 8.5: Cholesky Solve Derive

To solve Ax = b where A = LLT, you solve Ly = b (forward substitution) then LTx = y (back substitution). Given L = [[2, 0], [1, √2]] and b = [6, 5], find y1.

y1
Show derivation
Ly = b: [[2,0],[1,√2]] × [y1,y2]T = [6, 5]
2y1 = 6 → y1 = 3
y1 + √2 · y2 = 5 → y2 = 2/√2 = √2 ≈ 1.414

Forward substitution through L is O(n²), much cheaper than computing A−1 which is O(n³). The Cholesky factorization itself is O(n³/3), but once you have L, each new right-hand side b costs only O(n²).

Chapter 9: Capstone — ML Pipeline

Time to put it all together. You have a real ML pipeline: raw data matrix X [1000 samples, 50 features]. You need to (1) compute the covariance matrix, (2) run PCA to reduce to 10 dimensions, (3) train linear regression on the reduced data. Each step has a concrete computational cost that you should be able to estimate. This is what ML infrastructure engineers do every day — estimate whether a computation will take milliseconds or hours.

Pipeline:
1. Center X → X̅    // n · d subtractions
2. Covariance: C = (1/(n−1)) X̅TX̅    // matmul [d, n] × [n, d] = O(d²n)
3. Eigendecompose C → Q, Λ    // O(d³)
4. Project: Z = X̅ Q10    // matmul [n, d] × [d, 10] = O(n · d · 10)
5. Train: (ZTZ)−1ZTy    // matmul + inversion of [10, 10]
The bottleneck is always the largest matrix multiply. For n=1000, d=50: covariance costs O(50²×1000) = 2.5M FLOPs, eigendecomposition costs O(50³) = 125K FLOPs. The covariance dominates. But if d=5000, eigendecomposition costs O(125B) and suddenly dominates everything. Knowing where the bottleneck is saves you from optimizing the wrong step.

Exercise 9.1: Covariance Matrix Cost Derive

X is [1000, 50]. Computing C = X̅TX̅ is a matmul of [50, 1000] × [1000, 50]. How many multiply-add FLOPs? (Count each multiply-add pair as 2 FLOPs.)

million FLOPs
Show derivation
FLOPs = 2 × 50 × 1000 × 50 = 5,000,000 = 5M

For a matmul [m,k]×[k,n], FLOPs = 2mkn. Here m=n=50, k=1000. On a modern CPU doing ~10 GFLOPs, this takes about 0.5 ms. Trivial. But if d=50000 instead of 50, the cost jumps to 5 trillion FLOPs — 500 seconds on the same CPU.

Exercise 9.2: Eigendecomposition Cost Derive

Eigendecomposition of a d×d symmetric matrix costs approximately 4d³/3 FLOPs. For d=50, what is this?

million FLOPs
Show derivation
4 × 50³ / 3 = 4 × 125,000 / 3 = 166,667 ≈ 0.167M FLOPs

30× less than the covariance matrix computation. The eigendecomposition is cheap because the covariance matrix is only [50,50]. But the cost grows as d³ — at d=5000 it would be 167 billion FLOPs, 33,000× more expensive.

Exercise 9.3: Projection Cost Derive

Projecting X̅ [1000, 50] onto the top 10 eigenvectors Q10 [50, 10] gives Z [1000, 10]. FLOPs?

million FLOPs
Show derivation
FLOPs = 2 × 1000 × 50 × 10 = 1,000,000 = 1M

Cheaper than the covariance computation because we only project onto 10 directions, not all 50. The output Z [1000, 10] has 80% fewer features, which makes subsequent training much faster.

Exercise 9.4: Training Cost Derive

Linear regression on Z [1000, 10]: normal equations require ZTZ [10,1000]×[1000,10] and ZTy [10,1000]×[1000,1], plus inverting a 10×10 matrix. What are the total FLOPs for ZTZ?

million FLOPs
Show derivation
FLOPs(ZTZ) = 2 × 10 × 1000 × 10 = 200,000 = 0.2M
FLOPs(ZTy) = 2 × 10 × 1000 × 1 = 20,000 = 0.02M
FLOPs(invert 10×10) ≈ 2 × 10³ / 3 = 667

Training cost is dominated by ZTZ at 0.2M FLOPs — 25× less than the covariance step. PCA reduced d from 50 to 10, which made training 25× cheaper (cost scales as d²).

Exercise 9.5: Total Pipeline FLOPs Derive

Sum the costs: covariance (5M) + eigendecomp (0.167M) + projection (1M) + training (0.22M) + centering (0.05M, i.e., n·d). Total pipeline FLOPs in millions?

million FLOPs
Show derivation
Total = 5.0 + 0.167 + 1.0 + 0.22 + 0.05 = 6.437M ≈ 6.44M FLOPs

The covariance computation is 78% of the total cost. This is the bottleneck. At a modern GPU's ~10 TFLOPS, the entire pipeline takes under a microsecond. But scale n to 1 million and d to 5000, and the covariance step alone costs 50 trillion FLOPs — 5 seconds on GPU. Now you see why efficient covariance estimation (randomized methods, streaming PCA) matters at scale.

Exercise 9.6: Pipeline Design Design

Arrange these PCA+regression pipeline steps in the correct order.

?
?
?
?
?
Project onto top-k
Center data
Normal equations
Covariance matrix
Eigendecompose
Show correct order

Center dataCovariance matrixEigendecomposeProject onto top-kNormal equations

You must center before computing covariance (uncentered data gives a biased covariance). You must eigendecompose before projecting (you need the eigenvectors). And you train on the projected data, not the original.

Exercise 9.7: Scaling Analysis Trace
If you double both n (samples) and d (features), how does the total pipeline cost scale?
Show reasoning

The dominant term is the covariance: O(d²n). Doubling both: (2d)² × 2n = 4d² × 2n = 8d²n. That's 8× the cost. The eigendecomposition is O(d³): (2d)³ = 8d³, also 8×. For large d, eigendecomposition eventually dominates (d³ vs d²n when d > n).

Exercise 9.8: Memory Budget Derive

For the pipeline with n=1000, d=50, k=10: what is the peak memory in bytes if everything is FP32 (4 bytes per number)? Consider: X [1000,50], C [50,50], Q [50,50], Z [1000,10], all in memory simultaneously.

KB
Show derivation
X: 1000 × 50 = 50,000 numbers × 4 bytes = 200,000 bytes
C: 50 × 50 = 2,500 × 4 = 10,000 bytes
Q: 50 × 50 = 2,500 × 4 = 10,000 bytes
Z: 1000 × 10 = 10,000 × 4 = 40,000 bytes
Total = 260,000 bytes ≈ 253.9 KB

Tiny. But at n=1M and d=50K: X alone is 200 GB. The covariance matrix C [50K, 50K] is 10 GB. This is why large-scale PCA uses randomized algorithms that never materialize the full covariance matrix.

You made it through the full pipeline. From raw data to trained model, you can now estimate every intermediate shape, every operation count, every memory allocation. This is the working vocabulary of ML engineering — not the abstract theory, but the concrete arithmetic that determines whether your job fits on one GPU or needs a cluster.

For related material:

TopicResource
Transformer-specific mathTransformer Math Workbook
PCA intuition from zeroPCA — From Absolute Zero
Matrix calculus in backpropTransformer — From Absolute Zero