Every matrix operation, decomposition, and calculus trick that powers modern ML — derived by hand, verified in-browser, no handwaving.
You're staring at a neural network layer: y = Wx + b. The weight matrix W is [512, 768], the input x is [768, 1]. What shape is y? What if you transpose W? What if you multiply two weight matrices together? Every forward pass is matrix multiplication. Every backward pass is matrix multiplication. If you can't do these operations by hand on small matrices, you can't debug them on large ones.
Compute the (1,1) entry (top-left) of C = AB where A = [[1,2],[3,4]] and B = [[5,6],[7,8]].
C11 = row 1 of A · col 1 of B = A11·B11 + A12·B21.
Row 1 of A dotted with column 1 of B. The full product is C = [[19, 22], [43, 50]].
Now compute C22 (bottom-right) of the same product: A = [[1,2],[3,4]], B = [[5,6],[7,8]].
Transpose swaps rows and columns. A is [2,3] so AT is [3,2]. AT = [[1,4],[2,5],[3,6]]. Row 3, column 1 of AT = 3 (which was row 1, column 3 of the original A).
Compute the trace of M = [[4, 1, 7], [2, 5, 3], [6, 8, 9]].
The trace is the sum of diagonal entries. Fun fact: the trace equals the sum of eigenvalues, and the determinant equals the product of eigenvalues. You'll verify this in Chapter 2.
Compute det(A) where A = [[3, 8], [4, 6]].
A negative determinant means the transformation flips orientation. |det| = 14 is the area scaling factor. If det = 0, the matrix is singular — no inverse, no unique solutions.
Compute det(B) where B = [[1, 2, 3], [0, 4, 5], [1, 0, 6]] using cofactor expansion along the first row.
det = 1·det([[4,5],[0,6]]) − 2·det([[0,5],[1,6]]) + 3·det([[0,4],[1,0]])
Write a function that multiplies two 2D arrays (matrices). A is [m,n], B is [n,p]. Return the [m,p] result as a nested array.
javascript function matmul(A, B) { const m = A.length, n = A[0].length, p = B[0].length; const C = Array.from({length: m}, () => Array(p).fill(0)); for (let i = 0; i < m; i++) for (let j = 0; j < p; j++) for (let k = 0; k < n; k++) C[i][j] += A[i][k] * B[k][j]; return C; }
The triple loop is O(m × n × p). For square n×n matrices this is O(n³). Strassen's algorithm reduces this to O(n2.807), and the current best known is O(n2.371).
Training a linear regression model boils down to solving Ax = b. Gradient descent is one way, but the direct solution requires Gaussian elimination — converting the system to upper triangular form, then back-substituting. Understanding when this fails tells you when your data is redundant or your model is under-determined.
Solve by Gaussian elimination: 2x + 3y = 8, 4x + y = 10. Find x.
Augmented: [[2, 3 | 8], [4, 1 | 10]]. R2 ← R2 − 2·R1.
Solve:
x + y + z = 6
2x − y + z = 3
x + y − 2z = −3
Find z.
Augmented matrix. Eliminate x from rows 2 and 3, then eliminate y from row 3.
Solution: (x, y, z) = (1, 2, 3). Back-substitution starts from the bottom row (simplest equation) and works up, plugging known values into earlier equations.
Row 2 is exactly 2 × Row 1 — it carries no new information. R2 ← R2 − 2R1 gives [0 0 | 0]. The system reduces to x + y = 3, x − y = 1, which has the unique solution x = 2, y = 1. The "extra" equation was linearly dependent — its rank contribution was zero. In ML, this is like having a duplicate feature.
Compute the inverse of A = [[2, 1], [5, 3]]. What is the (1,1) entry of A−1?
For 2×2: A−1 = (1/det) · [[d, −b], [−c, a]] where A = [[a,b],[c,d]].
Verify: A · A−1 = [[2×3+1×(−5), 2×(−1)+1×2],[5×3+3×(−5), 5×(−1)+3×2]] = [[1,0],[0,1]]. The fact that det = 1 is special — integer matrices with det = ±1 have integer inverses.
This function solves a 2×2 system Ax = b using the inverse formula. It returns the wrong answer. Click the buggy line.
function solve2x2(A, b) { const det = A[0][0]*A[1][1] - A[0][1]*A[1][0]; if (det === 0) return null; const invDet = 1 / det; const x0 = invDet * (A[1][1]*b[0] + A[0][1]*b[1]); const x1 = invDet * (-A[1][0]*b[0] + A[0][0]*b[1]); return [x0, x1]; }
Line 5 is the bug. The inverse formula for x0 should use A[1][1]*b[0] − A[0][1]*b[1], not +. Recall that A−1 = (1/det)[[d, −b],[−c, a]]. The off-diagonal entries are negated. With the + sign, you're multiplying by [[d, b],[−c, a]] instead — wrong sign on the (1,2) entry.
[[2,4],[3,6]] has det = 0. Notice that row 2 = 1.5 × row 1 — the rows are linearly dependent. The matrix maps all of 2D space onto a line. You can't invert that — information is destroyed. In ML, this is like having one feature that's a perfect linear function of another.
You're running PCA and your library returns "eigenvalues" and "eigenvectors." What are they? An eigenvector v of matrix A is a direction that A stretches without rotating: Av = λv. The scalar λ is how much it stretches. PCA finds the directions of maximum variance — those are the eigenvectors of the covariance matrix, and the eigenvalues are the variances along those directions.
Find the eigenvalues. Set up det(A − λI) = 0 and solve the quadratic.
tr(A) = 7, det(A) = 10. So λ² − 7λ + 10 = 0.
Check: λ1 + λ2 = 7 = tr(A). λ1 × λ2 = 10 = det(A). Both match.
For A = [[4,1],[2,3]] with λ = 5, find the eigenvector by solving (A − 5I)v = 0. Give the ratio v2/v1.
So v = [1, 1] (or any scalar multiple). The eigenvector direction is the 45-degree line. A stretches vectors along this line by a factor of 5.
Compute A · [1, 1]T where A = [[4,1],[2,3]]. What is the first component of the result?
Av = λv confirmed. The matrix A applied to [1,1] produces 5×[1,1] — pure scaling, no rotation. That's exactly what "eigenvector" means.
Same A = [[4,1],[2,3]], λ = 2. Find v2/v1 for this eigenvector.
So v = [1, −2] (or any scalar multiple). This eigenvector points in a different direction from the λ=5 eigenvector. Together, these two eigenvectors form a basis for R² — any vector can be decomposed into these two directions, each scaled by its eigenvalue.
Write a function that returns the two eigenvalues of a 2×2 matrix as [larger, smaller].
javascript function eigenvalues2x2(A) { const tr = A[0][0] + A[1][1]; const det = A[0][0]*A[1][1] - A[0][1]*A[1][0]; const disc = Math.sqrt(tr*tr - 4*det); return [(tr + disc)/2, (tr - disc)/2]; }
The Spectral Theorem guarantees that real symmetric matrices have: (1) all real eigenvalues, and (2) orthogonal eigenvectors. This is why covariance matrices (which are symmetric) always have real eigenvalues and orthogonal principal components. The eigenvalues can be negative (just not for PSD matrices) and the matrix can be singular (if an eigenvalue is 0).
The Singular Value Decomposition is the Swiss army knife of linear algebra. It works for any matrix — rectangular, singular, whatever. A = UΣVT, where U and V are orthogonal and Σ is diagonal with non-negative entries. The singular values tell you how much the matrix stretches space in each direction. Low-rank approximation via SVD is the foundation of PCA, latent semantic analysis, and modern recommender systems.
A = [[3, 0], [0, 2]]. This is already diagonal. What are the singular values σ1 and σ2?
For a diagonal matrix, the singular values are just the absolute values of the diagonal entries. U = V = I (identity), Σ = A itself.
A = [[2,1],[1,2]]. Compute ATA, find its eigenvalues, then take square roots. What is σ1?
ATA = A² (since A is symmetric). Eigenvalues of A are 3 and 1 (from Ch 2). So eigenvalues of A² are...
Since A is symmetric with eigenvalues 3 and 1, the singular values are |3| = 3 and |1| = 1. For symmetric PSD matrices, singular values equal eigenvalues — the two decompositions coincide.
A 1000×1000 matrix has singular values [100, 50, 10, 1, 0.5, 0.1, ...]. If you keep only the top 3 singular values (rank-3 approximation), how many numbers do you need to store instead of 106?
Rank-k storage: k columns of U [1000 × k] + k singular values + k columns of V [1000 × k].
That's 6003 vs 1,000,000 — a 166× compression. And by the Eckart-Young theorem, this is the best possible rank-3 approximation in both Frobenius and spectral norm. The approximation error is σ4 = 1 (in spectral norm).
Rank-20 needs 512×20 + 20 + 512×20 = 20,500 numbers vs the original 512×512 = 262,144. Ratio: 262,144/20,500 ≈ 12.8, so roughly 13:1. This is the principle behind truncated SVD for image compression and dimensionality reduction.
The rank of a matrix equals the number of non-zero singular values. Three non-zero singular values (10, 5, 2) means rank 3. The matrix maps 5D space into a 3D subspace — two dimensions of information are lost. In ML, this means only 3 of the 5 features are linearly independent.
You have a dataset with 50 features, but you suspect most of the variance lives in a much lower-dimensional subspace. PCA finds that subspace. The recipe: center the data, compute the covariance matrix, eigendecompose it, project onto the top eigenvectors. Each step has a concrete computation you should be able to do by hand.
Three data points in 2D: (1, 4), (3, 2), (5, 6). Compute the mean of each feature, then the centered y-coordinate of the first point.
μx = (1+3+5)/3, μy = (4+2+6)/3. Centered first point y = 4 − μy.
Using the centered data from 4.1: (−2, 0), (0, −2), (2, 2). Compute C = (1/2)X̅TX̅ (using n−1=2). What is C11 (variance of feature 1)?
C11 = 4 is the variance of feature 1. C12 = C21 = 2 is the covariance between features — they're positively correlated. The covariance matrix is always symmetric.
C = [[4, 2], [2, 4]]. Find the eigenvalues. What is the larger eigenvalue?
tr(C) = 8, det(C) = 12. Quadratic: λ² − 8λ + 12 = 0.
The first principal component captures variance 6, the second captures 2. The eigenvector for λ=6 is [1,1]/√2 (the 45-degree direction). Data varies most along this diagonal — projecting onto it keeps 75% of the total variance.
With eigenvalues λ1=6, λ2=2, what fraction of total variance is explained by the first principal component?
75% of the variance is captured by a single direction. In practice, you pick k such that the cumulative explained variance exceeds a threshold (commonly 95%). With just one component here, you'd lose 25% of the information.
PCA finds the direction of maximum variance, which is generally a linear combination of the original features. With covariance 35 between features 1 and 2, a rotated axis capturing both features jointly explains more variance than either feature alone. Feature selection picks individual features; PCA finds optimal linear combinations. They're fundamentally different approaches.
Every time you call loss.backward() in PyTorch, the framework is computing matrix derivatives. The gradient of the loss with respect to the weights is a matrix derivative. The gradient of the loss with respect to the input is another one. If you want to understand why gradient descent works, why certain architectures train better than others, or what happens when gradients vanish or explode, you need matrix calculus.
W = [[2, 1], [3, 4]], x = [1, 2]. Compute y = Wx, then ∂y/∂x. What is (∂y/∂x)11?
The Jacobian of a linear transformation y = Wx is simply the matrix W itself. This makes sense: y1 = 2x1 + x2, so ∂y1/∂x1 = 2. This is why backpropagation through a linear layer just multiplies by WT.
Compute xTAx for A = [[2, 1], [1, 3]] and x = [1, 2].
First compute Ax, then dot with x.
Quadratic forms are scalar-valued: they map a vector to a single number. The loss function in ridge regression is xTAx + bx, a quadratic form plus a linear term.
Compute ∂(xTAx)/∂x = (A + AT)x for A = [[2, 1], [1, 3]], x = [1, 2]. What is the first component?
Since A is symmetric (A = AT), A + AT = 2A. So the gradient simplifies to 2Ax = 2[4, 7] = [8, 14]. Symmetric A is the common case in ML (covariance matrices, Hessians).
A = [[1, 0], [0, 1], [1, 1]], x = [2, 3], b = [2, 4, 6]. Compute ∂||Ax − b||²/∂x = 2AT(Ax − b). What is the first component?
Negative gradient means x should increase to reduce the loss. The residual Ax − b = [0, −1, −1] tells us the predictions are too low for data points 2 and 3. Gradient descent would step x in the direction [2, 4] (negating the gradient).
The chain unfolds as: ∂L/∂W1 = ∂L/∂y · ∂y/∂h2 · ∂h2/∂h1 · ∂h1/∂z1 · ∂z1/∂W1. That's 5 Jacobian matrices multiplied together. Each layer contributes two factors: one for the linear part (W), one for the activation (σ). The loss contributes one more. This is why deep networks are prone to vanishing/exploding gradients — you're multiplying many matrices together.
This computes the gradient of ||Ax - b||² with respect to x. It returns the wrong gradient. Click the buggy line.
function lsGrad(A, x, b) { // Compute Ax const Ax = matmul(A, x); // Compute residual const r = subtract(Ax, b); // Gradient = 2 * A * (Ax - b) const grad = scale(matmul(A, r), 2); return grad; }
Line 7 is the bug. The gradient is 2AT(Ax − b), not 2A(Ax − b). The code uses matmul(A, r) instead of matmul(transpose(A), r). Missing the transpose is one of the most common bugs in implementing gradient descent for linear models. Dimensionally: if A is [m,n] and r is [m,1], then A·r would be an invalid shape — only AT·r = [n,m]×[m,1] = [n,1] has the right shape to update x.
You have 100 data points and want to fit a line through them. The classical solution: set the gradient from Chapter 5 to zero. That gives the normal equations: ATAx = ATb. If ATA is invertible, x = (ATA)−1ATb. This is the closed-form solution to linear regression — no gradient descent needed.
Fit y = ax + b to points (1, 2), (2, 3), (3, 5). Set up A = [[1,1],[2,1],[3,1]] and b = [2,3,5]. Compute ATA. What is (ATA)11?
(ATA)11 = 1² + 2² + 3² = 14. This is the sum of squares of the x-coordinates. The full ATA is a 2×2 symmetric matrix encoding the structure of the feature data.
ATA = [[14,6],[6,3]], ATb = [23, 10]. Solve for x = [a, b] (the slope and intercept). What is the slope a?
det(ATA) = 14×3 − 6×6 = 6. Use the 2×2 inverse formula.
The best-fit line is y = 1.5x + 0.33. At x=1: y=1.83 (error 0.17), x=2: y=3.33 (error 0.33), x=3: y=4.83 (error −0.17). The residuals sum to approximately zero, as they must for any least squares solution with an intercept term.
Using y = 1.5x + 1/3, compute the sum of squared residuals for the 3 points (1,2), (2,3), (3,5). Round to 2 decimal places.
Since x2 = 1.8x1 + 32, column 2 of A is a linear combination of column 1 and the intercept column. ATA is singular (det = 0), and the inverse doesn't exist. There are infinitely many solutions: you can trade off weight on x1 vs x2 arbitrarily. The SVD pseudoinverse gives the minimum-norm solution, and ridge regression picks a unique solution by penalizing large weights.
Write a function that solves Ax = b by the normal equations for a 2-feature problem. A is [n, 2], b is [n]. Return [x1, x2].
javascript function normalEquation(A, b) { const n = A.length; // A^T A (2x2) let a00=0,a01=0,a11=0, b0=0,b1=0; for (let i=0; i<n; i++) { a00 += A[i][0]*A[i][0]; a01 += A[i][0]*A[i][1]; a11 += A[i][1]*A[i][1]; b0 += A[i][0]*b[i]; b1 += A[i][1]*b[i]; } const det = a00*a11 - a01*a01; return [(a11*b0 - a01*b1)/det, (-a01*b0 + a00*b1)/det]; }
Every loss function in ML involves a norm. L2 loss is the squared L2 norm of the residual. L1 regularization uses the L1 norm of the weights. Gradient clipping uses the L2 norm of the gradient. Batch normalization divides by a norm. You need to compute these fluently and understand their geometric differences — they impose fundamentally different shapes on the "unit ball" and therefore promote different solutions.
Compute ||v||1 for v = [3, −4, 0, 5].
Compute ||v||2 for v = [3, −4, 0, 5]. Give the answer to 2 decimal places.
Compute ||v||∞ for v = [3, −4, 0, 5].
The L∞ norm only cares about the single largest entry. It's used in adversarial robustness: an L∞ adversarial perturbation can change every pixel, as long as no single pixel changes by more than ε.
Compute ||M||F for M = [[1, 2], [3, 4]]. Give the answer to 2 decimal places.
The Frobenius norm treats the matrix as a long vector and takes its L2 norm. It equals √(∑ σi²) — the square root of the sum of squared singular values. Weight decay in neural networks penalizes the squared Frobenius norm of the weight matrices.
M = [[1, 2], [3, 4]] has singular values σ1 ≈ 5.465 and σ2 ≈ 0.366. Compute the nuclear norm ||M||*.
The nuclear norm is the "L1 norm of the singular values." Just as L1 on vectors promotes sparsity, the nuclear norm on matrices promotes low rank. Minimizing ||W||* pushes small singular values to exactly zero, effectively reducing the matrix rank. Used in matrix completion (recommender systems) and low-rank matrix recovery.
Always: ||v||∞ ≤ ||v||2 ≤ ||v||1. The max absolute entry is at most the Euclidean length (since the other terms are non-negative), and the Euclidean length is at most the sum of absolute values (by Cauchy-Schwarz). Verify with v = [3, −4, 0, 5]: ||v||∞ = 5 ≤ ||v||2 = 7.07 ≤ ||v||1 = 12.
A matrix is positive definite (PD) if xTAx > 0 for all non-zero x. Positive semidefinite (PSD) allows xTAx ≥ 0. Why should you care? Covariance matrices are always PSD. The Hessian of a convex loss function is PSD at the minimum. If a matrix is PD, it has a unique Cholesky factorization A = LLT — this is the fastest way to solve Ax = b and is numerically stable.
A = [[4, 2], [2, 3]]. Find the eigenvalues and determine if A is positive definite.
tr(A) = 7, det(A) = 8. Quadratic: λ² − 7λ + 8 = 0.
Both eigenvalues are positive, so A is positive definite. This means xTAx > 0 for all non-zero x, and the quadratic form defines a bowl-shaped surface with a unique minimum at x = 0.
Compute the Cholesky factorization A = LLT for A = [[4, 2], [2, 3]]. What is L22?
Verify: LLT = [[2,0],[1,√2]] × [[2,1],[0,√2]] = [[4,2],[2,3]] = A. Cholesky is 2× faster than LU decomposition for solving Ax = b when A is PD, and it's the standard method for sampling from multivariate Gaussians: x = L · z + μ where z ~ N(0, I).
tr(B) = 2, det(B) = 1 − 4 = −3. Eigenvalues: λ² − 2λ − 3 = 0 → λ = 3, −1. Since −1 < 0, B is not positive definite. In fact, take x = [1, −1]: xTBx = 1 − 2 − 2 + 1 = −2 < 0. Positive diagonal entries are necessary but not sufficient for PD — the off-diagonal correlations can make the matrix indefinite.
Adding εI shifts every eigenvalue by +ε. If ε = 0.01, the problematic −0.001 becomes 0.009 > 0, making the matrix PD. This is called jitter or ridge regularization and is standard practice in Gaussian processes, numerical Cholesky, and any code that inverts near-singular matrices. In GPyTorch, JAX, and TensorFlow, you'll see + 1e-6 * I everywhere.
To solve Ax = b where A = LLT, you solve Ly = b (forward substitution) then LTx = y (back substitution). Given L = [[2, 0], [1, √2]] and b = [6, 5], find y1.
Forward substitution through L is O(n²), much cheaper than computing A−1 which is O(n³). The Cholesky factorization itself is O(n³/3), but once you have L, each new right-hand side b costs only O(n²).
Time to put it all together. You have a real ML pipeline: raw data matrix X [1000 samples, 50 features]. You need to (1) compute the covariance matrix, (2) run PCA to reduce to 10 dimensions, (3) train linear regression on the reduced data. Each step has a concrete computational cost that you should be able to estimate. This is what ML infrastructure engineers do every day — estimate whether a computation will take milliseconds or hours.
X is [1000, 50]. Computing C = X̅TX̅ is a matmul of [50, 1000] × [1000, 50]. How many multiply-add FLOPs? (Count each multiply-add pair as 2 FLOPs.)
For a matmul [m,k]×[k,n], FLOPs = 2mkn. Here m=n=50, k=1000. On a modern CPU doing ~10 GFLOPs, this takes about 0.5 ms. Trivial. But if d=50000 instead of 50, the cost jumps to 5 trillion FLOPs — 500 seconds on the same CPU.
Eigendecomposition of a d×d symmetric matrix costs approximately 4d³/3 FLOPs. For d=50, what is this?
30× less than the covariance matrix computation. The eigendecomposition is cheap because the covariance matrix is only [50,50]. But the cost grows as d³ — at d=5000 it would be 167 billion FLOPs, 33,000× more expensive.
Projecting X̅ [1000, 50] onto the top 10 eigenvectors Q10 [50, 10] gives Z [1000, 10]. FLOPs?
Cheaper than the covariance computation because we only project onto 10 directions, not all 50. The output Z [1000, 10] has 80% fewer features, which makes subsequent training much faster.
Linear regression on Z [1000, 10]: normal equations require ZTZ [10,1000]×[1000,10] and ZTy [10,1000]×[1000,1], plus inverting a 10×10 matrix. What are the total FLOPs for ZTZ?
Training cost is dominated by ZTZ at 0.2M FLOPs — 25× less than the covariance step. PCA reduced d from 50 to 10, which made training 25× cheaper (cost scales as d²).
Sum the costs: covariance (5M) + eigendecomp (0.167M) + projection (1M) + training (0.22M) + centering (0.05M, i.e., n·d). Total pipeline FLOPs in millions?
The covariance computation is 78% of the total cost. This is the bottleneck. At a modern GPU's ~10 TFLOPS, the entire pipeline takes under a microsecond. But scale n to 1 million and d to 5000, and the covariance step alone costs 50 trillion FLOPs — 5 seconds on GPU. Now you see why efficient covariance estimation (randomized methods, streaming PCA) matters at scale.
Arrange these PCA+regression pipeline steps in the correct order.
Center data → Covariance matrix → Eigendecompose → Project onto top-k → Normal equations
You must center before computing covariance (uncentered data gives a biased covariance). You must eigendecompose before projecting (you need the eigenvectors). And you train on the projected data, not the original.
The dominant term is the covariance: O(d²n). Doubling both: (2d)² × 2n = 4d² × 2n = 8d²n. That's 8× the cost. The eigendecomposition is O(d³): (2d)³ = 8d³, also 8×. For large d, eigendecomposition eventually dominates (d³ vs d²n when d > n).
For the pipeline with n=1000, d=50, k=10: what is the peak memory in bytes if everything is FP32 (4 bytes per number)? Consider: X [1000,50], C [50,50], Q [50,50], Z [1000,10], all in memory simultaneously.
Tiny. But at n=1M and d=50K: X alone is 200 GB. The covariance matrix C [50K, 50K] is 10 GB. This is why large-scale PCA uses randomized algorithms that never materialize the full covariance matrix.
For related material:
| Topic | Resource |
|---|---|
| Transformer-specific math | Transformer Math Workbook |
| PCA intuition from zero | PCA — From Absolute Zero |
| Matrix calculus in backprop | Transformer — From Absolute Zero |