Every equation in every paper. Every matrix in every model. Built from absolute zero with hand calculations, from-scratch code, and interactive visualizations.
You want to classify images of cats vs dogs. Your image is 224 × 224 × 3 pixels — that's 150,528 numbers. A single neural network layer takes those 150,528 numbers and multiplies them by a weight matrix to produce, say, 512 numbers. That one operation is 150,528 × 512 = 77 million multiply-adds. And that's just one layer. A ResNet-50 has 25 million parameters doing this over and over, layer after layer.
Every single one of those multiply-adds is linear algebra. Matrix multiplication. Dot products. Vector norms. If you don't understand these operations, neural networks are a black box. If you do, every architecture paper becomes readable, every shape error becomes diagnosable, and every optimization trick makes intuitive sense.
This isn't abstract mathematics. This is the actual computation running on GPUs right now, powering ChatGPT, Stable Diffusion, AlphaFold, and every other ML system you've heard of. We're going to build it from scratch.
Let's start with the smallest possible example. Forget images — imagine your input is just two numbers: x = [2, 3]. Maybe they represent height and weight of a person. Your neural network layer has a weight matrix W and a bias vector b. The layer computes output = W · x + b.
Here are the weights (a 3×2 matrix, because we want 3 outputs from 2 inputs):
And the biases (one per output):
Now we compute each output element. Each output is a dot product of one row of W with the input x, plus the corresponding bias:
So our raw output is [1.4, 1.3, 0.7]. Then we apply ReLU (replace negatives with zero): max(0, [1.4, 1.3, 0.7]) = [1.4, 1.3, 0.7]. All values were already positive, so nothing changes. But if any output had been negative — say −0.6 — ReLU would clamp it to 0.
That's it. That's a neural network layer. Two inputs in, three outputs out. Every large model does exactly this, just with bigger matrices.
You might think convolution (used in image models) is a different kind of operation. It's not. Let's see why. A Sobel edge detector is a 3×3 kernel that finds vertical edges:
We flatten a 3×3 image patch into a vector too:
Now we compute the dot product — the same operation as before:
This negative value means the left side of the patch is darker than the right side. The kernel detected an edge gradient going right-to-left. A positive value would mean left-to-right. Zero means no edge. The convolution output at every position is just a dot product of the kernel with the local patch.
python # A neural network layer from scratch — no libraries def linear_layer(x, W, b): """Compute output = W @ x + b manually.""" out_size = len(W) # number of output neurons in_size = len(x) # number of input features output = [] for i in range(out_size): # Dot product of row i of W with x dot = 0.0 for j in range(in_size): dot += W[i][j] * x[j] dot += b[i] # add bias output.append(dot) return output def relu(x): """ReLU activation: max(0, x) for each element.""" return [max(0, val) for val in x] # Our example x = [2, 3] W = [[0.5, 0.1], [-0.3, 0.7], [0.8, -0.4]] b = [0.1, -0.2, 0.3] raw = linear_layer(x, W, b) print("Raw output:", raw) # [1.4, 1.3, 0.7] activated = relu(raw) print("After ReLU:", activated) # [1.4, 1.3, 0.7]
python import torch import torch.nn as nn # PyTorch does exactly what we did, but on GPU with optimized BLAS layer = nn.Linear(in_features=2, out_features=3) # Set our specific weights and biases with torch.no_grad(): layer.weight = nn.Parameter(torch.tensor( [[0.5, 0.1], [-0.3, 0.7], [0.8, -0.4]] )) layer.bias = nn.Parameter(torch.tensor([0.1, -0.2, 0.3])) x = torch.tensor([2.0, 3.0]) raw = layer(x) print(raw) # tensor([1.4, 1.3, 0.7]) activated = torch.relu(raw) print(activated) # tensor([1.4, 1.3, 0.7]) # Under the hood: output = x @ weight.T + bias manual = x @ layer.weight.T + layer.bias print(manual) # tensor([1.4, 1.3, 0.7]) — identical
The widget below shows an 8×8 image grid with pixel values. A 3×3 Sobel kernel slides across the image. At each position, we compute the dot product of the kernel with the underlying patch. Watch how the output reveals edges in the image.
Press Step to move the kernel one position. Press Play to auto-advance. The output grid on the right shows the dot product at each position.
Notice the pattern: where the image has sharp left-right brightness changes, the output has large absolute values (positive or negative). Where the image is uniform, the output is near zero. That's edge detection — and it's nothing but a sequence of dot products.
What operation is at the heart of every neural network layer?
Before we can manipulate data with linear algebra, we need to name the containers it lives in. There's a clean hierarchy of mathematical objects, and once you internalize it, every single shape error in PyTorch becomes instantly diagnosable. No more staring at RuntimeError: mat1 and mat2 shapes cannot be multiplied wondering what went wrong.
A scalar is a single number. The learning rate 0.001 is a scalar. A loss value 2.34 is a scalar. A scalar has zero dimensions — it's just a point on the number line.
A vector is a list of numbers. The input [2, 3] from Chapter 0 is a 2D vector. A word embedding might be a 768-dimensional vector. A vector has one dimension — you need one index to pick out an element.
A matrix is a grid of numbers with rows and columns. Our weight matrix W from Chapter 0 was a 3×2 matrix. A matrix has two dimensions — you need two indices (row, column) to pick out an element.
A tensor is the generalization to any number of dimensions. A color image is a 3D tensor (height × width × channels). A batch of images is a 4D tensor. Attention scores are a 4D tensor. You need as many indices as there are dimensions.
When you load a batch of images for training, PyTorch represents it as a 4D tensor with shape [B, C, H, W]. Let's say we have a batch of 32 images, each 224×224 pixels with 3 color channels (RGB). The shape is:
What does each dimension mean?
| Dimension | Size | Meaning |
|---|---|---|
| 0 (batch) | 32 | Number of images in this batch |
| 1 (channels) | 3 | Color channels: Red, Green, Blue |
| 2 (height) | 224 | Rows of pixels |
| 3 (width) | 224 | Columns of pixels |
How many total numbers are stored? Multiply all dimensions together:
That's 4.8 million floating-point numbers just for one batch of input. At 4 bytes each (float32), that's 4,816,896 × 4 = 19,267,584 bytes ≈ 19.3 MB of GPU memory. For the input alone. The model weights, gradients, optimizer states, and activations all pile on top.
Let's trace the shapes through a transformer's attention mechanism. These are real numbers from a GPT-2-small-style model: batch size B = 8, number of attention heads h = 12, sequence length T = 512, model dimension dmodel = 768, and key dimension dk = 768 / 12 = 64.
The query matrix Q has shape [B, h, T, dk] = [8, 12, 512, 64]. This means: for each of 8 sequences in the batch, for each of 12 attention heads, for each of 512 token positions, we have a 64-dimensional query vector.
The key matrix K has the same shape: [8, 12, 512, 64].
When we compute attention scores Q · KT, the last two dimensions participate in the matrix multiply:
That attention score matrix has 8 × 12 × 512 × 512 = 25,165,824 entries. Per layer. For just 512 tokens. This is why transformers have quadratic memory in sequence length — doubling the sequence length quadruples the attention matrix size.
python import torch # Scalar — 0 dimensions s = torch.tensor(3.14) print(s.shape) # torch.Size([]) — empty shape = scalar print(s.dim()) # 0 print(s.numel()) # 1 — one element total # Vector — 1 dimension v = torch.tensor([1.0, 2.0, 3.0]) print(v.shape) # torch.Size([3]) print(v.dim()) # 1 print(v.numel()) # 3 # Matrix — 2 dimensions M = torch.randn(3, 4) print(M.shape) # torch.Size([3, 4]) print(M.dim()) # 2 print(M.numel()) # 12 # 4D Tensor — image batch batch = torch.randn(32, 3, 224, 224) print(batch.shape) # torch.Size([32, 3, 224, 224]) print(batch.dim()) # 4 print(batch.numel()) # 4816896 # Memory usage print(batch.element_size()) # 4 (bytes per float32) print(batch.nelement() * batch.element_size()) # 19267584 bytes ≈ 19.3 MB
python import torch # MISTAKE 1: confusing len() with numel() t = torch.randn(3, 4, 5) print(len(t)) # 3 — size of FIRST dimension only! print(t.numel()) # 60 — total number of elements (3×4×5) # MISTAKE 2: .shape vs .size() — they're the SAME thing print(t.shape) # torch.Size([3, 4, 5]) print(t.size()) # torch.Size([3, 4, 5]) — identical print(t.size(1)) # 4 — size of dimension 1 print(t.shape[1]) # 4 — also works # MISTAKE 3: forgetting the batch dimension # nn.Linear expects [batch, features], not just [features] layer = torch.nn.Linear(10, 5) x_bad = torch.randn(10) # shape [10] — might work but risky x_good = torch.randn(1, 10) # shape [1, 10] — explicit batch dim
len(tensor) to get the total number of elements. It doesn't. len(torch.randn(3, 4, 5)) returns 3, not 60. It only gives you the size of the first dimension. Use .numel() for total elements, or math.prod(tensor.shape). This mistake is especially dangerous in loss computation where you might divide by the wrong count.Click the buttons below to build up from a scalar to a 4D tensor. Watch the shape and total element count change at each level. Each click adds one dimension to the structure.
Click the buttons to switch between dimensionalities. The visualization shows the structure and shape at each level.
A tensor with shape [16, 8, 256, 256] has how many total elements?
Now that we know vectors are ordered lists of numbers, we need operations on them. Adding vectors, scaling them, measuring their length. These aren't abstract — the L2 norm IS the loss function in regression. The L1 norm IS the regularization in Lasso. Cosine similarity IS what makes semantic search work. Every operation we learn here has a direct, concrete use in machine learning.
Let's start with the basics. Vector addition adds corresponding elements. Scalar multiplication multiplies every element by a number. Vector subtraction is just addition with a negated vector. These are the building blocks of everything.
Given two 3D vectors a = [3, −1, 4] and b = [1, 5, −2], let's compute everything:
Addition — add element by element:
Scalar multiplication — multiply every element by the scalar:
Subtraction — subtract element by element:
In ML, these operations happen constantly. When you compute a residual connection in a transformer, you're adding two vectors: output = layer(x) + x. When you apply a learning rate, you're scaling a gradient vector: w = w − η · ∇L. When you compute a loss gradient, you're subtracting the prediction from the target: error = ŷ − y.
A norm measures the "size" or "length" of a vector. Different norms measure size in different ways, and the choice of norm has profound effects on your model. Let's compute all three for v = [3, −4, 0, 5].
L1 norm (Manhattan distance / taxi distance) — sum of absolute values:
Think of walking on a city grid — you can only move along axes, never diagonally. The L1 norm is the total distance walked.
L2 norm (Euclidean distance / straight-line distance) — square root of sum of squares:
This is the "as the crow flies" distance. It's the Pythagorean theorem generalized to any number of dimensions.
L∞ norm (max norm / Chebyshev distance / king's distance) — largest absolute value:
Think of a king on a chessboard — it can move diagonally, so the distance is just the farthest it needs to go in any single direction.
When training a model, we add a regularization term to the loss function to prevent overfitting. The choice of norm determines the behavior:
L1 regularization (Lasso): loss = MSE + λ · ||w||1. The L1 norm pushes weights to exactly zero. This gives you sparse solutions — many weights become literally 0.0, effectively performing feature selection. If you have 1000 features but only 50 matter, L1 will zero out the other 950.
L2 regularization (Ridge / weight decay): loss = MSE + λ · ||w||2². The L2 norm pushes weights to be small but never exactly zero. All weights shrink toward zero proportionally. This is why it's called weight decay in PyTorch — it literally multiplies all weights by (1 − λ) each step.
python import math def l1_norm(v): """Sum of absolute values.""" return sum(abs(x) for x in v) def l2_norm(v): """Square root of sum of squares.""" return math.sqrt(sum(x ** 2 for x in v)) def linf_norm(v): """Maximum absolute value.""" return max(abs(x) for x in v) v = [3, -4, 0, 5] print(f"L1: {l1_norm(v)}") # 12 print(f"L2: {l2_norm(v):.4f}") # 7.0711 print(f"L∞: {linf_norm(v)}") # 5
python import numpy as np import torch v_np = np.array([3, -4, 0, 5], dtype=np.float64) v_pt = torch.tensor([3.0, -4.0, 0.0, 5.0]) # NumPy norms print(np.linalg.norm(v_np, ord=1)) # 12.0 print(np.linalg.norm(v_np, ord=2)) # 7.0710... print(np.linalg.norm(v_np, ord=np.inf)) # 5.0 # PyTorch norms print(torch.linalg.norm(v_pt, ord=1)) # tensor(12.) print(torch.linalg.norm(v_pt, ord=2)) # tensor(7.0711) print(torch.linalg.norm(v_pt, ord=float('inf'))) # tensor(5.) # In practice: L2 regularization = weight decay in AdamW optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.01) # This adds 0.01 * ||w||₂² to the loss automatically
The "unit ball" for a norm is the set of all vectors with norm ≤ 1. In 2D, L1 gives a diamond, L2 gives a circle, and L∞ gives a square. Drag the vector endpoint below to see how all three norms change, and toggle the unit ball shapes to see their geometry.
Drag the orange dot to move the vector. Toggle the unit ball shapes with the buttons. Watch how the three norm values change.
Which norm would you use if you want some model weights to become exactly zero (feature selection)?
The dot product is the single most important operation in machine learning. It's how attention works in transformers. It's how similarity search finds related documents. It's how every neural network layer transforms data. If you master one operation from this entire lesson, make it this one.
The dot product takes two vectors of the same length and produces a single number. You multiply corresponding elements and sum the results. That's it. But from this simple recipe comes an astonishing amount of machine learning.
Given a = [1, 2, 3] and b = [4, −5, 6]:
Each pair of corresponding elements gets multiplied, then we sum everything up. The result is a scalar — a single number. The sign tells us something important: positive means the vectors generally point in the same direction, negative means they point in opposite directions, and zero means they're perpendicular.
Cosine similarity normalizes the dot product by the vectors' lengths, giving a value between −1 and +1 regardless of vector magnitude. This is crucial when comparing word embeddings, because we care about direction (meaning), not magnitude (which is often an artifact of word frequency).
Using the same vectors a = [1, 2, 3] and b = [4, −5, 6]:
Step 1 — compute the L2 norms:
Step 2 — divide the dot product by the product of norms:
Step 3 — recover the angle:
A cosine of 0.365 means these vectors are somewhat aligned but far from parallel. In NLP, two word embeddings with cos = 0.365 would indicate a weak semantic relationship.
The dot product has a beautiful geometric interpretation. It equals the product of the vectors' lengths times the cosine of the angle between them:
This means three cases govern the sign of the dot product:
| Angle θ | cos(θ) | Dot product | Meaning |
|---|---|---|---|
| 0° (same direction) | 1 | Maximum positive | Vectors are parallel |
| 90° (perpendicular) | 0 | Zero | Vectors are orthogonal |
| 180° (opposite) | −1 | Maximum negative | Vectors are antiparallel |
In a transformer, each token produces a query vector (what am I looking for?) and a key vector (what do I offer?). The attention score between two tokens is the dot product of their query and key vectors, scaled by √dk.
Let's trace this with real numbers. Suppose dk = 4 (tiny, for illustration):
Compute raw attention scores (dot products):
Score 1 (1.82) is much higher than score 2 (−0.93). After softmax normalization, token "cat" gets much more attention weight than token "and." The query vector of "the" is more aligned with the key vector of "cat" — which makes linguistic sense, since "the cat" is a common phrase.
This is the entire mechanism of attention. Dot products measure alignment between query and key vectors. Softmax converts scores to probabilities. The model learns the projection matrices that produce useful query and key vectors through backpropagation.
python import math def dot_product(a, b): """Sum of element-wise products.""" assert len(a) == len(b), "Vectors must have same length" return sum(ai * bi for ai, bi in zip(a, b)) def l2_norm(v): return math.sqrt(sum(x ** 2 for x in v)) def cosine_similarity(a, b): """Dot product normalized by magnitudes. Returns [-1, 1].""" dot = dot_product(a, b) norm_a = l2_norm(a) norm_b = l2_norm(b) if norm_a == 0 or norm_b == 0: return 0.0 return dot / (norm_a * norm_b) # Our worked example a = [1, 2, 3] b = [4, -5, 6] print(f"Dot product: {dot_product(a, b)}") # 12 print(f"Cosine sim: {cosine_similarity(a, b):.4f}") # 0.3651 angle = math.degrees(math.acos(cosine_similarity(a, b))) print(f"Angle: {angle:.1f}°") # 68.6°
python import torch import torch.nn.functional as F a = torch.tensor([1.0, 2.0, 3.0]) b = torch.tensor([4.0, -5.0, 6.0]) # Dot product print(torch.dot(a, b)) # tensor(12.) print((a * b).sum()) # tensor(12.) — manual way # Cosine similarity (needs 2D input for F.cosine_similarity) cos = F.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0)) print(cos) # tensor([0.3651]) # Attention scores (batched dot products) q = torch.tensor([[1.0, 0.5, -0.3, 0.8]]) # [1, 4] K = torch.tensor([[0.9, 0.6, -0.2, 0.7], # [2, 4] [-0.5, 0.1, 0.8, -0.3]]) scores = q @ K.T # [1, 2] print(scores) # tensor([[1.82, -0.93]]) attn = F.softmax(scores / (4 ** 0.5), dim=-1) print(attn) # token "cat" gets ~87% attention
torch.dot(a,b) vs a * b. The dot product = element-wise multiply followed by sum. Confusing these causes silent bugs where your loss is a vector instead of a scalar.Drag the endpoints of the two vectors below. Watch the dot product, cosine similarity, and angle update in real time. The background tints green for positive dot product (same direction), red for negative (opposite), and stays dark for near-zero (perpendicular). The dashed line shows the projection of a onto b.
Two word embeddings have cosine similarity = −0.8. What does this mean?
This is it. The operation that runs the entire field of deep learning. Every forward pass through a neural network, every attention computation in a transformer, every linear layer in every model — matrix multiplication. We're going to derive it by hand, implement it from scratch, and then see why GPUs make it fast.
Matrix multiplication takes two matrices and produces a third. The rule is: element (i, j) of the result is the dot product of row i of the first matrix and column j of the second. That's it. Every element of the output is just a dot product — the operation we already mastered in Chapter 3.
Let A = [[1, 2], [3, 4]] and B = [[5, 6], [7, 8]]. The result C = A × B is a 2×2 matrix. Let's compute each element:
C[0][0] = row 0 of A · col 0 of B:
C[0][1] = row 0 of A · col 1 of B:
C[1][0] = row 1 of A · col 0 of B:
C[1][1] = row 1 of A · col 1 of B:
The full result:
The pattern: C[i][j] = dot product of row i of A and column j of B. This is the only formula you need. Everything else follows from it.
What if the matrices aren't square? The shape rule is: (m × n) × (n × p) → (m × p). The inner dimensions must match, and they "cancel out." If A is 2×3 and B is 3×2, the result is 2×2.
A = [[1, 0, 2], [−1, 3, 1]], B = [[3, 1], [2, 1], [1, 0]]
Shape check: (2×3) × (3×2) → (2×2). The inner dimensions (3 and 3) match. Good.
C[0][0] = [1, 0, 2] · [3, 2, 1]:
C[0][1] = [1, 0, 2] · [1, 1, 0]:
C[1][0] = [−1, 3, 1] · [3, 2, 1]:
C[1][1] = [−1, 3, 1] · [1, 1, 0]:
Result: C = [[5, 1], [4, 2]]
A neural network's linear layer computes output = W @ x + b. For a layer going from 784 inputs (a flattened 28×28 MNIST digit) to 128 hidden units:
| Matrix | Shape | Meaning |
|---|---|---|
| W (weights) | [128, 784] | 128 neurons, each connecting to 784 inputs |
| x (input) | [784, 1] | One flattened image |
| W @ x | [128, 1] | 128 pre-activation values |
| b (bias) | [128, 1] | One bias per neuron |
The multiply-add count: 128 × 784 = 100,352 multiply-adds for one image through one layer. For a batch of 64 images, that's 6.4 million multiply-adds. For a network with 5 layers, that's 32 million. And MNIST is a tiny dataset with tiny images — real models do billions.
python def matmul(A, B): """Matrix multiplication from scratch. A is m×n, B is n×p → result is m×p. """ m = len(A) # rows of A n = len(A[0]) # cols of A = rows of B p = len(B[0]) # cols of B # Verify inner dimensions match assert len(B) == n, f"Shape mismatch: A is {m}×{n}, B has {len(B)} rows" # Initialize result with zeros C = [[0.0] * p for _ in range(m)] # The triple loop: O(m × n × p) for i in range(m): # each row of A for j in range(p): # each column of B for k in range(n): # dot product over shared dim C[i][j] += A[i][k] * B[k][j] return C # Test with our first example A = [[1, 2], [3, 4]] B = [[5, 6], [7, 8]] C = matmul(A, B) print(C) # [[19, 22], [43, 50]] ✓ # Test with non-square example A2 = [[1, 0, 2], [-1, 3, 1]] B2 = [[3, 1], [2, 1], [1, 0]] C2 = matmul(A2, B2) print(C2) # [[5, 1], [4, 2]] ✓
python import numpy as np import torch # NumPy A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6], [7, 8]]) print(A @ B) # [[19, 22], [43, 50]] print(np.matmul(A, B)) # same thing # PyTorch At = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32) Bt = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32) print(At @ Bt) # tensor([[19., 22.], [43., 50.]]) print(torch.mm(At, Bt)) # 2D only — same result print(torch.matmul(At, Bt)) # works for any dimensionality # Neural network layer: F.linear does x @ W.T + b x = torch.randn(64, 784) # batch of 64 MNIST images W = torch.randn(128, 784) # weight matrix b = torch.randn(128) # bias out = x @ W.T + b # [64, 128] — 64 outputs, 128 dims each
The widget below shows two matrices being multiplied. The current row of A is highlighted in orange, and the current column of B in teal. The dot product between them fills in the corresponding element of the result. Press Step to advance one element, or Play for automatic animation.
What is the shape of the result when multiplying a [3, 5] matrix by a [5, 2] matrix?
Some matrices are special enough to get names. You'll see these constantly in ML papers and codebases: the identity matrix (skip connections, initialization), diagonal matrices (efficient scaling), symmetric matrices (covariance, Hessians), and orthogonal matrices (preserving gradient norms). And the inverse — the matrix that "undoes" a multiplication — is how we solve systems of equations in closed form.
The identity matrix I is the matrix equivalent of the number 1. For any matrix A: A · I = I · A = A. It's a square matrix with 1s on the diagonal and 0s everywhere else. In ML, it appears in skip connections (ResNets add the identity: y = F(x) + x) and in initialization (starting weight matrices near identity helps gradient flow).
A diagonal matrix has nonzero values only on the main diagonal. Multiplying by a diagonal matrix just scales each dimension independently — it's O(n) instead of O(n²). Attention masks, layer normalization scaling, and the λI regularization term all use diagonal structure.
A symmetric matrix satisfies A = AT: it equals its own transpose. Covariance matrices are always symmetric (the correlation of X with Y equals the correlation of Y with X). The Hessian (second-derivative matrix) is symmetric too. Symmetric matrices have real eigenvalues and orthogonal eigenvectors — properties that make optimization theory work.
An orthogonal matrix satisfies QTQ = QQT = I: its transpose is its inverse. Orthogonal matrices preserve lengths and angles — they're pure rotations (and reflections). Using orthogonal initialization for weight matrices prevents gradients from exploding or vanishing in deep networks.
A positive definite matrix satisfies xTAx > 0 for every nonzero vector x. This means the matrix defines a "bowl-shaped" surface with a unique minimum — exactly what you want for convex optimization. Covariance matrices are positive semi-definite (allowing zero, not just positive).
| Matrix Type | Property | ML Use |
|---|---|---|
| Identity I | AI = IA = A | Skip connections, initialization |
| Diagonal | Only diagonal nonzero | Efficient scaling, attention masks |
| Symmetric | A = AT | Covariance matrices, Hessians |
| Orthogonal | QTQ = I | Weight init (preserves gradient norms) |
| Positive definite | xTAx > 0 | Covariance, convex optimization |
The inverse of a matrix A, written A−1, is the matrix that "undoes" A. If you multiply a vector by A, then by A−1, you get back the original vector: A−1(Ax) = x. Not every matrix has an inverse — only square matrices with nonzero determinant are invertible.
For a 2×2 matrix A = [[a, b], [c, d]], the inverse has a clean formula:
where det(A) = ad − bc. Let's work through A = [[4, 7], [2, 6]]:
Step 1: Compute the determinant:
Step 2: Apply the formula (swap diagonal, negate off-diagonal, divide by det):
Step 3: Verify by multiplying A × A−1 — we should get the identity matrix:
Not every matrix can be inverted. Consider A = [[1, 2], [2, 4]]:
The determinant is zero — division by zero. No inverse exists. But why, geometrically? Look at the rows: row 2 = 2 × row 1. The two rows are linearly dependent — they point in the same direction. This matrix "collapses" 2D space onto a 1D line. Multiple input vectors get mapped to the same output, so there's no way to reverse the mapping. Information is destroyed.
In ML, this matters. If your feature matrix X has linearly dependent columns (say, one feature is exactly twice another), then XTX is singular and cannot be inverted. The normal equation for linear regression w = (XTX)−1XTy won't work. That's one reason we add regularization.
The optimal weights for linear regression minimize the squared error ||Xw − y||². Taking the derivative, setting it to zero, and solving gives:
Let's trace why this works. We want to minimize L(w) = ||Xw − y||². Expanding: L = (Xw − y)T(Xw − y). Taking the gradient with respect to w and setting it to zero:
But what if XTX is singular (linearly dependent features)? We can't invert it. The solution: add ridge regularization:
Adding λI to the diagonal guarantees that XTX + λI is invertible (its eigenvalues are all shifted up by λ > 0, so none can be zero). This is another reason regularization isn't just about preventing overfitting — it makes the math work.
python def det2x2(A): """Determinant of a 2×2 matrix.""" return A[0][0] * A[1][1] - A[0][1] * A[1][0] def inv2x2(A): """Inverse of a 2×2 matrix. Returns None if singular.""" d = det2x2(A) if abs(d) < 1e-12: return None # singular — no inverse return [ [ A[1][1] / d, -A[0][1] / d], [-A[1][0] / d, A[0][0] / d] ] # Example 1: invertible A = [[4, 7], [2, 6]] print(f"det(A) = {det2x2(A)}") # 10 Ainv = inv2x2(A) print(f"A⁻¹ = {Ainv}") # [[0.6, -0.7], [-0.2, 0.4]] # Verify: A @ A⁻¹ should be identity I = matmul(A, Ainv) # using our matmul from Ch4 print(f"A × A⁻¹ = {I}") # [[1.0, 0.0], [0.0, 1.0]] ✓ # Example 2: singular B = [[1, 2], [2, 4]] print(f"det(B) = {det2x2(B)}") # 0 print(inv2x2(B)) # None — no inverse
python import numpy as np import torch A = np.array([[4.0, 7.0], [2.0, 6.0]]) # NumPy inverse and determinant print(np.linalg.det(A)) # 10.0 print(np.linalg.inv(A)) # [[ 0.6, -0.7], [-0.2, 0.4]] print(A @ np.linalg.inv(A)) # [[1, 0], [0, 1]] (identity) # PyTorch At = torch.tensor([[4.0, 7.0], [2.0, 6.0]]) print(torch.linalg.det(At)) # tensor(10.) print(torch.linalg.inv(At)) # same result # Ridge regression: w = (X^T X + λI)^{-1} X^T y X = torch.randn(100, 5) # 100 samples, 5 features y = torch.randn(100, 1) # targets lam = 0.01 XtX = X.T @ X # [5, 5] reg = XtX + lam * torch.eye(5) # add λI for invertibility w = torch.linalg.inv(reg) @ X.T @ y # [5, 1] — closed-form solution print(w.shape) # torch.Size([5, 1])
np.linalg.inv() or torch.linalg.inv(). Matrix inversion is O(n³), numerically unstable, and unnecessary. Instead we use: (1) Gradient descent (let the optimizer find w iteratively), (2) Cholesky decomposition (torch.linalg.cholesky_ex), (3) SVD pseudoinverse (torch.linalg.pinv), or (4) QR decomposition (torch.linalg.qr). The closed-form solution is for understanding. In practice, SGD always wins for large problems.LinAlgError: Singular matrix, it means your features are linearly dependent — one column is a combination of others. Fix it by removing redundant features or adding regularization.Use the four sliders to control a 2×2 matrix [[a, b], [c, d]]. The widget shows how the matrix transforms the unit square into a parallelogram. When the determinant approaches zero, the parallelogram collapses toward a line. The inverse matrix and eigenvalues update in real time.
Adjust the matrix entries. Watch the transformed shape, determinant, inverse, and eigenvalues change. Try making the determinant zero — what happens?
A matrix has determinant = 0. What does this mean?
Imagine you're running a survey with 100 questions, hoping to measure 100 different aspects of personality. After collecting data from 10,000 people, you discover something surprising: the answers to questions 3, 17, and 42 are almost perfectly correlated. If someone scores high on question 3, you can predict their answers to 17 and 42. Those three questions are really measuring the same underlying trait.
That's the intuition behind linear independence. Some vectors (or measurements, or features) are redundant — they can be written as combinations of others. The rank of a matrix tells you how many truly independent directions it contains. Strip away the redundancy, and rank tells you the real dimensionality of your data.
This idea is the foundation of dimensionality reduction (PCA), low-rank adaptation (LoRA), and matrix completion (Netflix recommendations). If you don't understand rank, these techniques are magic. Once you do, they're obvious.
Two vectors are linearly dependent if one is a scalar multiple of the other — they point in the same (or opposite) direction. Three vectors are dependent if any one can be written as a combination of the other two.
More precisely: vectors v1, v2, ..., vk are linearly independent if the ONLY solution to c1v1 + c2v2 + ... + ckvk = 0 is c1 = c2 = ... = ck = 0. In plain English: you can't make the zero vector from them unless you use all-zero coefficients. No vector is "free" — none can be built from the others.
Are v1 = [1, 2] and v2 = [2, 4] linearly independent?
Check: is v2 a multiple of v1? Yes: [2, 4] = 2 × [1, 2]. They point in the exact same direction. One is redundant.
What about v1 = [1, 0] and v2 = [0, 1]? There's no scalar c such that [1, 0] = c × [0, 1]. The first vector points along x, the second along y. Neither is a multiple of the other.
Here's the geometric test: two 2D vectors are independent if and only if they point in different directions (not parallel). The area of the parallelogram they form is nonzero. When they're dependent, the parallelogram collapses to a line — zero area.
The rank of a matrix is the number of linearly independent rows (equivalently, columns). We find it by row reduction (Gaussian elimination) — systematically eliminating redundancy until we see how many independent rows remain.
Find the rank of:
Step 1: Eliminate below the first pivot. R2 = R2 − 2×R1:
Row 2 became all zeros! That means R2 was exactly 2×R1 — completely redundant.
Step 2: Swap R2 and R3 to move the zero row to the bottom:
Result: Two nonzero rows ⇒ rank = 2. The 3×3 matrix only has 2 independent directions. It maps 3D space into a 2D subspace. One dimension is lost.
The rank controls whether the system Ax = b has solutions, and how many.
| Condition | Meaning | Solutions |
|---|---|---|
| rank(A) = n (# columns) | All columns independent | At most one solution |
| rank(A) < n | Some columns redundant | Infinitely many solutions (underdetermined) |
| rank(A) < m (# rows) | Some rows redundant | Some b have NO solution (overdetermined) |
| rank(A) = m = n | Full rank, square | Exactly one solution for every b |
In ML, we're almost always in the underdetermined case. A neural network with millions of parameters and thousands of training examples has far more unknowns than equations. There are infinitely many parameter settings that fit the data — regularization picks which one.
When people say "the data lies on a lower-dimensional manifold," they mean the data matrix has low effective rank. Consider a dataset of face images, each 100×100 = 10,000 pixels. If you stack 50,000 faces into a 50,000×10,000 matrix, it might have rank ~50. Why? Because faces only vary along about 50 independent factors: pose, lighting, expression, identity, skin tone. The other 9,950 "dimensions" are redundant combinations of these factors.
PCA discovers these factors. It finds the rank-k approximation that captures the most variance.
LoRA (Low-Rank Adaptation) exploits the same insight for fine-tuning. Instead of updating a full weight matrix W (e.g., 4096×4096 = 16.7M parameters), LoRA parameterizes the update as ΔW = B × A where B is 4096×r and A is r×4096, with r = 4 or 8. That's only 2×4096×8 = 65,536 parameters instead of 16.7 million. The key assumption: the useful update has low rank.
python def rank_via_row_reduction(matrix): """Compute rank by Gaussian elimination (row echelon form).""" # Make a copy so we don't modify the original A = [row[:] for row in matrix] m, n = len(A), len(A[0]) pivot_row = 0 for col in range(n): # Find a nonzero entry in this column at or below pivot_row found = None for row in range(pivot_row, m): if abs(A[row][col]) > 1e-10: found = row break if found is None: continue # No pivot in this column # Swap the found row into the pivot position A[pivot_row], A[found] = A[found], A[pivot_row] # Eliminate all entries below the pivot for row in range(pivot_row + 1, m): factor = A[row][col] / A[pivot_row][col] for j in range(n): A[row][j] -= factor * A[pivot_row][j] pivot_row += 1 return pivot_row # Number of nonzero rows = rank # Test with our example A = [[1, 2, 3], [2, 4, 6], [0, 1, 1]] print("Rank:", rank_via_row_reduction(A)) # 2
python import numpy as np A = np.array([[1, 2, 3], [2, 4, 6], [0, 1, 1]]) print("Rank:", np.linalg.matrix_rank(A)) # 2 # LoRA example: rank-8 update to a 512x512 weight matrix B = np.random.randn(512, 8) # 512 x 8 A_lora = np.random.randn(8, 512) # 8 x 512 delta_W = B @ A_lora # 512 x 512 but rank 8 print("Shape:", delta_W.shape) # (512, 512) print("Rank:", np.linalg.matrix_rank(delta_W)) # 8 print("Params (full):", 512*512) # 262,144 print("Params (LoRA):", 512*8 + 8*512) # 8,192 (32x fewer!)
The widget below shows two 2D vectors that you can drag. When the vectors are independent, they span a parallelogram with nonzero area. When they become parallel (dependent), the parallelogram collapses to a line — zero area. The area equals the absolute value of the determinant, which is zero precisely when the vectors are dependent.
Drag the arrow tips to move vectors. Watch the determinant and rank update. When vectors are parallel, the area collapses to zero.
A 100×100 matrix has rank 5. What does this tell you about the data?
You're standing on a flat field with a compass. Someone tells you: "Walk 3 steps North and 2 steps East." You end up at a specific point. Now here's the key question: could you reach any point on the field using just North and East steps? Yes — any combination of North and East can take you anywhere on the 2D field. North and East form a basis for 2D space.
But what if someone only gave you North and Northeast? You can still reach any point — just combine them differently. What about North and South? No — those are opposite directions (dependent), so you can only move along the North-South line. You're stuck on a 1D subspace.
A basis is the minimum set of vectors you need to describe every point in a space. Changing the basis doesn't change the geometry — just the coordinate system. This is the single most important idea in linear algebra for ML, because every learned representation is a choice of basis.
The span of a set of vectors is the set of ALL points you can reach by combining them with any coefficients. If you have vectors v1 and v2, their span is:
If v1 and v2 are independent 2D vectors, their span is ALL of 2D space. If they're parallel (dependent), the span is just a line through the origin.
In 3D: span({[1,0,0], [0,1,0]}) = the xy-plane. You can reach any point on that plane, but you can never leave it — you can't reach [0,0,1]. Adding [0,0,1] to the set gives you span of all 3D space. But adding [2,3,0] doesn't help — it's already in the span of the first two vectors (it lies in the xy-plane).
The standard basis in 2D is e1 = [1, 0] and e2 = [0, 1]. These are the unit vectors along the x and y axes. Any point p can be written:
The coordinates [3, 2] ARE the coefficients. This is so natural that we don't even think about it — the coordinate (3, 2) directly means "3 units along e1 and 2 units along e2." But this is only true because we're using the standard basis. Change the basis, and the same point gets different coordinates.
Let's define a new basis by rotating 45°:
These are orthonormal (perpendicular and unit length). The same point p = [3, 2] has new coordinates found by projection (dot products with the basis vectors):
Let's verify: reconstruct p from the new coordinates:
The point didn't move. Its geometric location is exactly the same. Only the numbers describing it changed — from [3, 2] to [3.536, −0.707]. Different coordinates, same point.
A subspace is a subset of a vector space that is itself a vector space — it must contain the zero vector, and it must be closed under addition and scalar multiplication. In plain language: a subspace is a flat sheet (possibly lower-dimensional) passing through the origin.
Examples in 3D:
| Subspace | Dimension | What it looks like |
|---|---|---|
| {0} | 0 | Just the origin |
| span([1,0,0]) | 1 | The x-axis (a line) |
| span([1,0,0],[0,1,0]) | 2 | The xy-plane |
| span([1,0,0],[0,1,0],[0,0,1]) | 3 | All of 3D space |
NOT a subspace: {[x, y] : x ≥ 0}. This is the right half-plane, but it's not closed under scalar multiplication (multiply [1, 0] by −1 and you leave the set).
When word2vec maps "king" to a 768-dimensional vector [0.2, −0.5, 0.8, ...], those 768 numbers are coordinates in a learned basis. The model discovered 768 basis directions during training — abstract semantic axes that capture meaning.
The famous analogy king − man + woman ≈ queen works because the learned basis captures semantic directions. There's a "gender" direction, a "royalty" direction, and so on. Subtracting "man" and adding "woman" moves along the gender axis while preserving the royalty axis.
PCA is explicitly a basis change. It rotates from the standard basis to a basis aligned with the principal directions of variance in your data. The first principal component is the direction of maximum variance; the second is the direction of maximum remaining variance, perpendicular to the first; and so on.
python import math def dot(a, b): return sum(x*y for x, y in zip(a, b)) def scale(c, v): return [c*x for x in v] def add(a, b): return [x+y for x, y in zip(a, b)] # Standard basis e1, e2 = [1, 0], [0, 1] p = [3, 2] print("Standard coords:", p) # [3, 2] # 45-degree rotated basis s = 1 / math.sqrt(2) b1, b2 = [s, s], [-s, s] # Project p onto new basis (works for orthonormal bases) c1 = dot(p, b1) c2 = dot(p, b2) print(f"New coords: [{c1:.3f}, {c2:.3f}]") # [3.536, -0.707] # Reconstruct: c1*b1 + c2*b2 should equal p reconstructed = add(scale(c1, b1), scale(c2, b2)) print("Reconstructed:", [round(x, 6) for x in reconstructed]) # [3.0, 2.0]
python import numpy as np # Basis change via matrix multiplication p = np.array([3.0, 2.0]) theta = np.pi / 4 # 45 degrees # Rotation matrix (columns are new basis vectors) R = np.array([[np.cos(theta), -np.sin(theta)], [np.sin(theta), np.cos(theta)]]) # New coords = R^T @ p (inverse of orthogonal matrix is transpose) new_coords = R.T @ p print("New coords:", new_coords) # [ 3.536 -0.707] # Reconstruct print("Reconstructed:", R @ new_coords) # [3. 2.]
The widget below shows a single point on a 2D plane. Toggle between the standard basis (gray grid) and a custom basis (you can drag the basis vectors). The point stays in exactly the same place — only the coordinate numbers change. This is what "basis change" means geometrically: nothing moves.
Drag the basis vectors (colored arrows) to define a custom basis. Watch how the coordinates change while the point (white dot) stays fixed. Toggle between standard and custom grids.
You change the basis of your data from the standard basis to PCA components. Do the actual data points move?
Most matrices, when applied to a vector, both stretch and rotate it. The vector comes out pointing in a completely different direction, with a different length. But for every matrix, there are special vectors that only get stretched — not rotated at all. These are eigenvectors, and the stretching factor is the eigenvalue.
Why should you care? Because eigenvectors reveal the natural axes of a transformation. PCA finds the directions of maximum variance in your data — those are eigenvectors of the covariance matrix. The stability of a dynamical system depends on eigenvalues. PageRank, spectral clustering, and the Hessian of your loss function all boil down to eigenproblems.
Av = λv. The scalar λ is the eigenvalue. The vector doesn't rotate — it stays on its own line. Every other vector gets both stretched AND rotated.For the equation (A - λI)v = 0 to have a non-trivial solution (v ≠ 0), the matrix (A - λI) must be singular — its determinant must be zero. This gives us the characteristic equation:
Let's find the eigenvalues and eigenvectors of this matrix:
Step 1: Set up det(A − λI) = 0.
Subtract λ from each diagonal entry:
Step 2: Compute the determinant.
For a 2×2 matrix [[a,b],[c,d]], det = ad − bc:
Expand (2−λ)(2−λ):
Step 3: Solve the characteristic equation.
Factor: (λ − 1)(λ − 3) = 0. So λ&sub1; = 1 and λ&sub2; = 3.
Step 4: Find eigenvectors for each eigenvalue.
For λ&sub1; = 1, solve (A − 1·I)v = 0:
Both rows say the same thing: v&sub1; + v&sub2; = 0, so v&sub2; = −v&sub1;. Pick v&sub1; = 1:
For λ&sub2; = 3, solve (A − 3I)v = 0:
Both rows say: −v&sub1; + v&sub2; = 0, so v&sub2; = v&sub1;. Pick v&sub1; = 1:
Step 5: Verify both eigenpairs.
Check Av&sub1; = λ&sub1;v&sub1;:
Check Av&sub2; = λ&sub2;v&sub2;:
Not all matrices have real eigenvalues. Consider the 90° rotation matrix:
Characteristic equation:
This gives λ = ±i (imaginary numbers!). There are no real eigenvectors. Geometrically this makes perfect sense: a 90° rotation moves EVERY vector to a new direction. No vector stays on its own line. Complex eigenvalues always mean the matrix involves rotation.
PCA finds the directions of maximum variance in your data. Here's how it works, step by step, with actual numbers.
Given 5 data points in 2D:
Step 1: Center the data (subtract the mean of each dimension).
Centered points: (−2, −3.04), (−1, −1.54), (0, 0.06), (1, 1.36), (2, 3.16).
Step 2: Compute the covariance matrix C = (1/n)XTX:
Step 3: Find eigenvalues of C.
Using the quadratic formula: λ&sub1; ≈ 6.69, λ&sub2; ≈ 0.003.
Step 4: Interpret. The first eigenvalue (6.69) captures 6.69/(6.69+0.003) ≈ 99.96% of the total variance. The data is almost perfectly 1-dimensional! The eigenvector for λ&sub1; points along the line y ≈ 1.53x — the principal component.
The Hessian is the matrix of second derivatives of the loss function: Hij = ∂²L/(∂wi∂wj). Its eigenvalues tell you about the curvature of the loss landscape:
| Eigenvalue pattern | Meaning | Implication |
|---|---|---|
| All positive | Local minimum (bowl) | Gradient descent converges here |
| All negative | Local maximum (hilltop) | Unstable — optimizer leaves |
| Mixed +/− | Saddle point | Minimum in some directions, maximum in others |
| Large eigenvalue | Sharp curvature | Sensitive to learning rate in that direction |
| Small eigenvalue | Flat direction | Slow learning — gradient is tiny |
A large ratio of max/min eigenvalue (λmax/λmin) means the loss surface is like a narrow ravine — steep walls but a gentle slope along the bottom. This is why adaptive optimizers like Adam work: they effectively rescale each direction by its curvature.
python import math def eigen_2x2(a, b, c, d): """Find eigenvalues and eigenvectors of [[a,b],[c,d]].""" # Characteristic equation: lambda^2 - (a+d)*lambda + (ad-bc) = 0 trace = a + d det = a * d - b * c discriminant = trace**2 - 4 * det if discriminant < 0: return "Complex eigenvalues (rotation!)", None sqrt_d = math.sqrt(discriminant) lam1 = (trace + sqrt_d) / 2 lam2 = (trace - sqrt_d) / 2 # Eigenvectors: solve (A - lambda*I)v = 0 def eigvec(lam): if abs(b) > 1e-10: return [b, lam - a] elif abs(c) > 1e-10: return [lam - d, c] else: return [1, 0] if lam == a else [0, 1] v1 = eigvec(lam1) v2 = eigvec(lam2) # Normalize to unit length for v in [v1, v2]: norm = math.sqrt(v[0]**2 + v[1]**2) v[0] /= norm v[1] /= norm return (lam1, v1), (lam2, v2) # Test with A = [[2, 1], [1, 2]] r = eigen_2x2(2, 1, 1, 2) print(r) # ((3.0, [0.707, 0.707]), (1.0, [-0.707, 0.707]))
For large matrices, we don't solve the characteristic polynomial (it's degree n!). Instead, we use power iteration: start with a random vector, multiply by A repeatedly, and normalize. The vector converges to the eigenvector with the largest eigenvalue.
python import numpy as np def power_iteration(A, n_iters=100): """Find largest eigenvalue/eigenvector via power iteration.""" n = A.shape[0] v = np.random.randn(n) # random start v = v / np.linalg.norm(v) # normalize for _ in range(n_iters): Av = A @ v # multiply v = Av / np.linalg.norm(Av) # normalize eigenvalue = v @ A @ v # Rayleigh quotient return eigenvalue, v A = np.array([[2, 1], [1, 2]]) val, vec = power_iteration(A) print(f"Largest eigenvalue: {val:.4f}") # 3.0000 print(f"Eigenvector: {vec}") # [0.707, 0.707]
python import numpy as np A = np.array([[2, 1], [1, 2]]) # General eigendecomposition eigenvalues, eigenvectors = np.linalg.eig(A) print("Eigenvalues:", eigenvalues) # [3. 1.] print("Eigenvectors (columns):") print(eigenvectors) # [[ 0.707 -0.707] # [ 0.707 0.707]] # For symmetric matrices (like covariance), use eigh # Returns sorted eigenvalues (ascending) and orthogonal eigenvectors vals, vecs = np.linalg.eigh(A) print("Sorted eigenvalues:", vals) # [1. 3.]
A unit circle is transformed by the matrix. Eigenvectors (arrows) align with the axes of the resulting ellipse. Adjust matrix entries with the sliders. Toggle PCA mode to see eigenvectors of random data's covariance matrix.
If there's ONE decomposition to learn for ML, it's the Singular Value Decomposition (SVD). Every matrix A — any shape, any size, any rank — can be decomposed as:
Where U is an orthogonal matrix (m×m), Σ is a diagonal matrix of singular values (m×n), and VT is an orthogonal matrix (n×n). The singular values σ&sub1; ≥ σ&sub2; ≥ ... ≥ 0 are always non-negative and sorted in decreasing order.
Geometrically, SVD says: every linear transformation is a rotation (VT), then a stretching (Σ), then another rotation (U). That's it. Every matrix, no matter how complicated, is just rotate-stretch-rotate.
Start simple. Let A = [[3, 0], [0, 2]].
Step 1: Compute ATA.
Step 2: Find eigenvalues of ATA.
ATA is diagonal, so eigenvalues are just the diagonal entries: 9 and 4.
Singular values are the square roots: σ&sub1; = 3, σ&sub2; = 2.
Step 3: Find V (eigenvectors of ATA).
For a diagonal matrix, eigenvectors are the standard basis vectors:
Step 4: Find U (eigenvectors of AAT).
Same eigenvalues, same eigenvectors: U = I.
Step 5: Build Σ.
Step 6: Verify.
The most powerful property of SVD: you can approximate any matrix by keeping only the top k singular values. Given A = UΣVT, the best rank-k approximation is:
Let's try with a 3×3 matrix. Suppose SVD gives us:
Rank-1 approximation: Keep only σ&sub1; = 10. This captures:
Rank-2 approximation: Keep σ&sub1; = 10 and σ&sub2; = 3.
With just 2 out of 3 components, we capture nearly all the information. The Frobenius norm of the error is just the discarded singular values:
LoRA is one of the most important techniques in modern ML. Instead of fine-tuning all 4096×4096 = 16.7M parameters in a weight matrix W, LoRA decomposes the update ΔW as the product of two small matrices:
If d = 4096 and r = 16:
| Method | Parameters | Memory |
|---|---|---|
| Full fine-tuning | 4096 × 4096 = 16,777,216 | ~64 MB (fp32) |
| LoRA (r=16) | 4096×16 + 16×4096 = 131,072 | ~0.5 MB (fp32) |
| Reduction | 128× fewer parameters | |
Why does this work? Because SVD reveals that weight updates during fine-tuning are approximately low-rank. The paper "LoRA: Low-Rank Adaptation of Large Language Models" showed that the intrinsic dimensionality of fine-tuning updates is surprisingly small — rank 4-16 captures most of the adaptation signal.
Treat a grayscale image (m×n pixels) as a matrix. Compute its SVD. Keep only the top k singular values. Reconstruct. Storage goes from m×n values to k×(m + n + 1) values.
For a 256×256 image:
| Rank k | Storage | Compression ratio |
|---|---|---|
| Full (256) | 65,536 values | 1× |
| k = 50 | 50 × (256 + 256 + 1) = 25,650 | 2.6× |
| k = 20 | 20 × 513 = 10,260 | 6.4× |
| k = 5 | 5 × 513 = 2,565 | 25.5× |
python import numpy as np def svd_from_eigen(A): """Compute SVD of A using eigendecomposition of A^T A.""" # Step 1: Eigendecompose A^T A to get V and singular values AtA = A.T @ A eigenvalues, V = np.linalg.eigh(AtA) # Sort in descending order idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] V = V[:, idx] # Step 2: Singular values = sqrt of eigenvalues sigma = np.sqrt(np.maximum(eigenvalues, 0)) # Step 3: U = A V Sigma^{-1} r = np.sum(sigma > 1e-10) # rank U = np.zeros((A.shape[0], r)) for i in range(r): U[:, i] = A @ V[:, i] / sigma[i] return U, sigma[:r], V[:, :r] # Test A = np.array([[3, 0], [0, 2], [0, 0]]) U, s, V = svd_from_eigen(A) print("Singular values:", s) # [3. 2.]
python import numpy as np import torch A = np.array([[3, 2, 1], [1, 4, 2]], dtype=np.float64) # NumPy U, s, Vt = np.linalg.svd(A) print("U shape:", U.shape) # (2, 2) print("s:", s) # [5.47, 1.15] print("Vt shape:", Vt.shape) # (3, 3) # Low-rank reconstruction (rank 1) A_rank1 = s[0] * np.outer(U[:, 0], Vt[0, :]) print("Rank-1 approx error:", np.linalg.norm(A - A_rank1)) # PyTorch At = torch.tensor(A) U_t, s_t, Vh_t = torch.linalg.svd(At) print("PyTorch singular values:", s_t)
LEFT: original 32×32 procedural image. RIGHT: rank-k reconstruction. BOTTOM: singular value bar chart. Drag the rank slider and watch the image sharpen. Toggle LoRA mode to see weight matrix decomposition.
You open Netflix. It shows you a row of movies it thinks you'll like. You haven't rated most of them — yet Netflix predicts you'll give The Grand Budapest Hotel a 4.5 and Transformers a 2.1. How? It looked at millions of users, found people with similar taste to yours, and used their ratings to fill in your missing ones.
The mathematical engine behind this is matrix factorization: take a big matrix (users × movies) with mostly missing entries, and break it into the product of two smaller matrices. Those smaller matrices reveal hidden structure — each user gets a "taste vector" and each movie gets a "content vector." Their dot product predicts the rating.
This same idea powers LoRA fine-tuning, topic modeling (NMF), and collaborative filtering in every major recommendation system. It's one of the most practically important techniques in all of ML.
We have 3 users and 3 movies. Some ratings are observed, some are missing (marked ?):
We want to approximate R ≈ U × VT, where U is 3×2 (3 users, 2 latent factors) and V is 3×2 (3 movies, 2 latent factors). The "2" is our choice — we're guessing that movie preferences can be described by 2 hidden dimensions (maybe "action vs drama" and "old vs new").
Start with a guess:
Compute the predicted ratings (UVT):
Our initial guess is imperfect — it nails some entries but misses others. We need to optimize U and V to minimize the error on observed entries. That's where gradient descent comes in.
We only penalize errors on observed entries (we can't compute error on missing ones — that's what we're trying to predict):
The first term is the reconstruction error. The second term is regularization — it prevents U and V from growing too large (overfitting). For our example with λ = 0:
Our goal: adjust U and V to make this loss as small as possible. After optimization, the missing entries get filled with the model's best predictions.
For each observed entry (i, j), the gradient of the squared error with respect to user vector ui is:
This is the error times the movie vector, plus regularization. We subtract this (times the learning rate) from ui. The same logic applies to vj with the roles swapped.
Netflix Prize (2009): 480,000 users × 18,000 movies = 8.6 BILLION possible entries, but only 100 million ratings observed (1.2% fill rate). Factor into U(480K×50) × V(18K×50). Now each user is a 50-dimensional "taste vector" and each movie is a 50-dimensional "content vector." Their dot product predicts the rating.
LoRA is the SAME idea. A pretrained weight matrix W (4096×4096) is updated by ΔW = B×A where B is 4096×r and A is r×4096. With rank r=8, you're saying: "fine-tuning only changes 8 independent directions in weight space." The rest of the 4096 dimensions stay frozen.
python import numpy as np def matrix_factorize(R, k, lr=0.01, reg=0.02, epochs=1000): """Factor ratings matrix R ≈ U @ V.T with k latent factors.""" m, n = R.shape # Initialize with small random values U = np.random.randn(m, k) * 0.1 V = np.random.randn(n, k) * 0.1 # Find observed entries (non-NaN) observed = ~np.isnan(R) for epoch in range(epochs): for i in range(m): for j in range(n): if observed[i, j]: error = R[i, j] - U[i] @ V[j] # Update U[i] and V[j] U[i] += lr * (error * V[j] - reg * U[i]) V[j] += lr * (error * U[i] - reg * V[j]) return U, V # Our example (NaN = missing) R = np.array([[5, 3, np.nan], [4, np.nan, 1], [np.nan, 2, 4]]) U, V = matrix_factorize(R, k=2, epochs=5000) predicted = U @ V.T print("Predicted ratings:") print(np.round(predicted, 1)) # Missing entries now filled with predictions!
The widget below shows a 5×5 user-movie ratings matrix with some entries missing. Adjust the number of latent factors k with the slider. Watch how the predicted ratings and reconstruction error change. More factors = better fit to observed data, but too many = overfitting.
Observed ratings in orange, predicted missing entries in teal. U and V matrices shown below.
In Netflix-style matrix factorization with k=50 latent factors, what does each user's 50-dimensional vector represent?
In Chapter 2, we measured vector size with norms: L1, L2, L-infinity. But how do you measure the "size" of a matrix? A matrix isn't just a collection of numbers — it's a transformation. So the "size" of a matrix should tell you something about what that transformation does.
Different matrix norms answer different questions about a transformation. The Frobenius norm asks: "how much total energy is in this matrix?" The spectral norm asks: "what's the maximum stretching this matrix can do?" The nuclear norm asks: "how close is this matrix to being low-rank?" Each answer matters for different ML applications.
The Frobenius norm is the simplest. Flatten the matrix into a vector and take the L2 norm. Just square every entry, sum them, and take the square root.
For A = [[1, 2], [3, 4]]:
That's it. The Frobenius norm treats the matrix as if it were a flat vector. It measures the total "energy" or "magnitude" of all the entries. It doesn't care about structure — rearranging entries doesn't change the Frobenius norm.
weight_decay=0.01 in an optimizer, you're using the Frobenius norm.The spectral norm is the largest singular value of the matrix. It tells you the maximum amount the matrix can stretch any unit vector.
For a diagonal matrix A = [[3, 0], [0, 2]]:
This means: for any unit vector x (||x|| = 1), the output Ax has length at most 3. That is, ||Ax||2 ≤ 3 · ||x||2 for all x.
For a non-diagonal matrix, you'd compute the SVD first. For A = [[1, 2], [3, 4]], the SVD gives σ1 ≈ 5.465, σ2 ≈ 0.366. Spectral norm = 5.465.
The spectral norm answers a very specific question: "What's the worst-case amplification?" If noise enters a layer with spectral norm 5, the noise can be amplified up to 5×. Stack 10 such layers, and noise amplification is up to 510 ≈ 10 million. That's why controlling spectral norms matters for deep networks.
The nuclear norm is the sum of all singular values:
For A = [[3, 0], [0, 2]]: ||A||* = 3 + 2 = 5.
Why care? The nuclear norm is the convex relaxation of rank. Minimizing rank directly is NP-hard (a combinatorial problem). But minimizing the nuclear norm is a convex optimization problem — much easier to solve — and it tends to produce low-rank solutions. This is the mathematical trick behind matrix completion algorithms.
| Norm | Formula | What it measures | ML use |
|---|---|---|---|
| Frobenius | √(∑aij²) | Total energy | Weight decay, L2 regularization |
| Spectral | max singular value | Maximum stretching | GAN stability (spectral norm) |
| Nuclear | ∑ singular values | Low-rank proxy | Matrix completion, LoRA |
Training GANs is notoriously unstable. The discriminator can become too powerful, giving gradients that are too large and causing the generator to diverge. Spectral normalization (Miyato et al., 2018) fixes this by constraining the spectral norm of every layer's weight matrix to 1:
After this division, the largest singular value is exactly 1. The layer can rotate vectors freely but can't amplify them. This makes the discriminator Lipschitz continuous with constant 1 — its output changes by at most 1 unit per unit change in input. The result: stable, reliable GAN training.
python import numpy as np A = np.array([[1, 2], [3, 4]]) # Frobenius norm: sqrt(sum of squared entries) frob = np.sqrt(np.sum(A**2)) print(f"Frobenius: {frob:.3f}") # 5.477 # Via SVD U, S, Vt = np.linalg.svd(A) print(f"Singular values: {S}") # [5.465, 0.366] # Spectral norm: largest singular value spectral = S[0] print(f"Spectral: {spectral:.3f}") # 5.465 # Nuclear norm: sum of singular values nuclear = np.sum(S) print(f"Nuclear: {nuclear:.3f}") # 5.831 # NumPy built-ins print("Built-in Frobenius:", np.linalg.norm(A, 'fro')) print("Built-in Spectral:", np.linalg.norm(A, 2)) print("Built-in Nuclear:", np.linalg.norm(A, 'nuc')) # Spectral normalization in one line W = np.random.randn(64, 64) W_sn = W / np.linalg.norm(W, 2) print("After SN, spectral norm:", np.linalg.norm(W_sn, 2)) # 1.0
Adjust the entries of a 2×2 matrix with sliders. The canvas shows all three norms updating in real time. The transformation ellipse shows how the matrix maps the unit circle: the spectral norm is the longest axis, and the nuclear norm is the sum of both axes.
The unit circle (dashed) gets transformed into an ellipse. Spectral norm = long axis. Frobenius = "diagonal" of the axis lengths.
Which matrix norm is used in spectral normalization for GANs?
Everything so far has been static — given a matrix, transform a vector. But ML is about learning, which means adjusting the matrix to minimize a loss. To adjust it, we need gradients: how much does the loss change when we nudge each weight? This is where linear algebra meets calculus.
For a scalar function f(x&sub1;, x&sub2;, ..., xn), the gradient ∇f is the vector of all partial derivatives:
The gradient points in the direction of steepest ascent. To minimize a function, go in the opposite direction: the negative gradient.
Find the gradient of f(x, y) = x²y + 3xy².
Step 1: Partial derivative with respect to x (treat y as constant):
Step 2: Partial derivative with respect to y (treat x as constant):
Step 3: Evaluate at (x, y) = (1, 2).
The gradient [16, 13] points in the direction of steepest ascent. To minimize, step in the direction [−16, −13].
The gradient handles scalar-valued functions. But what about vector-valued functions — functions that output multiple values? The Jacobian is the matrix of all partial derivatives.
For f(x) = [x&sub1;² + x&sub2;, x&sub1;x&sub2;, x&sub2;³], compute the Jacobian:
Row 1 (gradient of f&sub1; = x&sub1;² + x&sub2;):
Row 2 (gradient of f&sub2; = x&sub1;x&sub2;):
Row 3 (gradient of f&sub3; = x&sub2;³):
Evaluate at x = [1, 2]:
The Jacobian has shape [3, 2] — 3 outputs, 2 inputs. Each row is the gradient of one output function. This matrix tells you: if I nudge x&sub1; by ε, how much does each output change?
Let's trace gradient descent step by step on the simplest possible loss: L(w) = (w − 3)².
The minimum is at w = 3 (where L = 0). Starting at w = 0 with learning rate α = 0.1:
| Step | w | L(w) | dL/dw = 2(w−3) | wnew = w − α·dL/dw |
|---|---|---|---|---|
| 0 | 0.000 | 9.000 | −6.000 | 0.000 − 0.1×(−6) = 0.600 |
| 1 | 0.600 | 5.760 | −4.800 | 0.600 + 0.480 = 1.080 |
| 2 | 1.080 | 3.686 | −3.840 | 1.080 + 0.384 = 1.464 |
| 3 | 1.464 | 2.359 | −3.072 | 1.464 + 0.307 = 1.771 |
| 4 | 1.771 | 1.510 | −2.458 | 1.771 + 0.246 = 2.017 |
| 5 | 2.017 | 0.966 | −1.966 | 2.017 + 0.197 = 2.214 |
| ... | ... | → 0 | → 0 | → 3.000 |
Each step reduces the loss. The gradient magnitude shrinks as we approach the minimum. After about 30 steps, w ≈ 2.99 and L ≈ 0.0001.
python def gradient_descent_1d(f, df, w0, lr, n_steps): """Minimize f starting at w0 with learning rate lr.""" w = w0 history = [w] for step in range(n_steps): grad = df(w) # compute gradient w = w - lr * grad # update history.append(w) return w, history # L(w) = (w - 3)^2 f = lambda w: (w - 3)**2 df = lambda w: 2 * (w - 3) # derivative w_final, hist = gradient_descent_1d(f, df, w0=0, lr=0.1, n_steps=50) print(f"Final w = {w_final:.6f}") # ~3.000000
python import torch w = torch.tensor(0.0, requires_grad=True) lr = 0.1 for step in range(50): loss = (w - 3)**2 # forward pass loss.backward() # compute gradient with torch.no_grad(): w -= lr * w.grad # update w.grad.zero_() # clear for next step print(f"Final w = {w.item():.6f}") # ~3.000000
Click anywhere on the loss landscape to start gradient descent from that point. The contours show loss values. Adjust the learning rate — too high oscillates, too low barely moves. Watch the gradient arrow at each step.
Gradient descent tells you which way is downhill. But it says nothing about the shape of the hill. Is it a gentle slope you can barrel down, or a narrow ravine where you'll oscillate wildly? Is it a valley floor (minimum) or a mountain pass (saddle point)? The Hessian matrix — the matrix of all second derivatives — answers these questions.
Understanding the Hessian explains why some optimization problems are easy and others are nightmarishly slow. It explains why Adam outperforms vanilla SGD. It explains why saddle points, not local minima, are the real enemy in high-dimensional optimization.
For a function f(x1, x2, ..., xn), the gradient is the vector of first derivatives: ∇f = [∂f/∂x1, ∂f/∂x2, ..., ∂f/∂xn]. The Hessian is the matrix of ALL second derivatives:
For a function of 2 variables, the Hessian is a 2×2 matrix:
The diagonal entries tell you the curvature along each axis. The off-diagonal entries tell you how the curvature couples the axes (cross-curvature). For smooth functions, the Hessian is always symmetric: ∂²f/∂x∂y = ∂²f/∂y∂x.
Let f(x, y) = x³ − 3xy² + y³. First, compute the gradient:
Now take second derivatives to build the Hessian:
Evaluate at the point (1, 1):
Now find the eigenvalues to classify this critical point:
One eigenvalue is positive, one is negative. This means the surface curves upward in one direction and downward in another. That's a saddle point — it looks like a horse saddle or a mountain pass.
| Eigenvalue signs | Classification | Geometry |
|---|---|---|
| All positive | Local minimum | Bowl (curves up everywhere) |
| All negative | Local maximum | Dome (curves down everywhere) |
| Mixed positive & negative | Saddle point | Horse saddle (up in some dirs, down in others) |
| Some zero | Degenerate | Flat in some directions (inconclusive) |
Gradient descent uses a fixed learning rate: xnew = x − α ∇f. But α is a guess — too large and you overshoot, too small and you crawl. Newton's method uses the Hessian to pick the perfect step size automatically:
The inverse Hessian scales the gradient by the curvature. In steep directions (large curvature), it takes small steps. In flat directions (small curvature), it takes large steps. It's like adaptive learning rates, but derived from the geometry of the loss surface.
For a 1D function f(x) = x4 at x = 2:
Compare to gradient descent with α = 0.01: xnew = 2 − 0.01 × 32 = 1.68. Newton's method took a much larger, smarter step because it knew the curvature was steep enough to handle it.
In deep learning, the loss landscape has millions of dimensions. The Hessian is N×N where N = number of parameters. For GPT-3, that's a 175 billion × 175 billion matrix. We NEVER compute it fully — it would take more memory than exists on Earth.
But its eigenvalue spectrum controls everything about optimization:
Why Adam works better than SGD: Adam maintains per-parameter running averages of squared gradients. This approximates the diagonal of the Hessian — each parameter gets its own "curvature estimate." Parameters with high curvature (large second derivative) get smaller effective learning rates. Parameters with low curvature get larger ones. Adam is a cheap approximation to Newton's method.
python import torch # Define f(x, y) = x^3 - 3xy^2 + y^3 def f(xy): x, y = xy[0], xy[1] return x**3 - 3*x*y**2 + y**3 # Compute Hessian at (1, 1) point = torch.tensor([1.0, 1.0]) H = torch.autograd.functional.hessian(f, point) print("Hessian:") print(H) # tensor([[ 6., -6.], # [-6., 0.]]) # Eigenvalue analysis eigenvalues = torch.linalg.eigvalsh(H) print("Eigenvalues:", eigenvalues) # tensor([-3.71, 9.71]) if all(eigenvalues > 0): print("Local minimum") elif all(eigenvalues < 0): print("Local maximum") else: print("Saddle point") # This one!
The widget below shows a contour plot of a 2D function. Click anywhere to place the analysis point. The Hessian eigenvectors are shown as arrows: long arrow = high curvature (steep), short arrow = low curvature (flat). Color indicates sign: red = positive eigenvalue (curves up, like a bowl), blue = negative (curves down). At saddle points, you'll see one red and one blue arrow.
Click on the contour plot to move the analysis point. Arrows show Hessian eigenvectors. Arrow length = curvature magnitude.
The Hessian at a critical point has eigenvalues [5, −2, 0.01]. What is this point?
Now we connect everything. A neural network is a chain of matrix multiplications and nonlinearities. Backpropagation is just the chain rule applied to matrices. Once you see this, neural networks stop being magic and become plumbing.
Let's trace every single number through a forward pass and backward pass. No hand-waving.
Setup:
Step 1: Linear layer z = Wx + b
Step 2: ReLU activation a = max(0, z)
Step 3: MSE Loss L = ∑(a − t)²
Now we compute gradients flowing backward through each operation.
Step 4: Gradient of loss w.r.t. activations
Step 5: Gradient through ReLU
ReLU's gradient: 1 if z > 0, 0 if z ≤ 0. Element-wise multiply:
The third neuron had z&sub3; = −0.7 < 0, so its gradient is zeroed. The dead neuron receives no learning signal.
Step 6: Gradient w.r.t. weights ∂L/∂W = (∂L/∂z) xT
This is the outer product of the upstream gradient and the input:
Step 7: Gradient w.r.t. bias ∂L/∂b = ∂L/∂z
Step 8: Update weights (learning rate = 0.01)
python import numpy as np # Setup x = np.array([2.0, 1.0]) W = np.array([[0.5, -0.3], [0.1, 0.7], [-0.4, 0.2]]) b = np.array([0.1, 0.0, -0.1]) target = np.array([1.0, 0.0, 0.0]) lr = 0.01 # ── Forward ── z = W @ x + b # [0.8, 0.9, -0.7] a = np.maximum(0, z) # [0.8, 0.9, 0.0] ReLU loss = np.sum((a - target)**2) # 0.85 # ── Backward ── dL_da = 2 * (a - target) # [-0.4, 1.8, 0.0] dL_dz = dL_da * (z > 0) # [-0.4, 1.8, 0.0] ReLU grad dL_dW = np.outer(dL_dz, x) # [[-0.8,-0.4],[3.6,1.8],[0,0]] dL_db = dL_dz # [-0.4, 1.8, 0.0] # ── Update ── W -= lr * dL_dW b -= lr * dL_db print("Updated W:", W) print("Updated b:", b)
python import torch x = torch.tensor([2.0, 1.0]) W = torch.tensor([[0.5, -0.3], [0.1, 0.7], [-0.4, 0.2]], requires_grad=True) b = torch.tensor([0.1, 0.0, -0.1], requires_grad=True) target = torch.tensor([1.0, 0.0, 0.0]) # Forward z = W @ x + b a = torch.relu(z) loss = torch.sum((a - target)**2) # Backward — PyTorch computes ALL gradients for us loss.backward() print("dL/dW:", W.grad) # matches our hand calc! print("dL/db:", b.grad) # matches our hand calc!
Toggle between forward pass (data flows left to right) and backward pass (gradients flow right to left). Actual numbers from our worked example are shown at each stage.
Everything we've done so far has been 1D and 2D — vectors and matrices. Real ML lives in higher dimensions. An image batch is 4D: [batch, channels, height, width]. Attention matrices have batch and head dimensions. Two tools make this manageable: broadcasting (automatic shape expansion) and einsum (universal tensor contraction).
Broadcasting lets you do arithmetic on tensors with different shapes. NumPy and PyTorch align dimensions from the RIGHT and expand size-1 dimensions to match.
Hand Calculation: Broadcasting in Action
a has shape [3, 1], b has shape [1, 4]:
Step 1: Align shapes from the right.
Step 2: After stretching, both are [3, 4]. Element-wise add:
More examples:
| Shape A | Shape B | Result | What happens |
|---|---|---|---|
| [3, 4] | [4] | [3, 4] | b becomes [1, 4] → broadcast rows |
| [2, 3, 4] | [3, 1] | [2, 3, 4] | b becomes [1, 3, 1] → broadcast batch + cols |
| [32, 128] | [128] | [32, 128] | Adding bias to every sample in a batch |
| [3, 4] | [5, 4] | ERROR | 3 ≠ 5, neither is 1 |
Einstein summation notation lets you express any tensor contraction in a single, readable string. The rule: any index that appears in the inputs but NOT in the output is summed over.
Hand Calculation: Decoding einsum strings
| Einsum | Operation | Equation |
|---|---|---|
'i,i->' | Dot product | c = ∑i ai bi |
'ij,jk->ik' | Matrix multiply | Cik = ∑j Aij Bjk |
'ij->ji' | Transpose | Bji = Aij |
'ii->' | Trace | t = ∑i Aii |
'i,j->ij' | Outer product | Cij = ai bj |
'bij,bjk->bik' | Batched matmul | Cbik = ∑j Abij Bbjk |
'bhqd,bhkd->bhqk' | Attention scores | Sbhqk = ∑d Qbhqd Kbhkd |
Worked example: Matrix multiply via einsum. A is [2, 3], B is [3, 2].
The most important einsum in modern ML computes attention scores:
Breaking it down letter by letter:
| Index | Meaning | In Q | In K | In output |
|---|---|---|---|---|
| b | Batch | ✓ | ✓ | ✓ (kept) |
| h | Attention head | ✓ | ✓ | ✓ (kept) |
| q | Query position | ✓ | ✓ (kept) | |
| k | Key position | ✓ | ✓ (kept) | |
| d | Head dimension | ✓ | ✓ | (summed!) |
Index d appears in both inputs but NOT in the output → it's summed over. This is the dot product between each query and each key vector, for every batch and every head, producing an attention score matrix of shape [b, h, q, k].
python import numpy as np # Manual broadcasting: add bias to every sample X = np.array([[1, 2, 3], # batch of 4 samples, 3 features [4, 5, 6], [7, 8, 9], [10, 11, 12]]) # shape [4, 3] bias = np.array([100, 200, 300]) # shape [3] # Without broadcasting (manual loop) result_loop = np.zeros_like(X) for i in range(4): result_loop[i] = X[i] + bias # With broadcasting (one line!) result_broadcast = X + bias # [4,3] + [3] -> [4,3] print(np.allclose(result_loop, result_broadcast)) # True
python import numpy as np import torch # All einsum patterns a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6], [7, 8]]) # Dot product: i,i-> print(np.einsum('i,i->', a, b)) # 32 # Matrix multiply: ij,jk->ik print(np.einsum('ij,jk->ik', A, B)) # [[19,22],[43,50]] # Trace: ii-> print(np.einsum('ii->', A)) # 5 # Outer product: i,j->ij print(np.einsum('i,j->ij', a, b)) # Batched attention scores (PyTorch) B_, H, Q_len, D = 2, 4, 8, 16 K_len = 8 Q = torch.randn(B_, H, Q_len, D) K = torch.randn(B_, H, K_len, D) scores = torch.einsum('bhqd,bhkd->bhqk', Q, K) print("Attention scores shape:", scores.shape) # torch.Size([2, 4, 8, 8])
Select different shape combinations to see how broadcasting stretches dimensions. Size-1 dimensions are highlighted and expanded.
einsum('bhqd,bhkd->bhqk', Q, K) compute?In the previous chapters on the Hessian, we classified critical points by whether the eigenvalues were all positive, all negative, or mixed. "All positive eigenvalues" has a name: the matrix is positive definite. This is one of the most important properties in all of machine learning, because covariance matrices are always positive semi-definite, kernel matrices must be positive semi-definite, and a positive definite Hessian guarantees you're at a local minimum.
Let's make this concrete rather than abstract.
A symmetric matrix A is positive definite (PD) if for every nonzero vector x:
The quantity xTAx is called the quadratic form associated with A. If A is positive definite, this quadratic form is always positive — it's a "bowl" shape with a single minimum at x = 0. No saddle points, no flat spots, no weird geometry. Just a clean bowl.
If the condition is xTAx ≥ 0 (allowing zero for some nonzero x), the matrix is positive semi-definite (PSD). If xTAx can be negative for some x, the matrix is indefinite.
Is A = [[2, 1], [1, 2]] positive definite?
Method 1: eigenvalues. Solve det(A − λI) = 0:
Both eigenvalues are positive ⇒ PD ✓
Method 2: compute xTAx directly. For x = [a, b]:
Rewrite by completing the square:
This is a sum of three squares. It's zero only when a = b = 0, and positive otherwise. PD ✓
Method 3: leading minors. Check that all upper-left submatrices have positive determinant:
All leading minors positive ⇒ PD ✓
Is B = [[1, 2], [2, 1]] positive definite?
Eigenvalues: λ² − 2λ − 3 = 0 ⇒ λ = 3, −1. One eigenvalue is negative ⇒ NOT PD.
Let's find a vector that makes xTBx negative. Try x = [1, −1]:
Confirmed: the quadratic form can be negative. The matrix is indefinite — the "bowl" is actually a saddle.
Covariance matrices must be PSD. The covariance matrix Σ of a random vector x is defined as E[(x − μ)(x − μ)T]. For any vector v, vTΣv = E[(vT(x − μ))²] ≥ 0. This is always non-negative because it's an expected squared quantity. If Σ were not PSD, the Gaussian density:
would be undefined — a negative determinant gives an imaginary normalizer. The density wouldn't integrate to 1. Physics breaks.
Kernel matrices must be PSD. In SVMs, the kernel matrix Kij = k(xi, xj) must be PSD — this is the Mercer condition. It guarantees that the kernel corresponds to an inner product in some (possibly infinite-dimensional) feature space. Without PSD, the optimization problem becomes non-convex and may have no solution.
PD Hessian = local minimum. If the Hessian at a critical point is PD, the loss surface is a bowl at that point — you're at a minimum. This is the second-order sufficient condition for optimality.
Every positive definite matrix A can be uniquely factored as:
where L is a lower triangular matrix. This is the Cholesky decomposition, and it's incredibly useful:
Sampling from multivariate Gaussians: To sample x ~ N(μ, Σ), first compute L where Σ = LLT. Then sample z ~ N(0, I) (independent standard normals). Finally: x = μ + Lz. The Cholesky factor L "shapes" the spherical noise z into the correct covariance structure.
Solving Ax = b efficiently: Instead of computing A−1 (O(n³) and numerically unstable), factor A = LLT once, then solve Ly = b (forward substitution) and LTx = y (back substitution). Both substitutions are O(n²).
python import numpy as np A = np.array([[2, 1], [1, 2]], dtype=float) B = np.array([[1, 2], [2, 1]], dtype=float) # Method 1: eigenvalue check eigs_A = np.linalg.eigvalsh(A) print("A eigenvalues:", eigs_A) # [1. 3.] — both positive, PD eigs_B = np.linalg.eigvalsh(B) print("B eigenvalues:", eigs_B) # [-1. 3.] — one negative, NOT PD # Method 2: Cholesky (fails on non-PD matrices!) try: L = np.linalg.cholesky(A) print("A is PD. L =", L) print("Verify LL^T =", L @ L.T) # Should equal A except np.linalg.LinAlgError: print("A is NOT PD") try: L = np.linalg.cholesky(B) except np.linalg.LinAlgError: print("B is NOT PD — Cholesky failed!") # Sampling from N(mu, Sigma) using Cholesky Sigma = np.array([[2, 1], [1, 2]], dtype=float) mu = np.array([3.0, 5.0]) L = np.linalg.cholesky(Sigma) z = np.random.randn(1000, 2) # 1000 samples from N(0, I) samples = mu + z @ L.T # Transform to N(mu, Sigma) print("Sample mean:", samples.mean(axis=0)) print("Sample cov:", np.cov(samples.T)) # Close to Sigma
Adjust the off-diagonal entries of a symmetric 2×2 matrix. When PD, the contours are ellipses (a bowl). When PSD, they degenerate to parallel lines (a trough). When indefinite, they become hyperbolas (a saddle). The eigenvalues determine everything.
Adjust the matrix entries. Watch the contour shape change from ellipses (PD, green) to parallel lines (PSD, yellow) to hyperbolas (indefinite, red).
A covariance matrix has eigenvalue 0. Is this a problem?
Everything we've learned so far assumes perfect arithmetic. 1 + 1 = 2, exactly. But computers don't do exact arithmetic. They use floating-point numbers — a finite approximation where 1 + 10−16 = 1 (not 1.0000000000000001). This tiny imprecision cascades through computations, and for certain matrices, it can destroy your answer completely.
Understanding when and why this happens explains training instability, the need for batch normalization, why mixed-precision training works, and why some models need float64 for certain operations.
A float32 number has approximately 7 significant decimal digits of precision. A float64 has ~15. This means:
| Format | Bits | Precision | Smallest difference from 1 |
|---|---|---|---|
| float16 | 16 | ~3.3 digits | ~10−3 |
| bfloat16 | 16 | ~2.4 digits | ~10−2 |
| float32 | 32 | ~7.2 digits | ~10−7 |
| float64 | 64 | ~15.9 digits | ~10−16 |
"Precision" means the total number of meaningful digits. If you're doing a computation that produces a 10-digit answer but you're using float32, only the first 7 digits are trustworthy. The last 3 are garbage.
The condition number κ(A) measures how sensitive the solution of Ax = b is to small changes in A or b. It's defined as:
Consider a nearly singular matrix:
The singular values are approximately σ1 ≈ 2.0001 and σ2 ≈ 0.00005. So:
This means: a 0.01% change in the input can cause up to a 400% change in the output. The matrix amplifies perturbations by a factor of 40,000. Let's see this happen.
Solve Ax = b with our ill-conditioned A:
The exact solution is x = [1, 1]. Easy enough.
Now change b by a tiny amount — just 0.005% in the second component:
Subtract the first equation from the second: 0.0001 · x2 = 0.0001, so x2 = 1. Wait — that doesn't look bad. Let me do a more dramatic example.
Actually, let's use A = [[1, 1], [1, 1.001]] with b = [2, 2.001] giving x = [1, 1], then b' = [2, 2.002]:
The solution jumped from [1, 1] to [0, 2]! A 0.05% change in b produced a 100% change in x. That's ill-conditioning in action. The condition number κ ≈ 4000 predicted exactly this kind of sensitivity.
LU decomposition factors A = LU where L is lower triangular and U is upper triangular. This is how computers actually solve Ax = b (they never compute A−1 directly).
For A = [[2, 1], [4, 3]]:
Step 1: The multiplier for row 2 is 4/2 = 2.
Step 2: Subtract 2 × R1 from R2: [4−4, 3−2] = [0, 1].
Verify: LU = [[1×2+0×0, 1×1+0×1], [2×2+1×0, 2×1+1×1]] = [[2, 1], [4, 3]] = A ✓
To solve Ax = b: first solve Ly = b (forward substitution, O(n²)), then Ux = y (back substitution, O(n²)). Total: O(n³) for the factorization, then O(n²) per right-hand side. If you need to solve for many different b vectors with the same A, you factorize once and solve cheaply many times.
float32 vs float16: Modern LLM training uses mixed precision — weights stored in float32, but forward/backward passes in float16 or bfloat16. The inner products in attention (Q·KT) can have huge values (that's why we divide by √dk), and float16 overflow causes NaN. bfloat16 has the same range as float32 (8 exponent bits) but less precision (2.4 digits vs 7.2), which is usually enough.
Batch normalization reduces the condition number of the input distribution to each layer. Without batchnorm, deep networks have layered transformations where small input changes get amplified exponentially (high condition number). Normalizing the statistics at each layer keeps the condition number manageable.
Gradient clipping prevents ill-conditioned gradients from causing parameter explosions. If the Jacobian of the network has a large condition number, gradients in the "steep" direction become enormous while those in the "flat" direction are tiny. Clipping caps the maximum gradient norm.
Adam's advantage: Adam's per-parameter scaling (dividing by √v̂) acts like diagonal preconditioning. In numerical linear algebra, preconditioning transforms a problem to reduce its condition number before solving. Adam does this automatically, which is why it works on problems where vanilla SGD with a fixed learning rate diverges.
python import numpy as np from scipy import linalg # Condition number example A = np.array([[1, 1], [1, 1.001]]) print("Condition number:", np.linalg.cond(A)) # ~4000 # Solve with two slightly different b vectors b1 = np.array([2.0, 2.001]) b2 = np.array([2.0, 2.002]) x1 = np.linalg.solve(A, b1) x2 = np.linalg.solve(A, b2) print("Solution 1:", x1) # [1. 1.] print("Solution 2:", x2) # [0. 2.] — wildly different! # LU decomposition A2 = np.array([[2, 1], [4, 3]], dtype=float) P, L, U = linalg.lu(A2) print("P:", P) # Permutation matrix print("L:", L) print("U:", U) print("P@L@U =", P @ L @ U) # Should equal A2 # Digits lost formula for kappa in [10, 1e4, 1e8, 1e12]: f32_reliable = 7.2 - np.log10(kappa) f64_reliable = 15.9 - np.log10(kappa) print(f"κ={kappa:.0e}: float32={f32_reliable:.1f} digits, float64={f64_reliable:.1f} digits")
The widget below lets you adjust a 2×2 matrix. A small perturbation ε is added to the right-hand side b. Watch how the solution changes. When the condition number is small (well-conditioned), the two solutions overlap. When large (ill-conditioned), they diverge wildly — even though the inputs are nearly identical.
Two nearly identical systems Ax=b and Ax=b' (differing by ε). Orange = solution 1, teal = solution 2. Large gap = ill-conditioned.
A matrix has condition number 108 and you're using float32 (~7 digits precision). How many digits of accuracy can you expect in the solution?
We've covered the entire linear algebra toolkit for ML. Every technique you'll encounter in papers — from attention mechanisms to LoRA fine-tuning to gradient descent — is built on these operations. Let's map every concept to its real-world ML application.
| Concept | ML Application | Concrete Example |
|---|---|---|
| Vectors | Embeddings | word2vec: "king" = [0.2, −0.5, 0.8, ...] |
| Dot product | Attention scores | Q·KT in transformers |
| Matrix multiply | Neural network layers | y = Wx + b (every layer) |
| Norms | Regularization | L1 (Lasso) → sparsity, L2 (Ridge) → weight decay |
| Transpose | Data reshaping | XTX in normal equations |
| Inverse | Closed-form solutions | w = (XTX)−1XTy (linear regression) |
| Determinant | Volume change | Normalizing flows: det(Jacobian) tracks density |
| Eigenvalues | PCA, stability | Covariance eigenvectors = principal components |
| SVD | Compression, LoRA | ΔW ≈ BA where rank(BA) ≪ rank(W) |
| Gradients | Training | ∂L/∂W for every parameter |
| Jacobian | Backpropagation | Chain of Jacobian-vector products |
| Broadcasting | Batched operations | Adding bias [D] to every sample in batch [B, D] |
| Einsum | Attention, contractions | Multi-head attention in one line |
Let's trace one complete training step of a 2-layer network, labeling every linear algebra operation:
Building on this foundation:
Advanced applications:
Each stage of the ML pipeline is labeled with the linear algebra operations it uses. Hover or tap a stage to highlight it.
"The purpose of computing is insight, not numbers." — Richard Hamming