Mathematics for Machine Learning

ML Maths

Every equation in every paper. Every matrix in every model. Built from absolute zero with hand calculations, from-scratch code, and interactive visualizations.

Prerequisites: Basic arithmetic. Seriously, that's it.
19
Chapters
19
Simulations
0
Assumed Knowledge

Chapter 0: Why Linear Algebra?

You want to classify images of cats vs dogs. Your image is 224 × 224 × 3 pixels — that's 150,528 numbers. A single neural network layer takes those 150,528 numbers and multiplies them by a weight matrix to produce, say, 512 numbers. That one operation is 150,528 × 512 = 77 million multiply-adds. And that's just one layer. A ResNet-50 has 25 million parameters doing this over and over, layer after layer.

Every single one of those multiply-adds is linear algebra. Matrix multiplication. Dot products. Vector norms. If you don't understand these operations, neural networks are a black box. If you do, every architecture paper becomes readable, every shape error becomes diagnosable, and every optimization trick makes intuitive sense.

This isn't abstract mathematics. This is the actual computation running on GPUs right now, powering ChatGPT, Stable Diffusion, AlphaFold, and every other ML system you've heard of. We're going to build it from scratch.

Your first neural network layer, by hand

Let's start with the smallest possible example. Forget images — imagine your input is just two numbers: x = [2, 3]. Maybe they represent height and weight of a person. Your neural network layer has a weight matrix W and a bias vector b. The layer computes output = W · x + b.

Here are the weights (a 3×2 matrix, because we want 3 outputs from 2 inputs):

W = [[0.5, 0.1], [−0.3, 0.7], [0.8, −0.4]]

And the biases (one per output):

b = [0.1, −0.2, 0.3]

Now we compute each output element. Each output is a dot product of one row of W with the input x, plus the corresponding bias:

output[0] = 0.5 × 2 + 0.1 × 3 + 0.1 = 1.0 + 0.3 + 0.1 = 1.4
output[1] = −0.3 × 2 + 0.7 × 3 + (−0.2) = −0.6 + 2.1 − 0.2 = 1.3
output[2] = 0.8 × 2 + (−0.4) × 3 + 0.3 = 1.6 − 1.2 + 0.3 = 0.7

So our raw output is [1.4, 1.3, 0.7]. Then we apply ReLU (replace negatives with zero): max(0, [1.4, 1.3, 0.7]) = [1.4, 1.3, 0.7]. All values were already positive, so nothing changes. But if any output had been negative — say −0.6 — ReLU would clamp it to 0.

That's it. That's a neural network layer. Two inputs in, three outputs out. Every large model does exactly this, just with bigger matrices.

Convolution is also linear algebra

You might think convolution (used in image models) is a different kind of operation. It's not. Let's see why. A Sobel edge detector is a 3×3 kernel that finds vertical edges:

kernel = [1, 0, −1, 2, 0, −2, 1, 0, −1]

We flatten a 3×3 image patch into a vector too:

patch = [100, 150, 200, 120, 160, 210, 130, 170, 220]

Now we compute the dot product — the same operation as before:

100×1 + 150×0 + 200×(−1) + 120×2 + 160×0 + 210×(−2) + 130×1 + 170×0 + 220×(−1)
= 100 + 0 − 200 + 240 + 0 − 420 + 130 + 0 − 220 = −370

This negative value means the left side of the patch is darker than the right side. The kernel detected an edge gradient going right-to-left. A positive value would mean left-to-right. Zero means no edge. The convolution output at every position is just a dot product of the kernel with the local patch.

Convolution IS matrix multiplication: You might think convolution and matrix multiply are fundamentally different operations. They're not. Every deep learning framework internally reshapes image patches into a matrix (a trick called im2col) and multiplies by the reshaped kernel. Convolution is just matrix multiplication with shared weights and a specific data layout. Understanding one means understanding both.

From scratch: one linear layer

python
# A neural network layer from scratch — no libraries

def linear_layer(x, W, b):
    """Compute output = W @ x + b manually."""
    out_size = len(W)      # number of output neurons
    in_size = len(x)       # number of input features
    output = []

    for i in range(out_size):
        # Dot product of row i of W with x
        dot = 0.0
        for j in range(in_size):
            dot += W[i][j] * x[j]
        dot += b[i]            # add bias
        output.append(dot)

    return output

def relu(x):
    """ReLU activation: max(0, x) for each element."""
    return [max(0, val) for val in x]

# Our example
x = [2, 3]
W = [[0.5, 0.1], [-0.3, 0.7], [0.8, -0.4]]
b = [0.1, -0.2, 0.3]

raw = linear_layer(x, W, b)
print("Raw output:", raw)      # [1.4, 1.3, 0.7]
activated = relu(raw)
print("After ReLU:", activated)  # [1.4, 1.3, 0.7]

The same thing in PyTorch

python
import torch
import torch.nn as nn

# PyTorch does exactly what we did, but on GPU with optimized BLAS
layer = nn.Linear(in_features=2, out_features=3)

# Set our specific weights and biases
with torch.no_grad():
    layer.weight = nn.Parameter(torch.tensor(
        [[0.5, 0.1], [-0.3, 0.7], [0.8, -0.4]]
    ))
    layer.bias = nn.Parameter(torch.tensor([0.1, -0.2, 0.3]))

x = torch.tensor([2.0, 3.0])
raw = layer(x)
print(raw)                       # tensor([1.4, 1.3, 0.7])
activated = torch.relu(raw)
print(activated)                 # tensor([1.4, 1.3, 0.7])

# Under the hood: output = x @ weight.T + bias
manual = x @ layer.weight.T + layer.bias
print(manual)                    # tensor([1.4, 1.3, 0.7]) — identical
Every ML breakthrough is a matrix trick: Transformers? Matrix multiplication of queries and keys. Diffusion models? Matrix operations at every denoising step. GANs? Matrix multiply in both generator and discriminator. Graph neural networks? Sparse matrix multiply on adjacency matrices. Once you see the linear algebra, you see the whole field.

See it in action: convolution as dot products

The widget below shows an 8×8 image grid with pixel values. A 3×3 Sobel kernel slides across the image. At each position, we compute the dot product of the kernel with the underlying patch. Watch how the output reveals edges in the image.

Convolution = Sliding Dot Products

Press Step to move the kernel one position. Press Play to auto-advance. The output grid on the right shows the dot product at each position.

Position: (0, 0)

Notice the pattern: where the image has sharp left-right brightness changes, the output has large absolute values (positive or negative). Where the image is uniform, the output is near zero. That's edge detection — and it's nothing but a sequence of dot products.

What operation is at the heart of every neural network layer?

Chapter 1: Scalars, Vectors, Matrices, Tensors

Before we can manipulate data with linear algebra, we need to name the containers it lives in. There's a clean hierarchy of mathematical objects, and once you internalize it, every single shape error in PyTorch becomes instantly diagnosable. No more staring at RuntimeError: mat1 and mat2 shapes cannot be multiplied wondering what went wrong.

A scalar is a single number. The learning rate 0.001 is a scalar. A loss value 2.34 is a scalar. A scalar has zero dimensions — it's just a point on the number line.

A vector is a list of numbers. The input [2, 3] from Chapter 0 is a 2D vector. A word embedding might be a 768-dimensional vector. A vector has one dimension — you need one index to pick out an element.

A matrix is a grid of numbers with rows and columns. Our weight matrix W from Chapter 0 was a 3×2 matrix. A matrix has two dimensions — you need two indices (row, column) to pick out an element.

A tensor is the generalization to any number of dimensions. A color image is a 3D tensor (height × width × channels). A batch of images is a 4D tensor. Attention scores are a 4D tensor. You need as many indices as there are dimensions.

The dimension hierarchy: Scalar (0D) → Vector (1D) → Matrix (2D) → 3D Tensor → 4D Tensor → ... This isn't just nomenclature. Each level adds one axis, and your code must keep track of every axis or shapes won't align.

Worked example: an image batch

When you load a batch of images for training, PyTorch represents it as a 4D tensor with shape [B, C, H, W]. Let's say we have a batch of 32 images, each 224×224 pixels with 3 color channels (RGB). The shape is:

[32, 3, 224, 224]

What does each dimension mean?

DimensionSizeMeaning
0 (batch)32Number of images in this batch
1 (channels)3Color channels: Red, Green, Blue
2 (height)224Rows of pixels
3 (width)224Columns of pixels

How many total numbers are stored? Multiply all dimensions together:

32 × 3 × 224 × 224 = 4,816,896

That's 4.8 million floating-point numbers just for one batch of input. At 4 bytes each (float32), that's 4,816,896 × 4 = 19,267,584 bytes ≈ 19.3 MB of GPU memory. For the input alone. The model weights, gradients, optimizer states, and activations all pile on top.

Worked example: transformer attention shapes

Let's trace the shapes through a transformer's attention mechanism. These are real numbers from a GPT-2-small-style model: batch size B = 8, number of attention heads h = 12, sequence length T = 512, model dimension dmodel = 768, and key dimension dk = 768 / 12 = 64.

The query matrix Q has shape [B, h, T, dk] = [8, 12, 512, 64]. This means: for each of 8 sequences in the batch, for each of 12 attention heads, for each of 512 token positions, we have a 64-dimensional query vector.

The key matrix K has the same shape: [8, 12, 512, 64].

When we compute attention scores Q · KT, the last two dimensions participate in the matrix multiply:

[8, 12, 512, 64] @ [8, 12, 64, 512] → [8, 12, 512, 512]

That attention score matrix has 8 × 12 × 512 × 512 = 25,165,824 entries. Per layer. For just 512 tokens. This is why transformers have quadratic memory in sequence length — doubling the sequence length quadruples the attention matrix size.

Shape arithmetic is your superpower: Every ML paper describes operations as shape transformations. "We project the input from d_model to d_k" means matrix multiply by a [d_model, d_k] weight. "We compute attention over the sequence" means batched matmul producing a [T, T] matrix. If you can trace shapes through an architecture, you understand the architecture.

Creating tensors in code

python
import torch

# Scalar — 0 dimensions
s = torch.tensor(3.14)
print(s.shape)       # torch.Size([])  — empty shape = scalar
print(s.dim())        # 0
print(s.numel())      # 1 — one element total

# Vector — 1 dimension
v = torch.tensor([1.0, 2.0, 3.0])
print(v.shape)       # torch.Size([3])
print(v.dim())        # 1
print(v.numel())      # 3

# Matrix — 2 dimensions
M = torch.randn(3, 4)
print(M.shape)       # torch.Size([3, 4])
print(M.dim())        # 2
print(M.numel())      # 12

# 4D Tensor — image batch
batch = torch.randn(32, 3, 224, 224)
print(batch.shape)    # torch.Size([32, 3, 224, 224])
print(batch.dim())    # 4
print(batch.numel())  # 4816896

# Memory usage
print(batch.element_size())       # 4 (bytes per float32)
print(batch.nelement() * batch.element_size())  # 19267584 bytes ≈ 19.3 MB

Common shape mistakes (and how to fix them)

python
import torch

# MISTAKE 1: confusing len() with numel()
t = torch.randn(3, 4, 5)
print(len(t))     # 3 — size of FIRST dimension only!
print(t.numel())  # 60 — total number of elements (3×4×5)

# MISTAKE 2: .shape vs .size() — they're the SAME thing
print(t.shape)        # torch.Size([3, 4, 5])
print(t.size())       # torch.Size([3, 4, 5]) — identical
print(t.size(1))      # 4 — size of dimension 1
print(t.shape[1])     # 4 — also works

# MISTAKE 3: forgetting the batch dimension
# nn.Linear expects [batch, features], not just [features]
layer = torch.nn.Linear(10, 5)
x_bad = torch.randn(10)      # shape [10] — might work but risky
x_good = torch.randn(1, 10)  # shape [1, 10] — explicit batch dim
len() vs numel(): A common beginner mistake is using len(tensor) to get the total number of elements. It doesn't. len(torch.randn(3, 4, 5)) returns 3, not 60. It only gives you the size of the first dimension. Use .numel() for total elements, or math.prod(tensor.shape). This mistake is especially dangerous in loss computation where you might divide by the wrong count.

Build the hierarchy

Click the buttons below to build up from a scalar to a 4D tensor. Watch the shape and total element count change at each level. Each click adds one dimension to the structure.

From Scalar to 4D Tensor

Click the buttons to switch between dimensionalities. The visualization shows the structure and shape at each level.

A tensor with shape [16, 8, 256, 256] has how many total elements?

Chapter 2: Vector Operations & Norms

Now that we know vectors are ordered lists of numbers, we need operations on them. Adding vectors, scaling them, measuring their length. These aren't abstract — the L2 norm IS the loss function in regression. The L1 norm IS the regularization in Lasso. Cosine similarity IS what makes semantic search work. Every operation we learn here has a direct, concrete use in machine learning.

Let's start with the basics. Vector addition adds corresponding elements. Scalar multiplication multiplies every element by a number. Vector subtraction is just addition with a negated vector. These are the building blocks of everything.

Worked example: vector arithmetic

Given two 3D vectors a = [3, −1, 4] and b = [1, 5, −2], let's compute everything:

Addition — add element by element:

a + b = [3+1, −1+5, 4+(−2)] = [4, 4, 2]

Scalar multiplication — multiply every element by the scalar:

2a = [2×3, 2×(−1), 2×4] = [6, −2, 8]

Subtraction — subtract element by element:

a − b = [3−1, −1−5, 4−(−2)] = [2, −6, 6]

In ML, these operations happen constantly. When you compute a residual connection in a transformer, you're adding two vectors: output = layer(x) + x. When you apply a learning rate, you're scaling a gradient vector: w = w − η · ∇L. When you compute a loss gradient, you're subtracting the prediction from the target: error = ŷ − y.

The three norms that matter

A norm measures the "size" or "length" of a vector. Different norms measure size in different ways, and the choice of norm has profound effects on your model. Let's compute all three for v = [3, −4, 0, 5].

L1 norm (Manhattan distance / taxi distance) — sum of absolute values:

||v||1 = |3| + |−4| + |0| + |5| = 3 + 4 + 0 + 5 = 12

Think of walking on a city grid — you can only move along axes, never diagonally. The L1 norm is the total distance walked.

L2 norm (Euclidean distance / straight-line distance) — square root of sum of squares:

||v||2 = √(3² + (−4)² + 0² + 5²) = √(9 + 16 + 0 + 25) = √50 = 5√2 ≈ 7.07

This is the "as the crow flies" distance. It's the Pythagorean theorem generalized to any number of dimensions.

L∞ norm (max norm / Chebyshev distance / king's distance) — largest absolute value:

||v|| = max(|3|, |−4|, |0|, |5|) = max(3, 4, 0, 5) = 5

Think of a king on a chessboard — it can move diagonally, so the distance is just the farthest it needs to go in any single direction.

Why norms matter in ML: regularization

When training a model, we add a regularization term to the loss function to prevent overfitting. The choice of norm determines the behavior:

L1 regularization (Lasso): loss = MSE + λ · ||w||1. The L1 norm pushes weights to exactly zero. This gives you sparse solutions — many weights become literally 0.0, effectively performing feature selection. If you have 1000 features but only 50 matter, L1 will zero out the other 950.

L2 regularization (Ridge / weight decay): loss = MSE + λ · ||w||2². The L2 norm pushes weights to be small but never exactly zero. All weights shrink toward zero proportionally. This is why it's called weight decay in PyTorch — it literally multiplies all weights by (1 − λ) each step.

Why does L1 produce sparsity but L2 doesn't? Geometry. The L1 unit ball is a diamond shape whose corners lie on the coordinate axes. When the loss function's contours intersect the L1 ball, they most likely hit a corner — where some coordinates equal exactly 0. The L2 ball is a circle with no corners, so the intersection slides smoothly and coordinates rarely hit exactly 0. This is the single deepest insight about regularization.

From scratch: all three norms

python
import math

def l1_norm(v):
    """Sum of absolute values."""
    return sum(abs(x) for x in v)

def l2_norm(v):
    """Square root of sum of squares."""
    return math.sqrt(sum(x ** 2 for x in v))

def linf_norm(v):
    """Maximum absolute value."""
    return max(abs(x) for x in v)

v = [3, -4, 0, 5]
print(f"L1:  {l1_norm(v)}")     # 12
print(f"L2:  {l2_norm(v):.4f}")  # 7.0711
print(f"L∞:  {linf_norm(v)}")   # 5

NumPy and PyTorch equivalents

python
import numpy as np
import torch

v_np = np.array([3, -4, 0, 5], dtype=np.float64)
v_pt = torch.tensor([3.0, -4.0, 0.0, 5.0])

# NumPy norms
print(np.linalg.norm(v_np, ord=1))       # 12.0
print(np.linalg.norm(v_np, ord=2))       # 7.0710...
print(np.linalg.norm(v_np, ord=np.inf))  # 5.0

# PyTorch norms
print(torch.linalg.norm(v_pt, ord=1))    # tensor(12.)
print(torch.linalg.norm(v_pt, ord=2))    # tensor(7.0711)
print(torch.linalg.norm(v_pt, ord=float('inf')))  # tensor(5.)

# In practice: L2 regularization = weight decay in AdamW
optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.01)
# This adds 0.01 * ||w||₂² to the loss automatically
L2 norm vs L2 loss — don't confuse them: L2 norm of a vector v: ||v||2 = √(Σvi²). L2 loss (MSE): (1/n)Σ(yi − ŷi)². The MSE loss is the squared L2 norm of the residual vector, divided by n. Papers use "L2" loosely for both — always check whether they mean the norm or the squared norm divided by n. Getting this wrong changes your gradient by a factor of 2n.

Visualize the norm unit balls

The "unit ball" for a norm is the set of all vectors with norm ≤ 1. In 2D, L1 gives a diamond, L2 gives a circle, and L∞ gives a square. Drag the vector endpoint below to see how all three norms change, and toggle the unit ball shapes to see their geometry.

Norm Unit Balls & Live Vector Norms

Drag the orange dot to move the vector. Toggle the unit ball shapes with the buttons. Watch how the three norm values change.

Which norm would you use if you want some model weights to become exactly zero (feature selection)?

Chapter 3: Dot Product & Cosine Similarity

The dot product is the single most important operation in machine learning. It's how attention works in transformers. It's how similarity search finds related documents. It's how every neural network layer transforms data. If you master one operation from this entire lesson, make it this one.

The dot product takes two vectors of the same length and produces a single number. You multiply corresponding elements and sum the results. That's it. But from this simple recipe comes an astonishing amount of machine learning.

Worked example: computing the dot product

Given a = [1, 2, 3] and b = [4, −5, 6]:

a · b = 1×4 + 2×(−5) + 3×6 = 4 − 10 + 18 = 12

Each pair of corresponding elements gets multiplied, then we sum everything up. The result is a scalar — a single number. The sign tells us something important: positive means the vectors generally point in the same direction, negative means they point in opposite directions, and zero means they're perpendicular.

Worked example: full cosine similarity

Cosine similarity normalizes the dot product by the vectors' lengths, giving a value between −1 and +1 regardless of vector magnitude. This is crucial when comparing word embeddings, because we care about direction (meaning), not magnitude (which is often an artifact of word frequency).

Using the same vectors a = [1, 2, 3] and b = [4, −5, 6]:

Step 1 — compute the L2 norms:

||a|| = √(1² + 2² + 3²) = √(1 + 4 + 9) = √14 ≈ 3.742
||b|| = √(4² + (−5)² + 6²) = √(16 + 25 + 36) = √77 ≈ 8.775

Step 2 — divide the dot product by the product of norms:

cos(θ) = (a · b) / (||a|| × ||b||) = 12 / (3.742 × 8.775) = 12 / 32.84 ≈ 0.365

Step 3 — recover the angle:

θ = arccos(0.365) ≈ 68.6°

A cosine of 0.365 means these vectors are somewhat aligned but far from parallel. In NLP, two word embeddings with cos = 0.365 would indicate a weak semantic relationship.

The geometric meaning

The dot product has a beautiful geometric interpretation. It equals the product of the vectors' lengths times the cosine of the angle between them:

a · b = ||a|| · ||b|| · cos(θ)

This means three cases govern the sign of the dot product:

Angle θcos(θ)Dot productMeaning
0° (same direction)1Maximum positiveVectors are parallel
90° (perpendicular)0ZeroVectors are orthogonal
180° (opposite)−1Maximum negativeVectors are antiparallel

ML payoff: how attention actually works

In a transformer, each token produces a query vector (what am I looking for?) and a key vector (what do I offer?). The attention score between two tokens is the dot product of their query and key vectors, scaled by √dk.

Let's trace this with real numbers. Suppose dk = 4 (tiny, for illustration):

q = [1.0, 0.5, −0.3, 0.8]    (what "the" is looking for)
k1 = [0.9, 0.6, −0.2, 0.7]    (what "cat" offers)
k2 = [−0.5, 0.1, 0.8, −0.3]    (what "and" offers)

Compute raw attention scores (dot products):

score1 = q · k1 = 1.0×0.9 + 0.5×0.6 + (−0.3)×(−0.2) + 0.8×0.7
= 0.9 + 0.3 + 0.06 + 0.56 = 1.82
score2 = q · k2 = 1.0×(−0.5) + 0.5×0.1 + (−0.3)×0.8 + 0.8×(−0.3)
= −0.5 + 0.05 − 0.24 − 0.24 = −0.93

Score 1 (1.82) is much higher than score 2 (−0.93). After softmax normalization, token "cat" gets much more attention weight than token "and." The query vector of "the" is more aligned with the key vector of "cat" — which makes linguistic sense, since "the cat" is a common phrase.

This is the entire mechanism of attention. Dot products measure alignment between query and key vectors. Softmax converts scores to probabilities. The model learns the projection matrices that produce useful query and key vectors through backpropagation.

From scratch: dot product and cosine similarity

python
import math

def dot_product(a, b):
    """Sum of element-wise products."""
    assert len(a) == len(b), "Vectors must have same length"
    return sum(ai * bi for ai, bi in zip(a, b))

def l2_norm(v):
    return math.sqrt(sum(x ** 2 for x in v))

def cosine_similarity(a, b):
    """Dot product normalized by magnitudes. Returns [-1, 1]."""
    dot = dot_product(a, b)
    norm_a = l2_norm(a)
    norm_b = l2_norm(b)
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)

# Our worked example
a = [1, 2, 3]
b = [4, -5, 6]
print(f"Dot product: {dot_product(a, b)}")       # 12
print(f"Cosine sim:  {cosine_similarity(a, b):.4f}")  # 0.3651
angle = math.degrees(math.acos(cosine_similarity(a, b)))
print(f"Angle:       {angle:.1f}°")               # 68.6°

PyTorch equivalents

python
import torch
import torch.nn.functional as F

a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, -5.0, 6.0])

# Dot product
print(torch.dot(a, b))                  # tensor(12.)
print((a * b).sum())                    # tensor(12.) — manual way

# Cosine similarity (needs 2D input for F.cosine_similarity)
cos = F.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0))
print(cos)                              # tensor([0.3651])

# Attention scores (batched dot products)
q = torch.tensor([[1.0, 0.5, -0.3, 0.8]])   # [1, 4]
K = torch.tensor([[0.9, 0.6, -0.2, 0.7],    # [2, 4]
                  [-0.5, 0.1, 0.8, -0.3]])
scores = q @ K.T                         # [1, 2]
print(scores)                           # tensor([[1.82, -0.93]])
attn = F.softmax(scores / (4 ** 0.5), dim=-1)
print(attn)                             # token "cat" gets ~87% attention
Don't confuse dot product with element-wise multiplication: The dot product of [1,2,3] and [4,5,6] is a scalar: 1×4 + 2×5 + 3×6 = 32. Element-wise multiplication gives a vector: [1×4, 2×5, 3×6] = [4, 10, 18]. In PyTorch, torch.dot(a,b) vs a * b. The dot product = element-wise multiply followed by sum. Confusing these causes silent bugs where your loss is a vector instead of a scalar.
Projection as a "how much" detector: The dot product of a vector with a unit vector gives the projection — how far the vector extends in that direction. This is exactly what happens in neural networks: each neuron's weight vector is a "detector" for a specific pattern. The dot product measures how much the input resembles that pattern. A high score = strong match. This is why neurons are sometimes called "feature detectors."

Interactive: draggable dot product

Drag the endpoints of the two vectors below. Watch the dot product, cosine similarity, and angle update in real time. The background tints green for positive dot product (same direction), red for negative (opposite), and stays dark for near-zero (perpendicular). The dashed line shows the projection of a onto b.

Interactive Dot Product Explorer
Drag the orange and teal endpoints

Two word embeddings have cosine similarity = −0.8. What does this mean?

Chapter 4: Matrix Multiplication

This is it. The operation that runs the entire field of deep learning. Every forward pass through a neural network, every attention computation in a transformer, every linear layer in every model — matrix multiplication. We're going to derive it by hand, implement it from scratch, and then see why GPUs make it fast.

Matrix multiplication takes two matrices and produces a third. The rule is: element (i, j) of the result is the dot product of row i of the first matrix and column j of the second. That's it. Every element of the output is just a dot product — the operation we already mastered in Chapter 3.

Worked example 1: square matrices (extremely detailed)

Let A = [[1, 2], [3, 4]] and B = [[5, 6], [7, 8]]. The result C = A × B is a 2×2 matrix. Let's compute each element:

C[0][0] = row 0 of A · col 0 of B:

[1, 2] · [5, 7] = 1×5 + 2×7 = 5 + 14 = 19

C[0][1] = row 0 of A · col 1 of B:

[1, 2] · [6, 8] = 1×6 + 2×8 = 6 + 16 = 22

C[1][0] = row 1 of A · col 0 of B:

[3, 4] · [5, 7] = 3×5 + 4×7 = 15 + 28 = 43

C[1][1] = row 1 of A · col 1 of B:

[3, 4] · [6, 8] = 3×6 + 4×8 = 18 + 32 = 50

The full result:

C = [[19, 22], [43, 50]]

The pattern: C[i][j] = dot product of row i of A and column j of B. This is the only formula you need. Everything else follows from it.

Worked example 2: non-square matrices

What if the matrices aren't square? The shape rule is: (m × n) × (n × p) → (m × p). The inner dimensions must match, and they "cancel out." If A is 2×3 and B is 3×2, the result is 2×2.

A = [[1, 0, 2], [−1, 3, 1]], B = [[3, 1], [2, 1], [1, 0]]

Shape check: (2×3) × (3×2) → (2×2). The inner dimensions (3 and 3) match. Good.

C[0][0] = [1, 0, 2] · [3, 2, 1]:

1×3 + 0×2 + 2×1 = 3 + 0 + 2 = 5

C[0][1] = [1, 0, 2] · [1, 1, 0]:

1×1 + 0×1 + 2×0 = 1 + 0 + 0 = 1

C[1][0] = [−1, 3, 1] · [3, 2, 1]:

(−1)×3 + 3×2 + 1×1 = −3 + 6 + 1 = 4

C[1][1] = [−1, 3, 1] · [1, 1, 0]:

(−1)×1 + 3×1 + 1×0 = −1 + 3 + 0 = 2

Result: C = [[5, 1], [4, 2]]

The neural network connection

A neural network's linear layer computes output = W @ x + b. For a layer going from 784 inputs (a flattened 28×28 MNIST digit) to 128 hidden units:

MatrixShapeMeaning
W (weights)[128, 784]128 neurons, each connecting to 784 inputs
x (input)[784, 1]One flattened image
W @ x[128, 1]128 pre-activation values
b (bias)[128, 1]One bias per neuron

The multiply-add count: 128 × 784 = 100,352 multiply-adds for one image through one layer. For a batch of 64 images, that's 6.4 million multiply-adds. For a network with 5 layers, that's 32 million. And MNIST is a tiny dataset with tiny images — real models do billions.

From scratch: triple nested loop

python
def matmul(A, B):
    """Matrix multiplication from scratch.
    A is m×n, B is n×p → result is m×p.
    """
    m = len(A)          # rows of A
    n = len(A[0])       # cols of A = rows of B
    p = len(B[0])       # cols of B

    # Verify inner dimensions match
    assert len(B) == n, f"Shape mismatch: A is {m}×{n}, B has {len(B)} rows"

    # Initialize result with zeros
    C = [[0.0] * p for _ in range(m)]

    # The triple loop: O(m × n × p)
    for i in range(m):       # each row of A
        for j in range(p):   # each column of B
            for k in range(n): # dot product over shared dim
                C[i][j] += A[i][k] * B[k][j]

    return C

# Test with our first example
A = [[1, 2], [3, 4]]
B = [[5, 6], [7, 8]]
C = matmul(A, B)
print(C)  # [[19, 22], [43, 50]] ✓

# Test with non-square example
A2 = [[1, 0, 2], [-1, 3, 1]]
B2 = [[3, 1], [2, 1], [1, 0]]
C2 = matmul(A2, B2)
print(C2)  # [[5, 1], [4, 2]] ✓

NumPy and PyTorch: the @ operator

python
import numpy as np
import torch

# NumPy
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(A @ B)              # [[19, 22], [43, 50]]
print(np.matmul(A, B))    # same thing

# PyTorch
At = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
Bt = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)
print(At @ Bt)            # tensor([[19., 22.], [43., 50.]])
print(torch.mm(At, Bt))   # 2D only — same result
print(torch.matmul(At, Bt))  # works for any dimensionality

# Neural network layer: F.linear does x @ W.T + b
x = torch.randn(64, 784)   # batch of 64 MNIST images
W = torch.randn(128, 784)  # weight matrix
b = torch.randn(128)       # bias
out = x @ W.T + b         # [64, 128] — 64 outputs, 128 dims each
Matrix multiplication is NOT commutative: A × B ≠ B × A in general. In fact, if A is 2×3 and B is 3×4, then A×B gives a 2×4 matrix but B×A doesn't even exist (you can't multiply 3×4 by 2×3 — the inner dims 4 and 2 don't match). Even when both products exist (square matrices), the results are usually different. [[1,2],[3,4]] × [[5,6],[7,8]] = [[19,22],[43,50]], but [[5,6],[7,8]] × [[1,2],[3,4]] = [[23,34],[31,46]]. Different.
Why are GPUs so good at this? Matrix multiplication is O(n³) — for n×n matrices, that's n³ multiply-adds. But here's the key: every one of those multiply-adds is independent. C[0][0] doesn't depend on C[1][1]. A GPU with 10,000 cores can compute 10,000 elements simultaneously. That's not a fancy algorithm — it's brute-force parallelism. The reason ML runs on GPUs isn't any clever trick. It's that matrix multiply is embarrassingly parallel and GPUs have thousands of simple cores.

Watch it happen: animated matrix multiply

The widget below shows two matrices being multiplied. The current row of A is highlighted in orange, and the current column of B in teal. The dot product between them fills in the corresponding element of the result. Press Step to advance one element, or Play for automatic animation.

Animated Matrix Multiplication
Speed 600ms

What is the shape of the result when multiplying a [3, 5] matrix by a [5, 2] matrix?

Chapter 5: Special Matrices & the Inverse

Some matrices are special enough to get names. You'll see these constantly in ML papers and codebases: the identity matrix (skip connections, initialization), diagonal matrices (efficient scaling), symmetric matrices (covariance, Hessians), and orthogonal matrices (preserving gradient norms). And the inverse — the matrix that "undoes" a multiplication — is how we solve systems of equations in closed form.

The identity matrix I is the matrix equivalent of the number 1. For any matrix A: A · I = I · A = A. It's a square matrix with 1s on the diagonal and 0s everywhere else. In ML, it appears in skip connections (ResNets add the identity: y = F(x) + x) and in initialization (starting weight matrices near identity helps gradient flow).

A diagonal matrix has nonzero values only on the main diagonal. Multiplying by a diagonal matrix just scales each dimension independently — it's O(n) instead of O(n²). Attention masks, layer normalization scaling, and the λI regularization term all use diagonal structure.

A symmetric matrix satisfies A = AT: it equals its own transpose. Covariance matrices are always symmetric (the correlation of X with Y equals the correlation of Y with X). The Hessian (second-derivative matrix) is symmetric too. Symmetric matrices have real eigenvalues and orthogonal eigenvectors — properties that make optimization theory work.

An orthogonal matrix satisfies QTQ = QQT = I: its transpose is its inverse. Orthogonal matrices preserve lengths and angles — they're pure rotations (and reflections). Using orthogonal initialization for weight matrices prevents gradients from exploding or vanishing in deep networks.

A positive definite matrix satisfies xTAx > 0 for every nonzero vector x. This means the matrix defines a "bowl-shaped" surface with a unique minimum — exactly what you want for convex optimization. Covariance matrices are positive semi-definite (allowing zero, not just positive).

Matrix TypePropertyML Use
Identity IAI = IA = ASkip connections, initialization
DiagonalOnly diagonal nonzeroEfficient scaling, attention masks
SymmetricA = ATCovariance matrices, Hessians
OrthogonalQTQ = IWeight init (preserves gradient norms)
Positive definitexTAx > 0Covariance, convex optimization

The matrix inverse

The inverse of a matrix A, written A−1, is the matrix that "undoes" A. If you multiply a vector by A, then by A−1, you get back the original vector: A−1(Ax) = x. Not every matrix has an inverse — only square matrices with nonzero determinant are invertible.

Worked example 1: computing a 2×2 inverse

For a 2×2 matrix A = [[a, b], [c, d]], the inverse has a clean formula:

A−1 = (1/det) × [[d, −b], [−c, a]]

where det(A) = ad − bc. Let's work through A = [[4, 7], [2, 6]]:

Step 1: Compute the determinant:

det(A) = 4×6 − 7×2 = 24 − 14 = 10

Step 2: Apply the formula (swap diagonal, negate off-diagonal, divide by det):

A−1 = (1/10) × [[6, −7], [−2, 4]] = [[0.6, −0.7], [−0.2, 0.4]]

Step 3: Verify by multiplying A × A−1 — we should get the identity matrix:

A × A−1 = [[4×0.6 + 7×(−0.2),   4×(−0.7) + 7×0.4], [2×0.6 + 6×(−0.2),   2×(−0.7) + 6×0.4]]
= [[2.4 − 1.4,   −2.8 + 2.8], [1.2 − 1.2,   −1.4 + 2.4]] = [[1, 0], [0, 1]] ✓

Worked example 2: a singular matrix (no inverse)

Not every matrix can be inverted. Consider A = [[1, 2], [2, 4]]:

det(A) = 1×4 − 2×2 = 4 − 4 = 0

The determinant is zero — division by zero. No inverse exists. But why, geometrically? Look at the rows: row 2 = 2 × row 1. The two rows are linearly dependent — they point in the same direction. This matrix "collapses" 2D space onto a 1D line. Multiple input vectors get mapped to the same output, so there's no way to reverse the mapping. Information is destroyed.

In ML, this matters. If your feature matrix X has linearly dependent columns (say, one feature is exactly twice another), then XTX is singular and cannot be inverted. The normal equation for linear regression w = (XTX)−1XTy won't work. That's one reason we add regularization.

ML connection: the closed-form solution for linear regression

The optimal weights for linear regression minimize the squared error ||Xw − y||². Taking the derivative, setting it to zero, and solving gives:

w = (XTX)−1 XT y

Let's trace why this works. We want to minimize L(w) = ||Xw − y||². Expanding: L = (Xw − y)T(Xw − y). Taking the gradient with respect to w and setting it to zero:

wL = 2XT(Xw − y) = 0    ⇒    XTXw = XTy    ⇒    w = (XTX)−1XTy

But what if XTX is singular (linearly dependent features)? We can't invert it. The solution: add ridge regularization:

w = (XTX + λI)−1 XT y

Adding λI to the diagonal guarantees that XTX + λI is invertible (its eigenvalues are all shifted up by λ > 0, so none can be zero). This is another reason regularization isn't just about preventing overfitting — it makes the math work.

From scratch: 2×2 inverse

python
def det2x2(A):
    """Determinant of a 2×2 matrix."""
    return A[0][0] * A[1][1] - A[0][1] * A[1][0]

def inv2x2(A):
    """Inverse of a 2×2 matrix. Returns None if singular."""
    d = det2x2(A)
    if abs(d) < 1e-12:
        return None  # singular — no inverse
    return [
        [ A[1][1] / d, -A[0][1] / d],
        [-A[1][0] / d,  A[0][0] / d]
    ]

# Example 1: invertible
A = [[4, 7], [2, 6]]
print(f"det(A) = {det2x2(A)}")  # 10
Ainv = inv2x2(A)
print(f"A⁻¹ = {Ainv}")        # [[0.6, -0.7], [-0.2, 0.4]]

# Verify: A @ A⁻¹ should be identity
I = matmul(A, Ainv)            # using our matmul from Ch4
print(f"A × A⁻¹ = {I}")       # [[1.0, 0.0], [0.0, 1.0]] ✓

# Example 2: singular
B = [[1, 2], [2, 4]]
print(f"det(B) = {det2x2(B)}")  # 0
print(inv2x2(B))                # None — no inverse

NumPy and PyTorch

python
import numpy as np
import torch

A = np.array([[4.0, 7.0], [2.0, 6.0]])

# NumPy inverse and determinant
print(np.linalg.det(A))           # 10.0
print(np.linalg.inv(A))           # [[ 0.6, -0.7], [-0.2,  0.4]]
print(A @ np.linalg.inv(A))       # [[1, 0], [0, 1]] (identity)

# PyTorch
At = torch.tensor([[4.0, 7.0], [2.0, 6.0]])
print(torch.linalg.det(At))       # tensor(10.)
print(torch.linalg.inv(At))       # same result

# Ridge regression: w = (X^T X + λI)^{-1} X^T y
X = torch.randn(100, 5)          # 100 samples, 5 features
y = torch.randn(100, 1)          # targets
lam = 0.01
XtX = X.T @ X                   # [5, 5]
reg = XtX + lam * torch.eye(5)  # add λI for invertibility
w = torch.linalg.inv(reg) @ X.T @ y  # [5, 1] — closed-form solution
print(w.shape)                    # torch.Size([5, 1])
In practice, never compute inverses directly: In production ML code, we almost NEVER call np.linalg.inv() or torch.linalg.inv(). Matrix inversion is O(n³), numerically unstable, and unnecessary. Instead we use: (1) Gradient descent (let the optimizer find w iteratively), (2) Cholesky decomposition (torch.linalg.cholesky_ex), (3) SVD pseudoinverse (torch.linalg.pinv), or (4) QR decomposition (torch.linalg.qr). The closed-form solution is for understanding. In practice, SGD always wins for large problems.
Why singular matrices break everything: A singular matrix maps multiple different inputs to the same output. Think of projecting a 3D object onto a 2D shadow — the shadow can't tell you the original 3D shape because depth information was destroyed. A determinant of zero means volume collapses to zero: the matrix crushes space by at least one dimension. When you see LinAlgError: Singular matrix, it means your features are linearly dependent — one column is a combination of others. Fix it by removing redundant features or adding regularization.

Interactive: transform the unit square

Use the four sliders to control a 2×2 matrix [[a, b], [c, d]]. The widget shows how the matrix transforms the unit square into a parallelogram. When the determinant approaches zero, the parallelogram collapses toward a line. The inverse matrix and eigenvalues update in real time.

2×2 Matrix Explorer

Adjust the matrix entries. Watch the transformed shape, determinant, inverse, and eigenvalues change. Try making the determinant zero — what happens?

a (top-left) 1.0
b (top-right) 0.0
c (bot-left) 0.0
d (bot-right) 1.0

A matrix has determinant = 0. What does this mean?

Chapter 6: Linear Independence & Rank

Imagine you're running a survey with 100 questions, hoping to measure 100 different aspects of personality. After collecting data from 10,000 people, you discover something surprising: the answers to questions 3, 17, and 42 are almost perfectly correlated. If someone scores high on question 3, you can predict their answers to 17 and 42. Those three questions are really measuring the same underlying trait.

That's the intuition behind linear independence. Some vectors (or measurements, or features) are redundant — they can be written as combinations of others. The rank of a matrix tells you how many truly independent directions it contains. Strip away the redundancy, and rank tells you the real dimensionality of your data.

This idea is the foundation of dimensionality reduction (PCA), low-rank adaptation (LoRA), and matrix completion (Netflix recommendations). If you don't understand rank, these techniques are magic. Once you do, they're obvious.

What does "linearly independent" actually mean?

Two vectors are linearly dependent if one is a scalar multiple of the other — they point in the same (or opposite) direction. Three vectors are dependent if any one can be written as a combination of the other two.

More precisely: vectors v1, v2, ..., vk are linearly independent if the ONLY solution to c1v1 + c2v2 + ... + ckvk = 0 is c1 = c2 = ... = ck = 0. In plain English: you can't make the zero vector from them unless you use all-zero coefficients. No vector is "free" — none can be built from the others.

Hand calculation 1: two vectors in 2D

Are v1 = [1, 2] and v2 = [2, 4] linearly independent?

Check: is v2 a multiple of v1? Yes: [2, 4] = 2 × [1, 2]. They point in the exact same direction. One is redundant.

v2 = 2 · v1   ⇒   dependent

What about v1 = [1, 0] and v2 = [0, 1]? There's no scalar c such that [1, 0] = c × [0, 1]. The first vector points along x, the second along y. Neither is a multiple of the other.

c · [0, 1] = [1, 0]  ⇒  impossible  ⇒  independent

Here's the geometric test: two 2D vectors are independent if and only if they point in different directions (not parallel). The area of the parallelogram they form is nonzero. When they're dependent, the parallelogram collapses to a line — zero area.

Hand calculation 2: rank via row reduction

The rank of a matrix is the number of linearly independent rows (equivalently, columns). We find it by row reduction (Gaussian elimination) — systematically eliminating redundancy until we see how many independent rows remain.

Find the rank of:

A = [[1, 2, 3], [2, 4, 6], [0, 1, 1]]

Step 1: Eliminate below the first pivot. R2 = R2 − 2×R1:

[[1, 2, 3], [0, 0, 0], [0, 1, 1]]

Row 2 became all zeros! That means R2 was exactly 2×R1 — completely redundant.

Step 2: Swap R2 and R3 to move the zero row to the bottom:

[[1, 2, 3], [0, 1, 1], [0, 0, 0]]

Result: Two nonzero rows ⇒ rank = 2. The 3×3 matrix only has 2 independent directions. It maps 3D space into a 2D subspace. One dimension is lost.

Hand calculation 3: what rank means for Ax = b

The rank controls whether the system Ax = b has solutions, and how many.

ConditionMeaningSolutions
rank(A) = n (# columns)All columns independentAt most one solution
rank(A) < nSome columns redundantInfinitely many solutions (underdetermined)
rank(A) < m (# rows)Some rows redundantSome b have NO solution (overdetermined)
rank(A) = m = nFull rank, squareExactly one solution for every b

In ML, we're almost always in the underdetermined case. A neural network with millions of parameters and thousands of training examples has far more unknowns than equations. There are infinitely many parameter settings that fit the data — regularization picks which one.

The ML payoff: low-rank structure everywhere

When people say "the data lies on a lower-dimensional manifold," they mean the data matrix has low effective rank. Consider a dataset of face images, each 100×100 = 10,000 pixels. If you stack 50,000 faces into a 50,000×10,000 matrix, it might have rank ~50. Why? Because faces only vary along about 50 independent factors: pose, lighting, expression, identity, skin tone. The other 9,950 "dimensions" are redundant combinations of these factors.

PCA discovers these factors. It finds the rank-k approximation that captures the most variance.

LoRA (Low-Rank Adaptation) exploits the same insight for fine-tuning. Instead of updating a full weight matrix W (e.g., 4096×4096 = 16.7M parameters), LoRA parameterizes the update as ΔW = B × A where B is 4096×r and A is r×4096, with r = 4 or 8. That's only 2×4096×8 = 65,536 parameters instead of 16.7 million. The key assumption: the useful update has low rank.

Why LoRA works: Fine-tuning changes a relatively small number of "directions" in weight space — the update ΔW has low rank. LoRA factors this as B×A with small rank r. This is not an approximation to save memory — it's a structural insight about the geometry of fine-tuning. The full-rank update is mostly noise; the low-rank update captures the signal.

From scratch: rank via row reduction

python
def rank_via_row_reduction(matrix):
    """Compute rank by Gaussian elimination (row echelon form)."""
    # Make a copy so we don't modify the original
    A = [row[:] for row in matrix]
    m, n = len(A), len(A[0])
    pivot_row = 0

    for col in range(n):
        # Find a nonzero entry in this column at or below pivot_row
        found = None
        for row in range(pivot_row, m):
            if abs(A[row][col]) > 1e-10:
                found = row
                break
        if found is None:
            continue  # No pivot in this column

        # Swap the found row into the pivot position
        A[pivot_row], A[found] = A[found], A[pivot_row]

        # Eliminate all entries below the pivot
        for row in range(pivot_row + 1, m):
            factor = A[row][col] / A[pivot_row][col]
            for j in range(n):
                A[row][j] -= factor * A[pivot_row][j]

        pivot_row += 1

    return pivot_row  # Number of nonzero rows = rank

# Test with our example
A = [[1, 2, 3], [2, 4, 6], [0, 1, 1]]
print("Rank:", rank_via_row_reduction(A))  # 2

The NumPy way

python
import numpy as np

A = np.array([[1, 2, 3], [2, 4, 6], [0, 1, 1]])
print("Rank:", np.linalg.matrix_rank(A))  # 2

# LoRA example: rank-8 update to a 512x512 weight matrix
B = np.random.randn(512, 8)   # 512 x 8
A_lora = np.random.randn(8, 512)  # 8 x 512
delta_W = B @ A_lora              # 512 x 512 but rank 8
print("Shape:", delta_W.shape)     # (512, 512)
print("Rank:", np.linalg.matrix_rank(delta_W))  # 8
print("Params (full):", 512*512)            # 262,144
print("Params (LoRA):", 512*8 + 8*512)    # 8,192 (32x fewer!)
Rank is NOT the same as dimension. A 5×3 matrix lives in a 5-dimensional output space but its rank can be at most 3 (the minimum of rows and columns). The rank tells you how much of that space the matrix actually uses. A 1000×1000 matrix with rank 3 maps everything onto a 3D subspace — 997 dimensions are completely wasted.

See it: independence and the parallelogram test

The widget below shows two 2D vectors that you can drag. When the vectors are independent, they span a parallelogram with nonzero area. When they become parallel (dependent), the parallelogram collapses to a line — zero area. The area equals the absolute value of the determinant, which is zero precisely when the vectors are dependent.

Linear Independence & Rank Visualizer

Drag the arrow tips to move vectors. Watch the determinant and rank update. When vectors are parallel, the area collapses to zero.

det = 1.00 | rank = 2 | Independent

A 100×100 matrix has rank 5. What does this tell you about the data?

Chapter 7: Span, Basis, and Subspaces

You're standing on a flat field with a compass. Someone tells you: "Walk 3 steps North and 2 steps East." You end up at a specific point. Now here's the key question: could you reach any point on the field using just North and East steps? Yes — any combination of North and East can take you anywhere on the 2D field. North and East form a basis for 2D space.

But what if someone only gave you North and Northeast? You can still reach any point — just combine them differently. What about North and South? No — those are opposite directions (dependent), so you can only move along the North-South line. You're stuck on a 1D subspace.

A basis is the minimum set of vectors you need to describe every point in a space. Changing the basis doesn't change the geometry — just the coordinate system. This is the single most important idea in linear algebra for ML, because every learned representation is a choice of basis.

Span: what can you reach?

The span of a set of vectors is the set of ALL points you can reach by combining them with any coefficients. If you have vectors v1 and v2, their span is:

span(v1, v2) = { c1 v1 + c2 v2  |  c1, c2 ∈ ℝ }

If v1 and v2 are independent 2D vectors, their span is ALL of 2D space. If they're parallel (dependent), the span is just a line through the origin.

In 3D: span({[1,0,0], [0,1,0]}) = the xy-plane. You can reach any point on that plane, but you can never leave it — you can't reach [0,0,1]. Adding [0,0,1] to the set gives you span of all 3D space. But adding [2,3,0] doesn't help — it's already in the span of the first two vectors (it lies in the xy-plane).

Hand calculation 1: standard basis coordinates

The standard basis in 2D is e1 = [1, 0] and e2 = [0, 1]. These are the unit vectors along the x and y axes. Any point p can be written:

p = [3, 2] = 3 · [1, 0] + 2 · [0, 1] = 3 · e1 + 2 · e2

The coordinates [3, 2] ARE the coefficients. This is so natural that we don't even think about it — the coordinate (3, 2) directly means "3 units along e1 and 2 units along e2." But this is only true because we're using the standard basis. Change the basis, and the same point gets different coordinates.

Hand calculation 2: coordinates in a rotated basis

Let's define a new basis by rotating 45°:

b1 = [1/√2, 1/√2] ≈ [0.707, 0.707]
b2 = [−1/√2, 1/√2] ≈ [−0.707, 0.707]

These are orthonormal (perpendicular and unit length). The same point p = [3, 2] has new coordinates found by projection (dot products with the basis vectors):

c1 = p · b1 = 3 × 0.707 + 2 × 0.707 = 5/√2 ≈ 3.536
c2 = p · b2 = 3 × (−0.707) + 2 × 0.707 = −1/√2 ≈ −0.707

Let's verify: reconstruct p from the new coordinates:

c1 · b1 + c2 · b2 = 3.536 × [0.707, 0.707] + (−0.707) × [−0.707, 0.707]
= [2.500, 2.500] + [0.500, −0.500] = [3.000, 2.000]   ✓

The point didn't move. Its geometric location is exactly the same. Only the numbers describing it changed — from [3, 2] to [3.536, −0.707]. Different coordinates, same point.

Hand calculation 3: subspaces

A subspace is a subset of a vector space that is itself a vector space — it must contain the zero vector, and it must be closed under addition and scalar multiplication. In plain language: a subspace is a flat sheet (possibly lower-dimensional) passing through the origin.

Examples in 3D:

SubspaceDimensionWhat it looks like
{0}0Just the origin
span([1,0,0])1The x-axis (a line)
span([1,0,0],[0,1,0])2The xy-plane
span([1,0,0],[0,1,0],[0,0,1])3All of 3D space

NOT a subspace: {[x, y] : x ≥ 0}. This is the right half-plane, but it's not closed under scalar multiplication (multiply [1, 0] by −1 and you leave the set).

The ML payoff: embeddings are coordinates in a learned basis

When word2vec maps "king" to a 768-dimensional vector [0.2, −0.5, 0.8, ...], those 768 numbers are coordinates in a learned basis. The model discovered 768 basis directions during training — abstract semantic axes that capture meaning.

The famous analogy king − man + woman ≈ queen works because the learned basis captures semantic directions. There's a "gender" direction, a "royalty" direction, and so on. Subtracting "man" and adding "woman" moves along the gender axis while preserving the royalty axis.

PCA is explicitly a basis change. It rotates from the standard basis to a basis aligned with the principal directions of variance in your data. The first principal component is the direction of maximum variance; the second is the direction of maximum remaining variance, perpendicular to the first; and so on.

Think of it this way: The standard basis [1,0,0,...] is arbitrary — it has no relationship to your data. PCA finds a basis that's aligned with your data's structure. In the new basis, the first few coordinates carry most of the information, and you can throw away the rest. That's dimensionality reduction — and it's nothing but a basis change followed by truncation.

From scratch: basis change

python
import math

def dot(a, b):
    return sum(x*y for x, y in zip(a, b))

def scale(c, v):
    return [c*x for x in v]

def add(a, b):
    return [x+y for x, y in zip(a, b)]

# Standard basis
e1, e2 = [1, 0], [0, 1]
p = [3, 2]
print("Standard coords:", p)  # [3, 2]

# 45-degree rotated basis
s = 1 / math.sqrt(2)
b1, b2 = [s, s], [-s, s]

# Project p onto new basis (works for orthonormal bases)
c1 = dot(p, b1)
c2 = dot(p, b2)
print(f"New coords: [{c1:.3f}, {c2:.3f}]")  # [3.536, -0.707]

# Reconstruct: c1*b1 + c2*b2 should equal p
reconstructed = add(scale(c1, b1), scale(c2, b2))
print("Reconstructed:", [round(x, 6) for x in reconstructed])  # [3.0, 2.0]

The NumPy way

python
import numpy as np

# Basis change via matrix multiplication
p = np.array([3.0, 2.0])
theta = np.pi / 4  # 45 degrees
# Rotation matrix (columns are new basis vectors)
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])
# New coords = R^T @ p (inverse of orthogonal matrix is transpose)
new_coords = R.T @ p
print("New coords:", new_coords)  # [ 3.536 -0.707]
# Reconstruct
print("Reconstructed:", R @ new_coords)  # [3. 2.]

See it: same point, different coordinates

The widget below shows a single point on a 2D plane. Toggle between the standard basis (gray grid) and a custom basis (you can drag the basis vectors). The point stays in exactly the same place — only the coordinate numbers change. This is what "basis change" means geometrically: nothing moves.

Basis Change Visualizer

Drag the basis vectors (colored arrows) to define a custom basis. Watch how the coordinates change while the point (white dot) stays fixed. Toggle between standard and custom grids.

Standard: (3.0, 2.0) | Custom: (3.54, -0.71)

You change the basis of your data from the standard basis to PCA components. Do the actual data points move?

Chapter 6: Eigenvalues & Eigenvectors

Most matrices, when applied to a vector, both stretch and rotate it. The vector comes out pointing in a completely different direction, with a different length. But for every matrix, there are special vectors that only get stretched — not rotated at all. These are eigenvectors, and the stretching factor is the eigenvalue.

Why should you care? Because eigenvectors reveal the natural axes of a transformation. PCA finds the directions of maximum variance in your data — those are eigenvectors of the covariance matrix. The stability of a dynamical system depends on eigenvalues. PageRank, spectral clustering, and the Hessian of your loss function all boil down to eigenproblems.

The defining equation: A vector v is an eigenvector of matrix A if multiplying by A only scales it: Av = λv. The scalar λ is the eigenvalue. The vector doesn't rotate — it stays on its own line. Every other vector gets both stretched AND rotated.
A v = λ v    ↔    (A − λI) v = 0

For the equation (A - λI)v = 0 to have a non-trivial solution (v ≠ 0), the matrix (A - λI) must be singular — its determinant must be zero. This gives us the characteristic equation:

det(A − λI) = 0

Hand Calculation 1: Finding Eigenvalues and Eigenvectors

Let's find the eigenvalues and eigenvectors of this matrix:

A = [[2, 1], [1, 2]]

Step 1: Set up det(A − λI) = 0.

Subtract λ from each diagonal entry:

A − λI = [[2−λ, 1], [1, 2−λ]]

Step 2: Compute the determinant.

For a 2×2 matrix [[a,b],[c,d]], det = ad − bc:

det = (2−λ)(2−λ) − 1×1

Expand (2−λ)(2−λ):

= 4 − 2λ − 2λ + λ² − 1 = λ² − 4λ + 3

Step 3: Solve the characteristic equation.

λ² − 4λ + 3 = 0

Factor: (λ − 1)(λ − 3) = 0. So λ&sub1; = 1 and λ&sub2; = 3.

Step 4: Find eigenvectors for each eigenvalue.

For λ&sub1; = 1, solve (A − 1·I)v = 0:

[[2−1, 1], [1, 2−1]] v = 0  →  [[1, 1], [1, 1]] v = 0

Both rows say the same thing: v&sub1; + v&sub2; = 0, so v&sub2; = −v&sub1;. Pick v&sub1; = 1:

v&sub1; = [−1, 1]   (or any scalar multiple)

For λ&sub2; = 3, solve (A − 3I)v = 0:

[[2−3, 1], [1, 2−3]] v = 0  →  [[−1, 1], [1, −1]] v = 0

Both rows say: −v&sub1; + v&sub2; = 0, so v&sub2; = v&sub1;. Pick v&sub1; = 1:

v&sub2; = [1, 1]   (or any scalar multiple)

Step 5: Verify both eigenpairs.

Check Av&sub1; = λ&sub1;v&sub1;:

[[2,1],[1,2]] × [−1, 1] = [2(−1)+1(1), 1(−1)+2(1)] = [−1, 1] = 1 × [−1, 1] ✓

Check Av&sub2; = λ&sub2;v&sub2;:

[[2,1],[1,2]] × [1, 1] = [2(1)+1(1), 1(1)+2(1)] = [3, 3] = 3 × [1, 1] ✓
Geometric meaning: Eigenvector [−1, 1] (the anti-diagonal direction) gets scaled by 1 — it stays the same length. Eigenvector [1, 1] (the diagonal direction) gets scaled by 3 — it triples in length. Every OTHER vector gets both stretched AND rotated.

Hand Calculation 2: Complex Eigenvalues (Rotation)

Not all matrices have real eigenvalues. Consider the 90° rotation matrix:

A = [[0, −1], [1, 0]]

Characteristic equation:

det([[−λ, −1], [1, −λ]]) = (−λ)(−λ) − (−1)(1) = λ² + 1 = 0

This gives λ = ±i (imaginary numbers!). There are no real eigenvectors. Geometrically this makes perfect sense: a 90° rotation moves EVERY vector to a new direction. No vector stays on its own line. Complex eigenvalues always mean the matrix involves rotation.

Rule of thumb: Real eigenvalues = stretching/flipping. Complex eigenvalues = rotation is involved. Purely imaginary eigenvalues (like ±i) = pure rotation with no stretching. This is how engineers determine if a system oscillates.

ML Payoff 1: PCA (Principal Component Analysis)

PCA finds the directions of maximum variance in your data. Here's how it works, step by step, with actual numbers.

Given 5 data points in 2D:

(1, 2), (2, 3.5), (3, 5.1), (4, 6.4), (5, 8.2)

Step 1: Center the data (subtract the mean of each dimension).

μx = (1+2+3+4+5)/5 = 3,    μy = (2+3.5+5.1+6.4+8.2)/5 = 5.04

Centered points: (−2, −3.04), (−1, −1.54), (0, 0.06), (1, 1.36), (2, 3.16).

Step 2: Compute the covariance matrix C = (1/n)XTX:

Cxx = (4+1+0+1+4)/5 = 2.0
Cxy = (6.08+1.54+0+1.36+6.32)/5 = 3.06
Cyy = (9.24+2.37+0.0036+1.85+9.99)/5 = 4.69
C = [[2.0, 3.06], [3.06, 4.69]]

Step 3: Find eigenvalues of C.

λ² − (2.0+4.69)λ + (2.0×4.69 − 3.06²) = 0
λ² − 6.69λ + (9.38 − 9.36) = 0  →  λ² − 6.69λ + 0.02 = 0

Using the quadratic formula: λ&sub1; ≈ 6.69, λ&sub2; ≈ 0.003.

Step 4: Interpret. The first eigenvalue (6.69) captures 6.69/(6.69+0.003) ≈ 99.96% of the total variance. The data is almost perfectly 1-dimensional! The eigenvector for λ&sub1; points along the line y ≈ 1.53x — the principal component.

PCA in one sentence: Find the eigenvectors of the covariance matrix. The eigenvector with the largest eigenvalue is the direction of maximum spread. The eigenvalue tells you HOW MUCH spread. If one eigenvalue dominates, your data is essentially lower-dimensional.

ML Payoff 2: The Hessian and Loss Landscape

The Hessian is the matrix of second derivatives of the loss function: Hij = ∂²L/(∂wi∂wj). Its eigenvalues tell you about the curvature of the loss landscape:

Eigenvalue patternMeaningImplication
All positiveLocal minimum (bowl)Gradient descent converges here
All negativeLocal maximum (hilltop)Unstable — optimizer leaves
Mixed +/−Saddle pointMinimum in some directions, maximum in others
Large eigenvalueSharp curvatureSensitive to learning rate in that direction
Small eigenvalueFlat directionSlow learning — gradient is tiny

A large ratio of max/min eigenvalue (λmaxmin) means the loss surface is like a narrow ravine — steep walls but a gentle slope along the bottom. This is why adaptive optimizers like Adam work: they effectively rescale each direction by its curvature.

From-Scratch Code: 2×2 Eigenvalue Solver

python
import math

def eigen_2x2(a, b, c, d):
    """Find eigenvalues and eigenvectors of [[a,b],[c,d]]."""
    # Characteristic equation: lambda^2 - (a+d)*lambda + (ad-bc) = 0
    trace = a + d
    det = a * d - b * c
    discriminant = trace**2 - 4 * det

    if discriminant < 0:
        return "Complex eigenvalues (rotation!)", None

    sqrt_d = math.sqrt(discriminant)
    lam1 = (trace + sqrt_d) / 2
    lam2 = (trace - sqrt_d) / 2

    # Eigenvectors: solve (A - lambda*I)v = 0
    def eigvec(lam):
        if abs(b) > 1e-10:
            return [b, lam - a]
        elif abs(c) > 1e-10:
            return [lam - d, c]
        else:
            return [1, 0] if lam == a else [0, 1]

    v1 = eigvec(lam1)
    v2 = eigvec(lam2)

    # Normalize to unit length
    for v in [v1, v2]:
        norm = math.sqrt(v[0]**2 + v[1]**2)
        v[0] /= norm
        v[1] /= norm

    return (lam1, v1), (lam2, v2)

# Test with A = [[2, 1], [1, 2]]
r = eigen_2x2(2, 1, 1, 2)
print(r)
# ((3.0, [0.707, 0.707]), (1.0, [-0.707, 0.707]))

Power Iteration: Finding the Largest Eigenvalue

For large matrices, we don't solve the characteristic polynomial (it's degree n!). Instead, we use power iteration: start with a random vector, multiply by A repeatedly, and normalize. The vector converges to the eigenvector with the largest eigenvalue.

python
import numpy as np

def power_iteration(A, n_iters=100):
    """Find largest eigenvalue/eigenvector via power iteration."""
    n = A.shape[0]
    v = np.random.randn(n)          # random start
    v = v / np.linalg.norm(v)       # normalize

    for _ in range(n_iters):
        Av = A @ v                    # multiply
        v = Av / np.linalg.norm(Av)   # normalize

    eigenvalue = v @ A @ v          # Rayleigh quotient
    return eigenvalue, v

A = np.array([[2, 1], [1, 2]])
val, vec = power_iteration(A)
print(f"Largest eigenvalue: {val:.4f}")  # 3.0000
print(f"Eigenvector: {vec}")               # [0.707, 0.707]

NumPy: The Easy Way

python
import numpy as np

A = np.array([[2, 1], [1, 2]])

# General eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)    # [3. 1.]
print("Eigenvectors (columns):")
print(eigenvectors)
# [[ 0.707  -0.707]
#  [ 0.707   0.707]]

# For symmetric matrices (like covariance), use eigh
# Returns sorted eigenvalues (ascending) and orthogonal eigenvectors
vals, vecs = np.linalg.eigh(A)
print("Sorted eigenvalues:", vals)  # [1. 3.]
Misconception alert: Eigenvectors are NOT unique. Any scalar multiple of an eigenvector is also an eigenvector. When we say "the eigenvector is [1, 1]," we really mean "any vector in the direction [1, 1]." That's why we often normalize them to unit length. The eigenvalue IS unique, but the eigenvector only defines a direction, not a specific vector.

SHOWCASE: Eigenvector Visualizer

Eigenvector & Eigenvalue Explorer

A unit circle is transformed by the matrix. Eigenvectors (arrows) align with the axes of the resulting ellipse. Adjust matrix entries with the sliders. Toggle PCA mode to see eigenvectors of random data's covariance matrix.

a (top-left)2.0
b (top-right)1.0
c (bot-left)1.0
d (bot-right)2.0
Matrix A has eigenvalues λ&sub1;=5, λ&sub2;=0.1. What does this tell you about PCA?

Chapter 7: SVD — The Crown Jewel

If there's ONE decomposition to learn for ML, it's the Singular Value Decomposition (SVD). Every matrix A — any shape, any size, any rank — can be decomposed as:

A = U Σ VT

Where U is an orthogonal matrix (m×m), Σ is a diagonal matrix of singular values (m×n), and VT is an orthogonal matrix (n×n). The singular values σ&sub1; ≥ σ&sub2; ≥ ... ≥ 0 are always non-negative and sorted in decreasing order.

Why SVD is king: Eigendecomposition only works on square matrices. SVD works on EVERY matrix — rectangular, rank-deficient, you name it. It powers LoRA fine-tuning, recommender systems, image compression, noise reduction, the pseudoinverse, and dimensionality reduction. If you understand SVD, you understand half of numerical linear algebra.

Geometrically, SVD says: every linear transformation is a rotation (VT), then a stretching (Σ), then another rotation (U). That's it. Every matrix, no matter how complicated, is just rotate-stretch-rotate.

Hand Calculation 1: SVD of a Diagonal Matrix

Start simple. Let A = [[3, 0], [0, 2]].

Step 1: Compute ATA.

ATA = [[3,0],[0,2]]T × [[3,0],[0,2]] = [[9, 0], [0, 4]]

Step 2: Find eigenvalues of ATA.

ATA is diagonal, so eigenvalues are just the diagonal entries: 9 and 4.

Singular values are the square roots: σ&sub1; = 3, σ&sub2; = 2.

Step 3: Find V (eigenvectors of ATA).

For a diagonal matrix, eigenvectors are the standard basis vectors:

V = [[1, 0], [0, 1]] = I

Step 4: Find U (eigenvectors of AAT).

AAT = [[3,0],[0,2]] × [[3,0],[0,2]]T = [[9, 0], [0, 4]]

Same eigenvalues, same eigenvectors: U = I.

Step 5: Build Σ.

Σ = [[3, 0], [0, 2]]

Step 6: Verify.

UΣVT = I × [[3,0],[0,2]] × I = [[3,0],[0,2]] = A ✓
For a diagonal matrix, SVD is trivial: the singular values are just the absolute values of the diagonal entries, and U and V are (possibly sign-flipped) identity matrices. The interesting cases are non-diagonal matrices.

Hand Calculation 2: Low-Rank Approximation

The most powerful property of SVD: you can approximate any matrix by keeping only the top k singular values. Given A = UΣVT, the best rank-k approximation is:

Ak = Uk Σk VkT

Let's try with a 3×3 matrix. Suppose SVD gives us:

Σ = [[10, 0, 0], [0, 3, 0], [0, 0, 0.1]]

Rank-1 approximation: Keep only σ&sub1; = 10. This captures:

Energy = 10² / (10² + 3² + 0.1²) = 100 / 109.01 = 91.7%

Rank-2 approximation: Keep σ&sub1; = 10 and σ&sub2; = 3.

Energy = (100 + 9) / 109.01 = 109 / 109.01 = 99.99%

With just 2 out of 3 components, we capture nearly all the information. The Frobenius norm of the error is just the discarded singular values:

‖A − AkF = √(σk+1² + σk+2² + ...) = √(0.01) = 0.1
The Eckart-Young theorem: The truncated SVD gives the BEST rank-k approximation of any matrix in both the Frobenius and spectral norms. There is no better way to compress a matrix to rank k. This is a mathematical guarantee, not a heuristic.

ML Payoff 1: LoRA (Low-Rank Adaptation)

LoRA is one of the most important techniques in modern ML. Instead of fine-tuning all 4096×4096 = 16.7M parameters in a weight matrix W, LoRA decomposes the update ΔW as the product of two small matrices:

ΔW = B × A    where B is [d × r], A is [r × d], r ≪ d

If d = 4096 and r = 16:

MethodParametersMemory
Full fine-tuning4096 × 4096 = 16,777,216~64 MB (fp32)
LoRA (r=16)4096×16 + 16×4096 = 131,072~0.5 MB (fp32)
Reduction128× fewer parameters

Why does this work? Because SVD reveals that weight updates during fine-tuning are approximately low-rank. The paper "LoRA: Low-Rank Adaptation of Large Language Models" showed that the intrinsic dimensionality of fine-tuning updates is surprisingly small — rank 4-16 captures most of the adaptation signal.

ML Payoff 2: Image Compression

Treat a grayscale image (m×n pixels) as a matrix. Compute its SVD. Keep only the top k singular values. Reconstruct. Storage goes from m×n values to k×(m + n + 1) values.

For a 256×256 image:

Rank kStorageCompression ratio
Full (256)65,536 values
k = 5050 × (256 + 256 + 1) = 25,6502.6×
k = 2020 × 513 = 10,2606.4×
k = 55 × 513 = 2,56525.5×

From-Scratch Code: SVD via Eigendecomposition

python
import numpy as np

def svd_from_eigen(A):
    """Compute SVD of A using eigendecomposition of A^T A."""
    # Step 1: Eigendecompose A^T A to get V and singular values
    AtA = A.T @ A
    eigenvalues, V = np.linalg.eigh(AtA)

    # Sort in descending order
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    V = V[:, idx]

    # Step 2: Singular values = sqrt of eigenvalues
    sigma = np.sqrt(np.maximum(eigenvalues, 0))

    # Step 3: U = A V Sigma^{-1}
    r = np.sum(sigma > 1e-10)  # rank
    U = np.zeros((A.shape[0], r))
    for i in range(r):
        U[:, i] = A @ V[:, i] / sigma[i]

    return U, sigma[:r], V[:, :r]

# Test
A = np.array([[3, 0], [0, 2], [0, 0]])
U, s, V = svd_from_eigen(A)
print("Singular values:", s)  # [3. 2.]

NumPy / PyTorch SVD

python
import numpy as np
import torch

A = np.array([[3, 2, 1], [1, 4, 2]], dtype=np.float64)

# NumPy
U, s, Vt = np.linalg.svd(A)
print("U shape:", U.shape)     # (2, 2)
print("s:", s)                  # [5.47, 1.15]
print("Vt shape:", Vt.shape)   # (3, 3)

# Low-rank reconstruction (rank 1)
A_rank1 = s[0] * np.outer(U[:, 0], Vt[0, :])
print("Rank-1 approx error:", np.linalg.norm(A - A_rank1))

# PyTorch
At = torch.tensor(A)
U_t, s_t, Vh_t = torch.linalg.svd(At)
print("PyTorch singular values:", s_t)
SVD vs Eigendecomposition: Eigendecomposition only works on square matrices: A = QΛQ−1. SVD works on ANY matrix (m×n): A = UΣVT. For symmetric positive definite matrices, eigenvalues equal singular values and eigenvectors equal singular vectors. But in general, SVD is strictly more powerful.

SHOWCASE: SVD Image Compression Explorer

SVD Image Compression & LoRA Visualizer

LEFT: original 32×32 procedural image. RIGHT: rank-k reconstruction. BOTTOM: singular value bar chart. Drag the rank slider and watch the image sharpen. Toggle LoRA mode to see weight matrix decomposition.

Rank k5
You have a 4096×4096 weight matrix. Using LoRA with rank r=8, how many parameters do you train?

Chapter 10: Matrix Factorization & Recommenders

You open Netflix. It shows you a row of movies it thinks you'll like. You haven't rated most of them — yet Netflix predicts you'll give The Grand Budapest Hotel a 4.5 and Transformers a 2.1. How? It looked at millions of users, found people with similar taste to yours, and used their ratings to fill in your missing ones.

The mathematical engine behind this is matrix factorization: take a big matrix (users × movies) with mostly missing entries, and break it into the product of two smaller matrices. Those smaller matrices reveal hidden structure — each user gets a "taste vector" and each movie gets a "content vector." Their dot product predicts the rating.

This same idea powers LoRA fine-tuning, topic modeling (NMF), and collaborative filtering in every major recommendation system. It's one of the most practically important techniques in all of ML.

Hand calculation 1: factoring a ratings matrix

We have 3 users and 3 movies. Some ratings are observed, some are missing (marked ?):

R = [[5, 3, ?], [4, ?, 1], [?, 2, 4]]

We want to approximate R ≈ U × VT, where U is 3×2 (3 users, 2 latent factors) and V is 3×2 (3 movies, 2 latent factors). The "2" is our choice — we're guessing that movie preferences can be described by 2 hidden dimensions (maybe "action vs drama" and "old vs new").

Start with a guess:

U = [[2, 1], [1, 2], [1, 1]],    V = [[2, 1], [1, 1], [0, 2]]

Compute the predicted ratings (UVT):

UVT[0][0] = 2×2 + 1×1 = 5    (actual: 5 ✓)
UVT[0][1] = 2×1 + 1×1 = 3    (actual: 3 ✓)
UVT[0][2] = 2×0 + 1×2 = 2    (actual: ? → predicted 2)
UVT[1][0] = 1×2 + 2×1 = 4    (actual: 4 ✓)
UVT[1][1] = 1×1 + 2×1 = 3    (actual: ? → predicted 3)
UVT[1][2] = 1×0 + 2×2 = 4    (actual: 1 ✗ off by 3!)
UVT[2][0] = 1×2 + 1×1 = 3    (actual: ? → predicted 3)
UVT[2][1] = 1×1 + 1×1 = 2    (actual: 2 ✓)
UVT[2][2] = 1×0 + 1×2 = 2    (actual: 4 ✗ off by 2!)

Our initial guess is imperfect — it nails some entries but misses others. We need to optimize U and V to minimize the error on observed entries. That's where gradient descent comes in.

Hand calculation 2: the loss function

We only penalize errors on observed entries (we can't compute error on missing ones — that's what we're trying to predict):

L = ∑observed (i,j) (Rij − (UVT)ij)² + λ(||U||² + ||V||²)

The first term is the reconstruction error. The second term is regularization — it prevents U and V from growing too large (overfitting). For our example with λ = 0:

L = (5−5)² + (3−3)² + (4−4)² + (2−2)² + (1−4)² + (4−2)²
= 0 + 0 + 0 + 0 + 9 + 4 = 13

Our goal: adjust U and V to make this loss as small as possible. After optimization, the missing entries get filled with the model's best predictions.

The gradient update rule

For each observed entry (i, j), the gradient of the squared error with respect to user vector ui is:

∂L/∂ui = −2(Rij − ui · vj) · vj + 2λ ui

This is the error times the movie vector, plus regularization. We subtract this (times the learning rate) from ui. The same logic applies to vj with the roles swapped.

The ML payoff: Netflix and LoRA

Netflix Prize (2009): 480,000 users × 18,000 movies = 8.6 BILLION possible entries, but only 100 million ratings observed (1.2% fill rate). Factor into U(480K×50) × V(18K×50). Now each user is a 50-dimensional "taste vector" and each movie is a 50-dimensional "content vector." Their dot product predicts the rating.

LoRA is the SAME idea. A pretrained weight matrix W (4096×4096) is updated by ΔW = B×A where B is 4096×r and A is r×4096. With rank r=8, you're saying: "fine-tuning only changes 8 independent directions in weight space." The rest of the 4096 dimensions stay frozen.

The unifying principle: Matrix factorization says "this big matrix has hidden low-rank structure." Recommendations: user preferences are driven by a few latent factors. LoRA: fine-tuning updates live in a low-rank subspace. NMF: documents are mixtures of a few topics. PCA: data variance concentrates along a few directions. It's the same math applied to different domains.

From scratch: gradient descent matrix factorization

python
import numpy as np

def matrix_factorize(R, k, lr=0.01, reg=0.02, epochs=1000):
    """Factor ratings matrix R ≈ U @ V.T with k latent factors."""
    m, n = R.shape
    # Initialize with small random values
    U = np.random.randn(m, k) * 0.1
    V = np.random.randn(n, k) * 0.1

    # Find observed entries (non-NaN)
    observed = ~np.isnan(R)

    for epoch in range(epochs):
        for i in range(m):
            for j in range(n):
                if observed[i, j]:
                    error = R[i, j] - U[i] @ V[j]
                    # Update U[i] and V[j]
                    U[i] += lr * (error * V[j] - reg * U[i])
                    V[j] += lr * (error * U[i] - reg * V[j])

    return U, V

# Our example (NaN = missing)
R = np.array([[5, 3, np.nan],
              [4, np.nan, 1],
              [np.nan, 2, 4]])

U, V = matrix_factorize(R, k=2, epochs=5000)
predicted = U @ V.T
print("Predicted ratings:")
print(np.round(predicted, 1))
# Missing entries now filled with predictions!

See it: interactive recommender system

The widget below shows a 5×5 user-movie ratings matrix with some entries missing. Adjust the number of latent factors k with the slider. Watch how the predicted ratings and reconstruction error change. More factors = better fit to observed data, but too many = overfitting.

Matrix Factorization Recommender

Observed ratings in orange, predicted missing entries in teal. U and V matrices shown below.

Latent factors (k) 2
Loss: -- | Click Train to start

In Netflix-style matrix factorization with k=50 latent factors, what does each user's 50-dimensional vector represent?

Chapter 11: Norms RevisitedMatrix Norms

In Chapter 2, we measured vector size with norms: L1, L2, L-infinity. But how do you measure the "size" of a matrix? A matrix isn't just a collection of numbers — it's a transformation. So the "size" of a matrix should tell you something about what that transformation does.

Different matrix norms answer different questions about a transformation. The Frobenius norm asks: "how much total energy is in this matrix?" The spectral norm asks: "what's the maximum stretching this matrix can do?" The nuclear norm asks: "how close is this matrix to being low-rank?" Each answer matters for different ML applications.

Hand calculation 1: Frobenius norm

The Frobenius norm is the simplest. Flatten the matrix into a vector and take the L2 norm. Just square every entry, sum them, and take the square root.

For A = [[1, 2], [3, 4]]:

||A||F = √(1² + 2² + 3² + 4²) = √(1 + 4 + 9 + 16) = √30 ≈ 5.48

That's it. The Frobenius norm treats the matrix as if it were a flat vector. It measures the total "energy" or "magnitude" of all the entries. It doesn't care about structure — rearranging entries doesn't change the Frobenius norm.

Why you already know this: Weight decay adds λ||W||F2 to the loss function. That's the Frobenius norm squared. It penalizes large weights by penalizing the total energy of the weight matrix. Every time you write weight_decay=0.01 in an optimizer, you're using the Frobenius norm.

Hand calculation 2: spectral norm (largest singular value)

The spectral norm is the largest singular value of the matrix. It tells you the maximum amount the matrix can stretch any unit vector.

For a diagonal matrix A = [[3, 0], [0, 2]]:

SVD: σ1 = 3, σ2 = 2
||A||2 = σ1 = 3

This means: for any unit vector x (||x|| = 1), the output Ax has length at most 3. That is, ||Ax||2 ≤ 3 · ||x||2 for all x.

For a non-diagonal matrix, you'd compute the SVD first. For A = [[1, 2], [3, 4]], the SVD gives σ1 ≈ 5.465, σ2 ≈ 0.366. Spectral norm = 5.465.

The spectral norm answers a very specific question: "What's the worst-case amplification?" If noise enters a layer with spectral norm 5, the noise can be amplified up to 5×. Stack 10 such layers, and noise amplification is up to 510 ≈ 10 million. That's why controlling spectral norms matters for deep networks.

Hand calculation 3: nuclear norm (sum of singular values)

The nuclear norm is the sum of all singular values:

||A||* = σ1 + σ2 + ... + σr

For A = [[3, 0], [0, 2]]: ||A||* = 3 + 2 = 5.

Why care? The nuclear norm is the convex relaxation of rank. Minimizing rank directly is NP-hard (a combinatorial problem). But minimizing the nuclear norm is a convex optimization problem — much easier to solve — and it tends to produce low-rank solutions. This is the mathematical trick behind matrix completion algorithms.

The big picture: three norms, three questions

NormFormulaWhat it measuresML use
Frobenius√(∑aij²)Total energyWeight decay, L2 regularization
Spectralmax singular valueMaximum stretchingGAN stability (spectral norm)
Nuclear∑ singular valuesLow-rank proxyMatrix completion, LoRA

The ML payoff: spectral normalization in GANs

Training GANs is notoriously unstable. The discriminator can become too powerful, giving gradients that are too large and causing the generator to diverge. Spectral normalization (Miyato et al., 2018) fixes this by constraining the spectral norm of every layer's weight matrix to 1:

WSN = W / ||W||2

After this division, the largest singular value is exactly 1. The layer can rotate vectors freely but can't amplify them. This makes the discriminator Lipschitz continuous with constant 1 — its output changes by at most 1 unit per unit change in input. The result: stable, reliable GAN training.

Connection to singular values: All three matrix norms can be expressed in terms of singular values. Frobenius = √(∑σi²), spectral = max(σi), nuclear = ∑σi. They're literally the L2, L∞, and L1 norms of the singular value vector! If you understand vector norms and SVD, matrix norms are free.

From scratch: computing all three norms

python
import numpy as np

A = np.array([[1, 2], [3, 4]])

# Frobenius norm: sqrt(sum of squared entries)
frob = np.sqrt(np.sum(A**2))
print(f"Frobenius: {frob:.3f}")  # 5.477

# Via SVD
U, S, Vt = np.linalg.svd(A)
print(f"Singular values: {S}")  # [5.465, 0.366]

# Spectral norm: largest singular value
spectral = S[0]
print(f"Spectral: {spectral:.3f}")  # 5.465

# Nuclear norm: sum of singular values
nuclear = np.sum(S)
print(f"Nuclear: {nuclear:.3f}")  # 5.831

# NumPy built-ins
print("Built-in Frobenius:", np.linalg.norm(A, 'fro'))
print("Built-in Spectral:", np.linalg.norm(A, 2))
print("Built-in Nuclear:", np.linalg.norm(A, 'nuc'))

# Spectral normalization in one line
W = np.random.randn(64, 64)
W_sn = W / np.linalg.norm(W, 2)
print("After SN, spectral norm:", np.linalg.norm(W_sn, 2))  # 1.0

See it: three norms of a 2×2 matrix

Adjust the entries of a 2×2 matrix with sliders. The canvas shows all three norms updating in real time. The transformation ellipse shows how the matrix maps the unit circle: the spectral norm is the longest axis, and the nuclear norm is the sum of both axes.

Matrix Norms Visualizer

The unit circle (dashed) gets transformed into an ellipse. Spectral norm = long axis. Frobenius = "diagonal" of the axis lengths.

a11 1.0
a12 2.0
a21 0.0
a22 1.0

Which matrix norm is used in spectral normalization for GANs?

Chapter 8: Gradients & Jacobians

Everything so far has been static — given a matrix, transform a vector. But ML is about learning, which means adjusting the matrix to minimize a loss. To adjust it, we need gradients: how much does the loss change when we nudge each weight? This is where linear algebra meets calculus.

The Gradient: A Vector of Partial Derivatives

For a scalar function f(x&sub1;, x&sub2;, ..., xn), the gradient ∇f is the vector of all partial derivatives:

∇f = [∂f/∂x&sub1;, ∂f/∂x&sub2;, ..., ∂f/∂xn]

The gradient points in the direction of steepest ascent. To minimize a function, go in the opposite direction: the negative gradient.

Hand Calculation 1: Computing a Gradient

Find the gradient of f(x, y) = x²y + 3xy².

Step 1: Partial derivative with respect to x (treat y as constant):

∂f/∂x = ∂(x²y)/∂x + ∂(3xy²)/∂x = 2xy + 3y²

Step 2: Partial derivative with respect to y (treat x as constant):

∂f/∂y = ∂(x²y)/∂y + ∂(3xy²)/∂y = x² + 6xy

Step 3: Evaluate at (x, y) = (1, 2).

∂f/∂x = 2(1)(2) + 3(2)² = 4 + 12 = 16
∂f/∂y = (1)² + 6(1)(2) = 1 + 12 = 13
∇f(1, 2) = [16, 13]

The gradient [16, 13] points in the direction of steepest ascent. To minimize, step in the direction [−16, −13].

Key insight: The gradient is NOT the slope. The slope is a scalar (for 1D functions). The gradient is a VECTOR that points in the direction of steepest ascent. Its magnitude tells you how steep the surface is at that point. In 2D, it's a vector with 2 components. In a neural network with 175 billion parameters, it's a vector with 175 billion components.

Hand Calculation 2: The Jacobian

The gradient handles scalar-valued functions. But what about vector-valued functions — functions that output multiple values? The Jacobian is the matrix of all partial derivatives.

For f(x) = [x&sub1;² + x&sub2;,  x&sub1;x&sub2;,  x&sub2;³], compute the Jacobian:

J = [[∂f&sub1;/∂x&sub1;, ∂f&sub1;/∂x&sub2;], [∂f&sub2;/∂x&sub1;, ∂f&sub2;/∂x&sub2;], [∂f&sub3;/∂x&sub1;, ∂f&sub3;/∂x&sub2;]]

Row 1 (gradient of f&sub1; = x&sub1;² + x&sub2;):

[∂f&sub1;/∂x&sub1;, ∂f&sub1;/∂x&sub2;] = [2x&sub1;, 1]

Row 2 (gradient of f&sub2; = x&sub1;x&sub2;):

[∂f&sub2;/∂x&sub1;, ∂f&sub2;/∂x&sub2;] = [x&sub2;, x&sub1;]

Row 3 (gradient of f&sub3; = x&sub2;³):

[∂f&sub3;/∂x&sub1;, ∂f&sub3;/∂x&sub2;] = [0, 3x&sub2;²]

Evaluate at x = [1, 2]:

J = [[2, 1], [2, 1], [0, 12]]

The Jacobian has shape [3, 2] — 3 outputs, 2 inputs. Each row is the gradient of one output function. This matrix tells you: if I nudge x&sub1; by ε, how much does each output change?

ML connection: In a neural network, backpropagation computes ∂L/∂W for every weight W. This IS the Jacobian of the loss with respect to parameters. PyTorch autograd computes Jacobian-vector products efficiently using the chain rule — it never forms the full Jacobian matrix, which would be enormous.

Gradient Descent: A Complete Worked Example

Let's trace gradient descent step by step on the simplest possible loss: L(w) = (w − 3)².

The minimum is at w = 3 (where L = 0). Starting at w = 0 with learning rate α = 0.1:

StepwL(w)dL/dw = 2(w−3)wnew = w − α·dL/dw
00.0009.000−6.0000.000 − 0.1×(−6) = 0.600
10.6005.760−4.8000.600 + 0.480 = 1.080
21.0803.686−3.8401.080 + 0.384 = 1.464
31.4642.359−3.0721.464 + 0.307 = 1.771
41.7711.510−2.4581.771 + 0.246 = 2.017
52.0170.966−1.9662.017 + 0.197 = 2.214
......→ 0→ 0→ 3.000

Each step reduces the loss. The gradient magnitude shrinks as we approach the minimum. After about 30 steps, w ≈ 2.99 and L ≈ 0.0001.

From-Scratch Code: Gradient Descent

python
def gradient_descent_1d(f, df, w0, lr, n_steps):
    """Minimize f starting at w0 with learning rate lr."""
    w = w0
    history = [w]
    for step in range(n_steps):
        grad = df(w)                  # compute gradient
        w = w - lr * grad              # update
        history.append(w)
    return w, history

# L(w) = (w - 3)^2
f = lambda w: (w - 3)**2
df = lambda w: 2 * (w - 3)          # derivative

w_final, hist = gradient_descent_1d(f, df, w0=0, lr=0.1, n_steps=50)
print(f"Final w = {w_final:.6f}")  # ~3.000000

PyTorch Autograd

python
import torch

w = torch.tensor(0.0, requires_grad=True)
lr = 0.1

for step in range(50):
    loss = (w - 3)**2              # forward pass
    loss.backward()                 # compute gradient
    with torch.no_grad():
        w -= lr * w.grad            # update
        w.grad.zero_()              # clear for next step

print(f"Final w = {w.item():.6f}")  # ~3.000000

Interactive: 2D Gradient Descent

2D Gradient Descent Explorer

Click anywhere on the loss landscape to start gradient descent from that point. The contours show loss values. Adjust the learning rate — too high oscillates, too low barely moves. Watch the gradient arrow at each step.

Learning rate0.050
The gradient at a point is [3, −2]. Which direction should you step to DECREASE the function?

Chapter 13: The Hessian Matrix

Gradient descent tells you which way is downhill. But it says nothing about the shape of the hill. Is it a gentle slope you can barrel down, or a narrow ravine where you'll oscillate wildly? Is it a valley floor (minimum) or a mountain pass (saddle point)? The Hessian matrix — the matrix of all second derivatives — answers these questions.

Understanding the Hessian explains why some optimization problems are easy and others are nightmarishly slow. It explains why Adam outperforms vanilla SGD. It explains why saddle points, not local minima, are the real enemy in high-dimensional optimization.

What is the Hessian?

For a function f(x1, x2, ..., xn), the gradient is the vector of first derivatives: ∇f = [∂f/∂x1, ∂f/∂x2, ..., ∂f/∂xn]. The Hessian is the matrix of ALL second derivatives:

Hij = ∂²f / ∂xi∂xj

For a function of 2 variables, the Hessian is a 2×2 matrix:

H = [[∂²f/∂x², ∂²f/∂x∂y], [∂²f/∂y∂x, ∂²f/∂y²]]

The diagonal entries tell you the curvature along each axis. The off-diagonal entries tell you how the curvature couples the axes (cross-curvature). For smooth functions, the Hessian is always symmetric: ∂²f/∂x∂y = ∂²f/∂y∂x.

Hand calculation 1: computing the Hessian

Let f(x, y) = x³ − 3xy² + y³. First, compute the gradient:

∂f/∂x = 3x² − 3y²
∂f/∂y = −6xy + 3y²

Now take second derivatives to build the Hessian:

∂²f/∂x² = 6x,    ∂²f/∂x∂y = −6y
∂²f/∂y∂x = −6y,    ∂²f/∂y² = −6x + 6y
H = [[6x, −6y], [−6y, −6x + 6y]]

Evaluate at the point (1, 1):

H(1,1) = [[6, −6], [−6, 0]]

Now find the eigenvalues to classify this critical point:

det(H − λI) = (6 − λ)(0 − λ) − (−6)(−6) = −6λ + λ² − 36 = 0
λ² − 6λ − 36 = 0  ⇒  λ = (6 ± √(36 + 144)) / 2 = (6 ± √180) / 2
λ1 ≈ 9.71,    λ2 ≈ −3.71

One eigenvalue is positive, one is negative. This means the surface curves upward in one direction and downward in another. That's a saddle point — it looks like a horse saddle or a mountain pass.

Classifying critical points with the Hessian

Eigenvalue signsClassificationGeometry
All positiveLocal minimumBowl (curves up everywhere)
All negativeLocal maximumDome (curves down everywhere)
Mixed positive & negativeSaddle pointHorse saddle (up in some dirs, down in others)
Some zeroDegenerateFlat in some directions (inconclusive)

Hand calculation 2: Newton's method

Gradient descent uses a fixed learning rate: xnew = x − α ∇f. But α is a guess — too large and you overshoot, too small and you crawl. Newton's method uses the Hessian to pick the perfect step size automatically:

xnew = x − H−1 ∇f

The inverse Hessian scales the gradient by the curvature. In steep directions (large curvature), it takes small steps. In flat directions (small curvature), it takes large steps. It's like adaptive learning rates, but derived from the geometry of the loss surface.

For a 1D function f(x) = x4 at x = 2:

f'(2) = 4x³ = 32,    f''(2) = 12x² = 48
xnew = 2 − 32/48 = 2 − 0.667 = 1.333

Compare to gradient descent with α = 0.01: xnew = 2 − 0.01 × 32 = 1.68. Newton's method took a much larger, smarter step because it knew the curvature was steep enough to handle it.

The ML payoff: why saddle points matter

In deep learning, the loss landscape has millions of dimensions. The Hessian is N×N where N = number of parameters. For GPT-3, that's a 175 billion × 175 billion matrix. We NEVER compute it fully — it would take more memory than exists on Earth.

But its eigenvalue spectrum controls everything about optimization:

The saddle point myth-buster: The common belief is that neural networks get stuck in local minima. In reality, local minima are vanishingly rare in high dimensions. Think about it: for a local minimum, ALL eigenvalues of the Hessian must be positive. With millions of dimensions, the chance that every single one is positive is astronomically small. Instead, most critical points are saddle points — positive eigenvalues in some directions, negative in others. SGD's inherent noise helps escape these saddle points, which is one reason stochastic methods work so well.

Why Adam works better than SGD: Adam maintains per-parameter running averages of squared gradients. This approximates the diagonal of the Hessian — each parameter gets its own "curvature estimate." Parameters with high curvature (large second derivative) get smaller effective learning rates. Parameters with low curvature get larger ones. Adam is a cheap approximation to Newton's method.

From scratch: Hessian computation

python
import torch

# Define f(x, y) = x^3 - 3xy^2 + y^3
def f(xy):
    x, y = xy[0], xy[1]
    return x**3 - 3*x*y**2 + y**3

# Compute Hessian at (1, 1)
point = torch.tensor([1.0, 1.0])
H = torch.autograd.functional.hessian(f, point)
print("Hessian:")
print(H)
# tensor([[ 6., -6.],
#         [-6.,  0.]])

# Eigenvalue analysis
eigenvalues = torch.linalg.eigvalsh(H)
print("Eigenvalues:", eigenvalues)
# tensor([-3.71,  9.71])

if all(eigenvalues > 0):
    print("Local minimum")
elif all(eigenvalues < 0):
    print("Local maximum")
else:
    print("Saddle point")  # This one!

See it: curvature on a 2D landscape

The widget below shows a contour plot of a 2D function. Click anywhere to place the analysis point. The Hessian eigenvectors are shown as arrows: long arrow = high curvature (steep), short arrow = low curvature (flat). Color indicates sign: red = positive eigenvalue (curves up, like a bowl), blue = negative (curves down). At saddle points, you'll see one red and one blue arrow.

Hessian Curvature Explorer

Click on the contour plot to move the analysis point. Arrows show Hessian eigenvectors. Arrow length = curvature magnitude.

Function Bowl
λ1 = 2.00, λ2 = 2.00 | Local minimum

The Hessian at a critical point has eigenvalues [5, −2, 0.01]. What is this point?

Chapter 9: Matrix Calculus & Backprop

Now we connect everything. A neural network is a chain of matrix multiplications and nonlinearities. Backpropagation is just the chain rule applied to matrices. Once you see this, neural networks stop being magic and become plumbing.

The core insight: If y = f(g(h(x))), the chain rule says ∂y/∂x = (∂y/∂f)(∂f/∂g)(∂g/∂h)(∂h/∂x). Each term is a Jacobian. Backpropagation computes these products from output to input (right to left). That's ALL it does.

Complete Worked Example: One-Layer Forward + Backward

Let's trace every single number through a forward pass and backward pass. No hand-waving.

Setup:

Input: x = [2, 1]    (shape [2])
Weights: W = [[0.5, −0.3], [0.1, 0.7], [−0.4, 0.2]]    (shape [3, 2])
Bias: b = [0.1, 0, −0.1]    (shape [3])
Target: t = [1, 0, 0]    (shape [3])

Forward Pass

Step 1: Linear layer z = Wx + b

z&sub1; = 0.5(2) + (−0.3)(1) + 0.1 = 1.0 − 0.3 + 0.1 = 0.8
z&sub2; = 0.1(2) + 0.7(1) + 0 = 0.2 + 0.7 = 0.9
z&sub3; = (−0.4)(2) + 0.2(1) + (−0.1) = −0.8 + 0.2 − 0.1 = −0.7
z = [0.8, 0.9, −0.7]

Step 2: ReLU activation a = max(0, z)

a&sub1; = max(0, 0.8) = 0.8  ✓
a&sub2; = max(0, 0.9) = 0.9  ✓
a&sub3; = max(0, −0.7) = 0    ← ReLU kills this neuron!
a = [0.8, 0.9, 0]

Step 3: MSE Loss L = ∑(a − t)²

L = (0.8 − 1)² + (0.9 − 0)² + (0 − 0)² = 0.04 + 0.81 + 0 = 0.85
Notice: The third neuron's pre-activation was −0.7, so ReLU zeroed it. Its activation matches the target (both 0), contributing zero loss. But it also receives zero gradient during backprop — it's "dead" for this input. This is the origin of the "dying ReLU" problem.

Backward Pass

Now we compute gradients flowing backward through each operation.

Step 4: Gradient of loss w.r.t. activations

∂L/∂a = 2(a − t) = 2[0.8−1, 0.9−0, 0−0] = [−0.4, 1.8, 0]

Step 5: Gradient through ReLU

ReLU's gradient: 1 if z > 0, 0 if z ≤ 0. Element-wise multiply:

∂L/∂z = ∂L/∂a ⊙ (z > 0) = [−0.4, 1.8, 0] ⊙ [1, 1, 0] = [−0.4, 1.8, 0]

The third neuron had z&sub3; = −0.7 < 0, so its gradient is zeroed. The dead neuron receives no learning signal.

Step 6: Gradient w.r.t. weights ∂L/∂W = (∂L/∂z) xT

This is the outer product of the upstream gradient and the input:

∂L/∂W = [−0.4, 1.8, 0]T × [2, 1]
= [[−0.4×2, −0.4×1], [1.8×2, 1.8×1], [0×2, 0×1]]
= [[−0.8, −0.4], [3.6, 1.8], [0, 0]]

Step 7: Gradient w.r.t. bias ∂L/∂b = ∂L/∂z

∂L/∂b = [−0.4, 1.8, 0]

Step 8: Update weights (learning rate = 0.01)

Wnew = W − 0.01 × ∂L/∂W
= [[0.5, −0.3], [0.1, 0.7], [−0.4, 0.2]] − 0.01 × [[−0.8, −0.4], [3.6, 1.8], [0, 0]]
= [[0.508, −0.296], [0.064, 0.682], [−0.4, 0.2]]
Shape rule (memorize this): The gradient ∂L/∂W ALWAYS has the same shape as W. If W is [128, 784], then ∂L/∂W is [128, 784]. This must be true because we subtract lr × gradient from W during the update, so the shapes must match.

From-Scratch Code: Complete Forward + Backward

python
import numpy as np

# Setup
x = np.array([2.0, 1.0])
W = np.array([[0.5, -0.3], [0.1, 0.7], [-0.4, 0.2]])
b = np.array([0.1, 0.0, -0.1])
target = np.array([1.0, 0.0, 0.0])
lr = 0.01

# ── Forward ──
z = W @ x + b                # [0.8, 0.9, -0.7]
a = np.maximum(0, z)           # [0.8, 0.9, 0.0]  ReLU
loss = np.sum((a - target)**2) # 0.85

# ── Backward ──
dL_da = 2 * (a - target)       # [-0.4, 1.8, 0.0]
dL_dz = dL_da * (z > 0)       # [-0.4, 1.8, 0.0]  ReLU grad
dL_dW = np.outer(dL_dz, x)     # [[-0.8,-0.4],[3.6,1.8],[0,0]]
dL_db = dL_dz                   # [-0.4, 1.8, 0.0]

# ── Update ──
W -= lr * dL_dW
b -= lr * dL_db
print("Updated W:", W)
print("Updated b:", b)

PyTorch Autograd: Same Network

python
import torch

x = torch.tensor([2.0, 1.0])
W = torch.tensor([[0.5, -0.3], [0.1, 0.7], [-0.4, 0.2]],
                  requires_grad=True)
b = torch.tensor([0.1, 0.0, -0.1], requires_grad=True)
target = torch.tensor([1.0, 0.0, 0.0])

# Forward
z = W @ x + b
a = torch.relu(z)
loss = torch.sum((a - target)**2)

# Backward — PyTorch computes ALL gradients for us
loss.backward()

print("dL/dW:", W.grad)  # matches our hand calc!
print("dL/db:", b.grad)  # matches our hand calc!

Interactive: Forward/Backward Pass Visualizer

Neural Network Forward & Backward Pass

Toggle between forward pass (data flows left to right) and backward pass (gradients flow right to left). Actual numbers from our worked example are shown at each stage.

During backpropagation, what shape is ∂L/∂W for a weight matrix W of shape [128, 784]?

Chapter 10: Tensor Operations & Einsum

Everything we've done so far has been 1D and 2D — vectors and matrices. Real ML lives in higher dimensions. An image batch is 4D: [batch, channels, height, width]. Attention matrices have batch and head dimensions. Two tools make this manageable: broadcasting (automatic shape expansion) and einsum (universal tensor contraction).

Broadcasting: Automatic Shape Expansion

Broadcasting lets you do arithmetic on tensors with different shapes. NumPy and PyTorch align dimensions from the RIGHT and expand size-1 dimensions to match.

Hand Calculation: Broadcasting in Action

a has shape [3, 1], b has shape [1, 4]:

a = [[1], [2], [3]],    b = [[10, 20, 30, 40]]

Step 1: Align shapes from the right.

a: [3, 1]  →  [3, 4]   (stretch columns)
b: [1, 4]  →  [3, 4]   (stretch rows)

Step 2: After stretching, both are [3, 4]. Element-wise add:

a + b = [[1+10, 1+20, 1+30, 1+40], [2+10, 2+20, 2+30, 2+40], [3+10, 3+20, 3+30, 3+40]]
= [[11, 21, 31, 41], [12, 22, 32, 42], [13, 23, 33, 43]]
Broadcasting rule: Dimensions are compared from RIGHT to LEFT. At each position: (a) if both sizes are equal, fine; (b) if one size is 1, stretch it; (c) if one tensor has fewer dimensions, prepend 1s. If neither is 1 and they differ → error.

More examples:

Shape AShape BResultWhat happens
[3, 4][4][3, 4]b becomes [1, 4] → broadcast rows
[2, 3, 4][3, 1][2, 3, 4]b becomes [1, 3, 1] → broadcast batch + cols
[32, 128][128][32, 128]Adding bias to every sample in a batch
[3, 4][5, 4]ERROR3 ≠ 5, neither is 1

Einsum: The Universal Tensor Operation

Einstein summation notation lets you express any tensor contraction in a single, readable string. The rule: any index that appears in the inputs but NOT in the output is summed over.

Hand Calculation: Decoding einsum strings

EinsumOperationEquation
'i,i->'Dot productc = ∑i ai bi
'ij,jk->ik'Matrix multiplyCik = ∑j Aij Bjk
'ij->ji'TransposeBji = Aij
'ii->'Tracet = ∑i Aii
'i,j->ij'Outer productCij = ai bj
'bij,bjk->bik'Batched matmulCbik = ∑j Abij Bbjk
'bhqd,bhkd->bhqk'Attention scoresSbhqk = ∑d Qbhqd Kbhkd

Worked example: Matrix multiply via einsum. A is [2, 3], B is [3, 2].

A = [[1, 2, 3], [4, 5, 6]],    B = [[7, 8], [9, 10], [11, 12]]
einsum('ij,jk->ik', A, B):   Cik = ∑j Aij Bjk
C00 = 1(7) + 2(9) + 3(11) = 7 + 18 + 33 = 58
C01 = 1(8) + 2(10) + 3(12) = 8 + 20 + 36 = 64
C10 = 4(7) + 5(9) + 6(11) = 28 + 45 + 66 = 139
C11 = 4(8) + 5(10) + 6(12) = 32 + 50 + 72 = 154
C = [[58, 64], [139, 154]]

The Attention Score Einsum Decoded

The most important einsum in modern ML computes attention scores:

einsum('bhqd, bhkd -> bhqk', Q, K)

Breaking it down letter by letter:

IndexMeaningIn QIn KIn output
bBatch✓ (kept)
hAttention head✓ (kept)
qQuery position✓ (kept)
kKey position✓ (kept)
dHead dimension(summed!)

Index d appears in both inputs but NOT in the output → it's summed over. This is the dot product between each query and each key vector, for every batch and every head, producing an attention score matrix of shape [b, h, q, k].

From-Scratch Broadcasting

python
import numpy as np

# Manual broadcasting: add bias to every sample
X = np.array([[1, 2, 3],   # batch of 4 samples, 3 features
              [4, 5, 6],
              [7, 8, 9],
              [10, 11, 12]])   # shape [4, 3]
bias = np.array([100, 200, 300])    # shape [3]

# Without broadcasting (manual loop)
result_loop = np.zeros_like(X)
for i in range(4):
    result_loop[i] = X[i] + bias

# With broadcasting (one line!)
result_broadcast = X + bias  # [4,3] + [3] -> [4,3]

print(np.allclose(result_loop, result_broadcast))  # True

NumPy / PyTorch Einsum

python
import numpy as np
import torch

# All einsum patterns
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Dot product: i,i->
print(np.einsum('i,i->', a, b))       # 32

# Matrix multiply: ij,jk->ik
print(np.einsum('ij,jk->ik', A, B))   # [[19,22],[43,50]]

# Trace: ii->
print(np.einsum('ii->', A))            # 5

# Outer product: i,j->ij
print(np.einsum('i,j->ij', a, b))

# Batched attention scores (PyTorch)
B_, H, Q_len, D = 2, 4, 8, 16
K_len = 8
Q = torch.randn(B_, H, Q_len, D)
K = torch.randn(B_, H, K_len, D)

scores = torch.einsum('bhqd,bhkd->bhqk', Q, K)
print("Attention scores shape:", scores.shape)
# torch.Size([2, 4, 8, 8])

Interactive: Broadcasting Visualizer

Broadcasting Shape Expander

Select different shape combinations to see how broadcasting stretches dimensions. Size-1 dimensions are highlighted and expanded.

What does einsum('bhqd,bhkd->bhqk', Q, K) compute?

Chapter 16: Positive Definite Matrices

In the previous chapters on the Hessian, we classified critical points by whether the eigenvalues were all positive, all negative, or mixed. "All positive eigenvalues" has a name: the matrix is positive definite. This is one of the most important properties in all of machine learning, because covariance matrices are always positive semi-definite, kernel matrices must be positive semi-definite, and a positive definite Hessian guarantees you're at a local minimum.

Let's make this concrete rather than abstract.

What does "positive definite" actually mean?

A symmetric matrix A is positive definite (PD) if for every nonzero vector x:

xTAx > 0

The quantity xTAx is called the quadratic form associated with A. If A is positive definite, this quadratic form is always positive — it's a "bowl" shape with a single minimum at x = 0. No saddle points, no flat spots, no weird geometry. Just a clean bowl.

If the condition is xTAx ≥ 0 (allowing zero for some nonzero x), the matrix is positive semi-definite (PSD). If xTAx can be negative for some x, the matrix is indefinite.

Hand calculation 1: three ways to check

Is A = [[2, 1], [1, 2]] positive definite?

Method 1: eigenvalues. Solve det(A − λI) = 0:

(2 − λ)(2 − λ) − 1 = λ² − 4λ + 3 = 0  ⇒  λ = 1, 3

Both eigenvalues are positive ⇒ PD ✓

Method 2: compute xTAx directly. For x = [a, b]:

xTAx = [a, b] [[2,1],[1,2]] [a,b]T = 2a² + 2ab + 2b²

Rewrite by completing the square:

= a² + (a + b)² + b²

This is a sum of three squares. It's zero only when a = b = 0, and positive otherwise. PD ✓

Method 3: leading minors. Check that all upper-left submatrices have positive determinant:

det([2]) = 2 > 0   ✓
det([[2,1],[1,2]]) = 4 − 1 = 3 > 0   ✓

All leading minors positive ⇒ PD ✓

Hand calculation 2: NOT positive definite

Is B = [[1, 2], [2, 1]] positive definite?

Eigenvalues: λ² − 2λ − 3 = 0 ⇒ λ = 3, −1. One eigenvalue is negative ⇒ NOT PD.

Let's find a vector that makes xTBx negative. Try x = [1, −1]:

xTBx = [1, −1] [[1,2],[2,1]] [1,−1]T = [1,−1] [−1, 1]T = −1 − 1 = −2 < 0

Confirmed: the quadratic form can be negative. The matrix is indefinite — the "bowl" is actually a saddle.

Why this matters for ML

Covariance matrices must be PSD. The covariance matrix Σ of a random vector x is defined as E[(x − μ)(x − μ)T]. For any vector v, vTΣv = E[(vT(x − μ))²] ≥ 0. This is always non-negative because it's an expected squared quantity. If Σ were not PSD, the Gaussian density:

p(x) = exp(−½(x − μ)T Σ−1 (x − μ)) / √(det(2πΣ))

would be undefined — a negative determinant gives an imaginary normalizer. The density wouldn't integrate to 1. Physics breaks.

Kernel matrices must be PSD. In SVMs, the kernel matrix Kij = k(xi, xj) must be PSD — this is the Mercer condition. It guarantees that the kernel corresponds to an inner product in some (possibly infinite-dimensional) feature space. Without PSD, the optimization problem becomes non-convex and may have no solution.

PD Hessian = local minimum. If the Hessian at a critical point is PD, the loss surface is a bowl at that point — you're at a minimum. This is the second-order sufficient condition for optimality.

The hierarchy: Positive definite ⊂ Positive semi-definite ⊂ Symmetric matrices. PD means all eigenvalues strictly positive. PSD means all eigenvalues ≥ 0 (some can be zero). PSD but not PD means the bowl has flat directions — the distribution lives on a lower-dimensional subspace.

Cholesky decomposition: the PD superpower

Every positive definite matrix A can be uniquely factored as:

A = LLT

where L is a lower triangular matrix. This is the Cholesky decomposition, and it's incredibly useful:

Sampling from multivariate Gaussians: To sample x ~ N(μ, Σ), first compute L where Σ = LLT. Then sample z ~ N(0, I) (independent standard normals). Finally: x = μ + Lz. The Cholesky factor L "shapes" the spherical noise z into the correct covariance structure.

Solving Ax = b efficiently: Instead of computing A−1 (O(n³) and numerically unstable), factor A = LLT once, then solve Ly = b (forward substitution) and LTx = y (back substitution). Both substitutions are O(n²).

From scratch: PD check and Cholesky

python
import numpy as np

A = np.array([[2, 1], [1, 2]], dtype=float)
B = np.array([[1, 2], [2, 1]], dtype=float)

# Method 1: eigenvalue check
eigs_A = np.linalg.eigvalsh(A)
print("A eigenvalues:", eigs_A)  # [1. 3.] — both positive, PD
eigs_B = np.linalg.eigvalsh(B)
print("B eigenvalues:", eigs_B)  # [-1.  3.] — one negative, NOT PD

# Method 2: Cholesky (fails on non-PD matrices!)
try:
    L = np.linalg.cholesky(A)
    print("A is PD. L =", L)
    print("Verify LL^T =", L @ L.T)  # Should equal A
except np.linalg.LinAlgError:
    print("A is NOT PD")

try:
    L = np.linalg.cholesky(B)
except np.linalg.LinAlgError:
    print("B is NOT PD — Cholesky failed!")

# Sampling from N(mu, Sigma) using Cholesky
Sigma = np.array([[2, 1], [1, 2]], dtype=float)
mu = np.array([3.0, 5.0])
L = np.linalg.cholesky(Sigma)
z = np.random.randn(1000, 2)  # 1000 samples from N(0, I)
samples = mu + z @ L.T         # Transform to N(mu, Sigma)
print("Sample mean:", samples.mean(axis=0))
print("Sample cov:", np.cov(samples.T))  # Close to Sigma

See it: the quadratic form xTAx

Adjust the off-diagonal entries of a symmetric 2×2 matrix. When PD, the contours are ellipses (a bowl). When PSD, they degenerate to parallel lines (a trough). When indefinite, they become hyperbolas (a saddle). The eigenvalues determine everything.

Positive Definiteness Visualizer

Adjust the matrix entries. Watch the contour shape change from ellipses (PD, green) to parallel lines (PSD, yellow) to hyperbolas (indefinite, red).

a11 2.0
a12 = a21 1.0
a22 2.0
λ1 = 1.00, λ2 = 3.00 | Positive Definite

A covariance matrix has eigenvalue 0. Is this a problem?

Chapter 17: Numerical Linear Algebra

Everything we've learned so far assumes perfect arithmetic. 1 + 1 = 2, exactly. But computers don't do exact arithmetic. They use floating-point numbers — a finite approximation where 1 + 10−16 = 1 (not 1.0000000000000001). This tiny imprecision cascades through computations, and for certain matrices, it can destroy your answer completely.

Understanding when and why this happens explains training instability, the need for batch normalization, why mixed-precision training works, and why some models need float64 for certain operations.

Floating point: the 7-digit lie

A float32 number has approximately 7 significant decimal digits of precision. A float64 has ~15. This means:

FormatBitsPrecisionSmallest difference from 1
float1616~3.3 digits~10−3
bfloat1616~2.4 digits~10−2
float3232~7.2 digits~10−7
float6464~15.9 digits~10−16

"Precision" means the total number of meaningful digits. If you're doing a computation that produces a 10-digit answer but you're using float32, only the first 7 digits are trustworthy. The last 3 are garbage.

Hand calculation 1: the condition number

The condition number κ(A) measures how sensitive the solution of Ax = b is to small changes in A or b. It's defined as:

κ(A) = ||A|| · ||A−1|| = σmax / σmin

Consider a nearly singular matrix:

A = [[1, 1], [1, 1.0001]]

The singular values are approximately σ1 ≈ 2.0001 and σ2 ≈ 0.00005. So:

κ(A) = 2.0001 / 0.00005 ≈ 40,000

This means: a 0.01% change in the input can cause up to a 400% change in the output. The matrix amplifies perturbations by a factor of 40,000. Let's see this happen.

Hand calculation 2: ill-conditioning in action

Solve Ax = b with our ill-conditioned A:

[[1, 1], [1, 1.0001]] · x = [2, 2.0001]

The exact solution is x = [1, 1]. Easy enough.

Now change b by a tiny amount — just 0.005% in the second component:

[[1, 1], [1, 1.0001]] · x = [2, 2.0002]

Subtract the first equation from the second: 0.0001 · x2 = 0.0001, so x2 = 1. Wait — that doesn't look bad. Let me do a more dramatic example.

Actually, let's use A = [[1, 1], [1, 1.001]] with b = [2, 2.001] giving x = [1, 1], then b' = [2, 2.002]:

0.001 · x2 = 0.002  ⇒  x2 = 2
x1 = 2 − x2 = 2 − 2 = 0

The solution jumped from [1, 1] to [0, 2]! A 0.05% change in b produced a 100% change in x. That's ill-conditioning in action. The condition number κ ≈ 4000 predicted exactly this kind of sensitivity.

The rule: digits lost = log10(κ)

The precision-eating rule: If your condition number is κ and you have d digits of precision, you can trust at most d − log10(κ) digits in your answer. With κ = 108 and float32 (7 digits): 7 − 8 = −1. You have negative reliable digits. The answer is pure noise. Using float64 (15 digits) gives 15 − 8 = 7 reliable digits. Better, but still expensive.

Hand calculation 3: LU decomposition

LU decomposition factors A = LU where L is lower triangular and U is upper triangular. This is how computers actually solve Ax = b (they never compute A−1 directly).

For A = [[2, 1], [4, 3]]:

Step 1: The multiplier for row 2 is 4/2 = 2.

Step 2: Subtract 2 × R1 from R2: [4−4, 3−2] = [0, 1].

L = [[1, 0], [2, 1]],    U = [[2, 1], [0, 1]]

Verify: LU = [[1×2+0×0, 1×1+0×1], [2×2+1×0, 2×1+1×1]] = [[2, 1], [4, 3]] = A ✓

To solve Ax = b: first solve Ly = b (forward substitution, O(n²)), then Ux = y (back substitution, O(n²)). Total: O(n³) for the factorization, then O(n²) per right-hand side. If you need to solve for many different b vectors with the same A, you factorize once and solve cheaply many times.

The ML payoff

float32 vs float16: Modern LLM training uses mixed precision — weights stored in float32, but forward/backward passes in float16 or bfloat16. The inner products in attention (Q·KT) can have huge values (that's why we divide by √dk), and float16 overflow causes NaN. bfloat16 has the same range as float32 (8 exponent bits) but less precision (2.4 digits vs 7.2), which is usually enough.

Batch normalization reduces the condition number of the input distribution to each layer. Without batchnorm, deep networks have layered transformations where small input changes get amplified exponentially (high condition number). Normalizing the statistics at each layer keeps the condition number manageable.

Gradient clipping prevents ill-conditioned gradients from causing parameter explosions. If the Jacobian of the network has a large condition number, gradients in the "steep" direction become enormous while those in the "flat" direction are tiny. Clipping caps the maximum gradient norm.

Adam's advantage: Adam's per-parameter scaling (dividing by √v̂) acts like diagonal preconditioning. In numerical linear algebra, preconditioning transforms a problem to reduce its condition number before solving. Adam does this automatically, which is why it works on problems where vanilla SGD with a fixed learning rate diverges.

Preconditioning, intuitively: Imagine you're optimizing a long, narrow valley. Gradient descent oscillates side-to-side across the narrow direction while making slow progress along the long axis. Preconditioning rescales the axes so the valley becomes a bowl, making all directions equally easy. The condition number goes from "terrible" to "1." Adam approximates this by tracking per-parameter curvature.

From scratch: condition number and LU

python
import numpy as np
from scipy import linalg

# Condition number example
A = np.array([[1, 1], [1, 1.001]])
print("Condition number:", np.linalg.cond(A))  # ~4000

# Solve with two slightly different b vectors
b1 = np.array([2.0, 2.001])
b2 = np.array([2.0, 2.002])
x1 = np.linalg.solve(A, b1)
x2 = np.linalg.solve(A, b2)
print("Solution 1:", x1)  # [1. 1.]
print("Solution 2:", x2)  # [0. 2.] — wildly different!

# LU decomposition
A2 = np.array([[2, 1], [4, 3]], dtype=float)
P, L, U = linalg.lu(A2)
print("P:", P)  # Permutation matrix
print("L:", L)
print("U:", U)
print("P@L@U =", P @ L @ U)  # Should equal A2

# Digits lost formula
for kappa in [10, 1e4, 1e8, 1e12]:
    f32_reliable = 7.2 - np.log10(kappa)
    f64_reliable = 15.9 - np.log10(kappa)
    print(f"κ={kappa:.0e}: float32={f32_reliable:.1f} digits, float64={f64_reliable:.1f} digits")
The float64 myth: People often think numerical instability is just a precision issue — "use float64 and it goes away." But if your condition number is 1015, even float64 (15 digits) gives you zero reliable digits. The solution is to reformulate the problem (preconditioning, regularization, better algorithms) rather than throwing precision at it. Adding λI to a matrix (Tikhonov regularization) directly reduces the condition number from σmaxmin to (σmax+λ)/(σmin+λ).

See it: condition number and sensitivity

The widget below lets you adjust a 2×2 matrix. A small perturbation ε is added to the right-hand side b. Watch how the solution changes. When the condition number is small (well-conditioned), the two solutions overlap. When large (ill-conditioned), they diverge wildly — even though the inputs are nearly identical.

Conditioning & Numerical Stability

Two nearly identical systems Ax=b and Ax=b' (differing by ε). Orange = solution 1, teal = solution 2. Large gap = ill-conditioned.

a12 1.000
ε (perturbation) 0.010
κ(A) = 1.00 | Δx = 0.000 | Well-conditioned

A matrix has condition number 108 and you're using float32 (~7 digits precision). How many digits of accuracy can you expect in the solution?

Chapter 11: Connections & Cheat Sheet

We've covered the entire linear algebra toolkit for ML. Every technique you'll encounter in papers — from attention mechanisms to LoRA fine-tuning to gradient descent — is built on these operations. Let's map every concept to its real-world ML application.

The Complete Concept → ML Mapping

ConceptML ApplicationConcrete Example
VectorsEmbeddingsword2vec: "king" = [0.2, −0.5, 0.8, ...]
Dot productAttention scoresQ·KT in transformers
Matrix multiplyNeural network layersy = Wx + b (every layer)
NormsRegularizationL1 (Lasso) → sparsity, L2 (Ridge) → weight decay
TransposeData reshapingXTX in normal equations
InverseClosed-form solutionsw = (XTX)−1XTy (linear regression)
DeterminantVolume changeNormalizing flows: det(Jacobian) tracks density
EigenvaluesPCA, stabilityCovariance eigenvectors = principal components
SVDCompression, LoRAΔW ≈ BA where rank(BA) ≪ rank(W)
GradientsTraining∂L/∂W for every parameter
JacobianBackpropagationChain of Jacobian-vector products
BroadcastingBatched operationsAdding bias [D] to every sample in batch [B, D]
EinsumAttention, contractionsMulti-head attention in one line

Pipeline Walk-Through: One Training Step

Let's trace one complete training step of a 2-layer network, labeling every linear algebra operation:

1. Data Loading
X: [batch, features], y: [batch] — Tensors loaded from disk
2. Layer 1 Forward
z&sub1; = W&sub1; @ XT + b&sub1; — matrix multiply + broadcasting
3. Activation
a&sub1; = ReLU(z&sub1;) — element-wise (no linalg)
4. Layer 2 Forward
z&sub2; = W&sub2; @ a&sub1; + b&sub2; — matrix multiply + broadcasting
5. Loss
L = ||z&sub2; − y||² — L2 norm squared
↓ backprop begins
6. Output Gradient
∂L/∂z&sub2; = 2(z&sub2; − y) — vector subtraction
7. Layer 2 Gradients
∂L/∂W&sub2; = (∂L/∂z&sub2;) a&sub1;Touter product
8. Chain to Layer 1
∂L/∂a&sub1; = W&sub2;T @ (∂L/∂z&sub2;) — matmul with transpose
9. Through ReLU
∂L/∂z&sub1; = ∂L/∂a&sub1; ⊙ (z&sub1; > 0) — element-wise product
10. Layer 1 Gradients
∂L/∂W&sub1; = (∂L/∂z&sub1;) XTouter product
11. Update
W ← W − α∇W — vector subtraction (scaled)
Count the linear algebra: Matrix multiply (4×), broadcasting (2×), norm (1×), vector subtraction (3×), outer product (2×), transpose (2×), element-wise product (1×). That's 15 linear algebra operations in a single training step of a 2-layer network. A transformer with 96 layers has thousands.

Related Lessons

Building on this foundation:

Advanced applications:

Interactive: Full ML Pipeline

ML Pipeline — Linear Algebra Operations Map

Each stage of the ML pipeline is labeled with the linear algebra operations it uses. Hover or tap a stage to highlight it.

Closing thought: Linear algebra is the language of machine learning. Every forward pass is a chain of matrix multiplies. Every backward pass is the chain rule on those multiplies. Every optimization step is a vector subtraction. PCA is eigendecomposition. LoRA is low-rank SVD. Attention is batched dot products via einsum. Now you speak the language. Go read some papers.

"The purpose of computing is insight, not numbers." — Richard Hamming

Which decomposition lets you approximate a large weight matrix with two small ones for efficient fine-tuning?