Equipping vector spaces with notions of length, angle, and distance.
In Chapter 2 we learned to solve Ax = b. We can find solutions, count dimensions, change bases. But we cannot yet ask simple geometric questions: How long is this vector? What angle do two vectors make? How close is a point to a subspace?
These questions matter because machine learning is fundamentally geometric. A classifier draws a boundary in space. PCA finds the direction of maximum spread. A neural network learns a transformation that separates clusters. To reason about any of these, we need lengths, angles, and distances.
Chapter 2 gave us the algebraic structure of vector spaces (addition, scaling). This chapter adds geometric structure: an inner product that lets us measure lengths and angles, and projections that let us find the closest point in a subspace. Together, algebra + geometry = the full toolkit for ML.
| Concept | ML application |
|---|---|
| Norm | Regularization (L1, L2 penalties on weights) |
| Inner product | Kernel methods, similarity measures |
| Angle / cosine similarity | Word embeddings, recommendation systems |
| Projection | PCA, least-squares regression, Gram-Schmidt |
| Orthogonality | Decorrelating features, orthogonal initialization |
Before we talk about inner products, let's start with a simpler concept: how big is this vector? A norm is a function that assigns a non-negative "size" to every vector, satisfying three rules:
| Property | Formula | Meaning |
|---|---|---|
| Positive definite | ||x|| ≥ 0, and ||x|| = 0 iff x = 0 | Only the zero vector has zero size |
| Homogeneous | ||λx|| = |λ| · ||x|| | Scaling a vector scales its norm |
| Triangle inequality | ||x + y|| ≤ ||x|| + ||y|| | The direct path is never longer than a detour |
The most common norms in ML:
The "unit circle" is all vectors with ||x|| = 1. Different norms produce different shapes. The p slider interpolates between L1 (diamond), L2 (circle), and L∞ (square).
At p=1, the unit ball is a diamond — L1 favors sparse solutions (corners touch the axes). At p=2, it is the familiar circle. As p grows, it inflates toward a square (L∞). This shape matters for regularization: L1 regularization (lasso) pushes weights to exactly zero because the diamond's corners sit on the axes.
A norm measures the size of one vector. An inner product measures the relationship between two vectors. It generalizes the familiar dot product xTy to allow different geometries.
An inner product ⟨·, ·⟩: V × V → R must satisfy:
| Property | Formula |
|---|---|
| Bilinear | ⟨λx + z, y⟩ = λ⟨x, y⟩ + ⟨z, y⟩ (and similar in second argument) |
| Symmetric | ⟨x, y⟩ = ⟨y, x⟩ |
| Positive definite | ⟨x, x⟩ ≥ 0, and = 0 only if x = 0 |
The standard dot product xTy = ∑ xiyi is the simplest inner product. But we can define others. Given a symmetric positive definite (SPD) matrix A, we can define:
This is still a valid inner product — it just warps the geometry. Distances and angles measured with ⟨·,·⟩A differ from those measured with the standard dot product. Think of A as defining the "shape of a ruler" — it stretches some directions and compresses others.
Every inner product induces a norm: ||x|| = √(⟨x, x⟩). The standard dot product induces the L2 norm. The inner product xTAy induces the norm ||x||A = √(xTAx).
The matrix A in the generalized inner product xTAy must be symmetric positive definite (SPD). What does that mean, and why does it matter?
A matrix A is symmetric if A = AT (it equals its transpose). It is positive definite if xTAx > 0 for every non-zero vector x. Together:
Why must A be SPD for an inner product? Consider the norm ||x||A = √(xTAx). If xTAx could be negative, we would be taking the square root of a negative number — the norm would not be real. If xTAx = 0 for some non-zero x, the norm would declare a non-zero vector to have zero length — violating positive definiteness.
The contours of xTAx for a 2×2 symmetric matrix. When A is PD, all contours are ellipses (all values positive). When it is not PD, you get hyperbolas or saddles.
When the contours are ellipses (A is PD), the function xTAx has a single minimum at the origin — it is a bowl. When eigenvalues have different signs (A is indefinite), the contours become hyperbolas — a saddle. The transition happens exactly when the determinant crosses zero.
With an inner product in hand, we can define length and distance precisely. The length (or norm) of a vector x is:
For the standard dot product: ||x|| = √(x12 + … + xn2), the familiar Euclidean length.
The distance between two vectors x and y is the length of their difference:
This is a metric — it satisfies non-negativity, symmetry, and the triangle inequality. Different inner products produce different metrics. The Mahalanobis distance uses xTΣ−1x:
where Σ is the covariance matrix. This "stretches" the space so that correlated directions are de-correlated — making distances meaningful even when features have different scales.
The inner product also gives us angles between vectors. The Cauchy-Schwarz inequality states:
This guarantees that the ratio ⟨x, y⟩ / (||x|| ||y||) always lies between −1 and 1. So we can define the angle θ between x and y:
For the standard dot product: cos(θ) = xTy / (||x|| ||y||). This is the cosine similarity used throughout ML — in word embeddings, information retrieval, and recommendation systems.
Drag the angle slider to see how the dot product relates to the angle. When cos(θ) = 0, the vectors are perpendicular. When cos(θ) = 1 or −1, they are parallel.
A worked example: x = [1, 0] and y = [1, 1]. Then xTy = 1, ||x|| = 1, ||y|| = √2. So cos(θ) = 1/√2 ≈ 0.707, giving θ = 45°. This matches intuition — [1,1] makes a 45° angle with the x-axis.
Two vectors x and y are orthogonal (perpendicular) if their inner product is zero:
Orthogonality is the geometric version of "no relationship." If x and y are orthogonal, knowing x tells you nothing about y — they are completely independent directions. This is why orthogonality is central to statistics (uncorrelated variables), signal processing (Fourier basis), and ML (decorrelation in PCA).
An orthogonal basis is a basis where every pair of basis vectors is orthogonal. An orthonormal basis adds the requirement that every basis vector has unit length:
Working with an orthonormal basis is effortless. To find the coordinate of a vector x along basis vector bi, you just compute ⟨x, bi⟩ — no matrix inversion, no Gaussian elimination. This is the power of orthogonality.
A matrix Q whose columns form an orthonormal basis is called an orthogonal matrix. It has a wonderful property: its inverse equals its transpose.
Inverting a matrix is normally expensive (O(n3)). But inverting an orthogonal matrix is free — just transpose it. Orthogonal matrices represent rotations and reflections (length-preserving transformations). They appear in the QR decomposition, SVD, and eigendecomposition of symmetric matrices.
| Property | Orthogonal matrix Q |
|---|---|
| Columns | Orthonormal (pairwise perpendicular, unit length) |
| Inverse | Q−1 = QT (free to compute) |
| Determinant | det(Q) = ±1 |
| Geometric action | Rotation (det = 1) or reflection (det = −1) |
| Preserves | Lengths, angles, inner products |
Here is a fundamental question: given a vector x and a line through the origin spanned by b, what is the point on the line closest to x? That point is the orthogonal projection of x onto b.
The projection must satisfy two conditions: (1) it lies on the line (so it equals λb for some scalar λ), and (2) the error x − λb is perpendicular to b. Using condition (2):
Expanding: ⟨x, b⟩ − λ⟨b, b⟩ = 0, so:
The projected vector is:
We can also write this as a matrix: π(x) = Px, where the projection matrix is:
Drag the teal endpoint of x and the orange endpoint of b. The projection π(x) and the perpendicular residual are shown. The value λ updates live.
The perpendicular (dashed line) is the shortest path from x to the line. This is the geometric meaning of "closest point" — any other point on the line is farther away. You can verify this visually by dragging x around and watching how the dashed line always hits the orange line at a right angle.
Let's verify with numbers. Take b = [2, 1] and x = [1, 3]. Then:
The residual is x − π(x) = [1,3] − [2,1] = [−1, 2]. Check: bT(x − π(x)) = 2(−1) + 1(2) = 0. Perpendicular. The projection formula works.
Projecting onto a line is a special case. More generally, we want to project onto a subspace spanned by multiple vectors. Suppose U = span{b1, …, bk} is a k-dimensional subspace of Rn. Let B = [b1 | … | bk] be the n×k matrix whose columns are the basis vectors. The projection of x onto U is:
and the projection matrix is:
When k = 1 (a single column b), this simplifies to bbT/(bTb) — exactly the line projection from Chapter 7. The general formula is derived the same way: demand that the residual x − π(x) be orthogonal to every vector in U:
The vector λ = (BTB)−1BTx gives the coordinates of the projection in the basis B, and π(x) = Bλ converts back to the original space.
The projection matrix P = B(BTB)−1BT has the same properties as the rank-1 case:
| Property | Formula | Meaning |
|---|---|---|
| Symmetric | PT = P | Projecting is a self-adjoint operation |
| Idempotent | P2 = P | Projecting twice = projecting once |
| Rank | rk(P) = k | P maps onto a k-dimensional subspace |
| Complement | I − P projects onto U⊥ | The "rejection" (orthogonal complement) |
Special case: if B is square and invertible (k = n), then P = B(BTB)−1BT = BB−1(BT)−1BT = I. Projecting onto the entire space leaves every vector unchanged. The projection becomes interesting only when the subspace is smaller than the ambient space.
Orthonormal bases are wonderful, but how do you build one? Given any set of linearly independent vectors {a1, …, ak}, the Gram-Schmidt process produces an orthonormal basis {q1, …, qk} spanning the same subspace.
The algorithm works iteratively. Each step takes the next vector, subtracts its projections onto all previously computed basis vectors (removing the "already-covered" directions), and normalizes the remainder:
At each step, the subtraction removes everything in the direction of the already-found basis vectors. What remains must be perpendicular to all of them. Normalizing gives a unit vector.
Watch the process step by step. Vector a1 is normalized to get q1. Then a2 has its projection onto q1 removed, and the remainder is normalized to get q2.
Let's verify with numbers. Take a1 = [2, 0] and a2 = [1, 1].
Step 1: ||a1|| = 2, so q1 = [1, 0].
Step 2: ⟨a2, q1⟩ = 1·1 + 1·0 = 1. u2 = [1,1] − 1·[1,0] = [0, 1]. ||u2|| = 1, so q2 = [0, 1].
Check: ⟨q1, q2⟩ = 1·0 + 0·1 = 0. Orthogonal. Both have unit length. Done.
python import numpy as np def gram_schmidt(A): """Orthonormalize columns of A.""" Q = np.zeros_like(A, dtype=float) for k in range(A.shape[1]): q = A[:, k].astype(float) for j in range(k): q -= np.dot(Q[:, j], q) * Q[:, j] Q[:, k] = q / np.linalg.norm(q) return Q A = np.array([[2, 1], [0, 1]]) Q = gram_schmidt(A) print(Q) # [[1, 0], [0, 1]] print(Q.T @ Q) # identity matrix
Looking ahead: The geometric ideas from this chapter — norms, inner products, projections, orthogonality — are the foundation for everything in Part II. Linear regression (Ch 9) is projection onto the column space. PCA (Ch 10) finds the orthogonal directions of maximum variance. GMMs (Ch 11) use Mahalanobis distance to measure cluster membership. The language of analytic geometry runs through all of ML.