Deisenroth et al., Chapter 3

Analytic Geometry

Equipping vector spaces with notions of length, angle, and distance.

Prerequisites: Chapter 2 (Linear Algebra). That's it.
10
Chapters
5+
Simulations
10
Quizzes

Chapter 0: Why Geometry?

In Chapter 2 we learned to solve Ax = b. We can find solutions, count dimensions, change bases. But we cannot yet ask simple geometric questions: How long is this vector? What angle do two vectors make? How close is a point to a subspace?

These questions matter because machine learning is fundamentally geometric. A classifier draws a boundary in space. PCA finds the direction of maximum spread. A neural network learns a transformation that separates clusters. To reason about any of these, we need lengths, angles, and distances.

Chapter 2 gave us the algebraic structure of vector spaces (addition, scaling). This chapter adds geometric structure: an inner product that lets us measure lengths and angles, and projections that let us find the closest point in a subspace. Together, algebra + geometry = the full toolkit for ML.

The core idea: A vector space alone has no notion of length or angle. Those concepts come from an additional piece of structure called an inner product (generalized dot product). Once you have an inner product, everything geometric — norms, distances, angles, projections, orthogonality — follows.
Inner Product
A function that takes two vectors and returns a number
↓ defines
Norms & Lengths
||x|| = √(⟨x,x⟩)
↓ defines
Angles & Distances
cos(θ) = ⟨x,y⟩ / (||x|| ||y||)
↓ enables
Projections
Find the closest point in a subspace
ConceptML application
NormRegularization (L1, L2 penalties on weights)
Inner productKernel methods, similarity measures
Angle / cosine similarityWord embeddings, recommendation systems
ProjectionPCA, least-squares regression, Gram-Schmidt
OrthogonalityDecorrelating features, orthogonal initialization
Check: What additional structure does a vector space need to support geometric operations like measuring lengths?

Chapter 1: Norms

Before we talk about inner products, let's start with a simpler concept: how big is this vector? A norm is a function that assigns a non-negative "size" to every vector, satisfying three rules:

PropertyFormulaMeaning
Positive definite||x|| ≥ 0, and ||x|| = 0 iff x = 0Only the zero vector has zero size
Homogeneous||λx|| = |λ| · ||x||Scaling a vector scales its norm
Triangle inequality||x + y|| ≤ ||x|| + ||y||The direct path is never longer than a detour

The most common norms in ML:

||x||1 = |x1| + |x2| + … + |xn|     (L1, "Manhattan distance")
||x||2 = √(x12 + x22 + … + xn2)     (L2, "Euclidean distance")
||x|| = max(|x1|, |x2|, …, |xn|)     (L∞, "max norm")
Unit Circles Under Different Norms

The "unit circle" is all vectors with ||x|| = 1. Different norms produce different shapes. The p slider interpolates between L1 (diamond), L2 (circle), and L∞ (square).

p2.0

At p=1, the unit ball is a diamond — L1 favors sparse solutions (corners touch the axes). At p=2, it is the familiar circle. As p grows, it inflates toward a square (L∞). This shape matters for regularization: L1 regularization (lasso) pushes weights to exactly zero because the diamond's corners sit on the axes.

Key insight: The choice of norm defines what "closeness" means. L2 treats all directions equally (the circle is round). L1 penalizes non-zero components (the diamond has sharp corners on the axes). L∞ only cares about the largest component. In ML, picking a norm is a modeling decision — it encodes your belief about what "simple" looks like.
Norms in ML: L2 regularization (Ridge) penalizes ||w||22 and spreads weight across all features. L1 regularization (Lasso) penalizes ||w||1 and drives some weights to exactly zero, performing feature selection. Both are controlled by a hyperparameter λ — larger λ means more shrinkage.
Check: Why does L1 regularization tend to produce sparse weights (some exactly zero)?

Chapter 2: Inner Products

A norm measures the size of one vector. An inner product measures the relationship between two vectors. It generalizes the familiar dot product xTy to allow different geometries.

An inner product ⟨·, ·⟩: V × V → R must satisfy:

PropertyFormula
Bilinear⟨λx + z, y⟩ = λ⟨x, y⟩ + ⟨z, y⟩ (and similar in second argument)
Symmetric⟨x, y⟩ = ⟨y, x⟩
Positive definite⟨x, x⟩ ≥ 0, and = 0 only if x = 0

The standard dot product xTy = ∑ xiyi is the simplest inner product. But we can define others. Given a symmetric positive definite (SPD) matrix A, we can define:

⟨x, y⟩A = xTAy

This is still a valid inner product — it just warps the geometry. Distances and angles measured with ⟨·,·⟩A differ from those measured with the standard dot product. Think of A as defining the "shape of a ruler" — it stretches some directions and compresses others.

Why generalize beyond the dot product? In ML, features often have different scales (age in years vs. income in dollars). The Mahalanobis distance uses the inverse covariance matrix as A, automatically accounting for different variances and correlations between features. Kernel methods define inner products in high-dimensional feature spaces without ever computing the features explicitly.

Every inner product induces a norm: ||x|| = √(⟨x, x⟩). The standard dot product induces the L2 norm. The inner product xTAy induces the norm ||x||A = √(xTAx).

Key insight: Inner products are more fundamental than norms. Every inner product gives you a norm (take ⟨x, x⟩ and square root), but not every norm comes from an inner product. The L1 and L∞ norms, for instance, cannot be derived from any inner product. This is why L2 is special — it is the norm with a matching inner product, and that inner product unlocks angles, projections, and orthogonality.
Check: What is the relationship between an inner product and a norm?

Chapter 3: SPD Matrices

The matrix A in the generalized inner product xTAy must be symmetric positive definite (SPD). What does that mean, and why does it matter?

A matrix A is symmetric if A = AT (it equals its transpose). It is positive definite if xTAx > 0 for every non-zero vector x. Together:

A is SPD  ⇔  A = AT and xTAx > 0 for all x ≠ 0

Why must A be SPD for an inner product? Consider the norm ||x||A = √(xTAx). If xTAx could be negative, we would be taking the square root of a negative number — the norm would not be real. If xTAx = 0 for some non-zero x, the norm would declare a non-zero vector to have zero length — violating positive definiteness.

Equivalent tests for SPD: A symmetric matrix is positive definite if and only if: (1) all eigenvalues are strictly positive, (2) all leading principal minors are positive (Sylvester's criterion), (3) it can be written as A = LLT for some invertible lower-triangular L (Cholesky decomposition).
Quadratic Form xTAx

The contours of xTAx for a 2×2 symmetric matrix. When A is PD, all contours are ellipses (all values positive). When it is not PD, you get hyperbolas or saddles.

a112.0
a12=a210.5
a222.0

When the contours are ellipses (A is PD), the function xTAx has a single minimum at the origin — it is a bowl. When eigenvalues have different signs (A is indefinite), the contours become hyperbolas — a saddle. The transition happens exactly when the determinant crosses zero.

SPD in ML: Covariance matrices are always symmetric positive semi-definite (eigenvalues ≥ 0). The kernel matrix in kernel methods must be positive semi-definite. The Hessian of a convex loss function is positive semi-definite. Whenever you see a quadratic form xTAx in ML, checking whether A is SPD tells you whether the problem is well-behaved.
Check: Why must the matrix A be positive definite for xTAy to be a valid inner product?

Chapter 4: Lengths & Distances

With an inner product in hand, we can define length and distance precisely. The length (or norm) of a vector x is:

||x|| = √(⟨x, x⟩)

For the standard dot product: ||x|| = √(x12 + … + xn2), the familiar Euclidean length.

The distance between two vectors x and y is the length of their difference:

d(x, y) = ||x − y|| = √(⟨x − y, x − y⟩)

This is a metric — it satisfies non-negativity, symmetry, and the triangle inequality. Different inner products produce different metrics. The Mahalanobis distance uses xTΣ−1x:

dM(x, y) = √((x − y)T Σ−1 (x − y))

where Σ is the covariance matrix. This "stretches" the space so that correlated directions are de-correlated — making distances meaningful even when features have different scales.

Worked example: Let x = [3, 4]. Euclidean length: ||x|| = √(9 + 16) = √25 = 5. Pythagorean theorem in action. Now if A = [4, 0; 0, 1], the length under ⟨·,·⟩A is ||x||A = √(xTAx) = √([3,4][4,0;0,1][3;4]) = √(36 + 4) = √40 ≈ 6.3. The x-direction is stretched (weighted 4×), so the "3 in x" contributes more to the length.
Key insight: Distance is not absolute — it depends on the inner product you choose. Euclidean distance treats all directions equally. Mahalanobis distance accounts for the data's covariance structure. In ML, choosing the right distance metric can be more important than choosing the right algorithm.
Check: What does the Mahalanobis distance account for that Euclidean distance does not?

Chapter 5: Angles

The inner product also gives us angles between vectors. The Cauchy-Schwarz inequality states:

|⟨x, y⟩| ≤ ||x|| · ||y||

This guarantees that the ratio ⟨x, y⟩ / (||x|| ||y||) always lies between −1 and 1. So we can define the angle θ between x and y:

cos(θ) = ⟨x, y⟩ / (||x|| · ||y||)

For the standard dot product: cos(θ) = xTy / (||x|| ||y||). This is the cosine similarity used throughout ML — in word embeddings, information retrieval, and recommendation systems.

Angle Between Vectors

Drag the angle slider to see how the dot product relates to the angle. When cos(θ) = 0, the vectors are perpendicular. When cos(θ) = 1 or −1, they are parallel.

θ45°
Cauchy-Schwarz: why it matters. It is the single most important inequality in linear algebra. It says: the dot product can never exceed the product of lengths. Equality holds if and only if the vectors are parallel (x = λy). Every other geometric inequality (triangle inequality, projection bounds) derives from Cauchy-Schwarz. It is the reason angles are well-defined.

A worked example: x = [1, 0] and y = [1, 1]. Then xTy = 1, ||x|| = 1, ||y|| = √2. So cos(θ) = 1/√2 ≈ 0.707, giving θ = 45°. This matches intuition — [1,1] makes a 45° angle with the x-axis.

Cosine similarity in NLP: In word embeddings (Word2Vec, GloVe), each word is a vector in R300. The angle between "king" and "queen" is small (similar meaning). The angle between "king" and "banana" is large. Cosine similarity captures semantic relatedness because it ignores vector magnitude (word frequency) and focuses on direction (meaning).
Check: What does the Cauchy-Schwarz inequality guarantee about the ratio ⟨x,y⟩ / (||x|| ||y||)?

Chapter 6: Orthogonality

Two vectors x and y are orthogonal (perpendicular) if their inner product is zero:

⟨x, y⟩ = 0  ⇔  x ⊥ y

Orthogonality is the geometric version of "no relationship." If x and y are orthogonal, knowing x tells you nothing about y — they are completely independent directions. This is why orthogonality is central to statistics (uncorrelated variables), signal processing (Fourier basis), and ML (decorrelation in PCA).

An orthogonal basis is a basis where every pair of basis vectors is orthogonal. An orthonormal basis adds the requirement that every basis vector has unit length:

⟨bi, bj⟩ = δij = { 1 if i = j, 0 if i ≠ j }

Working with an orthonormal basis is effortless. To find the coordinate of a vector x along basis vector bi, you just compute ⟨x, bi⟩ — no matrix inversion, no Gaussian elimination. This is the power of orthogonality.

Key insight: An orthonormal basis turns finding coordinates into simple dot products. In an arbitrary basis, you must solve a system of equations to find coordinates. In an orthonormal basis, you just project: coordinate i = ⟨x, bi⟩. This is why so much of applied mathematics is about finding orthogonal bases (Fourier, wavelets, PCA).

A matrix Q whose columns form an orthonormal basis is called an orthogonal matrix. It has a wonderful property: its inverse equals its transpose.

QTQ = I  ⇒  Q−1 = QT

Inverting a matrix is normally expensive (O(n3)). But inverting an orthogonal matrix is free — just transpose it. Orthogonal matrices represent rotations and reflections (length-preserving transformations). They appear in the QR decomposition, SVD, and eigendecomposition of symmetric matrices.

PropertyOrthogonal matrix Q
ColumnsOrthonormal (pairwise perpendicular, unit length)
InverseQ−1 = QT (free to compute)
Determinantdet(Q) = ±1
Geometric actionRotation (det = 1) or reflection (det = −1)
PreservesLengths, angles, inner products
Check: Why is an orthonormal basis so convenient for computing coordinates?

Chapter 7: Projection onto Lines

Here is a fundamental question: given a vector x and a line through the origin spanned by b, what is the point on the line closest to x? That point is the orthogonal projection of x onto b.

The projection must satisfy two conditions: (1) it lies on the line (so it equals λb for some scalar λ), and (2) the error x − λb is perpendicular to b. Using condition (2):

⟨x − λb, b⟩ = 0

Expanding: ⟨x, b⟩ − λ⟨b, b⟩ = 0, so:

λ = ⟨x, b⟩ / ⟨b, b⟩ = bTx / bTb

The projected vector is:

π(x) = λb = (bTx / bTb) · b

We can also write this as a matrix: π(x) = Px, where the projection matrix is:

P = bbT / bTb
Projection matrix properties: P is symmetric (PT = P) and idempotent (P2 = P). Applying the projection twice does nothing new — once you are on the line, projecting again keeps you there. The rank of P is 1 (it maps all of Rn onto a one-dimensional line).
Projection onto a Line (Draggable)

Drag the teal endpoint of x and the orange endpoint of b. The projection π(x) and the perpendicular residual are shown. The value λ updates live.

λ = 0.00  |  ||x − π(x)|| = 0.00

The perpendicular (dashed line) is the shortest path from x to the line. This is the geometric meaning of "closest point" — any other point on the line is farther away. You can verify this visually by dragging x around and watching how the dashed line always hits the orange line at a right angle.

Let's verify with numbers. Take b = [2, 1] and x = [1, 3]. Then:

λ = bTx / bTb = (2·1 + 1·3) / (4 + 1) = 5/5 = 1
π(x) = 1 · [2, 1] = [2, 1]

The residual is x − π(x) = [1,3] − [2,1] = [−1, 2]. Check: bT(x − π(x)) = 2(−1) + 1(2) = 0. Perpendicular. The projection formula works.

Key insight: Projection is the bridge from geometry to optimization. "Find the closest point" is the same as "minimize the error." Least-squares regression is literally projecting the data vector onto the column space of the feature matrix. If you understand projection onto a line, you understand the geometric heart of regression.
Check: The projection π(x) = λb satisfies what geometric condition?

Chapter 8: Projection onto Subspaces

Projecting onto a line is a special case. More generally, we want to project onto a subspace spanned by multiple vectors. Suppose U = span{b1, …, bk} is a k-dimensional subspace of Rn. Let B = [b1 | … | bk] be the n×k matrix whose columns are the basis vectors. The projection of x onto U is:

π(x) = B(BTB)−1BTx

and the projection matrix is:

P = B(BTB)−1BT

When k = 1 (a single column b), this simplifies to bbT/(bTb) — exactly the line projection from Chapter 7. The general formula is derived the same way: demand that the residual x − π(x) be orthogonal to every vector in U:

BT(x − Bλ) = 0  ⇒  BTBλ = BTx  ⇒  λ = (BTB)−1BTx

The vector λ = (BTB)−1BTx gives the coordinates of the projection in the basis B, and π(x) = Bλ converts back to the original space.

This is least squares. If Ax = b has no exact solution (overdetermined system), the "best" approximate solution minimizes ||Ax − b||2. The solution is x̂ = (ATA)−1ATb — which is exactly the projection formula with A playing the role of B. Least-squares regression is projection onto the column space of A.

The projection matrix P = B(BTB)−1BT has the same properties as the rank-1 case:

PropertyFormulaMeaning
SymmetricPT = PProjecting is a self-adjoint operation
IdempotentP2 = PProjecting twice = projecting once
Rankrk(P) = kP maps onto a k-dimensional subspace
ComplementI − P projects onto UThe "rejection" (orthogonal complement)
Key insight: If the columns of B are already orthonormal (BTB = I), the formula simplifies dramatically: P = BBT. No matrix inversion needed! This is another reason why orthonormal bases are so powerful — they make projection (and therefore regression) trivially cheap.

Special case: if B is square and invertible (k = n), then P = B(BTB)−1BT = BB−1(BT)−1BT = I. Projecting onto the entire space leaves every vector unchanged. The projection becomes interesting only when the subspace is smaller than the ambient space.

Check: What is the projection matrix P when the columns of B are orthonormal?

Chapter 9: Gram-Schmidt

Orthonormal bases are wonderful, but how do you build one? Given any set of linearly independent vectors {a1, …, ak}, the Gram-Schmidt process produces an orthonormal basis {q1, …, qk} spanning the same subspace.

The algorithm works iteratively. Each step takes the next vector, subtracts its projections onto all previously computed basis vectors (removing the "already-covered" directions), and normalizes the remainder:

Step 1
q1 = a1 / ||a1||   (just normalize)
Step 2
u2 = a2 − ⟨a2, q1⟩q1,   q2 = u2 / ||u2||
Step k
uk = ak − ∑i<k ⟨ak, qi⟩qi,   qk = uk / ||uk||

At each step, the subtraction removes everything in the direction of the already-found basis vectors. What remains must be perpendicular to all of them. Normalizing gives a unit vector.

Gram-Schmidt in R2

Watch the process step by step. Vector a1 is normalized to get q1. Then a2 has its projection onto q1 removed, and the remainder is normalized to get q2.

Step 0: Original vectors a1, a2
a2 angle30°

Let's verify with numbers. Take a1 = [2, 0] and a2 = [1, 1].

Step 1: ||a1|| = 2, so q1 = [1, 0].

Step 2: ⟨a2, q1⟩ = 1·1 + 1·0 = 1. u2 = [1,1] − 1·[1,0] = [0, 1]. ||u2|| = 1, so q2 = [0, 1].

Check: ⟨q1, q2⟩ = 1·0 + 0·1 = 0. Orthogonal. Both have unit length. Done.

python
import numpy as np

def gram_schmidt(A):
    """Orthonormalize columns of A."""
    Q = np.zeros_like(A, dtype=float)
    for k in range(A.shape[1]):
        q = A[:, k].astype(float)
        for j in range(k):
            q -= np.dot(Q[:, j], q) * Q[:, j]
        Q[:, k] = q / np.linalg.norm(q)
    return Q

A = np.array([[2, 1], [0, 1]])
Q = gram_schmidt(A)
print(Q)  # [[1, 0], [0, 1]]
print(Q.T @ Q)  # identity matrix
Gram-Schmidt is the QR decomposition. The process implicitly factors A = QR, where Q holds the orthonormal vectors and R is upper-triangular (the coefficients from the projection subtractions). QR decomposition is the standard numerical method for solving least-squares problems. It is more stable than forming ATA and inverting.

Looking ahead: The geometric ideas from this chapter — norms, inner products, projections, orthogonality — are the foundation for everything in Part II. Linear regression (Ch 9) is projection onto the column space. PCA (Ch 10) finds the orthogonal directions of maximum variance. GMMs (Ch 11) use Mahalanobis distance to measure cluster membership. The language of analytic geometry runs through all of ML.

"The introduction of numbers as coordinates is an act of violence."
— Hermann Weyl
Check: What does each step of Gram-Schmidt do to the next input vector?