Ch 3: Analytic Geometry — Deisenroth MML

Chapter 0: Why Geometry?

In Chapter 2 we learned to solve Ax = b. We can find solutions, count dimensions, change bases. But we cannot yet ask simple geometric questions: How long is this vector? What angle do two vectors make? How close is a point to a subspace?

These questions matter because machine learning is fundamentally geometric. A classifier draws a boundary in space. PCA finds the direction of maximum spread. A neural network learns a transformation that separates clusters. To reason about any of these, we need lengths, angles, and distances.

Chapter 2 gave us the algebraic structure of vector spaces (addition, scaling). This chapter adds geometric structure: an inner product that lets us measure lengths and angles, and projections that let us find the closest point in a subspace. Together, algebra + geometry = the full toolkit for ML.

The core idea: A vector space alone has no notion of length or angle. Those concepts come from an additional piece of structure called an inner product (generalized dot product). Once you have an inner product, everything geometric — norms, distances, angles, projections, orthogonality — follows.

Inner Product

A function that takes two vectors and returns a number

↓ defines

Norms & Lengths

||x|| = √(⟨x,x⟩)

↓ defines

Angles & Distances

cos(θ) = ⟨x,y⟩ / (||x|| ||y||)

↓ enables

Projections

Find the closest point in a subspace

Concept	ML application
Norm	Regularization (L1, L2 penalties on weights)
Inner product	Kernel methods, similarity measures
Angle / cosine similarity	Word embeddings, recommendation systems
Projection	PCA, least-squares regression, Gram-Schmidt
Orthogonality	Decorrelating features, orthogonal initialization

Check: What additional structure does a vector space need to support geometric operations like measuring lengths?

More vectors An inner product A larger dimension

Chapter 1: Norms

Before we talk about inner products, let's start with a simpler concept: how big is this vector? A norm is a function that assigns a non-negative "size" to every vector, satisfying three rules:

Property	Formula	Meaning
Positive definite	\|\|x\|\| ≥ 0, and \|\|x\|\| = 0 iff x = 0	Only the zero vector has zero size
Homogeneous	\|\|λx\|\| = \|λ\| · \|\|x\|\|	Scaling a vector scales its norm
Triangle inequality	\|\|x + y\|\| ≤ \|\|x\|\| + \|\|y\|\|	The direct path is never longer than a detour

The most common norms in ML:

||x||₁ = |x₁| + |x₂| + … + |x_n|     (L1, "Manhattan distance")
||x||₂ = √(x₁² + x₂² + … + x_n²)     (L2, "Euclidean distance")
||x||_∞ = max(|x₁|, |x₂|, …, |x_n|)     (L∞, "max norm")

Unit Circles Under Different Norms

The "unit circle" is all vectors with ||x|| = 1. Different norms produce different shapes. The p slider interpolates between L1 (diamond), L2 (circle), and L∞ (square).

p2.0

At p=1, the unit ball is a diamond — L1 favors sparse solutions (corners touch the axes). At p=2, it is the familiar circle. As p grows, it inflates toward a square (L∞). This shape matters for regularization: L1 regularization (lasso) pushes weights to exactly zero because the diamond's corners sit on the axes.

Key insight: The choice of norm defines what "closeness" means. L2 treats all directions equally (the circle is round). L1 penalizes non-zero components (the diamond has sharp corners on the axes). L∞ only cares about the largest component. In ML, picking a norm is a modeling decision — it encodes your belief about what "simple" looks like.

Norms in ML: L2 regularization (Ridge) penalizes ||w||₂² and spreads weight across all features. L1 regularization (Lasso) penalizes ||w||₁ and drives some weights to exactly zero, performing feature selection. Both are controlled by a hyperparameter λ — larger λ means more shrinkage.

Check: Why does L1 regularization tend to produce sparse weights (some exactly zero)?

Because the L1 unit ball has corners on the axes, where coordinates are zero Because L1 is always smaller than L2 Because L1 only penalizes large weights

Chapter 2: Inner Products

A norm measures the size of one vector. An inner product measures the relationship between two vectors. It generalizes the familiar dot product x^Ty to allow different geometries.

An inner product ⟨·, ·⟩: V × V → R must satisfy:

Property	Formula
Bilinear	⟨λx + z, y⟩ = λ⟨x, y⟩ + ⟨z, y⟩ (and similar in second argument)
Symmetric	⟨x, y⟩ = ⟨y, x⟩
Positive definite	⟨x, x⟩ ≥ 0, and = 0 only if x = 0

The standard dot product x^Ty = ∑ x_iy_i is the simplest inner product. But we can define others. Given a symmetric positive definite (SPD) matrix A, we can define:

⟨x, y⟩_A = x^TAy

This is still a valid inner product — it just warps the geometry. Distances and angles measured with ⟨·,·⟩_A differ from those measured with the standard dot product. Think of A as defining the "shape of a ruler" — it stretches some directions and compresses others.

Why generalize beyond the dot product? In ML, features often have different scales (age in years vs. income in dollars). The Mahalanobis distance uses the inverse covariance matrix as A, automatically accounting for different variances and correlations between features. Kernel methods define inner products in high-dimensional feature spaces without ever computing the features explicitly.

Every inner product induces a norm: ||x|| = √(⟨x, x⟩). The standard dot product induces the L2 norm. The inner product x^TAy induces the norm ||x||_A = √(x^TAx).

Key insight: Inner products are more fundamental than norms. Every inner product gives you a norm (take ⟨x, x⟩ and square root), but not every norm comes from an inner product. The L1 and L∞ norms, for instance, cannot be derived from any inner product. This is why L2 is special — it is the norm with a matching inner product, and that inner product unlocks angles, projections, and orthogonality.

Check: What is the relationship between an inner product and a norm?

They are the same thing A norm always defines an inner product Every inner product induces a norm, but not every norm comes from an inner product

Chapter 3: SPD Matrices

The matrix A in the generalized inner product x^TAy must be symmetric positive definite (SPD). What does that mean, and why does it matter?

A matrix A is symmetric if A = A^T (it equals its transpose). It is positive definite if x^TAx > 0 for every non-zero vector x. Together:

A is SPD ⇔ A = A^T and x^TAx > 0 for all x ≠ 0

Why must A be SPD for an inner product? Consider the norm ||x||_A = √(x^TAx). If x^TAx could be negative, we would be taking the square root of a negative number — the norm would not be real. If x^TAx = 0 for some non-zero x, the norm would declare a non-zero vector to have zero length — violating positive definiteness.

Equivalent tests for SPD: A symmetric matrix is positive definite if and only if: (1) all eigenvalues are strictly positive, (2) all leading principal minors are positive (Sylvester's criterion), (3) it can be written as A = LL^T for some invertible lower-triangular L (Cholesky decomposition).

Quadratic Form x^TAx

The contours of x^TAx for a 2×2 symmetric matrix. When A is PD, all contours are ellipses (all values positive). When it is not PD, you get hyperbolas or saddles.

a₁₁2.0

a₁₂=a₂₁0.5

a₂₂2.0

When the contours are ellipses (A is PD), the function x^TAx has a single minimum at the origin — it is a bowl. When eigenvalues have different signs (A is indefinite), the contours become hyperbolas — a saddle. The transition happens exactly when the determinant crosses zero.

SPD in ML: Covariance matrices are always symmetric positive semi-definite (eigenvalues ≥ 0). The kernel matrix in kernel methods must be positive semi-definite. The Hessian of a convex loss function is positive semi-definite. Whenever you see a quadratic form x^TAx in ML, checking whether A is SPD tells you whether the problem is well-behaved.

Check: Why must the matrix A be positive definite for x^TAy to be a valid inner product?

Because x^TAx must be strictly positive for all non-zero x, ensuring the induced norm is well-defined Because A must be invertible Because negative values would make the matrix too large

Chapter 4: Lengths & Distances

With an inner product in hand, we can define length and distance precisely. The length (or norm) of a vector x is:

||x|| = √(⟨x, x⟩)

For the standard dot product: ||x|| = √(x₁² + … + x_n²), the familiar Euclidean length.

The distance between two vectors x and y is the length of their difference:

d(x, y) = ||x − y|| = √(⟨x − y, x − y⟩)

This is a metric — it satisfies non-negativity, symmetry, and the triangle inequality. Different inner products produce different metrics. The Mahalanobis distance uses x^TΣ⁻¹x:

d_M(x, y) = √((x − y)^T Σ⁻¹ (x − y))

where Σ is the covariance matrix. This "stretches" the space so that correlated directions are de-correlated — making distances meaningful even when features have different scales.

Worked example: Let x = [3, 4]. Euclidean length: ||x|| = √(9 + 16) = √25 = 5. Pythagorean theorem in action. Now if A = [4, 0; 0, 1], the length under ⟨·,·⟩_A is ||x||_A = √(x^TAx) = √([3,4][4,0;0,1][3;4]) = √(36 + 4) = √40 ≈ 6.3. The x-direction is stretched (weighted 4×), so the "3 in x" contributes more to the length.

Key insight: Distance is not absolute — it depends on the inner product you choose. Euclidean distance treats all directions equally. Mahalanobis distance accounts for the data's covariance structure. In ML, choosing the right distance metric can be more important than choosing the right algorithm.

Check: What does the Mahalanobis distance account for that Euclidean distance does not?

The number of dimensions The covariance structure (different scales and correlations between features) The sign of the components

Chapter 5: Angles

The inner product also gives us angles between vectors. The Cauchy-Schwarz inequality states:

|⟨x, y⟩| ≤ ||x|| · ||y||

This guarantees that the ratio ⟨x, y⟩ / (||x|| ||y||) always lies between −1 and 1. So we can define the angle θ between x and y:

cos(θ) = ⟨x, y⟩ / (||x|| · ||y||)

For the standard dot product: cos(θ) = x^Ty / (||x|| ||y||). This is the cosine similarity used throughout ML — in word embeddings, information retrieval, and recommendation systems.

Angle Between Vectors

Drag the angle slider to see how the dot product relates to the angle. When cos(θ) = 0, the vectors are perpendicular. When cos(θ) = 1 or −1, they are parallel.

θ45°

Cauchy-Schwarz: why it matters. It is the single most important inequality in linear algebra. It says: the dot product can never exceed the product of lengths. Equality holds if and only if the vectors are parallel (x = λy). Every other geometric inequality (triangle inequality, projection bounds) derives from Cauchy-Schwarz. It is the reason angles are well-defined.

A worked example: x = [1, 0] and y = [1, 1]. Then x^Ty = 1, ||x|| = 1, ||y|| = √2. So cos(θ) = 1/√2 ≈ 0.707, giving θ = 45°. This matches intuition — [1,1] makes a 45° angle with the x-axis.

Cosine similarity in NLP: In word embeddings (Word2Vec, GloVe), each word is a vector in R³⁰⁰. The angle between "king" and "queen" is small (similar meaning). The angle between "king" and "banana" is large. Cosine similarity captures semantic relatedness because it ignores vector magnitude (word frequency) and focuses on direction (meaning).

Check: What does the Cauchy-Schwarz inequality guarantee about the ratio ⟨x,y⟩ / (||x|| ||y||)?

It always lies between −1 and 1, making the angle well-defined It is always positive It equals the angle in radians

Chapter 6: Orthogonality

Two vectors x and y are orthogonal (perpendicular) if their inner product is zero:

⟨x, y⟩ = 0 ⇔ x ⊥ y

Orthogonality is the geometric version of "no relationship." If x and y are orthogonal, knowing x tells you nothing about y — they are completely independent directions. This is why orthogonality is central to statistics (uncorrelated variables), signal processing (Fourier basis), and ML (decorrelation in PCA).

An orthogonal basis is a basis where every pair of basis vectors is orthogonal. An orthonormal basis adds the requirement that every basis vector has unit length:

⟨b_i, b_j⟩ = δ_ij = { 1 if i = j, 0 if i ≠ j }

Working with an orthonormal basis is effortless. To find the coordinate of a vector x along basis vector b_i, you just compute ⟨x, b_i⟩ — no matrix inversion, no Gaussian elimination. This is the power of orthogonality.

Key insight: An orthonormal basis turns finding coordinates into simple dot products. In an arbitrary basis, you must solve a system of equations to find coordinates. In an orthonormal basis, you just project: coordinate i = ⟨x, b_i⟩. This is why so much of applied mathematics is about finding orthogonal bases (Fourier, wavelets, PCA).

A matrix Q whose columns form an orthonormal basis is called an orthogonal matrix. It has a wonderful property: its inverse equals its transpose.

Q^TQ = I ⇒ Q⁻¹ = Q^T

Inverting a matrix is normally expensive (O(n³)). But inverting an orthogonal matrix is free — just transpose it. Orthogonal matrices represent rotations and reflections (length-preserving transformations). They appear in the QR decomposition, SVD, and eigendecomposition of symmetric matrices.

Property	Orthogonal matrix Q
Columns	Orthonormal (pairwise perpendicular, unit length)
Inverse	Q⁻¹ = Q^T (free to compute)
Determinant	det(Q) = ±1
Geometric action	Rotation (det = 1) or reflection (det = −1)
Preserves	Lengths, angles, inner products

Check: Why is an orthonormal basis so convenient for computing coordinates?

Because orthonormal bases always have exactly 3 vectors Because coordinates are just dot products — no system of equations to solve Because the basis vectors are always unit vectors

Chapter 7: Projection onto Lines

Here is a fundamental question: given a vector x and a line through the origin spanned by b, what is the point on the line closest to x? That point is the orthogonal projection of x onto b.

The projection must satisfy two conditions: (1) it lies on the line (so it equals λb for some scalar λ), and (2) the error x − λb is perpendicular to b. Using condition (2):

⟨x − λb, b⟩ = 0

Expanding: ⟨x, b⟩ − λ⟨b, b⟩ = 0, so:

λ = ⟨x, b⟩ / ⟨b, b⟩ = b^Tx / b^Tb

The projected vector is:

π(x) = λb = (b^Tx / b^Tb) · b

We can also write this as a matrix: π(x) = Px, where the projection matrix is:

P = bb^T / b^Tb

Projection matrix properties: P is symmetric (P^T = P) and idempotent (P² = P). Applying the projection twice does nothing new — once you are on the line, projecting again keeps you there. The rank of P is 1 (it maps all of Rⁿ onto a one-dimensional line).

Projection onto a Line (Draggable)

Drag the teal endpoint of x and the orange endpoint of b. The projection π(x) and the perpendicular residual are shown. The value λ updates live.

      λ = 0.00  |  ||x − π(x)|| = 0.00
    

The perpendicular (dashed line) is the shortest path from x to the line. This is the geometric meaning of "closest point" — any other point on the line is farther away. You can verify this visually by dragging x around and watching how the dashed line always hits the orange line at a right angle.

Let's verify with numbers. Take b = [2, 1] and x = [1, 3]. Then:

λ = b^Tx / b^Tb = (2·1 + 1·3) / (4 + 1) = 5/5 = 1

π(x) = 1 · [2, 1] = [2, 1]

The residual is x − π(x) = [1,3] − [2,1] = [−1, 2]. Check: b^T(x − π(x)) = 2(−1) + 1(2) = 0. Perpendicular. The projection formula works.

Key insight: Projection is the bridge from geometry to optimization. "Find the closest point" is the same as "minimize the error." Least-squares regression is literally projecting the data vector onto the column space of the feature matrix. If you understand projection onto a line, you understand the geometric heart of regression.

Check: The projection π(x) = λb satisfies what geometric condition?

π(x) has the same length as x The residual x − π(x) is perpendicular to b λ is always between 0 and 1

Chapter 8: Projection onto Subspaces

Projecting onto a line is a special case. More generally, we want to project onto a subspace spanned by multiple vectors. Suppose U = span{b₁, …, b_k} is a k-dimensional subspace of Rⁿ. Let B = [b₁ | … | b_k] be the n×k matrix whose columns are the basis vectors. The projection of x onto U is:

π(x) = B(B^TB)⁻¹B^Tx

and the projection matrix is:

P = B(B^TB)⁻¹B^T

When k = 1 (a single column b), this simplifies to bb^T/(b^Tb) — exactly the line projection from Chapter 7. The general formula is derived the same way: demand that the residual x − π(x) be orthogonal to every vector in U:

B^T(x − Bλ) = 0 ⇒ B^TBλ = B^Tx ⇒ λ = (B^TB)⁻¹B^Tx

The vector λ = (B^TB)⁻¹B^Tx gives the coordinates of the projection in the basis B, and π(x) = Bλ converts back to the original space.

This is least squares. If Ax = b has no exact solution (overdetermined system), the "best" approximate solution minimizes ||Ax − b||². The solution is x̂ = (A^TA)⁻¹A^Tb — which is exactly the projection formula with A playing the role of B. Least-squares regression is projection onto the column space of A.

The projection matrix P = B(B^TB)⁻¹B^T has the same properties as the rank-1 case:

Property	Formula	Meaning
Symmetric	P^T = P	Projecting is a self-adjoint operation
Idempotent	P² = P	Projecting twice = projecting once
Rank	rk(P) = k	P maps onto a k-dimensional subspace
Complement	I − P projects onto U^⊥	The "rejection" (orthogonal complement)

Key insight: If the columns of B are already orthonormal (B^TB = I), the formula simplifies dramatically: P = BB^T. No matrix inversion needed! This is another reason why orthonormal bases are so powerful — they make projection (and therefore regression) trivially cheap.

Special case: if B is square and invertible (k = n), then P = B(B^TB)⁻¹B^T = BB⁻¹(B^T)⁻¹B^T = I. Projecting onto the entire space leaves every vector unchanged. The projection becomes interesting only when the subspace is smaller than the ambient space.

Check: What is the projection matrix P when the columns of B are orthonormal?

P = BB^T (no inversion needed) P = B^TB P = I

Chapter 9: Gram-Schmidt

Orthonormal bases are wonderful, but how do you build one? Given any set of linearly independent vectors {a₁, …, a_k}, the Gram-Schmidt process produces an orthonormal basis {q₁, …, q_k} spanning the same subspace.

The algorithm works iteratively. Each step takes the next vector, subtracts its projections onto all previously computed basis vectors (removing the "already-covered" directions), and normalizes the remainder:

Step 1

q₁ = a₁ / ||a₁|| (just normalize)

↓

Step 2

u₂ = a₂ − ⟨a₂, q₁⟩q₁, q₂ = u₂ / ||u₂||

↓

Step k

u_k = a_k − ∑_i<k ⟨a_k, q_i⟩q_i, q_k = u_k / ||u_k||

At each step, the subtraction removes everything in the direction of the already-found basis vectors. What remains must be perpendicular to all of them. Normalizing gives a unit vector.

Gram-Schmidt in R²

Watch the process step by step. Vector a₁ is normalized to get q₁. Then a₂ has its projection onto q₁ removed, and the remainder is normalized to get q₂.

Step 0: Original vectors a1, a2

a₂ angle30°

Let's verify with numbers. Take a₁ = [2, 0] and a₂ = [1, 1].

Step 1: ||a₁|| = 2, so q₁ = [1, 0].

Step 2: ⟨a₂, q₁⟩ = 1·1 + 1·0 = 1. u₂ = [1,1] − 1·[1,0] = [0, 1]. ||u₂|| = 1, so q₂ = [0, 1].

Check: ⟨q₁, q₂⟩ = 1·0 + 0·1 = 0. Orthogonal. Both have unit length. Done.

python
import numpy as np

def gram_schmidt(A):
    """Orthonormalize columns of A."""
    Q = np.zeros_like(A, dtype=float)
    for k in range(A.shape[1]):
        q = A[:, k].astype(float)
        for j in range(k):
            q -= np.dot(Q[:, j], q) * Q[:, j]
        Q[:, k] = q / np.linalg.norm(q)
    return Q

A = np.array([[2, 1], [0, 1]])
Q = gram_schmidt(A)
print(Q)  # [[1, 0], [0, 1]]
print(Q.T @ Q)  # identity matrix

Gram-Schmidt is the QR decomposition. The process implicitly factors A = QR, where Q holds the orthonormal vectors and R is upper-triangular (the coefficients from the projection subtractions). QR decomposition is the standard numerical method for solving least-squares problems. It is more stable than forming A^TA and inverting.

Looking ahead: The geometric ideas from this chapter — norms, inner products, projections, orthogonality — are the foundation for everything in Part II. Linear regression (Ch 9) is projection onto the column space. PCA (Ch 10) finds the orthogonal directions of maximum variance. GMMs (Ch 11) use Mahalanobis distance to measure cluster membership. The language of analytic geometry runs through all of ML.

"The introduction of numbers as coordinates is an act of violence."
— Hermann Weyl

Check: What does each step of Gram-Schmidt do to the next input vector?

Subtracts its projections onto all previously found orthonormal vectors, then normalizes Rotates it by 90 degrees Scales it to match the length of the previous vector