Deisenroth et al., Chapter 4

Matrix Decompositions

Taking matrices apart to understand what they really do.

Prerequisites: Chapters 2–3 (linear algebra & analytic geometry). That's it.

Chapters

Simulations

Quizzes

Chapter 0: Why Decompose?

You have a 1000×1000 matrix. It describes how a neural network layer transforms its inputs. You need to understand what it does — which directions it stretches, which it crushes, whether it rotates. Staring at a million numbers tells you nothing.

Matrix decompositions solve this. They break a matrix into simpler pieces — rotations, scalings, projections — that reveal the geometry hidden inside. Think of it like factoring a number: 60 is just a number, but 2 × 2 × 3 × 5 tells you its structure.

This chapter covers three landmark decompositions, each revealing something different:

Eigendecomposition

A = PDP⁻¹ — find the directions a square matrix stretches

↓

Cholesky

A = LL^T — efficient "square root" for symmetric positive-definite matrices

↓

SVD

A = UΣV^T — works for ANY matrix, even rectangular ones

But before we can decompose, we need three tools: determinants, traces, and characteristic polynomials. These are the keys that unlock eigenvalues, and eigenvalues are the atoms of every decomposition.

Why ML cares: PCA is eigendecomposition of a covariance matrix. The SVD powers low-rank approximation (used in recommender systems, image compression, and LoRA fine-tuning). Cholesky enables efficient sampling from multivariate Gaussians. These are not abstract math — they are daily tools in modern ML.

Check: What is the main purpose of matrix decompositions?

Making matrices larger for better performance Breaking a matrix into simpler factors that reveal its geometric structure Converting matrices into vectors

Chapter 1: Determinants

When a matrix transforms space, it changes areas and volumes. The determinant measures exactly this: by what factor does the matrix scale areas (in 2D) or volumes (in 3D)?

For a 2×2 matrix, the formula is beautifully simple:

det (^{a b}⁄_{c d}) = ad − bc

Let's derive this. Take the unit square with corners at (0,0), (1,0), (0,1), (1,1). The matrix maps these to (0,0), (a,c), (b,d), (a+b, c+d). The area of the resulting parallelogram is the cross product of its two edge vectors: a·d − b·c. That's the determinant.

Determinant as Area Scaling

Drag the sliders to change the 2×2 matrix. The unit square (teal) transforms into the orange parallelogram. The determinant equals the signed area of that parallelogram.

a2.0

b0.5

c0.0

d1.5

Key properties of the determinant:

Property	Meaning
det(A) > 0	Orientation preserved (no reflection)
det(A) < 0	Orientation flipped (reflection)
det(A) = 0	Matrix is singular — collapses a dimension
det(AB) = det(A) · det(B)	Scaling factors multiply
det(A⁻¹) = 1/det(A)	Inverse undoes the scaling

Key insight: A matrix is invertible if and only if its determinant is non-zero. When det = 0, the matrix squashes at least one direction to zero — you cannot undo that. This is why singular matrices are the enemy of linear algebra: they lose information.

For larger matrices, the determinant is computed recursively via cofactor expansion. For a 3×3 matrix, you expand along the first row: det(A) = a₁₁C₁₁ + a₁₂C₁₂ + a₁₃C₁₃, where each cofactor C_ij is (−1)^i+j times the determinant of the (n−1)×(n−1) submatrix obtained by deleting row i and column j. In practice, we use LU decomposition to compute determinants efficiently in O(n³).

Check: What does det(A) = 0 tell you about the matrix A?

A is singular (not invertible) — it collapses at least one dimension A is the identity matrix A has all positive entries

Chapter 2: The Trace

The trace of a square matrix is embarrassingly simple: just add up the diagonal entries.

tr(A) = ∑_i=1ⁿ a_ii = a₁₁ + a₂₂ + … + a_nn

That's it. No fancy operations, just a sum. But this humble number encodes something deep: it's the sum of the eigenvalues. We haven't defined eigenvalues yet, but keep this fact loaded — it will pay off shortly.

Why does the trace matter? Two reasons. First, it's cyclic:

tr(ABC) = tr(BCA) = tr(CAB)

You can rotate the matrices inside the trace without changing the result. You cannot, however, rearrange them arbitrarily: tr(ABC) ≠ tr(BAC) in general. Only cyclic permutations are allowed.

Second, the trace gives you the Frobenius norm — a measure of a matrix's total "size":

‖A‖_F² = tr(A^TA) = ∑_i,j a_ij²

This appears constantly in ML loss functions. When you write ‖Y − Xθ‖² in linear regression, the Frobenius norm is lurking underneath.

Trace identities you'll use: tr(A + B) = tr(A) + tr(B). tr(αA) = α tr(A). tr(A^T) = tr(A). And the workhorse: tr(AB) = tr(BA), even when AB and BA have different shapes (as long as both products are defined and square).

Property	Formula
Sum of eigenvalues	tr(A) = λ₁ + λ₂ + … + λ_n
Product of eigenvalues	det(A) = λ₁ · λ₂ · … · λ_n

Together, the trace and determinant give you two summaries of a matrix's eigenvalues: the sum and the product. For a 2×2 matrix, these two numbers completely determine the eigenvalues (we'll see why next chapter).

Key insight: The trace and determinant are similarity invariants. If B = P⁻¹AP (a change of basis), then tr(B) = tr(A) and det(B) = det(A). These numbers belong to the linear map, not the matrix representation. Different coordinate systems give different matrices, but the trace and determinant stay the same.

Check: Which property of the trace is most useful in ML derivations?

That the trace is always positive The cyclic property: tr(ABC) = tr(CAB) That the trace equals the largest entry

Chapter 3: The Characteristic Polynomial

We want to find the special directions that a matrix simply stretches without rotating. That means finding vectors x and scalars λ where Ax = λx. How do we find them?

Rewrite the equation: Ax − λx = 0, which gives (A − λI)x = 0. This system has a non-trivial solution (x ≠ 0) only when the matrix (A − λI) is singular — meaning its determinant is zero:

p(λ) = det(A − λI) = 0

This equation is the characteristic polynomial. For an n×n matrix, it's a polynomial of degree n in λ. Its roots are the eigenvalues.

Let's work through a 2×2 example. Take A = [[4, 2], [1, 3]].

det(A − λI) = det (^{(4−λ) 2}⁄_{1 (3−λ)})

= (4 − λ)(3 − λ) − 2 · 1 = λ² − 7λ + 10

Factor: (λ − 5)(λ − 2) = 0. The eigenvalues are λ₁ = 5 and λ₂ = 2.

Sanity checks using trace and determinant: tr(A) = 4 + 3 = 7 = 5 + 2 = λ₁ + λ₂. det(A) = 4·3 − 2·1 = 10 = 5 · 2 = λ₁ · λ₂. Both check out. Always verify this after computing eigenvalues.

For the general 2×2 matrix [[a, b], [c, d]], the characteristic polynomial is:

λ² − (a + d)λ + (ad − bc) = 0

Which is λ² − tr(A)λ + det(A) = 0. Apply the quadratic formula:

λ = ^{tr(A) ± √(tr(A)² − 4 det(A))}⁄₂

The discriminant tr(A)² − 4 det(A) determines whether the eigenvalues are real and distinct (positive), real and equal (zero), or complex conjugates (negative). In ML, we usually work with real symmetric matrices where eigenvalues are always real.

Beyond 2×2: For 3×3 and larger matrices, solving the characteristic polynomial analytically becomes impractical (there's no general formula for degree ≥ 5, by the Abel-Ruffini theorem). In practice, we use iterative numerical methods like the QR algorithm, which finds all eigenvalues in O(n³) time.

Check: For a 2×2 matrix with tr(A) = 6 and det(A) = 8, what are the eigenvalues?

λ = 3, λ = 3 λ = 4, λ = 2 (since 4+2=6 and 4·2=8) λ = 6, λ = 8

Chapter 4: Eigenvalues and Eigenvectors

Now the payoff. An eigenvalue λ of a matrix A is a scalar such that there exists a non-zero vector x satisfying:

Ax = λx

The vector x is the eigenvector. Think of it this way: most vectors get both stretched and rotated when you multiply by A. Eigenvectors are special — they only get stretched (or flipped), by the factor λ. They are the "natural axes" of the transformation.

Let's find the eigenvectors for our example A = [[4, 2], [1, 3]] with eigenvalues λ₁ = 5, λ₂ = 2.

For λ₁ = 5: solve (A − 5I)x = 0:

[−1, 2; 1, −2] x = 0 ⇒ x₁ = 2x₂ ⇒ x₁ = [2, 1]^T

For λ₂ = 2: solve (A − 2I)x = 0:

[2, 2; 1, 1] x = 0 ⇒ x₁ = −x₂ ⇒ x₂ = [1, −1]^T

Eigenvector Visualization

The orange arrows show the two eigenvectors of A = [[4,2],[1,3]]. Drag the blue dot to pick any input vector. Notice: only along the eigenvector directions does the output (teal) point the same way as the input.

Key insight: Eigenvectors reveal the "skeleton" of a linear transformation. In the coordinate system defined by the eigenvectors, the matrix becomes diagonal — it just scales each axis independently. This is the entire idea behind eigendecomposition.

Verification: A · [2,1]^T = [4·2+2·1, 1·2+3·1]^T = [10, 5]^T = 5 · [2,1]^T. It checks out — the matrix scales this vector by 5 without changing direction.

Check: What makes a vector an eigenvector of A?

Multiplying by A only stretches it (possibly flipping), without rotating It has unit length It is a column of A

Chapter 5: Eigenvalue Properties

Eigenvalues obey elegant rules that make them powerful diagnostic tools. Here are the properties you'll use most:

Property	Statement
Sum	tr(A) = ∑ λ_i
Product	det(A) = ∏ λ_i
Inverse	eigenvalues of A⁻¹ are 1/λ_i
Power	eigenvalues of A^k are λ_i^k
Shift	eigenvalues of A + cI are λ_i + c

The power rule is especially important. If you multiply a vector by A repeatedly (x, Ax, A²x, ...), the direction converges to the eigenvector with the largest eigenvalue. This is the power iteration algorithm — the simplest way to find the dominant eigenvector.

Symmetric matrices are special. If A = A^T, three wonderful things happen: (1) all eigenvalues are real, (2) eigenvectors for distinct eigenvalues are orthogonal, and (3) A is always diagonalizable. Since covariance matrices are symmetric, these properties underpin PCA and many other ML methods.

For symmetric matrices, we can strengthen the decomposition. The eigenvectors form an orthonormal basis. This means the matrix P of eigenvectors satisfies P^TP = I — it's an orthogonal matrix. The decomposition becomes A = PDP^T (no inverse needed, just a transpose).

What about matrices with repeated eigenvalues? If an eigenvalue λ has algebraic multiplicity k (it appears k times as a root of the characteristic polynomial), the number of linearly independent eigenvectors for λ is its geometric multiplicity — which can be less than k. When geometric < algebraic multiplicity, the matrix is defective and cannot be diagonalized. Symmetric matrices are never defective.

Positive definiteness: A symmetric matrix is positive definite if all eigenvalues are strictly positive. This means xAx^T > 0 for all non-zero x — the matrix defines a bowl shape (think: loss surface). Positive definite matrices are invertible, have positive determinant, and their Cholesky decomposition exists.

Check: If A is symmetric with eigenvalues 3 and 7, what are the eigenvalues of A² + 2I?

11 and 51 (since 3²+2 = 11 and 7²+2 = 51) 5 and 9 6 and 14

Chapter 6: Cholesky Decomposition

You have a symmetric positive definite matrix S (like a covariance matrix). You need to sample from the Gaussian N(0, S). To do that, you need a "square root" of S — a matrix L such that LL^T = S. Then if z ~ N(0, I), the vector Lz has covariance S.

The Cholesky decomposition gives you exactly this. It factors S into:

S = LL^T

where L is a lower triangular matrix with positive diagonal entries. It exists if and only if S is symmetric positive definite.

Let's compute it for S = [[4, 2], [2, 5]]. We want L = [[l₁₁, 0], [l₂₁, l₂₂]] such that LL^T = S:

l₁₁² = 4 ⇒ l₁₁ = 2

l₂₁ · l₁₁ = 2 ⇒ l₂₁ = 1

l₂₁² + l₂₂² = 5 ⇒ l₂₂ = √(5 − 1) = 2

So L = [[2, 0], [1, 2]]. Verify: LL^T = [[2,0],[1,2]] · [[2,1],[0,2]] = [[4,2],[2,5]] = S.

Key insight: Cholesky is about twice as fast as a general LU decomposition because it exploits symmetry. The cost is (1/3)n³ flops. It's the standard way to solve systems Sx = b when S is positive definite: factor S = LL^T, solve Ly = b by forward substitution, then L^Tx = y by backward substitution.

In ML, Cholesky appears everywhere positive definite matrices do:

Application	Why Cholesky
Gaussian sampling	z ~ N(0,I), then Lz ~ N(0, S)
GP regression	Solve (K + σ²I)α = y via Cholesky
Log-determinant	log det(S) = 2 ∑ log(L_ii)
Positive-definite check	If Cholesky fails, S is not PD

The log-determinant trick: Computing log det(S) directly is numerically dangerous (the determinant can overflow). But after Cholesky, det(S) = det(L)² = (∏ l_ii)², so log det(S) = 2 ∑ log(l_ii). This is stable and fast — essential for evaluating Gaussian log-likelihoods.

Check: The Cholesky decomposition S = LL^T requires S to be...

Symmetric positive definite Any square matrix Diagonal

Chapter 7: Diagonalization

We've been building toward this. If A has n linearly independent eigenvectors x₁, ..., x_n with eigenvalues λ₁, ..., λ_n, we can write all n equations Ax_i = λ_ix_i at once as:

AP = PD

where P = [x₁ | x₂ | ... | x_n] (eigenvectors as columns) and D = diag(λ₁, ..., λ_n). Rearranging:

A = PDP⁻¹

This is the eigendecomposition. It says: to apply A, first change to the eigenvector basis (P⁻¹), then scale each axis by its eigenvalue (D), then change back (P). The matrix is "diagonal in disguise."

The power of diagonalization: A^k = PD^kP⁻¹. Computing D^k is trivial — just raise each diagonal entry to the k-th power. This turns O(n³ · k) matrix multiplication into O(n³) (one decomposition) plus O(n) per power. This is how systems of linear recurrences (like Fibonacci) are solved in closed form.

Diagonalization Step-by-Step

For A = [[4,2],[1,3]]: click each step to see the transformation decomposed into change-of-basis, scaling, and change-back.

Not every matrix can be diagonalized. A matrix is diagonalizable if and only if it has n linearly independent eigenvectors. Matrices that fail this test (like [[0,1],[0,0]] which only has one independent eigenvector) are called defective.

Symmetric matrices are always diagonalizable, and their eigenvectors are orthogonal. The decomposition simplifies to A = QDQ^T where Q is an orthogonal matrix (Q⁻¹ = Q^T). This is the spectral theorem — one of the most important results in linear algebra.

Spectral theorem: Every real symmetric matrix A can be decomposed as A = QΛQ^T, where Q is orthogonal (columns are orthonormal eigenvectors) and Λ is diagonal (eigenvalues). This is an orthogonal diagonalization — no matrix inverse needed, just a transpose.

Check: Why is A¹⁰⁰ easy to compute once you have A = PDP⁻¹?

Because 100 is a small number Because A¹⁰⁰ = PD¹⁰⁰P⁻¹, and D¹⁰⁰ just raises each eigenvalue to the 100th power Because the eigenvectors don't change

Chapter 8: The Singular Value Decomposition

Eigendecomposition requires a square matrix and doesn't always exist. What about a 100×50 matrix — like a dataset with 100 samples and 50 features? Enter the Singular Value Decomposition (SVD), the most important decomposition in all of applied mathematics.

Every matrix A (any shape, m×n) can be decomposed as:

A = U Σ V^T

Where:

Factor	Shape	Meaning
U	m×m	Orthogonal. Columns are left singular vectors (output directions)
Σ	m×n	Diagonal. Entries σ₁ ≥ σ₂ ≥ ... ≥ 0 are singular values (stretching factors)
V^T	n×n	Orthogonal. Rows are right singular vectors (input directions)

The SVD says: every linear transformation is a rotation (V^T), followed by axis-aligned stretching (Σ), followed by another rotation (U). No matter how complex the matrix, its action reduces to rotate-stretch-rotate.

SVD vs. eigendecomposition: Eigendecomposition says A = PDP⁻¹ (one basis, possibly non-orthogonal, only for square diagonalizable matrices). SVD says A = UΣV^T (two orthogonal bases, works for any matrix). SVD always exists and uses orthogonal matrices, making it numerically stable. When A is symmetric positive semi-definite, the SVD and eigendecomposition coincide.

The singular values σ_i are always non-negative real numbers, even if A has complex entries. They measure how much the matrix stretches in each direction. The number of non-zero singular values equals the rank of A.

Connection to eigenvalues: the singular values of A are the square roots of the eigenvalues of A^TA (or equivalently AA^T). The right singular vectors V are the eigenvectors of A^TA. The left singular vectors U are the eigenvectors of AA^T.

σ_i = √(λ_i(A^TA))

Check: What makes the SVD more general than eigendecomposition?

SVD is faster to compute SVD exists for any matrix (including rectangular), while eigendecomposition requires a square, diagonalizable matrix SVD produces larger matrices

Chapter 9: SVD Construction — Step by Step

Let's compute the SVD of a concrete matrix and watch what each factor does geometrically. Pick a matrix below (or adjust the entries), then step through the four stages of the transformation.

SVD Geometry: Rotate → Stretch → Rotate

The unit circle (teal) is transformed by A = UΣV^T. Step through each factor to see it decomposed. The final shape (orange ellipse) is the image of the unit circle under A.

a2.0

b1.0

c0.0

d1.5

      σ1 = ?, σ2 = ?
    

Here's the algorithm to compute the SVD of a 2×2 matrix A by hand:

Step 1

Compute A^TA and find its eigenvalues λ₁ ≥ λ₂

↓

Step 2

Singular values: σ_i = √λ_i. Eigenvectors of A^TA give V

↓

Step 3

Compute u_i = (1/σ_i) A v_i for each non-zero σ_i

↓

Done

A = UΣV^T, verified by multiplication

Let's work through A = [[3, 0], [0, 2]]. This is already diagonal, so the SVD should be simple.

A^TA = [[9, 0], [0, 4]]. Eigenvalues: 9 and 4. So σ₁ = 3, σ₂ = 2. The eigenvectors of A^TA are [1,0] and [0,1] (the standard basis), so V = I. Then u₁ = (1/3)·A·[1,0] = [1,0], u₂ = (1/2)·A·[0,1] = [0,1], so U = I. The SVD is A = I · diag(3,2) · I, which makes perfect sense — a diagonal matrix is already in "SVD form."

The geometric meaning: The unit circle in input space gets mapped to an ellipse in output space. The semi-axes of the ellipse have lengths σ₁ and σ₂, pointing in the directions u₁ and u₂. The singular values are the "radii" of the output ellipse. Large σ = important direction. Tiny σ = negligible direction. Zero σ = collapsed direction.

Check: In the SVD A = UΣV^T, the columns of V come from...

The columns of A itself The eigenvectors of A^TA Random orthogonal vectors

Chapter 10: Low-Rank Approximation

Imagine a 1000×1000 matrix with singular values [50, 20, 3, 0.1, 0.01, ...]. The first two singular values dominate. Can we approximate the matrix using just those two?

Yes. The truncated SVD keeps only the top k singular values and their associated vectors:

A_k = U_k Σ_k V_k^T = ∑_i=1^k σ_i u_i v_i^T

Each term σ_i u_i v_i^T is a rank-1 matrix (an outer product). The full SVD writes A as a sum of rank-1 pieces, ordered by importance (σ₁ ≥ σ₂ ≥ ...). Truncating after k terms gives the best rank-k approximation.

The Eckart-Young theorem: Among all rank-k matrices, A_k minimizes ‖A − A_k‖_F. The error is √(σ_k+1² + ... + σ_n²). No other rank-k matrix can do better. This is optimal, period.

Rank-k Approximation Quality

A random 6×6 matrix is decomposed via SVD. Drag the slider to keep the top k singular values. Watch the approximation quality and the Frobenius error.

Rank k6

The storage savings are dramatic. A full m×n matrix needs mn entries. The rank-k SVD stores U_k (m×k), Σ_k (k values), and V_k (n×k): total k(m + n + 1) entries. When k is much smaller than m or n, this is a massive compression.

Application	What the SVD does
Image compression	Keep top k singular values → approximate image with k·(m+n) numbers
Recommender systems	Low-rank user-item matrix → latent factors
LoRA fine-tuning	Approximate weight updates as low-rank: ΔW = BA where B is n×k, A is k×m
PCA	Top k right singular vectors = principal components
Pseudoinverse	A⁺ = VΣ⁺U^T (invert non-zero singular values)

Key insight: The singular values tell you the "effective dimensionality" of your data. If a 1000-dimensional dataset has only 10 large singular values, the data essentially lives on a 10-dimensional surface. The SVD finds that surface optimally. This is the mathematical foundation of dimensionality reduction.

Check: What does the Eckart-Young theorem guarantee about the truncated SVD?

It is the best rank-k approximation to A in the Frobenius norm It preserves all singular values of A It only works for square matrices

Chapter 11: Summary & Connections

We've built up three decompositions, each serving a different purpose. Here's the landscape:

Decomposition	Requires	Form	Key Use
Eigendecomposition	Square, n indep. eigenvectors	A = PDP⁻¹	Matrix powers, stability analysis
Spectral (symmetric)	Symmetric	A = QΛQ^T	PCA, covariance analysis
Cholesky	Symmetric positive definite	A = LL^T	Gaussian sampling, solving SPD systems
SVD	Any matrix	A = UΣV^T	Low-rank approx, pseudoinverse, PCA

The hierarchy: SVD is the most general — it always exists. For symmetric matrices, SVD and eigendecomposition coincide (with U = V = eigenvectors, Σ = |eigenvalues|). For symmetric positive definite, you additionally get Cholesky. Each specialization buys you efficiency or extra structure.

Where these ideas lead next:

This Chapter	Used In
Eigendecomposition	Ch 10: PCA (eigenvectors of covariance matrix)
SVD / low-rank	Ch 10: PCA (SVD of centered data matrix)
Cholesky	Ch 9: Bayesian Regression (posterior covariance)
Determinant, trace	Ch 6: Probability (Gaussian normalization)
Positive definiteness	Ch 7: Optimization (Hessian analysis)

Practical wisdom: In code, never compute eigenvalues by hand-coding the characteristic polynomial. Use numpy.linalg.eig (general), numpy.linalg.eigh (symmetric), numpy.linalg.svd, or numpy.linalg.cholesky. These use battle-tested LAPACK routines (QR iteration for eigenvalues, Householder bidiagonalization for SVD).

"The purpose of computing is insight, not numbers." — Richard Hamming

Check: Which decomposition works for ANY matrix, including rectangular ones?

Eigendecomposition Cholesky SVD