Deisenroth et al., Chapter 4

Matrix Decompositions

Taking matrices apart to understand what they really do.

Prerequisites: Chapters 2–3 (linear algebra & analytic geometry). That's it.
12
Chapters
5+
Simulations
12
Quizzes

Chapter 0: Why Decompose?

You have a 1000×1000 matrix. It describes how a neural network layer transforms its inputs. You need to understand what it does — which directions it stretches, which it crushes, whether it rotates. Staring at a million numbers tells you nothing.

Matrix decompositions solve this. They break a matrix into simpler pieces — rotations, scalings, projections — that reveal the geometry hidden inside. Think of it like factoring a number: 60 is just a number, but 2 × 2 × 3 × 5 tells you its structure.

This chapter covers three landmark decompositions, each revealing something different:

Eigendecomposition
A = PDP−1 — find the directions a square matrix stretches
Cholesky
A = LLT — efficient "square root" for symmetric positive-definite matrices
SVD
A = UΣVT — works for ANY matrix, even rectangular ones

But before we can decompose, we need three tools: determinants, traces, and characteristic polynomials. These are the keys that unlock eigenvalues, and eigenvalues are the atoms of every decomposition.

Why ML cares: PCA is eigendecomposition of a covariance matrix. The SVD powers low-rank approximation (used in recommender systems, image compression, and LoRA fine-tuning). Cholesky enables efficient sampling from multivariate Gaussians. These are not abstract math — they are daily tools in modern ML.
Check: What is the main purpose of matrix decompositions?

Chapter 1: Determinants

When a matrix transforms space, it changes areas and volumes. The determinant measures exactly this: by what factor does the matrix scale areas (in 2D) or volumes (in 3D)?

For a 2×2 matrix, the formula is beautifully simple:

det (a bc d) = ad − bc

Let's derive this. Take the unit square with corners at (0,0), (1,0), (0,1), (1,1). The matrix maps these to (0,0), (a,c), (b,d), (a+b, c+d). The area of the resulting parallelogram is the cross product of its two edge vectors: a·d − b·c. That's the determinant.

Determinant as Area Scaling

Drag the sliders to change the 2×2 matrix. The unit square (teal) transforms into the orange parallelogram. The determinant equals the signed area of that parallelogram.

a2.0
b0.5
c0.0
d1.5

Key properties of the determinant:

PropertyMeaning
det(A) > 0Orientation preserved (no reflection)
det(A) < 0Orientation flipped (reflection)
det(A) = 0Matrix is singular — collapses a dimension
det(AB) = det(A) · det(B)Scaling factors multiply
det(A−1) = 1/det(A)Inverse undoes the scaling
Key insight: A matrix is invertible if and only if its determinant is non-zero. When det = 0, the matrix squashes at least one direction to zero — you cannot undo that. This is why singular matrices are the enemy of linear algebra: they lose information.

For larger matrices, the determinant is computed recursively via cofactor expansion. For a 3×3 matrix, you expand along the first row: det(A) = a11C11 + a12C12 + a13C13, where each cofactor Cij is (−1)i+j times the determinant of the (n−1)×(n−1) submatrix obtained by deleting row i and column j. In practice, we use LU decomposition to compute determinants efficiently in O(n3).

Check: What does det(A) = 0 tell you about the matrix A?

Chapter 2: The Trace

The trace of a square matrix is embarrassingly simple: just add up the diagonal entries.

tr(A) = ∑i=1n aii = a11 + a22 + … + ann

That's it. No fancy operations, just a sum. But this humble number encodes something deep: it's the sum of the eigenvalues. We haven't defined eigenvalues yet, but keep this fact loaded — it will pay off shortly.

Why does the trace matter? Two reasons. First, it's cyclic:

tr(ABC) = tr(BCA) = tr(CAB)

You can rotate the matrices inside the trace without changing the result. You cannot, however, rearrange them arbitrarily: tr(ABC) ≠ tr(BAC) in general. Only cyclic permutations are allowed.

Second, the trace gives you the Frobenius norm — a measure of a matrix's total "size":

‖A‖F2 = tr(ATA) = ∑i,j aij2

This appears constantly in ML loss functions. When you write ‖Y − Xθ‖2 in linear regression, the Frobenius norm is lurking underneath.

Trace identities you'll use: tr(A + B) = tr(A) + tr(B). tr(αA) = α tr(A). tr(AT) = tr(A). And the workhorse: tr(AB) = tr(BA), even when AB and BA have different shapes (as long as both products are defined and square).
PropertyFormula
Sum of eigenvaluestr(A) = λ1 + λ2 + … + λn
Product of eigenvaluesdet(A) = λ1 · λ2 · … · λn

Together, the trace and determinant give you two summaries of a matrix's eigenvalues: the sum and the product. For a 2×2 matrix, these two numbers completely determine the eigenvalues (we'll see why next chapter).

Key insight: The trace and determinant are similarity invariants. If B = P−1AP (a change of basis), then tr(B) = tr(A) and det(B) = det(A). These numbers belong to the linear map, not the matrix representation. Different coordinate systems give different matrices, but the trace and determinant stay the same.
Check: Which property of the trace is most useful in ML derivations?

Chapter 3: The Characteristic Polynomial

We want to find the special directions that a matrix simply stretches without rotating. That means finding vectors x and scalars λ where Ax = λx. How do we find them?

Rewrite the equation: Ax − λx = 0, which gives (A − λI)x = 0. This system has a non-trivial solution (x ≠ 0) only when the matrix (A − λI) is singular — meaning its determinant is zero:

p(λ) = det(A − λI) = 0

This equation is the characteristic polynomial. For an n×n matrix, it's a polynomial of degree n in λ. Its roots are the eigenvalues.

Let's work through a 2×2 example. Take A = [[4, 2], [1, 3]].

det(A − λI) = det ((4−λ) 21 (3−λ))
= (4 − λ)(3 − λ) − 2 · 1 = λ2 − 7λ + 10

Factor: (λ − 5)(λ − 2) = 0. The eigenvalues are λ1 = 5 and λ2 = 2.

Sanity checks using trace and determinant: tr(A) = 4 + 3 = 7 = 5 + 2 = λ1 + λ2. det(A) = 4·3 − 2·1 = 10 = 5 · 2 = λ1 · λ2. Both check out. Always verify this after computing eigenvalues.

For the general 2×2 matrix [[a, b], [c, d]], the characteristic polynomial is:

λ2 − (a + d)λ + (ad − bc) = 0

Which is λ2 − tr(A)λ + det(A) = 0. Apply the quadratic formula:

λ = tr(A) ± √(tr(A)2 − 4 det(A))2

The discriminant tr(A)2 − 4 det(A) determines whether the eigenvalues are real and distinct (positive), real and equal (zero), or complex conjugates (negative). In ML, we usually work with real symmetric matrices where eigenvalues are always real.

Beyond 2×2: For 3×3 and larger matrices, solving the characteristic polynomial analytically becomes impractical (there's no general formula for degree ≥ 5, by the Abel-Ruffini theorem). In practice, we use iterative numerical methods like the QR algorithm, which finds all eigenvalues in O(n3) time.
Check: For a 2×2 matrix with tr(A) = 6 and det(A) = 8, what are the eigenvalues?

Chapter 4: Eigenvalues and Eigenvectors

Now the payoff. An eigenvalue λ of a matrix A is a scalar such that there exists a non-zero vector x satisfying:

Ax = λx

The vector x is the eigenvector. Think of it this way: most vectors get both stretched and rotated when you multiply by A. Eigenvectors are special — they only get stretched (or flipped), by the factor λ. They are the "natural axes" of the transformation.

Let's find the eigenvectors for our example A = [[4, 2], [1, 3]] with eigenvalues λ1 = 5, λ2 = 2.

For λ1 = 5: solve (A − 5I)x = 0:

[−1, 2; 1, −2] x = 0  ⇒  x1 = 2x2  ⇒  x1 = [2, 1]T

For λ2 = 2: solve (A − 2I)x = 0:

[2, 2; 1, 1] x = 0  ⇒  x1 = −x2  ⇒  x2 = [1, −1]T
Eigenvector Visualization

The orange arrows show the two eigenvectors of A = [[4,2],[1,3]]. Drag the blue dot to pick any input vector. Notice: only along the eigenvector directions does the output (teal) point the same way as the input.

Key insight: Eigenvectors reveal the "skeleton" of a linear transformation. In the coordinate system defined by the eigenvectors, the matrix becomes diagonal — it just scales each axis independently. This is the entire idea behind eigendecomposition.

Verification: A · [2,1]T = [4·2+2·1, 1·2+3·1]T = [10, 5]T = 5 · [2,1]T. It checks out — the matrix scales this vector by 5 without changing direction.

Check: What makes a vector an eigenvector of A?

Chapter 5: Eigenvalue Properties

Eigenvalues obey elegant rules that make them powerful diagnostic tools. Here are the properties you'll use most:

PropertyStatement
Sumtr(A) = ∑ λi
Productdet(A) = ∏ λi
Inverseeigenvalues of A−1 are 1/λi
Powereigenvalues of Ak are λik
Shifteigenvalues of A + cI are λi + c

The power rule is especially important. If you multiply a vector by A repeatedly (x, Ax, A2x, ...), the direction converges to the eigenvector with the largest eigenvalue. This is the power iteration algorithm — the simplest way to find the dominant eigenvector.

Symmetric matrices are special. If A = AT, three wonderful things happen: (1) all eigenvalues are real, (2) eigenvectors for distinct eigenvalues are orthogonal, and (3) A is always diagonalizable. Since covariance matrices are symmetric, these properties underpin PCA and many other ML methods.

For symmetric matrices, we can strengthen the decomposition. The eigenvectors form an orthonormal basis. This means the matrix P of eigenvectors satisfies PTP = I — it's an orthogonal matrix. The decomposition becomes A = PDPT (no inverse needed, just a transpose).

What about matrices with repeated eigenvalues? If an eigenvalue λ has algebraic multiplicity k (it appears k times as a root of the characteristic polynomial), the number of linearly independent eigenvectors for λ is its geometric multiplicity — which can be less than k. When geometric < algebraic multiplicity, the matrix is defective and cannot be diagonalized. Symmetric matrices are never defective.

Positive definiteness: A symmetric matrix is positive definite if all eigenvalues are strictly positive. This means xAxT > 0 for all non-zero x — the matrix defines a bowl shape (think: loss surface). Positive definite matrices are invertible, have positive determinant, and their Cholesky decomposition exists.
Check: If A is symmetric with eigenvalues 3 and 7, what are the eigenvalues of A2 + 2I?

Chapter 6: Cholesky Decomposition

You have a symmetric positive definite matrix S (like a covariance matrix). You need to sample from the Gaussian N(0, S). To do that, you need a "square root" of S — a matrix L such that LLT = S. Then if z ~ N(0, I), the vector Lz has covariance S.

The Cholesky decomposition gives you exactly this. It factors S into:

S = LLT

where L is a lower triangular matrix with positive diagonal entries. It exists if and only if S is symmetric positive definite.

Let's compute it for S = [[4, 2], [2, 5]]. We want L = [[l11, 0], [l21, l22]] such that LLT = S:

l112 = 4 ⇒ l11 = 2
l21 · l11 = 2 ⇒ l21 = 1
l212 + l222 = 5 ⇒ l22 = √(5 − 1) = 2

So L = [[2, 0], [1, 2]]. Verify: LLT = [[2,0],[1,2]] · [[2,1],[0,2]] = [[4,2],[2,5]] = S.

Key insight: Cholesky is about twice as fast as a general LU decomposition because it exploits symmetry. The cost is (1/3)n3 flops. It's the standard way to solve systems Sx = b when S is positive definite: factor S = LLT, solve Ly = b by forward substitution, then LTx = y by backward substitution.

In ML, Cholesky appears everywhere positive definite matrices do:

ApplicationWhy Cholesky
Gaussian samplingz ~ N(0,I), then Lz ~ N(0, S)
GP regressionSolve (K + σ2I)α = y via Cholesky
Log-determinantlog det(S) = 2 ∑ log(Lii)
Positive-definite checkIf Cholesky fails, S is not PD
The log-determinant trick: Computing log det(S) directly is numerically dangerous (the determinant can overflow). But after Cholesky, det(S) = det(L)2 = (∏ lii)2, so log det(S) = 2 ∑ log(lii). This is stable and fast — essential for evaluating Gaussian log-likelihoods.
Check: The Cholesky decomposition S = LLT requires S to be...

Chapter 7: Diagonalization

We've been building toward this. If A has n linearly independent eigenvectors x1, ..., xn with eigenvalues λ1, ..., λn, we can write all n equations Axi = λixi at once as:

AP = PD

where P = [x1 | x2 | ... | xn] (eigenvectors as columns) and D = diag(λ1, ..., λn). Rearranging:

A = PDP−1

This is the eigendecomposition. It says: to apply A, first change to the eigenvector basis (P−1), then scale each axis by its eigenvalue (D), then change back (P). The matrix is "diagonal in disguise."

The power of diagonalization: Ak = PDkP−1. Computing Dk is trivial — just raise each diagonal entry to the k-th power. This turns O(n3 · k) matrix multiplication into O(n3) (one decomposition) plus O(n) per power. This is how systems of linear recurrences (like Fibonacci) are solved in closed form.

Diagonalization Step-by-Step

For A = [[4,2],[1,3]]: click each step to see the transformation decomposed into change-of-basis, scaling, and change-back.

Not every matrix can be diagonalized. A matrix is diagonalizable if and only if it has n linearly independent eigenvectors. Matrices that fail this test (like [[0,1],[0,0]] which only has one independent eigenvector) are called defective.

Symmetric matrices are always diagonalizable, and their eigenvectors are orthogonal. The decomposition simplifies to A = QDQT where Q is an orthogonal matrix (Q−1 = QT). This is the spectral theorem — one of the most important results in linear algebra.

Spectral theorem: Every real symmetric matrix A can be decomposed as A = QΛQT, where Q is orthogonal (columns are orthonormal eigenvectors) and Λ is diagonal (eigenvalues). This is an orthogonal diagonalization — no matrix inverse needed, just a transpose.
Check: Why is A100 easy to compute once you have A = PDP−1?

Chapter 8: The Singular Value Decomposition

Eigendecomposition requires a square matrix and doesn't always exist. What about a 100×50 matrix — like a dataset with 100 samples and 50 features? Enter the Singular Value Decomposition (SVD), the most important decomposition in all of applied mathematics.

Every matrix A (any shape, m×n) can be decomposed as:

A = U Σ VT

Where:

FactorShapeMeaning
Um×mOrthogonal. Columns are left singular vectors (output directions)
Σm×nDiagonal. Entries σ1 ≥ σ2 ≥ ... ≥ 0 are singular values (stretching factors)
VTn×nOrthogonal. Rows are right singular vectors (input directions)

The SVD says: every linear transformation is a rotation (VT), followed by axis-aligned stretching (Σ), followed by another rotation (U). No matter how complex the matrix, its action reduces to rotate-stretch-rotate.

SVD vs. eigendecomposition: Eigendecomposition says A = PDP−1 (one basis, possibly non-orthogonal, only for square diagonalizable matrices). SVD says A = UΣVT (two orthogonal bases, works for any matrix). SVD always exists and uses orthogonal matrices, making it numerically stable. When A is symmetric positive semi-definite, the SVD and eigendecomposition coincide.

The singular values σi are always non-negative real numbers, even if A has complex entries. They measure how much the matrix stretches in each direction. The number of non-zero singular values equals the rank of A.

Connection to eigenvalues: the singular values of A are the square roots of the eigenvalues of ATA (or equivalently AAT). The right singular vectors V are the eigenvectors of ATA. The left singular vectors U are the eigenvectors of AAT.

σi = √(λi(ATA))
Check: What makes the SVD more general than eigendecomposition?

Chapter 9: SVD Construction — Step by Step

Let's compute the SVD of a concrete matrix and watch what each factor does geometrically. Pick a matrix below (or adjust the entries), then step through the four stages of the transformation.

SVD Geometry: Rotate → Stretch → Rotate

The unit circle (teal) is transformed by A = UΣVT. Step through each factor to see it decomposed. The final shape (orange ellipse) is the image of the unit circle under A.

a2.0
b1.0
c0.0
d1.5
σ1 = ?, σ2 = ?

Here's the algorithm to compute the SVD of a 2×2 matrix A by hand:

Step 1
Compute ATA and find its eigenvalues λ1 ≥ λ2
Step 2
Singular values: σi = √λi. Eigenvectors of ATA give V
Step 3
Compute ui = (1/σi) A vi for each non-zero σi
Done
A = UΣVT, verified by multiplication

Let's work through A = [[3, 0], [0, 2]]. This is already diagonal, so the SVD should be simple.

ATA = [[9, 0], [0, 4]]. Eigenvalues: 9 and 4. So σ1 = 3, σ2 = 2. The eigenvectors of ATA are [1,0] and [0,1] (the standard basis), so V = I. Then u1 = (1/3)·A·[1,0] = [1,0], u2 = (1/2)·A·[0,1] = [0,1], so U = I. The SVD is A = I · diag(3,2) · I, which makes perfect sense — a diagonal matrix is already in "SVD form."

The geometric meaning: The unit circle in input space gets mapped to an ellipse in output space. The semi-axes of the ellipse have lengths σ1 and σ2, pointing in the directions u1 and u2. The singular values are the "radii" of the output ellipse. Large σ = important direction. Tiny σ = negligible direction. Zero σ = collapsed direction.
Check: In the SVD A = UΣVT, the columns of V come from...

Chapter 10: Low-Rank Approximation

Imagine a 1000×1000 matrix with singular values [50, 20, 3, 0.1, 0.01, ...]. The first two singular values dominate. Can we approximate the matrix using just those two?

Yes. The truncated SVD keeps only the top k singular values and their associated vectors:

Ak = Uk Σk VkT = ∑i=1k σi ui viT

Each term σi ui viT is a rank-1 matrix (an outer product). The full SVD writes A as a sum of rank-1 pieces, ordered by importance (σ1 ≥ σ2 ≥ ...). Truncating after k terms gives the best rank-k approximation.

The Eckart-Young theorem: Among all rank-k matrices, Ak minimizes ‖A − AkF. The error is √(σk+12 + ... + σn2). No other rank-k matrix can do better. This is optimal, period.
Rank-k Approximation Quality

A random 6×6 matrix is decomposed via SVD. Drag the slider to keep the top k singular values. Watch the approximation quality and the Frobenius error.

Rank k6

The storage savings are dramatic. A full m×n matrix needs mn entries. The rank-k SVD stores Uk (m×k), Σk (k values), and Vk (n×k): total k(m + n + 1) entries. When k is much smaller than m or n, this is a massive compression.

ApplicationWhat the SVD does
Image compressionKeep top k singular values → approximate image with k·(m+n) numbers
Recommender systemsLow-rank user-item matrix → latent factors
LoRA fine-tuningApproximate weight updates as low-rank: ΔW = BA where B is n×k, A is k×m
PCATop k right singular vectors = principal components
PseudoinverseA+ = VΣ+UT (invert non-zero singular values)
Key insight: The singular values tell you the "effective dimensionality" of your data. If a 1000-dimensional dataset has only 10 large singular values, the data essentially lives on a 10-dimensional surface. The SVD finds that surface optimally. This is the mathematical foundation of dimensionality reduction.
Check: What does the Eckart-Young theorem guarantee about the truncated SVD?

Chapter 11: Summary & Connections

We've built up three decompositions, each serving a different purpose. Here's the landscape:

DecompositionRequiresFormKey Use
EigendecompositionSquare, n indep. eigenvectorsA = PDP−1Matrix powers, stability analysis
Spectral (symmetric)SymmetricA = QΛQTPCA, covariance analysis
CholeskySymmetric positive definiteA = LLTGaussian sampling, solving SPD systems
SVDAny matrixA = UΣVTLow-rank approx, pseudoinverse, PCA
The hierarchy: SVD is the most general — it always exists. For symmetric matrices, SVD and eigendecomposition coincide (with U = V = eigenvectors, Σ = |eigenvalues|). For symmetric positive definite, you additionally get Cholesky. Each specialization buys you efficiency or extra structure.

Where these ideas lead next:

This ChapterUsed In
EigendecompositionCh 10: PCA (eigenvectors of covariance matrix)
SVD / low-rankCh 10: PCA (SVD of centered data matrix)
CholeskyCh 9: Bayesian Regression (posterior covariance)
Determinant, traceCh 6: Probability (Gaussian normalization)
Positive definitenessCh 7: Optimization (Hessian analysis)
Practical wisdom: In code, never compute eigenvalues by hand-coding the characteristic polynomial. Use numpy.linalg.eig (general), numpy.linalg.eigh (symmetric), numpy.linalg.svd, or numpy.linalg.cholesky. These use battle-tested LAPACK routines (QR iteration for eigenvalues, Householder bidiagonalization for SVD).

"The purpose of computing is insight, not numbers." — Richard Hamming

Check: Which decomposition works for ANY matrix, including rectangular ones?