Taking matrices apart to understand what they really do.
You have a 1000×1000 matrix. It describes how a neural network layer transforms its inputs. You need to understand what it does — which directions it stretches, which it crushes, whether it rotates. Staring at a million numbers tells you nothing.
Matrix decompositions solve this. They break a matrix into simpler pieces — rotations, scalings, projections — that reveal the geometry hidden inside. Think of it like factoring a number: 60 is just a number, but 2 × 2 × 3 × 5 tells you its structure.
This chapter covers three landmark decompositions, each revealing something different:
But before we can decompose, we need three tools: determinants, traces, and characteristic polynomials. These are the keys that unlock eigenvalues, and eigenvalues are the atoms of every decomposition.
When a matrix transforms space, it changes areas and volumes. The determinant measures exactly this: by what factor does the matrix scale areas (in 2D) or volumes (in 3D)?
For a 2×2 matrix, the formula is beautifully simple:
Let's derive this. Take the unit square with corners at (0,0), (1,0), (0,1), (1,1). The matrix maps these to (0,0), (a,c), (b,d), (a+b, c+d). The area of the resulting parallelogram is the cross product of its two edge vectors: a·d − b·c. That's the determinant.
Drag the sliders to change the 2×2 matrix. The unit square (teal) transforms into the orange parallelogram. The determinant equals the signed area of that parallelogram.
Key properties of the determinant:
| Property | Meaning |
|---|---|
| det(A) > 0 | Orientation preserved (no reflection) |
| det(A) < 0 | Orientation flipped (reflection) |
| det(A) = 0 | Matrix is singular — collapses a dimension |
| det(AB) = det(A) · det(B) | Scaling factors multiply |
| det(A−1) = 1/det(A) | Inverse undoes the scaling |
For larger matrices, the determinant is computed recursively via cofactor expansion. For a 3×3 matrix, you expand along the first row: det(A) = a11C11 + a12C12 + a13C13, where each cofactor Cij is (−1)i+j times the determinant of the (n−1)×(n−1) submatrix obtained by deleting row i and column j. In practice, we use LU decomposition to compute determinants efficiently in O(n3).
The trace of a square matrix is embarrassingly simple: just add up the diagonal entries.
That's it. No fancy operations, just a sum. But this humble number encodes something deep: it's the sum of the eigenvalues. We haven't defined eigenvalues yet, but keep this fact loaded — it will pay off shortly.
Why does the trace matter? Two reasons. First, it's cyclic:
You can rotate the matrices inside the trace without changing the result. You cannot, however, rearrange them arbitrarily: tr(ABC) ≠ tr(BAC) in general. Only cyclic permutations are allowed.
Second, the trace gives you the Frobenius norm — a measure of a matrix's total "size":
This appears constantly in ML loss functions. When you write ‖Y − Xθ‖2 in linear regression, the Frobenius norm is lurking underneath.
| Property | Formula |
|---|---|
| Sum of eigenvalues | tr(A) = λ1 + λ2 + … + λn |
| Product of eigenvalues | det(A) = λ1 · λ2 · … · λn |
Together, the trace and determinant give you two summaries of a matrix's eigenvalues: the sum and the product. For a 2×2 matrix, these two numbers completely determine the eigenvalues (we'll see why next chapter).
We want to find the special directions that a matrix simply stretches without rotating. That means finding vectors x and scalars λ where Ax = λx. How do we find them?
Rewrite the equation: Ax − λx = 0, which gives (A − λI)x = 0. This system has a non-trivial solution (x ≠ 0) only when the matrix (A − λI) is singular — meaning its determinant is zero:
This equation is the characteristic polynomial. For an n×n matrix, it's a polynomial of degree n in λ. Its roots are the eigenvalues.
Let's work through a 2×2 example. Take A = [[4, 2], [1, 3]].
Factor: (λ − 5)(λ − 2) = 0. The eigenvalues are λ1 = 5 and λ2 = 2.
For the general 2×2 matrix [[a, b], [c, d]], the characteristic polynomial is:
Which is λ2 − tr(A)λ + det(A) = 0. Apply the quadratic formula:
The discriminant tr(A)2 − 4 det(A) determines whether the eigenvalues are real and distinct (positive), real and equal (zero), or complex conjugates (negative). In ML, we usually work with real symmetric matrices where eigenvalues are always real.
Now the payoff. An eigenvalue λ of a matrix A is a scalar such that there exists a non-zero vector x satisfying:
The vector x is the eigenvector. Think of it this way: most vectors get both stretched and rotated when you multiply by A. Eigenvectors are special — they only get stretched (or flipped), by the factor λ. They are the "natural axes" of the transformation.
Let's find the eigenvectors for our example A = [[4, 2], [1, 3]] with eigenvalues λ1 = 5, λ2 = 2.
For λ1 = 5: solve (A − 5I)x = 0:
For λ2 = 2: solve (A − 2I)x = 0:
The orange arrows show the two eigenvectors of A = [[4,2],[1,3]]. Drag the blue dot to pick any input vector. Notice: only along the eigenvector directions does the output (teal) point the same way as the input.
Verification: A · [2,1]T = [4·2+2·1, 1·2+3·1]T = [10, 5]T = 5 · [2,1]T. It checks out — the matrix scales this vector by 5 without changing direction.
Eigenvalues obey elegant rules that make them powerful diagnostic tools. Here are the properties you'll use most:
| Property | Statement |
|---|---|
| Sum | tr(A) = ∑ λi |
| Product | det(A) = ∏ λi |
| Inverse | eigenvalues of A−1 are 1/λi |
| Power | eigenvalues of Ak are λik |
| Shift | eigenvalues of A + cI are λi + c |
The power rule is especially important. If you multiply a vector by A repeatedly (x, Ax, A2x, ...), the direction converges to the eigenvector with the largest eigenvalue. This is the power iteration algorithm — the simplest way to find the dominant eigenvector.
For symmetric matrices, we can strengthen the decomposition. The eigenvectors form an orthonormal basis. This means the matrix P of eigenvectors satisfies PTP = I — it's an orthogonal matrix. The decomposition becomes A = PDPT (no inverse needed, just a transpose).
What about matrices with repeated eigenvalues? If an eigenvalue λ has algebraic multiplicity k (it appears k times as a root of the characteristic polynomial), the number of linearly independent eigenvectors for λ is its geometric multiplicity — which can be less than k. When geometric < algebraic multiplicity, the matrix is defective and cannot be diagonalized. Symmetric matrices are never defective.
You have a symmetric positive definite matrix S (like a covariance matrix). You need to sample from the Gaussian N(0, S). To do that, you need a "square root" of S — a matrix L such that LLT = S. Then if z ~ N(0, I), the vector Lz has covariance S.
The Cholesky decomposition gives you exactly this. It factors S into:
where L is a lower triangular matrix with positive diagonal entries. It exists if and only if S is symmetric positive definite.
Let's compute it for S = [[4, 2], [2, 5]]. We want L = [[l11, 0], [l21, l22]] such that LLT = S:
So L = [[2, 0], [1, 2]]. Verify: LLT = [[2,0],[1,2]] · [[2,1],[0,2]] = [[4,2],[2,5]] = S.
In ML, Cholesky appears everywhere positive definite matrices do:
| Application | Why Cholesky |
|---|---|
| Gaussian sampling | z ~ N(0,I), then Lz ~ N(0, S) |
| GP regression | Solve (K + σ2I)α = y via Cholesky |
| Log-determinant | log det(S) = 2 ∑ log(Lii) |
| Positive-definite check | If Cholesky fails, S is not PD |
We've been building toward this. If A has n linearly independent eigenvectors x1, ..., xn with eigenvalues λ1, ..., λn, we can write all n equations Axi = λixi at once as:
where P = [x1 | x2 | ... | xn] (eigenvectors as columns) and D = diag(λ1, ..., λn). Rearranging:
This is the eigendecomposition. It says: to apply A, first change to the eigenvector basis (P−1), then scale each axis by its eigenvalue (D), then change back (P). The matrix is "diagonal in disguise."
For A = [[4,2],[1,3]]: click each step to see the transformation decomposed into change-of-basis, scaling, and change-back.
Not every matrix can be diagonalized. A matrix is diagonalizable if and only if it has n linearly independent eigenvectors. Matrices that fail this test (like [[0,1],[0,0]] which only has one independent eigenvector) are called defective.
Symmetric matrices are always diagonalizable, and their eigenvectors are orthogonal. The decomposition simplifies to A = QDQT where Q is an orthogonal matrix (Q−1 = QT). This is the spectral theorem — one of the most important results in linear algebra.
Eigendecomposition requires a square matrix and doesn't always exist. What about a 100×50 matrix — like a dataset with 100 samples and 50 features? Enter the Singular Value Decomposition (SVD), the most important decomposition in all of applied mathematics.
Every matrix A (any shape, m×n) can be decomposed as:
Where:
| Factor | Shape | Meaning |
|---|---|---|
| U | m×m | Orthogonal. Columns are left singular vectors (output directions) |
| Σ | m×n | Diagonal. Entries σ1 ≥ σ2 ≥ ... ≥ 0 are singular values (stretching factors) |
| VT | n×n | Orthogonal. Rows are right singular vectors (input directions) |
The SVD says: every linear transformation is a rotation (VT), followed by axis-aligned stretching (Σ), followed by another rotation (U). No matter how complex the matrix, its action reduces to rotate-stretch-rotate.
The singular values σi are always non-negative real numbers, even if A has complex entries. They measure how much the matrix stretches in each direction. The number of non-zero singular values equals the rank of A.
Connection to eigenvalues: the singular values of A are the square roots of the eigenvalues of ATA (or equivalently AAT). The right singular vectors V are the eigenvectors of ATA. The left singular vectors U are the eigenvectors of AAT.
Let's compute the SVD of a concrete matrix and watch what each factor does geometrically. Pick a matrix below (or adjust the entries), then step through the four stages of the transformation.
The unit circle (teal) is transformed by A = UΣVT. Step through each factor to see it decomposed. The final shape (orange ellipse) is the image of the unit circle under A.
Here's the algorithm to compute the SVD of a 2×2 matrix A by hand:
Let's work through A = [[3, 0], [0, 2]]. This is already diagonal, so the SVD should be simple.
ATA = [[9, 0], [0, 4]]. Eigenvalues: 9 and 4. So σ1 = 3, σ2 = 2. The eigenvectors of ATA are [1,0] and [0,1] (the standard basis), so V = I. Then u1 = (1/3)·A·[1,0] = [1,0], u2 = (1/2)·A·[0,1] = [0,1], so U = I. The SVD is A = I · diag(3,2) · I, which makes perfect sense — a diagonal matrix is already in "SVD form."
Imagine a 1000×1000 matrix with singular values [50, 20, 3, 0.1, 0.01, ...]. The first two singular values dominate. Can we approximate the matrix using just those two?
Yes. The truncated SVD keeps only the top k singular values and their associated vectors:
Each term σi ui viT is a rank-1 matrix (an outer product). The full SVD writes A as a sum of rank-1 pieces, ordered by importance (σ1 ≥ σ2 ≥ ...). Truncating after k terms gives the best rank-k approximation.
A random 6×6 matrix is decomposed via SVD. Drag the slider to keep the top k singular values. Watch the approximation quality and the Frobenius error.
The storage savings are dramatic. A full m×n matrix needs mn entries. The rank-k SVD stores Uk (m×k), Σk (k values), and Vk (n×k): total k(m + n + 1) entries. When k is much smaller than m or n, this is a massive compression.
| Application | What the SVD does |
|---|---|
| Image compression | Keep top k singular values → approximate image with k·(m+n) numbers |
| Recommender systems | Low-rank user-item matrix → latent factors |
| LoRA fine-tuning | Approximate weight updates as low-rank: ΔW = BA where B is n×k, A is k×m |
| PCA | Top k right singular vectors = principal components |
| Pseudoinverse | A+ = VΣ+UT (invert non-zero singular values) |
We've built up three decompositions, each serving a different purpose. Here's the landscape:
| Decomposition | Requires | Form | Key Use |
|---|---|---|---|
| Eigendecomposition | Square, n indep. eigenvectors | A = PDP−1 | Matrix powers, stability analysis |
| Spectral (symmetric) | Symmetric | A = QΛQT | PCA, covariance analysis |
| Cholesky | Symmetric positive definite | A = LLT | Gaussian sampling, solving SPD systems |
| SVD | Any matrix | A = UΣVT | Low-rank approx, pseudoinverse, PCA |
Where these ideas lead next:
| This Chapter | Used In |
|---|---|
| Eigendecomposition | Ch 10: PCA (eigenvectors of covariance matrix) |
| SVD / low-rank | Ch 10: PCA (SVD of centered data matrix) |
| Cholesky | Ch 9: Bayesian Regression (posterior covariance) |
| Determinant, trace | Ch 6: Probability (Gaussian normalization) |
| Positive definiteness | Ch 7: Optimization (Hessian analysis) |
numpy.linalg.eig (general), numpy.linalg.eigh (symmetric), numpy.linalg.svd, or numpy.linalg.cholesky. These use battle-tested LAPACK routines (QR iteration for eigenvalues, Householder bidiagonalization for SVD)."The purpose of computing is insight, not numbers." — Richard Hamming