Matrix Operations

Chapter 0: The Problem

You run a factory that produces three chemicals. Each chemical uses different amounts of three raw materials. You know the recipe (how many kilograms of each material per liter of each chemical) and you know how much raw material you have. The question: how many liters of each chemical can you produce?

Write down the constraints. Each chemical uses raw materials in fixed proportions. If you produce x₁ liters of chemical 1, it consumes a₁₁ kg of material 1, a₂₁ kg of material 2, and a₃₁ kg of material 3. The total usage of each material must equal (or not exceed) the supply.

This is a system of linear equations. If x₁, x₂, x₃ are the liters of each chemical and b₁, b₂, b₃ are your material supplies, then:

a₁₁x₁ + a₁₂x₂ + a₁₃x₃ = b₁
a₂₁x₁ + a₂₂x₂ + a₂₃x₃ = b₂
a₃₁x₁ + a₃₂x₂ + a₃₃x₃ = b₃

In matrix form: Ax = b. The matrix A holds the recipes, x holds the unknowns, and b holds the constraints. Solving this system is arguably the most important computation in all of applied mathematics.

How important? Carl Friedrich Gauss developed the method we now call Gaussian elimination in 1809 to solve a 6-equation system arising in asteroid orbit determination. In the two centuries since, the basic problem — solve Ax = b efficiently and accurately — has become the computational engine behind physics, engineering, statistics, economics, machine learning, and computer graphics. This chapter is about the modern tools for that problem.

Geometry of Two Equations

Start simple. Two equations, two unknowns: each equation defines a line in 2D. The solution is where the lines intersect. If they are parallel, there is no solution. If they are the same line, there are infinitely many. The simulation below shows this geometrically. Drag the sliders to change the coefficients and watch the intersection point move.

Two Equations, Two Unknowns

Each equation defines a line. The solution is the intersection. Change coefficients to see parallel lines (no solution) or coincident lines (infinite solutions).

a₁₁ 2.0

a₁₂ 1.0

a₂₁ 1.0

a₂₂ 3.0

Why Not Just Use Gaussian Elimination?

You learned Gaussian elimination in your first linear algebra course: subtract multiples of one row from another until the matrix is upper triangular, then back-substitute. It works, and it costs O(n³/3) operations for an n × n system. So what is the problem?

The Reuse Problem

The problem is reuse. In real applications, you rarely solve Ax = b just once. The factory gets new material shipments daily — same recipes (same A), different supplies (different b). Gaussian elimination starts from scratch each time. With 1,000 chemicals and daily shipments, that is 10⁹/3 operations every single day.

The key insight of this chapter. If we can decompose A into a product of simpler matrices — specifically triangular matrices — then solving new systems with the same A becomes cheap. We pay O(n³/3) once to factor A, then only O(n²) for each new right-hand side b. This is the idea behind LU decomposition: factor A = LU where L is lower triangular and U is upper triangular. This chapter is about how to compute that factorization, what to do when it exists, and what to do when it almost does not (numerical stability).

Scale Matters

Why do we care about O(n³) vs O(n²)? Because real-world systems are large. A structural engineering simulation for a bridge might involve n = 100,000 unknowns. Gaussian elimination from scratch costs n³/3 ≈ 3.3 × 10¹⁴ operations. At 10¹⁰ FLOPS (a fast CPU), that is 9 hours per solve. If the bridge must withstand 50 different load conditions, that is 50 × 9 = 450 hours. LU decomposition pays the 9 hours once, then each additional solve costs only n² ≈ 10¹⁰ operations — 1 second. Total: 9 hours + 49 seconds instead of 18.75 days.

n	Gaussian elim (one b)	LU factor	Each new b (triangular solves)	100 solves total
100	333K	333K	10K	1.3M vs 33.3M
1,000	333M	333M	1M	433M vs 33.3B
10,000	333B	333B	100M	343B vs 33.3T

The pattern: anytime you solve the same system with different right-hand sides, factorize once and reuse. This principle applies beyond LU — Cholesky, QR, and SVD all separate the expensive "factor" step from the cheap "solve" step.

Beyond solving Ax = b, matrix decompositions power machine learning (least-squares regression), computer graphics (transformation pipelines), physics simulations (finite elements), signal processing (Fourier via matrix factorizations), and recommendation systems (matrix factorization for Netflix). Every time you hear "linear algebra is the backbone of ML," this is the chapter that explains why.

A Quick Refresher: Gaussian Elimination

If you already know Gaussian elimination, skip ahead. If not, here is the 60-second version. To solve a 3×3 system:

2x₁ + 3x₂ + 1x₃ = 1
4x₁ + 7x₂ + 5x₃ = 3
6x₁ + 18x₂ + 22x₃ = 5

Step 1: subtract 2×(row 1) from row 2, and 3×(row 1) from row 3, to eliminate x₁ from the last two rows. Step 2: subtract the appropriate multiple of the new row 2 from row 3 to eliminate x₂. You end up with an upper triangular system where x₃ is directly solvable, then x₂, then x₁. This "zeroing out below the diagonal" process costs roughly n³/3 multiplications and additions for an n×n system.

The problem? Those elimination steps are work you throw away. Next time someone hands you a new b vector, you start over. LU decomposition saves those elimination steps as a matrix L, so you never redo them.

Concept check: You solve Ax = b using Gaussian elimination for 100 different b vectors (same A every time). The matrix A is n × n. What is the total cost?

O(n²) — elimination is only done once O(100 · n³/3) — Gaussian elimination starts from scratch for each b, repeating the O(n³/3) factorization 100 times O(n³/3 + 100·n²) — smart reuse via LU decomposition O(n log n) — matrix operations are always n log n

Chapter 1: LU Decomposition

We know the goal: factor A = LU so that solving Ax = b reduces to two cheap triangular solves. But what are L and U, and how do we find them?

Lower triangular means all entries above the diagonal are zero. Upper triangular means all entries below the diagonal are zero. The crucial property: solving a triangular system takes only O(n²) time — you just substitute one variable at a time, top to bottom (forward substitution for L) or bottom to top (back substitution for U).

The Two-Step Solve

Given A = LU and Ax = b:

Step 1: Forward Substitution

Solve Ly = b for y. Start from the top row: y₁ = b₁/l₁₁. Each subsequent y_i depends only on y values already computed. O(n²).

↓

Step 2: Back Substitution

Solve Ux = y for x. Start from the bottom row: x_n = y_n/u_nn. Work upward. O(n²).

Why is each triangular solve only O(n²)? Because the triangular structure means each equation involves one new unknown and all previously computed values. In forward substitution of an n×n lower triangular system, row i involves a dot product of length i − 1 plus one division. Summing 1 + 2 + … + (n−1) = n(n−1)/2, which is O(n²). Back substitution is symmetric.

Total for the two solves: O(n²). Compare to O(n³/3) for full Gaussian elimination. If you solve 100 systems with the same A, you do the O(n³/3) factorization once and then 100 × O(n²) triangular solves. For n = 1000, that is 3.3 × 10⁸ + 2 × 10⁸ instead of 3.3 × 10¹⁰ — a 10× speedup.

Computing L and U

LU decomposition is just Gaussian elimination with bookkeeping. When you subtract a multiple of row i from row j to create a zero below the pivot, that multiplier goes into L. The resulting upper triangular matrix is U. The diagonal of L is all ones (this is called a unit lower triangular matrix).

Let us walk through a 3×3 example by hand.

Worked example. Factor A = LU where:

A = 2 3 14 7 56 18 22

Step 1: Eliminate below a₁₁ = 2.
  Row 2 − (4/2)·Row 1 = Row 2 − 2·Row 1. Multiplier l₂₁ = 2.
  Row 3 − (6/2)·Row 1 = Row 3 − 3·Row 1. Multiplier l₃₁ = 3.
After step 1: 2 3 10 1 30 9 19

Step 2: Eliminate below the (2,2) pivot = 1.
  Row 3 − (9/1)·Row 2. Multiplier l₃₂ = 9.
After step 2: U = 2 3   10 1   30 0 -8

L = 1 0 02 1 03 9 1

Verify: LU = 2 3 14 7 56 18 22 = A ✓

LU Decomposition Step-by-Step

Watch Gaussian elimination build L and U simultaneously. Each elimination step stores the multiplier in L and zeros out an entry in the working matrix (which becomes U).

Click Step to begin LU factorization

Implementation

python
import numpy as np

def lu_decompose(A):
    """Factor A = LU (no pivoting). Returns (L, U)."""
    n = A.shape[0]
    L = np.eye(n)           # start with identity
    U = A.astype(float).copy()

    for k in range(n):
        for i in range(k + 1, n):
            # multiplier: how much of row k to subtract from row i
            L[i, k] = U[i, k] / U[k, k]
            # eliminate: row i -= multiplier * row k
            U[i, k:] -= L[i, k] * U[k, k:]
    return L, U

def forward_sub(L, b):
    """Solve Ly = b for y (L is unit lower triangular)."""
    n = L.shape[0]
    y = np.zeros(n)
    for i in range(n):
        y[i] = b[i] - L[i, :i] @ y[:i]
    return y

def back_sub(U, y):
    """Solve Ux = y for x (U is upper triangular)."""
    n = U.shape[0]
    x = np.zeros(n)
    for i in range(n - 1, -1, -1):
        x[i] = (y[i] - U[i, i+1:] @ x[i+1:]) / U[i, i]
    return x

# Example
A = np.array([[2,3,1],[4,7,5],[6,18,22]], dtype=float)
b = np.array([1, 3, 5], dtype=float)
L, U = lu_decompose(A)
y = forward_sub(L, b)
x = back_sub(U, y)
print("x =", x)            # [ 1.75  -0.5   -0.5 ]
print("Ax =", A @ x)       # [1. 3. 5.] == b ✓

In-Place Storage

A clever implementation trick: L and U can be stored in the space of the original matrix A. Since L has 1s on the diagonal (which we do not need to store) and zeros above, while U has zeros below, they fit together like a jigsaw:

Storage = u₁₁ u₁₂ u₁₃l₂₁ u₂₂ u₂₃l₃₁ l₃₂ u₃₃

// Lower triangle holds L (minus the implicit 1s on diagonal)
// Upper triangle (including diagonal) holds U
// Total: n² storage — same as the original A

This is how LAPACK (and by extension NumPy/SciPy) stores LU factors. The function scipy.linalg.lu_factor returns a single n×n array plus a permutation vector — not separate L and U matrices. This halves memory usage, which matters for large matrices.

Worked Example: Solving Ax = b via LU

Let us solve the full problem. With A = LU already computed above, solve Ax = b where b = [1, 3, 5].

Step 1: Forward substitution Ly = b.

L = 1 0 02 1 03 9 1, b = 135

y₁ = b₁ = 1
y₂ = b₂ − l₂₁y₁ = 3 − 2(1) = 1
y₃ = b₃ − l₃₁y₁ − l₃₂y₂ = 5 − 3(1) − 9(1) = −7

y = [1, 1, −7]

Step 2: Back substitution Ux = y.

U = 2 3 10 1 30 0 −8, y = 11−7

x₃ = y₃ / u₃₃ = −7 / −8 = 0.875
x₂ = (y₂ − u₂₃x₃) / u₂₂ = (1 − 3(0.875)) / 1 = −1.625
x₁ = (y₁ − u₁₂x₂ − u₁₃x₃) / u₁₁ = (1 − 3(−1.625) − 1(0.875)) / 2 = 2.5

x = [2.5, −1.625, 0.875]

Check: Ax = [2(2.5)+3(−1.625)+1(0.875), 4(2.5)+7(−1.625)+5(0.875), 6(2.5)+18(−1.625)+22(0.875)] = [1, 3, 5] = b ✓

Notice that the forward and back substitution steps are short — O(n²). The hard work was in the LU factorization. If someone now gives us b = [10, 20, 30], we reuse the same L and U and just do the two substitution steps again.

Why LU is just organized Gaussian elimination. Every row operation "Row i − m·Row k" can be represented as multiplication by an elementary lower triangular matrix. The product of all these elementary matrices (inverted) is L. The final upper triangular result is U. LU decomposition does not discover new math — it packages the same operations in a reusable form.

Concept check: After computing A = LU (cost: n³/3), what is the cost to solve Ax = b for a single new b vector?

O(n³) — you need to refactor O(n²) — forward substitution O(n²) + back substitution O(n²) O(n) — triangular systems are linear O(n² log n) — there is a log factor from the recursion

Chapter 2: Pivoting

The LU decomposition from Chapter 1 has a fatal flaw. What happens when the pivot element u_kk is zero? Division by zero. The algorithm crashes. And even if the pivot is not exactly zero, a very small pivot amplifies rounding errors catastrophically.

The Instability Trap

Consider this 2×2 system using standard floating-point arithmetic with 3 significant digits:

0.001 x₁ + x₂ = 1
x₁ + x₂ = 2

The exact solution is x₁ ≈ 1.001, x₂ ≈ 0.999. Without pivoting, we eliminate using multiplier m = 1/0.001 = 1000. Row 2 becomes: 0·x₁ + (1 − 1000)x₂ = 2 − 1000, so −999 x₂ = −998. In 3-digit arithmetic, −999 rounds to −1000 and −998 rounds to −1000, giving x₂ = 1.000 and then x₁ = 0. The computed x₁ is completely wrong — off by 100%.

The core lesson: a small pivot creates a large multiplier (m = 1/0.001 = 1000). A large multiplier means the subtraction step involves very large numbers, which when rounded in finite precision, wipe out the significant digits. This is not a theoretical concern — it happens routinely in real computations.

With partial pivoting, we swap rows so the largest element in the current column becomes the pivot. Swap rows 1 and 2:

x₁ + x₂ = 2
0.001 x₁ + x₂ = 1

Now the multiplier is m = 0.001/1 = 0.001. Row 2 becomes: 0·x₁ + 0.999 x₂ = 0.998. Even in 3-digit arithmetic: x₂ = 0.998/0.999 = 0.999, x₁ = 2 − 0.999 = 1.001. Both correct to 3 digits.

The difference is dramatic. Without pivoting: x₁ = 0 (100% error). With pivoting: x₁ = 1.001 (0.01% error). Same matrix, same arithmetic precision — the only change was which row we used as the pivot. This is why every serious numerical library (LAPACK, NumPy, MATLAB) uses partial pivoting by default. It is not optional.

The mathematical explanation: the multiplier m = b/a_pivot appears in every subsequent entry. If a_pivot is tiny, m is huge, and the product m · (other entries) has many significant digits that must be subtracted from a number of comparable magnitude. In finite precision, this subtraction causes catastrophic cancellation — the leading digits cancel and you are left with rounding noise. Large pivots keep m small, keeping the arithmetic well-behaved.

Partial pivoting is non-negotiable. In practice, nobody uses LU without pivoting. The factorization becomes PA = LU, where P is a permutation matrix that records the row swaps. P is just the identity matrix with rows reordered — multiplying by P just shuffles the rows of A. The cost of pivoting is negligible: O(n²) comparisons on top of O(n³/3) arithmetic.

The PA = LU Algorithm

At each step k, before eliminating below the pivot:

1. Find the pivot

Search column k below the diagonal for the entry with the largest absolute value.

↓

2. Swap rows

Swap row k with the pivot row — in both the working matrix and the permutation vector. Also swap the already-computed entries of L.

↓

3. Eliminate

Proceed with standard elimination. All multipliers satisfy |l_ij| ≤ 1, keeping rounding errors controlled.

Pivoting vs. No Pivoting

This simulation shows what happens when you solve a nearly-singular system with and without pivoting. The small pivot value amplifies errors without pivoting. Toggle pivoting to see the difference.

Pivot size ε 0.001

Worked Example: PA = LU with Partial Pivoting

Factor the matrix B = 1 2 34 5 67 8 0

Step 1: Column 1 — largest |value| is 7 (row 3). Swap rows 1 and 3.
perm: [2, 1, 0] Working matrix: 7 8 04 5 61 2 3

Eliminate: l₂₁ = 4/7 ≈ 0.571, l₃₁ = 1/7 ≈ 0.143
Row 2 − 0.571 · Row 1: [0, 0.429, 6]
Row 3 − 0.143 · Row 1: [0, 0.857, 3]

Step 2: Column 2 — remaining rows 2,3. Largest |value| is 0.857 (row 3). Swap rows 2 and 3.
perm: [2, 0, 1] l₃₂ = 0.429/0.857 = 0.5
Row 2 − 0.5 · Row 3: [0, 0, 4.5]

Result:
P = 0 0 11 0 00 1 0, L = 1 0 00.143 1 00.571 0.5 1, U = 7 8 00 0.857 30 0 4.5

All multipliers in L satisfy |l_ij| ≤ 1. ✓

Full vs. Partial Pivoting

Partial pivoting searches only the current column for the largest element. Complete (full) pivoting searches the entire remaining submatrix — finding the largest element in all remaining rows and columns. Complete pivoting requires column swaps too (PAQ = LU), which makes bookkeeping harder and adds O(n²) comparisons total. In practice, partial pivoting is almost always sufficient. The extra stability of complete pivoting is rarely needed, and the added cost is not justified.

There is a deep theorem behind this: with partial pivoting, all entries of L satisfy |l_ij| ≤ 1. This means that rounding errors in L are bounded. The entries of U can grow (by a factor of at most 2ⁿ⁻¹ in the worst case), but in practice the growth factor is almost always small — typically O(n^2/3) for random matrices. Over 60 years of numerical experience confirms that partial pivoting is robust for virtually all practical problems.

The Permutation Matrix P

A permutation matrix P is the identity matrix with its rows shuffled. Multiplying PA just reorders the rows of A. Key properties:

Property	Meaning
P^T = P⁻¹	The inverse of a permutation is its transpose — just undo the shuffle
det(P) = ±1	Each row swap flips the sign
PP^T = I	P is orthogonal
Storage: just an array	Store perm = [2,0,1] instead of a full matrix. Row i of PA is row perm[i] of A.

python
import numpy as np

def lu_partial_pivot(A):
    """PA = LU with partial pivoting. Returns (L, U, P)."""
    n = A.shape[0]
    U = A.astype(float).copy()
    L = np.zeros((n, n))
    P = np.eye(n)

    for k in range(n):
        # Find pivot: row with largest |value| in column k
        pivot_row = k + np.argmax(np.abs(U[k:, k]))

        # Swap rows in U, L, and P
        if pivot_row != k:
            U[[k, pivot_row]] = U[[pivot_row, k]]
            P[[k, pivot_row]] = P[[pivot_row, k]]
            if k > 0:
                L[[k, pivot_row], :k] = L[[pivot_row, k], :k]

        # Eliminate below pivot
        for i in range(k + 1, n):
            L[i, k] = U[i, k] / U[k, k]
            U[i, k:] -= L[i, k] * U[k, k:]

    L += np.eye(n)  # add 1s on diagonal
    return L, U, P

# Solve PA = LU => Ax = b => PAx = Pb => LUx = Pb
A = np.array([[0.001, 1], [1, 1]])
b = np.array([1.0, 2.0])
L, U, P = lu_partial_pivot(A)
Pb = P @ b
y = forward_sub(L, Pb)  # reuse from Ch 1
x = back_sub(U, y)
print(x)  # [1.001001  0.998999] — correct!

Concept check: In LU decomposition with partial pivoting, why do we swap the current row with the row that has the largest absolute value in the pivot column?

It makes the algorithm faster by reducing the number of operations It makes the matrix symmetric so Cholesky can be used It ensures all multipliers satisfy |l_ij| ≤ 1, which bounds the amplification of rounding errors It is needed to make the determinant positive

Chapter 3: Matrix Inversion

A matrix A is invertible (or nonsingular) if there exists a matrix A⁻¹ such that A · A⁻¹ = A⁻¹ · A = I (the identity matrix). Geometrically, if A rotates and stretches space, then A⁻¹ undoes that rotation and stretching. Not all matrices are invertible — a matrix that squishes 3D space down to a plane has lost information and cannot be undone.

The condition for invertibility: det(A) ≠ 0. Equivalently, all pivots in LU decomposition are nonzero. Equivalently, the rows (or columns) of A are linearly independent.

Computing A⁻¹ via LU

The key idea: A⁻¹ is the matrix X satisfying AX = I. But AX = I is just n linear systems, one for each column of X:

Ax₁ = e₁, Ax₂ = e₂, … Ax_n = e_n

where e_j is the j-th column of the identity matrix (all zeros except a 1 in position j). We already know how to solve Ax = b efficiently using LU. So: compute PA = LU once (cost: n³/3), then solve n systems (each costs n²). Total: n³/3 + n · n² = O(n³).

A practical warning. In numerical computing, you almost never need A⁻¹ explicitly. If your goal is to compute A⁻¹b, just solve Ax = b using LU — it is faster (n³/3 + n² vs n³) and more numerically stable. Computing A⁻¹ and then multiplying by b squares the condition number of the problem. The only time you need the full inverse is when you need many entries of A⁻¹ itself (e.g., covariance matrices in statistics).

Matrix Inversion via LU

Watch the algorithm solve AX = I column by column. Each column of the inverse is obtained by solving Ax = e_j using forward+back substitution on the precomputed LU factorization.

Click to solve column by column

python
def invert_via_lu(A):
    """Compute A^{-1} using PA = LU decomposition."""
    n = A.shape[0]
    L, U, P = lu_partial_pivot(A)
    A_inv = np.zeros((n, n))

    for j in range(n):
        # Solve Ax_j = e_j  =>  LUx_j = P e_j
        e_j = np.zeros(n)
        e_j[j] = 1.0
        Pe = P @ e_j
        y = forward_sub(L, Pe)
        A_inv[:, j] = back_sub(U, y)

    return A_inv

A = np.array([[2,3,1],[4,7,5],[6,18,22]], dtype=float)
A_inv = invert_via_lu(A)
print(A @ A_inv)  # ≈ identity matrix

Worked Example: Inverting a 3×3 Matrix via LU

Using our familiar A = 2 3 14 7 56 18 22 with L and U from Chapter 1:

Column 1: Solve Ax₁ = e₁ = [1, 0, 0].
Forward sub Ly = [1, 0, 0]: y = [1, −2, −2+18] = [1, −2, −15]
Wait — let us be careful:
y₁ = 1
y₂ = 0 − 2(1) = −2
y₃ = 0 − 3(1) − 9(−2) = −3 + 18 = 15
Back sub Ux = [1, −2, 15]:
x₃ = 15/(−8) = −1.875
x₂ = (−2 − 3(−1.875))/1 = 3.625
x₁ = (1 − 3(3.625) − 1(−1.875))/2 = (1 − 10.875 + 1.875)/2 = −4
Column 1 of A⁻¹: [−4, 3.625, −1.875]

Column 2: Solve Ax₂ = e₂ = [0, 1, 0].
y₁ = 0, y₂ = 1 − 2(0) = 1, y₃ = 0 − 3(0) − 9(1) = −9
x₃ = −9/(−8) = 1.125
x₂ = (1 − 3(1.125))/1 = −2.375
x₁ = (0 − 3(−2.375) − 1(1.125))/2 = (7.125 − 1.125)/2 = 3
Column 2: [3, −2.375, 1.125]

Column 3: Solve Ax₃ = e₃ = [0, 0, 1].
y₁ = 0, y₂ = 0 − 2(0) = 0, y₃ = 1 − 0 − 0 = 1
x₃ = 1/(−8) = −0.125
x₂ = (0 − 3(−0.125))/1 = 0.375
x₁ = (0 − 3(0.375) − 1(−0.125))/2 = (−1.125 + 0.125)/2 = −0.5
Column 3: [−0.5, 0.375, −0.125]

Assembling: A⁻¹ = −4 3 −0.53.625 −2.375 0.375−1.875 1.125 −0.125

You can verify: A · A⁻¹ = I (try the first row: 2(−4)+3(3.625)+1(−1.875) = −8+10.875−1.875 = 1 ✓).

When Do You Actually Need A⁻¹?

Almost never for solving Ax = b. But there are genuine use cases:

Application	Why A⁻¹ is Needed
Kalman filter covariance update	The posterior covariance P = (I − KH)P_prior requires the Kalman gain K, which involves S⁻¹
Fisher information matrix	The Cramér-Rao lower bound uses I(θ)⁻¹, the inverse Fisher information
Covariance matrix in regression	Var(β̂) = σ²(X^TX)⁻¹ — you need the diagonal entries to get confidence intervals
Preconditioning	Using M⁻¹ as a preconditioner for iterative solvers (but you approximate, not compute exactly)

Matrix Inversion and Strassen

The Determinant for Free

Once you have A = LU (or PA = LU), computing the determinant is trivial:

det(A) = det(P)⁻¹ · det(L) · det(U) = (±1) · 1 · ∏ u_ii

The determinant of a triangular matrix is the product of its diagonal entries. L has all 1s on the diagonal, so det(L) = 1. U has the pivots on its diagonal. The sign depends on whether P involves an even or odd number of row swaps.

For our example: det(A) = det(U) = 2 · 1 · (−8) = −16. You can verify with the Sarrus rule for 3×3 matrices: 2(7·22 − 5·18) − 3(4·22 − 5·6) + 1(4·18 − 7·6) = 2(154−90) − 3(88−30) + 1(72−42) = 128 − 174 + 30 = −16 ✓.

This is how all serious software computes determinants. The naïve cofactor expansion has cost O(n!), which is utterly impractical for n > 15. LU gives det in O(n³/3) — the same cost as solving a system.

A beautiful result from CLRS: matrix inversion, matrix multiplication, and LU decomposition are all computationally equivalent up to constant factors. If you can multiply n×n matrices in O(n^ω) time (where ω < 3 for Strassen and beyond), then you can also invert and factor in O(n^ω) time. Currently the best known ω ≈ 2.371, so theoretically all three can be done in O(n^2.371). In practice, the constants are huge, so O(n³) LU is used for matrices smaller than about 10,000 × 10,000.

The proof of equivalence is elegant. To show "multiply ≤ invert": given two n×n matrices A, B, build the 3n×3n matrix M = I A 00 I B0 0 I. Its inverse has AB in the (1,3) block. So one matrix inversion gives us a matrix product. The other direction ("invert ≤ multiply") uses a divide-and-conquer argument, partitioning the matrix into blocks and recursing. This mutual reduction proves they have the same asymptotic complexity.

A deep question. What is the true value of ω? We know 2 ≤ ω ≤ 2.371 (you need at least n² time to read the input). Nobody knows if ω = 2 is achievable. This is one of the major open problems in theoretical computer science. If ω = 2, then matrix inversion, solving linear systems, and computing determinants would all be essentially O(n²) — barely more than reading the matrix.

Concept check: To compute A⁻¹b, which approach is better and why?

Solve Ax = b using LU — costs O(n³/3 + n²), avoids computing the full inverse, and is more numerically stable Compute A⁻¹ then multiply A⁻¹b — gives you the inverse for later reuse Use Cramer's rule — it gives an exact formula for each x_i They are exactly the same cost

Chapter 4: Symmetric Positive Definite & Cholesky

A special class of matrices appears so frequently in science and engineering that it deserves its own decomposition. A matrix A is symmetric positive definite (SPD) if:

Symmetric

A = A^T. The matrix equals its transpose. Entry (i,j) equals entry (j,i).

Positive Definite

x^TAx > 0 for every nonzero vector x. Geometrically, the quadratic form x^TAx defines a bowl that opens upward — every direction from the origin goes "up."

Where do SPD matrices come from? Everywhere:

Field	SPD Matrix	What It Represents
Statistics / ML	Covariance matrix Σ	Variance and correlation of variables
Optimization	Hessian of convex function	Curvature — how fast the function bends
Kalman Filters	State covariance P	Uncertainty in state estimate
Physics (FEM)	Stiffness matrix K	How a structure resists deformation
Graph theory	Laplacian L	Connectivity of a graph

Cholesky Decomposition

For an SPD matrix, LU decomposition can be simplified. Because A is symmetric, U turns out to be a scaled version of L^T. Specifically, A = LL^T where L is lower triangular with positive diagonal entries. This is the Cholesky decomposition.

The cost: n³/6 — half the cost of general LU. This is because symmetry means we only need to compute the lower triangle; the upper triangle is just the transpose.

Why SPD Matrices Are Special

SPD matrices have remarkable properties that make them the "nicest" matrices in linear algebra:

Property	Why It Matters
All eigenvalues are positive	Guarantees a unique solution to Ax = b
All pivots are positive	No pivoting needed — numerically stable without row swaps
All leading principal minors > 0	Equivalent test for positive definiteness (Sylvester's criterion)
x^TAx defines a bowl shape	Optimization min x^TAx has a unique global minimum
A = LL^T exists and is unique	Cholesky — the fastest, most stable factorization

The no-pivoting property is the most practical advantage. For general matrices, you must pivot to avoid numerical catastrophe. For SPD matrices, the pivots are naturally positive and well-behaved — Cholesky is backward stable without any row swaps. This is why Cholesky is the default solver in optimization (positive definite Hessians), statistics (covariance matrices), and Kalman filtering (covariance updates).

Generating SPD Matrices

A common interview/code question: how do you create an SPD matrix for testing? The standard trick: take any matrix B with full column rank, then A = B^TB is guaranteed SPD.

python
# Generate a random SPD matrix
B = np.random.randn(10, 5)  # any 10×5 matrix
A = B.T @ B                    # 5×5, guaranteed SPD

# Alternative: start from eigenvalues
Q, _ = np.linalg.qr(np.random.randn(5, 5))  # random orthogonal
eigvals = np.array([10, 5, 2, 1, 0.1])   # choose condition number
A = Q @ np.diag(eigvals) @ Q.T                # κ = 10/0.1 = 100

Cholesky = the "square root" of a matrix. Just as every positive number has a square root, every SPD matrix A has a "square root" L such that A = L · L^T. The analogy runs deep: just as √4 = 2 and 4 = 2 × 2, the Cholesky factor L satisfies A = L × L^T. You can even think of L as the matrix that "generates" A from independent noise — if z is a vector of independent standard normals, then x = Lz has covariance LL^T = A.

The Algorithm

Derive L row by row. For each row i, we know that A = LL^T, so:

a_ij = ∑_k=1^j l_ik l_jk (for j ≤ i)

Solving for the entries of L:

l_ii = √(a_ii − ∑_k=1^i-1 l_ik²) (diagonal)

l_ij = (a_ij − ∑_k=1^j-1 l_ik l_jk) / l_jj (below diagonal, j < i)

Cholesky Decomposition Step-by-Step

Watch A = LL^T being computed entry by entry. The diagonal entries require a square root (always positive for SPD). Off-diagonal entries use previously computed values.

Click Step to begin Cholesky factorization

python
def cholesky(A):
    """Cholesky decomposition A = LL^T for SPD matrix A."""
    n = A.shape[0]
    L = np.zeros((n, n))

    for i in range(n):
        for j in range(i + 1):
            s = np.sum(L[i, :j] * L[j, :j])
            if i == j:
                # Diagonal: square root
                L[i, j] = np.sqrt(A[i, i] - s)
            else:
                # Below diagonal
                L[i, j] = (A[i, j] - s) / L[j, j]
    return L

# Example: covariance matrix (always SPD)
A = np.array([[4, 2, 2],
              [2, 5, 3],
              [2, 3, 6]], dtype=float)
L = cholesky(A)
print("L =\n", L)
print("LL^T =\n", L @ L.T)  # == A ✓

# Solve Ax = b using Cholesky
b = np.array([1.0, 2.0, 3.0])
y = forward_sub(L, b)
x = back_sub(L.T, y)
print("x =", x)

Worked Example: 3×3 Cholesky by Hand

Decompose A = 4 2 22 5 32 3 6

Computing L entry by entry.

l₁₁ = √a₁₁ = √4 = 2

l₂₁ = a₂₁ / l₁₁ = 2/2 = 1
l₂₂ = √(a₂₂ − l₂₁²) = √(5 − 1) = √4 = 2

l₃₁ = a₃₁ / l₁₁ = 2/2 = 1
l₃₂ = (a₃₂ − l₃₁l₂₁) / l₂₂ = (3 − 1·1) / 2 = 1
l₃₃ = √(a₃₃ − l₃₁² − l₃₂²) = √(6 − 1 − 1) = √4 = 2

L = 2 0 01 2 01 1 2

Verify: LL^T = 4 2 22 5 32 3 6 = A ✓

Notice every diagonal entry of L is 2 — because the diagonal of A is [4, 5, 6] and the accumulated sum of squares at each step leaves exactly 4 under each square root. This is not a coincidence for this matrix but illustrates how the algorithm peels off one layer of "variance" at a time.

Cholesky as a test for positive definiteness. If the Cholesky algorithm encounters a zero or negative value under the square root, the matrix is not positive definite. This makes Cholesky a practical test: try to decompose, and if it fails, the matrix is not SPD. NumPy's np.linalg.cholesky raises LinAlgError for non-SPD inputs.

Cholesky in NumPy and SciPy

In practice, you almost never implement Cholesky yourself. The standard tools:

python
import numpy as np
from scipy import linalg

A = np.array([[4,2,2],[2,5,3],[2,3,6]], dtype=float)

# NumPy: returns lower triangular L
L = np.linalg.cholesky(A)

# SciPy: returns upper triangular by default!
U = linalg.cholesky(A)       # upper: A = U^T U
L2 = linalg.cholesky(A, lower=True)  # lower: A = L L^T

# Solve Ax = b using Cholesky (SciPy one-liner)
b = np.array([1.0, 2.0, 3.0])
x = linalg.cho_solve(linalg.cho_factor(A), b)
print("x =", x)
print("Ax =", A @ x)  # == b ✓

NumPy vs SciPy gotcha. np.linalg.cholesky returns lower triangular L. scipy.linalg.cholesky returns upper triangular by default. Always check the docs or pass lower=True explicitly.

Concept check: Cholesky decomposition costs n³/6, while general LU costs n³/3. Why is Cholesky faster?

Because symmetry means L^T = U, so we only compute the lower triangle — half the entries Because SPD matrices are always sparse Because square roots are faster than divisions Because Cholesky uses Strassen internally

Chapter 5: Least Squares

Not every system Ax = b has a solution. If you have 100 data points and want to fit a line (2 parameters), you have 100 equations and 2 unknowns — the system is overdetermined. There is no x that satisfies all 100 equations exactly. But you can find the x that comes closest.

The Least Squares Problem

Given an m×n matrix A (m > n) and a vector b ∈ R^m, find x ∈ Rⁿ that minimizes the squared residual:

minimize ||Ax − b||² = ∑_i=1^m (a_i^Tx − b_i)²

Each term (a_i^Tx − b_i)² is the squared error between the model's prediction a_i^Tx and the observed value b_i. Minimizing the sum of squared errors is the principle behind linear regression — the foundation of statistics and ML.

The Normal Equations

To minimize ||Ax − b||², take the derivative with respect to x and set it to zero:

∇_x ||Ax − b||² = 2A^T(Ax − b) = 0

A^TAx = A^Tb

These are the normal equations. The matrix A^TA is n×n (small!), symmetric, and positive semidefinite. If A has full column rank (columns are independent), then A^TA is positive definite — perfect for Cholesky. The solution pipeline:

1. Form A^TA and A^Tb

Matrix multiply: O(mn²) for A^TA, O(mn) for A^Tb.

↓

2. Cholesky factor A^TA = LL^T

Cost: O(n³/6). This is small since n << m.

↓

3. Solve LL^Tx = A^Tb

Two triangular solves: O(n²).

Total cost: O(mn²) — dominated by forming A^TA. For 10,000 data points and 10 features, that is about 10⁶ operations. Fast.

Where Does ||Ax − b||² Come From?

Why minimize the squared error rather than, say, the absolute error |Ax − b|? Three reasons: (1) The squared error is differentiable everywhere, so calculus gives us a clean closed-form solution. Absolute error has a kink at zero, making optimization harder. (2) Under Gaussian noise assumptions, minimizing squared error gives the maximum likelihood estimate — the statistically optimal estimate. (3) The resulting normal equations are a symmetric positive (semi)definite system, which we can solve efficiently with Cholesky.

If the noise is not Gaussian (e.g., heavy-tailed outliers), least squares can be fooled. A single extreme outlier pulls the fit toward it. Robust alternatives include least absolute deviation (L1 regression, solved by linear programming — Ch 29) and Huber loss (a hybrid that is quadratic for small residuals and linear for large ones).

Least squares IS linear regression. When a data scientist writes model.fit(X, y) in scikit-learn with LinearRegression(), the library computes X^TX, then X^Ty, then solves using Cholesky (or a numerically superior method — see Chapter 6). Every straight line, polynomial fit, or linear model you have ever used is a least squares problem solved by matrix decomposition.

Geometric Interpretation

The column space of A is the set of all vectors Ax for any x. When Ax = b has no solution, b is not in the column space. The least squares solution finds the point in the column space closest to b — the orthogonal projection. The residual r = b − Ax̂ is perpendicular to the column space, which is exactly what A^Tr = 0 says (the normal equations).

Think of it this way: you are standing at point b in a high-dimensional room. The column space of A is a flat surface (a subspace). The least squares solution is the shadow of b cast straight down onto that surface. The normal equations guarantee the shadow is directly below you — the residual (the vertical drop) is perpendicular to the surface.

Why "normal" equations? The word "normal" means "perpendicular." The normal equations enforce that the residual r = b − Ax is normal (perpendicular) to the column space of A. This is the projection interpretation of least squares, and it is perhaps the most beautiful idea in linear algebra.

Worked Example: Linear Regression by Hand

Fit y = c₀ + c₁t to data: (1,2), (2,3), (3,6), (4,8).

Step 1: Build A and b.

A = 1 11 21 31 4, b = 2368

Step 2: Form normal equations.
A^TA = 4 1010 30, A^Tb = 1958
(19 = 2+3+6+8, 58 = 1·2 + 2·3 + 3·6 + 4·8)

Step 3: Solve 2×2 system.
4c₀ + 10c₁ = 19
10c₀ + 30c₁ = 58

From row 1: c₀ = (19 − 10c₁)/4
Substitute: 10(19 − 10c₁)/4 + 30c₁ = 58
47.5 − 25c₁ + 30c₁ = 58
5c₁ = 10.5 ⇒ c₁ = 2.1
c₀ = (19 − 21)/4 = −0.5

Best-fit line: y = −0.5 + 2.1t. At t = 2.5 (the mean of the t's), y = 4.75 — the mean of the y's. A least-squares line always passes through the data's centroid.

Polynomial Fitting

To fit a polynomial of degree d to m data points (t₁,y₁), …, (t_m,y_m), construct the Vandermonde matrix:

A = 1 t₁ t₁² … t₁^d1 t₂ t₂² … t₂^d⋮1 t_m t_m² … t_m^d, b = y₁y₂⋮y_m

The coefficient vector x = [c₀, c₁, …, c_d] gives the polynomial p(t) = c₀ + c₁t + c₂t² + … + c_dt^d that best fits the data in the least-squares sense.

Least Squares Curve Fitting

Click to add data points. The curve updates in real time using least squares. Drag the degree slider to fit lines, parabolas, or higher polynomials. Red lines show residuals (errors).

Polynomial degree 1

Click on canvas to add points

python
def least_squares_poly(t, y, degree):
    """Fit polynomial of given degree to data (t, y)."""
    # Build Vandermonde matrix
    A = np.column_stack([t**k for k in range(degree + 1)])

    # Normal equations: (A^T A) x = A^T y
    ATA = A.T @ A
    ATy = A.T @ y

    # Solve via Cholesky
    L = cholesky(ATA)
    z = forward_sub(L, ATy)
    coeffs = back_sub(L.T, z)
    return coeffs

# Example: fit a quadratic to noisy data
t = np.linspace(0, 5, 50)
y_true = 1 + 2*t - 0.5*t**2
y_noisy = y_true + np.random.randn(50) * 0.5

coeffs = least_squares_poly(t, y_noisy, degree=2)
print("Coefficients:", coeffs)
# ≈ [1.0, 2.0, -0.5] — recovers the true polynomial

The Overfitting Trap

In the simulation above, try increasing the polynomial degree to 8 or higher with only 10 data points. The curve passes through (or very close to) every point — the residuals drop to zero. But the curve oscillates wildly between the points. This is overfitting: the polynomial is fitting the noise, not the underlying pattern.

Overfitting happens when the model has more parameters than the data can constrain. A degree-d polynomial has d+1 parameters. With m data points, you need m >> d+1 to get a reliable fit. The rule of thumb in statistics is at least 10-20 observations per parameter.

This is directly connected to the condition number. As the polynomial degree approaches the number of data points, the Vandermonde matrix becomes increasingly ill-conditioned. The normal equations A^TA become nearly singular, and tiny changes in the data produce wild swings in the coefficients. Regularization (ridge regression, from Chapter 6) controls overfitting by penalizing large coefficients.

The fundamental tradeoff in curve fitting: low-degree polynomials underfit (miss the pattern), high-degree polynomials overfit (fit the noise). The sweet spot depends on the amount of data and the complexity of the true relationship. Least squares gives you the best fit for a given model; choosing the right model is a separate problem.

Concept check: You have 1,000 data points and want to fit a degree-3 polynomial (4 coefficients). The normal equations A^TAx = A^Tb involve a matrix A^TA of size:

1000 × 1000 — same as the number of data points 4 × 4 — A is 1000×4, so A^TA is 4×4. The system is small regardless of how many data points you have 1000 × 4 — the same shape as A 3 × 3 — determined by the polynomial degree

Chapter 6: Numerical Stability

Everything we have done so far assumes exact arithmetic. Real computers use floating-point numbers with roughly 16 significant digits (64-bit doubles). Rounding errors are tiny — but they can be amplified by the structure of the matrix. Understanding when and why this happens is the difference between numerical code that works and code that silently produces garbage.

Condition Number

The condition number κ(A) measures how sensitive Ax = b is to perturbations. If you perturb b by a tiny amount δb, the solution changes by δx, and:

||δx|| / ||x|| ≤ κ(A) · ||δb|| / ||b||

If κ(A) = 10^k, then you lose roughly k digits of accuracy. With 16-digit doubles and κ = 10¹², you get only 4 correct digits.

Think of it this way: the condition number tells you how many digits of your input get "burned up" by the problem. You start with 16 digits of precision. The problem burns κ of them. You are left with 16 − log₁₀(κ) correct digits. If κ uses up all 16, you get zero correct digits — the output is pure noise.

The condition number is defined as:

κ(A) = ||A|| · ||A⁻¹|| = σ_max / σ_min

where σ_max and σ_min are the largest and smallest singular values of A. A matrix with κ ≈ 1 is well-conditioned; a matrix with κ >> 1 is ill-conditioned.

Some common condition numbers to calibrate your intuition:

κ(A)	Interpretation	Digits lost (double precision)	Example
1	Perfect — orthogonal matrix	0	Rotation, Fourier
10²	Well-conditioned	2	Most well-scaled problems
10⁶	Moderately conditioned	6	Polynomial fitting degree 8+
10¹⁰	Ill-conditioned	10	Nearly collinear features
10¹⁶	Effectively singular	16 (all)	Exact collinearity

Condition number is a property of the problem, not the algorithm. No algorithm — LU, QR, SVD, anything — can produce more accurate digits than the condition number allows. The choice of algorithm determines whether you achieve the best possible accuracy or lose additional digits unnecessarily. LU loses extra digits for least squares; QR does not.

Ill-Conditioned Systems

Two nearly-parallel lines. A tiny change in b (the right-hand side) causes the intersection point to jump wildly. The condition number controls how much the solution moves relative to the perturbation.

Condition (angle between lines) 45°

Perturbation δb 0.10

QR Decomposition

For least squares, the normal equations A^TAx = A^Tb square the condition number: κ(A^TA) = κ(A)². If A has κ = 10⁶, then A^TA has κ = 10¹², and you lose 12 digits — nearly all your precision.

The fix: QR decomposition. Factor A = QR where Q has orthonormal columns (Q^TQ = I) and R is upper triangular. Then:

||Ax − b||² = ||QRx − b||² = ||Rx − Q^Tb||²

(using the fact that orthogonal transformations preserve norms). So we just solve the triangular system Rx = Q^Tb. No squaring of the condition number. QR costs about twice as much as Cholesky for least squares (2mn² vs mn²), but it is far more numerically stable.

Worked Example: Condition Number Disaster

Consider the 2×2 matrix:

A = 1.0000 1.00001.0000 1.0001

This matrix is invertible (det = 0.0001), but its condition number is κ ≈ 40,001. Solving Ax = [2, 2.0001] gives x = [1, 1]. But solving Ax = [2, 2.0002] (perturbing b by 0.00005%) gives x = [0, 2] — a 50% change in x₁. A tiny perturbation in b caused a massive change in x. This is what ill-conditioning looks like.

Now form A^TA for least squares:

A^TA = 2.0000 2.00012.0001 2.0002

This matrix has κ ≈ 1.6 × 10⁹ — the condition number squared. With 16-digit doubles, you get about 7 correct digits. With Cholesky on A^TA, the answer might be okay but barely. With QR directly on A, you get about 12 correct digits. That is the difference between "usable" and "pushing the limits."

SVD: The Gold Standard

The singular value decomposition A = UΣV^T decomposes any matrix into a rotation U, a scaling Σ (diagonal), and another rotation V^T. It is the most stable decomposition, handles rank-deficient matrices gracefully, and directly reveals the condition number (σ_max/σ_min). The cost is O(mn²) — similar to QR. NumPy's np.linalg.lstsq uses SVD internally.

SVD is the Swiss Army knife of numerical linear algebra. It solves least squares, computes pseudoinverses (for non-square or rank-deficient matrices), performs PCA (principal component analysis), and is the basis for low-rank approximation (the best rank-k approximation of a matrix is given by truncating its SVD — the Eckart-Young theorem).

The tradeoff: SVD is about 10× more expensive than Cholesky for a well-conditioned least squares problem. Use Cholesky when you know the matrix is well-conditioned. Use QR for moderate ill-conditioning. Use SVD when you do not trust the data or when the matrix may be rank-deficient.

Practical Decision Tree for Least Squares

Is κ(A) < 10⁶?

Yes → Normal equations + Cholesky (fastest). This covers most practical ML problems with normalized features.

↓ No

Is A full rank?

Yes → QR decomposition (2× slower but preserves condition number).

↓ No (or unsure)

Use SVD

Handles rank deficiency, reveals the rank, gives you the pseudoinverse. Or add regularization (λI) to make A^TA well-conditioned and use Cholesky.

Method	Cost (least squares)	Condition amplification	Handles rank deficiency?
Normal equations + Cholesky	mn² + n³/6	κ² (squares it!)	No — crashes
QR decomposition	2mn²	κ (preserves it)	With care
SVD	2mn² + 11n³	κ (preserves it)	Yes — automatically

Regularization: Taming Ill-Conditioning

When κ(A) is large, you can improve the condition number by adding regularization. Instead of solving A^TAx = A^Tb, solve (A^TA + λI)x = A^Tb. This is ridge regression (also called Tikhonov regularization).

Why does it work? Adding λI to A^TA increases every eigenvalue by λ. If the smallest eigenvalue was σ_min², it becomes σ_min² + λ. The condition number drops from σ_max²/σ_min² to (σ_max² + λ)/(σ_min² + λ). For large λ, this approaches 1.

For example, with λ = 0.1 and a matrix that has σ_min² = 10⁻¹⁰ and σ_max² = 1, the condition number drops from 10¹⁰ to (1 + 0.1)/(10⁻¹⁰ + 0.1) ≈ 11. That is a nine-order-of-magnitude improvement in conditioning, at the cost of a small bias in the solution.

The tradeoff: regularization introduces bias. The solution is no longer the true least squares answer — it is shrunk toward zero. But the reduction in variance (from better conditioning) often more than compensates. This is the bias-variance tradeoff, and ridge regression is one of the most important tools in statistical machine learning.

Regularization in one sentence. Adding λI to A^TA is like telling the solver "I am not sure about the data, so do not let the coefficients get too large." Small λ trusts the data; large λ trusts the prior (that coefficients should be small). The optimal λ is found by cross-validation.

python
# Three ways to solve least squares in NumPy
import numpy as np

A = np.random.randn(100, 5)
b = np.random.randn(100)

# Method 1: Normal equations (fastest, least stable)
x1 = np.linalg.solve(A.T @ A, A.T @ b)

# Method 2: QR decomposition (good balance)
Q, R = np.linalg.qr(A)
x2 = np.linalg.solve(R, Q.T @ b)

# Method 3: SVD (most stable, recommended)
x3, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)

# Check condition number
print("κ(A) =", np.linalg.cond(A))
print("κ(AᵀA) =", np.linalg.cond(A.T @ A))  # ≈ κ(A)²

python
# Ridge regression: regularized least squares
def ridge_regression(X, y, lam):
    """Solve (X^T X + λI) w = X^T y"""
    n = X.shape[1]
    ATA = X.T @ X + lam * np.eye(n)
    ATy = X.T @ y
    return np.linalg.solve(ATA, ATy)

# Compare condition numbers
X = np.random.randn(100, 10)
# Make it ill-conditioned by adding near-duplicate column
X[:, 9] = X[:, 0] + 1e-6 * np.random.randn(100)

print("κ(X^T X)        =", np.linalg.cond(X.T @ X))
print("κ(X^T X + 0.1I) =", np.linalg.cond(X.T @ X + 0.1*np.eye(10)))
# Condition drops by orders of magnitude!

Concept check: A matrix has condition number κ = 10⁸. You solve least squares using the normal equations (Cholesky on A^TA). Using 64-bit floating point (about 16 digits of precision), approximately how many correct digits do you expect?

0 digits — the normal equations square the condition number to 10¹⁶, which eats all 16 digits of double precision. The result is meaningless. 8 digits — you lose 8 digits from κ = 10⁸ 16 digits — doubles always give full precision 4 digits — you lose half your digits

Chapter 7: Applications

Matrix operations are not abstract mathematics — they are the computational backbone of modern technology. Every field that involves data, physics, or optimization eventually reduces to solving linear systems or computing decompositions. Here is where each technique from this chapter shows up in the real world.

Machine Learning: Linear Regression

The most direct application. Given a dataset of m samples with n features and a target variable, linear regression finds weights w that minimize ||Xw − y||². This is exactly the least squares problem from Chapter 5. Scikit-learn's LinearRegression solves the normal equations (or uses SVD for numerical stability). Every ML engineer uses this daily, even when the model is a neural network — because the loss function still bottoms out at a system of linear equations in the final layer.

Computer Graphics: Transformations

Every vertex in a 3D scene is transformed by a 4×4 matrix: rotation, translation, scaling, projection. A typical rendering pipeline applies 3-5 matrix multiplications per vertex, and a single frame has millions of vertices. GPUs are essentially massively parallel matrix multiplication engines. The model-view-projection matrix M = P · V · M multiplies three 4×4 matrices — and the inverse M⁻¹ is needed for ray casting and picking.

Physics: Finite Element Methods

Simulating how a bridge bends, how heat flows through a wall, or how an airplane wing vibrates all reduce to solving Kx = f, where K is a large sparse SPD stiffness matrix and f is the applied force. Cholesky decomposition (or iterative solvers for very large systems) is the workhorse. Modern structural simulations involve matrices with millions of rows — but they are sparse (mostly zeros), so specialized sparse Cholesky implementations handle them efficiently.

Signal Processing: Fourier and Wavelets

The Discrete Fourier Transform is a matrix-vector product y = Fx where F is the Fourier matrix. The FFT is an O(n log n) factorization of this matrix into sparse factors — essentially an LU-like decomposition that exploits the structure of F. Wavelets, cosine transforms, and Hadamard transforms are all matrix factorizations.

PageRank and Recommender Systems

Google's PageRank finds the dominant eigenvector of the web's link matrix — a matrix operation. Netflix's recommendation engine uses matrix factorization: approximate the user-item rating matrix as a product of two low-rank matrices. Both are solved by iterative methods that reduce to repeated matrix-vector products and linear solves.

Neural Networks: Backpropagation

Every layer of a neural network computes y = Wx + b (a matrix-vector product) followed by a nonlinearity. Backpropagation computes gradients by multiplying Jacobian matrices — chain rule is matrix multiplication. The reason GPUs accelerate deep learning is that training is dominated by matrix multiplications. Modern transformers (GPT, etc.) spend most of their compute on the attention matrix Q K^T/√d and the projection matrices WQ, WK, WV, WO.

Data flow for a single transformer layer (GPT-style).
Input: X ∈ R^seq×d (token embeddings, e.g. 2048 × 768)
1. Q = X W_Q, K = X W_K, V = X W_V — three matrix multiplications (2048×768 @ 768×768 each)
2. Attention = softmax(Q K^T / √d) — matrix multiply Q K^T is 2048×2048 (the bottleneck for long contexts)
3. Output = Attention · V — matrix multiply (2048×2048 @ 2048×768)
4. FFN: two more matrix multiplies (768→3072→768)

Total per layer: ~6 large matrix multiplications. GPT-3 has 96 layers. That is 576 matrix multiplies per forward pass. Training requires another set for backprop. This is why LLM training costs millions of dollars — it is all matrix operations at scale.

Kalman Filters and Robotics

The Kalman filter — the workhorse of state estimation in robotics, navigation, and aerospace — performs a matrix inversion (or LU solve) at every timestep. The innovation covariance S = H P H^T + R must be inverted to compute the Kalman gain K = P H^T S⁻¹. For a robot tracking 100 landmarks in SLAM, the state vector is 200+ dimensions, and the Cholesky decomposition of the covariance matrix dominates the computation. Efficient matrix operations are the difference between real-time and too-slow.

Economics: Input-Output Models

Leontief's input-output model describes an economy where each industry consumes outputs of other industries. The consumption matrix C says how much of industry j's output is needed per unit of industry i's output. If d is the external demand vector, the total production x must satisfy x = Cx + d, or (I − C)x = d. This is a linear system, solved by LU decomposition. Leontief won the 1973 Nobel Prize in Economics for this model — and the computational technique behind it is exactly what we learned in Chapter 1.

Cryptography and Number Theory

Lattice-based cryptography (the basis for post-quantum encryption schemes like CRYSTALS-Kyber) uses matrix operations over finite fields. Finding short vectors in a lattice — the hard problem underlying the security — is related to LU decomposition and the Gram-Schmidt process (which is QR decomposition). The LLL algorithm for lattice basis reduction is essentially iterated QR factorization.

Computational Biology

In genomics, principal component analysis (PCA) on a genotype matrix (millions of SNPs × thousands of individuals) reveals population structure. PCA is computed via SVD — a matrix decomposition. Similarly, gene expression analysis uses least squares to fit linear models relating expression levels to experimental conditions. The entire field of bioinformatics is built on the matrix operations from this chapter.

Finance: Portfolio Optimization

Markowitz mean-variance optimization finds the portfolio weights w that minimize variance w^TΣw subject to a target return. The covariance matrix Σ is SPD, so the optimal weights come from solving a linear system using Cholesky. Risk modeling, option pricing via finite differences, and credit scoring all reduce to matrix operations.

Control Systems

Designing autopilots, robot controllers, and industrial regulators requires solving Riccati equations — matrix equations of the form A^TXA − X − A^TXB(R + B^TXB)⁻¹B^TXA + Q = 0. Each step of the iterative solver involves an LU solve (for the matrix inverse) and matrix multiplications. The linear-quadratic regulator (LQR) — the foundation of optimal control — is solved by iterating this equation to convergence. Every drone, self-driving car, and rocket engine controller depends on these matrix operations running fast enough for real-time control.

The Computational Bottleneck

A recurring theme: the most expensive part of many algorithms is a matrix operation. Interior-point methods for linear programming (Ch 29) solve a system of linear equations per iteration. The Hessian solve in Newton's method for optimization is an LU or Cholesky decomposition. Even randomized algorithms like stochastic gradient descent can be seen as approximate matrix operations. Understanding the cost of these operations tells you the fundamental speed limit of many algorithms.

Matrix Operations in the Wild

See how a 2D affine transformation (rotation + scaling + translation) is represented as matrix multiplication. Drag the sliders to transform the shape. The matrix updates in real time.

Rotation θ 0°

Scale 1.0

Shear 0.0

Concept check: A GPU processes a neural network layer with a 4096 × 4096 weight matrix, applied to a batch of 256 input vectors simultaneously. What is the fundamental operation?

Sorting — the GPU sorts activations Matrix multiplication — Y = W · X where W is 4096×4096 and X is 4096×256. This is why GPUs (designed for graphics matrix ops) accelerate deep learning. LU decomposition — factoring the weight matrix each forward pass Eigenvalue computation — finding the principal components

Chapter 8: Interview Arsenal

Matrix operations appear in system design interviews (ML pipelines, recommendation engines), coding interviews (implement linear regression from scratch), and theory interviews (complexity of matrix operations). Here is your cheat sheet.

Decomposition Comparison

Decomposition	Form	Requirements	Cost	Use Case
LU	PA = LU	Square, nonsingular	n³/3	General Ax = b, determinant
Cholesky	A = LL^T	SPD	n³/6	Covariance, Kalman, optimization
QR	A = QR	Any (m≥n)	2mn²	Stable least squares, eigenvalues
SVD	A = UΣV^T	Any	2mn²+11n³	Rank, pseudoinverse, PCA, LSQ
Eigendecomposition	A = QΛQ^T	Symmetric	~10n³	PCA, spectral clustering, physics

Complexity Cheat Sheet

Operation	Naïve Cost	Best Known
Matrix multiply (n×n)	O(n³)	O(n^2.371) (Strassen family)
LU decomposition	O(n³/3)	O(n^ω) via matrix multiply reduction
Matrix inversion	O(n³)	O(n^ω) — same as multiply
Determinant	O(n³) via LU	O(n^ω)
Solve Ax=b (one b)	O(n³/3 + n²)	O(n^ω + n²)
Least squares (m×n)	O(mn²)	O(mn²)

Decomposition Cost Comparison

Compare the operation counts of different decompositions as matrix size n grows. The y-axis is flops (in millions). Notice how Cholesky is consistently 2× cheaper than LU.

Matrix size n 100

Coding Drills

Implement each of these from scratch (no numpy.linalg). They are common in ML and systems interviews.

python
# Drill 1: LU decomposition with partial pivoting
# Input: n×n matrix A
# Output: L, U, perm (permutation vector)

def lu_pivot(A):
    n = len(A)
    U = [row[:] for row in A]  # deep copy, pure Python
    L = [[0.0]*n for _ in range(n)]
    perm = list(range(n))

    for k in range(n):
        # Find pivot
        max_val, max_row = abs(U[k][k]), k
        for i in range(k+1, n):
            if abs(U[i][k]) > max_val:
                max_val, max_row = abs(U[i][k]), i
        # Swap
        U[k], U[max_row] = U[max_row], U[k]
        L[k], L[max_row] = L[max_row], L[k]
        perm[k], perm[max_row] = perm[max_row], perm[k]
        # Eliminate
        for i in range(k+1, n):
            L[i][k] = U[i][k] / U[k][k]
            for j in range(k, n):
                U[i][j] -= L[i][k] * U[k][j]
    for i in range(n):
        L[i][i] = 1.0
    return L, U, perm

python
# Drill 2: Linear regression from scratch
# Input: X (m×n features), y (m targets)
# Output: weights w (n,)

def linear_regression(X, y):
    """Solve normal equations: (X^T X) w = X^T y"""
    XTX = [[0.0]*len(X[0]) for _ in range(len(X[0]))]
    XTy = [0.0]*len(X[0])
    m, n = len(X), len(X[0])

    # Form X^T X (n×n) and X^T y (n,)
    for i in range(n):
        for j in range(n):
            for k in range(m):
                XTX[i][j] += X[k][i] * X[k][j]
        for k in range(m):
            XTy[i] += X[k][i] * y[k]

    # Solve using LU
    L, U, perm = lu_pivot(XTX)
    Pb = [XTy[perm[i]] for i in range(n)]
    # Forward sub
    fwd = [0.0]*n
    for i in range(n):
        fwd[i] = Pb[i] - sum(L[i][k]*fwd[k] for k in range(i))
    # Back sub
    w = [0.0]*n
    for i in range(n-1, -1, -1):
        w[i] = (fwd[i] - sum(U[i][k]*w[k] for k in range(i+1,n))) / U[i][i]
    return w

python
# Drill 3: Cholesky decomposition (no numpy)
import math

def cholesky_pure(A):
    """Cholesky A = LL^T for SPD matrix. Pure Python."""
    n = len(A)
    L = [[0.0]*n for _ in range(n)]
    for i in range(n):
        for j in range(i + 1):
            s = sum(L[i][k] * L[j][k] for k in range(j))
            if i == j:
                val = A[i][i] - s
                if val <= 0:
                    raise ValueError("Not positive definite")
                L[i][j] = math.sqrt(val)
            else:
                L[i][j] = (A[i][j] - s) / L[j][j]
    return L

# Test with SPD matrix
A = [[4,2,2],[2,5,3],[2,3,6]]
L = cholesky_pure(A)
# Verify: L[0] = [2, 0, 0], L[1] = [1, 2, 0], L[2] = [1, 1, 2]

python
# Drill 4: Solve triangular system (interview classic)

def solve_upper_triangular(U, b):
    """Solve Ux = b for upper triangular U. Pure Python."""
    n = len(b)
    x = [0.0] * n
    for i in range(n - 1, -1, -1):
        x[i] = (b[i] - sum(U[i][j]*x[j] for j in range(i+1, n))) / U[i][i]
    return x

# Drill 5: Compute determinant via LU
def det_via_lu(A):
    """det(A) = det(L) * det(U) * det(P).
    det(L) = 1 (unit diagonal). det(U) = product of diagonal.
    det(P) = (-1)^(number of swaps)."""
    L, U, perm = lu_pivot(A)
    # Count swaps in permutation
    n = len(perm)
    visited = [False] * n
    swaps = 0
    for i in range(n):
        if visited[i]: continue
        j, cycle_len = i, 0
        while not visited[j]:
            visited[j] = True
            j = perm[j]; cycle_len += 1
        swaps += cycle_len - 1

    det_U = 1.0
    for i in range(n):
        det_U *= U[i][i]
    return ((-1)**swaps) * det_U

Key Formulas to Memorize

The essential formulas.
• LU solve cost: n³/3 (factor) + n² (each solve)
• Cholesky: A = LL^T, cost n³/6
• Normal equations: A^TAx = A^Tb, cost O(mn²)
• Condition number: κ(A) = σ_max/σ_min = ||A|| · ||A⁻¹||
• Normal equations square condition: κ(A^TA) = κ(A)²
• Matrix multiply, inversion, LU are all O(n^ω) — computationally equivalent
• det(A) = product of LU pivots (with sign from permutation)

Common Interview Pitfalls

Pitfall	Why It is Wrong	Correct Answer
"Use A⁻¹ to solve Ax = b"	Computing A⁻¹ costs 3× more and is less numerically stable	Use LU factorization and forward/back substitution
"Cholesky works for any symmetric matrix"	A must also be positive definite; negative eigenvalues cause square root of negative	SPD required. Test by attempting Cholesky — failure means not SPD
"Normal equations are always fine for least squares"	They square the condition number; for ill-conditioned A the result is garbage	Use QR or SVD for numerically stable least squares
"Matrix inversion is O(n²)"	It requires solving n linear systems	O(n³), same asymptotic class as multiplication

Interview Questions & Answers

Q: Why does np.linalg.solve use LU internally rather than computing A⁻¹?
A: (1) Solving Ax=b via LU costs n³/3 + n²; computing A⁻¹ then multiplying costs n³ + n². (2) Computing A⁻¹ and multiplying squares the amplification of rounding errors. (3) LU can be reused for multiple b vectors without recomputing.

Q: When would you use Cholesky over LU?
A: When the matrix is symmetric positive definite (SPD). Cholesky is 2× faster (n³/6 vs n³/3), requires no pivoting (SPD guarantees positive pivots), uses half the storage (only L), and is backward stable. Common in ML (covariance matrices), optimization (Hessians), and Kalman filters.

Q: Explain the relationship between Gaussian elimination and LU decomposition.
A: They are the same algorithm. Gaussian elimination applies row operations to transform A into upper triangular form U. Each row operation is "subtract m times row k from row i." LU decomposition records these multipliers m in a lower triangular matrix L. The result: A = LU. The only difference is bookkeeping — LU stores the work so you can reuse it for multiple right-hand sides.

Q: How do you compute the determinant of a large matrix?
A: Compute PA = LU. Then det(A) = (−1)^s · ∏ u_ii, where s is the number of row swaps. Cost: O(n³/3). Never use cofactor expansion (O(n!)) or the Leibniz formula (also O(n!)) for n > 5.

Q: Your linear regression gives nonsense results. What do you check?
A: (1) Condition number of X^TX — if it is large (say >10¹⁰), the features are nearly collinear. Switch to QR or SVD. (2) Check for duplicate or linearly dependent features. (3) Add regularization (ridge regression adds λI to X^TX, which improves the condition number). (4) Normalize features to similar scales.

Complexity Genealogy

The relationships between matrix operations form a hierarchy that every computer scientist should know:

The equivalence chain.
Matrix multiply ≡ Matrix inversion ≡ LU decomposition ≡ Determinant

All four have the same asymptotic complexity O(n^ω). If you discover a faster way to do any one of them, you automatically get faster algorithms for all four. This is a deep structural result — it means these problems are fundamentally the same difficulty.

Above them: Eigenvalue decomposition is strictly harder for non-symmetric matrices (iterative, no finite algorithm in exact arithmetic).
Below them: Triangular solve is easier at O(n²). Matrix-vector product is O(n²). Vector operations are O(n).

Concept check: You need to solve Ax = b for 10,000 different b vectors with the same 500×500 matrix A. Which approach is most efficient?

Run Gaussian elimination 10,000 times Compute PA = LU once (cost: 500³/3 ≈ 4×10⁷), then do 10,000 forward+back substitutions (each 500² ≈ 2.5×10⁵). Total ≈ 2.5×10⁹. Compare to 10,000×4×10⁷ = 4×10¹¹ without reuse. Compute A⁻¹ and multiply 10,000 times Use Strassen's algorithm

Chapter 9: Connections

Matrix operations are not an isolated topic — they are the computational substrate that other algorithms build upon. Here is how Chapter 28 connects to the rest of CLRS, other micro-lessons, and the broader world.

Within CLRS

Chapter	Connection
Ch 4: Divide & Conquer	Strassen's algorithm uses D&C to multiply n×n matrices in O(n^2.807) instead of O(n³). This reduces the cost of LU, inversion, and all decompositions.
Ch 29: Linear Programming	LP solvers (simplex, interior point) solve systems of linear equations at each iteration. The bottleneck of interior-point methods is an LU or Cholesky solve per step.
Ch 26: Max Flow	The push-relabel and augmenting-path algorithms can be formulated in terms of matrix operations on the graph's incidence matrix.
Ch 15: Dynamic Programming	Matrix chain multiplication (DP) optimizes the order of matrix products. The matrices being multiplied often come from the decompositions in this chapter.

Within This Site

Lesson	Connection
Kalman Filter	The Kalman gain requires solving P H^T(H P H^T + R)⁻¹ — a matrix inversion (or LU solve) every timestep. For high-dimensional state spaces, efficient matrix operations are critical.
Transformer	Self-attention computes Q K^T / √d — a matrix multiplication. The entire model is a sequence of matrix multiplications and nonlinearities.
GPT	Training GPT involves gradient computation through matrix chains. Efficient matrix operations directly affect training time and cost.

The Decomposition Hierarchy

Here is how the decompositions relate — think of it as a decision tree:

Is A square?

Yes → LU (general), Cholesky (if SPD), Eigendecomposition (if symmetric)
No → QR, SVD (for rectangular matrices)

↓

Is A symmetric?

Yes + positive definite → Cholesky (fastest)
Yes + indefinite → LDLT (modified Cholesky, allows negative diagonal)
No → LU with pivoting

↓

Do you need stability more than speed?

Yes → QR for least squares, SVD for rank-deficient problems
No → Normal equations + Cholesky (fastest for well-conditioned data)

Limitations

This chapter covers dense matrix operations — matrices where most entries are nonzero. Many real-world matrices (graphs, finite elements, sparse data) have mostly zero entries. Sparse solvers exploit this structure for massive speedups (storing only nonzero entries, using iterative methods like conjugate gradient instead of direct factorization). For truly large systems (millions of unknowns), iterative methods replace direct decompositions entirely.

The numerical stability discussion here barely scratches the surface. A full treatment requires understanding floating-point arithmetic (IEEE 754), backward error analysis, and the nuances of different pivot strategies. See Trefethen & Bau, Numerical Linear Algebra, for the definitive treatment.

What We Did Not Cover

Topic	Where to Learn It
Sparse matrices and iterative solvers	Saad, Iterative Methods for Sparse Linear Systems
Eigenvalue algorithms (QR iteration)	Trefethen & Bau, Numerical Linear Algebra, Ch 24-30
Randomized linear algebra (sketching)	Halko, Martinsson, Tropp — "Finding structure with randomness"
GPU-accelerated linear algebra	cuBLAS / cuSOLVER documentation (NVIDIA)
Block matrix algorithms	LAPACK Users' Guide — tiled factorizations for cache efficiency

The big picture. Matrix operations are to scientific computing what sorting is to data processing — a fundamental primitive that everything else builds upon. If you understand LU decomposition, you understand 80% of what LAPACK (the standard numerical linear algebra library) does. And LAPACK, written in Fortran in the 1990s, is still the engine inside NumPy, MATLAB, R, Julia, and every serious computational tool today.

Summary: The Big Ideas

If you remember nothing else from this chapter, remember these five principles:

1. Factorize once, solve many

LU/Cholesky/QR separate the expensive O(n³) factorization from the cheap O(n²) solve. Never redo work you can cache.

↓

2. Structure saves time

SPD → Cholesky (2× faster). Triangular → O(n²) solve. Sparse → specialized solvers. Always exploit structure.

↓

3. Pivot or perish

Without partial pivoting, small pivots amplify rounding errors catastrophically. Always pivot.

↓

4. Never compute A⁻¹ unless you must

Solve Ax = b directly via LU. It is faster, more stable, and uses less memory.

↓

5. Know your condition number

κ(A) tells you how many digits you can trust. If κ is large, use QR/SVD, not normal equations.

Resource	Why Read It
CLRS Chapter 28	The formal treatment with proofs of all theorems in this lesson
Trefethen & Bau, Numerical Linear Algebra	The gold standard for numerical stability, conditioning, and practical algorithms
Gilbert Strang, Linear Algebra and Its Applications	Best geometric intuition for projections, least squares, and the four fundamental subspaces
Golub & Van Loan, Matrix Computations	The encyclopedia of matrix algorithms — every decomposition, every variant, every edge case
3Blue1Brown, Essence of Linear Algebra	Beautiful geometric visualizations of matrix operations (free YouTube series)