The language every machine learning algorithm is written in.
You have three ingredients in a recipe. You know the cost of each ingredient, and you want the total cost. That is a linear combination — multiply each quantity by its price, add them up. Simple arithmetic.
Now imagine you have 1,000 ingredients, 50 recipes, and you need the cost of every recipe simultaneously. That same multiply-and-add operation, scaled up, is the beating heart of linear algebra. Matrices are the bookkeeping device that makes it tractable.
Why does this matter for machine learning? Every ML algorithm — every neural network forward pass, every gradient update, every PCA decomposition — is a sequence of matrix multiplications and vector additions. Linear algebra is not a prerequisite for ML. It is ML's native language.
Here is the roadmap. Chapters 1–4 cover solving systems of equations — the computational machinery. Chapters 5–7 develop the structural concepts: vector spaces, independence, basis, and rank. Chapters 8–9 reveal the geometric viewpoint: matrices as transformations, and how those transformations change when you change your coordinate system.
| Concept | Practical question it answers |
|---|---|
| Gaussian elimination | How do I solve Ax = b? |
| Rank | How many independent equations do I really have? |
| Basis | What is the minimal coordinate system for my data? |
| Linear map | What does multiplication by A do to space? |
| Basis change | How does the same map look in different coordinates? |
Suppose a store sells apples at $2 and bananas at $3. You spent $12. Your friend, buying at the same store, spent $17 on apples and bananas. How many of each did you buy? This is a system of linear equations:
Each equation constrains the unknowns x1 (apples) and x2 (bananas). A solution is a pair (x1, x2) that satisfies both simultaneously. Geometrically, each equation defines a line in the plane. The solution is where the lines cross.
Each equation is a line. The intersection (if it exists) is the solution. Drag the sliders to change the second equation and see what happens.
Three things can happen:
| Case | Geometry | Algebra |
|---|---|---|
| Unique solution | Lines cross at one point | The system has exactly one answer |
| No solution | Lines are parallel (never meet) | The equations contradict each other |
| Infinitely many | Lines are the same | One equation is a multiple of the other |
Try setting a=4, b=6, c=24 in the sliders above. The second equation becomes 4x1 + 6x2 = 24, which is just 2×(2x1 + 3x2 = 12). The lines overlap — infinitely many solutions. Now set c=25 instead. The lines are parallel — no solution.
For the original system (a=4, b=1, c=17), we can solve by substitution. From the first equation: x2 = (12 − 2x1)/3. Plugging into the second: 4x1 + (12 − 2x1)/3 = 17. Multiply through by 3: 12x1 + 12 − 2x1 = 51, so 10x1 = 39, giving x1 = 3.9. Then x2 = (12 − 7.8)/3 = 1.4.
Substitution works for two equations. But what about 100 equations in 50 unknowns? We need a systematic method. That is Gaussian elimination (Chapter 3).
Writing out every equation longhand is tedious. Matrices let us pack the entire system into one compact expression. Take our system:
We separate the coefficients, the unknowns, and the right-hand side:
The matrix A is a rectangular array of numbers. It has m rows and n columns, written A ∈ Rm×n. The vector x collects the unknowns; b collects the right-hand sides. The equation Ax = b says: "the matrix A, acting on the unknown vector x, produces the vector b."
Matrix multiplication AB is defined when A is m×n and B is n×p. Entry (i,j) of the product is the dot product of row i of A with column j of B. The result is m×p. The key rule: inner dimensions must match.
Watch how Ax is computed row by row. Each output entry is a dot product.
Some special matrices appear everywhere:
| Name | Symbol | Property |
|---|---|---|
| Identity | In | 1s on diagonal, 0s elsewhere. AI = IA = A. |
| Inverse | A−1 | AA−1 = A−1A = I. Exists only if A is square and full rank. |
| Transpose | AT | Rows become columns: (AT)ij = Aji. |
| Symmetric | A = AT | Mirror across the diagonal. Eigenvalues are real. |
For a 2×2 matrix, the inverse has a closed-form formula:
The quantity ad − bc is the determinant. When it is zero, the matrix has no inverse — the rows are linearly dependent, and the system Ax = b either has no solution or infinitely many. We will formalize "linearly dependent" in Chapter 6.
Substitution works for small systems, but it is ad-hoc. Gaussian elimination is the systematic algorithm: transform the system into a simple form, then read off the solution. It is the workhorse behind every "solve" button in every linear algebra library.
The idea: apply elementary row operations to the augmented matrix [A | b] until the left part is in row echelon form (staircase of leading 1s, zeros below each pivot). Then back-substitute.
The three allowed operations are:
| Operation | Example | Why it is safe |
|---|---|---|
| Swap two rows | R1 ↔ R2 | Reordering equations does not change the solution |
| Scale a row | R1 ← λR1 | Multiplying both sides of an equation by λ ≠ 0 |
| Add a multiple of one row to another | R2 ← R2 − 2R1 | Subtracting one equation from another |
Let's work through our example step by step. The augmented matrix is:
Step 1. Eliminate x1 from row 2. Apply R2 ← R2 − 2R1:
Now the system is in row echelon form. The second row says −5x2 = −7, so x2 = 7/5 = 1.4.
Step 2. Back-substitute into row 1: 2x1 + 3(1.4) = 12, so 2x1 = 7.8, x1 = 3.9.
Watch the augmented matrix transform step by step. The pivot element is highlighted.
If we continue beyond row echelon form and also eliminate above each pivot (making every pivot column have zeros everywhere except the pivot itself), we reach reduced row echelon form (RREF). For our example:
In RREF the solution is immediate: x1 = 3.9, x2 = 1.4. No back-substitution needed.
The pivot positions tell us the rank of the matrix — the number of independent equations. We will formalize this in Chapter 7. For now: count the pivots in the echelon form. Our 2×2 system has 2 pivots, so rank 2 — full rank — and a unique solution.
Not every system Ax = b has a unique answer. Let's add a variable and see what happens. Consider a system with 2 equations and 3 unknowns:
Applying Gaussian elimination (R2 ← R2 − 2R1):
There are two pivots (columns 1 and 3) and one free variable (x2). From row 2: x3 = 1. From row 1: x1 = 6 − 2x2 − 3(1) = 3 − 2x2. The solution is:
This is a line in 3D space. The first vector [3, 0, 1] is a particular solution (any specific point that satisfies the system). The second vector [−2, 1, 0] is a homogeneous solution — it satisfies Ax = 0 and can be added to any particular solution to get another valid solution.
This decomposition is how linear algebra handles "infinitely many solutions" — not as chaos, but as structure. The particular solution pins you somewhere on the solution set; the null space tells you the directions in which you can move without leaving it.
| Number of pivots vs. unknowns | Solution structure |
|---|---|
| pivots = unknowns (full rank, square) | Unique solution |
| pivots < unknowns | Infinitely many (free variables) |
| Contradictory row (e.g., [0 0 | 5]) | No solution |
Let's verify our solution. Pick x2 = 1: x = [3 − 2, 1, 1] = [1, 1, 1]. Check: 1 + 2 + 3 = 6 and 2 + 4 + 7 = 13. Both satisfied.
So far we have been working with vectors as columns of numbers. But what is a vector, really? Deisenroth et al. give a more abstract (and more powerful) definition.
A vector space (V, +, ·) is a set V equipped with two operations — addition and scalar multiplication — that satisfy certain rules (closure, associativity, commutativity, distributivity, existence of zero and inverses). Any collection of objects that obeys these rules is a vector space, whether those objects are columns of numbers, polynomials, functions, or matrices.
The most common vector space is Rn — all columns of n real numbers. But here are others:
| Vector space | Elements | Operations |
|---|---|---|
| Rn | Columns of n numbers | Componentwise add/scale |
| Rm×n | m×n matrices | Entrywise add/scale |
| Polynomials of degree ≤ n | a0 + a1t + … + antn | Add coefficients, scale all |
A subspace U of V is a subset that is itself a vector space (under the same operations). The key test: U must contain the zero vector, and be closed under addition and scalar multiplication. The null space of a matrix (all x with Ax = 0) is a subspace of Rn — exactly what we saw in Chapter 4.
In R2, lines through the origin are subspaces. A shifted line (not through origin) is not. Click to toggle examples.
Two important subspaces of a matrix A ∈ Rm×n:
| Subspace | Definition | Lives in |
|---|---|---|
| Column space (range/image) | All vectors b that can be written as Ax | Rm |
| Null space (kernel) | All vectors x satisfying Ax = 0 | Rn |
The column space tells you which right-hand sides b the system Ax = b can solve. The null space tells you the ambiguity in the solution. Together, they completely characterize the matrix.
Consider three vectors in R2: v1 = [1, 0], v2 = [0, 1], and v3 = [2, 3]. Can we write v3 using v1 and v2? Yes: v3 = 2v1 + 3v2. So v3 is redundant — it adds no new information beyond what v1 and v2 already provide.
A set of vectors {v1, …, vk} is linearly independent if the only way to write the zero vector as a combination is with all coefficients zero:
If you can find non-zero coefficients that sum to zero, the vectors are linearly dependent — at least one vector is a combination of the others.
Two vectors in R2. Adjust the angle between them. When they are parallel (angle = 0 or 180), they become dependent.
How do you check independence? Stack the vectors as columns of a matrix and perform Gaussian elimination. The number of pivots equals the number of independent vectors. If every column has a pivot, they are all independent. If some columns lack pivots, those correspond to dependent vectors.
A practical rule: in Rn, you can have at most n linearly independent vectors. Having more than n vectors in Rn guarantees at least one is redundant. This maximum count — the number of independent vectors — connects directly to the dimension of the space.
A basis of a vector space V is a set of vectors that is both linearly independent and spans V (every vector in V can be written as a combination). Think of a basis as the minimal coordinate system for the space — enough directions to reach everywhere, with no redundancy.
In R2, the standard basis is {e1 = [1,0], e2 = [0,1]}. But {[1,1], [1,−1]} is also a valid basis — any vector in R2 can be written as a combination of these two. A basis is not unique, but all bases of the same space have the same number of vectors. That number is the dimension of the space.
Now we can define rank precisely. The rank of a matrix A is the dimension of its column space — equivalently, the number of linearly independent columns, or the number of pivots in its row echelon form.
A remarkable fact: the rank equals the number of independent rows and the number of independent columns. Rows and columns always agree.
Let's verify with our earlier example. A = [1, 2, 3; 2, 4, 7]. After elimination: pivots in columns 1 and 3, so rk(A) = 2. The null space had one free variable (x2), so dim(null space) = 1. And 2 + 1 = 3 = n. It works.
| If rk(A) = | Then Ax = b has |
|---|---|
| n (full column rank) | At most one solution (unique if consistent) |
| m (full row rank) | At least one solution for every b |
| min(m,n) (full rank) | Exactly one solution (if m = n) |
| < min(m,n) (rank-deficient) | Either no solution or infinitely many |
So far we have treated matrices as arrays of numbers. Now let's think of them as transformations. When you multiply a vector x by a matrix A, you get a new vector y = Ax. The matrix maps x to y. It transforms all of space.
A linear map (or linear transformation) Φ: V → W is a function that preserves addition and scalar multiplication:
Every matrix defines a linear map, and every linear map between finite-dimensional spaces is a matrix (once you choose bases). The two concepts are interchangeable.
Adjust the 2×2 matrix A. The gray unit square maps to the orange parallelogram. Standard basis vectors e1, e2 map to the columns of A.
Try the preset buttons. Identity leaves everything unchanged. Scale 2× doubles all distances (det = 4, area quadruples). Rotate 45° spins everything but preserves lengths and areas (det = 1). Shear tilts one axis while keeping the other fixed (det = 1, area preserved). Reflect flips across the y-axis (det = −1, area preserved but orientation flipped).
The determinant measures how the transformation scales area:
| det(A) | Meaning |
|---|---|
| > 0 | Area scaled by det(A), orientation preserved |
| < 0 | Area scaled by |det(A)|, orientation flipped |
| = 0 | Collapses space to a lower dimension (singular) |
Set a11=2, a12=1, a21=4, a22=2 in the sliders. Notice det = 2·2 − 1·4 = 0 — the parallelogram collapses to a line. The matrix maps all of R2 onto a one-dimensional subspace. It has lost a dimension — it is singular, and Ax = b cannot be solved for all b.
The same linear map looks different in different coordinate systems. A rotation looks complicated in one basis but diagonal in another. Basis change lets us translate between representations — and it is the key to eigendecomposition (Chapter 4 of the book).
Suppose we have a linear map Φ represented by matrix A in the standard basis. If we switch to a new basis whose vectors are the columns of an invertible matrix S, then the same map is represented by:
Here is the intuition. To apply Φ in the new coordinates: (1) convert from new coordinates to standard (multiply by S), (2) apply the map (multiply by A), (3) convert back to new coordinates (multiply by S−1).
Matrices A and à = S−1AS are called similar. They represent the same map in different clothes. Similar matrices share all intrinsic properties: same eigenvalues, same determinant, same rank, same trace.
Let's work a concrete example. Take A = [2, 1; 0, 3] and the new basis S = [1, 1; 0, 1] (columns are the new basis vectors).
S−1 = [1, −1; 0, 1] (verify: SS−1 = I). Then:
The map is diagonal in this new basis. The new basis vectors happen to be eigenvectors of A (as we will see in Chapter 4 of the book). The diagonal entries 2 and 3 are the eigenvalues — the stretching factors along each eigenvector direction.
The same linear map A in two bases. Left: standard basis. Right: basis defined by S. Adjust S to see how the representation changes.
| Property | Preserved under basis change? |
|---|---|
| Eigenvalues | Yes |
| Determinant | Yes |
| Rank | Yes |
| Trace | Yes |
| Individual matrix entries | No |
Looking ahead: Chapter 4 of Deisenroth (eigendecomposition) is entirely about finding the best basis — the one where the matrix is diagonal. Chapter 10 (PCA) uses this same idea to find the basis that captures the most variance in data. Basis change is the thread that connects all of Part I to Part II.