Ch 2: Linear Algebra — Deisenroth MML

Chapter 0: What is Linear Algebra?

You have three ingredients in a recipe. You know the cost of each ingredient, and you want the total cost. That is a linear combination — multiply each quantity by its price, add them up. Simple arithmetic.

Now imagine you have 1,000 ingredients, 50 recipes, and you need the cost of every recipe simultaneously. That same multiply-and-add operation, scaled up, is the beating heart of linear algebra. Matrices are the bookkeeping device that makes it tractable.

Why does this matter for machine learning? Every ML algorithm — every neural network forward pass, every gradient update, every PCA decomposition — is a sequence of matrix multiplications and vector additions. Linear algebra is not a prerequisite for ML. It is ML's native language.

The core idea: Linear algebra studies systems of linear equations and the structures (vectors, matrices, vector spaces) that arise from them. Every concept in this chapter — rank, basis, linear maps — answers a practical question about these systems: Do solutions exist? How many? How do we find them?

Systems of Equations

Multiple equations, multiple unknowns

↓ compact notation

Matrices & Vectors

Ax = b encodes the entire system

↓ structure

Vector Spaces

The set of all possible solutions lives here

↓ geometry

Linear Maps

Matrices as transformations of space

Here is the roadmap. Chapters 1–4 cover solving systems of equations — the computational machinery. Chapters 5–7 develop the structural concepts: vector spaces, independence, basis, and rank. Chapters 8–9 reveal the geometric viewpoint: matrices as transformations, and how those transformations change when you change your coordinate system.

Concept	Practical question it answers
Gaussian elimination	How do I solve Ax = b?
Rank	How many independent equations do I really have?
Basis	What is the minimal coordinate system for my data?
Linear map	What does multiplication by A do to space?
Basis change	How does the same map look in different coordinates?

Check: Why is linear algebra the foundation of machine learning?

Because ML algorithms use randomness Because ML computations are built from matrix multiplications and vector additions Because ML requires calculus

Chapter 1: Systems of Equations

Suppose a store sells apples at $2 and bananas at $3. You spent $12. Your friend, buying at the same store, spent $17 on apples and bananas. How many of each did you buy? This is a system of linear equations:

2x₁ + 3x₂ = 12
4x₁ + 1x₂ = 17

Each equation constrains the unknowns x₁ (apples) and x₂ (bananas). A solution is a pair (x₁, x₂) that satisfies both simultaneously. Geometrically, each equation defines a line in the plane. The solution is where the lines cross.

Two Lines, One Solution

Each equation is a line. The intersection (if it exists) is the solution. Drag the sliders to change the second equation and see what happens.

a4.0

b1.0

c17.0

Three things can happen:

Case	Geometry	Algebra
Unique solution	Lines cross at one point	The system has exactly one answer
No solution	Lines are parallel (never meet)	The equations contradict each other
Infinitely many	Lines are the same	One equation is a multiple of the other

Try setting a=4, b=6, c=24 in the sliders above. The second equation becomes 4x₁ + 6x₂ = 24, which is just 2×(2x₁ + 3x₂ = 12). The lines overlap — infinitely many solutions. Now set c=25 instead. The lines are parallel — no solution.

Key insight: A system of linear equations either has 0, 1, or infinitely many solutions. Never 2, never 37. This trichotomy is fundamental — and linear algebra gives us tools (rank, determinant) to distinguish the three cases without actually solving.

For the original system (a=4, b=1, c=17), we can solve by substitution. From the first equation: x₂ = (12 − 2x₁)/3. Plugging into the second: 4x₁ + (12 − 2x₁)/3 = 17. Multiply through by 3: 12x₁ + 12 − 2x₁ = 51, so 10x₁ = 39, giving x₁ = 3.9. Then x₂ = (12 − 7.8)/3 = 1.4.

Substitution works for two equations. But what about 100 equations in 50 unknowns? We need a systematic method. That is Gaussian elimination (Chapter 3).

Check: A system of two linear equations in two unknowns can have how many solutions?

Either 1 or 2 Any positive number Exactly 0, exactly 1, or infinitely many

Chapter 2: Matrices

Writing out every equation longhand is tedious. Matrices let us pack the entire system into one compact expression. Take our system:

2x₁ + 3x₂ = 12
4x₁ + 1x₂ = 17

We separate the coefficients, the unknowns, and the right-hand side:

Ax = b where A = [2, 3; 4, 1], x = [x₁; x₂], b = [12; 17]

The matrix A is a rectangular array of numbers. It has m rows and n columns, written A ∈ R^m×n. The vector x collects the unknowns; b collects the right-hand sides. The equation Ax = b says: "the matrix A, acting on the unknown vector x, produces the vector b."

How matrix-vector multiplication works: Each entry of the result Ax is a dot product — row i of A dotted with x. Row 1 gives 2·x₁ + 3·x₂ (our first equation). Row 2 gives 4·x₁ + 1·x₂ (our second equation). That is all a matrix does: it applies each of its rows as a separate linear equation.

Matrix multiplication AB is defined when A is m×n and B is n×p. Entry (i,j) of the product is the dot product of row i of A with column j of B. The result is m×p. The key rule: inner dimensions must match.

Matrix-Vector Multiply

Watch how Ax is computed row by row. Each output entry is a dot product.

Some special matrices appear everywhere:

Name	Symbol	Property
Identity	I_n	1s on diagonal, 0s elsewhere. AI = IA = A.
Inverse	A⁻¹	AA⁻¹ = A⁻¹A = I. Exists only if A is square and full rank.
Transpose	A^T	Rows become columns: (A^T)_ij = A_ji.
Symmetric	A = A^T	Mirror across the diagonal. Eigenvalues are real.

For a 2×2 matrix, the inverse has a closed-form formula:

A = [a, b; c, d] ⇒ A⁻¹ = ¹⁄_{(ad − bc)} [d, −b; −c, a]

The quantity ad − bc is the determinant. When it is zero, the matrix has no inverse — the rows are linearly dependent, and the system Ax = b either has no solution or infinitely many. We will formalize "linearly dependent" in Chapter 6.

The determinant test (2×2): det(A) = ad − bc. If det(A) ≠ 0, A is invertible and Ax = b has a unique solution x = A⁻¹b. If det(A) = 0, the matrix is singular — the two rows point in the same direction and the system is degenerate.

Check: What does entry (i, j) of the product AB equal?

The dot product of row i of A with column j of B The sum of all entries of A and B A_ij times B_ij

Chapter 3: Gaussian Elimination

Substitution works for small systems, but it is ad-hoc. Gaussian elimination is the systematic algorithm: transform the system into a simple form, then read off the solution. It is the workhorse behind every "solve" button in every linear algebra library.

The idea: apply elementary row operations to the augmented matrix [A | b] until the left part is in row echelon form (staircase of leading 1s, zeros below each pivot). Then back-substitute.

The three allowed operations are:

Operation	Example	Why it is safe
Swap two rows	R₁ ↔ R₂	Reordering equations does not change the solution
Scale a row	R₁ ← λR₁	Multiplying both sides of an equation by λ ≠ 0
Add a multiple of one row to another	R₂ ← R₂ − 2R₁	Subtracting one equation from another

Let's work through our example step by step. The augmented matrix is:

[2, 3 | 12; 4, 1 | 17]

Step 1. Eliminate x₁ from row 2. Apply R₂ ← R₂ − 2R₁:

[2, 3 | 12; 0, −5 | −7]

Now the system is in row echelon form. The second row says −5x₂ = −7, so x₂ = 7/5 = 1.4.

Step 2. Back-substitute into row 1: 2x₁ + 3(1.4) = 12, so 2x₁ = 7.8, x₁ = 3.9.

Gaussian Elimination Stepper

Watch the augmented matrix transform step by step. The pivot element is highlighted.

Step 0: Original augmented matrix

If we continue beyond row echelon form and also eliminate above each pivot (making every pivot column have zeros everywhere except the pivot itself), we reach reduced row echelon form (RREF). For our example:

[1, 0 | 3.9; 0, 1 | 1.4]

In RREF the solution is immediate: x₁ = 3.9, x₂ = 1.4. No back-substitution needed.

Key insight: Gaussian elimination does not change the solution set — it only changes the appearance of the system. The pivots reveal the structure: how many independent equations exist, which variables are free, and whether a solution exists at all.

The pivot positions tell us the rank of the matrix — the number of independent equations. We will formalize this in Chapter 7. For now: count the pivots in the echelon form. Our 2×2 system has 2 pivots, so rank 2 — full rank — and a unique solution.

Check: What does Gaussian elimination transform the augmented matrix into?

The identity matrix Row echelon form (staircase of pivots with zeros below) A diagonal matrix

Chapter 4: The Solution Set

Not every system Ax = b has a unique answer. Let's add a variable and see what happens. Consider a system with 2 equations and 3 unknowns:

x₁ + 2x₂ + 3x₃ = 6
2x₁ + 4x₂ + 7x₃ = 13

Applying Gaussian elimination (R₂ ← R₂ − 2R₁):

[1, 2, 3 | 6; 0, 0, 1 | 1]

There are two pivots (columns 1 and 3) and one free variable (x₂). From row 2: x₃ = 1. From row 1: x₁ = 6 − 2x₂ − 3(1) = 3 − 2x₂. The solution is:

x = [3; 0; 1] + x₂ · [−2; 1; 0]

This is a line in 3D space. The first vector [3, 0, 1] is a particular solution (any specific point that satisfies the system). The second vector [−2, 1, 0] is a homogeneous solution — it satisfies Ax = 0 and can be added to any particular solution to get another valid solution.

The general solution: For any system Ax = b, the full solution set is:

x = x_particular + x_homogeneous

where x_particular is any single solution to Ax = b, and x_homogeneous ranges over all solutions to Ax = 0. The homogeneous solutions form a vector space (the null space or kernel of A).

This decomposition is how linear algebra handles "infinitely many solutions" — not as chaos, but as structure. The particular solution pins you somewhere on the solution set; the null space tells you the directions in which you can move without leaving it.

Number of pivots vs. unknowns	Solution structure
pivots = unknowns (full rank, square)	Unique solution
pivots < unknowns	Infinitely many (free variables)
Contradictory row (e.g., [0 0 \| 5])	No solution

Let's verify our solution. Pick x₂ = 1: x = [3 − 2, 1, 1] = [1, 1, 1]. Check: 1 + 2 + 3 = 6 and 2 + 4 + 7 = 13. Both satisfied.

Connection to ML: In machine learning, underdetermined systems (more unknowns than equations) appear constantly — neural networks typically have millions of parameters but only thousands of training constraints. The solution set is vast. Gradient descent navigates this space, and regularization picks a preferred solution from the infinitely many options.

Check: What is the general solution of Ax = b?

A particular solution plus any element of the null space of A The inverse of A times b The average of all solutions

Chapter 5: Vector Spaces

So far we have been working with vectors as columns of numbers. But what is a vector, really? Deisenroth et al. give a more abstract (and more powerful) definition.

A vector space (V, +, ·) is a set V equipped with two operations — addition and scalar multiplication — that satisfy certain rules (closure, associativity, commutativity, distributivity, existence of zero and inverses). Any collection of objects that obeys these rules is a vector space, whether those objects are columns of numbers, polynomials, functions, or matrices.

Think of a vector space as a playground. You can add any two vectors and get another vector in the same space. You can scale any vector and stay in the space. You can never "escape" using just addition and scaling. This closure property is what makes the structure useful.

The most common vector space is Rⁿ — all columns of n real numbers. But here are others:

Vector space	Elements	Operations
Rⁿ	Columns of n numbers	Componentwise add/scale
R^m×n	m×n matrices	Entrywise add/scale
Polynomials of degree ≤ n	a₀ + a₁t + … + a_ntⁿ	Add coefficients, scale all

A subspace U of V is a subset that is itself a vector space (under the same operations). The key test: U must contain the zero vector, and be closed under addition and scalar multiplication. The null space of a matrix (all x with Ax = 0) is a subspace of Rⁿ — exactly what we saw in Chapter 4.

Subspace or Not?

In R², lines through the origin are subspaces. A shifted line (not through origin) is not. Click to toggle examples.

Key insight: A subspace must contain the zero vector. This is the quickest test: if your set does not include the origin, it is not a subspace. The solution set of Ax = b (with b ≠ 0) is not a subspace — it is a shifted (affine) subspace. Only the solutions to the homogeneous system Ax = 0 form a true subspace.

Two important subspaces of a matrix A ∈ R^m×n:

Subspace	Definition	Lives in
Column space (range/image)	All vectors b that can be written as Ax	R^m
Null space (kernel)	All vectors x satisfying Ax = 0	Rⁿ

The column space tells you which right-hand sides b the system Ax = b can solve. The null space tells you the ambiguity in the solution. Together, they completely characterize the matrix.

Check: Why is the solution set of Ax = b (with b ≠ 0) NOT a subspace?

Because it contains too many vectors Because it does not contain the zero vector Because it is infinite

Chapter 6: Linear Independence

Consider three vectors in R²: v₁ = [1, 0], v₂ = [0, 1], and v₃ = [2, 3]. Can we write v₃ using v₁ and v₂? Yes: v₃ = 2v₁ + 3v₂. So v₃ is redundant — it adds no new information beyond what v₁ and v₂ already provide.

A set of vectors {v₁, …, v_k} is linearly independent if the only way to write the zero vector as a combination is with all coefficients zero:

λ₁v₁ + … + λ_kv_k = 0 ⇒ λ₁ = … = λ_k = 0

If you can find non-zero coefficients that sum to zero, the vectors are linearly dependent — at least one vector is a combination of the others.

Analogy: Think of directions on a map. "Go north" and "go east" are independent — neither can replace the other. But "go northeast" is redundant if you already have north and east: northeast = north + east. In a vector space, independent vectors point in genuinely different directions. Dependent ones are just combinations of directions you already have.

Independence in R²

Two vectors in R². Adjust the angle between them. When they are parallel (angle = 0 or 180), they become dependent.

Angle90°

How do you check independence? Stack the vectors as columns of a matrix and perform Gaussian elimination. The number of pivots equals the number of independent vectors. If every column has a pivot, they are all independent. If some columns lack pivots, those correspond to dependent vectors.

A practical rule: in Rⁿ, you can have at most n linearly independent vectors. Having more than n vectors in Rⁿ guarantees at least one is redundant. This maximum count — the number of independent vectors — connects directly to the dimension of the space.

Key insight: Linear independence is about information content. Each independent vector adds one "dimension of freedom" to what you can represent. Dependent vectors add nothing — they are baggage. The rank of a matrix counts how many of its columns (or rows) carry genuine, non-redundant information.

Check: If vectors v₁ and v₂ in R² are linearly dependent, what does that mean geometrically?

They point in the same (or opposite) direction — they are parallel They are perpendicular They have equal length

Chapter 7: Basis & Rank

A basis of a vector space V is a set of vectors that is both linearly independent and spans V (every vector in V can be written as a combination). Think of a basis as the minimal coordinate system for the space — enough directions to reach everywhere, with no redundancy.

In R², the standard basis is {e₁ = [1,0], e₂ = [0,1]}. But {[1,1], [1,−1]} is also a valid basis — any vector in R² can be written as a combination of these two. A basis is not unique, but all bases of the same space have the same number of vectors. That number is the dimension of the space.

Dimension: Rⁿ has dimension n. Any basis of Rⁿ has exactly n vectors. A plane through the origin in R³ is a 2-dimensional subspace. A line through the origin is 1-dimensional. The zero vector alone is 0-dimensional.

Now we can define rank precisely. The rank of a matrix A is the dimension of its column space — equivalently, the number of linearly independent columns, or the number of pivots in its row echelon form.

rk(A) = dim(column space of A) = number of pivots

A remarkable fact: the rank equals the number of independent rows and the number of independent columns. Rows and columns always agree.

The rank-nullity theorem: For any A ∈ R^m×n,

rk(A) + dim(null space of A) = n

Rank counts the "informative" dimensions. Nullity counts the "wasted" dimensions. Together they account for all n columns. This is one of the most important equations in linear algebra.

Let's verify with our earlier example. A = [1, 2, 3; 2, 4, 7]. After elimination: pivots in columns 1 and 3, so rk(A) = 2. The null space had one free variable (x₂), so dim(null space) = 1. And 2 + 1 = 3 = n. It works.

If rk(A) =	Then Ax = b has
n (full column rank)	At most one solution (unique if consistent)
m (full row rank)	At least one solution for every b
min(m,n) (full rank)	Exactly one solution (if m = n)
< min(m,n) (rank-deficient)	Either no solution or infinitely many

Key insight: Rank is the single number that tells you everything about solvability. A high-rank matrix carries a lot of information (its columns point in many independent directions). A low-rank matrix is "compressed" — its columns lie in a small subspace. Low-rank approximation (Chapter 10: PCA) exploits exactly this idea.

Check: A 5×3 matrix has rank 2. What is the dimension of its null space?

1 (because 3 − 2 = 1) 3 (because 5 − 2 = 3) 2 (because rank = nullity)

Chapter 8: Linear Maps

So far we have treated matrices as arrays of numbers. Now let's think of them as transformations. When you multiply a vector x by a matrix A, you get a new vector y = Ax. The matrix maps x to y. It transforms all of space.

A linear map (or linear transformation) Φ: V → W is a function that preserves addition and scalar multiplication:

Φ(x + y) = Φ(x) + Φ(y) and Φ(λx) = λΦ(x)

Every matrix defines a linear map, and every linear map between finite-dimensional spaces is a matrix (once you choose bases). The two concepts are interchangeable.

What does a 2×2 matrix do? It maps the unit square to a parallelogram. The first column of A tells you where e₁ = [1,0] goes. The second column tells you where e₂ = [0,1] goes. The determinant equals the signed area of the parallelogram — the factor by which the matrix scales areas.

Linear Mapping Sandbox

Adjust the 2×2 matrix A. The gray unit square maps to the orange parallelogram. Standard basis vectors e₁, e₂ map to the columns of A.

a₁₁1.0

a₁₂0.5

a₂₁-0.3

a₂₂1.2

Try the preset buttons. Identity leaves everything unchanged. Scale 2× doubles all distances (det = 4, area quadruples). Rotate 45° spins everything but preserves lengths and areas (det = 1). Shear tilts one axis while keeping the other fixed (det = 1, area preserved). Reflect flips across the y-axis (det = −1, area preserved but orientation flipped).

The determinant measures how the transformation scales area:

det(A)	Meaning
> 0	Area scaled by det(A), orientation preserved
< 0	Area scaled by \|det(A)\|, orientation flipped
= 0	Collapses space to a lower dimension (singular)

Set a₁₁=2, a₁₂=1, a₂₁=4, a₂₂=2 in the sliders. Notice det = 2·2 − 1·4 = 0 — the parallelogram collapses to a line. The matrix maps all of R² onto a one-dimensional subspace. It has lost a dimension — it is singular, and Ax = b cannot be solved for all b.

Key insight: A matrix is a machine that takes in vectors and produces vectors. Its columns are the output when you feed in the standard basis vectors. The determinant measures volume distortion. If det = 0, the machine crushes a dimension — information is lost, and you cannot invert the map.

Check: What does the determinant of a 2×2 matrix geometrically represent?

The angle of rotation The signed area-scaling factor of the transformation The length of the longest column

Chapter 9: Basis Change

The same linear map looks different in different coordinate systems. A rotation looks complicated in one basis but diagonal in another. Basis change lets us translate between representations — and it is the key to eigendecomposition (Chapter 4 of the book).

Suppose we have a linear map Φ represented by matrix A in the standard basis. If we switch to a new basis whose vectors are the columns of an invertible matrix S, then the same map is represented by:

Ã = S⁻¹ A S

Here is the intuition. To apply Φ in the new coordinates: (1) convert from new coordinates to standard (multiply by S), (2) apply the map (multiply by A), (3) convert back to new coordinates (multiply by S⁻¹).

New coords

The vector x̃ in basis B'

↓ multiply by S (convert to standard basis)

Standard coords

x = Sx̃

↓ apply A (the map in standard basis)

Mapped in standard

y = Ax = ASx̃

↓ multiply by S⁻¹ (convert back to new basis)

Mapped in new coords

ỹ = S⁻¹ASx̃ = Ãx̃

Matrices A and Ã = S⁻¹AS are called similar. They represent the same map in different clothes. Similar matrices share all intrinsic properties: same eigenvalues, same determinant, same rank, same trace.

Why this matters: The entire point of eigendecomposition is finding a basis S where Ã = S⁻¹AS is diagonal. A diagonal matrix is trivial to understand — it just scales each axis independently. Finding the right basis turns a complicated transformation into something transparent. That is the power of basis change.

Let's work a concrete example. Take A = [2, 1; 0, 3] and the new basis S = [1, 1; 0, 1] (columns are the new basis vectors).

S⁻¹ = [1, −1; 0, 1] (verify: SS⁻¹ = I). Then:

Ã = S⁻¹AS = [1, −1; 0, 1][2, 1; 0, 3][1, 1; 0, 1] = [2, 0; 0, 3]

The map is diagonal in this new basis. The new basis vectors happen to be eigenvectors of A (as we will see in Chapter 4 of the book). The diagonal entries 2 and 3 are the eigenvalues — the stretching factors along each eigenvector direction.

Basis Change Visualizer

The same linear map A in two bases. Left: standard basis. Right: basis defined by S. Adjust S to see how the representation changes.

s₁₁1.0

s₂₁0.0

s₁₂1.0

s₂₂1.0

Key insight: A matrix is not the map — it is a description of the map relative to a chosen basis. Change the basis, and the description changes, but the underlying geometry stays the same. Finding the right basis is often the entire game in linear algebra.

Property	Preserved under basis change?
Eigenvalues	Yes
Determinant	Yes
Rank	Yes
Trace	Yes
Individual matrix entries	No

Looking ahead: Chapter 4 of Deisenroth (eigendecomposition) is entirely about finding the best basis — the one where the matrix is diagonal. Chapter 10 (PCA) uses this same idea to find the basis that captures the most variance in data. Basis change is the thread that connects all of Part I to Part II.

"The purpose of computing is insight, not numbers."
— Richard Hamming

Check: If Ã = S⁻¹AS, what is the relationship between A and Ã?

They represent the same linear map in different bases Ã is the inverse of A They are unrelated matrices