Deisenroth et al., Chapter 2

Linear Algebra

The language every machine learning algorithm is written in.

Prerequisites: None. We start from scratch.
10
Chapters
5+
Simulations
10
Quizzes

Chapter 0: What is Linear Algebra?

You have three ingredients in a recipe. You know the cost of each ingredient, and you want the total cost. That is a linear combination — multiply each quantity by its price, add them up. Simple arithmetic.

Now imagine you have 1,000 ingredients, 50 recipes, and you need the cost of every recipe simultaneously. That same multiply-and-add operation, scaled up, is the beating heart of linear algebra. Matrices are the bookkeeping device that makes it tractable.

Why does this matter for machine learning? Every ML algorithm — every neural network forward pass, every gradient update, every PCA decomposition — is a sequence of matrix multiplications and vector additions. Linear algebra is not a prerequisite for ML. It is ML's native language.

The core idea: Linear algebra studies systems of linear equations and the structures (vectors, matrices, vector spaces) that arise from them. Every concept in this chapter — rank, basis, linear maps — answers a practical question about these systems: Do solutions exist? How many? How do we find them?
Systems of Equations
Multiple equations, multiple unknowns
↓ compact notation
Matrices & Vectors
Ax = b encodes the entire system
↓ structure
Vector Spaces
The set of all possible solutions lives here
↓ geometry
Linear Maps
Matrices as transformations of space

Here is the roadmap. Chapters 1–4 cover solving systems of equations — the computational machinery. Chapters 5–7 develop the structural concepts: vector spaces, independence, basis, and rank. Chapters 8–9 reveal the geometric viewpoint: matrices as transformations, and how those transformations change when you change your coordinate system.

ConceptPractical question it answers
Gaussian eliminationHow do I solve Ax = b?
RankHow many independent equations do I really have?
BasisWhat is the minimal coordinate system for my data?
Linear mapWhat does multiplication by A do to space?
Basis changeHow does the same map look in different coordinates?
Check: Why is linear algebra the foundation of machine learning?

Chapter 1: Systems of Equations

Suppose a store sells apples at $2 and bananas at $3. You spent $12. Your friend, buying at the same store, spent $17 on apples and bananas. How many of each did you buy? This is a system of linear equations:

2x1 + 3x2 = 12
4x1 + 1x2 = 17

Each equation constrains the unknowns x1 (apples) and x2 (bananas). A solution is a pair (x1, x2) that satisfies both simultaneously. Geometrically, each equation defines a line in the plane. The solution is where the lines cross.

Two Lines, One Solution

Each equation is a line. The intersection (if it exists) is the solution. Drag the sliders to change the second equation and see what happens.

a4.0
b1.0
c17.0

Three things can happen:

CaseGeometryAlgebra
Unique solutionLines cross at one pointThe system has exactly one answer
No solutionLines are parallel (never meet)The equations contradict each other
Infinitely manyLines are the sameOne equation is a multiple of the other

Try setting a=4, b=6, c=24 in the sliders above. The second equation becomes 4x1 + 6x2 = 24, which is just 2×(2x1 + 3x2 = 12). The lines overlap — infinitely many solutions. Now set c=25 instead. The lines are parallel — no solution.

Key insight: A system of linear equations either has 0, 1, or infinitely many solutions. Never 2, never 37. This trichotomy is fundamental — and linear algebra gives us tools (rank, determinant) to distinguish the three cases without actually solving.

For the original system (a=4, b=1, c=17), we can solve by substitution. From the first equation: x2 = (12 − 2x1)/3. Plugging into the second: 4x1 + (12 − 2x1)/3 = 17. Multiply through by 3: 12x1 + 12 − 2x1 = 51, so 10x1 = 39, giving x1 = 3.9. Then x2 = (12 − 7.8)/3 = 1.4.

Substitution works for two equations. But what about 100 equations in 50 unknowns? We need a systematic method. That is Gaussian elimination (Chapter 3).

Check: A system of two linear equations in two unknowns can have how many solutions?

Chapter 2: Matrices

Writing out every equation longhand is tedious. Matrices let us pack the entire system into one compact expression. Take our system:

2x1 + 3x2 = 12
4x1 + 1x2 = 17

We separate the coefficients, the unknowns, and the right-hand side:

Ax = b     where   A = [2, 3; 4, 1],   x = [x1; x2],   b = [12; 17]

The matrix A is a rectangular array of numbers. It has m rows and n columns, written A ∈ Rm×n. The vector x collects the unknowns; b collects the right-hand sides. The equation Ax = b says: "the matrix A, acting on the unknown vector x, produces the vector b."

How matrix-vector multiplication works: Each entry of the result Ax is a dot product — row i of A dotted with x. Row 1 gives 2·x1 + 3·x2 (our first equation). Row 2 gives 4·x1 + 1·x2 (our second equation). That is all a matrix does: it applies each of its rows as a separate linear equation.

Matrix multiplication AB is defined when A is m×n and B is n×p. Entry (i,j) of the product is the dot product of row i of A with column j of B. The result is m×p. The key rule: inner dimensions must match.

Matrix-Vector Multiply

Watch how Ax is computed row by row. Each output entry is a dot product.

Some special matrices appear everywhere:

NameSymbolProperty
IdentityIn1s on diagonal, 0s elsewhere. AI = IA = A.
InverseA−1AA−1 = A−1A = I. Exists only if A is square and full rank.
TransposeATRows become columns: (AT)ij = Aji.
SymmetricA = ATMirror across the diagonal. Eigenvalues are real.

For a 2×2 matrix, the inverse has a closed-form formula:

A = [a, b; c, d]  ⇒  A−1 = 1(ad − bc) [d, −b; −c, a]

The quantity ad − bc is the determinant. When it is zero, the matrix has no inverse — the rows are linearly dependent, and the system Ax = b either has no solution or infinitely many. We will formalize "linearly dependent" in Chapter 6.

The determinant test (2×2): det(A) = ad − bc. If det(A) ≠ 0, A is invertible and Ax = b has a unique solution x = A−1b. If det(A) = 0, the matrix is singular — the two rows point in the same direction and the system is degenerate.
Check: What does entry (i, j) of the product AB equal?

Chapter 3: Gaussian Elimination

Substitution works for small systems, but it is ad-hoc. Gaussian elimination is the systematic algorithm: transform the system into a simple form, then read off the solution. It is the workhorse behind every "solve" button in every linear algebra library.

The idea: apply elementary row operations to the augmented matrix [A | b] until the left part is in row echelon form (staircase of leading 1s, zeros below each pivot). Then back-substitute.

The three allowed operations are:

OperationExampleWhy it is safe
Swap two rowsR1 ↔ R2Reordering equations does not change the solution
Scale a rowR1 ← λR1Multiplying both sides of an equation by λ ≠ 0
Add a multiple of one row to anotherR2 ← R2 − 2R1Subtracting one equation from another

Let's work through our example step by step. The augmented matrix is:

[2, 3 | 12; 4, 1 | 17]

Step 1. Eliminate x1 from row 2. Apply R2 ← R2 − 2R1:

[2, 3 | 12; 0, −5 | −7]

Now the system is in row echelon form. The second row says −5x2 = −7, so x2 = 7/5 = 1.4.

Step 2. Back-substitute into row 1: 2x1 + 3(1.4) = 12, so 2x1 = 7.8, x1 = 3.9.

Gaussian Elimination Stepper

Watch the augmented matrix transform step by step. The pivot element is highlighted.

Step 0: Original augmented matrix

If we continue beyond row echelon form and also eliminate above each pivot (making every pivot column have zeros everywhere except the pivot itself), we reach reduced row echelon form (RREF). For our example:

[1, 0 | 3.9; 0, 1 | 1.4]

In RREF the solution is immediate: x1 = 3.9, x2 = 1.4. No back-substitution needed.

Key insight: Gaussian elimination does not change the solution set — it only changes the appearance of the system. The pivots reveal the structure: how many independent equations exist, which variables are free, and whether a solution exists at all.

The pivot positions tell us the rank of the matrix — the number of independent equations. We will formalize this in Chapter 7. For now: count the pivots in the echelon form. Our 2×2 system has 2 pivots, so rank 2 — full rank — and a unique solution.

Check: What does Gaussian elimination transform the augmented matrix into?

Chapter 4: The Solution Set

Not every system Ax = b has a unique answer. Let's add a variable and see what happens. Consider a system with 2 equations and 3 unknowns:

x1 + 2x2 + 3x3 = 6
2x1 + 4x2 + 7x3 = 13

Applying Gaussian elimination (R2 ← R2 − 2R1):

[1, 2, 3 | 6; 0, 0, 1 | 1]

There are two pivots (columns 1 and 3) and one free variable (x2). From row 2: x3 = 1. From row 1: x1 = 6 − 2x2 − 3(1) = 3 − 2x2. The solution is:

x = [3; 0; 1] + x2 · [−2; 1; 0]

This is a line in 3D space. The first vector [3, 0, 1] is a particular solution (any specific point that satisfies the system). The second vector [−2, 1, 0] is a homogeneous solution — it satisfies Ax = 0 and can be added to any particular solution to get another valid solution.

The general solution: For any system Ax = b, the full solution set is:

x = xparticular + xhomogeneous

where xparticular is any single solution to Ax = b, and xhomogeneous ranges over all solutions to Ax = 0. The homogeneous solutions form a vector space (the null space or kernel of A).

This decomposition is how linear algebra handles "infinitely many solutions" — not as chaos, but as structure. The particular solution pins you somewhere on the solution set; the null space tells you the directions in which you can move without leaving it.

Number of pivots vs. unknownsSolution structure
pivots = unknowns (full rank, square)Unique solution
pivots < unknownsInfinitely many (free variables)
Contradictory row (e.g., [0 0 | 5])No solution

Let's verify our solution. Pick x2 = 1: x = [3 − 2, 1, 1] = [1, 1, 1]. Check: 1 + 2 + 3 = 6 and 2 + 4 + 7 = 13. Both satisfied.

Connection to ML: In machine learning, underdetermined systems (more unknowns than equations) appear constantly — neural networks typically have millions of parameters but only thousands of training constraints. The solution set is vast. Gradient descent navigates this space, and regularization picks a preferred solution from the infinitely many options.
Check: What is the general solution of Ax = b?

Chapter 5: Vector Spaces

So far we have been working with vectors as columns of numbers. But what is a vector, really? Deisenroth et al. give a more abstract (and more powerful) definition.

A vector space (V, +, ·) is a set V equipped with two operations — addition and scalar multiplication — that satisfy certain rules (closure, associativity, commutativity, distributivity, existence of zero and inverses). Any collection of objects that obeys these rules is a vector space, whether those objects are columns of numbers, polynomials, functions, or matrices.

Think of a vector space as a playground. You can add any two vectors and get another vector in the same space. You can scale any vector and stay in the space. You can never "escape" using just addition and scaling. This closure property is what makes the structure useful.

The most common vector space is Rn — all columns of n real numbers. But here are others:

Vector spaceElementsOperations
RnColumns of n numbersComponentwise add/scale
Rm×nm×n matricesEntrywise add/scale
Polynomials of degree ≤ na0 + a1t + … + antnAdd coefficients, scale all

A subspace U of V is a subset that is itself a vector space (under the same operations). The key test: U must contain the zero vector, and be closed under addition and scalar multiplication. The null space of a matrix (all x with Ax = 0) is a subspace of Rn — exactly what we saw in Chapter 4.

Subspace or Not?

In R2, lines through the origin are subspaces. A shifted line (not through origin) is not. Click to toggle examples.

Key insight: A subspace must contain the zero vector. This is the quickest test: if your set does not include the origin, it is not a subspace. The solution set of Ax = b (with b ≠ 0) is not a subspace — it is a shifted (affine) subspace. Only the solutions to the homogeneous system Ax = 0 form a true subspace.

Two important subspaces of a matrix A ∈ Rm×n:

SubspaceDefinitionLives in
Column space (range/image)All vectors b that can be written as AxRm
Null space (kernel)All vectors x satisfying Ax = 0Rn

The column space tells you which right-hand sides b the system Ax = b can solve. The null space tells you the ambiguity in the solution. Together, they completely characterize the matrix.

Check: Why is the solution set of Ax = b (with b ≠ 0) NOT a subspace?

Chapter 6: Linear Independence

Consider three vectors in R2: v1 = [1, 0], v2 = [0, 1], and v3 = [2, 3]. Can we write v3 using v1 and v2? Yes: v3 = 2v1 + 3v2. So v3 is redundant — it adds no new information beyond what v1 and v2 already provide.

A set of vectors {v1, …, vk} is linearly independent if the only way to write the zero vector as a combination is with all coefficients zero:

λ1v1 + … + λkvk = 0  ⇒  λ1 = … = λk = 0

If you can find non-zero coefficients that sum to zero, the vectors are linearly dependent — at least one vector is a combination of the others.

Analogy: Think of directions on a map. "Go north" and "go east" are independent — neither can replace the other. But "go northeast" is redundant if you already have north and east: northeast = north + east. In a vector space, independent vectors point in genuinely different directions. Dependent ones are just combinations of directions you already have.
Independence in R2

Two vectors in R2. Adjust the angle between them. When they are parallel (angle = 0 or 180), they become dependent.

Angle90°

How do you check independence? Stack the vectors as columns of a matrix and perform Gaussian elimination. The number of pivots equals the number of independent vectors. If every column has a pivot, they are all independent. If some columns lack pivots, those correspond to dependent vectors.

A practical rule: in Rn, you can have at most n linearly independent vectors. Having more than n vectors in Rn guarantees at least one is redundant. This maximum count — the number of independent vectors — connects directly to the dimension of the space.

Key insight: Linear independence is about information content. Each independent vector adds one "dimension of freedom" to what you can represent. Dependent vectors add nothing — they are baggage. The rank of a matrix counts how many of its columns (or rows) carry genuine, non-redundant information.
Check: If vectors v1 and v2 in R2 are linearly dependent, what does that mean geometrically?

Chapter 7: Basis & Rank

A basis of a vector space V is a set of vectors that is both linearly independent and spans V (every vector in V can be written as a combination). Think of a basis as the minimal coordinate system for the space — enough directions to reach everywhere, with no redundancy.

In R2, the standard basis is {e1 = [1,0], e2 = [0,1]}. But {[1,1], [1,−1]} is also a valid basis — any vector in R2 can be written as a combination of these two. A basis is not unique, but all bases of the same space have the same number of vectors. That number is the dimension of the space.

Dimension: Rn has dimension n. Any basis of Rn has exactly n vectors. A plane through the origin in R3 is a 2-dimensional subspace. A line through the origin is 1-dimensional. The zero vector alone is 0-dimensional.

Now we can define rank precisely. The rank of a matrix A is the dimension of its column space — equivalently, the number of linearly independent columns, or the number of pivots in its row echelon form.

rk(A) = dim(column space of A) = number of pivots

A remarkable fact: the rank equals the number of independent rows and the number of independent columns. Rows and columns always agree.

The rank-nullity theorem: For any A ∈ Rm×n,

rk(A) + dim(null space of A) = n

Rank counts the "informative" dimensions. Nullity counts the "wasted" dimensions. Together they account for all n columns. This is one of the most important equations in linear algebra.

Let's verify with our earlier example. A = [1, 2, 3; 2, 4, 7]. After elimination: pivots in columns 1 and 3, so rk(A) = 2. The null space had one free variable (x2), so dim(null space) = 1. And 2 + 1 = 3 = n. It works.

If rk(A) =Then Ax = b has
n (full column rank)At most one solution (unique if consistent)
m (full row rank)At least one solution for every b
min(m,n) (full rank)Exactly one solution (if m = n)
< min(m,n) (rank-deficient)Either no solution or infinitely many
Key insight: Rank is the single number that tells you everything about solvability. A high-rank matrix carries a lot of information (its columns point in many independent directions). A low-rank matrix is "compressed" — its columns lie in a small subspace. Low-rank approximation (Chapter 10: PCA) exploits exactly this idea.
Check: A 5×3 matrix has rank 2. What is the dimension of its null space?

Chapter 8: Linear Maps

So far we have treated matrices as arrays of numbers. Now let's think of them as transformations. When you multiply a vector x by a matrix A, you get a new vector y = Ax. The matrix maps x to y. It transforms all of space.

A linear map (or linear transformation) Φ: V → W is a function that preserves addition and scalar multiplication:

Φ(x + y) = Φ(x) + Φ(y)    and    Φ(λx) = λΦ(x)

Every matrix defines a linear map, and every linear map between finite-dimensional spaces is a matrix (once you choose bases). The two concepts are interchangeable.

What does a 2×2 matrix do? It maps the unit square to a parallelogram. The first column of A tells you where e1 = [1,0] goes. The second column tells you where e2 = [0,1] goes. The determinant equals the signed area of the parallelogram — the factor by which the matrix scales areas.
Linear Mapping Sandbox

Adjust the 2×2 matrix A. The gray unit square maps to the orange parallelogram. Standard basis vectors e1, e2 map to the columns of A.

a111.0
a120.5
a21-0.3
a221.2

Try the preset buttons. Identity leaves everything unchanged. Scale 2× doubles all distances (det = 4, area quadruples). Rotate 45° spins everything but preserves lengths and areas (det = 1). Shear tilts one axis while keeping the other fixed (det = 1, area preserved). Reflect flips across the y-axis (det = −1, area preserved but orientation flipped).

The determinant measures how the transformation scales area:

det(A)Meaning
> 0Area scaled by det(A), orientation preserved
< 0Area scaled by |det(A)|, orientation flipped
= 0Collapses space to a lower dimension (singular)

Set a11=2, a12=1, a21=4, a22=2 in the sliders. Notice det = 2·2 − 1·4 = 0 — the parallelogram collapses to a line. The matrix maps all of R2 onto a one-dimensional subspace. It has lost a dimension — it is singular, and Ax = b cannot be solved for all b.

Key insight: A matrix is a machine that takes in vectors and produces vectors. Its columns are the output when you feed in the standard basis vectors. The determinant measures volume distortion. If det = 0, the machine crushes a dimension — information is lost, and you cannot invert the map.
Check: What does the determinant of a 2×2 matrix geometrically represent?

Chapter 9: Basis Change

The same linear map looks different in different coordinate systems. A rotation looks complicated in one basis but diagonal in another. Basis change lets us translate between representations — and it is the key to eigendecomposition (Chapter 4 of the book).

Suppose we have a linear map Φ represented by matrix A in the standard basis. If we switch to a new basis whose vectors are the columns of an invertible matrix S, then the same map is represented by:

à = S−1 A S

Here is the intuition. To apply Φ in the new coordinates: (1) convert from new coordinates to standard (multiply by S), (2) apply the map (multiply by A), (3) convert back to new coordinates (multiply by S−1).

New coords
The vector x̃ in basis B'
↓ multiply by S (convert to standard basis)
Standard coords
x = Sx̃
↓ apply A (the map in standard basis)
Mapped in standard
y = Ax = ASx̃
↓ multiply by S−1 (convert back to new basis)
Mapped in new coords
ỹ = S−1ASx̃ = Ãx̃

Matrices A and à = S−1AS are called similar. They represent the same map in different clothes. Similar matrices share all intrinsic properties: same eigenvalues, same determinant, same rank, same trace.

Why this matters: The entire point of eigendecomposition is finding a basis S where à = S−1AS is diagonal. A diagonal matrix is trivial to understand — it just scales each axis independently. Finding the right basis turns a complicated transformation into something transparent. That is the power of basis change.

Let's work a concrete example. Take A = [2, 1; 0, 3] and the new basis S = [1, 1; 0, 1] (columns are the new basis vectors).

S−1 = [1, −1; 0, 1] (verify: SS−1 = I). Then:

à = S−1AS = [1, −1; 0, 1][2, 1; 0, 3][1, 1; 0, 1] = [2, 0; 0, 3]

The map is diagonal in this new basis. The new basis vectors happen to be eigenvectors of A (as we will see in Chapter 4 of the book). The diagonal entries 2 and 3 are the eigenvalues — the stretching factors along each eigenvector direction.

Basis Change Visualizer

The same linear map A in two bases. Left: standard basis. Right: basis defined by S. Adjust S to see how the representation changes.

s111.0
s210.0
s121.0
s221.0
Key insight: A matrix is not the map — it is a description of the map relative to a chosen basis. Change the basis, and the description changes, but the underlying geometry stays the same. Finding the right basis is often the entire game in linear algebra.
PropertyPreserved under basis change?
EigenvaluesYes
DeterminantYes
RankYes
TraceYes
Individual matrix entriesNo

Looking ahead: Chapter 4 of Deisenroth (eigendecomposition) is entirely about finding the best basis — the one where the matrix is diagonal. Chapter 10 (PCA) uses this same idea to find the basis that captures the most variance in data. Basis change is the thread that connects all of Part I to Part II.

"The purpose of computing is insight, not numbers."
— Richard Hamming
Check: If à = S−1AS, what is the relationship between A and Ã?