Goodfellow et al., Chapter 2

Linear Algebra

The language every neural network speaks. Vectors, matrices, and the decompositions that make deep learning tractable.

Prerequisites: Basic arithmetic. That's it.
9
Chapters
4+
Simulations
9
Quizzes

Chapter 0: Why Linear Algebra?

A neural network takes an image — say 256×256 pixels with 3 color channels — and produces a prediction. Internally, that image is a block of 196,608 numbers. The network multiplies these numbers by weight matrices, adds biases, and applies nonlinearities. Repeat hundreds of times. That is deep learning.

Every single step of that pipeline is a linear algebra operation. Matrix multiplications, vector additions, norms, decompositions. If you understand linear algebra, you can read any deep learning paper and know exactly what each equation does to the data flowing through the network.

The big picture: Linear algebra provides the data structures (vectors, matrices, tensors) and operations (multiplication, decomposition, inversion) that deep learning is built on. This chapter covers just enough to make the rest of the book precise.
Scalars, Vectors, Matrices, Tensors
Containers for numbers at different scales
Operations
Multiplication, transpose, norms
Decompositions
Eigen, SVD — reveal hidden structure
Applications
Solving systems, PCA, understanding networks
Why is linear algebra essential for deep learning?

Chapter 1: From Scalars to Tensors

Deep learning operates on numbers at different scales. A scalar is a single number — a learning rate of 0.001, or a loss value of 2.34. We write scalars in italics: x.

A vector is an ordered array of numbers. A single neuron's weights form a vector. A 3-pixel grayscale image patch is a vector [128, 200, 55]. We write vectors in bold lowercase: x ∈ Rn.

A matrix is a 2D grid of numbers. A weight layer connecting 100 input neurons to 50 output neurons is a 50×100 matrix. We write matrices in bold uppercase: A ∈ Rm×n.

A tensor is the generalization to any number of axes. A color image is a 3D tensor (height × width × channels). A batch of images is a 4D tensor. We write tensors as A with indices: Ai,j,k.

Data Structure Hierarchy

Click each level to see how numbers are organized at different scales.

Key insight: The transpose of a matrix mirrors it across the diagonal: (AT)i,j = Aj,i. A row vector becomes a column vector. If A is m×n, then AT is n×m. For vectors, xTy is the dot product.
ObjectRankNotationDeep Learning Example
Scalar0xLearning rate, loss value
Vector1xNeuron weights, word embedding
Matrix2AWeight layer, attention scores
Tensor3+Ai,j,kImage batch, convolution kernel
A batch of 32 color images, each 64×64 with 3 channels, is stored as a tensor. What are its dimensions?

Chapter 2: Matrix Operations

The most important operation in deep learning is matrix multiplication. When a layer with weight matrix W processes input x, it computes Wx. Entry i of the result is the dot product of row i of W with x.

For two matrices A (m×n) and B (n×p), the product C = AB is m×p, where Ci,j = ∑k Ai,k Bk,j. The inner dimensions must match. Matrix multiplication is not commutative: AB ≠ BA in general.

Ci,j = ∑k Ai,k Bk,j

But it is associative — A(BC) = (AB)C — and distributive — A(B + C) = AB + AC. The transpose of a product reverses order: (AB)T = BTAT.

Matrix Multiplication Visualizer

Watch how each output element is computed as a dot product of a row and a column. Click cells in C to highlight the contributing row and column.

The element-wise product (Hadamard product, A ⊙ B) multiplies corresponding entries. Both matrices must have the same shape. This appears in gating mechanisms (LSTMs) and attention masking. Do not confuse it with matrix multiplication.
If A is 3×4 and B is 4×2, what shape is AB?

Chapter 3: Identity & Inverse

The identity matrix In is the matrix equivalent of the number 1. It is n×n with ones on the diagonal and zeros elsewhere. For any matrix A: AI = IA = A. It does nothing to vectors it multiplies: Ix = x.

The matrix inverse A−1 undoes the effect of A. If Ax = b, then x = A−1b. Not every matrix has an inverse — it must be square (same number of rows and columns) and have non-zero determinant.

A A−1 = A−1 A = In

A matrix with no inverse is called singular. Geometrically, a singular matrix squashes space into a lower dimension — information is lost and cannot be recovered. In deep learning, we almost never compute inverses directly (it is slow and numerically unstable). Instead, we solve systems using decompositions or iterative methods.

Why this matters: Solving Ax = b is at the heart of linear regression, computing gradients in closed form, and understanding how weight matrices transform data. A−1b gives the theoretical answer, but decompositions (Chapters 5–6) give the practical one.

The determinant det(A) measures how a matrix scales volume. If det(A) = 0, the matrix collapses at least one dimension — it is singular. If det(A) = 2, the matrix doubles volumes. If det(A) is negative, the matrix flips orientation.

The trace tr(A) is the sum of diagonal entries. It equals the sum of eigenvalues and appears in many deep learning formulas. Useful identities: tr(AB) = tr(BA), and tr(A) = tr(AT).

When does a square matrix NOT have an inverse?

Chapter 4: Norms

How big is a vector? The norm measures the size (or length) of a vector. In deep learning, norms appear everywhere: loss functions (how far is our prediction from the truth?), regularization (how big are the weights?), and gradient clipping (is the gradient exploding?).

The general Lp norm of a vector x is:

||x||p = ( ∑i |xi|p )1/p

The L2 norm (Euclidean norm) is the straight-line distance from the origin. Most common in deep learning. The L1 norm (Manhattan norm) sums absolute values — it encourages sparsity because it penalizes small nonzero values more harshly than L2.

The max norm (L) is simply the largest absolute element: ||x|| = maxi |xi|. For matrices, the Frobenius norm treats the matrix as a long vector: ||A||F = √(∑ Ai,j2).

Norm Unit Circles

The "unit circle" under each norm: all 2D vectors with norm = 1. Adjust p to see how the shape changes.

p2.0
Intuition for L1 vs L2: L2 is like measuring distance "as the crow flies." L1 is like walking on a city grid — you can only move along axes. L1 creates diamond-shaped unit balls, which have sharp corners on the axes. This is why L1 regularization drives weights exactly to zero (the corners sit on axes), producing sparse solutions.
Why does L1 regularization produce sparser weights than L2?

Chapter 5: Eigendecomposition

What does a matrix do? When you multiply a vector by a matrix, the vector generally changes both direction and length. But some special vectors only get scaled — their direction stays the same. These are eigenvectors, and the scaling factors are eigenvalues.

Av = λv

Here v is an eigenvector and λ is the corresponding eigenvalue. If λ > 1, the vector gets stretched. If 0 < λ < 1, it shrinks. If λ < 0, it flips direction.

The eigendecomposition of a square matrix writes A = V diag(λ) V−1, where V is the matrix whose columns are eigenvectors. This reveals that A's action is: rotate to the eigenvector coordinate system, scale each axis by its eigenvalue, rotate back.

Deep learning connection: The eigenvalues of the Hessian (second-derivative matrix of the loss) tell you about the curvature of the loss landscape. Large eigenvalues mean steep directions; small ones mean flat directions. A mix of positive and negative eigenvalues signals a saddle point.

A matrix is positive definite if all eigenvalues are positive. Positive semidefinite if all are ≥ 0. These appear in covariance matrices, kernel methods, and Hessians of convex functions.

Eigenvector Visualizer

The matrix A transforms the unit circle (gray) into an ellipse (orange). The eigenvectors (teal/blue) show the principal axes of stretching.

λ12.0
λ20.5
θ30
If a matrix has eigenvalues [3, 0.1], what does it do geometrically?

Chapter 6: Singular Value Decomposition

Eigendecomposition only works for square matrices. But most matrices in deep learning are rectangular — a weight matrix connecting 768 inputs to 512 outputs is 512×768. The Singular Value Decomposition (SVD) works for any matrix.

A = U Σ VT

U (m×m) contains the left singular vectors. Σ (m×n) is diagonal with the singular values σ1 ≥ σ2 ≥ ... ≥ 0. V (n×n) contains the right singular vectors. U and V are orthogonal: their columns are unit vectors perpendicular to each other.

Think of SVD as a three-step transformation: VT rotates the input, Σ stretches along each axis by the singular values, and U rotates the result to the output space.

Why SVD matters in deep learning: It reveals the true dimensionality of a matrix. If a 512×768 weight matrix has only 50 large singular values, the rest are nearly zero — the matrix effectively lives in a 50-dimensional subspace. This insight powers low-rank approximation (LoRA), compression, and understanding what networks actually learn.

The pseudoinverse A+ generalizes the inverse to non-square matrices: A+ = V Σ+ UT, where Σ+ inverts the nonzero singular values. When Ax = b has no exact solution, x = A+b gives the least-squares best approximation.

SVD decomposes any m×n matrix into UΣVT. What do the singular values in Σ represent?

Chapter 7: Matrix Transform Sandbox

Time to play. This sandbox lets you define a 2×2 matrix and see exactly what it does to space. You will see the unit circle deform into an ellipse, watch eigenvectors emerge, and build intuition for how matrices act as geometric transformations.

2D Matrix Transformation Sandbox

Adjust the four entries of the matrix. The gray unit circle transforms into the orange shape. Eigenvectors shown in teal and blue.

a112.0
a120.5
a210.5
a221.0

det = 1.75 | Eigenvalues: 2.28, 0.72

Try these experiments: (1) Set Identity, then slowly increase a12 to see shearing. (2) Set Singular — the ellipse collapses to a line, det=0. (3) Set Rotate 90 — notice the eigenvalues become complex (no real eigenvectors for pure rotation). The sandbox shows only real eigenvectors.
You set the matrix to [1,0;0,0]. What happens to the unit circle?

Chapter 8: Connections

This chapter covered the linear algebra toolkit that the rest of "Deep Learning" relies on. Here is how each concept maps to later chapters:

ConceptWhere It Appears
Matrix multiplicationEvery neural network layer (Ch 6)
NormsRegularization penalties (Ch 7), gradient clipping (Ch 8)
EigenvaluesHessian analysis, loss surface geometry (Ch 8)
SVD / low-rankWeight compression, PCA for data preprocessing
DeterminantChange-of-variables in probability (Ch 3), normalizing flows
Positive definitenessCovariance matrices (Ch 3), convex optimization
What you should take away: Vectors and matrices are not abstract math — they are the physical substrate of neural networks. Every weight tensor, every gradient, every optimizer state is a linear algebra object. The decompositions (eigen, SVD) reveal the hidden structure that makes learning possible.

Up next: Chapter 3: Probability & Information Theory — the other mathematical pillar of deep learning. Where linear algebra provides the data structures, probability provides the framework for reasoning under uncertainty.

Which decomposition works for ANY matrix (including rectangular ones)?