The language every neural network speaks. Vectors, matrices, and the decompositions that make deep learning tractable.
A neural network takes an image — say 256×256 pixels with 3 color channels — and produces a prediction. Internally, that image is a block of 196,608 numbers. The network multiplies these numbers by weight matrices, adds biases, and applies nonlinearities. Repeat hundreds of times. That is deep learning.
Every single step of that pipeline is a linear algebra operation. Matrix multiplications, vector additions, norms, decompositions. If you understand linear algebra, you can read any deep learning paper and know exactly what each equation does to the data flowing through the network.
Deep learning operates on numbers at different scales. A scalar is a single number — a learning rate of 0.001, or a loss value of 2.34. We write scalars in italics: x.
A vector is an ordered array of numbers. A single neuron's weights form a vector. A 3-pixel grayscale image patch is a vector [128, 200, 55]. We write vectors in bold lowercase: x ∈ Rn.
A matrix is a 2D grid of numbers. A weight layer connecting 100 input neurons to 50 output neurons is a 50×100 matrix. We write matrices in bold uppercase: A ∈ Rm×n.
A tensor is the generalization to any number of axes. A color image is a 3D tensor (height × width × channels). A batch of images is a 4D tensor. We write tensors as A with indices: Ai,j,k.
Click each level to see how numbers are organized at different scales.
| Object | Rank | Notation | Deep Learning Example |
|---|---|---|---|
| Scalar | 0 | x | Learning rate, loss value |
| Vector | 1 | x | Neuron weights, word embedding |
| Matrix | 2 | A | Weight layer, attention scores |
| Tensor | 3+ | Ai,j,k | Image batch, convolution kernel |
The most important operation in deep learning is matrix multiplication. When a layer with weight matrix W processes input x, it computes Wx. Entry i of the result is the dot product of row i of W with x.
For two matrices A (m×n) and B (n×p), the product C = AB is m×p, where Ci,j = ∑k Ai,k Bk,j. The inner dimensions must match. Matrix multiplication is not commutative: AB ≠ BA in general.
But it is associative — A(BC) = (AB)C — and distributive — A(B + C) = AB + AC. The transpose of a product reverses order: (AB)T = BTAT.
Watch how each output element is computed as a dot product of a row and a column. Click cells in C to highlight the contributing row and column.
The identity matrix In is the matrix equivalent of the number 1. It is n×n with ones on the diagonal and zeros elsewhere. For any matrix A: AI = IA = A. It does nothing to vectors it multiplies: Ix = x.
The matrix inverse A−1 undoes the effect of A. If Ax = b, then x = A−1b. Not every matrix has an inverse — it must be square (same number of rows and columns) and have non-zero determinant.
A matrix with no inverse is called singular. Geometrically, a singular matrix squashes space into a lower dimension — information is lost and cannot be recovered. In deep learning, we almost never compute inverses directly (it is slow and numerically unstable). Instead, we solve systems using decompositions or iterative methods.
The determinant det(A) measures how a matrix scales volume. If det(A) = 0, the matrix collapses at least one dimension — it is singular. If det(A) = 2, the matrix doubles volumes. If det(A) is negative, the matrix flips orientation.
The trace tr(A) is the sum of diagonal entries. It equals the sum of eigenvalues and appears in many deep learning formulas. Useful identities: tr(AB) = tr(BA), and tr(A) = tr(AT).
How big is a vector? The norm measures the size (or length) of a vector. In deep learning, norms appear everywhere: loss functions (how far is our prediction from the truth?), regularization (how big are the weights?), and gradient clipping (is the gradient exploding?).
The general Lp norm of a vector x is:
The L2 norm (Euclidean norm) is the straight-line distance from the origin. Most common in deep learning. The L1 norm (Manhattan norm) sums absolute values — it encourages sparsity because it penalizes small nonzero values more harshly than L2.
The max norm (L∞) is simply the largest absolute element: ||x||∞ = maxi |xi|. For matrices, the Frobenius norm treats the matrix as a long vector: ||A||F = √(∑ Ai,j2).
The "unit circle" under each norm: all 2D vectors with norm = 1. Adjust p to see how the shape changes.
What does a matrix do? When you multiply a vector by a matrix, the vector generally changes both direction and length. But some special vectors only get scaled — their direction stays the same. These are eigenvectors, and the scaling factors are eigenvalues.
Here v is an eigenvector and λ is the corresponding eigenvalue. If λ > 1, the vector gets stretched. If 0 < λ < 1, it shrinks. If λ < 0, it flips direction.
The eigendecomposition of a square matrix writes A = V diag(λ) V−1, where V is the matrix whose columns are eigenvectors. This reveals that A's action is: rotate to the eigenvector coordinate system, scale each axis by its eigenvalue, rotate back.
A matrix is positive definite if all eigenvalues are positive. Positive semidefinite if all are ≥ 0. These appear in covariance matrices, kernel methods, and Hessians of convex functions.
The matrix A transforms the unit circle (gray) into an ellipse (orange). The eigenvectors (teal/blue) show the principal axes of stretching.
Eigendecomposition only works for square matrices. But most matrices in deep learning are rectangular — a weight matrix connecting 768 inputs to 512 outputs is 512×768. The Singular Value Decomposition (SVD) works for any matrix.
U (m×m) contains the left singular vectors. Σ (m×n) is diagonal with the singular values σ1 ≥ σ2 ≥ ... ≥ 0. V (n×n) contains the right singular vectors. U and V are orthogonal: their columns are unit vectors perpendicular to each other.
Think of SVD as a three-step transformation: VT rotates the input, Σ stretches along each axis by the singular values, and U rotates the result to the output space.
The pseudoinverse A+ generalizes the inverse to non-square matrices: A+ = V Σ+ UT, where Σ+ inverts the nonzero singular values. When Ax = b has no exact solution, x = A+b gives the least-squares best approximation.
Time to play. This sandbox lets you define a 2×2 matrix and see exactly what it does to space. You will see the unit circle deform into an ellipse, watch eigenvectors emerge, and build intuition for how matrices act as geometric transformations.
Adjust the four entries of the matrix. The gray unit circle transforms into the orange shape. Eigenvectors shown in teal and blue.
det = 1.75 | Eigenvalues: 2.28, 0.72
This chapter covered the linear algebra toolkit that the rest of "Deep Learning" relies on. Here is how each concept maps to later chapters:
| Concept | Where It Appears |
|---|---|
| Matrix multiplication | Every neural network layer (Ch 6) |
| Norms | Regularization penalties (Ch 7), gradient clipping (Ch 8) |
| Eigenvalues | Hessian analysis, loss surface geometry (Ch 8) |
| SVD / low-rank | Weight compression, PCA for data preprocessing |
| Determinant | Change-of-variables in probability (Ch 3), normalizing flows |
| Positive definiteness | Covariance matrices (Ch 3), convex optimization |
Up next: Chapter 3: Probability & Information Theory — the other mathematical pillar of deep learning. Where linear algebra provides the data structures, probability provides the framework for reasoning under uncertainty.