The mathematics of change — and how machines learn from it.
You have a neural network with a million parameters. It makes a prediction. The prediction is wrong. You want to adjust every one of those million parameters so the prediction gets a little better. But which direction should each parameter move, and by how much?
This is the fundamental problem of machine learning, and it's answered by calculus. Specifically, by the gradient — a vector that points in the direction of steepest increase of a function. Go the opposite way and you reduce the error. That's gradient descent, and it powers virtually every modern ML system.
The curve is a loss function L(θ). The orange dot is your current parameter. Click Step to take one gradient descent step: θ ← θ − α · dL/dθ. Watch how the derivative (slope) guides the dot to the minimum.
This chapter builds the machinery from the ground up. We start with the basic derivative, extend it to multiple variables (partial derivatives and gradients), generalize to vector-valued functions (the Jacobian), and culminate in backpropagation — the algorithm that efficiently computes gradients through deep networks by clever application of the chain rule.
The derivative of f at x measures the instantaneous rate of change. Start from the definition:
This is the slope of the tangent line. Take f(x) = x2. Plug in:
So f'(x) = 2x. At x = 3, the slope is 6 — the function is increasing at a rate of 6 units per unit of x. At x = 0, the slope is 0 — a flat spot (the minimum). At x = −2, the slope is −4 — the function is decreasing.
The derivative rules you need for ML:
| Function | Derivative | Used in |
|---|---|---|
| xn | nxn−1 | Polynomial models |
| ex | ex | Softmax, Gaussian |
| ln(x) | 1/x | Log-likelihood |
| 1/(1+e−x) | σ(x)(1 − σ(x)) | Sigmoid activation |
| max(0, x) | 1 if x > 0, else 0 | ReLU activation |
Two rules compose derivatives of compound expressions:
Product rule:
Chain rule:
The chain rule is by far the most important for ML. It says: if y = f(g(x)), then dy/dx = (dy/dg) · (dg/dx). Derivatives of compositions multiply. This innocent-looking rule is the entire foundation of backpropagation.
If you know f and all its derivatives at a single point x0, you can reconstruct the entire function (at least locally). The Taylor series is this reconstruction:
The first two terms give the linear approximation (tangent line). Add the quadratic term and you get the second-order approximation (a parabola that matches the function's curvature). Each term adds finer detail.
The true function f(x) = sin(x) in teal. The Taylor polynomial in orange, centered at x0 = 0. Increase the order to see the approximation improve.
Why does ML care about Taylor series? Three reasons:
| Application | Which terms |
|---|---|
| Gradient descent | First-order: f(x + δ) ≈ f(x) + ∇f · δ |
| Newton's method | Second-order: includes the Hessian (curvature) |
| Loss surface analysis | Second-order: saddle points, condition number |
The factorial denominators (1/n!) are crucial. Without them, each successive term would be far too large and the series would diverge wildly. The factorials tame the growth so the polynomial stays close to the true function near x0.
So far, f takes one number in and produces one number out. But a neural network loss depends on millions of parameters simultaneously. We need derivatives of functions of many variables.
The partial derivative of f(x1, x2, ..., xn) with respect to xi is simply: hold all other variables fixed and differentiate with respect to xi alone.
Example: f(x, y) = x2y + 3y. The partial derivatives are:
(treat y as a constant)
(treat x as a constant)
At the point (x, y) = (2, 1): ∂f/∂x = 2·2·1 = 4 and ∂f/∂y = 4 + 3 = 7. The function is increasing faster in the y-direction than the x-direction at this point.
The total derivative matters when variables are not independent. If x = x(t) and y = y(t), then by the chain rule:
This is the multivariate chain rule in action: the total rate of change is the sum of each partial derivative times the rate at which that variable changes. This structure — sum of products of local derivatives — is exactly what backpropagation exploits.
Collect all the partial derivatives into one vector, and you get the gradient:
The gradient has a beautiful geometric meaning: at any point, it points in the direction of steepest ascent. Its magnitude tells you how steep that ascent is. To minimize a function, you walk in the opposite direction of the gradient.
Contour plot of f(x,y) = x2 + 2y2. Arrows show the negative gradient (descent direction). The orange dot follows gradient descent. Click Step to iterate.
For f(x, y) = x2 + 2y2, the gradient is ∇f = [2x, 4y]. At the point (3, 2), the gradient is [6, 8]. The steepest ascent direction is (6, 8)/||(6, 8)|| = (0.6, 0.8). To descend, we go the opposite way: (−0.6, −0.8).
The gradient descent update rule is:
where α is the learning rate. Too large and you overshoot. Too small and you crawl. Finding the right α is a central practical challenge in ML optimization.
In a neural network, the loss L depends on the output, which depends on the last layer's weights, which depend on the previous layer's output, and so on. To compute ∂L/∂w for a weight w buried deep in the network, we need to chain together derivatives through every layer in between.
The multivariate chain rule handles this. If f = f(g1(x), g2(x), ..., gm(x)), then:
Worked example: Let f(x) = (x2 + 1)3. Define g(x) = x2 + 1, so f = g3.
Now a two-layer example closer to neural networks. Let y = σ(w2 · σ(w1x + b1) + b2), where σ is the sigmoid. To find ∂y/∂w1:
Notice the product σ'(z2) · σ'(z1). Since σ'(z) ≤ 0.25, multiplying many such terms makes the gradient exponentially small. With 50 sigmoid layers: 0.2550 ≈ 10−30. This is the vanishing gradient problem — and the reason ReLU (whose derivative is 0 or 1) largely replaced sigmoid in deep networks.
What if your function has vector input AND vector output? A function f: Rn → Rm maps an n-vector to an m-vector. Its derivative needs to capture how each output changes with each input. That's the Jacobian matrix.
The Jacobian is an m×n matrix. Row i contains the gradient of fi (the i-th output). Column j shows how all outputs respond to changes in xj.
Example: f(x1, x2) = [x12 + x2, x1x2]. The Jacobian is:
At (x1, x2) = (3, 2): J = [[6, 1], [2, 3]]. This tells us: a small change δx1 = 0.01 changes f1 by 0.06 and f2 by 0.02. A small change δx2 = 0.01 changes f1 by 0.01 and f2 by 0.03.
| Function type | Derivative name | Shape |
|---|---|---|
| f: R → R | Derivative f' | Scalar |
| f: Rn → R | Gradient ∇f | 1 × n (row vector) |
| f: Rn → Rm | Jacobian J | m × n (matrix) |
| f: Rn → R, 2nd deriv | Hessian H | n × n (symmetric) |
The Hessian H is the Jacobian of the gradient: Hij = ∂2f/∂xi∂xj. It's symmetric (by Schwarz's theorem, mixed partials commute). The Hessian tells you about curvature: positive definite H means a local minimum, negative definite means a local maximum, indefinite means a saddle point.
In ML, we often need derivatives of scalar-valued functions with respect to matrices. For instance, the loss L(θ) depends on weight matrices W. How do we compute ∂L/∂W?
The key identities you'll use constantly:
| Function f(X) | ∂f/∂X |
|---|---|
| tr(AX) | AT |
| tr(XTA) | A |
| tr(XTX) | 2X |
| aTXb (= tr(baTX)) | abT |
| ||Xw − y||2 | 2XT(Xw − y) w.r.t. w |
Let's derive the most important one: the gradient of the squared loss.
Let L(w) = ||Xw − y||2 = (Xw − y)T(Xw − y). Expand:
Differentiate term by term:
Combining: ∇wL = 2XTXw − 2XTy = 2XT(Xw − y).
For the rule ∂tr(AX)/∂X = AT, the derivation uses the fact that tr(AX) = ∑i,j AjiXij. Taking ∂/∂Xij gives Aji = (AT)ij. Stacking all these partial derivatives gives the matrix AT.
Everything we've built leads here. Backpropagation is the algorithm that computes the gradient of a loss function with respect to all parameters in a computational graph. It's just the chain rule, applied systematically from output back to inputs.
A small computation graph: L = (wx + b − y)2. Enter a value for x, then click Forward to propagate values left-to-right. Click Backward to see gradients flow right-to-left (edges light up). Adjust w and b to see how gradients change.
The algorithm has two passes:
For L = (wx + b − y)2, let's define intermediate variables:
forward pass z = w * x # multiply a = z + b # add bias r = a - y # residual L = r2 # squared error
The backward pass computes gradients using the chain rule, starting from ∂L/∂L = 1:
backward pass ∂L/∂r = 2r ∂L/∂a = ∂L/∂r · ∂r/∂a = 2r · 1 = 2r ∂L/∂b = ∂L/∂a · ∂a/∂b = 2r · 1 = 2r ∂L/∂z = ∂L/∂a · ∂a/∂z = 2r · 1 = 2r ∂L/∂w = ∂L/∂z · ∂z/∂w = 2r · x = 2(wx+b-y)x
In matrix notation for a full layer with weight matrix W, input x, and ReLU activation:
The upstream gradient ∂L/∂a is passed down from the layer above. Each layer only needs to know its own local derivative and the incoming gradient. This locality is what makes backprop modular — you can swap activation functions, add layers, or change architectures, and the gradient computation "just works."
Backpropagation is a specific case of a broader idea: automatic differentiation (autodiff). Autodiff computes exact derivatives of any program by tracing the computation and applying the chain rule mechanically. No approximation, no symbolic manipulation — just exact derivatives, to machine precision.
| Method | How it works | Pros | Cons |
|---|---|---|---|
| Symbolic | Manipulate algebraic expressions | Exact, interpretable | Expression explosion for complex functions |
| Numerical | Finite differences: (f(x+h)−f(x))/h | Simple, works for anything | Approximate, O(n) cost per gradient, unstable h |
| Forward-mode AD | Propagate derivatives alongside values | Exact, efficient for few inputs | One pass per input dimension |
| Reverse-mode AD | Record forward, replay backward | Exact, one pass for all gradients | Memory for the tape |
Reverse-mode autodiff is backpropagation generalized. The "tape" (computation graph) is recorded during the forward pass. The backward pass walks the tape in reverse, accumulating gradients. PyTorch's autograd and JAX's grad implement this.
python import torch # Define parameters (requires_grad=True tells PyTorch to track) w = torch.tensor(1.5, requires_grad=True) b = torch.tensor(0.5, requires_grad=True) x = torch.tensor(2.0) y = torch.tensor(1.0) # Forward pass (builds computation graph automatically) L = (w * x + b - y) ** 2 # Backward pass (computes all gradients in one call) L.backward() print(w.grad) # dL/dw = 2(wx+b-y)*x = 2(3.5-1)*2 = 10.0 print(b.grad) # dL/db = 2(wx+b-y)*1 = 2(3.5-1) = 5.0
The connection between this chapter's ideas and real ML frameworks:
| Concept | In Code |
|---|---|
| Forward pass | loss = model(x) |
| Build computation graph | Happens automatically (PyTorch) or via jit.trace (JAX) |
| Backward pass | loss.backward() or jax.grad(loss_fn) |
| Gradient descent step | optimizer.step() |
"The chain rule is the most important single idea in differential calculus." — Gilbert Strang