Deisenroth et al., Chapter 5

Vector Calculus

The mathematics of change — and how machines learn from it.

Prerequisites: Chapters 2–4 (linear algebra & matrix decompositions). That's it.
10
Chapters
4+
Simulations
10
Quizzes

Chapter 0: Why Calculus?

You have a neural network with a million parameters. It makes a prediction. The prediction is wrong. You want to adjust every one of those million parameters so the prediction gets a little better. But which direction should each parameter move, and by how much?

This is the fundamental problem of machine learning, and it's answered by calculus. Specifically, by the gradient — a vector that points in the direction of steepest increase of a function. Go the opposite way and you reduce the error. That's gradient descent, and it powers virtually every modern ML system.

Loss Landscape: Gradient Descent in 1D

The curve is a loss function L(θ). The orange dot is your current parameter. Click Step to take one gradient descent step: θ ← θ − α · dL/dθ. Watch how the derivative (slope) guides the dot to the minimum.

θ = 3.0, dL/dθ = ?
Learning rate α0.10

This chapter builds the machinery from the ground up. We start with the basic derivative, extend it to multiple variables (partial derivatives and gradients), generalize to vector-valued functions (the Jacobian), and culminate in backpropagation — the algorithm that efficiently computes gradients through deep networks by clever application of the chain rule.

The big picture: Calculus gives us the derivative. The derivative tells us which way is downhill. Going downhill reduces the loss. Repeat. That's the entire training loop of modern ML, from linear regression to GPT-4.
Derivative
Rate of change of one variable with respect to another
↓ generalize to many variables
Gradient
Vector of all partial derivatives — points uphill
↓ generalize to vector outputs
Jacobian
Matrix of all partial derivatives of a vector function
↓ compose with chain rule
Backpropagation
Efficient gradient computation through deep networks
Check: What does the gradient of a loss function tell you?

Chapter 1: Derivatives From Scratch

The derivative of f at x measures the instantaneous rate of change. Start from the definition:

f'(x) = dfdx = limh→0 f(x + h) − f(x)h

This is the slope of the tangent line. Take f(x) = x2. Plug in:

(x+h)2 − x2h = 2xh + h2h = 2x + h  →  2x  as h → 0

So f'(x) = 2x. At x = 3, the slope is 6 — the function is increasing at a rate of 6 units per unit of x. At x = 0, the slope is 0 — a flat spot (the minimum). At x = −2, the slope is −4 — the function is decreasing.

The derivative rules you need for ML:

FunctionDerivativeUsed in
xnnxn−1Polynomial models
exexSoftmax, Gaussian
ln(x)1/xLog-likelihood
1/(1+e−x)σ(x)(1 − σ(x))Sigmoid activation
max(0, x)1 if x > 0, else 0ReLU activation
Key insight: The derivative of the sigmoid σ(x) = 1/(1+e−x) is σ(x)(1 − σ(x)). This is maximized at x = 0 (where the slope is 0.25) and vanishes as x → ±∞. This "vanishing gradient" at the extremes is why deep sigmoid networks are hard to train — the gradient shrinks exponentially through many layers.

Two rules compose derivatives of compound expressions:

Product rule:

(fg)' = f'g + fg'

Chain rule:

(f ∘ g)' = f'(g(x)) · g'(x)

The chain rule is by far the most important for ML. It says: if y = f(g(x)), then dy/dx = (dy/dg) · (dg/dx). Derivatives of compositions multiply. This innocent-looking rule is the entire foundation of backpropagation.

Check: What is the derivative of f(x) = (3x + 1)2 using the chain rule?

Chapter 2: Taylor Series

If you know f and all its derivatives at a single point x0, you can reconstruct the entire function (at least locally). The Taylor series is this reconstruction:

f(x) = f(x0) + f'(x0)(x − x0) + f''(x0)2!(x − x0)2 + f'''(x0)3!(x − x0)3 + …

The first two terms give the linear approximation (tangent line). Add the quadratic term and you get the second-order approximation (a parabola that matches the function's curvature). Each term adds finer detail.

Taylor Approximation

The true function f(x) = sin(x) in teal. The Taylor polynomial in orange, centered at x0 = 0. Increase the order to see the approximation improve.

Order1

Why does ML care about Taylor series? Three reasons:

ApplicationWhich terms
Gradient descentFirst-order: f(x + δ) ≈ f(x) + ∇f · δ
Newton's methodSecond-order: includes the Hessian (curvature)
Loss surface analysisSecond-order: saddle points, condition number
The multivariate Taylor expansion (preview): For f: Rn → R, centered at x0:

f(x) ≈ f(x0) + ∇f(x0)T(x − x0) + ½(x − x0)T H(x0) (x − x0)

where ∇f is the gradient (first-order) and H is the Hessian matrix of second derivatives (second-order). Gradient descent uses the first-order term. Newton's method uses both.

The factorial denominators (1/n!) are crucial. Without them, each successive term would be far too large and the series would diverge wildly. The factorials tame the growth so the polynomial stays close to the true function near x0.

Check: Gradient descent uses which level of Taylor approximation?

Chapter 3: Going Multivariate

So far, f takes one number in and produces one number out. But a neural network loss depends on millions of parameters simultaneously. We need derivatives of functions of many variables.

The partial derivative of f(x1, x2, ..., xn) with respect to xi is simply: hold all other variables fixed and differentiate with respect to xi alone.

∂f∂xi = limh→0 f(x1, ..., xi+h, ..., xn) − f(x1, ..., xi, ..., xn)h

Example: f(x, y) = x2y + 3y. The partial derivatives are:

∂f∂x = 2xy

(treat y as a constant)

∂f∂y = x2 + 3

(treat x as a constant)

At the point (x, y) = (2, 1): ∂f/∂x = 2·2·1 = 4 and ∂f/∂y = 4 + 3 = 7. The function is increasing faster in the y-direction than the x-direction at this point.

Notation note: ∂f/∂xi uses the "curly d" (∂) to distinguish partial derivatives from the ordinary derivative d/dx. When there's only one variable, ∂ and d are the same thing. Some ML texts use ∇xif instead.

The total derivative matters when variables are not independent. If x = x(t) and y = y(t), then by the chain rule:

dfdt = ∂f∂x · dxdt + ∂f∂y · dydt

This is the multivariate chain rule in action: the total rate of change is the sum of each partial derivative times the rate at which that variable changes. This structure — sum of products of local derivatives — is exactly what backpropagation exploits.

Check: For f(x, y) = xy2 + x3, what is ∂f/∂x?

Chapter 4: The Gradient

Collect all the partial derivatives into one vector, and you get the gradient:

∇f = [∂f∂x1, ∂f∂x2, ..., ∂f∂xn]T
Convention alert: Deisenroth et al. define the gradient as a row vector (the Jacobian of a scalar function). Many ML textbooks define it as a column vector. In this lesson, we follow the book's convention but note that it matters mainly for matrix multiplication shapes. The direction and magnitude are the same either way.

The gradient has a beautiful geometric meaning: at any point, it points in the direction of steepest ascent. Its magnitude tells you how steep that ascent is. To minimize a function, you walk in the opposite direction of the gradient.

2D Gradient Field

Contour plot of f(x,y) = x2 + 2y2. Arrows show the negative gradient (descent direction). The orange dot follows gradient descent. Click Step to iterate.

α0.15

For f(x, y) = x2 + 2y2, the gradient is ∇f = [2x, 4y]. At the point (3, 2), the gradient is [6, 8]. The steepest ascent direction is (6, 8)/||(6, 8)|| = (0.6, 0.8). To descend, we go the opposite way: (−0.6, −0.8).

The gradient descent update rule is:

xk+1 = xk − α ∇f(xk)

where α is the learning rate. Too large and you overshoot. Too small and you crawl. Finding the right α is a central practical challenge in ML optimization.

Key insight: The gradient is always perpendicular to the contour lines (level sets) of f. Contour lines are curves where f is constant. The gradient points away from them, toward higher values. This is why gradient descent cuts across contour lines at right angles — it takes the steepest path through the landscape.
Check: For f(x, y) = 3x2 + y2, what is the gradient at (1, 2)?

Chapter 5: The Chain Rule

In a neural network, the loss L depends on the output, which depends on the last layer's weights, which depend on the previous layer's output, and so on. To compute ∂L/∂w for a weight w buried deep in the network, we need to chain together derivatives through every layer in between.

The multivariate chain rule handles this. If f = f(g1(x), g2(x), ..., gm(x)), then:

dfdx = ∑i=1m ∂f∂gi · ∂gi∂x

Worked example: Let f(x) = (x2 + 1)3. Define g(x) = x2 + 1, so f = g3.

dfdx = dfdg · dgdx = 3g2 · 2x = 6x(x2 + 1)2

Now a two-layer example closer to neural networks. Let y = σ(w2 · σ(w1x + b1) + b2), where σ is the sigmoid. To find ∂y/∂w1:

Layer 1
z1 = w1x + b1,   a1 = σ(z1)
Layer 2
z2 = w2a1 + b2,   y = σ(z2)
∂y∂w1 = ∂y∂z2 · ∂z2∂a1 · ∂a1∂z1 · ∂z1∂w1
= σ'(z2) · w2 · σ'(z1) · x
The chain rule as a chain of multiplication: Each layer contributes one factor to the derivative. The final derivative is a product of all these local factors. This is why it's called the "chain" rule — the derivative chains multiply. Backpropagation just computes this product efficiently from right to left.

Notice the product σ'(z2) · σ'(z1). Since σ'(z) ≤ 0.25, multiplying many such terms makes the gradient exponentially small. With 50 sigmoid layers: 0.2550 ≈ 10−30. This is the vanishing gradient problem — and the reason ReLU (whose derivative is 0 or 1) largely replaced sigmoid in deep networks.

Check: If f = h(g(x)), what is df/dx by the chain rule?

Chapter 6: The Jacobian

What if your function has vector input AND vector output? A function f: Rn → Rm maps an n-vector to an m-vector. Its derivative needs to capture how each output changes with each input. That's the Jacobian matrix.

J = ∂f∂x = [∂f1∂x1∂f1∂xn ; … ; ∂fm∂x1∂fm∂xn]

The Jacobian is an m×n matrix. Row i contains the gradient of fi (the i-th output). Column j shows how all outputs respond to changes in xj.

Example: f(x1, x2) = [x12 + x2,   x1x2]. The Jacobian is:

J = [2x1, 1; x2, x1]

At (x1, x2) = (3, 2): J = [[6, 1], [2, 3]]. This tells us: a small change δx1 = 0.01 changes f1 by 0.06 and f2 by 0.02. A small change δx2 = 0.01 changes f1 by 0.01 and f2 by 0.03.

The chain rule becomes matrix multiplication: If z = g(f(x)), then the Jacobian of the composition is:

Jz = Jg · Jf

This is the matrix chain rule. Each layer in a neural network has a Jacobian, and the overall Jacobian is the product of all layer Jacobians. Backpropagation computes this product one factor at a time, from output to input.
Function typeDerivative nameShape
f: R → RDerivative f'Scalar
f: Rn → RGradient ∇f1 × n (row vector)
f: Rn → RmJacobian Jm × n (matrix)
f: Rn → R, 2nd derivHessian Hn × n (symmetric)

The Hessian H is the Jacobian of the gradient: Hij = ∂2f/∂xi∂xj. It's symmetric (by Schwarz's theorem, mixed partials commute). The Hessian tells you about curvature: positive definite H means a local minimum, negative definite means a local maximum, indefinite means a saddle point.

Check: The Jacobian of f: R3 → R2 has what shape?

Chapter 7: Gradients of Matrices

In ML, we often need derivatives of scalar-valued functions with respect to matrices. For instance, the loss L(θ) depends on weight matrices W. How do we compute ∂L/∂W?

The key identities you'll use constantly:

Function f(X)∂f/∂X
tr(AX)AT
tr(XTA)A
tr(XTX)2X
aTXb (= tr(baTX))abT
||Xw − y||22XT(Xw − y) w.r.t. w

Let's derive the most important one: the gradient of the squared loss.

Let L(w) = ||Xw − y||2 = (Xw − y)T(Xw − y). Expand:

L = wTXTXw − 2yTXw + yTy

Differentiate term by term:

∂w(wTXTXw) = 2XTXw
∂w(−2yTXw) = −2XTy
∂w(yTy) = 0

Combining: ∇wL = 2XTXw − 2XTy = 2XT(Xw − y).

The normal equation: Setting ∇L = 0 gives XTXw = XTy, so w* = (XTX)−1XTy. This is the closed-form solution to linear regression. The entire derivation is just "set the gradient to zero and solve." You need Chapter 4's matrix decompositions (Cholesky or SVD) to actually compute that inverse.

For the rule ∂tr(AX)/∂X = AT, the derivation uses the fact that tr(AX) = ∑i,j AjiXij. Taking ∂/∂Xij gives Aji = (AT)ij. Stacking all these partial derivatives gives the matrix AT.

Key insight: Matrix calculus is just multivariable calculus with careful bookkeeping of shapes. The gradient ∂f/∂X always has the same shape as X. When in doubt, write out indices (f = ∑ aixi) and differentiate element by element.
Check: What is the gradient of f(w) = ||Xw - y||2 with respect to w?

Chapter 8: Backpropagation

Everything we've built leads here. Backpropagation is the algorithm that computes the gradient of a loss function with respect to all parameters in a computational graph. It's just the chain rule, applied systematically from output back to inputs.

Backprop Computation Graph

A small computation graph: L = (wx + b − y)2. Enter a value for x, then click Forward to propagate values left-to-right. Click Backward to see gradients flow right-to-left (edges light up). Adjust w and b to see how gradients change.

Enter values, then Forward
x2.0
w1.5
b0.5
y (target)1.0

The algorithm has two passes:

Forward Pass
Compute all intermediate values from inputs to loss
Backward Pass
Compute gradients from loss back to each parameter

For L = (wx + b − y)2, let's define intermediate variables:

forward pass
z = w * x          # multiply
a = z + b          # add bias
r = a - y          # residual
L = r2             # squared error

The backward pass computes gradients using the chain rule, starting from ∂L/∂L = 1:

backward pass
∂L/∂r = 2r
∂L/∂a = ∂L/∂r · ∂r/∂a = 2r · 1 = 2r
∂L/∂b = ∂L/∂a · ∂a/∂b = 2r · 1 = 2r
∂L/∂z = ∂L/∂a · ∂a/∂z = 2r · 1 = 2r
∂L/∂w = ∂L/∂z · ∂z/∂w = 2r · x = 2(wx+b-y)x
Why backprop is efficient: A naive approach would compute each gradient independently, requiring a separate pass through the graph for each parameter. Backprop shares computation: the gradient ∂L/∂a is computed once and reused by both ∂L/∂w and ∂L/∂b. For a network with N parameters, backprop costs about 2× the forward pass — regardless of N. This O(N) scaling is what makes training billion-parameter models feasible.

In matrix notation for a full layer with weight matrix W, input x, and ReLU activation:

∂L∂W = ∂L∂a · xT   where a = Wx

The upstream gradient ∂L/∂a is passed down from the layer above. Each layer only needs to know its own local derivative and the incoming gradient. This locality is what makes backprop modular — you can swap activation functions, add layers, or change architectures, and the gradient computation "just works."

Check: Why is backpropagation more efficient than computing each gradient separately?

Chapter 9: Automatic Differentiation

Backpropagation is a specific case of a broader idea: automatic differentiation (autodiff). Autodiff computes exact derivatives of any program by tracing the computation and applying the chain rule mechanically. No approximation, no symbolic manipulation — just exact derivatives, to machine precision.

MethodHow it worksProsCons
SymbolicManipulate algebraic expressionsExact, interpretableExpression explosion for complex functions
NumericalFinite differences: (f(x+h)−f(x))/hSimple, works for anythingApproximate, O(n) cost per gradient, unstable h
Forward-mode ADPropagate derivatives alongside valuesExact, efficient for few inputsOne pass per input dimension
Reverse-mode ADRecord forward, replay backwardExact, one pass for all gradientsMemory for the tape

Reverse-mode autodiff is backpropagation generalized. The "tape" (computation graph) is recorded during the forward pass. The backward pass walks the tape in reverse, accumulating gradients. PyTorch's autograd and JAX's grad implement this.

Forward-mode vs. reverse-mode: Forward-mode computes one column of the Jacobian per pass (how one input affects all outputs). Reverse-mode computes one row per pass (how all inputs affect one output). For a loss function f: Rn → R with n parameters and one scalar output, reverse-mode needs one backward pass to get the full gradient. Forward-mode would need n passes. Since ML losses are scalar-valued, reverse-mode (backprop) wins decisively.
python
import torch

# Define parameters (requires_grad=True tells PyTorch to track)
w = torch.tensor(1.5, requires_grad=True)
b = torch.tensor(0.5, requires_grad=True)
x = torch.tensor(2.0)
y = torch.tensor(1.0)

# Forward pass (builds computation graph automatically)
L = (w * x + b - y) ** 2

# Backward pass (computes all gradients in one call)
L.backward()

print(w.grad)  # dL/dw = 2(wx+b-y)*x = 2(3.5-1)*2 = 10.0
print(b.grad)  # dL/db = 2(wx+b-y)*1 = 2(3.5-1) = 5.0

The connection between this chapter's ideas and real ML frameworks:

ConceptIn Code
Forward passloss = model(x)
Build computation graphHappens automatically (PyTorch) or via jit.trace (JAX)
Backward passloss.backward() or jax.grad(loss_fn)
Gradient descent stepoptimizer.step()
Where this leads: Chapter 5 gives you the math. Chapter 7 (Optimization) will build on these gradients to cover gradient descent, momentum, Adam, and learning rate schedules. Chapter 9 (Linear Regression) will use the normal equation we derived. And Chapter 10 (PCA) will use eigendecomposition from Chapter 4 on the covariance matrix.

"The chain rule is the most important single idea in differential calculus." — Gilbert Strang

Check: Why is reverse-mode autodiff preferred over forward-mode for training neural networks?