Ch 5: Vector Calculus — Deisenroth MML

Chapter 0: Why Calculus?

You have a neural network with a million parameters. It makes a prediction. The prediction is wrong. You want to adjust every one of those million parameters so the prediction gets a little better. But which direction should each parameter move, and by how much?

This is the fundamental problem of machine learning, and it's answered by calculus. Specifically, by the gradient — a vector that points in the direction of steepest increase of a function. Go the opposite way and you reduce the error. That's gradient descent, and it powers virtually every modern ML system.

Loss Landscape: Gradient Descent in 1D

The curve is a loss function L(θ). The orange dot is your current parameter. Click Step to take one gradient descent step: θ ← θ − α · dL/dθ. Watch how the derivative (slope) guides the dot to the minimum.

θ = 3.0, dL/dθ = ?

Learning rate α0.10

This chapter builds the machinery from the ground up. We start with the basic derivative, extend it to multiple variables (partial derivatives and gradients), generalize to vector-valued functions (the Jacobian), and culminate in backpropagation — the algorithm that efficiently computes gradients through deep networks by clever application of the chain rule.

The big picture: Calculus gives us the derivative. The derivative tells us which way is downhill. Going downhill reduces the loss. Repeat. That's the entire training loop of modern ML, from linear regression to GPT-4.

Derivative

Rate of change of one variable with respect to another

↓ generalize to many variables

Gradient

Vector of all partial derivatives — points uphill

↓ generalize to vector outputs

Jacobian

Matrix of all partial derivatives of a vector function

↓ compose with chain rule

Backpropagation

Efficient gradient computation through deep networks

Check: What does the gradient of a loss function tell you?

The minimum value of the loss The direction of steepest increase (negate it to go downhill) The number of parameters in the model

Chapter 1: Derivatives From Scratch

The derivative of f at x measures the instantaneous rate of change. Start from the definition:

f'(x) = ^df⁄_dx = lim_h→0 ^{f(x + h) − f(x)}⁄_h

This is the slope of the tangent line. Take f(x) = x². Plug in:

^{(x+h)² − x²}⁄_h = ^{2xh + h²}⁄_h = 2x + h → 2x as h → 0

So f'(x) = 2x. At x = 3, the slope is 6 — the function is increasing at a rate of 6 units per unit of x. At x = 0, the slope is 0 — a flat spot (the minimum). At x = −2, the slope is −4 — the function is decreasing.

The derivative rules you need for ML:

Function	Derivative	Used in
xⁿ	nxⁿ⁻¹	Polynomial models
e^x	e^x	Softmax, Gaussian
ln(x)	1/x	Log-likelihood
1/(1+e^−x)	σ(x)(1 − σ(x))	Sigmoid activation
max(0, x)	1 if x > 0, else 0	ReLU activation

Key insight: The derivative of the sigmoid σ(x) = 1/(1+e^−x) is σ(x)(1 − σ(x)). This is maximized at x = 0 (where the slope is 0.25) and vanishes as x → ±∞. This "vanishing gradient" at the extremes is why deep sigmoid networks are hard to train — the gradient shrinks exponentially through many layers.

Two rules compose derivatives of compound expressions:

Product rule:

(fg)' = f'g + fg'

Chain rule:

(f ∘ g)' = f'(g(x)) · g'(x)

The chain rule is by far the most important for ML. It says: if y = f(g(x)), then dy/dx = (dy/dg) · (dg/dx). Derivatives of compositions multiply. This innocent-looking rule is the entire foundation of backpropagation.

Check: What is the derivative of f(x) = (3x + 1)² using the chain rule?

2(3x + 1) (3x + 1)² · 3 2(3x + 1) · 3 = 6(3x + 1) (outer derivative times inner derivative)

Chapter 2: Taylor Series

If you know f and all its derivatives at a single point x₀, you can reconstruct the entire function (at least locally). The Taylor series is this reconstruction:

f(x) = f(x₀) + f'(x₀)(x − x₀) + ^f''(x₀)⁄_2!(x − x₀)² + ^f'''(x₀)⁄_3!(x − x₀)³ + …

The first two terms give the linear approximation (tangent line). Add the quadratic term and you get the second-order approximation (a parabola that matches the function's curvature). Each term adds finer detail.

Taylor Approximation

The true function f(x) = sin(x) in teal. The Taylor polynomial in orange, centered at x₀ = 0. Increase the order to see the approximation improve.

Order1

Why does ML care about Taylor series? Three reasons:

Application	Which terms
Gradient descent	First-order: f(x + δ) ≈ f(x) + ∇f · δ
Newton's method	Second-order: includes the Hessian (curvature)
Loss surface analysis	Second-order: saddle points, condition number

The multivariate Taylor expansion (preview): For f: Rⁿ → R, centered at x₀:

f(x) ≈ f(x₀) + ∇f(x₀)^T(x − x₀) + ½(x − x₀)^T H(x₀) (x − x₀)

where ∇f is the gradient (first-order) and H is the Hessian matrix of second derivatives (second-order). Gradient descent uses the first-order term. Newton's method uses both.

The factorial denominators (1/n!) are crucial. Without them, each successive term would be far too large and the series would diverge wildly. The factorials tame the growth so the polynomial stays close to the true function near x₀.

Check: Gradient descent uses which level of Taylor approximation?

First-order (linear) — using only the gradient Zero-order — using only the function value Third-order — using gradient, Hessian, and beyond

Chapter 3: Going Multivariate

So far, f takes one number in and produces one number out. But a neural network loss depends on millions of parameters simultaneously. We need derivatives of functions of many variables.

The partial derivative of f(x₁, x₂, ..., x_n) with respect to x_i is simply: hold all other variables fixed and differentiate with respect to x_i alone.

^∂f⁄_{∂x_i} = lim_h→0 ^{f(x₁, ..., x_i+h, ..., x_n) − f(x₁, ..., x_i, ..., x_n)}⁄_h

Example: f(x, y) = x²y + 3y. The partial derivatives are:

^∂f⁄_∂x = 2xy

(treat y as a constant)

^∂f⁄_∂y = x² + 3

(treat x as a constant)

At the point (x, y) = (2, 1): ∂f/∂x = 2·2·1 = 4 and ∂f/∂y = 4 + 3 = 7. The function is increasing faster in the y-direction than the x-direction at this point.

Notation note: ∂f/∂x_i uses the "curly d" (∂) to distinguish partial derivatives from the ordinary derivative d/dx. When there's only one variable, ∂ and d are the same thing. Some ML texts use ∇_{x_i}f instead.

The total derivative matters when variables are not independent. If x = x(t) and y = y(t), then by the chain rule:

^df⁄_dt = ^∂f⁄_∂x · ^dx⁄_dt + ^∂f⁄_∂y · ^dy⁄_dt

This is the multivariate chain rule in action: the total rate of change is the sum of each partial derivative times the rate at which that variable changes. This structure — sum of products of local derivatives — is exactly what backpropagation exploits.

Check: For f(x, y) = xy² + x³, what is ∂f/∂x?

2xy y² + 3x² (differentiate w.r.t. x, treating y as constant) xy²

Chapter 4: The Gradient

Collect all the partial derivatives into one vector, and you get the gradient:

∇f = [^∂f⁄_∂x₁, ^∂f⁄_∂x₂, ..., ^∂f⁄_{∂x_n}]^T

Convention alert: Deisenroth et al. define the gradient as a row vector (the Jacobian of a scalar function). Many ML textbooks define it as a column vector. In this lesson, we follow the book's convention but note that it matters mainly for matrix multiplication shapes. The direction and magnitude are the same either way.

The gradient has a beautiful geometric meaning: at any point, it points in the direction of steepest ascent. Its magnitude tells you how steep that ascent is. To minimize a function, you walk in the opposite direction of the gradient.

2D Gradient Field

Contour plot of f(x,y) = x² + 2y². Arrows show the negative gradient (descent direction). The orange dot follows gradient descent. Click Step to iterate.

α0.15

For f(x, y) = x² + 2y², the gradient is ∇f = [2x, 4y]. At the point (3, 2), the gradient is [6, 8]. The steepest ascent direction is (6, 8)/||(6, 8)|| = (0.6, 0.8). To descend, we go the opposite way: (−0.6, −0.8).

The gradient descent update rule is:

x_k+1 = x_k − α ∇f(x_k)

where α is the learning rate. Too large and you overshoot. Too small and you crawl. Finding the right α is a central practical challenge in ML optimization.

Key insight: The gradient is always perpendicular to the contour lines (level sets) of f. Contour lines are curves where f is constant. The gradient points away from them, toward higher values. This is why gradient descent cuts across contour lines at right angles — it takes the steepest path through the landscape.

Check: For f(x, y) = 3x² + y², what is the gradient at (1, 2)?

[6, 4] (since ∂f/∂x = 6x = 6 and ∂f/∂y = 2y = 4) [3, 1] [7, 7]

Chapter 5: The Chain Rule

In a neural network, the loss L depends on the output, which depends on the last layer's weights, which depend on the previous layer's output, and so on. To compute ∂L/∂w for a weight w buried deep in the network, we need to chain together derivatives through every layer in between.

The multivariate chain rule handles this. If f = f(g₁(x), g₂(x), ..., g_m(x)), then:

^df⁄_dx = ∑_i=1^m ^∂f⁄_{∂g_i} · ^∂g_i⁄_∂x

Worked example: Let f(x) = (x² + 1)³. Define g(x) = x² + 1, so f = g³.

^df⁄_dx = ^df⁄_dg · ^dg⁄_dx = 3g² · 2x = 6x(x² + 1)²

Now a two-layer example closer to neural networks. Let y = σ(w₂ · σ(w₁x + b₁) + b₂), where σ is the sigmoid. To find ∂y/∂w₁:

Layer 1

z₁ = w₁x + b₁, a₁ = σ(z₁)

↓

Layer 2

z₂ = w₂a₁ + b₂, y = σ(z₂)

^∂y⁄_∂w₁ = ^∂y⁄_∂z₂ · ^∂z₂⁄_∂a₁ · ^∂a₁⁄_∂z₁ · ^∂z₁⁄_∂w₁

= σ'(z₂) · w₂ · σ'(z₁) · x

The chain rule as a chain of multiplication: Each layer contributes one factor to the derivative. The final derivative is a product of all these local factors. This is why it's called the "chain" rule — the derivative chains multiply. Backpropagation just computes this product efficiently from right to left.

Notice the product σ'(z₂) · σ'(z₁). Since σ'(z) ≤ 0.25, multiplying many such terms makes the gradient exponentially small. With 50 sigmoid layers: 0.25⁵⁰ ≈ 10⁻³⁰. This is the vanishing gradient problem — and the reason ReLU (whose derivative is 0 or 1) largely replaced sigmoid in deep networks.

Check: If f = h(g(x)), what is df/dx by the chain rule?

h'(g(x)) · g'(x) (derivative of outer times derivative of inner) h'(x) · g'(x) h(g'(x))

Chapter 6: The Jacobian

What if your function has vector input AND vector output? A function f: Rⁿ → R^m maps an n-vector to an m-vector. Its derivative needs to capture how each output changes with each input. That's the Jacobian matrix.

J = ^∂f⁄_∂x = [^∂f₁⁄_∂x₁ … ^∂f₁⁄_{∂x_n} ; … ; ^∂f_m⁄_∂x₁ … ^∂f_m⁄_{∂x_n}]

The Jacobian is an m×n matrix. Row i contains the gradient of f_i (the i-th output). Column j shows how all outputs respond to changes in x_j.

Example: f(x₁, x₂) = [x₁² + x₂, x₁x₂]. The Jacobian is:

J = [2x₁, 1; x₂, x₁]

At (x₁, x₂) = (3, 2): J = [[6, 1], [2, 3]]. This tells us: a small change δx₁ = 0.01 changes f₁ by 0.06 and f₂ by 0.02. A small change δx₂ = 0.01 changes f₁ by 0.01 and f₂ by 0.03.

The chain rule becomes matrix multiplication: If z = g(f(x)), then the Jacobian of the composition is:

J_z = J_g · J_f

This is the matrix chain rule. Each layer in a neural network has a Jacobian, and the overall Jacobian is the product of all layer Jacobians. Backpropagation computes this product one factor at a time, from output to input.

Function type	Derivative name	Shape
f: R → R	Derivative f'	Scalar
f: Rⁿ → R	Gradient ∇f	1 × n (row vector)
f: Rⁿ → R^m	Jacobian J	m × n (matrix)
f: Rⁿ → R, 2nd deriv	Hessian H	n × n (symmetric)

The Hessian H is the Jacobian of the gradient: H_ij = ∂²f/∂x_i∂x_j. It's symmetric (by Schwarz's theorem, mixed partials commute). The Hessian tells you about curvature: positive definite H means a local minimum, negative definite means a local maximum, indefinite means a saddle point.

Check: The Jacobian of f: R³ → R² has what shape?

3 × 2 2 × 3 (m rows, n columns) 3 × 3

Chapter 7: Gradients of Matrices

In ML, we often need derivatives of scalar-valued functions with respect to matrices. For instance, the loss L(θ) depends on weight matrices W. How do we compute ∂L/∂W?

The key identities you'll use constantly:

Function f(X)	∂f/∂X
tr(AX)	A^T
tr(X^TA)	A
tr(X^TX)	2X
a^TXb (= tr(ba^TX))	ab^T
\|\|Xw − y\|\|²	2X^T(Xw − y) w.r.t. w

Let's derive the most important one: the gradient of the squared loss.

Let L(w) = ||Xw − y||² = (Xw − y)^T(Xw − y). Expand:

L = w^TX^TXw − 2y^TXw + y^Ty

Differentiate term by term:

^∂⁄_∂w(w^TX^TXw) = 2X^TXw

^∂⁄_∂w(−2y^TXw) = −2X^Ty

^∂⁄_∂w(y^Ty) = 0

Combining: ∇_wL = 2X^TXw − 2X^Ty = 2X^T(Xw − y).

The normal equation: Setting ∇L = 0 gives X^TXw = X^Ty, so w_* = (X^TX)⁻¹X^Ty. This is the closed-form solution to linear regression. The entire derivation is just "set the gradient to zero and solve." You need Chapter 4's matrix decompositions (Cholesky or SVD) to actually compute that inverse.

For the rule ∂tr(AX)/∂X = A^T, the derivation uses the fact that tr(AX) = ∑_i,j A_jiX_ij. Taking ∂/∂X_ij gives A_ji = (A^T)_ij. Stacking all these partial derivatives gives the matrix A^T.

Key insight: Matrix calculus is just multivariable calculus with careful bookkeeping of shapes. The gradient ∂f/∂X always has the same shape as X. When in doubt, write out indices (f = ∑ a_ix_i) and differentiate element by element.

Check: What is the gradient of f(w) = ||Xw - y||² with respect to w?

2X^T(Xw − y) 2(Xw − y) X^TX

Chapter 8: Backpropagation

Everything we've built leads here. Backpropagation is the algorithm that computes the gradient of a loss function with respect to all parameters in a computational graph. It's just the chain rule, applied systematically from output back to inputs.

Backprop Computation Graph

A small computation graph: L = (wx + b − y)². Enter a value for x, then click Forward to propagate values left-to-right. Click Backward to see gradients flow right-to-left (edges light up). Adjust w and b to see how gradients change.

Enter values, then Forward

x2.0

w1.5

b0.5

y (target)1.0

The algorithm has two passes:

Forward Pass

Compute all intermediate values from inputs to loss

↓

Backward Pass

Compute gradients from loss back to each parameter

For L = (wx + b − y)², let's define intermediate variables:

forward pass
z = w * x          # multiply
a = z + b          # add bias
r = a - y          # residual
L = r²             # squared error

The backward pass computes gradients using the chain rule, starting from ∂L/∂L = 1:

backward pass
∂L/∂r = 2r
∂L/∂a = ∂L/∂r · ∂r/∂a = 2r · 1 = 2r
∂L/∂b = ∂L/∂a · ∂a/∂b = 2r · 1 = 2r
∂L/∂z = ∂L/∂a · ∂a/∂z = 2r · 1 = 2r
∂L/∂w = ∂L/∂z · ∂z/∂w = 2r · x = 2(wx+b-y)x

Why backprop is efficient: A naive approach would compute each gradient independently, requiring a separate pass through the graph for each parameter. Backprop shares computation: the gradient ∂L/∂a is computed once and reused by both ∂L/∂w and ∂L/∂b. For a network with N parameters, backprop costs about 2× the forward pass — regardless of N. This O(N) scaling is what makes training billion-parameter models feasible.

In matrix notation for a full layer with weight matrix W, input x, and ReLU activation:

^∂L⁄_∂W = ^∂L⁄_∂a · x^T where a = Wx

The upstream gradient ∂L/∂a is passed down from the layer above. Each layer only needs to know its own local derivative and the incoming gradient. This locality is what makes backprop modular — you can swap activation functions, add layers, or change architectures, and the gradient computation "just works."

Check: Why is backpropagation more efficient than computing each gradient separately?

It uses a simpler formula It reuses intermediate gradients (each computed once and shared), making cost O(N) instead of O(N²) It only computes gradients for the last layer

Chapter 9: Automatic Differentiation

Backpropagation is a specific case of a broader idea: automatic differentiation (autodiff). Autodiff computes exact derivatives of any program by tracing the computation and applying the chain rule mechanically. No approximation, no symbolic manipulation — just exact derivatives, to machine precision.

Method	How it works	Pros	Cons
Symbolic	Manipulate algebraic expressions	Exact, interpretable	Expression explosion for complex functions
Numerical	Finite differences: (f(x+h)−f(x))/h	Simple, works for anything	Approximate, O(n) cost per gradient, unstable h
Forward-mode AD	Propagate derivatives alongside values	Exact, efficient for few inputs	One pass per input dimension
Reverse-mode AD	Record forward, replay backward	Exact, one pass for all gradients	Memory for the tape

Reverse-mode autodiff is backpropagation generalized. The "tape" (computation graph) is recorded during the forward pass. The backward pass walks the tape in reverse, accumulating gradients. PyTorch's autograd and JAX's grad implement this.

Forward-mode vs. reverse-mode: Forward-mode computes one column of the Jacobian per pass (how one input affects all outputs). Reverse-mode computes one row per pass (how all inputs affect one output). For a loss function f: Rⁿ → R with n parameters and one scalar output, reverse-mode needs one backward pass to get the full gradient. Forward-mode would need n passes. Since ML losses are scalar-valued, reverse-mode (backprop) wins decisively.

python
import torch

# Define parameters (requires_grad=True tells PyTorch to track)
w = torch.tensor(1.5, requires_grad=True)
b = torch.tensor(0.5, requires_grad=True)
x = torch.tensor(2.0)
y = torch.tensor(1.0)

# Forward pass (builds computation graph automatically)
L = (w * x + b - y) ** 2

# Backward pass (computes all gradients in one call)
L.backward()

print(w.grad)  # dL/dw = 2(wx+b-y)*x = 2(3.5-1)*2 = 10.0
print(b.grad)  # dL/db = 2(wx+b-y)*1 = 2(3.5-1) = 5.0

The connection between this chapter's ideas and real ML frameworks:

Concept	In Code
Forward pass	`loss = model(x)`
Build computation graph	Happens automatically (PyTorch) or via `jit.trace` (JAX)
Backward pass	`loss.backward()` or `jax.grad(loss_fn)`
Gradient descent step	`optimizer.step()`

Where this leads: Chapter 5 gives you the math. Chapter 7 (Optimization) will build on these gradients to cover gradient descent, momentum, Adam, and learning rate schedules. Chapter 9 (Linear Regression) will use the normal equation we derived. And Chapter 10 (PCA) will use eigendecomposition from Chapter 4 on the covariance matrix.

"The chain rule is the most important single idea in differential calculus." — Gilbert Strang

Check: Why is reverse-mode autodiff preferred over forward-mode for training neural networks?

Because losses are scalar-valued: reverse-mode gets the full gradient in one backward pass, vs. n passes for forward-mode Because forward-mode doesn't give exact derivatives Because reverse-mode uses less memory