Kochenderfer & Wheeler, Chapter 2

Derivatives and Gradients

The gradient tells you which direction is downhill. Without it, optimization is a walk in the dark.

Prerequisites: Chapter 1 (Introduction), basic calculus.
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Derivatives?

Imagine you are standing on a hilly landscape in fog. You can't see the lowest valley, but you can feel the slope beneath your feet. If the ground tilts down to your left, stepping left will take you lower. That slope — the rate of change — is the derivative.

Most optimization algorithms use derivative information to decide which direction to search. The derivative tells us how the objective function changes as we perturb the design variables. In one dimension, this is a single number. In multiple dimensions, it is a vector called the gradient.

The core idea: The gradient ∇f(x) points in the direction of steepest ascent. To minimize f, move in the opposite direction: −∇f(x). This is the foundation of gradient descent and most optimization algorithms in this book.

But what if you can't compute the derivative analytically? This chapter covers three alternatives:

Numerical Differentiation
Approximate the derivative using function evaluations (finite differences).
Complex Step Method
Use complex arithmetic to avoid subtraction cancellation errors.
Automatic Differentiation
Exact derivatives computed by propagating through the computation graph.
The gradient of a function points in which direction?

Chapter 1: Derivatives

The derivative of a univariate function f at a point x is the limit:

f'(x) = limh→0 [f(x + h) − f(x)] / h

Geometrically, f'(x) is the slope of the tangent line at x. When f'(x) > 0, the function is increasing; when f'(x) < 0, it is decreasing; when f'(x) = 0, we are at a stationary point — potentially a minimum, maximum, or saddle.

Key insight: The derivative is a local property. It tells you the instantaneous rate of change at a single point. Optimization uses this local information to make incremental progress toward a global solution.

Common derivative rules used in optimization:

RuleFormula
Power ruled/dx (xn) = n xn−1
Chain ruled/dx f(g(x)) = f'(g(x)) · g'(x)
Product ruled/dx [f(x)g(x)] = f'g + fg'
Sum ruled/dx [f(x) + g(x)] = f'(x) + g'(x)

The chain rule is especially critical — it is the mathematical foundation of backpropagation in neural networks and automatic differentiation systems.

At a stationary point, what is the value of f'(x)?

Chapter 2: Gradients & Hessians

For a function of multiple variables f(x1, ..., xn), the gradient is the vector of all partial derivatives:

∇f(x) = [∂f/∂x1, ∂f/∂x2, ..., ∂f/∂xn]

The gradient points in the direction of steepest ascent. Its magnitude tells you how steep the slope is. To minimize, move opposite to the gradient.

The Hessian is the matrix of second partial derivatives:

∇²f(x) = matrix where entry (i,j) = ∂²f / ∂xi∂xj
Key insight: The gradient tells you which direction is downhill. The Hessian tells you the curvature — how fast the slope is changing. Together they let you make much smarter steps than the gradient alone. This is the basis of second-order methods (Chapter 6).

The directional derivative gives the rate of change along any direction d:

d f(x) = ∇f(x) · d

This is maximized when d is parallel to ∇f (steepest ascent) and minimized when d is anti-parallel (steepest descent).

The Hessian matrix contains which type of information?

Chapter 3: Numerical Differentiation

When you can't compute derivatives analytically, you can approximate them using finite differences. The idea: evaluate f at two nearby points and compute the slope of the secant line.

Forward difference:

f'(x) ≈ [f(x + h) − f(x)] / h

Central difference:

f'(x) ≈ [f(x + h) − f(x − h)] / (2h)
MethodEvaluationsError Order
Forward difference2O(h)
Central difference2O(h²)
Complex step1 (complex)O(h²), no cancellation
The step size dilemma: Make h too large and you get a poor approximation of the tangent. Make h too small and floating-point subtraction cancellation dominates. The sweet spot for forward differences is h ≈ √ε ≈ 10−8 (where ε is machine epsilon). Central differences allow h ≈ ε1/3 ≈ 10−5.

For multivariate functions, numerical differentiation becomes expensive: you need at least n+1 function evaluations to approximate the full gradient in n dimensions (forward difference) or 2n evaluations (central difference).

Why is the central difference method more accurate than forward difference?

Chapter 4: Complex Step Method

The complex step method is an elegant trick. Instead of perturbing x by a real h, perturb it by an imaginary ih:

f'(x) ≈ Im[f(x + ih)] / h

Why does this work? The Taylor expansion in the complex plane gives f(x + ih) = f(x) + ih f'(x) − h²f''(x)/2 + .... Taking the imaginary part: Im[f(x+ih)] = h f'(x) + O(h³). Dividing by h gives f'(x) with O(h²) error.

The magic: There is no subtraction of nearly equal quantities! In finite differences, f(x+h) − f(x) subtracts two close numbers, losing significant digits. The complex step method extracts the derivative from the imaginary part, avoiding this cancellation. You can use h = 10−20 or smaller.

The catch: your function implementation must support complex number inputs. Not all code does — operations like abs() or comparisons need special handling. But when it works, the complex step method gives derivatives accurate to machine precision.

Why does the complex step method avoid subtraction cancellation?

Chapter 5: Automatic Differentiation

Automatic differentiation (AD) computes exact derivatives by applying the chain rule systematically through the computation graph of a program. It is neither symbolic differentiation (which manipulates expressions) nor numerical differentiation (which approximates with finite differences).

Every computer program is a sequence of elementary operations (+, −, ×, ÷, sin, exp, ...). Each operation has a known derivative. AD applies the chain rule through this sequence to propagate derivative information.

Key insight: AD gives you derivatives that are exact to machine precision, at a cost proportional to the cost of evaluating the function itself. This is why modern deep learning frameworks (PyTorch, JAX) are built on AD.

There are two modes of AD:

ModeDirectionCost per derivativeBest for
Forward modeInput → OutputO(1) per input variableFew inputs, many outputs
Reverse modeOutput → InputO(1) per output variableMany inputs, few outputs

Neural network training has millions of inputs (parameters) and one output (loss), so reverse mode AD (backpropagation) is overwhelmingly efficient: one backward pass gives the gradient with respect to all parameters.

For a function with many inputs and one output (like a neural network loss), which AD mode is more efficient?

Chapter 6: Forward vs Reverse

Let's trace through an example. Consider the function f(x1, x2) = sin(x1) + x1x2. The computation graph has intermediate variables:

computation graph
a = sin(x1)        # elementary op
b = x1 * x2        # elementary op
f = a + b          # elementary op

Forward mode propagates derivatives forward from inputs. For each input xi, we compute dxi/dxi = 1 and propagate through. One forward pass gives ∂f/∂xi for one input at a time.

Reverse mode propagates derivatives backward from the output. Starting with ∂f/∂f = 1, we work backward through the graph. One backward pass gives ∂f/∂xi for all inputs simultaneously.

Computational cost: If the function maps Rn → Rm:
Forward mode: n forward passes for the full Jacobian.
Reverse mode: m backward passes for the full Jacobian.
When m = 1 (scalar loss), reverse mode wins by a factor of n — which can be millions for neural networks.

The tradeoff: reverse mode requires storing the entire computation graph in memory (for the backward pass). This is why GPU memory is the bottleneck when training large models.

For a function f: R³ → R, how many passes does forward mode need for the full gradient?

Chapter 7: Derivative Explorer

The canvas below shows a function and its derivative. Drag the slider to move the evaluation point and watch the tangent line and derivative value change in real time.

Function & Derivative Visualization

Drag the slider to move the point. The tangent line shows the local derivative. f(x) = sin(x) + 0.3x.

x0.00
What to notice: When the derivative is positive (tangent slopes up), f is increasing. When the derivative is negative (tangent slopes down), f is decreasing. At the stationary points where f'(x) = 0, the tangent is horizontal — these are candidates for minima or maxima.
Where the tangent line is horizontal, the function has a:

Chapter 8: Regression Gradient

When the function is expensive or noisy, you can estimate the gradient from a set of sample points using regression. This is useful in simulation-based optimization where each function evaluation might take minutes or hours.

The idea: fit a linear model to nearby samples and use its coefficients as a gradient estimate. Given k sample points around x, fit:

f(x + δ) ≈ f(x) + g⊃T δ

where g is the gradient estimate, obtained by solving a least-squares problem.

A related technique is the Simultaneous Perturbation Stochastic Approximation (SPSA), which estimates the gradient using only two function evaluations regardless of the dimension:

gi ≈ [f(x + cΔ) − f(x − cΔ)] / (2c Δi)

where Δ is a random perturbation vector. This gives a noisy but unbiased gradient estimate, making it useful for high-dimensional noisy problems.

Key insight: SPSA needs only 2 function evaluations per gradient estimate, regardless of dimensionality. Compare this to finite differences which need 2n evaluations. The tradeoff is higher variance — you need more iterations to converge.
How many function evaluations does SPSA need per gradient estimate?

Chapter 9: Summary

Derivatives and gradients are the navigational instruments of optimization. They tell us which direction to move and how far.

ConceptWhat It Means
Derivative f'(x)Rate of change of f at x (univariate)
Gradient ∇f(x)Vector of partial derivatives (direction of steepest ascent)
Hessian ∇²f(x)Matrix of second derivatives (curvature information)
Forward difference[f(x+h) − f(x)]/h, O(h) error
Central difference[f(x+h) − f(x−h)]/(2h), O(h²) error
Complex stepIm[f(x+ih)]/h, no cancellation error
Forward ADPropagates derivatives input → output, costs O(n)
Reverse ADPropagates derivatives output → input, costs O(m)
Looking ahead: Chapter 3 uses derivatives to optimize univariate functions (bracketing, golden section search). Chapters 4-6 use gradients and Hessians for multivariate optimization. The choice of differentiation method affects both accuracy and computational cost.
Automatic differentiation gives derivatives that are: