Ch 2: Derivatives and Gradients — Kochenderfer & Wheeler

Chapter 0: Why Derivatives?

Imagine you are standing on a hilly landscape in fog. You can't see the lowest valley, but you can feel the slope beneath your feet. If the ground tilts down to your left, stepping left will take you lower. That slope — the rate of change — is the derivative.

Most optimization algorithms use derivative information to decide which direction to search. The derivative tells us how the objective function changes as we perturb the design variables. In one dimension, this is a single number. In multiple dimensions, it is a vector called the gradient.

The core idea: The gradient ∇f(x) points in the direction of steepest ascent. To minimize f, move in the opposite direction: −∇f(x). This is the foundation of gradient descent and most optimization algorithms in this book.

But what if you can't compute the derivative analytically? This chapter covers three alternatives:

Numerical Differentiation

Approximate the derivative using function evaluations (finite differences).

↓

Complex Step Method

Use complex arithmetic to avoid subtraction cancellation errors.

↓

Automatic Differentiation

Exact derivatives computed by propagating through the computation graph.

The gradient of a function points in which direction?

The direction of steepest ascent The direction of steepest descent Along the contour line

Chapter 1: Derivatives

The derivative of a univariate function f at a point x is the limit:

f'(x) = lim_h→0 [f(x + h) − f(x)] / h

Geometrically, f'(x) is the slope of the tangent line at x. When f'(x) > 0, the function is increasing; when f'(x) < 0, it is decreasing; when f'(x) = 0, we are at a stationary point — potentially a minimum, maximum, or saddle.

Key insight: The derivative is a local property. It tells you the instantaneous rate of change at a single point. Optimization uses this local information to make incremental progress toward a global solution.

Common derivative rules used in optimization:

Rule	Formula
Power rule	d/dx (xⁿ) = n xⁿ⁻¹
Chain rule	d/dx f(g(x)) = f'(g(x)) · g'(x)
Product rule	d/dx [f(x)g(x)] = f'g + fg'
Sum rule	d/dx [f(x) + g(x)] = f'(x) + g'(x)

The chain rule is especially critical — it is the mathematical foundation of backpropagation in neural networks and automatic differentiation systems.

At a stationary point, what is the value of f'(x)?

Positive Zero Negative

Chapter 2: Gradients & Hessians

For a function of multiple variables f(x₁, ..., x_n), the gradient is the vector of all partial derivatives:

∇f(x) = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂x_n]

The gradient points in the direction of steepest ascent. Its magnitude tells you how steep the slope is. To minimize, move opposite to the gradient.

The Hessian is the matrix of second partial derivatives:

∇²f(x) = matrix where entry (i,j) = ∂²f / ∂x_i∂x_j

Key insight: The gradient tells you which direction is downhill. The Hessian tells you the curvature — how fast the slope is changing. Together they let you make much smarter steps than the gradient alone. This is the basis of second-order methods (Chapter 6).

The directional derivative gives the rate of change along any direction d:

∇_d f(x) = ∇f(x) · d

This is maximized when d is parallel to ∇f (steepest ascent) and minimized when d is anti-parallel (steepest descent).

The Hessian matrix contains which type of information?

First partial derivatives Second partial derivatives (curvature) Function values at nearby points

Chapter 3: Numerical Differentiation

When you can't compute derivatives analytically, you can approximate them using finite differences. The idea: evaluate f at two nearby points and compute the slope of the secant line.

Forward difference:

f'(x) ≈ [f(x + h) − f(x)] / h

Central difference:

f'(x) ≈ [f(x + h) − f(x − h)] / (2h)

Method	Evaluations	Error Order
Forward difference	2	O(h)
Central difference	2	O(h²)
Complex step	1 (complex)	O(h²), no cancellation

The step size dilemma: Make h too large and you get a poor approximation of the tangent. Make h too small and floating-point subtraction cancellation dominates. The sweet spot for forward differences is h ≈ √ε ≈ 10⁻⁸ (where ε is machine epsilon). Central differences allow h ≈ ε^1/3 ≈ 10⁻⁵.

For multivariate functions, numerical differentiation becomes expensive: you need at least n+1 function evaluations to approximate the full gradient in n dimensions (forward difference) or 2n evaluations (central difference).

Why is the central difference method more accurate than forward difference?

It cancels the first-order error term, giving O(h²) accuracy instead of O(h) It uses more function evaluations It uses a larger step size

Chapter 4: Complex Step Method

The complex step method is an elegant trick. Instead of perturbing x by a real h, perturb it by an imaginary ih:

f'(x) ≈ Im[f(x + ih)] / h

Why does this work? The Taylor expansion in the complex plane gives f(x + ih) = f(x) + ih f'(x) − h²f''(x)/2 + .... Taking the imaginary part: Im[f(x+ih)] = h f'(x) + O(h³). Dividing by h gives f'(x) with O(h²) error.

The magic: There is no subtraction of nearly equal quantities! In finite differences, f(x+h) − f(x) subtracts two close numbers, losing significant digits. The complex step method extracts the derivative from the imaginary part, avoiding this cancellation. You can use h = 10⁻²⁰ or smaller.

The catch: your function implementation must support complex number inputs. Not all code does — operations like abs() or comparisons need special handling. But when it works, the complex step method gives derivatives accurate to machine precision.

Why does the complex step method avoid subtraction cancellation?

The derivative comes from the imaginary part, not a difference of real values It uses a larger step size It uses higher-precision arithmetic

Chapter 5: Automatic Differentiation

Automatic differentiation (AD) computes exact derivatives by applying the chain rule systematically through the computation graph of a program. It is neither symbolic differentiation (which manipulates expressions) nor numerical differentiation (which approximates with finite differences).

Every computer program is a sequence of elementary operations (+, −, ×, ÷, sin, exp, ...). Each operation has a known derivative. AD applies the chain rule through this sequence to propagate derivative information.

Key insight: AD gives you derivatives that are exact to machine precision, at a cost proportional to the cost of evaluating the function itself. This is why modern deep learning frameworks (PyTorch, JAX) are built on AD.

There are two modes of AD:

Mode	Direction	Cost per derivative	Best for
Forward mode	Input → Output	O(1) per input variable	Few inputs, many outputs
Reverse mode	Output → Input	O(1) per output variable	Many inputs, few outputs

Neural network training has millions of inputs (parameters) and one output (loss), so reverse mode AD (backpropagation) is overwhelmingly efficient: one backward pass gives the gradient with respect to all parameters.

For a function with many inputs and one output (like a neural network loss), which AD mode is more efficient?

Forward mode Reverse mode (backpropagation) Both are equally efficient

Chapter 6: Forward vs Reverse

Let's trace through an example. Consider the function f(x₁, x₂) = sin(x₁) + x₁x₂. The computation graph has intermediate variables:

computation graph
a = sin(x₁)        # elementary op
b = x₁ * x₂        # elementary op
f = a + b          # elementary op

Forward mode propagates derivatives forward from inputs. For each input x_i, we compute dx_i/dx_i = 1 and propagate through. One forward pass gives ∂f/∂x_i for one input at a time.

Reverse mode propagates derivatives backward from the output. Starting with ∂f/∂f = 1, we work backward through the graph. One backward pass gives ∂f/∂x_i for all inputs simultaneously.

Computational cost: If the function maps Rⁿ → R^m:
Forward mode: n forward passes for the full Jacobian.
Reverse mode: m backward passes for the full Jacobian.
When m = 1 (scalar loss), reverse mode wins by a factor of n — which can be millions for neural networks.

The tradeoff: reverse mode requires storing the entire computation graph in memory (for the backward pass). This is why GPU memory is the bottleneck when training large models.

For a function f: R³ → R, how many passes does forward mode need for the full gradient?

1 3 (one per input dimension) 9

Chapter 7: Derivative Explorer

The canvas below shows a function and its derivative. Drag the slider to move the evaluation point and watch the tangent line and derivative value change in real time.

Function & Derivative Visualization

Drag the slider to move the point. The tangent line shows the local derivative. f(x) = sin(x) + 0.3x.

x0.00

What to notice: When the derivative is positive (tangent slopes up), f is increasing. When the derivative is negative (tangent slopes down), f is decreasing. At the stationary points where f'(x) = 0, the tangent is horizontal — these are candidates for minima or maxima.

Where the tangent line is horizontal, the function has a:

Stationary point (potential min, max, or saddle) Global minimum Inflection point

Chapter 8: Regression Gradient

When the function is expensive or noisy, you can estimate the gradient from a set of sample points using regression. This is useful in simulation-based optimization where each function evaluation might take minutes or hours.

The idea: fit a linear model to nearby samples and use its coefficients as a gradient estimate. Given k sample points around x, fit:

f(x + δ) ≈ f(x) + g⊃T δ

where g is the gradient estimate, obtained by solving a least-squares problem.

A related technique is the Simultaneous Perturbation Stochastic Approximation (SPSA), which estimates the gradient using only two function evaluations regardless of the dimension:

g_i ≈ [f(x + cΔ) − f(x − cΔ)] / (2c Δ_i)

where Δ is a random perturbation vector. This gives a noisy but unbiased gradient estimate, making it useful for high-dimensional noisy problems.

Key insight: SPSA needs only 2 function evaluations per gradient estimate, regardless of dimensionality. Compare this to finite differences which need 2n evaluations. The tradeoff is higher variance — you need more iterations to converge.

How many function evaluations does SPSA need per gradient estimate?

2, regardless of dimensionality n+1 (one per dimension plus one) 2n (two per dimension)

Chapter 9: Summary

Derivatives and gradients are the navigational instruments of optimization. They tell us which direction to move and how far.

Concept	What It Means
Derivative f'(x)	Rate of change of f at x (univariate)
Gradient ∇f(x)	Vector of partial derivatives (direction of steepest ascent)
Hessian ∇²f(x)	Matrix of second derivatives (curvature information)
Forward difference	[f(x+h) − f(x)]/h, O(h) error
Central difference	[f(x+h) − f(x−h)]/(2h), O(h²) error
Complex step	Im[f(x+ih)]/h, no cancellation error
Forward AD	Propagates derivatives input → output, costs O(n)
Reverse AD	Propagates derivatives output → input, costs O(m)

Looking ahead: Chapter 3 uses derivatives to optimize univariate functions (bracketing, golden section search). Chapters 4-6 use gradients and Hessians for multivariate optimization. The choice of differentiation method affects both accuracy and computational cost.

Automatic differentiation gives derivatives that are:

Exact to machine precision, at cost proportional to function evaluation Approximate, like finite differences Symbolic expressions that grow exponentially