The gradient tells you which direction is downhill. Without it, optimization is a walk in the dark.
Imagine you are standing on a hilly landscape in fog. You can't see the lowest valley, but you can feel the slope beneath your feet. If the ground tilts down to your left, stepping left will take you lower. That slope — the rate of change — is the derivative.
Most optimization algorithms use derivative information to decide which direction to search. The derivative tells us how the objective function changes as we perturb the design variables. In one dimension, this is a single number. In multiple dimensions, it is a vector called the gradient.
But what if you can't compute the derivative analytically? This chapter covers three alternatives:
The derivative of a univariate function f at a point x is the limit:
Geometrically, f'(x) is the slope of the tangent line at x. When f'(x) > 0, the function is increasing; when f'(x) < 0, it is decreasing; when f'(x) = 0, we are at a stationary point — potentially a minimum, maximum, or saddle.
Common derivative rules used in optimization:
| Rule | Formula |
|---|---|
| Power rule | d/dx (xn) = n xn−1 |
| Chain rule | d/dx f(g(x)) = f'(g(x)) · g'(x) |
| Product rule | d/dx [f(x)g(x)] = f'g + fg' |
| Sum rule | d/dx [f(x) + g(x)] = f'(x) + g'(x) |
The chain rule is especially critical — it is the mathematical foundation of backpropagation in neural networks and automatic differentiation systems.
For a function of multiple variables f(x1, ..., xn), the gradient is the vector of all partial derivatives:
The gradient points in the direction of steepest ascent. Its magnitude tells you how steep the slope is. To minimize, move opposite to the gradient.
The Hessian is the matrix of second partial derivatives:
The directional derivative gives the rate of change along any direction d:
This is maximized when d is parallel to ∇f (steepest ascent) and minimized when d is anti-parallel (steepest descent).
When you can't compute derivatives analytically, you can approximate them using finite differences. The idea: evaluate f at two nearby points and compute the slope of the secant line.
Forward difference:
Central difference:
| Method | Evaluations | Error Order |
|---|---|---|
| Forward difference | 2 | O(h) |
| Central difference | 2 | O(h²) |
| Complex step | 1 (complex) | O(h²), no cancellation |
For multivariate functions, numerical differentiation becomes expensive: you need at least n+1 function evaluations to approximate the full gradient in n dimensions (forward difference) or 2n evaluations (central difference).
The complex step method is an elegant trick. Instead of perturbing x by a real h, perturb it by an imaginary ih:
Why does this work? The Taylor expansion in the complex plane gives f(x + ih) = f(x) + ih f'(x) − h²f''(x)/2 + .... Taking the imaginary part: Im[f(x+ih)] = h f'(x) + O(h³). Dividing by h gives f'(x) with O(h²) error.
The catch: your function implementation must support complex number inputs. Not all code does — operations like abs() or comparisons need special handling. But when it works, the complex step method gives derivatives accurate to machine precision.
Automatic differentiation (AD) computes exact derivatives by applying the chain rule systematically through the computation graph of a program. It is neither symbolic differentiation (which manipulates expressions) nor numerical differentiation (which approximates with finite differences).
Every computer program is a sequence of elementary operations (+, −, ×, ÷, sin, exp, ...). Each operation has a known derivative. AD applies the chain rule through this sequence to propagate derivative information.
There are two modes of AD:
| Mode | Direction | Cost per derivative | Best for |
|---|---|---|---|
| Forward mode | Input → Output | O(1) per input variable | Few inputs, many outputs |
| Reverse mode | Output → Input | O(1) per output variable | Many inputs, few outputs |
Neural network training has millions of inputs (parameters) and one output (loss), so reverse mode AD (backpropagation) is overwhelmingly efficient: one backward pass gives the gradient with respect to all parameters.
Let's trace through an example. Consider the function f(x1, x2) = sin(x1) + x1x2. The computation graph has intermediate variables:
computation graph a = sin(x1) # elementary op b = x1 * x2 # elementary op f = a + b # elementary op
Forward mode propagates derivatives forward from inputs. For each input xi, we compute dxi/dxi = 1 and propagate through. One forward pass gives ∂f/∂xi for one input at a time.
Reverse mode propagates derivatives backward from the output. Starting with ∂f/∂f = 1, we work backward through the graph. One backward pass gives ∂f/∂xi for all inputs simultaneously.
The tradeoff: reverse mode requires storing the entire computation graph in memory (for the backward pass). This is why GPU memory is the bottleneck when training large models.
The canvas below shows a function and its derivative. Drag the slider to move the evaluation point and watch the tangent line and derivative value change in real time.
Drag the slider to move the point. The tangent line shows the local derivative. f(x) = sin(x) + 0.3x.
When the function is expensive or noisy, you can estimate the gradient from a set of sample points using regression. This is useful in simulation-based optimization where each function evaluation might take minutes or hours.
The idea: fit a linear model to nearby samples and use its coefficients as a gradient estimate. Given k sample points around x, fit:
where g is the gradient estimate, obtained by solving a least-squares problem.
A related technique is the Simultaneous Perturbation Stochastic Approximation (SPSA), which estimates the gradient using only two function evaluations regardless of the dimension:
where Δ is a random perturbation vector. This gives a noisy but unbiased gradient estimate, making it useful for high-dimensional noisy problems.
Derivatives and gradients are the navigational instruments of optimization. They tell us which direction to move and how far.
| Concept | What It Means |
|---|---|
| Derivative f'(x) | Rate of change of f at x (univariate) |
| Gradient ∇f(x) | Vector of partial derivatives (direction of steepest ascent) |
| Hessian ∇²f(x) | Matrix of second derivatives (curvature information) |
| Forward difference | [f(x+h) − f(x)]/h, O(h) error |
| Central difference | [f(x+h) − f(x−h)]/(2h), O(h²) error |
| Complex step | Im[f(x+ih)]/h, no cancellation error |
| Forward AD | Propagates derivatives input → output, costs O(n) |
| Reverse AD | Propagates derivatives output → input, costs O(m) |