The gradient points uphill. Go the other way — but be clever about it.
Chapter 4 gave us the descent framework: pick a direction, pick a step size, repeat. Now we need to fill in the most important blank — which direction?
First-order methods use only the gradient ∇f(x) (first derivatives) to choose the descent direction. They never compute the Hessian (second derivatives), which makes them cheap per iteration but potentially slow to converge.
We'll progress from the simplest (gradient descent) to the most sophisticated (Adam, hypergradient), watching each method overcome a limitation of the previous one.
The simplest first-order method: move in the direction of steepest descent, which is the negative gradient.
The gradient ∇f points in the direction of steepest ascent. So −∇f points in the direction of steepest descent. The step size α controls how far you go.
For a quadratic f(x) = xTAx/2, gradient descent converges at a rate determined by the condition number κ = λmax/λmin of A. Large condition number = slow convergence = bad.
| κ(A) | Behavior |
|---|---|
| 1 (perfectly conditioned) | Converges in 1 step |
| 10 | Moderate zigzagging |
| 1000 | Severe zigzagging, very slow |
The zigzag problem of gradient descent arises because each step "forgets" the previous direction. Conjugate gradient fixes this by choosing each new direction to be conjugate to the previous one with respect to the Hessian A.
Two directions d1 and d2 are A-conjugate if d1T A d2 = 0. Think of it as "orthogonal in the geometry defined by A."
where β(k) = (g(k+1))T g(k+1) / (g(k))T g(k) is the Fletcher-Reeves formula. The magic: this builds in "memory" of the previous direction without ever computing the Hessian.
For non-quadratic functions, conjugate gradient is a heuristic — the conjugacy property only holds exactly for quadratics. Still, it vastly outperforms plain gradient descent on most smooth problems. The Polak-Ribière variant uses β = g(k+1)T(g(k+1) − g(k)) / g(k)Tg(k), which resets to gradient descent automatically when progress stalls.
Another way to fix zigzagging: add momentum. Instead of following the gradient directly, accumulate a velocity that builds up in consistent directions and cancels out in oscillating ones.
The momentum parameter β (typically 0.9) controls how much of the previous velocity is retained. Think of a ball rolling downhill: it builds speed in the downhill direction and smooths over small bumps.
| β | Behavior |
|---|---|
| 0 | No momentum — plain gradient descent |
| 0.9 | Standard momentum — good default |
| 0.99 | Heavy momentum — very smooth but can overshoot |
Standard momentum has a subtle flaw: it computes the gradient at the current position, then adds momentum. Nesterov momentum (also called Nesterov accelerated gradient) does it the other way around: first "look ahead" to where momentum would take you, then compute the gradient there.
Nesterov momentum has provably better convergence on convex functions: O(1/k2) vs. O(1/k) for standard momentum, where k is the iteration count. This makes it strictly superior in theory, and usually better in practice too.
Momentum addresses the direction problem. Adaptive gradient methods address the step size problem: different parameters may need different learning rates.
AdaGrad tracks the sum of squared gradients for each parameter and scales the step inversely:
Parameters with large historical gradients get smaller steps; parameters with small historical gradients get larger steps. This is perfect for sparse data where some features are rare.
With γ = 0.999, RMSProp forgets old gradients and adapts to the current landscape. It is the direct predecessor of Adam.
| Method | Accumulator | Decay | Issue |
|---|---|---|---|
| AdaGrad | Sum of g2 | None | Learning rate dies |
| RMSProp | EMA of g2 | γ ≈ 0.999 | No bias correction |
| Adadelta | EMA of g2 and Δx2 | γ | No learning rate needed |
Adam (Adaptive Moment Estimation) combines the best of momentum and RMSProp. It maintains two moving averages: one for the gradient (first moment, like momentum) and one for the squared gradient (second moment, like RMSProp).
The hat notation v̂ and ŝ denote bias-corrected estimates: v̂ = v / (1 − β1k), ŝ = s / (1 − β2k). Without this correction, the early estimates are biased toward zero (since v and s are initialized to zero).
Every optimizer has a learning rate α. Usually you tune it by trial and error. Hypergradient descent takes a radical approach: apply gradient descent to the learning rate itself.
The idea: the objective f(x(k+1)) depends on α(k) (because x(k+1) = x(k) − α(k) g(k)). So we can compute the derivative of f with respect to α and update α using gradient descent:
This can be applied to any gradient-based method: hypergradient Adam, hypergradient Nesterov, etc. The extra cost is negligible — just one dot product per step.
Watch four optimizers race to minimize the Rosenbrock function. Notice how momentum-based methods cut through the narrow valley while plain gradient descent zigzags.
Click Step to advance all optimizers simultaneously. Each color is a different method.
| Method | Direction | Per-Param Rates | Convergence |
|---|---|---|---|
| Gradient descent | −∇f | No | O(1/k) |
| Conjugate gradient | CG direction | No | n steps for quadratic |
| Momentum | −∇f + βv | No | O(1/k) |
| Nesterov | Look-ahead gradient | No | O(1/k²) |
| AdaGrad | −∇f / √s | Yes | Effective rate decays |
| RMSProp | −∇f / √s | Yes (EMA) | Non-decaying |
| Adam | Momentum + adaptive | Yes (EMA) | Best general-purpose |
| Hypergradient | Meta-learning α | N/A (on top of any) | Adaptive α |