Kochenderfer & Wheeler, Chapter 5

First-Order Methods

The gradient points uphill. Go the other way — but be clever about it.

Prerequisites: Chapter 4 (Local Descent).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why First-Order?

Chapter 4 gave us the descent framework: pick a direction, pick a step size, repeat. Now we need to fill in the most important blank — which direction?

First-order methods use only the gradient ∇f(x) (first derivatives) to choose the descent direction. They never compute the Hessian (second derivatives), which makes them cheap per iteration but potentially slow to converge.

Why it matters: In machine learning, the objective function has millions or billions of parameters. Computing the Hessian is impossible — it would be an n × n matrix where n can be 109. First-order methods are the only game in town for training deep neural networks.

We'll progress from the simplest (gradient descent) to the most sophisticated (Adam, hypergradient), watching each method overcome a limitation of the previous one.

First-order methods are preferred in deep learning because:

Chapter 1: Gradient Descent

The simplest first-order method: move in the direction of steepest descent, which is the negative gradient.

x(k+1) = x(k) − α ∇f(x(k))

The gradient ∇f points in the direction of steepest ascent. So −∇f points in the direction of steepest descent. The step size α controls how far you go.

Key insight: Gradient descent is the "greedy" strategy — at each step, it picks the locally steepest direction. This is optimal for a single step, but not globally optimal. On elongated valleys, gradient descent zigzags back and forth, making very slow progress along the valley floor. This is the fundamental weakness that all subsequent methods try to fix.

For a quadratic f(x) = xTAx/2, gradient descent converges at a rate determined by the condition number κ = λmaxmin of A. Large condition number = slow convergence = bad.

κ(A)Behavior
1 (perfectly conditioned)Converges in 1 step
10Moderate zigzagging
1000Severe zigzagging, very slow
Gradient descent converges slowly on ill-conditioned problems because:

Chapter 2: Conjugate Gradient

The zigzag problem of gradient descent arises because each step "forgets" the previous direction. Conjugate gradient fixes this by choosing each new direction to be conjugate to the previous one with respect to the Hessian A.

Two directions d1 and d2 are A-conjugate if d1T A d2 = 0. Think of it as "orthogonal in the geometry defined by A."

d(k+1) = −g(k+1) + β(k) d(k)

where β(k) = (g(k+1))T g(k+1) / (g(k))T g(k) is the Fletcher-Reeves formula. The magic: this builds in "memory" of the previous direction without ever computing the Hessian.

The payoff: For an n-dimensional quadratic, conjugate gradient converges in at most n steps (compared to gradient descent which may need thousands). It automatically adjusts to narrow valleys, eliminating the zigzag problem.

For non-quadratic functions, conjugate gradient is a heuristic — the conjugacy property only holds exactly for quadratics. Still, it vastly outperforms plain gradient descent on most smooth problems. The Polak-Ribière variant uses β = g(k+1)T(g(k+1) − g(k)) / g(k)Tg(k), which resets to gradient descent automatically when progress stalls.

Conjugate gradient converges in at most n steps for:

Chapter 3: Momentum

Another way to fix zigzagging: add momentum. Instead of following the gradient directly, accumulate a velocity that builds up in consistent directions and cancels out in oscillating ones.

v(k+1) = β v(k) − α g(k)
x(k+1) = x(k) + v(k+1)

The momentum parameter β (typically 0.9) controls how much of the previous velocity is retained. Think of a ball rolling downhill: it builds speed in the downhill direction and smooths over small bumps.

Key insight: In the zigzag scenario, the gradient oscillates side-to-side while consistently pointing down the valley. Momentum averages these out: the oscillations cancel, and the consistent downhill component accumulates. The result is fast progress along the valley floor.
βBehavior
0No momentum — plain gradient descent
0.9Standard momentum — good default
0.99Heavy momentum — very smooth but can overshoot
Momentum helps gradient descent by:

Chapter 4: Nesterov Momentum

Standard momentum has a subtle flaw: it computes the gradient at the current position, then adds momentum. Nesterov momentum (also called Nesterov accelerated gradient) does it the other way around: first "look ahead" to where momentum would take you, then compute the gradient there.

v(k+1) = β v(k) − α ∇f(x(k) + β v(k))
x(k+1) = x(k) + v(k+1)
The intuition: Standard momentum is like a ball that rolls and then checks the gradient. Nesterov momentum is like a ball that peeks ahead to see what's coming. If the terrain is about to slope upward (meaning you're about to overshoot), the look-ahead gradient provides a corrective signal before you overshoot, rather than after.

Nesterov momentum has provably better convergence on convex functions: O(1/k2) vs. O(1/k) for standard momentum, where k is the iteration count. This makes it strictly superior in theory, and usually better in practice too.

Nesterov momentum differs from standard momentum by:

Chapter 5: AdaGrad & RMSProp

Momentum addresses the direction problem. Adaptive gradient methods address the step size problem: different parameters may need different learning rates.

AdaGrad tracks the sum of squared gradients for each parameter and scales the step inversely:

si ← si + (gi)2     xi ← xi − α gi / √(si + ε)

Parameters with large historical gradients get smaller steps; parameters with small historical gradients get larger steps. This is perfect for sparse data where some features are rare.

The problem with AdaGrad: The accumulated si only grows, never shrinks. Over time, the effective learning rate goes to zero and learning stops. RMSProp fixes this by using an exponentially decaying average instead of a running sum:
si ← γ si + (1 − γ)(gi)2

With γ = 0.999, RMSProp forgets old gradients and adapts to the current landscape. It is the direct predecessor of Adam.

MethodAccumulatorDecayIssue
AdaGradSum of g2NoneLearning rate dies
RMSPropEMA of g2γ ≈ 0.999No bias correction
AdadeltaEMA of g2 and Δx2γNo learning rate needed
RMSProp improves over AdaGrad by:

Chapter 6: Adam

Adam (Adaptive Moment Estimation) combines the best of momentum and RMSProp. It maintains two moving averages: one for the gradient (first moment, like momentum) and one for the squared gradient (second moment, like RMSProp).

v ← β1 v + (1 − β1) g    (first moment / momentum)
s ← β2 s + (1 − β2) g ⊙ g    (second moment / adaptive rate)
x ← x − α v̂ / (√ŝ + ε)

The hat notation v̂ and ŝ denote bias-corrected estimates: v̂ = v / (1 − β1k), ŝ = s / (1 − β2k). Without this correction, the early estimates are biased toward zero (since v and s are initialized to zero).

Why Adam dominates: It adapts the learning rate per-parameter (from RMSProp), builds momentum in consistent directions (from momentum), and corrects for initialization bias. Default hyperparameters β1 = 0.9, β2 = 0.999, α = 10−3 work well across a wide range of problems. This is why Adam is the default optimizer in deep learning.
Adam combines which two ideas?

Chapter 7: Hypergradient Descent

Every optimizer has a learning rate α. Usually you tune it by trial and error. Hypergradient descent takes a radical approach: apply gradient descent to the learning rate itself.

The idea: the objective f(x(k+1)) depends on α(k) (because x(k+1) = x(k) − α(k) g(k)). So we can compute the derivative of f with respect to α and update α using gradient descent:

∂f(x(k+1)) / ∂α(k) = −(g(k+1))T g(k)
Key insight: The hypergradient −g(k+1)T g(k) has a beautiful interpretation. If consecutive gradients point in the same direction (positive dot product), the learning rate should increase — we're making consistent progress. If they point in opposite directions (negative dot product), the learning rate should decrease — we're oscillating.
α(k) = α(k−1) + μ (g(k))T g(k−1)

This can be applied to any gradient-based method: hypergradient Adam, hypergradient Nesterov, etc. The extra cost is negligible — just one dot product per step.

Hypergradient descent increases the learning rate when:

Chapter 8: Optimizer Race

Watch four optimizers race to minimize the Rosenbrock function. Notice how momentum-based methods cut through the narrow valley while plain gradient descent zigzags.

Optimizer Comparison on Rosenbrock

Click Step to advance all optimizers simultaneously. Each color is a different method.

Step 0
● GD ● Momentum ● Nesterov ● Adam
On ill-conditioned problems like Rosenbrock, which optimizer typically converges fastest?

Chapter 9: Summary

MethodDirectionPer-Param RatesConvergence
Gradient descent−∇fNoO(1/k)
Conjugate gradientCG directionNon steps for quadratic
Momentum−∇f + βvNoO(1/k)
NesterovLook-ahead gradientNoO(1/k²)
AdaGrad−∇f / √sYesEffective rate decays
RMSProp−∇f / √sYes (EMA)Non-decaying
AdamMomentum + adaptiveYes (EMA)Best general-purpose
HypergradientMeta-learning αN/A (on top of any)Adaptive α
Looking ahead: Chapter 6 introduces second-order methods that use curvature information (the Hessian) to take better steps. Newton's method converges quadratically near the optimum — dramatically faster than any first-order method — but at the cost of computing and inverting the Hessian.
The key advantage of adaptive methods (AdaGrad, RMSProp, Adam) over momentum is: