Ch 5: First-Order Methods — Kochenderfer & Wheeler

Chapter 0: Why First-Order?

Chapter 4 gave us the descent framework: pick a direction, pick a step size, repeat. Now we need to fill in the most important blank — which direction?

First-order methods use only the gradient ∇f(x) (first derivatives) to choose the descent direction. They never compute the Hessian (second derivatives), which makes them cheap per iteration but potentially slow to converge.

Why it matters: In machine learning, the objective function has millions or billions of parameters. Computing the Hessian is impossible — it would be an n × n matrix where n can be 10⁹. First-order methods are the only game in town for training deep neural networks.

We'll progress from the simplest (gradient descent) to the most sophisticated (Adam, hypergradient), watching each method overcome a limitation of the previous one.

First-order methods are preferred in deep learning because:

The Hessian is too expensive to compute for millions of parameters They always find the global minimum They don't require function evaluations

Chapter 1: Gradient Descent

The simplest first-order method: move in the direction of steepest descent, which is the negative gradient.

x^(k+1) = x^(k) − α ∇f(x^(k))

The gradient ∇f points in the direction of steepest ascent. So −∇f points in the direction of steepest descent. The step size α controls how far you go.

Key insight: Gradient descent is the "greedy" strategy — at each step, it picks the locally steepest direction. This is optimal for a single step, but not globally optimal. On elongated valleys, gradient descent zigzags back and forth, making very slow progress along the valley floor. This is the fundamental weakness that all subsequent methods try to fix.

For a quadratic f(x) = x^TAx/2, gradient descent converges at a rate determined by the condition number κ = λ_max/λ_min of A. Large condition number = slow convergence = bad.

κ(A)	Behavior
1 (perfectly conditioned)	Converges in 1 step
10	Moderate zigzagging
1000	Severe zigzagging, very slow

Gradient descent converges slowly on ill-conditioned problems because:

It zigzags across elongated valleys instead of moving along them The gradient is always zero It requires the Hessian

Chapter 2: Conjugate Gradient

The zigzag problem of gradient descent arises because each step "forgets" the previous direction. Conjugate gradient fixes this by choosing each new direction to be conjugate to the previous one with respect to the Hessian A.

Two directions d₁ and d₂ are A-conjugate if d₁^T A d₂ = 0. Think of it as "orthogonal in the geometry defined by A."

d^(k+1) = −g^(k+1) + β^(k) d^(k)

where β^(k) = (g^(k+1))^T g^(k+1) / (g^(k))^T g^(k) is the Fletcher-Reeves formula. The magic: this builds in "memory" of the previous direction without ever computing the Hessian.

The payoff: For an n-dimensional quadratic, conjugate gradient converges in at most n steps (compared to gradient descent which may need thousands). It automatically adjusts to narrow valleys, eliminating the zigzag problem.

For non-quadratic functions, conjugate gradient is a heuristic — the conjugacy property only holds exactly for quadratics. Still, it vastly outperforms plain gradient descent on most smooth problems. The Polak-Ribière variant uses β = g^(k+1)T(g^(k+1) − g^(k)) / g^(k)Tg^(k), which resets to gradient descent automatically when progress stalls.

Conjugate gradient converges in at most n steps for:

n-dimensional quadratic functions Any n-dimensional function Only 1D functions

Chapter 3: Momentum

Another way to fix zigzagging: add momentum. Instead of following the gradient directly, accumulate a velocity that builds up in consistent directions and cancels out in oscillating ones.

v^(k+1) = β v^(k) − α g^(k)

x^(k+1) = x^(k) + v^(k+1)

The momentum parameter β (typically 0.9) controls how much of the previous velocity is retained. Think of a ball rolling downhill: it builds speed in the downhill direction and smooths over small bumps.

Key insight: In the zigzag scenario, the gradient oscillates side-to-side while consistently pointing down the valley. Momentum averages these out: the oscillations cancel, and the consistent downhill component accumulates. The result is fast progress along the valley floor.

β	Behavior
0	No momentum — plain gradient descent
0.9	Standard momentum — good default
0.99	Heavy momentum — very smooth but can overshoot

Momentum helps gradient descent by:

Accumulating velocity in consistent directions and canceling oscillations Computing second derivatives Reducing the learning rate over time

Chapter 4: Nesterov Momentum

Standard momentum has a subtle flaw: it computes the gradient at the current position, then adds momentum. Nesterov momentum (also called Nesterov accelerated gradient) does it the other way around: first "look ahead" to where momentum would take you, then compute the gradient there.

v^(k+1) = β v^(k) − α ∇f(x^(k) + β v^(k))

x^(k+1) = x^(k) + v^(k+1)

The intuition: Standard momentum is like a ball that rolls and then checks the gradient. Nesterov momentum is like a ball that peeks ahead to see what's coming. If the terrain is about to slope upward (meaning you're about to overshoot), the look-ahead gradient provides a corrective signal before you overshoot, rather than after.

Nesterov momentum has provably better convergence on convex functions: O(1/k²) vs. O(1/k) for standard momentum, where k is the iteration count. This makes it strictly superior in theory, and usually better in practice too.

Nesterov momentum differs from standard momentum by:

Computing the gradient at the look-ahead position rather than the current position Using a larger step size Ignoring the previous velocity

Chapter 5: AdaGrad & RMSProp

Momentum addresses the direction problem. Adaptive gradient methods address the step size problem: different parameters may need different learning rates.

AdaGrad tracks the sum of squared gradients for each parameter and scales the step inversely:

s_i ← s_i + (g_i)² x_i ← x_i − α g_i / √(s_i + ε)

Parameters with large historical gradients get smaller steps; parameters with small historical gradients get larger steps. This is perfect for sparse data where some features are rare.

The problem with AdaGrad: The accumulated s_i only grows, never shrinks. Over time, the effective learning rate goes to zero and learning stops. RMSProp fixes this by using an exponentially decaying average instead of a running sum:

s_i ← γ s_i + (1 − γ)(g_i)²

With γ = 0.999, RMSProp forgets old gradients and adapts to the current landscape. It is the direct predecessor of Adam.

Method	Accumulator	Decay	Issue
AdaGrad	Sum of g²	None	Learning rate dies
RMSProp	EMA of g²	γ ≈ 0.999	No bias correction
Adadelta	EMA of g² and Δx²	γ	No learning rate needed

RMSProp improves over AdaGrad by:

Using an exponentially decaying average instead of a running sum Computing second derivatives Using a fixed learning rate

Chapter 6: Adam

Adam (Adaptive Moment Estimation) combines the best of momentum and RMSProp. It maintains two moving averages: one for the gradient (first moment, like momentum) and one for the squared gradient (second moment, like RMSProp).

v ← β₁ v + (1 − β₁) g (first moment / momentum)

s ← β₂ s + (1 − β₂) g ⊙ g (second moment / adaptive rate)

x ← x − α v̂ / (√ŝ + ε)

The hat notation v̂ and ŝ denote bias-corrected estimates: v̂ = v / (1 − β₁^k), ŝ = s / (1 − β₂^k). Without this correction, the early estimates are biased toward zero (since v and s are initialized to zero).

Why Adam dominates: It adapts the learning rate per-parameter (from RMSProp), builds momentum in consistent directions (from momentum), and corrects for initialization bias. Default hyperparameters β₁ = 0.9, β₂ = 0.999, α = 10⁻³ work well across a wide range of problems. This is why Adam is the default optimizer in deep learning.

Adam combines which two ideas?

Momentum (first moment) and adaptive learning rates (second moment) Conjugate gradient and Newton's method Line search and trust region

Chapter 7: Hypergradient Descent

Every optimizer has a learning rate α. Usually you tune it by trial and error. Hypergradient descent takes a radical approach: apply gradient descent to the learning rate itself.

The idea: the objective f(x^(k+1)) depends on α^(k) (because x^(k+1) = x^(k) − α^(k) g^(k)). So we can compute the derivative of f with respect to α and update α using gradient descent:

∂f(x^(k+1)) / ∂α^(k) = −(g^(k+1))^T g^(k)

Key insight: The hypergradient −g^(k+1)T g^(k) has a beautiful interpretation. If consecutive gradients point in the same direction (positive dot product), the learning rate should increase — we're making consistent progress. If they point in opposite directions (negative dot product), the learning rate should decrease — we're oscillating.

α^(k) = α^(k−1) + μ (g^(k))^T g^(k−1)

This can be applied to any gradient-based method: hypergradient Adam, hypergradient Nesterov, etc. The extra cost is negligible — just one dot product per step.

Hypergradient descent increases the learning rate when:

Consecutive gradients point in the same direction The function value increases The gradient norm is large

Chapter 8: Optimizer Race

Watch four optimizers race to minimize the Rosenbrock function. Notice how momentum-based methods cut through the narrow valley while plain gradient descent zigzags.

Optimizer Comparison on Rosenbrock

Click Step to advance all optimizers simultaneously. Each color is a different method.

Step 0

● GD ● Momentum ● Nesterov ● Adam

On ill-conditioned problems like Rosenbrock, which optimizer typically converges fastest?

Plain gradient descent Momentum Adam (adaptive rates + momentum)

Chapter 9: Summary

Method	Direction	Per-Param Rates	Convergence
Gradient descent	−∇f	No	O(1/k)
Conjugate gradient	CG direction	No	n steps for quadratic
Momentum	−∇f + βv	No	O(1/k)
Nesterov	Look-ahead gradient	No	O(1/k²)
AdaGrad	−∇f / √s	Yes	Effective rate decays
RMSProp	−∇f / √s	Yes (EMA)	Non-decaying
Adam	Momentum + adaptive	Yes (EMA)	Best general-purpose
Hypergradient	Meta-learning α	N/A (on top of any)	Adaptive α

Looking ahead: Chapter 6 introduces second-order methods that use curvature information (the Hessian) to take better steps. Newton's method converges quadratically near the optimum — dramatically faster than any first-order method — but at the cost of computing and inverting the Hessian.

The key advantage of adaptive methods (AdaGrad, RMSProp, Adam) over momentum is:

They scale the learning rate independently for each parameter They don't need gradients They use the Hessian