Kochenderfer & Wheeler, Chapter 4

Local Descent

Pick a direction. Pick a step size. Repeat until you can't go any lower.

Prerequisites: Chapter 3 (Bracketing).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Local Descent?

In Chapter 3 we mastered 1D optimization — finding the minimum of a function along a single line. But real problems live in many dimensions. How do you minimize f(x1, x2, ..., xn)?

The core idea: Local descent methods work by repeatedly choosing a direction to move and a step size for how far to go. Each step reduces the objective, and you stop when you can't improve anymore. The 1D line search from Chapter 3 becomes a subroutine — once you pick a direction, finding the best step size is a bracketing problem.

This is the template that nearly every optimization algorithm follows. The differences between methods (gradient descent, Newton's method, conjugate gradient) come down to how they choose the descent direction.

Choose Direction d
Which way reduces f? Gradient, Newton, or something else.
Choose Step Size α
How far along d? Line search or trust region.
Update x ← x + αd
Move to the new point. Check for convergence.
Local descent methods differ primarily in how they choose:

Chapter 1: The Descent Direction Iteration

The general framework for local descent is elegantly simple. Starting from an initial point x(1), repeat:

pseudocode
for k = 1, 2, ...
  d(k) ← choose_direction(x(k))    # descent direction
  α(k) ← choose_step(x(k), d(k))   # step size
  x(k+1) ← x(k) + α(k) d(k)        # update
end

For this to work, two conditions must hold:

Key insight: Any direction that makes an acute angle with the negative gradient is a descent direction. The steepest descent direction is −∇f(x), but it is not the only valid choice. Newton's method and conjugate gradient use different directions that can converge much faster.
A direction d is a valid descent direction if:

Chapter 2: Step Factors

Once you have a descent direction d, how far should you go? The step factor α controls this. Too small and convergence is glacial. Too large and you overshoot or even diverge.

Several strategies exist:

StrategyHow α Is ChosenCost
Fixed stepConstant α (e.g., 0.01)Free but fragile
Decaying stepαk = α0 / kFree but needs tuning
Exact line searchminα>0 f(x + αd)Expensive: full 1D optimization
BacktrackingStart large, shrink until sufficient decreaseModerate: a few f evaluations
The tradeoff: Exact line search finds the perfect step size but costs many function evaluations. Backtracking line search is "good enough" — it finds an acceptable step size with far fewer evaluations. In practice, approximate methods almost always win.

A fixed step factor can work for well-behaved problems, but it requires knowing the problem's curvature in advance. If the curvature varies (which it usually does), a fixed step either oscillates near the minimum or creeps along in flat regions.

The main disadvantage of exact line search is:

Chapter 3: Line Search

Given a direction d, the line search problem is: minimize f(x + αd) over α > 0. This is a 1D optimization problem — exactly what Chapter 3's bracketing methods solve.

An exact line search finds the α that truly minimizes f along d. At the optimum, the gradient is orthogonal to d:

∇f(x + α*d)T d = 0
Key insight: Exact line search has a beautiful geometric consequence: each successive direction is perpendicular to the previous one (since the gradient at the minimum is orthogonal to d). For gradient descent, this means the path zigzags — each step is at right angles to the last. This zigzag pattern is a hallmark of gradient descent with exact line search.

For quadratic functions f(x) = xTAx/2 + bTx, the exact line search has a closed-form solution:

α* = −(gT d) / (dT A d)

where g = ∇f(x). For non-quadratic functions, you resort to bracketing methods (golden section, quadratic fit) from Chapter 3.

With exact line search, consecutive gradient descent directions are:

Chapter 4: Approximate Line Search

Exact line search is overkill for most problems. Why spend 20 function evaluations finding the perfect step size when an approximate one works nearly as well? Two conditions define "good enough":

The Wolfe conditions ensure the step is both "big enough" (sufficient decrease) and "not too big" (curvature condition):

f(x + αd) ≤ f(x) + c1 α ∇f(x)Td    (sufficient decrease)
∇f(x + αd)Td ≥ c2 ∇f(x)Td    (curvature condition)

Typical values: c1 = 10−4, c2 = 0.9. The first condition says "f must decrease at least as much as a gentle linear model predicts." The second says "the slope at the new point shouldn't be too negative — we should be near the bottom of the valley along d."

Backtracking line search is simpler: start with a large α, check if the sufficient decrease condition holds. If not, shrink α by a factor (e.g., α ← 0.5α) and try again. This only enforces the first Wolfe condition but works well in practice.
pseudocode
function backtracking_line_search(f, ∇f, x, d; α=1, c=1e-4, ρ=0.5)
  y, g = f(x), ∇f(x)T d
  while f(x + α*d) > y + c*α*g
    α *= ρ               # shrink step
  return α
The sufficient decrease (Armijo) condition ensures that:

Chapter 5: Trust Region Methods

Line search says: "first choose a direction, then find the right step size." Trust region methods flip this: "first choose a maximum step size, then find the best direction within that radius."

The idea: maintain a region around the current point where a local model (usually quadratic) is trusted to be accurate. Minimize the model within that region:

minimizex' f̂(x')   subject to   ||x − x'|| ≤ δ
Key insight: After each step, compare the actual improvement f(x) − f(x') to the predicted improvement f(x) − f̂(x'). If the ratio η is close to 1, the model is accurate — expand the trust region. If η is small, the model was overoptimistic — shrink the trust region. If η is negative, the step actually increased f — reject it entirely.
Ratio ηAction
η < η1 (e.g., 0.25)Shrink trust region: δ ← γ1δ
η1 ≤ η ≤ η2Keep trust region unchanged
η > η2 (e.g., 0.75)Expand trust region: δ ← γ2δ

Trust region methods are particularly robust. They never take a step that increases the objective (unlike poorly-tuned line search), and they adapt automatically to the local curvature of the function.

Trust region methods differ from line search methods in that they:

Chapter 6: Termination Conditions

How do you know when to stop? Four common criteria:

Maximum Iterations
Stop after kmax steps. Simple but wasteful — you might stop too early or too late.
Absolute Improvement
Stop when |f(x(k)) − f(x(k+1))| < εa. The function value barely changes.
Relative Improvement
Stop when |f(x(k)) − f(x(k+1))| / |f(x(k))| < εr. Scale-independent.
Gradient Magnitude
Stop when ||∇f(x)|| < ε. Directly tests the first-order optimality condition.
In practice: Use a combination. Check gradient magnitude (the most principled condition) plus a maximum iteration count as a safety net. For trust region methods, the duality gap (Chapter 11) provides an even stronger stopping criterion.
Which termination condition directly tests the first-order optimality condition?

Chapter 7: Descent Comparison

Let's compare the two frameworks — line search vs. trust region — head to head:

PropertyLine SearchTrust Region
Order of decisionsDirection first, then step sizeStep size first, then direction
Step rejectionRare (backtracking ensures decrease)Common (reject if η too small)
AdaptivityAdapts step size along fixed directionAdapts both direction and step size
RobustnessCan fail with bad directionVery robust — self-correcting radius
Cost per stepMultiple f evaluations for line searchSolve constrained subproblem
When to use which: Line search methods are simpler and dominate in deep learning (SGD, Adam). Trust region methods are preferred for small-to-medium problems where robustness matters — engineering design, scientific computing, nonlinear least squares (Levenberg-Marquardt is essentially a trust region method).

Both frameworks share the same convergence guarantees under mild conditions: they converge to a point where ∇f = 0. The rate of convergence depends on the descent direction choice, not the step size strategy.

Trust region methods are generally considered more robust because:

Chapter 8: Backtracking Simulator

Watch backtracking line search navigate a 2D function. The algorithm follows the negative gradient, using backtracking to find an acceptable step size at each iteration.

Backtracking Line Search on the Rosenbrock Function

Click Step to advance. Red dot = current point. Trail shows the path taken. Notice the zigzag pattern.

Step 0 | f = 4.00
The zigzag pattern in gradient descent with line search occurs because:

Chapter 9: Summary

ConceptWhat It Means
Local descentIteratively pick direction + step size to reduce f
Descent condition∇fTd < 0 ensures d is downhill
Exact line searchSolve minα f(x + αd) exactly
Wolfe conditionsSufficient decrease + curvature = "good enough" step
BacktrackingStart large, shrink until Armijo condition holds
Trust regionFix max step size, find best direction within radius
TerminationGradient norm, improvement threshold, or max iterations
Looking ahead: Chapter 5 fills in the most important blank: which direction to choose. Gradient descent uses −∇f, but conjugate gradient, momentum, and Adam all do better. Chapter 6 goes further with second-order methods that use curvature information from the Hessian.
In the descent iteration x(k+1) = x(k) + αd, the line search determines: