Ch 4: Local Descent — Kochenderfer & Wheeler

Chapter 0: Why Local Descent?

In Chapter 3 we mastered 1D optimization — finding the minimum of a function along a single line. But real problems live in many dimensions. How do you minimize f(x₁, x₂, ..., x_n)?

The core idea: Local descent methods work by repeatedly choosing a direction to move and a step size for how far to go. Each step reduces the objective, and you stop when you can't improve anymore. The 1D line search from Chapter 3 becomes a subroutine — once you pick a direction, finding the best step size is a bracketing problem.

This is the template that nearly every optimization algorithm follows. The differences between methods (gradient descent, Newton's method, conjugate gradient) come down to how they choose the descent direction.

Choose Direction d

Which way reduces f? Gradient, Newton, or something else.

↓

Choose Step Size α

How far along d? Line search or trust region.

↓

Update x ← x + αd

Move to the new point. Check for convergence.

Local descent methods differ primarily in how they choose:

The descent direction The number of variables The objective function

Chapter 1: The Descent Direction Iteration

The general framework for local descent is elegantly simple. Starting from an initial point x⁽¹⁾, repeat:

pseudocode
for k = 1, 2, ...
  d^(k) ← choose_direction(x^(k))    # descent direction
  α^(k) ← choose_step(x^(k), d^(k))   # step size
  x^(k+1) ← x^(k) + α^(k) d^(k)        # update
end

For this to work, two conditions must hold:

Descent condition: The direction d must satisfy ∇f(x)^T d < 0. This guarantees that moving along d (at least initially) decreases f.
Sufficient decrease: The step size α must actually reduce f. Taking too large a step could overshoot.

Key insight: Any direction that makes an acute angle with the negative gradient is a descent direction. The steepest descent direction is −∇f(x), but it is not the only valid choice. Newton's method and conjugate gradient use different directions that can converge much faster.

A direction d is a valid descent direction if:

∇f(x)^T d < 0 (negative directional derivative) d points away from the origin ||d|| > 1

Chapter 2: Step Factors

Once you have a descent direction d, how far should you go? The step factor α controls this. Too small and convergence is glacial. Too large and you overshoot or even diverge.

Several strategies exist:

Strategy	How α Is Chosen	Cost
Fixed step	Constant α (e.g., 0.01)	Free but fragile
Decaying step	α_k = α₀ / k	Free but needs tuning
Exact line search	min_α>0 f(x + αd)	Expensive: full 1D optimization
Backtracking	Start large, shrink until sufficient decrease	Moderate: a few f evaluations

The tradeoff: Exact line search finds the perfect step size but costs many function evaluations. Backtracking line search is "good enough" — it finds an acceptable step size with far fewer evaluations. In practice, approximate methods almost always win.

A fixed step factor can work for well-behaved problems, but it requires knowing the problem's curvature in advance. If the curvature varies (which it usually does), a fixed step either oscillates near the minimum or creeps along in flat regions.

The main disadvantage of exact line search is:

It requires many function evaluations per step It cannot guarantee descent It only works in 1D

Chapter 3: Line Search

Given a direction d, the line search problem is: minimize f(x + αd) over α > 0. This is a 1D optimization problem — exactly what Chapter 3's bracketing methods solve.

An exact line search finds the α that truly minimizes f along d. At the optimum, the gradient is orthogonal to d:

∇f(x + α*d)^T d = 0

Key insight: Exact line search has a beautiful geometric consequence: each successive direction is perpendicular to the previous one (since the gradient at the minimum is orthogonal to d). For gradient descent, this means the path zigzags — each step is at right angles to the last. This zigzag pattern is a hallmark of gradient descent with exact line search.

For quadratic functions f(x) = x^TAx/2 + b^Tx, the exact line search has a closed-form solution:

α* = −(g^T d) / (d^T A d)

where g = ∇f(x). For non-quadratic functions, you resort to bracketing methods (golden section, quadratic fit) from Chapter 3.

With exact line search, consecutive gradient descent directions are:

Orthogonal to each other Parallel to each other Random

Chapter 4: Approximate Line Search

Exact line search is overkill for most problems. Why spend 20 function evaluations finding the perfect step size when an approximate one works nearly as well? Two conditions define "good enough":

The Wolfe conditions ensure the step is both "big enough" (sufficient decrease) and "not too big" (curvature condition):

f(x + αd) ≤ f(x) + c₁ α ∇f(x)^Td (sufficient decrease)

∇f(x + αd)^Td ≥ c₂ ∇f(x)^Td (curvature condition)

Typical values: c₁ = 10⁻⁴, c₂ = 0.9. The first condition says "f must decrease at least as much as a gentle linear model predicts." The second says "the slope at the new point shouldn't be too negative — we should be near the bottom of the valley along d."

Backtracking line search is simpler: start with a large α, check if the sufficient decrease condition holds. If not, shrink α by a factor (e.g., α ← 0.5α) and try again. This only enforces the first Wolfe condition but works well in practice.

pseudocode
function backtracking_line_search(f, ∇f, x, d; α=1, c=1e-4, ρ=0.5)
  y, g = f(x), ∇f(x)^T d
  while f(x + α*d) > y + c*α*g
    α *= ρ               # shrink step
  return α

The sufficient decrease (Armijo) condition ensures that:

f decreases by at least a fraction of what a linear model predicts The gradient becomes zero The step size is smaller than 1

Chapter 5: Trust Region Methods

Line search says: "first choose a direction, then find the right step size." Trust region methods flip this: "first choose a maximum step size, then find the best direction within that radius."

The idea: maintain a region around the current point where a local model (usually quadratic) is trusted to be accurate. Minimize the model within that region:

minimize_x' f̂(x') subject to ||x − x'|| ≤ δ

Key insight: After each step, compare the actual improvement f(x) − f(x') to the predicted improvement f(x) − f̂(x'). If the ratio η is close to 1, the model is accurate — expand the trust region. If η is small, the model was overoptimistic — shrink the trust region. If η is negative, the step actually increased f — reject it entirely.

Ratio η	Action
η < η₁ (e.g., 0.25)	Shrink trust region: δ ← γ₁δ
η₁ ≤ η ≤ η₂	Keep trust region unchanged
η > η₂ (e.g., 0.75)	Expand trust region: δ ← γ₂δ

Trust region methods are particularly robust. They never take a step that increases the objective (unlike poorly-tuned line search), and they adapt automatically to the local curvature of the function.

Trust region methods differ from line search methods in that they:

Choose the maximum step size first, then find the best direction within it Always use the gradient direction Do not require function evaluations

Chapter 6: Termination Conditions

How do you know when to stop? Four common criteria:

Maximum Iterations

Stop after k_max steps. Simple but wasteful — you might stop too early or too late.

↓

Absolute Improvement

Stop when |f(x^(k)) − f(x^(k+1))| < ε_a. The function value barely changes.

↓

Relative Improvement

Stop when |f(x^(k)) − f(x^(k+1))| / |f(x^(k))| < ε_r. Scale-independent.

↓

Gradient Magnitude

Stop when ||∇f(x)|| < ε. Directly tests the first-order optimality condition.

In practice: Use a combination. Check gradient magnitude (the most principled condition) plus a maximum iteration count as a safety net. For trust region methods, the duality gap (Chapter 11) provides an even stronger stopping criterion.

Which termination condition directly tests the first-order optimality condition?

Maximum iterations Absolute improvement Gradient magnitude below threshold

Chapter 7: Descent Comparison

Let's compare the two frameworks — line search vs. trust region — head to head:

Property	Line Search	Trust Region
Order of decisions	Direction first, then step size	Step size first, then direction
Step rejection	Rare (backtracking ensures decrease)	Common (reject if η too small)
Adaptivity	Adapts step size along fixed direction	Adapts both direction and step size
Robustness	Can fail with bad direction	Very robust — self-correcting radius
Cost per step	Multiple f evaluations for line search	Solve constrained subproblem

When to use which: Line search methods are simpler and dominate in deep learning (SGD, Adam). Trust region methods are preferred for small-to-medium problems where robustness matters — engineering design, scientific computing, nonlinear least squares (Levenberg-Marquardt is essentially a trust region method).

Both frameworks share the same convergence guarantees under mild conditions: they converge to a point where ∇f = 0. The rate of convergence depends on the descent direction choice, not the step size strategy.

Trust region methods are generally considered more robust because:

They self-correct the step size based on model accuracy They always use the gradient direction They require fewer function evaluations

Chapter 8: Backtracking Simulator

Watch backtracking line search navigate a 2D function. The algorithm follows the negative gradient, using backtracking to find an acceptable step size at each iteration.

Backtracking Line Search on the Rosenbrock Function

Click Step to advance. Red dot = current point. Trail shows the path taken. Notice the zigzag pattern.

Step 0 | f = 4.00

The zigzag pattern in gradient descent with line search occurs because:

Consecutive search directions are orthogonal after exact line search The step size is too large The function is non-convex

Chapter 9: Summary

Concept	What It Means
Local descent	Iteratively pick direction + step size to reduce f
Descent condition	∇f^Td < 0 ensures d is downhill
Exact line search	Solve min_α f(x + αd) exactly
Wolfe conditions	Sufficient decrease + curvature = "good enough" step
Backtracking	Start large, shrink until Armijo condition holds
Trust region	Fix max step size, find best direction within radius
Termination	Gradient norm, improvement threshold, or max iterations

Looking ahead: Chapter 5 fills in the most important blank: which direction to choose. Gradient descent uses −∇f, but conjugate gradient, momentum, and Adam all do better. Chapter 6 goes further with second-order methods that use curvature information from the Hessian.

In the descent iteration x^(k+1) = x^(k) + αd, the line search determines:

The direction d The step size α Both d and α