Pick a direction. Pick a step size. Repeat until you can't go any lower.
In Chapter 3 we mastered 1D optimization — finding the minimum of a function along a single line. But real problems live in many dimensions. How do you minimize f(x1, x2, ..., xn)?
This is the template that nearly every optimization algorithm follows. The differences between methods (gradient descent, Newton's method, conjugate gradient) come down to how they choose the descent direction.
The general framework for local descent is elegantly simple. Starting from an initial point x(1), repeat:
pseudocode for k = 1, 2, ... d(k) ← choose_direction(x(k)) # descent direction α(k) ← choose_step(x(k), d(k)) # step size x(k+1) ← x(k) + α(k) d(k) # update end
For this to work, two conditions must hold:
Once you have a descent direction d, how far should you go? The step factor α controls this. Too small and convergence is glacial. Too large and you overshoot or even diverge.
Several strategies exist:
| Strategy | How α Is Chosen | Cost |
|---|---|---|
| Fixed step | Constant α (e.g., 0.01) | Free but fragile |
| Decaying step | αk = α0 / k | Free but needs tuning |
| Exact line search | minα>0 f(x + αd) | Expensive: full 1D optimization |
| Backtracking | Start large, shrink until sufficient decrease | Moderate: a few f evaluations |
A fixed step factor can work for well-behaved problems, but it requires knowing the problem's curvature in advance. If the curvature varies (which it usually does), a fixed step either oscillates near the minimum or creeps along in flat regions.
Given a direction d, the line search problem is: minimize f(x + αd) over α > 0. This is a 1D optimization problem — exactly what Chapter 3's bracketing methods solve.
An exact line search finds the α that truly minimizes f along d. At the optimum, the gradient is orthogonal to d:
For quadratic functions f(x) = xTAx/2 + bTx, the exact line search has a closed-form solution:
where g = ∇f(x). For non-quadratic functions, you resort to bracketing methods (golden section, quadratic fit) from Chapter 3.
Exact line search is overkill for most problems. Why spend 20 function evaluations finding the perfect step size when an approximate one works nearly as well? Two conditions define "good enough":
The Wolfe conditions ensure the step is both "big enough" (sufficient decrease) and "not too big" (curvature condition):
Typical values: c1 = 10−4, c2 = 0.9. The first condition says "f must decrease at least as much as a gentle linear model predicts." The second says "the slope at the new point shouldn't be too negative — we should be near the bottom of the valley along d."
pseudocode function backtracking_line_search(f, ∇f, x, d; α=1, c=1e-4, ρ=0.5) y, g = f(x), ∇f(x)T d while f(x + α*d) > y + c*α*g α *= ρ # shrink step return α
Line search says: "first choose a direction, then find the right step size." Trust region methods flip this: "first choose a maximum step size, then find the best direction within that radius."
The idea: maintain a region around the current point where a local model (usually quadratic) is trusted to be accurate. Minimize the model within that region:
| Ratio η | Action |
|---|---|
| η < η1 (e.g., 0.25) | Shrink trust region: δ ← γ1δ |
| η1 ≤ η ≤ η2 | Keep trust region unchanged |
| η > η2 (e.g., 0.75) | Expand trust region: δ ← γ2δ |
Trust region methods are particularly robust. They never take a step that increases the objective (unlike poorly-tuned line search), and they adapt automatically to the local curvature of the function.
How do you know when to stop? Four common criteria:
Let's compare the two frameworks — line search vs. trust region — head to head:
| Property | Line Search | Trust Region |
|---|---|---|
| Order of decisions | Direction first, then step size | Step size first, then direction |
| Step rejection | Rare (backtracking ensures decrease) | Common (reject if η too small) |
| Adaptivity | Adapts step size along fixed direction | Adapts both direction and step size |
| Robustness | Can fail with bad direction | Very robust — self-correcting radius |
| Cost per step | Multiple f evaluations for line search | Solve constrained subproblem |
Both frameworks share the same convergence guarantees under mild conditions: they converge to a point where ∇f = 0. The rate of convergence depends on the descent direction choice, not the step size strategy.
Watch backtracking line search navigate a 2D function. The algorithm follows the negative gradient, using backtracking to find an acceptable step size at each iteration.
Click Step to advance. Red dot = current point. Trail shows the path taken. Notice the zigzag pattern.
| Concept | What It Means |
|---|---|
| Local descent | Iteratively pick direction + step size to reduce f |
| Descent condition | ∇fTd < 0 ensures d is downhill |
| Exact line search | Solve minα f(x + αd) exactly |
| Wolfe conditions | Sufficient decrease + curvature = "good enough" step |
| Backtracking | Start large, shrink until Armijo condition holds |
| Trust region | Fix max step size, find best direction within radius |
| Termination | Gradient norm, improvement threshold, or max iterations |