Kochenderfer & Wheeler, Chapter 7

Direct Methods

No derivatives? No problem. Let function evaluations alone guide you to the minimum.

Prerequisites: Chapter 4 (Local Descent).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Derivative-Free?

Sometimes you simply don't have derivatives. The function might be a black-box simulation, a physical experiment, or a noisy measurement. In these cases, you can only evaluate f(x) — no gradient, no Hessian.

The core idea: Direct methods optimize using only function evaluations. They explore the design space by probing points in systematic patterns, comparing values, and moving toward regions where f is lower. They are slower than gradient-based methods (no free lunch), but they work where nothing else can.

Approximating the gradient via finite differences requires n+1 evaluations per step. Approximating the Hessian requires O(n²). For expensive black-box functions, this is wasteful. Direct methods can be smarter about how they spend their evaluation budget.

Direct methods are necessary when:

Chapter 1: Coordinate Descent

The simplest direct method: optimize one variable at a time while holding the others fixed. This reduces an n-dimensional problem to a sequence of 1D problems, each solvable by bracketing (Chapter 3).

pseudocode
for k = 1, 2, ...
  for i = 1 to n
    xi ← argminxi f(x1,...,xi,...,xn)  # 1D search along axis i
  end
Key insight: Coordinate descent works well when the variables are nearly independent (the contours are axis-aligned). But when variables are coupled (elongated contours at an angle to the axes), it zigzags horribly — each axis-aligned step makes only a tiny fraction of the possible progress.

Despite this limitation, coordinate descent has seen a resurgence in machine learning for problems with special structure (e.g., LASSO regression), where the per-coordinate update has a closed-form solution.

Coordinate descent struggles when:

Chapter 2: Powell's Method

The weakness of coordinate descent is its fixed axis-aligned directions. Powell's method adapts: it learns good search directions from the progress made so far.

After cycling through all n coordinate directions, Powell replaces the direction that contributed least with the overall displacement direction (from the start of the cycle to the end). Over time, the search directions align with the principal axes of the function's contours.

Think of it this way: If coordinate descent is searching with a fixed compass (N, S, E, W), Powell's method rotates the compass to align with the terrain. After enough iterations, the directions become conjugate — similar to conjugate gradient but without derivatives.

Powell's method converges in n cycles for quadratic functions (like conjugate gradient), but it can fail when the adapted directions become linearly dependent. Modified Powell methods include safeguards to prevent this.

Powell's method improves on coordinate descent by:

Chapter 3: Hooke-Jeeves

The Hooke-Jeeves method (pattern search) combines two moves: an exploratory move that probes each coordinate direction, and a pattern move that extrapolates in the direction of recent progress.

Explore
Try ±δ in each coordinate. Keep improvements.
Pattern Move
Extrapolate: xnew = xcurrent + (xcurrent − xprevious)
Adapt
If no improvement found, halve the step size δ.
Key insight: The pattern move is the clever part. It uses the displacement from the last two base points to predict a promising direction. This is essentially momentum for derivative-free optimization — it builds up speed in directions of consistent improvement.
In Hooke-Jeeves, when no coordinate direction yields improvement:

Chapter 4: Generalized Pattern Search

Generalized pattern search (GPS) extends Hooke-Jeeves with a theoretical framework. The key requirement: the set of search directions must positively span Rn — meaning any direction can be written as a non-negative combination of the search directions.

For 2D, the simplest positive spanning set is {e1, e2, −e1, −e2} (the four compass directions). But any set of n+1 or more vectors that positively span Rn works.

Why positive spanning matters: If your search directions don't positively span Rn, there exists a descent direction that you can never find by searching along your directions. Positive spanning guarantees that at least one of your directions makes an acute angle with any descent direction — so you can always make progress.

GPS has convergence guarantees: under mild conditions on the function, the step size converges to zero at a point where the Clarke generalized derivative is non-negative in all search directions — a necessary condition for a local minimum.

A set of search directions must positively span Rn to guarantee:

Chapter 5: Nelder-Mead Simplex

The Nelder-Mead method is perhaps the most popular derivative-free optimizer. It maintains a simplex — a shape with n+1 vertices in n dimensions (a triangle in 2D, a tetrahedron in 3D). At each step, it replaces the worst vertex using one of four operations:

OperationWhat It DoesWhen
ReflectionMirror worst point through centroidReflected point is better than 2nd worst
ExpansionGo further in the reflection directionReflected point is the new best
ContractionMove halfway toward centroidReflected point is still the worst
ShrinkContract all vertices toward bestEven contraction didn't help
The intuition: The simplex acts like an amoeba, stretching toward promising regions and contracting away from bad ones. It adapts its shape to the local landscape: elongating along valleys, shrinking near the minimum. No derivatives, no step sizes to tune — just compare function values and move.

Nelder-Mead has no convergence guarantee in general (there exist functions where it fails to converge to a stationary point). But in practice, it works remarkably well for low-dimensional problems (n ≤ 10) and is the default "first try" derivative-free optimizer.

The Nelder-Mead simplex has how many vertices in n dimensions?

Chapter 6: DIRECT

DIRECT (DIviding RECTangles) is a global optimization method for bound-constrained problems. It recursively partitions the search space into hyper-rectangles and explores the most promising ones.

At each iteration, DIRECT identifies potentially optimal rectangles — those that could contain the global minimum for some Lipschitz constant. It then divides these rectangles into thirds along their longest dimension.

Key insight: DIRECT balances exploration (sampling large, unexplored rectangles) and exploitation (sampling rectangles with low function values). It doesn't need a Lipschitz constant like Shubert-Piyavskii — instead, it considers all possible Lipschitz constants simultaneously by checking if a rectangle is optimal for any one of them.

A rectangle is "potentially optimal" if it lies on the lower-right convex hull of the (rectangle-size, best-value) scatter plot. This elegant criterion automatically balances global and local search.

DIRECT avoids needing a Lipschitz constant by:

Chapter 7: When to Use What

ScenarioBest MethodWhy
Low-dim (n ≤ 10), smoothNelder-MeadAdapts shape, no tuning needed
Separable variablesCoordinate descentEach 1D problem is easy
No derivatives, need guaranteeGPSConvergence theory with positive spanning
Global search, bounds knownDIRECTBalances exploration and exploitation
Expensive evaluationsSurrogate (Ch 17–19)Model the function, optimize the model
Rule of thumb: If you have gradients, use them (Chapters 5–6). If you don't, start with Nelder-Mead for small problems or DIRECT for bounded global search. Only resort to coordinate descent if your problem has special separable structure.
For a 3D black-box function with no special structure, the best first-try direct method is:

Chapter 8: Nelder-Mead Simulator

Watch a Nelder-Mead simplex (triangle) crawl toward the minimum of a 2D function. The simplex reflects, expands, and contracts as it adapts to the landscape.

Nelder-Mead Simplex Method

Click Step to perform one Nelder-Mead operation. The triangle deforms to seek the minimum.

Step 0
Nelder-Mead decides between reflection, expansion, contraction, and shrink based on:

Chapter 9: Summary

MethodTypeGlobal?Convergence
Coordinate descentAxis-aligned 1DNoLinear (good for separable)
Powell's methodAdaptive 1DNon cycles for quadratic
Hooke-JeevesPattern + exploreNoLinear with adaptive δ
GPSPattern searchNoGuaranteed to stationary point
Nelder-MeadSimplex-basedNoNo guarantee, but practical
DIRECTRectangle partitionYesDense in limit
Looking ahead: Chapter 8 introduces stochastic methods — simulated annealing, cross-entropy, and CMA-ES — that use randomness to escape local minima. Chapter 9 extends this with population methods (genetic algorithms, particle swarm) that maintain many candidate solutions simultaneously.
Which direct method provides a global optimization guarantee?