No derivatives? No problem. Let function evaluations alone guide you to the minimum.
Sometimes you simply don't have derivatives. The function might be a black-box simulation, a physical experiment, or a noisy measurement. In these cases, you can only evaluate f(x) — no gradient, no Hessian.
Approximating the gradient via finite differences requires n+1 evaluations per step. Approximating the Hessian requires O(n²). For expensive black-box functions, this is wasteful. Direct methods can be smarter about how they spend their evaluation budget.
The simplest direct method: optimize one variable at a time while holding the others fixed. This reduces an n-dimensional problem to a sequence of 1D problems, each solvable by bracketing (Chapter 3).
pseudocode for k = 1, 2, ... for i = 1 to n xi ← argminxi f(x1,...,xi,...,xn) # 1D search along axis i end
Despite this limitation, coordinate descent has seen a resurgence in machine learning for problems with special structure (e.g., LASSO regression), where the per-coordinate update has a closed-form solution.
The weakness of coordinate descent is its fixed axis-aligned directions. Powell's method adapts: it learns good search directions from the progress made so far.
After cycling through all n coordinate directions, Powell replaces the direction that contributed least with the overall displacement direction (from the start of the cycle to the end). Over time, the search directions align with the principal axes of the function's contours.
Powell's method converges in n cycles for quadratic functions (like conjugate gradient), but it can fail when the adapted directions become linearly dependent. Modified Powell methods include safeguards to prevent this.
The Hooke-Jeeves method (pattern search) combines two moves: an exploratory move that probes each coordinate direction, and a pattern move that extrapolates in the direction of recent progress.
Generalized pattern search (GPS) extends Hooke-Jeeves with a theoretical framework. The key requirement: the set of search directions must positively span Rn — meaning any direction can be written as a non-negative combination of the search directions.
For 2D, the simplest positive spanning set is {e1, e2, −e1, −e2} (the four compass directions). But any set of n+1 or more vectors that positively span Rn works.
GPS has convergence guarantees: under mild conditions on the function, the step size converges to zero at a point where the Clarke generalized derivative is non-negative in all search directions — a necessary condition for a local minimum.
The Nelder-Mead method is perhaps the most popular derivative-free optimizer. It maintains a simplex — a shape with n+1 vertices in n dimensions (a triangle in 2D, a tetrahedron in 3D). At each step, it replaces the worst vertex using one of four operations:
| Operation | What It Does | When |
|---|---|---|
| Reflection | Mirror worst point through centroid | Reflected point is better than 2nd worst |
| Expansion | Go further in the reflection direction | Reflected point is the new best |
| Contraction | Move halfway toward centroid | Reflected point is still the worst |
| Shrink | Contract all vertices toward best | Even contraction didn't help |
Nelder-Mead has no convergence guarantee in general (there exist functions where it fails to converge to a stationary point). But in practice, it works remarkably well for low-dimensional problems (n ≤ 10) and is the default "first try" derivative-free optimizer.
DIRECT (DIviding RECTangles) is a global optimization method for bound-constrained problems. It recursively partitions the search space into hyper-rectangles and explores the most promising ones.
At each iteration, DIRECT identifies potentially optimal rectangles — those that could contain the global minimum for some Lipschitz constant. It then divides these rectangles into thirds along their longest dimension.
A rectangle is "potentially optimal" if it lies on the lower-right convex hull of the (rectangle-size, best-value) scatter plot. This elegant criterion automatically balances global and local search.
| Scenario | Best Method | Why |
|---|---|---|
| Low-dim (n ≤ 10), smooth | Nelder-Mead | Adapts shape, no tuning needed |
| Separable variables | Coordinate descent | Each 1D problem is easy |
| No derivatives, need guarantee | GPS | Convergence theory with positive spanning |
| Global search, bounds known | DIRECT | Balances exploration and exploitation |
| Expensive evaluations | Surrogate (Ch 17–19) | Model the function, optimize the model |
Watch a Nelder-Mead simplex (triangle) crawl toward the minimum of a 2D function. The simplex reflects, expands, and contracts as it adapts to the landscape.
Click Step to perform one Nelder-Mead operation. The triangle deforms to seek the minimum.
| Method | Type | Global? | Convergence |
|---|---|---|---|
| Coordinate descent | Axis-aligned 1D | No | Linear (good for separable) |
| Powell's method | Adaptive 1D | No | n cycles for quadratic |
| Hooke-Jeeves | Pattern + explore | No | Linear with adaptive δ |
| GPS | Pattern search | No | Guaranteed to stationary point |
| Nelder-Mead | Simplex-based | No | No guarantee, but practical |
| DIRECT | Rectangle partition | Yes | Dense in limit |