Inject randomness strategically to escape local minima and explore the design space more broadly.
Every method we have seen so far — gradient descent, Newton's method, Nelder-Mead — can get trapped in local minima. When the landscape has many valleys, a deterministic search will settle into whichever valley it starts in. It cannot jump out.
Pure random search — evaluating f at random points — ignores everything you have learned from previous evaluations. Stochastic methods are smarter: they use randomness as a tool within a structured search process, balancing exploration (trying new regions) against exploitation (refining what you have found).
The simplest stochastic method: take a normal gradient descent step, then add random noise. The update becomes:
where d(k) is the descent direction and ε(k) is zero-mean Gaussian noise with standard deviation σ. The noise is typically reduced over time — large perturbations early for exploration, small perturbations later for convergence.
For convergence guarantees, the step sizes α(k) must satisfy two conditions: ∑α(k) = ∞ (don't decay too fast) and ∑(α(k))² < ∞ (do eventually decay). A common choice is α(k) = 1/k.
Recall generalized pattern search (GPS) from Chapter 7: it explored a fixed set of directions. Mesh adaptive direct search (MADS) replaces those fixed directions with random positive spanning sets at each iteration.
The step size α is always a power of 4, ensuring the mesh stays well-structured. When a successful direction is found, MADS also queries a point 3α further in that direction — an opportunistic extrapolation.
Inspired by metallurgy: when a metal is heated, its atoms move freely. Slow cooling lets them settle into an ordered, low-energy crystal. Fast quenching traps defects. Simulated annealing applies the same idea to optimization.
At each iteration, a candidate point x' is sampled from a transition distribution T. The candidate is accepted with the Metropolis criterion:
where Δy = f(x') − f(x) and t is the temperature. High temperature means most moves are accepted (exploration). Low temperature means only improvements are accepted (exploitation).
The transition distribution T determines how far candidate points can be from the current point. Gaussian distributions are common. Adaptive simulated annealing adjusts step sizes per coordinate to maintain roughly 50% acceptance rate.
Rather than moving a single point, the cross-entropy method maintains an explicit probability distribution over the design space. At each iteration:
A common choice is a multivariate Gaussian proposal. The mean and covariance are updated using the maximum likelihood estimate from the elite samples. Over iterations, the distribution narrows around the global optimum.
Like the cross-entropy method, proposal distribution descent optimizes a parameterized distribution p(x|θ). But instead of fitting elite samples, it uses gradient descent on the distribution parameters.
The objective is to minimize Ex~p(x|θ)[f(x)]. The gradient with respect to θ is estimated using the log-likelihood trick:
This is the same trick used in REINFORCE for reinforcement learning. No gradient of f is needed — only the gradient of the log probability of the proposal distribution.
Information-geometric optimization improves on this by replacing the standard gradient with the natural gradient — the gradient pre-multiplied by the inverse Fisher information matrix. This makes the algorithm invariant to how you parameterize the distribution.
Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is one of the most powerful derivative-free optimizers. It maintains a multivariate Gaussian and adapts both the mean and the full covariance matrix to learn the shape of the objective landscape.
CMA-ES extends the cross-entropy method with three key improvements:
| Feature | What It Does | Why It Helps |
|---|---|---|
| Step-size adaptation | Separate control of overall step size σ | Prevents premature convergence |
| Covariance adaptation | Learns the shape of the landscape via Σ | Aligns search with elongated valleys |
| Evolution path | Accumulates momentum from past steps | Faster progress in consistent directions |
CMA-ES has few hyperparameters to tune (population size is the main one). For problems where gradient information is unavailable or unreliable, CMA-ES is often the first method to try.
| Scenario | Best Method | Why |
|---|---|---|
| Noisy gradient available | Noisy descent / SGD | Cheap per iteration, scales to high dimensions |
| Black-box, ≤ 100 dims | CMA-ES | Adapts to landscape shape automatically |
| Many local minima, cheap evals | Simulated annealing | Principled acceptance of worse solutions |
| Need distribution over optimum | Cross-entropy method | Outputs a distribution, not just a point |
| Derivative-free, convergence guarantee | MADS | Extends GPS with random directions |
Watch simulated annealing explore a multimodal 1D function. The temperature starts high (green bar on the right), allowing the algorithm to jump between valleys. As the temperature drops, the search settles into the best valley it has found.
Click Step to perform one annealing iteration. Watch how the acceptance probability changes as temperature cools.
| Method | Type | Key Mechanism | Best For |
|---|---|---|---|
| Noisy descent | Gradient + noise | Gaussian perturbation | Saddle points, SGD |
| MADS | Pattern search | Random spanning sets | Derivative-free with guarantees |
| Simulated annealing | Single-point | Metropolis criterion + cooling | Rugged landscapes |
| Cross-entropy | Distribution-based | Elite sample fitting | Global search with distributions |
| Proposal dist. descent | Distribution-based | Log-likelihood gradient | High-dim distribution optimization |
| CMA-ES | Distribution-based | Covariance adaptation | Black-box up to ~100 dims |