Kochenderfer & Wheeler, Chapter 8

Stochastic Methods

Inject randomness strategically to escape local minima and explore the design space more broadly.

Prerequisites: Chapter 4 (Local Descent), Chapter 7 (Direct Methods).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Randomness?

Every method we have seen so far — gradient descent, Newton's method, Nelder-Mead — can get trapped in local minima. When the landscape has many valleys, a deterministic search will settle into whichever valley it starts in. It cannot jump out.

The core idea: Stochastic methods use randomness strategically to explore the design space. By occasionally accepting worse solutions or sampling from broad distributions, they can escape local minima and discover better regions. The trick is controlling how much randomness you use, and when to dial it down so the search converges.

Pure random search — evaluating f at random points — ignores everything you have learned from previous evaluations. Stochastic methods are smarter: they use randomness as a tool within a structured search process, balancing exploration (trying new regions) against exploitation (refining what you have found).

Stochastic methods add randomness to optimization in order to:

Chapter 1: Noisy Descent

The simplest stochastic method: take a normal gradient descent step, then add random noise. The update becomes:

x(k+1) ← x(k) + d(k) + ε(k)

where d(k) is the descent direction and ε(k) is zero-mean Gaussian noise with standard deviation σ. The noise is typically reduced over time — large perturbations early for exploration, small perturbations later for convergence.

Key insight: Adding noise helps traverse saddle points, where the gradient is near zero but the point is not a minimum. Without noise, gradient descent can stall at saddle points for many iterations. Stochastic gradient descent (SGD) in deep learning benefits from exactly this effect — the mini-batch noise acts as a natural perturbation.

For convergence guarantees, the step sizes α(k) must satisfy two conditions: ∑α(k) = ∞ (don't decay too fast) and ∑(α(k))² < ∞ (do eventually decay). A common choice is α(k) = 1/k.

Why does adding noise help with saddle points?

Chapter 2: Mesh Adaptive Direct Search

Recall generalized pattern search (GPS) from Chapter 7: it explored a fixed set of directions. Mesh adaptive direct search (MADS) replaces those fixed directions with random positive spanning sets at each iteration.

Generate
Sample a random positive spanning set (n+1 directions that span Rn).
Poll
Evaluate f along each direction at step size α. Accept any improvement.
Adapt
Improvement found? Multiply α by 4. No improvement? Divide α by 4.

The step size α is always a power of 4, ensuring the mesh stays well-structured. When a successful direction is found, MADS also queries a point 3α further in that direction — an opportunistic extrapolation.

Think of it this way: GPS searches with a fixed compass. MADS rotates the compass randomly each iteration. Over time, this random rotation ensures that every direction gets explored — even directions that a fixed compass would never align with.
MADS improves over GPS by:

Chapter 3: Simulated Annealing

Inspired by metallurgy: when a metal is heated, its atoms move freely. Slow cooling lets them settle into an ordered, low-energy crystal. Fast quenching traps defects. Simulated annealing applies the same idea to optimization.

At each iteration, a candidate point x' is sampled from a transition distribution T. The candidate is accepted with the Metropolis criterion:

P(accept) = 1 if Δy ≤ 0,    e−Δy/t if Δy > 0

where Δy = f(x') − f(x) and t is the temperature. High temperature means most moves are accepted (exploration). Low temperature means only improvements are accepted (exploitation).

The intuition: Temperature controls your willingness to accept worse solutions. Start hot (explore broadly), cool slowly (converge to a minimum). An annealing schedule controls how t decreases. Common choices: logarithmic t(k) = t(1)ln(2)/ln(k+1) (slow but guaranteed), exponential t(k+1) = γt(k) for γ ∈ (0,1) (fast, practical), or fast annealing t(k) = t(1)/k.

The transition distribution T determines how far candidate points can be from the current point. Gaussian distributions are common. Adaptive simulated annealing adjusts step sizes per coordinate to maintain roughly 50% acceptance rate.

In simulated annealing, what controls the probability of accepting a worse solution?

Chapter 4: Cross-Entropy Method

Rather than moving a single point, the cross-entropy method maintains an explicit probability distribution over the design space. At each iteration:

Sample
Draw m samples from the current proposal distribution P.
Select
Keep the best melite samples (the "elite" set).
Refit
Fit a new distribution P to the elite samples (e.g., compute new μ and Σ).

A common choice is a multivariate Gaussian proposal. The mean and covariance are updated using the maximum likelihood estimate from the elite samples. Over iterations, the distribution narrows around the global optimum.

Key insight: The cross-entropy method is model-based optimization. Instead of reasoning about individual points, you reason about distributions. This lets you capture uncertainty about where the optimum lies. The limitation: if the true landscape is multimodal, a unimodal Gaussian cannot represent it well. Mixture models help but are harder to fit.
The cross-entropy method refits its proposal distribution to:

Chapter 5: Proposal Distribution Descent

Like the cross-entropy method, proposal distribution descent optimizes a parameterized distribution p(x|θ). But instead of fitting elite samples, it uses gradient descent on the distribution parameters.

The objective is to minimize Ex~p(x|θ)[f(x)]. The gradient with respect to θ is estimated using the log-likelihood trick:

θ E[f(x)] ≈ (1/m) ∑ f(x(i)) ∇θ log p(x(i)|θ)

This is the same trick used in REINFORCE for reinforcement learning. No gradient of f is needed — only the gradient of the log probability of the proposal distribution.

Think of it this way: The cross-entropy method is like maximum likelihood estimation on elite samples. Proposal distribution descent is like policy gradient in RL. Both optimize over distributions, but they differ in how they update: fitting vs. gradient steps.

Information-geometric optimization improves on this by replacing the standard gradient with the natural gradient — the gradient pre-multiplied by the inverse Fisher information matrix. This makes the algorithm invariant to how you parameterize the distribution.

Proposal distribution descent avoids needing the gradient of f by instead using:

Chapter 6: CMA-ES

Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is one of the most powerful derivative-free optimizers. It maintains a multivariate Gaussian and adapts both the mean and the full covariance matrix to learn the shape of the objective landscape.

CMA-ES extends the cross-entropy method with three key improvements:

FeatureWhat It DoesWhy It Helps
Step-size adaptationSeparate control of overall step size σPrevents premature convergence
Covariance adaptationLearns the shape of the landscape via ΣAligns search with elongated valleys
Evolution pathAccumulates momentum from past stepsFaster progress in consistent directions
Why CMA-ES is special: It is the gold standard for black-box optimization up to about 100 dimensions. It adapts to the local geometry without computing derivatives. The covariance matrix learns to stretch the search along promising directions — essentially discovering the principal axes of the objective function.

CMA-ES has few hyperparameters to tune (population size is the main one). For problems where gradient information is unavailable or unreliable, CMA-ES is often the first method to try.

CMA-ES adapts its covariance matrix to:

Chapter 7: When to Use What

ScenarioBest MethodWhy
Noisy gradient availableNoisy descent / SGDCheap per iteration, scales to high dimensions
Black-box, ≤ 100 dimsCMA-ESAdapts to landscape shape automatically
Many local minima, cheap evalsSimulated annealingPrincipled acceptance of worse solutions
Need distribution over optimumCross-entropy methodOutputs a distribution, not just a point
Derivative-free, convergence guaranteeMADSExtends GPS with random directions
Rule of thumb: If you have gradients, use them (possibly with noise). If you don't, CMA-ES is the strongest general-purpose choice for moderate dimensions. Simulated annealing is simple and effective when function evaluations are cheap and the landscape is rugged.
For a 50-dimensional black-box function with no special structure, the best stochastic method is typically:

Chapter 8: Simulated Annealing Simulator

Watch simulated annealing explore a multimodal 1D function. The temperature starts high (green bar on the right), allowing the algorithm to jump between valleys. As the temperature drops, the search settles into the best valley it has found.

Simulated Annealing

Click Step to perform one annealing iteration. Watch how the acceptance probability changes as temperature cools.

Step 0 | T=10.00
As simulated annealing cools, the probability of accepting a worse solution:

Chapter 9: Summary

MethodTypeKey MechanismBest For
Noisy descentGradient + noiseGaussian perturbationSaddle points, SGD
MADSPattern searchRandom spanning setsDerivative-free with guarantees
Simulated annealingSingle-pointMetropolis criterion + coolingRugged landscapes
Cross-entropyDistribution-basedElite sample fittingGlobal search with distributions
Proposal dist. descentDistribution-basedLog-likelihood gradientHigh-dim distribution optimization
CMA-ESDistribution-basedCovariance adaptationBlack-box up to ~100 dims
Looking ahead: Chapter 9 introduces population methods — genetic algorithms, differential evolution, particle swarm — that maintain many candidate solutions simultaneously instead of just one or a distribution.
Which stochastic method adapts a full covariance matrix to learn the landscape shape?