Ch 8: Stochastic Methods — Kochenderfer & Wheeler

Chapter 0: Why Randomness?

Every method we have seen so far — gradient descent, Newton's method, Nelder-Mead — can get trapped in local minima. When the landscape has many valleys, a deterministic search will settle into whichever valley it starts in. It cannot jump out.

The core idea: Stochastic methods use randomness strategically to explore the design space. By occasionally accepting worse solutions or sampling from broad distributions, they can escape local minima and discover better regions. The trick is controlling how much randomness you use, and when to dial it down so the search converges.

Pure random search — evaluating f at random points — ignores everything you have learned from previous evaluations. Stochastic methods are smarter: they use randomness as a tool within a structured search process, balancing exploration (trying new regions) against exploitation (refining what you have found).

Stochastic methods add randomness to optimization in order to:

Escape local minima by exploring beyond the current valley Make the algorithm faster than gradient descent Eliminate the need for function evaluations

Chapter 1: Noisy Descent

The simplest stochastic method: take a normal gradient descent step, then add random noise. The update becomes:

x^(k+1) ← x^(k) + d^(k) + ε^(k)

where d^(k) is the descent direction and ε^(k) is zero-mean Gaussian noise with standard deviation σ. The noise is typically reduced over time — large perturbations early for exploration, small perturbations later for convergence.

Key insight: Adding noise helps traverse saddle points, where the gradient is near zero but the point is not a minimum. Without noise, gradient descent can stall at saddle points for many iterations. Stochastic gradient descent (SGD) in deep learning benefits from exactly this effect — the mini-batch noise acts as a natural perturbation.

For convergence guarantees, the step sizes α^(k) must satisfy two conditions: ∑α^(k) = ∞ (don't decay too fast) and ∑(α^(k))² < ∞ (do eventually decay). A common choice is α^(k) = 1/k.

Why does adding noise help with saddle points?

The random perturbation pushes the iterate away from the zero-gradient region It computes the Hessian to detect saddle points It reduces the step size at saddle points

Chapter 2: Mesh Adaptive Direct Search

Recall generalized pattern search (GPS) from Chapter 7: it explored a fixed set of directions. Mesh adaptive direct search (MADS) replaces those fixed directions with random positive spanning sets at each iteration.

Generate

Sample a random positive spanning set (n+1 directions that span Rⁿ).

↓

Poll

Evaluate f along each direction at step size α. Accept any improvement.

↓

Adapt

Improvement found? Multiply α by 4. No improvement? Divide α by 4.

The step size α is always a power of 4, ensuring the mesh stays well-structured. When a successful direction is found, MADS also queries a point 3α further in that direction — an opportunistic extrapolation.

Think of it this way: GPS searches with a fixed compass. MADS rotates the compass randomly each iteration. Over time, this random rotation ensures that every direction gets explored — even directions that a fixed compass would never align with.

MADS improves over GPS by:

Using random positive spanning sets instead of fixed directions Computing the gradient Using a larger step size

Chapter 3: Simulated Annealing

Inspired by metallurgy: when a metal is heated, its atoms move freely. Slow cooling lets them settle into an ordered, low-energy crystal. Fast quenching traps defects. Simulated annealing applies the same idea to optimization.

At each iteration, a candidate point x' is sampled from a transition distribution T. The candidate is accepted with the Metropolis criterion:

P(accept) = 1 if Δy ≤ 0, e^−Δy/t if Δy > 0

where Δy = f(x') − f(x) and t is the temperature. High temperature means most moves are accepted (exploration). Low temperature means only improvements are accepted (exploitation).

The intuition: Temperature controls your willingness to accept worse solutions. Start hot (explore broadly), cool slowly (converge to a minimum). An annealing schedule controls how t decreases. Common choices: logarithmic t(k) = t(1)ln(2)/ln(k+1) (slow but guaranteed), exponential t(k+1) = γt(k) for γ ∈ (0,1) (fast, practical), or fast annealing t(k) = t(1)/k.

The transition distribution T determines how far candidate points can be from the current point. Gaussian distributions are common. Adaptive simulated annealing adjusts step sizes per coordinate to maintain roughly 50% acceptance rate.

In simulated annealing, what controls the probability of accepting a worse solution?

The temperature parameter t The gradient direction The number of design variables

Chapter 4: Cross-Entropy Method

Rather than moving a single point, the cross-entropy method maintains an explicit probability distribution over the design space. At each iteration:

Sample

Draw m samples from the current proposal distribution P.

↓

Select

Keep the best m_elite samples (the "elite" set).

↓

Refit

Fit a new distribution P to the elite samples (e.g., compute new μ and Σ).

A common choice is a multivariate Gaussian proposal. The mean and covariance are updated using the maximum likelihood estimate from the elite samples. Over iterations, the distribution narrows around the global optimum.

Key insight: The cross-entropy method is model-based optimization. Instead of reasoning about individual points, you reason about distributions. This lets you capture uncertainty about where the optimum lies. The limitation: if the true landscape is multimodal, a unimodal Gaussian cannot represent it well. Mixture models help but are harder to fit.

The cross-entropy method refits its proposal distribution to:

The best (elite) samples from each iteration All samples equally The worst samples, to avoid them

Chapter 5: Proposal Distribution Descent

Like the cross-entropy method, proposal distribution descent optimizes a parameterized distribution p(x|θ). But instead of fitting elite samples, it uses gradient descent on the distribution parameters.

The objective is to minimize E_x~p(x|θ)[f(x)]. The gradient with respect to θ is estimated using the log-likelihood trick:

∇_θ E[f(x)] ≈ (1/m) ∑ f(x⁽ⁱ⁾) ∇_θ log p(x⁽ⁱ⁾|θ)

This is the same trick used in REINFORCE for reinforcement learning. No gradient of f is needed — only the gradient of the log probability of the proposal distribution.

Think of it this way: The cross-entropy method is like maximum likelihood estimation on elite samples. Proposal distribution descent is like policy gradient in RL. Both optimize over distributions, but they differ in how they update: fitting vs. gradient steps.

Information-geometric optimization improves on this by replacing the standard gradient with the natural gradient — the gradient pre-multiplied by the inverse Fisher information matrix. This makes the algorithm invariant to how you parameterize the distribution.

Proposal distribution descent avoids needing the gradient of f by instead using:

The gradient of the log likelihood of the proposal distribution Finite differences on f The Hessian of f

Chapter 6: CMA-ES

Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is one of the most powerful derivative-free optimizers. It maintains a multivariate Gaussian and adapts both the mean and the full covariance matrix to learn the shape of the objective landscape.

CMA-ES extends the cross-entropy method with three key improvements:

Feature	What It Does	Why It Helps
Step-size adaptation	Separate control of overall step size σ	Prevents premature convergence
Covariance adaptation	Learns the shape of the landscape via Σ	Aligns search with elongated valleys
Evolution path	Accumulates momentum from past steps	Faster progress in consistent directions

Why CMA-ES is special: It is the gold standard for black-box optimization up to about 100 dimensions. It adapts to the local geometry without computing derivatives. The covariance matrix learns to stretch the search along promising directions — essentially discovering the principal axes of the objective function.

CMA-ES has few hyperparameters to tune (population size is the main one). For problems where gradient information is unavailable or unreliable, CMA-ES is often the first method to try.

CMA-ES adapts its covariance matrix to:

Align the search distribution with the shape of the objective landscape Increase the number of samples Compute the exact gradient

Chapter 7: When to Use What

Scenario	Best Method	Why
Noisy gradient available	Noisy descent / SGD	Cheap per iteration, scales to high dimensions
Black-box, ≤ 100 dims	CMA-ES	Adapts to landscape shape automatically
Many local minima, cheap evals	Simulated annealing	Principled acceptance of worse solutions
Need distribution over optimum	Cross-entropy method	Outputs a distribution, not just a point
Derivative-free, convergence guarantee	MADS	Extends GPS with random directions

Rule of thumb: If you have gradients, use them (possibly with noise). If you don't, CMA-ES is the strongest general-purpose choice for moderate dimensions. Simulated annealing is simple and effective when function evaluations are cheap and the landscape is rugged.

For a 50-dimensional black-box function with no special structure, the best stochastic method is typically:

CMA-ES Pure random search Noisy gradient descent

Chapter 8: Simulated Annealing Simulator

Watch simulated annealing explore a multimodal 1D function. The temperature starts high (green bar on the right), allowing the algorithm to jump between valleys. As the temperature drops, the search settles into the best valley it has found.

Simulated Annealing

Click Step to perform one annealing iteration. Watch how the acceptance probability changes as temperature cools.

Step 0 | T=10.00

As simulated annealing cools, the probability of accepting a worse solution:

Decreases toward zero Increases toward one Stays constant

Chapter 9: Summary

Method	Type	Key Mechanism	Best For
Noisy descent	Gradient + noise	Gaussian perturbation	Saddle points, SGD
MADS	Pattern search	Random spanning sets	Derivative-free with guarantees
Simulated annealing	Single-point	Metropolis criterion + cooling	Rugged landscapes
Cross-entropy	Distribution-based	Elite sample fitting	Global search with distributions
Proposal dist. descent	Distribution-based	Log-likelihood gradient	High-dim distribution optimization
CMA-ES	Distribution-based	Covariance adaptation	Black-box up to ~100 dims

Looking ahead: Chapter 9 introduces population methods — genetic algorithms, differential evolution, particle swarm — that maintain many candidate solutions simultaneously instead of just one or a distribution.

Which stochastic method adapts a full covariance matrix to learn the landscape shape?

Simulated annealing Noisy descent CMA-ES