Kochenderfer, Wheeler & Wray — Chapter 10

Policy Search

When you can't differentiate through the environment, search the policy space directly.

Prerequisites: MDPs (Ch 7) + basic probability. That's it.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Search Policy Space?

You're training a robot to walk. The robot's legs are controlled by a neural network with 10,000 parameters. To find the best parameters, you'd ideally compute ∂U/∂θ — the gradient of expected return with respect to the parameters — and follow it uphill.

But there's a problem: the environment isn't differentiable. The robot falls, bounces off the ground, has contact discontinuities. The map from parameters to expected return — call it U(θ) — exists, but its gradient might not. Or the gradient exists but is too noisy to follow. Or the policy class itself (a lookup table, a finite automaton, a decision tree) isn't differentiable by design.

The core question: If you can evaluate U(θ) by running your policy in the environment, but you can't differentiate through U(θ), how do you find the best θ?

The answer is policy search: treat U(θ) as a black-box function and search the parameter space Θ directly. No gradients required. You need only a way to evaluate policies — roll them out in the environment and measure the return.

1. Parameterize
Choose a policy class πθ. θ might be network weights, a lookup table, linear features, or anything.
2. Evaluate
Estimate U(θ) = E[total return under πθ] by running m rollouts and averaging.
3. Search
Use U(θ) evaluations to propose better θ. Repeat until convergence or budget exhausted.

Three families cover the space of approaches, each with a different philosophy for proposing new candidates:

FamilyCore ideaBest when
Local searchPerturb the current best θ and keep improvementsSmooth landscape, quick evaluation
Population-basedMaintain a set of candidates; evolve toward high-U regionsMulti-modal landscape, parallelism available
Distribution-basedFit a distribution over θ; sample from it; update toward high-U samplesGaussian-like structure in θ
Why not just gradient descent? Policy gradient methods (Ch 11–12) do use gradients, but they estimate them via sampling — a completely different mechanism from backpropagation. The methods in this chapter don't estimate gradients at all. They treat U(θ) as a pure black box. This makes them simpler to implement and apply to non-differentiable policy classes.
You want to optimize a policy parameterized by a lookup table (one action per discrete state). Why can't you use standard backpropagation?

Chapter 1: Approximate Policy Evaluation

Before we can search, we need to score. Given a candidate πθ, we want a number: how good is it? The true expected return is

U(θ) = Eπθ[ ∑t=0 γt R(st, at) ]

This expectation is over all rollout trajectories τ = (s0, a0, s1, a1, …) generated by running πθ in the environment. We can't compute it exactly, but we can Monte Carlo estimate it by running m actual rollouts.

Monte Carlo rollout estimate: Run m rollouts of πθ. On rollout i, collect total discounted return R(i). Then Û(θ) = (1/m) ∑i=1m R(i). By the law of large numbers, this converges to U(θ) as m → ∞.

Why m rollouts and not just 1? Because the environment is stochastic. A single rollout can be an outlier. Averaging over m rollouts reduces the estimation variance. The standard error of your estimate is σ/√m, where σ is the standard deviation of a single rollout's return.

How many rollouts do you need? If returns vary wildly (chaotic environment, long horizon), you need more. Double your precision, quadruple m. This is the fundamental evaluation budget problem in policy search.

Truncated horizons: For infinite-horizon problems, you typically cap rollouts at a maximum horizon T (either a fixed step count or until a terminal condition) and accept the truncation error. The truncated return R(i)T = ∑t=0T−1 γt R(st(i), at(i)) underestimates the true return by γT·U(sT). Choose T large enough that γT is negligible.
Monte Carlo Evaluation Variance

Observe how estimate variance decreases as rollout count m increases. The true value is 1.0. Watch the confidence interval (purple bar) shrink with more rollouts.

Rollouts m 5
You use m=4 rollouts and get returns [8, 2, 6, 4]. Your estimate is 5.0. If σ=3 (std of a single rollout), what is the standard error of your estimate?

Chapter 2: Local Search

The simplest strategy: stand at your current best θ, look around in a small neighborhood, and move to wherever U is higher. This is local search — also called hill climbing in the policy optimization context.

Concretely: start with an initial θ0. At each iteration, generate candidate perturbations by adding small random noise. Evaluate each candidate with Monte Carlo rollouts. If any candidate beats the current best, move there. Repeat.

1. Perturb
Generate k candidates: θi = θ + δi where δi ~ 𝒩(0, σ2I)
2. Evaluate
For each candidate, estimate U(θi) using m rollouts
3. Select
If maxi U(θi) > U(θ): set θ ← argmax. Else: shrink σ or stop.
↻ repeat

The step size σ controls the neighborhood radius. Large σ explores widely but may jump over fine features of U. Small σ is precise but slow. A common trick: adaptive step size — shrink σ when progress stalls, expand when making consistent progress.

Hooke-Jeeves pattern search: Search along coordinate directions. At each step, try perturbing each parameter θj by +δ and −δ, keeping successful changes. This requires 2d evaluations per iteration where d = dim(θ). Works well when U has axis-aligned structure. Breaks down when parameters are strongly correlated (a ridge at an angle to the coordinate axes).

Local search has one Achilles heel: local optima. If U(θ) has many hills, local search climbs to the nearest peak and stops, unaware that a distant mountain is much taller. For policy spaces with many local optima (common in robotics), population-based methods handle this much better.

When local search works: (1) Low-dimensional θ (d < 20). (2) U(θ) is roughly unimodal. (3) Evaluation is cheap. (4) A good initial θ0 is available from domain knowledge. These conditions hold surprisingly often in control problems with carefully chosen policy parameterizations.
A local search algorithm has been running for 500 iterations without finding a better policy. The step size σ is still large. What is the most likely explanation?

Chapter 3: Nelder-Mead Simplex

Imagine placing d+1 explorers on the policy landscape (for d = dim(θ)). These explorers form a simplex: in 2D a triangle, in 3D a tetrahedron. Their positions are θ(1), …, θ(d+1). The game: iteratively deform the simplex so it crawls toward the optimum.

The Nelder-Mead method (1965) uses four geometric operations on the simplex at each step. It doesn't use derivatives. It uses geometry.

OperationWhat it doesWhen used
ReflectionReflect the worst vertex θw through the centroid of the restDefault first try
ExpansionExtend further in the reflection direction if it's the new bestWhen reflection succeeded big
ContractionMove θw closer to the centroidWhen reflection didn't improve
ShrinkMove all vertices toward the current best θbWhen contraction also failed
The geometry of optimization: The simplex learns the local shape of U without computing derivatives. When the landscape has a ridge, the simplex stretches along it. When it reaches a bowl, the simplex contracts into the minimum. This adaptation is why Nelder-Mead is popular for noisy functions — it doesn't rely on exact derivative estimates that noise would corrupt.

The reflection step in detail: the centroid of all vertices except the worst is θc = (1/d) ∑i≠w θ(i). The reflected vertex is θr = θc + (θc − θw). Evaluate U(θr) and compare to the current best and second-worst. The four cases (best, good, bad, terrible) determine which operation to apply.

Nelder-Mead Simplex in 2D

The triangle is the simplex. The star is the optimum. Press Step to advance one iteration and watch the triangle deform toward the goal.

Convergence and limitations: Nelder-Mead converges reliably in low dimensions (d ≤ 10) but can stagnate in high dimensions — it has only d+1 vertices to navigate a d-dimensional landscape. For d > 20, use CEM or evolution strategies instead.
In 3D policy space (d=3), how many vertices does the Nelder-Mead simplex have?

Chapter 4: Genetic Algorithms

Nature solved optimization over billions of years: keep good solutions, randomly mutate them, and combine successful parents. Genetic algorithms (GAs) copy this strategy — maintain a population of candidate policies, evaluate each, and breed the next generation from the best survivors.

The key insight: by keeping a population rather than a single candidate, GAs naturally explore multiple basins of attraction simultaneously. This makes them far less prone to local optima than local search.

The GA loop: (1) Evaluate — score all N candidates with rollouts. (2) Select — rank by return, keep top-k (elites). (3) Crossover — randomly pair survivors and blend parameter vectors. (4) Mutate — add small random noise to each offspring. Repeat.

For continuous parameter vectors θ ∈ Rd, the most common operations:

Uniform crossover: Given parents θA and θB, create offspring θC by taking each component from A with probability 0.5, or from B otherwise: θCj ~ Bernoulli(½) ? θAj : θBj. Works well when parameters are independent.

Mutation: Add Gaussian noise: θC ← θC + ε where ε ~ 𝒩(0, σ2I). σ is the mutation rate. Large σ = wild exploration; small σ = fine-tuning.

Genetic Algorithm: Population Evolution

Each dot is a candidate policy in 2D parameter space. Brightness = fitness. The star is the global optimum. Watch the population converge.

Pop size N 20
Mutation σ 0.20
Elitism matters: Always carry the current best individual unchanged into the next generation. Without elitism, the best policy discovered so far can be lost through mutation. With elitism, the population's best is monotonically non-decreasing — you can only improve or stay the same. This simple trick dramatically improves convergence reliability.
A genetic algorithm with elitism has its current best policy at return 87 after generation 5. After generation 6, can the best return drop below 87?

Chapter 5: Cross-Entropy Method

Genetic algorithms maintain a set of point candidates. The cross-entropy method (CEM) is more principled: maintain a distribution over candidates. Concretely, CEM keeps a Gaussian p(θ) = 𝒩(μ, Σ) and iteratively updates (μ, Σ) so the distribution concentrates around high-return regions.

The name comes from information theory: updating the parameters to maximize the likelihood of the elite samples is equivalent to minimizing the KL-divergence between the current distribution and an idealized distribution concentrated on the optimal θ*.

The CEM loop: (1) Sample: Draw N candidates θ(i) ~ 𝒩(μ, Σ). (2) Evaluate: Score each with rollouts. (3) Select elites: Keep top k candidates (k = ρ·N, where ρ ≈ 0.2). (4) Update: Set μ ← mean of elites, Σ ← covariance of elites. (5) Repeat.

The update equations are sample statistics of the elite set Θelite:

μ ← (1/k) ∑θ ∈ Θelite θ
Σ ← (1/k) ∑θ ∈ Θelite (θ − μ)(θ − μ)T + εI

These are maximum likelihood estimates of a Gaussian fitted to the elites. No optimization required — just two matrix computations. The εI term (noise injection) prevents Σ from collapsing prematurely and stops exploration too early.

Why CEM beats naive random search: Random search samples from a fixed prior, forever wasting samples in poor regions. CEM starts with a broad prior and narrows it toward promising regions. By iteration 5–10, nearly all samples land in high-return territory. This is analogous to importance sampling: we're shifting our sampling distribution toward the important region.
Cross-Entropy Method: Distribution Narrowing

The ellipse shows the current distribution N(μ, Σ). Blue dots are samples, gold dots are elites. Watch the ellipse converge to the optimum (star).

Elite fraction ρ 0.20
Sample N 20
python
import numpy as np

def cross_entropy_method(evaluate, d, n_iter=50, N=100, rho=0.2, noise=1e-3):
    """evaluate: theta -> scalar return. d: dimension of theta."""
    mu = np.zeros(d)
    sigma = np.eye(d)   # full covariance
    k = int(rho * N)    # number of elites to keep

    for t in range(n_iter):
        # 1. Sample candidates from current distribution
        thetas = np.random.multivariate_normal(mu, sigma, N)

        # 2. Evaluate each candidate (embarrassingly parallel)
        returns = np.array([evaluate(th) for th in thetas])

        # 3. Select top-k elites
        elite_idx = np.argsort(returns)[-k:]
        elites = thetas[elite_idx]

        # 4. Update distribution: MLE Gaussian on elites
        mu = elites.mean(axis=0)
        diff = elites - mu
        sigma = (diff.T @ diff) / k + noise * np.eye(d)

    return mu
CEM updates μ using the sample mean of the top-k candidates. What happens if you use ALL N samples instead of just the top-k elites?

Chapter 6: Evolution Strategies

CEM uses hard selection: keep the top ρ·N, discard the rest. Evolution strategies (ES) use soft selection: weight every sample by its return. The distribution update becomes a weighted average rather than a sample mean of elites.

ES keeps a Gaussian p(θ; μ, Σ) and performs the update:

μ ← ∑i=1N wi θ(i)

where wi are weights derived from the returns. The most robust choice: rank-based weights. Rank the N candidates by return. Assign wi ∝ max(0, log(N/2 + 1) − log(ranki)), normalized to sum to 1. This gives large weight to the best, zero to the worst half, and smooth interpolation between.

ES vs. CEM: CEM has a hard threshold (top-k or nothing). ES has soft weighting (every sample contributes proportionally). ES is typically more stable because it uses information from all samples. But the elite fraction ρ in CEM is more intuitive to tune than the rank-weight formula in ES.
Covariance Matrix Adaptation ES (CMA-ES): The full d×d covariance update costs O(d3) per step — impractical for large d. CMA-ES uses a clever rank-1 and rank-μ update that efficiently adapts the full covariance without the O(d3) factorization. It's considered the state-of-the-art gradient-free optimizer for d up to ~1000 and is extremely robust to step-size tuning.
python
import numpy as np

def evolution_strategy(evaluate, d, n_iter=50, N=50, sigma=0.1, alpha=0.01):
    """Diagonal-covariance ES with rank-based weights."""
    mu = np.zeros(d)

    for t in range(n_iter):
        eps = np.random.randn(N // 2, d)
        # Mirrored sampling: evaluate +eps and -eps
        thetas = np.concatenate([mu + sigma * eps, mu - sigma * eps])
        returns = np.array([evaluate(th) for th in thetas])

        # Rank-based weights: top half gets positive weight, bottom zero
        ranks = np.argsort(np.argsort(returns))  # double-argsort = ranks
        w = np.maximum(0, np.log(N/2 + 1) - np.log(N - ranks))
        w /= w.sum()

        # Weighted mean update
        mu = (w[:, None] * thetas).sum(axis=0)

    return mu
In rank-based ES, why use ranks rather than raw returns to compute weights?

Chapter 7: Isotropic Evolution Strategies

When the policy parameter θ has thousands or millions of dimensions (a neural network), maintaining a full d×d covariance matrix is intractable. Isotropic ES simplifies by fixing Σ = σ2I — the same noise in every direction. This reduces the search to tuning a single scalar σ.

This sounds like a drastic limitation. Surprisingly, it works even for deep networks with millions of parameters. The key insight: the update

μ ← μ + (α/Nσ) ∑i=1N R(i) ε(i)

is actually a gradient estimate of Eε~𝒩(0,I)[U(μ + σε)] with respect to μ. Isotropic ES is secretly a smoothed gradient descent on the Gaussian-blurred objective!

The OpenAI ES insight (Salimans et al. 2017): With isotropic ES, each worker only needs to send back one scalar (the return) to reconstruct its exact gradient contribution. By sharing random seeds across workers rather than full parameter vectors, you can parallelize across thousands of CPUs with negligible communication. This is why ES scaled to MuJoCo locomotion tasks: it's perfectly parallelizable.
Mirrored sampling halves variance: For each ε(i), also evaluate μ − σε(i). The pair contributes (R+ − R(i) to the gradient. The antisymmetric part cancels noise, reducing variance by roughly half for free — you get two information-rich signals instead of one, and the redundancy in the random vector ε is exploited.
python
import numpy as np

def isotropic_es(evaluate, d, n_iter=100, N=50, sigma=0.05, alpha=0.01):
    """Isotropic ES with mirrored sampling and baseline subtraction."""
    mu = np.zeros(d)

    for t in range(n_iter):
        eps = np.random.randn(N, d)   # N perturbation directions

        # Mirrored: evaluate +eps and -eps pairs
        R_pos = np.array([evaluate(mu + sigma * e) for e in eps])
        R_neg = np.array([evaluate(mu - sigma * e) for e in eps])

        # Gradient estimate with baseline subtraction
        advantages = R_pos - R_neg  # baseline cancels exactly
        grad = (advantages[:, None] * eps).sum(axis=0) / (N * sigma)
        mu += alpha * grad

    return mu
Isotropic ES with mirrored sampling evaluates πμ+σε and πμ−σε. If both have the same return, what is their joint contribution to the gradient update?

Chapter 8: Showcase — Policy Search Race

Three algorithms compete to optimize the same 2D policy landscape in real time. The landscape has a global optimum (gold star) and sometimes local traps. Watch how each method navigates.

What to try: Terrain 1 is a smooth bowl — local search wins. Terrain 2 adds local traps — watch local search get stuck while GA and CEM escape. Terrain 3 is rugged multi-modal. Increase Speed to see long-run behavior.
Policy Search Race: Local Search vs. Genetic vs. CEM
Terrain 1
Speed 3
AlgorithmTerrain 1 (smooth)Terrain 2 (traps)Terrain 3 (rugged)
Local searchExcellentOften trappedUsually trapped
Genetic (pop)Good but slowUsually escapesOften global
CEMExcellent + fastUsually escapesDepends on init

Chapter 9: Connections & What's Next

Policy search occupies a special niche: the method of choice when you can evaluate but not differentiate the objective. The next chapters open the black box.

MethodBest forScale (dim θ)Gradient?
Hooke-Jeeves / local searchLow-d, smooth, good initd < 20No
Nelder-MeadLow-d, noisy, no gradientd < 15No
Genetic algorithmDiscrete or mixed, multi-modalAny dNo
CEMContinuous θ, Gaussian structured < 1000No
Isotropic ESNeural nets, parallelismd up to 106Implicit
Policy gradient (Ch 11)Differentiable class, on-policyAny dEstimated
PPO (Ch 12)Smooth env, clipped updateAny dEstimated
The gradient estimation connection: Isotropic ES is equivalent to finite-difference estimation of ∇μ E[U(μ+σε)]. Policy gradient methods (Ch 11) estimate ∇θ U(θ) via the log-derivative trick. Both are stochastic gradient methods; they differ in what they differentiate and how they reduce variance.
CEM in the wild: CEM was the planner in the original Dreamer model-based RL system (Ha & Schmidhuber 2018, Hafner et al. 2019): optimize action sequences by CEM within a learned world model. Imagined rollouts serve as the Monte Carlo evaluator. CEM + learned model achieves strong visual control without ever differentiating through the real environment.

Related chapters:

"Evolution is smarter than you are." — Leslie Orgel, biologist. The same humility applies to policy space: when gradients fail, let the landscape speak through samples.
CEM updates the mean toward elite samples and the covariance to match their spread. What does the covariance converge toward when fully converged?