Ch 10: Policy Search — Kochenderfer et al.

Chapter 0: Why Search Policy Space?

You're training a robot to walk. The robot's legs are controlled by a neural network with 10,000 parameters. To find the best parameters, you'd ideally compute ∂U/∂θ — the gradient of expected return with respect to the parameters — and follow it uphill.

But there's a problem: the environment isn't differentiable. The robot falls, bounces off the ground, has contact discontinuities. The map from parameters to expected return — call it U(θ) — exists, but its gradient might not. Or the gradient exists but is too noisy to follow. Or the policy class itself (a lookup table, a finite automaton, a decision tree) isn't differentiable by design.

The core question: If you can evaluate U(θ) by running your policy in the environment, but you can't differentiate through U(θ), how do you find the best θ?

The answer is policy search: treat U(θ) as a black-box function and search the parameter space Θ directly. No gradients required. You need only a way to evaluate policies — roll them out in the environment and measure the return.

1. Parameterize

Choose a policy class π_θ. θ might be network weights, a lookup table, linear features, or anything.

↓

2. Evaluate

Estimate U(θ) = E[total return under π_θ] by running m rollouts and averaging.

↓

3. Search

Use U(θ) evaluations to propose better θ. Repeat until convergence or budget exhausted.

Three families cover the space of approaches, each with a different philosophy for proposing new candidates:

Family	Core idea	Best when
Local search	Perturb the current best θ and keep improvements	Smooth landscape, quick evaluation
Population-based	Maintain a set of candidates; evolve toward high-U regions	Multi-modal landscape, parallelism available
Distribution-based	Fit a distribution over θ; sample from it; update toward high-U samples	Gaussian-like structure in θ

Why not just gradient descent? Policy gradient methods (Ch 11–12) do use gradients, but they estimate them via sampling — a completely different mechanism from backpropagation. The methods in this chapter don't estimate gradients at all. They treat U(θ) as a pure black box. This makes them simpler to implement and apply to non-differentiable policy classes.

You want to optimize a policy parameterized by a lookup table (one action per discrete state). Why can't you use standard backpropagation?

A lookup table is not a differentiable function of its parameters Backpropagation only works on supervised learning problems Lookup tables have too many parameters for gradient computation

Chapter 1: Approximate Policy Evaluation

Before we can search, we need to score. Given a candidate π_θ, we want a number: how good is it? The true expected return is

U(θ) = E_{π_θ}[ ∑_t=0^∞ γ^t R(s_t, a_t) ]

This expectation is over all rollout trajectories τ = (s₀, a₀, s₁, a₁, …) generated by running π_θ in the environment. We can't compute it exactly, but we can Monte Carlo estimate it by running m actual rollouts.

Monte Carlo rollout estimate: Run m rollouts of π_θ. On rollout i, collect total discounted return R⁽ⁱ⁾. Then Û(θ) = (1/m) ∑_i=1^m R⁽ⁱ⁾. By the law of large numbers, this converges to U(θ) as m → ∞.

Why m rollouts and not just 1? Because the environment is stochastic. A single rollout can be an outlier. Averaging over m rollouts reduces the estimation variance. The standard error of your estimate is σ/√m, where σ is the standard deviation of a single rollout's return.

How many rollouts do you need? If returns vary wildly (chaotic environment, long horizon), you need more. Double your precision, quadruple m. This is the fundamental evaluation budget problem in policy search.

Truncated horizons: For infinite-horizon problems, you typically cap rollouts at a maximum horizon T (either a fixed step count or until a terminal condition) and accept the truncation error. The truncated return R⁽ⁱ⁾_T = ∑_t=0^T−1 γ^t R(s_t⁽ⁱ⁾, a_t⁽ⁱ⁾) underestimates the true return by γ^T·U(s_T). Choose T large enough that γ^T is negligible.

Monte Carlo Evaluation Variance

Observe how estimate variance decreases as rollout count m increases. The true value is 1.0. Watch the confidence interval (purple bar) shrink with more rollouts.

Rollouts m 5

You use m=4 rollouts and get returns [8, 2, 6, 4]. Your estimate is 5.0. If σ=3 (std of a single rollout), what is the standard error of your estimate?

1.5 (because σ/√m = 3/2 = 1.5) 3.0 (same as a single rollout) 0.75 (because σ/m = 3/4)

Chapter 2: Local Search

The simplest strategy: stand at your current best θ, look around in a small neighborhood, and move to wherever U is higher. This is local search — also called hill climbing in the policy optimization context.

Concretely: start with an initial θ₀. At each iteration, generate candidate perturbations by adding small random noise. Evaluate each candidate with Monte Carlo rollouts. If any candidate beats the current best, move there. Repeat.

1. Perturb

Generate k candidates: θ_i = θ + δ_i where δ_i ~ 𝒩(0, σ²I)

↓

2. Evaluate

For each candidate, estimate U(θ_i) using m rollouts

↓

3. Select

If max_i U(θ_i) > U(θ): set θ ← argmax. Else: shrink σ or stop.

↻ repeat

The step size σ controls the neighborhood radius. Large σ explores widely but may jump over fine features of U. Small σ is precise but slow. A common trick: adaptive step size — shrink σ when progress stalls, expand when making consistent progress.

Hooke-Jeeves pattern search: Search along coordinate directions. At each step, try perturbing each parameter θ_j by +δ and −δ, keeping successful changes. This requires 2d evaluations per iteration where d = dim(θ). Works well when U has axis-aligned structure. Breaks down when parameters are strongly correlated (a ridge at an angle to the coordinate axes).

Local search has one Achilles heel: local optima. If U(θ) has many hills, local search climbs to the nearest peak and stops, unaware that a distant mountain is much taller. For policy spaces with many local optima (common in robotics), population-based methods handle this much better.

When local search works: (1) Low-dimensional θ (d < 20). (2) U(θ) is roughly unimodal. (3) Evaluation is cheap. (4) A good initial θ₀ is available from domain knowledge. These conditions hold surprisingly often in control problems with carefully chosen policy parameterizations.

A local search algorithm has been running for 500 iterations without finding a better policy. The step size σ is still large. What is the most likely explanation?

The algorithm has converged to a local optimum and the global optimum is elsewhere The Monte Carlo estimate is too noisy to detect improvements The policy parameterization is too low-dimensional

Chapter 3: Nelder-Mead Simplex

Imagine placing d+1 explorers on the policy landscape (for d = dim(θ)). These explorers form a simplex: in 2D a triangle, in 3D a tetrahedron. Their positions are θ⁽¹⁾, …, θ^(d+1). The game: iteratively deform the simplex so it crawls toward the optimum.

The Nelder-Mead method (1965) uses four geometric operations on the simplex at each step. It doesn't use derivatives. It uses geometry.

Operation	What it does	When used
Reflection	Reflect the worst vertex θ_w through the centroid of the rest	Default first try
Expansion	Extend further in the reflection direction if it's the new best	When reflection succeeded big
Contraction	Move θ_w closer to the centroid	When reflection didn't improve
Shrink	Move all vertices toward the current best θ_b	When contraction also failed

The geometry of optimization: The simplex learns the local shape of U without computing derivatives. When the landscape has a ridge, the simplex stretches along it. When it reaches a bowl, the simplex contracts into the minimum. This adaptation is why Nelder-Mead is popular for noisy functions — it doesn't rely on exact derivative estimates that noise would corrupt.

The reflection step in detail: the centroid of all vertices except the worst is θ_c = (1/d) ∑_i≠w θ⁽ⁱ⁾. The reflected vertex is θ_r = θ_c + (θ_c − θ_w). Evaluate U(θ_r) and compare to the current best and second-worst. The four cases (best, good, bad, terrible) determine which operation to apply.

Nelder-Mead Simplex in 2D

The triangle is the simplex. The star is the optimum. Press Step to advance one iteration and watch the triangle deform toward the goal.

Convergence and limitations: Nelder-Mead converges reliably in low dimensions (d ≤ 10) but can stagnate in high dimensions — it has only d+1 vertices to navigate a d-dimensional landscape. For d > 20, use CEM or evolution strategies instead.

In 3D policy space (d=3), how many vertices does the Nelder-Mead simplex have?

4 (d+1 vertices in d dimensions) 3 (one per dimension) 8 (2^d corners of a hypercube)

Chapter 4: Genetic Algorithms

Nature solved optimization over billions of years: keep good solutions, randomly mutate them, and combine successful parents. Genetic algorithms (GAs) copy this strategy — maintain a population of candidate policies, evaluate each, and breed the next generation from the best survivors.

The key insight: by keeping a population rather than a single candidate, GAs naturally explore multiple basins of attraction simultaneously. This makes them far less prone to local optima than local search.

The GA loop: (1) Evaluate — score all N candidates with rollouts. (2) Select — rank by return, keep top-k (elites). (3) Crossover — randomly pair survivors and blend parameter vectors. (4) Mutate — add small random noise to each offspring. Repeat.

For continuous parameter vectors θ ∈ R^d, the most common operations:

Uniform crossover: Given parents θ^A and θ^B, create offspring θ^C by taking each component from A with probability 0.5, or from B otherwise: θ^C_j ~ Bernoulli(½) ? θ^A_j : θ^B_j. Works well when parameters are independent.

Mutation: Add Gaussian noise: θ^C ← θ^C + ε where ε ~ 𝒩(0, σ²I). σ is the mutation rate. Large σ = wild exploration; small σ = fine-tuning.

Genetic Algorithm: Population Evolution

Each dot is a candidate policy in 2D parameter space. Brightness = fitness. The star is the global optimum. Watch the population converge.

Pop size N 20

Mutation σ 0.20

Elitism matters: Always carry the current best individual unchanged into the next generation. Without elitism, the best policy discovered so far can be lost through mutation. With elitism, the population's best is monotonically non-decreasing — you can only improve or stay the same. This simple trick dramatically improves convergence reliability.

A genetic algorithm with elitism has its current best policy at return 87 after generation 5. After generation 6, can the best return drop below 87?

No — elitism guarantees the best individual is always preserved Yes — mutation could make the elite worse Yes — stochastic evaluation means the same policy can score differently

Chapter 5: Cross-Entropy Method

Genetic algorithms maintain a set of point candidates. The cross-entropy method (CEM) is more principled: maintain a distribution over candidates. Concretely, CEM keeps a Gaussian p(θ) = 𝒩(μ, Σ) and iteratively updates (μ, Σ) so the distribution concentrates around high-return regions.

The name comes from information theory: updating the parameters to maximize the likelihood of the elite samples is equivalent to minimizing the KL-divergence between the current distribution and an idealized distribution concentrated on the optimal θ*.

The CEM loop: (1) Sample: Draw N candidates θ⁽ⁱ⁾ ~ 𝒩(μ, Σ). (2) Evaluate: Score each with rollouts. (3) Select elites: Keep top k candidates (k = ρ·N, where ρ ≈ 0.2). (4) Update: Set μ ← mean of elites, Σ ← covariance of elites. (5) Repeat.

The update equations are sample statistics of the elite set Θ_elite:

μ ← (1/k) ∑_{θ ∈ Θ_elite} θ

Σ ← (1/k) ∑_{θ ∈ Θ_elite} (θ − μ)(θ − μ)^T + εI

These are maximum likelihood estimates of a Gaussian fitted to the elites. No optimization required — just two matrix computations. The εI term (noise injection) prevents Σ from collapsing prematurely and stops exploration too early.

Why CEM beats naive random search: Random search samples from a fixed prior, forever wasting samples in poor regions. CEM starts with a broad prior and narrows it toward promising regions. By iteration 5–10, nearly all samples land in high-return territory. This is analogous to importance sampling: we're shifting our sampling distribution toward the important region.

Cross-Entropy Method: Distribution Narrowing

The ellipse shows the current distribution N(μ, Σ). Blue dots are samples, gold dots are elites. Watch the ellipse converge to the optimum (star).

Elite fraction ρ 0.20

Sample N 20

python
import numpy as np

def cross_entropy_method(evaluate, d, n_iter=50, N=100, rho=0.2, noise=1e-3):
    """evaluate: theta -> scalar return. d: dimension of theta."""
    mu = np.zeros(d)
    sigma = np.eye(d)   # full covariance
    k = int(rho * N)    # number of elites to keep

    for t in range(n_iter):
        # 1. Sample candidates from current distribution
        thetas = np.random.multivariate_normal(mu, sigma, N)

        # 2. Evaluate each candidate (embarrassingly parallel)
        returns = np.array([evaluate(th) for th in thetas])

        # 3. Select top-k elites
        elite_idx = np.argsort(returns)[-k:]
        elites = thetas[elite_idx]

        # 4. Update distribution: MLE Gaussian on elites
        mu = elites.mean(axis=0)
        diff = elites - mu
        sigma = (diff.T @ diff) / k + noise * np.eye(d)

    return mu

CEM updates μ using the sample mean of the top-k candidates. What happens if you use ALL N samples instead of just the top-k elites?

The distribution doesn't shift toward high-return regions — it stays near its initial location The distribution converges faster because it uses more data The method becomes equivalent to gradient descent

Chapter 6: Evolution Strategies

CEM uses hard selection: keep the top ρ·N, discard the rest. Evolution strategies (ES) use soft selection: weight every sample by its return. The distribution update becomes a weighted average rather than a sample mean of elites.

ES keeps a Gaussian p(θ; μ, Σ) and performs the update:

μ ← ∑_i=1^N w_i θ⁽ⁱ⁾

where w_i are weights derived from the returns. The most robust choice: rank-based weights. Rank the N candidates by return. Assign w_i ∝ max(0, log(N/2 + 1) − log(rank_i)), normalized to sum to 1. This gives large weight to the best, zero to the worst half, and smooth interpolation between.

ES vs. CEM: CEM has a hard threshold (top-k or nothing). ES has soft weighting (every sample contributes proportionally). ES is typically more stable because it uses information from all samples. But the elite fraction ρ in CEM is more intuitive to tune than the rank-weight formula in ES.

Covariance Matrix Adaptation ES (CMA-ES): The full d×d covariance update costs O(d³) per step — impractical for large d. CMA-ES uses a clever rank-1 and rank-μ update that efficiently adapts the full covariance without the O(d³) factorization. It's considered the state-of-the-art gradient-free optimizer for d up to ~1000 and is extremely robust to step-size tuning.

python
import numpy as np

def evolution_strategy(evaluate, d, n_iter=50, N=50, sigma=0.1, alpha=0.01):
    """Diagonal-covariance ES with rank-based weights."""
    mu = np.zeros(d)

    for t in range(n_iter):
        eps = np.random.randn(N // 2, d)
        # Mirrored sampling: evaluate +eps and -eps
        thetas = np.concatenate([mu + sigma * eps, mu - sigma * eps])
        returns = np.array([evaluate(th) for th in thetas])

        # Rank-based weights: top half gets positive weight, bottom zero
        ranks = np.argsort(np.argsort(returns))  # double-argsort = ranks
        w = np.maximum(0, np.log(N/2 + 1) - np.log(N - ranks))
        w /= w.sum()

        # Weighted mean update
        mu = (w[:, None] * thetas).sum(axis=0)

    return mu

In rank-based ES, why use ranks rather than raw returns to compute weights?

Ranks are invariant to the scale and shift of returns, making the algorithm robust to reward shaping Ranks are faster to compute than raw returns Ranks guarantee convergence to the global optimum

Chapter 7: Isotropic Evolution Strategies

When the policy parameter θ has thousands or millions of dimensions (a neural network), maintaining a full d×d covariance matrix is intractable. Isotropic ES simplifies by fixing Σ = σ²I — the same noise in every direction. This reduces the search to tuning a single scalar σ.

This sounds like a drastic limitation. Surprisingly, it works even for deep networks with millions of parameters. The key insight: the update

μ ← μ + (α/Nσ) ∑_i=1^N R⁽ⁱ⁾ ε⁽ⁱ⁾

is actually a gradient estimate of E_ε~𝒩(0,I)[U(μ + σε)] with respect to μ. Isotropic ES is secretly a smoothed gradient descent on the Gaussian-blurred objective!

The OpenAI ES insight (Salimans et al. 2017): With isotropic ES, each worker only needs to send back one scalar (the return) to reconstruct its exact gradient contribution. By sharing random seeds across workers rather than full parameter vectors, you can parallelize across thousands of CPUs with negligible communication. This is why ES scaled to MuJoCo locomotion tasks: it's perfectly parallelizable.

Mirrored sampling halves variance: For each ε⁽ⁱ⁾, also evaluate μ − σε⁽ⁱ⁾. The pair contributes (R⁺ − R⁻)ε⁽ⁱ⁾ to the gradient. The antisymmetric part cancels noise, reducing variance by roughly half for free — you get two information-rich signals instead of one, and the redundancy in the random vector ε is exploited.

python
import numpy as np

def isotropic_es(evaluate, d, n_iter=100, N=50, sigma=0.05, alpha=0.01):
    """Isotropic ES with mirrored sampling and baseline subtraction."""
    mu = np.zeros(d)

    for t in range(n_iter):
        eps = np.random.randn(N, d)   # N perturbation directions

        # Mirrored: evaluate +eps and -eps pairs
        R_pos = np.array([evaluate(mu + sigma * e) for e in eps])
        R_neg = np.array([evaluate(mu - sigma * e) for e in eps])

        # Gradient estimate with baseline subtraction
        advantages = R_pos - R_neg  # baseline cancels exactly
        grad = (advantages[:, None] * eps).sum(axis=0) / (N * sigma)
        mu += alpha * grad

    return mu

Isotropic ES with mirrored sampling evaluates π_μ+σε and π_μ−σε. If both have the same return, what is their joint contribution to the gradient update?

Zero — the paired returns cancel, contributing nothing to the update Positive — any nonzero return contributes positively Negative — equal returns indicate a local minimum, so the update reverses

Chapter 8: Showcase — Policy Search Race

Three algorithms compete to optimize the same 2D policy landscape in real time. The landscape has a global optimum (gold star) and sometimes local traps. Watch how each method navigates.

What to try: Terrain 1 is a smooth bowl — local search wins. Terrain 2 adds local traps — watch local search get stuck while GA and CEM escape. Terrain 3 is rugged multi-modal. Increase Speed to see long-run behavior.

Policy Search Race: Local Search vs. Genetic vs. CEM

Terrain 1

Speed 3

Algorithm	Terrain 1 (smooth)	Terrain 2 (traps)	Terrain 3 (rugged)
Local search	Excellent	Often trapped	Usually trapped
Genetic (pop)	Good but slow	Usually escapes	Often global
CEM	Excellent + fast	Usually escapes	Depends on init

Chapter 9: Connections & What's Next

Policy search occupies a special niche: the method of choice when you can evaluate but not differentiate the objective. The next chapters open the black box.

Method	Best for	Scale (dim θ)	Gradient?
Hooke-Jeeves / local search	Low-d, smooth, good init	d < 20	No
Nelder-Mead	Low-d, noisy, no gradient	d < 15	No
Genetic algorithm	Discrete or mixed, multi-modal	Any d	No
CEM	Continuous θ, Gaussian structure	d < 1000	No
Isotropic ES	Neural nets, parallelism	d up to 10⁶	Implicit
Policy gradient (Ch 11)	Differentiable class, on-policy	Any d	Estimated
PPO (Ch 12)	Smooth env, clipped update	Any d	Estimated

The gradient estimation connection: Isotropic ES is equivalent to finite-difference estimation of ∇_μ E[U(μ+σε)]. Policy gradient methods (Ch 11) estimate ∇_θ U(θ) via the log-derivative trick. Both are stochastic gradient methods; they differ in what they differentiate and how they reduce variance.

CEM in the wild: CEM was the planner in the original Dreamer model-based RL system (Ha & Schmidhuber 2018, Hafner et al. 2019): optimize action sequences by CEM within a learned world model. Imagined rollouts serve as the Monte Carlo evaluator. CEM + learned model achieves strong visual control without ever differentiating through the real environment.

Related chapters:

Ch 9: Online Planning — evaluating policies in online decision-making
Ch 11: Policy Gradient Estimation — gradient-based policy optimization
Ch 12: Policy Gradient Optimization — trust regions, PPO
Ch 16: Model-Based RL — CEM as a planner in learned world models

"Evolution is smarter than you are." — Leslie Orgel, biologist. The same humility applies to policy space: when gradients fail, let the landscape speak through samples.

CEM updates the mean toward elite samples and the covariance to match their spread. What does the covariance converge toward when fully converged?

A very small matrix — the distribution collapses to a point near optimal θ* The same covariance as the initial distribution A large covariance to maintain exploration