When you can't differentiate through the environment, search the policy space directly.
You're training a robot to walk. The robot's legs are controlled by a neural network with 10,000 parameters. To find the best parameters, you'd ideally compute ∂U/∂θ — the gradient of expected return with respect to the parameters — and follow it uphill.
But there's a problem: the environment isn't differentiable. The robot falls, bounces off the ground, has contact discontinuities. The map from parameters to expected return — call it U(θ) — exists, but its gradient might not. Or the gradient exists but is too noisy to follow. Or the policy class itself (a lookup table, a finite automaton, a decision tree) isn't differentiable by design.
The answer is policy search: treat U(θ) as a black-box function and search the parameter space Θ directly. No gradients required. You need only a way to evaluate policies — roll them out in the environment and measure the return.
Three families cover the space of approaches, each with a different philosophy for proposing new candidates:
| Family | Core idea | Best when |
|---|---|---|
| Local search | Perturb the current best θ and keep improvements | Smooth landscape, quick evaluation |
| Population-based | Maintain a set of candidates; evolve toward high-U regions | Multi-modal landscape, parallelism available |
| Distribution-based | Fit a distribution over θ; sample from it; update toward high-U samples | Gaussian-like structure in θ |
Before we can search, we need to score. Given a candidate πθ, we want a number: how good is it? The true expected return is
This expectation is over all rollout trajectories τ = (s0, a0, s1, a1, …) generated by running πθ in the environment. We can't compute it exactly, but we can Monte Carlo estimate it by running m actual rollouts.
Why m rollouts and not just 1? Because the environment is stochastic. A single rollout can be an outlier. Averaging over m rollouts reduces the estimation variance. The standard error of your estimate is σ/√m, where σ is the standard deviation of a single rollout's return.
How many rollouts do you need? If returns vary wildly (chaotic environment, long horizon), you need more. Double your precision, quadruple m. This is the fundamental evaluation budget problem in policy search.
Observe how estimate variance decreases as rollout count m increases. The true value is 1.0. Watch the confidence interval (purple bar) shrink with more rollouts.
The simplest strategy: stand at your current best θ, look around in a small neighborhood, and move to wherever U is higher. This is local search — also called hill climbing in the policy optimization context.
Concretely: start with an initial θ0. At each iteration, generate candidate perturbations by adding small random noise. Evaluate each candidate with Monte Carlo rollouts. If any candidate beats the current best, move there. Repeat.
The step size σ controls the neighborhood radius. Large σ explores widely but may jump over fine features of U. Small σ is precise but slow. A common trick: adaptive step size — shrink σ when progress stalls, expand when making consistent progress.
Local search has one Achilles heel: local optima. If U(θ) has many hills, local search climbs to the nearest peak and stops, unaware that a distant mountain is much taller. For policy spaces with many local optima (common in robotics), population-based methods handle this much better.
Imagine placing d+1 explorers on the policy landscape (for d = dim(θ)). These explorers form a simplex: in 2D a triangle, in 3D a tetrahedron. Their positions are θ(1), …, θ(d+1). The game: iteratively deform the simplex so it crawls toward the optimum.
The Nelder-Mead method (1965) uses four geometric operations on the simplex at each step. It doesn't use derivatives. It uses geometry.
| Operation | What it does | When used |
|---|---|---|
| Reflection | Reflect the worst vertex θw through the centroid of the rest | Default first try |
| Expansion | Extend further in the reflection direction if it's the new best | When reflection succeeded big |
| Contraction | Move θw closer to the centroid | When reflection didn't improve |
| Shrink | Move all vertices toward the current best θb | When contraction also failed |
The reflection step in detail: the centroid of all vertices except the worst is θc = (1/d) ∑i≠w θ(i). The reflected vertex is θr = θc + (θc − θw). Evaluate U(θr) and compare to the current best and second-worst. The four cases (best, good, bad, terrible) determine which operation to apply.
The triangle is the simplex. The star is the optimum. Press Step to advance one iteration and watch the triangle deform toward the goal.
Nature solved optimization over billions of years: keep good solutions, randomly mutate them, and combine successful parents. Genetic algorithms (GAs) copy this strategy — maintain a population of candidate policies, evaluate each, and breed the next generation from the best survivors.
The key insight: by keeping a population rather than a single candidate, GAs naturally explore multiple basins of attraction simultaneously. This makes them far less prone to local optima than local search.
For continuous parameter vectors θ ∈ Rd, the most common operations:
Uniform crossover: Given parents θA and θB, create offspring θC by taking each component from A with probability 0.5, or from B otherwise: θCj ~ Bernoulli(½) ? θAj : θBj. Works well when parameters are independent.
Mutation: Add Gaussian noise: θC ← θC + ε where ε ~ 𝒩(0, σ2I). σ is the mutation rate. Large σ = wild exploration; small σ = fine-tuning.
Each dot is a candidate policy in 2D parameter space. Brightness = fitness. The star is the global optimum. Watch the population converge.
Genetic algorithms maintain a set of point candidates. The cross-entropy method (CEM) is more principled: maintain a distribution over candidates. Concretely, CEM keeps a Gaussian p(θ) = 𝒩(μ, Σ) and iteratively updates (μ, Σ) so the distribution concentrates around high-return regions.
The name comes from information theory: updating the parameters to maximize the likelihood of the elite samples is equivalent to minimizing the KL-divergence between the current distribution and an idealized distribution concentrated on the optimal θ*.
The update equations are sample statistics of the elite set Θelite:
These are maximum likelihood estimates of a Gaussian fitted to the elites. No optimization required — just two matrix computations. The εI term (noise injection) prevents Σ from collapsing prematurely and stops exploration too early.
The ellipse shows the current distribution N(μ, Σ). Blue dots are samples, gold dots are elites. Watch the ellipse converge to the optimum (star).
python import numpy as np def cross_entropy_method(evaluate, d, n_iter=50, N=100, rho=0.2, noise=1e-3): """evaluate: theta -> scalar return. d: dimension of theta.""" mu = np.zeros(d) sigma = np.eye(d) # full covariance k = int(rho * N) # number of elites to keep for t in range(n_iter): # 1. Sample candidates from current distribution thetas = np.random.multivariate_normal(mu, sigma, N) # 2. Evaluate each candidate (embarrassingly parallel) returns = np.array([evaluate(th) for th in thetas]) # 3. Select top-k elites elite_idx = np.argsort(returns)[-k:] elites = thetas[elite_idx] # 4. Update distribution: MLE Gaussian on elites mu = elites.mean(axis=0) diff = elites - mu sigma = (diff.T @ diff) / k + noise * np.eye(d) return mu
CEM uses hard selection: keep the top ρ·N, discard the rest. Evolution strategies (ES) use soft selection: weight every sample by its return. The distribution update becomes a weighted average rather than a sample mean of elites.
ES keeps a Gaussian p(θ; μ, Σ) and performs the update:
where wi are weights derived from the returns. The most robust choice: rank-based weights. Rank the N candidates by return. Assign wi ∝ max(0, log(N/2 + 1) − log(ranki)), normalized to sum to 1. This gives large weight to the best, zero to the worst half, and smooth interpolation between.
python import numpy as np def evolution_strategy(evaluate, d, n_iter=50, N=50, sigma=0.1, alpha=0.01): """Diagonal-covariance ES with rank-based weights.""" mu = np.zeros(d) for t in range(n_iter): eps = np.random.randn(N // 2, d) # Mirrored sampling: evaluate +eps and -eps thetas = np.concatenate([mu + sigma * eps, mu - sigma * eps]) returns = np.array([evaluate(th) for th in thetas]) # Rank-based weights: top half gets positive weight, bottom zero ranks = np.argsort(np.argsort(returns)) # double-argsort = ranks w = np.maximum(0, np.log(N/2 + 1) - np.log(N - ranks)) w /= w.sum() # Weighted mean update mu = (w[:, None] * thetas).sum(axis=0) return mu
When the policy parameter θ has thousands or millions of dimensions (a neural network), maintaining a full d×d covariance matrix is intractable. Isotropic ES simplifies by fixing Σ = σ2I — the same noise in every direction. This reduces the search to tuning a single scalar σ.
This sounds like a drastic limitation. Surprisingly, it works even for deep networks with millions of parameters. The key insight: the update
is actually a gradient estimate of Eε~𝒩(0,I)[U(μ + σε)] with respect to μ. Isotropic ES is secretly a smoothed gradient descent on the Gaussian-blurred objective!
python import numpy as np def isotropic_es(evaluate, d, n_iter=100, N=50, sigma=0.05, alpha=0.01): """Isotropic ES with mirrored sampling and baseline subtraction.""" mu = np.zeros(d) for t in range(n_iter): eps = np.random.randn(N, d) # N perturbation directions # Mirrored: evaluate +eps and -eps pairs R_pos = np.array([evaluate(mu + sigma * e) for e in eps]) R_neg = np.array([evaluate(mu - sigma * e) for e in eps]) # Gradient estimate with baseline subtraction advantages = R_pos - R_neg # baseline cancels exactly grad = (advantages[:, None] * eps).sum(axis=0) / (N * sigma) mu += alpha * grad return mu
Three algorithms compete to optimize the same 2D policy landscape in real time. The landscape has a global optimum (gold star) and sometimes local traps. Watch how each method navigates.
| Algorithm | Terrain 1 (smooth) | Terrain 2 (traps) | Terrain 3 (rugged) |
|---|---|---|---|
| Local search | Excellent | Often trapped | Usually trapped |
| Genetic (pop) | Good but slow | Usually escapes | Often global |
| CEM | Excellent + fast | Usually escapes | Depends on init |
Policy search occupies a special niche: the method of choice when you can evaluate but not differentiate the objective. The next chapters open the black box.
| Method | Best for | Scale (dim θ) | Gradient? |
|---|---|---|---|
| Hooke-Jeeves / local search | Low-d, smooth, good init | d < 20 | No |
| Nelder-Mead | Low-d, noisy, no gradient | d < 15 | No |
| Genetic algorithm | Discrete or mixed, multi-modal | Any d | No |
| CEM | Continuous θ, Gaussian structure | d < 1000 | No |
| Isotropic ES | Neural nets, parallelism | d up to 106 | Implicit |
| Policy gradient (Ch 11) | Differentiable class, on-policy | Any d | Estimated |
| PPO (Ch 12) | Smooth env, clipped update | Any d | Estimated |
Related chapters:
"Evolution is smarter than you are." — Leslie Orgel, biologist. The same humility applies to policy space: when gradients fail, let the landscape speak through samples.