Ch 11: Policy Gradient Estimation — Kochenderfer et al.

Chapter 0: Why Gradient Estimation?

Imagine you are trying to tune a robot arm. The arm has a policy — a function that maps sensor readings to motor commands — controlled by a vector of parameters θ. You run the robot, score it, nudge θ a little, run it again. You're doing gradient ascent on the utility U(θ) = E[total reward].

There's one problem. You can't write down U(θ) in closed form. It's an expectation over all the random trajectories the robot might take — over random sensor noise, random actuator jitter, random environment dynamics. You never see U directly. You only see samples: one specific run that produced one specific reward.

The core tension: Gradient ascent needs ∇U(θ). U(θ) is an expectation over stochastic trajectories. We can't differentiate through an expectation we can only sample. We need a way to turn a bag of rollouts into a usable gradient signal.

To appreciate how nontrivial this is, compare with supervised learning. There, you minimize L(θ) = (1/N) ∑ ℓ(f_θ(x_i), y_i). The gradient ∇_θL is computed by backpropagation through a deterministic computational graph you wrote yourself. In RL, U(θ) = E_{τ∼p_θ}[R(τ)] involves a black-box environment. The computational graph of the world is unknown — you can only query it via rollouts. Finite difference, regression gradient, and likelihood ratio are all ways to estimate ∇U without the environment's source code. Each trades off sample cost, bias, and variance differently. Choosing among them is the first design decision when building a policy gradient algorithm.

Chapter 10 showed one response: ignore the gradient entirely. Cross-entropy, evolution strategies — treat U as a black box and search policy space heuristically. That works, but it's wasteful. If we can estimate ∇U, we can point our steps in the steepest uphill direction and converge far faster.

This chapter develops four approaches, in increasing sophistication. We'll start with the brute-force method (finite differences), then derive the elegant likelihood-ratio estimator, then show how to dramatically tighten it with variance reduction. The most polished version becomes REINFORCE — the algorithm that launched the modern policy gradient era.

The methods differ in what they assume about the environment. Finite differences and regression gradient need no assumptions at all — they treat the environment as a black box. The likelihood-ratio method assumes the policy is stochastic and differentiable w.r.t. θ (so we can compute ∇logπ), but makes no assumptions about the transition dynamics. This is why likelihood-ratio methods extend naturally to POMDPs (partial observability, Chapter 19-22): the agent doesn't need to model how its observations arise from hidden states. As long as the policy outputs action probabilities from observations, ∇logπ exists and the estimator applies unchanged.

One clarifying note on notation: throughout the book and this lesson, τ denotes a trajectory. Different papers use different symbols: some write h (history), some write ξ (Greek xi), some write s_1:Ta_1:T. The content is always the same: an ordered sequence of states and actions sampled by rolling out a policy in the environment. The book's notation p_θ(τ) for trajectory probability is consistent with most modern RL literature.

Here's a quick way to build intuition for the whole chapter. Imagine U(θ) as a hilly landscape, and you're blindfolded at some location (θ) on it. You don't know the map. All you can do is take steps and measure your altitude afterwards. Finite difference says: take a tiny step north, record your altitude; come back, take a tiny step east, record altitude; repeat for each direction. Expensive but reliable for few directions. Regression gradient says: take many random steps in all directions at once, fit a plane to all the altitude measurements — the normal of the plane points uphill. The likelihood ratio trick says: you don't need to take any off-center steps at all. Just stay where you are, make random decisions according to your current policy, and use the score function (how "surprising" each decision was) weighted by the reward you got. Good surprises point uphill. This is the key insight the chapter builds toward: you can estimate the hill's gradient without ever leaving your current position.

Policy Performance Landscape

A 2-parameter policy π_θ parameterized by (θ₁, θ₂). The color shows U(θ) (warm = high utility). We can't compute this surface analytically — each colored cell requires many rollouts. The challenge: navigate toward the peak without seeing the surface explicitly. Click to place your current θ and see what gradient estimation methods can do.

Click anywhere on the surface to set your current θ. The warm circle shows current location.

Black-box approach

Cross-entropy, CEM, NES — no gradient needed (Ch 10)

↓ this chapter

Finite Difference

Perturb each parameter, see how U changes — simple but expensive

↓

Likelihood Ratio / REINFORCE

Analytically differentiate through log π — elegant, scalable

↓

Variance Reduction

Reward-to-go + baselines — same gradient, far less noise

Why can't we just directly differentiate U(θ) = E[R(τ)] with respect to θ?

Because the reward function R is always discontinuous Because neural network policies have too many parameters Because U is an expectation over trajectories we can only sample — there's no closed-form expression to differentiate

Chapter 1: Finite Difference

The simplest idea: compute the gradient numerically by perturbing one parameter at a time and observing how U changes. For the i-th component of θ:

∂U / ∂θ_i ≈ ( U(θ + δ e_i) − U(θ) ) / δ

where e_i is the i-th standard basis vector (all zeros except a 1 in position i) and δ is a small step size. To estimate the full gradient for a policy with n parameters, you need n+1 evaluations of U: one at θ and one perturbed evaluation per parameter.

The trouble is that U(θ) itself is an expectation — you can't evaluate it exactly. You estimate it with m rollouts. So the total rollout budget is m(n+1) just to get one gradient estimate. For a neural network with 10,000 parameters that's 10,001 evaluation packs.

The sweet spot of δ: Too large, and the linear approximation (f + δf') breaks down because the function curves significantly. Too small, and numerical rounding noise swamps the signal — the difference U(θ + δe) − U(θ) is buried in estimation noise. The optimal δ balances these two sources of error. There is no formula for the optimum; it's problem-dependent.

The centered (two-sided) version is more accurate: perturb in both directions and take the symmetric difference ( U(θ + δe_i) − U(θ − δe_i) ) / 2δ. This cancels the leading-order error, so the bias drops from O(δ) to O(δ²). The cost doubles (2n evaluations instead of n+1), but the gain is worth it in smooth problems.

The Taylor expansion makes the bias explicit. Forward difference: U(θ + δe_i) = U(θ) + δ ∂U/∂θ_i + (δ²/2) ∂²U/∂θ_i² + O(δ³). Dividing by δ and rearranging: (U(θ + δe_i) − U(θ))/δ = ∂U/∂θ_i + (δ/2) ∂²U/∂θ_i² + O(δ²). The error is δ/2 times the second derivative — proportional to δ and to the curvature. The centered version: (U(θ+δe_i) − U(θ−δe_i)) / 2δ = ∂U/∂θ_i + O(δ²), because the second-derivative terms cancel. The curvature bias vanishes to leading order.

python
import numpy as np

def finite_diff_gradient(policy, theta, delta, n_rollouts):
    """Estimate gradient of U(theta) via finite differences."""
    n = len(theta)
    grad = np.zeros(n)
    U0 = estimate_utility(policy, theta, n_rollouts)  # baseline

    for i in range(n):
        theta_plus = theta.copy()
        theta_plus[i] += delta
        U_plus = estimate_utility(policy, theta_plus, n_rollouts)
        grad[i] = (U_plus - U0) / delta  # forward difference

    return grad  # cost: (n+1) * n_rollouts rollouts total

def estimate_utility(policy, theta, n_rollouts):
    """Average return over m rollouts."""
    returns = [run_episode(policy, theta) for _ in range(n_rollouts)]
    return np.mean(returns)

Step Size δ Trade-off

For f(x) = x² at x = 2, true derivative = 4. Drag the slider to see how δ affects the estimate. Too big: curve curvature creates bias. Too small: estimation noise (simulated here) dominates. There's a sweet spot.

log₁₀(δ) = 0.1000

Noise 0.20

For neural network policies (π_θ(a|s) implemented as a network with PyTorch/JAX), the score function ∇_θ log π(a|s) is computed by backpropagating through the network from the log-probability output. You don't compute it analytically — autograd does it. The workflow is: (1) run the forward pass to get action logits, (2) compute log π(a|s) for the taken action, (3) multiply by the rtg weight, (4) call .backward(). PyTorch accumulates gradients. After processing all steps in the batch, call optimizer.step(). One subtlety: you want gradient ascent, not descent. Either negate the loss before calling backward(), or use -logπ × rtg as the loss and standard gradient descent. Both are equivalent. Most RL libraries define the REINFORCE loss as -E[logπ × rtg] for this reason.

A useful sanity check when implementing any gradient estimator: run your code on a simple convex function where you know the exact gradient (e.g., U(θ) = −θ², gradient = −2θ). Compare your estimator's output to the exact gradient. If they agree within statistical noise, your implementation is probably correct. This gradient checking pattern is standard practice in RL codebases before deploying on real problems.

Kochenderfer also presents a smarter variant: regression gradient. Instead of probing one parameter at a time, draw m random perturbations Δθ⁽¹⁾, ..., Δθ^(m) (each perturbing all parameters simultaneously), collect the utility change ΔU⁽ⁱ⁾ for each, and fit a linear model:

∇U(θ) ≈ (ΔΘ⁺) ΔU

where ΔΘ is the m×n matrix of perturbations and ΔΘ⁺ is its Moore-Penrose pseudoinverse. Each random perturbation probes all parameter directions at once, so the gradient estimate becomes efficient: roughly 2n perturbations suffice for an n-parameter policy, compared to n+1 for finite differences. The recommended rule of thumb from the book: use 2n perturbations with a perturbation magnitude of δ ≈ 1% of the parameter norm.

python
import numpy as np

def regression_gradient(policy, theta, delta, m_perturb, n_rollouts):
    """
    Regression gradient: random perturbations + least-squares fit.
    theta:     current parameters (n,)
    delta:     perturbation magnitude (radius of hypersphere)
    m_perturb: number of random perturbations (use 2*n rule of thumb)
    """
    n = len(theta)
    U0 = estimate_utility(policy, theta, n_rollouts)

    # Sample m random perturbations from unit sphere, scale by delta
    Delta_Theta = []
    Delta_U = []
    for _ in range(m_perturb):
        direction = np.random.randn(n)
        direction /= np.linalg.norm(direction)   # unit sphere
        d_theta = delta * direction
        d_U = estimate_utility(policy, theta + d_theta, n_rollouts) - U0
        Delta_Theta.append(d_theta)
        Delta_U.append(d_U)

    Delta_Theta = np.array(Delta_Theta)          # shape: (m, n)
    Delta_U = np.array(Delta_U)                  # shape: (m,)

    # Least squares: find grad that best explains Delta_U = Delta_Theta @ grad
    grad, _, _, _ = np.linalg.lstsq(Delta_Theta, Delta_U, rcond=None)
    return grad   # cost: m * n_rollouts rollouts total (vs (n+1)*n_rollouts)

When to use which: Finite difference is simpler to implement and works well for very-low-dimensional problems (n ≤ 20) or as a gradient check. Regression gradient is more sample-efficient when n is moderate (20–500) and you can afford random direction probing. For n ≥ 1000 (neural networks), both are infeasible — use likelihood ratio (next chapter).

Numerical example. Consider a 3-parameter policy (θ = [0.5, −0.2, 1.1]) in a simple environment. We want ∂U/∂θ₂ using finite differences with δ = 0.01 and m = 50 rollouts each:

Step 1: Estimate U(θ) ≈ 4.23 (mean of 50 rollouts).

Step 2: Perturb θ₂: θ + 0.01 e₂ = [0.5, −0.19, 1.1]. Estimate U(perturbed) ≈ 4.41 (50 rollouts).

Step 3: ∂U/∂θ₂ ≈ (4.41 − 4.23) / 0.01 = 18.0.

But wait — with only 50 rollouts, each estimate has standard error σ/√50 ≈ 0.14 (assuming reward standard deviation ~1). The difference (4.41 − 4.23) = 0.18 might be 100% noise. The finite difference estimate's standard error is 0.14√2 / 0.01 ≈ 20 — larger than the signal. This is exactly the regime where likelihood-ratio methods shine: they compute ∇logπ analytically, avoiding this amplification of rollout noise.

For a policy with n = 100 parameters, how many rollout evaluations does forward finite differences need per gradient estimate?

100 (one per parameter) 101 (one baseline + one per parameter) 200 (two per parameter)

Chapter 2: The Log-Derivative Trick

Finite differences treats U as a black box. But U has structure we can exploit: it's an integral involving the policy π_θ, which we designed. We know its derivative analytically. The log-derivative trick is how we harvest that knowledge.

Start from the definition. A trajectory τ is a sequence of states and actions: (s₁, a₁, s₂, a₂, ..., s_d, a_d). The expected utility is:

U(θ) = ∫ p_θ(τ) · R(τ) dτ

where p_θ(τ) is the probability of trajectory τ under policy θ, and R(τ) is its total discounted reward. Moving the gradient inside the integral:

∇_θ U(θ) = ∫ ∇_θ p_θ(τ) · R(τ) dτ

We can't sample from ∇_θp because it's not a probability distribution (it can be negative!). Here's the trick: multiply and divide by p_θ(τ):

= ∫ p_θ(τ) · ∇_θ log p_θ(τ) · R(τ) dτ

The trick, explicitly: This works because ∇p / p = ∇ log p (chain rule: d/dx [log f(x)] = f'(x)/f(x), so f'(x) = f(x) d/dx [log f(x)]). Now the integral is exactly E_τ[∇ log p_θ(τ) · R(τ)] — an expectation we can estimate by sampling!

So the final gradient estimator is:

∇_θ U(θ) = E_τ[ R(τ) ċ ∇_θ log p_θ(τ) ]

Now what is ∇_θ log p_θ(τ)? Expand the trajectory probability:

p_θ(τ) = p(s₁) ∏_t=1^d T(s_t+1 | s_t, a_t) ċ π_θ(a_t | s_t)

Take the log: it turns the product into a sum. Then take the gradient w.r.t. θ. The initial state distribution p(s₁) doesn't depend on θ. The transitions T(s'|s,a) don't depend on θ (we're model-free). They vanish. Only the policy terms survive:

∇_θ log p_θ(τ) = ∑_t=1^d ∇_θ log π_θ(a_t | s_t)

The beautiful consequence: We never need to know the environment dynamics T(s'|s,a). The gradient depends only on ∇ log π_θ — the score function of our own policy. This is what makes policy gradient methods model-free: they work even when the environment is a complete black box.

There's a deeper reason the transition terms cancel that's worth dwelling on. The Markov property tells us that state s_t+1 is determined entirely by (s_t, a_t) and the environment's randomness — not by θ. So when we take ∇_θ, the transition T(s_t+1 | s_t, a_t) contributes zero. Only π_θ(a_t | s_t) carries θ dependence. If the environment were not Markov (say, a complicated simulator with internal state), the transitions might depend on θ indirectly (through action history affecting hidden state), and this clean cancellation would break down. The Markov assumption is load-bearing for the likelihood ratio trick.

Formally: log p_θ(τ) = log p(s₁) + ∑_t [log T(s_t+1|s_t,a_t) + log π_θ(a_t|s_t)]. Taking ∇_θ: the log p(s₁) term vanishes (no θ). Each log T term vanishes (no θ). What remains is exactly ∑_t ∇_θ log π_θ(a_t|s_t). The derivation is three lines; the implication is enormous — the entire field of model-free RL depends on this cancellation holding.

Each term ∇_θ log π_θ(a_t | s_t) is called the score function at time t. It points in the direction in parameter space that most increases the log-probability of taking action a_t in state s_t. When multiplied by R(τ), high-reward trajectories push θ to make those actions more likely, and low-reward trajectories push in the opposite direction.

Concrete score functions for two common policy families:

Gaussian policy (continuous actions):

π_θ(a|s) = N(μ_θ(s), σ²)

log π = −(a−μ)² / (2σ²) − log σ + const

∇_μ log π(a|s) = (a − μ) / σ²

If θ parameterizes μ through a neural net, chain rule gives ∇_θ log π = (a − μ)/σ² · ∇_θ μ_θ(s). The score vector points from the mean toward the sampled action. If the action was good, μ gets pulled toward it.

Softmax policy (discrete actions):

π_θ(a|s) = e^θ_sa / ∑_a' e^θ_sa'

log π(a|s) = θ_sa − log ∑ e^θ_sa'

∇_{θ_sa'} log π(a|s) = 1(a'=a) − π(a'|s)

The gradient w.r.t. the taken action's parameter is 1 − π (positive: increase it). For all other actions: −π (negative: decrease them). The result: taken action becomes more likely, others less likely, by exactly the probability mass that normalization requires.

Both score functions have a deep property: E_a∼π[∇ log π(a|s)] = 0. The expected score under the policy itself is always zero. This is why baselines (Chapter 5) work: you can subtract any state-dependent quantity from R and the expected gradient change is zero.

What happens at convergence? When the policy has converged to the optimal π*, the gradient should be zero (we're at a maximum). For the softmax: at the optimum π(best_action|s) → 1. The score for the best action is 1 − 1 = 0. All other actions have probability 0, so their contribution is also 0 × (−1) = 0. The entire gradient is zero. Gradient ascent correctly stops improving. For Gaussian policies, convergence means the mean μ has moved to the optimal action and the score (a − μ)/σ² is centered around zero, so its expectation is also zero. The gradient signal dies out as the policy peaks.

This is not a bug — it's the correct behavior. A peaked policy that's fully converged needs no further updates. The challenge is premature convergence: the policy might peak on a locally optimal but globally suboptimal action. Entropy regularization (adding H(π) to the objective) counteracts this by penalizing overconfident policies, keeping σ from shrinking to zero or the softmax from becoming too sharp too soon.

Score Function Geometry

Gaussian policy π(a|s) = N(μ, σ²). The score ∇_μ log π = (a − μ)/σ² points from the mean toward the sampled action (red dot). Drag μ and σ to see how the score arrow changes for the fixed sample a = 1.5.

μ = 0.0

σ = 1.0

Why do the transition probabilities T(s'|s,a) vanish from ∇_θ log p_θ(τ)?

Because T is always 0 Because T is approximated by a neural network Because T doesn't depend on the policy parameters θ, so ∇_θ log T = 0

Chapter 3: REINFORCE — Live Demo

Williams (1992) combined the likelihood ratio gradient into a clean algorithm: REINFORCE. Collect trajectories from the current policy, estimate the gradient from each, take a gradient ascent step, repeat. The full gradient estimator is:

∇_θ U(θ) ≈ (1/m) ∑_i=1^m R(τ⁽ⁱ⁾) · ∑_t=1^d ∇_θ log π_θ(a_t⁽ⁱ⁾ | s_t⁽ⁱ⁾)

where m trajectories τ⁽¹⁾, ..., τ^(m) are sampled from the current policy, R(τ⁽ⁱ⁾) is the total return of trajectory i, and the inner sum is the summed score over all timesteps. After computing the gradient estimate, we take a gradient ascent step: θ ← θ + α ∇_θ U(θ).

The REINFORCE intuition: Good trajectories (high R) push θ to make their actions more likely. Bad trajectories push in the opposite direction. The magnitude of the push is proportional to how good (or bad) the trajectory was. Over many updates, the policy gravitates toward the behaviors that yield high reward.

Worked example from scratch. Suppose you have a 1-step MDP: one state, two actions (Left = reward −1, Right = reward +5). Your policy is a softmax parameterized by a single scalar θ: π(Right) = e^θ/(e^θ + 1) = σ(θ). Currently θ = 0, so π(Right) = 0.5.

You run m = 3 episodes. Suppose the outcomes are: episode 1 goes Right (reward +5), episode 2 goes Left (reward −1), episode 3 goes Right (reward +5). First, compute each score:

∇_θ log π(Right | θ=0) = 1 − 0.5 = +0.5

∇_θ log π(Left | θ=0) = 0 − 0.5 = −0.5

Now sum R × score over the three episodes:

∇U ≈ (1/3) [ (+5)(+0.5) + (−1)(−0.5) + (+5)(+0.5) ] = (1/3)[2.5 + 0.5 + 2.5] = (1/3)(5.5) ≈ 1.83

With learning rate α = 0.1: θ ← 0 + 0.1 × 1.83 = 0.183. Now π(Right) = σ(0.183) ≈ 0.546. The policy has shifted slightly toward Right. After many such updates, θ grows large and π(Right) → 1.0.

Below is a live simulation of exactly this MDP. The agent starts at θ = 0 and you can watch REINFORCE drive it toward the optimal policy.

The policy is a softmax over a single parameter: π(Right | S) = σ(θ) = e^θ / (1 + e^θ). The gradient is:

∇_θ log π(Right | S) = 1 − π(Right | S) ∇_θ log π(Left | S) = −π(Right | S)

Run episodes, watch the gradient accumulate, and see θ climb toward the optimal policy. Click "Auto-run" to let REINFORCE converge on its own.

REINFORCE Live

Two-state MDP. Right = reward +5, Left = reward −1. Policy: π(Right) = σ(θ). Watch θ evolve as REINFORCE collects rollouts and updates.

Learning rate α 0.10

Episodes / step 5

Ready. θ = 0.00, π(Right) = 50.0%

An aside on deterministic policies. The likelihood ratio trick requires ∇_θ log π_θ(a|s) to exist. For a deterministic policy, π(a|s) is either 1 (for the greedy action) or 0 (for others). The log of 0 is undefined, and the gradient doesn't exist in the usual sense. This is why stochastic policies are essential for REINFORCE — without them, the log-derivative trick breaks. Deterministic policy gradient (DPG) methods (Ch 17) handle this via a different trick: they differentiate the deterministic action output directly through the Q-function, bypassing the log-probability entirely. But that requires a differentiable environment model or a differentiable Q approximation.

The practical takeaway: always use a stochastic policy for REINFORCE. For discrete action spaces, a softmax with a temperature parameter τ controls how peaked the distribution is: π(a|s) ∝ exp(θ_sa/τ). High temperature (large τ) = uniform/exploratory; low temperature = nearly deterministic. You can anneal τ from high to low over training to transition from exploration to exploitation. For continuous action spaces, a Gaussian policy π(a|s) = N(μ_θ(s), σ_θ²(s)) maintains stochasticity throughout training, with the learned σ naturally decreasing as the policy converges — though you may need a minimum σ floor to prevent premature convergence to a degenerate distribution.

Notice that at θ = 0, the policy is 50/50. As REINFORCE accumulates gradient signal from high-reward Right trajectories, θ grows. The policy becomes increasingly biased toward Right. Eventually π(Right) → ~1.0 and the updates plateau — the policy has converged.

Note that the gradient estimate from REINFORCE, even when computed correctly, is extremely high-variance in the beginning of training when the policy is nearly uniform. Every episode has nearly equal probability, so the gradient weights are nearly equal for all trajectories, making the net gradient close to zero. As the policy concentrates (high θ magnitude), the score function becomes larger for the preferred actions and the gradient signal clarifies. This is why REINFORCE often shows very slow initial progress followed by a "breakthrough" when the policy suddenly tips past 50% toward the optimal action — a phase transition in the gradient landscape.

The convergence rate depends on learning rate and batch size. A large learning rate (α = 0.5) converges faster but can overshoot — the policy oscillates between "mostly Right" and "all Right" before stabilizing. A small learning rate (α = 0.01) is stable but slow. Batch size matters too: with 1 episode per update, the gradient estimate is noisy and training is erratic; with 20 episodes, the estimate is smoother and the θ trajectory is much more monotone. This is the same bias-variance trade-off that appears everywhere in statistics, now applied to the gradient estimator.

One subtlety: in this 1-step MDP, the Left action always gives −1. If early rollouts happen to produce many Right episodes by chance (giving θ a head start), convergence is fast. If they produce many Left episodes (unlucky start), θ dips negative before climbing. The initial policy's randomness is the primary driver of early training variance — which is exactly why the baseline (subtracting the mean return ≈ 2.0 from the batch) stabilizes the gradient signal significantly.

In REINFORCE, what happens to the probability of an action that led to a negative total return?

It stays the same (REINFORCE only reinforces positive rewards) It decreases: the gradient step subtracts ∇logπ · |R|, pushing θ to make that action less likely It increases (negative reward means the action should be taken more)

Chapter 4: Reward-to-Go

REINFORCE as stated multiplies every step's score by the entire trajectory return R(τ). But can action a₂ (taken at time 2) affect reward r₁ (collected at time 1)? No. Causality forbids it. Past rewards are already fixed before action a₂ is chosen.

This means we're polluting the gradient signal for each step t with irrelevant noise from rewards before t. The reward-to-go (also called the "causality trick") substitutes the total return with only the future rewards, from step t onward:

R_t^to-go = ∑_t'=t^d γ^t'−t r_t'

The full policy gradient becomes:

∇_θ U(θ) = E_τ &Bigg[∑_t=1^d ∇_θ log π_θ(a_t | s_t) ċ γ^t−1 R_t^to-go&Bigg]

Why this is still unbiased: We're dropping the term ∑_t'<t γ^t'−1 r_t' from step t's weight. But E[∇_θ log π(a_t|s_t) ċ r_t'] = 0 for all t' < t, because past rewards are independent of the current action given the current state. We're removing a term whose expectation is zero — that's zero bias. But its variance was non-zero, so we've reduced variance for free.

The benefit compounds across long trajectories. For a trajectory of length d, step t uses d−t+1 future reward terms instead of d total terms. Early steps see the largest reduction. Step d uses only r_d; step 1 uses all d rewards. The average number of terms per step drops from d to (d+1)/2.

In practice, for long-horizon problems like Atari games (episode length ~1000 steps) or robotic locomotion (500+ steps), reward-to-go can reduce variance by a factor of 5-10x compared to total return. This translates directly to needing 5-10x fewer samples to achieve the same gradient signal quality — a massive practical advantage that costs nothing in terms of bias or additional computation. Every policy gradient implementation that doesn't use reward-to-go is leaving free variance reduction on the table.

Another subtlety: the discount γ in the reward-to-go has a dual role. Mathematically, it down-weights future rewards to prioritize immediate returns, reflecting time preference. Practically, it also prevents the reward-to-go from blowing up in continuing (non-episodic) tasks where the sum could otherwise be infinite. For γ < 1, the geometric series ∑_t'=t^∞ γ^t'−t r_t' converges as long as rewards are bounded. The choice γ = 0.99 (vs 0.95 or 1.0) can significantly affect learned behavior: a lower γ makes the agent myopic and greedy; γ = 1.0 requires finite-horizon tasks to guarantee convergence of the sum.

One implementation subtlety: when computing reward-to-go across a batch of trajectories with different lengths, be careful to truncate correctly. Each trajectory has its own rtg array; they should not be mixed or padded incorrectly. A common bug is computing the running sum across trajectory boundaries, which mixes future rewards from one trajectory into past steps of another. Always maintain a separate rtg list per trajectory and concatenate after.

With a discount factor γ < 1, the reward-to-go also has a natural "effective horizon" of 1/(1−γ) steps — the geometric series converges to a finite value. For γ = 0.99, the effective horizon is ~100 steps: rewards more than 100 steps in the future are down-weighted by more than e⁻¹. This means in very long episodes (1000+ steps), the policy is effectively optimized for the near-term horizon, not the true long-run return. Choosing γ is thus a design decision about how far ahead the agent should plan, not just a mathematical convenience.

Total return (noisy):

Step 3 gets multiplied by:
r₁ + r₂ + r₃ + r₄ + r₅
↑ r₁, r₂ are noise — action at step 3 can't change them

Reward-to-go (cleaner):

Step 3 gets multiplied by:
r₃ + γr₄ + γ²r₅
↑ Only rewards the action can actually influence

python
def compute_rewards_to_go(rewards, gamma=0.99):
    """
    Input:  rewards = [r1, r2, ..., rd] from a single trajectory
    Output: rtg[t] = gamma^0 * r_t + gamma^1 * r_{t+1} + ... + gamma^{d-t} * r_d
    """
    d = len(rewards)
    rtg = [0.0] * d
    running = 0.0
    for t in range(d - 1, -1, -1):  # iterate backward
        running = rewards[t] + gamma * running
        rtg[t] = running
    return rtg

# REINFORCE with reward-to-go:
for trajectory in batch:
    rtg = compute_rewards_to_go(trajectory.rewards)
    for t, (s, a, r) in enumerate(trajectory.steps):
        score = grad_log_pi(theta, a, s)  # nabla_theta log pi
        grad += score * rtg[t]             # weight by future rewards only

theta += alpha * grad / len(batch)

Worked example: reward-to-go vs. total return. Suppose a trajectory has 3 steps with rewards [1, 3, 2] and γ = 1.0 (no discounting for simplicity). Total return: R(τ) = 6.

With total return, every step gets multiplied by 6. Step 1's score gets multiplied by 6 — even though 5 of those 6 reward units came from steps 2 and 3, which had nothing to do with action a₁.

With reward-to-go: step 1 gets 1 + 3 + 2 = 6, step 2 gets 3 + 2 = 5, step 3 gets only 2. Now the weight for step 3 is tight (only the immediate reward), and the weight for step 1 is the most informative possible. Steps at the end of the trajectory contribute more signal per rollout, because their weights are smaller and thus less contaminated by the environment's randomness across future steps.

Formal proof sketch: The full REINFORCE gradient is E[R(τ) ∑_t ∇logπ(a_t|s_t)]. Expand R(τ) = ∑_t'=1^d r_t'. The term ∇logπ(a_t|s_t) ċ r_t' for t' < t: since a_t is chosen after r_t' is already determined, E[∇logπ(a_t|s_t) ċ r_t'] = E_{s_t}[E_{a_t}[∇logπ(a_t|s_t)] · r_t'] = E_{s_t}[0 · r_t'] = 0. Each past reward is a zero-mean-contribution noise term. Removing them is free (unbiased) variance reduction.

Check: Reward-to-go reduces variance without introducing bias. Why is it still unbiased?

Because E[∇logπ(a_t|s_t) · r_t'] = 0 for all past rewards t' < t (past rewards are independent of the current action) Because it uses the total return in a cleverer way Because it adds a baseline to compensate

Chapter 5: Baseline Subtraction

Even with reward-to-go, the gradient can still be noisy. Here's a concrete failure mode: suppose all trajectories have positive return — the environment is generous. Then every action gets pushed to be more likely, just by different amounts. The gradient signal is drowned by the absolute scale of returns, not the relative quality of actions.

The fix: subtract a baseline b(s_t) from the reward-to-go at each step. The gradient becomes:

∇_θ U(θ) = E_τ&Bigg[∑_t ∇_θ log π_θ(a_t | s_t) ċ γ^t−1 (R_t^to-go − b(s_t))

The baseline can be any function of state — it must not depend on the action. This is critical: if b were action-dependent, subtracting it would bias the gradient.

Why subtracting b doesn't introduce bias: We need to show that E[b(s_t) ∇_θ log π(a_t|s_t)] = 0. Since b doesn't depend on a_t, we can condition on s_t and move b out: b(s_t) · E_{a_t∼π}[∇_θ log π(a_t|s_t)]. But E_a∼π[∇ log π(a|s)] = ∑_a π(a|s) ∇logπ(a|s) = ∑_a ∇π(a|s) = ∇_θ ∑_a π(a|s) = ∇_θ 1 = 0. The baseline contributes zero in expectation — it only affects variance.

What makes a good baseline? The Kochenderfer text derives the optimal baseline analytically. For a scalar policy gradient component i, the variance-minimizing baseline is:

b*(s) = E[( ∇_θ log π(a|s) )² R_to-go] / E[( ∇_θ log π(a|s) )²]

This is a weighted average of returns, where the weights are the squared score magnitudes. In practice, this is expensive to compute per-parameter. The most common heuristic is the state value function: b(s) = V^π(s) — the expected return from state s under the current policy. It's a good proxy because it removes the "background" expected return, leaving only the advantage of the specific action taken.

The three common choices of baseline, ordered by effectiveness:

Baseline b(s)	How to compute	Effectiveness
Zero (no baseline)	b = 0	Worst: full return scale in gradient
Mean return	b = mean of R^to-go across batch	Good: simple, removes absolute scale
State value V^π(s)	Learned neural network: V_w(s) trained by regression on MC returns	Best: removes per-state background return
GAE-estimated advantage	δ = r + γV(s') − V(s), then exponential blend	Excellent: trades off bias vs. variance with λ

In code, implementing the mean-return baseline adds just 2 lines. Implementing V(s) as a learned network adds ~50 lines (a second network + regression training loop). The gain from V(s) over the mean baseline is typically 2-5x variance reduction on continuous control tasks, often worth the complexity.

Worked example: why baseline helps when all rewards are positive. Suppose from state s, actions A, B, C give returns 8, 10, 12. Mean = 10. Without a baseline, the gradient term for action B is score(B) × 10. With baseline = 10: score(B) × (10 − 10) = 0. Action B is exactly average — it gets zero gradient, neither promoted nor demoted. Action A gets score(A) × (8 − 10) = score(A) × (−2): slightly demoted. Action C gets score(C) × (12 − 10) = score(C) × 2: promoted. The gradient now correctly pushes the policy toward C and away from A, ignoring the absolute scale of rewards entirely.

Without the baseline, all three actions got positive gradient signal — the gradient was trying to make all of them more likely simultaneously. The magnitude differences between 8, 10, and 12 created the correct direction, but drowned in much more noise than necessary. The baseline precisely centers the signal.

Baseline Effect on Gradient Variance

Both methods estimate the same gradient (unbiased), but the baseline version clusters tightly around the truth. Click "Sample" to add batches of gradient estimates. Red = no baseline, Green = with mean-return baseline.

No samples yet

A baseline b(s) is subtracted from returns. Why must it NOT depend on the action a?

Because actions are discrete Because we don't have access to actions when computing the baseline Because if b depends on a, then E[b(s,a) ∇logπ(a|s)] ≠ 0 and the estimator becomes biased

Chapter 6: Variance Reduction Summary

We've now built up three layers of variance reduction. Let's stack them together and see exactly how the gradient changes — and what each layer buys you.

Layer 0: Raw REINFORCE

E[R(τ) · ∇log p_θ(τ)] — unbiased but extremely noisy. Every step weighted by full trajectory return including irrelevant past.

↓ causality

Layer 1: + Reward-to-Go

E[R_t^to-go · ∇log π(a_t|s_t)] — same expectation, less variance. Drops past rewards which contribute zero in expectation.

↓ centering

Layer 2: + Baseline

E[(R_t^to-go − b(s_t)) · ∇log π] — still unbiased. Centers signal around "better or worse than average from this state."

↓ critic

Layer 3: + Advantage Function

E[A(s_t,a_t) · ∇log π] where A = Q − V — if A is known exactly, this is the minimum-variance unbiased estimator.

The advantage function A(s,a) = Q(s,a) − V(s) is the ideal weight for each gradient term. It answers: "How much better is action a than the average action from state s?" Actions that are exactly average get zero gradient signal. Only relative quality drives learning. This is the most principled choice of baseline: b(s) = V(s).

One important clarification: even with the optimal baseline, the REINFORCE gradient estimator is still unbiased. "Minimum variance" doesn't mean zero variance — it means no other unbiased estimator in this family can have smaller variance. The gradient still wiggles from step to step; it's just as tightly clustered around the true gradient as possible given the trajectory samples. To actually eliminate variance, you'd need to compute E_τ[...] exactly — which requires knowing the full environment model, which is what we're trying to avoid. Variance reduction is about making the best use of finite samples; it's not magic.

Estimator	Bias	Variance	Requires
Total return R(τ)	Zero	Highest	Rollouts only
Reward-to-go R_t^to-go	Zero	Lower	Rollouts only
Reward-to-go − constant b	Zero	Lower still	Rollouts + b
Reward-to-go − V(s_t)	Zero	Much lower	Rollouts + V estimator
True advantage A(s,a)	Zero	Lowest	Full Q and V (oracle)
TD residual r + γV(s') − V(s)	Some (from V approx)	Very low	Rollouts + learned V

The bias-variance trade-off in gradient estimation: The TD residual (r + γV(s') − V(s)) introduces bias because V is approximated, not exact. But its variance is tiny — it only looks one step ahead. The reward-to-go is unbiased but looks d steps ahead (accumulating noise). Most practical algorithms (Actor-Critic, PPO, A3C) live somewhere in between, using multi-step TD estimates that balance the two.

Variance Reduction in Practice

Click "Run" to simulate 50 batches of 20 rollouts on a bandit problem (two actions, rewards N(3,4) and N(5,4)). Each bar shows the empirical variance of the gradient estimate across batches.

The GAE bridge. In practice, even the TD residual (1-step advantage r + γV(s') − V(s)) and the full reward-to-go (n-step advantage) are two extremes. Generalized Advantage Estimation (GAE, Schulman et al. 2016) interpolates them with a scalar λ ∈ [0,1]:

A^{GAE(γ, λ)}_t = ∑_l=0^∞ (γλ)^l δ_t+l

where δ_t = r_t + γV(s_t+1) − V(s_t) is the TD residual at step t. At λ = 0: GAE = δ_t (1-step TD, low variance but biased if V is wrong). At λ = 1: GAE = reward-to-go minus baseline (unbiased but high variance). λ ≈ 0.95 is the standard choice in PPO, A3C, and most modern implementations — it gives most of the variance reduction with little bias.

Why GAE works: Each additional step you look ahead reduces bias (because you rely less on the V approximation) but increases variance (because you accumulate more random reward noise). GAE trades off these forces geometrically: the (γλ)^l weight decays exponentially with look-ahead depth, down-weighting noisy far-future TD errors relative to the close, low-noise 1-step TD error.

To see why the TD residual introduces bias when V is approximate: suppose V(s) overestimates the true value. Then δ_t = r_t + γV(s_t+1) − V(s_t) is biased by the error in V. In contrast, the full reward-to-go (Monte Carlo return) uses actual observed rewards, not V estimates, so it's unbiased regardless of how bad V is. The catch: it requires waiting until the episode ends to compute, and accumulates d steps of random reward noise. GAE at λ = 0.95 with a reasonably good V (typically learned jointly) provides most of the variance reduction of the full TD approach with most of the unbiasedness of Monte Carlo — this is the Goldilocks sweet spot that makes modern deep RL practical.

The PPO paper (Schulman et al., 2017) uses γ = 0.99, λ = 0.95 as defaults and reports that GAE is responsible for a large fraction of PPO's sample efficiency improvement over vanilla REINFORCE. The advantage is that GAE doesn't require a separate hyperparameter search for how many steps to look ahead — λ smoothly interpolates between the two extremes and the optimal value is relatively stable across domains. Intuitively, λ = 0.95 means that TD errors more than log(0.05)/log(0.95) ≈ 58 steps in the future contribute less than 5% to the advantage estimate — so the effective look-ahead is roughly 58 steps under default settings. For tasks with rewards on longer timescales, increasing λ toward 1 helps, at the cost of slightly higher variance in the advantage estimate.

Which of the following is the minimum-variance unbiased gradient estimator (assuming the advantage is known exactly)?

Total return R(τ) Reward-to-go Advantage function A(s,a) = Q(s,a) − V(s)

Chapter 7: Implementation

Here is a complete, runnable implementation of REINFORCE with reward-to-go and a baseline. This is not pseudocode — every line does real work, and every decision is explained.

python
import numpy as np

####  Policy  ####################################################
# Softmax policy over K discrete actions. Parameters: theta[s, a].
# pi(a|s) = exp(theta[s,a]) / sum_a exp(theta[s,a])
# Input:  state s (int), theta (S x A array)
# Output: action probabilities (A-vector, sums to 1)

def softmax_policy(theta, s):
    logits = theta[s]                        # shape: (A,)
    logits -= logits.max()                   # stability: subtract max
    exp_logits = np.exp(logits)
    return exp_logits / exp_logits.sum()    # normalize

def grad_log_pi(theta, s, a):
    """
    Gradient of log pi(a|s) w.r.t. theta[s, :].
    For softmax: grad[s, a'] = 1(a'=a) - pi(a'|s).
    All other rows of theta are zero gradient.
    """
    pi = softmax_policy(theta, s)
    g = np.zeros_like(theta)
    g[s] = -pi                               # -pi(a'|s) for all a'
    g[s, a] += 1.0                          # +1 for the taken action
    return g

####  Rollout  ###################################################
# Run one episode; return list of (s, a, r) tuples.

def rollout(env, theta, max_steps=200):
    trajectory = []
    s = env.reset()
    for _ in range(max_steps):
        pi = softmax_policy(theta, s)
        a = np.random.choice(len(pi), p=pi)
        s_next, r, done = env.step(a)
        trajectory.append((s, a, r))
        if done: break
        s = s_next
    return trajectory

####  Rewards-to-go  #############################################

def rewards_to_go(rewards, gamma=0.99):
    rtg, running = [], 0.0
    for r in reversed(rewards):
        running = r + gamma * running
        rtg.insert(0, running)
    return rtg

####  REINFORCE update  ##########################################
# Batch of m trajectories. Baseline = mean return across the batch.

def reinforce_update(env, theta, m=20, alpha=0.01, gamma=0.99):
    trajectories = [rollout(env, theta) for _ in range(m)]

    # 1. Compute reward-to-go for every step of every trajectory
    all_rtg = []
    for traj in trajectories:
        rewards = [r for _, _, r in traj]
        all_rtg.append(rewards_to_go(rewards, gamma))

    # 2. Baseline = mean rtg across ALL steps in batch
    all_vals = [v for rtg in all_rtg for v in rtg]
    baseline = np.mean(all_vals)            # constant baseline

    # 3. Accumulate gradient
    grad = np.zeros_like(theta)
    total_steps = 0
    for traj, rtg in zip(trajectories, all_rtg):
        for t, (s, a, _) in enumerate(traj):
            weight = (rtg[t] - baseline) * (gamma ** t)
            grad += weight * grad_log_pi(theta, s, a)
            total_steps += 1

    grad /= total_steps                      # normalize by steps

    # 4. Gradient ascent step
    theta += alpha * grad
    mean_return = np.mean([sum(r for _,_,r in traj) for traj in trajectories])
    return theta, mean_return

Key decisions in this implementation, and why each was made:

Decision	Why
Subtract max before softmax	Prevents exp() overflow. Doesn't change probabilities: e^(x-c)/(sum e^(y-c)) = e^x / sum e^y
Normalize by total_steps, not m	Trajectories vary in length. Per-step normalization avoids short-trajectory bias.
Constant baseline (mean rtg)	Simple and effective. A learned V(s) baseline would be better but requires a second model.
gamma^t discount factor per step	The γ^t−1 in the gradient formula accounts for the geometric down-weighting of later steps.
grad_log_pi = 1(a'=a) - π(a'\|s)	Exact analytical gradient of log-softmax. This is why stochastic policies work: we know this formula.

Data flow check: theta has shape (S, A). grad_log_pi returns a (S, A) array with nonzeros only in row s. For a policy over 4 states and 3 actions, theta is 12 parameters. A single gradient step touches all 12 via the accumulated grad from all rollout steps. After n_updates gradient ascent steps, theta[s, best_action] grows large and theta[s, other_actions] shrinks — making the softmax sharply peaked at the optimal action in each state. Concretely: if state 2 has best action 1, after convergence theta[2,1] ≈ +5 and theta[2,0] = theta[2,2] ≈ 0 (or negative), giving pi(a=1|s=2) = exp(5)/(exp(5)+1+1) ≈ 0.985. The policy is 98.5% confident in the right action. The 1.5% uncertainty is the exploration residual — the cost of using a stochastic policy.

For continuous action spaces, use a Gaussian policy parameterized by a mean vector μ_θ(s) and a fixed or learned standard deviation σ. The key formulas:

python
import numpy as np

####  Gaussian Policy for Continuous Actions  ####################
# Policy: pi(a|s) = N(mu_theta(s), sigma^2)
# Parameters theta map states to means (via a linear or neural-net model).
# Here: simple linear mean: mu(s) = theta @ phi(s)
# Input:  s = feature vector (d,), theta = (d,) linear weights
# Output: sampled action a (scalar), its log-prob

def gaussian_policy(theta, s, sigma=1.0):
    mu = theta @ s                                # linear mean (scalar)
    a = np.random.normal(mu, sigma)              # sample action
    return a, mu

def grad_log_pi_gaussian(theta, s, a, sigma=1.0):
    """
    Gradient of log N(a; mu_theta(s), sigma^2) w.r.t. theta.
    log pi = -0.5 * ((a - mu)/sigma)^2 - log(sigma*sqrt(2*pi))
    d/dtheta log pi = (a - mu) / sigma^2 * d(mu)/d(theta)
                    = (a - mu) / sigma^2 * s       (for linear mu = theta @ s)
    """
    mu = theta @ s
    return (a - mu) / (sigma ** 2) * s           # shape: (d,)

# REINFORCE update for Gaussian policy:
# For each (s, a, rtg) triple in the batch:
for s, a, rtg in batch_steps:
    score = grad_log_pi_gaussian(theta, s, a)
    grad += score * rtg

theta += alpha * grad / len(batch_steps)

Notice how the score function (a − μ)/σ² × s encodes exactly the geometry from Chapter 2's widget: if the sampled action a is above the mean μ, the score is positive, and the gradient step increases μ (making the policy output higher actions). If a is below μ, the score is negative, decreasing μ. The reward weight (rtg) scales how strongly we push: good-return samples drive larger mean updates than bad-return samples.

Debugging your implementation: If REINFORCE is not learning, check these five things in order. First, verify that your log-probability gradients are correct by comparing ∇logπ to finite-difference estimates of logπ. Second, verify that your rewards-to-go are computed correctly — print the first 5 values and manually verify with the reward sequence. Third, check that your gradient has the right sign: positive reward should push θ in the direction that increases the probability of the taken action. Fourth, check your normalization: if rewards are in the thousands (e.g., Atari scores), normalize them to mean 0, std 1 before computing gradients. Fifth, check your learning rate: too large (oscillation), too small (no progress). The most common bug is getting the gradient sign wrong by computing logπ(a|s) as negative log-probability instead of log-probability — easy to do if you mistake a negative log-likelihood loss for a log-probability.

In the softmax policy, the gradient of log π(a|s) w.r.t. θ[s, a'] equals:

1(a'=a) − π(a'|s) — one minus the probability for the taken action, and minus the probability for others Always 1 for the taken action and 0 for others The reward received at that step

Chapter 8: Policy Gradient vs. Value-Based Methods

You now know two fundamentally different paradigms for solving MDPs: value-based methods (Q-learning, dynamic programming from Chapters 7-8) and policy gradient methods (this chapter). They solve the same problem differently. Understanding when each excels is essential.

Value-based methods learn Q(s,a) or V(s) first, then derive a policy implicitly as π(s) = argmax_a Q(s,a). The gradient (in the RL sense) flows through the Bellman operator — temporal consistency, not likelihood ratios.

Policy gradient methods directly optimize the policy parameters θ to maximize expected return. No value function is required (though it helps as a baseline). The gradient flows through the log-probability of actions.

Criterion	Value-Based (Q-Learning)	Policy Gradient (REINFORCE)
What is learned?	Q(s,a) table or network	π_θ directly
Action space	Discrete (argmax is tractable)	Discrete or continuous
Policy type	Deterministic (greedy)	Stochastic (by design)
Sample efficiency	High (off-policy, replay buffer)	Lower (on-policy, no replay)
Gradient quality	Biased (bootstrapping)	Unbiased (but high variance)
Convergence	Guaranteed for tabular Q-learning	Local optimum (gradient ascent)
Continuous actions	Hard (maximize over continuous a)	Natural (sample from π, differentiate log π)
Stochastic policies	Awkward (requires ε-greedy hack)	Native: π_θ is stochastic by design

When continuous actions matter: Consider a robotic arm with joint torques ∈ ℜ. Q-learning needs argmax over the continuous torque space — a hard inner optimization loop at every step. Policy gradient is immediate: the policy outputs a Gaussian over torques, you sample, you compute the log-probability score, you have a gradient. This is why policy gradients dominate robotics and continuous control.

The real power emerges in actor-critic methods (Chapter 13) that combine both: a policy (actor) trained by policy gradient, and a value estimator (critic) used to compute the advantage. You get the natural handling of continuous actions from policy gradient, plus the low-variance gradient signal from the critic's advantage estimate.

Exploration vs. exploitation in each paradigm: Q-learning must artificially inject exploration via ε-greedy or Boltzmann sampling — the learned policy is deterministic, so it would never explore on its own. Policy gradient policies are stochastic by design. High entropy π_θ means high exploration; as training converges, the policy concentrates probability on good actions. Exploration is automatic, though sometimes supplemented by entropy regularization: max E[R] + λ H(π).

Stochastic vs. deterministic policies: a deeper look. Value-based methods like Q-learning produce deterministic greedy policies: given state s, always take argmax_a Q(s,a). This determinism means Q-learning policies can't represent mixed strategies — behaviors where randomness is genuinely optimal (like bluffing in poker, or breaking symmetry in multi-agent games). Policy gradient methods naturally represent such strategies: a softmax over action logits can assign probability 0.5/0.3/0.2 to three actions, and gradient ascent can maintain this mixed strategy indefinitely. In multi-agent settings, many Nash equilibria are mixed strategies; policy gradient can find them while greedy Q-learning cannot.

The on-policy constraint. REINFORCE is on-policy: every gradient estimate uses trajectories sampled from the current policy. Once we update θ, the old trajectories are stale and must be discarded. This is expensive: you need fresh rollouts for every gradient step. Q-learning, by contrast, is off-policy: experiences from any behavior policy can be stored in a replay buffer and reused many times. This is a massive efficiency advantage. The off-policy property comes from Q-learning's use of the Bellman equation, which is policy-independent. REINFORCE's estimator explicitly samples from π_θ and has no such property.

The off-policy vs. on-policy gap is one reason deep Q-networks (DQN) were historically more sample-efficient than REINFORCE. Modern algorithms like PPO partially bridge the gap using importance sampling to reuse old data a few times before discarding, though the gap never fully closes for pure policy gradient methods.

Local optima and the multimodal landscape. One important limitation of policy gradient methods is that gradient ascent can converge to local optima. The policy performance surface U(θ) is generally non-convex for expressive policy classes. Imagine two distinct strategies for solving a maze: one fast path (high reward) and one slow safe path (moderate reward). If the policy starts near the slow-path optimum, gradient ascent will converge there and never discover the fast path. Q-learning can also suffer from this, but the Bellman-operator structure gives it stronger convergence guarantees in the tabular setting. For policy gradients, exploration (high policy entropy) is the primary mechanism for avoiding local optima — keeping the policy spread enough that it occasionally stumbles across better strategies before committing.

Why variance matters for convergence rate, not just noise. It's easy to think high gradient variance is merely an annoyance — the mean is still correct, so we'll get there eventually. But high variance has a deeper consequence: it forces us to use smaller learning rates. A gradient estimate with variance σ²_g is safe to step by α ≈ 1/(σ_g √T) in the stochastic gradient ascent analysis, meaning tighter gradient estimates allow larger steps and faster convergence. Variance reduction is not just cosmetic — it's directly coupled to how fast the algorithm learns.

Why are policy gradient methods preferred over Q-learning for continuous action spaces?

Because policy gradient always converges to the global optimum Because policy gradient has zero variance Because Q-learning requires argmax over continuous actions (hard inner optimization), while policy gradient directly differentiates through log π without any inner loop

Chapter 9: Connections

Policy gradient estimation is not a dead end — it's a foundation. Everything in chapters 12, 13, and 15 builds directly on the estimator derived here.

What Chapter 11 gave us:

• ∇U = E[R(τ) ∇log p_θ(τ)]
• Score = ∑_t ∇log π_θ(a_t|s_t)
• No model needed (transitions cancel)
• Reward-to-go (causality)
• Baseline subtraction (unbiased variance reduction)
• Advantage function (optimal baseline)

Where these ideas go:

• Ch 12: How to use ∇U for optimization (natural gradient, TRPO, PPO)
• Ch 13: Actor-Critic: estimating A(s,a) with a learned critic
• Ch 15: Exploration via entropy bonuses and curiosity
• Ch 17: Model-free methods (DDPG, SAC) for continuous control
• Ch 18: Imitation learning (behavioral cloning via ∇logπ)

The estimator ∇U = E[A(s,a) ∇logπ] is not just REINFORCE. It's the gradient of every modern on-policy deep RL algorithm. PPO uses it. A3C uses it. SAC uses it (with entropy). The difference between algorithms is primarily in how they estimate A(s,a) and how they constrain the gradient step size to avoid overshooting.

Summary of the four estimators and when to use them. Finite difference: use when n ≤ 20 and you don't have policy gradients (e.g., black-box simulators, non-differentiable policies). Regression gradient: use when n is moderate (20–500) and random direction probing is feasible. Likelihood ratio (REINFORCE): use when π_θ is stochastic and differentiable — this is the default for neural network policies in episodic environments. REINFORCE with variance reduction: always use reward-to-go (free) and a baseline (small cost, large gain). These are the building blocks; Chapter 12 shows how to use the resulting gradients more efficiently, and Chapter 13 shows how to get better advantage estimates.

A historical note on naming. The term "policy gradient" is sometimes used loosely to mean any gradient-based policy optimization, including natural policy gradient, actor-critic, and PPO. More precisely, the "policy gradient theorem" (Sutton et al., 2000) refers specifically to the identity ∇U = E[Q(s,a) ∇logπ], which this chapter derives as a special case (with reward-to-go approximating Q). The algorithms that directly implement this theorem (REINFORCE, REINFORCE with baseline) are sometimes called "vanilla policy gradient" to distinguish them from the actor-critic variants that additionally learn a value function.

The Williams (1992) REINFORCE paper derived the likelihood ratio estimator for recurrent neural network policies applied to sequence generation tasks — essentially what we'd call language model RL today. The same estimator reappeared in Sutton et al. (2000) as the "Policy Gradient Theorem" in a more general formulation that works for continuing tasks (no episode boundaries), not just episodic ones. The key insight Sutton et al. added: the gradient theorem holds for the average reward criterion as well, and the "baseline" generalizes to the value function under the stationary distribution. These theoretical results are why we can confidently use policy gradient in all the diverse settings we care about — episodic RL, average-reward RL, POMDP RL, and even fine-tuning language models — the math holds in each case.

What "model-free" really means. The likelihood ratio gradient is called model-free because it doesn't require knowing T(s'|s,a). But there's a subtlety: it still requires the policy to be differentiable (so ∇logπ exists) and the reward to be observable (so R(τ) can be computed). If the reward function is itself non-differentiable or non-observable (e.g., human judgments in RLHF), you still need a reward model. "Model-free" specifically means free from the environment dynamics model, not free from all structure. Evolution strategies (ES) and finite differences are doubly model-free: they don't need dynamics or policy gradients. This extra freedom comes at the cost of much worse sample efficiency, since they can't exploit the policy's internal structure.

Regression gradient as a bridge: Section 11.2 in the text (regression gradient) deserves one more note. It uses random perturbations Δθ and utility changes ΔU to fit a linear model ∇U ≈ ΔΘ⁺ ΔU. This is closely related to evolution strategies (ES) — the difference is that ES typically uses a zero-mean isotropic Gaussian for perturbations and weights them by fitness. For certain problem structures (non-differentiable reward, sparse gradients), regression/ES outperforms likelihood ratio methods despite being black-box.

There are also connections to the theory of score-based generative models. The score function ∇_x log p(x) is the central quantity in diffusion models — it's the same mathematical object (gradient of a log-probability) playing a different role. Policy gradient is fundamentally about shaping a distribution (the policy) to put more mass on high-reward regions; diffusion models shape a distribution to match training data. Same math, different application.

The policy gradient in RLHF. Reinforcement learning from human feedback — the training procedure behind ChatGPT, Claude, and similar models — uses the policy gradient estimator from this chapter. The language model is the policy (π_θ(next_token | prompt + previous_tokens)). The reward signal comes from a learned reward model trained on human preference comparisons. The REINFORCE estimator (or its PPO variant) then updates θ to maximize expected reward. The score function ∇logπ(token|context) is computed exactly as in Chapter 2 — the only change is that "state" is a token sequence and "action" is the next token. Policy gradient scales to billion-parameter models and trillion-token action spaces precisely because the likelihood ratio trick requires only ∇logπ, which autograd computes efficiently through any differentiable policy architecture.

Connections to variational inference. Policy gradient also appears in variational inference, where we want to fit an approximate posterior q_θ(z|x) ≈ p(z|x). The ELBO gradient ∇_θ E_{z∼q_θ}[log p(x,z) − log q_θ(z|x)] faces the same challenge: we can't differentiate through the sampling step. The score function gradient estimator (also called REINFORCE in this context) applies the same log-derivative trick: ∇_θE_{z∼q_θ}[f(z)] = E_{z∼q_θ}[f(z) ∇_θ log q_θ(z)]. The reparameterization trick (z = g(θ, ε) with ε ∼ p(ε)) is an alternative that often gives lower variance, but it requires reparameterizable distributions — Gaussians yes, discrete distributions no. For discrete latent variables, the score function estimator is the only option, which is why it remains relevant in modern VAE and diffusion model training.

Limitation	How later chapters address it
High variance makes learning slow	Natural gradient (Ch 12): scale the step by the Fisher information matrix, taking larger steps where the distribution changes slowly
Need a good advantage estimate	Actor-Critic (Ch 13): train a separate neural network V_w(s) as critic, use TD residual as advantage proxy
Large steps can destroy the policy	TRPO/PPO (Ch 12): constrain the KL divergence between old and new policy per update
On-policy: can't reuse old data	Importance sampling: reweight old trajectories by π_θ/π_θold (Ch 12)
Poor exploration near deterministic policies	Entropy bonus (Ch 15): add λ H(π) to the objective, penalizing overconfident policies

Common implementation pitfalls and how to avoid them:

Pitfall	Symptom	Fix
Learning rate too large	Policy collapses to one action, then diverges	Start with α ≈ 1e-3 to 1e-4; decay over time
Baseline not normalized	Gradient magnitudes vary wildly across environments with different reward scales	Normalize rewards to mean 0, std 1 per batch
Batch size too small	Gradient estimate is noisy; policy oscillates without converging	Use 20-100+ trajectories per update, not just 1
Forgetting the discount in the gradient	Long horizons give too much weight to late-episode actions	Include γ^t−1 in the per-step weight
Using total return instead of RTG	Slow learning, especially in long episodes	Always use reward-to-go; it's free variance reduction
Mixing on-policy and off-policy data	Gradient estimate is biased; policy may degrade	Discard all trajectories after each update, or use IS weights
Not clipping gradient norms	Occasional catastrophic policy update from a high-variance batch	torch.nn.utils.clip_grad_norm_(params, max_norm=0.5)

For deep RL with neural network policies (PyTorch example): learning rate 3e-4 with Adam optimizer, batch of 2048 timesteps (~20 episodes of length ~100), normalize advantages to mean 0 std 1 per batch, clip gradient norms to 0.5, γ = 0.99, λ = 0.95. These are the PPO defaults and work well as starting points for most continuous control tasks. You can always tighten from here.

The single most impactful fix in practice: Normalize returns. Before computing gradients, subtract the batch mean and divide by the batch standard deviation of the rewards-to-go. This makes the gradient scale invariant to reward magnitudes — you can use the same learning rate whether rewards are in [0,1] or [0,1000]. Almost every practical REINFORCE implementation does this silently.

"The policy gradient theorem [is] one of the most important results in reinforcement learning theory."
— Sutton & Barto, Reinforcement Learning: An Introduction (2018)

Which aspect of REINFORCE does PPO specifically address?

The on-policy requirement (PPO allows off-policy training) The need for a baseline (PPO removes baselines) Overshooting: large gradient steps can catastrophically change the policy. PPO clips the objective to limit per-update policy change.

Related lessons on Parminces:

← Ch 10: Policy Search — Cross-entropy, CEM, black-box methods
→ Ch 12: Policy Gradient Optimization — Natural gradient, TRPO, PPO
→ Ch 13: Actor-Critic — Learning the advantage with a critic network