How to measure the steepness of the performance landscape — and climb it — when the only tool you have is a bag of noisy rollouts.
Imagine you are trying to tune a robot arm. The arm has a policy — a function that maps sensor readings to motor commands — controlled by a vector of parameters θ. You run the robot, score it, nudge θ a little, run it again. You're doing gradient ascent on the utility U(θ) = E[total reward].
There's one problem. You can't write down U(θ) in closed form. It's an expectation over all the random trajectories the robot might take — over random sensor noise, random actuator jitter, random environment dynamics. You never see U directly. You only see samples: one specific run that produced one specific reward.
To appreciate how nontrivial this is, compare with supervised learning. There, you minimize L(θ) = (1/N) ∑ ℓ(fθ(xi), yi). The gradient ∇θL is computed by backpropagation through a deterministic computational graph you wrote yourself. In RL, U(θ) = Eτ∼pθ[R(τ)] involves a black-box environment. The computational graph of the world is unknown — you can only query it via rollouts. Finite difference, regression gradient, and likelihood ratio are all ways to estimate ∇U without the environment's source code. Each trades off sample cost, bias, and variance differently. Choosing among them is the first design decision when building a policy gradient algorithm.
Chapter 10 showed one response: ignore the gradient entirely. Cross-entropy, evolution strategies — treat U as a black box and search policy space heuristically. That works, but it's wasteful. If we can estimate ∇U, we can point our steps in the steepest uphill direction and converge far faster.
This chapter develops four approaches, in increasing sophistication. We'll start with the brute-force method (finite differences), then derive the elegant likelihood-ratio estimator, then show how to dramatically tighten it with variance reduction. The most polished version becomes REINFORCE — the algorithm that launched the modern policy gradient era.
The methods differ in what they assume about the environment. Finite differences and regression gradient need no assumptions at all — they treat the environment as a black box. The likelihood-ratio method assumes the policy is stochastic and differentiable w.r.t. θ (so we can compute ∇logπ), but makes no assumptions about the transition dynamics. This is why likelihood-ratio methods extend naturally to POMDPs (partial observability, Chapter 19-22): the agent doesn't need to model how its observations arise from hidden states. As long as the policy outputs action probabilities from observations, ∇logπ exists and the estimator applies unchanged.
One clarifying note on notation: throughout the book and this lesson, τ denotes a trajectory. Different papers use different symbols: some write h (history), some write ξ (Greek xi), some write s1:Ta1:T. The content is always the same: an ordered sequence of states and actions sampled by rolling out a policy in the environment. The book's notation pθ(τ) for trajectory probability is consistent with most modern RL literature.
Here's a quick way to build intuition for the whole chapter. Imagine U(θ) as a hilly landscape, and you're blindfolded at some location (θ) on it. You don't know the map. All you can do is take steps and measure your altitude afterwards. Finite difference says: take a tiny step north, record your altitude; come back, take a tiny step east, record altitude; repeat for each direction. Expensive but reliable for few directions. Regression gradient says: take many random steps in all directions at once, fit a plane to all the altitude measurements — the normal of the plane points uphill. The likelihood ratio trick says: you don't need to take any off-center steps at all. Just stay where you are, make random decisions according to your current policy, and use the score function (how "surprising" each decision was) weighted by the reward you got. Good surprises point uphill. This is the key insight the chapter builds toward: you can estimate the hill's gradient without ever leaving your current position.
A 2-parameter policy πθ parameterized by (θ1, θ2). The color shows U(θ) (warm = high utility). We can't compute this surface analytically — each colored cell requires many rollouts. The challenge: navigate toward the peak without seeing the surface explicitly. Click to place your current θ and see what gradient estimation methods can do.
Click anywhere on the surface to set your current θ. The warm circle shows current location.
The simplest idea: compute the gradient numerically by perturbing one parameter at a time and observing how U changes. For the i-th component of θ:
where ei is the i-th standard basis vector (all zeros except a 1 in position i) and δ is a small step size. To estimate the full gradient for a policy with n parameters, you need n+1 evaluations of U: one at θ and one perturbed evaluation per parameter.
The trouble is that U(θ) itself is an expectation — you can't evaluate it exactly. You estimate it with m rollouts. So the total rollout budget is m(n+1) just to get one gradient estimate. For a neural network with 10,000 parameters that's 10,001 evaluation packs.
The centered (two-sided) version is more accurate: perturb in both directions and take the symmetric difference ( U(θ + δei) − U(θ − δei) ) / 2δ. This cancels the leading-order error, so the bias drops from O(δ) to O(δ2). The cost doubles (2n evaluations instead of n+1), but the gain is worth it in smooth problems.
The Taylor expansion makes the bias explicit. Forward difference: U(θ + δei) = U(θ) + δ ∂U/∂θi + (δ2/2) ∂2U/∂θi2 + O(δ3). Dividing by δ and rearranging: (U(θ + δei) − U(θ))/δ = ∂U/∂θi + (δ/2) ∂2U/∂θi2 + O(δ2). The error is δ/2 times the second derivative — proportional to δ and to the curvature. The centered version: (U(θ+δei) − U(θ−δei)) / 2δ = ∂U/∂θi + O(δ2), because the second-derivative terms cancel. The curvature bias vanishes to leading order.
python import numpy as np def finite_diff_gradient(policy, theta, delta, n_rollouts): """Estimate gradient of U(theta) via finite differences.""" n = len(theta) grad = np.zeros(n) U0 = estimate_utility(policy, theta, n_rollouts) # baseline for i in range(n): theta_plus = theta.copy() theta_plus[i] += delta U_plus = estimate_utility(policy, theta_plus, n_rollouts) grad[i] = (U_plus - U0) / delta # forward difference return grad # cost: (n+1) * n_rollouts rollouts total def estimate_utility(policy, theta, n_rollouts): """Average return over m rollouts.""" returns = [run_episode(policy, theta) for _ in range(n_rollouts)] return np.mean(returns)
For f(x) = x² at x = 2, true derivative = 4. Drag the slider to see how δ affects the estimate. Too big: curve curvature creates bias. Too small: estimation noise (simulated here) dominates. There's a sweet spot.
For neural network policies (πθ(a|s) implemented as a network with PyTorch/JAX), the score function ∇θ log π(a|s) is computed by backpropagating through the network from the log-probability output. You don't compute it analytically — autograd does it. The workflow is: (1) run the forward pass to get action logits, (2) compute log π(a|s) for the taken action, (3) multiply by the rtg weight, (4) call .backward(). PyTorch accumulates gradients. After processing all steps in the batch, call optimizer.step(). One subtlety: you want gradient ascent, not descent. Either negate the loss before calling backward(), or use -logπ × rtg as the loss and standard gradient descent. Both are equivalent. Most RL libraries define the REINFORCE loss as -E[logπ × rtg] for this reason.
A useful sanity check when implementing any gradient estimator: run your code on a simple convex function where you know the exact gradient (e.g., U(θ) = −θ², gradient = −2θ). Compare your estimator's output to the exact gradient. If they agree within statistical noise, your implementation is probably correct. This gradient checking pattern is standard practice in RL codebases before deploying on real problems.
Kochenderfer also presents a smarter variant: regression gradient. Instead of probing one parameter at a time, draw m random perturbations Δθ(1), ..., Δθ(m) (each perturbing all parameters simultaneously), collect the utility change ΔU(i) for each, and fit a linear model:
where ΔΘ is the m×n matrix of perturbations and ΔΘ+ is its Moore-Penrose pseudoinverse. Each random perturbation probes all parameter directions at once, so the gradient estimate becomes efficient: roughly 2n perturbations suffice for an n-parameter policy, compared to n+1 for finite differences. The recommended rule of thumb from the book: use 2n perturbations with a perturbation magnitude of δ ≈ 1% of the parameter norm.
python import numpy as np def regression_gradient(policy, theta, delta, m_perturb, n_rollouts): """ Regression gradient: random perturbations + least-squares fit. theta: current parameters (n,) delta: perturbation magnitude (radius of hypersphere) m_perturb: number of random perturbations (use 2*n rule of thumb) """ n = len(theta) U0 = estimate_utility(policy, theta, n_rollouts) # Sample m random perturbations from unit sphere, scale by delta Delta_Theta = [] Delta_U = [] for _ in range(m_perturb): direction = np.random.randn(n) direction /= np.linalg.norm(direction) # unit sphere d_theta = delta * direction d_U = estimate_utility(policy, theta + d_theta, n_rollouts) - U0 Delta_Theta.append(d_theta) Delta_U.append(d_U) Delta_Theta = np.array(Delta_Theta) # shape: (m, n) Delta_U = np.array(Delta_U) # shape: (m,) # Least squares: find grad that best explains Delta_U = Delta_Theta @ grad grad, _, _, _ = np.linalg.lstsq(Delta_Theta, Delta_U, rcond=None) return grad # cost: m * n_rollouts rollouts total (vs (n+1)*n_rollouts)
Numerical example. Consider a 3-parameter policy (θ = [0.5, −0.2, 1.1]) in a simple environment. We want ∂U/∂θ2 using finite differences with δ = 0.01 and m = 50 rollouts each:
Step 1: Estimate U(θ) ≈ 4.23 (mean of 50 rollouts).
Step 2: Perturb θ2: θ + 0.01 e2 = [0.5, −0.19, 1.1]. Estimate U(perturbed) ≈ 4.41 (50 rollouts).
Step 3: ∂U/∂θ2 ≈ (4.41 − 4.23) / 0.01 = 18.0.
But wait — with only 50 rollouts, each estimate has standard error σ/√50 ≈ 0.14 (assuming reward standard deviation ~1). The difference (4.41 − 4.23) = 0.18 might be 100% noise. The finite difference estimate's standard error is 0.14√2 / 0.01 ≈ 20 — larger than the signal. This is exactly the regime where likelihood-ratio methods shine: they compute ∇logπ analytically, avoiding this amplification of rollout noise.
Finite differences treats U as a black box. But U has structure we can exploit: it's an integral involving the policy πθ, which we designed. We know its derivative analytically. The log-derivative trick is how we harvest that knowledge.
Start from the definition. A trajectory τ is a sequence of states and actions: (s1, a1, s2, a2, ..., sd, ad). The expected utility is:
where pθ(τ) is the probability of trajectory τ under policy θ, and R(τ) is its total discounted reward. Moving the gradient inside the integral:
We can't sample from ∇θp because it's not a probability distribution (it can be negative!). Here's the trick: multiply and divide by pθ(τ):
So the final gradient estimator is:
Now what is ∇θ log pθ(τ)? Expand the trajectory probability:
Take the log: it turns the product into a sum. Then take the gradient w.r.t. θ. The initial state distribution p(s1) doesn't depend on θ. The transitions T(s'|s,a) don't depend on θ (we're model-free). They vanish. Only the policy terms survive:
There's a deeper reason the transition terms cancel that's worth dwelling on. The Markov property tells us that state st+1 is determined entirely by (st, at) and the environment's randomness — not by θ. So when we take ∇θ, the transition T(st+1 | st, at) contributes zero. Only πθ(at | st) carries θ dependence. If the environment were not Markov (say, a complicated simulator with internal state), the transitions might depend on θ indirectly (through action history affecting hidden state), and this clean cancellation would break down. The Markov assumption is load-bearing for the likelihood ratio trick.
Formally: log pθ(τ) = log p(s1) + ∑t [log T(st+1|st,at) + log πθ(at|st)]. Taking ∇θ: the log p(s1) term vanishes (no θ). Each log T term vanishes (no θ). What remains is exactly ∑t ∇θ log πθ(at|st). The derivation is three lines; the implication is enormous — the entire field of model-free RL depends on this cancellation holding.
Each term ∇θ log πθ(at | st) is called the score function at time t. It points in the direction in parameter space that most increases the log-probability of taking action at in state st. When multiplied by R(τ), high-reward trajectories push θ to make those actions more likely, and low-reward trajectories push in the opposite direction.
Concrete score functions for two common policy families:
Gaussian policy (continuous actions):
πθ(a|s) = N(μθ(s), σ2)
log π = −(a−μ)2 / (2σ2) − log σ + const
∇μ log π(a|s) = (a − μ) / σ2
If θ parameterizes μ through a neural net, chain rule gives ∇θ log π = (a − μ)/σ2 · ∇θ μθ(s). The score vector points from the mean toward the sampled action. If the action was good, μ gets pulled toward it.
Softmax policy (discrete actions):
πθ(a|s) = eθsa / ∑a' eθsa'
log π(a|s) = θsa − log ∑ eθsa'
∇θsa' log π(a|s) = 1(a'=a) − π(a'|s)
The gradient w.r.t. the taken action's parameter is 1 − π (positive: increase it). For all other actions: −π (negative: decrease them). The result: taken action becomes more likely, others less likely, by exactly the probability mass that normalization requires.
Both score functions have a deep property: Ea∼π[∇ log π(a|s)] = 0. The expected score under the policy itself is always zero. This is why baselines (Chapter 5) work: you can subtract any state-dependent quantity from R and the expected gradient change is zero.
What happens at convergence? When the policy has converged to the optimal π*, the gradient should be zero (we're at a maximum). For the softmax: at the optimum π(best_action|s) → 1. The score for the best action is 1 − 1 = 0. All other actions have probability 0, so their contribution is also 0 × (−1) = 0. The entire gradient is zero. Gradient ascent correctly stops improving. For Gaussian policies, convergence means the mean μ has moved to the optimal action and the score (a − μ)/σ2 is centered around zero, so its expectation is also zero. The gradient signal dies out as the policy peaks.
This is not a bug — it's the correct behavior. A peaked policy that's fully converged needs no further updates. The challenge is premature convergence: the policy might peak on a locally optimal but globally suboptimal action. Entropy regularization (adding H(π) to the objective) counteracts this by penalizing overconfident policies, keeping σ from shrinking to zero or the softmax from becoming too sharp too soon.
Gaussian policy π(a|s) = N(μ, σ²). The score ∇μ log π = (a − μ)/σ² points from the mean toward the sampled action (red dot). Drag μ and σ to see how the score arrow changes for the fixed sample a = 1.5.
Williams (1992) combined the likelihood ratio gradient into a clean algorithm: REINFORCE. Collect trajectories from the current policy, estimate the gradient from each, take a gradient ascent step, repeat. The full gradient estimator is:
where m trajectories τ(1), ..., τ(m) are sampled from the current policy, R(τ(i)) is the total return of trajectory i, and the inner sum is the summed score over all timesteps. After computing the gradient estimate, we take a gradient ascent step: θ ← θ + α ∇θ U(θ).
Worked example from scratch. Suppose you have a 1-step MDP: one state, two actions (Left = reward −1, Right = reward +5). Your policy is a softmax parameterized by a single scalar θ: π(Right) = eθ/(eθ + 1) = σ(θ). Currently θ = 0, so π(Right) = 0.5.
You run m = 3 episodes. Suppose the outcomes are: episode 1 goes Right (reward +5), episode 2 goes Left (reward −1), episode 3 goes Right (reward +5). First, compute each score:
Now sum R × score over the three episodes:
With learning rate α = 0.1: θ ← 0 + 0.1 × 1.83 = 0.183. Now π(Right) = σ(0.183) ≈ 0.546. The policy has shifted slightly toward Right. After many such updates, θ grows large and π(Right) → 1.0.
Below is a live simulation of exactly this MDP. The agent starts at θ = 0 and you can watch REINFORCE drive it toward the optimal policy.
The policy is a softmax over a single parameter: π(Right | S) = σ(θ) = eθ / (1 + eθ). The gradient is:
Run episodes, watch the gradient accumulate, and see θ climb toward the optimal policy. Click "Auto-run" to let REINFORCE converge on its own.
Two-state MDP. Right = reward +5, Left = reward −1. Policy: π(Right) = σ(θ). Watch θ evolve as REINFORCE collects rollouts and updates.
An aside on deterministic policies. The likelihood ratio trick requires ∇θ log πθ(a|s) to exist. For a deterministic policy, π(a|s) is either 1 (for the greedy action) or 0 (for others). The log of 0 is undefined, and the gradient doesn't exist in the usual sense. This is why stochastic policies are essential for REINFORCE — without them, the log-derivative trick breaks. Deterministic policy gradient (DPG) methods (Ch 17) handle this via a different trick: they differentiate the deterministic action output directly through the Q-function, bypassing the log-probability entirely. But that requires a differentiable environment model or a differentiable Q approximation.
The practical takeaway: always use a stochastic policy for REINFORCE. For discrete action spaces, a softmax with a temperature parameter τ controls how peaked the distribution is: π(a|s) ∝ exp(θsa/τ). High temperature (large τ) = uniform/exploratory; low temperature = nearly deterministic. You can anneal τ from high to low over training to transition from exploration to exploitation. For continuous action spaces, a Gaussian policy π(a|s) = N(μθ(s), σθ2(s)) maintains stochasticity throughout training, with the learned σ naturally decreasing as the policy converges — though you may need a minimum σ floor to prevent premature convergence to a degenerate distribution.
Notice that at θ = 0, the policy is 50/50. As REINFORCE accumulates gradient signal from high-reward Right trajectories, θ grows. The policy becomes increasingly biased toward Right. Eventually π(Right) → ~1.0 and the updates plateau — the policy has converged.
Note that the gradient estimate from REINFORCE, even when computed correctly, is extremely high-variance in the beginning of training when the policy is nearly uniform. Every episode has nearly equal probability, so the gradient weights are nearly equal for all trajectories, making the net gradient close to zero. As the policy concentrates (high θ magnitude), the score function becomes larger for the preferred actions and the gradient signal clarifies. This is why REINFORCE often shows very slow initial progress followed by a "breakthrough" when the policy suddenly tips past 50% toward the optimal action — a phase transition in the gradient landscape.
The convergence rate depends on learning rate and batch size. A large learning rate (α = 0.5) converges faster but can overshoot — the policy oscillates between "mostly Right" and "all Right" before stabilizing. A small learning rate (α = 0.01) is stable but slow. Batch size matters too: with 1 episode per update, the gradient estimate is noisy and training is erratic; with 20 episodes, the estimate is smoother and the θ trajectory is much more monotone. This is the same bias-variance trade-off that appears everywhere in statistics, now applied to the gradient estimator.
One subtlety: in this 1-step MDP, the Left action always gives −1. If early rollouts happen to produce many Right episodes by chance (giving θ a head start), convergence is fast. If they produce many Left episodes (unlucky start), θ dips negative before climbing. The initial policy's randomness is the primary driver of early training variance — which is exactly why the baseline (subtracting the mean return ≈ 2.0 from the batch) stabilizes the gradient signal significantly.
REINFORCE as stated multiplies every step's score by the entire trajectory return R(τ). But can action a2 (taken at time 2) affect reward r1 (collected at time 1)? No. Causality forbids it. Past rewards are already fixed before action a2 is chosen.
This means we're polluting the gradient signal for each step t with irrelevant noise from rewards before t. The reward-to-go (also called the "causality trick") substitutes the total return with only the future rewards, from step t onward:
The full policy gradient becomes:
The benefit compounds across long trajectories. For a trajectory of length d, step t uses d−t+1 future reward terms instead of d total terms. Early steps see the largest reduction. Step d uses only rd; step 1 uses all d rewards. The average number of terms per step drops from d to (d+1)/2.
In practice, for long-horizon problems like Atari games (episode length ~1000 steps) or robotic locomotion (500+ steps), reward-to-go can reduce variance by a factor of 5-10x compared to total return. This translates directly to needing 5-10x fewer samples to achieve the same gradient signal quality — a massive practical advantage that costs nothing in terms of bias or additional computation. Every policy gradient implementation that doesn't use reward-to-go is leaving free variance reduction on the table.
Another subtlety: the discount γ in the reward-to-go has a dual role. Mathematically, it down-weights future rewards to prioritize immediate returns, reflecting time preference. Practically, it also prevents the reward-to-go from blowing up in continuing (non-episodic) tasks where the sum could otherwise be infinite. For γ < 1, the geometric series ∑t'=t∞ γt'−t rt' converges as long as rewards are bounded. The choice γ = 0.99 (vs 0.95 or 1.0) can significantly affect learned behavior: a lower γ makes the agent myopic and greedy; γ = 1.0 requires finite-horizon tasks to guarantee convergence of the sum.
One implementation subtlety: when computing reward-to-go across a batch of trajectories with different lengths, be careful to truncate correctly. Each trajectory has its own rtg array; they should not be mixed or padded incorrectly. A common bug is computing the running sum across trajectory boundaries, which mixes future rewards from one trajectory into past steps of another. Always maintain a separate rtg list per trajectory and concatenate after.
With a discount factor γ < 1, the reward-to-go also has a natural "effective horizon" of 1/(1−γ) steps — the geometric series converges to a finite value. For γ = 0.99, the effective horizon is ~100 steps: rewards more than 100 steps in the future are down-weighted by more than e−1. This means in very long episodes (1000+ steps), the policy is effectively optimized for the near-term horizon, not the true long-run return. Choosing γ is thus a design decision about how far ahead the agent should plan, not just a mathematical convenience.
Total return (noisy):
Step 3 gets multiplied by:
r1 + r2 + r3 + r4 + r5
↑ r1, r2 are noise — action at step 3 can't change them
Reward-to-go (cleaner):
Step 3 gets multiplied by:
r3 + γr4 + γ²r5
↑ Only rewards the action can actually influence
python def compute_rewards_to_go(rewards, gamma=0.99): """ Input: rewards = [r1, r2, ..., rd] from a single trajectory Output: rtg[t] = gamma^0 * r_t + gamma^1 * r_{t+1} + ... + gamma^{d-t} * r_d """ d = len(rewards) rtg = [0.0] * d running = 0.0 for t in range(d - 1, -1, -1): # iterate backward running = rewards[t] + gamma * running rtg[t] = running return rtg # REINFORCE with reward-to-go: for trajectory in batch: rtg = compute_rewards_to_go(trajectory.rewards) for t, (s, a, r) in enumerate(trajectory.steps): score = grad_log_pi(theta, a, s) # nabla_theta log pi grad += score * rtg[t] # weight by future rewards only theta += alpha * grad / len(batch)
Worked example: reward-to-go vs. total return. Suppose a trajectory has 3 steps with rewards [1, 3, 2] and γ = 1.0 (no discounting for simplicity). Total return: R(τ) = 6.
With total return, every step gets multiplied by 6. Step 1's score gets multiplied by 6 — even though 5 of those 6 reward units came from steps 2 and 3, which had nothing to do with action a1.
With reward-to-go: step 1 gets 1 + 3 + 2 = 6, step 2 gets 3 + 2 = 5, step 3 gets only 2. Now the weight for step 3 is tight (only the immediate reward), and the weight for step 1 is the most informative possible. Steps at the end of the trajectory contribute more signal per rollout, because their weights are smaller and thus less contaminated by the environment's randomness across future steps.
Even with reward-to-go, the gradient can still be noisy. Here's a concrete failure mode: suppose all trajectories have positive return — the environment is generous. Then every action gets pushed to be more likely, just by different amounts. The gradient signal is drowned by the absolute scale of returns, not the relative quality of actions.
The fix: subtract a baseline b(st) from the reward-to-go at each step. The gradient becomes:
The baseline can be any function of state — it must not depend on the action. This is critical: if b were action-dependent, subtracting it would bias the gradient.
What makes a good baseline? The Kochenderfer text derives the optimal baseline analytically. For a scalar policy gradient component i, the variance-minimizing baseline is:
This is a weighted average of returns, where the weights are the squared score magnitudes. In practice, this is expensive to compute per-parameter. The most common heuristic is the state value function: b(s) = Vπ(s) — the expected return from state s under the current policy. It's a good proxy because it removes the "background" expected return, leaving only the advantage of the specific action taken.
The three common choices of baseline, ordered by effectiveness:
| Baseline b(s) | How to compute | Effectiveness |
|---|---|---|
| Zero (no baseline) | b = 0 | Worst: full return scale in gradient |
| Mean return | b = mean of Rto-go across batch | Good: simple, removes absolute scale |
| State value Vπ(s) | Learned neural network: Vw(s) trained by regression on MC returns | Best: removes per-state background return |
| GAE-estimated advantage | δ = r + γV(s') − V(s), then exponential blend | Excellent: trades off bias vs. variance with λ |
In code, implementing the mean-return baseline adds just 2 lines. Implementing V(s) as a learned network adds ~50 lines (a second network + regression training loop). The gain from V(s) over the mean baseline is typically 2-5x variance reduction on continuous control tasks, often worth the complexity.
Worked example: why baseline helps when all rewards are positive. Suppose from state s, actions A, B, C give returns 8, 10, 12. Mean = 10. Without a baseline, the gradient term for action B is score(B) × 10. With baseline = 10: score(B) × (10 − 10) = 0. Action B is exactly average — it gets zero gradient, neither promoted nor demoted. Action A gets score(A) × (8 − 10) = score(A) × (−2): slightly demoted. Action C gets score(C) × (12 − 10) = score(C) × 2: promoted. The gradient now correctly pushes the policy toward C and away from A, ignoring the absolute scale of rewards entirely.
Without the baseline, all three actions got positive gradient signal — the gradient was trying to make all of them more likely simultaneously. The magnitude differences between 8, 10, and 12 created the correct direction, but drowned in much more noise than necessary. The baseline precisely centers the signal.
Both methods estimate the same gradient (unbiased), but the baseline version clusters tightly around the truth. Click "Sample" to add batches of gradient estimates. Red = no baseline, Green = with mean-return baseline.
We've now built up three layers of variance reduction. Let's stack them together and see exactly how the gradient changes — and what each layer buys you.
The advantage function A(s,a) = Q(s,a) − V(s) is the ideal weight for each gradient term. It answers: "How much better is action a than the average action from state s?" Actions that are exactly average get zero gradient signal. Only relative quality drives learning. This is the most principled choice of baseline: b(s) = V(s).
One important clarification: even with the optimal baseline, the REINFORCE gradient estimator is still unbiased. "Minimum variance" doesn't mean zero variance — it means no other unbiased estimator in this family can have smaller variance. The gradient still wiggles from step to step; it's just as tightly clustered around the true gradient as possible given the trajectory samples. To actually eliminate variance, you'd need to compute Eτ[...] exactly — which requires knowing the full environment model, which is what we're trying to avoid. Variance reduction is about making the best use of finite samples; it's not magic.
| Estimator | Bias | Variance | Requires |
|---|---|---|---|
| Total return R(τ) | Zero | Highest | Rollouts only |
| Reward-to-go Rtto-go | Zero | Lower | Rollouts only |
| Reward-to-go − constant b | Zero | Lower still | Rollouts + b |
| Reward-to-go − V(st) | Zero | Much lower | Rollouts + V estimator |
| True advantage A(s,a) | Zero | Lowest | Full Q and V (oracle) |
| TD residual r + γV(s') − V(s) | Some (from V approx) | Very low | Rollouts + learned V |
Click "Run" to simulate 50 batches of 20 rollouts on a bandit problem (two actions, rewards N(3,4) and N(5,4)). Each bar shows the empirical variance of the gradient estimate across batches.
The GAE bridge. In practice, even the TD residual (1-step advantage r + γV(s') − V(s)) and the full reward-to-go (n-step advantage) are two extremes. Generalized Advantage Estimation (GAE, Schulman et al. 2016) interpolates them with a scalar λ ∈ [0,1]:
where δt = rt + γV(st+1) − V(st) is the TD residual at step t. At λ = 0: GAE = δt (1-step TD, low variance but biased if V is wrong). At λ = 1: GAE = reward-to-go minus baseline (unbiased but high variance). λ ≈ 0.95 is the standard choice in PPO, A3C, and most modern implementations — it gives most of the variance reduction with little bias.
To see why the TD residual introduces bias when V is approximate: suppose V(s) overestimates the true value. Then δt = rt + γV(st+1) − V(st) is biased by the error in V. In contrast, the full reward-to-go (Monte Carlo return) uses actual observed rewards, not V estimates, so it's unbiased regardless of how bad V is. The catch: it requires waiting until the episode ends to compute, and accumulates d steps of random reward noise. GAE at λ = 0.95 with a reasonably good V (typically learned jointly) provides most of the variance reduction of the full TD approach with most of the unbiasedness of Monte Carlo — this is the Goldilocks sweet spot that makes modern deep RL practical.
The PPO paper (Schulman et al., 2017) uses γ = 0.99, λ = 0.95 as defaults and reports that GAE is responsible for a large fraction of PPO's sample efficiency improvement over vanilla REINFORCE. The advantage is that GAE doesn't require a separate hyperparameter search for how many steps to look ahead — λ smoothly interpolates between the two extremes and the optimal value is relatively stable across domains. Intuitively, λ = 0.95 means that TD errors more than log(0.05)/log(0.95) ≈ 58 steps in the future contribute less than 5% to the advantage estimate — so the effective look-ahead is roughly 58 steps under default settings. For tasks with rewards on longer timescales, increasing λ toward 1 helps, at the cost of slightly higher variance in the advantage estimate.
Here is a complete, runnable implementation of REINFORCE with reward-to-go and a baseline. This is not pseudocode — every line does real work, and every decision is explained.
python import numpy as np #### Policy #################################################### # Softmax policy over K discrete actions. Parameters: theta[s, a]. # pi(a|s) = exp(theta[s,a]) / sum_a exp(theta[s,a]) # Input: state s (int), theta (S x A array) # Output: action probabilities (A-vector, sums to 1) def softmax_policy(theta, s): logits = theta[s] # shape: (A,) logits -= logits.max() # stability: subtract max exp_logits = np.exp(logits) return exp_logits / exp_logits.sum() # normalize def grad_log_pi(theta, s, a): """ Gradient of log pi(a|s) w.r.t. theta[s, :]. For softmax: grad[s, a'] = 1(a'=a) - pi(a'|s). All other rows of theta are zero gradient. """ pi = softmax_policy(theta, s) g = np.zeros_like(theta) g[s] = -pi # -pi(a'|s) for all a' g[s, a] += 1.0 # +1 for the taken action return g #### Rollout ################################################### # Run one episode; return list of (s, a, r) tuples. def rollout(env, theta, max_steps=200): trajectory = [] s = env.reset() for _ in range(max_steps): pi = softmax_policy(theta, s) a = np.random.choice(len(pi), p=pi) s_next, r, done = env.step(a) trajectory.append((s, a, r)) if done: break s = s_next return trajectory #### Rewards-to-go ############################################# def rewards_to_go(rewards, gamma=0.99): rtg, running = [], 0.0 for r in reversed(rewards): running = r + gamma * running rtg.insert(0, running) return rtg #### REINFORCE update ########################################## # Batch of m trajectories. Baseline = mean return across the batch. def reinforce_update(env, theta, m=20, alpha=0.01, gamma=0.99): trajectories = [rollout(env, theta) for _ in range(m)] # 1. Compute reward-to-go for every step of every trajectory all_rtg = [] for traj in trajectories: rewards = [r for _, _, r in traj] all_rtg.append(rewards_to_go(rewards, gamma)) # 2. Baseline = mean rtg across ALL steps in batch all_vals = [v for rtg in all_rtg for v in rtg] baseline = np.mean(all_vals) # constant baseline # 3. Accumulate gradient grad = np.zeros_like(theta) total_steps = 0 for traj, rtg in zip(trajectories, all_rtg): for t, (s, a, _) in enumerate(traj): weight = (rtg[t] - baseline) * (gamma ** t) grad += weight * grad_log_pi(theta, s, a) total_steps += 1 grad /= total_steps # normalize by steps # 4. Gradient ascent step theta += alpha * grad mean_return = np.mean([sum(r for _,_,r in traj) for traj in trajectories]) return theta, mean_return
Key decisions in this implementation, and why each was made:
| Decision | Why |
|---|---|
| Subtract max before softmax | Prevents exp() overflow. Doesn't change probabilities: e^(x-c)/(sum e^(y-c)) = e^x / sum e^y |
| Normalize by total_steps, not m | Trajectories vary in length. Per-step normalization avoids short-trajectory bias. |
| Constant baseline (mean rtg) | Simple and effective. A learned V(s) baseline would be better but requires a second model. |
| gamma^t discount factor per step | The γt−1 in the gradient formula accounts for the geometric down-weighting of later steps. |
| grad_log_pi = 1(a'=a) - π(a'|s) | Exact analytical gradient of log-softmax. This is why stochastic policies work: we know this formula. |
For continuous action spaces, use a Gaussian policy parameterized by a mean vector μθ(s) and a fixed or learned standard deviation σ. The key formulas:
python import numpy as np #### Gaussian Policy for Continuous Actions #################### # Policy: pi(a|s) = N(mu_theta(s), sigma^2) # Parameters theta map states to means (via a linear or neural-net model). # Here: simple linear mean: mu(s) = theta @ phi(s) # Input: s = feature vector (d,), theta = (d,) linear weights # Output: sampled action a (scalar), its log-prob def gaussian_policy(theta, s, sigma=1.0): mu = theta @ s # linear mean (scalar) a = np.random.normal(mu, sigma) # sample action return a, mu def grad_log_pi_gaussian(theta, s, a, sigma=1.0): """ Gradient of log N(a; mu_theta(s), sigma^2) w.r.t. theta. log pi = -0.5 * ((a - mu)/sigma)^2 - log(sigma*sqrt(2*pi)) d/dtheta log pi = (a - mu) / sigma^2 * d(mu)/d(theta) = (a - mu) / sigma^2 * s (for linear mu = theta @ s) """ mu = theta @ s return (a - mu) / (sigma ** 2) * s # shape: (d,) # REINFORCE update for Gaussian policy: # For each (s, a, rtg) triple in the batch: for s, a, rtg in batch_steps: score = grad_log_pi_gaussian(theta, s, a) grad += score * rtg theta += alpha * grad / len(batch_steps)
Notice how the score function (a − μ)/σ2 × s encodes exactly the geometry from Chapter 2's widget: if the sampled action a is above the mean μ, the score is positive, and the gradient step increases μ (making the policy output higher actions). If a is below μ, the score is negative, decreasing μ. The reward weight (rtg) scales how strongly we push: good-return samples drive larger mean updates than bad-return samples.
Debugging your implementation: If REINFORCE is not learning, check these five things in order. First, verify that your log-probability gradients are correct by comparing ∇logπ to finite-difference estimates of logπ. Second, verify that your rewards-to-go are computed correctly — print the first 5 values and manually verify with the reward sequence. Third, check that your gradient has the right sign: positive reward should push θ in the direction that increases the probability of the taken action. Fourth, check your normalization: if rewards are in the thousands (e.g., Atari scores), normalize them to mean 0, std 1 before computing gradients. Fifth, check your learning rate: too large (oscillation), too small (no progress). The most common bug is getting the gradient sign wrong by computing logπ(a|s) as negative log-probability instead of log-probability — easy to do if you mistake a negative log-likelihood loss for a log-probability.
You now know two fundamentally different paradigms for solving MDPs: value-based methods (Q-learning, dynamic programming from Chapters 7-8) and policy gradient methods (this chapter). They solve the same problem differently. Understanding when each excels is essential.
Value-based methods learn Q(s,a) or V(s) first, then derive a policy implicitly as π(s) = argmaxa Q(s,a). The gradient (in the RL sense) flows through the Bellman operator — temporal consistency, not likelihood ratios.
Policy gradient methods directly optimize the policy parameters θ to maximize expected return. No value function is required (though it helps as a baseline). The gradient flows through the log-probability of actions.
| Criterion | Value-Based (Q-Learning) | Policy Gradient (REINFORCE) |
|---|---|---|
| What is learned? | Q(s,a) table or network | πθ directly |
| Action space | Discrete (argmax is tractable) | Discrete or continuous |
| Policy type | Deterministic (greedy) | Stochastic (by design) |
| Sample efficiency | High (off-policy, replay buffer) | Lower (on-policy, no replay) |
| Gradient quality | Biased (bootstrapping) | Unbiased (but high variance) |
| Convergence | Guaranteed for tabular Q-learning | Local optimum (gradient ascent) |
| Continuous actions | Hard (maximize over continuous a) | Natural (sample from π, differentiate log π) |
| Stochastic policies | Awkward (requires ε-greedy hack) | Native: πθ is stochastic by design |
The real power emerges in actor-critic methods (Chapter 13) that combine both: a policy (actor) trained by policy gradient, and a value estimator (critic) used to compute the advantage. You get the natural handling of continuous actions from policy gradient, plus the low-variance gradient signal from the critic's advantage estimate.
Stochastic vs. deterministic policies: a deeper look. Value-based methods like Q-learning produce deterministic greedy policies: given state s, always take argmaxa Q(s,a). This determinism means Q-learning policies can't represent mixed strategies — behaviors where randomness is genuinely optimal (like bluffing in poker, or breaking symmetry in multi-agent games). Policy gradient methods naturally represent such strategies: a softmax over action logits can assign probability 0.5/0.3/0.2 to three actions, and gradient ascent can maintain this mixed strategy indefinitely. In multi-agent settings, many Nash equilibria are mixed strategies; policy gradient can find them while greedy Q-learning cannot.
The on-policy constraint. REINFORCE is on-policy: every gradient estimate uses trajectories sampled from the current policy. Once we update θ, the old trajectories are stale and must be discarded. This is expensive: you need fresh rollouts for every gradient step. Q-learning, by contrast, is off-policy: experiences from any behavior policy can be stored in a replay buffer and reused many times. This is a massive efficiency advantage. The off-policy property comes from Q-learning's use of the Bellman equation, which is policy-independent. REINFORCE's estimator explicitly samples from πθ and has no such property.
The off-policy vs. on-policy gap is one reason deep Q-networks (DQN) were historically more sample-efficient than REINFORCE. Modern algorithms like PPO partially bridge the gap using importance sampling to reuse old data a few times before discarding, though the gap never fully closes for pure policy gradient methods.
Local optima and the multimodal landscape. One important limitation of policy gradient methods is that gradient ascent can converge to local optima. The policy performance surface U(θ) is generally non-convex for expressive policy classes. Imagine two distinct strategies for solving a maze: one fast path (high reward) and one slow safe path (moderate reward). If the policy starts near the slow-path optimum, gradient ascent will converge there and never discover the fast path. Q-learning can also suffer from this, but the Bellman-operator structure gives it stronger convergence guarantees in the tabular setting. For policy gradients, exploration (high policy entropy) is the primary mechanism for avoiding local optima — keeping the policy spread enough that it occasionally stumbles across better strategies before committing.
Why variance matters for convergence rate, not just noise. It's easy to think high gradient variance is merely an annoyance — the mean is still correct, so we'll get there eventually. But high variance has a deeper consequence: it forces us to use smaller learning rates. A gradient estimate with variance σ2g is safe to step by α ≈ 1/(σg √T) in the stochastic gradient ascent analysis, meaning tighter gradient estimates allow larger steps and faster convergence. Variance reduction is not just cosmetic — it's directly coupled to how fast the algorithm learns.
Policy gradient estimation is not a dead end — it's a foundation. Everything in chapters 12, 13, and 15 builds directly on the estimator derived here.
What Chapter 11 gave us:
• ∇U = E[R(τ) ∇log pθ(τ)]
• Score = ∑t ∇log πθ(at|st)
• No model needed (transitions cancel)
• Reward-to-go (causality)
• Baseline subtraction (unbiased variance reduction)
• Advantage function (optimal baseline)
Where these ideas go:
• Ch 12: How to use ∇U for optimization (natural gradient, TRPO, PPO)
• Ch 13: Actor-Critic: estimating A(s,a) with a learned critic
• Ch 15: Exploration via entropy bonuses and curiosity
• Ch 17: Model-free methods (DDPG, SAC) for continuous control
• Ch 18: Imitation learning (behavioral cloning via ∇logπ)
The estimator ∇U = E[A(s,a) ∇logπ] is not just REINFORCE. It's the gradient of every modern on-policy deep RL algorithm. PPO uses it. A3C uses it. SAC uses it (with entropy). The difference between algorithms is primarily in how they estimate A(s,a) and how they constrain the gradient step size to avoid overshooting.
Summary of the four estimators and when to use them. Finite difference: use when n ≤ 20 and you don't have policy gradients (e.g., black-box simulators, non-differentiable policies). Regression gradient: use when n is moderate (20–500) and random direction probing is feasible. Likelihood ratio (REINFORCE): use when πθ is stochastic and differentiable — this is the default for neural network policies in episodic environments. REINFORCE with variance reduction: always use reward-to-go (free) and a baseline (small cost, large gain). These are the building blocks; Chapter 12 shows how to use the resulting gradients more efficiently, and Chapter 13 shows how to get better advantage estimates.
A historical note on naming. The term "policy gradient" is sometimes used loosely to mean any gradient-based policy optimization, including natural policy gradient, actor-critic, and PPO. More precisely, the "policy gradient theorem" (Sutton et al., 2000) refers specifically to the identity ∇U = E[Q(s,a) ∇logπ], which this chapter derives as a special case (with reward-to-go approximating Q). The algorithms that directly implement this theorem (REINFORCE, REINFORCE with baseline) are sometimes called "vanilla policy gradient" to distinguish them from the actor-critic variants that additionally learn a value function.
The Williams (1992) REINFORCE paper derived the likelihood ratio estimator for recurrent neural network policies applied to sequence generation tasks — essentially what we'd call language model RL today. The same estimator reappeared in Sutton et al. (2000) as the "Policy Gradient Theorem" in a more general formulation that works for continuing tasks (no episode boundaries), not just episodic ones. The key insight Sutton et al. added: the gradient theorem holds for the average reward criterion as well, and the "baseline" generalizes to the value function under the stationary distribution. These theoretical results are why we can confidently use policy gradient in all the diverse settings we care about — episodic RL, average-reward RL, POMDP RL, and even fine-tuning language models — the math holds in each case.
What "model-free" really means. The likelihood ratio gradient is called model-free because it doesn't require knowing T(s'|s,a). But there's a subtlety: it still requires the policy to be differentiable (so ∇logπ exists) and the reward to be observable (so R(τ) can be computed). If the reward function is itself non-differentiable or non-observable (e.g., human judgments in RLHF), you still need a reward model. "Model-free" specifically means free from the environment dynamics model, not free from all structure. Evolution strategies (ES) and finite differences are doubly model-free: they don't need dynamics or policy gradients. This extra freedom comes at the cost of much worse sample efficiency, since they can't exploit the policy's internal structure.
There are also connections to the theory of score-based generative models. The score function ∇x log p(x) is the central quantity in diffusion models — it's the same mathematical object (gradient of a log-probability) playing a different role. Policy gradient is fundamentally about shaping a distribution (the policy) to put more mass on high-reward regions; diffusion models shape a distribution to match training data. Same math, different application.
The policy gradient in RLHF. Reinforcement learning from human feedback — the training procedure behind ChatGPT, Claude, and similar models — uses the policy gradient estimator from this chapter. The language model is the policy (πθ(next_token | prompt + previous_tokens)). The reward signal comes from a learned reward model trained on human preference comparisons. The REINFORCE estimator (or its PPO variant) then updates θ to maximize expected reward. The score function ∇logπ(token|context) is computed exactly as in Chapter 2 — the only change is that "state" is a token sequence and "action" is the next token. Policy gradient scales to billion-parameter models and trillion-token action spaces precisely because the likelihood ratio trick requires only ∇logπ, which autograd computes efficiently through any differentiable policy architecture.
Connections to variational inference. Policy gradient also appears in variational inference, where we want to fit an approximate posterior qθ(z|x) ≈ p(z|x). The ELBO gradient ∇θ Ez∼qθ[log p(x,z) − log qθ(z|x)] faces the same challenge: we can't differentiate through the sampling step. The score function gradient estimator (also called REINFORCE in this context) applies the same log-derivative trick: ∇θEz∼qθ[f(z)] = Ez∼qθ[f(z) ∇θ log qθ(z)]. The reparameterization trick (z = g(θ, ε) with ε ∼ p(ε)) is an alternative that often gives lower variance, but it requires reparameterizable distributions — Gaussians yes, discrete distributions no. For discrete latent variables, the score function estimator is the only option, which is why it remains relevant in modern VAE and diffusion model training.
| Limitation | How later chapters address it |
|---|---|
| High variance makes learning slow | Natural gradient (Ch 12): scale the step by the Fisher information matrix, taking larger steps where the distribution changes slowly |
| Need a good advantage estimate | Actor-Critic (Ch 13): train a separate neural network Vw(s) as critic, use TD residual as advantage proxy |
| Large steps can destroy the policy | TRPO/PPO (Ch 12): constrain the KL divergence between old and new policy per update |
| On-policy: can't reuse old data | Importance sampling: reweight old trajectories by πθ/πθold (Ch 12) |
| Poor exploration near deterministic policies | Entropy bonus (Ch 15): add λ H(π) to the objective, penalizing overconfident policies |
Common implementation pitfalls and how to avoid them:
| Pitfall | Symptom | Fix |
|---|---|---|
| Learning rate too large | Policy collapses to one action, then diverges | Start with α ≈ 1e-3 to 1e-4; decay over time |
| Baseline not normalized | Gradient magnitudes vary wildly across environments with different reward scales | Normalize rewards to mean 0, std 1 per batch |
| Batch size too small | Gradient estimate is noisy; policy oscillates without converging | Use 20-100+ trajectories per update, not just 1 |
| Forgetting the discount in the gradient | Long horizons give too much weight to late-episode actions | Include γt−1 in the per-step weight |
| Using total return instead of RTG | Slow learning, especially in long episodes | Always use reward-to-go; it's free variance reduction |
| Mixing on-policy and off-policy data | Gradient estimate is biased; policy may degrade | Discard all trajectories after each update, or use IS weights |
| Not clipping gradient norms | Occasional catastrophic policy update from a high-variance batch | torch.nn.utils.clip_grad_norm_(params, max_norm=0.5) |
For deep RL with neural network policies (PyTorch example): learning rate 3e-4 with Adam optimizer, batch of 2048 timesteps (~20 episodes of length ~100), normalize advantages to mean 0 std 1 per batch, clip gradient norms to 0.5, γ = 0.99, λ = 0.95. These are the PPO defaults and work well as starting points for most continuous control tasks. You can always tighten from here.
"The policy gradient theorem [is] one of the most important results in reinforcement learning theory."
— Sutton & Barto, Reinforcement Learning: An Introduction (2018)
Related lessons on Parminces:
← Ch 10: Policy Search — Cross-entropy, CEM, black-box methods
→ Ch 12: Policy Gradient Optimization — Natural gradient, TRPO, PPO
→ Ch 13: Actor-Critic — Learning the advantage with a critic network