Sutton & Barto, Chapter 13

Policy Gradient Methods

Learning the policy directly — no value function required, no argmax needed.

Prerequisites: Chapters 9–10 (function approximation) + basic calculus (gradients). That's it.
11
Chapters
4
Simulations
11
Quizzes

Chapter 0: A New Paradigm

Everything in this book so far has followed the same recipe: learn a value function (V or Q), then derive a policy from it (usually by acting greedily: pick the action with the highest value). The policy is implicit — it falls out of the value function. But what if we learned the policy directly?

A policy gradient method parameterizes the policy as π(a|s, θ) — a function from states to action probabilities, controlled by parameters θ (like weights in a neural network). Instead of learning values, we learn θ by gradient ascent on some measure of performance.

The paradigm shift: Value-based methods ask "what is each action worth?" and then pick the best. Policy gradient methods ask "what should the policy do?" and directly adjust it. There's no argmax, no greedy selection, no value function required (though we'll often use one as an assistant).

Why might this be better? Consider a robot arm that can move in any direction. A value-based method would need to evaluate every possible direction (infinitely many!) and pick the best. A policy gradient method parameterizes the direction as, say, a Gaussian distribution, and adjusts its mean and variance. It naturally handles continuous actions.

Or consider a game where the optimal strategy is stochastic — you need to randomize (like bluffing in poker). Value-based methods with ε-greedy can't learn the right randomization. Policy gradient methods output probabilities directly, so they can learn any stochastic policy.

Value-Based (Chs 2–12)
Learn V or Q → derive π via argmax
vs
Policy Gradient (Ch 13)
Learn π(a|s,θ) directly via gradient ascent on performance
Check: What do policy gradient methods learn directly?

Chapter 1: Softmax Policy

How do we parameterize a policy? For discrete actions, the most common approach is the softmax over action preferences. Each state-action pair has a preference h(s, a, θ), and the policy is:

π(a|s, θ) = exp(h(s, a, θ)) / ∑b exp(h(s, b, θ))

This is the same softmax function used in neural networks. The preferences h can be linear: h(s, a, θ) = θT x(s, a), or they can be the output of a neural network. The key property: every action always has nonzero probability, so the policy can be stochastic.

Key insight: The softmax guarantees that all actions are tried with some probability, providing natural exploration. As preferences diverge (one action's h becomes much larger), the policy approaches deterministic — but never quite gets there. This is different from ε-greedy, where the non-greedy actions all share the same small probability ε/|A| regardless of how bad they are.

The action preferences are not values — they don't estimate expected returns. They are relative quantities: adding the same constant to all preferences in a state doesn't change the policy. Only the differences between preferences matter. Think of them as "how much the policy favors each action," on an arbitrary scale.

Softmax Policy

Adjust the action preferences h1, h2, h3 for three actions. The bar chart shows the resulting softmax probabilities. Notice how increasing one preference steals probability from the others.

h1 = 1.0
h2 = 0.0
h3 = -1.0
Check: What happens to the softmax policy if you add 5 to all preferences h(s, a)?

Chapter 2: Advantages of Policy Gradient

Why bother with a whole new paradigm? Policy gradient methods have three fundamental advantages over value-based methods:

1. Continuous Action Spaces

Q-learning requires maxa Q(s, a). With continuous actions (a real number or a vector), this max is itself an optimization problem. Policy gradient methods parameterize the action distribution directly (e.g., as a Gaussian with learned mean and variance), sidestepping the max entirely.

2. Stochastic Optimal Policies

Some problems have optimal policies that are stochastic. Rock-paper-scissors: the optimal strategy is to randomize uniformly. A value-based method with greedy selection can only produce deterministic policies (or ε-greedy, which randomizes the wrong way). Policy gradient methods output probabilities, so they can represent any stochastic policy.

3. Smoother Optimization

Small changes to θ produce small changes to π(a|s,θ), which produce small changes to performance. The optimization landscape is smooth. In contrast, small changes to a value function can cause the greedy policy to flip an action (if one Q-value crosses another), producing a discontinuous jump in behavior. This makes value-based methods more brittle in practice.

The gradient is the key: Because π(θ) is a smooth function of θ, we can compute ∇θJ(θ) — the gradient of performance with respect to policy parameters. Standard gradient ascent then improves the policy. The challenge is computing this gradient when performance depends on the entire trajectory distribution, which itself depends on θ.
Check: Why can value-based methods with greedy selection struggle when the optimal policy is stochastic?

Chapter 3: The Policy Gradient Theorem

Here's the core challenge. Our performance measure J(θ) depends on the policy, which affects which states are visited, which affects the returns. The state distribution dπ(s) depends on θ. Taking the gradient ∇J seems to require differentiating through the state distribution — which depends on the environment dynamics we don't know.

The policy gradient theorem (Sutton et al., 2000) shows this isn't necessary. The gradient has a remarkably simple form:

∇ J(θ) = ∑s dπ(s) ∑a qπ(s, a) ∇ π(a|s, θ)
The miracle: The derivative of the state distribution dπ(s) with respect to θ does not appear. The gradient depends on dπ(s) (how often states are visited) and on ∇π(a|s,θ) (how the policy changes), but never on how the state distribution changes. This is what makes policy gradient methods practical — we don't need a model of the environment.

We can rewrite this using the log-derivative trick. Since ∇π(a|s,θ) = π(a|s,θ) ∇ln π(a|s,θ), the gradient becomes an expectation:

∇ J(θ) = Eπ[ qπ(S, A) ∇ ln π(A|S, θ) ]

This is an expectation under the policy π, which means we can estimate it by sampling. Take actions from π, observe returns, and use them to estimate qπ(S, A). Multiply by ∇ln π and you have a sample of the gradient. This is the foundation of all policy gradient algorithms.

The term ∇ln π(A|S,θ) is called the score function. It points in the direction that makes action A more likely in state S. The gradient says: adjust θ to make good actions (high q) more likely and bad actions (low q) less likely.

Check: What makes the policy gradient theorem remarkable?

Chapter 4: REINFORCE

REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. It uses the complete return Gt as an unbiased estimate of qπ(St, At), giving the update:

θt+1 = θt + α γt Gt ∇ ln π(At|St, θt)

That's it. Run an episode. At each step, multiply the return by the score function. Update θ. The return Gt tells the algorithm how good the outcome was; ∇ln π tells it which direction to push θ to make the chosen action more or less likely.

Intuition: If you took an action and got a high return, REINFORCE increases the probability of that action. If the return was low, it decreases it. Over many episodes, actions that lead to high returns become more likely. It's trial-and-error at the level of entire trajectories.

REINFORCE is a Monte Carlo method: it needs complete episodes to compute Gt. This gives it high variance (returns are noisy) but zero bias (Gt is an unbiased estimate of qπ). The high variance means slow learning — the algorithm needs many episodes to average out the noise.

Showcase: Short Corridor Problem

A 4-state corridor where the middle state has reversed actions (left goes right, right goes left). The optimal policy is stochastic: go right with probability 0.59. An ε-greedy policy can't represent this — it either goes right almost always (and gets trapped) or randomizes too much. Watch REINFORCE learn the correct stochastic policy.

α = 0.010
Episode 0 | P(right) = 0.50
pseudocode
# REINFORCE (Monte Carlo Policy Gradient)
Initialize θ arbitrarily

for each episode:
  Generate trajectory S0, A0, R1, ..., ST following π(·|·,θ)
  for t = 0, 1, ..., T−1:
    G ← ∑k=t+1T γk−t−1 Rk
    θ ← θ + α γt G ∇ ln π(At|St, θ)
Check: Why can REINFORCE solve the short corridor problem but ε-greedy Q-learning cannot?

Chapter 5: REINFORCE with Baseline

REINFORCE has a variance problem. Suppose all returns are positive (e.g., a survival task). Then every action's probability gets increased, just by different amounts. The signal is "this action was good" even if it was the worst available — it was just less good. What we really want is: "was this action better or worse than average?"

Enter the baseline. We subtract a state-dependent quantity b(St) from the return:

θt+1 = θt + α γt (Gt − b(St)) ∇ ln π(At|St, θt)

The baseline b(St) can be any function of the state that doesn't depend on the action. The most common choice is the state-value function v̂(St, w), learned alongside the policy. This doesn't introduce bias (because E[∇ln π · b] = 0 for any action-independent b), but it dramatically reduces variance.

The intuition: Without a baseline, REINFORCE says "this action got return 5, make it more likely." With a baseline of 4 (the average return from this state), it says "this action got 1 more than average, make it slightly more likely." The signal is now about whether the action was better than expected, not just good in absolute terms.

The quantity Gt − b(St) is called the advantage. When b(St) = vπ(St), the advantage is At = Gt − vπ(St), which measures how much better the actual outcome was compared to what was expected from that state. Positive advantage = better than average; negative = worse.

Baseline Effect on Variance

Compare the gradient estimates with and without a baseline. The red dots (no baseline) show raw G·∇lnπ samples — high variance. The green dots (with baseline) show (G−b)·∇lnπ — same mean, much less spread.

Check: Why doesn't subtracting a baseline from Gt introduce bias in the gradient estimate?

Chapter 6: Actor-Critic

REINFORCE (even with a baseline) is Monte Carlo: it needs the complete episode return Gt. The actor-critic architecture replaces Gt with a bootstrapped estimate, enabling fully online, incremental updates after every step.

The actor is the policy π(a|s, θ). The critic is a value function v̂(s, w) that estimates the expected return. The one-step actor-critic update is:

δt = Rt+1 + γ v̂(St+1, w) − v̂(St, w)
θt+1 = θt + αθ γt δt ∇ ln π(At|St, θt)
wt+1 = wt + αw δt ∇ v̂(St, wt)
Two learners, one goal: The critic learns "how good is this state?" (value estimation, just like TD). The actor learns "which action to take?" (policy improvement). The TD error δt connects them: it tells the actor whether the outcome was better or worse than the critic expected. If δ > 0, the action was better than expected — make it more likely. If δ < 0, make it less likely.

The critic replaces Gt with Rt+1 + γv̂(St+1), which introduces bias (the value estimate may be wrong) but reduces variance (no need to wait for the noisy full return). This is the same TD-vs-MC trade-off we've seen throughout the book, now applied to policy gradients.

Actor (policy)
θ ← θ + α δ ∇ ln π
↑ TD error δ ↓
Critic (value)
w ← w + α δ ∇ v̂
pseudocode
# One-step Actor-Critic
Initialize θ (actor), w (critic)

for each episode:
  S ← start; I ← 1
  while S is not terminal:
    A ∼ π(·|S, θ)
    take A, observe R, S'
    δ ← R + γ v̂(S', w) − v̂(S, w)
    w ← w + αw δ ∇ v̂(S, w)
    θ ← θ + αθ I δ ∇ ln π(A|S, θ)
    I ← γ I
    S ← S'
Check: What role does the TD error δt play in actor-critic?

Chapter 7: Actor-Critic with Traces

We can combine actor-critic with the eligibility traces from Chapter 12. Both the actor and the critic get their own trace vector:

ztw = γ λw zt−1w + ∇ v̂(St, w)    (critic trace)
ztθ = γ λθ zt−1θ + It ∇ ln π(At|St, θ)    (actor trace)

The updates become: w += αw δt ztw and θ += αθ δt ztθ. The λ parameters control the bias-variance trade-off for each: higher λ means more reliance on actual returns (lower bias, higher variance).

The benefit: Without traces, one-step actor-critic assigns credit only to the immediately preceding action. With traces, credit propagates back through the entire recent trajectory. This is especially important for problems with delayed rewards — the action that "caused" the reward might have happened many steps ago. Traces spread the TD error signal back to those earlier actions.

The actor and critic can use different λ values. A common choice is λw = 0.8 for the critic (some bootstrapping) and λθ = 0.9 for the actor (more reliance on returns, since policy updates benefit from lower bias). But these are tuned empirically.

This is the full actor-critic with eligibility traces — a method that combines the strengths of policy gradient (smooth optimization, stochastic policies) with the strengths of TD learning (bootstrapping, online updates) and eligibility traces (multi-step credit assignment). It's the ancestor of modern algorithms like A3C and PPO.

Check: Why does actor-critic benefit from eligibility traces?

Chapter 8: Continuing Problems

For continuing tasks (no episodes, infinite stream of experience), the discounted formulation has a subtle issue: with function approximation, the policy gradient theorem requires a well-defined state distribution, which the discounted setting provides. But an alternative is the average reward formulation.

Instead of maximizing discounted return, we maximize the average reward per step:

r(π) = limT→∞ (1/T) ∑t=1T E[Rt | π]

The differential return replaces the discounted return:

Gt = Rt+1 − r(π) + Rt+2 − r(π) + Rt+3 − r(π) + …
The shift: Instead of discounting future rewards, we subtract the average reward from each step. States that yield above-average reward have positive differential value; below-average states have negative value. This is cleaner than discounting for continuing tasks because γ doesn't artificially devalue the future.

The policy gradient theorem has an average-reward version:

∇ r(θ) = ∑s dπ(s) ∑a qπ(s, a) ∇ π(a|s, θ)

where now qπ is the differential action-value function and dπ is the stationary distribution under π. The form is identical to the episodic case. The differential TD error is δt = Rt+1 − r̅ + v̂(St+1) − v̂(St), where r̅ is a running estimate of the average reward.

Check: In the average reward formulation, what replaces the discount factor γ?

Chapter 9: Continuous Actions

This is where policy gradient methods truly shine. For continuous action spaces (controlling a joint angle, a steering wheel, a throttle), parameterize the policy as a Gaussian distribution:

π(a|s, θ) = (1 / σ(s)√(2π)) exp( −(a − μ(s))2 / 2σ(s)2 )

The mean μ(s) and standard deviation σ(s) are functions of the state, parameterized by θ. For example, μ(s) = θμT x(s) (linear) or μ(s) = neural_net(s; θμ). Similarly for σ(s), often parameterized as exp(θσT x(s)) to keep it positive.

The beauty: To sample an action, just compute μ and σ from the state, then draw a = μ + σ · ε where ε ~ N(0,1). To compute ∇lnπ, take derivatives of the Gaussian log-density. Both are simple, closed-form operations. No discretization, no max over actions.

The score function for a Gaussian policy decomposes into two parts:

θμ ln π = (a − μ) / σ2 · ∇θμ μ(s)
θσ ln π = ((a − μ)2 / σ2 − 1) · ∇θσ σ(s)

The mean gradient pushes μ toward actions that performed well (positive advantage). The variance gradient increases σ when actions far from the mean performed well (need to explore more) and decreases σ when actions near the mean performed best (we've found the sweet spot, narrow the search).

Gaussian Policy

Adjust the mean μ and standard deviation σ of a Gaussian policy. The curve shows the probability density over continuous actions. The orange region is the most likely action range.

μ = 0.5
σ = 1.0

For multi-dimensional actions (e.g., joint angles for each arm segment), the policy becomes a multivariate Gaussian. Each dimension can have its own mean and variance, and the covariance structure can be diagonal (common) or full (more expressive but more parameters).

Check: In a Gaussian policy, what does decreasing σ mean?

Chapter 10: Summary

Policy gradient methods complete the RL toolkit. Where value-based methods struggle (continuous actions, stochastic optimal policies), policy gradients provide a clean, principled alternative. And the two paradigms combine beautifully in actor-critic architectures.

MethodCore IdeaKey Property
REINFORCEMC policy gradientUnbiased but high variance
REINFORCE + baselineSubtract v̂(s) from returnSame mean, much lower variance
One-step ACTD error replaces returnOnline, bootstrapped, lower variance
AC + tracesAdd eligibility tracesMulti-step credit assignment
Average reward ACContinuing tasksNo discount factor needed
Gaussian policyNormal distribution over actionsNatural fit for continuous control
The complete picture: After 13 chapters, you have the full foundation of reinforcement learning. Value functions (DP, MC, TD, n-step, traces) learn what states are worth. Policy gradients learn what to do. Actor-critic combines both. Function approximation scales to large problems. The deadly triad warns you of pitfalls. Everything in modern deep RL — PPO, SAC, A3C, IMPALA — builds on these foundations.

What policy gradients give us:

• Continuous action spaces (natural)
• Stochastic optimal policies
• Smooth optimization landscape
• Compatible with any differentiable π
• Foundation for modern deep RL

Limitations:

• High variance (especially REINFORCE)
• Sample inefficient vs Q-learning
• Converge to local optima, not global
• Sensitive to step size
• Need many episodes to learn well

The road ahead: Chapter 12 gave us traces for credit assignment. This chapter gave us policy gradients for direct policy optimization. Modern deep RL combines these with neural networks: PPO = clipped policy gradient + actor-critic. SAC = maximum entropy actor-critic. IMPALA = off-policy actor-critic with V-trace. The concepts you've learned are the DNA of every modern RL algorithm.

"One can parameterize policies in a much more compact way than value functions.
For many problems, the policy is the simpler, more natural thing to learn."
— Richard S. Sutton & Andrew G. Barto
Check: What is the key advantage of actor-critic over pure REINFORCE?