Ch 13: Policy Gradient Methods — Sutton & Barto RL

Chapter 0: A New Paradigm

Everything in this book so far has followed the same recipe: learn a value function (V or Q), then derive a policy from it (usually by acting greedily: pick the action with the highest value). The policy is implicit — it falls out of the value function. But what if we learned the policy directly?

A policy gradient method parameterizes the policy as π(a|s, θ) — a function from states to action probabilities, controlled by parameters θ (like weights in a neural network). Instead of learning values, we learn θ by gradient ascent on some measure of performance.

The paradigm shift: Value-based methods ask "what is each action worth?" and then pick the best. Policy gradient methods ask "what should the policy do?" and directly adjust it. There's no argmax, no greedy selection, no value function required (though we'll often use one as an assistant).

Why might this be better? Consider a robot arm that can move in any direction. A value-based method would need to evaluate every possible direction (infinitely many!) and pick the best. A policy gradient method parameterizes the direction as, say, a Gaussian distribution, and adjusts its mean and variance. It naturally handles continuous actions.

Or consider a game where the optimal strategy is stochastic — you need to randomize (like bluffing in poker). Value-based methods with ε-greedy can't learn the right randomization. Policy gradient methods output probabilities directly, so they can learn any stochastic policy.

Value-Based (Chs 2–12)

Learn V or Q → derive π via argmax

Policy Gradient (Ch 13)

Learn π(a|s,θ) directly via gradient ascent on performance

Check: What do policy gradient methods learn directly?

The value function The policy π(a|s,θ) — a parameterized mapping from states to action probabilities The environment model

Chapter 1: Softmax Policy

How do we parameterize a policy? For discrete actions, the most common approach is the softmax over action preferences. Each state-action pair has a preference h(s, a, θ), and the policy is:

π(a|s, θ) = exp(h(s, a, θ)) / ∑_b exp(h(s, b, θ))

This is the same softmax function used in neural networks. The preferences h can be linear: h(s, a, θ) = θ^T x(s, a), or they can be the output of a neural network. The key property: every action always has nonzero probability, so the policy can be stochastic.

Key insight: The softmax guarantees that all actions are tried with some probability, providing natural exploration. As preferences diverge (one action's h becomes much larger), the policy approaches deterministic — but never quite gets there. This is different from ε-greedy, where the non-greedy actions all share the same small probability ε/|A| regardless of how bad they are.

The action preferences are not values — they don't estimate expected returns. They are relative quantities: adding the same constant to all preferences in a state doesn't change the policy. Only the differences between preferences matter. Think of them as "how much the policy favors each action," on an arbitrary scale.

Softmax Policy

Adjust the action preferences h₁, h₂, h₃ for three actions. The bar chart shows the resulting softmax probabilities. Notice how increasing one preference steals probability from the others.

h₁ = 1.0

h₂ = 0.0

h₃ = -1.0

Check: What happens to the softmax policy if you add 5 to all preferences h(s, a)?

Nothing — the probabilities stay the same because only differences matter All probabilities increase The policy becomes deterministic

Chapter 2: Advantages of Policy Gradient

Why bother with a whole new paradigm? Policy gradient methods have three fundamental advantages over value-based methods:

1. Continuous Action Spaces

Q-learning requires max_a Q(s, a). With continuous actions (a real number or a vector), this max is itself an optimization problem. Policy gradient methods parameterize the action distribution directly (e.g., as a Gaussian with learned mean and variance), sidestepping the max entirely.

2. Stochastic Optimal Policies

Some problems have optimal policies that are stochastic. Rock-paper-scissors: the optimal strategy is to randomize uniformly. A value-based method with greedy selection can only produce deterministic policies (or ε-greedy, which randomizes the wrong way). Policy gradient methods output probabilities, so they can represent any stochastic policy.

3. Smoother Optimization

Small changes to θ produce small changes to π(a|s,θ), which produce small changes to performance. The optimization landscape is smooth. In contrast, small changes to a value function can cause the greedy policy to flip an action (if one Q-value crosses another), producing a discontinuous jump in behavior. This makes value-based methods more brittle in practice.

The gradient is the key: Because π(θ) is a smooth function of θ, we can compute ∇_θJ(θ) — the gradient of performance with respect to policy parameters. Standard gradient ascent then improves the policy. The challenge is computing this gradient when performance depends on the entire trajectory distribution, which itself depends on θ.

Check: Why can value-based methods with greedy selection struggle when the optimal policy is stochastic?

They require too much computation Greedy selection always picks one action deterministically, so it can't represent the correct randomization The discount factor interferes

Chapter 3: The Policy Gradient Theorem

Here's the core challenge. Our performance measure J(θ) depends on the policy, which affects which states are visited, which affects the returns. The state distribution d_π(s) depends on θ. Taking the gradient ∇J seems to require differentiating through the state distribution — which depends on the environment dynamics we don't know.

The policy gradient theorem (Sutton et al., 2000) shows this isn't necessary. The gradient has a remarkably simple form:

∇ J(θ) = ∑_s d_π(s) ∑_a q_π(s, a) ∇ π(a|s, θ)

The miracle: The derivative of the state distribution d_π(s) with respect to θ does not appear. The gradient depends on d_π(s) (how often states are visited) and on ∇π(a|s,θ) (how the policy changes), but never on how the state distribution changes. This is what makes policy gradient methods practical — we don't need a model of the environment.

We can rewrite this using the log-derivative trick. Since ∇π(a|s,θ) = π(a|s,θ) ∇ln π(a|s,θ), the gradient becomes an expectation:

∇ J(θ) = E_π[ q_π(S, A) ∇ ln π(A|S, θ) ]

This is an expectation under the policy π, which means we can estimate it by sampling. Take actions from π, observe returns, and use them to estimate q_π(S, A). Multiply by ∇ln π and you have a sample of the gradient. This is the foundation of all policy gradient algorithms.

The term ∇ln π(A|S,θ) is called the score function. It points in the direction that makes action A more likely in state S. The gradient says: adjust θ to make good actions (high q) more likely and bad actions (low q) less likely.

Check: What makes the policy gradient theorem remarkable?

It shows the gradient doesn't involve the derivative of the state distribution, so we don't need environment dynamics It proves policy gradient methods always converge to the global optimum It eliminates the need for a discount factor

Chapter 4: REINFORCE

REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. It uses the complete return G_t as an unbiased estimate of q_π(S_t, A_t), giving the update:

θ_t+1 = θ_t + α γ^t G_t ∇ ln π(A_t|S_t, θ_t)

That's it. Run an episode. At each step, multiply the return by the score function. Update θ. The return G_t tells the algorithm how good the outcome was; ∇ln π tells it which direction to push θ to make the chosen action more or less likely.

Intuition: If you took an action and got a high return, REINFORCE increases the probability of that action. If the return was low, it decreases it. Over many episodes, actions that lead to high returns become more likely. It's trial-and-error at the level of entire trajectories.

REINFORCE is a Monte Carlo method: it needs complete episodes to compute G_t. This gives it high variance (returns are noisy) but zero bias (G_t is an unbiased estimate of q_π). The high variance means slow learning — the algorithm needs many episodes to average out the noise.

Showcase: Short Corridor Problem

A 4-state corridor where the middle state has reversed actions (left goes right, right goes left). The optimal policy is stochastic: go right with probability 0.59. An ε-greedy policy can't represent this — it either goes right almost always (and gets trapped) or randomizes too much. Watch REINFORCE learn the correct stochastic policy.

α = 0.010

Episode 0 | P(right) = 0.50

pseudocode
# REINFORCE (Monte Carlo Policy Gradient)
Initialize θ arbitrarily

for each episode:
  Generate trajectory S₀, A₀, R₁, ..., S_T following π(·|·,θ)
  for t = 0, 1, ..., T−1:
    G ← ∑_k=t+1^T γ^k−t−1 R_k
    θ ← θ + α γ^t G ∇ ln π(A_t|S_t, θ)

Check: Why can REINFORCE solve the short corridor problem but ε-greedy Q-learning cannot?

REINFORCE learns a stochastic policy with the optimal randomization, while ε-greedy can only choose one action deterministically plus uniform random noise REINFORCE uses a smaller step size REINFORCE doesn't use a discount factor

Chapter 5: REINFORCE with Baseline

REINFORCE has a variance problem. Suppose all returns are positive (e.g., a survival task). Then every action's probability gets increased, just by different amounts. The signal is "this action was good" even if it was the worst available — it was just less good. What we really want is: "was this action better or worse than average?"

Enter the baseline. We subtract a state-dependent quantity b(S_t) from the return:

θ_t+1 = θ_t + α γ^t (G_t − b(S_t)) ∇ ln π(A_t|S_t, θ_t)

The baseline b(S_t) can be any function of the state that doesn't depend on the action. The most common choice is the state-value function v̂(S_t, w), learned alongside the policy. This doesn't introduce bias (because E[∇ln π · b] = 0 for any action-independent b), but it dramatically reduces variance.

The intuition: Without a baseline, REINFORCE says "this action got return 5, make it more likely." With a baseline of 4 (the average return from this state), it says "this action got 1 more than average, make it slightly more likely." The signal is now about whether the action was better than expected, not just good in absolute terms.

The quantity G_t − b(S_t) is called the advantage. When b(S_t) = v_π(S_t), the advantage is A_t = G_t − v_π(S_t), which measures how much better the actual outcome was compared to what was expected from that state. Positive advantage = better than average; negative = worse.

Baseline Effect on Variance

Compare the gradient estimates with and without a baseline. The red dots (no baseline) show raw G·∇lnπ samples — high variance. The green dots (with baseline) show (G−b)·∇lnπ — same mean, much less spread.

Check: Why doesn't subtracting a baseline from G_t introduce bias in the gradient estimate?

Because the baseline is always zero Because the baseline equals the return Because b(s) doesn't depend on the action, so E[b(s) ∇lnπ] = b(s) ∑ ∇π = 0

Chapter 6: Actor-Critic

REINFORCE (even with a baseline) is Monte Carlo: it needs the complete episode return G_t. The actor-critic architecture replaces G_t with a bootstrapped estimate, enabling fully online, incremental updates after every step.

The actor is the policy π(a|s, θ). The critic is a value function v̂(s, w) that estimates the expected return. The one-step actor-critic update is:

δ_t = R_t+1 + γ v̂(S_t+1, w) − v̂(S_t, w)

θ_t+1 = θ_t + α_θ γ^t δ_t ∇ ln π(A_t|S_t, θ_t)

w_t+1 = w_t + α_w δ_t ∇ v̂(S_t, w_t)

Two learners, one goal: The critic learns "how good is this state?" (value estimation, just like TD). The actor learns "which action to take?" (policy improvement). The TD error δ_t connects them: it tells the actor whether the outcome was better or worse than the critic expected. If δ > 0, the action was better than expected — make it more likely. If δ < 0, make it less likely.

The critic replaces G_t with R_t+1 + γv̂(S_t+1), which introduces bias (the value estimate may be wrong) but reduces variance (no need to wait for the noisy full return). This is the same TD-vs-MC trade-off we've seen throughout the book, now applied to policy gradients.

Actor (policy)

θ ← θ + α δ ∇ ln π

↑ TD error δ ↓

Critic (value)

w ← w + α δ ∇ v̂

pseudocode
# One-step Actor-Critic
Initialize θ (actor), w (critic)

for each episode:
  S ← start; I ← 1
  while S is not terminal:
    A ∼ π(·|S, θ)
    take A, observe R, S'
    δ ← R + γ v̂(S', w) − v̂(S, w)
    w ← w + α_w δ ∇ v̂(S, w)
    θ ← θ + α_θ I δ ∇ ln π(A|S, θ)
    I ← γ I
    S ← S'

Check: What role does the TD error δ_t play in actor-critic?

It tells the actor whether the action was better (δ>0) or worse (δ<0) than the critic expected It updates the environment model It sets the exploration rate

Chapter 7: Actor-Critic with Traces

We can combine actor-critic with the eligibility traces from Chapter 12. Both the actor and the critic get their own trace vector:

z_t^w = γ λ_w z_t−1^w + ∇ v̂(S_t, w) (critic trace)

z_t^θ = γ λ_θ z_t−1^θ + I_t ∇ ln π(A_t|S_t, θ) (actor trace)

The updates become: w += α_w δ_t z_t^w and θ += α_θ δ_t z_t^θ. The λ parameters control the bias-variance trade-off for each: higher λ means more reliance on actual returns (lower bias, higher variance).

The benefit: Without traces, one-step actor-critic assigns credit only to the immediately preceding action. With traces, credit propagates back through the entire recent trajectory. This is especially important for problems with delayed rewards — the action that "caused" the reward might have happened many steps ago. Traces spread the TD error signal back to those earlier actions.

The actor and critic can use different λ values. A common choice is λ_w = 0.8 for the critic (some bootstrapping) and λ_θ = 0.9 for the actor (more reliance on returns, since policy updates benefit from lower bias). But these are tuned empirically.

This is the full actor-critic with eligibility traces — a method that combines the strengths of policy gradient (smooth optimization, stochastic policies) with the strengths of TD learning (bootstrapping, online updates) and eligibility traces (multi-step credit assignment). It's the ancestor of modern algorithms like A3C and PPO.

Check: Why does actor-critic benefit from eligibility traces?

Traces reduce the number of parameters needed Traces propagate the TD error backward, assigning credit to actions taken many steps before the reward Traces make the policy deterministic

Chapter 8: Continuing Problems

For continuing tasks (no episodes, infinite stream of experience), the discounted formulation has a subtle issue: with function approximation, the policy gradient theorem requires a well-defined state distribution, which the discounted setting provides. But an alternative is the average reward formulation.

Instead of maximizing discounted return, we maximize the average reward per step:

r(π) = lim_T→∞ (1/T) ∑_t=1^T E[R_t | π]

The differential return replaces the discounted return:

G_t = R_t+1 − r(π) + R_t+2 − r(π) + R_t+3 − r(π) + …

The shift: Instead of discounting future rewards, we subtract the average reward from each step. States that yield above-average reward have positive differential value; below-average states have negative value. This is cleaner than discounting for continuing tasks because γ doesn't artificially devalue the future.

The policy gradient theorem has an average-reward version:

∇ r(θ) = ∑_s d_π(s) ∑_a q_π(s, a) ∇ π(a|s, θ)

where now q_π is the differential action-value function and d_π is the stationary distribution under π. The form is identical to the episodic case. The differential TD error is δ_t = R_t+1 − r̅ + v̂(S_t+1) − v̂(S_t), where r̅ is a running estimate of the average reward.

Check: In the average reward formulation, what replaces the discount factor γ?

Subtraction of the average reward from each step's reward A larger step size An eligibility trace

Chapter 9: Continuous Actions

This is where policy gradient methods truly shine. For continuous action spaces (controlling a joint angle, a steering wheel, a throttle), parameterize the policy as a Gaussian distribution:

π(a|s, θ) = (1 / σ(s)√(2π)) exp( −(a − μ(s))² / 2σ(s)² )

The mean μ(s) and standard deviation σ(s) are functions of the state, parameterized by θ. For example, μ(s) = θ_μ^T x(s) (linear) or μ(s) = neural_net(s; θ_μ). Similarly for σ(s), often parameterized as exp(θ_σ^T x(s)) to keep it positive.

The beauty: To sample an action, just compute μ and σ from the state, then draw a = μ + σ · ε where ε ~ N(0,1). To compute ∇lnπ, take derivatives of the Gaussian log-density. Both are simple, closed-form operations. No discretization, no max over actions.

The score function for a Gaussian policy decomposes into two parts:

∇_{θ_μ} ln π = (a − μ) / σ² · ∇_{θ_μ} μ(s)

∇_{θ_σ} ln π = ((a − μ)² / σ² − 1) · ∇_{θ_σ} σ(s)

The mean gradient pushes μ toward actions that performed well (positive advantage). The variance gradient increases σ when actions far from the mean performed well (need to explore more) and decreases σ when actions near the mean performed best (we've found the sweet spot, narrow the search).

Gaussian Policy

Adjust the mean μ and standard deviation σ of a Gaussian policy. The curve shows the probability density over continuous actions. The orange region is the most likely action range.

μ = 0.5

σ = 1.0

For multi-dimensional actions (e.g., joint angles for each arm segment), the policy becomes a multivariate Gaussian. Each dimension can have its own mean and variance, and the covariance structure can be diagonal (common) or full (more expressive but more parameters).

Check: In a Gaussian policy, what does decreasing σ mean?

The agent explores more broadly The agent becomes more certain, concentrating actions near the mean μ The average reward increases

Chapter 10: Summary

Policy gradient methods complete the RL toolkit. Where value-based methods struggle (continuous actions, stochastic optimal policies), policy gradients provide a clean, principled alternative. And the two paradigms combine beautifully in actor-critic architectures.

Method	Core Idea	Key Property
REINFORCE	MC policy gradient	Unbiased but high variance
REINFORCE + baseline	Subtract v̂(s) from return	Same mean, much lower variance
One-step AC	TD error replaces return	Online, bootstrapped, lower variance
AC + traces	Add eligibility traces	Multi-step credit assignment
Average reward AC	Continuing tasks	No discount factor needed
Gaussian policy	Normal distribution over actions	Natural fit for continuous control

The complete picture: After 13 chapters, you have the full foundation of reinforcement learning. Value functions (DP, MC, TD, n-step, traces) learn what states are worth. Policy gradients learn what to do. Actor-critic combines both. Function approximation scales to large problems. The deadly triad warns you of pitfalls. Everything in modern deep RL — PPO, SAC, A3C, IMPALA — builds on these foundations.

What policy gradients give us:

• Continuous action spaces (natural)
• Stochastic optimal policies
• Smooth optimization landscape
• Compatible with any differentiable π
• Foundation for modern deep RL

Limitations:

• High variance (especially REINFORCE)
• Sample inefficient vs Q-learning
• Converge to local optima, not global
• Sensitive to step size
• Need many episodes to learn well

The road ahead: Chapter 12 gave us traces for credit assignment. This chapter gave us policy gradients for direct policy optimization. Modern deep RL combines these with neural networks: PPO = clipped policy gradient + actor-critic. SAC = maximum entropy actor-critic. IMPALA = off-policy actor-critic with V-trace. The concepts you've learned are the DNA of every modern RL algorithm.

"One can parameterize policies in a much more compact way than value functions.
For many problems, the policy is the simpler, more natural thing to learn."
— Richard S. Sutton & Andrew G. Barto

Check: What is the key advantage of actor-critic over pure REINFORCE?

Actor-critic doesn't need a policy Actor-critic uses the TD error (bootstrapping) instead of the full return, enabling online updates with lower variance Actor-critic always finds the global optimum