Learning the policy directly — no value function required, no argmax needed.
Everything in this book so far has followed the same recipe: learn a value function (V or Q), then derive a policy from it (usually by acting greedily: pick the action with the highest value). The policy is implicit — it falls out of the value function. But what if we learned the policy directly?
A policy gradient method parameterizes the policy as π(a|s, θ) — a function from states to action probabilities, controlled by parameters θ (like weights in a neural network). Instead of learning values, we learn θ by gradient ascent on some measure of performance.
Why might this be better? Consider a robot arm that can move in any direction. A value-based method would need to evaluate every possible direction (infinitely many!) and pick the best. A policy gradient method parameterizes the direction as, say, a Gaussian distribution, and adjusts its mean and variance. It naturally handles continuous actions.
Or consider a game where the optimal strategy is stochastic — you need to randomize (like bluffing in poker). Value-based methods with ε-greedy can't learn the right randomization. Policy gradient methods output probabilities directly, so they can learn any stochastic policy.
How do we parameterize a policy? For discrete actions, the most common approach is the softmax over action preferences. Each state-action pair has a preference h(s, a, θ), and the policy is:
This is the same softmax function used in neural networks. The preferences h can be linear: h(s, a, θ) = θT x(s, a), or they can be the output of a neural network. The key property: every action always has nonzero probability, so the policy can be stochastic.
The action preferences are not values — they don't estimate expected returns. They are relative quantities: adding the same constant to all preferences in a state doesn't change the policy. Only the differences between preferences matter. Think of them as "how much the policy favors each action," on an arbitrary scale.
Adjust the action preferences h1, h2, h3 for three actions. The bar chart shows the resulting softmax probabilities. Notice how increasing one preference steals probability from the others.
Why bother with a whole new paradigm? Policy gradient methods have three fundamental advantages over value-based methods:
Q-learning requires maxa Q(s, a). With continuous actions (a real number or a vector), this max is itself an optimization problem. Policy gradient methods parameterize the action distribution directly (e.g., as a Gaussian with learned mean and variance), sidestepping the max entirely.
Some problems have optimal policies that are stochastic. Rock-paper-scissors: the optimal strategy is to randomize uniformly. A value-based method with greedy selection can only produce deterministic policies (or ε-greedy, which randomizes the wrong way). Policy gradient methods output probabilities, so they can represent any stochastic policy.
Small changes to θ produce small changes to π(a|s,θ), which produce small changes to performance. The optimization landscape is smooth. In contrast, small changes to a value function can cause the greedy policy to flip an action (if one Q-value crosses another), producing a discontinuous jump in behavior. This makes value-based methods more brittle in practice.
Here's the core challenge. Our performance measure J(θ) depends on the policy, which affects which states are visited, which affects the returns. The state distribution dπ(s) depends on θ. Taking the gradient ∇J seems to require differentiating through the state distribution — which depends on the environment dynamics we don't know.
The policy gradient theorem (Sutton et al., 2000) shows this isn't necessary. The gradient has a remarkably simple form:
We can rewrite this using the log-derivative trick. Since ∇π(a|s,θ) = π(a|s,θ) ∇ln π(a|s,θ), the gradient becomes an expectation:
This is an expectation under the policy π, which means we can estimate it by sampling. Take actions from π, observe returns, and use them to estimate qπ(S, A). Multiply by ∇ln π and you have a sample of the gradient. This is the foundation of all policy gradient algorithms.
The term ∇ln π(A|S,θ) is called the score function. It points in the direction that makes action A more likely in state S. The gradient says: adjust θ to make good actions (high q) more likely and bad actions (low q) less likely.
REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. It uses the complete return Gt as an unbiased estimate of qπ(St, At), giving the update:
That's it. Run an episode. At each step, multiply the return by the score function. Update θ. The return Gt tells the algorithm how good the outcome was; ∇ln π tells it which direction to push θ to make the chosen action more or less likely.
REINFORCE is a Monte Carlo method: it needs complete episodes to compute Gt. This gives it high variance (returns are noisy) but zero bias (Gt is an unbiased estimate of qπ). The high variance means slow learning — the algorithm needs many episodes to average out the noise.
A 4-state corridor where the middle state has reversed actions (left goes right, right goes left). The optimal policy is stochastic: go right with probability 0.59. An ε-greedy policy can't represent this — it either goes right almost always (and gets trapped) or randomizes too much. Watch REINFORCE learn the correct stochastic policy.
pseudocode # REINFORCE (Monte Carlo Policy Gradient) Initialize θ arbitrarily for each episode: Generate trajectory S0, A0, R1, ..., ST following π(·|·,θ) for t = 0, 1, ..., T−1: G ← ∑k=t+1T γk−t−1 Rk θ ← θ + α γt G ∇ ln π(At|St, θ)
REINFORCE has a variance problem. Suppose all returns are positive (e.g., a survival task). Then every action's probability gets increased, just by different amounts. The signal is "this action was good" even if it was the worst available — it was just less good. What we really want is: "was this action better or worse than average?"
Enter the baseline. We subtract a state-dependent quantity b(St) from the return:
The baseline b(St) can be any function of the state that doesn't depend on the action. The most common choice is the state-value function v̂(St, w), learned alongside the policy. This doesn't introduce bias (because E[∇ln π · b] = 0 for any action-independent b), but it dramatically reduces variance.
The quantity Gt − b(St) is called the advantage. When b(St) = vπ(St), the advantage is At = Gt − vπ(St), which measures how much better the actual outcome was compared to what was expected from that state. Positive advantage = better than average; negative = worse.
Compare the gradient estimates with and without a baseline. The red dots (no baseline) show raw G·∇lnπ samples — high variance. The green dots (with baseline) show (G−b)·∇lnπ — same mean, much less spread.
REINFORCE (even with a baseline) is Monte Carlo: it needs the complete episode return Gt. The actor-critic architecture replaces Gt with a bootstrapped estimate, enabling fully online, incremental updates after every step.
The actor is the policy π(a|s, θ). The critic is a value function v̂(s, w) that estimates the expected return. The one-step actor-critic update is:
The critic replaces Gt with Rt+1 + γv̂(St+1), which introduces bias (the value estimate may be wrong) but reduces variance (no need to wait for the noisy full return). This is the same TD-vs-MC trade-off we've seen throughout the book, now applied to policy gradients.
pseudocode # One-step Actor-Critic Initialize θ (actor), w (critic) for each episode: S ← start; I ← 1 while S is not terminal: A ∼ π(·|S, θ) take A, observe R, S' δ ← R + γ v̂(S', w) − v̂(S, w) w ← w + αw δ ∇ v̂(S, w) θ ← θ + αθ I δ ∇ ln π(A|S, θ) I ← γ I S ← S'
We can combine actor-critic with the eligibility traces from Chapter 12. Both the actor and the critic get their own trace vector:
The updates become: w += αw δt ztw and θ += αθ δt ztθ. The λ parameters control the bias-variance trade-off for each: higher λ means more reliance on actual returns (lower bias, higher variance).
The actor and critic can use different λ values. A common choice is λw = 0.8 for the critic (some bootstrapping) and λθ = 0.9 for the actor (more reliance on returns, since policy updates benefit from lower bias). But these are tuned empirically.
This is the full actor-critic with eligibility traces — a method that combines the strengths of policy gradient (smooth optimization, stochastic policies) with the strengths of TD learning (bootstrapping, online updates) and eligibility traces (multi-step credit assignment). It's the ancestor of modern algorithms like A3C and PPO.
For continuing tasks (no episodes, infinite stream of experience), the discounted formulation has a subtle issue: with function approximation, the policy gradient theorem requires a well-defined state distribution, which the discounted setting provides. But an alternative is the average reward formulation.
Instead of maximizing discounted return, we maximize the average reward per step:
The differential return replaces the discounted return:
The policy gradient theorem has an average-reward version:
where now qπ is the differential action-value function and dπ is the stationary distribution under π. The form is identical to the episodic case. The differential TD error is δt = Rt+1 − r̅ + v̂(St+1) − v̂(St), where r̅ is a running estimate of the average reward.
This is where policy gradient methods truly shine. For continuous action spaces (controlling a joint angle, a steering wheel, a throttle), parameterize the policy as a Gaussian distribution:
The mean μ(s) and standard deviation σ(s) are functions of the state, parameterized by θ. For example, μ(s) = θμT x(s) (linear) or μ(s) = neural_net(s; θμ). Similarly for σ(s), often parameterized as exp(θσT x(s)) to keep it positive.
The score function for a Gaussian policy decomposes into two parts:
The mean gradient pushes μ toward actions that performed well (positive advantage). The variance gradient increases σ when actions far from the mean performed well (need to explore more) and decreases σ when actions near the mean performed best (we've found the sweet spot, narrow the search).
Adjust the mean μ and standard deviation σ of a Gaussian policy. The curve shows the probability density over continuous actions. The orange region is the most likely action range.
For multi-dimensional actions (e.g., joint angles for each arm segment), the policy becomes a multivariate Gaussian. Each dimension can have its own mean and variance, and the covariance structure can be diagonal (common) or full (more expressive but more parameters).
Policy gradient methods complete the RL toolkit. Where value-based methods struggle (continuous actions, stochastic optimal policies), policy gradients provide a clean, principled alternative. And the two paradigms combine beautifully in actor-critic architectures.
| Method | Core Idea | Key Property |
|---|---|---|
| REINFORCE | MC policy gradient | Unbiased but high variance |
| REINFORCE + baseline | Subtract v̂(s) from return | Same mean, much lower variance |
| One-step AC | TD error replaces return | Online, bootstrapped, lower variance |
| AC + traces | Add eligibility traces | Multi-step credit assignment |
| Average reward AC | Continuing tasks | No discount factor needed |
| Gaussian policy | Normal distribution over actions | Natural fit for continuous control |
What policy gradients give us:
• Continuous action spaces (natural)
• Stochastic optimal policies
• Smooth optimization landscape
• Compatible with any differentiable π
• Foundation for modern deep RL
Limitations:
• High variance (especially REINFORCE)
• Sample inefficient vs Q-learning
• Converge to local optima, not global
• Sensitive to step size
• Need many episodes to learn well
The road ahead: Chapter 12 gave us traces for credit assignment. This chapter gave us policy gradients for direct policy optimization. Modern deep RL combines these with neural networks: PPO = clipped policy gradient + actor-critic. SAC = maximum entropy actor-critic. IMPALA = off-policy actor-critic with V-trace. The concepts you've learned are the DNA of every modern RL algorithm.