Sutton, McAllester, Singh, Mansour — 1999

Policy Gradient Theorem

The foundational theorem proving that policy gradients can be estimated with function approximation — enabling convergent actor-critic methods and every modern policy gradient algorithm.

Prerequisites: REINFORCE + MDPs + Value functions
10
Chapters
5+
Simulations

Chapter 0: The Problem

By 1999, reinforcement learning had a deep problem. The dominant approach — learn a value function, derive a policy from it — was theoretically broken.

Q-learning, SARSA, and dynamic programming methods had all been proven unable to converge for simple MDPs with function approximation. The reason was fundamental: a tiny change in the estimated Q-value of an action could flip whether that action was selected or not. This discontinuous policy change, caused by a continuous value update, made convergence guarantees impossible.

Consider a state where action A has estimated value 5.01 and action B has value 5.00. The greedy policy picks A. Now suppose an update changes A's value to 4.99. Suddenly the policy switches entirely to B. A 0.02 change in one value causes a 100% change in behavior. Chain these discontinuities across states, and the learning process oscillates or diverges.

The value-function trap: Value-function methods try to find the right values to make the right policy emerge implicitly. But with function approximation, the "right values" might not be representable. And even if they are, the greedy policy extraction step creates discontinuities that prevent convergence. The result: Q-learning with a neural network could provably diverge on trivially simple MDPs.

Williams's REINFORCE (1992) offered an alternative — directly parameterize the policy and follow the gradient. But REINFORCE was impractically slow because it estimated gradients purely from returns, without using a value function for variance reduction.

The question: can we combine the best of both worlds — the convergence guarantees of direct policy optimization with the variance reduction of learned value functions?

Why do value-function methods with function approximation fail to converge?

Chapter 1: The Key Insight

Sutton et al. prove something remarkable: the gradient of expected reward with respect to policy parameters can be written in a form that does not involve the derivative of the state distribution.

Why is this surprising? Because changing the policy changes which states you visit. If you start going left instead of right, you end up in completely different states. The state distribution dπ(s) depends on the policy π. So the gradient of performance ∇θρ(π) should, in general, require ∇θdπ(s) — the derivative of the state visitation distribution.

But the Policy Gradient Theorem shows this term vanishes. The gradient depends only on:

  1. The state distribution dπ(s) itself (not its derivative)
  2. The gradient of the policy ∇θπ(s, a) (how changing θ changes action probabilities)
  3. The action-value function Qπ(s, a) (how good each action is)
θρ = ∑s dπ(s) ∑aθπ(s, a) Qπ(s, a)
Why this changes everything: If ∇θdπ(s) appeared in the gradient, we'd need to know how policy changes affect the entire trajectory of states — essentially requiring a model of the environment. But since it doesn't appear, we can estimate the gradient from samples: just follow the policy, observe states s ~ dπ, and compute ∇θπ(s,a) Qπ(s,a). No model needed.

This is the foundation. Everything else — actor-critic methods, PPO, TRPO, SAC — is a consequence of this one equation.

What is the surprising property of the Policy Gradient Theorem?

Chapter 2: Setup

The paper considers the standard MDP framework with a critical twist: the policy is explicitly parameterized.

The MDP

States s ∈ S, actions a ∈ A, transition probabilities Pss′a, and expected rewards Rsa. The agent follows a parameterized policy π(s, a, θ) = Pr{at = a | st = s, θ}, where θ ∈ ℝl with l « |S|.

The key requirement: π must be differentiable with respect to θ. This means the policy changes smoothly with parameters — no discontinuities like greedy value-based policies.

Two performance measures

The paper handles both formulations simultaneously:

Average reward
ρ(π) = ∑s dπ(s) ∑a π(s,a) Rsa — long-term average reward per step
Start-state
ρ(π) = E{∑ γt-1rt | s0, π} — discounted return from a fixed start state

In both cases, dπ(s) is the state visitation distribution under policy π. For average reward, it's the stationary distribution. For start-state, it's the discounted state visitation: dπ(s) = ∑t γt Pr{st = s | s0, π}.

Why differentiable policies matter: With a neural network policy, a small change in weights θ produces a small change in action probabilities π(s,a). This means a small change in the state distribution dπ(s). Everything varies smoothly — no discontinuities. This is the key structural advantage over value-based methods where the greedy policy creates cliffs.
Why does the paper require the policy π(s, a, θ) to be differentiable with respect to θ?

Chapter 3: Theorem 1 — The Gradient

The central result. For any MDP, in either formulation:

θρ = ∑s dπ(s) ∑aθπ(s, a) Qπ(s, a)

The proof idea

Start with the value function gradient:

θVπ(s) = ∑a [∇θπ(s,a) Qπ(s,a) + π(s,a) ∇θQπ(s,a)]

The second term is tricky — ∇θQπ(s,a) involves the gradient of the value at successor states, which themselves depend on the policy. Expanding recursively:

θQπ(s,a) = ∑s′ Pss′aθVπ(s′)

Substituting back and unrolling the recursion, the successor-state terms cascade through the entire state space. The final result collects all terms as a weighted sum over states, where the weights are exactly dπ(s) — the state visitation distribution.

Where ∇dπ goes: The naive product rule on ρ = ∑s dπ(s) ∑a π(s,a) Qπ(s,a) would produce terms involving ∇dπ(s). The proof shows these terms cancel with the recursive ∇Qπ terms. The result: the gradient as if dπ were fixed. This is not an approximation — it's exact.

What this means for sampling

Since dπ(s) appears only as a weighting (not differentiated), we can estimate the gradient by simply following the policy: sample s ~ dπ, then compute ∑aθπ(s,a) Qπ(s,a). This is a sample from the gradient — no model of the environment needed.

If we replace Qπ(s,a) with actual returns Rt, we recover Williams's REINFORCE. But REINFORCE uses returns as a noisy estimate of Qπ — can we do better with a learned approximation?

The Policy Gradient

The gradient depends on: (1) state distribution dπ(s) — which states you visit, (2) policy gradient ∇π(s,a) — how parameters affect action probabilities, and (3) Qπ(s,a) — how good each action is. Crucially, ∇dπ does NOT appear.

In the Policy Gradient Theorem, why does ∇θdπ(s) not appear in the final gradient expression?

Chapter 4: The Missing Term

Let's pause and appreciate why the absence of ∇θdπ(s) is so profound.

The naive approach

If you just took the product rule on ρ = ∑s dπ(s) Vπ(s), you'd get:

∇ρ = ∑s [∇dπ(s) · Vπ(s) + dπ(s) · ∇Vπ(s)]

The first term ∇dπ(s) requires knowing how policy changes affect the entire trajectory distribution — essentially a complete model of the environment's dynamics. This is exactly what model-free RL tries to avoid.

Why it cancels

The magic happens because changing the policy at state s has two effects:

  1. Direct effect: Different action probabilities → different immediate rewards
  2. Indirect effect: Different actions → different successor states → different future state distribution

The proof shows that the indirect effect (changing which states you visit) is already captured by the Qπ(s,a) term. The Q-value at state s already accounts for all future consequences of taking action a — including all the state-distribution changes downstream. So including ∇dπ would be double-counting.

The deep reason: Qπ(s,a) is defined as the expected sum of future rewards from taking a in s and then following π. All the "state distribution" effects of the policy are baked into Qπ itself. The gradient theorem says: you only need to know "how good is each action from each state" (Qπ) and "how does the policy change" (∇π). You don't need to know how the state distribution shifts — that's already inside Q.
Why would including ∇dπ(s) in the gradient estimate be double-counting?

Chapter 5: Theorem 2 — Approximation

Theorem 1 requires the true Qπ(s,a), which we don't know. In practice, we'll learn an approximation fw(s,a). Can we substitute fw for Qπ and still get the correct gradient?

Theorem 2 says yes, under two conditions:

Condition 1: Compatibility

wfw(s, a) = ∇θπ(s, a) / π(s, a) = ∇θ ln π(s, a)

The value approximator's features must match the policy's score function. For a linear approximator, this means fw(s,a) = wTθ ln π(s,a).

Condition 2: Minimized projection error

s dπ(s) ∑a π(s,a) [Qπ(s,a) − fw(s,a)] ∇wfw(s,a) = 0

The approximation error is orthogonal to the features — i.e., fw has been trained to a local minimum of the squared error.

The key result: Under these two conditions, substituting fw for Qπ gives exactly the same gradient:
θρ = ∑s dπ(s) ∑aθπ(s, a) fw(s, a)
Not approximately — exactly. The error in fw doesn't bias the gradient because the error is orthogonal to the policy gradient direction.

The proof in one line

From the compatibility condition, the error E = Qπ(s,a) − fw(s,a) satisfies ∑ dπ ∑ ∇π · E = 0 (by condition 2 and compatibility). So:

∑ dπ ∑ ∇π Qπ = ∑ dπ ∑ ∇π fw + ∑ dπ ∑ ∇π E = ∑ dπ ∑ ∇π fw + 0

The approximation error is invisible to the gradient.

Compatible Function Approximation

The error E = Qπ − fw (red) is orthogonal to the policy gradient direction ∇π (blue). Their dot product is zero, so the error doesn't bias the gradient estimate.

What does the "compatibility condition" ∇wfw = ∇θln π ensure?

Chapter 6: Advantages

The paper reveals that fw doesn't need to approximate Qπ — it only needs to get the relative value of actions correct in each state. This means fw is really approximating the advantage function:

Aπ(s, a) = Qπ(s, a) − Vπ(s)

The advantage measures how much better action a is than average in state s. This is because adding any function of state v(s) to fw doesn't change the gradient:

aθπ(s,a) · v(s) = v(s) ∑aθπ(s,a) = v(s) · ∇θ 1 = 0

Since ∑ ∇π(s,a) = ∇∑π(s,a) = ∇1 = 0, any state-dependent baseline drops out.

The advantage insight: The gradient only cares about "which actions are better than others in each state" — not "how good is this state overall." Adding Vπ(s) as a baseline doesn't change the expected gradient but dramatically reduces variance. This is the theoretical justification for advantage-based methods (A2C, GAE) and connects directly to Williams's reinforcement baseline in REINFORCE.

In practice, this means:

  1. Learn a value function Vφ(s) to estimate Vπ(s)
  2. Estimate advantages as Â(s,a) = r + γVφ(s′) − Vφ(s)
  3. Use  in place of Qπ in the policy gradient
Why does adding a state-dependent baseline v(s) to the value function not change the policy gradient?

Chapter 7: Convergence

Theorem 3 is the crown jewel: the first proof that policy iteration with arbitrary differentiable function approximation converges to a locally optimal policy.

The algorithm

Alternate between:

  1. Critic update: Find wk such that fw satisfies the projection condition (3) — i.e., train the value approximator until convergence.
  2. Actor update: θk+1 = θk + αks dπk(s) ∑aθπk(s,a) fwk(s,a)

Convergence guarantee

Under standard conditions (αk → 0, ∑ αk = ∞, bounded second derivatives of π), the sequence converges such that:

limk→∞θρ(πk) = 0

In words: the gradient vanishes — we reach a local optimum. The proof applies Proposition 3.5 from Bertsekas and Tsitsiklis (1996): Theorem 2 guarantees the update direction is the true gradient, the bounded second derivatives ensure the objective is smooth, and the step-size conditions are standard for stochastic approximation.

What this means historically: Before this paper, there was no convergence proof for any RL algorithm using general function approximation for both policy and value function. Q-learning with neural nets could diverge. SARSA could oscillate. Policy iteration could cycle. This theorem broke the deadlock — policy gradient methods with compatible function approximation are provably convergent. Every modern deep RL algorithm traces its theoretical legitimacy to this result.
What does Theorem 3 prove for the first time?

Chapter 8: Actor-Critic

The paper provides the theoretical foundation for actor-critic architectures, which had existed since Barto, Sutton, and Anderson (1983) but lacked convergence guarantees.

The actor-critic structure

Two components with separate parameters:

The actor follows the gradient ∇θρ using the critic's estimates. The critic learns from TD errors. The two update asynchronously — the critic providing variance reduction for the actor's gradient estimates.

Connecting to REINFORCE

REINFORCE is a special case where there is no critic — Qπ(s,a) is estimated directly from returns Rt. This is high-variance but unbiased. The actor-critic uses a learned fw to reduce variance while Theorem 2 guarantees no bias (under compatibility).

Actor-Critic Architecture

The actor selects actions; the environment returns rewards and next states; the critic evaluates the action and provides gradient signal to the actor.

The variance-bias tradeoff: REINFORCE (no critic): unbiased, high variance, slow learning. Actor-critic (learned critic): low variance, potentially biased if compatibility condition isn't exactly met. In practice, approximate compatibility works well enough — the bias is small and the variance reduction is enormous. This tradeoff is why actor-critic methods dominate modern deep RL.
How does the actor-critic architecture improve upon REINFORCE?

Chapter 9: Connections

What the Policy Gradient Theorem built on

REINFORCE (Williams, 1992): Proved that policy gradient estimates can be obtained without explicit gradient computation. The PGT generalizes REINFORCE by showing that a learned value function can replace the high-variance return estimates without introducing bias.

Actor-critic methods (Barto, Sutton, Anderson, 1983): The PGT provides the first convergence proof for these methods, which had been used heuristically for 16 years.

What the Policy Gradient Theorem enabled

A3C/A2C (Mnih et al., 2016): Asynchronous advantage actor-critic — uses the advantage function (justified by Chapter 6) with parallel workers for stability.

TRPO (Schulman et al., 2015): Trust Region Policy Optimization — constrains policy updates to prevent large steps that destabilize learning, using the PGT gradient.

PPO (Schulman et al., 2017): Proximal Policy Optimization — approximates TRPO's constraint with a clipped surrogate, becoming the default algorithm for deep RL.

SAC (Haarnoja et al., 2018): Soft Actor-Critic — adds entropy regularization for exploration, still built on the PGT's actor-critic framework.

RLHF (Ouyang et al., 2022): Reinforcement Learning from Human Feedback for language models — PPO applied to LLM policy optimization, directly descended from this theorem.

The 25-year legacy: Every policy gradient algorithm used today — from the PPO training ChatGPT to the SAC controlling robots — is a direct descendant of this 7-page NeurIPS paper. The Policy Gradient Theorem didn't just solve one problem; it created an entire paradigm that replaced the value-function approach as the dominant framework for deep RL.

Cheat sheet

Theorem 1
∇ρ = ∑ dπ(s) ∑ ∇π(s,a) Qπ(s,a) — no ∇dπ needed
Theorem 2
Compatible fw ⊆ Qπ gives exact gradient when ∇wf = ∇ ln π
Theorem 3
Policy iteration with differentiable function approximation converges to local optimum
Advantage
A(s,a) = Q(s,a) − V(s) — the gradient only needs relative action values
Impact
Theoretical foundation for A3C, TRPO, PPO, SAC, RLHF
Which modern algorithms are direct descendants of the Policy Gradient Theorem?