The foundational theorem proving that policy gradients can be estimated with function approximation — enabling convergent actor-critic methods and every modern policy gradient algorithm.
By 1999, reinforcement learning had a deep problem. The dominant approach — learn a value function, derive a policy from it — was theoretically broken.
Q-learning, SARSA, and dynamic programming methods had all been proven unable to converge for simple MDPs with function approximation. The reason was fundamental: a tiny change in the estimated Q-value of an action could flip whether that action was selected or not. This discontinuous policy change, caused by a continuous value update, made convergence guarantees impossible.
Consider a state where action A has estimated value 5.01 and action B has value 5.00. The greedy policy picks A. Now suppose an update changes A's value to 4.99. Suddenly the policy switches entirely to B. A 0.02 change in one value causes a 100% change in behavior. Chain these discontinuities across states, and the learning process oscillates or diverges.
Williams's REINFORCE (1992) offered an alternative — directly parameterize the policy and follow the gradient. But REINFORCE was impractically slow because it estimated gradients purely from returns, without using a value function for variance reduction.
The question: can we combine the best of both worlds — the convergence guarantees of direct policy optimization with the variance reduction of learned value functions?
Sutton et al. prove something remarkable: the gradient of expected reward with respect to policy parameters can be written in a form that does not involve the derivative of the state distribution.
Why is this surprising? Because changing the policy changes which states you visit. If you start going left instead of right, you end up in completely different states. The state distribution dπ(s) depends on the policy π. So the gradient of performance ∇θρ(π) should, in general, require ∇θdπ(s) — the derivative of the state visitation distribution.
But the Policy Gradient Theorem shows this term vanishes. The gradient depends only on:
This is the foundation. Everything else — actor-critic methods, PPO, TRPO, SAC — is a consequence of this one equation.
The paper considers the standard MDP framework with a critical twist: the policy is explicitly parameterized.
States s ∈ S, actions a ∈ A, transition probabilities Pss′a, and expected rewards Rsa. The agent follows a parameterized policy π(s, a, θ) = Pr{at = a | st = s, θ}, where θ ∈ ℝl with l « |S|.
The key requirement: π must be differentiable with respect to θ. This means the policy changes smoothly with parameters — no discontinuities like greedy value-based policies.
The paper handles both formulations simultaneously:
In both cases, dπ(s) is the state visitation distribution under policy π. For average reward, it's the stationary distribution. For start-state, it's the discounted state visitation: dπ(s) = ∑t γt Pr{st = s | s0, π}.
The central result. For any MDP, in either formulation:
Start with the value function gradient:
The second term is tricky — ∇θQπ(s,a) involves the gradient of the value at successor states, which themselves depend on the policy. Expanding recursively:
Substituting back and unrolling the recursion, the successor-state terms cascade through the entire state space. The final result collects all terms as a weighted sum over states, where the weights are exactly dπ(s) — the state visitation distribution.
Since dπ(s) appears only as a weighting (not differentiated), we can estimate the gradient by simply following the policy: sample s ~ dπ, then compute ∑a ∇θπ(s,a) Qπ(s,a). This is a sample from the gradient — no model of the environment needed.
If we replace Qπ(s,a) with actual returns Rt, we recover Williams's REINFORCE. But REINFORCE uses returns as a noisy estimate of Qπ — can we do better with a learned approximation?
The gradient depends on: (1) state distribution dπ(s) — which states you visit, (2) policy gradient ∇π(s,a) — how parameters affect action probabilities, and (3) Qπ(s,a) — how good each action is. Crucially, ∇dπ does NOT appear.
Let's pause and appreciate why the absence of ∇θdπ(s) is so profound.
If you just took the product rule on ρ = ∑s dπ(s) Vπ(s), you'd get:
The first term ∇dπ(s) requires knowing how policy changes affect the entire trajectory distribution — essentially a complete model of the environment's dynamics. This is exactly what model-free RL tries to avoid.
The magic happens because changing the policy at state s has two effects:
The proof shows that the indirect effect (changing which states you visit) is already captured by the Qπ(s,a) term. The Q-value at state s already accounts for all future consequences of taking action a — including all the state-distribution changes downstream. So including ∇dπ would be double-counting.
Theorem 1 requires the true Qπ(s,a), which we don't know. In practice, we'll learn an approximation fw(s,a). Can we substitute fw for Qπ and still get the correct gradient?
Theorem 2 says yes, under two conditions:
The value approximator's features must match the policy's score function. For a linear approximator, this means fw(s,a) = wT ∇θ ln π(s,a).
The approximation error is orthogonal to the features — i.e., fw has been trained to a local minimum of the squared error.
From the compatibility condition, the error E = Qπ(s,a) − fw(s,a) satisfies ∑ dπ ∑ ∇π · E = 0 (by condition 2 and compatibility). So:
The approximation error is invisible to the gradient.
The error E = Qπ − fw (red) is orthogonal to the policy gradient direction ∇π (blue). Their dot product is zero, so the error doesn't bias the gradient estimate.
The paper reveals that fw doesn't need to approximate Qπ — it only needs to get the relative value of actions correct in each state. This means fw is really approximating the advantage function:
The advantage measures how much better action a is than average in state s. This is because adding any function of state v(s) to fw doesn't change the gradient:
Since ∑ ∇π(s,a) = ∇∑π(s,a) = ∇1 = 0, any state-dependent baseline drops out.
In practice, this means:
Theorem 3 is the crown jewel: the first proof that policy iteration with arbitrary differentiable function approximation converges to a locally optimal policy.
Alternate between:
Under standard conditions (αk → 0, ∑ αk = ∞, bounded second derivatives of π), the sequence converges such that:
In words: the gradient vanishes — we reach a local optimum. The proof applies Proposition 3.5 from Bertsekas and Tsitsiklis (1996): Theorem 2 guarantees the update direction is the true gradient, the bounded second derivatives ensure the objective is smooth, and the step-size conditions are standard for stochastic approximation.
The paper provides the theoretical foundation for actor-critic architectures, which had existed since Barto, Sutton, and Anderson (1983) but lacked convergence guarantees.
Two components with separate parameters:
The actor follows the gradient ∇θρ using the critic's estimates. The critic learns from TD errors. The two update asynchronously — the critic providing variance reduction for the actor's gradient estimates.
REINFORCE is a special case where there is no critic — Qπ(s,a) is estimated directly from returns Rt. This is high-variance but unbiased. The actor-critic uses a learned fw to reduce variance while Theorem 2 guarantees no bias (under compatibility).
The actor selects actions; the environment returns rewards and next states; the critic evaluates the action and provides gradient signal to the actor.
REINFORCE (Williams, 1992): Proved that policy gradient estimates can be obtained without explicit gradient computation. The PGT generalizes REINFORCE by showing that a learned value function can replace the high-variance return estimates without introducing bias.
Actor-critic methods (Barto, Sutton, Anderson, 1983): The PGT provides the first convergence proof for these methods, which had been used heuristically for 16 years.
A3C/A2C (Mnih et al., 2016): Asynchronous advantage actor-critic — uses the advantage function (justified by Chapter 6) with parallel workers for stability.
TRPO (Schulman et al., 2015): Trust Region Policy Optimization — constrains policy updates to prevent large steps that destabilize learning, using the PGT gradient.
PPO (Schulman et al., 2017): Proximal Policy Optimization — approximates TRPO's constraint with a clipped surrogate, becoming the default algorithm for deep RL.
SAC (Haarnoja et al., 2018): Soft Actor-Critic — adds entropy regularization for exploration, still built on the PGT's actor-critic framework.
RLHF (Ouyang et al., 2022): Reinforcement Learning from Human Feedback for language models — PPO applied to LLM policy optimization, directly descended from this theorem.