Policy Gradient Theorem

Chapter 0: The Problem

By 1999, reinforcement learning had a deep problem. The dominant approach — learn a value function, derive a policy from it — was theoretically broken.

Q-learning, SARSA, and dynamic programming methods had all been proven unable to converge for simple MDPs with function approximation. The reason was fundamental: a tiny change in the estimated Q-value of an action could flip whether that action was selected or not. This discontinuous policy change, caused by a continuous value update, made convergence guarantees impossible.

Consider a state where action A has estimated value 5.01 and action B has value 5.00. The greedy policy picks A. Now suppose an update changes A's value to 4.99. Suddenly the policy switches entirely to B. A 0.02 change in one value causes a 100% change in behavior. Chain these discontinuities across states, and the learning process oscillates or diverges.

The value-function trap: Value-function methods try to find the right values to make the right policy emerge implicitly. But with function approximation, the "right values" might not be representable. And even if they are, the greedy policy extraction step creates discontinuities that prevent convergence. The result: Q-learning with a neural network could provably diverge on trivially simple MDPs.

Williams's REINFORCE (1992) offered an alternative — directly parameterize the policy and follow the gradient. But REINFORCE was impractically slow because it estimated gradients purely from returns, without using a value function for variance reduction.

The question: can we combine the best of both worlds — the convergence guarantees of direct policy optimization with the variance reduction of learned value functions?

Why do value-function methods with function approximation fail to converge?

The greedy policy extraction creates discontinuities — a tiny change in estimated values can completely change the policy, preventing stable convergence The function approximator isn't expressive enough The learning rate is too large

Chapter 1: The Key Insight

Sutton et al. prove something remarkable: the gradient of expected reward with respect to policy parameters can be written in a form that does not involve the derivative of the state distribution.

Why is this surprising? Because changing the policy changes which states you visit. If you start going left instead of right, you end up in completely different states. The state distribution d^π(s) depends on the policy π. So the gradient of performance ∇_θρ(π) should, in general, require ∇_θd^π(s) — the derivative of the state visitation distribution.

But the Policy Gradient Theorem shows this term vanishes. The gradient depends only on:

The state distribution d^π(s) itself (not its derivative)
The gradient of the policy ∇_θπ(s, a) (how changing θ changes action probabilities)
The action-value function Q^π(s, a) (how good each action is)

∇_θρ = ∑_s d^π(s) ∑_a ∇_θπ(s, a) Q^π(s, a)

Why this changes everything: If ∇_θd^π(s) appeared in the gradient, we'd need to know how policy changes affect the entire trajectory of states — essentially requiring a model of the environment. But since it doesn't appear, we can estimate the gradient from samples: just follow the policy, observe states s ~ d^π, and compute ∇_θπ(s,a) Q^π(s,a). No model needed.

This is the foundation. Everything else — actor-critic methods, PPO, TRPO, SAC — is a consequence of this one equation.

What is the surprising property of the Policy Gradient Theorem?

The gradient of expected reward does NOT require the derivative of the state distribution — even though the policy changes which states are visited The gradient is always positive The gradient can be computed in closed form

Chapter 2: Setup

The paper considers the standard MDP framework with a critical twist: the policy is explicitly parameterized.

The MDP

States s ∈ S, actions a ∈ A, transition probabilities P_ss′^a, and expected rewards R_s^a. The agent follows a parameterized policy π(s, a, θ) = Pr{a_t = a | s_t = s, θ}, where θ ∈ ℝ^l with l « |S|.

The key requirement: π must be differentiable with respect to θ. This means the policy changes smoothly with parameters — no discontinuities like greedy value-based policies.

Two performance measures

The paper handles both formulations simultaneously:

Average reward

ρ(π) = ∑_s d^π(s) ∑_a π(s,a) R_s^a — long-term average reward per step

Start-state

ρ(π) = E{∑ γ^t-1r_t | s₀, π} — discounted return from a fixed start state

In both cases, d^π(s) is the state visitation distribution under policy π. For average reward, it's the stationary distribution. For start-state, it's the discounted state visitation: d^π(s) = ∑_t γ^t Pr{s_t = s | s₀, π}.

Why differentiable policies matter: With a neural network policy, a small change in weights θ produces a small change in action probabilities π(s,a). This means a small change in the state distribution d^π(s). Everything varies smoothly — no discontinuities. This is the key structural advantage over value-based methods where the greedy policy creates cliffs.

Why does the paper require the policy π(s, a, θ) to be differentiable with respect to θ?

Differentiability ensures smooth policy changes — small Δθ causes small changes in action probabilities, unlike greedy value-based policies which have discontinuities To enable faster training To use less memory

Chapter 3: Theorem 1 — The Gradient

The central result. For any MDP, in either formulation:

∇_θρ = ∑_s d^π(s) ∑_a ∇_θπ(s, a) Q^π(s, a)

The proof idea

Start with the value function gradient:

∇_θV^π(s) = ∑_a [∇_θπ(s,a) Q^π(s,a) + π(s,a) ∇_θQ^π(s,a)]

The second term is tricky — ∇_θQ^π(s,a) involves the gradient of the value at successor states, which themselves depend on the policy. Expanding recursively:

∇_θQ^π(s,a) = ∑_s′ P_ss′^a ∇_θV^π(s′)

Substituting back and unrolling the recursion, the successor-state terms cascade through the entire state space. The final result collects all terms as a weighted sum over states, where the weights are exactly d^π(s) — the state visitation distribution.

Where ∇d^π goes: The naive product rule on ρ = ∑_s d^π(s) ∑_a π(s,a) Q^π(s,a) would produce terms involving ∇d^π(s). The proof shows these terms cancel with the recursive ∇Q^π terms. The result: the gradient as if d^π were fixed. This is not an approximation — it's exact.

What this means for sampling

Since d^π(s) appears only as a weighting (not differentiated), we can estimate the gradient by simply following the policy: sample s ~ d^π, then compute ∑_a ∇_θπ(s,a) Q^π(s,a). This is a sample from the gradient — no model of the environment needed.

If we replace Q^π(s,a) with actual returns R_t, we recover Williams's REINFORCE. But REINFORCE uses returns as a noisy estimate of Q^π — can we do better with a learned approximation?

The Policy Gradient

The gradient depends on: (1) state distribution d^π(s) — which states you visit, (2) policy gradient ∇π(s,a) — how parameters affect action probabilities, and (3) Q^π(s,a) — how good each action is. Crucially, ∇d^π does NOT appear.

In the Policy Gradient Theorem, why does ∇_θd^π(s) not appear in the final gradient expression?

The recursive expansion of ∇Q^π through successor states produces terms that exactly cancel the ∇d^π terms from the product rule d^π doesn't depend on θ The paper approximates by dropping that term

Chapter 4: The Missing Term

Let's pause and appreciate why the absence of ∇_θd^π(s) is so profound.

The naive approach

If you just took the product rule on ρ = ∑_s d^π(s) V^π(s), you'd get:

∇ρ = ∑_s [∇d^π(s) · V^π(s) + d^π(s) · ∇V^π(s)]

The first term ∇d^π(s) requires knowing how policy changes affect the entire trajectory distribution — essentially a complete model of the environment's dynamics. This is exactly what model-free RL tries to avoid.

Why it cancels

The magic happens because changing the policy at state s has two effects:

Direct effect: Different action probabilities → different immediate rewards
Indirect effect: Different actions → different successor states → different future state distribution

The proof shows that the indirect effect (changing which states you visit) is already captured by the Q^π(s,a) term. The Q-value at state s already accounts for all future consequences of taking action a — including all the state-distribution changes downstream. So including ∇d^π would be double-counting.

The deep reason: Q^π(s,a) is defined as the expected sum of future rewards from taking a in s and then following π. All the "state distribution" effects of the policy are baked into Q^π itself. The gradient theorem says: you only need to know "how good is each action from each state" (Q^π) and "how does the policy change" (∇π). You don't need to know how the state distribution shifts — that's already inside Q.

Why would including ∇d^π(s) in the gradient estimate be double-counting?

Q^π(s,a) already accounts for all future consequences including state distribution changes — it's the expected sum of ALL future rewards d^π is constant The environment is deterministic

Chapter 5: Theorem 2 — Approximation

Theorem 1 requires the true Q^π(s,a), which we don't know. In practice, we'll learn an approximation f_w(s,a). Can we substitute f_w for Q^π and still get the correct gradient?

Theorem 2 says yes, under two conditions:

Condition 1: Compatibility

∇_wf_w(s, a) = ∇_θπ(s, a) / π(s, a) = ∇_θ ln π(s, a)

The value approximator's features must match the policy's score function. For a linear approximator, this means f_w(s,a) = w^T ∇_θ ln π(s,a).

Condition 2: Minimized projection error

∑_s d^π(s) ∑_a π(s,a) [Q^π(s,a) − f_w(s,a)] ∇_wf_w(s,a) = 0

The approximation error is orthogonal to the features — i.e., f_w has been trained to a local minimum of the squared error.

The key result: Under these two conditions, substituting f_w for Q^π gives exactly the same gradient:

∇_θρ = ∑_s d^π(s) ∑_a ∇_θπ(s, a) f_w(s, a)

Not approximately — exactly. The error in f_w doesn't bias the gradient because the error is orthogonal to the policy gradient direction.

The proof in one line

From the compatibility condition, the error E = Q^π(s,a) − f_w(s,a) satisfies ∑ d^π ∑ ∇π · E = 0 (by condition 2 and compatibility). So:

∑ d^π ∑ ∇π Q^π = ∑ d^π ∑ ∇π f_w + ∑ d^π ∑ ∇π E = ∑ d^π ∑ ∇π f_w + 0

The approximation error is invisible to the gradient.

Compatible Function Approximation

The error E = Q^π − f_w (red) is orthogonal to the policy gradient direction ∇π (blue). Their dot product is zero, so the error doesn't bias the gradient estimate.

What does the "compatibility condition" ∇_wf_w = ∇_θln π ensure?

That the value approximation error is orthogonal to the policy gradient — so substituting f_w for Q^π gives the exact gradient That f_w exactly equals Q^π That the policy is optimal

Chapter 6: Advantages

The paper reveals that f_w doesn't need to approximate Q^π — it only needs to get the relative value of actions correct in each state. This means f_w is really approximating the advantage function:

A^π(s, a) = Q^π(s, a) − V^π(s)

The advantage measures how much better action a is than average in state s. This is because adding any function of state v(s) to f_w doesn't change the gradient:

∑_a ∇_θπ(s,a) · v(s) = v(s) ∑_a ∇_θπ(s,a) = v(s) · ∇_θ 1 = 0

Since ∑ ∇π(s,a) = ∇∑π(s,a) = ∇1 = 0, any state-dependent baseline drops out.

The advantage insight: The gradient only cares about "which actions are better than others in each state" — not "how good is this state overall." Adding V^π(s) as a baseline doesn't change the expected gradient but dramatically reduces variance. This is the theoretical justification for advantage-based methods (A2C, GAE) and connects directly to Williams's reinforcement baseline in REINFORCE.

In practice, this means:

Learn a value function V_φ(s) to estimate V^π(s)
Estimate advantages as Â(s,a) = r + γV_φ(s′) − V_φ(s)
Use Â in place of Q^π in the policy gradient

Why does adding a state-dependent baseline v(s) to the value function not change the policy gradient?

Because ∑_a ∇_θπ(s,a) = 0 (probabilities sum to 1), so v(s) · ∑ ∇π = 0 Because v(s) is always zero Because v(s) is the optimal value function

Chapter 7: Convergence

Theorem 3 is the crown jewel: the first proof that policy iteration with arbitrary differentiable function approximation converges to a locally optimal policy.

The algorithm

Alternate between:

Critic update: Find w_k such that f_w satisfies the projection condition (3) — i.e., train the value approximator until convergence.
Actor update: θ_k+1 = θ_k + α_k ∑_s d^πk(s) ∑_a ∇_θπ_k(s,a) f_{w_k}(s,a)

Convergence guarantee

Under standard conditions (α_k → 0, ∑ α_k = ∞, bounded second derivatives of π), the sequence converges such that:

lim_k→∞ ∇_θρ(π_k) = 0

In words: the gradient vanishes — we reach a local optimum. The proof applies Proposition 3.5 from Bertsekas and Tsitsiklis (1996): Theorem 2 guarantees the update direction is the true gradient, the bounded second derivatives ensure the objective is smooth, and the step-size conditions are standard for stochastic approximation.

What this means historically: Before this paper, there was no convergence proof for any RL algorithm using general function approximation for both policy and value function. Q-learning with neural nets could diverge. SARSA could oscillate. Policy iteration could cycle. This theorem broke the deadlock — policy gradient methods with compatible function approximation are provably convergent. Every modern deep RL algorithm traces its theoretical legitimacy to this result.

What does Theorem 3 prove for the first time?

That policy iteration with arbitrary differentiable function approximation converges to a local optimum — the first convergence guarantee for RL with general function approximation That Q-learning converges with neural networks That the global optimum can always be found

Chapter 8: Actor-Critic

The paper provides the theoretical foundation for actor-critic architectures, which had existed since Barto, Sutton, and Anderson (1983) but lacked convergence guarantees.

The actor-critic structure

Two components with separate parameters:

Actor π(s, a; θ): the policy, updated by the policy gradient
Critic f_w(s, a): the value approximation, trained to satisfy the compatibility condition

The actor follows the gradient ∇_θρ using the critic's estimates. The critic learns from TD errors. The two update asynchronously — the critic providing variance reduction for the actor's gradient estimates.

Connecting to REINFORCE

REINFORCE is a special case where there is no critic — Q^π(s,a) is estimated directly from returns R_t. This is high-variance but unbiased. The actor-critic uses a learned f_w to reduce variance while Theorem 2 guarantees no bias (under compatibility).

Actor-Critic Architecture

The actor selects actions; the environment returns rewards and next states; the critic evaluates the action and provides gradient signal to the actor.

The variance-bias tradeoff: REINFORCE (no critic): unbiased, high variance, slow learning. Actor-critic (learned critic): low variance, potentially biased if compatibility condition isn't exactly met. In practice, approximate compatibility works well enough — the bias is small and the variance reduction is enormous. This tradeoff is why actor-critic methods dominate modern deep RL.

How does the actor-critic architecture improve upon REINFORCE?

The learned critic provides low-variance gradient estimates compared to REINFORCE's high-variance return-based estimates, while Theorem 2 guarantees no bias under the compatibility condition Actor-critic uses a different gradient Actor-critic doesn't need a policy

Chapter 9: Connections

What the Policy Gradient Theorem built on

REINFORCE (Williams, 1992): Proved that policy gradient estimates can be obtained without explicit gradient computation. The PGT generalizes REINFORCE by showing that a learned value function can replace the high-variance return estimates without introducing bias.

Actor-critic methods (Barto, Sutton, Anderson, 1983): The PGT provides the first convergence proof for these methods, which had been used heuristically for 16 years.

What the Policy Gradient Theorem enabled

A3C/A2C (Mnih et al., 2016): Asynchronous advantage actor-critic — uses the advantage function (justified by Chapter 6) with parallel workers for stability.

TRPO (Schulman et al., 2015): Trust Region Policy Optimization — constrains policy updates to prevent large steps that destabilize learning, using the PGT gradient.

PPO (Schulman et al., 2017): Proximal Policy Optimization — approximates TRPO's constraint with a clipped surrogate, becoming the default algorithm for deep RL.

SAC (Haarnoja et al., 2018): Soft Actor-Critic — adds entropy regularization for exploration, still built on the PGT's actor-critic framework.

RLHF (Ouyang et al., 2022): Reinforcement Learning from Human Feedback for language models — PPO applied to LLM policy optimization, directly descended from this theorem.

The 25-year legacy: Every policy gradient algorithm used today — from the PPO training ChatGPT to the SAC controlling robots — is a direct descendant of this 7-page NeurIPS paper. The Policy Gradient Theorem didn't just solve one problem; it created an entire paradigm that replaced the value-function approach as the dominant framework for deep RL.

Cheat sheet

Theorem 1

∇ρ = ∑ d^π(s) ∑ ∇π(s,a) Q^π(s,a) — no ∇d^π needed

Theorem 2

Compatible f_w ⊆ Q^π gives exact gradient when ∇_wf = ∇ ln π

Theorem 3

Policy iteration with differentiable function approximation converges to local optimum

Advantage

A(s,a) = Q(s,a) − V(s) — the gradient only needs relative action values

Impact

Theoretical foundation for A3C, TRPO, PPO, SAC, RLHF

Which modern algorithms are direct descendants of the Policy Gradient Theorem?

PPO, TRPO, A3C, SAC, and RLHF — all use the policy gradient with advantage baselines as established by this theorem Only Q-learning variants Only model-based methods