← Gleams
Stanford CS 224R · Lecture 3 · Deep RL

The Complete Guide to Policy Gradients

Every symbol explained. Every step derived. Every pitfall exposed. After this, you teach it.

Full derivations Worked examples On-policy → Off-policy Foundation for PPO
Roadmap

What You'll Master

Chapter 01

RL Foundations & Notation

Reinforcement learning is about an agent interacting with an environment over time, trying to discover — through trial and error — which actions lead to the most total reward. Think of a robot learning to walk: nobody programs exact motor commands; the robot tries stuff, gets a score, and adjusts.

Definition
State — st

The complete description of the "world" at time step t. For a walking robot: every joint position and velocity, torso angle, foot contacts, etc. The state is everything the environment needs to determine what happens next (Markov property). It's a vector of numbers.

Definition
Action — at

The decision the agent makes at time t. For the robot: torque on each motor. Actions can be discrete (left, right, jump) or continuous (3.7 N·m to the knee).

Definition
Trajectory — τ (tau)
τ = (s₁, a₁, s₂, a₂, …, sT, aT)

A full sequence of states and actions. T = time horizon (episode length). One trajectory = one "rollout" = recording every frame and button press of a video game playthrough.

Definition
Reward Function — r(s, a)

A scalar signal: how good is this state-action pair? For the robot, r(s,a) = forward velocity. We design this — it encodes our goal.

Definition
Policy — πθ(a | s)

The agent's strategy: a probability distribution over actions given a state. The subscript θ = neural network weights. This is what we learn.

Discrete actions → softmax over logits. Continuous → outputs mean (and variance) of a Gaussian.

The Trajectory Distribution

Trajectory Probability pθ(τ) = p(s₁) · ∏t=1T πθ(at | st) · p(st+1 | st, at)

Three ingredients:

p(s₁) — initial state distribution. Given by environment.

πθ(at|st) — our policy's choice. Only part we control via θ.

p(st+1|st,at) — environment dynamics. Unknown in model-free RL.

Intuition

Only the πθ terms depend on θ. Initial state and dynamics are fixed. When we take ∇θ, environment terms vanish. This is the key insight of the entire derivation.

On-Policy vs. Off-Policy vs. Offline

TermMeaningExample
OnlineAgent collects new data by actingRobot tries, gets feedback
OfflineLearn from fixed dataset, never actLearn from recorded demos
On-policyUses only data from current policyREINFORCE
Off-policyCan reuse data from past policiesIS-weighted PG, PPO
Chapter 02

The RL Objective

Everything reduces to one problem. Find θ that maximizes expected total reward:

The RL Objective maxθ J(θ) = 𝔼τ ~ pθ(τ) [ Σt=1T r(st, at) ]

maxθ — search over all parameter vectors for the best.

J(θ) — scalar: expected total reward. Make this as big as possible.

𝔼τ ~ pθ(τ) — expectation over trajectories. Both policy and env are stochastic → maximize the average.

Σt r(st,at) — total reward of one trajectory.

With shorthand r(τ) = Σt r(st,at):

J(θ) = 𝔼τ ~ pθ(τ)[ r(τ) ] = ∫ pθ(τ) · r(τ) dτ

We integrate over all possible trajectories. Can't compute this exactly — estimate with samples (actual rollouts).

Why expectations?

Same policy → wildly different outcomes across runs. We want a policy that works on average, not one that got lucky once.

Chapter 03

From Imitation to Policy Gradients

Previously: behavioral cloning — given expert (s,a) pairs, maximize likelihood:

Behavioral Cloning minθ − 𝔼(s,a)~𝒟[ log πθ(a|s) ]

θ JBC ≈ (1/N) Σi ( Σtθ log πθ(ai,t|si,t) )

Limitation: can never outperform the demonstrator. No improvement from practice.

The bridge: keep the same gradient structure, but weight each action by its reward:

Sneak Peek — Policy Gradientθ J(θ) ≈ (1/N) Σi ( Σtθ log πθ(ai,t|si,t) ) · r(τi)

↑ Imitation gradient × trajectory reward
Core Idea

High-reward trajectory → increase action likelihood. Low/negative reward → decrease. This is the formalization of trial and error.

Chapter 04

Deriving the Policy Gradient

The most important derivation in this course. Starting point: J(θ). Goal: ∇θJ(θ).

Step 1 — Write as integral
J(θ) = ∫ pθ(τ) · r(τ) dτ

Definition of expectation. Nothing fancy.

Step 2 — Differentiate
θ J(θ) = ∫ ∇θ pθ(τ) · r(τ) dτ

r(τ) doesn't depend on θ → gradient only hits pθ(τ).

Step 3 — Log-Derivative Trick (the key identity)
θ log f(θ) = ∇θ f(θ) / f(θ)
θ f(θ) = f(θ) · ∇θ log f(θ)

Why?θ pθ(τ) is intractable (unknown dynamics). By introducing the log, dynamics will cancel.

Step 4 — Substitute back
θ J(θ) = ∫ pθ(τ) · ∇θ log pθ(τ) · r(τ) dτ = 𝔼τ ~ pθ(τ)[ ∇θ log pθ(τ) · r(τ) ]

It's an expectation — estimable with samples!

Step 5 — Expand log pθ(τ), dynamics disappear
log pθ(τ) = log p(s₁) + Σt[ log πθ(at|st) + log p(st+1|st,at) ]

θ log pθ(τ) = Σt=1Tθ log πθ(at|st)

log p(s₁) and log p(st+1|st,at) vanish — they don't depend on θ. We never need the dynamics!

Step 6 — The Policy Gradient Theorem
Policy Gradient — Trajectory Formθ J(θ) = 𝔼τ ~ pθ(τ)[ ( Σt=1Tθ log πθ(at|st) ) · r(τ) ]
Step 7 — Monte Carlo estimation
Sample-Based Policy Gradientθ J(θ) ≈ (1/N) Σi=1N ( Σt=1Tθ log πθ(ai,t|si,t) ) · r(τi)

Unbiased but noisy. More trajectories N → less noise.

What does ∇θ log πθ(a|s) mean?

The direction in parameter space that most increases the probability of action a in state s. Multiply by reward → increase probability of good actions, decrease bad ones.

🔨 Derivation Derive the Log-Derivative Trick from the Definition of Derivative ✓ ATTEMPTED

The log-derivative trick states: ∇θ pθ(τ) = pθ(τ) · ∇θ log pθ(τ).

Your task: (1) Prove this identity using only the chain rule. (2) Then show WHY this is useful: prove that 𝔼τ~pθ[∇θ log pθ(τ)] = 0 (the "score function has zero mean" property). (3) Explain what goes wrong if you try to directly compute ∇θ ∫ pθ(τ) r(τ) dτ without this trick.

d/dx log f(x) = f'(x) / f(x). Rearrange: f'(x) = f(x) · d/dx log f(x). Replace f with pθ and d/dx with ∇θ. That's it — the identity is just the chain rule in disguise.
𝔼[∇ log p] = ∫ p · (∇p / p) dτ = ∫ ∇p dτ = ∇ ∫ p dτ = ∇(1) = 0. The gradient commutes with the integral (Leibniz rule), and probabilities sum to 1.
pθ(τ) = p(s1) ∏ πθ(at|st) p(st+1|st,at). The dynamics p(s'|s,a) are unknown — we can't differentiate through them. The log-trick converts the gradient into an expectation we can estimate by sampling, and the log factorization means unknown dynamics terms vanish.

Part 1 — The identity:

By the chain rule: ∇θ log pθ(τ) = ∇θ pθ(τ) / pθ(τ)

Multiply both sides by pθ(τ): ∇θ pθ(τ) = pθ(τ) · ∇θ log pθ(τ) ■

Part 2 — Zero mean:

𝔼τ~pθ[∇θ log pθ(τ)] = ∫ pθ(τ) ∇θ log pθ(τ) dτ = ∫ ∇θ pθ(τ) dτ = ∇θ ∫ pθ(τ) dτ = ∇θ 1 = 0 ■

Part 3 — Why the trick is necessary:

Direct approach: ∇θ J = ∫ ∇θ pθ(τ) · r(τ) dτ. This requires computing ∇θ pθ(τ), which requires differentiating through the unknown dynamics. The log trick converts this to 𝔼[∇ log p · r], which we estimate by sampling trajectories and only requires ∇θ log πθ(a|s) (the policy is known and differentiable).

The key insight: The log-derivative trick converts "differentiate through an intractable probability distribution" into "sample from it and multiply by a score." This pattern appears everywhere: variational inference (ELBO gradient), score-based diffusion models, and black-box optimization. It's arguably the single most important trick in modern ML.

🔗 Pattern Recognition
Policy Gradient = Reward-Weighted Maximum Likelihood
Supervised Learning (MLE)
θ L = (1/N) ∑iθ log pθ(yi|xi)
All examples weighted equally.
Policy Gradient
θ J = (1/N) ∑iθ log πθ(ai|si) · wi
Examples weighted by reward.

Behavioral cloning = MLE with wi = 1 for all examples. Policy gradients = MLE with wi = reward-to-go − baseline. The gradient formula is identical except for the weighting. This is why REINFORCE is sometimes called "reward-weighted regression." Good actions get up-weighted (like having more copies in the training set), bad actions get down-weighted (like removing them).

If REINFORCE is just weighted MLE, why can it outperform the demonstrator while behavioral cloning can't? (Hint: where do the weights come from?)

Chapter 05

The REINFORCE Algorithm

Algorithm: REINFORCE (Vanilla Policy Gradient)
  1. Initialize θ (randomly, from imitation, or heuristics)
  2. Loop forever:
    1. Collect N trajectories {τ₁…τN} by running πθ
    2. Compute gradient:
      θ J ≈ (1/N) Σi ( Σtθ log πθ(ai,t|si,t) ) · r(τi)
    3. Update: θ ← θ + α · ∇θ J (gradient ascent — maximizing)

Collect → compute → update → repeat. After updating θ, old data is invalid → recollect. This is on-policy.

Humanoid Walking

Reward = forward velocity. Five trajectories:

τ₁: falls backward → r = −2 · τ₂: falls forward → r = +2 · τ₃: stands still → r = 0 · τ₄: one step then falls → r = +1 · τ₅: step backward → r = −3

Gradient: increases likelihood of τ₂ (best), decreases τ₅ (worst). The policy learns to fall forward — the best it saw. Over many iterations: stumbling → stepping → walking. Very noisy.

Data Efficiency

Every gradient step needs fresh data. 5 min/trajectory × 100 per step × 10k steps = 6.6 years of robot time. Off-policy methods fix this.

Chapter 06

Improving the Gradient: Causality

Our gradient weights every action by the total trajectory reward. But action at step 5 can't affect reward at step 3.

Policy Gradient with Causality (Reward-to-Go)θ J ≈ (1/N) Σi Σt=1Tθ log πθ(ai,t|si,t) · ( Σt'=tT r(si,t', ai,t') )

Replace total reward Σt'=1T with future-only Σt'=tT.

Why it helps

Robot does great at step 1 (+10) but terrible at step 50 (−10). Without causality: total = 0, so gradient for all actions is zero — even the great one. With causality: step 1 sees its own future reward, properly crediting the good action. Reduces variance, still unbiased.

Formally: for t' < t, the expectation of ∇θ log πθ(at|st) · r(st',at') = 0. Past reward is independent of current action given the state. Removing these terms removes noise without changing the expected gradient.

Chapter 07

Improving the Gradient: Baselines

The Reward Scale Problem

All rewards positive: τ₁ (+2), τ₂ (+5), τ₃ (+8), τ₄ (+12). The gradient increases probability of every trajectory — including falling forward! We waste signal pushing everything up.

The Baseline Trick

Subtract constant b:

Policy Gradient with Baselineθ J = 𝔼[ ∇θ log pθ(τ) · ( r(τ) − b ) ]

Reward > b → positive weight. Reward < b → negative. Clear signal.

Proof: unbiased
𝔼[ ∇θ log pθ(τ) · b ] = b · ∫ pθ · (∇θ pθ / pθ) dτ = b · ∫ ∇θ pθ dτ = b · ∇θ ∫ pθ dτ = b · ∇θ (1) = 0 ✓

Best simple baseline: average reward b = (1/N) Σi r(τi).

Best Version So Farθ J ≈ (1/N) Σi Σtθ log πθ(ai,t|si,t) · ( Σt'=tT r(si,t',ai,t') − b )
Jacket Folding

Reward: 1 (folded), 0.5 (wrinkled), 0 (not folded). Trajectories: τ₁ (0), τ₂ (0), τ₃ (0), τ₄ (1). Baseline b = 0.25. Only τ₄ gets positive weight (+0.75); others get −0.25. Clear signal — but 3/4 trajectories carry the same weak signal. Still noisy. Use dense rewards and large batches.

Variance is the enemy

We estimate a gradient over an exponentially large trajectory space with a handful of samples. Dense rewards and large batches are the main practical levers.

🔨 Derivation Prove: Any State-Dependent Baseline Is Unbiased ✓ ATTEMPTED

We subtract b(st) from the reward-to-go: ∇θ J = 𝔼[∑tθ log πθ(at|st) · (Gt − b(st))].

Your task: Prove that for ANY function b(s) that depends only on the state (not the action), the gradient remains unbiased: 𝔼[∇θ log πθ(at|st) · b(st)] = 0.

Bonus: Show that b(s, a) = Qπ(s,a) would NOT be an unbiased baseline, and explain the intuition for why action-dependent baselines fail.

b(st) doesn't depend on at, so: 𝔼a~π[∇θ log πθ(a|s) · b(s)] = b(s) · 𝔼a~π[∇θ log πθ(a|s)]. You just need to show the inner expectation is zero.
a π(a|s) · ∇ log π(a|s) = ∑a ∇π(a|s) = ∇ ∑a π(a|s) = ∇ 1 = 0. Probabilities sum to 1 regardless of parameters, so their gradient sums to zero.
If b depends on a, you can't factor it out of 𝔼a~π[...]. The term becomes ∑a ∇π(a|s) · b(s,a), which is NOT ∇(∑ π · b) = ∇𝔼[b] (because ∇ of a product has two terms via product rule).

Main proof:

𝔼a~π[∇θ log πθ(a|s) · b(s)]

= b(s) · ∑a πθ(a|s) · ∇θ log πθ(a|s)   [b(s) factors out]

= b(s) · ∑a πθ(a|s) · ∇θπθ(a|s) / πθ(a|s)   [log-derivative identity]

= b(s) · ∑aθπθ(a|s)

= b(s) · ∇θa πθ(a|s)   [gradient commutes with finite sum]

= b(s) · ∇θ 1 = 0

Bonus — why action-dependent baselines bias:

If b = b(s,a): ∑a π(a|s) · ∇ log π(a|s) · b(s,a) = ∑a ∇π(a|s) · b(s,a) ≠ 0 in general. You'd need ∇(∑a π · b) − ∑a π · ∇b, introducing the extra ∑π∇b term. Intuitively: if the baseline depends on the action, subtracting it changes WHICH actions look good, not just the scale. It preferentially reduces the gradient for some actions over others.

The key insight:a ∇π(a|s) = 0 is the reason baselines work. It says: "the total probability mass is conserved under any parameter change." Pushing one action up necessarily pushes others down. A state-dependent baseline simply shifts all actions by the same constant, preserving relative ordering.

🔨 Derivation Natural Policy Gradient: The Fisher Information Connection ✓ ATTEMPTED

The vanilla gradient ∇θJ is not parameterization-invariant: if you reparameterize θ → φ(θ), the gradient direction changes. The natural gradient F−1θJ is invariant, where F is the Fisher information matrix.

Your task: (1) Show that the Fisher matrix F = 𝔼[∇ log π (∇ log π)T] is the Hessian of the KL divergence at θ' = θ. (2) Explain why F−1∇J gives the steepest ascent direction under a KL constraint.

DKLθ || πθ') = 𝔼a~πθ[log πθ(a|s) − log πθ'(a|s)]. At θ' = θ, DKL = 0. The first derivative also vanishes. The second-order Taylor expansion gives: DKL ≈ (1/2)(θ' − θ)T F (θ' − θ), where F is the Fisher matrix.
Steepest ascent under KL constraint = maxΔθ ∇JTΔθ subject to (1/2)ΔθTFΔθ ≤ ε. This is a quadratic program. Use Lagrange multipliers to get the solution Δθ ∝ F−1∇J.

Part 1 — Fisher = Hessian of KL:

DKLθ || πθ') = 𝔼a~πθ[log πθ(a|s) − log πθ'(a|s)]

At θ' = θ: DKL = 0. First derivative: ∇θ' DKL|θ'=θ = −𝔼[∇θ' log πθ']|θ'=θ = 0 (score has zero mean).

Second derivative: ∇2θ' DKL|θ'=θ = −𝔼[∇2 log π] = 𝔼[∇ log π (∇ log π)T] = F.

(The last equality uses the identity: −𝔼[∇2 log p] = 𝔼[∇ log p (∇ log p)T], which holds for any exponential family.)

Part 2 — Steepest ascent under KL:

maxΔθ ∇JTΔθ   s.t.   (1/2)ΔθTFΔθ ≤ ε

Lagrangian: L = ∇JTΔθ − λ[(1/2)ΔθTFΔθ − ε]

ΔθL = ∇J − λFΔθ = 0 ⇒ Δθ = (1/λ) F−1∇J ■

The key insight: The Fisher matrix defines the "natural geometry" of probability distributions. Moving Δθ in parameter space changes the policy by different amounts depending on the local curvature. F−1 undoes this warping, giving equal-sized steps in distribution space. This is why TRPO (which constrains KL directly) is equivalent to natural gradient descent with adaptive step size.

Checkpoint — Before you move on
You now know three variance reduction techniques: causality (reward-to-go), baselines, and natural gradient. Explain: why does each one reduce variance without adding bias? What's the common principle they share?
✓ Gate cleared
Model Answer

Causality: Removes terms whose expectation is zero (past rewards × score function = 0 because past is independent of current action given state). Unbiased because E[removed terms] = 0. Reduces variance by removing noise that averages out anyway.

Baselines: Subtracts b(s) × ∇ log π. Unbiased because E[∇ log π] = 0 (probabilities sum to 1). Reduces variance by centering rewards so the gradient signal reflects relative quality, not absolute scale.

Natural gradient: Multiplies by F−1. Changes the direction (not the expectation of direction). Reduces variance in the sense that equal KL-steps are taken regardless of parameterization, avoiding oscillation in high-curvature directions.

Common principle: All three exploit properties of probability distributions (∑π = 1, independence structure, Fisher geometry) to remove noise without changing the expected signal.

Chapter 08

Implementation: Surrogate Objective

Can't plug gradient formula into autodiff efficiently (N×T backward passes). Instead, build a surrogate objective whose autodiff gradient = our policy gradient:

Surrogate Objective (Pseudo-Loss) J̃(θ) = (1/N) Σi Σt log πθ(ai,t|si,t) · ( Σt'=tT r(si,t',ai,t') − b )

Why it works: autodiff differentiates only log πθ(a|s). Reward weights are detached constants. So ∇θ J̃ = our gradient.

Discrete actions: cross-entropy loss (actions as labels). Continuous (Gaussian):

Gaussian Policy πθ(a|s) = 𝒩(a; μθ(s), σ²)
log πθ(a|s) = −(a − μθ(s))² / (2σ²) − log(σ√(2π))
Weighted Maximum Likelihood

Behavioral cloning = all weights 1. Policy gradient = weights are reward-to-go − baseline. Good actions up-weighted, bad down-weighted.

But Why Does the Surrogate Work?

Let’s slow down. You’ve seen the formula. But many people memorize it without understanding why it exists and why it’s fast. The answer comes from understanding what a backward pass actually does.

When you call .backward() on a scalar in PyTorch, it walks backward through the computation graph and fills in ∂(that scalar)/∂θ for every parameter. That walk is one backward pass. It doesn’t matter if the scalar came from summing 3 things or 3 million things — backward is one sweep through the graph.

In supervised learning, you do this all the time without thinking:

python
loss = (1/N) * sum(cross_entropy(net(x_i), y_i) for i in range(N))
loss.backward()  # one backward pass, done

Mathematically ∇L = (1/N) Σi ∇ℓi, but you never compute each ∇ℓi separately. You build the sum, call backward, done. You trust that ∇(sum) = sum(∇).

The Policy Gradient Formula Has ∇ Inside the Sum

Now look at the policy gradient formula on paper:

θ J ≈ (1/N) Σi Σtθ log πθ(ai,t|si,t) · wi,t

Notice the ∇ is inside the double sum. In supervised learning, you never saw a formula with ∇ written inside — you just saw the loss and knew to take the gradient. Here, the derivation hands you a formula that already has ∇ in it.

So the question is: do you implement this literally, or rearrange first?

The Naive (Literal) Translation: N×T Backward Passes

If you read Σi Σt ∇(something) as literal instructions, you get:

python
# SLOW: one backward pass per (i,t) pair
total = 0
for i in range(N):
    for t in range(T):
        logp = log_pi(a[i,t], s[i,t])   # tiny scalar
        g = autograd.grad(logp, theta)   # backward pass!
        total += g * w[i,t]              # accumulate

Each iteration builds a tiny scalar logp, calls backward to get a gradient vector, multiplies by wi,t, and accumulates. That’s N×T backward passes — each one a full sweep through the neural network. With N=100 trajectories and T=200 steps, that’s 20,000 backward passes per gradient update.

This isn’t a weird straw man. It’s the direct translation of the math to code. A beginner reading Σi Σt ∇(something) and writing it as nested loops with autograd.grad inside would produce exactly this. The “inefficient” version is the obvious version.

The One-Line Algebra Fix

Since wi,t doesn’t depend on θ (it’s just a number computed from rewards), it slides inside the gradient:

∇ log πθ · w = ∇(log πθ · w)

And since ∇(sum) = sum(∇):

(1/N) Σi Σt ∇(log πθ · wi,t) = ∇[ (1/N) Σi Σt log πθ · wi,t ]

Look at the right side. What’s inside the brackets? A single scalar — compute log πθ for each (i,t), multiply by wi,t, sum them up. One number. And the ∇ is outside. Call this scalar J̃. Now you’re in the exact same situation as supervised learning: you have a scalar, you want its gradient, you call backward. One backward pass.

The Surrogate in Three Lines

python
logps = policy(s).log_prob(a)     # [N, T] — one forward pass
surrogate = (logps * w).mean()    # sum into one scalar
surrogate.backward()              # one backward pass

The sums over N and T still happen — but in the forward pass (adding numbers is cheap). The expensive part is the backward sweep through the neural network, and that happens exactly once. Internally, autodiff computes the same thing as N×T separate backward passes summed together — but in one sweep through one graph.

The ∇ didn’t disappear. It’s just not written in the code. In supervised learning you never write ∇ either — you write loss, you write .backward(), and backward is the ∇. Same here. J̃ appears without a ∇ because J̃ is the thing you build as a scalar. The ∇ comes from .backward().

Why .detach() Matters

The only reason we could move ∇ outside the sum was that wi,t doesn’t depend on θ. If w depended on θ, you’d get extra terms and the rearrangement would be invalid. That’s why you .detach() the rewards and baseline in code — to tell autodiff “don’t differentiate through this, it’s a constant.”

What Is the Surrogate, Really?

The real objective J(θ) is expected return. You can’t build J as a scalar that autodiff can differentiate — sampling trajectories isn’t differentiable. So you build a different scalar J̃ whose value is meaningless, but whose gradient happens to equal ∇J. You never care about J̃’s value. You only care that .backward() on it gives you the right gradient.

The surrogate is a trick to make autodiff compute something it couldn’t directly compute. Engineer a scalar whose gradient matches what you want, then let autodiff do its one-backward-pass thing. This pattern — building a “fake loss” whose gradient is a useful signal — appears everywhere in RL, from REINFORCE to PPO’s clipped objective.
Chapter 09

Off-Policy: Importance Sampling

Here’s the frustrating thing about REINFORCE: you collect a batch of trajectories, compute one gradient, update θ, and throw the entire batch away. Every single gradient step requires fresh data from the current policy. This is called on-policy learning, and it’s enormously wasteful.

Why can’t we reuse old data? Because our gradient formula assumes samples from πθ. After we update θ, the old data came from a different policy. Using it directly gives a biased gradient. But what if we could mathematically correct for the mismatch?

Definition
Importance Sampling
𝔼x~p[ f(x) ] = 𝔼x~q[ (p(x)/q(x)) · f(x) ]

If you have samples from distribution q but want to estimate an expectation under distribution p, multiply each sample by the importance weight p(x)/q(x). This corrects the distribution mismatch — samples that are more likely under p than q get up-weighted, and vice versa.

Applying IS to Trajectories

We want to evaluate the policy gradient of πθ′ (our new policy), but using trajectories collected from πθ (our old policy). The IS correction requires the ratio of trajectory probabilities:

Trajectory probability ratio pθ′(τ) / pθ(τ) = [ p(s1) ∏t πθ′(at|st) p(st+1|st,at) ] / [ p(s1) ∏t πθ(at|st) p(st+1|st,at) ]

Something beautiful happens here: the initial state distribution p(s1) and all the dynamics terms p(st+1|st,at) appear in both numerator and denominator — they cancel completely:

Dynamics cancel! pθ′(τ) / pθ(τ) = ∏t=1T πθ′(at|st) / πθ(at|st)
Why this is remarkable: We never need to know the environment dynamics. The IS correction only involves the ratio of policy probabilities, which we can compute. This is the same “model-free” property that made REINFORCE attractive — and it carries over to the off-policy version.

The Curse of Horizon: Exploding Products

There’s a devastating problem. The IS weight is a product of T individual ratios. Even if each ratio is close to 1, the product can be catastrophic:

Worked example: Suppose each ratio is 1.1 (the new policy is 10% more likely to take each action). With T=100 timesteps: 1.110013,780. One trajectory gets weighted 13,780× more than others. Conversely, if each ratio is 0.9: 0.91000.00003. The trajectory is essentially ignored. Either way, the gradient estimate is dominated by a handful of extreme weights — the variance is astronomical. This is the curse of horizon.

Fix: Per-Timestep Importance Weights

The solution: instead of one IS weight per trajectory (product of T ratios), use one IS weight per timestep (a single ratio). This requires switching from a trajectory-level expectation to a timestep-level one.

The full off-policy gradient with per-timestep IS becomes:

Off-Policy Policy Gradient — Common Final Formθ′ J ≈ (1/N) Σi Σt [ πθ′(ai,t|si,t) / πθ(ai,t|si,t) ] · ∇θ′ log πθ′(ai,t|si,t) · ( Σt′=tT r(si,t′,ai,t′) − b )

The Hidden Approximation

There’s a subtle but important approximation buried in this formula. The full per-timestep IS weight should include a marginal state distribution ratio: πθ′(st) / πθ(st). This captures the fact that different policies visit different states.

The problem: πθ′(st) is the probability of being in state st at time t under the new policy. Computing this requires rolling out the new policy or solving for the stationary distribution — which is exactly what we’re trying to avoid. So in practice, this ratio is approximated as 1: we pretend the state distribution doesn’t change between policies.

When is this approximation valid? Only when πθ′ and πθ are similar enough that they visit roughly the same states. If the policies diverge, the state distributions diverge, and our gradient becomes unreliable. This is the fundamental tension that motivates the KL constraint in the next chapter.

The Off-Policy Algorithm

Algorithm: Off-Policy Policy Gradient
  1. Initialize θ, set θ′ = θ
  2. Collect N trajectories using πθ (the data-collecting policy)
  3. For K gradient steps (reusing the same batch):
    1. Compute per-timestep IS weights: wt = πθ′(at|st) / πθ(at|st)
    2. Compute weighted policy gradient using wt
    3. Update: θ′ ← θ′ + α · ∇J
  4. Set θ ← θ′, go to step 2 (collect fresh data)

The key win: K gradient steps per batch instead of 1. With K=10, you get 10× more optimization per unit of environment interaction. But there’s a catch: as θ′ drifts further from θ with each step, the IS weights become more extreme, the state distribution approximation breaks down, and the gradient becomes unreliable. After too many steps, you’re optimizing garbage.

The core dilemma: More gradient steps per batch = better sample efficiency, but each step moves θ′ further from θ, degrading the IS correction. Take too few steps and you waste data. Take too many and you destabilize training. How do you know when to stop? That’s exactly what the KL constraint (next chapter) solves.
Chapter 10

KL Divergence Constraints

We’ve established the dilemma: off-policy policy gradients let us take multiple gradient steps per batch (great for sample efficiency), but each step moves θ′ further from the data-collecting policy θ, degrading the IS correction and eventually ruining the gradient. We need a principled way to say “stop updating before the gradient becomes unreliable.”

Measuring Policy Distance with KL Divergence

First, we need a way to measure how much πθ′ has changed from πθ. Raw parameter distance ||θ′ − θ|| doesn’t work because a small change in parameters can cause a large change in the policy (or vice versa). We need to measure distance in distribution space.

Definition
KL Divergence — DKL(p ∥ q)
DKL(p ∥ q) = Σx p(x) log( p(x) / q(x) )

KL divergence measures how much distribution p differs from distribution q. Key properties: (1) Non-negative: DKL ≥ 0, with equality iff p = q. (2) Not symmetric: DKL(p ∥ q) ≠ DKL(q ∥ p) in general. (3) Interpretable: it measures the expected number of “extra bits” needed to encode samples from p using a code optimized for q.

Why KL, Not Just ||θ′ − θ||?

Consider two policies that differ by Δθ = 0.001 in one parameter. If that parameter controls a softmax temperature, this tiny change might completely flip the action distribution. Conversely, Δθ = 100 in a dead parameter changes nothing. KL divergence measures the functional change in the policy — how different the action distributions actually are, not how different the numbers in memory are.

The KL-Constrained Optimization

We want to maximize the off-policy objective, but constrain how far the policy can move:

KL-Constrained Policy Update maxθ′   𝔼τ~πθ[ Σtθ′(at|st) / πθ(at|st)) · (Σt′≥t r − b) ]

subject to:   𝔼s~πθ[ DKLθ′(·|s) ∥ πθ(·|s)) ] ≤ ε
What the constraint does: It creates a “trust region” around πθ. Within this region (DKL ≤ ε), the IS weights stay close to 1, the state distribution approximation holds, and the gradient is reliable. Outside the region, all bets are off. The constraint prevents the optimizer from leaving the region where we trust our gradient estimate.

Why This Fixes Everything

Inside the trust region:

The Road to PPO: Three Levels of Approximation

The KL-constrained optimization above is the right objective, but how do you solve it?

TRPO (2015)
Enforce the KL constraint exactly using second-order optimization (conjugate gradient + line search). Reliable but complex — requires computing or approximating the Fisher information matrix.
↓ simplify
KL Penalty (naive)
Replace the constraint with a penalty: max J(θ′) − β DKL. First-order, but choosing β is hard — too small allows instability, too large prevents learning. The right β changes during training.
↓ simplify further
PPO (2017)
Skip KL entirely. Clip the IS ratio to [1−ε, 1+ε]. If the ratio tries to go beyond this range, the gradient becomes zero. This creates an implicit trust region with nothing more than a min() and a clip().
The full trajectory of ideas: On-policy REINFORCE (wasteful, one step per batch) → off-policy with IS (multiple steps, but IS weights explode) → KL constraint (trust region keeps IS valid) → TRPO (exact constraint, complex) → PPO (approximate constraint via clipping, simple). Each step is a response to the limitation of the previous one. PPO is the result of 5 years of iterative simplification.

What PPO’s Clipped Objective Looks Like

For reference, PPO’s objective (covered in depth in our PPO Veanors lesson):

PPO Clipped Surrogate LCLIP = 𝔼t[ min( rt(θ) Ât, clip(rt(θ), 1−ε, 1+ε) Ât ) ]

Where rt(θ) = πθ′(at|st) / πθ(at|st) is the IS ratio, and Ât is the advantage estimate. The clip caps the ratio at 1 ± ε, and the min selects the more conservative estimate. When the ratio drifts outside the clipping range, the gradient vanishes — the optimizer can’t keep pushing the policy further.

PPO is the default RL algorithm today. It trains ChatGPT (RLHF), controls Unitree humanoid robots, plays Dota 2 (OpenAI Five), and runs inside every major RL framework. Its simplicity (the core is ~50 lines of code) and robustness (works with default hyperparameters on most problems) make it the “Adam of RL.” And it all flows from one idea: keep the IS ratio close to 1.
Chapter 11

Policy Architectures

The derivation above is deliberately architecture-agnostic. The policy πθ(a|s) can be any differentiable function that outputs a probability distribution over actions. The entire derivation only requires one thing: that you can compute ∇θ log πθ(a|s). Any architecture that satisfies that works.

In practice, what people actually use depends on the input type:

Architecture
MLP (Multi-Layer Perceptron)

Input: State vectors (joint angles, velocities, positions). Structure: 2–3 hidden layers, 256 or 512 units. This is the most common setup in MuJoCo locomotion, robotic manipulation, and CS224R homework. Simple and fast.

Architecture
CNN (Convolutional Neural Network)

Input: Pixel observations (images from a camera). Structure: CNN backbone feeding into an MLP head. Classic Atari DQN-style convnets, or ResNets for more complex visual tasks.

Architecture
RNN / LSTM

Input: Partial observability / history-dependent tasks (agent doesn't see the full state). Structure: RNN/LSTM processes a sequence of past observations to form a belief state, then outputs actions.

Architecture
Transformer

Input: Sequence modeling (offline RL, long-horizon). Structure: As in Decision Transformer or Gato. Less common in the on-policy policy gradient setting from this lecture.

Architecture
Diffusion Policy

Input: Multimodal action distributions (robotic manipulation where there are multiple valid ways to grasp). Structure: Models πθ(a|s) as a denoising diffusion process. Very recent (2023+).

The beauty of policy gradients

The math doesn't care which architecture you use. You swap in a different backbone, the same ∇θ log πθ(a|s) formula applies — PyTorch handles the rest via autodiff on the surrogate objective. For CS224R and its default MuJoCo tasks, assume MLP unless stated otherwise.

ArchitectureBest ForExample Tasks
MLPState vectorsMuJoCo locomotion, robotic manipulation
CNNPixel observationsAtari games, visual navigation
RNN/LSTMPartial observabilityMemory tasks, POMDPs
TransformerSequence/offline RLDecision Transformer, Gato
DiffusionMultimodal actionsDexterous manipulation (2023+)
Chapter 12

Full Summary & Cheat Sheet

The Evolution

V1 — Vanillaθ J = 𝔼[ ( Σtθ log πθ(at|st) ) · r(τ) ]
V2 — + Causalityθ J = 𝔼[ Σtθ log πθ(at|st) · ( Σt'≥t r ) ]
V3 — + Baselineθ J = 𝔼[ Σtθ log πθ(at|st) · ( Σt'≥t r − b ) ]
V4 — Off-Policy + IS + KLθ' J ≈ (1/N) Σi Σtθ'θ] · ∇θ' log πθ' · (Σt'≥t r − b),   DKL ≤ ε

Derivation Cheat Sheet

#Step & Purpose
1J(θ) = ∫ pθ(τ) r(τ) dτ — write as integral
2Log-derivative trick: ∇p = p · ∇log p → expectation
3Expand log pθ(τ) → dynamics vanish under ∇θ
4Monte Carlo: 𝔼 → sample average over N rollouts
5Causality: r(τ) → reward-to-go, less variance
6Baseline: subtract b, still unbiased, less variance
7Off-policy IS: πθ'θ to reuse old data
8KL constraint: keep policy near data source

Strengths vs. Weaknesses

StrengthsWeaknesses
Any differentiable policyHigh variance gradient
Continuous + discrete actionsSample-inefficient (on-policy)
Stochastic policiesReward scale/density sensitive
Directly optimizes RL objectiveSlow convergence, large batches needed
Conceptually simpleHyperparameter-sensitive

What Comes Next

Actor-Critic: replace noisy reward-to-go with a learned value function (the "critic"). Dramatically reduces variance. Combined with KL constraints → PPO, the workhorse of modern RL.

The One Sentence

Policy gradients = run the policy, weight each action's log-probability by how good its future outcomes were minus a baseline, and gradient-ascent. Everything else is variance reduction.

💥 Break-It Lab What Dies When You Remove Variance Reduction? ✓ ATTEMPTED
A working REINFORCE agent learns a simple task. The gold line shows the expected return. Toggle off each component to see the specific failure mode it prevents.
Remove baseline (all rewards positive) ACTIVE
Failure mode: Without a baseline, all actions receive positive reinforcement (rewards are +3, +5, +8, etc. — all positive). The policy changes slowly because every action gets pushed up, just by different amounts. Signal-to-noise ratio collapses. Training takes 10-100x longer to converge, with gradient estimates dominated by the reward scale rather than reward differences.
Learning rate too high (α = 0.5) ACTIVE
Failure mode: Policy gradients have high variance. A single unlucky trajectory can produce a huge gradient that overshoots, moving the policy far from the optimum. The next iteration's data comes from this bad policy, producing another wild gradient in the opposite direction. The policy oscillates between extremes, never settling. This is the "sample efficiency" problem that trust regions and clipping (PPO) solve.
Only 1 trajectory per update (N=1) ACTIVE
Failure mode: With N=1, the gradient estimate is a single sample from a high-variance distribution. The estimate is unbiased but wildly noisy. Some updates push the policy in completely wrong directions just by chance. Progress is a random walk rather than ascent. The cure: use N=100+ trajectories, or equivalently, parallelize with multiple environment instances (vectorized envs).
🏗 Design Challenge You're the Architect: Policy Gradient for Robot Locomotion ✓ ATTEMPTED
You need to train a policy gradient agent for a simulated humanoid robot (20 joints, 1000-dim proprioceptive observation, continuous actions). The task: walk forward at 1.5 m/s. Sample efficiency matters — you have a compute budget of 10M environment steps.
Observation
1000-dim (joint angles, velocities, IMU, foot contacts)
Action
20-dim continuous (joint torques)
Step Budget
10M steps
Parallel Envs
Up to 64
Episode Length
~1000 steps (10 seconds)
1. Policy parameterization: What distribution for 20-dim continuous actions? Gaussian (diagonal, full covariance, or learned variance)? Or something else?
2. Variance reduction: Causality alone? + Baseline? If baseline, what function class for V(s)? Share parameters with the policy network?
3. Batch size: How many steps per gradient update? (N trajectories × T steps). How does this interact with your 64 parallel envs?
4. Given your 10M step budget, would you use vanilla REINFORCE, or upgrade to PPO's clipped objective? Quantify the sample efficiency gain.
5. Entropy bonus: would you add one for exploration? How would you schedule it?

Standard setup (Isaac Gym / MuJoCo locomotion, 2022-2024):

1. Policy: Diagonal Gaussian. μθ(s) = MLP(s) gives the mean, logσ is a learnable parameter vector (state-independent). Diagonal is sufficient because joint torques are approximately independent given the observation. Full covariance adds parameters without proportional gain.

2. Variance reduction: GAE(λ=0.95) with a separate value network Vφ(s). Same architecture as policy but separate parameters (shared features hurt optimization — the policy and value have conflicting gradients). Vφ is trained with MSE against GAE returns.

3. Batch size: 2048 steps per update (64 envs × 32 steps each). This gives ~2 full episodes per batch. With PPO's K=4 epochs, each batch provides 4 gradient updates. Total: 10M / 2048 × 4 = ~19,500 gradient steps.

4. PPO: Absolutely PPO, not vanilla REINFORCE. With K=4 epochs and ε=0.2, you get 4x the gradient steps per environment interaction. Vanilla REINFORCE would need 40M steps for equivalent performance. The clipping ensures the 4 passes don't destabilize despite being slightly off-policy.

5. Entropy: Yes, coefficient 0.01 (small). Decayed to 0 over training. Prevents premature convergence to deterministic policy. For locomotion, entropy helps discover diverse gaits early in training.

💻 Build It Implement REINFORCE with Baseline (Full PyTorch) ✓ ATTEMPTED
Implement the surrogate loss function for REINFORCE with reward-to-go and a value baseline. The key insight: the surrogate is a scalar whose gradient (via autodiff) equals the policy gradient. You'll build J̃ = ∑ log π · (Gt − V(st)), then call .backward().
python def compute_reinforce_loss(policy, value_net, states, actions, rewards, gamma=0.99): """ Args: policy: nn.Module, outputs action distribution given states value_net: nn.Module, outputs V(s) scalar given states states: Tensor [T, obs_dim] actions: Tensor [T, act_dim] rewards: Tensor [T] gamma: float, discount factor Returns: policy_loss: scalar tensor (negate for gradient ascent) value_loss: scalar tensor (MSE of V vs returns) """
Test case
states = randn(5, 4), actions = randn(5, 2), rewards = [1,1,1,1,10], gamma=1.0
returns = [14, 13, 12, 11, 10]
If V(s) = 12 for all s: advantages = [2, 1, 0, -1, -2]
policy_loss = -(log_probs * [2,1,0,-1,-2]).mean()
The advantages (returns - V(s)) should NOT be differentiated through when computing the policy loss. They're treated as fixed weights. Without .detach(), the gradient would flow through V(s) into the value network parameters via the policy loss, corrupting the value network update. Use advantages.detach() for the policy loss, but NOT for the value loss.
import torch

def compute_reinforce_loss(policy, value_net, states, actions, rewards, gamma=0.99):
    T = len(rewards)

    # Step 1: Compute discounted reward-to-go
    returns = torch.zeros(T)
    G = 0
    for t in reversed(range(T)):
        G = rewards[t] + gamma * G
        returns[t] = G

    # Step 2: Compute advantages
    values = value_net(states).squeeze(-1)  # [T]
    advantages = returns - values.detach()  # don't backprop through V here

    # Step 3: Get log-probabilities of taken actions
    dist = policy(states)  # e.g., Normal(mu, sigma)
    log_probs = dist.log_prob(actions).sum(-1)  # [T], sum over action dims

    # Step 4: Policy surrogate loss (negate for gradient ascent)
    policy_loss = -(log_probs * advantages).mean()

    # Step 5: Value function loss (regress toward actual returns)
    value_loss = ((values - returns.detach()) ** 2).mean()

    return policy_loss, value_loss
Bonus challenge: Add entropy regularization: subtract α · dist.entropy().mean() from the policy loss. This encourages exploration by penalizing deterministic policies. Start with α = 0.01.
🔗 Pattern Recognition
Actor-Critic = REINFORCE + Learned Baseline + Bootstrapping
REINFORCE (this lesson)
t = Gt − b
Full Monte Carlo return minus average reward. Unbiased, high variance.
Actor-Critic (next step)
t = rt + γV(st+1) − V(st)
One-step TD error. Biased (V is approximate), low variance. → Actor-Critic

The spectrum from REINFORCE to actor-critic is a single axis: how many steps of real reward do you use before bootstrapping from V? REINFORCE: all steps (no bootstrap). One-step AC: one step then bootstrap. GAE(λ): exponentially-weighted blend. λ is the knob that moves you along this axis. λ=1 = REINFORCE. λ=0 = one-step TD. λ=0.95 = the sweet spot used in PPO.

When would you prefer REINFORCE over actor-critic? (Hint: think about when the value function V is likely to be badly wrong.)

⚔ Adversarial: Variance vs Bias in Policy Gradients
You're training REINFORCE on a task where the optimal return is 100. Your current policy gets returns between 95 and 105 (very close to optimal). You're using baseline b = mean(returns) ≈ 100. The gradient estimates are tiny (advantages are ±5 on a scale of 100). A colleague suggests: "use b = 0 to get larger gradient magnitudes." What happens?