CS224R — Policy Gradients: The Complete Guide

Roadmap

What You'll Master

01RL Foundations & Notation 02The RL Objective 03From Imitation → RL 04Deriving the Policy Gradient 05The REINFORCE Algorithm 06Causality Trick 07Baselines 08Implementation Details 09Off-Policy & Importance Sampling 10KL Constraints → PPO 11Policy Architectures 12Summary & Cheat Sheet

Chapter 01

RL Foundations & Notation

Reinforcement learning is about an agent interacting with an environment over time, trying to discover — through trial and error — which actions lead to the most total reward. Think of a robot learning to walk: nobody programs exact motor commands; the robot tries stuff, gets a score, and adjusts.

Definition

State — s_t

The complete description of the "world" at time step t. For a walking robot: every joint position and velocity, torso angle, foot contacts, etc. The state is everything the environment needs to determine what happens next (Markov property). It's a vector of numbers.

Definition

Action — a_t

The decision the agent makes at time t. For the robot: torque on each motor. Actions can be discrete (left, right, jump) or continuous (3.7 N·m to the knee).

Definition

Trajectory — τ (tau)

τ = (s₁, a₁, s₂, a₂, …, s_T, a_T)

A full sequence of states and actions. T = time horizon (episode length). One trajectory = one "rollout" = recording every frame and button press of a video game playthrough.

Definition

Reward Function — r(s, a)

A scalar signal: how good is this state-action pair? For the robot, r(s,a) = forward velocity. We design this — it encodes our goal.

Definition

Policy — π_θ(a | s)

The agent's strategy: a probability distribution over actions given a state. The subscript θ = neural network weights. This is what we learn.

Discrete actions → softmax over logits. Continuous → outputs mean (and variance) of a Gaussian.

The Trajectory Distribution

Trajectory Probability p_θ(τ) = p(s₁) · ∏_t=1^T π_θ(a_t | s_t) · p(s_t+1 | s_t, a_t)

Three ingredients:

• p(s₁) — initial state distribution. Given by environment.

• π_θ(a_t|s_t) — our policy's choice. Only part we control via θ.

• p(s_t+1|s_t,a_t) — environment dynamics. Unknown in model-free RL.

Intuition

Only the π_θ terms depend on θ. Initial state and dynamics are fixed. When we take ∇_θ, environment terms vanish. This is the key insight of the entire derivation.

On-Policy vs. Off-Policy vs. Offline

Term	Meaning	Example
Online	Agent collects new data by acting	Robot tries, gets feedback
Offline	Learn from fixed dataset, never act	Learn from recorded demos
On-policy	Uses only data from current policy	REINFORCE
Off-policy	Can reuse data from past policies	IS-weighted PG, PPO

Chapter 02

The RL Objective

Everything reduces to one problem. Find θ that maximizes expected total reward:

The RL Objective max_θ J(θ) = 𝔼_{τ ~ p_θ(τ)} [ Σ_t=1^T r(s_t, a_t) ]

• max_θ — search over all parameter vectors for the best.

• J(θ) — scalar: expected total reward. Make this as big as possible.

• 𝔼_{τ ~ p_θ(τ)} — expectation over trajectories. Both policy and env are stochastic → maximize the average.

• Σ_t r(s_t,a_t) — total reward of one trajectory.

With shorthand r(τ) = Σ_t r(s_t,a_t):

J(θ) = 𝔼_{τ ~ p_θ(τ)}[ r(τ) ] = ∫ p_θ(τ) · r(τ) dτ

We integrate over all possible trajectories. Can't compute this exactly — estimate with samples (actual rollouts).

Why expectations?

Same policy → wildly different outcomes across runs. We want a policy that works on average, not one that got lucky once.

Chapter 03

From Imitation to Policy Gradients

Previously: behavioral cloning — given expert (s,a) pairs, maximize likelihood:

Behavioral Cloning min_θ − 𝔼_(s,a)~𝒟[ log π_θ(a|s) ]

∇_θ J_BC ≈ (1/N) Σ_i ( Σ_t ∇_θ log π_θ(a_i,t|s_i,t) )

Limitation: can never outperform the demonstrator. No improvement from practice.

The bridge: keep the same gradient structure, but weight each action by its reward:

Sneak Peek — Policy Gradient ∇_θ J(θ) ≈ (1/N) Σ_i ( Σ_t ∇_θ log π_θ(a_i,t|s_i,t) ) · r(τ_i)

↑ Imitation gradient × trajectory reward

Core Idea

High-reward trajectory → increase action likelihood. Low/negative reward → decrease. This is the formalization of trial and error.

Chapter 04

Deriving the Policy Gradient

The most important derivation in this course. Starting point: J(θ). Goal: ∇_θJ(θ).

Step 1 — Write as integral

J(θ) = ∫ p_θ(τ) · r(τ) dτ

Definition of expectation. Nothing fancy.

Step 2 — Differentiate

∇_θ J(θ) = ∫ ∇_θ p_θ(τ) · r(τ) dτ

r(τ) doesn't depend on θ → gradient only hits p_θ(τ).

Step 3 — Log-Derivative Trick (the key identity)

∇_θ log f(θ) = ∇_θ f(θ) / f(θ)
⇒ ∇_θ f(θ) = f(θ) · ∇_θ log f(θ)

Why? ∇_θ p_θ(τ) is intractable (unknown dynamics). By introducing the log, dynamics will cancel.

Step 4 — Substitute back

∇_θ J(θ) = ∫ p_θ(τ) · ∇_θ log p_θ(τ) · r(τ) dτ = 𝔼_{τ ~ p_θ(τ)}[ ∇_θ log p_θ(τ) · r(τ) ]

It's an expectation — estimable with samples!

Step 5 — Expand log p_θ(τ), dynamics disappear

log p_θ(τ) = log p(s₁) + Σ_t[ log π_θ(a_t|s_t) + log p(s_t+1|s_t,a_t) ]

∇_θ log p_θ(τ) = Σ_t=1^T ∇_θ log π_θ(a_t|s_t)

log p(s₁) and log p(s_t+1|s_t,a_t) vanish — they don't depend on θ. We never need the dynamics!

Step 6 — The Policy Gradient Theorem

Policy Gradient — Trajectory Form ∇_θ J(θ) = 𝔼_{τ ~ p_θ(τ)}[ ( Σ_t=1^T ∇_θ log π_θ(a_t|s_t) ) · r(τ) ]

Step 7 — Monte Carlo estimation

Sample-Based Policy Gradient ∇_θ J(θ) ≈ (1/N) Σ_i=1^N ( Σ_t=1^T ∇_θ log π_θ(a_i,t|s_i,t) ) · r(τ_i)

Unbiased but noisy. More trajectories N → less noise.

What does ∇_θ log π_θ(a|s) mean?

The direction in parameter space that most increases the probability of action a in state s. Multiply by reward → increase probability of good actions, decrease bad ones.

🔨 Derivation Derive the Log-Derivative Trick from the Definition of Derivative ▶ ✓ ATTEMPTED

The log-derivative trick states: ∇_θ p_θ(τ) = p_θ(τ) · ∇_θ log p_θ(τ).

Your task: (1) Prove this identity using only the chain rule. (2) Then show WHY this is useful: prove that 𝔼_{τ~p_θ}[∇_θ log p_θ(τ)] = 0 (the "score function has zero mean" property). (3) Explain what goes wrong if you try to directly compute ∇_θ ∫ p_θ(τ) r(τ) dτ without this trick.

d/dx log f(x) = f'(x) / f(x). Rearrange: f'(x) = f(x) · d/dx log f(x). Replace f with p_θ and d/dx with ∇_θ. That's it — the identity is just the chain rule in disguise.

𝔼[∇ log p] = ∫ p · (∇p / p) dτ = ∫ ∇p dτ = ∇ ∫ p dτ = ∇(1) = 0. The gradient commutes with the integral (Leibniz rule), and probabilities sum to 1.

p_θ(τ) = p(s₁) ∏ π_θ(a_t|s_t) p(s_t+1|s_t,a_t). The dynamics p(s'|s,a) are unknown — we can't differentiate through them. The log-trick converts the gradient into an expectation we can estimate by sampling, and the log factorization means unknown dynamics terms vanish.

Part 1 — The identity:

By the chain rule: ∇_θ log p_θ(τ) = ∇_θ p_θ(τ) / p_θ(τ)

Multiply both sides by p_θ(τ): ∇_θ p_θ(τ) = p_θ(τ) · ∇_θ log p_θ(τ) ■

Part 2 — Zero mean:

𝔼_{τ~p_θ}[∇_θ log p_θ(τ)] = ∫ p_θ(τ) ∇_θ log p_θ(τ) dτ = ∫ ∇_θ p_θ(τ) dτ = ∇_θ ∫ p_θ(τ) dτ = ∇_θ 1 = 0 ■

Part 3 — Why the trick is necessary:

Direct approach: ∇_θ J = ∫ ∇_θ p_θ(τ) · r(τ) dτ. This requires computing ∇_θ p_θ(τ), which requires differentiating through the unknown dynamics. The log trick converts this to 𝔼[∇ log p · r], which we estimate by sampling trajectories and only requires ∇_θ log π_θ(a|s) (the policy is known and differentiable).

The key insight: The log-derivative trick converts "differentiate through an intractable probability distribution" into "sample from it and multiply by a score." This pattern appears everywhere: variational inference (ELBO gradient), score-based diffusion models, and black-box optimization. It's arguably the single most important trick in modern ML.

🔗 Pattern Recognition

Policy Gradient = Reward-Weighted Maximum Likelihood

Supervised Learning (MLE)

∇_θ L = (1/N) ∑_i ∇_θ log p_θ(y_i|x_i)
All examples weighted equally.

Policy Gradient

∇_θ J = (1/N) ∑_i ∇_θ log π_θ(a_i|s_i) · w_i
Examples weighted by reward.

Behavioral cloning = MLE with w_i = 1 for all examples. Policy gradients = MLE with w_i = reward-to-go − baseline. The gradient formula is identical except for the weighting. This is why REINFORCE is sometimes called "reward-weighted regression." Good actions get up-weighted (like having more copies in the training set), bad actions get down-weighted (like removing them).

If REINFORCE is just weighted MLE, why can it outperform the demonstrator while behavioral cloning can't? (Hint: where do the weights come from?)

Chapter 05

The REINFORCE Algorithm

Algorithm: REINFORCE (Vanilla Policy Gradient)

Initialize θ (randomly, from imitation, or heuristics)
Loop forever:
1. Collect N trajectories {τ₁…τ_N} by running π_θ
2. Compute gradient:
  ∇_θ J ≈ (1/N) Σ_i ( Σ_t ∇_θ log π_θ(a_i,t|s_i,t) ) · r(τ_i)
3. Update: θ ← θ + α · ∇_θ J (gradient ascent — maximizing)

Collect → compute → update → repeat. After updating θ, old data is invalid → recollect. This is on-policy.

Humanoid Walking

Reward = forward velocity. Five trajectories:

τ₁: falls backward → r = −2 · τ₂: falls forward → r = +2 · τ₃: stands still → r = 0 · τ₄: one step then falls → r = +1 · τ₅: step backward → r = −3

Gradient: increases likelihood of τ₂ (best), decreases τ₅ (worst). The policy learns to fall forward — the best it saw. Over many iterations: stumbling → stepping → walking. Very noisy.

Data Efficiency

Every gradient step needs fresh data. 5 min/trajectory × 100 per step × 10k steps = 6.6 years of robot time. Off-policy methods fix this.

Chapter 06

Improving the Gradient: Causality

Our gradient weights every action by the total trajectory reward. But action at step 5 can't affect reward at step 3.

Policy Gradient with Causality (Reward-to-Go) ∇_θ J ≈ (1/N) Σ_i Σ_t=1^T ∇_θ log π_θ(a_i,t|s_i,t) · ( Σ_t'=t^T r(s_i,t', a_i,t') )

Replace total reward Σ_t'=1^T with future-only Σ_t'=t^T.

Why it helps

Robot does great at step 1 (+10) but terrible at step 50 (−10). Without causality: total = 0, so gradient for all actions is zero — even the great one. With causality: step 1 sees its own future reward, properly crediting the good action. Reduces variance, still unbiased.

Formally: for t' < t, the expectation of ∇_θ log π_θ(a_t|s_t) · r(s_t',a_t') = 0. Past reward is independent of current action given the state. Removing these terms removes noise without changing the expected gradient.

Chapter 07

Improving the Gradient: Baselines

The Reward Scale Problem

All rewards positive: τ₁ (+2), τ₂ (+5), τ₃ (+8), τ₄ (+12). The gradient increases probability of every trajectory — including falling forward! We waste signal pushing everything up.

The Baseline Trick

Subtract constant b:

Policy Gradient with Baseline ∇_θ J = 𝔼[ ∇_θ log p_θ(τ) · ( r(τ) − b ) ]

Reward > b → positive weight. Reward < b → negative. Clear signal.

Proof: unbiased

𝔼[ ∇_θ log p_θ(τ) · b ] = b · ∫ p_θ · (∇_θ p_θ / p_θ) dτ = b · ∫ ∇_θ p_θ dτ = b · ∇_θ ∫ p_θ dτ = b · ∇_θ (1) = 0 ✓

Best simple baseline: average reward b = (1/N) Σ_i r(τ_i).

Best Version So Far ∇_θ J ≈ (1/N) Σ_i Σ_t ∇_θ log π_θ(a_i,t|s_i,t) · ( Σ_t'=t^T r(s_i,t',a_i,t') − b )

Jacket Folding

Reward: 1 (folded), 0.5 (wrinkled), 0 (not folded). Trajectories: τ₁ (0), τ₂ (0), τ₃ (0), τ₄ (1). Baseline b = 0.25. Only τ₄ gets positive weight (+0.75); others get −0.25. Clear signal — but 3/4 trajectories carry the same weak signal. Still noisy. Use dense rewards and large batches.

Variance is the enemy

We estimate a gradient over an exponentially large trajectory space with a handful of samples. Dense rewards and large batches are the main practical levers.

🔨 Derivation Prove: Any State-Dependent Baseline Is Unbiased ▶ ✓ ATTEMPTED

We subtract b(s_t) from the reward-to-go: ∇_θ J = 𝔼[∑_t ∇_θ log π_θ(a_t|s_t) · (G_t − b(s_t))].

Your task: Prove that for ANY function b(s) that depends only on the state (not the action), the gradient remains unbiased: 𝔼[∇_θ log π_θ(a_t|s_t) · b(s_t)] = 0.

Bonus: Show that b(s, a) = Q^π(s,a) would NOT be an unbiased baseline, and explain the intuition for why action-dependent baselines fail.

b(s_t) doesn't depend on a_t, so: 𝔼_a~π[∇_θ log π_θ(a|s) · b(s)] = b(s) · 𝔼_a~π[∇_θ log π_θ(a|s)]. You just need to show the inner expectation is zero.

∑_a π(a|s) · ∇ log π(a|s) = ∑_a ∇π(a|s) = ∇ ∑_a π(a|s) = ∇ 1 = 0. Probabilities sum to 1 regardless of parameters, so their gradient sums to zero.

If b depends on a, you can't factor it out of 𝔼_a~π[...]. The term becomes ∑_a ∇π(a|s) · b(s,a), which is NOT ∇(∑ π · b) = ∇𝔼[b] (because ∇ of a product has two terms via product rule).

Main proof:

𝔼_a~π[∇_θ log π_θ(a|s) · b(s)]

= b(s) · ∑_a π_θ(a|s) · ∇_θ log π_θ(a|s) [b(s) factors out]

= b(s) · ∑_a π_θ(a|s) · ∇_θπ_θ(a|s) / π_θ(a|s) [log-derivative identity]

= b(s) · ∑_a ∇_θπ_θ(a|s)

= b(s) · ∇_θ ∑_a π_θ(a|s) [gradient commutes with finite sum]

= b(s) · ∇_θ 1 = 0 ■

Bonus — why action-dependent baselines bias:

If b = b(s,a): ∑_a π(a|s) · ∇ log π(a|s) · b(s,a) = ∑_a ∇π(a|s) · b(s,a) ≠ 0 in general. You'd need ∇(∑_a π · b) − ∑_a π · ∇b, introducing the extra ∑π∇b term. Intuitively: if the baseline depends on the action, subtracting it changes WHICH actions look good, not just the scale. It preferentially reduces the gradient for some actions over others.

The key insight: ∑_a ∇π(a|s) = 0 is the reason baselines work. It says: "the total probability mass is conserved under any parameter change." Pushing one action up necessarily pushes others down. A state-dependent baseline simply shifts all actions by the same constant, preserving relative ordering.

🔨 Derivation Natural Policy Gradient: The Fisher Information Connection ▶ ✓ ATTEMPTED

The vanilla gradient ∇_θJ is not parameterization-invariant: if you reparameterize θ → φ(θ), the gradient direction changes. The natural gradient F⁻¹∇_θJ is invariant, where F is the Fisher information matrix.

Your task: (1) Show that the Fisher matrix F = 𝔼[∇ log π (∇ log π)^T] is the Hessian of the KL divergence at θ' = θ. (2) Explain why F⁻¹∇J gives the steepest ascent direction under a KL constraint.

D_KL(π_θ || π_θ') = 𝔼_{a~π_θ}[log π_θ(a|s) − log π_θ'(a|s)]. At θ' = θ, D_KL = 0. The first derivative also vanishes. The second-order Taylor expansion gives: D_KL ≈ (1/2)(θ' − θ)^T F (θ' − θ), where F is the Fisher matrix.

Steepest ascent under KL constraint = max_Δθ ∇J^TΔθ subject to (1/2)Δθ^TFΔθ ≤ ε. This is a quadratic program. Use Lagrange multipliers to get the solution Δθ ∝ F⁻¹∇J.

Part 1 — Fisher = Hessian of KL:

D_KL(π_θ || π_θ') = 𝔼_{a~π_θ}[log π_θ(a|s) − log π_θ'(a|s)]

At θ' = θ: D_KL = 0. First derivative: ∇_θ' D_KL|_θ'=θ = −𝔼[∇_θ' log π_θ']|_θ'=θ = 0 (score has zero mean).

Second derivative: ∇²_θ' D_KL|_θ'=θ = −𝔼[∇² log π] = 𝔼[∇ log π (∇ log π)^T] = F.

(The last equality uses the identity: −𝔼[∇² log p] = 𝔼[∇ log p (∇ log p)^T], which holds for any exponential family.)

Part 2 — Steepest ascent under KL:

max_Δθ ∇J^TΔθ s.t. (1/2)Δθ^TFΔθ ≤ ε

Lagrangian: L = ∇J^TΔθ − λ[(1/2)Δθ^TFΔθ − ε]

∇_ΔθL = ∇J − λFΔθ = 0 ⇒ Δθ = (1/λ) F⁻¹∇J ■

The key insight: The Fisher matrix defines the "natural geometry" of probability distributions. Moving Δθ in parameter space changes the policy by different amounts depending on the local curvature. F⁻¹ undoes this warping, giving equal-sized steps in distribution space. This is why TRPO (which constrains KL directly) is equivalent to natural gradient descent with adaptive step size.

Checkpoint — Before you move on

You now know three variance reduction techniques: causality (reward-to-go), baselines, and natural gradient. Explain: why does each one reduce variance without adding bias? What's the common principle they share?

✓ Gate cleared

Model Answer

Causality: Removes terms whose expectation is zero (past rewards × score function = 0 because past is independent of current action given state). Unbiased because E[removed terms] = 0. Reduces variance by removing noise that averages out anyway.

Baselines: Subtracts b(s) × ∇ log π. Unbiased because E[∇ log π] = 0 (probabilities sum to 1). Reduces variance by centering rewards so the gradient signal reflects relative quality, not absolute scale.

Natural gradient: Multiplies by F⁻¹. Changes the direction (not the expectation of direction). Reduces variance in the sense that equal KL-steps are taken regardless of parameterization, avoiding oscillation in high-curvature directions.

Common principle: All three exploit properties of probability distributions (∑π = 1, independence structure, Fisher geometry) to remove noise without changing the expected signal.

Chapter 08

Implementation: Surrogate Objective

Can't plug gradient formula into autodiff efficiently (N×T backward passes). Instead, build a surrogate objective whose autodiff gradient = our policy gradient:

Surrogate Objective (Pseudo-Loss) J̃(θ) = (1/N) Σ_i Σ_t log π_θ(a_i,t|s_i,t) · ( Σ_t'=t^T r(s_i,t',a_i,t') − b )

Why it works: autodiff differentiates only log π_θ(a|s). Reward weights are detached constants. So ∇_θ J̃ = our gradient.

Discrete actions: cross-entropy loss (actions as labels). Continuous (Gaussian):

Gaussian Policy π_θ(a|s) = 𝒩(a; μ_θ(s), σ²)
log π_θ(a|s) = −(a − μ_θ(s))² / (2σ²) − log(σ√(2π))

Weighted Maximum Likelihood

Behavioral cloning = all weights 1. Policy gradient = weights are reward-to-go − baseline. Good actions up-weighted, bad down-weighted.

But Why Does the Surrogate Work?

Let’s slow down. You’ve seen the formula. But many people memorize it without understanding why it exists and why it’s fast. The answer comes from understanding what a backward pass actually does.

When you call .backward() on a scalar in PyTorch, it walks backward through the computation graph and fills in ∂(that scalar)/∂θ for every parameter. That walk is one backward pass. It doesn’t matter if the scalar came from summing 3 things or 3 million things — backward is one sweep through the graph.

In supervised learning, you do this all the time without thinking:

python
loss = (1/N) * sum(cross_entropy(net(x_i), y_i) for i in range(N))
loss.backward()  # one backward pass, done

Mathematically ∇L = (1/N) Σ_i ∇ℓ_i, but you never compute each ∇ℓ_i separately. You build the sum, call backward, done. You trust that ∇(sum) = sum(∇).

The Policy Gradient Formula Has ∇ Inside the Sum

Now look at the policy gradient formula on paper:

∇_θ J ≈ (1/N) Σ_i Σ_t ∇_θ log π_θ(a_i,t|s_i,t) · w_i,t

Notice the ∇ is inside the double sum. In supervised learning, you never saw a formula with ∇ written inside — you just saw the loss and knew to take the gradient. Here, the derivation hands you a formula that already has ∇ in it.

So the question is: do you implement this literally, or rearrange first?

The Naive (Literal) Translation: N×T Backward Passes

If you read Σ_i Σ_t ∇(something) as literal instructions, you get:

python
# SLOW: one backward pass per (i,t) pair
total = 0
for i in range(N):
    for t in range(T):
        logp = log_pi(a[i,t], s[i,t])   # tiny scalar
        g = autograd.grad(logp, theta)   # backward pass!
        total += g * w[i,t]              # accumulate

Each iteration builds a tiny scalar logp, calls backward to get a gradient vector, multiplies by w_i,t, and accumulates. That’s N×T backward passes — each one a full sweep through the neural network. With N=100 trajectories and T=200 steps, that’s 20,000 backward passes per gradient update.

This isn’t a weird straw man. It’s the direct translation of the math to code. A beginner reading Σ_i Σ_t ∇(something) and writing it as nested loops with autograd.grad inside would produce exactly this. The “inefficient” version is the obvious version.

The One-Line Algebra Fix

Since w_i,t doesn’t depend on θ (it’s just a number computed from rewards), it slides inside the gradient:

∇ log π_θ · w = ∇(log π_θ · w)

And since ∇(sum) = sum(∇):

(1/N) Σ_i Σ_t ∇(log π_θ · w_i,t) = ∇[ (1/N) Σ_i Σ_t log π_θ · w_i,t ]

Look at the right side. What’s inside the brackets? A single scalar — compute log π_θ for each (i,t), multiply by w_i,t, sum them up. One number. And the ∇ is outside. Call this scalar J̃. Now you’re in the exact same situation as supervised learning: you have a scalar, you want its gradient, you call backward. One backward pass.

The Surrogate in Three Lines

python
logps = policy(s).log_prob(a)     # [N, T] — one forward pass
surrogate = (logps * w).mean()    # sum into one scalar
surrogate.backward()              # one backward pass

The sums over N and T still happen — but in the forward pass (adding numbers is cheap). The expensive part is the backward sweep through the neural network, and that happens exactly once. Internally, autodiff computes the same thing as N×T separate backward passes summed together — but in one sweep through one graph.

The ∇ didn’t disappear. It’s just not written in the code. In supervised learning you never write ∇ either — you write loss, you write .backward(), and backward is the ∇. Same here. J̃ appears without a ∇ because J̃ is the thing you build as a scalar. The ∇ comes from .backward().

Why `.detach()` Matters

The only reason we could move ∇ outside the sum was that w_i,t doesn’t depend on θ. If w depended on θ, you’d get extra terms and the rearrangement would be invalid. That’s why you .detach() the rewards and baseline in code — to tell autodiff “don’t differentiate through this, it’s a constant.”

What Is the Surrogate, Really?

The real objective J(θ) is expected return. You can’t build J as a scalar that autodiff can differentiate — sampling trajectories isn’t differentiable. So you build a different scalar J̃ whose value is meaningless, but whose gradient happens to equal ∇J. You never care about J̃’s value. You only care that .backward() on it gives you the right gradient.

The surrogate is a trick to make autodiff compute something it couldn’t directly compute. Engineer a scalar whose gradient matches what you want, then let autodiff do its one-backward-pass thing. This pattern — building a “fake loss” whose gradient is a useful signal — appears everywhere in RL, from REINFORCE to PPO’s clipped objective.

Chapter 09

Off-Policy: Importance Sampling

Here’s the frustrating thing about REINFORCE: you collect a batch of trajectories, compute one gradient, update θ, and throw the entire batch away. Every single gradient step requires fresh data from the current policy. This is called on-policy learning, and it’s enormously wasteful.

Why can’t we reuse old data? Because our gradient formula assumes samples from π_θ. After we update θ, the old data came from a different policy. Using it directly gives a biased gradient. But what if we could mathematically correct for the mismatch?

Definition

Importance Sampling

𝔼_x~p[ f(x) ] = 𝔼_x~q[ (p(x)/q(x)) · f(x) ]

If you have samples from distribution q but want to estimate an expectation under distribution p, multiply each sample by the importance weight p(x)/q(x). This corrects the distribution mismatch — samples that are more likely under p than q get up-weighted, and vice versa.

Applying IS to Trajectories

We want to evaluate the policy gradient of π_θ′ (our new policy), but using trajectories collected from π_θ (our old policy). The IS correction requires the ratio of trajectory probabilities:

Trajectory probability ratio p_θ′(τ) / p_θ(τ) = [ p(s₁) ∏_t π_θ′(a_t|s_t) p(s_t+1|s_t,a_t) ] / [ p(s₁) ∏_t π_θ(a_t|s_t) p(s_t+1|s_t,a_t) ]

Something beautiful happens here: the initial state distribution p(s₁) and all the dynamics terms p(s_t+1|s_t,a_t) appear in both numerator and denominator — they cancel completely:

Dynamics cancel! p_θ′(τ) / p_θ(τ) = ∏_t=1^T π_θ′(a_t|s_t) / π_θ(a_t|s_t)

Why this is remarkable: We never need to know the environment dynamics. The IS correction only involves the ratio of policy probabilities, which we can compute. This is the same “model-free” property that made REINFORCE attractive — and it carries over to the off-policy version.

The Curse of Horizon: Exploding Products

There’s a devastating problem. The IS weight is a product of T individual ratios. Even if each ratio is close to 1, the product can be catastrophic:

Worked example: Suppose each ratio is 1.1 (the new policy is 10% more likely to take each action). With T=100 timesteps: 1.1¹⁰⁰ ≈ 13,780. One trajectory gets weighted 13,780× more than others. Conversely, if each ratio is 0.9: 0.9¹⁰⁰ ≈ 0.00003. The trajectory is essentially ignored. Either way, the gradient estimate is dominated by a handful of extreme weights — the variance is astronomical. This is the curse of horizon.

Fix: Per-Timestep Importance Weights

The solution: instead of one IS weight per trajectory (product of T ratios), use one IS weight per timestep (a single ratio). This requires switching from a trajectory-level expectation to a timestep-level one.

The full off-policy gradient with per-timestep IS becomes:

Off-Policy Policy Gradient — Common Final Form ∇_θ′ J ≈ (1/N) Σ_i Σ_t [ π_θ′(a_i,t|s_i,t) / π_θ(a_i,t|s_i,t) ] · ∇_θ′ log π_θ′(a_i,t|s_i,t) · ( Σ_t′=t^T r(s_i,t′,a_i,t′) − b )

The Hidden Approximation

There’s a subtle but important approximation buried in this formula. The full per-timestep IS weight should include a marginal state distribution ratio: π_θ′(s_t) / π_θ(s_t). This captures the fact that different policies visit different states.

The problem: π_θ′(s_t) is the probability of being in state s_t at time t under the new policy. Computing this requires rolling out the new policy or solving for the stationary distribution — which is exactly what we’re trying to avoid. So in practice, this ratio is approximated as 1: we pretend the state distribution doesn’t change between policies.

When is this approximation valid? Only when π_θ′ and π_θ are similar enough that they visit roughly the same states. If the policies diverge, the state distributions diverge, and our gradient becomes unreliable. This is the fundamental tension that motivates the KL constraint in the next chapter.

The Off-Policy Algorithm

Algorithm: Off-Policy Policy Gradient

Initialize θ, set θ′ = θ
Collect N trajectories using π_θ (the data-collecting policy)
For K gradient steps (reusing the same batch):
1. Compute per-timestep IS weights: w_t = π_θ′(a_t|s_t) / π_θ(a_t|s_t)
2. Compute weighted policy gradient using w_t
3. Update: θ′ ← θ′ + α · ∇J
Set θ ← θ′, go to step 2 (collect fresh data)

The key win: K gradient steps per batch instead of 1. With K=10, you get 10× more optimization per unit of environment interaction. But there’s a catch: as θ′ drifts further from θ with each step, the IS weights become more extreme, the state distribution approximation breaks down, and the gradient becomes unreliable. After too many steps, you’re optimizing garbage.

The core dilemma: More gradient steps per batch = better sample efficiency, but each step moves θ′ further from θ, degrading the IS correction. Take too few steps and you waste data. Take too many and you destabilize training. How do you know when to stop? That’s exactly what the KL constraint (next chapter) solves.

Chapter 10

KL Divergence Constraints

We’ve established the dilemma: off-policy policy gradients let us take multiple gradient steps per batch (great for sample efficiency), but each step moves θ′ further from the data-collecting policy θ, degrading the IS correction and eventually ruining the gradient. We need a principled way to say “stop updating before the gradient becomes unreliable.”

Measuring Policy Distance with KL Divergence

First, we need a way to measure how much π_θ′ has changed from π_θ. Raw parameter distance ||θ′ − θ|| doesn’t work because a small change in parameters can cause a large change in the policy (or vice versa). We need to measure distance in distribution space.

Definition

KL Divergence — D_KL(p ∥ q)

D_KL(p ∥ q) = Σ_x p(x) log( p(x) / q(x) )

KL divergence measures how much distribution p differs from distribution q. Key properties: (1) Non-negative: D_KL ≥ 0, with equality iff p = q. (2) Not symmetric: D_KL(p ∥ q) ≠ D_KL(q ∥ p) in general. (3) Interpretable: it measures the expected number of “extra bits” needed to encode samples from p using a code optimized for q.

Why KL, Not Just ||θ′ − θ||?

Consider two policies that differ by Δθ = 0.001 in one parameter. If that parameter controls a softmax temperature, this tiny change might completely flip the action distribution. Conversely, Δθ = 100 in a dead parameter changes nothing. KL divergence measures the functional change in the policy — how different the action distributions actually are, not how different the numbers in memory are.

The KL-Constrained Optimization

We want to maximize the off-policy objective, but constrain how far the policy can move:

KL-Constrained Policy Update max_θ′ 𝔼_{τ~π_θ}[ Σ_t (π_θ′(a_t|s_t) / π_θ(a_t|s_t)) · (Σ_t′≥t r − b) ]

subject to: 𝔼_{s~π_θ}[ D_KL(π_θ′(·|s) ∥ π_θ(·|s)) ] ≤ ε

What the constraint does: It creates a “trust region” around π_θ. Within this region (D_KL ≤ ε), the IS weights stay close to 1, the state distribution approximation holds, and the gradient is reliable. Outside the region, all bets are off. The constraint prevents the optimizer from leaving the region where we trust our gradient estimate.

Why This Fixes Everything

Inside the trust region:

IS weights near 1: If D_KL(π_θ′ ∥ π_θ) is small, the policies assign similar probabilities, so each ratio π_θ′(a|s)/π_θ(a|s) stays near 1. No extreme weights.
State distribution valid: Similar policies visit similar states, so the approximation π_θ′(s) ≈ π_θ(s) holds.
Multiple steps safe: You can take as many gradient steps as you want within the trust region. The constraint, not the number of steps, determines safety.

The Road to PPO: Three Levels of Approximation

The KL-constrained optimization above is the right objective, but how do you solve it?

TRPO (2015)

Enforce the KL constraint exactly using second-order optimization (conjugate gradient + line search). Reliable but complex — requires computing or approximating the Fisher information matrix.

↓ simplify

KL Penalty (naive)

Replace the constraint with a penalty: max J(θ′) − β D_KL. First-order, but choosing β is hard — too small allows instability, too large prevents learning. The right β changes during training.

↓ simplify further

PPO (2017)

Skip KL entirely. Clip the IS ratio to [1−ε, 1+ε]. If the ratio tries to go beyond this range, the gradient becomes zero. This creates an implicit trust region with nothing more than a min() and a clip().

The full trajectory of ideas: On-policy REINFORCE (wasteful, one step per batch) → off-policy with IS (multiple steps, but IS weights explode) → KL constraint (trust region keeps IS valid) → TRPO (exact constraint, complex) → PPO (approximate constraint via clipping, simple). Each step is a response to the limitation of the previous one. PPO is the result of 5 years of iterative simplification.

What PPO’s Clipped Objective Looks Like

For reference, PPO’s objective (covered in depth in our PPO Veanors lesson):

PPO Clipped Surrogate L^CLIP = 𝔼_t[ min( r_t(θ) Â_t, clip(r_t(θ), 1−ε, 1+ε) Â_t ) ]

Where r_t(θ) = π_θ′(a_t|s_t) / π_θ(a_t|s_t) is the IS ratio, and Â_t is the advantage estimate. The clip caps the ratio at 1 ± ε, and the min selects the more conservative estimate. When the ratio drifts outside the clipping range, the gradient vanishes — the optimizer can’t keep pushing the policy further.

PPO is the default RL algorithm today. It trains ChatGPT (RLHF), controls Unitree humanoid robots, plays Dota 2 (OpenAI Five), and runs inside every major RL framework. Its simplicity (the core is ~50 lines of code) and robustness (works with default hyperparameters on most problems) make it the “Adam of RL.” And it all flows from one idea: keep the IS ratio close to 1.

Chapter 11

Policy Architectures

The derivation above is deliberately architecture-agnostic. The policy π_θ(a|s) can be any differentiable function that outputs a probability distribution over actions. The entire derivation only requires one thing: that you can compute ∇_θ log π_θ(a|s). Any architecture that satisfies that works.

In practice, what people actually use depends on the input type:

Architecture

MLP (Multi-Layer Perceptron)

Input: State vectors (joint angles, velocities, positions). Structure: 2–3 hidden layers, 256 or 512 units. This is the most common setup in MuJoCo locomotion, robotic manipulation, and CS224R homework. Simple and fast.

Architecture

CNN (Convolutional Neural Network)

Input: Pixel observations (images from a camera). Structure: CNN backbone feeding into an MLP head. Classic Atari DQN-style convnets, or ResNets for more complex visual tasks.

Architecture

RNN / LSTM

Input: Partial observability / history-dependent tasks (agent doesn't see the full state). Structure: RNN/LSTM processes a sequence of past observations to form a belief state, then outputs actions.

Architecture

Transformer

Input: Sequence modeling (offline RL, long-horizon). Structure: As in Decision Transformer or Gato. Less common in the on-policy policy gradient setting from this lecture.

Architecture

Diffusion Policy

Input: Multimodal action distributions (robotic manipulation where there are multiple valid ways to grasp). Structure: Models π_θ(a|s) as a denoising diffusion process. Very recent (2023+).

The beauty of policy gradients

The math doesn't care which architecture you use. You swap in a different backbone, the same ∇_θ log π_θ(a|s) formula applies — PyTorch handles the rest via autodiff on the surrogate objective. For CS224R and its default MuJoCo tasks, assume MLP unless stated otherwise.

Architecture	Best For	Example Tasks
MLP	State vectors	MuJoCo locomotion, robotic manipulation
CNN	Pixel observations	Atari games, visual navigation
RNN/LSTM	Partial observability	Memory tasks, POMDPs
Transformer	Sequence/offline RL	Decision Transformer, Gato
Diffusion	Multimodal actions	Dexterous manipulation (2023+)

Chapter 12

Full Summary & Cheat Sheet

The Evolution

V1 — Vanilla∇_θ J = 𝔼[ ( Σ_t ∇_θ log π_θ(a_t|s_t) ) · r(τ) ]

V2 — + Causality∇_θ J = 𝔼[ Σ_t ∇_θ log π_θ(a_t|s_t) · ( Σ_t'≥t r ) ]

V3 — + Baseline∇_θ J = 𝔼[ Σ_t ∇_θ log π_θ(a_t|s_t) · ( Σ_t'≥t r − b ) ]

V4 — Off-Policy + IS + KL∇_θ' J ≈ (1/N) Σ_i Σ_t [π_θ'/π_θ] · ∇_θ' log π_θ' · (Σ_t'≥t r − b), D_KL ≤ ε

Derivation Cheat Sheet

#	Step & Purpose
1	J(θ) = ∫ p_θ(τ) r(τ) dτ — write as integral
2	Log-derivative trick: ∇p = p · ∇log p → expectation
3	Expand log p_θ(τ) → dynamics vanish under ∇_θ
4	Monte Carlo: 𝔼 → sample average over N rollouts
5	Causality: r(τ) → reward-to-go, less variance
6	Baseline: subtract b, still unbiased, less variance
7	Off-policy IS: π_θ'/π_θ to reuse old data
8	KL constraint: keep policy near data source

Strengths vs. Weaknesses

Strengths	Weaknesses
Any differentiable policy	High variance gradient
Continuous + discrete actions	Sample-inefficient (on-policy)
Stochastic policies	Reward scale/density sensitive
Directly optimizes RL objective	Slow convergence, large batches needed
Conceptually simple	Hyperparameter-sensitive

What Comes Next

Actor-Critic: replace noisy reward-to-go with a learned value function (the "critic"). Dramatically reduces variance. Combined with KL constraints → PPO, the workhorse of modern RL.

The One Sentence

Policy gradients = run the policy, weight each action's log-probability by how good its future outcomes were minus a baseline, and gradient-ascent. Everything else is variance reduction.

💥 Break-It Lab What Dies When You Remove Variance Reduction? ▶ ✓ ATTEMPTED

A working REINFORCE agent learns a simple task. The gold line shows the expected return. Toggle off each component to see the specific failure mode it prevents.

Remove baseline (all rewards positive) ACTIVE

Failure mode: Without a baseline, all actions receive positive reinforcement (rewards are +3, +5, +8, etc. — all positive). The policy changes slowly because every action gets pushed up, just by different amounts. Signal-to-noise ratio collapses. Training takes 10-100x longer to converge, with gradient estimates dominated by the reward scale rather than reward differences.

Learning rate too high (α = 0.5) ACTIVE

Failure mode: Policy gradients have high variance. A single unlucky trajectory can produce a huge gradient that overshoots, moving the policy far from the optimum. The next iteration's data comes from this bad policy, producing another wild gradient in the opposite direction. The policy oscillates between extremes, never settling. This is the "sample efficiency" problem that trust regions and clipping (PPO) solve.

Only 1 trajectory per update (N=1) ACTIVE

Failure mode: With N=1, the gradient estimate is a single sample from a high-variance distribution. The estimate is unbiased but wildly noisy. Some updates push the policy in completely wrong directions just by chance. Progress is a random walk rather than ascent. The cure: use N=100+ trajectories, or equivalently, parallelize with multiple environment instances (vectorized envs).

🏗 Design Challenge You're the Architect: Policy Gradient for Robot Locomotion ▶ ✓ ATTEMPTED

You need to train a policy gradient agent for a simulated humanoid robot (20 joints, 1000-dim proprioceptive observation, continuous actions). The task: walk forward at 1.5 m/s. Sample efficiency matters — you have a compute budget of 10M environment steps.

Observation

1000-dim (joint angles, velocities, IMU, foot contacts)

Action

20-dim continuous (joint torques)

Step Budget

10M steps

Parallel Envs

Up to 64

Episode Length

~1000 steps (10 seconds)

1. Policy parameterization: What distribution for 20-dim continuous actions? Gaussian (diagonal, full covariance, or learned variance)? Or something else?

2. Variance reduction: Causality alone? + Baseline? If baseline, what function class for V(s)? Share parameters with the policy network?

3. Batch size: How many steps per gradient update? (N trajectories × T steps). How does this interact with your 64 parallel envs?

4. Given your 10M step budget, would you use vanilla REINFORCE, or upgrade to PPO's clipped objective? Quantify the sample efficiency gain.

5. Entropy bonus: would you add one for exploration? How would you schedule it?

Standard setup (Isaac Gym / MuJoCo locomotion, 2022-2024):

1. Policy: Diagonal Gaussian. μ_θ(s) = MLP(s) gives the mean, logσ is a learnable parameter vector (state-independent). Diagonal is sufficient because joint torques are approximately independent given the observation. Full covariance adds parameters without proportional gain.

2. Variance reduction: GAE(λ=0.95) with a separate value network V_φ(s). Same architecture as policy but separate parameters (shared features hurt optimization — the policy and value have conflicting gradients). V_φ is trained with MSE against GAE returns.

3. Batch size: 2048 steps per update (64 envs × 32 steps each). This gives ~2 full episodes per batch. With PPO's K=4 epochs, each batch provides 4 gradient updates. Total: 10M / 2048 × 4 = ~19,500 gradient steps.

4. PPO: Absolutely PPO, not vanilla REINFORCE. With K=4 epochs and ε=0.2, you get 4x the gradient steps per environment interaction. Vanilla REINFORCE would need 40M steps for equivalent performance. The clipping ensures the 4 passes don't destabilize despite being slightly off-policy.

5. Entropy: Yes, coefficient 0.01 (small). Decayed to 0 over training. Prevents premature convergence to deterministic policy. For locomotion, entropy helps discover diverse gaits early in training.

💻 Build It Implement REINFORCE with Baseline (Full PyTorch) ▶ ✓ ATTEMPTED

Implement the surrogate loss function for REINFORCE with reward-to-go and a value baseline. The key insight: the surrogate is a scalar whose gradient (via autodiff) equals the policy gradient. You'll build J̃ = ∑ log π · (G_t − V(s_t)), then call .backward().

python def compute_reinforce_loss(policy, value_net, states, actions, rewards, gamma=0.99): """ Args: policy: nn.Module, outputs action distribution given states value_net: nn.Module, outputs V(s) scalar given states states: Tensor [T, obs_dim] actions: Tensor [T, act_dim] rewards: Tensor [T] gamma: float, discount factor Returns: policy_loss: scalar tensor (negate for gradient ascent) value_loss: scalar tensor (MSE of V vs returns) """

Test case

states = randn(5, 4), actions = randn(5, 2), rewards = [1,1,1,1,10], gamma=1.0
returns = [14, 13, 12, 11, 10]
If V(s) = 12 for all s: advantages = [2, 1, 0, -1, -2]
policy_loss = -(log_probs * [2,1,0,-1,-2]).mean()

The advantages (returns - V(s)) should NOT be differentiated through when computing the policy loss. They're treated as fixed weights. Without .detach(), the gradient would flow through V(s) into the value network parameters via the policy loss, corrupting the value network update. Use advantages.detach() for the policy loss, but NOT for the value loss.

import torch

def compute_reinforce_loss(policy, value_net, states, actions, rewards, gamma=0.99):
    T = len(rewards)

    # Step 1: Compute discounted reward-to-go
    returns = torch.zeros(T)
    G = 0
    for t in reversed(range(T)):
        G = rewards[t] + gamma * G
        returns[t] = G

    # Step 2: Compute advantages
    values = value_net(states).squeeze(-1)  # [T]
    advantages = returns - values.detach()  # don't backprop through V here

    # Step 3: Get log-probabilities of taken actions
    dist = policy(states)  # e.g., Normal(mu, sigma)
    log_probs = dist.log_prob(actions).sum(-1)  # [T], sum over action dims

    # Step 4: Policy surrogate loss (negate for gradient ascent)
    policy_loss = -(log_probs * advantages).mean()

    # Step 5: Value function loss (regress toward actual returns)
    value_loss = ((values - returns.detach()) ** 2).mean()

    return policy_loss, value_loss

Bonus challenge: Add entropy regularization: subtract α · dist.entropy().mean() from the policy loss. This encourages exploration by penalizing deterministic policies. Start with α = 0.01.

🔗 Pattern Recognition

Actor-Critic = REINFORCE + Learned Baseline + Bootstrapping

REINFORCE (this lesson)

Â_t = G_t − b
Full Monte Carlo return minus average reward. Unbiased, high variance.

Actor-Critic (next step)

Â_t = r_t + γV(s_t+1) − V(s_t)
One-step TD error. Biased (V is approximate), low variance. → Actor-Critic

The spectrum from REINFORCE to actor-critic is a single axis: how many steps of real reward do you use before bootstrapping from V? REINFORCE: all steps (no bootstrap). One-step AC: one step then bootstrap. GAE(λ): exponentially-weighted blend. λ is the knob that moves you along this axis. λ=1 = REINFORCE. λ=0 = one-step TD. λ=0.95 = the sweet spot used in PPO.

When would you prefer REINFORCE over actor-critic? (Hint: think about when the value function V is likely to be badly wrong.)

⚔ Adversarial: Variance vs Bias in Policy Gradients

You're training REINFORCE on a task where the optimal return is 100. Your current policy gets returns between 95 and 105 (very close to optimal). You're using baseline b = mean(returns) ≈ 100. The gradient estimates are tiny (advantages are ±5 on a scale of 100). A colleague suggests: "use b = 0 to get larger gradient magnitudes." What happens?

The policy improves faster because the gradients are larger and point in the right direction The policy diverges because b=0 introduces bias into the gradient The gradient direction becomes dominated by noise: all trajectories get similar positive weight (~100), drowning out the ±5 signal that actually matters

The Complete Guide to Policy Gradients

What You'll Master

RL Foundations & Notation

The Trajectory Distribution

On-Policy vs. Off-Policy vs. Offline

The RL Objective

From Imitation to Policy Gradients

Deriving the Policy Gradient

The REINFORCE Algorithm

Improving the Gradient: Causality

Improving the Gradient: Baselines

The Baseline Trick

Implementation: Surrogate Objective

But Why Does the Surrogate Work?

The Policy Gradient Formula Has ∇ Inside the Sum

The Naive (Literal) Translation: N×T Backward Passes

The One-Line Algebra Fix

The Surrogate in Three Lines

Why .detach() Matters

What Is the Surrogate, Really?

Off-Policy: Importance Sampling

Applying IS to Trajectories

The Curse of Horizon: Exploding Products

Fix: Per-Timestep Importance Weights

The Hidden Approximation

The Off-Policy Algorithm

KL Divergence Constraints

Measuring Policy Distance with KL Divergence

Why KL, Not Just ||θ′ − θ||?

The KL-Constrained Optimization

Why This Fixes Everything

The Road to PPO: Three Levels of Approximation

What PPO’s Clipped Objective Looks Like

Policy Architectures

Full Summary & Cheat Sheet

The Evolution

Derivation Cheat Sheet

Strengths vs. Weaknesses

What Comes Next

Why `.detach()` Matters