Every symbol explained. Every step derived. Every pitfall exposed. After this, you teach it.
Reinforcement learning is about an agent interacting with an environment over time, trying to discover — through trial and error — which actions lead to the most total reward. Think of a robot learning to walk: nobody programs exact motor commands; the robot tries stuff, gets a score, and adjusts.
The complete description of the "world" at time step t. For a walking robot: every joint position and velocity, torso angle, foot contacts, etc. The state is everything the environment needs to determine what happens next (Markov property). It's a vector of numbers.
The decision the agent makes at time t. For the robot: torque on each motor. Actions can be discrete (left, right, jump) or continuous (3.7 N·m to the knee).
A full sequence of states and actions. T = time horizon (episode length). One trajectory = one "rollout" = recording every frame and button press of a video game playthrough.
A scalar signal: how good is this state-action pair? For the robot, r(s,a) = forward velocity. We design this — it encodes our goal.
The agent's strategy: a probability distribution over actions given a state. The subscript θ = neural network weights. This is what we learn.
Discrete actions → softmax over logits. Continuous → outputs mean (and variance) of a Gaussian.
Three ingredients:
• p(s₁) — initial state distribution. Given by environment.
• πθ(at|st) — our policy's choice. Only part we control via θ.
• p(st+1|st,at) — environment dynamics. Unknown in model-free RL.
Only the πθ terms depend on θ. Initial state and dynamics are fixed. When we take ∇θ, environment terms vanish. This is the key insight of the entire derivation.
| Term | Meaning | Example |
|---|---|---|
| Online | Agent collects new data by acting | Robot tries, gets feedback |
| Offline | Learn from fixed dataset, never act | Learn from recorded demos |
| On-policy | Uses only data from current policy | REINFORCE |
| Off-policy | Can reuse data from past policies | IS-weighted PG, PPO |
Everything reduces to one problem. Find θ that maximizes expected total reward:
• maxθ — search over all parameter vectors for the best.
• J(θ) — scalar: expected total reward. Make this as big as possible.
• 𝔼τ ~ pθ(τ) — expectation over trajectories. Both policy and env are stochastic → maximize the average.
• Σt r(st,at) — total reward of one trajectory.
With shorthand r(τ) = Σt r(st,at):
We integrate over all possible trajectories. Can't compute this exactly — estimate with samples (actual rollouts).
Same policy → wildly different outcomes across runs. We want a policy that works on average, not one that got lucky once.
Previously: behavioral cloning — given expert (s,a) pairs, maximize likelihood:
Limitation: can never outperform the demonstrator. No improvement from practice.
The bridge: keep the same gradient structure, but weight each action by its reward:
High-reward trajectory → increase action likelihood. Low/negative reward → decrease. This is the formalization of trial and error.
The most important derivation in this course. Starting point: J(θ). Goal: ∇θJ(θ).
Definition of expectation. Nothing fancy.
r(τ) doesn't depend on θ → gradient only hits pθ(τ).
Why? ∇θ pθ(τ) is intractable (unknown dynamics). By introducing the log, dynamics will cancel.
It's an expectation — estimable with samples!
log p(s₁) and log p(st+1|st,at) vanish — they don't depend on θ. We never need the dynamics!
Unbiased but noisy. More trajectories N → less noise.
The direction in parameter space that most increases the probability of action a in state s. Multiply by reward → increase probability of good actions, decrease bad ones.
The log-derivative trick states: ∇θ pθ(τ) = pθ(τ) · ∇θ log pθ(τ).
Your task: (1) Prove this identity using only the chain rule. (2) Then show WHY this is useful: prove that 𝔼τ~pθ[∇θ log pθ(τ)] = 0 (the "score function has zero mean" property). (3) Explain what goes wrong if you try to directly compute ∇θ ∫ pθ(τ) r(τ) dτ without this trick.
Part 1 — The identity:
By the chain rule: ∇θ log pθ(τ) = ∇θ pθ(τ) / pθ(τ)
Multiply both sides by pθ(τ): ∇θ pθ(τ) = pθ(τ) · ∇θ log pθ(τ) ■
Part 2 — Zero mean:
𝔼τ~pθ[∇θ log pθ(τ)] = ∫ pθ(τ) ∇θ log pθ(τ) dτ = ∫ ∇θ pθ(τ) dτ = ∇θ ∫ pθ(τ) dτ = ∇θ 1 = 0 ■
Part 3 — Why the trick is necessary:
Direct approach: ∇θ J = ∫ ∇θ pθ(τ) · r(τ) dτ. This requires computing ∇θ pθ(τ), which requires differentiating through the unknown dynamics. The log trick converts this to 𝔼[∇ log p · r], which we estimate by sampling trajectories and only requires ∇θ log πθ(a|s) (the policy is known and differentiable).
The key insight: The log-derivative trick converts "differentiate through an intractable probability distribution" into "sample from it and multiply by a score." This pattern appears everywhere: variational inference (ELBO gradient), score-based diffusion models, and black-box optimization. It's arguably the single most important trick in modern ML.
Behavioral cloning = MLE with wi = 1 for all examples. Policy gradients = MLE with wi = reward-to-go − baseline. The gradient formula is identical except for the weighting. This is why REINFORCE is sometimes called "reward-weighted regression." Good actions get up-weighted (like having more copies in the training set), bad actions get down-weighted (like removing them).
If REINFORCE is just weighted MLE, why can it outperform the demonstrator while behavioral cloning can't? (Hint: where do the weights come from?)
Collect → compute → update → repeat. After updating θ, old data is invalid → recollect. This is on-policy.
Reward = forward velocity. Five trajectories:
τ₁: falls backward → r = −2 · τ₂: falls forward → r = +2 · τ₃: stands still → r = 0 · τ₄: one step then falls → r = +1 · τ₅: step backward → r = −3
Gradient: increases likelihood of τ₂ (best), decreases τ₅ (worst). The policy learns to fall forward — the best it saw. Over many iterations: stumbling → stepping → walking. Very noisy.
Every gradient step needs fresh data. 5 min/trajectory × 100 per step × 10k steps = 6.6 years of robot time. Off-policy methods fix this.
Our gradient weights every action by the total trajectory reward. But action at step 5 can't affect reward at step 3.
Replace total reward Σt'=1T with future-only Σt'=tT.
Robot does great at step 1 (+10) but terrible at step 50 (−10). Without causality: total = 0, so gradient for all actions is zero — even the great one. With causality: step 1 sees its own future reward, properly crediting the good action. Reduces variance, still unbiased.
Formally: for t' < t, the expectation of ∇θ log πθ(at|st) · r(st',at') = 0. Past reward is independent of current action given the state. Removing these terms removes noise without changing the expected gradient.
All rewards positive: τ₁ (+2), τ₂ (+5), τ₃ (+8), τ₄ (+12). The gradient increases probability of every trajectory — including falling forward! We waste signal pushing everything up.
Subtract constant b:
Reward > b → positive weight. Reward < b → negative. Clear signal.
Best simple baseline: average reward b = (1/N) Σi r(τi).
Reward: 1 (folded), 0.5 (wrinkled), 0 (not folded). Trajectories: τ₁ (0), τ₂ (0), τ₃ (0), τ₄ (1). Baseline b = 0.25. Only τ₄ gets positive weight (+0.75); others get −0.25. Clear signal — but 3/4 trajectories carry the same weak signal. Still noisy. Use dense rewards and large batches.
We estimate a gradient over an exponentially large trajectory space with a handful of samples. Dense rewards and large batches are the main practical levers.
We subtract b(st) from the reward-to-go: ∇θ J = 𝔼[∑t ∇θ log πθ(at|st) · (Gt − b(st))].
Your task: Prove that for ANY function b(s) that depends only on the state (not the action), the gradient remains unbiased: 𝔼[∇θ log πθ(at|st) · b(st)] = 0.
Bonus: Show that b(s, a) = Qπ(s,a) would NOT be an unbiased baseline, and explain the intuition for why action-dependent baselines fail.
Main proof:
𝔼a~π[∇θ log πθ(a|s) · b(s)]
= b(s) · ∑a πθ(a|s) · ∇θ log πθ(a|s) [b(s) factors out]
= b(s) · ∑a πθ(a|s) · ∇θπθ(a|s) / πθ(a|s) [log-derivative identity]
= b(s) · ∑a ∇θπθ(a|s)
= b(s) · ∇θ ∑a πθ(a|s) [gradient commutes with finite sum]
= b(s) · ∇θ 1 = 0 ■
Bonus — why action-dependent baselines bias:
If b = b(s,a): ∑a π(a|s) · ∇ log π(a|s) · b(s,a) = ∑a ∇π(a|s) · b(s,a) ≠ 0 in general. You'd need ∇(∑a π · b) − ∑a π · ∇b, introducing the extra ∑π∇b term. Intuitively: if the baseline depends on the action, subtracting it changes WHICH actions look good, not just the scale. It preferentially reduces the gradient for some actions over others.
The key insight: ∑a ∇π(a|s) = 0 is the reason baselines work. It says: "the total probability mass is conserved under any parameter change." Pushing one action up necessarily pushes others down. A state-dependent baseline simply shifts all actions by the same constant, preserving relative ordering.
The vanilla gradient ∇θJ is not parameterization-invariant: if you reparameterize θ → φ(θ), the gradient direction changes. The natural gradient F−1∇θJ is invariant, where F is the Fisher information matrix.
Your task: (1) Show that the Fisher matrix F = 𝔼[∇ log π (∇ log π)T] is the Hessian of the KL divergence at θ' = θ. (2) Explain why F−1∇J gives the steepest ascent direction under a KL constraint.
Part 1 — Fisher = Hessian of KL:
DKL(πθ || πθ') = 𝔼a~πθ[log πθ(a|s) − log πθ'(a|s)]
At θ' = θ: DKL = 0. First derivative: ∇θ' DKL|θ'=θ = −𝔼[∇θ' log πθ']|θ'=θ = 0 (score has zero mean).
Second derivative: ∇2θ' DKL|θ'=θ = −𝔼[∇2 log π] = 𝔼[∇ log π (∇ log π)T] = F.
(The last equality uses the identity: −𝔼[∇2 log p] = 𝔼[∇ log p (∇ log p)T], which holds for any exponential family.)
Part 2 — Steepest ascent under KL:
maxΔθ ∇JTΔθ s.t. (1/2)ΔθTFΔθ ≤ ε
Lagrangian: L = ∇JTΔθ − λ[(1/2)ΔθTFΔθ − ε]
∇ΔθL = ∇J − λFΔθ = 0 ⇒ Δθ = (1/λ) F−1∇J ■
The key insight: The Fisher matrix defines the "natural geometry" of probability distributions. Moving Δθ in parameter space changes the policy by different amounts depending on the local curvature. F−1 undoes this warping, giving equal-sized steps in distribution space. This is why TRPO (which constrains KL directly) is equivalent to natural gradient descent with adaptive step size.
Causality: Removes terms whose expectation is zero (past rewards × score function = 0 because past is independent of current action given state). Unbiased because E[removed terms] = 0. Reduces variance by removing noise that averages out anyway.
Baselines: Subtracts b(s) × ∇ log π. Unbiased because E[∇ log π] = 0 (probabilities sum to 1). Reduces variance by centering rewards so the gradient signal reflects relative quality, not absolute scale.
Natural gradient: Multiplies by F−1. Changes the direction (not the expectation of direction). Reduces variance in the sense that equal KL-steps are taken regardless of parameterization, avoiding oscillation in high-curvature directions.
Common principle: All three exploit properties of probability distributions (∑π = 1, independence structure, Fisher geometry) to remove noise without changing the expected signal.
Can't plug gradient formula into autodiff efficiently (N×T backward passes). Instead, build a surrogate objective whose autodiff gradient = our policy gradient:
Why it works: autodiff differentiates only log πθ(a|s). Reward weights are detached constants. So ∇θ J̃ = our gradient.
Discrete actions: cross-entropy loss (actions as labels). Continuous (Gaussian):
Behavioral cloning = all weights 1. Policy gradient = weights are reward-to-go − baseline. Good actions up-weighted, bad down-weighted.
Let’s slow down. You’ve seen the formula. But many people memorize it without understanding why it exists and why it’s fast. The answer comes from understanding what a backward pass actually does.
When you call .backward() on a scalar in PyTorch, it walks backward through the computation graph and fills in ∂(that scalar)/∂θ for every parameter. That walk is one backward pass. It doesn’t matter if the scalar came from summing 3 things or 3 million things — backward is one sweep through the graph.
In supervised learning, you do this all the time without thinking:
python loss = (1/N) * sum(cross_entropy(net(x_i), y_i) for i in range(N)) loss.backward() # one backward pass, done
Mathematically ∇L = (1/N) Σi ∇ℓi, but you never compute each ∇ℓi separately. You build the sum, call backward, done. You trust that ∇(sum) = sum(∇).
Now look at the policy gradient formula on paper:
Notice the ∇ is inside the double sum. In supervised learning, you never saw a formula with ∇ written inside — you just saw the loss and knew to take the gradient. Here, the derivation hands you a formula that already has ∇ in it.
So the question is: do you implement this literally, or rearrange first?
If you read Σi Σt ∇(something) as literal instructions, you get:
python # SLOW: one backward pass per (i,t) pair total = 0 for i in range(N): for t in range(T): logp = log_pi(a[i,t], s[i,t]) # tiny scalar g = autograd.grad(logp, theta) # backward pass! total += g * w[i,t] # accumulate
Each iteration builds a tiny scalar logp, calls backward to get a gradient vector, multiplies by wi,t, and accumulates. That’s N×T backward passes — each one a full sweep through the neural network. With N=100 trajectories and T=200 steps, that’s 20,000 backward passes per gradient update.
autograd.grad inside would produce exactly this. The “inefficient” version is the obvious version.Since wi,t doesn’t depend on θ (it’s just a number computed from rewards), it slides inside the gradient:
And since ∇(sum) = sum(∇):
Look at the right side. What’s inside the brackets? A single scalar — compute log πθ for each (i,t), multiply by wi,t, sum them up. One number. And the ∇ is outside. Call this scalar J̃. Now you’re in the exact same situation as supervised learning: you have a scalar, you want its gradient, you call backward. One backward pass.
python logps = policy(s).log_prob(a) # [N, T] — one forward pass surrogate = (logps * w).mean() # sum into one scalar surrogate.backward() # one backward pass
The sums over N and T still happen — but in the forward pass (adding numbers is cheap). The expensive part is the backward sweep through the neural network, and that happens exactly once. Internally, autodiff computes the same thing as N×T separate backward passes summed together — but in one sweep through one graph.
loss, you write .backward(), and backward is the ∇. Same here. J̃ appears without a ∇ because J̃ is the thing you build as a scalar. The ∇ comes from .backward()..detach() MattersThe only reason we could move ∇ outside the sum was that wi,t doesn’t depend on θ. If w depended on θ, you’d get extra terms and the rearrangement would be invalid. That’s why you .detach() the rewards and baseline in code — to tell autodiff “don’t differentiate through this, it’s a constant.”
The real objective J(θ) is expected return. You can’t build J as a scalar that autodiff can differentiate — sampling trajectories isn’t differentiable. So you build a different scalar J̃ whose value is meaningless, but whose gradient happens to equal ∇J. You never care about J̃’s value. You only care that .backward() on it gives you the right gradient.
Here’s the frustrating thing about REINFORCE: you collect a batch of trajectories, compute one gradient, update θ, and throw the entire batch away. Every single gradient step requires fresh data from the current policy. This is called on-policy learning, and it’s enormously wasteful.
Why can’t we reuse old data? Because our gradient formula assumes samples from πθ. After we update θ, the old data came from a different policy. Using it directly gives a biased gradient. But what if we could mathematically correct for the mismatch?
If you have samples from distribution q but want to estimate an expectation under distribution p, multiply each sample by the importance weight p(x)/q(x). This corrects the distribution mismatch — samples that are more likely under p than q get up-weighted, and vice versa.
We want to evaluate the policy gradient of πθ′ (our new policy), but using trajectories collected from πθ (our old policy). The IS correction requires the ratio of trajectory probabilities:
Something beautiful happens here: the initial state distribution p(s1) and all the dynamics terms p(st+1|st,at) appear in both numerator and denominator — they cancel completely:
There’s a devastating problem. The IS weight is a product of T individual ratios. Even if each ratio is close to 1, the product can be catastrophic:
The solution: instead of one IS weight per trajectory (product of T ratios), use one IS weight per timestep (a single ratio). This requires switching from a trajectory-level expectation to a timestep-level one.
The full off-policy gradient with per-timestep IS becomes:
There’s a subtle but important approximation buried in this formula. The full per-timestep IS weight should include a marginal state distribution ratio: πθ′(st) / πθ(st). This captures the fact that different policies visit different states.
The problem: πθ′(st) is the probability of being in state st at time t under the new policy. Computing this requires rolling out the new policy or solving for the stationary distribution — which is exactly what we’re trying to avoid. So in practice, this ratio is approximated as 1: we pretend the state distribution doesn’t change between policies.
The key win: K gradient steps per batch instead of 1. With K=10, you get 10× more optimization per unit of environment interaction. But there’s a catch: as θ′ drifts further from θ with each step, the IS weights become more extreme, the state distribution approximation breaks down, and the gradient becomes unreliable. After too many steps, you’re optimizing garbage.
We’ve established the dilemma: off-policy policy gradients let us take multiple gradient steps per batch (great for sample efficiency), but each step moves θ′ further from the data-collecting policy θ, degrading the IS correction and eventually ruining the gradient. We need a principled way to say “stop updating before the gradient becomes unreliable.”
First, we need a way to measure how much πθ′ has changed from πθ. Raw parameter distance ||θ′ − θ|| doesn’t work because a small change in parameters can cause a large change in the policy (or vice versa). We need to measure distance in distribution space.
KL divergence measures how much distribution p differs from distribution q. Key properties: (1) Non-negative: DKL ≥ 0, with equality iff p = q. (2) Not symmetric: DKL(p ∥ q) ≠ DKL(q ∥ p) in general. (3) Interpretable: it measures the expected number of “extra bits” needed to encode samples from p using a code optimized for q.
Consider two policies that differ by Δθ = 0.001 in one parameter. If that parameter controls a softmax temperature, this tiny change might completely flip the action distribution. Conversely, Δθ = 100 in a dead parameter changes nothing. KL divergence measures the functional change in the policy — how different the action distributions actually are, not how different the numbers in memory are.
We want to maximize the off-policy objective, but constrain how far the policy can move:
Inside the trust region:
The KL-constrained optimization above is the right objective, but how do you solve it?
min() and a clip().For reference, PPO’s objective (covered in depth in our PPO Veanors lesson):
Where rt(θ) = πθ′(at|st) / πθ(at|st) is the IS ratio, and Ât is the advantage estimate. The clip caps the ratio at 1 ± ε, and the min selects the more conservative estimate. When the ratio drifts outside the clipping range, the gradient vanishes — the optimizer can’t keep pushing the policy further.
The derivation above is deliberately architecture-agnostic. The policy πθ(a|s) can be any differentiable function that outputs a probability distribution over actions. The entire derivation only requires one thing: that you can compute ∇θ log πθ(a|s). Any architecture that satisfies that works.
In practice, what people actually use depends on the input type:
Input: State vectors (joint angles, velocities, positions). Structure: 2–3 hidden layers, 256 or 512 units. This is the most common setup in MuJoCo locomotion, robotic manipulation, and CS224R homework. Simple and fast.
Input: Pixel observations (images from a camera). Structure: CNN backbone feeding into an MLP head. Classic Atari DQN-style convnets, or ResNets for more complex visual tasks.
Input: Partial observability / history-dependent tasks (agent doesn't see the full state). Structure: RNN/LSTM processes a sequence of past observations to form a belief state, then outputs actions.
Input: Sequence modeling (offline RL, long-horizon). Structure: As in Decision Transformer or Gato. Less common in the on-policy policy gradient setting from this lecture.
Input: Multimodal action distributions (robotic manipulation where there are multiple valid ways to grasp). Structure: Models πθ(a|s) as a denoising diffusion process. Very recent (2023+).
The math doesn't care which architecture you use. You swap in a different backbone, the same ∇θ log πθ(a|s) formula applies — PyTorch handles the rest via autodiff on the surrogate objective. For CS224R and its default MuJoCo tasks, assume MLP unless stated otherwise.
| Architecture | Best For | Example Tasks |
|---|---|---|
| MLP | State vectors | MuJoCo locomotion, robotic manipulation |
| CNN | Pixel observations | Atari games, visual navigation |
| RNN/LSTM | Partial observability | Memory tasks, POMDPs |
| Transformer | Sequence/offline RL | Decision Transformer, Gato |
| Diffusion | Multimodal actions | Dexterous manipulation (2023+) |
| # | Step & Purpose |
|---|---|
| 1 | J(θ) = ∫ pθ(τ) r(τ) dτ — write as integral |
| 2 | Log-derivative trick: ∇p = p · ∇log p → expectation |
| 3 | Expand log pθ(τ) → dynamics vanish under ∇θ |
| 4 | Monte Carlo: 𝔼 → sample average over N rollouts |
| 5 | Causality: r(τ) → reward-to-go, less variance |
| 6 | Baseline: subtract b, still unbiased, less variance |
| 7 | Off-policy IS: πθ'/πθ to reuse old data |
| 8 | KL constraint: keep policy near data source |
| Strengths | Weaknesses |
|---|---|
| Any differentiable policy | High variance gradient |
| Continuous + discrete actions | Sample-inefficient (on-policy) |
| Stochastic policies | Reward scale/density sensitive |
| Directly optimizes RL objective | Slow convergence, large batches needed |
| Conceptually simple | Hyperparameter-sensitive |
Actor-Critic: replace noisy reward-to-go with a learned value function (the "critic"). Dramatically reduces variance. Combined with KL constraints → PPO, the workhorse of modern RL.
Policy gradients = run the policy, weight each action's log-probability by how good its future outcomes were minus a baseline, and gradient-ascent. Everything else is variance reduction.
Standard setup (Isaac Gym / MuJoCo locomotion, 2022-2024):
1. Policy: Diagonal Gaussian. μθ(s) = MLP(s) gives the mean, logσ is a learnable parameter vector (state-independent). Diagonal is sufficient because joint torques are approximately independent given the observation. Full covariance adds parameters without proportional gain.
2. Variance reduction: GAE(λ=0.95) with a separate value network Vφ(s). Same architecture as policy but separate parameters (shared features hurt optimization — the policy and value have conflicting gradients). Vφ is trained with MSE against GAE returns.
3. Batch size: 2048 steps per update (64 envs × 32 steps each). This gives ~2 full episodes per batch. With PPO's K=4 epochs, each batch provides 4 gradient updates. Total: 10M / 2048 × 4 = ~19,500 gradient steps.
4. PPO: Absolutely PPO, not vanilla REINFORCE. With K=4 epochs and ε=0.2, you get 4x the gradient steps per environment interaction. Vanilla REINFORCE would need 40M steps for equivalent performance. The clipping ensures the 4 passes don't destabilize despite being slightly off-policy.
5. Entropy: Yes, coefficient 0.01 (small). Decayed to 0 over training. Prevents premature convergence to deterministic policy. For locomotion, entropy helps discover diverse gaits early in training.
import torch
def compute_reinforce_loss(policy, value_net, states, actions, rewards, gamma=0.99):
T = len(rewards)
# Step 1: Compute discounted reward-to-go
returns = torch.zeros(T)
G = 0
for t in reversed(range(T)):
G = rewards[t] + gamma * G
returns[t] = G
# Step 2: Compute advantages
values = value_net(states).squeeze(-1) # [T]
advantages = returns - values.detach() # don't backprop through V here
# Step 3: Get log-probabilities of taken actions
dist = policy(states) # e.g., Normal(mu, sigma)
log_probs = dist.log_prob(actions).sum(-1) # [T], sum over action dims
# Step 4: Policy surrogate loss (negate for gradient ascent)
policy_loss = -(log_probs * advantages).mean()
# Step 5: Value function loss (regress toward actual returns)
value_loss = ((values - returns.detach()) ** 2).mean()
return policy_loss, value_loss
The spectrum from REINFORCE to actor-critic is a single axis: how many steps of real reward do you use before bootstrapping from V? REINFORCE: all steps (no bootstrap). One-step AC: one step then bootstrap. GAE(λ): exponentially-weighted blend. λ is the knob that moves you along this axis. λ=1 = REINFORCE. λ=0 = one-step TD. λ=0.95 = the sweet spot used in PPO.
When would you prefer REINFORCE over actor-critic? (Hint: think about when the value function V is likely to be badly wrong.)