CS224R — Actor-Critic Methods: The Complete Guide

Roadmap

What You'll Master

01Recap & Motivation 02V, Q, and Advantage 03The Actor-Critic Loop 04Monte Carlo Estimation 05Bootstrapping & TD Learning 06N-Step Returns 07Discount Factors 08Full Actor-Critic Algorithm 09Off-Policy: Importance Weights 10Off-Policy: Replay Buffers 11PPO vs. SAC 12Summary & Cheat Sheet

Chapter 01

Recap & Motivation

In the policy gradients lesson, we derived this formula:

Policy Gradient with Advantage ∇_θ J(θ) ≈ (1/N) Σ_i Σ_t ∇_θ log π_θ(a_i,t|s_i,t) · A^π(s_i,t, a_i,t)

Each action is weighted by the advantage — how much better that action was compared to average. The problem is how we estimate that advantage.

The Jacket Folding Problem

A robot is learning to fold a jacket. Four trajectories from one batch:

• τ₁: Folds both sleeves neatly → reward = 10

• τ₂: Flattens the jacket but doesn't fold → reward = 0

• τ₃: Folds one sleeve, drops the other → reward = 5

• τ₄: Takes one step forward then everything falls backward → reward = −2

Vanilla PG with baseline (average = 3.25): τ₂ gets weight −3.25 and τ₃ gets +1.75. But τ₂ did something useful — flattening is a good sub-step! And the single backward step in τ₄ gets as much blame as the step forward. PG can't distinguish good sub-actions from bad trajectories.

The Core Problem

Single-sample reward-to-go is a terrible estimator of how good an action actually is. One unlucky outcome poisons the gradient for every earlier step. We need something better: a learned estimate of expected future reward.

The Big Idea

What if we trained a separate neural network to predict expected future reward from any state? Then instead of using one noisy trajectory sample, we'd use a smooth, multi-sample average. That's the critic. The policy is the actor. Together: actor-critic.

Chapter 02

V, Q, and Advantage

Three objects underpin all of actor-critic. Each answers a different question about how good things are under the current policy π.

Definition

State-Value Function — V^π(s)

V^π(s) = 𝔼_τ~π[ Σ_t'=t^T r(s_t', a_t') | s_t = s ]

"How much total reward do I expect from state s if I follow policy π from here on?" A single number per state. Think of it as the real estate value of a state — some states are inherently more promising.

Definition

Action-Value Function — Q^π(s, a)

Q^π(s, a) = 𝔼_τ~π[ Σ_t'=t^T r(s_t', a_t') | s_t = s, a_t = a ]

"How much total reward do I expect if I take action a in state s, then follow π after?" Same as V but conditions on the first action. Q is more detailed — it knows about specific actions, not just locations.

Definition

Advantage Function — A^π(s, a)

A^π(s, a) = Q^π(s, a) − V^π(s)

"How much better is action a compared to what we'd normally do?" If A > 0, action a is better than the average action from this state. If A < 0, it's worse. This is the signal we need for policy gradients.

Intuition

V is the "value of the neighborhood." Q is the "value of taking a specific road in that neighborhood." Advantage is "how much better is this road than the average road?" A positive advantage means the action outperforms the default policy.

🔨 Derivation A2C Advantage = Q − V Reduces Variance (Proof) ▶ ✓ ATTEMPTED

The policy gradient theorem gives: ∇J = E[∇ log π(a|s) · Q(s,a)]. We can subtract any state-dependent baseline b(s) without changing the expected gradient: ∇J = E[∇ log π(a|s) · (Q(s,a) − b(s))].

Your task: (1) Show that subtracting b(s) doesn't bias the gradient (prove E[∇ log π · b(s)] = 0). (2) Show that the optimal b*(s) that minimizes variance equals V^π(s), making A(s,a) = Q(s,a) − V(s) the minimum-variance advantage estimator.

The variance of the gradient estimator is Var[∇ log π · (Q − b)]. Expand: E[((∇ log π)(Q−b))²] − (E[...]).². Differentiate with respect to b and set to zero. The optimal b* = E_a[(∇ log π)² Q] / E_a[(∇ log π)²]. If you approximate (∇ log π)² as roughly constant across actions (crude but common), this simplifies to b* ≈ E_a~π[Q(s,a)] = V^π(s).

Without baseline: Var ∝ E[Q(s,a)²]. With optimal baseline V: Var ∝ E[(Q−V)²] = E[A²]. Since A sums to zero across actions (E_a~π[A] = 0), the advantage has smaller magnitude than raw Q-values. If Q ranges from 50-60 (all positive), A ranges from −5 to +5. Variance reduction: (5/55)² ≈ 120x less variance!

Part 1 — Unbiasedness:

E_a~π(·|s)[∇_θ log π_θ(a|s) · b(s)] = b(s) · E_a~π[∇ log π(a|s)]

= b(s) · Σ_a π(a|s) · ∇π(a|s)/π(a|s) = b(s) · Σ_a ∇π(a|s) = b(s) · ∇ 1 = 0

Any baseline that depends only on s (not a) vanishes in expectation. This means we can subtract V, a constant, or even a neural network of s — the gradient stays unbiased.

Part 2 — Optimal baseline minimizes variance:

Var = E_s,a[(g(s,a) · (Q(s,a) − b(s)))²] where g = ∇ log π.

d/db Var = −2 E_s[E_a[g²(Q − b)]] = 0

⇒ b*(s) = E_a~π[g(s,a)² Q(s,a)] / E_a~π[g(s,a)²]

Under the approximation that g² is roughly uniform across actions:

b*(s) ≈ E_a~π[Q(s,a)] = V^π(s)

The key insight: The advantage A = Q − V is the minimum-variance baseline for policy gradients. It works because V centers the signal around zero: good actions get positive weight, bad actions get negative weight, and the magnitudes are as small as possible. This is why actor-critic (which learns V as the baseline) dramatically outperforms vanilla REINFORCE.

A Worked Example

Gridworld Advantage Calculation

Imagine a 3×3 grid. The agent is in the center cell. Under current policy π, the expected returns are:

• V^π(center) = 6.0 — on average, starting from center, we get 6.0 total reward.

• Q^π(center, right) = 8.5 — going right first, then following π, gives 8.5.

• Q^π(center, left) = 4.0 — going left first gives only 4.0.

• Q^π(center, up) = 5.5 — going up gives 5.5.

The advantages:

• A(center, right) = 8.5 − 6.0 = +2.5 → much better than average! Push gradient to go right more.

• A(center, left) = 4.0 − 6.0 = −2.0 → worse than average. Decrease this action's probability.

• A(center, up) = 5.5 − 6.0 = −0.5 → slightly below average.

Notice: V is the weighted average of Q across all actions, so advantages always average to zero: Σ_a π(a|s) A(s,a) = 0.

Chapter 03

The Actor-Critic Loop

Actor-critic turns the policy gradient into a two-network system. One network acts, the other judges.

Definition

Actor — π_θ(a|s)

The policy network. Takes a state, outputs action probabilities (discrete) or a distribution over actions (continuous). Parameters θ. This is what we deploy.

Definition

Critic — V̂^π_φ(s)

The value network. Takes a state, outputs a scalar: the estimated expected future reward. Parameters φ. This is a tool for training — at deployment, only the actor matters.

The Actor-Critic Loop (High Level)

Act: Run policy π_θ to collect a batch of (s, a, r, s') transitions
Evaluate: Fit the critic V̂_φ to estimate expected returns (policy evaluation)
Improve: Compute advantages using the critic, update the actor θ via policy gradient (policy improvement)
Repeat from step 1

Why fit V instead of Q or A directly?

We only need to fit V! Because we can reconstruct the advantage from V using a single sampled transition:

A^π(s_t, a_t) ≈ r(s_t, a_t) + V^π(s_t+1) − V^π(s_t)

The reward r(s_t, a_t) plus the value of the next state V(s_t+1) gives us an estimate of Q. Subtract V(s_t) to get A. One network, not three.

Why does this work?

Start from the definition: Q^π(s,a) = r(s,a) + 𝔼_s'[V^π(s')]. We can't compute the expectation over all possible next states — but we have one actual next state s_t+1 from our rollout. Using it gives a single-sample estimate of Q. Subtract V(s_t) to get A. The approximation is:

Â^π(s_t, a_t) = r(s_t, a_t) + V̂^π_φ(s_t+1) − V̂^π_φ(s_t)

This is a one-step estimate of the advantage. The "hat" ̂ reminds us that V̂ is our neural network approximation, not the true value.

Chapter 04

Monte Carlo Estimation

The critic needs training targets. Where do they come from? First approach: just use the actual rewards from our rollouts.

Definition

Monte Carlo Target

y_i,t = Σ_t'=t^T r(s_i,t', a_i,t')

For each state visited, the MC target is the actual sum of rewards from that point until the end of the episode. This is a direct, model-free estimate of V^π(s_i,t).

Training the critic is then straightforward supervised learning:

Monte Carlo Value Estimation

Collect N trajectories by running π_θ
Compute targets: For each (i, t), set y_i,t = Σ_t'=t^T r(s_i,t', a_i,t')
Regression: Minimize Σ_i,t ( V̂_φ(s_i,t) − y_i,t )² with gradient descent on φ

Implicit multi-sample averaging

Each y_i,t is a single noisy sample. But different trajectories often visit the same or similar states. When the neural network fits all these targets simultaneously, it implicitly averages across many samples for each state. The network acts as a smoothing function — nearby states get similar value estimates even if individual trajectories were noisy.

Worked Example

A simple chain: states A → B → C → terminal, with rewards r(A)=1, r(B)=3, r(C)=5.

• y(A) = 1 + 3 + 5 = 9

• y(B) = 3 + 5 = 8

• y(C) = 5 = 5

The critic trains to output V̂(A) ≈ 9, V̂(B) ≈ 8, V̂(C) ≈ 5. In practice, stochastic dynamics and different actions yield different rewards each time — the network learns the average.

High Variance

MC targets sum many random variables (every future reward). Each is noisy. The sum compounds that noise. For a 1000-step episode, y_t is the sum of up to 1000 random rewards — the variance can be enormous. Can we do better?

Chapter 05

Bootstrapping & TD Learning

Instead of summing all future rewards (expensive, noisy), use the critic's own estimate for the future.

Definition

Temporal Difference (TD) Target

y_i,t = r(s_i,t, a_i,t) + V̂^π_φ(s_i,t+1)

One real reward step, then bootstrap: use the critic's own estimate of V for the rest. The name "temporal difference" comes from the fact that we use the difference in value estimates at consecutive time steps.

Same supervised learning pipeline:

TD Value Update φ ← φ − α ∇_φ Σ_i,t ( V̂_φ(s_i,t) − y_i,t )²

Circular but powerful

TD looks suspicious: we're training V̂ using targets that depend on V̂ itself! The labels change every gradient step. But this is exactly how dynamic programming works — iteratively refine estimates using neighboring estimates. It converges because each update makes the value function more self-consistent (satisfying the Bellman equation).

Why TD has lower variance

MC target: y = r_t + r_t+1 + r_t+2 + … + r_T (sum of many random variables).

TD target: y = r_t + V̂(s_t+1) (one random variable + one deterministic function evaluation).

Fewer random terms → less variance. But V̂ might be wrong → biased. This is the fundamental tradeoff.

MC vs. TD: Head to Head

Concrete Comparison

A robot walks a 5-step hallway. Reward is −1 at every step except r = +10 at the final step. One trajectory: s₁ → s₂ → s₃ → s₄ → s₅.

MC target for s₁: −1 + (−1) + (−1) + (−1) + 10 = 6. Exact for this trajectory but varies wildly if the robot sometimes falls at step 3.

TD target for s₁: −1 + V̂(s₂). If V̂(s₂) = 7.2 (current estimate), then y = 6.2. Smoother, but only as good as V̂.

Property	Monte Carlo	TD (Bootstrap)
Bias	Unbiased	Biased (uses own estimate)
Variance	High (sum of many rewards)	Low (one reward + one estimate)
Needs episode end?	Yes (must reach terminal state)	No (works step-by-step)
Works with infinite horizons?	No	Yes
Early training	More reliable (no wrong estimates)	Can propagate errors

Chapter 06

N-Step Returns

MC and TD are two extremes. What if we took the best of both?

Definition

N-Step Return Target

N-Step Target y_i,t = Σ_t'=t^t+n−1 γ^t'−t r(s_i,t', a_i,t') + γⁿ V̂^π(s_i,t+n)

Use n steps of real rewards, then bootstrap with the critic for the rest. n = 1 is pure TD. n = T (rest of episode) is pure MC. Any n in between blends the two.

The Bias-Variance Spectrum

Small n (near TD): low variance, higher bias. You trust your critic more. Large n (near MC): higher variance, lower bias. You trust your actual rewards more. In practice, n between 5 and 20 often works best — enough real reward to reduce bias, short enough to keep variance manageable.

N-Step in Action

Same 5-step hallway: rewards [−1, −1, −1, −1, +10]. Say γ = 1 for simplicity and V̂(s₃) = 8.0.

• 1-step (TD): y(s₁) = −1 + V̂(s₂). Depends entirely on V̂(s₂).

• 2-step: y(s₁) = −1 + (−1) + V̂(s₃) = −2 + 8.0 = 6.0. Two real rewards, rest from critic.

• 5-step (MC): y(s₁) = −1 + (−1) + (−1) + (−1) + 10 = 6. All real rewards.

With a good critic, the 2-step estimate will be less noisy than MC across different trajectories, while being less biased than 1-step TD.

n	Name	Bias	Variance	When to Use
1	TD(0)	Highest	Lowest	Good critic, dense rewards
5–20	N-step	Medium	Medium	Most practical settings
T	MC	None	Highest	Short episodes, early training

GAE: The Continuous Blend

Generalized Advantage Estimation (GAE, Schulman et al. 2016) takes a weighted average of all n-step returns using parameter λ ∈ [0,1]:

Â^GAE = Σ_l=0^∞ (γλ)^l δ_t+l where δ_t = r_t + γV̂(s_t+1) − V̂(s_t)

λ = 0 is pure TD advantage (one step). λ = 1 is full MC advantage. λ = 0.95 is a common default in PPO.

🔨 Derivation GAE(λ) Derivation — Why TD(λ) Interpolates MC and TD ▶ ✓ ATTEMPTED

Define the n-step advantage: Â⁽ⁿ⁾_t = Σ_l=0ⁿ⁻¹ γ^l r_t+l + γⁿ V̂(s_t+n) − V̂(s_t). GAE is defined as the exponentially-weighted average: Â^GAE_t = (1−λ) Σ_n=1^∞ λⁿ⁻¹ Â⁽ⁿ⁾_t.

Your task: Show this simplifies to Â^GAE_t = Σ_l=0^∞ (γλ)^l δ_t+l, where δ_t = r_t + γV̂(s_t+1) − V̂(s_t). Then show that λ=0 gives TD(0) and λ=1 gives MC.

The n-step advantage can be written as a telescoping sum of TD errors: Â⁽ⁿ⁾_t = Σ_l=0ⁿ⁻¹ γ^l δ_t+l. Verify: Σ γ^l(r_t+l + γV̂(s_t+l+1) − V̂(s_t+l)) telescopes because +γV(s_t+1) at l=0 cancels with −V(s_t+1) at l=1 (after multiplying by γ).

Â^GAE = (1−λ) Σ_n=1^∞ λⁿ⁻¹ Σ_l=0ⁿ⁻¹ γ^l δ_t+l. Swap sums: for each δ_t+l, count how many n-step estimates include it. It appears for n ≥ l+1, so its coefficient is (1−λ) γ^l Σ_n=l+1^∞ λⁿ⁻¹ = (1−λ) γ^l λ^l / (1−λ) = (γλ)^l.

Step 1: Express Â⁽ⁿ⁾ as sum of TD errors.

Â⁽ⁿ⁾_t = Σ_l=0ⁿ⁻¹ γ^l r_t+l + γⁿ V̂(s_t+n) − V̂(s_t)

Rewrite: = Σ_l=0ⁿ⁻¹ γ^l [r_t+l + γV̂(s_t+l+1) − V̂(s_t+l)] = Σ_l=0ⁿ⁻¹ γ^l δ_t+l

(The telescoping: V̂(s_t) cancels from the first term's −V̂(s_t), intermediate V̂'s cancel in pairs, leaving +γⁿV̂(s_t+n) from the last term.)

Step 2: Substitute into GAE definition.

Â^GAE = (1−λ) Σ_n=1^∞ λⁿ⁻¹ Σ_l=0ⁿ⁻¹ γ^l δ_t+l

Step 3: Swap summation order. For fixed l, δ_t+l appears when n ≥ l+1:

Coefficient of δ_t+l = (1−λ) γ^l Σ_n=l+1^∞ λⁿ⁻¹ = (1−λ) γ^l · λ^l/(1−λ) = (γλ)^l

Therefore: Â^GAE_t = Σ_l=0^∞ (γλ)^l δ_t+l

Boundary cases:

λ = 0: Â^GAE = δ_t = r_t + γV̂(s_t+1) − V̂(s_t). Pure TD advantage. High bias, low variance.

λ = 1: Â^GAE = Σ_l=0^∞ γ^l δ_t+l = Σ γ^l r_t+l − V̂(s_t). This is the MC return minus the baseline. Zero bias (if V̂ is removed from target), highest variance.

The key insight: λ controls an effective discount on TD errors. Each additional step's δ is weighted by (γλ)^l. Since γ already discounts future rewards, λ provides a second layer of discounting that specifically controls how much we trust the critic vs. actual rewards. Low λ = trust the critic. High λ = trust the rewards.

Checkpoint — Before you move on

You've seen MC, TD, N-step, and GAE. In one sentence each: (1) Why does TD have bias? (2) Why does MC have high variance? (3) What role does λ play in GAE?

✓ Gate cleared

Model Answer

(1) TD has bias because its target y = r + γV̂(s') uses V̂ — a learned approximation that's wrong early in training. We're bootstrapping from our own (incorrect) estimates. (2) MC has high variance because y = Σr_t' sums many stochastic rewards. Each future timestep adds another random variable, and their variances compound. A 1000-step episode means summing 1000 random variables. (3) λ in GAE trades off bias vs. variance by exponentially weighting n-step returns. Low λ trusts the critic more (lower variance, higher bias). High λ trusts actual rewards more (higher variance, lower bias). λ=0.95 is the standard because it leans toward trusting rewards while still getting meaningful variance reduction from the critic.

Chapter 07

Discount Factors

So far we've treated all future rewards equally. But a reward 1000 steps from now is both uncertain and far away. Enter γ.

Definition

Discount Factor — γ (gamma)

Discounted Return G_t = γ⁰ r_t + γ¹ r_t+1 + γ² r_t+2 + … = Σ_k=0^∞ γ^k r_t+k

γ ∈ [0, 1]. Each future reward is multiplied by γ raised to the power of how many steps away it is. Reward now = full weight. Reward in 100 steps = γ¹⁰⁰ ≈ tiny.

Why discount?

Three reasons: (1) Mathematical — with γ < 1, infinite sums converge. Without it, V^π can be infinite. (2) Practical — distant rewards are more uncertain, so trusting them less reduces variance. (3) Behavioral — we want agents that don't defer reward forever ("I'll clean up tomorrow… forever").

The Effect of γ

Rewards: [+1, +1, +1, +1, +1] at steps t, t+1, t+2, t+3, t+4.

• γ = 1.0: G = 1 + 1 + 1 + 1 + 1 = 5.0 (all rewards equal)

• γ = 0.99: G = 1 + 0.99 + 0.98 + 0.97 + 0.96 = 4.90 (slight decay)

• γ = 0.9: G = 1 + 0.9 + 0.81 + 0.73 + 0.66 = 4.10 (noticeable)

• γ = 0.5: G = 1 + 0.5 + 0.25 + 0.125 + 0.063 = 1.94 (very myopic)

• γ = 0.0: G = 1 + 0 + 0 + 0 + 0 = 1.0 (only immediate reward)

The Discounted Value Functions

With discounting, all our definitions update:

Discounted Bellman Equation V^π(s) = 𝔼_a~π[ r(s, a) + γ V^π(s') ]

Discounted Advantage A^π(s_t, a_t) = r(s_t, a_t) + γ V^π(s_t+1) − V^π(s_t)

Common γ Values

Practical Guidelines

• γ = 0.99 — the default for most tasks. Cares about ~100 steps of future.

• γ = 0.999 — for very long-horizon tasks (thousands of steps).

• γ = 0.95 — when you want faster learning at the cost of some long-term quality.

Rule of thumb: the "effective horizon" is ~1/(1−γ). So γ=0.99 → ~100 steps, γ=0.999 → ~1000 steps.

🔨 Derivation TD(λ) Bias-Variance Tradeoff — Quantifying the Error ▶ ✓ ATTEMPTED

The TD(0) advantage Â_t = δ_t = r_t + γV̂(s_t+1) − V̂(s_t) has bias proportional to V̂'s error, while the MC advantage A_t = Σγ^kr_t+k − V̂(s_t) has variance proportional to the episode length.

Your task: Quantify both. If V̂(s) = V^π(s) + ε(s) with |ε| ≤ ε_max, bound the bias of TD(0). If rewards have variance σ_r², bound the variance of MC returns for horizon T.

True advantage: A(s,a) = r + γV(s') − V(s). TD estimate: Â = r + γV̂(s') − V̂(s) = r + γ(V(s') + ε(s')) − (V(s) + ε(s)) = A(s,a) + γε(s') − ε(s). The bias is γε(s') − ε(s), bounded by |bias| ≤ (1 + γ)ε_max.

MC return G_t = Σ_k=0^T−t−1 γ^k r_t+k. If rewards are independent with Var(r) = σ_r²: Var(G_t) = Σ_k=0^T−t−1 γ^2k σ_r² = σ_r² (1 − γ^2(T−t)) / (1 − γ²). For γ = 0.99 and T−t = 1000: Var ≈ 50 σ_r².

TD(0) Bias:

Â_TD = r + γV̂(s') − V̂(s) = [r + γV(s') − V(s)] + [γε(s') − ε(s)]

= A^true(s,a) + bias term

|Bias| = |γε(s') − ε(s)| ≤ γ|ε(s')| + |ε(s)| ≤ (1+γ)ε_max

For γ = 0.99: |Bias| ≤ 1.99 ε_max. If V̂ is off by 2.0 on average, TD advantages can be wrong by up to 4.0!

MC Variance:

Var(G_t) = Σ_k=0^H−1 γ^2k Var(r_t+k) = σ_r² · (1 − γ^2H)/(1−γ²)

For γ=0.99, H=1000: (1−0.99²⁰⁰⁰)/(1−0.9801) ≈ 1/0.0199 ≈ 50.25

So Var(MC) ≈ 50 σ_r². Standard deviation ≈ 7σ_r.

The tradeoff:

TD(0): MSE = Bias² + Var(single step) ≈ (2ε_max)² + σ_r²

MC: MSE = 0 + 50σ_r²

TD wins when: 4ε_max² + σ_r² < 50σ_r² ⇒ ε_max < 3.5σ_r

The key insight: TD is better whenever the critic's error is less than ~3.5x the per-step reward noise. Early in training (critic very wrong), MC wins. Late in training (critic accurate), TD wins. GAE(λ) with λ=0.95 automatically interpolates — relying mostly on actual rewards but getting meaningful variance reduction from the critic's short-horizon estimates.

Chapter 08

The Full Actor-Critic Algorithm

Now we assemble every piece. This is the complete algorithm you'd implement.

Algorithm: Actor-Critic (On-Policy, N-Step)

Initialize actor parameters θ, critic parameters φ
Loop:
1. Sample batch: Collect N trajectories {(s_i,t, a_i,t, r_i,t, s_i,t+1)} by running π_θ
2. Fit critic (policy evaluation):
  1. Compute n-step targets: y_i,t = Σ_t'=t^t+n−1 γ^t'−t r_i,t' + γⁿ V̂_φ(s_i,t+n)
  2. Update: φ ← φ − α_φ ∇_φ Σ_i,t ( V̂_φ(s_i,t) − y_i,t )²
3. Evaluate advantages (using critic):
  Â^π(s_i,t, a_i,t) = r(s_i,t, a_i,t) + γ V̂_φ(s_i,t+1) − V̂_φ(s_i,t)
4. Policy gradient (actor update):
  ∇_θ J ≈ Σ_i,t ∇_θ log π_θ(a_i,t|s_i,t) · Â^π(s_i,t, a_i,t)
5. Update actor: θ ← θ + α_θ ∇_θ J

Step-by-Step Numerical Walkthrough

One Iteration in Detail

Tiny environment: 3 states {A, B, C}. Two actions {left, right}. γ = 0.9. Current critic: V̂(A) = 2.0, V̂(B) = 5.0, V̂(C) = 8.0.

Step 2a: Sample batch. One trajectory: A ⟶^right B ⟶^right C. Rewards: r(A, right) = 1, r(B, right) = 3.

Step 2b: Critic targets (1-step TD):

• y(A) = r(A, right) + γ V̂(B) = 1 + 0.9 × 5.0 = 5.5

• y(B) = r(B, right) + γ V̂(C) = 3 + 0.9 × 8.0 = 10.2

Critic loss: (V̂(A) − 5.5)² + (V̂(B) − 10.2)² = (2 − 5.5)² + (5 − 10.2)² = 12.25 + 27.04 = 39.29. Gradient descent will push V̂(A) toward 5.5 and V̂(B) toward 10.2.

Step 2c: Advantages:

• Â(A, right) = 1 + 0.9 × 5.0 − 2.0 = +3.5 (great action!)

• Â(B, right) = 3 + 0.9 × 8.0 − 5.0 = +5.2 (even better)

Step 2d: Both advantages are positive → increase probability of "right" in both states.

Two Learning Rates

The actor and critic are trained simultaneously but at potentially different learning rates. A common pattern: α_φ = 3 × α_θ. The critic should learn faster than the actor, so the advantages are based on a reasonably good value estimate. If the critic is still random, advantages are garbage and the actor learns nonsense.

Shared vs. Separate Networks

In some implementations (e.g., A3C), the actor and critic share a feature backbone with two separate heads. This saves parameters and computation, but can cause instability — the critic's gradient can interfere with actor features. Modern practice (PPO, SAC): separate networks unless you have a specific reason to share.

💥 Break-It Lab What Dies When You Remove Actor-Critic Components? ▶ ✓ ATTEMPTED

A well-tuned actor-critic has three critical relationships: the value baseline, the critic's learning speed relative to the actor, and the actor's update magnitude. Toggle each to see distinct failure modes.

Remove Value Baseline (back to REINFORCE) ACTIVE

Failure mode: High-variance gradients. Without V(s) as baseline, the policy gradient uses raw returns: ∇J = Σ ∇logπ · R(τ). If all trajectories have positive reward (common in continuous control), every action gets reinforced — just some more than others. The signal-to-noise ratio is terrible. Gradient estimates oscillate wildly between batches. Learning happens but 10-100x slower, with massive variance in the training curve.

Critic Learns Too Slowly (α_critic = 0.1x actor) ACTIVE

Failure mode: Stale advantages. If the critic can't keep up with the actor, V̂(s) reflects the value of an old policy. Advantages computed as r + γV̂(s') − V̂(s) are garbage — they praise actions that were good 1000 updates ago. The actor receives misleading gradient signals. Training stalls or oscillates as the actor overshoots, then the critic slowly catches up, then the actor overshoots again.

Actor Updates Too Aggressively (α_actor = 10x normal) ACTIVE

Failure mode: Policy oscillation. Large actor updates change the policy drastically each iteration. The advantages from one step become meaningless for the new policy. The critic's targets (computed under the old policy) are wrong. The policy swings between extremes — "always go left" then "always go right" — never settling. This is exactly what PPO's clipping mechanism was designed to prevent: bounding the policy change per update.

Connecting to policy gradients

Compare the gradient from the PG lesson (with reward-to-go and baseline) to the actor-critic gradient:

PG with baseline ∇_θ J ≈ Σ_i,t ∇_θ log π · ( Σ_t'≥t r − b )

Actor-Critic ∇_θ J ≈ Σ_i,t ∇_θ log π · ( r_t + γV̂(s_t+1) − V̂(s_t) )

They're the same structure. The only change: noisy single-sample reward-to-go is replaced by the critic-based advantage. Less noise → better gradients → faster learning.

Chapter 09

Off-Policy: Importance Weights

On-policy actor-critic still throws away data after each update. Can we squeeze multiple gradient steps from the same batch?

Definition

Importance-Weighted Off-Policy Gradient

∇_θ' J ≈ Σ_i,t [ π_θ'(a_i,t|s_i,t) / π_θ(a_i,t|s_i,t) ] · ∇_θ' log π_θ'(a_i,t|s_i,t) · Â^π(s_i,t, a_i,t)

Data was collected under π_θ (old policy). We want to update θ' (new policy). The ratio π_θ'/π_θ corrects for the distribution mismatch — exactly like importance sampling.

Now we can take K gradient steps on the same batch without recollecting data. Much more efficient!

The Drift Problem

After many gradient steps, θ' drifts far from θ. The importance weights become extreme: if π_θ'(a|s) = 0.8 but π_θ(a|s) = 0.01, the ratio = 80. A few huge weights dominate the gradient → instability. The advantages were computed for the old policy — they're stale.

Solution 1: KL Constraint

KL-Constrained Update (TRPO / PPO) max_θ' Σ_i,t [ π_θ'(a|s) / π_θ(a|s) ] · Â(s, a)
subject to: 𝔼_s[ D_KL( π_θ'(·|s) ∥ π_θ(·|s) ) ] ≤ ε

Don't let the new policy stray too far from the old one. This keeps importance weights near 1.

Solution 2: Clipping (PPO)

Definition

PPO Clipped Objective

PPO-Clip L^CLIP(θ') = 𝔼[ min( w_t Â_t, clip(w_t, 1−ε, 1+ε) Â_t ) ]

where w_t = π_θ'(a_t|s_t) / π_θ(a_t|s_t)

Clip the importance ratio to [1−ε, 1+ε] (typically ε = 0.2). If the ratio tries to go beyond the clip range, the gradient is zero — preventing any single update from changing the policy too much.

Why PPO is so popular

PPO = actor-critic + importance sampling + clipping. It gets multiple gradient steps per batch (data efficient), prevents catastrophic policy updates (stable), and is simple to implement (~50 lines of core logic). It's the default algorithm for ChatGPT's RLHF, robotic locomotion, and game AI.

Clipping in Action

ε = 0.2. Old policy: π_θ(right|s) = 0.4. New policy after 3 gradient steps: π_θ'(right|s) = 0.7.

Ratio w = 0.7/0.4 = 1.75. Clipped to [0.8, 1.2] → 1.2.

If Â = +3.0: unclipped objective = 1.75 × 3.0 = 5.25. Clipped = 1.2 × 3.0 = 3.6. PPO takes min(5.25, 3.6) = 3.6.

The policy wants to increase this action's probability even more, but PPO says "that's enough for one batch." Collect new data first.

Chapter 10

Off-Policy: Replay Buffers

PPO reuses one batch for a few gradient steps. What if we stored all past experience and sampled from it freely?

Definition

Replay Buffer

A large memory that stores transitions (s, a, r, s') from all past policies. During training, we sample random minibatches from this buffer. The transitions might be from an hour ago or a million gradient steps ago.

Two Problems to Fix

Problem 1: Advantage A = r + γV(s') − V(s) requires V^π for the current policy, but the transitions were generated by old policies. V of the current policy in those states might be different.

Problem 2: We can't use single-step importance weights effectively because the states themselves were visited under a different policy.

Solution: Learn Q Instead of V

With a replay buffer, we switch from V to Q:

Definition

Q-Function Critic

Q-Learning Target (Off-Policy) y_j = r_j + γ Q^π_φ(s'_j, π_θ(s'_j))

For each transition (s, a, r, s') in the buffer: the target is the reward plus the discounted Q-value of the next state under the current policy's action. The key: we evaluate π_θ(s') using the current actor, not the one that generated the data.

Why Q works off-policy but V doesn't

V^π(s) = 𝔼_a~π[Q^π(s,a)] depends on which actions π would take. If the transition was generated by a different policy, the action a in the buffer might not be what the current π would do → V estimate is wrong.

Q^π(s, a) conditions on a specific action. The (s, a, r, s') tuple tells us exactly what happened when action a was taken in state s. That factual information is valid regardless of which policy chose a. We only need the current policy to evaluate the next action at s'.

SAC: A Modern Replay-Buffer Algorithm

Definition

SAC — Soft Actor-Critic (Haarnoja et al. 2018)

Maximum entropy RL: the policy maximizes reward and entropy (randomness). This encourages exploration and makes training more stable. Uses a replay buffer with two Q-networks (for stability) and automatic temperature tuning.

Algorithm: SAC (Simplified)

Initialize actor π_θ, two critics Q_φ1, Q_φ2, target networks
Loop:
1. Act: Sample a_t ~ π_θ(·|s_t), store (s_t, a_t, r_t, s_t+1) in replay buffer 𝔇
2. Sample minibatch of transitions from 𝔇
3. Update critics: y = r + γ min(Q_φ1', Q_φ2') evaluated at next action from current π
4. Update actor: maximize 𝔼[Q(s, π(s)) + α H(π(·|s))]
5. Soft-update target networks: φ' ← τφ + (1−τ)φ'

Reparameterization Trick

To backprop through the actor into the Q-function, SAC uses the reparameterization trick: instead of sampling a ~ π(·|s), write a = μ_θ(s) + σ_θ(s) · ε, where ε ~ 𝒩(0, 1). Now the gradient flows through the deterministic μ and σ, treating ε as a constant. No log-probability needed — direct gradient from Q to θ.

💻 Build It Implement GAE (Lambda-Weighted TD Errors) ▶ ✓ ATTEMPTED

Implement GAE advantage computation. Given a batch of rewards, values, and dones from a rollout, compute the GAE advantages for each timestep. This is the core of PPO's advantage estimation.

signature def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95): """Compute Generalized Advantage Estimation. Args: rewards: list/array of rewards [T] values: list/array of V(s_t) predictions [T+1] (includes bootstrap V(s_T)) dones: list/array of done flags [T] (1 if episode ended) gamma: discount factor lam: GAE lambda parameter Returns: advantages: array of GAE advantages [T] """

Test case

# rewards = [1, 1, 1], values = [5, 6, 7, 8], dones = [0, 0, 0], gamma=0.99, lam=0.95
# delta_2 = 1 + 0.99*8 - 7 = 1.92
# delta_1 = 1 + 0.99*7 - 6 = 1.93
# delta_0 = 1 + 0.99*6 - 5 = 1.94
# A_2 = 1.92
# A_1 = 1.93 + 0.99*0.95*1.92 = 3.74
# A_0 = 1.94 + 0.99*0.95*3.74 = 5.46

GAE has a recursive structure: A_t = δ_t + (γλ)(1−done_t) · A_t+1. Initialize A_T = 0, then sweep backward. The (1−done) factor resets the advantage at episode boundaries — future rewards from a new episode shouldn't affect the current one.

import numpy as np

def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    T = len(rewards)
    advantages = np.zeros(T)
    last_gae = 0.0

    for t in reversed(range(T)):
        # TD error: r_t + gamma * V(s_{t+1}) - V(s_t)
        next_non_terminal = 1.0 - dones[t]
        delta = rewards[t] + gamma * values[t + 1] * next_non_terminal - values[t]

        # GAE recursive formula
        last_gae = delta + gamma * lam * next_non_terminal * last_gae
        advantages[t] = last_gae

    return advantages

Bonus challenge: Modify this to also return the value targets (advantages + values[:T]). In PPO, these targets are used to train the critic via MSE loss. Why do we use advantages + V(s) as the critic target rather than computing n-step returns separately? (Answer: GAE advantages already encode the optimal multi-step target implicitly. Adding V(s) back gives us the "GAE-corrected return" which is a lower-variance target than raw MC returns.)

⚔ Adversarial: The Deadly Critic Lag

You're training PPO on a locomotion task. After 1M steps, the agent walks well (reward ~300). You save the critic checkpoint. You then reset the actor to random weights but keep the trained critic. What happens?

The agent quickly re-learns because the critic provides perfect advantages Nothing changes because PPO doesn't use the critic for action selection The critic gives WRONG advantages because V^π depends on which policy π is being evaluated, and a random policy has completely different state values

Chapter 11

PPO vs. SAC

These two algorithms dominate modern deep RL. They represent fundamentally different design choices.

Property	PPO	SAC
Off-policy degree	Slightly off-policy (few steps per batch)	Fully off-policy (replay buffer)
Value function	V^π(s)	Q^π(s, a)
Policy update	Clipped surrogate objective	Reparameterization through Q
Data efficiency	Lower (discard data quickly)	Higher (reuse all data)
Wall-clock speed	Often faster (simpler updates)	Slower per step (more networks)
Stability	Very stable, easy to tune	Can be finicky (critic lag)
Exploration	Stochastic policy + entropy bonus	Maximum entropy (built-in)
Action space	Both discrete and continuous	Primarily continuous
Famous uses	ChatGPT RLHF, Dota 2, robotics	Dexterous manipulation, locomotion

When to Use Which

PPO when: you have fast simulation (cheap data), need stability, discrete actions, or are new to RL. Start here.

SAC when: data is expensive (real robot), continuous actions, need maximum data efficiency, or the task requires exploration.

Definition

The Spectrum of Off-Policyness

All actor-critic methods live on a spectrum. On one end: pure on-policy (REINFORCE, A2C) — collect, update, discard. In the middle: "slightly" off-policy (PPO) — a few gradient steps per batch. On the far end: fully off-policy (SAC, TD3) — replay buffer, hundreds of gradient steps per environment step.

More off-policy = more data efficient but harder to train. Less off-policy = simpler but wasteful.

Robotics Example

You have a physical robot arm learning to pick up objects. Each real-world episode takes 30 seconds, and you can run maybe 1000 episodes per day.

PPO approach: Collect 100 episodes, take ~10 gradient steps, discard. Next day: another 100. After a week: maybe 7000 episodes used, 70 updates. Slow.

SAC approach: Every single transition goes into the replay buffer. After day 1: 1000 episodes, but ~50,000 gradient updates sampling from the buffer. After a week: 7000 episodes but ~350,000 updates. 50x more learning per real-world interaction.

🏗 Design Challenge You're the Architect: PPO for a Humanoid Walker ▶ ✓ ATTEMPTED

Design a PPO system for training a simulated humanoid (like MuJoCo Humanoid-v4) to walk. 300-dimensional observation (joint angles, velocities, contacts), 17-dimensional continuous action (torques). Budget: 50 million environment steps on 8 parallel workers.

Observation dim

300 (proprioception)

Action dim

17 (continuous torques)

Step budget

50M total steps

Parallel workers

8 environments

Episode length

~1000 steps

GPU

1x RTX 3090 (24GB)

1. Network architecture: MLP depth/width for both actor and critic? Separate or shared backbone? Activation function?

2. Action distribution: Gaussian with what parameterization? State-dependent std or fixed? Initial std?

3. GAE parameters: what γ and λ for a 1000-step horizon? How many steps per rollout before update?

4. PPO hyperparameters: clip epsilon, number of epochs per batch, minibatch size, learning rate schedule?

5. Normalization: observation normalization? Reward scaling? Advantage normalization? Value function clipping?

Network: Separate actor and critic. Both: MLP with 2 hidden layers of 256 units, tanh activation (not ReLU — tanh naturally bounds outputs, important for smooth locomotion). Orthogonal weight initialization (helps with gradient flow in deep RL).

Action distribution: Diagonal Gaussian. Mean μ(s) from actor network. Log-std as a separate learnable parameter (NOT state-dependent initially — state-dependent std is harder to train). Initial log-std = 0 (σ = 1). The std naturally decreases as training converges.

GAE: γ = 0.99, λ = 0.95. Rollout length = 2048 steps per worker. Batch = 8 workers × 2048 = 16,384 transitions. This is ~16 episodes worth of data per update.

PPO: Clip ε = 0.2. 10 epochs per batch. 32 minibatches (minibatch size = 512). Learning rate = 3e-4 with linear decay to 0 over training. Entropy coefficient = 0.0 (humanoid doesn't need exploration bonus).

Normalization (CRITICAL): Running mean/std normalization of observations. Reward scaling by 1/std(returns). Advantage normalization (subtract mean, divide by std) per minibatch. Value function loss clipped (prevent value function from changing too fast). These normalizations are often more important than the algorithm choice — without them, PPO frequently fails on Humanoid.

Timeline: 50M steps / (8 workers × 2048 steps/batch) = ~3000 updates. At ~1 sec/update on a 3090: ~50 minutes to train. Expected reward: ~5000+ (stable walking) by 20M steps.

🔗 Pattern Recognition

Actor-Critic = Policy Gradient + Learned Baseline

Policy Gradients (Previous Lesson)

∇J = E[∇logπ · (R(τ) − b)]
b is a constant or simple average. No learning. → Policy Gradients

Actor-Critic (This Lesson)

∇J = E[∇logπ · (r + γV̂(s') − V̂(s))]
b = V̂(s), a learned neural network. Low variance.

The leap from PG to AC is just one thing: replacing the constant baseline with a learned value function. Everything else (GAE, PPO clipping, replay buffers) is optimization of this core idea. The critic is just a very good baseline.

Can you see the same "learned estimator replaces crude sample" pattern in supervised learning? (Hint: batch normalization learns running statistics instead of using per-batch estimates.)

🔗 Pattern Recognition

Off-Policy Actor-Critic → Q-Learning Continuum

Actor-Critic (learns Q^π)

Critic evaluates current policy's actions.
Actor explicitly parameterized. Continuous actions natural.

Q-Learning (learns Q*)

Critic evaluates optimal actions via max.
Policy = argmax (implicit). Discrete actions only. → Q-Learning

Both use replay buffers and both learn a Q-function. The key difference: actor-critic learns Q^π (value under current policy) and uses importance weights for off-policy correction. Q-learning learns Q* (optimal value) and needs no correction because max is policy-independent. SAC sits between them: it learns Q^π but with maximum entropy, making the policy more exploratory and the Q-function more robust to distribution shift.

If you had a discrete-action problem with 1000 actions, which would you choose: DQN (compute max over 1000 Q-values) or SAC (sample from a learned categorical)? What about 1,000,000 actions?

Chapter 12

Full Summary & Cheat Sheet

The Evolution from PG to Actor-Critic

V1 — Vanilla PG∇_θ J = 𝔼[ Σ_t ∇_θ log π · r(τ) ] ← whole trajectory reward, very noisy

V2 — + Causality + Baseline∇_θ J = 𝔼[ Σ_t ∇_θ log π · ( Σ_t'≥t r − b ) ] ← reward-to-go, still single-sample

V3 — Actor-Critic∇_θ J = 𝔼[ Σ_t ∇_θ log π · ( r_t + γV̂(s') − V̂(s) ) ] ← learned value function replaces noisy samples

V4 — PPOL = 𝔼[ min( w · Â, clip(w, 1±ε) · Â ) ] ← clipped IS for data reuse

V5 — SACmax_θ 𝔼[ Q(s, π(s)) + α H(π) ] ← replay buffer + max entropy

Value Estimation Cheat Sheet

Method	Target y_t	Bias	Variance
MC	Σ_t'=t^T γ^t'−t r_t'	None	High
TD(0)	r_t + γ V̂(s_t+1)	High	Low
N-step	Σ_t'=t^t+n−1 γ^t'−t r_t' + γⁿ V̂(s_t+n)	Medium	Medium
GAE(λ)	Σ_l=0^∞ (γλ)^l δ_t+l	Tunable	Tunable

Algorithm Decision Tree

Situation	Recommended	Why
Fast simulator, discrete actions	PPO	Simple, stable, well-tested
Fast simulator, continuous actions	PPO or SAC	PPO simpler; SAC if exploration matters
Real robot (data expensive)	SAC	Replay buffer maximizes data use
RLHF for language models	PPO	Stable, handles discrete tokens
Need reproducible results	PPO	More deterministic, easier to debug

Key Equations at a Glance

Object	Formula
V^π(s)	𝔼[Σ γ^k r_t+k \| s_t=s]
Q^π(s,a)	𝔼[Σ γ^k r_t+k \| s_t=s, a_t=a]
A^π(s,a)	Q^π(s,a) − V^π(s)
TD error	δ_t = r_t + γV̂(s_t+1) − V̂(s_t)
Actor gradient	Σ ∇_θ log π · Â
PPO clip	min(w Â, clip(w, 1±ε) Â)

The One Sentence

Actor-critic = learn what's good (critic), then do more of it (actor). Everything else is how you estimate "good" and how much old data you reuse.

The Complete Guide to Actor-Critic Methods

What You'll Master

Recap & Motivation

V, Q, and Advantage

A Worked Example

The Actor-Critic Loop

Monte Carlo Estimation

Bootstrapping & TD Learning

MC vs. TD: Head to Head

N-Step Returns

Discount Factors

The Discounted Value Functions

The Full Actor-Critic Algorithm

Step-by-Step Numerical Walkthrough

Off-Policy: Importance Weights

Solution 1: KL Constraint

Solution 2: Clipping (PPO)

Off-Policy: Replay Buffers

Solution: Learn Q Instead of V

SAC: A Modern Replay-Buffer Algorithm

PPO vs. SAC

Full Summary & Cheat Sheet

The Evolution from PG to Actor-Critic

Value Estimation Cheat Sheet

Algorithm Decision Tree

Key Equations at a Glance