← Gleams
Stanford CS 224R · Lecture 4 · Deep RL

The Complete Guide to Actor-Critic Methods

Learn to estimate what is good vs. bad, then do more good. From vanilla PG to PPO and SAC.

V, Q, & Advantage MC vs. TD vs. N-step Full algorithm walkthrough PPO & SAC
Roadmap

What You'll Master

Chapter 01

Recap & Motivation

In the policy gradients lesson, we derived this formula:

Policy Gradient with Advantageθ J(θ) ≈ (1/N) Σi Σtθ log πθ(ai,t|si,t) · Aπ(si,t, ai,t)

Each action is weighted by the advantage — how much better that action was compared to average. The problem is how we estimate that advantage.

The Jacket Folding Problem

A robot is learning to fold a jacket. Four trajectories from one batch:

τ₁: Folds both sleeves neatly → reward = 10

τ₂: Flattens the jacket but doesn't fold → reward = 0

τ₃: Folds one sleeve, drops the other → reward = 5

τ₄: Takes one step forward then everything falls backward → reward = −2

Vanilla PG with baseline (average = 3.25): τ₂ gets weight −3.25 and τ₃ gets +1.75. But τ₂ did something useful — flattening is a good sub-step! And the single backward step in τ₄ gets as much blame as the step forward. PG can't distinguish good sub-actions from bad trajectories.

The Core Problem

Single-sample reward-to-go is a terrible estimator of how good an action actually is. One unlucky outcome poisons the gradient for every earlier step. We need something better: a learned estimate of expected future reward.

The Big Idea

What if we trained a separate neural network to predict expected future reward from any state? Then instead of using one noisy trajectory sample, we'd use a smooth, multi-sample average. That's the critic. The policy is the actor. Together: actor-critic.

Chapter 02

V, Q, and Advantage

Three objects underpin all of actor-critic. Each answers a different question about how good things are under the current policy π.

Definition
State-Value Function — Vπ(s)
Vπ(s) = 𝔼τ~π[ Σt'=tT r(st', at') | st = s ]

"How much total reward do I expect from state s if I follow policy π from here on?" A single number per state. Think of it as the real estate value of a state — some states are inherently more promising.

Definition
Action-Value Function — Qπ(s, a)
Qπ(s, a) = 𝔼τ~π[ Σt'=tT r(st', at') | st = s, at = a ]

"How much total reward do I expect if I take action a in state s, then follow π after?" Same as V but conditions on the first action. Q is more detailed — it knows about specific actions, not just locations.

Definition
Advantage Function — Aπ(s, a)
Aπ(s, a) = Qπ(s, a) − Vπ(s)

"How much better is action a compared to what we'd normally do?" If A > 0, action a is better than the average action from this state. If A < 0, it's worse. This is the signal we need for policy gradients.

Intuition

V is the "value of the neighborhood." Q is the "value of taking a specific road in that neighborhood." Advantage is "how much better is this road than the average road?" A positive advantage means the action outperforms the default policy.

🔨 Derivation A2C Advantage = Q − V Reduces Variance (Proof) ✓ ATTEMPTED

The policy gradient theorem gives: ∇J = E[∇ log π(a|s) · Q(s,a)]. We can subtract any state-dependent baseline b(s) without changing the expected gradient: ∇J = E[∇ log π(a|s) · (Q(s,a) − b(s))].

Your task: (1) Show that subtracting b(s) doesn't bias the gradient (prove E[∇ log π · b(s)] = 0). (2) Show that the optimal b*(s) that minimizes variance equals Vπ(s), making A(s,a) = Q(s,a) − V(s) the minimum-variance advantage estimator.

Key identity: Σa ∇π(a|s) = ∇ Σa π(a|s) = ∇ 1 = 0. So Ea~π[∇ log π(a|s)] = Σa π(a|s) · ∇π(a|s)/π(a|s) = Σa ∇π(a|s) = 0. Since b(s) doesn't depend on a, it factors out: Ea[∇ log π · b(s)] = b(s) · Ea[∇ log π] = b(s) · 0 = 0.
The variance of the gradient estimator is Var[∇ log π · (Q − b)]. Expand: E[((∇ log π)(Q−b))2] − (E[...]).2. Differentiate with respect to b and set to zero. The optimal b* = Ea[(∇ log π)2 Q] / Ea[(∇ log π)2]. If you approximate (∇ log π)2 as roughly constant across actions (crude but common), this simplifies to b* ≈ Ea~π[Q(s,a)] = Vπ(s).
Without baseline: Var ∝ E[Q(s,a)2]. With optimal baseline V: Var ∝ E[(Q−V)2] = E[A2]. Since A sums to zero across actions (Ea~π[A] = 0), the advantage has smaller magnitude than raw Q-values. If Q ranges from 50-60 (all positive), A ranges from −5 to +5. Variance reduction: (5/55)2 ≈ 120x less variance!

Part 1 — Unbiasedness:

Ea~π(·|s)[∇θ log πθ(a|s) · b(s)] = b(s) · Ea~π[∇ log π(a|s)]

= b(s) · Σa π(a|s) · ∇π(a|s)/π(a|s) = b(s) · Σa ∇π(a|s) = b(s) · ∇ 1 = 0

Any baseline that depends only on s (not a) vanishes in expectation. This means we can subtract V, a constant, or even a neural network of s — the gradient stays unbiased.

Part 2 — Optimal baseline minimizes variance:

Var = Es,a[(g(s,a) · (Q(s,a) − b(s)))2] where g = ∇ log π.

d/db Var = −2 Es[Ea[g2(Q − b)]] = 0

⇒ b*(s) = Ea~π[g(s,a)2 Q(s,a)] / Ea~π[g(s,a)2]

Under the approximation that g2 is roughly uniform across actions:

b*(s) ≈ Ea~π[Q(s,a)] = Vπ(s)

The key insight: The advantage A = Q − V is the minimum-variance baseline for policy gradients. It works because V centers the signal around zero: good actions get positive weight, bad actions get negative weight, and the magnitudes are as small as possible. This is why actor-critic (which learns V as the baseline) dramatically outperforms vanilla REINFORCE.

A Worked Example

Gridworld Advantage Calculation

Imagine a 3×3 grid. The agent is in the center cell. Under current policy π, the expected returns are:

Vπ(center) = 6.0 — on average, starting from center, we get 6.0 total reward.

Qπ(center, right) = 8.5 — going right first, then following π, gives 8.5.

Qπ(center, left) = 4.0 — going left first gives only 4.0.

Qπ(center, up) = 5.5 — going up gives 5.5.

The advantages:

A(center, right) = 8.5 − 6.0 = +2.5 → much better than average! Push gradient to go right more.

A(center, left) = 4.0 − 6.0 = −2.0 → worse than average. Decrease this action's probability.

A(center, up) = 5.5 − 6.0 = −0.5 → slightly below average.

Notice: V is the weighted average of Q across all actions, so advantages always average to zero: Σa π(a|s) A(s,a) = 0.

Chapter 03

The Actor-Critic Loop

Actor-critic turns the policy gradient into a two-network system. One network acts, the other judges.

Definition
Actor — πθ(a|s)

The policy network. Takes a state, outputs action probabilities (discrete) or a distribution over actions (continuous). Parameters θ. This is what we deploy.

Definition
Critic — V̂πφ(s)

The value network. Takes a state, outputs a scalar: the estimated expected future reward. Parameters φ. This is a tool for training — at deployment, only the actor matters.

The Actor-Critic Loop (High Level)
  1. Act: Run policy πθ to collect a batch of (s, a, r, s') transitions
  2. Evaluate: Fit the critic V̂φ to estimate expected returns (policy evaluation)
  3. Improve: Compute advantages using the critic, update the actor θ via policy gradient (policy improvement)
  4. Repeat from step 1
Why fit V instead of Q or A directly?

We only need to fit V! Because we can reconstruct the advantage from V using a single sampled transition:

Aπ(st, at) ≈ r(st, at) + Vπ(st+1) − Vπ(st)

The reward r(st, at) plus the value of the next state V(st+1) gives us an estimate of Q. Subtract V(st) to get A. One network, not three.

Why does this work?

Start from the definition: Qπ(s,a) = r(s,a) + 𝔼s'[Vπ(s')]. We can't compute the expectation over all possible next states — but we have one actual next state st+1 from our rollout. Using it gives a single-sample estimate of Q. Subtract V(st) to get A. The approximation is:

π(st, at) = r(st, at) + V̂πφ(st+1) − V̂πφ(st)

This is a one-step estimate of the advantage. The "hat" ̂ reminds us that V̂ is our neural network approximation, not the true value.

Chapter 04

Monte Carlo Estimation

The critic needs training targets. Where do they come from? First approach: just use the actual rewards from our rollouts.

Definition
Monte Carlo Target
yi,t = Σt'=tT r(si,t', ai,t')

For each state visited, the MC target is the actual sum of rewards from that point until the end of the episode. This is a direct, model-free estimate of Vπ(si,t).

Training the critic is then straightforward supervised learning:

Monte Carlo Value Estimation
  1. Collect N trajectories by running πθ
  2. Compute targets: For each (i, t), set yi,t = Σt'=tT r(si,t', ai,t')
  3. Regression: Minimize Σi,t ( V̂φ(si,t) − yi,t )2 with gradient descent on φ
Implicit multi-sample averaging

Each yi,t is a single noisy sample. But different trajectories often visit the same or similar states. When the neural network fits all these targets simultaneously, it implicitly averages across many samples for each state. The network acts as a smoothing function — nearby states get similar value estimates even if individual trajectories were noisy.

Worked Example

A simple chain: states A → B → C → terminal, with rewards r(A)=1, r(B)=3, r(C)=5.

• y(A) = 1 + 3 + 5 = 9

• y(B) = 3 + 5 = 8

• y(C) = 5 = 5

The critic trains to output V̂(A) ≈ 9, V̂(B) ≈ 8, V̂(C) ≈ 5. In practice, stochastic dynamics and different actions yield different rewards each time — the network learns the average.

High Variance

MC targets sum many random variables (every future reward). Each is noisy. The sum compounds that noise. For a 1000-step episode, yt is the sum of up to 1000 random rewards — the variance can be enormous. Can we do better?

Chapter 05

Bootstrapping & TD Learning

Instead of summing all future rewards (expensive, noisy), use the critic's own estimate for the future.

Definition
Temporal Difference (TD) Target
yi,t = r(si,t, ai,t) + V̂πφ(si,t+1)

One real reward step, then bootstrap: use the critic's own estimate of V for the rest. The name "temporal difference" comes from the fact that we use the difference in value estimates at consecutive time steps.

Same supervised learning pipeline:

TD Value Update φ ← φ − α ∇φ Σi,t ( V̂φ(si,t) − yi,t )2
Circular but powerful

TD looks suspicious: we're training V̂ using targets that depend on V̂ itself! The labels change every gradient step. But this is exactly how dynamic programming works — iteratively refine estimates using neighboring estimates. It converges because each update makes the value function more self-consistent (satisfying the Bellman equation).

Why TD has lower variance

MC target: y = rt + rt+1 + rt+2 + … + rT (sum of many random variables).

TD target: y = rt + V̂(st+1) (one random variable + one deterministic function evaluation).

Fewer random terms → less variance. But V̂ might be wrong → biased. This is the fundamental tradeoff.

MC vs. TD: Head to Head

Concrete Comparison

A robot walks a 5-step hallway. Reward is −1 at every step except r = +10 at the final step. One trajectory: s₁ → s₂ → s₃ → s₄ → s₅.

MC target for s₁: −1 + (−1) + (−1) + (−1) + 10 = 6. Exact for this trajectory but varies wildly if the robot sometimes falls at step 3.

TD target for s₁: −1 + V̂(s₂). If V̂(s₂) = 7.2 (current estimate), then y = 6.2. Smoother, but only as good as V̂.

PropertyMonte CarloTD (Bootstrap)
BiasUnbiasedBiased (uses own estimate)
VarianceHigh (sum of many rewards)Low (one reward + one estimate)
Needs episode end?Yes (must reach terminal state)No (works step-by-step)
Works with infinite horizons?NoYes
Early trainingMore reliable (no wrong estimates)Can propagate errors
Chapter 06

N-Step Returns

MC and TD are two extremes. What if we took the best of both?

Definition
N-Step Return Target
N-Step Target yi,t = Σt'=tt+n−1 γt'−t r(si,t', ai,t') + γnπ(si,t+n)

Use n steps of real rewards, then bootstrap with the critic for the rest. n = 1 is pure TD. n = T (rest of episode) is pure MC. Any n in between blends the two.

The Bias-Variance Spectrum

Small n (near TD): low variance, higher bias. You trust your critic more. Large n (near MC): higher variance, lower bias. You trust your actual rewards more. In practice, n between 5 and 20 often works best — enough real reward to reduce bias, short enough to keep variance manageable.

N-Step in Action

Same 5-step hallway: rewards [−1, −1, −1, −1, +10]. Say γ = 1 for simplicity and V̂(s₃) = 8.0.

1-step (TD): y(s₁) = −1 + V̂(s₂). Depends entirely on V̂(s₂).

2-step: y(s₁) = −1 + (−1) + V̂(s₃) = −2 + 8.0 = 6.0. Two real rewards, rest from critic.

5-step (MC): y(s₁) = −1 + (−1) + (−1) + (−1) + 10 = 6. All real rewards.

With a good critic, the 2-step estimate will be less noisy than MC across different trajectories, while being less biased than 1-step TD.

nNameBiasVarianceWhen to Use
1TD(0)HighestLowestGood critic, dense rewards
5–20N-stepMediumMediumMost practical settings
TMCNoneHighestShort episodes, early training
GAE: The Continuous Blend

Generalized Advantage Estimation (GAE, Schulman et al. 2016) takes a weighted average of all n-step returns using parameter λ ∈ [0,1]:

GAE = Σl=0 (γλ)l δt+l where δt = rt + γV̂(st+1) − V̂(st)

λ = 0 is pure TD advantage (one step). λ = 1 is full MC advantage. λ = 0.95 is a common default in PPO.

🔨 Derivation GAE(λ) Derivation — Why TD(λ) Interpolates MC and TD ✓ ATTEMPTED

Define the n-step advantage: Â(n)t = Σl=0n−1 γl rt+l + γn V̂(st+n) − V̂(st). GAE is defined as the exponentially-weighted average: ÂGAEt = (1−λ) Σn=1 λn−1(n)t.

Your task: Show this simplifies to ÂGAEt = Σl=0 (γλ)l δt+l, where δt = rt + γV̂(st+1) − V̂(st). Then show that λ=0 gives TD(0) and λ=1 gives MC.

The n-step advantage can be written as a telescoping sum of TD errors: Â(n)t = Σl=0n−1 γl δt+l. Verify: Σ γl(rt+l + γV̂(st+l+1) − V̂(st+l)) telescopes because +γV(st+1) at l=0 cancels with −V(st+1) at l=1 (after multiplying by γ).
GAE = (1−λ) Σn=1 λn−1 Σl=0n−1 γl δt+l. Swap sums: for each δt+l, count how many n-step estimates include it. It appears for n ≥ l+1, so its coefficient is (1−λ) γl Σn=l+1 λn−1 = (1−λ) γl λl / (1−λ) = (γλ)l.

Step 1: Express Â(n) as sum of TD errors.

(n)t = Σl=0n−1 γl rt+l + γn V̂(st+n) − V̂(st)

Rewrite: = Σl=0n−1 γl [rt+l + γV̂(st+l+1) − V̂(st+l)] = Σl=0n−1 γl δt+l

(The telescoping: V̂(st) cancels from the first term's −V̂(st), intermediate V̂'s cancel in pairs, leaving +γnV̂(st+n) from the last term.)

Step 2: Substitute into GAE definition.

GAE = (1−λ) Σn=1 λn−1 Σl=0n−1 γl δt+l

Step 3: Swap summation order. For fixed l, δt+l appears when n ≥ l+1:

Coefficient of δt+l = (1−λ) γl Σn=l+1 λn−1 = (1−λ) γl · λl/(1−λ) = (γλ)l

Therefore: GAEt = Σl=0 (γλ)l δt+l

Boundary cases:

λ = 0: ÂGAE = δt = rt + γV̂(st+1) − V̂(st). Pure TD advantage. High bias, low variance.

λ = 1: ÂGAE = Σl=0 γl δt+l = Σ γl rt+l − V̂(st). This is the MC return minus the baseline. Zero bias (if V̂ is removed from target), highest variance.

The key insight: λ controls an effective discount on TD errors. Each additional step's δ is weighted by (γλ)l. Since γ already discounts future rewards, λ provides a second layer of discounting that specifically controls how much we trust the critic vs. actual rewards. Low λ = trust the critic. High λ = trust the rewards.

Checkpoint — Before you move on
You've seen MC, TD, N-step, and GAE. In one sentence each: (1) Why does TD have bias? (2) Why does MC have high variance? (3) What role does λ play in GAE?
✓ Gate cleared
Model Answer

(1) TD has bias because its target y = r + γV̂(s') uses V̂ — a learned approximation that's wrong early in training. We're bootstrapping from our own (incorrect) estimates. (2) MC has high variance because y = Σrt' sums many stochastic rewards. Each future timestep adds another random variable, and their variances compound. A 1000-step episode means summing 1000 random variables. (3) λ in GAE trades off bias vs. variance by exponentially weighting n-step returns. Low λ trusts the critic more (lower variance, higher bias). High λ trusts actual rewards more (higher variance, lower bias). λ=0.95 is the standard because it leans toward trusting rewards while still getting meaningful variance reduction from the critic.

Chapter 07

Discount Factors

So far we've treated all future rewards equally. But a reward 1000 steps from now is both uncertain and far away. Enter γ.

Definition
Discount Factor — γ (gamma)
Discounted Return Gt = γ0 rt + γ1 rt+1 + γ2 rt+2 + … = Σk=0 γk rt+k

γ ∈ [0, 1]. Each future reward is multiplied by γ raised to the power of how many steps away it is. Reward now = full weight. Reward in 100 steps = γ100 ≈ tiny.

Why discount?

Three reasons: (1) Mathematical — with γ < 1, infinite sums converge. Without it, Vπ can be infinite. (2) Practical — distant rewards are more uncertain, so trusting them less reduces variance. (3) Behavioral — we want agents that don't defer reward forever ("I'll clean up tomorrow… forever").

The Effect of γ

Rewards: [+1, +1, +1, +1, +1] at steps t, t+1, t+2, t+3, t+4.

γ = 1.0: G = 1 + 1 + 1 + 1 + 1 = 5.0 (all rewards equal)

γ = 0.99: G = 1 + 0.99 + 0.98 + 0.97 + 0.96 = 4.90 (slight decay)

γ = 0.9: G = 1 + 0.9 + 0.81 + 0.73 + 0.66 = 4.10 (noticeable)

γ = 0.5: G = 1 + 0.5 + 0.25 + 0.125 + 0.063 = 1.94 (very myopic)

γ = 0.0: G = 1 + 0 + 0 + 0 + 0 = 1.0 (only immediate reward)

The Discounted Value Functions

With discounting, all our definitions update:

Discounted Bellman Equation Vπ(s) = 𝔼a~π[ r(s, a) + γ Vπ(s') ]
Discounted Advantage Aπ(st, at) = r(st, at) + γ Vπ(st+1) − Vπ(st)
Common γ Values
Practical Guidelines

γ = 0.99 — the default for most tasks. Cares about ~100 steps of future.

γ = 0.999 — for very long-horizon tasks (thousands of steps).

γ = 0.95 — when you want faster learning at the cost of some long-term quality.

Rule of thumb: the "effective horizon" is ~1/(1−γ). So γ=0.99 → ~100 steps, γ=0.999 → ~1000 steps.

🔨 Derivation TD(λ) Bias-Variance Tradeoff — Quantifying the Error ✓ ATTEMPTED

The TD(0) advantage Ât = δt = rt + γV̂(st+1) − V̂(st) has bias proportional to V̂'s error, while the MC advantage At = Σγkrt+k − V̂(st) has variance proportional to the episode length.

Your task: Quantify both. If V̂(s) = Vπ(s) + ε(s) with |ε| ≤ εmax, bound the bias of TD(0). If rewards have variance σr2, bound the variance of MC returns for horizon T.

True advantage: A(s,a) = r + γV(s') − V(s). TD estimate: Â = r + γV̂(s') − V̂(s) = r + γ(V(s') + ε(s')) − (V(s) + ε(s)) = A(s,a) + γε(s') − ε(s). The bias is γε(s') − ε(s), bounded by |bias| ≤ (1 + γ)εmax.
MC return Gt = Σk=0T−t−1 γk rt+k. If rewards are independent with Var(r) = σr2: Var(Gt) = Σk=0T−t−1 γ2k σr2 = σr2 (1 − γ2(T−t)) / (1 − γ2). For γ = 0.99 and T−t = 1000: Var ≈ 50 σr2.

TD(0) Bias:

TD = r + γV̂(s') − V̂(s) = [r + γV(s') − V(s)] + [γε(s') − ε(s)]

= Atrue(s,a) + bias term

|Bias| = |γε(s') − ε(s)| ≤ γ|ε(s')| + |ε(s)| ≤ (1+γ)εmax

For γ = 0.99: |Bias| ≤ 1.99 εmax. If V̂ is off by 2.0 on average, TD advantages can be wrong by up to 4.0!

MC Variance:

Var(Gt) = Σk=0H−1 γ2k Var(rt+k) = σr2 · (1 − γ2H)/(1−γ2)

For γ=0.99, H=1000: (1−0.992000)/(1−0.9801) ≈ 1/0.0199 ≈ 50.25

So Var(MC) ≈ 50 σr2. Standard deviation ≈ 7σr.

The tradeoff:

TD(0): MSE = Bias2 + Var(single step) ≈ (2εmax)2 + σr2

MC: MSE = 0 + 50σr2

TD wins when: 4εmax2 + σr2 < 50σr2 ⇒ εmax < 3.5σr

The key insight: TD is better whenever the critic's error is less than ~3.5x the per-step reward noise. Early in training (critic very wrong), MC wins. Late in training (critic accurate), TD wins. GAE(λ) with λ=0.95 automatically interpolates — relying mostly on actual rewards but getting meaningful variance reduction from the critic's short-horizon estimates.

Chapter 08

The Full Actor-Critic Algorithm

Now we assemble every piece. This is the complete algorithm you'd implement.

Algorithm: Actor-Critic (On-Policy, N-Step)
  1. Initialize actor parameters θ, critic parameters φ
  2. Loop:
    1. Sample batch: Collect N trajectories {(si,t, ai,t, ri,t, si,t+1)} by running πθ
    2. Fit critic (policy evaluation):
      1. Compute n-step targets: yi,t = Σt'=tt+n−1 γt'−t ri,t' + γnφ(si,t+n)
      2. Update: φ ← φ − αφφ Σi,t ( V̂φ(si,t) − yi,t )2
    3. Evaluate advantages (using critic):
      π(si,t, ai,t) = r(si,t, ai,t) + γ V̂φ(si,t+1) − V̂φ(si,t)
    4. Policy gradient (actor update):
      θ J ≈ Σi,tθ log πθ(ai,t|si,t) · Âπ(si,t, ai,t)
    5. Update actor: θ ← θ + αθθ J

Step-by-Step Numerical Walkthrough

One Iteration in Detail

Tiny environment: 3 states {A, B, C}. Two actions {left, right}. γ = 0.9. Current critic: V̂(A) = 2.0, V̂(B) = 5.0, V̂(C) = 8.0.

Step 2a: Sample batch. One trajectory: A ⟶right B ⟶right C. Rewards: r(A, right) = 1, r(B, right) = 3.

Step 2b: Critic targets (1-step TD):

• y(A) = r(A, right) + γ V̂(B) = 1 + 0.9 × 5.0 = 5.5

• y(B) = r(B, right) + γ V̂(C) = 3 + 0.9 × 8.0 = 10.2

Critic loss: (V̂(A) − 5.5)2 + (V̂(B) − 10.2)2 = (2 − 5.5)2 + (5 − 10.2)2 = 12.25 + 27.04 = 39.29. Gradient descent will push V̂(A) toward 5.5 and V̂(B) toward 10.2.

Step 2c: Advantages:

• Â(A, right) = 1 + 0.9 × 5.0 − 2.0 = +3.5 (great action!)

• Â(B, right) = 3 + 0.9 × 8.0 − 5.0 = +5.2 (even better)

Step 2d: Both advantages are positive → increase probability of "right" in both states.

Two Learning Rates

The actor and critic are trained simultaneously but at potentially different learning rates. A common pattern: αφ = 3 × αθ. The critic should learn faster than the actor, so the advantages are based on a reasonably good value estimate. If the critic is still random, advantages are garbage and the actor learns nonsense.

Shared vs. Separate Networks

In some implementations (e.g., A3C), the actor and critic share a feature backbone with two separate heads. This saves parameters and computation, but can cause instability — the critic's gradient can interfere with actor features. Modern practice (PPO, SAC): separate networks unless you have a specific reason to share.

💥 Break-It Lab What Dies When You Remove Actor-Critic Components? ✓ ATTEMPTED
A well-tuned actor-critic has three critical relationships: the value baseline, the critic's learning speed relative to the actor, and the actor's update magnitude. Toggle each to see distinct failure modes.
Remove Value Baseline (back to REINFORCE) ACTIVE
Failure mode: High-variance gradients. Without V(s) as baseline, the policy gradient uses raw returns: ∇J = Σ ∇logπ · R(τ). If all trajectories have positive reward (common in continuous control), every action gets reinforced — just some more than others. The signal-to-noise ratio is terrible. Gradient estimates oscillate wildly between batches. Learning happens but 10-100x slower, with massive variance in the training curve.
Critic Learns Too Slowly (αcritic = 0.1x actor) ACTIVE
Failure mode: Stale advantages. If the critic can't keep up with the actor, V̂(s) reflects the value of an old policy. Advantages computed as r + γV̂(s') − V̂(s) are garbage — they praise actions that were good 1000 updates ago. The actor receives misleading gradient signals. Training stalls or oscillates as the actor overshoots, then the critic slowly catches up, then the actor overshoots again.
Actor Updates Too Aggressively (αactor = 10x normal) ACTIVE
Failure mode: Policy oscillation. Large actor updates change the policy drastically each iteration. The advantages from one step become meaningless for the new policy. The critic's targets (computed under the old policy) are wrong. The policy swings between extremes — "always go left" then "always go right" — never settling. This is exactly what PPO's clipping mechanism was designed to prevent: bounding the policy change per update.
Connecting to policy gradients

Compare the gradient from the PG lesson (with reward-to-go and baseline) to the actor-critic gradient:

PG with baselineθ J ≈ Σi,tθ log π · ( Σt'≥t r − b )
Actor-Criticθ J ≈ Σi,tθ log π · ( rt + γV̂(st+1) − V̂(st) )

They're the same structure. The only change: noisy single-sample reward-to-go is replaced by the critic-based advantage. Less noise → better gradients → faster learning.

Chapter 09

Off-Policy: Importance Weights

On-policy actor-critic still throws away data after each update. Can we squeeze multiple gradient steps from the same batch?

Definition
Importance-Weighted Off-Policy Gradient
θ' J ≈ Σi,t [ πθ'(ai,t|si,t) / πθ(ai,t|si,t) ] · ∇θ' log πθ'(ai,t|si,t) · Âπ(si,t, ai,t)

Data was collected under πθ (old policy). We want to update θ' (new policy). The ratio πθ'θ corrects for the distribution mismatch — exactly like importance sampling.

Now we can take K gradient steps on the same batch without recollecting data. Much more efficient!

The Drift Problem

After many gradient steps, θ' drifts far from θ. The importance weights become extreme: if πθ'(a|s) = 0.8 but πθ(a|s) = 0.01, the ratio = 80. A few huge weights dominate the gradient → instability. The advantages were computed for the old policy — they're stale.

Solution 1: KL Constraint

KL-Constrained Update (TRPO / PPO) maxθ' Σi,t [ πθ'(a|s) / πθ(a|s) ] · Â(s, a)
subject to: 𝔼s[ DKL( πθ'(·|s) ∥ πθ(·|s) ) ] ≤ ε

Don't let the new policy stray too far from the old one. This keeps importance weights near 1.

Solution 2: Clipping (PPO)

Definition
PPO Clipped Objective
PPO-Clip LCLIP(θ') = 𝔼[ min( wtt, clip(wt, 1−ε, 1+ε) Ât ) ]

where wt = πθ'(at|st) / πθ(at|st)

Clip the importance ratio to [1−ε, 1+ε] (typically ε = 0.2). If the ratio tries to go beyond the clip range, the gradient is zero — preventing any single update from changing the policy too much.

Why PPO is so popular

PPO = actor-critic + importance sampling + clipping. It gets multiple gradient steps per batch (data efficient), prevents catastrophic policy updates (stable), and is simple to implement (~50 lines of core logic). It's the default algorithm for ChatGPT's RLHF, robotic locomotion, and game AI.

Clipping in Action

ε = 0.2. Old policy: πθ(right|s) = 0.4. New policy after 3 gradient steps: πθ'(right|s) = 0.7.

Ratio w = 0.7/0.4 = 1.75. Clipped to [0.8, 1.2] → 1.2.

If  = +3.0: unclipped objective = 1.75 × 3.0 = 5.25. Clipped = 1.2 × 3.0 = 3.6. PPO takes min(5.25, 3.6) = 3.6.

The policy wants to increase this action's probability even more, but PPO says "that's enough for one batch." Collect new data first.

Chapter 10

Off-Policy: Replay Buffers

PPO reuses one batch for a few gradient steps. What if we stored all past experience and sampled from it freely?

Definition
Replay Buffer

A large memory that stores transitions (s, a, r, s') from all past policies. During training, we sample random minibatches from this buffer. The transitions might be from an hour ago or a million gradient steps ago.

Two Problems to Fix

Problem 1: Advantage A = r + γV(s') − V(s) requires Vπ for the current policy, but the transitions were generated by old policies. V of the current policy in those states might be different.

Problem 2: We can't use single-step importance weights effectively because the states themselves were visited under a different policy.

Solution: Learn Q Instead of V

With a replay buffer, we switch from V to Q:

Definition
Q-Function Critic
Q-Learning Target (Off-Policy) yj = rj + γ Qπφ(s'j, πθ(s'j))

For each transition (s, a, r, s') in the buffer: the target is the reward plus the discounted Q-value of the next state under the current policy's action. The key: we evaluate πθ(s') using the current actor, not the one that generated the data.

Why Q works off-policy but V doesn't

Vπ(s) = 𝔼a~π[Qπ(s,a)] depends on which actions π would take. If the transition was generated by a different policy, the action a in the buffer might not be what the current π would do → V estimate is wrong.

Qπ(s, a) conditions on a specific action. The (s, a, r, s') tuple tells us exactly what happened when action a was taken in state s. That factual information is valid regardless of which policy chose a. We only need the current policy to evaluate the next action at s'.

SAC: A Modern Replay-Buffer Algorithm

Definition
SAC — Soft Actor-Critic (Haarnoja et al. 2018)

Maximum entropy RL: the policy maximizes reward and entropy (randomness). This encourages exploration and makes training more stable. Uses a replay buffer with two Q-networks (for stability) and automatic temperature tuning.

Algorithm: SAC (Simplified)
  1. Initialize actor πθ, two critics Qφ1, Qφ2, target networks
  2. Loop:
    1. Act: Sample at ~ πθ(·|st), store (st, at, rt, st+1) in replay buffer 𝔇
    2. Sample minibatch of transitions from 𝔇
    3. Update critics: y = r + γ min(Qφ1', Qφ2') evaluated at next action from current π
    4. Update actor: maximize 𝔼[Q(s, π(s)) + α H(π(·|s))]
    5. Soft-update target networks: φ' ← τφ + (1−τ)φ'
Reparameterization Trick

To backprop through the actor into the Q-function, SAC uses the reparameterization trick: instead of sampling a ~ π(·|s), write a = μθ(s) + σθ(s) · ε, where ε ~ 𝒩(0, 1). Now the gradient flows through the deterministic μ and σ, treating ε as a constant. No log-probability needed — direct gradient from Q to θ.

💻 Build It Implement GAE (Lambda-Weighted TD Errors) ✓ ATTEMPTED
Implement GAE advantage computation. Given a batch of rewards, values, and dones from a rollout, compute the GAE advantages for each timestep. This is the core of PPO's advantage estimation.
signature def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95): """Compute Generalized Advantage Estimation. Args: rewards: list/array of rewards [T] values: list/array of V(s_t) predictions [T+1] (includes bootstrap V(s_T)) dones: list/array of done flags [T] (1 if episode ended) gamma: discount factor lam: GAE lambda parameter Returns: advantages: array of GAE advantages [T] """
Test case
# rewards = [1, 1, 1], values = [5, 6, 7, 8], dones = [0, 0, 0], gamma=0.99, lam=0.95
# delta_2 = 1 + 0.99*8 - 7 = 1.92
# delta_1 = 1 + 0.99*7 - 6 = 1.93
# delta_0 = 1 + 0.99*6 - 5 = 1.94
# A_2 = 1.92
# A_1 = 1.93 + 0.99*0.95*1.92 = 3.74
# A_0 = 1.94 + 0.99*0.95*3.74 = 5.46
GAE has a recursive structure: At = δt + (γλ)(1−donet) · At+1. Initialize AT = 0, then sweep backward. The (1−done) factor resets the advantage at episode boundaries — future rewards from a new episode shouldn't affect the current one.
import numpy as np

def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    T = len(rewards)
    advantages = np.zeros(T)
    last_gae = 0.0

    for t in reversed(range(T)):
        # TD error: r_t + gamma * V(s_{t+1}) - V(s_t)
        next_non_terminal = 1.0 - dones[t]
        delta = rewards[t] + gamma * values[t + 1] * next_non_terminal - values[t]

        # GAE recursive formula
        last_gae = delta + gamma * lam * next_non_terminal * last_gae
        advantages[t] = last_gae

    return advantages
Bonus challenge: Modify this to also return the value targets (advantages + values[:T]). In PPO, these targets are used to train the critic via MSE loss. Why do we use advantages + V(s) as the critic target rather than computing n-step returns separately? (Answer: GAE advantages already encode the optimal multi-step target implicitly. Adding V(s) back gives us the "GAE-corrected return" which is a lower-variance target than raw MC returns.)
⚔ Adversarial: The Deadly Critic Lag
You're training PPO on a locomotion task. After 1M steps, the agent walks well (reward ~300). You save the critic checkpoint. You then reset the actor to random weights but keep the trained critic. What happens?
Chapter 11

PPO vs. SAC

These two algorithms dominate modern deep RL. They represent fundamentally different design choices.

PropertyPPOSAC
Off-policy degreeSlightly off-policy (few steps per batch)Fully off-policy (replay buffer)
Value functionVπ(s)Qπ(s, a)
Policy updateClipped surrogate objectiveReparameterization through Q
Data efficiencyLower (discard data quickly)Higher (reuse all data)
Wall-clock speedOften faster (simpler updates)Slower per step (more networks)
StabilityVery stable, easy to tuneCan be finicky (critic lag)
ExplorationStochastic policy + entropy bonusMaximum entropy (built-in)
Action spaceBoth discrete and continuousPrimarily continuous
Famous usesChatGPT RLHF, Dota 2, roboticsDexterous manipulation, locomotion
When to Use Which

PPO when: you have fast simulation (cheap data), need stability, discrete actions, or are new to RL. Start here.

SAC when: data is expensive (real robot), continuous actions, need maximum data efficiency, or the task requires exploration.

Definition
The Spectrum of Off-Policyness

All actor-critic methods live on a spectrum. On one end: pure on-policy (REINFORCE, A2C) — collect, update, discard. In the middle: "slightly" off-policy (PPO) — a few gradient steps per batch. On the far end: fully off-policy (SAC, TD3) — replay buffer, hundreds of gradient steps per environment step.

More off-policy = more data efficient but harder to train. Less off-policy = simpler but wasteful.

Robotics Example

You have a physical robot arm learning to pick up objects. Each real-world episode takes 30 seconds, and you can run maybe 1000 episodes per day.

PPO approach: Collect 100 episodes, take ~10 gradient steps, discard. Next day: another 100. After a week: maybe 7000 episodes used, 70 updates. Slow.

SAC approach: Every single transition goes into the replay buffer. After day 1: 1000 episodes, but ~50,000 gradient updates sampling from the buffer. After a week: 7000 episodes but ~350,000 updates. 50x more learning per real-world interaction.

🏗 Design Challenge You're the Architect: PPO for a Humanoid Walker ✓ ATTEMPTED
Design a PPO system for training a simulated humanoid (like MuJoCo Humanoid-v4) to walk. 300-dimensional observation (joint angles, velocities, contacts), 17-dimensional continuous action (torques). Budget: 50 million environment steps on 8 parallel workers.
Observation dim
300 (proprioception)
Action dim
17 (continuous torques)
Step budget
50M total steps
Parallel workers
8 environments
Episode length
~1000 steps
GPU
1x RTX 3090 (24GB)
1. Network architecture: MLP depth/width for both actor and critic? Separate or shared backbone? Activation function?
2. Action distribution: Gaussian with what parameterization? State-dependent std or fixed? Initial std?
3. GAE parameters: what γ and λ for a 1000-step horizon? How many steps per rollout before update?
4. PPO hyperparameters: clip epsilon, number of epochs per batch, minibatch size, learning rate schedule?
5. Normalization: observation normalization? Reward scaling? Advantage normalization? Value function clipping?

Network: Separate actor and critic. Both: MLP with 2 hidden layers of 256 units, tanh activation (not ReLU — tanh naturally bounds outputs, important for smooth locomotion). Orthogonal weight initialization (helps with gradient flow in deep RL).

Action distribution: Diagonal Gaussian. Mean μ(s) from actor network. Log-std as a separate learnable parameter (NOT state-dependent initially — state-dependent std is harder to train). Initial log-std = 0 (σ = 1). The std naturally decreases as training converges.

GAE: γ = 0.99, λ = 0.95. Rollout length = 2048 steps per worker. Batch = 8 workers × 2048 = 16,384 transitions. This is ~16 episodes worth of data per update.

PPO: Clip ε = 0.2. 10 epochs per batch. 32 minibatches (minibatch size = 512). Learning rate = 3e-4 with linear decay to 0 over training. Entropy coefficient = 0.0 (humanoid doesn't need exploration bonus).

Normalization (CRITICAL): Running mean/std normalization of observations. Reward scaling by 1/std(returns). Advantage normalization (subtract mean, divide by std) per minibatch. Value function loss clipped (prevent value function from changing too fast). These normalizations are often more important than the algorithm choice — without them, PPO frequently fails on Humanoid.

Timeline: 50M steps / (8 workers × 2048 steps/batch) = ~3000 updates. At ~1 sec/update on a 3090: ~50 minutes to train. Expected reward: ~5000+ (stable walking) by 20M steps.

🔗 Pattern Recognition
Actor-Critic = Policy Gradient + Learned Baseline
Policy Gradients (Previous Lesson)
∇J = E[∇logπ · (R(τ) − b)]
b is a constant or simple average. No learning. → Policy Gradients
Actor-Critic (This Lesson)
∇J = E[∇logπ · (r + γV̂(s') − V̂(s))]
b = V̂(s), a learned neural network. Low variance.

The leap from PG to AC is just one thing: replacing the constant baseline with a learned value function. Everything else (GAE, PPO clipping, replay buffers) is optimization of this core idea. The critic is just a very good baseline.

Can you see the same "learned estimator replaces crude sample" pattern in supervised learning? (Hint: batch normalization learns running statistics instead of using per-batch estimates.)

🔗 Pattern Recognition
Off-Policy Actor-Critic → Q-Learning Continuum
Actor-Critic (learns Qπ)
Critic evaluates current policy's actions.
Actor explicitly parameterized. Continuous actions natural.
Q-Learning (learns Q*)
Critic evaluates optimal actions via max.
Policy = argmax (implicit). Discrete actions only. → Q-Learning

Both use replay buffers and both learn a Q-function. The key difference: actor-critic learns Qπ (value under current policy) and uses importance weights for off-policy correction. Q-learning learns Q* (optimal value) and needs no correction because max is policy-independent. SAC sits between them: it learns Qπ but with maximum entropy, making the policy more exploratory and the Q-function more robust to distribution shift.

If you had a discrete-action problem with 1000 actions, which would you choose: DQN (compute max over 1000 Q-values) or SAC (sample from a learned categorical)? What about 1,000,000 actions?

Chapter 12

Full Summary & Cheat Sheet

The Evolution from PG to Actor-Critic

V1 — Vanilla PGθ J = 𝔼[ Σtθ log π · r(τ) ] ← whole trajectory reward, very noisy
V2 — + Causality + Baselineθ J = 𝔼[ Σtθ log π · ( Σt'≥t r − b ) ] ← reward-to-go, still single-sample
V3 — Actor-Criticθ J = 𝔼[ Σtθ log π · ( rt + γV̂(s') − V̂(s) ) ] ← learned value function replaces noisy samples
V4 — PPOL = 𝔼[ min( w · Â, clip(w, 1±ε) · Â ) ] ← clipped IS for data reuse
V5 — SACmaxθ 𝔼[ Q(s, π(s)) + α H(π) ] ← replay buffer + max entropy

Value Estimation Cheat Sheet

MethodTarget ytBiasVariance
MCΣt'=tT γt'−t rt'NoneHigh
TD(0)rt + γ V̂(st+1)HighLow
N-stepΣt'=tt+n−1 γt'−t rt' + γn V̂(st+n)MediumMedium
GAE(λ)Σl=0 (γλ)l δt+lTunableTunable

Algorithm Decision Tree

SituationRecommendedWhy
Fast simulator, discrete actionsPPOSimple, stable, well-tested
Fast simulator, continuous actionsPPO or SACPPO simpler; SAC if exploration matters
Real robot (data expensive)SACReplay buffer maximizes data use
RLHF for language modelsPPOStable, handles discrete tokens
Need reproducible resultsPPOMore deterministic, easier to debug

Key Equations at a Glance

ObjectFormula
Vπ(s)𝔼[Σ γk rt+k | st=s]
Qπ(s,a)𝔼[Σ γk rt+k | st=s, at=a]
Aπ(s,a)Qπ(s,a) − Vπ(s)
TD errorδt = rt + γV̂(st+1) − V̂(st)
Actor gradientΣ ∇θ log π · Â
PPO clipmin(w Â, clip(w, 1±ε) Â)
The One Sentence

Actor-critic = learn what's good (critic), then do more of it (actor). Everything else is how you estimate "good" and how much old data you reuse.