Learn to estimate what is good vs. bad, then do more good. From vanilla PG to PPO and SAC.
In the policy gradients lesson, we derived this formula:
Each action is weighted by the advantage — how much better that action was compared to average. The problem is how we estimate that advantage.
A robot is learning to fold a jacket. Four trajectories from one batch:
• τ₁: Folds both sleeves neatly → reward = 10
• τ₂: Flattens the jacket but doesn't fold → reward = 0
• τ₃: Folds one sleeve, drops the other → reward = 5
• τ₄: Takes one step forward then everything falls backward → reward = −2
Vanilla PG with baseline (average = 3.25): τ₂ gets weight −3.25 and τ₃ gets +1.75. But τ₂ did something useful — flattening is a good sub-step! And the single backward step in τ₄ gets as much blame as the step forward. PG can't distinguish good sub-actions from bad trajectories.
Single-sample reward-to-go is a terrible estimator of how good an action actually is. One unlucky outcome poisons the gradient for every earlier step. We need something better: a learned estimate of expected future reward.
What if we trained a separate neural network to predict expected future reward from any state? Then instead of using one noisy trajectory sample, we'd use a smooth, multi-sample average. That's the critic. The policy is the actor. Together: actor-critic.
Three objects underpin all of actor-critic. Each answers a different question about how good things are under the current policy π.
"How much total reward do I expect from state s if I follow policy π from here on?" A single number per state. Think of it as the real estate value of a state — some states are inherently more promising.
"How much total reward do I expect if I take action a in state s, then follow π after?" Same as V but conditions on the first action. Q is more detailed — it knows about specific actions, not just locations.
"How much better is action a compared to what we'd normally do?" If A > 0, action a is better than the average action from this state. If A < 0, it's worse. This is the signal we need for policy gradients.
V is the "value of the neighborhood." Q is the "value of taking a specific road in that neighborhood." Advantage is "how much better is this road than the average road?" A positive advantage means the action outperforms the default policy.
The policy gradient theorem gives: ∇J = E[∇ log π(a|s) · Q(s,a)]. We can subtract any state-dependent baseline b(s) without changing the expected gradient: ∇J = E[∇ log π(a|s) · (Q(s,a) − b(s))].
Your task: (1) Show that subtracting b(s) doesn't bias the gradient (prove E[∇ log π · b(s)] = 0). (2) Show that the optimal b*(s) that minimizes variance equals Vπ(s), making A(s,a) = Q(s,a) − V(s) the minimum-variance advantage estimator.
Part 1 — Unbiasedness:
Ea~π(·|s)[∇θ log πθ(a|s) · b(s)] = b(s) · Ea~π[∇ log π(a|s)]
= b(s) · Σa π(a|s) · ∇π(a|s)/π(a|s) = b(s) · Σa ∇π(a|s) = b(s) · ∇ 1 = 0
Any baseline that depends only on s (not a) vanishes in expectation. This means we can subtract V, a constant, or even a neural network of s — the gradient stays unbiased.
Part 2 — Optimal baseline minimizes variance:
Var = Es,a[(g(s,a) · (Q(s,a) − b(s)))2] where g = ∇ log π.
d/db Var = −2 Es[Ea[g2(Q − b)]] = 0
⇒ b*(s) = Ea~π[g(s,a)2 Q(s,a)] / Ea~π[g(s,a)2]
Under the approximation that g2 is roughly uniform across actions:
b*(s) ≈ Ea~π[Q(s,a)] = Vπ(s)
The key insight: The advantage A = Q − V is the minimum-variance baseline for policy gradients. It works because V centers the signal around zero: good actions get positive weight, bad actions get negative weight, and the magnitudes are as small as possible. This is why actor-critic (which learns V as the baseline) dramatically outperforms vanilla REINFORCE.
Imagine a 3×3 grid. The agent is in the center cell. Under current policy π, the expected returns are:
• Vπ(center) = 6.0 — on average, starting from center, we get 6.0 total reward.
• Qπ(center, right) = 8.5 — going right first, then following π, gives 8.5.
• Qπ(center, left) = 4.0 — going left first gives only 4.0.
• Qπ(center, up) = 5.5 — going up gives 5.5.
The advantages:
• A(center, right) = 8.5 − 6.0 = +2.5 → much better than average! Push gradient to go right more.
• A(center, left) = 4.0 − 6.0 = −2.0 → worse than average. Decrease this action's probability.
• A(center, up) = 5.5 − 6.0 = −0.5 → slightly below average.
Notice: V is the weighted average of Q across all actions, so advantages always average to zero: Σa π(a|s) A(s,a) = 0.
Actor-critic turns the policy gradient into a two-network system. One network acts, the other judges.
The policy network. Takes a state, outputs action probabilities (discrete) or a distribution over actions (continuous). Parameters θ. This is what we deploy.
The value network. Takes a state, outputs a scalar: the estimated expected future reward. Parameters φ. This is a tool for training — at deployment, only the actor matters.
We only need to fit V! Because we can reconstruct the advantage from V using a single sampled transition:
The reward r(st, at) plus the value of the next state V(st+1) gives us an estimate of Q. Subtract V(st) to get A. One network, not three.
Start from the definition: Qπ(s,a) = r(s,a) + 𝔼s'[Vπ(s')]. We can't compute the expectation over all possible next states — but we have one actual next state st+1 from our rollout. Using it gives a single-sample estimate of Q. Subtract V(st) to get A. The approximation is:
This is a one-step estimate of the advantage. The "hat" ̂ reminds us that V̂ is our neural network approximation, not the true value.
The critic needs training targets. Where do they come from? First approach: just use the actual rewards from our rollouts.
For each state visited, the MC target is the actual sum of rewards from that point until the end of the episode. This is a direct, model-free estimate of Vπ(si,t).
Training the critic is then straightforward supervised learning:
Each yi,t is a single noisy sample. But different trajectories often visit the same or similar states. When the neural network fits all these targets simultaneously, it implicitly averages across many samples for each state. The network acts as a smoothing function — nearby states get similar value estimates even if individual trajectories were noisy.
A simple chain: states A → B → C → terminal, with rewards r(A)=1, r(B)=3, r(C)=5.
• y(A) = 1 + 3 + 5 = 9
• y(B) = 3 + 5 = 8
• y(C) = 5 = 5
The critic trains to output V̂(A) ≈ 9, V̂(B) ≈ 8, V̂(C) ≈ 5. In practice, stochastic dynamics and different actions yield different rewards each time — the network learns the average.
MC targets sum many random variables (every future reward). Each is noisy. The sum compounds that noise. For a 1000-step episode, yt is the sum of up to 1000 random rewards — the variance can be enormous. Can we do better?
Instead of summing all future rewards (expensive, noisy), use the critic's own estimate for the future.
One real reward step, then bootstrap: use the critic's own estimate of V for the rest. The name "temporal difference" comes from the fact that we use the difference in value estimates at consecutive time steps.
Same supervised learning pipeline:
TD looks suspicious: we're training V̂ using targets that depend on V̂ itself! The labels change every gradient step. But this is exactly how dynamic programming works — iteratively refine estimates using neighboring estimates. It converges because each update makes the value function more self-consistent (satisfying the Bellman equation).
MC target: y = rt + rt+1 + rt+2 + … + rT (sum of many random variables).
TD target: y = rt + V̂(st+1) (one random variable + one deterministic function evaluation).
Fewer random terms → less variance. But V̂ might be wrong → biased. This is the fundamental tradeoff.
A robot walks a 5-step hallway. Reward is −1 at every step except r = +10 at the final step. One trajectory: s₁ → s₂ → s₃ → s₄ → s₅.
MC target for s₁: −1 + (−1) + (−1) + (−1) + 10 = 6. Exact for this trajectory but varies wildly if the robot sometimes falls at step 3.
TD target for s₁: −1 + V̂(s₂). If V̂(s₂) = 7.2 (current estimate), then y = 6.2. Smoother, but only as good as V̂.
| Property | Monte Carlo | TD (Bootstrap) |
|---|---|---|
| Bias | Unbiased | Biased (uses own estimate) |
| Variance | High (sum of many rewards) | Low (one reward + one estimate) |
| Needs episode end? | Yes (must reach terminal state) | No (works step-by-step) |
| Works with infinite horizons? | No | Yes |
| Early training | More reliable (no wrong estimates) | Can propagate errors |
MC and TD are two extremes. What if we took the best of both?
Use n steps of real rewards, then bootstrap with the critic for the rest. n = 1 is pure TD. n = T (rest of episode) is pure MC. Any n in between blends the two.
Small n (near TD): low variance, higher bias. You trust your critic more. Large n (near MC): higher variance, lower bias. You trust your actual rewards more. In practice, n between 5 and 20 often works best — enough real reward to reduce bias, short enough to keep variance manageable.
Same 5-step hallway: rewards [−1, −1, −1, −1, +10]. Say γ = 1 for simplicity and V̂(s₃) = 8.0.
• 1-step (TD): y(s₁) = −1 + V̂(s₂). Depends entirely on V̂(s₂).
• 2-step: y(s₁) = −1 + (−1) + V̂(s₃) = −2 + 8.0 = 6.0. Two real rewards, rest from critic.
• 5-step (MC): y(s₁) = −1 + (−1) + (−1) + (−1) + 10 = 6. All real rewards.
With a good critic, the 2-step estimate will be less noisy than MC across different trajectories, while being less biased than 1-step TD.
| n | Name | Bias | Variance | When to Use |
|---|---|---|---|---|
| 1 | TD(0) | Highest | Lowest | Good critic, dense rewards |
| 5–20 | N-step | Medium | Medium | Most practical settings |
| T | MC | None | Highest | Short episodes, early training |
Generalized Advantage Estimation (GAE, Schulman et al. 2016) takes a weighted average of all n-step returns using parameter λ ∈ [0,1]:
λ = 0 is pure TD advantage (one step). λ = 1 is full MC advantage. λ = 0.95 is a common default in PPO.
Define the n-step advantage: Â(n)t = Σl=0n−1 γl rt+l + γn V̂(st+n) − V̂(st). GAE is defined as the exponentially-weighted average: ÂGAEt = (1−λ) Σn=1∞ λn−1 Â(n)t.
Your task: Show this simplifies to ÂGAEt = Σl=0∞ (γλ)l δt+l, where δt = rt + γV̂(st+1) − V̂(st). Then show that λ=0 gives TD(0) and λ=1 gives MC.
Step 1: Express Â(n) as sum of TD errors.
Â(n)t = Σl=0n−1 γl rt+l + γn V̂(st+n) − V̂(st)
Rewrite: = Σl=0n−1 γl [rt+l + γV̂(st+l+1) − V̂(st+l)] = Σl=0n−1 γl δt+l
(The telescoping: V̂(st) cancels from the first term's −V̂(st), intermediate V̂'s cancel in pairs, leaving +γnV̂(st+n) from the last term.)
Step 2: Substitute into GAE definition.
ÂGAE = (1−λ) Σn=1∞ λn−1 Σl=0n−1 γl δt+l
Step 3: Swap summation order. For fixed l, δt+l appears when n ≥ l+1:
Coefficient of δt+l = (1−λ) γl Σn=l+1∞ λn−1 = (1−λ) γl · λl/(1−λ) = (γλ)l
Therefore: ÂGAEt = Σl=0∞ (γλ)l δt+l
Boundary cases:
λ = 0: ÂGAE = δt = rt + γV̂(st+1) − V̂(st). Pure TD advantage. High bias, low variance.
λ = 1: ÂGAE = Σl=0∞ γl δt+l = Σ γl rt+l − V̂(st). This is the MC return minus the baseline. Zero bias (if V̂ is removed from target), highest variance.
The key insight: λ controls an effective discount on TD errors. Each additional step's δ is weighted by (γλ)l. Since γ already discounts future rewards, λ provides a second layer of discounting that specifically controls how much we trust the critic vs. actual rewards. Low λ = trust the critic. High λ = trust the rewards.
(1) TD has bias because its target y = r + γV̂(s') uses V̂ — a learned approximation that's wrong early in training. We're bootstrapping from our own (incorrect) estimates. (2) MC has high variance because y = Σrt' sums many stochastic rewards. Each future timestep adds another random variable, and their variances compound. A 1000-step episode means summing 1000 random variables. (3) λ in GAE trades off bias vs. variance by exponentially weighting n-step returns. Low λ trusts the critic more (lower variance, higher bias). High λ trusts actual rewards more (higher variance, lower bias). λ=0.95 is the standard because it leans toward trusting rewards while still getting meaningful variance reduction from the critic.
So far we've treated all future rewards equally. But a reward 1000 steps from now is both uncertain and far away. Enter γ.
γ ∈ [0, 1]. Each future reward is multiplied by γ raised to the power of how many steps away it is. Reward now = full weight. Reward in 100 steps = γ100 ≈ tiny.
Three reasons: (1) Mathematical — with γ < 1, infinite sums converge. Without it, Vπ can be infinite. (2) Practical — distant rewards are more uncertain, so trusting them less reduces variance. (3) Behavioral — we want agents that don't defer reward forever ("I'll clean up tomorrow… forever").
Rewards: [+1, +1, +1, +1, +1] at steps t, t+1, t+2, t+3, t+4.
• γ = 1.0: G = 1 + 1 + 1 + 1 + 1 = 5.0 (all rewards equal)
• γ = 0.99: G = 1 + 0.99 + 0.98 + 0.97 + 0.96 = 4.90 (slight decay)
• γ = 0.9: G = 1 + 0.9 + 0.81 + 0.73 + 0.66 = 4.10 (noticeable)
• γ = 0.5: G = 1 + 0.5 + 0.25 + 0.125 + 0.063 = 1.94 (very myopic)
• γ = 0.0: G = 1 + 0 + 0 + 0 + 0 = 1.0 (only immediate reward)
With discounting, all our definitions update:
• γ = 0.99 — the default for most tasks. Cares about ~100 steps of future.
• γ = 0.999 — for very long-horizon tasks (thousands of steps).
• γ = 0.95 — when you want faster learning at the cost of some long-term quality.
Rule of thumb: the "effective horizon" is ~1/(1−γ). So γ=0.99 → ~100 steps, γ=0.999 → ~1000 steps.
The TD(0) advantage Ât = δt = rt + γV̂(st+1) − V̂(st) has bias proportional to V̂'s error, while the MC advantage At = Σγkrt+k − V̂(st) has variance proportional to the episode length.
Your task: Quantify both. If V̂(s) = Vπ(s) + ε(s) with |ε| ≤ εmax, bound the bias of TD(0). If rewards have variance σr2, bound the variance of MC returns for horizon T.
TD(0) Bias:
ÂTD = r + γV̂(s') − V̂(s) = [r + γV(s') − V(s)] + [γε(s') − ε(s)]
= Atrue(s,a) + bias term
|Bias| = |γε(s') − ε(s)| ≤ γ|ε(s')| + |ε(s)| ≤ (1+γ)εmax
For γ = 0.99: |Bias| ≤ 1.99 εmax. If V̂ is off by 2.0 on average, TD advantages can be wrong by up to 4.0!
MC Variance:
Var(Gt) = Σk=0H−1 γ2k Var(rt+k) = σr2 · (1 − γ2H)/(1−γ2)
For γ=0.99, H=1000: (1−0.992000)/(1−0.9801) ≈ 1/0.0199 ≈ 50.25
So Var(MC) ≈ 50 σr2. Standard deviation ≈ 7σr.
The tradeoff:
TD(0): MSE = Bias2 + Var(single step) ≈ (2εmax)2 + σr2
MC: MSE = 0 + 50σr2
TD wins when: 4εmax2 + σr2 < 50σr2 ⇒ εmax < 3.5σr
The key insight: TD is better whenever the critic's error is less than ~3.5x the per-step reward noise. Early in training (critic very wrong), MC wins. Late in training (critic accurate), TD wins. GAE(λ) with λ=0.95 automatically interpolates — relying mostly on actual rewards but getting meaningful variance reduction from the critic's short-horizon estimates.
Now we assemble every piece. This is the complete algorithm you'd implement.
Tiny environment: 3 states {A, B, C}. Two actions {left, right}. γ = 0.9. Current critic: V̂(A) = 2.0, V̂(B) = 5.0, V̂(C) = 8.0.
Step 2a: Sample batch. One trajectory: A ⟶right B ⟶right C. Rewards: r(A, right) = 1, r(B, right) = 3.
Step 2b: Critic targets (1-step TD):
• y(A) = r(A, right) + γ V̂(B) = 1 + 0.9 × 5.0 = 5.5
• y(B) = r(B, right) + γ V̂(C) = 3 + 0.9 × 8.0 = 10.2
Critic loss: (V̂(A) − 5.5)2 + (V̂(B) − 10.2)2 = (2 − 5.5)2 + (5 − 10.2)2 = 12.25 + 27.04 = 39.29. Gradient descent will push V̂(A) toward 5.5 and V̂(B) toward 10.2.
Step 2c: Advantages:
• Â(A, right) = 1 + 0.9 × 5.0 − 2.0 = +3.5 (great action!)
• Â(B, right) = 3 + 0.9 × 8.0 − 5.0 = +5.2 (even better)
Step 2d: Both advantages are positive → increase probability of "right" in both states.
The actor and critic are trained simultaneously but at potentially different learning rates. A common pattern: αφ = 3 × αθ. The critic should learn faster than the actor, so the advantages are based on a reasonably good value estimate. If the critic is still random, advantages are garbage and the actor learns nonsense.
In some implementations (e.g., A3C), the actor and critic share a feature backbone with two separate heads. This saves parameters and computation, but can cause instability — the critic's gradient can interfere with actor features. Modern practice (PPO, SAC): separate networks unless you have a specific reason to share.
Compare the gradient from the PG lesson (with reward-to-go and baseline) to the actor-critic gradient:
They're the same structure. The only change: noisy single-sample reward-to-go is replaced by the critic-based advantage. Less noise → better gradients → faster learning.
On-policy actor-critic still throws away data after each update. Can we squeeze multiple gradient steps from the same batch?
Data was collected under πθ (old policy). We want to update θ' (new policy). The ratio πθ'/πθ corrects for the distribution mismatch — exactly like importance sampling.
Now we can take K gradient steps on the same batch without recollecting data. Much more efficient!
After many gradient steps, θ' drifts far from θ. The importance weights become extreme: if πθ'(a|s) = 0.8 but πθ(a|s) = 0.01, the ratio = 80. A few huge weights dominate the gradient → instability. The advantages were computed for the old policy — they're stale.
Don't let the new policy stray too far from the old one. This keeps importance weights near 1.
Clip the importance ratio to [1−ε, 1+ε] (typically ε = 0.2). If the ratio tries to go beyond the clip range, the gradient is zero — preventing any single update from changing the policy too much.
PPO = actor-critic + importance sampling + clipping. It gets multiple gradient steps per batch (data efficient), prevents catastrophic policy updates (stable), and is simple to implement (~50 lines of core logic). It's the default algorithm for ChatGPT's RLHF, robotic locomotion, and game AI.
ε = 0.2. Old policy: πθ(right|s) = 0.4. New policy after 3 gradient steps: πθ'(right|s) = 0.7.
Ratio w = 0.7/0.4 = 1.75. Clipped to [0.8, 1.2] → 1.2.
If  = +3.0: unclipped objective = 1.75 × 3.0 = 5.25. Clipped = 1.2 × 3.0 = 3.6. PPO takes min(5.25, 3.6) = 3.6.
The policy wants to increase this action's probability even more, but PPO says "that's enough for one batch." Collect new data first.
PPO reuses one batch for a few gradient steps. What if we stored all past experience and sampled from it freely?
A large memory that stores transitions (s, a, r, s') from all past policies. During training, we sample random minibatches from this buffer. The transitions might be from an hour ago or a million gradient steps ago.
Problem 1: Advantage A = r + γV(s') − V(s) requires Vπ for the current policy, but the transitions were generated by old policies. V of the current policy in those states might be different.
Problem 2: We can't use single-step importance weights effectively because the states themselves were visited under a different policy.
With a replay buffer, we switch from V to Q:
For each transition (s, a, r, s') in the buffer: the target is the reward plus the discounted Q-value of the next state under the current policy's action. The key: we evaluate πθ(s') using the current actor, not the one that generated the data.
Vπ(s) = 𝔼a~π[Qπ(s,a)] depends on which actions π would take. If the transition was generated by a different policy, the action a in the buffer might not be what the current π would do → V estimate is wrong.
Qπ(s, a) conditions on a specific action. The (s, a, r, s') tuple tells us exactly what happened when action a was taken in state s. That factual information is valid regardless of which policy chose a. We only need the current policy to evaluate the next action at s'.
Maximum entropy RL: the policy maximizes reward and entropy (randomness). This encourages exploration and makes training more stable. Uses a replay buffer with two Q-networks (for stability) and automatic temperature tuning.
To backprop through the actor into the Q-function, SAC uses the reparameterization trick: instead of sampling a ~ π(·|s), write a = μθ(s) + σθ(s) · ε, where ε ~ 𝒩(0, 1). Now the gradient flows through the deterministic μ and σ, treating ε as a constant. No log-probability needed — direct gradient from Q to θ.
import numpy as np def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95): T = len(rewards) advantages = np.zeros(T) last_gae = 0.0 for t in reversed(range(T)): # TD error: r_t + gamma * V(s_{t+1}) - V(s_t) next_non_terminal = 1.0 - dones[t] delta = rewards[t] + gamma * values[t + 1] * next_non_terminal - values[t] # GAE recursive formula last_gae = delta + gamma * lam * next_non_terminal * last_gae advantages[t] = last_gae return advantages
These two algorithms dominate modern deep RL. They represent fundamentally different design choices.
| Property | PPO | SAC |
|---|---|---|
| Off-policy degree | Slightly off-policy (few steps per batch) | Fully off-policy (replay buffer) |
| Value function | Vπ(s) | Qπ(s, a) |
| Policy update | Clipped surrogate objective | Reparameterization through Q |
| Data efficiency | Lower (discard data quickly) | Higher (reuse all data) |
| Wall-clock speed | Often faster (simpler updates) | Slower per step (more networks) |
| Stability | Very stable, easy to tune | Can be finicky (critic lag) |
| Exploration | Stochastic policy + entropy bonus | Maximum entropy (built-in) |
| Action space | Both discrete and continuous | Primarily continuous |
| Famous uses | ChatGPT RLHF, Dota 2, robotics | Dexterous manipulation, locomotion |
PPO when: you have fast simulation (cheap data), need stability, discrete actions, or are new to RL. Start here.
SAC when: data is expensive (real robot), continuous actions, need maximum data efficiency, or the task requires exploration.
All actor-critic methods live on a spectrum. On one end: pure on-policy (REINFORCE, A2C) — collect, update, discard. In the middle: "slightly" off-policy (PPO) — a few gradient steps per batch. On the far end: fully off-policy (SAC, TD3) — replay buffer, hundreds of gradient steps per environment step.
More off-policy = more data efficient but harder to train. Less off-policy = simpler but wasteful.
You have a physical robot arm learning to pick up objects. Each real-world episode takes 30 seconds, and you can run maybe 1000 episodes per day.
PPO approach: Collect 100 episodes, take ~10 gradient steps, discard. Next day: another 100. After a week: maybe 7000 episodes used, 70 updates. Slow.
SAC approach: Every single transition goes into the replay buffer. After day 1: 1000 episodes, but ~50,000 gradient updates sampling from the buffer. After a week: 7000 episodes but ~350,000 updates. 50x more learning per real-world interaction.
Network: Separate actor and critic. Both: MLP with 2 hidden layers of 256 units, tanh activation (not ReLU — tanh naturally bounds outputs, important for smooth locomotion). Orthogonal weight initialization (helps with gradient flow in deep RL).
Action distribution: Diagonal Gaussian. Mean μ(s) from actor network. Log-std as a separate learnable parameter (NOT state-dependent initially — state-dependent std is harder to train). Initial log-std = 0 (σ = 1). The std naturally decreases as training converges.
GAE: γ = 0.99, λ = 0.95. Rollout length = 2048 steps per worker. Batch = 8 workers × 2048 = 16,384 transitions. This is ~16 episodes worth of data per update.
PPO: Clip ε = 0.2. 10 epochs per batch. 32 minibatches (minibatch size = 512). Learning rate = 3e-4 with linear decay to 0 over training. Entropy coefficient = 0.0 (humanoid doesn't need exploration bonus).
Normalization (CRITICAL): Running mean/std normalization of observations. Reward scaling by 1/std(returns). Advantage normalization (subtract mean, divide by std) per minibatch. Value function loss clipped (prevent value function from changing too fast). These normalizations are often more important than the algorithm choice — without them, PPO frequently fails on Humanoid.
Timeline: 50M steps / (8 workers × 2048 steps/batch) = ~3000 updates. At ~1 sec/update on a 3090: ~50 minutes to train. Expected reward: ~5000+ (stable walking) by 20M steps.
The leap from PG to AC is just one thing: replacing the constant baseline with a learned value function. Everything else (GAE, PPO clipping, replay buffers) is optimization of this core idea. The critic is just a very good baseline.
Can you see the same "learned estimator replaces crude sample" pattern in supervised learning? (Hint: batch normalization learns running statistics instead of using per-batch estimates.)
Both use replay buffers and both learn a Q-function. The key difference: actor-critic learns Qπ (value under current policy) and uses importance weights for off-policy correction. Q-learning learns Q* (optimal value) and needs no correction because max is policy-independent. SAC sits between them: it learns Qπ but with maximum entropy, making the policy more exploratory and the Q-function more robust to distribution shift.
If you had a discrete-action problem with 1000 actions, which would you choose: DQN (compute max over 1000 Q-values) or SAC (sample from a learned categorical)? What about 1,000,000 actions?
| Method | Target yt | Bias | Variance |
|---|---|---|---|
| MC | Σt'=tT γt'−t rt' | None | High |
| TD(0) | rt + γ V̂(st+1) | High | Low |
| N-step | Σt'=tt+n−1 γt'−t rt' + γn V̂(st+n) | Medium | Medium |
| GAE(λ) | Σl=0∞ (γλ)l δt+l | Tunable | Tunable |
| Situation | Recommended | Why |
|---|---|---|
| Fast simulator, discrete actions | PPO | Simple, stable, well-tested |
| Fast simulator, continuous actions | PPO or SAC | PPO simpler; SAC if exploration matters |
| Real robot (data expensive) | SAC | Replay buffer maximizes data use |
| RLHF for language models | PPO | Stable, handles discrete tokens |
| Need reproducible results | PPO | More deterministic, easier to debug |
| Object | Formula |
|---|---|
| Vπ(s) | 𝔼[Σ γk rt+k | st=s] |
| Qπ(s,a) | 𝔼[Σ γk rt+k | st=s, at=a] |
| Aπ(s,a) | Qπ(s,a) − Vπ(s) |
| TD error | δt = rt + γV̂(st+1) − V̂(st) |
| Actor gradient | Σ ∇θ log π · Â |
| PPO clip | min(w Â, clip(w, 1±ε) Â) |
Actor-critic = learn what's good (critic), then do more of it (actor). Everything else is how you estimate "good" and how much old data you reuse.