Kochenderfer et al., Chapter 13

Actor-Critic Methods

Two networks, one goal: the actor decides, the critic evaluates, and together they learn faster than either could alone.

Prerequisites: Chapters 11–12 (gradient estimation & optimization) + value functions (Ch 7–8). That's it.
10
Chapters
5
Simulations
10
Quizzes

Chapter 0: The Noisy Teacher Problem

You're training to play tennis. After every rally, your coach tells you your score for that rally: 42 points. Was that good? You have no idea. It depends entirely on how the rally ended, whether you were at the net or baseline, whether the opponent was weak or strong. The number alone tells you almost nothing about what specific swing caused the result.

This is the problem with pure policy gradient methods like REINFORCE. The learning signal is the full trajectory return — a single number summarizing everything that happened. Noisy. Delayed. High variance. You adjust your swing based on a signal that may take 20 steps to arrive and reflects many random events beyond your control.

The variance problem in numbers: In CartPole, a typical trajectory return has variance ~1000. The gradient estimate with 10 trajectories has standard deviation ~√(1000/10) ≈ 10. Your "improvement direction" is more noise than signal. You need hundreds of trajectories to get a reliable gradient — and each one requires rolling out the full episode.

What if your coach, instead of waiting for the rally to end, whispered after each shot: "That was 0.3 points better than I expected." That's immediate, action-specific feedback. Much lower variance. That whisper is the advantage signal, and the coach is the critic.

Actor-critic methods split the learning into two coupled problems:

Actor πθ(a|s)
The policy: given state s, output a distribution over actions. Learns via gradient ascent on expected utility.
↓ "how good was that?" ↓
Critic Uφ(s) or Qφ(s,a)
The value function: predicts expected future reward. Trained via temporal difference learning. Provides advantage estimates to the actor.
↑ "here are the trajectories" ↑
Environment
Provides states, actions, rewards. Neither actor nor critic can change how it works.
Why this works: The critic replaces noisy trajectory returns with bootstrapped estimates: δ = r + γV(s') − V(s). This one-step signal has much lower variance than the full return, because it only captures one step of randomness. The cost is bias — if V is wrong, δ is wrong. The entire field of actor-critic research is about managing this bias-variance tradeoff.
Why do pure policy gradient methods (REINFORCE) have high variance?

Chapter 1: The Actor-Critic Architecture

Start from the policy gradient theorem (Ch 11). The gradient of expected utility under policy πθ is:

∇U(θ) = Eτ ~ πθ [∑t=1dθ log πθ(at|st) · Φt]

where Φt is the learning signal at step t. REINFORCE uses Φt = Gt (the full return from t onward). Actor-critic replaces this with the advantage A(st, at) = Q(st, at) − V(st), estimated by the critic.

The advantage answers: "How much better (or worse) was the action I took compared to what I'd expect on average in this state?" Positive advantage → reinforce this action. Negative advantage → discourage it.

Advantage vs. Q-value: Using Q(s,a) directly as Φt is unbiased but still high variance. Subtracting a baseline V(s) that doesn't depend on a leaves the gradient unbiased (the baseline term has zero expectation) but dramatically reduces variance. The advantage A = Q − V is the optimal baseline choice.

In practice we parameterize two networks:

Actor: πθ(a|s)

Input: state s ∈ Rn
Output: action distribution (Categorical for discrete, Gaussian for continuous)
Loss: −E[log πθ(a|s) · A(s,a)] (maximize)
Update: ascent on ∇θU

Critic: Vφ(s) or Qφ(s,a)

Input: state s (and action a for Q)
Output: scalar value estimate ∈ R
Loss: ½ E[(Vφ(s) − Gt)2] (minimize)
Update: descent on ∇φ

The update loop runs as follows. Collect m trajectories with the current policy. For each step (st, at, rt, st+1), compute the TD residual δt = rt + γVφ(st+1) − Vφ(st). Use δt as the advantage estimate. Gradient ascent on the actor, gradient descent on the critic.

python
import torch
import torch.nn as nn

class Actor(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 64), nn.Tanh(),
            nn.Linear(64, 64), nn.Tanh(),
            nn.Linear(64, act_dim)  # logits for discrete
        )
    def forward(self, s):
        return torch.distributions.Categorical(logits=self.net(s))

class Critic(nn.Module):
    def __init__(self, obs_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 64), nn.Tanh(),
            nn.Linear(64, 64), nn.Tanh(),
            nn.Linear(64, 1)  # scalar V(s)
        )
    def forward(self, s):
        return self.net(s).squeeze(-1)

# One update step
def update(actor, critic, opt_a, opt_c, batch, gamma=0.99):
    s, a, r, s_next, done = batch
    # Critic: TD target
    with torch.no_grad():
        td_target = r + gamma * critic(s_next) * (~done)
    v = critic(s)
    critic_loss = ((v - td_target) ** 2).mean()
    # Actor: advantage = TD residual
    advantage = (td_target - v).detach()
    log_prob = actor(s).log_prob(a)
    actor_loss = -(log_prob * advantage).mean()
    # Optimize both
    opt_a.zero_grad(); actor_loss.backward(); opt_a.step()
    opt_c.zero_grad(); critic_loss.backward(); opt_c.step()
Tensor shapes (CartPole, batch=256):
s: [256, 4] — 4 state features
a: [256] — integer actions {0,1}
log_prob: [256] — one log-probability per step
advantage: [256] — one scalar per step (detached from graph)
actor_loss: scalar — mean across batch
Why must the advantage be .detach()ed before computing the actor loss?

Chapter 2: TD Error as Advantage

The critic provides Vφ(s). How do we get an advantage estimate from it? The exact advantage requires knowing Q(s,a) = E[r + γV(s')|s,a], which we don't have directly. But we can estimate it from a single observed transition:

δt = rt + γ Vφ(st+1) − Vφ(st)

This is the TD residual (also called the temporal difference error). It measures: "How much better was this step than I predicted?" If you expected -2 value at s, got reward +1, and the next state has value -1.5, then δ = 1 + 0.99×(-1.5) − (-2) = 1 − 1.485 + 2 = 1.515. Better than expected. Reinforce the action.

When is δ an unbiased estimate of A(s,a)? Only when Vφ = Vπ exactly. In that case E[δ|s,a] = E[r + γVπ(s')|s,a] − Vπ(s) = Q(s,a) − V(s) = A(s,a). But Vφ is never perfect — it's trained simultaneously with the policy — so δ carries bias from critic errors.

The bias-variance picture:

Signal ΦtBiasVarianceData needed
Full return GtZeroHigh — sums all future randomnessFull episodes
TD residual δtHigh (critic errors propagate)Low — only one step of randomnessSingle steps
k-step returnMediumMediumk steps
GAE (Ch 3)TunableTunableFull episodes

The practical effect: with the TD residual, the actor gradient estimate has much lower variance than REINFORCE, so you need fewer trajectories to get a reliable gradient direction. But if the critic is poorly trained (early in learning, or with too little network capacity), the bias can mislead the actor worse than high variance would.

TD Residual vs Full Return: Sampling Distribution

Both signals estimate the same true advantage A(s,a) = −2. The TD residual (teal) has critic bias built in. The full return (orange) is unbiased but spreads wide. Click "Sample" to accumulate estimates and watch the distributions emerge.

Click to sample
You train a critic for 1000 steps on a fixed policy. Now the policy changes. What happens to the TD residual as an advantage estimate?

Chapter 3: Generalized Advantage Estimation

John Schulman (2015) noticed that TD residuals and full returns are endpoints of a spectrum. You can interpolate between them using a single scalar λ ∈ [0, 1]. The resulting estimator, Generalized Advantage Estimation (GAE), is now standard in PPO and most modern actor-critic methods.

Start from the k-step advantage. Using k actual rewards plus the critic's estimate of remaining value:

(k)t = −V(st) + rt + γrt+1 + γ2rt+2 + … + γk−1rt+k−1 + γkV(st+k)

At k=1 this is the TD residual δt. At k=∞ this is the full Monte Carlo return. GAE takes an exponentially weighted average over all k:

AGAE(λ)t = (1−λ) ∑k=1 λk−1(k)t

This simplifies beautifully. Define δt = rt + γV(st+1) − V(st). Then:

AGAE(λ)t = ∑l=0 (γλ)l δt+l

This is a geometric series over TD residuals. The (γλ)l weight decays as we look further into the future. At l=0, the current TD residual gets weight 1. At l=10, it gets weight (γλ)10. When λ=0, only the l=0 term survives: pure TD. When λ=1, all terms get equal γl weight: pure Monte Carlo.

Derive it yourself: Expand Â(1) = δt. Expand Â(2) = δt + γδt+1. Expand Â(3) = δt + γδt+1 + γ2δt+2. The GAE formula is (1−λ)[Â(1) + λÂ(2) + λ2(3) + …]. Substitute, collect terms for δt+l: each gets weight (1−λ)γl[1 + λ + λ2 + …] = γlλl. Done.

Computing GAE efficiently is a single backwards pass over the trajectory. No extra network calls needed.

python
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    """
    rewards: [T]   — observed rewards
    values:  [T+1] — V(s_0)...V(s_T), last is bootstrap
    dones:   [T]   — 1.0 if episode ended
    returns: advantages [T] and targets [T]
    """
    T = len(rewards)
    advantages = [0.0] * T
    gae = 0.0
    for t in reversed(range(T)):
        mask = 1.0 - dones[t]          # 0 at episode boundary
        delta = rewards[t] + gamma * values[t+1] * mask - values[t]
        gae = delta + gamma * lam * mask * gae
        advantages[t] = gae
    targets = [advantages[t] + values[t] for t in range(T)]
    return advantages, targets
# One backwards scan, O(T) time and space. No extra network calls.
Why the backwards scan? GAE[t] = δt + (γλ) × GAE[t+1]. It's a recurrence that runs backwards in time. Start from the last step (GAE = 0 at the terminal state), work backwards accumulating the geometric sum. Each step is a single multiply-and-add.
GAE λ Dial — Bias–Variance Tradeoff

Adjust λ to see how GAE blends between one-step TD (low variance, biased by critic errors) and full Monte Carlo (unbiased, high variance). The top panel shows the (γλ)l weights on each TD residual. The bottom panel shows a simulated sampling distribution of the resulting advantage estimate, compared to the true value A = −1.5.

λ 0.95
γ 0.99
Critic σ 0.50
You set λ = 0.95 and γ = 0.99, giving γλ = 0.9405. How much weight does the TD residual 10 steps in the future (δt+10) get relative to δt?

Chapter 4: A2C and A3C

The basic actor-critic update is sequential: collect a trajectory, update, collect another. Advantage Actor-Critic (A2C) and its asynchronous variant A3C (Mnih et al., 2016) scale this up by collecting experience from multiple parallel environments simultaneously.

A2C: Synchronous Parallel Environments

Run N copies of the environment in parallel (different random seeds, same policy parameters). After T steps in each, you have N×T transition tuples. Average the gradient over all of them before updating.

∇U(θ) = ½itθ log πθ(at(i)|st(i)) AGAEt(i)

Why does this reduce variance? Each environment produces an independent trajectory. Averaging N independent gradient estimates reduces variance by 1/N. With 8 parallel environments, you get 8× lower gradient variance for the same wall-clock time — assuming the environments run in parallel.

Entropy bonus: A2C adds an entropy term to the actor loss: α · H(π(·|s)) where H is the policy entropy. This prevents the policy from collapsing to a near-deterministic distribution too early, which would stop exploration. Typical α = 0.01. The entropy of a Gaussian is H = ½log(2πeσ2), so maximizing entropy means maximizing σ (keeping the policy exploratory).

The full A2C loss (note: actor loss is negated since we ascend):

total = −ℓactor + c1critic − c2 H(π)

Typical coefficients: c1 = 0.5 (value loss weight), c2 = 0.01 (entropy bonus). Both are hyperparameters.

A3C: Asynchronous Threads

A3C runs N worker threads, each with its own copy of the environment and local gradient accumulation. Workers push gradient updates to a shared parameter server asynchronously — no waiting for others. This allows scaling to many cores without synchronization overhead.

A2C (synchronous)

• All workers step together
• Gradient is averaged before update
• More reproducible training
• Standard in modern implementations
• Works well with GPU batching

A3C (asynchronous)

• Workers push gradients independently
• No synchronization → higher throughput
• Stale gradients can cause instability
• Mostly replaced by A2C + GPU
• Original: 16 threads on CPU

Why A2C won: The original A3C paper was written in the CPU era. On a GPU, batching N environments and computing one big gradient is faster than N separate threads with synchronization overhead. Modern practice is A2C with vectorized environments (e.g., gym.vector.SyncVectorEnv).
python
# A2C with N parallel environments (pseudo-code)
envs = gym.vector.SyncVectorEnv([make_env for _ in range(N)])  # N=8
obs = envs.reset()  # [N, obs_dim]

for update in range(num_updates):
    # Collect T steps from all N envs
    storage = []
    for _ in range(T):
        dist = actor(obs)                        # [N] distributions
        actions = dist.sample()                  # [N]
        log_probs = dist.log_prob(actions)       # [N]
        values = critic(obs)                     # [N]
        obs_next, rewards, dones, _ = envs.step(actions)
        storage.append((obs, actions, log_probs, values, rewards, dones))
        obs = obs_next

    # Bootstrap value for last state
    with torch.no_grad():
        last_value = critic(obs)                 # [N]

    # Compute GAE advantages for each env independently
    advantages = compute_gae_batch(storage, last_value, gamma, lam)
    # advantages: [T, N] -> flatten to [T*N]

    # Actor + critic + entropy loss
    actor_loss = -(log_probs * advantages.detach()).mean()
    critic_loss = 0.5 * (values - (advantages + values).detach()).pow(2).mean()
    entropy_loss = -dist.entropy().mean()
    loss = actor_loss + 0.5 * critic_loss + 0.01 * entropy_loss
    optimizer.zero_grad(); loss.backward(); optimizer.step()
In A2C with N=8 parallel environments, by approximately what factor does gradient variance decrease compared to a single environment (same number of steps total)?

Chapter 5: Deterministic Policy Gradient

Everything so far uses stochastic policies: πθ(a|s) is a distribution. This works for discrete actions and low-dimensional continuous actions. But for high-dimensional continuous control — a robot arm with 7 joints — sampling from a multivariate Gaussian and computing log-probabilities becomes expensive and unstable.

Silver et al. (2014) proved a deterministic policy gradient theorem. If the policy is deterministic — μθ(s) outputs a single action vector, not a distribution — then:

∇J(θ) = Es ~ ρμ[∇aQμ(s,a)|a=μθ(s) · ∇θμθ(s)]

Read it as two chain rule factors: ∇aQ answers "if I perturb the action slightly, how does value change?" and ∇θμ answers "if I perturb the parameters, how does the action change?" The actor gradient is their product. No log-probability needed.

What this requires: A differentiable Q-function (a neural network). Continuous actions (so we can differentiate w.r.t. action). An off-policy data collection scheme (since the deterministic policy itself doesn't explore).

DDPG: DPG + Deep Networks

Lillicrap et al. (2015) combined DPG with three stability tricks from DQN to get DDPG (Deep Deterministic Policy Gradient):

Experience Replay
Store transitions (s,a,r,s') in a buffer of size 106. Train on random mini-batches. Breaks temporal correlation that causes divergence.
+
Target Networks
Keep slow copies μ' and Q' of both networks. TD target = r + γQ'(s', μ'(s')). Polyak update: θ' ← τθ + (1−τ)θ' with τ = 0.005. Prevents oscillation.
+
Exploration Noise
Add Ornstein-Uhlenbeck noise to the action: a = μ(s) + ε, ε ~ OU process. Temporally correlated noise works better than i.i.d. Gaussian for physical control.
python
# DDPG actor update
def update_actor(actor, critic, opt_actor, states):
    # Gradient ascent on Q(s, mu(s)) w.r.t. actor params
    actions = actor(states)                    # [B, act_dim]
    q_values = critic(states, actions)         # [B]  — differentiable!
    actor_loss = -q_values.mean()              # maximize Q
    opt_actor.zero_grad()
    actor_loss.backward()                      # flows through critic
    opt_actor.step()
    # Note: critic parameters are NOT updated here (freeze them)
    # Only actor parameters receive gradient through this path

# DDPG critic update (standard Bellman)
def update_critic(critic, actor_target, critic_target, opt_critic, batch):
    s, a, r, s_next, done = batch
    with torch.no_grad():
        a_next = actor_target(s_next)
        q_next = critic_target(s_next, a_next)
        td_target = r + gamma * q_next * (~done)
    q = critic(s, a)
    critic_loss = nn.functional.mse_loss(q, td_target)
    opt_critic.zero_grad(); critic_loss.backward(); opt_critic.step()

# Polyak update of target networks
for param, target_param in zip(critic.parameters(), critic_target.parameters()):
    target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
Tensor flow in DDPG actor update:
states [B, obs_dim] → actor → actions [B, act_dim]
(states, actions) → critic → Q [B]
actor_loss = −mean(Q) — gradient flows back through critic to actor
critic.parameters() receive grad but opt_actor only updates actor.parameters()
Deterministic Policy + Exploration Noise

The deterministic policy (blue line) maps state to action with no randomness. Exploration noise (orange dots) covers the action space for experience collection. The OU process (red dots) produces temporally correlated noise — it wanders instead of jumping.

Gaussian σ 0.50
DDPG's actor update computes −Q(s, μ(s)).mean() and backpropagates through the critic. Which parameters does the optimizer update?

Chapter 6: Soft Actor-Critic

DDPG is sample-efficient but brittle: hyperparameter-sensitive, prone to Q-value overestimation, and the policy collapses to deterministic too quickly. Soft Actor-Critic (SAC) (Haarnoja et al., 2018) fixes all three issues by changing the objective.

Instead of maximizing expected return alone, SAC maximizes entropy-augmented return:

J(π) = E[∑t=0 γt(rt + α H(π(·|st)))]

α is the temperature parameter: it controls how much entropy matters. High α → very stochastic policy (exploration), low α → nearly deterministic (exploitation). The entropy term H(π(·|s)) = −E[log π(a|s)] rewards distributional spread.

Why entropy regularization works: The standard RL objective finds a single best action per state. The entropy-augmented objective finds a distribution over good actions. This means:
Exploration is built in — high entropy keeps the policy exploratory without separate noise
Multiple optima — if two actions are equally good, the policy learns to use both
Robustness — a spread policy is harder to adversarially exploit

The soft Bellman optimality equation becomes:

Q*(s,a) = r(s,a) + γ Es'[V*(s')]
V*(s) = Ea ~ π[Q*(s,a) − α log π(a|s)]

SAC uses the reparameterization trick to backpropagate through the stochastic action: instead of sampling a ~ π(·|s), sample ε ~ N(0,I) and compute a = μ(s) + σ(s)⊙ε. Now a is a differentiable function of the parameters.

SAC networks:

• Actor: πθ(a|s) — Gaussian with learned mean μθ(s) and diagonal covariance σθ(s)
• Two critics: Qφ1, Qφ2 (Clipped Double-Q)
• Two target critics: Q'φ1, Q'φ2

Clipped Double-Q:

Use min(Qφ1, Qφ2) in the Bellman target. Two independent critics with different initializations disagree about Q-values. Taking the minimum is pessimistic — it counteracts the systematic overestimation that causes instability in DDPG.

python
# SAC actor (squashed Gaussian)
class SACPolicy(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.backbone = nn.Sequential(nn.Linear(obs_dim,256),nn.ReLU(),nn.Linear(256,256),nn.ReLU())
        self.mean_layer = nn.Linear(256, act_dim)
        self.log_std_layer = nn.Linear(256, act_dim)

    def forward(self, s):
        h = self.backbone(s)
        mean = self.mean_layer(h)
        log_std = self.log_std_layer(h).clamp(-20, 2)  # stability
        std = log_std.exp()
        # Reparameterization: a = mean + std * eps
        eps = torch.randn_like(mean)
        a_raw = mean + std * eps                        # differentiable!
        a = torch.tanh(a_raw)                           # squash to (-1,1)
        # Log-prob with change of variables for tanh squashing
        log_prob = Normal(mean,std).log_prob(a_raw).sum(-1)
        log_prob -= (2*(math.log(2) - a_raw - F.softplus(-2*a_raw))).sum(-1)
        return a, log_prob  # [B,act_dim], [B]

# SAC actor update: maximize E[Q] + alpha*H
def update_actor_sac(actor, q1, q2, opt_actor, states, alpha):
    a, log_prob = actor(states)                         # [B,act_dim], [B]
    q_min = torch.min(q1(states,a), q2(states,a))      # [B]
    actor_loss = (alpha * log_prob - q_min).mean()      # minimize this
    opt_actor.zero_grad(); actor_loss.backward(); opt_actor.step()
Automatic temperature tuning: SAC-v2 treats α as a Lagrange multiplier. Define a target entropy H̄ (e.g., −act_dim for continuous). Minimize the dual objective: ℓ(α) = −α · E[log π(a|s) + H̄]. This automatically adjusts α so that actual policy entropy matches the target. No manual tuning needed.
SAC uses two Q-networks and takes min(Q1, Q2) in the Bellman target. What problem does this solve?

Chapter 7: Actor-Critic with MCTS (AlphaZero)

For games with discrete, finite action spaces and a perfect simulator (chess, Go, shogi), a powerful alternative exists: use MCTS as a policy improvement operator. The actor and critic guide the search; the search results train the actor and critic. This is the AlphaZero architecture.

The key insight: MCTS is a form of lookahead. Running 800 simulations from a state explores the game tree and produces a much better action distribution than the raw policy alone. If we treat this improved distribution as a training target, we can distill the lookahead into the policy network — which then starts from a better place next time, enabling deeper lookahead.

Network Architecture

A single network fθ(s) = (p, v) outputs two heads:

Policy head p = πθ(a|s)

Prior probability over all legal moves. Used to initialize Q-values in MCTS and bias action selection toward promising moves without searching.

Value head v = Uθ(s)

Estimated probability of winning from state s. Replaces rollouts in MCTS leaf evaluation — much faster than simulating to game end.

MCTS with Neural Guidance

Each simulation traverses the game tree by selecting actions with:

a* = argmaxa [ Q(s,a) + cpuct · p(a|s) · √N(s) / (1 + N(s,a)) ]

N(s,a) is the visit count of action a from s. The second term is an exploration bonus: high when a has been visited rarely (N(s,a) small) or the prior is high (p(a|s) large). cpuct ≈ 1 to 5 controls exploration vs. exploitation inside the tree.

Why visit counts? Frequent visits drive Q toward the true value (averaging reduces noise). The √N(s)/(1+N(s,a)) term ensures every action gets some visits — it's the UCB (Upper Confidence Bound) formula from multi-armed bandits, applied to tree search.

Training Loop

Self-Play
Two copies of the current network play each other. At each move, run 800 MCTS simulations. Record (s, πMCTS, z) where z ∈ {+1,−1} is the game outcome.
Replay Buffer
Store the last 500k self-play positions. Sample random mini-batches for training (breaks correlation between consecutive positions).
Network Update
Minimize ℓ = (v − z)2 − πMCTS · log p + λ||θ||2. Value head learns to predict game outcomes; policy head learns to mimic MCTS's improved distribution.
↻ repeat

The policy head's loss is cross-entropy against the MCTS visit distribution πMCTS(a|s) = N(s,a)1/τ / ∑bN(s,b)1/τ. Temperature τ → 0 makes this the greedy most-visited action; τ = 1 is proportional to counts. AlphaZero uses τ = 1 for the first 30 moves (exploration), then τ → 0 (exploitation).

What makes AlphaZero work:
• Tabula rasa — starts from random play, no human games needed (unlike AlphaGo)
• MCTS + neural network together > either alone (proven by ablation studies)
• The value network replaces rollouts; the policy network replaces random selection
• Self-play curriculum: as the network improves, its opponents (previous versions) also improve
In AlphaZero, what does the policy head p(a|s) contribute to MCTS, and what does MCTS contribute back to the policy?

Chapter 8: Actor-Critic in Action

Let's watch an actor-critic method learn on a simple 1D regulator: the state is a scalar s, and the goal is to drive s to zero. The optimal policy is μ(s) = −s (negative feedback). The optimal value function is a negative quadratic — further from zero is worse.

We parameterize: actor as μθ(s) = θ1·s with log-variance θ2. Critic as Vφ(s) = φ1·s + φ2·s2. Watch θ1 → −1 (optimal slope) and φ2 → negative (value is lowest far from zero).

Actor-Critic Learning — 1D Regulator

Top: Actor policy mean (blue) vs. optimal (dashed). Bottom: Critic value estimate (green) vs. optimal (dashed). Press Train to run the update loop. Try large αactor to see instability.

α actor 0.015
α critic 0.040
λ 0.90
Iteration 0
Watch for: When αactor is too large relative to αcritic, the actor outruns the critic. The policy changes before the critic can track it, so the advantage estimates become stale and the actor oscillates. The fix: use a smaller actor learning rate (or more critic updates per actor update).
In the 1D regulator, what is the optimal policy parameter θ1 (the slope of μ(s) = θ1·s)?

Chapter 9: Connections and What's Next

Actor-critic methods are the backbone of modern deep RL. The table below shows how the algorithms in this chapter relate to each other and to broader families.

AlgorithmActorCriticAdvantageKey trick
Basic ACStochastic πθVφ(s)TD residual δBaseline variance reduction
A2C/A3CStochastic πθVφ(s)GAE(λ)Parallel envs + entropy bonus
PPOStochastic πθVφ(s)GAE(λ)Clipped surrogate (Ch 12)
DDPGDeterministic μθQφ(s,a)Chain rule through QReplay + target networks
TD3Deterministic μθ2×QφClipped Double-QDelayed actor updates + target noise
SACStochastic (reparam)2×QφQ − αlogπEntropy regularization + auto-α
AlphaZeroπθ priorvθ(s)MCTS visit countsSelf-play + lookahead distillation

Limitations

When actor-critic struggles:
Sparse rewards: If rewards only come at episode end (e.g., winning a game), the TD signal is nearly zero everywhere. The critic can't bootstrap usefully.
Non-stationary targets: Both actor and critic change simultaneously. The critic's training target (which depends on the actor) is always moving. This can cause divergence with function approximation.
Exploration in sparse environments: A stochastic policy with Gaussian noise may never discover the rewarding region. Requires auxiliary exploration (curiosity, RND, HER).

What's Next

From this chapter:

• GAE + PPO clip = most robust general algorithm today
• SAC = best for continuous control benchmarks
• AlphaZero = best for two-player perfect-information games
• TD3 = DDPG but stable (often better default)

Open problems:

• Model-based AC (Dreamer, MBPO): learn a world model, train AC inside it
• Multi-agent AC (Ch 27): each agent has its own actor, shared or separate critics
• Offline RL: train AC from a fixed dataset with no environment interaction
• RLHF: the critic is a human reward model (GPT fine-tuning)

The central lesson of actor-critic methods: Any learning signal that reduces variance while keeping bias manageable is worth building. The critic is just a learned function that does this for the policy gradient. The GAE parameter λ lets you choose your bias-variance tradeoff explicitly. The rest is engineering — replay buffers, target networks, entropy bonuses — to keep the coupled optimization stable.
"The actor-critic is the right framework for almost any RL problem. The question is which variant."
— paraphrased from Sutton & Barto, Reinforcement Learning (2nd ed.)
SAC maximizes entropy-regularized return. What does this mean for the policy at convergence, compared to a standard actor-critic?