Ch 13: Actor-Critic Methods — Kochenderfer et al.

Chapter 0: The Noisy Teacher Problem

You're training to play tennis. After every rally, your coach tells you your score for that rally: 42 points. Was that good? You have no idea. It depends entirely on how the rally ended, whether you were at the net or baseline, whether the opponent was weak or strong. The number alone tells you almost nothing about what specific swing caused the result.

This is the problem with pure policy gradient methods like REINFORCE. The learning signal is the full trajectory return — a single number summarizing everything that happened. Noisy. Delayed. High variance. You adjust your swing based on a signal that may take 20 steps to arrive and reflects many random events beyond your control.

The variance problem in numbers: In CartPole, a typical trajectory return has variance ~1000. The gradient estimate with 10 trajectories has standard deviation ~√(1000/10) ≈ 10. Your "improvement direction" is more noise than signal. You need hundreds of trajectories to get a reliable gradient — and each one requires rolling out the full episode.

What if your coach, instead of waiting for the rally to end, whispered after each shot: "That was 0.3 points better than I expected." That's immediate, action-specific feedback. Much lower variance. That whisper is the advantage signal, and the coach is the critic.

Actor-critic methods split the learning into two coupled problems:

Actor π_θ(a|s)

The policy: given state s, output a distribution over actions. Learns via gradient ascent on expected utility.

↓ "how good was that?" ↓

Critic U_φ(s) or Q_φ(s,a)

The value function: predicts expected future reward. Trained via temporal difference learning. Provides advantage estimates to the actor.

↑ "here are the trajectories" ↑

Environment

Provides states, actions, rewards. Neither actor nor critic can change how it works.

Why this works: The critic replaces noisy trajectory returns with bootstrapped estimates: δ = r + γV(s') − V(s). This one-step signal has much lower variance than the full return, because it only captures one step of randomness. The cost is bias — if V is wrong, δ is wrong. The entire field of actor-critic research is about managing this bias-variance tradeoff.

Why do pure policy gradient methods (REINFORCE) have high variance?

Because the policy is too small Because the learning signal is the full trajectory return — a sum of many random rewards — making it noisy and high-variance Because the critic hasn't been trained yet

Chapter 1: The Actor-Critic Architecture

Start from the policy gradient theorem (Ch 11). The gradient of expected utility under policy π_θ is:

∇U(θ) = E_{τ ~ πθ} [∑_t=1^d ∇_θ log π_θ(a_t|s_t) · Φ_t]

where Φ_t is the learning signal at step t. REINFORCE uses Φ_t = G_t (the full return from t onward). Actor-critic replaces this with the advantage A(s_t, a_t) = Q(s_t, a_t) − V(s_t), estimated by the critic.

The advantage answers: "How much better (or worse) was the action I took compared to what I'd expect on average in this state?" Positive advantage → reinforce this action. Negative advantage → discourage it.

Advantage vs. Q-value: Using Q(s,a) directly as Φ_t is unbiased but still high variance. Subtracting a baseline V(s) that doesn't depend on a leaves the gradient unbiased (the baseline term has zero expectation) but dramatically reduces variance. The advantage A = Q − V is the optimal baseline choice.

In practice we parameterize two networks:

Actor: π_θ(a|s)

Input: state s ∈ Rⁿ
Output: action distribution (Categorical for discrete, Gaussian for continuous)
Loss: −E[log π_θ(a|s) · A(s,a)] (maximize)
Update: ascent on ∇_θU

Critic: V_φ(s) or Q_φ(s,a)

Input: state s (and action a for Q)
Output: scalar value estimate ∈ R
Loss: ½ E[(V_φ(s) − G_t)²] (minimize)
Update: descent on ∇_φℓ

The update loop runs as follows. Collect m trajectories with the current policy. For each step (s_t, a_t, r_t, s_t+1), compute the TD residual δ_t = r_t + γV_φ(s_t+1) − V_φ(s_t). Use δ_t as the advantage estimate. Gradient ascent on the actor, gradient descent on the critic.

python
import torch
import torch.nn as nn

class Actor(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 64), nn.Tanh(),
            nn.Linear(64, 64), nn.Tanh(),
            nn.Linear(64, act_dim)  # logits for discrete
        )
    def forward(self, s):
        return torch.distributions.Categorical(logits=self.net(s))

class Critic(nn.Module):
    def __init__(self, obs_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 64), nn.Tanh(),
            nn.Linear(64, 64), nn.Tanh(),
            nn.Linear(64, 1)  # scalar V(s)
        )
    def forward(self, s):
        return self.net(s).squeeze(-1)

# One update step
def update(actor, critic, opt_a, opt_c, batch, gamma=0.99):
    s, a, r, s_next, done = batch
    # Critic: TD target
    with torch.no_grad():
        td_target = r + gamma * critic(s_next) * (~done)
    v = critic(s)
    critic_loss = ((v - td_target) ** 2).mean()
    # Actor: advantage = TD residual
    advantage = (td_target - v).detach()
    log_prob = actor(s).log_prob(a)
    actor_loss = -(log_prob * advantage).mean()
    # Optimize both
    opt_a.zero_grad(); actor_loss.backward(); opt_a.step()
    opt_c.zero_grad(); critic_loss.backward(); opt_c.step()

Tensor shapes (CartPole, batch=256):
s: [256, 4] — 4 state features
a: [256] — integer actions {0,1}
log_prob: [256] — one log-probability per step
advantage: [256] — one scalar per step (detached from graph)
actor_loss: scalar — mean across batch

Why must the advantage be .detach()ed before computing the actor loss?

Because we don't want the actor gradient to flow through the critic's parameters — the advantage is treated as a fixed target, not a function of θ Because PyTorch requires it for all multiplications Because advantage is always negative

Chapter 2: TD Error as Advantage

The critic provides V_φ(s). How do we get an advantage estimate from it? The exact advantage requires knowing Q(s,a) = E[r + γV(s')|s,a], which we don't have directly. But we can estimate it from a single observed transition:

δ_t = r_t + γ V_φ(s_t+1) − V_φ(s_t)

This is the TD residual (also called the temporal difference error). It measures: "How much better was this step than I predicted?" If you expected -2 value at s, got reward +1, and the next state has value -1.5, then δ = 1 + 0.99×(-1.5) − (-2) = 1 − 1.485 + 2 = 1.515. Better than expected. Reinforce the action.

When is δ an unbiased estimate of A(s,a)? Only when V_φ = V^π exactly. In that case E[δ|s,a] = E[r + γV^π(s')|s,a] − V^π(s) = Q(s,a) − V(s) = A(s,a). But V_φ is never perfect — it's trained simultaneously with the policy — so δ carries bias from critic errors.

The bias-variance picture:

Signal Φ_t	Bias	Variance	Data needed
Full return G_t	Zero	High — sums all future randomness	Full episodes
TD residual δ_t	High (critic errors propagate)	Low — only one step of randomness	Single steps
k-step return	Medium	Medium	k steps
GAE (Ch 3)	Tunable	Tunable	Full episodes

The practical effect: with the TD residual, the actor gradient estimate has much lower variance than REINFORCE, so you need fewer trajectories to get a reliable gradient direction. But if the critic is poorly trained (early in learning, or with too little network capacity), the bias can mislead the actor worse than high variance would.

TD Residual vs Full Return: Sampling Distribution

Both signals estimate the same true advantage A(s,a) = −2. The TD residual (teal) has critic bias built in. The full return (orange) is unbiased but spreads wide. Click "Sample" to accumulate estimates and watch the distributions emerge.

Critic bias: 0.8 Click to sample

You train a critic for 1000 steps on a fixed policy. Now the policy changes. What happens to the TD residual as an advantage estimate?

It improves immediately since the critic is already trained It becomes biased — the critic estimates V for the old policy, not the new one, so δ = r + γV_old(s') − V_old(s) doesn't reflect A under the new policy It becomes unbiased

Chapter 3: Generalized Advantage Estimation

John Schulman (2015) noticed that TD residuals and full returns are endpoints of a spectrum. You can interpolate between them using a single scalar λ ∈ [0, 1]. The resulting estimator, Generalized Advantage Estimation (GAE), is now standard in PPO and most modern actor-critic methods.

Start from the k-step advantage. Using k actual rewards plus the critic's estimate of remaining value:

Â^(k)_t = −V(s_t) + r_t + γr_t+1 + γ²r_t+2 + … + γ^k−1r_t+k−1 + γ^kV(s_t+k)

At k=1 this is the TD residual δ_t. At k=∞ this is the full Monte Carlo return. GAE takes an exponentially weighted average over all k:

A^GAE(λ)_t = (1−λ) ∑_k=1^∞ λ^k−1 Â^(k)_t

This simplifies beautifully. Define δ_t = r_t + γV(s_t+1) − V(s_t). Then:

A^GAE(λ)_t = ∑_l=0^∞ (γλ)^l δ_t+l

This is a geometric series over TD residuals. The (γλ)^l weight decays as we look further into the future. At l=0, the current TD residual gets weight 1. At l=10, it gets weight (γλ)¹⁰. When λ=0, only the l=0 term survives: pure TD. When λ=1, all terms get equal γ^l weight: pure Monte Carlo.

Derive it yourself: Expand Â⁽¹⁾ = δ_t. Expand Â⁽²⁾ = δ_t + γδ_t+1. Expand Â⁽³⁾ = δ_t + γδ_t+1 + γ²δ_t+2. The GAE formula is (1−λ)[Â⁽¹⁾ + λÂ⁽²⁾ + λ²Â⁽³⁾ + …]. Substitute, collect terms for δ_t+l: each gets weight (1−λ)γ^l[1 + λ + λ² + …] = γ^lλ^l. Done.

Computing GAE efficiently is a single backwards pass over the trajectory. No extra network calls needed.

python
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    """
    rewards: [T]   — observed rewards
    values:  [T+1] — V(s_0)...V(s_T), last is bootstrap
    dones:   [T]   — 1.0 if episode ended
    returns: advantages [T] and targets [T]
    """
    T = len(rewards)
    advantages = [0.0] * T
    gae = 0.0
    for t in reversed(range(T)):
        mask = 1.0 - dones[t]          # 0 at episode boundary
        delta = rewards[t] + gamma * values[t+1] * mask - values[t]
        gae = delta + gamma * lam * mask * gae
        advantages[t] = gae
    targets = [advantages[t] + values[t] for t in range(T)]
    return advantages, targets
# One backwards scan, O(T) time and space. No extra network calls.

Why the backwards scan? GAE[t] = δ_t + (γλ) × GAE[t+1]. It's a recurrence that runs backwards in time. Start from the last step (GAE = 0 at the terminal state), work backwards accumulating the geometric sum. Each step is a single multiply-and-add.

GAE λ Dial — Bias–Variance Tradeoff

Adjust λ to see how GAE blends between one-step TD (low variance, biased by critic errors) and full Monte Carlo (unbiased, high variance). The top panel shows the (γλ)^l weights on each TD residual. The bottom panel shows a simulated sampling distribution of the resulting advantage estimate, compared to the true value A = −1.5.

λ 0.95

γ 0.99

Critic σ 0.50

You set λ = 0.95 and γ = 0.99, giving γλ = 0.9405. How much weight does the TD residual 10 steps in the future (δ_t+10) get relative to δ_t?

Exactly the same weight (1.0) 0.9405 (one decay step) 0.9405¹⁰ ≈ 0.543 — exponentially decayed over 10 steps

Chapter 4: A2C and A3C

The basic actor-critic update is sequential: collect a trajectory, update, collect another. Advantage Actor-Critic (A2C) and its asynchronous variant A3C (Mnih et al., 2016) scale this up by collecting experience from multiple parallel environments simultaneously.

A2C: Synchronous Parallel Environments

Run N copies of the environment in parallel (different random seeds, same policy parameters). After T steps in each, you have N×T transition tuples. Average the gradient over all of them before updating.

∇U(θ) = ½_i ∑_t ∇_θ log π_θ(a_t⁽ⁱ⁾|s_t⁽ⁱ⁾) A^GAE_t⁽ⁱ⁾

Why does this reduce variance? Each environment produces an independent trajectory. Averaging N independent gradient estimates reduces variance by 1/N. With 8 parallel environments, you get 8× lower gradient variance for the same wall-clock time — assuming the environments run in parallel.

Entropy bonus: A2C adds an entropy term to the actor loss: α · H(π(·|s)) where H is the policy entropy. This prevents the policy from collapsing to a near-deterministic distribution too early, which would stop exploration. Typical α = 0.01. The entropy of a Gaussian is H = ½log(2πeσ²), so maximizing entropy means maximizing σ (keeping the policy exploratory).

The full A2C loss (note: actor loss is negated since we ascend):

ℓ_total = −ℓ_actor + c₁ℓ_critic − c₂ H(π)

Typical coefficients: c₁ = 0.5 (value loss weight), c₂ = 0.01 (entropy bonus). Both are hyperparameters.

A3C: Asynchronous Threads

A3C runs N worker threads, each with its own copy of the environment and local gradient accumulation. Workers push gradient updates to a shared parameter server asynchronously — no waiting for others. This allows scaling to many cores without synchronization overhead.

A2C (synchronous)

• All workers step together
• Gradient is averaged before update
• More reproducible training
• Standard in modern implementations
• Works well with GPU batching

A3C (asynchronous)

• Workers push gradients independently
• No synchronization → higher throughput
• Stale gradients can cause instability
• Mostly replaced by A2C + GPU
• Original: 16 threads on CPU

Why A2C won: The original A3C paper was written in the CPU era. On a GPU, batching N environments and computing one big gradient is faster than N separate threads with synchronization overhead. Modern practice is A2C with vectorized environments (e.g., gym.vector.SyncVectorEnv).

python
# A2C with N parallel environments (pseudo-code)
envs = gym.vector.SyncVectorEnv([make_env for _ in range(N)])  # N=8
obs = envs.reset()  # [N, obs_dim]

for update in range(num_updates):
    # Collect T steps from all N envs
    storage = []
    for _ in range(T):
        dist = actor(obs)                        # [N] distributions
        actions = dist.sample()                  # [N]
        log_probs = dist.log_prob(actions)       # [N]
        values = critic(obs)                     # [N]
        obs_next, rewards, dones, _ = envs.step(actions)
        storage.append((obs, actions, log_probs, values, rewards, dones))
        obs = obs_next

    # Bootstrap value for last state
    with torch.no_grad():
        last_value = critic(obs)                 # [N]

    # Compute GAE advantages for each env independently
    advantages = compute_gae_batch(storage, last_value, gamma, lam)
    # advantages: [T, N] -> flatten to [T*N]

    # Actor + critic + entropy loss
    actor_loss = -(log_probs * advantages.detach()).mean()
    critic_loss = 0.5 * (values - (advantages + values).detach()).pow(2).mean()
    entropy_loss = -dist.entropy().mean()
    loss = actor_loss + 0.5 * critic_loss + 0.01 * entropy_loss
    optimizer.zero_grad(); loss.backward(); optimizer.step()

In A2C with N=8 parallel environments, by approximately what factor does gradient variance decrease compared to a single environment (same number of steps total)?

No change — same data, same variance Factor of 8 — averaging 8 independent estimates reduces variance by 1/N Factor of 64 — variance drops as 1/N²

Chapter 5: Deterministic Policy Gradient

Everything so far uses stochastic policies: π_θ(a|s) is a distribution. This works for discrete actions and low-dimensional continuous actions. But for high-dimensional continuous control — a robot arm with 7 joints — sampling from a multivariate Gaussian and computing log-probabilities becomes expensive and unstable.

Silver et al. (2014) proved a deterministic policy gradient theorem. If the policy is deterministic — μ_θ(s) outputs a single action vector, not a distribution — then:

∇J(θ) = E_{s ~ ρ^μ}[∇_aQ^μ(s,a)|_a=μθ(s) · ∇_θμ_θ(s)]

Read it as two chain rule factors: ∇_aQ answers "if I perturb the action slightly, how does value change?" and ∇_θμ answers "if I perturb the parameters, how does the action change?" The actor gradient is their product. No log-probability needed.

What this requires: A differentiable Q-function (a neural network). Continuous actions (so we can differentiate w.r.t. action). An off-policy data collection scheme (since the deterministic policy itself doesn't explore).

DDPG: DPG + Deep Networks

Lillicrap et al. (2015) combined DPG with three stability tricks from DQN to get DDPG (Deep Deterministic Policy Gradient):

Experience Replay

Store transitions (s,a,r,s') in a buffer of size 10⁶. Train on random mini-batches. Breaks temporal correlation that causes divergence.

Target Networks

Keep slow copies μ' and Q' of both networks. TD target = r + γQ'(s', μ'(s')). Polyak update: θ' ← τθ + (1−τ)θ' with τ = 0.005. Prevents oscillation.

Exploration Noise

Add Ornstein-Uhlenbeck noise to the action: a = μ(s) + ε, ε ~ OU process. Temporally correlated noise works better than i.i.d. Gaussian for physical control.

python
# DDPG actor update
def update_actor(actor, critic, opt_actor, states):
    # Gradient ascent on Q(s, mu(s)) w.r.t. actor params
    actions = actor(states)                    # [B, act_dim]
    q_values = critic(states, actions)         # [B]  — differentiable!
    actor_loss = -q_values.mean()              # maximize Q
    opt_actor.zero_grad()
    actor_loss.backward()                      # flows through critic
    opt_actor.step()
    # Note: critic parameters are NOT updated here (freeze them)
    # Only actor parameters receive gradient through this path

# DDPG critic update (standard Bellman)
def update_critic(critic, actor_target, critic_target, opt_critic, batch):
    s, a, r, s_next, done = batch
    with torch.no_grad():
        a_next = actor_target(s_next)
        q_next = critic_target(s_next, a_next)
        td_target = r + gamma * q_next * (~done)
    q = critic(s, a)
    critic_loss = nn.functional.mse_loss(q, td_target)
    opt_critic.zero_grad(); critic_loss.backward(); opt_critic.step()

# Polyak update of target networks
for param, target_param in zip(critic.parameters(), critic_target.parameters()):
    target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

Tensor flow in DDPG actor update:
states [B, obs_dim] → actor → actions [B, act_dim]
(states, actions) → critic → Q [B]
actor_loss = −mean(Q) — gradient flows back through critic to actor
critic.parameters() receive grad but opt_actor only updates actor.parameters()

Deterministic Policy + Exploration Noise

The deterministic policy (blue line) maps state to action with no randomness. Exploration noise (orange dots) covers the action space for experience collection. The OU process (red dots) produces temporally correlated noise — it wanders instead of jumping.

Gaussian σ 0.50

DDPG's actor update computes −Q(s, μ(s)).mean() and backpropagates through the critic. Which parameters does the optimizer update?

Only critic parameters — Q is the critic's output Both actor and critic parameters Only actor parameters — the gradient flows through the critic as a fixed function, but only actor.parameters() are in opt_actor

Chapter 6: Soft Actor-Critic

DDPG is sample-efficient but brittle: hyperparameter-sensitive, prone to Q-value overestimation, and the policy collapses to deterministic too quickly. Soft Actor-Critic (SAC) (Haarnoja et al., 2018) fixes all three issues by changing the objective.

Instead of maximizing expected return alone, SAC maximizes entropy-augmented return:

J(π) = E[∑_t=0^∞ γ^t(r_t + α H(π(·|s_t)))]

α is the temperature parameter: it controls how much entropy matters. High α → very stochastic policy (exploration), low α → nearly deterministic (exploitation). The entropy term H(π(·|s)) = −E[log π(a|s)] rewards distributional spread.

Why entropy regularization works: The standard RL objective finds a single best action per state. The entropy-augmented objective finds a distribution over good actions. This means:
• Exploration is built in — high entropy keeps the policy exploratory without separate noise
• Multiple optima — if two actions are equally good, the policy learns to use both
• Robustness — a spread policy is harder to adversarially exploit

The soft Bellman optimality equation becomes:

Q^*(s,a) = r(s,a) + γ E_s'[V^*(s')]

V^*(s) = E_{a ~ π}[Q^*(s,a) − α log π(a|s)]

SAC uses the reparameterization trick to backpropagate through the stochastic action: instead of sampling a ~ π(·|s), sample ε ~ N(0,I) and compute a = μ(s) + σ(s)⊙ε. Now a is a differentiable function of the parameters.

SAC networks:

• Actor: π_θ(a|s) — Gaussian with learned mean μ_θ(s) and diagonal covariance σ_θ(s)
• Two critics: Q_φ1, Q_φ2 (Clipped Double-Q)
• Two target critics: Q'_φ1, Q'_φ2

Clipped Double-Q:

Use min(Q_φ1, Q_φ2) in the Bellman target. Two independent critics with different initializations disagree about Q-values. Taking the minimum is pessimistic — it counteracts the systematic overestimation that causes instability in DDPG.

python
# SAC actor (squashed Gaussian)
class SACPolicy(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.backbone = nn.Sequential(nn.Linear(obs_dim,256),nn.ReLU(),nn.Linear(256,256),nn.ReLU())
        self.mean_layer = nn.Linear(256, act_dim)
        self.log_std_layer = nn.Linear(256, act_dim)

    def forward(self, s):
        h = self.backbone(s)
        mean = self.mean_layer(h)
        log_std = self.log_std_layer(h).clamp(-20, 2)  # stability
        std = log_std.exp()
        # Reparameterization: a = mean + std * eps
        eps = torch.randn_like(mean)
        a_raw = mean + std * eps                        # differentiable!
        a = torch.tanh(a_raw)                           # squash to (-1,1)
        # Log-prob with change of variables for tanh squashing
        log_prob = Normal(mean,std).log_prob(a_raw).sum(-1)
        log_prob -= (2*(math.log(2) - a_raw - F.softplus(-2*a_raw))).sum(-1)
        return a, log_prob  # [B,act_dim], [B]

# SAC actor update: maximize E[Q] + alpha*H
def update_actor_sac(actor, q1, q2, opt_actor, states, alpha):
    a, log_prob = actor(states)                         # [B,act_dim], [B]
    q_min = torch.min(q1(states,a), q2(states,a))      # [B]
    actor_loss = (alpha * log_prob - q_min).mean()      # minimize this
    opt_actor.zero_grad(); actor_loss.backward(); opt_actor.step()

Automatic temperature tuning: SAC-v2 treats α as a Lagrange multiplier. Define a target entropy H̄ (e.g., −act_dim for continuous). Minimize the dual objective: ℓ(α) = −α · E[log π(a|s) + H̄]. This automatically adjusts α so that actual policy entropy matches the target. No manual tuning needed.

SAC uses two Q-networks and takes min(Q1, Q2) in the Bellman target. What problem does this solve?

It doubles computation for no benefit It handles discrete actions It counteracts systematic Q-value overestimation: two networks with different initializations independently tend to overestimate; taking the minimum is pessimistic and more accurate

Chapter 7: Actor-Critic with MCTS (AlphaZero)

For games with discrete, finite action spaces and a perfect simulator (chess, Go, shogi), a powerful alternative exists: use MCTS as a policy improvement operator. The actor and critic guide the search; the search results train the actor and critic. This is the AlphaZero architecture.

The key insight: MCTS is a form of lookahead. Running 800 simulations from a state explores the game tree and produces a much better action distribution than the raw policy alone. If we treat this improved distribution as a training target, we can distill the lookahead into the policy network — which then starts from a better place next time, enabling deeper lookahead.

Network Architecture

A single network f_θ(s) = (p, v) outputs two heads:

Policy head p = π_θ(a|s)

Prior probability over all legal moves. Used to initialize Q-values in MCTS and bias action selection toward promising moves without searching.

Value head v = U_θ(s)

Estimated probability of winning from state s. Replaces rollouts in MCTS leaf evaluation — much faster than simulating to game end.

MCTS with Neural Guidance

Each simulation traverses the game tree by selecting actions with:

a* = argmax_a [ Q(s,a) + c_puct · p(a|s) · √N(s) / (1 + N(s,a)) ]

N(s,a) is the visit count of action a from s. The second term is an exploration bonus: high when a has been visited rarely (N(s,a) small) or the prior is high (p(a|s) large). c_puct ≈ 1 to 5 controls exploration vs. exploitation inside the tree.

Why visit counts? Frequent visits drive Q toward the true value (averaging reduces noise). The √N(s)/(1+N(s,a)) term ensures every action gets some visits — it's the UCB (Upper Confidence Bound) formula from multi-armed bandits, applied to tree search.

Training Loop

Self-Play

Two copies of the current network play each other. At each move, run 800 MCTS simulations. Record (s, π_MCTS, z) where z ∈ {+1,−1} is the game outcome.

↓

Replay Buffer

Store the last 500k self-play positions. Sample random mini-batches for training (breaks correlation between consecutive positions).

↓

Network Update

Minimize ℓ = (v − z)² − π_MCTS · log p + λ||θ||². Value head learns to predict game outcomes; policy head learns to mimic MCTS's improved distribution.

↻ repeat

The policy head's loss is cross-entropy against the MCTS visit distribution π_MCTS(a|s) = N(s,a)^1/τ / ∑_bN(s,b)^1/τ. Temperature τ → 0 makes this the greedy most-visited action; τ = 1 is proportional to counts. AlphaZero uses τ = 1 for the first 30 moves (exploration), then τ → 0 (exploitation).

What makes AlphaZero work:
• Tabula rasa — starts from random play, no human games needed (unlike AlphaGo)
• MCTS + neural network together > either alone (proven by ablation studies)
• The value network replaces rollouts; the policy network replaces random selection
• Self-play curriculum: as the network improves, its opponents (previous versions) also improve

In AlphaZero, what does the policy head p(a|s) contribute to MCTS, and what does MCTS contribute back to the policy?

p(a|s) serves as a prior in the UCB formula, biasing MCTS toward promising moves. MCTS then produces an improved visit-count distribution π_MCTS that becomes the training target for the policy head. p(a|s) replaces MCTS entirely when it's accurate enough MCTS trains the value head and ignores the policy head

Chapter 8: Actor-Critic in Action

Let's watch an actor-critic method learn on a simple 1D regulator: the state is a scalar s, and the goal is to drive s to zero. The optimal policy is μ(s) = −s (negative feedback). The optimal value function is a negative quadratic — further from zero is worse.

We parameterize: actor as μ_θ(s) = θ₁·s with log-variance θ₂. Critic as V_φ(s) = φ₁·s + φ₂·s². Watch θ₁ → −1 (optimal slope) and φ₂ → negative (value is lowest far from zero).

Actor-Critic Learning — 1D Regulator

Top: Actor policy mean (blue) vs. optimal (dashed). Bottom: Critic value estimate (green) vs. optimal (dashed). Press Train to run the update loop. Try large α_actor to see instability.

α actor 0.015

α critic 0.040

λ 0.90

Iteration 0

Watch for: When α_actor is too large relative to α_critic, the actor outruns the critic. The policy changes before the critic can track it, so the advantage estimates become stale and the actor oscillates. The fix: use a smaller actor learning rate (or more critic updates per actor update).

In the 1D regulator, what is the optimal policy parameter θ₁ (the slope of μ(s) = θ₁·s)?

θ₁ = +1 (push state away from zero) θ₁ ≈ −1 (negative feedback drives state to zero) θ₁ = 0 (take no action)

Chapter 9: Connections and What's Next

Actor-critic methods are the backbone of modern deep RL. The table below shows how the algorithms in this chapter relate to each other and to broader families.

Algorithm	Actor	Critic	Advantage	Key trick
Basic AC	Stochastic π_θ	V_φ(s)	TD residual δ	Baseline variance reduction
A2C/A3C	Stochastic π_θ	V_φ(s)	GAE(λ)	Parallel envs + entropy bonus
PPO	Stochastic π_θ	V_φ(s)	GAE(λ)	Clipped surrogate (Ch 12)
DDPG	Deterministic μ_θ	Q_φ(s,a)	Chain rule through Q	Replay + target networks
TD3	Deterministic μ_θ	2×Q_φ	Clipped Double-Q	Delayed actor updates + target noise
SAC	Stochastic (reparam)	2×Q_φ	Q − αlogπ	Entropy regularization + auto-α
AlphaZero	π_θ prior	v_θ(s)	MCTS visit counts	Self-play + lookahead distillation

Limitations

When actor-critic struggles:
• Sparse rewards: If rewards only come at episode end (e.g., winning a game), the TD signal is nearly zero everywhere. The critic can't bootstrap usefully.
• Non-stationary targets: Both actor and critic change simultaneously. The critic's training target (which depends on the actor) is always moving. This can cause divergence with function approximation.
• Exploration in sparse environments: A stochastic policy with Gaussian noise may never discover the rewarding region. Requires auxiliary exploration (curiosity, RND, HER).

What's Next

From this chapter:

• GAE + PPO clip = most robust general algorithm today
• SAC = best for continuous control benchmarks
• AlphaZero = best for two-player perfect-information games
• TD3 = DDPG but stable (often better default)

Open problems:

• Model-based AC (Dreamer, MBPO): learn a world model, train AC inside it
• Multi-agent AC (Ch 27): each agent has its own actor, shared or separate critics
• Offline RL: train AC from a fixed dataset with no environment interaction
• RLHF: the critic is a human reward model (GPT fine-tuning)

The central lesson of actor-critic methods: Any learning signal that reduces variance while keeping bias manageable is worth building. The critic is just a learned function that does this for the policy gradient. The GAE parameter λ lets you choose your bias-variance tradeoff explicitly. The rest is engineering — replay buffers, target networks, entropy bonuses — to keep the coupled optimization stable.

"The actor-critic is the right framework for almost any RL problem. The question is which variant."
— paraphrased from Sutton & Barto, Reinforcement Learning (2nd ed.)

SAC maximizes entropy-regularized return. What does this mean for the policy at convergence, compared to a standard actor-critic?

It converges to the same deterministic policy but faster It always outputs maximum entropy (uniform) regardless of state It learns a distribution that covers all near-optimal actions, staying stochastic where multiple actions are similarly good — not collapsing to a single deterministic choice