Two networks, one goal: the actor decides, the critic evaluates, and together they learn faster than either could alone.
You're training to play tennis. After every rally, your coach tells you your score for that rally: 42 points. Was that good? You have no idea. It depends entirely on how the rally ended, whether you were at the net or baseline, whether the opponent was weak or strong. The number alone tells you almost nothing about what specific swing caused the result.
This is the problem with pure policy gradient methods like REINFORCE. The learning signal is the full trajectory return — a single number summarizing everything that happened. Noisy. Delayed. High variance. You adjust your swing based on a signal that may take 20 steps to arrive and reflects many random events beyond your control.
What if your coach, instead of waiting for the rally to end, whispered after each shot: "That was 0.3 points better than I expected." That's immediate, action-specific feedback. Much lower variance. That whisper is the advantage signal, and the coach is the critic.
Actor-critic methods split the learning into two coupled problems:
Start from the policy gradient theorem (Ch 11). The gradient of expected utility under policy πθ is:
where Φt is the learning signal at step t. REINFORCE uses Φt = Gt (the full return from t onward). Actor-critic replaces this with the advantage A(st, at) = Q(st, at) − V(st), estimated by the critic.
The advantage answers: "How much better (or worse) was the action I took compared to what I'd expect on average in this state?" Positive advantage → reinforce this action. Negative advantage → discourage it.
In practice we parameterize two networks:
Actor: πθ(a|s)
Input: state s ∈ Rn
Output: action distribution (Categorical for discrete, Gaussian for continuous)
Loss: −E[log πθ(a|s) · A(s,a)] (maximize)
Update: ascent on ∇θU
Critic: Vφ(s) or Qφ(s,a)
Input: state s (and action a for Q)
Output: scalar value estimate ∈ R
Loss: ½ E[(Vφ(s) − Gt)2] (minimize)
Update: descent on ∇φℓ
The update loop runs as follows. Collect m trajectories with the current policy. For each step (st, at, rt, st+1), compute the TD residual δt = rt + γVφ(st+1) − Vφ(st). Use δt as the advantage estimate. Gradient ascent on the actor, gradient descent on the critic.
python import torch import torch.nn as nn class Actor(nn.Module): def __init__(self, obs_dim, act_dim): super().__init__() self.net = nn.Sequential( nn.Linear(obs_dim, 64), nn.Tanh(), nn.Linear(64, 64), nn.Tanh(), nn.Linear(64, act_dim) # logits for discrete ) def forward(self, s): return torch.distributions.Categorical(logits=self.net(s)) class Critic(nn.Module): def __init__(self, obs_dim): super().__init__() self.net = nn.Sequential( nn.Linear(obs_dim, 64), nn.Tanh(), nn.Linear(64, 64), nn.Tanh(), nn.Linear(64, 1) # scalar V(s) ) def forward(self, s): return self.net(s).squeeze(-1) # One update step def update(actor, critic, opt_a, opt_c, batch, gamma=0.99): s, a, r, s_next, done = batch # Critic: TD target with torch.no_grad(): td_target = r + gamma * critic(s_next) * (~done) v = critic(s) critic_loss = ((v - td_target) ** 2).mean() # Actor: advantage = TD residual advantage = (td_target - v).detach() log_prob = actor(s).log_prob(a) actor_loss = -(log_prob * advantage).mean() # Optimize both opt_a.zero_grad(); actor_loss.backward(); opt_a.step() opt_c.zero_grad(); critic_loss.backward(); opt_c.step()
.detach()ed before computing the actor loss?The critic provides Vφ(s). How do we get an advantage estimate from it? The exact advantage requires knowing Q(s,a) = E[r + γV(s')|s,a], which we don't have directly. But we can estimate it from a single observed transition:
This is the TD residual (also called the temporal difference error). It measures: "How much better was this step than I predicted?" If you expected -2 value at s, got reward +1, and the next state has value -1.5, then δ = 1 + 0.99×(-1.5) − (-2) = 1 − 1.485 + 2 = 1.515. Better than expected. Reinforce the action.
The bias-variance picture:
| Signal Φt | Bias | Variance | Data needed |
|---|---|---|---|
| Full return Gt | Zero | High — sums all future randomness | Full episodes |
| TD residual δt | High (critic errors propagate) | Low — only one step of randomness | Single steps |
| k-step return | Medium | Medium | k steps |
| GAE (Ch 3) | Tunable | Tunable | Full episodes |
The practical effect: with the TD residual, the actor gradient estimate has much lower variance than REINFORCE, so you need fewer trajectories to get a reliable gradient direction. But if the critic is poorly trained (early in learning, or with too little network capacity), the bias can mislead the actor worse than high variance would.
Both signals estimate the same true advantage A(s,a) = −2. The TD residual (teal) has critic bias built in. The full return (orange) is unbiased but spreads wide. Click "Sample" to accumulate estimates and watch the distributions emerge.
John Schulman (2015) noticed that TD residuals and full returns are endpoints of a spectrum. You can interpolate between them using a single scalar λ ∈ [0, 1]. The resulting estimator, Generalized Advantage Estimation (GAE), is now standard in PPO and most modern actor-critic methods.
Start from the k-step advantage. Using k actual rewards plus the critic's estimate of remaining value:
At k=1 this is the TD residual δt. At k=∞ this is the full Monte Carlo return. GAE takes an exponentially weighted average over all k:
This simplifies beautifully. Define δt = rt + γV(st+1) − V(st). Then:
This is a geometric series over TD residuals. The (γλ)l weight decays as we look further into the future. At l=0, the current TD residual gets weight 1. At l=10, it gets weight (γλ)10. When λ=0, only the l=0 term survives: pure TD. When λ=1, all terms get equal γl weight: pure Monte Carlo.
Computing GAE efficiently is a single backwards pass over the trajectory. No extra network calls needed.
python def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95): """ rewards: [T] — observed rewards values: [T+1] — V(s_0)...V(s_T), last is bootstrap dones: [T] — 1.0 if episode ended returns: advantages [T] and targets [T] """ T = len(rewards) advantages = [0.0] * T gae = 0.0 for t in reversed(range(T)): mask = 1.0 - dones[t] # 0 at episode boundary delta = rewards[t] + gamma * values[t+1] * mask - values[t] gae = delta + gamma * lam * mask * gae advantages[t] = gae targets = [advantages[t] + values[t] for t in range(T)] return advantages, targets # One backwards scan, O(T) time and space. No extra network calls.
Adjust λ to see how GAE blends between one-step TD (low variance, biased by critic errors) and full Monte Carlo (unbiased, high variance). The top panel shows the (γλ)l weights on each TD residual. The bottom panel shows a simulated sampling distribution of the resulting advantage estimate, compared to the true value A = −1.5.
The basic actor-critic update is sequential: collect a trajectory, update, collect another. Advantage Actor-Critic (A2C) and its asynchronous variant A3C (Mnih et al., 2016) scale this up by collecting experience from multiple parallel environments simultaneously.
Run N copies of the environment in parallel (different random seeds, same policy parameters). After T steps in each, you have N×T transition tuples. Average the gradient over all of them before updating.
Why does this reduce variance? Each environment produces an independent trajectory. Averaging N independent gradient estimates reduces variance by 1/N. With 8 parallel environments, you get 8× lower gradient variance for the same wall-clock time — assuming the environments run in parallel.
The full A2C loss (note: actor loss is negated since we ascend):
Typical coefficients: c1 = 0.5 (value loss weight), c2 = 0.01 (entropy bonus). Both are hyperparameters.
A3C runs N worker threads, each with its own copy of the environment and local gradient accumulation. Workers push gradient updates to a shared parameter server asynchronously — no waiting for others. This allows scaling to many cores without synchronization overhead.
A2C (synchronous)
• All workers step together
• Gradient is averaged before update
• More reproducible training
• Standard in modern implementations
• Works well with GPU batching
A3C (asynchronous)
• Workers push gradients independently
• No synchronization → higher throughput
• Stale gradients can cause instability
• Mostly replaced by A2C + GPU
• Original: 16 threads on CPU
gym.vector.SyncVectorEnv).python # A2C with N parallel environments (pseudo-code) envs = gym.vector.SyncVectorEnv([make_env for _ in range(N)]) # N=8 obs = envs.reset() # [N, obs_dim] for update in range(num_updates): # Collect T steps from all N envs storage = [] for _ in range(T): dist = actor(obs) # [N] distributions actions = dist.sample() # [N] log_probs = dist.log_prob(actions) # [N] values = critic(obs) # [N] obs_next, rewards, dones, _ = envs.step(actions) storage.append((obs, actions, log_probs, values, rewards, dones)) obs = obs_next # Bootstrap value for last state with torch.no_grad(): last_value = critic(obs) # [N] # Compute GAE advantages for each env independently advantages = compute_gae_batch(storage, last_value, gamma, lam) # advantages: [T, N] -> flatten to [T*N] # Actor + critic + entropy loss actor_loss = -(log_probs * advantages.detach()).mean() critic_loss = 0.5 * (values - (advantages + values).detach()).pow(2).mean() entropy_loss = -dist.entropy().mean() loss = actor_loss + 0.5 * critic_loss + 0.01 * entropy_loss optimizer.zero_grad(); loss.backward(); optimizer.step()
Everything so far uses stochastic policies: πθ(a|s) is a distribution. This works for discrete actions and low-dimensional continuous actions. But for high-dimensional continuous control — a robot arm with 7 joints — sampling from a multivariate Gaussian and computing log-probabilities becomes expensive and unstable.
Silver et al. (2014) proved a deterministic policy gradient theorem. If the policy is deterministic — μθ(s) outputs a single action vector, not a distribution — then:
Read it as two chain rule factors: ∇aQ answers "if I perturb the action slightly, how does value change?" and ∇θμ answers "if I perturb the parameters, how does the action change?" The actor gradient is their product. No log-probability needed.
Lillicrap et al. (2015) combined DPG with three stability tricks from DQN to get DDPG (Deep Deterministic Policy Gradient):
python # DDPG actor update def update_actor(actor, critic, opt_actor, states): # Gradient ascent on Q(s, mu(s)) w.r.t. actor params actions = actor(states) # [B, act_dim] q_values = critic(states, actions) # [B] — differentiable! actor_loss = -q_values.mean() # maximize Q opt_actor.zero_grad() actor_loss.backward() # flows through critic opt_actor.step() # Note: critic parameters are NOT updated here (freeze them) # Only actor parameters receive gradient through this path # DDPG critic update (standard Bellman) def update_critic(critic, actor_target, critic_target, opt_critic, batch): s, a, r, s_next, done = batch with torch.no_grad(): a_next = actor_target(s_next) q_next = critic_target(s_next, a_next) td_target = r + gamma * q_next * (~done) q = critic(s, a) critic_loss = nn.functional.mse_loss(q, td_target) opt_critic.zero_grad(); critic_loss.backward(); opt_critic.step() # Polyak update of target networks for param, target_param in zip(critic.parameters(), critic_target.parameters()): target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
The deterministic policy (blue line) maps state to action with no randomness. Exploration noise (orange dots) covers the action space for experience collection. The OU process (red dots) produces temporally correlated noise — it wanders instead of jumping.
DDPG is sample-efficient but brittle: hyperparameter-sensitive, prone to Q-value overestimation, and the policy collapses to deterministic too quickly. Soft Actor-Critic (SAC) (Haarnoja et al., 2018) fixes all three issues by changing the objective.
Instead of maximizing expected return alone, SAC maximizes entropy-augmented return:
α is the temperature parameter: it controls how much entropy matters. High α → very stochastic policy (exploration), low α → nearly deterministic (exploitation). The entropy term H(π(·|s)) = −E[log π(a|s)] rewards distributional spread.
The soft Bellman optimality equation becomes:
SAC uses the reparameterization trick to backpropagate through the stochastic action: instead of sampling a ~ π(·|s), sample ε ~ N(0,I) and compute a = μ(s) + σ(s)⊙ε. Now a is a differentiable function of the parameters.
SAC networks:
• Actor: πθ(a|s) — Gaussian with learned mean μθ(s) and diagonal covariance σθ(s)
• Two critics: Qφ1, Qφ2 (Clipped Double-Q)
• Two target critics: Q'φ1, Q'φ2
Clipped Double-Q:
Use min(Qφ1, Qφ2) in the Bellman target. Two independent critics with different initializations disagree about Q-values. Taking the minimum is pessimistic — it counteracts the systematic overestimation that causes instability in DDPG.
python # SAC actor (squashed Gaussian) class SACPolicy(nn.Module): def __init__(self, obs_dim, act_dim): super().__init__() self.backbone = nn.Sequential(nn.Linear(obs_dim,256),nn.ReLU(),nn.Linear(256,256),nn.ReLU()) self.mean_layer = nn.Linear(256, act_dim) self.log_std_layer = nn.Linear(256, act_dim) def forward(self, s): h = self.backbone(s) mean = self.mean_layer(h) log_std = self.log_std_layer(h).clamp(-20, 2) # stability std = log_std.exp() # Reparameterization: a = mean + std * eps eps = torch.randn_like(mean) a_raw = mean + std * eps # differentiable! a = torch.tanh(a_raw) # squash to (-1,1) # Log-prob with change of variables for tanh squashing log_prob = Normal(mean,std).log_prob(a_raw).sum(-1) log_prob -= (2*(math.log(2) - a_raw - F.softplus(-2*a_raw))).sum(-1) return a, log_prob # [B,act_dim], [B] # SAC actor update: maximize E[Q] + alpha*H def update_actor_sac(actor, q1, q2, opt_actor, states, alpha): a, log_prob = actor(states) # [B,act_dim], [B] q_min = torch.min(q1(states,a), q2(states,a)) # [B] actor_loss = (alpha * log_prob - q_min).mean() # minimize this opt_actor.zero_grad(); actor_loss.backward(); opt_actor.step()
For games with discrete, finite action spaces and a perfect simulator (chess, Go, shogi), a powerful alternative exists: use MCTS as a policy improvement operator. The actor and critic guide the search; the search results train the actor and critic. This is the AlphaZero architecture.
A single network fθ(s) = (p, v) outputs two heads:
Policy head p = πθ(a|s)
Prior probability over all legal moves. Used to initialize Q-values in MCTS and bias action selection toward promising moves without searching.
Value head v = Uθ(s)
Estimated probability of winning from state s. Replaces rollouts in MCTS leaf evaluation — much faster than simulating to game end.
Each simulation traverses the game tree by selecting actions with:
N(s,a) is the visit count of action a from s. The second term is an exploration bonus: high when a has been visited rarely (N(s,a) small) or the prior is high (p(a|s) large). cpuct ≈ 1 to 5 controls exploration vs. exploitation inside the tree.
The policy head's loss is cross-entropy against the MCTS visit distribution πMCTS(a|s) = N(s,a)1/τ / ∑bN(s,b)1/τ. Temperature τ → 0 makes this the greedy most-visited action; τ = 1 is proportional to counts. AlphaZero uses τ = 1 for the first 30 moves (exploration), then τ → 0 (exploitation).
Let's watch an actor-critic method learn on a simple 1D regulator: the state is a scalar s, and the goal is to drive s to zero. The optimal policy is μ(s) = −s (negative feedback). The optimal value function is a negative quadratic — further from zero is worse.
We parameterize: actor as μθ(s) = θ1·s with log-variance θ2. Critic as Vφ(s) = φ1·s + φ2·s2. Watch θ1 → −1 (optimal slope) and φ2 → negative (value is lowest far from zero).
Top: Actor policy mean (blue) vs. optimal (dashed). Bottom: Critic value estimate (green) vs. optimal (dashed). Press Train to run the update loop. Try large αactor to see instability.
Actor-critic methods are the backbone of modern deep RL. The table below shows how the algorithms in this chapter relate to each other and to broader families.
| Algorithm | Actor | Critic | Advantage | Key trick |
|---|---|---|---|---|
| Basic AC | Stochastic πθ | Vφ(s) | TD residual δ | Baseline variance reduction |
| A2C/A3C | Stochastic πθ | Vφ(s) | GAE(λ) | Parallel envs + entropy bonus |
| PPO | Stochastic πθ | Vφ(s) | GAE(λ) | Clipped surrogate (Ch 12) |
| DDPG | Deterministic μθ | Qφ(s,a) | Chain rule through Q | Replay + target networks |
| TD3 | Deterministic μθ | 2×Qφ | Clipped Double-Q | Delayed actor updates + target noise |
| SAC | Stochastic (reparam) | 2×Qφ | Q − αlogπ | Entropy regularization + auto-α |
| AlphaZero | πθ prior | vθ(s) | MCTS visit counts | Self-play + lookahead distillation |
From this chapter:
• GAE + PPO clip = most robust general algorithm today
• SAC = best for continuous control benchmarks
• AlphaZero = best for two-player perfect-information games
• TD3 = DDPG but stable (often better default)
Open problems:
• Model-based AC (Dreamer, MBPO): learn a world model, train AC inside it
• Multi-agent AC (Ch 27): each agent has its own actor, shared or separate critics
• Offline RL: train AC from a fixed dataset with no environment interaction
• RLHF: the critic is a human reward model (GPT fine-tuning)
"The actor-critic is the right framework for almost any RL problem. The question is which variant."
— paraphrased from Sutton & Barto, Reinforcement Learning (2nd ed.)