The exponentially-weighted advantage estimator that controls the bias-variance tradeoff in policy gradients — used by PPO, A3C, and virtually every modern actor-critic method.
You want to teach a simulated robot to run. The policy outputs joint torques at every timestep, and the reward comes from how far the robot moves forward. But here is the trouble: the reward the robot gets at timestep 100 was influenced by actions at timesteps 1, 2, 3, ..., 99. Which actions deserve credit for the good outcome? This is the credit assignment problem.
Policy gradient methods tackle this head-on: estimate the gradient of total reward with respect to the policy parameters, then take a gradient step. Simple in principle. Devastating in practice.
The problem is variance. The gradient estimator is a sum over the entire trajectory, and the effect of any single action is confounded with the effects of every other action. With a 1000-step episode, the gradient signal for any one action is buried under the noise of 999 others. To get a reliable gradient, you need millions of samples.
GAE provides exactly this tradeoff. It introduces a single parameter λ ∈ [0, 1] that smoothly interpolates between:
By setting λ somewhere in between (typically 0.95-0.98), you get the best of both worlds: dramatically lower variance than Monte Carlo, with very little bias.
Imagine you have a 1-step advantage estimator and a 2-step advantage estimator. The 1-step version has low variance but high bias. The 2-step version has slightly more variance but less bias. You could also build 3-step, 4-step, ..., k-step estimators, each trading a bit more variance for a bit less bias.
GAE's key insight: take an exponentially-weighted average of all of these k-step estimators. Give the 1-step estimator weight (1−λ), the 2-step estimator weight (1−λ)λ, the 3-step weight (1−λ)λ², and so on.
This exponential weighting is not arbitrary. It collapses into a remarkably simple formula:
Where δt = rt + γV(st+1) − V(st) is the TD residual — the one-step prediction error of the value function.
The beauty is in the simplicity. To compute GAE in practice, you just compute all the TD residuals δt along the trajectory, then run a single backward pass accumulating (γλ)-discounted sums. That's it. No complex optimization, no second-order methods — just a discounted cumulative sum.
Let's build the foundations. A policy πθ(a|s) maps states to action distributions. The goal is to maximize expected total reward. The policy gradient theorem gives us:
Where Ψt can be any of:
The advantage function Aπ(s, a) = Qπ(s, a) − Vπ(s) measures how much better action a is compared to the policy's average behavior in state s. It has a clean interpretation:
Using the advantage as Ψt gives the lowest-variance unbiased gradient estimator (among the choices above). But we don't know Aπ — we have to estimate it. The question is how.
The paper works with a discounted advantage Aπ,γ, where γ ∈ [0, 1] downweights future rewards. Crucially, they treat γ not as part of the problem definition but as a variance reduction parameter. Setting γ < 1 introduces bias but reduces variance by making distant rewards contribute less.
Each choice of Ψt trades off variance and bias differently. The advantage function is the sweet spot.
We need to estimate Aπ,γ(st, at), but we don't have the true value function — we have a learned approximation V. This introduces bias. Let's see exactly how.
Define the TD residual:
If V = Vπ,γ (the true value function), then δt is an unbiased estimator of the advantage:
But if V is approximate (as it always is in practice), δt is biased. The bias equals the value function's error.
Alternatively, we can use the empirical returns minus the baseline:
This uses no bootstrapping — it waits for all actual rewards to arrive. If V is used only as a baseline (subtracted but not bootstrapped from), the estimator is unbiased regardless of V's accuracy. But every reward from t onward adds variance. For a 1000-step episode, ÂtMC at timestep 0 is a sum of 1000 noisy terms.
Click "Sample" to draw advantage estimates. TD is tight but shifted (biased). MC is centered but scattered.
We've seen the two extremes: 1-step TD (low variance, high bias) and Monte Carlo (no bias, high variance). Now let's build the bridge between them.
Take the sum of k consecutive TD residuals:
In general:
Look at the pattern. The k-step estimator uses k actual rewards, then bootstraps from V at step t+k. As k increases:
At k = ∞, the bootstrap term γkV(st+k) vanishes (for γ < 1) and we recover the Monte Carlo estimator.
Adjust k to see how the estimator uses more real rewards and less bootstrapping.
Now for the main result. GAE is defined as the exponentially-weighted average of all k-step advantage estimators:
This looks complicated, but it telescopes into something beautiful. Expand the sum, collect terms, and you get:
That's it. The GAE advantage is just a discounted sum of TD residuals, with effective discount (γλ) instead of γ. Let's verify the two special cases:
Start with the definition and expand each Ât(k):
Collect the coefficient of each δt+l. The term δt+l appears in the k-step estimator for all k ≥ l+1, so its coefficient is:
And there it is: every δt+l gets weight (γλ)l.
The computation is a simple backward pass through the trajectory:
Two lines of code. That's the entire implementation.
Drag λ to see how GAE interpolates between 1-step TD (λ=0) and Monte Carlo (λ=1). The bars show the weight given to each TD residual δt+l.
To understand why GAE is so effective, we need to think carefully about what γ and λ each do.
γ (discount factor) determines the scale of the value function Vπ,γ. Taking γ < 1 introduces bias into the policy gradient regardless of V's accuracy. It says: "I don't care about rewards more than ~1/(1−γ) steps in the future." With γ = 0.99, that's a horizon of ~100 steps.
λ (GAE parameter) controls how much we bootstrap from V. Taking λ < 1 introduces bias only when V is inaccurate. If V were perfect, any λ would give the correct advantage. Since V is always imperfect in practice, lower λ means more reliance on V (more bias), but dramatically less variance.
The paper shows that GAE can be interpreted through the lens of reward shaping (Ng et al., 1999). Define a shaped reward:
Then GAE(γ, λ) computes the γλ-discounted sum of these shaped rewards. Reward shaping theory tells us that the potential-based shaping γV(s') − V(s) doesn't change the optimal policy — so the shaped reward δt has the same optimal policy as the original reward rt, but with much lower variance because V absorbs the "expected" part of future returns.
GAE is the advantage-function analog of TD(λ) with eligibility traces. In TD(λ), the value update at each step is spread backward along the trajectory using traces that decay by γλ per step. In GAE, the advantage estimate at each step accumulates forward-looking TD residuals decaying by γλ per step. The forward view and backward view are equivalent — GAE is the "forward view."
GAE relies on V being a good approximation. If V is terrible, even moderate λ values will give biased gradients. So training V well is critical. The paper proposes using a trust region method for the value function — not just the policy.
Standard approach: minimize the MSE between Vφ(st) and the empirical returns. But naive MSE minimization can overfit to the current batch of data, causing V to become worse on states not in the batch. This is especially problematic because GAE bootstraps from V — overfitting V can destabilize the entire training loop.
The paper constrains the value function update to stay close to the previous value function, using a trust region:
Where V̂n = ∑l=0∞ γl rn,t+l are the empirical discounted returns. The constraint prevents V from changing too much in a single update.
They use a conjugate gradient method followed by a line search, similar to TRPO for the policy. The Hessian-vector products are computed efficiently. In practice, they compute the step by solving:
Where H is the Hessian of the constraint, g is the gradient of the objective, and α is chosen by line search to satisfy the constraint.
Both the policy and value function are updated with trust region constraints to prevent overfitting.
The paper tests GAE on 3D locomotion tasks in the MuJoCo physics simulator. These are hard: the robot has to coordinate dozens of joints, manage balance, and produce efficient gaits — all from raw joint angles and velocities to joint torques.
All policies are neural networks (2 hidden layers, 100 units, tanh activations) mapping directly from raw kinematics to joint torques. No hand-crafted features, no phase variables, no gait libraries.
The paper systematically varies γ and λ to measure their effects:
Learning curves for different λ values on bipedal locomotion. Intermediate λ (0.96–0.99) outperforms both extremes.
REINFORCE (Williams, 1992): The original policy gradient. GAE addresses its crippling variance by introducing a principled advantage estimator.
TD(λ) (Sutton, 1988): The eligibility trace framework for value function estimation. GAE is the advantage-function analog — same exponential weighting, applied to a different quantity.
Actor-critic methods (Konda & Tsitsiklis, 2003): Use learned value functions to reduce variance. GAE provides the best way to use that value function for advantage estimation.
TRPO (Schulman et al., 2015): Trust region optimization for the policy. The GAE paper extends trust regions to the value function and uses TRPO-style updates for the policy.
PPO (Schulman et al., 2017): The default RL algorithm. PPO uses GAE directly — the advantage estimates in PPO's clipped objective are computed with GAE(γ=0.99, λ=0.95). Without GAE, PPO wouldn't work.
A3C (Mnih et al., 2016): Asynchronous actor-critic with n-step returns — a special case of GAE with λ = 1 and truncated trajectories.
RLHF / ChatGPT: When you train a language model with PPO from human feedback, GAE computes the advantage estimates that determine which tokens to reinforce. GAE is running behind every RLHF-trained model.
Every modern actor-critic: SAC, IMPALA, MAPPO, GRPO — they all use GAE or a close variant for advantage estimation. It is the de facto standard.