Schulman, Moritz, Levine, Jordan, Abbeel — ICLR 2016

Generalized Advantage Estimation

The exponentially-weighted advantage estimator that controls the bias-variance tradeoff in policy gradients — used by PPO, A3C, and virtually every modern actor-critic method.

Prerequisites: Policy gradients + Value functions + TD learning
10
Chapters
5+
Simulations

Chapter 0: The Problem

You want to teach a simulated robot to run. The policy outputs joint torques at every timestep, and the reward comes from how far the robot moves forward. But here is the trouble: the reward the robot gets at timestep 100 was influenced by actions at timesteps 1, 2, 3, ..., 99. Which actions deserve credit for the good outcome? This is the credit assignment problem.

Policy gradient methods tackle this head-on: estimate the gradient of total reward with respect to the policy parameters, then take a gradient step. Simple in principle. Devastating in practice.

The problem is variance. The gradient estimator is a sum over the entire trajectory, and the effect of any single action is confounded with the effects of every other action. With a 1000-step episode, the gradient signal for any one action is buried under the noise of 999 others. To get a reliable gradient, you need millions of samples.

The core tension: You can reduce variance by using a learned value function to estimate future returns (instead of waiting for them). But a value function introduces bias — if your estimate is wrong, you'll get a biased gradient that converges to the wrong solution. High variance means slow learning. Bias means wrong learning. We need a principled way to trade one for the other.

GAE provides exactly this tradeoff. It introduces a single parameter λ ∈ [0, 1] that smoothly interpolates between:

By setting λ somewhere in between (typically 0.95-0.98), you get the best of both worlds: dramatically lower variance than Monte Carlo, with very little bias.

Why do vanilla policy gradient methods require millions of samples for continuous control tasks?

Chapter 1: The Key Insight

Imagine you have a 1-step advantage estimator and a 2-step advantage estimator. The 1-step version has low variance but high bias. The 2-step version has slightly more variance but less bias. You could also build 3-step, 4-step, ..., k-step estimators, each trading a bit more variance for a bit less bias.

GAE's key insight: take an exponentially-weighted average of all of these k-step estimators. Give the 1-step estimator weight (1−λ), the 2-step estimator weight (1−λ)λ, the 3-step weight (1−λ)λ², and so on.

This exponential weighting is not arbitrary. It collapses into a remarkably simple formula:

ÂtGAE(γ,λ) = ∑l=0 (γλ)l δt+l

Where δt = rt + γV(st+1) − V(st) is the TD residual — the one-step prediction error of the value function.

The analogy to TD(λ): This construction is directly analogous to TD(λ), which estimates value functions by exponentially weighting n-step returns. GAE applies the same idea to estimate advantage functions. Just as TD(λ) with λ = 0 gives TD(0) and λ = 1 gives Monte Carlo returns, GAE(γ, 0) gives the 1-step TD advantage and GAE(γ, 1) gives the Monte Carlo advantage.

The beauty is in the simplicity. To compute GAE in practice, you just compute all the TD residuals δt along the trajectory, then run a single backward pass accumulating (γλ)-discounted sums. That's it. No complex optimization, no second-order methods — just a discounted cumulative sum.

What does GAE take an exponentially-weighted average of?

Chapter 2: Policy Gradient Review

Let's build the foundations. A policy πθ(a|s) maps states to action distributions. The goal is to maximize expected total reward. The policy gradient theorem gives us:

g = E[∑t=0 Ψtθ ln πθ(at|st)]

Where Ψt can be any of:

Why the advantage function?

The advantage function Aπ(s, a) = Qπ(s, a) − Vπ(s) measures how much better action a is compared to the policy's average behavior in state s. It has a clean interpretation:

Using the advantage as Ψt gives the lowest-variance unbiased gradient estimator (among the choices above). But we don't know Aπ — we have to estimate it. The question is how.

The discounted setting

The paper works with a discounted advantage Aπ,γ, where γ ∈ [0, 1] downweights future rewards. Crucially, they treat γ not as part of the problem definition but as a variance reduction parameter. Setting γ < 1 introduces bias but reduces variance by making distant rewards contribute less.

Vπ,γ(st) = E[∑l=0 γl rt+l]
Aπ,γ(st, at) = Qπ,γ(st, at) − Vπ,γ(st)
Policy Gradient Ψt Choices: Variance vs. Bias

Each choice of Ψt trades off variance and bias differently. The advantage function is the sweet spot.

Why is the advantage function Aπ(s, a) the preferred choice for Ψt in the policy gradient?

Chapter 3: The Bias-Variance Tradeoff

We need to estimate Aπ,γ(st, at), but we don't have the true value function — we have a learned approximation V. This introduces bias. Let's see exactly how.

The TD residual as a 1-step advantage estimate

Define the TD residual:

δtV = rt + γV(st+1) − V(st)

If V = Vπ,γ (the true value function), then δt is an unbiased estimator of the advantage:

Est+1tVπ,γ] = Est+1[rt + γVπ,γ(st+1) − Vπ,γ(st)] = Aπ,γ(st, at)

But if V is approximate (as it always is in practice), δt is biased. The bias equals the value function's error.

The Monte Carlo advantage: no bias, lots of variance

Alternatively, we can use the empirical returns minus the baseline:

ÂtMC = ∑l=0 γl rt+l − V(st)

This uses no bootstrapping — it waits for all actual rewards to arrive. If V is used only as a baseline (subtracted but not bootstrapped from), the estimator is unbiased regardless of V's accuracy. But every reward from t onward adds variance. For a 1000-step episode, ÂtMC at timestep 0 is a sum of 1000 noisy terms.

The fundamental tradeoff:
1-step TD (δt): Low variance (one random step) but biased (error in V propagates).
Monte Carlo: Unbiased (uses real rewards) but high variance (sums many noisy terms).
We want something in between. That's exactly what k-step estimators and GAE provide.
Bias vs. Variance: 1-Step TD vs. Monte Carlo

Click "Sample" to draw advantage estimates. TD is tight but shifted (biased). MC is centered but scattered.

What causes bias in the 1-step TD advantage estimator δt = rt + γV(st+1) − V(st)?

Chapter 4: Building Up to GAE

We've seen the two extremes: 1-step TD (low variance, high bias) and Monte Carlo (no bias, high variance). Now let's build the bridge between them.

The k-step advantage estimator

Take the sum of k consecutive TD residuals:

Ât(1) = δt = −V(st) + rt + γV(st+1)
Ât(2) = δt + γδt+1 = −V(st) + rt + γrt+1 + γ²V(st+2)
Ât(3) = δt + γδt+1 + γ²δt+2 = −V(st) + rt + γrt+1 + γ²rt+2 + γ³V(st+3)

In general:

Ât(k) = ∑l=0k−1 γl δt+l = −V(st) + rt + γrt+1 + ··· + γk−1rt+k−1 + γkV(st+k)

Look at the pattern. The k-step estimator uses k actual rewards, then bootstraps from V at step t+k. As k increases:

At k = ∞, the bootstrap term γkV(st+k) vanishes (for γ < 1) and we recover the Monte Carlo estimator.

The insight that leads to GAE: Each Ât(k) sits at a different point on the bias-variance spectrum. Instead of picking one k, why not combine all of them? An exponentially-weighted average with weights (1−λ)λk−1 gives nearby steps more influence (low variance) while still incorporating distant steps (low bias). This is GAE.
k-Step Advantage Estimators

Adjust k to see how the estimator uses more real rewards and less bootstrapping.

k = 1
As k increases in the k-step advantage estimator Ât(k), what happens to bias and variance?

Chapter 5: GAE(γ,λ)

Now for the main result. GAE is defined as the exponentially-weighted average of all k-step advantage estimators:

ÂtGAE(γ,λ) := (1−λ)(Ât(1) + λÂt(2) + λ²Ât(3) + ···)

This looks complicated, but it telescopes into something beautiful. Expand the sum, collect terms, and you get:

ÂtGAE(γ,λ) = ∑l=0 (γλ)l δt+l

That's it. The GAE advantage is just a discounted sum of TD residuals, with effective discount (γλ) instead of γ. Let's verify the two special cases:

λ = 0
Ât = δt = rt + γV(st+1) − V(st)
1-step TD: low variance, high bias if V is inaccurate
λ = 1
Ât = ∑ γl δt+l = ∑ γl rt+l − V(st)
Monte Carlo: unbiased (V only as baseline), high variance
0 < λ < 1
Smooth interpolation between the two extremes.
Typical: λ = 0.95–0.98, γ = 0.99

The derivation in detail

Start with the definition and expand each Ât(k):

(1−λ)[δt + λ(δt + γδt+1) + λ²(δt + γδt+1 + γ²δt+2) + ···]

Collect the coefficient of each δt+l. The term δt+l appears in the k-step estimator for all k ≥ l+1, so its coefficient is:

(1−λ) · γl · (λl + λl+1 + λl+2 + ···) = (1−λ) · γl · λl/(1−λ) = (γλ)l

And there it is: every δt+l gets weight (γλ)l.

Computing GAE in practice

The computation is a simple backward pass through the trajectory:

Step 1
Compute all TD residuals: δt = rt + γV(st+1) − V(st)
Step 2
Backward sweep: ÂT−1 = δT−1, then Ât = δt + γλ · Ât+1

Two lines of code. That's the entire implementation.

GAE(γ, λ) — The λ Slider

Drag λ to see how GAE interpolates between 1-step TD (λ=0) and Monte Carlo (λ=1). The bars show the weight given to each TD residual δt+l.

λ = 0.95
γ = 0.99
What is the GAE advantage estimator with λ = 0?

Chapter 6: Why GAE Works

To understand why GAE is so effective, we need to think carefully about what γ and λ each do.

Two parameters, two roles

γ (discount factor) determines the scale of the value function Vπ,γ. Taking γ < 1 introduces bias into the policy gradient regardless of V's accuracy. It says: "I don't care about rewards more than ~1/(1−γ) steps in the future." With γ = 0.99, that's a horizon of ~100 steps.

λ (GAE parameter) controls how much we bootstrap from V. Taking λ < 1 introduces bias only when V is inaccurate. If V were perfect, any λ would give the correct advantage. Since V is always imperfect in practice, lower λ means more reliance on V (more bias), but dramatically less variance.

The crucial asymmetry: λ introduces far less bias than γ for a reasonably accurate value function. That's why the best λ (typically 0.95-0.98) is much lower than the best γ (typically 0.99-0.999). You can aggressively reduce variance with λ while paying very little in bias — as long as your value function is decent.

Connection to reward shaping

The paper shows that GAE can be interpreted through the lens of reward shaping (Ng et al., 1999). Define a shaped reward:

t = rt + γV(st+1) − V(st) = δt

Then GAE(γ, λ) computes the γλ-discounted sum of these shaped rewards. Reward shaping theory tells us that the potential-based shaping γV(s') − V(s) doesn't change the optimal policy — so the shaped reward δt has the same optimal policy as the original reward rt, but with much lower variance because V absorbs the "expected" part of future returns.

Connection to eligibility traces

GAE is the advantage-function analog of TD(λ) with eligibility traces. In TD(λ), the value update at each step is spread backward along the trajectory using traces that decay by γλ per step. In GAE, the advantage estimate at each step accumulates forward-looking TD residuals decaying by γλ per step. The forward view and backward view are equivalent — GAE is the "forward view."

Why does λ < 1 introduce less bias than γ < 1, given a reasonably accurate value function?

Chapter 7: Trust Region for Value Function

GAE relies on V being a good approximation. If V is terrible, even moderate λ values will give biased gradients. So training V well is critical. The paper proposes using a trust region method for the value function — not just the policy.

The value function objective

Standard approach: minimize the MSE between Vφ(st) and the empirical returns. But naive MSE minimization can overfit to the current batch of data, causing V to become worse on states not in the batch. This is especially problematic because GAE bootstraps from V — overfitting V can destabilize the entire training loop.

Trust region for V

The paper constrains the value function update to stay close to the previous value function, using a trust region:

minimizeφn ||Vφ(sn) − V̂n||²
subject to: (1/N) ∑n ||Vφ(sn) − Vφold(sn)||² ≤ ε

Where V̂n = ∑l=0 γl rn,t+l are the empirical discounted returns. The constraint prevents V from changing too much in a single update.

Solving the constrained problem

They use a conjugate gradient method followed by a line search, similar to TRPO for the policy. The Hessian-vector products are computed efficiently. In practice, they compute the step by solving:

φ ← φold + α · H−1g

Where H is the Hessian of the constraint, g is the gradient of the objective, and α is chosen by line search to satisfy the constraint.

The full algorithm: (1) Collect batch of trajectories using current policy. (2) Compute GAE advantages using current V. (3) Update policy using trust region (TRPO-style). (4) Update value function using trust region (constrained MSE). Repeat. Steps 3 and 4 both use trust regions, but for different objectives.
Trust Region: Policy vs. Value Function

Both the policy and value function are updated with trust region constraints to prevent overfitting.

Why does the paper apply a trust region constraint to the value function, not just the policy?

Chapter 8: Results

The paper tests GAE on 3D locomotion tasks in the MuJoCo physics simulator. These are hard: the robot has to coordinate dozens of joints, manage balance, and produce efficient gaits — all from raw joint angles and velocities to joint torques.

The robots

All policies are neural networks (2 hidden layers, 100 units, tanh activations) mapping directly from raw kinematics to joint torques. No hand-crafted features, no phase variables, no gait libraries.

Key results

The paper systematically varies γ and λ to measure their effects:

The standing-up task: The biped must coordinate its entire body to transition from lying prone to standing upright. This requires a long sequence of precisely coordinated movements — rolling, pushing up, balancing. Standard policy gradients cannot solve this. GAE with the right (γ, λ) learns it in ~1-2 weeks of simulated experience. This was state-of-the-art for neural network locomotion policies at the time.
Effect of λ on Learning Performance

Learning curves for different λ values on bipedal locomotion. Intermediate λ (0.96–0.99) outperforms both extremes.

What range of λ values produced the best results on 3D locomotion tasks?

Chapter 9: Connections

What GAE built on

REINFORCE (Williams, 1992): The original policy gradient. GAE addresses its crippling variance by introducing a principled advantage estimator.

TD(λ) (Sutton, 1988): The eligibility trace framework for value function estimation. GAE is the advantage-function analog — same exponential weighting, applied to a different quantity.

Actor-critic methods (Konda & Tsitsiklis, 2003): Use learned value functions to reduce variance. GAE provides the best way to use that value function for advantage estimation.

TRPO (Schulman et al., 2015): Trust region optimization for the policy. The GAE paper extends trust regions to the value function and uses TRPO-style updates for the policy.

What GAE enabled

PPO (Schulman et al., 2017): The default RL algorithm. PPO uses GAE directly — the advantage estimates in PPO's clipped objective are computed with GAE(γ=0.99, λ=0.95). Without GAE, PPO wouldn't work.

A3C (Mnih et al., 2016): Asynchronous actor-critic with n-step returns — a special case of GAE with λ = 1 and truncated trajectories.

RLHF / ChatGPT: When you train a language model with PPO from human feedback, GAE computes the advantage estimates that determine which tokens to reinforce. GAE is running behind every RLHF-trained model.

Every modern actor-critic: SAC, IMPALA, MAPPO, GRPO — they all use GAE or a close variant for advantage estimation. It is the de facto standard.

GAE's legacy: This paper solved one of RL's most fundamental problems — how to estimate advantages with controllable bias-variance tradeoff. The formula Ât = ∑ (γλ)l δt+l is used in virtually every modern policy gradient implementation. It's two lines of code that made deep RL practical for continuous control, game-playing, and language model alignment.

Cheat sheet

Core equation
ÂtGAE = ∑l=0 (γλ)l δt+l, where δt = rt + γV(st+1) − V(st)
Key hyperparams
γ = 0.99–0.999 (reward horizon), λ = 0.95–0.98 (bias-variance)
Special cases
λ=0: 1-step TD (low var, high bias). λ=1: MC (no bias, high var)
Implementation
Backward sweep: Ât = δt + γλ · Ât+1
Impact
Used by PPO, A3C, RLHF, SAC, and virtually all modern actor-critic methods
How does PPO use GAE?