GAE — Veanors

Chapter 0: The Problem

You want to teach a simulated robot to run. The policy outputs joint torques at every timestep, and the reward comes from how far the robot moves forward. But here is the trouble: the reward the robot gets at timestep 100 was influenced by actions at timesteps 1, 2, 3, ..., 99. Which actions deserve credit for the good outcome? This is the credit assignment problem.

Policy gradient methods tackle this head-on: estimate the gradient of total reward with respect to the policy parameters, then take a gradient step. Simple in principle. Devastating in practice.

The problem is variance. The gradient estimator is a sum over the entire trajectory, and the effect of any single action is confounded with the effects of every other action. With a 1000-step episode, the gradient signal for any one action is buried under the noise of 999 others. To get a reliable gradient, you need millions of samples.

The core tension: You can reduce variance by using a learned value function to estimate future returns (instead of waiting for them). But a value function introduces bias — if your estimate is wrong, you'll get a biased gradient that converges to the wrong solution. High variance means slow learning. Bias means wrong learning. We need a principled way to trade one for the other.

GAE provides exactly this tradeoff. It introduces a single parameter λ ∈ [0, 1] that smoothly interpolates between:

λ = 0: Low variance, high bias (1-step TD — trust the value function completely)
λ = 1: High variance, zero bias (Monte Carlo — don't trust the value function at all)

By setting λ somewhere in between (typically 0.95-0.98), you get the best of both worlds: dramatically lower variance than Monte Carlo, with very little bias.

Why do vanilla policy gradient methods require millions of samples for continuous control tasks?

The gradient estimator has extremely high variance because each action's effect is confounded with all other actions in the trajectory The neural network is too large to train quickly The simulator runs slowly

Chapter 1: The Key Insight

Imagine you have a 1-step advantage estimator and a 2-step advantage estimator. The 1-step version has low variance but high bias. The 2-step version has slightly more variance but less bias. You could also build 3-step, 4-step, ..., k-step estimators, each trading a bit more variance for a bit less bias.

GAE's key insight: take an exponentially-weighted average of all of these k-step estimators. Give the 1-step estimator weight (1−λ), the 2-step estimator weight (1−λ)λ, the 3-step weight (1−λ)λ², and so on.

This exponential weighting is not arbitrary. It collapses into a remarkably simple formula:

Â_t^GAE(γ,λ) = ∑_l=0^∞ (γλ)^l δ_t+l

Where δ_t = r_t + γV(s_t+1) − V(s_t) is the TD residual — the one-step prediction error of the value function.

The analogy to TD(λ): This construction is directly analogous to TD(λ), which estimates value functions by exponentially weighting n-step returns. GAE applies the same idea to estimate advantage functions. Just as TD(λ) with λ = 0 gives TD(0) and λ = 1 gives Monte Carlo returns, GAE(γ, 0) gives the 1-step TD advantage and GAE(γ, 1) gives the Monte Carlo advantage.

The beauty is in the simplicity. To compute GAE in practice, you just compute all the TD residuals δ_t along the trajectory, then run a single backward pass accumulating (γλ)-discounted sums. That's it. No complex optimization, no second-order methods — just a discounted cumulative sum.

What does GAE take an exponentially-weighted average of?

All k-step advantage estimators, from k=1 (1-step TD) to k=∞ (Monte Carlo), weighted by λ^k−1 Different reward functions Multiple value function estimates

Chapter 2: Policy Gradient Review

Let's build the foundations. A policy π_θ(a|s) maps states to action distributions. The goal is to maximize expected total reward. The policy gradient theorem gives us:

g = E[∑_t=0^∞ Ψ_t ∇_θ ln π_θ(a_t|s_t)]

Where Ψ_t can be any of:

Total trajectory reward ∑ r_t (REINFORCE)
Reward-to-go ∑_t'≥t r_t' (causality trick)
Reward-to-go minus baseline ∑_t'≥t r_t' − b(s_t)
Q-function Q^π(s_t, a_t)
Advantage function A^π(s_t, a_t) ← lowest variance
TD residual r_t + γV^π(s_t+1) − V^π(s_t)

Why the advantage function?

The advantage function A^π(s, a) = Q^π(s, a) − V^π(s) measures how much better action a is compared to the policy's average behavior in state s. It has a clean interpretation:

A > 0: This action was better than average → increase its probability
A < 0: This action was worse than average → decrease its probability
A = 0: This action was exactly average → no change

Using the advantage as Ψ_t gives the lowest-variance unbiased gradient estimator (among the choices above). But we don't know A^π — we have to estimate it. The question is how.

The discounted setting

The paper works with a discounted advantage A^π,γ, where γ ∈ [0, 1] downweights future rewards. Crucially, they treat γ not as part of the problem definition but as a variance reduction parameter. Setting γ < 1 introduces bias but reduces variance by making distant rewards contribute less.

V^π,γ(s_t) = E[∑_l=0^∞ γ^l r_t+l]

A^π,γ(s_t, a_t) = Q^π,γ(s_t, a_t) − V^π,γ(s_t)

Policy Gradient Ψ_t Choices: Variance vs. Bias

Each choice of Ψ_t trades off variance and bias differently. The advantage function is the sweet spot.

Why is the advantage function A^π(s, a) the preferred choice for Ψ_t in the policy gradient?

It measures how much better an action is compared to the policy's average, giving the lowest variance among unbiased choices while naturally centering the gradient around zero It is easier to compute than the Q-function It converges faster than other choices

Chapter 3: The Bias-Variance Tradeoff

We need to estimate A^π,γ(s_t, a_t), but we don't have the true value function — we have a learned approximation V. This introduces bias. Let's see exactly how.

The TD residual as a 1-step advantage estimate

Define the TD residual:

δ_t^V = r_t + γV(s_t+1) − V(s_t)

If V = V^π,γ (the true value function), then δ_t is an unbiased estimator of the advantage:

E_{s_t+1}[δ_t^{V^π,γ}] = E_{s_t+1}[r_t + γV^π,γ(s_t+1) − V^π,γ(s_t)] = A^π,γ(s_t, a_t)

But if V is approximate (as it always is in practice), δ_t is biased. The bias equals the value function's error.

The Monte Carlo advantage: no bias, lots of variance

Alternatively, we can use the empirical returns minus the baseline:

Â_t^MC = ∑_l=0^∞ γ^l r_t+l − V(s_t)

This uses no bootstrapping — it waits for all actual rewards to arrive. If V is used only as a baseline (subtracted but not bootstrapped from), the estimator is unbiased regardless of V's accuracy. But every reward from t onward adds variance. For a 1000-step episode, Â_t^MC at timestep 0 is a sum of 1000 noisy terms.

The fundamental tradeoff:
1-step TD (δ_t): Low variance (one random step) but biased (error in V propagates).
Monte Carlo: Unbiased (uses real rewards) but high variance (sums many noisy terms).
We want something in between. That's exactly what k-step estimators and GAE provide.

Bias vs. Variance: 1-Step TD vs. Monte Carlo

Click "Sample" to draw advantage estimates. TD is tight but shifted (biased). MC is centered but scattered.

What causes bias in the 1-step TD advantage estimator δ_t = r_t + γV(s_t+1) − V(s_t)?

The approximate value function V introduces error — bootstrapping from an inaccurate V means the estimator doesn't converge to the true advantage The discount factor γ causes bias The reward r_t is noisy

Chapter 4: Building Up to GAE

We've seen the two extremes: 1-step TD (low variance, high bias) and Monte Carlo (no bias, high variance). Now let's build the bridge between them.

The k-step advantage estimator

Take the sum of k consecutive TD residuals:

Â_t⁽¹⁾ = δ_t = −V(s_t) + r_t + γV(s_t+1)

Â_t⁽²⁾ = δ_t + γδ_t+1 = −V(s_t) + r_t + γr_t+1 + γ²V(s_t+2)

Â_t⁽³⁾ = δ_t + γδ_t+1 + γ²δ_t+2 = −V(s_t) + r_t + γr_t+1 + γ²r_t+2 + γ³V(s_t+3)

In general:

Â_t^(k) = ∑_l=0^k−1 γ^l δ_t+l = −V(s_t) + r_t + γr_t+1 + ··· + γ^k−1r_t+k−1 + γ^kV(s_t+k)

Look at the pattern. The k-step estimator uses k actual rewards, then bootstraps from V at step t+k. As k increases:

More real rewards → less reliance on V → less bias
More random terms summed → more variance

At k = ∞, the bootstrap term γ^kV(s_t+k) vanishes (for γ < 1) and we recover the Monte Carlo estimator.

The insight that leads to GAE: Each Â_t^(k) sits at a different point on the bias-variance spectrum. Instead of picking one k, why not combine all of them? An exponentially-weighted average with weights (1−λ)λ^k−1 gives nearby steps more influence (low variance) while still incorporating distant steps (low bias). This is GAE.

k-Step Advantage Estimators

Adjust k to see how the estimator uses more real rewards and less bootstrapping.

k = 1

As k increases in the k-step advantage estimator Â_t^(k), what happens to bias and variance?

Bias decreases (less reliance on approximate V) and variance increases (more random reward terms summed) Both bias and variance decrease Bias increases and variance decreases

Chapter 5: GAE(γ,λ)

Now for the main result. GAE is defined as the exponentially-weighted average of all k-step advantage estimators:

Â_t^GAE(γ,λ) := (1−λ)(Â_t⁽¹⁾ + λÂ_t⁽²⁾ + λ²Â_t⁽³⁾ + ···)

This looks complicated, but it telescopes into something beautiful. Expand the sum, collect terms, and you get:

Â_t^GAE(γ,λ) = ∑_l=0^∞ (γλ)^l δ_t+l

That's it. The GAE advantage is just a discounted sum of TD residuals, with effective discount (γλ) instead of γ. Let's verify the two special cases:

λ = 0

Â_t = δ_t = r_t + γV(s_t+1) − V(s_t)
1-step TD: low variance, high bias if V is inaccurate

↓

λ = 1

Â_t = ∑ γ^l δ_t+l = ∑ γ^l r_t+l − V(s_t)
Monte Carlo: unbiased (V only as baseline), high variance

↓

0 < λ < 1

Smooth interpolation between the two extremes.
Typical: λ = 0.95–0.98, γ = 0.99

The derivation in detail

Start with the definition and expand each Â_t^(k):

(1−λ)[δ_t + λ(δ_t + γδ_t+1) + λ²(δ_t + γδ_t+1 + γ²δ_t+2) + ···]

Collect the coefficient of each δ_t+l. The term δ_t+l appears in the k-step estimator for all k ≥ l+1, so its coefficient is:

(1−λ) · γ^l · (λ^l + λ^l+1 + λ^l+2 + ···) = (1−λ) · γ^l · λ^l/(1−λ) = (γλ)^l

And there it is: every δ_t+l gets weight (γλ)^l.

Computing GAE in practice

The computation is a simple backward pass through the trajectory:

Step 1

Compute all TD residuals: δ_t = r_t + γV(s_t+1) − V(s_t)

↓

Step 2

Backward sweep: Â_T−1 = δ_T−1, then Â_t = δ_t + γλ · Â_t+1

Two lines of code. That's the entire implementation.

GAE(γ, λ) — The λ Slider

Drag λ to see how GAE interpolates between 1-step TD (λ=0) and Monte Carlo (λ=1). The bars show the weight given to each TD residual δ_t+l.

λ = 0.95

γ = 0.99

What is the GAE advantage estimator with λ = 0?

Â_t = δ_t = r_t + γV(s_t+1) − V(s_t) — the 1-step TD residual Â_t = ∑ γ^l r_t+l − V(s_t) — the Monte Carlo advantage Â_t = 0 — no advantage estimate

Chapter 6: Why GAE Works

To understand why GAE is so effective, we need to think carefully about what γ and λ each do.

Two parameters, two roles

γ (discount factor) determines the scale of the value function V^π,γ. Taking γ < 1 introduces bias into the policy gradient regardless of V's accuracy. It says: "I don't care about rewards more than ~1/(1−γ) steps in the future." With γ = 0.99, that's a horizon of ~100 steps.

λ (GAE parameter) controls how much we bootstrap from V. Taking λ < 1 introduces bias only when V is inaccurate. If V were perfect, any λ would give the correct advantage. Since V is always imperfect in practice, lower λ means more reliance on V (more bias), but dramatically less variance.

The crucial asymmetry: λ introduces far less bias than γ for a reasonably accurate value function. That's why the best λ (typically 0.95-0.98) is much lower than the best γ (typically 0.99-0.999). You can aggressively reduce variance with λ while paying very little in bias — as long as your value function is decent.

Connection to reward shaping

The paper shows that GAE can be interpreted through the lens of reward shaping (Ng et al., 1999). Define a shaped reward:

r̃_t = r_t + γV(s_t+1) − V(s_t) = δ_t

Then GAE(γ, λ) computes the γλ-discounted sum of these shaped rewards. Reward shaping theory tells us that the potential-based shaping γV(s') − V(s) doesn't change the optimal policy — so the shaped reward δ_t has the same optimal policy as the original reward r_t, but with much lower variance because V absorbs the "expected" part of future returns.

Connection to eligibility traces

GAE is the advantage-function analog of TD(λ) with eligibility traces. In TD(λ), the value update at each step is spread backward along the trajectory using traces that decay by γλ per step. In GAE, the advantage estimate at each step accumulates forward-looking TD residuals decaying by γλ per step. The forward view and backward view are equivalent — GAE is the "forward view."

Why does λ < 1 introduce less bias than γ < 1, given a reasonably accurate value function?

γ introduces bias regardless of V's accuracy (it truncates the reward horizon), while λ only introduces bias through V's approximation error — if V were perfect, any λ would be unbiased λ has a smaller numerical range than γ λ is applied to a smaller number of terms

Chapter 7: Trust Region for Value Function

GAE relies on V being a good approximation. If V is terrible, even moderate λ values will give biased gradients. So training V well is critical. The paper proposes using a trust region method for the value function — not just the policy.

The value function objective

Standard approach: minimize the MSE between V_φ(s_t) and the empirical returns. But naive MSE minimization can overfit to the current batch of data, causing V to become worse on states not in the batch. This is especially problematic because GAE bootstraps from V — overfitting V can destabilize the entire training loop.

Trust region for V

The paper constrains the value function update to stay close to the previous value function, using a trust region:

minimize_φ ∑_n ||V_φ(s_n) − V̂_n||²

subject to: (1/N) ∑_n ||V_φ(s_n) − V_{φ_old}(s_n)||² ≤ ε

Where V̂_n = ∑_l=0^∞ γ^l r_n,t+l are the empirical discounted returns. The constraint prevents V from changing too much in a single update.

Solving the constrained problem

They use a conjugate gradient method followed by a line search, similar to TRPO for the policy. The Hessian-vector products are computed efficiently. In practice, they compute the step by solving:

φ ← φ_old + α · H⁻¹g

Where H is the Hessian of the constraint, g is the gradient of the objective, and α is chosen by line search to satisfy the constraint.

The full algorithm: (1) Collect batch of trajectories using current policy. (2) Compute GAE advantages using current V. (3) Update policy using trust region (TRPO-style). (4) Update value function using trust region (constrained MSE). Repeat. Steps 3 and 4 both use trust regions, but for different objectives.

Trust Region: Policy vs. Value Function

Both the policy and value function are updated with trust region constraints to prevent overfitting.

Why does the paper apply a trust region constraint to the value function, not just the policy?

Because GAE bootstraps from V — if V overfits to the current batch, the advantage estimates become unreliable, destabilizing the entire training loop Because the value function has more parameters than the policy Because the value function is harder to optimize

Chapter 8: Results

The paper tests GAE on 3D locomotion tasks in the MuJoCo physics simulator. These are hard: the robot has to coordinate dozens of joints, manage balance, and produce efficient gaits — all from raw joint angles and velocities to joint torques.

The robots

3D biped: 33 state dimensions, 10 actuated joints. Tasks: walking, running.
3D quadruped: 29 state dimensions, 8 actuated joints. Task: galloping.
3D biped standing: Start lying on the ground, learn to stand up. The hardest task.

All policies are neural networks (2 hidden layers, 100 units, tanh activations) mapping directly from raw kinematics to joint torques. No hand-crafted features, no phase variables, no gait libraries.

Key results

The paper systematically varies γ and λ to measure their effects:

γ: Best results at γ = 0.995 or 0.999. Too low (γ = 0.99) truncates the effective horizon too much. Too high (γ = 1) gives high variance.
λ: Best results at λ = 0.96–0.99. λ = 0 (1-step TD) is consistently worse. λ = 1 (Monte Carlo) has high variance. The sweet spot is in between, confirming GAE's value.

The standing-up task: The biped must coordinate its entire body to transition from lying prone to standing upright. This requires a long sequence of precisely coordinated movements — rolling, pushing up, balancing. Standard policy gradients cannot solve this. GAE with the right (γ, λ) learns it in ~1-2 weeks of simulated experience. This was state-of-the-art for neural network locomotion policies at the time.

Effect of λ on Learning Performance

Learning curves for different λ values on bipedal locomotion. Intermediate λ (0.96–0.99) outperforms both extremes.

What range of λ values produced the best results on 3D locomotion tasks?

λ = 0.96–0.99 — intermediate values that balance bias and variance outperformed both λ=0 (pure TD) and λ=1 (Monte Carlo) λ = 0 was always best due to low variance λ = 1 was always best due to zero bias

Chapter 9: Connections

What GAE built on

REINFORCE (Williams, 1992): The original policy gradient. GAE addresses its crippling variance by introducing a principled advantage estimator.

TD(λ) (Sutton, 1988): The eligibility trace framework for value function estimation. GAE is the advantage-function analog — same exponential weighting, applied to a different quantity.

Actor-critic methods (Konda & Tsitsiklis, 2003): Use learned value functions to reduce variance. GAE provides the best way to use that value function for advantage estimation.

TRPO (Schulman et al., 2015): Trust region optimization for the policy. The GAE paper extends trust regions to the value function and uses TRPO-style updates for the policy.

What GAE enabled

PPO (Schulman et al., 2017): The default RL algorithm. PPO uses GAE directly — the advantage estimates in PPO's clipped objective are computed with GAE(γ=0.99, λ=0.95). Without GAE, PPO wouldn't work.

A3C (Mnih et al., 2016): Asynchronous actor-critic with n-step returns — a special case of GAE with λ = 1 and truncated trajectories.

RLHF / ChatGPT: When you train a language model with PPO from human feedback, GAE computes the advantage estimates that determine which tokens to reinforce. GAE is running behind every RLHF-trained model.

Every modern actor-critic: SAC, IMPALA, MAPPO, GRPO — they all use GAE or a close variant for advantage estimation. It is the de facto standard.

GAE's legacy: This paper solved one of RL's most fundamental problems — how to estimate advantages with controllable bias-variance tradeoff. The formula Â_t = ∑ (γλ)^l δ_t+l is used in virtually every modern policy gradient implementation. It's two lines of code that made deep RL practical for continuous control, game-playing, and language model alignment.

Cheat sheet

Core equation

Â_t^GAE = ∑_l=0^∞ (γλ)^l δ_t+l, where δ_t = r_t + γV(s_t+1) − V(s_t)

Key hyperparams

γ = 0.99–0.999 (reward horizon), λ = 0.95–0.98 (bias-variance)

Special cases

λ=0: 1-step TD (low var, high bias). λ=1: MC (no bias, high var)

Implementation

Backward sweep: Â_t = δ_t + γλ · Â_t+1

Impact

Used by PPO, A3C, RLHF, SAC, and virtually all modern actor-critic methods

How does PPO use GAE?

PPO uses GAE to compute the advantage estimates Â_t in its clipped surrogate objective — without GAE's variance reduction, PPO's multi-epoch updates would be too noisy to work PPO uses GAE for the clipping threshold PPO replaces GAE with its own advantage estimator

Generalized Advantage Estimation