Ch 12: Eligibility Traces — Sutton & Barto RL

Chapter 0: Bridging TD and MC

We've seen two extremes. TD(0) updates immediately after each step, using the next state's value as a stand-in for the future. It's fast, online, and low-variance, but biased by the quality of the current estimate. Monte Carlo waits until the episode ends and uses the actual return. It's unbiased but high-variance and requires complete episodes.

Chapter 7 showed how n-step methods interpolate between them: use n steps of real rewards, then bootstrap. But n is an integer — you have to pick one. What if you could blend all possible n-step returns at once?

The core idea: Instead of choosing one look-ahead depth n, take a weighted average of all n-step returns: the 1-step return, the 2-step return, all the way to the full MC return. The parameter λ (lambda) controls the weighting. This produces the λ-return, and the mechanism that implements it efficiently is called an eligibility trace.

Think of it this way. You're studying for an exam. One strategy is to review material immediately after learning it (TD-style). Another is to wait until the course ends and review everything (MC-style). The best strategy? Review at multiple intervals, weighting each review by how useful it is. That's what λ does for RL.

λ = 0

Pure TD(0): 1-step return only

↓

0 < λ < 1

Blend of all n-step returns, geometrically weighted

↓

λ = 1

Pure MC: full episode return only

Check: What does the parameter λ control?

How much weight to give longer vs shorter returns — interpolating between TD and MC The learning rate The discount factor

Chapter 1: The λ-Return

The n-step return from Chapter 7 is G_t:t+n = R_t+1 + γR_t+2 + … + γⁿ⁻¹R_t+n + γⁿV(S_t+n). It uses n real rewards and then bootstraps. The λ-return is a weighted average of all these n-step returns:

G_t^λ = (1 − λ) ∑_n=1^T−t−1 λⁿ⁻¹ G_t:t+n + λ^T−t−1 G_t

The weights are geometric: the 1-step return gets weight (1−λ), the 2-step return gets (1−λ)λ, the 3-step gets (1−λ)λ², and so on. The full MC return G_t gets whatever weight is left: λ^T−t−1. These weights always sum to 1 — it's a proper average.

Why geometric? Geometric weights have a special property: they are memoryless. The relative weight between consecutive n-step returns is always the same ratio λ, regardless of n. This means the λ-return can be computed incrementally without storing all the individual returns. This mathematical convenience is what makes eligibility traces possible.

Showcase: Lambda Weight Distribution

Drag λ to see how the weights shift. At λ=0, all weight goes to the 1-step return (pure TD). At λ=1, all weight goes to the full MC return. In between, the weights form a geometric distribution across all n-step returns.

λ = 0.50

Episode length T = 10

At λ = 0: only the 1-step return has nonzero weight. G_t^λ=0 = G_t:t+1 = R_t+1 + γV(S_t+1). This is the TD(0) target.

At λ = 1: only the final MC return has nonzero weight. G_t^λ=1 = G_t. This is the Monte Carlo target.

At λ = 0.5: the 1-step return gets weight 0.5, the 2-step gets 0.25, the 3-step gets 0.125, etc. Short-term returns dominate, but long-term information is blended in.

Check: In the λ-return, what is the weight given to the n-step return G_t:t+n (for n < T−t)?

1/n (1 − λ) λⁿ⁻¹ λⁿ

Chapter 2: TD(λ)

Computing the λ-return requires the full episode (since the MC return is part of the blend). That defeats the purpose of TD's incremental, online nature. The eligibility trace solves this by turning the forward-looking λ-return into a backward-looking mechanism that updates after every step.

The eligibility trace z_t is a vector the same size as the weight vector w. It starts at zero and updates on every step:

z_t = γ λ z_t−1 + ∇ v̂(S_t, w_t)

Each time a state is visited, its features get added to z. Between visits, z decays by γλ per step. The trace records a fading memory of which features were recently active. Then the weight update is simply:

w_t+1 = w_t + α δ_t z_t

The backward view: When a TD error δ_t occurs (positive surprise or negative surprise), credit is assigned to all recently visited states, proportional to their trace. States visited recently get large updates; states visited long ago get tiny updates (their traces have decayed). It's like a spotlight that fades: the current state is brightest, previous states dimmer and dimmer.

Trace Decay Animation

Watch the eligibility trace for each state as an agent walks through a 7-state chain. The brightness shows trace magnitude. When the agent reaches a reward, the TD error propagates backward through the trace, updating all recently visited states.

λ = 0.80

γ = 0.95

pseudocode
# Semi-gradient TD(λ) — backward view
Initialize w arbitrarily
for each episode:
  S ← start state
  z ← 0 (trace vector)
  while S is not terminal:
    A ← π(S)
    take A, observe R, S'
    δ ← R + γ v̂(S', w) − v̂(S, w)
    z ← γ λ z + ∇ v̂(S, w)
    w ← w + α δ z
    S ← S'

Check: What does the eligibility trace z_t represent?

The current reward signal The value function estimate A fading memory of recently visited states — how eligible each feature is for credit assignment

Chapter 3: Forward vs Backward

There are two equivalent ways to think about the same algorithm. The forward view says: at each time step t, look forward to all future rewards, compute the λ-return G_t^λ, and update w toward it. The backward view says: at each time step t, compute the TD error δ_t and send credit backward to all recently visited states through the eligibility trace.

Forward view (λ-return)

At each step, look ahead to compute the weighted average of n-step returns. Requires the full episode (or a truncation). Conceptually clean — the update target is explicit.

Backward view (traces)

At each step, compute δ_t and multiply it by the trace z_t. Fully online — updates happen immediately. The trace encodes the forward view implicitly.

Key insight: For the offline case (accumulate updates, apply at end of episode), the forward and backward views produce identical weight changes. This equivalence is the mathematical foundation of TD(λ). The backward view computes the same thing as the forward view, just incrementally, step by step.

The equivalence holds exactly in the linear case with accumulating traces. For the online case (apply updates immediately), the two views diverge slightly because each update changes w, which changes future TD errors. The true online TD(λ) algorithm (Chapter 6) closes this gap.

Why does this matter? Because the backward view is O(d) per step (d = number of features), while computing the forward view naively is O(d · T) per episode. The trace is the computational trick that makes λ-returns practical for large-scale problems.

Check: What is the computational advantage of the backward view (traces) over the forward view (λ-return)?

It is O(d) per step instead of O(d · T) per episode It converges faster It uses less memory for the weight vector

Chapter 4: Truncated λ-Return

The λ-return involves the MC return at the end, which means you need the full episode. For continuing tasks (no episodes), this is a problem. The truncated λ-return solves it by cutting off after h steps:

G_t:t+h^λ = (1 − λ) ∑_n=1^h−1 λⁿ⁻¹ G_t:t+n + λ^h−1 G_t:t+h

Instead of averaging all the way to the end of the episode, we average up to h steps ahead, putting all remaining weight on the h-step return. This is an approximation — the truncated λ-return is not exactly the same as the full λ-return. But for large h and λγ < 1, the distant terms have decayed to near-zero weight anyway.

The trade-off: Larger h = better approximation to the true λ-return, but requires waiting h steps before you can compute the update. Smaller h = more timely updates, but the approximation is coarser. With h=1, you recover TD(0). The trace-based backward view effectively uses h = ∞ (the full episode), which is why it's exact — but it relies on the trace decaying geometrically.

The truncated λ-return is the basis for the TTD(λ) algorithm. It updates each state t using the truncated return computed h steps later. This requires a buffer of h recent transitions — more memory than the trace approach, but it works for continuing tasks and doesn't accumulate trace arithmetic.

Check: Why is the truncated λ-return needed?

It reduces the step size It makes the λ-return computable for continuing tasks where episodes don't end It improves the variance of Monte Carlo

Chapter 5: Online λ-Return

Consider a subtle problem with the offline λ-return. At time t, we compute G_t^λ using the weight vector w_t (the version of w at time t). But by the time we have all the data to compute G_t^λ, w has already been updated many times. Wouldn't it be better to use the latest w at each point?

The online λ-return algorithm does exactly this. At each time step h, it goes back and redoes all updates from the start of the episode using the h-step truncated λ-return with the latest weights. It's as if, at each step, you erase all the updates you've done and redo them with better information.

The idea: Imagine writing a paper. The offline approach writes it once, then submits. The online approach rewrites the entire paper from scratch after each new paragraph, incorporating new insights into every section. The result is better, but the cost is O(t) work at time step t, and O(T²) total per episode.

This seems prohibitively expensive. At step h, we redo h updates, each costing O(d). Over a T-step episode, the total cost is O(d · T²). For short episodes this is fine. For long episodes it's impractical. But the mathematical sequence of weight vectors it produces is the best that the λ-return framework can achieve online — it's the ideal that other algorithms approximate.

w_t+1^h = w_t^h + α [G_t:h^λ − v̂(S_t, w_t^h)] ∇ v̂(S_t, w_t^h)

The key: w_t^h means "the weight vector at time t, computed using data up to time h." The online algorithm, at each new time step h, recomputes this entire sequence from scratch. Impractical? Yes. But it defines the target that the next chapter's efficient algorithm matches exactly.

Check: What is the main drawback of the online λ-return algorithm?

O(T²) computation per episode because it redoes all updates at each step It doesn't converge It only works with tabular representations

Chapter 6: True Online TD(λ)

The online λ-return is ideal but costs O(T²). True online TD(λ) (van Seijen & Sutton, 2014) achieves the exact same sequence of weight updates in O(d) per step — the same cost as regular TD(λ). It's a mathematical marvel: through careful algebraic manipulation, the O(T²) algorithm collapses to O(d).

The trick is Dutch traces. Instead of the standard accumulating trace, the Dutch trace is:

z_t = γ λ z_t−1 + (1 − α γ λ z_t−1^T x_t) x_t

where x_t = ∇v̂(S_t, w_t) is the feature vector. The extra term (1 − αγλ z_t−1^Tx_t) corrects for the online nature of the algorithm. The weight update also includes an additional correction:

w_t+1 = w_t + α δ_t z_t + α (w_t^Tx_t − w_t−1^Tx_t)(z_t − x_t)

The result: True online TD(λ) produces exactly the same weight sequence as the online λ-return algorithm, but in O(d) per step. It needs one extra scalar (the old value w_t−1^Tx_t) and the modified trace. It should be the default choice for TD(λ) with linear function approximation.

pseudocode
# True Online TD(λ) with Dutch traces
Initialize w
for each episode:
  S ← start; z ← 0; V_old ← 0
  while S is not terminal:
    A ← π(S); take A, observe R, S'
    x ← feature vector for S
    V ← w^Tx; V' ← w^Tx'
    δ ← R + γ V' − V
    z ← γλ z + (1 − αγλ z^Tx) x
    w ← w + α(δ + V − V_old) z − α(V − V_old) x
    V_old ← V'
    S ← S'

Check: What makes True Online TD(λ) better than standard TD(λ)?

It uses a smaller step size It produces the exact same updates as the ideal online λ-return, at the same O(d) cost as standard TD(λ) It eliminates the need for eligibility traces

Chapter 7: Sarsa(λ)

Everything we've done with eligibility traces for prediction (learning V) extends naturally to control (learning Q). Sarsa(λ) uses traces to assign credit over multiple time steps in the action-value setting.

The update is exactly the TD(λ) pattern, but with Q instead of V and with features defined over (state, action) pairs:

z_t = γ λ z_t−1 + ∇ q̂(S_t, A_t, w_t)

w_t+1 = w_t + α δ_t z_t

where δ_t = R_t+1 + γ q̂(S_t+1, A_t+1, w_t) − q̂(S_t, A_t, w_t).

The practical benefit: In regular Sarsa, a reward only directly updates the state-action pair that immediately preceded it. With Sarsa(λ), the reward propagates backward through the eligibility trace to update all recent state-action pairs. This dramatically speeds up learning in problems with delayed rewards — the agent doesn't have to visit the same path many times to propagate reward information backward.

Sarsa(λ) on a Grid World

An agent navigates from start (orange) to goal (green). Compare how fast the agent learns with different λ values. Higher λ propagates reward information further back through the trajectory on each episode.

λ = 0.8

Episode 0 | Steps: --

Check: What advantage does Sarsa(λ) have over regular Sarsa(0)?

Reward information propagates back through the full trajectory in a single episode, not just one step It requires fewer features It doesn't need a discount factor

Chapter 8: Variable λ and γ

So far λ and γ have been constants. But there's no reason they can't vary by state. State-dependent bootstrapping lets the agent decide, at each state, how much to bootstrap.

The generalized λ-return uses λ(s) and γ(s):

δ_t = R_t+1 + γ(S_t+1) v̂(S_t+1) − v̂(S_t)

z_t = γ(S_t) λ(S_t) z_t−1 + ∇ v̂(S_t)

Why vary λ? In states where the value estimate is reliable, bootstrap heavily (λ near 0). In states where the estimate is uncertain or changing rapidly, lean toward MC (λ near 1). This is similar to how human decision-making works: in familiar situations we rely on instinct (bootstrapping), in novel situations we wait and see what happens (Monte Carlo).

State-dependent γ has a natural interpretation too. Setting γ(s) = 0 for terminal states is something we already do — it signals the end of the return. But we can also set γ(s) to values between 0 and 1 to encode how much we care about future rewards from different states. This is a form of soft termination.

The state-dependent formulation subsumes all the previous versions as special cases. Constant λ and γ are just the case where λ(s) = λ and γ(s) = γ for all s. The theory (convergence guarantees, forward-backward equivalence) carries through with appropriate modifications.

Check: When would you set λ(s) close to 1 for a particular state?

When the value estimate at that state is very accurate When the value estimate at that state is uncertain, so we should rely more on actual returns When the rewards at that state are always zero

Chapter 9: Off-policy Traces

Eligibility traces and off-policy learning interact in complex ways. Chapter 11 showed that off-policy with function approximation is already dangerous. Adding traces (which propagate updates over multiple steps) compounds the importance sampling problem: the IS ratio for an n-step update is the product ρ_tρ_t+1…ρ_t+n−1, which can have enormous variance.

Several approaches handle this:

IS-weighted Traces

Multiply the trace update by ρ_t:

z_t = ρ_t (γ λ z_t−1 + ∇ v̂(S_t))

When ρ_t = 0 (the target policy wouldn't take this action), the trace resets to zero. This is correct but can have very high variance when ratios are large.

Tree-Backup Traces

Instead of IS ratios, use the target policy's action probabilities directly. The trace decays by π(A_t|S_t) instead of using ρ. This bounds the trace (no ratios greater than 1) but means the effective λ is reduced — you bootstrap more when the policies differ.

Retrace(λ)

Cap the IS ratio at 1: use min(ρ_t, 1) in place of ρ_t. This keeps the trace bounded (low variance) while still correcting for the off-policy distribution. Retrace is the basis of many practical deep RL algorithms.

The pattern: All of these modify the trace to limit the damage of large IS ratios. The spectrum ranges from "use full IS ratios" (unbiased but high variance) to "use no IS correction" (biased but low variance). Tree-backup and Retrace sit at carefully chosen points in between.

Check: Why does off-policy learning with traces have worse variance than without traces?

Traces propagate updates over multiple steps, requiring products of IS ratios which can be very large Traces add extra parameters to learn Traces make the discount factor larger

Chapter 10: Empirical Results

Theory tells us that λ interpolates between TD(0) and MC. But which value of λ actually works best? The consistent empirical finding, replicated across many domains, is: intermediate values of λ are best. Neither λ=0 (pure TD) nor λ=1 (pure MC) wins.

Lambda Performance Curve

Error after a fixed number of episodes on a random walk, as a function of λ. The U-shaped curve shows that intermediate λ (around 0.4–0.8) outperforms both extremes. Drag the α slider to see how the optimal λ changes with the step size.

α = 0.10

Why intermediate λ? Pure TD(0) is biased by the initial value estimates — it takes many episodes for the bias to wash out. Pure MC has zero bias but high variance (returns are noisy). Intermediate λ balances the two: enough real rewards to reduce bias, enough bootstrapping to reduce variance. The "sweet spot" depends on the problem, the step size, and the quality of the function approximation.

The optimal λ also interacts with the step size α. With a very small α, the bias of TD(0) is corrected slowly, so higher λ is better. With a large α, the variance of MC makes λ=1 unstable, so lower λ is preferred. The best (α, λ) pair is problem-dependent, which is why tuning both is standard practice.

In the tabular case on the 19-state random walk, the best performance is typically around λ = 0.4–0.6 with moderate α. With function approximation on harder problems, λ = 0.8–0.95 is common. The empirical lesson is clear: always try intermediate λ, never assume one extreme is best.

Check: Why does λ=1 (pure MC) typically not give the best performance?

MC returns are always biased MC cannot be used with function approximation MC returns have high variance because they depend on all future rewards, making updates noisy

Chapter 11: Summary

Eligibility traces provide a unified mechanism for assigning credit over time. They bridge the gap between TD(0) and Monte Carlo, and the parameter λ provides a smooth dial between them. In practice, intermediate λ almost always outperforms either extreme.

Method	Core Idea	Key Property
TD(λ)	Accumulating trace with decay γλ	O(d) per step, offline equivalence to λ-return
True Online TD(λ)	Dutch traces	Exact online λ-return at O(d) cost
Sarsa(λ)	Traces for control	Faster credit propagation for action values
Truncated TD(λ)	h-step cutoff	Works for continuing tasks
Tree-Backup	Policy probs replace IS ratios	Bounded off-policy traces
Retrace(λ)	Capped IS ratios	Low variance off-policy traces

The practical takeaway: For on-policy with linear function approximation, True Online TD(λ) should be the default. For off-policy, Retrace(λ) is the workhorse. Always tune λ — the optimal value is problem-dependent but almost never 0 or 1.

What traces give us:

• Smooth interpolation TD ↔ MC
• Faster credit assignment
• One more knob (λ) to tune
• A backward-view that's O(d)/step
• State-dependent bootstrapping

Limitations:

• Extra memory for the trace vector
• Off-policy traces are still fragile
• Optimal λ is problem-dependent
• Nonlinear function approx complicates theory
• Dutch traces need linear features

What comes next: Chapter 11 explored the instability of off-policy learning. Chapter 13 takes a completely different approach: instead of learning value functions and deriving policies from them, we learn the policy directly. This is the world of policy gradient methods — the foundation of modern deep RL.

"Eligibility traces are one of the oldest and most enduring ideas in reinforcement learning.
They are a bridge between two fundamental approaches to temporal credit assignment."
— Richard S. Sutton & Andrew G. Barto

Check: What is the key empirical finding about the parameter λ?

λ=0 is always best Intermediate values of λ (between 0 and 1) consistently outperform both extremes λ=1 is always best