The bridge between TD and Monte Carlo — one parameter smoothly interpolates between them.
We've seen two extremes. TD(0) updates immediately after each step, using the next state's value as a stand-in for the future. It's fast, online, and low-variance, but biased by the quality of the current estimate. Monte Carlo waits until the episode ends and uses the actual return. It's unbiased but high-variance and requires complete episodes.
Chapter 7 showed how n-step methods interpolate between them: use n steps of real rewards, then bootstrap. But n is an integer — you have to pick one. What if you could blend all possible n-step returns at once?
Think of it this way. You're studying for an exam. One strategy is to review material immediately after learning it (TD-style). Another is to wait until the course ends and review everything (MC-style). The best strategy? Review at multiple intervals, weighting each review by how useful it is. That's what λ does for RL.
The n-step return from Chapter 7 is Gt:t+n = Rt+1 + γRt+2 + … + γn−1Rt+n + γnV(St+n). It uses n real rewards and then bootstraps. The λ-return is a weighted average of all these n-step returns:
The weights are geometric: the 1-step return gets weight (1−λ), the 2-step return gets (1−λ)λ, the 3-step gets (1−λ)λ2, and so on. The full MC return Gt gets whatever weight is left: λT−t−1. These weights always sum to 1 — it's a proper average.
Drag λ to see how the weights shift. At λ=0, all weight goes to the 1-step return (pure TD). At λ=1, all weight goes to the full MC return. In between, the weights form a geometric distribution across all n-step returns.
At λ = 0: only the 1-step return has nonzero weight. Gtλ=0 = Gt:t+1 = Rt+1 + γV(St+1). This is the TD(0) target.
At λ = 1: only the final MC return has nonzero weight. Gtλ=1 = Gt. This is the Monte Carlo target.
At λ = 0.5: the 1-step return gets weight 0.5, the 2-step gets 0.25, the 3-step gets 0.125, etc. Short-term returns dominate, but long-term information is blended in.
Computing the λ-return requires the full episode (since the MC return is part of the blend). That defeats the purpose of TD's incremental, online nature. The eligibility trace solves this by turning the forward-looking λ-return into a backward-looking mechanism that updates after every step.
The eligibility trace zt is a vector the same size as the weight vector w. It starts at zero and updates on every step:
Each time a state is visited, its features get added to z. Between visits, z decays by γλ per step. The trace records a fading memory of which features were recently active. Then the weight update is simply:
Watch the eligibility trace for each state as an agent walks through a 7-state chain. The brightness shows trace magnitude. When the agent reaches a reward, the TD error propagates backward through the trace, updating all recently visited states.
pseudocode # Semi-gradient TD(λ) — backward view Initialize w arbitrarily for each episode: S ← start state z ← 0 (trace vector) while S is not terminal: A ← π(S) take A, observe R, S' δ ← R + γ v̂(S', w) − v̂(S, w) z ← γ λ z + ∇ v̂(S, w) w ← w + α δ z S ← S'
There are two equivalent ways to think about the same algorithm. The forward view says: at each time step t, look forward to all future rewards, compute the λ-return Gtλ, and update w toward it. The backward view says: at each time step t, compute the TD error δt and send credit backward to all recently visited states through the eligibility trace.
Forward view (λ-return)
At each step, look ahead to compute the weighted average of n-step returns. Requires the full episode (or a truncation). Conceptually clean — the update target is explicit.
Backward view (traces)
At each step, compute δt and multiply it by the trace zt. Fully online — updates happen immediately. The trace encodes the forward view implicitly.
The equivalence holds exactly in the linear case with accumulating traces. For the online case (apply updates immediately), the two views diverge slightly because each update changes w, which changes future TD errors. The true online TD(λ) algorithm (Chapter 6) closes this gap.
Why does this matter? Because the backward view is O(d) per step (d = number of features), while computing the forward view naively is O(d · T) per episode. The trace is the computational trick that makes λ-returns practical for large-scale problems.
The λ-return involves the MC return at the end, which means you need the full episode. For continuing tasks (no episodes), this is a problem. The truncated λ-return solves it by cutting off after h steps:
Instead of averaging all the way to the end of the episode, we average up to h steps ahead, putting all remaining weight on the h-step return. This is an approximation — the truncated λ-return is not exactly the same as the full λ-return. But for large h and λγ < 1, the distant terms have decayed to near-zero weight anyway.
The truncated λ-return is the basis for the TTD(λ) algorithm. It updates each state t using the truncated return computed h steps later. This requires a buffer of h recent transitions — more memory than the trace approach, but it works for continuing tasks and doesn't accumulate trace arithmetic.
Consider a subtle problem with the offline λ-return. At time t, we compute Gtλ using the weight vector wt (the version of w at time t). But by the time we have all the data to compute Gtλ, w has already been updated many times. Wouldn't it be better to use the latest w at each point?
The online λ-return algorithm does exactly this. At each time step h, it goes back and redoes all updates from the start of the episode using the h-step truncated λ-return with the latest weights. It's as if, at each step, you erase all the updates you've done and redo them with better information.
This seems prohibitively expensive. At step h, we redo h updates, each costing O(d). Over a T-step episode, the total cost is O(d · T2). For short episodes this is fine. For long episodes it's impractical. But the mathematical sequence of weight vectors it produces is the best that the λ-return framework can achieve online — it's the ideal that other algorithms approximate.
The key: wth means "the weight vector at time t, computed using data up to time h." The online algorithm, at each new time step h, recomputes this entire sequence from scratch. Impractical? Yes. But it defines the target that the next chapter's efficient algorithm matches exactly.
The online λ-return is ideal but costs O(T2). True online TD(λ) (van Seijen & Sutton, 2014) achieves the exact same sequence of weight updates in O(d) per step — the same cost as regular TD(λ). It's a mathematical marvel: through careful algebraic manipulation, the O(T2) algorithm collapses to O(d).
The trick is Dutch traces. Instead of the standard accumulating trace, the Dutch trace is:
where xt = ∇v̂(St, wt) is the feature vector. The extra term (1 − αγλ zt−1Txt) corrects for the online nature of the algorithm. The weight update also includes an additional correction:
pseudocode # True Online TD(λ) with Dutch traces Initialize w for each episode: S ← start; z ← 0; V_old ← 0 while S is not terminal: A ← π(S); take A, observe R, S' x ← feature vector for S V ← wTx; V' ← wTx' δ ← R + γ V' − V z ← γλ z + (1 − αγλ zTx) x w ← w + α(δ + V − V_old) z − α(V − V_old) x V_old ← V' S ← S'
Everything we've done with eligibility traces for prediction (learning V) extends naturally to control (learning Q). Sarsa(λ) uses traces to assign credit over multiple time steps in the action-value setting.
The update is exactly the TD(λ) pattern, but with Q instead of V and with features defined over (state, action) pairs:
where δt = Rt+1 + γ q̂(St+1, At+1, wt) − q̂(St, At, wt).
An agent navigates from start (orange) to goal (green). Compare how fast the agent learns with different λ values. Higher λ propagates reward information further back through the trajectory on each episode.
So far λ and γ have been constants. But there's no reason they can't vary by state. State-dependent bootstrapping lets the agent decide, at each state, how much to bootstrap.
The generalized λ-return uses λ(s) and γ(s):
State-dependent γ has a natural interpretation too. Setting γ(s) = 0 for terminal states is something we already do — it signals the end of the return. But we can also set γ(s) to values between 0 and 1 to encode how much we care about future rewards from different states. This is a form of soft termination.
The state-dependent formulation subsumes all the previous versions as special cases. Constant λ and γ are just the case where λ(s) = λ and γ(s) = γ for all s. The theory (convergence guarantees, forward-backward equivalence) carries through with appropriate modifications.
Eligibility traces and off-policy learning interact in complex ways. Chapter 11 showed that off-policy with function approximation is already dangerous. Adding traces (which propagate updates over multiple steps) compounds the importance sampling problem: the IS ratio for an n-step update is the product ρtρt+1…ρt+n−1, which can have enormous variance.
Several approaches handle this:
Multiply the trace update by ρt:
When ρt = 0 (the target policy wouldn't take this action), the trace resets to zero. This is correct but can have very high variance when ratios are large.
Instead of IS ratios, use the target policy's action probabilities directly. The trace decays by π(At|St) instead of using ρ. This bounds the trace (no ratios greater than 1) but means the effective λ is reduced — you bootstrap more when the policies differ.
Cap the IS ratio at 1: use min(ρt, 1) in place of ρt. This keeps the trace bounded (low variance) while still correcting for the off-policy distribution. Retrace is the basis of many practical deep RL algorithms.
Theory tells us that λ interpolates between TD(0) and MC. But which value of λ actually works best? The consistent empirical finding, replicated across many domains, is: intermediate values of λ are best. Neither λ=0 (pure TD) nor λ=1 (pure MC) wins.
Error after a fixed number of episodes on a random walk, as a function of λ. The U-shaped curve shows that intermediate λ (around 0.4–0.8) outperforms both extremes. Drag the α slider to see how the optimal λ changes with the step size.
The optimal λ also interacts with the step size α. With a very small α, the bias of TD(0) is corrected slowly, so higher λ is better. With a large α, the variance of MC makes λ=1 unstable, so lower λ is preferred. The best (α, λ) pair is problem-dependent, which is why tuning both is standard practice.
In the tabular case on the 19-state random walk, the best performance is typically around λ = 0.4–0.6 with moderate α. With function approximation on harder problems, λ = 0.8–0.95 is common. The empirical lesson is clear: always try intermediate λ, never assume one extreme is best.
Eligibility traces provide a unified mechanism for assigning credit over time. They bridge the gap between TD(0) and Monte Carlo, and the parameter λ provides a smooth dial between them. In practice, intermediate λ almost always outperforms either extreme.
| Method | Core Idea | Key Property |
|---|---|---|
| TD(λ) | Accumulating trace with decay γλ | O(d) per step, offline equivalence to λ-return |
| True Online TD(λ) | Dutch traces | Exact online λ-return at O(d) cost |
| Sarsa(λ) | Traces for control | Faster credit propagation for action values |
| Truncated TD(λ) | h-step cutoff | Works for continuing tasks |
| Tree-Backup | Policy probs replace IS ratios | Bounded off-policy traces |
| Retrace(λ) | Capped IS ratios | Low variance off-policy traces |
What traces give us:
• Smooth interpolation TD ↔ MC
• Faster credit assignment
• One more knob (λ) to tune
• A backward-view that's O(d)/step
• State-dependent bootstrapping
Limitations:
• Extra memory for the trace vector
• Off-policy traces are still fragile
• Optimal λ is problem-dependent
• Nonlinear function approx complicates theory
• Dutch traces need linear features
What comes next: Chapter 11 explored the instability of off-policy learning. Chapter 13 takes a completely different approach: instead of learning value functions and deriving policies from them, we learn the policy directly. This is the world of policy gradient methods — the foundation of modern deep RL.