Bridging the gap between TD and Monte Carlo — one step at a time.
In Chapter 6, we met two extremes. TD(0) updates its value estimate after just one step: it peeks at the next state's value and bootstraps from there. Monte Carlo waits until the very end of the episode, using the full actual return. One is fast but biased. The other is unbiased but slow. Surely there's something in between?
There is. n-step methods look ahead n steps, use the actual rewards for those steps, then bootstrap from the state n steps ahead. When n = 1, you get TD(0). When n = ∞ (or n reaches the episode's end), you get Monte Carlo. Everything in between is a new method — and often a better one.
Drag the slider to see how the n-step return changes. At n=1, we bootstrap immediately (TD). As n grows, we use more real rewards before bootstrapping. At n=∞, we use only real rewards (MC).
Think of it like writing an essay. MC reads the entire book before writing a single sentence — thorough but slow. TD(0) reads one sentence and starts writing — fast but risky. n-step methods read a chapter at a time: enough context to be useful, without waiting forever.
Let's make the idea precise. The n-step return from time t is defined as:
We sum up n actual rewards, each discounted appropriately, then add the bootstrapped value of the state n steps ahead. The subscript on V reminds us that we're using whatever value estimates we had at time t+n−1.
The update rule for n-step TD prediction is then:
This looks just like the TD(0) update, but with Gt:t+n replacing the one-step return. There's a catch: the update for time t can't happen until time t+n, because we need n future rewards. So there is a delay of n steps before each update. The agent still acts at every step, but it queues its updates.
pseudocode # n-step TD Prediction Input: policy π, step size α, n ≥ 1 Initialize V(s) for all s for each episode: S0 ← starting state T ← ∞ for t = 0, 1, 2, …: if t < T: take action from π, observe Rt+1, St+1 if St+1 is terminal: T ← t+1 τ ← t − n + 1 # time of state being updated if τ ≥ 0: G ← ∑i=τ+1min(τ+n,T) γi−τ−1 Ri if τ+n < T: G ← G + γn V(Sτ+n) V(Sτ) ← V(Sτ) + α[G − V(Sτ)] until τ = T − 1
Notice the variable τ (tau). It tracks which state we're actually updating. When τ < 0, we haven't seen enough steps yet, so we skip. The algorithm keeps a sliding window of n transitions and makes one update per step (after the initial delay).
This is the chapter's big payoff. The 19-state random walk is the classic testbed for n-step methods (Example 7.1 in Sutton & Barto). An agent starts in the center of a 19-state chain. At each step it moves left or right with equal probability. The left terminal gives reward 0; the right terminal gives reward +1. All other rewards are 0. The true values form a smooth ramp from near 0 on the left to near 1 on the right.
We run n-step TD with various values of n and various step sizes α, then measure the RMS error averaged over all states and episodes. The result is a U-shaped curve: n=1 (TD) and very large n (near MC) perform poorly, while intermediate values of n (around 4-8) hit the sweet spot.
Choose n from 1 to 20 and a step-size α. Click Run to train over 10 episodes and see the RMS error. Click Run All n to compare all n values at the current α and find the sweet spot.
Why does this happen? With n=1 (TD), we bootstrap heavily, but our initial value estimates are bad — so we propagate errors. With very large n (MC), we use complete returns, but those returns have high variance because they depend on many random transitions. The sweet spot balances bias (from bootstrapping with imperfect values) against variance (from noisy long returns).
So far we've been doing prediction — estimating V(s) for a fixed policy. But we want control: finding the best policy. The natural extension is n-step Sarsa, which estimates action values Q(s,a) instead of state values.
The idea is identical: instead of bootstrapping from V(St+n), we bootstrap from Q(St+n, At+n). The n-step return for Sarsa is:
And the update rule is:
One-step Sarsa (n=1) is just the original Sarsa from Chapter 6. n-step Sarsa uses n actual rewards before bootstrapping. This means information about rewards propagates n times faster per episode than with one-step Sarsa — a huge advantage in tasks where the reward signal is sparse.
pseudocode # n-step Sarsa Input: step size α, n ≥ 1, ε for exploration Initialize Q(s,a) for all s, a for each episode: S0 ← start; A0 ← ε-greedy from Q(S0,·) T ← ∞ for t = 0, 1, 2, …: if t < T: take At, observe Rt+1, St+1 if terminal: T ← t+1 else: At+1 ← ε-greedy from Q(St+1,·) τ ← t − n + 1 if τ ≥ 0: G ← ∑i=τ+1min(τ+n,T) γi−τ−1 Ri if τ+n < T: G ← G + γn Q(Sτ+n, Aτ+n) Q(Sτ,Aτ) ← Q(Sτ,Aτ) + α[G − Q(Sτ,Aτ)] until τ = T − 1
There is also n-step Expected Sarsa, which replaces the bootstrapped Q(St+n, At+n) with the expected value over all actions: ∑a π(a|St+n) Q(St+n, a). This removes one source of variance (the randomness of At+n) and generally performs better.
Sometimes we want to learn about one policy (the target policy π) while following another (the behavior policy b). This is the off-policy setting from Chapter 5. In the n-step case, we need importance sampling to correct for the mismatch between the two policies.
The importance sampling ratio for n steps is:
This product of ratios tells us how much more (or less) likely this particular sequence of actions was under π compared to b. We then scale the update by this ratio:
The same idea extends to n-step Sarsa: for off-policy learning of Q values, we use the ratio ρt+1:t+n−1 (starting one step later, because the first action At is already accounted for in Q(St, At)).
See how the IS ratio can explode or collapse as n grows. Each step multiplies by π(a)/b(a). Even mild mismatches compound quickly.
This variance problem motivates two alternatives. One is per-decision importance sampling, which applies the ratio more carefully. The other — and more elegant — is to avoid importance sampling entirely. That's the tree backup algorithm, coming in Chapter 5.
What if we could do off-policy learning without importance sampling? The tree backup algorithm achieves this by using expected values at each step instead of sampled actions.
Here's the insight: at each step, instead of following the action that was actually taken and correcting with importance sampling, we weight each possible action by its probability under π. For the action that was taken, we go deeper into the tree. For all actions that were not taken, we use their current Q estimates.
The one-step tree backup return is just Expected Sarsa's target:
The two-step tree backup return is more interesting. It takes the reward Rt+1 that was actually observed, adds the Q values for all untaken actions at St+1 weighted by π, and follows the taken action one step deeper:
The general pattern continues recursively. At each level, side-branch actions contribute their Q values immediately, while the taken action is expanded further.
A visual of how the tree backup builds its return. Orange nodes are the actions actually taken (followed deeper). Teal nodes are side-branch actions (Q values used immediately).
We now have three n-step algorithms for action values: n-step Sarsa (pure sampling), n-step Expected Sarsa (expectation at the last step), and the tree backup (expectations at every step). Can we unify them?
Yes. The Q(σ) algorithm introduces a parameter σ ∈ [0, 1] at each step that controls how much we sample versus how much we take expectations. When σ = 1, we fully sample that step (Sarsa-style). When σ = 0, we fully take expectations (tree-backup-style). At intermediate values, we blend.
Slide σ from 0 (tree backup / full expectation) to 1 (Sarsa / full sampling). Watch how the algorithm blends between the two extremes.
At each step k of the n-step return, the sliding-scale target is:
When σk = 1, we get the sampled next action-value. When σk = 0, we get the expected action-value. And σ can vary per step — you could sample for early steps (when you have actual trajectory data) and take expectations later.
In practice, σ can even be set per state — for example, σ = 1 in states where the policy is deterministic (where sampling and expectation are the same) and σ = 0 in states with high stochasticity (to reduce variance).
One of the best tools for understanding RL algorithms is the backup diagram. Each diagram shows which states and actions contribute to the update of the root node. Let's compare all the backup types side by side.
Select an algorithm to see its backup structure. Open circles = states. Filled circles = actions. Orange = sampled path. Teal = expected/backed-up branches.
| Backup Type | Samples? | Off-policy? | Needs IS? |
|---|---|---|---|
| n-step TD | Yes (one path) | With IS | Yes |
| n-step Sarsa | Yes (one path) | With IS | Yes |
| Tree Backup | No (all branches) | Naturally | No |
| Q(σ) | Blended | Blended | Partial |
The diagrams make the differences concrete. In n-step TD/Sarsa, a single sampled path goes deep. In tree backup, at every level, all action branches are shown, but only the taken action goes deeper. In Q(σ), the degree of branching depends on σ.
n-step methods fill the gap between TD and Monte Carlo, giving us a powerful design knob. Instead of committing to one extreme, we can tune n to balance bias and variance for each problem.
| Algorithm | What it does | Key property |
|---|---|---|
| n-step TD | Prediction with n-step returns | Bridges TD(0) and MC |
| n-step Sarsa | On-policy control | Faster credit propagation |
| Off-policy n-step | Importance sampling corrections | Learns target from behavior |
| Tree Backup | Off-policy without IS | No variance explosion |
| Q(σ) | Unifies all three | Per-step sampling/expectation |
Strengths of n-step methods:
• Tunable bias-variance tradeoff
• Faster credit propagation than 1-step
• Tree backup: off-policy without IS
• Q(σ) unifies the design space
Limitations:
• Must store n transitions in memory
• Updates delayed by n steps
• IS variance grows with n (except tree backup)
• Optimal n is problem-dependent
What comes next: Chapter 6 gave us model-free learning. This chapter showed how to tune the bootstrapping horizon. Chapter 8 asks: what if we have a model, even an imperfect one? The Dyna architecture integrates model-based planning with model-free learning, using the model to generate simulated experience that amplifies real experience.