Ch 7: n-step Bootstrapping — Sutton & Barto RL

Chapter 0: The Spectrum

In Chapter 6, we met two extremes. TD(0) updates its value estimate after just one step: it peeks at the next state's value and bootstraps from there. Monte Carlo waits until the very end of the episode, using the full actual return. One is fast but biased. The other is unbiased but slow. Surely there's something in between?

There is. n-step methods look ahead n steps, use the actual rewards for those steps, then bootstrap from the state n steps ahead. When n = 1, you get TD(0). When n = ∞ (or n reaches the episode's end), you get Monte Carlo. Everything in between is a new method — and often a better one.

The core idea: TD and MC are not two separate algorithms — they're the endpoints of a continuous spectrum. n-step methods let you slide between them. The sweet spot is rarely at either extreme.

The n-step Spectrum

Drag the slider to see how the n-step return changes. At n=1, we bootstrap immediately (TD). As n grows, we use more real rewards before bootstrapping. At n=∞, we use only real rewards (MC).

n = 1

Think of it like writing an essay. MC reads the entire book before writing a single sentence — thorough but slow. TD(0) reads one sentence and starts writing — fast but risky. n-step methods read a chapter at a time: enough context to be useful, without waiting forever.

n = 1 (TD)

R_t+1 + γ V(S_t+1)

↓

n = 3

R_t+1 + γR_t+2 + γ²R_t+3 + γ³V(S_t+3)

↓

n = ∞ (MC)

R_t+1 + γR_t+2 + γ²R_t+3 + … + γ^T−t−1R_T

Check: What happens when you set n=1 in an n-step method?

You get Monte Carlo You get TD(0) You get Dynamic Programming

Chapter 1: n-step TD Prediction

Let's make the idea precise. The n-step return from time t is defined as:

G_t:t+n = R_t+1 + γR_t+2 + … + γⁿ⁻¹R_t+n + γⁿV_t+n−1(S_t+n)

We sum up n actual rewards, each discounted appropriately, then add the bootstrapped value of the state n steps ahead. The subscript on V reminds us that we're using whatever value estimates we had at time t+n−1.

The update rule for n-step TD prediction is then:

V(S_t) ← V(S_t) + α [ G_t:t+n − V(S_t) ]

This looks just like the TD(0) update, but with G_t:t+n replacing the one-step return. There's a catch: the update for time t can't happen until time t+n, because we need n future rewards. So there is a delay of n steps before each update. The agent still acts at every step, but it queues its updates.

Error reduction: If V is a good approximation, then the n-step return G_t:t+n is a better estimate of v_π(S_t) than V(S_t) alone. In fact, the worst-case error of the n-step return is guaranteed to be ≤ γⁿ times the worst-case error of V. More steps of real rewards means less reliance on the (possibly inaccurate) bootstrap value.

pseudocode
# n-step TD Prediction
Input: policy π, step size α, n ≥ 1
Initialize V(s) for all s

for each episode:
  S₀ ← starting state
  T ← ∞
  for t = 0, 1, 2, …:
    if t < T:
      take action from π, observe R_t+1, S_t+1
      if S_t+1 is terminal: T ← t+1
    τ ← t − n + 1        # time of state being updated
    if τ ≥ 0:
      G ← ∑_i=τ+1^min(τ+n,T) γ^i−τ−1 R_i
      if τ+n < T: G ← G + γⁿ V(S_τ+n)
      V(S_τ) ← V(S_τ) + α[G − V(S_τ)]
  until τ = T − 1

Notice the variable τ (tau). It tracks which state we're actually updating. When τ < 0, we haven't seen enough steps yet, so we skip. The algorithm keeps a sliding window of n transitions and makes one update per step (after the initial delay).

Edge cases: If the episode ends before n steps, the return automatically becomes the full MC return for those remaining steps — there's nothing to bootstrap from. So n-step TD gracefully reduces to MC at the end of short episodes, even for large n.

Check: Why must the n-step TD update for state S_t wait until time t+n?

To ensure the policy has converged Because it needs the next n rewards and S_t+n to compute the return Because bootstrapping requires full episodes

Chapter 2: n-step Performance

This is the chapter's big payoff. The 19-state random walk is the classic testbed for n-step methods (Example 7.1 in Sutton & Barto). An agent starts in the center of a 19-state chain. At each step it moves left or right with equal probability. The left terminal gives reward 0; the right terminal gives reward +1. All other rewards are 0. The true values form a smooth ramp from near 0 on the left to near 1 on the right.

We run n-step TD with various values of n and various step sizes α, then measure the RMS error averaged over all states and episodes. The result is a U-shaped curve: n=1 (TD) and very large n (near MC) perform poorly, while intermediate values of n (around 4-8) hit the sweet spot.

Showcase: n-step Random Walk

Choose n from 1 to 20 and a step-size α. Click Run to train over 10 episodes and see the RMS error. Click Run All n to compare all n values at the current α and find the sweet spot.

n = 4

α = 0.40

Key insight: The optimal n is almost never 1 (pure TD) or infinity (pure MC). Intermediate n values combine enough real experience to reduce bias with enough bootstrapping to reduce variance. This is the fundamental argument for n-step methods.

Why does this happen? With n=1 (TD), we bootstrap heavily, but our initial value estimates are bad — so we propagate errors. With very large n (MC), we use complete returns, but those returns have high variance because they depend on many random transitions. The sweet spot balances bias (from bootstrapping with imperfect values) against variance (from noisy long returns).

Check: In the random walk experiment, why do intermediate values of n outperform both n=1 and n=∞?

They balance low bias (more real rewards) with low variance (some bootstrapping) They use a better step size automatically They converge in fewer episodes

Chapter 3: n-step Sarsa

So far we've been doing prediction — estimating V(s) for a fixed policy. But we want control: finding the best policy. The natural extension is n-step Sarsa, which estimates action values Q(s,a) instead of state values.

The idea is identical: instead of bootstrapping from V(S_t+n), we bootstrap from Q(S_t+n, A_t+n). The n-step return for Sarsa is:

G_t:t+n = R_t+1 + γR_t+2 + … + γⁿ⁻¹R_t+n + γⁿQ_t+n−1(S_t+n, A_t+n)

And the update rule is:

Q(S_t, A_t) ← Q(S_t, A_t) + α [ G_t:t+n − Q(S_t, A_t) ]

One-step Sarsa (n=1) is just the original Sarsa from Chapter 6. n-step Sarsa uses n actual rewards before bootstrapping. This means information about rewards propagates n times faster per episode than with one-step Sarsa — a huge advantage in tasks where the reward signal is sparse.

Why this matters: Imagine a maze where the only reward is +1 at the exit. With n=1 Sarsa, after one episode, only the state-action pair one step before the exit learns anything useful. With n=10 Sarsa, the last ten state-action pairs all update in one pass. Learning spreads ten times faster.

pseudocode
# n-step Sarsa
Input: step size α, n ≥ 1, ε for exploration
Initialize Q(s,a) for all s, a

for each episode:
  S₀ ← start; A₀ ← ε-greedy from Q(S₀,·)
  T ← ∞
  for t = 0, 1, 2, …:
    if t < T:
      take A_t, observe R_t+1, S_t+1
      if terminal: T ← t+1
      else: A_t+1 ← ε-greedy from Q(S_t+1,·)
    τ ← t − n + 1
    if τ ≥ 0:
      G ← ∑_i=τ+1^min(τ+n,T) γ^i−τ−1 R_i
      if τ+n < T: G ← G + γⁿ Q(S_τ+n, A_τ+n)
      Q(S_τ,A_τ) ← Q(S_τ,A_τ) + α[G − Q(S_τ,A_τ)]
  until τ = T − 1

There is also n-step Expected Sarsa, which replaces the bootstrapped Q(S_t+n, A_t+n) with the expected value over all actions: ∑_a π(a|S_t+n) Q(S_t+n, a). This removes one source of variance (the randomness of A_t+n) and generally performs better.

Check: Compared to 1-step Sarsa, what is the main advantage of n-step Sarsa?

Reward information propagates n times faster along the trajectory It doesn't need an exploration strategy It guarantees convergence in n episodes

Chapter 4: Off-policy n-step

Sometimes we want to learn about one policy (the target policy π) while following another (the behavior policy b). This is the off-policy setting from Chapter 5. In the n-step case, we need importance sampling to correct for the mismatch between the two policies.

The importance sampling ratio for n steps is:

ρ_t:t+n−1 = ∏_k=t^t+n−1 π(A_k|S_k) / b(A_k|S_k)

This product of ratios tells us how much more (or less) likely this particular sequence of actions was under π compared to b. We then scale the update by this ratio:

V(S_t) ← V(S_t) + α ρ_t:t+n−1 [ G_t:t+n − V(S_t) ]

The same idea extends to n-step Sarsa: for off-policy learning of Q values, we use the ratio ρ_t+1:t+n−1 (starting one step later, because the first action A_t is already accounted for in Q(S_t, A_t)).

The variance problem: Importance sampling ratios are products. Each factor can be large if π likes an action that b rarely takes. After n steps, the product can explode or collapse to zero. High variance is the fundamental cost of off-policy learning with importance sampling — and it gets worse with larger n.

Importance Sampling Variance

See how the IS ratio can explode or collapse as n grows. Each step multiplies by π(a)/b(a). Even mild mismatches compound quickly.

n steps = 1

π(a)/b(a) = 1.5

This variance problem motivates two alternatives. One is per-decision importance sampling, which applies the ratio more carefully. The other — and more elegant — is to avoid importance sampling entirely. That's the tree backup algorithm, coming in Chapter 5.

Check: Why does the importance sampling ratio become problematic for large n?

It always equals zero for large n It requires knowing the true value function It's a product of n terms, so variance grows exponentially

Chapter 5: Tree Backup Algorithm

What if we could do off-policy learning without importance sampling? The tree backup algorithm achieves this by using expected values at each step instead of sampled actions.

Here's the insight: at each step, instead of following the action that was actually taken and correcting with importance sampling, we weight each possible action by its probability under π. For the action that was taken, we go deeper into the tree. For all actions that were not taken, we use their current Q estimates.

The tree metaphor: Imagine a tree rooted at S_t. At the first level, the tree branches into all possible actions. For the action actually taken (A_t), we follow the branch one level deeper. For all other actions, we "back up" their Q values immediately, weighted by π(a|S_t). The result is a return that blends actual experience (for the taken actions) with expected values (for the untaken ones).

The one-step tree backup return is just Expected Sarsa's target:

G_t:t+1 = ∑_a π(a|S_t) Q(S_t, a)

The two-step tree backup return is more interesting. It takes the reward R_t+1 that was actually observed, adds the Q values for all untaken actions at S_t+1 weighted by π, and follows the taken action one step deeper:

G_t:t+2 = R_t+1 + γ ∑_{a≠A_t+1} π(a|S_t+1) Q(S_t+1,a) + γ π(A_t+1|S_t+1) G_t+1:t+2

The general pattern continues recursively. At each level, side-branch actions contribute their Q values immediately, while the taken action is expanded further.

Tree Backup Structure

A visual of how the tree backup builds its return. Orange nodes are the actions actually taken (followed deeper). Teal nodes are side-branch actions (Q values used immediately).

Depth = 2

No importance sampling needed: Because we weight actions by π(a|s) at every level, the tree backup return is already an expectation under π. The behavior policy b only matters in that it determines which action we follow deeper — but the weighting uses π, not b. This eliminates the variance explosion that plagues importance sampling at large n.

Check: How does the tree backup algorithm avoid importance sampling?

It weights all actions by their target policy probabilities instead of correcting sampled actions It only learns on-policy It ignores actions that weren't taken

Chapter 6: Unifying with Q(σ)

We now have three n-step algorithms for action values: n-step Sarsa (pure sampling), n-step Expected Sarsa (expectation at the last step), and the tree backup (expectations at every step). Can we unify them?

Yes. The Q(σ) algorithm introduces a parameter σ ∈ [0, 1] at each step that controls how much we sample versus how much we take expectations. When σ = 1, we fully sample that step (Sarsa-style). When σ = 0, we fully take expectations (tree-backup-style). At intermediate values, we blend.

The unification: σ = 1 at all steps gives n-step Sarsa. σ = 0 at all steps gives tree backup. σ = 1 at all steps except the last (where σ = 0) gives n-step Expected Sarsa. Q(σ) subsumes all three as special cases.

Q(σ) Interpolation

Slide σ from 0 (tree backup / full expectation) to 1 (Sarsa / full sampling). Watch how the algorithm blends between the two extremes.

σ = 0.50

At each step k of the n-step return, the sliding-scale target is:

σ_k · [R_k+1 + γ Q(S_k+1,A_k+1)] + (1 − σ_k) · [R_k+1 + γ ∑_a π(a|S_k+1) Q(S_k+1,a)]

When σ_k = 1, we get the sampled next action-value. When σ_k = 0, we get the expected action-value. And σ can vary per step — you could sample for early steps (when you have actual trajectory data) and take expectations later.

In practice, σ can even be set per state — for example, σ = 1 in states where the policy is deterministic (where sampling and expectation are the same) and σ = 0 in states with high stochasticity (to reduce variance).

Check: What does Q(σ) with σ=0 at all steps reduce to?

n-step Sarsa Monte Carlo Tree backup

Chapter 7: Backup Diagrams

One of the best tools for understanding RL algorithms is the backup diagram. Each diagram shows which states and actions contribute to the update of the root node. Let's compare all the backup types side by side.

Backup Diagram Comparison

Select an algorithm to see its backup structure. Open circles = states. Filled circles = actions. Orange = sampled path. Teal = expected/backed-up branches.

Backup Type	Samples?	Off-policy?	Needs IS?
n-step TD	Yes (one path)	With IS	Yes
n-step Sarsa	Yes (one path)	With IS	Yes
Tree Backup	No (all branches)	Naturally	No
Q(σ)	Blended	Blended	Partial

Reading backup diagrams: A solid downward line means "this specific action was taken and this specific next state was observed" (sampling). A branching set of lines means "we considered all possible actions or next states" (expectation). The depth of the diagram is n — the number of steps before we bootstrap. TD has depth 1. MC has depth all the way to the terminal.

The diagrams make the differences concrete. In n-step TD/Sarsa, a single sampled path goes deep. In tree backup, at every level, all action branches are shown, but only the taken action goes deeper. In Q(σ), the degree of branching depends on σ.

Check: In a backup diagram, what distinguishes the tree backup from n-step Sarsa?

Tree backup shows all action branches at each level; n-step Sarsa follows only the taken action Tree backup is always deeper than n-step Sarsa They look identical

Chapter 8: Summary

n-step methods fill the gap between TD and Monte Carlo, giving us a powerful design knob. Instead of committing to one extreme, we can tune n to balance bias and variance for each problem.

Algorithm	What it does	Key property
n-step TD	Prediction with n-step returns	Bridges TD(0) and MC
n-step Sarsa	On-policy control	Faster credit propagation
Off-policy n-step	Importance sampling corrections	Learns target from behavior
Tree Backup	Off-policy without IS	No variance explosion
Q(σ)	Unifies all three	Per-step sampling/expectation

The bias-variance tradeoff: Larger n means more real rewards (less bias from bootstrapping) but more variance (more random steps in the return). Smaller n means less variance but more bias. The best n depends on the problem, the quality of the value estimates, and the step size.

Strengths of n-step methods:

• Tunable bias-variance tradeoff
• Faster credit propagation than 1-step
• Tree backup: off-policy without IS
• Q(σ) unifies the design space

Limitations:

• Must store n transitions in memory
• Updates delayed by n steps
• IS variance grows with n (except tree backup)
• Optimal n is problem-dependent

What comes next: Chapter 6 gave us model-free learning. This chapter showed how to tune the bootstrapping horizon. Chapter 8 asks: what if we have a model, even an imperfect one? The Dyna architecture integrates model-based planning with model-free learning, using the model to generate simulated experience that amplifies real experience.

Looking ahead: In Chapter 12, we'll see eligibility traces, which provide an even more elegant way to interpolate between TD and MC. Instead of picking a fixed n, eligibility traces effectively average over all n values simultaneously, weighted by λⁿ. This gives us TD(λ) — the continuous version of the n-step spectrum.

"The best number of steps to look ahead depends on
the quality of your crystal ball."
— Intuition behind the bias-variance tradeoff

Check: What is the main advantage of the tree backup algorithm over n-step Sarsa for off-policy learning?

It converges in fewer episodes It avoids importance sampling, eliminating variance explosion It works only for deterministic policies