Ch 6: Temporal-Difference Learning

Chapter 0: TD Prediction

In Chapter 5, Monte Carlo methods learned value functions by waiting until the end of an episode, computing the actual return, and then updating. That works, but it has a frustrating requirement: you must wait until the episode terminates. What if episodes are very long? What if they never end at all?

Temporal-difference (TD) learning removes that requirement. Instead of waiting for the actual return, TD updates its value estimate immediately after each step, using the reward it just observed plus its current estimate of the next state's value. It learns a guess from a guess.

The simplest TD method, called TD(0), uses this update rule:

V(S_t) ← V(S_t) + α [ R_t+1 + γ V(S_t+1) − V(S_t) ]

The quantity in brackets is the TD error, written δ_t. It measures the difference between the better estimate (reward + discounted next value) and the current estimate. If δ_t is positive, the outcome was better than expected, so we nudge V(S_t) upward. If negative, we nudge it down.

The core idea: MC waits for the real answer; TD makes do with a partial answer plus a guess. The target R_t+1 + γV(S_t+1) is called the TD target. Because it includes an estimate V(S_t+1), TD is said to bootstrap — it updates a guess toward a guess.

Random Walk: TD(0) vs Monte Carlo

A 5-state random walk (A–E) with terminal states at each end. The agent steps left or right with equal probability. The left terminal gives reward 0, the right terminal gives reward +1. True values are 1/6, 2/6, 3/6, 4/6, 5/6. Watch how TD updates after every step while MC waits for the episode to end.

Click Step or Run Episode to begin.

What does TD(0) use as its update target instead of the actual return G_t?

The average of all future rewards R_t+1 + γV(S_t+1) — the reward plus discounted next-state estimate Just the immediate reward R_t+1

Chapter 1: The Driving Home Example

Sutton and Barto's famous example makes the difference vivid. Imagine you commute home every day and try to predict the total travel time. Each day, as you pass landmarks (leaving work, reaching the highway, exiting, etc.), you observe elapsed time and revise your estimate of the remaining time.

A Monte Carlo approach would wait until you actually arrive home, record the total time, then go back and update your estimate for every landmark you passed that day. You learn nothing until the trip is over.

A TD approach updates immediately. When you reach the highway and notice it took 5 minutes instead of the 3 you expected, you immediately revise your estimate upward — without waiting to see how the rest of the drive goes. Each landmark updates from the next landmark's estimate.

Key insight: TD learning is how humans naturally think. You don't wait until you arrive home to feel that traffic was heavier than expected. You revise your mental estimate in real time, at each stage of the journey, using partial information plus your expectations about the rest.

Driving Home: TD vs MC Updates

Watch how the predicted total travel time evolves through one commute. TD updates at each waypoint; MC stays flat until the trip ends, then updates all at once.

Waypoint	Elapsed	Predicted Total	Actual Total
Leave office	0	30	43
Reach car	5	35	43
Highway on-ramp	20	35	43
Exit highway	30	40	43
Side street	35	43	43
Home	43	43	43

In the driving home example, when does a TD agent first update its predictions?

At the very first waypoint — as soon as it sees the next estimate Only when it arrives home It never updates predictions

Chapter 2: Advantages of TD

Why not just use Monte Carlo? TD has three practical advantages that matter enormously in real applications.

1. No waiting for episode end. TD can learn online, step by step. This is essential for continuing tasks (tasks that never terminate) and for long episodes where waiting is impractical.

2. Lower variance. MC targets (actual returns) depend on many random actions and transitions. TD targets depend on only one step of randomness. The TD target has higher bias (because V(S_t+1) might be wrong) but much lower variance. In practice, the lower variance usually wins.

3. No model required. Like MC, TD learns directly from experience. Unlike DP, it doesn't need to know transition probabilities. TD combines the best of both: bootstrapping like DP, but model-free like MC.

The bias-variance tradeoff: MC is unbiased but high-variance (the actual return is noisy). TD is biased (the bootstrap estimate might be wrong) but low-variance (only one step of noise). For most practical problems, TD converges faster because low variance dominates.

Property	DP	MC	TD
Needs model?	Yes	No	No
Bootstraps?	Yes	No	Yes
Online (per-step)?	No*	No	Yes
Works with continuing tasks?	Yes	No	Yes
Target bias	None	None	Some
Target variance	None	High	Low

*DP does full sweeps, not truly step-by-step from experience.

Bootstraps + needs model

↓ remove model requirement

Bootstraps + learns from experience

↑ add bootstrapping

No bootstrap + learns from experience

Which is TRUE about TD compared to Monte Carlo?

TD has lower variance targets but introduces bias from bootstrapping TD has higher variance because it updates more often TD requires a model of the environment

Chapter 3: Batch TD vs MC

Here's a surprising fact: given the same finite batch of experience, TD and MC converge to different answers. Neither is wrong — they're answering different questions.

Batch MC converges to the values that minimize mean-squared error on the observed returns. It treats each episode as an independent sample and averages.

Batch TD(0) converges to the values of the maximum-likelihood MDP model implied by the data. TD implicitly builds a model from the data (count transitions, average rewards) and computes the value function of that model. This is called the certainty-equivalence estimate.

Certainty equivalence: TD acts as if its learned model is exactly correct and computes values accordingly. MC makes no such assumption — it just averages returns. TD's approach is typically better with limited data because it exploits the Markov structure.

Consider a tiny example with states A and B. You observe 8 episodes: A → B → 0 (seen 6 times), B → 1 (seen 1 time), B → 0 (seen 7 times). What is V(A)?

Batch MC: The only episode starting from A ended with return 0, so V(A) = 0.

Batch TD: A always transitions to B. V(B) = 1/8 (one reward of 1 out of 8 visits). Since A → B with reward 0, V(A) = 0 + γV(B) = γ/8. TD propagates B's value back to A through the Markov structure.

Batch TD vs Batch MC

Generate random batches from a 3-state chain. See how batch TD and batch MC converge to different value estimates from the same data.

Batch size20

Why does batch TD give a different answer than batch MC on finite data?

TD uses a different learning rate TD finds values of the maximum-likelihood MDP; MC averages observed returns TD has a bug in the algorithm

Chapter 4: Sarsa — On-Policy TD Control

So far, TD has estimated state values V(s). But for control (finding a good policy), we need action values Q(s,a). Sarsa is the natural extension: instead of updating V toward the next state's value, we update Q toward the next state-action pair's value.

The name comes from the quintuple used in each update: (S_t, A_t, R_t+1, S_t+1, A_t+1) — SARSA. The update rule is:

Q(S_t, A_t) ← Q(S_t, A_t) + α [ R_t+1 + γ Q(S_t+1, A_t+1) − Q(S_t, A_t) ]

Notice: the next action A_t+1 is the action the agent actually takes (chosen by the current policy, typically ε-greedy). This makes Sarsa on-policy: it evaluates and improves the policy it's actually following, including its exploration.

The SARSA quintuple: S_t (state), A_t (action taken), R_t+1 (reward received), S_t+1 (next state), A_t+1 (next action taken). The agent needs all five pieces before it can do one update. It commits to the next action before learning.

Sarsa on Windy Gridworld

A 10×7 gridworld with upward wind in the middle columns. Start at S, goal at G. Wind pushes the agent upward by 0, 1, or 2 cells depending on column. Sarsa learns to compensate.

Click Train to begin Sarsa learning.

Why is Sarsa called "on-policy"?

It uses the action A_t+1 the agent actually takes under its current policy It always picks the greedy action It requires knowing the environment model

Chapter 5: Q-Learning — Off-Policy TD Control

Q-learning is arguably the most important algorithm in this entire book. Its update rule looks almost identical to Sarsa's, but with one crucial difference: instead of using the action the agent actually takes, it uses the best possible action at the next state:

Q(S_t, A_t) ← Q(S_t, A_t) + α [ R_t+1 + γ max_a Q(S_t+1, a) − Q(S_t, A_t) ]

That max is the key. Q-learning updates toward the greedy action regardless of which action the agent actually took. This makes it off-policy: it learns the optimal Q-function directly, even while following an exploratory ε-greedy policy.

The difference between Sarsa and Q-learning comes alive in the Cliff Walking problem. Imagine a gridworld where a cliff runs along the bottom edge. Falling off the cliff gives -100 reward. Each normal step gives -1. The start is bottom-left, the goal is bottom-right.

Sarsa vs Q-learning on the cliff: Sarsa, being on-policy, knows its ε-greedy policy sometimes takes random steps. Near the cliff, a random step is fatal, so Sarsa learns the safe path along the top. Q-learning, being off-policy, learns the optimal path along the cliff edge — shorter but dangerous under exploration. This perfectly illustrates the on-policy/off-policy distinction.

Cliff Walking: Sarsa vs Q-Learning

Watch both algorithms learn side by side. Sarsa (left) finds the safe path away from the cliff. Q-learning (right) finds the optimal path along the cliff edge. Adjust ε to see how exploration changes the strategies.

Epsilon ε0.10

Learning rate α0.50

Click Train to watch Sarsa and Q-learning learn different strategies.

Why does Sarsa learn a longer, safer path than Q-learning on Cliff Walking?

Sarsa accounts for its own exploratory behavior and avoids the dangerous cliff edge Sarsa has a lower learning rate Q-learning converges faster

Chapter 6: Expected Sarsa

Sarsa updates toward Q(S_t+1, A_t+1) — the value of the specific next action taken. But what if instead of sampling one action, we averaged over all possible next actions, weighted by the policy?

Q(S_t, A_t) ← Q(S_t, A_t) + α [ R_t+1 + γ ∑_a π(a|S_t+1) Q(S_t+1, a) − Q(S_t, A_t) ]

This is Expected Sarsa. Instead of the random next action, it uses the expected value under the policy. This eliminates one source of variance (the randomness of A_t+1) at the cost of slightly more computation per step.

Here is the beautiful unification: Expected Sarsa is a generalization of both Sarsa and Q-learning. If the policy π is ε-greedy, Expected Sarsa accounts for both greedy and exploratory actions. If you set π to be purely greedy (the argmax), the expectation reduces to the max, and Expected Sarsa becomes Q-learning.

Unification: Sarsa, Q-learning, and Expected Sarsa are all the same algorithm with different targets. Sarsa: sample one next action. Q-learning: take the max. Expected Sarsa: take the weighted average. Each trades off bias, variance, and computation differently.

Method	TD Target uses	Variance	Policy
Sarsa	Q(S', A') where A' is sampled	Highest	On-policy
Q-learning	max_a Q(S', a)	Medium	Off-policy
Expected Sarsa	∑_a π(a\|S') Q(S', a)	Lowest	Either

Under what condition does Expected Sarsa reduce to Q-learning?

When the learning rate is 1 When the target policy is greedy (all weight on the max-action) When epsilon is very large

Chapter 7: Maximization Bias

Q-learning uses max_a Q(S', a) as its target. This seems sensible, but there is a subtle trap. When Q-values are noisy estimates (and they always are, especially early in learning), the maximum of noisy estimates is biased upward. This is called maximization bias.

Think of it this way: you ask 10 friends to estimate the temperature outside. Some guess too high, some too low. If you take the highest guess, you'll almost certainly overestimate the true temperature. The max of noisy estimates is optimistically biased.

The problem: If Q(s,a) has noise around the true value, then max_a Q(s,a) ≥ max_a Q^*(s,a) in expectation. Q-learning can systematically overestimate values, leading to poor policy decisions.

The fix is Double Q-learning. Maintain two Q-tables, Q₁ and Q₂. On each update, use one table to select the best action and the other table to evaluate it:

Q₁(S,A) ← Q₁(S,A) + α [ R + γ Q₂(S', argmax_a Q₁(S', a)) − Q₁(S,A) ]

By decoupling action selection (which action looks best) from action evaluation (how good is that action really), the bias disappears. Each table's noise is independent, so one table's optimistic errors don't infect the other's evaluation.

Why it works: When Q₁ overestimates action a, Q₂ has independent noise and is unlikely to overestimate the same action. The cross-evaluation cancels the optimistic bias. Symmetrically, Q₂ selects, Q₁ evaluates (randomly chosen each step).

What causes maximization bias in Q-learning?

The learning rate is too large The max of noisy estimates is systematically higher than the max of true values The agent explores too much

Chapter 8: Comparing TD Control Methods

Let's put it all together. We have three TD control methods — Sarsa, Q-learning, and Expected Sarsa — plus the Double variants. How do they relate, and when should you use each?

The differences come down to what each method uses as its TD target at S_t+1. Here's a unified view using backup diagrams, the visual language Sutton and Barto use to show which future quantities an update depends on.

Reading backup diagrams: A filled circle is a state. An open circle is a state-action pair. Lines show which future values are used in the update. Sarsa follows one path down. Q-learning takes the max over a fan of actions. Expected Sarsa averages over the fan.

Backup Diagrams & Performance

Backup diagrams for each TD control method, showing which quantities are used in the update target.

Method	Target at S'	On/Off-policy	Best when
Sarsa	Q(S', A') (sampled)	On-policy	Online performance matters
Q-learning	max_a Q(S', a)	Off-policy	You want the optimal policy
Expected Sarsa	E_π[Q(S', a)]	Either	You want low variance
Double Q	Q₂(S', argmax Q₁)	Off-policy	Max bias is a concern

The GPI thread continues: All of these methods follow the Generalized Policy Iteration (GPI) pattern from Chapter 4: evaluate Q under the current policy, then improve the policy greedily. The only difference is how they estimate Q. TD methods do it sample by sample, one step at a time.

For an agent that must perform well DURING learning (not just after), which method is typically preferred?

Sarsa — it accounts for exploration in its value estimates Q-learning — it converges to optimal values None — they all perform the same online

Chapter 9: Summary

Temporal-difference learning is the central and most novel idea in reinforcement learning. It combines the model-free, experience-based nature of Monte Carlo with the bootstrapping power of dynamic programming. The result is a family of algorithms that learn online, step by step, from incomplete episodes, without a model.

TD Prediction

TD(0) updates V using R + γV(S') as target

↓ extend to action values

Sarsa

On-policy: uses Q(S', A') where A' is from π

↓ use max instead

Q-learning

Off-policy: uses max_a Q(S', a)

↓ take expectation

Expected Sarsa

Uses ∑ π(a|S') Q(S',a) — unifies both

↓ fix max bias

Double Q-learning

Two tables decouple selection and evaluation

Looking ahead: Chapter 7 generalizes from 1-step TD to n-step methods, bridging the gap between TD(0) and Monte Carlo. Chapter 8 combines TD learning with model-based planning in Dyna. The TD idea — learning a guess from a guess — runs through all of it.

Key equations from this chapter:

TD(0): V(S) ← V(S) + α[R + γV(S') − V(S)]

Sarsa: Q(S,A) ← Q(S,A) + α[R + γQ(S',A') − Q(S,A)]

Q-learn: Q(S,A) ← Q(S,A) + α[R + γmax_aQ(S',a) − Q(S,A)]

What to remember:

TD bootstraps — it updates a guess from a guess. This is both its power (low variance, online, no model) and its limitation (biased until convergence).

On-policy methods (Sarsa) learn the value of what you do. Off-policy methods (Q-learning) learn the value of what you should do.

← Chapter 5: Monte Carlo Chapter 7: n-step →

What is the defining characteristic of temporal-difference learning that distinguishes it from both DP and MC?

It uses a fixed learning rate It requires a model of the environment It bootstraps (updates from estimates) while learning directly from experience without a model