Sutton & Barto, Chapter 6

Temporal-Difference Learning

Learning a guess from a guess — the idea that makes modern RL possible.

Prerequisites: Chapters 3–5 (MDPs, DP, Monte Carlo methods).
10
Chapters
5
Simulations
10
Quizzes

Chapter 0: TD Prediction

In Chapter 5, Monte Carlo methods learned value functions by waiting until the end of an episode, computing the actual return, and then updating. That works, but it has a frustrating requirement: you must wait until the episode terminates. What if episodes are very long? What if they never end at all?

Temporal-difference (TD) learning removes that requirement. Instead of waiting for the actual return, TD updates its value estimate immediately after each step, using the reward it just observed plus its current estimate of the next state's value. It learns a guess from a guess.

The simplest TD method, called TD(0), uses this update rule:

V(St) ← V(St) + α [ Rt+1 + γ V(St+1) − V(St) ]

The quantity in brackets is the TD error, written δt. It measures the difference between the better estimate (reward + discounted next value) and the current estimate. If δt is positive, the outcome was better than expected, so we nudge V(St) upward. If negative, we nudge it down.

The core idea: MC waits for the real answer; TD makes do with a partial answer plus a guess. The target Rt+1 + γV(St+1) is called the TD target. Because it includes an estimate V(St+1), TD is said to bootstrap — it updates a guess toward a guess.
Random Walk: TD(0) vs Monte Carlo

A 5-state random walk (A–E) with terminal states at each end. The agent steps left or right with equal probability. The left terminal gives reward 0, the right terminal gives reward +1. True values are 1/6, 2/6, 3/6, 4/6, 5/6. Watch how TD updates after every step while MC waits for the episode to end.

Click Step or Run Episode to begin.
What does TD(0) use as its update target instead of the actual return Gt?

Chapter 1: The Driving Home Example

Sutton and Barto's famous example makes the difference vivid. Imagine you commute home every day and try to predict the total travel time. Each day, as you pass landmarks (leaving work, reaching the highway, exiting, etc.), you observe elapsed time and revise your estimate of the remaining time.

A Monte Carlo approach would wait until you actually arrive home, record the total time, then go back and update your estimate for every landmark you passed that day. You learn nothing until the trip is over.

A TD approach updates immediately. When you reach the highway and notice it took 5 minutes instead of the 3 you expected, you immediately revise your estimate upward — without waiting to see how the rest of the drive goes. Each landmark updates from the next landmark's estimate.

Key insight: TD learning is how humans naturally think. You don't wait until you arrive home to feel that traffic was heavier than expected. You revise your mental estimate in real time, at each stage of the journey, using partial information plus your expectations about the rest.
Driving Home: TD vs MC Updates

Watch how the predicted total travel time evolves through one commute. TD updates at each waypoint; MC stays flat until the trip ends, then updates all at once.

WaypointElapsedPredicted TotalActual Total
Leave office03043
Reach car53543
Highway on-ramp203543
Exit highway304043
Side street354343
Home434343
In the driving home example, when does a TD agent first update its predictions?

Chapter 2: Advantages of TD

Why not just use Monte Carlo? TD has three practical advantages that matter enormously in real applications.

1. No waiting for episode end. TD can learn online, step by step. This is essential for continuing tasks (tasks that never terminate) and for long episodes where waiting is impractical.

2. Lower variance. MC targets (actual returns) depend on many random actions and transitions. TD targets depend on only one step of randomness. The TD target has higher bias (because V(St+1) might be wrong) but much lower variance. In practice, the lower variance usually wins.

3. No model required. Like MC, TD learns directly from experience. Unlike DP, it doesn't need to know transition probabilities. TD combines the best of both: bootstrapping like DP, but model-free like MC.

The bias-variance tradeoff: MC is unbiased but high-variance (the actual return is noisy). TD is biased (the bootstrap estimate might be wrong) but low-variance (only one step of noise). For most practical problems, TD converges faster because low variance dominates.
PropertyDPMCTD
Needs model?YesNoNo
Bootstraps?YesNoYes
Online (per-step)?No*NoYes
Works with continuing tasks?YesNoYes
Target biasNoneNoneSome
Target varianceNoneHighLow

*DP does full sweeps, not truly step-by-step from experience.

DP
Bootstraps + needs model
↓ remove model requirement
TD
Bootstraps + learns from experience
↑ add bootstrapping
MC
No bootstrap + learns from experience
Which is TRUE about TD compared to Monte Carlo?

Chapter 3: Batch TD vs MC

Here's a surprising fact: given the same finite batch of experience, TD and MC converge to different answers. Neither is wrong — they're answering different questions.

Batch MC converges to the values that minimize mean-squared error on the observed returns. It treats each episode as an independent sample and averages.

Batch TD(0) converges to the values of the maximum-likelihood MDP model implied by the data. TD implicitly builds a model from the data (count transitions, average rewards) and computes the value function of that model. This is called the certainty-equivalence estimate.

Certainty equivalence: TD acts as if its learned model is exactly correct and computes values accordingly. MC makes no such assumption — it just averages returns. TD's approach is typically better with limited data because it exploits the Markov structure.

Consider a tiny example with states A and B. You observe 8 episodes: A → B → 0 (seen 6 times), B → 1 (seen 1 time), B → 0 (seen 7 times). What is V(A)?

Batch MC: The only episode starting from A ended with return 0, so V(A) = 0.

Batch TD: A always transitions to B. V(B) = 1/8 (one reward of 1 out of 8 visits). Since A → B with reward 0, V(A) = 0 + γV(B) = γ/8. TD propagates B's value back to A through the Markov structure.

Batch TD vs Batch MC

Generate random batches from a 3-state chain. See how batch TD and batch MC converge to different value estimates from the same data.

Batch size20
Why does batch TD give a different answer than batch MC on finite data?

Chapter 4: Sarsa — On-Policy TD Control

So far, TD has estimated state values V(s). But for control (finding a good policy), we need action values Q(s,a). Sarsa is the natural extension: instead of updating V toward the next state's value, we update Q toward the next state-action pair's value.

The name comes from the quintuple used in each update: (St, At, Rt+1, St+1, At+1) — SARSA. The update rule is:

Q(St, At) ← Q(St, At) + α [ Rt+1 + γ Q(St+1, At+1) − Q(St, At) ]

Notice: the next action At+1 is the action the agent actually takes (chosen by the current policy, typically ε-greedy). This makes Sarsa on-policy: it evaluates and improves the policy it's actually following, including its exploration.

The SARSA quintuple: St (state), At (action taken), Rt+1 (reward received), St+1 (next state), At+1 (next action taken). The agent needs all five pieces before it can do one update. It commits to the next action before learning.
Sarsa on Windy Gridworld

A 10×7 gridworld with upward wind in the middle columns. Start at S, goal at G. Wind pushes the agent upward by 0, 1, or 2 cells depending on column. Sarsa learns to compensate.

Click Train to begin Sarsa learning.
Why is Sarsa called "on-policy"?

Chapter 5: Q-Learning — Off-Policy TD Control

Q-learning is arguably the most important algorithm in this entire book. Its update rule looks almost identical to Sarsa's, but with one crucial difference: instead of using the action the agent actually takes, it uses the best possible action at the next state:

Q(St, At) ← Q(St, At) + α [ Rt+1 + γ maxa Q(St+1, a) − Q(St, At) ]

That max is the key. Q-learning updates toward the greedy action regardless of which action the agent actually took. This makes it off-policy: it learns the optimal Q-function directly, even while following an exploratory ε-greedy policy.

The difference between Sarsa and Q-learning comes alive in the Cliff Walking problem. Imagine a gridworld where a cliff runs along the bottom edge. Falling off the cliff gives -100 reward. Each normal step gives -1. The start is bottom-left, the goal is bottom-right.

Sarsa vs Q-learning on the cliff: Sarsa, being on-policy, knows its ε-greedy policy sometimes takes random steps. Near the cliff, a random step is fatal, so Sarsa learns the safe path along the top. Q-learning, being off-policy, learns the optimal path along the cliff edge — shorter but dangerous under exploration. This perfectly illustrates the on-policy/off-policy distinction.
Cliff Walking: Sarsa vs Q-Learning

Watch both algorithms learn side by side. Sarsa (left) finds the safe path away from the cliff. Q-learning (right) finds the optimal path along the cliff edge. Adjust ε to see how exploration changes the strategies.

Epsilon ε0.10
Learning rate α0.50
Click Train to watch Sarsa and Q-learning learn different strategies.
Why does Sarsa learn a longer, safer path than Q-learning on Cliff Walking?

Chapter 6: Expected Sarsa

Sarsa updates toward Q(St+1, At+1) — the value of the specific next action taken. But what if instead of sampling one action, we averaged over all possible next actions, weighted by the policy?

Q(St, At) ← Q(St, At) + α [ Rt+1 + γ ∑a π(a|St+1) Q(St+1, a) − Q(St, At) ]

This is Expected Sarsa. Instead of the random next action, it uses the expected value under the policy. This eliminates one source of variance (the randomness of At+1) at the cost of slightly more computation per step.

Here is the beautiful unification: Expected Sarsa is a generalization of both Sarsa and Q-learning. If the policy π is ε-greedy, Expected Sarsa accounts for both greedy and exploratory actions. If you set π to be purely greedy (the argmax), the expectation reduces to the max, and Expected Sarsa becomes Q-learning.

Unification: Sarsa, Q-learning, and Expected Sarsa are all the same algorithm with different targets. Sarsa: sample one next action. Q-learning: take the max. Expected Sarsa: take the weighted average. Each trades off bias, variance, and computation differently.
MethodTD Target usesVariancePolicy
SarsaQ(S', A') where A' is sampledHighestOn-policy
Q-learningmaxa Q(S', a)MediumOff-policy
Expected Sarsaa π(a|S') Q(S', a)LowestEither
Under what condition does Expected Sarsa reduce to Q-learning?

Chapter 7: Maximization Bias

Q-learning uses maxa Q(S', a) as its target. This seems sensible, but there is a subtle trap. When Q-values are noisy estimates (and they always are, especially early in learning), the maximum of noisy estimates is biased upward. This is called maximization bias.

Think of it this way: you ask 10 friends to estimate the temperature outside. Some guess too high, some too low. If you take the highest guess, you'll almost certainly overestimate the true temperature. The max of noisy estimates is optimistically biased.

The problem: If Q(s,a) has noise around the true value, then maxa Q(s,a) ≥ maxa Q*(s,a) in expectation. Q-learning can systematically overestimate values, leading to poor policy decisions.

The fix is Double Q-learning. Maintain two Q-tables, Q1 and Q2. On each update, use one table to select the best action and the other table to evaluate it:

Q1(S,A) ← Q1(S,A) + α [ R + γ Q2(S', argmaxa Q1(S', a)) − Q1(S,A) ]

By decoupling action selection (which action looks best) from action evaluation (how good is that action really), the bias disappears. Each table's noise is independent, so one table's optimistic errors don't infect the other's evaluation.

Why it works: When Q1 overestimates action a, Q2 has independent noise and is unlikely to overestimate the same action. The cross-evaluation cancels the optimistic bias. Symmetrically, Q2 selects, Q1 evaluates (randomly chosen each step).
What causes maximization bias in Q-learning?

Chapter 8: Comparing TD Control Methods

Let's put it all together. We have three TD control methods — Sarsa, Q-learning, and Expected Sarsa — plus the Double variants. How do they relate, and when should you use each?

The differences come down to what each method uses as its TD target at St+1. Here's a unified view using backup diagrams, the visual language Sutton and Barto use to show which future quantities an update depends on.

Reading backup diagrams: A filled circle is a state. An open circle is a state-action pair. Lines show which future values are used in the update. Sarsa follows one path down. Q-learning takes the max over a fan of actions. Expected Sarsa averages over the fan.
Backup Diagrams & Performance

Backup diagrams for each TD control method, showing which quantities are used in the update target.

MethodTarget at S'On/Off-policyBest when
SarsaQ(S', A') (sampled)On-policyOnline performance matters
Q-learningmaxa Q(S', a)Off-policyYou want the optimal policy
Expected SarsaEπ[Q(S', a)]EitherYou want low variance
Double QQ2(S', argmax Q1)Off-policyMax bias is a concern
The GPI thread continues: All of these methods follow the Generalized Policy Iteration (GPI) pattern from Chapter 4: evaluate Q under the current policy, then improve the policy greedily. The only difference is how they estimate Q. TD methods do it sample by sample, one step at a time.
For an agent that must perform well DURING learning (not just after), which method is typically preferred?

Chapter 9: Summary

Temporal-difference learning is the central and most novel idea in reinforcement learning. It combines the model-free, experience-based nature of Monte Carlo with the bootstrapping power of dynamic programming. The result is a family of algorithms that learn online, step by step, from incomplete episodes, without a model.

TD Prediction
TD(0) updates V using R + γV(S') as target
↓ extend to action values
Sarsa
On-policy: uses Q(S', A') where A' is from π
↓ use max instead
Q-learning
Off-policy: uses maxa Q(S', a)
↓ take expectation
Expected Sarsa
Uses ∑ π(a|S') Q(S',a) — unifies both
↓ fix max bias
Double Q-learning
Two tables decouple selection and evaluation
Looking ahead: Chapter 7 generalizes from 1-step TD to n-step methods, bridging the gap between TD(0) and Monte Carlo. Chapter 8 combines TD learning with model-based planning in Dyna. The TD idea — learning a guess from a guess — runs through all of it.

Key equations from this chapter:

TD(0): V(S) ← V(S) + α[R + γV(S') − V(S)]
Sarsa: Q(S,A) ← Q(S,A) + α[R + γQ(S',A') − Q(S,A)]
Q-learn: Q(S,A) ← Q(S,A) + α[R + γmaxaQ(S',a) − Q(S,A)]

What to remember:

TD bootstraps — it updates a guess from a guess. This is both its power (low variance, online, no model) and its limitation (biased until convergence).

On-policy methods (Sarsa) learn the value of what you do. Off-policy methods (Q-learning) learn the value of what you should do.

← Chapter 5: Monte Carlo    Chapter 7: n-step →

What is the defining characteristic of temporal-difference learning that distinguishes it from both DP and MC?