Learning a guess from a guess — the idea that makes modern RL possible.
In Chapter 5, Monte Carlo methods learned value functions by waiting until the end of an episode, computing the actual return, and then updating. That works, but it has a frustrating requirement: you must wait until the episode terminates. What if episodes are very long? What if they never end at all?
Temporal-difference (TD) learning removes that requirement. Instead of waiting for the actual return, TD updates its value estimate immediately after each step, using the reward it just observed plus its current estimate of the next state's value. It learns a guess from a guess.
The simplest TD method, called TD(0), uses this update rule:
The quantity in brackets is the TD error, written δt. It measures the difference between the better estimate (reward + discounted next value) and the current estimate. If δt is positive, the outcome was better than expected, so we nudge V(St) upward. If negative, we nudge it down.
A 5-state random walk (A–E) with terminal states at each end. The agent steps left or right with equal probability. The left terminal gives reward 0, the right terminal gives reward +1. True values are 1/6, 2/6, 3/6, 4/6, 5/6. Watch how TD updates after every step while MC waits for the episode to end.
Sutton and Barto's famous example makes the difference vivid. Imagine you commute home every day and try to predict the total travel time. Each day, as you pass landmarks (leaving work, reaching the highway, exiting, etc.), you observe elapsed time and revise your estimate of the remaining time.
A Monte Carlo approach would wait until you actually arrive home, record the total time, then go back and update your estimate for every landmark you passed that day. You learn nothing until the trip is over.
A TD approach updates immediately. When you reach the highway and notice it took 5 minutes instead of the 3 you expected, you immediately revise your estimate upward — without waiting to see how the rest of the drive goes. Each landmark updates from the next landmark's estimate.
Watch how the predicted total travel time evolves through one commute. TD updates at each waypoint; MC stays flat until the trip ends, then updates all at once.
| Waypoint | Elapsed | Predicted Total | Actual Total |
|---|---|---|---|
| Leave office | 0 | 30 | 43 |
| Reach car | 5 | 35 | 43 |
| Highway on-ramp | 20 | 35 | 43 |
| Exit highway | 30 | 40 | 43 |
| Side street | 35 | 43 | 43 |
| Home | 43 | 43 | 43 |
Why not just use Monte Carlo? TD has three practical advantages that matter enormously in real applications.
1. No waiting for episode end. TD can learn online, step by step. This is essential for continuing tasks (tasks that never terminate) and for long episodes where waiting is impractical.
2. Lower variance. MC targets (actual returns) depend on many random actions and transitions. TD targets depend on only one step of randomness. The TD target has higher bias (because V(St+1) might be wrong) but much lower variance. In practice, the lower variance usually wins.
3. No model required. Like MC, TD learns directly from experience. Unlike DP, it doesn't need to know transition probabilities. TD combines the best of both: bootstrapping like DP, but model-free like MC.
| Property | DP | MC | TD |
|---|---|---|---|
| Needs model? | Yes | No | No |
| Bootstraps? | Yes | No | Yes |
| Online (per-step)? | No* | No | Yes |
| Works with continuing tasks? | Yes | No | Yes |
| Target bias | None | None | Some |
| Target variance | None | High | Low |
*DP does full sweeps, not truly step-by-step from experience.
Here's a surprising fact: given the same finite batch of experience, TD and MC converge to different answers. Neither is wrong — they're answering different questions.
Batch MC converges to the values that minimize mean-squared error on the observed returns. It treats each episode as an independent sample and averages.
Batch TD(0) converges to the values of the maximum-likelihood MDP model implied by the data. TD implicitly builds a model from the data (count transitions, average rewards) and computes the value function of that model. This is called the certainty-equivalence estimate.
Consider a tiny example with states A and B. You observe 8 episodes: A → B → 0 (seen 6 times), B → 1 (seen 1 time), B → 0 (seen 7 times). What is V(A)?
Batch MC: The only episode starting from A ended with return 0, so V(A) = 0.
Batch TD: A always transitions to B. V(B) = 1/8 (one reward of 1 out of 8 visits). Since A → B with reward 0, V(A) = 0 + γV(B) = γ/8. TD propagates B's value back to A through the Markov structure.
Generate random batches from a 3-state chain. See how batch TD and batch MC converge to different value estimates from the same data.
So far, TD has estimated state values V(s). But for control (finding a good policy), we need action values Q(s,a). Sarsa is the natural extension: instead of updating V toward the next state's value, we update Q toward the next state-action pair's value.
The name comes from the quintuple used in each update: (St, At, Rt+1, St+1, At+1) — SARSA. The update rule is:
Notice: the next action At+1 is the action the agent actually takes (chosen by the current policy, typically ε-greedy). This makes Sarsa on-policy: it evaluates and improves the policy it's actually following, including its exploration.
A 10×7 gridworld with upward wind in the middle columns. Start at S, goal at G. Wind pushes the agent upward by 0, 1, or 2 cells depending on column. Sarsa learns to compensate.
Q-learning is arguably the most important algorithm in this entire book. Its update rule looks almost identical to Sarsa's, but with one crucial difference: instead of using the action the agent actually takes, it uses the best possible action at the next state:
That max is the key. Q-learning updates toward the greedy action regardless of which action the agent actually took. This makes it off-policy: it learns the optimal Q-function directly, even while following an exploratory ε-greedy policy.
The difference between Sarsa and Q-learning comes alive in the Cliff Walking problem. Imagine a gridworld where a cliff runs along the bottom edge. Falling off the cliff gives -100 reward. Each normal step gives -1. The start is bottom-left, the goal is bottom-right.
Watch both algorithms learn side by side. Sarsa (left) finds the safe path away from the cliff. Q-learning (right) finds the optimal path along the cliff edge. Adjust ε to see how exploration changes the strategies.
Sarsa updates toward Q(St+1, At+1) — the value of the specific next action taken. But what if instead of sampling one action, we averaged over all possible next actions, weighted by the policy?
This is Expected Sarsa. Instead of the random next action, it uses the expected value under the policy. This eliminates one source of variance (the randomness of At+1) at the cost of slightly more computation per step.
Here is the beautiful unification: Expected Sarsa is a generalization of both Sarsa and Q-learning. If the policy π is ε-greedy, Expected Sarsa accounts for both greedy and exploratory actions. If you set π to be purely greedy (the argmax), the expectation reduces to the max, and Expected Sarsa becomes Q-learning.
| Method | TD Target uses | Variance | Policy |
|---|---|---|---|
| Sarsa | Q(S', A') where A' is sampled | Highest | On-policy |
| Q-learning | maxa Q(S', a) | Medium | Off-policy |
| Expected Sarsa | ∑a π(a|S') Q(S', a) | Lowest | Either |
Q-learning uses maxa Q(S', a) as its target. This seems sensible, but there is a subtle trap. When Q-values are noisy estimates (and they always are, especially early in learning), the maximum of noisy estimates is biased upward. This is called maximization bias.
Think of it this way: you ask 10 friends to estimate the temperature outside. Some guess too high, some too low. If you take the highest guess, you'll almost certainly overestimate the true temperature. The max of noisy estimates is optimistically biased.
The fix is Double Q-learning. Maintain two Q-tables, Q1 and Q2. On each update, use one table to select the best action and the other table to evaluate it:
By decoupling action selection (which action looks best) from action evaluation (how good is that action really), the bias disappears. Each table's noise is independent, so one table's optimistic errors don't infect the other's evaluation.
Let's put it all together. We have three TD control methods — Sarsa, Q-learning, and Expected Sarsa — plus the Double variants. How do they relate, and when should you use each?
The differences come down to what each method uses as its TD target at St+1. Here's a unified view using backup diagrams, the visual language Sutton and Barto use to show which future quantities an update depends on.
Backup diagrams for each TD control method, showing which quantities are used in the update target.
| Method | Target at S' | On/Off-policy | Best when |
|---|---|---|---|
| Sarsa | Q(S', A') (sampled) | On-policy | Online performance matters |
| Q-learning | maxa Q(S', a) | Off-policy | You want the optimal policy |
| Expected Sarsa | Eπ[Q(S', a)] | Either | You want low variance |
| Double Q | Q2(S', argmax Q1) | Off-policy | Max bias is a concern |
Temporal-difference learning is the central and most novel idea in reinforcement learning. It combines the model-free, experience-based nature of Monte Carlo with the bootstrapping power of dynamic programming. The result is a family of algorithms that learn online, step by step, from incomplete episodes, without a model.
Key equations from this chapter:
What to remember:
TD bootstraps — it updates a guess from a guess. This is both its power (low variance, online, no model) and its limitation (biased until convergence).
On-policy methods (Sarsa) learn the value of what you do. Off-policy methods (Q-learning) learn the value of what you should do.
← Chapter 5: Monte Carlo Chapter 7: n-step →