When learning about one policy from data generated by another — and why this is much harder than it sounds.
You're building a robot that learns from a log of past experience. The log was generated by an old, cautious policy — it moved slowly and avoided corners. But you want to evaluate (or improve) a new policy that is aggressive and explores everywhere. The data doesn't match the policy you care about. This is the off-policy setting.
In the tabular case (Chapters 5–7), off-policy learning was manageable. We used importance sampling (IS) to reweight returns, and Q-learning neatly sidestepped the issue for control. But now we have function approximation — and two new, serious problems collide.
On-policy methods (Chapter 9–10) were stable because the data distribution matched the update targets. The states visited were exactly the states the policy cared about. Off-policy breaks this: the agent updates toward one thing while the data comes from another. With a lookup table, each state is independent, so mismatched distributions don't matter. With function approximation, changing one weight affects many states, and the mismatch can amplify errors in a vicious cycle.
This chapter is the most challenging in the book. It's where the limits of the field are laid bare. But understanding why things break is just as important as knowing what works. Let's start simple and build to the hard truths.
The simplest approach: take the semi-gradient methods from Chapter 9 and add importance sampling ratios. For one-step semi-gradient TD, the update becomes:
Here ρt = π(At|St) / b(At|St) is the importance sampling ratio for a single step. It reweights the update so that, in expectation, it targets the value function under π rather than b. The TD error δt is the usual Rt+1 + γ v̂(St+1, wt) − v̂(St, wt).
For action-value methods (semi-gradient expected Sarsa), there's good news: the ρ ratio can sometimes be avoided entirely. If the update target uses the expected value under π rather than the sampled next action, the importance sampling ratio drops out. This is why Q-learning works off-policy without IS.
pseudocode # Semi-gradient off-policy TD(0) Initialize w arbitrarily for each episode: S ← start state while S is not terminal: A ← sample from b(·|S) take A, observe R, S' ρ ← π(A|S) / b(A|S) δ ← R + γ v̂(S', w) − v̂(S, w) w ← w + α ρ δ ∇ v̂(S, w) S ← S'
This looks straightforward. The update is just TD(0) multiplied by ρ. But there's a catch: even this simple method has no convergence guarantee under off-policy with function approximation. The semi-gradient avoids taking a gradient through the bootstrapped target, which breaks the clean theoretical story we had on-policy. On-policy semi-gradient TD converges with linear function approximation; off-policy, it can diverge.
The simplest example of divergence uses just one state with two components. Consider a single state s whose feature vector is x(s) = [1, 2]. The approximate value is v̂(s, w) = w1 + 2w2. On each step, the agent "transitions" back to s with reward 0 and discount γ. The semi-gradient TD update amplifies the weights by a factor of γ in a direction that never converges.
But the canonical example of off-policy divergence is Baird's counterexample (1995). It has 7 states, each with a distinct feature vector, and a simple transition structure: from any state, one action goes to state 7 (the "solid" state) with probability 1, and the other action goes uniformly to one of the 6 "dashed" states. The behavior policy picks the dashed action 6/7 of the time and the solid action 1/7. The target policy always picks solid.
Under semi-gradient DP (or TD) with this setup and linear function approximation, the 8 weight components diverge to infinity. Not slowly — exponentially. The animation below shows it happening in real time.
Watch the 8 weight components of a linear value function diverge under semi-gradient DP. Each colored line is one weight. The updates are exact expected updates (no sampling noise) — this is pure, deterministic divergence.
Notice that the true value of every state is 0 (all rewards are 0, so vπ(s) = 0 for all s). The weights should converge to zero. Instead they diverge exponentially. This is not a sampling fluke or a step-size problem — it's structural. The combination of off-policy distribution and semi-gradient bootstrapping creates an unstable system.
Baird's counterexample isn't a quirky edge case. It reveals a fundamental tension in RL. Instability arises whenever three elements are combined:
1. Function Approximation
Generalizing across states. Changing one parameter affects many state values. This is essential for large state spaces — we can't avoid it.
2. Bootstrapping
Updating toward a target that includes the current estimate (like TD). DP and all TD methods bootstrap. Only pure MC avoids it.
3. Off-policy Learning
Learning about one policy from data generated by another. Needed for Q-learning, experience replay, parallel data collection, and many practical systems.
This is a deep and unsettling result. Each element is useful — even necessary — for practical RL. Function approximation handles large state spaces. Bootstrapping enables online, incremental learning. Off-policy methods enable data reuse and flexibility. But the combination is dangerous.
The rest of this chapter searches for ways to have all three safely. The main approaches are: (1) avoid true off-policy by using on-policy methods; (2) avoid bootstrapping by using MC; (3) use true gradient methods that are stable by construction (Gradient-TD); or (4) reweight the update distribution to match the target (Emphatic-TD).
Click each element to see what happens when it is removed. When all three are active, instability arises.
To understand why some objectives work and others don't, we need to think about value functions as vectors in a high-dimensional space. If there are |S| states, then a value function v is a vector in R|S| — one component per state. The true value vπ is one point in this space. Our approximation v̂(w) is constrained to a subspace (for linear approximation, it's a hyperplane). The best we can do is project vπ onto this subspace.
Different objectives project differently:
| Objective | What It Minimizes | Notation |
|---|---|---|
| VE | Mean squared error from vπ | ||vw − vπ||2μ |
| BE | Mean squared Bellman error | ||vw − Tπvw||2μ |
| PBE | Projected Bellman error | ||vw − Π Tπvw||2μ |
The Bellman operator Tπ takes any value function v and produces a new one: (Tπv)(s) = ∑a π(a|s) ∑s',r p(s',r|s,a)[r + γv(s')]. It maps a point in value-function space to another point. The true value vπ is the unique fixed point: Tπvπ = vπ. But Tπvw typically leaves the representable subspace, so we need projection Π to bring it back.
Semi-gradient TD finds the PBE minimum — where vw = Π Tπvw. On-policy, this fixed point always exists for linear approximation and TD converges to it. Off-policy, the fixed point may not exist, and that's when divergence happens.
The Bellman error (BE) at a state s is the difference between the current value and what the Bellman operator says it should be:
This is the expected TD error. If the BE is zero everywhere, the value function satisfies the Bellman equation exactly. The mean squared BE (the BE objective) squares this and averages over states: BE(w) = ∑s μ(s) [δ̅w(s)]2.
Can we do gradient descent directly in the BE? Yes — we can compute ∇BE(w) and follow it downhill. Unlike semi-gradient methods, this is a true gradient, so it's guaranteed to converge to a local minimum. The gradient is:
Notice the ∇v̂(S',w) term. This is what semi-gradient methods drop. Including it makes the method a true gradient — but it requires knowing the gradient at the next state before completing the current step, which can be estimated from samples.
This is a surprising and deep result. The BE seems like the right objective — after all, it measures how well the Bellman equation is satisfied. But it depends on the full transition dynamics in a way that samples cannot distinguish. The next chapter explains why.
Here's the key thought experiment. Consider two Markov reward processes (MRPs) — MDPs with a fixed policy, so we only have states and transitions. Both have two states, s1 and s2. Both produce exactly the same observable sequence of states and rewards. But their internal transition structures differ.
MRP A
State s has two possible next states, each with probability 0.5. The rewards differ: one transition gives +1, the other gives −1. The Bellman error involves the variance of rewards at s.
MRP B
State s always goes to one next state, but which next state it goes to is determined by an unobserved variable. The observable data stream is identical to MRP A.
Formally, the issue is that the BE involves a product of two expectations at the same state: E[R + γv(S')] − v(s), and squaring this gives a term E[δ]2 which is fine, but the gradient introduces terms involving E[δ · ∇v(S')], which requires two independent samples from the same state. If you use the same sample for both, you get a biased estimate.
This is why the projected Bellman error (PBE) is the preferred objective. The PBE is learnable from data. It doesn't suffer from this double-sampling problem because the projection operator absorbs the problematic variance term. The Gradient-TD methods in the next chapter minimize the PBE.
The solution to the divergence problem: do true stochastic gradient descent on the projected Bellman error (PBE). The PBE is learnable, and gradient descent in it is guaranteed to converge. The resulting family of algorithms — GTD, GTD2, and TDC — are the first provably convergent off-policy TD methods with linear function approximation.
The PBE can be written as:
where X is the feature matrix, D is the diagonal matrix of state weights μ(s), and δ̅w is the vector of expected TD errors. Taking the gradient and converting to sample-based updates requires a trick: introducing a second set of weights v (sometimes called h) that estimate a secondary quantity.
The three algorithms differ in how they decompose the PBE gradient:
| Method | Update Rule (intuition) | Key Property |
|---|---|---|
| GTD | w += α (ρδ · x − γρ x' · (xTv)) | First convergent off-policy linear TD |
| GTD2 | w += α (ρ x − γρ x') · (xTv) | Better decomposition, lower variance |
| TDC | w += α ρδ x − αγ ρx'(xTv) | = TD correction; most commonly used |
In all three, the auxiliary weights v are updated as: v += β(ρδ − xTv)x, using step size β. This is a standard least-squares update — v tracks E[ρδ | x].
TDC (TD with gradient Correction) is the most intuitive. The first term (αρδx) is the usual semi-gradient TD update. The second term (−αγρx'(xTv)) is a correction that accounts for the gradient through the bootstrapped target. It's the part that semi-gradient methods drop. By estimating it with v, TDC makes the update a true gradient.
Compare semi-gradient TD (diverges) vs TDC (converges to zero) on Baird's counterexample. Both use the same data distribution.
Gradient-TD methods fix the objective (PBE instead of VE). Emphatic-TD takes a different approach: it fixes the distribution. The idea is to reweight updates so that, even though data comes from the behavior policy b, the effective update distribution matches what on-policy TD would see under π.
The key quantity is the followon trace Ft:
where It is the interest (how much we care about the current state, typically 1), and ρt−1 is the previous step's IS ratio. The followon trace accumulates discounted, importance-weighted "emphasis" over time. The emphasis Mt = Ft ρt then multiplies the semi-gradient update:
Emphatic TD(0) has convergence guarantees for linear function approximation in the off-policy setting. It's a single weight vector (no auxiliary parameters like Gradient-TD). The cost is high variance. The emphasis Mt involves products of IS ratios, which can be very large or very small, making the updates noisy.
The followon trace is analogous to eligibility traces (Chapter 12), but it serves a completely different purpose. Eligibility traces propagate credit backward through time; the followon trace reweights the state distribution forward through time.
Every off-policy method pays a variance tax. Importance sampling ratios ρ multiply the updates, and ρ can be very large when the target policy favors an action much more than the behavior policy. Over many steps, products of ratios can explode. This chapter collects practical techniques for keeping variance manageable.
Instead of weighting entire returns by the product of all ratios, weight each reward in the return by only the ratios up to that time step. This reduces variance because later ratios (which don't affect earlier rewards) are removed.
Normalizing the IS ratios (dividing by their sum) gives a biased but much lower-variance estimator. For finite data, the bias is small and the variance reduction is dramatic. This is the estimator from Chapter 5 that typically outperforms ordinary IS.
Cap ρ at some maximum value (e.g., ρ̅ = min(ρ, c) for some constant c). This introduces bias but eliminates the worst-case variance. The Retrace(λ) algorithm uses this idea, as does PPO's clipped objective.
Subtract a baseline that has the same expectation as the IS-corrected update but lower variance. The auxiliary weights v in Gradient-TD can be seen as a form of control variate. REINFORCE with baseline (Chapter 13) uses the same idea for policy gradients.
The practical lesson: off-policy methods can work, but you must budget significant engineering effort for variance reduction. The algorithms in this chapter form a toolkit, and the best choice depends on how different π and b are.
This chapter confronted the hardest open problem in RL: learning reliably off-policy with function approximation. The deadly triad — function approximation + bootstrapping + off-policy — produces instability that no simple fix can fully resolve. But significant progress has been made.
| Method | Approach | Trade-off |
|---|---|---|
| Semi-gradient TD | Ignore the gradient through bootstrap target | Simple but can diverge off-policy |
| Gradient BE | True gradient in Bellman error | Converges but BE is not learnable |
| GTD / TDC | True gradient in projected BE | Converges; needs auxiliary weights |
| Emphatic TD | Reweight to match target distribution | Converges; high variance |
What we learned:
• Off-policy + function approx can diverge
• Baird's example proves it constructively
• The deadly triad explains when danger arises
• PBE is learnable; BE is not
• Gradient-TD methods are provably stable
• Emphatic TD fixes the distribution
Open questions:
• Why does DQN work at all? (deadly triad!)
• Can we extend Gradient-TD to neural nets?
• Is there a low-variance emphatic method?
• Are there better objectives than PBE?
• Can the deadly triad be tamed in general?
What comes next: Chapter 12 introduces eligibility traces, which unify TD and MC by letting the agent assign credit over multiple time steps. Traces provide a mechanism for smoothly interpolating between the extremes of one-step TD and full Monte Carlo returns.