Sutton & Barto, Chapter 11

Off-policy Approximation

When learning about one policy from data generated by another — and why this is much harder than it sounds.

Prerequisites: Chapters 5–6 (off-policy IS) + Chapter 9–10 (function approximation). That's it.
11
Chapters
4
Simulations
11
Quizzes

Chapter 0: The Challenge

You're building a robot that learns from a log of past experience. The log was generated by an old, cautious policy — it moved slowly and avoided corners. But you want to evaluate (or improve) a new policy that is aggressive and explores everywhere. The data doesn't match the policy you care about. This is the off-policy setting.

In the tabular case (Chapters 5–7), off-policy learning was manageable. We used importance sampling (IS) to reweight returns, and Q-learning neatly sidestepped the issue for control. But now we have function approximation — and two new, serious problems collide.

The two problems: (1) Target correction — importance sampling ratios can have enormous variance when the behavior and target policies differ significantly. (2) Distribution mismatch — the state distribution under the behavior policy differs from the target policy's distribution, and with function approximation, updating toward the wrong distribution can cause the weights to diverge to infinity.

On-policy methods (Chapter 9–10) were stable because the data distribution matched the update targets. The states visited were exactly the states the policy cared about. Off-policy breaks this: the agent updates toward one thing while the data comes from another. With a lookup table, each state is independent, so mismatched distributions don't matter. With function approximation, changing one weight affects many states, and the mismatch can amplify errors in a vicious cycle.

Problem 1: Correction
IS ratios fix the targets but add massive variance
Problem 2: Distribution
Updates under μ (behavior) destabilize vπ (target)
Result
Weights can diverge — even with linear function approximation

This chapter is the most challenging in the book. It's where the limits of the field are laid bare. But understanding why things break is just as important as knowing what works. Let's start simple and build to the hard truths.

Check: Why is off-policy learning harder with function approximation than with tables?

Chapter 1: Semi-gradient Off-policy

The simplest approach: take the semi-gradient methods from Chapter 9 and add importance sampling ratios. For one-step semi-gradient TD, the update becomes:

wt+1 = wt + α ρt δt ∇ v̂(St, wt)

Here ρt = π(At|St) / b(At|St) is the importance sampling ratio for a single step. It reweights the update so that, in expectation, it targets the value function under π rather than b. The TD error δt is the usual Rt+1 + γ v̂(St+1, wt) − v̂(St, wt).

Key insight: The ratio ρt is a correction factor. If π would have taken action At twice as often as b did, then ρt = 2 and we double the update. If π would never take that action, ρt = 0 and we ignore the transition entirely. It's a per-step reweighting.

For action-value methods (semi-gradient expected Sarsa), there's good news: the ρ ratio can sometimes be avoided entirely. If the update target uses the expected value under π rather than the sampled next action, the importance sampling ratio drops out. This is why Q-learning works off-policy without IS.

pseudocode
# Semi-gradient off-policy TD(0)
Initialize w arbitrarily

for each episode:
  S ← start state
  while S is not terminal:
    A ← sample from b(·|S)
    take A, observe R, S'
    ρ ← π(A|S) / b(A|S)
    δ ← R + γ v̂(S', w) − v̂(S, w)
    w ← w + α ρ δ ∇ v̂(S, w)
    S ← S'

This looks straightforward. The update is just TD(0) multiplied by ρ. But there's a catch: even this simple method has no convergence guarantee under off-policy with function approximation. The semi-gradient avoids taking a gradient through the bootstrapped target, which breaks the clean theoretical story we had on-policy. On-policy semi-gradient TD converges with linear function approximation; off-policy, it can diverge.

Check: What does the importance sampling ratio ρt correct for?

Chapter 2: Divergence Examples

The simplest example of divergence uses just one state with two components. Consider a single state s whose feature vector is x(s) = [1, 2]. The approximate value is v̂(s, w) = w1 + 2w2. On each step, the agent "transitions" back to s with reward 0 and discount γ. The semi-gradient TD update amplifies the weights by a factor of γ in a direction that never converges.

The w-to-2w example: Imagine a parameter w that gets doubled every update. The value function says v̂ = 2w, the target says "the next state is worth 2w too, so the target is γ · 2w." If γ is large enough, the update pushes w toward a larger value, which makes the target even larger, which pushes w further — a positive feedback loop. The weights spiral to infinity.

But the canonical example of off-policy divergence is Baird's counterexample (1995). It has 7 states, each with a distinct feature vector, and a simple transition structure: from any state, one action goes to state 7 (the "solid" state) with probability 1, and the other action goes uniformly to one of the 6 "dashed" states. The behavior policy picks the dashed action 6/7 of the time and the solid action 1/7. The target policy always picks solid.

Under semi-gradient DP (or TD) with this setup and linear function approximation, the 8 weight components diverge to infinity. Not slowly — exponentially. The animation below shows it happening in real time.

Showcase: Baird's Counterexample

Watch the 8 weight components of a linear value function diverge under semi-gradient DP. Each colored line is one weight. The updates are exact expected updates (no sampling noise) — this is pure, deterministic divergence.

Step size α = 0.010
Discount γ = 0.990
Sweep 0

Notice that the true value of every state is 0 (all rewards are 0, so vπ(s) = 0 for all s). The weights should converge to zero. Instead they diverge exponentially. This is not a sampling fluke or a step-size problem — it's structural. The combination of off-policy distribution and semi-gradient bootstrapping creates an unstable system.

Check: In Baird's counterexample, what are the true values of all states?

Chapter 3: The Deadly Triad

Baird's counterexample isn't a quirky edge case. It reveals a fundamental tension in RL. Instability arises whenever three elements are combined:

1. Function Approximation

Generalizing across states. Changing one parameter affects many state values. This is essential for large state spaces — we can't avoid it.

2. Bootstrapping

Updating toward a target that includes the current estimate (like TD). DP and all TD methods bootstrap. Only pure MC avoids it.

3. Off-policy Learning

Learning about one policy from data generated by another. Needed for Q-learning, experience replay, parallel data collection, and many practical systems.

The deadly triad: Any two of these three are safe. Function approximation + bootstrapping works (on-policy TD). Function approximation + off-policy works (MC with IS). Bootstrapping + off-policy works (tabular Q-learning). But all three together? Instability. No convergence guarantee. Potential divergence.

This is a deep and unsettling result. Each element is useful — even necessary — for practical RL. Function approximation handles large state spaces. Bootstrapping enables online, incremental learning. Off-policy methods enable data reuse and flexibility. But the combination is dangerous.

The rest of this chapter searches for ways to have all three safely. The main approaches are: (1) avoid true off-policy by using on-policy methods; (2) avoid bootstrapping by using MC; (3) use true gradient methods that are stable by construction (Gradient-TD); or (4) reweight the update distribution to match the target (Emphatic-TD).

The Deadly Triad

Click each element to see what happens when it is removed. When all three are active, instability arises.

Check: Which combination is safe from divergence?

Chapter 4: Value Function Geometry

To understand why some objectives work and others don't, we need to think about value functions as vectors in a high-dimensional space. If there are |S| states, then a value function v is a vector in R|S| — one component per state. The true value vπ is one point in this space. Our approximation v̂(w) is constrained to a subspace (for linear approximation, it's a hyperplane). The best we can do is project vπ onto this subspace.

Different objectives project differently:

ObjectiveWhat It MinimizesNotation
VEMean squared error from vπ||vw − vπ||2μ
BEMean squared Bellman error||vw − Tπvw||2μ
PBEProjected Bellman error||vw − Π Tπvw||2μ
The geometric picture: Think of a 2D plane (the representable functions) inside 3D space (all value functions). vπ floats above the plane. The VE minimum is the perpendicular projection of vπ onto the plane. The BE minimum is where the Bellman operator Tπ applied to points on the plane lands closest to the plane. The PBE minimum is where the projection of Tπvw back onto the plane equals vw — a fixed point.

The Bellman operator Tπ takes any value function v and produces a new one: (Tπv)(s) = ∑a π(a|s) ∑s',r p(s',r|s,a)[r + γv(s')]. It maps a point in value-function space to another point. The true value vπ is the unique fixed point: Tπvπ = vπ. But Tπvw typically leaves the representable subspace, so we need projection Π to bring it back.

Semi-gradient TD finds the PBE minimum — where vw = Π Tπvw. On-policy, this fixed point always exists for linear approximation and TD converges to it. Off-policy, the fixed point may not exist, and that's when divergence happens.

Check: What does the projection operator Π do in the PBE?

Chapter 5: The Bellman Error

The Bellman error (BE) at a state s is the difference between the current value and what the Bellman operator says it should be:

δ̅w(s) = ( ∑a π(a|s) ∑s',r p(s',r|s,a) [r + γ v̂(s',w)] ) − v̂(s,w)

This is the expected TD error. If the BE is zero everywhere, the value function satisfies the Bellman equation exactly. The mean squared BE (the BE objective) squares this and averages over states: BE(w) = ∑s μ(s) [δ̅w(s)]2.

Can we do gradient descent directly in the BE? Yes — we can compute ∇BE(w) and follow it downhill. Unlike semi-gradient methods, this is a true gradient, so it's guaranteed to converge to a local minimum. The gradient is:

∇BE(w) = 2 ∑s μ(s) δ̅w(s) (γ ∇v̂(S',w) − ∇v̂(S,w))

Notice the ∇v̂(S',w) term. This is what semi-gradient methods drop. Including it makes the method a true gradient — but it requires knowing the gradient at the next state before completing the current step, which can be estimated from samples.

The problem with BE: Even though gradient descent in BE converges, the minimum it finds may not be useful. The BE objective has a fundamental flaw: it is not learnable from observable data. Two different MDPs can produce identical interaction data but have different BE minima. If we can't tell which MDP we're in, we can't know which BE minimum to target.

This is a surprising and deep result. The BE seems like the right objective — after all, it measures how well the Bellman equation is satisfied. But it depends on the full transition dynamics in a way that samples cannot distinguish. The next chapter explains why.

Check: Why is the Bellman error a problematic objective despite being a true gradient?

Chapter 6: Not Learnable

Here's the key thought experiment. Consider two Markov reward processes (MRPs) — MDPs with a fixed policy, so we only have states and transitions. Both have two states, s1 and s2. Both produce exactly the same observable sequence of states and rewards. But their internal transition structures differ.

MRP A

State s has two possible next states, each with probability 0.5. The rewards differ: one transition gives +1, the other gives −1. The Bellman error involves the variance of rewards at s.

MRP B

State s always goes to one next state, but which next state it goes to is determined by an unobserved variable. The observable data stream is identical to MRP A.

The punchline: From data alone, you cannot tell whether the randomness in rewards comes from stochastic transitions (MRP A) or from a hidden variable (MRP B). But the Bellman error depends on this distinction — it involves the variance of the next-state value, which differs between the two. So the BE minimum differs, even though the data is identical. An algorithm that only sees data cannot reliably minimize BE.

Formally, the issue is that the BE involves a product of two expectations at the same state: E[R + γv(S')] − v(s), and squaring this gives a term E[δ]2 which is fine, but the gradient introduces terms involving E[δ · ∇v(S')], which requires two independent samples from the same state. If you use the same sample for both, you get a biased estimate.

This is why the projected Bellman error (PBE) is the preferred objective. The PBE is learnable from data. It doesn't suffer from this double-sampling problem because the projection operator absorbs the problematic variance term. The Gradient-TD methods in the next chapter minimize the PBE.

Check: Why can't an algorithm reliably minimize the BE from observed data?

Chapter 7: Gradient-TD Methods

The solution to the divergence problem: do true stochastic gradient descent on the projected Bellman error (PBE). The PBE is learnable, and gradient descent in it is guaranteed to converge. The resulting family of algorithms — GTD, GTD2, and TDC — are the first provably convergent off-policy TD methods with linear function approximation.

The PBE can be written as:

PBE(w) = (δ̅w)T X (XT D X)−1 XT D δ̅w

where X is the feature matrix, D is the diagonal matrix of state weights μ(s), and δ̅w is the vector of expected TD errors. Taking the gradient and converting to sample-based updates requires a trick: introducing a second set of weights v (sometimes called h) that estimate a secondary quantity.

The saddle-point trick: The PBE gradient involves a matrix inverse that we can't compute from samples. But we can rewrite the objective as a saddle-point problem: minimize over w, maximize over v. The auxiliary weights v track the "direction of steepest descent" on the fly. This doubles the number of parameters but gives us O(d) per-step complexity with convergence guarantees.

The three algorithms differ in how they decompose the PBE gradient:

MethodUpdate Rule (intuition)Key Property
GTDw += α (ρδ · x − γρ x' · (xTv))First convergent off-policy linear TD
GTD2w += α (ρ x − γρ x') · (xTv)Better decomposition, lower variance
TDCw += α ρδ x − αγ ρx'(xTv)= TD correction; most commonly used

In all three, the auxiliary weights v are updated as: v += β(ρδ − xTv)x, using step size β. This is a standard least-squares update — v tracks E[ρδ | x].

TDC (TD with gradient Correction) is the most intuitive. The first term (αρδx) is the usual semi-gradient TD update. The second term (−αγρx'(xTv)) is a correction that accounts for the gradient through the bootstrapped target. It's the part that semi-gradient methods drop. By estimating it with v, TDC makes the update a true gradient.

Gradient-TD vs Semi-gradient on Baird's Example

Compare semi-gradient TD (diverges) vs TDC (converges to zero) on Baird's counterexample. Both use the same data distribution.

Sweep 0
Check: What objective do Gradient-TD methods (GTD, GTD2, TDC) minimize?

Chapter 8: Emphatic-TD

Gradient-TD methods fix the objective (PBE instead of VE). Emphatic-TD takes a different approach: it fixes the distribution. The idea is to reweight updates so that, even though data comes from the behavior policy b, the effective update distribution matches what on-policy TD would see under π.

The key quantity is the followon trace Ft:

Ft = γ ρt−1 Ft−1 + It

where It is the interest (how much we care about the current state, typically 1), and ρt−1 is the previous step's IS ratio. The followon trace accumulates discounted, importance-weighted "emphasis" over time. The emphasis Mt = Ft ρt then multiplies the semi-gradient update:

wt+1 = wt + α Mt δt ∇ v̂(St, wt)
The intuition: In on-policy TD, each state is visited proportionally to dπ(s) — the on-policy state distribution. Off-policy, states are visited proportionally to db(s). The emphasis Mt reweights each update so that the effective distribution of updates matches dπ(s), even though the data comes from b. It's like assigning "importance" to each time step based on how likely it is under π.

Emphatic TD(0) has convergence guarantees for linear function approximation in the off-policy setting. It's a single weight vector (no auxiliary parameters like Gradient-TD). The cost is high variance. The emphasis Mt involves products of IS ratios, which can be very large or very small, making the updates noisy.

The followon trace is analogous to eligibility traces (Chapter 12), but it serves a completely different purpose. Eligibility traces propagate credit backward through time; the followon trace reweights the state distribution forward through time.

Check: What does the emphasis Mt in emphatic-TD correct for?

Chapter 9: Variance Reduction

Every off-policy method pays a variance tax. Importance sampling ratios ρ multiply the updates, and ρ can be very large when the target policy favors an action much more than the behavior policy. Over many steps, products of ratios can explode. This chapter collects practical techniques for keeping variance manageable.

Per-decision Importance Sampling

Instead of weighting entire returns by the product of all ratios, weight each reward in the return by only the ratios up to that time step. This reduces variance because later ratios (which don't affect earlier rewards) are removed.

Weighted Importance Sampling

Normalizing the IS ratios (dividing by their sum) gives a biased but much lower-variance estimator. For finite data, the bias is small and the variance reduction is dramatic. This is the estimator from Chapter 5 that typically outperforms ordinary IS.

Truncation and Clipping

Cap ρ at some maximum value (e.g., ρ̅ = min(ρ, c) for some constant c). This introduces bias but eliminates the worst-case variance. The Retrace(λ) algorithm uses this idea, as does PPO's clipped objective.

Practical wisdom: In practice, per-decision IS + truncation at ρ=1 (the "tree backup" idea) is often the safest choice. It keeps all ratios in [0, 1], bounding variance, at the cost of slower learning when π and b differ significantly.

Control Variates

Subtract a baseline that has the same expectation as the IS-corrected update but lower variance. The auxiliary weights v in Gradient-TD can be seen as a form of control variate. REINFORCE with baseline (Chapter 13) uses the same idea for policy gradients.

The practical lesson: off-policy methods can work, but you must budget significant engineering effort for variance reduction. The algorithms in this chapter form a toolkit, and the best choice depends on how different π and b are.

Check: Why does truncating the importance sampling ratio reduce variance?

Chapter 10: Summary

This chapter confronted the hardest open problem in RL: learning reliably off-policy with function approximation. The deadly triad — function approximation + bootstrapping + off-policy — produces instability that no simple fix can fully resolve. But significant progress has been made.

MethodApproachTrade-off
Semi-gradient TDIgnore the gradient through bootstrap targetSimple but can diverge off-policy
Gradient BETrue gradient in Bellman errorConverges but BE is not learnable
GTD / TDCTrue gradient in projected BEConverges; needs auxiliary weights
Emphatic TDReweight to match target distributionConverges; high variance
The honest summary: There is no known method that has all of: (1) convergence guarantees, (2) low variance, (3) computational efficiency, and (4) generality to nonlinear function approximation. The field is still searching. Deep RL routinely uses the deadly triad (experience replay + neural networks + bootstrapping) and gets away with it through careful engineering, but the theoretical foundations remain incomplete.

What we learned:

• Off-policy + function approx can diverge
• Baird's example proves it constructively
• The deadly triad explains when danger arises
• PBE is learnable; BE is not
• Gradient-TD methods are provably stable
• Emphatic TD fixes the distribution

Open questions:

• Why does DQN work at all? (deadly triad!)
• Can we extend Gradient-TD to neural nets?
• Is there a low-variance emphatic method?
• Are there better objectives than PBE?
• Can the deadly triad be tamed in general?

What comes next: Chapter 12 introduces eligibility traces, which unify TD and MC by letting the agent assign credit over multiple time steps. Traces provide a mechanism for smoothly interpolating between the extremes of one-step TD and full Monte Carlo returns.

"The combination of off-policy learning, function approximation, and bootstrapping
is both essential and dangerous. Understanding the danger is itself progress."
— Richard S. Sutton & Andrew G. Barto
Check: What makes Gradient-TD methods (GTD/TDC) fundamentally different from semi-gradient TD?