Sutton & Barto, Chapter 10

On-policy Control with
Approximation

Learning to act — not just evaluate — when the state space is too large for tables.

Prerequisites: Chapter 9 (Prediction Approx) + Chapter 6 (TD, Sarsa). That's it.
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: From Prediction to Control

Chapter 9 taught us to predict — to approximate the value vπ(s) of following a fixed policy. But the whole point of RL is control: finding the best policy. For that, we need action values q̂(s, a, w), not just state values. The agent must know which action is best in each state.

The leap from prediction to control follows the same pattern as in tabular methods (Chapter 6). We use generalized policy improvement (GPI): evaluate the current policy using action-value approximation, then improve the policy by acting greedily with respect to q̂. The difference is that now q̂ is a parameterized function, not a table.

The key shift: Chapter 9 approximated v(s). This chapter approximates q(s, a). The input is now a state-action pair, and the output is how good that action is in that state. Everything else — SGD, semi-gradients, features — transfers directly.
Ch 9: Prediction
Approximate vπ(s) with v̂(s, w)
Ch 10: Control
Approximate qπ(s, a) with q̂(s, a, w)
Action Selection
ε-greedy: pick a = argmaxa q̂(s, a, w) most of the time

There's a subtlety with approximation that didn't exist in tabular methods. When we update q̂ for one (s, a) pair, the weights change, and that change affects q̂ for every (s, a) pair. Improving the value of one action might accidentally make another action look better or worse. This coupling — absent in tables — makes control with approximation fundamentally harder.

Check: Why do we need action values q(s, a) instead of state values v(s) for control?

Chapter 1: Semi-Gradient Sarsa

Tabular Sarsa (Chapter 6) updated Q(S, A) toward R + γ Q(S', A'). The semi-gradient version does the same thing, but with a parameterized function. The agent follows an ε-greedy policy, and after each transition (S, A, R, S', A'), updates the weights:

wt+1 = wt + α [ Rt+1 + γ q̂(St+1, At+1, wt) − q̂(St, At, wt) ] ∇w q̂(St, At, wt)

This is "semi-gradient" for the same reason as in Chapter 9: the target R + γ q̂(S', A', w) depends on w, but we ignore that dependence in the gradient. It's Sarsa because the next action A' is the action actually taken under the current policy (on-policy), not the greedy action (which would be Q-learning).

pseudocode
# Episodic Semi-gradient Sarsa
Initialize w arbitrarily
for each episode:
  S ← initial state
  A ← ε-greedy from q̂(S, ·, w)
  while S is not terminal:
    take A, observe R, S'
    if S' is terminal:
      ww + α [R − q̂(S, A, w)] ∇ q̂(S, A, w)
    else:
      A' ← ε-greedy from q̂(S', ·, w)
      ww + α [R + γ q̂(S', A', w) − q̂(S, A, w)] ∇ q̂(S, A, w)
      A ← A'
    S ← S'
Key insight: The update rule is identical to tabular Sarsa, except Q(S, A) is replaced by q̂(S, A, w) and the table update is replaced by a gradient step on w. This is the pattern throughout Part II: take a tabular algorithm, swap in function approximation, and you get the approximate version.

For linear approximation, q̂(s, a, w) = wT x(s, a), where x(s, a) is a feature vector that depends on both the state and action. One common approach: construct separate features for each action. For 3 actions and 10 state features, the full feature vector has 30 components (10 per action), and only the 10 corresponding to the chosen action are nonzero.

Check: In semi-gradient Sarsa, what makes it "on-policy"?

Chapter 2: Mountain Car

The Mountain Car is the quintessential test problem for function approximation in RL. A car sits in a valley between two hills. The goal is to drive up the right hill, but the engine is too weak to climb directly. The agent must learn to swing back and forth — building momentum — to reach the top.

The state space is continuous: (position, velocity). Position ranges from −1.2 to 0.5, velocity from −0.07 to 0.07. Three actions: throttle left, no throttle, throttle right. The reward is −1 per step (punishing slow solutions). The episode ends when the car reaches position 0.5 (the goal on the right).

Why Mountain Car matters: A tabular method would need to discretize the continuous state space into a grid, losing resolution. With tile coding, the agent gets smooth generalization across nearby positions and velocities. A state the agent has never seen before still gets a reasonable value estimate from overlapping tiles.
Showcase: Mountain Car with Tile Coding

Watch the car learn to reach the goal on the right hill. Click Train 1 for one episode or Train 100 to see rapid learning. The value surface below shows the learned cost-to-go. The right panel tracks steps per episode.

α = 0.50
ε = 0.05
Episode 0 | Steps: --
What to watch for: Early episodes take hundreds of steps — the car just bounces around randomly. After 50-100 training episodes, the car learns the swinging strategy and reaches the goal in under 200 steps. Click "Animate" after training to watch the learned policy in action. Try increasing ε to see how more exploration slows convergence, or decreasing α to see slower but steadier learning.

The tile coding uses 8 tilings, each with 8×8 tiles over the (position, velocity) space. That's 8 × 8 × 8 = 512 tiles per action, 1536 total weights. The step size α is divided by 8 (number of tilings) internally. Each state activates exactly 8 tiles per action — a tiny fraction of all weights — making updates extremely fast.

Check: Why can't the Mountain Car reach the goal by just throttling right from the start?

Chapter 3: n-step Sarsa

Just as we extended TD prediction to n-step returns (Chapter 7, then Chapter 9), we can extend Sarsa to use n-step returns. The n-step return for control is:

Gt:t+n = Rt+1 + γ Rt+2 + ... + γn−1 Rt+n + γn q̂(St+n, At+n, wt+n−1)

The semi-gradient n-step Sarsa update is then:

wt+n = wt+n−1 + α [ Gt:t+n − q̂(St, At, wt+n−1) ] ∇w q̂(St, At, wt+n−1)

More steps means more actual rewards in the return and less reliance on the current (possibly inaccurate) value estimate. The tradeoff is the same as always: more steps reduces bias from bootstrapping but increases variance from stochastic rewards and actions.

Practical tip: For Mountain Car, n-step Sarsa with n = 8 learns significantly faster than one-step Sarsa. The negative reward signal (−1 per step) needs to propagate backwards from the goal. With n = 1, this propagation is slow — each update only moves the value one step backwards. With n = 8, each update pushes the value 8 steps back, reaching earlier states much faster.

n = 1 (one-step Sarsa)

Maximum bootstrapping. Lowest variance. Slowest value propagation. Needs many episodes to learn.

n = 8 or 16

Moderate bootstrapping. More variance. Faster propagation. Often the sweet spot for Mountain Car.

The algorithm stores a buffer of the last n transitions. At each step t, once n more steps have been taken, it computes Gt:t+n and updates. Near the end of an episode (when fewer than n steps remain), the return truncates naturally — the final steps become effectively MC updates.

Check: Why does n-step Sarsa learn faster than one-step Sarsa on Mountain Car?

Chapter 4: Average Reward

Everything so far has used the discounted return: Gt = Rt+1 + γ Rt+2 + γ2 Rt+3 + ... . Discounting makes the sum finite by shrinking future rewards. But for continuing tasks (tasks that never end), there's a better way: the average reward formulation.

Instead of maximizing the discounted sum, the agent maximizes the long-run average reward per step:

r(π) = limT→∞ 1/Tt=1T E[Rt | π]

This is the average reward r(π). A good policy achieves a high average reward per step. Think of it like a factory's throughput: you don't care about any single product; you care about the average production rate over time.

Why average reward? In continuing tasks, the discounted return depends on γ in ways that don't reflect the actual quality of the policy. A policy that gets reward 10 per step should be preferred over one that gets reward 5 per step, regardless of γ. But the discounted return conflates time preference (γ) with policy quality. Average reward separates them cleanly.

Under the average reward formulation, the differential return replaces the discounted return:

Gt = (Rt+1 − r) + (Rt+2 − r) + (Rt+3 − r) + ...

Each reward is measured relative to the average r(π). Positive terms mean "better than average." Negative terms mean "worse than average." The differential return doesn't need discounting because it sums deviations from the mean, which naturally tend to cancel out.

Check: In the average reward formulation, what does r(π) represent?

Chapter 5: Differential Values

In the average reward setting, the differential value of a state measures how much better (or worse) it is to be in that state compared to the long-run average. Formally:

vπ(s) = Eπ[ ∑k=0 (Rt+k+1 − r(π)) | St = s ]

This is the sum of future rewards minus the average. If the state leads to a sequence of above-average rewards, its differential value is positive. If it leads to below-average rewards, it's negative. A state with differential value zero is "average" — being there is neither better nor worse than the long-run expectation.

Analogy: Think of a poker player's position at the table. Their long-run win rate is r(π) (say $5/hand). The differential value of being in "early position" might be negative ($3/hand from here), and "late position" might be positive ($7/hand). The differential tells you how much your current situation deviates from average.

The differential Bellman equation is:

vπ(s) = ∑a π(a|s) ∑s',r p(s',r|s,a) [ r − r(π) + vπ(s') ]

Compare this with the discounted Bellman equation. The discount factor γ is gone, replaced by the subtraction of r(π). Instead of shrinking future values by γ, we measure each reward as a deviation from the baseline. The structure is otherwise identical, which means all our RL algorithms transfer directly.

Similarly, the differential action value qπ(s, a) measures how much cumulative excess reward the agent gets starting from (s, a):

qπ(s, a) = ∑s',r p(s',r|s,a) [ r − r(π) + ∑a' π(a'|s') qπ(s', a') ]
Check: What does a differential value of zero for a state mean?

Chapter 6: The Futility of Discounting

Here's one of the most surprising results in the book. For continuing tasks with function approximation, the discount factor γ does not affect which policy is best. Every γ ∈ (0, 1) produces the same policy ordering. This means discounting adds complexity without changing the answer.

How can this be? In tabular methods, γ matters because it changes the shape of the value function, and greedy policies depend on that shape. But with function approximation, value functions are already approximate — they can't represent the true values exactly. The relative ordering of policies, which is what matters for control, turns out to be independent of γ.

The proof sketch: For any two policies π1 and π2, r(π1) > r(π2) ⇔ the average reward of π1 exceeds π2's. This ordering is a property of the policies and the environment — not of γ. Since the discounted return converges to the average reward formulation as γ → 1, and the ordering doesn't change for any γ, the discount factor is irrelevant for ranking policies.

The practical consequence: in continuing tasks with approximation, γ is a free parameter with no clear best value. Setting γ = 0.9 vs γ = 0.99 changes learning dynamics (step-size sensitivity, variance, bootstrapping) but not which policy is theoretically best. This is why the average reward formulation is cleaner — it removes the arbitrary γ entirely.

Discounting Does Not Change Policy Ordering

Three policies (A, B, C) with different reward streams. Drag γ and observe: the discounted returns change, but the ranking (which bar is tallest) stays the same. Policy ordering is independent of γ.

γ = 0.90

This result has deep implications. It says that for continuing tasks, the entire γ-based framework is a convenience, not a necessity. The average reward framework captures policy quality directly, without the extra parameter. Many researchers in the field have shifted to average reward for continuing problems as a result.

Check: Why is discounting called "futile" for continuing tasks with approximation?

Chapter 7: Differential Sarsa

Now we combine all the pieces: average reward, differential values, semi-gradient methods, and n-step returns. The result is differential semi-gradient n-step Sarsa — the practical algorithm for continuing control with approximation.

The idea: instead of bootstrapping from discounted values, we bootstrap from differential values. The TD error becomes:

δt = Rt+1R + q̂(St+1, At+1, w) − q̂(St, At, w)

where R is the agent's current estimate of the average reward r(π). Notice: no γ anywhere. The discount factor has been replaced by the average reward estimate.

The weight update is the familiar semi-gradient step:

wt+1 = wt + α δtw q̂(St, At, wt)

The average reward estimate R is updated incrementally:

Rt+1 = Rt + β δt

where β is a step size for the average. When δt > 0, the agent is doing better than expected, so R increases. When δt < 0, the agent is doing worse, and R decreases. Over time, R converges to the true average reward.

pseudocode
# Differential Semi-gradient Sarsa
Initialize w arbitrarily, R0
S ← initial state
A ← ε-greedy from q̂(S, ·, w)

for each step:
  take A, observe R, S'
  A' ← ε-greedy from q̂(S', ·, w)
  δ ← R − R + q̂(S', A', w) − q̂(S, A, w)
  RR + β · δ
  ww + α · δ · ∇ q̂(S, A, w)
  S ← S', A ← A'
Elegance of the differential approach: Compare this pseudocode to episodic Sarsa. The only differences are: (1) no γ, (2) subtract R from the reward, (3) update R incrementally. The structure is identical. Every RL algorithm has a differential version, obtained by this same substitution.

The n-step version uses a differential n-step return: Gt:t+n = ∑k=0n−1(Rt+k+1R) + q̂(St+n, At+n, w). More steps means less bias from the current q̂ estimate, at the cost of more variance from stochastic transitions.

Check: In differential Sarsa, what replaces the discount factor γ?

Chapter 8: Continuing Tasks

Most real-world problems are continuing tasks: a factory optimizing production, a trading algorithm managing a portfolio, a server balancing network traffic. These tasks have no natural episodes. The agent runs indefinitely, and its performance is measured by the long-run average reward.

Continuing tasks bring several practical challenges that episodic tasks don't have:

No episode boundaries

No natural "reset" points. The agent must learn online, continuously adapting. Mistakes in one time region affect all future learning.

Non-stationarity

The environment may slowly change. The step size α must stay positive (no 1/t decay) to track changes, but this means estimates never fully converge.

Episodic trick: Many practitioners convert continuing tasks into episodic ones by imposing artificial episode boundaries (e.g., reset every 1000 steps). This is a hack that works surprisingly well. It lets you use standard episodic algorithms with minor modifications. But it's theoretically unsatisfying and can miss long-term dependencies.

The average reward formulation handles continuing tasks naturally. The differential Sarsa algorithm from Chapter 7 runs forever, continuously updating R and w. There's no need for episode boundaries, discount factors, or artificial resets.

The key practical considerations for continuing tasks:

IssueSolution
No episodes → no per-episode performance measureTrack rolling average reward (last N steps)
Non-stationarityConstant step size α (no decay)
Initial transientDiscard early data; start measuring after warm-up
Value function scaleDifferential values are naturally centered around zero
Credit assignment over long horizonsUse n-step or eligibility traces (Chapter 12)
Check: Why should the step size stay constant (not decay to zero) in continuing tasks?

Chapter 9: Summary

This chapter extended approximation from prediction to control — from knowing how good states are to actually learning how to act. The core algorithm, semi-gradient Sarsa, is a direct generalization of tabular Sarsa with a parameterized q̂(s, a, w). The Mountain Car demonstrated that tile coding makes RL work in continuous state spaces where tables are hopeless.

ConceptCore IdeaKey Property
q̂(s, a, w)Parameterized action valuesEnables GPI with approximation
Semi-gradient SarsaOne-step on-policy controlSame as tabular Sarsa + gradient step
n-step SarsaMulti-step bootstrapFaster value propagation
Mountain CarClassic continuous test bedTile coding + Sarsa solves it
Average reward r(π)Long-run rate for continuing tasksNo γ needed
Differential valuesValues relative to averageReplace discounted returns
Futility of γDiscounting doesn't affect optimal policyAverage reward is cleaner
Differential SarsaControl without discountingNatural for continuing tasks
The two frameworks side by side: For episodic tasks, use discounted semi-gradient Sarsa — it's simple and effective. For continuing tasks, use differential Sarsa with average reward — it's theoretically cleaner and avoids the arbitrary γ. In practice, both work; the choice depends on whether your problem has natural episodes.

What we can now do:

• Learn control policies in continuous state spaces
• Use tile coding for fast, linear, convergent learning
• Handle episodic tasks (semi-gradient Sarsa)
• Handle continuing tasks (differential Sarsa)

What's still missing:

• Off-policy learning with approximation (Ch 11)
• Eligibility traces for faster credit assignment (Ch 12)
• Policy gradient methods (Ch 13)
• Stability guarantees for nonlinear functions

What comes next: Chapter 9 handled prediction. This chapter handled on-policy control. Chapter 11 tackles the hardest case: off-policy learning with approximation. This is where function approximation, bootstrapping, and off-policy data collide to create the infamous deadly triad — and where things can go seriously wrong.

"Don't let the discount factor distract you.
What matters is the average reward per step."
— Andrew Barto (paraphrased)
Check: What is the main contribution of this chapter beyond Chapter 9?