Ch 10: On-policy Control with Approximation

Chapter 0: From Prediction to Control

Chapter 9 taught us to predict — to approximate the value v_π(s) of following a fixed policy. But the whole point of RL is control: finding the best policy. For that, we need action values q̂(s, a, w), not just state values. The agent must know which action is best in each state.

The leap from prediction to control follows the same pattern as in tabular methods (Chapter 6). We use generalized policy improvement (GPI): evaluate the current policy using action-value approximation, then improve the policy by acting greedily with respect to q̂. The difference is that now q̂ is a parameterized function, not a table.

The key shift: Chapter 9 approximated v(s). This chapter approximates q(s, a). The input is now a state-action pair, and the output is how good that action is in that state. Everything else — SGD, semi-gradients, features — transfers directly.

Ch 9: Prediction

Approximate v_π(s) with v̂(s, w)

↓

Ch 10: Control

Approximate q_π(s, a) with q̂(s, a, w)

↓

Action Selection

ε-greedy: pick a = argmax_a q̂(s, a, w) most of the time

There's a subtlety with approximation that didn't exist in tabular methods. When we update q̂ for one (s, a) pair, the weights change, and that change affects q̂ for every (s, a) pair. Improving the value of one action might accidentally make another action look better or worse. This coupling — absent in tables — makes control with approximation fundamentally harder.

Check: Why do we need action values q(s, a) instead of state values v(s) for control?

Because the agent must compare actions to pick the best one, and v(s) doesn't distinguish between actions Because action values are easier to compute Because state values require a model

Chapter 1: Semi-Gradient Sarsa

Tabular Sarsa (Chapter 6) updated Q(S, A) toward R + γ Q(S', A'). The semi-gradient version does the same thing, but with a parameterized function. The agent follows an ε-greedy policy, and after each transition (S, A, R, S', A'), updates the weights:

w_t+1 = w_t + α [ R_t+1 + γ q̂(S_t+1, A_t+1, w_t) − q̂(S_t, A_t, w_t) ] ∇_w q̂(S_t, A_t, w_t)

This is "semi-gradient" for the same reason as in Chapter 9: the target R + γ q̂(S', A', w) depends on w, but we ignore that dependence in the gradient. It's Sarsa because the next action A' is the action actually taken under the current policy (on-policy), not the greedy action (which would be Q-learning).

pseudocode
# Episodic Semi-gradient Sarsa
Initialize w arbitrarily
for each episode:
  S ← initial state
  A ← ε-greedy from q̂(S, ·, w)
  while S is not terminal:
    take A, observe R, S'
    if S' is terminal:
      w ← w + α [R − q̂(S, A, w)] ∇ q̂(S, A, w)
    else:
      A' ← ε-greedy from q̂(S', ·, w)
      w ← w + α [R + γ q̂(S', A', w) − q̂(S, A, w)] ∇ q̂(S, A, w)
      A ← A'
    S ← S'

Key insight: The update rule is identical to tabular Sarsa, except Q(S, A) is replaced by q̂(S, A, w) and the table update is replaced by a gradient step on w. This is the pattern throughout Part II: take a tabular algorithm, swap in function approximation, and you get the approximate version.

For linear approximation, q̂(s, a, w) = w^T x(s, a), where x(s, a) is a feature vector that depends on both the state and action. One common approach: construct separate features for each action. For 3 actions and 10 state features, the full feature vector has 30 components (10 per action), and only the 10 corresponding to the chosen action are nonzero.

Check: In semi-gradient Sarsa, what makes it "on-policy"?

It uses experience replay The next action A' in the target is the action actually taken by the current policy, not the greedy action It only works with linear functions

Chapter 2: Mountain Car

The Mountain Car is the quintessential test problem for function approximation in RL. A car sits in a valley between two hills. The goal is to drive up the right hill, but the engine is too weak to climb directly. The agent must learn to swing back and forth — building momentum — to reach the top.

The state space is continuous: (position, velocity). Position ranges from −1.2 to 0.5, velocity from −0.07 to 0.07. Three actions: throttle left, no throttle, throttle right. The reward is −1 per step (punishing slow solutions). The episode ends when the car reaches position 0.5 (the goal on the right).

Why Mountain Car matters: A tabular method would need to discretize the continuous state space into a grid, losing resolution. With tile coding, the agent gets smooth generalization across nearby positions and velocities. A state the agent has never seen before still gets a reasonable value estimate from overlapping tiles.

Showcase: Mountain Car with Tile Coding

Watch the car learn to reach the goal on the right hill. Click Train 1 for one episode or Train 100 to see rapid learning. The value surface below shows the learned cost-to-go. The right panel tracks steps per episode.

α = 0.50

ε = 0.05

Episode 0 | Steps: --

What to watch for: Early episodes take hundreds of steps — the car just bounces around randomly. After 50-100 training episodes, the car learns the swinging strategy and reaches the goal in under 200 steps. Click "Animate" after training to watch the learned policy in action. Try increasing ε to see how more exploration slows convergence, or decreasing α to see slower but steadier learning.

The tile coding uses 8 tilings, each with 8×8 tiles over the (position, velocity) space. That's 8 × 8 × 8 = 512 tiles per action, 1536 total weights. The step size α is divided by 8 (number of tilings) internally. Each state activates exactly 8 tiles per action — a tiny fraction of all weights — making updates extremely fast.

Check: Why can't the Mountain Car reach the goal by just throttling right from the start?

The right action is not available The discount factor is too low The engine is too weak to climb the hill directly — it must build momentum by swinging back and forth

Chapter 3: n-step Sarsa

Just as we extended TD prediction to n-step returns (Chapter 7, then Chapter 9), we can extend Sarsa to use n-step returns. The n-step return for control is:

G_t:t+n = R_t+1 + γ R_t+2 + ... + γⁿ⁻¹ R_t+n + γⁿ q̂(S_t+n, A_t+n, w_t+n−1)

The semi-gradient n-step Sarsa update is then:

w_t+n = w_t+n−1 + α [ G_t:t+n − q̂(S_t, A_t, w_t+n−1) ] ∇_w q̂(S_t, A_t, w_t+n−1)

More steps means more actual rewards in the return and less reliance on the current (possibly inaccurate) value estimate. The tradeoff is the same as always: more steps reduces bias from bootstrapping but increases variance from stochastic rewards and actions.

Practical tip: For Mountain Car, n-step Sarsa with n = 8 learns significantly faster than one-step Sarsa. The negative reward signal (−1 per step) needs to propagate backwards from the goal. With n = 1, this propagation is slow — each update only moves the value one step backwards. With n = 8, each update pushes the value 8 steps back, reaching earlier states much faster.

n = 1 (one-step Sarsa)

Maximum bootstrapping. Lowest variance. Slowest value propagation. Needs many episodes to learn.

n = 8 or 16

Moderate bootstrapping. More variance. Faster propagation. Often the sweet spot for Mountain Car.

The algorithm stores a buffer of the last n transitions. At each step t, once n more steps have been taken, it computes G_t:t+n and updates. Near the end of an episode (when fewer than n steps remain), the return truncates naturally — the final steps become effectively MC updates.

Check: Why does n-step Sarsa learn faster than one-step Sarsa on Mountain Car?

The reward signal propagates n steps back per update instead of just one, reaching earlier states much sooner It uses a different learning rate It explores more states

Chapter 4: Average Reward

Everything so far has used the discounted return: G_t = R_t+1 + γ R_t+2 + γ² R_t+3 + ... . Discounting makes the sum finite by shrinking future rewards. But for continuing tasks (tasks that never end), there's a better way: the average reward formulation.

Instead of maximizing the discounted sum, the agent maximizes the long-run average reward per step:

r(π) = lim_T→∞ 1/T ∑_t=1^T E[R_t | π]

This is the average reward r(π). A good policy achieves a high average reward per step. Think of it like a factory's throughput: you don't care about any single product; you care about the average production rate over time.

Why average reward? In continuing tasks, the discounted return depends on γ in ways that don't reflect the actual quality of the policy. A policy that gets reward 10 per step should be preferred over one that gets reward 5 per step, regardless of γ. But the discounted return conflates time preference (γ) with policy quality. Average reward separates them cleanly.

Under the average reward formulation, the differential return replaces the discounted return:

G_t = (R_t+1 − r) + (R_t+2 − r) + (R_t+3 − r) + ...

Each reward is measured relative to the average r(π). Positive terms mean "better than average." Negative terms mean "worse than average." The differential return doesn't need discounting because it sums deviations from the mean, which naturally tend to cancel out.

Check: In the average reward formulation, what does r(π) represent?

The discount factor The long-run average reward per time step under policy π The maximum possible reward

Chapter 5: Differential Values

In the average reward setting, the differential value of a state measures how much better (or worse) it is to be in that state compared to the long-run average. Formally:

v_π(s) = E_π[ ∑_k=0^∞ (R_t+k+1 − r(π)) | S_t = s ]

This is the sum of future rewards minus the average. If the state leads to a sequence of above-average rewards, its differential value is positive. If it leads to below-average rewards, it's negative. A state with differential value zero is "average" — being there is neither better nor worse than the long-run expectation.

Analogy: Think of a poker player's position at the table. Their long-run win rate is r(π) (say $5/hand). The differential value of being in "early position" might be negative ($3/hand from here), and "late position" might be positive ($7/hand). The differential tells you how much your current situation deviates from average.

The differential Bellman equation is:

v_π(s) = ∑_a π(a|s) ∑_s',r p(s',r|s,a) [ r − r(π) + v_π(s') ]

Compare this with the discounted Bellman equation. The discount factor γ is gone, replaced by the subtraction of r(π). Instead of shrinking future values by γ, we measure each reward as a deviation from the baseline. The structure is otherwise identical, which means all our RL algorithms transfer directly.

Similarly, the differential action value q_π(s, a) measures how much cumulative excess reward the agent gets starting from (s, a):

q_π(s, a) = ∑_s',r p(s',r|s,a) [ r − r(π) + ∑_a' π(a'|s') q_π(s', a') ]

Check: What does a differential value of zero for a state mean?

The state has never been visited The state is unreachable Being in that state is neither better nor worse than the long-run average — it's exactly average

Chapter 6: The Futility of Discounting

Here's one of the most surprising results in the book. For continuing tasks with function approximation, the discount factor γ does not affect which policy is best. Every γ ∈ (0, 1) produces the same policy ordering. This means discounting adds complexity without changing the answer.

How can this be? In tabular methods, γ matters because it changes the shape of the value function, and greedy policies depend on that shape. But with function approximation, value functions are already approximate — they can't represent the true values exactly. The relative ordering of policies, which is what matters for control, turns out to be independent of γ.

The proof sketch: For any two policies π₁ and π₂, r(π₁) > r(π₂) ⇔ the average reward of π₁ exceeds π₂'s. This ordering is a property of the policies and the environment — not of γ. Since the discounted return converges to the average reward formulation as γ → 1, and the ordering doesn't change for any γ, the discount factor is irrelevant for ranking policies.

The practical consequence: in continuing tasks with approximation, γ is a free parameter with no clear best value. Setting γ = 0.9 vs γ = 0.99 changes learning dynamics (step-size sensitivity, variance, bootstrapping) but not which policy is theoretically best. This is why the average reward formulation is cleaner — it removes the arbitrary γ entirely.

Discounting Does Not Change Policy Ordering

Three policies (A, B, C) with different reward streams. Drag γ and observe: the discounted returns change, but the ranking (which bar is tallest) stays the same. Policy ordering is independent of γ.

γ = 0.90

This result has deep implications. It says that for continuing tasks, the entire γ-based framework is a convenience, not a necessity. The average reward framework captures policy quality directly, without the extra parameter. Many researchers in the field have shifted to average reward for continuing problems as a result.

Check: Why is discounting called "futile" for continuing tasks with approximation?

Because the discount factor does not affect which policy is best — all gamma values produce the same policy ordering Because the discount factor makes learning slower Because continuing tasks never terminate

Chapter 7: Differential Sarsa

Now we combine all the pieces: average reward, differential values, semi-gradient methods, and n-step returns. The result is differential semi-gradient n-step Sarsa — the practical algorithm for continuing control with approximation.

The idea: instead of bootstrapping from discounted values, we bootstrap from differential values. The TD error becomes:

δ_t = R_t+1 − R + q̂(S_t+1, A_t+1, w) − q̂(S_t, A_t, w)

where R is the agent's current estimate of the average reward r(π). Notice: no γ anywhere. The discount factor has been replaced by the average reward estimate.

The weight update is the familiar semi-gradient step:

w_t+1 = w_t + α δ_t ∇_w q̂(S_t, A_t, w_t)

The average reward estimate R is updated incrementally:

R_t+1 = R_t + β δ_t

where β is a step size for the average. When δ_t > 0, the agent is doing better than expected, so R increases. When δ_t < 0, the agent is doing worse, and R decreases. Over time, R converges to the true average reward.

pseudocode
# Differential Semi-gradient Sarsa
Initialize w arbitrarily, R ← 0
S ← initial state
A ← ε-greedy from q̂(S, ·, w)

for each step:
  take A, observe R, S'
  A' ← ε-greedy from q̂(S', ·, w)
  δ ← R − R + q̂(S', A', w) − q̂(S, A, w)
  R ← R + β · δ
  w ← w + α · δ · ∇ q̂(S, A, w)
  S ← S', A ← A'

Elegance of the differential approach: Compare this pseudocode to episodic Sarsa. The only differences are: (1) no γ, (2) subtract R from the reward, (3) update R incrementally. The structure is identical. Every RL algorithm has a differential version, obtained by this same substitution.

The n-step version uses a differential n-step return: G_t:t+n = ∑_k=0ⁿ⁻¹(R_t+k+1 − R) + q̂(S_t+n, A_t+n, w). More steps means less bias from the current q̂ estimate, at the cost of more variance from stochastic transitions.

Check: In differential Sarsa, what replaces the discount factor γ?

A larger learning rate More planning steps Subtracting the average reward estimate from each reward, measuring everything relative to the baseline

Chapter 8: Continuing Tasks

Most real-world problems are continuing tasks: a factory optimizing production, a trading algorithm managing a portfolio, a server balancing network traffic. These tasks have no natural episodes. The agent runs indefinitely, and its performance is measured by the long-run average reward.

Continuing tasks bring several practical challenges that episodic tasks don't have:

No episode boundaries

No natural "reset" points. The agent must learn online, continuously adapting. Mistakes in one time region affect all future learning.

Non-stationarity

The environment may slowly change. The step size α must stay positive (no 1/t decay) to track changes, but this means estimates never fully converge.

Episodic trick: Many practitioners convert continuing tasks into episodic ones by imposing artificial episode boundaries (e.g., reset every 1000 steps). This is a hack that works surprisingly well. It lets you use standard episodic algorithms with minor modifications. But it's theoretically unsatisfying and can miss long-term dependencies.

The average reward formulation handles continuing tasks naturally. The differential Sarsa algorithm from Chapter 7 runs forever, continuously updating R and w. There's no need for episode boundaries, discount factors, or artificial resets.

The key practical considerations for continuing tasks:

Issue	Solution
No episodes → no per-episode performance measure	Track rolling average reward (last N steps)
Non-stationarity	Constant step size α (no decay)
Initial transient	Discard early data; start measuring after warm-up
Value function scale	Differential values are naturally centered around zero
Credit assignment over long horizons	Use n-step or eligibility traces (Chapter 12)

Check: Why should the step size stay constant (not decay to zero) in continuing tasks?

Because the environment may change over time, and the agent needs to keep adapting Because decaying step sizes make the algorithm faster Because continuing tasks converge immediately

Chapter 9: Summary

This chapter extended approximation from prediction to control — from knowing how good states are to actually learning how to act. The core algorithm, semi-gradient Sarsa, is a direct generalization of tabular Sarsa with a parameterized q̂(s, a, w). The Mountain Car demonstrated that tile coding makes RL work in continuous state spaces where tables are hopeless.

Concept	Core Idea	Key Property
q̂(s, a, w)	Parameterized action values	Enables GPI with approximation
Semi-gradient Sarsa	One-step on-policy control	Same as tabular Sarsa + gradient step
n-step Sarsa	Multi-step bootstrap	Faster value propagation
Mountain Car	Classic continuous test bed	Tile coding + Sarsa solves it
Average reward r(π)	Long-run rate for continuing tasks	No γ needed
Differential values	Values relative to average	Replace discounted returns
Futility of γ	Discounting doesn't affect optimal policy	Average reward is cleaner
Differential Sarsa	Control without discounting	Natural for continuing tasks

The two frameworks side by side: For episodic tasks, use discounted semi-gradient Sarsa — it's simple and effective. For continuing tasks, use differential Sarsa with average reward — it's theoretically cleaner and avoids the arbitrary γ. In practice, both work; the choice depends on whether your problem has natural episodes.

What we can now do:

• Learn control policies in continuous state spaces
• Use tile coding for fast, linear, convergent learning
• Handle episodic tasks (semi-gradient Sarsa)
• Handle continuing tasks (differential Sarsa)

What's still missing:

• Off-policy learning with approximation (Ch 11)
• Eligibility traces for faster credit assignment (Ch 12)
• Policy gradient methods (Ch 13)
• Stability guarantees for nonlinear functions

What comes next: Chapter 9 handled prediction. This chapter handled on-policy control. Chapter 11 tackles the hardest case: off-policy learning with approximation. This is where function approximation, bootstrapping, and off-policy data collide to create the infamous deadly triad — and where things can go seriously wrong.

"Don't let the discount factor distract you.
What matters is the average reward per step."
— Andrew Barto (paraphrased)

Check: What is the main contribution of this chapter beyond Chapter 9?

It introduced function approximation It extended approximation from prediction to control, enabling the agent to learn policies (actions), not just values It proved that all RL algorithms converge

On-policy Control withApproximation