Learning to act — not just evaluate — when the state space is too large for tables.
Chapter 9 taught us to predict — to approximate the value vπ(s) of following a fixed policy. But the whole point of RL is control: finding the best policy. For that, we need action values q̂(s, a, w), not just state values. The agent must know which action is best in each state.
The leap from prediction to control follows the same pattern as in tabular methods (Chapter 6). We use generalized policy improvement (GPI): evaluate the current policy using action-value approximation, then improve the policy by acting greedily with respect to q̂. The difference is that now q̂ is a parameterized function, not a table.
There's a subtlety with approximation that didn't exist in tabular methods. When we update q̂ for one (s, a) pair, the weights change, and that change affects q̂ for every (s, a) pair. Improving the value of one action might accidentally make another action look better or worse. This coupling — absent in tables — makes control with approximation fundamentally harder.
Tabular Sarsa (Chapter 6) updated Q(S, A) toward R + γ Q(S', A'). The semi-gradient version does the same thing, but with a parameterized function. The agent follows an ε-greedy policy, and after each transition (S, A, R, S', A'), updates the weights:
This is "semi-gradient" for the same reason as in Chapter 9: the target R + γ q̂(S', A', w) depends on w, but we ignore that dependence in the gradient. It's Sarsa because the next action A' is the action actually taken under the current policy (on-policy), not the greedy action (which would be Q-learning).
pseudocode # Episodic Semi-gradient Sarsa Initialize w arbitrarily for each episode: S ← initial state A ← ε-greedy from q̂(S, ·, w) while S is not terminal: take A, observe R, S' if S' is terminal: w ← w + α [R − q̂(S, A, w)] ∇ q̂(S, A, w) else: A' ← ε-greedy from q̂(S', ·, w) w ← w + α [R + γ q̂(S', A', w) − q̂(S, A, w)] ∇ q̂(S, A, w) A ← A' S ← S'
For linear approximation, q̂(s, a, w) = wT x(s, a), where x(s, a) is a feature vector that depends on both the state and action. One common approach: construct separate features for each action. For 3 actions and 10 state features, the full feature vector has 30 components (10 per action), and only the 10 corresponding to the chosen action are nonzero.
The Mountain Car is the quintessential test problem for function approximation in RL. A car sits in a valley between two hills. The goal is to drive up the right hill, but the engine is too weak to climb directly. The agent must learn to swing back and forth — building momentum — to reach the top.
The state space is continuous: (position, velocity). Position ranges from −1.2 to 0.5, velocity from −0.07 to 0.07. Three actions: throttle left, no throttle, throttle right. The reward is −1 per step (punishing slow solutions). The episode ends when the car reaches position 0.5 (the goal on the right).
Watch the car learn to reach the goal on the right hill. Click Train 1 for one episode or Train 100 to see rapid learning. The value surface below shows the learned cost-to-go. The right panel tracks steps per episode.
The tile coding uses 8 tilings, each with 8×8 tiles over the (position, velocity) space. That's 8 × 8 × 8 = 512 tiles per action, 1536 total weights. The step size α is divided by 8 (number of tilings) internally. Each state activates exactly 8 tiles per action — a tiny fraction of all weights — making updates extremely fast.
Just as we extended TD prediction to n-step returns (Chapter 7, then Chapter 9), we can extend Sarsa to use n-step returns. The n-step return for control is:
The semi-gradient n-step Sarsa update is then:
More steps means more actual rewards in the return and less reliance on the current (possibly inaccurate) value estimate. The tradeoff is the same as always: more steps reduces bias from bootstrapping but increases variance from stochastic rewards and actions.
n = 1 (one-step Sarsa)
Maximum bootstrapping. Lowest variance. Slowest value propagation. Needs many episodes to learn.
n = 8 or 16
Moderate bootstrapping. More variance. Faster propagation. Often the sweet spot for Mountain Car.
The algorithm stores a buffer of the last n transitions. At each step t, once n more steps have been taken, it computes Gt:t+n and updates. Near the end of an episode (when fewer than n steps remain), the return truncates naturally — the final steps become effectively MC updates.
Everything so far has used the discounted return: Gt = Rt+1 + γ Rt+2 + γ2 Rt+3 + ... . Discounting makes the sum finite by shrinking future rewards. But for continuing tasks (tasks that never end), there's a better way: the average reward formulation.
Instead of maximizing the discounted sum, the agent maximizes the long-run average reward per step:
This is the average reward r(π). A good policy achieves a high average reward per step. Think of it like a factory's throughput: you don't care about any single product; you care about the average production rate over time.
Under the average reward formulation, the differential return replaces the discounted return:
Each reward is measured relative to the average r(π). Positive terms mean "better than average." Negative terms mean "worse than average." The differential return doesn't need discounting because it sums deviations from the mean, which naturally tend to cancel out.
In the average reward setting, the differential value of a state measures how much better (or worse) it is to be in that state compared to the long-run average. Formally:
This is the sum of future rewards minus the average. If the state leads to a sequence of above-average rewards, its differential value is positive. If it leads to below-average rewards, it's negative. A state with differential value zero is "average" — being there is neither better nor worse than the long-run expectation.
The differential Bellman equation is:
Compare this with the discounted Bellman equation. The discount factor γ is gone, replaced by the subtraction of r(π). Instead of shrinking future values by γ, we measure each reward as a deviation from the baseline. The structure is otherwise identical, which means all our RL algorithms transfer directly.
Similarly, the differential action value qπ(s, a) measures how much cumulative excess reward the agent gets starting from (s, a):
Here's one of the most surprising results in the book. For continuing tasks with function approximation, the discount factor γ does not affect which policy is best. Every γ ∈ (0, 1) produces the same policy ordering. This means discounting adds complexity without changing the answer.
How can this be? In tabular methods, γ matters because it changes the shape of the value function, and greedy policies depend on that shape. But with function approximation, value functions are already approximate — they can't represent the true values exactly. The relative ordering of policies, which is what matters for control, turns out to be independent of γ.
The practical consequence: in continuing tasks with approximation, γ is a free parameter with no clear best value. Setting γ = 0.9 vs γ = 0.99 changes learning dynamics (step-size sensitivity, variance, bootstrapping) but not which policy is theoretically best. This is why the average reward formulation is cleaner — it removes the arbitrary γ entirely.
Three policies (A, B, C) with different reward streams. Drag γ and observe: the discounted returns change, but the ranking (which bar is tallest) stays the same. Policy ordering is independent of γ.
This result has deep implications. It says that for continuing tasks, the entire γ-based framework is a convenience, not a necessity. The average reward framework captures policy quality directly, without the extra parameter. Many researchers in the field have shifted to average reward for continuing problems as a result.
Now we combine all the pieces: average reward, differential values, semi-gradient methods, and n-step returns. The result is differential semi-gradient n-step Sarsa — the practical algorithm for continuing control with approximation.
The idea: instead of bootstrapping from discounted values, we bootstrap from differential values. The TD error becomes:
where R is the agent's current estimate of the average reward r(π). Notice: no γ anywhere. The discount factor has been replaced by the average reward estimate.
The weight update is the familiar semi-gradient step:
The average reward estimate R is updated incrementally:
where β is a step size for the average. When δt > 0, the agent is doing better than expected, so R increases. When δt < 0, the agent is doing worse, and R decreases. Over time, R converges to the true average reward.
pseudocode # Differential Semi-gradient Sarsa Initialize w arbitrarily, R ← 0 S ← initial state A ← ε-greedy from q̂(S, ·, w) for each step: take A, observe R, S' A' ← ε-greedy from q̂(S', ·, w) δ ← R − R + q̂(S', A', w) − q̂(S, A, w) R ← R + β · δ w ← w + α · δ · ∇ q̂(S, A, w) S ← S', A ← A'
The n-step version uses a differential n-step return: Gt:t+n = ∑k=0n−1(Rt+k+1 − R) + q̂(St+n, At+n, w). More steps means less bias from the current q̂ estimate, at the cost of more variance from stochastic transitions.
Most real-world problems are continuing tasks: a factory optimizing production, a trading algorithm managing a portfolio, a server balancing network traffic. These tasks have no natural episodes. The agent runs indefinitely, and its performance is measured by the long-run average reward.
Continuing tasks bring several practical challenges that episodic tasks don't have:
No episode boundaries
No natural "reset" points. The agent must learn online, continuously adapting. Mistakes in one time region affect all future learning.
Non-stationarity
The environment may slowly change. The step size α must stay positive (no 1/t decay) to track changes, but this means estimates never fully converge.
The average reward formulation handles continuing tasks naturally. The differential Sarsa algorithm from Chapter 7 runs forever, continuously updating R and w. There's no need for episode boundaries, discount factors, or artificial resets.
The key practical considerations for continuing tasks:
| Issue | Solution |
|---|---|
| No episodes → no per-episode performance measure | Track rolling average reward (last N steps) |
| Non-stationarity | Constant step size α (no decay) |
| Initial transient | Discard early data; start measuring after warm-up |
| Value function scale | Differential values are naturally centered around zero |
| Credit assignment over long horizons | Use n-step or eligibility traces (Chapter 12) |
This chapter extended approximation from prediction to control — from knowing how good states are to actually learning how to act. The core algorithm, semi-gradient Sarsa, is a direct generalization of tabular Sarsa with a parameterized q̂(s, a, w). The Mountain Car demonstrated that tile coding makes RL work in continuous state spaces where tables are hopeless.
| Concept | Core Idea | Key Property |
|---|---|---|
| q̂(s, a, w) | Parameterized action values | Enables GPI with approximation |
| Semi-gradient Sarsa | One-step on-policy control | Same as tabular Sarsa + gradient step |
| n-step Sarsa | Multi-step bootstrap | Faster value propagation |
| Mountain Car | Classic continuous test bed | Tile coding + Sarsa solves it |
| Average reward r(π) | Long-run rate for continuing tasks | No γ needed |
| Differential values | Values relative to average | Replace discounted returns |
| Futility of γ | Discounting doesn't affect optimal policy | Average reward is cleaner |
| Differential Sarsa | Control without discounting | Natural for continuing tasks |
What we can now do:
• Learn control policies in continuous state spaces
• Use tile coding for fast, linear, convergent learning
• Handle episodic tasks (semi-gradient Sarsa)
• Handle continuing tasks (differential Sarsa)
What's still missing:
• Off-policy learning with approximation (Ch 11)
• Eligibility traces for faster credit assignment (Ch 12)
• Policy gradient methods (Ch 13)
• Stability guarantees for nonlinear functions
What comes next: Chapter 9 handled prediction. This chapter handled on-policy control. Chapter 11 tackles the hardest case: off-policy learning with approximation. This is where function approximation, bootstrapping, and off-policy data collide to create the infamous deadly triad — and where things can go seriously wrong.