Ch 9: On-policy Prediction with Approximation

Chapter 0: Why Approximation?

Consider a game of backgammon. The state space has roughly 10²⁰ board positions. A tabular value function — one entry per state — would need more memory than all the computers on Earth combined. And even if you could store it, you'd never visit most states, so most entries would remain at zero. The table would be enormous, empty, and useless.

Or consider a robot arm with six joints, each measured to 1-degree resolution. That's 360⁶ ≈ 2.2 × 10¹⁵ states. A lookup table is out of the question. But adjacent joint configurations produce nearly the same physics. A state with angles [30, 45, 60, 90, 120, 150] should have nearly the same value as [31, 45, 60, 90, 120, 150]. The table doesn't know this.

The core problem: Tabular methods store one value per state. In large (or continuous) state spaces, you can't visit every state, so you can't fill the table. We need methods that generalize — learning about one state should tell us something about similar states.

The solution is function approximation: instead of a table, we use a parameterized function v̂(s, w) that takes a state and outputs a value. The parameters w are shared across all states, so updating one state's value automatically affects nearby states. This is the leap from Part I to Part II of the book.

Tabular (Part I)

V[s] = one number per state. Independent entries. No generalization. Works only when every state is visited.

Approximate (Part II)

v̂(s, w) = parameterized function. Shared weights. Automatic generalization. Works in huge/continuous spaces.

Think of it like weather forecasting. You don't build a separate model for each GPS coordinate on Earth. You build one model with parameters that capture how temperature, pressure, and humidity relate — and it generalizes to coordinates it's never seen. That's what function approximation does for value functions.

Check: Why can't tabular methods handle large state spaces?

The learning rate is too small Most states are never visited, so the table stays empty, and there is no generalization to unseen states Tabular methods converge too slowly

Chapter 1: The Framework

Here's the setup. We have a parameterized function v̂(s, w) that maps states to estimated values. The vector w = [w₁, w₂, ..., w_d] contains d weights. The number of weights d is much, much smaller than the number of states — that's the whole point. We're compressing a massive value table into a small weight vector.

Training examples come from experience. Each time the agent visits a state S_t and observes a return (or a TD target), we get a training pair: (S_t, target). The target says "the value of this state should be approximately this." We adjust w to make v̂(S_t, w) closer to the target.

Experience

Agent follows policy π, observes states and rewards

↓

Training Examples

(S_t, G_t) or (S_t, R + γ v̂(S', w))

↓

Weight Update

w ← w + α [target − v̂(S, w)] ∇ v̂(S, w)

↻ repeat

This looks exactly like supervised learning. The difference: our training targets are not given by a teacher — they come from the agent's own experience. Monte Carlo targets (G_t) are unbiased but noisy. TD targets (R + γ v̂(S', w)) are biased (they depend on the current weights) but lower variance. This distinction drives the rest of the chapter.

Key insight: Function approximation turns RL into a supervised learning problem. The "dataset" is generated by the agent's own policy. The "labels" come from returns or bootstrapped estimates. The "model" is v̂(s, w). Everything you know about supervised learning — gradients, overfitting, feature engineering — now applies to RL.

The function v̂ can be anything: a linear combination of features, a neural network, a decision tree. But not all function approximators are equal. Some have nice convergence guarantees (linear methods). Others are more powerful but more dangerous (neural networks). This chapter explores the spectrum.

Check: Where do the training examples for function approximation in RL come from?

A pre-collected labeled dataset A human expert labeling states The agent's own experience: states visited and returns or TD targets observed

Chapter 2: The VE Objective

With a table, we could make V(s) exactly equal to v_π(s) for every state. With function approximation, we usually can't — the function has fewer parameters than states, so it must compromise. Some states will be approximated well, others poorly. How do we decide which states matter most?

We define the Mean Squared Value Error, written VE. It measures how far our approximation v̂(s, w) is from the true value v_π(s), averaged over states and weighted by how often we visit them under policy π:

VE(w) = ∑_s μ(s) [ v_π(s) − v̂(s, w) ]²

The weighting μ(s) is the on-policy distribution — the fraction of time spent in state s while following π. States we visit often matter more. States we rarely visit matter less. This makes practical sense: we care most about accuracy where the agent actually spends its time.

Why μ? If the agent spends 90% of its time in states A and B, and 10% in state C, then errors in A and B should be penalized 9x more than errors in C. The on-policy distribution does exactly this. It says: "get the values right where it matters."

In episodic tasks, μ(s) is proportional to the number of time steps spent in s averaged over episodes. In continuing tasks, μ(s) is the stationary distribution of the Markov chain under π. Either way, μ naturally emphasizes the states the agent actually visits.

The ideal weight vector w* minimizes VE. For linear methods, this minimum is unique and can be characterized exactly. For nonlinear methods (neural networks), there may be many local minima. But the goal is always the same: find w that makes VE(w) as small as possible.

Check: What does the on-policy distribution μ(s) represent in the VE objective?

How often state s is visited under policy π, weighting errors by importance The optimal value of state s The learning rate for state s

Chapter 3: SGD and Semi-Gradient

To minimize VE, we use stochastic gradient descent (SGD). On each step, we observe a state S_t and a target U_t (the "true" or estimated value), and update the weights in the direction that reduces the squared error for that one example:

w_t+1 = w_t + α [ U_t − v̂(S_t, w_t) ] ∇_w v̂(S_t, w_t)

The gradient ∇_w v̂(S_t, w_t) tells us how each weight affects the output. We move the weights in the direction that makes the output closer to the target. If U_t is the true value v_π(S_t), this is a true gradient method and converges to the global minimum of VE.

Here's the critical distinction. With Monte Carlo, the target U_t = G_t is the actual return — it doesn't depend on w. So the gradient is correct and MC with function approximation is a true gradient method. It converges to a local minimum (global for linear functions).

With TD, the target U_t = R_t+1 + γ v̂(S_t+1, w_t) depends on w itself. The gradient of this target with respect to w is ignored — we pretend it's a constant. This makes the update a semi-gradient method. It's not following the true gradient of any objective function.

The key difference: MC targets are independent of w → true gradient → guaranteed convergence. TD targets depend on w → semi-gradient → convergence only with special function approximators (linear). This gap between "correct but slow" and "fast but approximate" is a recurring theme.

True Gradient (MC)

Target: G_t (actual return)
Independent of w
Unbiased, high variance
Converges to local min of VE

Semi-Gradient (TD)

Target: R + γ v̂(S', w)
Depends on w
Biased, low variance
Converges for linear — not guaranteed for nonlinear

Despite lacking a true gradient, semi-gradient TD is usually faster than MC in practice. The bootstrapped target introduces bias but dramatically reduces variance. And for linear function approximation, semi-gradient TD converges reliably to a well-defined TD fixed point. That fixed point is close to the VE minimum, as we'll see in Chapter 5.

Check: Why is TD with function approximation called a "semi-gradient" method?

Because the TD target depends on the current weights w, so the gradient of the target is ignored Because we only use half the training data Because the step size is halved

Chapter 4: Linear Methods

The simplest and most important function approximator is the linear function. We represent each state s by a feature vector x(s) = [x₁(s), x₂(s), ..., x_d(s)]^T. Each feature is a number that captures something about the state. The approximate value is just the dot product:

v̂(s, w) = w^T x(s) = ∑_i=1^d w_i · x_i(s)

Think of it like a recipe. Each feature x_i(s) is an ingredient, and each weight w_i controls how much of that ingredient goes into the value. Feature x₁ might indicate "how close to the goal," x₂ might indicate "how much danger nearby," and the weights learn the right mixture.

The gradient is beautifully simple: ∇_w v̂(s, w) = x(s). The SGD update becomes:

w_t+1 = w_t + α [ U_t − w_t^T x(S_t) ] x(S_t)

For semi-gradient TD(0), the update specializes to:

w_t+1 = w_t + α [ R_t+1 + γ w_t^T x(S_t+1) − w_t^T x(S_t) ] x(S_t)

Key insight: With linear methods, the only design choice is the features x(s). Pick good features and the method works beautifully. Pick bad features and no amount of training helps. Feature engineering is everything in linear function approximation — the same lesson as classical machine learning.

Semi-gradient TD(0) with linear approximation converges to the TD fixed point w_TD, defined as the weight vector where the expected update is zero. This fixed point exists and is unique (as long as the feature vectors are linearly independent), and the method converges to it reliably. We'll see exactly how good this fixed point is in the next chapter.

Check: In linear function approximation, what is the gradient ∇_w v̂(s, w)?

The weight vector w The feature vector x(s) The value v̂(s, w) itself

Chapter 5: Convergence

We said semi-gradient TD converges to the TD fixed point. But how good is that fixed point? Does it minimize VE? Not exactly — but it comes close, and we can bound exactly how close.

The VE bound for the TD fixed point is:

VE(w_TD) ≤ 1 ⁄ (1 − γ) · min_w VE(w)

This says the TD fixed point's error is at most 1/(1−γ) times the best possible error. For γ = 0.9, that's a factor of 10. For γ = 0.99, it's 100. The bound gets worse as γ approaches 1. But in practice, TD usually does much better than this worst case.

The tradeoff: MC converges to the actual VE minimum (the best possible with these features), but slowly. TD converges to a fixed point that might be up to 1/(1−γ) times worse, but it gets there much faster. In practice, TD's speed advantage usually wins.

Why the 1/(1−γ) factor? Bootstrapping introduces bias — TD's target includes the current estimate v̂(S', w), which may be wrong. This bias compounds across the bootstrap chain. The more we bootstrap (higher γ), the more steps of bias compound, and the worse the bound. With γ = 0, there's no bootstrapping and TD matches MC.

Method	Converges to	Guarantee
MC + linear	min VE(w)	Global optimum
TD(0) + linear	w_TD	≤ 1/(1−γ) × min VE
MC + nonlinear	Local min of VE	Only local
TD + nonlinear	No guarantee	Can diverge!

That last row is alarming. Semi-gradient TD with nonlinear function approximation (e.g., neural networks) has no convergence guarantee. It can diverge. This is a taste of the deadly triad problem from Chapter 11. For now, know that linear methods are safe, and nonlinear methods require care.

Check: What does the VE bound 1/(1−γ) tell us?

The TD fixed point's error is at most 1/(1−γ) times the best achievable error with these features TD always performs 1/(1−γ) times worse than MC We need 1/(1−γ) episodes to converge

Chapter 6: Fourier Basis

The power of linear methods lives entirely in the features. Choose features wisely and you can approximate complex value functions with simple dot products. Two classic approaches are polynomial bases and Fourier bases. The Fourier basis is especially elegant and practical.

The idea: represent the state as a number in [0, 1] (or a vector in [0,1]^k for multi-dimensional states), then use cosine functions of different frequencies as features:

x_i(s) = cos(i π s), i = 0, 1, 2, ..., n

Feature x₀ = cos(0) = 1 is a constant (the bias). Feature x₁ = cos(πs) captures one half-wave across the state space. Feature x₂ = cos(2πs) captures one full wave. Higher-order features capture finer details. The order n controls the resolution: more features mean a better fit, but also more parameters and risk of overfitting.

Why Fourier? Any smooth function can be approximated by a sum of cosines (this is the Fourier cosine series). By including cosines up to order n, we can represent functions with up to n "bumps." It's like an equalizer on a stereo — each slider controls how much of that frequency to include. The weights w_i are the slider positions.

Showcase: Fourier Basis Approximation

The teal curve is the true value function. The orange curve is the Fourier basis approximation. Adjust the order to see how more basis functions improve the fit. The individual basis functions are shown below in gray.

Fourier order n = 3

Target function

Notice that smooth functions (the bump, the sine wave) are well-approximated with low order. But the step function needs many terms — it has a sharp discontinuity that cosines can only approximate by piling up many high-frequency components. This is a fundamental tradeoff: smoother value functions are easier to approximate.

Check: What determines the resolution of a Fourier basis approximation?

The learning rate α The discount factor γ The order n — the number of cosine basis functions used

Chapter 7: Tile Coding

Tile coding is the most practical feature construction method for RL. It's fast, simple, and scales well. The idea: lay a grid (a tiling) over the state space. Each grid cell (tile) that contains the current state produces a feature value of 1; all other tiles produce 0. Then overlay several tilings, each offset slightly from the others.

A single tiling is too coarse — nearby states in the same tile get identical features. But with multiple offset tilings, the set of active tiles changes gradually as the state moves. Two nearby states share most of their active tiles (similar features), while distant states share few (different features). This creates smooth generalization.

Analogy: Imagine describing your location using street grids. One grid says "you're in block A3." Another grid, shifted half a block, says "you're in block B4." A third, shifted differently, says "C2." Together, these three descriptions locate you much more precisely than any single grid, and nearby locations share most of their block labels.

With n tilings and k tiles per dimension, each state activates exactly n tiles (one per tiling). The feature vector is binary: mostly zeros, with exactly n ones. The value estimate is just the sum of the n active tile weights. This makes updates extremely fast — you only touch n weights per step.

v̂(s, w) = ∑_{i ∈ active(s)} w_i

Showcase: 2D Tile Coding Visualizer

Move your cursor over the 2D space. Each colored overlay is one tiling (offset from the others). The tiles that contain your cursor are highlighted. Notice how the set of active tiles changes smoothly as you move.

Number of tilings = 4

Tiles per dimension = 4

Move cursor over canvas

Practical considerations: the step size α should be proportional to 1/n (number of tilings), because each update affects n tiles. With 8 tilings, use α = 1/8 as a starting point. More tilings give finer generalization but require more memory (more tile weights to store).

Why tile coding dominates in RL: It's (1) fast: only n additions per value lookup, (2) linear: convergence guarantees apply, (3) local: updating one region doesn't affect distant regions, (4) tunable: the number of tilings and tiles per dimension control the resolution. Sutton's RL group used tile coding for decades of successful experiments.

Check: Why use multiple offset tilings instead of a single fine grid?

It reduces computation Multiple tilings create smooth generalization — nearby states share most active tiles, while a single grid has sharp boundaries It makes the feature vector denser

Chapter 8: RBFs and Step-Size

Radial basis functions (RBFs) are a natural generalization of tile coding. Instead of each feature being 0 or 1 (tile active or not), an RBF feature produces a continuous value that depends on the distance between the state and the feature's center:

x_i(s) = exp(− ||s − c_i||² / 2σ_i²)

Each RBF is a Gaussian bump centered at c_i with width σ_i. When the state is near center c_i, the feature fires strongly. As the state moves away, the feature decays smoothly to zero. The result is an even smoother approximation than tile coding — but at the cost of more computation (all features are nonzero, not just n).

Tile coding vs RBFs: Tile coding features are binary — sharp on/off within tiles. RBFs are smooth Gaussians — gradual falloff with distance. Tile coding is faster (sparse features, only n active). RBFs give smoother approximations. In practice, tile coding is preferred for RL because speed matters and the smoothness advantage of RBFs is marginal.

Practical Step-Size Selection

Choosing the step size α is critical. Too large and the weights oscillate. Too small and learning is painfully slow. For tile coding with n tilings, a good starting point is α = 1/n. More generally, α should be inversely proportional to the expected number of active features.

The α = 1/τ heuristic: think of τ as the number of experiences you want to average over. With α = 1/10, the estimate is roughly a running average of the last 10 experiences at each feature. Too many features sharing an update means each one gets diluted — scale α accordingly.

Feature Type	Active Features	Suggested α
Tile coding (n tilings)	n per state	1/n
RBFs	All (d total)	1/E[x^Tx]
Tabular (one-hot)	1	Any (no sharing)

Check: How does an RBF feature respond as the state moves away from its center?

It switches off abruptly (binary) It increases linearly It decays smoothly to zero (Gaussian falloff)

Chapter 9: Neural Networks

Everything we've discussed so far has been linear: v̂(s, w) = w^Tx(s). The features x(s) are fixed; only the weights are learned. But what if the features themselves could be learned? That's what neural networks do. They learn both the features and the weights, automatically discovering the right representation of the state.

A neural network is a series of layers, each applying a linear transformation followed by a nonlinear activation function. The output of each layer becomes the features for the next. The final layer produces the value estimate. The entire network is trained end-to-end with backpropagation.

Input: State s

Raw state representation (pixels, positions, etc.)

↓

Hidden Layers

Learned features: nonlinear transformations of the input

↓

Output: v̂(s, w)

Single number: estimated value of state s

This is the connection to deep reinforcement learning. DQN (Mnih et al., 2015) used a deep convolutional neural network to play Atari games from raw pixels. The network learned to extract features — positions of enemies, trajectory of the ball — without any human feature engineering. The semi-gradient update is the same; only the function approximator changed.

The catch: Neural networks are nonlinear, so we lose all convergence guarantees. Semi-gradient TD with a neural network can diverge. In practice, tricks like experience replay (storing transitions and replaying them) and target networks (using a frozen copy of the network for the TD target) stabilize training. These hacks are necessary precisely because the theory doesn't guarantee stability.

The semi-gradient update with a neural network is identical in form to the linear case:

w_t+1 = w_t + α δ_t ∇_w v̂(S_t, w_t)

where δ_t = R_t+1 + γ v̂(S_t+1, w) − v̂(S_t, w) is the TD error. The only difference is that ∇_w v̂ now involves backpropagation through the network layers, not just the feature vector x(s).

Check: What is the key advantage of neural networks over linear methods for value approximation?

They always converge They learn features automatically, eliminating the need for manual feature engineering They require fewer parameters

Chapter 10: n-step Semi-Gradient TD

In Chapter 7 we saw how n-step methods unify MC (n = ∞) and TD(0) (n = 1) for tabular methods. The same idea extends directly to function approximation. The n-step semi-gradient TD update uses an n-step return as the target:

G_t:t+n = R_t+1 + γ R_t+2 + ... + γⁿ⁻¹ R_t+n + γⁿ v̂(S_t+n, w_t+n−1)

w_t+n = w_t+n−1 + α [ G_t:t+n − v̂(S_t, w_t+n−1) ] ∇_w v̂(S_t, w_t+n−1)

The n-step return looks n steps ahead in the actual experience, then bootstraps from the current value estimate at step t+n. More steps means more real rewards and less bootstrapping — lower bias but higher variance.

The n-step spectrum with approximation: n = 1 is semi-gradient TD(0) — maximum bootstrapping, minimum variance, but the convergence bound is 1/(1−γ). As n increases, the bound improves (less bias), but variance grows. n = ∞ gives MC — zero bias, optimal VE, but highest variance. The best n is problem-dependent.

For linear methods, n-step semi-gradient TD still converges to a fixed point, and the VE bound becomes:

VE(w_n-TD) ≤ γⁿ ⁄ (1 − γⁿ) · min_w VE(w)

As n grows, γⁿ shrinks, and the bound tightens toward zero (the MC limit). So intermediate n values can provide a better balance than either extreme. This is the same story as Chapter 7, just generalized to function approximation.

n-step Bias-Variance Tradeoff

See how the VE bound changes with n for different γ values. Lower is better.

γ = 0.90

Check: What happens to the VE bound as n (the number of steps) increases?

It tightens (improves) because more real rewards reduce the bias from bootstrapping It gets worse because more steps mean more noise It stays the same regardless of n

Chapter 11: Summary

This chapter crossed the threshold from tabular to approximate methods — the most important transition in reinforcement learning. When state spaces are too large for tables, function approximation is not just helpful; it's the only option. And it brings a fundamentally different character to RL: now we're doing optimization, not just averaging.

Concept	Core Idea	Key Property
v̂(s, w)	Parameterized value function	Generalizes across states
VE	Mean squared value error	Weighted by on-policy μ(s)
True gradient (MC)	Target independent of w	Converges to min VE
Semi-gradient (TD)	Target depends on w	Converges for linear only
Linear methods	v̂ = w^Tx(s)	Convergence guarantees; features are everything
Fourier basis	Cosine features	Smooth, tunable order
Tile coding	Multiple offset binary grids	Fast, local, practical
RBFs	Gaussian bumps	Smooth but slower
Neural networks	Learned features	Powerful but no guarantees

The two big lessons: (1) Feature construction is the most important design choice in linear approximation. Good features make simple linear methods work beautifully. (2) The semi-gradient "cheat" — ignoring the gradient through the target — is what makes TD fast but also what breaks convergence guarantees with nonlinear functions. This tension is the heart of modern deep RL.

What we gained:

• Ability to handle huge/continuous state spaces
• Automatic generalization to unseen states
• Connection to supervised learning tools
• Practical methods (tile coding) for real problems

What we lost:

• Exact convergence to v_π
• Convergence guarantees (for nonlinear + TD)
• State-independent accuracy
• Simplicity of table lookup

What comes next: Chapter 8 gave us planning with tabular methods. This chapter gave us prediction with approximation. Chapter 10 combines these ideas for control — learning to act, not just evaluate, in large state spaces. We'll extend semi-gradient methods to action values and meet the Mountain Car, the classic test-bed for function approximation in RL.

"The representation is the intelligence."
— Rich Sutton (paraphrased)

Check: What is the fundamental tradeoff introduced by function approximation in RL?

Speed vs memory Generalization to unseen states, at the cost of exact convergence and guaranteed stability Exploration vs exploitation

On-policy Prediction withApproximation