When the state space is too large for tables, we need functions that generalize.
Consider a game of backgammon. The state space has roughly 1020 board positions. A tabular value function — one entry per state — would need more memory than all the computers on Earth combined. And even if you could store it, you'd never visit most states, so most entries would remain at zero. The table would be enormous, empty, and useless.
Or consider a robot arm with six joints, each measured to 1-degree resolution. That's 3606 ≈ 2.2 × 1015 states. A lookup table is out of the question. But adjacent joint configurations produce nearly the same physics. A state with angles [30, 45, 60, 90, 120, 150] should have nearly the same value as [31, 45, 60, 90, 120, 150]. The table doesn't know this.
The solution is function approximation: instead of a table, we use a parameterized function v̂(s, w) that takes a state and outputs a value. The parameters w are shared across all states, so updating one state's value automatically affects nearby states. This is the leap from Part I to Part II of the book.
Tabular (Part I)
V[s] = one number per state. Independent entries. No generalization. Works only when every state is visited.
Approximate (Part II)
v̂(s, w) = parameterized function. Shared weights. Automatic generalization. Works in huge/continuous spaces.
Think of it like weather forecasting. You don't build a separate model for each GPS coordinate on Earth. You build one model with parameters that capture how temperature, pressure, and humidity relate — and it generalizes to coordinates it's never seen. That's what function approximation does for value functions.
Here's the setup. We have a parameterized function v̂(s, w) that maps states to estimated values. The vector w = [w1, w2, ..., wd] contains d weights. The number of weights d is much, much smaller than the number of states — that's the whole point. We're compressing a massive value table into a small weight vector.
Training examples come from experience. Each time the agent visits a state St and observes a return (or a TD target), we get a training pair: (St, target). The target says "the value of this state should be approximately this." We adjust w to make v̂(St, w) closer to the target.
This looks exactly like supervised learning. The difference: our training targets are not given by a teacher — they come from the agent's own experience. Monte Carlo targets (Gt) are unbiased but noisy. TD targets (R + γ v̂(S', w)) are biased (they depend on the current weights) but lower variance. This distinction drives the rest of the chapter.
The function v̂ can be anything: a linear combination of features, a neural network, a decision tree. But not all function approximators are equal. Some have nice convergence guarantees (linear methods). Others are more powerful but more dangerous (neural networks). This chapter explores the spectrum.
With a table, we could make V(s) exactly equal to vπ(s) for every state. With function approximation, we usually can't — the function has fewer parameters than states, so it must compromise. Some states will be approximated well, others poorly. How do we decide which states matter most?
We define the Mean Squared Value Error, written VE. It measures how far our approximation v̂(s, w) is from the true value vπ(s), averaged over states and weighted by how often we visit them under policy π:
The weighting μ(s) is the on-policy distribution — the fraction of time spent in state s while following π. States we visit often matter more. States we rarely visit matter less. This makes practical sense: we care most about accuracy where the agent actually spends its time.
In episodic tasks, μ(s) is proportional to the number of time steps spent in s averaged over episodes. In continuing tasks, μ(s) is the stationary distribution of the Markov chain under π. Either way, μ naturally emphasizes the states the agent actually visits.
The ideal weight vector w* minimizes VE. For linear methods, this minimum is unique and can be characterized exactly. For nonlinear methods (neural networks), there may be many local minima. But the goal is always the same: find w that makes VE(w) as small as possible.
To minimize VE, we use stochastic gradient descent (SGD). On each step, we observe a state St and a target Ut (the "true" or estimated value), and update the weights in the direction that reduces the squared error for that one example:
The gradient ∇w v̂(St, wt) tells us how each weight affects the output. We move the weights in the direction that makes the output closer to the target. If Ut is the true value vπ(St), this is a true gradient method and converges to the global minimum of VE.
Here's the critical distinction. With Monte Carlo, the target Ut = Gt is the actual return — it doesn't depend on w. So the gradient is correct and MC with function approximation is a true gradient method. It converges to a local minimum (global for linear functions).
With TD, the target Ut = Rt+1 + γ v̂(St+1, wt) depends on w itself. The gradient of this target with respect to w is ignored — we pretend it's a constant. This makes the update a semi-gradient method. It's not following the true gradient of any objective function.
True Gradient (MC)
Target: Gt (actual return)
Independent of w
Unbiased, high variance
Converges to local min of VE
Semi-Gradient (TD)
Target: R + γ v̂(S', w)
Depends on w
Biased, low variance
Converges for linear — not guaranteed for nonlinear
Despite lacking a true gradient, semi-gradient TD is usually faster than MC in practice. The bootstrapped target introduces bias but dramatically reduces variance. And for linear function approximation, semi-gradient TD converges reliably to a well-defined TD fixed point. That fixed point is close to the VE minimum, as we'll see in Chapter 5.
The simplest and most important function approximator is the linear function. We represent each state s by a feature vector x(s) = [x1(s), x2(s), ..., xd(s)]T. Each feature is a number that captures something about the state. The approximate value is just the dot product:
Think of it like a recipe. Each feature xi(s) is an ingredient, and each weight wi controls how much of that ingredient goes into the value. Feature x1 might indicate "how close to the goal," x2 might indicate "how much danger nearby," and the weights learn the right mixture.
The gradient is beautifully simple: ∇w v̂(s, w) = x(s). The SGD update becomes:
For semi-gradient TD(0), the update specializes to:
Semi-gradient TD(0) with linear approximation converges to the TD fixed point wTD, defined as the weight vector where the expected update is zero. This fixed point exists and is unique (as long as the feature vectors are linearly independent), and the method converges to it reliably. We'll see exactly how good this fixed point is in the next chapter.
We said semi-gradient TD converges to the TD fixed point. But how good is that fixed point? Does it minimize VE? Not exactly — but it comes close, and we can bound exactly how close.
The VE bound for the TD fixed point is:
This says the TD fixed point's error is at most 1/(1−γ) times the best possible error. For γ = 0.9, that's a factor of 10. For γ = 0.99, it's 100. The bound gets worse as γ approaches 1. But in practice, TD usually does much better than this worst case.
Why the 1/(1−γ) factor? Bootstrapping introduces bias — TD's target includes the current estimate v̂(S', w), which may be wrong. This bias compounds across the bootstrap chain. The more we bootstrap (higher γ), the more steps of bias compound, and the worse the bound. With γ = 0, there's no bootstrapping and TD matches MC.
| Method | Converges to | Guarantee |
|---|---|---|
| MC + linear | min VE(w) | Global optimum |
| TD(0) + linear | wTD | ≤ 1/(1−γ) × min VE |
| MC + nonlinear | Local min of VE | Only local |
| TD + nonlinear | No guarantee | Can diverge! |
That last row is alarming. Semi-gradient TD with nonlinear function approximation (e.g., neural networks) has no convergence guarantee. It can diverge. This is a taste of the deadly triad problem from Chapter 11. For now, know that linear methods are safe, and nonlinear methods require care.
The power of linear methods lives entirely in the features. Choose features wisely and you can approximate complex value functions with simple dot products. Two classic approaches are polynomial bases and Fourier bases. The Fourier basis is especially elegant and practical.
The idea: represent the state as a number in [0, 1] (or a vector in [0,1]k for multi-dimensional states), then use cosine functions of different frequencies as features:
Feature x0 = cos(0) = 1 is a constant (the bias). Feature x1 = cos(πs) captures one half-wave across the state space. Feature x2 = cos(2πs) captures one full wave. Higher-order features capture finer details. The order n controls the resolution: more features mean a better fit, but also more parameters and risk of overfitting.
The teal curve is the true value function. The orange curve is the Fourier basis approximation. Adjust the order to see how more basis functions improve the fit. The individual basis functions are shown below in gray.
Notice that smooth functions (the bump, the sine wave) are well-approximated with low order. But the step function needs many terms — it has a sharp discontinuity that cosines can only approximate by piling up many high-frequency components. This is a fundamental tradeoff: smoother value functions are easier to approximate.
Tile coding is the most practical feature construction method for RL. It's fast, simple, and scales well. The idea: lay a grid (a tiling) over the state space. Each grid cell (tile) that contains the current state produces a feature value of 1; all other tiles produce 0. Then overlay several tilings, each offset slightly from the others.
A single tiling is too coarse — nearby states in the same tile get identical features. But with multiple offset tilings, the set of active tiles changes gradually as the state moves. Two nearby states share most of their active tiles (similar features), while distant states share few (different features). This creates smooth generalization.
With n tilings and k tiles per dimension, each state activates exactly n tiles (one per tiling). The feature vector is binary: mostly zeros, with exactly n ones. The value estimate is just the sum of the n active tile weights. This makes updates extremely fast — you only touch n weights per step.
Move your cursor over the 2D space. Each colored overlay is one tiling (offset from the others). The tiles that contain your cursor are highlighted. Notice how the set of active tiles changes smoothly as you move.
Practical considerations: the step size α should be proportional to 1/n (number of tilings), because each update affects n tiles. With 8 tilings, use α = 1/8 as a starting point. More tilings give finer generalization but require more memory (more tile weights to store).
Radial basis functions (RBFs) are a natural generalization of tile coding. Instead of each feature being 0 or 1 (tile active or not), an RBF feature produces a continuous value that depends on the distance between the state and the feature's center:
Each RBF is a Gaussian bump centered at ci with width σi. When the state is near center ci, the feature fires strongly. As the state moves away, the feature decays smoothly to zero. The result is an even smoother approximation than tile coding — but at the cost of more computation (all features are nonzero, not just n).
Choosing the step size α is critical. Too large and the weights oscillate. Too small and learning is painfully slow. For tile coding with n tilings, a good starting point is α = 1/n. More generally, α should be inversely proportional to the expected number of active features.
The α = 1/τ heuristic: think of τ as the number of experiences you want to average over. With α = 1/10, the estimate is roughly a running average of the last 10 experiences at each feature. Too many features sharing an update means each one gets diluted — scale α accordingly.
| Feature Type | Active Features | Suggested α |
|---|---|---|
| Tile coding (n tilings) | n per state | 1/n |
| RBFs | All (d total) | 1/E[xTx] |
| Tabular (one-hot) | 1 | Any (no sharing) |
Everything we've discussed so far has been linear: v̂(s, w) = wTx(s). The features x(s) are fixed; only the weights are learned. But what if the features themselves could be learned? That's what neural networks do. They learn both the features and the weights, automatically discovering the right representation of the state.
A neural network is a series of layers, each applying a linear transformation followed by a nonlinear activation function. The output of each layer becomes the features for the next. The final layer produces the value estimate. The entire network is trained end-to-end with backpropagation.
This is the connection to deep reinforcement learning. DQN (Mnih et al., 2015) used a deep convolutional neural network to play Atari games from raw pixels. The network learned to extract features — positions of enemies, trajectory of the ball — without any human feature engineering. The semi-gradient update is the same; only the function approximator changed.
The semi-gradient update with a neural network is identical in form to the linear case:
where δt = Rt+1 + γ v̂(St+1, w) − v̂(St, w) is the TD error. The only difference is that ∇w v̂ now involves backpropagation through the network layers, not just the feature vector x(s).
In Chapter 7 we saw how n-step methods unify MC (n = ∞) and TD(0) (n = 1) for tabular methods. The same idea extends directly to function approximation. The n-step semi-gradient TD update uses an n-step return as the target:
The n-step return looks n steps ahead in the actual experience, then bootstraps from the current value estimate at step t+n. More steps means more real rewards and less bootstrapping — lower bias but higher variance.
For linear methods, n-step semi-gradient TD still converges to a fixed point, and the VE bound becomes:
As n grows, γn shrinks, and the bound tightens toward zero (the MC limit). So intermediate n values can provide a better balance than either extreme. This is the same story as Chapter 7, just generalized to function approximation.
See how the VE bound changes with n for different γ values. Lower is better.
This chapter crossed the threshold from tabular to approximate methods — the most important transition in reinforcement learning. When state spaces are too large for tables, function approximation is not just helpful; it's the only option. And it brings a fundamentally different character to RL: now we're doing optimization, not just averaging.
| Concept | Core Idea | Key Property |
|---|---|---|
| v̂(s, w) | Parameterized value function | Generalizes across states |
| VE | Mean squared value error | Weighted by on-policy μ(s) |
| True gradient (MC) | Target independent of w | Converges to min VE |
| Semi-gradient (TD) | Target depends on w | Converges for linear only |
| Linear methods | v̂ = wTx(s) | Convergence guarantees; features are everything |
| Fourier basis | Cosine features | Smooth, tunable order |
| Tile coding | Multiple offset binary grids | Fast, local, practical |
| RBFs | Gaussian bumps | Smooth but slower |
| Neural networks | Learned features | Powerful but no guarantees |
What we gained:
• Ability to handle huge/continuous state spaces
• Automatic generalization to unseen states
• Connection to supervised learning tools
• Practical methods (tile coding) for real problems
What we lost:
• Exact convergence to vπ
• Convergence guarantees (for nonlinear + TD)
• State-independent accuracy
• Simplicity of table lookup
What comes next: Chapter 8 gave us planning with tabular methods. This chapter gave us prediction with approximation. Chapter 10 combines these ideas for control — learning to act, not just evaluate, in large state spaces. We'll extend semi-gradient methods to action values and meet the Mountain Car, the classic test-bed for function approximation in RL.