Approximate Value Functions

Chapter 0: Why Approximate?

A robot arm with 7 joints has 14 state variables (angle + velocity per joint). Discretize each into 100 bins and you need 100¹⁴ = 10²⁸ table entries. There are roughly 10²⁵ grains of sand on Earth. The table would fill thousands of planets. Tables don't scale to continuous or high-dimensional state spaces.

Chapter 7's exact value iteration stores one number U(s) per state s. This works for small, finite MDPs — Kochenderfer et al. use the example of a two-state aircraft collision avoidance system. But the real-world ACAS X system has a continuous state space with 8+ dimensions. Exact methods cannot be applied. The entire ACAS X project used the approximate value function methods from this chapter.

The answer is yes — and there are many ways to do it. This chapter surveys a toolkit of approximators, from the conceptually simple (nearest neighbor) to the powerful (neural networks). Each makes different tradeoffs between accuracy, scalability, and computational cost. The ACAS X system ultimately used a grid-based approach (piecewise linear) over a 5-dimensional state space with 5 billion grid points, stored as a compressed lookup table of 2.4 GB.

1. Sample

Collect a finite set of sample states S = {s₁, ..., s_m}

↓

2. Backup

Compute target values y_i = max_a[R(s_i,a) + γ∑T(s'|s_i,a)U(s')] at each sample

↓

3. Fit

Update approximator U_θ so U_θ(s_i) ≈ y_i

↻ repeat

4. Extract Policy

π(s) = argmax_a[R(s,a) + γ∑T(s'|s,a)U_θ(s')]

The approximation tradeoff. We lose the convergence guarantees of exact value iteration. In exchange, we can handle problems that exact methods cannot touch. The quality depends on (1) the function family (can it represent the true U?), (2) the sample states (do they cover important regions?), and (3) the fitting procedure (does it find good parameters?).

The textbook categorizes the full approximate value iteration loop as fitted value iteration (FVI). The word "fitted" emphasizes that each iteration involves a supervised learning step — fitting U_θ to target values — not just a Bellman update. FVI was first analyzed by Szepesvari and Smart (2004) and Gordon (1999), who established conditions under which it converges and derived error bounds. The key condition: the regression step must not increase the error — formally, the approximation operator Π must be non-expansive in the weighted norm. For least-squares regression (linear or neural net), this typically holds, but it is not guaranteed in general.

The name "fitted value iteration" was coined by Ernst, Geurts & Wehenkel (2005) in their paper "Tree-Based Batch Mode Reinforcement Learning," which used random forests as the regression oracle in the FVI loop. This paper demonstrated that any supervised regression method can be used as the fitting step — gradient boosted trees, Gaussian processes, random forests, and neural networks all fit the FVI framework. The textbook covers the gradient boosting and neural network variants most closely, but the framework is general.

Modern deep RL methods (DQN, TD3, SAC) are all variants of FVI with neural function approximation. The conceptual framework is identical to the textbook's presentation; the engineering innovations (replay buffers, target networks, gradient clipping, normalized rewards) are stabilization tricks that make FVI work in practice for complex problems.

The Bellman Backup: A Worked Example

Before diving into approximators, let's see exactly what one Bellman backup looks like at a continuous state. Suppose we have a 1D MDP with state s ∈ [0, 5], two actions {left, right}, and the following simple model:

Action	Transition	Reward
right	s' = min(s + 1, 5)	R = 10 if s' = 5 else 0
left	s' = max(s − 1, 0)	R = 0

Suppose our current value approximation says U(0)=0, U(1)=4, U(2)=8, U(3)=13, U(4)=19, U(5)=10 (the fifth state has the goal reward but no future). The Bellman backup at s=3, γ=0.9:

U_new(3) = max_a[R(3,a) + γ · U(s')]

a=right: 0 + 0.9 × U(4) = 0.9 × 19 = 17.1

a=left: 0 + 0.9 × U(2) = 0.9 × 8 = 7.2

So U_new(3) = 17.1, greedy action = right. Notice U changed from 13 to 17.1 in one backup — the value is still being propagated from the goal at s=5. After more iterations, all states will reflect the true discounted value of reaching the goal.

At a continuous query state like s=2.6, we cannot look up U(2.6) from a table. We must interpolate or approximate — that is the problem this chapter solves. The backup still computes R + γ E[U(s')], but the E[U(s')] evaluation requires the approximator.

The Curse of Dimensionality

See how the number of table entries (100 bins per dimension) explodes. Adjust the number of state dimensions. Even 6 dimensions makes a table infeasible; approximate methods are the only option.

State space dimensions 4

The approximate value iteration framework is sometimes called approximate dynamic programming (ADP) in the operations research literature. The two fields developed the same ideas independently: RL researchers focused on the online/model-free setting, while ADP researchers focused on the offline/model-based setting. The textbook bridges both, presenting approximate value iteration as a unified framework that covers both RL (when T and R are learned) and ADP (when T and R are known). Chapter 8 is the ADP treatment; Chapters 12–17 cover the RL variants where the model is unknown.

Why is tabular value iteration infeasible for a pendulum with continuous angle and angular velocity?

The state space is continuous and has infinitely many states, so no finite table can represent U(s) exactly The discount factor must be exactly 0 for continuous problems Continuous problems have no optimal policy

Chapter 1: Parametric Representations

A parametric approximation U_θ(s) represents the value function as a fixed functional form controlled by a parameter vector θ. Instead of storing one number per state, we store θ and compute U_θ(s) on demand for any query state.

The most important categorization is local vs global:

Type	How parameters work	Generalization	Examples
Local	θ = values at sample states; query computes weighted sum of nearby θ_i	Interpolates locally near samples	k-NN, kernel, simplex
Global	θ = coefficients of basis functions; U_θ(s) = θ^Tβ(s)	Extrapolates everywhere	Linear regression, neural nets

In both cases, the fitting procedure finds θ by minimizing a loss between U_θ(s_i) and the backed-up target values y_i. The loss is typically mean squared error:

L(θ) = ∑_i=1^m (U_θ(s_i) − y_i)²

Why these methods differ. Local methods (k-NN, kernel) store all sample states at prediction time — they are "lazy learners." Their computational cost scales with the sample set at query time. Global methods (linear, neural nets) store only θ at query time but pay an upfront fitting cost. For very large sample sets, global methods are far faster to query.

Fitting the Approximator: Regression on Bellman Targets

In approximate value iteration, we compute Bellman targets y_i at each sample state and then fit the approximator to them. The full loop is:

Initialize: U_θ(s) = 0 for all s (or some prior)
Sample states: Draw m states s₁, ..., s_m from S (uniform, grid, or policy rollouts)
Backup: y_i = max_a[R(s_i,a) + γ ∑_s' T(s'|s_i,a) U_θ(s')] for each i
Fit: Find θ that minimizes ∑_i(U_θ(s_i) − y_i)²
Repeat steps 3–4 until convergence or budget exhausted

The backup in step 3 requires evaluating U_θ(s') at the next state s', which uses the current approximation. This creates the same "moving target" instability as neural network regression (Chapter 7). With linear and k-NN approximators, the instability is less severe but still exists — there is no general proof of convergence for approximate value iteration.

One exception: If the approximation family is rich enough to contain the true value function, and the sample distribution covers S well, approximate value iteration converges to a small error ball. The formal guarantee: ||U_θ* − U*||_∞ ≤ ε/(1−γ) where ε is the approximation error of the best-fit U* in the function class. This bound degrades with discount γ → 1 and improves as ε → 0.

Local vs Global: Which Fits the True Function?

The true value function (a sine-like curve) is sampled at dots. A local method (nearest neighbor) and a global method (linear fit) are shown. Add sample points by clicking the canvas.

3 sample points

A k-nearest-neighbor approximator stores the value function as what?

A list of (state, value) pairs; to query U(s), find the k closest pairs and average their values A set of weights that define a linear function over the whole state space A single number representing the average value across all states

Chapter 2: Nearest Neighbor

Nearest neighbor (k-NN) is the simplest local approximator. To compute U_θ(s) at a query state s:

Find the k closest sample states to s (by Euclidean distance or any metric)
Return the average of their stored values

U_θ(s) = ¹/_k ∑_{i ∈ N_k(s)} θ_i

where N_k(s) is the set of k-nearest sample indices and θ_i = y_i is the backed-up value at sample s_i. With k=1, the approximation is piecewise constant — every query returns the value of its single nearest neighbor. As k increases, the estimate becomes smoother but less responsive to local variation.

A critical design choice: distance metric. Raw Euclidean distance treats all dimensions equally, which fails when state variables have very different scales (e.g., angle in radians vs. velocity in m/s). Always normalize state variables to comparable ranges, or use the Mahalanobis distance which accounts for scale and correlation.

python
import numpy as np

class KNNValueFunction:
    """k-Nearest Neighbor value function approximation."""
    def __init__(self, k=3):
        self.k = k
        self.states = []   # list of sample states (arrays)
        self.values = []   # list of target values (floats)

    def update(self, states, values):
        """Replace stored samples with new (state, value) pairs."""
        self.states = [np.array(s) for s in states]
        self.values = list(values)

    def __call__(self, s):
        """Query value at state s."""
        s = np.array(s)
        dists = [np.linalg.norm(s - si) for si in self.states]
        idx = np.argsort(dists)[:self.k]
        return np.mean([self.values[i] for i in idx])

# Usage in approximate value iteration
approx_U = KNNValueFunction(k=3)
sample_states = [np.array([s]) for s in np.linspace(0, 10, 20)]
for iteration in range(50):
    targets = [bellman_backup(s, approx_U) for s in sample_states]
    approx_U.update(sample_states, targets)

k=1 is like a Voronoi diagram. The state space is partitioned into Voronoi cells, one per sample. Every query in cell i returns the value at sample i. As samples become denser, the approximation improves. But in high dimensions, "nearest neighbor" becomes meaningless (all points are approximately equidistant) — the curse of dimensionality strikes again.

Efficient k-NN with k-d Trees

Naively finding the k nearest neighbors requires scanning all m samples: O(m) per query. For large m, this is too slow. The k-d tree (k-dimensional tree) is a binary tree that partitions the state space recursively, enabling nearest-neighbor queries in O(log m) average time.

Structure	Build Time	Query Time	Best When
Brute force scan	O(1)	O(m × n)	m < 100 and queries are rare
k-d tree	O(m log m)	O(log m) avg	Low dimensions (n ≤ 20), many queries
Ball tree	O(m log m)	O(log m) avg	High-d data with metric structure
Approximate NN (FAISS)	O(m)	O(1) amortized	Very high-d, speed over exact answer

In Python: scipy.spatial.cKDTree builds a k-d tree in milliseconds for m=10,000 points, and queries are 100× faster than brute force. For the approximate value iteration setting where we make thousands of queries per iteration, this speedup is critical.

python
from scipy.spatial import cKDTree
import numpy as np

class FastKNNValueFunction:
    def __init__(self, k=3):
        self.k = k; self.tree = None; self.values = None

    def update(self, states, values):
        self.tree = cKDTree(np.array(states))  # O(m log m) build
        self.values = np.array(values)

    def __call__(self, s):
        dists, idx = self.tree.query(np.array(s), k=self.k)  # O(log m)
        return np.mean(self.values[idx])

k-NN Approximation Explorer

The true function is sin(x). Sample states are dots. The k-NN approximation is piecewise constant (k=1) or piecewise averaged (k=3). Adjust k.

k (number of neighbors) 1

What is the main weakness of k-NN in high-dimensional state spaces?

Points become approximately equidistant (curse of dimensionality), so "nearest neighbor" carries little meaning k-NN always overfits regardless of k k-NN cannot handle continuous states

Chapter 3: Kernel Smoothing

Kernel smoothing is a weighted average over all (or nearby) sample states, where the weight is a decreasing function of distance. Common kernels include the Gaussian and the uniform (box) kernel:

U_θ(s) = ∑_i=1^m w(s, s_i) θ_i where w(s, s_i) = K(d(s,s_i)/h) / ∑_j K(d(s,s_j)/h)

Here K is the kernel function and h is the bandwidth — the single most important hyperparameter. Small h: tight kernel, high local sensitivity, noisy estimate. Large h: wide kernel, smooth estimate, may miss local structure.

K-NN is actually a special case of kernel smoothing with the "top-k" kernel: w=1/k for the k nearest points and 0 for the rest. The Nadaraya-Watson estimator (kernel smoothing) can be derived from a nonparametric regression model where U(s) is locally constant in a neighborhood of size h. Maximizing the weighted log-likelihood under a Gaussian noise model gives exactly the kernel smoothing formula above. This probabilistic interpretation means the bandwidth h has a natural interpretation: it is the standard deviation of the assumed "spread" in the value function, not just an arbitrary smoothing parameter.

For value function approximation specifically, a key practical issue is extrapolation. Both k-NN and kernel smoothing extrapolate poorly: they return the nearest sample's value (k-NN) or a weighted average of nearby samples (kernel) for query states outside the convex hull of the training data. This means that states that are never visited during training have poorly estimated values. For problems with bounded state spaces, this is manageable; for problems where the optimal policy may visit novel states (exploration), this creates a chicken-and-egg problem: you need good values to explore, but you need exploration to get good values. This is why on-policy sampling (Chapter 9 of the textbook) helps: it focuses value estimation where the current policy actually goes, ensuring the approximation is good where it matters.

The key difference: with a Gaussian kernel, every sample contributes to every query (though distant ones contribute very little).

Bandwidth = the smoothness dial. As h → 0, the kernel approximation approaches the sample values exactly (interpolation). As h → ∞, the approximation approaches the global mean of all sample values. The right h is problem-specific and typically set by cross-validation or domain knowledge about how fast the true value function varies.

Kernel smoothing generalizes naturally to multivariate states using a product kernel: w(s, s_i) = ∏_j K((s_j − s_ij)/h_j) where h_j is the bandwidth for dimension j. Using separate bandwidths per dimension is important when state variables have different scales or different degrees of variation. For example, in a pendulum MDP, the angle variable may need a narrow bandwidth (value changes quickly near the unstable equilibrium) while the angular velocity may need a wider bandwidth.

Choosing the Bandwidth: Cross-Validation and the Bias-Variance Tradeoff

The optimal bandwidth minimizes the mean squared error between the approximation and the true value function. This has two competing components:

Component	Effect of small h	Effect of large h
Bias	Low — the approximation can capture local variation	High — the approximation over-smooths; misses peaks
Variance	High — few points in each window; noisy estimate	Low — many points averaged; stable but biased

The bias-variance tradeoff is the fundamental problem of statistical estimation. For the value function approximation setting, we typically use leave-one-out cross-validation to select h: fit the approximation on all but one sample state and measure the prediction error at the held-out point. Average over all held-out points and choose the h that minimizes this cross-validation error.

An alternative: Silverman's rule of thumb sets h proportional to the sample standard deviation and inversely to m^1/(n+4) where n is the state dimension. This rule is optimal for Gaussian-distributed data and works well in practice even for non-Gaussian cases. For a 1D problem with standard deviation σ and m samples: h = 1.06 σ m^−1/5.

python
import numpy as np

def silverman_bandwidth(states):
    """Silverman's rule of thumb for kernel bandwidth (1D)."""
    m = len(states)
    sigma = np.std(states)
    return 1.06 * sigma * m ** (-0.2)

def gaussian_kernel_smooth(query_states, sample_states, values, h):
    """Vectorized Gaussian kernel smoothing."""
    diffs = query_states[:, None] - sample_states[None, :]  # (nq, m)
    weights = np.exp(-0.5 * (diffs / h) ** 2)                 # (nq, m)
    weights /= weights.sum(axis=1, keepdims=True) + 1e-10     # normalize
    return weights @ values                                       # (nq,)

# Usage: smooth a noisy value estimate over 30 sample states
sample_s = np.linspace(0, 10, 30)
values = np.sin(sample_s) * 5 + np.random.randn(30) * 0.5  # noisy
h = silverman_bandwidth(sample_s)
query_s = np.linspace(0, 10, 200)
smooth_v = gaussian_kernel_smooth(query_s, sample_s, values, h)

Kernel Smoothing Bandwidth Explorer

The true function is sin(2x). Adjust the bandwidth h. Too small: wiggly and noisy. Too large: over-smoothed, misses peaks and valleys.

Bandwidth h 0.40

What happens to a kernel smoothing estimator as the bandwidth h approaches infinity?

The estimate converges to the global mean of all sample values — every state gets the same average value It becomes identical to 1-NN The estimate diverges to infinity

Chapter 4: Linear Interpolation

Linear interpolation works on a grid: a regular rectangular discretization of the state space. Given a query state s between grid points, the value is computed as a weighted combination of the surrounding grid values, with weights proportional to how close s is to each grid point.

In 1D with grid points at x_l and x_r (the two neighbors of query x), the linear interpolation is:

U(x) = λ · U(x_l) + (1−λ) · U(x_r) where λ = (x_r − x) / (x_r − x_l)

In 2D, bilinear interpolation uses the four surrounding grid corners. In n dimensions, multilinear interpolation uses 2ⁿ surrounding corners. This exponential scaling is why linear interpolation is practical in low dimensions (1–4) but problematic in higher ones.

Grid interpolation vs function approximation. Linear interpolation on a grid guarantees exact recovery at grid points and smooth (continuous) interpolation between them. It requires no training phase beyond setting up the grid. The cost: memory scales as O(resolutionⁿ), and the resolution needed for accuracy also scales poorly with n. This is the same curse of dimensionality as tables, just delayed by the smoothness assumption.

Choosing a Grid Resolution

For grid-based interpolation methods, the resolution per dimension is a critical design parameter. Choosing it requires balancing approximation accuracy, memory, and computation:

Consideration	Effect of finer resolution	Effect of coarser resolution
Approximation error	Lower — grid points closer together, interpolation more accurate	Higher — grid points further apart, more interpolation error
Memory	Exponentially more — doubling resolution in n dims multiplies storage by 2ⁿ	Exponentially less
Computation (value iteration)	More grid points to backup; more total work	Fewer points; faster per iteration
Policy quality	Better approximation of U* → better policy	Coarse resolution may miss narrow peaks (e.g., near goal states)

A practical heuristic: start with a coarse 5×5 (or 5ⁿ) grid, solve approximately, inspect the value function, and double the resolution in dimensions where the function varies most rapidly. ACAS X used a non-uniform grid with finer resolution near zero relative altitude (where collision risk is highest) and coarser resolution at large separations (where the optimal action is always "do nothing").

Linear interpolation is also the foundation of the Q-table with linear interpolation method used in classic RL papers. Sutton & Barto's mountain car problem (1998) used a tile-coded linear approximation that is equivalent to piecewise linear interpolation on a non-uniform grid. The tile coding creates a grid implicitly: each "tile" is a rectangular region in state space with its own value estimate. When multiple tilings overlap, the interpolated value is a weighted sum of tiles, approximating the same multilinear formula as regular grid interpolation. The advantage: tile coding adapts naturally to the state-space geometry (tiles can be offset and rotated) while maintaining the O(1) query time of grid methods.

Despite this limitation, piecewise linear interpolation on a grid is the most widely deployed approximate value function method in safety-critical aviation. The original TCAS (Traffic Collision Avoidance System, deployed in virtually all commercial aircraft since 1993) used a lookup table. ACAS X (2020s) uses a 5D grid with piecewise linear interpolation, achieving far better performance than TCAS while maintaining hard safety guarantees. The key insight: in the aviation application, 5 state dimensions × modest resolution is manageable; the smoothness assumption holds because physics is continuous.

The multilinear interpolation formula in n dimensions:

U_θ(s) = ∑_{b ∈ {0,1}ⁿ} (∏_j=1ⁿ [b_jλ_j + (1−b_j)(1−λ_j)]) · U(corner_b)

where λ_j is the fractional offset in dimension j, and corner_b is the grid corner indexed by binary vector b. This formula has 2ⁿ terms and requires 2ⁿ table lookups per query. For n=5 that is 32 lookups; for n=10 that is 1024. Simplex interpolation replaces this with just n+1 = 6 or 11 lookups respectively.

python
import numpy as np

def linear_interp_1d(grid_x, grid_U, query_x):
    """Piecewise linear interpolation in 1D."""
    # Find the two surrounding grid points
    idx = np.searchsorted(grid_x, query_x) - 1
    idx = np.clip(idx, 0, len(grid_x) - 2)
    xl, xr = grid_x[idx], grid_x[idx + 1]
    Ul, Ur = grid_U[idx], grid_U[idx + 1]
    lam = (xr - query_x) / (xr - xl)  # weight for left point
    return lam * Ul + (1 - lam) * Ur

# Example: value function on [0, 1] with 5 grid points
grid_x = np.linspace(0, 1, 5)
grid_U = np.array([0.0, 0.8, 1.2, 0.9, 0.1])
print(linear_interp_1d(grid_x, grid_U, 0.35))  # 0.96

In 1D linear interpolation, a query point exactly on the left grid point x_l gets weight λ = 1. What value does it return?

U(x_l) exactly — λ=1 means full weight on the left grid value, recovering it exactly The average of U(x_l) and U(x_r) 0, because λ=1 means the point is at the boundary

Chapter 5: Simplex Interpolation

Simplex interpolation avoids the exponential 2ⁿ corner problem of multilinear interpolation. Instead of requiring all 2ⁿ corners of a hypercube, it decomposes each grid cell into simplices (triangles in 2D, tetrahedra in 3D, etc.) using only n+1 vertices each.

A simplex in n dimensions has n+1 vertices. Any point inside a simplex can be written as a convex combination of the vertices with non-negative barycentric coordinates λ₀, ..., λ_n that sum to 1:

s = λ₀s₀ + λ₁s₁ + ... + λ_ns_n with ∑_i λ_i = 1, λ_i ≥ 0

The interpolated value is then U_θ(s) = ∑_i λ_i U(s_i). This uses only n+1 lookups per query instead of 2ⁿ. The tradeoff: the interpolation is linear within each simplex but may have discontinuous derivatives at simplex boundaries.

Efficient simplex decomposition via sorting. The textbook describes an efficient method: sort the fractional parts of the scaled coordinates in decreasing order. This determines which simplex the query lies in, without explicitly constructing all simplices. The algorithm runs in O(n log n) per query.

Worked Numerical Example: 2D Simplex Interpolation

Suppose our 2D grid has spacing 1 in each dimension. Grid corners at (0,0), (1,0), (0,1), (1,1) with values U(0,0)=2, U(1,0)=5, U(0,1)=3, U(1,1)=8. We query at s=(0.7, 0.4).

Step 1: Fractional coordinates. Already in [0,1]: d_x=0.7, d_y=0.4.

Step 2: Determine triangle. Since d_x = 0.7 ≥ d_y = 0.4, we're in the lower triangle with vertices (0,0), (1,0), (1,1).

Step 3: Barycentric coordinates. λ₀ = 1 − d_x = 0.3, λ₁ = d_x − d_y = 0.3, λ₂ = d_y = 0.4. Verify: 0.3 + 0.3 + 0.4 = 1.0 ✓.

Step 4: Interpolate. U(0.7, 0.4) = 0.3×U(0,0) + 0.3×U(1,0) + 0.4×U(1,1) = 0.3×2 + 0.3×5 + 0.4×8 = 0.6 + 1.5 + 3.2 = 5.3.

Compare: bilinear interpolation at (0.7, 0.4) uses all 4 corners. λ_x=0.7, λ_y=0.4: U = (1−0.7)(1−0.4)×2 + 0.7(1−0.4)×5 + (1−0.7)0.4×3 + 0.7×0.4×8 = 0.36 + 2.1 + 0.36 + 2.24 = 5.06. The two methods agree closely but use different numbers of lookups (3 vs 4). In 10D: simplex uses 11 lookups vs bilinear's 1024.

Here is the full simplex interpolation algorithm for a 2D grid with state (x, y) ∈ [0,1]²:

python
import numpy as np

def simplex_interp_2d(grid_U, query_x, query_y, x_min, x_max, y_min, y_max):
    """
    Simplex interpolation in 2D.
    grid_U: 2D array of values at grid points
    query: (x, y) continuous state
    Returns interpolated value.
    """
    nx, ny = grid_U.shape
    # Map to fractional grid coordinates [0, nx-1] x [0, ny-1]
    fx = (query_x - x_min) / (x_max - x_min) * (nx - 1)
    fy = (query_y - y_min) / (y_max - y_min) * (ny - 1)
    # Integer and fractional parts
    ix, iy = int(np.clip(fx, 0, nx-2)), int(np.clip(fy, 0, ny-2))
    dx, dy = fx - ix, fy - iy  # fractional parts in [0, 1)
    # Sort fractional parts: determines which triangle we're in
    if dx >= dy:  # lower triangle: (0,0)-(1,0)-(1,1)
        lam = [1 - dx, dx - dy, dy]  # barycentric coordinates
        pts = [(ix, iy), (ix+1, iy), (ix+1, iy+1)]
    else:         # upper triangle: (0,0)-(0,1)-(1,1)
        lam = [1 - dy, dy - dx, dx]
        pts = [(ix, iy), (ix, iy+1), (ix+1, iy+1)]
    return sum(lam[k] * grid_U[pts[k]] for k in range(3))

# Only 3 lookups (n+1 = 3 for 2D) vs 4 for bilinear (2^2 = 4)

Multilinear (bilinear in 2D)

Uses 2ⁿ = 4 corners per cell
Smooth (C⁰ continuous)
Cost: 4 lookups per query in 2D
Scales as 2ⁿ — problematic in high-d

Simplex interpolation

Uses n+1 = 3 vertices per triangle
Linear within simplices (C⁰ at boundaries)
Cost: 3 lookups per query in 2D
Scales as n+1 — much better in high-d

In 5 dimensions, simplex interpolation uses ____ vertices per query, while multilinear interpolation uses ____.

6 vertices (n+1=6) vs 32 vertices (2⁵=32) 5 vs 25 10 vs 32

Chapter 6: Linear Regression

Linear regression approximation represents the value function as a linear combination of basis functions β₁(s), ..., β_p(s):

U_θ(s) = θ₁ β₁(s) + θ₂ β₂(s) + ... + θ_p β_p(s) = θ^Tβ(s)

The parameters θ are found by minimizing the mean squared error over sample states. This has a closed-form solution: the ordinary least squares estimator θ^* = (B^TB)⁻¹B^Ty, where B is the matrix with rows β(s_i)^T and y is the vector of target values y_i.

Basis function choices for value function approximation:

Polynomials: β_k(s) = s₁^a₁s₂^a₂... (monomials up to some degree)
Radial basis functions: β_k(s) = exp(−||s − c_k||²/(2σ²)) (Gaussians centered at c_k)
Fourier features: sin/cos of random projections (useful for periodic value functions)
Tile coding: binary features from overlapping tilings (a classic RL trick)

Linear in parameters, nonlinear in state. "Linear regression" means linear in θ, not necessarily in s. With polynomial or RBF features, U_θ(s) is a highly nonlinear function of s. The word "linear" describes the parametric structure: double θ and you double U_θ. This linearity is what makes the closed-form OLS solution possible.

The OLS estimator θ^* = (B^TB)⁻¹B^Ty requires matrix inversion. For p basis functions and m sample states, the matrix B^TB is p×p. Inverting it costs O(p³). For p=100 basis functions this is fast; for p=10,000 (e.g., with many tile coding tilings) it becomes expensive. In practice, use numpy.linalg.lstsq or iterative solvers (conjugate gradient) for large p. Alternatively, use stochastic gradient descent to update θ online rather than solving the full OLS every iteration — this is the basis of TD-learning with linear function approximation, analyzed by Sutton (1988) and Tsitsiklis & Van Roy (1997).

Tile coding is a classic trick from RL (Sutton & Barto) that creates binary features by overlaying multiple offset grids ("tilings") over the state space. Each tiling is like a low-resolution grid; the feature for a state is 1 in the bin it falls into on each tiling, 0 everywhere else. Using k overlapping tilings at different offsets gives k features per state. The advantage: linear regression on tile-coded features can represent nonlinear functions, while remaining fully linear in parameters.

Basis Function Family	Strengths	Weaknesses	Typical Use
Monomials (polynomial)	Interpretable; can represent smooth functions exactly	Runge phenomenon: oscillates at boundaries for high degree	Value functions with known polynomial structure
RBF (Gaussian)	Local support; no oscillation; smooth	Requires choosing centers and width σ	Unknown smooth functions; robotics control
Tile coding	Very fast to evaluate; sparse binary features	Doesn't extrapolate; fixed resolution	Classic RL: grid worlds, mountain car
Fourier features	Handles periodic structure; random sampling works	Sensitive to scale; less intuitive	Periodic or quasi-periodic value functions

Basis Functions: Polynomial vs RBF

The true function (sin wave) is approximated with polynomial (orange) and RBF (teal) bases. Adjust the number of basis functions. More RBFs capture local structure better; polynomials can oscillate (Runge's phenomenon).

Number of basis functions 4

Why can a linear regression approximation with polynomial features represent a nonlinear value function?

The features β(s) are nonlinear functions of s; the linearity is in the parameters θ, not in s Polynomials always produce linear functions Only linear value functions can be represented this way

Chapter 7: Neural Network Regression

A neural network approximator U_θ(s) passes the state through a stack of learned transformations. Each layer computes a linear combination of the previous layer's activations, followed by a nonlinear activation function:

h₁ = σ(W₁s + b₁) h₂ = σ(W₂h₁ + b₂) U_θ(s) = W_outh_L + b_out

The parameters θ = {W₁, b₁, W₂, b₂, ..., W_out, b_out} are learned by stochastic gradient descent on the MSE loss. Common activation functions: ReLU σ(x) = max(0,x), tanh, or sigmoid.

Neural networks are universal approximators: with enough width or depth, a ReLU network can approximate any continuous function to arbitrary precision. In practice, they are the default for high-dimensional state spaces — they power DQN, A3C, PPO, and virtually all modern deep RL.

For the approximate value iteration setting (offline, batch), neural networks are trained by minimizing the mean squared Bellman error over sample transitions (s, a, r, s'). The key data pipeline is: (1) sample a batch of transitions from a replay buffer, (2) compute Bellman targets y_i = r_i + γ · max_a' U_θ−(s'_i, a') using the frozen target network, (3) gradient step on ∑(U_θ(s_i, a_i) − y_i)². The target network weights θ

Architecture Design for Value Function Networks

The architecture of the value network matters enormously. Key design decisions:

Decision	Common Choice	Why
Input normalization	Standardize each state dimension to mean 0, std 1	Prevents large-magnitude dimensions from dominating gradients
Width vs depth	2–3 hidden layers of 64–256 units for simple MDPs; 4+ layers for high-d	Deeper is more expressive but harder to train; diminishing returns beyond 4 layers
Activation function	ReLU (default); ELU or SiLU for smoother value functions	ReLU trains fast; smoother activations give smoother Q-functions
Output layer	Linear (no activation) for Q-values	Q-values can be any real number, not bounded; tanh/sigmoid would artificially constrain them
Separate network per action vs shared	One network, output size = \|A\| for discrete actions	Shared network shares representations; cheaper to evaluate all actions at once
Dueling architecture (optional)	Split output into V(s) + A(s,a); Q = V + A − mean(A)	Separates state value from action advantage; faster convergence in many MDPs

For the MDP in the showcase (1D continuous, 2 actions), a minimal network is 2 hidden layers × 32 units. For Atari (image input, 18 actions), DQN used 3 convolutional layers followed by 2 fully connected layers with 512 units. For robotics with 50D state spaces, typical architectures are 3 fully connected layers with 256 units each.

Neural networks introduce instability in value iteration. When you use a neural net and update it on Bellman targets, the targets themselves change as θ changes (because U_θ appears on both sides of the Bellman equation). DQN fixes this by using a target network — a frozen copy of the network used to compute targets, updated only every few thousand steps. Without this trick, training diverges.

python
import torch
import torch.nn as nn

class ValueNetwork(nn.Module):
    def __init__(self, state_dim, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden),  nn.ReLU(),
            nn.Linear(hidden, 1)            # scalar value output
        )
    def forward(self, s):
        return self.net(s).squeeze(-1)  # (batch,) scalar per state

# Approximate value iteration step
def approx_value_iter_step(net, target_net, optimizer, states, rewards, next_states, gamma=0.99):
    with torch.no_grad():
        targets = rewards + gamma * target_net(next_states)  # Bellman backup
    preds = net(states)
    loss = nn.functional.mse_loss(preds, targets)
    optimizer.zero_grad(); loss.backward(); optimizer.step()
    return loss.item()

Why Neural Networks Are Unstable in Value Iteration

The fundamental problem: the Bellman backup computes y_i = R + γ · max_a U_θ(s'). But U_θ uses the current network weights θ. So when we take a gradient step to make U_θ(s) closer to y_i, we also change U_θ(s'), which changes y_i on the next iteration. We are chasing a moving target. This can cause oscillations or divergence.

DQN (Mnih et al., 2015, DeepMind) introduced two key fixes that made neural value functions stable for the first time:

Fix	Mechanism	Why It Helps
Replay buffer	Store all past transitions (s,a,r,s'). Train on random mini-batches from this buffer instead of the latest experience.	Breaks temporal correlations; diverse mini-batches prevent the network from forgetting earlier states
Target network	A frozen copy θ⁻ of the network, used only for computing targets y = R + γ max_a' U_θ⁻(s'). Copy θ to θ⁻ every C steps.	Targets are stable for C steps; regression is well-defined

Together these two fixes turned a theoretically unstable algorithm into a practical powerhouse. DQN achieved superhuman performance on 49 Atari games using this architecture: a CNN for image features, two hidden layers with 512 ReLU units, output of Q-values for each action, trained with replay buffer of 1M transitions and target network update every 10,000 steps.

Data Flow in Neural Approximate Value Iteration

Double DQN and Other Improvements

Since the original DQN paper (2015), several improvements have become standard. For the approximate value iteration context:

Improvement	Problem it solves	How
Double DQN (van Hasselt et al.)	Q-values are overestimated because max_a' Q(s',a') is biased upward	Use main net to select action, target net to evaluate: Q_θ⁻(s', argmax_a' Q_θ(s',a'))
Dueling DQN (Wang et al.)	Most actions in most states have similar Q-values; wasted capacity	Split Q into V(s) + A(s,a): value head and advantage head; combine as Q = V + A − mean(A)
Prioritized Experience Replay	Uniform replay wastes time on easy transitions already well-modeled	Sample transitions proportional to \|TD error\|; correct bias with importance weights
N-step returns	1-step Bellman targets have high bias; propagating value is slow	Use y = ∑_t=0ⁿ γ^tr_t + γⁿ max_a'Q(s_n,a')
Rainbow DQN (Hessel et al.)	Combines all of the above	Double + Dueling + Prioritized + N-step + NoisyNets + Distributional; state-of-the-art for Atari

All of these improvements operate on the same approximate value iteration loop from the textbook. The core insight is unchanged: sample, backup, fit. The improvements address specific failure modes of the basic loop when scaled to complex environments. For the textbook's simple MDP examples, the basic DQN loop without improvements is sufficient and often preferred for pedagogical clarity.

Concretely, the tensor shapes for a 1D continuous state problem with 2 actions:

Quantity	Shape	Meaning
States s	(batch, 1)	Batch of 1D continuous states
Q-values Q_θ(s)	(batch, 2)	One Q-value per action per state
Actions a	(batch,) int	Which action was taken
Rewards r	(batch,)	Immediate reward for (s,a)
Next states s'	(batch, 1)	States reached after action
Targets y	(batch,)	r + γ max_a' Q_θ⁻(s', a')
Loss	scalar	mean[ (Q_θ(s,a) − y)² ]

The forward pass: s → hidden layers → Q-values (batch, 2) → gather Q(s,a) at chosen action → compare with target y. Gradient flows back through gather → Q-head → hidden layers only (not through the target network, which is detached from the computation graph).

Why does DQN use a separate "target network" in addition to the main network?

The Bellman targets change as the network updates, creating instability; a frozen target network provides stable regression targets Two networks always produce better estimates than one The target network handles exploration while the main network handles exploitation

Chapter 8: Showcase — Approximate Value Iteration

Here is the full approximate value iteration loop, visualized in real time on a 1D continuous MDP. The state is a position s ∈ [0, 10]. The reward is high at the goal (s=8) and zero elsewhere. The discount is γ=0.9. There is no analytical solution — we must approximate.

You choose the approximator: k-NN, kernel smoothing, or linear regression with RBF features. The loop samples states randomly, computes Bellman backups, refits the approximator, and shows how the value function evolves. Watch the value function grow toward the goal.

From Approximate Value Function to Policy Extraction

Once we have a good approximation U_θ, extracting a policy requires evaluating the Bellman maximization at each decision point:

π(s) = argmax_{a ∈ A} [ R(s,a) + γ ∑_s' T(s'|s,a) U_θ(s') ]

For discrete action spaces (e.g., left/right/stop), this is a finite argmax — cheap. For continuous actions, this becomes a non-convex optimization problem. Common approaches:

Action Space	Policy Extraction Method	Cost per Decision
Small discrete (≤ 10 actions)	Enumerate all actions, compute each Q-value, take max	O(\|A\| × approximator cost)
Large discrete (≥ 100 actions)	Beam search or UCT tree search from current state	O(\|A\| × tree depth)
Continuous (bounded interval)	Random shooting: sample N actions, pick best	O(N × approximator cost)
Continuous (high-d)	Cross-entropy method or gradient ascent on Q_θ(s,a) in a	O(iterations × batch)

In the showcase's 1D MDP with 2 discrete actions (left/right), policy extraction is O(2). For ACAS X with 5 actions (alerts), O(5). The simplicity of discrete, small action spaces is one reason they are preferred in safety-critical applications.

What you should see. After a few iterations, a peak forms near the goal (s=8). The peak spreads backward as the discount propagates value. With fewer sample states, the approximation is noisier. With more, it smooths out. Kernel smoothing is smoothest; k-NN is most jagged; RBF regression is globally smooth.

Approximate Value Iteration: 1D Continuous MDP

Goal at s=8 (reward=10). Discount γ=0.9. Run iterations and watch the value function (teal) converge. The orange dots are current sample states. Change the approximator to see how each method fits.

Number of sample states 15

Bandwidth h (for kernel) / k (for k-NN) 1.00

0 iterations

In this showcase, why does the value function peak near s=8 but also have non-zero values at other states?

The discount factor γ propagates future value backwards: states near the goal receive high value because from there the goal is reachable in few steps The approximator is overfitting and creating spurious peaks All states have the same value

Chapter 9: Connections & What's Next

Method	Strengths	Weaknesses	Best For
k-NN	No training, exact at samples	Slow queries (O(m)), poor high-d	Low-d, small datasets
Kernel smoothing	Smooth, simple to implement	Same as k-NN; bandwidth sensitive	Low-d, smooth value functions
Linear interpolation	Fast queries, exact at grid points	Memory O(resolutionⁿ)	1–3D problems with regular grids
Simplex interpolation	n+1 lookups (vs 2ⁿ)	Discontinuous derivatives	4–6D problems with regular grids
Linear regression	Fast, closed-form solution	Must choose good basis functions	Problems with known structure
Neural networks	Universal approximator, auto-features	Instability, needs target network	High-d, complex value functions

What comes next. Chapter 9 (Online Planning) leaves the "approximate the value function" paradigm and instead plans from the current state at decision time using search trees (MCTS, rollout). Chapter 10 (Policy Search) directly optimizes the policy without computing a value function at all. The choice: approximate value functions excel when you make many decisions from the same states and can afford offline computation; online planning excels for one-off decisions in novel states; policy search excels when you only need a policy and the value function is irrelevant.

Convergence Theory for Approximate Value Iteration

Exact value iteration converges because the Bellman operator T is a γ-contraction: ||TU − TU'||_∞ ≤ γ ||U − U'||_∞. The fixed point of T is U*, and repeated application of T shrinks any error by γ per iteration.

With approximate value iteration, the error does not shrink to zero. Each iteration first applies T (contraction) then applies the approximation operator Π (projection onto the function class). The composition ΠT is no longer a contraction in general. Three cases:

Function class	Convergence guarantee	Practical behavior
Exact (tabular)	Converges to U* (geometric rate γ)	Always converges; rate determined by γ
Linear (fixed features)	Converges to a small error ball \|\|U_θ* − U*\|\| ≤ ε/(1−γ)	Usually converges; rate depends on features and γ
Neural network	No general guarantee	Usually converges in practice with target network; can oscillate
k-NN / kernel	Converges if samples are dense and h is appropriately chosen	Converges with enough samples and good bandwidth

The error bound ε/(1−γ) has two terms: ε is the best-in-class approximation error (how well the function family can represent U*), and 1/(1−γ) is the discount amplification factor. As γ → 1, even small approximation errors get amplified arbitrarily. This is why deep RL with γ=0.99 is so finicky: the 1/(1−0.99) = 100 amplification magnifies any approximation error by 100x in the value bound.

On-Policy vs Off-Policy Sampling

The choice of sample states is closely related to the concept of model bias in approximate dynamic programming. If the sample states are not representative of the states visited by the optimal policy, the approximation will be accurate in the wrong region of state space, leading to a suboptimal policy. This is a form of distribution mismatch: we minimize loss on the training distribution (sample states) but the policy deploys in a different distribution (states the optimal policy visits). The solution is iterative: use the current approximate policy to generate new sample states, refit, and repeat. This is called approximate policy iteration (API) and is the algorithmic framework underlying SARSA, on-policy actor-critic, and TRPO in modern deep RL.

The sampling distribution for sample states is a crucial design choice that the textbook addresses explicitly. Two main strategies:

Strategy	Mechanism	Pros	Cons
Uniform grid	Place sample states on a regular grid or uniformly at random	Covers the state space evenly; no policy dependence; easy to implement	Wastes samples on states rarely visited; poor coverage of high-value regions
On-policy rollouts	Generate sample states by following the current approximate policy	Concentrates samples where the policy actually goes; better approximation where it matters	Biased: misses states the policy avoids; divergence risk if policy is bad early on
Experience replay	Collect transitions from many historical policies; sample uniformly from the replay buffer	Data efficient; diverse samples; standard in DQN	Off-policy samples may not reflect current state distribution; distribution shift
Prioritized replay	Sample transitions with probability proportional to Bellman error \|U(s) − y\|	Focuses computation on states where the approximation is worst	Requires importance weights to correct for the sampling bias; more complex

For the grid-based methods (linear interpolation, simplex), a uniform grid is natural — the grid points are the sample states. For k-NN and kernel methods, on-policy sampling often works best after the first few iterations. For neural networks, experience replay (off-policy) is standard: it reduces correlation between consecutive samples and allows reuse of data, which is critical when environment interaction is expensive.

The textbook also notes an important practical consideration: the sampling distribution of states in S matters enormously. If your sample states cluster in low-value regions but the optimal policy mostly visits high-value regions, the approximation will be poor where it matters most. Common solutions: (1) follow the current approximate policy to generate samples (on-policy sampling), (2) importance-weight samples from a fixed distribution, or (3) use a uniform grid if the state space is bounded and low-dimensional. The interplay between sample distribution and approximation error is the core technical challenge of approximate dynamic programming and motivates most of the innovations in deep RL (experience replay, on-policy methods, distributional RL) that have appeared since the DQN paper.

Practical Consideration	Details
Sample density	Uniform grid covers evenly; policy rollouts focus on reachable states; both have tradeoffs
Extrapolation	Local methods (k-NN, kernel) extrapolate poorly; global methods (linear, neural) extrapolate according to their functional form
Convergence	Approximate VI does not guarantee convergence; oscillations possible; reduce learning rate or use trust regions
Feature engineering	Domain-specific features (sin/cos of angles for pendulum, distance to goal) dramatically improve data efficiency
Generalization vs overfit	Too few samples: underfit (smooth but wrong); too many samples with rich features: overfit (noisy)

Computing Approximate Bellman Backups in Continuous State Spaces

The Bellman backup y_i = max_a[R(s_i,a) + γ ∑_s' T(s'|s_i,a) U_θ(s')] requires integrating U_θ(s') over the transition distribution T. For deterministic transitions (s' = f(s,a) exactly), this is trivial: y_i = max_a[R(s_i,a) + γ U_θ(f(s_i,a))]. For stochastic transitions (s' ~ T(·|s,a)), we must approximate the expectation:

Monte Carlo: Sample K next states s'_k ~ T(·|s,a); approximate E[U(s')] ≈ (1/K) ∑_k U_θ(s'_k). Simple but requires K environment rollouts per backup.
Gaussian quadrature: If T is Gaussian, use exact quadrature points and weights. O(Kⁿ) points for n-dimensional state but exact for polynomial approximators up to degree 2K−1.
Reparameterization: Write s' = g(s,a,ε) for ε ~ N(0,I); estimate E[U(s')] ≈ U(g(s,a,E[ε])) (deterministic equivalent, exact for linear U_θ).

For the showcase's 1D MDP, the transitions are deterministic (s' = clamp(s + action, 0, 10)), so backups are exact single-step lookups. In the real-world ACAS X problem, the aircraft dynamics are stochastic (due to wind and sensor noise), and Monte Carlo backup with K=10–50 samples per state per action is the standard approach.

Which method is most appropriate when the value function is known to be approximately quadratic in the state, with no data about the global mean?

Linear regression with polynomial basis functions — it can exactly represent the quadratic structure using few parameters k-NN — it will automatically discover the quadratic shape A large neural network — it will find whatever structure exists

Real-World Deployment: ACAS X as a Case Study

The Airborne Collision Avoidance System X (ACAS X) is the premier real-world deployment of approximate value function methods. Developed at MIT Lincoln Laboratory and the FAA, it uses a 5D piecewise linear approximation over the states: relative altitude, relative bearing, intruder closure rate, own-ship altitude rate, and intruder altitude rate.

Key engineering choices and why they were made:

Decision	Choice	Reason
Approximation method	Piecewise linear (simplex interpolation) over a 5D grid	Exact at grid points; fast (11 lookups); proved safe
State space	5 continuous dimensions; 5 billion grid points	Covers all physically relevant encounter geometries
Storage	2.4 GB compressed; 45 MB compressed with quantization	Fits in onboard flight computer memory
Computation	Offline: weeks to solve; online: microseconds to query	Safety-critical systems cannot afford real-time planning
Offline solver	Exact value iteration on discretized state space	Grid is fine enough to capture all safety-critical structure
Policy form	Lookup table with piecewise linear interpolation	Certifiable: no neural net uncertainty

The key lesson: for safety-critical aviation systems, neural networks are not acceptable — regulators require deterministic behavior that can be formally verified. This does not mean neural networks are never used in aviation: they are used for perception (camera-based object detection), speech recognition, and anomaly detection in maintenance. But for the decision-making layer that determines pilot advisories, only certifiable methods (look-up tables, piecewise linear functions) are acceptable under current FAA standards. The push toward neural network verification (using formal methods like Marabou or α,β-CROWN) may change this in the future.

The grid-based piecewise linear approach is not the most accurate approximator, but it is the most certifiable one. The choice of approximation method is as much a regulatory and safety question as a technical one.

The ACAS X value table is publicly released and has been analyzed extensively by the formal verification community. Katz et al. (2017) verified properties of small ACAS X networks using the Reluplex satisfiability checker. Julian et al. (2019) showed how to compress the 5D grid into a neural network that is 1000x smaller, while Kouvaros et al. (2021) formally verified certain safety properties of those compressed networks. This research direction — compressing safe grid-based policies into neural networks and then formally verifying the neural version — represents the state of the art in bridging the certifiability gap between classical and neural approaches. The textbook's approximate value function methods are the starting point for this entire research thread.

Why not neural networks for ACAS X? Neural networks are more accurate and more scalable. But for FAA certification, you must demonstrate that the system's behavior is safe in every possible encounter geometry. With a lookup table, you can enumerate all extreme inputs and check the output. With a neural network, you cannot: there are infinitely many inputs, and adversarial examples may produce unexpected outputs. This is the core tension between ML performance and system safety that the field is actively working to resolve.

How This Chapter Connects to the Rest of the Book

Chapter 8's approximate value functions feed directly into every subsequent chapter. Chapter 9 (Online Planning) replaces offline precomputation with real-time search using rollout policies — the rollout policy is exactly the policy extracted from an approximate value function via the greedy policy improvement step. Chapter 10 (Policy Search) parameterizes π_θ(a|s) directly and optimizes θ with gradient ascent; approximate value functions appear here as critics in actor-critic methods, providing low-variance gradient estimates. Chapters 12–17 (Model-Free RL) learn U_θ or Q_θ without knowing T or R — but the Bellman operator being approximated is identical; the only difference is whether samples come from a simulator or from real environment interactions.

The critical design axis that runs through all of these chapters is the approximation-exploration tradeoff: the more compact U_θ is (fewer parameters, simpler basis), the faster it can be fit and the less data it needs — but the worse it approximates the true U* in complex regions of state space. The practitioner's job is to choose an approximator that is rich enough to capture the value structure that matters for decision-making, while being simple enough to be trained reliably from the available data. Chapter 8 gives you the vocabulary and toolkit to make this choice deliberately rather than by default.

Generalization and the Credit Assignment Problem

One of the deepest challenges in approximate value functions is credit assignment: when the agent takes action a in state s and eventually receives a reward r, how much of r is attributable to a versus to later actions? The Bellman equation solves this by definition — Q(s,a) = R(s,a) + γ max_a' Q(s',a') assigns exactly the right credit to action a by recursion. But with function approximation, each gradient update that adjusts θ to reduce the Bellman error at one (s,a) pair also changes U_θ(s') for all other states s' where the basis functions overlap. This generalization is both the power and the hazard of function approximation: a good approximator generalizes correctly (nearby states have similar values), while a bad approximator corrupts good value estimates when it over-generalizes.

Tile coding (Section 8.3) manages this tradeoff explicitly: the tiles are designed so that generalization happens only within each tile, never across tiles. This gives exact local estimates at the cost of coarse approximation at boundaries. Neural networks generalize more smoothly but without explicit control. The field of representation learning (Bengio et al., 2013; Lesort et al., 2018) studies which learned state representations support accurate value generalization — this is the same as asking what features make a good basis for U_θ.

← Chapter 7: Exact Methods Chapter 9: Online Planning →