Kochenderfer, Wheeler & Wray, Chapter 8

Approximate Value Functions

When state spaces are continuous or enormous, the exact value table from Chapter 7 is impossible. This chapter builds a toolkit of function approximators — from nearest neighbors to neural networks — that generalize from finite sample states to the entire state space.

Prerequisites: Chapter 7 (Exact Solution Methods) — MDPs, Bellman equations, value iteration, policy extraction.
10
Chapters
6+
Simulations
10
Quizzes

Chapter 0: Why Approximate?

A robot arm with 7 joints has 14 state variables (angle + velocity per joint). Discretize each into 100 bins and you need 10014 = 1028 table entries. There are roughly 1025 grains of sand on Earth. The table would fill thousands of planets. Tables don't scale to continuous or high-dimensional state spaces.

Chapter 7's exact value iteration stores one number U(s) per state s. This works for small, finite MDPs — Kochenderfer et al. use the example of a two-state aircraft collision avoidance system. But the real-world ACAS X system has a continuous state space with 8+ dimensions. Exact methods cannot be applied. The entire ACAS X project used the approximate value function methods from this chapter.

The answer is yes — and there are many ways to do it. This chapter surveys a toolkit of approximators, from the conceptually simple (nearest neighbor) to the powerful (neural networks). Each makes different tradeoffs between accuracy, scalability, and computational cost. The ACAS X system ultimately used a grid-based approach (piecewise linear) over a 5-dimensional state space with 5 billion grid points, stored as a compressed lookup table of 2.4 GB.

1. Sample
Collect a finite set of sample states S = {s1, ..., sm}
2. Backup
Compute target values yi = maxa[R(si,a) + γ∑T(s'|si,a)U(s')] at each sample
3. Fit
Update approximator Uθ so Uθ(si) ≈ yi
↻ repeat
4. Extract Policy
π(s) = argmaxa[R(s,a) + γ∑T(s'|s,a)Uθ(s')]
The approximation tradeoff. We lose the convergence guarantees of exact value iteration. In exchange, we can handle problems that exact methods cannot touch. The quality depends on (1) the function family (can it represent the true U?), (2) the sample states (do they cover important regions?), and (3) the fitting procedure (does it find good parameters?).

The textbook categorizes the full approximate value iteration loop as fitted value iteration (FVI). The word "fitted" emphasizes that each iteration involves a supervised learning step — fitting Uθ to target values — not just a Bellman update. FVI was first analyzed by Szepesvari and Smart (2004) and Gordon (1999), who established conditions under which it converges and derived error bounds. The key condition: the regression step must not increase the error — formally, the approximation operator Π must be non-expansive in the weighted norm. For least-squares regression (linear or neural net), this typically holds, but it is not guaranteed in general.

The name "fitted value iteration" was coined by Ernst, Geurts & Wehenkel (2005) in their paper "Tree-Based Batch Mode Reinforcement Learning," which used random forests as the regression oracle in the FVI loop. This paper demonstrated that any supervised regression method can be used as the fitting step — gradient boosted trees, Gaussian processes, random forests, and neural networks all fit the FVI framework. The textbook covers the gradient boosting and neural network variants most closely, but the framework is general.

Modern deep RL methods (DQN, TD3, SAC) are all variants of FVI with neural function approximation. The conceptual framework is identical to the textbook's presentation; the engineering innovations (replay buffers, target networks, gradient clipping, normalized rewards) are stabilization tricks that make FVI work in practice for complex problems.

The Bellman Backup: A Worked Example

Before diving into approximators, let's see exactly what one Bellman backup looks like at a continuous state. Suppose we have a 1D MDP with state s ∈ [0, 5], two actions {left, right}, and the following simple model:

ActionTransitionReward
rights' = min(s + 1, 5)R = 10 if s' = 5 else 0
lefts' = max(s − 1, 0)R = 0

Suppose our current value approximation says U(0)=0, U(1)=4, U(2)=8, U(3)=13, U(4)=19, U(5)=10 (the fifth state has the goal reward but no future). The Bellman backup at s=3, γ=0.9:

Unew(3) = maxa[R(3,a) + γ · U(s')]
a=right: 0 + 0.9 × U(4) = 0.9 × 19 = 17.1
a=left: 0 + 0.9 × U(2) = 0.9 × 8 = 7.2

So Unew(3) = 17.1, greedy action = right. Notice U changed from 13 to 17.1 in one backup — the value is still being propagated from the goal at s=5. After more iterations, all states will reflect the true discounted value of reaching the goal.

At a continuous query state like s=2.6, we cannot look up U(2.6) from a table. We must interpolate or approximate — that is the problem this chapter solves. The backup still computes R + γ E[U(s')], but the E[U(s')] evaluation requires the approximator.

The Curse of Dimensionality

See how the number of table entries (100 bins per dimension) explodes. Adjust the number of state dimensions. Even 6 dimensions makes a table infeasible; approximate methods are the only option.

State space dimensions 4

The approximate value iteration framework is sometimes called approximate dynamic programming (ADP) in the operations research literature. The two fields developed the same ideas independently: RL researchers focused on the online/model-free setting, while ADP researchers focused on the offline/model-based setting. The textbook bridges both, presenting approximate value iteration as a unified framework that covers both RL (when T and R are learned) and ADP (when T and R are known). Chapter 8 is the ADP treatment; Chapters 12–17 cover the RL variants where the model is unknown.

Why is tabular value iteration infeasible for a pendulum with continuous angle and angular velocity?

Chapter 1: Parametric Representations

A parametric approximation Uθ(s) represents the value function as a fixed functional form controlled by a parameter vector θ. Instead of storing one number per state, we store θ and compute Uθ(s) on demand for any query state.

The most important categorization is local vs global:

TypeHow parameters workGeneralizationExamples
Localθ = values at sample states; query computes weighted sum of nearby θiInterpolates locally near samplesk-NN, kernel, simplex
Globalθ = coefficients of basis functions; Uθ(s) = θTβ(s)Extrapolates everywhereLinear regression, neural nets

In both cases, the fitting procedure finds θ by minimizing a loss between Uθ(si) and the backed-up target values yi. The loss is typically mean squared error:

L(θ) = ∑i=1m (Uθ(si) − yi)2
Why these methods differ. Local methods (k-NN, kernel) store all sample states at prediction time — they are "lazy learners." Their computational cost scales with the sample set at query time. Global methods (linear, neural nets) store only θ at query time but pay an upfront fitting cost. For very large sample sets, global methods are far faster to query.

Fitting the Approximator: Regression on Bellman Targets

In approximate value iteration, we compute Bellman targets yi at each sample state and then fit the approximator to them. The full loop is:

  1. Initialize: Uθ(s) = 0 for all s (or some prior)
  2. Sample states: Draw m states s1, ..., sm from S (uniform, grid, or policy rollouts)
  3. Backup: yi = maxa[R(si,a) + γ ∑s' T(s'|si,a) Uθ(s')] for each i
  4. Fit: Find θ that minimizes ∑i(Uθ(si) − yi)2
  5. Repeat steps 3–4 until convergence or budget exhausted

The backup in step 3 requires evaluating Uθ(s') at the next state s', which uses the current approximation. This creates the same "moving target" instability as neural network regression (Chapter 7). With linear and k-NN approximators, the instability is less severe but still exists — there is no general proof of convergence for approximate value iteration.

One exception: If the approximation family is rich enough to contain the true value function, and the sample distribution covers S well, approximate value iteration converges to a small error ball. The formal guarantee: ||Uθ* − U*|| ≤ ε/(1−γ) where ε is the approximation error of the best-fit U* in the function class. This bound degrades with discount γ → 1 and improves as ε → 0.

Local vs Global: Which Fits the True Function?

The true value function (a sine-like curve) is sampled at dots. A local method (nearest neighbor) and a global method (linear fit) are shown. Add sample points by clicking the canvas.

3 sample points
A k-nearest-neighbor approximator stores the value function as what?

Chapter 2: Nearest Neighbor

Nearest neighbor (k-NN) is the simplest local approximator. To compute Uθ(s) at a query state s:

  1. Find the k closest sample states to s (by Euclidean distance or any metric)
  2. Return the average of their stored values
Uθ(s) = 1/ki ∈ Nk(s) θi

where Nk(s) is the set of k-nearest sample indices and θi = yi is the backed-up value at sample si. With k=1, the approximation is piecewise constant — every query returns the value of its single nearest neighbor. As k increases, the estimate becomes smoother but less responsive to local variation.

A critical design choice: distance metric. Raw Euclidean distance treats all dimensions equally, which fails when state variables have very different scales (e.g., angle in radians vs. velocity in m/s). Always normalize state variables to comparable ranges, or use the Mahalanobis distance which accounts for scale and correlation.

python
import numpy as np

class KNNValueFunction:
    """k-Nearest Neighbor value function approximation."""
    def __init__(self, k=3):
        self.k = k
        self.states = []   # list of sample states (arrays)
        self.values = []   # list of target values (floats)

    def update(self, states, values):
        """Replace stored samples with new (state, value) pairs."""
        self.states = [np.array(s) for s in states]
        self.values = list(values)

    def __call__(self, s):
        """Query value at state s."""
        s = np.array(s)
        dists = [np.linalg.norm(s - si) for si in self.states]
        idx = np.argsort(dists)[:self.k]
        return np.mean([self.values[i] for i in idx])

# Usage in approximate value iteration
approx_U = KNNValueFunction(k=3)
sample_states = [np.array([s]) for s in np.linspace(0, 10, 20)]
for iteration in range(50):
    targets = [bellman_backup(s, approx_U) for s in sample_states]
    approx_U.update(sample_states, targets)
k=1 is like a Voronoi diagram. The state space is partitioned into Voronoi cells, one per sample. Every query in cell i returns the value at sample i. As samples become denser, the approximation improves. But in high dimensions, "nearest neighbor" becomes meaningless (all points are approximately equidistant) — the curse of dimensionality strikes again.

Efficient k-NN with k-d Trees

Naively finding the k nearest neighbors requires scanning all m samples: O(m) per query. For large m, this is too slow. The k-d tree (k-dimensional tree) is a binary tree that partitions the state space recursively, enabling nearest-neighbor queries in O(log m) average time.

StructureBuild TimeQuery TimeBest When
Brute force scanO(1)O(m × n)m < 100 and queries are rare
k-d treeO(m log m)O(log m) avgLow dimensions (n ≤ 20), many queries
Ball treeO(m log m)O(log m) avgHigh-d data with metric structure
Approximate NN (FAISS)O(m)O(1) amortizedVery high-d, speed over exact answer

In Python: scipy.spatial.cKDTree builds a k-d tree in milliseconds for m=10,000 points, and queries are 100× faster than brute force. For the approximate value iteration setting where we make thousands of queries per iteration, this speedup is critical.

python
from scipy.spatial import cKDTree
import numpy as np

class FastKNNValueFunction:
    def __init__(self, k=3):
        self.k = k; self.tree = None; self.values = None

    def update(self, states, values):
        self.tree = cKDTree(np.array(states))  # O(m log m) build
        self.values = np.array(values)

    def __call__(self, s):
        dists, idx = self.tree.query(np.array(s), k=self.k)  # O(log m)
        return np.mean(self.values[idx])
k-NN Approximation Explorer

The true function is sin(x). Sample states are dots. The k-NN approximation is piecewise constant (k=1) or piecewise averaged (k=3). Adjust k.

k (number of neighbors) 1
What is the main weakness of k-NN in high-dimensional state spaces?

Chapter 3: Kernel Smoothing

Kernel smoothing is a weighted average over all (or nearby) sample states, where the weight is a decreasing function of distance. Common kernels include the Gaussian and the uniform (box) kernel:

Uθ(s) = ∑i=1m w(s, si) θi    where    w(s, si) = K(d(s,si)/h) / ∑j K(d(s,sj)/h)

Here K is the kernel function and h is the bandwidth — the single most important hyperparameter. Small h: tight kernel, high local sensitivity, noisy estimate. Large h: wide kernel, smooth estimate, may miss local structure.

K-NN is actually a special case of kernel smoothing with the "top-k" kernel: w=1/k for the k nearest points and 0 for the rest. The Nadaraya-Watson estimator (kernel smoothing) can be derived from a nonparametric regression model where U(s) is locally constant in a neighborhood of size h. Maximizing the weighted log-likelihood under a Gaussian noise model gives exactly the kernel smoothing formula above. This probabilistic interpretation means the bandwidth h has a natural interpretation: it is the standard deviation of the assumed "spread" in the value function, not just an arbitrary smoothing parameter.

For value function approximation specifically, a key practical issue is extrapolation. Both k-NN and kernel smoothing extrapolate poorly: they return the nearest sample's value (k-NN) or a weighted average of nearby samples (kernel) for query states outside the convex hull of the training data. This means that states that are never visited during training have poorly estimated values. For problems with bounded state spaces, this is manageable; for problems where the optimal policy may visit novel states (exploration), this creates a chicken-and-egg problem: you need good values to explore, but you need exploration to get good values. This is why on-policy sampling (Chapter 9 of the textbook) helps: it focuses value estimation where the current policy actually goes, ensuring the approximation is good where it matters.

The key difference: with a Gaussian kernel, every sample contributes to every query (though distant ones contribute very little).

Bandwidth = the smoothness dial. As h → 0, the kernel approximation approaches the sample values exactly (interpolation). As h → ∞, the approximation approaches the global mean of all sample values. The right h is problem-specific and typically set by cross-validation or domain knowledge about how fast the true value function varies.

Kernel smoothing generalizes naturally to multivariate states using a product kernel: w(s, si) = ∏j K((sj − sij)/hj) where hj is the bandwidth for dimension j. Using separate bandwidths per dimension is important when state variables have different scales or different degrees of variation. For example, in a pendulum MDP, the angle variable may need a narrow bandwidth (value changes quickly near the unstable equilibrium) while the angular velocity may need a wider bandwidth.

Choosing the Bandwidth: Cross-Validation and the Bias-Variance Tradeoff

The optimal bandwidth minimizes the mean squared error between the approximation and the true value function. This has two competing components:

ComponentEffect of small hEffect of large h
BiasLow — the approximation can capture local variationHigh — the approximation over-smooths; misses peaks
VarianceHigh — few points in each window; noisy estimateLow — many points averaged; stable but biased

The bias-variance tradeoff is the fundamental problem of statistical estimation. For the value function approximation setting, we typically use leave-one-out cross-validation to select h: fit the approximation on all but one sample state and measure the prediction error at the held-out point. Average over all held-out points and choose the h that minimizes this cross-validation error.

An alternative: Silverman's rule of thumb sets h proportional to the sample standard deviation and inversely to m1/(n+4) where n is the state dimension. This rule is optimal for Gaussian-distributed data and works well in practice even for non-Gaussian cases. For a 1D problem with standard deviation σ and m samples: h = 1.06 σ m−1/5.

python
import numpy as np

def silverman_bandwidth(states):
    """Silverman's rule of thumb for kernel bandwidth (1D)."""
    m = len(states)
    sigma = np.std(states)
    return 1.06 * sigma * m ** (-0.2)

def gaussian_kernel_smooth(query_states, sample_states, values, h):
    """Vectorized Gaussian kernel smoothing."""
    diffs = query_states[:, None] - sample_states[None, :]  # (nq, m)
    weights = np.exp(-0.5 * (diffs / h) ** 2)                 # (nq, m)
    weights /= weights.sum(axis=1, keepdims=True) + 1e-10     # normalize
    return weights @ values                                       # (nq,)

# Usage: smooth a noisy value estimate over 30 sample states
sample_s = np.linspace(0, 10, 30)
values = np.sin(sample_s) * 5 + np.random.randn(30) * 0.5  # noisy
h = silverman_bandwidth(sample_s)
query_s = np.linspace(0, 10, 200)
smooth_v = gaussian_kernel_smooth(query_s, sample_s, values, h)
Kernel Smoothing Bandwidth Explorer

The true function is sin(2x). Adjust the bandwidth h. Too small: wiggly and noisy. Too large: over-smoothed, misses peaks and valleys.

Bandwidth h 0.40
What happens to a kernel smoothing estimator as the bandwidth h approaches infinity?

Chapter 4: Linear Interpolation

Linear interpolation works on a grid: a regular rectangular discretization of the state space. Given a query state s between grid points, the value is computed as a weighted combination of the surrounding grid values, with weights proportional to how close s is to each grid point.

In 1D with grid points at xl and xr (the two neighbors of query x), the linear interpolation is:

U(x) = λ · U(xl) + (1−λ) · U(xr)    where    λ = (xr − x) / (xr − xl)

In 2D, bilinear interpolation uses the four surrounding grid corners. In n dimensions, multilinear interpolation uses 2n surrounding corners. This exponential scaling is why linear interpolation is practical in low dimensions (1–4) but problematic in higher ones.

Grid interpolation vs function approximation. Linear interpolation on a grid guarantees exact recovery at grid points and smooth (continuous) interpolation between them. It requires no training phase beyond setting up the grid. The cost: memory scales as O(resolutionn), and the resolution needed for accuracy also scales poorly with n. This is the same curse of dimensionality as tables, just delayed by the smoothness assumption.

Choosing a Grid Resolution

For grid-based interpolation methods, the resolution per dimension is a critical design parameter. Choosing it requires balancing approximation accuracy, memory, and computation:

ConsiderationEffect of finer resolutionEffect of coarser resolution
Approximation errorLower — grid points closer together, interpolation more accurateHigher — grid points further apart, more interpolation error
MemoryExponentially more — doubling resolution in n dims multiplies storage by 2nExponentially less
Computation (value iteration)More grid points to backup; more total workFewer points; faster per iteration
Policy qualityBetter approximation of U* → better policyCoarse resolution may miss narrow peaks (e.g., near goal states)

A practical heuristic: start with a coarse 5×5 (or 5n) grid, solve approximately, inspect the value function, and double the resolution in dimensions where the function varies most rapidly. ACAS X used a non-uniform grid with finer resolution near zero relative altitude (where collision risk is highest) and coarser resolution at large separations (where the optimal action is always "do nothing").

Linear interpolation is also the foundation of the Q-table with linear interpolation method used in classic RL papers. Sutton & Barto's mountain car problem (1998) used a tile-coded linear approximation that is equivalent to piecewise linear interpolation on a non-uniform grid. The tile coding creates a grid implicitly: each "tile" is a rectangular region in state space with its own value estimate. When multiple tilings overlap, the interpolated value is a weighted sum of tiles, approximating the same multilinear formula as regular grid interpolation. The advantage: tile coding adapts naturally to the state-space geometry (tiles can be offset and rotated) while maintaining the O(1) query time of grid methods.

Despite this limitation, piecewise linear interpolation on a grid is the most widely deployed approximate value function method in safety-critical aviation. The original TCAS (Traffic Collision Avoidance System, deployed in virtually all commercial aircraft since 1993) used a lookup table. ACAS X (2020s) uses a 5D grid with piecewise linear interpolation, achieving far better performance than TCAS while maintaining hard safety guarantees. The key insight: in the aviation application, 5 state dimensions × modest resolution is manageable; the smoothness assumption holds because physics is continuous.

The multilinear interpolation formula in n dimensions:

Uθ(s) = ∑b ∈ {0,1}n (∏j=1n [bjλj + (1−bj)(1−λj)]) · U(cornerb)

where λj is the fractional offset in dimension j, and cornerb is the grid corner indexed by binary vector b. This formula has 2n terms and requires 2n table lookups per query. For n=5 that is 32 lookups; for n=10 that is 1024. Simplex interpolation replaces this with just n+1 = 6 or 11 lookups respectively.

python
import numpy as np

def linear_interp_1d(grid_x, grid_U, query_x):
    """Piecewise linear interpolation in 1D."""
    # Find the two surrounding grid points
    idx = np.searchsorted(grid_x, query_x) - 1
    idx = np.clip(idx, 0, len(grid_x) - 2)
    xl, xr = grid_x[idx], grid_x[idx + 1]
    Ul, Ur = grid_U[idx], grid_U[idx + 1]
    lam = (xr - query_x) / (xr - xl)  # weight for left point
    return lam * Ul + (1 - lam) * Ur

# Example: value function on [0, 1] with 5 grid points
grid_x = np.linspace(0, 1, 5)
grid_U = np.array([0.0, 0.8, 1.2, 0.9, 0.1])
print(linear_interp_1d(grid_x, grid_U, 0.35))  # 0.96
In 1D linear interpolation, a query point exactly on the left grid point x_l gets weight λ = 1. What value does it return?

Chapter 5: Simplex Interpolation

Simplex interpolation avoids the exponential 2n corner problem of multilinear interpolation. Instead of requiring all 2n corners of a hypercube, it decomposes each grid cell into simplices (triangles in 2D, tetrahedra in 3D, etc.) using only n+1 vertices each.

A simplex in n dimensions has n+1 vertices. Any point inside a simplex can be written as a convex combination of the vertices with non-negative barycentric coordinates λ0, ..., λn that sum to 1:

s = λ0s0 + λ1s1 + ... + λnsn    with    ∑i λi = 1, λi ≥ 0

The interpolated value is then Uθ(s) = ∑i λi U(si). This uses only n+1 lookups per query instead of 2n. The tradeoff: the interpolation is linear within each simplex but may have discontinuous derivatives at simplex boundaries.

Efficient simplex decomposition via sorting. The textbook describes an efficient method: sort the fractional parts of the scaled coordinates in decreasing order. This determines which simplex the query lies in, without explicitly constructing all simplices. The algorithm runs in O(n log n) per query.

Worked Numerical Example: 2D Simplex Interpolation

Suppose our 2D grid has spacing 1 in each dimension. Grid corners at (0,0), (1,0), (0,1), (1,1) with values U(0,0)=2, U(1,0)=5, U(0,1)=3, U(1,1)=8. We query at s=(0.7, 0.4).

Step 1: Fractional coordinates. Already in [0,1]: dx=0.7, dy=0.4.

Step 2: Determine triangle. Since dx = 0.7 ≥ dy = 0.4, we're in the lower triangle with vertices (0,0), (1,0), (1,1).

Step 3: Barycentric coordinates. λ0 = 1 − dx = 0.3, λ1 = dx − dy = 0.3, λ2 = dy = 0.4. Verify: 0.3 + 0.3 + 0.4 = 1.0 ✓.

Step 4: Interpolate. U(0.7, 0.4) = 0.3×U(0,0) + 0.3×U(1,0) + 0.4×U(1,1) = 0.3×2 + 0.3×5 + 0.4×8 = 0.6 + 1.5 + 3.2 = 5.3.

Compare: bilinear interpolation at (0.7, 0.4) uses all 4 corners. λx=0.7, λy=0.4: U = (1−0.7)(1−0.4)×2 + 0.7(1−0.4)×5 + (1−0.7)0.4×3 + 0.7×0.4×8 = 0.36 + 2.1 + 0.36 + 2.24 = 5.06. The two methods agree closely but use different numbers of lookups (3 vs 4). In 10D: simplex uses 11 lookups vs bilinear's 1024.

Here is the full simplex interpolation algorithm for a 2D grid with state (x, y) ∈ [0,1]2:

python
import numpy as np

def simplex_interp_2d(grid_U, query_x, query_y, x_min, x_max, y_min, y_max):
    """
    Simplex interpolation in 2D.
    grid_U: 2D array of values at grid points
    query: (x, y) continuous state
    Returns interpolated value.
    """
    nx, ny = grid_U.shape
    # Map to fractional grid coordinates [0, nx-1] x [0, ny-1]
    fx = (query_x - x_min) / (x_max - x_min) * (nx - 1)
    fy = (query_y - y_min) / (y_max - y_min) * (ny - 1)
    # Integer and fractional parts
    ix, iy = int(np.clip(fx, 0, nx-2)), int(np.clip(fy, 0, ny-2))
    dx, dy = fx - ix, fy - iy  # fractional parts in [0, 1)
    # Sort fractional parts: determines which triangle we're in
    if dx >= dy:  # lower triangle: (0,0)-(1,0)-(1,1)
        lam = [1 - dx, dx - dy, dy]  # barycentric coordinates
        pts = [(ix, iy), (ix+1, iy), (ix+1, iy+1)]
    else:         # upper triangle: (0,0)-(0,1)-(1,1)
        lam = [1 - dy, dy - dx, dx]
        pts = [(ix, iy), (ix, iy+1), (ix+1, iy+1)]
    return sum(lam[k] * grid_U[pts[k]] for k in range(3))

# Only 3 lookups (n+1 = 3 for 2D) vs 4 for bilinear (2^2 = 4)

Multilinear (bilinear in 2D)

  • Uses 2n = 4 corners per cell
  • Smooth (C0 continuous)
  • Cost: 4 lookups per query in 2D
  • Scales as 2n — problematic in high-d

Simplex interpolation

  • Uses n+1 = 3 vertices per triangle
  • Linear within simplices (C0 at boundaries)
  • Cost: 3 lookups per query in 2D
  • Scales as n+1 — much better in high-d
In 5 dimensions, simplex interpolation uses ____ vertices per query, while multilinear interpolation uses ____.

Chapter 6: Linear Regression

Linear regression approximation represents the value function as a linear combination of basis functions β1(s), ..., βp(s):

Uθ(s) = θ1 β1(s) + θ2 β2(s) + ... + θp βp(s) = θTβ(s)

The parameters θ are found by minimizing the mean squared error over sample states. This has a closed-form solution: the ordinary least squares estimator θ* = (BTB)−1BTy, where B is the matrix with rows β(si)T and y is the vector of target values yi.

Basis function choices for value function approximation:

Linear in parameters, nonlinear in state. "Linear regression" means linear in θ, not necessarily in s. With polynomial or RBF features, Uθ(s) is a highly nonlinear function of s. The word "linear" describes the parametric structure: double θ and you double Uθ. This linearity is what makes the closed-form OLS solution possible.

The OLS estimator θ* = (BTB)−1BTy requires matrix inversion. For p basis functions and m sample states, the matrix BTB is p×p. Inverting it costs O(p3). For p=100 basis functions this is fast; for p=10,000 (e.g., with many tile coding tilings) it becomes expensive. In practice, use numpy.linalg.lstsq or iterative solvers (conjugate gradient) for large p. Alternatively, use stochastic gradient descent to update θ online rather than solving the full OLS every iteration — this is the basis of TD-learning with linear function approximation, analyzed by Sutton (1988) and Tsitsiklis & Van Roy (1997).

Tile coding is a classic trick from RL (Sutton & Barto) that creates binary features by overlaying multiple offset grids ("tilings") over the state space. Each tiling is like a low-resolution grid; the feature for a state is 1 in the bin it falls into on each tiling, 0 everywhere else. Using k overlapping tilings at different offsets gives k features per state. The advantage: linear regression on tile-coded features can represent nonlinear functions, while remaining fully linear in parameters.

Basis Function FamilyStrengthsWeaknessesTypical Use
Monomials (polynomial)Interpretable; can represent smooth functions exactlyRunge phenomenon: oscillates at boundaries for high degreeValue functions with known polynomial structure
RBF (Gaussian)Local support; no oscillation; smoothRequires choosing centers and width σUnknown smooth functions; robotics control
Tile codingVery fast to evaluate; sparse binary featuresDoesn't extrapolate; fixed resolutionClassic RL: grid worlds, mountain car
Fourier featuresHandles periodic structure; random sampling worksSensitive to scale; less intuitivePeriodic or quasi-periodic value functions
Basis Functions: Polynomial vs RBF

The true function (sin wave) is approximated with polynomial (orange) and RBF (teal) bases. Adjust the number of basis functions. More RBFs capture local structure better; polynomials can oscillate (Runge's phenomenon).

Number of basis functions 4
Why can a linear regression approximation with polynomial features represent a nonlinear value function?

Chapter 7: Neural Network Regression

A neural network approximator Uθ(s) passes the state through a stack of learned transformations. Each layer computes a linear combination of the previous layer's activations, followed by a nonlinear activation function:

h1 = σ(W1s + b1)    h2 = σ(W2h1 + b2)    Uθ(s) = WouthL + bout

The parameters θ = {W1, b1, W2, b2, ..., Wout, bout} are learned by stochastic gradient descent on the MSE loss. Common activation functions: ReLU σ(x) = max(0,x), tanh, or sigmoid.

Neural networks are universal approximators: with enough width or depth, a ReLU network can approximate any continuous function to arbitrary precision. In practice, they are the default for high-dimensional state spaces — they power DQN, A3C, PPO, and virtually all modern deep RL.

For the approximate value iteration setting (offline, batch), neural networks are trained by minimizing the mean squared Bellman error over sample transitions (s, a, r, s'). The key data pipeline is: (1) sample a batch of transitions from a replay buffer, (2) compute Bellman targets yi = ri + γ · maxa' Uθ−(s'i, a') using the frozen target network, (3) gradient step on ∑(Uθ(si, ai) − yi)2. The target network weights θ

Architecture Design for Value Function Networks

The architecture of the value network matters enormously. Key design decisions:

DecisionCommon ChoiceWhy
Input normalizationStandardize each state dimension to mean 0, std 1Prevents large-magnitude dimensions from dominating gradients
Width vs depth2–3 hidden layers of 64–256 units for simple MDPs; 4+ layers for high-dDeeper is more expressive but harder to train; diminishing returns beyond 4 layers
Activation functionReLU (default); ELU or SiLU for smoother value functionsReLU trains fast; smoother activations give smoother Q-functions
Output layerLinear (no activation) for Q-valuesQ-values can be any real number, not bounded; tanh/sigmoid would artificially constrain them
Separate network per action vs sharedOne network, output size = |A| for discrete actionsShared network shares representations; cheaper to evaluate all actions at once
Dueling architecture (optional)Split output into V(s) + A(s,a); Q = V + A − mean(A)Separates state value from action advantage; faster convergence in many MDPs

For the MDP in the showcase (1D continuous, 2 actions), a minimal network is 2 hidden layers × 32 units. For Atari (image input, 18 actions), DQN used 3 convolutional layers followed by 2 fully connected layers with 512 units. For robotics with 50D state spaces, typical architectures are 3 fully connected layers with 256 units each.

Neural networks introduce instability in value iteration. When you use a neural net and update it on Bellman targets, the targets themselves change as θ changes (because Uθ appears on both sides of the Bellman equation). DQN fixes this by using a target network — a frozen copy of the network used to compute targets, updated only every few thousand steps. Without this trick, training diverges.
python
import torch
import torch.nn as nn

class ValueNetwork(nn.Module):
    def __init__(self, state_dim, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden),  nn.ReLU(),
            nn.Linear(hidden, 1)            # scalar value output
        )
    def forward(self, s):
        return self.net(s).squeeze(-1)  # (batch,) scalar per state

# Approximate value iteration step
def approx_value_iter_step(net, target_net, optimizer, states, rewards, next_states, gamma=0.99):
    with torch.no_grad():
        targets = rewards + gamma * target_net(next_states)  # Bellman backup
    preds = net(states)
    loss = nn.functional.mse_loss(preds, targets)
    optimizer.zero_grad(); loss.backward(); optimizer.step()
    return loss.item()

Why Neural Networks Are Unstable in Value Iteration

The fundamental problem: the Bellman backup computes yi = R + γ · maxa Uθ(s'). But Uθ uses the current network weights θ. So when we take a gradient step to make Uθ(s) closer to yi, we also change Uθ(s'), which changes yi on the next iteration. We are chasing a moving target. This can cause oscillations or divergence.

DQN (Mnih et al., 2015, DeepMind) introduced two key fixes that made neural value functions stable for the first time:

FixMechanismWhy It Helps
Replay bufferStore all past transitions (s,a,r,s'). Train on random mini-batches from this buffer instead of the latest experience.Breaks temporal correlations; diverse mini-batches prevent the network from forgetting earlier states
Target networkA frozen copy θ of the network, used only for computing targets y = R + γ maxa' Uθ(s'). Copy θ to θ every C steps.Targets are stable for C steps; regression is well-defined

Together these two fixes turned a theoretically unstable algorithm into a practical powerhouse. DQN achieved superhuman performance on 49 Atari games using this architecture: a CNN for image features, two hidden layers with 512 ReLU units, output of Q-values for each action, trained with replay buffer of 1M transitions and target network update every 10,000 steps.

Data Flow in Neural Approximate Value Iteration

Double DQN and Other Improvements

Since the original DQN paper (2015), several improvements have become standard. For the approximate value iteration context:

ImprovementProblem it solvesHow
Double DQN (van Hasselt et al.)Q-values are overestimated because maxa' Q(s',a') is biased upwardUse main net to select action, target net to evaluate: Qθ(s', argmaxa' Qθ(s',a'))
Dueling DQN (Wang et al.)Most actions in most states have similar Q-values; wasted capacitySplit Q into V(s) + A(s,a): value head and advantage head; combine as Q = V + A − mean(A)
Prioritized Experience ReplayUniform replay wastes time on easy transitions already well-modeledSample transitions proportional to |TD error|; correct bias with importance weights
N-step returns1-step Bellman targets have high bias; propagating value is slowUse y = ∑t=0n γtrt + γn maxa'Q(sn,a')
Rainbow DQN (Hessel et al.)Combines all of the aboveDouble + Dueling + Prioritized + N-step + NoisyNets + Distributional; state-of-the-art for Atari

All of these improvements operate on the same approximate value iteration loop from the textbook. The core insight is unchanged: sample, backup, fit. The improvements address specific failure modes of the basic loop when scaled to complex environments. For the textbook's simple MDP examples, the basic DQN loop without improvements is sufficient and often preferred for pedagogical clarity.

Concretely, the tensor shapes for a 1D continuous state problem with 2 actions:

QuantityShapeMeaning
States s(batch, 1)Batch of 1D continuous states
Q-values Qθ(s)(batch, 2)One Q-value per action per state
Actions a(batch,) intWhich action was taken
Rewards r(batch,)Immediate reward for (s,a)
Next states s'(batch, 1)States reached after action
Targets y(batch,)r + γ maxa' Qθ(s', a')
Lossscalarmean[ (Qθ(s,a) − y)2 ]

The forward pass: s → hidden layers → Q-values (batch, 2) → gather Q(s,a) at chosen action → compare with target y. Gradient flows back through gather → Q-head → hidden layers only (not through the target network, which is detached from the computation graph).

Why does DQN use a separate "target network" in addition to the main network?

Chapter 8: Showcase — Approximate Value Iteration

Here is the full approximate value iteration loop, visualized in real time on a 1D continuous MDP. The state is a position s ∈ [0, 10]. The reward is high at the goal (s=8) and zero elsewhere. The discount is γ=0.9. There is no analytical solution — we must approximate.

You choose the approximator: k-NN, kernel smoothing, or linear regression with RBF features. The loop samples states randomly, computes Bellman backups, refits the approximator, and shows how the value function evolves. Watch the value function grow toward the goal.

From Approximate Value Function to Policy Extraction

Once we have a good approximation Uθ, extracting a policy requires evaluating the Bellman maximization at each decision point:

π(s) = argmaxa ∈ A [ R(s,a) + γ ∑s' T(s'|s,a) Uθ(s') ]

For discrete action spaces (e.g., left/right/stop), this is a finite argmax — cheap. For continuous actions, this becomes a non-convex optimization problem. Common approaches:

Action SpacePolicy Extraction MethodCost per Decision
Small discrete (≤ 10 actions)Enumerate all actions, compute each Q-value, take maxO(|A| × approximator cost)
Large discrete (≥ 100 actions)Beam search or UCT tree search from current stateO(|A| × tree depth)
Continuous (bounded interval)Random shooting: sample N actions, pick bestO(N × approximator cost)
Continuous (high-d)Cross-entropy method or gradient ascent on Qθ(s,a) in aO(iterations × batch)

In the showcase's 1D MDP with 2 discrete actions (left/right), policy extraction is O(2). For ACAS X with 5 actions (alerts), O(5). The simplicity of discrete, small action spaces is one reason they are preferred in safety-critical applications.

What you should see. After a few iterations, a peak forms near the goal (s=8). The peak spreads backward as the discount propagates value. With fewer sample states, the approximation is noisier. With more, it smooths out. Kernel smoothing is smoothest; k-NN is most jagged; RBF regression is globally smooth.
Approximate Value Iteration: 1D Continuous MDP

Goal at s=8 (reward=10). Discount γ=0.9. Run iterations and watch the value function (teal) converge. The orange dots are current sample states. Change the approximator to see how each method fits.

Number of sample states 15
Bandwidth h (for kernel) / k (for k-NN) 1.00
0 iterations
In this showcase, why does the value function peak near s=8 but also have non-zero values at other states?

Chapter 9: Connections & What's Next

MethodStrengthsWeaknessesBest For
k-NNNo training, exact at samplesSlow queries (O(m)), poor high-dLow-d, small datasets
Kernel smoothingSmooth, simple to implementSame as k-NN; bandwidth sensitiveLow-d, smooth value functions
Linear interpolationFast queries, exact at grid pointsMemory O(resolutionn)1–3D problems with regular grids
Simplex interpolationn+1 lookups (vs 2n)Discontinuous derivatives4–6D problems with regular grids
Linear regressionFast, closed-form solutionMust choose good basis functionsProblems with known structure
Neural networksUniversal approximator, auto-featuresInstability, needs target networkHigh-d, complex value functions
What comes next. Chapter 9 (Online Planning) leaves the "approximate the value function" paradigm and instead plans from the current state at decision time using search trees (MCTS, rollout). Chapter 10 (Policy Search) directly optimizes the policy without computing a value function at all. The choice: approximate value functions excel when you make many decisions from the same states and can afford offline computation; online planning excels for one-off decisions in novel states; policy search excels when you only need a policy and the value function is irrelevant.

Convergence Theory for Approximate Value Iteration

Exact value iteration converges because the Bellman operator T is a γ-contraction: ||TU − TU'|| ≤ γ ||U − U'||. The fixed point of T is U*, and repeated application of T shrinks any error by γ per iteration.

With approximate value iteration, the error does not shrink to zero. Each iteration first applies T (contraction) then applies the approximation operator Π (projection onto the function class). The composition ΠT is no longer a contraction in general. Three cases:

Function classConvergence guaranteePractical behavior
Exact (tabular)Converges to U* (geometric rate γ)Always converges; rate determined by γ
Linear (fixed features)Converges to a small error ball ||Uθ* − U*|| ≤ ε/(1−γ)Usually converges; rate depends on features and γ
Neural networkNo general guaranteeUsually converges in practice with target network; can oscillate
k-NN / kernelConverges if samples are dense and h is appropriately chosenConverges with enough samples and good bandwidth

The error bound ε/(1−γ) has two terms: ε is the best-in-class approximation error (how well the function family can represent U*), and 1/(1−γ) is the discount amplification factor. As γ → 1, even small approximation errors get amplified arbitrarily. This is why deep RL with γ=0.99 is so finicky: the 1/(1−0.99) = 100 amplification magnifies any approximation error by 100x in the value bound.

On-Policy vs Off-Policy Sampling

The choice of sample states is closely related to the concept of model bias in approximate dynamic programming. If the sample states are not representative of the states visited by the optimal policy, the approximation will be accurate in the wrong region of state space, leading to a suboptimal policy. This is a form of distribution mismatch: we minimize loss on the training distribution (sample states) but the policy deploys in a different distribution (states the optimal policy visits). The solution is iterative: use the current approximate policy to generate new sample states, refit, and repeat. This is called approximate policy iteration (API) and is the algorithmic framework underlying SARSA, on-policy actor-critic, and TRPO in modern deep RL.

The sampling distribution for sample states is a crucial design choice that the textbook addresses explicitly. Two main strategies:

StrategyMechanismProsCons
Uniform gridPlace sample states on a regular grid or uniformly at randomCovers the state space evenly; no policy dependence; easy to implementWastes samples on states rarely visited; poor coverage of high-value regions
On-policy rolloutsGenerate sample states by following the current approximate policyConcentrates samples where the policy actually goes; better approximation where it mattersBiased: misses states the policy avoids; divergence risk if policy is bad early on
Experience replayCollect transitions from many historical policies; sample uniformly from the replay bufferData efficient; diverse samples; standard in DQNOff-policy samples may not reflect current state distribution; distribution shift
Prioritized replaySample transitions with probability proportional to Bellman error |U(s) − y|Focuses computation on states where the approximation is worstRequires importance weights to correct for the sampling bias; more complex

For the grid-based methods (linear interpolation, simplex), a uniform grid is natural — the grid points are the sample states. For k-NN and kernel methods, on-policy sampling often works best after the first few iterations. For neural networks, experience replay (off-policy) is standard: it reduces correlation between consecutive samples and allows reuse of data, which is critical when environment interaction is expensive.

The textbook also notes an important practical consideration: the sampling distribution of states in S matters enormously. If your sample states cluster in low-value regions but the optimal policy mostly visits high-value regions, the approximation will be poor where it matters most. Common solutions: (1) follow the current approximate policy to generate samples (on-policy sampling), (2) importance-weight samples from a fixed distribution, or (3) use a uniform grid if the state space is bounded and low-dimensional. The interplay between sample distribution and approximation error is the core technical challenge of approximate dynamic programming and motivates most of the innovations in deep RL (experience replay, on-policy methods, distributional RL) that have appeared since the DQN paper.

Practical ConsiderationDetails
Sample densityUniform grid covers evenly; policy rollouts focus on reachable states; both have tradeoffs
ExtrapolationLocal methods (k-NN, kernel) extrapolate poorly; global methods (linear, neural) extrapolate according to their functional form
ConvergenceApproximate VI does not guarantee convergence; oscillations possible; reduce learning rate or use trust regions
Feature engineeringDomain-specific features (sin/cos of angles for pendulum, distance to goal) dramatically improve data efficiency
Generalization vs overfitToo few samples: underfit (smooth but wrong); too many samples with rich features: overfit (noisy)

Further Reading

The textbook's Chapter 8 is a strong survey. For deeper coverage:

"All models are wrong, but some are useful."
— George E. P. Box

No approximator recovers the exact value function. The question is always: is this approximation good enough to make the right decisions most of the time?

Computing Approximate Bellman Backups in Continuous State Spaces

The Bellman backup yi = maxa[R(si,a) + γ ∑s' T(s'|si,a) Uθ(s')] requires integrating Uθ(s') over the transition distribution T. For deterministic transitions (s' = f(s,a) exactly), this is trivial: yi = maxa[R(si,a) + γ Uθ(f(si,a))]. For stochastic transitions (s' ~ T(·|s,a)), we must approximate the expectation:

For the showcase's 1D MDP, the transitions are deterministic (s' = clamp(s + action, 0, 10)), so backups are exact single-step lookups. In the real-world ACAS X problem, the aircraft dynamics are stochastic (due to wind and sensor noise), and Monte Carlo backup with K=10–50 samples per state per action is the standard approach.

Which method is most appropriate when the value function is known to be approximately quadratic in the state, with no data about the global mean?

Real-World Deployment: ACAS X as a Case Study

The Airborne Collision Avoidance System X (ACAS X) is the premier real-world deployment of approximate value function methods. Developed at MIT Lincoln Laboratory and the FAA, it uses a 5D piecewise linear approximation over the states: relative altitude, relative bearing, intruder closure rate, own-ship altitude rate, and intruder altitude rate.

Key engineering choices and why they were made:

DecisionChoiceReason
Approximation methodPiecewise linear (simplex interpolation) over a 5D gridExact at grid points; fast (11 lookups); proved safe
State space5 continuous dimensions; 5 billion grid pointsCovers all physically relevant encounter geometries
Storage2.4 GB compressed; 45 MB compressed with quantizationFits in onboard flight computer memory
ComputationOffline: weeks to solve; online: microseconds to querySafety-critical systems cannot afford real-time planning
Offline solverExact value iteration on discretized state spaceGrid is fine enough to capture all safety-critical structure
Policy formLookup table with piecewise linear interpolationCertifiable: no neural net uncertainty

The key lesson: for safety-critical aviation systems, neural networks are not acceptable — regulators require deterministic behavior that can be formally verified. This does not mean neural networks are never used in aviation: they are used for perception (camera-based object detection), speech recognition, and anomaly detection in maintenance. But for the decision-making layer that determines pilot advisories, only certifiable methods (look-up tables, piecewise linear functions) are acceptable under current FAA standards. The push toward neural network verification (using formal methods like Marabou or α,β-CROWN) may change this in the future.

The grid-based piecewise linear approach is not the most accurate approximator, but it is the most certifiable one. The choice of approximation method is as much a regulatory and safety question as a technical one.

The ACAS X value table is publicly released and has been analyzed extensively by the formal verification community. Katz et al. (2017) verified properties of small ACAS X networks using the Reluplex satisfiability checker. Julian et al. (2019) showed how to compress the 5D grid into a neural network that is 1000x smaller, while Kouvaros et al. (2021) formally verified certain safety properties of those compressed networks. This research direction — compressing safe grid-based policies into neural networks and then formally verifying the neural version — represents the state of the art in bridging the certifiability gap between classical and neural approaches. The textbook's approximate value function methods are the starting point for this entire research thread.

Why not neural networks for ACAS X? Neural networks are more accurate and more scalable. But for FAA certification, you must demonstrate that the system's behavior is safe in every possible encounter geometry. With a lookup table, you can enumerate all extreme inputs and check the output. With a neural network, you cannot: there are infinitely many inputs, and adversarial examples may produce unexpected outputs. This is the core tension between ML performance and system safety that the field is actively working to resolve.

How This Chapter Connects to the Rest of the Book

Chapter 8's approximate value functions feed directly into every subsequent chapter. Chapter 9 (Online Planning) replaces offline precomputation with real-time search using rollout policies — the rollout policy is exactly the policy extracted from an approximate value function via the greedy policy improvement step. Chapter 10 (Policy Search) parameterizes πθ(a|s) directly and optimizes θ with gradient ascent; approximate value functions appear here as critics in actor-critic methods, providing low-variance gradient estimates. Chapters 12–17 (Model-Free RL) learn Uθ or Qθ without knowing T or R — but the Bellman operator being approximated is identical; the only difference is whether samples come from a simulator or from real environment interactions.

The critical design axis that runs through all of these chapters is the approximation-exploration tradeoff: the more compact Uθ is (fewer parameters, simpler basis), the faster it can be fit and the less data it needs — but the worse it approximates the true U* in complex regions of state space. The practitioner's job is to choose an approximator that is rich enough to capture the value structure that matters for decision-making, while being simple enough to be trained reliably from the available data. Chapter 8 gives you the vocabulary and toolkit to make this choice deliberately rather than by default.

Generalization and the Credit Assignment Problem

One of the deepest challenges in approximate value functions is credit assignment: when the agent takes action a in state s and eventually receives a reward r, how much of r is attributable to a versus to later actions? The Bellman equation solves this by definition — Q(s,a) = R(s,a) + γ maxa' Q(s',a') assigns exactly the right credit to action a by recursion. But with function approximation, each gradient update that adjusts θ to reduce the Bellman error at one (s,a) pair also changes Uθ(s') for all other states s' where the basis functions overlap. This generalization is both the power and the hazard of function approximation: a good approximator generalizes correctly (nearby states have similar values), while a bad approximator corrupts good value estimates when it over-generalizes.

Tile coding (Section 8.3) manages this tradeoff explicitly: the tiles are designed so that generalization happens only within each tile, never across tiles. This gives exact local estimates at the cost of coarse approximation at boundaries. Neural networks generalize more smoothly but without explicit control. The field of representation learning (Bengio et al., 2013; Lesort et al., 2018) studies which learned state representations support accurate value generalization — this is the same as asking what features make a good basis for Uθ.

← Chapter 7: Exact Methods Chapter 9: Online Planning →