When state spaces are continuous or enormous, the exact value table from Chapter 7 is impossible. This chapter builds a toolkit of function approximators — from nearest neighbors to neural networks — that generalize from finite sample states to the entire state space.
A robot arm with 7 joints has 14 state variables (angle + velocity per joint). Discretize each into 100 bins and you need 10014 = 1028 table entries. There are roughly 1025 grains of sand on Earth. The table would fill thousands of planets. Tables don't scale to continuous or high-dimensional state spaces.
Chapter 7's exact value iteration stores one number U(s) per state s. This works for small, finite MDPs — Kochenderfer et al. use the example of a two-state aircraft collision avoidance system. But the real-world ACAS X system has a continuous state space with 8+ dimensions. Exact methods cannot be applied. The entire ACAS X project used the approximate value function methods from this chapter.
The answer is yes — and there are many ways to do it. This chapter surveys a toolkit of approximators, from the conceptually simple (nearest neighbor) to the powerful (neural networks). Each makes different tradeoffs between accuracy, scalability, and computational cost. The ACAS X system ultimately used a grid-based approach (piecewise linear) over a 5-dimensional state space with 5 billion grid points, stored as a compressed lookup table of 2.4 GB.
The textbook categorizes the full approximate value iteration loop as fitted value iteration (FVI). The word "fitted" emphasizes that each iteration involves a supervised learning step — fitting Uθ to target values — not just a Bellman update. FVI was first analyzed by Szepesvari and Smart (2004) and Gordon (1999), who established conditions under which it converges and derived error bounds. The key condition: the regression step must not increase the error — formally, the approximation operator Π must be non-expansive in the weighted norm. For least-squares regression (linear or neural net), this typically holds, but it is not guaranteed in general.
The name "fitted value iteration" was coined by Ernst, Geurts & Wehenkel (2005) in their paper "Tree-Based Batch Mode Reinforcement Learning," which used random forests as the regression oracle in the FVI loop. This paper demonstrated that any supervised regression method can be used as the fitting step — gradient boosted trees, Gaussian processes, random forests, and neural networks all fit the FVI framework. The textbook covers the gradient boosting and neural network variants most closely, but the framework is general.
Modern deep RL methods (DQN, TD3, SAC) are all variants of FVI with neural function approximation. The conceptual framework is identical to the textbook's presentation; the engineering innovations (replay buffers, target networks, gradient clipping, normalized rewards) are stabilization tricks that make FVI work in practice for complex problems.
Before diving into approximators, let's see exactly what one Bellman backup looks like at a continuous state. Suppose we have a 1D MDP with state s ∈ [0, 5], two actions {left, right}, and the following simple model:
| Action | Transition | Reward |
|---|---|---|
| right | s' = min(s + 1, 5) | R = 10 if s' = 5 else 0 |
| left | s' = max(s − 1, 0) | R = 0 |
Suppose our current value approximation says U(0)=0, U(1)=4, U(2)=8, U(3)=13, U(4)=19, U(5)=10 (the fifth state has the goal reward but no future). The Bellman backup at s=3, γ=0.9:
So Unew(3) = 17.1, greedy action = right. Notice U changed from 13 to 17.1 in one backup — the value is still being propagated from the goal at s=5. After more iterations, all states will reflect the true discounted value of reaching the goal.
At a continuous query state like s=2.6, we cannot look up U(2.6) from a table. We must interpolate or approximate — that is the problem this chapter solves. The backup still computes R + γ E[U(s')], but the E[U(s')] evaluation requires the approximator.
See how the number of table entries (100 bins per dimension) explodes. Adjust the number of state dimensions. Even 6 dimensions makes a table infeasible; approximate methods are the only option.
The approximate value iteration framework is sometimes called approximate dynamic programming (ADP) in the operations research literature. The two fields developed the same ideas independently: RL researchers focused on the online/model-free setting, while ADP researchers focused on the offline/model-based setting. The textbook bridges both, presenting approximate value iteration as a unified framework that covers both RL (when T and R are learned) and ADP (when T and R are known). Chapter 8 is the ADP treatment; Chapters 12–17 cover the RL variants where the model is unknown.
A parametric approximation Uθ(s) represents the value function as a fixed functional form controlled by a parameter vector θ. Instead of storing one number per state, we store θ and compute Uθ(s) on demand for any query state.
The most important categorization is local vs global:
| Type | How parameters work | Generalization | Examples |
|---|---|---|---|
| Local | θ = values at sample states; query computes weighted sum of nearby θi | Interpolates locally near samples | k-NN, kernel, simplex |
| Global | θ = coefficients of basis functions; Uθ(s) = θTβ(s) | Extrapolates everywhere | Linear regression, neural nets |
In both cases, the fitting procedure finds θ by minimizing a loss between Uθ(si) and the backed-up target values yi. The loss is typically mean squared error:
In approximate value iteration, we compute Bellman targets yi at each sample state and then fit the approximator to them. The full loop is:
The backup in step 3 requires evaluating Uθ(s') at the next state s', which uses the current approximation. This creates the same "moving target" instability as neural network regression (Chapter 7). With linear and k-NN approximators, the instability is less severe but still exists — there is no general proof of convergence for approximate value iteration.
One exception: If the approximation family is rich enough to contain the true value function, and the sample distribution covers S well, approximate value iteration converges to a small error ball. The formal guarantee: ||Uθ* − U*||∞ ≤ ε/(1−γ) where ε is the approximation error of the best-fit U* in the function class. This bound degrades with discount γ → 1 and improves as ε → 0.
The true value function (a sine-like curve) is sampled at dots. A local method (nearest neighbor) and a global method (linear fit) are shown. Add sample points by clicking the canvas.
Nearest neighbor (k-NN) is the simplest local approximator. To compute Uθ(s) at a query state s:
where Nk(s) is the set of k-nearest sample indices and θi = yi is the backed-up value at sample si. With k=1, the approximation is piecewise constant — every query returns the value of its single nearest neighbor. As k increases, the estimate becomes smoother but less responsive to local variation.
A critical design choice: distance metric. Raw Euclidean distance treats all dimensions equally, which fails when state variables have very different scales (e.g., angle in radians vs. velocity in m/s). Always normalize state variables to comparable ranges, or use the Mahalanobis distance which accounts for scale and correlation.
python import numpy as np class KNNValueFunction: """k-Nearest Neighbor value function approximation.""" def __init__(self, k=3): self.k = k self.states = [] # list of sample states (arrays) self.values = [] # list of target values (floats) def update(self, states, values): """Replace stored samples with new (state, value) pairs.""" self.states = [np.array(s) for s in states] self.values = list(values) def __call__(self, s): """Query value at state s.""" s = np.array(s) dists = [np.linalg.norm(s - si) for si in self.states] idx = np.argsort(dists)[:self.k] return np.mean([self.values[i] for i in idx]) # Usage in approximate value iteration approx_U = KNNValueFunction(k=3) sample_states = [np.array([s]) for s in np.linspace(0, 10, 20)] for iteration in range(50): targets = [bellman_backup(s, approx_U) for s in sample_states] approx_U.update(sample_states, targets)
Naively finding the k nearest neighbors requires scanning all m samples: O(m) per query. For large m, this is too slow. The k-d tree (k-dimensional tree) is a binary tree that partitions the state space recursively, enabling nearest-neighbor queries in O(log m) average time.
| Structure | Build Time | Query Time | Best When |
|---|---|---|---|
| Brute force scan | O(1) | O(m × n) | m < 100 and queries are rare |
| k-d tree | O(m log m) | O(log m) avg | Low dimensions (n ≤ 20), many queries |
| Ball tree | O(m log m) | O(log m) avg | High-d data with metric structure |
| Approximate NN (FAISS) | O(m) | O(1) amortized | Very high-d, speed over exact answer |
In Python: scipy.spatial.cKDTree builds a k-d tree in milliseconds for m=10,000 points, and queries are 100× faster than brute force. For the approximate value iteration setting where we make thousands of queries per iteration, this speedup is critical.
python from scipy.spatial import cKDTree import numpy as np class FastKNNValueFunction: def __init__(self, k=3): self.k = k; self.tree = None; self.values = None def update(self, states, values): self.tree = cKDTree(np.array(states)) # O(m log m) build self.values = np.array(values) def __call__(self, s): dists, idx = self.tree.query(np.array(s), k=self.k) # O(log m) return np.mean(self.values[idx])
The true function is sin(x). Sample states are dots. The k-NN approximation is piecewise constant (k=1) or piecewise averaged (k=3). Adjust k.
Kernel smoothing is a weighted average over all (or nearby) sample states, where the weight is a decreasing function of distance. Common kernels include the Gaussian and the uniform (box) kernel:
Here K is the kernel function and h is the bandwidth — the single most important hyperparameter. Small h: tight kernel, high local sensitivity, noisy estimate. Large h: wide kernel, smooth estimate, may miss local structure.
K-NN is actually a special case of kernel smoothing with the "top-k" kernel: w=1/k for the k nearest points and 0 for the rest. The Nadaraya-Watson estimator (kernel smoothing) can be derived from a nonparametric regression model where U(s) is locally constant in a neighborhood of size h. Maximizing the weighted log-likelihood under a Gaussian noise model gives exactly the kernel smoothing formula above. This probabilistic interpretation means the bandwidth h has a natural interpretation: it is the standard deviation of the assumed "spread" in the value function, not just an arbitrary smoothing parameter.
For value function approximation specifically, a key practical issue is extrapolation. Both k-NN and kernel smoothing extrapolate poorly: they return the nearest sample's value (k-NN) or a weighted average of nearby samples (kernel) for query states outside the convex hull of the training data. This means that states that are never visited during training have poorly estimated values. For problems with bounded state spaces, this is manageable; for problems where the optimal policy may visit novel states (exploration), this creates a chicken-and-egg problem: you need good values to explore, but you need exploration to get good values. This is why on-policy sampling (Chapter 9 of the textbook) helps: it focuses value estimation where the current policy actually goes, ensuring the approximation is good where it matters.
The key difference: with a Gaussian kernel, every sample contributes to every query (though distant ones contribute very little).Kernel smoothing generalizes naturally to multivariate states using a product kernel: w(s, si) = ∏j K((sj − sij)/hj) where hj is the bandwidth for dimension j. Using separate bandwidths per dimension is important when state variables have different scales or different degrees of variation. For example, in a pendulum MDP, the angle variable may need a narrow bandwidth (value changes quickly near the unstable equilibrium) while the angular velocity may need a wider bandwidth.
The optimal bandwidth minimizes the mean squared error between the approximation and the true value function. This has two competing components:
| Component | Effect of small h | Effect of large h |
|---|---|---|
| Bias | Low — the approximation can capture local variation | High — the approximation over-smooths; misses peaks |
| Variance | High — few points in each window; noisy estimate | Low — many points averaged; stable but biased |
The bias-variance tradeoff is the fundamental problem of statistical estimation. For the value function approximation setting, we typically use leave-one-out cross-validation to select h: fit the approximation on all but one sample state and measure the prediction error at the held-out point. Average over all held-out points and choose the h that minimizes this cross-validation error.
An alternative: Silverman's rule of thumb sets h proportional to the sample standard deviation and inversely to m1/(n+4) where n is the state dimension. This rule is optimal for Gaussian-distributed data and works well in practice even for non-Gaussian cases. For a 1D problem with standard deviation σ and m samples: h = 1.06 σ m−1/5.
python import numpy as np def silverman_bandwidth(states): """Silverman's rule of thumb for kernel bandwidth (1D).""" m = len(states) sigma = np.std(states) return 1.06 * sigma * m ** (-0.2) def gaussian_kernel_smooth(query_states, sample_states, values, h): """Vectorized Gaussian kernel smoothing.""" diffs = query_states[:, None] - sample_states[None, :] # (nq, m) weights = np.exp(-0.5 * (diffs / h) ** 2) # (nq, m) weights /= weights.sum(axis=1, keepdims=True) + 1e-10 # normalize return weights @ values # (nq,) # Usage: smooth a noisy value estimate over 30 sample states sample_s = np.linspace(0, 10, 30) values = np.sin(sample_s) * 5 + np.random.randn(30) * 0.5 # noisy h = silverman_bandwidth(sample_s) query_s = np.linspace(0, 10, 200) smooth_v = gaussian_kernel_smooth(query_s, sample_s, values, h)
The true function is sin(2x). Adjust the bandwidth h. Too small: wiggly and noisy. Too large: over-smoothed, misses peaks and valleys.
Linear interpolation works on a grid: a regular rectangular discretization of the state space. Given a query state s between grid points, the value is computed as a weighted combination of the surrounding grid values, with weights proportional to how close s is to each grid point.
In 1D with grid points at xl and xr (the two neighbors of query x), the linear interpolation is:
In 2D, bilinear interpolation uses the four surrounding grid corners. In n dimensions, multilinear interpolation uses 2n surrounding corners. This exponential scaling is why linear interpolation is practical in low dimensions (1–4) but problematic in higher ones.
For grid-based interpolation methods, the resolution per dimension is a critical design parameter. Choosing it requires balancing approximation accuracy, memory, and computation:
| Consideration | Effect of finer resolution | Effect of coarser resolution |
|---|---|---|
| Approximation error | Lower — grid points closer together, interpolation more accurate | Higher — grid points further apart, more interpolation error |
| Memory | Exponentially more — doubling resolution in n dims multiplies storage by 2n | Exponentially less |
| Computation (value iteration) | More grid points to backup; more total work | Fewer points; faster per iteration |
| Policy quality | Better approximation of U* → better policy | Coarse resolution may miss narrow peaks (e.g., near goal states) |
A practical heuristic: start with a coarse 5×5 (or 5n) grid, solve approximately, inspect the value function, and double the resolution in dimensions where the function varies most rapidly. ACAS X used a non-uniform grid with finer resolution near zero relative altitude (where collision risk is highest) and coarser resolution at large separations (where the optimal action is always "do nothing").
Linear interpolation is also the foundation of the Q-table with linear interpolation method used in classic RL papers. Sutton & Barto's mountain car problem (1998) used a tile-coded linear approximation that is equivalent to piecewise linear interpolation on a non-uniform grid. The tile coding creates a grid implicitly: each "tile" is a rectangular region in state space with its own value estimate. When multiple tilings overlap, the interpolated value is a weighted sum of tiles, approximating the same multilinear formula as regular grid interpolation. The advantage: tile coding adapts naturally to the state-space geometry (tiles can be offset and rotated) while maintaining the O(1) query time of grid methods.
Despite this limitation, piecewise linear interpolation on a grid is the most widely deployed approximate value function method in safety-critical aviation. The original TCAS (Traffic Collision Avoidance System, deployed in virtually all commercial aircraft since 1993) used a lookup table. ACAS X (2020s) uses a 5D grid with piecewise linear interpolation, achieving far better performance than TCAS while maintaining hard safety guarantees. The key insight: in the aviation application, 5 state dimensions × modest resolution is manageable; the smoothness assumption holds because physics is continuous.
The multilinear interpolation formula in n dimensions:
where λj is the fractional offset in dimension j, and cornerb is the grid corner indexed by binary vector b. This formula has 2n terms and requires 2n table lookups per query. For n=5 that is 32 lookups; for n=10 that is 1024. Simplex interpolation replaces this with just n+1 = 6 or 11 lookups respectively.
python import numpy as np def linear_interp_1d(grid_x, grid_U, query_x): """Piecewise linear interpolation in 1D.""" # Find the two surrounding grid points idx = np.searchsorted(grid_x, query_x) - 1 idx = np.clip(idx, 0, len(grid_x) - 2) xl, xr = grid_x[idx], grid_x[idx + 1] Ul, Ur = grid_U[idx], grid_U[idx + 1] lam = (xr - query_x) / (xr - xl) # weight for left point return lam * Ul + (1 - lam) * Ur # Example: value function on [0, 1] with 5 grid points grid_x = np.linspace(0, 1, 5) grid_U = np.array([0.0, 0.8, 1.2, 0.9, 0.1]) print(linear_interp_1d(grid_x, grid_U, 0.35)) # 0.96
Simplex interpolation avoids the exponential 2n corner problem of multilinear interpolation. Instead of requiring all 2n corners of a hypercube, it decomposes each grid cell into simplices (triangles in 2D, tetrahedra in 3D, etc.) using only n+1 vertices each.
A simplex in n dimensions has n+1 vertices. Any point inside a simplex can be written as a convex combination of the vertices with non-negative barycentric coordinates λ0, ..., λn that sum to 1:
The interpolated value is then Uθ(s) = ∑i λi U(si). This uses only n+1 lookups per query instead of 2n. The tradeoff: the interpolation is linear within each simplex but may have discontinuous derivatives at simplex boundaries.
Suppose our 2D grid has spacing 1 in each dimension. Grid corners at (0,0), (1,0), (0,1), (1,1) with values U(0,0)=2, U(1,0)=5, U(0,1)=3, U(1,1)=8. We query at s=(0.7, 0.4).
Step 1: Fractional coordinates. Already in [0,1]: dx=0.7, dy=0.4.
Step 2: Determine triangle. Since dx = 0.7 ≥ dy = 0.4, we're in the lower triangle with vertices (0,0), (1,0), (1,1).
Step 3: Barycentric coordinates. λ0 = 1 − dx = 0.3, λ1 = dx − dy = 0.3, λ2 = dy = 0.4. Verify: 0.3 + 0.3 + 0.4 = 1.0 ✓.
Step 4: Interpolate. U(0.7, 0.4) = 0.3×U(0,0) + 0.3×U(1,0) + 0.4×U(1,1) = 0.3×2 + 0.3×5 + 0.4×8 = 0.6 + 1.5 + 3.2 = 5.3.
Compare: bilinear interpolation at (0.7, 0.4) uses all 4 corners. λx=0.7, λy=0.4: U = (1−0.7)(1−0.4)×2 + 0.7(1−0.4)×5 + (1−0.7)0.4×3 + 0.7×0.4×8 = 0.36 + 2.1 + 0.36 + 2.24 = 5.06. The two methods agree closely but use different numbers of lookups (3 vs 4). In 10D: simplex uses 11 lookups vs bilinear's 1024.
Here is the full simplex interpolation algorithm for a 2D grid with state (x, y) ∈ [0,1]2:
python import numpy as np def simplex_interp_2d(grid_U, query_x, query_y, x_min, x_max, y_min, y_max): """ Simplex interpolation in 2D. grid_U: 2D array of values at grid points query: (x, y) continuous state Returns interpolated value. """ nx, ny = grid_U.shape # Map to fractional grid coordinates [0, nx-1] x [0, ny-1] fx = (query_x - x_min) / (x_max - x_min) * (nx - 1) fy = (query_y - y_min) / (y_max - y_min) * (ny - 1) # Integer and fractional parts ix, iy = int(np.clip(fx, 0, nx-2)), int(np.clip(fy, 0, ny-2)) dx, dy = fx - ix, fy - iy # fractional parts in [0, 1) # Sort fractional parts: determines which triangle we're in if dx >= dy: # lower triangle: (0,0)-(1,0)-(1,1) lam = [1 - dx, dx - dy, dy] # barycentric coordinates pts = [(ix, iy), (ix+1, iy), (ix+1, iy+1)] else: # upper triangle: (0,0)-(0,1)-(1,1) lam = [1 - dy, dy - dx, dx] pts = [(ix, iy), (ix, iy+1), (ix+1, iy+1)] return sum(lam[k] * grid_U[pts[k]] for k in range(3)) # Only 3 lookups (n+1 = 3 for 2D) vs 4 for bilinear (2^2 = 4)
Multilinear (bilinear in 2D)
Simplex interpolation
Linear regression approximation represents the value function as a linear combination of basis functions β1(s), ..., βp(s):
The parameters θ are found by minimizing the mean squared error over sample states. This has a closed-form solution: the ordinary least squares estimator θ* = (BTB)−1BTy, where B is the matrix with rows β(si)T and y is the vector of target values yi.
Basis function choices for value function approximation:
The OLS estimator θ* = (BTB)−1BTy requires matrix inversion. For p basis functions and m sample states, the matrix BTB is p×p. Inverting it costs O(p3). For p=100 basis functions this is fast; for p=10,000 (e.g., with many tile coding tilings) it becomes expensive. In practice, use numpy.linalg.lstsq or iterative solvers (conjugate gradient) for large p. Alternatively, use stochastic gradient descent to update θ online rather than solving the full OLS every iteration — this is the basis of TD-learning with linear function approximation, analyzed by Sutton (1988) and Tsitsiklis & Van Roy (1997).
Tile coding is a classic trick from RL (Sutton & Barto) that creates binary features by overlaying multiple offset grids ("tilings") over the state space. Each tiling is like a low-resolution grid; the feature for a state is 1 in the bin it falls into on each tiling, 0 everywhere else. Using k overlapping tilings at different offsets gives k features per state. The advantage: linear regression on tile-coded features can represent nonlinear functions, while remaining fully linear in parameters.
| Basis Function Family | Strengths | Weaknesses | Typical Use |
|---|---|---|---|
| Monomials (polynomial) | Interpretable; can represent smooth functions exactly | Runge phenomenon: oscillates at boundaries for high degree | Value functions with known polynomial structure |
| RBF (Gaussian) | Local support; no oscillation; smooth | Requires choosing centers and width σ | Unknown smooth functions; robotics control |
| Tile coding | Very fast to evaluate; sparse binary features | Doesn't extrapolate; fixed resolution | Classic RL: grid worlds, mountain car |
| Fourier features | Handles periodic structure; random sampling works | Sensitive to scale; less intuitive | Periodic or quasi-periodic value functions |
The true function (sin wave) is approximated with polynomial (orange) and RBF (teal) bases. Adjust the number of basis functions. More RBFs capture local structure better; polynomials can oscillate (Runge's phenomenon).
A neural network approximator Uθ(s) passes the state through a stack of learned transformations. Each layer computes a linear combination of the previous layer's activations, followed by a nonlinear activation function:
The parameters θ = {W1, b1, W2, b2, ..., Wout, bout} are learned by stochastic gradient descent on the MSE loss. Common activation functions: ReLU σ(x) = max(0,x), tanh, or sigmoid.
Neural networks are universal approximators: with enough width or depth, a ReLU network can approximate any continuous function to arbitrary precision. In practice, they are the default for high-dimensional state spaces — they power DQN, A3C, PPO, and virtually all modern deep RL.
For the approximate value iteration setting (offline, batch), neural networks are trained by minimizing the mean squared Bellman error over sample transitions (s, a, r, s'). The key data pipeline is: (1) sample a batch of transitions from a replay buffer, (2) compute Bellman targets yi = ri + γ · maxa' Uθ−(s'i, a') using the frozen target network, (3) gradient step on ∑(Uθ(si, ai) − yi)2. The target network weights θ The architecture of the value network matters enormously. Key design decisions: For the MDP in the showcase (1D continuous, 2 actions), a minimal network is 2 hidden layers × 32 units. For Atari (image input, 18 actions), DQN used 3 convolutional layers followed by 2 fully connected layers with 512 units. For robotics with 50D state spaces, typical architectures are 3 fully connected layers with 256 units each. The fundamental problem: the Bellman backup computes yi = R + γ · maxa Uθ(s'). But Uθ uses the current network weights θ. So when we take a gradient step to make Uθ(s) closer to yi, we also change Uθ(s'), which changes yi on the next iteration. We are chasing a moving target. This can cause oscillations or divergence. DQN (Mnih et al., 2015, DeepMind) introduced two key fixes that made neural value functions stable for the first time: Together these two fixes turned a theoretically unstable algorithm into a practical powerhouse. DQN achieved superhuman performance on 49 Atari games using this architecture: a CNN for image features, two hidden layers with 512 ReLU units, output of Q-values for each action, trained with replay buffer of 1M transitions and target network update every 10,000 steps. Since the original DQN paper (2015), several improvements have become standard. For the approximate value iteration context: All of these improvements operate on the same approximate value iteration loop from the textbook. The core insight is unchanged: sample, backup, fit. The improvements address specific failure modes of the basic loop when scaled to complex environments. For the textbook's simple MDP examples, the basic DQN loop without improvements is sufficient and often preferred for pedagogical clarity. Concretely, the tensor shapes for a 1D continuous state problem with 2 actions: The forward pass: s → hidden layers → Q-values (batch, 2) → gather Q(s,a) at chosen action → compare with target y. Gradient flows back through gather → Q-head → hidden layers only (not through the target network, which is detached from the computation graph).Architecture Design for Value Function Networks
Decision Common Choice Why Input normalization Standardize each state dimension to mean 0, std 1 Prevents large-magnitude dimensions from dominating gradients Width vs depth 2–3 hidden layers of 64–256 units for simple MDPs; 4+ layers for high-d Deeper is more expressive but harder to train; diminishing returns beyond 4 layers Activation function ReLU (default); ELU or SiLU for smoother value functions ReLU trains fast; smoother activations give smoother Q-functions Output layer Linear (no activation) for Q-values Q-values can be any real number, not bounded; tanh/sigmoid would artificially constrain them Separate network per action vs shared One network, output size = |A| for discrete actions Shared network shares representations; cheaper to evaluate all actions at once Dueling architecture (optional) Split output into V(s) + A(s,a); Q = V + A − mean(A) Separates state value from action advantage; faster convergence in many MDPs python
import torch
import torch.nn as nn
class ValueNetwork(nn.Module):
def __init__(self, state_dim, hidden=64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, 1) # scalar value output
)
def forward(self, s):
return self.net(s).squeeze(-1) # (batch,) scalar per state
# Approximate value iteration step
def approx_value_iter_step(net, target_net, optimizer, states, rewards, next_states, gamma=0.99):
with torch.no_grad():
targets = rewards + gamma * target_net(next_states) # Bellman backup
preds = net(states)
loss = nn.functional.mse_loss(preds, targets)
optimizer.zero_grad(); loss.backward(); optimizer.step()
return loss.item()
Why Neural Networks Are Unstable in Value Iteration
Fix Mechanism Why It Helps Replay buffer Store all past transitions (s,a,r,s'). Train on random mini-batches from this buffer instead of the latest experience. Breaks temporal correlations; diverse mini-batches prevent the network from forgetting earlier states Target network A frozen copy θ− of the network, used only for computing targets y = R + γ maxa' Uθ−(s'). Copy θ to θ− every C steps. Targets are stable for C steps; regression is well-defined Data Flow in Neural Approximate Value Iteration
Double DQN and Other Improvements
Improvement Problem it solves How Double DQN (van Hasselt et al.) Q-values are overestimated because maxa' Q(s',a') is biased upward Use main net to select action, target net to evaluate: Qθ−(s', argmaxa' Qθ(s',a')) Dueling DQN (Wang et al.) Most actions in most states have similar Q-values; wasted capacity Split Q into V(s) + A(s,a): value head and advantage head; combine as Q = V + A − mean(A) Prioritized Experience Replay Uniform replay wastes time on easy transitions already well-modeled Sample transitions proportional to |TD error|; correct bias with importance weights N-step returns 1-step Bellman targets have high bias; propagating value is slow Use y = ∑t=0n γtrt + γn maxa'Q(sn,a') Rainbow DQN (Hessel et al.) Combines all of the above Double + Dueling + Prioritized + N-step + NoisyNets + Distributional; state-of-the-art for Atari
Quantity Shape Meaning States s (batch, 1) Batch of 1D continuous states Q-values Qθ(s) (batch, 2) One Q-value per action per state Actions a (batch,) int Which action was taken Rewards r (batch,) Immediate reward for (s,a) Next states s' (batch, 1) States reached after action Targets y (batch,) r + γ maxa' Qθ−(s', a') Loss scalar mean[ (Qθ(s,a) − y)2 ]
Here is the full approximate value iteration loop, visualized in real time on a 1D continuous MDP. The state is a position s ∈ [0, 10]. The reward is high at the goal (s=8) and zero elsewhere. The discount is γ=0.9. There is no analytical solution — we must approximate.
You choose the approximator: k-NN, kernel smoothing, or linear regression with RBF features. The loop samples states randomly, computes Bellman backups, refits the approximator, and shows how the value function evolves. Watch the value function grow toward the goal.
Once we have a good approximation Uθ, extracting a policy requires evaluating the Bellman maximization at each decision point:
For discrete action spaces (e.g., left/right/stop), this is a finite argmax — cheap. For continuous actions, this becomes a non-convex optimization problem. Common approaches:
| Action Space | Policy Extraction Method | Cost per Decision |
|---|---|---|
| Small discrete (≤ 10 actions) | Enumerate all actions, compute each Q-value, take max | O(|A| × approximator cost) |
| Large discrete (≥ 100 actions) | Beam search or UCT tree search from current state | O(|A| × tree depth) |
| Continuous (bounded interval) | Random shooting: sample N actions, pick best | O(N × approximator cost) |
| Continuous (high-d) | Cross-entropy method or gradient ascent on Qθ(s,a) in a | O(iterations × batch) |
In the showcase's 1D MDP with 2 discrete actions (left/right), policy extraction is O(2). For ACAS X with 5 actions (alerts), O(5). The simplicity of discrete, small action spaces is one reason they are preferred in safety-critical applications.
Goal at s=8 (reward=10). Discount γ=0.9. Run iterations and watch the value function (teal) converge. The orange dots are current sample states. Change the approximator to see how each method fits.
| Method | Strengths | Weaknesses | Best For |
|---|---|---|---|
| k-NN | No training, exact at samples | Slow queries (O(m)), poor high-d | Low-d, small datasets |
| Kernel smoothing | Smooth, simple to implement | Same as k-NN; bandwidth sensitive | Low-d, smooth value functions |
| Linear interpolation | Fast queries, exact at grid points | Memory O(resolutionn) | 1–3D problems with regular grids |
| Simplex interpolation | n+1 lookups (vs 2n) | Discontinuous derivatives | 4–6D problems with regular grids |
| Linear regression | Fast, closed-form solution | Must choose good basis functions | Problems with known structure |
| Neural networks | Universal approximator, auto-features | Instability, needs target network | High-d, complex value functions |
Exact value iteration converges because the Bellman operator T is a γ-contraction: ||TU − TU'||∞ ≤ γ ||U − U'||∞. The fixed point of T is U*, and repeated application of T shrinks any error by γ per iteration.
With approximate value iteration, the error does not shrink to zero. Each iteration first applies T (contraction) then applies the approximation operator Π (projection onto the function class). The composition ΠT is no longer a contraction in general. Three cases:
| Function class | Convergence guarantee | Practical behavior |
|---|---|---|
| Exact (tabular) | Converges to U* (geometric rate γ) | Always converges; rate determined by γ |
| Linear (fixed features) | Converges to a small error ball ||Uθ* − U*|| ≤ ε/(1−γ) | Usually converges; rate depends on features and γ |
| Neural network | No general guarantee | Usually converges in practice with target network; can oscillate |
| k-NN / kernel | Converges if samples are dense and h is appropriately chosen | Converges with enough samples and good bandwidth |
The error bound ε/(1−γ) has two terms: ε is the best-in-class approximation error (how well the function family can represent U*), and 1/(1−γ) is the discount amplification factor. As γ → 1, even small approximation errors get amplified arbitrarily. This is why deep RL with γ=0.99 is so finicky: the 1/(1−0.99) = 100 amplification magnifies any approximation error by 100x in the value bound.
The choice of sample states is closely related to the concept of model bias in approximate dynamic programming. If the sample states are not representative of the states visited by the optimal policy, the approximation will be accurate in the wrong region of state space, leading to a suboptimal policy. This is a form of distribution mismatch: we minimize loss on the training distribution (sample states) but the policy deploys in a different distribution (states the optimal policy visits). The solution is iterative: use the current approximate policy to generate new sample states, refit, and repeat. This is called approximate policy iteration (API) and is the algorithmic framework underlying SARSA, on-policy actor-critic, and TRPO in modern deep RL.
The sampling distribution for sample states is a crucial design choice that the textbook addresses explicitly. Two main strategies:
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Uniform grid | Place sample states on a regular grid or uniformly at random | Covers the state space evenly; no policy dependence; easy to implement | Wastes samples on states rarely visited; poor coverage of high-value regions |
| On-policy rollouts | Generate sample states by following the current approximate policy | Concentrates samples where the policy actually goes; better approximation where it matters | Biased: misses states the policy avoids; divergence risk if policy is bad early on |
| Experience replay | Collect transitions from many historical policies; sample uniformly from the replay buffer | Data efficient; diverse samples; standard in DQN | Off-policy samples may not reflect current state distribution; distribution shift |
| Prioritized replay | Sample transitions with probability proportional to Bellman error |U(s) − y| | Focuses computation on states where the approximation is worst | Requires importance weights to correct for the sampling bias; more complex |
For the grid-based methods (linear interpolation, simplex), a uniform grid is natural — the grid points are the sample states. For k-NN and kernel methods, on-policy sampling often works best after the first few iterations. For neural networks, experience replay (off-policy) is standard: it reduces correlation between consecutive samples and allows reuse of data, which is critical when environment interaction is expensive.
The textbook also notes an important practical consideration: the sampling distribution of states in S matters enormously. If your sample states cluster in low-value regions but the optimal policy mostly visits high-value regions, the approximation will be poor where it matters most. Common solutions: (1) follow the current approximate policy to generate samples (on-policy sampling), (2) importance-weight samples from a fixed distribution, or (3) use a uniform grid if the state space is bounded and low-dimensional. The interplay between sample distribution and approximation error is the core technical challenge of approximate dynamic programming and motivates most of the innovations in deep RL (experience replay, on-policy methods, distributional RL) that have appeared since the DQN paper.
| Practical Consideration | Details |
|---|---|
| Sample density | Uniform grid covers evenly; policy rollouts focus on reachable states; both have tradeoffs |
| Extrapolation | Local methods (k-NN, kernel) extrapolate poorly; global methods (linear, neural) extrapolate according to their functional form |
| Convergence | Approximate VI does not guarantee convergence; oscillations possible; reduce learning rate or use trust regions |
| Feature engineering | Domain-specific features (sin/cos of angles for pendulum, distance to goal) dramatically improve data efficiency |
| Generalization vs overfit | Too few samples: underfit (smooth but wrong); too many samples with rich features: overfit (noisy) |
The textbook's Chapter 8 is a strong survey. For deeper coverage:
No approximator recovers the exact value function. The question is always: is this approximation good enough to make the right decisions most of the time?
The Bellman backup yi = maxa[R(si,a) + γ ∑s' T(s'|si,a) Uθ(s')] requires integrating Uθ(s') over the transition distribution T. For deterministic transitions (s' = f(s,a) exactly), this is trivial: yi = maxa[R(si,a) + γ Uθ(f(si,a))]. For stochastic transitions (s' ~ T(·|s,a)), we must approximate the expectation:
For the showcase's 1D MDP, the transitions are deterministic (s' = clamp(s + action, 0, 10)), so backups are exact single-step lookups. In the real-world ACAS X problem, the aircraft dynamics are stochastic (due to wind and sensor noise), and Monte Carlo backup with K=10–50 samples per state per action is the standard approach.
The Airborne Collision Avoidance System X (ACAS X) is the premier real-world deployment of approximate value function methods. Developed at MIT Lincoln Laboratory and the FAA, it uses a 5D piecewise linear approximation over the states: relative altitude, relative bearing, intruder closure rate, own-ship altitude rate, and intruder altitude rate.
Key engineering choices and why they were made:
| Decision | Choice | Reason |
|---|---|---|
| Approximation method | Piecewise linear (simplex interpolation) over a 5D grid | Exact at grid points; fast (11 lookups); proved safe |
| State space | 5 continuous dimensions; 5 billion grid points | Covers all physically relevant encounter geometries |
| Storage | 2.4 GB compressed; 45 MB compressed with quantization | Fits in onboard flight computer memory |
| Computation | Offline: weeks to solve; online: microseconds to query | Safety-critical systems cannot afford real-time planning |
| Offline solver | Exact value iteration on discretized state space | Grid is fine enough to capture all safety-critical structure |
| Policy form | Lookup table with piecewise linear interpolation | Certifiable: no neural net uncertainty |
The key lesson: for safety-critical aviation systems, neural networks are not acceptable — regulators require deterministic behavior that can be formally verified. This does not mean neural networks are never used in aviation: they are used for perception (camera-based object detection), speech recognition, and anomaly detection in maintenance. But for the decision-making layer that determines pilot advisories, only certifiable methods (look-up tables, piecewise linear functions) are acceptable under current FAA standards. The push toward neural network verification (using formal methods like Marabou or α,β-CROWN) may change this in the future.
The grid-based piecewise linear approach is not the most accurate approximator, but it is the most certifiable one. The choice of approximation method is as much a regulatory and safety question as a technical one.The ACAS X value table is publicly released and has been analyzed extensively by the formal verification community. Katz et al. (2017) verified properties of small ACAS X networks using the Reluplex satisfiability checker. Julian et al. (2019) showed how to compress the 5D grid into a neural network that is 1000x smaller, while Kouvaros et al. (2021) formally verified certain safety properties of those compressed networks. This research direction — compressing safe grid-based policies into neural networks and then formally verifying the neural version — represents the state of the art in bridging the certifiability gap between classical and neural approaches. The textbook's approximate value function methods are the starting point for this entire research thread.
Chapter 8's approximate value functions feed directly into every subsequent chapter. Chapter 9 (Online Planning) replaces offline precomputation with real-time search using rollout policies — the rollout policy is exactly the policy extracted from an approximate value function via the greedy policy improvement step. Chapter 10 (Policy Search) parameterizes πθ(a|s) directly and optimizes θ with gradient ascent; approximate value functions appear here as critics in actor-critic methods, providing low-variance gradient estimates. Chapters 12–17 (Model-Free RL) learn Uθ or Qθ without knowing T or R — but the Bellman operator being approximated is identical; the only difference is whether samples come from a simulator or from real environment interactions.
The critical design axis that runs through all of these chapters is the approximation-exploration tradeoff: the more compact Uθ is (fewer parameters, simpler basis), the faster it can be fit and the less data it needs — but the worse it approximates the true U* in complex regions of state space. The practitioner's job is to choose an approximator that is rich enough to capture the value structure that matters for decision-making, while being simple enough to be trained reliably from the available data. Chapter 8 gives you the vocabulary and toolkit to make this choice deliberately rather than by default.
One of the deepest challenges in approximate value functions is credit assignment: when the agent takes action a in state s and eventually receives a reward r, how much of r is attributable to a versus to later actions? The Bellman equation solves this by definition — Q(s,a) = R(s,a) + γ maxa' Q(s',a') assigns exactly the right credit to action a by recursion. But with function approximation, each gradient update that adjusts θ to reduce the Bellman error at one (s,a) pair also changes Uθ(s') for all other states s' where the basis functions overlap. This generalization is both the power and the hazard of function approximation: a good approximator generalizes correctly (nearby states have similar values), while a bad approximator corrupts good value estimates when it over-generalizes.
Tile coding (Section 8.3) manages this tradeoff explicitly: the tiles are designed so that generalization happens only within each tile, never across tiles. This gives exact local estimates at the cost of coarse approximation at boundaries. Neural networks generalize more smoothly but without explicit control. The field of representation learning (Bengio et al., 2013; Lesort et al., 2018) studies which learned state representations support accurate value generalization — this is the same as asking what features make a good basis for Uθ.