Kochenderfer & Wheeler, Chapter 17

Surrogate Models

Can't afford to evaluate the real function? Build a cheap approximation and optimize that instead.

Prerequisites: Chapter 16 (Sampling Plans).
9
Chapters
0
Simulations
9
Quizzes

Chapter 0: Why Surrogates?

A single CFD simulation might take hours. Training a neural network takes days. You cannot run gradient descent on such functions — each evaluation is too expensive. Instead, build a surrogate model: a cheap-to-evaluate approximation that captures the function's essential behavior.

The core idea: Evaluate the expensive function at a few carefully chosen points (Chapter 16). Fit a simple model through those points. Optimize the model instead of the real function. The surrogate lets you perform thousands of cheap evaluations to find a promising region, then verify with one expensive evaluation.
Surrogate models are used when:

Chapter 1: Linear Models

The simplest surrogate: f̂(x) = θ>x + θ0. Fit via least squares: θ* = argmin ∑(yi − θ>xi)2. This is just a hyperplane through the data points.

Linear models are fast to fit, easy to interpret, and provide baselines. But they cannot capture nonlinear behavior — if the function curves, a plane will be a poor approximation.

Think of it this way: A linear surrogate is a first-order Taylor approximation, but fit to multiple data points rather than evaluated at one. It captures the trend but misses the shape.
A linear surrogate model is limited because it cannot capture:

Chapter 2: Basis Functions

To capture nonlinearity, express the surrogate as a linear combination of basis functions:

f̂(x) = ∑j θj bj(x)

Common basis function families: polynomials (bj(x) = xj), sinusoids, and radial basis functions. The model is still linear in the parameters θ, so fitting is a least-squares problem, but the basis functions inject nonlinearity in x.

Key insight: The choice of basis functions encodes your prior knowledge about the function. Polynomials assume smoothness. Sinusoids assume periodicity. Radial basis functions assume local influence. A good basis set can dramatically reduce the number of data points needed.
A model that is linear in parameters but uses nonlinear basis functions can capture:

Chapter 3: Radial Basis Functions

Radial basis functions (RBFs) center a basis function at each data point: bj(x) = φ(||x − xj||). The surrogate is ∑θjφ(||x − xj||). Popular kernels:

KernelFormulaCharacter
Gaussianexp(−r²/2ℓ²)Smooth, localized
Multiquadric√(r² + c²)Global influence
Thin-plate spliner² log(r)Minimum curvature
Why RBFs work well: They interpolate exactly through data points (when λ = 0). The length scale ℓ controls how far each point's influence extends. Small ℓ gives a wiggly surface that fits noise; large ℓ gives a smooth surface that may miss features. Choosing ℓ is the key modeling decision.
In an RBF surrogate, the length scale parameter ℓ controls:

Chapter 4: Fitting Noisy Data

Real data is noisy. Exact interpolation through noisy points overfits — the surrogate fits the noise, not the signal. Regularization adds a penalty term:

minimize ∑(yi − f̂(xi))² + λ||θ||²

The regularization parameter λ controls the tradeoff: λ = 0 gives exact interpolation (overfitting risk), large λ gives an overly smooth fit (underfitting). The solution becomes θ* = (B>B + λI)−1B>y.

Key insight: Regularization is equivalent to a prior belief that parameters should be small. It stabilizes the fit when data is scarce or noisy. The optimal λ can be found by cross-validation (Chapter 6 below).
Regularization prevents overfitting by:

Chapter 5: Model Selection

How do you choose between different surrogates (polynomial vs. RBF, different ℓ, different λ)? You need to estimate how well each model generalizes to unseen data, not just how well it fits the training data.

The holdout method splits data into training and test sets. Train on one, evaluate on the other. Simple but wastes data and results depend on the particular split.

Think of it this way: A model that memorizes the training data scores perfectly on training but fails on new data. We need to test on data the model has never seen. The question is how to split our precious, expensive data efficiently.
Generalization error measures how well a model performs on:

Chapter 6: Cross Validation

k-fold cross validation partitions data into k sets. For each set, train on the other k−1 sets and test on the held-out set. Average the k error estimates. This uses all data for both training and testing.

Leave-one-out (k = m) trains on m−1 points and tests on 1, repeated for every point. It uses the most data for training but requires fitting m models.

Practical use: To choose λ for an RBF surrogate, sweep λ over a range and pick the value with the lowest cross-validation error. This automatically balances overfitting and underfitting.
k-fold cross validation improves over a single holdout split by:

Chapter 7: Bootstrap

The bootstrap method creates multiple training sets by sampling with replacement from the original data. Each bootstrap sample has m indices drawn independently, so some points appear multiple times and ~36.8% are left out.

The leave-one-out bootstrap evaluates each model only on points not in its bootstrap sample, removing the optimistic bias of testing on training data.

Think of it this way: The bootstrap simulates having multiple datasets by resampling. It is like dealing poker hands from the same deck with replacement — each hand is different, but all come from the same source.
A bootstrap sample differs from the original dataset because:

Chapter 8: Summary

ConceptKey Fact
Surrogate modelCheap approximation of an expensive function
Basis functionsInject nonlinearity while keeping fitting linear in parameters
RBFsCenter a kernel at each data point; length scale controls smoothness
RegularizationPenalize parameter magnitude to prevent overfitting
Cross validationEstimate generalization error; choose hyperparameters
Looking ahead: Chapter 18 introduces Gaussian processes — probabilistic surrogates that not only predict function values but also quantify uncertainty in those predictions.
The key hyperparameter in an RBF surrogate is: