Kochenderfer & Wheeler, Chapter 17

Surrogate Models

Can't afford to evaluate the real function? Build a cheap approximation and optimize that instead.

Prerequisites: Chapter 16 (Sampling Plans).

Chapters

Simulations

Quizzes

Chapter 0: Why Surrogates?

A single CFD simulation might take hours. Training a neural network takes days. You cannot run gradient descent on such functions — each evaluation is too expensive. Instead, build a surrogate model: a cheap-to-evaluate approximation that captures the function's essential behavior.

The core idea: Evaluate the expensive function at a few carefully chosen points (Chapter 16). Fit a simple model through those points. Optimize the model instead of the real function. The surrogate lets you perform thousands of cheap evaluations to find a promising region, then verify with one expensive evaluation.

Surrogate models are used when:

The true objective function is too expensive to evaluate many times The objective function is linear Gradients are available

Chapter 1: Linear Models

The simplest surrogate: f̂(x) = θ^>x + θ₀. Fit via least squares: θ* = argmin ∑(y_i − θ^>x_i)². This is just a hyperplane through the data points.

Linear models are fast to fit, easy to interpret, and provide baselines. But they cannot capture nonlinear behavior — if the function curves, a plane will be a poor approximation.

Think of it this way: A linear surrogate is a first-order Taylor approximation, but fit to multiple data points rather than evaluated at one. It captures the trend but misses the shape.

A linear surrogate model is limited because it cannot capture:

Nonlinear (curved) behavior of the objective function Data with more than two variables Positive function values

Chapter 2: Basis Functions

To capture nonlinearity, express the surrogate as a linear combination of basis functions:

f̂(x) = ∑_j θ_j b_j(x)

Common basis function families: polynomials (b_j(x) = x^j), sinusoids, and radial basis functions. The model is still linear in the parameters θ, so fitting is a least-squares problem, but the basis functions inject nonlinearity in x.

Key insight: The choice of basis functions encodes your prior knowledge about the function. Polynomials assume smoothness. Sinusoids assume periodicity. Radial basis functions assume local influence. A good basis set can dramatically reduce the number of data points needed.

A model that is linear in parameters but uses nonlinear basis functions can capture:

Nonlinear relationships while remaining easy to fit via least squares Only linear relationships Nothing more than a linear model

Chapter 3: Radial Basis Functions

Radial basis functions (RBFs) center a basis function at each data point: b_j(x) = φ(||x − x_j||). The surrogate is ∑θ_jφ(||x − x_j||). Popular kernels:

Kernel	Formula	Character
Gaussian	exp(−r²/2ℓ²)	Smooth, localized
Multiquadric	√(r² + c²)	Global influence
Thin-plate spline	r² log(r)	Minimum curvature

Why RBFs work well: They interpolate exactly through data points (when λ = 0). The length scale ℓ controls how far each point's influence extends. Small ℓ gives a wiggly surface that fits noise; large ℓ gives a smooth surface that may miss features. Choosing ℓ is the key modeling decision.

In an RBF surrogate, the length scale parameter ℓ controls:

How far each data point's influence extends (smoothness vs. fidelity) The number of data points used The dimensionality of the problem

Chapter 4: Fitting Noisy Data

Real data is noisy. Exact interpolation through noisy points overfits — the surrogate fits the noise, not the signal. Regularization adds a penalty term:

minimize ∑(y_i − f̂(x_i))² + λ||θ||²

The regularization parameter λ controls the tradeoff: λ = 0 gives exact interpolation (overfitting risk), large λ gives an overly smooth fit (underfitting). The solution becomes θ* = (B^>B + λI)⁻¹B^>y.

Key insight: Regularization is equivalent to a prior belief that parameters should be small. It stabilizes the fit when data is scarce or noisy. The optimal λ can be found by cross-validation (Chapter 6 below).

Regularization prevents overfitting by:

Penalizing large parameter values, producing a smoother fit Adding more data points Removing outliers

Chapter 5: Model Selection

How do you choose between different surrogates (polynomial vs. RBF, different ℓ, different λ)? You need to estimate how well each model generalizes to unseen data, not just how well it fits the training data.

The holdout method splits data into training and test sets. Train on one, evaluate on the other. Simple but wastes data and results depend on the particular split.

Think of it this way: A model that memorizes the training data scores perfectly on training but fails on new data. We need to test on data the model has never seen. The question is how to split our precious, expensive data efficiently.

Generalization error measures how well a model performs on:

Unseen data (not used for training) The training data itself The parameter values

Chapter 6: Cross Validation

k-fold cross validation partitions data into k sets. For each set, train on the other k−1 sets and test on the held-out set. Average the k error estimates. This uses all data for both training and testing.

Leave-one-out (k = m) trains on m−1 points and tests on 1, repeated for every point. It uses the most data for training but requires fitting m models.

Practical use: To choose λ for an RBF surrogate, sweep λ over a range and pick the value with the lowest cross-validation error. This automatically balances overfitting and underfitting.

k-fold cross validation improves over a single holdout split by:

Using all data for both training and testing across k rounds Requiring no test data Training only one model

Chapter 7: Bootstrap

The bootstrap method creates multiple training sets by sampling with replacement from the original data. Each bootstrap sample has m indices drawn independently, so some points appear multiple times and ~36.8% are left out.

The leave-one-out bootstrap evaluates each model only on points not in its bootstrap sample, removing the optimistic bias of testing on training data.

Think of it this way: The bootstrap simulates having multiple datasets by resampling. It is like dealing poker hands from the same deck with replacement — each hand is different, but all come from the same source.

A bootstrap sample differs from the original dataset because:

Some points appear multiple times and ~37% are excluded It has more data points It uses a completely different dataset

Chapter 8: Summary

Concept	Key Fact
Surrogate model	Cheap approximation of an expensive function
Basis functions	Inject nonlinearity while keeping fitting linear in parameters
RBFs	Center a kernel at each data point; length scale controls smoothness
Regularization	Penalize parameter magnitude to prevent overfitting
Cross validation	Estimate generalization error; choose hyperparameters

Looking ahead: Chapter 18 introduces Gaussian processes — probabilistic surrogates that not only predict function values but also quantify uncertainty in those predictions.

The key hyperparameter in an RBF surrogate is:

The length scale ℓ (or regularization λ), chosen by cross validation The number of design variables The gradient direction