Can't afford to evaluate the real function? Build a cheap approximation and optimize that instead.
A single CFD simulation might take hours. Training a neural network takes days. You cannot run gradient descent on such functions — each evaluation is too expensive. Instead, build a surrogate model: a cheap-to-evaluate approximation that captures the function's essential behavior.
The simplest surrogate: f̂(x) = θ>x + θ0. Fit via least squares: θ* = argmin ∑(yi − θ>xi)2. This is just a hyperplane through the data points.
Linear models are fast to fit, easy to interpret, and provide baselines. But they cannot capture nonlinear behavior — if the function curves, a plane will be a poor approximation.
To capture nonlinearity, express the surrogate as a linear combination of basis functions:
Common basis function families: polynomials (bj(x) = xj), sinusoids, and radial basis functions. The model is still linear in the parameters θ, so fitting is a least-squares problem, but the basis functions inject nonlinearity in x.
Radial basis functions (RBFs) center a basis function at each data point: bj(x) = φ(||x − xj||). The surrogate is ∑θjφ(||x − xj||). Popular kernels:
| Kernel | Formula | Character |
|---|---|---|
| Gaussian | exp(−r²/2ℓ²) | Smooth, localized |
| Multiquadric | √(r² + c²) | Global influence |
| Thin-plate spline | r² log(r) | Minimum curvature |
Real data is noisy. Exact interpolation through noisy points overfits — the surrogate fits the noise, not the signal. Regularization adds a penalty term:
The regularization parameter λ controls the tradeoff: λ = 0 gives exact interpolation (overfitting risk), large λ gives an overly smooth fit (underfitting). The solution becomes θ* = (B>B + λI)−1B>y.
How do you choose between different surrogates (polynomial vs. RBF, different ℓ, different λ)? You need to estimate how well each model generalizes to unseen data, not just how well it fits the training data.
The holdout method splits data into training and test sets. Train on one, evaluate on the other. Simple but wastes data and results depend on the particular split.
k-fold cross validation partitions data into k sets. For each set, train on the other k−1 sets and test on the held-out set. Average the k error estimates. This uses all data for both training and testing.
Leave-one-out (k = m) trains on m−1 points and tests on 1, repeated for every point. It uses the most data for training but requires fitting m models.
The bootstrap method creates multiple training sets by sampling with replacement from the original data. Each bootstrap sample has m indices drawn independently, so some points appear multiple times and ~36.8% are left out.
The leave-one-out bootstrap evaluates each model only on points not in its bootstrap sample, removing the optimistic bias of testing on training data.
| Concept | Key Fact |
|---|---|
| Surrogate model | Cheap approximation of an expensive function |
| Basis functions | Inject nonlinearity while keeping fitting linear in parameters |
| RBFs | Center a kernel at each data point; length scale controls smoothness |
| Regularization | Penalize parameter magnitude to prevent overfitting |
| Cross validation | Estimate generalization error; choose hyperparameters |