Kochenderfer & Wheeler, Chapter 18

Probabilistic Surrogate Models

Gaussian processes: surrogates that tell you not just what the function value probably is, but how uncertain they are about it.

Prerequisites: Chapter 17 (Surrogate Models).
9
Chapters
0
Simulations
9
Quizzes

Chapter 0: Why Uncertainty?

A standard surrogate tells you "the function value at x is about 3.7." A Gaussian process tells you "the function value at x is about 3.7 ± 0.8." That uncertainty estimate is crucial for optimization: it tells you where to look next.

The core idea: A Gaussian process (GP) is a probability distribution over functions. Given data, it provides a posterior distribution: a predicted mean (best guess) and a predicted variance (uncertainty). Near observed points, variance is low. Far from observations, variance is high. This uncertainty drives exploration in surrogate optimization.
Gaussian processes improve over standard surrogates by also providing:

Chapter 1: Gaussian Processes

A GP is defined by a mean function m(x) and a covariance function (kernel) k(x, x'). Any finite collection of function values follows a multivariate Gaussian distribution with mean m(X) and covariance matrix K(X, X) where Kij = k(xi, xj).

The mean function encodes your prior belief about the function's average behavior. The kernel encodes your belief about smoothness, periodicity, or other structure. Together, they define a prior over functions before seeing any data.

Key insight: The kernel is the heart of a GP. It says: "how correlated are function values at x and x'?" The squared exponential kernel k(x,x') = exp(−||x−x'||²/2ℓ²) says that nearby points have similar values, with ℓ controlling how quickly correlation decays with distance.
A Gaussian process is fully defined by its:

Chapter 2: Kernel Functions

The kernel controls what kinds of functions the GP considers plausible:

KernelFormula (r = ||x−x'||)Character
Squared exponentialexp(−r²/2ℓ²)Infinitely smooth
Matérn (ν=3/2)(1+√3r/ℓ)exp(−√3r/ℓ)Once differentiable
Exponentialexp(−r/ℓ)Rough, continuous but not differentiable
Rational quadratic(1+r²/2αℓ²)−αMix of length scales
Think of it this way: The squared exponential kernel assumes the function is very smooth (infinitely differentiable). If your real function has kinks or sharp features, a Matérn kernel with lower ν is more appropriate. The length scale ℓ controls how quickly the correlation between two points decays with distance.
The length scale ℓ in a kernel controls:

Chapter 3: Prediction

Given observations (X, y), the GP posterior at new points X* is:

μ̂(x) = m(x) + K(x,X)K(X,X)−1(y − m(X))
ν̂(x) = K(x,x) − K(x,X)K(X,X)−1K(X,x)

The predicted mean μ̂ interpolates through the data. The predicted variance ν̂ is zero at data points and increases with distance from observations. The covariance does not depend on y — uncertainty depends only on where you have data, not what values you observed.

The beautiful property: Uncertainty is zero at observed points and grows as you move away. This makes intuitive sense: you are certain about what you have measured and uncertain about regions you have not explored. The GP captures this naturally through the posterior covariance.
The GP predicted variance is zero at:

Chapter 4: Confidence Intervals

From the predicted mean and standard deviation σ̂(x) = √ν̂(x), we can construct a 95% confidence interval: μ̂(x) ± 1.96σ̂(x). This band narrows near data points and widens between them.

Think of it this way: The confidence band is like fog. Where you have measured, the fog clears and you see the function clearly. Between measurements, the fog thickens and you are less sure of the function's true shape. The GP tells you exactly how thick the fog is.
A 95% GP confidence interval widens when:

Chapter 5: Gradient Observations

If gradient observations are available, they can be incorporated into the GP by extending the covariance structure. Derivatives of a GP are also Gaussian processes, with covariance functions derived from the original kernel via differentiation.

For the squared exponential kernel: k∇f(x,x')i = −(xi−x'i)k(x,x'). Including gradient observations can dramatically improve prediction quality with few additional evaluations.

Key insight: Each gradient observation in n dimensions provides n additional pieces of information. For expensive functions where gradients are available (e.g., via adjoint methods), incorporating gradients into the GP makes the surrogate much more informative per evaluation.
Incorporating gradient observations helps because:

Chapter 6: Noisy Measurements

Real measurements include noise: y = f(x) + ε, where ε ~ N(0, ν). To handle this, add the noise variance to the diagonal of the covariance matrix: K(X,X) + νI. The GP no longer interpolates exactly through data points but instead smooths through the noise.

Think of it this way: Without noise, the GP passes exactly through every data point. With noise, it acknowledges that each measurement might be off, so it averages through them. The noise variance ν is like saying "I trust each measurement to within ±√ν."
Adding noise variance ν to the GP causes it to:

Chapter 7: Fitting GPs

The kernel hyperparameters (ℓ, signal variance, noise variance) must be chosen. The standard approach: maximum marginal likelihood. Maximize the probability of the observed data under the GP model with respect to the hyperparameters.

log p(y|X, θ) = −1/2 y>K−1y − 1/2 log|K| − n/2 log(2π)

The first term rewards data fit. The second term penalizes model complexity (determinant of K). Together they implement automatic Occam's razor.

Automatic model selection: The marginal likelihood naturally balances fit and complexity. A kernel that is too wiggly (small ℓ) will have a high data fit term but a huge complexity penalty. A kernel that is too smooth (large ℓ) will have a low complexity penalty but poor fit. The optimum balances both.
Maximum marginal likelihood for GP hyperparameters balances:

Chapter 8: Summary

ConceptKey Fact
Gaussian processDistribution over functions, defined by mean + kernel
KernelControls smoothness, length scale, and correlation structure
PosteriorMean interpolates data; variance is zero at data, grows with distance
Noisy measurementsAdd νI to covariance; GP smooths instead of interpolating
Hyperparameter fittingMaximum marginal likelihood balances fit and complexity
Looking ahead: Chapter 19 shows how to use GP uncertainty to guide optimization: where to sample next to find the minimum most efficiently. This is Bayesian optimization.
The GP posterior variance depends on: