Kochenderfer & Wheeler, Chapter 18

Probabilistic Surrogate Models

Gaussian processes: surrogates that tell you not just what the function value probably is, but how uncertain they are about it.

Prerequisites: Chapter 17 (Surrogate Models).

Chapters

Simulations

Quizzes

Chapter 0: Why Uncertainty?

A standard surrogate tells you "the function value at x is about 3.7." A Gaussian process tells you "the function value at x is about 3.7 ± 0.8." That uncertainty estimate is crucial for optimization: it tells you where to look next.

The core idea: A Gaussian process (GP) is a probability distribution over functions. Given data, it provides a posterior distribution: a predicted mean (best guess) and a predicted variance (uncertainty). Near observed points, variance is low. Far from observations, variance is high. This uncertainty drives exploration in surrogate optimization.

Gaussian processes improve over standard surrogates by also providing:

Uncertainty estimates (variance) for predictions Faster training time Exact gradient computation

Chapter 1: Gaussian Processes

A GP is defined by a mean function m(x) and a covariance function (kernel) k(x, x'). Any finite collection of function values follows a multivariate Gaussian distribution with mean m(X) and covariance matrix K(X, X) where K_ij = k(x_i, x_j).

The mean function encodes your prior belief about the function's average behavior. The kernel encodes your belief about smoothness, periodicity, or other structure. Together, they define a prior over functions before seeing any data.

Key insight: The kernel is the heart of a GP. It says: "how correlated are function values at x and x'?" The squared exponential kernel k(x,x') = exp(−||x−x'||²/2ℓ²) says that nearby points have similar values, with ℓ controlling how quickly correlation decays with distance.

A Gaussian process is fully defined by its:

Mean function and covariance (kernel) function Gradient and Hessian A single point estimate

Chapter 2: Kernel Functions

The kernel controls what kinds of functions the GP considers plausible:

Kernel	Formula (r = \|\|x−x'\|\|)	Character
Squared exponential	exp(−r²/2ℓ²)	Infinitely smooth
Matérn (ν=3/2)	(1+√3r/ℓ)exp(−√3r/ℓ)	Once differentiable
Exponential	exp(−r/ℓ)	Rough, continuous but not differentiable
Rational quadratic	(1+r²/2αℓ²)^−α	Mix of length scales

Think of it this way: The squared exponential kernel assumes the function is very smooth (infinitely differentiable). If your real function has kinks or sharp features, a Matérn kernel with lower ν is more appropriate. The length scale ℓ controls how quickly the correlation between two points decays with distance.

The length scale ℓ in a kernel controls:

How quickly correlation between function values decays with distance The number of data points The dimensionality of the input

Chapter 3: Prediction

Given observations (X, y), the GP posterior at new points X* is:

μ̂(x) = m(x) + K(x,X)K(X,X)⁻¹(y − m(X))

ν̂(x) = K(x,x) − K(x,X)K(X,X)⁻¹K(X,x)

The predicted mean μ̂ interpolates through the data. The predicted variance ν̂ is zero at data points and increases with distance from observations. The covariance does not depend on y — uncertainty depends only on where you have data, not what values you observed.

The beautiful property: Uncertainty is zero at observed points and grows as you move away. This makes intuitive sense: you are certain about what you have measured and uncertain about regions you have not explored. The GP captures this naturally through the posterior covariance.

The GP predicted variance is zero at:

Observed data points (where you have measurements) Points far from data The global minimum

Chapter 4: Confidence Intervals

From the predicted mean and standard deviation σ̂(x) = √ν̂(x), we can construct a 95% confidence interval: μ̂(x) ± 1.96σ̂(x). This band narrows near data points and widens between them.

Think of it this way: The confidence band is like fog. Where you have measured, the fog clears and you see the function clearly. Between measurements, the fog thickens and you are less sure of the function's true shape. The GP tells you exactly how thick the fog is.

A 95% GP confidence interval widens when:

You are far from any observed data point You are at an observed data point The function value is large

Chapter 5: Gradient Observations

If gradient observations are available, they can be incorporated into the GP by extending the covariance structure. Derivatives of a GP are also Gaussian processes, with covariance functions derived from the original kernel via differentiation.

For the squared exponential kernel: k_∇f(x,x')_i = −(x_i−x'_i)k(x,x'). Including gradient observations can dramatically improve prediction quality with few additional evaluations.

Key insight: Each gradient observation in n dimensions provides n additional pieces of information. For expensive functions where gradients are available (e.g., via adjoint methods), incorporating gradients into the GP makes the surrogate much more informative per evaluation.

Incorporating gradient observations helps because:

Each gradient gives n additional constraints, improving the fit Gradients are always free to compute It eliminates the need for function evaluations

Chapter 6: Noisy Measurements

Real measurements include noise: y = f(x) + ε, where ε ~ N(0, ν). To handle this, add the noise variance to the diagonal of the covariance matrix: K(X,X) + νI. The GP no longer interpolates exactly through data points but instead smooths through the noise.

Think of it this way: Without noise, the GP passes exactly through every data point. With noise, it acknowledges that each measurement might be off, so it averages through them. The noise variance ν is like saying "I trust each measurement to within ±√ν."

Adding noise variance ν to the GP causes it to:

Smooth through data points instead of interpolating exactly Interpolate more aggressively Ignore all data points

Chapter 7: Fitting GPs

The kernel hyperparameters (ℓ, signal variance, noise variance) must be chosen. The standard approach: maximum marginal likelihood. Maximize the probability of the observed data under the GP model with respect to the hyperparameters.

log p(y|X, θ) = −1/2 y^>K⁻¹y − 1/2 log|K| − n/2 log(2π)

The first term rewards data fit. The second term penalizes model complexity (determinant of K). Together they implement automatic Occam's razor.

Automatic model selection: The marginal likelihood naturally balances fit and complexity. A kernel that is too wiggly (small ℓ) will have a high data fit term but a huge complexity penalty. A kernel that is too smooth (large ℓ) will have a low complexity penalty but poor fit. The optimum balances both.

Maximum marginal likelihood for GP hyperparameters balances:

Data fit and model complexity (automatic Occam's razor) Training speed and accuracy The number of kernels used

Chapter 8: Summary

Concept	Key Fact
Gaussian process	Distribution over functions, defined by mean + kernel
Kernel	Controls smoothness, length scale, and correlation structure
Posterior	Mean interpolates data; variance is zero at data, grows with distance
Noisy measurements	Add νI to covariance; GP smooths instead of interpolating
Hyperparameter fitting	Maximum marginal likelihood balances fit and complexity

Looking ahead: Chapter 19 shows how to use GP uncertainty to guide optimization: where to sample next to find the minimum most efficiently. This is Bayesian optimization.

The GP posterior variance depends on:

Where data points are located, not their values Only the observed function values The gradient at each point