Gaussian processes: surrogates that tell you not just what the function value probably is, but how uncertain they are about it.
A standard surrogate tells you "the function value at x is about 3.7." A Gaussian process tells you "the function value at x is about 3.7 ± 0.8." That uncertainty estimate is crucial for optimization: it tells you where to look next.
A GP is defined by a mean function m(x) and a covariance function (kernel) k(x, x'). Any finite collection of function values follows a multivariate Gaussian distribution with mean m(X) and covariance matrix K(X, X) where Kij = k(xi, xj).
The mean function encodes your prior belief about the function's average behavior. The kernel encodes your belief about smoothness, periodicity, or other structure. Together, they define a prior over functions before seeing any data.
The kernel controls what kinds of functions the GP considers plausible:
| Kernel | Formula (r = ||x−x'||) | Character |
|---|---|---|
| Squared exponential | exp(−r²/2ℓ²) | Infinitely smooth |
| Matérn (ν=3/2) | (1+√3r/ℓ)exp(−√3r/ℓ) | Once differentiable |
| Exponential | exp(−r/ℓ) | Rough, continuous but not differentiable |
| Rational quadratic | (1+r²/2αℓ²)−α | Mix of length scales |
Given observations (X, y), the GP posterior at new points X* is:
The predicted mean μ̂ interpolates through the data. The predicted variance ν̂ is zero at data points and increases with distance from observations. The covariance does not depend on y — uncertainty depends only on where you have data, not what values you observed.
From the predicted mean and standard deviation σ̂(x) = √ν̂(x), we can construct a 95% confidence interval: μ̂(x) ± 1.96σ̂(x). This band narrows near data points and widens between them.
If gradient observations are available, they can be incorporated into the GP by extending the covariance structure. Derivatives of a GP are also Gaussian processes, with covariance functions derived from the original kernel via differentiation.
For the squared exponential kernel: k∇f(x,x')i = −(xi−x'i)k(x,x'). Including gradient observations can dramatically improve prediction quality with few additional evaluations.
Real measurements include noise: y = f(x) + ε, where ε ~ N(0, ν). To handle this, add the noise variance to the diagonal of the covariance matrix: K(X,X) + νI. The GP no longer interpolates exactly through data points but instead smooths through the noise.
The kernel hyperparameters (ℓ, signal variance, noise variance) must be chosen. The standard approach: maximum marginal likelihood. Maximize the probability of the observed data under the GP model with respect to the hyperparameters.
The first term rewards data fit. The second term penalizes model complexity (determinant of K). Together they implement automatic Occam's razor.
| Concept | Key Fact |
|---|---|
| Gaussian process | Distribution over functions, defined by mean + kernel |
| Kernel | Controls smoothness, length scale, and correlation structure |
| Posterior | Mean interpolates data; variance is zero at data, grows with distance |
| Noisy measurements | Add νI to covariance; GP smooths instead of interpolating |
| Hyperparameter fitting | Maximum marginal likelihood balances fit and complexity |