Modalities & Methods

Gaussian Processes

Don’t fit one function — fit a distribution over all plausible functions, and get calibrated uncertainty for free. The gold standard for “what does the model not know,” plus the broader toolkit of uncertainty in deep learning.

Prerequisites: A Gaussian has a mean and a variance + Regression fits a curve to points. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: What the Model Doesn’t Know

A normal regression model gives you a single curve: for any input, one predicted output. But that prediction hides something vital — how confident should you be? Near your training data the answer is probably good; far from it, the model is extrapolating blindly, yet it reports the prediction with the same false confidence. For anything where being wrong is costly — medicine, robotics, scientific experiments, finance — you need the model to say “here’s my best guess, and here’s how unsure I am.”

A Gaussian Process (GP) gives exactly that. Instead of fitting one function, it maintains a distribution over all functions consistent with the data. Its prediction at any point is itself a Gaussian: a mean (best guess) and a variance (uncertainty). And the uncertainty behaves exactly as it should — small near data, large far from it. GPs are the gold standard for calibrated uncertainty, and this lesson builds one from scratch, then surveys how deep learning chases the same goal.

The trap: “a confident prediction is a good prediction.” A model can be confidently, catastrophically wrong — especially when extrapolating. Knowing what it doesn’t know is often more valuable than the prediction itself: it tells you when to trust the model, when to gather more data, and where the risk lies. Uncertainty is not a luxury; it’s the difference between a model you can deploy safely and one you can’t.

A prediction with honest uncertainty

The mean prediction (line) plus an uncertainty band (shaded). Notice the band is tight where there’s data and balloons in the gaps and beyond — the model admitting where it’s guessing. Drag to move the data and watch the uncertainty follow.

data location0.40

What does a Gaussian Process give that a normal regression model doesn’t?

Faster predictions Calibrated uncertainty — a mean AND a variance per prediction, small near data and large far from it The ability to classify images

Chapter 1: A Distribution Over Functions

Here’s the mind-bending core idea. We usually think of a probability distribution over numbers (a Gaussian over heights, say). A Gaussian Process is a probability distribution over functions. Before seeing any data, the GP has a prior — a belief about which functions are plausible. You can literally sample from it: each sample is a whole random function (a wiggly curve). The prior says “functions like these are likely; functions like those are not.”

How can a distribution over infinitely many points be tractable? The trick: a GP is defined so that any finite set of points has a multivariate Gaussian distribution. You never need the whole infinite function — just the values at the finite points you care about, which are jointly Gaussian. That keeps everything to standard Gaussian math (means, covariances, conditioning), which is exactly what makes GPs both principled and computable. The whole behavior of the prior — how smooth, how wiggly, how the functions look — is controlled by one object: the kernel, which we meet next.

Samples from a GP prior

Each curve is one random function drawn from the GP prior — before any data. They share a character (here, smooth) but vary. Press resample to draw new functions; this cloud of plausible functions IS the prior.

A Gaussian Process is a probability distribution over:

single numbers whole functions — and any finite set of points from it is jointly multivariate Gaussian images

Chapter 2: The Kernel — how points relate

The kernel (or covariance function) is the heart of a GP. It answers one question: given two input points, how correlated should their outputs be? The intuition: nearby inputs should have similar outputs (a smooth function), distant inputs less so. The most common kernel, the RBF (squared-exponential), says exactly this — the correlation between two points falls off smoothly as the distance between their inputs grows.

The kernel has a crucial knob: the lengthscale. It sets how far apart points must be before they stop being correlated — in other words, how fast the function is allowed to wiggle. A short lengthscale means correlation dies quickly: nearby points are linked but the function can change rapidly (wiggly). A long lengthscale means correlation persists far: the function must be smooth and slowly-varying. The kernel encodes your assumptions about the function before you see data — smooth? periodic? linear? — and the lengthscale tunes how strongly.

Concept → realization: the kernel is literally the entries of the covariance matrix for the finite points. “Outputs at nearby inputs are correlated” becomes “the covariance matrix has large off-diagonal entries for nearby points.” Choosing a kernel is choosing the shape of that matrix — and thus the character of every function the GP considers plausible.

The RBF kernel and its lengthscale

Correlation between two points vs. the distance between their inputs. Drag the lengthscale: short = correlation dies fast (wiggly functions allowed); long = correlation persists (smooth functions only). The prior samples (right) respond.

lengthscale0.20

What does the kernel’s lengthscale control?

The number of data points How far apart inputs must be before their outputs stop being correlated — i.e. how fast the function can wiggle (short=wiggly, long=smooth) The learning rate

Chapter 3: Conditioning on Data

The magic step: given observed data points, condition the prior to get the posterior — the distribution over functions that now agree with the data. Because everything is jointly Gaussian, conditioning is a closed-form operation (the same formula as conditioning any multivariate Gaussian on some of its variables). No iterative training, no gradient descent — just linear algebra. You get the posterior exactly, in one shot.

And the posterior does the beautiful thing. At and near observed points, the functions are pinned down — they must pass through (or near) the data — so the uncertainty collapses. In the gaps between data, and far beyond it, the functions are free to vary, so the uncertainty grows. The posterior mean interpolates the data smoothly (governed by the kernel), and the posterior variance is your honest error bar — automatically tight where you know and wide where you don’t. This is the GP’s signature: data pins the function down locally, and uncertainty fills the unknown.

Prior → posterior as data arrives

Start with the wide prior (many functions). Add data points (drag the slider) and watch the functions snap to pass through them — uncertainty collapsing at the data, remaining wide in the gaps. That collapse is conditioning.

data points observed0

When a GP conditions on data, what happens to the uncertainty?

It stays constant everywhere It collapses near observed points and remains large in the gaps and far from data It grows everywhere

Chapter 4: Predictions with Error Bars

Put it together and you have GP regression’s signature picture: a mean curve that smoothly interpolates the data, wrapped in an uncertainty band that hugs the data tight and flares in the gaps. At any new input, the GP returns a Gaussian — mean and variance — so you can report “the value is about 5, give or take 0.3” near data, or “about 5, give or take 4” far away. The error bars are calibrated by construction.

This is enormously useful. Active learning: sample next where uncertainty is highest — the model tells you where it’s most ignorant. Bayesian optimization: optimize an expensive function (tuning a model, designing an experiment) by balancing exploiting the high-mean regions and exploring the high-variance ones — GPs are the standard engine here. Safety: refuse to act when uncertainty is too high. The GP doesn’t just predict; it tells you how much to trust each prediction, which turns prediction into informed decision-making.

GP regression in one picture

Data points, the posterior mean (line), and the uncertainty band (shaded ±2 standard deviations). Tight at data, wide between — the calibrated error bar. Drag the observation noise: noisier data loosens the fit and widens the band even at the points.

observation noise0.05

Why are GPs the standard engine for Bayesian optimization?

They are the fastest models Their calibrated uncertainty lets you balance exploiting high-mean regions and exploring high-variance ones They require no kernel

Chapter 5: Choosing the Kernel

The kernel is where you inject prior knowledge, and different kernels produce wildly different behavior. RBF gives smooth functions. Periodic kernels produce repeating patterns — perfect for seasonality. Linear kernels give straight-line trends. Matérn kernels allow rougher functions than RBF. And you can combine them: add a periodic kernel to a linear one to model “a rising trend with a yearly cycle.” The kernel is a language for expressing what you believe about the function.

The kernel’s hyperparameters (lengthscale, output variance, noise) are usually learned from data by maximizing the marginal likelihood — the GP’s own principled measure of how well a setting explains the data, which automatically balances fit against complexity (a built-in Occam’s razor). So you choose the kernel family (your structural assumption) and let the data tune its parameters. Picking the right kernel is the main modeling decision in GP work — get it right and the GP extrapolates sensibly; get it wrong and it’s confidently mistaken in exactly the way the kernel implies.

Different kernels, different functions

The same data, four kernels. RBF (smooth), periodic (repeating), linear (trend), Matérn (rougher). The kernel is your assumption about the function’s shape. Drag to switch and see how the GP extrapolates each way.

kernelRBF

What role does the kernel play in a GP?

It is the optimizer It encodes your assumptions about the function (smooth, periodic, linear, …) by defining how outputs correlate with input distance It stores the training labels

Chapter 6: The n³ Wall

GPs have one serious limitation: they scale badly. Conditioning requires inverting (or factorizing) the kernel matrix over all data points — an n×n matrix — which costs O(n³) time and O(n²) memory. At a few thousand points it’s fine; at a million, the exact GP is hopeless. This cubic cost is why GPs shine on small-to-medium data and struggle on the large datasets where deep nets thrive.

The fix is sparse / approximate GPs: instead of using all n points, summarize them with a smaller set of inducing points (m « n) that capture the data’s structure. The GP then operates on these m representatives, dropping the cost dramatically (to roughly O(n·m²)). If that sounds familiar — a small set of learned representatives summarizing a large input — it’s the same bottleneck idea as the Perceiver’s latents. There are also iterative and GPU-friendly methods. So GPs can scale, with approximation — but the exact, beautiful closed-form GP is fundamentally a small-data tool, and that’s the central trade-off versus deep learning’s scale.

Cost vs. data size, and inducing points

Exact GP cost grows as n³ (orange) — it explodes. Sparse GPs with m inducing points (teal) summarize the data and grow far more gently. Drag the data size and watch the gap open.

data size0.40

Why don’t exact GPs scale to very large datasets?

They need too many layers Conditioning inverts an n×n kernel matrix — O(n³) time — so they shine on small/medium data; sparse GPs use m≪n inducing points to scale They can’t represent nonlinear functions

Chapter 7: A Gaussian Process, Live (showcase)

Build a GP by hand: click anywhere to add a data point and watch the posterior mean and uncertainty band update instantly (it’s closed-form — no training). Tune the lengthscale and noise and see the fit’s character change. This is a real GP regression running in your browser.

Click to add data — the GP responds

Click in the plot to add observations. The teal mean curve interpolates them; the shaded band is the ±2σ uncertainty — tight at your points, wide in the gaps. Tune lengthscale (wiggliness) and noise (how loosely it fits). Watch the uncertainty grow wherever you leave a gap.

lengthscale0.15

noise0.05

Play with it: cluster points and the band shrinks there; leave a gap and watch the band balloon; shrink the lengthscale and the mean gets wiggly and over-fits; grow it and the fit smooths out. Everything you learned — prior, kernel, conditioning, error bars — is happening live, in closed form, every time you click.

Chapter 8: Uncertainty Beyond GPs

GPs are the gold standard for uncertainty, but they don’t scale to deep-learning-sized problems. So how do big neural nets estimate uncertainty? Several practical methods:

Deep ensembles: train several networks (different seeds); their disagreement on an input is the uncertainty. Simple, strong, and surprisingly well-calibrated — where they agree, trust; where they diverge, doubt.
MC dropout: keep dropout on at test time and run the network many times; the spread of outputs approximates a Bayesian posterior — cheap, if rough.
Conformal prediction: a model-agnostic wrapper that turns any predictor into one with guaranteed-coverage prediction intervals (e.g. “90% of the time the truth is in this range”), using a held-out calibration set — distribution-free guarantees, hugely popular.
Calibration: separately, are a model’s confidence scores honest? Techniques like temperature scaling fix overconfident classifiers so “90% sure” means right 90% of the time.

The throughline: a prediction without uncertainty is half an answer. GPs give it exactly and elegantly for small data; deep ensembles, MC dropout, and conformal prediction give it approximately and scalably for large models. Whichever you use, quantifying what the model doesn’t know — and powering decisions, active learning, and Bayesian optimization with it — is one of the most underrated skills in applied ML.

Uncertainty methods compared

GP (exact, small data), deep ensemble (disagreement), MC dropout (sampled), conformal (guaranteed coverage). Drag to see each method’s uncertainty estimate and scalability trade-off.

methodGP

What does conformal prediction provide?

Faster training Model-agnostic prediction intervals with guaranteed coverage (distribution-free), via a held-out calibration set A new kernel for GPs

Chapter 9: Cheat Sheet & Connections

prior

a distribution over functions, defined by a kernel (covariance)

↓ kernel sets character (lengthscale = wiggliness)

kernel

RBF / periodic / linear / Matérn; encodes assumptions; hyperparams learned via marginal likelihood

↓ condition on data (closed-form)

posterior

mean interpolates data; variance = error bar (tight at data, wide in gaps)

↓ cost O(n³) → sparse GPs (inducing points)

use

calibrated uncertainty, active learning, Bayesian optimization

Method	Uncertainty	Scale
Gaussian Process	exact, calibrated	small/medium (O(n³))
Deep ensembles	disagreement, strong	large (×k cost)
MC dropout	sampled, rough	large, cheap
Conformal	guaranteed coverage	any (wraps a model)

Keep exploring

→ Time-Series Forecasting — another home for probabilistic prediction
→ Bayesian Estimation — the prior→posterior machinery, generally
→ Kalman Filter — a Gaussian, recursive cousin
→ Perceiver — the inducing-points / latent-bottleneck idea

“What I cannot create, I do not understand.” You just rebuilt the Gaussian Process: a distribution over functions shaped by a kernel, conditioned on data in closed form to get a posterior whose mean interpolates and whose variance is an honest error bar — tight where you know, wide where you don’t. Knowing what the model doesn’t know is half the answer.