Modalities & Methods

Gaussian Processes

Don’t fit one function — fit a distribution over all plausible functions, and get calibrated uncertainty for free. The gold standard for “what does the model not know,” plus the broader toolkit of uncertainty in deep learning.

Prerequisites: A Gaussian has a mean and a variance + Regression fits a curve to points. That’s it.
10
Chapters
9+
Simulations
0
Assumed Knowledge

Chapter 0: What the Model Doesn’t Know

A normal regression model gives you a single curve: for any input, one predicted output. But that prediction hides something vital — how confident should you be? Near your training data the answer is probably good; far from it, the model is extrapolating blindly, yet it reports the prediction with the same false confidence. For anything where being wrong is costly — medicine, robotics, scientific experiments, finance — you need the model to say “here’s my best guess, and here’s how unsure I am.

A Gaussian Process (GP) gives exactly that. Instead of fitting one function, it maintains a distribution over all functions consistent with the data. Its prediction at any point is itself a Gaussian: a mean (best guess) and a variance (uncertainty). And the uncertainty behaves exactly as it should — small near data, large far from it. GPs are the gold standard for calibrated uncertainty, and this lesson builds one from scratch, then surveys how deep learning chases the same goal.

The trap: “a confident prediction is a good prediction.” A model can be confidently, catastrophically wrong — especially when extrapolating. Knowing what it doesn’t know is often more valuable than the prediction itself: it tells you when to trust the model, when to gather more data, and where the risk lies. Uncertainty is not a luxury; it’s the difference between a model you can deploy safely and one you can’t.
A prediction with honest uncertainty

The mean prediction (line) plus an uncertainty band (shaded). Notice the band is tight where there’s data and balloons in the gaps and beyond — the model admitting where it’s guessing. Drag to move the data and watch the uncertainty follow.

data location0.40
What does a Gaussian Process give that a normal regression model doesn’t?

Chapter 1: A Distribution Over Functions

Here’s the mind-bending core idea. We usually think of a probability distribution over numbers (a Gaussian over heights, say). A Gaussian Process is a probability distribution over functions. Before seeing any data, the GP has a prior — a belief about which functions are plausible. You can literally sample from it: each sample is a whole random function (a wiggly curve). The prior says “functions like these are likely; functions like those are not.”

How can a distribution over infinitely many points be tractable? The trick: a GP is defined so that any finite set of points has a multivariate Gaussian distribution. You never need the whole infinite function — just the values at the finite points you care about, which are jointly Gaussian. That keeps everything to standard Gaussian math (means, covariances, conditioning), which is exactly what makes GPs both principled and computable. The whole behavior of the prior — how smooth, how wiggly, how the functions look — is controlled by one object: the kernel, which we meet next.

Samples from a GP prior

Each curve is one random function drawn from the GP prior — before any data. They share a character (here, smooth) but vary. Press resample to draw new functions; this cloud of plausible functions IS the prior.

A Gaussian Process is a probability distribution over:

Chapter 2: The Kernel — how points relate

The kernel (or covariance function) is the heart of a GP. It answers one question: given two input points, how correlated should their outputs be? The intuition: nearby inputs should have similar outputs (a smooth function), distant inputs less so. The most common kernel, the RBF (squared-exponential), says exactly this — the correlation between two points falls off smoothly as the distance between their inputs grows.

The kernel has a crucial knob: the lengthscale. It sets how far apart points must be before they stop being correlated — in other words, how fast the function is allowed to wiggle. A short lengthscale means correlation dies quickly: nearby points are linked but the function can change rapidly (wiggly). A long lengthscale means correlation persists far: the function must be smooth and slowly-varying. The kernel encodes your assumptions about the function before you see data — smooth? periodic? linear? — and the lengthscale tunes how strongly.

Concept → realization: the kernel is literally the entries of the covariance matrix for the finite points. “Outputs at nearby inputs are correlated” becomes “the covariance matrix has large off-diagonal entries for nearby points.” Choosing a kernel is choosing the shape of that matrix — and thus the character of every function the GP considers plausible.
The RBF kernel and its lengthscale

Correlation between two points vs. the distance between their inputs. Drag the lengthscale: short = correlation dies fast (wiggly functions allowed); long = correlation persists (smooth functions only). The prior samples (right) respond.

lengthscale0.20
What does the kernel’s lengthscale control?

Chapter 3: Conditioning on Data

The magic step: given observed data points, condition the prior to get the posterior — the distribution over functions that now agree with the data. Because everything is jointly Gaussian, conditioning is a closed-form operation (the same formula as conditioning any multivariate Gaussian on some of its variables). No iterative training, no gradient descent — just linear algebra. You get the posterior exactly, in one shot.

And the posterior does the beautiful thing. At and near observed points, the functions are pinned down — they must pass through (or near) the data — so the uncertainty collapses. In the gaps between data, and far beyond it, the functions are free to vary, so the uncertainty grows. The posterior mean interpolates the data smoothly (governed by the kernel), and the posterior variance is your honest error bar — automatically tight where you know and wide where you don’t. This is the GP’s signature: data pins the function down locally, and uncertainty fills the unknown.

Prior → posterior as data arrives

Start with the wide prior (many functions). Add data points (drag the slider) and watch the functions snap to pass through them — uncertainty collapsing at the data, remaining wide in the gaps. That collapse is conditioning.

data points observed0
When a GP conditions on data, what happens to the uncertainty?

Chapter 4: Predictions with Error Bars

Put it together and you have GP regression’s signature picture: a mean curve that smoothly interpolates the data, wrapped in an uncertainty band that hugs the data tight and flares in the gaps. At any new input, the GP returns a Gaussian — mean and variance — so you can report “the value is about 5, give or take 0.3” near data, or “about 5, give or take 4” far away. The error bars are calibrated by construction.

This is enormously useful. Active learning: sample next where uncertainty is highest — the model tells you where it’s most ignorant. Bayesian optimization: optimize an expensive function (tuning a model, designing an experiment) by balancing exploiting the high-mean regions and exploring the high-variance ones — GPs are the standard engine here. Safety: refuse to act when uncertainty is too high. The GP doesn’t just predict; it tells you how much to trust each prediction, which turns prediction into informed decision-making.

GP regression in one picture

Data points, the posterior mean (line), and the uncertainty band (shaded ±2 standard deviations). Tight at data, wide between — the calibrated error bar. Drag the observation noise: noisier data loosens the fit and widens the band even at the points.

observation noise0.05
Why are GPs the standard engine for Bayesian optimization?

Chapter 5: Choosing the Kernel

The kernel is where you inject prior knowledge, and different kernels produce wildly different behavior. RBF gives smooth functions. Periodic kernels produce repeating patterns — perfect for seasonality. Linear kernels give straight-line trends. Matérn kernels allow rougher functions than RBF. And you can combine them: add a periodic kernel to a linear one to model “a rising trend with a yearly cycle.” The kernel is a language for expressing what you believe about the function.

The kernel’s hyperparameters (lengthscale, output variance, noise) are usually learned from data by maximizing the marginal likelihood — the GP’s own principled measure of how well a setting explains the data, which automatically balances fit against complexity (a built-in Occam’s razor). So you choose the kernel family (your structural assumption) and let the data tune its parameters. Picking the right kernel is the main modeling decision in GP work — get it right and the GP extrapolates sensibly; get it wrong and it’s confidently mistaken in exactly the way the kernel implies.

Different kernels, different functions

The same data, four kernels. RBF (smooth), periodic (repeating), linear (trend), Matérn (rougher). The kernel is your assumption about the function’s shape. Drag to switch and see how the GP extrapolates each way.

kernelRBF
What role does the kernel play in a GP?

Chapter 6: The n³ Wall

GPs have one serious limitation: they scale badly. Conditioning requires inverting (or factorizing) the kernel matrix over all data points — an n×n matrix — which costs O(n³) time and O(n²) memory. At a few thousand points it’s fine; at a million, the exact GP is hopeless. This cubic cost is why GPs shine on small-to-medium data and struggle on the large datasets where deep nets thrive.

The fix is sparse / approximate GPs: instead of using all n points, summarize them with a smaller set of inducing points (m « n) that capture the data’s structure. The GP then operates on these m representatives, dropping the cost dramatically (to roughly O(n·m²)). If that sounds familiar — a small set of learned representatives summarizing a large input — it’s the same bottleneck idea as the Perceiver’s latents. There are also iterative and GPU-friendly methods. So GPs can scale, with approximation — but the exact, beautiful closed-form GP is fundamentally a small-data tool, and that’s the central trade-off versus deep learning’s scale.

Cost vs. data size, and inducing points

Exact GP cost grows as n³ (orange) — it explodes. Sparse GPs with m inducing points (teal) summarize the data and grow far more gently. Drag the data size and watch the gap open.

data size0.40
Why don’t exact GPs scale to very large datasets?

Chapter 7: A Gaussian Process, Live (showcase)

Build a GP by hand: click anywhere to add a data point and watch the posterior mean and uncertainty band update instantly (it’s closed-form — no training). Tune the lengthscale and noise and see the fit’s character change. This is a real GP regression running in your browser.

Click to add data — the GP responds

Click in the plot to add observations. The teal mean curve interpolates them; the shaded band is the ±2σ uncertainty — tight at your points, wide in the gaps. Tune lengthscale (wiggliness) and noise (how loosely it fits). Watch the uncertainty grow wherever you leave a gap.

lengthscale0.15
noise0.05

Play with it: cluster points and the band shrinks there; leave a gap and watch the band balloon; shrink the lengthscale and the mean gets wiggly and over-fits; grow it and the fit smooths out. Everything you learned — prior, kernel, conditioning, error bars — is happening live, in closed form, every time you click.

Chapter 8: Uncertainty Beyond GPs

GPs are the gold standard for uncertainty, but they don’t scale to deep-learning-sized problems. So how do big neural nets estimate uncertainty? Several practical methods:

The throughline: a prediction without uncertainty is half an answer. GPs give it exactly and elegantly for small data; deep ensembles, MC dropout, and conformal prediction give it approximately and scalably for large models. Whichever you use, quantifying what the model doesn’t know — and powering decisions, active learning, and Bayesian optimization with it — is one of the most underrated skills in applied ML.

Uncertainty methods compared

GP (exact, small data), deep ensemble (disagreement), MC dropout (sampled), conformal (guaranteed coverage). Drag to see each method’s uncertainty estimate and scalability trade-off.

methodGP
What does conformal prediction provide?

Chapter 9: Cheat Sheet & Connections

prior
a distribution over functions, defined by a kernel (covariance)
↓ kernel sets character (lengthscale = wiggliness)
kernel
RBF / periodic / linear / Matérn; encodes assumptions; hyperparams learned via marginal likelihood
↓ condition on data (closed-form)
posterior
mean interpolates data; variance = error bar (tight at data, wide in gaps)
↓ cost O(n³) → sparse GPs (inducing points)
use
calibrated uncertainty, active learning, Bayesian optimization
MethodUncertaintyScale
Gaussian Processexact, calibratedsmall/medium (O(n³))
Deep ensemblesdisagreement, stronglarge (×k cost)
MC dropoutsampled, roughlarge, cheap
Conformalguaranteed coverageany (wraps a model)

Keep exploring

Time-Series Forecasting — another home for probabilistic prediction
Bayesian Estimation — the prior→posterior machinery, generally
Kalman Filter — a Gaussian, recursive cousin
Perceiver — the inducing-points / latent-bottleneck idea

“What I cannot create, I do not understand.” You just rebuilt the Gaussian Process: a distribution over functions shaped by a kernel, conditioned on data in closed form to get a posterior whose mean interpolates and whose variance is an honest error bar — tight where you know, wide where you don’t. Knowing what the model doesn’t know is half the answer.