Don’t fit one function — fit a distribution over all plausible functions, and get calibrated uncertainty for free. The gold standard for “what does the model not know,” plus the broader toolkit of uncertainty in deep learning.
A normal regression model gives you a single curve: for any input, one predicted output. But that prediction hides something vital — how confident should you be? Near your training data the answer is probably good; far from it, the model is extrapolating blindly, yet it reports the prediction with the same false confidence. For anything where being wrong is costly — medicine, robotics, scientific experiments, finance — you need the model to say “here’s my best guess, and here’s how unsure I am.”
A Gaussian Process (GP) gives exactly that. Instead of fitting one function, it maintains a distribution over all functions consistent with the data. Its prediction at any point is itself a Gaussian: a mean (best guess) and a variance (uncertainty). And the uncertainty behaves exactly as it should — small near data, large far from it. GPs are the gold standard for calibrated uncertainty, and this lesson builds one from scratch, then surveys how deep learning chases the same goal.
The mean prediction (line) plus an uncertainty band (shaded). Notice the band is tight where there’s data and balloons in the gaps and beyond — the model admitting where it’s guessing. Drag to move the data and watch the uncertainty follow.
Here’s the mind-bending core idea. We usually think of a probability distribution over numbers (a Gaussian over heights, say). A Gaussian Process is a probability distribution over functions. Before seeing any data, the GP has a prior — a belief about which functions are plausible. You can literally sample from it: each sample is a whole random function (a wiggly curve). The prior says “functions like these are likely; functions like those are not.”
How can a distribution over infinitely many points be tractable? The trick: a GP is defined so that any finite set of points has a multivariate Gaussian distribution. You never need the whole infinite function — just the values at the finite points you care about, which are jointly Gaussian. That keeps everything to standard Gaussian math (means, covariances, conditioning), which is exactly what makes GPs both principled and computable. The whole behavior of the prior — how smooth, how wiggly, how the functions look — is controlled by one object: the kernel, which we meet next.
Each curve is one random function drawn from the GP prior — before any data. They share a character (here, smooth) but vary. Press resample to draw new functions; this cloud of plausible functions IS the prior.
The kernel (or covariance function) is the heart of a GP. It answers one question: given two input points, how correlated should their outputs be? The intuition: nearby inputs should have similar outputs (a smooth function), distant inputs less so. The most common kernel, the RBF (squared-exponential), says exactly this — the correlation between two points falls off smoothly as the distance between their inputs grows.
The kernel has a crucial knob: the lengthscale. It sets how far apart points must be before they stop being correlated — in other words, how fast the function is allowed to wiggle. A short lengthscale means correlation dies quickly: nearby points are linked but the function can change rapidly (wiggly). A long lengthscale means correlation persists far: the function must be smooth and slowly-varying. The kernel encodes your assumptions about the function before you see data — smooth? periodic? linear? — and the lengthscale tunes how strongly.
Correlation between two points vs. the distance between their inputs. Drag the lengthscale: short = correlation dies fast (wiggly functions allowed); long = correlation persists (smooth functions only). The prior samples (right) respond.
The magic step: given observed data points, condition the prior to get the posterior — the distribution over functions that now agree with the data. Because everything is jointly Gaussian, conditioning is a closed-form operation (the same formula as conditioning any multivariate Gaussian on some of its variables). No iterative training, no gradient descent — just linear algebra. You get the posterior exactly, in one shot.
And the posterior does the beautiful thing. At and near observed points, the functions are pinned down — they must pass through (or near) the data — so the uncertainty collapses. In the gaps between data, and far beyond it, the functions are free to vary, so the uncertainty grows. The posterior mean interpolates the data smoothly (governed by the kernel), and the posterior variance is your honest error bar — automatically tight where you know and wide where you don’t. This is the GP’s signature: data pins the function down locally, and uncertainty fills the unknown.
Start with the wide prior (many functions). Add data points (drag the slider) and watch the functions snap to pass through them — uncertainty collapsing at the data, remaining wide in the gaps. That collapse is conditioning.
Put it together and you have GP regression’s signature picture: a mean curve that smoothly interpolates the data, wrapped in an uncertainty band that hugs the data tight and flares in the gaps. At any new input, the GP returns a Gaussian — mean and variance — so you can report “the value is about 5, give or take 0.3” near data, or “about 5, give or take 4” far away. The error bars are calibrated by construction.
This is enormously useful. Active learning: sample next where uncertainty is highest — the model tells you where it’s most ignorant. Bayesian optimization: optimize an expensive function (tuning a model, designing an experiment) by balancing exploiting the high-mean regions and exploring the high-variance ones — GPs are the standard engine here. Safety: refuse to act when uncertainty is too high. The GP doesn’t just predict; it tells you how much to trust each prediction, which turns prediction into informed decision-making.
Data points, the posterior mean (line), and the uncertainty band (shaded ±2 standard deviations). Tight at data, wide between — the calibrated error bar. Drag the observation noise: noisier data loosens the fit and widens the band even at the points.
The kernel is where you inject prior knowledge, and different kernels produce wildly different behavior. RBF gives smooth functions. Periodic kernels produce repeating patterns — perfect for seasonality. Linear kernels give straight-line trends. Matérn kernels allow rougher functions than RBF. And you can combine them: add a periodic kernel to a linear one to model “a rising trend with a yearly cycle.” The kernel is a language for expressing what you believe about the function.
The kernel’s hyperparameters (lengthscale, output variance, noise) are usually learned from data by maximizing the marginal likelihood — the GP’s own principled measure of how well a setting explains the data, which automatically balances fit against complexity (a built-in Occam’s razor). So you choose the kernel family (your structural assumption) and let the data tune its parameters. Picking the right kernel is the main modeling decision in GP work — get it right and the GP extrapolates sensibly; get it wrong and it’s confidently mistaken in exactly the way the kernel implies.
The same data, four kernels. RBF (smooth), periodic (repeating), linear (trend), Matérn (rougher). The kernel is your assumption about the function’s shape. Drag to switch and see how the GP extrapolates each way.
GPs have one serious limitation: they scale badly. Conditioning requires inverting (or factorizing) the kernel matrix over all data points — an n×n matrix — which costs O(n³) time and O(n²) memory. At a few thousand points it’s fine; at a million, the exact GP is hopeless. This cubic cost is why GPs shine on small-to-medium data and struggle on the large datasets where deep nets thrive.
The fix is sparse / approximate GPs: instead of using all n points, summarize them with a smaller set of inducing points (m « n) that capture the data’s structure. The GP then operates on these m representatives, dropping the cost dramatically (to roughly O(n·m²)). If that sounds familiar — a small set of learned representatives summarizing a large input — it’s the same bottleneck idea as the Perceiver’s latents. There are also iterative and GPU-friendly methods. So GPs can scale, with approximation — but the exact, beautiful closed-form GP is fundamentally a small-data tool, and that’s the central trade-off versus deep learning’s scale.
Exact GP cost grows as n³ (orange) — it explodes. Sparse GPs with m inducing points (teal) summarize the data and grow far more gently. Drag the data size and watch the gap open.
Build a GP by hand: click anywhere to add a data point and watch the posterior mean and uncertainty band update instantly (it’s closed-form — no training). Tune the lengthscale and noise and see the fit’s character change. This is a real GP regression running in your browser.
Click in the plot to add observations. The teal mean curve interpolates them; the shaded band is the ±2σ uncertainty — tight at your points, wide in the gaps. Tune lengthscale (wiggliness) and noise (how loosely it fits). Watch the uncertainty grow wherever you leave a gap.
Play with it: cluster points and the band shrinks there; leave a gap and watch the band balloon; shrink the lengthscale and the mean gets wiggly and over-fits; grow it and the fit smooths out. Everything you learned — prior, kernel, conditioning, error bars — is happening live, in closed form, every time you click.
GPs are the gold standard for uncertainty, but they don’t scale to deep-learning-sized problems. So how do big neural nets estimate uncertainty? Several practical methods:
The throughline: a prediction without uncertainty is half an answer. GPs give it exactly and elegantly for small data; deep ensembles, MC dropout, and conformal prediction give it approximately and scalably for large models. Whichever you use, quantifying what the model doesn’t know — and powering decisions, active learning, and Bayesian optimization with it — is one of the most underrated skills in applied ML.
GP (exact, small data), deep ensemble (disagreement), MC dropout (sampled), conformal (guaranteed coverage). Drag to see each method’s uncertainty estimate and scalability trade-off.
| Method | Uncertainty | Scale |
|---|---|---|
| Gaussian Process | exact, calibrated | small/medium (O(n³)) |
| Deep ensembles | disagreement, strong | large (×k cost) |
| MC dropout | sampled, rough | large, cheap |
| Conformal | guaranteed coverage | any (wraps a model) |
→ Time-Series Forecasting — another home for probabilistic prediction
→ Bayesian Estimation — the prior→posterior machinery, generally
→ Kalman Filter — a Gaussian, recursive cousin
→ Perceiver — the inducing-points / latent-bottleneck idea