From Bayesian regression to reproducing kernel Hilbert spaces — when you need nonlinear models but still want closed-form solutions.
In the last lecture, we fitted linear models: f(x) = wTx. This works great if the relationship between features and targets is actually linear. But what if it's not?
One approach: transform the features. Instead of using x directly, map it to a higher-dimensional space using a feature map φ(x), then do linear regression in the new space:
For example, if x is 1D and we set φ(x) = [1, x, x2, x3], we get polynomial regression — a linear model in a higher-dimensional feature space.
Here's a preview. Suppose the only thing your algorithm ever computes is inner products ⟨φ(xi), φ(xj)⟩. If there exists a kernel function K(xi, xj) that equals this inner product, you can work entirely in terms of K — without ever computing φ itself. This is the kernel trick.
Data from a nonlinear function. Linear regression fails. But we can fit it perfectly with the right feature map.
You might ask: why not just compute φ(x) explicitly and do linear regression in the high-dimensional space? For the polynomial kernel of degree d in p dimensions, the feature space has O(pd) dimensions. For degree 5 with 100 features: 10 billion dimensions. You can't even allocate the memory.
For the Gaussian kernel, it's worse: the feature space is literally infinite-dimensional (the Taylor expansion of exp has infinitely many terms). No finite computer can store φ(x). The kernel trick sidesteps this entirely: K(x, x') = exp(−‖x−x'‖2/2ℓ2) is a single scalar computation, yet it implicitly computes an inner product in an infinite-dimensional Hilbert space.
Before kernels, let's see ridge regression from a Bayesian perspective. This will motivate why kernels naturally arise.
Assume the data comes from a Gaussian model:
Each observation y is the linear prediction plus Gaussian noise with variance σ2. Now put a Gaussian prior on the weights:
This says: before seeing any data, we believe the weights are small (centered at zero) with variance τ2 per component.
The maximum a posteriori (MAP) estimate maximizes the posterior p(w|data) ∝ p(data|w) · p(w). Taking the negative log:
Minimizing this is exactly ridge regression with λ = σ2/τ2.
The full Bayesian approach doesn't just find a point estimate. The posterior over w is also Gaussian:
The posterior mean μpost equals the ridge solution. But we also get uncertainty: Σpost tells us how confident we are about each weight.
Suppose d = 1 (one feature), σ2 = 1, τ2 = 4. Then λ = σ2/τ2 = 0.25. Given N = 3 points with x = [1, 2, 3], y = [2.1, 3.9, 6.2] (and bias column), the MAP/ridge solution will shrink the weights slightly toward zero compared to OLS.
The effect is small here because λ = 0.25 is modest. But consider τ2 = 0.1 (tight prior: "I believe weights are near zero"). Then λ = 10, and the ridge solution is dramatically different from OLS — the weights are pulled almost to zero.
For a new input x*, the predictive distribution integrates over the posterior on w:
The first term σ2 is the irreducible noise. The second term x*TΣpostx* is the epistemic uncertainty — uncertainty from not knowing the true weights. It's larger when x* is far from the training data (extrapolation). This is the precursor to GP uncertainty.
Adjust the prior variance τ². Large prior → trust data (OLS). Small prior → shrink weights (ridge).
The Bayesian predictive distribution at x* integrates over all possible weight vectors:
Since both terms are Gaussian, the integral is also Gaussian with mean x*Tμpost and variance σ2 + x*TΣpostx*. Now the key observation: the variance depends on x* only through x*TΣpostx*. Substituting Σpost and rearranging, this can be written entirely in terms of inner products between feature vectors. If we replace those inner products with a kernel function K, we get the GP predictive variance — no feature computation needed.
This is how the Bayesian perspective naturally leads to kernel methods: the entire posterior computation (mean and variance) depends on the feature vectors only through their inner products, which is exactly what a kernel provides.
Now the key move. Instead of specifying a finite feature map φ(x), we specify a kernel function K(x, x') that measures similarity between any two inputs. The model becomes:
The prediction at a new point x is a weighted sum of the kernel's similarity to every training point. The coefficients αi replace the weight vector w.
| Name | K(x, x') | Feature space |
|---|---|---|
| Linear | xTx' | Same as input (finite) |
| Polynomial (deg d) | (xTx' + c)d | All monomials up to degree d |
| Gaussian (RBF) | exp(−‖x − x'‖2 / 2ℓ2) | Infinite-dimensional |
The Gaussian kernel (also called the radial basis function or RBF kernel) is the most important. Its feature space is infinite-dimensional, yet we can compute K(x, x') in O(d) time. This is the kernel trick in action.
Think of K(x, x') as measuring how similar two inputs are. For the Gaussian kernel, K(x, x') ≈ 1 when x and x' are close (within a few length scales ℓ), and K(x, x') ≈ 0 when they're far apart. The prediction f(x*) = ∑αiK(xi, x*) weights each training point by its similarity to the test point. Nearby training points have a big say; distant ones are ignored.
This is why ℓ matters so much: it controls the "radius of influence" of each training point. Small ℓ = only very nearby points matter = wiggly fits. Large ℓ = distant points still influence = smooth fits.
Define the kernel matrix (or Gram matrix) K where Kij = K(xi, xj). The ridge regression solution in kernel form is:
The length scale ℓ in the Gaussian kernel controls how "local" the model is. Small ℓ: each training point only influences nearby predictions (wiggly). Large ℓ: influence spreads widely (smooth).
Consider a 1D polynomial kernel K(x, x') = (xx' + 1)2. Expanding: (xx' + 1)2 = x2x'2 + 2xx' + 1. This is the inner product of φ(x) = [x2, √2·x, 1] with φ(x') = [x'2, √2·x', 1].
The feature map φ maps 1D inputs to 3D. But we never compute φ — we just evaluate K(x, x') = (xx' + 1)2 in O(1). For the Gaussian kernel, the implicit feature space has infinitely many dimensions (the Taylor expansion of exp has infinitely many terms), yet each kernel evaluation is still O(d).
Start with ridge regression in the feature space: w = (XφTXφ + λI)−1XφTy, where Xφ has rows φ(xi)T. The prediction at x is:
Using the matrix identity AT(AAT + λI)−1 = (ATA + λI)−1AT, this becomes:
where k(x)i = K(xi, x) and Kij = K(xi, xj). Everything is expressed in terms of kernel evaluations — φ never appears.
Choose a kernel type and adjust its parameters. Watch how the fit changes.
We've been working with kernels informally. Now let's make it rigorous. A Reproducing Kernel Hilbert Space (RKHS) is a special type of function space with a remarkable property.
A Hilbert space is a (possibly infinite-dimensional) vector space with an inner product that is complete (all Cauchy sequences converge). Think of it as a generalization of Euclidean space to functions.
An RKHS H is a Hilbert space of functions f: X → R with one extra property: evaluation is continuous. Formally, for every point x ∈ X, the evaluation functional δx(f) = f(x) is a bounded linear functional on H.
By the Riesz representation theorem, every bounded linear functional on a Hilbert space can be represented as an inner product with some element. So for each x, there exists a function Kx ∈ H such that:
The function K(x, x') = Kx(x') = ⟨Kx, Kx'⟩H is the reproducing kernel of H. It "reproduces" function values via inner products.
This is the same kernel function we used in kernel regression. The RKHS gives it a rigorous foundation.
A valid kernel must be positive semi-definite (PSD): for any set of points {x1, ..., xN}, the Gram matrix Kij = K(xi, xj) must be PSD. Conversely, any PSD function defines a unique RKHS (Moore-Aronszajn theorem).
For the linear kernel K(x, x') = xTx', the RKHS is the set of linear functions f(x) = wTx, and the RKHS norm is ‖f‖H = ‖w‖. Regularizing ‖f‖H2 = ‖w‖2 is exactly ridge regression.
For the Gaussian kernel, the RKHS contains smooth functions. A function that interpolates every data point exactly (zero training error) will have a large RKHS norm because it oscillates wildly between points. A smoother function that approximately fits the data has a smaller norm. The regularizer λ‖f‖H2 explicitly prefers the smoother fit.
Every PSD kernel defines exactly one RKHS, and every RKHS has exactly one reproducing kernel. This is the Moore-Aronszajn theorem. The RKHS is constructed as the closure of the span of {K(x, ·) : x ∈ X}. The inner product is defined by ⟨K(x, ·), K(x', ·)⟩H = K(x, x'). Any function in this RKHS can be expressed (possibly as an infinite sum) as f = ∑ αi K(xi, ·).
Not every function K(x, x') is a valid kernel. It must be PSD. But you can build new kernels from existing ones:
The Gaussian kernel is valid because it can be written as exp(−‖x‖2/2ℓ2) · exp(xTx'/ℓ2) · exp(−‖x'‖2/2ℓ2), and the exponential of a PSD kernel is PSD (via the Taylor series: each term is a product of PSD kernels, and the sum of PSD kernels is PSD).
The space L2 (square-integrable functions) is a Hilbert space, but NOT an RKHS. In L2, two functions that differ at a single point are considered "the same" (they have the same L2 norm). So you can't meaningfully evaluate f(x) at a single point — it's not a continuous operation. In an RKHS, the norm is strong enough that nearby functions (in the RKHS norm) also have nearby values at each point. This "pointwise control" is exactly what makes RKHS useful for regression: we need f(xi) to be well-defined.
The intuition: an RKHS is a "nicer" function space than L2. It contains only functions that are smooth enough to be evaluated pointwise. The kernel controls exactly how smooth.
Here's the most important result in kernel methods. You want to find the function f in an infinite-dimensional RKHS that minimizes:
where L is any loss function. The RKHS is infinite-dimensional, so this seems like searching over infinitely many parameters. But the Representer Theorem says:
Decompose any f ∈ H as f = f∥ + f⊥, where f∥ is in the span of {Kx1, ..., KxN} and f⊥ is orthogonal to this span.
The loss term only sees f at x1, ..., xN. By the reproducing property, f(xi) = ⟨f, Kxi⟩ = ⟨f∥, Kxi⟩ (since f⊥ is orthogonal). So the loss doesn't depend on f⊥.
The regularization term: ‖f‖2 = ‖f∥‖2 + ‖f⊥‖2 ≥ ‖f∥‖2. Adding f⊥ only increases the penalty without changing the loss. So the optimal f has f⊥ = 0.
When L is squared loss, the problem becomes:
where we used ‖f‖H2 = αTKα (since f = ∑αiKxi and ⟨Kxi, Kxj⟩ = Kij). Taking the derivative:
which gives α = (K + λI)−1y, the familiar kernel ridge formula. The Representer Theorem guarantees this is also the optimum over the entire infinite-dimensional RKHS, not just over N-dimensional coefficient vectors.
Click to add data points. The model fits f(x) = ∑ αi K(xi, x) using (K + λI)−1y. The faint blue curves show each individual kernel contribution αiK(xi, ·); the orange curve is their sum. Try different kernels and watch how the fit changes character.
Click canvas to add points. Adjust kernel type, bandwidth, and regularization.
Kernel regression gives us point predictions. But what about uncertainty? Bayesian linear regression gave us a posterior distribution. Can we get the same thing with kernels?
A Gaussian process (GP) is a distribution over functions. Formally, f ~ GP(m(x), K(x, x')) means that for any finite set of points {x1, ..., xN}, the vector [f(x1), ..., f(xN)] is jointly Gaussian with mean [m(x1), ..., m(xN)] and covariance matrix Kij = K(xi, xj).
Given training data {(xi, yi)} with noise model y = f(x) + ε, ε ~ N(0, σ2), the posterior at a new point x* is:
where k* = [K(x1, x*), ..., K(xN, x*)]T. The posterior mean μ* is exactly kernel ridge regression with λ = σ2. But we also get σ*2: uncertainty that is small near training points and large far away.
Let K be the Gaussian kernel with ℓ = 1, σ2 = 0.1. Training data: x1 = 0, y1 = 1 and x2 = 1, y2 = −0.5.
Its inverse: det = 1.21 − 0.368 = 0.842. So (K + σ2I)−1 = [[1.306, −0.721], [−0.721, 1.306]]. Multiplying by y:
Prediction at x* = 0.5: k* = [K(0, 0.5), K(1, 0.5)] = [0.882, 0.882]. Mean = 0.882 · 1.667 + 0.882 · (−1.374) = 0.259. This is a weighted blend of the two observations, reflecting the kernel's smoothness.
Variance: σ*2 = K(0.5, 0.5) − k*T(K + σ2I)−1k* = 1.0 − [0.882, 0.882] · [1.667, −1.374] · ... The variance is small because x* = 0.5 is between two training points. At x* = 3 (far from data), σ*2 → K(3, 3) = 1 (full prior uncertainty).
Mercer's theorem states that a continuous PSD kernel on a compact domain can be expanded as:
where λj ≥ 0 are eigenvalues and φj are eigenfunctions of the integral operator associated with K. The feature map is φ(x) = [√λ1φ1(x), √λ2φ2(x), ...]. For the Gaussian kernel, all λj > 0 (infinitely many), confirming the feature space is infinite-dimensional.
GP regression implicitly uses this eigenexpansion. The posterior shrinks each eigencomponent by λj/(λj + σ2) — high-eigenvalue components (smooth, data-aligned) are trusted, low-eigenvalue components (rough, noisy) are suppressed. This is the same spectral filtering as ridge regression in the eigenspace of XTX.
The shaded region is the 95% confidence interval. Notice it narrows near data points.
This behavior is exactly right for decision-making under uncertainty. In Bayesian optimization, the GP posterior variance guides exploration: we sample at points where uncertainty is highest (most information to gain). In active learning, we query labels at points where the GP is most uncertain.
Let's tie everything together. In ordinary regression, we regularize the weight vector: λ‖w‖2. In kernel regression, we regularize the function itself: λ‖f‖H2.
What does the RKHS norm ‖f‖H actually measure? For the Gaussian kernel:
where F(ω) is the Fourier transform of f and S(ω) is the spectral density of the kernel. High-frequency components of f are penalized more heavily (since S(ω) decays for large ω). The RKHS norm is a smoothness penalty in disguise.
| View | Optimization | Regularizer |
|---|---|---|
| Primal (finite features) | min ‖X w − y‖2 + λ‖w‖2 | Weight magnitude |
| Dual (kernel) | min ‖Kα − y‖2 + λαTKα | Function RKHS norm |
| Bayesian | max p(y|f) · GP(f; 0, K) | Prior over functions |
These are three perspectives on the same optimization. Primal works when d is small. Dual works when N is small (or d is infinite). Bayesian gives us uncertainty.
Start from the Bayesian view: the GP prior says f ~ GP(0, K). The likelihood is p(y|f) = N(f, σ2I). The MAP estimate minimizes:
where f = [f(x1), ..., f(xN)]. Multiply by 2σ2: minimize ‖f − y‖2 + σ2fTK−1f. Writing f = Kα:
Setting the gradient to zero: (K + σ2I)α = y. This is kernel ridge regression with λ = σ2. The same α, the same predictions, derived from three completely different starting points.
The kernel encodes your inductive bias — what you believe the function looks like before seeing data. Key choices:
In kernel methods, the kernel parameters control the bias-variance tradeoff:
The RKHS norm provides a principled measure of complexity. Unlike polynomial degree (discrete), the RKHS norm is continuous — λ smoothly trades off fit quality against function complexity. This is one reason kernel methods generalize better than polynomial regression in high dimensions.
The deep connection: regularization in weight space (λ‖w‖2), regularization in function space (λ‖f‖H2), and Bayesian priors (p(f) = GP(0, K)) are three faces of the same idea. The RKHS makes this precise.
python import numpy as np def gaussian_kernel(X1, X2, ell=1.0): """Gram matrix K[i,j] = exp(-||x_i - x_j||^2 / 2*ell^2)""" sq = np.sum(X1**2, axis=1, keepdims=True) # (N1, 1) + np.sum(X2**2, axis=1) # (N2,) - 2 * X1 @ X2.T # (N1, N2) return np.exp(-sq / (2 * ell**2)) def kernel_ridge(X_train, y_train, X_test, ell=1.0, lam=0.1): """Kernel ridge regression with Gaussian kernel.""" K = gaussian_kernel(X_train, X_train, ell) # (N, N) alpha = np.linalg.solve(K + lam * np.eye(len(K)), y_train) K_test = gaussian_kernel(X_test, X_train, ell) # (M, N) return K_test @ alpha # predictions def gp_predict(X_train, y_train, X_test, ell=1.0, sig2=0.1): """GP posterior mean + variance.""" K = gaussian_kernel(X_train, X_train, ell) + sig2 * np.eye(len(X_train)) K_star = gaussian_kernel(X_test, X_train, ell) alpha = np.linalg.solve(K, y_train) mu = K_star @ alpha # posterior mean v = np.linalg.solve(K, K_star.T) var = 1.0 - np.sum(K_star * v.T, axis=1) # posterior variance return mu, var
Let's map the entire journey from linear regression to RKHS.
| Concept | Key Equation | Insight |
|---|---|---|
| Kernel trick | K(x,x') = ⟨φ(x), φ(x')⟩ | Never compute φ explicitly |
| Representer Theorem | f* = ∑ αi K(xi, ·) | ∞-dim search → N coefficients |
| GP posterior mean | μ* = k*T(K+σ²I)−1y | Same as kernel ridge regression |
| GP posterior var | σ*² = K(**) − k*T(K+σ²I)−1k* | Uncertainty ↑ far from data |
The RKHS framework is one of the most powerful tools in statistical learning theory. It gives a rigorous language for "function complexity," connects Bayesian and frequentist methods through the kernel, and the Representer Theorem turns infinite-dimensional optimization into finite computation. Every time you use a Gaussian process, an SVM, or analyze a neural network through the NTK, you're working in an RKHS.
| Method | Training Cost | Prediction Cost | Memory |
|---|---|---|---|
| Primal ridge (d features) | O(Nd2 + d3) | O(d) | O(d) |
| Kernel ridge (N points) | O(N3) | O(N) | O(N2) |
| GP posterior | O(N3) | O(N) mean, O(N2) var | O(N2) |
The O(N3) cost of kernel methods is the main bottleneck. For N > 10,000, approximate methods are needed: random Fourier features (Rahimi & Recht, 2007), inducing points (Snelson & Ghahramani, 2006), or structured kernel interpolation.
Decision guide for practitioners:
In modern deep learning, the neural tangent kernel (NTK) connects neural networks to kernel methods: an infinitely wide neural network trained with gradient descent is equivalent to kernel regression with the NTK. This gives theoretical tools to analyze neural network generalization via RKHS theory.
The kernel parameters (length scale ℓ, polynomial degree d) and regularization λ dramatically affect performance. Two standard approaches:
Cross-validation: Split data into K folds, evaluate prediction error for each (ℓ, λ) pair, pick the best. Computational cost: O(K · N3) per parameter setting. Works for any kernel method.
Marginal likelihood (for GPs): Maximize log p(y | ℓ, σ2) = −(1/2)yT(K + σ2I)−1y − (1/2)log|K + σ2I| − (N/2)log(2π). The first term favors data fit, the second penalizes model complexity (Occam's razor built in). Gradient-based optimization finds the best kernel parameters automatically. This is the preferred method for GPs because it avoids the expense of cross-validation.
The marginal likelihood is sometimes called the evidence. It automatically balances model complexity against data fit — too flexible a kernel (very small ℓ) gets penalized by the log-determinant term, while too rigid a kernel (very large ℓ) gets penalized by the data fit term. This is the Bayesian answer to cross-validation.
"The kernel trick is one of the most beautiful mathematical ideas in machine learning. It says: you can work in a million-dimensional space by only ever computing dot products." — Pilanci, EE269