The kernel trick, Gaussian processes, and the art of working in infinite-dimensional feature spaces without ever going there.
Many algorithms (linear regression, PCA, nearest neighbors) depend on the data only through inner products between data points: xnTxm. What if we could replace these inner products with a more general similarity measure?
The kernel function k(x, x') measures similarity between two inputs. The remarkable discovery: for many useful kernels, there exists a (possibly infinite-dimensional) feature space where the kernel computes the inner product: k(x, x') = φ(x)Tφ(x').
The kernel trick transforms the question from "what basis functions should I use?" to "what similarity measure is appropriate for my data?" This is often a more natural question.
Consider regularized least squares from Ch 3. The solution w = (ΦTΦ + λI)−1ΦTt can be rewritten as:
where a = (1/λ)(t − Φw) is a vector of N coefficients (one per training point). Substituting back, we get the dual form:
where K is the N × N Gram matrix with Knm = k(xn, xm) = φ(xn)Tφ(xm).
Prediction for a new input x:
where k(x)n = k(x, xn). The prediction is a weighted sum of kernel evaluations with training points. No explicit feature vectors needed!
For a function k(x, x') to be a valid kernel, it must correspond to an inner product in some feature space. The key theorem:
A concrete example: the polynomial kernel. Consider k(x, x') = (x x' + 1)2 in 1D. Expanding: k(x, x') = 1 + 2xx' + x2x'2 = φ(x)Tφ(x') where φ(x) = (1, √2 x, x2). A degree-2 polynomial kernel in D dimensions implicitly works in a feature space of dimension O(D2) — but the kernel evaluation costs just O(D).
For a general degree-M polynomial in D dimensions, the feature space has dimension O(DM), but the kernel k(x, x') = (xTx' + c)M still costs just O(D). The savings become astronomical for large M or D.
| Kernel | Formula | Feature space |
|---|---|---|
| Linear | k(x,x') = xTx' | D-dimensional (no transformation) |
| Polynomial | k(x,x') = (xTx'+c)M | O(DM)-dimensional |
| Gaussian (RBF) | k(x,x') = exp(−||x−x'||2/2ℓ2) | Infinite-dimensional |
| Sigmoid | k(x,x') = tanh(κxTx'−δ) | Not always valid (not PSD for all κ,δ) |
The length scale ℓ in the Gaussian kernel controls the smoothness. Small ℓ → local influence (only nearby points matter) → wiggly function. Large ℓ → global influence → smooth function. This is the kernel analog of the bias-variance tradeoff.
We can build complex kernels from simpler ones. If k1 and k2 are valid kernels, then so are:
| Construction | Result |
|---|---|
| c k1(x,x') | Scaled kernel (c > 0) |
| k1 + k2 | Sum kernel |
| k1 · k2 | Product kernel |
| f(x) k1(x,x') f(x') | Scaled kernel (any function f) |
| exp(k1(x,x')) | Exponentiated kernel |
These rules let us design kernels for structured data — strings, graphs, sets — where traditional feature vectors don't exist. For example, string kernels count shared substrings between two text documents, and graph kernels measure structural similarity between molecules.
Radial basis function (RBF) networks use Gaussian basis functions centered on a subset of the training data (or on learned centers):
This is a linear model in the feature space defined by Gaussian basis functions. The parameters wj can be found by solving a linear system (unlike neural networks, which require iterative optimization).
Nadaraya-Watson model: A special case of kernel regression where the weights are determined by the kernel itself: y(x) = ∑n k(x,xn) tn / ∑n k(x,xn). This is a weighted average of training targets, with weights proportional to kernel similarity. No training required — just look up the nearest neighbors and average.
A Gaussian process (GP) is a distribution over functions. Instead of specifying a parametric model and putting a prior on parameters, we put a prior directly on the space of functions.
where m(x) is the mean function (often zero) and k(x, x') is the covariance function (the kernel). This means: for any finite set of points, the function values are jointly Gaussian.
The kernel encodes our prior beliefs about the function: how smooth it is, how quickly it varies, whether it has periodic structure. Choosing a kernel is the GP analog of choosing a model architecture.
Random functions drawn from a GP prior with RBF kernel. Adjust the length scale to see how it controls smoothness.
Given N training points, the GP posterior is also a GP (Gaussians are closed under conditioning):
where K is the N × N kernel matrix and k(x) is the vector of kernel evaluations with training points.
The kernel hyperparameters (length scale, signal variance, noise variance) can be learned by maximizing the log marginal likelihood:
This automatically balances data fit (first term) with model complexity (second term) — the evidence framework from Ch 3, now in function space.
For classification, the GP prior is placed on a latent function f(x), then passed through a sigmoid to get class probabilities:
where f ~ GP(0, k). The problem: the sigmoid likelihood is not conjugate to the Gaussian prior, so the posterior over f is non-Gaussian.
Two approximation strategies:
Laplace approximation: Find the mode of p(f|t) (the MAP latent function values), then fit a Gaussian. Conceptually identical to Laplace for logistic regression (Ch 4) but now in function space.
Expectation propagation (EP): A more accurate approximation that matches the moments of the true posterior at each data point. Generally better calibrated than Laplace.
Chapter 6 introduced a fundamentally different perspective on machine learning:
| Parametric (Ch 3-5) | Kernel / GP (Ch 6) |
|---|---|
| Prior on parameters w | Prior on functions f |
| Finite-dimensional | Potentially infinite-dimensional |
| Prediction: O(M) per point | Prediction: O(N) per point |
| Training: O(NM2) | Training: O(N3) |
| Fixed basis functions or learned (NN) | Kernel encodes prior beliefs |
| Scales to large N | Expensive for large N (cubic cost) |
What comes next: Chapter 7 covers sparse kernel machines — SVMs and relevance vector machines — which address the O(N) prediction cost of kernel methods by finding sparse representations.