Bishop PRML, Chapter 6

Kernel Methods

The kernel trick, Gaussian processes, and the art of working in infinite-dimensional feature spaces without ever going there.

Prerequisites: Chapters 3–4 (linear regression, classification, Bayesian inference).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Kernels?

Many algorithms (linear regression, PCA, nearest neighbors) depend on the data only through inner products between data points: xnTxm. What if we could replace these inner products with a more general similarity measure?

The kernel function k(x, x') measures similarity between two inputs. The remarkable discovery: for many useful kernels, there exists a (possibly infinite-dimensional) feature space where the kernel computes the inner product: k(x, x') = φ(x)Tφ(x').

The kernel trick: We can work in a high-dimensional (even infinite-dimensional) feature space without ever computing the feature vectors φ(x). We only need the kernel function k(x, x'), which is cheap to compute. This gives us the power of nonlinear feature expansions with the computational cost of working in the original input space.

The kernel trick transforms the question from "what basis functions should I use?" to "what similarity measure is appropriate for my data?" This is often a more natural question.

Check: What is the kernel trick?

Chapter 1: The Dual Representation

Consider regularized least squares from Ch 3. The solution w = (ΦTΦ + λI)−1ΦTt can be rewritten as:

w = ΦTa

where a = (1/λ)(tΦw) is a vector of N coefficients (one per training point). Substituting back, we get the dual form:

a = (K + λI)−1 t

where K is the N × N Gram matrix with Knm = k(xn, xm) = φ(xn)Tφ(xm).

Primal vs. dual: The primal form inverts an M × M matrix (M = number of basis functions). The dual inverts an N × N matrix (N = number of data points). If we use the kernel trick, M could be infinite but N is always finite. The dual lets us work in infinite feature spaces at cost O(N3).

Prediction for a new input x:

y(x) = k(x)T (K + λI)−1 t

where k(x)n = k(x, xn). The prediction is a weighted sum of kernel evaluations with training points. No explicit feature vectors needed!

Check: The dual representation expresses the solution as:

Chapter 2: The Kernel Trick in Detail

For a function k(x, x') to be a valid kernel, it must correspond to an inner product in some feature space. The key theorem:

Mercer's theorem: A function k(x, x') is a valid kernel if and only if the Gram matrix Knm = k(xn, xm) is positive semi-definite for any set of data points. Equivalently, the kernel has an expansion k(x, x') = ∑i λi ψi(x) ψi(x') with λi ≥ 0.

A concrete example: the polynomial kernel. Consider k(x, x') = (x x' + 1)2 in 1D. Expanding: k(x, x') = 1 + 2xx' + x2x'2 = φ(x)Tφ(x') where φ(x) = (1, √2 x, x2). A degree-2 polynomial kernel in D dimensions implicitly works in a feature space of dimension O(D2) — but the kernel evaluation costs just O(D).

For a general degree-M polynomial in D dimensions, the feature space has dimension O(DM), but the kernel k(x, x') = (xTx' + c)M still costs just O(D). The savings become astronomical for large M or D.

Check: What condition must a valid kernel function satisfy?

Chapter 3: Common Kernels

KernelFormulaFeature space
Lineark(x,x') = xTx'D-dimensional (no transformation)
Polynomialk(x,x') = (xTx'+c)MO(DM)-dimensional
Gaussian (RBF)k(x,x') = exp(−||xx'||2/2ℓ2)Infinite-dimensional
Sigmoidk(x,x') = tanh(κxTx'−δ)Not always valid (not PSD for all κ,δ)
The Gaussian kernel is infinite-dimensional. The Taylor expansion of exp(−||xx'||2/2ℓ2) contains terms of every polynomial degree. The corresponding feature space is infinite-dimensional. Yet the kernel evaluation is O(D) — a single exponential of the squared distance. This is the magic of kernels: infinite-dimensional feature spaces at finite computational cost.

The length scale ℓ in the Gaussian kernel controls the smoothness. Small ℓ → local influence (only nearby points matter) → wiggly function. Large ℓ → global influence → smooth function. This is the kernel analog of the bias-variance tradeoff.

Check: The Gaussian (RBF) kernel implicitly maps to what kind of feature space?

Chapter 4: Constructing Kernels

We can build complex kernels from simpler ones. If k1 and k2 are valid kernels, then so are:

ConstructionResult
c k1(x,x')Scaled kernel (c > 0)
k1 + k2Sum kernel
k1 · k2Product kernel
f(x) k1(x,x') f(x')Scaled kernel (any function f)
exp(k1(x,x'))Exponentiated kernel

These rules let us design kernels for structured data — strings, graphs, sets — where traditional feature vectors don't exist. For example, string kernels count shared substrings between two text documents, and graph kernels measure structural similarity between molecules.

Kernels beyond vectors: One of the most powerful aspects of kernel methods is that inputs don't have to be vectors. As long as you can define a valid kernel between any two inputs (sequences, trees, graphs, images), you can apply kernel methods. This makes kernel methods applicable to almost any data type.
Check: Can kernel methods be applied to non-vector data like strings or graphs?

Chapter 5: RBF Networks

Radial basis function (RBF) networks use Gaussian basis functions centered on a subset of the training data (or on learned centers):

y(x) = ∑j=1M wj exp(−||xμj||2 / 2s2) + w0

This is a linear model in the feature space defined by Gaussian basis functions. The parameters wj can be found by solving a linear system (unlike neural networks, which require iterative optimization).

RBF vs neural network: Both use nonlinear basis functions. But in an RBF network, each basis function responds to a local region of input space (near its center). In a neural network, each hidden unit divides the input space with a hyperplane. RBF networks are local; neural networks are global. This makes RBFs better for interpolation but potentially requiring more units for complex functions.

Nadaraya-Watson model: A special case of kernel regression where the weights are determined by the kernel itself: y(x) = ∑n k(x,xn) tn / ∑n k(x,xn). This is a weighted average of training targets, with weights proportional to kernel similarity. No training required — just look up the nearest neighbors and average.

Check: How do RBF basis functions differ from neural network hidden units?

Chapter 6: Gaussian Processes — The Idea

A Gaussian process (GP) is a distribution over functions. Instead of specifying a parametric model and putting a prior on parameters, we put a prior directly on the space of functions.

f(x) ~ GP(m(x), k(x, x'))

where m(x) is the mean function (often zero) and k(x, x') is the covariance function (the kernel). This means: for any finite set of points, the function values are jointly Gaussian.

From weights to functions: Remember Bayesian linear regression (Ch 3)? We put a Gaussian prior on weights w, which induced a Gaussian distribution over functions y(x) = wTφ(x). As M → ∞, this becomes a Gaussian process with kernel k(x,x') = ∑j φj(xj(x')/α. The GP is the nonparametric limit of Bayesian linear regression — infinitely many basis functions.

The kernel encodes our prior beliefs about the function: how smooth it is, how quickly it varies, whether it has periodic structure. Choosing a kernel is the GP analog of choosing a model architecture.

Gaussian Process Prior Samples

Random functions drawn from a GP prior with RBF kernel. Adjust the length scale to see how it controls smoothness.

0.30
Check: What is a Gaussian process?

Chapter 7: GP Regression

Given N training points, the GP posterior is also a GP (Gaussians are closed under conditioning):

m(x|D) = k(x)T (K + σ2I)−1 t
σ2(x|D) = k(x,x) − k(x)T (K + σ2I)−1 k(x)

where K is the N × N kernel matrix and k(x) is the vector of kernel evaluations with training points.

Automatic uncertainty: The predictive variance is small near training points (where we have data) and large far from them (where we're uncertain). This is the ideal behavior: the model "knows what it doesn't know." Unlike neural networks where uncertainty estimation requires approximations, GPs provide exact uncertainty for free.

The kernel hyperparameters (length scale, signal variance, noise variance) can be learned by maximizing the log marginal likelihood:

ln p(t|θ) = −½ tT(K + σ2I)−1t − ½ ln|K + σ2I| − N/2 ln(2π)

This automatically balances data fit (first term) with model complexity (second term) — the evidence framework from Ch 3, now in function space.

Check: How does GP predictive variance behave?

Chapter 8: GP Classification

For classification, the GP prior is placed on a latent function f(x), then passed through a sigmoid to get class probabilities:

p(C1|x) = σ(f(x))

where f ~ GP(0, k). The problem: the sigmoid likelihood is not conjugate to the Gaussian prior, so the posterior over f is non-Gaussian.

Two approximation strategies:

Laplace approximation: Find the mode of p(f|t) (the MAP latent function values), then fit a Gaussian. Conceptually identical to Laplace for logistic regression (Ch 4) but now in function space.

Expectation propagation (EP): A more accurate approximation that matches the moments of the true posterior at each data point. Generally better calibrated than Laplace.

The recurring theme: Bayesian inference is easy when everything is Gaussian (regression). The moment we introduce non-Gaussian likelihoods (classification), we need approximations. This theme continues throughout the book: Ch 10 (variational inference) and Ch 11 (sampling) develop increasingly powerful tools for non-conjugate inference.
Check: Why is GP classification harder than GP regression?

Chapter 9: Summary

Chapter 6 introduced a fundamentally different perspective on machine learning:

Parametric (Ch 3-5)Kernel / GP (Ch 6)
Prior on parameters wPrior on functions f
Finite-dimensionalPotentially infinite-dimensional
Prediction: O(M) per pointPrediction: O(N) per point
Training: O(NM2)Training: O(N3)
Fixed basis functions or learned (NN)Kernel encodes prior beliefs
Scales to large NExpensive for large N (cubic cost)
The GP perspective: A GP is the Bayesian nonparametric ideal: it uses the data to shape the posterior over functions without committing to a finite parametric form. The kernel replaces architecture design. The marginal likelihood replaces cross-validation. Uncertainty is exact, not approximate. The catch is the O(N3) cost, which limits GPs to moderate-sized datasets unless approximations (inducing points, sparse GPs) are used.

What comes next: Chapter 7 covers sparse kernel machines — SVMs and relevance vector machines — which address the O(N) prediction cost of kernel methods by finding sparse representations.

"The Gaussian process viewpoint provides a unifying
framework for many regression and classification methods."
— Christopher Bishop, PRML §6.4
Check: What is the main computational limitation of Gaussian processes?