Murphy, Chapters 16–17

Nonparametric Methods: KNN, Kernels & GPs

No fixed parameter count. The model grows with the data. From nearest neighbors to Gaussian processes, these methods let the data speak with minimal assumptions.

Prerequisites: Probability (Ch 2–3) + Linear Models (Ch 10–11). Bayesian linear regression helps for GPs.
12
Chapters
5
Simulations
12
Quizzes

Chapter 0: Why Nonparametric?

Every model we have seen so far — logistic regression, linear regression, neural networks — has a fixed number of parameters. You choose an architecture, train it, and the parameter count is determined before seeing any data. These are parametric models.

But what if you don't know how complex the true function is? What if a linear model is too simple and a neural network is overkill? Nonparametric models let the data determine the complexity. The number of effective parameters grows with the dataset size.

The big picture: "Nonparametric" does not mean "no parameters." It means "not a fixed number of parameters." KNN stores all training data and classifies by majority vote of the K nearest neighbors. Gaussian processes define a prior over functions and condition on observed data. The model complexity adapts automatically.
Exemplar Methods (Ch 16)
KNN, kernel density estimation, metric learning
Mercer Kernels (Ch 17)
Similarity functions, the kernel trick, kernel ridge regression
Gaussian Processes (Ch 17)
Prior over functions, posterior predictive with uncertainty
SVMs (Ch 17)
Maximum margin classifiers, the dual problem, kernel trick
What does "nonparametric" mean in the context of machine learning models?

Chapter 1: K-Nearest Neighbors

The simplest nonparametric model: store all training data. To classify a new point x, find the K closest training points and take a majority vote.

p(y = c | x, D, K) = (1/K) ∑n ∈ NK(x) I(yn = c)

where NK(x) is the set of K nearest neighbors of x in the training set D, and I(·) is the indicator function. For regression, we replace the vote with an average: f(x) = (1/K) ∑ yn.

K controls the bias-variance tradeoff: K=1 gives a complex, jagged decision boundary that exactly classifies every training point (zero training error, high variance). Large K gives smoother boundaries (higher bias, lower variance). The optimal K is found by cross-validation. As N → ∞, KNN with K → ∞ and K/N → 0 converges to the Bayes-optimal classifier.

The choice of distance metric matters enormously. Euclidean distance treats all features equally, which is problematic when features have different scales. Options include:

DistanceFormulaWhen to use
Euclidean√(∑(xd − x'd)²)Continuous features, same scale
Manhattan∑|xd − x'd|High-dimensional or sparse data
Mahalanobis√((x−x')TΣ−1(x−x'))Correlated features
Cosine1 − x·x' / (||x|| ||x'||)Text, high-dimensional embeddings
Deep metric learning (16.2): Instead of using hand-crafted distances, learn a neural network that maps inputs to an embedding space where Euclidean distance is meaningful. Train with contrastive or triplet loss: similar items should be close, dissimilar items far apart. This combines the power of DNNs with the simplicity of KNN.
What happens to the KNN decision boundary as K increases from 1 to N (the total number of training points)?

Chapter 2: The Curse of Dimensionality

KNN works beautifully in low dimensions. In high dimensions, it breaks. This is the curse of dimensionality (Murphy 16.1.2).

Consider a unit hypercube [0, 1]D. To capture a fraction f of the volume, you need a sub-cube with side length f1/D. For D = 10 and f = 0.01 (1% of data), the side length is 0.010.1 ≈ 0.63. Your "local" neighborhood spans 63% of each axis — not local at all!

The geometric argument: In high dimensions, almost all the volume of a sphere concentrates in a thin shell near the surface. Most of the hypercube's volume is in the corners. Distances between random points concentrate around a single value: all points are roughly equidistant. The concept of "nearest" becomes meaningless.

Consequences for KNN:

DimensionEffect
Low (D ≤ 10)KNN works well, neighbors are truly "near"
Medium (10–100)Need exponentially more data; metric learning helps
High (D > 100)Raw KNN fails; need dimensionality reduction or learned embeddings first
Practical solutions: (1) Reduce dimensionality with PCA or autoencoders (Ch 20) before applying KNN. (2) Learn a distance metric via deep metric learning. (3) Use approximate nearest neighbor search (locality-sensitive hashing, k-d trees) for efficiency. (4) Accept that parametric models with strong inductive biases (like CNNs) often outperform KNN in high dimensions.
Why does the curse of dimensionality make KNN ineffective in high-dimensional spaces?

Chapter 3: Kernel Density Estimation

KNN classifies but doesn't estimate the underlying density p(x). Kernel density estimation (KDE) does. Place a small "bump" (kernel) at each data point and sum them up (Murphy 16.3).

p̂(x) = (1/NhD) ∑n=1N K((x − xn) / h)

where K is a density kernel (a symmetric, non-negative function that integrates to 1) and h is the bandwidth (how wide each bump is). The most common choice is the Gaussian kernel:

K(u) = (2π)−D/2 exp(−||u||² / 2)
Bandwidth is the key hyperparameter: Small h gives a spiky estimate (each data point is a sharp peak). Large h gives a smooth, spread-out estimate. Too small = overfitting (the density looks like a sum of delta functions). Too large = underfitting (all structure is smoothed away). This is the bias-variance tradeoff again.
From KDE to KNN (16.3.4): KDE fixes the bandwidth h and counts how many points fall within it. KNN fixes the count K and adapts the "bandwidth" (the distance to the K-th neighbor). Both estimate density; they are duals of each other. In regions with dense data, KNN uses a small neighborhood. In sparse regions, it reaches farther. This makes KNN adaptive.

Kernel regression (Nadaraya-Watson estimator) combines KDE with regression: weight each training point's y-value by how close its x-value is to the query:

f̂(x) = ∑n K((x − xn)/h) yn / ∑n K((x − xn)/h)
What role does the bandwidth h play in kernel density estimation?

Chapter 4: Mercer Kernels

A Mercer kernel (or positive definite kernel) is a function k(x, x') that measures similarity between two inputs. Murphy (17.1) explains why these are so powerful.

k(x, x') = φ(x)Tφ(x')

The key insight: every Mercer kernel corresponds to a dot product in some (possibly infinite-dimensional) feature space. The kernel computes this dot product without ever constructing the feature vector. This is the foundation of the kernel trick.

Mercer's theorem: A symmetric function k(x, x') is a valid (Mercer) kernel if and only if the kernel matrix Kij = k(xi, xj) is positive semi-definite for any set of inputs. Equivalently, k can be written as φ(x)Tφ(x') for some feature map φ.
Kernelk(x, x')Feature Space
LinearxTx'D-dimensional (original space)
Polynomial(xTx' + c)dO(Dd)-dimensional
RBF / Gaussianexp(−||x − x'||² / 2σ²)Infinite-dimensional
Laplacianexp(−||x − x'||1 / σ)Infinite-dimensional
The RBF kernel is special: Its feature space is infinite-dimensional. It can represent any continuous function. The bandwidth σ controls smoothness: small σ gives a wiggly function (each training point is its own sharp peak); large σ gives a smooth, slowly varying function. This is the analog of the KDE bandwidth, but now applied to regression and classification.

Kernels can be combined: if k1 and k2 are valid kernels, then so are k1 + k2, c · k1, k1 · k2, and many other combinations. This lets you encode prior knowledge about the function's structure.

What does the kernel trick allow you to do?

Chapter 5: Gaussian Processes

A Gaussian process (GP) is a distribution over functions (Murphy 17.2). Instead of learning a fixed set of weights, we place a prior directly over the function f(x) and update it with data.

f(x) ∼ GP(m(x), k(x, x'))

This means: for any finite set of inputs {x1, …, xN}, the function values [f(x1), …, f(xN)] follow a multivariate Gaussian with mean [m(x1), …, m(xN)] and covariance matrix Kij = k(xi, xj).

The prior encodes smoothness: The kernel k determines what kinds of functions the GP considers plausible before seeing data. An RBF kernel says "nearby inputs should have similar outputs." A periodic kernel says "the function repeats." A Matérn kernel controls the exact degree of smoothness. The mean function m(x) is typically set to zero.

Think of it this way: sample a random function from the GP prior. It is a smooth, wiggly curve (if using an RBF kernel). Now condition on some observed data points. The posterior is also a GP, but now constrained to pass through (or near) the data. Between data points, the function is uncertain — the posterior variance is large. Near data points, it is confident.

Connection to Bayesian linear regression: A GP with a linear kernel k(x, x') = xTx' is exactly equivalent to Bayesian linear regression with a Gaussian prior on weights. GPs generalize this to arbitrary kernels, giving infinite-dimensional Bayesian regression.
What does a Gaussian process place a distribution over?

Chapter 6: GP Prediction

Given N noisy observations D = {(xn, yn)}, where yn = f(xn) + ε, ε ∼ N(0, σy²), the GP posterior predictive at a new input x* is (Murphy 17.2.2):

p(f* | x*, D) = N(f* | μ*, σ*²)
μ* = k*T(K + σy² I)−1 y
σ*² = k(x*, x*) − k*T(K + σy² I)−1 k*

where k* = [k(x*, x1), …, k(x*, xN)]T is the kernel vector between the test point and all training points, and K is the N×N kernel matrix of training points.

Interpreting the formulas: The mean μ* is a weighted sum of training outputs y, where the weights come from (K + σy²I)−1k*. Nearby training points (high k values) contribute more. The variance σ*² starts at the prior variance k(x*, x*) and decreases by the amount of information the training data provides. Far from data, variance stays high. Near data, it shrinks.
Computational cost: The main bottleneck is inverting the N×N matrix (K + σy²I), which costs O(N³). For N > 10,000, this is impractical. Solutions include sparse GPs (inducing points), random Fourier features, and scalable variational methods (Murphy 17.2.9).

Estimating the kernel (17.2.6): The kernel has hyperparameters (e.g., lengthscale ℓ and signal variance σf² for the RBF kernel). We optimize them by maximizing the log marginal likelihood:

log p(y | X) = −½ yT(K + σy²I)−1y − ½ log |K + σy²I| − N/2 log(2π)

This automatically balances data fit (first term) and model complexity (second term) — Bayesian Occam's razor.

What happens to the GP posterior variance at a test point that is far from all training data?

Chapter 7: Support Vector Machines

SVMs (Murphy 17.3) approach classification differently from logistic regression. Instead of maximizing likelihood, they find the hyperplane that maximizes the margin — the distance to the nearest training points.

maxw, b   2 / ||w||   s.t.   yn(wTxn + b) ≥ 1   ∀ n

The constraint says every point must be on the correct side with at least margin 1/||w||. The objective maximizes this margin. Only the training points on the margin boundary matter — these are the support vectors.

Why maximum margin? A wider margin means the classifier is more robust to small perturbations. Among all hyperplanes that separate the data, the maximum-margin one generalizes best. The VC dimension of the classifier depends on the margin, not just the number of features.

Soft margin (17.3.3): Real data is rarely perfectly separable. We allow some points to violate the margin by introducing slack variables ξn ≥ 0:

minw ½||w||² + C ∑n ξn   s.t.   yn(wTxn + b) ≥ 1 − ξn

The parameter C controls the tradeoff: large C penalizes violations heavily (hard margin); small C allows more violations (softer boundary). C is analogous to 1/λ in ridge regression.

SVM vs logistic regression (17.3.6): The SVM loss (hinge loss) is max(0, 1 − y · f(x)). Logistic regression uses log(1 + exp(−y · f(x))). Both are convex upper bounds on the 0-1 loss. Hinge loss is exactly zero for correctly classified points with margin ≥ 1, making SVMs sparse in dual form. Logistic regression always assigns a small gradient to every point.
What are support vectors?

Chapter 8: The Kernel Trick

The SVM optimization can be written in dual form (Murphy 17.3.2), where the solution depends on the data only through dot products xiTxj:

maxαn αn − ½ ∑i,j αi αj yi yj xiTxj   s.t.   0 ≤ αn ≤ C

Now the kernel trick: replace every dot product xiTxj with a kernel k(xi, xj). This implicitly maps the data to a high-dimensional feature space where it may be linearly separable, without ever computing the feature vectors.

A nonlinear SVM in the original space is a linear SVM in the kernel feature space. Using an RBF kernel, the SVM can learn arbitrarily complex decision boundaries. The complexity is controlled by C (regularization) and σ (kernel bandwidth), not by choosing a feature map by hand.

The prediction for a new input x is:

f(x) = ∑n ∈ SV αn yn k(xn, x) + b

The sum is only over support vectors (points with αn > 0), making prediction sparse and efficient.

Kernel ridge regression (17.3.9): The kernel trick applies beyond SVMs. Kernel ridge regression replaces wTx with kernel evaluations: f(x) = k*T(K + λI)−1y. This is identical to the GP posterior mean! The kernel trick provides a unified framework for nonlinear regression and classification.
MethodLossSparsityProbabilistic?
SVMHinge lossYes (support vectors only)No (margins, not probabilities)
Kernel ridge regressionSquared lossNo (all points contribute)Yes (equivalent to GP mean)
Gaussian processLog marginal likelihoodNoYes (full posterior)
Relevance vector machineType-II MLYesYes
Why is the kernel SVM prediction efficient despite operating in a potentially infinite-dimensional feature space?

Chapter 9: KNN Explorer

See how K-nearest neighbors classifies 2D data. Place points, choose K, and watch the decision regions change.

K-Nearest Neighbors: Decision Regions

Click to place points (toggle class below). The background shows how KNN would classify each location. Increase K for smoother boundaries.

0 points
K3
What to observe: With K=1, every training point creates its own region (Voronoi diagram). Increase K and the boundaries smooth out. Try overlapping the classes to see how K affects the noise tolerance.
Why does K=1 in KNN give a Voronoi tessellation as the decision boundary?

Chapter 10: GP Regression

Click to add training points and watch the Gaussian process posterior update in real time. The mean prediction is a smooth curve that passes through (or near) the data, with uncertainty bands that grow between observations.

Gaussian Process Regression

Click to add data points. The teal line is the posterior mean. The shaded band is ±2 standard deviations. The dashed lines are prior samples.

Lengthscale ℓ0.30
Noise σy0.10
Try this: Place a few points and observe how the uncertainty band narrows near data and widens between observations. Then change the lengthscale: small ℓ gives wiggly fits, large ℓ gives smooth fits. Increase noise σy to see the mean curve "relax" away from the data points.
What happens to the GP uncertainty band as you add more training points?

Chapter 11: Connections

Nonparametric methods are the bridge between simple linear models and the full Bayesian toolkit.

Concept from this chapterWhere it leads
KNNBaseline for any classification problem; retrieval-augmented generation (RAG)
Kernel density estimationDensity estimation in GMMs (Ch 21), normalizing flows
Mercer kernelsKernel PCA (Ch 20), spectral clustering (Ch 21), GP kernels
Gaussian processesBayesian optimization, neural tangent kernels, infinite-width DNNs
SVMsMaximum margin principle, hinge loss in structured prediction
Kernel trickKernel PCA, kernel k-means, any algorithm that uses only dot products
Deep metric learningFace verification, image retrieval, few-shot learning
What we covered: KNN classification and regression, the curse of dimensionality, kernel density estimation, Mercer kernels and the kernel trick, Gaussian processes with posterior predictive, support vector machines with maximum margin and soft margin, and kernel ridge regression.
What comes next: Chapters 20–21 go beyond supervised learning. PCA and autoencoders learn low-dimensional representations. VAEs combine neural networks with probabilistic generative models. K-means and GMMs cluster data without labels. The EM algorithm ties it all together.

"The purpose of computing is insight, not numbers." — Richard Hamming

How is kernel ridge regression related to Gaussian process prediction?