No fixed parameter count. The model grows with the data. From nearest neighbors to Gaussian processes, these methods let the data speak with minimal assumptions.
Every model we have seen so far — logistic regression, linear regression, neural networks — has a fixed number of parameters. You choose an architecture, train it, and the parameter count is determined before seeing any data. These are parametric models.
But what if you don't know how complex the true function is? What if a linear model is too simple and a neural network is overkill? Nonparametric models let the data determine the complexity. The number of effective parameters grows with the dataset size.
The simplest nonparametric model: store all training data. To classify a new point x, find the K closest training points and take a majority vote.
where NK(x) is the set of K nearest neighbors of x in the training set D, and I(·) is the indicator function. For regression, we replace the vote with an average: f(x) = (1/K) ∑ yn.
The choice of distance metric matters enormously. Euclidean distance treats all features equally, which is problematic when features have different scales. Options include:
| Distance | Formula | When to use |
|---|---|---|
| Euclidean | √(∑(xd − x'd)²) | Continuous features, same scale |
| Manhattan | ∑|xd − x'd| | High-dimensional or sparse data |
| Mahalanobis | √((x−x')TΣ−1(x−x')) | Correlated features |
| Cosine | 1 − x·x' / (||x|| ||x'||) | Text, high-dimensional embeddings |
KNN works beautifully in low dimensions. In high dimensions, it breaks. This is the curse of dimensionality (Murphy 16.1.2).
Consider a unit hypercube [0, 1]D. To capture a fraction f of the volume, you need a sub-cube with side length f1/D. For D = 10 and f = 0.01 (1% of data), the side length is 0.010.1 ≈ 0.63. Your "local" neighborhood spans 63% of each axis — not local at all!
Consequences for KNN:
| Dimension | Effect |
|---|---|
| Low (D ≤ 10) | KNN works well, neighbors are truly "near" |
| Medium (10–100) | Need exponentially more data; metric learning helps |
| High (D > 100) | Raw KNN fails; need dimensionality reduction or learned embeddings first |
KNN classifies but doesn't estimate the underlying density p(x). Kernel density estimation (KDE) does. Place a small "bump" (kernel) at each data point and sum them up (Murphy 16.3).
where K is a density kernel (a symmetric, non-negative function that integrates to 1) and h is the bandwidth (how wide each bump is). The most common choice is the Gaussian kernel:
Kernel regression (Nadaraya-Watson estimator) combines KDE with regression: weight each training point's y-value by how close its x-value is to the query:
A Mercer kernel (or positive definite kernel) is a function k(x, x') that measures similarity between two inputs. Murphy (17.1) explains why these are so powerful.
The key insight: every Mercer kernel corresponds to a dot product in some (possibly infinite-dimensional) feature space. The kernel computes this dot product without ever constructing the feature vector. This is the foundation of the kernel trick.
| Kernel | k(x, x') | Feature Space |
|---|---|---|
| Linear | xTx' | D-dimensional (original space) |
| Polynomial | (xTx' + c)d | O(Dd)-dimensional |
| RBF / Gaussian | exp(−||x − x'||² / 2σ²) | Infinite-dimensional |
| Laplacian | exp(−||x − x'||1 / σ) | Infinite-dimensional |
Kernels can be combined: if k1 and k2 are valid kernels, then so are k1 + k2, c · k1, k1 · k2, and many other combinations. This lets you encode prior knowledge about the function's structure.
A Gaussian process (GP) is a distribution over functions (Murphy 17.2). Instead of learning a fixed set of weights, we place a prior directly over the function f(x) and update it with data.
This means: for any finite set of inputs {x1, …, xN}, the function values [f(x1), …, f(xN)] follow a multivariate Gaussian with mean [m(x1), …, m(xN)] and covariance matrix Kij = k(xi, xj).
Think of it this way: sample a random function from the GP prior. It is a smooth, wiggly curve (if using an RBF kernel). Now condition on some observed data points. The posterior is also a GP, but now constrained to pass through (or near) the data. Between data points, the function is uncertain — the posterior variance is large. Near data points, it is confident.
Given N noisy observations D = {(xn, yn)}, where yn = f(xn) + ε, ε ∼ N(0, σy²), the GP posterior predictive at a new input x* is (Murphy 17.2.2):
where k* = [k(x*, x1), …, k(x*, xN)]T is the kernel vector between the test point and all training points, and K is the N×N kernel matrix of training points.
Estimating the kernel (17.2.6): The kernel has hyperparameters (e.g., lengthscale ℓ and signal variance σf² for the RBF kernel). We optimize them by maximizing the log marginal likelihood:
This automatically balances data fit (first term) and model complexity (second term) — Bayesian Occam's razor.
SVMs (Murphy 17.3) approach classification differently from logistic regression. Instead of maximizing likelihood, they find the hyperplane that maximizes the margin — the distance to the nearest training points.
The constraint says every point must be on the correct side with at least margin 1/||w||. The objective maximizes this margin. Only the training points on the margin boundary matter — these are the support vectors.
Soft margin (17.3.3): Real data is rarely perfectly separable. We allow some points to violate the margin by introducing slack variables ξn ≥ 0:
The parameter C controls the tradeoff: large C penalizes violations heavily (hard margin); small C allows more violations (softer boundary). C is analogous to 1/λ in ridge regression.
The SVM optimization can be written in dual form (Murphy 17.3.2), where the solution depends on the data only through dot products xiTxj:
Now the kernel trick: replace every dot product xiTxj with a kernel k(xi, xj). This implicitly maps the data to a high-dimensional feature space where it may be linearly separable, without ever computing the feature vectors.
The prediction for a new input x is:
The sum is only over support vectors (points with αn > 0), making prediction sparse and efficient.
| Method | Loss | Sparsity | Probabilistic? |
|---|---|---|---|
| SVM | Hinge loss | Yes (support vectors only) | No (margins, not probabilities) |
| Kernel ridge regression | Squared loss | No (all points contribute) | Yes (equivalent to GP mean) |
| Gaussian process | Log marginal likelihood | No | Yes (full posterior) |
| Relevance vector machine | Type-II ML | Yes | Yes |
See how K-nearest neighbors classifies 2D data. Place points, choose K, and watch the decision regions change.
Click to place points (toggle class below). The background shows how KNN would classify each location. Increase K for smoother boundaries.
Click to add training points and watch the Gaussian process posterior update in real time. The mean prediction is a smooth curve that passes through (or near) the data, with uncertainty bands that grow between observations.
Click to add data points. The teal line is the posterior mean. The shaded band is ±2 standard deviations. The dashed lines are prior samples.
Nonparametric methods are the bridge between simple linear models and the full Bayesian toolkit.
| Concept from this chapter | Where it leads |
|---|---|
| KNN | Baseline for any classification problem; retrieval-augmented generation (RAG) |
| Kernel density estimation | Density estimation in GMMs (Ch 21), normalizing flows |
| Mercer kernels | Kernel PCA (Ch 20), spectral clustering (Ch 21), GP kernels |
| Gaussian processes | Bayesian optimization, neural tangent kernels, infinite-width DNNs |
| SVMs | Maximum margin principle, hinge loss in structured prediction |
| Kernel trick | Kernel PCA, kernel k-means, any algorithm that uses only dot products |
| Deep metric learning | Face verification, image retrieval, few-shot learning |
"The purpose of computing is insight, not numbers." — Richard Hamming