Bishop PRML, Chapter 7

Sparse Kernel Machines

Support vector machines, maximum margin classifiers, and relevance vector machines — kernel methods that use only a sparse subset of training points.

Prerequisites: Chapter 6 (kernel methods, Gaussian processes).
11
Chapters
2
Simulations
11
Quizzes

Chapter 0: Why Sparsity?

In Chapter 6, we saw that kernel methods like Gaussian processes make predictions using all N training points. The predictive function is a weighted sum over every data point:

y(x) = ∑n=1N an k(x, xn)

This means prediction costs O(N) per test point — expensive when N is large. Can we find solutions where most of the coefficients an are exactly zero?

The sparsity goal: A sparse kernel machine uses only a small subset of the training data to make predictions. If only S out of N points have nonzero coefficients (S « N), prediction costs O(S) instead of O(N). The data points with nonzero coefficients are called support vectors (in SVMs) or relevance vectors (in RVMs).

This chapter covers two approaches to achieving sparsity:

Support Vector Machines (SVMs): Find the maximum margin decision boundary. Sparsity emerges from the geometry — only points on the margin boundary matter.

Relevance Vector Machines (RVMs): Use a Bayesian prior with individual hyperparameters per weight. Sparsity emerges from evidence maximization — most hyperparameters go to infinity, pruning their weights to zero.

Check: Why is sparsity desirable in kernel methods?

Chapter 1: Maximum Margin Classifiers

Consider a two-class problem with a linear decision boundary y(x) = wTφ(x) + b. Many hyperplanes can separate the classes perfectly. Which one should we choose?

The margin is the perpendicular distance from the decision boundary to the nearest data point. The SVM chooses the hyperplane that maximizes this margin.

Why maximum margin? Intuitively, a large margin means the classifier is confident about its decision — there's a wide "buffer zone" between classes. A small perturbation to a test point won't change its classification. Formally, computational learning theory (VC theory) shows that larger margins lead to better generalization bounds, independent of the input dimensionality.

The distance from point xn to the decision surface is tny(xn)/||w||. We want to maximize the minimum such distance:

arg maxw,b { (1/||w||) minn [tn(wTφ(xn) + b)] }

Using the canonical representation (rescaling so the closest point has tny(xn) = 1), this becomes:

arg minw,b ½||w||2    subject to    tn(wTφ(xn) + b) ≥ 1   ∀n

This is a quadratic program — minimize a quadratic objective subject to linear inequality constraints. It has a unique global solution.

Check: What does the SVM maximize?

Chapter 2: The Lagrangian Dual

To solve the constrained optimization, we introduce Lagrange multipliers an ≥ 0 for each constraint:

L(w, b, a) = ½||w||2 − ∑n=1N an{tn(wTφ(xn) + b) − 1}

Setting derivatives to zero: ∂L/∂w = 0 gives w = ∑n an tn φ(xn), and ∂L/∂b = 0 gives ∑n an tn = 0.

Substituting back, we get the dual problem:

maximize   L̃(a) = ∑n=1N an − ½ ∑n=1Nm=1N an am tn tm k(xn, xm)

subject to an ≥ 0 and ∑n an tn = 0, where k(xn, xm) = φ(xn)Tφ(xm) is the kernel function.

The kernel trick appears: The dual problem depends on the data only through kernel evaluations k(xn, xm). We never need to compute φ(x) explicitly. This means we can use infinite-dimensional feature spaces (via the Gaussian kernel, for example) at finite computational cost.

The KKT conditions (Karush-Kuhn-Tucker) require that an{tny(xn) − 1} = 0 for every n. This means either an = 0 (the point is irrelevant to the solution) or tny(xn) = 1 (the point sits exactly on the margin). Points with an > 0 are the support vectors.

Prediction for a new point:

y(x) = ∑n ∈ SV an tn k(x, xn) + b

Only support vectors contribute. The rest of the training data can be discarded.

Check: What are support vectors?

Chapter 3: The Soft Margin

Real data is rarely linearly separable, even in feature space. We need to allow some points to be misclassified. The soft margin SVM introduces slack variables ξn ≥ 0:

tn(wTφ(xn) + b) ≥ 1 − ξn

If ξn = 0, the point is correctly classified and on or beyond the margin. If 0 < ξn < 1, the point is correctly classified but inside the margin. If ξn > 1, the point is misclassified.

The objective becomes:

minw,b,ξ ½||w||2 + C ∑n=1N ξn

The parameter C controls the tradeoff: large C → fewer violations (potentially overfitting), small C → wider margin (more violations allowed).

The C parameter: C is the SVM's complexity parameter. It plays the same role as inverse regularization strength: large C is like low regularization (fit the data closely), small C is like high regularization (prefer simplicity). Unlike Bayesian methods, C must be set by cross-validation.

The dual of the soft-margin SVM is almost identical to the hard-margin case, except that now 0 ≤ an ≤ C. Points with 0 < an < C lie on the margin; points with an = C are inside the margin or misclassified. An alternative formulation using a parameter ν ∈ (0, 1] replaces C and has a nice interpretation: ν is an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors.

Check: What do slack variables allow in the soft-margin SVM?

Chapter 4: SVM vs Logistic Regression

The SVM and logistic regression are more closely related than they might appear. Both find a linear decision boundary in feature space. The difference is in the loss function.

MethodLoss functionOutputs
Logistic regressionln(1 + exp(−tnyn)) — log lossProbabilities p(C1|x)
SVM[1 − tnyn]+ — hinge lossDecisions only (sign of y)

The hinge loss [1 − ty]+ = max(0, 1 − ty) is zero whenever ty ≥ 1, i.e., when the point is correctly classified and beyond the margin. This is why SVMs produce sparse solutions: points safely classified contribute zero loss and zero gradient, so they have no influence on the solution.

Sparsity from the hinge loss: The hinge loss has a "flat region" for ty ≥ 1 where the gradient is exactly zero. This is what causes sparsity — points in this region don't affect the solution. Logistic regression's log loss is always positive and always has a nonzero gradient, so every point influences the solution. No sparsity.

To get probabilities from an SVM, one can fit a sigmoid to the SVM outputs: p(t=1|x) = σ(Ay(x) + B), where A and B are fitted on a validation set. This is known as Platt scaling.

Check: Why does the SVM produce sparse solutions but logistic regression does not?

Chapter 5: Multiclass & SVR

The SVM is inherently a two-class classifier. Extension to K classes is ad hoc:

One-vs-rest: Train K separate SVMs, each separating one class from the rest. Assign the class with the largest y(x). Problem: different SVMs have different scales, and there's ambiguous regions.

One-vs-one: Train K(K−1)/2 SVMs, one for each pair of classes. Use majority voting. Problem: quadratic number of classifiers and inconsistent decisions.

For regression, the SVM uses the ε-insensitive loss:

Eε(y − t) = max(0, |y − t| − ε)

This loss is zero when the prediction is within ε of the target. The ε-tube around the regression function plays the same role as the margin: only data points outside the tube (with |yn − tn| > ε) contribute to the solution.

Support vector regression (SVR): The ε-insensitive loss creates sparsity in regression: points inside the ε-tube have zero loss and don't affect the solution. The regression function is expressed as a weighted sum over support vectors only. The parameter ε controls the width of the tube and hence the sparsity: larger ε gives fewer support vectors but a coarser fit.
Check: How does the ε-insensitive loss achieve sparsity in SVR?

Chapter 6: SVM Decision Boundary

Drag the data points to see how the SVM margin and support vectors change. Notice how only points on or inside the margin affect the decision boundary.

SVM Maximum Margin Classifier

Blue circles = class +1, orange diamonds = class −1. The shaded band is the margin. Support vectors are highlighted. Click "New Data" for a fresh dataset.

C 10.0
Observe sparsity: Most data points have no effect on the decision boundary. Only the support vectors (highlighted with rings) matter. Move a non-support-vector point around — the boundary doesn't change. But move a support vector, and the whole boundary shifts.
Check: If you move a data point that is far from the decision boundary, what happens?

Chapter 7: RVM for Regression

The relevance vector machine (RVM) takes a completely different approach to sparsity. Instead of a geometric argument (maximum margin), the RVM uses Bayesian evidence maximization — the same automatic relevance determination (ARD) framework from Chapter 3.

The model is a linear combination of kernels, just like the SVM:

y(x) = ∑n=1N wn k(x, xn) + b

The key difference: each weight wn gets its own precision hyperparameter αn:

p(w|α) = ∏n N(wn|0, αn−1)
Automatic pruning: When we maximize the evidence (marginal likelihood) with respect to α, many αn → ∞. This forces the corresponding wn → 0 in the posterior, effectively pruning those basis functions. The surviving basis functions are the relevance vectors. Typically, the RVM produces much sparser solutions than the SVM.

The posterior over weights is Gaussian: p(w|t, α, β) = N(w|m, Σ) where:

Σ = (A + βΦTΦ)−1,    m = βΣΦTt

with A = diag(α1, ..., αN).

Check: How does the RVM achieve sparsity?

Chapter 8: RVM Sparsity Analysis

Why does evidence maximization produce sparsity? Consider the marginal likelihood as a function of a single hyperparameter αi. It can be decomposed as:

ln p(t|α, β) = L(α−i) + ½[ln αi − ln(αi + si) + qi2/(αi + si)]

where si and qi are the "sparsity" and "quality" of basis function i, measuring how much adding this basis function improves the fit.

The sparsity criterion: The optimal αi is finite (keep the basis function) only if qi2 > si, i.e., the quality exceeds the sparsity threshold. Otherwise, αi → ∞ and the basis function is pruned. This provides a principled, automatic criterion for model complexity: each basis function is individually tested against the evidence.
PropertySVMRVM
Sparsity mechanismHinge loss flat regionEvidence maximization / ARD
Typical sparsityModerate (10-50% SVs)Very sparse (often <10%)
OutputsDecisions onlyFull posterior predictive (with uncertainty)
Kernel constraintMust be positive definiteAny basis function
HyperparametersC (or ν), kernel paramsNone — all learned from data
Training costO(N2–N3) QPO(N3) evidence maximization
MulticlassAd hoc (one-vs-rest)Natural extension
Check: When is a basis function kept in the RVM?

Chapter 9: RVM for Classification

For classification, the RVM uses a logistic sigmoid output: p(t=1|x) = σ(y(x)). The non-conjugacy between the sigmoid likelihood and Gaussian prior means the posterior is non-Gaussian.

The solution uses the Laplace approximation (same idea as Ch 4 for Bayesian logistic regression):

p(w|t, α) ≈ N(w|wMAP, Σ)

where wMAP is found by iteratively reweighted least squares (IRLS), and Σ is the inverse Hessian at the mode. The hyperparameters αn are then optimized using the approximate marginal likelihood, exactly as in regression.

RVM classification in practice: The RVM classifier typically uses far fewer basis functions than the SVM for comparable accuracy. This means faster prediction — critical in real-time applications. The RVM also provides genuine posterior probabilities, unlike the SVM where probabilities must be calibrated post-hoc via Platt scaling.

The iterative algorithm alternates between:

1. Fix α, find the posterior mode wMAP and covariance Σ using IRLS.

2. Fix the posterior, update each αi using the evidence framework. Prune basis functions where αi → ∞.

Repeat until convergence.

Check: What approximation does the RVM classifier use for the posterior?

Chapter 10: Summary

Chapter 7 presented two philosophies for building sparse kernel machines:

AspectSVMRVM
PhilosophyGeometric (maximize margin)Bayesian (maximize evidence)
Why sparse?Hinge loss has flat zero-gradient regionARD priors prune irrelevant basis functions
Sweet spotLarge N, need a quick reliable classifierNeed uncertainty, small-to-medium N
The big picture: SVMs dominated machine learning in the 2000s because they provided good generalization with theoretical guarantees. RVMs offered a Bayesian alternative with sparser solutions and uncertainty. Both have been largely superseded by deep learning for large-scale problems, but remain valuable for small datasets and interpretable models. The concepts — maximum margin, kernel trick, automatic relevance determination — continue to influence modern methods.

What comes next: Chapter 8 introduces graphical models — a powerful framework for representing and reasoning about probabilistic models with many variables. The graphical structure reveals conditional independencies that make inference tractable.

"The support vector machine is an example of a sparse
kernel machine, in which predictions for new inputs depend
on the kernel function evaluated at a subset of the training data."
— Christopher Bishop, PRML §7
Check: What is the fundamental advantage of the RVM over the SVM?