Support vector machines, maximum margin classifiers, and relevance vector machines — kernel methods that use only a sparse subset of training points.
In Chapter 6, we saw that kernel methods like Gaussian processes make predictions using all N training points. The predictive function is a weighted sum over every data point:
This means prediction costs O(N) per test point — expensive when N is large. Can we find solutions where most of the coefficients an are exactly zero?
This chapter covers two approaches to achieving sparsity:
Support Vector Machines (SVMs): Find the maximum margin decision boundary. Sparsity emerges from the geometry — only points on the margin boundary matter.
Relevance Vector Machines (RVMs): Use a Bayesian prior with individual hyperparameters per weight. Sparsity emerges from evidence maximization — most hyperparameters go to infinity, pruning their weights to zero.
Consider a two-class problem with a linear decision boundary y(x) = wTφ(x) + b. Many hyperplanes can separate the classes perfectly. Which one should we choose?
The margin is the perpendicular distance from the decision boundary to the nearest data point. The SVM chooses the hyperplane that maximizes this margin.
The distance from point xn to the decision surface is tny(xn)/||w||. We want to maximize the minimum such distance:
Using the canonical representation (rescaling so the closest point has tny(xn) = 1), this becomes:
This is a quadratic program — minimize a quadratic objective subject to linear inequality constraints. It has a unique global solution.
To solve the constrained optimization, we introduce Lagrange multipliers an ≥ 0 for each constraint:
Setting derivatives to zero: ∂L/∂w = 0 gives w = ∑n an tn φ(xn), and ∂L/∂b = 0 gives ∑n an tn = 0.
Substituting back, we get the dual problem:
subject to an ≥ 0 and ∑n an tn = 0, where k(xn, xm) = φ(xn)Tφ(xm) is the kernel function.
The KKT conditions (Karush-Kuhn-Tucker) require that an{tny(xn) − 1} = 0 for every n. This means either an = 0 (the point is irrelevant to the solution) or tny(xn) = 1 (the point sits exactly on the margin). Points with an > 0 are the support vectors.
Prediction for a new point:
Only support vectors contribute. The rest of the training data can be discarded.
Real data is rarely linearly separable, even in feature space. We need to allow some points to be misclassified. The soft margin SVM introduces slack variables ξn ≥ 0:
If ξn = 0, the point is correctly classified and on or beyond the margin. If 0 < ξn < 1, the point is correctly classified but inside the margin. If ξn > 1, the point is misclassified.
The objective becomes:
The parameter C controls the tradeoff: large C → fewer violations (potentially overfitting), small C → wider margin (more violations allowed).
The dual of the soft-margin SVM is almost identical to the hard-margin case, except that now 0 ≤ an ≤ C. Points with 0 < an < C lie on the margin; points with an = C are inside the margin or misclassified. An alternative formulation using a parameter ν ∈ (0, 1] replaces C and has a nice interpretation: ν is an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors.
The SVM and logistic regression are more closely related than they might appear. Both find a linear decision boundary in feature space. The difference is in the loss function.
| Method | Loss function | Outputs |
|---|---|---|
| Logistic regression | ln(1 + exp(−tnyn)) — log loss | Probabilities p(C1|x) |
| SVM | [1 − tnyn]+ — hinge loss | Decisions only (sign of y) |
The hinge loss [1 − ty]+ = max(0, 1 − ty) is zero whenever ty ≥ 1, i.e., when the point is correctly classified and beyond the margin. This is why SVMs produce sparse solutions: points safely classified contribute zero loss and zero gradient, so they have no influence on the solution.
To get probabilities from an SVM, one can fit a sigmoid to the SVM outputs: p(t=1|x) = σ(Ay(x) + B), where A and B are fitted on a validation set. This is known as Platt scaling.
The SVM is inherently a two-class classifier. Extension to K classes is ad hoc:
One-vs-rest: Train K separate SVMs, each separating one class from the rest. Assign the class with the largest y(x). Problem: different SVMs have different scales, and there's ambiguous regions.
One-vs-one: Train K(K−1)/2 SVMs, one for each pair of classes. Use majority voting. Problem: quadratic number of classifiers and inconsistent decisions.
For regression, the SVM uses the ε-insensitive loss:
This loss is zero when the prediction is within ε of the target. The ε-tube around the regression function plays the same role as the margin: only data points outside the tube (with |yn − tn| > ε) contribute to the solution.
Drag the data points to see how the SVM margin and support vectors change. Notice how only points on or inside the margin affect the decision boundary.
Blue circles = class +1, orange diamonds = class −1. The shaded band is the margin. Support vectors are highlighted. Click "New Data" for a fresh dataset.
The relevance vector machine (RVM) takes a completely different approach to sparsity. Instead of a geometric argument (maximum margin), the RVM uses Bayesian evidence maximization — the same automatic relevance determination (ARD) framework from Chapter 3.
The model is a linear combination of kernels, just like the SVM:
The key difference: each weight wn gets its own precision hyperparameter αn:
The posterior over weights is Gaussian: p(w|t, α, β) = N(w|m, Σ) where:
with A = diag(α1, ..., αN).
Why does evidence maximization produce sparsity? Consider the marginal likelihood as a function of a single hyperparameter αi. It can be decomposed as:
where si and qi are the "sparsity" and "quality" of basis function i, measuring how much adding this basis function improves the fit.
| Property | SVM | RVM |
|---|---|---|
| Sparsity mechanism | Hinge loss flat region | Evidence maximization / ARD |
| Typical sparsity | Moderate (10-50% SVs) | Very sparse (often <10%) |
| Outputs | Decisions only | Full posterior predictive (with uncertainty) |
| Kernel constraint | Must be positive definite | Any basis function |
| Hyperparameters | C (or ν), kernel params | None — all learned from data |
| Training cost | O(N2–N3) QP | O(N3) evidence maximization |
| Multiclass | Ad hoc (one-vs-rest) | Natural extension |
For classification, the RVM uses a logistic sigmoid output: p(t=1|x) = σ(y(x)). The non-conjugacy between the sigmoid likelihood and Gaussian prior means the posterior is non-Gaussian.
The solution uses the Laplace approximation (same idea as Ch 4 for Bayesian logistic regression):
where wMAP is found by iteratively reweighted least squares (IRLS), and Σ is the inverse Hessian at the mode. The hyperparameters αn are then optimized using the approximate marginal likelihood, exactly as in regression.
The iterative algorithm alternates between:
1. Fix α, find the posterior mode wMAP and covariance Σ using IRLS.
2. Fix the posterior, update each αi using the evidence framework. Prune basis functions where αi → ∞.
Repeat until convergence.
Chapter 7 presented two philosophies for building sparse kernel machines:
| Aspect | SVM | RVM |
|---|---|---|
| Philosophy | Geometric (maximize margin) | Bayesian (maximize evidence) |
| Why sparse? | Hinge loss has flat zero-gradient region | ARD priors prune irrelevant basis functions |
| Sweet spot | Large N, need a quick reliable classifier | Need uncertainty, small-to-medium N |
What comes next: Chapter 8 introduces graphical models — a powerful framework for representing and reasoning about probabilistic models with many variables. The graphical structure reveals conditional independencies that make inference tractable.