Finding the widest street between classes — maximum margin classification, duality, and the kernel trick.
You have a dataset of emails, each labelled "spam" or "not spam." Each email is represented as a feature vector x (word counts, sender info, etc.) and a label y ∈ {+1, −1}. Your goal: learn a rule that predicts the label of a new, unseen email.
There are many ways to draw a boundary between two classes. A line in 2D. A plane in 3D. A hyperplane in D dimensions. But not all boundaries are created equal. Some barely separate the classes, threading the needle between nearby points. Others create a wide buffer zone — a generous gap between the closest points of each class.
Click New Boundary to draw random separating lines through this 2D data. Notice how some lines barely miss nearby points while others leave a comfortable margin. Which would you trust more on new data?
Intuitively, a wider gap means more confidence. A boundary that barely separates the training data will misclassify slightly perturbed points. A boundary with a wide margin is more robust. This is the fundamental insight behind Support Vector Machines (SVMs): among all valid boundaries, pick the one with the maximum margin.
Over the next ten chapters, we will build SVMs from scratch: the geometry of hyperplanes, the margin, the optimization problem, the dual formulation, and finally the kernel trick that lets SVMs classify data that no line could ever separate.
A hyperplane in D-dimensional space is defined by a weight vector w ∈ RD and a bias b ∈ R. The hyperplane is the set of all points x satisfying:
In 2D, this is a line. In 3D, a plane. In higher dimensions, the geometry is the same — a flat (D−1)-dimensional surface that splits the space in two. The vector w is normal (perpendicular) to the hyperplane. It points in the direction of the positive class.
A data point xn is on the positive side if wTxn + b > 0, and on the negative side if wTxn + b < 0. Our classifier predicts:
The value wTx + b is the signed distance from x to the hyperplane (up to a scaling by ||w||). Points far from the boundary have large magnitude — the classifier is confident. Points near the boundary have small magnitude — the classifier is uncertain.
The margin is the distance from the hyperplane to the nearest data point on either side. Formally, the distance from a point x to the hyperplane wTx + b = 0 is:
The margin is twice this distance for the closest point — the width of the "street" between the two classes. To make the math cleaner, we adopt a canonical scaling: we require that the closest points satisfy wTx + b = +1 (positive class) or wTx + b = −1 (negative class). These define two parallel "gutters" on either side of the decision boundary.
This is the crucial formula. The margin depends only on the norm of w. A smaller ||w|| means a wider margin. So maximizing the margin is equivalent to minimizing ||w||.
Drag the slider to rotate the decision boundary. Watch how the margin (shaded zone) changes width. The optimal SVM boundary maximizes this width.
We now have all the ingredients. The hard-margin SVM finds the maximum-margin hyperplane that correctly classifies every training point. The optimization problem is:
The constraint says: every point must be on the correct side of its margin boundary. yn = +1 points must have wTxn + b ≥ +1, and yn = −1 points must have wTxn + b ≤ −1. The product yn(wTxn + b) conveniently captures both cases.
This is a convex quadratic program (QP): a quadratic objective with linear inequality constraints. Convex QPs have a unique global optimum (no local minima), and efficient solvers exist. The solution (w*, b*) defines the maximum-margin hyperplane.
| Term | Formula | Meaning |
|---|---|---|
| Objective | ½||w||2 | Inverse of margin (squared) |
| Constraint | yn(wTxn + b) ≥ 1 | All points correctly classified with margin ≥ 1 |
| Margin | 2/||w*|| | Width of the "street" at the optimum |
Real data is rarely linearly separable. Some points will be on the wrong side of any hyperplane. The soft-margin SVM introduces slack variables ξn ≥ 0 that allow individual points to violate the margin:
Each ξn measures how much point n violates the margin. If ξn = 0, the point is correctly classified and outside the margin (no violation). If 0 < ξn < 1, the point is inside the margin but still on the correct side. If ξn ≥ 1, the point is misclassified.
Drag the slider to change C. Low C = wide margin, more violations. High C = narrow margin, fewer violations. Points inside the margin or misclassified are marked with slack ξ.
| ξn value | Point location | Status |
|---|---|---|
| ξn = 0 | Outside margin (correct side) | No violation |
| 0 < ξn < 1 | Inside margin (correct side) | Margin violation |
| ξn ≥ 1 | Wrong side of boundary | Misclassified |
There is an equivalent way to write the soft-margin SVM that reveals a deeper connection. The slack variables ξn in the constrained optimization can be eliminated to yield an unconstrained problem:
The term max(0, 1 − yn f(xn)) is the hinge loss. It is zero when the functional margin yn f(xn) ≥ 1 (point is outside the margin on the correct side). It increases linearly as the point moves into the margin or onto the wrong side.
The hinge loss (orange) compared to 0-1 loss (teal, the ideal but non-differentiable loss) and logistic loss (blue). The x-axis is the functional margin y · f(x). The hinge is a tight convex upper bound on the 0-1 loss.
Why the hinge and not something else? Three important properties:
| Property | Why it matters |
|---|---|
| Convex | Guarantees a unique global optimum — no local minima traps |
| Sparsity-inducing | Points with margin ≥ 1 contribute zero loss and zero gradient. Only the critical points (support vectors) matter. |
| Upper bound on 0-1 loss | Minimizing the hinge loss pushes down the classification error |
The hinge loss is piecewise linear: flat at zero for large margins, and linearly increasing for small or negative margins. It is continuous but not differentiable at the kink (margin = 1). In practice, this is handled by subgradient methods or by solving the QP dual instead.
The hard-margin SVM is a constrained optimization. Chapter 7 of the book introduced Lagrange multipliers for handling constraints. We introduce a multiplier αn ≥ 0 for each constraint and form the Lagrangian:
Setting ∂L/∂w = 0 gives the representer theorem for SVMs — the optimal weight vector is a linear combination of the training data:
Setting ∂L/∂b = 0 gives ∑n αn yn = 0. Substituting both back into the Lagrangian eliminates w and b entirely, yielding the dual problem:
The dual is also a QP — but in N variables (α1, …, αN) instead of D+1 variables (w, b). For high-dimensional data (large D) with moderate N, the dual is much smaller. Efficient algorithms like Sequential Minimal Optimization (SMO) solve the dual by updating two αn at a time.
By complementary slackness, only points with αn > 0 contribute to the solution. These points — the ones sitting exactly on the margin boundary — are the support vectors. Everything else could be deleted without changing the decision boundary at all.
The prediction for a new point x is:
Since most αn = 0, this sum has only as many terms as there are support vectors. Prediction is efficient — it depends on the number of support vectors, not the full training set.
Support vectors (circled) sit on the margin gutters. They are the only points that determine the decision boundary. All other points could be removed without changing the SVM solution.
To find b, use any support vector xs (a point with αs > 0): since ys(wTxs + b) = 1, we get b = ys − wTxs. In practice, average over all support vectors for numerical stability.
| Point type | αn | Location |
|---|---|---|
| Non-support | = 0 | Outside margin (doesn't affect boundary) |
| Support vector | > 0 | On the margin gutter (defines boundary) |
| Bounded SV (soft margin) | = C | Inside margin or misclassified |
So far, SVM boundaries are hyperplanes — flat, linear surfaces. But many real-world problems are not linearly separable. Consider two concentric circles of data: the inner ring is class +1, the outer ring is class −1. No line can separate them.
The classic solution: map the data into a higher-dimensional feature space φ(x) where the classes are linearly separable, then run the SVM there. For the concentric circles, adding a feature x12 + x22 (distance from origin) makes the problem trivially separable in 3D.
But computing φ(x) explicitly can be expensive — the feature space might have millions or even infinite dimensions. Here is the key insight: the dual SVM only uses inner products xnTxm. If we replace these with inner products in the feature space φ(xn)Tφ(xm), we get a nonlinear SVM. The kernel trick is: find a function k(xn, xm) that computes this inner product without ever computing φ.
The dual becomes:
Common kernels:
| Kernel | Formula | Implicit feature space |
|---|---|---|
| Linear | xTx' | Original space (no mapping) |
| Polynomial | (xTx' + c)d | All monomials up to degree d |
| RBF (Gaussian) | exp(−γ||x − x'||2) | Infinite-dimensional! (all Taylor terms) |
Time to see kernels in action. Below is a dataset of concentric rings — clearly not linearly separable. Select different kernels and tune their parameters. Watch how the decision boundary bends and wraps to separate the classes.
Choose a kernel and adjust its parameters. The decision boundary (the colored regions) redraws in real time. The linear kernel fails spectacularly. The polynomial and RBF kernels find nonlinear boundaries.
Support Vector Machines combine geometric intuition (maximum margin), convex optimization (QP), and the kernel trick (nonlinear classification) into one of the most elegant algorithms in machine learning.
| Concept | Key Formula | Takeaway |
|---|---|---|
| Decision function | f(x) = wTx + b | Linear classifier |
| Margin | 2/||w|| | Wider margin = better generalization |
| Hard-margin objective | min ½||w||2 | Maximize margin, zero tolerance |
| Soft-margin | + C ∑ ξn | Allow violations with penalty |
| Hinge loss | max(0, 1−yf(x)) | Sparse, convex loss function |
| Dual | In terms of αn and xnTxm | Opens door to kernels |
| Representer theorem | w = ∑ αn yn xn | Sparse: only SVs matter |
| Kernel trick | k(x,x') replaces xTx' | Nonlinear boundaries, finite cost |
Strengths of SVMs:
• Strong theoretical foundations
• Convex → unique global optimum
• Kernel trick for nonlinearity
• Sparse (only support vectors matter)
• Works well in high dimensions
Limitations:
• Training O(N2) to O(N3) for the QP
• Kernel choice is a black art
• No native probability outputs
• Doesn't scale to millions of points easily
• Binary only (extensions: one-vs-rest)
Looking back: This chapter built on optimization (Chapter 7) for the QP formulation, probability (Chapter 6) for the Bayesian connections, and linear algebra (Chapter 2) for inner products and projections. The SVM is a beautiful synthesis of mathematics meeting data.