Deisenroth et al., Chapter 12

Support Vector Machines

Finding the widest street between classes — maximum margin classification, duality, and the kernel trick.

Prerequisites: Chapter 7 (Continuous Optimization) + Chapter 8 (When Models Meet Data). That's it.
11
Chapters
5+
Simulations
11
Quizzes

Chapter 0: The Classification Problem

You have a dataset of emails, each labelled "spam" or "not spam." Each email is represented as a feature vector x (word counts, sender info, etc.) and a label y ∈ {+1, −1}. Your goal: learn a rule that predicts the label of a new, unseen email.

There are many ways to draw a boundary between two classes. A line in 2D. A plane in 3D. A hyperplane in D dimensions. But not all boundaries are created equal. Some barely separate the classes, threading the needle between nearby points. Others create a wide buffer zone — a generous gap between the closest points of each class.

Many Possible Boundaries

Click New Boundary to draw random separating lines through this 2D data. Notice how some lines barely miss nearby points while others leave a comfortable margin. Which would you trust more on new data?

Intuitively, a wider gap means more confidence. A boundary that barely separates the training data will misclassify slightly perturbed points. A boundary with a wide margin is more robust. This is the fundamental insight behind Support Vector Machines (SVMs): among all valid boundaries, pick the one with the maximum margin.

The SVM philosophy: Don't just find any boundary that separates the classes. Find the best one — the one with the widest buffer zone. This maximum-margin principle leads to better generalization (performance on new data) and, remarkably, depends only on a handful of critical training points called support vectors.

Over the next ten chapters, we will build SVMs from scratch: the geometry of hyperplanes, the margin, the optimization problem, the dual formulation, and finally the kernel trick that lets SVMs classify data that no line could ever separate.

Among many possible separating boundaries, why should we prefer one with a wider margin?

Chapter 1: Separating Hyperplanes

A hyperplane in D-dimensional space is defined by a weight vector w ∈ RD and a bias b ∈ R. The hyperplane is the set of all points x satisfying:

wTx + b = 0

In 2D, this is a line. In 3D, a plane. In higher dimensions, the geometry is the same — a flat (D−1)-dimensional surface that splits the space in two. The vector w is normal (perpendicular) to the hyperplane. It points in the direction of the positive class.

A data point xn is on the positive side if wTxn + b > 0, and on the negative side if wTxn + b < 0. Our classifier predicts:

f(x) = sign(wTx + b)

The value wTx + b is the signed distance from x to the hyperplane (up to a scaling by ||w||). Points far from the boundary have large magnitude — the classifier is confident. Points near the boundary have small magnitude — the classifier is uncertain.

Key insight: The weight vector w controls two things simultaneously: the orientation of the hyperplane (which direction it faces) and the scale (how steep the decision function is). The bias b shifts the hyperplane along the normal direction. Together, (w, b) fully parameterize any hyperplane.
Scaling ambiguity: If (w, b) defines a hyperplane, so does (2w, 2b) — the same geometric plane, just different parameterization. This ambiguity becomes important when we define the margin, because we will fix the scale by requiring that the closest points satisfy |wTx + b| = 1.
What role does the vector w play in the hyperplane equation wTx + b = 0?

Chapter 2: The Margin

The margin is the distance from the hyperplane to the nearest data point on either side. Formally, the distance from a point x to the hyperplane wTx + b = 0 is:

distance(x) = |wTx + b| / ||w||

The margin is twice this distance for the closest point — the width of the "street" between the two classes. To make the math cleaner, we adopt a canonical scaling: we require that the closest points satisfy wTx + b = +1 (positive class) or wTx + b = −1 (negative class). These define two parallel "gutters" on either side of the decision boundary.

margin = 2 / ||w||

This is the crucial formula. The margin depends only on the norm of w. A smaller ||w|| means a wider margin. So maximizing the margin is equivalent to minimizing ||w||.

Margin Visualization

Drag the slider to rotate the decision boundary. Watch how the margin (shaded zone) changes width. The optimal SVM boundary maximizes this width.

Angle 45°
Maximizing the margin = minimizing ||w||: This is the key insight that turns a geometric intuition (wide street) into an optimization problem (minimize a quadratic). Since ||w|| is awkward to differentiate (due to the square root), we minimize ½||w||2 instead. Same solution, cleaner math.
The "street" analogy: Think of the hyperplane as the center line of a street. The positive class lives on one sidewalk, the negative class on the other. The margin is the width of the street. We want the widest possible street where no data points intrude into the roadway. The points on the curb — the closest ones — are the support vectors.
If we want to maximize the margin 2/||w||, what equivalent problem should we solve?

Chapter 3: Hard Margin SVM

We now have all the ingredients. The hard-margin SVM finds the maximum-margin hyperplane that correctly classifies every training point. The optimization problem is:

minw,b ½ ||w||2
subject to: yn(wTxn + b) ≥ 1    ∀ n = 1, …, N

The constraint says: every point must be on the correct side of its margin boundary. yn = +1 points must have wTxn + b ≥ +1, and yn = −1 points must have wTxn + b ≤ −1. The product yn(wTxn + b) conveniently captures both cases.

Reading the constraint: yn(wTxn + b) is the functional margin. It is positive when the prediction sign matches the label, and its magnitude measures confidence. The constraint says the functional margin must be at least 1 for every point — not just correct, but confidently correct.

This is a convex quadratic program (QP): a quadratic objective with linear inequality constraints. Convex QPs have a unique global optimum (no local minima), and efficient solvers exist. The solution (w*, b*) defines the maximum-margin hyperplane.

"Hard" margin means zero tolerance. Every single training point must be outside the margin. If even one point falls inside, the problem is infeasible — no solution exists. This is a serious limitation: real data is messy, and classes often overlap. The soft-margin extension (next chapter) addresses this.
TermFormulaMeaning
Objective½||w||2Inverse of margin (squared)
Constraintyn(wTxn + b) ≥ 1All points correctly classified with margin ≥ 1
Margin2/||w*||Width of the "street" at the optimum
What makes the hard-margin SVM infeasible?

Chapter 4: Soft Margin & Slack

Real data is rarely linearly separable. Some points will be on the wrong side of any hyperplane. The soft-margin SVM introduces slack variables ξn ≥ 0 that allow individual points to violate the margin:

minw,b,ξ ½ ||w||2 + C ∑n=1N ξn
subject to: yn(wTxn + b) ≥ 1 − ξn,   ξn ≥ 0

Each ξn measures how much point n violates the margin. If ξn = 0, the point is correctly classified and outside the margin (no violation). If 0 < ξn < 1, the point is inside the margin but still on the correct side. If ξn ≥ 1, the point is misclassified.

The C parameter: C controls the trade-off between a wide margin and few violations. Large C penalizes violations heavily, producing a narrow margin that classifies most training points correctly. Small C tolerates more violations, producing a wider margin that may misclassify some training points but generalizes better. C is the SVM's main hyperparameter.
Soft Margin: The C Trade-Off

Drag the slider to change C. Low C = wide margin, more violations. High C = narrow margin, fewer violations. Points inside the margin or misclassified are marked with slack ξ.

C 10.0
ξn valuePoint locationStatus
ξn = 0Outside margin (correct side)No violation
0 < ξn < 1Inside margin (correct side)Margin violation
ξn ≥ 1Wrong side of boundaryMisclassified
Bias-variance through the lens of C: Large C = low bias, high variance (fits the training data closely, may overfit). Small C = high bias, low variance (simpler model, may underfit). In practice, C is selected by cross-validation.
What happens when ξn ≥ 1 for a training point?

Chapter 5: The Hinge Loss

There is an equivalent way to write the soft-margin SVM that reveals a deeper connection. The slack variables ξn in the constrained optimization can be eliminated to yield an unconstrained problem:

minw,b ½ ||w||2 + C ∑n=1N max(0, 1 − yn(wTxn + b))

The term max(0, 1 − yn f(xn)) is the hinge loss. It is zero when the functional margin yn f(xn) ≥ 1 (point is outside the margin on the correct side). It increases linearly as the point moves into the margin or onto the wrong side.

Hinge Loss vs. Other Losses

The hinge loss (orange) compared to 0-1 loss (teal, the ideal but non-differentiable loss) and logistic loss (blue). The x-axis is the functional margin y · f(x). The hinge is a tight convex upper bound on the 0-1 loss.

The SVM is regularized risk minimization: The objective has two terms: ½||w||2 (regularizer, keeps the model simple) plus C × ∑ hinge loss (empirical risk, fits the data). This is the same structure as ridge regression and other regularized models. The loss function is what makes it an SVM rather than, say, logistic regression.

Why the hinge and not something else? Three important properties:

PropertyWhy it matters
ConvexGuarantees a unique global optimum — no local minima traps
Sparsity-inducingPoints with margin ≥ 1 contribute zero loss and zero gradient. Only the critical points (support vectors) matter.
Upper bound on 0-1 lossMinimizing the hinge loss pushes down the classification error

The hinge loss is piecewise linear: flat at zero for large margins, and linearly increasing for small or negative margins. It is continuous but not differentiable at the kink (margin = 1). In practice, this is handled by subgradient methods or by solving the QP dual instead.

When is the hinge loss for a data point exactly zero?

Chapter 6: Dual SVM & Lagrange

The hard-margin SVM is a constrained optimization. Chapter 7 of the book introduced Lagrange multipliers for handling constraints. We introduce a multiplier αn ≥ 0 for each constraint and form the Lagrangian:

L(w, b, α) = ½||w||2 − ∑n=1N αn [ yn(wTxn + b) − 1 ]

Setting ∂L/∂w = 0 gives the representer theorem for SVMs — the optimal weight vector is a linear combination of the training data:

w = ∑n=1N αn yn xn

Setting ∂L/∂b = 0 gives ∑n αn yn = 0. Substituting both back into the Lagrangian eliminates w and b entirely, yielding the dual problem:

maxαn=1N αn − ½ ∑n,m αn αm yn ym xnT xm
subject to: αn ≥ 0,   ∑n αn yn = 0
Why the dual matters: The dual problem depends on the data only through inner products xnTxm. This is the door that opens to the kernel trick (Chapter 8). If we can compute inner products in a higher-dimensional space without ever constructing the features explicitly, we can learn nonlinear boundaries. The dual makes this possible.

The dual is also a QP — but in N variables (α1, …, αN) instead of D+1 variables (w, b). For high-dimensional data (large D) with moderate N, the dual is much smaller. Efficient algorithms like Sequential Minimal Optimization (SMO) solve the dual by updating two αn at a time.

Complementary slackness: The KKT conditions require αn[yn(wTxn + b) − 1] = 0. For each point, either αn = 0 (the point is outside the margin and doesn't matter) or yn(wTxn + b) = 1 (the point sits exactly on the margin — a support vector). This is the sparsity of SVMs: most αn are zero.
The optimal w in the SVM can be written as w = ∑ αn yn xn. What does this mean?

Chapter 7: Support Vectors

By complementary slackness, only points with αn > 0 contribute to the solution. These points — the ones sitting exactly on the margin boundary — are the support vectors. Everything else could be deleted without changing the decision boundary at all.

The name reveals the idea: The support vectors literally "support" the decision boundary. They are the critical few data points that define the margin. Move one of them, and the boundary shifts. Remove a non-support-vector, and nothing changes. This is remarkable sparsity: out of thousands of training points, often only a handful matter.

The prediction for a new point x is:

f(x) = sign(wTx + b) = sign(∑n=1N αn yn xnTx + b)

Since most αn = 0, this sum has only as many terms as there are support vectors. Prediction is efficient — it depends on the number of support vectors, not the full training set.

Support Vectors Identified

Support vectors (circled) sit on the margin gutters. They are the only points that determine the decision boundary. All other points could be removed without changing the SVM solution.

To find b, use any support vector xs (a point with αs > 0): since ys(wTxs + b) = 1, we get b = ys − wTxs. In practice, average over all support vectors for numerical stability.

Point typeαnLocation
Non-support= 0Outside margin (doesn't affect boundary)
Support vector> 0On the margin gutter (defines boundary)
Bounded SV (soft margin)= CInside margin or misclassified
If you remove a training point that is NOT a support vector, what happens to the SVM decision boundary?

Chapter 8: The Kernel Trick

So far, SVM boundaries are hyperplanes — flat, linear surfaces. But many real-world problems are not linearly separable. Consider two concentric circles of data: the inner ring is class +1, the outer ring is class −1. No line can separate them.

The classic solution: map the data into a higher-dimensional feature space φ(x) where the classes are linearly separable, then run the SVM there. For the concentric circles, adding a feature x12 + x22 (distance from origin) makes the problem trivially separable in 3D.

But computing φ(x) explicitly can be expensive — the feature space might have millions or even infinite dimensions. Here is the key insight: the dual SVM only uses inner products xnTxm. If we replace these with inner products in the feature space φ(xn)Tφ(xm), we get a nonlinear SVM. The kernel trick is: find a function k(xn, xm) that computes this inner product without ever computing φ.

k(xn, xm) = φ(xn)Tφ(xm)

The dual becomes:

maxαn αn − ½ ∑n,m αn αm yn ym k(xn, xm)

Common kernels:

KernelFormulaImplicit feature space
LinearxTx'Original space (no mapping)
Polynomial(xTx' + c)dAll monomials up to degree d
RBF (Gaussian)exp(−γ||x − x'||2)Infinite-dimensional! (all Taylor terms)
The RBF kernel maps to infinite dimensions. Yet we never compute the infinite-dimensional features — we only ever evaluate the kernel function, which is a simple exponential. This is the magic: infinite expressiveness at finite computational cost. The γ parameter controls how quickly the kernel decays with distance — large γ means nearby points dominate, small γ means faraway points also matter.
Mercer's theorem: A function k(x, x') can be used as a kernel if and only if the resulting Gram matrix Knm = k(xn, xm) is positive semi-definite for any set of points. All three kernels above satisfy this condition. You can even design your own kernels — as long as Mercer's condition holds, there exists a valid feature space.
What is the key advantage of the kernel trick?

Chapter 9: Kernel Playground

Time to see kernels in action. Below is a dataset of concentric rings — clearly not linearly separable. Select different kernels and tune their parameters. Watch how the decision boundary bends and wraps to separate the classes.

Kernel SVM Playground

Choose a kernel and adjust its parameters. The decision boundary (the colored regions) redraws in real time. The linear kernel fails spectacularly. The polynomial and RBF kernels find nonlinear boundaries.

What to watch for: With the linear kernel, the boundary is a straight line — it cannot wrap around the inner ring. Switch to RBF and the boundary curves smoothly between the classes. Increase γ and the boundary becomes tighter (more complex, possible overfitting). Decrease γ and it becomes smoother (simpler, possible underfitting). The polynomial kernel produces decision boundaries with interesting shapes depending on the degree.
No free lunch: More flexible kernels (high γ RBF, high-degree polynomial) can fit any training data perfectly, but may overfit. The best kernel and parameters depend on the data — use cross-validation. In modern practice, SVMs with RBF kernels remain competitive on many tabular datasets.
Why does the linear kernel fail on concentric ring data?

Chapter 10: Summary

Support Vector Machines combine geometric intuition (maximum margin), convex optimization (QP), and the kernel trick (nonlinear classification) into one of the most elegant algorithms in machine learning.

ConceptKey FormulaTakeaway
Decision functionf(x) = wTx + bLinear classifier
Margin2/||w||Wider margin = better generalization
Hard-margin objectivemin ½||w||2Maximize margin, zero tolerance
Soft-margin+ C ∑ ξnAllow violations with penalty
Hinge lossmax(0, 1−yf(x))Sparse, convex loss function
DualIn terms of αn and xnTxmOpens door to kernels
Representer theoremw = ∑ αn yn xnSparse: only SVs matter
Kernel trickk(x,x') replaces xTx'Nonlinear boundaries, finite cost

Strengths of SVMs:

• Strong theoretical foundations
• Convex → unique global optimum
• Kernel trick for nonlinearity
• Sparse (only support vectors matter)
• Works well in high dimensions

Limitations:

• Training O(N2) to O(N3) for the QP
• Kernel choice is a black art
• No native probability outputs
• Doesn't scale to millions of points easily
• Binary only (extensions: one-vs-rest)

SVMs vs. Neural Networks: SVMs were the dominant classifier from the mid-1990s to the early 2010s. Deep learning supplanted them for images, text, and audio — tasks where learned features outperform hand-crafted kernels. But SVMs remain competitive and sometimes superior for structured/tabular data, small datasets, and problems where interpretability (via support vectors) matters.

Looking back: This chapter built on optimization (Chapter 7) for the QP formulation, probability (Chapter 6) for the Bayesian connections, and linear algebra (Chapter 2) for inner products and projections. The SVM is a beautiful synthesis of mathematics meeting data.

"The key insight of SVMs is that the solution depends only on a subset
of the training data — the support vectors."
— Bernhard Scholkopf & Alexander Smola
What is the main advantage of the dual formulation over the primal for kernel SVMs?