Ch 12: Support Vector Machines

Chapter 0: The Classification Problem

You have a dataset of emails, each labelled "spam" or "not spam." Each email is represented as a feature vector x (word counts, sender info, etc.) and a label y ∈ {+1, −1}. Your goal: learn a rule that predicts the label of a new, unseen email.

There are many ways to draw a boundary between two classes. A line in 2D. A plane in 3D. A hyperplane in D dimensions. But not all boundaries are created equal. Some barely separate the classes, threading the needle between nearby points. Others create a wide buffer zone — a generous gap between the closest points of each class.

Many Possible Boundaries

Click New Boundary to draw random separating lines through this 2D data. Notice how some lines barely miss nearby points while others leave a comfortable margin. Which would you trust more on new data?

Intuitively, a wider gap means more confidence. A boundary that barely separates the training data will misclassify slightly perturbed points. A boundary with a wide margin is more robust. This is the fundamental insight behind Support Vector Machines (SVMs): among all valid boundaries, pick the one with the maximum margin.

The SVM philosophy: Don't just find any boundary that separates the classes. Find the best one — the one with the widest buffer zone. This maximum-margin principle leads to better generalization (performance on new data) and, remarkably, depends only on a handful of critical training points called support vectors.

Over the next ten chapters, we will build SVMs from scratch: the geometry of hyperplanes, the margin, the optimization problem, the dual formulation, and finally the kernel trick that lets SVMs classify data that no line could ever separate.

Among many possible separating boundaries, why should we prefer one with a wider margin?

Because wider margins are faster to compute Because a wider margin is more robust to perturbations and generalizes better Because narrow margins use more memory

Chapter 1: Separating Hyperplanes

A hyperplane in D-dimensional space is defined by a weight vector w ∈ R^D and a bias b ∈ R. The hyperplane is the set of all points x satisfying:

w^Tx + b = 0

In 2D, this is a line. In 3D, a plane. In higher dimensions, the geometry is the same — a flat (D−1)-dimensional surface that splits the space in two. The vector w is normal (perpendicular) to the hyperplane. It points in the direction of the positive class.

A data point x_n is on the positive side if w^Tx_n + b > 0, and on the negative side if w^Tx_n + b < 0. Our classifier predicts:

f(x) = sign(w^Tx + b)

The value w^Tx + b is the signed distance from x to the hyperplane (up to a scaling by ||w||). Points far from the boundary have large magnitude — the classifier is confident. Points near the boundary have small magnitude — the classifier is uncertain.

Key insight: The weight vector w controls two things simultaneously: the orientation of the hyperplane (which direction it faces) and the scale (how steep the decision function is). The bias b shifts the hyperplane along the normal direction. Together, (w, b) fully parameterize any hyperplane.

Scaling ambiguity: If (w, b) defines a hyperplane, so does (2w, 2b) — the same geometric plane, just different parameterization. This ambiguity becomes important when we define the margin, because we will fix the scale by requiring that the closest points satisfy |w^Tx + b| = 1.

What role does the vector w play in the hyperplane equation w^Tx + b = 0?

It is the normal (perpendicular) to the hyperplane It is a point on the hyperplane It is the label vector

Chapter 2: The Margin

The margin is the distance from the hyperplane to the nearest data point on either side. Formally, the distance from a point x to the hyperplane w^Tx + b = 0 is:

distance(x) = |w^Tx + b| / ||w||

The margin is twice this distance for the closest point — the width of the "street" between the two classes. To make the math cleaner, we adopt a canonical scaling: we require that the closest points satisfy w^Tx + b = +1 (positive class) or w^Tx + b = −1 (negative class). These define two parallel "gutters" on either side of the decision boundary.

margin = 2 / ||w||

This is the crucial formula. The margin depends only on the norm of w. A smaller ||w|| means a wider margin. So maximizing the margin is equivalent to minimizing ||w||.

Margin Visualization

Drag the slider to rotate the decision boundary. Watch how the margin (shaded zone) changes width. The optimal SVM boundary maximizes this width.

Angle 45°

Maximizing the margin = minimizing ||w||: This is the key insight that turns a geometric intuition (wide street) into an optimization problem (minimize a quadratic). Since ||w|| is awkward to differentiate (due to the square root), we minimize ½||w||² instead. Same solution, cleaner math.

The "street" analogy: Think of the hyperplane as the center line of a street. The positive class lives on one sidewalk, the negative class on the other. The margin is the width of the street. We want the widest possible street where no data points intrude into the roadway. The points on the curb — the closest ones — are the support vectors.

If we want to maximize the margin 2/||w||, what equivalent problem should we solve?

Maximize ||w|| Minimize ½||w||² subject to correct classification Set w = 0

Chapter 3: Hard Margin SVM

We now have all the ingredients. The hard-margin SVM finds the maximum-margin hyperplane that correctly classifies every training point. The optimization problem is:

min_w,b ½ ||w||²

subject to: y_n(w^Tx_n + b) ≥ 1 ∀ n = 1, …, N

The constraint says: every point must be on the correct side of its margin boundary. y_n = +1 points must have w^Tx_n + b ≥ +1, and y_n = −1 points must have w^Tx_n + b ≤ −1. The product y_n(w^Tx_n + b) conveniently captures both cases.

Reading the constraint: y_n(w^Tx_n + b) is the functional margin. It is positive when the prediction sign matches the label, and its magnitude measures confidence. The constraint says the functional margin must be at least 1 for every point — not just correct, but confidently correct.

This is a convex quadratic program (QP): a quadratic objective with linear inequality constraints. Convex QPs have a unique global optimum (no local minima), and efficient solvers exist. The solution (w*, b*) defines the maximum-margin hyperplane.

"Hard" margin means zero tolerance. Every single training point must be outside the margin. If even one point falls inside, the problem is infeasible — no solution exists. This is a serious limitation: real data is messy, and classes often overlap. The soft-margin extension (next chapter) addresses this.

Term	Formula	Meaning
Objective	½\|\|w\|\|²	Inverse of margin (squared)
Constraint	y_n(w^Tx_n + b) ≥ 1	All points correctly classified with margin ≥ 1
Margin	2/\|\|w*\|\|	Width of the "street" at the optimum

What makes the hard-margin SVM infeasible?

If any training point falls inside the margin or on the wrong side If there are more than 100 data points If the data is in more than 3 dimensions

Chapter 4: Soft Margin & Slack

Real data is rarely linearly separable. Some points will be on the wrong side of any hyperplane. The soft-margin SVM introduces slack variables ξ_n ≥ 0 that allow individual points to violate the margin:

min_w,b,ξ ½ ||w||² + C ∑_n=1^N ξ_n

subject to: y_n(w^Tx_n + b) ≥ 1 − ξ_n, ξ_n ≥ 0

Each ξ_n measures how much point n violates the margin. If ξ_n = 0, the point is correctly classified and outside the margin (no violation). If 0 < ξ_n < 1, the point is inside the margin but still on the correct side. If ξ_n ≥ 1, the point is misclassified.

The C parameter: C controls the trade-off between a wide margin and few violations. Large C penalizes violations heavily, producing a narrow margin that classifies most training points correctly. Small C tolerates more violations, producing a wider margin that may misclassify some training points but generalizes better. C is the SVM's main hyperparameter.

Soft Margin: The C Trade-Off

Drag the slider to change C. Low C = wide margin, more violations. High C = narrow margin, fewer violations. Points inside the margin or misclassified are marked with slack ξ.

C 10.0

ξ_n value	Point location	Status
ξ_n = 0	Outside margin (correct side)	No violation
0 < ξ_n < 1	Inside margin (correct side)	Margin violation
ξ_n ≥ 1	Wrong side of boundary	Misclassified

Bias-variance through the lens of C: Large C = low bias, high variance (fits the training data closely, may overfit). Small C = high bias, low variance (simpler model, may underfit). In practice, C is selected by cross-validation.

What happens when ξ_n ≥ 1 for a training point?

The point is outside the margin The point is inside the margin but correct The point is misclassified (on the wrong side of the boundary)

Chapter 5: The Hinge Loss

There is an equivalent way to write the soft-margin SVM that reveals a deeper connection. The slack variables ξ_n in the constrained optimization can be eliminated to yield an unconstrained problem:

min_w,b ½ ||w||² + C ∑_n=1^N max(0, 1 − y_n(w^Tx_n + b))

The term max(0, 1 − y_n f(x_n)) is the hinge loss. It is zero when the functional margin y_n f(x_n) ≥ 1 (point is outside the margin on the correct side). It increases linearly as the point moves into the margin or onto the wrong side.

Hinge Loss vs. Other Losses

The hinge loss (orange) compared to 0-1 loss (teal, the ideal but non-differentiable loss) and logistic loss (blue). The x-axis is the functional margin y · f(x). The hinge is a tight convex upper bound on the 0-1 loss.

The SVM is regularized risk minimization: The objective has two terms: ½||w||² (regularizer, keeps the model simple) plus C × ∑ hinge loss (empirical risk, fits the data). This is the same structure as ridge regression and other regularized models. The loss function is what makes it an SVM rather than, say, logistic regression.

Why the hinge and not something else? Three important properties:

Property	Why it matters
Convex	Guarantees a unique global optimum — no local minima traps
Sparsity-inducing	Points with margin ≥ 1 contribute zero loss and zero gradient. Only the critical points (support vectors) matter.
Upper bound on 0-1 loss	Minimizing the hinge loss pushes down the classification error

The hinge loss is piecewise linear: flat at zero for large margins, and linearly increasing for small or negative margins. It is continuous but not differentiable at the kink (margin = 1). In practice, this is handled by subgradient methods or by solving the QP dual instead.

When is the hinge loss for a data point exactly zero?

When the point is misclassified When the functional margin y_n f(x_n) ≥ 1 When the point is on the hyperplane

Chapter 6: Dual SVM & Lagrange

The hard-margin SVM is a constrained optimization. Chapter 7 of the book introduced Lagrange multipliers for handling constraints. We introduce a multiplier α_n ≥ 0 for each constraint and form the Lagrangian:

L(w, b, α) = ½||w||² − ∑_n=1^N α_n [ y_n(w^Tx_n + b) − 1 ]

Setting ∂L/∂w = 0 gives the representer theorem for SVMs — the optimal weight vector is a linear combination of the training data:

w = ∑_n=1^N α_n y_n x_n

Setting ∂L/∂b = 0 gives ∑_n α_n y_n = 0. Substituting both back into the Lagrangian eliminates w and b entirely, yielding the dual problem:

max_α ∑_n=1^N α_n − ½ ∑_n,m α_n α_m y_n y_m x_n^T x_m

subject to: α_n ≥ 0, ∑_n α_n y_n = 0

Why the dual matters: The dual problem depends on the data only through inner products x_n^Tx_m. This is the door that opens to the kernel trick (Chapter 8). If we can compute inner products in a higher-dimensional space without ever constructing the features explicitly, we can learn nonlinear boundaries. The dual makes this possible.

The dual is also a QP — but in N variables (α₁, …, α_N) instead of D+1 variables (w, b). For high-dimensional data (large D) with moderate N, the dual is much smaller. Efficient algorithms like Sequential Minimal Optimization (SMO) solve the dual by updating two α_n at a time.

Complementary slackness: The KKT conditions require α_n[y_n(w^Tx_n + b) − 1] = 0. For each point, either α_n = 0 (the point is outside the margin and doesn't matter) or y_n(w^Tx_n + b) = 1 (the point sits exactly on the margin — a support vector). This is the sparsity of SVMs: most α_n are zero.

The optimal w in the SVM can be written as w = ∑ α_n y_n x_n. What does this mean?

The decision boundary is built from a weighted combination of the training points The weight vector must be orthogonal to all training points Only the first training point contributes

Chapter 7: Support Vectors

By complementary slackness, only points with α_n > 0 contribute to the solution. These points — the ones sitting exactly on the margin boundary — are the support vectors. Everything else could be deleted without changing the decision boundary at all.

The name reveals the idea: The support vectors literally "support" the decision boundary. They are the critical few data points that define the margin. Move one of them, and the boundary shifts. Remove a non-support-vector, and nothing changes. This is remarkable sparsity: out of thousands of training points, often only a handful matter.

The prediction for a new point x is:

f(x) = sign(w^Tx + b) = sign(∑_n=1^N α_n y_n x_n^Tx + b)

Since most α_n = 0, this sum has only as many terms as there are support vectors. Prediction is efficient — it depends on the number of support vectors, not the full training set.

Support Vectors Identified

Support vectors (circled) sit on the margin gutters. They are the only points that determine the decision boundary. All other points could be removed without changing the SVM solution.

To find b, use any support vector x_s (a point with α_s > 0): since y_s(w^Tx_s + b) = 1, we get b = y_s − w^Tx_s. In practice, average over all support vectors for numerical stability.

Point type	α_n	Location
Non-support	= 0	Outside margin (doesn't affect boundary)
Support vector	> 0	On the margin gutter (defines boundary)
Bounded SV (soft margin)	= C	Inside margin or misclassified

If you remove a training point that is NOT a support vector, what happens to the SVM decision boundary?

Nothing — the boundary stays exactly the same The boundary shifts slightly The SVM must be completely retrained

Chapter 8: The Kernel Trick

So far, SVM boundaries are hyperplanes — flat, linear surfaces. But many real-world problems are not linearly separable. Consider two concentric circles of data: the inner ring is class +1, the outer ring is class −1. No line can separate them.

The classic solution: map the data into a higher-dimensional feature space φ(x) where the classes are linearly separable, then run the SVM there. For the concentric circles, adding a feature x₁² + x₂² (distance from origin) makes the problem trivially separable in 3D.

But computing φ(x) explicitly can be expensive — the feature space might have millions or even infinite dimensions. Here is the key insight: the dual SVM only uses inner products x_n^Tx_m. If we replace these with inner products in the feature space φ(x_n)^Tφ(x_m), we get a nonlinear SVM. The kernel trick is: find a function k(x_n, x_m) that computes this inner product without ever computing φ.

k(x_n, x_m) = φ(x_n)^Tφ(x_m)

The dual becomes:

max_α ∑_n α_n − ½ ∑_n,m α_n α_m y_n y_m k(x_n, x_m)

Common kernels:

Kernel	Formula	Implicit feature space
Linear	x^Tx'	Original space (no mapping)
Polynomial	(x^Tx' + c)^d	All monomials up to degree d
RBF (Gaussian)	exp(−γ\|\|x − x'\|\|²)	Infinite-dimensional! (all Taylor terms)

The RBF kernel maps to infinite dimensions. Yet we never compute the infinite-dimensional features — we only ever evaluate the kernel function, which is a simple exponential. This is the magic: infinite expressiveness at finite computational cost. The γ parameter controls how quickly the kernel decays with distance — large γ means nearby points dominate, small γ means faraway points also matter.

Mercer's theorem: A function k(x, x') can be used as a kernel if and only if the resulting Gram matrix K_nm = k(x_n, x_m) is positive semi-definite for any set of points. All three kernels above satisfy this condition. You can even design your own kernels — as long as Mercer's condition holds, there exists a valid feature space.

What is the key advantage of the kernel trick?

It makes the SVM run faster by reducing dimensions It computes inner products in a high-dimensional space without explicitly constructing the features It eliminates the need for training data

Chapter 9: Kernel Playground

Time to see kernels in action. Below is a dataset of concentric rings — clearly not linearly separable. Select different kernels and tune their parameters. Watch how the decision boundary bends and wraps to separate the classes.

Kernel SVM Playground

Choose a kernel and adjust its parameters. The decision boundary (the colored regions) redraws in real time. The linear kernel fails spectacularly. The polynomial and RBF kernels find nonlinear boundaries.

What to watch for: With the linear kernel, the boundary is a straight line — it cannot wrap around the inner ring. Switch to RBF and the boundary curves smoothly between the classes. Increase γ and the boundary becomes tighter (more complex, possible overfitting). Decrease γ and it becomes smoother (simpler, possible underfitting). The polynomial kernel produces decision boundaries with interesting shapes depending on the degree.

No free lunch: More flexible kernels (high γ RBF, high-degree polynomial) can fit any training data perfectly, but may overfit. The best kernel and parameters depend on the data — use cross-validation. In modern practice, SVMs with RBF kernels remain competitive on many tabular datasets.

Why does the linear kernel fail on concentric ring data?

Because a linear boundary (straight line/plane) cannot enclose one class inside another Because the linear kernel is too slow Because linear kernels only work in 1D

Chapter 10: Summary

Support Vector Machines combine geometric intuition (maximum margin), convex optimization (QP), and the kernel trick (nonlinear classification) into one of the most elegant algorithms in machine learning.

Concept	Key Formula	Takeaway
Decision function	f(x) = w^Tx + b	Linear classifier
Margin	2/\|\|w\|\|	Wider margin = better generalization
Hard-margin objective	min ½\|\|w\|\|²	Maximize margin, zero tolerance
Soft-margin	+ C ∑ ξ_n	Allow violations with penalty
Hinge loss	max(0, 1−yf(x))	Sparse, convex loss function
Dual	In terms of α_n and x_n^Tx_m	Opens door to kernels
Representer theorem	w = ∑ α_n y_n x_n	Sparse: only SVs matter
Kernel trick	k(x,x') replaces x^Tx'	Nonlinear boundaries, finite cost

Strengths of SVMs:

• Strong theoretical foundations
• Convex → unique global optimum
• Kernel trick for nonlinearity
• Sparse (only support vectors matter)
• Works well in high dimensions

Limitations:

• Training O(N²) to O(N³) for the QP
• Kernel choice is a black art
• No native probability outputs
• Doesn't scale to millions of points easily
• Binary only (extensions: one-vs-rest)

SVMs vs. Neural Networks: SVMs were the dominant classifier from the mid-1990s to the early 2010s. Deep learning supplanted them for images, text, and audio — tasks where learned features outperform hand-crafted kernels. But SVMs remain competitive and sometimes superior for structured/tabular data, small datasets, and problems where interpretability (via support vectors) matters.

Looking back: This chapter built on optimization (Chapter 7) for the QP formulation, probability (Chapter 6) for the Bayesian connections, and linear algebra (Chapter 2) for inner products and projections. The SVM is a beautiful synthesis of mathematics meeting data.

"The key insight of SVMs is that the solution depends only on a subset
of the training data — the support vectors."
— Bernhard Scholkopf & Alexander Smola

What is the main advantage of the dual formulation over the primal for kernel SVMs?

The dual depends on data only through inner products, enabling the kernel trick The dual is faster for small datasets The dual does not require labels