EE269 Lecture 17 — Mert Pilanci, Stanford

Support Vector Machines

From Fisher's discriminant to the maximum-margin classifier that launched a thousand kernels.

Prerequisites: Linear algebra + Basic optimization. That's it.
8
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: From Fisher to Hyperplanes

You have two classes of data points in high-dimensional space. Maybe they're tumor cells vs healthy cells, spam vs ham, or cat images vs dog images. You want a rule that separates them. The simplest possible rule? Draw a line (or in higher dimensions, a flat surface) between them.

We already know one approach: Fisher's Linear Discriminant Analysis (LDA). It projects data onto the direction that maximizes the ratio of between-class separation to within-class scatter:

J(w) = (β1 − β2)² / (σ1² + σ2²)

where βk = wTμk are the projected class means and σk² are the projected class variances. Fisher finds the w that maximizes this. It works beautifully when classes are roughly Gaussian.

But Fisher has a limitation: it cares about the entire distribution of each class. If the data is well-separated, Fisher wastes effort modeling points far from the boundary. What if we only cared about the points closest to the decision boundary — the ones that are hardest to classify?

The key shift: Fisher uses all points equally (via covariance). SVMs focus only on the support vectors — the hardest points near the boundary. This makes SVMs robust to outliers far from the decision surface.
Fisher vs Margin: Two Philosophies

Left: Fisher's projection maximizes class separation. Right: SVM maximizes the gap (margin) between the closest points. Click Regenerate to see new data.

This chapter traces the path from Fisher's global view to the SVM's local, boundary-focused philosophy. The destination: a classifier defined entirely by a handful of critical points.

What is Fisher LDA's main limitation that motivates SVMs?

Chapter 1: Separating Hyperplane Geometry

A hyperplane in ℝd is the set of all points x satisfying a linear equation:

H = { x ∈ ℝd : wTx + b = 0 }

The vector w is the normal to the hyperplane — it points perpendicular to the surface. The scalar b is the bias (or offset) that shifts the hyperplane away from the origin. In 2D, a hyperplane is just a line. In 3D, it's a plane. In higher dimensions, it's a flat (d−1)-dimensional surface.

The most important geometric fact we need: how far is a point from the hyperplane?

Key derivation: Take any point z. Decompose it as z = z0 + r·(w/||w||) where z0 lies on H and r is the signed distance. Since z0 is on H: wTz0 + b = 0. Substituting: wT(z − r·w/||w||) + b = 0, so wTz + b = r·||w||. Therefore r = (wTz + b)/||w||.

The signed distance from point z to hyperplane H is:

d(z, H) = (wTz + b) / ||w||

The unsigned distance (always positive) is simply |wTz + b| / ||w||. Points on the w-side of H get positive distance; points on the opposite side get negative distance.

This formula is the foundation of everything that follows. The SVM will ask: "which hyperplane maximizes the minimum distance to any training point?"

Distance to a Hyperplane

Drag the orange point and adjust the hyperplane angle. The dashed line shows the perpendicular distance d(z,H).

Angle θ45°
Bias b0.0

Notice: scaling w and b by the same constant doesn't change the hyperplane (the set H stays the same), but it does change ||w||. This scaling ambiguity will be important when we formulate the SVM optimization — we'll fix a specific normalization.

If w = [3, 4] and b = −10, what is the distance from the origin (z = [0,0]) to the hyperplane?

Chapter 2: Margin — The Gap Between Classes

Given labeled training data {(xi, yi)} where yi ∈ {+1, −1}, a separating hyperplane satisfies yi(wTxi + b) > 0 for all i. Points with label +1 land on the positive side; points with label −1 land on the negative side.

But if data is linearly separable, there are infinitely many separating hyperplanes. Which one should we choose?

The maximum margin principle: Choose the hyperplane that maximizes the margin — the minimum distance from any training point to the decision boundary. This gives the classifier the best "safety buffer" against new, unseen data.

The margin of a hyperplane (w, b) with respect to the training set is:

margin = mini |wTxi + b| / ||w||

The total margin (the width of the "street" between classes) is twice this: 2 · mini |wTxi + b| / ||w||.

Here's where the canonical scaling trick comes in. Since scaling (w, b) doesn't change the hyperplane, we can always rescale so that the closest point satisfies |wTxi + b| = 1. Under this normalization:

margin = 1 / ||w||,    total margin = 2 / ||w||

So maximizing the margin is equivalent to minimizing ||w||. The points that achieve |wTxi + b| = 1 are called support vectors — they "support" the margin boundaries.

Margin Visualization

The shaded "street" is the margin. Orange = class +1, Teal = class −1. Support vectors are circled. Drag the slider to rotate the boundary and see how margin changes.

Boundary angle90°
Why maximum margin? Statistical learning theory (Vapnik-Chervonenkis theory) shows that larger margins lead to better generalization bounds. Intuitively: a wide street means small perturbations to test points won't flip their classification.
Under canonical scaling (closest point has |wTx + b| = 1), the total margin width is:

Chapter 3: Hard Margin SVM

We now have everything to write the hard margin SVM optimization problem. "Hard margin" means we demand perfect separation — every point must be on the correct side with margin at least 1/||w||.

Goal
Maximize margin = 2/||w||
↓ equivalent to
Reformulation
Minimize ||w||² (easier to optimize)
↓ subject to
Constraints
yi(wTxi + b) ≥ 1 for all i

The hard margin SVM primal problem:

minw,b ½ ||w||²    s.t.   yi(wTxi + b) ≥ 1,   i = 1, …, n

Why ½||w||² instead of ||w||? Two reasons: (1) the square removes the square root, making the objective differentiable everywhere, and (2) the ½ is a convenience — it cancels with the 2 when we differentiate.

This is a convex quadratic program (QP): quadratic objective, linear constraints. Convexity guarantees a unique global minimum. The solution gives us:

What makes it "hard": The constraints yi(wTxi + b) ≥ 1 demand that EVERY training point is classified correctly with a gap of at least 1/||w||. If even one point violates this (data is not linearly separable), the problem is infeasible — no solution exists.
Hard Margin SVM Solution

The solid line is the optimal hyperplane. Dashed lines show the margin boundaries at wTx + b = ±1. Circled points are support vectors. Click New Data to try different configurations.

Observe: the solution depends ONLY on the support vectors. Moving any non-support-vector point (as long as it stays outside the margin) doesn't change the boundary at all. This sparsity is what makes SVMs powerful.

What happens if the data is NOT linearly separable and we use hard margin SVM?

Chapter 4: Soft Margin & Slack Variables

Real data is messy. Classes overlap. Outliers exist. Hard margin SVM fails catastrophically in these cases — the optimization is simply infeasible. We need a way to allow some misclassifications while still preferring large margins.

The solution: introduce a slack variable si ≥ 0 for each training point. The slack "relaxes" the hard constraint:

yi(wTxi + b) ≥ 1 − si,    si ≥ 0

What does si mean geometrically?

The soft margin SVM (C-SVM) trades off margin width against slack:

minw,b,s ½ ||w||² + (C/n) ∑i=1n si
s.t.   yi(wTxi + b) ≥ 1 − si,   si ≥ 0,   ∀i
The C parameter: C controls the penalty for violations. Large C = "I hate misclassifications" → small margin, few errors. Small C = "I'm OK with some errors" → wide margin, more errors. C is the fundamental hyperparameter of SVMs.

Notice the structure: the objective is still convex (quadratic + linear), and the constraints are still linear. So soft margin SVM is also a convex QP with a unique global solution.

Effect of C on the Decision Boundary

Drag C to see how it trades margin width vs misclassification. Low C = wide margin, high C = tight fit. Points inside the margin have si > 0 (shown with red halos).

log10(C)10.0

The soft margin formulation is equivalent to the hinge loss: ℓ(y, f(x)) = max(0, 1 − y·f(x)). Each slack si = max(0, 1 − yi(wTxi + b)). So the objective becomes ½||w||² + (C/n)∑ max(0, 1 − yif(xi)). This connection to loss functions bridges SVMs with general regularized empirical risk minimization.

A training point has slack si = 1.5. What does this mean geometrically?

Chapter 5: Interactive SVM

Time to build intuition by playing with an SVM. Click to place points (left-click for class +1, right-click or hold Shift+click for class −1). The SVM computes the optimal boundary in real time.

Things to try: (1) Place two well-separated clusters — see the wide margin. (2) Move a point into the margin — watch it become a support vector. (3) Crank C down — see the margin widen as the SVM tolerates errors. (4) Make data non-separable — see slack variables activate.
Live SVM Playground

Click = class +1 (orange). Shift+click = class −1 (teal). Support vectors are circled. The shaded band is the margin.

log10(C)100

Notice what happens with the XOR preset: four points arranged in an X pattern. No linear boundary can separate them. The hard margin SVM is infeasible; the soft margin SVM does its best but misclassifies. This is the fundamental limitation of linear classifiers — and the motivation for kernels (Lecture 19).

Key observations: (1) Only support vectors determine the boundary — move other points freely. (2) As C → ∞, soft margin approaches hard margin. (3) As C → 0, the margin grows but errors increase. (4) The number of support vectors indicates model complexity.

Chapter 6: Multi-class Strategies

SVMs are inherently binary classifiers — they find a hyperplane between two classes. But real problems often have K > 2 classes. How do we extend?

One-vs-All (OvA): Train K binary SVMs, each separating class k from all others. For a new point x, predict the class whose SVM gives the highest confidence: argmaxk (wkTx + bk).

One-vs-One (OvO): Train K(K−1)/2 binary SVMs, one for each pair of classes. For a new point, each SVM "votes" for one class; predict the class with the most votes.

StrategySVMs trainedProsCons
One-vs-AllKFewer models, fast at test timeImbalanced training sets
One-vs-OneK(K−1)/2Each model sees balanced dataMany models for large K

There's also the multi-class Fisher approach (from EE269): the generalized eigenvalue problem. Given K classes with means μk and shared within-class scatter SW, we solve:

SB w = λ SW w

where SB = ∑k nkk − μ)(μk − μ)T is the between-class scatter. The top (K−1) eigenvectors give discriminant directions. But this is a dimensionality reduction method, not a classifier — you still need a classification rule in the projected space.

Multi-class Decision Regions

Three classes separated by OvA SVMs. Each region shows which class "wins." The boundaries are piecewise linear.

For a 10-class problem, how many binary SVMs does One-vs-One require?

Chapter 7: Mastery

Let's consolidate what we've built, from Fisher's global projection to the SVM's local, margin-focused philosophy.

ConceptFormulaIntuition
HyperplaneH = {x : wTx + b = 0}A flat decision surface
Distance to H|wTz + b| / ||w||Perpendicular gap
Margin (canonical)2 / ||w||Width of the "street"
Hard marginmin ½||w||² s.t. yi(wTxi+b)≥1Widest street, no errors allowed
Soft margin+ (C/n)∑si, relax to ≥1−siAllow errors, penalize them
Support vectorsPoints with si=0 at margin boundaryThe few critical points
The big picture: SVMs find the simplest boundary (maximum margin) that fits the data (up to tolerance C). This embodies Occam's razor in geometric form. Next lecture: we'll see the dual formulation, which reveals that the solution depends only on inner products xiTxj — opening the door to kernels.

What's next:

True or False: Moving a training point that is NOT a support vector (and stays outside the margin) changes the SVM solution.