From Fisher's discriminant to the maximum-margin classifier that launched a thousand kernels.
You have two classes of data points in high-dimensional space. Maybe they're tumor cells vs healthy cells, spam vs ham, or cat images vs dog images. You want a rule that separates them. The simplest possible rule? Draw a line (or in higher dimensions, a flat surface) between them.
We already know one approach: Fisher's Linear Discriminant Analysis (LDA). It projects data onto the direction that maximizes the ratio of between-class separation to within-class scatter:
where βk = wTμk are the projected class means and σk² are the projected class variances. Fisher finds the w that maximizes this. It works beautifully when classes are roughly Gaussian.
But Fisher has a limitation: it cares about the entire distribution of each class. If the data is well-separated, Fisher wastes effort modeling points far from the boundary. What if we only cared about the points closest to the decision boundary — the ones that are hardest to classify?
Left: Fisher's projection maximizes class separation. Right: SVM maximizes the gap (margin) between the closest points. Click Regenerate to see new data.
This chapter traces the path from Fisher's global view to the SVM's local, boundary-focused philosophy. The destination: a classifier defined entirely by a handful of critical points.
A hyperplane in ℝd is the set of all points x satisfying a linear equation:
The vector w is the normal to the hyperplane — it points perpendicular to the surface. The scalar b is the bias (or offset) that shifts the hyperplane away from the origin. In 2D, a hyperplane is just a line. In 3D, it's a plane. In higher dimensions, it's a flat (d−1)-dimensional surface.
The most important geometric fact we need: how far is a point from the hyperplane?
The signed distance from point z to hyperplane H is:
The unsigned distance (always positive) is simply |wTz + b| / ||w||. Points on the w-side of H get positive distance; points on the opposite side get negative distance.
This formula is the foundation of everything that follows. The SVM will ask: "which hyperplane maximizes the minimum distance to any training point?"
Drag the orange point and adjust the hyperplane angle. The dashed line shows the perpendicular distance d(z,H).
Notice: scaling w and b by the same constant doesn't change the hyperplane (the set H stays the same), but it does change ||w||. This scaling ambiguity will be important when we formulate the SVM optimization — we'll fix a specific normalization.
Given labeled training data {(xi, yi)} where yi ∈ {+1, −1}, a separating hyperplane satisfies yi(wTxi + b) > 0 for all i. Points with label +1 land on the positive side; points with label −1 land on the negative side.
But if data is linearly separable, there are infinitely many separating hyperplanes. Which one should we choose?
The margin of a hyperplane (w, b) with respect to the training set is:
The total margin (the width of the "street" between classes) is twice this: 2 · mini |wTxi + b| / ||w||.
Here's where the canonical scaling trick comes in. Since scaling (w, b) doesn't change the hyperplane, we can always rescale so that the closest point satisfies |wTxi + b| = 1. Under this normalization:
So maximizing the margin is equivalent to minimizing ||w||. The points that achieve |wTxi + b| = 1 are called support vectors — they "support" the margin boundaries.
The shaded "street" is the margin. Orange = class +1, Teal = class −1. Support vectors are circled. Drag the slider to rotate the boundary and see how margin changes.
We now have everything to write the hard margin SVM optimization problem. "Hard margin" means we demand perfect separation — every point must be on the correct side with margin at least 1/||w||.
The hard margin SVM primal problem:
Why ½||w||² instead of ||w||? Two reasons: (1) the square removes the square root, making the objective differentiable everywhere, and (2) the ½ is a convenience — it cancels with the 2 when we differentiate.
This is a convex quadratic program (QP): quadratic objective, linear constraints. Convexity guarantees a unique global minimum. The solution gives us:
The solid line is the optimal hyperplane. Dashed lines show the margin boundaries at wTx + b = ±1. Circled points are support vectors. Click New Data to try different configurations.
Observe: the solution depends ONLY on the support vectors. Moving any non-support-vector point (as long as it stays outside the margin) doesn't change the boundary at all. This sparsity is what makes SVMs powerful.
Real data is messy. Classes overlap. Outliers exist. Hard margin SVM fails catastrophically in these cases — the optimization is simply infeasible. We need a way to allow some misclassifications while still preferring large margins.
The solution: introduce a slack variable si ≥ 0 for each training point. The slack "relaxes" the hard constraint:
What does si mean geometrically?
The soft margin SVM (C-SVM) trades off margin width against slack:
Notice the structure: the objective is still convex (quadratic + linear), and the constraints are still linear. So soft margin SVM is also a convex QP with a unique global solution.
Drag C to see how it trades margin width vs misclassification. Low C = wide margin, high C = tight fit. Points inside the margin have si > 0 (shown with red halos).
The soft margin formulation is equivalent to the hinge loss: ℓ(y, f(x)) = max(0, 1 − y·f(x)). Each slack si = max(0, 1 − yi(wTxi + b)). So the objective becomes ½||w||² + (C/n)∑ max(0, 1 − yif(xi)). This connection to loss functions bridges SVMs with general regularized empirical risk minimization.
Time to build intuition by playing with an SVM. Click to place points (left-click for class +1, right-click or hold Shift+click for class −1). The SVM computes the optimal boundary in real time.
Click = class +1 (orange). Shift+click = class −1 (teal). Support vectors are circled. The shaded band is the margin.
Notice what happens with the XOR preset: four points arranged in an X pattern. No linear boundary can separate them. The hard margin SVM is infeasible; the soft margin SVM does its best but misclassifies. This is the fundamental limitation of linear classifiers — and the motivation for kernels (Lecture 19).
SVMs are inherently binary classifiers — they find a hyperplane between two classes. But real problems often have K > 2 classes. How do we extend?
One-vs-All (OvA): Train K binary SVMs, each separating class k from all others. For a new point x, predict the class whose SVM gives the highest confidence: argmaxk (wkTx + bk).
One-vs-One (OvO): Train K(K−1)/2 binary SVMs, one for each pair of classes. For a new point, each SVM "votes" for one class; predict the class with the most votes.
| Strategy | SVMs trained | Pros | Cons |
|---|---|---|---|
| One-vs-All | K | Fewer models, fast at test time | Imbalanced training sets |
| One-vs-One | K(K−1)/2 | Each model sees balanced data | Many models for large K |
There's also the multi-class Fisher approach (from EE269): the generalized eigenvalue problem. Given K classes with means μk and shared within-class scatter SW, we solve:
where SB = ∑k nk(μk − μ)(μk − μ)T is the between-class scatter. The top (K−1) eigenvectors give discriminant directions. But this is a dimensionality reduction method, not a classifier — you still need a classification rule in the projected space.
Three classes separated by OvA SVMs. Each region shows which class "wins." The boundaries are piecewise linear.
Let's consolidate what we've built, from Fisher's global projection to the SVM's local, margin-focused philosophy.
| Concept | Formula | Intuition |
|---|---|---|
| Hyperplane | H = {x : wTx + b = 0} | A flat decision surface |
| Distance to H | |wTz + b| / ||w|| | Perpendicular gap |
| Margin (canonical) | 2 / ||w|| | Width of the "street" |
| Hard margin | min ½||w||² s.t. yi(wTxi+b)≥1 | Widest street, no errors allowed |
| Soft margin | + (C/n)∑si, relax to ≥1−si | Allow errors, penalize them |
| Support vectors | Points with si=0 at margin boundary | The few critical points |
What's next: