EE269 Lecture 17 — Support Vector Machines

Chapter 0: From Fisher to Hyperplanes

You have two classes of data points in high-dimensional space. Maybe they're tumor cells vs healthy cells, spam vs ham, or cat images vs dog images. You want a rule that separates them. The simplest possible rule? Draw a line (or in higher dimensions, a flat surface) between them.

We already know one approach: Fisher's Linear Discriminant Analysis (LDA). It projects data onto the direction that maximizes the ratio of between-class separation to within-class scatter:

J(w) = (β₁ − β₂)² / (σ₁² + σ₂²)

where β_k = w^Tμ_k are the projected class means and σ_k² are the projected class variances. Fisher finds the w that maximizes this. It works beautifully when classes are roughly Gaussian.

But Fisher has a limitation: it cares about the entire distribution of each class. If the data is well-separated, Fisher wastes effort modeling points far from the boundary. What if we only cared about the points closest to the decision boundary — the ones that are hardest to classify?

The key shift: Fisher uses all points equally (via covariance). SVMs focus only on the support vectors — the hardest points near the boundary. This makes SVMs robust to outliers far from the decision surface.

Fisher vs Margin: Two Philosophies

Left: Fisher's projection maximizes class separation. Right: SVM maximizes the gap (margin) between the closest points. Click Regenerate to see new data.

This chapter traces the path from Fisher's global view to the SVM's local, boundary-focused philosophy. The destination: a classifier defined entirely by a handful of critical points.

What is Fisher LDA's main limitation that motivates SVMs?

It can only handle two classes It requires too much computation It uses all points equally, even those far from the boundary

Chapter 1: Separating Hyperplane Geometry

A hyperplane in ℝ^d is the set of all points x satisfying a linear equation:

H = { x ∈ ℝ^d : w^Tx + b = 0 }

The vector w is the normal to the hyperplane — it points perpendicular to the surface. The scalar b is the bias (or offset) that shifts the hyperplane away from the origin. In 2D, a hyperplane is just a line. In 3D, it's a plane. In higher dimensions, it's a flat (d−1)-dimensional surface.

The most important geometric fact we need: how far is a point from the hyperplane?

Key derivation: Take any point z. Decompose it as z = z₀ + r·(w/||w||) where z₀ lies on H and r is the signed distance. Since z₀ is on H: w^Tz₀ + b = 0. Substituting: w^T(z − r·w/||w||) + b = 0, so w^Tz + b = r·||w||. Therefore r = (w^Tz + b)/||w||.

The signed distance from point z to hyperplane H is:

d(z, H) = (w^Tz + b) / ||w||

The unsigned distance (always positive) is simply |w^Tz + b| / ||w||. Points on the w-side of H get positive distance; points on the opposite side get negative distance.

This formula is the foundation of everything that follows. The SVM will ask: "which hyperplane maximizes the minimum distance to any training point?"

Distance to a Hyperplane

Drag the orange point and adjust the hyperplane angle. The dashed line shows the perpendicular distance d(z,H).

Angle θ45°

Bias b0.0

Notice: scaling w and b by the same constant doesn't change the hyperplane (the set H stays the same), but it does change ||w||. This scaling ambiguity will be important when we formulate the SVM optimization — we'll fix a specific normalization.

If w = [3, 4] and b = −10, what is the distance from the origin (z = [0,0]) to the hyperplane?

10 2 (since |0 + 0 − 10| / 5 = 2) 5

Chapter 2: Margin — The Gap Between Classes

Given labeled training data {(x_i, y_i)} where y_i ∈ {+1, −1}, a separating hyperplane satisfies y_i(w^Tx_i + b) > 0 for all i. Points with label +1 land on the positive side; points with label −1 land on the negative side.

But if data is linearly separable, there are infinitely many separating hyperplanes. Which one should we choose?

The maximum margin principle: Choose the hyperplane that maximizes the margin — the minimum distance from any training point to the decision boundary. This gives the classifier the best "safety buffer" against new, unseen data.

The margin of a hyperplane (w, b) with respect to the training set is:

margin = min_i |w^Tx_i + b| / ||w||

The total margin (the width of the "street" between classes) is twice this: 2 · min_i |w^Tx_i + b| / ||w||.

Here's where the canonical scaling trick comes in. Since scaling (w, b) doesn't change the hyperplane, we can always rescale so that the closest point satisfies |w^Tx_i + b| = 1. Under this normalization:

margin = 1 / ||w||, total margin = 2 / ||w||

So maximizing the margin is equivalent to minimizing ||w||. The points that achieve |w^Tx_i + b| = 1 are called support vectors — they "support" the margin boundaries.

Margin Visualization

The shaded "street" is the margin. Orange = class +1, Teal = class −1. Support vectors are circled. Drag the slider to rotate the boundary and see how margin changes.

Boundary angle90°

Why maximum margin? Statistical learning theory (Vapnik-Chervonenkis theory) shows that larger margins lead to better generalization bounds. Intuitively: a wide street means small perturbations to test points won't flip their classification.

Under canonical scaling (closest point has |w^Tx + b| = 1), the total margin width is:

2/||w|| ||w||/2 1/||w||²

Chapter 3: Hard Margin SVM

We now have everything to write the hard margin SVM optimization problem. "Hard margin" means we demand perfect separation — every point must be on the correct side with margin at least 1/||w||.

Goal

Maximize margin = 2/||w||

↓ equivalent to

Reformulation

Minimize ||w||² (easier to optimize)

↓ subject to

Constraints

y_i(w^Tx_i + b) ≥ 1 for all i

The hard margin SVM primal problem:

min_w,b ½ ||w||² s.t. y_i(w^Tx_i + b) ≥ 1, i = 1, …, n

Why ½||w||² instead of ||w||? Two reasons: (1) the square removes the square root, making the objective differentiable everywhere, and (2) the ½ is a convenience — it cancels with the 2 when we differentiate.

This is a convex quadratic program (QP): quadratic objective, linear constraints. Convexity guarantees a unique global minimum. The solution gives us:

The optimal normal vector w* and bias b*
The decision boundary: w*^Tx + b* = 0
Prediction rule: sign(w*^Tx + b*)

What makes it "hard": The constraints y_i(w^Tx_i + b) ≥ 1 demand that EVERY training point is classified correctly with a gap of at least 1/||w||. If even one point violates this (data is not linearly separable), the problem is infeasible — no solution exists.

Hard Margin SVM Solution

The solid line is the optimal hyperplane. Dashed lines show the margin boundaries at w^Tx + b = ±1. Circled points are support vectors. Click New Data to try different configurations.

Observe: the solution depends ONLY on the support vectors. Moving any non-support-vector point (as long as it stays outside the margin) doesn't change the boundary at all. This sparsity is what makes SVMs powerful.

What happens if the data is NOT linearly separable and we use hard margin SVM?

The problem is infeasible — no solution exists It finds the best approximate boundary It uses a nonlinear boundary instead

Chapter 4: Soft Margin & Slack Variables

Real data is messy. Classes overlap. Outliers exist. Hard margin SVM fails catastrophically in these cases — the optimization is simply infeasible. We need a way to allow some misclassifications while still preferring large margins.

The solution: introduce a slack variable s_i ≥ 0 for each training point. The slack "relaxes" the hard constraint:

y_i(w^Tx_i + b) ≥ 1 − s_i, s_i ≥ 0

What does s_i mean geometrically?

s_i = 0: Point is correctly classified, outside the margin. Happy.
0 < s_i < 1: Point is correctly classified but inside the margin. Slightly penalized.
s_i = 1: Point is exactly on the decision boundary.
s_i > 1: Point is misclassified. Heavily penalized.

The soft margin SVM (C-SVM) trades off margin width against slack:

min_w,b,s ½ ||w||² + (C/n) ∑_i=1ⁿ s_i

s.t. y_i(w^Tx_i + b) ≥ 1 − s_i, s_i ≥ 0, ∀i

The C parameter: C controls the penalty for violations. Large C = "I hate misclassifications" → small margin, few errors. Small C = "I'm OK with some errors" → wide margin, more errors. C is the fundamental hyperparameter of SVMs.

Notice the structure: the objective is still convex (quadratic + linear), and the constraints are still linear. So soft margin SVM is also a convex QP with a unique global solution.

Effect of C on the Decision Boundary

Drag C to see how it trades margin width vs misclassification. Low C = wide margin, high C = tight fit. Points inside the margin have s_i > 0 (shown with red halos).

log₁₀(C)10.0

The soft margin formulation is equivalent to the hinge loss: ℓ(y, f(x)) = max(0, 1 − y·f(x)). Each slack s_i = max(0, 1 − y_i(w^Tx_i + b)). So the objective becomes ½||w||² + (C/n)∑ max(0, 1 − y_if(x_i)). This connection to loss functions bridges SVMs with general regularized empirical risk minimization.

A training point has slack s_i = 1.5. What does this mean geometrically?

It's correctly classified but inside the margin It's exactly on the decision boundary It's misclassified (on the wrong side of the boundary)

Chapter 5: Interactive SVM

Time to build intuition by playing with an SVM. Click to place points (left-click for class +1, right-click or hold Shift+click for class −1). The SVM computes the optimal boundary in real time.

Things to try: (1) Place two well-separated clusters — see the wide margin. (2) Move a point into the margin — watch it become a support vector. (3) Crank C down — see the margin widen as the SVM tolerates errors. (4) Make data non-separable — see slack variables activate.

Live SVM Playground

Click = class +1 (orange). Shift+click = class −1 (teal). Support vectors are circled. The shaded band is the margin.

log₁₀(C)100

Notice what happens with the XOR preset: four points arranged in an X pattern. No linear boundary can separate them. The hard margin SVM is infeasible; the soft margin SVM does its best but misclassifies. This is the fundamental limitation of linear classifiers — and the motivation for kernels (Lecture 19).

Key observations: (1) Only support vectors determine the boundary — move other points freely. (2) As C → ∞, soft margin approaches hard margin. (3) As C → 0, the margin grows but errors increase. (4) The number of support vectors indicates model complexity.

Chapter 6: Multi-class Strategies

SVMs are inherently binary classifiers — they find a hyperplane between two classes. But real problems often have K > 2 classes. How do we extend?

One-vs-All (OvA): Train K binary SVMs, each separating class k from all others. For a new point x, predict the class whose SVM gives the highest confidence: argmax_k (w_k^Tx + b_k).

One-vs-One (OvO): Train K(K−1)/2 binary SVMs, one for each pair of classes. For a new point, each SVM "votes" for one class; predict the class with the most votes.

Strategy	SVMs trained	Pros	Cons
One-vs-All	K	Fewer models, fast at test time	Imbalanced training sets
One-vs-One	K(K−1)/2	Each model sees balanced data	Many models for large K

There's also the multi-class Fisher approach (from EE269): the generalized eigenvalue problem. Given K classes with means μ_k and shared within-class scatter S_W, we solve:

S_B w = λ S_W w

where S_B = ∑_k n_k(μ_k − μ)(μ_k − μ)^T is the between-class scatter. The top (K−1) eigenvectors give discriminant directions. But this is a dimensionality reduction method, not a classifier — you still need a classification rule in the projected space.

Multi-class Decision Regions

Three classes separated by OvA SVMs. Each region shows which class "wins." The boundaries are piecewise linear.

For a 10-class problem, how many binary SVMs does One-vs-One require?

10 45 (since 10·9/2 = 45) 100

Chapter 7: Mastery

Let's consolidate what we've built, from Fisher's global projection to the SVM's local, margin-focused philosophy.

Concept	Formula	Intuition
Hyperplane	H = {x : w^Tx + b = 0}	A flat decision surface
Distance to H	\|w^Tz + b\| / \|\|w\|\|	Perpendicular gap
Margin (canonical)	2 / \|\|w\|\|	Width of the "street"
Hard margin	min ½\|\|w\|\|² s.t. y_i(w^Tx_i+b)≥1	Widest street, no errors allowed
Soft margin	+ (C/n)∑s_i, relax to ≥1−s_i	Allow errors, penalize them
Support vectors	Points with s_i=0 at margin boundary	The few critical points

The big picture: SVMs find the simplest boundary (maximum margin) that fits the data (up to tolerance C). This embodies Occam's razor in geometric form. Next lecture: we'll see the dual formulation, which reveals that the solution depends only on inner products x_i^Tx_j — opening the door to kernels.

What's next:

Lecture 18: Convex Duality & Dual SVM — Lagrangian, KKT, and the dual problem
Lecture 19: Kernels — Nonlinear boundaries via the kernel trick

True or False: Moving a training point that is NOT a support vector (and stays outside the margin) changes the SVM solution.

True — all points affect the solution False — only support vectors determine the boundary