The Lagrangian, KKT conditions, and the dual formulation that reveals SVMs depend only on inner products.
We ended Lecture 17 with the soft margin SVM primal:
This works. We can solve it directly as a quadratic program. So why bother with duality?
Three reasons that each alone would justify the effort:
Vapnik himself stated the principle: "Don't solve a more general problem as an intermediate step." The dual strips away the explicit weight vector and works directly with the data points that matter.
Drag the sliders to see when the dual has fewer variables. The crossover happens when d > n.
The Lagrangian is a single function that packages the objective and all constraints into one expression. For any constrained optimization problem:
the Lagrangian is:
Each Lagrange multiplier αi ≥ 0 is paired with constraint gi(x) ≤ 0. Think of αi as the "price" of violating constraint i. If a constraint is satisfied easily (gi < 0), the price drops to zero. If a constraint is tight (gi = 0), the multiplier can be positive — it tells you how much the objective would improve if you relaxed that constraint slightly.
For the soft margin SVM, we rewrite the constraints as ≤ 0:
The SVM Lagrangian is:
where αi ≥ 0 and βi ≥ 0 are the dual variables. This single expression encodes everything: the objective, the margin constraints, and the non-negativity of slacks.
Problem: min x² s.t. x ≥ 1. Lagrangian: L = x² − α(x − 1). Drag α and see the saddle point.
The optimal α* = 2 (the derivative of L w.r.t. x at x*=1 is 2x − α = 0, so α* = 2). At this point, L(x*, α*) = 1 — matching the constrained optimum.
The Karush-Kuhn-Tucker (KKT) conditions are necessary (and for convex problems, sufficient) conditions for optimality. They come from setting the gradient of the Lagrangian to zero and adding complementarity.
For our SVM Lagrangian, we take partial derivatives and set them to zero:
The optimal weight vector is a linear combination of training points, weighted by αiyi. Points with αi = 0 don't contribute at all.
The α-weighted labels must balance. This is a constraint that appears in the dual.
Since βi ≥ 0, this gives 0 ≤ αi ≤ C/n. The box constraint on the dual variables!
Plus the complementary slackness conditions (Chapter 6):
These five sets of conditions (stationarity ×3, primal feasibility, dual feasibility, complementary slackness) completely characterize the optimal solution.
Each row shows one KKT condition and what it implies. Toggle to highlight which conditions determine which dual properties.
Now we substitute the KKT stationarity conditions back into the Lagrangian to eliminate w, b, and s. This is the mechanical heart of duality — and it's surprisingly clean.
Step 1: Start with the Lagrangian:
Step 2: Substitute w = ∑αjyjxj into ||w||²:
Step 3: Substitute w = ∑αjyjxj into the wTxi terms:
Step 4: Use ∑αiyi = 0 to kill the b terms, and αi + βi = C/n to kill the si terms:
Step 5: Collect terms. The ½||w||² term contributes +½∑∑αiαjyiyjxiTxj. The −∑αiyiwTxi term contributes −∑∑αiαjyiyjxiTxj. The +∑αi term survives.
The dual SVM problem is:
Notice: the features xi only appear as inner products xiTxj. This is the critical observation that enables the kernel trick.
Click through each step to see terms appear and cancel in the derivation.
The dual formulation makes the role of support vectors crystal clear. Each training point gets a dual variable αi. At the optimum:
Prediction for a new point x uses only the support vectors:
Each point's size reflects its αi. Yellow circles = support vectors (αi > 0). Bar chart on right shows α values. Adjust C to see how support vectors change.
Watch carefully as you change C:
In general, for any optimization problem, the dual provides a lower bound on the primal:
The gap p* − d* ≥ 0 is the duality gap. This inequality always holds — it's called weak duality.
The remarkable fact for SVMs: because the primal is a convex problem with linear constraints (satisfying Slater's condition), we get strong duality:
Slater's condition (a sufficient condition for strong duality): for a convex problem, if there exists a strictly feasible point (all inequality constraints are strict), then strong duality holds. For the SVM, this is easily satisfied — just pick any w with a large enough margin.
The primal value (minimizing) and dual value (maximizing) converge to the same optimum. Drag α for a simple example: min x² s.t. x ≥ 1.
In the visualization: the primal value is p* = 1 (at x* = 1). As you increase α, the dual value g(α) = minx(x² − α(x−1)) increases until α* = 2 where g(2) = 1 = p*. Strong duality!
Complementary slackness is the most revealing KKT condition. For each constraint-multiplier pair:
This says: at least one of the two factors must be zero. Either the multiplier αi = 0 (the constraint doesn't matter), or the constraint is tight (the bracket = 0). They "complement" each other — like a see-saw, one must be down.
Combined with βi si = 0 and αi + βi = C/n, we get three cases for each training point:
| αi | si | Margin | Meaning |
|---|---|---|---|
| 0 | 0 | yif(xi) > 1 | Correct, outside margin — non-support vector |
| 0 < αi < C/n | 0 | yif(xi) = 1 | On margin boundary — free support vector |
| C/n | > 0 | yif(xi) < 1 | Inside margin or misclassified — bounded support vector |
Each point is colored by its KKT status. Gray = non-SV (α=0). Green = free SV (on margin). Red = bounded SV (inside/misclassified).
We've traveled from the primal SVM through Lagrangian mechanics to the dual. Here's the complete picture:
| Concept | Key Result |
|---|---|
| Lagrangian | Packages objective + constraints into L(w,b,s,α,β) |
| ∂L/∂w = 0 | w = ∑αiyixi |
| ∂L/∂b = 0 | ∑αiyi = 0 |
| ∂L/∂si = 0 | 0 ≤ αi ≤ C/n |
| Dual SVM | max ∑αi − ½∑∑αiαjyiyjxiTxj |
| Strong duality | Primal optimum = dual optimum (convexity) |
| Comp. slackness | αi[yif(xi) − 1 + si] = 0 |
What's next: