EE269 Lecture 18 — Mert Pilanci, Stanford

Convex Duality & Dual SVM

The Lagrangian, KKT conditions, and the dual formulation that reveals SVMs depend only on inner products.

Prerequisites: SVM primal (Lecture 17) + Multivariable calculus. That's it.
8
Chapters
4+
Simulations
0
Assumed Knowledge

Chapter 0: Why Duality?

We ended Lecture 17 with the soft margin SVM primal:

minw,b,s ½||w||² + (C/n)∑si    s.t.   yi(wTxi + b) ≥ 1 − si,   si ≥ 0

This works. We can solve it directly as a quadratic program. So why bother with duality?

Three reasons that each alone would justify the effort:

Reason 1 — Reveals structure: The dual shows that the SVM solution depends only on inner products xiTxj between training points. This is the doorway to the kernel trick (Lecture 19).
Reason 2 — Sparsity: The dual variables αi are zero for non-support vectors. Only a few αi > 0, making prediction efficient: f(x) = ∑i: αi>0 αiyixiTx + b.
Reason 3 — Computational: The primal has d + 1 + n variables (w ∈ ℝd, b, slacks). The dual has n variables (αi). When d ≫ n (more features than samples, common in genomics/NLP), the dual is cheaper.

Vapnik himself stated the principle: "Don't solve a more general problem as an intermediate step." The dual strips away the explicit weight vector and works directly with the data points that matter.

Primal vs Dual: Variable Count

Drag the sliders to see when the dual has fewer variables. The crossover happens when d > n.

Features d100
Samples n50
The SVM dual reveals that the solution depends only on:

Chapter 1: The Lagrangian

The Lagrangian is a single function that packages the objective and all constraints into one expression. For any constrained optimization problem:

min f(x)   s.t.   gi(x) ≤ 0,   i = 1,…,m

the Lagrangian is:

L(x, α) = f(x) + ∑i=1m αi gi(x),    αi ≥ 0

Each Lagrange multiplier αi ≥ 0 is paired with constraint gi(x) ≤ 0. Think of αi as the "price" of violating constraint i. If a constraint is satisfied easily (gi < 0), the price drops to zero. If a constraint is tight (gi = 0), the multiplier can be positive — it tells you how much the objective would improve if you relaxed that constraint slightly.

For the soft margin SVM, we rewrite the constraints as ≤ 0:

Constraint 1
1 − si − yi(wTxi + b) ≤ 0    (multiplier αi)
Constraint 2
−si ≤ 0    (multiplier βi)

The SVM Lagrangian is:

L(w, b, s, α, β) = ½||w||² + (C/n)∑si − ∑αi[yi(wTxi + b) − 1 + si] − ∑βisi

where αi ≥ 0 and βi ≥ 0 are the dual variables. This single expression encodes everything: the objective, the margin constraints, and the non-negativity of slacks.

The Lagrangian trick: At the optimal solution, maximizing over α,β while minimizing over w,b,s recovers the original constrained problem. Why? If any constraint is violated (gi > 0), the adversary can send αi → ∞, making L = ∞. So the minimizer is forced to satisfy all constraints.
Lagrangian for a Simple 1D Problem

Problem: min x² s.t. x ≥ 1. Lagrangian: L = x² − α(x − 1). Drag α and see the saddle point.

α2.0

The optimal α* = 2 (the derivative of L w.r.t. x at x*=1 is 2x − α = 0, so α* = 2). At this point, L(x*, α*) = 1 — matching the constrained optimum.

In the Lagrangian L = f(x) + ∑αigi(x), what happens if constraint gk is violated (gk > 0)?

Chapter 2: KKT Conditions

The Karush-Kuhn-Tucker (KKT) conditions are necessary (and for convex problems, sufficient) conditions for optimality. They come from setting the gradient of the Lagrangian to zero and adding complementarity.

For our SVM Lagrangian, we take partial derivatives and set them to zero:

KKT Condition 1 — Stationarity w.r.t. w:
∂L/∂w = w − ∑αiyixi = 0   ⇒   w = ∑i=1n αiyixi

The optimal weight vector is a linear combination of training points, weighted by αiyi. Points with αi = 0 don't contribute at all.

KKT Condition 2 — Stationarity w.r.t. b:
∂L/∂b = −∑αiyi = 0   ⇒   ∑i=1n αiyi = 0

The α-weighted labels must balance. This is a constraint that appears in the dual.

KKT Condition 3 — Stationarity w.r.t. si:
∂L/∂si = C/n − αi − βi = 0   ⇒   αi + βi = C/n

Since βi ≥ 0, this gives 0 ≤ αi ≤ C/n. The box constraint on the dual variables!

Plus the complementary slackness conditions (Chapter 6):

These five sets of conditions (stationarity ×3, primal feasibility, dual feasibility, complementary slackness) completely characterize the optimal solution.

KKT Conditions at a Glance

Each row shows one KKT condition and what it implies. Toggle to highlight which conditions determine which dual properties.

From ∂L/∂w = 0, the optimal w is expressed as:

Chapter 3: Primal → Dual (Full Derivation)

Now we substitute the KKT stationarity conditions back into the Lagrangian to eliminate w, b, and s. This is the mechanical heart of duality — and it's surprisingly clean.

Step 1: Start with the Lagrangian:

L = ½||w||² + (C/n)∑si − ∑αi[yi(wTxi + b) − 1 + si] − ∑βisi

Step 2: Substitute w = ∑αjyjxj into ||w||²:

||w||² = wTw = (∑iαiyixi)T(∑jαjyjxj) = ∑ij αiαjyiyjxiTxj

Step 3: Substitute w = ∑αjyjxj into the wTxi terms:

wTxi = ∑j αjyjxjTxi

Step 4: Use ∑αiyi = 0 to kill the b terms, and αi + βi = C/n to kill the si terms:

(C/n)∑si − ∑αisi − ∑βisi = ∑si(C/n − αi − βi) = 0

Step 5: Collect terms. The ½||w||² term contributes +½∑∑αiαjyiyjxiTxj. The −∑αiyiwTxi term contributes −∑∑αiαjyiyjxiTxj. The +∑αi term survives.

The dual objective (after all cancellations):
g(α) = ∑i=1n αi − ½∑i=1nj=1n αiαjyiyjxiTxj

The dual SVM problem is:

maxα ∑αi − ½∑∑αiαjyiyjxiTxj
s.t.   ∑αiyi = 0,   0 ≤ αi ≤ C/n,   ∀i

Notice: the features xi only appear as inner products xiTxj. This is the critical observation that enables the kernel trick.

Derivation Step-by-Step

Click through each step to see terms appear and cancel in the derivation.

Step 1/5
In the dual SVM objective, data points appear only as:

Chapter 4: Dual SVM — Support Vectors Revealed

The dual formulation makes the role of support vectors crystal clear. Each training point gets a dual variable αi. At the optimum:

Prediction for a new point x uses only the support vectors:

f(x) = ∑i: αi>0 αiyixiTx + b
Sparsity in action: Out of n training points, typically only a small fraction become support vectors (αi > 0). The rest could be deleted without changing the solution. This is why SVMs generalize well — the model complexity is controlled by the number of support vectors, not the number of features.
Dual Variables αi Visualization

Each point's size reflects its αi. Yellow circles = support vectors (αi > 0). Bar chart on right shows α values. Adjust C to see how support vectors change.

log10(C)10

Watch carefully as you change C:

Computing b: For any support vector with 0 < αi < C/n (on the margin, not penalized), we know yi(wTxi + b) = 1. So b = yi − wTxi = yi − ∑jαjyjxjTxi. In practice, average over all such support vectors for numerical stability.

Chapter 5: Strong Duality

In general, for any optimization problem, the dual provides a lower bound on the primal:

d* = maxα≥0 minx L(x, α) ≤ minx maxα≥0 L(x, α) = p*

The gap p* − d* ≥ 0 is the duality gap. This inequality always holds — it's called weak duality.

The remarkable fact for SVMs: because the primal is a convex problem with linear constraints (satisfying Slater's condition), we get strong duality:

p* = d*    (zero duality gap)
What strong duality means: The dual optimum exactly equals the primal optimum. There's no approximation. Solving the dual gives the same answer as solving the primal. This is a gift of convexity.

Slater's condition (a sufficient condition for strong duality): for a convex problem, if there exists a strictly feasible point (all inequality constraints are strict), then strong duality holds. For the SVM, this is easily satisfied — just pick any w with a large enough margin.

Duality Gap

The primal value (minimizing) and dual value (maximizing) converge to the same optimum. Drag α for a simple example: min x² s.t. x ≥ 1.

α0.0

In the visualization: the primal value is p* = 1 (at x* = 1). As you increase α, the dual value g(α) = minx(x² − α(x−1)) increases until α* = 2 where g(2) = 1 = p*. Strong duality!

Strong duality (p* = d*) holds for the SVM because:

Chapter 6: Complementary Slackness

Complementary slackness is the most revealing KKT condition. For each constraint-multiplier pair:

αi · [yi(wTxi + b) − 1 + si] = 0

This says: at least one of the two factors must be zero. Either the multiplier αi = 0 (the constraint doesn't matter), or the constraint is tight (the bracket = 0). They "complement" each other — like a see-saw, one must be down.

Combined with βi si = 0 and αi + βi = C/n, we get three cases for each training point:

αisiMarginMeaning
00yif(xi) > 1Correct, outside margin — non-support vector
0 < αi < C/n0yif(xi) = 1On margin boundary — free support vector
C/n> 0yif(xi) < 1Inside margin or misclassified — bounded support vector
Why this matters for prediction: Only support vectors (αi > 0) contribute to the decision function. Free support vectors (0 < αi < C/n) are used to compute b. Bounded support vectors (αi = C/n) are points the SVM "gave up on" — they're too close or misclassified, and C caps how much it cares.
Complementary Slackness Anatomy

Each point is colored by its KKT status. Gray = non-SV (α=0). Green = free SV (on margin). Red = bounded SV (inside/misclassified).

log10(C)10
A point has αi = C/n and si = 0.3. What kind of support vector is it?

Chapter 7: Mastery

We've traveled from the primal SVM through Lagrangian mechanics to the dual. Here's the complete picture:

ConceptKey Result
LagrangianPackages objective + constraints into L(w,b,s,α,β)
∂L/∂w = 0w = ∑αiyixi
∂L/∂b = 0∑αiyi = 0
∂L/∂si = 00 ≤ αi ≤ C/n
Dual SVMmax ∑αi − ½∑∑αiαjyiyjxiTxj
Strong dualityPrimal optimum = dual optimum (convexity)
Comp. slacknessαi[yif(xi) − 1 + si] = 0
The punchline: The dual SVM depends on data ONLY through inner products xiTxj. If we replace these inner products with a kernel function K(xi, xj) = φ(xi)Tφ(xj), we get nonlinear decision boundaries without ever computing φ explicitly. That's Lecture 19.

What's next:

The dual SVM reveals that w can be written as w = ∑αiyixi. If 100 training points are used but only 5 have αi > 0, how many points determine the decision boundary?