EE269 Lecture 18 — Convex Duality & Dual SVM

Chapter 0: Why Duality?

We ended Lecture 17 with the soft margin SVM primal:

min_w,b,s ½||w||² + (C/n)∑s_i s.t. y_i(w^Tx_i + b) ≥ 1 − s_i, s_i ≥ 0

This works. We can solve it directly as a quadratic program. So why bother with duality?

Three reasons that each alone would justify the effort:

Reason 1 — Reveals structure: The dual shows that the SVM solution depends only on inner products x_i^Tx_j between training points. This is the doorway to the kernel trick (Lecture 19).

Reason 2 — Sparsity: The dual variables α_i are zero for non-support vectors. Only a few α_i > 0, making prediction efficient: f(x) = ∑_{i: α_i>0} α_iy_ix_i^Tx + b.

Reason 3 — Computational: The primal has d + 1 + n variables (w ∈ ℝ^d, b, slacks). The dual has n variables (α_i). When d ≫ n (more features than samples, common in genomics/NLP), the dual is cheaper.

Vapnik himself stated the principle: "Don't solve a more general problem as an intermediate step." The dual strips away the explicit weight vector and works directly with the data points that matter.

Primal vs Dual: Variable Count

Drag the sliders to see when the dual has fewer variables. The crossover happens when d > n.

Features d100

Samples n50

The SVM dual reveals that the solution depends only on:

The weight vector w directly Inner products x_i^Tx_j between training points The number of features d

Chapter 1: The Lagrangian

The Lagrangian is a single function that packages the objective and all constraints into one expression. For any constrained optimization problem:

min f(x) s.t. g_i(x) ≤ 0, i = 1,…,m

the Lagrangian is:

L(x, α) = f(x) + ∑_i=1^m α_i g_i(x), α_i ≥ 0

Each Lagrange multiplier α_i ≥ 0 is paired with constraint g_i(x) ≤ 0. Think of α_i as the "price" of violating constraint i. If a constraint is satisfied easily (g_i < 0), the price drops to zero. If a constraint is tight (g_i = 0), the multiplier can be positive — it tells you how much the objective would improve if you relaxed that constraint slightly.

For the soft margin SVM, we rewrite the constraints as ≤ 0:

Constraint 1

1 − s_i − y_i(w^Tx_i + b) ≤ 0 (multiplier α_i)

↓

Constraint 2

−s_i ≤ 0 (multiplier β_i)

The SVM Lagrangian is:

L(w, b, s, α, β) = ½||w||² + (C/n)∑s_i − ∑α_i[y_i(w^Tx_i + b) − 1 + s_i] − ∑β_is_i

where α_i ≥ 0 and β_i ≥ 0 are the dual variables. This single expression encodes everything: the objective, the margin constraints, and the non-negativity of slacks.

The Lagrangian trick: At the optimal solution, maximizing over α,β while minimizing over w,b,s recovers the original constrained problem. Why? If any constraint is violated (g_i > 0), the adversary can send α_i → ∞, making L = ∞. So the minimizer is forced to satisfy all constraints.

Lagrangian for a Simple 1D Problem

Problem: min x² s.t. x ≥ 1. Lagrangian: L = x² − α(x − 1). Drag α and see the saddle point.

α2.0

The optimal α* = 2 (the derivative of L w.r.t. x at x*=1 is 2x − α = 0, so α* = 2). At this point, L(x*, α*) = 1 — matching the constrained optimum.

In the Lagrangian L = f(x) + ∑α_ig_i(x), what happens if constraint g_k is violated (g_k > 0)?

α_k is set to zero The Lagrangian decreases The maximizer sends α_k → ∞, making L → ∞

Chapter 2: KKT Conditions

The Karush-Kuhn-Tucker (KKT) conditions are necessary (and for convex problems, sufficient) conditions for optimality. They come from setting the gradient of the Lagrangian to zero and adding complementarity.

For our SVM Lagrangian, we take partial derivatives and set them to zero:

KKT Condition 1 — Stationarity w.r.t. w:

∂L/∂w = w − ∑α_iy_ix_i = 0 ⇒ w = ∑_i=1ⁿ α_iy_ix_i

The optimal weight vector is a linear combination of training points, weighted by α_iy_i. Points with α_i = 0 don't contribute at all.

KKT Condition 2 — Stationarity w.r.t. b:

∂L/∂b = −∑α_iy_i = 0 ⇒ ∑_i=1ⁿ α_iy_i = 0

The α-weighted labels must balance. This is a constraint that appears in the dual.

KKT Condition 3 — Stationarity w.r.t. s_i:

∂L/∂s_i = C/n − α_i − β_i = 0 ⇒ α_i + β_i = C/n

Since β_i ≥ 0, this gives 0 ≤ α_i ≤ C/n. The box constraint on the dual variables!

Plus the complementary slackness conditions (Chapter 6):

α_i[y_i(w^Tx_i + b) − 1 + s_i] = 0 for all i
β_i s_i = 0 for all i

These five sets of conditions (stationarity ×3, primal feasibility, dual feasibility, complementary slackness) completely characterize the optimal solution.

KKT Conditions at a Glance

Each row shows one KKT condition and what it implies. Toggle to highlight which conditions determine which dual properties.

From ∂L/∂w = 0, the optimal w is expressed as:

w = ∑α_iy_ix_i (linear combination of data points) w = ∑x_i/n (the mean of all data) w = X^TX (the covariance)

Chapter 3: Primal → Dual (Full Derivation)

Now we substitute the KKT stationarity conditions back into the Lagrangian to eliminate w, b, and s. This is the mechanical heart of duality — and it's surprisingly clean.

Step 1: Start with the Lagrangian:

L = ½||w||² + (C/n)∑s_i − ∑α_i[y_i(w^Tx_i + b) − 1 + s_i] − ∑β_is_i

Step 2: Substitute w = ∑α_jy_jx_j into ||w||²:

||w||² = w^Tw = (∑_iα_iy_ix_i)^T(∑_jα_jy_jx_j) = ∑_i∑_j α_iα_jy_iy_jx_i^Tx_j

Step 3: Substitute w = ∑α_jy_jx_j into the w^Tx_i terms:

w^Tx_i = ∑_j α_jy_jx_j^Tx_i

Step 4: Use ∑α_iy_i = 0 to kill the b terms, and α_i + β_i = C/n to kill the s_i terms:

(C/n)∑s_i − ∑α_is_i − ∑β_is_i = ∑s_i(C/n − α_i − β_i) = 0

Step 5: Collect terms. The ½||w||² term contributes +½∑∑α_iα_jy_iy_jx_i^Tx_j. The −∑α_iy_iw^Tx_i term contributes −∑∑α_iα_jy_iy_jx_i^Tx_j. The +∑α_i term survives.

The dual objective (after all cancellations):

g(α) = ∑_i=1ⁿ α_i − ½∑_i=1ⁿ∑_j=1ⁿ α_iα_jy_iy_jx_i^Tx_j

The dual SVM problem is:

max_α ∑α_i − ½∑∑α_iα_jy_iy_jx_i^Tx_j

s.t. ∑α_iy_i = 0, 0 ≤ α_i ≤ C/n, ∀i

Notice: the features x_i only appear as inner products x_i^Tx_j. This is the critical observation that enables the kernel trick.

Derivation Step-by-Step

Click through each step to see terms appear and cancel in the derivation.

Step 1/5

In the dual SVM objective, data points appear only as:

Inner products x_i^Tx_j Individual coordinates x_i,k Norms ||x_i||

Chapter 4: Dual SVM — Support Vectors Revealed

The dual formulation makes the role of support vectors crystal clear. Each training point gets a dual variable α_i. At the optimum:

α_i = 0: Non-support vector. This point is far from the margin and doesn't influence the boundary at all.
0 < α_i < C/n: Support vector on the margin boundary. y_i(w^Tx_i + b) = 1 exactly.
α_i = C/n: Support vector inside the margin or misclassified. The slack s_i > 0.

Prediction for a new point x uses only the support vectors:

f(x) = ∑_{i: α_i>0} α_iy_ix_i^Tx + b

Sparsity in action: Out of n training points, typically only a small fraction become support vectors (α_i > 0). The rest could be deleted without changing the solution. This is why SVMs generalize well — the model complexity is controlled by the number of support vectors, not the number of features.

Dual Variables α_i Visualization

Each point's size reflects its α_i. Yellow circles = support vectors (α_i > 0). Bar chart on right shows α values. Adjust C to see how support vectors change.

log₁₀(C)10

Watch carefully as you change C:

Large C: Few support vectors, all on the margin boundary. Tight fit.
Small C: Many support vectors, some at α_i = C/n. Wide margin, more errors tolerated.

Computing b: For any support vector with 0 < α_i < C/n (on the margin, not penalized), we know y_i(w^Tx_i + b) = 1. So b = y_i − w^Tx_i = y_i − ∑_jα_jy_jx_j^Tx_i. In practice, average over all such support vectors for numerical stability.

Chapter 5: Strong Duality

In general, for any optimization problem, the dual provides a lower bound on the primal:

d* = max_α≥0 min_x L(x, α) ≤ min_x max_α≥0 L(x, α) = p*

The gap p* − d* ≥ 0 is the duality gap. This inequality always holds — it's called weak duality.

The remarkable fact for SVMs: because the primal is a convex problem with linear constraints (satisfying Slater's condition), we get strong duality:

p* = d* (zero duality gap)

What strong duality means: The dual optimum exactly equals the primal optimum. There's no approximation. Solving the dual gives the same answer as solving the primal. This is a gift of convexity.

Slater's condition (a sufficient condition for strong duality): for a convex problem, if there exists a strictly feasible point (all inequality constraints are strict), then strong duality holds. For the SVM, this is easily satisfied — just pick any w with a large enough margin.

Duality Gap

The primal value (minimizing) and dual value (maximizing) converge to the same optimum. Drag α for a simple example: min x² s.t. x ≥ 1.

α0.0

In the visualization: the primal value is p* = 1 (at x* = 1). As you increase α, the dual value g(α) = min_x(x² − α(x−1)) increases until α* = 2 where g(2) = 1 = p*. Strong duality!

Strong duality (p* = d*) holds for the SVM because:

SVMs always have a unique solution The primal is convex and Slater's condition is satisfied The data is linearly separable

Chapter 6: Complementary Slackness

Complementary slackness is the most revealing KKT condition. For each constraint-multiplier pair:

α_i · [y_i(w^Tx_i + b) − 1 + s_i] = 0

This says: at least one of the two factors must be zero. Either the multiplier α_i = 0 (the constraint doesn't matter), or the constraint is tight (the bracket = 0). They "complement" each other — like a see-saw, one must be down.

Combined with β_i s_i = 0 and α_i + β_i = C/n, we get three cases for each training point:

α_i	s_i	Margin	Meaning
0	0	y_if(x_i) > 1	Correct, outside margin — non-support vector
0 < α_i < C/n	0	y_if(x_i) = 1	On margin boundary — free support vector
C/n	> 0	y_if(x_i) < 1	Inside margin or misclassified — bounded support vector

Why this matters for prediction: Only support vectors (α_i > 0) contribute to the decision function. Free support vectors (0 < α_i < C/n) are used to compute b. Bounded support vectors (α_i = C/n) are points the SVM "gave up on" — they're too close or misclassified, and C caps how much it cares.

Complementary Slackness Anatomy

Each point is colored by its KKT status. Gray = non-SV (α=0). Green = free SV (on margin). Red = bounded SV (inside/misclassified).

log₁₀(C)10

A point has α_i = C/n and s_i = 0.3. What kind of support vector is it?

Non-support vector Bounded SV (inside margin, 0 < s_i < 1 means correct but in margin) Free SV (on the margin boundary)

Chapter 7: Mastery

We've traveled from the primal SVM through Lagrangian mechanics to the dual. Here's the complete picture:

Concept	Key Result
Lagrangian	Packages objective + constraints into L(w,b,s,α,β)
∂L/∂w = 0	w = ∑α_iy_ix_i
∂L/∂b = 0	∑α_iy_i = 0
∂L/∂s_i = 0	0 ≤ α_i ≤ C/n
Dual SVM	max ∑α_i − ½∑∑α_iα_jy_iy_jx_i^Tx_j
Strong duality	Primal optimum = dual optimum (convexity)
Comp. slackness	α_i[y_if(x_i) − 1 + s_i] = 0

The punchline: The dual SVM depends on data ONLY through inner products x_i^Tx_j. If we replace these inner products with a kernel function K(x_i, x_j) = φ(x_i)^Tφ(x_j), we get nonlinear decision boundaries without ever computing φ explicitly. That's Lecture 19.

What's next:

Lecture 17: SVM Primal — Margin, hard/soft margin
Lecture 19: Kernels — The kernel trick, RBF, Mercer's theorem

The dual SVM reveals that w can be written as w = ∑α_iy_ix_i. If 100 training points are used but only 5 have α_i > 0, how many points determine the decision boundary?

100 5 (only the support vectors) 50 (half of them)