EE269 Lecture 19 — Mert Pilanci, Stanford

Kernels & Kernel Machines

The trick that lets SVMs draw nonlinear boundaries by never leaving the comfort of inner products.

Prerequisites: Dual SVM (Lecture 18) + Inner products. That's it.
8
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: The Nonlinearity Problem

Consider the XOR pattern: four points at the corners of a square. Class +1 at top-left and bottom-right, class −1 at top-right and bottom-left. No line can separate them. Not a steep one, not a flat one, not any line at all.

This isn't a toy problem. Real data is full of such patterns: a tumor marker might be dangerous at very low AND very high values but safe in the middle. A credit score model might flag both very young AND very old first-time borrowers. Linear boundaries simply can't capture these relationships.

XOR: Linear SVM Fails

The linear SVM tries its best but cannot separate the XOR pattern. The best it can do is 50% accuracy.

One approach: manually engineer new features. For XOR, the product x1·x2 separates the classes perfectly (positive for same-sign corners, negative for opposite-sign corners). But hand-engineering features for every problem doesn't scale.

The question: Can we systematically map data into a higher-dimensional space where a linear boundary exists — without the combinatorial explosion of feature engineering? The answer is the kernel trick, and it's one of the most elegant ideas in machine learning.
Why can't a linear SVM separate the XOR pattern?

Chapter 1: Feature Maps φ(x)

A feature map is a function φ: ℝd → ℝD that transforms each data point into a higher-dimensional space. In this new space, data that was linearly inseparable may become linearly separable.

Example: for 2D data x = (x1, x2), consider the quadratic feature map:

φ(x) = (x1², x2², √2 · x1x2)

This maps 2D points into 3D. A linear boundary in 3D (a plane through φ-space) corresponds to a quadratic boundary back in the original 2D space — circles, ellipses, hyperbolas, all become possible.

Key insight: The feature map φ doesn't change the algorithm. We still run the exact same linear SVM — we just feed it φ(xi) instead of xi. All the theory (margin, duality, support vectors) transfers unchanged.

The problem: D can be enormous. A degree-p polynomial map from ℝd has D = C(d+p, p) features. For d=100 and p=5, that's over 96 million features. Computing φ(x) explicitly is impractical.

Feature Map Lifts XOR to 3D

The 2D XOR pattern (left) becomes linearly separable when lifted by φ(x) = (x1², x2², x1x2) into 3D (right). A plane in 3D = a curve in 2D.

But wait — recall from Lecture 18 that the dual SVM only uses inner products φ(xi)Tφ(xj). What if we could compute this inner product without ever computing φ explicitly?

A quadratic feature map from ℝ² to ℝ³ maps x = (x1, x2) to φ(x) = (x1², x2², √2 x1x2). What does a linear boundary in φ-space look like in the original space?

Chapter 2: The Kernel Trick

Here is the magic. Consider the quadratic feature map φ(x) = (x1², x2², √2 x1x2). Compute the inner product in φ-space:

φ(x)Tφ(z) = x1²z1² + x2²z2² + 2x1x2z1z2

Now factor:

= (x1z1 + x2z2)² = (xTz)²
The kernel trick: We can compute the inner product in the high-dimensional φ-space using a simple function K(x,z) = (xTz)² that operates entirely in the original low-dimensional space. We never need to compute φ(x) or φ(z) explicitly.

A kernel function K(x, z) computes the inner product in some feature space:

K(x, z) = φ(x)Tφ(z)

The cost of computing K(x, z) = (xTz)² is O(d) — just a dot product and a square. The cost of computing φ(x)Tφ(z) explicitly would be O(D), where D ≫ d. For the RBF kernel (coming next), D is infinite, yet K(x, z) still takes O(d) to compute.

Naive Approach
Compute φ(xi) ∈ ℝD for all i, then inner products in ℝD
↓ replaced by
Kernel Trick
Compute K(xi, xj) directly in ℝd — O(d) per pair
Kernel Computation: Explicit vs Trick

For polynomial degree p in dimension d, compare the cost. The kernel trick wins massively as p and d grow.

Dimension d10
Degree p3
The kernel trick lets us compute φ(x)Tφ(z) without ever computing φ. What makes this possible?

Chapter 3: Common Kernels

Different kernels correspond to different feature spaces and different types of decision boundaries.

KernelK(x, z)Feature space dimBoundary type
LinearxTzd (no lift)Hyperplane
Polynomial(xTz + c)pC(d+p, p)Polynomial curves
RBF (Gaussian)exp(−||x−z||² / 2σ²)Arbitrary smooth

Polynomial kernel: K(x, z) = (xTz + c)p. The constant c controls whether pure monomials (c=0) or mixed terms (c>0) are included. Degree p=2 gives quadratic boundaries, p=3 gives cubic, etc.

RBF (Radial Basis Function) kernel: K(x, z) = exp(−||x − z||² / 2σ²). This is the most widely used kernel. The parameter σ (bandwidth) controls smoothness: small σ = wiggly boundaries (each support vector has local influence), large σ = smooth boundaries (global influence).

The RBF kernel corresponds to an infinite-dimensional feature map! You can show this by expanding the exponential as a Taylor series: exp(xTz/σ²) = ∑k=0 (xTz)k / (k! σ2k). Each term (xTz)k is an inner product in a degree-k polynomial space. The full expansion uses ALL polynomial degrees simultaneously — an infinite-dimensional φ.

Despite the infinite-dimensional feature space, K(x,z) = exp(−||x−z||²/2σ²) takes O(d) to compute. This is the kernel trick at its most dramatic.

Kernel Values: How Points "See" Each Other

Each kernel measures similarity differently. The heatmap shows K(x, z) for the orange point relative to every location. Drag the point to explore.

Kernel
σ (RBF)0.50
The RBF kernel K(x,z) = exp(−||x−z||²/2σ²) corresponds to a feature space of dimension:

Chapter 4: Kernel SVM — Nonlinear Boundaries

The kernel SVM dual is identical to the linear dual, with xiTxj replaced by K(xi, xj):

maxα ∑αi − ½∑∑αiαjyiyjK(xi, xj)    s.t.   ∑αiyi=0,   0 ≤ αi ≤ C/n

Prediction for a new point x:

f(x) = ∑i: αi>0 αiyiK(xi, x) + b

Notice: we never compute w explicitly (it lives in the potentially infinite-dimensional φ-space). We only ever compute kernel evaluations K(xi, x). This is the full power of the kernel trick in action.

Play with the simulation below. Switch between kernels. Place nonlinear patterns (circles, spirals). Watch the RBF kernel conform to arbitrary shapes. Adjust σ to control smoothness: too small = overfitting (wiggly boundary hugging each point), too large = underfitting (approaches linear).
Kernel SVM Playground

Click = class +1 (orange). Shift+click = class −1 (teal). Support vectors are circled in yellow.

Kernel
σ (RBF bandwidth)0.12
log10(C)100
Key observations: (1) RBF with small σ creates tight "bubbles" around each support vector — high variance. (2) RBF with large σ smooths everything — high bias. (3) Polynomial kernels create global boundaries (a degree-3 polynomial everywhere). (4) The number of support vectors increases as you lower C or decrease σ.

Chapter 5: Mercer's Theorem

Not every function K(x, z) is a valid kernel. If we want K to correspond to an inner product in some feature space (K(x,z) = φ(x)Tφ(z)), it must satisfy a mathematical condition.

Mercer's theorem: A continuous, symmetric function K(x, z) is a valid kernel if and only if the kernel matrix (Gram matrix) is positive semi-definite for any finite set of points.

The kernel matrix (Gram matrix) for points x1, …, xn is the n×n matrix:

Gij = K(xi, xj)

Positive semi-definite means: for any vector c ∈ ℝn, cTGc ≥ 0. Equivalently, all eigenvalues of G are ≥ 0.

Why PSD matters: If K is a valid kernel, then K(x,z) = φ(x)Tφ(z) for some φ. Then cTGc = ∑i,j cicjφ(xi)Tφ(xj) = ||∑iciφ(xi)||² ≥ 0. The PSD condition follows automatically from the existence of a feature map. Mercer's theorem says the converse is also true.

Useful closure properties — you can build complex kernels from simple ones:

Gram Matrix Eigenvalues

For random 2D points, the Gram matrix Gij = K(xi,xj) has all non-negative eigenvalues (valid kernel). The bar chart shows eigenvalues sorted descending.

Kernel
σ (RBF)1.00
A function K(x,z) is a valid kernel (by Mercer's theorem) if and only if:

Chapter 6: Representer Theorem

The representer theorem is a remarkable result that justifies the kernel approach far beyond SVMs. It says:

Representer Theorem: For any regularized empirical risk minimization problem of the form minf ∈ Hi=1n ℓ(yi, f(xi)) + λ||f||H², where H is a reproducing kernel Hilbert space (RKHS), the optimal solution has the form:
f*(x) = ∑i=1n ci K(xi, x)

The solution is always a linear combination of kernel evaluations at the training points.

This is profound. The function space H might be infinite-dimensional (for the RBF kernel, it is). But the representer theorem guarantees that the optimal f* lives in the finite-dimensional subspace spanned by {K(x1, ·), K(x2, ·), …, K(xn, ·)}. We only need n coefficients c1, …, cn.

For the SVM specifically, ci = αiyi, and most ci are zero (non-support vectors). The representer theorem tells us this structure isn't specific to SVMs — it's a property of any kernel method with regularization.

Search space
All functions in RKHS H (possibly infinite-dim)
↓ Representer theorem
Reduced to
f(x) = ∑i=1n ciK(xi, x) — n coefficients
↓ SVM sparsity
Further reduced
Only support vectors have ci ≠ 0
Representer Theorem: Kernel Basis Functions

f(x) = ∑ ciK(xi, x) is a weighted sum of "bumps" centered at training points. Drag coefficients to see how the function shape changes.

c11.0
c2-1.0
c30.5
σ0.50
The representer theorem guarantees that the optimal kernel method solution:

Chapter 7: Mastery

We've built the complete kernel machine pipeline: from the nonlinearity problem to feature maps to the kernel trick to Mercer's condition for valid kernels to the representer theorem guaranteeing finite solutions in infinite spaces.

ConceptKey Idea
Feature map φLift data to higher dimensions where linear separation exists
Kernel trickK(x,z) = φ(x)Tφ(z) computable without φ
Polynomial kernel(xTz + c)p — degree-p boundaries
RBF kernelexp(−||x−z||²/2σ²) — infinite-dim, local influence
Mercer's theoremK valid ⇔ Gram matrix is PSD
Representer theoremf*(x) = ∑ciK(xi,x) — n coefficients suffice
The legacy of kernels: While deep learning has displaced SVMs in many tasks, the kernel perspective remains foundational. The neural tangent kernel (NTK) shows that infinitely wide neural networks behave as kernel machines. Understanding kernels is understanding the mathematical soul of learning.

Connections:

Why is the kernel trick essential for using the RBF kernel?