EE269 Lecture 19 — Kernels & Kernel Machines

Chapter 0: The Nonlinearity Problem

Consider the XOR pattern: four points at the corners of a square. Class +1 at top-left and bottom-right, class −1 at top-right and bottom-left. No line can separate them. Not a steep one, not a flat one, not any line at all.

This isn't a toy problem. Real data is full of such patterns: a tumor marker might be dangerous at very low AND very high values but safe in the middle. A credit score model might flag both very young AND very old first-time borrowers. Linear boundaries simply can't capture these relationships.

XOR: Linear SVM Fails

The linear SVM tries its best but cannot separate the XOR pattern. The best it can do is 50% accuracy.

One approach: manually engineer new features. For XOR, the product x₁·x₂ separates the classes perfectly (positive for same-sign corners, negative for opposite-sign corners). But hand-engineering features for every problem doesn't scale.

The question: Can we systematically map data into a higher-dimensional space where a linear boundary exists — without the combinatorial explosion of feature engineering? The answer is the kernel trick, and it's one of the most elegant ideas in machine learning.

Why can't a linear SVM separate the XOR pattern?

There aren't enough data points No single hyperplane can put both +1 points on one side and both −1 points on the other The margin is too small

Chapter 1: Feature Maps φ(x)

A feature map is a function φ: ℝ^d → ℝ^D that transforms each data point into a higher-dimensional space. In this new space, data that was linearly inseparable may become linearly separable.

Example: for 2D data x = (x₁, x₂), consider the quadratic feature map:

φ(x) = (x₁², x₂², √2 · x₁x₂)

This maps 2D points into 3D. A linear boundary in 3D (a plane through φ-space) corresponds to a quadratic boundary back in the original 2D space — circles, ellipses, hyperbolas, all become possible.

Key insight: The feature map φ doesn't change the algorithm. We still run the exact same linear SVM — we just feed it φ(x_i) instead of x_i. All the theory (margin, duality, support vectors) transfers unchanged.

The problem: D can be enormous. A degree-p polynomial map from ℝ^d has D = C(d+p, p) features. For d=100 and p=5, that's over 96 million features. Computing φ(x) explicitly is impractical.

Feature Map Lifts XOR to 3D

The 2D XOR pattern (left) becomes linearly separable when lifted by φ(x) = (x₁², x₂², x₁x₂) into 3D (right). A plane in 3D = a curve in 2D.

But wait — recall from Lecture 18 that the dual SVM only uses inner products φ(x_i)^Tφ(x_j). What if we could compute this inner product without ever computing φ explicitly?

A quadratic feature map from ℝ² to ℝ³ maps x = (x₁, x₂) to φ(x) = (x₁², x₂², √2 x₁x₂). What does a linear boundary in φ-space look like in the original space?

A quadratic curve (conic section) Still a straight line A cubic curve

Chapter 2: The Kernel Trick

Here is the magic. Consider the quadratic feature map φ(x) = (x₁², x₂², √2 x₁x₂). Compute the inner product in φ-space:

φ(x)^Tφ(z) = x₁²z₁² + x₂²z₂² + 2x₁x₂z₁z₂

Now factor:

= (x₁z₁ + x₂z₂)² = (x^Tz)²

The kernel trick: We can compute the inner product in the high-dimensional φ-space using a simple function K(x,z) = (x^Tz)² that operates entirely in the original low-dimensional space. We never need to compute φ(x) or φ(z) explicitly.

A kernel function K(x, z) computes the inner product in some feature space:

K(x, z) = φ(x)^Tφ(z)

The cost of computing K(x, z) = (x^Tz)² is O(d) — just a dot product and a square. The cost of computing φ(x)^Tφ(z) explicitly would be O(D), where D ≫ d. For the RBF kernel (coming next), D is infinite, yet K(x, z) still takes O(d) to compute.

Naive Approach

Compute φ(x_i) ∈ ℝ^D for all i, then inner products in ℝ^D

↓ replaced by

Kernel Trick

Compute K(x_i, x_j) directly in ℝ^d — O(d) per pair

Kernel Computation: Explicit vs Trick

For polynomial degree p in dimension d, compare the cost. The kernel trick wins massively as p and d grow.

Dimension d10

Degree p3

The kernel trick lets us compute φ(x)^Tφ(z) without ever computing φ. What makes this possible?

The inner product in feature space equals a simple function K(x,z) in input space We approximate φ with fewer dimensions The feature map is invertible

Chapter 3: Common Kernels

Different kernels correspond to different feature spaces and different types of decision boundaries.

Kernel	K(x, z)	Feature space dim	Boundary type
Linear	x^Tz	d (no lift)	Hyperplane
Polynomial	(x^Tz + c)^p	C(d+p, p)	Polynomial curves
RBF (Gaussian)	exp(−\|\|x−z\|\|² / 2σ²)	∞	Arbitrary smooth

Polynomial kernel: K(x, z) = (x^Tz + c)^p. The constant c controls whether pure monomials (c=0) or mixed terms (c>0) are included. Degree p=2 gives quadratic boundaries, p=3 gives cubic, etc.

RBF (Radial Basis Function) kernel: K(x, z) = exp(−||x − z||² / 2σ²). This is the most widely used kernel. The parameter σ (bandwidth) controls smoothness: small σ = wiggly boundaries (each support vector has local influence), large σ = smooth boundaries (global influence).

The RBF kernel corresponds to an infinite-dimensional feature map! You can show this by expanding the exponential as a Taylor series: exp(x^Tz/σ²) = ∑_k=0^∞ (x^Tz)^k / (k! σ^2k). Each term (x^Tz)^k is an inner product in a degree-k polynomial space. The full expansion uses ALL polynomial degrees simultaneously — an infinite-dimensional φ.

Despite the infinite-dimensional feature space, K(x,z) = exp(−||x−z||²/2σ²) takes O(d) to compute. This is the kernel trick at its most dramatic.

Kernel Values: How Points "See" Each Other

Each kernel measures similarity differently. The heatmap shows K(x, z) for the orange point relative to every location. Drag the point to explore.

Kernel

σ (RBF)0.50

The RBF kernel K(x,z) = exp(−||x−z||²/2σ²) corresponds to a feature space of dimension:

d (same as input) d² Infinite

Chapter 4: Kernel SVM — Nonlinear Boundaries

The kernel SVM dual is identical to the linear dual, with x_i^Tx_j replaced by K(x_i, x_j):

max_α ∑α_i − ½∑∑α_iα_jy_iy_jK(x_i, x_j) s.t. ∑α_iy_i=0, 0 ≤ α_i ≤ C/n

Prediction for a new point x:

f(x) = ∑_{i: α_i>0} α_iy_iK(x_i, x) + b

Notice: we never compute w explicitly (it lives in the potentially infinite-dimensional φ-space). We only ever compute kernel evaluations K(x_i, x). This is the full power of the kernel trick in action.

Play with the simulation below. Switch between kernels. Place nonlinear patterns (circles, spirals). Watch the RBF kernel conform to arbitrary shapes. Adjust σ to control smoothness: too small = overfitting (wiggly boundary hugging each point), too large = underfitting (approaches linear).

Kernel SVM Playground

Click = class +1 (orange). Shift+click = class −1 (teal). Support vectors are circled in yellow.

Kernel

σ (RBF bandwidth)0.12

log₁₀(C)100

Key observations: (1) RBF with small σ creates tight "bubbles" around each support vector — high variance. (2) RBF with large σ smooths everything — high bias. (3) Polynomial kernels create global boundaries (a degree-3 polynomial everywhere). (4) The number of support vectors increases as you lower C or decrease σ.

Chapter 5: Mercer's Theorem

Not every function K(x, z) is a valid kernel. If we want K to correspond to an inner product in some feature space (K(x,z) = φ(x)^Tφ(z)), it must satisfy a mathematical condition.

Mercer's theorem: A continuous, symmetric function K(x, z) is a valid kernel if and only if the kernel matrix (Gram matrix) is positive semi-definite for any finite set of points.

The kernel matrix (Gram matrix) for points x₁, …, x_n is the n×n matrix:

G_ij = K(x_i, x_j)

Positive semi-definite means: for any vector c ∈ ℝⁿ, c^TGc ≥ 0. Equivalently, all eigenvalues of G are ≥ 0.

Why PSD matters: If K is a valid kernel, then K(x,z) = φ(x)^Tφ(z) for some φ. Then c^TGc = ∑_i,j c_ic_jφ(x_i)^Tφ(x_j) = ||∑_ic_iφ(x_i)||² ≥ 0. The PSD condition follows automatically from the existence of a feature map. Mercer's theorem says the converse is also true.

Useful closure properties — you can build complex kernels from simple ones:

K₁ + K₂ is a valid kernel (if both are)
α K is valid for α ≥ 0
K₁ · K₂ is valid (Schur product theorem)
exp(K) is valid (since exp preserves PSD)

Gram Matrix Eigenvalues

For random 2D points, the Gram matrix G_ij = K(x_i,x_j) has all non-negative eigenvalues (valid kernel). The bar chart shows eigenvalues sorted descending.

Kernel

σ (RBF)1.00

A function K(x,z) is a valid kernel (by Mercer's theorem) if and only if:

K(x,z) > 0 for all x, z The Gram matrix G_ij = K(x_i,x_j) is positive semi-definite for any finite point set K is differentiable everywhere

Chapter 6: Representer Theorem

The representer theorem is a remarkable result that justifies the kernel approach far beyond SVMs. It says:

Representer Theorem: For any regularized empirical risk minimization problem of the form min_{f ∈ H} ∑_i=1ⁿ ℓ(y_i, f(x_i)) + λ||f||_H², where H is a reproducing kernel Hilbert space (RKHS), the optimal solution has the form:

f*(x) = ∑_i=1ⁿ c_i K(x_i, x)

The solution is always a linear combination of kernel evaluations at the training points.

This is profound. The function space H might be infinite-dimensional (for the RBF kernel, it is). But the representer theorem guarantees that the optimal f* lives in the finite-dimensional subspace spanned by {K(x₁, ·), K(x₂, ·), …, K(x_n, ·)}. We only need n coefficients c₁, …, c_n.

For the SVM specifically, c_i = α_iy_i, and most c_i are zero (non-support vectors). The representer theorem tells us this structure isn't specific to SVMs — it's a property of any kernel method with regularization.

Search space

All functions in RKHS H (possibly infinite-dim)

↓ Representer theorem

Reduced to

f(x) = ∑_i=1ⁿ c_iK(x_i, x) — n coefficients

↓ SVM sparsity

Further reduced

Only support vectors have c_i ≠ 0

Representer Theorem: Kernel Basis Functions

f(x) = ∑ c_iK(x_i, x) is a weighted sum of "bumps" centered at training points. Drag coefficients to see how the function shape changes.

c₁1.0

c₂-1.0

c₃0.5

σ0.50

The representer theorem guarantees that the optimal kernel method solution:

Has exactly d non-zero coefficients Is a linear combination of kernel evaluations at the training points Always achieves zero training error

Chapter 7: Mastery

We've built the complete kernel machine pipeline: from the nonlinearity problem to feature maps to the kernel trick to Mercer's condition for valid kernels to the representer theorem guaranteeing finite solutions in infinite spaces.

Concept	Key Idea
Feature map φ	Lift data to higher dimensions where linear separation exists
Kernel trick	K(x,z) = φ(x)^Tφ(z) computable without φ
Polynomial kernel	(x^Tz + c)^p — degree-p boundaries
RBF kernel	exp(−\|\|x−z\|\|²/2σ²) — infinite-dim, local influence
Mercer's theorem	K valid ⇔ Gram matrix is PSD
Representer theorem	f*(x) = ∑c_iK(x_i,x) — n coefficients suffice

The legacy of kernels: While deep learning has displaced SVMs in many tasks, the kernel perspective remains foundational. The neural tangent kernel (NTK) shows that infinitely wide neural networks behave as kernel machines. Understanding kernels is understanding the mathematical soul of learning.

Connections:

Lecture 17: SVM Primal — Hard/soft margin in original space
Lecture 18: Convex Duality — How inner products emerged in the dual

Why is the kernel trick essential for using the RBF kernel?

The RBF feature map φ is infinite-dimensional — we can never compute it, but K(x,z) takes O(d) The RBF kernel is faster than the linear kernel It avoids the need for training data