The trick that lets SVMs draw nonlinear boundaries by never leaving the comfort of inner products.
Consider the XOR pattern: four points at the corners of a square. Class +1 at top-left and bottom-right, class −1 at top-right and bottom-left. No line can separate them. Not a steep one, not a flat one, not any line at all.
This isn't a toy problem. Real data is full of such patterns: a tumor marker might be dangerous at very low AND very high values but safe in the middle. A credit score model might flag both very young AND very old first-time borrowers. Linear boundaries simply can't capture these relationships.
The linear SVM tries its best but cannot separate the XOR pattern. The best it can do is 50% accuracy.
One approach: manually engineer new features. For XOR, the product x1·x2 separates the classes perfectly (positive for same-sign corners, negative for opposite-sign corners). But hand-engineering features for every problem doesn't scale.
A feature map is a function φ: ℝd → ℝD that transforms each data point into a higher-dimensional space. In this new space, data that was linearly inseparable may become linearly separable.
Example: for 2D data x = (x1, x2), consider the quadratic feature map:
This maps 2D points into 3D. A linear boundary in 3D (a plane through φ-space) corresponds to a quadratic boundary back in the original 2D space — circles, ellipses, hyperbolas, all become possible.
The problem: D can be enormous. A degree-p polynomial map from ℝd has D = C(d+p, p) features. For d=100 and p=5, that's over 96 million features. Computing φ(x) explicitly is impractical.
The 2D XOR pattern (left) becomes linearly separable when lifted by φ(x) = (x1², x2², x1x2) into 3D (right). A plane in 3D = a curve in 2D.
But wait — recall from Lecture 18 that the dual SVM only uses inner products φ(xi)Tφ(xj). What if we could compute this inner product without ever computing φ explicitly?
Here is the magic. Consider the quadratic feature map φ(x) = (x1², x2², √2 x1x2). Compute the inner product in φ-space:
Now factor:
A kernel function K(x, z) computes the inner product in some feature space:
The cost of computing K(x, z) = (xTz)² is O(d) — just a dot product and a square. The cost of computing φ(x)Tφ(z) explicitly would be O(D), where D ≫ d. For the RBF kernel (coming next), D is infinite, yet K(x, z) still takes O(d) to compute.
For polynomial degree p in dimension d, compare the cost. The kernel trick wins massively as p and d grow.
Different kernels correspond to different feature spaces and different types of decision boundaries.
| Kernel | K(x, z) | Feature space dim | Boundary type |
|---|---|---|---|
| Linear | xTz | d (no lift) | Hyperplane |
| Polynomial | (xTz + c)p | C(d+p, p) | Polynomial curves |
| RBF (Gaussian) | exp(−||x−z||² / 2σ²) | ∞ | Arbitrary smooth |
Polynomial kernel: K(x, z) = (xTz + c)p. The constant c controls whether pure monomials (c=0) or mixed terms (c>0) are included. Degree p=2 gives quadratic boundaries, p=3 gives cubic, etc.
RBF (Radial Basis Function) kernel: K(x, z) = exp(−||x − z||² / 2σ²). This is the most widely used kernel. The parameter σ (bandwidth) controls smoothness: small σ = wiggly boundaries (each support vector has local influence), large σ = smooth boundaries (global influence).
Despite the infinite-dimensional feature space, K(x,z) = exp(−||x−z||²/2σ²) takes O(d) to compute. This is the kernel trick at its most dramatic.
Each kernel measures similarity differently. The heatmap shows K(x, z) for the orange point relative to every location. Drag the point to explore.
The kernel SVM dual is identical to the linear dual, with xiTxj replaced by K(xi, xj):
Prediction for a new point x:
Notice: we never compute w explicitly (it lives in the potentially infinite-dimensional φ-space). We only ever compute kernel evaluations K(xi, x). This is the full power of the kernel trick in action.
Click = class +1 (orange). Shift+click = class −1 (teal). Support vectors are circled in yellow.
Not every function K(x, z) is a valid kernel. If we want K to correspond to an inner product in some feature space (K(x,z) = φ(x)Tφ(z)), it must satisfy a mathematical condition.
Mercer's theorem: A continuous, symmetric function K(x, z) is a valid kernel if and only if the kernel matrix (Gram matrix) is positive semi-definite for any finite set of points.
The kernel matrix (Gram matrix) for points x1, …, xn is the n×n matrix:
Positive semi-definite means: for any vector c ∈ ℝn, cTGc ≥ 0. Equivalently, all eigenvalues of G are ≥ 0.
Useful closure properties — you can build complex kernels from simple ones:
For random 2D points, the Gram matrix Gij = K(xi,xj) has all non-negative eigenvalues (valid kernel). The bar chart shows eigenvalues sorted descending.
The representer theorem is a remarkable result that justifies the kernel approach far beyond SVMs. It says:
The solution is always a linear combination of kernel evaluations at the training points.
This is profound. The function space H might be infinite-dimensional (for the RBF kernel, it is). But the representer theorem guarantees that the optimal f* lives in the finite-dimensional subspace spanned by {K(x1, ·), K(x2, ·), …, K(xn, ·)}. We only need n coefficients c1, …, cn.
For the SVM specifically, ci = αiyi, and most ci are zero (non-support vectors). The representer theorem tells us this structure isn't specific to SVMs — it's a property of any kernel method with regularization.
f(x) = ∑ ciK(xi, x) is a weighted sum of "bumps" centered at training points. Drag coefficients to see how the function shape changes.
We've built the complete kernel machine pipeline: from the nonlinearity problem to feature maps to the kernel trick to Mercer's condition for valid kernels to the representer theorem guaranteeing finite solutions in infinite spaces.
| Concept | Key Idea |
|---|---|
| Feature map φ | Lift data to higher dimensions where linear separation exists |
| Kernel trick | K(x,z) = φ(x)Tφ(z) computable without φ |
| Polynomial kernel | (xTz + c)p — degree-p boundaries |
| RBF kernel | exp(−||x−z||²/2σ²) — infinite-dim, local influence |
| Mercer's theorem | K valid ⇔ Gram matrix is PSD |
| Representer theorem | f*(x) = ∑ciK(xi,x) — n coefficients suffice |
Connections: