EE269 Lecture 21 — Mert Pilanci, Stanford

RKHS & Kernel Regression

From Bayesian regression to reproducing kernel Hilbert spaces — when you need nonlinear models but still want closed-form solutions.

Prerequisites: Linear regression & ridge + Basic probability (Gaussian, Bayes' rule). That's it.
8
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: From Finite to Infinite Dimensions

In the last lecture, we fitted linear models: f(x) = wTx. This works great if the relationship between features and targets is actually linear. But what if it's not?

One approach: transform the features. Instead of using x directly, map it to a higher-dimensional space using a feature map φ(x), then do linear regression in the new space:

f(x) = wTφ(x)

For example, if x is 1D and we set φ(x) = [1, x, x2, x3], we get polynomial regression — a linear model in a higher-dimensional feature space.

The fundamental question. What if the "right" feature space is infinite-dimensional? You can't store an infinite-dimensional weight vector. You can't invert an infinite-dimensional matrix. Or can you? The kernel trick says: you never need to work in the high-dimensional space explicitly. All you need are inner products between mapped points.

Here's a preview. Suppose the only thing your algorithm ever computes is inner products ⟨φ(xi), φ(xj)⟩. If there exists a kernel function K(xi, xj) that equals this inner product, you can work entirely in terms of K — without ever computing φ itself. This is the kernel trick.

Linear vs. Nonlinear Fit

Data from a nonlinear function. Linear regression fails. But we can fit it perfectly with the right feature map.

Feature map Linear

Why Not Just Use More Features?

You might ask: why not just compute φ(x) explicitly and do linear regression in the high-dimensional space? For the polynomial kernel of degree d in p dimensions, the feature space has O(pd) dimensions. For degree 5 with 100 features: 10 billion dimensions. You can't even allocate the memory.

For the Gaussian kernel, it's worse: the feature space is literally infinite-dimensional (the Taylor expansion of exp has infinitely many terms). No finite computer can store φ(x). The kernel trick sidesteps this entirely: K(x, x') = exp(−‖x−x'‖2/2ℓ2) is a single scalar computation, yet it implicitly computes an inner product in an infinite-dimensional Hilbert space.

The journey ahead. Chapter 1: Bayesian view connects ridge to MAP. Chapter 2: kernel regression mechanics. Chapter 3: RKHS formalism. Chapter 4: the Representer Theorem (why N coefficients suffice). Chapter 5: Gaussian processes (uncertainty from kernels). Chapter 6: what regularization means in function space.
The kernel trick allows us to:

Chapter 1: Bayesian Regression

Before kernels, let's see ridge regression from a Bayesian perspective. This will motivate why kernels naturally arise.

The Generative Model

Assume the data comes from a Gaussian model:

y | x, w ~ N(wTx + b, σ2)

Each observation y is the linear prediction plus Gaussian noise with variance σ2. Now put a Gaussian prior on the weights:

w ~ N(0, τ2I)

This says: before seeing any data, we believe the weights are small (centered at zero) with variance τ2 per component.

MAP = Ridge

The maximum a posteriori (MAP) estimate maximizes the posterior p(w|data) ∝ p(data|w) · p(w). Taking the negative log:

−log p(w|data) ∝ (1/2σ2) ∑ (yi − wTxi)2 + (1/2τ2) ‖w‖2

Minimizing this is exactly ridge regression with λ = σ22.

Ridge = Bayesian MAP with Gaussian prior. This is a deep connection. The regularization parameter λ encodes your prior belief about how large the weights can be. A small τ2 (tight prior, small weights expected) means large λ (strong regularization). A large τ2 (vague prior) means small λ (trust the data more).

The Posterior Distribution

The full Bayesian approach doesn't just find a point estimate. The posterior over w is also Gaussian:

w | data ~ N(μpost, Σpost)
Σpost = (σ−2 XTX + τ−2 I)−1
μpost = σ−2 Σpost XTy

The posterior mean μpost equals the ridge solution. But we also get uncertainty: Σpost tells us how confident we are about each weight.

Worked Example

Suppose d = 1 (one feature), σ2 = 1, τ2 = 4. Then λ = σ22 = 0.25. Given N = 3 points with x = [1, 2, 3], y = [2.1, 3.9, 6.2] (and bias column), the MAP/ridge solution will shrink the weights slightly toward zero compared to OLS.

The effect is small here because λ = 0.25 is modest. But consider τ2 = 0.1 (tight prior: "I believe weights are near zero"). Then λ = 10, and the ridge solution is dramatically different from OLS — the weights are pulled almost to zero.

Predictive Distribution

For a new input x*, the predictive distribution integrates over the posterior on w:

p(y* | x*, data) = N(x*Tμpost,   σ2 + x*TΣpostx*)

The first term σ2 is the irreducible noise. The second term x*TΣpostx* is the epistemic uncertainty — uncertainty from not knowing the true weights. It's larger when x* is far from the training data (extrapolation). This is the precursor to GP uncertainty.

MAP Estimate = Ridge Solution

Adjust the prior variance τ². Large prior → trust data (OLS). Small prior → shrink weights (ridge).

τ² (prior var) 2.00
σ² (noise var) 1.00

Why Bayesian Leads to Kernels

The Bayesian predictive distribution at x* integrates over all possible weight vectors:

p(y* | x*, data) = ∫ p(y* | x*, w) p(w | data) dw

Since both terms are Gaussian, the integral is also Gaussian with mean x*Tμpost and variance σ2 + x*TΣpostx*. Now the key observation: the variance depends on x* only through x*TΣpostx*. Substituting Σpost and rearranging, this can be written entirely in terms of inner products between feature vectors. If we replace those inner products with a kernel function K, we get the GP predictive variance — no feature computation needed.

This is how the Bayesian perspective naturally leads to kernel methods: the entire posterior computation (mean and variance) depends on the feature vectors only through their inner products, which is exactly what a kernel provides.

A Gaussian prior p(w) = N(0, τ²I) with very small τ² corresponds to ridge regression with:

Chapter 2: Kernel Regression

Now the key move. Instead of specifying a finite feature map φ(x), we specify a kernel function K(x, x') that measures similarity between any two inputs. The model becomes:

f(x) = ∑i=1N αi K(xi, x)

The prediction at a new point x is a weighted sum of the kernel's similarity to every training point. The coefficients αi replace the weight vector w.

Common Kernels

NameK(x, x')Feature space
LinearxTx'Same as input (finite)
Polynomial (deg d)(xTx' + c)dAll monomials up to degree d
Gaussian (RBF)exp(−‖x − x'‖2 / 2ℓ2)Infinite-dimensional

The Gaussian kernel (also called the radial basis function or RBF kernel) is the most important. Its feature space is infinite-dimensional, yet we can compute K(x, x') in O(d) time. This is the kernel trick in action.

Kernel as Similarity Measure

Think of K(x, x') as measuring how similar two inputs are. For the Gaussian kernel, K(x, x') ≈ 1 when x and x' are close (within a few length scales ℓ), and K(x, x') ≈ 0 when they're far apart. The prediction f(x*) = ∑αiK(xi, x*) weights each training point by its similarity to the test point. Nearby training points have a big say; distant ones are ignored.

This is why ℓ matters so much: it controls the "radius of influence" of each training point. Small ℓ = only very nearby points matter = wiggly fits. Large ℓ = distant points still influence = smooth fits.

Solving for α

Define the kernel matrix (or Gram matrix) K where Kij = K(xi, xj). The ridge regression solution in kernel form is:

α = (K + λI)−1y
Dual form of ridge regression. In the primal form, we invert a d×d matrix (XTX + λI). In the dual (kernel) form, we invert an N×N matrix (K + λI). When d is huge (or infinite), the dual form is computationally tractable. When N is small, this is efficient even for infinite-dimensional feature spaces.

The length scale ℓ in the Gaussian kernel controls how "local" the model is. Small ℓ: each training point only influences nearby predictions (wiggly). Large ℓ: influence spreads widely (smooth).

Worked Example: The Kernel Trick in Action

Consider a 1D polynomial kernel K(x, x') = (xx' + 1)2. Expanding: (xx' + 1)2 = x2x'2 + 2xx' + 1. This is the inner product of φ(x) = [x2, √2·x, 1] with φ(x') = [x'2, √2·x', 1].

The feature map φ maps 1D inputs to 3D. But we never compute φ — we just evaluate K(x, x') = (xx' + 1)2 in O(1). For the Gaussian kernel, the implicit feature space has infinitely many dimensions (the Taylor expansion of exp has infinitely many terms), yet each kernel evaluation is still O(d).

From Primal to Dual

Start with ridge regression in the feature space: w = (XφTXφ + λI)−1XφTy, where Xφ has rows φ(xi)T. The prediction at x is:

f(x) = φ(x)Tw = φ(x)T(XφTXφ + λI)−1XφTy

Using the matrix identity AT(AAT + λI)−1 = (ATA + λI)−1AT, this becomes:

f(x) = k(x)T(K + λI)−1y

where k(x)i = K(xi, x) and Kij = K(xi, xj). Everything is expressed in terms of kernel evaluations — φ never appears.

Kernel Regression with Different Kernels

Choose a kernel type and adjust its parameters. Watch how the fit changes.

Kernel Gaussian
Length scale ℓ 0.50
λ 0.10
The Gaussian (RBF) kernel maps inputs to a feature space that is:

Chapter 3: What is an RKHS?

We've been working with kernels informally. Now let's make it rigorous. A Reproducing Kernel Hilbert Space (RKHS) is a special type of function space with a remarkable property.

Building Blocks

A Hilbert space is a (possibly infinite-dimensional) vector space with an inner product that is complete (all Cauchy sequences converge). Think of it as a generalization of Euclidean space to functions.

An RKHS H is a Hilbert space of functions f: X → R with one extra property: evaluation is continuous. Formally, for every point x ∈ X, the evaluation functional δx(f) = f(x) is a bounded linear functional on H.

Why "evaluation is continuous" matters. In some function spaces (like L2), you can't meaningfully evaluate f at a single point — changing f at one point doesn't change its L2 norm. In an RKHS, point evaluation is well-defined and well-behaved: if two functions are "close" in the RKHS norm, their values at every point are close.

The Reproducing Property

By the Riesz representation theorem, every bounded linear functional on a Hilbert space can be represented as an inner product with some element. So for each x, there exists a function Kx ∈ H such that:

f(x) = ⟨f, KxH   for all f ∈ H

The function K(x, x') = Kx(x') = ⟨Kx, Kx'H is the reproducing kernel of H. It "reproduces" function values via inner products.

K(x, x') = ⟨Kx, Kx'H

This is the same kernel function we used in kernel regression. The RKHS gives it a rigorous foundation.

Properties of the Kernel

A valid kernel must be positive semi-definite (PSD): for any set of points {x1, ..., xN}, the Gram matrix Kij = K(xi, xj) must be PSD. Conversely, any PSD function defines a unique RKHS (Moore-Aronszajn theorem).

PSD kernel K(x,x')
A symmetric function with PSD Gram matrices
RKHS H
A Hilbert space where f(x) = ⟨f, Kx
Feature map φ
K(x,x') = ⟨φ(x), φ(x')⟩
The RKHS norm. The norm ‖f‖H measures the "complexity" of function f. Smooth functions have small RKHS norms. Wiggly functions have large norms. Regularization in kernel regression (λ‖f‖H2) penalizes complexity in this precise sense.

Concrete Example

For the linear kernel K(x, x') = xTx', the RKHS is the set of linear functions f(x) = wTx, and the RKHS norm is ‖f‖H = ‖w‖. Regularizing ‖f‖H2 = ‖w‖2 is exactly ridge regression.

For the Gaussian kernel, the RKHS contains smooth functions. A function that interpolates every data point exactly (zero training error) will have a large RKHS norm because it oscillates wildly between points. A smoother function that approximately fits the data has a smaller norm. The regularizer λ‖f‖H2 explicitly prefers the smoother fit.

The One-to-One Correspondence

Every PSD kernel defines exactly one RKHS, and every RKHS has exactly one reproducing kernel. This is the Moore-Aronszajn theorem. The RKHS is constructed as the closure of the span of {K(x, ·) : x ∈ X}. The inner product is defined by ⟨K(x, ·), K(x', ·)⟩H = K(x, x'). Any function in this RKHS can be expressed (possibly as an infinite sum) as f = ∑ αi K(xi, ·).

Building Valid Kernels

Not every function K(x, x') is a valid kernel. It must be PSD. But you can build new kernels from existing ones:

The Gaussian kernel is valid because it can be written as exp(−‖x‖2/2ℓ2) · exp(xTx'/ℓ2) · exp(−‖x'‖2/2ℓ2), and the exponential of a PSD kernel is PSD (via the Taylor series: each term is a product of PSD kernels, and the sum of PSD kernels is PSD).

Why Not L2?

The space L2 (square-integrable functions) is a Hilbert space, but NOT an RKHS. In L2, two functions that differ at a single point are considered "the same" (they have the same L2 norm). So you can't meaningfully evaluate f(x) at a single point — it's not a continuous operation. In an RKHS, the norm is strong enough that nearby functions (in the RKHS norm) also have nearby values at each point. This "pointwise control" is exactly what makes RKHS useful for regression: we need f(xi) to be well-defined.

The intuition: an RKHS is a "nicer" function space than L2. It contains only functions that are smooth enough to be evaluated pointwise. The kernel controls exactly how smooth.

The "reproducing property" of an RKHS means:

Chapter 4: The Representer Theorem

Here's the most important result in kernel methods. You want to find the function f in an infinite-dimensional RKHS that minimizes:

minf ∈ Hi=1N L(yi, f(xi)) + λ ‖f‖H2

where L is any loss function. The RKHS is infinite-dimensional, so this seems like searching over infinitely many parameters. But the Representer Theorem says:

Representer Theorem. The minimizer has the form f*(x) = ∑i=1N αi K(xi, x). You only need N coefficients, one per training point. The infinite-dimensional optimization reduces to a finite N-dimensional problem.

Why This Works

Decompose any f ∈ H as f = f + f, where f is in the span of {Kx1, ..., KxN} and f is orthogonal to this span.

The loss term only sees f at x1, ..., xN. By the reproducing property, f(xi) = ⟨f, Kxi⟩ = ⟨f, Kxi⟩ (since f is orthogonal). So the loss doesn't depend on f.

The regularization term: ‖f‖2 = ‖f2 + ‖f2 ≥ ‖f2. Adding f only increases the penalty without changing the loss. So the optimal f has f = 0.

For Squared Loss: Explicit Solution

When L is squared loss, the problem becomes:

minα ∈ RN ‖Kα − y‖2 + λ αT

where we used ‖f‖H2 = αTKα (since f = ∑αiKxi and ⟨Kxi, Kxj⟩ = Kij). Taking the derivative:

2K(Kα − y) + 2λKα = 0 ⇒ (K + λI)α = y

which gives α = (K + λI)−1y, the familiar kernel ridge formula. The Representer Theorem guarantees this is also the optimum over the entire infinite-dimensional RKHS, not just over N-dimensional coefficient vectors.

Showcase: Kernel Regression Explorer

Click to add data points. The model fits f(x) = ∑ αi K(xi, x) using (K + λI)−1y. The faint blue curves show each individual kernel contribution αiK(xi, ·); the orange curve is their sum. Try different kernels and watch how the fit changes character.

Kernel Regression Showcase

Click canvas to add points. Adjust kernel type, bandwidth, and regularization.

Kernel Gaussian
Bandwidth ℓ 0.40
λ 0.050
Experiment ideas:
• With Gaussian kernel, shrink ℓ → 0.1 and watch each point get its own "bump"
• Increase ℓ → 2.0 for a very smooth interpolation
• Switch to linear kernel — you get back ordinary ridge regression
• Set λ very small — the function interpolates all points exactly
The Representer Theorem guarantees that the optimal function in an RKHS:

Chapter 5: Gaussian Processes

Kernel regression gives us point predictions. But what about uncertainty? Bayesian linear regression gave us a posterior distribution. Can we get the same thing with kernels?

A Gaussian process (GP) is a distribution over functions. Formally, f ~ GP(m(x), K(x, x')) means that for any finite set of points {x1, ..., xN}, the vector [f(x1), ..., f(xN)] is jointly Gaussian with mean [m(x1), ..., m(xN)] and covariance matrix Kij = K(xi, xj).

GP = infinite-dimensional Gaussian. Just as a multivariate Gaussian is specified by its mean vector and covariance matrix, a GP is specified by its mean function m(x) and covariance function K(x, x'). The covariance function is a kernel — it must be PSD.

GP Posterior (Prediction)

Given training data {(xi, yi)} with noise model y = f(x) + ε, ε ~ N(0, σ2), the posterior at a new point x* is:

f(x*) | data ~ N(μ*, σ*2)
μ* = k*T (K + σ2I)−1 y
σ*2 = K(x*, x*) − k*T (K + σ2I)−1 k*

where k* = [K(x1, x*), ..., K(xN, x*)]T. The posterior mean μ* is exactly kernel ridge regression with λ = σ2. But we also get σ*2: uncertainty that is small near training points and large far away.

Worked Example: GP with 2 Training Points

Let K be the Gaussian kernel with ℓ = 1, σ2 = 0.1. Training data: x1 = 0, y1 = 1 and x2 = 1, y2 = −0.5.

K = [[1.0, 0.607], [0.607, 1.0]]
(K + σ2I) = [[1.1, 0.607], [0.607, 1.1]]

Its inverse: det = 1.21 − 0.368 = 0.842. So (K + σ2I)−1 = [[1.306, −0.721], [−0.721, 1.306]]. Multiplying by y:

α = (K + σ2I)−1y = [1.306·1 + (−0.721)(−0.5), (−0.721)·1 + 1.306·(−0.5)] = [1.667, −1.374]

Prediction at x* = 0.5: k* = [K(0, 0.5), K(1, 0.5)] = [0.882, 0.882]. Mean = 0.882 · 1.667 + 0.882 · (−1.374) = 0.259. This is a weighted blend of the two observations, reflecting the kernel's smoothness.

Variance: σ*2 = K(0.5, 0.5) − k*T(K + σ2I)−1k* = 1.0 − [0.882, 0.882] · [1.667, −1.374] · ... The variance is small because x* = 0.5 is between two training points. At x* = 3 (far from data), σ*2 → K(3, 3) = 1 (full prior uncertainty).

Mercer's Theorem and the Eigenexpansion

Mercer's theorem states that a continuous PSD kernel on a compact domain can be expanded as:

K(x, x') = ∑j=1 λj φj(x) φj(x')

where λj ≥ 0 are eigenvalues and φj are eigenfunctions of the integral operator associated with K. The feature map is φ(x) = [√λ1φ1(x), √λ2φ2(x), ...]. For the Gaussian kernel, all λj > 0 (infinitely many), confirming the feature space is infinite-dimensional.

GP regression implicitly uses this eigenexpansion. The posterior shrinks each eigencomponent by λj/(λj + σ2) — high-eigenvalue components (smooth, data-aligned) are trusted, low-eigenvalue components (rough, noisy) are suppressed. This is the same spectral filtering as ridge regression in the eigenspace of XTX.

Why GP = kernel ridge + uncertainty. The GP posterior mean is identical to kernel ridge regression. The posterior variance adds a principled error bar. When you need predictions only, use kernel ridge. When you need to know how confident those predictions are — for active learning, Bayesian optimization, or safety-critical decisions — use GPs.
GP Posterior with Confidence Bands

The shaded region is the 95% confidence interval. Notice it narrows near data points.

ℓ (length scale) 0.50
σ² (noise) 0.10
GP intuition. Far from data: posterior mean reverts to the prior mean (zero), and variance returns to the prior variance K(x*, x*) = 1. The model says "I have no information here, so I default to my prior belief." Near training points: the posterior is pinned close to the observed values, with small variance. Between training points: the posterior smoothly interpolates, with moderate uncertainty that depends on how far apart the neighbors are.

This behavior is exactly right for decision-making under uncertainty. In Bayesian optimization, the GP posterior variance guides exploration: we sample at points where uncertainty is highest (most information to gain). In active learning, we query labels at points where the GP is most uncertain.

In a GP posterior, the predictive uncertainty σ*² is smallest:

Chapter 6: Regularization in Function Space

Let's tie everything together. In ordinary regression, we regularize the weight vector: λ‖w‖2. In kernel regression, we regularize the function itself: λ‖f‖H2.

What does the RKHS norm ‖f‖H actually measure? For the Gaussian kernel:

‖f‖H2 = ∫ |F(ω)|2 / S(ω) dω

where F(ω) is the Fourier transform of f and S(ω) is the spectral density of the kernel. High-frequency components of f are penalized more heavily (since S(ω) decays for large ω). The RKHS norm is a smoothness penalty in disguise.

Different kernels, different smoothness. The Gaussian kernel penalizes all frequencies, giving infinitely smooth functions. A Matérn kernel with parameter ν gives functions that are ν times differentiable. Choose the kernel to match your prior belief about the function's smoothness.

The Three Views

ViewOptimizationRegularizer
Primal (finite features)min ‖X w − y‖2 + λ‖w‖2Weight magnitude
Dual (kernel)min ‖Kα − y‖2 + λαTFunction RKHS norm
Bayesianmax p(y|f) · GP(f; 0, K)Prior over functions

These are three perspectives on the same optimization. Primal works when d is small. Dual works when N is small (or d is infinite). Bayesian gives us uncertainty.

Deriving the Equivalence

Start from the Bayesian view: the GP prior says f ~ GP(0, K). The likelihood is p(y|f) = N(f, σ2I). The MAP estimate minimizes:

−log p(f|y) ∝ (1/2σ2)‖f − y‖2 + (1/2)fTK−1f

where f = [f(x1), ..., f(xN)]. Multiply by 2σ2: minimize ‖f − y‖2 + σ2fTK−1f. Writing f = Kα:

‖Kα − y‖2 + σ2αT

Setting the gradient to zero: (K + σ2I)α = y. This is kernel ridge regression with λ = σ2. The same α, the same predictions, derived from three completely different starting points.

Choosing the Kernel

The kernel encodes your inductive bias — what you believe the function looks like before seeing data. Key choices:

The Bias-Variance Tradeoff Revisited

In kernel methods, the kernel parameters control the bias-variance tradeoff:

The RKHS norm provides a principled measure of complexity. Unlike polynomial degree (discrete), the RKHS norm is continuous — λ smoothly trades off fit quality against function complexity. This is one reason kernel methods generalize better than polynomial regression in high dimensions.

The deep connection: regularization in weight space (λ‖w‖2), regularization in function space (λ‖f‖H2), and Bayesian priors (p(f) = GP(0, K)) are three faces of the same idea. The RKHS makes this precise.

Implementation: Kernel Ridge Regression in NumPy

python
import numpy as np

def gaussian_kernel(X1, X2, ell=1.0):
    """Gram matrix K[i,j] = exp(-||x_i - x_j||^2 / 2*ell^2)"""
    sq = np.sum(X1**2, axis=1, keepdims=True)  # (N1, 1)
         + np.sum(X2**2, axis=1)              # (N2,)
         - 2 * X1 @ X2.T                      # (N1, N2)
    return np.exp(-sq / (2 * ell**2))

def kernel_ridge(X_train, y_train, X_test, ell=1.0, lam=0.1):
    """Kernel ridge regression with Gaussian kernel."""
    K = gaussian_kernel(X_train, X_train, ell)  # (N, N)
    alpha = np.linalg.solve(K + lam * np.eye(len(K)), y_train)
    K_test = gaussian_kernel(X_test, X_train, ell)  # (M, N)
    return K_test @ alpha  # predictions

def gp_predict(X_train, y_train, X_test, ell=1.0, sig2=0.1):
    """GP posterior mean + variance."""
    K = gaussian_kernel(X_train, X_train, ell) + sig2 * np.eye(len(X_train))
    K_star = gaussian_kernel(X_test, X_train, ell)
    alpha = np.linalg.solve(K, y_train)
    mu = K_star @ alpha                          # posterior mean
    v = np.linalg.solve(K, K_star.T)
    var = 1.0 - np.sum(K_star * v.T, axis=1)    # posterior variance
    return mu, var
The RKHS norm ‖f‖_H measures:

Chapter 7: Mastery

Let's map the entire journey from linear regression to RKHS.

Linear Regression
f(x) = wTx, solve normal equations
↓ add prior on w
Bayesian / Ridge
MAP with N(0, τ²I) = ridge with λ = σ²/τ²
↓ replace x with φ(x)
Feature-Space Ridge
f(x) = wTφ(x), dual: α = (K+λI)−1y
↓ let φ be infinite-dim
Kernel Regression / RKHS
f = ∑ αi K(xi, ·), Representer Theorem
↓ full posterior over f
Gaussian Process
f ~ GP(0, K), posterior = mean + uncertainty
ConceptKey EquationInsight
Kernel trickK(x,x') = ⟨φ(x), φ(x')⟩Never compute φ explicitly
Representer Theoremf* = ∑ αi K(xi, ·)∞-dim search → N coefficients
GP posterior meanμ* = k*T(K+σ²I)−1ySame as kernel ridge regression
GP posterior varσ*² = K(**) − k*T(K+σ²I)−1k*Uncertainty ↑ far from data

Connections

The RKHS framework is one of the most powerful tools in statistical learning theory. It gives a rigorous language for "function complexity," connects Bayesian and frequentist methods through the kernel, and the Representer Theorem turns infinite-dimensional optimization into finite computation. Every time you use a Gaussian process, an SVM, or analyze a neural network through the NTK, you're working in an RKHS.

What you can now do:
• Derive ridge regression as MAP with Gaussian prior
• Compute kernel regression predictions: α = (K + λI)−1y
• Explain what an RKHS is and why evaluation must be continuous
• State and explain the Representer Theorem
• Compute GP posterior mean and variance
• Choose between kernels based on smoothness assumptions

Computational Complexity

MethodTraining CostPrediction CostMemory
Primal ridge (d features)O(Nd2 + d3)O(d)O(d)
Kernel ridge (N points)O(N3)O(N)O(N2)
GP posteriorO(N3)O(N) mean, O(N2) varO(N2)

The O(N3) cost of kernel methods is the main bottleneck. For N > 10,000, approximate methods are needed: random Fourier features (Rahimi & Recht, 2007), inducing points (Snelson & Ghahramani, 2006), or structured kernel interpolation.

When to Use Which

Decision guide for practitioners:

In modern deep learning, the neural tangent kernel (NTK) connects neural networks to kernel methods: an infinitely wide neural network trained with gradient descent is equivalent to kernel regression with the NTK. This gives theoretical tools to analyze neural network generalization via RKHS theory.

Kernel Hyperparameter Selection

The kernel parameters (length scale ℓ, polynomial degree d) and regularization λ dramatically affect performance. Two standard approaches:

Cross-validation: Split data into K folds, evaluate prediction error for each (ℓ, λ) pair, pick the best. Computational cost: O(K · N3) per parameter setting. Works for any kernel method.

Marginal likelihood (for GPs): Maximize log p(y | ℓ, σ2) = −(1/2)yT(K + σ2I)−1y − (1/2)log|K + σ2I| − (N/2)log(2π). The first term favors data fit, the second penalizes model complexity (Occam's razor built in). Gradient-based optimization finds the best kernel parameters automatically. This is the preferred method for GPs because it avoids the expense of cross-validation.

The marginal likelihood is sometimes called the evidence. It automatically balances model complexity against data fit — too flexible a kernel (very small ℓ) gets penalized by the log-determinant term, while too rigid a kernel (very large ℓ) gets penalized by the data fit term. This is the Bayesian answer to cross-validation.

Random Fourier Features. For shift-invariant kernels like the Gaussian, K(x, x') ≈ z(x)Tz(x') where z(x) is a D-dimensional random feature vector. This turns kernel regression back into primal ridge with D random features — O(ND2) instead of O(N3). Choose D << N for massive speedups.

"The kernel trick is one of the most beautiful mathematical ideas in machine learning. It says: you can work in a million-dimensional space by only ever computing dot products." — Pilanci, EE269