EE269 Lecture 21 — RKHS & Kernel Regression

Chapter 0: From Finite to Infinite Dimensions

In the last lecture, we fitted linear models: f(x) = w^Tx. This works great if the relationship between features and targets is actually linear. But what if it's not?

One approach: transform the features. Instead of using x directly, map it to a higher-dimensional space using a feature map φ(x), then do linear regression in the new space:

f(x) = w^Tφ(x)

For example, if x is 1D and we set φ(x) = [1, x, x², x³], we get polynomial regression — a linear model in a higher-dimensional feature space.

The fundamental question. What if the "right" feature space is infinite-dimensional? You can't store an infinite-dimensional weight vector. You can't invert an infinite-dimensional matrix. Or can you? The kernel trick says: you never need to work in the high-dimensional space explicitly. All you need are inner products between mapped points.

Here's a preview. Suppose the only thing your algorithm ever computes is inner products ⟨φ(x_i), φ(x_j)⟩. If there exists a kernel function K(x_i, x_j) that equals this inner product, you can work entirely in terms of K — without ever computing φ itself. This is the kernel trick.

Linear vs. Nonlinear Fit

Data from a nonlinear function. Linear regression fails. But we can fit it perfectly with the right feature map.

Feature map Linear

Why Not Just Use More Features?

You might ask: why not just compute φ(x) explicitly and do linear regression in the high-dimensional space? For the polynomial kernel of degree d in p dimensions, the feature space has O(p^d) dimensions. For degree 5 with 100 features: 10 billion dimensions. You can't even allocate the memory.

For the Gaussian kernel, it's worse: the feature space is literally infinite-dimensional (the Taylor expansion of exp has infinitely many terms). No finite computer can store φ(x). The kernel trick sidesteps this entirely: K(x, x') = exp(−‖x−x'‖²/2ℓ²) is a single scalar computation, yet it implicitly computes an inner product in an infinite-dimensional Hilbert space.

The journey ahead. Chapter 1: Bayesian view connects ridge to MAP. Chapter 2: kernel regression mechanics. Chapter 3: RKHS formalism. Chapter 4: the Representer Theorem (why N coefficients suffice). Chapter 5: Gaussian processes (uncertainty from kernels). Chapter 6: what regularization means in function space.

The kernel trick allows us to:

Reduce the number of training samples needed Work in high-dimensional feature spaces without computing features explicitly Avoid regularization entirely

Chapter 1: Bayesian Regression

Before kernels, let's see ridge regression from a Bayesian perspective. This will motivate why kernels naturally arise.

The Generative Model

Assume the data comes from a Gaussian model:

y | x, w ~ N(w^Tx + b, σ²)

Each observation y is the linear prediction plus Gaussian noise with variance σ². Now put a Gaussian prior on the weights:

w ~ N(0, τ²I)

This says: before seeing any data, we believe the weights are small (centered at zero) with variance τ² per component.

MAP = Ridge

The maximum a posteriori (MAP) estimate maximizes the posterior p(w|data) ∝ p(data|w) · p(w). Taking the negative log:

−log p(w|data) ∝ (1/2σ²) ∑ (y_i − w^Tx_i)² + (1/2τ²) ‖w‖²

Minimizing this is exactly ridge regression with λ = σ²/τ².

Ridge = Bayesian MAP with Gaussian prior. This is a deep connection. The regularization parameter λ encodes your prior belief about how large the weights can be. A small τ² (tight prior, small weights expected) means large λ (strong regularization). A large τ² (vague prior) means small λ (trust the data more).

The Posterior Distribution

The full Bayesian approach doesn't just find a point estimate. The posterior over w is also Gaussian:

w | data ~ N(μ_post, Σ_post)

Σ_post = (σ⁻² X^TX + τ⁻² I)⁻¹

μ_post = σ⁻² Σ_post X^Ty

The posterior mean μ_post equals the ridge solution. But we also get uncertainty: Σ_post tells us how confident we are about each weight.

Worked Example

Suppose d = 1 (one feature), σ² = 1, τ² = 4. Then λ = σ²/τ² = 0.25. Given N = 3 points with x = [1, 2, 3], y = [2.1, 3.9, 6.2] (and bias column), the MAP/ridge solution will shrink the weights slightly toward zero compared to OLS.

The effect is small here because λ = 0.25 is modest. But consider τ² = 0.1 (tight prior: "I believe weights are near zero"). Then λ = 10, and the ridge solution is dramatically different from OLS — the weights are pulled almost to zero.

Predictive Distribution

For a new input x*, the predictive distribution integrates over the posterior on w:

p(y* | x*, data) = N(x*^Tμ_post, σ² + x*^TΣ_postx*)

The first term σ² is the irreducible noise. The second term x*^TΣ_postx* is the epistemic uncertainty — uncertainty from not knowing the true weights. It's larger when x* is far from the training data (extrapolation). This is the precursor to GP uncertainty.

MAP Estimate = Ridge Solution

Adjust the prior variance τ². Large prior → trust data (OLS). Small prior → shrink weights (ridge).

τ² (prior var) 2.00

σ² (noise var) 1.00

Why Bayesian Leads to Kernels

The Bayesian predictive distribution at x* integrates over all possible weight vectors:

p(y* | x*, data) = ∫ p(y* | x*, w) p(w | data) dw

Since both terms are Gaussian, the integral is also Gaussian with mean x*^Tμ_post and variance σ² + x*^TΣ_postx*. Now the key observation: the variance depends on x* only through x*^TΣ_postx*. Substituting Σ_post and rearranging, this can be written entirely in terms of inner products between feature vectors. If we replace those inner products with a kernel function K, we get the GP predictive variance — no feature computation needed.

This is how the Bayesian perspective naturally leads to kernel methods: the entire posterior computation (mean and variance) depends on the feature vectors only through their inner products, which is exactly what a kernel provides.

A Gaussian prior p(w) = N(0, τ²I) with very small τ² corresponds to ridge regression with:

Large λ (strong regularization) Small λ (weak regularization) λ = 0 (no regularization)

Chapter 2: Kernel Regression

Now the key move. Instead of specifying a finite feature map φ(x), we specify a kernel function K(x, x') that measures similarity between any two inputs. The model becomes:

f(x) = ∑_i=1^N α_i K(x_i, x)

The prediction at a new point x is a weighted sum of the kernel's similarity to every training point. The coefficients α_i replace the weight vector w.

Common Kernels

Name	K(x, x')	Feature space
Linear	x^Tx'	Same as input (finite)
Polynomial (deg d)	(x^Tx' + c)^d	All monomials up to degree d
Gaussian (RBF)	exp(−‖x − x'‖² / 2ℓ²)	Infinite-dimensional

The Gaussian kernel (also called the radial basis function or RBF kernel) is the most important. Its feature space is infinite-dimensional, yet we can compute K(x, x') in O(d) time. This is the kernel trick in action.

Kernel as Similarity Measure

Think of K(x, x') as measuring how similar two inputs are. For the Gaussian kernel, K(x, x') ≈ 1 when x and x' are close (within a few length scales ℓ), and K(x, x') ≈ 0 when they're far apart. The prediction f(x*) = ∑α_iK(x_i, x*) weights each training point by its similarity to the test point. Nearby training points have a big say; distant ones are ignored.

This is why ℓ matters so much: it controls the "radius of influence" of each training point. Small ℓ = only very nearby points matter = wiggly fits. Large ℓ = distant points still influence = smooth fits.

Solving for α

Define the kernel matrix (or Gram matrix) K where K_ij = K(x_i, x_j). The ridge regression solution in kernel form is:

α = (K + λI)⁻¹y

Dual form of ridge regression. In the primal form, we invert a d×d matrix (X^TX + λI). In the dual (kernel) form, we invert an N×N matrix (K + λI). When d is huge (or infinite), the dual form is computationally tractable. When N is small, this is efficient even for infinite-dimensional feature spaces.

The length scale ℓ in the Gaussian kernel controls how "local" the model is. Small ℓ: each training point only influences nearby predictions (wiggly). Large ℓ: influence spreads widely (smooth).

Worked Example: The Kernel Trick in Action

Consider a 1D polynomial kernel K(x, x') = (xx' + 1)². Expanding: (xx' + 1)² = x²x'² + 2xx' + 1. This is the inner product of φ(x) = [x², √2·x, 1] with φ(x') = [x'², √2·x', 1].

The feature map φ maps 1D inputs to 3D. But we never compute φ — we just evaluate K(x, x') = (xx' + 1)² in O(1). For the Gaussian kernel, the implicit feature space has infinitely many dimensions (the Taylor expansion of exp has infinitely many terms), yet each kernel evaluation is still O(d).

From Primal to Dual

Start with ridge regression in the feature space: w = (X_φ^TX_φ + λI)⁻¹X_φ^Ty, where X_φ has rows φ(x_i)^T. The prediction at x is:

f(x) = φ(x)^Tw = φ(x)^T(X_φ^TX_φ + λI)⁻¹X_φ^Ty

Using the matrix identity A^T(AA^T + λI)⁻¹ = (A^TA + λI)⁻¹A^T, this becomes:

f(x) = k(x)^T(K + λI)⁻¹y

where k(x)_i = K(x_i, x) and K_ij = K(x_i, x_j). Everything is expressed in terms of kernel evaluations — φ never appears.

Kernel Regression with Different Kernels

Choose a kernel type and adjust its parameters. Watch how the fit changes.

Kernel Gaussian

Length scale ℓ 0.50

λ 0.10

The Gaussian (RBF) kernel maps inputs to a feature space that is:

Infinite-dimensional The same dimension as the input Dimension equal to the number of training points

Chapter 3: What is an RKHS?

We've been working with kernels informally. Now let's make it rigorous. A Reproducing Kernel Hilbert Space (RKHS) is a special type of function space with a remarkable property.

Building Blocks

A Hilbert space is a (possibly infinite-dimensional) vector space with an inner product that is complete (all Cauchy sequences converge). Think of it as a generalization of Euclidean space to functions.

An RKHS H is a Hilbert space of functions f: X → R with one extra property: evaluation is continuous. Formally, for every point x ∈ X, the evaluation functional δ_x(f) = f(x) is a bounded linear functional on H.

Why "evaluation is continuous" matters. In some function spaces (like L²), you can't meaningfully evaluate f at a single point — changing f at one point doesn't change its L² norm. In an RKHS, point evaluation is well-defined and well-behaved: if two functions are "close" in the RKHS norm, their values at every point are close.

The Reproducing Property

By the Riesz representation theorem, every bounded linear functional on a Hilbert space can be represented as an inner product with some element. So for each x, there exists a function K_x ∈ H such that:

f(x) = ⟨f, K_x⟩_H for all f ∈ H

The function K(x, x') = K_x(x') = ⟨K_x, K_x'⟩_H is the reproducing kernel of H. It "reproduces" function values via inner products.

K(x, x') = ⟨K_x, K_x'⟩_H

This is the same kernel function we used in kernel regression. The RKHS gives it a rigorous foundation.

Properties of the Kernel

A valid kernel must be positive semi-definite (PSD): for any set of points {x₁, ..., x_N}, the Gram matrix K_ij = K(x_i, x_j) must be PSD. Conversely, any PSD function defines a unique RKHS (Moore-Aronszajn theorem).

PSD kernel K(x,x')

A symmetric function with PSD Gram matrices

↔

RKHS H

A Hilbert space where f(x) = ⟨f, K_x⟩

↔

Feature map φ

K(x,x') = ⟨φ(x), φ(x')⟩

The RKHS norm. The norm ‖f‖_H measures the "complexity" of function f. Smooth functions have small RKHS norms. Wiggly functions have large norms. Regularization in kernel regression (λ‖f‖_H²) penalizes complexity in this precise sense.

Concrete Example

For the linear kernel K(x, x') = x^Tx', the RKHS is the set of linear functions f(x) = w^Tx, and the RKHS norm is ‖f‖_H = ‖w‖. Regularizing ‖f‖_H² = ‖w‖² is exactly ridge regression.

For the Gaussian kernel, the RKHS contains smooth functions. A function that interpolates every data point exactly (zero training error) will have a large RKHS norm because it oscillates wildly between points. A smoother function that approximately fits the data has a smaller norm. The regularizer λ‖f‖_H² explicitly prefers the smoother fit.

The One-to-One Correspondence

Every PSD kernel defines exactly one RKHS, and every RKHS has exactly one reproducing kernel. This is the Moore-Aronszajn theorem. The RKHS is constructed as the closure of the span of {K(x, ·) : x ∈ X}. The inner product is defined by ⟨K(x, ·), K(x', ·)⟩_H = K(x, x'). Any function in this RKHS can be expressed (possibly as an infinite sum) as f = ∑ α_i K(x_i, ·).

Building Valid Kernels

Not every function K(x, x') is a valid kernel. It must be PSD. But you can build new kernels from existing ones:

Sum: K₁ + K₂ is PSD if both are PSD
Product: K₁ · K₂ is PSD if both are PSD
Scaling: cK is PSD for c > 0
Composition: K(x,x') = p(⟨x,x'⟩) is PSD if p is a polynomial with positive coefficients
Exponentiation: exp(K) is PSD if K is PSD

The Gaussian kernel is valid because it can be written as exp(−‖x‖²/2ℓ²) · exp(x^Tx'/ℓ²) · exp(−‖x'‖²/2ℓ²), and the exponential of a PSD kernel is PSD (via the Taylor series: each term is a product of PSD kernels, and the sum of PSD kernels is PSD).

Why Not L²?

The space L² (square-integrable functions) is a Hilbert space, but NOT an RKHS. In L², two functions that differ at a single point are considered "the same" (they have the same L² norm). So you can't meaningfully evaluate f(x) at a single point — it's not a continuous operation. In an RKHS, the norm is strong enough that nearby functions (in the RKHS norm) also have nearby values at each point. This "pointwise control" is exactly what makes RKHS useful for regression: we need f(x_i) to be well-defined.

The intuition: an RKHS is a "nicer" function space than L². It contains only functions that are smooth enough to be evaluated pointwise. The kernel controls exactly how smooth.

The "reproducing property" of an RKHS means:

Function values can be computed as inner products with the kernel: f(x) = ⟨f, K_x⟩ The space can reproduce any continuous function exactly Functions in the space are always periodic

Chapter 4: The Representer Theorem

Here's the most important result in kernel methods. You want to find the function f in an infinite-dimensional RKHS that minimizes:

min_{f ∈ H} ∑_i=1^N L(y_i, f(x_i)) + λ ‖f‖_H²

where L is any loss function. The RKHS is infinite-dimensional, so this seems like searching over infinitely many parameters. But the Representer Theorem says:

Representer Theorem. The minimizer has the form f*(x) = ∑_i=1^N α_i K(x_i, x). You only need N coefficients, one per training point. The infinite-dimensional optimization reduces to a finite N-dimensional problem.

Why This Works

Decompose any f ∈ H as f = f_∥ + f_⊥, where f_∥ is in the span of {K_x1, ..., K_xN} and f_⊥ is orthogonal to this span.

The loss term only sees f at x₁, ..., x_N. By the reproducing property, f(x_i) = ⟨f, K_xi⟩ = ⟨f_∥, K_xi⟩ (since f_⊥ is orthogonal). So the loss doesn't depend on f_⊥.

The regularization term: ‖f‖² = ‖f_∥‖² + ‖f_⊥‖² ≥ ‖f_∥‖². Adding f_⊥ only increases the penalty without changing the loss. So the optimal f has f_⊥ = 0.

For Squared Loss: Explicit Solution

When L is squared loss, the problem becomes:

min_{α ∈ R^N} ‖Kα − y‖² + λ α^TKα

where we used ‖f‖_H² = α^TKα (since f = ∑α_iK_xi and ⟨K_xi, K_xj⟩ = K_ij). Taking the derivative:

2K(Kα − y) + 2λKα = 0 ⇒ (K + λI)α = y

which gives α = (K + λI)⁻¹y, the familiar kernel ridge formula. The Representer Theorem guarantees this is also the optimum over the entire infinite-dimensional RKHS, not just over N-dimensional coefficient vectors.

Showcase: Kernel Regression Explorer

Click to add data points. The model fits f(x) = ∑ α_i K(x_i, x) using (K + λI)⁻¹y. The faint blue curves show each individual kernel contribution α_iK(x_i, ·); the orange curve is their sum. Try different kernels and watch how the fit changes character.

Kernel Regression Showcase

Click canvas to add points. Adjust kernel type, bandwidth, and regularization.

Kernel Gaussian

Bandwidth ℓ 0.40

λ 0.050

Experiment ideas:
• With Gaussian kernel, shrink ℓ → 0.1 and watch each point get its own "bump"
• Increase ℓ → 2.0 for a very smooth interpolation
• Switch to linear kernel — you get back ordinary ridge regression
• Set λ very small — the function interpolates all points exactly

The Representer Theorem guarantees that the optimal function in an RKHS:

Is always a polynomial Can be written as a finite sum of N kernel evaluations (one per training point) Has zero training error

Chapter 5: Gaussian Processes

Kernel regression gives us point predictions. But what about uncertainty? Bayesian linear regression gave us a posterior distribution. Can we get the same thing with kernels?

A Gaussian process (GP) is a distribution over functions. Formally, f ~ GP(m(x), K(x, x')) means that for any finite set of points {x₁, ..., x_N}, the vector [f(x₁), ..., f(x_N)] is jointly Gaussian with mean [m(x₁), ..., m(x_N)] and covariance matrix K_ij = K(x_i, x_j).

GP = infinite-dimensional Gaussian. Just as a multivariate Gaussian is specified by its mean vector and covariance matrix, a GP is specified by its mean function m(x) and covariance function K(x, x'). The covariance function is a kernel — it must be PSD.

GP Posterior (Prediction)

Given training data {(x_i, y_i)} with noise model y = f(x) + ε, ε ~ N(0, σ²), the posterior at a new point x* is:

f(x*) | data ~ N(μ*, σ*²)

μ* = k_*^T (K + σ²I)⁻¹ y

σ*² = K(x*, x*) − k_*^T (K + σ²I)⁻¹ k_*

where k_* = [K(x₁, x*), ..., K(x_N, x*)]^T. The posterior mean μ* is exactly kernel ridge regression with λ = σ². But we also get σ*²: uncertainty that is small near training points and large far away.

Worked Example: GP with 2 Training Points

Let K be the Gaussian kernel with ℓ = 1, σ² = 0.1. Training data: x₁ = 0, y₁ = 1 and x₂ = 1, y₂ = −0.5.

K = [[1.0, 0.607], [0.607, 1.0]]

(K + σ²I) = [[1.1, 0.607], [0.607, 1.1]]

Its inverse: det = 1.21 − 0.368 = 0.842. So (K + σ²I)⁻¹ = [[1.306, −0.721], [−0.721, 1.306]]. Multiplying by y:

α = (K + σ²I)⁻¹y = [1.306·1 + (−0.721)(−0.5), (−0.721)·1 + 1.306·(−0.5)] = [1.667, −1.374]

Prediction at x* = 0.5: k_* = [K(0, 0.5), K(1, 0.5)] = [0.882, 0.882]. Mean = 0.882 · 1.667 + 0.882 · (−1.374) = 0.259. This is a weighted blend of the two observations, reflecting the kernel's smoothness.

Variance: σ*² = K(0.5, 0.5) − k_*^T(K + σ²I)⁻¹k_* = 1.0 − [0.882, 0.882] · [1.667, −1.374] · ... The variance is small because x* = 0.5 is between two training points. At x* = 3 (far from data), σ*² → K(3, 3) = 1 (full prior uncertainty).

Mercer's Theorem and the Eigenexpansion

Mercer's theorem states that a continuous PSD kernel on a compact domain can be expanded as:

K(x, x') = ∑_j=1^∞ λ_j φ_j(x) φ_j(x')

where λ_j ≥ 0 are eigenvalues and φ_j are eigenfunctions of the integral operator associated with K. The feature map is φ(x) = [√λ₁φ₁(x), √λ₂φ₂(x), ...]. For the Gaussian kernel, all λ_j > 0 (infinitely many), confirming the feature space is infinite-dimensional.

GP regression implicitly uses this eigenexpansion. The posterior shrinks each eigencomponent by λ_j/(λ_j + σ²) — high-eigenvalue components (smooth, data-aligned) are trusted, low-eigenvalue components (rough, noisy) are suppressed. This is the same spectral filtering as ridge regression in the eigenspace of X^TX.

Why GP = kernel ridge + uncertainty. The GP posterior mean is identical to kernel ridge regression. The posterior variance adds a principled error bar. When you need predictions only, use kernel ridge. When you need to know how confident those predictions are — for active learning, Bayesian optimization, or safety-critical decisions — use GPs.

GP Posterior with Confidence Bands

The shaded region is the 95% confidence interval. Notice it narrows near data points.

ℓ (length scale) 0.50

σ² (noise) 0.10

GP intuition. Far from data: posterior mean reverts to the prior mean (zero), and variance returns to the prior variance K(x*, x*) = 1. The model says "I have no information here, so I default to my prior belief." Near training points: the posterior is pinned close to the observed values, with small variance. Between training points: the posterior smoothly interpolates, with moderate uncertainty that depends on how far apart the neighbors are.

This behavior is exactly right for decision-making under uncertainty. In Bayesian optimization, the GP posterior variance guides exploration: we sample at points where uncertainty is highest (most information to gain). In active learning, we query labels at points where the GP is most uncertain.

In a GP posterior, the predictive uncertainty σ*² is smallest:

Near the training data points Far from all training points Everywhere equally (constant)

Chapter 6: Regularization in Function Space

Let's tie everything together. In ordinary regression, we regularize the weight vector: λ‖w‖². In kernel regression, we regularize the function itself: λ‖f‖_H².

What does the RKHS norm ‖f‖_H actually measure? For the Gaussian kernel:

‖f‖_H² = ∫ |F(ω)|² / S(ω) dω

where F(ω) is the Fourier transform of f and S(ω) is the spectral density of the kernel. High-frequency components of f are penalized more heavily (since S(ω) decays for large ω). The RKHS norm is a smoothness penalty in disguise.

Different kernels, different smoothness. The Gaussian kernel penalizes all frequencies, giving infinitely smooth functions. A Matérn kernel with parameter ν gives functions that are ν times differentiable. Choose the kernel to match your prior belief about the function's smoothness.

The Three Views

View	Optimization	Regularizer
Primal (finite features)	min ‖X w − y‖² + λ‖w‖²	Weight magnitude
Dual (kernel)	min ‖Kα − y‖² + λα^TKα	Function RKHS norm
Bayesian	max p(y\|f) · GP(f; 0, K)	Prior over functions

These are three perspectives on the same optimization. Primal works when d is small. Dual works when N is small (or d is infinite). Bayesian gives us uncertainty.

Deriving the Equivalence

Start from the Bayesian view: the GP prior says f ~ GP(0, K). The likelihood is p(y|f) = N(f, σ²I). The MAP estimate minimizes:

−log p(f|y) ∝ (1/2σ²)‖f − y‖² + (1/2)f^TK⁻¹f

where f = [f(x₁), ..., f(x_N)]. Multiply by 2σ²: minimize ‖f − y‖² + σ²f^TK⁻¹f. Writing f = Kα:

‖Kα − y‖² + σ²α^TKα

Setting the gradient to zero: (K + σ²I)α = y. This is kernel ridge regression with λ = σ². The same α, the same predictions, derived from three completely different starting points.

Choosing the Kernel

The kernel encodes your inductive bias — what you believe the function looks like before seeing data. Key choices:

Gaussian RBF: smooth, infinitely differentiable. Good default.
Matérn: rougher functions. Parameter ν controls differentiability.
Polynomial: captures polynomial trends. Degree d is explicit.
Linear: ordinary ridge regression. Simplest baseline.

The Bias-Variance Tradeoff Revisited

In kernel methods, the kernel parameters control the bias-variance tradeoff:

Small ℓ + small λ: Low bias (can fit complex functions), high variance (sensitive to specific training points)
Large ℓ + large λ: High bias (overly smooth), low variance (stable across datasets)
Optimal: the sweet spot depends on the true function complexity and the amount of data

The RKHS norm provides a principled measure of complexity. Unlike polynomial degree (discrete), the RKHS norm is continuous — λ smoothly trades off fit quality against function complexity. This is one reason kernel methods generalize better than polynomial regression in high dimensions.

The deep connection: regularization in weight space (λ‖w‖²), regularization in function space (λ‖f‖_H²), and Bayesian priors (p(f) = GP(0, K)) are three faces of the same idea. The RKHS makes this precise.

Implementation: Kernel Ridge Regression in NumPy

python
import numpy as np

def gaussian_kernel(X1, X2, ell=1.0):
    """Gram matrix K[i,j] = exp(-||x_i - x_j||^2 / 2*ell^2)"""
    sq = np.sum(X1**2, axis=1, keepdims=True)  # (N1, 1)
         + np.sum(X2**2, axis=1)              # (N2,)
         - 2 * X1 @ X2.T                      # (N1, N2)
    return np.exp(-sq / (2 * ell**2))

def kernel_ridge(X_train, y_train, X_test, ell=1.0, lam=0.1):
    """Kernel ridge regression with Gaussian kernel."""
    K = gaussian_kernel(X_train, X_train, ell)  # (N, N)
    alpha = np.linalg.solve(K + lam * np.eye(len(K)), y_train)
    K_test = gaussian_kernel(X_test, X_train, ell)  # (M, N)
    return K_test @ alpha  # predictions

def gp_predict(X_train, y_train, X_test, ell=1.0, sig2=0.1):
    """GP posterior mean + variance."""
    K = gaussian_kernel(X_train, X_train, ell) + sig2 * np.eye(len(X_train))
    K_star = gaussian_kernel(X_test, X_train, ell)
    alpha = np.linalg.solve(K, y_train)
    mu = K_star @ alpha                          # posterior mean
    v = np.linalg.solve(K, K_star.T)
    var = 1.0 - np.sum(K_star * v.T, axis=1)    # posterior variance
    return mu, var

The RKHS norm ‖f‖_H measures:

The number of non-zero coefficients in f The smoothness/complexity of the function f The magnitude of f at the origin

Chapter 7: Mastery

Let's map the entire journey from linear regression to RKHS.

Linear Regression

f(x) = w^Tx, solve normal equations

↓ add prior on w

Bayesian / Ridge

MAP with N(0, τ²I) = ridge with λ = σ²/τ²

↓ replace x with φ(x)

Feature-Space Ridge

f(x) = w^Tφ(x), dual: α = (K+λI)⁻¹y

↓ let φ be infinite-dim

Kernel Regression / RKHS

f = ∑ α_i K(x_i, ·), Representer Theorem

↓ full posterior over f

Gaussian Process

f ~ GP(0, K), posterior = mean + uncertainty

Concept	Key Equation	Insight
Kernel trick	K(x,x') = ⟨φ(x), φ(x')⟩	Never compute φ explicitly
Representer Theorem	f* = ∑ α_i K(x_i, ·)	∞-dim search → N coefficients
GP posterior mean	μ* = k*^T(K+σ²I)⁻¹y	Same as kernel ridge regression
GP posterior var	σ² = K() − k^T(K+σ²I)⁻¹k*	Uncertainty ↑ far from data

Connections

Regression & AR Models — the linear starting point; kernel methods are its nonlinear generalization
Adaptive Filters & LMS — online learning as opposed to batch kernel methods; kernel LMS bridges the two
Support Vector Machines — classification with kernels; the representer theorem applies there too
Neural Tangent Kernel — connects deep learning to RKHS theory in the infinite-width limit

The RKHS framework is one of the most powerful tools in statistical learning theory. It gives a rigorous language for "function complexity," connects Bayesian and frequentist methods through the kernel, and the Representer Theorem turns infinite-dimensional optimization into finite computation. Every time you use a Gaussian process, an SVM, or analyze a neural network through the NTK, you're working in an RKHS.

What you can now do:
• Derive ridge regression as MAP with Gaussian prior
• Compute kernel regression predictions: α = (K + λI)⁻¹y
• Explain what an RKHS is and why evaluation must be continuous
• State and explain the Representer Theorem
• Compute GP posterior mean and variance
• Choose between kernels based on smoothness assumptions

Computational Complexity

Method	Training Cost	Prediction Cost	Memory
Primal ridge (d features)	O(Nd² + d³)	O(d)	O(d)
Kernel ridge (N points)	O(N³)	O(N)	O(N²)
GP posterior	O(N³)	O(N) mean, O(N²) var	O(N²)

The O(N³) cost of kernel methods is the main bottleneck. For N > 10,000, approximate methods are needed: random Fourier features (Rahimi & Recht, 2007), inducing points (Snelson & Ghahramani, 2006), or structured kernel interpolation.

When to Use Which

Decision guide for practitioners:

N < 1000, want uncertainty: Full GP. Exact posterior, confidence intervals.
N < 10,000, no uncertainty needed: Kernel ridge. Fast, exact.
N > 10,000: Approximate. Random Fourier features or sparse GP with inducing points.
Streaming data: Kernel LMS (online kernel methods) or random feature regression.
d is small, linear relationship: Just use ridge regression. No kernel needed.

In modern deep learning, the neural tangent kernel (NTK) connects neural networks to kernel methods: an infinitely wide neural network trained with gradient descent is equivalent to kernel regression with the NTK. This gives theoretical tools to analyze neural network generalization via RKHS theory.

Kernel Hyperparameter Selection

The kernel parameters (length scale ℓ, polynomial degree d) and regularization λ dramatically affect performance. Two standard approaches:

Cross-validation: Split data into K folds, evaluate prediction error for each (ℓ, λ) pair, pick the best. Computational cost: O(K · N³) per parameter setting. Works for any kernel method.

Marginal likelihood (for GPs): Maximize log p(y | ℓ, σ²) = −(1/2)y^T(K + σ²I)⁻¹y − (1/2)log|K + σ²I| − (N/2)log(2π). The first term favors data fit, the second penalizes model complexity (Occam's razor built in). Gradient-based optimization finds the best kernel parameters automatically. This is the preferred method for GPs because it avoids the expense of cross-validation.

The marginal likelihood is sometimes called the evidence. It automatically balances model complexity against data fit — too flexible a kernel (very small ℓ) gets penalized by the log-determinant term, while too rigid a kernel (very large ℓ) gets penalized by the data fit term. This is the Bayesian answer to cross-validation.

Random Fourier Features. For shift-invariant kernels like the Gaussian, K(x, x') ≈ z(x)^Tz(x') where z(x) is a D-dimensional random feature vector. This turns kernel regression back into primal ridge with D random features — O(ND²) instead of O(N³). Choose D << N for massive speedups.

"The kernel trick is one of the most beautiful mathematical ideas in machine learning. It says: you can work in a million-dimensional space by only ever computing dot products." — Pilanci, EE269

RKHS & Kernel Regression