From fitting a line through data to predicting the next sample of a signal — and the regularization tricks that keep it all from falling apart.
You have N data points. Each point has a feature vector x and a target value y. A house's features might be [square footage, bedrooms, age]; the target is its sale price. You want a function f(x) that predicts y from x. How do you find the best f?
The simplest idea: make f linear. Set f(x) = wTx + w0, where w is a weight vector and w0 is a bias. Now "finding the best f" reduces to "finding the best w."
But "best" needs a definition. We measure quality by the mean squared error (MSE) risk:
In practice, we don't know the true distribution. We only have N training samples {(xi, yi)}i=1N. So we minimize the empirical risk:
Here's a quick demo. We generate noisy data from a true line y = 2x + 1. Try fitting it with different polynomial degrees. Low degree = underfitting (can't capture the pattern). High degree = overfitting (wiggles through every point).
Drag the slider to change the polynomial degree. Watch how high degrees chase noise.
Let's derive the optimal weights in closed form. Stack the N data points into a matrix. Let X be the N × d design matrix (row i is xiT, with a 1 prepended for the bias). Let y be the N × 1 vector of targets. The model predictions are X·w, and the loss is:
Expand this:
Take the gradient with respect to w and set it to zero:
This gives us the normal equations:
If XTX is invertible, we get the closed-form solution:
The matrix X+ = (XTX)−1XT is the pseudoinverse of X. The hat matrix H = X(XTX)−1XT projects y onto the column space of X. The fitted values are ŷ = Hy.
Suppose we have 3 data points in 1D (with a bias column):
Then XTX = [[3, 6], [6, 14]] and XTy = [12.2, 28.5]. Solving:
So the line is y ≈ −0.033 + 2.05x. The true slope was 2 and the intercept was 0 — close, given only 3 noisy points.
Click to place data points. The red line is the OLS fit. The dashed lines show residuals.
Ordinary least squares minimizes the training error. But sometimes that's a bad idea. When features are correlated, or when there are more features than samples (d > N), XTX is singular or nearly so, and the OLS weights explode.
Ridge regression (also called Tikhonov regularization or L2 regularization) adds a penalty on the size of the weights:
where λ ≥ 0 is the regularization parameter. The penalty term λ‖w‖2 = λ ∑ wj2 discourages large weights. Taking the gradient and setting it to zero:
Ridge introduces bias — the expected prediction is no longer exactly right, because we're shrinking the weights toward zero. But it reduces variance — the weights are more stable across different datasets.
As λ → 0, we get OLS (low bias, high variance). As λ → ∞, w → 0 (high bias, zero variance). The optimal λ balances the two.
Watch how the weights shrink toward zero as λ increases. The fit line becomes less wiggly but more biased.
Write the SVD of X: X = UΣVT. The OLS prediction is ŷ = X(XTX)−1XTy. In terms of singular values σj, each component is scaled by σj2 / σj2 = 1. Ridge instead scales each component by:
Components with small singular values (directions where data has little variation) get shrunk the most. This is exactly what we want: don't trust directions where you have little information.
Ridge shrinks weights toward zero but never makes them exactly zero. If you have 1000 features and only 5 matter, ridge keeps all 1000 — just smaller. Can we do better?
The LASSO (Least Absolute Shrinkage and Selection Operator) replaces the L2 penalty with an L1 penalty:
The key difference: the L1 penalty has "corners" at wj = 0 in the geometry of the constraint set. The optimal solution tends to land on these corners, making some weights exactly zero. LASSO performs feature selection automatically.
| Property | Ridge (L2) | LASSO (L1) |
|---|---|---|
| Penalty | λ‖w‖22 | λ‖w‖1 |
| Solution form | Closed-form | No closed-form (iterative) |
| Sparsity | Never exactly zero | Many weights become zero |
| Correlated features | Keeps all, shares weight | Picks one, drops others |
| Use case | All features matter | Few features matter |
The ellipses are MSE contours. The colored shape is the constraint region. Watch where they first touch.
In practice, a compromise called Elastic Net combines both penalties:
This gets sparsity from L1 while handling correlated features better (L2 groups them). Scikit-learn's default for many problems.
Now let's apply regression to signals. Instead of predicting house prices from features, we predict the next sample of a time series from its past values. This is the autoregressive (AR) model.
An AR(p) model of order p says:
where a1, ..., ap are the AR coefficients and e[n] is white noise (the "innovation" — the part of x[n] that can't be predicted from the past). The current sample is a weighted sum of the p most recent past samples, plus unpredictable noise.
Speech signals are well-modeled by AR(10-20). Financial time series use AR models for short-term prediction. EEG, vibration data, climate records — any signal with temporal structure.
The key parameter is the model order p. Too low: can't capture the signal's autocorrelation structure. Too high: overfits to noise (same story as polynomial degree).
Given a signal x[0], x[1], ..., x[N−1], we form the regression:
Then a = (XTX)−1XTy. Same normal equations. The prediction is x̂[n] = aT[x[n−1], ..., x[n−p]].
A signal is generated from a true AR process. Fit an AR model and watch predictions. Adjust the model order to see underfitting vs. overfitting.
We just fitted an AR model by building the design matrix X and solving the normal equations. But there's an elegant alternative that uses the signal's autocorrelation function directly.
For a stationary signal, the autocorrelation at lag k is:
Now multiply both sides of the AR equation x[n] = ∑k=1p akx[n−k] + e[n] by x[n−m] and take expectations:
For m ≥ 1, the noise e[n] is uncorrelated with past values x[n−m], so E[e[n]·x[n−m]] = 0. This gives us:
These are the Yule-Walker equations. In matrix form:
The noise variance is recovered from m = 0:
Suppose r[0] = 1.0, r[1] = 0.6, r[2] = 0.2. The Yule-Walker system is:
Solving: det = 1 − 0.36 = 0.64. a1 = (1.0·0.6 − 0.6·0.2)/0.64 = 0.48/0.64 = 0.75. a2 = (1.0·0.2 − 0.6·0.6)/0.64 = −0.16/0.64 = −0.25.
The noise variance: σe2 = 1.0 − 0.75(0.6) − (−0.25)(0.2) = 1.0 − 0.45 + 0.05 = 0.60.
Adjust the autocorrelation values and see the Yule-Walker solution update in real time.
We've seen that choosing λ (for ridge/LASSO) or p (for AR models) is critical. Too small: overfit. Too large: underfit. How do we pick the right value without a separate test set?
K-fold cross-validation: split the data into K roughly equal parts. For each fold k, train on the other K−1 parts and measure the error on fold k. Average the K error values. Pick the hyperparameter that minimizes this average.
The extreme case: K = N. Each fold is a single data point. For linear regression, there's a beautiful shortcut. The LOO error is:
where hii is the i-th diagonal of the hat matrix H = X(XTX)−1XT. You only fit the model once and adjust each residual by the leverage hii. This is called the PRESS statistic.
For AR model order selection, two popular criteria avoid cross-validation entirely:
Both penalize model complexity. BIC penalizes more heavily for large N and tends to select simpler models. In practice, BIC is often preferred for AR order selection because it's consistent (picks the true order as N → ∞).
The training error always decreases with model complexity. The CV error dips at the true order, then rises.
Let's review the landscape of linear prediction and regularization.
| Method | Loss | Solution | Key Property |
|---|---|---|---|
| OLS | ‖Xw − y‖2 | (XTX)−1XTy | Unbiased, minimum variance (Gauss-Markov) |
| Ridge | ‖Xw − y‖2 + λ‖w‖2 | (XTX + λI)−1XTy | Always invertible, shrinks weights |
| LASSO | ‖Xw − y‖2 + λ‖w‖1 | Iterative (no closed form) | Sparse solutions, feature selection |
| AR(p) | ‖x − Xpasta‖2 | Yule-Walker or OLS | Temporal prediction from p past values |
This lecture connects forward to:
And connects back to:
"All models are wrong, but some are useful." — George Box