EE269 Lecture 20 — Mert Pilanci, Stanford

Regression & AR Models

From fitting a line through data to predicting the next sample of a signal — and the regularization tricks that keep it all from falling apart.

Prerequisites: Linear algebra (matrix inverse, transpose) + Basic probability (expectation, variance). That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: The Prediction Problem

You have N data points. Each point has a feature vector x and a target value y. A house's features might be [square footage, bedrooms, age]; the target is its sale price. You want a function f(x) that predicts y from x. How do you find the best f?

The simplest idea: make f linear. Set f(x) = wTx + w0, where w is a weight vector and w0 is a bias. Now "finding the best f" reduces to "finding the best w."

But "best" needs a definition. We measure quality by the mean squared error (MSE) risk:

R(w) = E[(f(x) − y)2] = E[(wTx + w0 − y)2]

In practice, we don't know the true distribution. We only have N training samples {(xi, yi)}i=1N. So we minimize the empirical risk:

R̂(w) = (1/N) ∑i=1N (wTxi + w0 − yi)2
The core tension. Minimizing R̂ exactly can lead to overfitting: the model memorizes training noise instead of learning the underlying pattern. This entire lecture is about balancing fit vs. simplicity.

Here's a quick demo. We generate noisy data from a true line y = 2x + 1. Try fitting it with different polynomial degrees. Low degree = underfitting (can't capture the pattern). High degree = overfitting (wiggles through every point).

Underfitting vs. Overfitting

Drag the slider to change the polynomial degree. Watch how high degrees chase noise.

Degree 1
A model with very low training error but high test error is likely:

Chapter 1: Least Squares

Let's derive the optimal weights in closed form. Stack the N data points into a matrix. Let X be the N × d design matrix (row i is xiT, with a 1 prepended for the bias). Let y be the N × 1 vector of targets. The model predictions are X·w, and the loss is:

L(w) = ‖Xw − y‖2 = (Xw − y)T(Xw − y)

Expand this:

L(w) = wTXTXw − 2wTXTy + yTy

Take the gradient with respect to w and set it to zero:

wL = 2XTXw − 2XTy = 0

This gives us the normal equations:

XTXw = XTy
The normal equations. The name "normal" comes from geometry: the residual vector (y − Xw) is orthogonal (normal) to the column space of X. We're projecting y onto the subspace spanned by the features.

If XTX is invertible, we get the closed-form solution:

wOLS = (XTX)−1XTy

The matrix X+ = (XTX)−1XT is the pseudoinverse of X. The hat matrix H = X(XTX)−1XT projects y onto the column space of X. The fitted values are ŷ = Hy.

Worked Example

Suppose we have 3 data points in 1D (with a bias column):

X = [[1, 1], [1, 2], [1, 3]],   y = [2.1, 3.9, 6.2]

Then XTX = [[3, 6], [6, 14]] and XTy = [12.2, 28.5]. Solving:

w = [[14, −6], [−6, 3]] / (3·14 − 6·6) · [12.2, 28.5]
= [[14, −6], [−6, 3]] / 6 · [12.2, 28.5]
w0 ≈ −0.033,   w1 ≈ 2.05

So the line is y ≈ −0.033 + 2.05x. The true slope was 2 and the intercept was 0 — close, given only 3 noisy points.

Normal Equations Visualized

Click to place data points. The red line is the OLS fit. The dashed lines show residuals.

When does this fail? If XTX is singular (features are linearly dependent) or nearly singular (multicollinearity), the inverse is unstable. Small changes in data cause wild swings in w. This is exactly where ridge regression saves us — next chapter.
The OLS solution w = (XTX)−1XTy requires which condition on X?

Chapter 2: Ridge Regression

Ordinary least squares minimizes the training error. But sometimes that's a bad idea. When features are correlated, or when there are more features than samples (d > N), XTX is singular or nearly so, and the OLS weights explode.

Ridge regression (also called Tikhonov regularization or L2 regularization) adds a penalty on the size of the weights:

Lridge(w) = ‖Xw − y‖2 + λ‖w‖2

where λ ≥ 0 is the regularization parameter. The penalty term λ‖w‖2 = λ ∑ wj2 discourages large weights. Taking the gradient and setting it to zero:

wL = 2XTXw − 2XTy + 2λw = 0
wridge = (XTX + λI)−1XTy
Why λI fixes everything. Adding λI to XTX shifts every eigenvalue by λ. If the smallest eigenvalue of XTX was near zero (ill-conditioned), now it's at least λ. The matrix is always invertible, and the condition number drops dramatically.

The Bias-Variance Tradeoff

Ridge introduces bias — the expected prediction is no longer exactly right, because we're shrinking the weights toward zero. But it reduces variance — the weights are more stable across different datasets.

As λ → 0, we get OLS (low bias, high variance). As λ → ∞, w → 0 (high bias, zero variance). The optimal λ balances the two.

Ridge Regularization Path

Watch how the weights shrink toward zero as λ increases. The fit line becomes less wiggly but more biased.

λ 0.00

Eigenvalue Perspective

Write the SVD of X: X = UΣVT. The OLS prediction is ŷ = X(XTX)−1XTy. In terms of singular values σj, each component is scaled by σj2 / σj2 = 1. Ridge instead scales each component by:

σj2 / (σj2 + λ)

Components with small singular values (directions where data has little variation) get shrunk the most. This is exactly what we want: don't trust directions where you have little information.

As λ increases in ridge regression, what happens to the weight vector?

Chapter 3: LASSO & Sparsity

Ridge shrinks weights toward zero but never makes them exactly zero. If you have 1000 features and only 5 matter, ridge keeps all 1000 — just smaller. Can we do better?

The LASSO (Least Absolute Shrinkage and Selection Operator) replaces the L2 penalty with an L1 penalty:

Llasso(w) = ‖Xw − y‖2 + λ ∑j |wj|

The key difference: the L1 penalty has "corners" at wj = 0 in the geometry of the constraint set. The optimal solution tends to land on these corners, making some weights exactly zero. LASSO performs feature selection automatically.

Geometry of L1 vs. L2. The L2 constraint set is a circle (sphere in higher dims). The L1 constraint set is a diamond (cross-polytope). A circle is smooth everywhere, so the solution can land anywhere on its surface. A diamond has sharp corners at the axes. The elliptical contours of the MSE loss are more likely to first touch a corner — and corners mean some coordinates are zero.

Comparing Ridge vs. LASSO

PropertyRidge (L2)LASSO (L1)
Penaltyλ‖w‖22λ‖w‖1
Solution formClosed-formNo closed-form (iterative)
SparsityNever exactly zeroMany weights become zero
Correlated featuresKeeps all, shares weightPicks one, drops others
Use caseAll features matterFew features matter
L1 vs. L2 Constraint Geometry

The ellipses are MSE contours. The colored shape is the constraint region. Watch where they first touch.

Penalty type L2 (Ridge)
λ 1.00

Elastic Net

In practice, a compromise called Elastic Net combines both penalties:

Lelastic(w) = ‖Xw − y‖2 + λ1‖w‖1 + λ2‖w‖22

This gets sparsity from L1 while handling correlated features better (L2 groups them). Scikit-learn's default for many problems.

Which regularization method can set some weights to exactly zero?

Chapter 4: Autoregressive Models for Signals

Now let's apply regression to signals. Instead of predicting house prices from features, we predict the next sample of a time series from its past values. This is the autoregressive (AR) model.

An AR(p) model of order p says:

x[n] = a1x[n−1] + a2x[n−2] + ... + apx[n−p] + e[n]

where a1, ..., ap are the AR coefficients and e[n] is white noise (the "innovation" — the part of x[n] that can't be predicted from the past). The current sample is a weighted sum of the p most recent past samples, plus unpredictable noise.

AR models ARE regression. Define the feature vector for sample n as xn = [x[n−1], x[n−2], ..., x[n−p]]. The target is yn = x[n]. The weight vector is a = [a1, ..., ap]. This is exactly linear regression: x[n] = aTxn + e[n].

Why AR Models Matter

Speech signals are well-modeled by AR(10-20). Financial time series use AR models for short-term prediction. EEG, vibration data, climate records — any signal with temporal structure.

The key parameter is the model order p. Too low: can't capture the signal's autocorrelation structure. Too high: overfits to noise (same story as polynomial degree).

Fitting an AR Model

Given a signal x[0], x[1], ..., x[N−1], we form the regression:

X = [[x[p−1], ..., x[0]], [x[p], ..., x[1]], ..., [x[N−2], ..., x[N−1−p]]]
y = [x[p], x[p+1], ..., x[N−1]]

Then a = (XTX)−1XTy. Same normal equations. The prediction is x̂[n] = aT[x[n−1], ..., x[n−p]].

AR Model Showcase

A signal is generated from a true AR process. Fit an AR model and watch predictions. Adjust the model order to see underfitting vs. overfitting.

Model order p 2
Noise σ 0.50
True order 2
Reading the showcase. The teal line is the actual signal. The orange line is the one-step-ahead prediction from the fitted AR model. When the model order matches the true order (and noise is low), the predictions track the signal closely. Crank up the order past the true value and the MSE on held-out data starts rising — that's overfitting.
An AR(3) model predicts x[n] from:

Chapter 5: Yule-Walker Equations

We just fitted an AR model by building the design matrix X and solving the normal equations. But there's an elegant alternative that uses the signal's autocorrelation function directly.

For a stationary signal, the autocorrelation at lag k is:

r[k] = E[x[n] · x[n−k]]

Now multiply both sides of the AR equation x[n] = ∑k=1p akx[n−k] + e[n] by x[n−m] and take expectations:

E[x[n]·x[n−m]] = ∑k=1p ak E[x[n−k]·x[n−m]] + E[e[n]·x[n−m]]

For m ≥ 1, the noise e[n] is uncorrelated with past values x[n−m], so E[e[n]·x[n−m]] = 0. This gives us:

r[m] = a1r[m−1] + a2r[m−2] + ... + apr[m−p],   m = 1, 2, ..., p

These are the Yule-Walker equations. In matrix form:

[[r[0], r[1], ..., r[p−1]], [r[1], r[0], ..., r[p−2]], ..., [r[p−1], ..., r[0]]] · a = [r[1], r[2], ..., r[p]]
Toeplitz structure. The autocorrelation matrix R is Toeplitz — each diagonal is constant. This structure allows the Levinson-Durbin algorithm to solve the system in O(p2) instead of O(p3). For large model orders, this matters.

The noise variance is recovered from m = 0:

σe2 = r[0] − ∑k=1p ak r[k]

Worked Example: AR(2)

Suppose r[0] = 1.0, r[1] = 0.6, r[2] = 0.2. The Yule-Walker system is:

[[1.0, 0.6], [0.6, 1.0]] · [a1, a2] = [0.6, 0.2]

Solving: det = 1 − 0.36 = 0.64. a1 = (1.0·0.6 − 0.6·0.2)/0.64 = 0.48/0.64 = 0.75. a2 = (1.0·0.2 − 0.6·0.6)/0.64 = −0.16/0.64 = −0.25.

The noise variance: σe2 = 1.0 − 0.75(0.6) − (−0.25)(0.2) = 1.0 − 0.45 + 0.05 = 0.60.

Autocorrelation → AR Coefficients

Adjust the autocorrelation values and see the Yule-Walker solution update in real time.

r[1] 0.60
r[2] 0.20
The Yule-Walker equations relate AR coefficients to:

Chapter 6: Cross-Validation

We've seen that choosing λ (for ridge/LASSO) or p (for AR models) is critical. Too small: overfit. Too large: underfit. How do we pick the right value without a separate test set?

K-fold cross-validation: split the data into K roughly equal parts. For each fold k, train on the other K−1 parts and measure the error on fold k. Average the K error values. Pick the hyperparameter that minimizes this average.

Split
Divide N samples into K folds
For each fold k
Train on folds ≠ k, test on fold k
Average
CV error = (1/K) ∑ MSEk
Select
Pick λ (or p) minimizing CV error

Leave-One-Out (LOO) Cross-Validation

The extreme case: K = N. Each fold is a single data point. For linear regression, there's a beautiful shortcut. The LOO error is:

CVLOO = (1/N) ∑i=1N (yi − ŷi)2 / (1 − hii)2

where hii is the i-th diagonal of the hat matrix H = X(XTX)−1XT. You only fit the model once and adjust each residual by the leverage hii. This is called the PRESS statistic.

Leverage hii measures how much influence point i has on its own prediction. High leverage points (those far from the centroid of the features) have hii close to 1, making their LOO residuals amplified.

Information Criteria

For AR model order selection, two popular criteria avoid cross-validation entirely:

AIC = N · ln(σ̂e2) + 2p
BIC = N · ln(σ̂e2) + p · ln(N)

Both penalize model complexity. BIC penalizes more heavily for large N and tends to select simpler models. In practice, BIC is often preferred for AR order selection because it's consistent (picks the true order as N → ∞).

Model Selection: CV Error vs. Order

The training error always decreases with model complexity. The CV error dips at the true order, then rises.

In K-fold cross-validation, increasing K from 5 to N (leave-one-out) generally:

Chapter 7: Mastery

Let's review the landscape of linear prediction and regularization.

MethodLossSolutionKey Property
OLS‖Xw − y‖2(XTX)−1XTyUnbiased, minimum variance (Gauss-Markov)
Ridge‖Xw − y‖2 + λ‖w‖2(XTX + λI)−1XTyAlways invertible, shrinks weights
LASSO‖Xw − y‖2 + λ‖w‖1Iterative (no closed form)Sparse solutions, feature selection
AR(p)‖x − Xpasta‖2Yule-Walker or OLSTemporal prediction from p past values
The grand unification. Ridge, LASSO, and AR models are all the same thing: least-squares regression with different design matrices and penalties. The design matrix changes (features vs. lagged signal), the penalty changes (L2 vs. L1 vs. none), but the mathematical machinery is identical.

Connections

This lecture connects forward to:

And connects back to:

What you can now do:
• Derive the normal equations from scratch
• Explain why ridge regression stabilizes ill-conditioned problems
• Compare L1 vs. L2 geometry for sparsity
• Fit an AR model to a signal and predict future samples
• Use Yule-Walker equations with autocorrelation
• Choose model order via cross-validation or BIC

"All models are wrong, but some are useful." — George Box