EE269 Lecture 20 — Least-Squares Regression & AR Models

Chapter 0: The Prediction Problem

You have N data points. Each point has a feature vector x and a target value y. A house's features might be [square footage, bedrooms, age]; the target is its sale price. You want a function f(x) that predicts y from x. How do you find the best f?

The simplest idea: make f linear. Set f(x) = w^Tx + w₀, where w is a weight vector and w₀ is a bias. Now "finding the best f" reduces to "finding the best w."

But "best" needs a definition. We measure quality by the mean squared error (MSE) risk:

R(w) = E[(f(x) − y)²] = E[(w^Tx + w₀ − y)²]

In practice, we don't know the true distribution. We only have N training samples {(x_i, y_i)}_i=1^N. So we minimize the empirical risk:

R̂(w) = (1/N) ∑_i=1^N (w^Tx_i + w₀ − y_i)²

The core tension. Minimizing R̂ exactly can lead to overfitting: the model memorizes training noise instead of learning the underlying pattern. This entire lecture is about balancing fit vs. simplicity.

Here's a quick demo. We generate noisy data from a true line y = 2x + 1. Try fitting it with different polynomial degrees. Low degree = underfitting (can't capture the pattern). High degree = overfitting (wiggles through every point).

Underfitting vs. Overfitting

Drag the slider to change the polynomial degree. Watch how high degrees chase noise.

Degree 1

A model with very low training error but high test error is likely:

Underfitting the data Overfitting the data Perfectly generalized

Chapter 1: Least Squares

Let's derive the optimal weights in closed form. Stack the N data points into a matrix. Let X be the N × d design matrix (row i is x_i^T, with a 1 prepended for the bias). Let y be the N × 1 vector of targets. The model predictions are X·w, and the loss is:

L(w) = ‖Xw − y‖² = (Xw − y)^T(Xw − y)

Expand this:

L(w) = w^TX^TXw − 2w^TX^Ty + y^Ty

Take the gradient with respect to w and set it to zero:

∇_wL = 2X^TXw − 2X^Ty = 0

This gives us the normal equations:

X^TXw = X^Ty

The normal equations. The name "normal" comes from geometry: the residual vector (y − Xw) is orthogonal (normal) to the column space of X. We're projecting y onto the subspace spanned by the features.

If X^TX is invertible, we get the closed-form solution:

w_OLS = (X^TX)⁻¹X^Ty

The matrix X⁺ = (X^TX)⁻¹X^T is the pseudoinverse of X. The hat matrix H = X(X^TX)⁻¹X^T projects y onto the column space of X. The fitted values are ŷ = Hy.

Worked Example

Suppose we have 3 data points in 1D (with a bias column):

X = [[1, 1], [1, 2], [1, 3]], y = [2.1, 3.9, 6.2]

Then X^TX = [[3, 6], [6, 14]] and X^Ty = [12.2, 28.5]. Solving:

w = [[14, −6], [−6, 3]] / (3·14 − 6·6) · [12.2, 28.5]
= [[14, −6], [−6, 3]] / 6 · [12.2, 28.5]
w₀ ≈ −0.033, w₁ ≈ 2.05

So the line is y ≈ −0.033 + 2.05x. The true slope was 2 and the intercept was 0 — close, given only 3 noisy points.

Normal Equations Visualized

Click to place data points. The red line is the OLS fit. The dashed lines show residuals.

When does this fail? If X^TX is singular (features are linearly dependent) or nearly singular (multicollinearity), the inverse is unstable. Small changes in data cause wild swings in w. This is exactly where ridge regression saves us — next chapter.

The OLS solution w = (X^TX)⁻¹X^Ty requires which condition on X?

X^TX must be invertible (full column rank) X must be square X must have more columns than rows

Chapter 2: Ridge Regression

Ordinary least squares minimizes the training error. But sometimes that's a bad idea. When features are correlated, or when there are more features than samples (d > N), X^TX is singular or nearly so, and the OLS weights explode.

Ridge regression (also called Tikhonov regularization or L2 regularization) adds a penalty on the size of the weights:

L_ridge(w) = ‖Xw − y‖² + λ‖w‖²

where λ ≥ 0 is the regularization parameter. The penalty term λ‖w‖² = λ ∑ w_j² discourages large weights. Taking the gradient and setting it to zero:

∇_wL = 2X^TXw − 2X^Ty + 2λw = 0

w_ridge = (X^TX + λI)⁻¹X^Ty

Why λI fixes everything. Adding λI to X^TX shifts every eigenvalue by λ. If the smallest eigenvalue of X^TX was near zero (ill-conditioned), now it's at least λ. The matrix is always invertible, and the condition number drops dramatically.

The Bias-Variance Tradeoff

Ridge introduces bias — the expected prediction is no longer exactly right, because we're shrinking the weights toward zero. But it reduces variance — the weights are more stable across different datasets.

As λ → 0, we get OLS (low bias, high variance). As λ → ∞, w → 0 (high bias, zero variance). The optimal λ balances the two.

Ridge Regularization Path

Watch how the weights shrink toward zero as λ increases. The fit line becomes less wiggly but more biased.

λ 0.00

Eigenvalue Perspective

Write the SVD of X: X = UΣV^T. The OLS prediction is ŷ = X(X^TX)⁻¹X^Ty. In terms of singular values σ_j, each component is scaled by σ_j² / σ_j² = 1. Ridge instead scales each component by:

σ_j² / (σ_j² + λ)

Components with small singular values (directions where data has little variation) get shrunk the most. This is exactly what we want: don't trust directions where you have little information.

As λ increases in ridge regression, what happens to the weight vector?

Weights grow larger Weights shrink toward zero Weights become exactly zero

Chapter 3: LASSO & Sparsity

Ridge shrinks weights toward zero but never makes them exactly zero. If you have 1000 features and only 5 matter, ridge keeps all 1000 — just smaller. Can we do better?

The LASSO (Least Absolute Shrinkage and Selection Operator) replaces the L2 penalty with an L1 penalty:

L_lasso(w) = ‖Xw − y‖² + λ ∑_j |w_j|

The key difference: the L1 penalty has "corners" at w_j = 0 in the geometry of the constraint set. The optimal solution tends to land on these corners, making some weights exactly zero. LASSO performs feature selection automatically.

Geometry of L1 vs. L2. The L2 constraint set is a circle (sphere in higher dims). The L1 constraint set is a diamond (cross-polytope). A circle is smooth everywhere, so the solution can land anywhere on its surface. A diamond has sharp corners at the axes. The elliptical contours of the MSE loss are more likely to first touch a corner — and corners mean some coordinates are zero.

Comparing Ridge vs. LASSO

Property	Ridge (L2)	LASSO (L1)
Penalty	λ‖w‖₂²	λ‖w‖₁
Solution form	Closed-form	No closed-form (iterative)
Sparsity	Never exactly zero	Many weights become zero
Correlated features	Keeps all, shares weight	Picks one, drops others
Use case	All features matter	Few features matter

L1 vs. L2 Constraint Geometry

The ellipses are MSE contours. The colored shape is the constraint region. Watch where they first touch.

Penalty type L2 (Ridge)

λ 1.00

Elastic Net

In practice, a compromise called Elastic Net combines both penalties:

L_elastic(w) = ‖Xw − y‖² + λ₁‖w‖₁ + λ₂‖w‖₂²

This gets sparsity from L1 while handling correlated features better (L2 groups them). Scikit-learn's default for many problems.

Which regularization method can set some weights to exactly zero?

Ridge (L2) LASSO (L1) Neither — both just shrink weights

Chapter 4: Autoregressive Models for Signals

Now let's apply regression to signals. Instead of predicting house prices from features, we predict the next sample of a time series from its past values. This is the autoregressive (AR) model.

An AR(p) model of order p says:

x[n] = a₁x[n−1] + a₂x[n−2] + ... + a_px[n−p] + e[n]

where a₁, ..., a_p are the AR coefficients and e[n] is white noise (the "innovation" — the part of x[n] that can't be predicted from the past). The current sample is a weighted sum of the p most recent past samples, plus unpredictable noise.

AR models ARE regression. Define the feature vector for sample n as x_n = [x[n−1], x[n−2], ..., x[n−p]]. The target is y_n = x[n]. The weight vector is a = [a₁, ..., a_p]. This is exactly linear regression: x[n] = a^Tx_n + e[n].

Why AR Models Matter

Speech signals are well-modeled by AR(10-20). Financial time series use AR models for short-term prediction. EEG, vibration data, climate records — any signal with temporal structure.

The key parameter is the model order p. Too low: can't capture the signal's autocorrelation structure. Too high: overfits to noise (same story as polynomial degree).

Fitting an AR Model

Given a signal x[0], x[1], ..., x[N−1], we form the regression:

X = [[x[p−1], ..., x[0]], [x[p], ..., x[1]], ..., [x[N−2], ..., x[N−1−p]]]
y = [x[p], x[p+1], ..., x[N−1]]

Then a = (X^TX)⁻¹X^Ty. Same normal equations. The prediction is x̂[n] = a^T[x[n−1], ..., x[n−p]].

AR Model Showcase

A signal is generated from a true AR process. Fit an AR model and watch predictions. Adjust the model order to see underfitting vs. overfitting.

Model order p 2

Noise σ 0.50

True order 2

Reading the showcase. The teal line is the actual signal. The orange line is the one-step-ahead prediction from the fitted AR model. When the model order matches the true order (and noise is low), the predictions track the signal closely. Crank up the order past the true value and the MSE on held-out data starts rising — that's overfitting.

An AR(3) model predicts x[n] from:

x[n−1], x[n−2], x[n−3] x[n+1], x[n+2], x[n+3] x[0], x[1], x[2]

Chapter 5: Yule-Walker Equations

We just fitted an AR model by building the design matrix X and solving the normal equations. But there's an elegant alternative that uses the signal's autocorrelation function directly.

For a stationary signal, the autocorrelation at lag k is:

r[k] = E[x[n] · x[n−k]]

Now multiply both sides of the AR equation x[n] = ∑_k=1^p a_kx[n−k] + e[n] by x[n−m] and take expectations:

E[x[n]·x[n−m]] = ∑_k=1^p a_k E[x[n−k]·x[n−m]] + E[e[n]·x[n−m]]

For m ≥ 1, the noise e[n] is uncorrelated with past values x[n−m], so E[e[n]·x[n−m]] = 0. This gives us:

r[m] = a₁r[m−1] + a₂r[m−2] + ... + a_pr[m−p], m = 1, 2, ..., p

These are the Yule-Walker equations. In matrix form:

[[r[0], r[1], ..., r[p−1]], [r[1], r[0], ..., r[p−2]], ..., [r[p−1], ..., r[0]]] · a = [r[1], r[2], ..., r[p]]

Toeplitz structure. The autocorrelation matrix R is Toeplitz — each diagonal is constant. This structure allows the Levinson-Durbin algorithm to solve the system in O(p²) instead of O(p³). For large model orders, this matters.

The noise variance is recovered from m = 0:

σ_e² = r[0] − ∑_k=1^p a_k r[k]

Worked Example: AR(2)

Suppose r[0] = 1.0, r[1] = 0.6, r[2] = 0.2. The Yule-Walker system is:

[[1.0, 0.6], [0.6, 1.0]] · [a₁, a₂] = [0.6, 0.2]

Solving: det = 1 − 0.36 = 0.64. a₁ = (1.0·0.6 − 0.6·0.2)/0.64 = 0.48/0.64 = 0.75. a₂ = (1.0·0.2 − 0.6·0.6)/0.64 = −0.16/0.64 = −0.25.

The noise variance: σ_e² = 1.0 − 0.75(0.6) − (−0.25)(0.2) = 1.0 − 0.45 + 0.05 = 0.60.

Autocorrelation → AR Coefficients

Adjust the autocorrelation values and see the Yule-Walker solution update in real time.

r[1] 0.60

r[2] 0.20

The Yule-Walker equations relate AR coefficients to:

The signal's autocorrelation values The signal's frequency components The signal's sample values directly

Chapter 6: Cross-Validation

We've seen that choosing λ (for ridge/LASSO) or p (for AR models) is critical. Too small: overfit. Too large: underfit. How do we pick the right value without a separate test set?

K-fold cross-validation: split the data into K roughly equal parts. For each fold k, train on the other K−1 parts and measure the error on fold k. Average the K error values. Pick the hyperparameter that minimizes this average.

Split

Divide N samples into K folds

↓

For each fold k

Train on folds ≠ k, test on fold k

↓

Average

CV error = (1/K) ∑ MSE_k

↓

Select

Pick λ (or p) minimizing CV error

Leave-One-Out (LOO) Cross-Validation

The extreme case: K = N. Each fold is a single data point. For linear regression, there's a beautiful shortcut. The LOO error is:

CV_LOO = (1/N) ∑_i=1^N (y_i − ŷ_i)² / (1 − h_ii)²

where h_ii is the i-th diagonal of the hat matrix H = X(X^TX)⁻¹X^T. You only fit the model once and adjust each residual by the leverage h_ii. This is called the PRESS statistic.

Leverage h_ii measures how much influence point i has on its own prediction. High leverage points (those far from the centroid of the features) have h_ii close to 1, making their LOO residuals amplified.

Information Criteria

For AR model order selection, two popular criteria avoid cross-validation entirely:

AIC = N · ln(σ̂_e²) + 2p

BIC = N · ln(σ̂_e²) + p · ln(N)

Both penalize model complexity. BIC penalizes more heavily for large N and tends to select simpler models. In practice, BIC is often preferred for AR order selection because it's consistent (picks the true order as N → ∞).

Model Selection: CV Error vs. Order

The training error always decreases with model complexity. The CV error dips at the true order, then rises.

In K-fold cross-validation, increasing K from 5 to N (leave-one-out) generally:

Reduces both bias and variance of the CV estimate Reduces bias but increases variance of the CV estimate Increases both bias and variance

Chapter 7: Mastery

Let's review the landscape of linear prediction and regularization.

Method	Loss	Solution	Key Property
OLS	‖Xw − y‖²	(X^TX)⁻¹X^Ty	Unbiased, minimum variance (Gauss-Markov)
Ridge	‖Xw − y‖² + λ‖w‖²	(X^TX + λI)⁻¹X^Ty	Always invertible, shrinks weights
LASSO	‖Xw − y‖² + λ‖w‖₁	Iterative (no closed form)	Sparse solutions, feature selection
AR(p)	‖x − X_pasta‖²	Yule-Walker or OLS	Temporal prediction from p past values

The grand unification. Ridge, LASSO, and AR models are all the same thing: least-squares regression with different design matrices and penalties. The design matrix changes (features vs. lagged signal), the penalty changes (L2 vs. L1 vs. none), but the mathematical machinery is identical.

Connections

This lecture connects forward to:

RKHS & Kernel Regression — what happens when we go beyond linear f(x)?
Adaptive Filters & LMS — online regression where data arrives one sample at a time

And connects back to:

The DFT and spectral analysis — AR models produce a power spectral estimate via 1/|A(e^jω)|²
Bayesian estimation — ridge regression is equivalent to MAP with a Gaussian prior on w

What you can now do:
• Derive the normal equations from scratch
• Explain why ridge regression stabilizes ill-conditioned problems
• Compare L1 vs. L2 geometry for sparsity
• Fit an AR model to a signal and predict future samples
• Use Yule-Walker equations with autocorrelation
• Choose model order via cross-validation or BIC

"All models are wrong, but some are useful." — George Box